Intro to strings and regular expressions

Warmup activity

Work on the activity (handout) with a neighbor, then we will discuss as a class

Warmup

  • Are You Pikachu Or Meowth
  • 23 Photos That Definitively Prove The Moon Landing Was Faked
  • Chinese Exports Fall 22.6% in April
  • From a Beijing Suburb, Vibrant Strings
  • 27 Happy Gifts For People Who Love Jamaica
  • Two largest known prime numbers discovered just two weeks apart, one qualifies for $100k prize
  • Pakistan : New policy on renewable energy launched
  • Strong earthquake hits Pakistan, north India, Afghanistan
  • National Debate About G.O.P. Hits Home in Utah
  • This 7-Picture Test Will Determine What Type Of Harry Potter Fan You Are
  • 22 Fred And George Weasley Moments That’ll Make You Laugh, Cry, And Everything In Between
  • 21 Puppies Who Absolutely Cannot Be Trusted

Characterizing clickbait headlines

Clickbait headlines often contain numbers:

  • 23 Photos That Definitively Prove The Moon Landing Was Faked
  • 21 Puppies Who Absolutely Cannot Be Trusted

Clickbait headlines are often written in first or second person:

  • Are You Pikachu Or Meowth
  • This 7-Picture Test Will Determine What Type Of Harry Potter Fan You Are

Another example

We conduct a survey, and the results contain the following responses:

  • “I am 31 years old”
  • “I just turned 52”
  • “My age is 83”

If we want to explore statistics about respondents’ ages (summary statistics, visualizations, regression models, etc.), what do we need to do first?

Strings

Strings are data that consist of a sequence of characters, and store information like names and text responses. We use single or double quotes when creating a string:

ex_str <- "Hello!"
ex_str
[1] "Hello!"

The number of characters in a string is called its length:

str_length(ex_str)
[1] 6

Extracting information from strings

Working with text data requires us to identify and extract useful information in strings. For example, we may wish to extract the number from a string:

str_extract("I am 31 years old", "31")
[1] "31"
str_extract("21 Puppies Who Absolutely Cannot Be Trusted", "21")
[1] "21"

str_extract: extracts the first match in a string to a specified pattern

Question: Are there any issues with the way we are extracting numbers here?

More general patterns: regular expressions

Instead of specifying a specific number, we can ask R to find any number:

str_extract("My son is 9 years old", "\\d")
[1] "9"
  • \d is a special character that means match any digit
  • In R, we need to add an additional escape character, so we enter this as \\d

Looking for numbers

What do you think will happen if I run the following code?

str_extract("My son is 9 years old", "d")

Looking for numbers

What do you think will happen if I run the following code?

str_extract("My son is 9 years old", "d")
[1] "d"

This just looks for the letter "d"! To get the special character meanining “any digit”, we need the escape character(s):

str_extract("My son is 9 years old", "\\d")
[1] "9"

Looking for numbers

What do you think will happen if I run the following code?

str_extract("My son is 19 years old", "\\d")

Looking for numbers

What do you think will happen if I run the following code?

str_extract("My son is 19 years old", "\\d")
[1] "1"

The pattern \d will just return the first match. To get the full “19”, we need to match any contiguous sequence of digits:

str_extract("My son is 19 years old", "\\d+")
[1] "19"

+ means “one or more occurrences”

Looking for numbers

What do you think will happen if I run the following code?

str_extract("My son is 19 years old, and I am 51", "\\d+")

Looking for numbers

What do you think will happen if I run the following code?

str_extract("My son is 19 years old, and I am 51", "\\d+")
[1] "19"

str_extract returns the first match to the pattern. To get all matches:

str_extract_all("My son is 19 years old, and I am 51", "\\d+")
[[1]]
[1] "19" "51"

Looking for numbers

String functions in the stringr package are also vectorized:

ex_strings <- c("My son is 19 years old, and I am 51",
                "21 Puppies Who Absolutely Cannot Be Trusted")

str_extract(ex_strings, "\\d+")
[1] "19" "21"

Another string function

Instead of extracting a pattern, we may wish to detect whether the string contains a pattern:

ex_strings <- c("23 Photos That Definitively Prove The Moon Landing Was Faked",
                "21 Puppies Who Absolutely Cannot Be Trusted",
                "Pakistan : New policy on renewable energy launched")

str_detect(ex_strings, "\\d+")
[1]  TRUE  TRUE FALSE

We can also see where the match occurs:

str_view(ex_strings, "\\d+")
[1] │ <23> Photos That Definitively Prove The Moon Landing Was Faked
[2] │ <21> Puppies Who Absolutely Cannot Be Trusted

Another string function

Instead of extracting a pattern, we may wish to detect whether the string contains a pattern:

ex_strings <- c("23 Photos That Definitively Prove The Moon Landing Was Faked",
                "21 Puppies Who Absolutely Cannot Be Trusted",
                "Pakistan : New policy on renewable energy launched")

str_detect(ex_strings, "\\d+")
[1]  TRUE  TRUE FALSE

And we can select only the strings which contain the pattern:

str_subset(ex_strings, "\\d+")
[1] "23 Photos That Definitively Prove The Moon Landing Was Faked"
[2] "21 Puppies Who Absolutely Cannot Be Trusted"                 

Back to clickbait

str_subset(headlines, "\\d+")
[1] "23 Photos That Definitively Prove The Moon Landing Was Faked"                                  
[2] "Chinese Exports Fall 22.6% in April"                                                           
[3] "27 Happy Gifts For People Who Love Jamaica"                                                    
[4] "Two largest known prime numbers discovered just two weeks apart, one qualifies for $100k prize"
[5] "This 7-Picture Test Will Determine What Type Of Harry Potter Fan You Are"                      
[6] "22 Fred And George Weasley Moments That'll Make You Laugh, Cry, And Everything In Between"     
[7] "21 Puppies Who Absolutely Cannot Be Trusted"                                                   

Are all of these headlines clickbait?

Back to clickbait

We can look for a pattern at the beginning of a string with the ^ character:

str_subset(headlines, "^\\d+")
[1] "23 Photos That Definitively Prove The Moon Landing Was Faked"                             
[2] "27 Happy Gifts For People Who Love Jamaica"                                               
[3] "22 Fred And George Weasley Moments That'll Make You Laugh, Cry, And Everything In Between"
[4] "21 Puppies Who Absolutely Cannot Be Trusted"                                              

Anchors

We can look for a pattern at the beginning of a string with the ^ character:

str_subset(headlines, "^\\d+")
[1] "23 Photos That Definitively Prove The Moon Landing Was Faked"                             
[2] "27 Happy Gifts For People Who Love Jamaica"                                               
[3] "22 Fred And George Weasley Moments That'll Make You Laugh, Cry, And Everything In Between"
[4] "21 Puppies Who Absolutely Cannot Be Trusted"                                              

Question: When might I want to look for a pattern at the end of a string?

Anchors

str_subset(headlines, "^\\d+")
[1] "23 Photos That Definitively Prove The Moon Landing Was Faked"                             
[2] "27 Happy Gifts For People Who Love Jamaica"                                               
[3] "22 Fred And George Weasley Moments That'll Make You Laugh, Cry, And Everything In Between"
[4] "21 Puppies Who Absolutely Cannot Be Trusted"                                              
str_detect("my_file.png", "csv$")
[1] FALSE
str_detect("file2.csv", "csv$")
[1] TRUE
str_detect("csv_folder/accident.xlsx", "csv")
[1] TRUE
str_detect("csv_folder/accident.xlsx", "csv$")
[1] FALSE

Regular expressions

Regular expression: a tool for specifying a search pattern in text. (Note: regular expressions are not specific to R, and are used in many languages and platforms)

Some regular expressions so far:

  • \d any digit
  • + one or more occurrences
  • ^ anchors at the beginning
  • $ anchors at the end

Class activity

  • Work independently or with a neighbor on the class activity
  • At the end of class, submit your work as an HTML file on Canvas (one per group, list all your names)

For next time, read:

  • Chapter 15.1 - 15.3 in R for Data Science