[1] "Hello!"
Work on the activity (handout) with a neighbor, then we will discuss as a class
Clickbait headlines often contain numbers:
Clickbait headlines are often written in first or second person:
We conduct a survey, and the results contain the following responses:
If we want to explore statistics about respondents’ ages (summary statistics, visualizations, regression models, etc.), what do we need to do first?
Strings are data that consist of a sequence of characters, and store information like names and text responses. We use single or double quotes when creating a string:
The number of characters in a string is called its length:
Working with text data requires us to identify and extract useful information in strings. For example, we may wish to extract the number from a string:
[1] "31"
[1] "21"
str_extract: extracts the first match in a string to a specified pattern
Question: Are there any issues with the way we are extracting numbers here?
Instead of specifying a specific number, we can ask R to find any number:
\d is a special character that means match any digit\\dWhat do you think will happen if I run the following code?
What do you think will happen if I run the following code?
This just looks for the letter "d"! To get the special character meanining “any digit”, we need the escape character(s):
What do you think will happen if I run the following code?
What do you think will happen if I run the following code?
The pattern \d will just return the first match. To get the full “19”, we need to match any contiguous sequence of digits:
+ means “one or more occurrences”
What do you think will happen if I run the following code?
What do you think will happen if I run the following code?
str_extract returns the first match to the pattern. To get all matches:
String functions in the stringr package are also vectorized:
Instead of extracting a pattern, we may wish to detect whether the string contains a pattern:
ex_strings <- c("23 Photos That Definitively Prove The Moon Landing Was Faked",
"21 Puppies Who Absolutely Cannot Be Trusted",
"Pakistan : New policy on renewable energy launched")
str_detect(ex_strings, "\\d+")[1] TRUE TRUE FALSE
We can also see where the match occurs:
Instead of extracting a pattern, we may wish to detect whether the string contains a pattern:
ex_strings <- c("23 Photos That Definitively Prove The Moon Landing Was Faked",
"21 Puppies Who Absolutely Cannot Be Trusted",
"Pakistan : New policy on renewable energy launched")
str_detect(ex_strings, "\\d+")[1] TRUE TRUE FALSE
And we can select only the strings which contain the pattern:
[1] "23 Photos That Definitively Prove The Moon Landing Was Faked"
[2] "Chinese Exports Fall 22.6% in April"
[3] "27 Happy Gifts For People Who Love Jamaica"
[4] "Two largest known prime numbers discovered just two weeks apart, one qualifies for $100k prize"
[5] "This 7-Picture Test Will Determine What Type Of Harry Potter Fan You Are"
[6] "22 Fred And George Weasley Moments That'll Make You Laugh, Cry, And Everything In Between"
[7] "21 Puppies Who Absolutely Cannot Be Trusted"
Are all of these headlines clickbait?
We can look for a pattern at the beginning of a string with the ^ character:
We can look for a pattern at the beginning of a string with the ^ character:
[1] "23 Photos That Definitively Prove The Moon Landing Was Faked"
[2] "27 Happy Gifts For People Who Love Jamaica"
[3] "22 Fred And George Weasley Moments That'll Make You Laugh, Cry, And Everything In Between"
[4] "21 Puppies Who Absolutely Cannot Be Trusted"
Question: When might I want to look for a pattern at the end of a string?
[1] "23 Photos That Definitively Prove The Moon Landing Was Faked"
[2] "27 Happy Gifts For People Who Love Jamaica"
[3] "22 Fred And George Weasley Moments That'll Make You Laugh, Cry, And Everything In Between"
[4] "21 Puppies Who Absolutely Cannot Be Trusted"
Regular expression: a tool for specifying a search pattern in text. (Note: regular expressions are not specific to R, and are used in many languages and platforms)
Some regular expressions so far:
\d any digit+ one or more occurrences^ anchors at the beginning$ anchors at the endFor next time, read: