Work on the activity (handout), then we will discuss as a class.
Warmup
computer professionals celebrate 10th birthday of a.l.i.c.e.
i watched ‘home alone’ for the first time and it was actually horrifying
f.b.i. lab houses growing database of dna profiles
6 things i wish i knew as a teen
i asked my mom for marriage advice and here’s what happened
i wore food on my face instead of makeup to see if anyone would notice
Of these 6 headlines, 4 are clickbait. All the clickbait headlines are written in first person. How can I detect these headlines?
Identifying first person headlines
What’s wrong with this code?
str_subset(headlines, "i")
[1] "i watched \"home alone\" for the first time and it was actually horrifying"
[2] "computer professionals celebrate 10th birthday of a.l.i.c.e."
[3] "f.b.i. lab houses growing database of dna profiles"
[4] "6 things i wish i knew as a teen"
[5] "i asked my mom for marriage advice and here's what happened"
[6] "i wore food on my face instead of makeup to see if anyone would notice"
Identifying first person headlines
Adding word boundaries:
str_subset(headlines, "\\bi\\b")
[1] "i watched \"home alone\" for the first time and it was actually horrifying"
[2] "computer professionals celebrate 10th birthday of a.l.i.c.e."
[3] "f.b.i. lab houses growing database of dna profiles"
[4] "6 things i wish i knew as a teen"
[5] "i asked my mom for marriage advice and here's what happened"
[6] "i wore food on my face instead of makeup to see if anyone would notice"
How else could we modify this pattern?
Identifying first person headlines
The word “I” is likely to either start the headline, or be preceded by a space:
str_subset(headlines, "(^|\\s)i\\b")
[1] "i watched \"home alone\" for the first time and it was actually horrifying"
[2] "6 things i wish i knew as a teen"
[3] "i asked my mom for marriage advice and here's what happened"
[4] "i wore food on my face instead of makeup to see if anyone would notice"
Regular expressions so far
Regular expression: a tool for specifying a search pattern in text.
Some regular expressions so far:
\d any digit
+ one or more occurrences
^ anchors at the beginning
$ anchors at the end
\b word boundary
| alternation (this pattern OR that pattern)
Example 2: Cleaning phone numbers
You are working with customer data in which customers have entered their phone numbers:
Remember that $ is a special character in regular expressions, meaning “the end of the string”. To get a literal dollar sign, we need the escape character: \\$
Example 4: Extracting LaTeX
document_text
[1] "The equation for the simple linear regression line is given by $Y_i = \\beta_0 + \\beta_1 X_i + \\varepsilon_i$"