Activity: Strings and regular expressions I

Instructions:

Clickbait headlines

The following code imports a dataset with 30 different headlines from different websites. The headline column contains the text of the headline, and the clickbait column records whether the headline is from a clickbait article (1) or not (0).

library(tidyverse)

headlines <- read_csv("https://sta279-f25.github.io/data/clickbait_headlines_small.csv")
  1. The code below creates a new variable called clickbait_guess, in which we try to identify clickbait headlines using regular expressions. It then compares the clickbait predictions to the truth. Fill in the regular expression to predict that articles are clickbait if: there is a number at the beginning of the headline. How well does this work?
headlines <- headlines |>
  mutate(clickbait_guess = str_detect(headline, "...")) # fill in

table(headlines$clickbait_guess, headlines$clickbait)
  1. We also noticed that headlines which are in second person tend to be clickbait. Fill in the regular expression to predict that articles are clickbait if: the headline contains the word “you”. Note that it is helpful to convert all the headlines to lower case first, so we don’t have to worry about differences in capitalization.
headlines <- headlines |>
  mutate(clickbait_guess = str_detect(tolower(headline), "...")) 

table(headlines$clickbait_guess, headlines$clickbait)

Word borders

Depending on how you wrote the regular expression in question 2, you could pick up words that contain the letters “you”, and not just the word “you” itself. Here is a concrete example:

str_view("you are at Wake Forest", "you")
[1] │ <you> are at Wake Forest
str_view("born on the bayou", "you")
[1] │ born on the ba<you>

We can require “you” to be a word on its own by requiring a word boundary (the special character \b) on either side of “you”:

str_view("you are at Wake Forest", "\\byou\\b")
[1] │ <you> are at Wake Forest
str_view("born on the bayou", "\\byou\\b")

Here is another example of the word boundary in action:

str_view("This island is beautiful", "is")
[1] │ Th<is> <is>land <is> beautiful
str_view("This island is beautiful", "\\bis\\b")
[1] │ This island <is> beautiful
  1. Modify your code from question 2 to only match the word “you”, using word boundaries. Did this help your clickbait classification? Why or why not?

Alternation

Often, we may wish to match multiple patterns. For example, suppose we wish to identify strings which contain the words “cat” or “dog”:

str_view("I have two cats and one dog", "cat|dog")
[1] │ I have two <cat>s and one <dog>

The vertical bar | allows me to find matches to any one of the patterns separated by the bar.

  1. Modify your code from question 3 to predict that articles are clickbait if: the headline contains the word “you”, OR the headline begins with a number. How well do we predict clickbait using this rule?