Strings and regular expressions

Recap: regular expressions

A regular expression is a pattern used to find matches in text.

Example: suppose I want to extract just the lecture number from the following file name. How would I do that?

"teaching/sta279-f23/slides/lecture_22.qmd"

A regular expression is a pattern used to find matches in text.

Example: suppose I want to extract just the lecture number from the following file name. How would I do that?

str_extract("teaching/sta279-f23/slides/lecture_22.qmd", "\\d+")

[1] "279"

str_extract("teaching/sta279-f23/slides/lecture_22.qmd", "_\\d+")

[1] "_22"

str_extract("teaching/sta279-f23/slides/lecture_22.qmd", 
            "(?<=_)\\d+")

[1] "22"

Last time, we learned the following regular expression tools:

\d matches any digit (in R, have to type \\d because we write the regex in a string)
. matches any character (except \n)
+ means “at least once”
(?<=) and (?=) are positive lookbehinds and lookaheads
| is alternation (one pattern or another)

strings <- c("apple", "banana", "raspberry", 
             "blackberry", "grrreat", "random")

How would I select just raspberry and blackberry?

strings <- c("apple", "banana", "raspberry", 
             "blackberry", "grrreat", "random")

How would I select just raspberry and blackberry?

str_view(strings, "berry")

[3] │ rasp<berry>
[4] │ black<berry>

strings <- c("apple", "banana", "raspberry", 
             "blackberry", "grrreat", "random")

How would I select “raspberry”, “blackberry”, “grrreat”, and “random”?

strings <- c("apple", "banana", "raspberry", 
             "blackberry", "grrreat", "random")

How would I select “raspberry”, “blackberry”, “grrreat”, and “random”?

str_view(strings, "r")

[3] │ <r>aspbe<r><r>y
[4] │ blackbe<r><r>y
[5] │ g<r><r><r>eat
[6] │ <r>andom

strings <- c("apple", "banana", "raspberry", 
             "blackberry", "grrreat", "random")

How would I select just “raspberry”, “blackberry”, and “grrreat”?

strings <- c("apple", "banana", "raspberry", 
             "blackberry", "grrreat", "random")

How would I select just “raspberry”, “blackberry”, and “grrreat”?

str_view(strings, "rr+")

[3] │ raspbe<rr>y
[4] │ blackbe<rr>y
[5] │ g<rrr>eat

str_view(strings, "r{2,}")

[3] │ raspbe<rr>y
[4] │ blackbe<rr>y
[5] │ g<rrr>eat

strings <- c("apple", "banana", "raspberry", 
             "blackberry", "grrreat", "random")

How would I select just “grrreat”?

str_view(strings, "r{3}")

[5] │ g<rrr>eat

strings <- c("apple", "banana", "raspberry", 
             "blackberry", "grrreat", "random")

How would I select “apple”, “raspberry”, “blackberry”, and “grrreat”?

strings <- c("apple", "banana", "raspberry", 
             "blackberry", "grrreat", "random")

How would I select “apple”, “raspberry”, “blackberry”, and “grrreat”?

str_view(strings, "(.)\\1")

[1] │ a<pp>le
[3] │ raspbe<rr>y
[4] │ blackbe<rr>y
[5] │ g<rr>reat

strings <- c("papa", "banana", "memento", 
             "blackberry", "grrreat", "random")

How would I select “papa”, “banana”, and “memento”?

strings <- c("papa", "banana", "memento", 
             "blackberry", "grrreat", "random")

How would I select “papa”, “banana”, and “memento”?

str_view(strings, "(..)\\1")

[1] │ <papa>
[2] │ b<anan>a
[3] │ <meme>nto

str_view(strings, "(..)+")

[1] │ <papa>
[2] │ <banana>
[3] │ <mement>o
[4] │ <blackberry>
[5] │ <grrrea>t
[6] │ <random>

"The mean $\\mu$ is defined by $\\mu = \\frac{1}{n} \\sum_i x_i$"

How would I extract $\mu$ and $\mu = \frac{1}{n} \sum_i x_i$ ?

"The mean $\\mu$ is defined by $\\mu = \\frac{1}{n} \\sum_i x_i$"

How would I extract $\mu$ and $\mu = \frac{1}{n} \sum_i x_i$ ?

str_extract("The mean $\\mu$ is defined by $\\mu = \\frac{1}{n} \\sum_i x_i$",
            "\\$.+\\$")

[1] "$\\mu$ is defined by $\\mu = \\frac{1}{n} \\sum_i x_i$"

"The mean $\\mu$ is defined by $\\mu = \\frac{1}{n} \\sum_i x_i$"

How would I extract $\mu$ and $\mu = \frac{1}{n} \sum_i x_i$ ?

Option 1:

str_extract_all("The mean $\\mu$ is defined by $\\mu = \\frac{1}{n} \\sum_i x_i$",
            "\\$.+?\\$")

[[1]]
[1] "$\\mu$"                            "$\\mu = \\frac{1}{n} \\sum_i x_i$"

"The mean $\\mu$ is defined by $\\mu = \\frac{1}{n} \\sum_i x_i$"

How would I extract $\mu$ and $\mu = \frac{1}{n} \sum_i x_i$ ?

Option 2:

str_extract_all("The mean $\\mu$ is defined by $\\mu = \\frac{1}{n} \\sum_i x_i$",
            "\\$[^\\$]+\\$")

[[1]]
[1] "$\\mu$"                            "$\\mu = \\frac{1}{n} \\sum_i x_i$"

Work independently or with a neighbor on the class activity
At the end of class, submit your work as an HTML file on Canvas (one per group, list all your names)