Strings and regular expressions
Recap: regular expressions
A regular expression is a pattern used to find matches in text.
Example: suppose I want to extract just the lecture number from the following file name. How would I do that?
"teaching/sta279-f23/slides/lecture_22.qmd"
Recap: regular expressions
A regular expression is a pattern used to find matches in text.
Example: suppose I want to extract just the lecture number from the following file name. How would I do that?
str_extract("teaching/sta279-f23/slides/lecture_22.qmd", "\\d+")
str_extract("teaching/sta279-f23/slides/lecture_22.qmd", "_\\d+")
str_extract("teaching/sta279-f23/slides/lecture_22.qmd",
"(?<=_)\\d+")
Recap: regular expressions
Last time, we learned the following regular expression tools:
\d matches any digit (in R, have to type \\d because we write the regex in a string)
. matches any character (except \n)
+ means “at least once”
(?<=) and (?=) are positive lookbehinds and lookaheads
| is alternation (one pattern or another)
More regular expressions
strings <- c("apple", "banana", "raspberry",
"blackberry", "grrreat", "random")
How would I select just raspberry and blackberry?
More regular expressions
strings <- c("apple", "banana", "raspberry",
"blackberry", "grrreat", "random")
How would I select just raspberry and blackberry?
str_view(strings, "berry")
[3] │ rasp<berry>
[4] │ black<berry>
More regular expressions
strings <- c("apple", "banana", "raspberry",
"blackberry", "grrreat", "random")
How would I select “raspberry”, “blackberry”, “grrreat”, and “random”?
More regular expressions
strings <- c("apple", "banana", "raspberry",
"blackberry", "grrreat", "random")
How would I select “raspberry”, “blackberry”, “grrreat”, and “random”?
[3] │ <r>aspbe<r><r>y
[4] │ blackbe<r><r>y
[5] │ g<r><r><r>eat
[6] │ <r>andom
More regular expressions
strings <- c("apple", "banana", "raspberry",
"blackberry", "grrreat", "random")
How would I select just “raspberry”, “blackberry”, and “grrreat”?
More regular expressions
strings <- c("apple", "banana", "raspberry",
"blackberry", "grrreat", "random")
How would I select just “raspberry”, “blackberry”, and “grrreat”?
[3] │ raspbe<rr>y
[4] │ blackbe<rr>y
[5] │ g<rrr>eat
str_view(strings, "r{2,}")
[3] │ raspbe<rr>y
[4] │ blackbe<rr>y
[5] │ g<rrr>eat
More regular expressions
strings <- c("apple", "banana", "raspberry",
"blackberry", "grrreat", "random")
How would I select just “grrreat”?
str_view(strings, "r{3}")
More regular expressions
strings <- c("apple", "banana", "raspberry",
"blackberry", "grrreat", "random")
How would I select “apple”, “raspberry”, “blackberry”, and “grrreat”?
More regular expressions
strings <- c("apple", "banana", "raspberry",
"blackberry", "grrreat", "random")
How would I select “apple”, “raspberry”, “blackberry”, and “grrreat”?
str_view(strings, "(.)\\1")
[1] │ a<pp>le
[3] │ raspbe<rr>y
[4] │ blackbe<rr>y
[5] │ g<rr>reat
More regular expressions
strings <- c("papa", "banana", "memento",
"blackberry", "grrreat", "random")
How would I select “papa”, “banana”, and “memento”?
More regular expressions
strings <- c("papa", "banana", "memento",
"blackberry", "grrreat", "random")
How would I select “papa”, “banana”, and “memento”?
str_view(strings, "(..)\\1")
[1] │ <papa>
[2] │ b<anan>a
[3] │ <meme>nto
str_view(strings, "(..)+")
[1] │ <papa>
[2] │ <banana>
[3] │ <mement>o
[4] │ <blackberry>
[5] │ <grrrea>t
[6] │ <random>
More regular expressions
"The mean $\\mu$ is defined by $\\mu = \\frac{1}{n} \\sum_i x_i$"
How would I extract $\mu$ and $\mu = \frac{1}{n} \sum_i x_i$?
More regular expressions
"The mean $\\mu$ is defined by $\\mu = \\frac{1}{n} \\sum_i x_i$"
How would I extract $\mu$ and $\mu = \frac{1}{n} \sum_i x_i$?
str_extract("The mean $\\mu$ is defined by $\\mu = \\frac{1}{n} \\sum_i x_i$",
"\\$.+\\$")
[1] "$\\mu$ is defined by $\\mu = \\frac{1}{n} \\sum_i x_i$"
More regular expressions
"The mean $\\mu$ is defined by $\\mu = \\frac{1}{n} \\sum_i x_i$"
How would I extract $\mu$ and $\mu = \frac{1}{n} \sum_i x_i$?
Option 1:
str_extract_all("The mean $\\mu$ is defined by $\\mu = \\frac{1}{n} \\sum_i x_i$",
"\\$.+?\\$")
[[1]]
[1] "$\\mu$" "$\\mu = \\frac{1}{n} \\sum_i x_i$"
More regular expressions
"The mean $\\mu$ is defined by $\\mu = \\frac{1}{n} \\sum_i x_i$"
How would I extract $\mu$ and $\mu = \frac{1}{n} \sum_i x_i$?
Option 2:
str_extract_all("The mean $\\mu$ is defined by $\\mu = \\frac{1}{n} \\sum_i x_i$",
"\\$[^\\$]+\\$")
[[1]]
[1] "$\\mu$" "$\\mu = \\frac{1}{n} \\sum_i x_i$"
Class activity
- Work independently or with a neighbor on the class activity
- At the end of class, submit your work as an HTML file on Canvas (one per group, list all your names)