[1] "George Washington: February 22, 1732"
[2] "Thomas Jefferson: April 13, 1743"
[3] "Abraham Lincoln: February 12, 1809"
[4] "Theodore Roosevelt: October 27, 1858"
Exam 2 review
Below are questions to help you study for Exam 2. These are some examples of the kinds of questions I might ask.
- This is not a practice exam. There will be fewer questions on the actual exam.
- The questions cover most, but not all, possible material for the exam.
- The distribution of questions here is not necessarily reflective of the distribution of questions on the actual exam.
Strings and regular expressions
Example 1
Consider the following strings:
For each question below, fill in the R code to produce the desired output.
str_extract(strings, ...)[1] "George Washington" "Thomas Jefferson" "Abraham Lincoln"
[4] "Theodore Roosevelt"
str_extract(strings, ...)[1] "February 22, 1732" "April 13, 1743" "February 12, 1809"
[4] "October 27, 1858"
str_extract(strings, ...)[1] "Washington" "Jefferson" "Lincoln" "Roosevelt"
str_extract(strings, ...)[1] "1732" "1743" "1809" "1858"
Example 2
Consider the following strings:
strings [1] "apple" "banana" "canteloupe" "durian"
[5] "eggplant" "french fries" "goat cheese" "pizza"
[9] "99 red balloons" "101 dalmatians" "route 66"
For each question below, fill in the R code to produce the desired output.
str_subset(strings, ...)[1] "99 red balloons" "101 dalmatians" "route 66"
str_subset(strings, ...)[1] "99 red balloons" "101 dalmatians"
str_subset(strings, ...)[1] "apple" "banana" "canteloupe" "durian"
[5] "eggplant" "goat cheese" "pizza" "99 red balloons"
[9] "101 dalmatians"
str_subset(strings, ...)[1] "french fries" "goat cheese" "99 red balloons" "101 dalmatians"
[5] "route 66"
str_subset(strings, ...)[1] "apple" "eggplant" "goat cheese" "pizza"
[5] "99 red balloons" "route 66"
Example 3
Suppose you receive a vector containing the following email addresses:
emails <- c("evansc@wfu.edu",
"dalzelnm@wfu.edu",
"heplersa@wfu.edu")- Fill in the following code to produce the desired output:
str_extract(emails, ...)[1] "evansc" "dalzelnm" "heplersa"
Example 4
You have a folder with the following filenames:
filenames <- c(".bash_profile",
"workspace.doc",
"img0912.jpg",
"updated_img0912.png",
"documentation.html",
"favicon.gif",
"access.lock")- Fill in the following code to produce the desired output:
str_subset(filenames, ...)[1] "img0912.jpg" "updated_img0912.png" "favicon.gif"
Example 5
Consider the following string:
ex_str <- "The kurtosis is defined by the equation $\\mu_4/\\sigma^4$,
where $\\mu_4$ is the fourth central moment
and $\\sigma$ is the standard deviation."- Fill in the following code to produce the desired output:
str_extract_all(ex_str, ...)[[1]]
[1] "$\\mu_4/\\sigma^4$" "$\\mu_4$" "$\\sigma$"
Example 6
Here is another small dataset, ex_df, which has a single column (faculty):
ex_df faculty
1 Rob Erhardt, Professor: MAN 343
2 Daniel Beavers, Associate Professor: MAN 337
3 Kenneth Berenhaut, Professor: MAN 379
4 Leonardo Cella, Assistant Professor: MAN 344
5 Lucy D'Agostino McGowan, Assistant Professor: MAN 342
6 Nicole Dalzell, Associate Teaching Professor: MAN 338
- Write code to produce the following output using
ex_df:
# A tibble: 6 × 3
name position office
<chr> <chr> <chr>
1 Rob Erhardt Professor MAN 343
2 Daniel Beavers Associate Professor MAN 337
3 Kenneth Berenhaut Professor MAN 379
4 Leonardo Cella Assistant Professor MAN 344
5 Lucy D'Agostino McGowan Assistant Professor MAN 342
6 Nicole Dalzell Associate Teaching Professor MAN 338
Web scraping
Example 1
Consider the following portion of a web page:
STA 279
Welcome to STA 279! This is a 16-week course offered at Wake Forest University, which meets 3 hours each week. The syllabus can be found here
Below are some useful dates for the semester:
| Exam 1 | March 1 |
| Exam 2 | April 12 |
Now here is the source HTML for that simple web page:
<h1> STA 279 </h1>
<p>
Welcome to STA 279!
This is a 16-week course offered at Wake Forest University,
which meets 3 hours each week.
The syllabus can be found
<a href='https://sta279-f25.github.io/about/'>here</a>
</p>
<p>
Below are some useful dates for the semester:
<table>
<tr>
<td>Exam 1</td>
<td>March 1</td>
</tr>
<tr>
<td>Exam 2</td>
<td>April 12</td>
</tr>
</table>
</p>
Suppose that the url for this page is “https://fake_web_page.html”. Write code that will import the table of exam dates into R, as a data frame.
Suppose that the url for this page is “https://fake_web_page.html”. What will the following code return?
read_html("https://fake_web_page.html") |>
html_element("h1") |>
html_text2() |>
str_extract("\\d+")Example 2
Consider the following short portion of a web page, with URL https://fake_page.html.
<div class="mesa">
<span>
<p> Hello! </p>
</span>
<p class="small"> Welcome to STA 279! </p>
<span>
<p class="small"> Room: Carswell 102 </p>
</span>
<span>
<h3 class="small"> Grading policy </h3>
</span>
<span>
<p class="small"> Exam dates </p>
</span>
</div>
- Write code in R that will extract the following string from this web page:
[1] "Carswell 102"
- Using your code from the previous question, write a short function with the following specifications:
- Input:
urlthe URL for a web page with the same format as above - Output: the building and room number for the class, scraped from the web page
- Input:
- Suppose that we have the web pages for 20 different classes at Wake Forest, all of which have the same format as shown above, but which take places in different rooms. The URLS are stored in a vector:
class_urls [1] "https://fake_page_1.html" "https://fake_page_2.html"
[3] "https://fake_page_3.html" "https://fake_page_4.html"
[5] "https://fake_page_5.html" "https://fake_page_6.html"
[7] "https://fake_page_7.html" "https://fake_page_8.html"
[9] "https://fake_page_9.html" "https://fake_page_10.html"
[11] "https://fake_page_11.html" "https://fake_page_12.html"
[13] "https://fake_page_13.html" "https://fake_page_14.html"
[15] "https://fake_page_15.html" "https://fake_page_16.html"
[17] "https://fake_page_17.html" "https://fake_page_18.html"
[19] "https://fake_page_19.html" "https://fake_page_20.html"
Using your answers to the previous questions, iterate over this vector of URLs to produce a vector containing the building and room number for each class.
Polite web scraping
Why is it important to be polite when scraping from a website?
What are the features of polite web scraping that we discussed in class?
SQL
Here is a rough summary of what you should know from the SQL unit:
- Pieces of a SQL query you should know:
SELECTFROMJOIN(an inner join) andLEFT JOIN(a left join)WHEREGROUP BYHAVINGORDER BYLIMIT
- You should also know the order in which queries must be written
- Other functions to know for queries:
DISTINCT- Summary functions like
SUM,AVG,MIN,MAX,COUNT LIKE, and the wildcards%and_CASE
- The function of primary and foreign keys, and the constraints they place on tables
- How to read and interpret an ERD
- You should be able to read and interpet the results of
DESCRIBE,ALTER TABLE, andINSERT INTOstatements. You will not have to write these statements - Subqueries
Practice with SQL queries
Here is a small table in a SQL database. The name of the table is small_penguins.
species island bill_length_mm bill_depth_mm
1 Gentoo Biscoe 43.3 14.0
2 Gentoo Biscoe 59.6 17.0
3 Adelie Dream 39.7 17.9
4 Adelie Dream 39.2 21.1
5 Chinstrap Dream 50.8 19.0
6 Gentoo Biscoe 49.9 16.1
7 Chinstrap Dream 50.7 19.7
8 Gentoo Biscoe 47.3 15.3
9 Gentoo Biscoe 49.3 15.7
10 Adelie Dream 37.5 18.9
In each of the following examples, write a SQL query to produce the desired output from small_penguins.
- Output:
species island n
1 Adelie Dream 3
2 Chinstrap Dream 2
3 Gentoo Biscoe 5
- Output:
island species mean_length
1 Biscoe Gentoo 49.88
2 Dream Adelie 38.80
3 Dream Chinstrap 50.75
- Output:
species island bill_length_mm bill_depth_mm bill_ratio
1 Gentoo Biscoe 43.3 14.0 3.092857
2 Gentoo Biscoe 59.6 17.0 3.505882
3 Adelie Dream 39.7 17.9 2.217877
4 Adelie Dream 39.2 21.1 1.857820
5 Chinstrap Dream 50.8 19.0 2.673684
6 Gentoo Biscoe 49.9 16.1 3.099379
7 Chinstrap Dream 50.7 19.7 2.573604
8 Gentoo Biscoe 47.3 15.3 3.091503
9 Gentoo Biscoe 49.3 15.7 3.140127
10 Adelie Dream 37.5 18.9 1.984127
- Output:
species island bill_length_mm bill_depth_mm
1 Adelie Dream 39.7 17.9
2 Adelie Dream 39.2 21.1
3 Adelie Dream 37.5 18.9
Joins
In each of the following questions, write a SQL query to produce the desired output from the two input tables
df1 id x
1 1 7
2 2 9
3 3 13
df2 id y
1 1 10
2 2 12
3 4 14
Output:
id x y
1 1 7 10
2 2 9 12
3 3 13 NA
df1 id x
1 1 7
2 2 9
3 3 13
df2 id y
1 1 10
2 2 12
3 4 14
Output:
id x y
1 1 7 10
2 2 9 12
More practice with SQL queries
The nycflights13 package in R contains several datasets describing flights from NY airports in 2013, including the airports and weather tables.
The airports table has one row for each different airport in the database. There are a total of 1458 rows and 8 columns in the airports table; here are a few random rows from the table, showing the relevant columns for this question:
faa name
1 RAC John H. Batten Airport
2 RME Griffiss Afld
3 PDK Dekalb-Peachtree Airport
4 HSV Huntsville International Airport-Carl T Jones Field
5 OEB Branch County Memorial Airport
The weather table has one row for each date-time, for each of the three NY airports (EWR, JFK, LGA), and records a variety of weather variables at each time for each location. Here are a few random rows from the table, showing the relevant columns for this question:
origin month temp humid
1 EWR 12 39.92 73.08
2 EWR 9 77.00 51.76
3 JFK 10 64.94 67.57
4 LGA 3 48.02 86.08
5 EWR 9 62.06 74.75
The weather table uses a 3-character FAA code to identify each airport. It does not include the full name for the airports.
We would like to produce a table with the mean temperature, and humidity for each month, for each of the 3 NYC airports in the database. The output table should also include the full name of each airport. Here is what the output table looks like:
origin name month mean_temp mean_humid
1 EWR Newark Liberty Intl 1 35.56216 62.12451
2 EWR Newark Liberty Intl 2 34.26332 63.33558
3 EWR Newark Liberty Intl 3 40.11865 57.91382
4 EWR Newark Liberty Intl 4 52.97750 54.23513
5 EWR Newark Liberty Intl 5 63.32024 64.07371
6 EWR Newark Liberty Intl 6 73.26725 67.00229
7 EWR Newark Liberty Intl 7 80.70300 66.67641
8 EWR Newark Liberty Intl 8 74.53748 65.75175
9 EWR Newark Liberty Intl 9 67.30478 64.03556
10 EWR Newark Liberty Intl 10 59.77821 65.31928
11 EWR Newark Liberty Intl 11 44.57734 56.52011
12 EWR Newark Liberty Intl 12 37.95008 69.72141
13 JFK John F Kennedy Intl 1 35.38555 61.66162
14 JFK John F Kennedy Intl 2 34.19246 62.55848
15 JFK John F Kennedy Intl 3 39.54472 58.02222
16 JFK John F Kennedy Intl 4 50.14270 60.05764
17 JFK John F Kennedy Intl 5 59.31476 70.98926
18 JFK John F Kennedy Intl 6 69.95825 73.76147
19 JFK John F Kennedy Intl 7 78.73492 71.74469
20 JFK John F Kennedy Intl 8 73.81878 67.84420
21 JFK John F Kennedy Intl 9 66.89775 65.25182
22 JFK John F Kennedy Intl 10 59.80195 65.06362
23 JFK John F Kennedy Intl 11 45.13419 56.79961
24 JFK John F Kennedy Intl 12 38.60487 68.31341
25 LGA La Guardia 1 35.95927 59.15609
26 LGA La Guardia 2 34.35612 60.68603
27 LGA La Guardia 3 39.97652 56.55947
28 LGA La Guardia 4 52.11450 53.15060
29 LGA La Guardia 5 62.75000 61.49618
30 LGA La Guardia 6 73.32650 61.22686
31 LGA La Guardia 7 80.76425 62.26404
32 LGA La Guardia 8 75.04825 60.34574
33 LGA La Guardia 9 67.91125 59.67743
34 LGA La Guardia 10 60.63244 60.13875
35 LGA La Guardia 11 45.26093 53.49296
36 LGA La Guardia 12 38.76976 63.62769
Write a SQL query to produce the desired results.
How would you modify your above query to keep only the rows in the results for which the mean temperature is at least 60?
How would you modify your above query to keep only the rows for which the mean temperature is the highest average temperature?
Conceptual questions
What are the differences between
SELECTin a SQL query, and theselectfunction fromdplyrin R?Suppose you are managing a database for a hospital. List four different tables which might be included in the hospital database. For each table:
- Describe what information would be included (what would a row represent? what would be some of the columns?)
- Is there a natural primary key for the table? What column or set of columns would form that primary key?
- Are there any columns in the table that would be foreign keys for other tables in the database?
Practice with function scoping
- What will be the output of the following code?
x <- 10
test_fun <- function(x = 11){
return(x)
}
test_fun()
x- What will be the output of the following code?
x <- 10
test_fun <- function(y = 11){
return(x + 1)
}
test_fun()
x- What will be the output of the following code?
x <- 10
test_fun <- function(y = 11){
x <- x + 1
return(x + 1)
}
test_fun()
x- What will be the output of the following code?
x <- 10
test_fun <- function(x = 11){
x <- x + 1
return(x + 1)
}
test_fun()
x- What will be the output of the following code?
x <- 10
test_fun <- function(x = 11){
x <- x + 1
return(x + 1)
}
x <- test_fun(x)
xPractice with lists
- Create a list
xin R such that:
x[[1]]returns the functionmeanx[[2]]returns the functionsdx[[3]][[1]]returns the vectorc(0, 1, 2)x[[3]][[2]]returns a function which calculates the cube root of a vector