Exam 2 review

Below are questions to help you study for Exam 2. These are some examples of the kinds of questions I might ask.

Strings and regular expressions

Example 1

Consider the following strings:

[1] "George Washington: February 22, 1732"
[2] "Thomas Jefferson: April 13, 1743"    
[3] "Abraham Lincoln: February 12, 1809"  
[4] "Theodore Roosevelt: October 27, 1858"

For each question below, fill in the R code to produce the desired output.

str_extract(strings, ...)
[1] "George Washington"  "Thomas Jefferson"   "Abraham Lincoln"   
[4] "Theodore Roosevelt"
str_extract(strings, ...)
[1] "February 22, 1732" "April 13, 1743"    "February 12, 1809"
[4] "October 27, 1858" 
str_extract(strings, ...)
[1] "Washington" "Jefferson"  "Lincoln"    "Roosevelt" 
str_extract(strings, ...)
[1] "1732" "1743" "1809" "1858"

Example 2

Consider the following strings:

strings
 [1] "apple"           "banana"          "canteloupe"      "durian"         
 [5] "eggplant"        "french fries"    "goat cheese"     "pizza"          
 [9] "99 red balloons" "101 dalmatians"  "route 66"       

For each question below, fill in the R code to produce the desired output.

str_subset(strings, ...)
[1] "99 red balloons" "101 dalmatians"  "route 66"       
str_subset(strings, ...)
[1] "99 red balloons" "101 dalmatians" 
str_subset(strings, ...)
[1] "apple"           "banana"          "canteloupe"      "durian"         
[5] "eggplant"        "goat cheese"     "pizza"           "99 red balloons"
[9] "101 dalmatians" 
str_subset(strings, ...)
[1] "french fries"    "goat cheese"     "99 red balloons" "101 dalmatians" 
[5] "route 66"       
str_subset(strings, ...)
[1] "apple"           "eggplant"        "goat cheese"     "pizza"          
[5] "99 red balloons" "route 66"       

Example 3

Suppose you receive a vector containing the following email addresses:

emails <- c("evansc@wfu.edu",
            "dalzelnm@wfu.edu",
            "heplersa@wfu.edu")
  1. Fill in the following code to produce the desired output:
str_extract(emails, ...)
[1] "evansc"   "dalzelnm" "heplersa"

Example 4

You have a folder with the following filenames:

filenames <- c(".bash_profile",
               "workspace.doc",
               "img0912.jpg",
               "updated_img0912.png",
               "documentation.html",
               "favicon.gif",
               "access.lock")
  1. Fill in the following code to produce the desired output:
str_subset(filenames, ...)
[1] "img0912.jpg"         "updated_img0912.png" "favicon.gif"        

Example 5

Consider the following string:

ex_str <- "The kurtosis is defined by the equation $\\mu_4/\\sigma^4$, 
where $\\mu_4$ is the fourth central moment 
and $\\sigma$ is the standard deviation."
  1. Fill in the following code to produce the desired output:
str_extract_all(ex_str, ...)
[[1]]
[1] "$\\mu_4/\\sigma^4$" "$\\mu_4$"           "$\\sigma$"         

Example 6

Here is another small dataset, ex_df, which has a single column (faculty):

ex_df
                                                faculty
1                       Rob Erhardt, Professor: MAN 343
2          Daniel Beavers, Associate Professor: MAN 337
3                 Kenneth Berenhaut, Professor: MAN 379
4          Leonardo Cella, Assistant Professor: MAN 344
5 Lucy D'Agostino McGowan, Assistant Professor: MAN 342
6 Nicole Dalzell, Associate Teaching Professor: MAN 338
  1. Write code to produce the following output using ex_df:
# A tibble: 6 × 3
  name                    position                     office 
  <chr>                   <chr>                        <chr>  
1 Rob Erhardt             Professor                    MAN 343
2 Daniel Beavers          Associate Professor          MAN 337
3 Kenneth Berenhaut       Professor                    MAN 379
4 Leonardo Cella          Assistant Professor          MAN 344
5 Lucy D'Agostino McGowan Assistant Professor          MAN 342
6 Nicole Dalzell          Associate Teaching Professor MAN 338

Web scraping

Example 1

Consider the following portion of a web page:

STA 279

Welcome to STA 279! This is a 16-week course offered at Wake Forest University, which meets 3 hours each week. The syllabus can be found here

Below are some useful dates for the semester:

Exam 1 March 1
Exam 2 April 12

Now here is the source HTML for that simple web page:

<h1> STA 279 </h1>

<p> 
  Welcome to STA 279! 
  This is a 16-week course offered at Wake Forest University, 
  which meets 3 hours each week. 
  The syllabus can be found 
  <a href='https://sta279-f25.github.io/about/'>here</a> 
</p>

<p>
  Below are some useful dates for the semester:
   <table>
  <tr>
    <td>Exam 1</td>
    <td>March 1</td>
  </tr>
  <tr>
    <td>Exam 2</td>
    <td>April 12</td>
  </tr>
</table> 
</p>
  1. Suppose that the url for this page is “https://fake_web_page.html”. Write code that will import the table of exam dates into R, as a data frame.

  2. Suppose that the url for this page is “https://fake_web_page.html”. What will the following code return?

read_html("https://fake_web_page.html") |>
  html_element("h1") |>
  html_text2() |>
  str_extract("\\d+")

Example 2

Consider the following short portion of a web page, with URL https://fake_page.html.

<div class="mesa">
  <span>
    <p> Hello! </p>
  </span>
  
  <p class="small"> Welcome to STA 279! </p>
  
  <span>
    <p class="small"> Room: Carswell 102 </p>
  </span>
  
  <span>
    <h3 class="small"> Grading policy </h3>
  </span>
  
  <span>
    <p class="small"> Exam dates </p>
  </span>
</div>
  1. Write code in R that will extract the following string from this web page:
[1] "Carswell 102"
  1. Using your code from the previous question, write a short function with the following specifications:
    • Input: url the URL for a web page with the same format as above
    • Output: the building and room number for the class, scraped from the web page
  2. Suppose that we have the web pages for 20 different classes at Wake Forest, all of which have the same format as shown above, but which take places in different rooms. The URLS are stored in a vector:
class_urls
 [1] "https://fake_page_1.html"  "https://fake_page_2.html" 
 [3] "https://fake_page_3.html"  "https://fake_page_4.html" 
 [5] "https://fake_page_5.html"  "https://fake_page_6.html" 
 [7] "https://fake_page_7.html"  "https://fake_page_8.html" 
 [9] "https://fake_page_9.html"  "https://fake_page_10.html"
[11] "https://fake_page_11.html" "https://fake_page_12.html"
[13] "https://fake_page_13.html" "https://fake_page_14.html"
[15] "https://fake_page_15.html" "https://fake_page_16.html"
[17] "https://fake_page_17.html" "https://fake_page_18.html"
[19] "https://fake_page_19.html" "https://fake_page_20.html"

Using your answers to the previous questions, iterate over this vector of URLs to produce a vector containing the building and room number for each class.

Polite web scraping

  1. Why is it important to be polite when scraping from a website?

  2. What are the features of polite web scraping that we discussed in class?

SQL

Here is a rough summary of what you should know from the SQL unit:

  • Pieces of a SQL query you should know:
    • SELECT
    • FROM
    • JOIN (an inner join) and LEFT JOIN (a left join)
    • WHERE
    • GROUP BY
    • HAVING
    • ORDER BY
    • LIMIT
  • You should also know the order in which queries must be written
  • Other functions to know for queries:
    • DISTINCT
    • Summary functions like SUM, AVG, MIN, MAX, COUNT
    • LIKE, and the wildcards % and _
    • CASE
  • The function of primary and foreign keys, and the constraints they place on tables
  • How to read and interpret an ERD
  • You should be able to read and interpet the results of DESCRIBE, ALTER TABLE, and INSERT INTO statements. You will not have to write these statements
  • Subqueries

Practice with SQL queries

Here is a small table in a SQL database. The name of the table is small_penguins.

     species island bill_length_mm bill_depth_mm
1     Gentoo Biscoe           43.3          14.0
2     Gentoo Biscoe           59.6          17.0
3     Adelie  Dream           39.7          17.9
4     Adelie  Dream           39.2          21.1
5  Chinstrap  Dream           50.8          19.0
6     Gentoo Biscoe           49.9          16.1
7  Chinstrap  Dream           50.7          19.7
8     Gentoo Biscoe           47.3          15.3
9     Gentoo Biscoe           49.3          15.7
10    Adelie  Dream           37.5          18.9

In each of the following examples, write a SQL query to produce the desired output from small_penguins.

  1. Output:
    species island n
1    Adelie  Dream 3
2 Chinstrap  Dream 2
3    Gentoo Biscoe 5
  1. Output:
  island   species mean_length
1 Biscoe    Gentoo       49.88
2  Dream    Adelie       38.80
3  Dream Chinstrap       50.75
  1. Output:
     species island bill_length_mm bill_depth_mm bill_ratio
1     Gentoo Biscoe           43.3          14.0   3.092857
2     Gentoo Biscoe           59.6          17.0   3.505882
3     Adelie  Dream           39.7          17.9   2.217877
4     Adelie  Dream           39.2          21.1   1.857820
5  Chinstrap  Dream           50.8          19.0   2.673684
6     Gentoo Biscoe           49.9          16.1   3.099379
7  Chinstrap  Dream           50.7          19.7   2.573604
8     Gentoo Biscoe           47.3          15.3   3.091503
9     Gentoo Biscoe           49.3          15.7   3.140127
10    Adelie  Dream           37.5          18.9   1.984127
  1. Output:
  species island bill_length_mm bill_depth_mm
1  Adelie  Dream           39.7          17.9
2  Adelie  Dream           39.2          21.1
3  Adelie  Dream           37.5          18.9

Joins

In each of the following questions, write a SQL query to produce the desired output from the two input tables

df1
  id  x
1  1  7
2  2  9
3  3 13
df2
  id  y
1  1 10
2  2 12
3  4 14

Output:

  id  x  y
1  1  7 10
2  2  9 12
3  3 13 NA
df1
  id  x
1  1  7
2  2  9
3  3 13
df2
  id  y
1  1 10
2  2 12
3  4 14

Output:

  id x  y
1  1 7 10
2  2 9 12

More practice with SQL queries

The nycflights13 package in R contains several datasets describing flights from NY airports in 2013, including the airports and weather tables.

The airports table has one row for each different airport in the database. There are a total of 1458 rows and 8 columns in the airports table; here are a few random rows from the table, showing the relevant columns for this question:

  faa                                                name
1 RAC                              John H. Batten Airport
2 RME                                       Griffiss Afld
3 PDK                            Dekalb-Peachtree Airport
4 HSV Huntsville International Airport-Carl T Jones Field
5 OEB                      Branch County Memorial Airport

The weather table has one row for each date-time, for each of the three NY airports (EWR, JFK, LGA), and records a variety of weather variables at each time for each location. Here are a few random rows from the table, showing the relevant columns for this question:

  origin month  temp humid
1    EWR    12 39.92 73.08
2    EWR     9 77.00 51.76
3    JFK    10 64.94 67.57
4    LGA     3 48.02 86.08
5    EWR     9 62.06 74.75

The weather table uses a 3-character FAA code to identify each airport. It does not include the full name for the airports.

We would like to produce a table with the mean temperature, and humidity for each month, for each of the 3 NYC airports in the database. The output table should also include the full name of each airport. Here is what the output table looks like:

   origin                name month mean_temp mean_humid
1     EWR Newark Liberty Intl     1  35.56216   62.12451
2     EWR Newark Liberty Intl     2  34.26332   63.33558
3     EWR Newark Liberty Intl     3  40.11865   57.91382
4     EWR Newark Liberty Intl     4  52.97750   54.23513
5     EWR Newark Liberty Intl     5  63.32024   64.07371
6     EWR Newark Liberty Intl     6  73.26725   67.00229
7     EWR Newark Liberty Intl     7  80.70300   66.67641
8     EWR Newark Liberty Intl     8  74.53748   65.75175
9     EWR Newark Liberty Intl     9  67.30478   64.03556
10    EWR Newark Liberty Intl    10  59.77821   65.31928
11    EWR Newark Liberty Intl    11  44.57734   56.52011
12    EWR Newark Liberty Intl    12  37.95008   69.72141
13    JFK John F Kennedy Intl     1  35.38555   61.66162
14    JFK John F Kennedy Intl     2  34.19246   62.55848
15    JFK John F Kennedy Intl     3  39.54472   58.02222
16    JFK John F Kennedy Intl     4  50.14270   60.05764
17    JFK John F Kennedy Intl     5  59.31476   70.98926
18    JFK John F Kennedy Intl     6  69.95825   73.76147
19    JFK John F Kennedy Intl     7  78.73492   71.74469
20    JFK John F Kennedy Intl     8  73.81878   67.84420
21    JFK John F Kennedy Intl     9  66.89775   65.25182
22    JFK John F Kennedy Intl    10  59.80195   65.06362
23    JFK John F Kennedy Intl    11  45.13419   56.79961
24    JFK John F Kennedy Intl    12  38.60487   68.31341
25    LGA          La Guardia     1  35.95927   59.15609
26    LGA          La Guardia     2  34.35612   60.68603
27    LGA          La Guardia     3  39.97652   56.55947
28    LGA          La Guardia     4  52.11450   53.15060
29    LGA          La Guardia     5  62.75000   61.49618
30    LGA          La Guardia     6  73.32650   61.22686
31    LGA          La Guardia     7  80.76425   62.26404
32    LGA          La Guardia     8  75.04825   60.34574
33    LGA          La Guardia     9  67.91125   59.67743
34    LGA          La Guardia    10  60.63244   60.13875
35    LGA          La Guardia    11  45.26093   53.49296
36    LGA          La Guardia    12  38.76976   63.62769
  1. Write a SQL query to produce the desired results.

  2. How would you modify your above query to keep only the rows in the results for which the mean temperature is at least 60?

  3. How would you modify your above query to keep only the rows for which the mean temperature is the highest average temperature?

Conceptual questions

  1. What are the differences between SELECT in a SQL query, and the select function from dplyr in R?

  2. Suppose you are managing a database for a hospital. List four different tables which might be included in the hospital database. For each table:

  • Describe what information would be included (what would a row represent? what would be some of the columns?)
  • Is there a natural primary key for the table? What column or set of columns would form that primary key?
  • Are there any columns in the table that would be foreign keys for other tables in the database?

Practice with function scoping

  1. What will be the output of the following code?
x <- 10
test_fun <- function(x = 11){
  return(x)
}
test_fun()
x
  1. What will be the output of the following code?
x <- 10
test_fun <- function(y = 11){
  return(x + 1)
}
test_fun()
x
  1. What will be the output of the following code?
x <- 10
test_fun <- function(y = 11){
  x <- x + 1
  return(x + 1)
}
test_fun()
x
  1. What will be the output of the following code?
x <- 10
test_fun <- function(x = 11){
  x <- x + 1
  return(x + 1)
}
test_fun()
x
  1. What will be the output of the following code?
x <- 10
test_fun <- function(x = 11){
  x <- x + 1
  return(x + 1)
}
x <- test_fun(x)
x

Practice with lists

  1. Create a list x in R such that:
  • x[[1]] returns the function mean
  • x[[2]] returns the function sd
  • x[[3]][[1]] returns the vector c(0, 1, 2)
  • x[[3]][[2]] returns a function which calculates the cube root of a vector