Exam 1 review solutions

Practice with loops in R

What will the following code return when you run it in R?

output <- rep(NA, 10)
for(i in 1:5){
  output[i] <- i
}

output[6]

[1] NA

What will the following code return when you run it in R?

output <- rep(0, 10)
for(i in 1:10){
  output[i] <- i
}

output[6]

[1] 6

What will the following code return when you run it in R?

output <- rep(0, 10)
for(i in 1:10){
  output[i] <- i
}

output[11]

[1] NA

What will the following code return when you run it in R?

output <- rep(1, 10)
for(i in 2:10){
  output[i] <- i + output[i-1]
}

output[5]

[1] 15

What will the following code return when you run it in R?

output <- rep(1, 10)
for(i in 2:10){
  output[i] <- i + output[i+1]
}

output[5]

[1] 6

Practice with functions

Standard deviation

The sample standard deviation of numbers \(x_1,...,x_n\) is given by

\[\widehat{\sigma} = \sqrt{\frac{1}{n-1}\sum \limits_{i=1}^n (x_i - \bar{x})^2},\]

where \(\bar{x} = \frac{1}{n} \sum \limits_{i=1}^n x_i\).

Write a function called my_sd which calculates the standard deviation of a vector in R.

my_sd <- function(x){
  n <- length(x)
  if(n == 1){
    return(0)
  } else {
    return(sqrt(1/(n-1) * sum((x - mean(x))^2)))
  }
}

# checking that it works
my_sd(c(1,2,4))

[1] 1.527525

sd(c(1,2,4))

[1] 1.527525

\(\ell_p\) norm

The \(\ell_p\) norm of a vector \(x = (x_1,...,x_k)\) is given by

\[||x||_p = \left( \sum \limits_{i=1}^k |x_i|^p \right)^{1/p}\]

Write a function called p_norm in R, which takes two inputs: a vector x, and p, and returns \(\ell_p(x)\). Make p = 2 the default value (this corresponds to the usual Euclidean norm).

p_norm <- function(x, p=2){
  (sum(abs(x)^p))^(1/p)
}

p_norm(c(1, 1, 1)) # = sqrt(3)

[1] 1.732051

p_norm(c(1, 2, 3), 1) # = 6

[1] 6

Kurtosis

Suppose we have a sample \(X_1,...,X_n\) from some population distribution. We know that the mean describes the “center” of the distribution, the standard deviation is a measure of variability, and skewness describes the shape of the distribution.

Another quantity we can calculate to describe a distribution is kurtosis, which describes how heavy the tails of the distribution are. The sample kurtosis is calculated by:

\[\dfrac{\frac{1}{n} \sum \limits_{i=1}^n (X_i - \bar{X})^4}{\left( \frac{1}{n} \sum \limits_{i=1}^n (X_i - \bar{X})^2 \right)^2} \ \ - \ \ 3\]

where \(\bar{X}\) is the sample mean.

Write a function in R to calculate the sample kurtosis. Your function should take in one argument: a vector x.

my_kurtosis <- function(x){
  mean((x - mean(x))^4)/(mean((x - mean(x))^2))^2 - 3
}

Correlation

Suppose we have a sample \((X_1, Y_1),...,(X_n, Y_n)\) of \(n\) observations collected on two variables, \(X\) and \(Y\). The strength of the linear relationship between \(X\) and \(Y\) is measured by their correlation, and the sample correlation is calculated with the following formula:

\[\dfrac{ \sum \limits_{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y})}{\left( \sum \limits_{i=1}^n (X_i - \bar{X})^2 \right)^{1/2} \left( \sum \limits_{i=1}^n (Y_i - \bar{Y})^2 \right)^{1/2}}\]

Write a function in R to calculate the sample correlation. Your function should take in two arguments: a vector x and a vector y.

my_corr <- function(x, y){
  sum((x - mean(x)) * (y - mean(y)))/( sqrt(sum((x - mean(x))^2)) * 
                                         sqrt(sum((y - mean(y))^2)) )
}

Practice with probability simulations

Three players enter a room and a red or blue hat is placed on each person’s head. The color of each hat is determined by an independent coin toss (so, any combination of red and blue hats is possible). No communication of any sort is allowed, except for an initial strategy session before the game begins. Once they have had a chance to look at the other hats but not their own, the players must simultaneously guess the color of their own hats or pass. The players win the game if at least one person guesses correctly, and no one guesses incorrectly.

Here is one strategy: one player randomly guesses the color of their hat, while the other two players pass. Write a simulation to estimate the probability the players win the game (the true probability is 1/2).

set.seed(91)

nsim <- 1000
results <- rep(NA, nsim)
for(i in 1:nsim){
  hats <- sample(c("red", "blue"), 3, replace=T)
  guesses <- c(sample(c("red", "blue"), 1), "pass", "pass")
  results[i] <- guesses[1] == hats[1]
}

mean(results)

[1] 0.48

Here is another strategy: if a player sees the same color on the other two hats, they guess the color they do not see. If a player sees different colors on the other two hats, they pass. For example: If players A, B, and C have hats red, blue, and blue respectively, then player A would guess red, player B would pass, and player C would pass. Write a simulation to estimate the probability the players win the game with this new strategy (the true probability is 3/4).

Note: For the exam, I am more interested in the logic of how you approach the simulation, than in your code syntax being perfect. Your code should be mostly correct, but a few minor errors isn’t an issue.

nsim <- 1000
results <- rep(NA, nsim)
for(i in 1:nsim){
  hats <- sample(c("red", "blue"), 3, replace=T)
  guesses <- rep(NA, 3)
  for(j in 1:3){
    if(length(unique(hats[-j])) == 1){
      guesses[j] <- ifelse(unique(hats[-j]) == "red", "blue", "red")
    } else {
      guesses[j] <- "pass"
    }
  }
  
  results[i] <- sum(guesses[guesses != "pass"] == hats[guesses != "pass"]) == 
    length(guesses[guesses != "pass"])
}

mean(results)

[1] 0.751

Practice with data wrangling

Writing code

In each of the questions below, write code to produce the output from the original data.

Original data:

sub_data

# A tibble: 10 × 4
   species   island bill_length_mm bill_depth_mm
   <fct>     <fct>           <dbl>         <dbl>
 1 Gentoo    Biscoe           43.3          14  
 2 Gentoo    Biscoe           59.6          17  
 3 Adelie    Dream            39.7          17.9
 4 Adelie    Dream            39.2          21.1
 5 Chinstrap Dream            50.8          19  
 6 Gentoo    Biscoe           49.9          16.1
 7 Chinstrap Dream            50.7          19.7
 8 Gentoo    Biscoe           47.3          15.3
 9 Gentoo    Biscoe           49.3          15.7
10 Adelie    Dream            37.5          18.9

Output:

sub_data |>
  count(species, island)

# A tibble: 3 × 3
  species   island     n
  <fct>     <fct>  <int>
1 Adelie    Dream      3
2 Chinstrap Dream      2
3 Gentoo    Biscoe     5

Original data:

sub_data

# A tibble: 10 × 4
   species   island bill_length_mm bill_depth_mm
   <fct>     <fct>           <dbl>         <dbl>
 1 Gentoo    Biscoe           43.3          14  
 2 Gentoo    Biscoe           59.6          17  
 3 Adelie    Dream            39.7          17.9
 4 Adelie    Dream            39.2          21.1
 5 Chinstrap Dream            50.8          19  
 6 Gentoo    Biscoe           49.9          16.1
 7 Chinstrap Dream            50.7          19.7
 8 Gentoo    Biscoe           47.3          15.3
 9 Gentoo    Biscoe           49.3          15.7
10 Adelie    Dream            37.5          18.9

Output:

sub_data |>
  group_by(island, species) |>
  summarize(mean_length = mean(bill_length_mm, na.rm=T))

# A tibble: 3 × 3
# Groups:   island [2]
  island species   mean_length
  <fct>  <fct>           <dbl>
1 Biscoe Gentoo           49.9
2 Dream  Adelie           38.8
3 Dream  Chinstrap        50.8

Original data:

sub_data

# A tibble: 10 × 4
   species   island bill_length_mm bill_depth_mm
   <fct>     <fct>           <dbl>         <dbl>
 1 Gentoo    Biscoe           43.3          14  
 2 Gentoo    Biscoe           59.6          17  
 3 Adelie    Dream            39.7          17.9
 4 Adelie    Dream            39.2          21.1
 5 Chinstrap Dream            50.8          19  
 6 Gentoo    Biscoe           49.9          16.1
 7 Chinstrap Dream            50.7          19.7
 8 Gentoo    Biscoe           47.3          15.3
 9 Gentoo    Biscoe           49.3          15.7
10 Adelie    Dream            37.5          18.9

Output:

sub_data |>
  mutate(bill_ratio = bill_length_mm/bill_depth_mm)

# A tibble: 10 × 5
   species   island bill_length_mm bill_depth_mm bill_ratio
   <fct>     <fct>           <dbl>         <dbl>      <dbl>
 1 Gentoo    Biscoe           43.3          14         3.09
 2 Gentoo    Biscoe           59.6          17         3.51
 3 Adelie    Dream            39.7          17.9       2.22
 4 Adelie    Dream            39.2          21.1       1.86
 5 Chinstrap Dream            50.8          19         2.67
 6 Gentoo    Biscoe           49.9          16.1       3.10
 7 Chinstrap Dream            50.7          19.7       2.57
 8 Gentoo    Biscoe           47.3          15.3       3.09
 9 Gentoo    Biscoe           49.3          15.7       3.14
10 Adelie    Dream            37.5          18.9       1.98

Original data:

sub_data

# A tibble: 10 × 4
   species   island bill_length_mm bill_depth_mm
   <fct>     <fct>           <dbl>         <dbl>
 1 Gentoo    Biscoe           43.3          14  
 2 Gentoo    Biscoe           59.6          17  
 3 Adelie    Dream            39.7          17.9
 4 Adelie    Dream            39.2          21.1
 5 Chinstrap Dream            50.8          19  
 6 Gentoo    Biscoe           49.9          16.1
 7 Chinstrap Dream            50.7          19.7
 8 Gentoo    Biscoe           47.3          15.3
 9 Gentoo    Biscoe           49.3          15.7
10 Adelie    Dream            37.5          18.9

Output:

sub_data |>
  filter(species == "Adelie", 
         island == "Dream")

# A tibble: 3 × 4
  species island bill_length_mm bill_depth_mm
  <fct>   <fct>           <dbl>         <dbl>
1 Adelie  Dream            39.7          17.9
2 Adelie  Dream            39.2          21.1
3 Adelie  Dream            37.5          18.9

Original data:

ex_df

  id x_1 x_2 y_1 y_2
1  1   3   5   0   2
2  2   1   8   1   7
3  3   4   9   2   9

Output:

ex_df |>
  pivot_longer(cols = -id, names_to = c("group", "obs"), names_sep = "_")

# A tibble: 12 × 4
      id group obs   value
   <dbl> <chr> <chr> <dbl>
 1     1 x     1         3
 2     1 x     2         5
 3     1 y     1         0
 4     1 y     2         2
 5     2 x     1         1
 6     2 x     2         8
 7     2 y     1         1
 8     2 y     2         7
 9     3 x     1         4
10     3 x     2         9
11     3 y     1         2
12     3 y     2         9

Original data:

ex_df

  id group value
1  1     x     5
2  1     y     2
3  2     x     5
4  2     y     4
5  3     x     5
6  3     y     5

Output:

ex_df |>
  pivot_wider(id_cols = id, names_from = group, values_from = value)

# A tibble: 3 × 3
     id     x     y
  <dbl> <int> <int>
1     1     5     2
2     2     5     4
3     3     5     5

Joins

In each of the following questions, write code to produce the desired output from the two input datasets. The code may involve additional wrangling steps, beyond a join.

df1

df2

Output:

df1 |>
  left_join(df2, join_by(id))

df1

df2

Output:

df1 |>
  inner_join(df2, join_by(id))

  id x  y
1  1 7 10
2  2 9 12

df1

  a_x a_y b_x b_y
1   1   2   2   3

df2

  id z
1  a 4
2  b 5

df1 |> 
  pivot_longer(cols = -c(), names_to = c("id", ".value"), names_sep = "_") |>
  left_join(df2, join_by(id))

# A tibble: 2 × 4
  id        x     y     z
  <chr> <dbl> <dbl> <dbl>
1 a         1     2     4
2 b         2     3     5

Reading data wrangling code

Here are two small datasets, df1 and df2:

df1

  id  x  y z
1  1  5  8 8
2  2 10  8 8
3  3  7  4 8
4  4  4 10 5
5  5 10  7 2

df2

For each of the following chunks of code, write down the output or explain why it will cause an error.

df1 |>
  left_join(df2, join_by(id))

  id  x  y z  a  b
1  1  5  8 8 NA NA
2  2 10  8 8 NA NA
3  3  7  4 8  5  9
4  4  4 10 5  8  8
5  5 10  7 2  5  6

df1 |>
  inner_join(df2, join_by(id))

  id  x  y z a b
1  3  7  4 8 5 9
2  4  4 10 5 8 8
3  5 10  7 2 5 6

df1 |>
  group_by(z) |>
  summarize(max_b = max(b))

Error in `summarize()`:
ℹ In argument: `max_b = max(b)`.
ℹ In group 1: `z = 2`.
Caused by error:
! object 'b' not found

df1 |>
  select(x, y) |>
  pivot_longer(cols = -id,
               names_to = "measurement",
               values_to = "value")

Error in `pivot_longer()`:
! Can't select columns that don't exist.
✖ Column `id` doesn't exist.

df1 |>
  select(id, x, y) |>
  pivot_longer(cols = -id,
               names_to = "measurement",
               values_to = "value") |>
  filter(id %in% c(1, 2, 3))

# A tibble: 6 × 3
     id measurement value
  <int> <chr>       <int>
1     1 x               5
2     1 y               8
3     2 x              10
4     2 y               8
5     3 x               7
6     3 y               4

df1 |>
  left_join(df2, join_by(id)) |>
  mutate(new_var = x + a) |>
  group_by(z) |>
  summarize(mean_new_var = mean(new_var))

# A tibble: 3 × 2
      z mean_new_var
  <int>        <dbl>
1     2           15
2     5           12
3     8           NA

df1 |>
  left_join(df2, join_by(id)) |>
  mutate(new_var = x + a) |>
  group_by(z) |>
  summarize(mean_new_var = mean(new_var, na.rm=T)) |>
  summarize(mean_b = mean(b))

Error in `summarize()`:
ℹ In argument: `mean_b = mean(b)`.
Caused by error:
! object 'b' not found