output <- rep(NA, 10)
for(i in 1:5){
output[i] <- i
}
output[6]Exam 1 review
Below are questions to help you study for Exam 1. These are some examples of the kinds of questions I might ask.
- This is not a practice exam. There will be fewer questions on the actual exam.
- The questions cover most, but not all, possible material for the exam.
- The distribution of questions here is not necessarily reflective of the distribution of questions on the actual exam.
Practice with loops in R
- What will the following code return when you run it in R?
- What will the following code return when you run it in R?
output <- rep(0, 10)
for(i in 1:10){
output[i] <- i
}
output[6]- What will the following code return when you run it in R?
output <- rep(0, 10)
for(i in 1:10){
output[i] <- i
}
output[11]- What will the following code return when you run it in R?
output <- rep(1, 10)
for(i in 2:10){
output[i] <- i + output[i-1]
}
output[5]- What will the following code return when you run it in R?
output <- rep(1, 10)
for(i in 2:10){
output[i] <- i + output[i+1]
}
output[5]Practice with functions
Standard deviation
The sample standard deviation of numbers \(x_1,...,x_n\) is given by
\[\widehat{\sigma} = \sqrt{\frac{1}{n-1}\sum \limits_{i=1}^n (x_i - \bar{x})^2},\]
where \(\bar{x} = \frac{1}{n} \sum \limits_{i=1}^n x_i\).
- Write a function called
my_sdwhich calculates the standard deviation of a vector in R.
\(\ell_p\) norm
The \(\ell_p\) norm of a vector \(x = (x_1,...,x_k)\) is given by
\[||x||_p = \left( \sum \limits_{i=1}^k |x_i|^p \right)^{1/p}\]
- Write a function called
p_normin R, which takes two inputs: a vectorx, andp, and returns \(\ell_p(x)\). Makep = 2the default value (this corresponds to the usual Euclidean norm).
Kurtosis
Suppose we have a sample \(X_1,...,X_n\) from some population distribution. We know that the mean describes the “center” of the distribution, the standard deviation is a measure of variability, and skewness describes the shape of the distribution.
Another quantity we can calculate to describe a distribution is kurtosis, which describes how heavy the tails of the distribution are. The sample kurtosis is calculated by:
\[\dfrac{\frac{1}{n} \sum \limits_{i=1}^n (X_i - \bar{X})^4}{\left( \frac{1}{n} \sum \limits_{i=1}^n (X_i - \bar{X})^2 \right)^2} \ \ - \ \ 3\]
where \(\bar{X}\) is the sample mean.
- Write a function in R to calculate the sample kurtosis. Your function should take in one argument: a vector
x.
Correlation
Suppose we have a sample \((X_1, Y_1),...,(X_n, Y_n)\) of \(n\) observations collected on two variables, \(X\) and \(Y\). The strength of the linear relationship between \(X\) and \(Y\) is measured by their correlation, and the sample correlation is calculated with the following formula:
\[\dfrac{ \sum \limits_{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y})}{\left( \sum \limits_{i=1}^n (X_i - \bar{X})^2 \right)^{1/2} \left( \sum \limits_{i=1}^n (Y_i - \bar{Y})^2 \right)^{1/2}}\]
- Write a function in R to calculate the sample correlation. Your function should take in two arguments: a vector
xand a vectory.
Practice with probability simulations
Three players enter a room and a red or blue hat is placed on each person’s head. The color of each hat is determined by [an independent] coin toss (so, any combination of red and blue hats is possible). No communication of any sort is allowed, except for an initial strategy session before the game begins. Once they have had a chance to look at the other hats [but not their own], the players must simultaneously guess the color of their own hats or pass. The players win the game if at least one person guesses correctly, and no one guesses incorrectly.
Here is one strategy: one player randomly guesses the color of their hat, while the other two players pass. Write a simulation to estimate the probability the players win the game (the true probability is 1/2).
Here is another strategy: if a player sees the same color on the other two hats, they guess the color they do not see. If a player sees different colors on the other two hats, they pass. For example: If players A, B, and C have hats red, blue, and blue respectively, then player A would guess red, player B would pass, and player C would pass. Write a simulation to estimate the probability the players win the game with this new strategy (the true probability is 3/4).
Note: For the exam, I am more interested in the logic of how you approach the simulation, than in your code syntax being perfect. Your code should be mostly correct, but a few minor errors isn’t an issue.
Practice with data wrangling
Writing code
In each of the questions below, write code to produce the output from the original data.
- Original data:
# A tibble: 10 × 4
species island bill_length_mm bill_depth_mm
<fct> <fct> <dbl> <dbl>
1 Gentoo Biscoe 43.3 14
2 Gentoo Biscoe 59.6 17
3 Adelie Dream 39.7 17.9
4 Adelie Dream 39.2 21.1
5 Chinstrap Dream 50.8 19
6 Gentoo Biscoe 49.9 16.1
7 Chinstrap Dream 50.7 19.7
8 Gentoo Biscoe 47.3 15.3
9 Gentoo Biscoe 49.3 15.7
10 Adelie Dream 37.5 18.9
Output:
# A tibble: 3 × 3
species island n
<fct> <fct> <int>
1 Adelie Dream 3
2 Chinstrap Dream 2
3 Gentoo Biscoe 5
- Original data:
# A tibble: 10 × 4
species island bill_length_mm bill_depth_mm
<fct> <fct> <dbl> <dbl>
1 Gentoo Biscoe 43.3 14
2 Gentoo Biscoe 59.6 17
3 Adelie Dream 39.7 17.9
4 Adelie Dream 39.2 21.1
5 Chinstrap Dream 50.8 19
6 Gentoo Biscoe 49.9 16.1
7 Chinstrap Dream 50.7 19.7
8 Gentoo Biscoe 47.3 15.3
9 Gentoo Biscoe 49.3 15.7
10 Adelie Dream 37.5 18.9
Output:
# A tibble: 3 × 3
# Groups: island [2]
island species mean_length
<fct> <fct> <dbl>
1 Biscoe Gentoo 49.9
2 Dream Adelie 38.8
3 Dream Chinstrap 50.8
- Original data:
# A tibble: 10 × 4
species island bill_length_mm bill_depth_mm
<fct> <fct> <dbl> <dbl>
1 Gentoo Biscoe 43.3 14
2 Gentoo Biscoe 59.6 17
3 Adelie Dream 39.7 17.9
4 Adelie Dream 39.2 21.1
5 Chinstrap Dream 50.8 19
6 Gentoo Biscoe 49.9 16.1
7 Chinstrap Dream 50.7 19.7
8 Gentoo Biscoe 47.3 15.3
9 Gentoo Biscoe 49.3 15.7
10 Adelie Dream 37.5 18.9
Output:
# A tibble: 10 × 5
species island bill_length_mm bill_depth_mm bill_ratio
<fct> <fct> <dbl> <dbl> <dbl>
1 Gentoo Biscoe 43.3 14 3.09
2 Gentoo Biscoe 59.6 17 3.51
3 Adelie Dream 39.7 17.9 2.22
4 Adelie Dream 39.2 21.1 1.86
5 Chinstrap Dream 50.8 19 2.67
6 Gentoo Biscoe 49.9 16.1 3.10
7 Chinstrap Dream 50.7 19.7 2.57
8 Gentoo Biscoe 47.3 15.3 3.09
9 Gentoo Biscoe 49.3 15.7 3.14
10 Adelie Dream 37.5 18.9 1.98
- Original data:
# A tibble: 10 × 4
species island bill_length_mm bill_depth_mm
<fct> <fct> <dbl> <dbl>
1 Gentoo Biscoe 43.3 14
2 Gentoo Biscoe 59.6 17
3 Adelie Dream 39.7 17.9
4 Adelie Dream 39.2 21.1
5 Chinstrap Dream 50.8 19
6 Gentoo Biscoe 49.9 16.1
7 Chinstrap Dream 50.7 19.7
8 Gentoo Biscoe 47.3 15.3
9 Gentoo Biscoe 49.3 15.7
10 Adelie Dream 37.5 18.9
Output:
# A tibble: 3 × 4
species island bill_length_mm bill_depth_mm
<fct> <fct> <dbl> <dbl>
1 Adelie Dream 39.7 17.9
2 Adelie Dream 39.2 21.1
3 Adelie Dream 37.5 18.9
- Original data:
id x_1 x_2 y_1 y_2
1 1 3 5 0 2
2 2 1 8 1 7
3 3 4 9 2 9
Output:
# A tibble: 12 × 4
id group obs value
<dbl> <chr> <chr> <dbl>
1 1 x 1 3
2 1 x 2 5
3 1 y 1 0
4 1 y 2 2
5 2 x 1 1
6 2 x 2 8
7 2 y 1 1
8 2 y 2 7
9 3 x 1 4
10 3 x 2 9
11 3 y 1 2
12 3 y 2 9
- Original data:
id group value
1 1 x 5
2 1 y 2
3 2 x 5
4 2 y 4
5 3 x 5
6 3 y 5
Output:
# A tibble: 3 × 3
id x y
<dbl> <int> <int>
1 1 5 2
2 2 5 4
3 3 5 5
Joins
In each of the following questions, write code to produce the desired output from the two input datasets. The code may involve additional wrangling steps, beyond a join.
df1 id x
1 1 7
2 2 9
3 3 13
df2 id y
1 1 10
2 2 12
3 4 14
Output:
id x y
1 1 7 10
2 2 9 12
3 3 13 NA
df1 id x
1 1 7
2 2 9
3 3 13
df2 id y
1 1 10
2 2 12
3 4 14
Output:
id x y
1 1 7 10
2 2 9 12
df1 a_x a_y b_x b_y
1 1 2 2 3
df2 id z
1 a 4
2 b 5
# A tibble: 2 × 4
id x y z
<chr> <dbl> <dbl> <dbl>
1 a 1 2 4
2 b 2 3 5
Reading data wrangling code
Here are two small datasets, df1 and df2:
df1 id x y z
1 1 5 8 8
2 2 10 8 8
3 3 7 4 8
4 4 4 10 5
5 5 10 7 2
df2 id a b
1 3 5 9
2 4 8 8
3 5 5 6
4 6 9 2
For each of the following chunks of code, write down the output or explain why it will cause an error.
df1 |>
left_join(df2, join_by(id))df1 |>
inner_join(df2, join_by(id))df1 |>
group_by(z) |>
summarize(max_b = max(b))df1 |>
select(x, y) |>
pivot_longer(cols = -id,
names_to = "measurement",
values_to = "value")df1 |>
select(id, x, y) |>
pivot_longer(cols = -id,
names_to = "measurement",
values_to = "value") |>
filter(id %in% c(1, 2, 3))df1 |>
left_join(df2, join_by(id)) |>
mutate(new_var = x + a) |>
group_by(z) |>
summarize(mean_new_var = mean(new_var))df1 |>
left_join(df2, join_by(id)) |>
mutate(new_var = x + a) |>
group_by(z) |>
summarize(mean_new_var = mean(new_var, na.rm=T)) |>
summarize(mean_b = mean(b))