output <- rep(NA, 10)
for(i in 1:5){
output[i] <- i
}
output[6][1] NA
output <- rep(NA, 10)
for(i in 1:5){
output[i] <- i
}
output[6][1] NA
output <- rep(0, 10)
for(i in 1:10){
output[i] <- i
}
output[6][1] 6
output <- rep(0, 10)
for(i in 1:10){
output[i] <- i
}
output[11][1] NA
output <- rep(1, 10)
for(i in 2:10){
output[i] <- i + output[i-1]
}
output[5][1] 15
output <- rep(1, 10)
for(i in 2:10){
output[i] <- i + output[i+1]
}
output[5][1] 6
The sample standard deviation of numbers \(x_1,...,x_n\) is given by
\[\widehat{\sigma} = \sqrt{\frac{1}{n-1}\sum \limits_{i=1}^n (x_i - \bar{x})^2},\]
where \(\bar{x} = \frac{1}{n} \sum \limits_{i=1}^n x_i\).
my_sd which calculates the standard deviation of a vector in R.my_sd <- function(x){
n <- length(x)
if(n == 1){
return(0)
} else {
return(sqrt(1/(n-1) * sum((x - mean(x))^2)))
}
}
# checking that it works
my_sd(c(1,2,4))[1] 1.527525
sd(c(1,2,4))[1] 1.527525
The \(\ell_p\) norm of a vector \(x = (x_1,...,x_k)\) is given by
\[||x||_p = \left( \sum \limits_{i=1}^k |x_i|^p \right)^{1/p}\]
p_norm in R, which takes two inputs: a vector x, and p, and returns \(\ell_p(x)\). Make p = 2 the default value (this corresponds to the usual Euclidean norm).p_norm <- function(x, p=2){
(sum(abs(x)^p))^(1/p)
}
p_norm(c(1, 1, 1)) # = sqrt(3)[1] 1.732051
p_norm(c(1, 2, 3), 1) # = 6[1] 6
Suppose we have a sample \(X_1,...,X_n\) from some population distribution. We know that the mean describes the “center” of the distribution, the standard deviation is a measure of variability, and skewness describes the shape of the distribution.
Another quantity we can calculate to describe a distribution is kurtosis, which describes how heavy the tails of the distribution are. The sample kurtosis is calculated by:
\[\dfrac{\frac{1}{n} \sum \limits_{i=1}^n (X_i - \bar{X})^4}{\left( \frac{1}{n} \sum \limits_{i=1}^n (X_i - \bar{X})^2 \right)^2} \ \ - \ \ 3\]
where \(\bar{X}\) is the sample mean.
x.my_kurtosis <- function(x){
mean((x - mean(x))^4)/(mean((x - mean(x))^2))^2 - 3
}Suppose we have a sample \((X_1, Y_1),...,(X_n, Y_n)\) of \(n\) observations collected on two variables, \(X\) and \(Y\). The strength of the linear relationship between \(X\) and \(Y\) is measured by their correlation, and the sample correlation is calculated with the following formula:
\[\dfrac{ \sum \limits_{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y})}{\left( \sum \limits_{i=1}^n (X_i - \bar{X})^2 \right)^{1/2} \left( \sum \limits_{i=1}^n (Y_i - \bar{Y})^2 \right)^{1/2}}\]
x and a vector y.my_corr <- function(x, y){
sum((x - mean(x)) * (y - mean(y)))/( sqrt(sum((x - mean(x))^2)) *
sqrt(sum((y - mean(y))^2)) )
}Three players enter a room and a red or blue hat is placed on each person’s head. The color of each hat is determined by an independent coin toss (so, any combination of red and blue hats is possible). No communication of any sort is allowed, except for an initial strategy session before the game begins. Once they have had a chance to look at the other hats but not their own, the players must simultaneously guess the color of their own hats or pass. The players win the game if at least one person guesses correctly, and no one guesses incorrectly.
set.seed(91)
nsim <- 1000
results <- rep(NA, nsim)
for(i in 1:nsim){
hats <- sample(c("red", "blue"), 3, replace=T)
guesses <- c(sample(c("red", "blue"), 1), "pass", "pass")
results[i] <- guesses[1] == hats[1]
}
mean(results)[1] 0.48
Note: For the exam, I am more interested in the logic of how you approach the simulation, than in your code syntax being perfect. Your code should be mostly correct, but a few minor errors isn’t an issue.
nsim <- 1000
results <- rep(NA, nsim)
for(i in 1:nsim){
hats <- sample(c("red", "blue"), 3, replace=T)
guesses <- rep(NA, 3)
for(j in 1:3){
if(length(unique(hats[-j])) == 1){
guesses[j] <- ifelse(unique(hats[-j]) == "red", "blue", "red")
} else {
guesses[j] <- "pass"
}
}
results[i] <- sum(guesses[guesses != "pass"] == hats[guesses != "pass"]) ==
length(guesses[guesses != "pass"])
}
mean(results)[1] 0.751
In each of the questions below, write code to produce the output from the original data.
sub_data# A tibble: 10 × 4
species island bill_length_mm bill_depth_mm
<fct> <fct> <dbl> <dbl>
1 Gentoo Biscoe 43.3 14
2 Gentoo Biscoe 59.6 17
3 Adelie Dream 39.7 17.9
4 Adelie Dream 39.2 21.1
5 Chinstrap Dream 50.8 19
6 Gentoo Biscoe 49.9 16.1
7 Chinstrap Dream 50.7 19.7
8 Gentoo Biscoe 47.3 15.3
9 Gentoo Biscoe 49.3 15.7
10 Adelie Dream 37.5 18.9
Output:
sub_data |>
count(species, island)# A tibble: 3 × 3
species island n
<fct> <fct> <int>
1 Adelie Dream 3
2 Chinstrap Dream 2
3 Gentoo Biscoe 5
sub_data# A tibble: 10 × 4
species island bill_length_mm bill_depth_mm
<fct> <fct> <dbl> <dbl>
1 Gentoo Biscoe 43.3 14
2 Gentoo Biscoe 59.6 17
3 Adelie Dream 39.7 17.9
4 Adelie Dream 39.2 21.1
5 Chinstrap Dream 50.8 19
6 Gentoo Biscoe 49.9 16.1
7 Chinstrap Dream 50.7 19.7
8 Gentoo Biscoe 47.3 15.3
9 Gentoo Biscoe 49.3 15.7
10 Adelie Dream 37.5 18.9
Output:
sub_data |>
group_by(island, species) |>
summarize(mean_length = mean(bill_length_mm, na.rm=T))# A tibble: 3 × 3
# Groups: island [2]
island species mean_length
<fct> <fct> <dbl>
1 Biscoe Gentoo 49.9
2 Dream Adelie 38.8
3 Dream Chinstrap 50.8
sub_data# A tibble: 10 × 4
species island bill_length_mm bill_depth_mm
<fct> <fct> <dbl> <dbl>
1 Gentoo Biscoe 43.3 14
2 Gentoo Biscoe 59.6 17
3 Adelie Dream 39.7 17.9
4 Adelie Dream 39.2 21.1
5 Chinstrap Dream 50.8 19
6 Gentoo Biscoe 49.9 16.1
7 Chinstrap Dream 50.7 19.7
8 Gentoo Biscoe 47.3 15.3
9 Gentoo Biscoe 49.3 15.7
10 Adelie Dream 37.5 18.9
Output:
sub_data |>
mutate(bill_ratio = bill_length_mm/bill_depth_mm)# A tibble: 10 × 5
species island bill_length_mm bill_depth_mm bill_ratio
<fct> <fct> <dbl> <dbl> <dbl>
1 Gentoo Biscoe 43.3 14 3.09
2 Gentoo Biscoe 59.6 17 3.51
3 Adelie Dream 39.7 17.9 2.22
4 Adelie Dream 39.2 21.1 1.86
5 Chinstrap Dream 50.8 19 2.67
6 Gentoo Biscoe 49.9 16.1 3.10
7 Chinstrap Dream 50.7 19.7 2.57
8 Gentoo Biscoe 47.3 15.3 3.09
9 Gentoo Biscoe 49.3 15.7 3.14
10 Adelie Dream 37.5 18.9 1.98
sub_data# A tibble: 10 × 4
species island bill_length_mm bill_depth_mm
<fct> <fct> <dbl> <dbl>
1 Gentoo Biscoe 43.3 14
2 Gentoo Biscoe 59.6 17
3 Adelie Dream 39.7 17.9
4 Adelie Dream 39.2 21.1
5 Chinstrap Dream 50.8 19
6 Gentoo Biscoe 49.9 16.1
7 Chinstrap Dream 50.7 19.7
8 Gentoo Biscoe 47.3 15.3
9 Gentoo Biscoe 49.3 15.7
10 Adelie Dream 37.5 18.9
Output:
sub_data |>
filter(species == "Adelie",
island == "Dream")# A tibble: 3 × 4
species island bill_length_mm bill_depth_mm
<fct> <fct> <dbl> <dbl>
1 Adelie Dream 39.7 17.9
2 Adelie Dream 39.2 21.1
3 Adelie Dream 37.5 18.9
ex_df id x_1 x_2 y_1 y_2
1 1 3 5 0 2
2 2 1 8 1 7
3 3 4 9 2 9
Output:
ex_df |>
pivot_longer(cols = -id, names_to = c("group", "obs"), names_sep = "_")# A tibble: 12 × 4
id group obs value
<dbl> <chr> <chr> <dbl>
1 1 x 1 3
2 1 x 2 5
3 1 y 1 0
4 1 y 2 2
5 2 x 1 1
6 2 x 2 8
7 2 y 1 1
8 2 y 2 7
9 3 x 1 4
10 3 x 2 9
11 3 y 1 2
12 3 y 2 9
ex_df id group value
1 1 x 5
2 1 y 2
3 2 x 5
4 2 y 4
5 3 x 5
6 3 y 5
Output:
ex_df |>
pivot_wider(id_cols = id, names_from = group, values_from = value)# A tibble: 3 × 3
id x y
<dbl> <int> <int>
1 1 5 2
2 2 5 4
3 3 5 5
In each of the following questions, write code to produce the desired output from the two input datasets. The code may involve additional wrangling steps, beyond a join.
df1 id x
1 1 7
2 2 9
3 3 13
df2 id y
1 1 10
2 2 12
3 4 14
Output:
df1 |>
left_join(df2, join_by(id)) id x y
1 1 7 10
2 2 9 12
3 3 13 NA
df1 id x
1 1 7
2 2 9
3 3 13
df2 id y
1 1 10
2 2 12
3 4 14
Output:
df1 |>
inner_join(df2, join_by(id)) id x y
1 1 7 10
2 2 9 12
df1 a_x a_y b_x b_y
1 1 2 2 3
df2 id z
1 a 4
2 b 5
df1 |>
pivot_longer(cols = -c(), names_to = c("id", ".value"), names_sep = "_") |>
left_join(df2, join_by(id))# A tibble: 2 × 4
id x y z
<chr> <dbl> <dbl> <dbl>
1 a 1 2 4
2 b 2 3 5
Here are two small datasets, df1 and df2:
df1 id x y z
1 1 5 8 8
2 2 10 8 8
3 3 7 4 8
4 4 4 10 5
5 5 10 7 2
df2 id a b
1 3 5 9
2 4 8 8
3 5 5 6
4 6 9 2
For each of the following chunks of code, write down the output or explain why it will cause an error.
df1 |>
left_join(df2, join_by(id)) id x y z a b
1 1 5 8 8 NA NA
2 2 10 8 8 NA NA
3 3 7 4 8 5 9
4 4 4 10 5 8 8
5 5 10 7 2 5 6
df1 |>
inner_join(df2, join_by(id)) id x y z a b
1 3 7 4 8 5 9
2 4 4 10 5 8 8
3 5 10 7 2 5 6
df1 |>
group_by(z) |>
summarize(max_b = max(b))Error in `summarize()`:
ℹ In argument: `max_b = max(b)`.
ℹ In group 1: `z = 2`.
Caused by error:
! object 'b' not found
df1 |>
select(x, y) |>
pivot_longer(cols = -id,
names_to = "measurement",
values_to = "value")Error in `pivot_longer()`:
! Can't select columns that don't exist.
✖ Column `id` doesn't exist.
df1 |>
select(id, x, y) |>
pivot_longer(cols = -id,
names_to = "measurement",
values_to = "value") |>
filter(id %in% c(1, 2, 3))# A tibble: 6 × 3
id measurement value
<int> <chr> <int>
1 1 x 5
2 1 y 8
3 2 x 10
4 2 y 8
5 3 x 7
6 3 y 4
df1 |>
left_join(df2, join_by(id)) |>
mutate(new_var = x + a) |>
group_by(z) |>
summarize(mean_new_var = mean(new_var))# A tibble: 3 × 2
z mean_new_var
<int> <dbl>
1 2 15
2 5 12
3 8 NA
df1 |>
left_join(df2, join_by(id)) |>
mutate(new_var = x + a) |>
group_by(z) |>
summarize(mean_new_var = mean(new_var, na.rm=T)) |>
summarize(mean_b = mean(b))Error in `summarize()`:
ℹ In argument: `mean_b = mean(b)`.
Caused by error:
! object 'b' not found