# A tibble: 35 × 9
# Groups: cut [5]
cut color mean_carat mean_depth mean_table mean_price mean_x mean_y mean_z
<ord> <ord> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Fair D 0.920 64.0 59.0 4291. 6.02 5.96 3.84
2 Fair E 0.857 63.3 59.4 3682. 5.91 5.86 3.72
3 Fair F 0.905 63.5 59.5 3827. 5.99 5.93 3.79
4 Fair G 1.02 64.3 58.8 4239. 6.17 6.11 3.96
5 Fair H 1.22 64.6 58.7 5136. 6.58 6.50 4.22
6 Fair I 1.20 64.2 59.2 4685. 6.56 6.49 4.19
7 Fair J 1.34 64.4 58.9 4976. 6.75 6.68 4.32
8 Good D 0.745 62.4 58.5 3405. 5.62 5.63 3.50
9 Good E 0.745 62.2 58.8 3424. 5.62 5.63 3.50
10 Good F 0.776 62.2 58.9 3496. 5.69 5.71 3.54
# ℹ 25 more rows
Activity: data wrangling across columns
diamonds data
- Modify the code from class to calculate the mean of each numeric variable in the
diamondsdata, for each different cut and color, producing the following output. (Thediamondsdata is contained in theggplot2package).
Solution:
diamonds |>
group_by(cut, color) |>
summarize(across(where(is.numeric),
list("mean" = mean),
.names = "{.fn}_{.col}"))`summarise()` has grouped output by 'cut'. You can override using the `.groups`
argument.
# A tibble: 35 × 9
# Groups: cut [5]
cut color mean_carat mean_depth mean_table mean_price mean_x mean_y mean_z
<ord> <ord> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Fair D 0.920 64.0 59.0 4291. 6.02 5.96 3.84
2 Fair E 0.857 63.3 59.4 3682. 5.91 5.86 3.72
3 Fair F 0.905 63.5 59.5 3827. 5.99 5.93 3.79
4 Fair G 1.02 64.3 58.8 4239. 6.17 6.11 3.96
5 Fair H 1.22 64.6 58.7 5136. 6.58 6.50 4.22
6 Fair I 1.20 64.2 59.2 4685. 6.56 6.49 4.19
7 Fair J 1.34 64.4 58.9 4976. 6.75 6.68 4.32
8 Good D 0.745 62.4 58.5 3405. 5.62 5.63 3.50
9 Good E 0.745 62.2 58.8 3424. 5.62 5.63 3.50
10 Good F 0.776 62.2 58.9 3496. 5.69 5.71 3.54
# ℹ 25 more rows
Selecting multiple columns based on name
For the second part of this class activity, we will create our own small toy dataset, so that we can explore data wrangling functions with a dataset that is easy to view and work with.
Run the following code in R to create the toy dataset:
library(tidyverse)
example_df <- data.frame(
x1 = c(1, 2, 3),
x2 = c("a", "b", "c"),
x3 = c(5, 1, 2),
y1 = c(0, 9, 2),
y2 = c(2, 7, 9),
z = c(0, 0, 0)
)
example_df x1 x2 x3 y1 y2 z
1 1 a 5 0 2 0
2 2 b 1 9 7 0
3 3 c 2 2 9 0
Notice that here we have three columns which start with x, and two columns which start with y. If I want to focus only on the x columns, I can use the starts_with function:
example_df |>
select(starts_with("x")) x1 x2 x3
1 1 a 5
2 2 b 1
3 3 c 2
- Use the
starts_withfunction to compute the mean of each column beginning withy, without listing any of the column names explicitly. (Hint: usestarts_with(...)instead ofwhere(...)in theacrossfunction).
Solution:
example_df |>
summarize(across(starts_with("y"), mean)) y1 y2
1 3.666667 6
- Use both
starts_withandwhereto compute the mean of each numeric column beginning withx, without listing any of the column names explicitly.
Solution:
example_df |>
summarize(across(where(is.numeric) & starts_with("x"), mean)) x1 x3
1 2 2.666667