Activity: data wrangling across columns

diamonds data

  1. Modify the code from class to calculate the mean of each numeric variable in the diamonds data, for each different cut and color, producing the following output. (The diamonds data is contained in the ggplot2 package).
# A tibble: 35 × 9
# Groups:   cut [5]
   cut   color mean_carat mean_depth mean_table mean_price mean_x mean_y mean_z
   <ord> <ord>      <dbl>      <dbl>      <dbl>      <dbl>  <dbl>  <dbl>  <dbl>
 1 Fair  D          0.920       64.0       59.0      4291.   6.02   5.96   3.84
 2 Fair  E          0.857       63.3       59.4      3682.   5.91   5.86   3.72
 3 Fair  F          0.905       63.5       59.5      3827.   5.99   5.93   3.79
 4 Fair  G          1.02        64.3       58.8      4239.   6.17   6.11   3.96
 5 Fair  H          1.22        64.6       58.7      5136.   6.58   6.50   4.22
 6 Fair  I          1.20        64.2       59.2      4685.   6.56   6.49   4.19
 7 Fair  J          1.34        64.4       58.9      4976.   6.75   6.68   4.32
 8 Good  D          0.745       62.4       58.5      3405.   5.62   5.63   3.50
 9 Good  E          0.745       62.2       58.8      3424.   5.62   5.63   3.50
10 Good  F          0.776       62.2       58.9      3496.   5.69   5.71   3.54
# ℹ 25 more rows

Solution:

diamonds |>
  group_by(cut, color) |>
  summarize(across(where(is.numeric),
                   list("mean" = mean),
                   .names = "{.fn}_{.col}"))
`summarise()` has grouped output by 'cut'. You can override using the `.groups`
argument.
# A tibble: 35 × 9
# Groups:   cut [5]
   cut   color mean_carat mean_depth mean_table mean_price mean_x mean_y mean_z
   <ord> <ord>      <dbl>      <dbl>      <dbl>      <dbl>  <dbl>  <dbl>  <dbl>
 1 Fair  D          0.920       64.0       59.0      4291.   6.02   5.96   3.84
 2 Fair  E          0.857       63.3       59.4      3682.   5.91   5.86   3.72
 3 Fair  F          0.905       63.5       59.5      3827.   5.99   5.93   3.79
 4 Fair  G          1.02        64.3       58.8      4239.   6.17   6.11   3.96
 5 Fair  H          1.22        64.6       58.7      5136.   6.58   6.50   4.22
 6 Fair  I          1.20        64.2       59.2      4685.   6.56   6.49   4.19
 7 Fair  J          1.34        64.4       58.9      4976.   6.75   6.68   4.32
 8 Good  D          0.745       62.4       58.5      3405.   5.62   5.63   3.50
 9 Good  E          0.745       62.2       58.8      3424.   5.62   5.63   3.50
10 Good  F          0.776       62.2       58.9      3496.   5.69   5.71   3.54
# ℹ 25 more rows

Selecting multiple columns based on name

For the second part of this class activity, we will create our own small toy dataset, so that we can explore data wrangling functions with a dataset that is easy to view and work with.

Run the following code in R to create the toy dataset:

library(tidyverse)

example_df <- data.frame(
  x1 = c(1, 2, 3),
  x2 = c("a", "b", "c"),
  x3 = c(5, 1, 2),
  y1 = c(0, 9, 2),
  y2 = c(2, 7, 9),
  z = c(0, 0, 0)
)

example_df
  x1 x2 x3 y1 y2 z
1  1  a  5  0  2 0
2  2  b  1  9  7 0
3  3  c  2  2  9 0

Notice that here we have three columns which start with x, and two columns which start with y. If I want to focus only on the x columns, I can use the starts_with function:

example_df |>
  select(starts_with("x"))
  x1 x2 x3
1  1  a  5
2  2  b  1
3  3  c  2
  1. Use the starts_with function to compute the mean of each column beginning with y, without listing any of the column names explicitly. (Hint: use starts_with(...) instead of where(...) in the across function).

Solution:

example_df |>
  summarize(across(starts_with("y"), mean))
        y1 y2
1 3.666667  6
  1. Use both starts_with and where to compute the mean of each numeric column beginning with x, without listing any of the column names explicitly.

Solution:

example_df |>
  summarize(across(where(is.numeric) & starts_with("x"), mean))
  x1       x3
1  2 2.666667