Data wrangling across columns
Agenda and reminders
- HW 2 due Friday on GitHub classroom
- Commit early and often, let me know if you have any technical problems
- Make sure to submit both the
.qmd and md files
- Department seminar tomorrow (9/11) at 11am in ZSR auditorium
- please refrain from wearing colognes, perfumes, and/or heavily scented body and hair products
- Today: data wrangling across columns
Warmup activity
Work on the activity (handout) with a neighbor, then we will discuss as a class
Warmup
diamonds |>
summarize(mean_carat = mean(carat),
sd_carat = sd(carat),
mean_depth = mean(depth),
sd_depth = sd(depth),
mean_price = mean(price),
sd_price = sd(price))
# A tibble: 1 × 6
mean_carat sd_carat mean_depth sd_depth mean_price sd_price
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.798 0.474 61.7 1.43 3933. 3989.
Are there any downsides to this code?
Warmup
diamonds |>
summarize(mean_carat = mean(carat),
sd_carat = sd(carat),
mean_depth = mean(depth),
sd_depth = sd(depth),
mean_price = mean(price),
sd_price = sd(price))
- more variables to summarize means longer code, harder to read
- requires a lot of copying and pasting
- more prone to errors when typing names of functions, variables, etc.
across: Data wrangling across columns
Instead of copying the same function multiple times for different columns, we can apply functions across the columns of a table:
diamonds |>
summarize(across(c(carat, depth, price),
mean))
# A tibble: 1 × 3
carat depth price
<dbl> <dbl> <dbl>
1 0.798 61.7 3933.
across: Data wrangling across columns
Instead of copying the same function multiple times for different columns, we can apply functions across the columns of a table:
diamonds |>
summarize(across(c(carat, depth, price),
mean))
# A tibble: 1 × 3
carat depth price
<dbl> <dbl> <dbl>
1 0.798 61.7 3933.
What if I want to calculate both the mean and the standard deviation of these columns?
across with multiple functions
diamonds |>
summarize(across(c(carat, depth, price),
list(mean, sd)))
# A tibble: 1 × 6
carat_1 carat_2 depth_1 depth_2 price_1 price_2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.798 0.474 61.7 1.43 3933. 3989.
What if I want to include the function name in the summary columns?
across with multiple functions
diamonds |>
summarize(across(c(carat, depth, price),
list("mean" = mean, "sd" = sd)))
# A tibble: 1 × 6
carat_mean carat_sd depth_mean depth_sd price_mean price_sd
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.798 0.474 61.7 1.43 3933. 3989.
What if I want to change the order of the column names (e.g. mean_carat vs. carat_mean)?
across with multiple functions
diamonds |>
summarize(across(c(carat, depth, price),
list("mean" = mean, "sd" = sd),
.names = "{.col}_{.fn}"))
# A tibble: 1 × 6
carat_mean carat_sd depth_mean depth_sd price_mean price_sd
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.798 0.474 61.7 1.43 3933. 3989.
diamonds |>
summarize(across(c(carat, depth, price),
list("mean" = mean, "sd" = sd),
.names = "{.fn}_{.col}"))
# A tibble: 1 × 6
mean_carat sd_carat mean_depth sd_depth mean_price sd_price
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.798 0.474 61.7 1.43 3933. 3989.
Summarizing more columns
diamonds |>
summarize(across(c(carat, depth, price),
list("mean" = mean)))
# A tibble: 1 × 3
carat_mean depth_mean price_mean
<dbl> <dbl> <dbl>
1 0.798 61.7 3933.
How would I modify this code to calculate the mean for all the numeric variables (carat, depth, table, price, x, y, z)?
Summarizing more columns
Option 1:
diamonds |>
summarize(across(c(carat, depth, table, price, x, y, z),
list("mean" = mean)))
# A tibble: 1 × 7
carat_mean depth_mean table_mean price_mean x_mean y_mean z_mean
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.798 61.7 57.5 3933. 5.73 5.73 3.54
Are there any issues with this approach?
Efficiently summarizing more columns
diamonds |>
summarize(across(where(is.numeric),
list("mean" = mean)))
# A tibble: 1 × 7
carat_mean depth_mean table_mean price_mean x_mean y_mean z_mean
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.798 61.7 57.5 3933. 5.73 5.73 3.54
Efficiently summarizing more columns
diamonds |>
summarize(across(where(is.numeric),
list("mean" = mean)))
# A tibble: 1 × 7
carat_mean depth_mean table_mean price_mean x_mean y_mean z_mean
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.798 61.7 57.5 3933. 5.73 5.73 3.54
where(is.numeric) returns the columns which are numeric:
is.numeric(diamonds$carat)
is.numeric(diamonds$price)
is.numeric(diamonds$clarity)
Efficiently summarizing more columns
We can use where with other functions too. For example:
diamonds |>
summarize(across(where(is.factor),
list("num_categories" = n_distinct)))
# A tibble: 1 × 3
cut_num_categories color_num_categories clarity_num_categories
<int> <int> <int>
1 5 7 8
What do you think this code is doing?