Data wrangling across columns

Agenda and reminders

HW 2 due Friday on GitHub classroom
- Commit early and often, let me know if you have any technical problems
- Make sure to submit both the .qmd and md files
Department seminar tomorrow (9/11) at 11am in ZSR auditorium
- please refrain from wearing colognes, perfumes, and/or heavily scented body and hair products
Today: data wrangling across columns

Warmup activity

Work on the activity (handout) with a neighbor, then we will discuss as a class

Warmup

diamonds |>
  summarize(mean_carat = mean(carat),
            sd_carat = sd(carat),
            mean_depth = mean(depth),
            sd_depth = sd(depth),
            mean_price = mean(price),
            sd_price = sd(price))

# A tibble: 1 × 6
  mean_carat sd_carat mean_depth sd_depth mean_price sd_price
       <dbl>    <dbl>      <dbl>    <dbl>      <dbl>    <dbl>
1      0.798    0.474       61.7     1.43      3933.    3989.

Are there any downsides to this code?

Warmup

diamonds |>
  summarize(mean_carat = mean(carat),
            sd_carat = sd(carat),
            mean_depth = mean(depth),
            sd_depth = sd(depth),
            mean_price = mean(price),
            sd_price = sd(price))

more variables to summarize means longer code, harder to read
requires a lot of copying and pasting
more prone to errors when typing names of functions, variables, etc.

`across`: Data wrangling across columns

Instead of copying the same function multiple times for different columns, we can apply functions across the columns of a table:

diamonds |>
  summarize(across(c(carat, depth, price), 
                   mean))

# A tibble: 1 × 3
  carat depth price
  <dbl> <dbl> <dbl>
1 0.798  61.7 3933.

`across`: Data wrangling across columns

Instead of copying the same function multiple times for different columns, we can apply functions across the columns of a table:

diamonds |>
  summarize(across(c(carat, depth, price), 
                   mean))

# A tibble: 1 × 3
  carat depth price
  <dbl> <dbl> <dbl>
1 0.798  61.7 3933.

What if I want to calculate both the mean and the standard deviation of these columns?

`across` with multiple functions

diamonds |>
  summarize(across(c(carat, depth, price), 
                   list(mean, sd)))

# A tibble: 1 × 6
  carat_1 carat_2 depth_1 depth_2 price_1 price_2
    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1   0.798   0.474    61.7    1.43   3933.   3989.

What if I want to include the function name in the summary columns?

`across` with multiple functions

diamonds |>
  summarize(across(c(carat, depth, price), 
                   list("mean" = mean, "sd" = sd)))

# A tibble: 1 × 6
  carat_mean carat_sd depth_mean depth_sd price_mean price_sd
       <dbl>    <dbl>      <dbl>    <dbl>      <dbl>    <dbl>
1      0.798    0.474       61.7     1.43      3933.    3989.

What if I want to change the order of the column names (e.g. mean_carat vs. carat_mean)?

`across` with multiple functions

diamonds |>
  summarize(across(c(carat, depth, price), 
                   list("mean" = mean, "sd" = sd),
                   .names = "{.col}_{.fn}"))

# A tibble: 1 × 6
  carat_mean carat_sd depth_mean depth_sd price_mean price_sd
       <dbl>    <dbl>      <dbl>    <dbl>      <dbl>    <dbl>
1      0.798    0.474       61.7     1.43      3933.    3989.

diamonds |>
  summarize(across(c(carat, depth, price), 
                   list("mean" = mean, "sd" = sd),
                   .names = "{.fn}_{.col}"))

# A tibble: 1 × 6
  mean_carat sd_carat mean_depth sd_depth mean_price sd_price
       <dbl>    <dbl>      <dbl>    <dbl>      <dbl>    <dbl>
1      0.798    0.474       61.7     1.43      3933.    3989.

Summarizing more columns

diamonds |>
  summarize(across(c(carat, depth, price), 
                   list("mean" = mean)))

# A tibble: 1 × 3
  carat_mean depth_mean price_mean
       <dbl>      <dbl>      <dbl>
1      0.798       61.7      3933.

How would I modify this code to calculate the mean for all the numeric variables (carat, depth, table, price, x, y, z)?

Summarizing more columns

Option 1:

diamonds |>
  summarize(across(c(carat, depth, table, price, x, y, z), 
                   list("mean" = mean)))

# A tibble: 1 × 7
  carat_mean depth_mean table_mean price_mean x_mean y_mean z_mean
       <dbl>      <dbl>      <dbl>      <dbl>  <dbl>  <dbl>  <dbl>
1      0.798       61.7       57.5      3933.   5.73   5.73   3.54

Are there any issues with this approach?

Efficiently summarizing more columns

diamonds |>
  summarize(across(where(is.numeric),
                   list("mean" = mean)))

# A tibble: 1 × 7
  carat_mean depth_mean table_mean price_mean x_mean y_mean z_mean
       <dbl>      <dbl>      <dbl>      <dbl>  <dbl>  <dbl>  <dbl>
1      0.798       61.7       57.5      3933.   5.73   5.73   3.54

Efficiently summarizing more columns

diamonds |>
  summarize(across(where(is.numeric),
                   list("mean" = mean)))

# A tibble: 1 × 7
  carat_mean depth_mean table_mean price_mean x_mean y_mean z_mean
       <dbl>      <dbl>      <dbl>      <dbl>  <dbl>  <dbl>  <dbl>
1      0.798       61.7       57.5      3933.   5.73   5.73   3.54

where(is.numeric) returns the columns which are numeric:

is.numeric(diamonds$carat)

[1] TRUE

is.numeric(diamonds$price)

[1] TRUE

is.numeric(diamonds$clarity)

[1] FALSE

Efficiently summarizing more columns

We can use where with other functions too. For example:

diamonds |>
  summarize(across(where(is.factor),
                   list("num_categories" = n_distinct)))

# A tibble: 1 × 3
  cut_num_categories color_num_categories clarity_num_categories
               <int>                <int>                  <int>
1                  5                    7                      8

What do you think this code is doing?

Class activity

https://sta279-f25.github.io/class_activities/ca_07.html

Work with a neighbor on the class activity
At the end of class, submit your work as an HTML file on Canvas (one per group, list all your names)

For next time, read:

Chapter 25.2 in R for Data Science

Data wrangling across columns

Agenda and reminders

Warmup activity

Warmup

Warmup

across: Data wrangling across columns

across: Data wrangling across columns

across with multiple functions

across with multiple functions

across with multiple functions

Summarizing more columns

Summarizing more columns

Efficiently summarizing more columns

Efficiently summarizing more columns

Efficiently summarizing more columns

Class activity

`across`: Data wrangling across columns

`across`: Data wrangling across columns

`across` with multiple functions

`across` with multiple functions

`across` with multiple functions