Functions

Warmup activity

Work on the activity (handout) with a neighbor, then we will discuss as a class

Warmup

grouped_max <- function(df, group_var, max_var) {
  df |>
    group_by(group_var) |>
    summarize(max(max_var, na.rm=T))
}

grouped_max(penguins, species, bill_depth_mm)

What is this code trying to do?

Warmup

grouped_max <- function(df, group_var, max_var) {
  df |>
    group_by(group_var) |>
    summarize(max(max_var, na.rm=T))
}

grouped_max(penguins, species, bill_depth_mm)
Error in `group_by()`:
! Must group by variables found in `.data`.
✖ Column `group_var` is not found.

What is causing the error?

Warmup

grouped_max <- function(df, group_var, max_var) {
  df |>
    group_by(group_var) |>
    summarize(max(max_var, na.rm=T))
}

grouped_max(penguins, species, bill_depth_mm)
Error in `group_by()`:
! Must group by variables found in `.data`.
✖ Column `group_var` is not found.

What should we change so the code runs correctly?

Embracing

grouped_max <- function(df, group_var, max_var) {
  df |>
    group_by({{ group_var }}) |>
    summarize(max({{ max_var }}, na.rm=T))
}

grouped_max(penguins, species, bill_depth_mm)
# A tibble: 3 × 2
  species   `max(bill_depth_mm, na.rm = T)`
  <fct>                               <dbl>
1 Adelie                               21.5
2 Chinstrap                            20.8
3 Gentoo                               17.3

Why do we need embracing?

penguins |>
  filter(species == "Adelie")

This code contains two different types of variables:

  • penguins is an env-variable (environment variable)
  • species is a data-variable (it makes sense only within the context of a data frame)

Env-variables

Env-variables are objects in the R environment that we can interact with directly. For example:

head(penguins)
# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex <fct>, year <int>

Data-variables

Data-variables only exist in the context of a data frame:

# R doesn't know what 'species' is:
species
Error: object 'species' not found
# R DOES understand species in the context of penguins:
penguins$species
  [1] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
  [8] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
 [15] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
 [22] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
 [29] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
 [36] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
 [43] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
 [50] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
 [57] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
 [64] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
 [71] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
 [78] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
 [85] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
 [92] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
 [99] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
[106] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
[113] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
[120] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
[127] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
[134] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
[141] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   
[148] Adelie    Adelie    Adelie    Adelie    Adelie    Gentoo    Gentoo   
[155] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
[162] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
[169] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
[176] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
[183] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
[190] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
[197] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
[204] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
[211] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
[218] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
[225] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
[232] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
[239] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
[246] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
[253] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
[260] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
[267] Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo    Gentoo   
[274] Gentoo    Gentoo    Gentoo    Chinstrap Chinstrap Chinstrap Chinstrap
[281] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
[288] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
[295] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
[302] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
[309] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
[316] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
[323] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
[330] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
[337] Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap Chinstrap
[344] Chinstrap
Levels: Adelie Chinstrap Gentoo

Tidy evaluation

Many tidyverse functions are nice and allow us to reference data-variables:

penguins |>
  filter(species == "Adelie")

Here filter knows to look for a column called species in the penguins data.

Tidy evaluation

Of course, you will get an error if you try to reference a data-variable that doesn’t exist! E.g. if we mis-spell the name:

penguins |>
  filter(speices == "Adelie")
Error in `filter()`:
ℹ In argument: `speices == "Adelie"`.
Caused by error:
! object 'speices' not found

Tidy evaluation

Of course, you will get an error if you try to reference a data-variable that doesn’t exist!

penguins |>
  group_by(group_var) |>
  summarize(max(max_var, na.rm=T))
Error in `group_by()`:
! Must group by variables found in `.data`.
✖ Column `group_var` is not found.

The problem: group_var and max_var are not columns in the penguins data!

Tidy evaluation

grouped_max <- function(df, group_var, max_var) {
  df |>
    group_by(group_var) |>
    summarize(max(max_var, na.rm=T))
}

grouped_max(penguins, species, bill_depth_mm)
Error in `group_by()`:
! Must group by variables found in `.data`.
✖ Column `group_var` is not found.

What we want R to run:

penguins |>
  group_by(species) |>
  summarize(max(bill_depth_mm, na.rm=T))
# A tibble: 3 × 2
  species   `max(bill_depth_mm, na.rm = T)`
  <fct>                               <dbl>
1 Adelie                               21.5
2 Chinstrap                            20.8
3 Gentoo                               17.3

Tidy evaluation

grouped_max <- function(df, group_var, max_var) {
  df |>
    group_by(group_var) |>
    summarize(max(max_var, na.rm=T))
}

grouped_max(penguins, species, bill_depth_mm)
Error in `group_by()`:
! Must group by variables found in `.data`.
✖ Column `group_var` is not found.

What R is actually running:

penguins |>
  group_by(group_var) |>
  summarize(max(max_var, na.rm=T))
Error in `group_by()`:
! Must group by variables found in `.data`.
✖ Column `group_var` is not found.

The solution: embracing

grouped_max <- function(df, group_var, max_var) {
  df |>
    group_by({{ group_var }}) |>
    summarize(max({{ max_var }}, na.rm=T))
}

grouped_max(penguins, species, bill_depth_mm)
# A tibble: 3 × 2
  species   `max(bill_depth_mm, na.rm = T)`
  <fct>                               <dbl>
1 Adelie                               21.5
2 Chinstrap                            20.8
3 Gentoo                               17.3

What R is running now:

penguins |>
  group_by(species) |>
  summarize(max(bill_depth_mm, na.rm=T))
# A tibble: 3 × 2
  species   `max(bill_depth_mm, na.rm = T)`
  <fct>                               <dbl>
1 Adelie                               21.5
2 Chinstrap                            20.8
3 Gentoo                               17.3

Another example

Suppose we want to fit a simple linear regression model:

penguins |>
  lm(bill_length_mm ~ bill_depth_mm, data = _) |>
  coef()
  (Intercept) bill_depth_mm 
   55.0673698    -0.6498356 

Another example

penguins |>
  lm(bill_length_mm ~ bill_depth_mm, data = _) |>
  coef()
  (Intercept) bill_depth_mm 
   55.0673698    -0.6498356 
lm_coef <- function(df, x, y) {
  df |>
    lm({{ y }} ~ {{ x }}, data = _) |>
    coef()
}

lm_coef(penguins, bill_depth_mm, bill_length_mm)

Do you think this code will work?

Another example

penguins |>
  lm(bill_length_mm ~ bill_depth_mm, data = _) |>
  coef()
  (Intercept) bill_depth_mm 
   55.0673698    -0.6498356 
lm_coef <- function(df, x, y) {
  df |>
    lm({{ y }} ~ {{ x }}, data = _) |>
    coef()
}

lm_coef(penguins, bill_depth_mm, bill_length_mm)
Error: object 'bill_length_mm' not found

Why does this code fail?

Another example

penguins |>
  lm(bill_length_mm ~ bill_depth_mm, data = _) |>
  coef()
  (Intercept) bill_depth_mm 
   55.0673698    -0.6498356 
lm_coef <- function(df, x, y) {
  df |>
    lm({{ y }} ~ {{ x }}, data = _) |>
    coef()
}

lm_coef(penguins, bill_depth_mm, bill_length_mm)
Error: object 'bill_length_mm' not found

Problem: The lm function does not support tidy evaluation! (To see if a function does support tidy evaluation, look for keywords like “data masking” or “tidy selection” in the documentation.)

Fixing the issue

penguins |>
  lm(bill_length_mm ~ bill_depth_mm, data = _) |>
  coef()
  (Intercept) bill_depth_mm 
   55.0673698    -0.6498356 
lm_coef <- function(df, x, y) {
  df |>
    lm({{ y }} ~ {{ x }}, data = _) |>
    coef()
}

lm_coef(penguins, bill_depth_mm, bill_length_mm)
Error: object 'bill_length_mm' not found

If lm doesn’t support tidy evaluation, what could we do differently?

Fixing the issue

SLR slope: \(\widehat{\beta}_1 = \frac{\sum \limits_{i=1}^n (x_i - \overline{x})(y_i - \overline{y})}{\sum \limits_{i=1}^n (x_i - \overline{x})^2}\)

penguins |>
  lm(bill_length_mm ~ bill_depth_mm, data = _) |>
  coef()
  (Intercept) bill_depth_mm 
   55.0673698    -0.6498356 
penguins |>
  summarize(slope = cov(bill_depth_mm, bill_length_mm, 
                        use="complete.obs")/
              var(bill_depth_mm, na.rm=T))
[1] -0.6498356

Fixing the issue

penguins |>
  summarize(slope = cov(bill_depth_mm, bill_length_mm, 
                        use="complete.obs")/
              var(bill_depth_mm, na.rm=T))

How would I turn this into a function?

slr_slope <- function(df, x, y) {
  
  
  
  
}

Fixing the issue

slr_slope <- function(df, x, y) {
  df |>
    summarize(slope = cov({{ x }}, {{ y }}, use="complete.obs")/
                var({{ x }}, na.rm=T))
}

slr_slope(penguins, bill_depth_mm, bill_length_mm)
# A tibble: 1 × 1
   slope
   <dbl>
1 -0.650
slr_slope(penguins, flipper_length_mm, bill_length_mm)
# A tibble: 1 × 1
  slope
  <dbl>
1 0.255

Class activity

https://sta279-f25.github.io/class_activities/ca_10.html

  • Work with a neighbor on the class activity
  • At the end of class, submit your work as an HTML file on Canvas (one per group, list all your names)

For next time, read:

  • Chapter 26.3 in R for Data Science