library(tidyverse)
slr_slope <- function(df, x, y) {
df |>
summarize(slope = cov({{ x }}, {{ y }}, use="complete.obs")/
var({{ x }}, na.rm=T))
}Activity: Iteration
Instructions:
- Work with a neighbor to answer the following questions
- See below to get started
- When you are finished, render the file as an HTML and submit the HTML to Canvas (let me know if you encounter any problems)
Getting started
This class activity requires access to several CSV files. Rather than have you download them separately, it will be easiest if we use a GitHub classroom link.
You do not need to submit the activity on GitHub classroom, only the HTML file on Canvas. The GitHub Classroom setup is just to make it easy for you to get a repository with the data files in the right place.
- Go to Canvas -> Assignments -> Class Activity 11. Open the GitHub Classroom assignment link
- Follow the instructions to accept the assignment and clone the repository to your local computer
- The repository contains the file
ca_11_template.qmd, and a folder of CSV files calledintro_stats_grades. Write your code and answers to the questions in the Quarto document. - When you are finished, make sure to Render your Quarto document to HTML, then submit the HTML to Canvas
Intro stats grades
The repository for this activity contains a folder called intro_stats_grades, which contains CSV files with gradebooks for several different sections of intro stats at a university. As the head TA for intro stats, it is your job to explore these intro stats grades for any important patterns. (You will work more with these grades on HW 4).
Examining the relationship between exam grades
In a previous class, we wrote the following function to calculate the slope for a simple linear regression model:
Let’s use this function to calculate the slope for the relationship between student grades on midterm 1 and midterm 2 in the first section of intro stats:
read_csv("intro_stats_grades/section_1.csv") |>
slr_slope(midterm_1, midterm_2)Iterating
Now we want to calculate the slope for each intro stats section, not just for section 1. One option is to copy and paste the code, and make the necessary changes:
read_csv("intro_stats_grades/section_1.csv") |>
slr_slope(midterm_1, midterm_2)
read_csv("intro_stats_grades/section_2.csv") |>
slr_slope(midterm_1, midterm_2)
read_csv("intro_stats_grades/section_3.csv") |>
slr_slope(midterm_1, midterm_2)
# etc...However, this is tedious and error-prone! It will be better if we can instead iterate through each of the sections, and apply the same function to the CSV file for each section.
Listing files
To begin, we want to get all of the file names for the CSV files that we need to read in. Instead of writing them all out by hand, we will use the handy list.files function in R.
- Run the following code in R:
grade_files <- list.files("intro_stats_grades", full.names=T)The resulting object,
grade_files, is a vector. What type of data doesgrade_filescontain? (Hint: use thetypeoffunction!)Without counting manually or looking at the file names yourself, how many intro stats sections are there?
Applying a function to each file: first attempt
Now we can use the slr_slope function on each CSV file, by accessing the file names from the grade_files vector:
read_csv(grade_files[1]) |>
slr_slope(midterm_1, midterm_2)
read_csv(grade_files[2]) |>
slr_slope(midterm_1, midterm_2)
# etc...Hmmm… This isn’t much better than what we had before! We still have to manually index each entry in grade_files to read it into R.
purrr::map
Fortunately, there is a different way! The map function from the purrr package allows us to efficiently apply a function to each file in grade_files.
- Run the following code to read in each CSV file with
purrr::map.
grade_tables <- map(grade_files, read_csv)- What type of object is
grade_tables?
We can access elements of a list with double square brackets. For example:
grade_tables[[1]]- What is stored in the
grade_tableslist?
Calculating the slopes
Now we can calculate the slopes for each grade data frame:
grade_tables[[1]] |>
slr_slope(midterm_1, midterm_2)
grade_tables[[2]] |>
slr_slope(midterm_1, midterm_2)
# etc...But this is still tedious! Instead, let’s use map again. As a first attempt, we might try the following:
exam_slopes <- map(grade_tables, slr_slope)- What happens when we run this code? Why?
To fix the issue, we need to specify which variables we want to calculate the slope for. It is simplest to do this with an anonymous function:
exam_slopes <- map(grade_tables,
function(df) slr_slope(df, midterm_1, midterm_2))- Run the code. What type of object is
exam_slopes? What does it contain?
Putting everything together
In the process here, we have saved several intermediate steps (grade_files, grade_tables) that we probably don’t need to save. Fortunately, we can use the handy pipe |> and do everything in one nice chain!
- Run the following code:
exam_slopes <- list.files("intro_stats_grades", full.names=T) |>
map(read_csv) |>
map(function(df) slr_slope(df, midterm_1, midterm_2))Cleaning up
Currently, exam_slopes is a list of data frames (each slope is stored as a 1x1 data frame from the summarize function in slr_slope). In R, lists are often a bit harder to work with than vectors, and it would be nice to avoid storing the slopes in data frames.
To make these changes, we can adapt the slr_slope function so that it returns a number instead of a data frame. Then, we can use a variant of map called map_dbl, which returns a numeric vector instead of a list:
slr_slope <- function(df, x, y) {
df |>
summarize(slope = cov({{ x }}, {{ y }}, use="complete.obs")/
var({{ x }}, na.rm=T)) |>
pull(slope)
}
exam_slopes <- list.files("intro_stats_grades", full.names=T) |>
map(read_csv) |>
map_dbl(function(df) slr_slope(df, midterm_1, midterm_2))
exam_slopes- Using the
typeoffunction, confirm thatexam_slopesis now a vector containing numeric values (i.e., “doubles”).