library(tidyverse)
student_grades <- read_csv("https://sta279-f25.github.io/data/student_grades.csv")Activity: Data wrangling across columns and functions
Student grades
You are a TA for a statistics course. The instructor of the course is interested in assessing how students performed on each assignment in the class.
You are provided with a CSV file (student_grades.csv), from Canvas, containing the grades for each student on each assignment in the course. Here are the instructions the professor gives you:
- There are 6 homeworks, 2 midterms, a final exam, and a project
- Each homework is scored out of 10. All other assignments are scored out of 100
- If a student did not submit an assignment, it is marked as
NAin the CSV file. These missing assignments should receive a score of 0
The data can be imported into R with the following code:
Questions
Your answers to these questions should involve functions like starts_with, where, across, etc. You should not list all of the homework columns explicitly, e.g.
- What was the average exam score for each midterm?
Solution:
student_grades |>
summarize(across(starts_with("midterm"), mean))# A tibble: 1 × 2
midterm_1 midterm_2
<dbl> <dbl>
1 71.8 71.2
- What fraction of students failed each midterm (a grade less than 60%)?
Solution:
One option:
failure_rate <- function(x){
mean(x < 60)
}
student_grades |>
summarize(across(starts_with("midterm"), failure_rate))# A tibble: 1 × 2
midterm_1 midterm_2
<dbl> <dbl>
1 0.167 0.2
Another option:
student_grades |>
summarize(across(starts_with("midterm"), function(x) mean(x < 60)))# A tibble: 1 × 2
midterm_1 midterm_2
<dbl> <dbl>
1 0.167 0.2
- What was the average score for each homework, if you ignore missing submissions?
Solution:
One option:
mean_no_na <- function (x) {
mean(x, na.rm=T)
}
student_grades |>
summarize(across(starts_with("hw"), mean_no_na))# A tibble: 1 × 6
hw_1 hw_2 hw_3 hw_4 hw_5 hw_6
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 7.18 6.93 7.21 6.87 6.63 7.47
Another option:
student_grades |>
summarize(across(starts_with("hw"), function(x) mean(x, na.rm=T)))# A tibble: 1 × 6
hw_1 hw_2 hw_3 hw_4 hw_5 hw_6
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 7.18 6.93 7.21 6.87 6.63 7.47
- What was the average score for each homework, if you treat missing assignment as 0? Hint: One approach could use the
replace_nafunction
Solution:
One option:
mean_missing <- function (x) {
x[is.na(x)] <- 0
mean(x)
}
student_grades |>
summarize(across(starts_with("hw"), mean_missing))# A tibble: 1 × 6
hw_1 hw_2 hw_3 hw_4 hw_5 hw_6
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 6.7 6.93 6.97 6.87 5.97 7.47
Another option:
student_grades |>
summarize(across(starts_with("hw"),
function(x) mean(replace_na(x, 0))))# A tibble: 1 × 6
hw_1 hw_2 hw_3 hw_4 hw_5 hw_6
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 6.7 6.93 6.97 6.87 5.97 7.47
Another option:
student_grades |>
mutate(across(everything(), function(x) replace_na(x, 0))) |>
summarize(across(starts_with("hw"), mean))# A tibble: 1 × 6
hw_1 hw_2 hw_3 hw_4 hw_5 hw_6
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 6.7 6.93 6.97 6.87 5.97 7.47
- For each homework assignment, how many students failed to submit?
Solution:
One option:
count_na <- function (x) {
sum(is.na(x))
}
student_grades |>
summarize(across(starts_with("hw"), count_na))# A tibble: 1 × 6
hw_1 hw_2 hw_3 hw_4 hw_5 hw_6
<int> <int> <int> <int> <int> <int>
1 2 0 1 0 3 0
Another option:
student_grades |>
summarize(across(starts_with("hw"), function(x) sum(is.na(x))))# A tibble: 1 × 6
hw_1 hw_2 hw_3 hw_4 hw_5 hw_6
<int> <int> <int> <int> <int> <int>
1 2 0 1 0 3 0