Homework 4

Due: Friday, September 26, 10:00pm on Canvas

Instructions:

Go to Canvas -> Assignments -> HW 4. Open the GitHub Classroom assignment link
Follow the instructions to accept the assignment and clone the repository to your local computer
The repository contains the file hw_04.qmd, and several CSV files containing data for the different questions. Write your code and answers to the questions in the Quarto document. Commit and push to GitHub regularly.
When you are finished, make sure to Render your Quarto document; this will produce a hw_04.md file which is easy to view on GitHub. Commit and push both the hw_04.qmd and hw_04.md files to GitHub
The final question in the assignment will also have you create 10 CSV files of course grades. Make sure to commit and push those CSV files too!
Finally, request feedback on your assignment on the “Feedback” pull request on your HW 4 repository

Important: Make sure to include both the .qmd and .md files, and the CSV files from the last question, when you submit to receive full credit

Code guidelines:

If a question requires code, and code is not provided, you will not receive full credit
You will be graded on the quality of your code. In addition to being correct, your code should also be easy to read

Resources:

Chapter 25 (functions) in R for Data Science (2nd edition)
Chapter 26 (iteration) in R for Data Science (2nd edition)

Practice writing functions

In the first part of this assignment, you will practice writing some short functions to calculate two quantities which appear often in statistics and data science. These will be functions which take in a vector, like in Section 25.2 of R for Data Science (2nd edition).

The goal of these two functions will be to assess the performance of a prediction model. When we fit a model (for example, a linear or logistic regression model), we are trying to describe the relation between the explanatory variable(s) and the response variable. The goal is that our model predictions should be “close” to the true, observed values of the response variable.

Formally, suppose that you have data with \(n\) observations. Let \(y_1,...,y_n\) denote the observed values of the response variable for these \(n\) observations, and let \(\widehat{y}_1,...,\widehat{y}_n\) denote the predicted values from some model. We want the \(\widehat{y}\) values to be “close” to the true \(y\) values.

How do we formalize this idea of “close”?

Mean squared error (MSE)

If the response \(y\) is a continuous variable, it is most common to use the mean squared error (MSE), which measures the average squared distance between the predictions and the true response:

\[MSE = \frac{1}{n} \sum \limits_{i=1}^n (y_i - \widehat{y}_i)^2\]

For the first question, you will write a function to compute the MSE. When writing your function, it will help to notice that many arithmetic operations in R are vectorized – that is, you can simultaneously apply an operation to every element of a vector. For example:

x <- 1:5
x

## [1] 1 2 3 4 5

x + 1

## [1] 2 3 4 5 6

2 * x

## [1]  2  4  6  8 10

Question 1

Write a function called my_mse to compute the MSE, which satisfies the following requirements:

Inputs:
- yhat: vector of predictions
- y: vector of true responses
Output: the MSE, as described in the formula above

Other requirements:

Your function should leverage R’s vectorized operations
Do not use any existing MSE implementations
You may assume that the two input vectors are numeric and of the same length

Examples:

my_mse(c(1, 2, 3), c(1, 2, 3))

## [1] 0

my_mse(c(1, 2, 2), c(1, 2, 3))

## [1] 0.3333333

my_mse(c(0, 1, 2), c(1, 2, 3))

## [1] 1

Binary cross-entropy (BCE)

MSE is a common choice when the response is continuous. But what if the response is binary (as in logistic regression)?

Suppose that our true values \(y_1,...,y_n\) are binary, taking values either 0 or 1. And suppose that our predictions \(\widehat{y}_1,...,\widehat{y}_n\) are all predicted probabilities, taking values between 0 and 1. Then, the binary cross-entropy (BCE) is defined as

\[BCE = \frac{1}{n} \sum \limits_{i=1}^n [y_i \log(\widehat{y}_i) + (1 - y_i) \log(1 - \widehat{y}_i) ]\]

(Here log is the natural log).

Question 2

Write a function called my_bce to compute the BCE, which satisfies the following requirements:

Inputs:
- yhat: vector of predictions
- y: vector of true responses
Output: the BCE, as described in the formula above

Other requirements:

Your function should leverage R’s vectorized operations
Do not use any existing BCE implementations
You may assume that the two input vectors are of the same length
You may assume that the values of y are all 0s and 1s
You may assume that the values of yhat are all between 0 and 1, with no values exactly equal to 0 or 1

Examples:

my_bce(c(0.5, 0.5, 0.5), c(1, 0, 0))

## [1] -0.6931472

my_bce(c(0.99, 0.01, 0.02), c(1, 0, 0))

## [1] -0.01343446

Comparing model performance

Model performance metrics such as MSE and BCE can be used to compare different models. The repository for this assignment contains a CSV file, pred_data.csv, with the following columns:

truth: the true response values \(y\) for 200 observations
pred_knn: the predictions for a k-nearest neighbors (kNN) model
pred_random_forest: the predictions for a random forest model
pred_lasso: the predictions for a lasso model
pred_linear: the predictions for a linear regression model (no penalty term)

Question 3

Using your MSE function from question 1, and the across function from dplyr, calculate the MSE for each model (kNN, random forest, lasso, linear regression). Which model has the lowest MSE?

Practice with iteration

In HW 3, you wrote code to compute students’ overall course grades, given their scores on homeworks, exams, and projects. In this assignment, you will use that code to compute course grades across multiple classes and save the results as CSV files.

The repository for this assignment contains a folder called intro_stats_grades, which contains CSV files with gradebooks for 10 different sections of intro stats at a university. As the head TA for intro stats, your job is to calculate the overall grade for students in each of the 10 sections.

Here is information on how the grades are calculated:

Different sections have different numbers of hw assignments. Some have 5, some have 6, some have 7, etc.
Each section has 2 midterms, a final exam, and a project
Each homework is scored out of 10. All other assignments are scored out of 100
If a student did not submit an assignment, it is marked as NA in the CSV file. These missing assignments should receive a score of 0
For all sections: Homework is worth 15% of the course grade; midterm 1 is worth 15%; midterm 2 is worth 15%; the final exam is worth 25%; and the project is worth 30%
The possible letter grades at the university are A, B, C, D, and F. There are no plus/minus options. Grades are assigned on a standard scale:
- < 60 is F
- 60 - 69 is D
- 70 - 79 is C
- 80 - 89 is B
- 90+ is A

Your task is to calculate the overall course grade, reporting both the percentage and the letter grade. For each section, you will create a CSV file containing these overall grades, in the intro_stats_grades folder.

Question 4

Adapting your code from HW 3, write a function to calculate course grades for a section called calc_overall_grade which satisfies the following requirements:

Inputs:
- df: a data frame containing the gradebook for the section (e.g., the contents of CSV files like intro_stats_grades/section_1.csv)
Output:
- a data frame containing three columns:
  - student_id: the student’s id number
  - class_grade: the student’s overall percentage in the course, calculated as described above
  - letter_grade: the student’s corresponding letter grade (A, B, C, etc.)

Test your function on the data from a couple csv files in the intro_stats_grades folder.

Question 5

Now use your function from question 4 to calculate the overall grades for students in each section of the course. Save the results as 10 CSV files (one per section) in the intro_stats_grades folder, with names course_grades_section_1.csv, course_grades_section_2.csv, etc.

Use functions from the purrr package, as discussed in Chapter 26 of R for Data Science (2nd edition). You may not use any loops (for loops, while loops, etc.)
Do not list any files or datasets explicitly. For example, you should not manually write out all the names of the files in the intro_stats_grades folder. Instead, use tools like list.files