Homework 4
Due: Friday, September 26, 10:00pm on Canvas
Instructions:
- Go to Canvas -> Assignments -> HW 4. Open the GitHub Classroom assignment link
- Follow the instructions to accept the assignment and clone the repository to your local computer
- The repository contains the file
hw_04.qmd, and several CSV files containing data for the different questions. Write your code and answers to the questions in the Quarto document. Commit and push to GitHub regularly. - When you are finished, make sure to Render your Quarto document;
this will produce a
hw_04.mdfile which is easy to view on GitHub. Commit and push both thehw_04.qmdandhw_04.mdfiles to GitHub - The final question in the assignment will also have you create 10 CSV files of course grades. Make sure to commit and push those CSV files too!
- Finally, request feedback on your assignment on the “Feedback” pull request on your HW 4 repository
Important: Make sure to include both the
.qmd and .md files, and the CSV files from the
last question, when you submit to receive full credit
Code guidelines:
- If a question requires code, and code is not provided, you will not receive full credit
- You will be graded on the quality of your code. In addition to being correct, your code should also be easy to read
Resources:
- Chapter 25 (functions) in R for Data Science (2nd edition)
- Chapter 26 (iteration) in R for Data Science (2nd edition)
Practice writing functions
In the first part of this assignment, you will practice writing some short functions to calculate two quantities which appear often in statistics and data science. These will be functions which take in a vector, like in Section 25.2 of R for Data Science (2nd edition).
The goal of these two functions will be to assess the performance of a prediction model. When we fit a model (for example, a linear or logistic regression model), we are trying to describe the relation between the explanatory variable(s) and the response variable. The goal is that our model predictions should be “close” to the true, observed values of the response variable.
Formally, suppose that you have data with \(n\) observations. Let \(y_1,...,y_n\) denote the observed values of the response variable for these \(n\) observations, and let \(\widehat{y}_1,...,\widehat{y}_n\) denote the predicted values from some model. We want the \(\widehat{y}\) values to be “close” to the true \(y\) values.
How do we formalize this idea of “close”?
Mean squared error (MSE)
If the response \(y\) is a continuous variable, it is most common to use the mean squared error (MSE), which measures the average squared distance between the predictions and the true response:
\[MSE = \frac{1}{n} \sum \limits_{i=1}^n (y_i - \widehat{y}_i)^2\]
For the first question, you will write a function to compute the MSE. When writing your function, it will help to notice that many arithmetic operations in R are vectorized – that is, you can simultaneously apply an operation to every element of a vector. For example:
## [1] 1 2 3 4 5
## [1] 2 3 4 5 6
## [1] 2 4 6 8 10
Question 1
Write a function called my_mse to compute the MSE, which
satisfies the following requirements:
Inputs:
yhat: vector of predictionsy: vector of true responses
Output: the MSE, as described in the formula above
Other requirements:
- Your function should leverage R’s vectorized operations
- Do not use any existing MSE implementations
- You may assume that the two input vectors are numeric and of the same length
Examples:
## [1] 0
## [1] 0.3333333
## [1] 1
Binary cross-entropy (BCE)
MSE is a common choice when the response is continuous. But what if the response is binary (as in logistic regression)?
Suppose that our true values \(y_1,...,y_n\) are binary, taking values either 0 or 1. And suppose that our predictions \(\widehat{y}_1,...,\widehat{y}_n\) are all predicted probabilities, taking values between 0 and 1. Then, the binary cross-entropy (BCE) is defined as
\[BCE = \frac{1}{n} \sum \limits_{i=1}^n [y_i \log(\widehat{y}_i) + (1 - y_i) \log(1 - \widehat{y}_i) ]\]
(Here log is the natural log).
Question 2
Write a function called my_bce to compute the BCE, which
satisfies the following requirements:
Inputs:
yhat: vector of predictionsy: vector of true responses
Output: the BCE, as described in the formula above
Other requirements:
- Your function should leverage R’s vectorized operations
- Do not use any existing BCE implementations
- You may assume that the two input vectors are of the same length
- You may assume that the values of
yare all 0s and 1s - You may assume that the values of
yhatare all between 0 and 1, with no values exactly equal to 0 or 1
Examples:
## [1] -0.6931472
## [1] -0.01343446
Comparing model performance
Model performance metrics such as MSE and BCE can be used to compare
different models. The repository for this assignment contains a CSV
file, pred_data.csv, with the following columns:
truth: the true response values \(y\) for 200 observationspred_knn: the predictions for a k-nearest neighbors (kNN) modelpred_random_forest: the predictions for a random forest modelpred_lasso: the predictions for a lasso modelpred_linear: the predictions for a linear regression model (no penalty term)
Question 3
Using your MSE function from question 1, and the across
function from dplyr, calculate the MSE for each model (kNN,
random forest, lasso, linear regression). Which model has the lowest
MSE?
Practice with iteration
In HW 3, you wrote code to compute students’ overall course grades, given their scores on homeworks, exams, and projects. In this assignment, you will use that code to compute course grades across multiple classes and save the results as CSV files.
The repository for this assignment contains a folder called
intro_stats_grades, which contains CSV files with
gradebooks for 10 different sections of intro stats at a university. As
the head TA for intro stats, your job is to calculate the overall grade
for students in each of the 10 sections.
Here is information on how the grades are calculated:
- Different sections have different numbers of hw assignments. Some have 5, some have 6, some have 7, etc.
- Each section has 2 midterms, a final exam, and a project
- Each homework is scored out of 10. All other assignments are scored out of 100
- If a student did not submit an assignment, it is marked as
NAin the CSV file. These missing assignments should receive a score of 0 - For all sections: Homework is worth 15% of the course grade; midterm 1 is worth 15%; midterm 2 is worth 15%; the final exam is worth 25%; and the project is worth 30%
- The possible letter grades at the university are A, B, C, D, and F.
There are no plus/minus options. Grades are assigned on a standard
scale:
- < 60 is F
- 60 - 69 is D
- 70 - 79 is C
- 80 - 89 is B
- 90+ is A
Your task is to calculate the overall course grade, reporting both
the percentage and the letter grade. For each section, you will create a
CSV file containing these overall grades, in the
intro_stats_grades folder.
Question 4
Adapting your code from HW 3, write a function to calculate course
grades for a section called calc_overall_grade which
satisfies the following requirements:
- Inputs:
df: a data frame containing the gradebook for the section (e.g., the contents of CSV files likeintro_stats_grades/section_1.csv)
- Output:
- a data frame containing three columns:
student_id: the student’s id numberclass_grade: the student’s overall percentage in the course, calculated as described aboveletter_grade: the student’s corresponding letter grade (A, B, C, etc.)
- a data frame containing three columns:
Test your function on the data from a couple csv files in the
intro_stats_grades folder.
Question 5
Now use your function from question 4 to calculate the overall grades
for students in each section of the course. Save the results as 10 CSV
files (one per section) in the intro_stats_grades folder,
with names course_grades_section_1.csv,
course_grades_section_2.csv, etc.
- Use functions from the
purrrpackage, as discussed in Chapter 26 of R for Data Science (2nd edition). You may not use any loops (for loops, while loops, etc.) - Do not list any files or datasets explicitly. For example, you
should not manually write out all the names of the files in the
intro_stats_gradesfolder. Instead, use tools likelist.files