Homework 2
Due: Friday, September 12, 10:00pm on Canvas
Instructions:
- Go to Canvas -> Assignments -> HW 2. Open the GitHub Classroom assignment link
- Follow the instructions to accept the assignment and clone the repository to your local computer
- The repository contains the file
hw_02.qmd. Write your code and answers to the questions in this Quarto document. Commit and push to GitHub regularly. - When you are finished, make sure to Render your Quarto document;
this will produce a
hw_02.mdfile which is easy to view on GitHub. Commit and push both thehw_02.qmdandhw_02.mdfiles to GitHub - Finally, request feedback on your assignment on the “Feedback” pull request on your HW 2 repository
Code guidelines:
- If a question requires code, and code is not provided, you will not receive full credit
- You will be graded on the quality of your code. In addition to being correct, your code should also be easy to read
Resources:
- Chapter 4 and Chapter 6 in Modern Data Science with R
- Chapter 3 and Chapter 5 in R for Data Science (2nd edition)
Practice pivoting
Question 1
The code below creates a data frame called df_1:
df_1 <- data.frame(
grp = c("A", "A", "B", "B"),
sex = c("F", "M", "F", "M"),
meanL = c(0.225, 0.47, 0.325, 0.547),
sdL = c(0.106, 0.325, 0.106, 0.308),
meanR = c(0.34, 0.57, 0.4, 0.647),
sdR = c(0.0849, 0.325, 0.0707, 0.274)
)
df_1## grp sex meanL sdL meanR sdR
## 1 A F 0.225 0.106 0.340 0.0849
## 2 A M 0.470 0.325 0.570 0.3250
## 3 B F 0.325 0.106 0.400 0.0707
## 4 B M 0.547 0.308 0.647 0.2740
Using pivot_longer and/or pivot_wider,
reshape df_1 to produce the following output:
## # A tibble: 2 × 9
## grp F.meanL F.sdL F.meanR F.sdR M.meanL M.sdL M.meanR M.sdR
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 A 0.225 0.106 0.34 0.0849 0.47 0.325 0.57 0.325
## 2 B 0.325 0.106 0.4 0.0707 0.547 0.308 0.647 0.274
Remember that the ? provides documentation in R. For
example, running ?pivot_wider in the console gives helpful
information about the pivot_wider function.
More practice pivoting
The code below creates a data frame called df_2:
df_2 <- data.frame(id = rep(c(1, 2, 3), 2),
group = rep(c("T", "C"), each=3),
vals = c(4, 6, 8, 5, 6, 10))
df_2## id group vals
## 1 1 T 4
## 2 2 T 6
## 3 3 T 8
## 4 1 C 5
## 5 2 C 6
## 6 3 C 10
An analyst wants to calculate the pairwise differences between the Treatment (T) and Control (C) values for each individual in this dataset. They use the following code:
Treat <- filter(df_2, group == "T")
Control <- filter(df_2, group == "C")
all <- mutate(Treat, diff = Treat$vals - Control$vals)
all## id group vals diff
## 1 1 T 4 -1
## 2 2 T 6 0
## 3 3 T 8 -2
Question 2
Verify that this code works for this example and generates the correct values of -1, 0, and -2. Describe two problems that might arise if the data set is not sorted in a particular order or if one of the observations is missing for one of the subjects.
Question 3
Provide an alternative approach to generate the diff
variable, using group_by and summarize to
produce the following output:
## # A tibble: 3 × 2
## id diff
## <dbl> <dbl>
## 1 1 -1
## 2 2 0
## 3 3 -2
Question 4
Provide an alternative approach to generate the diff
variable that uses pivot_wider and mutate to
produce the following output:
## # A tibble: 3 × 4
## id T C diff
## <dbl> <dbl> <dbl> <dbl>
## 1 1 4 5 -1
## 2 2 6 6 0
## 3 3 8 10 -2
Baseball data
The Teams data in the Lahman package
contains information on professional baseball teams since 1871.
Question 5
Using the Teams data, create a plot of the number of
home runs scored (HR) and allowed (HRA) by the
Chicago Cubs (teamID CHN) in each season. Your plot should
look like close to this:
You may use whichever R functions you like to create the plot, but the axes and legend should be labeled as in the plot above.
Ethics, data wrangling, and reproducibility
Because of the role that data plays in making decisions and informing experts, policy makers, and society as a whole, statisticians and data scientists have an ethical responsibility to make analyses clear, transparent, and understandable. An important component of a good analysis is that it should be reproducible – that is, given the same initial data, an outside observer can reproduce the steps of the analysis and arrive at the same results (the same summary statistics, the same plots, the same models, etc.).
You are already engaging in reproducible analyses by sharing code on GitHub and in your Quarto documents. When the code you used for an analysis is available to others, they are able to see exactly what you did and reproduce your work by re-running your code.
In this assignment, you will think more about ethics and reproducibility in the context of reproducible spreadsheet analysis.
Read the following material, then answer the questions below.
- Section 8.4.6 and section 8.5.6 in Modern Data Science with R
- This NPR article on former Cornell professor Brian Wansink
- This paper on data errors found in several of Brian Wansink’s research papers (focus on the Background, Granularity errors, and Inconsistent sample sizes within and between articles sections)
Question 6
Summarize the different problems with Brian Wansink’s research discussed in the NPR article.
Question 7
What are some similarities between the problems with Wansink’s research, and the Rogoff and Reinhart paper discussed in MDSR?
Question 8
Give an example of a granularity error identified in the Statistical heartburn paper.
Question 9
What could Wansink and his colleagues have done differently to guard against the errors identified in the Statistical heartburn paper?
Learning something new
No class in computing or data science can include a comprehensive list of every function or package you will ever need to work with. Rather, it is important to learn how to search for solutions on your own.
Your friend is working with data from the General Social Survey, which contains information on respondents’ demographics, habits, and political affiliation. Using the R code below, they produce the following plot:
library(tidyverse)
gss_cat |>
ggplot(aes(x = partyid,
y = tvhours)) +
geom_boxplot() +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust=1)) +
labs(x = "Party affiliation",
y = "Hours of TV per day")They like the idea of using a boxplot, but they’re not happy with how many different categories there are for party affiliation. Instead, they want to group some of the levels together and rename others:
- “No answer” and “Don’t know” should be “missing”
- “Other party” should be “other”
- “Strong republican”, “Not str republican” should be “republican”
- “Ind,near rep”, “Independent”, and “Ind,near dem” should all become “independent”
- “Not str democrat”, “Strong democrat” should be “democrat”
Your friend remembers there is an R package called
forcats which is useful for working with categorical
variables, but they don’t remember which function(s) to use here.
Question 10
Find a function in the forcats package which allows you
to combine the categories of partyid as described above,
then remake the boxplot. Your final boxplot should look like this: