Homework 2

Due: Friday, September 12, 10:00pm on Canvas

Instructions:

  1. Go to Canvas -> Assignments -> HW 2. Open the GitHub Classroom assignment link
  2. Follow the instructions to accept the assignment and clone the repository to your local computer
  3. The repository contains the file hw_02.qmd. Write your code and answers to the questions in this Quarto document. Commit and push to GitHub regularly.
  4. When you are finished, make sure to Render your Quarto document; this will produce a hw_02.md file which is easy to view on GitHub. Commit and push both the hw_02.qmd and hw_02.md files to GitHub
  5. Finally, request feedback on your assignment on the “Feedback” pull request on your HW 2 repository

Code guidelines:

  • If a question requires code, and code is not provided, you will not receive full credit
  • You will be graded on the quality of your code. In addition to being correct, your code should also be easy to read

Resources:

Practice pivoting

Question 1

The code below creates a data frame called df_1:

df_1 <- data.frame(
  grp = c("A", "A", "B", "B"),
  sex = c("F", "M", "F", "M"),
  meanL = c(0.225, 0.47, 0.325, 0.547),
  sdL = c(0.106, 0.325, 0.106, 0.308),
  meanR = c(0.34, 0.57, 0.4, 0.647),
  sdR = c(0.0849, 0.325, 0.0707, 0.274)
)

df_1
##   grp sex meanL   sdL meanR    sdR
## 1   A   F 0.225 0.106 0.340 0.0849
## 2   A   M 0.470 0.325 0.570 0.3250
## 3   B   F 0.325 0.106 0.400 0.0707
## 4   B   M 0.547 0.308 0.647 0.2740

Using pivot_longer and/or pivot_wider, reshape df_1 to produce the following output:

## # A tibble: 2 × 9
##   grp   F.meanL F.sdL F.meanR  F.sdR M.meanL M.sdL M.meanR M.sdR
##   <chr>   <dbl> <dbl>   <dbl>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
## 1 A       0.225 0.106    0.34 0.0849   0.47  0.325   0.57  0.325
## 2 B       0.325 0.106    0.4  0.0707   0.547 0.308   0.647 0.274

Remember that the ? provides documentation in R. For example, running ?pivot_wider in the console gives helpful information about the pivot_wider function.

More practice pivoting

The code below creates a data frame called df_2:

df_2 <- data.frame(id = rep(c(1, 2, 3), 2),
                  group = rep(c("T", "C"), each=3),
                  vals = c(4, 6, 8, 5, 6, 10))

df_2
##   id group vals
## 1  1     T    4
## 2  2     T    6
## 3  3     T    8
## 4  1     C    5
## 5  2     C    6
## 6  3     C   10

An analyst wants to calculate the pairwise differences between the Treatment (T) and Control (C) values for each individual in this dataset. They use the following code:

Treat <- filter(df_2, group == "T")
Control <- filter(df_2, group == "C")
all <- mutate(Treat, diff = Treat$vals - Control$vals)
all
##   id group vals diff
## 1  1     T    4   -1
## 2  2     T    6    0
## 3  3     T    8   -2

Question 2

Verify that this code works for this example and generates the correct values of -1, 0, and -2. Describe two problems that might arise if the data set is not sorted in a particular order or if one of the observations is missing for one of the subjects.

Question 3

Provide an alternative approach to generate the diff variable, using group_by and summarize to produce the following output:

## # A tibble: 3 × 2
##      id  diff
##   <dbl> <dbl>
## 1     1    -1
## 2     2     0
## 3     3    -2

Question 4

Provide an alternative approach to generate the diff variable that uses pivot_wider and mutate to produce the following output:

## # A tibble: 3 × 4
##      id     T     C  diff
##   <dbl> <dbl> <dbl> <dbl>
## 1     1     4     5    -1
## 2     2     6     6     0
## 3     3     8    10    -2

Baseball data

The Teams data in the Lahman package contains information on professional baseball teams since 1871.

Question 5

Using the Teams data, create a plot of the number of home runs scored (HR) and allowed (HRA) by the Chicago Cubs (teamID CHN) in each season. Your plot should look like close to this:

You may use whichever R functions you like to create the plot, but the axes and legend should be labeled as in the plot above.

Ethics, data wrangling, and reproducibility

Because of the role that data plays in making decisions and informing experts, policy makers, and society as a whole, statisticians and data scientists have an ethical responsibility to make analyses clear, transparent, and understandable. An important component of a good analysis is that it should be reproducible – that is, given the same initial data, an outside observer can reproduce the steps of the analysis and arrive at the same results (the same summary statistics, the same plots, the same models, etc.).

You are already engaging in reproducible analyses by sharing code on GitHub and in your Quarto documents. When the code you used for an analysis is available to others, they are able to see exactly what you did and reproduce your work by re-running your code.

In this assignment, you will think more about ethics and reproducibility in the context of reproducible spreadsheet analysis.

Read the following material, then answer the questions below.

  • Section 8.4.6 and section 8.5.6 in Modern Data Science with R
  • This NPR article on former Cornell professor Brian Wansink
  • This paper on data errors found in several of Brian Wansink’s research papers (focus on the Background, Granularity errors, and Inconsistent sample sizes within and between articles sections)

Question 6

Summarize the different problems with Brian Wansink’s research discussed in the NPR article.

Question 7

What are some similarities between the problems with Wansink’s research, and the Rogoff and Reinhart paper discussed in MDSR?

Question 8

Give an example of a granularity error identified in the Statistical heartburn paper.

Question 9

What could Wansink and his colleagues have done differently to guard against the errors identified in the Statistical heartburn paper?

Learning something new

No class in computing or data science can include a comprehensive list of every function or package you will ever need to work with. Rather, it is important to learn how to search for solutions on your own.

Your friend is working with data from the General Social Survey, which contains information on respondents’ demographics, habits, and political affiliation. Using the R code below, they produce the following plot:

library(tidyverse)

gss_cat |>
  ggplot(aes(x = partyid,
             y = tvhours)) +
  geom_boxplot() +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust=1)) +
  labs(x = "Party affiliation",
       y = "Hours of TV per day")

They like the idea of using a boxplot, but they’re not happy with how many different categories there are for party affiliation. Instead, they want to group some of the levels together and rename others:

  • “No answer” and “Don’t know” should be “missing”
  • “Other party” should be “other”
  • “Strong republican”, “Not str republican” should be “republican”
  • “Ind,near rep”, “Independent”, and “Ind,near dem” should all become “independent”
  • “Not str democrat”, “Strong democrat” should be “democrat”

Your friend remembers there is an R package called forcats which is useful for working with categorical variables, but they don’t remember which function(s) to use here.

Question 10

Find a function in the forcats package which allows you to combine the categories of partyid as described above, then remake the boxplot. Your final boxplot should look like this: