Homework 1

Due: Friday, September 5, 10:00pm on Canvas

Instructions:

Download the HW 1 template, and open the template (a Quarto document) in RStudio.
Put your name in the file header
Click Render
Type all code and answers in the document
Render early and often to catch any errors!
When you are finished, submit the final rendered HTML to Canvas

Code guidelines:

If a question requires code, and code is not provided, you will not receive full credit
You will be graded on the quality of your code. In addition to being correct, your code should also be easy to read

Resources: In addition to the class notes and activities, you should read the following resources:

Chapter 4 in Modern Data Science with R
Chapter 3 in R for Data Science (2nd edition)

NY Flights data

Let’s begin with the flights data from the nycflights13 package.

Suppose we want to know how many flights departed from each of the three NY airports (EWR, LGA, and JFK) in 2013. The count function allows us to count the number of rows, and to specify a variable to count by. Adding sort = TRUE will order the results from largest to smalles. For example:

library(nycflights13)
library(tidyverse)

flights |>
  count(origin, sort = TRUE)

## # A tibble: 3 × 2
##   origin      n
##   <chr>   <int>
## 1 EWR    120835
## 2 JFK    111279
## 3 LGA    104662

We can see that 120835 departed from EWR in 2013.

Question 1

Reproduce the output from the count function, but use the group_by, summarize, and arrange functions instead.

Question 2

The dep_time variable in the flights data records the actual departure time for each flight. However, there are several rows with missing departure times. Let’s assume that these correspond to flights which were cancelled.

Which month had the highest proportion of cancelled flights?

Hint 1: the is.na() function will return TRUE if missing, and FALSE otherwise.

is.na(c(1, 2, NA, 3))

## [1] FALSE FALSE  TRUE FALSE

Hint 2: In R, you can do math with boolean values (i.e., TRUEs and FALSEs). TRUE is treated as 1, and FALSE as 0. For example: For example:

sum(is.na(c(1, 2, NA, 3)))

## [1] 1

(that is, the number of NAs!)

Question 3

Which plane (specified by the tailnum variable) traveled the most times from New York City airports in 2013?

Question 4

Which carrier has the worst average delays?

Question 5

For the results in question 4, can you disentangle the effects of bad airports vs. bad carriers? Why or why not?

Hint: Look at table(flights$origin, flights$carrier)

Working with dates

Examining the available variables in the flights dataset, we can see that several columns provide information about the date and time of the flight. In particular, the time_hour column shows both the date and hour at which the flight was scheduled to depart, in a format like

2013-01-01 05:00:00

A very helpful package for working with dates in R is lubridate.

Install the lubridate package (if it is not installed already)
Read sections 17.2 and 17.3 in R for Data Science
Then answer the following questions

Question 6

Fill in the code to extract the year (2013) from the date-time 2013-01-01 05:00:00:

library(lubridate)

datetime <- ymd_hms(...)
...(datetime)

Question 7

What is the difference between the mday, yday, and wday functions?

Question 8

Look up the documentation on the week function from the lubridate package by running the following code in your console:

?week

What does

week(datetime)

tell us?

Question 9

For the plane identified in question 3, plot the number of trips per week over the year.

To create this plot, you will need to determine the week of the year for each flight. Use the week function in the lubridate package.

Privacy and k-anonymity

On the first day of class, we saw a dataset for which I deliberately removed many demographic features before releasing it to you. As we discussed in class, it is important to be careful with variables that can be used to identify subjects in the data. As statisticians and data scientists, we have an ethical responsibility to protect these individuals’ privacy.

One approach to assessing whether individuals are identifiable in a dataset is a concept called k-anonymity.

Read the Introduction (Section 1) of this paper by Dr. Latanya Sweeney on k-anonymity
Read the Wikipedia article on k-anonymity (Wikipedia does a good job at explaining the concept without technical mathematical details). The tables on the Wikipedia page do a good job at providing an example of 2-anonymity
Then answer the following questions

Question 10

Summarize k-anonymity in a few sentences. What does it mean for a dataset to have k-anonymity (e.g., 2-anonymity)?

On the Wikipedia page, the data in the second table has been modified so that the table has 2-anonymity with respect to Age, Gender, and State of domicile. We can import the data in this table into R with the following code (we will discuss web scraping in more detail later in the semester):

library(tidyverse)
library(rvest)

mod_data <- read_html("https://en.wikipedia.org/wiki/K-anonymity") |> 
  html_elements("table") |>
  purrr::pluck(2) |>
  html_table()

Question 11

Using dplyr functions that we have covered in this course, verify that this data has 2-anonymity with respect to Age, Gender, and State of domicile.