Homework 9 (Optional)

Due: Wednesday, December 3, 11:59pm on Canvas

Important:

  1. This assignment is optional. I will take your 8 highest hw grades in the class; so, if you choose to complete this assignment, it can potentially replace one of your other hw grades.
  2. You do not need to do this assignment through GitHub.

Working in Python: To work in Python, you will connect to the DEAC OnDemand server:

https://sta279-f25.github.io/resources/rstudio_server/

Instructions:

  1. Log on to the RStudio Server
  2. Open a Quarto file
  3. Complete the assignment, render to HTML
  4. Download your .qmd and HTML files to your local computer
  5. Submit the .qmd and HTML files on Canvas

Data wrangling in Python

flights data

In the first part of this assignment, you will return to the flights data from the nycflights13 package we have used throughout the semester. To load the flights data into Python, run the following in R:

library(nycflights13)

Then run the following in Python:

import numpy as np
import pandas as pd

flights_py = r.flights

Pandas DataFrames and Series

Before we begin manipulating this data in Python, let’s take a look at what type of object it is:

type(flights_py)
## <class 'pandas.core.frame.DataFrame'>

You can see that flights_py has been imported as a pandas DataFrame! Recall that R has two main ways of storing rectangular data: matrices and data frames. The equivalent of a matrix in Python is a 2-d NumPy array. The equivalent of an R data frame is a pandas data frame.

Let’s take a look at some of the attributes of this data in Python:

flights_py.shape
## (336776, 19)
flights_py.columns
## Index(['year', 'month', 'day', 'dep_time', 'sched_dep_time', 'dep_delay',
##        'arr_time', 'sched_arr_time', 'arr_delay', 'carrier', 'flight',
##        'tailnum', 'origin', 'dest', 'air_time', 'distance', 'hour', 'minute',
##        'time_hour'],
##       dtype='object')

We can see that flights_py has 336776 and 19 columns, just like the flights data frame in R. The column names are also the same, and in the same order (just as we would hope).

Note: In Python (and many object-oriented language), the . notation (“dot notation”) is used to access attributes and methods of an object. Here, flights_py is an object (in particular, a pandas DataFrame). For example, all DataFrames in pandas have a .shape attribute, which stores information about their numbers of rows and columns.

We can access column in Python by name:

flights_py['origin']
## 0         EWR
## 1         LGA
## 2         JFK
## 3         JFK
## 4         LGA
##          ... 
## 336771    JFK
## 336772    LGA
## 336773    LGA
## 336774    LGA
## 336775    LGA
## Name: origin, Length: 336776, dtype: object
type(flights_py['origin'])
## <class 'pandas.core.series.Series'>

Each column in a pandas DataFrame is a Series; basically a one-dimensional DataFrame (sort of similar to a vector in R).

Basic data manipulation with pandas

Pandas offers many functions which are similar to functions in dplyr. For example, if I want to group the flights by month, and calculate the mean departure delay in each month, I will use the groupby and agg (short for aggregate) functions:

flights_py.groupby(by = 'month').agg({'dep_delay': 'mean'})
##        dep_delay
## month           
## 1      10.036665
## 2      10.816843
## 3      13.227076
## 4      13.938038
## 5      12.986859
## 6      20.846332
## 7      21.727787
## 8      12.611040
## 9       6.722476
## 10      6.243988
## 11      5.435362
## 12     16.576688

Let’s break this code down:

  • First, notice that these functions are methods that belong to pandas DataFrames, so we access them with dot notation. flights_py.groupby(by = 'month') means “group the flights by month”
  • Second, notice that instead of piping (|>) from one step to the next, we chain the dot notation functions together. This is a similar idea to the pipe, just slightly different syntax. This chaining works because the output of flights_py.groupby(...) is itself a pandas DataFrame, and therefore has an agg method
  • Also notice that when I refer to columns, I always include quotes around the name
  • agg applies summary functions to certain columns. Here we are calculating the mean of the dep_delay column. If we didn’t specify the dep_delay column, we would calculate the mean of all the numeric columns in the data
  • By default, missing values were ignored when calculating the mean

Now, what if I want to calculate the proportion of cancelled flights for each month? Well, the proportion of missing values is not a built-in function (whereas the mean is built-in), so we have to write it ourselves:

def prop_na(x):
  return sum(pd.isna(x))/len(x)

flights_py.groupby(by = 'month').agg(
  {'dep_delay': prop_na}).sort_values(by = 'dep_delay', ascending = False)
##        dep_delay
## month           
## 2       0.050539
## 12      0.036431
## 6       0.035726
## 7       0.031946
## 3       0.029861
## 4       0.023579
## 5       0.019551
## 1       0.019293
## 8       0.016572
## 9       0.016392
## 11      0.008545
## 10      0.008169

Question 1

Using pandas, calculate the mean arrival delay for each carrier in the NY flights data.

Question 2

How many carriers have a positive mean arrival delay?

Question 3

Now use the weather and airports tables to calculate the mean temperature and humidity for each of the three NY airports, for each month.

Hints:

  • You will need to move these tables from R into Python
  • Look up merge function in Pandas
  • Your final results should look like this:
##                                   mean_temp  mean_humid
## origin name                month                       
## EWR    Newark Liberty Intl 1      35.562156   62.124515
##                            2      34.263318   63.335575
##                            3      40.118654   57.913822
##                            4      52.977500   54.235125
##                            5      63.320242   64.073710
##                            6      73.267250   67.002292
##                            7      80.702996   66.676410
##                            8      74.537483   65.751746
##                            9      67.304784   64.035563
##                            10     59.778207   65.319280
##                            11     44.577343   56.520112
##                            12     37.950084   69.721415
## JFK    John F Kennedy Intl 1      35.385553   61.661617
##                            2      34.192459   62.558480
##                            3      39.544717   58.022224
##                            4      50.142698   60.057636
##                            5      59.314758   70.989261
##                            6      69.958250   73.761472
##                            7      78.734919   71.744691
##                            8      73.818780   67.844201
##                            9      66.897750   65.251819
##                            10     59.801951   65.063618
##                            11     45.134194   56.799607
##                            12     38.604867   68.313413
## LGA    La Guardia          1      35.959272   59.156092
##                            2      34.356119   60.686030
##                            3      39.976523   56.559474
##                            4      52.114500   53.150597
##                            5      62.750000   61.496183
##                            6      73.326500   61.226861
##                            7      80.764253   62.264038
##                            8      75.048254   60.345737
##                            9      67.911250   59.677431
##                            10     60.632439   60.138753
##                            11     45.260926   53.492959
##                            12     38.769762   63.627692

Question 4

Modify your code from the previous question to keep only the rows corresponding to the month with the highest mean temperature for each airport. You may not manually specify which month this is (i.e. you may not do something like .query(month == 7)).

Hints:

  • Grouping and then applying a filtration may be useful here!
  • Your final results should look like this:
##                                   mean_temp  mean_humid
## origin name                month                       
## LGA    La Guardia          7      80.764253   62.264038
## JFK    John F Kennedy Intl 7      78.734919   71.744691
## EWR    Newark Liberty Intl 7      80.702996   66.676410

Health code violations

The Violations data from the mdsr package contains information on 480621 health inspections conducted on restaurants in New York City. We can load the data into Python with the following code.

First, in R:

library(mdsr)

my_violations <- Violations |>
  mutate(score = as.numeric(score))

Now in Python:

vio_py = r.my_violations

Question 5

Using pandas, calculate the median inspection score by zip code for zip codes in Manhattan with 50 or more inspections.

Hints: The following functions for pandas DataFrames will be useful:

  • query
  • dropna
  • groupby
  • agg

Simulations

In this part of the assignment, you will practice working in Python by re-implementing a simulation in Python. The following table shows you corresponding Python code for several R operations.

R code (approximate) Python equivalent
length len
dim(x) x.shape
seq np.linspace or np.arange
rep(0, n) np.zeros(n)
runif(n, a, b) np.random.uniform(a, b, n)
mean(x) np.mean(x)
for(i in 1:n) for i in range(n):
ifelse(...) np.where(...)
x[i] x[i] (but remember python is 0-indexed)
if(...){

} else {

}
if ... :

else:
rnorm(n, mean=0, sd=0.5) np.random.normal(loc=0, scale=0.5, n)
set.seed(...) np.random.seed(...)
sample(...) np.random.choice(...)

Question 6

Re-do Question 1 from HW 5 (the robot tug-of-war) in Python. Confirm that you get a similar result.