Homework 9 (Optional)
Due: Wednesday, December 3, 11:59pm on Canvas
Important:
- This assignment is optional. I will take your 8 highest hw grades in the class; so, if you choose to complete this assignment, it can potentially replace one of your other hw grades.
- You do not need to do this assignment through GitHub.
Working in Python: To work in Python, you will connect to the DEAC OnDemand server:
https://sta279-f25.github.io/resources/rstudio_server/
Instructions:
- Log on to the RStudio Server
- Open a Quarto file
- Complete the assignment, render to HTML
- Download your .qmd and HTML files to your local computer
- Submit the .qmd and HTML files on Canvas
Data wrangling in Python
flights data
In the first part of this assignment, you will return to the
flights data from the nycflights13 package we
have used throughout the semester. To load the flights data
into Python, run the following in R:
Then run the following in Python:
Pandas DataFrames and Series
Before we begin manipulating this data in Python, let’s take a look at what type of object it is:
## <class 'pandas.core.frame.DataFrame'>
You can see that flights_py has been imported as a
pandas DataFrame! Recall that R has two main ways of storing rectangular
data: matrices and data frames. The equivalent of a matrix in Python is
a 2-d NumPy array. The equivalent of an R data frame is a pandas data
frame.
Let’s take a look at some of the attributes of this data in Python:
## (336776, 19)
## Index(['year', 'month', 'day', 'dep_time', 'sched_dep_time', 'dep_delay',
## 'arr_time', 'sched_arr_time', 'arr_delay', 'carrier', 'flight',
## 'tailnum', 'origin', 'dest', 'air_time', 'distance', 'hour', 'minute',
## 'time_hour'],
## dtype='object')
We can see that flights_py has 336776 and 19 columns,
just like the flights data frame in R. The column names are
also the same, and in the same order (just as we would hope).
Note: In Python (and many object-oriented language),
the . notation (“dot notation”) is used to access
attributes and methods of an object. Here, flights_py is an
object (in particular, a pandas DataFrame). For example,
all DataFrames in pandas have a .shape attribute,
which stores information about their numbers of rows and columns.
We can access column in Python by name:
## 0 EWR
## 1 LGA
## 2 JFK
## 3 JFK
## 4 LGA
## ...
## 336771 JFK
## 336772 LGA
## 336773 LGA
## 336774 LGA
## 336775 LGA
## Name: origin, Length: 336776, dtype: object
## <class 'pandas.core.series.Series'>
Each column in a pandas DataFrame is a Series; basically a one-dimensional DataFrame (sort of similar to a vector in R).
Basic data manipulation with pandas
Pandas offers many functions which are similar to functions in
dplyr. For example, if I want to group the flights by
month, and calculate the mean departure delay in each month, I will use
the groupby and agg (short for
aggregate) functions:
## dep_delay
## month
## 1 10.036665
## 2 10.816843
## 3 13.227076
## 4 13.938038
## 5 12.986859
## 6 20.846332
## 7 21.727787
## 8 12.611040
## 9 6.722476
## 10 6.243988
## 11 5.435362
## 12 16.576688
Let’s break this code down:
- First, notice that these functions are methods that belong to pandas
DataFrames, so we access them with dot notation.
flights_py.groupby(by = 'month')means “group the flights by month” - Second, notice that instead of piping (
|>) from one step to the next, we chain the dot notation functions together. This is a similar idea to the pipe, just slightly different syntax. This chaining works because the output offlights_py.groupby(...)is itself a pandas DataFrame, and therefore has anaggmethod - Also notice that when I refer to columns, I always include quotes around the name
aggapplies summary functions to certain columns. Here we are calculating the mean of thedep_delaycolumn. If we didn’t specify thedep_delaycolumn, we would calculate the mean of all the numeric columns in the data- By default, missing values were ignored when calculating the mean
Now, what if I want to calculate the proportion of cancelled flights for each month? Well, the proportion of missing values is not a built-in function (whereas the mean is built-in), so we have to write it ourselves:
def prop_na(x):
return sum(pd.isna(x))/len(x)
flights_py.groupby(by = 'month').agg(
{'dep_delay': prop_na}).sort_values(by = 'dep_delay', ascending = False)## dep_delay
## month
## 2 0.050539
## 12 0.036431
## 6 0.035726
## 7 0.031946
## 3 0.029861
## 4 0.023579
## 5 0.019551
## 1 0.019293
## 8 0.016572
## 9 0.016392
## 11 0.008545
## 10 0.008169
Question 1
Using pandas, calculate the mean arrival delay for each carrier in the NY flights data.
Question 2
How many carriers have a positive mean arrival delay?
Question 3
Now use the weather and airports tables to
calculate the mean temperature and humidity for each of the three NY
airports, for each month.
Hints:
- You will need to move these tables from R into Python
- Look up
mergefunction in Pandas - Your final results should look like this:
## mean_temp mean_humid
## origin name month
## EWR Newark Liberty Intl 1 35.562156 62.124515
## 2 34.263318 63.335575
## 3 40.118654 57.913822
## 4 52.977500 54.235125
## 5 63.320242 64.073710
## 6 73.267250 67.002292
## 7 80.702996 66.676410
## 8 74.537483 65.751746
## 9 67.304784 64.035563
## 10 59.778207 65.319280
## 11 44.577343 56.520112
## 12 37.950084 69.721415
## JFK John F Kennedy Intl 1 35.385553 61.661617
## 2 34.192459 62.558480
## 3 39.544717 58.022224
## 4 50.142698 60.057636
## 5 59.314758 70.989261
## 6 69.958250 73.761472
## 7 78.734919 71.744691
## 8 73.818780 67.844201
## 9 66.897750 65.251819
## 10 59.801951 65.063618
## 11 45.134194 56.799607
## 12 38.604867 68.313413
## LGA La Guardia 1 35.959272 59.156092
## 2 34.356119 60.686030
## 3 39.976523 56.559474
## 4 52.114500 53.150597
## 5 62.750000 61.496183
## 6 73.326500 61.226861
## 7 80.764253 62.264038
## 8 75.048254 60.345737
## 9 67.911250 59.677431
## 10 60.632439 60.138753
## 11 45.260926 53.492959
## 12 38.769762 63.627692
Question 4
Modify your code from the previous question to keep only the rows
corresponding to the month with the highest mean temperature for each
airport. You may not manually specify which month this is
(i.e. you may not do something like
.query(month == 7)).
Hints:
- Grouping and then applying a filtration may be useful here!
- Your final results should look like this:
## mean_temp mean_humid
## origin name month
## LGA La Guardia 7 80.764253 62.264038
## JFK John F Kennedy Intl 7 78.734919 71.744691
## EWR Newark Liberty Intl 7 80.702996 66.676410
Health code violations
The Violations data from the mdsr package
contains information on 480621 health inspections conducted on
restaurants in New York City. We can load the data into Python with the
following code.
First, in R:
Now in Python:
Question 5
Using pandas, calculate the median inspection score by zip code for zip codes in Manhattan with 50 or more inspections.
Hints: The following functions for pandas DataFrames will be useful:
querydropnagroupbyagg
Simulations
In this part of the assignment, you will practice working in Python by re-implementing a simulation in Python. The following table shows you corresponding Python code for several R operations.
| R code | (approximate) Python equivalent |
|---|---|
length |
len |
dim(x) |
x.shape |
seq |
np.linspace or np.arange |
rep(0, n) |
np.zeros(n) |
runif(n, a, b) |
np.random.uniform(a, b, n) |
mean(x) |
np.mean(x) |
for(i in 1:n) |
for i in range(n): |
ifelse(...) |
np.where(...) |
x[i] |
x[i] (but remember python is 0-indexed) |
if(...){} else { } |
if ... : else: |
rnorm(n, mean=0, sd=0.5) |
np.random.normal(loc=0, scale=0.5, n) |
set.seed(...) |
np.random.seed(...) |
sample(...) |
np.random.choice(...) |
Question 6
Re-do Question 1 from HW 5 (the robot tug-of-war) in Python. Confirm that you get a similar result.