38.8 Mb
The nycflights13 package contains information on flights from NYC airports in 2013. The data is stored across several data frames:
airlines: information on each airlineairports: information on each airportflights: information on each flightplanes: information on each planeweather: hourly weather dataThe nycflights13 package contains a small subset of a database on 48 million flights. The airlines database includes the following tables:
airportscarriersflightsplanesThis data is too big to store locally, but can be on servers which we can access remotely.
What do you think each part of this query is doing?
name N pct_ontime
1 Virgin America 322 0.8789
2 United Air Lines Inc. 356 0.8736
3 Delta Air Lines Inc. 2100 0.8505
4 American Airlines Inc. 1500 0.8113
carrier name
1 02Q Titan Airways
2 04Q Tradewind Aviation
3 05Q Comlux Aviation, AG
4 06Q Master Top Linhas Aereas Ltd.
5 07Q Flair Airlines Ltd.
6 09Q Swift Air, LLC
7 0BQ DCA
8 0CQ ACM AIR CHARTER GmbH
9 0GQ Inter Island Airways, d/b/a Inter Island Air
10 0HQ Polar Airlines de Mexico d/b/a Nova Air
SELECT: the columns to be retrievedFROM: the table containing the dataLIMIT: limit the rows to returnWhat if I want the year, origin, dest, dep_delay, and arr_delay columns from the flights table?
What if I want the year, origin, dest, dep_delay, and arr_delay columns from the flights table?
year origin dest dep_delay arr_delay
1 2013 LAX DFW -8 -12
2 2013 SFO ATL 5 1
3 2013 SFO DFW -4 -2
4 2013 SEA ORD 19 4
5 2013 LAX IAH -1 -10
What if I also want to calculate the difference between arrival delay and departure delay?
What if I also want to calculate the difference between arrival delay and departure delay?
year origin dest dep_delay arr_delay delay_diff
1 2013 LAX DFW -8 -12 -4
2 2013 SFO ATL 5 1 -4
3 2013 SFO DFW -4 -2 2
What are the equivalent dplyr functions?
Back to our original SQL query:
N pct_ontime
1 18008372 0.8091
SELECT can also be used to calculate summary statistics. For example, if we want the average departure delay:
mean_dep_delay
1 9.7471
Now suppose that I only want the mean departure delay for flights from EWR in 2013:
mean_dep_delay
1 14.703
What do you think should I do if I want the mean delay for each airport in November 2013?
mean_dep_delay
1 3.7766
2 2.2070
3 8.0122
4 -0.2985
5 5.2750
6 3.6619
7 8.2222
8 18.8750
9 -2.1042
10 8.0443
Do you notice anything about this output?
origin mean_dep_delay
1 ABE 3.7766
2 ABI 2.2070
3 ABQ 8.0122
4 ABR -0.2985
5 ABY 5.2750
6 ACT 3.6619
7 ACV 8.2222
8 ADK 18.8750
9 ADQ -2.1042
10 AEX 8.0443
Work on the class activity, and submit your rendered HTML on Canvas at the end of class.