library(tidyverse)
library(googlesheets4)
gs4_deauth()
gdp_data <- read_sheet("https://docs.google.com/spreadsheets/d/1RctTQmKB0hzbm1E8rGcufYdMshRdhmYdeL29nXqmvsc/pub?gid=0")
lit_data <- read_sheet("https://docs.google.com/spreadsheets/d/1hDinTIRHQIaZg1RUn6Z_6mo12PtKwEPFIz_mJVF6P5I/pub?gid=0")Activity: Joins
Literacy and GDP data
Run the following R code to import data on literacy and GDP for each country:
As in previous class activities, we can reshape the datasets:
gdp_data <- gdp_data |>
rename(country = starts_with("Income")) |>
pivot_longer(-country,
names_to = "year",
values_to = "gdp",
values_drop_na = TRUE)
lit_data <- lit_data |>
rename(country = starts_with("Adult")) |>
pivot_longer(-country,
names_to = "year",
values_to = "literacy",
values_drop_na = TRUE)- Use a join to create a dataset containing both the literacy rate and gdp for countries in each year. Keep only the rows with information on both gdp and literacy rate:
# A tibble: 6 × 4
country year literacy gdp
<chr> <chr> <dbl> <dbl>
1 Albania 2001 98.3 1282.
2 Albania 2008 94.7 1804.
3 Albania 2011 95.7 1966.
4 Algeria 1987 35.8 1902.
5 Algeria 2002 60.1 1872.
6 Algeria 2006 63.9 2125.
Solution:
lit_gdp <- lit_data |>
inner_join(gdp_data, join_by(country, year))
head(lit_gdp)# A tibble: 6 × 4
country year literacy gdp
<chr> <chr> <dbl> <dbl>
1 Albania 2001 98.3 1282.
2 Albania 2008 94.7 1804.
3 Albania 2011 95.7 1966.
4 Algeria 1987 35.8 1902.
5 Algeria 2002 60.1 1872.
6 Algeria 2006 63.9 2125.
Air quality
In the next question, we will work with air quality data. There are two tables: the air_quality table contains information on different air quality measurements for multiple locations, while the air_parameters table contains information on the different air quality measurements.
The datasets can be loaded into R as follows:
air_quality <- read.csv("https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/air_quality_long.csv")
air_parameters <- read.csv("https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/air_quality_parameters.csv")- Use a left join to add information about each air quality measurement to the
air_qualitytable.
Solution:
air_quality <- air_quality |>
left_join(air_parameters, join_by(parameter == id))
head(air_quality) city country date.utc location parameter value unit
1 Antwerpen BE 2019-06-18 06:00:00+00:00 BETR801 pm25 18.0 µg/m³
2 Antwerpen BE 2019-06-17 08:00:00+00:00 BETR801 pm25 6.5 µg/m³
3 Antwerpen BE 2019-06-17 07:00:00+00:00 BETR801 pm25 18.5 µg/m³
4 Antwerpen BE 2019-06-17 06:00:00+00:00 BETR801 pm25 16.0 µg/m³
5 Antwerpen BE 2019-06-17 05:00:00+00:00 BETR801 pm25 7.5 µg/m³
6 Antwerpen BE 2019-06-17 04:00:00+00:00 BETR801 pm25 7.5 µg/m³
description name
1 Particulate matter less than 2.5 micrometers in diameter PM2.5
2 Particulate matter less than 2.5 micrometers in diameter PM2.5
3 Particulate matter less than 2.5 micrometers in diameter PM2.5
4 Particulate matter less than 2.5 micrometers in diameter PM2.5
5 Particulate matter less than 2.5 micrometers in diameter PM2.5
6 Particulate matter less than 2.5 micrometers in diameter PM2.5
NYC Flights
In the next questions, we will return to data on NYC flights in 2013 from the nycflights13 package. We will use two tables from the package: the flights table, which contains information on each flight, and the planes table, which contains information on each plane (uniquely identified by tailnum).
- In R, use a left join to add information about each plane to the
flightstable.
Solution:
library(nycflights13)
flights_new <- flights |>
left_join(planes, join_by(tailnum))
head(flights_new)# A tibble: 6 × 27
year.x month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
# ℹ 19 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>, year.y <int>, type <chr>,
# manufacturer <chr>, model <chr>, engines <int>, seats <int>, speed <int>,
# engine <chr>
Looking at the results from question 3, you will see that there are two year columns in the output of the join. This is because year is a column in both flights and planes. However, we don’t want to join on year, because it represents different information (year of the flight, vs. year of manufacture).
Since we don’t join on year, and R won’t allow multiple columns with the same name in one data frame, then the output has two year columns: year.x (from flights), and year.y (from planes). The year.y column contains information on date of manufacture.
- What is the oldest plane (identified by
tailnum) which flew from NYC airports in 2013?
Solution:
flights |>
left_join(planes, join_by(tailnum)) |>
rename(manufacture_date = year.y) |>
select(tailnum, manufacture_date) |>
distinct() |>
slice_min(manufacture_date, n=3)# A tibble: 3 × 2
tailnum manufacture_date
<chr> <int>
1 N381AA 1956
2 N201AA 1959
3 N567AA 1959
The oldest plane has tail number N381AA.