Activity: Joins

Literacy and GDP data

Run the following R code to import data on literacy and GDP for each country:

library(tidyverse)
library(googlesheets4)
gs4_deauth()
gdp_data <- read_sheet("https://docs.google.com/spreadsheets/d/1RctTQmKB0hzbm1E8rGcufYdMshRdhmYdeL29nXqmvsc/pub?gid=0")
lit_data <- read_sheet("https://docs.google.com/spreadsheets/d/1hDinTIRHQIaZg1RUn6Z_6mo12PtKwEPFIz_mJVF6P5I/pub?gid=0")

As in previous class activities, we can reshape the datasets:

gdp_data <- gdp_data |>
  rename(country = starts_with("Income")) |>
  pivot_longer(-country, 
               names_to = "year", 
               values_to = "gdp",
               values_drop_na = TRUE)

lit_data <- lit_data |>
  rename(country = starts_with("Adult")) |>
  pivot_longer(-country, 
               names_to = "year", 
               values_to = "literacy",
               values_drop_na = TRUE)

Use a join to create a dataset containing both the literacy rate and gdp for countries in each year. Keep only the rows with information on both gdp and literacy rate:

# A tibble: 6 × 4
  country year  literacy   gdp
  <chr>   <chr>    <dbl> <dbl>
1 Albania 2001      98.3 1282.
2 Albania 2008      94.7 1804.
3 Albania 2011      95.7 1966.
4 Algeria 1987      35.8 1902.
5 Algeria 2002      60.1 1872.
6 Algeria 2006      63.9 2125.

Solution:

lit_gdp <- lit_data |>
  inner_join(gdp_data, join_by(country, year))

head(lit_gdp)

# A tibble: 6 × 4
  country year  literacy   gdp
  <chr>   <chr>    <dbl> <dbl>
1 Albania 2001      98.3 1282.
2 Albania 2008      94.7 1804.
3 Albania 2011      95.7 1966.
4 Algeria 1987      35.8 1902.
5 Algeria 2002      60.1 1872.
6 Algeria 2006      63.9 2125.

Air quality

In the next question, we will work with air quality data. There are two tables: the air_quality table contains information on different air quality measurements for multiple locations, while the air_parameters table contains information on the different air quality measurements.

The datasets can be loaded into R as follows:

air_quality <- read.csv("https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/air_quality_long.csv")
air_parameters <- read.csv("https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/air_quality_parameters.csv")

Use a left join to add information about each air quality measurement to the air_quality table.

Solution:

air_quality <- air_quality |>
  left_join(air_parameters, join_by(parameter == id))

head(air_quality)

       city country                  date.utc location parameter value  unit
1 Antwerpen      BE 2019-06-18 06:00:00+00:00  BETR801      pm25  18.0 µg/m³
2 Antwerpen      BE 2019-06-17 08:00:00+00:00  BETR801      pm25   6.5 µg/m³
3 Antwerpen      BE 2019-06-17 07:00:00+00:00  BETR801      pm25  18.5 µg/m³
4 Antwerpen      BE 2019-06-17 06:00:00+00:00  BETR801      pm25  16.0 µg/m³
5 Antwerpen      BE 2019-06-17 05:00:00+00:00  BETR801      pm25   7.5 µg/m³
6 Antwerpen      BE 2019-06-17 04:00:00+00:00  BETR801      pm25   7.5 µg/m³
                                               description  name
1 Particulate matter less than 2.5 micrometers in diameter PM2.5
2 Particulate matter less than 2.5 micrometers in diameter PM2.5
3 Particulate matter less than 2.5 micrometers in diameter PM2.5
4 Particulate matter less than 2.5 micrometers in diameter PM2.5
5 Particulate matter less than 2.5 micrometers in diameter PM2.5
6 Particulate matter less than 2.5 micrometers in diameter PM2.5

NYC Flights

In the next questions, we will return to data on NYC flights in 2013 from the nycflights13 package. We will use two tables from the package: the flights table, which contains information on each flight, and the planes table, which contains information on each plane (uniquely identified by tailnum).

In R, use a left join to add information about each plane to the flights table.

Solution:

library(nycflights13)

flights_new <- flights |>
  left_join(planes, join_by(tailnum))

head(flights_new)

# A tibble: 6 × 27
  year.x month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1   2013     1     1      517            515         2      830            819
2   2013     1     1      533            529         4      850            830
3   2013     1     1      542            540         2      923            850
4   2013     1     1      544            545        -1     1004           1022
5   2013     1     1      554            600        -6      812            837
6   2013     1     1      554            558        -4      740            728
# ℹ 19 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>, year.y <int>, type <chr>,
#   manufacturer <chr>, model <chr>, engines <int>, seats <int>, speed <int>,
#   engine <chr>

Looking at the results from question 3, you will see that there are two year columns in the output of the join. This is because year is a column in both flights and planes. However, we don’t want to join on year, because it represents different information (year of the flight, vs. year of manufacture).

Since we don’t join on year, and R won’t allow multiple columns with the same name in one data frame, then the output has two year columns: year.x (from flights), and year.y (from planes). The year.y column contains information on date of manufacture.

What is the oldest plane (identified by tailnum) which flew from NYC airports in 2013?

Solution:

flights |>
  left_join(planes, join_by(tailnum)) |>
  rename(manufacture_date = year.y) |>
  select(tailnum, manufacture_date) |>
  distinct() |>
  slice_min(manufacture_date, n=3)

# A tibble: 3 × 2
  tailnum manufacture_date
  <chr>              <int>
1 N381AA              1956
2 N201AA              1959
3 N567AA              1959

The oldest plane has tail number N381AA.