Activity: Joins

Instructions:

Work with a neighbor to answer the following questions
To get started, download the class activity template file
When you are finished, render the file as an HTML and submit the HTML to Canvas (let me know if you encounter any problems)

Literacy and GDP data

Run the following R code to import data on literacy and GDP for each country:

library(tidyverse)
library(googlesheets4)
gs4_deauth()
gdp_data <- read_sheet("https://docs.google.com/spreadsheets/d/1RctTQmKB0hzbm1E8rGcufYdMshRdhmYdeL29nXqmvsc/pub?gid=0")
lit_data <- read_sheet("https://docs.google.com/spreadsheets/d/1hDinTIRHQIaZg1RUn6Z_6mo12PtKwEPFIz_mJVF6P5I/pub?gid=0")

As in previous class activities, we can reshape the datasets:

gdp_data <- gdp_data |>
  rename(country = starts_with("Income")) |>
  pivot_longer(-country, 
               names_to = "year", 
               values_to = "gdp",
               values_drop_na = TRUE)

lit_data <- lit_data |>
  rename(country = starts_with("Adult")) |>
  pivot_longer(-country, 
               names_to = "year", 
               values_to = "literacy",
               values_drop_na = TRUE)

Use a join to create a dataset containing both the literacy rate and gdp for countries in each year. Keep only the rows with information on both gdp and literacy rate:

# A tibble: 6 × 4
  country year  literacy   gdp
  <chr>   <chr>    <dbl> <dbl>
1 Albania 2001      98.3 1282.
2 Albania 2008      94.7 1804.
3 Albania 2011      95.7 1966.
4 Algeria 1987      35.8 1902.
5 Algeria 2002      60.1 1872.
6 Algeria 2006      63.9 2125.

Air quality

In the next question, we will work with air quality data. There are two tables: the air_quality table contains information on different air quality measurements for multiple locations, while the air_parameters table contains information on the different air quality measurements.

The datasets can be loaded into R as follows:

air_quality <- read.csv("https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/air_quality_long.csv")
air_parameters <- read.csv("https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/air_quality_parameters.csv")

Use a left join to add information about each air quality measurement to the air_quality table.

NYC Flights

In the next questions, we will return to data on NYC flights in 2013 from the nycflights13 package. We will use two tables from the package: the flights table, which contains information on each flight, and the planes table, which contains information on each plane (uniquely identified by tailnum).

In R, use a left join to add information about each plane to the flights table.

Looking at the results from question 3, you will see that there are two year columns in the output of the join. This is because year is a column in both flights and planes. However, we don’t want to join on year, because it represents different information (year of the flight, vs. year of manufacture).

Since we don’t join on year, and R won’t allow multiple columns with the same name in one data frame, then the output has two year columns: year.x (from flights), and year.y (from planes). The year.y column contains information on date of manufacture.

What is the oldest plane (identified by tailnum) which flew from NYC airports in 2013?