Activity: Practice pivoting

GDP data

Run the following code to import data on GDP for each country (you may need to install the googlesheets4 package):

library(tidyverse)
library(googlesheets4)
gs4_deauth()
gdp_data <- read_sheet("https://docs.google.com/spreadsheets/d/1RctTQmKB0hzbm1E8rGcufYdMshRdhmYdeL29nXqmvsc/pub?gid=0")

gdp_data <- gdp_data |>
  rename(country = starts_with("Income"))

Looking at the data, we can see that there is a column for the GDP in each year, and the first column contains country information:

head(gdp_data)
# A tibble: 6 × 53
  country  `1960` `1961` `1962` `1963` `1964` `1965` `1966` `1967` `1968` `1969`
  <chr>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 Abkhazia    NA     NA     NA     NA     NA     NA     NA     NA     NA     NA 
2 Afghani…    NA     NA     NA     NA     NA     NA     NA     NA     NA     NA 
3 Akrotir…    NA     NA     NA     NA     NA     NA     NA     NA     NA     NA 
4 Albania     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA 
5 Algeria   1280.  1085.   856.  1128.  1170.  1215.  1128.  1201.  1292.  1359.
6 America…    NA     NA     NA     NA     NA     NA     NA     NA     NA     NA 
# ℹ 42 more variables: `1970` <dbl>, `1971` <dbl>, `1972` <dbl>, `1973` <dbl>,
#   `1974` <dbl>, `1975` <dbl>, `1976` <dbl>, `1977` <dbl>, `1978` <dbl>,
#   `1979` <dbl>, `1980` <dbl>, `1981` <dbl>, `1982` <dbl>, `1983` <dbl>,
#   `1984` <dbl>, `1985` <dbl>, `1986` <dbl>, `1987` <dbl>, `1988` <dbl>,
#   `1989` <dbl>, `1990` <dbl>, `1991` <dbl>, `1992` <dbl>, `1993` <dbl>,
#   `1994` <dbl>, `1995` <dbl>, `1996` <dbl>, `1997` <dbl>, `1998` <dbl>,
#   `1999` <dbl>, `2000` <dbl>, `2001` <dbl>, `2002` <dbl>, `2003` <dbl>, …
  1. Use the pivot_longer function to reshape the GDP data so it looks like this:
# A tibble: 7,988 × 3
   country year    gdp
   <chr>   <chr> <dbl>
 1 Albania 1980  1061.
 2 Albania 1981  1100.
 3 Albania 1982  1111.
 4 Albania 1983  1101.
 5 Albania 1984  1065.
 6 Albania 1985  1060.
 7 Albania 1986  1092.
 8 Albania 1987  1054.
 9 Albania 1988  1014.
10 Albania 1989  1092.
# ℹ 7,978 more rows
  • There should be no NAs
  • There should be separate columns for year and gdp

Solution:

gdp_data |>
  pivot_longer(cols = -country,
               names_to = "year",
               values_to = "gdp",
               values_drop_na = T)
# A tibble: 7,988 × 3
   country year    gdp
   <chr>   <chr> <dbl>
 1 Albania 1980  1061.
 2 Albania 1981  1100.
 3 Albania 1982  1111.
 4 Albania 1983  1101.
 5 Albania 1984  1065.
 6 Albania 1985  1060.
 7 Albania 1986  1092.
 8 Albania 1987  1054.
 9 Albania 1988  1014.
10 Albania 1989  1092.
# ℹ 7,978 more rows

WHO data

The tidyr package contains a dataset called who2 which contains information on tuberculosis counts for each country in each year, broken down by method of diagnosis, sex, and age group:

who2
# A tibble: 7,240 × 58
   country      year sp_m_014 sp_m_1524 sp_m_2534 sp_m_3544 sp_m_4554 sp_m_5564
   <chr>       <dbl>    <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
 1 Afghanistan  1980       NA        NA        NA        NA        NA        NA
 2 Afghanistan  1981       NA        NA        NA        NA        NA        NA
 3 Afghanistan  1982       NA        NA        NA        NA        NA        NA
 4 Afghanistan  1983       NA        NA        NA        NA        NA        NA
 5 Afghanistan  1984       NA        NA        NA        NA        NA        NA
 6 Afghanistan  1985       NA        NA        NA        NA        NA        NA
 7 Afghanistan  1986       NA        NA        NA        NA        NA        NA
 8 Afghanistan  1987       NA        NA        NA        NA        NA        NA
 9 Afghanistan  1988       NA        NA        NA        NA        NA        NA
10 Afghanistan  1989       NA        NA        NA        NA        NA        NA
# ℹ 7,230 more rows
# ℹ 50 more variables: sp_m_65 <dbl>, sp_f_014 <dbl>, sp_f_1524 <dbl>,
#   sp_f_2534 <dbl>, sp_f_3544 <dbl>, sp_f_4554 <dbl>, sp_f_5564 <dbl>,
#   sp_f_65 <dbl>, sn_m_014 <dbl>, sn_m_1524 <dbl>, sn_m_2534 <dbl>,
#   sn_m_3544 <dbl>, sn_m_4554 <dbl>, sn_m_5564 <dbl>, sn_m_65 <dbl>,
#   sn_f_014 <dbl>, sn_f_1524 <dbl>, sn_f_2534 <dbl>, sn_f_3544 <dbl>,
#   sn_f_4554 <dbl>, sn_f_5564 <dbl>, sn_f_65 <dbl>, ep_m_014 <dbl>, …

Columns like sp_m_1524 contain the number of tuberculosis cases. The column names contain information about the method of diagnosis (rel, sn, sp, or ep), sex (f or m), and age group (014 = 0-14 years, 1524 = 15-24 years, etc.).

Ideally, we would like these variables to each have their own column.

  1. Reshape the who2 data so it looks like this:
# A tibble: 76,046 × 6
   country      year diagnosis sex   age_group count
   <chr>       <dbl> <chr>     <chr> <chr>     <dbl>
 1 Afghanistan  1997 sp        m     014           0
 2 Afghanistan  1997 sp        m     1524         10
 3 Afghanistan  1997 sp        m     2534          6
 4 Afghanistan  1997 sp        m     3544          3
 5 Afghanistan  1997 sp        m     4554          5
 6 Afghanistan  1997 sp        m     5564          2
 7 Afghanistan  1997 sp        m     65            0
 8 Afghanistan  1997 sp        f     014           5
 9 Afghanistan  1997 sp        f     1524         38
10 Afghanistan  1997 sp        f     2534         36
# ℹ 76,036 more rows
  • There should be no NAs
  • There should be separate columns for diagnosis, sex, and age_group
  • The count column contains the number of cases

Solution:

who2 |>
  pivot_longer(cols = -c(country, year),
               names_to = c("diagnosis", "sex", "age_group"),
               names_sep = "_",
               values_to = "count",
               values_drop_na = T)
# A tibble: 76,046 × 6
   country      year diagnosis sex   age_group count
   <chr>       <dbl> <chr>     <chr> <chr>     <dbl>
 1 Afghanistan  1997 sp        m     014           0
 2 Afghanistan  1997 sp        m     1524         10
 3 Afghanistan  1997 sp        m     2534          6
 4 Afghanistan  1997 sp        m     3544          3
 5 Afghanistan  1997 sp        m     4554          5
 6 Afghanistan  1997 sp        m     5564          2
 7 Afghanistan  1997 sp        m     65            0
 8 Afghanistan  1997 sp        f     014           5
 9 Afghanistan  1997 sp        f     1524         38
10 Afghanistan  1997 sp        f     2534         36
# ℹ 76,036 more rows