Activity: Practice pivoting

Instructions:

Work with a neighbor to answer the following questions
To get started, download the class activity template file
When you are finished, render the file as an HTML and submit the HTML to Canvas (let me know if you encounter any problems)

GDP data

Run the following code to import data on GDP for each country (you may need to install the googlesheets4 package):

library(tidyverse)
library(googlesheets4)
gs4_deauth()
gdp_data <- read_sheet("https://docs.google.com/spreadsheets/d/1RctTQmKB0hzbm1E8rGcufYdMshRdhmYdeL29nXqmvsc/pub?gid=0")

gdp_data <- gdp_data |>
  rename(country = starts_with("Income"))

Looking at the data, we can see that there is a column for the GDP in each year, and the first column contains country information:

head(gdp_data)

# A tibble: 6 × 53
  country  `1960` `1961` `1962` `1963` `1964` `1965` `1966` `1967` `1968` `1969`
  <chr>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 Abkhazia    NA     NA     NA     NA     NA     NA     NA     NA     NA     NA 
2 Afghani…    NA     NA     NA     NA     NA     NA     NA     NA     NA     NA 
3 Akrotir…    NA     NA     NA     NA     NA     NA     NA     NA     NA     NA 
4 Albania     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA 
5 Algeria   1280.  1085.   856.  1128.  1170.  1215.  1128.  1201.  1292.  1359.
6 America…    NA     NA     NA     NA     NA     NA     NA     NA     NA     NA 
# ℹ 42 more variables: `1970` <dbl>, `1971` <dbl>, `1972` <dbl>, `1973` <dbl>,
#   `1974` <dbl>, `1975` <dbl>, `1976` <dbl>, `1977` <dbl>, `1978` <dbl>,
#   `1979` <dbl>, `1980` <dbl>, `1981` <dbl>, `1982` <dbl>, `1983` <dbl>,
#   `1984` <dbl>, `1985` <dbl>, `1986` <dbl>, `1987` <dbl>, `1988` <dbl>,
#   `1989` <dbl>, `1990` <dbl>, `1991` <dbl>, `1992` <dbl>, `1993` <dbl>,
#   `1994` <dbl>, `1995` <dbl>, `1996` <dbl>, `1997` <dbl>, `1998` <dbl>,
#   `1999` <dbl>, `2000` <dbl>, `2001` <dbl>, `2002` <dbl>, `2003` <dbl>, …

Use the pivot_longer function to reshape the GDP data so it looks like this:

# A tibble: 7,988 × 3
   country year    gdp
   <chr>   <chr> <dbl>
 1 Albania 1980  1061.
 2 Albania 1981  1100.
 3 Albania 1982  1111.
 4 Albania 1983  1101.
 5 Albania 1984  1065.
 6 Albania 1985  1060.
 7 Albania 1986  1092.
 8 Albania 1987  1054.
 9 Albania 1988  1014.
10 Albania 1989  1092.
# ℹ 7,978 more rows

There should be no NAs
There should be separate columns for year and gdp

WHO data

The tidyr package contains a dataset called who2 which contains information on tuberculosis counts for each country in each year, broken down by method of diagnosis, sex, and age group:

who2

# A tibble: 7,240 × 58
   country      year sp_m_014 sp_m_1524 sp_m_2534 sp_m_3544 sp_m_4554 sp_m_5564
   <chr>       <dbl>    <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
 1 Afghanistan  1980       NA        NA        NA        NA        NA        NA
 2 Afghanistan  1981       NA        NA        NA        NA        NA        NA
 3 Afghanistan  1982       NA        NA        NA        NA        NA        NA
 4 Afghanistan  1983       NA        NA        NA        NA        NA        NA
 5 Afghanistan  1984       NA        NA        NA        NA        NA        NA
 6 Afghanistan  1985       NA        NA        NA        NA        NA        NA
 7 Afghanistan  1986       NA        NA        NA        NA        NA        NA
 8 Afghanistan  1987       NA        NA        NA        NA        NA        NA
 9 Afghanistan  1988       NA        NA        NA        NA        NA        NA
10 Afghanistan  1989       NA        NA        NA        NA        NA        NA
# ℹ 7,230 more rows
# ℹ 50 more variables: sp_m_65 <dbl>, sp_f_014 <dbl>, sp_f_1524 <dbl>,
#   sp_f_2534 <dbl>, sp_f_3544 <dbl>, sp_f_4554 <dbl>, sp_f_5564 <dbl>,
#   sp_f_65 <dbl>, sn_m_014 <dbl>, sn_m_1524 <dbl>, sn_m_2534 <dbl>,
#   sn_m_3544 <dbl>, sn_m_4554 <dbl>, sn_m_5564 <dbl>, sn_m_65 <dbl>,
#   sn_f_014 <dbl>, sn_f_1524 <dbl>, sn_f_2534 <dbl>, sn_f_3544 <dbl>,
#   sn_f_4554 <dbl>, sn_f_5564 <dbl>, sn_f_65 <dbl>, ep_m_014 <dbl>, …

Columns like sp_m_1524 contain the number of tuberculosis cases. The column names contain information about the method of diagnosis (rel, sn, sp, or ep), sex (f or m), and age group (014 = 0-14 years, 1524 = 15-24 years, etc.).

Ideally, we would like these variables to each have their own column.

Reshape the who2 data so it looks like this:

# A tibble: 76,046 × 6
   country      year diagnosis sex   age_group count
   <chr>       <dbl> <chr>     <chr> <chr>     <dbl>
 1 Afghanistan  1997 sp        m     014           0
 2 Afghanistan  1997 sp        m     1524         10
 3 Afghanistan  1997 sp        m     2534          6
 4 Afghanistan  1997 sp        m     3544          3
 5 Afghanistan  1997 sp        m     4554          5
 6 Afghanistan  1997 sp        m     5564          2
 7 Afghanistan  1997 sp        m     65            0
 8 Afghanistan  1997 sp        f     014           5
 9 Afghanistan  1997 sp        f     1524         38
10 Afghanistan  1997 sp        f     2534         36
# ℹ 76,036 more rows

There should be no NAs
There should be separate columns for diagnosis, sex, and age_group
The count column contains the number of cases