library(tidyverse)
library(googlesheets4)
gs4_deauth()
gdp_data <- read_sheet("https://docs.google.com/spreadsheets/d/1RctTQmKB0hzbm1E8rGcufYdMshRdhmYdeL29nXqmvsc/pub?gid=0")
gdp_data <- gdp_data |>
rename(country = starts_with("Income"))Activity: Practice pivoting
Instructions:
- Work with a neighbor to answer the following questions
- To get started, download the class activity template file
- When you are finished, render the file as an HTML and submit the HTML to Canvas (let me know if you encounter any problems)
GDP data
Run the following code to import data on GDP for each country (you may need to install the googlesheets4 package):
Looking at the data, we can see that there is a column for the GDP in each year, and the first column contains country information:
head(gdp_data)# A tibble: 6 × 53
country `1960` `1961` `1962` `1963` `1964` `1965` `1966` `1967` `1968` `1969`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Abkhazia NA NA NA NA NA NA NA NA NA NA
2 Afghani… NA NA NA NA NA NA NA NA NA NA
3 Akrotir… NA NA NA NA NA NA NA NA NA NA
4 Albania NA NA NA NA NA NA NA NA NA NA
5 Algeria 1280. 1085. 856. 1128. 1170. 1215. 1128. 1201. 1292. 1359.
6 America… NA NA NA NA NA NA NA NA NA NA
# ℹ 42 more variables: `1970` <dbl>, `1971` <dbl>, `1972` <dbl>, `1973` <dbl>,
# `1974` <dbl>, `1975` <dbl>, `1976` <dbl>, `1977` <dbl>, `1978` <dbl>,
# `1979` <dbl>, `1980` <dbl>, `1981` <dbl>, `1982` <dbl>, `1983` <dbl>,
# `1984` <dbl>, `1985` <dbl>, `1986` <dbl>, `1987` <dbl>, `1988` <dbl>,
# `1989` <dbl>, `1990` <dbl>, `1991` <dbl>, `1992` <dbl>, `1993` <dbl>,
# `1994` <dbl>, `1995` <dbl>, `1996` <dbl>, `1997` <dbl>, `1998` <dbl>,
# `1999` <dbl>, `2000` <dbl>, `2001` <dbl>, `2002` <dbl>, `2003` <dbl>, …
- Use the
pivot_longerfunction to reshape the GDP data so it looks like this:
# A tibble: 7,988 × 3
country year gdp
<chr> <chr> <dbl>
1 Albania 1980 1061.
2 Albania 1981 1100.
3 Albania 1982 1111.
4 Albania 1983 1101.
5 Albania 1984 1065.
6 Albania 1985 1060.
7 Albania 1986 1092.
8 Albania 1987 1054.
9 Albania 1988 1014.
10 Albania 1989 1092.
# ℹ 7,978 more rows
- There should be no NAs
- There should be separate columns for
yearandgdp
WHO data
The tidyr package contains a dataset called who2 which contains information on tuberculosis counts for each country in each year, broken down by method of diagnosis, sex, and age group:
who2# A tibble: 7,240 × 58
country year sp_m_014 sp_m_1524 sp_m_2534 sp_m_3544 sp_m_4554 sp_m_5564
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan 1980 NA NA NA NA NA NA
2 Afghanistan 1981 NA NA NA NA NA NA
3 Afghanistan 1982 NA NA NA NA NA NA
4 Afghanistan 1983 NA NA NA NA NA NA
5 Afghanistan 1984 NA NA NA NA NA NA
6 Afghanistan 1985 NA NA NA NA NA NA
7 Afghanistan 1986 NA NA NA NA NA NA
8 Afghanistan 1987 NA NA NA NA NA NA
9 Afghanistan 1988 NA NA NA NA NA NA
10 Afghanistan 1989 NA NA NA NA NA NA
# ℹ 7,230 more rows
# ℹ 50 more variables: sp_m_65 <dbl>, sp_f_014 <dbl>, sp_f_1524 <dbl>,
# sp_f_2534 <dbl>, sp_f_3544 <dbl>, sp_f_4554 <dbl>, sp_f_5564 <dbl>,
# sp_f_65 <dbl>, sn_m_014 <dbl>, sn_m_1524 <dbl>, sn_m_2534 <dbl>,
# sn_m_3544 <dbl>, sn_m_4554 <dbl>, sn_m_5564 <dbl>, sn_m_65 <dbl>,
# sn_f_014 <dbl>, sn_f_1524 <dbl>, sn_f_2534 <dbl>, sn_f_3544 <dbl>,
# sn_f_4554 <dbl>, sn_f_5564 <dbl>, sn_f_65 <dbl>, ep_m_014 <dbl>, …
Columns like sp_m_1524 contain the number of tuberculosis cases. The column names contain information about the method of diagnosis (rel, sn, sp, or ep), sex (f or m), and age group (014 = 0-14 years, 1524 = 15-24 years, etc.).
Ideally, we would like these variables to each have their own column.
- Reshape the
who2data so it looks like this:
# A tibble: 76,046 × 6
country year diagnosis sex age_group count
<chr> <dbl> <chr> <chr> <chr> <dbl>
1 Afghanistan 1997 sp m 014 0
2 Afghanistan 1997 sp m 1524 10
3 Afghanistan 1997 sp m 2534 6
4 Afghanistan 1997 sp m 3544 3
5 Afghanistan 1997 sp m 4554 5
6 Afghanistan 1997 sp m 5564 2
7 Afghanistan 1997 sp m 65 0
8 Afghanistan 1997 sp f 014 5
9 Afghanistan 1997 sp f 1524 38
10 Afghanistan 1997 sp f 2534 36
# ℹ 76,036 more rows
- There should be no NAs
- There should be separate columns for
diagnosis,sex, andage_group - The
countcolumn contains the number of cases