library(tidyverse)
library(googlesheets4)
gs4_deauth()
gdp_data <- read_sheet("https://docs.google.com/spreadsheets/d/1RctTQmKB0hzbm1E8rGcufYdMshRdhmYdeL29nXqmvsc/pub?gid=0")
gdp_data <- gdp_data |>
rename(country = starts_with("Income"))Activity: Practice pivoting
GDP data
Run the following code to import data on GDP for each country (you may need to install the googlesheets4 package):
Looking at the data, we can see that there is a column for the GDP in each year, and the first column contains country information:
head(gdp_data)# A tibble: 6 × 53
country `1960` `1961` `1962` `1963` `1964` `1965` `1966` `1967` `1968` `1969`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Abkhazia NA NA NA NA NA NA NA NA NA NA
2 Afghani… NA NA NA NA NA NA NA NA NA NA
3 Akrotir… NA NA NA NA NA NA NA NA NA NA
4 Albania NA NA NA NA NA NA NA NA NA NA
5 Algeria 1280. 1085. 856. 1128. 1170. 1215. 1128. 1201. 1292. 1359.
6 America… NA NA NA NA NA NA NA NA NA NA
# ℹ 42 more variables: `1970` <dbl>, `1971` <dbl>, `1972` <dbl>, `1973` <dbl>,
# `1974` <dbl>, `1975` <dbl>, `1976` <dbl>, `1977` <dbl>, `1978` <dbl>,
# `1979` <dbl>, `1980` <dbl>, `1981` <dbl>, `1982` <dbl>, `1983` <dbl>,
# `1984` <dbl>, `1985` <dbl>, `1986` <dbl>, `1987` <dbl>, `1988` <dbl>,
# `1989` <dbl>, `1990` <dbl>, `1991` <dbl>, `1992` <dbl>, `1993` <dbl>,
# `1994` <dbl>, `1995` <dbl>, `1996` <dbl>, `1997` <dbl>, `1998` <dbl>,
# `1999` <dbl>, `2000` <dbl>, `2001` <dbl>, `2002` <dbl>, `2003` <dbl>, …
- Use the
pivot_longerfunction to reshape the GDP data so it looks like this:
# A tibble: 7,988 × 3
country year gdp
<chr> <chr> <dbl>
1 Albania 1980 1061.
2 Albania 1981 1100.
3 Albania 1982 1111.
4 Albania 1983 1101.
5 Albania 1984 1065.
6 Albania 1985 1060.
7 Albania 1986 1092.
8 Albania 1987 1054.
9 Albania 1988 1014.
10 Albania 1989 1092.
# ℹ 7,978 more rows
- There should be no NAs
- There should be separate columns for
yearandgdp
Solution:
gdp_data |>
pivot_longer(cols = -country,
names_to = "year",
values_to = "gdp",
values_drop_na = T)# A tibble: 7,988 × 3
country year gdp
<chr> <chr> <dbl>
1 Albania 1980 1061.
2 Albania 1981 1100.
3 Albania 1982 1111.
4 Albania 1983 1101.
5 Albania 1984 1065.
6 Albania 1985 1060.
7 Albania 1986 1092.
8 Albania 1987 1054.
9 Albania 1988 1014.
10 Albania 1989 1092.
# ℹ 7,978 more rows
WHO data
The tidyr package contains a dataset called who2 which contains information on tuberculosis counts for each country in each year, broken down by method of diagnosis, sex, and age group:
who2# A tibble: 7,240 × 58
country year sp_m_014 sp_m_1524 sp_m_2534 sp_m_3544 sp_m_4554 sp_m_5564
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan 1980 NA NA NA NA NA NA
2 Afghanistan 1981 NA NA NA NA NA NA
3 Afghanistan 1982 NA NA NA NA NA NA
4 Afghanistan 1983 NA NA NA NA NA NA
5 Afghanistan 1984 NA NA NA NA NA NA
6 Afghanistan 1985 NA NA NA NA NA NA
7 Afghanistan 1986 NA NA NA NA NA NA
8 Afghanistan 1987 NA NA NA NA NA NA
9 Afghanistan 1988 NA NA NA NA NA NA
10 Afghanistan 1989 NA NA NA NA NA NA
# ℹ 7,230 more rows
# ℹ 50 more variables: sp_m_65 <dbl>, sp_f_014 <dbl>, sp_f_1524 <dbl>,
# sp_f_2534 <dbl>, sp_f_3544 <dbl>, sp_f_4554 <dbl>, sp_f_5564 <dbl>,
# sp_f_65 <dbl>, sn_m_014 <dbl>, sn_m_1524 <dbl>, sn_m_2534 <dbl>,
# sn_m_3544 <dbl>, sn_m_4554 <dbl>, sn_m_5564 <dbl>, sn_m_65 <dbl>,
# sn_f_014 <dbl>, sn_f_1524 <dbl>, sn_f_2534 <dbl>, sn_f_3544 <dbl>,
# sn_f_4554 <dbl>, sn_f_5564 <dbl>, sn_f_65 <dbl>, ep_m_014 <dbl>, …
Columns like sp_m_1524 contain the number of tuberculosis cases. The column names contain information about the method of diagnosis (rel, sn, sp, or ep), sex (f or m), and age group (014 = 0-14 years, 1524 = 15-24 years, etc.).
Ideally, we would like these variables to each have their own column.
- Reshape the
who2data so it looks like this:
# A tibble: 76,046 × 6
country year diagnosis sex age_group count
<chr> <dbl> <chr> <chr> <chr> <dbl>
1 Afghanistan 1997 sp m 014 0
2 Afghanistan 1997 sp m 1524 10
3 Afghanistan 1997 sp m 2534 6
4 Afghanistan 1997 sp m 3544 3
5 Afghanistan 1997 sp m 4554 5
6 Afghanistan 1997 sp m 5564 2
7 Afghanistan 1997 sp m 65 0
8 Afghanistan 1997 sp f 014 5
9 Afghanistan 1997 sp f 1524 38
10 Afghanistan 1997 sp f 2534 36
# ℹ 76,036 more rows
- There should be no NAs
- There should be separate columns for
diagnosis,sex, andage_group - The
countcolumn contains the number of cases
Solution:
who2 |>
pivot_longer(cols = -c(country, year),
names_to = c("diagnosis", "sex", "age_group"),
names_sep = "_",
values_to = "count",
values_drop_na = T)# A tibble: 76,046 × 6
country year diagnosis sex age_group count
<chr> <dbl> <chr> <chr> <chr> <dbl>
1 Afghanistan 1997 sp m 014 0
2 Afghanistan 1997 sp m 1524 10
3 Afghanistan 1997 sp m 2534 6
4 Afghanistan 1997 sp m 3544 3
5 Afghanistan 1997 sp m 4554 5
6 Afghanistan 1997 sp m 5564 2
7 Afghanistan 1997 sp m 65 0
8 Afghanistan 1997 sp f 014 5
9 Afghanistan 1997 sp f 1524 38
10 Afghanistan 1997 sp f 2534 36
# ℹ 76,036 more rows