Activity: Data wrangling with Gapminder

Data

We will work with data on global health and economic development. The data is called gapminder, and is part of the gapminder package in R. You will need to install the gapminder package before beginning the questions.

dplyr practice with gapminder data

In this activity, you will practice using the dplyr functions we learned in class.

  1. Fill in the following code to create a new data frame, containing only countries in 2007 with life expectancy at least 70 years and GDP per capita at most $20000.
new_gapminder <- gapminder |>
  filter(year ..., 
         lifeExp ..., 
         gdpPercap ...)

Solution:

library(gapminder)
library(tidyverse)

new_gapminder <- gapminder |>
  filter(year == 2007, 
         lifeExp >= 70, 
         gdpPercap <= 20000)
  1. Fill in the following code to count the number of countries in each continent in the data for 2007.
gapminder |>
  filter(...) |>
  count(...)

Solution: Note that when we look only at 2007, each row in the data represents exactly one country. When we count by the continent variable, we are asking for the number of rows – that is, the number of countries – for each continent.

gapminder |>
  filter(year == 2007) |>
  count(continent)
# A tibble: 5 × 2
  continent     n
  <fct>     <int>
1 Africa       52
2 Americas     25
3 Asia         33
4 Europe       30
5 Oceania       2
  1. Fill in the following code to create a data frame with a new column that is the natural log of GDP per capita. (Hint: in R, the natural log function is log).
new_gapminder <- gapminder |>
  mutate(log_gdp_percap = ...)

Solution:

new_gapminder <- gapminder |>
  mutate(log_gdp_percap = log(gdpPercap))
  1. Fill in the following code to calculate the median natural log of GDP per capita in countries with a life expectancy of at least 70 years in 2007. (Hint: in R, the median function is median).
gapminder |>
  mutate(log_gdp_percap = ...) |>
  filter(...) |>
  summarize(...)

Solution:

gapminder |>
  mutate(log_gdp_percap = log(gdpPercap)) |>
  filter(year == 2007, lifeExp >= 70) |>
  summarize(med_log_gdp = median(log_gdp_percap))
# A tibble: 1 × 1
  med_log_gdp
        <dbl>
1        9.40
  1. Fill in the following code to calculate the median natural log of GDP per capita in countries with a life expectancy of at least 70 years in 2007, broken down by continent.
gapminder |>
  mutate(...) |>
  filter(...) |>
  group_by(...) |>
  summarize(...)

Solution:

gapminder |>
  mutate(log_gdp_percap = log(gdpPercap)) |>
  filter(year == 2007, lifeExp >= 70) |>
  group_by(continent) |>
  summarize(med_log_gdp = median(log_gdp_percap))
# A tibble: 5 × 2
  continent med_log_gdp
  <fct>           <dbl>
1 Africa           8.87
2 Americas         9.11
3 Asia             9.39
4 Europe          10.2 
5 Oceania         10.3 
  1. Does it matter whether we mutate or filter first in question 5?

Solution: No

gapminder |>
  filter(year == 2007, lifeExp >= 70) |>
  mutate(log_gdp_percap = log(gdpPercap)) |>
  group_by(continent) |>
  summarize(med_log_gdp = median(log_gdp_percap))
# A tibble: 5 × 2
  continent med_log_gdp
  <fct>           <dbl>
1 Africa           8.87
2 Americas         9.11
3 Asia             9.39
4 Europe          10.2 
5 Oceania         10.3 
  1. Calculate the median life expectancy for each continent in 2007, and the correlation between log GDP per capita and life expectancy for each continent in 2007.

Solution: Note that the correlation for Oceania is meaningless because there are only two points (and two points determine a line, hence a perfect correlation).

gapminder |>
  filter(year == 2007) |>
  group_by(continent) |>
  summarize(med_life = median(lifeExp),
            cor_life_gdp = cor(log(gdpPercap), lifeExp),
            N = n())
# A tibble: 5 × 4
  continent med_life cor_life_gdp     N
  <fct>        <dbl>        <dbl> <int>
1 Africa        52.9        0.452    52
2 Americas      72.9        0.780    25
3 Asia          72.4        0.800    33
4 Europe        78.6        0.836    30
5 Oceania       80.7        1         2
  1. Using your summary statistics, describe the relationship between GDP per capita and life expectancy, and summarize the differences between continents.

Solution: There tends to be a positive relationship between GDP per capita and life expectancy, though the relationship is weaker in Africa than in the other continents.