Data wrangling fundamentals

What do you do with data?

Data manipulation and cleaning
Calculate summary statistics
Visualization
Input for modeling

Data manipulation

glimpse(starwars)

Rows: 87
Columns: 14
$ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or…
$ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…
$ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…
$ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N…
$ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "…
$ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",…
$ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …
$ sex        <chr> "male", "none", "none", "male", "female", "male", "female",…
$ gender     <chr> "masculine", "masculine", "masculine", "masculine", "femini…
$ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T…
$ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma…
$ films      <list> <"A New Hope", "The Empire Strikes Back", "Return of the J…
$ vehicles   <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp…
$ starships  <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",…

What manipulation might I want to do with the starwars data?

`dplyr`: Tools for data wrangling

part of the tidyverse
provides a “grammar of data manipulation”: useful verbs (functions) for manipulating data
we will cover the key dplyr functions

Some core verbs for data wrangling

filter(): take a subset of the rows (i.e., observations)
select(): take a subset of the columns (i.e., features, variables)
mutate(): add or modify existing columns
arrange(): sort the rows
group_by(): group rows by one or more variables
summarize(): aggregate the data across rows (often after grouping)

Creating a subset of the rows

Question: Suppose I only want the droids in the starwars data. How would I choose only those rows?

Creating a subset of the rows

Question: Suppose I only want the droids in the starwars data. How would I choose only those rows?

filter(starwars, species == "Droid")

# A tibble: 6 × 14
  name   height  mass hair_color skin_color  eye_color birth_year sex   gender  
  <chr>   <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr> <chr>   
1 C-3PO     167    75 <NA>       gold        yellow           112 none  masculi…
2 R2-D2      96    32 <NA>       white, blue red               33 none  masculi…
3 R5-D4      97    32 <NA>       white, red  red               NA none  masculi…
4 IG-88     200   140 none       metal       red               15 none  masculi…
5 R4-P17     96    NA none       silver, red red, blue         NA none  feminine
6 BB8        NA    NA none       none        black             NA none  masculi…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Creating a subset of the rows

starwars |>
  filter(species == "Droid")

# A tibble: 2 × 14
  name  height  mass hair_color skin_color  eye_color birth_year sex   gender   
  <chr>  <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr> <chr>    
1 C-3PO    167    75 <NA>       gold        yellow           112 none  masculine
2 R2-D2     96    32 <NA>       white, blue red               33 none  masculine
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

|> is called the pipe. It means “take <this>, THEN do <that>”
filter keeps only the rows which satisfy a specific condition

Calculating summary statistics

Question: What is the average height for droids in the dataset?

Calculating summary statistics

Question: What is the average height for droids in the dataset?

starwars |>
  filter(species == "Droid") |>
  summarize(mean_height = mean(height))

# A tibble: 1 × 1
  mean_height
        <dbl>
1          NA

pipes (|>) can be chained together
summarize calculates summary statistics
Why am I getting NA?

Handling missing values

# A tibble: 6 × 14
  name   height  mass hair_color skin_color  eye_color birth_year sex   gender  
  <chr>   <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr> <chr>   
1 C-3PO     167    75 <NA>       gold        yellow           112 none  masculi…
2 R2-D2      96    32 <NA>       white, blue red               33 none  masculi…
3 R5-D4      97    32 <NA>       white, red  red               NA none  masculi…
4 IG-88     200   140 none       metal       red               15 none  masculi…
5 R4-P17     96    NA none       silver, red red, blue         NA none  feminine
6 BB8        NA    NA none       none        black             NA none  masculi…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

starwars |>
  filter(species == "Droid") |>
  summarize(mean_height = mean(height, na.rm=T))

# A tibble: 1 × 1
  mean_height
        <dbl>
1        131.

Calculating summary statistics

Question: What if I want the average height for humans?

starwars |>
  filter(species == "Droid") |>
  summarize(mean_height = mean(height, na.rm=T))

Calculating summary statistics

Question: What if I want the average height for humans?

starwars |>
  filter(species == "Human") |>
  summarize(mean_height = mean(height, na.rm=T))

# A tibble: 1 × 1
  mean_height
        <dbl>
1         178

Calculating summary statistics

Question: What is the average height for each species?

Calculating summary statistics

Question: What is the average height for each species?

starwars |>
  group_by(species) |>
  summarize(mean_height = mean(height, na.rm=T))

# A tibble: 38 × 2
   species   mean_height
   <chr>           <dbl>
 1 Aleena            79 
 2 Besalisk         198 
 3 Cerean           198 
 4 Chagrian         196 
 5 Clawdite         168 
 6 Droid            131.
 7 Dug              112 
 8 Ewok              88 
 9 Geonosian        183 
10 Gungan           209.
# ℹ 28 more rows

Creating new variables

Question: What is the distribution of the ratio of body mass to height?

Creating new variables

Question: What is the distribution of the ratio of body mass to height?

starwars |>
  mutate(body_ratio = mass/height)

Creating new variables

starwars |>
  mutate(body_ratio = mass/height) |>
  group_by(species) |>
  summarize(mean_ratio = mean(body_ratio, na.rm=T),
            sd_ratio = sd(body_ratio, na.rm=T))

# A tibble: 38 × 3
   species   mean_ratio sd_ratio
   <chr>          <dbl>    <dbl>
 1 Aleena         0.190  NA     
 2 Besalisk       0.515  NA     
 3 Cerean         0.414  NA     
 4 Chagrian     NaN      NA     
 5 Clawdite       0.327  NA     
 6 Droid          0.453   0.174 
 7 Dug            0.357  NA     
 8 Ewok           0.227  NA     
 9 Geonosian      0.437  NA     
10 Gungan         0.351   0.0207
# ℹ 28 more rows

Creating new variables

starwars |>
  mutate(body_ratio = mass/height) |>
  group_by(species) |>
  summarize(mean_ratio = mean(body_ratio, na.rm=T),
            sd_ratio = sd(body_ratio, na.rm=T),
            N = n())

# A tibble: 38 × 4
   species   mean_ratio sd_ratio     N
   <chr>          <dbl>    <dbl> <int>
 1 Aleena         0.190  NA          1
 2 Besalisk       0.515  NA          1
 3 Cerean         0.414  NA          1
 4 Chagrian     NaN      NA          1
 5 Clawdite       0.327  NA          1
 6 Droid          0.453   0.174      6
 7 Dug            0.357  NA          1
 8 Ewok           0.227  NA          1
 9 Geonosian      0.437  NA          1
10 Gungan         0.351   0.0207     3
# ℹ 28 more rows

Summary so far

filter: choose certain rows
summarize: calculate summary statistics
group_by: group rows together
mutate: create new columns

Class activity

https://sta279-f25.github.io/class_activities/ca_02.html

Work with a neighbor on the class activity
At the end of class, submit your work as an HTML file on Canvas (one per group, list all your names)
I will come around and answer any questions

For next time, read:

Chapter 3 in R for Data Science (2nd ed.)
Chapter 4 in Modern Data Science with R

Data wrangling fundamentals

What do you do with data?

Data manipulation

dplyr: Tools for data wrangling

Some core verbs for data wrangling

Creating a subset of the rows

Creating a subset of the rows

Creating a subset of the rows

Calculating summary statistics

Calculating summary statistics

Handling missing values

Calculating summary statistics

Calculating summary statistics

Calculating summary statistics

Calculating summary statistics

Creating new variables

Creating new variables

Creating new variables

Creating new variables

Summary so far

Class activity

`dplyr`: Tools for data wrangling