Activity: Intro to web scraping

Instructions:

Web scraping

In this activity, you will scrape data from the page here: https://rvest.tidyverse.org/articles/starwars.html

This is a small HTML page designed to help you practice some web scraping fundamentals. We can read the HTML page in R with the following:

library(rvest)
library(tidyverse)
starwars <- read_html("https://rvest.tidyverse.org/articles/starwars.html")

Questions

To begin, let’s try to get the titles of the Star Wars films included on the web page.

  1. In your browser, inspect the web page (right-click the element you want, and click “Inspect”). What type of HTML elements contain the film titles?

  2. Use the html_elements function to find the HTML elements you identified in question 1:

starwars |>
  html_elements("...") # fill this in!
  1. Did your elements in question 2 capture only the elements corresponding to film titles? Or did you get any additional results that you don’t want to include? If so, modify your code from question 2 to more specifically choose the elements you want.

  2. Now pull out just the film titles from the elements with html_text2:

starwars |>
  html_elements("...") |> # fill this in!
  html_text2()
  1. Now write code to pull out just the release date for each film. Note: specifying HTML elements won’t be enough to get only the release dates; you will need to use some string functions (such as str_subset and str_extract) too.