Activity: Intro to web scraping

Web scraping

In this activity, you will scrape data from the page here: https://rvest.tidyverse.org/articles/starwars.html

This is a small HTML page designed to help you practice some web scraping fundamentals. We can read the HTML page in R with the following:

library(rvest)
library(tidyverse)
starwars <- read_html("https://rvest.tidyverse.org/articles/starwars.html")

Questions

To begin, let’s try to get the titles of the Star Wars films included on the web page.

  1. In your browser, inspect the web page (right-click the element you want, and click “Inspect”). What type of HTML elements contain the film titles?

Solution: h2 elements

  1. Use the html_elements function to find the HTML elements you identified in question 1.

Solution:

starwars |>
  html_elements("h2")
  1. Did your elements in question 2 capture only the elements corresponding to film titles? Or did you get any additional results that you don’t want to include? If so, modify your code from question 2 to more specifically choose the elements you want.

Solution: My solution for question 2 above also captured On this page, which is not one of the film titles, but is included in an h2 element.

There are a few ways to modify this code. Here is one option: the film titles are all in h2 elements with a data-id attribute.

starwars |>
  html_elements("h2[data-id]")

Here is another option: the film titles are all in h2 elements which are contained in section elements. So, first find the section elements, and then the h2 elements within them:

starwars |>
  html_elements("section") |>
  html_element("h2")

This can also be written a different way: here > means “child node”, i.e. h2 is a child node of section

starwars |>
  html_elements("section > h2")
  1. Now pull out just the film titles from the elements with html_text2:
starwars |>
  html_elements("h2[data-id]") |> 
  html_text2()
  1. Now write code to pull out just the release date for each film. Note: specifying HTML elements won’t be enough to get only the release dates; you will need to use some string functions (such as str_subset and str_extract) too.
starwars |>
  html_elements("p") |>
  html_text2() |>
  str_subset("^Released") |>
  str_extract("(?<=: ).+")