library(rvest)
library(tidyverse)
starwars <- read_html("https://rvest.tidyverse.org/articles/starwars.html")Activity: Intro to web scraping
Instructions:
- Work with a neighbor to answer the following questions
- To get started, download the class activity template file
- When you are finished, render the file as an HTML and submit the HTML to Canvas (let me know if you encounter any problems)
Web scraping
In this activity, you will scrape data from the page here: https://rvest.tidyverse.org/articles/starwars.html
This is a small HTML page designed to help you practice some web scraping fundamentals. We can read the HTML page in R with the following:
Questions
To begin, let’s try to get the titles of the Star Wars films included on the web page.
In your browser, inspect the web page (right-click the element you want, and click “Inspect”). What type of HTML elements contain the film titles?
Use the
html_elementsfunction to find the HTML elements you identified in question 1:
starwars |>
html_elements("...") # fill this in!Did your elements in question 2 capture only the elements corresponding to film titles? Or did you get any additional results that you don’t want to include? If so, modify your code from question 2 to more specifically choose the elements you want.
Now pull out just the film titles from the elements with
html_text2:
starwars |>
html_elements("...") |> # fill this in!
html_text2()- Now write code to pull out just the release date for each film. Note: specifying HTML elements won’t be enough to get only the release dates; you will need to use some string functions (such as
str_subsetandstr_extract) too.