library(tidyverse)
library(rvest)
library(polite)
session <- bow("https://www.cheese.com/alphabetical/a/")Activity: Web scraping
Cheese
The web page https://www.cheese.com/alphabetical/a/ contains a list of cheeses, in alphabetical order, whose name begins with “A”. (The website has similar pages for the other letters of the alphabet).
By default, when you visit this page the first 20 cheeses will be displayed. The name of each cheese links to its own page; e.g. if you click “Aarewasser” you will be taken to https://www.cheese.com/aarewasser/.
Questions
For scraping the cheese information in the following questions, we will be polite. Start with the following:
- Scrape the hyperlinks from the first page of “A” cheeses:
a_cheeses <- session |>
scrape() |>
html_elements("...") |> # fill in!
... # fill in!Solution:
a_cheeses <- session |>
scrape() |>
html_elements("h3 > a") |>
html_attr("href")a_cheeses [1] "/aarewasser/" "/abbaye-de-belloc/"
[3] "/abbaye-de-belval/" "/abbaye-de-citeaux/"
[5] "/tamie/" "/abbaye-de-timadeuc/"
[7] "/abbaye-du-mont-des-cats/" "/abbots-gold/"
[9] "/abertam/" "/abondance/"
[11] "/acapella/" "/accasciato/"
[13] "/ackawi/" "/acorn/"
[15] "/adelost/" "/adl-brick-cheese/"
[17] "/adl-mild-cheddar/" "/admiral-collingwood/"
[19] "/affidelice-au-chablis/" "/affineur-walo-rotwein-sennechas/"
If you visit the page for each cheese, you will get information on that cheese. For example, visiting https://www.cheese.com/aarewasser/, we can see that Aarewasser comes from Switzerland, is a semi-soft cheese, and has a buttery texture.
- Use string and web scraping tools to extract this information from the web page. We will start by nodding at the updated URL:
current_page <- session |>
nod("/aarewasser")
current_page |>
scrape() |>
html_elements("...") |> # fill in!
... # fill in!Solution:
current_page <- session |>
nod("/aarewasser")
current_page |>
scrape() |>
html_elements("li > p") |>
html_text2() |>
str_subset("^(Country|Type|Texture)")[1] "Country of origin: Switzerland" "Type: semi-soft"
[3] "Texture: buttery"