Homework 6
Due: Friday, October 24, 11:59pm
GitHub classroom link: Canvas has been down, so here is the GitHub classroom link:
https://classroom.github.com/a/iDBpjdo6
Instructions:
- Go to Canvas -> Assignments -> HW 6. Open the GitHub Classroom assignment link
- Follow the instructions to accept the assignment and clone the repository to your local computer
- The repository contains the file
hw_06.qmd. Write your code and answers to the questions in the Quarto document. Commit and push to GitHub regularly. - When you are finished, make sure to Render your Quarto document;
this will produce a
hw_06.mdfile which is easy to view on GitHub. Commit and push both thehw_06.qmdandhw_06.mdfiles to GitHub - Finally, request feedback on your assignment on the “Feedback” pull request on your HW 6 repository
Important: Make sure to include both the
.qmd and .md files when you submit to receive
full credit
The Great British Bake Off
The Great British Bake Off (called the Great British Baking Show in the US because of trademark issues with Pillsbury – yes, really) is a British competition baking show. Each episode involves three challenges: a signature bake, a technical challenge, and a showstopper, all centered around a theme (bread week, cake week, pastry week, etc.). The participant who performs worst is eliminated (with a couple rare exceptions), and the participant who performs best is awarded “star baker” for the week.
The goal of this assignment is to use web scraping and data wrangling (including working with strings) to collect and analyze data about the show. We will scrape the data from Wikipedia articles about the show.
Getting the episode names and information
We will begin with series 2 as an example. The Wikipedia article on series 2 can be found at:
https://en.wikipedia.org/wiki/The_Great_British_Bake_Off_series_2
If you scroll down, you will notice that there is an “Episodes” section, which contains the headings “Episode 1: Cakes”, “Episode 2: Tarts”, etc.
We will soon learn some basic web scraping tools. For now, I will provide you code with which to get the episode titles into R:
library(tidyverse)
library(rvest)
episodes <- read_html("https://en.wikipedia.org/wiki/The_Great_British_Bake_Off_series_2") |>
html_elements(".mw-heading > [id^='Episode_']") |>
html_text2() Question 1
Using string and regular expression tools, use the
episodes vector provided by the code above to create a data
frame which contains the episode information in the following
format:
## episode name
## 1 1 Cakes
## 2 2 Tarts
## 3 3 Bread
## 4 4 Biscuits
## 5 5 Pies
## 6 6 Desserts
## 7 7 Pâtisserie
## 8 8 Final
Question 2
Re-run the web scraping code, and your code from question 1, for Series 4 instead of Series 2. You should get some unexpected results; explain what goes wrong, and why you get those results (look at the Wikipedia page!).
(If, for whatever reason, you did not produce any unexpected results for Series 4, still look at the Wikipedia page, and describe what is different for the episode list of Series 4 vs. Series 2).
Question 3
To fix the issue from question 2, we will get rid of the “Episodes”
from the masterclass. This can be done with a little help from the
str_subset function.
Fill in the modified web scraping code below, with an appropriate regular expression, so that the masterclass episodes are not included for Series 4:
Question 4
Use an appropriate regular expression to extract the series number of the Wikipedia URL.
For example:
str_extract("https://en.wikipedia.org/wiki/The_Great_British_Bake_Off_series_4",
... ) # fill this in!Desired output:
## [1] "4"
Question 5
Using your answers to the previous questions, write a function called
clean_ggbo which satisfies the following requirements:
- Input:
url, a string containing the URL for one of the GBBO seasons (excluding season 1) - Output: a data frame with columns for the episode number, episode title, and season number
Here are some examples of this function in action:
## episode name season
## 1 1 Cakes 2
## 2 2 Tarts 2
## 3 3 Bread 2
## 4 4 Biscuits 2
## 5 5 Pies 2
## 6 6 Desserts 2
## 7 7 Pâtisserie 2
## 8 8 Final 2
## episode name season
## 1 1 Cakes 4
## 2 2 Bread 4
## 3 3 Desserts 4
## 4 4 Pies and Tarts 4
## 5 5 Biscuits and Traybakes 4
## 6 6 Sweet Dough 4
## 7 7 Pastry 4
## 8 8 Alternative Ingredients 4
## 9 9 French week 4
## 10 10 Final 4
Iterating over series
So far, we have only worked with one series at a time. We will next want to think about iterating over multiple series. As a first step, it will help to create a vector of URLs, for each of the Bake Off series.
The str_glue function will help here. Here is an example
of the str_glue function in action:
## class_activities/ca_1.html
## class_activities/ca_2.html
## class_activities/ca_3.html
## class_activities/ca_4.html
## class_activities/ca_5.html
Question 6
Use the str_glue function to create a vector called
season_urls which contains the URLs of the Wikipedia pages
for seasons 1 - 15 of the Great British Bake Off.
Now let’s iterate over the season_urls vector!
Question 7
Use the map and list_rbind functions from
the purrr package to create a dataframe which contains
episode information for all episodes in seasons 1 - 15. The
combined output should look something like this:
## episode name season
## <num> <char> <num>
## 1: 1 Cakes 1
## 2: 2 Biscuits 1
## 3: 3 Bread 1
## 4: 4 Puddings 1
## 5: 5 Pastry 1
## ---
## 140: 6 Autumn 15
## 141: 7 Desserts 15
## 142: 8 The '70s 15
## 143: 9 Patisserie 15
## 144: 10 Final 15
Exploring the data
Now let’s explore the episode information! For each of the following questions, you must write code to produce the answer. You may not answer just by looking through the combined data manually.
Question 8
Which seasons have fewer than 10 episodes?
Now let’s look at the episode titles, which describe the “theme” of each week. Some of these themes appear multiple times, others appear only once.
Note: looking at the episode information, you will see that “Cakes” appears 8 times, and “Cake” appears 7 times. So, “Cake week” actually happens every series! The issue here is that “Cake” vs. “Cakes” looks different, but they are really the same theme (just singular vs. plural).
There are some other issues: should we count “Biscuits” the same as “Biscuits and Traybakes”? Are “Pies” the same as “Pies and Tarts”? And depending on how you wrote the regular expressions, some of the names might have trailing white space.
Handling the extra white space is straightforward: the
trimws function in base R will do that for us, or we can
modify our regular expressions. The other issues are more
complicated.
Question 9
Which themes have appeared in every series?
In answering this question, you should modify the dataframe so that an episode name of “Cake” or “Cakes” is treated the same.
Question 10
How many episode themes have appeared only once?
Question 11
How often are the first three weeks Biscuits, Bread, and Cakes (in any order)?
Converting LaTeX to Markdown
(The following section is unrelated to the Bake Off section from above).
For another class, I wrote a homework assignment in LaTeX, which contained variable descriptions for a dataset on Titanic passengers. Here is what the original LaTeX looked like:
The data include the following variables:
\begin{itemize}
\item \verb;Passenger;: A unique ID number for each passenger.
\item \verb;Survived;: An indicator for whether the passenger survived (1) or perished (0) during the disaster.
\item \verb;Pclass;: Indicator for the class of the ticket held by this passengers. 1 = 1st class, 2 = 2nd class, 3 = 3rd class.
\item \verb;Name;: The name of the passenger.
\item \verb;Sex;: Binary indicator for the sex of the passenger.
\item \verb;Age;: Age of the passenger in years. Age is fractional if the passenger was less than 1 year old.
\item \verb;SibSp;: number of siblings/spouses the passenger had aboard the Titanic. Here, siblings are defined as brother, sister, stepbrother, and stepsister. Spouses are defined as husband and wife.
\item \verb;Parch;: number of parents/children the passenger had aboard the Titanic. Here, parent is defined as mother/father and child is defined as daughter,son, stepdaughter or stepson. NOTE: Some children traveled only with a nanny, therefore parch=0 for them. There were no parents aboard for these children.
\item \verb;Ticket;: The unique ticket number for each passenger.
\item \verb;Fare;: How much the ticket cost in US dollars.
\item \verb;Cabin;: The cabin number assigned to each passenger. Some cabins hold more than one passenger.
\item \verb;Embarked;: Port where the passenger boarded the ship, C = Cherbourg, Q = Queenstown, S = Southampton
\end{itemize}
In a LaTeX document, this would then render to look like the following:
The data include the following variables:
Passenger: A unique ID number for each passenger.Survived: An indicator for whether the passenger survived (1) or perished (0) during the disaster.Pclass: Indicator for the class of the ticket held by this passengers. 1 = 1st class, 2 = 2nd class, 3 = 3rd class.Name: The name of the passenger.Sex: Binary indicator for the sex of the passenger.Age: Age of the passenger in years. Age is fractional if the passenger was less than 1 year old.SibSp: number of siblings/spouses the passenger had aboard the Titanic. Here, siblings are defined as brother, sister, stepbrother, and stepsister. Spouses are defined as husband and wife.Parch: number of parents/children the passenger had aboard the Titanic. Here, parent is defined as mother/father and child is defined as daughter,son, stepdaughter or stepson. NOTE: Some children traveled only with a nanny, therefore parch=0 for them. There were no parents aboard for these children.Ticket: The unique ticket number for each passenger.Fare: How much the ticket cost in US dollars.Cabin: The cabin number assigned to each passenger. Some cabins hold more than one passenger.Embarked: Port where the passenger boarded the ship, C = Cherbourg, Q = Queenstown, S = Southampton
Unfortunately, I later wanted to copy these variable descriptions into a Quarto document. Quarto will render some LaTeX equations, but not the formatting shown above. Instead, to write these descriptions in a Quarto document, it would look like
The data include the following variables:
* `Passenger`: A unique ID number for each passenger.
* `Survived`: An indicator for whether the passenger survived (1) or perished (0) during the disaster.
* `Pclass`: Indicator for the class of the ticket held by this passengers. 1 = 1st class, 2 = 2nd class, 3 = 3rd class.
* `Name`: The name of the passenger.
* `Sex`: Binary indicator for the sex of the passenger.
* `Age`: Age of the passenger in years. Age is fractional if the passenger was less than 1 year old.
* `SibSp`: number of siblings/spouses the passenger had aboard the Titanic. Here, siblings are defined as brother, sister, stepbrother, and stepsister. Spouses are defined as husband and wife.
* `Parch`: number of parents/children the passenger had aboard the Titanic. Here, parent is defined as mother/father and child is defined as daughter,son, stepdaughter or stepson. NOTE: Some children traveled only with a nanny, therefore parch=0 for them. There were no parents aboard for these children.
* `Ticket`: The unique ticket number for each passenger.
* `Fare`: How much the ticket cost in US dollars.
* `Cabin`: The cabin number assigned to each passenger. Some cabins hold more than one passenger.
* `Embarked`: Port where the passenger boarded the ship, C = Cherbourg, Q = Queenstown, S = Southampton
Making all these changes by hand is tedious! Instead, let’s do it with some string and regular expression tools.
Here is a string that you can copy into R, which contains the original LaTeX text:
original_str <- "The data include the following variables:
\\begin{itemize}
\\item \\verb;Passenger;: A unique ID number for each passenger.
\\item \\verb;Survived;: An indicator for whether the passenger survived (1) or perished (0) during the disaster.
\\item \\verb;Pclass;: Indicator for the class of the ticket held by this passengers. 1 = 1st class, 2 = 2nd class, 3 = 3rd class.
\\item \\verb;Name;: The name of the passenger.
\\item \\verb;Sex;: Binary indicator for the sex of the passenger.
\\item \\verb;Age;: Age of the passenger in years. Age is fractional if the passenger was less than 1 year old.
\\item \\verb;SibSp;: number of siblings/spouses the passenger had aboard the Titanic. Here, siblings are defined as brother, sister, stepbrother, and stepsister. Spouses are defined as husband and wife.
\\item \\verb;Parch;: number of parents/children the passenger had aboard the Titanic. Here, parent is defined as mother/father and child is defined as daughter,son, stepdaughter or stepson. NOTE: Some children traveled only with a nanny, therefore parch=0 for them. There were no parents aboard for these children.
\\item \\verb;Ticket;: The unique ticket number for each passenger.
\\item \\verb;Fare;: How much the ticket cost in US dollars.
\\item \\verb;Cabin;: The cabin number assigned to each passenger. Some cabins hold more than one passenger.
\\item \\verb;Embarked;: Port where the passenger boarded the ship, C = Cherbourg, Q = Queenstown, S = Southampton
\\end{itemize}
"You can check how R would print this string with
cat:
## The data include the following variables:
##
## \begin{itemize}
## \item \verb;Passenger;: A unique ID number for each passenger.
## \item \verb;Survived;: An indicator for whether the passenger survived (1) or perished (0) during the disaster.
## \item \verb;Pclass;: Indicator for the class of the ticket held by this passengers. 1 = 1st class, 2 = 2nd class, 3 = 3rd class.
## \item \verb;Name;: The name of the passenger.
## \item \verb;Sex;: Binary indicator for the sex of the passenger.
## \item \verb;Age;: Age of the passenger in years. Age is fractional if the passenger was less than 1 year old.
## \item \verb;SibSp;: number of siblings/spouses the passenger had aboard the Titanic. Here, siblings are defined as brother, sister, stepbrother, and stepsister. Spouses are defined as husband and wife.
## \item \verb;Parch;: number of parents/children the passenger had aboard the Titanic. Here, parent is defined as mother/father and child is defined as daughter,son, stepdaughter or stepson. NOTE: Some children traveled only with a nanny, therefore parch=0 for them. There were no parents aboard for these children.
## \item \verb;Ticket;: The unique ticket number for each passenger.
## \item \verb;Fare;: How much the ticket cost in US dollars.
## \item \verb;Cabin;: The cabin number assigned to each passenger. Some cabins hold more than one passenger.
## \item \verb;Embarked;: Port where the passenger boarded the ship, C = Cherbourg, Q = Queenstown, S = Southampton
## \end{itemize}
Question 12
Using string functions and regular expressions, convert the LaTeX text into the desired Quarto text.
Hints:
- Look at
str_remove_allandstr_replace_all - Back references will be particularly useful here!