Homework 6

Due: Friday, October 24, 11:59pm

GitHub classroom link: Canvas has been down, so here is the GitHub classroom link:

https://classroom.github.com/a/iDBpjdo6

Instructions:

Go to Canvas -> Assignments -> HW 6. Open the GitHub Classroom assignment link
Follow the instructions to accept the assignment and clone the repository to your local computer
The repository contains the file hw_06.qmd. Write your code and answers to the questions in the Quarto document. Commit and push to GitHub regularly.
When you are finished, make sure to Render your Quarto document; this will produce a hw_06.md file which is easy to view on GitHub. Commit and push both the hw_06.qmd and hw_06.md files to GitHub
Finally, request feedback on your assignment on the “Feedback” pull request on your HW 6 repository

Important: Make sure to include both the .qmd and .md files when you submit to receive full credit

The Great British Bake Off

The Great British Bake Off (called the Great British Baking Show in the US because of trademark issues with Pillsbury – yes, really) is a British competition baking show. Each episode involves three challenges: a signature bake, a technical challenge, and a showstopper, all centered around a theme (bread week, cake week, pastry week, etc.). The participant who performs worst is eliminated (with a couple rare exceptions), and the participant who performs best is awarded “star baker” for the week.

The goal of this assignment is to use web scraping and data wrangling (including working with strings) to collect and analyze data about the show. We will scrape the data from Wikipedia articles about the show.

Getting the episode names and information

We will begin with series 2 as an example. The Wikipedia article on series 2 can be found at:

https://en.wikipedia.org/wiki/The_Great_British_Bake_Off_series_2

If you scroll down, you will notice that there is an “Episodes” section, which contains the headings “Episode 1: Cakes”, “Episode 2: Tarts”, etc.

We will soon learn some basic web scraping tools. For now, I will provide you code with which to get the episode titles into R:

library(tidyverse)
library(rvest)

episodes <- read_html("https://en.wikipedia.org/wiki/The_Great_British_Bake_Off_series_2") |>
  html_elements(".mw-heading > [id^='Episode_']") |>
  html_text2()

Question 1

Using string and regular expression tools, use the episodes vector provided by the code above to create a data frame which contains the episode information in the following format:

##   episode       name
## 1       1      Cakes
## 2       2      Tarts
## 3       3      Bread
## 4       4   Biscuits
## 5       5       Pies
## 6       6   Desserts
## 7       7 Pâtisserie
## 8       8      Final

Question 2

Re-run the web scraping code, and your code from question 1, for Series 4 instead of Series 2. You should get some unexpected results; explain what goes wrong, and why you get those results (look at the Wikipedia page!).

(If, for whatever reason, you did not produce any unexpected results for Series 4, still look at the Wikipedia page, and describe what is different for the episode list of Series 4 vs. Series 2).

Question 3

To fix the issue from question 2, we will get rid of the “Episodes” from the masterclass. This can be done with a little help from the str_subset function.

Fill in the modified web scraping code below, with an appropriate regular expression, so that the masterclass episodes are not included for Series 4:

read_html("https://en.wikipedia.org/wiki/The_Great_British_Bake_Off_series_4") |>
  html_elements(".mw-heading > [id^='Episode_']") |>
  html_text2() |>
  str_subset(...) # fill this in!

Question 4

Use an appropriate regular expression to extract the series number of the Wikipedia URL.

For example:

str_extract("https://en.wikipedia.org/wiki/The_Great_British_Bake_Off_series_4", 
            ... ) # fill this in!

Desired output:

## [1] "4"

Question 5

Using your answers to the previous questions, write a function called clean_ggbo which satisfies the following requirements:

Input: url, a string containing the URL for one of the GBBO seasons (excluding season 1)
Output: a data frame with columns for the episode number, episode title, and season number

Here are some examples of this function in action:

clean_gbbo("https://en.wikipedia.org/wiki/The_Great_British_Bake_Off_series_2")

##   episode       name season
## 1       1      Cakes      2
## 2       2      Tarts      2
## 3       3      Bread      2
## 4       4   Biscuits      2
## 5       5       Pies      2
## 6       6   Desserts      2
## 7       7 Pâtisserie      2
## 8       8      Final      2

clean_gbbo("https://en.wikipedia.org/wiki/The_Great_British_Bake_Off_series_4")

##    episode                    name season
## 1        1                   Cakes      4
## 2        2                   Bread      4
## 3        3                Desserts      4
## 4        4          Pies and Tarts      4
## 5        5  Biscuits and Traybakes      4
## 6        6             Sweet Dough      4
## 7        7                  Pastry      4
## 8        8 Alternative Ingredients      4
## 9        9             French week      4
## 10      10                   Final      4

Iterating over series

So far, we have only worked with one series at a time. We will next want to think about iterating over multiple series. As a first step, it will help to create a vector of URLs, for each of the Bake Off series.

The str_glue function will help here. Here is an example of the str_glue function in action:

str_glue("class_activities/ca_{class_num}.html", 
         class_num = 1:5)

## class_activities/ca_1.html
## class_activities/ca_2.html
## class_activities/ca_3.html
## class_activities/ca_4.html
## class_activities/ca_5.html

Question 6

Use the str_glue function to create a vector called season_urls which contains the URLs of the Wikipedia pages for seasons 1 - 15 of the Great British Bake Off.

Now let’s iterate over the season_urls vector!

Question 7

Use the map and list_rbind functions from the purrr package to create a dataframe which contains episode information for all episodes in seasons 1 - 15. The combined output should look something like this:

##      episode       name season
##        <num>     <char>  <num>
##   1:       1      Cakes      1
##   2:       2   Biscuits      1
##   3:       3      Bread      1
##   4:       4   Puddings      1
##   5:       5     Pastry      1
##  ---                          
## 140:       6     Autumn     15
## 141:       7   Desserts     15
## 142:       8   The '70s     15
## 143:       9 Patisserie     15
## 144:      10      Final     15

Exploring the data

Now let’s explore the episode information! For each of the following questions, you must write code to produce the answer. You may not answer just by looking through the combined data manually.

Question 8

Which seasons have fewer than 10 episodes?

Now let’s look at the episode titles, which describe the “theme” of each week. Some of these themes appear multiple times, others appear only once.

Note: looking at the episode information, you will see that “Cakes” appears 8 times, and “Cake” appears 7 times. So, “Cake week” actually happens every series! The issue here is that “Cake” vs. “Cakes” looks different, but they are really the same theme (just singular vs. plural).

There are some other issues: should we count “Biscuits” the same as “Biscuits and Traybakes”? Are “Pies” the same as “Pies and Tarts”? And depending on how you wrote the regular expressions, some of the names might have trailing white space.

Handling the extra white space is straightforward: the trimws function in base R will do that for us, or we can modify our regular expressions. The other issues are more complicated.

Question 9

Which themes have appeared in every series?

In answering this question, you should modify the dataframe so that an episode name of “Cake” or “Cakes” is treated the same.

Question 10

How many episode themes have appeared only once?

Question 11

How often are the first three weeks Biscuits, Bread, and Cakes (in any order)?

Converting LaTeX to Markdown

(The following section is unrelated to the Bake Off section from above).

For another class, I wrote a homework assignment in LaTeX, which contained variable descriptions for a dataset on Titanic passengers. Here is what the original LaTeX looked like:

The data include the following variables:

\begin{itemize}
\item \verb;Passenger;: A unique ID number for each passenger.
\item \verb;Survived;: An indicator for whether the passenger survived (1) or perished (0) during the disaster.
\item \verb;Pclass;: Indicator for the class of the ticket held by this passengers. 1 = 1st class, 2 = 2nd class, 3 = 3rd class.
\item \verb;Name;: The name of the passenger.
\item \verb;Sex;: Binary indicator for the sex of the passenger.
\item \verb;Age;: Age of the passenger in years. Age is fractional if the passenger was less than 1 year old.
\item \verb;SibSp;: number of siblings/spouses the passenger had aboard the Titanic. Here, siblings are defined as brother, sister, stepbrother, and stepsister. Spouses are defined as husband and wife.
\item \verb;Parch;: number of parents/children the passenger had aboard the Titanic. Here, parent is defined as mother/father and child is defined as daughter,son, stepdaughter or stepson. NOTE: Some children traveled only with a nanny, therefore parch=0 for them. There were no parents aboard for these children.
\item \verb;Ticket;: The unique ticket number for each passenger.
\item \verb;Fare;: How much the ticket cost in US dollars.
\item \verb;Cabin;: The cabin number assigned to each passenger. Some cabins hold more than one passenger.
\item \verb;Embarked;: Port where the passenger boarded the ship, C = Cherbourg, Q = Queenstown, S = Southampton
\end{itemize}

In a LaTeX document, this would then render to look like the following:

The data include the following variables:

Passenger: A unique ID number for each passenger.
Survived: An indicator for whether the passenger survived (1) or perished (0) during the disaster.
Pclass: Indicator for the class of the ticket held by this passengers. 1 = 1st class, 2 = 2nd class, 3 = 3rd class.
Name: The name of the passenger.
Sex: Binary indicator for the sex of the passenger.
Age: Age of the passenger in years. Age is fractional if the passenger was less than 1 year old.
SibSp: number of siblings/spouses the passenger had aboard the Titanic. Here, siblings are defined as brother, sister, stepbrother, and stepsister. Spouses are defined as husband and wife.
Parch: number of parents/children the passenger had aboard the Titanic. Here, parent is defined as mother/father and child is defined as daughter,son, stepdaughter or stepson. NOTE: Some children traveled only with a nanny, therefore parch=0 for them. There were no parents aboard for these children.
Ticket: The unique ticket number for each passenger.
Fare: How much the ticket cost in US dollars.
Cabin: The cabin number assigned to each passenger. Some cabins hold more than one passenger.
Embarked: Port where the passenger boarded the ship, C = Cherbourg, Q = Queenstown, S = Southampton

Unfortunately, I later wanted to copy these variable descriptions into a Quarto document. Quarto will render some LaTeX equations, but not the formatting shown above. Instead, to write these descriptions in a Quarto document, it would look like

The data include the following variables:

* `Passenger`: A unique ID number for each passenger.
* `Survived`: An indicator for whether the passenger survived (1) or perished (0) during the disaster.
* `Pclass`: Indicator for the class of the ticket held by this passengers. 1 = 1st class, 2 = 2nd class, 3 = 3rd class.
* `Name`: The name of the passenger.
* `Sex`: Binary indicator for the sex of the passenger.
* `Age`: Age of the passenger in years. Age is fractional if the passenger was less than 1 year old.
* `SibSp`: number of siblings/spouses the passenger had aboard the Titanic. Here, siblings are defined as brother, sister, stepbrother, and stepsister. Spouses are defined as husband and wife.
* `Parch`: number of parents/children the passenger had aboard the Titanic. Here, parent is defined as mother/father and child is defined as daughter,son, stepdaughter or stepson. NOTE: Some children traveled only with a nanny, therefore parch=0 for them. There were no parents aboard for these children.
* `Ticket`: The unique ticket number for each passenger.
* `Fare`: How much the ticket cost in US dollars.
* `Cabin`: The cabin number assigned to each passenger. Some cabins hold more than one passenger.
* `Embarked`: Port where the passenger boarded the ship, C = Cherbourg, Q = Queenstown, S = Southampton

Making all these changes by hand is tedious! Instead, let’s do it with some string and regular expression tools.

Here is a string that you can copy into R, which contains the original LaTeX text:

original_str <- "The data include the following variables:

\\begin{itemize}
\\item \\verb;Passenger;: A unique ID number for each passenger.
\\item \\verb;Survived;: An indicator for whether the passenger survived (1) or perished (0) during the disaster.
\\item \\verb;Pclass;: Indicator for the class of the ticket held by this passengers. 1 = 1st class, 2 = 2nd class, 3 = 3rd class.
\\item \\verb;Name;: The name of the passenger.
\\item \\verb;Sex;: Binary indicator for the sex of the passenger.
\\item \\verb;Age;: Age of the passenger in years. Age is fractional if the passenger was less than 1 year old.
\\item \\verb;SibSp;: number of siblings/spouses the passenger had aboard the Titanic. Here, siblings are defined as brother, sister, stepbrother, and stepsister. Spouses are defined as husband and wife.
\\item \\verb;Parch;: number of parents/children the passenger had aboard the Titanic. Here, parent is defined as mother/father and child is defined as daughter,son, stepdaughter or stepson. NOTE: Some children traveled only with a nanny, therefore parch=0 for them. There were no parents aboard for these children.
\\item \\verb;Ticket;: The unique ticket number for each passenger.
\\item \\verb;Fare;: How much the ticket cost in US dollars.
\\item \\verb;Cabin;: The cabin number assigned to each passenger. Some cabins hold more than one passenger.
\\item \\verb;Embarked;: Port where the passenger boarded the ship, C = Cherbourg, Q = Queenstown, S = Southampton
\\end{itemize}
"

You can check how R would print this string with cat:

cat(original_str)

## The data include the following variables:
## 
## \begin{itemize}
## \item \verb;Passenger;: A unique ID number for each passenger.
## \item \verb;Survived;: An indicator for whether the passenger survived (1) or perished (0) during the disaster.
## \item \verb;Pclass;: Indicator for the class of the ticket held by this passengers. 1 = 1st class, 2 = 2nd class, 3 = 3rd class.
## \item \verb;Name;: The name of the passenger.
## \item \verb;Sex;: Binary indicator for the sex of the passenger.
## \item \verb;Age;: Age of the passenger in years. Age is fractional if the passenger was less than 1 year old.
## \item \verb;SibSp;: number of siblings/spouses the passenger had aboard the Titanic. Here, siblings are defined as brother, sister, stepbrother, and stepsister. Spouses are defined as husband and wife.
## \item \verb;Parch;: number of parents/children the passenger had aboard the Titanic. Here, parent is defined as mother/father and child is defined as daughter,son, stepdaughter or stepson. NOTE: Some children traveled only with a nanny, therefore parch=0 for them. There were no parents aboard for these children.
## \item \verb;Ticket;: The unique ticket number for each passenger.
## \item \verb;Fare;: How much the ticket cost in US dollars.
## \item \verb;Cabin;: The cabin number assigned to each passenger. Some cabins hold more than one passenger.
## \item \verb;Embarked;: Port where the passenger boarded the ship, C = Cherbourg, Q = Queenstown, S = Southampton
## \end{itemize}

Question 12

Using string functions and regular expressions, convert the LaTeX text into the desired Quarto text.

Hints:

Look at str_remove_all and str_replace_all
Back references will be particularly useful here!