Homework 7

Due: Friday, October 31, 11:59pm

Instructions:

  1. Go to Canvas -> Assignments -> HW 7. Open the GitHub Classroom assignment link
  2. Follow the instructions to accept the assignment and clone the repository to your local computer
  3. The repository contains two files: hw_07.R and hw_07.qmd. You will use both of these files to write your HW 7 code. Commit and push to GitHub regularly.
  4. When you are finished, make sure to Render your Quarto document; this will produce a hw_07.md file which is easy to view on GitHub.
  5. In the process of completing this assignment, you will also create a csv file, called something like cheese_info.csv. Commit and push all of the following files to GitHub: cheese_info.csv, hw_07.R, hw_07.qmd, and hw_07.md
  6. Finally, request feedback on your assignment on the “Feedback” pull request on your HW 7 repository

Important: Make sure to include all four requested files in your repository on GitHub to receive full credit.

Web scraping cheese information

In the class activity on October 24, you began exploring a website with information about all sorts of different cheeses. For this homework assignment, you will continue exploring and scraping the information on this website.

There are currently 112 cheeses whose name begins with “A” on the website (to keep the runtime of your code manageable, we will only look at cheeses beginning with “A” for this assignment). Your first task is to scrape this information and create a dataset, which you will store in a CSV file, which contains information about these 112 cheeses.

There are specific instructions below that you must follow when scraping this information. If you do not follow these instructions, you will not receive full credit for the assignment.

Requirements for the cheese data

The data table you produce must:

  • contain 112 rows (one for each “A” cheese)
  • contain 5 columns: Name, Country, Type, Texture, and Colour
  • be written to a CSV file called cheese_info.csv, and saved in your GitHub repository

Note that some information will be missing. The Name column should have no missing values, but there could be missing values in the other columns.

Here are what the first few rows of your resulting table should look like:

Name               Country     Type                       Texture Colour
  <chr>              <chr>       <chr>                      <chr>   <chr> 
  Aarewasser         Switzerland semi-soft                  buttery yellow
  Abbaye de Belloc   France      semi-hard, artisan         creamy… yellow
  Abbaye de Belval   France      semi-hard                  elastic ivory 
  Abbaye de Citeaux  France      semi-soft, artisan, brined creamy… white 

Code requirements

  • Your code to create the cheese_info.csv file must be written in the hw_07.R file, not in the hw_07.qmd file. See below for more information
  • When you commit and push to GitHub, you must include both the hw_07.R and the cheese_info.csv files
  • Your scraping code must be polite, using the bow, scrape, and nod functions from the polite package, with the default 5 second delay between requests. This means that it will take a bit of time to run! See below for more information
  • Your code should use tools we have covered in class, such as in the class notes and the textbooks
  • If you choose to use generative AI to help on the assignment, you must cite which tools you used and how you have used the tool. You must also be able to explain your final code – I may ask you about it, and if you cannot explain the code you will not receive credit
  • You should write helper functions when needed (like how we used the clean_gbbo function in HW 6). If you are unsure how to organize your code with helper functions, I am happy to chat

.R files

In this course, we have mostly used Quarto (.qmd files) for our code. This is great for reproducibility, because Quarto files can contain both code and text, and because a rendered Quarto file guarantees that the code ran as shown and shows you the results.

However: unless we specify otherwise, .qmd files will re-run all the code in the document every time we render them. When our code is quick to run, this isn’t an issue. But when our code takes some time to run, this is a problem!

In this assignment, you are producing an intermediate product – a CSV file containing the cheese data – which you will later analyze. When we need to accomplish a resource-intensive task as an initial step, it is a good idea to write the code for this in a separate file. In this case, all of our code to produce the cheese_info.csv file will be contained in hw_07.R. Then, we will load the cheese_info.csv file into the hw_07.qmd file when we want to analyze the data.

What is a .R file? A .R file, which we call an R script, is a file which contains only R code. If you are used to Quarto documents, think of an R script as a file which contains only the code from the R chunks, and nothing else – no Markdown, no plots, no additional text (except for comments in the code itself).

For this assignment, I have provided the hw_07.R file for you to start with. This loads some necessary libraries at the top, and contains the polite connection to the website. You will then write the rest of the code to accomplish the task described here.

Runtime

There are 112 cheeses to scrape, and being polite, you are going to wait 5 seconds between requests to the website. This means that, once your code is ready to run in full, you should expect it to take at least 10 minutes or so (the first time, anyway – polite is pretty good about caching information once you have visited a page once). If your code still hasn’t finished running after 30 minutes, come see me. But, the key point is that the code will not run nearly as quickly as you are used to.

A few hints

  • My solution to this task used many of the tools we have used throughout the semester. In addition to the web scraping and string/text wrangling tools we have learned recently, it is likely that you will need to use many of the general data wrangling tools from earlier in the course
  • Not all 112 cheeses display on the first page when showing the cheeses alphabetically. By default, there are 20 cheeses visible on each page. You can change the number of cheeses per page, and navigate to a different page in the list of cheeses, by modifying the scrape function. E.g. scrape(query = list(per_page = 100, page=2)) will scrape from page 2, when 100 cheeses are listed per page
  • If you find yourself in a situation with a list of vectors, you can use the unlist function to make a single combined vector
  • If you find yourself in a situation with a list of data frames, you can use the list_rbind function to combine them into a single dataframe. Note that list_rbind is smart and will allow the data frames in the list to have different numbers of columns
  • General guidance: Start small, then build up. E.g. start by scraping the required info for a single cheese, and get it into the right format. Then think about how to iterate over all of the cheeses, and then combine the information

Question 1

Read all of the instructions above before beginning. Make sure you understand the requirements, and where you need to write your code. Then, write your code in hw_07.R to accomplish the specified task, following all requirements.

Analyzing cheese information

Once you have completed the first task – creating the cheese_info.csv file – you are ready for the second task! In the second part of this assignment, you will analyze the cheese data.

Questions of interest

  • What are the most common countries of origin?
  • What fraction of cheeses belong to each of the following categories: soft, semi-soft, firm, semi-hard, and hard?
  • Is there a relationship between the color of a cheese and its hardness (soft, hard, etc.)?

Code and discussion requirements

  • Your code for this task must be written in the hw_07.qmd file
  • The hw_07.qmd file cannot perform any web scraping. You should import the cheese_info.csv file at the top, but you may not include any of the code to create that CSV file
  • Your answers to the questions of interest must include both code and discussion. It is not enough to show only code and R output – you must discuss the results
  • Render your hw_07.qmd file to create the hw_07.md file. When you commit and push to GitHub, you must include both the hw_07.qmd and the hw_07.md files

A few hints

  • Some entries in the dataset will contain multiple values. E.g. the Abbaye de Belloc cheese has Type “semi-hard, artisan”. This means that the cheese is both semi-hard, and artisan. Consider some string functions and/or regular expressions to separate these values as you do the analysis

Question 2

Read all of the instructions above before beginning. Make sure you understand the requirements, and where you need to write your code. Then, write your code and discussion in hw_07.qmd to answer all the questions of interest.