Ethics and web scraping

Warmup

Work on the warmup activity (handout), then we will discuss as a group.

Ethical restrictions on web scraping

As a general rule, respect websites’ Terms and Conditions
Terms prohibiting web scraping have become more prevalent with advent of LLMs
Be polite when scraping: check that you have permission to scrape, and don’t send too many requests in a short space of time
Application Programming Interfaces (APIs) provide an alternative to web scraping for some websites

Other restrictions on web scraping

While sites often prohibit scraping in their terms and conditions, they do “allow” some portion of their site to be accessed by web crawlers for things like search engine indexing etc.
These limitations are expressed in a site’s robots.txt file

`robots.txt`

https://www.imdb.com/robots.txt

User-agent: *
Disallow: /OnThisDay
Disallow: /*/OnThisDay
Disallow: /ads/
Disallow: /*/ads/
Disallow: /find
Disallow: /*find
...

In R:

library(robotstxt)
paths_allowed("https://www.imdb.com/find/")

[1] FALSE

paths_allowed("https://www.imdb.com/list/ls055592025/")

[1] TRUE

`robots.txt`

This means that everyone is disallowed from everything:

User-agent: *
Diallow: /

`robots.txt`

https://www.imdb.com/robots.txt

User-agent: anthropic-ai
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: CCbot
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: GPTBot
Disallow: /
User-agent: PiplBot
Disallow: /

Polite web scraping: the `polite` package

Automatically ask permission to scrape (by checking robots.txt file)
Don’t ask for the same information twice
Limit rate of requests

Example: cheese!

https://www.cheese.com/alphabetical/a/

Example: cheese!

library(polite)
session <- bow("https://www.cheese.com/alphabetical/a/")
session

<polite session> https://www.cheese.com/alphabetical/a/
    User-agent: polite R package
    robots.txt: 0 rules are defined for 1 bots
   Crawl delay: 5 sec
  The path is scrapable for this user-agent

bow: establish initial connection to the page

Save this so we don’t have to establish the connection again
bow automatically checks the robots.txt file and establishes a delay between requests (default: 5 seconds)

Example: cheese!

library(polite)
session <- bow("https://www.cheese.com/alphabetical/a/")
session |> 
  scrape()

{html_document}
<html lang="en">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset ...
[2] <body>\n    \n    \n\n    <!-- Header -->\n    <div id="header">\n  ...

This is the polite version of:

read_html("https://www.cheese.com/alphabetical/a/")

{html_document}
<html lang="en">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset ...
[2] <body>\n    \n    \n\n    <!-- Header -->\n    <div id="header">\n  ...

Example: cheese!

What if we want to look at another page on the website?

https://www.cheese.com/aarewasser/

current_page <- session |>
    nod("/aarewasser")
current_page

<polite session> https://www.cheese.com/aarewasser
    User-agent: polite R package
    robots.txt: 0 rules are defined for 1 bots
   Crawl delay: 5 sec
  The path is scrapable for this user-agent

current_page |>
    scrape()

{html_document}
<html lang="en">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset ...
[2] <body>\n    \n    \n\n    <!-- Header -->\n    <div id="header">\n  ...

Example: cheese!

current_page <- session |>
    nod("/aarewasser")
current_page |>
    scrape()

nod allows us to modify the URL without having to establish a new connection (i.e. don’t have to bow again)

This is the polite version of:

read_html("https://www.cheese.com/aarewasser/")

Why be `polite`?

Many requests in a short period of time uses the website’s resources, and can look like a malicious attack
- This could also result in the website blocking you
- Be polite: ask slowly
Requesting the same information repeatedly is a waste of resources
- Be polite: establish the connection once (bow), and modify when you need a different page (nod)
It is particularly important to be polite when you are scraping many different pages on a website

Class activity

Work on the class activity. Render your HTML and submit on Canvas at the end of class.

Ethics and web scraping

Warmup

Ethical restrictions on web scraping

Other restrictions on web scraping

robots.txt

robots.txt

robots.txt

Polite web scraping: the polite package

Example: cheese!

Example: cheese!

Example: cheese!

Example: cheese!

Example: cheese!

Why be polite?

Class activity

`robots.txt`

`robots.txt`

`robots.txt`

Polite web scraping: the `polite` package

Why be `polite`?