Web scraping in R with rvest

Data Manipulation in R Import and export data
Web scraping in R

Web scraping is a technique used for automatically extracting data from web pages. In this tutorial, we will demonstrate how to scrape data from static websites using the rvest library. Basic knowledge of HTML and CSS is required to follow along with this tutorial.

Web scraping is considered a legal technique in most cases, but we encourage you to learn more about web scraping before scraping any website and to check their robots.txt file and their terms of use.

The sample site

In this tutorial we are going to scrape a website designed to practice web scraping skills named http://books.toscrape.com/. If you enter to this website you will see the following:

Sample site to scrape in R

Read the HTML of the site

The first step to retrieve data from a web page is to download its HTML. Given a URL you can download its HTML with the read_html function from rvest.

# install.packages("rvest")
library(rvest)

# URL to scrape
url <- "http://books.toscrape.com/"

# Download the HTML code of the page
html <- url %>%
  read_html()

# View HTML
html
{html_document}
<html lang="en-us" class="no-js">
[1] <head>\n<title>\n    All products | Books to Scrape - Sandbox\n</title>\n<meta http-equiv="content-type" content="text/html; charset=UTF-8">\n<meta name="created" content="24th Jun 2016 09:29">\n<meta name="description" ...
[2] <body id="default" class="default">\n        \n        \n    \n    \n    <header class="header container-fluid"><div class="page_inner">\n            <div class="row">\n                <div class="col-sm-8 h1">\n<a href ...

Elements

Once you have downloaded the HTML of the page you will be able to obtain the desired data. To do this you will have to select the desired elements or nodes with the html_element or html_elements functions to scrape the first or all occurrences.

html_element and html_elements are also named html_node and html_nodes.

CSS selectors

For instance, if you want to get the heading of the page (the h1 tag) you can specify it inside the html_element function and then return the inner text with html_text2.

html %>%
  html_element("h1") %>%
  html_text2()
"All products"

Now consider that you want to get all the titles of the books. If you inspect the website, you will see that the titles are inside H3 tags:

Web scraping in R with rvest

Once you know which element to scrape specify it inside html_elements and get the inner text with html_text2.

book_titles <- html %>%
  html_elements("h3") %>%
  html_text2()

book_titles
 [1] "A Light in the ..."                    "Tipping the Velvet"                    "Soumission"                            "Sharp Objects"                         "Sapiens: A Brief History ..."         
 [6] "The Requiem Red"                       "The Dirty Little Secrets ..."          "The Coming Woman: A ..."               "The Boys in the ..."                   "The Black Maria"                      
[11] "Starving Hearts (Triangular Trade ..." "Shakespeare's Sonnets"                 "Set Me Free"                           "Scott Pilgrim's Precious Little ..."   "Rip it Up and ..."                    
[16] "Our Band Could Be ..."                 "Olio"                                  "Mesaerion: The Best Science ..."       "Libertarianism for Beginners"          "It's Only the Himalayas" 

The html_elements function also works with CSS classes and ids. Inspecting the site you will see that all prices have a class called price_color, so to download them you can input ".price_color" to html_elements.

book_prices <- html %>%
  html_elements(".price_color") %>%
  html_text2()

book_prices
"£51.77" "£53.74" "£50.10" "£47.82" "£54.23" "£22.65" "£33.34" "£17.93" "£22.60" "£52.15" "£13.99" "£20.66" "£17.46" "£52.29" "£35.02" "£57.25" "£23.88" "£37.59" "£51.33" "£45.17"

Add . before CSS classes names and # before IDs to select them.

Xpath

XPath is a language that allows to select elements based on a route. In order to copy the XPath of an element using the developer tools of your browser select the desired element inside the tools and then do: Right clickCopyXPath. Then, you can input the XPath to the xpath argument of the html_elements or html_element function:

title <- html %>%
  html_elements(xpath = "/html/body/div/div/div/div/div[1]/h1")

title
{xml_nodeset (1)}
[1] <h1>All products</h1>

Children

The rvest package provides a function to get the children elements named html_children. For instance, the children of the H3 headings are their links.

book_links <- html %>%
  html_elements("h3") %>%
  html_children()

head(book_links)
{xml_nodeset (6)}
[1] <a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a>
[2] <a href="catalogue/tipping-the-velvet_999/index.html" title="Tipping the Velvet">Tipping the Velvet</a>
[3] <a href="catalogue/soumission_998/index.html" title="Soumission">Soumission</a>
[4] <a href="catalogue/sharp-objects_997/index.html" title="Sharp Objects">Sharp Objects</a>
[5] <a href="catalogue/sapiens-a-brief-history-of-humankind_996/index.html" title="Sapiens: A Brief History of Humankind">Sapiens: A Brief History ...</a>
[6] <a href="catalogue/the-requiem-red_995/index.html" title="The Requiem Red">The Requiem Red</a>

Attributes

rvest provide two functions named html_attr and html_attrs which returns a single attribute (the one specified) or all attributes of a node, respectively.

In the previous examples we got the titles of the books, but the titles were truncated. Inspecting the HTML of the site you can see that each H3 has a link with an attribute named title that contains the full name of each book, so to get them you can do the following:

book_titles <- html %>%
  html_elements("h3") %>%
  html_elements("a") %>%
  html_attr("title")

book_titles
 [1] "A Light in the Attic"                                                                           "Tipping the Velvet"                                                                            
 [3] "Soumission"                                                                                     "Sharp Objects"                                                                                 
 [5] "Sapiens: A Brief History of Humankind"                                                          "The Requiem Red"                                                                               
 [7] "The Dirty Little Secrets of Getting Your Dream Job"                                             "The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull"       
 [9] "The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics" "The Black Maria"                                                                               
[11] "Starving Hearts (Triangular Trade Trilogy, #1)"                                                 "Shakespeare's Sonnets"                                                                         
[13] "Set Me Free"                                                                                    "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)"                                       
[15] "Rip it Up and Start Again"                                                                      "Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991"            
[17] "Olio"                                                                                           "Mesaerion: The Best Science Fiction Stories 1800-1849"                                         
[19] "Libertarianism for Beginners"                                                                   "It's Only the Himalayas" 

With the html_attrs function you can extract all the attributes of a node. The example below extracts all the attributes within the link of the H3 headings and merges them into a matrix.

link_attrs <- html %>%
  html_elements("h3") %>%
  html_elements("a") %>%
  html_attrs()

# Merge the list into a matrix
link_attrs <- do.call(rbind, link_attrs)

head(link_attrs)
     href                                                            title  
[1,] "catalogue/a-light-in-the-attic_1000/index.html"                "A Light in the Attic"                 
[2,] "catalogue/tipping-the-velvet_999/index.html"                   "Tipping the Velvet"                   
[3,] "catalogue/soumission_998/index.html"                           "Soumission"                           
[4,] "catalogue/sharp-objects_997/index.html"                        "Sharp Objects"                        
[5,] "catalogue/sapiens-a-brief-history-of-humankind_996/index.html" "Sapiens: A Brief History of Humankind"
[6,] "catalogue/the-requiem-red_995/index.html"                      "The Requiem Red"   

You could paste the URL with the links and scrape them if you want to continue scraping the details for each book.

Tables

The html_table function allows to scrape tables in a simple way, since it is only necessary to input the HTML document and it will download all the tables available in the page.

In the example below we are extracting the URL of the first book of the page and its HTML.

# Get links
links <- html %>%
  html_elements("h3") %>%
  html_elements("a") %>%
  html_attr("href")

links <- paste0("http://books.toscrape.com/", links)

# Get link of the first book
book_link <- links[1] # http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html

# Read HTML of the book page
html_book <- book_link %>%
  read_html()

The page is the following:

Scraping a table in R with rvest and html_table

The scraped page has a table with information about the book. Using the html_table function you can download the table, as shown below.

book_table <- html_book %>%
  html_table()

# View the table
book_table
[[1]]
# A tibble: 7 × 2
  X1                X2                     
  <chr>             <chr>                  
1 UPC               a897fe39b1053632       
2 Product Type      Books                  
3 Price (excl. tax) £51.77                 
4 Price (incl. tax) £51.77                 
5 Tax               £0.00                  
6 Availability      In stock (22 available)
7 Number of reviews 0