Here’s some of our tutorials on webscraping


Introductory webscraping in R

About this tutorial

This is a brief introductory tutorial to webscraping in R. This covers the basics of extracting structured (i.e. already in a table format or similar) from websites using the rvest package. It's aimed at those with some basic familiarity with R, but feel free to give it a go with no experience! The content is somewhat high-level to keep it to a digestable length, but for more information on the technical parts please read some other resources, such as this post.

Here, we are going to extract cost of living data for Australia's capital cities from the website repository Numbeo.

Step 1: Load packages

In R, as you may know, we need to load packages first before doing anything. This is because packages give us access to functions which we can then use to webscrape, summarise data, conduct analysis, and do a whole host of other things. The below code chunk loads the packages necessary for this tutorial. If you don't have any of these packages, first run install.packages("insertpackagename") before running the library() function.

library(tidyverse)
library(rvest)
library(XML)
library(stringr)
library(data.table)

Step 2: Create a list of capital cities in Australia to use in the webscraper

Now we are going to create an object which stores the names of Australia's capital cities. This will be used in the webscraping code to adjust the URL so we can extract each city's data sequentially as the data sits on separate pages for each city.

city_names <- c("Sydney", "Melbourne", "Brisbane", "Perth", "Canberra", "Adelaide", "Hobart", "Darwin")

Step 3: Create an empty object to store our webscraped data in

Since we are going to be extracting data sequentially for each city, we are going to make use of R's list functionality. Remember, you can call the objects whatever you want.

empty.list <- list()

Step 4: Write the webscraper function

Alright, we're in the thick of it now. This is a longer chunk (see code below) as it is one big for loop to extract the data from the webpages. It helps if you visit the website and visualise the layout, as that is how I knew which parts of the webpage to scrape. But more on that shortly, let's take it from the top!

At the top we speficy that we want to run a loop for every entry in the object called city_names which we created earlier. You can see that we are calling these values i. This means wherever we put an i inside the loop's curly braces, {} these things, the loop will replace that with the current iteration's city name. This means we only have to write the code once, rather than independently for each city. Pretty neat.

for(i in city_names) {

  url <- paste0("https://www.numbeo.com/cost-of-living/in/", i, "?displayCurrency=AUD")

  print(url)

  try({

 link  <- read_html(url)

  the_data <- html_table(link) [[3]] %>%
    setNames (c('item', 'cost', 'range')) %>%
    filter (range != '') %>%
    mutate(city = i)

  empty.list[[i]] <- the_data

  })

}

Moving on, we next specify the URL where we want to pull the data from. Here, you can see the first instance of our i loop method. This was used because on Numbeo, each city holds its own page (URL) which is denoted by the same URL with the city name changing. The part of the URL after where we are piping our city name into just ensures the values displayed are in $AUD. The print function after the URL line just tells us, when the loop is running, which city we are up to so we can keep track of progress.

Now within the try function we are onto the actual "scraping" part. First we need to tell R to read the website in HTML format (a type of code showing the structure of the website). This then lets us specify parts of the webpage in HTML to extract, such as Table 3 which is in the first line of the the_data object creation. I knew the table we want was labelled 3 because I first examined the website using Chrome's Selector Gadget extension (just google how to use this).

Following this, I tell R to create columns called Item, Cost and Range to name the variables I am extracting. The filter line just removes blank entries. Finally, the mutate line creates a variable which tells me what city the values relate to. This is critical since we are looping through multiple cities. The empty.list[[i]] <- the_data part just puts all of that extracted data for each city into the empty list object we specified earlier.

Step 5: Merge all the cities' data into one file

Now we have scraped the data! But you may notice it isn't in an immediately useful format. This is because it is stored in a list (which we needed to do the loop process). To "use" the data, we need to turn it into a dataframe. We can do this by "rowbinding" all of the loop's iterations (i.e. each city's data) into the same dataframe which we are calling col_data (short for "cost of living data"). A rowbind works when your multiple data sets have the exact same column structure, as R can then just append them together one after the other.

col_data <- rbindlist(empty.list, use.names = TRUE)

Step 6: Cleaning and final pre-processing

Alright, now we have our data in a usable format! Instead of ending the tutorial here, I'm going to show you some basic cleaning so you can make the data even more analysis-ready. The code chunk below is cleaning the cost column by removing everything other than numbers from the data entries. Specifically, we are using a form of code called "regular expressions" (regex for short) to edit the "string" (which is what we call an entry of alphanumerics or anything else). The \s part is regex for "whitespace" or a space, and the full stop and asterisk tells R to look at everything from the whitespace onwards. We are then replacing this specified part of the string from the whitespace onward with nothing, and then removing commas. After this, we can convert the column to a numeric format ready for analysis. See this resource for more information on regex in R.

col_data_clean <- col_data %>%
  mutate(cost = gsub("\\s.*", "", cost)) %>%
  mutate(cost = gsub(",", "", cost)) %>%
  mutate(cost = as.numeric(cost))

This next section splits the range column into separate columns for the minimum and maximum range values. Broadly, we are taking the numbers before the hyphen as the minimum and after as the maximum, then again removing symbols and commas to be left with just numbers. Please read up on regex to better understand what is going on here, acknowledging this isn't a regex tutorial (maybe there will be one in the future!). Finally, we are removing the original range column since we no longer need it and just want to be left with the least columns as possible.

col_data_clean <- col_data_clean %>%
  mutate(min_range = str_extract(range, ".*-")) %>%
  mutate(min_range = gsub("-", "", min_range)) %>%
  mutate(min_range = gsub(",", "", min_range)) %>%
  mutate(max_range = str_extract(range, "-.*")) %>%
  mutate(max_range = gsub("-", "", max_range)) %>%
  mutate(max_range = gsub(",", "", max_range)) %>%
  mutate(min_range = as.numeric(min_range)) %>%
  mutate(max_range = as.numeric(max_range)) %>%
  dplyr::select(-c(range))

And that's it! We're done!

Final thoughts

This marks the end of basic webscraping in R. I hope you found it useful and can find ways to implement it into your workflow to take away some of the manual data collection that many of us dislike. Stay tuned for more tutorials in the near future, including a more advanced webscraping example!