Web Scraping with R - Extracting Dialogue from IASIP Transcripts

By Andrew Robson

November 15, 2022

In this post, I’ll explain how I obtained the data for my The Gang Gets NLP’d blog post. This code worked when I first ran it, but as is the nature of webscraping, if the website ever changes the code will no longer work.

First, I loaded the necessary packages: tidyverse and rvest, which is what I used for web scraping. I also defined the URLs I was interested in:

library(tidyverse)
library(rvest)

base_url <- 'https://transcripts.foreverdreaming.org/viewforum.php?f=104'

all_page_urls <- c(base_url, 
                  paste0(base_url, 
                         '&start=', 
                         seq(from = 25, to = 25*6, by = 25)
                         )
                  )

all_page_urls is now a vector of URLs that I need to open and scrape one by one to get the episode links. To do that, I created a function called get_script_links_from_page:

get_script_links_from_page <- function(page_url) {
  
  page <- read_html(page_url)
  
  Sys.sleep(5)
  
  page %>%
    html_element(".tablebg") %>%
    html_elements('a') %>%
    html_attr('href') %>%
    tibble() %>%
    rename("link" = ".") %>%
    filter(substr(link, 1, 29) != "./viewtopic.php?f=104&t=32146") %>%
    mutate(link = paste0('https://transcripts.foreverdreaming.org',
                          substr(link, 2, nchar(link))))
  
}

The Sys.sleep line is just to be kind to the servers that host the transcripts - I didn’t want to ping them non-stop, so I added a wait of a few seconds between iterations. This function returns a table of individual links to episodes, which I then used to create a vector of all the links:

all_script_links <- do.call(rbind,
                            lapply(all_page_urls, get_script_links_from_page))

Next, I created another function called get_dialog that loads up each episode page and scrapes the text:

get_dialog <- function(script_url) {
  
  script_page <- read_html(script_url)
  
  Sys.sleep(1)
  
  episode_name <- script_page %>%
    html_element('.community-content') %>%
    html_element('.boxheading') %>%
    html_text %>%
    gsub(pattern = '\n\t\t\t', replacement = '') %>%
    gsub(pattern = '\n\t\t', replacement = '')
  
  
  script_page  %>%
    html_element('.postbody') %>%
    html_elements('p') %>%
    html_text %>% 
    tibble(episode_name) %>%
    rename("raw_dialog" = ".") 
  
}

This function is similar to the get_script_links_from_page function, but it works on a slightly different page format. It also does some cleaning, probably not in the most efficient way.

Now that I had all the building blocks, I was ready to run them and get the data:

all_dialog <- do.call(rbind, lapply(all_script_links$link, get_dialog))

all_dialog %>% 
  tibble() %>%
  separate(episode_name, c("episode_info", "episode_name"), " - ") %>%
  separate(episode_info, c('season', 'episode'), 'x') %>%
  filter(raw_dialog != '(adsbygoogle = window.adsbygoogle || []).push({});') %>%
  group_by(season, episode, episode_name) %>%
  mutate(line_id = row_number()) %>%
  write.csv(file = 'IASIP_Dialog.csv', row.names = F)

A little bit more cleaning and adding metadata and then that’s it! The data is ready to use. Head over to the main blog to see what I did with it.

Posted on:: November 15, 2022

Length:: 2 minute read, 401 words

See Also: