Web Scraping with R - Extracting Dialogue from IASIP Transcripts
By Andrew Robson
November 15, 2022
In this post, I’ll explain how I obtained the data for my The Gang Gets NLP’d blog post. This code worked when I first ran it, but as is the nature of webscraping, if the website ever changes the code will no longer work.
First, I loaded the necessary packages: tidyverse and rvest, which is what I used for web scraping. I also defined the URLs I was interested in:
library(tidyverse)
library(rvest)
base_url <- 'https://transcripts.foreverdreaming.org/viewforum.php?f=104'
all_page_urls <- c(base_url,
paste0(base_url,
'&start=',
seq(from = 25, to = 25*6, by = 25)
)
)
all_page_urls
is now a vector of URLs that I need to open and scrape one by one to get the episode links. To do that, I created a function called get_script_links_from_page:
get_script_links_from_page <- function(page_url) {
page <- read_html(page_url)
Sys.sleep(5)
page %>%
html_element(".tablebg") %>%
html_elements('a') %>%
html_attr('href') %>%
tibble() %>%
rename("link" = ".") %>%
filter(substr(link, 1, 29) != "./viewtopic.php?f=104&t=32146") %>%
mutate(link = paste0('https://transcripts.foreverdreaming.org',
substr(link, 2, nchar(link))))
}
The Sys.sleep line is just to be kind to the servers that host the transcripts - I didn’t want to ping them non-stop, so I added a wait of a few seconds between iterations. This function returns a table of individual links to episodes, which I then used to create a vector of all the links:
all_script_links <- do.call(rbind,
lapply(all_page_urls, get_script_links_from_page))
Next, I created another function called get_dialog that loads up each episode page and scrapes the text:
get_dialog <- function(script_url) {
script_page <- read_html(script_url)
Sys.sleep(1)
episode_name <- script_page %>%
html_element('.community-content') %>%
html_element('.boxheading') %>%
html_text %>%
gsub(pattern = '\n\t\t\t', replacement = '') %>%
gsub(pattern = '\n\t\t', replacement = '')
script_page %>%
html_element('.postbody') %>%
html_elements('p') %>%
html_text %>%
tibble(episode_name) %>%
rename("raw_dialog" = ".")
}
This function is similar to the get_script_links_from_page function, but it works on a slightly different page format. It also does some cleaning, probably not in the most efficient way.
Now that I had all the building blocks, I was ready to run them and get the data:
all_dialog <- do.call(rbind, lapply(all_script_links$link, get_dialog))
all_dialog %>%
tibble() %>%
separate(episode_name, c("episode_info", "episode_name"), " - ") %>%
separate(episode_info, c('season', 'episode'), 'x') %>%
filter(raw_dialog != '(adsbygoogle = window.adsbygoogle || []).push({});') %>%
group_by(season, episode, episode_name) %>%
mutate(line_id = row_number()) %>%
write.csv(file = 'IASIP_Dialog.csv', row.names = F)
A little bit more cleaning and adding metadata and then that’s it! The data is ready to use. Head over to the main blog to see what I did with it.
- Posted on:
- November 15, 2022
- Length:
- 2 minute read, 401 words
- See Also: