The Gang Gets NLP'd - analysing the language of Always Sunny

By Andrew Robson

December 10, 2022

In this post, we will be exploring the data from the TV show It’s Always Sunny in Philadelphia. We will be using various techniques, such as sentiment analysis and topic modeling, to gain insights into the show. We will start by loading in the necessary packages and the data, and then breaking the data into tokens. From there, we will use sentiment analysis to determine the overall sentiment of each episode and explore the most common words used in the show. Finally, we will use topic modeling to identify the main topics discussed in the show and see how they have evolved over time.

Understanding the data

First, we will load the necessary packages and import the data from the previous blog post to begin our analysis:

library(tidyverse)
library(tidytext)
library(tidylo)
library(stm)
library(reshape2)

dialog <- read_csv('IASIP_Dialog.csv')

Next, we will examine the data to gain a better understanding of its structure and content:

dialog %>%
  slice(1:5) %>%
  knitr::kable()

…1	raw_dialog	season	episode	episode_name	line_id
1	Another big night, fellas… $ 164.87.	01	01	The Gang Gets r*cist	1
2	That’s not a lot of money.	01	01	The Gang Gets r*cist	2
3	No, it isn’t.	01	01	The Gang Gets r*cist	3
4	And our mortgage is due in two weeks.	01	01	The Gang Gets r*cist	4
5	We paid that a week ago.	01	01	The Gang Gets r*cist	5

We can see that each line of dialogue is represented as a separate row in the table, along with some additional metadata. Before we can analyse the data, we need to break it down into smaller units, known as tokens. These can be individual words, sentences, or groups of words (n-grams). For this analysis, we will simply split the data into individual words

tidy_dialog <- dialog %>%
  unnest_tokens(word, raw_dialog) 

tidy_dialog %>%
  slice(1:5) %>%
  knitr::kable()

…1	season	episode	episode_name	line_id	word
1	01	01	The Gang Gets r*cist	1	another
1	01	01	The Gang Gets r*cist	1	big
1	01	01	The Gang Gets r*cist	1	night
1	01	01	The Gang Gets r*cist	1	fellas
1	01	01	The Gang Gets r*cist	1	164.87

Now that we have the data all set up, let’s dive in and see what interesting insights we can uncover! One thing that would be fun to explore is the overall sentiment of each episode. Is It’s Always Sunny in Philadelphia a happy show or a sad one? We might have our own opinions on this, but let’s see what the data has to say.

Sentiment Analysis

Okay, let’s break this down! First off, we’re joining a sentiment library called afinn to our data. This library contains a list of words and their corresponding sentiment scores, ranging from -5 (most negative) to 5 (most positive). By using an inner join, we only keep the words that have a sentiment score assigned to them, and then we calculate the average sentiment per episode.

tidy_dialog %>%
  inner_join(get_sentiments("afinn")) %>%
  group_by(season, episode) %>%
  summarise(sentiment  = mean(value)) %>%
  ungroup() %>%
  mutate(epsiode_number = row_number()) %>%
  ggplot(aes(x = epsiode_number, y = sentiment, fill = season)) +
  geom_col() +
  theme_minimal_blog +
  labs(y = 'Average Sentiment',
       x = 'Overall Epsiode Number') +
  theme(legend.position = 'none')

Once we have that, it’s time to plot the data! The colours of the bar represent the different seasons of the show. Positive bars in the chart show that an episode had a lot of words with positive connotations, while negative bars mean the episode had many words with negative connotations.

The highest bar in the chart belongs to Season 9 Episode 1 - The Gang Breaks Dee, was this a positive episode? It was and it wasn’t, it was and it wasn’t. Our analysis shows that it contain the most positive dialogue but if you know the episode, you’ll know it’s not all what it seems. Let’s take a closer look at the words used in that episode.

library(wordcloud)

tidy_dialog %>%
  filter(season == '09', episode == '01') %>%
  inner_join(get_sentiments("afinn")) %>%
  mutate(sentiment = ifelse(value > 0, 'Positive', 'Negative')) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(match.colors = F,
                   title.bg.colors=NA,
                   colors = c("#9423F9", "#00C4A9"),
                   title.size = 0.1,
                   scale=c(5,.9))

On the flip side, the least positive episode was Season 11 Episode 6 - Being Frank. This episode is unique because it takes place entirely inside Frank’s mind, and we hear his inner monologue throughout. Given that Frank is known to be a bit of a dark character, it’s no surprise that this episode had the lowest positive sentiment score.

tidy_dialog %>%
  filter(season == '11', episode == '06') %>%
  inner_join(get_sentiments("afinn")) %>%
  mutate(sentiment = ifelse(value > 0, 'Positive', 'Negative')) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(match.colors = F,
                   title.bg.colors=NA,
                   colors = c("#9423F9", "#00C4A9"),
                   title.size = 0.1,
                   scale=c(5,.9))

Log Odds

Moving forward, we will be following many of the steps outlined in this excellent blog post by Julia Silge.

Now that we know if episodes are generally positive or negative, it would be interesting to see which words and themes are more common in different episodes or seasons. To gain insights into this, we will calculate the log odds of each word per season. This will tell us which season a given word is most likely to be from.

tidy_dialog %>%
  count(season, word, sort = TRUE) %>%
  bind_log_odds(season, word, n) %>%
  filter(n > 20) %>%
  group_by(season) %>%
  slice_max(log_odds_weighted, n = 5) %>%
  mutate(word = reorder_within(word, log_odds_weighted, season)) %>%
  ggplot(aes(log_odds_weighted, word, fill = season)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(vars(paste("Season", season)), scales = "free") +
  scale_y_reordered() +
  labs(y = NULL) +
  theme_minimal_blog +
  labs(x = 'Weight Log Odds')

The words at the top of each chart per season are more likely to be from that season. For instance, if you’re watching an episode and they are talking about prom. prom, you’re probably watching Season 1. Let’s just ignore Season 3 for now.

Topic Model

To wrap things up, we will use a topic model to group collections of words into topics. If you want to learn more about the details, I would highly recommend reading Julia’s blog post on this. Essentially, we will use an unsupervised machine learning model to analyse all of the dialogue and identify common themes. First, we will create the model, and then we will examine the topics based on their high “lift” words, which are words that are most unique and distinct to each topic - they may show in other topics too but they are what makes this topic distinct from the others.

set.seed(4689257)
dialogue_sparse <- tidy_dialog %>%
  mutate(document = paste(season, episode, sep = "_")) %>%
  count(document, word) %>%
  filter(n > 3) %>%
  cast_sparse(document, word, n) 

topic_model <- stm(dialogue_sparse, K = 5, verbose = FALSE)

tidy(topic_model, matrix = "lift") %>%
  group_by(topic) %>%
  slice_head(n = 10) %>%
  mutate(rank = row_number()) %>%
  ungroup() %>%
  pivot_wider(
    names_from = "topic", 
    names_glue = "topic {.name}",
    values_from = term
  ) %>%
  select(-rank) %>%
  knitr::kable()

topic 1	topic 2	topic 3	topic 4	topic 5
terrell	band	delicious	assraped	log
creepy	lil	hu	heroic	crippled
popular	pooping	hum	silvia	lap
trey	turd	monkey	pennington	certain
cancer	brilliant	ok	ty	grout
shake	corpse	plate	letter	unemployment
once	crime	rambo	mascot	empty
phase	extra	tea	tunnel	friday
worse	featured	barter	whoomp	punch
heroin	lundgren	canyon	bored	snow

It can be difficult to determine exactly why the model has grouped words into a particular topic. It’s not always as straightforward as seeing a topic called “Charlie Work” and knowing exactly what it’s about. Nonetheless, let’s plot these topics over all of the seasons to see which seasons contain which topics. This will give us a better understanding of how the themes of the show have evolved over time.

tidy(topic_model, 
     matrix = "gamma",
     document_names = rownames(dialogue_sparse)) %>%
  separate(document, c("season", "episode"), sep = "_") %>%
  mutate(topic = factor(topic)) %>%
  ggplot(aes(topic, gamma, fill = topic)) +
  geom_boxplot(show.legend = FALSE) +
  facet_wrap(vars(paste("Season", season))) +
  labs(y = expression(gamma)) +
  theme_minimal_blog

From the plot, we can see that the early seasons generally revolve around the same topics, primarily topic 1 and 5. After Season 9, these topics are rarely mentioned again. This suggests that the show’s focus shifted over time, with different themes and topics becoming more prominent as the seasons progressed.

Final Thoughts

We have seen that the overall sentiment of the episodes varies, with some being more positive and others more negative. We have also identified the most common words used in the show, and used a topic model to identify the main themes discussed throughout the seasons. Overall, this analysis has given us a deeper understanding of the show and its evolution over time. And just like the show itself, this analysis has been a wild and unpredictable ride.

This conclusion was written by chat GPT, I’m not sure how true it is but it feels like a conclusion.

Posted on:: December 10, 2022

Length:: 7 minute read, 1479 words

See Also: