It's too early to give a yellow card
By Andrew Robson
January 5, 2024
Today we will be using the worldfootballR
package to have a brief look at every single yellow card in the 2022/2023 EPL season. I remember looking at something like this when I was still at University, just at the beginning of learning how to do this kind of analysis.
I made hard work of it - many manual steps, cleaning data in excel, no repeatability. It doesn’t have to be that hard and the worldfootballR
package makes it even easier.
worldfootballR has a great selection of useful functions which help you collect data from many different sources (Transfermarkt, FBref, Understat). There’s really a lot you can do with it but I’m going to be doing something fairly basic in comparison to everything that’s possible.
There’s a cliche in football that it’s ’too early to give out a yellow card’. Maybe it’s because I’m autistic, but I don’t understand it - a yellow card early in the game can have a big impact in a players ability to make necessary tackles later on - why do you get a pass if it’s early in the game? Or is it all a myth?
Let’s start by loading the usual pacakges, alongside the new worldfootballR
.
library(worldfootballR)
library(tidyverse)
library(ggplot2)
match <- fb_match_urls(country = "ENG", gender = "M", season_end_year = 2023, tier = "1st")
match_data <- do.call(rbind,lapply(match, function(x) {
data <- fb_match_summary(match_url = x)
data %>%
select(Match_Date, Home_Team, Away_Team, Event_Time, Event_Half, Event_Type, Home_Away) %>%
filter(Event_Type == 'Yellow Card' | Event_Type == 'Red Card')
}))
Then we collect the data we want. It’s not much code but it does a lot - and it’s easily expandable and easily modified to look at other leagues. First, we use fb_match_urls
to get a list of all 380 games in the men’s EPL season for 2022/2023. Then, we iterate over them and apply the fb_match_summary
function to each one. This function gives a basic summary of the game you supply the URL for. Stuff like Home Team, Away Team, Date etc and then a few rows which indicate the major match events like yellows, reds and goals.
We just take each game summary, filter it down to just the cards and then only select the columns we want. Note that this function is webscraping, so to be kind to the servers it runs each command with a 3 second wait - that means this actually takes about 20 minutes to run.
Once it’s finished, we have a great dataset to explore. I explored it for a while but it wasn’t as interesting as I was expecting, there isn’t much variation if you’re the home or away side for example. The following plot is the best thing I found - so let’s get right to that.
First, we filter to yellow cards and then count the number of yellow cards each minute, then divide by the total number of yellow cards. This gives us a percentage chance that at least one of the players will get a yellow card in a given minute. (Note: We have to also group by Event_Half because with added-on time, there are two 46 minutes for example - one in extra time of the first half, then one in the second half.)
That’s pretty much all the processing we need, then we’re good to plot it. It looks like a lot of code but most of it is just to make it look nice. Mainly, we’re using geom_bar to plot the percentage chance of a yellow card in each minute.
prob_data <- match_data %>%
filter(Event_Type == 'Yellow Card') %>%
group_by(Event_Half, Event_Time) %>%
count(Event_Time) %>%
ungroup() %>%
mutate(probability = n/sum(n))
# Calculate summary statistics
summary_stats <- prob_data %>%
group_by(Event_Half) %>%
summarise(avg_probability = mean(probability),
max_probability = max(probability),
max_minute = Event_Time[which.max(probability)])
# Plotting the probabilities with additional annotations
ggplot(prob_data,
aes(x = Event_Time, y = probability, fill = factor(Event_Half))) +
geom_bar(stat = "identity",
position = "dodge",
alpha = 0.8) +
geom_hline(data = summary_stats, aes(yintercept = avg_probability),
linetype = "dashed",
color = "grey",
linewidth = 0.8) +
geom_text(data = summary_stats,
aes(label = paste0(round(max_probability, 3)*100, '%'),
x = max_minute,
y = max_probability),
vjust = -1,
color = "grey20",
size = 4,
fontface = "bold") +
labs(x = "Minute",
y = "Probability of Yellow Card",
subtitle = 'From all games of the 2022/2023 EPL Season',
caption = 'andrew-robson.com - Data Source: RBRef') +
ggtitle("Minute by Minute Probability of Yellow Cards") +
facet_wrap(~Event_Half, scales = 'free_x', labeller = as_labeller(to_string)) +
theme_minimal(base_size = 20) +
scale_y_continuous(labels = scales::percent) +
theme_minimal_blog+
scale_fill_brewer(palette = "Set2")
An immediate take away is that the first half has considerably fewer yellow cards. This could be because of the cliche that it’s ’too early’ but there are mitigating circumstances. You can’t get a yellow card for an accumulation of fouls if you haven’t had time to commit fouls. You’re also less likely to be trying to time waste, we can see spikes on 45 and 90mins - I assume these are most likely increased due to sportsmanship type fouls like time wasting.
However, we do see there are a few early yellows! We have one in particular that is in the first minute.The game was Brentford vs Wolverhampton Wanders - a yellow card for Semedo. Unfortunately, Sky didn’t think it was interesting enough to put in their match highlights.
- Posted on:
- January 5, 2024
- Length:
- 5 minute read, 879 words
- See Also: