Here’s a snapshot of some of our recent work into sports and esports
Full report: A linear modelling exploration of predictors of scores in the AFL
Preface: This report is a slightly edited version of an assignment I submitted for a PhD linear modelling course at The University of Sydney. I acknowledge that "points scored" is an integer response which may be better modelled using a Poisson or negative binomial GLM. I hope to get around to this one day soon, but for now, enjoy the interesting results that came from a standard linear modelling approach.
Abstract
AFL is a highly-popular Australian sport that has garnered a lot of talk show attention, but suffers from a lack of statistical rigour. The present report seeks to bridge this gap by providing a statistically robust exploration of predictors of scores using data aggregated at the team and match level from the 2005-2019 seasons inclusive. An ordinary least squares regression model was used alongside preliminary exploration of a more sophisticated generalised additive model approach. Robust variance-covariance matrix estimators were used due to the presence of mild heteroscedasticity. Results found that tackles and unforced errors significantly and negatively predicted match scores, and rebounds, marks inside 50, marks, inside 50s, handballs, free kicks for, and clearances all significantly and positively predicted match scores. Contested marks and hit outs did not significantly predict match scores. Implications for coaching and gameplay strategy as well as limitations are discussed.
Introduction
AFL is a highly popular Australian sports league that began in 1896 and continues strongly today, with Grand Final match attendance (outside of the anomalous COVID-19-impacted 2020 season) approximating a sold out 100,000 each year at the traditional host venue - the Melbourne Cricket Ground. An AFL match is won based on points, which can be accumulated by kicking either a goal (worth six points) or a behind (worth one point). Despite its popularity and complexity, AFL is a sport that has traditionally relied on subject matter expertise and the knowledge of past players to inform coaching strategies. Much like other Australian sports, a lack of empirical statistical sophistication is evident.
Globally, sports analytics has continued to generate increasing attention, with websites such as FiveThirtyEight and Advanced Sports Analytics creating stylish platforms that constitute a reliable source of insight and interactive analysis. However, this form of innovative and detailed analysis has yet to fully permeate Australian sports. While the AFL has many dedicated talk show analysis television programs such as AFL 360, The Front Bar, and Talking Footy, these programs focus mostly on qualitative breakdowns of high-level descriptive statistics and not on statistical rigour. This report aims to bridge some of this gap by providing a preliminary statistical investigation of factors associated with scoring in the AFL. Specifically, this report aims to explore the following research question: Which gameplay attributes are predictors of scores in AFL matches?
Data set
Historical AFL data has been made readily-accessible in an open-source setting through the R
package fitzRoy
. The package provides a simple API that accesses and integrates a range of data sources that collate AFL data. Examples of these sources include:
- AFL
- AFL Tables
- Squiggle
- FootyWire
The data itself is diverse, covering domains as broad as player and match statistics, Brownlow medal votes, betting odds, attendance numbers, and match times. This report focuses on player and match statistics by aggregating quantities of interest to team-per-match-level sums using data for the 2005-2019 seasons, inclusive. This time period is somewhat arbitrary, but was made on the basis of wanting a large sample size while balancing recency and homogeneity. The 2020 season is a strong counter example of this, where the season length was truncated and played almost entirely in Queensland due to the impacts of COVID-19. This means the standard set up of games - having a home and away team - was not normal in 2020 and thus data for the entire season may represent a heterogenous set.
Data limitations
Despite the availability of so much player-level data, the author of the fitzRoy
package and the creators of the sources it pulls from all note potential caveats around the data. The main caveat is that the data is not official. Each source pulls from multiple others, and many individual people are involved in the continual updating of information. The accuracy of the data in fitzRoy
is largely contingent on the accuracy of the sources underpinning the websites it scrapes. While this is cause for concern, there are a large number of industry-standard sources that comprise the majority of the data used in this report, including official statistics produced by the AFL, newspapers and magazines (such as The Herald Sun and Inside Football), and official books (such as @everyone and @everygame). The open-source nature of many of the sources, especially AFL Tables, means continual improvement and accuracy is being achieved, further lending confidence to the available data, though some caution is still advised.
Variable retention
A small subset of variables were retained from the larger dataset. The subset was developed based on the author's AFL subject matter expertise. The variables retained were selected based on their likely relationship to a team's ability to score and whether a team could implement a training or coaching intervention off the back of this analysis to better target important predictors. For example, the variable free kicks against was not included, as the number of free kicks given away by a team is not a core contributor to that team scoring, and it is likely near impossible to coach out of their game.
The variables that were retained for the purposes of this analysis included team-match-level counts of scores, marks, handballs, hit outs, tackles, rebounds, inside 50s, clearances, clangers (unforced errors), free kicks for, contested possessions, contested marks, and marks inside 50.
Analysis
A rigorous and detailed linear modelling pipeline was implemented. This involved the following steps, each of which will be discussed in turn:
- Exploratory data analysis and visualisation
- Model fitting and assumption testing
Exploratory data analysis and visualisation
Prior to modelling, the data were aggregated and explored visually and numerically to understand the empirical structure. The data was aggregated to match-level sums for each team by summing over individual player statistics. Matches outside of regular season games (i.e. finals) were removed as finals likely represent a heterogenous set. The figure below shows the distributions of each aggregated quantitative variable.
The data were further explored using high-level summary statistics which revealed large difference in scales between the variables. To avoid issues with high-variance predictors (due to scale) influencing linear modelling or producing extremely low coefficients, all predictors were mean-centred and standardised (z-scored) prior to modelling. This also means the coefficients will have an intuitive interpretation compared to other rescaling methods.
Model fitting and assumption testing
There are four core assumptions of a linear regression model. These include:
- Independent observations
- Linear relationship between
X
andy
- Normality of residuals
- Homogeneity of variance
Since the data is at the most independent level possible for the present analysis (acknowledging the potential that some relationship may exist between each team on the same match), the following sections will focus on reporting the testing of the other assumptions.
Linear relationship
The purpose of a linear model is to understand the relationship between some number of predictors and a quantitative response variable. As such, a linear model at its core assumes that all predictors are related linearly to the response variable. These bivariate relationships are presented in the figure below. At this stage, a preliminary linear ordinary least squares (OLS) model was fit which confirmed the visual hypothesis that two variables - contested marks and hit outs - were not significantly associated with total scores. These variables were dropped for the remaining analysis.
A follow-up assessment of linearity using a residuals versus fitted plot was conducted (see top left plot in the figure below). In this plot, a slight quadratic shape is noted on the residuals versus fitted plot. Three new models were fit in response to this: OLS with second-degree polynomial terms on suspect predictors, OLS with square-root-transformed response, and OLS with log-transformed response. These models introduced new issues without addressing the underlying problem, and given that the quadratic shape was only slight with the data points themselves looking rather evenly-dispersed, the additional models were not retained.
Homogeneity of variance
Homoegeneity of variance - the lack of a systematic pattern or bias of residuals across model fitted or predictor values - is another core linear model assumption. This assumption is typically assessed graphically using a residuals and/or standardised plot. A model with homogeneity of variance should have no discernible pattern across the fitted values. This plot is depicted in the bottom left in the figure above.
Evidently, there is a non-horizontal line through the plot, indicating potential heteroscedasticity. Visual inspection of the data points themselves suggests only mild heteroscedasticity as they look reasonably evenly-dispersed. It was first hypothesised that potential outliers might be influencing the results, despite the lack of compelling visual evidence of leverage based on the Cook's distance plot. A test of the maximum studentised residual value against a Bonferroni-corrected critical value was conducted. Since the maximum residual value of 4.14 was less than the critical value of 4.45, it was declared that outliers were not an issue.
Heteroscedasticity is a major issue for linear models. As a response to this potential violation, a weighted OLS model was tested. The weighted OLS model works by computing weights for each data point, calculated by the inverse of squared fitted values from a linear regression of the absolute residuals of the original model as the response variable, and the fitted values of the original model as the predictor. The weight vector is then factored into the matrix decomposition to solve the linear regression problem. This method did not fix the heteroscedasticity issue.
As a solution, a heteroscedastic-robust estimator was used, which produces robust estimations of standard errors, test statistics, and p-values. This solution may seem like a highly conservative response to the relatively weak violation, however, erring on the side of caution could be considered a safe option in applied settings. Robust estimators are implemented in R
using the sandwich
package. The estimators work by introducing a new term, Omega, that flexibly acts on the diagonal of the variance-covariance matrix, and relaxes the assumption of homogeneity by enabling differing variances along the matrix diagonal (see both equations below). The inclusion of heteroscedastic-robust estimators reduces the size of test statistics, drives significance values away from zero, and increases standard errors to reflect the variance structure of the data.
Numerous Omega options are available in the sandwich
package. Most of the options returned negligibly different values for the present analysis, so the default HC3
parameter recommended by the authors was retained. It is defined according to the equation below.
Results
Coefficients and model outputs are presented in the table below where the first column of the dependent variable section is the standard OLS model and the second column is the heteroscedastic-robust corrected model. Interpretation will focus on the robust estimators, given the violation of homoscedasticity. For each predictor, coefficients, standard errors, t-statistics, and p-values are reported. Since all predictors were mean-centred and standardised (z-scored) prior to analysis, the interpretation is as follows: the coefficient represents the expected change in total score (response variable) for a one standard deviation change in the predictor. The overall model is statistically significant, F = 811.8 (df = 10; 5641), and explains approximately 59% of the observed variance in scores.
Two predictors were negative and statistically significant. These were tackles (t = -4.28, p < .001) and clangers (t = -15.54, p < .001), such that a one standard deviation increase in tackles is associated with mean reduction of 1.2 in total score, and a one standard deviation increase in clangers is associated with mean reduction of 4.2 in total score. For the positive predictors, the two with the strongest coefficients are mechanically related in terms of AFL gameplay: inside 50s (t = 28.03, p < .001) and marks inside 50 (t = 38.6, p < .001). The magnitude of both these predictors is noteworthy, as a one standard deviation increase in inside 50s is associated with a mean increase of 9.3 in total score, and a one standard deviation increase in marks inside 50 is associated with a mean increase of 12.2 in total score. The remaining positive predictors are reported in the table above.
Discussion
The present analysis aimed to produce an innovative and statistically robust exploration of predictors of scoring in the AFL using team-per-match-level data for the 2005-2019 seasons inclusive accessed through the R
package fitzRoy
. While not necessarily causal, the analysis sought to quantify the type and magnitude of any relationships with end-of-match scores.
Implications for AFL teams
This report found some potentially informative relationships regarding scoring in the AFL that teams may seek to consider. First, teams should seek to deeply understand their potential to generate opportunities within the fifty-metre circle in front of goal. The analysis strongly supports this recommendation, as increases in inside 50s and marks inside 50 are both associated with a substantial increase in total score. This is intuitive from a gameplay sense, as being closer to goal with possession of the ball would increase the likelihood of scoring, and a mark inside 50 means a guaranteed uninterrupted set shot at goal, further increasing the likelihood of kicking a six-point goal.
Second, teams should also consider the importance of clearances. The strong positive association found between clearances and scores was surprising. This is because clearances involve a team kicking the ball away from their own goal area, which is a heavily defensive statistic. The positive relationship may suggest that the opposition team was unsuccessful in scoring on multiple occasions, and so the team could take advantage of converting a successful defense into attacking opportunities of their own.
Third, teams should be cautious not to interpret the causal direction of some of the relationships presented in this paper. The negative relationship between tackles and scores is one such example. It is not necessarily the case that tackling less directly results in higher scores at the end of a match. It is far more likely that teams who score more (therefore more likely winning more) are just more defensively efficient or spend more time attacking rather than defending. Both of these characteristics would manifest as noticeably lower tackle counts.
Limitations
Despite the potentially informative findings, there were some limitations with the analysis. The first, as described earlier, is that the data is not official, and therefore its accuracy is unknown. It is likely that the data quality is high, given that some of the underlying sources are official and published material and that the project is open-source with contributions for numerous high-profile researchers and analysts.
A second limitation is that of variable selection. The variables included in this analysis were selected based on the author's subject matter expertise and prior knowledge of AFL. However, these variables only explained roughly sixty per cent of the variance in match scores. It is highly likely that the addition of more variables included in the larger dataset of approximately sixty variables would help drive this number closer to a more respectable percentage, such as eighty or ninety per cent. Since factor variables are included in the broader dataset, their inclusion raises some interesting questions around interaction terms. For example, future research may seek to fit interaction terms by team, or by home versus away, to better understand the dynamics of AFL metrics on match scores. Of course, the inclusion of more covariates, especially large numbers of them, may raise serious issues around multicollinearity or other model assumptions. Researchers may seek to account for this by first applying variable selection procedures such as Lasso regression.
A third limitation is that of model selection. It remains unclear whether an ordinary least squares regression approach is the optimal modelling technique for this data. Preliminary follow-up analysis undertaken by the author revealed that a generalised additive model - a model that linearly adds estimated smooth functions using splines for each covariate - produced a better model fit at a lower Akaike information criterion value. Further, since the response variable is a non-zero count, it may be more appropriate to consider a generalised linear model with a link function appropriate to an integer response, such as a Poisson or negative binomial-distributed model. The added benefit of these models is that they correctly model the response as a discrete-valued probability mass function, instead of the probability density function assumed by a Gaussian linear model (if a maximum likelihood and not ordinary least squares approach is taken). This may be particularly pertinent if future endeavours focus on predictive applications. Future research should aim to consider these modelling options, and potentially even perform a direct comparison.
Available code
All code for this paper is available on GitHub.
AFL Analysis: How different is gameplay in finals versus regular season games?
AFL is an incredible sport, and one that many Australians are highly passionate about. Analysis of all facets of the game is also extremely popular, with many talk shows dedicated to it throughout the season. Examples of these shows include (but not limited to):
AFL 360
On the Couch
AFL Tonight
However, while the analysis these shows present and discuss is deep and informative, it (understandably) lacks statistical and scientific rigour - as this is not the premise on which the shows are built. Orbisant set out to provide a distinct viewpoint on AFL data with statistical modelling.
Using game data from AFL Tables and FootyWire, pulled using the R package fitzRoy, Orbisant produced a set of research questions. The first of these, is the focus of this article:
Do finals series games show different match statistics compared to regular season games? Is this a product of the “pressure” of finals?
Guided by this research question, Orbisant framed up two hypotheses to guide the work:
Hypothesis I: Finals are “messier” and more competitive than season games
Hypothesis II: Competitive game statistics will be able to predict whether a game was a final or not
To start off, Orbisant pulled data for the past 10 seasons (from 2011 to 2020) and plotted a few key match statistics that might be of interest. These are visible in the graph below. Evidently, the distribution of match-level means for finals and season games are remarkably different, with a few metrics showing results of interest to us:
There is a much thicker tail on the higher end for contested marks and contested possessions in finals compared to season games
There is a much thicker tail on the lower end for goals and handballs in finals compared to season games
There seems to be some support for the notion that finals might be more competitive (seen through higher contest metrics) and messier (thicker lower tails for ‘clean’ football metrics such as goals and handballs).
The distributions are a useful starting point, but statistically testing the hypothesis of difference would yield a more rigorous answer. There are multiple ways to operationalise this, such as fitting separate models for each metric, but this would require controlling of Type-I error rate. Another way, is to use the match-level averages to predict the binary response variable of game type (1 = final, 0 = regular season). This second method could be done using logistic regression. Consideration should be given to the difference in group sample sizes here, as there are many more season games than there are finals.
One output of the logistic regression model is presented in the graph below. An odds ratio greater than 1 signifies that the higher the value of the variable is, the higher the odds of the predicted game being a final. Conversely, an odds ratio lower than 1 signifies that the higher the value of the variable is, the higher the odds of the predicted game being a regular season game. However, this interpretation is only valid for ‘significant’ variables - visible by the entire confident interval not including an odds ratio of 1.
Evidently, higher numbers of contested marks and clearances in a game are both significantly associated with higher odds of a game being a final. This is consistent with the ‘more competitive’ hypothesis, however, another key metric - contested possessions - is not a feature that can easily discern the two game types. This offers some evidence against the hypothesis. In addition, higher numbers of goals and handballs in a game are both significantly associated with higher odds of a game being a regular season game. This suggests that less goals are scored in finals (which is consistent with both the competitiveness and ‘messiness’ hypotheses), and less clean handballs are performed in finals.
This analysis is the first in a multi-part series that dives into the hard statistics of AFL. Future posts with examine what makes certain teams successful, predictors of Brownlow votes, and much more. Stay tuned.
The “home team advantage” is a real and SIZeABLE effect in the english premier league
Many of us who follow sports feel noticeably more comfortable when watching our favourite team play at their “home ground.” Consciously or unconsciously, we seem to know that the odds are in their favour, whether it be due to the crowd, lack of travel, or something else. But what is the magnitude of this effect?
Orbisant explored this question using data from the 2018/19 English Premier League season. Specifically, we wanted to determine what impact this home team advantage might have on the second half of the game, provided different net scores at half-time. The chart above shows the probability that the home team will win, depending on this net goal difference at half-time and bounded by 95% confidence interval shading (to account for uncertainty). Evidently, the home team may have reason to feel more comfortable than the away team heading into the second half if the scores are tied at half-time.
With even scores at half-time, we might expect, in a fully even match, there to be a 50-50 chance that either team will win. However, when examining this value (0.5) against our model’s probability of a home team winning given a tied half-time score, we see a much higher value of 0.61. This strongly suggests the existence of a home team advantage, whatever its drivers, and it is a non-trivial effect at that.
In sport many factors can influence a game, especially random factors. However, we can feel somewhat confident in this model, as the net half-time goal difference alone accounted for over 55% of the variance in home team outcomes at the end of the match (after removing “draws”). Interestingly, if you add in the number of corners at half-time as another input variable alongside net goal difference, this variance explained jumps to over two-thirds. Furthermore, we can also examine the probability of the home team and away team winning independent of each other (using raw goals scored at half time, not net). In this model, the odds of the home team winning increase by 8.2 for every goal scored in the first half, while they decrease by 86% for every goal the away team scores. Clearly, despite the second-half home team advantage at even scores, the first half is critical to help secure a win. The next iteration of this analysis will likely factor in whether the home team is playing a team that is positioned above or below them on the league ladder.