Here’s a snapshot of some of our recent work into the transport sector


Generalised additive modelling of traffic crash data: A time series exploration

time-series.png

Orbisant has previously analysed crash data (see post at the bottom of this page) from the Queensland Department of Transport and Main Roads. However, that analysis was focused on predicting crash severity based off road characteristics for just one region of Queensland in 2018. After recently re-opening the same dataset, Orbisant set out to take a time series perspective. Specifically, Orbisant endeavoured to use a class of statistical models called Generalised Additive Models (GAM) to understand the temporal properties of crashes that resulted in minor injuries (inspired by this post by the excellent Gavin Simpson). However, the whole process described in this post could be easily functionalised to accomodate crash severity (i.e. minor injury, fatality, hospitalisation, medical treatment) as an input.

As with all time series analysis, it’s often most useful when starting out to plot the raw data. The graph above shows this for crashes that resulted in minor injuries. Immediately, two things stand out:

  • There is likely a dynamic trend (i.e. the trend isn’t linearly increasing or decreasing but shows substantial somewhat sinusoidal variation over time)

  • There is some form of seasonality (seen in the somewhat consistently oscillating shape of the line)

These two properties are useful to keep in mind when specifying the model - because without them, our model might be inaccurate.

There are many statistical approaches to time-series analysis (see other Orbisant posts for some examples), however, this post will focus on Generalised Additive Models (GAM). GAMs are a very useful class of models that are related to generalised linear models (such as standard linear regression). However, where linear regression is composed of a linear combination of regression coefficients and predictor values (and error), a GAM is composed of a summation of smoothed local basis regression functions. This is an immensely powerful property as it enables GAMs to flexibly fit all manner of data, including linear and nonlinear, while providing the user with a large degree of control over many parameters. One of these, called “knots”, is core to the idea of a GAM. Knots determine the number of local basis functions to fit - where a very large number of knots would likely overfit to the training data.

In R, GAM models are typically fit with the fantastic mgcv package written by Simon Wood. Since we are modelling a response variable that is in the form of a “count”, using a Poisson-distributed model or negative binomial model (a generalisation of Poisson where mean and variance are not equal) is most appropriate. This is because counts form discrete probability distributions (since counts are integers), as opposed to continuous distributions (which could be modelled using standard Gaussian methods or others such as Gamma distributions). We will start with the Poisson model for simplicity. So far, our model is composed of:

  • Poisson-distributed response variable of crash count

  • Smoothed trend component (between-year variance)

  • Smoothed monthly seasonality component (within-year variance)

Running this model produces the following results:

  • Smooth term of trend (date) is significant at p < .001 (“approximate” p values for significance are used on smooth terms - there is much contention about p values and smoothing splines)

  • Smooth term of monthly seasonality is significant at p < .001 (see above for note on approximation)

  • Intercept is significant at p <.001

  • Deviance explained is 78.3%

This is very encouraging. With just two predictor terms, a deviance explained value of 78.3% is quite good for data that may inherently describe complex real-world processes (i.e. traffic crashes).

However, since we are modelling time series data, it’s important that we consider the temporal correlation structure of the data. Ignoring this may make our model(s) inaccurate. In the case of GAMs and fitting them in R, we can examine the residual autocorrelation of our first GAM model using the autocorrelation function and partial autocorrelation function. The output of this is presented below. Evidently, there may be some residual autocorrelation that we need to account for, and an AR(4) or AR(6) process might be a useful place to start.

acf.png

We can now compare our original model to these new models that account for residual autocorrelation. There are several ways to do this, and a common starting point is the Akaike information criterion (AIC) - where a lower value is better. The results are presented below:

  • Model with uncorrelated errors: -8.91

  • Model with AR(4) errors: -14.51

  • Model with AR(6) errors: -18.15

Evidently, a model which accounts for temporal correlation may be a superior model (from AIC values, at least). However, one must consider the tradeoff between model complexity and gain in model performance. Other metrics can help provide a more informed assessment.

Taking the analysis one step further, it is common in time series analysis (especially in business consulting applications) to just report on a trend using its direction and magnitude (see plot below for trend comparisons between the models). But what if we wanted to know whether the change in the trend at any given point was statistically significant? We can explore this by computing derivatives along the smoothed trend curve. Because GAMs are statistical models, we can account for uncertainty by computing derivatives over the confidence interval.

trend-comparison.png

The plot below demonstrates this derivative approach for the AR(4) model (we would expect almost the same results for the other two models after examining the plot above) by emphasising regions of significantly increasing and decreasing trend against the overall trend and its 95% confidence interval. Evidently, the steepest parts of the curve show significant rates of change (declining and increasing). While this may seem intuitive or expected when looking at the graph, it’s useful to know whether the rate of change in the trend at any given month is significant or not, as opposed to just eyeballing the graph and summarising “the trend was increasing/decreasing”.

deriv-ar4.png

This post has been much more methods-focused compared to other Orbisant posts. A lot more work could be done to interrogate and improve on the models presented here, and this is intended to be explored in the near future. Aside from exploring more model fit assessments, it might also be useful to add holiday seasons as an additional binary predictor (as crashes are likely more prominent in these months). This might improve the deviance explained and overall model accuracy. However, the models presented in this post are a very useful starting point and hopefully highlight a lot of the flexibility and benefits afforded by GAMs - especially in time series analysis where they are not used as often as other methods (e.g. ARIMA).

Code for this post can be found on GitHub.


Parsimonious models that account for seasonality can potentially forecast public transport data in Queensland

2-forecast-models.png

Understanding the demand side of the equation is critical for operational planning in many sectors, but particularly in transport. Using TransLink (QLD’s public transit agency) trip data from 2016 to the end of 2019, Orbisant set out to answer a few key research questions:

  • What are the trends, seasons, and cycles, if any?

  • Can this data be forecast?

  • What is the best forecasting method?

The first step in most time-series analysis is to visualise the data and understand its stationarity. Stationarity means that a signal’s underlying statistical properties do not change over time. This holds significant implications for the way forecasting is approached. For this analysis, Orbisant used data on the quantity of trips (using a GoCard) aggregated to the monthly level.

The plot below shows a matrix of useful time-series visualisations. Here we can see the raw trip quantity data (top left) plotted over time, the density of these quantities (top right; to show a distribution of trip numbers), the autocorrelation function (bottom left; relationship between a signal and its previous values as a function of time), and the partial autocorrelation function (bottom right; similar to autocorrelation but with indirect time lag effects removed). Evidently, there may be a very slight upward trend over time in quantity of trips, and the quantities themselves are not too far from being normally distributed. With regards to autocorrelation, there aren’t too many values outside the 95% confidence interval (blue shading) which suggests that previous values at a given time lag are not correlated with later values.

ts_plot.png

The next step was to statistically determine if the data was stationary. The raw data showed a very slight upward trend, but quantifying this is essential in determining the forecasting methodology. The Augmented Dickey-Fuller (ADF) test for stationarity was used, which revealed that the data is in fact stationary, and the very slight trend in the raw data was not material. This means the data does not have to be transformed into a stationary process prior to forecasting.

The final step in understanding the underlying time-series statistical properties of the data was to decompose it into the following useful components:

  • Trend

  • Seasonality

  • Residuals

The matrix of plots below show these components. The trend plot confirms the slight upwards movement seen in the raw data, however, the y-axis of this plot shows the magnitude of this effect is very low. The seasonal plot reveals a definite seasonality to the data - which is to be expected, given the likely consistent fluctuations in public transport use at certain points of the year. The residuals plot shows a relatively consistent pattern until right at the end of the time series. This is interesting and worth exploring further. At a high level, the raw time-series data appear to show a slight change in the pattern across the last three quarters of 2019. This may explain the sharp downward shape at the end of the trend and residual plots. Interestingly, these changes were not enough to produce a stationarity violation in the ADF test.

decomp_plot.png

With a visual and statistical understanding of the data, four basic forecast models were constructed and evaluated comparatively to determine which one best fit the transit data. These included:

  • Mean (uses the historical mean for all forecast values)

  • Naive (uses the last historical value for all forecast values)

  • Seasonal Naive (uses the last value in each season for each forecast value by season)

  • ARIMA (integrates autoregressive and moving average processes to forecast future values)

The mean, naive and seasonal naive models are typically considered “simple” or “parsimonious” or “baseline” models in time-series analysis. These models are useful as they provide a base level of accuracy for which to compare more complex models to. This means a more complex model would have to substantially outperform the baseline models to warrant its selection due to higher model complexity. The ARIMA model represents the complex model in this analysis. The plot at the top of this post shows the performance of each model for the 2019 holdout period. A holdout period in time-series analysis is similar to a “test” set in machine learning, where you “train” or construct a model on previous years’ data and test it on the holdout period data. This lets you quantify how accurate potential future forecasts might be.

Visually, it appears that the seasonal naive model outperforms the other three models as its values are consistently closer to the actual data. This result was confirmed by running the following accuracy diagnostics and comparing the results, where the seasonal naive model produced the lowest error scores:

  • Root mean squared error

  • Mean absolute error

  • Mean absolute percentage error

  • Mean absolute scaled error

This analysis has shown that the TransLink demand data can be forecast. Forecast accuracy was very strong for the first quarter of the 2019 holdout period. However, the data for the remainder of the 2019 holdout period showed a slightly different pattern to previous years. This meant the accuracy of the forecast models worsened over this timeframe. Other more sophisticated models may handle this slight shift in the data better, such as those that integrate additional information into predicting seasonal trends, as opposed to just using the last value from the previous season as forecast values. Future analysis should seek to test more of these models in a systematic and automated way to determine an optimal method. It should also aim to explore the anomaly in the residuals diagnostic (linked to this slight pattern change). Future analysis may also seek to disaggregate the results by service type (i.e. train, bus, ferry) to build more bespoke models and understand differences in demand trends. Code for this analysis can be found here.


dark conditions significantly increase the odds of road crashes being fatal

crash_chart.PNG

Understanding the factors at play in road crashes is crucial. Even more so, is understanding the major factors at play in fatal crashes. Using data from the Australian Government on road crashes between 2001 and 2014 in Noosa Shire Council, Queensland, Orbisant conducted analysis into what factors increase the odds of a crash being fatal compared to non-fatal.

A logistic regression model was used for this analysis and assessed whether the following factors predicted fatal crashes:

  • Crash atmospheric conditions (clear vs fog/smoke/dust vs raining)

  • Day of the week (weekday vs weekend)

  • Surface conditions (sealed - dry vs unsealed - dry vs sealed - wet vs unsealed - wet)

  • Lighting conditions (daylight vs dawn/dusk vs darkness - lighted vs darkness - not lighted)

Of course there are many other, including random, factors at play, even within this data set, but these are a useful and pragmatic starting point. The logit model shows that all the variables were statistically non-significant, save for the lighting condition of “darkness - not lighted”. Looking at the model outputs, after converting the model’s base values of log odds to just odds, this can be interpreted as darkness - not lighted conditions increase the odds of a crash being fatal by 2 times. This is perhaps a non-surprising result, but one that is still very serious.

Taking the model one step further, we can construct a “profile” of the average accident in the Noosa crash data, given the variables used in the model above. This is done by taking the mode (most frequent response) of each of the input variables and feeding this profile data set into the model. Doing this, after turning the predicted odds into a probability, returns the probability that the “average” crash within the parameters will be a fatal one. The results for this model show a probability of 2% given the specified parameters.

The fact that average crashes are highly unlikely to be fatal is a positive sentiment, but could also be a result of the relatively low variance in crash fatality that the model explained (4% out of 100%). All of this highlights the complexity and potential randomness at play in road crashes, although some conditions are more likely to lead to fatalities than others.