Here’s a snapshot of some of our recent work into the education sector


Modelling heteroskedastic socioeconomic data: a quantile regression exploration of student load and population size by postcode

You have an interesting dataset. You have a a strong theoretical idea of a relationship in mind between two of the variables. Let’s say these two variables are resident population and domestic online equivalent full time student load (EFTSL) for every postcode in Australia. You are all set to build your statistical model, plot the data just to see what it looks like - and you are met with the graph below.

Panic sets in. The data clearly violates one of the core assumptions of ordinary least-squares (OLS) linear regression - heteroskedasticity. Heteroskedasticity describes statistical dispersion, where variance increases as a function of a variable. More simply, we can see the heteroskedasticity visually on the graph as the “fanning out” shape of the data points away from tight clustering as resident population increases. Linear regression and a suite of other statistical techniques assume the inverse - homoskedasticity - where variance is constant. Luckily, there are a multitude of modelling techniques that can help us that are only a small conceptual and mathematical leap from OLS regression.

Source: Department of Education, Skills and Employment. Orbisant analysis.

Source: Department of Education, Skills and Employment. Orbisant analysis.

One of the options at our disposal is quantile regression. Quantile regression is similar to OLS regression, but rather than modelling the conditional mean, quantile regression models the conditional median (and any other quantile we might be interested in). You might liken the motivation for such a model to situations when we have highly skewed data where we would report a median rather than a mean. Since quantiles describe distance from the median, we can use quantile regression to identify cases in our data that are abnormal or substantially different from middle quantiles.

Quantile regression is not limited by some of the key assumptions of OLS regression, because we can fit separate quantiles that might describe portions of the data better than a single least-squares line might. This grants us much flexibility in the types of continuous response variables we can model using quantile regression and may reveal actual and strong relationships in data that are described as “weak” by methods that model conditional means. Quantile regression is also robust to outliers.

Further, the method can be implemented quite easily in code through statsmodels in Python and quantreg in R. Since most of Orbisant’s analysis is conducted in R, this post will explore the Python implementation.

The graph below plots the output of a quantile regression model with the standard OLS regression for comparison. Many quantiles are shown by the dotted lines, and some key ones include:

  • .5 quantile (median)

  • .05 quantile

  • .1 quantile

  • .95 quantile

The interpretation of these quantiles is rather intuitive. For example, in the 10th percentile we estimate that there is a 10% chance that the true value is below the predicted value. To operationalise this in the model, we assign lower loss weighting to negative errors but higher weighting to positive errors. The inverse of this would be true for the 90th percentile which is the opposite of the 10th.

Source: Department of Education, Skills and Employment. Orbisant analysis.

Source: Department of Education, Skills and Employment. Orbisant analysis.

Evidently, the various quantiles describe different parts of the data better than other quantiles and the OLS model itself. The quantiles individually capture the “heteroskedastic” cases in the higher resident populations that the OLS line largely misses. While the overall “trend” is the same - that as population increases so does EFTSL - we now have different coefficients that describe the slope of this positive trend more accurately for data that falls in different quantiles.

While quantile regression is a very useful tool, many other alternate statistical models could have been fit to this data and transformations could have fixed the heteroskedasticity problem to potentially enable OLS regression. It is always a good idea to keep in mind the alternatives. However, this example served to see how well quantile regression would perform against standard OLS linear regression for the type of problem that would warrant consideration of it.

Code for this post can be found on GitHub.


University rank partially explains vice-chancellor salary

vc_salary_rank.png

University Vice-Chancellor (VC) salary is an often-debated topic in Australia. A lesser formally explored lens on the topic is that of its correlation to university prestige or rank. In this analysis, a moderate negative correlation is visible, such that as rank increases (indicating a potentially lower quality institution), VC salary decreases. The magnitude of this effect is in the ballpark of just under one-third, meaning that a university’s rank can explain roughly 29% of the variance visible in VC salary.

Of particular interest is the relative positioning of a small number of select universities. Namely, the Australian National University (ANU), Australian Catholic University (ACU), and The University of Sydney (USyd). ANU is the top-ranked university in Australia, yet its VC receives the 4th lowest salary. Conversely, ACU is the 5th-lowest ranked university, yet its VC receives the 2nd highest salary. Last, USyd is the 3rd best ranked university and its VC receives the highest salary. In these three cases, USyd is perhaps the only one that makes the most sense with a more direct performance-reward relationship. The other two example universities highlight that performance and prestige are not always tied so tightly together. Whether this is for better or worse likely depends on a myriad of factors such as the VC themselves, financial performance of the institution and the institution’s strategy and goals.