Here’s a snapshot of some of our recent work into economics


Exploration of socioeconomic factors for australia’s postcodes

Source: ABS

Source: ABS

Orbisant was recently exploring ABS data and got thinking about interesting research questions around variables that might be associated with mortgage repayments. Using postcode-level data, Orbisant set out to produce data visualisations and modelling for the following variables:

The plot above highlights the first foray into this linked dataset. As expected, there is a somewhat linear relationship (smooth model fitted on the graph is a generalised additive model, however, hence the curved shape) at the postcode level between median mortgage repayments and scores on the SEIFA Index of Education and Occupation (where a higher score indicates relatively higher status). This is interesting, but disaggregating the analysis further might reveal a more informative pattern structure.

The plot below disaggregates the same analysis by State/Territory. This plot highlights the difference in the type of relationships between the two variables when accounting for regionality. Some of the patterns that stand out are:

  • Trendlines for each State/Territory are largely unique - While ACT and TAS both show positive linear relationships, the others show an array of different curvilinear smooth lines

  • Sample size plays a role - ACT shows a linear relationship, but if more data points were possible, or an outlier-robust smooth model was fitted, the line may appear less leveraged. Further, as expected, the States/Territories with larger sample sizes (i.e. more postcodes) show tighter confidence intervals around the trendlines

faceted.png

While this analysis is interesting, it could be deepened through the application of statistical models using all the available variables. The first research question that came to mind was an exploratory one:

  • Do postcodes cluster together on a mix of socioeconomic variables?

Before jumping into modelling, a correlogram - a type of data visualisation for a correlation matrix - was produced to examine the relationships between each variable. This is seen in the plot below. Evidently, moderate-strong relationships are seen for most bivariate relationships (i.e. correlations between two variables), except for any combination with resident population. The takeaway here is that postcodes with higher populations are associated with higher values for all other variables, but the strength of these relationships is weaker than bivariate correlations between any other variable combination.

corrplot.png

Given the massive increase in machine learning approaches in recent years, many data scientists may jump straight into an unsupervised learning algorithm such as k-means clustering. However, while k-means is a very useful tool, its implementation relies on the minimisation of squared Euclidean distances, with no indication of uncertainty. Instead, a probabilistic approach, using a method such as Latent Profile Analysis (LPA), could help quantify the uncertainty associated with class/cluster/group membership while still utilising a data-driven largely unsupervised approach.

Orbisant fit an LPA on all the quantitative variables specified above. Prior to modelling, all variables were scaled. This process aids greatly in putting all the variables “on an equal playing field” and helps greatly with visually interpreting the results when all the variables are on the same scale. Multiple competing models were fit:

Set 1

  • 1 class / equal variances and covariances set to 0

  • 2 classes / equal variances and covariances set to 0

  • 3 classes / equal variances and covariances set to 0

  • 4 classes / equal variances and covariances set to 0

  • 5 classes / equal variances and covariances set to 0

  • 6 classes / equal variances and covariances set to 0

Set 2

  • 1 class / varying variances and covariances

  • 2 classes / varying variances and covariances

  • 3 classes / varying variances and covariances

  • 4 classes / varying variances and covariances

  • 5 classes / varying variances and covariances

  • 6 classes / varying variances and covariances

An analytic hierarchy process proposed by Akogul & Erisoglu, 2017 was used to determine which model to retain. This process compares multiple fit indices such as AIC, AWE, BIC, CLC, and KIC. Using this process, the model with 6 classes and varying variances and covariances was retained. Outputs of the model are presented in the boxplot below. One way to interpret this is to consider the relative positioning of each class compared to the others across all the variables, and forming a high-level synthesis of this information for each class.

Another useful interpretation is to focus on the spread of each class-variable combination. For example, Class 1 across the variables shows quite a tight spread compared to Class 4. Further examination of exactly which postcodes are in these classes (and which State/Territory they are in) may produce a more robust interpretation.

boxplot.png

This analysis used only one type of statistical model, and many more could be fit (depending on the hypothesis-driven approach). Future analysis should aim to test some of these. A suggested first approach would be to extend the LPA outputs and use the classes as predictors in a new model with an important and relative response variable.

Code for this post can be found on GitHub.


resident population and dwelling internet access can predict postcode regionality with 87% accuracy

Source: ABS, Orbisant analysis.

Source: ABS, Orbisant analysis.

Characteristics of Australia’s postcodes yield some fascinating differences. These interesting differences can be seen across characteristics of households, individuals, finances, and other demographics. Given the recent and ongoing COVID-19 pandemic, Orbisant got to thinking about questions regarding home internet access.

This led to a research question around whether a postcode is classified as being in a Major City of Australia as per the ABS Remoteness Areas. Specifically, Orbisant set out to address the following:

  • Can a postcode’s usual resident population and proportion of dwellings with internet access predict whether the postcode is in a major city or not?

To do this, the four ABS Remoteness Areas that are not “Major Cities of Australia” were aggregated into a single category of “Not in a major city”, as seen in the plot above. This plot graphs the raw data from the ABS, with population log-scaled for visual clarity. Upon first inspection, it seems likely that the data can be classified, however, probably not (well) with a linear tool. Instead, the “boundary” between those postcodes in and not in a major city might be better estimated with an irregular-shaped curve. Tracing the boundary between the majority of the blue and orange points with your hand is a good way to visualise what this more accurate partition line might look like.

Technically, this nonlinear boundary separation can be modelled using a range of machine learning techniques, such as Support Vector Machines (SVM). SVMs are immensely powerful for a few reasons:

  • SVMs can specify linear and nonlinear boundaries, meaning you can compare model accuracy and make tradeoffs between accuracy and parsimony; and

  • SVMs are computationally efficient compared to many other classification algorithms.

Source: ABS, Orbisant analysis. 0 = “Not in major city”, 1 = “Major city postcode”.

Source: ABS, Orbisant analysis. 0 = “Not in major city”, 1 = “Major city postcode”.

An SVM was fit to the data, as seen in the plot above. Importantly, the features (i.e. population and dwellings with internet access) were scaled prior to fitting the model. This is why the axis values differ from the first plot. These can be easily back-transformed to the original scale, but were left here to demonstrate a likely first output if you built a similar algorithm.

Importantly, the SVM demonstrated very strong accuracy, successfully classifying 87.2% of the test data into the right category (major city versus not in a major city). This is a strong result without evidence of overfitting. This result could likely be marginally improved through additional tuning of SVM hyperparameters (aspects of the model). However, it is unlikely without the removal of some borderline-outlier cases that the model fit would asymptote towards 100%. Such is the fun of “real-world” data, and highlights the innate variability within Australia’s postcodes.

If you would like to explore the data and modelling for yourself, please check out the code at the GitHub repository.


Some of Australia’s capital cities are consistently priced, but others remain distal in terms of cost of living

Source: Numbeo

Source: Numbeo

Many annual measures exist to capture liveability and affordability for cities around the world (see here and here for examples). However, these measures are typically aggregated and holistic in nature. While these measures are useful, they are not set up to analyse pricing consistency or similarity, For example do cities who have a similar cost for cars or childcare also have similarly priced takeaway food or electricity? Orbisant set out to test this using the following research question: Are some of Australia's capital cities priced more similarly than others across a range of differently-valued goods and services?

To answer this, item-level data was webscraped from the online repository Numbeo (see here for Orbisant's tutorial on how to replicate this webscraping process using R). Numbeo utilises self-reported prices from people living in the city. These entries are curated and updated over time to ensure prices remain relevant. Examples of the data's granularity are shown in items such as the price of a carton of eggs, a regular-sized cappuccino, a McMeal at McDonalds, and a Toyota Corolla sedan, among many more. Broadly, the individual items are grouped into the following domains:

  • Restaurants

  • Markets

  • Transportation

  • Utilities

  • Sports and leisure

  • Childcare

  • Clothing and shoes

  • Rent per month

  • Apartment purchase cost

  • Salaries and financing

To measure similarity, for every item on the repository (53 in total) a collection of four pricing bands was created (note normal distribution z score standard deviation cutoffs were not used due to low number of cities compared to cutoff groups):

  1. Less than 2 standard deviations (SD) below the mean

  2. Between 2 SDs below the mean and the mean

  3. Between the mean and 2 SDs above the mean

  4. Greater than 2 SDs above the mean

Orbisant then determined which pricing band each city was placed in for each item, and summed the amount of times each city was placed in the same price band as every other city. This is visualised in the network diagram at the top. Typically, network diagrams are used for social network analysis, but they also provide a usual visual way of showing connections and similarity for other purposes, such as this example.

As the diagram shows, Melbourne and Sydney, Melbourne and Canberra, and Perth and Adelaide are the most similarly priced across the 53 good and services. Conversely, multiple city pairings were priced in the same basket much less than 50% of the time, such as Brisbane and Darwin, and Hobart and Canberra, among others. Together, these results highlight the broad spectrum of living costs provided by different capital cities in Australia.


Three major economic metrics are declining in Australia, but at largely different rates

Data from RBA (2019).

Data from RBA (2019).

In light of recent cash rate cuts by the Reserve Bank of Australia (RBA), Orbisant set out to visualise historically what these cuts have looked like alongside other important economic measures.

Here, we can see that all three rates have been declining rapidly since about 1990, interspersed with periods of peaks and troughs amidst the general decline. The impact of key economic events, such as the Global Financial Crisis (GFC) are visible in this chart.

While the fall of the unemployment rate may bring about positive sentiment, the rapid decline of the cash rate presents a potential cause for economic concern. Also of interest is the relatively steep decline of the consumer price index (CPI, or “inflation”), which we know is an important consideration in the evaluation of monetary policy. This steepness is most evident from roughly 1982 to 1991 where the drop in CPI was enormous. However, the rate of this decline quickly decelerated after this point into a steady downward trajectory following a brief period of rate increases.

While potentially insightful, this chart was developed to be more of a visualisation tool for historical data rather than an analytical tool of economic conditions. Indeed, it may yield promise as an overview tool for those seeking to understand a few broad economic trends in Australia across a relatively short time span.