Here’s a snapshot of some of our recent work into the Triple J Hottest 100 Countdown

Hottest 100 2019 analysis Part II - A statistical modelling approach

Orbisant has recently been spending some time thinking about ways to build on its previous Hottest 100 Countdown analysis (se post below) to produce a deeper exploration for the next Countdown. To explore some ideas, Orbisant revisited the same previous dataset and tried some new models and data visualisations.

The process started with simply graphing the raw data in ways that were not initially done in the first analysis. The scatterplot above represents one such way. Evidently, the relationship between Facebook fanbase size and Spotify plays differs for Australian artists compared to international artists. This is seen in the difference in the shape of the trend line (and surrounding confidence interval) between Australian artists (positive linear model) and international artists (positive-trending additive spline model).

Rather than cycling through all of the variables available and plotting them as regression-based scatterplots, Orbisant had the idea to see if any clusters or “profiles” might emerge from the consideration of all of variables at once. The variables included:

Facebook likes (fanbase size)
Spotify plays
Hottest 100 rank
Days between song release and voting

To test this, Orbisant ran a Latent Profile Analysis (LPA) on these quantitative variables. LPA is a probabilistic framework for determining “latent” (or “higher order”) groupings across quantitative variables. The probabilistic nature makes it a stronger choice for many applications over traditional machine learning go-to techniques such as k-means clustering. A model with equal variances and covariances set to zero and 3 classes displayed the best model fit (using an analytical hierarchy process of AIC, AWE, BIC, CLC, KIC proposed by Akogul & Erisoglu, 2017). The distribution of values across each of the four metrics by class is presented in the plot below.

A high level synthesis of what the classes are reveals some interesting differences. For example, Class 1 could be described as high popularity/decent ranking (due to high Facebook likes and high Spotify plays, as well as distributed Hottest 100 rank and distributed days since release). The usefulness of this analysis could be further enhanced by examining exactly which songs sit in each class (e.g. seeing if songs from a certain genre group together). However, this would require substantial manual collection (unless genre data can be accessed through an API). Further analysis should aim to collect genre data and explore this.

Taking a different perspective, Orbisant flipped the direction taken so far and started to explore whether an artist’s nationality could be predicted from the available data. This approach considered the same metrics as the LPA model:

Facebook likes (fanbase size)
Spotify plays
Hottest 100 rank
Days between song release and voting

The matrix of graphs below shows one output of this initial line of thinking. The plots visualise a logit model between scores on each metric and artist nationality. A logit model (also known as “logistic regression”) aims to classify data into one of two categories with a certain probability based on input variables. In this matrix of graphs, the “S” shape of the curves strongly suggests that Spotify plays and Facebook likes might be the most useful two variables in classifying whether an artist in the Hottest 100 is Australian or not. Spotify plays is likely the stronger classifier variable though, because the steepness of the “S” shape-curved is greater than that of Facebook likes.

While the above plots are useful, formalising the concept into an actual statistical model is a strong way to add rigour. Since Orbisant will be analysing future Hottest 100 Countdown data, it makes sense to explore this modelling approach in the Bayesian framework. This is because the posterior results from this analysis can be used to inform the priors of the next Countdown’s analysis. Orbisant wrote a Bayesian logit model in the probabilistic programming language Stan to explore this. The outputs from the model are shown in the plot below.

In this plot, we can see a few interesting things (Alpha is a constant term in the regression equation, representing the intercept). For example, we can see that Spotify plays is a strong negative predictor, with the entirety of its credible intervals being below zero. This means the higher a song’s Spotify plays, the more probable the artist is to be international (as international was coded as “0” and Australian as a “1” - a negative result produces higher probability of belonging to the lower-numbered category code). However, the other predictors’ credible intervals all include zero - which strongly suggests they cannot meaningfully classify songs as belonging to an Australian or international artist. However, the out-of-sample classification accuracy of this model will have to be tested on another year’s Hottest 100 Countdown data - which is a future task of Orbisant to complete.

This follow-up analysis has produced a host of interesting new perspectives on the 2019 Hottest 100 data. It has also stimulated thoughts around new analytical ideas to run for the next Countdown’s analysis. Further, the process of setting up a Bayesian framework that can inform the next analysis will position Orbisant well to continue finding informative relationships.

Code is available for this post on GitHub.

The distribution of songs in Triple J’s Hottest 100 for 2019 show a bimodal shape, and smaller Australian artists are well-represented

Data compiled manually from Triple J, Facebook, and Spotify. Month 1 = January; Month 12 = December.

The Hottest 100 is recognised as a reflection of the musical zeitgeist that governed the ears of many mainstream and alternative listeners from the past year. Last year's Hottest 100 was no exception. However, the statistics that are released by Triple J each year after the countdown are typically very high level and act as almost discrete, single-number values presented in an infographic. Orbisant set out to ask deeper questions about the songs which featured in the countdown.

Using "likes" from artist Facebook pages as a proxy for fanbase size, plays from Spotify as an indication of volume/interest, and also whether the artist was Australian or not, Orbisant aimed to describe the year's top 100 songs in a more analytical way. However, it is important to note that these plays and likes were not from explicitly Australian listeners or fans (who likely compromise the overwhelming majority of Hottest 100 voters). This data does not seem to be publicly-available nor easily accessible.

The top two graphs above present unsurprising results - Australian artists in the countdown largely have less fans than international artists and the Australian songs have largely less plays than international songs. The bottom two graphs present much more interesting story. The bottom left chart shows that most of the songs in the 2019 countdown come from months around March and August-October. This bimodal distribution of song density (concentration) by month of release shows a remarkable dip in representation of songs released around May and June.

Of even more interest is the relative replication of these peaks and troughs when the data is split by nationality (see bottom right chart). We now see Australian songs being represented more so in the later peak (August) compared to the first. This later peak is also characterised by a sharper increase following the mid-year dip compared to international songs which see a stagnation following the initial decline. Comparatively, international songs are slightly more concentrated around the first peak in the year (March) compared to the later peak. This second peak also occurs later in the year for international songs - at around October compared to the August representation for Australian songs.

Unfortunately, the causal drivers and release seasonality can't be directly inferred from this analysis due to a lack of fanbase and listening data specific to the population who votes in the Hottest 100. Future investigations should also aim to replicate the distributions seen here over previous years’ countdowns to validate the effects. Despite these limitations, the results may suggest some important considerations for artists about when to release their songs if a countdown placing is important to them. The results also provide some preliminary numerical reassurance to Australian artists that size isn't everything - the ability to penetrate the primarily domestic voting market through platforms that showcase Australian music, such as Triple J, may be much more important.