Here’s a snapshot of some of our recent work into the energy & environment sector


Classifying “Normal” and “Abnormal” inline process control Sensor measurements during the processing of silicon wafers for semiconductor fabrication

Method

Time-series classification algorithms are typically benchmarked using the Time Series Classification Repository, which contains 128 univariate (i.e., values sampled uniformly in time) time-series classification datasets/problems and 30 multivariate datasets. Here, we focus on the univariate setting. Many algorithms developed across the sciences have been evaluated across the standardised univariate datasets provided in the repository, with black-box algorithms typically outperforming competitors (see this paper for a review). Given the growing need for algorithmic transparency in the sciences and industry, the present research aims to explore the feasibility of a new approach using one of the 128 datasets as a test case: the "Wafer" dataset. Specifically, this work aims to determine if using informative summary statistics (known as "features"), can be used to construct a high performing classification procedure that rivals or outperforms existing approaches (see this paper for more). Examples of time-series features include properties of the time-series distribution of values, autocorrelation structure, entropy, model-fit statistics, nonlinear time-series analysis, stationarity, and many others (see this paper). At its core, feature-based time-series analysis reduces a time series x time matrix to a time series x feature matrix, which can then be used for statistical learning (such as classification), using the values of time-series features as inputs.

Dataset

The Wafer dataset in the Time Series Classification Repository contains a collection of inline process control measurements that were recorded from a range of sensors during the manufacturing and processing of silicon wafers for semiconductor fabrication. Each unique time series in the Wafer dataset represents the measurements of one sensor during the processing of one wafer by one tool. Labels for two classes are provided in the data: Normal and Abnormal. The goal of this problem is to predict class membership from the time-series values. The dataset contains a pre-designated train-test split, with 1000 samples in the train set (each of length T = 152), and 6164 samples in the test set (each of length T = 152).

Algorithmic Approach

The approach in this work contains three stages: (i) extraction of time-series features for each unique time series; (ii) dimensionality reduction using principal components analysis (PCA) to obtain a reduced set of informative vectors; and (iii) classification using the top principal components (quantified in terms of cumulative variance explained) as inputs into a random forest classifier. Features from two open-source feature sets---catch22 (see here for the native R implementation Rcatch22) and Kats --- will be extracted using the R package theft. catch22 extracts 24 features (as mean and standard deviation are added to the standard 22 features) and Kats extracts 40 features. This will produce a resulting time series x feature matrix of size 7164 x 64. Given the existence of within-set redundancy (i.e., high absolute correlations between features in a set) observed particularly for Kats relative to catch22, as well as between-set redundancy identified in previous work, dimensionality reduction through PCA will be applied to substantially reduce the input matrix size for the classification algorithm into a time series x principal component matrix. A threshold of 80% cumulative variance explained will be used to determine the number of principal components to retain. Following the procedure of previous work, the present approach then trains and evaluates the accuracy of the classifier over 30 resamples of train-test splits, where each is seeded for reproducibility, and the first is always the pre-designated train-test split in the data as it comes from the Time Series Classification Repository. This will enable scientific inference of algorithmic performance with uncertainty, and facilitate a direct comparison with the performance of existing algorithms and benchmarks.

Given the existence of the Time Series Classification Repository which holds the Wafer dataset (whose sole purpose is to facilitate benchmarking of time-series classification algorithms), it is expected that meaningful class separation will be possible. The present research takes a novel approach of chaining time-series feature extraction, dimensionality reduction techniques, and a classification algorithm --- an approach that has seen almost no research attention to-date. As such, a primary goal of this work is to understand the performance of this approach relative to current benchmarks.

Given the high within-set redundancy (i.e., high absolute correlations between features in a set) observed in previous work, it is hypothesised that dimensionality reduction techniques will substantially reduce the input matrix size for the classification algorithm from the original time series x feature matrix to a much smaller time series x principal component matrix. Further, it is also hypothesised that using the time series x principal component matrix as input to a classification algorithm (i.e., random forest classifier) will not result in a substantial reduction in classification performance compared to using the time series x feature matrix nor to existing benchmarks due to the highly informative nature of time-series features in understanding temporal dynamics.

Results

Prior to substantative analysis, exploratory data analysis was performed. A sample of three time series from each class (Normal and Abnormal) is displayed in the figure below. To the eye, some small, but noticeable differences in temporal dynamics and shape are visible between the classes, which suggests that classification based on temporal properties (i.e, "features") is a feasible approach.

Raw time-series plots of three randomly selected time series from each class (“Normal” and “Abnormal”) in the Wafer dataset. Small differences in temporal dynamics are visible.

Dimensionality reduction

Time-series features were computed from two sets (catch22 and Kats) on all time series, which produced a time series x feature matrix of size 7164 x 64. This matrix is large, which increases computation time for classification algorithms and increases complexity of model interpretation and evaluation. To reduce this complexity, a principal components analysis was performed. The dataset meets the assumptions of PCA, as multicollinearity between time-series features was a reason for conducting the analysis, and there are sufficient samples to adequately perform PCA as the 7164:64 samples:variables ratio exceeds the 20:1 recommendation.

The figure below plots a visual summary of the PCA. Panel (A) plots the percentage of variance explained for the top eight principal components (PC) which together explain 80% of the variance. PC1 explains 37.5% of the variance, but there is a steep drop off following this component, with PC2 explaining 15.6% of the variance. Panel (B) plots cumulative variance explained for the eight retained PCs. The first four PCs explain two-thirds of the variance in the dataset. Panel (C) plots the eigenvalues of the eight PCs. All eight PCs exceed the eigenvalue > 1 threshold for the Kaiser criterion of PC retention.

Summary of eight retained principal components. (A) Percentage of variance explained is plotted in descending order for each of the retained principal components. (B) Cumulative variance explained is plotted for each of the retained principal components. An 80% cumulative variance threshold was selected to determine the principal components to retain which returned the eight plotted here (from the original 64). (C) Eigenvalues of the eight retained principal components are plotted in descending order. All retained components also exceed the eigenvalue > 1 cutoff for the Kaiser criterion.

Loadings for each of the 64 time-series features (variables) on the eight PCs is presented in the figure below. Patterns are evident across the PCs, such as the strong loading of time-series features associated with properties of the autocorrelation and partial autocorrelation function onto PC1, and histogram-based statistics onto PC3. This plot gives us interpretable and informative insight into the relative behaviour of the time-series features extracted from the Wafer dataset.

Loadings of each variable onto the eight retained principal components. Relationships between the time-series features are visible, such as loading of features associated with the autocorrelation and partial autocorrelation function onto PC1.

Prior to modelling, the PCA was distilled into an even lower dimensional space of just two dimensions to understand if class differences could be ascertained with just the two PCs which explain the most variance in the data (a collective 53.1%). This is displayed in the figure below). There is considerable overlap between the "Normal" and "Abnormal" classes in the two-dimensional space, suggesting the additional six principal components are likely needed for accurate classification.

Principal components analysis biplot. The first principal component (positioned along the x-axis) explains 37.5% of the variance in the Wafer dataset. The second principal component (positioned along the y-axis) explains 15.6% of the variance in the Wafer dataset. Class-level covariance is displayed as shaded ellipses.

Time-series classification

The time series x principal component matrix (7164 x 8) was passed as input into a random forest classifier using the caret package in R over 30 resamples, where the first sample was the pre-designated train-test split from the Time Series Classification Repository. Mean classification accuracy across the resamples was 99.08% (SD = 0.19%) and is compared to previous benchmarks in the table below. The benchmark algorithms include collective of transformation-based ensembles (COTE), shapelet transform (ST), bag of SFA symbols (BOSS), elastic ensemble (EE), dynamic time warping (DTW), time series forest (TSF), time series bag of features (TSBF), learned pattern (LPS), and move-split-merge (MSM). The current approach, while marginally outperformed by the other algorithms, demonstrates classification performance that can be considered to be on-par with more complex, more black-box algorithms, despite only using eight principal components as an input to a random forest classifier. Importantly, no manual hyperparameter tuning was performed beyond caret's basic parameter grid search that it performs over the $k$-fold cross-validation procedure. It is likely that even stronger performance could be obtained through more detailed hyperparameter tuning and optimisation. Further, relative to the benchmark algorithms presented in the table below, the current approach is fast --- executing the entire classification stage (including all 30 resamples, each with 10-fold cross-validation) in under four minutes locally on a laptop with no parallel processing.

Comparison of mean classification accuracy results on the Wafer dataset between the current approach and existing benchmarks. Performance differences are minimal between all approaches, with each algorithm achieving >99% accuracy.

To better understand the inner mechanics of the current approach, we examine the machinery of the random forest models. Understanding relative variable importance can shed a deeper light into random forest models to aid interpretability. Variable importance ranks for each variable (principal components) across the 30 resamples are plotted in the figure below. PC4 (comprised of mostly symbolic features, such as those associated with entropy of small-set probabilities, flat spots, and proportion of magnitudes that exceed a threshold over the standard deviation) is the most important variable across all 30 resamples for predicting class ("Normal" versus "Abnormal").

Frequency of ranks over all resamples are plotted for each principal component used as a predictor in the models. PC4 is the most important variable across all 30 resamples for predicting class (“Normal” versus “Abnormal”).

The performance of PC4 is further demonstrated in the figure below which plots the mean +/- 1SD of variable importance values across the 30 resamples. On average, PC4 exhibits variable importance of a factor of 5.5 higher than any other variable.

Mean variable importance +/- 1SD is plotted for each principal component used as predictors in the models. PC4 demonstrates the highest mean variable importance values by a factor of 5.5 over the next highest component (PC6).


How similar are countries on the structure of their Co2 emissions per capita over time? A highly comparative time-series approach

With the worsening consequences of global warming looming ever closer, understanding how countries are changing their emissions over time (if at all) is critical. One way to measure this is to look at Co2 emissions per capita. Luckily for us, this data exists as a time-series all the way back to the year 1800 (for some countries). How neat!

In this post, we are going to explore the Co2 emissions per capita dataset and examine ways to analyse the empirical structure and similarity of time-series using techniques that are very uncommon in policy analysis and program evaluation. With that out of the way, let's dive in.

As with most (time-series) analysis, it's usually best to start with a plot so we know what we are dealing with. The plot below is the log-scaled CO2 emissions per capita by country with Australia called out in orange. While it's hard to discern the exact movement of each country from this graph alone, we can clearly see that there is a lot of variation in the time-series data between countries.

ts_plot_log.png

Now, perhaps we are interested in understanding which countries are most similar in their temporal structure of CO2 emissions per capita. This is more than just a trivial thought exercise - if we worked in public policy we might very well seek to investigate this question because it would provide a set of data-driven countries for whom to benchmark our own against. This removes the potential for subjectivity to drive our analysis. While we all may have prior understanding about which countries might be similar or who should be compared to each other (whether through political or geographical ties), the outputs of a data-driven approach might reveal more nuanced and informative relationships that are not as restrictive in their scope.

While simple at first glance, this research problem is actually inherently very tricky. When wanting to understand similarity between numerical quantities, our intuition is often to compute a correlation, usually a Pearson or Spearman correlation coefficient. But how do we compute such a metric for a temporally ordered series of data? Do we compute correlations pairwise at every time point? How do we then reconcile K number of correlations (where K is the length of the time-series) into a singular understanding of the relationship between multiple time-series? From a management consulting perspective, many consultants might approach this problem by computing a compound annual growth rate (CAGR), percentage change or some other basic metric and use that as the basis for comparison. But this is crude and misses an extreme amount of nuance and information, as we will later see. Clearly, we need a different approach.

A brief introduction to feature-based time-series analysis

One approach that is making waves in the scientific and machine learning literature (and is a focus of my PhD) is that of feature-based time-series analysis. Broadly, this approach involves reducing a time-series to a single summary statistic, which can then be correlated or analysed with respect to the same statistic for any other given time-series. These summary statistics can be almost any conceivable quantity that can be calculated on a time-ordered vector of numbers, and could be as simple as a mean or standard deviation, or as complex as spectral entropy or fluctuation analysis.

In fact, Ben Fulcher developed a toolbox for MATLAB called hctsa, short for "highly comparative time-series analysis" which automates the calculation of over 7,700 time-series features from across the scientific literature on a given time-series. Examples of fields represented in the toolbox include (but definitely not limited to):

  • Physics 

  • Neuroscience

  • Astronomy

  • Information Theory

  • Statistics

  • Medicine

  • Econometrics

  • Dynamics

  • Finance

With so many potential features from many different fields on-hand, researchers can begin to investigate what approaches might work best for their given data/problem, rather than just sticking to the methods developed and taught in their respective field. This is important, because if researchers select a single technique for a problem, how can they be sure it was the optimal one? A highly comparative approach facilitates an answer to this question.

Stepping back a bit, more generally, working in the feature space as opposed to the raw data (measurement) space is very useful for multiple reasons:

  • Feature space is much more computationally efficient than measurement space - working with a collection of summary statistics rather than entire time-series (which can be potentially very long) makes any further calculations and modelling much more flexible and much faster

  • Feature space can reveal dynamical and nonlinear relationships between statistical processes that the measurement space may not be able to detect - using features that capture different aspects of temporal dynamics enables a deeper and potentially more sophisticated understanding of the empirical structure of and similarity between time series

  • Dimension reduction techniques generalise well to the feature space - using feature summary statistics enables methods such as Principal components analysis and t-SNE to reveal patterns across groups of features, and promote effective data visualisation

However, MATLAB is not free software, making the adoption of the highly comparative (and feature-based) approach more limited than it should be. Recently, I built an R package called `catch22` which automates the calculation of 22 time-series features which have been shown to be high-performing and minimally-redundant. More simply, out of the wider >7,700 feature toolbox in hctsa, these 22 both perform the best on a variety of tasks and contribute unique variance over the others.

A feature-based look at Co2 emissions per capita

Now, back to the problem at hand. One approach we could take is to calculate these 22 features for each country and then compare values. The plot below shows one resulting output of this procedure, where each country is on the y axis and each feature is across the x axis. Values were normalised across the feature vectors for standardisation of analysis, and both rows and columns were hierarchically clustered prior to plotting to reveal rich empirical structure. Clearly, we have some interesting patterns emerging, where we have regions of high values and regions of low values organising together in some way. 

Note that some countries were removed prior to plotting due to missing data.

feature_matrix.png

We can dive even deeper to the country level and reduce the number of dimensions we are visualising from 22 to just 2 by using a dimension-reduction technique such as Principal components analysis (PCA). We can fit this to our data and plot is as a scatterplot on just the first two principal components, which are, by nature, the two most explanatory components. By colouring each datapoint by continent or country, we can investigate any potential patterns. The plot below does this where each country-level datapoint is coloured according to its continent. These first two principal components alone explain 48% of the variance in the data.

pca.png

How similar are individual countries?

Going even further, we can assess the overall similarity of countries by computing the correlation between each country based on the 22 feature values. We can then visualise the output as another heatmap, such as the plot below.

cormat.png

While it looks pretty, there is no real structure - meaning we cannot easily discern which countries are most similar at a glance. If we hierarchically cluster the data prior to plotting, countries who are most similar to each other will be organised on the axes accordingly. The plot below shows this. For example, we can now see (apologies for plot scaling in the web browser) that the United States, Switzerland, and Syria are similar in terms of their empirical temporal properties, while Nepal, Vanuatu, Sierra Leone, and United Arab Emirates are not very similar, yet Australia has moderate-strong negative correlations with the latter three countries. This is much more interesting and raises some potential questions around new data-driven sets of countries to benchmark a particular country of interest against, or to see what policy settings similar (or different) countries have in place.

cormat-clustered.png

Final thoughts

In summary, feature-based time-series analysis is an often overlooked but very useful part of any analyst or researcher's skillset. Feature-based approaches can both speed up computational time and reveal (linear and nonlinear) patterns in temporal data that standard analysis or visualisation of the raw data (measurement) space cannot. Their application so far has been somewhat prominent in neuroscience, where time-series features are used as inputs to statistical and machine learning models to classify healthy controls from people with brain disorders from fMRI and EEG data. However, these techniques are yet to break into heavily applied fields such as policy and program evaluation. This article hopefully serves as a precursor to a slightly wider adoption of feature-based approaches to time-series problems in more econometric-type settings.