An introduction to Concept Drift

6 min readFeb 18, 2023

Academic paper source:

“From concept drift to model degradation: An overview on performance-aware drift detectors”
Paper by Firas Bayram, Bestoun S. Ahmed and Andreas Kassler
pdf available on Science Direct: https://www.sciencedirect.com/science/article/pii/S0950705122002854

Background

When we create a model, we expect the data and logic used to mimic the real world; however, as the world changes, the prediction can become less accurate and therefore the model’s prediction will get worse over time (see https://www.youtube.com/watch?v=uOG685WFO00). An example of this can be the pandemic which changed people’s behavior. The model’s accuracy can drop for mostly 2 reasons:

1. An hidden context

“The data points on which the model was trained are not sufficient to capture the complexity of the problem space. Therefore, the model will perform unexpectedly for samples in the input space that was not covered in the instance space of training examples.” The authors give the example of a learning system predicting the earth’s temperature using spatial and temporal historical data. Predictions will get gradually worse due to the lack of captured information related to climate change (our hidden contact).

2. Concept drift

“The system environment is dynamic and progressively subject to changes, making it difficult for a single model to provide accurate predictions.” Concept drift can be attributed to several reasons, such as seasonality or changing preferences and behaviors. This effect has been studied in many domains where all systems share a non-stationarity property due to continuous change. To detect concept drift, a model degradation detector is integrated into the model system to “evaluate and track the system’s performance to control such degradation in prediction accuracy”.

Fixing concept drift

Blind adaptation: a passive approach where the model is constantly updated with new data. We don’t try to detect drifts in this case
Drift detection: “methodology that helps determine and identify a time instant or time interval when a change arises in the properties of the target object.” In order to do this, we will use a test statistic to measure the differences between old and new samples

Categories to detect concept drift

Data distribution-based technique

We use distance to measure the similarity between data distribution from samples extracted at different times. The advantage is that you can apply this to labeled and unlabeled datasets; the disadvantage is that a change in data distribution is not always affecting the predictor performance (generating false alarms !)

Performance-based technique

Performance-based and data distribution are the most dominant techniques. Performance-based approach “typically trace deviations in the online learner’s output error, known as the predictive sequential (prequential) error, to detect changes”. The main idea is that if you are in a stationary distribution, your error rate should decrease as the learner sees more examples. If the performance decreases we can then conclude that we have a concept drift — meaning that the learned relationship between the input data and the target feature is obsolete.

The advantage is it handles the change when performance is getting worse (no false alarms). The disadvantage is that you need a quick feedback loop on the predictions

Multiple hypothesis-based drift detectors

“Hybrid approaches that apply several detection methods and aggregate their results in parallel or hierarchically. Parallel drift detectors integrate the decisions of multiple drift detectors to make the final judgment”. Hierarchical drift detectors incorporate two layers: a warning layer about a potential occurrence of concept drift and a validation layer

Contextual approach

“Use context information available from the system and data to detect the drift.” Examples include using historical drift trends or analyzing the spikes of neural networks.

Concept drift types

The authors describe the different types of concept drift they found in the literature. I won’t go through each of them but I will give you a few examples instead. They created 2 tables, one related to concept drifts by a probabilistic source of change and the other one related to the arrival pattern (meaning the drift transition).

Concept drifts by a probabilistic source of change

real concept drift: “a change in the posterior probability distribution [that] indicates a principal change in the underlying target concept. This drift type directly affects the prediction performance since it requires an adaptation of the decision boundary to react to it for preserving the model’s accuracy.” Relationships change but not necessarily the input
covariate shift: a change in the underlying data distribution. The target feature is not necessarily affected
prior-probability shift: a change in the distribution of classes over time. “This drift type could affect the prediction performance if there is a significant change in the distribution of classes or the number of classes in the learning problem has changed”

Concept drifts related to the arrival pattern

sudden drift: “occurs when the target distribution changes from one concept to another abruptly at a point in time”
gradual drift: “occurs when the target distribution changes progressively from one concept to another”
recurring drift: “occurs when a precedently-seen concept reappears again after a time interval”
incremental drift: “occurs when a new concept replaces the old one slowly in a continuous manner”

Performance-based concept drift detectors

Those detectors can be split into 2 categories: Statistical Process Control, Windowing technique, and Ensemble learning. We will only cover a few.

Statistical Process Control: the Drift Detection Method (DDM)

Many methods are based on DDM so let’s focus on this one. “DDM analyzes the error rate of the streaming data classifier to detect changes. The method considers the error as a Bernoulli random variable with Binomial distribution”. Pt is the probability of error at a time t, and i represents the number of points being sampled. The standard deviation is given by:

At a time t, Pmin and Smin are replaced by the values of Pt and St if Pt + St < Pmin + Smin.
A warning state is triggered when Pt + St >= Pmin + 2 * Smin
A drift is detected when Pt + St >= Pmin + 3 * Smin

This method is not particularly great if the drift change is very slow and gradual. Note that there are different versions of DDM to handle imbalanced datasets (see PerfSim, DDM-OCI or Linear Four Rates)

Windowing technique

“Window-based detectors divide the data stream into windows based on data size or time interval in a sliding manner. These methods monitor the performance of the most recent observations introduced to the learner and compare it with the performance of a reference window.” Adaptive Windowing (ADWIN) and STEPD are popular methods to use this technique

ADWIN

“Adwin is an algorithm that detects concept drifts on the fly and adapts ML models accordingly. The algorithm maintains an adaptive window which is the basis for computing the ML model. Adwin grows the window (i.e., adds the most recent tuples) as long as there is no concept drift detected. As a result, the model can rely on growing training data. Adwin shrinks the window by removing old tuples when it detects a concept drift” (source)

It checks the change between the means Mu_hist and Mu_new, of 2 large enough sub-windows defined by W_hist and W_new. Delta is a pre-defined confidence parameter and m is the harmonic mean of the 2 windows

Ensemble technique

“Concept drift detectors that are ensemble-based operate by combining the results of multiple diverse base learners. The overall performance is monitored by either considering the accuracy of all the ensemble members or the accuracy of each individual base learner.[…] Ensemble-based detectors trigger concept drift if the learners suffer from a significant level of performance degradation. This assumption is based on the fact that each learner has capabilities in solving specific problems.” A system of weight is used by most ensembles to select the best learners (see Weighted Majority Algorithm, Accuracy Weighted Ensemble, Dynamic Weighted Majority or Accuracy Updated Ensemble)