Oversampling and undersampling in data analysis

Statistical sampling techniques

(Learn how and when to remove this template message)

Within statistics, oversampling and undersampling in data analysis are techniques used to adjust the class distribution of a data set (i.e. the ratio between the different classes/categories represented). These terms are used both in statistical sampling, survey design methodology and in machine learning.

Oversampling and undersampling are opposite and roughly equivalent techniques. There are also more complex oversampling techniques, including the creation of artificial data points with algorithms like Synthetic minority oversampling technique.^[1]^[2]

Motivation for oversampling and undersampling

Both oversampling and undersampling involve introducing a bias to select more samples from one class than from another, to compensate for an imbalance that is either already present in the data, or likely to develop if a purely random sample were taken. Data Imbalance can be of the following types:

Under-representation of a class in one or more important predictor variables. Suppose, to address the question of gender discrimination, we have survey data on salaries within a particular field, e.g., computer software. It is known women are under-represented considerably in a random sample of software engineers, which would be important when adjusting for other variables such as years employed and current level of seniority. Suppose only 20% of software engineers are women, i.e., males are 4 times as frequent as females. If we were designing a survey to gather data, we would survey 4 times as many females as males, so that in the final sample, both genders will be represented equally. (See also Stratified Sampling.)
Under-representation of one class in the outcome (dependent) variable. Suppose we want to predict, from a large clinical dataset, which patients are likely to develop a particular disease (e.g., diabetes). Assume, however, that only 10% of patients go on to develop the disease. Suppose we have a large existing dataset. We can then pick 9 times the number of patients who did not go on to develop the disease for every one patient who did.

Oversampling is generally employed more frequently than undersampling, especially when the detailed data has yet to be collected by survey, interview or otherwise. Undersampling is employed much less frequently. Overabundance of already collected data became an issue only in the "Big Data" era, and the reasons to use undersampling are mainly practical and related to resource costs. Specifically, while one needs a suitably large sample size to draw valid statistical conclusions, the data must be cleaned before it can be used. Cleansing typically involves a significant human component, and is typically specific to the dataset and the analytical problem, and therefore takes time and money. For example:

Domain experts will suggest dataset-specific means of validation involving not only intra-variable checks (permissible values, maximum and minimum possible valid values, etc.), but also inter-variable checks. For example, the individual components of a differential white blood cell count must all add up to 100, because each is a percentage of the total.
Data that is embedded in narrative text (e.g., interview transcripts) must be manually coded into discrete variables that a statistical or machine-learning package can deal with. The more the data, the more the coding effort. (Sometimes, the coding can be done through software, but somebody must often write a custom, one-off program to do so, and the program's output must be tested for accuracy, in terms of false positive and false negative results.)

For these reasons, one will typically cleanse only as much data as is needed to answer a question with reasonable statistical confidence (see Sample Size), but not more than that.

Oversampling techniques for classification problems

Random oversampling

Random Oversampling involves supplementing the training data with multiple copies of some of the minority classes. Oversampling can be done more than once (2x, 3x, 5x, 10x, etc.) This is one of the earliest proposed methods, that is also proven to be robust.^[3] Instead of duplicating every sample in the minority class, some of them may be randomly chosen with replacement.

SMOTE

There are a number of methods available to oversample a dataset used in a typical classification problem (using a classification algorithm to classify a set of images, given a labelled training set of images). The most common technique is known as SMOTE: Synthetic Minority Over-sampling Technique.^[4] However, this technique has been shown to yield poorly calibrated models, with an overestimated probability to belong to the minority class.^[5]

To illustrate how this technique works consider some training data which has s samples, and f features in the feature space of the data. Note that these features, for simplicity, are continuous. As an example, consider a dataset of birds for classification. The feature space for the minority class for which we want to oversample could be beak length, wingspan, and weight (all continuous). To then oversample, take a sample from the dataset, and consider its k nearest neighbors (in feature space). To create a synthetic data point, take the vector between one of those k neighbors, and the current data point. Multiply this vector by a random number x which lies between 0, and 1. Add this to the current data point to create the new, synthetic data point.

Many modifications and extensions have been made to the SMOTE method ever since its proposal.^[6]

ADASYN

The adaptive synthetic sampling approach, or ADASYN algorithm,^[7] builds on the methodology of SMOTE, by shifting the importance of the classification boundary to those minority classes which are difficult. ADASYN uses a weighted distribution for different minority class examples according to their level of difficulty in learning, where more synthetic data is generated for minority class examples that are harder to learn.

Augmentation

Data augmentation in data analysis are techniques used to increase the amount of data by adding slightly modified copies of already existing data or newly created synthetic data from existing data. It acts as a regularizer and helps reduce overfitting when training a machine learning model.^[8] (See: Data augmentation)

Undersampling techniques for classification problems

Random undersampling

Randomly remove samples from the majority class, with or without replacement. This is one of the earliest techniques used to alleviate imbalance in the dataset, however, it may increase the variance of the classifier and is very likely to discard useful or important samples.^[6]

Cluster

Cluster centroids is a method that replaces cluster of samples by the cluster centroid of a K-means algorithm, where the number of clusters is set by the level of undersampling.

Tomek links

Tomek links remove unwanted overlap between classes where majority class links are removed until all minimally distanced nearest neighbor pairs are of the same class. A Tomek link is defined as follows: given an instance pair $(x_{i},x_{j})$ , where $x_{i}\in S_{\min },x_{j}\in S_{\operatorname {max} }$ and $d(x_{i},x_{j})$ is the distance between $x_{i}$ and $x_{j}$ , then the pair $(x_{i},x_{j})$ is called a Tomek link if there's no instance $x_{k}$ such that $d(x_{i},x_{k})<d(x_{i},x_{j})$ or $d(x_{j},x_{k})<d(x_{i},x_{j})$ . In this way, if two instances form a Tomek link then either one of these instances is noise or both are near a border. Thus, one can use Tomek links to clean up overlap between classes. By removing overlapping examples, one can establish well-defined clusters in the training set and lead to improved classification performance.

Undersampling with ensemble learning

A recent study shows that the combination of Undersampling with ensemble learning can achieve better results, see IFME: information filtering by multiple examples with under-sampling in a digital library environment.^[9]

Techniques for regression problems

Although sampling techniques have been developed mostly for classification tasks, growing attention is being paid to the problem of imbalanced regression.^[10] Adaptations of popular strategies are available, including undersampling, oversampling and SMOTE.^[11]^[12] Sampling techniques have also been explored in the context of numerical prediction in dependency-oriented data, such as time series forecasting^[13] and spatio-temporal forecasting.^[14]

Additional techniques

It's possible to combine oversampling and undersampling techniques into a hybrid strategy. Common examples include SMOTE and Tomek links or SMOTE and Edited Nearest Neighbors (ENN). Additional ways of learning on imbalanced datasets include weighing training instances, introducing different misclassification costs for positive and negative examples and bootstrapping.^[15]

Implementations

A variety of data re-sampling techniques are implemented in the imbalanced-learn package ^[1] compatible with the scikit-learn Python library. The re-sampling techniques are implemented in four different categories: undersampling the majority class, oversampling the minority class, combining over and under sampling, and ensembling sampling.
The Python implementation of 85 minority oversampling techniques with model selection functions are available in the smote-variants ^[2] package.

Criticism

Machine learning models trying to model a conditional distribution using Bayes rule: $P(Y|X)={\frac {P(X|Y)P(Y)}{P(X)}}$ may be wrongly calibrated if modifying the natural distribution $P(Y)$ during training by applying undersampling or downsampling.^[16]

This point can be illustrated with a simple example: Assume no predictive variables $X$ and where the proportion of $Y=1$ is 0.01 and the proportion of $Y=0$ is 0.99. Is a model which learns ${\hat {P}}(Y=1)=0.01$ useless and should be modified via undersampling or oversampling? The answer is no. Class imbalance is not a problem in itself at all.

Additionally,

oversampling
undersampling
as well as assigning weights to samples

may be applied by practitioners in multi-class classification or situations with very imbalanced cost structure. This might be done in order to achieve "desireable", best performances for each class (potentially measured as precision and recall in each class). Finding the best multi-class classification performance or the best tradeoff between precision and recall is, however, inherently a multi-objective optimization problem. It is well known that these problems typically have multiple incomparable Pareto optimal solutions. Oversampling or undersampling as well as assigning weights to samples is an implicit way to find a certain pareto optimum (and it sacrifices the calibration of the estimated probabilities). A more explicit way than oversampling or downsampling could be to select a Pareto optimum by

assign explicit costs to missclassified samples and then minimize the total (scalarized) costs via cost-sensitive machine learning.^[17]
perform threshold tuning in a binary classification setting so that a certain validation precision and recall are achieved ^[18]

Literature

Kubat, M. (2000). Addressing the Curse of Imbalanced Training Sets: One-Sided Selection. Fourteenth International Conference on Machine Learning.

References

^ ^a ^b "Scikit-learn-contrib/Imbalanced-learn". GitHub. 25 October 2021.
^ ^a ^b "Analyticalmindsltd/Smote_variants". GitHub. 26 October 2021.
^ Ling, Charles X., and Chenghui Li. "Data mining for direct marketing: Problems and solutions." Kdd. Vol. 98. 1998.
^ Chawla, N. V.; Bowyer, K. W.; Hall, L. O.; Kegelmeyer, W. P. (2002-06-01). "SMOTE: Synthetic Minority Over-sampling Technique". Journal of Artificial Intelligence Research. 16: 321–357. arXiv:1106.1813. doi:10.1613/jair.953. ISSN 1076-9757. S2CID 1554582.
^ van den Goorbergh, Ruben; van Smeden, Maarten; Timmerman, Dirk; Van Calster, Ben (2022-09-01). "The harm of class imbalance corrections for risk prediction models: illustration and simulation using logistic regression". Journal of the American Medical Informatics Association. 29 (9): 1525–1534. doi:10.1093/jamia/ocac093. ISSN 1527-974X. PMC 9382395. PMID 35686364.
^ ^a ^b Chawla, Nitesh V.; Herrera, Francisco; Garcia, Salvador; Fernandez, Alberto (2018-04-20). "SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary". Journal of Artificial Intelligence Research. 61: 863–905. doi:10.1613/jair.1.11192. hdl:10481/56411. ISSN 1076-9757.
^ He, Haibo; Bai, Yang; Garcia, Edwardo A.; Li, Shutao (June 2008). "ADASYN: Adaptive synthetic sampling approach for imbalanced learning" (PDF). 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). pp. 1322–1328. doi:10.1109/IJCNN.2008.4633969. ISBN 978-1-4244-1820-6. S2CID 1438164. Retrieved 5 December 2022.
^ Shorten, Connor; Khoshgoftaar, Taghi M. (2019). "A survey on Image Data Augmentation for Deep Learning". Mathematics and Computers in Simulation. 6. springer: 60. doi:10.1186/s40537-019-0197-0.
^ Zhu, Mingzhu; Xu, Chao; Wu, Yi-Fang Brook (2013-07-22). IFME: information filtering by multiple examples with under-sampling in a digital library environment. ACM. pp. 107–110. doi:10.1145/2467696.2467736. ISBN 9781450320771. S2CID 13279787.
^ Ribeiro, Rita P.; Moniz, Nuno (2020-09-01). "Imbalanced regression and extreme value prediction". Machine Learning. 109 (9): 1803–1835. doi:10.1007/s10994-020-05900-9. ISSN 1573-0565. S2CID 222143074.
^ Torgo, Luís; Branco, Paula; Ribeiro, Rita P.; Pfahringer, Bernhard (June 2015). "Resampling strategies for regression". Expert Systems. 32 (3): 465–476. doi:10.1111/exsy.12081. S2CID 205129966.
^ Torgo, Luís; Ribeiro, Rita P.; Pfahringer, Bernhard; Branco, Paula (2013). "SMOTE for Regression". In Correia, Luís; Reis, Luís Paulo; Cascalho, José (eds.). Progress in Artificial Intelligence. Lecture Notes in Computer Science. Vol. 8154. Berlin, Heidelberg: Springer. pp. 378–389. doi:10.1007/978-3-642-40669-0_33. hdl:10289/8518. ISBN 978-3-642-40669-0. S2CID 16253787.
^ Moniz, Nuno; Branco, Paula; Torgo, Luís (2017-05-01). "Resampling strategies for imbalanced time series forecasting". International Journal of Data Science and Analytics. 3 (3): 161–181. doi:10.1007/s41060-017-0044-3. ISSN 2364-4168. S2CID 25975914.
^ Oliveira, Mariana; Moniz, Nuno; Torgo, Luís; Santos Costa, Vítor (2021-09-01). "Biased resampling strategies for imbalanced spatio-temporal forecasting". International Journal of Data Science and Analytics. 12 (3): 205–228. doi:10.1007/s41060-021-00256-2. ISSN 2364-4168. S2CID 210931099.
^ Haibo He; Garcia, E.A. (2009). "Learning from Imbalanced Data". IEEE Transactions on Knowledge and Data Engineering. 21 (9): 1263–1284. doi:10.1109/TKDE.2008.239. S2CID 206742563.
^ "Imbalance correction led to models with strong miscalibration without better ability to distinguish between patients with and without the outcome event. The inaccurate probability estimates reduce the clinical utility of the model, because decisions about treatment are ill-informed.", The harm of class imbalance corrections for risk prediction models: illustration and simulation using logistic regression, 2022, Ruben van den Goorbergh, Maarten van Smeden, Dirk Timmerman, Ben Van Calster https://doi.org/10.1093/jamia/ocac093
^ Encyclopedia of Machine Learning. (2011). Deutschland: Springer. Page 193, https://books.google.de/books?id=i8hQhp1a62UC&pg=PT193
^ https://arxiv.org/abs/2201.08528v3

Chawla, Nitesh V. (2010) Data Mining for Imbalanced Datasets: An Overview doi:10.1007/978-0-387-09823-4_45 In: Maimon, Oded; Rokach, Lior (Eds) Data Mining and Knowledge Discovery Handbook, Springer ISBN 978-0-387-09823-4 (pages 875–886)
Lemaître, G. Nogueira, F. Aridas, Ch.K. (2017) Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning, Journal of Machine Learning Research, vol. 18, no. 17, 2017, pp. 1–5.