Chapter 9. Imbalanced Classes

If you are classifying data, and the classes are not relatively balanced in size, the bias toward more popular classes can carry over into your model. For example, if you have 1 positive case and 99 negative cases, you can get 99% accuracy simply by classifying everything as negative. There are various options for dealing with imbalanced classes.

Downsampling Majority

Another method to balance classes is to downsample majority classes. Here is an sklearn example:

>>> from sklearn.utils import resample
>>> mask = df.survived == 1
>>> surv_df = df[mask]
>>> death_df = df[~mask]
>>> df_downsample = resample(
...     death_df,
...     replace=False,
...     n_samples=len(surv_df),
...     random_state=42,
... )
>>> df3 = pd.concat([surv_df, df_downsample])

>>> df3.survived.value_counts()
1    500
0    500
Name: survived, dtype: int64
Tip

Don’t use replacement when downsampling.

The imbalanced-learn library also implements various downsampling algorithms:

ClusterCentroids

This class uses K-means to synthesize data with the centroids.

RandomUnderSampler

This class randomly selects samples.

NearMiss

This class uses nearest neighbors to downsample.

TomekLink

This class downsamples by removing samples that are close to each other.

EditedNearestNeighbours

This class removes samples that have neighbors that are either not in the majority or all of the same class.

RepeatedNearestNeighbours

This class repeatedly calls the EditedNearestNeighbours.

AllKNN

This class is similar but increases the number of nearest neighbors during the iterations of downsampling.

CondensedNearestNeighbour

This class picks one sample of the class to be downsampled, then iterates through the other samples of the class, and if KNN doesn’t misclassify, it adds that sample.

OneSidedSelection

This classremoves noisy samples.

NeighbourhoodCleaningRule

This class uses EditedNearestNeighbours results and applies KNN to it.

InstanceHardnessThreshold

This class trains a model, then removes samples with low probabilities.

All of these classes support the .fit_sample method.