If you are classifying data, and the classes are not relatively balanced in size, the bias toward more popular classes can carry over into your model. For example, if you have 1 positive case and 99 negative cases, you can get 99% accuracy simply by classifying everything as negative. There are various options for dealing with imbalanced classes.
Tree-based models may perform better depending on the distribution of the smaller class. If they tend to be clustered, they can be classified easier.
Ensemble methods can further aid in pulling out the minority classes. Bagging and boosting are options found in tree models like random forests and Extreme Gradient Boosting (XGBoost).
Many scikit-learn classification models support the class_weight
parameter. Setting
this to 'balanced'
will attempt to regularize minority classes and
incentivize the model to classify them correctly. Alternatively, you can
grid search and specify the weight options by passing in a dictionary
mapping class to weight (give higher weight to smaller classes).
The XGBoost library has the max_delta_step
parameter, which can be set
from 1 to 10 to make the update step more conservative. It also has the
scale_pos_weight
parameter that sets the ratio of negative to
positive samples (for binary classes). Also, the eval_metric
should be set to
'auc'
rather than the default value of 'error'
for classification.
The KNN model has a weights
parameter that can bias neighbors that are
closer. If the minority class samples are close together,
setting this parameter to 'distance'
may improve performance.
You can upsample the minority class in a couple of ways. Here is an sklearn implementation:
>>>
from
sklearn.utils
import
resample
>>>
mask
=
df
.
survived
==
1
>>>
surv_df
=
df
[
mask
]
>>>
death_df
=
df
[
~
mask
]
>>>
df_upsample
=
resample
(
...
surv_df
,
...
replace
=
True
,
...
n_samples
=
len
(
death_df
),
...
random_state
=
42
,
...
)
>>>
df2
=
pd
.
concat
([
death_df
,
df_upsample
])
>>>
df2
.
survived
.
value_counts
()
1 809
0 809
Name: survived, dtype: int64
We can also use the imbalanced-learn library to randomly sample with replacement:
>>>
from
imblearn.over_sampling
import
(
...
RandomOverSampler
,
...
)
>>>
ros
=
RandomOverSampler
(
random_state
=
42
)
>>>
X_ros
,
y_ros
=
ros
.
fit_sample
(
X
,
y
)
>>>
pd
.
Series
(
y_ros
)
.
value_counts
()
1 809
0 809
dtype: int64
The imbalanced-learn library can also generate new samples of minority
classes with both the Synthetic Minority Over-sampling Technique (SMOTE)
and Adaptive Synthetic (ADASYN) sampling approach algorithms. SMOTE
works by choosing one of its k-nearest neighbors, connecting a line to
one of them, and choosing a point along that line. ADASYN is similar to
SMOTE, but generates more samples from those that are harder to learn.
The classes in imbanced-learn are named over_sampling.SMOTE
and
over_sampling.ADASYN
.
Another method to balance classes is to downsample majority classes. Here is an sklearn example:
>>>
from
sklearn.utils
import
resample
>>>
mask
=
df
.
survived
==
1
>>>
surv_df
=
df
[
mask
]
>>>
death_df
=
df
[
~
mask
]
>>>
df_downsample
=
resample
(
...
death_df
,
...
replace
=
False
,
...
n_samples
=
len
(
surv_df
),
...
random_state
=
42
,
...
)
>>>
df3
=
pd
.
concat
([
surv_df
,
df_downsample
])
>>>
df3
.
survived
.
value_counts
()
1 500
0 500
Name: survived, dtype: int64
Don’t use replacement when downsampling.
The imbalanced-learn library also implements various downsampling algorithms:
ClusterCentroids
This class uses K-means to synthesize data with the centroids.
RandomUnderSampler
This class randomly selects samples.
NearMiss
This class uses nearest neighbors to downsample.
TomekLink
This class downsamples by removing samples that are close to each other.
EditedNearestNeighbours
This class removes samples that have neighbors that are either not in the majority or all of the same class.
RepeatedNearestNeighbours
This class repeatedly calls the
EditedNearestNeighbours
.
AllKNN
This class is similar but increases the number of nearest neighbors during the iterations of downsampling.
CondensedNearestNeighbour
This class picks one sample of the class to be downsampled, then iterates through the other samples of the class, and if KNN doesn’t misclassify, it adds that sample.
OneSidedSelection
This classremoves noisy samples.
NeighbourhoodCleaningRule
This class uses EditedNearestNeighbours
results and applies KNN to it.
InstanceHardnessThreshold
This class trains a model, then removes samples with low probabilities.
All of these classes support the .fit_sample
method.