In machine learning, and more specifically in classification (supervised learning), the industrial/raw datasets are known to get dealt with way more complications compared to toy data.
Among those constraints is the presence of a high imbalance ratio where usually, common classes happen way more frequently (majority) than the ones we actually target to study (minority).
In this tutorial, we will dive into more details on what lies underneath the Imbalance learning problem, how it impacts our models, understand what we mean by under/oversampling and implement using the Python library smote-variants.
Throughout the tutorial, we will use the fraudulent credit cards dataset from Kaggle, which you can download here.
$ pip install numpy pandas imblearn smote-variants
Learn also: Feature Selection using Scikit-Learn in Python
What is Imbalance Learning?
Imbalance Learning is in most cases present in the industry. As mentioned above, targeted classes (either for binary or multiclass problems) that need to be further studied and analyzed usually suffer from the lack of data in front of the tremendous presence of the common classes.
This imbalance holds a direct negative impact on the performance of the model while training, where it will get biased towards the majority class, and this may lead to falling under the accuracy paradox.
To better illustrate the accuracy paradox, imagine a dataset that contains 98 samples from class 0 and 2 from class 1, if our model predicts naively all of them as being 0 we will still have an accuracy of 98% which is relatively good even though our model just predicted by default all classes as 0. Such an accuracy value is misleading and can give wrong conclusions.
For this reason, we will now go through the used metrics when it comes to imbalanced datasets:
Another characteristic present in imbalance learning is known as the imbalance ratio. In the case of binary imbalanced classification, it is calculated by dividing the size of the minority class over the size of the majority one. We use the IR to know how severe the imbalance problem is.
Types Of Under/Oversampling
To reduce the imbalance ratio, we may pursue 2 different approaches, we can either reduce the majority classes (undersampling) or add samples to the minority ones (oversampling).
In this tutorial, we will go through 2 types of resampling-based approaches:
For the random sampling, we will use the imblearn Python library. But first, we will turn our CSV into a data frame format and store the labels into a variable
import numpy as np import pandas as pd df=pd.read_csv("creditcard.csv") y=df["Class"] X=df.drop(["Time","Class"],axis=1) print(y.value_counts())
0 284315 1 492 Name: Class, dtype: int64
We will now create a
RandomUnderSampler() object from imblearn and use the method
fit_resample() to apply the undersampling on the dataset.
from imblearn.under_sampling import RandomUnderSampler under=RandomUnderSampler() X_und,y_und=under.fit_resample(X,y) print(len(X_und[X_und==1])==len(X_und[X_und==0]))
As you can notice, after applying the undersampling the number of non-fraudulent credit cards is equal to the fraudulent ones.
For random oversampling, we use the same process, the only difference is to use
RandomOverSampler() instead of
from imblearn.over_sampling import RandomOverSampler over=RandomOverSampler() X_und,y_und=over.fit_resample(X,y) print(len(X_und[X_und==1])==len(X_und[X_und==0]))
Now we will dive into the second type of balancing algorithms which are the directed approaches.
For undersampling, we will cover those algorithms: Edited Nearest Neighbors, Instance Hardness Threshold, and TomekLinks.
from imblearn.under_sampling import EditedNearestNeighbours,InstanceHardnessThreshold,TomekLinks under_samp_models=[EditedNearestNeighbours(),InstanceHardnessThreshold,TomekLinks()] for under_samp_model in under_samp_models: X_und,y_und=under_samp_model.fit_resample(X,y)
Directed Oversampling Using Smote
SMOTE ("Synthetic Minority Oversampling TEchnique") is an oversampling technique that works by drawing lines between the minority data points and generate data throughout those lines as shown in the figure below.
We will use the smote-variants Python library which is a package that includes 85 variants of smote, all mentioned by this scientific article.
The implementation is quite similar to the one of imblearn with minor changes like using the method
sample() instead of
fit_resample() to generate data. In this tutorial, we will use Kmeans_Smote,
Smote_Cosine for the sake of examples.
import smote_variants as sv svs=[sv.kmeans_SMOTE(),sv.Safe_Level_SMOTE(),sv.SMOTE_Cosine()] for over_sampler in svs: X_over_samp, y_over_samp= over_sampler.sample(X, y)
Throughout this tutorial, we have learned:
Related: Detecting Fraudulent Transactions in a Streaming App using Kafka in Python
Happy learning ♥View Full Code