PyData Eindhoven 2019 - Presentation: Deep Dive in Imbalanced Learning: a Framework in Python

The challenge of imbalanced datasets arises in many Machine Learning applications such as diagnostics in medical applications, fraud detection, and reliability prediction in manufacturing. In this case, learning a model that predicts the minority class with low false positives is challenging due to the skewed data distribution. For example, a model that predicts only the majority class is accurate but not meaningful.

In this talk, I will provide a framework on how to use effectively the state-of-the-art methods for imbalanced learning and skewed distributions with hands-on examples using imblearn and scikit- learn libraries. I will present the proper evaluation of a model in imbalanced learning, the drawbacks of resampling, the advantages of ensembles and balanced algorithms, and finally I will give tips for reframing the problem when possible.

In the first part of the talk, I will talk about the evaluation of a model under imbalanced learning. Metrics that reflect the underlying data distribution such as accuracy, result in wrongly perceived high performance. Also, using singular metrics such as recall, precision, F-measure, alone is not sufficient as they assume that the imbalance ratio in the training set will remain stable throughout the life of the classifier. Combing these metrics with curve-based assessments (ROC, precision-recall curves) for different imbalance ratios allows us to evaluate the robustness of a classifier.

In the second part of the talk, I will present selection criteria for imbalanced learning algorithms. Common algorithms include resampling, creation of synthetic data, ensembles, and balanced algorithms. Resampling techniques such as oversampling the minority class or undersampling the majority class, mask data imbalance by artificially balancing the dataset. However, undersampling ignores potentially useful information of the majority class, while oversampling leads to overfitting due to duplicated training examples and also increases the training time of the classifier. Depending on the imbalance ratio different techniques can be used. In moderate imbalance ratio, combining synthetic data generation of the minority class together with an undersampling strategy that removes only redundant examples from the majority class results in good performance. Instead of resampling manually before training a classifier, using ensembles together with creating synthetic datapoints has a typical high performance. To deal with the severe class imbalance problem, boosting-based ensemble method together with a cascade-style learning structure, is able to achieve very high detection rate while keeping very low false positive.

In the third part of the talk, I will present how reframing imbalanced learning can address the rarity issues of the minority class. Depending on the characteristics of the problem, class imbalance can be address by learning to discriminate only the rare class or the normal behavior (majority class). In this way, imbalanced learning is tackled as outlier detection or one class classification.

Deep Dive in Imbalanced Learning: a Framework in Python

Dimitra Gkorou

Description

Abstract

Subscribe to Receive PyData Updates

Tickets