Saturday 4:00 PM–4:45 PM in Theater

Resampling techniques and other strategies for handling highly unbalanced datasets in classification

Ajinkya More

Audience level:
Intermediate

Description

Many real world machine learning problems need to deal with imbalanced class distribution i.e. when one class has significantly higher/lower representation than the other classes. Often performance on the minority class is more crucial, e.g. fraud detection, product classification, medical diagnosis, etc. In this talk I will discuss several techniques to handle class imbalance in classification.

Abstract

I will discuss techniques to combat class imbalance in classification problems. These mainly fall into the following categories:

  • Asymmetric loss functions: Penalize misclassifications on the class of interest higher (often the minority class)
  • Resampling techniques:
    • Undersampling the minority class
    • Oversampling the majority class
    • SMOTE (Synthetic minority oversampling technique)
    • Tomek link removal
    • Condensed nearest neighbors
    • Edited nearest neighbors
  • Ensemble techniques
    • EasyEnsemble
    • BalanceCascade