Skewed datasets are not uncommon. And they are tough to handle. Usual classification models and techniques often fail miserably when presented with such a problem. We discuss right from the basics of what class imbalance means to how we can overcome it, using various algorithms and some subtle techniques. We discuss details of evaluating our efforts and some small but crucial things that are vital
The talk has the following sections-
What is Class Imbalance? Here we give examples to define what a class imbalanced dataset means and why it should be handled differently.
Ways to overcome it -
We go in detail about 3 ways to tackle the class imbalance problem.
a.Sampling
b.Setting Hyperparameters to assign weights
c.Libraries like imblearn
Evaluation Methods We discuss the evaluation methods that best help us judge how our model is performing on an imbalanced dataset.
Custom loss
We discuss a custom loss function that can considerably better our deep learning model and also explain why it does so.
Misc We go over some miscellaneous tricks and steps we can take to avoid common pitfalls. a.Train - Validation Splits b.Remove classes
One usually wants a machine learning model to do well on two fronts. The first being the quality of predictions defined by a quantitative metric such as accuracy/precision/recall and the other being the fairness or the logical sanity behind the prediction. Sometimes, these do not go hand in hand. And a major reason behind the model being less accurate or even less fair/logically sound is its inability to deal with imbalanced data. This is where the bias against minority creeps in and our model is no longer a valid reflection of the phenomenon its trying to predict. Hence, it is vital that people have a good understanding of how to mitigate this imbalance in the model.
From ML practitioners who build models to key business decision makers, this is an issue that everyone needs to be aware of. The way one builds their model and the way in which one interprets the model predictions are closely tied to how your model handles skewed imbalanced datasets. Key take-aways from practitioners would be techniques to do better with skewed datasets. Whereas, for decision takers, it will be an insight on how to question the model and the data the model has been trained on.
My name is Aditya Lahiri and I am currently a Machine Learning intern at American Express, Big Data Labs. I am a Computer Science undergraduate from BITS Pilani, Goa and will graduate in December 2019. I love solving problems through data and code. Besides that, I enjoy attending meetups, talks and try my best to contribute to them. I have previously given talks in my college at events like Google Developers Group, Goa.
Here are the slides of this proposal- https://docs.google.com/presentation/d/1_hiJQsbXHhrzlXxCtPUSpt9-FvMWNlNw1m6cBVPyGCE/edit?usp=sharing