In this talk I'll introduce the problem of selection bias in the context of machine learning, illustrate its consequences and present some solutions.
When creating a recommendation system, is it ok to use data only from users that rate their movies? When predicting a survey outcome, is it ok to use data only from survey responders? Can non responders give us any valuable data? Can we estimate the time to read an article using only data from users that do read an article? Can we estimate the reward of an action based only on the actually executed ones? These are all simple examples of selection bias, a problem that appears in many domains, once and again, hindering our ability to use data successfully. In this talk I'll introduce the problem in the context of machine learning, illustrate its consequences and present some solutions.