Friday 3:55 p.m.–5:25 p.m.

How “good” is your model, and how can you make it better?

Chih-Chun Chen, Dimitry Foures, Elena Chatzimichali, Giuseppe Vettigli, Raoul-Gabriel Urma

Audience level:
Intermediate

Description

What distinguishes “true artists” from “one-hit wonders” in machine learning is an understanding of how a model performs with respect to different data. This hands-on tutorial will show you how to use scikit-learn’s model evaluation functions to evaluate different models in terms of accuracy and generalisability, and search for optimal parameter configurations.

Abstract

The objective of this tutorial is to give participants the skills required to validate, evaluate and fine-tune models using scikit-learn’s evaluation metrics and parameter search capabilities. It will combine both the theoretical rationale behind these methods and their code implementation.

The session will be structured as follows (rough timings in parentheses):

  1. Explanation of over-fitting and the bias-variance trade-off, followed by a brief conceptual overview of cross-validation, bootstrapping, and ensemble methods, in particular with respect to bias and variance. Pointers to the corresponding scikit-learn functions will also be given. (20 minutes)
  2. Implementation of cross-validation and grid-search method for parameter tuning, using KNN classification as an illustrative example. Participants will train two KNN neighbours with different numbers of neighbours on preprocessed data (provided). They will then be guided through cross-validation, plotting of results, and grid-search to find the best neighbour and weight configuration(s). (30 minutes)
  3. Comparison of different classification models using cross-validation. Participants will implement a logistic regression, linear and non-linear support vector machine (SVM) or neural network model and apply the same cross-validation and grid search method as in the guided KNN example. Participants will then compare their plots, evaluate their results and discuss which model they might choose for different objectives, trading off generalisability, accuracy, speed and randomness. (70 minutes)
We assume participants will be familiar with numpy, matplotlib, and at least the intuition behind some of the main classification algorithms. Before the tutorial, participants with github accounts should fork from https://github.com/cambridgecoding/pydata-tutorial or download the files and iPython notebook so they can participate in the hands on activities. Required libraries: numpy, scikit-learn, matplotlib, pandas, scipy, multilayer_perceptron (provided)

Sponsors


Become a sponsor.