Friday 10:40 AM–11:20 AM in Room 3

Visual diagnostics for more informed machine learning

Rebecca Bilbro

Audience level:: Intermediate

Description

Visualization has a critical role to play throughout the analytic process. Where static outputs and tabular data can obscure patterns, human visual analysis can open up insights that lead to more robust data products. For Python programmers who dabble in machine learning, visual diagnostics are a must-have for effective feature analysis, model selection, and parameter tuning.

Abstract

Detailed Abstract

Visual diagnostics are a powerful but frequently underestimated tool in data science. By tapping into one of our most essential resources — the human visual cortex — they can enable us to see patterns rendered opaque by numeric outputs and tabular data, and lead us toward more robust programs and better data products. For Python programmers who dabble in machine learning, visual diagnostics can mean the difference between a model that crashes and burns, and one that predicts the future.

Python and high level libraries like Scikit-Learn, NLTK, TensorFlow, PyBrain, Theano, and MLPY have made machine learning accessible to a broad programming community that might never have found it otherwise. With the democratization of these tools, there are now a great many machine learning practitioners who are primarily self-taught. At the same time, the stakes of machine learning have never been higher; predictive tools are driving decision-making in every sector, from business, art, and engineering to education, law, and defense. In an age where any Python programmer can harness the power of predictive analytics, how do we ensure our models are valid and robust? How can we identify problems such as local minima and overfit? How can we build intuition around model selection? How can we isolate and combine the most informative features? Whether you have an academic background in predictive methods or not, visual diagnostics are the key to augmenting the algorithms already implemented in Python.

In this talk, I present a suite of visualization tools within and outside the standard Scikit-learn library that Python programmers can use to evaluate their machine learning models' performance, stability, and predictive value. I then identify some of the key gaps in the current visual diagnostics arsenal, and propose some novel possibilities for the future.

OUTLINE

Introduction/Problem statement
- Machine learning made accessible via the Scikit-learn API
- But what kinds of things can go wrong?
- Anscombe's quartet: An argument for using visual diagnostics
Visual tools for feature analysis and selection
- The model selection triple: What it is and how it can support the ML workflow
- Effective feature analysis is key to informed machine learning
- Visualizations can facilitate feature selection (boxplots/violinplots, histograms, sploms, radviz, parallel coordinates and more)
Demystifying model selection
- Tree diagrams and graph traversal for model selection: The Scikit-learn algorithm cheatsheet, Sayed's data mining map
- Cluster and classifier comparison plots
- Model evaluation to support selection (Confusion matrices, ROC curves, prediction error plots, residual plots)
Taking hyperparameter tuning out of the black box
- Visualizing parameters with validation curves
- Developing intuition through visual grid search
Open source Python packages for more informed machine learning
- Yellowbrick: Aggregating the tools available in Scikit-Learn, Matplotlib, Seaborn, Pandas and Bokeh
- Trinket: Wrapping the entire machine learning workflow