Sunday 3:50 PM–4:35 PM in Track 1

Do You Want To Build A Forest?

Thomas J Fan

Audience level:
Intermediate

Description

scikit-learn provides two popular ways to build tree ensembles: Gradient Boosting Decision Trees (GBDT) and Random Forests. In version 0.21, scikit-learn introduced its own Histogram-based GBDT inspired by LightGBM. In this talk, we will learn the underpinnings of GBDT and Random Forests, the differences between them, how each algorithm is optimized in scikit-learn, and how to interpret them.

Abstract

In this talk, we will explore two ways to build tree ensembles: Gradient Boosting Decision Trees and Random Forests. First, we will review the classical random forest and build our way up to understand scikit-learn's HistGradientBoostingRegressor and HistGradientBoostingClassifer. Next, we will learn the differences between these two models and see how scikit-learn optimizes these algorithms for performance. Lastly, we will explore model inspection techniques such as: impurity-based feature importance, permutation importance, and partial dependence curves. Specifically, we will look at how collinear features or features with high cardinality can lead to incorrect characterization of a feature's importance. This talk is targeted to an audience that is familiar with machine learning concepts and scikit-learn's API.

Subscribe to Receive PyData Updates

Subscribe