Ensemble methods are extremely performant in terms of prediction, but lack easy interpretation. Feature importance is not only counting up how many times a feature has been used in a weak learner, but also by how much this feature contributes to the result. Detailed example and implementation are provided in a jupyter notebook in python for the library "xgboost" of extreme gradient boosting.
I - Feature importance in ensemble algorithms - state of the art
1) Feature importance in sklearn/xgboost : basically counts the occurrences of a feature in all the weak learners 2) Construction of the trees in xgboost : if the trees are deep enough, every feature is going to be used 3) Global feature importance is a misleading : a given feature might be critical for a given subpopulation but completely irrelevant for another (ex : multi-class classification)
II - Xgboost real feature importance
1) Prediction influence : first splits influence the prediction more than last splits, so the importance of a feature must be weighted by the discrimination it provides
2) Point-to-point feature importance : following the path of a given prediction, it is possible to weigh the importance of every used feature
3) A relevant assessment of feature importance : explanation of a given prediction, and aggregation on a set of data points
III - Implementation and examples
1) Point-to-point feature importance illustration and implementation explanation
2) Evolution of feature importance with respect to learning iterations
3) Noisy variables cancellation
IV - Limits and ways forward
1) A word on correlated variables
2) Is there a compromise performance/interpretation ?