Uploaded by Rubén Guevara

effective-xgboost

advertisement
Treading on Python Series
EFFECTIVE
XGBOOST
Tuning, Understanding,
and Deploying Classification Models
Matt Harrison
Effective XGBoost
Tuning, Understanding, and Deploying Classification
Models
Effective XGBoost
Tuning, Understanding, and Deploying Classification
Models
Matt Harrison
Technical Editors: Edward Krueger, Alex Rook, Ronald Legere
hairysun.com
COPYRIGHT © 2023
While every precaution has been taken in the preparation of this book, the publisher and
author assumes no responsibility for errors or omissions, or for damages resulting from the
use of the information contained herein.
Contents
Contents
1
Introduction
3
2
Datasets
2.1 Cleanup . . . . .
2.2 Cleanup Pipeline
2.3 Summary . . . . .
2.4 Exercises . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
5
6
13
14
3
Exploratory Data Analysis
3.1 Correlations . . . . . .
3.2 Bar Plot . . . . . . . . .
3.3 Summary . . . . . . . .
3.4 Exercises . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15
15
17
21
21
4
Tree Creation
4.1 The Gini Coefficient . . . . .
4.2 Coefficients in Trees . . . . .
4.3 Another Visualization Tool .
4.4 Summary . . . . . . . . . . .
4.5 Exercises . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
23
23
27
30
30
30
5
Stumps on Real Data
5.1 Scikit-learn stump on real data
5.2 Decision Stump with XGBoost .
5.3 Values in the XGBoost Tree . . .
5.4 Summary . . . . . . . . . . . . .
5.5 Exercises . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
31
31
33
35
37
37
6
Model Complexity & Hyperparameters
6.1 Underfit . . . . . . . . . . . . . . .
6.2 Growing a Tree . . . . . . . . . . .
6.3 Overfitting . . . . . . . . . . . . . .
6.4 Overfitting with Decision Trees . .
6.5 Summary . . . . . . . . . . . . . . .
6.6 Exercises . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
39
39
39
40
40
41
42
7
Tree Hyperparameters
7.1 Decision Tree Hyperparameters . . . . . .
7.2 Tracking changes with Validation Curves
7.3 Leveraging Yellowbrick . . . . . . . . . .
7.4 Grid Search . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
43
43
44
45
47
.
.
.
.
.
.
.
.
v
Contents
7.5
7.6
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
49
8
Random Forest
8.1 Ensembles with Bagging . . . . . . . . . . .
8.2 Scikit-learn Random Forest . . . . . . . . .
8.3 XGBoost Random Forest . . . . . . . . . . .
8.4 Random Forest Hyperparameters . . . . . .
8.5 Training the Number of Trees in the Forest
8.6 Summary . . . . . . . . . . . . . . . . . . . .
8.7 Exercises . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
51
51
51
53
54
57
58
58
9
XGBoost
9.1 Jargon . . . . . . . . . . . . . . . . . . .
9.2 Benefits of Boosting . . . . . . . . . . .
9.3 A Big Downside . . . . . . . . . . . . .
9.4 Creating an XGBoost Model . . . . . .
9.5 A Boosted Model . . . . . . . . . . . .
9.6 Understanding the Output of the Trees
9.7 Summary . . . . . . . . . . . . . . . . .
9.8 Exercises . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
59
59
59
60
60
61
62
65
65
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
67
67
68
71
73
73
11 XGBoost Hyperparameters
11.1 Hyperparameters . . . . . . . . . . . . . .
11.2 Examining Hyperparameters . . . . . . .
11.3 Tuning Hyperparameters . . . . . . . . .
11.4 Intuitive Understanding of Learning Rate
11.5 Grid Search . . . . . . . . . . . . . . . . .
11.6 Summary . . . . . . . . . . . . . . . . . . .
11.7 Exercises . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
75
75
77
78
78
80
83
83
12 Hyperopt
12.1 Bayesian Optimization . . . . . . .
12.2 Exhaustive Tuning with Hyperopt
12.3 Defining Parameter Distributions .
12.4 Exploring the Trials . . . . . . . . .
12.5 EDA with Plotly . . . . . . . . . . .
12.6 Conclusion . . . . . . . . . . . . . .
12.7 Exercises . . . . . . . . . . . . . . .
10 Early Stopping
10.1 Early Stopping Rounds . .
10.2 Plotting Tree Performance
10.3 Different eval_metrics . . .
10.4 Summary . . . . . . . . . .
10.5 Exercises . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
85
. 85
. 85
. 88
. 92
. 100
. 103
. 104
13 Step-wise Tuning with Hyperopt
13.1 Groups of Hyperparameters . . . . . .
13.2 Visualization Hyperparameter Scores
13.3 Training an Optimized Model . . . . .
13.4 Summary . . . . . . . . . . . . . . . . .
13.5 Exercises . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
vi
.
.
.
.
.
.
.
105
105
106
106
108
108
Contents
14 Do you have enough data?
14.1 Learning Curves . . . . . . . . . . .
14.2 Learning Curves for Decision Trees
14.3 Underfit Learning Curves . . . . .
14.4 Overfit Learning Curves . . . . . .
14.5 Summary . . . . . . . . . . . . . . .
14.6 Exercises . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
109
109
110
111
111
112
112
15 Model Evaluation
15.1 Accuracy . . . . . . . . .
15.2 Confusion Matrix . . . .
15.3 Precision and Recall . . .
15.4 F1 Score . . . . . . . . . .
15.5 ROC Curve . . . . . . . .
15.6 Threshold Metrics . . . .
15.7 Cumulative Gains Curve
15.8 Lift Curves . . . . . . . .
15.9 Summary . . . . . . . . .
15.10Exercises . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
115
115
115
117
120
120
123
125
126
127
127
16 Training For Different Metrics
16.1 Metric overview . . . . . . . . . .
16.2 Training with Validation Curves
16.3 Step-wise Recall Tuning . . . . .
16.4 Summary . . . . . . . . . . . . . .
16.5 Exercises . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
129
129
129
130
132
132
17 Model Interpretation
17.1 Logistic Regression Interpretation
17.2 Decision Tree Interpretation . . . .
17.3 XGBoost Feature Importance . . .
17.4 Surrogate Models . . . . . . . . . .
17.5 Summary . . . . . . . . . . . . . . .
17.6 Exercises . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
133
133
134
137
138
140
140
18 xgbfir (Feature Interactions Reshaped)
18.1 Feature Interactions . . . . . . . . .
18.2 xgbfir . . . . . . . . . . . . . . . . .
18.3 Deeper Interactions . . . . . . . . .
18.4 Specifying Feature Interactions . .
18.5 Summary . . . . . . . . . . . . . . .
18.6 Exercises . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
141
141
141
147
148
148
149
19 Exploring SHAP
19.1 SHAP . . . . . . . . . . . . . . . . . . . . . . . . .
19.2 Examining a Single Prediction . . . . . . . . . . .
19.3 Waterfall Plots . . . . . . . . . . . . . . . . . . . .
19.4 A Force Plot . . . . . . . . . . . . . . . . . . . . .
19.5 Force Plot with Multiple Predictions . . . . . . .
19.6 Understanding Features with Dependence Plots
19.7 Jittering a Dependence Plot . . . . . . . . . . . .
19.8 Heatmaps and Correlations . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
151
151
155
156
160
160
160
162
162
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
vii
Contents
19.9 Beeswarm Plots of Global Behavior
19.10SHAP with No Interaction . . . . .
19.11 Summary . . . . . . . . . . . . . . .
19.12Exercises . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
165
166
169
169
20 Better Models with ICE, Partial Dependence,
Calibration
20.1 ICE Plots . . . . . . . . . . . . . . . . . . . . .
20.2 ICE Plots with SHAP . . . . . . . . . . . . . .
20.3 Partial Dependence Plots . . . . . . . . . . . .
20.4 PDP with SHAP . . . . . . . . . . . . . . . . .
20.5 Monotonic Constraints . . . . . . . . . . . . .
20.6 Calibrating a Model . . . . . . . . . . . . . . .
20.7 Calibration Curves . . . . . . . . . . . . . . .
20.8 Summary . . . . . . . . . . . . . . . . . . . . .
20.9 Exercises . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
171
171
176
176
179
180
185
185
186
187
21 Serving Models with MLFlow
21.1 Installation and Setup . . . . . . . .
21.2 Inspecting Model Artifacts . . . . . .
21.3 Running A Model From Code . . . .
21.4 Serving Predictions . . . . . . . . . .
21.5 Querying from the Command Line .
21.6 Querying with the Requests Library
21.7 Building with Docker . . . . . . . . .
21.8 Conclusion . . . . . . . . . . . . . . .
21.9 Exercises . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
189
189
192
193
193
193
198
198
199
199
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Monotonic Constraints, and
22 Conclusion
201
22.1 One more thing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
Index
viii
205
Forward
“XGBoost is all you need.”It started as a half-joke, a silly little phrase that I myself had not been
too confident about, but it grew into a signature line of sorts. I don’t remember when I first
said it, and I can only surmise what motivated me to say it. Part of it was my frustration with
the increasingly complex deep learning that was being pushed as the only valid approach to
tackling all Machine Learning problems. The other part of it came from a sense of maturity
that was hard won over the years, acquired from many different wild experiments with
various ML tools and techniques, particularly those that deal with tabular data problems. The
more problems you tackle, the more you appreciate such things as parsimony, robustness, and
elegance in implementing your machine learning models and pipelines.
My first exposure to XGBoost came on Kaggle. Over the past decade, the progress of
Machine Learning for tabular data has been intertwined with Machine Learning platforms,
and Kaggle in particular. XGBoost was first announced on Kaggle in 2015, not long before I
joined that platform. XGBoost was initially a research project by Tianqi Chen, but due to its
excellent predictive performance on a variety of machine learning tasks it quickly became the
algorithm of choice for many Data Science practitioners.
XGBoost is both a library and a particular gradient boosted trees (GBT) algorithm.
(Although the XGBoost library also supports other - linear - base learners.) GBTs are a class
of algorithms that utilize the so-called ensembling - building a very strong ML algorithm
by combining many weaker algorithms. GBTs use decision trees as “base learners”, utilize
“boosting”as an ensembling method, and optimize the ensembling by utilizing gradient
descent, something that they have in common with other machine learning methods, such
as neural networks.
XGBoost the library is in its own right one of the main reasons why I am such a big XGBoost
promoter. As one of the oldest and most widely used GBT libraries, it has matured and
become very robust and stable. It can be easily installed and used on almost any computing
environment. I have installed it and used it on everything from Raspberry Pi Zero to DGX
Station A100. It has had a stable version for various Arm-based platforms for a long while,
including the new Macs with Apple chips. It can even run natively in the browser - it has
been included in Wasm and PyScript for a while. XGBoost is the only GBT library with a
comprehensive Nvidia CUDA GPU support. It works on a single machine, or on a large
cluster. And since version 1.7, it also supports federated learning. It includes C, Java, Python,
and R front ends, as well as many other ones. If you are a working Data Scientist, and need an
efficient way to train and deploy a Machine Learning model for a wide variety of problems,
chances are that XGBoost is indeed all you need.
To be clear, in terms of predictive performance, I have always considered XGBoost to be
on par with other well known GBT libraries - LightGBM, CatBoost, HistGradientBoosting,
etc. Each one of them has its strengths and weaknesses, but in the first approximation all of
them are fairly interchangeable with each other for the majority of problems that a working
Data Scientist comes across. In my experience there is no a priori way of telling which one of
them will perform the best on any given dataset, but in terms of practical considerations the
1
FORWARD
differences are usually negligible. Mastering any one of them will go a long way towards a
user being comfortable with all of them.
Even though GBTs are very widely used and the gold standard for tackling tabular data
Machine Learning problems, there is still a serious paucity of learning material for them. Most
of us have learned to use them by following widely scattered blog posts, Kaggle competitions,
GitHub repos, and many other dispersed pieces of information. There are a couple good
introductory books out there as well, but for the most part one has to search wide to find a very
comprehensive instruction manual that can get you up and running with XGBoost. This book
aims to put a lot of that scattered material into one coherent whole, with a straightforward
and logical content progression.
I’ve now known Matt for years, and I greatly admire his pedagogical perspective. He is
a non-nonsense educator, and has a very down to earth approach to all of his teaching. I’
ve greatly benefited from several of his books, especially the ML handbook and the Effective
Pandas book. In all of his books he aims to get the reader to the practical parts as soon as
possible, and this book is no exception - you start dealing with code from the very first page.
Like with all of his other books, this one was also written with a working practitioner in
mind. The book skips most of the convoluted theoretical background, and delves directly into
practical examples. You should be comfortable coding and ideally be an advanced beginner
to an intermediate Python coder in order to get the most out of this book.
Machine Learning for tabular data is still a very hands-on artisanal process. A big part
of what makes a great tabular data ML model has to do with proper data preparation and
feature engineering. This is where Matt’s background with Pandas really comes in handy many Pandas examples throughout the book are exceptionally valuable in their own right.
Chapters end with a great selection of useful exercises. All of the code in the book is also
available from the accompanying repo, and most of the datasets can be found on Kaggle.
Like all other great tools, it takes many years of experience and dedicated work to fully
master XGBoost. Every path to full mastery needs to start somewhere, and I can’t think of a
better starting point than this book. If you get through all of it you will be well prepared for
the rest of your Machine Learning for tabular data journey.
Happy Boosting!
Bojan Tunguz
2
Chapter 1
Introduction
In this book, we will build our intuition for classification with supervised learning models.
Initially, we will look at decision trees, a fundamental component of the XGBoost model,
and consider the tradeoffs we make when creating a predictive model. If you have that
background, feel free to skip these chapters and jump straight to the XGBoost chapters.
While providing best practices for using the XGBoost library, we will also show many
related libraries and how to use them to improve your model. Finally, we will demonstrate
how to deploy your model.
I strongly encourage you to practice using this library on a dataset of your choice.
Each chapter has suggested exercises for you to consider what you have learned and put
it into practice. Practicing writing code is crucial for learning because it helps to solidify
understanding of the library and understanding concepts. Additionally, regular coding
practice can build familiarity with coding syntax and increase coding efficiency over time.
Using your fingers to practice will help you much more than just reading.
If you are interested in courses to practice and learn more Python and data materials, check
out https://store.metasnake.com. Use the code XGBOOK for a discount.
3
Chapter 2
Datasets
We will be using machine learning classifiers to predict labels for data. I will demonstrate the
features of XGBoost using survey data from Kaggle. Kaggle is a company that hosts machine
learning competitions. They have conducted surveys with users about their backgrounds,
experience, and tooling. Our goal will be to determine whether a respondent’s job title is
“Data Scientist” or “Software Engineer” based on how they responded to the survey.
This data comes from a Kaggle survey conducted in 2018 1 . As part of the survey, Kaggle
asked users about a wide range of topics, from what kind of machine learning software they
use to how much money they make. We’re going to be using the responses from the survey
to predict what kind of job the respondent has.
2.1
Cleanup
I’ve hosted the data on my GitHub page. The 2018 survey data is in a ZIP file. Let’s use some
Python code to extract the contents of this data into a Pandas DataFrame. If you need help
setting up your environment, see the appendix.
The ZIP file contains multiple files. We are concerned with the multipleChoiceResponses.csv
file.
import pandas as pd
import urllib.request
import zipfile
url = 'https://github.com/mattharrison/datasets/raw/master/data/'\
'kaggle-survey-2018.zip'
fname = 'kaggle-survey-2018.zip'
member_name = 'multipleChoiceResponses.csv'
def extract_zip(src, dst, member_name):
"""Extract a member file from a zip file and read it into a pandas
DataFrame.
Parameters:
src (str): URL of the zip file to be downloaded and extracted.
1 https://www.kaggle.com/datasets/kaggle/kaggle-survey-2018
5
2. Datasets
dst (str): Local file path where the zip file will be written.
member_name (str): Name of the member file inside the zip file
to be read into a DataFrame.
Returns:
pandas.DataFrame: DataFrame containing the contents of the
member file.
"""
url = src
fname = dst
fin = urllib.request.urlopen(url)
data = fin.read()
with open(dst, mode='wb') as fout:
fout.write(data)
with zipfile.ZipFile(dst) as z:
kag = pd.read_csv(z.open(member_name))
kag_questions = kag.iloc[0]
raw = kag.iloc[1:]
return raw
raw = extract_zip(url, fname, member_name)
I created the function extract_zip so I can reuse this functionality. I will also add this
function to a library that I will develop as we progress, xg_helpers.py.
After running the above code, we have a Pandas dataframe assigned to the variable raw.
This is the raw data of survey responses. We show how to explore it and clean it up for
machine learning.
2.2 Cleanup Pipeline
The raw data has over 23,000 rows and almost 400 columns. Most data is not natively found
in a form where you can do machine learning on it. Generally, you will need to perform some
preprocessing on it.
Some of the columns don’t lend easily to analysis because they aren’t in numeric form.
Additionally, there may be missing data that needs to be dealt with. I will show how to
preprocess the data using the Pandas library and the Scikit-learn library. I’ll describe the
steps, but I will not dive deeply into the features of the Pandas library. I suggest you check
out Effective Pandas for a comprehensive overview of that library.
Our task will be transforming survey responses into numeric values, encoding categorical
values into numbers, and filling in missing values. We will use a pipeline for that. Scikit-learn
pipelines are a convenient tool for tying together the steps of constructing and evaluating
machine learning models. A pipeline chains a series of steps, including feature extraction,
dimensionality reduction, and model fitting, into a single object. Pipelines simplify the endto-end process of machine learning and make you more efficient. They also make it easy to
reuse and share your work and eliminate the risk of errors by guaranteeing that each step is
executed in a precise sequence.
I wrote a function, tweak_kag, to perform the survey response cleanup. I generally create a
cleanup function every time I get new tabular data to work with.
The tweak_kag function is the meat of the Pandas preprocessing logic. It uses a chain to
manipulate the data one action (I typically write each step on its own line) at a time. You can
read it as a recipe of actions. The first is to use .assign to create or update columns. Here are
the columns updates:
6
2.2. Cleanup Pipeline
• age - Pull off the first two characters of the Q2 column and convert them to an integer.
• education - Replace the education strings with numeric values.
• major - Take the top three majors, change the others to 'other', and then rename those
top three majors to shortened versions.
• year_exp - Convert the Q8 column to years of experience by replacing '+' empty space,
splitting on '-' (the first value of the range) and taking the left-hand side, and converting
that value to a floating point number.
• compensation - Replace values in the Q9 column by removing commas, shortening
500,000 to 500, splitting on '-' (the first value of the range) and taking the left side, filling
in missing values with zero, and converting that value to an integer and multiplying it
by 1,000.
• python - Fill in missing values of Q16_Part_1 with zero and convert the result to an
integer.
• r - Fill in missing values of Q16_Part_2 with zero and convert the result to an integer.
• sql - Fill in missing values of Q16_Part_2 with zero and convert the result to an integer.
After column manipulation, tweak_kag renames the columns by replacing spaces with an
underscore.
Finally, it pulls out only the Q1, Q2, age, education, major, years_exp, compensation, python, r,
and sql columns.
def tweak_kag(df_: pd.DataFrame) -> pd.DataFrame:
"""
Tweak the Kaggle survey data and return a new DataFrame.
This function takes a Pandas DataFrame containing
survey data as input and returns a new DataFrame.
modifications include extracting and transforming
columns, renaming columns, and selecting a subset
Kaggle
The
certain
of columns.
Parameters
---------df_ : pd.DataFrame
The input DataFrame containing Kaggle survey data.
Returns
------pd.DataFrame
The new DataFrame with the modified and selected columns.
"""
return (df_
.assign(age=df_.Q2.str.slice(0,2).astype(int),
education=df_.Q4.replace({'Master’s degree': 18,
'Bachelor’s degree': 16,
'Doctoral degree': 20,
'Some college/university study without earning a bachelor’s degree': 13,
'Professional degree': 19,
'I prefer not to answer': None,
'No formal education past high school': 12}),
major=(df_.Q5
.pipe(topn, n=3)
.replace({
7
2. Datasets
'Computer science (software engineering, etc.)': 'cs',
'Engineering (non-computer focused)': 'eng',
'Mathematics or statistics': 'stat'})
),
years_exp=(df_.Q8.str.replace('+','', regex=False)
.str.split('-', expand=True)
.iloc[:,0]
.astype(float)),
compensation=(df_.Q9.str.replace('+','', regex=False)
.str.replace(',','', regex=False)
.str.replace('500000', '500', regex=False)
.str.replace('I do not wish to disclose my approximate yearly compensation',
'0', regex=False)
.str.split('-', expand=True)
.iloc[:,0]
.fillna(0)
.astype(int)
.mul(1_000)
),
python=df_.Q16_Part_1.fillna(0).replace('Python', 1),
r=df_.Q16_Part_2.fillna(0).replace('R', 1),
sql=df_.Q16_Part_3.fillna(0).replace('SQL', 1)
)#assign
.rename(columns=lambda col:col.replace(' ', '_'))
.loc[:, 'Q1,Q3,age,education,major,years_exp,compensation,'
'python,r,sql'.split(',')]
)
def topn(ser, n=5, default='other'):
"""
Replace all values in a Pandas Series that are not among
the top `n` most frequent values with a default value.
This function takes a Pandas Series and returns a new
Series with the values replaced as described above. The
top `n` most frequent values are determined using the
`value_counts` method of the input Series.
Parameters
---------ser : pd.Series
The input Series.
n : int, optional
The number of most frequent values to keep. The
default value is 5.
default : str, optional
The default value to use for values that are not among
the top `n` most frequent values. The default value is
'other'.
Returns
8
2.2. Cleanup Pipeline
------pd.Series
The modified Series with the values replaced.
"""
counts = ser.value_counts()
return ser.where(ser.isin(counts.index[:n]), default)
I created the TweakKagTransformer class to wrap tweak_kag so I could embed the cleanup logic
into the pipeline functionality of Scikit-Learn.
It is not a requirement to use pipelines to use XGBoost. To create a model with XGBoost
suitable for classification, you need a matrix with rows of training data with columns
representing the features of the data. Generally, this is called X in most documentation and
examples. You will also need a column with labels for each row, commonly named y.
The raw data might look like this:
Q2
Q3
Q4 \
587 25-29 India
Master’s degree
3065 22-24 India Bachelor’s degree
8435 22-24 India
Master’s degree
Q5
587 Information technology, networking, or system ...
3065
Computer science (software engineering, etc.)
8435
Other
We need to get X into looking like this:
587
3065
8435
age education Q3_India major_cs
25
18.0
1
0
22
16.0
1
1
22
18.0
1
0
Note
The capitalization on X is meant to indicate that the data is a matrix and two-dimensional,
while the lowercase y is a one-dimensional vector. If you are familiar with linear algebra,
you might recognize these conventions, capital letters for matrices and lowercase for
vectors.
The transformer class subclasses the BaseEstimator and TransformerMixin classes. These
classes require that we implement the .fit and .transform methods, respectively. The .fit
method returns the class instance. The .transform method leverages the logic in the tweak_kag
function.
from feature_engine import encoding, imputation
from sklearn import base, pipeline
class TweakKagTransformer(base.BaseEstimator,
base.TransformerMixin):
"""
A transformer for tweaking Kaggle survey data.
9
2. Datasets
This transformer takes a Pandas DataFrame containing
Kaggle survey data as input and returns a new version of
the DataFrame. The modifications include extracting and
transforming certain columns, renaming columns, and
selecting a subset of columns.
Parameters
---------ycol : str, optional
The name of the column to be used as the target variable.
If not specified, the target variable will not be set.
Attributes
---------ycol : str
The name of the column to be used as the target variable.
"""
def __init__(self, ycol=None):
self.ycol = ycol
def transform(self, X):
return tweak_kag(X)
def fit(self, X, y=None):
return self
The get_rawX_y function will take the original data and return an X DataFrame and a y Series
ready to feed into our pipeline for further cleanup. It uses the .query method of Pandas to limit
the rows to only those located in the US, China, or India, and respondents that had the job
title of Data Scientist or Software Engineer.
Below that, a pipeline is stored in the variable kag_pl. The pipeline will process the data
by calling the TweakKagTransformer. Then it will perform one hot encoding on the Q1, Q3, and
major columns using the Feature Engine library. Finally, it will use the imputation library to fill
in missing numeric values in the education and year_exp columns. It does that by filling in the
missing values with the median values.
def get_rawX_y(df, y_col):
raw = (df
.query('Q3.isin(["United States of America", "China", "India"]) '
'and Q6.isin(["Data Scientist", "Software Engineer"])')
)
return raw.drop(columns=[y_col]), raw[y_col]
## Create a pipeline
kag_pl = pipeline.Pipeline(
[('tweak', TweakKagTransformer()),
('cat', encoding.OneHotEncoder(top_categories=5, drop_last=True,
variables=['Q1', 'Q3', 'major'])),
('num_impute', imputation.MeanMedianImputer(imputation_method='median',
10
2.2. Cleanup Pipeline
variables=['education', 'years_exp']))]
)
Let’s run the code.
We will to create a training set and a validation set (also called test set or holdout set) using
the scikit-learn model_selection.train_test_split function. This function will withhold 30% of
the data into the test dataset (see test_size=.3). This will let us train the model on 70% of data,
then use the other 30% to simulate unseen data and let us experiment with how the model
might perform on data it hasn’t seen before.
Note
The stratify parameter of the train_test_split function is important because it ensures
that the proportion of different classes found in the labels of in the dataset is maintained in
both the training and test sets. This is particularly useful when working with imbalanced
datasets, where one class may be under-represented. Without stratification, the training
and test sets may not accurately reflect the distribution of classes in the original dataset,
which can lead to biased or unreliable model performance.
For example, if we have a binary classification problem, where the dataset has 80%
class A and 20% class B, if we stratify, we will split the data into training and test sets.
We will have 80% class A and 20% class B in both sets, which helps to ensure that the
classifier will generalize well to unseen data. If we don’t stratify, it is possible to get all
of class B in the test set, hampering the ability of the model to learn class B.
To use stratification, you pass in the labels to the stratify parameter.
Once we have split the data, we can feed it into our pipeline. There are two main methods
that we want to be aware of in our pipeline, .fit and .transform.
The .fit method is used to train a model on the training data. It takes in the training data
as input and uses that data to learn the parameters of the model. For example, in the case of the
pipeline that includes our tweaking function, one-hot encoding, and imputation step, the .fit
method runs each of those steps to learn how to perform the transformation in a consistent
manner on new data. We generally use the .fit method for the side effect of learning, it does
not return transformed data.
The .transform method is used to transform the input data according to the steps of the
pipeline. The method takes the input data and applies the transformations specified in the
pipeline to it (it does not learn any parameters). This method is typically used on test data to
apply the same preprocessing steps that were learned on the training data. In our case, we
also want to run it on the training data so that it is prepped for use in models. It should return
data ready for use with a machine learning model.
The .fit_transform method is a combination of the .fit and .transform methods. It first fits
the model on the training data and then applies the learned transformations to the input data,
returning the transformed data. This method is useful when we want to learn the parameters
of the pipeline on the training data and then apply these learned parameters to the training
data all at once. We only want to run this on the training data. Then after we have fit the
pipeline on the training data, we will call .transform on the testing data.
The output of calling .fit_transform and .transform on the pipeline is a Pandas DataFrame
suitable for training an XGBoost model.
>>> from sklearn import model_selection
>>> kag_X, kag_y = get_rawX_y(raw, 'Q6')
>>> kag_X_train, kag_X_test, kag_y_train, kag_y_test = \
11
2. Datasets
Figure 2.1: Process for splitting data and transforming data with a pipeline.
12
2.3. Summary
...
...
model_selection.train_test_split(
kag_X, kag_y, test_size=.3, random_state=42, stratify=kag_y)
>>> X_train = kag_pl.fit_transform(kag_X_train, kag_y_train)
>>> X_test = kag_pl.transform(kag_X_test)
>>> print(X_train)
age education years_exp ... major_other major_eng major_stat
587
25
18.0
4.0 ...
1
0
0
3065
22
16.0
1.0 ...
0
0
0
8435
22
18.0
1.0 ...
1
0
0
3110
40
20.0
3.0 ...
1
0
0
16372 45
5.0 ...
0
12.0
1
0
...
...
...
... ...
...
...
...
16608 25
16.0
2.0 ...
0
0
0
7325
18
16.0
1.0 ...
0
0
0
21810 18
16.0
2.0 ...
0
0
0
4917
25
18.0
1.0 ...
0
0
1
639
25
18.0
1.0 ...
0
0
0
[2110 rows x 18 columns]
At this point, we should be ready to create a model so we can make predictions.
Remember, our goal is to predict whether an individual is a Software Engineer or Data Scientist.
Here are the labels for the training data:
>>> kag_y_train
587
Software Engineer
3065
Data Scientist
8435
Data Scientist
3110
Data Scientist
16372
Software Engineer
...
16608
Software Engineer
7325
Software Engineer
21810
Data Scientist
4917
Data Scientist
639
Data Scientist
Name: Q6, Length: 2110, dtype: object
Our data is in pretty good shape for machine learning. Before we jump into that we want
to explore the data a little more. We will do that in the next chapter. Data exploration is a
worthwhile effort to create better models.
2.3
Summary
In this chapter, we loaded the data. Then we adapted a function that cleans up the data with
Pandas into a Scikit-learn pipeline. We used that to prepare the data for machine learning and
then split the data into training and testing data.
13
2. Datasets
2.4 Exercises
1. Import the pandas library and use the read_csv function to load in a CSV file of your
choice.
2. Use the .describe method to generate summary statistics for all numeric columns in the
DataFrame.
3. Explain how you can convert categorical string data into numbers.
14
Chapter 3
Exploratory Data Analysis
In this chapter, we will explore the Kaggle data before creating models from it. This is called
Exploratory Data Analysis or EDA. My goal with EDA is to understand the data I’m dealing
with by using summary statistics and visualizations. This process provides insights into what
data we have and what the data can tell us.
The exploration step is the most crucial step in data modeling. If you understand the data,
you can better model it. You will also understand the relationships between the columns and
the target you are trying to predict. This will also enable feature engineering, creating new
features to drive the model to perform better.
I will show some of my favorite techniques for exploring the data. Generally, I use Pandas
to facilitate this exploration. For more details on leveraging Pandas, check out the book
Effective Pandas.
3.1
Correlations
The correlation coefficient is a useful summary statistic for understanding the relationship
between features. There are various ways to calculate this measure. Generally, the Pearson
Correlation Coefficient metric is used. The calculation for this metric assumes a linear
relationship between variables. I often use the Spearman Correlation Coefficient which correlates
the ranks or monotonicity (and doesn’t make the assumption of linearity). The values for both
coefficients range from -1 to 1. Higher values indicate that as one value increases, the other
value increases. A correlation of 0 means that as one variable increases, the other doesn’t
respond. A negative correlation indicates that one variable goes down as the other goes up.
I like to follow up correlation exploration with a scatter plot to visualize the relationship.
Here is the Pandas code to create a correlation dataframe. I’ll step through what each line
does:
Start with the X_train data
Create a data_scientist column (so we can see correlations to it)
Create a correlation dataframe using the Spearman metric
Start styling the result (to indicate the best correlations visually)
Add a background gradient using a diverging colormap (red to white to blue, RdBu),
pinning the minimum value of -1 to red and maximum value of 1 to blue
• Make the index sticky, so it stays on the screen during horizontal scrolling
•
•
•
•
•
(X_train
.assign(data_scientist = kag_y_train == 'Data Scientist')
.corr(method='spearman')
15
3. Exploratory Data Analysis
Figure 3.1: Flowchart for my process for basic EDA. I include numerical summaries and visual options.
16
3.2. Bar Plot
)
.style
.background_gradient(cmap='RdBu', vmax=1, vmin=-1)
.set_sticky(axis='index')
Figure 3.2: Spearman correlation coefficient of features in dataset. We want to look for the darkest blue
and red values ignoring the diagonal.
3.2
Bar Plot
If we inspect the columns that correlate with data_scientist, the most significant feature is r
(with a value of 0.32). Let’s explore what is going on there. The r column is an indicator
column that is 1 or 0, depending on if the sample uses the R language. We could use a scatter
plot to view the relationship, but it will be less compelling because there are only two values
for data scientist, the values will all land on top of one of the two options. Instead, we will
use an unstacked bar plot.
Here’s what the code is doing line by line:
• Start with the X_train dataframe
17
3. Exploratory Data Analysis
Create a data_scientist column
Group by the r column
Look only at the data_scientist column
Get the counts for each different entry in the data_scientist column (which will be
represented as a multi-index)
• Pull out the data_scientist level of the multi-index and put it in the columns
• Create a bar plot from each column (data_scientist options) for each entry in r
•
•
•
•
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(8, 4))
(X_train
.assign(data_scientist = kag_y_train)
.groupby('r')
.data_scientist
.value_counts()
.unstack()
.plot.bar(ax=ax)
)
Figure 3.3: Bar plot of R usage for Data Scientists and Software Engineers. The correlation value was
positive, 0.32.
It turns out that the column with the most negative correlation to job title (-0.31) was
also an indicator column, major_cs, indicating whether the respondent studied computer
science. Let’s explore using the cross-tabulation function (pd.crosstab) which provides a
helper function for the previous chain of Pandas code.
fig, ax = plt.subplots(figsize=(8, 4))
(pd.crosstab(index=X_train['major_cs'],
18
3.2. Bar Plot
)
columns=kag_y)
.plot.bar(ax=ax)
Figure 3.4: Bar plot of majors for Data Scientists and Software Engineers. The correlation value was
negative, -0.31.
There is a slight correlation between years of experience and compensation. Let’s do a
scatter plot of this to see if it helps us understand what is going on.
Here is the code to create a scatter plot:
fig, ax = plt.subplots(figsize=(8, 4))
(X_train
.plot.scatter(x='years_exp', y='compensation', alpha=.3, ax=ax, c='purple')
)
This plot is ok, but it is glaringly evident that the data is binned. In the real-world folks
would only have exactly five years of experience on one day. Luckily, we set the alpha
transparency to see where points pile up on top of each other. But given the plot, it is hard to
tell what is going on.
I’m going to use the Seaborn library to create a plot that is more involved. One thing I want
to do is add jitter. Because our points land on top of each other, it is hard to understand the
density. If we shift each one horizontally and vertically (called jitter) by a different random
amount, it will aid with that. (This is done with so.Jitter(x=.5, y=10_000)
I’m also going to color the points by the title. (color='title')
Finally, we will fit a polynomial line through the points to see the trend. (see
.add(so.Line(), so.PolyFit()))
19
3. Exploratory Data Analysis
Figure 3.5: Scatter plot showing relationship between compensation and years of experience
import seaborn.objects as so
fig = plt.figure(figsize=(8, 4))
(so
.Plot(X_train.assign(title=kag_y_train), x='years_exp', y='compensation', color='title')
.add(so.Dots(alpha=.3, pointsize=2), so.Jitter(x=.5, y=10_000))
.add(so.Line(), so.PolyFit())
.on(fig) # not required unless saving to image
.plot() # ditto
)
Figure 3.6: Scatter plot showing relationship between compensation and years of experience
20
3.3. Summary
This plot is more interesting. It indicates that data scientists tend to earn more than
their software engineer counterparts. This hints that compensation might come in useful for
determining job titles.
Another point to consider is that this plot lumps together folks from around the world.
Different regions are likely to compensate differently. Let’s tease that apart as well.
This next plot is cool. The Seaborn objects interface makes it simple to facet the
data by country (.facet('country')). Also, I want to show all of the data in grey on
each plot. If you set col=None, it will plot all of the data. However, I want to zoom
into to lower left corner because most of the data is found there. I can adjust the tick
locations with .scale(x=so.Continuous().tick(at=[0,1,2,3,4,5])) and adjust the limit as well
with .limit(y=(-10_000, 200_000), x=(-1, 6)).
fig = plt.figure(figsize=(8, 4))
(so
.Plot(X_train
#.query('compensation < 200_000 and years_exp < 16')
.assign(
title=kag_y_train,
country=(X_train
.loc[:, 'Q3_United States of America': 'Q3_China']
.idxmax(axis='columns')
)
), x='years_exp', y='compensation', color='title')
.facet('country')
.add(so.Dots(alpha=.01, pointsize=2, color='grey' ), so.Jitter(x=.5, y=10_000), col=None)
.add(so.Dots(alpha=.5, pointsize=1.5), so.Jitter(x=.5, y=10_000))
.add(so.Line(pointsize=1), so.PolyFit(order=2))
.scale(x=so.Continuous().tick(at=[0,1,2,3,4,5]))
.limit(y=(-10_000, 200_000), x=(-1, 6)) # zoom in with this not .query (above)
.on(fig) # not required unless saving to image
.plot() # ditto
)
3.3
Summary
In this chapter, we explored the data to get a feel for what determines whether a respondent
is a data scientist. We can use summary statistics to quantify amounts with Pandas. I also like
to visualize the data. Typically I will use Pandas and the Seaborn library to make charts to
understand relationships between features of the data.
3.4
1.
2.
3.
4.
Exercises
Use .corr to quantify the correlations in your data.
Use a scatter plot to visualize the correlations in your data.
Use .value_counts to quantify counts of categorical data.
Use a bar plot to visualize the counts of categorical data.
21
3. Exploratory Data Analysis
Figure 3.7: Scatter plot showing relationship between compensation and years of experience broken out
by country
22
Chapter 4
Tree Creation
Finally, we are going to make a model! A tree model. Tree-based models are useful for ML
because they can handle both categorical and numerical data, and they are able to handle nonlinear relationships between features and target variables. Additionally, they provide a clear
and interpretable method for feature importance, which can be useful for feature selection
and understanding the underlying relationships in the data.
Before we make a model, let’s try to understand an algorithm for what makes a good tree
model.
At a high level, when we create a tree for classification, the process goes something like
this:
• Loop through all of the columns and
– Find the split point that is best at separating the different labels
– Create a node with this information
• Repeat the above process for results of each node until each node is pure (or a
hyperparameter indicates to stop tree creation)
We need a metric to help determine when our separation is good. One commonly used
metric is the Gini Coefficient.
4.1
The Gini Coef昀椀cient
The Gini coefficient (also known as the Gini index or Gini impurity)1 , quantifies the level
of inequality in a frequency distribution. A value of 0 indicates complete equality, with all
values being the same, and a value of 1 represents the maximum level of inequality among
the values.
The formula∑for the Gini coefficient is:
c
Gini = 1 − i=1 p2i
Where Gini is the Gini index, c is the number of classes, and pi is the proportion of
observations that belong to class i.
In order to understand optimal splitting, we’ll simulate a dataset with two classes and one
feature. For some values of the features, the classes will overlap—there will be observations
from both classes on each side of any split.
We’ll find the threshold for which the feature best separates the data into the correct classes.
1 This
coefficient was proposed by an Italian statistician and sociologist Corrado Gini to measure the inequality
of wealth in a country. A coefficient of 0 would mean that everyone has the same income. A measure of 1 would
indicate that one person has all the income.
23
4. Tree Creation
Let’s explore this by generating two random samples using NumPy. We will label one
sample as Negative and the other as Positive. Here is a visualization of what the distributions
look like:
import numpy as np
import numpy.random as rn
pos_center = 12
pos_count = 100
neg_center = 7
neg_count = 1000
rs = rn.RandomState(rn.MT19937(rn.SeedSequence(42)))
gini = pd.DataFrame({'value':
np.append((pos_center) + rs.randn(pos_count),
(neg_center) + rs.randn(neg_count)),
'label':
['pos']* pos_count + ['neg'] * neg_count})
fig, ax = plt.subplots(figsize=(8, 4))
_ = (gini
.groupby('label')
[['value']]
.plot.hist(bins=30, alpha=.5, ax=ax, edgecolor='black')
)
ax.legend(['Negative', 'Positive'])
Figure 4.1: Positive and negative distributions for Gini demonstration. Note that there are overlapping
values between 9 and 10.
It might be hard to see, but there is some overlap between 9 and 10, where there are both
negative and positive examples. I outlined these in black to make them more evident.
24
4.1. The Gini Coefficient
Now, let’s imagine that we needed to choose a value where we determine whether values
are negative or positive. The Gini coefficient is a metric that can help with this task.
To calculate the Gini coefficient for our positive and negative data, we need to consider
a point (my function calls it the split_point) where we decide whether a value is positive
or negative. If we have the actual labels for the data, we can calculate the true positive, false
positive, true negative, and false negative values. Then we calculate the weighted average of one
minus the fraction of true and false positives squared. We also calculate the weighted average
of one minus the fraction of true and false negatives squared. The weighted average of these
two values is the Gini coefficient.
Here is a function that performs the calculation:
def calc_gini(df, val_col, label_col, pos_val, split_point,
debug=False):
"""
This function calculates the Gini impurity of a dataset. Gini impurity
is a measure of the probability of a random sample being classified
incorrectly when a feature is used to split the data. The lower the
impurity, the better the split.
Parameters:
df (pd.DataFrame): The dataframe containing the data
val_col (str): The column name of the feature used to split the data
label_col (str): The column name of the target variable
pos_val (str or int): The value of the target variable that represents
the positive class
split_point (float): The threshold used to split the data.
debug (bool): optional, when set to True, prints the calculated Gini
impurities and the final weighted average
Returns:
float: The weighted average of Gini impurity for the positive and
negative subsets.
"""
ge_split = df[val_col] >= split_point
eq_pos = df[label_col] == pos_val
tp = df[ge_split & eq_pos].shape[0]
fp = df[ge_split & ~eq_pos].shape[0]
tn = df[~ge_split & ~eq_pos].shape[0]
fn = df[~ge_split & eq_pos].shape[0]
pos_size = tp+fp
neg_size = tn+fn
total_size = len(df)
if pos_size == 0:
gini_pos = 0
else:
gini_pos = 1 - (tp/pos_size)**2 - (fp/pos_size)**2
if neg_size == 0:
gini_neg = 0
else:
gini_neg = 1 - (tn/neg_size)**2 - (fn/neg_size)**2
weighted_avg = gini_pos * (pos_size/total_size) + \
gini_neg * (neg_size/total_size)
25
4. Tree Creation
if debug:
print(f'{gini_pos=:.3} {gini_neg=:.3} {weighted_avg=:.3}')
return weighted_avg
Let’s choose a split point and calculate the Gini coefficient (the weighted_avg value).
>>> calc_gini(gini, val_col='value', label_col='label', pos_val='pos',
...
split_point=9.24, debug=True)
gini_pos=0.217 gini_neg=0.00202 weighted_avg=0.0241
0.024117224644432264
With the function in hand, let’s loop over the possible values for the split point and plot out
the coefficients for a given split. If we go too low, we get too many false positives. Conversely,
if we set the split point too high, we get too many false negatives, and somewhere in between
lies the lowest Gini coefficient.
values = np.arange(5, 15, .1)
ginis = []
for v in values:
ginis.append(calc_gini(gini, val_col='value', label_col='label',
pos_val='pos', split_point=v))
fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(values, ginis)
ax.set_title('Gini Coefficient')
ax.set_ylabel('Gini Coefficient')
ax.set_xlabel('Split Point')
Figure 4.2: Gini values over different split points. Notice that the lowest value is around 10.
Armed with this graph, somewhere around 10 provides the lowest Gini coefficient. Let’s
look at the values around 10.
26
4.2. Coefficients in Trees
>>> pd.Series(ginis, index=values).loc[9.5:10.5]
9.6
0.013703
9.7
0.010470
9.8
0.007193
9.9
0.005429
10.0
0.007238
10.1
0.005438
10.2
0.005438
10.3
0.007244
10.4
0.009046
10.5
0.009046
dtype: float64
This code verifies that the split point at 9.9 will minimize the Gini coefficient:
>>> print(pd.DataFrame({'gini':ginis, 'split':values})
... .query('gini <= gini.min()')
... )
gini split
49 0.005429
9.9
4.2
Coef昀椀cients in Trees
Now let’s make a simple tree that only has a single node. We call this a decision stump (because
it has no branches).
We can use a DecisionTreeClassifier from scikit-learn. We will call the .fit method to fit
the training data to the labels. Remember, we pass in a dataframe for the features and a series
for the labels. (We can also pass in NumPy data structures but I find those inconvenient for
tabular work.)
from sklearn import tree
stump = tree.DecisionTreeClassifier(max_depth=1)
stump.fit(gini[['value']], gini.label)
A valuable decision tree feature is that you can visualize what they do. Below, we will
create a visualization of the decision tree.
It shows a single decision at the top box (or node). It checks if the value less than or equal
to 9.708. If that decision is true, we jump to the left child, the negative label. Otherwise, we
use the right child, the positive label.
You can see that the base Gini value (given in the top box) is our starting Gini considering
that we labeled everything as positive.
fig, ax = plt.subplots(figsize=(8, 4))
tree.plot_tree(stump, feature_names=['value'],
filled=True,
class_names=stump.classes_,
ax=ax)
If you calculate the weighted average of the Gini coefficients in the leaf nodes, you see a
value very similar to the minmum value we calculated above:
27
4. Tree Creation
Figure 4.3: Export of decision stump.
>>> gini_pos = 0.039
>>> gini_neg = 0.002
>>> pos_size = 101
>>> neg_size = 999
>>> total_size = pos_size + neg_size
>>> weighted_avg = gini_pos * (pos_size/total_size) + \
...
gini_neg * (neg_size/total_size)
>>> print(weighted_avg)
0.005397272727272727
XGBoost doesn’t use Gini, but Scikit-Learn does by default. XGBoost goes through a treebuilding process. After training a tree, there are two statistics gi and hi . They are similar to
Gini but also represent the behavior of the loss function. The gradient (gi ) represents the first
derivative of the loss function with respect to the predicted value for each instance, while the
second derivative (hi ) represents the curvature of the loss function.
Let’s make a stump in XGBoost by limiting the max_depth and n_estimators parameters:
import xgboost as xgb
xg_stump = xgb.XGBClassifier(n_estimators=1, max_depth=1)
xg_stump.fit(gini[['value']], (gini.label== 'pos'))
The .plot_tree method will let us visualize the split point for the stump.
xgb.plot_tree(xg_stump, num_trees=0)
Note
Because I want to be able to set the fonts in this plot, I created my own function,
my_dot_export.
28
4.2. Coefficients in Trees
import subprocess
def my_dot_export(xg, num_trees, filename, title='', direction='TB'):
"""Exports a specified number of trees from an XGBoost model as a graph
visualization in dot and png formats.
Args:
xg: An XGBoost model.
num_trees: The number of tree to export.
filename: The name of the file to save the exported visualization.
title: The title to display on the graph visualization (optional).
direction: The direction to lay out the graph, either 'TB' (top to
bottom) or 'LR' (left to right) (optional).
"""
res = xgb.to_graphviz(xg, num_trees=num_trees)
node [fontname = "Roboto Condensed"];
content = f'''
edge [fontname = "Roboto Thin"];
label = "{title}"
fontname = "Roboto Condensed"
'''
out = res.source.replace('graph [ rankdir=TB ]',
f'graph [ rankdir={direction} ];\n {content}')
# dot -Gdpi=300 -Tpng -ocourseflow.png courseflow.dot
dot_filename = filename
with open(dot_filename, 'w') as fout:
fout.write(out)
png_filename = dot_filename.replace('.dot', '.png')
subprocess.run(f'dot -Gdpi=300 -Tpng -o{png_filename} {dot_filename}'.split())
Let’s try out the function.
my_dot_export(xg_stump, num_trees=0, filename='img/stump_xg.dot', title='A demo stump')
Figure 4.4: Export of xgboost stump.
29
4. Tree Creation
The split point XGBoost found is very similar to scikit-learn. (The values in the leaves are
different, but we will get to those.)
4.3 Another Visualization Tool
The dtreeviz package creates visualizations for decision trees. It provides a clear visualization
of tree structure and decision rules, making it easier to understand and interpret the model.
import dtreeviz
viz = dtreeviz.model(xg_stump, X_train=gini[['value']],
y_train=gini.label=='pos',
target_name='positive',
feature_names=['value'], class_names=['negative', 'positive'],
tree_index=0)
viz.view()
Figure 4.5: Decision stump visualized by dtreeviz package. Note that it indicates the split point in the
histogram. It also shows a pie chart for each leaf indicating the fraction of labels.
This visualization shows the distributions of the data as histograms. It also labels the split
point. Finally, it shows what the ratios of the predictions are as pie charts.
4.4 Summary
The Gini coefficient is a measure of inequality used in economics and statistics. It is typically
used to measure income inequality, but can also be applied to other variables such as wealth.
The Gini coefficient takes on a value between 0 and 1, with 0 representing perfect equality
(where everyone has the same income) and 1 representing perfect inequality (where one
person has all the income and everyone else has none).
4.5 Exercises
1. What is the Gini coefficient?
2. How is the Gini coefficient calculated?
3. How is the Gini coefficient used in Decision trees?
4. What does a stump tell us about the data?
30
Chapter 5
Stumps on Real Data
By now, you should have an intuition on how a decision tree “decides” values to split on.
Decision trees use a greedy algorithm to split on a feature (column) that results in the most
“pure” split.
Instead of looking at our simple data with only one column of information, let’s use the
Kaggle data. We will predict whether someone is a data scientist or software engineer based
on how they answered the survey questions. The decision tree should loop over every column
and try to find the column and split point that best separates data scientists from software
engineers. It will recursively perform this operation, creating a tree structure.
5.1
Scikit-learn stump on real data
Let’s use scikit-learn first to make a decision stump on the Kaggle data. Before we do that,
consider what this stump is telling us. The first split should feature one of the most critical
features because it is the value that best separates the data into classes. If you only had one
piece of information, you would probably want the column that the stump splits on.
Let’s create the stump and visualize it. We will use our pipeline to prepare the training
data. Remember that the pipeline uses Pandas code to clean up the raw data, then we use
categorical encoding on the non-numeric columns, and finally, we fill in the missing values
for education level and experience.
We will feed the raw data, kag_X_train, into the .fit_transform method of the pipeline and
get out the cleaned-up training data, X_train. Then we will create a stum and train it with the
.fit method:
stump_dt = tree.DecisionTreeClassifier(max_depth=1)
X_train = kag_pl.fit_transform(kag_X_train)
stump_dt.fit(X_train, kag_y_train)
Let’s explore how we decided to split the data. Here is the code to visualize the tree:
fig, ax = plt.subplots(figsize=(8, 4))
features = list(c for c in X_train.columns)
tree.plot_tree(stump_dt, feature_names=features,
filled=True,
class_names=stump_dt.classes_,
ax=ax)
31
5. Stumps on Real Data
The column that best separates the two professions is the use of the R programming
language. This seems like a sensible choice considering that very few software engineers use
the R language.
Figure 5.1: Decision stump trained on our Kaggle data. It bases its decision purely on the value in the r
column.
We can evaluate the accuracy of our model with the .score method. We call this method on
the holdout set. The holdout set (or validation set), X_test, is the portion of the data that is set
aside and not used during the training phase, but instead is reserved for testing the model’s
performance once it has been trained. It allows for a more accurate evaluation of the model’s
performance against new data, as it is tested on unseen data.
It looks like this model classified 62% of the occupations correctly by just looking at a single
column, the R column.
>>> X_test = kag_pl.transform(kag_X_test)
>>> stump_dt.score(X_test, kag_y_test)
0.6243093922651933
Is 62% accuracy good? Bad? The answer to questions like this is generally “it depends”.
Looking at accuracy on its own is usually not sufficient to inform us of whether a model is
“good enough”. However, we can use it to compare models. We can create a baseline model
using the DummyClassifier. This classifier just predicts the most common label, in our case
'Data Scientist'. Not a particularly useful model, but it provides a baseline score that our
model should be able to beat. If we can’t beat that model, we shouldn’t be using machine
learning.
Here is an example of using the DummyClassifier.
>>> from sklearn import dummy
>>> dummy_model = dummy.DummyClassifier()
>>> dummy_model.fit(X_train, kag_y_train)
>>> dummy_model.score(X_test, kag_y_test)
0.5458563535911602
32
5.2. Decision Stump with XGBoost
The baseline performance is 54% accuracy (which is the percent of values that are
'Data Scientist'). Our stump accuracy is better than the baseline. Note that being better than
the baseline does not qualify a model as “good”. But it does indicate that model is potentially
useful.
5.2
Decision Stump with XGBoost
XGBoost does not use the Gini calculation to decide how to make decisions. Rather XGBoost
uses boosting and gradient descent. Boosting is using a model and combining it with other
models to improve the results. In fact, XGBoost stands for eXtreme Gradient Boosting. The
“extreme” part is due to the ability to regularize the result and the various optimizations is
has to efficiently create the model.
In this case, the subsequent models are trained from the error of the previous model with
the goal to reduce the error. The “gradient descent” part comes in because this process of
minimizing the error is specified into an objective function such that the gradient descent
algorithm can be applied. The outcome is based on the gradient of the error with respect to
the prediction. The objective function combines two parts: training loss and a regularization
term. Trees are created by splitting on features that move a small step in the direction of the
negative gradient, which moves them closer to the global minimum of the loss function.
There are many parameters for controlling XGBoost. One of them is n_estimators, the
number of trees. Let’s create a stump with XGBoost by setting this value to 1 and see if it
performs similarly to the scikit-learn stump.
import xgboost as xgb
kag_stump = xgb.XGBClassifier(n_estimators=1, max_depth=1)
kag_stump.fit(X_train, kag_y_train)
--------------------------------------------------------------------------ValueError
Traceback (most recent call last)
Cell In[402], line 2
1 kag_stump = xgb.XGBClassifier(n_estimators=1, max_depth=1)
----> 2 kag_stump.fit(X_train, kag_y_train)
3 kag_stump.score(X_test, kag_y_test)
...
1462
1463
1464
1465
-> 1466
1467
1468
1469
1471
1473
if (
):
self.classes_.shape != expected_classes.shape
or not (self.classes_ == expected_classes).all()
raise ValueError(
f"Invalid classes inferred from unique values of `y`. "
f"Expected: {expected_classes}, got {self.classes_}"
)
params = self.get_xgb_params()
if callable(self.objective):
ValueError: Invalid classes inferred from unique values of `y`.
Expected: [0 1], got ['Data Scientist' 'Software Engineer']
We got an error! XGBoost is not happy about our labels. Unlike scikit-learn, XGBoost
does not work with string labels. The kag_y_train series has strings in it, and XGBoost wants
integer values.
33
5. Stumps on Real Data
>>> print(kag_y_train)
587
Software Engineer
3065
Data Scientist
8435
Data Scientist
3110
Data Scientist
16372
Software Engineer
...
16608
Software Engineer
7325
Software Engineer
21810
Data Scientist
4917
Data Scientist
639
Data Scientist
Name: Q6, Length: 2110, dtype: object
Since there are only two options, we can convert this to integer values by testing whether
the string is equal to Software Engineer. A True or (1) represents Software Engineer and False
(or 0) is a Data Scientist.
>>> print(kag_y_train == 'Software Engineer')
587
True
3065
False
8435
False
3110
False
16372
True
...
16608
True
7325
True
21810
False
4917
False
639
False
Name: Q6, Length: 2110, dtype: bool
However, rather than using pandas, we will use the LabelEncoder class from scikit-learn
to convert the labels to numbers. The label encoder stores attributes that are useful when
preparing data for future predictions.
>>> from sklearn import preprocessing
>>> label_encoder = preprocessing.LabelEncoder()
>>> y_train = label_encoder.fit_transform(kag_y_train)
>>> y_test = label_encoder.transform(kag_y_test)
>>> y_test[:5]
array([1, 0, 0, 1, 1])
The label encoder will return 1s and 0s. How do you know which numerical value is which
string value? You can ask the label encoder for the .classes_. The index of these values is the
number that the encoder uses.
>>> label_encoder.classes_
array(['Data Scientist', 'Software Engineer'], dtype=object)
34
5.3. Values in the XGBoost Tree
Because 'Data Scientist' is in index 0 in .classes_, it is the 0 value in the training labels.
The 1 represents 'Software Engineer'. This is also called the positive label. (It is the positive
label because it has the value of 1.)
The label encoder has an .inverse_transform method that reverses the transformation. (This
would require more effort if we went with the pandas encoding solution.)
>>> label_encoder.inverse_transform([0, 1])
array(['Data Scientist', 'Software Engineer'], dtype=object)
Now let’s make a stump with XGBoost and check the score.
>>> kag_stump = xgb.XGBClassifier(n_estimators=1, max_depth=1)
>>> kag_stump.fit(X_train, y_train)
>>> kag_stump.score(X_test, y_test)
0.6243093922651933
It looks like the XGBoost stump performs similarly to the scikit-learn decision tree. Let’s
visualize what it looks like.
my_dot_export(kag_stump, num_trees=0, filename='img/stump_xg_kag.dot',
title='XGBoost Stump')
Figure 5.2: Stump export of XGBoost Kaggle model. A positive leaf value means that we predict the
positive label.
5.3
Values in the XGBoost Tree
What are the numbers in the leaves of the export? They are probabilities. Well, actually, they
are the logarithm of the odds of the positive value. Statisticians call this the logit.
When you calculate the inverse logit scores with the values from the nodes in the tree,
you get the probability of the prediction. If a survey respondent does not use the R language,
the leaf value is .0717741922. The inverse logit of that is .518, or 51.8%. Because this value is
greater than .5, we assume the positive label or class of Software Engineer (the second value in
kag_stump.classes_ is 1, which our label encoder used as the value for Software Engineer).
35
5. Stumps on Real Data
>>> kag_stump.classes_
array([0, 1])
import numpy as np
def inv_logit(p: float) -> float:
"""
Compute the inverse logit function of a given value.
The inverse logit function is defined as:
f(p) = exp(p) / (1 + exp(p))
Parameters
---------p : float
The input value to the inverse logit function.
Returns
------float
The output of the inverse logit function.
"""
return np.exp(p) / (1 + np.exp(p))
Let’s see what probability comes out when we pass in .07177.
>>> inv_logit(.0717741922)
0.5179358489487103
It looks like it spits out 51.8%.
Conversely, the inverse logit of the right node is .411. Because this number is below .5, we
do not classify it as the second option but rather choose Data Scientist.
Note
Note that if a user uses R, they have a higher likelihood of being a Data Scientist (1-.41 or
59%), than a Software Engineer who doesn’t use R (.518 or 52%). In other words, using
R pushes one more toward Data Scientist than not using R pushes one toward Software
Engineer.
>>> inv_logit(-.3592)
0.41115323716754393
Here I plot the inverse logit function. You can see a crossover point at 0 on the x-axis.
When the x values are above 0, the y value will be > .5 (Software Engineer). When they are
below 0, the values will be < .5 (Data Scientist).
fig, ax = plt.subplots(figsize=(8, 4))
vals = np.linspace(-7, 7)
ax.plot(vals, inv_logit(vals))
ax.annotate('Crossover point', (0,.5), (-5,.8), arrowprops={'color':'k'})
ax.annotate('Predict Positive', (5,.6), (1,.6), va='center', arrowprops={'color':'k'})
ax.annotate('Predict Negative', (-5,.4), (-3,.4), va='center', arrowprops={'color':'k'})
36
5.4. Summary
Figure 5.3: The inverse logit function. When the x values are above 0, the y value will be > .5 (Software
Engineer). When they are below 0, the values will be < .5 (Data Scientist).
5.4
Summary
In this chapter, we explored making a stump on real data. A decision stump is a simple
decision tree with a single split. We showed how we can use the XGBoost algorithm to create
a decision stump. We also explored how the values in the leaves of XGBoost models indicate
what label to predict.
5.5
Exercises
1. How are decision stumps trained using the XGBoost algorithm?
2. What is the inverse logit function and how is it used in machine learning?
3. How can the output of the inverse logit function be interpreted in terms of label
probabilities?
37
Chapter 6
Model Complexity & Hyperparameters
In this chapter, we will explore the concept of underfit and overfit models. Then we will see
how we can use hyperparameters, or attributes of the model to change the behavior of the model
fitting.
6.1
Under昀椀t
A stump is too simple. Statisticians like to say it has too much bias. (I think this term is a little
confusing and prefer “underfit”.) It has a bias towards simple predictions. In the case of our
Kaggle model, the stump only looks at a single column. Perhaps it would perform better if it
were able to also consider additional columns after looking at the R column.
When you have an underfit model, you want to make it more complex (because it is too
simple). There are a few mechanisms we can use to add complexity:
• Add more features (columns) that have predictive value
• Use a more complex model
These are general strategies that work with most underfit models, not just trees.
Here is our stump. Let’s look at the accuracy (as reported by .score). It should be
representative of an underfit model.
>>> underfit = tree.DecisionTreeClassifier(max_depth=1)
>>> X_train = kag_pl.fit_transform(kag_X_train)
>>> underfit.fit(X_train, kag_y_train)
>>> underfit.score(X_test, kag_y_test)
0.6243093922651933
Hopefully, we will be able to create a model with better accuracy than 62%.
6.2
Growing a Tree
We showed how to create a stump. How do we fix our underfit stump? You could add more
advanced features (or columns) to the training data that better separate the classes. Or we
could make the stump more complex by letting the tree grow larger.
Instead of trying arbitrary tactics, generally, we want to proceed in a slightly more
organized fashion. One common technique is to measure model accuracy as you let the
tree grow deeper (or add additional columns) and see if there is an improvement. You can
optimize a scoring function given constraints. Different models can have different scoring
functions. You can also create your own. We saw that scikit-learn optimizes Gini and that
XGBoosts performs gradient descent on a loss function.
39
6. Model Complexity & Hyperparameters
The constraints for the model are called hyperparameters. Hyperparameters, at a high level,
are knobs that you can use to control the complexity of your model. Remember that our stump
was too simple and underfit our data. One way we can add complexity to a tree-based model
is to let it grow more nodes. We can use hyperparameters to determine how and when we
will add nodes.
6.3 Over昀椀tting
Before optimizing our model, let’s look at the other end of the spectrum, an overfit model.
An overfit model is too complicated. Again, the statistician would say it has too much
variance. I avoid using the term “variance” too much (unless I’m in an interview situation)
and like to picture a model that looks at every nook and cranny of the data. When it sees an
example that is new, the model compares all of these trivial aspects that the model learned by
inspecting the training data, which tends to lead it down the wrong path. Rather than honing
in on the right label, the complexity causes it to “vary”.
And again, there are general solutions to helping deal with overfitting:
• Simplify or constrain (regularize)
• Add more samples (rows of data)
For a tree model, we can prune back the growth so that the leaf nodes are overly specific.
This will simplify or constrain the model. Alternatively, we can collect more data, thus forcing
the model to have more attempts at finding the important features.
6.4 Over昀椀tting with Decision Trees
Let’s make an overfit model with a decision tree. This is quite easy. We just let the model
grow until every node is pure (of the same class). We can set max_depth=None to overfit the tree,
which is the default behavior of DecisionTreeClassifier.
After we create the model, let’s inspect the accuracy.
>>> hi_variance = tree.DecisionTreeClassifier(max_depth=None)
>>> X_train = kag_pl.fit_transform(kag_X_train)
>>> hi_variance.fit(X_train, kag_y_train)
>>> hi_variance.score(X_test, kag_y_test)
0.6629834254143646
In this case, the accuracy is 66%. The accuracy of the stump was 62%. It is possible (and
likely) that we can get better accuracy than both the stump and the complex model by either
simplifying our complex model or adding some complexity to our simple model.
Here is a visualization of the complex model. You can see that the tree is pretty deep. I
count 22 levels of nodes. This visualization should help with the intuition of why an overfit
tree is bad. It is essentially memorizing the training data. If you feed it data that is not exactly
the same as data it has seen before, it will tend to go down the wrong track. (If you think
memorizing the data is good, then you don’t need a machine learning model because you can
just memorize every possible input and create a giant if statement for that.)
fig, ax = plt.subplots(figsize=(8, 4))
features = list(c for c in X_train.columns)
tree.plot_tree(hi_variance, feature_names=features, filled=True)
Let’s zoom in on the top of the tree to understand more about what is happening.
40
6.5. Summary
Figure 6.1: An overfit decision tree. This is too complicated and it is essentially memorizing the data
that it was trained on. However, because it is very specific to the training data, it will tend to perform
worse on data that it hasn’t seen before.
fig, ax = plt.subplots(figsize=(8, 4))
features = list(c for c in X_train.columns)
tree.plot_tree(hi_variance, feature_names=features, filled=True,
class_names=hi_variance.classes_,
max_depth=2, fontsize=6)
The darker orange boxes tend towards more data scientists, and the bluer boxes are
software engineers. If we go down the most orange path, at the top node is a person who uses
R, who also didn’t study computer science, and finally, they have less or equal to 22.5 years
of experience. Of folks who fall into this bucket, 390 of the 443 samples were data scientists.
The ability to interpret a decision tree is important and often business-critical. Many
companies are willing to sacrifice model performance for the ability to provide a concrete
explanation for the behavior of a model. A decision tree makes this very easy. We will see
that XGBoost models prove more challenging at delivering a simple interpretation.
6.5
Summary
In this chapter, the concept of underfit and overfit models is explored. Hyperparameters, or
attributes of the model, are used to change the behavior of the model fitting. We saw that
the max_depth hyperparameter can tune the model between simple and complex. An underfit
model is too simple, having a bias towards simple predictions. To fix an underfit model, one
can add more features or use a more complex model. These general strategies work with most
underfit models, not just trees.
On the other hand, an overfit model is too complicated and has too much variance. To
deal with overfitting, one can simplify or constrain the model, or add more samples. For a
tree model, one can prune back the growth so that the leaf nodes are overly specific or collect
41
6. Model Complexity & Hyperparameters
Figure 6.2: Zooming in on an overfit decision tree.
more data. The goal is to find the optimal balance between a model that is too simple and too
complex.
Later on, we will see that XGBoost creates decent models out of the box, but it also has
mechanisms to tune the complexity. We will also see how you can diagnose overfitting by
visualizing model performance.
6.6 Exercises
1.
2.
3.
4.
42
What is underfitting in decision trees and how does it occur?
What is overfitting in decision trees and how does it occur?
What are some techniques to avoid underfitting in decision trees?
What are some techniques to avoid overfitting in decision trees?
Chapter 7
Tree Hyperparameters
Previously, we saw that models might be too complex or too simple. Some of the mechanisms
for dealing with model complexity are by creating new columns (to deal with simple models)
and adding more examples (to deal with complex models). But there are also levers that we
can pull without dealing with data. These levers are hyperparameters, levers that change how
the model is created, trained, or behaves.
7.1
Decision Tree Hyperparameters
Let’s explore the hyperparameters of decision trees. You might wonder why we seem so
concerned with decision trees when this book is about XGBoost. That is because decision
trees form the basis for XGBoost. If you understand how decision trees work, that will aid
your mental model of XGBoost. Also, many of the hyperparameters for decision trees apply
to XGBoost at some level.
Many of the hyperparameters impact complexity (or simplicity). Here is a general rule
for models based on scikit-learn. Hyperparameters that start with max_ will make the model
more complex when you raise it. On the flip side, those that start with min_ will make the
model simpler if you raise it. (The reverse is also true, they do the reverse when you lower
them.)
Here are the hyperparameters for scikit-learn’s DecisionTreeClassifier:
• max_depth=None - Tree depth. The default is to keep splitting until all nodes are pure or
there are fewer than min_samples_split samples in each node. Range [1-large number].
• max_features=None - Amount of features to examine for the split. Default is the number
of features. Range [1-number of features].
• max_leaf_node=None - Number of leaves in a tree. Range [1-large number].
• min_impurity_decrease=0 - Split when impurity is >= this value. (Impurity : 0 - 100%
accurate, .3 - 70%). Range [0.0-1.0]
• min_samples_leaf=1, - Minimum samples at each leaf. Range [1-large number]
• min_samples_split=2 - Minimum samples required to split a node. Range [2-large
number]
• min_weight_fraction_leaf=0 - The fraction of the total weights required to be a leaf. Range
[0-1]
To use a particular hyperparameter, you provide the parameter in the constructor before
training a model. The constructor is the method called when a class is created. Then you train
a model and measure a performance metric and track what happens to that metric as you
change the parameter.
You can explore the parameters of a trained model as well using the .get_params method:
43
7. Tree Hyperparameters
Figure 7.1: Scikit-learn convention for hyperparameter tuning options to deal with overfit and underfit
models.
>>> stump.get_params()
{'ccp_alpha': 0.0,
'class_weight': None,
'criterion': 'gini',
'max_depth': 1,
'max_features': None,
'max_leaf_nodes': None,
'min_impurity_decrease': 0.0,
'min_samples_leaf': 1,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'random_state': None,
'splitter': 'best'}
We can verify that this is a stump because the max_depth hyperparameter is set to 1.
7.2 Tracking changes with Validation Curves
Let’s adjust the depth of a tree while tracking accuracy. A chart that does this is called a
Validation Curve.
With the stump, we get around 62% accuracy. As we add more levels, the accuracy tends
to improve until a point. Then it starts to fall back down.
44
7.3. Leveraging Yellowbrick
Why does the accuracy start to fall? Our model is now overfitting. Rather than extracting
valuable insights from the training data, it memorizes the noise because we allow it to get too
complicated. When we try to make a prediction with new data that is missing the same noise,
our overfit model underperforms.
Here is the code to create a validation curve:
accuracies = []
for depth in range(1, 15):
between = tree.DecisionTreeClassifier(max_depth=depth)
between.fit(X_train, kag_y_train)
accuracies.append(between.score(X_test, kag_y_test))
fig, ax = plt.subplots(figsize=(10,4))
(pd.Series(accuracies, name='Accuracy', index=range(1, len(accuracies)+1))
.plot(ax=ax, title='Accuracy at a given Tree Depth'))
ax.set_ylabel('Accuracy')
ax.set_xlabel('max_depth')
Figure 7.2: A handmade validation curve tracking accuracy over depth.
From this graph, it looks like a depth of seven maximizes the accuracy. Let’s check the
.score of our “Goldilocks” model.
>>> between = tree.DecisionTreeClassifier(max_depth=7)
>>> between.fit(X_train, kag_y_train)
>>> between.score(X_test, kag_y_test)
0.7359116022099448
Remember that the scores of the underfit and overfit models were 62 and 67%, respectively.
An accuracy of 74% is a nice little bump.
7.3
Leveraging Yellowbrick
I come from the lazy programmer school of thought. The code above to track the accuracy
while sweeping over hyperparameter changes is not particularly difficult, but I would rather
45
7. Tree Hyperparameters
have one line of code that does that for me. The Yellowbrick library provides a visualizer,
validation_curve, for us. (Ok, it is a few more lines of code because we need to import it and
I’m also configuring Matplotlib and saving the image, but in your notebook, you only need
the call to validation_curve if you have imported the code.)
from yellowbrick.model_selection import validation_curve
fig, ax = plt.subplots(figsize=(10,4))
viz = validation_curve(tree.DecisionTreeClassifier(),
X=pd.concat([X_train, X_test]),
y=pd.concat([kag_y_train, kag_y_test]),
param_name='max_depth', param_range=range(1,14),
scoring='accuracy', cv=5, ax=ax, n_jobs=6)
Figure 7.3: Validation curve illustrating and overfitting. Around the depth of 7 appears to be the sweet
spot for this data.
This plot is interesting (and better than my hand-rolled one). It has two ones, one tracks
the “Training Score”, and the other line tracks the “Cross Validation Score”. The training score
is the accuracy of the training data. You can see that as we increase the tree depth, the accuracy
continues to improve. However! We only care about the accuracy of the testing data (this is
the “Cross Validation Score”). The testing data line simulates how our model will work when
it encounters data it hasn’t seen before. The testing data score drops as our model begins to
overfit. This plot validates our test and suggests that a depth of seven is an appropriate choice.
The validation_curve function also allows us to specify other parameters:
• cv - Number of k-fold cross validations (default to 3)
• scoring - A scikit-learn model metric string or function (there are various functions in
the sklearn.metrics module)
The plot also shows a range of scores with the shaded area. This is because it does cross
validation and has multiple scores for a given hyperparameter value. Ideally, you want the
range to be tight indicating that the metric performed consistently across each split of the cross
validation.
46
7.4. Grid Search
7.4
Grid Search
There is a limitation to the validation curve. It only tracks a single hyperparameter. We often
want to evaluate many hyperparameters. Grid search is one tool that allows us to experiment
across many hyperparameters.
Grid search is one technique used to find the best hyperparameters for a model by training
the model on different combinations of hyperparameters and evaluating its performance
on a validation set. This technique is often used in decision tree models to tune the
hyperparameters that control the tree’s growth, such as the maximum depth, minimum
samples per leaf, and minimum samples per split.
With grid search, a list of possible values for each hyperparameter is specified, and the
model is trained and evaluated on every combination of hyperparameters. For example, if the
maximum depth of the decision tree can take values in the range 1 to 10, and the minimum
samples per leaf can take values in the range 1 to 100, then a grid search would involve training
and evaluating the model on all possible combinations of maximum depth and minimum
samples per leaf within these ranges.
After training and evaluating the model on all combinations of hyperparameters, the
combination that produces the best performance on the validation set is selected as the optimal
set of hyperparameters for the model.
Scikit-learn provides this functionality in the GridSearchCV class. We specify a dictionary
to map the hyperparameters we want to explore to a list of options. The GridSearchCV class
follows the standard scikit-learn interface and provides a .fit method to kick off the search
for the best hyperparameters.
The verbose=1 parameter is not a hyperparameter, it tells grid search to display
computation time for each attempt.
from sklearn.model_selection import GridSearchCV
params = {
'max_depth': [3, 5, 7, 8],
'min_samples_leaf': [1, 3, 4, 5, 6],
'min_samples_split': [2, 3, 4, 5, 6],
}
grid_search = GridSearchCV(estimator=tree.DecisionTreeClassifier(),
param_grid=params, cv=4, n_jobs=-1,
verbose=1, scoring="accuracy")
grid_search.fit(pd.concat([X_train, X_test]),
pd.concat([kag_y_train, kag_y_test]))
After the grid search has exhausted all the combinations, we can inspect the best
parameters on the .best_params_ attribute.
>>> grid_search.best_params_
{'max_depth': 7, 'min_samples_leaf': 5, 'min_samples_split': 6}
Then I can use these parameters while constructing the decision tree. (The unpack operation,
**, will use each key from a dictionary as a keyword parameter with its associated value.
>>> between2 = tree.DecisionTreeClassifier(**grid_search.best_params_)
>>> between2.fit(X_train, kag_y_train)
>>> between2.score(X_test, kag_y_test)
0.7259668508287292
47
7. Tree Hyperparameters
Why is this score less than our between tree that we previously created? Generally, this is
due to grid search doing k-fold cross validation. In our grid search, we split the data into four
parts (cv=4). Then we train each hyperparameter against three parts and evaluate it against
the remaining part. The score is the mean of each of the four testing scores.
When we were only calling the .score method, we held out one set of data for evaluation.
It is conceivable that the section of data that was held out happens to perform slightly above
the average from the grid search.
There is also a .cv_results_ attribute that contains the scores for each of the options. Here
is Pandas code to explore this. The darker cells have better scores.
# why is the score different than between_tree?
(pd.DataFrame(grid_search.cv_results_)
.sort_values(by='rank_test_score')
.style
.background_gradient(axis='rows')
)
Figure 7.4: Scores for each split for the hyperparameter values of the grid search.
We can manually validate the results of the grid search using the cross_val_score. This
scikit-learn function does cross validation on the model and returns the score from each
training.
>>> results = model_selection.cross_val_score(
...
tree.DecisionTreeClassifier(max_depth=7),
...
X=pd.concat([X_train, X_test], axis='index'),
...
y=pd.concat([kag_y_train, kag_y_test], axis='index'),
...
cv=4
... )
48
7.5. Summary
>>> results
array([0.69628647, 0.73607427, 0.70291777, 0.7184595 ])
>>> results.mean()
0.7134345024851962
Here is the same process running against the updated hyperparameters from the grid
search. (Note that if you change cv to 3, the grid search model performs slightly worse.)
>>> results = model_selection.cross_val_score(
...
tree.DecisionTreeClassifier(max_depth=7, min_samples_leaf=5,
...
min_samples_split=2),
...
X=pd.concat([X_train, X_test], axis='index'),
...
y=pd.concat([kag_y_train, kag_y_test], axis='index'),
...
cv=4
... )
>>> results
array([0.70822281, 0.73740053, 0.70689655, 0.71580345])
>>> results.mean()
0.7170808366886126
One thing you want to be careful of when evaluating machine learning models is that you
are comparing apples to apples. Even if you are using the same evaluation metric, you need
to make sure that you are evaluating against the same data.
7.5
Summary
In this chapter, we learned that we can modify the accuracy of the model by changing
the hyperparameters. We used a validation curve to visualize the impact of a single
hyperparameter change and then changed many of them using grid search. Later on, we
will see other techniques to tune hyperparameters.
7.6
1.
2.
3.
4.
Exercises
What are validation curves and how are they used in hyperparameter tuning?
What is grid search and how does it differ from validation curves?
What are some pros and cons of validation curves?
What are some pros and cons of grid search?
49
Chapter 8
Random Forest
Before diving into XGBoost, we will look at one more model. The Random Forest. This is
a common tree-based model that is useful to compare and contrast with XGBoost (indeed,
XGBoost can create random forest models if so desired).
8.1
Ensembles with Bagging
Jargon warning! A random forest model is an ensemble model. An ensemble is a group
of models that aggregate results to prevent overfitting. The ensembling technique used by
random forests is called bagging. Bagging (which is short for more jargon-bootstrap aggregating)
means taking the models’ average. Bootstrapping means to sample with replacement. In
effect, a random forest trains multiple decision trees, but each one is trained on different rows
of the data (and different subsets of the features). The prediction is the average of those trees.
Random forests also use column subsampling. Column subsampling in a random forest is a
technique where at various points in the model or tree creation, a random subset of the features
is considered rather than using all the features. This can help to reduce the correlation between
the trees and improve the overall performance of the random forest by reducing overfitting
and increasing the diversity of the trees.
After the trees have been created, the trees “vote” for the final class.
An intuitive way to think about this comes from a theory proposed in 1785, Condorcet’s jury
theorem. The Marquis de Condorcet was a French philosopher who proposed a mechanism
for deciding the correct number of jurors to arrive at the correct verdict. The basic idea is that
you should add a juror if they have more than a 50% chance of picking the right verdict and
are not correlated with the other jurors. You should keep adding additional jurors if they have
> 50% accuracy. Each juror will contribute more to the correct verdict. Similarly, if a decision
tree has a greater than 50% chance of making the right classification (and it is looking at
different samples and different features), you should take its vote into consideration. Because
we subsample on the columns, that should aid in reducing the correlation between the trees.
8.2
Scikit-learn Random Forest
Here is a random forest model created with scikit-learn:
>>> from sklearn import ensemble
>>> rf = ensemble.RandomForestClassifier(random_state=42)
>>> rf.fit(X_train, kag_y_train)
>>> rf.score(X_test, kag_y_test)
0.7237569060773481
51
8. Random Forest
How many trees did this create? Let’s look at the hyperparameters:
>>> rf.get_params()
{'bootstrap': True,
'ccp_alpha': 0.0,
'class_weight': None,
'criterion': 'gini',
'max_depth': None,
'max_features': 'sqrt',
'max_leaf_nodes': None,
'max_samples': None,
'min_impurity_decrease': 0.0,
'min_samples_leaf': 1,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'n_estimators': 100,
'n_jobs': None,
'oob_score': False,
'random_state': 42,
'verbose': 0,
'warm_start': False}
It looks like it made 100 trees (see n_estimators). And if we want to, we can explore each
tree. They are found under the .estimators_ attribute.
Note
Scikit-learn has many conventions. We discussed how raising max_ hyperparameters
would make the model more complex.
Another convention is that any attribute that ends with an underscore (_), like
.estimators_, is created during the call to the .fit method. Sometimes these will be
insights, metadata, or tables of data.
In this case, it is a list of the trees created while training the model.
We can validate that there are 100 trees.
>>> len(rf.estimators_)
100
Every “estimator” is a decision tree. Here is the first tree.
>>> print(rf.estimators_[0])
DecisionTreeClassifier(max_features='sqrt', random_state=1608637542)
We can visualize each tree if desired. Here is the visualization for tree 0 (limited to depth
two so we can read it).
fig, ax = plt.subplots(figsize=(8, 4))
features = list(c for c in X_train.columns)
tree.plot_tree(rf.estimators_[0], feature_names=features,
filled=True, class_names=rf.classes_, ax=ax,
max_depth=2, fontsize=6)
52
8.3. XGBoost Random Forest
Figure 8.1: A visualization of the first two levels of tree 0 from our random forest.
Interestingly, this tree chose Q3_United States of America as the first column. It did not
choose the R column, probably due to column subsampling.
8.3
XGBoost Random Forest
You can also create a random forest with the XGBoost library.
>>> import xgboost as xgb
>>> rf_xg = xgb.XGBRFClassifier(random_state=42)
>>> rf_xg.fit(X_train, y_train)
>>> rf_xg.score(X_test, y_test)
0.7447513812154696
You can also inspect the hyperparameters of this random forest. XGBoost uses different
hyperparameters than scikit-learn.
>>> rf_xg.get_params()
{'colsample_bynode': 0.8,
'learning_rate': 1.0,
'reg_lambda': 1e-05,
'subsample': 0.8,
'objective': 'binary:logistic',
'use_label_encoder': None,
'base_score': 0.5,
'booster': 'gbtree',
'callbacks': None,
'colsample_bylevel': 1,
'colsample_bytree': 1,
'early_stopping_rounds': None,
'enable_categorical': False,
53
8. Random Forest
'eval_metric': None,
'feature_types': None,
'gamma': 0,
'gpu_id': -1,
'grow_policy': 'depthwise',
'importance_type': None,
'interaction_constraints': '',
'max_bin': 256,
'max_cat_threshold': 64,
'max_cat_to_onehot': 4,
'max_delta_step': 0,
'max_depth': 6,
'max_leaves': 0,
'min_child_weight': 1,
'missing': nan,
'monotone_constraints': '()',
'n_estimators': 100,
'n_jobs': 0,
'num_parallel_tree': 100,
'predictor': 'auto',
'random_state': 42,
'reg_alpha': 0,
'sampling_method': 'uniform',
'scale_pos_weight': 1,
'tree_method': 'exact',
'validate_parameters': 1,
'verbosity': None}
Let’s visualize the first tree with the plot_tree function of XGBoost.
fig, ax = plt.subplots(figsize=(6,12), dpi=600)
xgb.plot_tree(rf_xg, num_trees=0, ax=ax, size='1,1')
my_dot_export(rf_xg, num_trees=0, filename='img/rf_xg_kag.dot',
title='First Random Forest Tree', direction='LR')
That tree is big, and a little hard to see what is going on without a microscope. Let’s use
the dtreeviz library and limit the depth to 2 (depth_range_to_display=[0,2]).
viz = dtreeviz.model(rf_xg, X_train=X_train,
y_train=y_train,
target_name='Job', feature_names=list(X_train.columns),
class_names=['DS', 'SE'], tree_index=0)
viz.view(depth_range_to_display=[0,2])
8.4 Random Forest Hyperparameters
Here are most of the hyperparameters for random forests (created with the XGBoost library).
At a high level, you have the tree parameters (like the depth), sampling parameters, the
ensembling parameters (number of trees), and regularization parameters (like gamma, which
prunes back trees to regularize them).
Tree hyperparameters.
54
8.4. Random Forest Hyperparameters
Figure 8.2: First random forest tree built with XGBoost.
55
8. Random Forest
Figure 8.3: dtreeviz export of XGBoost Random Forest model.
• max_depth=6 - Tree depth. Range [1-large number]
• max_leaves=0 - Number of leaves in a tree. (Not supported by tree_method='exact'). A
value of 0 means no limit, otherwise, the range is [2-large number]
• min_child_weight=1 - Minimum sum of hessian (weight) needed in a child. The larger the
value, the more conservative. Range [0,∞]
• grow_policy="depthwise" - (Only supported with tree_method set to 'hist', 'approx', or
'gpu_hist'. Split nodes closest to root. Set to "lossguide" (with max_depth=0) to mimic
LightGBM behavior.
• tree_method="auto" - Use "hist" to use histogram bucketing and increase performance.
"auto" heuristic:
– 'exact' for small data
– 'approx' for larger data
– 'hist' for histograms (limit bins with max_bins (default 256))
Sampling hyperparameters. These work cumulatively. First by the tree, then by level, and
finally by the node.
• colsample_bytree=1 - Subsample columns at each tree. Range (0,1]
• colsample_bylevel=1 - Subsample columns at each level (from tree columns colsample_bytree).
Range (0,1]
• colsample_bynode=1 - Subsample columns at each node split (from tree columns
colsample_bytree). Range (0,1]
• subsample=1 - Sample portion of training data (rows) before each boosting round. Range
(0,1]
• sampling_method='uniform' - Sampling method.
'uniform' for equal probability.
gradient_based sample is proportional to the regularized absolute value of gradients
(only supported by tree_method="gpu_hist".
Categorical data hyperparameters. Use the enable_categorical=True parameter and set
columns to Pandas categoricals (.astype('category')).
• max_cat_to_onehot=4 - Upper threshold for using one-hot encoding for categorical
features. Use one-hot encoding when the number of categories is less than this number.
56
8.5. Training the Number of Trees in the Forest
• max_cat_threshold=64 - Maximum number of categories to consider for each split.
Ensembling hyperparameters.
• n_estimators=100 - Number of trees. Range [1, large number]
• early_stopping_rounds=None - Stop creating new trees if eval_metric score has not
improved after n rounds. Range [1,∞]
• eval_metric - Metric for evaluating validation data.
– Classification metrics
*
*
'logloss' - Default classification metric. Negative log-likelihood.
'auc' - Area under the curve of Receiver Operating Characteristic.
• objective - Learning objective to optimize the fit of the model to data during training.
– Classification objectives
*
'binary:logistic' - Default classification objective. Outputs probability.
Regularization hyperparameters.
• learning_rate=.3 - After each boosting round, multiply weights to make it more
conservative. Lower is more conservative. Range (0,1]
• gamma=0 - Minimum loss required to make a further partition. Larger is more
conservative. Range [0,∞]
• reg_alpha=0 - L1 regularization. Increasing will make it more conservative.
• reg_lambda=1 - L2 regularization. Increasing will make it more conservative.
Imbalanced data hyperparameters.
• scale_pos_weight=1 - Consider (count negative) / (count positive) for imbalanced classes.
Range (0, large number)
• max_delta_step=0 - Maximum delta step for leaf output. It might help with imbalanced
classes. Range [0,∞]
8.5
Training the Number of Trees in the Forest
Let’s try out a few different values for n_estimators and inspect what happens to the accuracy
of our random forest model.
from yellowbrick.model_selection import validation_curve
fig, ax = plt.subplots(figsize=(10,4))
viz = validation_curve(xgb.XGBClassifier(random_state=42),
x=pd.concat([X_train, X_test], axis='index'),
y=np.concatenate([y_train, y_test]),
param_name='n_estimators', param_range=range(1, 100, 2),
scoring='accuracy', cv=3,
ax=ax)
It looks like our model performs best around 29 estimators. A quick check shows that it
performs slightly better than the out-of-the-box model. I will choose the simpler model when
given a choice between models that perform similarly. In this case, a model with 99 trees
appears to perform similarly but is more complex, so I go with the simpler model.
57
8. Random Forest
Figure 8.4: Validation curve for random forest model that shows accuracy score as n_estimators changes.
Prefer the simpler model when two models return similar results.
>>> rf_xg29 = xgb.XGBRFClassifier(random_state=42, n_estimators=29)
>>> rf_xg29.fit(X_train, y_train)
>>> rf_xg29.score(X_test, y_test)
0.7480662983425415
8.6 Summary
In this chapter, we introduced a random forest. At its essence, a random forest is a group of
decision trees formed by allowing each tree to sample different rows of data. Each tree can
look at different features as well. The hope is that the trees can glean different insights because
they look at different parts of the data. Finally, the results of the trees are combined to predict
a final result.
8.7 Exercises
1. How do we specify the hyperparameters for an XGBoost model to create a random
forest?
2. Create a random forest for the Kaggle data predicting whether the title is a data scientist
or software engineer.
3. What is the score?
4. You have two models with the same score. One has 200 trees; the other has 40. Which
model do you choose and why?
58
Chapter 9
XGBoost
At last, we can talk about the most common use of the XGBoost library, making extreme
gradient-boosted models.
This ensemble is created by training a weak model and then training another model to
correct the residuals or mistakes of the model before it.
I like to use a golfing metaphor. A decision tree is like getting one swing to hit the ball
and put it in the hole. A random forest is getting a bunch of different attempts at teeing off
(changing clubs, swing styles, etc.) and then placing the ball at the average of each of those
swings. A boosted tree is hitting the ball once, then going to where it landed and hitting
again (trying to correct the mistake of the previous hit), and then getting to keep hitting the
ball. This metaphor clarifies why boosted models work so well; they can correct the model’s
error as they progress.
9.1
Jargon
I used the term weak model to describe a single tree in boosting. A weak tree is a shallow tree.
Boosting works well for performance reasons. Deeper trees grow nodes at a fast rate. Using
many shallower trees tends to provide better performance and quicker results. However,
shallow trees might not be able to capture interactions between features.
I also used the term residual. This is the prediction error, the true value minus the predicted
value. For a regression model, where we predict a numeric value, this is the difference
between the predicted and actual values. If the residual is positive, it means our prediction
was too high. A negative residual means that we underpredicted. A residual of zero means
that the prediction was correct.
For classification tasks, where we are trying to predict a label, you can think of the residual
in terms of probabilities. For a positive label, we would prefer a probability greater than .5.
A perfect model would have a probability of 1. If our model predicted a probability of .4, the
residual is .6, the perfect value (1) minus the predicted value (.4). The subsequent boosting
round would try to predict .6 to correct that value.
9.2
Bene昀椀ts of Boosting
The XBGoost library provides extras we don’t get with a standard decision tree. It has
support to handle missing values. It also has support for learning from categorical features.
Additionally, it has various hyperparameters to tune model behavior and fend off overfitting.
59
9. XGBoost
9.3 A Big Downside
However, using XGBoost versus a plain decision tree has a pretty big downside. A decision
tree is easy to understand. I’ve been to a doctor’s appointment where I was diagnosed using
a decision tree (the doctor showed me the choices after). Understanding how a model works
is called explainability or interpretability and is valuable in some business contexts.
Explaining the reasoning for rejecting the loan might help customers feel like they
understand the process. For example, if you were applying for a loan and the bank rejected
you, you might be disappointed. But, if the bank told you that if you deposit $2,000 into your
account, you would be approved for the loan, you might resolve to save that amount and
apply again in the future.
Now, imagine that a data scientist at the bank claims they have a new model better at
predicting who will default on a loan, but this model cannot explain why. The same user
comes in and applies for the loan. They are rejected again but not given a reason. That could
be a big turn-off and cause the customer to look for a new bank. (Perhaps not particularly
problematic for folks with a high probability of defaulting on the loan.) However, if enough
customers get turned off and start looking to move their business elsewhere, that could hurt
the bottom line.
In short, many businesses are willing to accept models with worse performance but are
interpretable. Models that are easy to explain are called white box models. Conversely, hardto-explain models are called black box models.
To be pedantic, you can explain a boosted model. However, if there are 500 trees in it, that
might be a frustrating experience to walk a customer through each of those trees in order.
9.4 Creating an XGBoost Model
Let’s load the libraries and the xg_helpers library. The xg_helpers library has the Python
functions I created to load the data and prepare it for modeling.
%matplotlib inline
import dtreeviz
from feature_engine import encoding, imputation
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import base, compose, datasets, ensemble, \
metrics, model_selection, pipeline, preprocessing, tree
import scikitplot
import xgboost as xgb
import yellowbrick.model_selection as ms
from yellowbrick import classifier
import urllib
import zipfile
import xg_helpers as xhelp
url = 'https://github.com/mattharrison/datasets/raw/master/data/'\
'kaggle-survey-2018.zip'
fname = 'kaggle-survey-2018.zip'
member_name = 'multipleChoiceResponses.csv'
60
9.5. A Boosted Model
raw = xhelp.extract_zip(url, fname, member_name)
## Create raw X and raw y
kag_X, kag_y = xhelp.get_rawX_y(raw, 'Q6')
## Split data
kag_X_train, kag_X_test, kag_y_train, kag_y_test = \
model_selection.train_test_split(
kag_X, kag_y, test_size=.3, random_state=42, stratify=kag_y)
## Transform X with pipeline
X_train = xhelp.kag_pl.fit_transform(kag_X_train)
X_test = xhelp.kag_pl.transform(kag_X_test)
## Transform y with label encoder
label_encoder = preprocessing.LabelEncoder()
label_encoder.fit(kag_y_train)
y_train = label_encoder.transform(kag_y_train)
y_test = label_encoder.transform(kag_y_test)
# Combined Data for cross validation/etc
X = pd.concat([X_train, X_test], axis='index')
y = pd.Series([*y_train, *y_test], index=X.index)
Now we have our training data and test data and are ready to create models.
9.5
A Boosted Model
The XGBoost library makes it really easy to convert from a decision tree to a random forest
or a boosted model. (The scikit-learn library also makes trying other models, like logistic
regression, support vector machines, or k-nearest neighbors easy.) Let’s make our first boosted
model.
All we need to do is create an instance of the model (generally, this is the only thing we
need to change to try a different model). Then we call the .fit method to train the model on
the training data.
We can evaluate the model using the .score method, which tells us the accuracy of the
model. Out of the box, the accuracy of the boosted model (.745) is quite a bit better than both
a decision tree (.73 for a depth of 7) and a random forest (.72 from scikit-learn).
>>> xg_oob = xgb.XGBClassifier()
>>> xg_oob.fit(X_train, y_train)
>>> xg_oob.score(X_test, y_test)
0.7458563535911602
The default model performs 100 rounds of boosting. (Again, this is like getting 100 chances
to keep hitting the golf ball and move it toward the hole.) Let’s see what happens to our
accuracy if we only hit two times (limiting the tree estimators to two and the depth of each
tree to two).
61
9. XGBoost
>>> # Let's try w/ depth of 2 and 2 trees
>>> xg2 = xgb.XGBClassifier(max_depth=2, n_estimators=2)
>>> xg2.fit(X_train, y_train)
>>> xg2.score(X_test, y_test)
0.6685082872928176
Our performance dropped quite a bit. Down to .668. Playing around with these
hyperparameters can have a large impact on our model’s performance.
Let’s look at what the first tree looks like (We will use tree_index=0 to indicate this). I will
use the dtreeviz package to show the tree. It looks like it first considers whether the user uses
the R programming language. Then it looks at whether the user studied CS. (Note that it
doesn’t have to look at the same feature for a level, but in this case, the CS feature turns out
to be most informative for both splits.)
import dtreeviz
viz = dtreeviz.model(xg2, X_train=X, y_train=y, target_name='Job',
feature_names=list(X_train.columns),
class_names=['DS', 'SE'], tree_index=0)
viz.view(depth_range_to_display=[0,2])
Figure 9.1: Visualization of first booster from xgboost model.
The dtreeviz package is good for understanding what the trees are doing and how the
splits break up the groups.
9.6 Understanding the Output of the Trees
One thing that the dtreeviz does not show with XGBoost models is the score for the nodes.
Let’s use XGBoost to plot the same booster (using num_trees=0) and trace through an
example to understand the leaf scores.
I’m using my function, my_dot_export, because it uses my fonts.
62
9.6. Understanding the Output of the Trees
xhelp.my_dot_export(xg2, num_trees=0, filename='img/xgb_md2.dot',
title='First Tree')
Note
You can just use the plot_tree function if you want to view this in a Jupyter notebook
xgb.plot_tree(xg2, num_trees=0)
Figure 9.2: Visualization of first tree from xgboost model.
Let’s trace through an example and see what our model predicts.
Here is the prediction for this two-tree model. It predicts 49.8% data scientist and 50.1%
software engineer. Because 50.1% is greater than 50%, our model predicts 1 or software
engineer for this example.
We can ask the model to predict probabilities with .predict_proba.
>>> # Predicts 1 - Software engineer
>>> se7894 = pd.DataFrame({'age': {7894: 22},
... 'education': {7894: 16.0},
... 'years_exp': {7894: 1.0},
... 'compensation': {7894: 0},
... 'python': {7894: 1},
... 'r': {7894: 0},
... 'sql': {7894: 0},
... 'Q1_Male': {7894: 1},
... 'Q1_Female': {7894: 0},
... 'Q1_Prefer not to say': {7894: 0},
... 'Q1_Prefer to self-describe': {7894: 0},
... 'Q3_United States of America': {7894: 0},
... 'Q3_India': {7894: 1},
... 'Q3_China': {7894: 0},
... 'major_cs': {7894: 0},
... 'major_other': {7894: 0},
... 'major_eng': {7894: 0},
... 'major_stat': {7894: 0}})
>>> xg2.predict_proba(se7894)
array([[0.4986236, 0.5013764]], dtype=float32)
63
9. XGBoost
Or we can predict just the value with .predict:
>>> # Predicts 1 - Software engineer
>>> xg2.predict(pd.DataFrame(se7894))
array([1])
For our example, the user did not use R, so we took a left turn at the first node. They did
not study CS, so we moved left at that node and ended with a leaf value of -0.084.
Now let’s trace what happens in the second tree.
First, let’s plot it. Remember that this tree tries to fix the predictions of the tree before it.
(We specify num_trees=1 for the second tree because Python is zero-based.)
xhelp.my_dot_export(xg2, num_trees=1, filename='img/xgb_md2_tree1.dot', title='Second Tree')
Figure 9.3: Visualization of second booster from xgboost model.
Again, our example does not use R, so we go left on the first node. The education level is
16, so we take another left at the next node and end at a value of 0.0902.
Our prediction is made by adding these leaf values together and taking the inverse logit
of the sum. This returns the probability of the positive class.
def inv_logit(p):
return np.exp(p) / (1 + np.exp(p))
>>> inv_logit(-0.08476+0.0902701)
0.5013775215147345
We calculate the inverse logit of the sum of the leaf values and come up with .501. Again
this corresponds to 50.1% software engineer (with binary classifiers, this is the percent that
the label is 1, or software engineer).
If there were more trees, we would repeat this process and add up the values in the leaves.
Note that due to the inverse logistic function, values above 0 push towards the positive label,
and values below zero push to the negative label.
64
9.7. Summary
9.7
Summary
XGBoost can improve the performance of traditional decision trees and random forest
techniques. Decision trees are simpler models and tend to underfit in certain situations,
while random forests are better at generalizing to unseen data points. XGBoost combines
the benefits of both approaches by using decision trees as its base learners and building an
ensemble of trees with the help of boosting. It works by training multiple decision trees with
different parameters and combining the results into a powerful model. XGBoost also has
additional features such as regularization, early stopping, and feature importance making it
more effective than regular decision tree and random forest algorithms.
9.8
Exercises
1. What are the benefits of an XGBoost model versus a Decision Tree model?
2. Create an XGBoost model with a dataset of your choosing for classification. What is the
accuracy score?
3. Compare your model with a Decision Tree model.
4. Visualize the first tree of your model.
65
Chapter 10
Early Stopping
Early stopping in XGBoost is a technique that reduces overfitting in XGBoost models. You
can tell XGBoost to use early stopping when you call the .fit method. Early stopping works
by monitoring the performance of a model on a validation set and automatically stopping
training when the model’s performance on the validation set no longer improves. This helps
to improve the model’s generalization, as the model is not overfitted to the training data. Early
stopping is helpful because it can prevent overfitting and help to improve XGBoost models’
generalization.
10.1
Early Stopping Rounds
Here are the results of XGBoost using out-of-the-box behavior. By default, the model will
create 100 trees. This model labels about 74% of the test examples correctly.
>>> # Defaults
>>> xg = xgb.XGBClassifier()
>>> xg.fit(X_train, y_train)
>>> xg.score(X_test, y_test)
0.7458563535911602
Now, we are going to provide the early_stopping_rounds parameter. Note that we also
specify an eval_set parameter. The eval_set parameter is used in the .fit method to provide
a list of (X, y) tuples for evaluation to use as a validation dataset. This dataset evaluates the
model’s performance and adjusts the hyperparameters during the model training process. We
pass in two tuples, the training data and the testing data. When you combine the evaluation
data with the early stopping parameter, XGBoost will look at the results of the last tuple. We
passed in 20 for early_stopping_rounds, so XGBoost will build up to 100 trees, but if the log loss
hasn’t improved after the following 20 trees, it will stop building trees.
You can see the line where the log loss bottoms out. Note that the score on the left, 0.43736,
is the training score and the score on the right, 0.5003, is the testing score and the value used
for early stopping.
[12]
validation_0-logloss:0.43736
validation_1-logloss:0.5003
(Note that this is the thirteenth tree because we started counting at 0.)
>>> xg = xgb.XGBClassifier(early_stopping_rounds=20)
>>> xg.fit(X_train, y_train,
...
eval_set=[(X_train, y_train),
...
(X_test, y_test)
67
10. Early Stopping
...
]
...
)
>>> xg.score(X_test, y_test)
[0] validation_0-logloss:0.61534
validation_1-logloss:0.61775
[1] validation_0-logloss:0.57046
validation_1-logloss:0.57623
[2] validation_0-logloss:0.54011
validation_1-logloss:0.55333
[3] validation_0-logloss:0.51965
validation_1-logloss:0.53711
[4] validation_0-logloss:0.50419
validation_1-logloss:0.52511
[5] validation_0-logloss:0.49176
validation_1-logloss:0.51741
[6] validation_0-logloss:0.48159
validation_1-logloss:0.51277
[7] validation_0-logloss:0.47221
validation_1-logloss:0.51040
[8] validation_0-logloss:0.46221
validation_1-logloss:0.50713
[9] validation_0-logloss:0.45700
validation_1-logloss:0.50583
[10]
validation_0-logloss:0.45062
validation_1-logloss:0.50430
[11]
validation_0-logloss:0.44533
validation_1-logloss:0.50338
[12]
validation_0-logloss:0.43736
validation_1-logloss:0.50033
[13]
validation_0-logloss:0.43399
validation_1-logloss:0.50034
[14]
validation_0-logloss:0.43004
validation_1-logloss:0.50192
[15]
validation_0-logloss:0.42550
validation_1-logloss:0.50268
[16]
validation_0-logloss:0.42169
validation_1-logloss:0.50196
[17]
validation_0-logloss:0.41854
validation_1-logloss:0.50223
[18]
validation_0-logloss:0.41485
validation_1-logloss:0.50360
[19]
validation_0-logloss:0.41228
validation_1-logloss:0.50527
[20]
validation_0-logloss:0.40872
validation_1-logloss:0.50839
[21]
validation_0-logloss:0.40490
validation_1-logloss:0.50623
[22]
validation_0-logloss:0.40280
validation_1-logloss:0.50806
[23]
validation_0-logloss:0.39942
validation_1-logloss:0.51007
[24]
validation_0-logloss:0.39807
validation_1-logloss:0.50987
[25]
validation_0-logloss:0.39473
validation_1-logloss:0.51189
[26]
validation_0-logloss:0.39389
validation_1-logloss:0.51170
[27]
validation_0-logloss:0.39040
validation_1-logloss:0.51218
[28]
validation_0-logloss:0.38837
validation_1-logloss:0.51135
[29]
validation_0-logloss:0.38569
validation_1-logloss:0.51202
[30]
validation_0-logloss:0.37945
validation_1-logloss:0.51352
[31]
validation_0-logloss:0.37840
validation_1-logloss:0.51545
0.7558011049723757
We can ask the model what the limit was by inspecting the .best_ntree_limit attribute:
>>> xg.best_ntree_limit
13
Note
If this were implemented in scikit-learn, the attribute .best_ntree_limit would have a
trailing underscore because it was learned by fitting the model. Alas, we live in a world
of inconsistencies.
10.2 Plotting Tree Performance
Let’s explore the score that occurred during fitting the model. The .eval_results method will
return a data structure containing the results from the eval_set.
68
10.2. Plotting Tree Performance
>>> # validation_0 is for training data
>>> # validation_1 is for testing data
>>> results = xg.evals_result()
>>> results
{'validation_0': OrderedDict([('logloss',
[0.6153406503923696,
0.5704566627034644,
0.5401074953836288,
0.519646179894983,
0.5041859194071372,
0.49175883369140716,
0.4815858465553177,
0.4722135672319274,
0.46221246084118905,
0.4570046103131291,
0.45062119092139025,
0.44533101600634545,
0.4373589513231934,
0.4339914069003403,
0.4300442738158372,
0.42550266018419824,
0.42168949383456633,
0.41853931894949614,
0.41485192559138645,
0.4122836278413833,
0.4087179538231096,
0.404898268053467,
0.4027963532207719,
0.39941699938733854,
0.3980718078477953,
0.39473153180519993,
0.39388538948800944,
0.39039599470886893,
0.38837148147752126,
0.38569152626668,
0.3794510693344513,
0.37840359436957194,
0.37538466192241227])]),
'validation_1': OrderedDict([('logloss',
[0.6177459120091813,
0.5762297115602546,
0.5533292921537852,
0.5371078260695736,
0.5251118483299708,
0.5174100387491574,
0.5127666981510036,
0.5103968678752362,
0.5071349115538004,
0.5058257413585542,
0.5043005662687247,
0.5033770955193438,
0.5003349146419797,
69
10. Early Stopping
0.5003436393562437,
0.5019165392779843,
0.502677517614806,
0.501961292550791,
0.5022262006329157,
0.5035970173261607,
0.5052709663297096,
0.508388655664636,
0.5062287504923689,
0.5080608455824424,
0.5100736726054829,
0.5098673969229365,
0.5118910041889845,
0.5117007332982608,
0.5121825202836434,
0.5113475993625531,
0.5120185821281118,
0.5135189292720874,
0.5154504034915188,
0.5158137131755071])])}
Let’s plot that data and visualize what happens as we add more trees:
# Testing score is best at 13 trees
results = xg.evals_result()
fig, ax = plt.subplots(figsize=(8, 4))
ax = (pd.DataFrame({'training': results['validation_0']['logloss'],
'testing': results['validation_1']['logloss']})
.assign(ntrees=lambda adf: range(1, len(adf)+1))
.set_index('ntrees')
.plot(figsize=(5,4), ax=ax,
title='eval_results with early_stopping')
)
ax.annotate('Best number \nof trees (13)', xy=(13, .498),
xytext=(20,.42), arrowprops={'color':'k'})
ax.set_xlabel('ntrees')
Now, let’s train a model with 13 trees (also referred to as estimators) and see how the model
performs:
# Using value from early stopping gives same result
>>> xg13 = xgb.XGBClassifier(n_estimators=13)
>>> xg13.fit(X_train, y_train,
...
eval_set=[(X_train, y_train),
...
(X_test, y_test)]
... )
>>> xg13.score(X_test, y_test)
It looks like this model gives the same results that the early_stopping model did.
>>> xg.score(X_test, y_test)
0.7558011049723757
70
10.3. Different eval_metrics
Figure 10.1: Image showing that the testing score might stop improving
Finally, let’s look at the model that creates exactly 100 trees. The score is worse than only
using 13!
>>> # No early stopping, uses all estimators
>>> xg_no_es = xgb.XGBClassifier()
>>> xg_no_es.fit(X_train, y_train)
>>> xg_no_es.score(X_test, y_test)
0.7458563535911602
What is the takeaway from this chapter? The early_stopping_rounds parameter helps to
prevent overfitting.
10.3
Different eval_metrics
The eval_metric options for XGBoost are metrics used to evaluate the results of an XGBoost
model. They provide a way to measure the performance of a model on a dataset. Some of the
most common eval_metric options for classification are:
1. Log loss ('logloss') - The default measure. Log loss measures a model’s negative loglikelihood performance that predicts a given class’s probability. It penalizes incorrect
predictions more heavily than correct ones.
2. Area under the curve ('auc') - AUC is used to measure the performance of a binary
classification model. It is the Area Under a receiver operating characteristic Curve.
71
10. Early Stopping
3. Accuracy ('error') - Accuracy is the percent of correct predictions. XGBoost uses the
error, which is one minus accuracy because it wants to minimize the value.
4. You can also create a custom function for evaluation.
In the following example, I set the eval_metric to 'error'.
>>> xg_err = xgb.XGBClassifier(early_stopping_rounds=20,
...
eval_metric='error')
>>> xg_err.fit(X_train, y_train,
...
eval_set=[(X_train, y_train),
...
(X_test, y_test)
...
]
...
)
>>> xg_err.score(X_test, y_test)
[0] validation_0-error:0.24739 validation_1-error:0.27072
[1] validation_0-error:0.24218 validation_1-error:0.26188
[2] validation_0-error:0.23839 validation_1-error:0.24751
[3] validation_0-error:0.23697 validation_1-error:0.25193
[4] validation_0-error:0.23081 validation_1-error:0.24530
[5] validation_0-error:0.22607 validation_1-error:0.24420
[6] validation_0-error:0.22180 validation_1-error:0.24862
[7] validation_0-error:0.21801 validation_1-error:0.24862
[8] validation_0-error:0.21280 validation_1-error:0.25304
[9] validation_0-error:0.21043 validation_1-error:0.25304
[10]
validation_0-error:0.20806 validation_1-error:0.24641
[11]
validation_0-error:0.20284 validation_1-error:0.25193
[12]
validation_0-error:0.20047 validation_1-error:0.24420
[13]
validation_0-error:0.19668 validation_1-error:0.24420
[14]
validation_0-error:0.19384 validation_1-error:0.24530
[15]
validation_0-error:0.18815 validation_1-error:0.24199
[16]
validation_0-error:0.18531 validation_1-error:0.24199
[17]
validation_0-error:0.18389 validation_1-error:0.23867
[18]
validation_0-error:0.18531 validation_1-error:0.23757
[19]
validation_0-error:0.18815 validation_1-error:0.23867
[20]
validation_0-error:0.18246 validation_1-error:0.24199
[21]
validation_0-error:0.17915 validation_1-error:0.24862
[22]
validation_0-error:0.17867 validation_1-error:0.24751
[23]
validation_0-error:0.17630 validation_1-error:0.24199
[24]
validation_0-error:0.17488 validation_1-error:0.24309
[25]
validation_0-error:0.17251 validation_1-error:0.24530
[26]
validation_0-error:0.17204 validation_1-error:0.24309
[27]
validation_0-error:0.16825 validation_1-error:0.24199
[28]
validation_0-error:0.16730 validation_1-error:0.24088
[29]
validation_0-error:0.16019 validation_1-error:0.24199
[30]
validation_0-error:0.15782 validation_1-error:0.24972
[31]
validation_0-error:0.15972 validation_1-error:0.24862
[32]
validation_0-error:0.15924 validation_1-error:0.24641
[33]
validation_0-error:0.15403 validation_1-error:0.25635
[34]
validation_0-error:0.15261 validation_1-error:0.25525
[35]
validation_0-error:0.15213 validation_1-error:0.25525
[36]
validation_0-error:0.15166 validation_1-error:0.25525
72
10.4. Summary
[37]
validation_0-error:0.14550 validation_1-error:0.25525
[38]
validation_0-error:0.14597 validation_1-error:0.25083
0.7624309392265194
Now, the best number of trees when we minimize error is 19.
>>> xg_err.best_ntree_limit
19
10.4
Summary
You can use the early_stopping_rounds parameter in XGBoost as a feature to specify a
maximum number of trees after the lowest evaluation metric is encountered. This is a useful
feature to avoid overfitting, as it allows you to specify a validation set and track the model’s
performance. If the performance hasn’t improved after a certain number of rounds (specified
by early_stopping_rounds), then training is stopped, and the best model is kept. This helps to
avoid wasting time and resources on training a model that is unlikely to improve.
Another benefit of the early stopping rounds parameter is that it allows you to select the
best model with the least training time. In other words, it’s like getting to the finish line first
without running the entire race!
10.5
Exercises
1. What is the purpose of early stopping in XGBoost?
2. How is the early_stopping parameter used in XGBoost?
3. How does the eval_set parameter work with the early stopping round parameter in
XGBoost?
73
Chapter 11
XGBoost Hyperparameters
XGBoost provides relatively good performance out of the box. However, it is also a
complicated model with many knobs and dials. You can adjust these dials to improve the
results. This chapter will introduce many of these knobs and dials, called hyperparameters.
11.1
Hyperparameters
What does the term hyperparameter even mean? In programming, a parameter lets us control
how a function behaves. Hyperparameters allow us to control how a machine-learning model
behaves.
The scikit-learn library has some basic conventions for hyperparameters that allow us to
think of them as levers that push a model towards overfitting or underfitting. One convention
is that hyperparameters that start with max_ will tend to complicate the model (and lead to
overfitting) when you raise the value. Conversely, they will simplify the model if you lower
them (and lead to underfitting). Often there is a sweet spot in between.
We saw n example of this earlier when we looked at decision trees. Raising the max_depth
hyperparameter will add child nodes, complicating the model. Lowering it will simplify the
model (degrading it into a decision stump).
There is a similar convention for parameters starting with min_. Those tend to simplify
when you raise them and complicate when you lower them.
I will split up the parameters by what they impact. There are parameters for tree creation,
sampling, categorical data, ensembling, regularization, and imbalanced data.
Here are the tree hyperparameters:
• max_depth=6 - Tree depth. How many feature interactions you can have. Larger is more
complex (more likely to overfit). Each level doubles the time. Range [0, ∞].
• max_leaves=0 - Number of leaves in a tree. A larger number is more complex. (Not
supported by tree_method='exact'). Range [0, ∞].
• min_child_weight=1 - Minimum sum of hessian (weight) needed in a child. The larger the
value, the more conservative. Range [0,∞]
• 'grow_policy="depthwise"' - (Only supported with tree_method set to 'hist', 'approx', or
'gpu_hist'. Split nodes closest to root. Set to "lossguide" (with max_depth=0) to mimic
LightGBM behavior.
• tree_method="auto" - Use "hist" to use histogram bucketing and increase performance.
"auto" heuristic:
– 'exact' for small data
– 'approx' for larger data
– 'hist' for histograms (limit bins with max_bins (default 256))
75
11. XGBoost Hyperparameters
These are the sampling hyperparameters. These work cumulatively. First, tree, then level,
then node. If colsample_bytree, colsample_bylevel, and colsample_bynode are all .5, then a tree
will only look at (.5 * .5 * .5) 12.5% of the original columns.
• colsample_bytree=1 - Subsample columns at each tree. Range (0,1]. Recommended to
search [.5, 1].
• colsample_bylevel=1 - Subsample columns at each level (from tree columns colsample_bytree).
Range (0,1]. Recommended to search [.5, 1].
• colsample_bynode=1 - Subsample columns at each node split (from tree columns
colsample_bytree). Range (0,1]. Recommended to search [.5, 1].
• subsample=1 - Sample portion of training data (rows) before each boosting round. Range
(0,1]. Lower to make more conservative. (When not equal to 1.0, the model performs
stochastic gradient descent, i.e., there is some randomness in the model.) Recommended
to search [.5, 1].
• sampling_method='uniform' - Sampling method.
'uniform' for equal probability.
gradient_based sample is proportional to the regularized absolute value of gradients
(only supported by tree_method="gpu_hist".
Hyperparameters for managing categorical data(use enable_categorical=True parameter
and set columns to Pandas categoricals .astype('category')).
• max_cat_to_onehot=4 - Upper threshold for using one-hot encoding for categorical
features. One-hot encoding is used when the number of categories is less than this
number.
• max_cat_threshold=64 - Maximum number of categories to consider for each split.
Ensembling hyperparameters control how the subsequent trees are created:
• n_estimators=100 - Number of trees. Larger is more complex. Default 100. Use
early_stopping_rounds to prevent overfitting. You don’t really need to tune if you use
early_stopping_rounds.
• early_stopping_rounds=None - Stop creating new trees if eval_metric score has not
improved after n rounds. Range [1,∞]
• eval_metric - Metric for evaluating validation data for evaluating early stopping.
– Classification metrics
* 'logloss' - Default classification metric. Negative log-likelihood.
* 'auc' - Area under the curve of Receiver Operating Characteristic.
• objective - Learning objective to optimize the fit of the model to data during training.
– Classification objectives
* 'binary:logistic' - Default classification objective. Outputs probability.
* 'multi:softmax' - Multi-class classification objective.
Regularization hyperparameters control the complexity of the overall model:
• learning_rate=.3 - After each boosting round, multiply weights to make it more
conservative. Lower is more conservative. Range (0,1]. Typically when you lower this,
you want to raise n_estimators. Range (0, 1]
• gamma=0 / min_split_loss - L0 regularization. Prunes tree to remove splits that don’t
meet the given value. Global regularization. Minimum loss required for making a split.
Larger is more conservative. Range [0, ∞), default 0 - no regularization. Recommended
search is (0, 1, 10, 100, 1000…)
76
11.2. Examining Hyperparameters
• reg_alpha=0 - L1/ridge regularization. (Mean of weights). Increase to be more
conservative. Range [0, ∞)
• reg_lambda=0 - L2 regularization. (Root of squared weights). Increase to be more
conservative. Range [0, ∞)
Imbalanced data hyperparameters:
• scale_pos_weight=1 - Consider (count negative) / (count positive) for imbalanced classes.
Range (0, large number)
• max_delta_step=0 - Maximum delta step for leaf output. It might help with imbalanced
classes. Range [0, ∞)
• Use 'auc' or 'aucpr' for eval_metric metric (rather than classification default 'logless')
11.2
Examining Hyperparameters
You can set hyperparameters by passing them into the constructor when creating an XGBoost
model. If you want to inspect hyperparameters, you can use the .get_params method:
>>> xg = xgb.XGBClassifier() # set the hyperparamters in here
>>> xg.fit(X_train, y_train)
>>> xg.get_params()
{'objective': 'binary:logistic',
'use_label_encoder': None,
'base_score': 0.5,
'booster': 'gbtree',
'callbacks': None,
'colsample_bylevel': 1,
'colsample_bynode': 1,
'colsample_bytree': 1,
'early_stopping_rounds': None,
'enable_categorical': False,
'eval_metric': None,
'feature_types': None,
'gamma': 0,
'gpu_id': -1,
'grow_policy': 'depthwise',
'importance_type': None,
'interaction_constraints': '',
'learning_rate': 0.300000012,
'max_bin': 256,
'max_cat_threshold': 64,
'max_cat_to_onehot': 4,
'max_delta_step': 0,
'max_depth': 6,
'max_leaves': 0,
'min_child_weight': 1,
'missing': nan,
'monotone_constraints': '()',
'n_estimators': 100,
'n_jobs': 0,
'num_parallel_tree': 1,
'predictor': 'auto',
77
11. XGBoost Hyperparameters
'random_state': 0,
'reg_alpha': 0,
'reg_lambda': 1,
'sampling_method': 'uniform',
'scale_pos_weight': 1,
'subsample': 1,
'tree_method': 'exact',
'validate_parameters': 1,
'verbosity': None}
This shows the default valus for each hyperparameter.
11.3 Tuning Hyperparameters
Hyperparameters can be a bit tricky to tune. Let’s explore how to tune a single value first.
To get started, you’ll want to use validation curves. These plots show the relationships
between your model’s performance and the values of specific hyperparameters. They show
the values of hyperparameters that yield the best results.
I like to plot a validation curve by using the validation_curve function from the Yellowbrick
library.
Let’s examine how the gamma parameter impacts our model. You can think of gamma as
a pruner for the trees. It limits growth unless a specified loss reduction is met. The larger the
value, the more simple (conservative or tending to underfit) our model.
fig, ax = plt.subplots(figsize=(8, 4))
ms.validation_curve(xgb.XGBClassifier(), X_train, y_train, param_name='gamma',
param_range=[0, .5, 1,5,10, 20, 30], n_jobs=-1, ax=ax)
It looks like gamma is best between 10 and 20. We could run another validation curve to
dive into that area. However, we will soon see another mechanism for finding a good value
without requiring us to specify every single option we want to consider.
11.4 Intuitive Understanding of Learning Rate
Returning to our golf metaphor, the learning rage hyperparameter is like how hard you swing.
It’s all about finding the right balance between caution and aggression. Just like in golf, if you
swing too hard, you can end up in the rough. But if you swing too softly, you’ll never get the
ball in the hole.
With XGBoost, if you set the learning_rate hyperparameter too high, your model will be
over-confident and over-fit the data. On the other hand, if you set learning_rate too low, your
model will take too long to learn and never reach its full potential.
We want to find the right balance that maximizes your model’s accuracy. You’ll have to
experiment to find out where that sweet spot is. In golf – you experiment with different
techniques, adjust your swing and your grip, and it’ll take some practice, but eventually, you
might find the perfect combination that will allow you to hit a hole-in-one.
Let’s go back to our model with two layers of depth and examine the values in the leaf
node when the learning_rate is set to 1.
# check impact of learning weight on scores
xg_lr1 = xgb.XGBClassifier(learning_rate=1, max_depth=2)
xg_lr1.fit(X_train, y_train)
78
11.4. Intuitive Understanding of Learning Rate
Figure 11.1: Validation curve for the gamma hyperparameter. Around 10-20 the score is maximized for
the cross validation line. The cross validation line gives us a feel for how the model will perform on
unseen data, hwile the training score helps us to understand if the model is overfitting.
my_dot_export(xg_lr1, num_trees=0, filename='img/xg_depth2_tree0.dot',
title='Learning Rate set to 1')
Figure 11.2: When the learning rate is set to one, we preserve the log loss score.
Now we are going to decrease the learning rate. Remember, this is like being really timid
with our golf swing.
# check impact of learning weight on scores
xg_lr001 = xgb.XGBClassifier(learning_rate=.001, max_depth=2)
xg_lr001.fit(X_train, y_train)
79
11. XGBoost Hyperparameters
my_dot_export(xg_lr001, num_trees=0, filename='img/xg_depth2_tree0_lr001.dot',
title='Learning Rate set to .001')
Figure 11.3: Lowering the learning rate essentially multiples the full values by this amount in the first
tree. Each subsequent tree is now impacted and will shift.
You can see that the learning rate took the values in the leaves and decreased them. Positive
values will still push toward the positive label but at a slower rate. We could also set the
learning rate to a value above 1, but it does not make sense in practice. It is like teeing off
with your largest club when playing miniature golf. You can try it, but you will see that the
model does not perform well.
My suggestion for tuning the learning rate (as you will see when I show step-wise tuning)
is to tune it last. Combine this with early stopping and a large number of trees. If early
stopping doesn’t kick in, raise the number of trees (it means your model is still improving).
11.5 Grid Search
Calibrating one hyperparameter takes some time, there are multiple hyperparameters, some
of which interact with other hyperparameters. How do we know which combination is best?
We can try various combinations and explore which values perform the best.
The GridSerachCV class from scikit-learn will do this for us. We need to provide it with
a mapping of hyperparameters to potential values for each hyperparameter. It will loop
through all combinations and keep track of the best ones. If you have imbalanced data or
need to optimize for a different metric, you can also set the scoring parameter.
Be careful with grid search. Running grid search on XGBoost can be computationally
expensive, as it can take a long time for XGBoost to run through every combination of
hyperparameters. It can also be difficult to determine how many hyperparameters to test
and which values to use as candidates.
Later, I will show another method I prefer more than grid search for tuning the model.
This example below runs quickly on my machine, but it isn’t testing that many
combinations. You could imagine (ignoring random_state) if any hyperparameters had ten
options, there would be 1,000,000 tests to run.
from sklearn import model_selection
params = {'reg_lambda': [0], # No effect
'learning_rate': [.1, .3], # makes each boost more conservative
'subsample': [.7, 1],
'max_depth': [2, 3],
'random_state': [42],
80
11.5. Grid Search
'n_jobs': [-1],
'n_estimators': [200]}
xgb2 = xgb.XGBClassifier(early_stopping_rounds=5)
cv = (model_selection.GridSearchCV(xgb2, params, cv=3, n_jobs=-1)
.fit(X_train, y_train,
eval_set=[(X_test, y_test)],
verbose=50
)
)
After running the grid search, we can inspect the .best_params_ attribute. This provides a
dictionary of the hyperparameters with the best result.
Here are the best hyperparameters from our minimal grid search:
>>> cv.best_params_
{'learning_rate': 0.3,
'max_depth': 2,
'n_estimators': 200,
'n_jobs': -1,
'random_state': 42,
'reg_lambda': 0,
'subsample': 1}
Let’s stick those values in a dictionary, and then we can use dictionary unpacking to provide
those values as arguments when we create the model. The verbose=10 parameter tells XGBoost
only to provide training metrics after every ten attempts. We will use .fit to train the model.
params = {'learning_rate': 0.3,
'max_depth': 2,
'n_estimators': 200,
'n_jobs': -1,
'random_state': 42,
'reg_lambda': 0,
'subsample': 1
}
xgb_grid = xgb.XGBClassifier(**params, early_stopping_rounds=50)
xgb_grid.fit(X_train, y_train, eval_set=[(X_train, y_train),
(X_test, y_test)],
verbose=10
)
Now let’s train a default model to compare if there is an improvement from using grid
search.
# vs default
xgb_def = xgb.XGBClassifier(early_stopping_rounds=50)
xgb_def.fit(X_train, y_train, eval_set=[(X_train, y_train),
(X_test, y_test)],
verbose=10
)
Here’s the accuracy for both the default and the grid models:
81
11. XGBoost Hyperparameters
>>> xgb_def.score(X_test, y_test), xgb_grid.score(X_test, y_test)
(0.7558011049723757, 0.7524861878453039)
Oddly enough, it looks like the default model is slightly better. This is a case where you
need to be careful. The grid search did a 3-fold cross-validation to find the hyperparameters,
but the final score was calculated against a single holdout set.
K-fold cross-validation (in the grid search case, k was 3) splits your data set into k folds or
groups. Then, it reserves a fold to test the model and trains it on the other data. It repeats this
for each fold. Finally, it averages out the results to create a more accurate prediction. It’s a
great way to understand better how your model will perform in the real world.
When you do k-fold validation, you may get different results than just calling .fit on
a model and .score. To run k-fold outside of grid search, you can use the cross_val_score
function from scikit-learn.
Below, we run cross-validation on the default model:
>>> results_default = model_selection.cross_val_score(
...
xgb.XGBClassifier(),
...
X=X, y=y,
...
cv=4
... )
Here are the scores from each of the four folds:
>>> results_default
array([0.71352785, 0.72413793, 0.69496021, 0.74501992])
Some folds performed quite a bit better than others. Our average score (accuracy) is 72%.
>>> results_default.mean()
0.7194114787534214
Let’s call cross_val_score again with a model created from our grid search hyperparameters.
>>> results_grid = model_selection.cross_val_score(
...
xgb.XGBClassifier(**params),
...
X=X, y=y,
...
cv=4
... )
Here are the fold scores. Note that these values do not deviate as much as the default
model.
>>> results_grid
array([0.74137931, 0.74137931, 0.74801061, 0.73572377])
Here is the mean score:
>>> results_grid.mean()
0.7416232505873941
Our grid search model did perform better (at least for these folds of the data). And the
scores across the folds were more consistent. It is probably the case that the test set we held
out for X_test is easier for the default model to predict.
82
11.6. Summary
11.6
Summary
Tuning hyperparameters can feel like a stab in the dark. Validation curves can start us down
the path but only work for a single value. You believe there is a sweet spot somewhere, but
you have to dig deep and hunt for the other sweet spots of the other hyperparameters. We
can combine this with grid search, but it can be tedious.
We also showed how using k-fold validation can help us understand how consistent the
model is and whether some parts of the data might be easier to model. You will be more
confident in a consistent model.
11.7
1.
2.
3.
4.
5.
Exercises
What is the purpose of hyperparameters in XGBoost?
How can validation curves be used to tune hyperparameters in XGBoost?
What is the impact of the gamma hyperparameter on XGBoost models?
How can the learning rate hyperparameter affect the performance of an XGBoost model?
How can grid search be used to tune hyperparameters in XGBoost?
83
Chapter 12
Hyperopt
Remember I said that grid search could be slow? It also only searches over the values that
I provide. What if another value is better, but I didn’t tell grid search about that possible
value? Grid search is naive and would have no way of suggesting that value. Let’s look at
another library that might help with this but also search in a smarter way than the brute force
mechanism of grid search—introducing hyperopt.
Hyperopt is a Python library for optimizing both discrete and continuous hyperparameters for XGBoost. Hyperopt uses Bayesian optimization to tune hyperparameters. This uses
a probabilistic model to select the next set of hyperparameters to try. If one value performs
better, it will try values around it and see if they boost the performance. If the values worsen
the model, then Hyperopt can ignore those values as future candidates. Hyperopt can also
tune various other machine-learning models, including random forests and neural networks.
12.1
Bayesian Optimization
Bayesian optimization algorithms offer several benefits for hyperparameter optimization,
including:
• Efficiency: Bayesian optimization algorithms use probabilistic models to update their
predictions based on previous trials, which allows them to quickly identify promising
areas of the search space and avoid wasteful exploration.
• Accuracy: Bayesian optimization algorithms can incorporate prior knowledge and
adapt to the underlying structure of the optimization problem, which can result in more
accurate predictions.
• Flexibility: Bayesian optimization algorithms are flexible and can be applied to a wide
range of optimization problems, including those with complex and multimodal search
spaces.
Using Bayesian optimization algorithms can significantly improve hyperparameter
optimization’s efficiency, accuracy, and robustness, leading to better models.
12.2
Exhaustive Tuning with Hyperopt
Let’s look at an exhaustive exploration with Hyperopt. This code might be confusing if you
aren’t familiar with the notion of first-class functions in Python. Python allows us to pass a
function in as a parameter to another function or return a function as a result. To use the
Hyperopt library, we need to use the fmin function which tries to find the best hyperparameter
values given a space of many parameters.
85
12. Hyperopt
The fmin function expects us to pass in another function that accepts a dictionary of
hyperparameters to evaluate and returns a dictionary. Our hyperparameter_tuning function
doesn’t quite fit the bill. It accepts a dictionary, space, as the first parameter but it has
additional parameters. I will use a closure in the form of a lambda function to adapt the
hyperparameter_tuning function to a new function that accepts the hyperparameter dictionary.
The hyperparameter_tuning function takes in a dictionary of hyperparameters (space),
training data (X_train and y_train), test data (X_test and y_test), and an optional value for
early stopping rounds (early_stopping_rounds). Some hyperparameters need to be integers,
so it converts those keys and adds early_stopping_rounds to the hyperparameters to evaluate.
Then it trains a model and returns a dictionary with a negative accuracy score and a status.
Because fmin tries to minimize the score and our model is evaluating the accuracy, we do not
want the model with the minimum accuracy. However, we can do a little trick and minimize
the negative accuracy.
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
from sklearn.metrics import accuracy_score, roc_auc_score
from typing import Any, Dict, Union
def hyperparameter_tuning(space: Dict[str, Union[float, int]],
X_train: pd.DataFrame, y_train: pd.Series,
X_test: pd.DataFrame, y_test: pd.Series,
early_stopping_rounds: int=50,
metric:callable=accuracy_score) -> Dict[str, Any]:
"""
Perform hyperparameter tuning for an XGBoost classifier.
This function takes a dictionary of hyperparameters, training
and test data, and an optional value for early stopping rounds,
and returns a dictionary with the loss and model resulting from
the tuning process. The model is trained using the training
data and evaluated on the test data. The loss is computed as
the negative of the accuracy score.
Parameters
---------space : Dict[str, Union[float, int]]
A dictionary of hyperparameters for the XGBoost classifier.
X_train : pd.DataFrame
The training data.
y_train : pd.Series
The training target.
X_test : pd.DataFrame
The test data.
y_test : pd.Series
The test target.
early_stopping_rounds : int, optional
The number of early stopping rounds to use. The default value
is 50.
metric : callable
Metric to maximize. Default is accuracy
86
12.2. Exhaustive Tuning with Hyperopt
Returns
------Dict[str, Any]
A dictionary with the loss and model resulting from the
tuning process. The loss is a float, and the model is an
XGBoost classifier.
"""
int_vals = ['max_depth', 'reg_alpha']
space = {k: (int(val) if k in int_vals else val)
for k,val in space.items()}
space['early_stopping_rounds'] = early_stopping_rounds
model = xgb.XGBClassifier(**space)
evaluation = [(X_train, y_train),
(X_test, y_test)]
model.fit(X_train, y_train,
eval_set=evaluation,
verbose=False)
pred = model.predict(X_test)
score = metric(y_test, pred)
return {'loss': -score, 'status': STATUS_OK, 'model': model}
After we have defined the function we want to minimize, we need to define the space of the
search for our hyperparameters. Based on a talk by Bradley Boehmke1 , I’ve used these values
and found them to work quite well. We will cover the functions that describe the search space
in a moment.
Next, we create a trials object to store the results of the hyperparameter tuning process.
The fmin function loops over the options to search for the best hyperparameters. I told it
to use the tpe.suggest algorithm, which runs the Tree-structured Parzen Estimator - Expected
Improvement Bayesian Optimization. It uses the expected improvement acquisition function
to guide the search. This algorithm works well with noisy data. We also told the search to
limit the evaluations to 2,000 attempts. My Macbook Pro (2022) takes about 30 minutes to run
this search. On my Thinkpad P1 (2020), it takes over 2 hours. You can set a timeout period as
well.
As this runs, it spits out the best loss scores.
options = {'max_depth': hp.quniform('max_depth', 1, 8, 1), # tree
'min_child_weight': hp.loguniform('min_child_weight', -2, 3),
'subsample': hp.uniform('subsample', 0.5, 1), # stochastic
'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 1),
'reg_alpha': hp.uniform('reg_alpha', 0, 10),
'reg_lambda': hp.uniform('reg_lambda', 1, 10),
'gamma': hp.loguniform('gamma', -10, 10), # regularization
'learning_rate': hp.loguniform('learning_rate', -7, 0), # boosting
'random_state': 42
}
trials = Trials()
best = fmin(fn=lambda space: hyperparameter_tuning(space, X_train, y_train,
1 https://bradleyboehmke.github.io/xgboost_databricks_tuning/index.html#slide21
87
12. Hyperopt
)
space=options,
algo=tpe.suggest,
max_evals=2_000,
trials=trials,
#timeout=60*5 # 5 minutes
X_test, y_test),
The best variable holds the results. If I’m using a notebook, I will inspect the values and
make sure I keep track of the hyperparameters. I like to print them out and then stick them
into a dictionary, so I have them for posterity.
# 2 hours of training (paste best in here)
long_params = {'colsample_bytree': 0.6874845219014455,
'gamma': 0.06936323554883501,
'learning_rate': 0.21439214284976907,
'max_depth': 6,
'min_child_weight': 0.6678357091609912,
'reg_alpha': 3.2979862933185546,
'reg_lambda': 7.850943400390477,
'subsample': 0.999767483950891}
Now we can train a model with these hyperparameters.
xg_ex = xgb.XGBClassifier(**long_params, early_stopping_rounds=50,
n_estimators=500)
xg_ex.fit(X_train, y_train,
eval_set=[(X_train, y_train),
(X_test, y_test)
],
verbose=100
)
How does this model do?
>>> xg_ex.score(X_test, y_test)
0.7580110497237569
This is an improvement over the default out-of-the-box model. Generally, you will get
better improvement in your model by understanding your data and doing appropriate feature
engineering than the improvement you will see with hyperparameter optimization. However,
the two are not mutually exclusive, and you can and should use both.
12.3 De昀椀ning Parameter Distributions
Rather than limiting the hyperparameters to selection from a list of options, the hyperopt
library has various functions that let us define how to choose values.
The choice function can specify a discrete set of values for a categorical hyperparameter.
This is similar to enumerating the list of options in a grid search. This function takes two
arguments: a list of possible values and an optional probability distribution over these values.
It returns a random value from the list according to the specified probability distribution.
For example, to generate a random value from a list of possible values ['a', 'b', 'c'],
you can use the following code:
88
12.3. Defining Parameter Distributions
>>> from hyperopt import hp, pyll
>>> pyll.stochastic.sample(hp.choice('value', ['a', 'b', 'c']))
'a'
To generate a random value from a list of possible values ['a', 'b', 'c'] with probabilities
[0.05, 0.9, 0.05] use the choice function:
>>> pyll.stochastic.sample(hp.pchoice('value', [(.05, 'a'), (.9, 'b'),
...
(.05, 'c')]))
'c'
Note
Be careful about using choice and pchoice for numeric values. The Hyperopt library treats
each value as independent and not ordered. Its search algorithm cannot take advantage
of the outcomes of neighboring values. I have found examples of using Hyperopt that
suggest defining the search space for 'num_leaves' and 'subsample' like this:
'num_leaves': hp.choice('num_leaves', list(range(20, 250, 10))),
'subsample': hp.choice('subsample', [0.2, 0.4, 0.5, 0.6, 0.7, .8, .9]),
Do not use the above code! Rather, these should be defined as below so Hyperopt can
efficiently explore the space:
'num_leaves': hp.quniform('num_leaves', 20, 250, 10),
'subsample': hp.uniform('subsample', 0.2, .9),
The uniform function can specify a uniform distribution over a continuous range of values.
This function takes two arguments: the minimum and maximum values of the uniform
distribution, and returns a random floating-point value within this range.
For example, to generate a random floating-point value between 0 and 1 using the uniform
function, you can use the following code:
>>> from hyperopt import hp, pyll
>>> pyll.stochastic.sample(hp.uniform('value', 0, 1))
0.7875384438202859
Let’s call it 10,000 times and then look at a histogram of the result. You can see that it has
an equal probability of choosing any value.
uniform_vals = [pyll.stochastic.sample(hp.uniform('value', 0, 1))
for _ in range(10_000)]
fig, ax = plt.subplots(figsize=(8, 4))
ax.hist(uniform_vals)
The loguniform function can specify a log-uniform distribution over a continuous range
of values. This function takes two arguments: the minimum and maximum values of the
log-uniform distribution, and it returns a random floating-point value within this range on a
logarithmic scale.
Here is the histogram for the loguniform values from -5 to 5. Note that these values do not
range from -5 to 5, but rather 0.006737 (math.exp(-5)) to 148.41 (math.exp(5)). Note that these
values will strongly favor the low end.
89
12. Hyperopt
Figure 12.1: Plot of the histogram of uniform values for low=0 and high=1
loguniform_vals = [pyll.stochastic.sample(hp.loguniform('value', -5, 5))
for _ in range(10_000)]
fig, ax = plt.subplots(figsize=(8, 4))
ax.hist(loguniform_vals)
Figure 12.2: Plot of the histogram of loguniform values for low=-5 and high=5
90
12.3. Defining Parameter Distributions
Here is another way to look at it. This is a plot of the transform of possible values when the
min and max values are set to -5 and 5, respectively. In the y-axis are the exponential values
that come out of the loguniform function.
fig, ax = plt.subplots(figsize=(8, 4))
(pd.Series(np.arange(-5, 5, step=.1))
.rename('x')
.to_frame()
.assign(y=lambda adf:np.exp(adf.x))
.plot(x='x', y='y', ax=ax)
)
Figure 12.3: Plot of the loguniform values for low=-5 and high=5
For example, to generate a random floating-point value between 0.1 and 10 using the
loguniform function, you can use the following code:
>>> from hyperopt import hp, pyll
>>> from math import log
>>> pyll.stochastic.sample(hp.loguniform('value', log(.1), log(10)))
3.0090767867889174
If you want values pulled from a log scale, .001, .01, .1, 1, 10, then use loguniform.
The quniform function can be used to specify an integer from the range [exp(low),
exp(high)]. The function also accepts a q parameter that specifies the step (for q=1, it returns
every integer in the range. For q=3, it returns every third integer).
quniform_vals = [pyll.stochastic.sample(hp.quniform('value', -5, 5, q=2))
for _ in range(10_000)]
91
12. Hyperopt
Here are the counts for the quniform values from -5 to 5, taking every other value (q=2).
>>> pd.Series(quniform_vals).value_counts()
-0.0
2042
-2.0
2021
2.0
2001
4.0
2000
-4.0
1936
dtype: int64
12.4 Exploring the Trials
The trial object from the hyperopt search above has the data about the hyperparameter
optimization.
Feel free to skip this section if you want, but I want to show you how to explore this and
see what Hyperopt is doing.
First, I will make a function to convert the trial object to a Pandas dataframe.
from typing import Any, Dict, Sequence
def trial2df(trial: Sequence[Dict[str, Any]]) -> pd.DataFrame:
"""
Convert a Trial object (sequence of trial dictionaries)
to a Pandas DataFrame.
Parameters
---------trial : List[Dict[str, Any]]
A list of trial dictionaries.
Returns
------pd.DataFrame
A DataFrame with columns for the loss, trial id, and
values from each trial dictionary.
"""
vals = []
for t in trial:
result = t['result']
misc = t['misc']
val = {k:(v[0] if isinstance(v, list) else v)
for k,v in misc['vals'].items()
}
val['loss'] = result['loss']
val['tid'] = t['tid']
vals.append(val)
return pd.DataFrame(vals)
Now, let’s use the function to create a dataframe.
>>> hyper2hr = trial2df(trials)
92
12.4. Exploring the Trials
Each row is an evaluation. You can see the hyperparameter settings, the loss score, and
the trial id (tid).
>>> hyper2hr
colsample_bytree
gamma learning_rate
0
0.854670
2.753933
0.042056
1
0.512653
0.153628
0.611973
2
0.552569
1.010561
0.002412
3
0.604020 682.836185
0.005037
4
0.785281
0.004130
0.015200
...
...
...
...
1995
0.717890
0.000543
0.141629
1996
0.725305
0.000248
0.172854
1997
0.698025
0.028484
0.162207
1998
0.688053
0.068223
0.099814
1999
0.666225
0.125253
0.203441
0
1
2
3
4
...
1995
1996
1997
1998
1999
... subsample
loss \
... 0.913247 -0.744751
... 0.550048 -0.746961
... 0.508593 -0.735912
... 0.536935 -0.545856
... 0.691211 -0.739227
...
...
...
... 0.893414 -0.765746
... 0.919415 -0.765746
... 0.952204 -0.770166
... 0.939489 -0.762431
... 0.980354 -0.767956
tid
0
1
2
3
4
...
1995
1996
1997
1998
1999
[2000 rows x 10 columns]
I’ll do some EDA (exploratory data analysis) on this data.
I like to look at correlations for numeric data and it might be interesting to see if
hyperparameters are correlated to each other or to the loss. We can also inspect if a feature
has a monotonic relationship with the loss score.
Here is a Seaborn plot to show the correlations.
import seaborn as sns
fig, ax = plt.subplots(figsize=(8, 4))
sns.heatmap(hyper2hr.corr(method='spearman'),
cmap='RdBu', annot=True, fmt='.2f', vmin=-1, vmax=1, ax=ax
)
Note
If you want to create a correlation table in Jupyter with Pandas, you can use this code:
(hyper2hr
.corr(method='spearman')
.style
93
12. Hyperopt
Figure 12.4: Spearman correlation of hyperparameters during Hyperopt process.
)
.background_gradient(cmap='RdBu', vmin=-1, vmax=1)
There is a negative correlation between loss and tid (trial id). This is to be expected. As
we progress, the negative loss score should drop. The optimization process should balance
exploration and exploitation to minimize the negative loss score. If we do a scatter plot of
these two columns, we see this play out. Occasionally, an exploration attempt proves poor
(the loss value around -0.55), and the Bayesian search process cuts back to better values.
fig, ax = plt.subplots(figsize=(8, 4))
(hyper2hr
.plot.scatter(x='tid', y='loss', alpha=.1, color='purple', ax=ax)
)
Let’s explore what happened with max_depth and olss which have a correlation of -0.13.
fig, ax = plt.subplots(figsize=(8, 4))
(hyper2hr
.plot.scatter(x='max_depth', y='loss', alpha=1, color='purple', ax=ax)
)
This is a little hard to understand because the max_depth values plot on top of each other. I
will use a technique called jittering to pull them apart and let us understand how many values
there are. I will also adjust the alpha values of the plot so we can see where values overlap.
Here is my jitter function. It adds a random amount of noise around a data point. Let’s
apply it to the max_depth column.
94
12.4. Exploring the Trials
Figure 12.5: Scatter plot of loss vs trial during Hyperopt process. Note that the values tend to lower as
the tid goes up.
Figure 12.6: Scatter plot of loss vs max_depth during Hyperopt process.
95
12. Hyperopt
import numpy as np
def jitter(df: pd.DataFrame, col: str, amount: float=1) -> pd.Series:
"""
Add random noise to the values in a Pandas DataFrame column.
This function adds
column of a Pandas
noise with a range
function returns a
random noise to the values in a specified
DataFrame. The noise is uniform random
of `amount` centered around zero. The
Pandas Series with the jittered values.
Parameters
---------df : pd.DataFrame
The input DataFrame.
col : str
The name of the column to jitter.
amount : float, optional
The range of the noise to add. The default value is 1.
Returns
------pd.Series
A Pandas Series with the jittered values.
"""
vals = np.random.uniform(low=-amount/2, high=amount/2,
size=df.shape[0])
return df[col] + vals
fig, ax = plt.subplots(figsize=(8, 4))
(hyper2hr
.assign(max_depth=lambda df:jitter(df, 'max_depth', amount=.8))
.plot.scatter(x='max_depth', y='loss', alpha=.1, color='purple', ax=ax)
)
This makes it quite clear that the algorithm spent a good amount of time at depth 6.
If we want to get even fancier, we can color this by trial attempt. The later attempts are
represented by the yellow color.
fig, ax = plt.subplots(figsize=(8, 4))
(hyper2hr
.assign(max_depth=lambda df:jitter(df, 'max_depth', amount=.8))
.plot.scatter(x='max_depth', y='loss', alpha=.5,
color='tid', cmap='viridis', ax=ax)
)
We could also use Seaborn to create a violin plot of loss against max_depth. I don’t feel like
this lets me see the density as well.
96
12.4. Exploring the Trials
Figure 12.7: Scatter plot of loss vs max_depth during Hyperopt process with jittering. You can see the
majority of the values were in the 5-7 range.
Figure 12.8: Scatter plot of loss vs max_depth during Hyperopt process colored by attempt.
97
12. Hyperopt
import seaborn as sns
fig, ax = plt.subplots(figsize=(8, 4))
sns.violinplot(x='max_depth', y='loss', data=hyper2hr, kind='violin', ax=ax)
Figure 12.9: Violin plot of loss versus max_depth. I prefer the jittered scatter plot because it allows me
to understand the density better.
The correlation between reg_alpha and colsample_bytree was 0.41, meaning there was a slight
tendency for the rank of one value to go up if the rank of the other value went up. Let’s plot
this and see if we can glean any insight.
fig, ax = plt.subplots(figsize=(8, 4))
(hyper2hr
.plot.scatter(x='reg_alpha', y='colsample_bytree', alpha=.8,
color='tid', cmap='viridis', ax=ax)
)
ax.annotate('Min Loss (-0.77)', xy=(4.56, 0.692),
xytext=(.7, .84), arrowprops={'color':'k'})
The values for colsample_bytree and reg_alpha were .68 and 3.29 respectively.
Let’s explore gamma and loss that had a correlation of .25.
Here is my initial plot.
fig, ax = plt.subplots(figsize=(8, 4))
(hyper2hr
.plot.scatter(x='gamma', y='loss', alpha=.1, color='purple', ax=ax)
)
98
12.4. Exploring the Trials
Figure 12.10: Scatter plot of colsample_bytree vs reg_alpha during Hyperopt process colored by attempt.
Figure 12.11: Scatter plot of loss vs gamma during Hyperopt process.
99
12. Hyperopt
It is a little hard to tell what is going on here. The stacking on the left side might indicate
a need for jittering. But if you look at the scale, you can see that there are some outliers.
Remember that gamma was defined with this distribution:
'gamma': hp.loguniform('gamma', -10, 10), # regularization
These values range from math.exp(-10) to math.exp(10). Because this is a log-uniform
distribution, they will tend towards the low end of this.
One trick you can use is to plot with a log x-axis. That will help tease these values apart.
fig, ax = plt.subplots(figsize=(8, 4))
(hyper2hr
.plot.scatter(x='gamma', y='loss', alpha=.5, color='tid', ax=ax,
logx=True, cmap='viridis')
)
ax.annotate('Min Loss (-0.77)', xy=(0.000581, -0.777),
xytext=(1, -.6), arrowprops={'color':'k'})
Figure 12.12: Scatter plot of loss vs gamma during Hyperopt process with log x-axis
This shows how Bayesian searching focuses on the values that provide the best loss.
Occasionally, it does some exploring, but any gamma value above 10 performs poorly. The
gamma value that it settled on during this run was 0.000581000581.
12.5 EDA with Plotly
Feel free to skip this section too. I like digging into the data. I find the Plotly library works
well if I was to create 3D plots and have some interactivity.
Here is a helper function I wrote that will create a 3D mesh. I will make a mesh of two
hyperparameters and plot the score in the third (Z) dimension.
100
12.5. EDA with Plotly
import plotly.graph_objects as go
def plot_3d_mesh(df: pd.DataFrame, x_col: str, y_col: str,
z_col: str) -> go.Figure:
"""
Create a 3D mesh plot using Plotly.
This function creates a 3D mesh plot using Plotly, with
the `x_col`, `y_col`, and `z_col` columns of the `df`
DataFrame as the x, y, and z values, respectively. The
plot has a title and axis labels that match the column
names, and the intensity of the mesh is proportional
to the values in the `z_col` column. The function returns
a Plotly Figure object that can be displayed or saved as
desired.
Parameters
---------df : pd.DataFrame
The DataFrame containing the data to
x_col : str
The name of the column to use as the
y_col : str
The name of the column to use as the
z_col : str
The name of the column to use as the
plot.
x values.
y values.
z values.
Returns
------go.Figure
A Plotly Figure object with the 3D mesh plot.
"""
fig = go.Figure(data=[go.Mesh3d(x=df[x_col], y=df[y_col], z=df[z_col],
intensity=df[z_col]/ df[z_col].min(),
hovertemplate=f"{z_col}: %{{z}}<br>{x_col}: %{{x}}<br>{y_col}: "
"%{{y}}<extra></extra>")],
)
fig.update_layout(
title=dict(text=f'{y_col} vs {x_col}'),
scene = dict(
xaxis_title=x_col,
yaxis_title=y_col,
zaxis_title=z_col),
width=700,
margin=dict(r=20, b=10, l=10, t=50)
)
return fig
Here is the code to plot a mesh of gamma and reg_lambda. In the z-axis, I plot the loss. The
minimum value colors this in the z-axis. The most yellow value has the least loss.
101
12. Hyperopt
Plotly allows us to interact with the plot, rotate it, zoom in or out, and show values by
hovering.
fig = plot_3d_mesh(hyper2hr.query('gamma < .2'),
'reg_lambda', 'gamma', 'loss')
fig
Figure 12.13: Scatter plot of loss vs gamma during Hyperopt process with log x-axis
Here is the code to create a scatter plot. (It turns out that this wrapper isn’t really needed
but I provided it for consistency).
import plotly.express as px
import plotly.graph_objects as go
def plot_3d_scatter(df: pd.DataFrame, x_col: str, y_col: str,
z_col: str, color_col: str,
opacity: float=1) -> go.Figure:
"""
Create a 3D scatter plot using Plotly Express.
This function creates a 3D scatter plot using Plotly Express,
with the `x_col`, `y_col`, and `z_col` columns of the `df`
DataFrame as the x, y, and z values, respectively. The points
102
12.6. Conclusion
in the plot are colored according to the values in the
`color_col` column, using a continuous color scale. The
function returns a Plotly Express scatter_3d object that
can be displayed or saved as desired.
Parameters
---------df : pd.DataFrame
The DataFrame containing the data to plot.
x_col : str
The name of the column to use as the x values.
y_col : str
The name of the column to use as the y values.
z_col : str
The name of the column to use as the z values.
color_col : str
The name of the column to use for coloring.
opacity : float
The opacity (alpha) of the points.
Returns
------go.Figure
A Plotly Figure object with the 3D mesh plot.
"""
fig = px.scatter_3d(data_frame=df, x=x_col,
y=y_col, z=z_col, color=color_col,
color_continuous_scale=px.colors.sequential.Viridis_r,
opacity=opacity)
return fig
Let’s look at gamma and reg_lambda over time. I’m using color for loss, so the high values
in the z-axis should be getting better loss scores as the hyperparameter search progresses.
plot_3d_scatter(hyper2hr.query('gamma < .2'),
'reg_lambda', 'gamma', 'tid', color_col='loss')
12.6
Conclusion
In this chapter, we introduced the Hyperopt library and performed hyperparameter searches
using Bayesian optimization. Since there are so many hyperparameters for XGBoost, using
grid search to optimize them could take a long time because it will explore values known to
be bad. Hyperopt can focus on and around known good values.
We also showed how to specify parameter distributions and gave examples of the different
types of distributions. We showed good default distributions that you can use for your
models.
Finally, we explored the relationships between hyperparameters and loss scores using
pandas and visualization techniques.
103
12. Hyperopt
Figure 12.14: Scatter plot of loss vs gamma during Hyperopt process with log x-axis
12.7 Exercises
1. What is the Hyperopt library and how does it differ from grid search for optimizing
hyperparameters?
2. How can parameter distributions be specified in Hyperopt? Give an example of a
continuous distribution and a discrete distribution.
3. What are some good default distributions for your models in Hyperopt?
4. How can pandas and visualization techniques be used to explore the relationships
between hyperparameters and loss scores?
104
Chapter 13
Step-wise Tuning with Hyperopt
In this chapter, we introduce step-wise tuning as a method for optimizing the hyperparameters
of an XGBoost model. The main reason for tuning with this method is to save time.
This method involves tuning small groups of hyperparameters that act similarly and then
moving on to the next group while preserving the values from the previous group. This can
significantly reduce the search space compared to tuning all hyperparameters simultaneously.
13.1
Groups of Hyperparameters
In this section, rather than tuning all of the parameters at once, we will use step-wise tuning.
We will tune small groups of hyperparameters that act similarly, then move to the next group
while preserving the values from the previous group. I’m going to limit my steps to:
•
•
•
•
Tree parameters
Sampling parameters
Regularization parameters
Learning rate
This limits the search space by quite a bit. Rather than searching 100 options for trees for
every 100 options for sampling (10,000), another 100 options for regularization (1,000,000),
and another 100 for learning rate (100,000,000), you are searching 400 options! Of course, it
could be that this finds a local maximum and ignores the interplay between hyperparameters,
but I find that it is often a worthwhile tradeoff if you are pressed for time.
Here is the code that I use for step-wise tuning. In the rounds list, I have a dictionary for the
hyperparameters to evaluate for each step. The max_evals setting in the call to fmin determines
how many attempts hyperopt makes during the round. You can bump up this number if you
want it to explore more values.
from hyperopt import fmin, tpe, hp, Trials
params = {'random_state': 42}
rounds = [{'max_depth': hp.quniform('max_depth', 1, 8, 1), # tree
'min_child_weight': hp.loguniform('min_child_weight', -2, 3)},
{'subsample': hp.uniform('subsample', 0.5, 1), # stochastic
'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 1)},
{'reg_alpha': hp.uniform('reg_alpha', 0, 10),
'reg_lambda': hp.uniform('reg_lambda', 1, 10),},
{'gamma': hp.loguniform('gamma', -10, 10)}, # regularization
{'learning_rate': hp.loguniform('learning_rate', -7, 0)} # boosting
105
13. Step-wise Tuning with Hyperopt
]
all_trials = []
for round in rounds:
params = {**params, **round}
trials = Trials()
best = fmin(fn=lambda space: xhelp.hyperparameter_tuning(space, X_train,
y_train, X_test, y_test),
space=params,
algo=tpe.suggest,
max_evals=20,
trials=trials,
)
params = {**params, **best}
all_trials.append(trials)
Rather than taking 30 minutes (or two hours om my slower computer) to exhaustively
train all of the options, this finished in a little over a minute.
13.2 Visualization Hyperparameter Scores
Let’s use the Plotly visualization code from the last chapter to explore the relationship between
reg_alpha and reg_lambda.
xhelp.plot_3d_mesh(xhelp.trial2df(all_trials[2]),
'reg_alpha', 'reg_lambda', 'loss')
This is a lot coarser than our previous plots. You might use this to diagnose if you need to
up the number of max_evals.
13.3 Training an Optimized Model
With the optimized parameters in hand, let’s train a model with them. Make sure that you
explicitly print out the values (this also helps from having to run through the search space
again if you need to restart the notebook).
step_params = {'random_state': 42,
'max_depth': 5,
'min_child_weight': 0.6411044640540848,
'subsample': 0.9492383155577023,
'colsample_bytree': 0.6235721099295888,
'gamma': 0.00011273797329538491,
'learning_rate': 0.24399020050740935}
Once the parameters are in a dictionary, you can use dictionary unpacking (the **
operator) to pass the parameters into the constructor. Remember to set early_stopping_rounds,
n_estimators, and provide an eval_set to the .fit method so that the number of trees is
optimized (if you use all of the trees, bump up the n_estimators number and run again):
xg_step = xgb.XGBClassifier(**step_params, early_stopping_rounds=50,
n_estimators=500)
xg_step.fit(X_train, y_train,
106
13.3. Training an Optimized Model
Figure 13.1: Contour plot of lambda and alpha during Hyperopt process
eval_set=[(X_train, y_train),
(X_test, y_test)
],
verbose=100
)
How does this model perform?
>>> xg_step.score(X_test, y_test)
0.7613259668508288
Looks pretty good! Let’s compare this to the default out-of-the-box model.
>>> xg_def = xgb.XGBClassifier()
>>> xg_def.fit(X_train, y_train)
>>> xg_def.score(X_test, y_test)
0.7458563535911602
Our tuned model performs marginally better. However, a marginal improvement often
yields outsized results.
107
13. Step-wise Tuning with Hyperopt
13.4 Summary
In this chapter, we used the hyperopt library to optimize the performance of our XGBoost
model. Hyperopt is a library for hyperparameter optimization that provides a range of search
algorithms, parameter distributions, and performance metrics that can be used to optimize the
hyperparameters of XGBoost models.
We used step-wise optimization to speed up the search. This may return a local minimum,
but it does a decent job if you need an answer quickly. If you can access a cluster and have a
weekend to burn, kick off a non-step-wise search.
13.5 Exercises
1. How does step-wise tuning with Hyperopt differ from traditional hyperparameter
tuning methods such as grid search?
2. When does step-wise tuning make sense? When would you use a different strategy?
108
Chapter 14
Do you have enough data?
It is essential to have enough data for machine learning because the performance of a machine
learning model is highly dependent on the amount and quality of the data used to train
the model. A model trained on a large and diverse dataset will likely have better accuracy,
generalization, and robustness than a model trained on a small and homogeneous dataset.
Having enough data for machine learning can help to overcome overfitting. Overfitting
occurs when the model learns the noise and irrelevant details in the training data, leading to
poor generalization when predicting with new data.
This chapter will introduce learning curves, a valuable tool for understanding your model.
14.1
Learning Curves
A learning curve is a graphical representation of the relationship between the performance of a
machine-learning model and the amount of data used to train the model. We train a model on
increasing amounts of data and plot the scores along the way. The x-axis of a learning curve
shows the number of training examples or the amount of training data, and the y-axis shows
the performance metric of the model, such as the error rate, accuracy, or F1 score.
The Yellowbrick library contains a learning curve visualization function. Note that I’m
manually setting the y-limit for all the learning curve plots to help consistently convey the
results. Otherwise, the plot would lop off the bottom portions essentially cropping the image
and making better-performing models look as if they performed worse.
params = {'learning_rate': 0.3,
'max_depth': 2,
'n_estimators': 200,
'n_jobs': -1,
'random_state': 42,
'reg_lambda': 0,
'subsample': 1}
import yellowbrick.model_selection as ms
fig, ax = plt.subplots(figsize=(8, 4))
viz = ms.learning_curve(xgb.XGBClassifier(**params),
X, y, ax=ax
)
ax.set_ylim(0.6, 1)
The learning curve for a good model illustrates a few things.
109
14. Do you have enough data?
Figure 14.1: Learning curve for xgboost model
1. Shows a consistent and monotonic improvement in the model’s cross-validation (or
testing) performance as the amount of training data increases. This means that the error
rate or accuracy of the model will decrease as the number of training examples or the
amount of training data increases. It seems we could have an even better model if we
added more data because the cross-validation score does not look like it has plateaued.
2. Learns the underlying patterns and trends in the data and that it is not overfitting
or underfitting to the training data. An overfit model would show the training score
trending close to 100% accuracy. We want to see it coming down to the testing data. We
also like to see the cross-validation score trending close to the training score.
My takeaway from looking at this plot is that the model seems to be performing ok, it is
not overfitting, and it might do even better if we had more data.
14.2 Learning Curves for Decision Trees
In the previous learning curve, we saw the behavior of an XGBoost model.
Let’s plot a decision tree with a max depth of 7. This depth was a good compromise
between overfitting and underfitting our model.
# tuned tree
fig, ax = plt.subplots(figsize=(8, 4))
viz = ms.learning_curve(tree.DecisionTreeClassifier(max_depth=7),
X, y, ax=ax)
viz.ax.set_ylim(0.6, 1)
It looks similar to the XGBoost model in shape, but the scores are shifted down a bit.
My takeaway is similar to the XGBoost model. It looks pretty good. We also might want
to get some more data.
110
14.3. Underfit Learning Curves
Figure 14.2: Learning curve for decision tree with depth of 7
14.3
Under昀椀t Learning Curves
Now let’s look at an underfit model. A learning curve for a decision stump model will show
a low and constant cross-validation score as the amount of training data increases. As the
number of training examples increases, it shouldn’t impact the model because the model is
too simple to learn from the data. We would also see poor performance on the training data.
The score for both the training and testing data should be similarly bad.
If you look at the learning curve, you see that our model cannot learn the underlying
patterns in the data. It needs to be more complex.
# underfit
fig, ax = plt.subplots(figsize=(8, 4))
viz = ms.learning_curve(tree.DecisionTreeClassifier(max_depth=1),
X, y, ax=ax
)
ax.set_ylim(0.6, 1)
14.4
Over昀椀t Learning Curves
Let’s jump to the other end of the spectrum, an overfit learning curve. We can overfit a decision
tree by using the default parameters which let the tree grow unbounded. Here, we expect to
see good performance on the training data but poor performance on the testing data. The
model memorizes the training data but cannot generalize the information to new data.
This learning curve is an excellent example of a model that is too complex.
# overfit
fig, ax = plt.subplots(figsize=(8, 4))
111
14. Do you have enough data?
Figure 14.3: Learning curve for underfit decision stump
viz = ms.learning_curve(tree.DecisionTreeClassifier(),
X, y, ax=ax
)
ax.set_ylim(0.6, 1)
14.5 Summary
Learning curves are a valuable tool to help diagnose models. We can infer whether adding
more data will help the model perform better. We can also understand if the model is
overfitting or underfitting.
If you want to know your machine learning model’s health, plot a learning curve. Your
model is on a good path if the learning curve is good. If the learning curve is bad, your model
needs some attention, and you should consider adjusting its hyperparameters, adding more
training data, or improving its features.
14.6 Exercises
1. What is the relationship between the performance of a machine learning model and the
amount of data used to train the model?
2. How does having enough data for machine learning help to overcome overfitting?
3. What is a learning curve, and how is it used to evaluate the performance of a machine
learning model?
4. What is the purpose of plotting a learning curve for a machine learning model?
5. What are some characteristics of a good learning curve for a machine learning model?
6. How can the learning curve of a model be used to determine if the model is overfitting
or underfitting?
112
14.6. Exercises
Figure 14.4: Learning curve for overfit decision tree. Evidence of overfitting is clear when the training
score is high but the cross validation score never improves.
113
Chapter 15
Model Evaluation
Several machine learning evaluation techniques can be used to measure a machine learning
model’s performance, generalization, and robustness. In this chapter, we will explore many
of them.
15.1
Accuracy
Let’s start off with an out-of-the-box model and train it.
xgb_def = xgb.XGBClassifier()
xgb_def.fit(X_train, y_train)
The default result of calling the .score method returns the accuracy. The accuracy metric
calculates the proportion of correct predictions made by the model. The metric is defined as
the number of correct predictions divided by the total number of predictions.
>>> xgb_def.score(X_test, y_test)
0.7458563535911602
This model predicts 74% of the labels for the testing data correctly. On the face of it, it seems
a valuable metric. But you must tread carefully with it, especially if you have imbalanced
labels.
Imagine that you are building a machine learning model to predict whether a patient has
cancer. You have a dataset of 100 patients, where 95 don’t have cancer, and only 5 have cancer.
You train a model that always predicts that a patient doesn’t have cancer. This model will have
an accuracy of 95%, which looks impressive but useless and possibly dangerous.
You can also use the accuracy_score function:
>>> from sklearn import metrics
>>> metrics.accuracy_score(y_test, xgb_def.predict(X_test))
0.7458563535911602
15.2
Confusion Matrix
A confusion matrix is a table that shows the number of true positive, true negative, false
positive, and false negative predictions made by a machine learning model. The confusion
matrix is useful because it helps you to understand the mistakes and errors of your machine
learning model.
The confusion matrix aids in calculating prevalence, accuracy, precision, and recall.
115
15. Model Evaluation
Figure 15.1: A confusion matrix and associated formulas.
Generally, the lower right corner represents the true positive values. Values that were
predicted to have the positive label and were indeed positive. You want this number to be
high. Above those of the false positives values. This is the count of values that were predicted
to have the positive label but, in truth, were negative. You want this number to be low. (These
are also called type 1 errors by statisticians.)
In the upper left corner is the true negative values. Values that were both predicted and, in
truth, negative. You want this value to be high. Below that value is the false negative count (or
type 2 errors). These are positive values that were predicted as negative values.
fig, ax = plt.subplots(figsize=(8, 4))
classifier.confusion_matrix(xgb_def, X_train, y_train,
X_test, y_test,
classes=['DS', 'SE'], ax=ax
)
You can also use scikit-learn to create a NumPy matrix of a confusion matrix.
>>> from sklearn import metrics
>>> cm = metrics.confusion_matrix(y_test, xgb_def.predict(X_test))
>>> cm
array([[372, 122],
[108, 303]])
Scikit-learn can create a matplotlib plot that is similar to Yellowbrick.
116
15.3. Precision and Recall
Figure 15.2: Confusion matrix
fig, ax = plt.subplots(figsize=(8, 4))
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm,
display_labels=['DS', 'SE'])
disp.plot(ax=ax, cmap='Blues')
Sometimes it is easier to understand a confusion matrix if you use fractions instead of
counts. The ORmaliizze=True parameter will do this for us.
fig, ax = plt.subplots(figsize=(8, 4))
cm = metrics.confusion_matrix(y_test, xgb_def.predict(X_test),
normalize='true')
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm,
display_labels=['DS', 'SE'])
disp.plot(ax=ax, cmap='Blues')
15.3
Precision and Recall
Precision and recall are two important evaluation metrics for machine learning. Precision
measures the proportion of correct positive predictions made by the model divided by all
of the positive predictions. I like to think of this as how relevant the results are.
Recall (statisticians also like the term sensitivity) measures the proportion of actual positive
examples that the model correctly predicts. I like to think of this as how many relevant results
are returned.
Recall and precision are often at odds with each other. If you want high precision, then
you can classify the sample that is the most positive as positive and get 100% precision, but
you will have a low recall because the other positive samples will be mislabeled. Conversely,
you can get high recall by classifying everything as positive. This will give you 100% recall
but poor precision. How you want to balance this tradeoff depends on your business needs.
117
15. Model Evaluation
Figure 15.3: Confusion matrix from scikit learn
Figure 15.4: Normalized confusion matrix from scikit learn showing fractions.
118
15.3. Precision and Recall
You can measure the precision with the precision_score function from scikit-learn.
>>> metrics.precision_score(y_test, xgb_def.predict(X_test))
0.7129411764705882
You can measure the recall with the recall_score function from scikit-learn.
>>> metrics.recall_score(y_test, xgb_def.predict(X_test))
0.7372262773722628
To visualize the tradeoff between these two measures, you can use a precision-recall curve
found in Yellowbrick.
from yellowbrick import classifier
fig, ax = plt.subplots(figsize=(8, 4))
classifier.precision_recall_curve(xgb_def, X_train, y_train,
X_test, y_test, micro=False, macro=False, ax=ax, per_class=True)
ax.set_ylim((0,1.05))
Figure 15.5: Precision-recall curve from Yellowbrick
This curve needs a little explanation. It shows the precision against the recall. That might
not be very clear because we just calculated those scores, which are a single value. So, where
does this plot with a line come from? Many machine learning models have a .predict_proba
method that returns a probability for both positive and negative labels rather than just a single
label or value for prediction.
One could imagine that if you changed how a model assigns a label to some threshold. If
the probability of the positive label is above a threshold of .98, assign the positive label. This
would likely produce a model with high precision and low recall (a point on the upper left
119
15. Model Evaluation
of our plot). You will produce the precision-recall plot if you loosen this threshold from .98
down to .01 and track both precision and recall.
This plot also shows the average precision across the threshold values and plots it as the
dotted red line.
Now imagine that your boss says she does not want any false positives. If that metric
is most important, you can raise the threshold (and train to optimize the 'precision' metric
when creating your model). Given such considerations, this curve can help you understand
how your model might perform.
15.4 F1 Score
Another metric that might be useful is the F1 score. This is the harmonic mean of precision
and recall. If you want to balance precision and recall, this single number can help describe
the performance.
>>> metrics.f1_score(y_test, xgb_def.predict(X_test))
0.7248803827751197
Scikit-learn also has a text-based classification report. The support column is the number
of samples for each label. The macro average is the average of the class behavior. The weighted
average is based on the count (support).
>>> print(metrics.classification_report(y_test,
...
y_pred=xgb_def.predict(X_test), target_names=['DS', 'SE']))
precision
recall f1-score
support
DS
SE
0.78
0.71
0.75
0.74
0.76
0.72
494
411
accuracy
macro avg
weighted avg
0.74
0.75
0.75
0.75
0.75
0.74
0.75
905
905
905
The Yellowbrick library provides a classification report that summarizes precision, recall,
and f1-score for both positive and negative labels:
fig, ax = plt.subplots(figsize=(8, 4))
classifier.classification_report(xgb_def, X_train, y_train,
X_test, y_test, classes=['DS', 'SE'],
micro=False, macro=False, ax=ax)
15.5 ROC Curve
A Receiver Operating Characteristic curve or ROC curve is another useful plot for model
diagnoses that requires some explaining.
It was first used during World War II to understand how effective radar signals were at
detecting enemy aircraft. It measures the true positive rate (or recall) against the false positive
rate (or fallout, false positive count divided by the count of all negatives) over a range of
thresholds.
120
15.5. ROC Curve
Figure 15.6: Classification report from Yellowbrick
Let’s look at an example and then explain it. We will plot two ROC curves on the same
plot—one from the default XGBoost model and the other from our Hyperopt-trained model.
We will use the scikit learn library RocCurveDisplay.from_estimator method.
fig, ax = plt.subplots(figsize=(8,8))
metrics.RocCurveDisplay.from_estimator(xgb_def,
X_test, y_test,ax=ax, label='default')
metrics.RocCurveDisplay.from_estimator(xg_step,
X_test, y_test,ax=ax)
Again, imagine sweeping the probability threshold as we did for the precision-recall plot.
At a high threshold, we would want a high true positive rate and a low false positive rate (if
the threshold is .9, only values that the model gave a probability > .9 would be classified as
positive). However, remember that the true positive rate is also the recall. A high threshold
does not necessarily pull out all of the positive labels. The line starts at (0,0). As the threshold
lowers, the line moves up and to the right until the threshold is low enough that we have
classified everything as positive and thus have 100% recall (and 100% FPR).
Also, note that the Area Under Curve (AUC) is reported as .84 for our hyperopt model.
Because this is a unit square, a perfect model would have an AUC of 1.0.
Folks often ask if a given AUC value is “good enough” for their model. Here’s how I
interpret an ROC curve. Generally, you cannot tell if a model is sufficiently good for your
application from the AUC metric or an ROC plot. However, you can tell if a model is bad. If
the AUC is less than .5, it indicates that the model performs worse than guessing (and you
should do the opposite of what your model says).
Suppose the reported AUC is 1 (or very close to it). That usually tells me one of two
things. First, the problem is such that it is very simple, and perhaps that machine learning is
not needed. More often, it is the case that the modeler has leaked data into their model such
121
15. Model Evaluation
Figure 15.7: ROC Curve for out of the box model (default) and the step wise tuned model. Models that
have curves above and to the left of other curves are better models. The AUC is the area under the curve.
If the AUC is < .5, then the model is bad.
122
15.6. Threshold Metrics
that it knows the future. Data leakage is when you include data in the model that wouldn’t
be known when you ask the model to make a prediction.
Imagine that you are building a machine learning model to predict the sentiment of movie
reviews. You have a dataset of 100 reviews, where 50 of them are positive, and 50 of them are
negative. You train a model that correctly predicts the sentiment of each review.
Then, you realize that your model has a serious data leakage problem. You forgot to
remove the column that contains the movie’s rating from the dataset, and you included this
column in the model’s training. This means that the model is not learning from the text of
the reviews, but it is learning from the movie’s rating. When you remove this feature, the
accuracy of the model drops.
You can use an ROC curve to determine if a model is better than another model (at a given
threshold). If one model has a higher AUC or bulges out more to the upper left corner, it is
better. But being a better model is insufficient to determine whether a model would work for
a given business context.
In the above plot, it appears that the default model has worse performance than the stepwise model. You should note that the step-wise model curve is to the left and above the default
curve.
You can leverage the ROC curve to understand overfitting. If you plot an ROC curve for
both the training and testing data, you would hope that they look similar. A good ROC curve
for the training data that yields a poor testing ROC curve is an indication of overfitting.
Let’s explore that with the default model and the stepwise model.
fig, axes = plt.subplots(figsize=(8, 4), ncols=2)
metrics.RocCurveDisplay.from_estimator(xgb_def,
X_train, y_train,ax=axes[0], label='detault train')
metrics.RocCurveDisplay.from_estimator(xgb_def,
X_test, y_test,ax=axes[0])
axes[0].set(title='ROC plots for default model')
metrics.RocCurveDisplay.from_estimator(xg_step,
X_train, y_train,ax=axes[1], label='step train')
metrics.RocCurveDisplay.from_estimator(xg_step,
X_test, y_test,ax=axes[1])
axes[1].set(title='ROC plots for stepwise model')
It looks like the training performance of the stepwise model is worse. However, that is
because the default model appears to be overfitting, and our testing score improves with the
tuning even though the training score has decreased.
15.6
Threshold Metrics
The models (in scikit-learn and XGBoost) don’t expose the threshold as a hyperparameter.
You can create a subclass to experiment with the threshold.
class ThresholdXGBClassifier(xgb.XGBClassifier):
def __init__(self, threshold=0.5, **kwargs):
super().__init__(**kwargs)
self.threshold = threshold
def predict(self, X, *args, **kwargs):
"""Predict with `threshold` applied to predicted class probabilities.
123
15. Model Evaluation
Figure 15.8: Exploring overfitting with ROC curves. On the left is the out-of-the-box model. Note
that the training score is much better than the test score. With the stepwise model, the training score
decreases, but the score of the test data has improved.
"""
proba = self.predict_proba(X, *args, **kwargs)
return (proba[:, 1] > self.threshold).astype(int)
Here is an example of using the class. Notice that the first row of the testing data has a
probability of .857 of being positive.
>>> xgb_def = xgb.XGBClassifier()
>>> xgb_def.fit(X_train, y_train)
>>> xgb_def.predict_proba(X_test.iloc[[0]])
array([[0.14253652, 0.8574635 ]], dtype=float32)
When we predict this row with the default model, it comes out as positive.
>>> xgb_def.predict(X_test.iloc[[0]])
array([1])
If we set the threshold to .9, the prediction becomes negative.
>>> xgb90 = ThresholdXGBClassifier(threshold=.9, verbosity=0)
>>> xgb90.fit(X_train, y_train)
>>> xgb90.predict(X_test.iloc[[0]])
array([0])
Here is code that you can use to see if it would be appropriate to use a different threshold.
This visualization shows how many common metrics respond as the threshold is changed.
124
15.7. Cumulative Gains Curve
def get_tpr_fpr(probs, y_truth):
"""
Calculates true positive rate (TPR) and false positive rate
(FPR) given predicted probabilities and ground truth labels.
Parameters:
probs (np.array): predicted probabilities of positive class
y_truth (np.array): ground truth labels
Returns:
tuple: (tpr, fpr)
"""
tp = (probs == 1) & (y_truth == 1)
tn = (probs < 1) & (y_truth == 0)
fp = (probs == 1) & (y_truth == 0)
fn = (probs < 1) & (y_truth == 1)
tpr = tp.sum() / (tp.sum() + fn.sum())
fpr = fp.sum() / (fp.sum() + tn.sum())
return tpr, fpr
vals = []
for thresh in np.arange(0, 1, step=.05):
probs = xg_step.predict_proba(X_test)[:, 1]
tpr, fpr = get_tpr_fpr(probs > thresh, y_test)
val = [thresh, tpr, fpr]
for metric in [metrics.accuracy_score, metrics.precision_score,
metrics.recall_score, metrics.f1_score,
metrics.roc_auc_score]:
val.append(metric(y_test, probs > thresh))
vals.append(val)
fig, ax = plt.subplots(figsize=(8, 4))
(pd.DataFrame(vals, columns=['thresh', 'tpr/rec', 'fpr', 'acc',
'prec', 'rec', 'f1', 'auc'])
.drop(columns='rec')
.set_index('thresh')
.plot(ax=ax, title='Threshold Metrics')
)
15.7
Cumulative Gains Curve
This curve is useful for maximizing marketing response rates (evaluating a model when you
have finite resources). It plots the gain (the recall or sensitivity) against the ordered samples.
The baseline is what a random prediction would give you. The gain is calculated as the true
positive rate if you were to order the predictions by the probability against the percentage of
samples.
Look at the plot below from the scikitplot library. I have augmented it to show the optimal
gains (what a perfect model would give you).
125
15. Model Evaluation
Figure 15.9: Threshold plots
import scikitplot
fig, ax = plt.subplots(figsize=(8, 4))
y_probs = xgb_def.predict_proba(X_test)
scikitplot.metrics.plot_cumulative_gain(y_test, y_probs, ax=ax)
ax.plot([0, (y_test == 1).mean(), 1], [0, 1, 1], label='Optimal Class 1')
ax.set_ylim(0, 1.05)
ax.annotate('Reach 60% of\nClass 1\nby contacting top 35%', xy=(.35, .6),
xytext=(.55,.25), arrowprops={'color':'k'})
ax.legend()
Here is an example of reading the plot. If you want to reach 60% of the positive audience,
you can trace the .6 from the y-axis to the plot. It indicates that you need to contact the top
35% of the sample if you want to reach that number. Note that if we had a perfect model, we
would only need to reach out to the 45% of them in the positive class (see the blue line).
15.8 Lift Curves
A lift curve is a plot showing the true positive rate and the cumulative positive rate of a
machine learning model as the prediction threshold varies. It leverages the same x-axis from
the cumulative gains plot, but the y-axis is the ratio of gains to randomly choosing a label. It
indicates how much better the model is than guessing.
If you only can reach the top 20% of your audience (move up from .2 on the x-axis), you
should be almost 1.8x better than randomly choosing labels.
I have also augmented this with the optimal model.
fig, ax = plt.subplots(figsize=(8, 4))
y_probs = xgb_def.predict_proba(X_test)
126
15.9. Summary
Figure 15.10: Cumulative gains plot
scikitplot.metrics.plot_lift_curve(y_test, y_probs, ax=ax)
mean = (y_test == 1).mean()
ax.plot([0, mean, 1], [1/mean, 1/mean, 1], label='Optimal Class 1')
ax.legend()
15.9
Summary
In this chapter, we explored various metrics and plots to evaluate a classification model. There
is no one size fits all metric or plot to understand a model. Many times it depends on the needs
of the business.
Precision, recall, accuracy, and F1 are common and important metrics for evaluating
and comparing the performance of a machine learning model. ROC curves, precision-recall
curves, and lift plots are visual representations of the performance of a machine learning
model. These metrics and plots provide valuable and complementary information about
the performance of a machine learning model, and they are widely used for evaluating,
comparing, and diagnosing the performance of the model. They are important and useful
tools for improving, optimizing, and maximizing the performance of the model, and they are
essential for understanding and interpreting the results of the model.
15.10 Exercises
1. How can cross-validation be used to evaluate the performance of a machine learning
model?
2. What is the difference between precision and recall in evaluating a machine learning
model?
127
15. Model Evaluation
Figure 15.11: Lift Curve for out of the box model
3. What is the F1 score, and how is it used to evaluate a machine learning model?
4. In what situations might it be more appropriate to use precision, recall, accuracy, or F1
as a metric to evaluate the model’s performance?
5. What is the purpose of a ROC curve in evaluating a machine learning model?
6. How can ROC curves, precision-recall curves, and lift plots be used to improve,
optimize, and maximize the performance of a machine learning model?
128
Chapter 16
Training For Different Metrics
We just looked at different metrics for model evaluation. In this chapter, we will look at
training a model to maximize those metrics.
16.1
Metric overview
We will compare optimizing a model with the metrics of accuracy, the area under the ROC
curve. Accuracy is a measure of how often the model correctly predicts the target class.
Precision is a measure of how often the model is correct when it predicts the positive class.
Recall is a measure of how often the model can correctly identify the positive class. The area
under the ROC curve is one measure to balance precision and recall. The F1 score is another
metric that combines both by using the harmonic mean of precision and recall.
16.2
Training with Validation Curves
Let’s use a validation curve to visualize tuning the learning_rate for both accuracy and area
under ROC. (Note that you can get a high recall by just classifying everything as positive, so
that is not a particularly interesting metric to maximize.) We will use the Yellowbrick library.
Remember that a validation curve tracks the performance of the metric as we sweep through
hyperparameter values.
The validation_curve function accepts a scoring parameter to determine a metric to track.
We will pass in scoring='accuracy' to the first curve and scoring='roc_auc' to the second.
from yellowbrick import model_selection as ms
fig, ax = plt.subplots(figsize=(8, 4))
ms.validation_curve(xgb.XGBClassifier(), X_train, y_train,
scoring='accuracy', param_name='learning_rate',
param_range=[0.001, .01, .05, .1, .2, .5, .9, 1], ax=ax
)
ax.set_xlabel('Accuracy')
fig, ax = plt.subplots(figsize=(8, 4))
ms.validation_curve(xgb.XGBClassifier(), X_train, y_train,
scoring='roc_auc', param_name='learning_rate',
param_range=[0.001, .01, .05, .1, .2, .5, .9, 1], ax=ax
)
ax.set_xlabel('roc_auc')
129
16. Training For Different Metrics
Figure 16.1: Validation Curve for XGBoost model tuning learning_rate for accuracy
You can see that the optimal learning rate for accuracy is around .01 (given our limited
range of choices). The shape of the area under the ROC curve is similar but shifted up, and
the maximum cross-validation score is around .05. In general, tuning for different metrics will
reveal different hyperparameter values.
16.3 Step-wise Recall Tuning
Let’s use the hyperopt library to optimize AUC. Let’s pass in roc_auc_score for the metric
parameter.
from sklearn.metrics import roc_auc_score
from hyperopt import hp, Trials, fmin, tpe
params = {'random_state': 42}
rounds = [{'max_depth': hp.quniform('max_depth', 1, 9, 1), # tree
'min_child_weight': hp.loguniform('min_child_weight', -2, 3)},
{'subsample': hp.uniform('subsample', 0.5, 1), # stochastic
'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 1)},
{'gamma': hp.loguniform('gamma', -10, 10)}, # regularization
{'learning_rate': hp.loguniform('learning_rate', -7, 0)} # boosting
]
for round in rounds:
params = {**params, **round}
trials = Trials()
best = fmin(fn=lambda space: xhelp.hyperparameter_tuning(
space, X_train, y_train, X_test, y_test, metric=roc_auc_score),
space=params,
130
16.3. Step-wise Recall Tuning
Figure 16.2: Validation Curve for XGBoost model tuning learning_rate for area under the ROC curve
algo=tpe.suggest,
max_evals=40,
trials=trials,
)
params = {**params, **best}
Now let’s compare the default model for recall versus our optimized model.
>>> xgb_def = xgb.XGBClassifier()
>>> xgb_def.fit(X_train, y_train)
>>> metrics.roc_auc_score(y_test, xgb_def.predict(X_test))
0.7451313573096131
>>> # the values from above training
>>> params = {'random_state': 42,
... 'max_depth': 4,
... 'min_child_weight': 4.808561584650579,
... 'subsample': 0.9265505972233746,
... 'colsample_bytree': 0.9870944989347749,
... 'gamma': 0.1383762861356536,
... 'learning_rate': 0.13664139307301595}
Below I use the verbose=100 parameter to only show the output every 100 iterations.
>>> xgb_tuned = xgb.XGBClassifier(**params, early_stopping_rounds=50,
...
n_estimators=500)
>>> xgb_tuned.fit(X_train, y_train, eval_set=[(X_train, y_train),
...
(X_test, y_test)], verbose=100)
131
16. Training For Different Metrics
[0] validation_0-logloss:0.66207
validation_1-logloss:0.66289
[100] validation_0-logloss:0.44945
validation_1-logloss:0.49416
[150] validation_0-logloss:0.43196
validation_1-logloss:0.49833
XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
colsample_bylevel=1, colsample_bynode=1,
colsample_bytree=0.9870944989347749, early_stopping_rounds=50,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=0.1383762861356536, gpu_id=-1, grow_policy='depthwise',
importance_type=None, interaction_constraints='',
learning_rate=0.13664139307301595, max_bin=256,
max_cat_threshold=64, max_cat_to_onehot=4, max_delta_step=0,
max_depth=4, max_leaves=0, min_child_weight=4.808561584650579,
missing=nan, monotone_constraints='()', n_estimators=500,
n_jobs=0, num_parallel_tree=1, predictor='auto', random_state=42, ...)
It looks like our tuned model has improved the area under the ROC curve score.
>>> metrics.roc_auc_score(y_test, xgb_tuned.predict(X_test))
0.7629510328319394
16.4 Summary
In this chapter, we explored optimizing models around different metrics. Talk to the business
owners and determine what metric is important for you to optimize. Then tune your model
to that metric. The default metric of accuracy is a fine default, but it might not be what you
need.
16.5 Exercises
1.
2.
3.
4.
5.
132
What is the purpose of a validation curve?
How does tuning a model for recall differ from tuning for accuracy?
How does the metric choice influence a model’s hyperparameters?
Can a model be optimized for both precision and recall simultaneously? Discuss.
In what situations might it be important to prioritize recall over precision or vice versa?
Chapter 17
Model Interpretation
Consider a bank offering loans. The bank wants to be able to tell a user why they were rejected.
Saying, “We are denying your application, but if you deposit $2,000 into your account, we
will approve it.” is better than saying, “We deny your application. Sorry, we can’t tell
you why because our model doesn’t give us that information.” The company might value
explainability more than performance.
Data scientists classify models as white box and black box models. A white box machine
learning model is a model that is transparent and interpretable. This means that the model’s
workings are completely understood and can be explained in terms of the input features and
the mathematical operations used to make predictions.
In contrast, a black box machine learning model is a model that is not transparent or
interpretable. This means that the model’s workings are not easily understood, and it is
difficult or impossible to explain how the model makes predictions.
A decision tree model is a white box model. You can explain exactly what the model is
doing.
An XGBoost model is considered a black box model. I would probably call it a grey model
(or maybe dark grey). We can explain what happens, but we need to sift through multiple
trees to come to an answer. It is unlikely that a user of the model is going to want to track
through 587 trees to determine why it made a prediction.
This chapter will explore different models and how to interpret them. We will also discuss
the pros and cons of the interpretation techniques.
17.1
Logistic Regression Interpretation
Let’s start with a Logistic Regression model. Logistic regression uses a logistic function to
predict the probability that a given input belongs to a particular class. Like linear regression,
it learns coefficients for each feature. These coefficients are multiplied by the feature values
and summed to make a prediction. It is a white box model because the result comes from
the match in the previous sentence. In fact, you can use these coefficients to determine how
features impact the model (by looking at the coefficient’s sign) and the impact’s magnitude
(by looking at the magnitude of the coefficient, assuming that the features are standardized).
To aid with the interpretation of the coefficients, we standardize the features before running
logistic regression. Standardization is a technique used to transform data with a mean of zero
and a standard deviation of one. Standardization is helpful for machine learning because
it helps ensure that all features are on a similar scale, which can improve the performance
or behavior of certain models, such as linear regression, logistic regression, support vector
machines, and neural networks.
Let’s look at how this model works with our data (we set penalty=None to turn off the
regularization of this model):
133
17. Model Interpretation
>>> from sklearn import linear_model, preprocessing
>>> std = preprocessing.StandardScaler()
>>> lr = linear_model.LogisticRegression(penalty=None)
>>> lr.fit(std.fit_transform(X_train), y_train)
>>> lr.score(std.transform(X_test), y_test)
0.7337016574585635
In this case, logistic regression gives us a similar score to XGBoost’s default model. If I
had data that performed the same with both logistic regression and XGBoost, I would use the
logistic regression model because it is simpler. Given two equivalent results, I will choose the
simpler version. Logistic regression also is a white box model and is easy to explain.
Let’s look at the .coef_ attribute. These are the feature coefficients or weights that we
learned during the fitting of the model. (This uses scikit learn’s convention of adding an
underscore to the attribute learned while training the model.)
>>> lr.coef_
array([[-1.56018160e-01,
-1.45213121e-01,
3.11683777e-02,
-4.59272439e-04,
-4.48524110e-03,
-1.79149729e-01,
-4.01817103e-01,
-8.13849902e-02,
3.16120596e-02,
-8.21683100e-03,
1.01853988e-01,
2.41389081e-02,
6.01542610e-01,
-6.03727624e-01,
-3.14510213e-02,
-5.27737710e-02,
3.49376790e-01,
-3.37424750e-01]])
This is a NumPy array and a little hard to interpret on its own. I prefer to stick this in a
Pandas series and add the corresponding column names to the values. You can make a bar
plot to visualize them.
fig, ax = plt.subplots(figsize=(8, 4))
(pd.Series(lr.coef_[0], index=X_train.columns)
.sort_values()
.plot.barh(ax=ax)
)
The wider the bar, the higher the impact of the feature. Positive values push towards
the positive label, in our case Software Engineer. Negative labels push towards the negative
label, Data Scientist. The years of experience column correlates with software engineering,
and using the R language correlates with data science. Also, the Q1_Prefer not to say feature
does not have much impact on this model.
Given this result, I might try to make an even simpler model. A model that only considers
features that have an absolute magnitude above 0.2.
17.2 Decision Tree Interpretation
A decision tree is also a white box model. I will train a decision tree with a depth of seven
and explore the feature importance results. These values are learned from fitting the model, and
similar to coefficients in the logistic regression model, they give us some insight into how the
features impact the model.
>>> tree7 = tree.DecisionTreeClassifier(max_depth=7)
>>> tree7.fit(X_train, y_train)
>>> tree7.score(X_test, y_test)
0.7337016574585635
134
17.2. Decision Tree Interpretation
Figure 17.1: Logistic Regression feature coefficients
Let’s inspect the .feature_importances_ attribute. This is an array of values that indicate
how important each feature is in making predictions using a decision tree model. It is learned
during the call to .fit (hence the trailing underscore). Like logistic regression, it gives a
magnitude of importance. In scikit-learn, it is calculated from the decrease in Gini importance
when making a split on a feature. Unlike logistic regression, it does not show the direction
(this is because a tree, unlike logistic regression, can capture non-linear relationships and use
a feature to direct results toward both positive and negative labels).
fig, ax = plt.subplots(figsize=(8, 4))
(pd.Series(tree7.feature_importances_, index=X_train.columns)
.sort_values()
.plot.barh(ax=ax)
)
It is interesting to note that the feature importances of the decision tree are not necessarily
the same as the coefficients of the logistic regression model.
I will plot the first three levels of the tree to compare the nodes to the feature importance
scores.
import dtreeviz
dt3 = tree.DecisionTreeClassifier(max_depth=3)
dt3.fit(X_train, y_train)
viz = dtreeviz.model(dt3, X_train=X_train, y_train=y_train,
feature_names=list(X_train.columns), target_name='Job',
class_names=['DS', 'SE'])
viz.view()
135
17. Model Interpretation
Figure 17.2: Decision tree feature importance
Figure 17.3: Decision tree visualization using dtreeviz.
136
17.3. XGBoost Feature Importance
17.3
XGBoost Feature Importance
The XGBoost model also reports feature importance. The .feature_importances_ attribute
reports the normalized gain across all the trees when the feature is used.
xgb_def = xgb.XGBClassifier()
xgb_def.fit(X_train, y_train)
fig, ax = plt.subplots(figsize=(8, 4))
(pd.Series(xgb_def.feature_importances_, index=X_train.columns)
.sort_values()
.plot.barh(ax=ax)
)
Figure 17.4: XGBoost default feature importance
You can also visualize the feature importance directly from XGBoost. The .plot_importance
method has a importance_type parameter to change how importance is measured.
• The 'gain' feature importance is calculated as the total gain in the model’s performance
that results from using a feature.
• The 'weight' feature importance is calculated as the number of times a feature is used in
the model.
• The 'cover' feature importance is calculated as the number of samples that are affected
by a feature.
fig, ax = plt.subplots(figsize=(8, 4))
xgb.plot_importance(xgb_def, importance_type='cover', ax=ax)
137
17. Model Interpretation
Figure 17.5: XGBoost feature importance for cover importance
17.4 Surrogate Models
Another way to tease apart the XGBoost model is to train a decision tree to its predictions and
then explore the interpretable decision tree. Let’s do that.
from sklearn import tree
sur_reg_sk = tree.DecisionTreeRegressor(max_depth=4)
sur_reg_sk.fit(X_train, xgb_def.predict_proba(X_train)[:,-1])
Let’s export the tree to examine it. If you want to make a tree that goes from left to right
with scikit-learn, you need to use the export_graphviz function.
Note
I ran the following code to convert the DOT file to a PNG.
tree.export_graphviz(sur_reg_sk, out_file='img/sur-sk.dot',
feature_names=X_train.columns, filled=True, rotate=True,
fontname='Roboto Condensed')
You can use the dot command to generate the PNG:
dot -Gdpi=300 -Tpng -oimg/sur-sk.png img/sur-sk.dot
# HIDE
This surrogate model can provide insight into interactions. Nodes that split on a different
feature than a parent node often have an interaction. It looks like R has an interaction with
major_cs. Also, the years_exp and education columns seem to interact. We will explore those in
the SHAP chapter.
You can also learn about non-linearities with surrogate models. When the same node
feature follows itself, that often indicates a non-linear relationship with the target.
138
17.4. Surrogate Models
Figure 17.6: A surrograte tree of the model. You can look for the whitest and orangest nodes and trace
the path to understand the prediction.
139
17. Model Interpretation
17.5 Summary
Interpretability in machine learning refers to understanding how a model makes predictions.
A model is considered interpretable if it can explain its workings regarding the input features
and the mathematical operations that the model uses to make predictions.
Interpretability is important because it allows us to understand how a model arrives at its
predictions and identify any potential biases or errors. This can be useful for improving the
model or for explaining the model’s predictions to decision-makers.
Feature importance in XGBoost measures each feature’s importance in making predictions
using a decision tree model. XGBoost provides three feature importances: weight, gain, and
cover. These feature importances are calculated differently and can provide different insights
about the model.
17.6 Exercises
1. What is the difference between white box and black box models regarding interpretation?
2. How is logistic regression used to make predictions and how can the coefficients be
interpreted to understand the model’s decision-making process?
3. How can decision trees be interpreted, and how do they visualize the model’s decisionmaking process?
4. What are some potential limitations of using interpretation techniques to understand
the decision-making process of a machine learning model?
5. How can the feature importance attribute be used to interpret the decision-making
process of a black box model?
6. In what situations might it be more appropriate to use a white box model over a black
box model, and vice versa?
140
Chapter 18
xgb昀椀r (Feature Interactions Reshaped)
In this chapter, we will explore how features relate to each other and the labels. We will
define feature interactions, show how to find them in XGBoost, and then introduce interaction
constraints to see if it improves our model.
18.1
Feature Interactions
When you have multiple columns in a machine-learning model, you may want to understand
how those columns interact with each other. Statisticians say that X1 and X2 interact when
the X1 changes are not constant in the change to y but also depend on X2 .
You could represent this as a formula when using linear or logistic regression. You can
create a new column that multiplies the two columns together:
y = aX1 + bX2 + cX1 X2
In decision trees, interaction is a little different. A feature, X1 does not interact with
another feature, X2 , if X2 is not in the path, or if X2 is used the same in each path. Otherwise,
there is an interaction, and both features combined to impact the result.
18.2
xgb昀椀r
The xgbfir library will allow us to find the top interactions from our data in an XGBoost model.
You will have to install this third-party library and also the openpyxl library (a library to read
Excel files) to explore the feature interactions.
I’m not going to lie. The interface of this library is a little odd. You create a model, then run
the saveXgbFI function, and it saves an Excel file with multiple sheets with the results. Let’s
try it out:
import xgbfir
xgbfir.saveXgbFI(xgb_def, feature_names=X_train.columns, OutputXlsxFile='fir.xlsx')
Let’s read the first sheet. This sheet lists all of the features from our data. It includes
a bunch of other columns that include metric values and the rank of the metrics. The final
order of the rows is the average rank of the metrics.
Here are the columns:
• Gain: Total gain of each feature or feature interaction
• FScore: Amount of possible splits taken on a feature or feature Interaction
141
18. xgbfir (Feature Interactions Reshaped)
• wFScore: Amount of possible splits taken on a feature or feature interaction weighted by
the probability of the splits to take place
• Average wFScore: wFScore divided by FScore
• Average Gain: Gain divided by FScore
• Expected Gain: Total gain of each feature or feature interaction weighted by the
probability of the gain
For each of the above columns, another column shows the rank. Finally, there is the Average
Rank, the Average Tree Index, and the Average Tree Depth columns.
Let’s sort the features by the Average Rank column:
>>>
>>>
...
...
...
...
2
0
5
1
4
fir = pd.read_excel('fir.xlsx')
print(fir
.sort_values(by='Average Rank')
.head()
.round(1)
)
Interaction Gain FScore ... Average Rank Average Tree Index \
r 517.8
84 ...
3.3
44.6
years_exp 597.0
627 ...
4.5
45.1
education 296.0
254 ...
4.5
45.2
compensation 518.5
702 ...
4.8
47.5
major_cs 327.1
96 ...
5.5
48.9
2
0
5
1
4
Average Tree Depth
2.6
3.7
3.3
3.7
3.6
[5 rows x 16 columns]
This looks like the R and years_exp features are important for our model.
However, this chapter is about interactions, and this isn’t showing interactions, just a
single column. So we need to move on to the other sheets found in the Excel file.
The Interaction Depth 1 sheet shows how pairs of columns interact. Let’s look at it:
>>> print(pd.read_excel('fir.xlsx', sheet_name='Interaction Depth 1').iloc[:20]
...
.sort_values(by='Average Rank')
...
.head(10)
...
.round(1)
... )
Interaction
Gain FScore wFScore Average wFScore \
1
education|years_exp 523.8
106
14.8
0.1
0
major_cs|r 1210.8
15
5.4
0.4
6 compensation|education
207.2
103
18.8
0.2
11
age|education 133.2
80
27.2
0.3
3
major_cs|years_exp 441.3
36
4.8
0.1
142
18.2. xgbfir
5
4
15
14
18
age|years_exp
age|compensation
major_stat|years_exp
education|r
age|age
316.3
344.7
97.7
116.5
90.5
216
219
32
14
66
43.9
38.8
6.7
4.6
24.7
0.2
0.2
0.2
0.3
0.4
1
0
6
11
3
5
4
15
14
18
Average Gain Expected Gain Gain Rank FScore Rank wFScore Rank \
4.9
77.9
2
5
8
80.7
607.6
1
45
20
2.0
34.0
7
6
7
1.7
25.6
12
8
4
108.2
20
12.3
4
25
1.5
44.0
6
3
1
1.6
30.6
5
2
2
3.1
20.4
16
25
15
8.3
72.3
15
52
27
1.4
16.6
19
11
6
1
0
6
11
3
5
4
15
14
18
Avg wFScore Rank Avg Gain Rank Expected Gain Rank Average Rank \
43
8
3
11.5
8
1
1
12.7
32
25
9
14.3
12
40
13
14.8
46
3
2
16.7
26
57
7
16.7
34
48
11
17.0
24
14
14
18.0
13
5
4
19.3
7
62
16
20.2
1
0
6
11
3
5
4
15
14
18
Average Tree Index Average Tree Depth
38.0
3.5
12.3
1.6
50.6
3.7
38.8
3.6
29.2
3.2
45.6
3.9
48.9
3.9
25.5
3.1
40.4
2.4
48.0
3.6
This tells us that education is often followed by years_exp and major_cs is often followed by
R. This hints that these features might have an impact on each other. Let’s explore that a little
more by exploring the relationship between the features.
We can create a correlation heatmap with both Pandas and Seaborn. Here’s the pandas
version which is useful inside of Jupyter.
(X_train
.assign(software_eng=y_train)
.corr(method='spearman')
143
18. xgbfir (Feature Interactions Reshaped)
.loc[:, ['education', 'years_exp', 'major_cs', 'r', 'compensation', 'age']]
.style
.background_gradient(cmap='RdBu', vmin=-1, vmax=1)
.format('{:.2f}')
)
import seaborn as sns
fig, ax = plt.subplots(figsize=(8, 4))
sns.heatmap(X_train
.assign(software_eng=y_train)
.corr(method='spearman')
.loc[:, ['age','education', 'years_exp', 'compensation', 'r',
'major_cs', 'software_eng']],
cmap='RdBu', annot=True, fmt='.2f', vmin=-1, vmax=1, ax=ax
)
Figure 18.1: Correlation heatmap between subset of features
Interestingly, the correlation between years_exp and education is close to 0.10. There may
be a non-linear relationship that a tree model can tease apart. Let’s explore that with a scatter
plot. We will use Seaborn to add a fit line and color the plot based on the label.
import seaborn.objects as so
fig = plt.figure(figsize=(8, 4))
(so
.Plot(X_train.assign(software_eng=y_train), x='years_exp', y='education',
color='software_eng')
.add(so.Dots(alpha=.9, pointsize=2), so.Jitter(x=.7, y=1))
.add(so.Line(), so.PolyFit())
.scale(color='viridis')
.on(fig) # not required unless saving to image
.plot() # ditto
144
18.2. xgbfir
)
Figure 18.2: Scatter plot of education vs years_exp
It looks like the education levels for software engineers are lower than for data scientists.
Now, let’s explore the relationships between major_cs and R and the target label. This has
three dimensions of categorical data. We can use Pandas grouping or the pd.crosstab shortcut
to quantify this:
>>> print(X_train
... .assign(software_eng=y_train)
... .groupby(['software_eng', 'r', 'major_cs'])
... .age
... .count()
... .unstack()
... .unstack()
... )
major_cs
0
1
r
0
1
0
1
software_eng
0
410 390 243 110
1
308 53 523 73
This shows that if the user didn’t use R, they have a higher probability of being a software
engineer.
Here is the pd.crosstab version:
>>> both = (X_train
... .assign(software_eng=y_train)
... )
>>> print(pd.crosstab(index=both.software_eng, columns=[both.major_cs, both.r]))
major_cs
0
1
r
0
1
0
1
145
18. xgbfir (Feature Interactions Reshaped)
software_eng
0
1
410 390 243 110
308 53 523 73
We can also visualize this. Here is a visualization of this data using a slope graph. My
interpretation of this data is that not using R is a strong indication of the label “software
engineer”. Here is the interaction between using R and not studying CS. It is an indicator
of the “data science” label. However, studying CS and using R is not quite so clear. You can
see why the tree might want to look at the interaction between these features rather than just
looking at them independently.
This plot is a little more involved than most in this book, but I think it illustrates the point
well.
fig,
grey
blue
font
ax = plt.subplots(figsize=(8, 4))
= '#999999'
= '#16a2c6'
= 'Roboto'
data = (X_train
.assign(software_eng=y_train)
.groupby(['software_eng', 'r', 'major_cs'])
.age
.count()
.unstack()
.unstack())
(data
.pipe(lambda adf: adf.iloc[:,-2:].plot(color=[grey,blue], linewidth=4, ax=ax,
legend=None) and adf)
.plot(color=[grey, blue, grey, blue], ax=ax, legend=None)
)
ax.set_xticks([0, 1], ['Data Scientist', 'Software Engineer'], font=font, size=12,
weight=600)
ax.set_yticks([])
ax.set_xlabel('')
ax.text(x=0, y=.93, s="Count Data Scientist or Software Engineer by R/CS",
transform=fig.transFigure, ha='left', font=font, fontsize=10, weight=1000)
ax.text(x=0, y=.83, s="(Studied CS) Thick lines\n(R) Blue", transform=fig.transFigure,
ha='left', font=font, fontsize=10, weight=300)
for side in 'left,top,right,bottom'.split(','):
ax.spines[side].set_visible(False)
# labels
for left,txt in zip(data.iloc[0], ['Other/No R', 'Other/R', 'CS/No R', 'CS/R']):
ax.text(x=-.02, y=left, s=f'{txt} ({left})', ha='right', va='center',
font=font, weight=300)
for right,txt in zip(data.iloc[1], ['Other/No R', 'Other/R', 'CS/No R', 'CS/R']):
ax.text(x=1.02, y=right, s=f'{txt} ({right})', ha='left', va='center',
font=font, weight=300)
146
18.3. Deeper Interactions
Figure 18.3: Slopegraph of R vs study CS
18.3
Deeper Interactions
You can get interactions with three columns from the Interaction Depth 2 sheet.
>>> print(pd.read_excel('fir.xlsx', sheet_name='Interaction Depth 2').iloc[:20]
...
.sort_values(by='Average Rank')
...
.head(5)
... )
Interaction
Gain FScore ... Average Rank \
0
major_cs|r|years_exp 1842.711375
17 ...
12.000000
7
age|education|years_exp 267.537987
53 ...
15.666667
13
age|compensation|education 154.313245
55 ...
15.833333
2 compensation|education|years_exp 431.541357
91 ...
17.166667
14
education|r|years_exp 145.534591
17 ...
19.000000
0
7
13
2
14
Average Tree Index Average Tree Depth
2.588235
2.117647
31.452830
3.981132
47.381818
3.800000
47.175824
4.010989
34.352941
2.588235
[5 rows x 16 columns]
I’m not going to explore these here. Again, these interactions can lead to insights about
how various features relate to each other that would be hard to come by without this analysis.
147
18. xgbfir (Feature Interactions Reshaped)
18.4 Specifying Feature Interactions
The XGBoost library can limit feature interactions. You can specify a list of features that can
only be in trees with other limited features. This could lead to models that are more robust,
simpler or comply with regulations.
Let’s train models with limited interactions and a default model and see how the
performance compares.
I will take the top entries from the interactions and put them in a nested list. Each entry
in this list shows which groups of columns are able to follow other columns in a given tree.
For example, education is in the first, third, and fourth lists. This means that education can be
followed by years_exp, compensation, or age columns, but not the remaining columns.
constraints = [['education', 'years_exp'], ['major_cs', 'r'],
['compensation', 'education'], ['age', 'education'],
['major_cs', 'years_exp'], ['age', 'years_exp'],
['age', 'compensation'], ['major_stat', 'years_exp'],
]
I’ll write some code to filter out the columns that I want to use, so I can feed data with just
those columns into my model.
def flatten(seq):
res = []
for sub in seq:
res.extend(sub)
return res
small_cols = sorted(set(flatten(constraints)))
>>> print(small_cols)
['age', 'compensation', 'education', 'major_cs', 'major_stat', 'r', 'years_exp']
>>> xg_constraints = xgb.XGBClassifier(interaction_constraints=constraints)
>>> xg_constraints.fit(X_train.loc[:, small_cols], y_train)
>>> xg_constraints.score(X_test.loc[:, small_cols], y_test)
0.7259668508287292
It looks like this model does ok when it uses 7 of the 18 features. But it does not exhibit
better performance than the default model (.745) trained on all of the data.
Here is a visualization of the first tree. You can see that the r column is only following by
major_cs in this tree.
my_dot_export(xg_constraints, num_trees=0, filename='img/constrains0_xg.dot',
title='First Constrained Tree')
18.5 Summary
In this chapter, we explored feature interactions. Exploring feature interactions can give you
a better sense for how the columns in the data impact and relate to each other.
148
18.6. Exercises
Figure 18.4: First constrained tree
18.6
Exercises
1.
2.
3.
4.
5.
What is an interaction in a machine learning model?
What is the xgbfir library used for in machine learning?
How is the Average Rank column calculated in the xgbfir library?
What is the purpose of the Interaction Depth 1 sheet in the xgbfir library?
How can the xgbfir library be used to find the top feature interactions in an XGBoost
model?
6. What is the purpose of using interaction constraints in a machine learning model?
7. What are the potential drawbacks of using interaction constraints in a machine learning
model?
149
Chapter 19
Exploring SHAP
Have you ever wanted to peek inside the “black box” of a machine learning model and see
how it’s making predictions? While feature importance ranks the features, it doesn’t indicate
the direction of the impact.
Importance scores also can’t model non-linear relationships. For example, when I worked
with solid-state storage devices, there was a “bathtub curve” relationship between usage and
lifetime. Devices with little usage often had infant mortality issues, but once the device had
survived “burn-in”, they often lasted until end-of-life failure. A feature importance score
cannot help to tell this story. However, using a tool like SHAP can.
The shap library is a useful tool for explaining black box models (and white box models
too). SHapley Additive Explanations (SHAP) is a mechanism that explains a model’s global
and local behavior. At a global level, SHAP can provide a rank order of feature importance
and how the feature impacts the results. It can model non-linear relationships. At the local
level of an individual prediction, SHAP can explain how each feature contributes to the final
prediction.
19.1
SHAP
SHAP works with both classification and regression models. SHAP analyzes a model and
predictions from it and outputs a value for every feature of an example. The values are
determined using game theory and indicate how to distribute attribution among the features
to the target. In the case of classification, these values add up to the log odds probability of
the positive label. (For regression cases, SHAP values sum to the target prediction.)
One of the key advantages of the SHAP algorithm is that it can provide explanations for
both white box models, where the internal workings of the model are known, and black
box models, where the inner workings of the model are not known. It can also handle
complex models and interactions between features and has been shown to provide accurate
and consistent explanations in various settings.
Let’s look at using this library with our step-tuned model. Make sure you have installed
the library as it does not come with Python.
step_params = {'random_state': 42,
'max_depth': 5,
'min_child_weight': 0.6411044640540848,
'subsample': 0.9492383155577023,
'colsample_bytree': 0.6235721099295888,
'gamma': 0.00011273797329538491,
'learning_rate': 0.24399020050740935}
xg_step = xgb.XGBClassifier(**step_params, early_stopping_rounds=50,
151
19. Exploring SHAP
n_estimators=500)
xg_step.fit(X_train, y_train,
eval_set=[(X_train, y_train),
(X_test, y_test)
]
)
The shap library works well in Jupyter. It provides JavaScript visualizations that allow
some interactivity. To enable the JavaScript extension, we need to run the initjs function.
Then, we create an instance of a TreeExplainer. This can provide the SHAP values for us. The
TreeExplainer instance is callable and returns an Explanation object. This object has an attribute,
.values, with the SHAP values for every feature for each sample.
import shap
shap.initjs()
shap_ex = shap.TreeExplainer(xg_step)
vals = shap_ex(X_test)
I’ll stick the values in a DataFrame so we can see what they look like.
>>> shap_df = pd.DataFrame(vals.values, columns=X_test.columns)
>>> print(shap_df)
age education years_exp compensation
python
r \
0
0.426614
0.390184 -0.246353
0.145825 -0.034680 0.379261
1
0.011164 -0.131144 -0.292135
-0.014521 0.016003 -1.043464
2 -0.218063 -0.140705 -0.411293
0.048281 0.424516 0.487451
3 -0.015227 -0.299068 -0.426323
-0.205840 -0.125867 0.320594
4 -0.468785 -0.200953 -0.230639
0.064272 0.021362 0.355619
..
...
...
...
...
...
...
900 0.268237 -0.112710 0.330096
-0.209942 0.012074 -1.144335
901 0.154642
0.572190 -0.227121
0.448253 -0.057847 0.290381
902 0.079129 -0.095771 1.136799
0.150705 0.133260 0.484103
903 -0.206584 0.430074 -0.385100
-0.078808 -0.083052 -0.992487
904 0.007351
0.589351
1.485712
0.056398 -0.047231 0.373149
0
1
2
3
4
..
900
901
902
903
904
0
1
152
sql
-0.019017
0.020524
-0.098703
-0.062712
-0.083344
...
-0.065815
-0.069114
-0.120819
-0.088811
-0.105290
Q1_Male Q1_Female
0.004868
0.000877
0.039019
0.047712
-0.004710 0.063545
0.019110
0.012257
-0.017202 0.002754
...
...
0.028274
0.032291
0.006243
0.007443
0.012034
0.057516
0.080561
0.028648
0.029283
0.074762
Q1_Prefer not to say \
0.002111
0.001010
0.000258
0.002184
0.001432
...
0.001012
0.002198
0.000266
0.000876
0.001406
Q1_Prefer to self-describe Q3_United States of America Q3_India
0.0
0.033738 -0.117918
0.0
0.068171 0.086444
\
19.1. SHAP
2
3
4
..
900
901
902
903
904
0
1
2
3
4
..
900
901
902
903
904
0.0
0.0
0.0
...
0.0
0.0
0.0
0.0
0.0
Q3_China
-0.018271
-0.026271
-0.010548
-0.024099
-0.022188
...
0.310404
-0.008244
0.003234
-0.031448
0.008734
[905 rows x 18
age
0
0.426614
1
0.011164
2 -0.218063
3 -0.015227
4 -0.468785
..
...
900 0.268237
901 0.154642
902 0.079129
903 -0.206584
904 0.007351
0
1
2
3
4
..
900
901
902
903
904
0
sql
-0.019017
0.020524
-0.098703
-0.062712
-0.083344
...
-0.065815
-0.069114
-0.120819
-0.088811
-0.105290
major_cs
0.369876
-0.428484
-0.333695
0.486864
0.324419
...
-0.407444
0.602087
-0.313785
-0.524141
-0.505613
columns]
education
0.390184
-0.131144
-0.140705
-0.299068
-0.200953
...
-0.112710
0.572190
-0.095771
0.430074
0.589351
major_other
0.014006
-0.064157
0.016919
0.038438
0.012664
...
-0.013195
0.039680
-0.080046
-0.048108
-0.159411
0.005533
-0.000044
0.035772
...
-0.086408
-0.074364
0.103810
0.045213
-0.031587
major_eng
-0.013465
-0.026041
-0.026932
-0.013727
-0.019550
...
-0.026412
-0.012820
-0.066032
-0.007185
-0.067388
-0.105534
0.042814
-0.073206
...
0.136677
0.115520
-0.097848
0.066553
0.117050
major_stat
0.104177
0.069931
-0.591922
0.047564
0.093926
...
-0.484734
0.083934
0.101975
0.093196
0.126560
years_exp compensation
python
r \
-0.246353
0.145825 -0.034680 0.379261
-0.292135
-0.014521 0.016003 -1.043464
-0.411293
0.048281 0.424516 0.487451
-0.426323
-0.205840 -0.125867 0.320594
-0.230639
0.064272 0.021362 0.355619
...
...
...
...
0.330096
-0.209942 0.012074 -1.144335
-0.227121
0.448253 -0.057847 0.290381
1.136799
0.150705 0.133260 0.484103
-0.385100
-0.078808 -0.083052 -0.992487
1.485712
0.056398 -0.047231 0.373149
Q1_Male Q1_Female Q1_Prefer not to say \
0.004868
0.000877
0.002111
0.039019
0.047712
0.001010
-0.004710 0.063545
0.000258
0.019110
0.012257
0.002184
-0.017202 0.002754
0.001432
...
...
...
0.028274
0.032291
0.001012
0.006243
0.007443
0.002198
0.012034
0.057516
0.000266
0.080561
0.028648
0.000876
0.029283
0.074762
0.001406
Q1_Prefer to self-describe Q3_United States of America Q3_India \
0.0
0.033738 -0.117918
153
19. Exploring SHAP
1
2
3
4
..
900
901
902
903
904
0
1
2
3
4
..
900
901
902
903
904
0.0
0.0
0.0
0.0
...
0.0
0.0
0.0
0.0
0.0
Q3_China
-0.018271
-0.026271
-0.010548
-0.024099
-0.022188
...
0.310404
-0.008244
0.003234
-0.031448
0.008734
major_cs major_other major_eng
0.369876
0.014006 -0.013465
-0.428484
-0.064157 -0.026041
-0.333695
0.016919 -0.026932
0.486864
0.038438 -0.013727
0.324419
0.012664 -0.019550
...
...
...
-0.407444
-0.013195 -0.026412
0.602087
0.039680 -0.012820
-0.313785
-0.080046 -0.066032
-0.524141
-0.048108 -0.007185
-0.505613
-0.159411 -0.067388
0.068171
0.005533
-0.000044
0.035772
...
-0.086408
-0.074364
0.103810
0.045213
-0.031587
0.086444
-0.105534
0.042814
-0.073206
...
0.136677
0.115520
-0.097848
0.066553
0.117050
major_stat
0.104177
0.069931
-0.591922
0.047564
0.093926
...
-0.484734
0.083934
0.101975
0.093196
0.126560
[905 rows x 18 columns]
If you add up each row and add the .base_values attribute (this is the default guess for the
model) you will get the log odds value that the sample is in the positive case. I will stick the
sum in a dataframe with the column name pred. I will also include the ground truth value and
convert the log odds sum into a probability column, prob. If the probability is above .5 (this
happens when pred is positive), we predict the positive value.
>>> print(pd.concat([shap_df.sum(axis='columns')
...
.rename('pred') + vals.base_values,
...
pd.Series(y_test, name='true')], axis='columns')
...
.assign(prob=lambda adf: (np.exp(adf.pred) /
...
(1 + np.exp(adf.pred))))
... )
pred true
prob
0
1.204692
1 0.769358
1 -2.493559
0 0.076311
2 -2.205473
0 0.099260
3 -0.843847
1 0.300725
4 -0.168726
1 0.457918
..
... ...
...
900 -1.698727
0 0.154632
901 1.957872
0 0.876302
902 0.786588
0 0.687098
903 -2.299702
0 0.091148
904 1.497035
1 0.817132
154
19.2. Examining a Single Prediction
[905 rows x 3 columns]
Note that for index rows 3, 4, 901, and 902, our model makes incorrect predictions.
19.2
Examining a Single Prediction
If we want to explore how shap attributes the prediction from the features, you can create a
few visualizations.
Let’s explore the first entry in the test data. It looks like this:
>>> X_test.iloc[0]
age
education
years_exp
compensation
python
r
sql
Q1_Male
Q1_Female
Q1_Prefer not to say
Q1_Prefer to self-describe
Q3_United States of America
Q3_India
Q3_China
major_cs
major_other
major_eng
major_stat
Name: 7894, dtype: float64
22.0
16.0
1.0
0.0
1.0
0.0
0.0
1.0
0.0
0.0
0.0
0.0
1.0
0.0
1.0
0.0
0.0
0.0
Our model predicts a value of 1 (which is also the ground truth), which corresponds to a
Software Engineer. Is that because of age, education, R, or something else?
>>> # predicts software engineer... why?
>>> xg_step.predict(X_test.iloc[[0]])
array([1])
>>> # ground truth
>>> y_test[0]
1
We can also do some math to validate the SHAP values. We start off with the expected
value.
>>> # Since this is below zero, the default is Data Scientist
>>> shap_ex.expected_value
-0.2166416
Then we add the sum of the values from the row to the expected value.
155
19. Exploring SHAP
>>> # > 0 therefore ... Software Engineer
>>> shap_ex.expected_value + vals.values[0].sum()
1.2046916
Because this value is above 0, we would predict Software Engineer for this sample.
19.3 Waterfall Plots
Let’s make a waterfall plot to explore the SHAP values. This plots an explanation of a single
prediction. It displays how the SHAP value from each column impacts the result. You can see
a vertical line at -0.21. This represents the default or base value. This is the right edge of the
bottom bar. By default, our model would predict Data Scientist. But we need to look at each
of the features and examine how they push the prediction.
The waterfall plot is rank ordered. The feature with the most impact is shown at the top.
The next most important feature is second, and so on. The age value of 22 gave a SHAP value
of 0.43. This pushes the result towards the positive case. The R value, 0.0, also pushes toward
the positive case with a magnitude of 0.38. We repeat this and see that the values add up to
1.204 (in the upper right). (You can confirm this from our addition above.)
fig = plt.figure(figsize=(8, 4))
shap.plots.waterfall(vals[0], show=False)
To get a feel for how the values of this sample relate to the other samples, we can plot a
histogram showing the distributions of the values and mark our value.
I will write a function to do this, plot_histograms.
def plot_histograms(df, columns, row=None, title='', color='shap'):
"""
Parameters
---------df : pandas.DataFrame
The DataFrame to plot histograms for.
columns : list of str
The names of the columns to plot histograms for.
row : pandas.Series, optional
A row of data to plot a vertical line for.
title : str, optional
The title to use for the figure.
color : str, optional
'shap' - color positive values red. Negative blue
'mean' - above mean red. Below blue.
None - black
Returns
------matplotlib.figure.Figure
The figure object containing the histogram plots.
"""
red = '#ff0051'
blue = '#008bfb'
156
19.3. Waterfall Plots
Figure 19.1: SHAP waterfall plot of first test sample
fig, ax = plt.subplots(figsize=(8, 4))
hist = (df
[columns]
.hist(ax=ax, color='#bbb')
)
fig = hist[0][0].get_figure()
if row is not None:
name2ax = {ax.get_title():ax for ax in fig.axes}
pos, neg = red, blue
if color is None:
pos, neg = 'black', 'black'
for column in columns:
if color == 'mean':
mid = df[column].mean()
else:
mid = 0
if row[column] > mid:
c = pos
else:
c = neg
name2ax[column].axvline(row[column], c=c)
157
19. Exploring SHAP
fig.tight_layout()
fig.suptitle(title)
return fig
I will use this function to show where the features of the individual sample are located
among the distributions.
features = ['education', 'r', 'major_cs', 'age', 'years_exp',
'compensation']
fig = plot_histograms(shap_df, features, shap_df.iloc[0],
title='SHAP values for row 0')
Figure 19.2: Histograms of SHAP values for first row
We can also use this function to visualize the histograms of the original feature values.
fig = plot_histograms(X_test, features, X_test.iloc[0],
title='Values for row 0', color='mean')
We can create a bar plot of the SHAP values with the Pandas library.
fig, ax = plt.subplots(figsize=(8, 4))
(pd.Series(vals.values[0], index=X_test.columns)
.sort_values(key=np.abs)
.plot.barh(ax=ax)
)
158
19.3. Waterfall Plots
Figure 19.3: Histograms of original values for first row
Figure 19.4: Bar plot of SHAP values for first test sample
159
19. Exploring SHAP
19.4 A Force Plot
The shap library also provides a flattened version of the waterfall plot, called a force plot. (In
my example, I show matplotlib=True, but in Jupyter, you can leave that out and the plot will
have some interactivity.)
# use matplotlib if having js issues
# blue - DS
# red - Software Engineer
# to save need both matplotlib=True, show=False
res = shap.plots.force(base_value=vals.base_values,
shap_values=vals.values[0,:], features=X_test.iloc[0],
matplotlib=True, show=False
)
res.savefig('img/shap_forceplot0.png', dpi=600, bbox_inches='tight')
Figure 19.5: Force plot of SHAP values for first test sample
19.5 Force Plot with Multiple Predictions
The shap library allows you to pass in multiple rows of SHAP values into the force function.
In that case, it flips them vertically and stacks them next to each other. There currently is no
Matplotlib version for this plot, and the Jupyter/JavaScript version has a dropdown to change
the order of the results and a dropdown to change what features it shows.
# First n values
n = 100
# blue - DS
# red - Software Engineer
shap.plots.force(base_value=vals.base_values,
shap_values=vals.values[:n,:], features=X_test.iloc[:n],
)
19.6 Understanding Features with Dependence Plots
Because the shap library has computed SHAP values for every feature, we can use that to
visualize how features impact the model. The Dependence Scatter Plot shows the SHAP values
across the values for a feature. This plot can be a little confusing at first, but when you take a
moment to understand it, it becomes very useful.
Let’s pick one of the features that had a big impact on our model, education.
160
19.6. Understanding Features with Dependence Plots
Figure 19.6: Force plot of SHAP values for first 100 test samples
fig, ax = plt.subplots(figsize=(8, 4))
shap.plots.scatter(vals[:, 'education'], ax=ax, color=vals,
x_jitter=0, hist=False)
Figure 19.7: Dependence plot of SHAP values for the education feature
161
19. Exploring SHAP
Let’s try and make sense of this plot. In the x-axis, we have different entries for education.
In the y-axis, we have the different SHAP values for each education value. You might wonder
why the y values are different. This is because the education column interacts with other
columns, so sometimes a value of 19 pushes toward Data Scientist, and sometimes it moves
towards Software Engineer. Remember that y values above 0 will push toward the positive
label, and negative values push toward the negative label. My interpretation of this plot is
that there is a non-linear relationship between education and the target. Education values
push towards the negative label quicker as the value increases.
Here are a couple of other things to note about this plot. The shap library automatically
chooses an interaction index, another column to plot in a color gradient. In this example, it
decided on the age feature. This allows you to visualize how another column interacts with
education. It is a little hard to see, but higher compensation pushes more toward the software
engineer label for low education levels. You can specify a different column by setting the color
parameter to a single column of values (color=vals[:, 'compensation'])).
19.7 Jittering a Dependence Plot
Also, because the education entries are limited to a few values, the scatter plot just plots the
values on top of each other. This makes it hard to understand what is really happening. We
will address that using the x_jitter parameter to spread the values. I also like to adjust
the alpha parameter to adjust the transparency of each dot. By default, every dot has a
transparency of 1, which means that it is completely opaque. If you lower the value, it makes
them more see-through and lets you see the density better.
Here is a plot that sets the alpha to .5, spreads out the values in the x-axis with x_jitter,
and shows the interaction with the years_exp column.
fig, ax = plt.subplots(figsize=(8, 4))
shap.plots.scatter(vals[:, 'education'], ax=ax, color=vals[:, 'years_exp'], x_jitter=1,
alpha=.5)
This plot makes it very clear that most of the education entries have values of 16 and 18.
Based on the SHAP values, having more schooling pushes toward data science.
You can also see a faint histogram. Because most values have an education level above
15, you might want to explore whether the model is overfitting the values for 12 and 13. The
value for 19 might warrant exploration as there are few entries there.
The jittering and histogram functionality is automatic when you use the scatter function.
Let’s explore one more dependence plot showing the impact of the major_cs column. We
will choose the R column for the interaction index.
fig, ax = plt.subplots(figsize=(8, 4))
shap.plots.scatter(vals[:, 'major_cs'], ax=ax, color=vals[:, 'r'], alpha=.5)
This plot indicates that studying computer science pushes the result more toward software
engineer than the reverse does toward data scientist. It is interesting to note that adding R to
studying CS strengthens the effect (which is the opposite of what I would have expected).
19.8 Heatmaps and Correlations
Though not part of the shap package, it can be helpful to look at the heatmaps of the
correlations of the SHAP values.
162
19.8. Heatmaps and Correlations
Figure 19.8: Dependence plot of SHAP values for education feature with jittering and different
interaction column
Figure 19.9: Dependence plot of SHAP values for education feature with jittering for major_cs column
163
19. Exploring SHAP
Let’s start with the standard correlation of the features (not involving SHAP).
import seaborn as sns
fig, ax = plt.subplots(figsize=(8, 4))
sns.heatmap(X_test
.assign(software_eng=y_test)
.corr(method='spearman')
.loc[:, ['age', 'education', 'years_exp',
'compensation', 'r', 'major_cs',
'software_eng']],
cmap='RdBu', annot=True, fmt='.2f', vmin=-1, vmax=1, ax=ax
)
Figure 19.10: Correlation heatmap between subset of features for the testing data
This heatmap tells us if the features tend to move in the same or opposite directions. I
have added the software_eng column to see if there is a correlation with the prediction.
Now, we will create a heatmap of the correlation of the SHAP values for each prediction. This
correlation will tell us if two features tend to move the prediction similarly. For example, the
SHAP values for both education and major_stat push the prediction label in the same direction.
This can help us understand interactions from different points of view.
import seaborn as sns
fig, ax = plt.subplots(figsize=(8, 4))
sns.heatmap(shap_df
.assign(software_eng=y_test)
.corr(method='spearman')
.loc[:, ['age', 'education', 'years_exp', 'compensation', 'r', 'major_cs',
'software_eng']],
cmap='RdBu', annot=True, fmt='.2f', vmin=-1, vmax=1, ax=ax
164
19.9. Beeswarm Plots of Global Behavior
)
Figure 19.11: Correlation heatmap between subset of SHAP values for each feature for the testing data
Generally, when dealing with correlation heatmaps, I want to look at the dark red and
dark blue values (ignoring the self-correlation values). It looks like the SHAP values for
compensation tend to rise if the SHAP values for age rise. (Note that this is different from
the actual non-SHAP values correlating, though, in the case of compensation and age, it looks
like they do.) You could explore these further by doing a scatter plot.
19.9
Beeswarm Plots of Global Behavior
One of the reasons that SHAP is so popular is that it not only explains local predictions and
feature interactions but also gives you a global view of how features impact the model.
Let’s look at the beeswarm or summary plot. This provides a rank-ordered listing of the
features that drive the most impact on the final prediction. In the x-axis is the SHAP value.
Positive value push towards the positive label. Each feature is colored to indicate high (red)
or low (blue) values.
The R feature is binary, with only red and blue values. You can see that a high value (1)
results in a significant push toward data science. The spread on the red values indicates some
interaction with other columns. Another way to understand this is that there are probably
some low frequency effects when R is red that cause a large impact on our model.
The years_exp feature has gradations in the color because there are multiple options for
that feature. It looks like there is a pretty smooth transition from blue to red. Contrast that
smooth gradation with education, where you see from the left red, then purple, then red again,
then blue. This indicates non-monotonic behavior for the education feature. We saw this in the
dependence plot for education above.
165
19. Exploring SHAP
fig = plt.figure(figsize=(8, 4))
shap.plots.beeswarm(vals)
Figure 19.12: Summary plot of SHAP values for top features
If you want to see all of the features, use the max_display parameter. I’m also changing the
colormap to a Matplotlib colormap that will work better in grayscale. (This is for all those
who bought the grayscale physical edition.)
from matplotlib import cm
fig = plt.figure(figsize=(8, 4))
shap.plots.beeswarm(vals, max_display=len(X_test.columns), color=cm.autumn_r)
19.10 SHAP with No Interaction
If your model has interactions, SHAP values will reflect them. If we remove interactions, you
will simplify your model (by making the max_depth equal to 1). It will also make non-linear
responses more clear.
Let’s train a model of stumps and look at some of the SHAP plots.
no_int_params = {'random_state': 42,
'max_depth': 1
}
xg_no_int = xgb.XGBClassifier(**no_int_params, early_stopping_rounds=50,
n_estimators=500)
166
19.10. SHAP with No Interaction
Figure 19.13: Summary plot of SHAP values for all the features
xg_no_int.fit(X_train, y_train,
eval_set=[(X_train, y_train),
(X_test, y_test)
]
)
The stump model does not per form as well as the default model. But it is close. The
default score was 0.74.
>>> xg_no_int.score(X_test, y_test)
0.7370165745856354
shap_ind = shap.TreeExplainer(xg_no_int)
shap_ind_vals = shap_ind(X_test)
Here is the summary plot for the model without interactions.
167
19. Exploring SHAP
from matplotlib import cm
fig = plt.figure(figsize=(8, 4))
shap.plots.beeswarm(shap_ind_vals, max_display=len(X_test.columns))
Figure 19.14: Summary plot of SHAP values for all the features without interactions
It is interesting to observe that the ordering of feature importance changes with this model.
Here is the years_exp plot for our model with interactions. You can see a vertical spread in
the y-axis from the interactions of years_exp with other columns. (The spread in the x-axis is
due to jittering.)
fig, ax = plt.subplots(figsize=(8, 4))
shap.plots.scatter(vals[:, 'years_exp'], ax=ax,
color=vals[:, 'age'], alpha=.5,
x_jitter=1)
Here is the dependence plot for the model without interactions. You can see that there is
no variation in the y-axis.
168
19.11. Summary
Figure 19.15: Dependence plot of SHAP values for years_exp feature with feature interactions and
jittering.
fig, ax = plt.subplots(figsize=(8, 4))
shap.plots.scatter(shap_ind_vals[:, 'years_exp'], ax=ax,
color=shap_ind_vals[:, 'age'], alpha=.5,
x_jitter=1)
This makes the non-linear response in years_exp very clear. Note that you might want to
jitter this is the y-axis to help reveal population density, but shap doesn’t support that. You
can change the alpha parameter to help with density.
19.11 Summary
SHAP (SHapley Additive exPlanations) is a powerful tool for explaining the behavior of
machine learning models, including XGBoost models. It provides insight into which features
are most important in making predictions and how each feature contributes to the model’s
output. This can be useful for several reasons, including improving the interpretability of a
model, identifying potential issues or biases in the data, and gaining a better understanding
of how the model makes predictions.
In this chapter, we explored using SHAP values to understand predictions and interactions
between the features. We also used SHAP values to rank features.
SHAP values are an important mechanism for explaining black-box models like XGBoost.
19.12 Exercises
1. How can SHAP be used to explain a model’s global and local behavior?
2. How does the waterfall plot display the impact of each feature on the prediction result?
3. How does the force plot display the impact of features on the model
169
19. Exploring SHAP
Figure 19.16: Dependence plot of SHAP values for years_exp feature with no feature interaction and
jittering for major_cs column
4.
5.
6.
7.
170
What does the summary plot display, and how does it help interpret feature importance?
How does the dependence plot help to identify non-monotonic behavior?
How can the summary plot help identify features with non-monotonic behavior?
How can the summary plot help to identify features that have interacted with other
features?
Chapter 20
Better Models with ICE, Partial Dependence, Monotonic
Constraints, and Calibration
In this chapter, we will explore some advanced techniques that can be used to improve the
interpretability and performance of XGBoost models. We will discuss Individual Conditional
Expectation (ICE) plots and Partial Dependence Plots (PDP). These powerful visualization
tools allow us to understand how the input features affect the predictions made by the model.
We will also examine how to constrain XGBoost models to prevent overfitting and improve
generalization performance. This chapter will provide a comprehensive overview of some
important techniques that can be used to extract valuable insights from XGBoost models.
20.1
ICE Plots
Individual Conditional Expectation (ICE) plots are a useful tool for visualizing the effect of a
single input variable on the output of a machine learning model. In the context of XGBoost,
ICE plots can help us understand how each feature contributes to the final predictions made
by the model. This section will explore using Python to create ICE plots for XGBoost models.
An ICE plot shows the predicted values of a machine learning model for a single
observation as the value of a specific input feature varies. In other words, an ICE plot displays
the model’s output for a fixed instance while incrementally changing one input feature’s
value. Each line in the plot represents the predicted output for a particular instance as the
input feature changes. By examining the shape and slope of each line, we can gain insights
into how the model uses that input feature to make predictions. ICE plots provide a detailed
view of the relationship between a single input feature and the model’s predictions.
To create an ICE plot for a single row, follow these steps:
1. Choose an input feature to analyze and select a range of values for that feature to vary.
2. Fix the values of all other input features to the observed values for the instance of
interest.
3. Vary the selected input feature across the values chosen in step 1.
4. For each value of the selected input feature, calculate the corresponding prediction of
the model for the specific instance of interest.
5. Plot the values obtained in step 4 against the corresponding values of the input feature
to create a single line in the ICE plot.
You can repeat this process for each row in your data and visualize how changing that
feature would impact the final result.
171
20. Better Models with ICE, Partial Dependence, Monotonic Constraints, and Calibration
xgb_def = xgb.XGBClassifier(random_state=42)
xgb_def.fit(X_train, y_train)
xgb_def.score(X_test, y_test)
Let’s make the default model for the r and education features. We will use scikit-learn to
create an ICE plot.
from sklearn.inspection import PartialDependenceDisplay
fig, axes = plt.subplots(ncols=2, figsize=(8,4))
PartialDependenceDisplay.from_estimator(xgb_def, X_train, features=['r', 'education'],
kind='individual', ax=axes)
Figure 20.1: ICE plots for R and education. As R values have a positive value, the probability of data
scientists tends to increase. For education, it appears that more education also tends to push toward
data scientists.
It is a little difficult to discern what is happening here. Remember that the y-axis in these
plots represents the probability of the final label, Software Engineer (1) or Data Scientist (0).
If r goes to 1, there is a strong push to Data Scientist. For the education plot, low education
values tend toward software engineering, while larger values push toward data science.
One technique to help make the visualizations more clear is to have them all start at the
same value on the left-hand side. We can do that with the centered parameter.
fig, axes = plt.subplots(ncols=2, figsize=(8,4))
PartialDependenceDisplay.from_estimator(xgb_def, X_train, features=['r', 'education'],
centered=True,
kind='individual', ax=axes)
172
20.1. ICE Plots
Figure 20.2: Centered ICE plots helps visualize the impact of the feature as it changes.
This plot also reveals that ticks at the bottom are intended to help visualize the number of
rows with those values. However, due to the binned values in the survey data, the education
levels are not discernable.
We can throw a histogram on top to see the distribution. This should provide intuition
into the density of data. Locations with more data would tend to have better predictions. In
the case of education, we don’t have many examples of respondents with less than 14 years
of education. That would cause us to have more uncertainty about those predictions.
fig, axes = plt.subplots(ncols=2, figsize=(8,4))
ax_h0 = axes[0].twinx()
ax_h0.hist(X_train.r, zorder=0)
ax_h1 = axes[1].twinx()
ax_h1.hist(X_train.education, zorder=0)
PartialDependenceDisplay.from_estimator(xgb_def, X_train, features=['r', 'education'],
centered=True,
ice_lines_kw={'zorder':10},
kind='individual', ax=axes)
fig.tight_layout()
I wrote quantile_ice, a function that will subdivide a feature by the quantile of the
predicted probability and create lines from the average of each quantile. It can also show
a histogram.
def quantile_ice(clf, X, col, center=True, q=10, color='k', alpha=.5, legend=True,
add_hist=False, title='', val_limit=10, ax=None):
"""
173
20. Better Models with ICE, Partial Dependence, Monotonic Constraints, and Calibration
Figure 20.3: ICE plots with histograms to aid with understanding where there is sparseness in the data
that might lead to overfitting.
Generate an ICE plot for a binary classifier's predicted probabilities split
by quantiles.
Parameters:
---------clf : binary classifier
A binary classifier with a `predict_proba` method.
X : DataFrame
Feature matrix to predict on with shape (n_samples, n_features).
col : str
Name of column in `X` to plot against the quantiles of predicted probabilities.
center : bool, default=True
Whether to center the plot on 0.5.
q : int, default=10
Number of quantiles to split the predicted probabilities into.
color : str or array-like, default='k'
Color(s) of the lines in the plot.
alpha : float, default=0.5
Opacity of the lines in the plot.
legend : bool, default=True
Whether to show the plot legend.
add_hist : bool, default=False
Whether to add a histogram of the `col` variable to the plot.
title : str, default=''
Title of the plot.
val_limit : num, default=10
Maximum number of values to test for col.
ax : Matplotlib Axis, deafault=None
Axis to plot on.
174
20.1. ICE Plots
Returns:
------results : DataFrame
A DataFrame with the same columns as `X`, as well as a `prob` column with
the predicted probabilities of `clf` for each row in `X`, and a `group`
column indicating which quantile group the row belongs to.
"""
probs = clf.predict_proba(X)
df = (X
.assign(probs=probs[:,-1],
p_bin=lambda df_:pd.qcut(df_.probs, q=q,
labels=[f'q{n}' for n in range(1,q+1)])
)
)
groups = df.groupby('p_bin')
vals = X.loc[:,col].unique()
if len(vals) > val_limit:
vals = np.linspace(min(vals), max(vals), num=val_limit)
res = []
for name,g in groups:
for val in vals:
this_X = g.loc[:,X.columns].assign(**{col:val})
q_prob = clf.predict_proba(this_X)[:,-1]
res.append(this_X.assign(prob=q_prob, group=name))
results = pd.concat(res, axis='index')
if ax is None:
fig, ax = plt.subplots(figsize=(8,4))
if add_hist:
back_ax = ax.twinx()
back_ax.hist(X[col], density=True, alpha=.2)
for name, g in results.groupby('group'):
g.groupby(col).prob.mean().plot(ax=ax, label=name, color=color, alpha=alpha)
if legend:
ax.legend()
if title:
ax.set_title(title)
return results
Let’s plot the 10 quantiles for the education feature. Ideally, these lines do not cross each
other. If they do cross, you might want to debug the model to make sure that it isn’t overfitting
on sparse data.
fig, ax = plt.subplots(figsize=(8,4))
quantile_ice(xgb_def, X_train, 'education', q=10, legend=False, add_hist=True, ax=ax,
title='ICE plot for Age')
175
20. Better Models with ICE, Partial Dependence, Monotonic Constraints, and Calibration
Figure 20.4: ICE plot for quantiles.
20.2 ICE Plots with SHAP
There are multiple tools to create ICE plots. If you know the magic parameters, you can get
the SHAP library to create an ICE plot. It will also create the histogram.
This function is poorly documented, but I’ve decoded it for you. The model parameter
needs to be a function that, given rows of data, will return probabilities. You can specify
the rows to draw ice lines using the’ data’ parameter. You need to ensure that the npoints
parameter is the number of unique values for a column.
import shap
fig, ax = plt.subplots(figsize=(8,4))
shap.plots.partial_dependence_plot(ind='education',
model=lambda rows: xgb_def.predict_proba(rows)[:,-1],
data=X_train.iloc[0:1000], ice=True,
npoints=(X_train.education.nunique()),
pd_linewidth=0, show=False, ax=ax)
ax.set_title('ICE plot (from SHAP)')
20.3 Partial Dependence Plots
If you set the quantile to 1 in the code above, you create a Partial Dependence Plot. It is the
average of the ICE plots.
Partial Dependence Plots (PDPs) are a popular visualization technique used in machine
learning to understand the relationship between input variables and the model’s predicted
176
20.3. Partial Dependence Plots
Figure 20.5: An ICE plot created whith the SHAP library.
output. PDPs illustrate the average behavior of the model for a particular input variable while
holding all other variables constant. These plots allow us to identify non-linear relationships,
interactions, and other important patterns in the data that are not immediately apparent from
summary statistics or simple scatterplots.
fig, axes = plt.subplots(ncols=2, figsize=(8,4))
PartialDependenceDisplay.from_estimator(xgb_def, X_train, features=['r', 'education'],
kind='average', ax=axes)
fig.tight_layout()
A common suggestion is to plot the PDP plot on top of a centered ICE plot. This let’s us
better understand the general impact of a feature.
fig, axes = plt.subplots(ncols=2, figsize=(8,4))
PartialDependenceDisplay.from_estimator(xgb_def, X_train, features=['r', 'education'],
centered=True, kind='both',
ax=axes)
fig.tight_layout()
Let’s expore the years_exp and Q1_Male plots.
fig, axes = plt.subplots(ncols=2, figsize=(8,4))
PartialDependenceDisplay.from_estimator(xgb_def, X_train, features=['years_exp', 'Q1_Male'],
177
20. Better Models with ICE, Partial Dependence, Monotonic Constraints, and Calibration
Figure 20.6: PDP and ICE plots for R and education.
fig.tight_layout()
centered=True, kind='both',
ax=axes)
Figure 20.7: PDP and ICE plots for years_exp and Q1_Male.
It looks like year_exp tends to be monotonically increasing. The Q1_Male plot is flat for
the PDP, but there is some spread for the ICE values, indicating that there is probably an
interaction with other columns.
178
20.4. PDP with SHAP
20.4
PDP with SHAP
It might not surprise you that the SHAP library can make PDP plots as well. We will create
PDP plots with the aptly named partial_dependence_plot function. It can even add a histogram
to the plot.
Let’s create one for the years_exp feature.
import shap
fig, ax = plt.subplots(figsize=(8,4))
col = 'years_exp'
shap.plots.partial_dependence_plot(ind=col,
model=lambda rows: xgb_def.predict_proba(rows)[:,-1],
data=X_train.iloc[0:1000], ice=False,
npoints=(X_train[col].nunique()),
pd_linewidth=2, show=False, ax=ax)
ax.set_title('PDP plot (from SHAP)')
Figure 20.8: A PDP plot created with SHAP.
If we specify ice=True, we can plot the PDP on top of the ICE plots. In this example, we
also highlight the expected value for the years_exp column and the expected probability for
the positive label.
fig, ax = plt.subplots(figsize=(8,4))
col = 'years_exp'
shap.plots.partial_dependence_plot(ind=col,
179
20. Better Models with ICE, Partial Dependence, Monotonic Constraints, and Calibration
model=lambda rows: xgb_def.predict_proba(rows)[:,-1],
data=X_train.iloc[0:1000], ice=True,
npoints=(X_train[col].nunique()),
model_expected_value=True,
feature_expected_value=True,
pd_linewidth=2, show=False, ax=ax)
ax.set_title('PDP plot (from SHAP) with ICE Plots')
Figure 20.9: PDP and ICE plots created with SHAP.
20.5 Monotonic Constraints
PDP and ICE plots are useful visualization tools in machine learning that can help to identify
monotonic constraints in a model.
A monotonic constraint means that as the value of a specific feature changes, the predicted
outcome should either always increase or decrease. This is often the case in situations with
a correlation between a feature and the outcome. The standard disclaimer that correlation is
not causation applies here!
Examining the ranked correlation (Spearman) is one way to explore these relationships.
fig, ax = plt.subplots(figsize=(8,4))
(X_test
.assign(target=y_test)
.corr(method='spearman')
.iloc[:-1]
180
20.5. Monotonic Constraints
)
.loc[:,'target']
.sort_values(key=np.abs)
.plot.barh(title='Spearman Correlation with Target', ax=ax)
Figure 20.10: Spearman correlation with target variable. You can use this to guide you with monotonic
constraints.
You can create a cutoff and explore variables above that cutoff. Assume a cutoff for an
absolute value of 0.2. The education column is non-binary and might be a good candidate to
explore.
PDP and ICE plots can also help to identify monotonic constraints by showing the
relationship between a feature and the predicted outcome while holding all other features
constant. If the plot shows a clear increasing or decreasing trend for a specific feature, this
can indicate a monotonic constraint.
You can see the PDP for education in the last section.
The line appears almost monotonic if we inspect the education PDP plot for education. It
has a small bump at 19. This plot caused me to explore the data a little more. The tweak_kag
function converts Professional Degree to a value of 19 (19 years of school). Most professional
degrees take 3-5 years.
Let’s use some Pandas to explore what happens in education and the mean value of other
columns:
>>>
...
...
...
...
...
print(X_train
.assign(target=y_train)
.groupby('education')
.mean()
.loc[:, ['age', 'years_exp', 'target']]
)
education
age years_exp
target
181
20. Better Models with ICE, Partial Dependence, Monotonic Constraints, and Calibration
12.0
13.0
16.0
18.0
19.0
20.0
30.428571
30.369565
25.720867
28.913628
27.642857
35.310638
2.857143
6.760870
2.849593
3.225528
4.166667
4.834043
0.714286
0.652174
0.605691
0.393474
0.571429
0.174468
It appears that the target value does jump at 19. Let’s see how many values there are for
education 19:
>>> X_train.education.value_counts()
18.0
1042
16.0
738
20.0
235
13.0
46
19.0
42
12.0
7
Name: education, dtype: int64
There aren’t very many values. We can dig in a little more and inspect the raw data
(remember that 19 was derived from the Professional degree value) to see if anything stands
out:
>>>
...
...
...
...
...
print(raw
.query('Q3.isin(["United States of America", "China", "India"]) '
'and Q6.isin(["Data Scientist", "Software Engineer"])')
.query('Q4 == "Professional degree"')
.pipe(lambda df_:pd.crosstab(index=df_.Q5, columns=df_.Q6))
)
Q6
Q5
A business discipline (accounting, economics, f...
Computer science (software engineering, etc.)
Engineering (non-computer focused)
Humanities (history, literature, philosophy, etc.)
I never declared a major
Mathematics or statistics
Other
Physics or astronomy
Data Scientist
Q6
Q5
A business discipline (accounting, economics, f...
Computer science (software engineering, etc.)
Engineering (non-computer focused)
Humanities (history, literature, philosophy, etc.)
I never declared a major
Mathematics or statistics
Other
Physics or astronomy
Software Engineer
182
\
0
12
6
2
0
2
2
2
1
19
10
0
1
1
1
1
20.5. Monotonic Constraints
Deciding whether to add a monotonic constraint to XGBoost depends on the specific
context and goals of the model. A monotonic constraint can be added to ensure the model
adheres to a particular relationship between a feature and the outcome. For example, in
a scenario where we expect a clear cause-and-effect relationship between a feature and the
outcome, such as in financial risk modeling, we might want to add a monotonic constraint to
ensure that the model predicts an increasing or decreasing trend for that feature.
Adding a monotonic constraint can help to improve the accuracy and interpretability of
the model, by providing a clear constraint on the relationship between the feature and the
outcome. It can also help reduce overfitting by limiting the range of values the model can
predict for a given feature.
From looking at the PDP and ICE plots it appears there is not a ton of data to not enforce
a monotonic constraint for education. There is a chance that the model is overfitting the
education entries with the value of 19. To simplify the model, we will add a monotonic
constraint.
I will also add a constraint to the years_exp column. We will specify the monotone_constraints
parameter mapping the column name to the sign of the slope of the PDP plot. Because
years_exp increases, we will map it to 1. The education column is decreasing, so we map it
to -1.
xgb_const = xgb.XGBClassifier(random_state=42,
monotone_constraints={'years_exp':1, 'education':-1})
xgb_const.fit(X_train, y_train)
xgb_const.score(X_test, y_test)
It looks like the constraints improved (and simplified) our model!
I’m going to go one step further. Because the PDP line for Q1_Male was flat, I’m going to
remove gender columns as well.
small_cols = ['age', 'education', 'years_exp', 'compensation', 'python', 'r', 'sql',
#'Q1_Male', 'Q1_Female', 'Q1_Prefer not to say',
#'Q1_Prefer to self-describe',
'Q3_United States of America', 'Q3_India',
'Q3_China', 'major_cs', 'major_other', 'major_eng', 'major_stat']
xgb_const2 = xgb.XGBClassifier(random_state=42,
monotone_constraints={'years_exp':1, 'education':-1})
xgb_const2.fit(X_train[small_cols], y_train)
Let’s look at the score:
>>> xgb_const2.score(X_test[small_cols], y_test)
0.7569060773480663
Slightly better! And simpler!
It looks like these constraints improved our model (and also simplified it). Another way to
evaluate this is to look at the feature importance values. Our default model focuses on the R
column now. The constrained model will hopefully spread out the feature importance more.
fig, ax = plt.subplots(figsize=(8,4))
(pd.Series(xgb_def.feature_importances_, index=X_train.columns)
.sort_values()
.plot.barh(ax=ax)
)
183
20. Better Models with ICE, Partial Dependence, Monotonic Constraints, and Calibration
Figure 20.11: Unconstrained feature importance. Notice that the R column is driving our model right
now. We would like to distribute the importance if possible.
It appears that the feature importance values from the constrained model are slightly more
evenly disbursed indicating that that model will pay less attention to the R column.
Figure 20.12: Constrained feature importance. Notice that the R column has less impact and the other
columns have more.
fig, ax = plt.subplots(figsize=(8,4))
(pd.Series(xgb_const2.feature_importances_, index=small_cols)
.sort_values()
.plot.barh(ax=ax)
)
184
20.6. Calibrating a Model
20.6
Calibrating a Model
This section will look at fine-tuning our model with calibration. Calibration refers to adjusting
a model’s output to better align with the actual probabilities of the target variable. If we want
to use the probabilities (from .predict_proba) and not just the target, we will want to calibrate
our model.
With XGBoost, the probability output often does not correspond to the actual probabilities
of the target variable. XGBoost models tend to produce predicted probabilities biased towards
the ends of the probability range, meaning they often overestimate or underestimate the actual
probabilities. This will lead to poor performance on tasks that require accurate probability
estimates, such as ranking, threshold selection, and decision-making.
Calibrating an XGBoost model involves post-processing the predicted probabilities of the
model using a calibration method, such as Platt scaling or isotonic regression. These methods
map the model’s predicted probabilities to calibrated probabilities that better align with the
actual probabilities of the target variable.
Let’s calibrate the model using scikit-learn.
Two calibration types are available, sigmoid and isotonic. Sigmoid calibration involves
fitting a logistic regression model to the predicted probabilities of a binary classifier and
transforming the probabilities using the logistic function. Isotonic calibration consists in
fitting a non-parametric monotonic function to the predicted probabilities, ensuring that the
function increases with increasing probabilities.
Both techniques can improve the reliability of the probabilistic predictions of a model,
but they differ in their flexibility and interpretability. Sigmoid calibration is a simple and
computationally efficient method that can be easily implemented, but it assumes a parametric
form for the calibration function and may not capture more complex calibration patterns.
Isotonic calibration, on the other hand, is a more flexible and data-driven method that can
capture more complex calibration patterns. Still, it may require more data and computational
resources to implement.
We will try both and compare the results. We need to provide the CalibratedClassifierCV
with our existing model and the method parameter. I’m using cv=prefit because I already fit
the model. Then we call the .fit method.
from sklearn.calibration import CalibratedClassifierCV
xgb_cal = CalibratedClassifierCV(xgb_def, method='sigmoid', cv='prefit')
xgb_cal.fit(X_test, y_test)
xgb_cal_iso = CalibratedClassifierCV(xgb_def, method='isotonic', cv='prefit')
xgb_cal_iso.fit(X_test, y_test)
20.7
Calibration Curves
We will visualize the results with a calibration curve.
A calibration curve is a graphical representation of the relationship between predicted
probabilities and the actual frequencies of events in a binary classification problem. The curve
plots the predicted probabilities (x-axis) against the observed frequencies of the positive class
(y-axis) for different probability thresholds.
The calibration curve visually assesses how well a classification model’s predicted
probabilities match the true probabilities of the events of interest. Ideally, a well-calibrated
model should have predicted probabilities that match the actual probabilities of the events,
meaning that the calibration curve should be close to the diagonal line.
185
20. Better Models with ICE, Partial Dependence, Monotonic Constraints, and Calibration
If the calibration curve deviates from the diagonal line, it suggests that the model’s
predicted probabilities are either overconfident or underconfident. An overconfident model
predicts high probabilities for events that are unlikely to occur, while an underconfident
model predicts low probabilities for events that are likely to occur.
from sklearn.calibration import CalibrationDisplay
from matplotlib.gridspec import GridSpec
fig = plt.figure(figsize=(8,6))
gs = GridSpec(4, 3)
axes = fig.add_subplot(gs[:2, :3])
display = CalibrationDisplay.from_estimator(xgb_def, X_test, y_test,
n_bins=10, ax=axes)
disp_cal = CalibrationDisplay.from_estimator(xgb_cal, X_test, y_test,
n_bins=10,ax=axes, name='sigmoid')
disp_cal_iso = CalibrationDisplay.from_estimator(xgb_cal_iso, X_test, y_test,
n_bins=10, ax=axes, name='isotonic')
row = 2
col = 0
ax = fig.add_subplot(gs[row, col])
ax.hist(display.y_prob, range=(0,1), bins=20)
ax.set(title='Default', xlabel='Predicted Prob')
ax2 = fig.add_subplot(gs[row, 1])
ax2.hist(disp_cal.y_prob, range=(0,1), bins=20)
ax2.set(title='Sigmoid', xlabel='Predicted Prob')
ax3 = fig.add_subplot(gs[row, 2])
ax3.hist(disp_cal_iso.y_prob, range=(0,1), bins=20)
ax3.set(title='Isotonic', xlabel='Predicted Prob')
fig.tight_layout()
The calibration curve suggests that our default model does a respectable job. It is tracking
the diagonal pretty well. But our calibrated models look like they track the diagonal better.
The histograms show the distribution of the default model and the calibrated models.
Let’s look at the score of our calibrated models. It looks like they perform slightly better
(at least for accuracy).
>>> xgb_cal.score(X_test, y_test)
0.7480662983425415
>>> xgb_cal_iso.score(X_test, y_test)
0.7491712707182321
>>> xgb_def.score(X_test, y_test)
0.7458563535911602
20.8 Summary
This chapter explored ICE and PDP plots. Then we looked at monotonic constraints and
model calibration. ICE and PDP plots are two visualization techniques used to interpret the
output of XGBoost models. ICE plots show the predicted outcome of an individual instance
as a function of a single feature. In contrast, PDP plots show the average predicted outcome
across all instances as a function of a single feature. Monotonic constraints can be added
186
20.9. Exercises
Figure 20.13: Calibration curves for the default and calibrated models. Models that track the diagonal
tend to return probabilities that reflect the data.
to XGBoost models to ensure that the model’s predictions increase or decrease monotonically
with increasing feature values. This can simplify the model to prevent overfitting. Calibration
techniques such as sigmoid or isotonic calibration can be used to improve the reliability of the
model’s probabilistic predictions.
20.9
Exercises
1. How can ICE and PDP plots identify non-linear relationships between features and the
predicted outcome in an XGBoost model?
2. How can ICE and PDP plots be used to compare the effects of different features on the
predicted outcome in an XGBoost model?
3. Can ICE and PDP plots visualize interactions between features in an XGBoost model?
If so, how?
4. What are monotonic constraints in XGBoost, and how do they impact the model’s
predictions?
5. How can monotonic constraints be specified for individual features in an XGBoost
model?
6. What are some potential benefits of adding monotonic constraints to an XGBoost model,
and in what situations are they particularly useful?
7. How can the calibration curve be used to assess the calibration of an XGBoost model’s
predicted probabilities?
8. How can calibration techniques help improve the reliability of an XGBoost model’s
probabilistic predictions in practical applications?
187
Chapter 21
Serving Models with MLFlow
In this chapter, we will explore the MLFlow library.
MLflow is an open-source library for managing the end-to-end machine learning lifecycle.
It provides tools for tracking experiments, packaging machine learning models, and
deploying models to production. MLflow is particularly useful for XGBoost because it allows
users to track and compare the performance of different XGBoost models and quickly deploy
the best-performing model to production.
One of the key features of MLflow is its experiment-tracking capability. This allows users
to log various metrics, such as model accuracy or training time, and compare the results of
different experiments. This can be useful for identifying the most effective hyperparameters
for an XGBoost model and choosing the best-performing model.
MLflow also includes a model packaging feature, which makes it easy to package and
deploy XGBoost models. This makes it possible to deploy XGBoost models to production
environments and integrate them with other machine learning systems.
21.1
Installation and Setup
MLFlow is a third-party library. Make sure you install it.
Let’s show how to use MLFlow with the model we have been developing. We will then
serve the model using MLFlow.
Here are the imports that we will need:
%matplotlib inline
from feature_engine import encoding, imputation
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
import matplotlib.pyplot as plt
import mlflow
import numpy as np
import pandas as pd
from sklearn import base, metrics, model_selection, \
pipeline, preprocessing
from sklearn.metrics import accuracy_score, roc_auc_score
import xgboost as xgb
import urllib
import zipfile
189
21. Serving Models with MLFlow
Let’s load the raw data.
This code will create a pipeline and prepare the training and testing data.
import pandas as pd
from sklearn import model_selection, preprocessing
import xg_helpers as xhelp
url = 'https://github.com/mattharrison/datasets/raw/master/data/'\
'kaggle-survey-2018.zip'
fname = 'kaggle-survey-2018.zip'
member_name = 'multipleChoiceResponses.csv'
raw = xhelp.extract_zip(url, fname, member_name)
## Create raw X and raw y
kag_X, kag_y = xhelp.get_rawX_y(raw, 'Q6')
## Split data
kag_X_train, kag_X_test, kag_y_train, kag_y_test = \
model_selection.train_test_split(
kag_X, kag_y, test_size=.3, random_state=42, stratify=kag_y)
## Transform X with pipeline
X_train = xhelp.kag_pl.fit_transform(kag_X_train)
X_test = xhelp.kag_pl.transform(kag_X_test)
## Transform y with label encoder
label_encoder = preprocessing.LabelEncoder()
label_encoder.fit(kag_y_train)
y_train = label_encoder.transform(kag_y_train)
y_test = label_encoder.transform(kag_y_test)
# Combined Data for cross validation/etc
X = pd.concat([X_train, X_test], axis='index')
y = pd.Series([*y_train, *y_test], index=X.index)
This code uses hypopt to train the model. It has been extended to log various metrics in
MLFlow while training.
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
import mlflow
from sklearn import metrics
import xgboost as xgb
ex_id = mlflow.create_experiment(name='ex3', artifact_location='ex2path')
mlflow.set_experiment(experiment_name='ex3')
with mlflow.start_run():
params = {'random_state': 42}
rounds = [{'max_depth': hp.quniform('max_depth', 1, 12, 1), # tree
'min_child_weight': hp.loguniform('min_child_weight', -2, 3)},
{'subsample': hp.uniform('subsample', 0.5, 1), # stochastic
'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 1)},
190
21.1. Installation and Setup
]
{'gamma': hp.loguniform('gamma', -10, 10)}, # regularization
{'learning_rate': hp.loguniform('learning_rate', -7, 0)} # boosting
for round in rounds:
params = {**params, **round}
trials = Trials()
best = fmin(fn=lambda space: xhelp.hyperparameter_tuning(
space, X_train, y_train, X_test, y_test),
space=params,
algo=tpe.suggest,
max_evals=10,
trials=trials,
timeout=60*5 # 5 minutes
)
params = {**params, **best}
for param, val in params.items():
mlflow.log_param(param, val)
params['max_depth'] = int(params['max_depth'])
xg = xgb.XGBClassifier(eval_metric='logloss', early_stopping_rounds=50, **params)
xg.fit(X_train, y_train,
eval_set=[(X_train, y_train),
(X_test, y_test)
]
)
for metric in [metrics.accuracy_score, metrics.precision_score, metrics.recall_score,
metrics.f1_score]:
mlflow.log_metric(metric.__name__, metric(y_test, xg.predict(X_test)))
model_info = mlflow.xgboost.log_model(xg, artifact_path='model')
At the end of this code, you’ll notice that we store the results of calling log_model in
model_info. The log_model function stores the xgboost model as an artifact. It returns an object
that has metadata about the model that we created.
If you inspect the ex_id variable and the .run_id attribute, they will point to a directory
that stores information about the model.
>>> ex_id
'172212630951564101'
>>> model_info.run_id
'263b3e793f584251a4e4cd1a2d494110'
The directory structure looks like this:
mlruns/172212630951564101/263b3e793f584251a4e4cd1a2d494110
├── artifacts
├── meta.yaml
├── metrics
│ ├── accuracy_score
│
├── f1_score
191
21. Serving Models with MLFlow
│
│
├──
│
│
│
│
│
│
│
└──
├── precision_score
└── recall_score
params
├── colsample_bytree
├── gamma
├── learning_rate
├── max_depth
├── min_child_weight
├── random_state
└── subsample
tags
├── mlflow.log-model.history
├── mlflow.runName
├── mlflow.source.name
├── mlflow.source.type
└── mlflow.user
We can launch a server to run the model if we know the artifact.
21.2 Inspecting Model Artifacts
You can tell MLFlow to launch a service to inspect the artifacts of training the model.
From the command line, use the following command:
mlflow ui
Then go to the URL localhost:5000
You should see a site that looks like this.
Figure 21.1: MLFlow UI Homepage
192
21.3. Running A Model From Code
It will create unique identifiers of the runs like colorful-moose-627. You can inspect the value
of tags/mlflow.runName file to find the run name. MLFlow automatically names your models
with unique ids.
21.3
Running A Model From Code
When you click on a finished model, MLFlow will give you the code to make predictions. You
load the model using the load_model function and pass in the run id.
import mlflow
logged_model = 'runs:/ecc05fedb5c942598741816a1c6d76e2/model'
# Load model as a PyFuncModel.
loaded_model = mlflow.pyfunc.load_model(logged_model)
This code above is available from the website found from launching the command line
service. If you click on the model folder icon, it will show how to make a prediction with
Pandas (like above) or with Spark.
Once you have loaded the model, you can make predictions. Let’s predict the first row of
the test data:
>>> loaded_model.predict(X_test.iloc[[0]])
array([1])
21.4
Serving Predictions
MLFlow also creates an endpoint that we can query for predictions.
If you have the uuid, you can launch a service for it. By default, it will use pyenv to create a
fresh environment to run the service. I’m passing --env-manager local to bypass that and use
my local environment:
mlflow models serve -m mlruns/172212630951564101/ \
263b3e793f584251a4e4cd1a2d494110/artifacts/model \
-p 1234 --env-manager local
The above command starts a service on port 1234. If you hit the URL and port in a web
browser, it will throw an error because it is expecting an HTTP POST command. The next
section will show you how to query the service.
MLFlow also enables you to serve the same model from Azure, AWS Sagemaker, or Spark.
Explore the documentation for MLFlow about the built-in deployment tools for more details.
21.5
Querying from the Command Line
You can also query models from the command line with curl.
On UNIX systems, you can use the curl command. It look like this (make sure to replace
$URL and $JSON_DATA with the appropriate values):
curl $URL -X POST -H "Content-Type:application/json" --data $JSON_DATA
You will need to create JSON_DATA data with this format. I’m showing it on multiple
lines, but you will put it on a single line.:
193
21. Serving Models with MLFlow
Figure 21.2: Inspecting a run of an experiment. You can expand the Parameters and the Metrics on the
left-hand side. In the lower right is the code to run a model.
194
21.5. Querying from the Command Line
{'dataframe_split':
{'columns': ['col1', 'col2'],
'data': [[22, 16.0],
[25, 18.0]]}
}
If you don’t want to create the JSON manually, you can use Pandas to make it. If you have
a Dataframe with the data, use the Pandas .to_json method and set orient='split'.
>>> X_test.head(2).to_json(orient='split', index=False)
'{"columns":["age","education","years_exp","compensation",
"python","r","sql","Q1_Male","Q1_Female","Q1_Prefer not to say",
"Q1_Prefer to self-describe","Q3_United States of America",
"Q3_India","Q3_China","major_cs","major_other","major_eng",
"major_stat"],"data":[[22,16.0,1.0,0,1,0,0,1,0,0,0,0,1,0,1,0,
0,0],[25,18.0,1.0,70000,1,1,0,1,0,0,0,1,0,0,0,1,0,0]]}'
This is a JSON string, and we need a Python dictionary so that we can embed this in another
dictionary. Consider this value to be DICT. We must place it in another dictionary with the
key dataframe_split : {'dataframe_split: DICT}. We will use the json.loads function to create a
dictionary from the string. (We can’t use the Python string because the quotes are incorrect
for JSON.)
Here is the JSON data we need to insert into the dictionary:
>>> import json
>>> json.loads(X_test.head(2).to_json(orient='split', index=False))
{'columns': ['age',
'education',
'years_exp',
'compensation',
'python',
'r',
'sql',
'Q1_Male',
'Q1_Female',
'Q1_Prefer not to say',
'Q1_Prefer to self-describe',
'Q3_United States of America',
'Q3_India',
'Q3_China',
'major_cs',
'major_other',
'major_eng',
'major_stat'],
'data': [[22, 16.0, 1.0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0],
[25, 18.0, 1.0, 70000, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0]]}
Then you need to nest this in another dictionary under the key 'dataframe_split'.
>>> {'dataframe_split': json.loads(X_test.head(2).to_json(orient='split',
...
index=False))}
{'dataframe_split': {'columns': ['age',
'education',
195
21. Serving Models with MLFlow
'years_exp',
'compensation',
'python',
'r',
'sql',
'Q1_Male',
'Q1_Female',
'Q1_Prefer not to say',
'Q1_Prefer to self-describe',
'Q3_United States of America',
'Q3_India',
'Q3_China',
'major_cs',
'major_other',
'major_eng',
'major_stat'],
'data': [[22, 16.0, 1.0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0],
[25, 18.0, 1.0, 70000, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0]]}}
Finally, we need to convert the Python dictionary back to a valid JSON string. We will use
the json.dumps function to do that:
>>> import json
>>> post_data = json.dumps({'dataframe_split': json.loads(
...
X_test.head(2).to_json(orient='split', index=False))})
>>> post_data
'{"dataframe_split": {"columns": ["age", "education",
"years_exp", "compensation", "python", "r", "sql", "Q1_Male",
"Q1_Female", "Q1_Prefer not to say", "Q1_Prefer to self-describe",
"Q3_United States of America", "Q3_India", "Q3_China", "major_cs",
"major_other", "major_eng", "major_stat"], "data": [[22, 16.0,
1.0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0], [25, 18.0, 1.0,
70000, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0]]}}'
I’ll make a function that transforms a dataframe into this format to avoid repetition.
def create_post_data(df):
dictionary = json.loads(df
.to_json(orient='split', index=False))
return json.dumps({'dataframe_split': dictionary})
Let’s try out the function:
>>> post_data = create_post_data(X_test.head(2))
>>> print(post_data)
{"dataframe_split": {"columns": ["age", "education", "years_exp", "compensation",
"python", "r", "sql", "Q1_Male", "Q1_Female", "Q1_Prefer not to say",
"Q1_Prefer to self-describe", "Q3_United States of America", "Q3_India", "Q3_China",
"major_cs", "major_other", "major_eng", "major_stat"],
"data": [[22, 16.0, 1.0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0],
[25, 18.0, 1.0, 70000, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0]]}}
196
21.5. Querying from the Command Line
In Jupyter, we can use Python variables in shell commands. To tell Jupyter to use the value
of the variable, we need to stick a dollar sign ($) in front of the variable. However, when we
throw the post_data variable into the curl command in Jupyter, it fails.
!curl http://127.0.0.1:1234/invocations -X POST -H \
"Content-Type:application/json" --data $post_data
It fails with this error:
curl: (3) unmatched brace in URL position 1:
{columns:
^
This is because we need to wrap the contents of the post_data variable with single quotes.
I’m going to stick the quotes directly into the Python string.
>>> quoted = f"'{post_data}'"
>>> quoted
'\'{"dataframe_split": {"columns": ["age", "education",
"years_exp", "compensation", "python", "r", "sql", "Q1_Male",
"Q1_Female", "Q1_Prefer not to say", "Q1_Prefer to self-describe",
"Q3_United States of America", "Q3_India", "Q3_China", "major_cs",
"major_other", "major_eng", "major_stat"], "data": [[22, 16.0, 1.0,
0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0], [25, 18.0, 1.0,
70000, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0]]}}\''
Let’s stick the capability to “quote” the data into the create_post_data function.
def create_post_data(df, quote=True):
dictionary = {'dataframe_split': json.loads(df
.to_json(orient='split', index=False))}
if quote:
return f"'{dictionary}'"
else:
return dictionary
quoted = create_post_data(X_test.head(2))
Let’s try it now:
!curl http://127.0.0.1:1234/invocations -x post -h \
"content-type:application/json" --data $quoted
This returns the JSON:
{"predictions": [1, 0]}
Indicating that the first row is a software engineer and the second is a data scientist.
197
21. Serving Models with MLFlow
21.6 Querying with the Requests Library
I’ll show you how to query the service from python using the requests library.
This code uses the requests library to make a post request. It will predict the first two test
rows. it uses the pandas .to_json method and requires setting orient='split'. it returns 1
(software engineer) for the first row and 0 (data scientist) for the second.
In this case, because we are sending the json data as a dictionary and not a quoted string,
we will pass in quote=false.
>>> import requests as req
>>> import json
>>> r = req.post('http://127.0.0.1:1234/invocations',
...
json=create_post_data(x_test.head(2), quote=False))
>>> print(r.text)
{"predictions": [1, 0]}
Again, the .text result indicates that the web service predicted software engineer for the
first row and data scientist for the second.
21.7 Building with Docker
Docker is a platform that enables developers to package, deploy, and run applications
in containers. A container is a lightweight, stand-alone executable package that includes
everything needed to run a piece of software, including the code, runtime, system tools,
libraries, and settings.
Docker provides consistent and reproducible development and deployment environments
across different platforms, making it an efficient and reliable tool for deploying applications,
including machine learning models. It also makes it simple to scale the deployment by
running multiple container instances.
If you want to build a Docker image of the model and the service, you can use this
command. (You will need to point to the correct model directory and replace MODELNAME with
the name you want to use for the Docker image.)
mlflow models build-docker -m mlruns/172212630951564101/ \
263b3e793f584251a4e4cd1a2d494110/artifacts/model \
-n MODELNAME
Note that you make sure you install the extra packages when installing MLFlow to get
Docker capabilities:
pip install mlflow[extras]
After you have built the Docker image, you can run it with:
docker run --rm -p 5001:8080 MODELNAME
This will run the application locally on port 5001.
You can query it from the command line or from code, as we illustrated previously:
198
21.8. Conclusion
curl http://127.0.0.1:5001/invocations -X POST -H \
"Content-Type:application/json" --data '{"dataframe_split": \
{"columns": ["age", "education", "years_exp", "compensation", \
"python", "r", "sql", "Q1_Male", "Q1_Female", \
"Q1_Prefer not to say", "Q1_Prefer to self-describe", \
"Q3_United States of America", "Q3_India", "Q3_China", \
"major_cs", "major_other", "major_eng", "major_stat"], \
"data": [[22, 16.0, 1.0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, \
1, 0, 0, 0], [25, 18.0, 1.0, 70000, 1, 1, 0, 1, 0, 0, 0, 1,\
0, 0, 0, 1, 0, 0]]}}'
21.8
Conclusion
This chapter introduced the MLFlow library, an open-source library for managing the end-toend machine learning lifecycle. It provides tools for tracking experiments, packaging machine
learning models, and deploying models to production. We showed how to use the MLFlow
library with the XGBoost library, how to inspect the artifacts of training the model, how to run
a model from code, how to make predictions with the loaded model and launch the service,
and how to use Docker with MLFlow.
21.9
Exercises
1. What is the MLFlow library, and what are its main features?
2. How does MLFlow help track and compare the performance of different XGBoost
models?
3. How do you use MLFlow to package and deploy XGBoost models?
4. How do you create a Docker image from MLFlow?
199
Chapter 22
Conclusion
Well, folks. We are at the end. I hope you enjoyed this journey. Throughout this journey,
we have explored the depths of this powerful algorithm and learned how to build highly
accurate classification models for real-world applications. We have covered everything from
data preparation and feature engineering to hyperparameter tuning and model evaluation.
To use a cliche…
But the journey doesn’t end here. It’s just the beginning. The knowledge you have gained
from this book can be applied to a wide range of classification problems, from predicting
customer churn in business to diagnosing medical conditions in healthcare.
So, what are you waiting for? Please take what you have learned and put it into practice.
Practice is the best way to make sure you are absorbing this content. And please let me know.
I’d love to hear what you used this for.
Please reach out if you need help with XGBoost and want consulting or training for your
team. You can find my offerings at www.metasnake.com.
If you are a teacher using this book as a textbook, don’t hesitate to contact me, I would
love to present on XGBoost to your class.
22.1
One more thing
As an independent self-published author, I know firsthand how challenging it can be to share
this content. Without the backing of a major publishing house or marketing team, I rely on the
support of my readers to help spread the word about my work. That’s why I’m urging you
to take a few minutes out of your day to post a review and share my book with your friends
and family.
If you enjoyed this book, posting an honest review is the best way you can say thanks for
the countless hours I’ve poured into my work. If you do post a review, please let me know.
I’d love to link to it on the homepage of the book. Your support can help me reach a wider
audience. So please, if you’ve enjoyed my book, consider leaving a review and sharing it with
others. Your support means the world to me and will impact my future endeavors.
201
22.1. One more thing
About the Author
Matt Harrison is a renowned corporate trainer specializing in Python and Pandas. With over
a decade of experience in the field, he has trained thousands of professionals and helped them
develop their skills in data analysis, machine learning, and automation using Python.
Matt authored several best-selling Python and data analysis books, including Illustrated
Guide to Python 3, Effective Pandas, Machine Learning Pocket Reference, and more. He is a frequent
speaker at industry conferences and events, where he shares his expertise and insights on the
latest trends and techniques in data analysis and Python programming.
Technical Editors
Dr. Ronald Legere earned his Ph.D. in Physics from Yale University in 1998, with a focus on
laser cooling and cold atom collisions. He then served as a Post Doc in Quantum Information
Theory at Caltech and later joined MIT Lincoln Laboratory in 2000 as technical staff. There,
he managed a team of analysts and support staff providing systems analysis and costbenefit analysis to various Federal programs. In 2013, Dr. Legere became CEO of Legere
Pharmaceuticals, leveraging his technical and leadership experience to improve business
systems and regulatory compliance. In 2021, after selling Legere, he founded Rational Pursuit
LLC to provide scientific, data-driven insights to businesses and organizations.
Edward Krueger is an expert in data science and scientific applications. As the
proprietor of Peak Values Consulting, he specializes in developing and deploying simulation,
optimization, and machine learning models for clients across various industries, from
programmatic advertising to heavy industry.
In addition to his consulting work, Krueger is also an Adjunct Assistant Professor in the
Economics Department at The University of Texas at Austin. He teaches programming, data
science, and machine learning to master’s students.
Alex Rook is a data scientist and Artificial Engineer based in North Carolina. He has
degrees in Cognitive Science and Philosophy from the University of Virginia. He likes
designing robust systems that look and feel good, function perfectly, and answer questions of
high importance to multiple stakeholders who often have competing - yet valid - interests
in their organization. He has extensive experience in non-financial risk management and
mitigation.
203
Index
Index
., 172
$, 197
'error', 71
**, 47, 106
.add, 19, 144
.assign, 154
.background_gradient, 15
.barh, 137
.bar, 17
.base_values, 154
.best_params_, 47, 81
.classes_, 34
.coef_, 134
.corr, 15, 162
.count, 146
.cv_results_, 48
.estimators_, 52
.evals_result, 68
.feature_importances_, 135, 137
.fit_transform, 11, 31
.fit, 27, 31, 61
.from_estimator, 120, 176, 185
.get_params, 43, 52, 53
.groupby, 181
.inverse_transform, 34
.loc, 26
.mean, 48
.on, 144
.pipe, 146
.plot.bar, 17
.plot.scatter, 94
.plot_tree, 28, 40, 52, 53, 62
.plot, 144
.predict_proba, 63, 119, 124, 185
.predict, 64
.query, 10, 27, 182
.rename, 154
.run_id, 191
.scale, 21, 144
.scatter, 19, 94
.score, 32, 45, 61, 115
.set_sticky, 15
.set_visible, 146
.set_xlabel, 146
.set_xticks, 146
.set_yticks, 146
.spines, 146
.text, 146
.tight_layout, 176
.to_json, 198
.transform, 11, 34
.unstack, 145, 146
.value_counts, 182
.values, 152
.view, 135
.view (dtreeviz), 54
.violinplot, 96
BaseEstimator, 10
CalibratedClassifierCV, 185
CalibrationDisplay, 185
ConfusionMatrixDisplay, 116
DecisionTreeClassifier, 27
DummyClassifier, 32
GridSearchCV, 47, 80
GridSpec, 185
LabelEncoder, 34
LogisticRegression, 133
MeanMedianImputer, 10
OneHotEncoder, 10
PartialDependenceDisplay, 172
PartialDependencePlot, 176
Plot, 144
RandomForestClassifier, 52
RandomState, 23
RocCurveDisplay, 120
SeedSequence, 23
StandardScaler, 133
ThresholdXGBClassifier, 124
TreeExplainer, 152
Trial, 87
ValueError, 33
XGBClassifier, 61
205
Index
XGBRFClassifier, 53
XGBlassifier subclass, 123
X, 9
accuracy_score, 115
add, 19
annotate, 125
annot, 162
ax.plot, 26
background_gradient, 15, 48
beeswarm, 165
calc_gini, 25
calibration, 185
choice, 88
classification_report, 120
cmap, 162
cm, 166
color, 166
colsample_bylevel, 54, 75
colsample_bynode, 54, 75
colsample_bytree, 54, 75
confusion_matrix, 115
corr, 162
create_experiment, 190
create_post_data, 197
cross_val_score, 48, 82
crosstab, 18, 182
cv_results_, 48
cv, 46, 48
depth_range_to_display, 54
dtreeviz, 30, 54
dummy, 32
early_stopping_rounds, 57, 67, 76, 85, 106
ensemble, 52
eval_metric, 57, 71, 76
eval_set, 67, 106, 132
export_graphviz, 138
exp, 154
extract_zip, 5
f1_score, 120
feature_engine, 10
fig.tight_layout, 176
fit_transform, 11, 31
fit, 31
fmin, 86, 105, 190
fmt, 162
force, 160
from_estimator, 172, 176, 185
gamma, 57, 76, 78
get_tpr_fpr, 124
groupby, 181
grow_policy, 54, 75
heatmap, 93
206
hp, 88
hyperopt, 105
initjs, 152
inspection, 172
interaction_constraints, 148
inv_log, 64
inverse_transform, 34
jitter, 94
json.loads, 195
lambda, 154
learning_curve, 109
learning_rate, 57, 76, 78, 79
linear_model, 133
load_model, 193
loads, 195
loc, 26
log_metric, 190
log_param, 190
loguniform, 89
max_delta_step, 57, 76
max_depth, 27, 40, 43, 54, 61, 75
max_evals, 105
max_features, 43
max_leaves, 43, 54, 75
min_child_weight, 54, 75
min_impurity_decrease, 43
min_samples_leaf, 43
min_samples_split, 43
min_weight_fraction_leaf, 43
mlflow ui, 192
model_selection, 11
model_selection (sklearn), 47
model, 135
monotone_constraints, 183
my_dot_export, 28, 54
n_estimators, 33, 52, 57, 61, 76, 132
np.exp, 154
num_trees, 62
objective, 57, 76
openpyxl, 141
orient='split', 195
param_range, 129
partial_dependence_plot, 176, 179
pd.crosstab, 18, 145, 182
pd.read_csv, 5
penalty=None, 133
plot.bar, 17
plot_3d_mesh, 100
plot_cumulative_gain, 125
plot_lift_curve, 126
plot_tree, 28, 31, 40, 52
plots.force, 160
Index
plots.scatter, 160
plots.waterfall, 156
precision_recall_curve, 119
precision_score, 117
predict_proba, 185
preprocessing, 133
pyfunc, 193
pyll, 88
quantile_ice, 173
query, 10, 27, 182
quniform, 91
read_csv, 5
recall_score, 117
reg_alpha, 57, 76
reg_lambda, 57, 76
rounds, 105
run_id, 191
sample, 88
sampling_method, 54, 75
scale_pos_weight, 57, 76
scale, 21
scatter (SHAP), 160
score, 32, 45
scoring, 46, 129
set_experiment, 190
set_sticky, 15
shap (PDP), 179
so.Dots, 19, 144
so.Jitter, 144
so.Line, 19, 144
so.PolyFit, 144
spines, 146
stratify, 11
subsample, 54, 75
tight_layout, 176
to_json, 198
top_n, 10
tpe.suggest, 87
train_test_split, 11
transform, 11, 146
tree.plot_tree, 31, 52
tree_index, 62
tree_method, 54, 75
trial2df, 92
trial, 92
ui, 192
uniform, 89
urllib, 5
validation_curve, 46, 78, 129
value_counts, 182
verbose, 47, 81, 132
view, 135
vmax, 162
vmin, 162
x_jitter, 162
xg_helpers.py, 6
xg_helpers, 60
xgbfir, 141
y, 9
zipfile, 5
3D mesh, 106
3D plot, 100
accuracy, 32, 71, 115, 129
algorithm, 23
alpha, 162
annotate, 125
area under curve, 71, 121
arrow (on plot), 125
AUC, 71, 121
auc tuning, 130
Average Gain, 141
average precision, 119
Average wFScore, 141
axis spines, 146
bagging, 51
bar plot, 17, 137
base value, 156
baseline model, 32
bayesian optimization, 85
benefits of XGBoost, 59
binary feature, 165
black box, 60, 133, 151
boosting, 33
bootstrap aggregation, 51
calibration, 185
categorical, 88
categorical encoding, 31
categorical values, 59
centered ICE plots, 172
classification report, 120
clean up data, 31, 61
cleanup, 5
closure, 85
coefficient, 133
coefficients, 134
colormap, 166
column subsampling, 51
command line, 193
comparing models, 121
complicated, 40
Condorcet’s Jury Theorem, 51
207
Index
confusion matrix, 115
constraints, 148, 180
constructor, 43
conventions (scikit-learn), 52
convert labels to numbers, 34
correlation, 15, 93, 143
correlation between trees, 51
correlations, 162
cover, 137
cross tabulation, 145
cross validation score, 46
cross-validation, 82
cumulative gains curve, 125
curl, 193, 197
data leakage, 121
data size, 109
Decision stump, 27
decision tree visualization, 30
decrease learning rate, 79
default value, 156
dependence plot, 162
dependence plot (SHAP), 160
diagnose overfit, 111
diagnose underfit, 111
dictionary, 195
dictionary unpacking, 106
discrete, 88
distributions, 87
docker, 198
docker run, 198
DOT, 138
dtreeviz, 62, 135
early stopping, 67
EDA, 15
edge color, 23
endpoint, 193
ensemble, 51
ensembling, 54
environment, 193
error, 59
estimators, 52
eta, 79
evaluate labelling errors, 115
evaluation set, 67
Excel, 141
exhaust, 47
Expected Gain, 141
experiment tracking, 189
explaining, 60
explanations, 151
208
exploitation, 94
exploration, 94
exploratory data analysis, 15
export xgboost plot, 62
F1 score, 120
facet plot, 21
false positive, 25, 115, 117
feature engineering, 15
feature importance, 135
feature interactions, 141
fill in missing values, 31
first-class functions, 85
fit line, 19
fixing overfitting, 40
fixing underfit models, 39
font, 146
fraction (confusion matrix), 117
FScore, 141
gain, 125, 137, 141
gamma, 78, 100
Gini coefficient, 23, 25
global interpretation, 151
golfing, 59
good enough, 121
gradient descent, 33
graphviz, 138
grid search, 47, 80
grouping, 145
growing trees, 39
heahtmap, 162
heatmap, 93, 143
histogram, 23
histogram (ICE), 173
holdout, 11
holdout set, 11, 32
horizontal alignment, 146
Hyperopt, 85
hyperopt (with mlflow), 190
hyperparameter tuning, 87
hyperparameters, 39
hyperparameters (decision tree), 43
hyperparameters (random forest), 54
hyperparameters (XGBoost), 75
ICE, 171
ICE (shap), 176
imbalanced data, 11, 115
impute, 10
indicator column, 17
Index
Individual conditional expectation, 171
integer distribution, 91
interaction, 141
interactions, 141
interactions (dependence plot), 162
interactive plot, 101
interactive visuals, 100
interpret, 133
interpretation, 60
inverse logit, 35, 64
isotonic calibration, 185
jitter, 19, 94, 162
JSON, 193
Jupyter, 197
juror, 51
k-fold validation, 82
Kaggle, 5
label encoder, 34
leaf values, 35
leaked data, 121
learning rate, 78
lift curve, 126
limit adjustment, 21
limit feature interactions, 148
line plot (Matplotlib), 26
load data, 5
local interpretation, 151
log loss, 71
log odds, 35, 154
log-axis, 100
log-uniform, 89
logistic regression, 133
logit, 35
loss, 100
macro average, 120
magnitude, 133
Matplotlib colormap, 166
max convention, 52
maximize marketing response rates, 125
memorizing, 40
metric, 46
min convention, 52
missing values, 59
mlflow, 189
mlflow serve, 193
monotonic constraints, 180
negative accuracy, 85
nested list, 148
no interactions (shap), 166
non-linear relationships, 135
non-linear response, 169
non-monotonic interactions, 165
normalize confusion matrix, 117
number of k-fold cross validations, 46
number of trees, 33
NumPy, 23
optimal gain, 125
optimal model, 126
optimize auc, 130
ordered samples, 125
over-fit model diagnosis, 109
overfit, 40
overfit diagnosis, 111
overfit learning curve, 111
Partial dependence plot, 176
PDP, 176
PDP (SHAP), 179
percentage (confusion matrix), 117
philosopher, 51
pipeline, 6, 9, 10, 31, 61
plot inverse logit, 36
plot polynomial line, 19
plot xgboost tree, 62
plotly, 100, 106
positive label, 34
precision, 117
precision-recall curve, 119
predict probability, 63
probabilities, 35
probability, 63, 124
probability of label, 119
prune, 40
pruning, 78
pyenv, 193
Python variables, 197
query model, 193
random forest, 51
random forest (XGBoost), 53
random forest hyperparameters, 54
random sample, 23
random samples, 23
rank, 141
ratio of gains, 126
recall, 117
receiver operating characteristic, 120
regularization, 54
209
Index
requests, 198
residual, 59
results of evaluation, 68
reverse transformation, 34
ROC Curve, 120
roc-auc tuning, 130
roc_auc, 129
run name, 192
running a model, 193
running with docker, 198
sampling, 54
scale, 133
scatter plot, 19, 94, 144
Sci-kit learn pipeline, 6
scikit-learn conventions, 52
scikitplot, 125
Seaborn, 162
seaborn, 19, 93, 96, 144
sensitivity, 117
serve model, 193
service (mlflow), 192
setting y-axis limit, 109
shap, 151
SHAP (ICE), 176
shap waterfall, 156
shap without interactions, 166
sigmoid calibration, 185
sign of coefficient, 133
simple, 39
simpler models, 148
single prediction, 156
slope graph, 146
Spearman correlation, 15
specify feature interactions, 148
split, 11
split data, 61
split point, 23
standard deviation, 133
standardize, 133
step-wise tuning, 105, 130
sticky index, 15
string labels, 33
Stump, 27
stump, 166
subclass, 123
subsampling, 51
summary plot, 165
support, 120
survey data, 5
test set, 11
210
threshold, 120
threshold metrics, 124
tick location (Seaborn), 21
track artifacts, 192
trailing underscore convention, 52
trails, 87
training score, 46
training set, 11
transparency, 19
transparent model, 133
tree depth, 40
tree model, 23
trend, 19
true positive, 25, 115, 117
tuning function, 86
tuning hyperparameters, 78
tweak function, 6, 10
underfit, 39
underfit diagnosis, 111
undo transformation, 34
unstacked bar plot, 17
upack operatior, 47
uuid, 193
validation curve, 78
validation curve (random forest), 57
validation curves, 44
validation score, 46
validation set, 11, 32
variance, 40
verbose, 132
vertical alignment, 146
violin plot, 96
visualize decision tree, 31
waterfall plot, 156
weak model, 59
weight, 137
weighted average, 120
wFScore, 141
white box, 60, 133
x-axis log, 100
XGBoost hyperparameters, 75
XGBoost stump, 35
y-axis limit, 109
yellowbrick, 57, 120
Yellowbrick (validation curve), 46
yellowbrick learning curve, 109
ZIP file, 5
Download
Study collections