Treading on Python Series EFFECTIVE XGBOOST Tuning, Understanding, and Deploying Classification Models Matt Harrison Effective XGBoost Tuning, Understanding, and Deploying Classification Models Effective XGBoost Tuning, Understanding, and Deploying Classification Models Matt Harrison Technical Editors: Edward Krueger, Alex Rook, Ronald Legere hairysun.com COPYRIGHT © 2023 While every precaution has been taken in the preparation of this book, the publisher and author assumes no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein. Contents Contents 1 Introduction 3 2 Datasets 2.1 Cleanup . . . . . 2.2 Cleanup Pipeline 2.3 Summary . . . . . 2.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5 6 13 14 3 Exploratory Data Analysis 3.1 Correlations . . . . . . 3.2 Bar Plot . . . . . . . . . 3.3 Summary . . . . . . . . 3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 15 17 21 21 4 Tree Creation 4.1 The Gini Coefficient . . . . . 4.2 Coefficients in Trees . . . . . 4.3 Another Visualization Tool . 4.4 Summary . . . . . . . . . . . 4.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 23 27 30 30 30 5 Stumps on Real Data 5.1 Scikit-learn stump on real data 5.2 Decision Stump with XGBoost . 5.3 Values in the XGBoost Tree . . . 5.4 Summary . . . . . . . . . . . . . 5.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 31 33 35 37 37 6 Model Complexity & Hyperparameters 6.1 Underfit . . . . . . . . . . . . . . . 6.2 Growing a Tree . . . . . . . . . . . 6.3 Overfitting . . . . . . . . . . . . . . 6.4 Overfitting with Decision Trees . . 6.5 Summary . . . . . . . . . . . . . . . 6.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 39 39 40 40 41 42 7 Tree Hyperparameters 7.1 Decision Tree Hyperparameters . . . . . . 7.2 Tracking changes with Validation Curves 7.3 Leveraging Yellowbrick . . . . . . . . . . 7.4 Grid Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 43 44 45 47 . . . . . . . . v Contents 7.5 7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 49 8 Random Forest 8.1 Ensembles with Bagging . . . . . . . . . . . 8.2 Scikit-learn Random Forest . . . . . . . . . 8.3 XGBoost Random Forest . . . . . . . . . . . 8.4 Random Forest Hyperparameters . . . . . . 8.5 Training the Number of Trees in the Forest 8.6 Summary . . . . . . . . . . . . . . . . . . . . 8.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 51 51 53 54 57 58 58 9 XGBoost 9.1 Jargon . . . . . . . . . . . . . . . . . . . 9.2 Benefits of Boosting . . . . . . . . . . . 9.3 A Big Downside . . . . . . . . . . . . . 9.4 Creating an XGBoost Model . . . . . . 9.5 A Boosted Model . . . . . . . . . . . . 9.6 Understanding the Output of the Trees 9.7 Summary . . . . . . . . . . . . . . . . . 9.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 59 59 60 60 61 62 65 65 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 67 68 71 73 73 11 XGBoost Hyperparameters 11.1 Hyperparameters . . . . . . . . . . . . . . 11.2 Examining Hyperparameters . . . . . . . 11.3 Tuning Hyperparameters . . . . . . . . . 11.4 Intuitive Understanding of Learning Rate 11.5 Grid Search . . . . . . . . . . . . . . . . . 11.6 Summary . . . . . . . . . . . . . . . . . . . 11.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 75 77 78 78 80 83 83 12 Hyperopt 12.1 Bayesian Optimization . . . . . . . 12.2 Exhaustive Tuning with Hyperopt 12.3 Defining Parameter Distributions . 12.4 Exploring the Trials . . . . . . . . . 12.5 EDA with Plotly . . . . . . . . . . . 12.6 Conclusion . . . . . . . . . . . . . . 12.7 Exercises . . . . . . . . . . . . . . . 10 Early Stopping 10.1 Early Stopping Rounds . . 10.2 Plotting Tree Performance 10.3 Different eval_metrics . . . 10.4 Summary . . . . . . . . . . 10.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 . 85 . 85 . 88 . 92 . 100 . 103 . 104 13 Step-wise Tuning with Hyperopt 13.1 Groups of Hyperparameters . . . . . . 13.2 Visualization Hyperparameter Scores 13.3 Training an Optimized Model . . . . . 13.4 Summary . . . . . . . . . . . . . . . . . 13.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi . . . . . . . 105 105 106 106 108 108 Contents 14 Do you have enough data? 14.1 Learning Curves . . . . . . . . . . . 14.2 Learning Curves for Decision Trees 14.3 Underfit Learning Curves . . . . . 14.4 Overfit Learning Curves . . . . . . 14.5 Summary . . . . . . . . . . . . . . . 14.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 109 110 111 111 112 112 15 Model Evaluation 15.1 Accuracy . . . . . . . . . 15.2 Confusion Matrix . . . . 15.3 Precision and Recall . . . 15.4 F1 Score . . . . . . . . . . 15.5 ROC Curve . . . . . . . . 15.6 Threshold Metrics . . . . 15.7 Cumulative Gains Curve 15.8 Lift Curves . . . . . . . . 15.9 Summary . . . . . . . . . 15.10Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 115 115 117 120 120 123 125 126 127 127 16 Training For Different Metrics 16.1 Metric overview . . . . . . . . . . 16.2 Training with Validation Curves 16.3 Step-wise Recall Tuning . . . . . 16.4 Summary . . . . . . . . . . . . . . 16.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 129 129 130 132 132 17 Model Interpretation 17.1 Logistic Regression Interpretation 17.2 Decision Tree Interpretation . . . . 17.3 XGBoost Feature Importance . . . 17.4 Surrogate Models . . . . . . . . . . 17.5 Summary . . . . . . . . . . . . . . . 17.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 133 134 137 138 140 140 18 xgbfir (Feature Interactions Reshaped) 18.1 Feature Interactions . . . . . . . . . 18.2 xgbfir . . . . . . . . . . . . . . . . . 18.3 Deeper Interactions . . . . . . . . . 18.4 Specifying Feature Interactions . . 18.5 Summary . . . . . . . . . . . . . . . 18.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 141 141 147 148 148 149 19 Exploring SHAP 19.1 SHAP . . . . . . . . . . . . . . . . . . . . . . . . . 19.2 Examining a Single Prediction . . . . . . . . . . . 19.3 Waterfall Plots . . . . . . . . . . . . . . . . . . . . 19.4 A Force Plot . . . . . . . . . . . . . . . . . . . . . 19.5 Force Plot with Multiple Predictions . . . . . . . 19.6 Understanding Features with Dependence Plots 19.7 Jittering a Dependence Plot . . . . . . . . . . . . 19.8 Heatmaps and Correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 151 155 156 160 160 160 162 162 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Contents 19.9 Beeswarm Plots of Global Behavior 19.10SHAP with No Interaction . . . . . 19.11 Summary . . . . . . . . . . . . . . . 19.12Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 166 169 169 20 Better Models with ICE, Partial Dependence, Calibration 20.1 ICE Plots . . . . . . . . . . . . . . . . . . . . . 20.2 ICE Plots with SHAP . . . . . . . . . . . . . . 20.3 Partial Dependence Plots . . . . . . . . . . . . 20.4 PDP with SHAP . . . . . . . . . . . . . . . . . 20.5 Monotonic Constraints . . . . . . . . . . . . . 20.6 Calibrating a Model . . . . . . . . . . . . . . . 20.7 Calibration Curves . . . . . . . . . . . . . . . 20.8 Summary . . . . . . . . . . . . . . . . . . . . . 20.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 171 176 176 179 180 185 185 186 187 21 Serving Models with MLFlow 21.1 Installation and Setup . . . . . . . . 21.2 Inspecting Model Artifacts . . . . . . 21.3 Running A Model From Code . . . . 21.4 Serving Predictions . . . . . . . . . . 21.5 Querying from the Command Line . 21.6 Querying with the Requests Library 21.7 Building with Docker . . . . . . . . . 21.8 Conclusion . . . . . . . . . . . . . . . 21.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 189 192 193 193 193 198 198 199 199 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Monotonic Constraints, and 22 Conclusion 201 22.1 One more thing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 Index viii 205 Forward “XGBoost is all you need.”It started as a half-joke, a silly little phrase that I myself had not been too confident about, but it grew into a signature line of sorts. I don’t remember when I first said it, and I can only surmise what motivated me to say it. Part of it was my frustration with the increasingly complex deep learning that was being pushed as the only valid approach to tackling all Machine Learning problems. The other part of it came from a sense of maturity that was hard won over the years, acquired from many different wild experiments with various ML tools and techniques, particularly those that deal with tabular data problems. The more problems you tackle, the more you appreciate such things as parsimony, robustness, and elegance in implementing your machine learning models and pipelines. My first exposure to XGBoost came on Kaggle. Over the past decade, the progress of Machine Learning for tabular data has been intertwined with Machine Learning platforms, and Kaggle in particular. XGBoost was first announced on Kaggle in 2015, not long before I joined that platform. XGBoost was initially a research project by Tianqi Chen, but due to its excellent predictive performance on a variety of machine learning tasks it quickly became the algorithm of choice for many Data Science practitioners. XGBoost is both a library and a particular gradient boosted trees (GBT) algorithm. (Although the XGBoost library also supports other - linear - base learners.) GBTs are a class of algorithms that utilize the so-called ensembling - building a very strong ML algorithm by combining many weaker algorithms. GBTs use decision trees as “base learners”, utilize “boosting”as an ensembling method, and optimize the ensembling by utilizing gradient descent, something that they have in common with other machine learning methods, such as neural networks. XGBoost the library is in its own right one of the main reasons why I am such a big XGBoost promoter. As one of the oldest and most widely used GBT libraries, it has matured and become very robust and stable. It can be easily installed and used on almost any computing environment. I have installed it and used it on everything from Raspberry Pi Zero to DGX Station A100. It has had a stable version for various Arm-based platforms for a long while, including the new Macs with Apple chips. It can even run natively in the browser - it has been included in Wasm and PyScript for a while. XGBoost is the only GBT library with a comprehensive Nvidia CUDA GPU support. It works on a single machine, or on a large cluster. And since version 1.7, it also supports federated learning. It includes C, Java, Python, and R front ends, as well as many other ones. If you are a working Data Scientist, and need an efficient way to train and deploy a Machine Learning model for a wide variety of problems, chances are that XGBoost is indeed all you need. To be clear, in terms of predictive performance, I have always considered XGBoost to be on par with other well known GBT libraries - LightGBM, CatBoost, HistGradientBoosting, etc. Each one of them has its strengths and weaknesses, but in the first approximation all of them are fairly interchangeable with each other for the majority of problems that a working Data Scientist comes across. In my experience there is no a priori way of telling which one of them will perform the best on any given dataset, but in terms of practical considerations the 1 FORWARD differences are usually negligible. Mastering any one of them will go a long way towards a user being comfortable with all of them. Even though GBTs are very widely used and the gold standard for tackling tabular data Machine Learning problems, there is still a serious paucity of learning material for them. Most of us have learned to use them by following widely scattered blog posts, Kaggle competitions, GitHub repos, and many other dispersed pieces of information. There are a couple good introductory books out there as well, but for the most part one has to search wide to find a very comprehensive instruction manual that can get you up and running with XGBoost. This book aims to put a lot of that scattered material into one coherent whole, with a straightforward and logical content progression. I’ve now known Matt for years, and I greatly admire his pedagogical perspective. He is a non-nonsense educator, and has a very down to earth approach to all of his teaching. I’ ve greatly benefited from several of his books, especially the ML handbook and the Effective Pandas book. In all of his books he aims to get the reader to the practical parts as soon as possible, and this book is no exception - you start dealing with code from the very first page. Like with all of his other books, this one was also written with a working practitioner in mind. The book skips most of the convoluted theoretical background, and delves directly into practical examples. You should be comfortable coding and ideally be an advanced beginner to an intermediate Python coder in order to get the most out of this book. Machine Learning for tabular data is still a very hands-on artisanal process. A big part of what makes a great tabular data ML model has to do with proper data preparation and feature engineering. This is where Matt’s background with Pandas really comes in handy many Pandas examples throughout the book are exceptionally valuable in their own right. Chapters end with a great selection of useful exercises. All of the code in the book is also available from the accompanying repo, and most of the datasets can be found on Kaggle. Like all other great tools, it takes many years of experience and dedicated work to fully master XGBoost. Every path to full mastery needs to start somewhere, and I can’t think of a better starting point than this book. If you get through all of it you will be well prepared for the rest of your Machine Learning for tabular data journey. Happy Boosting! Bojan Tunguz 2 Chapter 1 Introduction In this book, we will build our intuition for classification with supervised learning models. Initially, we will look at decision trees, a fundamental component of the XGBoost model, and consider the tradeoffs we make when creating a predictive model. If you have that background, feel free to skip these chapters and jump straight to the XGBoost chapters. While providing best practices for using the XGBoost library, we will also show many related libraries and how to use them to improve your model. Finally, we will demonstrate how to deploy your model. I strongly encourage you to practice using this library on a dataset of your choice. Each chapter has suggested exercises for you to consider what you have learned and put it into practice. Practicing writing code is crucial for learning because it helps to solidify understanding of the library and understanding concepts. Additionally, regular coding practice can build familiarity with coding syntax and increase coding efficiency over time. Using your fingers to practice will help you much more than just reading. If you are interested in courses to practice and learn more Python and data materials, check out https://store.metasnake.com. Use the code XGBOOK for a discount. 3 Chapter 2 Datasets We will be using machine learning classifiers to predict labels for data. I will demonstrate the features of XGBoost using survey data from Kaggle. Kaggle is a company that hosts machine learning competitions. They have conducted surveys with users about their backgrounds, experience, and tooling. Our goal will be to determine whether a respondent’s job title is “Data Scientist” or “Software Engineer” based on how they responded to the survey. This data comes from a Kaggle survey conducted in 2018 1 . As part of the survey, Kaggle asked users about a wide range of topics, from what kind of machine learning software they use to how much money they make. We’re going to be using the responses from the survey to predict what kind of job the respondent has. 2.1 Cleanup I’ve hosted the data on my GitHub page. The 2018 survey data is in a ZIP file. Let’s use some Python code to extract the contents of this data into a Pandas DataFrame. If you need help setting up your environment, see the appendix. The ZIP file contains multiple files. We are concerned with the multipleChoiceResponses.csv file. import pandas as pd import urllib.request import zipfile url = 'https://github.com/mattharrison/datasets/raw/master/data/'\ 'kaggle-survey-2018.zip' fname = 'kaggle-survey-2018.zip' member_name = 'multipleChoiceResponses.csv' def extract_zip(src, dst, member_name): """Extract a member file from a zip file and read it into a pandas DataFrame. Parameters: src (str): URL of the zip file to be downloaded and extracted. 1 https://www.kaggle.com/datasets/kaggle/kaggle-survey-2018 5 2. Datasets dst (str): Local file path where the zip file will be written. member_name (str): Name of the member file inside the zip file to be read into a DataFrame. Returns: pandas.DataFrame: DataFrame containing the contents of the member file. """ url = src fname = dst fin = urllib.request.urlopen(url) data = fin.read() with open(dst, mode='wb') as fout: fout.write(data) with zipfile.ZipFile(dst) as z: kag = pd.read_csv(z.open(member_name)) kag_questions = kag.iloc[0] raw = kag.iloc[1:] return raw raw = extract_zip(url, fname, member_name) I created the function extract_zip so I can reuse this functionality. I will also add this function to a library that I will develop as we progress, xg_helpers.py. After running the above code, we have a Pandas dataframe assigned to the variable raw. This is the raw data of survey responses. We show how to explore it and clean it up for machine learning. 2.2 Cleanup Pipeline The raw data has over 23,000 rows and almost 400 columns. Most data is not natively found in a form where you can do machine learning on it. Generally, you will need to perform some preprocessing on it. Some of the columns don’t lend easily to analysis because they aren’t in numeric form. Additionally, there may be missing data that needs to be dealt with. I will show how to preprocess the data using the Pandas library and the Scikit-learn library. I’ll describe the steps, but I will not dive deeply into the features of the Pandas library. I suggest you check out Effective Pandas for a comprehensive overview of that library. Our task will be transforming survey responses into numeric values, encoding categorical values into numbers, and filling in missing values. We will use a pipeline for that. Scikit-learn pipelines are a convenient tool for tying together the steps of constructing and evaluating machine learning models. A pipeline chains a series of steps, including feature extraction, dimensionality reduction, and model fitting, into a single object. Pipelines simplify the endto-end process of machine learning and make you more efficient. They also make it easy to reuse and share your work and eliminate the risk of errors by guaranteeing that each step is executed in a precise sequence. I wrote a function, tweak_kag, to perform the survey response cleanup. I generally create a cleanup function every time I get new tabular data to work with. The tweak_kag function is the meat of the Pandas preprocessing logic. It uses a chain to manipulate the data one action (I typically write each step on its own line) at a time. You can read it as a recipe of actions. The first is to use .assign to create or update columns. Here are the columns updates: 6 2.2. Cleanup Pipeline • age - Pull off the first two characters of the Q2 column and convert them to an integer. • education - Replace the education strings with numeric values. • major - Take the top three majors, change the others to 'other', and then rename those top three majors to shortened versions. • year_exp - Convert the Q8 column to years of experience by replacing '+' empty space, splitting on '-' (the first value of the range) and taking the left-hand side, and converting that value to a floating point number. • compensation - Replace values in the Q9 column by removing commas, shortening 500,000 to 500, splitting on '-' (the first value of the range) and taking the left side, filling in missing values with zero, and converting that value to an integer and multiplying it by 1,000. • python - Fill in missing values of Q16_Part_1 with zero and convert the result to an integer. • r - Fill in missing values of Q16_Part_2 with zero and convert the result to an integer. • sql - Fill in missing values of Q16_Part_2 with zero and convert the result to an integer. After column manipulation, tweak_kag renames the columns by replacing spaces with an underscore. Finally, it pulls out only the Q1, Q2, age, education, major, years_exp, compensation, python, r, and sql columns. def tweak_kag(df_: pd.DataFrame) -> pd.DataFrame: """ Tweak the Kaggle survey data and return a new DataFrame. This function takes a Pandas DataFrame containing survey data as input and returns a new DataFrame. modifications include extracting and transforming columns, renaming columns, and selecting a subset Kaggle The certain of columns. Parameters ---------df_ : pd.DataFrame The input DataFrame containing Kaggle survey data. Returns ------pd.DataFrame The new DataFrame with the modified and selected columns. """ return (df_ .assign(age=df_.Q2.str.slice(0,2).astype(int), education=df_.Q4.replace({'Master’s degree': 18, 'Bachelor’s degree': 16, 'Doctoral degree': 20, 'Some college/university study without earning a bachelor’s degree': 13, 'Professional degree': 19, 'I prefer not to answer': None, 'No formal education past high school': 12}), major=(df_.Q5 .pipe(topn, n=3) .replace({ 7 2. Datasets 'Computer science (software engineering, etc.)': 'cs', 'Engineering (non-computer focused)': 'eng', 'Mathematics or statistics': 'stat'}) ), years_exp=(df_.Q8.str.replace('+','', regex=False) .str.split('-', expand=True) .iloc[:,0] .astype(float)), compensation=(df_.Q9.str.replace('+','', regex=False) .str.replace(',','', regex=False) .str.replace('500000', '500', regex=False) .str.replace('I do not wish to disclose my approximate yearly compensation', '0', regex=False) .str.split('-', expand=True) .iloc[:,0] .fillna(0) .astype(int) .mul(1_000) ), python=df_.Q16_Part_1.fillna(0).replace('Python', 1), r=df_.Q16_Part_2.fillna(0).replace('R', 1), sql=df_.Q16_Part_3.fillna(0).replace('SQL', 1) )#assign .rename(columns=lambda col:col.replace(' ', '_')) .loc[:, 'Q1,Q3,age,education,major,years_exp,compensation,' 'python,r,sql'.split(',')] ) def topn(ser, n=5, default='other'): """ Replace all values in a Pandas Series that are not among the top `n` most frequent values with a default value. This function takes a Pandas Series and returns a new Series with the values replaced as described above. The top `n` most frequent values are determined using the `value_counts` method of the input Series. Parameters ---------ser : pd.Series The input Series. n : int, optional The number of most frequent values to keep. The default value is 5. default : str, optional The default value to use for values that are not among the top `n` most frequent values. The default value is 'other'. Returns 8 2.2. Cleanup Pipeline ------pd.Series The modified Series with the values replaced. """ counts = ser.value_counts() return ser.where(ser.isin(counts.index[:n]), default) I created the TweakKagTransformer class to wrap tweak_kag so I could embed the cleanup logic into the pipeline functionality of Scikit-Learn. It is not a requirement to use pipelines to use XGBoost. To create a model with XGBoost suitable for classification, you need a matrix with rows of training data with columns representing the features of the data. Generally, this is called X in most documentation and examples. You will also need a column with labels for each row, commonly named y. The raw data might look like this: Q2 Q3 Q4 \ 587 25-29 India Master’s degree 3065 22-24 India Bachelor’s degree 8435 22-24 India Master’s degree Q5 587 Information technology, networking, or system ... 3065 Computer science (software engineering, etc.) 8435 Other We need to get X into looking like this: 587 3065 8435 age education Q3_India major_cs 25 18.0 1 0 22 16.0 1 1 22 18.0 1 0 Note The capitalization on X is meant to indicate that the data is a matrix and two-dimensional, while the lowercase y is a one-dimensional vector. If you are familiar with linear algebra, you might recognize these conventions, capital letters for matrices and lowercase for vectors. The transformer class subclasses the BaseEstimator and TransformerMixin classes. These classes require that we implement the .fit and .transform methods, respectively. The .fit method returns the class instance. The .transform method leverages the logic in the tweak_kag function. from feature_engine import encoding, imputation from sklearn import base, pipeline class TweakKagTransformer(base.BaseEstimator, base.TransformerMixin): """ A transformer for tweaking Kaggle survey data. 9 2. Datasets This transformer takes a Pandas DataFrame containing Kaggle survey data as input and returns a new version of the DataFrame. The modifications include extracting and transforming certain columns, renaming columns, and selecting a subset of columns. Parameters ---------ycol : str, optional The name of the column to be used as the target variable. If not specified, the target variable will not be set. Attributes ---------ycol : str The name of the column to be used as the target variable. """ def __init__(self, ycol=None): self.ycol = ycol def transform(self, X): return tweak_kag(X) def fit(self, X, y=None): return self The get_rawX_y function will take the original data and return an X DataFrame and a y Series ready to feed into our pipeline for further cleanup. It uses the .query method of Pandas to limit the rows to only those located in the US, China, or India, and respondents that had the job title of Data Scientist or Software Engineer. Below that, a pipeline is stored in the variable kag_pl. The pipeline will process the data by calling the TweakKagTransformer. Then it will perform one hot encoding on the Q1, Q3, and major columns using the Feature Engine library. Finally, it will use the imputation library to fill in missing numeric values in the education and year_exp columns. It does that by filling in the missing values with the median values. def get_rawX_y(df, y_col): raw = (df .query('Q3.isin(["United States of America", "China", "India"]) ' 'and Q6.isin(["Data Scientist", "Software Engineer"])') ) return raw.drop(columns=[y_col]), raw[y_col] ## Create a pipeline kag_pl = pipeline.Pipeline( [('tweak', TweakKagTransformer()), ('cat', encoding.OneHotEncoder(top_categories=5, drop_last=True, variables=['Q1', 'Q3', 'major'])), ('num_impute', imputation.MeanMedianImputer(imputation_method='median', 10 2.2. Cleanup Pipeline variables=['education', 'years_exp']))] ) Let’s run the code. We will to create a training set and a validation set (also called test set or holdout set) using the scikit-learn model_selection.train_test_split function. This function will withhold 30% of the data into the test dataset (see test_size=.3). This will let us train the model on 70% of data, then use the other 30% to simulate unseen data and let us experiment with how the model might perform on data it hasn’t seen before. Note The stratify parameter of the train_test_split function is important because it ensures that the proportion of different classes found in the labels of in the dataset is maintained in both the training and test sets. This is particularly useful when working with imbalanced datasets, where one class may be under-represented. Without stratification, the training and test sets may not accurately reflect the distribution of classes in the original dataset, which can lead to biased or unreliable model performance. For example, if we have a binary classification problem, where the dataset has 80% class A and 20% class B, if we stratify, we will split the data into training and test sets. We will have 80% class A and 20% class B in both sets, which helps to ensure that the classifier will generalize well to unseen data. If we don’t stratify, it is possible to get all of class B in the test set, hampering the ability of the model to learn class B. To use stratification, you pass in the labels to the stratify parameter. Once we have split the data, we can feed it into our pipeline. There are two main methods that we want to be aware of in our pipeline, .fit and .transform. The .fit method is used to train a model on the training data. It takes in the training data as input and uses that data to learn the parameters of the model. For example, in the case of the pipeline that includes our tweaking function, one-hot encoding, and imputation step, the .fit method runs each of those steps to learn how to perform the transformation in a consistent manner on new data. We generally use the .fit method for the side effect of learning, it does not return transformed data. The .transform method is used to transform the input data according to the steps of the pipeline. The method takes the input data and applies the transformations specified in the pipeline to it (it does not learn any parameters). This method is typically used on test data to apply the same preprocessing steps that were learned on the training data. In our case, we also want to run it on the training data so that it is prepped for use in models. It should return data ready for use with a machine learning model. The .fit_transform method is a combination of the .fit and .transform methods. It first fits the model on the training data and then applies the learned transformations to the input data, returning the transformed data. This method is useful when we want to learn the parameters of the pipeline on the training data and then apply these learned parameters to the training data all at once. We only want to run this on the training data. Then after we have fit the pipeline on the training data, we will call .transform on the testing data. The output of calling .fit_transform and .transform on the pipeline is a Pandas DataFrame suitable for training an XGBoost model. >>> from sklearn import model_selection >>> kag_X, kag_y = get_rawX_y(raw, 'Q6') >>> kag_X_train, kag_X_test, kag_y_train, kag_y_test = \ 11 2. Datasets Figure 2.1: Process for splitting data and transforming data with a pipeline. 12 2.3. Summary ... ... model_selection.train_test_split( kag_X, kag_y, test_size=.3, random_state=42, stratify=kag_y) >>> X_train = kag_pl.fit_transform(kag_X_train, kag_y_train) >>> X_test = kag_pl.transform(kag_X_test) >>> print(X_train) age education years_exp ... major_other major_eng major_stat 587 25 18.0 4.0 ... 1 0 0 3065 22 16.0 1.0 ... 0 0 0 8435 22 18.0 1.0 ... 1 0 0 3110 40 20.0 3.0 ... 1 0 0 16372 45 5.0 ... 0 12.0 1 0 ... ... ... ... ... ... ... ... 16608 25 16.0 2.0 ... 0 0 0 7325 18 16.0 1.0 ... 0 0 0 21810 18 16.0 2.0 ... 0 0 0 4917 25 18.0 1.0 ... 0 0 1 639 25 18.0 1.0 ... 0 0 0 [2110 rows x 18 columns] At this point, we should be ready to create a model so we can make predictions. Remember, our goal is to predict whether an individual is a Software Engineer or Data Scientist. Here are the labels for the training data: >>> kag_y_train 587 Software Engineer 3065 Data Scientist 8435 Data Scientist 3110 Data Scientist 16372 Software Engineer ... 16608 Software Engineer 7325 Software Engineer 21810 Data Scientist 4917 Data Scientist 639 Data Scientist Name: Q6, Length: 2110, dtype: object Our data is in pretty good shape for machine learning. Before we jump into that we want to explore the data a little more. We will do that in the next chapter. Data exploration is a worthwhile effort to create better models. 2.3 Summary In this chapter, we loaded the data. Then we adapted a function that cleans up the data with Pandas into a Scikit-learn pipeline. We used that to prepare the data for machine learning and then split the data into training and testing data. 13 2. Datasets 2.4 Exercises 1. Import the pandas library and use the read_csv function to load in a CSV file of your choice. 2. Use the .describe method to generate summary statistics for all numeric columns in the DataFrame. 3. Explain how you can convert categorical string data into numbers. 14 Chapter 3 Exploratory Data Analysis In this chapter, we will explore the Kaggle data before creating models from it. This is called Exploratory Data Analysis or EDA. My goal with EDA is to understand the data I’m dealing with by using summary statistics and visualizations. This process provides insights into what data we have and what the data can tell us. The exploration step is the most crucial step in data modeling. If you understand the data, you can better model it. You will also understand the relationships between the columns and the target you are trying to predict. This will also enable feature engineering, creating new features to drive the model to perform better. I will show some of my favorite techniques for exploring the data. Generally, I use Pandas to facilitate this exploration. For more details on leveraging Pandas, check out the book Effective Pandas. 3.1 Correlations The correlation coefficient is a useful summary statistic for understanding the relationship between features. There are various ways to calculate this measure. Generally, the Pearson Correlation Coefficient metric is used. The calculation for this metric assumes a linear relationship between variables. I often use the Spearman Correlation Coefficient which correlates the ranks or monotonicity (and doesn’t make the assumption of linearity). The values for both coefficients range from -1 to 1. Higher values indicate that as one value increases, the other value increases. A correlation of 0 means that as one variable increases, the other doesn’t respond. A negative correlation indicates that one variable goes down as the other goes up. I like to follow up correlation exploration with a scatter plot to visualize the relationship. Here is the Pandas code to create a correlation dataframe. I’ll step through what each line does: Start with the X_train data Create a data_scientist column (so we can see correlations to it) Create a correlation dataframe using the Spearman metric Start styling the result (to indicate the best correlations visually) Add a background gradient using a diverging colormap (red to white to blue, RdBu), pinning the minimum value of -1 to red and maximum value of 1 to blue • Make the index sticky, so it stays on the screen during horizontal scrolling • • • • • (X_train .assign(data_scientist = kag_y_train == 'Data Scientist') .corr(method='spearman') 15 3. Exploratory Data Analysis Figure 3.1: Flowchart for my process for basic EDA. I include numerical summaries and visual options. 16 3.2. Bar Plot ) .style .background_gradient(cmap='RdBu', vmax=1, vmin=-1) .set_sticky(axis='index') Figure 3.2: Spearman correlation coefficient of features in dataset. We want to look for the darkest blue and red values ignoring the diagonal. 3.2 Bar Plot If we inspect the columns that correlate with data_scientist, the most significant feature is r (with a value of 0.32). Let’s explore what is going on there. The r column is an indicator column that is 1 or 0, depending on if the sample uses the R language. We could use a scatter plot to view the relationship, but it will be less compelling because there are only two values for data scientist, the values will all land on top of one of the two options. Instead, we will use an unstacked bar plot. Here’s what the code is doing line by line: • Start with the X_train dataframe 17 3. Exploratory Data Analysis Create a data_scientist column Group by the r column Look only at the data_scientist column Get the counts for each different entry in the data_scientist column (which will be represented as a multi-index) • Pull out the data_scientist level of the multi-index and put it in the columns • Create a bar plot from each column (data_scientist options) for each entry in r • • • • import matplotlib.pyplot as plt fig, ax = plt.subplots(figsize=(8, 4)) (X_train .assign(data_scientist = kag_y_train) .groupby('r') .data_scientist .value_counts() .unstack() .plot.bar(ax=ax) ) Figure 3.3: Bar plot of R usage for Data Scientists and Software Engineers. The correlation value was positive, 0.32. It turns out that the column with the most negative correlation to job title (-0.31) was also an indicator column, major_cs, indicating whether the respondent studied computer science. Let’s explore using the cross-tabulation function (pd.crosstab) which provides a helper function for the previous chain of Pandas code. fig, ax = plt.subplots(figsize=(8, 4)) (pd.crosstab(index=X_train['major_cs'], 18 3.2. Bar Plot ) columns=kag_y) .plot.bar(ax=ax) Figure 3.4: Bar plot of majors for Data Scientists and Software Engineers. The correlation value was negative, -0.31. There is a slight correlation between years of experience and compensation. Let’s do a scatter plot of this to see if it helps us understand what is going on. Here is the code to create a scatter plot: fig, ax = plt.subplots(figsize=(8, 4)) (X_train .plot.scatter(x='years_exp', y='compensation', alpha=.3, ax=ax, c='purple') ) This plot is ok, but it is glaringly evident that the data is binned. In the real-world folks would only have exactly five years of experience on one day. Luckily, we set the alpha transparency to see where points pile up on top of each other. But given the plot, it is hard to tell what is going on. I’m going to use the Seaborn library to create a plot that is more involved. One thing I want to do is add jitter. Because our points land on top of each other, it is hard to understand the density. If we shift each one horizontally and vertically (called jitter) by a different random amount, it will aid with that. (This is done with so.Jitter(x=.5, y=10_000) I’m also going to color the points by the title. (color='title') Finally, we will fit a polynomial line through the points to see the trend. (see .add(so.Line(), so.PolyFit())) 19 3. Exploratory Data Analysis Figure 3.5: Scatter plot showing relationship between compensation and years of experience import seaborn.objects as so fig = plt.figure(figsize=(8, 4)) (so .Plot(X_train.assign(title=kag_y_train), x='years_exp', y='compensation', color='title') .add(so.Dots(alpha=.3, pointsize=2), so.Jitter(x=.5, y=10_000)) .add(so.Line(), so.PolyFit()) .on(fig) # not required unless saving to image .plot() # ditto ) Figure 3.6: Scatter plot showing relationship between compensation and years of experience 20 3.3. Summary This plot is more interesting. It indicates that data scientists tend to earn more than their software engineer counterparts. This hints that compensation might come in useful for determining job titles. Another point to consider is that this plot lumps together folks from around the world. Different regions are likely to compensate differently. Let’s tease that apart as well. This next plot is cool. The Seaborn objects interface makes it simple to facet the data by country (.facet('country')). Also, I want to show all of the data in grey on each plot. If you set col=None, it will plot all of the data. However, I want to zoom into to lower left corner because most of the data is found there. I can adjust the tick locations with .scale(x=so.Continuous().tick(at=[0,1,2,3,4,5])) and adjust the limit as well with .limit(y=(-10_000, 200_000), x=(-1, 6)). fig = plt.figure(figsize=(8, 4)) (so .Plot(X_train #.query('compensation < 200_000 and years_exp < 16') .assign( title=kag_y_train, country=(X_train .loc[:, 'Q3_United States of America': 'Q3_China'] .idxmax(axis='columns') ) ), x='years_exp', y='compensation', color='title') .facet('country') .add(so.Dots(alpha=.01, pointsize=2, color='grey' ), so.Jitter(x=.5, y=10_000), col=None) .add(so.Dots(alpha=.5, pointsize=1.5), so.Jitter(x=.5, y=10_000)) .add(so.Line(pointsize=1), so.PolyFit(order=2)) .scale(x=so.Continuous().tick(at=[0,1,2,3,4,5])) .limit(y=(-10_000, 200_000), x=(-1, 6)) # zoom in with this not .query (above) .on(fig) # not required unless saving to image .plot() # ditto ) 3.3 Summary In this chapter, we explored the data to get a feel for what determines whether a respondent is a data scientist. We can use summary statistics to quantify amounts with Pandas. I also like to visualize the data. Typically I will use Pandas and the Seaborn library to make charts to understand relationships between features of the data. 3.4 1. 2. 3. 4. Exercises Use .corr to quantify the correlations in your data. Use a scatter plot to visualize the correlations in your data. Use .value_counts to quantify counts of categorical data. Use a bar plot to visualize the counts of categorical data. 21 3. Exploratory Data Analysis Figure 3.7: Scatter plot showing relationship between compensation and years of experience broken out by country 22 Chapter 4 Tree Creation Finally, we are going to make a model! A tree model. Tree-based models are useful for ML because they can handle both categorical and numerical data, and they are able to handle nonlinear relationships between features and target variables. Additionally, they provide a clear and interpretable method for feature importance, which can be useful for feature selection and understanding the underlying relationships in the data. Before we make a model, let’s try to understand an algorithm for what makes a good tree model. At a high level, when we create a tree for classification, the process goes something like this: • Loop through all of the columns and – Find the split point that is best at separating the different labels – Create a node with this information • Repeat the above process for results of each node until each node is pure (or a hyperparameter indicates to stop tree creation) We need a metric to help determine when our separation is good. One commonly used metric is the Gini Coefficient. 4.1 The Gini Coef昀椀cient The Gini coefficient (also known as the Gini index or Gini impurity)1 , quantifies the level of inequality in a frequency distribution. A value of 0 indicates complete equality, with all values being the same, and a value of 1 represents the maximum level of inequality among the values. The formula∑for the Gini coefficient is: c Gini = 1 − i=1 p2i Where Gini is the Gini index, c is the number of classes, and pi is the proportion of observations that belong to class i. In order to understand optimal splitting, we’ll simulate a dataset with two classes and one feature. For some values of the features, the classes will overlap—there will be observations from both classes on each side of any split. We’ll find the threshold for which the feature best separates the data into the correct classes. 1 This coefficient was proposed by an Italian statistician and sociologist Corrado Gini to measure the inequality of wealth in a country. A coefficient of 0 would mean that everyone has the same income. A measure of 1 would indicate that one person has all the income. 23 4. Tree Creation Let’s explore this by generating two random samples using NumPy. We will label one sample as Negative and the other as Positive. Here is a visualization of what the distributions look like: import numpy as np import numpy.random as rn pos_center = 12 pos_count = 100 neg_center = 7 neg_count = 1000 rs = rn.RandomState(rn.MT19937(rn.SeedSequence(42))) gini = pd.DataFrame({'value': np.append((pos_center) + rs.randn(pos_count), (neg_center) + rs.randn(neg_count)), 'label': ['pos']* pos_count + ['neg'] * neg_count}) fig, ax = plt.subplots(figsize=(8, 4)) _ = (gini .groupby('label') [['value']] .plot.hist(bins=30, alpha=.5, ax=ax, edgecolor='black') ) ax.legend(['Negative', 'Positive']) Figure 4.1: Positive and negative distributions for Gini demonstration. Note that there are overlapping values between 9 and 10. It might be hard to see, but there is some overlap between 9 and 10, where there are both negative and positive examples. I outlined these in black to make them more evident. 24 4.1. The Gini Coefficient Now, let’s imagine that we needed to choose a value where we determine whether values are negative or positive. The Gini coefficient is a metric that can help with this task. To calculate the Gini coefficient for our positive and negative data, we need to consider a point (my function calls it the split_point) where we decide whether a value is positive or negative. If we have the actual labels for the data, we can calculate the true positive, false positive, true negative, and false negative values. Then we calculate the weighted average of one minus the fraction of true and false positives squared. We also calculate the weighted average of one minus the fraction of true and false negatives squared. The weighted average of these two values is the Gini coefficient. Here is a function that performs the calculation: def calc_gini(df, val_col, label_col, pos_val, split_point, debug=False): """ This function calculates the Gini impurity of a dataset. Gini impurity is a measure of the probability of a random sample being classified incorrectly when a feature is used to split the data. The lower the impurity, the better the split. Parameters: df (pd.DataFrame): The dataframe containing the data val_col (str): The column name of the feature used to split the data label_col (str): The column name of the target variable pos_val (str or int): The value of the target variable that represents the positive class split_point (float): The threshold used to split the data. debug (bool): optional, when set to True, prints the calculated Gini impurities and the final weighted average Returns: float: The weighted average of Gini impurity for the positive and negative subsets. """ ge_split = df[val_col] >= split_point eq_pos = df[label_col] == pos_val tp = df[ge_split & eq_pos].shape[0] fp = df[ge_split & ~eq_pos].shape[0] tn = df[~ge_split & ~eq_pos].shape[0] fn = df[~ge_split & eq_pos].shape[0] pos_size = tp+fp neg_size = tn+fn total_size = len(df) if pos_size == 0: gini_pos = 0 else: gini_pos = 1 - (tp/pos_size)**2 - (fp/pos_size)**2 if neg_size == 0: gini_neg = 0 else: gini_neg = 1 - (tn/neg_size)**2 - (fn/neg_size)**2 weighted_avg = gini_pos * (pos_size/total_size) + \ gini_neg * (neg_size/total_size) 25 4. Tree Creation if debug: print(f'{gini_pos=:.3} {gini_neg=:.3} {weighted_avg=:.3}') return weighted_avg Let’s choose a split point and calculate the Gini coefficient (the weighted_avg value). >>> calc_gini(gini, val_col='value', label_col='label', pos_val='pos', ... split_point=9.24, debug=True) gini_pos=0.217 gini_neg=0.00202 weighted_avg=0.0241 0.024117224644432264 With the function in hand, let’s loop over the possible values for the split point and plot out the coefficients for a given split. If we go too low, we get too many false positives. Conversely, if we set the split point too high, we get too many false negatives, and somewhere in between lies the lowest Gini coefficient. values = np.arange(5, 15, .1) ginis = [] for v in values: ginis.append(calc_gini(gini, val_col='value', label_col='label', pos_val='pos', split_point=v)) fig, ax = plt.subplots(figsize=(8, 4)) ax.plot(values, ginis) ax.set_title('Gini Coefficient') ax.set_ylabel('Gini Coefficient') ax.set_xlabel('Split Point') Figure 4.2: Gini values over different split points. Notice that the lowest value is around 10. Armed with this graph, somewhere around 10 provides the lowest Gini coefficient. Let’s look at the values around 10. 26 4.2. Coefficients in Trees >>> pd.Series(ginis, index=values).loc[9.5:10.5] 9.6 0.013703 9.7 0.010470 9.8 0.007193 9.9 0.005429 10.0 0.007238 10.1 0.005438 10.2 0.005438 10.3 0.007244 10.4 0.009046 10.5 0.009046 dtype: float64 This code verifies that the split point at 9.9 will minimize the Gini coefficient: >>> print(pd.DataFrame({'gini':ginis, 'split':values}) ... .query('gini <= gini.min()') ... ) gini split 49 0.005429 9.9 4.2 Coef昀椀cients in Trees Now let’s make a simple tree that only has a single node. We call this a decision stump (because it has no branches). We can use a DecisionTreeClassifier from scikit-learn. We will call the .fit method to fit the training data to the labels. Remember, we pass in a dataframe for the features and a series for the labels. (We can also pass in NumPy data structures but I find those inconvenient for tabular work.) from sklearn import tree stump = tree.DecisionTreeClassifier(max_depth=1) stump.fit(gini[['value']], gini.label) A valuable decision tree feature is that you can visualize what they do. Below, we will create a visualization of the decision tree. It shows a single decision at the top box (or node). It checks if the value less than or equal to 9.708. If that decision is true, we jump to the left child, the negative label. Otherwise, we use the right child, the positive label. You can see that the base Gini value (given in the top box) is our starting Gini considering that we labeled everything as positive. fig, ax = plt.subplots(figsize=(8, 4)) tree.plot_tree(stump, feature_names=['value'], filled=True, class_names=stump.classes_, ax=ax) If you calculate the weighted average of the Gini coefficients in the leaf nodes, you see a value very similar to the minmum value we calculated above: 27 4. Tree Creation Figure 4.3: Export of decision stump. >>> gini_pos = 0.039 >>> gini_neg = 0.002 >>> pos_size = 101 >>> neg_size = 999 >>> total_size = pos_size + neg_size >>> weighted_avg = gini_pos * (pos_size/total_size) + \ ... gini_neg * (neg_size/total_size) >>> print(weighted_avg) 0.005397272727272727 XGBoost doesn’t use Gini, but Scikit-Learn does by default. XGBoost goes through a treebuilding process. After training a tree, there are two statistics gi and hi . They are similar to Gini but also represent the behavior of the loss function. The gradient (gi ) represents the first derivative of the loss function with respect to the predicted value for each instance, while the second derivative (hi ) represents the curvature of the loss function. Let’s make a stump in XGBoost by limiting the max_depth and n_estimators parameters: import xgboost as xgb xg_stump = xgb.XGBClassifier(n_estimators=1, max_depth=1) xg_stump.fit(gini[['value']], (gini.label== 'pos')) The .plot_tree method will let us visualize the split point for the stump. xgb.plot_tree(xg_stump, num_trees=0) Note Because I want to be able to set the fonts in this plot, I created my own function, my_dot_export. 28 4.2. Coefficients in Trees import subprocess def my_dot_export(xg, num_trees, filename, title='', direction='TB'): """Exports a specified number of trees from an XGBoost model as a graph visualization in dot and png formats. Args: xg: An XGBoost model. num_trees: The number of tree to export. filename: The name of the file to save the exported visualization. title: The title to display on the graph visualization (optional). direction: The direction to lay out the graph, either 'TB' (top to bottom) or 'LR' (left to right) (optional). """ res = xgb.to_graphviz(xg, num_trees=num_trees) node [fontname = "Roboto Condensed"]; content = f''' edge [fontname = "Roboto Thin"]; label = "{title}" fontname = "Roboto Condensed" ''' out = res.source.replace('graph [ rankdir=TB ]', f'graph [ rankdir={direction} ];\n {content}') # dot -Gdpi=300 -Tpng -ocourseflow.png courseflow.dot dot_filename = filename with open(dot_filename, 'w') as fout: fout.write(out) png_filename = dot_filename.replace('.dot', '.png') subprocess.run(f'dot -Gdpi=300 -Tpng -o{png_filename} {dot_filename}'.split()) Let’s try out the function. my_dot_export(xg_stump, num_trees=0, filename='img/stump_xg.dot', title='A demo stump') Figure 4.4: Export of xgboost stump. 29 4. Tree Creation The split point XGBoost found is very similar to scikit-learn. (The values in the leaves are different, but we will get to those.) 4.3 Another Visualization Tool The dtreeviz package creates visualizations for decision trees. It provides a clear visualization of tree structure and decision rules, making it easier to understand and interpret the model. import dtreeviz viz = dtreeviz.model(xg_stump, X_train=gini[['value']], y_train=gini.label=='pos', target_name='positive', feature_names=['value'], class_names=['negative', 'positive'], tree_index=0) viz.view() Figure 4.5: Decision stump visualized by dtreeviz package. Note that it indicates the split point in the histogram. It also shows a pie chart for each leaf indicating the fraction of labels. This visualization shows the distributions of the data as histograms. It also labels the split point. Finally, it shows what the ratios of the predictions are as pie charts. 4.4 Summary The Gini coefficient is a measure of inequality used in economics and statistics. It is typically used to measure income inequality, but can also be applied to other variables such as wealth. The Gini coefficient takes on a value between 0 and 1, with 0 representing perfect equality (where everyone has the same income) and 1 representing perfect inequality (where one person has all the income and everyone else has none). 4.5 Exercises 1. What is the Gini coefficient? 2. How is the Gini coefficient calculated? 3. How is the Gini coefficient used in Decision trees? 4. What does a stump tell us about the data? 30 Chapter 5 Stumps on Real Data By now, you should have an intuition on how a decision tree “decides” values to split on. Decision trees use a greedy algorithm to split on a feature (column) that results in the most “pure” split. Instead of looking at our simple data with only one column of information, let’s use the Kaggle data. We will predict whether someone is a data scientist or software engineer based on how they answered the survey questions. The decision tree should loop over every column and try to find the column and split point that best separates data scientists from software engineers. It will recursively perform this operation, creating a tree structure. 5.1 Scikit-learn stump on real data Let’s use scikit-learn first to make a decision stump on the Kaggle data. Before we do that, consider what this stump is telling us. The first split should feature one of the most critical features because it is the value that best separates the data into classes. If you only had one piece of information, you would probably want the column that the stump splits on. Let’s create the stump and visualize it. We will use our pipeline to prepare the training data. Remember that the pipeline uses Pandas code to clean up the raw data, then we use categorical encoding on the non-numeric columns, and finally, we fill in the missing values for education level and experience. We will feed the raw data, kag_X_train, into the .fit_transform method of the pipeline and get out the cleaned-up training data, X_train. Then we will create a stum and train it with the .fit method: stump_dt = tree.DecisionTreeClassifier(max_depth=1) X_train = kag_pl.fit_transform(kag_X_train) stump_dt.fit(X_train, kag_y_train) Let’s explore how we decided to split the data. Here is the code to visualize the tree: fig, ax = plt.subplots(figsize=(8, 4)) features = list(c for c in X_train.columns) tree.plot_tree(stump_dt, feature_names=features, filled=True, class_names=stump_dt.classes_, ax=ax) 31 5. Stumps on Real Data The column that best separates the two professions is the use of the R programming language. This seems like a sensible choice considering that very few software engineers use the R language. Figure 5.1: Decision stump trained on our Kaggle data. It bases its decision purely on the value in the r column. We can evaluate the accuracy of our model with the .score method. We call this method on the holdout set. The holdout set (or validation set), X_test, is the portion of the data that is set aside and not used during the training phase, but instead is reserved for testing the model’s performance once it has been trained. It allows for a more accurate evaluation of the model’s performance against new data, as it is tested on unseen data. It looks like this model classified 62% of the occupations correctly by just looking at a single column, the R column. >>> X_test = kag_pl.transform(kag_X_test) >>> stump_dt.score(X_test, kag_y_test) 0.6243093922651933 Is 62% accuracy good? Bad? The answer to questions like this is generally “it depends”. Looking at accuracy on its own is usually not sufficient to inform us of whether a model is “good enough”. However, we can use it to compare models. We can create a baseline model using the DummyClassifier. This classifier just predicts the most common label, in our case 'Data Scientist'. Not a particularly useful model, but it provides a baseline score that our model should be able to beat. If we can’t beat that model, we shouldn’t be using machine learning. Here is an example of using the DummyClassifier. >>> from sklearn import dummy >>> dummy_model = dummy.DummyClassifier() >>> dummy_model.fit(X_train, kag_y_train) >>> dummy_model.score(X_test, kag_y_test) 0.5458563535911602 32 5.2. Decision Stump with XGBoost The baseline performance is 54% accuracy (which is the percent of values that are 'Data Scientist'). Our stump accuracy is better than the baseline. Note that being better than the baseline does not qualify a model as “good”. But it does indicate that model is potentially useful. 5.2 Decision Stump with XGBoost XGBoost does not use the Gini calculation to decide how to make decisions. Rather XGBoost uses boosting and gradient descent. Boosting is using a model and combining it with other models to improve the results. In fact, XGBoost stands for eXtreme Gradient Boosting. The “extreme” part is due to the ability to regularize the result and the various optimizations is has to efficiently create the model. In this case, the subsequent models are trained from the error of the previous model with the goal to reduce the error. The “gradient descent” part comes in because this process of minimizing the error is specified into an objective function such that the gradient descent algorithm can be applied. The outcome is based on the gradient of the error with respect to the prediction. The objective function combines two parts: training loss and a regularization term. Trees are created by splitting on features that move a small step in the direction of the negative gradient, which moves them closer to the global minimum of the loss function. There are many parameters for controlling XGBoost. One of them is n_estimators, the number of trees. Let’s create a stump with XGBoost by setting this value to 1 and see if it performs similarly to the scikit-learn stump. import xgboost as xgb kag_stump = xgb.XGBClassifier(n_estimators=1, max_depth=1) kag_stump.fit(X_train, kag_y_train) --------------------------------------------------------------------------ValueError Traceback (most recent call last) Cell In[402], line 2 1 kag_stump = xgb.XGBClassifier(n_estimators=1, max_depth=1) ----> 2 kag_stump.fit(X_train, kag_y_train) 3 kag_stump.score(X_test, kag_y_test) ... 1462 1463 1464 1465 -> 1466 1467 1468 1469 1471 1473 if ( ): self.classes_.shape != expected_classes.shape or not (self.classes_ == expected_classes).all() raise ValueError( f"Invalid classes inferred from unique values of `y`. " f"Expected: {expected_classes}, got {self.classes_}" ) params = self.get_xgb_params() if callable(self.objective): ValueError: Invalid classes inferred from unique values of `y`. Expected: [0 1], got ['Data Scientist' 'Software Engineer'] We got an error! XGBoost is not happy about our labels. Unlike scikit-learn, XGBoost does not work with string labels. The kag_y_train series has strings in it, and XGBoost wants integer values. 33 5. Stumps on Real Data >>> print(kag_y_train) 587 Software Engineer 3065 Data Scientist 8435 Data Scientist 3110 Data Scientist 16372 Software Engineer ... 16608 Software Engineer 7325 Software Engineer 21810 Data Scientist 4917 Data Scientist 639 Data Scientist Name: Q6, Length: 2110, dtype: object Since there are only two options, we can convert this to integer values by testing whether the string is equal to Software Engineer. A True or (1) represents Software Engineer and False (or 0) is a Data Scientist. >>> print(kag_y_train == 'Software Engineer') 587 True 3065 False 8435 False 3110 False 16372 True ... 16608 True 7325 True 21810 False 4917 False 639 False Name: Q6, Length: 2110, dtype: bool However, rather than using pandas, we will use the LabelEncoder class from scikit-learn to convert the labels to numbers. The label encoder stores attributes that are useful when preparing data for future predictions. >>> from sklearn import preprocessing >>> label_encoder = preprocessing.LabelEncoder() >>> y_train = label_encoder.fit_transform(kag_y_train) >>> y_test = label_encoder.transform(kag_y_test) >>> y_test[:5] array([1, 0, 0, 1, 1]) The label encoder will return 1s and 0s. How do you know which numerical value is which string value? You can ask the label encoder for the .classes_. The index of these values is the number that the encoder uses. >>> label_encoder.classes_ array(['Data Scientist', 'Software Engineer'], dtype=object) 34 5.3. Values in the XGBoost Tree Because 'Data Scientist' is in index 0 in .classes_, it is the 0 value in the training labels. The 1 represents 'Software Engineer'. This is also called the positive label. (It is the positive label because it has the value of 1.) The label encoder has an .inverse_transform method that reverses the transformation. (This would require more effort if we went with the pandas encoding solution.) >>> label_encoder.inverse_transform([0, 1]) array(['Data Scientist', 'Software Engineer'], dtype=object) Now let’s make a stump with XGBoost and check the score. >>> kag_stump = xgb.XGBClassifier(n_estimators=1, max_depth=1) >>> kag_stump.fit(X_train, y_train) >>> kag_stump.score(X_test, y_test) 0.6243093922651933 It looks like the XGBoost stump performs similarly to the scikit-learn decision tree. Let’s visualize what it looks like. my_dot_export(kag_stump, num_trees=0, filename='img/stump_xg_kag.dot', title='XGBoost Stump') Figure 5.2: Stump export of XGBoost Kaggle model. A positive leaf value means that we predict the positive label. 5.3 Values in the XGBoost Tree What are the numbers in the leaves of the export? They are probabilities. Well, actually, they are the logarithm of the odds of the positive value. Statisticians call this the logit. When you calculate the inverse logit scores with the values from the nodes in the tree, you get the probability of the prediction. If a survey respondent does not use the R language, the leaf value is .0717741922. The inverse logit of that is .518, or 51.8%. Because this value is greater than .5, we assume the positive label or class of Software Engineer (the second value in kag_stump.classes_ is 1, which our label encoder used as the value for Software Engineer). 35 5. Stumps on Real Data >>> kag_stump.classes_ array([0, 1]) import numpy as np def inv_logit(p: float) -> float: """ Compute the inverse logit function of a given value. The inverse logit function is defined as: f(p) = exp(p) / (1 + exp(p)) Parameters ---------p : float The input value to the inverse logit function. Returns ------float The output of the inverse logit function. """ return np.exp(p) / (1 + np.exp(p)) Let’s see what probability comes out when we pass in .07177. >>> inv_logit(.0717741922) 0.5179358489487103 It looks like it spits out 51.8%. Conversely, the inverse logit of the right node is .411. Because this number is below .5, we do not classify it as the second option but rather choose Data Scientist. Note Note that if a user uses R, they have a higher likelihood of being a Data Scientist (1-.41 or 59%), than a Software Engineer who doesn’t use R (.518 or 52%). In other words, using R pushes one more toward Data Scientist than not using R pushes one toward Software Engineer. >>> inv_logit(-.3592) 0.41115323716754393 Here I plot the inverse logit function. You can see a crossover point at 0 on the x-axis. When the x values are above 0, the y value will be > .5 (Software Engineer). When they are below 0, the values will be < .5 (Data Scientist). fig, ax = plt.subplots(figsize=(8, 4)) vals = np.linspace(-7, 7) ax.plot(vals, inv_logit(vals)) ax.annotate('Crossover point', (0,.5), (-5,.8), arrowprops={'color':'k'}) ax.annotate('Predict Positive', (5,.6), (1,.6), va='center', arrowprops={'color':'k'}) ax.annotate('Predict Negative', (-5,.4), (-3,.4), va='center', arrowprops={'color':'k'}) 36 5.4. Summary Figure 5.3: The inverse logit function. When the x values are above 0, the y value will be > .5 (Software Engineer). When they are below 0, the values will be < .5 (Data Scientist). 5.4 Summary In this chapter, we explored making a stump on real data. A decision stump is a simple decision tree with a single split. We showed how we can use the XGBoost algorithm to create a decision stump. We also explored how the values in the leaves of XGBoost models indicate what label to predict. 5.5 Exercises 1. How are decision stumps trained using the XGBoost algorithm? 2. What is the inverse logit function and how is it used in machine learning? 3. How can the output of the inverse logit function be interpreted in terms of label probabilities? 37 Chapter 6 Model Complexity & Hyperparameters In this chapter, we will explore the concept of underfit and overfit models. Then we will see how we can use hyperparameters, or attributes of the model to change the behavior of the model fitting. 6.1 Under昀椀t A stump is too simple. Statisticians like to say it has too much bias. (I think this term is a little confusing and prefer “underfit”.) It has a bias towards simple predictions. In the case of our Kaggle model, the stump only looks at a single column. Perhaps it would perform better if it were able to also consider additional columns after looking at the R column. When you have an underfit model, you want to make it more complex (because it is too simple). There are a few mechanisms we can use to add complexity: • Add more features (columns) that have predictive value • Use a more complex model These are general strategies that work with most underfit models, not just trees. Here is our stump. Let’s look at the accuracy (as reported by .score). It should be representative of an underfit model. >>> underfit = tree.DecisionTreeClassifier(max_depth=1) >>> X_train = kag_pl.fit_transform(kag_X_train) >>> underfit.fit(X_train, kag_y_train) >>> underfit.score(X_test, kag_y_test) 0.6243093922651933 Hopefully, we will be able to create a model with better accuracy than 62%. 6.2 Growing a Tree We showed how to create a stump. How do we fix our underfit stump? You could add more advanced features (or columns) to the training data that better separate the classes. Or we could make the stump more complex by letting the tree grow larger. Instead of trying arbitrary tactics, generally, we want to proceed in a slightly more organized fashion. One common technique is to measure model accuracy as you let the tree grow deeper (or add additional columns) and see if there is an improvement. You can optimize a scoring function given constraints. Different models can have different scoring functions. You can also create your own. We saw that scikit-learn optimizes Gini and that XGBoosts performs gradient descent on a loss function. 39 6. Model Complexity & Hyperparameters The constraints for the model are called hyperparameters. Hyperparameters, at a high level, are knobs that you can use to control the complexity of your model. Remember that our stump was too simple and underfit our data. One way we can add complexity to a tree-based model is to let it grow more nodes. We can use hyperparameters to determine how and when we will add nodes. 6.3 Over昀椀tting Before optimizing our model, let’s look at the other end of the spectrum, an overfit model. An overfit model is too complicated. Again, the statistician would say it has too much variance. I avoid using the term “variance” too much (unless I’m in an interview situation) and like to picture a model that looks at every nook and cranny of the data. When it sees an example that is new, the model compares all of these trivial aspects that the model learned by inspecting the training data, which tends to lead it down the wrong path. Rather than honing in on the right label, the complexity causes it to “vary”. And again, there are general solutions to helping deal with overfitting: • Simplify or constrain (regularize) • Add more samples (rows of data) For a tree model, we can prune back the growth so that the leaf nodes are overly specific. This will simplify or constrain the model. Alternatively, we can collect more data, thus forcing the model to have more attempts at finding the important features. 6.4 Over昀椀tting with Decision Trees Let’s make an overfit model with a decision tree. This is quite easy. We just let the model grow until every node is pure (of the same class). We can set max_depth=None to overfit the tree, which is the default behavior of DecisionTreeClassifier. After we create the model, let’s inspect the accuracy. >>> hi_variance = tree.DecisionTreeClassifier(max_depth=None) >>> X_train = kag_pl.fit_transform(kag_X_train) >>> hi_variance.fit(X_train, kag_y_train) >>> hi_variance.score(X_test, kag_y_test) 0.6629834254143646 In this case, the accuracy is 66%. The accuracy of the stump was 62%. It is possible (and likely) that we can get better accuracy than both the stump and the complex model by either simplifying our complex model or adding some complexity to our simple model. Here is a visualization of the complex model. You can see that the tree is pretty deep. I count 22 levels of nodes. This visualization should help with the intuition of why an overfit tree is bad. It is essentially memorizing the training data. If you feed it data that is not exactly the same as data it has seen before, it will tend to go down the wrong track. (If you think memorizing the data is good, then you don’t need a machine learning model because you can just memorize every possible input and create a giant if statement for that.) fig, ax = plt.subplots(figsize=(8, 4)) features = list(c for c in X_train.columns) tree.plot_tree(hi_variance, feature_names=features, filled=True) Let’s zoom in on the top of the tree to understand more about what is happening. 40 6.5. Summary Figure 6.1: An overfit decision tree. This is too complicated and it is essentially memorizing the data that it was trained on. However, because it is very specific to the training data, it will tend to perform worse on data that it hasn’t seen before. fig, ax = plt.subplots(figsize=(8, 4)) features = list(c for c in X_train.columns) tree.plot_tree(hi_variance, feature_names=features, filled=True, class_names=hi_variance.classes_, max_depth=2, fontsize=6) The darker orange boxes tend towards more data scientists, and the bluer boxes are software engineers. If we go down the most orange path, at the top node is a person who uses R, who also didn’t study computer science, and finally, they have less or equal to 22.5 years of experience. Of folks who fall into this bucket, 390 of the 443 samples were data scientists. The ability to interpret a decision tree is important and often business-critical. Many companies are willing to sacrifice model performance for the ability to provide a concrete explanation for the behavior of a model. A decision tree makes this very easy. We will see that XGBoost models prove more challenging at delivering a simple interpretation. 6.5 Summary In this chapter, the concept of underfit and overfit models is explored. Hyperparameters, or attributes of the model, are used to change the behavior of the model fitting. We saw that the max_depth hyperparameter can tune the model between simple and complex. An underfit model is too simple, having a bias towards simple predictions. To fix an underfit model, one can add more features or use a more complex model. These general strategies work with most underfit models, not just trees. On the other hand, an overfit model is too complicated and has too much variance. To deal with overfitting, one can simplify or constrain the model, or add more samples. For a tree model, one can prune back the growth so that the leaf nodes are overly specific or collect 41 6. Model Complexity & Hyperparameters Figure 6.2: Zooming in on an overfit decision tree. more data. The goal is to find the optimal balance between a model that is too simple and too complex. Later on, we will see that XGBoost creates decent models out of the box, but it also has mechanisms to tune the complexity. We will also see how you can diagnose overfitting by visualizing model performance. 6.6 Exercises 1. 2. 3. 4. 42 What is underfitting in decision trees and how does it occur? What is overfitting in decision trees and how does it occur? What are some techniques to avoid underfitting in decision trees? What are some techniques to avoid overfitting in decision trees? Chapter 7 Tree Hyperparameters Previously, we saw that models might be too complex or too simple. Some of the mechanisms for dealing with model complexity are by creating new columns (to deal with simple models) and adding more examples (to deal with complex models). But there are also levers that we can pull without dealing with data. These levers are hyperparameters, levers that change how the model is created, trained, or behaves. 7.1 Decision Tree Hyperparameters Let’s explore the hyperparameters of decision trees. You might wonder why we seem so concerned with decision trees when this book is about XGBoost. That is because decision trees form the basis for XGBoost. If you understand how decision trees work, that will aid your mental model of XGBoost. Also, many of the hyperparameters for decision trees apply to XGBoost at some level. Many of the hyperparameters impact complexity (or simplicity). Here is a general rule for models based on scikit-learn. Hyperparameters that start with max_ will make the model more complex when you raise it. On the flip side, those that start with min_ will make the model simpler if you raise it. (The reverse is also true, they do the reverse when you lower them.) Here are the hyperparameters for scikit-learn’s DecisionTreeClassifier: • max_depth=None - Tree depth. The default is to keep splitting until all nodes are pure or there are fewer than min_samples_split samples in each node. Range [1-large number]. • max_features=None - Amount of features to examine for the split. Default is the number of features. Range [1-number of features]. • max_leaf_node=None - Number of leaves in a tree. Range [1-large number]. • min_impurity_decrease=0 - Split when impurity is >= this value. (Impurity : 0 - 100% accurate, .3 - 70%). Range [0.0-1.0] • min_samples_leaf=1, - Minimum samples at each leaf. Range [1-large number] • min_samples_split=2 - Minimum samples required to split a node. Range [2-large number] • min_weight_fraction_leaf=0 - The fraction of the total weights required to be a leaf. Range [0-1] To use a particular hyperparameter, you provide the parameter in the constructor before training a model. The constructor is the method called when a class is created. Then you train a model and measure a performance metric and track what happens to that metric as you change the parameter. You can explore the parameters of a trained model as well using the .get_params method: 43 7. Tree Hyperparameters Figure 7.1: Scikit-learn convention for hyperparameter tuning options to deal with overfit and underfit models. >>> stump.get_params() {'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': 1, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'random_state': None, 'splitter': 'best'} We can verify that this is a stump because the max_depth hyperparameter is set to 1. 7.2 Tracking changes with Validation Curves Let’s adjust the depth of a tree while tracking accuracy. A chart that does this is called a Validation Curve. With the stump, we get around 62% accuracy. As we add more levels, the accuracy tends to improve until a point. Then it starts to fall back down. 44 7.3. Leveraging Yellowbrick Why does the accuracy start to fall? Our model is now overfitting. Rather than extracting valuable insights from the training data, it memorizes the noise because we allow it to get too complicated. When we try to make a prediction with new data that is missing the same noise, our overfit model underperforms. Here is the code to create a validation curve: accuracies = [] for depth in range(1, 15): between = tree.DecisionTreeClassifier(max_depth=depth) between.fit(X_train, kag_y_train) accuracies.append(between.score(X_test, kag_y_test)) fig, ax = plt.subplots(figsize=(10,4)) (pd.Series(accuracies, name='Accuracy', index=range(1, len(accuracies)+1)) .plot(ax=ax, title='Accuracy at a given Tree Depth')) ax.set_ylabel('Accuracy') ax.set_xlabel('max_depth') Figure 7.2: A handmade validation curve tracking accuracy over depth. From this graph, it looks like a depth of seven maximizes the accuracy. Let’s check the .score of our “Goldilocks” model. >>> between = tree.DecisionTreeClassifier(max_depth=7) >>> between.fit(X_train, kag_y_train) >>> between.score(X_test, kag_y_test) 0.7359116022099448 Remember that the scores of the underfit and overfit models were 62 and 67%, respectively. An accuracy of 74% is a nice little bump. 7.3 Leveraging Yellowbrick I come from the lazy programmer school of thought. The code above to track the accuracy while sweeping over hyperparameter changes is not particularly difficult, but I would rather 45 7. Tree Hyperparameters have one line of code that does that for me. The Yellowbrick library provides a visualizer, validation_curve, for us. (Ok, it is a few more lines of code because we need to import it and I’m also configuring Matplotlib and saving the image, but in your notebook, you only need the call to validation_curve if you have imported the code.) from yellowbrick.model_selection import validation_curve fig, ax = plt.subplots(figsize=(10,4)) viz = validation_curve(tree.DecisionTreeClassifier(), X=pd.concat([X_train, X_test]), y=pd.concat([kag_y_train, kag_y_test]), param_name='max_depth', param_range=range(1,14), scoring='accuracy', cv=5, ax=ax, n_jobs=6) Figure 7.3: Validation curve illustrating and overfitting. Around the depth of 7 appears to be the sweet spot for this data. This plot is interesting (and better than my hand-rolled one). It has two ones, one tracks the “Training Score”, and the other line tracks the “Cross Validation Score”. The training score is the accuracy of the training data. You can see that as we increase the tree depth, the accuracy continues to improve. However! We only care about the accuracy of the testing data (this is the “Cross Validation Score”). The testing data line simulates how our model will work when it encounters data it hasn’t seen before. The testing data score drops as our model begins to overfit. This plot validates our test and suggests that a depth of seven is an appropriate choice. The validation_curve function also allows us to specify other parameters: • cv - Number of k-fold cross validations (default to 3) • scoring - A scikit-learn model metric string or function (there are various functions in the sklearn.metrics module) The plot also shows a range of scores with the shaded area. This is because it does cross validation and has multiple scores for a given hyperparameter value. Ideally, you want the range to be tight indicating that the metric performed consistently across each split of the cross validation. 46 7.4. Grid Search 7.4 Grid Search There is a limitation to the validation curve. It only tracks a single hyperparameter. We often want to evaluate many hyperparameters. Grid search is one tool that allows us to experiment across many hyperparameters. Grid search is one technique used to find the best hyperparameters for a model by training the model on different combinations of hyperparameters and evaluating its performance on a validation set. This technique is often used in decision tree models to tune the hyperparameters that control the tree’s growth, such as the maximum depth, minimum samples per leaf, and minimum samples per split. With grid search, a list of possible values for each hyperparameter is specified, and the model is trained and evaluated on every combination of hyperparameters. For example, if the maximum depth of the decision tree can take values in the range 1 to 10, and the minimum samples per leaf can take values in the range 1 to 100, then a grid search would involve training and evaluating the model on all possible combinations of maximum depth and minimum samples per leaf within these ranges. After training and evaluating the model on all combinations of hyperparameters, the combination that produces the best performance on the validation set is selected as the optimal set of hyperparameters for the model. Scikit-learn provides this functionality in the GridSearchCV class. We specify a dictionary to map the hyperparameters we want to explore to a list of options. The GridSearchCV class follows the standard scikit-learn interface and provides a .fit method to kick off the search for the best hyperparameters. The verbose=1 parameter is not a hyperparameter, it tells grid search to display computation time for each attempt. from sklearn.model_selection import GridSearchCV params = { 'max_depth': [3, 5, 7, 8], 'min_samples_leaf': [1, 3, 4, 5, 6], 'min_samples_split': [2, 3, 4, 5, 6], } grid_search = GridSearchCV(estimator=tree.DecisionTreeClassifier(), param_grid=params, cv=4, n_jobs=-1, verbose=1, scoring="accuracy") grid_search.fit(pd.concat([X_train, X_test]), pd.concat([kag_y_train, kag_y_test])) After the grid search has exhausted all the combinations, we can inspect the best parameters on the .best_params_ attribute. >>> grid_search.best_params_ {'max_depth': 7, 'min_samples_leaf': 5, 'min_samples_split': 6} Then I can use these parameters while constructing the decision tree. (The unpack operation, **, will use each key from a dictionary as a keyword parameter with its associated value. >>> between2 = tree.DecisionTreeClassifier(**grid_search.best_params_) >>> between2.fit(X_train, kag_y_train) >>> between2.score(X_test, kag_y_test) 0.7259668508287292 47 7. Tree Hyperparameters Why is this score less than our between tree that we previously created? Generally, this is due to grid search doing k-fold cross validation. In our grid search, we split the data into four parts (cv=4). Then we train each hyperparameter against three parts and evaluate it against the remaining part. The score is the mean of each of the four testing scores. When we were only calling the .score method, we held out one set of data for evaluation. It is conceivable that the section of data that was held out happens to perform slightly above the average from the grid search. There is also a .cv_results_ attribute that contains the scores for each of the options. Here is Pandas code to explore this. The darker cells have better scores. # why is the score different than between_tree? (pd.DataFrame(grid_search.cv_results_) .sort_values(by='rank_test_score') .style .background_gradient(axis='rows') ) Figure 7.4: Scores for each split for the hyperparameter values of the grid search. We can manually validate the results of the grid search using the cross_val_score. This scikit-learn function does cross validation on the model and returns the score from each training. >>> results = model_selection.cross_val_score( ... tree.DecisionTreeClassifier(max_depth=7), ... X=pd.concat([X_train, X_test], axis='index'), ... y=pd.concat([kag_y_train, kag_y_test], axis='index'), ... cv=4 ... ) 48 7.5. Summary >>> results array([0.69628647, 0.73607427, 0.70291777, 0.7184595 ]) >>> results.mean() 0.7134345024851962 Here is the same process running against the updated hyperparameters from the grid search. (Note that if you change cv to 3, the grid search model performs slightly worse.) >>> results = model_selection.cross_val_score( ... tree.DecisionTreeClassifier(max_depth=7, min_samples_leaf=5, ... min_samples_split=2), ... X=pd.concat([X_train, X_test], axis='index'), ... y=pd.concat([kag_y_train, kag_y_test], axis='index'), ... cv=4 ... ) >>> results array([0.70822281, 0.73740053, 0.70689655, 0.71580345]) >>> results.mean() 0.7170808366886126 One thing you want to be careful of when evaluating machine learning models is that you are comparing apples to apples. Even if you are using the same evaluation metric, you need to make sure that you are evaluating against the same data. 7.5 Summary In this chapter, we learned that we can modify the accuracy of the model by changing the hyperparameters. We used a validation curve to visualize the impact of a single hyperparameter change and then changed many of them using grid search. Later on, we will see other techniques to tune hyperparameters. 7.6 1. 2. 3. 4. Exercises What are validation curves and how are they used in hyperparameter tuning? What is grid search and how does it differ from validation curves? What are some pros and cons of validation curves? What are some pros and cons of grid search? 49 Chapter 8 Random Forest Before diving into XGBoost, we will look at one more model. The Random Forest. This is a common tree-based model that is useful to compare and contrast with XGBoost (indeed, XGBoost can create random forest models if so desired). 8.1 Ensembles with Bagging Jargon warning! A random forest model is an ensemble model. An ensemble is a group of models that aggregate results to prevent overfitting. The ensembling technique used by random forests is called bagging. Bagging (which is short for more jargon-bootstrap aggregating) means taking the models’ average. Bootstrapping means to sample with replacement. In effect, a random forest trains multiple decision trees, but each one is trained on different rows of the data (and different subsets of the features). The prediction is the average of those trees. Random forests also use column subsampling. Column subsampling in a random forest is a technique where at various points in the model or tree creation, a random subset of the features is considered rather than using all the features. This can help to reduce the correlation between the trees and improve the overall performance of the random forest by reducing overfitting and increasing the diversity of the trees. After the trees have been created, the trees “vote” for the final class. An intuitive way to think about this comes from a theory proposed in 1785, Condorcet’s jury theorem. The Marquis de Condorcet was a French philosopher who proposed a mechanism for deciding the correct number of jurors to arrive at the correct verdict. The basic idea is that you should add a juror if they have more than a 50% chance of picking the right verdict and are not correlated with the other jurors. You should keep adding additional jurors if they have > 50% accuracy. Each juror will contribute more to the correct verdict. Similarly, if a decision tree has a greater than 50% chance of making the right classification (and it is looking at different samples and different features), you should take its vote into consideration. Because we subsample on the columns, that should aid in reducing the correlation between the trees. 8.2 Scikit-learn Random Forest Here is a random forest model created with scikit-learn: >>> from sklearn import ensemble >>> rf = ensemble.RandomForestClassifier(random_state=42) >>> rf.fit(X_train, kag_y_train) >>> rf.score(X_test, kag_y_test) 0.7237569060773481 51 8. Random Forest How many trees did this create? Let’s look at the hyperparameters: >>> rf.get_params() {'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'sqrt', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': 42, 'verbose': 0, 'warm_start': False} It looks like it made 100 trees (see n_estimators). And if we want to, we can explore each tree. They are found under the .estimators_ attribute. Note Scikit-learn has many conventions. We discussed how raising max_ hyperparameters would make the model more complex. Another convention is that any attribute that ends with an underscore (_), like .estimators_, is created during the call to the .fit method. Sometimes these will be insights, metadata, or tables of data. In this case, it is a list of the trees created while training the model. We can validate that there are 100 trees. >>> len(rf.estimators_) 100 Every “estimator” is a decision tree. Here is the first tree. >>> print(rf.estimators_[0]) DecisionTreeClassifier(max_features='sqrt', random_state=1608637542) We can visualize each tree if desired. Here is the visualization for tree 0 (limited to depth two so we can read it). fig, ax = plt.subplots(figsize=(8, 4)) features = list(c for c in X_train.columns) tree.plot_tree(rf.estimators_[0], feature_names=features, filled=True, class_names=rf.classes_, ax=ax, max_depth=2, fontsize=6) 52 8.3. XGBoost Random Forest Figure 8.1: A visualization of the first two levels of tree 0 from our random forest. Interestingly, this tree chose Q3_United States of America as the first column. It did not choose the R column, probably due to column subsampling. 8.3 XGBoost Random Forest You can also create a random forest with the XGBoost library. >>> import xgboost as xgb >>> rf_xg = xgb.XGBRFClassifier(random_state=42) >>> rf_xg.fit(X_train, y_train) >>> rf_xg.score(X_test, y_test) 0.7447513812154696 You can also inspect the hyperparameters of this random forest. XGBoost uses different hyperparameters than scikit-learn. >>> rf_xg.get_params() {'colsample_bynode': 0.8, 'learning_rate': 1.0, 'reg_lambda': 1e-05, 'subsample': 0.8, 'objective': 'binary:logistic', 'use_label_encoder': None, 'base_score': 0.5, 'booster': 'gbtree', 'callbacks': None, 'colsample_bylevel': 1, 'colsample_bytree': 1, 'early_stopping_rounds': None, 'enable_categorical': False, 53 8. Random Forest 'eval_metric': None, 'feature_types': None, 'gamma': 0, 'gpu_id': -1, 'grow_policy': 'depthwise', 'importance_type': None, 'interaction_constraints': '', 'max_bin': 256, 'max_cat_threshold': 64, 'max_cat_to_onehot': 4, 'max_delta_step': 0, 'max_depth': 6, 'max_leaves': 0, 'min_child_weight': 1, 'missing': nan, 'monotone_constraints': '()', 'n_estimators': 100, 'n_jobs': 0, 'num_parallel_tree': 100, 'predictor': 'auto', 'random_state': 42, 'reg_alpha': 0, 'sampling_method': 'uniform', 'scale_pos_weight': 1, 'tree_method': 'exact', 'validate_parameters': 1, 'verbosity': None} Let’s visualize the first tree with the plot_tree function of XGBoost. fig, ax = plt.subplots(figsize=(6,12), dpi=600) xgb.plot_tree(rf_xg, num_trees=0, ax=ax, size='1,1') my_dot_export(rf_xg, num_trees=0, filename='img/rf_xg_kag.dot', title='First Random Forest Tree', direction='LR') That tree is big, and a little hard to see what is going on without a microscope. Let’s use the dtreeviz library and limit the depth to 2 (depth_range_to_display=[0,2]). viz = dtreeviz.model(rf_xg, X_train=X_train, y_train=y_train, target_name='Job', feature_names=list(X_train.columns), class_names=['DS', 'SE'], tree_index=0) viz.view(depth_range_to_display=[0,2]) 8.4 Random Forest Hyperparameters Here are most of the hyperparameters for random forests (created with the XGBoost library). At a high level, you have the tree parameters (like the depth), sampling parameters, the ensembling parameters (number of trees), and regularization parameters (like gamma, which prunes back trees to regularize them). Tree hyperparameters. 54 8.4. Random Forest Hyperparameters Figure 8.2: First random forest tree built with XGBoost. 55 8. Random Forest Figure 8.3: dtreeviz export of XGBoost Random Forest model. • max_depth=6 - Tree depth. Range [1-large number] • max_leaves=0 - Number of leaves in a tree. (Not supported by tree_method='exact'). A value of 0 means no limit, otherwise, the range is [2-large number] • min_child_weight=1 - Minimum sum of hessian (weight) needed in a child. The larger the value, the more conservative. Range [0,∞] • grow_policy="depthwise" - (Only supported with tree_method set to 'hist', 'approx', or 'gpu_hist'. Split nodes closest to root. Set to "lossguide" (with max_depth=0) to mimic LightGBM behavior. • tree_method="auto" - Use "hist" to use histogram bucketing and increase performance. "auto" heuristic: – 'exact' for small data – 'approx' for larger data – 'hist' for histograms (limit bins with max_bins (default 256)) Sampling hyperparameters. These work cumulatively. First by the tree, then by level, and finally by the node. • colsample_bytree=1 - Subsample columns at each tree. Range (0,1] • colsample_bylevel=1 - Subsample columns at each level (from tree columns colsample_bytree). Range (0,1] • colsample_bynode=1 - Subsample columns at each node split (from tree columns colsample_bytree). Range (0,1] • subsample=1 - Sample portion of training data (rows) before each boosting round. Range (0,1] • sampling_method='uniform' - Sampling method. 'uniform' for equal probability. gradient_based sample is proportional to the regularized absolute value of gradients (only supported by tree_method="gpu_hist". Categorical data hyperparameters. Use the enable_categorical=True parameter and set columns to Pandas categoricals (.astype('category')). • max_cat_to_onehot=4 - Upper threshold for using one-hot encoding for categorical features. Use one-hot encoding when the number of categories is less than this number. 56 8.5. Training the Number of Trees in the Forest • max_cat_threshold=64 - Maximum number of categories to consider for each split. Ensembling hyperparameters. • n_estimators=100 - Number of trees. Range [1, large number] • early_stopping_rounds=None - Stop creating new trees if eval_metric score has not improved after n rounds. Range [1,∞] • eval_metric - Metric for evaluating validation data. – Classification metrics * * 'logloss' - Default classification metric. Negative log-likelihood. 'auc' - Area under the curve of Receiver Operating Characteristic. • objective - Learning objective to optimize the fit of the model to data during training. – Classification objectives * 'binary:logistic' - Default classification objective. Outputs probability. Regularization hyperparameters. • learning_rate=.3 - After each boosting round, multiply weights to make it more conservative. Lower is more conservative. Range (0,1] • gamma=0 - Minimum loss required to make a further partition. Larger is more conservative. Range [0,∞] • reg_alpha=0 - L1 regularization. Increasing will make it more conservative. • reg_lambda=1 - L2 regularization. Increasing will make it more conservative. Imbalanced data hyperparameters. • scale_pos_weight=1 - Consider (count negative) / (count positive) for imbalanced classes. Range (0, large number) • max_delta_step=0 - Maximum delta step for leaf output. It might help with imbalanced classes. Range [0,∞] 8.5 Training the Number of Trees in the Forest Let’s try out a few different values for n_estimators and inspect what happens to the accuracy of our random forest model. from yellowbrick.model_selection import validation_curve fig, ax = plt.subplots(figsize=(10,4)) viz = validation_curve(xgb.XGBClassifier(random_state=42), x=pd.concat([X_train, X_test], axis='index'), y=np.concatenate([y_train, y_test]), param_name='n_estimators', param_range=range(1, 100, 2), scoring='accuracy', cv=3, ax=ax) It looks like our model performs best around 29 estimators. A quick check shows that it performs slightly better than the out-of-the-box model. I will choose the simpler model when given a choice between models that perform similarly. In this case, a model with 99 trees appears to perform similarly but is more complex, so I go with the simpler model. 57 8. Random Forest Figure 8.4: Validation curve for random forest model that shows accuracy score as n_estimators changes. Prefer the simpler model when two models return similar results. >>> rf_xg29 = xgb.XGBRFClassifier(random_state=42, n_estimators=29) >>> rf_xg29.fit(X_train, y_train) >>> rf_xg29.score(X_test, y_test) 0.7480662983425415 8.6 Summary In this chapter, we introduced a random forest. At its essence, a random forest is a group of decision trees formed by allowing each tree to sample different rows of data. Each tree can look at different features as well. The hope is that the trees can glean different insights because they look at different parts of the data. Finally, the results of the trees are combined to predict a final result. 8.7 Exercises 1. How do we specify the hyperparameters for an XGBoost model to create a random forest? 2. Create a random forest for the Kaggle data predicting whether the title is a data scientist or software engineer. 3. What is the score? 4. You have two models with the same score. One has 200 trees; the other has 40. Which model do you choose and why? 58 Chapter 9 XGBoost At last, we can talk about the most common use of the XGBoost library, making extreme gradient-boosted models. This ensemble is created by training a weak model and then training another model to correct the residuals or mistakes of the model before it. I like to use a golfing metaphor. A decision tree is like getting one swing to hit the ball and put it in the hole. A random forest is getting a bunch of different attempts at teeing off (changing clubs, swing styles, etc.) and then placing the ball at the average of each of those swings. A boosted tree is hitting the ball once, then going to where it landed and hitting again (trying to correct the mistake of the previous hit), and then getting to keep hitting the ball. This metaphor clarifies why boosted models work so well; they can correct the model’s error as they progress. 9.1 Jargon I used the term weak model to describe a single tree in boosting. A weak tree is a shallow tree. Boosting works well for performance reasons. Deeper trees grow nodes at a fast rate. Using many shallower trees tends to provide better performance and quicker results. However, shallow trees might not be able to capture interactions between features. I also used the term residual. This is the prediction error, the true value minus the predicted value. For a regression model, where we predict a numeric value, this is the difference between the predicted and actual values. If the residual is positive, it means our prediction was too high. A negative residual means that we underpredicted. A residual of zero means that the prediction was correct. For classification tasks, where we are trying to predict a label, you can think of the residual in terms of probabilities. For a positive label, we would prefer a probability greater than .5. A perfect model would have a probability of 1. If our model predicted a probability of .4, the residual is .6, the perfect value (1) minus the predicted value (.4). The subsequent boosting round would try to predict .6 to correct that value. 9.2 Bene昀椀ts of Boosting The XBGoost library provides extras we don’t get with a standard decision tree. It has support to handle missing values. It also has support for learning from categorical features. Additionally, it has various hyperparameters to tune model behavior and fend off overfitting. 59 9. XGBoost 9.3 A Big Downside However, using XGBoost versus a plain decision tree has a pretty big downside. A decision tree is easy to understand. I’ve been to a doctor’s appointment where I was diagnosed using a decision tree (the doctor showed me the choices after). Understanding how a model works is called explainability or interpretability and is valuable in some business contexts. Explaining the reasoning for rejecting the loan might help customers feel like they understand the process. For example, if you were applying for a loan and the bank rejected you, you might be disappointed. But, if the bank told you that if you deposit $2,000 into your account, you would be approved for the loan, you might resolve to save that amount and apply again in the future. Now, imagine that a data scientist at the bank claims they have a new model better at predicting who will default on a loan, but this model cannot explain why. The same user comes in and applies for the loan. They are rejected again but not given a reason. That could be a big turn-off and cause the customer to look for a new bank. (Perhaps not particularly problematic for folks with a high probability of defaulting on the loan.) However, if enough customers get turned off and start looking to move their business elsewhere, that could hurt the bottom line. In short, many businesses are willing to accept models with worse performance but are interpretable. Models that are easy to explain are called white box models. Conversely, hardto-explain models are called black box models. To be pedantic, you can explain a boosted model. However, if there are 500 trees in it, that might be a frustrating experience to walk a customer through each of those trees in order. 9.4 Creating an XGBoost Model Let’s load the libraries and the xg_helpers library. The xg_helpers library has the Python functions I created to load the data and prepare it for modeling. %matplotlib inline import dtreeviz from feature_engine import encoding, imputation import matplotlib.pyplot as plt import numpy as np import pandas as pd from sklearn import base, compose, datasets, ensemble, \ metrics, model_selection, pipeline, preprocessing, tree import scikitplot import xgboost as xgb import yellowbrick.model_selection as ms from yellowbrick import classifier import urllib import zipfile import xg_helpers as xhelp url = 'https://github.com/mattharrison/datasets/raw/master/data/'\ 'kaggle-survey-2018.zip' fname = 'kaggle-survey-2018.zip' member_name = 'multipleChoiceResponses.csv' 60 9.5. A Boosted Model raw = xhelp.extract_zip(url, fname, member_name) ## Create raw X and raw y kag_X, kag_y = xhelp.get_rawX_y(raw, 'Q6') ## Split data kag_X_train, kag_X_test, kag_y_train, kag_y_test = \ model_selection.train_test_split( kag_X, kag_y, test_size=.3, random_state=42, stratify=kag_y) ## Transform X with pipeline X_train = xhelp.kag_pl.fit_transform(kag_X_train) X_test = xhelp.kag_pl.transform(kag_X_test) ## Transform y with label encoder label_encoder = preprocessing.LabelEncoder() label_encoder.fit(kag_y_train) y_train = label_encoder.transform(kag_y_train) y_test = label_encoder.transform(kag_y_test) # Combined Data for cross validation/etc X = pd.concat([X_train, X_test], axis='index') y = pd.Series([*y_train, *y_test], index=X.index) Now we have our training data and test data and are ready to create models. 9.5 A Boosted Model The XGBoost library makes it really easy to convert from a decision tree to a random forest or a boosted model. (The scikit-learn library also makes trying other models, like logistic regression, support vector machines, or k-nearest neighbors easy.) Let’s make our first boosted model. All we need to do is create an instance of the model (generally, this is the only thing we need to change to try a different model). Then we call the .fit method to train the model on the training data. We can evaluate the model using the .score method, which tells us the accuracy of the model. Out of the box, the accuracy of the boosted model (.745) is quite a bit better than both a decision tree (.73 for a depth of 7) and a random forest (.72 from scikit-learn). >>> xg_oob = xgb.XGBClassifier() >>> xg_oob.fit(X_train, y_train) >>> xg_oob.score(X_test, y_test) 0.7458563535911602 The default model performs 100 rounds of boosting. (Again, this is like getting 100 chances to keep hitting the golf ball and move it toward the hole.) Let’s see what happens to our accuracy if we only hit two times (limiting the tree estimators to two and the depth of each tree to two). 61 9. XGBoost >>> # Let's try w/ depth of 2 and 2 trees >>> xg2 = xgb.XGBClassifier(max_depth=2, n_estimators=2) >>> xg2.fit(X_train, y_train) >>> xg2.score(X_test, y_test) 0.6685082872928176 Our performance dropped quite a bit. Down to .668. Playing around with these hyperparameters can have a large impact on our model’s performance. Let’s look at what the first tree looks like (We will use tree_index=0 to indicate this). I will use the dtreeviz package to show the tree. It looks like it first considers whether the user uses the R programming language. Then it looks at whether the user studied CS. (Note that it doesn’t have to look at the same feature for a level, but in this case, the CS feature turns out to be most informative for both splits.) import dtreeviz viz = dtreeviz.model(xg2, X_train=X, y_train=y, target_name='Job', feature_names=list(X_train.columns), class_names=['DS', 'SE'], tree_index=0) viz.view(depth_range_to_display=[0,2]) Figure 9.1: Visualization of first booster from xgboost model. The dtreeviz package is good for understanding what the trees are doing and how the splits break up the groups. 9.6 Understanding the Output of the Trees One thing that the dtreeviz does not show with XGBoost models is the score for the nodes. Let’s use XGBoost to plot the same booster (using num_trees=0) and trace through an example to understand the leaf scores. I’m using my function, my_dot_export, because it uses my fonts. 62 9.6. Understanding the Output of the Trees xhelp.my_dot_export(xg2, num_trees=0, filename='img/xgb_md2.dot', title='First Tree') Note You can just use the plot_tree function if you want to view this in a Jupyter notebook xgb.plot_tree(xg2, num_trees=0) Figure 9.2: Visualization of first tree from xgboost model. Let’s trace through an example and see what our model predicts. Here is the prediction for this two-tree model. It predicts 49.8% data scientist and 50.1% software engineer. Because 50.1% is greater than 50%, our model predicts 1 or software engineer for this example. We can ask the model to predict probabilities with .predict_proba. >>> # Predicts 1 - Software engineer >>> se7894 = pd.DataFrame({'age': {7894: 22}, ... 'education': {7894: 16.0}, ... 'years_exp': {7894: 1.0}, ... 'compensation': {7894: 0}, ... 'python': {7894: 1}, ... 'r': {7894: 0}, ... 'sql': {7894: 0}, ... 'Q1_Male': {7894: 1}, ... 'Q1_Female': {7894: 0}, ... 'Q1_Prefer not to say': {7894: 0}, ... 'Q1_Prefer to self-describe': {7894: 0}, ... 'Q3_United States of America': {7894: 0}, ... 'Q3_India': {7894: 1}, ... 'Q3_China': {7894: 0}, ... 'major_cs': {7894: 0}, ... 'major_other': {7894: 0}, ... 'major_eng': {7894: 0}, ... 'major_stat': {7894: 0}}) >>> xg2.predict_proba(se7894) array([[0.4986236, 0.5013764]], dtype=float32) 63 9. XGBoost Or we can predict just the value with .predict: >>> # Predicts 1 - Software engineer >>> xg2.predict(pd.DataFrame(se7894)) array([1]) For our example, the user did not use R, so we took a left turn at the first node. They did not study CS, so we moved left at that node and ended with a leaf value of -0.084. Now let’s trace what happens in the second tree. First, let’s plot it. Remember that this tree tries to fix the predictions of the tree before it. (We specify num_trees=1 for the second tree because Python is zero-based.) xhelp.my_dot_export(xg2, num_trees=1, filename='img/xgb_md2_tree1.dot', title='Second Tree') Figure 9.3: Visualization of second booster from xgboost model. Again, our example does not use R, so we go left on the first node. The education level is 16, so we take another left at the next node and end at a value of 0.0902. Our prediction is made by adding these leaf values together and taking the inverse logit of the sum. This returns the probability of the positive class. def inv_logit(p): return np.exp(p) / (1 + np.exp(p)) >>> inv_logit(-0.08476+0.0902701) 0.5013775215147345 We calculate the inverse logit of the sum of the leaf values and come up with .501. Again this corresponds to 50.1% software engineer (with binary classifiers, this is the percent that the label is 1, or software engineer). If there were more trees, we would repeat this process and add up the values in the leaves. Note that due to the inverse logistic function, values above 0 push towards the positive label, and values below zero push to the negative label. 64 9.7. Summary 9.7 Summary XGBoost can improve the performance of traditional decision trees and random forest techniques. Decision trees are simpler models and tend to underfit in certain situations, while random forests are better at generalizing to unseen data points. XGBoost combines the benefits of both approaches by using decision trees as its base learners and building an ensemble of trees with the help of boosting. It works by training multiple decision trees with different parameters and combining the results into a powerful model. XGBoost also has additional features such as regularization, early stopping, and feature importance making it more effective than regular decision tree and random forest algorithms. 9.8 Exercises 1. What are the benefits of an XGBoost model versus a Decision Tree model? 2. Create an XGBoost model with a dataset of your choosing for classification. What is the accuracy score? 3. Compare your model with a Decision Tree model. 4. Visualize the first tree of your model. 65 Chapter 10 Early Stopping Early stopping in XGBoost is a technique that reduces overfitting in XGBoost models. You can tell XGBoost to use early stopping when you call the .fit method. Early stopping works by monitoring the performance of a model on a validation set and automatically stopping training when the model’s performance on the validation set no longer improves. This helps to improve the model’s generalization, as the model is not overfitted to the training data. Early stopping is helpful because it can prevent overfitting and help to improve XGBoost models’ generalization. 10.1 Early Stopping Rounds Here are the results of XGBoost using out-of-the-box behavior. By default, the model will create 100 trees. This model labels about 74% of the test examples correctly. >>> # Defaults >>> xg = xgb.XGBClassifier() >>> xg.fit(X_train, y_train) >>> xg.score(X_test, y_test) 0.7458563535911602 Now, we are going to provide the early_stopping_rounds parameter. Note that we also specify an eval_set parameter. The eval_set parameter is used in the .fit method to provide a list of (X, y) tuples for evaluation to use as a validation dataset. This dataset evaluates the model’s performance and adjusts the hyperparameters during the model training process. We pass in two tuples, the training data and the testing data. When you combine the evaluation data with the early stopping parameter, XGBoost will look at the results of the last tuple. We passed in 20 for early_stopping_rounds, so XGBoost will build up to 100 trees, but if the log loss hasn’t improved after the following 20 trees, it will stop building trees. You can see the line where the log loss bottoms out. Note that the score on the left, 0.43736, is the training score and the score on the right, 0.5003, is the testing score and the value used for early stopping. [12] validation_0-logloss:0.43736 validation_1-logloss:0.5003 (Note that this is the thirteenth tree because we started counting at 0.) >>> xg = xgb.XGBClassifier(early_stopping_rounds=20) >>> xg.fit(X_train, y_train, ... eval_set=[(X_train, y_train), ... (X_test, y_test) 67 10. Early Stopping ... ] ... ) >>> xg.score(X_test, y_test) [0] validation_0-logloss:0.61534 validation_1-logloss:0.61775 [1] validation_0-logloss:0.57046 validation_1-logloss:0.57623 [2] validation_0-logloss:0.54011 validation_1-logloss:0.55333 [3] validation_0-logloss:0.51965 validation_1-logloss:0.53711 [4] validation_0-logloss:0.50419 validation_1-logloss:0.52511 [5] validation_0-logloss:0.49176 validation_1-logloss:0.51741 [6] validation_0-logloss:0.48159 validation_1-logloss:0.51277 [7] validation_0-logloss:0.47221 validation_1-logloss:0.51040 [8] validation_0-logloss:0.46221 validation_1-logloss:0.50713 [9] validation_0-logloss:0.45700 validation_1-logloss:0.50583 [10] validation_0-logloss:0.45062 validation_1-logloss:0.50430 [11] validation_0-logloss:0.44533 validation_1-logloss:0.50338 [12] validation_0-logloss:0.43736 validation_1-logloss:0.50033 [13] validation_0-logloss:0.43399 validation_1-logloss:0.50034 [14] validation_0-logloss:0.43004 validation_1-logloss:0.50192 [15] validation_0-logloss:0.42550 validation_1-logloss:0.50268 [16] validation_0-logloss:0.42169 validation_1-logloss:0.50196 [17] validation_0-logloss:0.41854 validation_1-logloss:0.50223 [18] validation_0-logloss:0.41485 validation_1-logloss:0.50360 [19] validation_0-logloss:0.41228 validation_1-logloss:0.50527 [20] validation_0-logloss:0.40872 validation_1-logloss:0.50839 [21] validation_0-logloss:0.40490 validation_1-logloss:0.50623 [22] validation_0-logloss:0.40280 validation_1-logloss:0.50806 [23] validation_0-logloss:0.39942 validation_1-logloss:0.51007 [24] validation_0-logloss:0.39807 validation_1-logloss:0.50987 [25] validation_0-logloss:0.39473 validation_1-logloss:0.51189 [26] validation_0-logloss:0.39389 validation_1-logloss:0.51170 [27] validation_0-logloss:0.39040 validation_1-logloss:0.51218 [28] validation_0-logloss:0.38837 validation_1-logloss:0.51135 [29] validation_0-logloss:0.38569 validation_1-logloss:0.51202 [30] validation_0-logloss:0.37945 validation_1-logloss:0.51352 [31] validation_0-logloss:0.37840 validation_1-logloss:0.51545 0.7558011049723757 We can ask the model what the limit was by inspecting the .best_ntree_limit attribute: >>> xg.best_ntree_limit 13 Note If this were implemented in scikit-learn, the attribute .best_ntree_limit would have a trailing underscore because it was learned by fitting the model. Alas, we live in a world of inconsistencies. 10.2 Plotting Tree Performance Let’s explore the score that occurred during fitting the model. The .eval_results method will return a data structure containing the results from the eval_set. 68 10.2. Plotting Tree Performance >>> # validation_0 is for training data >>> # validation_1 is for testing data >>> results = xg.evals_result() >>> results {'validation_0': OrderedDict([('logloss', [0.6153406503923696, 0.5704566627034644, 0.5401074953836288, 0.519646179894983, 0.5041859194071372, 0.49175883369140716, 0.4815858465553177, 0.4722135672319274, 0.46221246084118905, 0.4570046103131291, 0.45062119092139025, 0.44533101600634545, 0.4373589513231934, 0.4339914069003403, 0.4300442738158372, 0.42550266018419824, 0.42168949383456633, 0.41853931894949614, 0.41485192559138645, 0.4122836278413833, 0.4087179538231096, 0.404898268053467, 0.4027963532207719, 0.39941699938733854, 0.3980718078477953, 0.39473153180519993, 0.39388538948800944, 0.39039599470886893, 0.38837148147752126, 0.38569152626668, 0.3794510693344513, 0.37840359436957194, 0.37538466192241227])]), 'validation_1': OrderedDict([('logloss', [0.6177459120091813, 0.5762297115602546, 0.5533292921537852, 0.5371078260695736, 0.5251118483299708, 0.5174100387491574, 0.5127666981510036, 0.5103968678752362, 0.5071349115538004, 0.5058257413585542, 0.5043005662687247, 0.5033770955193438, 0.5003349146419797, 69 10. Early Stopping 0.5003436393562437, 0.5019165392779843, 0.502677517614806, 0.501961292550791, 0.5022262006329157, 0.5035970173261607, 0.5052709663297096, 0.508388655664636, 0.5062287504923689, 0.5080608455824424, 0.5100736726054829, 0.5098673969229365, 0.5118910041889845, 0.5117007332982608, 0.5121825202836434, 0.5113475993625531, 0.5120185821281118, 0.5135189292720874, 0.5154504034915188, 0.5158137131755071])])} Let’s plot that data and visualize what happens as we add more trees: # Testing score is best at 13 trees results = xg.evals_result() fig, ax = plt.subplots(figsize=(8, 4)) ax = (pd.DataFrame({'training': results['validation_0']['logloss'], 'testing': results['validation_1']['logloss']}) .assign(ntrees=lambda adf: range(1, len(adf)+1)) .set_index('ntrees') .plot(figsize=(5,4), ax=ax, title='eval_results with early_stopping') ) ax.annotate('Best number \nof trees (13)', xy=(13, .498), xytext=(20,.42), arrowprops={'color':'k'}) ax.set_xlabel('ntrees') Now, let’s train a model with 13 trees (also referred to as estimators) and see how the model performs: # Using value from early stopping gives same result >>> xg13 = xgb.XGBClassifier(n_estimators=13) >>> xg13.fit(X_train, y_train, ... eval_set=[(X_train, y_train), ... (X_test, y_test)] ... ) >>> xg13.score(X_test, y_test) It looks like this model gives the same results that the early_stopping model did. >>> xg.score(X_test, y_test) 0.7558011049723757 70 10.3. Different eval_metrics Figure 10.1: Image showing that the testing score might stop improving Finally, let’s look at the model that creates exactly 100 trees. The score is worse than only using 13! >>> # No early stopping, uses all estimators >>> xg_no_es = xgb.XGBClassifier() >>> xg_no_es.fit(X_train, y_train) >>> xg_no_es.score(X_test, y_test) 0.7458563535911602 What is the takeaway from this chapter? The early_stopping_rounds parameter helps to prevent overfitting. 10.3 Different eval_metrics The eval_metric options for XGBoost are metrics used to evaluate the results of an XGBoost model. They provide a way to measure the performance of a model on a dataset. Some of the most common eval_metric options for classification are: 1. Log loss ('logloss') - The default measure. Log loss measures a model’s negative loglikelihood performance that predicts a given class’s probability. It penalizes incorrect predictions more heavily than correct ones. 2. Area under the curve ('auc') - AUC is used to measure the performance of a binary classification model. It is the Area Under a receiver operating characteristic Curve. 71 10. Early Stopping 3. Accuracy ('error') - Accuracy is the percent of correct predictions. XGBoost uses the error, which is one minus accuracy because it wants to minimize the value. 4. You can also create a custom function for evaluation. In the following example, I set the eval_metric to 'error'. >>> xg_err = xgb.XGBClassifier(early_stopping_rounds=20, ... eval_metric='error') >>> xg_err.fit(X_train, y_train, ... eval_set=[(X_train, y_train), ... (X_test, y_test) ... ] ... ) >>> xg_err.score(X_test, y_test) [0] validation_0-error:0.24739 validation_1-error:0.27072 [1] validation_0-error:0.24218 validation_1-error:0.26188 [2] validation_0-error:0.23839 validation_1-error:0.24751 [3] validation_0-error:0.23697 validation_1-error:0.25193 [4] validation_0-error:0.23081 validation_1-error:0.24530 [5] validation_0-error:0.22607 validation_1-error:0.24420 [6] validation_0-error:0.22180 validation_1-error:0.24862 [7] validation_0-error:0.21801 validation_1-error:0.24862 [8] validation_0-error:0.21280 validation_1-error:0.25304 [9] validation_0-error:0.21043 validation_1-error:0.25304 [10] validation_0-error:0.20806 validation_1-error:0.24641 [11] validation_0-error:0.20284 validation_1-error:0.25193 [12] validation_0-error:0.20047 validation_1-error:0.24420 [13] validation_0-error:0.19668 validation_1-error:0.24420 [14] validation_0-error:0.19384 validation_1-error:0.24530 [15] validation_0-error:0.18815 validation_1-error:0.24199 [16] validation_0-error:0.18531 validation_1-error:0.24199 [17] validation_0-error:0.18389 validation_1-error:0.23867 [18] validation_0-error:0.18531 validation_1-error:0.23757 [19] validation_0-error:0.18815 validation_1-error:0.23867 [20] validation_0-error:0.18246 validation_1-error:0.24199 [21] validation_0-error:0.17915 validation_1-error:0.24862 [22] validation_0-error:0.17867 validation_1-error:0.24751 [23] validation_0-error:0.17630 validation_1-error:0.24199 [24] validation_0-error:0.17488 validation_1-error:0.24309 [25] validation_0-error:0.17251 validation_1-error:0.24530 [26] validation_0-error:0.17204 validation_1-error:0.24309 [27] validation_0-error:0.16825 validation_1-error:0.24199 [28] validation_0-error:0.16730 validation_1-error:0.24088 [29] validation_0-error:0.16019 validation_1-error:0.24199 [30] validation_0-error:0.15782 validation_1-error:0.24972 [31] validation_0-error:0.15972 validation_1-error:0.24862 [32] validation_0-error:0.15924 validation_1-error:0.24641 [33] validation_0-error:0.15403 validation_1-error:0.25635 [34] validation_0-error:0.15261 validation_1-error:0.25525 [35] validation_0-error:0.15213 validation_1-error:0.25525 [36] validation_0-error:0.15166 validation_1-error:0.25525 72 10.4. Summary [37] validation_0-error:0.14550 validation_1-error:0.25525 [38] validation_0-error:0.14597 validation_1-error:0.25083 0.7624309392265194 Now, the best number of trees when we minimize error is 19. >>> xg_err.best_ntree_limit 19 10.4 Summary You can use the early_stopping_rounds parameter in XGBoost as a feature to specify a maximum number of trees after the lowest evaluation metric is encountered. This is a useful feature to avoid overfitting, as it allows you to specify a validation set and track the model’s performance. If the performance hasn’t improved after a certain number of rounds (specified by early_stopping_rounds), then training is stopped, and the best model is kept. This helps to avoid wasting time and resources on training a model that is unlikely to improve. Another benefit of the early stopping rounds parameter is that it allows you to select the best model with the least training time. In other words, it’s like getting to the finish line first without running the entire race! 10.5 Exercises 1. What is the purpose of early stopping in XGBoost? 2. How is the early_stopping parameter used in XGBoost? 3. How does the eval_set parameter work with the early stopping round parameter in XGBoost? 73 Chapter 11 XGBoost Hyperparameters XGBoost provides relatively good performance out of the box. However, it is also a complicated model with many knobs and dials. You can adjust these dials to improve the results. This chapter will introduce many of these knobs and dials, called hyperparameters. 11.1 Hyperparameters What does the term hyperparameter even mean? In programming, a parameter lets us control how a function behaves. Hyperparameters allow us to control how a machine-learning model behaves. The scikit-learn library has some basic conventions for hyperparameters that allow us to think of them as levers that push a model towards overfitting or underfitting. One convention is that hyperparameters that start with max_ will tend to complicate the model (and lead to overfitting) when you raise the value. Conversely, they will simplify the model if you lower them (and lead to underfitting). Often there is a sweet spot in between. We saw n example of this earlier when we looked at decision trees. Raising the max_depth hyperparameter will add child nodes, complicating the model. Lowering it will simplify the model (degrading it into a decision stump). There is a similar convention for parameters starting with min_. Those tend to simplify when you raise them and complicate when you lower them. I will split up the parameters by what they impact. There are parameters for tree creation, sampling, categorical data, ensembling, regularization, and imbalanced data. Here are the tree hyperparameters: • max_depth=6 - Tree depth. How many feature interactions you can have. Larger is more complex (more likely to overfit). Each level doubles the time. Range [0, ∞]. • max_leaves=0 - Number of leaves in a tree. A larger number is more complex. (Not supported by tree_method='exact'). Range [0, ∞]. • min_child_weight=1 - Minimum sum of hessian (weight) needed in a child. The larger the value, the more conservative. Range [0,∞] • 'grow_policy="depthwise"' - (Only supported with tree_method set to 'hist', 'approx', or 'gpu_hist'. Split nodes closest to root. Set to "lossguide" (with max_depth=0) to mimic LightGBM behavior. • tree_method="auto" - Use "hist" to use histogram bucketing and increase performance. "auto" heuristic: – 'exact' for small data – 'approx' for larger data – 'hist' for histograms (limit bins with max_bins (default 256)) 75 11. XGBoost Hyperparameters These are the sampling hyperparameters. These work cumulatively. First, tree, then level, then node. If colsample_bytree, colsample_bylevel, and colsample_bynode are all .5, then a tree will only look at (.5 * .5 * .5) 12.5% of the original columns. • colsample_bytree=1 - Subsample columns at each tree. Range (0,1]. Recommended to search [.5, 1]. • colsample_bylevel=1 - Subsample columns at each level (from tree columns colsample_bytree). Range (0,1]. Recommended to search [.5, 1]. • colsample_bynode=1 - Subsample columns at each node split (from tree columns colsample_bytree). Range (0,1]. Recommended to search [.5, 1]. • subsample=1 - Sample portion of training data (rows) before each boosting round. Range (0,1]. Lower to make more conservative. (When not equal to 1.0, the model performs stochastic gradient descent, i.e., there is some randomness in the model.) Recommended to search [.5, 1]. • sampling_method='uniform' - Sampling method. 'uniform' for equal probability. gradient_based sample is proportional to the regularized absolute value of gradients (only supported by tree_method="gpu_hist". Hyperparameters for managing categorical data(use enable_categorical=True parameter and set columns to Pandas categoricals .astype('category')). • max_cat_to_onehot=4 - Upper threshold for using one-hot encoding for categorical features. One-hot encoding is used when the number of categories is less than this number. • max_cat_threshold=64 - Maximum number of categories to consider for each split. Ensembling hyperparameters control how the subsequent trees are created: • n_estimators=100 - Number of trees. Larger is more complex. Default 100. Use early_stopping_rounds to prevent overfitting. You don’t really need to tune if you use early_stopping_rounds. • early_stopping_rounds=None - Stop creating new trees if eval_metric score has not improved after n rounds. Range [1,∞] • eval_metric - Metric for evaluating validation data for evaluating early stopping. – Classification metrics * 'logloss' - Default classification metric. Negative log-likelihood. * 'auc' - Area under the curve of Receiver Operating Characteristic. • objective - Learning objective to optimize the fit of the model to data during training. – Classification objectives * 'binary:logistic' - Default classification objective. Outputs probability. * 'multi:softmax' - Multi-class classification objective. Regularization hyperparameters control the complexity of the overall model: • learning_rate=.3 - After each boosting round, multiply weights to make it more conservative. Lower is more conservative. Range (0,1]. Typically when you lower this, you want to raise n_estimators. Range (0, 1] • gamma=0 / min_split_loss - L0 regularization. Prunes tree to remove splits that don’t meet the given value. Global regularization. Minimum loss required for making a split. Larger is more conservative. Range [0, ∞), default 0 - no regularization. Recommended search is (0, 1, 10, 100, 1000…) 76 11.2. Examining Hyperparameters • reg_alpha=0 - L1/ridge regularization. (Mean of weights). Increase to be more conservative. Range [0, ∞) • reg_lambda=0 - L2 regularization. (Root of squared weights). Increase to be more conservative. Range [0, ∞) Imbalanced data hyperparameters: • scale_pos_weight=1 - Consider (count negative) / (count positive) for imbalanced classes. Range (0, large number) • max_delta_step=0 - Maximum delta step for leaf output. It might help with imbalanced classes. Range [0, ∞) • Use 'auc' or 'aucpr' for eval_metric metric (rather than classification default 'logless') 11.2 Examining Hyperparameters You can set hyperparameters by passing them into the constructor when creating an XGBoost model. If you want to inspect hyperparameters, you can use the .get_params method: >>> xg = xgb.XGBClassifier() # set the hyperparamters in here >>> xg.fit(X_train, y_train) >>> xg.get_params() {'objective': 'binary:logistic', 'use_label_encoder': None, 'base_score': 0.5, 'booster': 'gbtree', 'callbacks': None, 'colsample_bylevel': 1, 'colsample_bynode': 1, 'colsample_bytree': 1, 'early_stopping_rounds': None, 'enable_categorical': False, 'eval_metric': None, 'feature_types': None, 'gamma': 0, 'gpu_id': -1, 'grow_policy': 'depthwise', 'importance_type': None, 'interaction_constraints': '', 'learning_rate': 0.300000012, 'max_bin': 256, 'max_cat_threshold': 64, 'max_cat_to_onehot': 4, 'max_delta_step': 0, 'max_depth': 6, 'max_leaves': 0, 'min_child_weight': 1, 'missing': nan, 'monotone_constraints': '()', 'n_estimators': 100, 'n_jobs': 0, 'num_parallel_tree': 1, 'predictor': 'auto', 77 11. XGBoost Hyperparameters 'random_state': 0, 'reg_alpha': 0, 'reg_lambda': 1, 'sampling_method': 'uniform', 'scale_pos_weight': 1, 'subsample': 1, 'tree_method': 'exact', 'validate_parameters': 1, 'verbosity': None} This shows the default valus for each hyperparameter. 11.3 Tuning Hyperparameters Hyperparameters can be a bit tricky to tune. Let’s explore how to tune a single value first. To get started, you’ll want to use validation curves. These plots show the relationships between your model’s performance and the values of specific hyperparameters. They show the values of hyperparameters that yield the best results. I like to plot a validation curve by using the validation_curve function from the Yellowbrick library. Let’s examine how the gamma parameter impacts our model. You can think of gamma as a pruner for the trees. It limits growth unless a specified loss reduction is met. The larger the value, the more simple (conservative or tending to underfit) our model. fig, ax = plt.subplots(figsize=(8, 4)) ms.validation_curve(xgb.XGBClassifier(), X_train, y_train, param_name='gamma', param_range=[0, .5, 1,5,10, 20, 30], n_jobs=-1, ax=ax) It looks like gamma is best between 10 and 20. We could run another validation curve to dive into that area. However, we will soon see another mechanism for finding a good value without requiring us to specify every single option we want to consider. 11.4 Intuitive Understanding of Learning Rate Returning to our golf metaphor, the learning rage hyperparameter is like how hard you swing. It’s all about finding the right balance between caution and aggression. Just like in golf, if you swing too hard, you can end up in the rough. But if you swing too softly, you’ll never get the ball in the hole. With XGBoost, if you set the learning_rate hyperparameter too high, your model will be over-confident and over-fit the data. On the other hand, if you set learning_rate too low, your model will take too long to learn and never reach its full potential. We want to find the right balance that maximizes your model’s accuracy. You’ll have to experiment to find out where that sweet spot is. In golf – you experiment with different techniques, adjust your swing and your grip, and it’ll take some practice, but eventually, you might find the perfect combination that will allow you to hit a hole-in-one. Let’s go back to our model with two layers of depth and examine the values in the leaf node when the learning_rate is set to 1. # check impact of learning weight on scores xg_lr1 = xgb.XGBClassifier(learning_rate=1, max_depth=2) xg_lr1.fit(X_train, y_train) 78 11.4. Intuitive Understanding of Learning Rate Figure 11.1: Validation curve for the gamma hyperparameter. Around 10-20 the score is maximized for the cross validation line. The cross validation line gives us a feel for how the model will perform on unseen data, hwile the training score helps us to understand if the model is overfitting. my_dot_export(xg_lr1, num_trees=0, filename='img/xg_depth2_tree0.dot', title='Learning Rate set to 1') Figure 11.2: When the learning rate is set to one, we preserve the log loss score. Now we are going to decrease the learning rate. Remember, this is like being really timid with our golf swing. # check impact of learning weight on scores xg_lr001 = xgb.XGBClassifier(learning_rate=.001, max_depth=2) xg_lr001.fit(X_train, y_train) 79 11. XGBoost Hyperparameters my_dot_export(xg_lr001, num_trees=0, filename='img/xg_depth2_tree0_lr001.dot', title='Learning Rate set to .001') Figure 11.3: Lowering the learning rate essentially multiples the full values by this amount in the first tree. Each subsequent tree is now impacted and will shift. You can see that the learning rate took the values in the leaves and decreased them. Positive values will still push toward the positive label but at a slower rate. We could also set the learning rate to a value above 1, but it does not make sense in practice. It is like teeing off with your largest club when playing miniature golf. You can try it, but you will see that the model does not perform well. My suggestion for tuning the learning rate (as you will see when I show step-wise tuning) is to tune it last. Combine this with early stopping and a large number of trees. If early stopping doesn’t kick in, raise the number of trees (it means your model is still improving). 11.5 Grid Search Calibrating one hyperparameter takes some time, there are multiple hyperparameters, some of which interact with other hyperparameters. How do we know which combination is best? We can try various combinations and explore which values perform the best. The GridSerachCV class from scikit-learn will do this for us. We need to provide it with a mapping of hyperparameters to potential values for each hyperparameter. It will loop through all combinations and keep track of the best ones. If you have imbalanced data or need to optimize for a different metric, you can also set the scoring parameter. Be careful with grid search. Running grid search on XGBoost can be computationally expensive, as it can take a long time for XGBoost to run through every combination of hyperparameters. It can also be difficult to determine how many hyperparameters to test and which values to use as candidates. Later, I will show another method I prefer more than grid search for tuning the model. This example below runs quickly on my machine, but it isn’t testing that many combinations. You could imagine (ignoring random_state) if any hyperparameters had ten options, there would be 1,000,000 tests to run. from sklearn import model_selection params = {'reg_lambda': [0], # No effect 'learning_rate': [.1, .3], # makes each boost more conservative 'subsample': [.7, 1], 'max_depth': [2, 3], 'random_state': [42], 80 11.5. Grid Search 'n_jobs': [-1], 'n_estimators': [200]} xgb2 = xgb.XGBClassifier(early_stopping_rounds=5) cv = (model_selection.GridSearchCV(xgb2, params, cv=3, n_jobs=-1) .fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=50 ) ) After running the grid search, we can inspect the .best_params_ attribute. This provides a dictionary of the hyperparameters with the best result. Here are the best hyperparameters from our minimal grid search: >>> cv.best_params_ {'learning_rate': 0.3, 'max_depth': 2, 'n_estimators': 200, 'n_jobs': -1, 'random_state': 42, 'reg_lambda': 0, 'subsample': 1} Let’s stick those values in a dictionary, and then we can use dictionary unpacking to provide those values as arguments when we create the model. The verbose=10 parameter tells XGBoost only to provide training metrics after every ten attempts. We will use .fit to train the model. params = {'learning_rate': 0.3, 'max_depth': 2, 'n_estimators': 200, 'n_jobs': -1, 'random_state': 42, 'reg_lambda': 0, 'subsample': 1 } xgb_grid = xgb.XGBClassifier(**params, early_stopping_rounds=50) xgb_grid.fit(X_train, y_train, eval_set=[(X_train, y_train), (X_test, y_test)], verbose=10 ) Now let’s train a default model to compare if there is an improvement from using grid search. # vs default xgb_def = xgb.XGBClassifier(early_stopping_rounds=50) xgb_def.fit(X_train, y_train, eval_set=[(X_train, y_train), (X_test, y_test)], verbose=10 ) Here’s the accuracy for both the default and the grid models: 81 11. XGBoost Hyperparameters >>> xgb_def.score(X_test, y_test), xgb_grid.score(X_test, y_test) (0.7558011049723757, 0.7524861878453039) Oddly enough, it looks like the default model is slightly better. This is a case where you need to be careful. The grid search did a 3-fold cross-validation to find the hyperparameters, but the final score was calculated against a single holdout set. K-fold cross-validation (in the grid search case, k was 3) splits your data set into k folds or groups. Then, it reserves a fold to test the model and trains it on the other data. It repeats this for each fold. Finally, it averages out the results to create a more accurate prediction. It’s a great way to understand better how your model will perform in the real world. When you do k-fold validation, you may get different results than just calling .fit on a model and .score. To run k-fold outside of grid search, you can use the cross_val_score function from scikit-learn. Below, we run cross-validation on the default model: >>> results_default = model_selection.cross_val_score( ... xgb.XGBClassifier(), ... X=X, y=y, ... cv=4 ... ) Here are the scores from each of the four folds: >>> results_default array([0.71352785, 0.72413793, 0.69496021, 0.74501992]) Some folds performed quite a bit better than others. Our average score (accuracy) is 72%. >>> results_default.mean() 0.7194114787534214 Let’s call cross_val_score again with a model created from our grid search hyperparameters. >>> results_grid = model_selection.cross_val_score( ... xgb.XGBClassifier(**params), ... X=X, y=y, ... cv=4 ... ) Here are the fold scores. Note that these values do not deviate as much as the default model. >>> results_grid array([0.74137931, 0.74137931, 0.74801061, 0.73572377]) Here is the mean score: >>> results_grid.mean() 0.7416232505873941 Our grid search model did perform better (at least for these folds of the data). And the scores across the folds were more consistent. It is probably the case that the test set we held out for X_test is easier for the default model to predict. 82 11.6. Summary 11.6 Summary Tuning hyperparameters can feel like a stab in the dark. Validation curves can start us down the path but only work for a single value. You believe there is a sweet spot somewhere, but you have to dig deep and hunt for the other sweet spots of the other hyperparameters. We can combine this with grid search, but it can be tedious. We also showed how using k-fold validation can help us understand how consistent the model is and whether some parts of the data might be easier to model. You will be more confident in a consistent model. 11.7 1. 2. 3. 4. 5. Exercises What is the purpose of hyperparameters in XGBoost? How can validation curves be used to tune hyperparameters in XGBoost? What is the impact of the gamma hyperparameter on XGBoost models? How can the learning rate hyperparameter affect the performance of an XGBoost model? How can grid search be used to tune hyperparameters in XGBoost? 83 Chapter 12 Hyperopt Remember I said that grid search could be slow? It also only searches over the values that I provide. What if another value is better, but I didn’t tell grid search about that possible value? Grid search is naive and would have no way of suggesting that value. Let’s look at another library that might help with this but also search in a smarter way than the brute force mechanism of grid search—introducing hyperopt. Hyperopt is a Python library for optimizing both discrete and continuous hyperparameters for XGBoost. Hyperopt uses Bayesian optimization to tune hyperparameters. This uses a probabilistic model to select the next set of hyperparameters to try. If one value performs better, it will try values around it and see if they boost the performance. If the values worsen the model, then Hyperopt can ignore those values as future candidates. Hyperopt can also tune various other machine-learning models, including random forests and neural networks. 12.1 Bayesian Optimization Bayesian optimization algorithms offer several benefits for hyperparameter optimization, including: • Efficiency: Bayesian optimization algorithms use probabilistic models to update their predictions based on previous trials, which allows them to quickly identify promising areas of the search space and avoid wasteful exploration. • Accuracy: Bayesian optimization algorithms can incorporate prior knowledge and adapt to the underlying structure of the optimization problem, which can result in more accurate predictions. • Flexibility: Bayesian optimization algorithms are flexible and can be applied to a wide range of optimization problems, including those with complex and multimodal search spaces. Using Bayesian optimization algorithms can significantly improve hyperparameter optimization’s efficiency, accuracy, and robustness, leading to better models. 12.2 Exhaustive Tuning with Hyperopt Let’s look at an exhaustive exploration with Hyperopt. This code might be confusing if you aren’t familiar with the notion of first-class functions in Python. Python allows us to pass a function in as a parameter to another function or return a function as a result. To use the Hyperopt library, we need to use the fmin function which tries to find the best hyperparameter values given a space of many parameters. 85 12. Hyperopt The fmin function expects us to pass in another function that accepts a dictionary of hyperparameters to evaluate and returns a dictionary. Our hyperparameter_tuning function doesn’t quite fit the bill. It accepts a dictionary, space, as the first parameter but it has additional parameters. I will use a closure in the form of a lambda function to adapt the hyperparameter_tuning function to a new function that accepts the hyperparameter dictionary. The hyperparameter_tuning function takes in a dictionary of hyperparameters (space), training data (X_train and y_train), test data (X_test and y_test), and an optional value for early stopping rounds (early_stopping_rounds). Some hyperparameters need to be integers, so it converts those keys and adds early_stopping_rounds to the hyperparameters to evaluate. Then it trains a model and returns a dictionary with a negative accuracy score and a status. Because fmin tries to minimize the score and our model is evaluating the accuracy, we do not want the model with the minimum accuracy. However, we can do a little trick and minimize the negative accuracy. from hyperopt import fmin, tpe, hp, STATUS_OK, Trials from sklearn.metrics import accuracy_score, roc_auc_score from typing import Any, Dict, Union def hyperparameter_tuning(space: Dict[str, Union[float, int]], X_train: pd.DataFrame, y_train: pd.Series, X_test: pd.DataFrame, y_test: pd.Series, early_stopping_rounds: int=50, metric:callable=accuracy_score) -> Dict[str, Any]: """ Perform hyperparameter tuning for an XGBoost classifier. This function takes a dictionary of hyperparameters, training and test data, and an optional value for early stopping rounds, and returns a dictionary with the loss and model resulting from the tuning process. The model is trained using the training data and evaluated on the test data. The loss is computed as the negative of the accuracy score. Parameters ---------space : Dict[str, Union[float, int]] A dictionary of hyperparameters for the XGBoost classifier. X_train : pd.DataFrame The training data. y_train : pd.Series The training target. X_test : pd.DataFrame The test data. y_test : pd.Series The test target. early_stopping_rounds : int, optional The number of early stopping rounds to use. The default value is 50. metric : callable Metric to maximize. Default is accuracy 86 12.2. Exhaustive Tuning with Hyperopt Returns ------Dict[str, Any] A dictionary with the loss and model resulting from the tuning process. The loss is a float, and the model is an XGBoost classifier. """ int_vals = ['max_depth', 'reg_alpha'] space = {k: (int(val) if k in int_vals else val) for k,val in space.items()} space['early_stopping_rounds'] = early_stopping_rounds model = xgb.XGBClassifier(**space) evaluation = [(X_train, y_train), (X_test, y_test)] model.fit(X_train, y_train, eval_set=evaluation, verbose=False) pred = model.predict(X_test) score = metric(y_test, pred) return {'loss': -score, 'status': STATUS_OK, 'model': model} After we have defined the function we want to minimize, we need to define the space of the search for our hyperparameters. Based on a talk by Bradley Boehmke1 , I’ve used these values and found them to work quite well. We will cover the functions that describe the search space in a moment. Next, we create a trials object to store the results of the hyperparameter tuning process. The fmin function loops over the options to search for the best hyperparameters. I told it to use the tpe.suggest algorithm, which runs the Tree-structured Parzen Estimator - Expected Improvement Bayesian Optimization. It uses the expected improvement acquisition function to guide the search. This algorithm works well with noisy data. We also told the search to limit the evaluations to 2,000 attempts. My Macbook Pro (2022) takes about 30 minutes to run this search. On my Thinkpad P1 (2020), it takes over 2 hours. You can set a timeout period as well. As this runs, it spits out the best loss scores. options = {'max_depth': hp.quniform('max_depth', 1, 8, 1), # tree 'min_child_weight': hp.loguniform('min_child_weight', -2, 3), 'subsample': hp.uniform('subsample', 0.5, 1), # stochastic 'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 1), 'reg_alpha': hp.uniform('reg_alpha', 0, 10), 'reg_lambda': hp.uniform('reg_lambda', 1, 10), 'gamma': hp.loguniform('gamma', -10, 10), # regularization 'learning_rate': hp.loguniform('learning_rate', -7, 0), # boosting 'random_state': 42 } trials = Trials() best = fmin(fn=lambda space: hyperparameter_tuning(space, X_train, y_train, 1 https://bradleyboehmke.github.io/xgboost_databricks_tuning/index.html#slide21 87 12. Hyperopt ) space=options, algo=tpe.suggest, max_evals=2_000, trials=trials, #timeout=60*5 # 5 minutes X_test, y_test), The best variable holds the results. If I’m using a notebook, I will inspect the values and make sure I keep track of the hyperparameters. I like to print them out and then stick them into a dictionary, so I have them for posterity. # 2 hours of training (paste best in here) long_params = {'colsample_bytree': 0.6874845219014455, 'gamma': 0.06936323554883501, 'learning_rate': 0.21439214284976907, 'max_depth': 6, 'min_child_weight': 0.6678357091609912, 'reg_alpha': 3.2979862933185546, 'reg_lambda': 7.850943400390477, 'subsample': 0.999767483950891} Now we can train a model with these hyperparameters. xg_ex = xgb.XGBClassifier(**long_params, early_stopping_rounds=50, n_estimators=500) xg_ex.fit(X_train, y_train, eval_set=[(X_train, y_train), (X_test, y_test) ], verbose=100 ) How does this model do? >>> xg_ex.score(X_test, y_test) 0.7580110497237569 This is an improvement over the default out-of-the-box model. Generally, you will get better improvement in your model by understanding your data and doing appropriate feature engineering than the improvement you will see with hyperparameter optimization. However, the two are not mutually exclusive, and you can and should use both. 12.3 De昀椀ning Parameter Distributions Rather than limiting the hyperparameters to selection from a list of options, the hyperopt library has various functions that let us define how to choose values. The choice function can specify a discrete set of values for a categorical hyperparameter. This is similar to enumerating the list of options in a grid search. This function takes two arguments: a list of possible values and an optional probability distribution over these values. It returns a random value from the list according to the specified probability distribution. For example, to generate a random value from a list of possible values ['a', 'b', 'c'], you can use the following code: 88 12.3. Defining Parameter Distributions >>> from hyperopt import hp, pyll >>> pyll.stochastic.sample(hp.choice('value', ['a', 'b', 'c'])) 'a' To generate a random value from a list of possible values ['a', 'b', 'c'] with probabilities [0.05, 0.9, 0.05] use the choice function: >>> pyll.stochastic.sample(hp.pchoice('value', [(.05, 'a'), (.9, 'b'), ... (.05, 'c')])) 'c' Note Be careful about using choice and pchoice for numeric values. The Hyperopt library treats each value as independent and not ordered. Its search algorithm cannot take advantage of the outcomes of neighboring values. I have found examples of using Hyperopt that suggest defining the search space for 'num_leaves' and 'subsample' like this: 'num_leaves': hp.choice('num_leaves', list(range(20, 250, 10))), 'subsample': hp.choice('subsample', [0.2, 0.4, 0.5, 0.6, 0.7, .8, .9]), Do not use the above code! Rather, these should be defined as below so Hyperopt can efficiently explore the space: 'num_leaves': hp.quniform('num_leaves', 20, 250, 10), 'subsample': hp.uniform('subsample', 0.2, .9), The uniform function can specify a uniform distribution over a continuous range of values. This function takes two arguments: the minimum and maximum values of the uniform distribution, and returns a random floating-point value within this range. For example, to generate a random floating-point value between 0 and 1 using the uniform function, you can use the following code: >>> from hyperopt import hp, pyll >>> pyll.stochastic.sample(hp.uniform('value', 0, 1)) 0.7875384438202859 Let’s call it 10,000 times and then look at a histogram of the result. You can see that it has an equal probability of choosing any value. uniform_vals = [pyll.stochastic.sample(hp.uniform('value', 0, 1)) for _ in range(10_000)] fig, ax = plt.subplots(figsize=(8, 4)) ax.hist(uniform_vals) The loguniform function can specify a log-uniform distribution over a continuous range of values. This function takes two arguments: the minimum and maximum values of the log-uniform distribution, and it returns a random floating-point value within this range on a logarithmic scale. Here is the histogram for the loguniform values from -5 to 5. Note that these values do not range from -5 to 5, but rather 0.006737 (math.exp(-5)) to 148.41 (math.exp(5)). Note that these values will strongly favor the low end. 89 12. Hyperopt Figure 12.1: Plot of the histogram of uniform values for low=0 and high=1 loguniform_vals = [pyll.stochastic.sample(hp.loguniform('value', -5, 5)) for _ in range(10_000)] fig, ax = plt.subplots(figsize=(8, 4)) ax.hist(loguniform_vals) Figure 12.2: Plot of the histogram of loguniform values for low=-5 and high=5 90 12.3. Defining Parameter Distributions Here is another way to look at it. This is a plot of the transform of possible values when the min and max values are set to -5 and 5, respectively. In the y-axis are the exponential values that come out of the loguniform function. fig, ax = plt.subplots(figsize=(8, 4)) (pd.Series(np.arange(-5, 5, step=.1)) .rename('x') .to_frame() .assign(y=lambda adf:np.exp(adf.x)) .plot(x='x', y='y', ax=ax) ) Figure 12.3: Plot of the loguniform values for low=-5 and high=5 For example, to generate a random floating-point value between 0.1 and 10 using the loguniform function, you can use the following code: >>> from hyperopt import hp, pyll >>> from math import log >>> pyll.stochastic.sample(hp.loguniform('value', log(.1), log(10))) 3.0090767867889174 If you want values pulled from a log scale, .001, .01, .1, 1, 10, then use loguniform. The quniform function can be used to specify an integer from the range [exp(low), exp(high)]. The function also accepts a q parameter that specifies the step (for q=1, it returns every integer in the range. For q=3, it returns every third integer). quniform_vals = [pyll.stochastic.sample(hp.quniform('value', -5, 5, q=2)) for _ in range(10_000)] 91 12. Hyperopt Here are the counts for the quniform values from -5 to 5, taking every other value (q=2). >>> pd.Series(quniform_vals).value_counts() -0.0 2042 -2.0 2021 2.0 2001 4.0 2000 -4.0 1936 dtype: int64 12.4 Exploring the Trials The trial object from the hyperopt search above has the data about the hyperparameter optimization. Feel free to skip this section if you want, but I want to show you how to explore this and see what Hyperopt is doing. First, I will make a function to convert the trial object to a Pandas dataframe. from typing import Any, Dict, Sequence def trial2df(trial: Sequence[Dict[str, Any]]) -> pd.DataFrame: """ Convert a Trial object (sequence of trial dictionaries) to a Pandas DataFrame. Parameters ---------trial : List[Dict[str, Any]] A list of trial dictionaries. Returns ------pd.DataFrame A DataFrame with columns for the loss, trial id, and values from each trial dictionary. """ vals = [] for t in trial: result = t['result'] misc = t['misc'] val = {k:(v[0] if isinstance(v, list) else v) for k,v in misc['vals'].items() } val['loss'] = result['loss'] val['tid'] = t['tid'] vals.append(val) return pd.DataFrame(vals) Now, let’s use the function to create a dataframe. >>> hyper2hr = trial2df(trials) 92 12.4. Exploring the Trials Each row is an evaluation. You can see the hyperparameter settings, the loss score, and the trial id (tid). >>> hyper2hr colsample_bytree gamma learning_rate 0 0.854670 2.753933 0.042056 1 0.512653 0.153628 0.611973 2 0.552569 1.010561 0.002412 3 0.604020 682.836185 0.005037 4 0.785281 0.004130 0.015200 ... ... ... ... 1995 0.717890 0.000543 0.141629 1996 0.725305 0.000248 0.172854 1997 0.698025 0.028484 0.162207 1998 0.688053 0.068223 0.099814 1999 0.666225 0.125253 0.203441 0 1 2 3 4 ... 1995 1996 1997 1998 1999 ... subsample loss \ ... 0.913247 -0.744751 ... 0.550048 -0.746961 ... 0.508593 -0.735912 ... 0.536935 -0.545856 ... 0.691211 -0.739227 ... ... ... ... 0.893414 -0.765746 ... 0.919415 -0.765746 ... 0.952204 -0.770166 ... 0.939489 -0.762431 ... 0.980354 -0.767956 tid 0 1 2 3 4 ... 1995 1996 1997 1998 1999 [2000 rows x 10 columns] I’ll do some EDA (exploratory data analysis) on this data. I like to look at correlations for numeric data and it might be interesting to see if hyperparameters are correlated to each other or to the loss. We can also inspect if a feature has a monotonic relationship with the loss score. Here is a Seaborn plot to show the correlations. import seaborn as sns fig, ax = plt.subplots(figsize=(8, 4)) sns.heatmap(hyper2hr.corr(method='spearman'), cmap='RdBu', annot=True, fmt='.2f', vmin=-1, vmax=1, ax=ax ) Note If you want to create a correlation table in Jupyter with Pandas, you can use this code: (hyper2hr .corr(method='spearman') .style 93 12. Hyperopt Figure 12.4: Spearman correlation of hyperparameters during Hyperopt process. ) .background_gradient(cmap='RdBu', vmin=-1, vmax=1) There is a negative correlation between loss and tid (trial id). This is to be expected. As we progress, the negative loss score should drop. The optimization process should balance exploration and exploitation to minimize the negative loss score. If we do a scatter plot of these two columns, we see this play out. Occasionally, an exploration attempt proves poor (the loss value around -0.55), and the Bayesian search process cuts back to better values. fig, ax = plt.subplots(figsize=(8, 4)) (hyper2hr .plot.scatter(x='tid', y='loss', alpha=.1, color='purple', ax=ax) ) Let’s explore what happened with max_depth and olss which have a correlation of -0.13. fig, ax = plt.subplots(figsize=(8, 4)) (hyper2hr .plot.scatter(x='max_depth', y='loss', alpha=1, color='purple', ax=ax) ) This is a little hard to understand because the max_depth values plot on top of each other. I will use a technique called jittering to pull them apart and let us understand how many values there are. I will also adjust the alpha values of the plot so we can see where values overlap. Here is my jitter function. It adds a random amount of noise around a data point. Let’s apply it to the max_depth column. 94 12.4. Exploring the Trials Figure 12.5: Scatter plot of loss vs trial during Hyperopt process. Note that the values tend to lower as the tid goes up. Figure 12.6: Scatter plot of loss vs max_depth during Hyperopt process. 95 12. Hyperopt import numpy as np def jitter(df: pd.DataFrame, col: str, amount: float=1) -> pd.Series: """ Add random noise to the values in a Pandas DataFrame column. This function adds column of a Pandas noise with a range function returns a random noise to the values in a specified DataFrame. The noise is uniform random of `amount` centered around zero. The Pandas Series with the jittered values. Parameters ---------df : pd.DataFrame The input DataFrame. col : str The name of the column to jitter. amount : float, optional The range of the noise to add. The default value is 1. Returns ------pd.Series A Pandas Series with the jittered values. """ vals = np.random.uniform(low=-amount/2, high=amount/2, size=df.shape[0]) return df[col] + vals fig, ax = plt.subplots(figsize=(8, 4)) (hyper2hr .assign(max_depth=lambda df:jitter(df, 'max_depth', amount=.8)) .plot.scatter(x='max_depth', y='loss', alpha=.1, color='purple', ax=ax) ) This makes it quite clear that the algorithm spent a good amount of time at depth 6. If we want to get even fancier, we can color this by trial attempt. The later attempts are represented by the yellow color. fig, ax = plt.subplots(figsize=(8, 4)) (hyper2hr .assign(max_depth=lambda df:jitter(df, 'max_depth', amount=.8)) .plot.scatter(x='max_depth', y='loss', alpha=.5, color='tid', cmap='viridis', ax=ax) ) We could also use Seaborn to create a violin plot of loss against max_depth. I don’t feel like this lets me see the density as well. 96 12.4. Exploring the Trials Figure 12.7: Scatter plot of loss vs max_depth during Hyperopt process with jittering. You can see the majority of the values were in the 5-7 range. Figure 12.8: Scatter plot of loss vs max_depth during Hyperopt process colored by attempt. 97 12. Hyperopt import seaborn as sns fig, ax = plt.subplots(figsize=(8, 4)) sns.violinplot(x='max_depth', y='loss', data=hyper2hr, kind='violin', ax=ax) Figure 12.9: Violin plot of loss versus max_depth. I prefer the jittered scatter plot because it allows me to understand the density better. The correlation between reg_alpha and colsample_bytree was 0.41, meaning there was a slight tendency for the rank of one value to go up if the rank of the other value went up. Let’s plot this and see if we can glean any insight. fig, ax = plt.subplots(figsize=(8, 4)) (hyper2hr .plot.scatter(x='reg_alpha', y='colsample_bytree', alpha=.8, color='tid', cmap='viridis', ax=ax) ) ax.annotate('Min Loss (-0.77)', xy=(4.56, 0.692), xytext=(.7, .84), arrowprops={'color':'k'}) The values for colsample_bytree and reg_alpha were .68 and 3.29 respectively. Let’s explore gamma and loss that had a correlation of .25. Here is my initial plot. fig, ax = plt.subplots(figsize=(8, 4)) (hyper2hr .plot.scatter(x='gamma', y='loss', alpha=.1, color='purple', ax=ax) ) 98 12.4. Exploring the Trials Figure 12.10: Scatter plot of colsample_bytree vs reg_alpha during Hyperopt process colored by attempt. Figure 12.11: Scatter plot of loss vs gamma during Hyperopt process. 99 12. Hyperopt It is a little hard to tell what is going on here. The stacking on the left side might indicate a need for jittering. But if you look at the scale, you can see that there are some outliers. Remember that gamma was defined with this distribution: 'gamma': hp.loguniform('gamma', -10, 10), # regularization These values range from math.exp(-10) to math.exp(10). Because this is a log-uniform distribution, they will tend towards the low end of this. One trick you can use is to plot with a log x-axis. That will help tease these values apart. fig, ax = plt.subplots(figsize=(8, 4)) (hyper2hr .plot.scatter(x='gamma', y='loss', alpha=.5, color='tid', ax=ax, logx=True, cmap='viridis') ) ax.annotate('Min Loss (-0.77)', xy=(0.000581, -0.777), xytext=(1, -.6), arrowprops={'color':'k'}) Figure 12.12: Scatter plot of loss vs gamma during Hyperopt process with log x-axis This shows how Bayesian searching focuses on the values that provide the best loss. Occasionally, it does some exploring, but any gamma value above 10 performs poorly. The gamma value that it settled on during this run was 0.000581000581. 12.5 EDA with Plotly Feel free to skip this section too. I like digging into the data. I find the Plotly library works well if I was to create 3D plots and have some interactivity. Here is a helper function I wrote that will create a 3D mesh. I will make a mesh of two hyperparameters and plot the score in the third (Z) dimension. 100 12.5. EDA with Plotly import plotly.graph_objects as go def plot_3d_mesh(df: pd.DataFrame, x_col: str, y_col: str, z_col: str) -> go.Figure: """ Create a 3D mesh plot using Plotly. This function creates a 3D mesh plot using Plotly, with the `x_col`, `y_col`, and `z_col` columns of the `df` DataFrame as the x, y, and z values, respectively. The plot has a title and axis labels that match the column names, and the intensity of the mesh is proportional to the values in the `z_col` column. The function returns a Plotly Figure object that can be displayed or saved as desired. Parameters ---------df : pd.DataFrame The DataFrame containing the data to x_col : str The name of the column to use as the y_col : str The name of the column to use as the z_col : str The name of the column to use as the plot. x values. y values. z values. Returns ------go.Figure A Plotly Figure object with the 3D mesh plot. """ fig = go.Figure(data=[go.Mesh3d(x=df[x_col], y=df[y_col], z=df[z_col], intensity=df[z_col]/ df[z_col].min(), hovertemplate=f"{z_col}: %{{z}}<br>{x_col}: %{{x}}<br>{y_col}: " "%{{y}}<extra></extra>")], ) fig.update_layout( title=dict(text=f'{y_col} vs {x_col}'), scene = dict( xaxis_title=x_col, yaxis_title=y_col, zaxis_title=z_col), width=700, margin=dict(r=20, b=10, l=10, t=50) ) return fig Here is the code to plot a mesh of gamma and reg_lambda. In the z-axis, I plot the loss. The minimum value colors this in the z-axis. The most yellow value has the least loss. 101 12. Hyperopt Plotly allows us to interact with the plot, rotate it, zoom in or out, and show values by hovering. fig = plot_3d_mesh(hyper2hr.query('gamma < .2'), 'reg_lambda', 'gamma', 'loss') fig Figure 12.13: Scatter plot of loss vs gamma during Hyperopt process with log x-axis Here is the code to create a scatter plot. (It turns out that this wrapper isn’t really needed but I provided it for consistency). import plotly.express as px import plotly.graph_objects as go def plot_3d_scatter(df: pd.DataFrame, x_col: str, y_col: str, z_col: str, color_col: str, opacity: float=1) -> go.Figure: """ Create a 3D scatter plot using Plotly Express. This function creates a 3D scatter plot using Plotly Express, with the `x_col`, `y_col`, and `z_col` columns of the `df` DataFrame as the x, y, and z values, respectively. The points 102 12.6. Conclusion in the plot are colored according to the values in the `color_col` column, using a continuous color scale. The function returns a Plotly Express scatter_3d object that can be displayed or saved as desired. Parameters ---------df : pd.DataFrame The DataFrame containing the data to plot. x_col : str The name of the column to use as the x values. y_col : str The name of the column to use as the y values. z_col : str The name of the column to use as the z values. color_col : str The name of the column to use for coloring. opacity : float The opacity (alpha) of the points. Returns ------go.Figure A Plotly Figure object with the 3D mesh plot. """ fig = px.scatter_3d(data_frame=df, x=x_col, y=y_col, z=z_col, color=color_col, color_continuous_scale=px.colors.sequential.Viridis_r, opacity=opacity) return fig Let’s look at gamma and reg_lambda over time. I’m using color for loss, so the high values in the z-axis should be getting better loss scores as the hyperparameter search progresses. plot_3d_scatter(hyper2hr.query('gamma < .2'), 'reg_lambda', 'gamma', 'tid', color_col='loss') 12.6 Conclusion In this chapter, we introduced the Hyperopt library and performed hyperparameter searches using Bayesian optimization. Since there are so many hyperparameters for XGBoost, using grid search to optimize them could take a long time because it will explore values known to be bad. Hyperopt can focus on and around known good values. We also showed how to specify parameter distributions and gave examples of the different types of distributions. We showed good default distributions that you can use for your models. Finally, we explored the relationships between hyperparameters and loss scores using pandas and visualization techniques. 103 12. Hyperopt Figure 12.14: Scatter plot of loss vs gamma during Hyperopt process with log x-axis 12.7 Exercises 1. What is the Hyperopt library and how does it differ from grid search for optimizing hyperparameters? 2. How can parameter distributions be specified in Hyperopt? Give an example of a continuous distribution and a discrete distribution. 3. What are some good default distributions for your models in Hyperopt? 4. How can pandas and visualization techniques be used to explore the relationships between hyperparameters and loss scores? 104 Chapter 13 Step-wise Tuning with Hyperopt In this chapter, we introduce step-wise tuning as a method for optimizing the hyperparameters of an XGBoost model. The main reason for tuning with this method is to save time. This method involves tuning small groups of hyperparameters that act similarly and then moving on to the next group while preserving the values from the previous group. This can significantly reduce the search space compared to tuning all hyperparameters simultaneously. 13.1 Groups of Hyperparameters In this section, rather than tuning all of the parameters at once, we will use step-wise tuning. We will tune small groups of hyperparameters that act similarly, then move to the next group while preserving the values from the previous group. I’m going to limit my steps to: • • • • Tree parameters Sampling parameters Regularization parameters Learning rate This limits the search space by quite a bit. Rather than searching 100 options for trees for every 100 options for sampling (10,000), another 100 options for regularization (1,000,000), and another 100 for learning rate (100,000,000), you are searching 400 options! Of course, it could be that this finds a local maximum and ignores the interplay between hyperparameters, but I find that it is often a worthwhile tradeoff if you are pressed for time. Here is the code that I use for step-wise tuning. In the rounds list, I have a dictionary for the hyperparameters to evaluate for each step. The max_evals setting in the call to fmin determines how many attempts hyperopt makes during the round. You can bump up this number if you want it to explore more values. from hyperopt import fmin, tpe, hp, Trials params = {'random_state': 42} rounds = [{'max_depth': hp.quniform('max_depth', 1, 8, 1), # tree 'min_child_weight': hp.loguniform('min_child_weight', -2, 3)}, {'subsample': hp.uniform('subsample', 0.5, 1), # stochastic 'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 1)}, {'reg_alpha': hp.uniform('reg_alpha', 0, 10), 'reg_lambda': hp.uniform('reg_lambda', 1, 10),}, {'gamma': hp.loguniform('gamma', -10, 10)}, # regularization {'learning_rate': hp.loguniform('learning_rate', -7, 0)} # boosting 105 13. Step-wise Tuning with Hyperopt ] all_trials = [] for round in rounds: params = {**params, **round} trials = Trials() best = fmin(fn=lambda space: xhelp.hyperparameter_tuning(space, X_train, y_train, X_test, y_test), space=params, algo=tpe.suggest, max_evals=20, trials=trials, ) params = {**params, **best} all_trials.append(trials) Rather than taking 30 minutes (or two hours om my slower computer) to exhaustively train all of the options, this finished in a little over a minute. 13.2 Visualization Hyperparameter Scores Let’s use the Plotly visualization code from the last chapter to explore the relationship between reg_alpha and reg_lambda. xhelp.plot_3d_mesh(xhelp.trial2df(all_trials[2]), 'reg_alpha', 'reg_lambda', 'loss') This is a lot coarser than our previous plots. You might use this to diagnose if you need to up the number of max_evals. 13.3 Training an Optimized Model With the optimized parameters in hand, let’s train a model with them. Make sure that you explicitly print out the values (this also helps from having to run through the search space again if you need to restart the notebook). step_params = {'random_state': 42, 'max_depth': 5, 'min_child_weight': 0.6411044640540848, 'subsample': 0.9492383155577023, 'colsample_bytree': 0.6235721099295888, 'gamma': 0.00011273797329538491, 'learning_rate': 0.24399020050740935} Once the parameters are in a dictionary, you can use dictionary unpacking (the ** operator) to pass the parameters into the constructor. Remember to set early_stopping_rounds, n_estimators, and provide an eval_set to the .fit method so that the number of trees is optimized (if you use all of the trees, bump up the n_estimators number and run again): xg_step = xgb.XGBClassifier(**step_params, early_stopping_rounds=50, n_estimators=500) xg_step.fit(X_train, y_train, 106 13.3. Training an Optimized Model Figure 13.1: Contour plot of lambda and alpha during Hyperopt process eval_set=[(X_train, y_train), (X_test, y_test) ], verbose=100 ) How does this model perform? >>> xg_step.score(X_test, y_test) 0.7613259668508288 Looks pretty good! Let’s compare this to the default out-of-the-box model. >>> xg_def = xgb.XGBClassifier() >>> xg_def.fit(X_train, y_train) >>> xg_def.score(X_test, y_test) 0.7458563535911602 Our tuned model performs marginally better. However, a marginal improvement often yields outsized results. 107 13. Step-wise Tuning with Hyperopt 13.4 Summary In this chapter, we used the hyperopt library to optimize the performance of our XGBoost model. Hyperopt is a library for hyperparameter optimization that provides a range of search algorithms, parameter distributions, and performance metrics that can be used to optimize the hyperparameters of XGBoost models. We used step-wise optimization to speed up the search. This may return a local minimum, but it does a decent job if you need an answer quickly. If you can access a cluster and have a weekend to burn, kick off a non-step-wise search. 13.5 Exercises 1. How does step-wise tuning with Hyperopt differ from traditional hyperparameter tuning methods such as grid search? 2. When does step-wise tuning make sense? When would you use a different strategy? 108 Chapter 14 Do you have enough data? It is essential to have enough data for machine learning because the performance of a machine learning model is highly dependent on the amount and quality of the data used to train the model. A model trained on a large and diverse dataset will likely have better accuracy, generalization, and robustness than a model trained on a small and homogeneous dataset. Having enough data for machine learning can help to overcome overfitting. Overfitting occurs when the model learns the noise and irrelevant details in the training data, leading to poor generalization when predicting with new data. This chapter will introduce learning curves, a valuable tool for understanding your model. 14.1 Learning Curves A learning curve is a graphical representation of the relationship between the performance of a machine-learning model and the amount of data used to train the model. We train a model on increasing amounts of data and plot the scores along the way. The x-axis of a learning curve shows the number of training examples or the amount of training data, and the y-axis shows the performance metric of the model, such as the error rate, accuracy, or F1 score. The Yellowbrick library contains a learning curve visualization function. Note that I’m manually setting the y-limit for all the learning curve plots to help consistently convey the results. Otherwise, the plot would lop off the bottom portions essentially cropping the image and making better-performing models look as if they performed worse. params = {'learning_rate': 0.3, 'max_depth': 2, 'n_estimators': 200, 'n_jobs': -1, 'random_state': 42, 'reg_lambda': 0, 'subsample': 1} import yellowbrick.model_selection as ms fig, ax = plt.subplots(figsize=(8, 4)) viz = ms.learning_curve(xgb.XGBClassifier(**params), X, y, ax=ax ) ax.set_ylim(0.6, 1) The learning curve for a good model illustrates a few things. 109 14. Do you have enough data? Figure 14.1: Learning curve for xgboost model 1. Shows a consistent and monotonic improvement in the model’s cross-validation (or testing) performance as the amount of training data increases. This means that the error rate or accuracy of the model will decrease as the number of training examples or the amount of training data increases. It seems we could have an even better model if we added more data because the cross-validation score does not look like it has plateaued. 2. Learns the underlying patterns and trends in the data and that it is not overfitting or underfitting to the training data. An overfit model would show the training score trending close to 100% accuracy. We want to see it coming down to the testing data. We also like to see the cross-validation score trending close to the training score. My takeaway from looking at this plot is that the model seems to be performing ok, it is not overfitting, and it might do even better if we had more data. 14.2 Learning Curves for Decision Trees In the previous learning curve, we saw the behavior of an XGBoost model. Let’s plot a decision tree with a max depth of 7. This depth was a good compromise between overfitting and underfitting our model. # tuned tree fig, ax = plt.subplots(figsize=(8, 4)) viz = ms.learning_curve(tree.DecisionTreeClassifier(max_depth=7), X, y, ax=ax) viz.ax.set_ylim(0.6, 1) It looks similar to the XGBoost model in shape, but the scores are shifted down a bit. My takeaway is similar to the XGBoost model. It looks pretty good. We also might want to get some more data. 110 14.3. Underfit Learning Curves Figure 14.2: Learning curve for decision tree with depth of 7 14.3 Under昀椀t Learning Curves Now let’s look at an underfit model. A learning curve for a decision stump model will show a low and constant cross-validation score as the amount of training data increases. As the number of training examples increases, it shouldn’t impact the model because the model is too simple to learn from the data. We would also see poor performance on the training data. The score for both the training and testing data should be similarly bad. If you look at the learning curve, you see that our model cannot learn the underlying patterns in the data. It needs to be more complex. # underfit fig, ax = plt.subplots(figsize=(8, 4)) viz = ms.learning_curve(tree.DecisionTreeClassifier(max_depth=1), X, y, ax=ax ) ax.set_ylim(0.6, 1) 14.4 Over昀椀t Learning Curves Let’s jump to the other end of the spectrum, an overfit learning curve. We can overfit a decision tree by using the default parameters which let the tree grow unbounded. Here, we expect to see good performance on the training data but poor performance on the testing data. The model memorizes the training data but cannot generalize the information to new data. This learning curve is an excellent example of a model that is too complex. # overfit fig, ax = plt.subplots(figsize=(8, 4)) 111 14. Do you have enough data? Figure 14.3: Learning curve for underfit decision stump viz = ms.learning_curve(tree.DecisionTreeClassifier(), X, y, ax=ax ) ax.set_ylim(0.6, 1) 14.5 Summary Learning curves are a valuable tool to help diagnose models. We can infer whether adding more data will help the model perform better. We can also understand if the model is overfitting or underfitting. If you want to know your machine learning model’s health, plot a learning curve. Your model is on a good path if the learning curve is good. If the learning curve is bad, your model needs some attention, and you should consider adjusting its hyperparameters, adding more training data, or improving its features. 14.6 Exercises 1. What is the relationship between the performance of a machine learning model and the amount of data used to train the model? 2. How does having enough data for machine learning help to overcome overfitting? 3. What is a learning curve, and how is it used to evaluate the performance of a machine learning model? 4. What is the purpose of plotting a learning curve for a machine learning model? 5. What are some characteristics of a good learning curve for a machine learning model? 6. How can the learning curve of a model be used to determine if the model is overfitting or underfitting? 112 14.6. Exercises Figure 14.4: Learning curve for overfit decision tree. Evidence of overfitting is clear when the training score is high but the cross validation score never improves. 113 Chapter 15 Model Evaluation Several machine learning evaluation techniques can be used to measure a machine learning model’s performance, generalization, and robustness. In this chapter, we will explore many of them. 15.1 Accuracy Let’s start off with an out-of-the-box model and train it. xgb_def = xgb.XGBClassifier() xgb_def.fit(X_train, y_train) The default result of calling the .score method returns the accuracy. The accuracy metric calculates the proportion of correct predictions made by the model. The metric is defined as the number of correct predictions divided by the total number of predictions. >>> xgb_def.score(X_test, y_test) 0.7458563535911602 This model predicts 74% of the labels for the testing data correctly. On the face of it, it seems a valuable metric. But you must tread carefully with it, especially if you have imbalanced labels. Imagine that you are building a machine learning model to predict whether a patient has cancer. You have a dataset of 100 patients, where 95 don’t have cancer, and only 5 have cancer. You train a model that always predicts that a patient doesn’t have cancer. This model will have an accuracy of 95%, which looks impressive but useless and possibly dangerous. You can also use the accuracy_score function: >>> from sklearn import metrics >>> metrics.accuracy_score(y_test, xgb_def.predict(X_test)) 0.7458563535911602 15.2 Confusion Matrix A confusion matrix is a table that shows the number of true positive, true negative, false positive, and false negative predictions made by a machine learning model. The confusion matrix is useful because it helps you to understand the mistakes and errors of your machine learning model. The confusion matrix aids in calculating prevalence, accuracy, precision, and recall. 115 15. Model Evaluation Figure 15.1: A confusion matrix and associated formulas. Generally, the lower right corner represents the true positive values. Values that were predicted to have the positive label and were indeed positive. You want this number to be high. Above those of the false positives values. This is the count of values that were predicted to have the positive label but, in truth, were negative. You want this number to be low. (These are also called type 1 errors by statisticians.) In the upper left corner is the true negative values. Values that were both predicted and, in truth, negative. You want this value to be high. Below that value is the false negative count (or type 2 errors). These are positive values that were predicted as negative values. fig, ax = plt.subplots(figsize=(8, 4)) classifier.confusion_matrix(xgb_def, X_train, y_train, X_test, y_test, classes=['DS', 'SE'], ax=ax ) You can also use scikit-learn to create a NumPy matrix of a confusion matrix. >>> from sklearn import metrics >>> cm = metrics.confusion_matrix(y_test, xgb_def.predict(X_test)) >>> cm array([[372, 122], [108, 303]]) Scikit-learn can create a matplotlib plot that is similar to Yellowbrick. 116 15.3. Precision and Recall Figure 15.2: Confusion matrix fig, ax = plt.subplots(figsize=(8, 4)) disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['DS', 'SE']) disp.plot(ax=ax, cmap='Blues') Sometimes it is easier to understand a confusion matrix if you use fractions instead of counts. The ORmaliizze=True parameter will do this for us. fig, ax = plt.subplots(figsize=(8, 4)) cm = metrics.confusion_matrix(y_test, xgb_def.predict(X_test), normalize='true') disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['DS', 'SE']) disp.plot(ax=ax, cmap='Blues') 15.3 Precision and Recall Precision and recall are two important evaluation metrics for machine learning. Precision measures the proportion of correct positive predictions made by the model divided by all of the positive predictions. I like to think of this as how relevant the results are. Recall (statisticians also like the term sensitivity) measures the proportion of actual positive examples that the model correctly predicts. I like to think of this as how many relevant results are returned. Recall and precision are often at odds with each other. If you want high precision, then you can classify the sample that is the most positive as positive and get 100% precision, but you will have a low recall because the other positive samples will be mislabeled. Conversely, you can get high recall by classifying everything as positive. This will give you 100% recall but poor precision. How you want to balance this tradeoff depends on your business needs. 117 15. Model Evaluation Figure 15.3: Confusion matrix from scikit learn Figure 15.4: Normalized confusion matrix from scikit learn showing fractions. 118 15.3. Precision and Recall You can measure the precision with the precision_score function from scikit-learn. >>> metrics.precision_score(y_test, xgb_def.predict(X_test)) 0.7129411764705882 You can measure the recall with the recall_score function from scikit-learn. >>> metrics.recall_score(y_test, xgb_def.predict(X_test)) 0.7372262773722628 To visualize the tradeoff between these two measures, you can use a precision-recall curve found in Yellowbrick. from yellowbrick import classifier fig, ax = plt.subplots(figsize=(8, 4)) classifier.precision_recall_curve(xgb_def, X_train, y_train, X_test, y_test, micro=False, macro=False, ax=ax, per_class=True) ax.set_ylim((0,1.05)) Figure 15.5: Precision-recall curve from Yellowbrick This curve needs a little explanation. It shows the precision against the recall. That might not be very clear because we just calculated those scores, which are a single value. So, where does this plot with a line come from? Many machine learning models have a .predict_proba method that returns a probability for both positive and negative labels rather than just a single label or value for prediction. One could imagine that if you changed how a model assigns a label to some threshold. If the probability of the positive label is above a threshold of .98, assign the positive label. This would likely produce a model with high precision and low recall (a point on the upper left 119 15. Model Evaluation of our plot). You will produce the precision-recall plot if you loosen this threshold from .98 down to .01 and track both precision and recall. This plot also shows the average precision across the threshold values and plots it as the dotted red line. Now imagine that your boss says she does not want any false positives. If that metric is most important, you can raise the threshold (and train to optimize the 'precision' metric when creating your model). Given such considerations, this curve can help you understand how your model might perform. 15.4 F1 Score Another metric that might be useful is the F1 score. This is the harmonic mean of precision and recall. If you want to balance precision and recall, this single number can help describe the performance. >>> metrics.f1_score(y_test, xgb_def.predict(X_test)) 0.7248803827751197 Scikit-learn also has a text-based classification report. The support column is the number of samples for each label. The macro average is the average of the class behavior. The weighted average is based on the count (support). >>> print(metrics.classification_report(y_test, ... y_pred=xgb_def.predict(X_test), target_names=['DS', 'SE'])) precision recall f1-score support DS SE 0.78 0.71 0.75 0.74 0.76 0.72 494 411 accuracy macro avg weighted avg 0.74 0.75 0.75 0.75 0.75 0.74 0.75 905 905 905 The Yellowbrick library provides a classification report that summarizes precision, recall, and f1-score for both positive and negative labels: fig, ax = plt.subplots(figsize=(8, 4)) classifier.classification_report(xgb_def, X_train, y_train, X_test, y_test, classes=['DS', 'SE'], micro=False, macro=False, ax=ax) 15.5 ROC Curve A Receiver Operating Characteristic curve or ROC curve is another useful plot for model diagnoses that requires some explaining. It was first used during World War II to understand how effective radar signals were at detecting enemy aircraft. It measures the true positive rate (or recall) against the false positive rate (or fallout, false positive count divided by the count of all negatives) over a range of thresholds. 120 15.5. ROC Curve Figure 15.6: Classification report from Yellowbrick Let’s look at an example and then explain it. We will plot two ROC curves on the same plot—one from the default XGBoost model and the other from our Hyperopt-trained model. We will use the scikit learn library RocCurveDisplay.from_estimator method. fig, ax = plt.subplots(figsize=(8,8)) metrics.RocCurveDisplay.from_estimator(xgb_def, X_test, y_test,ax=ax, label='default') metrics.RocCurveDisplay.from_estimator(xg_step, X_test, y_test,ax=ax) Again, imagine sweeping the probability threshold as we did for the precision-recall plot. At a high threshold, we would want a high true positive rate and a low false positive rate (if the threshold is .9, only values that the model gave a probability > .9 would be classified as positive). However, remember that the true positive rate is also the recall. A high threshold does not necessarily pull out all of the positive labels. The line starts at (0,0). As the threshold lowers, the line moves up and to the right until the threshold is low enough that we have classified everything as positive and thus have 100% recall (and 100% FPR). Also, note that the Area Under Curve (AUC) is reported as .84 for our hyperopt model. Because this is a unit square, a perfect model would have an AUC of 1.0. Folks often ask if a given AUC value is “good enough” for their model. Here’s how I interpret an ROC curve. Generally, you cannot tell if a model is sufficiently good for your application from the AUC metric or an ROC plot. However, you can tell if a model is bad. If the AUC is less than .5, it indicates that the model performs worse than guessing (and you should do the opposite of what your model says). Suppose the reported AUC is 1 (or very close to it). That usually tells me one of two things. First, the problem is such that it is very simple, and perhaps that machine learning is not needed. More often, it is the case that the modeler has leaked data into their model such 121 15. Model Evaluation Figure 15.7: ROC Curve for out of the box model (default) and the step wise tuned model. Models that have curves above and to the left of other curves are better models. The AUC is the area under the curve. If the AUC is < .5, then the model is bad. 122 15.6. Threshold Metrics that it knows the future. Data leakage is when you include data in the model that wouldn’t be known when you ask the model to make a prediction. Imagine that you are building a machine learning model to predict the sentiment of movie reviews. You have a dataset of 100 reviews, where 50 of them are positive, and 50 of them are negative. You train a model that correctly predicts the sentiment of each review. Then, you realize that your model has a serious data leakage problem. You forgot to remove the column that contains the movie’s rating from the dataset, and you included this column in the model’s training. This means that the model is not learning from the text of the reviews, but it is learning from the movie’s rating. When you remove this feature, the accuracy of the model drops. You can use an ROC curve to determine if a model is better than another model (at a given threshold). If one model has a higher AUC or bulges out more to the upper left corner, it is better. But being a better model is insufficient to determine whether a model would work for a given business context. In the above plot, it appears that the default model has worse performance than the stepwise model. You should note that the step-wise model curve is to the left and above the default curve. You can leverage the ROC curve to understand overfitting. If you plot an ROC curve for both the training and testing data, you would hope that they look similar. A good ROC curve for the training data that yields a poor testing ROC curve is an indication of overfitting. Let’s explore that with the default model and the stepwise model. fig, axes = plt.subplots(figsize=(8, 4), ncols=2) metrics.RocCurveDisplay.from_estimator(xgb_def, X_train, y_train,ax=axes[0], label='detault train') metrics.RocCurveDisplay.from_estimator(xgb_def, X_test, y_test,ax=axes[0]) axes[0].set(title='ROC plots for default model') metrics.RocCurveDisplay.from_estimator(xg_step, X_train, y_train,ax=axes[1], label='step train') metrics.RocCurveDisplay.from_estimator(xg_step, X_test, y_test,ax=axes[1]) axes[1].set(title='ROC plots for stepwise model') It looks like the training performance of the stepwise model is worse. However, that is because the default model appears to be overfitting, and our testing score improves with the tuning even though the training score has decreased. 15.6 Threshold Metrics The models (in scikit-learn and XGBoost) don’t expose the threshold as a hyperparameter. You can create a subclass to experiment with the threshold. class ThresholdXGBClassifier(xgb.XGBClassifier): def __init__(self, threshold=0.5, **kwargs): super().__init__(**kwargs) self.threshold = threshold def predict(self, X, *args, **kwargs): """Predict with `threshold` applied to predicted class probabilities. 123 15. Model Evaluation Figure 15.8: Exploring overfitting with ROC curves. On the left is the out-of-the-box model. Note that the training score is much better than the test score. With the stepwise model, the training score decreases, but the score of the test data has improved. """ proba = self.predict_proba(X, *args, **kwargs) return (proba[:, 1] > self.threshold).astype(int) Here is an example of using the class. Notice that the first row of the testing data has a probability of .857 of being positive. >>> xgb_def = xgb.XGBClassifier() >>> xgb_def.fit(X_train, y_train) >>> xgb_def.predict_proba(X_test.iloc[[0]]) array([[0.14253652, 0.8574635 ]], dtype=float32) When we predict this row with the default model, it comes out as positive. >>> xgb_def.predict(X_test.iloc[[0]]) array([1]) If we set the threshold to .9, the prediction becomes negative. >>> xgb90 = ThresholdXGBClassifier(threshold=.9, verbosity=0) >>> xgb90.fit(X_train, y_train) >>> xgb90.predict(X_test.iloc[[0]]) array([0]) Here is code that you can use to see if it would be appropriate to use a different threshold. This visualization shows how many common metrics respond as the threshold is changed. 124 15.7. Cumulative Gains Curve def get_tpr_fpr(probs, y_truth): """ Calculates true positive rate (TPR) and false positive rate (FPR) given predicted probabilities and ground truth labels. Parameters: probs (np.array): predicted probabilities of positive class y_truth (np.array): ground truth labels Returns: tuple: (tpr, fpr) """ tp = (probs == 1) & (y_truth == 1) tn = (probs < 1) & (y_truth == 0) fp = (probs == 1) & (y_truth == 0) fn = (probs < 1) & (y_truth == 1) tpr = tp.sum() / (tp.sum() + fn.sum()) fpr = fp.sum() / (fp.sum() + tn.sum()) return tpr, fpr vals = [] for thresh in np.arange(0, 1, step=.05): probs = xg_step.predict_proba(X_test)[:, 1] tpr, fpr = get_tpr_fpr(probs > thresh, y_test) val = [thresh, tpr, fpr] for metric in [metrics.accuracy_score, metrics.precision_score, metrics.recall_score, metrics.f1_score, metrics.roc_auc_score]: val.append(metric(y_test, probs > thresh)) vals.append(val) fig, ax = plt.subplots(figsize=(8, 4)) (pd.DataFrame(vals, columns=['thresh', 'tpr/rec', 'fpr', 'acc', 'prec', 'rec', 'f1', 'auc']) .drop(columns='rec') .set_index('thresh') .plot(ax=ax, title='Threshold Metrics') ) 15.7 Cumulative Gains Curve This curve is useful for maximizing marketing response rates (evaluating a model when you have finite resources). It plots the gain (the recall or sensitivity) against the ordered samples. The baseline is what a random prediction would give you. The gain is calculated as the true positive rate if you were to order the predictions by the probability against the percentage of samples. Look at the plot below from the scikitplot library. I have augmented it to show the optimal gains (what a perfect model would give you). 125 15. Model Evaluation Figure 15.9: Threshold plots import scikitplot fig, ax = plt.subplots(figsize=(8, 4)) y_probs = xgb_def.predict_proba(X_test) scikitplot.metrics.plot_cumulative_gain(y_test, y_probs, ax=ax) ax.plot([0, (y_test == 1).mean(), 1], [0, 1, 1], label='Optimal Class 1') ax.set_ylim(0, 1.05) ax.annotate('Reach 60% of\nClass 1\nby contacting top 35%', xy=(.35, .6), xytext=(.55,.25), arrowprops={'color':'k'}) ax.legend() Here is an example of reading the plot. If you want to reach 60% of the positive audience, you can trace the .6 from the y-axis to the plot. It indicates that you need to contact the top 35% of the sample if you want to reach that number. Note that if we had a perfect model, we would only need to reach out to the 45% of them in the positive class (see the blue line). 15.8 Lift Curves A lift curve is a plot showing the true positive rate and the cumulative positive rate of a machine learning model as the prediction threshold varies. It leverages the same x-axis from the cumulative gains plot, but the y-axis is the ratio of gains to randomly choosing a label. It indicates how much better the model is than guessing. If you only can reach the top 20% of your audience (move up from .2 on the x-axis), you should be almost 1.8x better than randomly choosing labels. I have also augmented this with the optimal model. fig, ax = plt.subplots(figsize=(8, 4)) y_probs = xgb_def.predict_proba(X_test) 126 15.9. Summary Figure 15.10: Cumulative gains plot scikitplot.metrics.plot_lift_curve(y_test, y_probs, ax=ax) mean = (y_test == 1).mean() ax.plot([0, mean, 1], [1/mean, 1/mean, 1], label='Optimal Class 1') ax.legend() 15.9 Summary In this chapter, we explored various metrics and plots to evaluate a classification model. There is no one size fits all metric or plot to understand a model. Many times it depends on the needs of the business. Precision, recall, accuracy, and F1 are common and important metrics for evaluating and comparing the performance of a machine learning model. ROC curves, precision-recall curves, and lift plots are visual representations of the performance of a machine learning model. These metrics and plots provide valuable and complementary information about the performance of a machine learning model, and they are widely used for evaluating, comparing, and diagnosing the performance of the model. They are important and useful tools for improving, optimizing, and maximizing the performance of the model, and they are essential for understanding and interpreting the results of the model. 15.10 Exercises 1. How can cross-validation be used to evaluate the performance of a machine learning model? 2. What is the difference between precision and recall in evaluating a machine learning model? 127 15. Model Evaluation Figure 15.11: Lift Curve for out of the box model 3. What is the F1 score, and how is it used to evaluate a machine learning model? 4. In what situations might it be more appropriate to use precision, recall, accuracy, or F1 as a metric to evaluate the model’s performance? 5. What is the purpose of a ROC curve in evaluating a machine learning model? 6. How can ROC curves, precision-recall curves, and lift plots be used to improve, optimize, and maximize the performance of a machine learning model? 128 Chapter 16 Training For Different Metrics We just looked at different metrics for model evaluation. In this chapter, we will look at training a model to maximize those metrics. 16.1 Metric overview We will compare optimizing a model with the metrics of accuracy, the area under the ROC curve. Accuracy is a measure of how often the model correctly predicts the target class. Precision is a measure of how often the model is correct when it predicts the positive class. Recall is a measure of how often the model can correctly identify the positive class. The area under the ROC curve is one measure to balance precision and recall. The F1 score is another metric that combines both by using the harmonic mean of precision and recall. 16.2 Training with Validation Curves Let’s use a validation curve to visualize tuning the learning_rate for both accuracy and area under ROC. (Note that you can get a high recall by just classifying everything as positive, so that is not a particularly interesting metric to maximize.) We will use the Yellowbrick library. Remember that a validation curve tracks the performance of the metric as we sweep through hyperparameter values. The validation_curve function accepts a scoring parameter to determine a metric to track. We will pass in scoring='accuracy' to the first curve and scoring='roc_auc' to the second. from yellowbrick import model_selection as ms fig, ax = plt.subplots(figsize=(8, 4)) ms.validation_curve(xgb.XGBClassifier(), X_train, y_train, scoring='accuracy', param_name='learning_rate', param_range=[0.001, .01, .05, .1, .2, .5, .9, 1], ax=ax ) ax.set_xlabel('Accuracy') fig, ax = plt.subplots(figsize=(8, 4)) ms.validation_curve(xgb.XGBClassifier(), X_train, y_train, scoring='roc_auc', param_name='learning_rate', param_range=[0.001, .01, .05, .1, .2, .5, .9, 1], ax=ax ) ax.set_xlabel('roc_auc') 129 16. Training For Different Metrics Figure 16.1: Validation Curve for XGBoost model tuning learning_rate for accuracy You can see that the optimal learning rate for accuracy is around .01 (given our limited range of choices). The shape of the area under the ROC curve is similar but shifted up, and the maximum cross-validation score is around .05. In general, tuning for different metrics will reveal different hyperparameter values. 16.3 Step-wise Recall Tuning Let’s use the hyperopt library to optimize AUC. Let’s pass in roc_auc_score for the metric parameter. from sklearn.metrics import roc_auc_score from hyperopt import hp, Trials, fmin, tpe params = {'random_state': 42} rounds = [{'max_depth': hp.quniform('max_depth', 1, 9, 1), # tree 'min_child_weight': hp.loguniform('min_child_weight', -2, 3)}, {'subsample': hp.uniform('subsample', 0.5, 1), # stochastic 'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 1)}, {'gamma': hp.loguniform('gamma', -10, 10)}, # regularization {'learning_rate': hp.loguniform('learning_rate', -7, 0)} # boosting ] for round in rounds: params = {**params, **round} trials = Trials() best = fmin(fn=lambda space: xhelp.hyperparameter_tuning( space, X_train, y_train, X_test, y_test, metric=roc_auc_score), space=params, 130 16.3. Step-wise Recall Tuning Figure 16.2: Validation Curve for XGBoost model tuning learning_rate for area under the ROC curve algo=tpe.suggest, max_evals=40, trials=trials, ) params = {**params, **best} Now let’s compare the default model for recall versus our optimized model. >>> xgb_def = xgb.XGBClassifier() >>> xgb_def.fit(X_train, y_train) >>> metrics.roc_auc_score(y_test, xgb_def.predict(X_test)) 0.7451313573096131 >>> # the values from above training >>> params = {'random_state': 42, ... 'max_depth': 4, ... 'min_child_weight': 4.808561584650579, ... 'subsample': 0.9265505972233746, ... 'colsample_bytree': 0.9870944989347749, ... 'gamma': 0.1383762861356536, ... 'learning_rate': 0.13664139307301595} Below I use the verbose=100 parameter to only show the output every 100 iterations. >>> xgb_tuned = xgb.XGBClassifier(**params, early_stopping_rounds=50, ... n_estimators=500) >>> xgb_tuned.fit(X_train, y_train, eval_set=[(X_train, y_train), ... (X_test, y_test)], verbose=100) 131 16. Training For Different Metrics [0] validation_0-logloss:0.66207 validation_1-logloss:0.66289 [100] validation_0-logloss:0.44945 validation_1-logloss:0.49416 [150] validation_0-logloss:0.43196 validation_1-logloss:0.49833 XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None, colsample_bylevel=1, colsample_bynode=1, colsample_bytree=0.9870944989347749, early_stopping_rounds=50, enable_categorical=False, eval_metric=None, feature_types=None, gamma=0.1383762861356536, gpu_id=-1, grow_policy='depthwise', importance_type=None, interaction_constraints='', learning_rate=0.13664139307301595, max_bin=256, max_cat_threshold=64, max_cat_to_onehot=4, max_delta_step=0, max_depth=4, max_leaves=0, min_child_weight=4.808561584650579, missing=nan, monotone_constraints='()', n_estimators=500, n_jobs=0, num_parallel_tree=1, predictor='auto', random_state=42, ...) It looks like our tuned model has improved the area under the ROC curve score. >>> metrics.roc_auc_score(y_test, xgb_tuned.predict(X_test)) 0.7629510328319394 16.4 Summary In this chapter, we explored optimizing models around different metrics. Talk to the business owners and determine what metric is important for you to optimize. Then tune your model to that metric. The default metric of accuracy is a fine default, but it might not be what you need. 16.5 Exercises 1. 2. 3. 4. 5. 132 What is the purpose of a validation curve? How does tuning a model for recall differ from tuning for accuracy? How does the metric choice influence a model’s hyperparameters? Can a model be optimized for both precision and recall simultaneously? Discuss. In what situations might it be important to prioritize recall over precision or vice versa? Chapter 17 Model Interpretation Consider a bank offering loans. The bank wants to be able to tell a user why they were rejected. Saying, “We are denying your application, but if you deposit $2,000 into your account, we will approve it.” is better than saying, “We deny your application. Sorry, we can’t tell you why because our model doesn’t give us that information.” The company might value explainability more than performance. Data scientists classify models as white box and black box models. A white box machine learning model is a model that is transparent and interpretable. This means that the model’s workings are completely understood and can be explained in terms of the input features and the mathematical operations used to make predictions. In contrast, a black box machine learning model is a model that is not transparent or interpretable. This means that the model’s workings are not easily understood, and it is difficult or impossible to explain how the model makes predictions. A decision tree model is a white box model. You can explain exactly what the model is doing. An XGBoost model is considered a black box model. I would probably call it a grey model (or maybe dark grey). We can explain what happens, but we need to sift through multiple trees to come to an answer. It is unlikely that a user of the model is going to want to track through 587 trees to determine why it made a prediction. This chapter will explore different models and how to interpret them. We will also discuss the pros and cons of the interpretation techniques. 17.1 Logistic Regression Interpretation Let’s start with a Logistic Regression model. Logistic regression uses a logistic function to predict the probability that a given input belongs to a particular class. Like linear regression, it learns coefficients for each feature. These coefficients are multiplied by the feature values and summed to make a prediction. It is a white box model because the result comes from the match in the previous sentence. In fact, you can use these coefficients to determine how features impact the model (by looking at the coefficient’s sign) and the impact’s magnitude (by looking at the magnitude of the coefficient, assuming that the features are standardized). To aid with the interpretation of the coefficients, we standardize the features before running logistic regression. Standardization is a technique used to transform data with a mean of zero and a standard deviation of one. Standardization is helpful for machine learning because it helps ensure that all features are on a similar scale, which can improve the performance or behavior of certain models, such as linear regression, logistic regression, support vector machines, and neural networks. Let’s look at how this model works with our data (we set penalty=None to turn off the regularization of this model): 133 17. Model Interpretation >>> from sklearn import linear_model, preprocessing >>> std = preprocessing.StandardScaler() >>> lr = linear_model.LogisticRegression(penalty=None) >>> lr.fit(std.fit_transform(X_train), y_train) >>> lr.score(std.transform(X_test), y_test) 0.7337016574585635 In this case, logistic regression gives us a similar score to XGBoost’s default model. If I had data that performed the same with both logistic regression and XGBoost, I would use the logistic regression model because it is simpler. Given two equivalent results, I will choose the simpler version. Logistic regression also is a white box model and is easy to explain. Let’s look at the .coef_ attribute. These are the feature coefficients or weights that we learned during the fitting of the model. (This uses scikit learn’s convention of adding an underscore to the attribute learned while training the model.) >>> lr.coef_ array([[-1.56018160e-01, -1.45213121e-01, 3.11683777e-02, -4.59272439e-04, -4.48524110e-03, -1.79149729e-01, -4.01817103e-01, -8.13849902e-02, 3.16120596e-02, -8.21683100e-03, 1.01853988e-01, 2.41389081e-02, 6.01542610e-01, -6.03727624e-01, -3.14510213e-02, -5.27737710e-02, 3.49376790e-01, -3.37424750e-01]]) This is a NumPy array and a little hard to interpret on its own. I prefer to stick this in a Pandas series and add the corresponding column names to the values. You can make a bar plot to visualize them. fig, ax = plt.subplots(figsize=(8, 4)) (pd.Series(lr.coef_[0], index=X_train.columns) .sort_values() .plot.barh(ax=ax) ) The wider the bar, the higher the impact of the feature. Positive values push towards the positive label, in our case Software Engineer. Negative labels push towards the negative label, Data Scientist. The years of experience column correlates with software engineering, and using the R language correlates with data science. Also, the Q1_Prefer not to say feature does not have much impact on this model. Given this result, I might try to make an even simpler model. A model that only considers features that have an absolute magnitude above 0.2. 17.2 Decision Tree Interpretation A decision tree is also a white box model. I will train a decision tree with a depth of seven and explore the feature importance results. These values are learned from fitting the model, and similar to coefficients in the logistic regression model, they give us some insight into how the features impact the model. >>> tree7 = tree.DecisionTreeClassifier(max_depth=7) >>> tree7.fit(X_train, y_train) >>> tree7.score(X_test, y_test) 0.7337016574585635 134 17.2. Decision Tree Interpretation Figure 17.1: Logistic Regression feature coefficients Let’s inspect the .feature_importances_ attribute. This is an array of values that indicate how important each feature is in making predictions using a decision tree model. It is learned during the call to .fit (hence the trailing underscore). Like logistic regression, it gives a magnitude of importance. In scikit-learn, it is calculated from the decrease in Gini importance when making a split on a feature. Unlike logistic regression, it does not show the direction (this is because a tree, unlike logistic regression, can capture non-linear relationships and use a feature to direct results toward both positive and negative labels). fig, ax = plt.subplots(figsize=(8, 4)) (pd.Series(tree7.feature_importances_, index=X_train.columns) .sort_values() .plot.barh(ax=ax) ) It is interesting to note that the feature importances of the decision tree are not necessarily the same as the coefficients of the logistic regression model. I will plot the first three levels of the tree to compare the nodes to the feature importance scores. import dtreeviz dt3 = tree.DecisionTreeClassifier(max_depth=3) dt3.fit(X_train, y_train) viz = dtreeviz.model(dt3, X_train=X_train, y_train=y_train, feature_names=list(X_train.columns), target_name='Job', class_names=['DS', 'SE']) viz.view() 135 17. Model Interpretation Figure 17.2: Decision tree feature importance Figure 17.3: Decision tree visualization using dtreeviz. 136 17.3. XGBoost Feature Importance 17.3 XGBoost Feature Importance The XGBoost model also reports feature importance. The .feature_importances_ attribute reports the normalized gain across all the trees when the feature is used. xgb_def = xgb.XGBClassifier() xgb_def.fit(X_train, y_train) fig, ax = plt.subplots(figsize=(8, 4)) (pd.Series(xgb_def.feature_importances_, index=X_train.columns) .sort_values() .plot.barh(ax=ax) ) Figure 17.4: XGBoost default feature importance You can also visualize the feature importance directly from XGBoost. The .plot_importance method has a importance_type parameter to change how importance is measured. • The 'gain' feature importance is calculated as the total gain in the model’s performance that results from using a feature. • The 'weight' feature importance is calculated as the number of times a feature is used in the model. • The 'cover' feature importance is calculated as the number of samples that are affected by a feature. fig, ax = plt.subplots(figsize=(8, 4)) xgb.plot_importance(xgb_def, importance_type='cover', ax=ax) 137 17. Model Interpretation Figure 17.5: XGBoost feature importance for cover importance 17.4 Surrogate Models Another way to tease apart the XGBoost model is to train a decision tree to its predictions and then explore the interpretable decision tree. Let’s do that. from sklearn import tree sur_reg_sk = tree.DecisionTreeRegressor(max_depth=4) sur_reg_sk.fit(X_train, xgb_def.predict_proba(X_train)[:,-1]) Let’s export the tree to examine it. If you want to make a tree that goes from left to right with scikit-learn, you need to use the export_graphviz function. Note I ran the following code to convert the DOT file to a PNG. tree.export_graphviz(sur_reg_sk, out_file='img/sur-sk.dot', feature_names=X_train.columns, filled=True, rotate=True, fontname='Roboto Condensed') You can use the dot command to generate the PNG: dot -Gdpi=300 -Tpng -oimg/sur-sk.png img/sur-sk.dot # HIDE This surrogate model can provide insight into interactions. Nodes that split on a different feature than a parent node often have an interaction. It looks like R has an interaction with major_cs. Also, the years_exp and education columns seem to interact. We will explore those in the SHAP chapter. You can also learn about non-linearities with surrogate models. When the same node feature follows itself, that often indicates a non-linear relationship with the target. 138 17.4. Surrogate Models Figure 17.6: A surrograte tree of the model. You can look for the whitest and orangest nodes and trace the path to understand the prediction. 139 17. Model Interpretation 17.5 Summary Interpretability in machine learning refers to understanding how a model makes predictions. A model is considered interpretable if it can explain its workings regarding the input features and the mathematical operations that the model uses to make predictions. Interpretability is important because it allows us to understand how a model arrives at its predictions and identify any potential biases or errors. This can be useful for improving the model or for explaining the model’s predictions to decision-makers. Feature importance in XGBoost measures each feature’s importance in making predictions using a decision tree model. XGBoost provides three feature importances: weight, gain, and cover. These feature importances are calculated differently and can provide different insights about the model. 17.6 Exercises 1. What is the difference between white box and black box models regarding interpretation? 2. How is logistic regression used to make predictions and how can the coefficients be interpreted to understand the model’s decision-making process? 3. How can decision trees be interpreted, and how do they visualize the model’s decisionmaking process? 4. What are some potential limitations of using interpretation techniques to understand the decision-making process of a machine learning model? 5. How can the feature importance attribute be used to interpret the decision-making process of a black box model? 6. In what situations might it be more appropriate to use a white box model over a black box model, and vice versa? 140 Chapter 18 xgb昀椀r (Feature Interactions Reshaped) In this chapter, we will explore how features relate to each other and the labels. We will define feature interactions, show how to find them in XGBoost, and then introduce interaction constraints to see if it improves our model. 18.1 Feature Interactions When you have multiple columns in a machine-learning model, you may want to understand how those columns interact with each other. Statisticians say that X1 and X2 interact when the X1 changes are not constant in the change to y but also depend on X2 . You could represent this as a formula when using linear or logistic regression. You can create a new column that multiplies the two columns together: y = aX1 + bX2 + cX1 X2 In decision trees, interaction is a little different. A feature, X1 does not interact with another feature, X2 , if X2 is not in the path, or if X2 is used the same in each path. Otherwise, there is an interaction, and both features combined to impact the result. 18.2 xgb昀椀r The xgbfir library will allow us to find the top interactions from our data in an XGBoost model. You will have to install this third-party library and also the openpyxl library (a library to read Excel files) to explore the feature interactions. I’m not going to lie. The interface of this library is a little odd. You create a model, then run the saveXgbFI function, and it saves an Excel file with multiple sheets with the results. Let’s try it out: import xgbfir xgbfir.saveXgbFI(xgb_def, feature_names=X_train.columns, OutputXlsxFile='fir.xlsx') Let’s read the first sheet. This sheet lists all of the features from our data. It includes a bunch of other columns that include metric values and the rank of the metrics. The final order of the rows is the average rank of the metrics. Here are the columns: • Gain: Total gain of each feature or feature interaction • FScore: Amount of possible splits taken on a feature or feature Interaction 141 18. xgbfir (Feature Interactions Reshaped) • wFScore: Amount of possible splits taken on a feature or feature interaction weighted by the probability of the splits to take place • Average wFScore: wFScore divided by FScore • Average Gain: Gain divided by FScore • Expected Gain: Total gain of each feature or feature interaction weighted by the probability of the gain For each of the above columns, another column shows the rank. Finally, there is the Average Rank, the Average Tree Index, and the Average Tree Depth columns. Let’s sort the features by the Average Rank column: >>> >>> ... ... ... ... 2 0 5 1 4 fir = pd.read_excel('fir.xlsx') print(fir .sort_values(by='Average Rank') .head() .round(1) ) Interaction Gain FScore ... Average Rank Average Tree Index \ r 517.8 84 ... 3.3 44.6 years_exp 597.0 627 ... 4.5 45.1 education 296.0 254 ... 4.5 45.2 compensation 518.5 702 ... 4.8 47.5 major_cs 327.1 96 ... 5.5 48.9 2 0 5 1 4 Average Tree Depth 2.6 3.7 3.3 3.7 3.6 [5 rows x 16 columns] This looks like the R and years_exp features are important for our model. However, this chapter is about interactions, and this isn’t showing interactions, just a single column. So we need to move on to the other sheets found in the Excel file. The Interaction Depth 1 sheet shows how pairs of columns interact. Let’s look at it: >>> print(pd.read_excel('fir.xlsx', sheet_name='Interaction Depth 1').iloc[:20] ... .sort_values(by='Average Rank') ... .head(10) ... .round(1) ... ) Interaction Gain FScore wFScore Average wFScore \ 1 education|years_exp 523.8 106 14.8 0.1 0 major_cs|r 1210.8 15 5.4 0.4 6 compensation|education 207.2 103 18.8 0.2 11 age|education 133.2 80 27.2 0.3 3 major_cs|years_exp 441.3 36 4.8 0.1 142 18.2. xgbfir 5 4 15 14 18 age|years_exp age|compensation major_stat|years_exp education|r age|age 316.3 344.7 97.7 116.5 90.5 216 219 32 14 66 43.9 38.8 6.7 4.6 24.7 0.2 0.2 0.2 0.3 0.4 1 0 6 11 3 5 4 15 14 18 Average Gain Expected Gain Gain Rank FScore Rank wFScore Rank \ 4.9 77.9 2 5 8 80.7 607.6 1 45 20 2.0 34.0 7 6 7 1.7 25.6 12 8 4 108.2 20 12.3 4 25 1.5 44.0 6 3 1 1.6 30.6 5 2 2 3.1 20.4 16 25 15 8.3 72.3 15 52 27 1.4 16.6 19 11 6 1 0 6 11 3 5 4 15 14 18 Avg wFScore Rank Avg Gain Rank Expected Gain Rank Average Rank \ 43 8 3 11.5 8 1 1 12.7 32 25 9 14.3 12 40 13 14.8 46 3 2 16.7 26 57 7 16.7 34 48 11 17.0 24 14 14 18.0 13 5 4 19.3 7 62 16 20.2 1 0 6 11 3 5 4 15 14 18 Average Tree Index Average Tree Depth 38.0 3.5 12.3 1.6 50.6 3.7 38.8 3.6 29.2 3.2 45.6 3.9 48.9 3.9 25.5 3.1 40.4 2.4 48.0 3.6 This tells us that education is often followed by years_exp and major_cs is often followed by R. This hints that these features might have an impact on each other. Let’s explore that a little more by exploring the relationship between the features. We can create a correlation heatmap with both Pandas and Seaborn. Here’s the pandas version which is useful inside of Jupyter. (X_train .assign(software_eng=y_train) .corr(method='spearman') 143 18. xgbfir (Feature Interactions Reshaped) .loc[:, ['education', 'years_exp', 'major_cs', 'r', 'compensation', 'age']] .style .background_gradient(cmap='RdBu', vmin=-1, vmax=1) .format('{:.2f}') ) import seaborn as sns fig, ax = plt.subplots(figsize=(8, 4)) sns.heatmap(X_train .assign(software_eng=y_train) .corr(method='spearman') .loc[:, ['age','education', 'years_exp', 'compensation', 'r', 'major_cs', 'software_eng']], cmap='RdBu', annot=True, fmt='.2f', vmin=-1, vmax=1, ax=ax ) Figure 18.1: Correlation heatmap between subset of features Interestingly, the correlation between years_exp and education is close to 0.10. There may be a non-linear relationship that a tree model can tease apart. Let’s explore that with a scatter plot. We will use Seaborn to add a fit line and color the plot based on the label. import seaborn.objects as so fig = plt.figure(figsize=(8, 4)) (so .Plot(X_train.assign(software_eng=y_train), x='years_exp', y='education', color='software_eng') .add(so.Dots(alpha=.9, pointsize=2), so.Jitter(x=.7, y=1)) .add(so.Line(), so.PolyFit()) .scale(color='viridis') .on(fig) # not required unless saving to image .plot() # ditto 144 18.2. xgbfir ) Figure 18.2: Scatter plot of education vs years_exp It looks like the education levels for software engineers are lower than for data scientists. Now, let’s explore the relationships between major_cs and R and the target label. This has three dimensions of categorical data. We can use Pandas grouping or the pd.crosstab shortcut to quantify this: >>> print(X_train ... .assign(software_eng=y_train) ... .groupby(['software_eng', 'r', 'major_cs']) ... .age ... .count() ... .unstack() ... .unstack() ... ) major_cs 0 1 r 0 1 0 1 software_eng 0 410 390 243 110 1 308 53 523 73 This shows that if the user didn’t use R, they have a higher probability of being a software engineer. Here is the pd.crosstab version: >>> both = (X_train ... .assign(software_eng=y_train) ... ) >>> print(pd.crosstab(index=both.software_eng, columns=[both.major_cs, both.r])) major_cs 0 1 r 0 1 0 1 145 18. xgbfir (Feature Interactions Reshaped) software_eng 0 1 410 390 243 110 308 53 523 73 We can also visualize this. Here is a visualization of this data using a slope graph. My interpretation of this data is that not using R is a strong indication of the label “software engineer”. Here is the interaction between using R and not studying CS. It is an indicator of the “data science” label. However, studying CS and using R is not quite so clear. You can see why the tree might want to look at the interaction between these features rather than just looking at them independently. This plot is a little more involved than most in this book, but I think it illustrates the point well. fig, grey blue font ax = plt.subplots(figsize=(8, 4)) = '#999999' = '#16a2c6' = 'Roboto' data = (X_train .assign(software_eng=y_train) .groupby(['software_eng', 'r', 'major_cs']) .age .count() .unstack() .unstack()) (data .pipe(lambda adf: adf.iloc[:,-2:].plot(color=[grey,blue], linewidth=4, ax=ax, legend=None) and adf) .plot(color=[grey, blue, grey, blue], ax=ax, legend=None) ) ax.set_xticks([0, 1], ['Data Scientist', 'Software Engineer'], font=font, size=12, weight=600) ax.set_yticks([]) ax.set_xlabel('') ax.text(x=0, y=.93, s="Count Data Scientist or Software Engineer by R/CS", transform=fig.transFigure, ha='left', font=font, fontsize=10, weight=1000) ax.text(x=0, y=.83, s="(Studied CS) Thick lines\n(R) Blue", transform=fig.transFigure, ha='left', font=font, fontsize=10, weight=300) for side in 'left,top,right,bottom'.split(','): ax.spines[side].set_visible(False) # labels for left,txt in zip(data.iloc[0], ['Other/No R', 'Other/R', 'CS/No R', 'CS/R']): ax.text(x=-.02, y=left, s=f'{txt} ({left})', ha='right', va='center', font=font, weight=300) for right,txt in zip(data.iloc[1], ['Other/No R', 'Other/R', 'CS/No R', 'CS/R']): ax.text(x=1.02, y=right, s=f'{txt} ({right})', ha='left', va='center', font=font, weight=300) 146 18.3. Deeper Interactions Figure 18.3: Slopegraph of R vs study CS 18.3 Deeper Interactions You can get interactions with three columns from the Interaction Depth 2 sheet. >>> print(pd.read_excel('fir.xlsx', sheet_name='Interaction Depth 2').iloc[:20] ... .sort_values(by='Average Rank') ... .head(5) ... ) Interaction Gain FScore ... Average Rank \ 0 major_cs|r|years_exp 1842.711375 17 ... 12.000000 7 age|education|years_exp 267.537987 53 ... 15.666667 13 age|compensation|education 154.313245 55 ... 15.833333 2 compensation|education|years_exp 431.541357 91 ... 17.166667 14 education|r|years_exp 145.534591 17 ... 19.000000 0 7 13 2 14 Average Tree Index Average Tree Depth 2.588235 2.117647 31.452830 3.981132 47.381818 3.800000 47.175824 4.010989 34.352941 2.588235 [5 rows x 16 columns] I’m not going to explore these here. Again, these interactions can lead to insights about how various features relate to each other that would be hard to come by without this analysis. 147 18. xgbfir (Feature Interactions Reshaped) 18.4 Specifying Feature Interactions The XGBoost library can limit feature interactions. You can specify a list of features that can only be in trees with other limited features. This could lead to models that are more robust, simpler or comply with regulations. Let’s train models with limited interactions and a default model and see how the performance compares. I will take the top entries from the interactions and put them in a nested list. Each entry in this list shows which groups of columns are able to follow other columns in a given tree. For example, education is in the first, third, and fourth lists. This means that education can be followed by years_exp, compensation, or age columns, but not the remaining columns. constraints = [['education', 'years_exp'], ['major_cs', 'r'], ['compensation', 'education'], ['age', 'education'], ['major_cs', 'years_exp'], ['age', 'years_exp'], ['age', 'compensation'], ['major_stat', 'years_exp'], ] I’ll write some code to filter out the columns that I want to use, so I can feed data with just those columns into my model. def flatten(seq): res = [] for sub in seq: res.extend(sub) return res small_cols = sorted(set(flatten(constraints))) >>> print(small_cols) ['age', 'compensation', 'education', 'major_cs', 'major_stat', 'r', 'years_exp'] >>> xg_constraints = xgb.XGBClassifier(interaction_constraints=constraints) >>> xg_constraints.fit(X_train.loc[:, small_cols], y_train) >>> xg_constraints.score(X_test.loc[:, small_cols], y_test) 0.7259668508287292 It looks like this model does ok when it uses 7 of the 18 features. But it does not exhibit better performance than the default model (.745) trained on all of the data. Here is a visualization of the first tree. You can see that the r column is only following by major_cs in this tree. my_dot_export(xg_constraints, num_trees=0, filename='img/constrains0_xg.dot', title='First Constrained Tree') 18.5 Summary In this chapter, we explored feature interactions. Exploring feature interactions can give you a better sense for how the columns in the data impact and relate to each other. 148 18.6. Exercises Figure 18.4: First constrained tree 18.6 Exercises 1. 2. 3. 4. 5. What is an interaction in a machine learning model? What is the xgbfir library used for in machine learning? How is the Average Rank column calculated in the xgbfir library? What is the purpose of the Interaction Depth 1 sheet in the xgbfir library? How can the xgbfir library be used to find the top feature interactions in an XGBoost model? 6. What is the purpose of using interaction constraints in a machine learning model? 7. What are the potential drawbacks of using interaction constraints in a machine learning model? 149 Chapter 19 Exploring SHAP Have you ever wanted to peek inside the “black box” of a machine learning model and see how it’s making predictions? While feature importance ranks the features, it doesn’t indicate the direction of the impact. Importance scores also can’t model non-linear relationships. For example, when I worked with solid-state storage devices, there was a “bathtub curve” relationship between usage and lifetime. Devices with little usage often had infant mortality issues, but once the device had survived “burn-in”, they often lasted until end-of-life failure. A feature importance score cannot help to tell this story. However, using a tool like SHAP can. The shap library is a useful tool for explaining black box models (and white box models too). SHapley Additive Explanations (SHAP) is a mechanism that explains a model’s global and local behavior. At a global level, SHAP can provide a rank order of feature importance and how the feature impacts the results. It can model non-linear relationships. At the local level of an individual prediction, SHAP can explain how each feature contributes to the final prediction. 19.1 SHAP SHAP works with both classification and regression models. SHAP analyzes a model and predictions from it and outputs a value for every feature of an example. The values are determined using game theory and indicate how to distribute attribution among the features to the target. In the case of classification, these values add up to the log odds probability of the positive label. (For regression cases, SHAP values sum to the target prediction.) One of the key advantages of the SHAP algorithm is that it can provide explanations for both white box models, where the internal workings of the model are known, and black box models, where the inner workings of the model are not known. It can also handle complex models and interactions between features and has been shown to provide accurate and consistent explanations in various settings. Let’s look at using this library with our step-tuned model. Make sure you have installed the library as it does not come with Python. step_params = {'random_state': 42, 'max_depth': 5, 'min_child_weight': 0.6411044640540848, 'subsample': 0.9492383155577023, 'colsample_bytree': 0.6235721099295888, 'gamma': 0.00011273797329538491, 'learning_rate': 0.24399020050740935} xg_step = xgb.XGBClassifier(**step_params, early_stopping_rounds=50, 151 19. Exploring SHAP n_estimators=500) xg_step.fit(X_train, y_train, eval_set=[(X_train, y_train), (X_test, y_test) ] ) The shap library works well in Jupyter. It provides JavaScript visualizations that allow some interactivity. To enable the JavaScript extension, we need to run the initjs function. Then, we create an instance of a TreeExplainer. This can provide the SHAP values for us. The TreeExplainer instance is callable and returns an Explanation object. This object has an attribute, .values, with the SHAP values for every feature for each sample. import shap shap.initjs() shap_ex = shap.TreeExplainer(xg_step) vals = shap_ex(X_test) I’ll stick the values in a DataFrame so we can see what they look like. >>> shap_df = pd.DataFrame(vals.values, columns=X_test.columns) >>> print(shap_df) age education years_exp compensation python r \ 0 0.426614 0.390184 -0.246353 0.145825 -0.034680 0.379261 1 0.011164 -0.131144 -0.292135 -0.014521 0.016003 -1.043464 2 -0.218063 -0.140705 -0.411293 0.048281 0.424516 0.487451 3 -0.015227 -0.299068 -0.426323 -0.205840 -0.125867 0.320594 4 -0.468785 -0.200953 -0.230639 0.064272 0.021362 0.355619 .. ... ... ... ... ... ... 900 0.268237 -0.112710 0.330096 -0.209942 0.012074 -1.144335 901 0.154642 0.572190 -0.227121 0.448253 -0.057847 0.290381 902 0.079129 -0.095771 1.136799 0.150705 0.133260 0.484103 903 -0.206584 0.430074 -0.385100 -0.078808 -0.083052 -0.992487 904 0.007351 0.589351 1.485712 0.056398 -0.047231 0.373149 0 1 2 3 4 .. 900 901 902 903 904 0 1 152 sql -0.019017 0.020524 -0.098703 -0.062712 -0.083344 ... -0.065815 -0.069114 -0.120819 -0.088811 -0.105290 Q1_Male Q1_Female 0.004868 0.000877 0.039019 0.047712 -0.004710 0.063545 0.019110 0.012257 -0.017202 0.002754 ... ... 0.028274 0.032291 0.006243 0.007443 0.012034 0.057516 0.080561 0.028648 0.029283 0.074762 Q1_Prefer not to say \ 0.002111 0.001010 0.000258 0.002184 0.001432 ... 0.001012 0.002198 0.000266 0.000876 0.001406 Q1_Prefer to self-describe Q3_United States of America Q3_India 0.0 0.033738 -0.117918 0.0 0.068171 0.086444 \ 19.1. SHAP 2 3 4 .. 900 901 902 903 904 0 1 2 3 4 .. 900 901 902 903 904 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 Q3_China -0.018271 -0.026271 -0.010548 -0.024099 -0.022188 ... 0.310404 -0.008244 0.003234 -0.031448 0.008734 [905 rows x 18 age 0 0.426614 1 0.011164 2 -0.218063 3 -0.015227 4 -0.468785 .. ... 900 0.268237 901 0.154642 902 0.079129 903 -0.206584 904 0.007351 0 1 2 3 4 .. 900 901 902 903 904 0 sql -0.019017 0.020524 -0.098703 -0.062712 -0.083344 ... -0.065815 -0.069114 -0.120819 -0.088811 -0.105290 major_cs 0.369876 -0.428484 -0.333695 0.486864 0.324419 ... -0.407444 0.602087 -0.313785 -0.524141 -0.505613 columns] education 0.390184 -0.131144 -0.140705 -0.299068 -0.200953 ... -0.112710 0.572190 -0.095771 0.430074 0.589351 major_other 0.014006 -0.064157 0.016919 0.038438 0.012664 ... -0.013195 0.039680 -0.080046 -0.048108 -0.159411 0.005533 -0.000044 0.035772 ... -0.086408 -0.074364 0.103810 0.045213 -0.031587 major_eng -0.013465 -0.026041 -0.026932 -0.013727 -0.019550 ... -0.026412 -0.012820 -0.066032 -0.007185 -0.067388 -0.105534 0.042814 -0.073206 ... 0.136677 0.115520 -0.097848 0.066553 0.117050 major_stat 0.104177 0.069931 -0.591922 0.047564 0.093926 ... -0.484734 0.083934 0.101975 0.093196 0.126560 years_exp compensation python r \ -0.246353 0.145825 -0.034680 0.379261 -0.292135 -0.014521 0.016003 -1.043464 -0.411293 0.048281 0.424516 0.487451 -0.426323 -0.205840 -0.125867 0.320594 -0.230639 0.064272 0.021362 0.355619 ... ... ... ... 0.330096 -0.209942 0.012074 -1.144335 -0.227121 0.448253 -0.057847 0.290381 1.136799 0.150705 0.133260 0.484103 -0.385100 -0.078808 -0.083052 -0.992487 1.485712 0.056398 -0.047231 0.373149 Q1_Male Q1_Female Q1_Prefer not to say \ 0.004868 0.000877 0.002111 0.039019 0.047712 0.001010 -0.004710 0.063545 0.000258 0.019110 0.012257 0.002184 -0.017202 0.002754 0.001432 ... ... ... 0.028274 0.032291 0.001012 0.006243 0.007443 0.002198 0.012034 0.057516 0.000266 0.080561 0.028648 0.000876 0.029283 0.074762 0.001406 Q1_Prefer to self-describe Q3_United States of America Q3_India \ 0.0 0.033738 -0.117918 153 19. Exploring SHAP 1 2 3 4 .. 900 901 902 903 904 0 1 2 3 4 .. 900 901 902 903 904 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 Q3_China -0.018271 -0.026271 -0.010548 -0.024099 -0.022188 ... 0.310404 -0.008244 0.003234 -0.031448 0.008734 major_cs major_other major_eng 0.369876 0.014006 -0.013465 -0.428484 -0.064157 -0.026041 -0.333695 0.016919 -0.026932 0.486864 0.038438 -0.013727 0.324419 0.012664 -0.019550 ... ... ... -0.407444 -0.013195 -0.026412 0.602087 0.039680 -0.012820 -0.313785 -0.080046 -0.066032 -0.524141 -0.048108 -0.007185 -0.505613 -0.159411 -0.067388 0.068171 0.005533 -0.000044 0.035772 ... -0.086408 -0.074364 0.103810 0.045213 -0.031587 0.086444 -0.105534 0.042814 -0.073206 ... 0.136677 0.115520 -0.097848 0.066553 0.117050 major_stat 0.104177 0.069931 -0.591922 0.047564 0.093926 ... -0.484734 0.083934 0.101975 0.093196 0.126560 [905 rows x 18 columns] If you add up each row and add the .base_values attribute (this is the default guess for the model) you will get the log odds value that the sample is in the positive case. I will stick the sum in a dataframe with the column name pred. I will also include the ground truth value and convert the log odds sum into a probability column, prob. If the probability is above .5 (this happens when pred is positive), we predict the positive value. >>> print(pd.concat([shap_df.sum(axis='columns') ... .rename('pred') + vals.base_values, ... pd.Series(y_test, name='true')], axis='columns') ... .assign(prob=lambda adf: (np.exp(adf.pred) / ... (1 + np.exp(adf.pred)))) ... ) pred true prob 0 1.204692 1 0.769358 1 -2.493559 0 0.076311 2 -2.205473 0 0.099260 3 -0.843847 1 0.300725 4 -0.168726 1 0.457918 .. ... ... ... 900 -1.698727 0 0.154632 901 1.957872 0 0.876302 902 0.786588 0 0.687098 903 -2.299702 0 0.091148 904 1.497035 1 0.817132 154 19.2. Examining a Single Prediction [905 rows x 3 columns] Note that for index rows 3, 4, 901, and 902, our model makes incorrect predictions. 19.2 Examining a Single Prediction If we want to explore how shap attributes the prediction from the features, you can create a few visualizations. Let’s explore the first entry in the test data. It looks like this: >>> X_test.iloc[0] age education years_exp compensation python r sql Q1_Male Q1_Female Q1_Prefer not to say Q1_Prefer to self-describe Q3_United States of America Q3_India Q3_China major_cs major_other major_eng major_stat Name: 7894, dtype: float64 22.0 16.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 Our model predicts a value of 1 (which is also the ground truth), which corresponds to a Software Engineer. Is that because of age, education, R, or something else? >>> # predicts software engineer... why? >>> xg_step.predict(X_test.iloc[[0]]) array([1]) >>> # ground truth >>> y_test[0] 1 We can also do some math to validate the SHAP values. We start off with the expected value. >>> # Since this is below zero, the default is Data Scientist >>> shap_ex.expected_value -0.2166416 Then we add the sum of the values from the row to the expected value. 155 19. Exploring SHAP >>> # > 0 therefore ... Software Engineer >>> shap_ex.expected_value + vals.values[0].sum() 1.2046916 Because this value is above 0, we would predict Software Engineer for this sample. 19.3 Waterfall Plots Let’s make a waterfall plot to explore the SHAP values. This plots an explanation of a single prediction. It displays how the SHAP value from each column impacts the result. You can see a vertical line at -0.21. This represents the default or base value. This is the right edge of the bottom bar. By default, our model would predict Data Scientist. But we need to look at each of the features and examine how they push the prediction. The waterfall plot is rank ordered. The feature with the most impact is shown at the top. The next most important feature is second, and so on. The age value of 22 gave a SHAP value of 0.43. This pushes the result towards the positive case. The R value, 0.0, also pushes toward the positive case with a magnitude of 0.38. We repeat this and see that the values add up to 1.204 (in the upper right). (You can confirm this from our addition above.) fig = plt.figure(figsize=(8, 4)) shap.plots.waterfall(vals[0], show=False) To get a feel for how the values of this sample relate to the other samples, we can plot a histogram showing the distributions of the values and mark our value. I will write a function to do this, plot_histograms. def plot_histograms(df, columns, row=None, title='', color='shap'): """ Parameters ---------df : pandas.DataFrame The DataFrame to plot histograms for. columns : list of str The names of the columns to plot histograms for. row : pandas.Series, optional A row of data to plot a vertical line for. title : str, optional The title to use for the figure. color : str, optional 'shap' - color positive values red. Negative blue 'mean' - above mean red. Below blue. None - black Returns ------matplotlib.figure.Figure The figure object containing the histogram plots. """ red = '#ff0051' blue = '#008bfb' 156 19.3. Waterfall Plots Figure 19.1: SHAP waterfall plot of first test sample fig, ax = plt.subplots(figsize=(8, 4)) hist = (df [columns] .hist(ax=ax, color='#bbb') ) fig = hist[0][0].get_figure() if row is not None: name2ax = {ax.get_title():ax for ax in fig.axes} pos, neg = red, blue if color is None: pos, neg = 'black', 'black' for column in columns: if color == 'mean': mid = df[column].mean() else: mid = 0 if row[column] > mid: c = pos else: c = neg name2ax[column].axvline(row[column], c=c) 157 19. Exploring SHAP fig.tight_layout() fig.suptitle(title) return fig I will use this function to show where the features of the individual sample are located among the distributions. features = ['education', 'r', 'major_cs', 'age', 'years_exp', 'compensation'] fig = plot_histograms(shap_df, features, shap_df.iloc[0], title='SHAP values for row 0') Figure 19.2: Histograms of SHAP values for first row We can also use this function to visualize the histograms of the original feature values. fig = plot_histograms(X_test, features, X_test.iloc[0], title='Values for row 0', color='mean') We can create a bar plot of the SHAP values with the Pandas library. fig, ax = plt.subplots(figsize=(8, 4)) (pd.Series(vals.values[0], index=X_test.columns) .sort_values(key=np.abs) .plot.barh(ax=ax) ) 158 19.3. Waterfall Plots Figure 19.3: Histograms of original values for first row Figure 19.4: Bar plot of SHAP values for first test sample 159 19. Exploring SHAP 19.4 A Force Plot The shap library also provides a flattened version of the waterfall plot, called a force plot. (In my example, I show matplotlib=True, but in Jupyter, you can leave that out and the plot will have some interactivity.) # use matplotlib if having js issues # blue - DS # red - Software Engineer # to save need both matplotlib=True, show=False res = shap.plots.force(base_value=vals.base_values, shap_values=vals.values[0,:], features=X_test.iloc[0], matplotlib=True, show=False ) res.savefig('img/shap_forceplot0.png', dpi=600, bbox_inches='tight') Figure 19.5: Force plot of SHAP values for first test sample 19.5 Force Plot with Multiple Predictions The shap library allows you to pass in multiple rows of SHAP values into the force function. In that case, it flips them vertically and stacks them next to each other. There currently is no Matplotlib version for this plot, and the Jupyter/JavaScript version has a dropdown to change the order of the results and a dropdown to change what features it shows. # First n values n = 100 # blue - DS # red - Software Engineer shap.plots.force(base_value=vals.base_values, shap_values=vals.values[:n,:], features=X_test.iloc[:n], ) 19.6 Understanding Features with Dependence Plots Because the shap library has computed SHAP values for every feature, we can use that to visualize how features impact the model. The Dependence Scatter Plot shows the SHAP values across the values for a feature. This plot can be a little confusing at first, but when you take a moment to understand it, it becomes very useful. Let’s pick one of the features that had a big impact on our model, education. 160 19.6. Understanding Features with Dependence Plots Figure 19.6: Force plot of SHAP values for first 100 test samples fig, ax = plt.subplots(figsize=(8, 4)) shap.plots.scatter(vals[:, 'education'], ax=ax, color=vals, x_jitter=0, hist=False) Figure 19.7: Dependence plot of SHAP values for the education feature 161 19. Exploring SHAP Let’s try and make sense of this plot. In the x-axis, we have different entries for education. In the y-axis, we have the different SHAP values for each education value. You might wonder why the y values are different. This is because the education column interacts with other columns, so sometimes a value of 19 pushes toward Data Scientist, and sometimes it moves towards Software Engineer. Remember that y values above 0 will push toward the positive label, and negative values push toward the negative label. My interpretation of this plot is that there is a non-linear relationship between education and the target. Education values push towards the negative label quicker as the value increases. Here are a couple of other things to note about this plot. The shap library automatically chooses an interaction index, another column to plot in a color gradient. In this example, it decided on the age feature. This allows you to visualize how another column interacts with education. It is a little hard to see, but higher compensation pushes more toward the software engineer label for low education levels. You can specify a different column by setting the color parameter to a single column of values (color=vals[:, 'compensation'])). 19.7 Jittering a Dependence Plot Also, because the education entries are limited to a few values, the scatter plot just plots the values on top of each other. This makes it hard to understand what is really happening. We will address that using the x_jitter parameter to spread the values. I also like to adjust the alpha parameter to adjust the transparency of each dot. By default, every dot has a transparency of 1, which means that it is completely opaque. If you lower the value, it makes them more see-through and lets you see the density better. Here is a plot that sets the alpha to .5, spreads out the values in the x-axis with x_jitter, and shows the interaction with the years_exp column. fig, ax = plt.subplots(figsize=(8, 4)) shap.plots.scatter(vals[:, 'education'], ax=ax, color=vals[:, 'years_exp'], x_jitter=1, alpha=.5) This plot makes it very clear that most of the education entries have values of 16 and 18. Based on the SHAP values, having more schooling pushes toward data science. You can also see a faint histogram. Because most values have an education level above 15, you might want to explore whether the model is overfitting the values for 12 and 13. The value for 19 might warrant exploration as there are few entries there. The jittering and histogram functionality is automatic when you use the scatter function. Let’s explore one more dependence plot showing the impact of the major_cs column. We will choose the R column for the interaction index. fig, ax = plt.subplots(figsize=(8, 4)) shap.plots.scatter(vals[:, 'major_cs'], ax=ax, color=vals[:, 'r'], alpha=.5) This plot indicates that studying computer science pushes the result more toward software engineer than the reverse does toward data scientist. It is interesting to note that adding R to studying CS strengthens the effect (which is the opposite of what I would have expected). 19.8 Heatmaps and Correlations Though not part of the shap package, it can be helpful to look at the heatmaps of the correlations of the SHAP values. 162 19.8. Heatmaps and Correlations Figure 19.8: Dependence plot of SHAP values for education feature with jittering and different interaction column Figure 19.9: Dependence plot of SHAP values for education feature with jittering for major_cs column 163 19. Exploring SHAP Let’s start with the standard correlation of the features (not involving SHAP). import seaborn as sns fig, ax = plt.subplots(figsize=(8, 4)) sns.heatmap(X_test .assign(software_eng=y_test) .corr(method='spearman') .loc[:, ['age', 'education', 'years_exp', 'compensation', 'r', 'major_cs', 'software_eng']], cmap='RdBu', annot=True, fmt='.2f', vmin=-1, vmax=1, ax=ax ) Figure 19.10: Correlation heatmap between subset of features for the testing data This heatmap tells us if the features tend to move in the same or opposite directions. I have added the software_eng column to see if there is a correlation with the prediction. Now, we will create a heatmap of the correlation of the SHAP values for each prediction. This correlation will tell us if two features tend to move the prediction similarly. For example, the SHAP values for both education and major_stat push the prediction label in the same direction. This can help us understand interactions from different points of view. import seaborn as sns fig, ax = plt.subplots(figsize=(8, 4)) sns.heatmap(shap_df .assign(software_eng=y_test) .corr(method='spearman') .loc[:, ['age', 'education', 'years_exp', 'compensation', 'r', 'major_cs', 'software_eng']], cmap='RdBu', annot=True, fmt='.2f', vmin=-1, vmax=1, ax=ax 164 19.9. Beeswarm Plots of Global Behavior ) Figure 19.11: Correlation heatmap between subset of SHAP values for each feature for the testing data Generally, when dealing with correlation heatmaps, I want to look at the dark red and dark blue values (ignoring the self-correlation values). It looks like the SHAP values for compensation tend to rise if the SHAP values for age rise. (Note that this is different from the actual non-SHAP values correlating, though, in the case of compensation and age, it looks like they do.) You could explore these further by doing a scatter plot. 19.9 Beeswarm Plots of Global Behavior One of the reasons that SHAP is so popular is that it not only explains local predictions and feature interactions but also gives you a global view of how features impact the model. Let’s look at the beeswarm or summary plot. This provides a rank-ordered listing of the features that drive the most impact on the final prediction. In the x-axis is the SHAP value. Positive value push towards the positive label. Each feature is colored to indicate high (red) or low (blue) values. The R feature is binary, with only red and blue values. You can see that a high value (1) results in a significant push toward data science. The spread on the red values indicates some interaction with other columns. Another way to understand this is that there are probably some low frequency effects when R is red that cause a large impact on our model. The years_exp feature has gradations in the color because there are multiple options for that feature. It looks like there is a pretty smooth transition from blue to red. Contrast that smooth gradation with education, where you see from the left red, then purple, then red again, then blue. This indicates non-monotonic behavior for the education feature. We saw this in the dependence plot for education above. 165 19. Exploring SHAP fig = plt.figure(figsize=(8, 4)) shap.plots.beeswarm(vals) Figure 19.12: Summary plot of SHAP values for top features If you want to see all of the features, use the max_display parameter. I’m also changing the colormap to a Matplotlib colormap that will work better in grayscale. (This is for all those who bought the grayscale physical edition.) from matplotlib import cm fig = plt.figure(figsize=(8, 4)) shap.plots.beeswarm(vals, max_display=len(X_test.columns), color=cm.autumn_r) 19.10 SHAP with No Interaction If your model has interactions, SHAP values will reflect them. If we remove interactions, you will simplify your model (by making the max_depth equal to 1). It will also make non-linear responses more clear. Let’s train a model of stumps and look at some of the SHAP plots. no_int_params = {'random_state': 42, 'max_depth': 1 } xg_no_int = xgb.XGBClassifier(**no_int_params, early_stopping_rounds=50, n_estimators=500) 166 19.10. SHAP with No Interaction Figure 19.13: Summary plot of SHAP values for all the features xg_no_int.fit(X_train, y_train, eval_set=[(X_train, y_train), (X_test, y_test) ] ) The stump model does not per form as well as the default model. But it is close. The default score was 0.74. >>> xg_no_int.score(X_test, y_test) 0.7370165745856354 shap_ind = shap.TreeExplainer(xg_no_int) shap_ind_vals = shap_ind(X_test) Here is the summary plot for the model without interactions. 167 19. Exploring SHAP from matplotlib import cm fig = plt.figure(figsize=(8, 4)) shap.plots.beeswarm(shap_ind_vals, max_display=len(X_test.columns)) Figure 19.14: Summary plot of SHAP values for all the features without interactions It is interesting to observe that the ordering of feature importance changes with this model. Here is the years_exp plot for our model with interactions. You can see a vertical spread in the y-axis from the interactions of years_exp with other columns. (The spread in the x-axis is due to jittering.) fig, ax = plt.subplots(figsize=(8, 4)) shap.plots.scatter(vals[:, 'years_exp'], ax=ax, color=vals[:, 'age'], alpha=.5, x_jitter=1) Here is the dependence plot for the model without interactions. You can see that there is no variation in the y-axis. 168 19.11. Summary Figure 19.15: Dependence plot of SHAP values for years_exp feature with feature interactions and jittering. fig, ax = plt.subplots(figsize=(8, 4)) shap.plots.scatter(shap_ind_vals[:, 'years_exp'], ax=ax, color=shap_ind_vals[:, 'age'], alpha=.5, x_jitter=1) This makes the non-linear response in years_exp very clear. Note that you might want to jitter this is the y-axis to help reveal population density, but shap doesn’t support that. You can change the alpha parameter to help with density. 19.11 Summary SHAP (SHapley Additive exPlanations) is a powerful tool for explaining the behavior of machine learning models, including XGBoost models. It provides insight into which features are most important in making predictions and how each feature contributes to the model’s output. This can be useful for several reasons, including improving the interpretability of a model, identifying potential issues or biases in the data, and gaining a better understanding of how the model makes predictions. In this chapter, we explored using SHAP values to understand predictions and interactions between the features. We also used SHAP values to rank features. SHAP values are an important mechanism for explaining black-box models like XGBoost. 19.12 Exercises 1. How can SHAP be used to explain a model’s global and local behavior? 2. How does the waterfall plot display the impact of each feature on the prediction result? 3. How does the force plot display the impact of features on the model 169 19. Exploring SHAP Figure 19.16: Dependence plot of SHAP values for years_exp feature with no feature interaction and jittering for major_cs column 4. 5. 6. 7. 170 What does the summary plot display, and how does it help interpret feature importance? How does the dependence plot help to identify non-monotonic behavior? How can the summary plot help identify features with non-monotonic behavior? How can the summary plot help to identify features that have interacted with other features? Chapter 20 Better Models with ICE, Partial Dependence, Monotonic Constraints, and Calibration In this chapter, we will explore some advanced techniques that can be used to improve the interpretability and performance of XGBoost models. We will discuss Individual Conditional Expectation (ICE) plots and Partial Dependence Plots (PDP). These powerful visualization tools allow us to understand how the input features affect the predictions made by the model. We will also examine how to constrain XGBoost models to prevent overfitting and improve generalization performance. This chapter will provide a comprehensive overview of some important techniques that can be used to extract valuable insights from XGBoost models. 20.1 ICE Plots Individual Conditional Expectation (ICE) plots are a useful tool for visualizing the effect of a single input variable on the output of a machine learning model. In the context of XGBoost, ICE plots can help us understand how each feature contributes to the final predictions made by the model. This section will explore using Python to create ICE plots for XGBoost models. An ICE plot shows the predicted values of a machine learning model for a single observation as the value of a specific input feature varies. In other words, an ICE plot displays the model’s output for a fixed instance while incrementally changing one input feature’s value. Each line in the plot represents the predicted output for a particular instance as the input feature changes. By examining the shape and slope of each line, we can gain insights into how the model uses that input feature to make predictions. ICE plots provide a detailed view of the relationship between a single input feature and the model’s predictions. To create an ICE plot for a single row, follow these steps: 1. Choose an input feature to analyze and select a range of values for that feature to vary. 2. Fix the values of all other input features to the observed values for the instance of interest. 3. Vary the selected input feature across the values chosen in step 1. 4. For each value of the selected input feature, calculate the corresponding prediction of the model for the specific instance of interest. 5. Plot the values obtained in step 4 against the corresponding values of the input feature to create a single line in the ICE plot. You can repeat this process for each row in your data and visualize how changing that feature would impact the final result. 171 20. Better Models with ICE, Partial Dependence, Monotonic Constraints, and Calibration xgb_def = xgb.XGBClassifier(random_state=42) xgb_def.fit(X_train, y_train) xgb_def.score(X_test, y_test) Let’s make the default model for the r and education features. We will use scikit-learn to create an ICE plot. from sklearn.inspection import PartialDependenceDisplay fig, axes = plt.subplots(ncols=2, figsize=(8,4)) PartialDependenceDisplay.from_estimator(xgb_def, X_train, features=['r', 'education'], kind='individual', ax=axes) Figure 20.1: ICE plots for R and education. As R values have a positive value, the probability of data scientists tends to increase. For education, it appears that more education also tends to push toward data scientists. It is a little difficult to discern what is happening here. Remember that the y-axis in these plots represents the probability of the final label, Software Engineer (1) or Data Scientist (0). If r goes to 1, there is a strong push to Data Scientist. For the education plot, low education values tend toward software engineering, while larger values push toward data science. One technique to help make the visualizations more clear is to have them all start at the same value on the left-hand side. We can do that with the centered parameter. fig, axes = plt.subplots(ncols=2, figsize=(8,4)) PartialDependenceDisplay.from_estimator(xgb_def, X_train, features=['r', 'education'], centered=True, kind='individual', ax=axes) 172 20.1. ICE Plots Figure 20.2: Centered ICE plots helps visualize the impact of the feature as it changes. This plot also reveals that ticks at the bottom are intended to help visualize the number of rows with those values. However, due to the binned values in the survey data, the education levels are not discernable. We can throw a histogram on top to see the distribution. This should provide intuition into the density of data. Locations with more data would tend to have better predictions. In the case of education, we don’t have many examples of respondents with less than 14 years of education. That would cause us to have more uncertainty about those predictions. fig, axes = plt.subplots(ncols=2, figsize=(8,4)) ax_h0 = axes[0].twinx() ax_h0.hist(X_train.r, zorder=0) ax_h1 = axes[1].twinx() ax_h1.hist(X_train.education, zorder=0) PartialDependenceDisplay.from_estimator(xgb_def, X_train, features=['r', 'education'], centered=True, ice_lines_kw={'zorder':10}, kind='individual', ax=axes) fig.tight_layout() I wrote quantile_ice, a function that will subdivide a feature by the quantile of the predicted probability and create lines from the average of each quantile. It can also show a histogram. def quantile_ice(clf, X, col, center=True, q=10, color='k', alpha=.5, legend=True, add_hist=False, title='', val_limit=10, ax=None): """ 173 20. Better Models with ICE, Partial Dependence, Monotonic Constraints, and Calibration Figure 20.3: ICE plots with histograms to aid with understanding where there is sparseness in the data that might lead to overfitting. Generate an ICE plot for a binary classifier's predicted probabilities split by quantiles. Parameters: ---------clf : binary classifier A binary classifier with a `predict_proba` method. X : DataFrame Feature matrix to predict on with shape (n_samples, n_features). col : str Name of column in `X` to plot against the quantiles of predicted probabilities. center : bool, default=True Whether to center the plot on 0.5. q : int, default=10 Number of quantiles to split the predicted probabilities into. color : str or array-like, default='k' Color(s) of the lines in the plot. alpha : float, default=0.5 Opacity of the lines in the plot. legend : bool, default=True Whether to show the plot legend. add_hist : bool, default=False Whether to add a histogram of the `col` variable to the plot. title : str, default='' Title of the plot. val_limit : num, default=10 Maximum number of values to test for col. ax : Matplotlib Axis, deafault=None Axis to plot on. 174 20.1. ICE Plots Returns: ------results : DataFrame A DataFrame with the same columns as `X`, as well as a `prob` column with the predicted probabilities of `clf` for each row in `X`, and a `group` column indicating which quantile group the row belongs to. """ probs = clf.predict_proba(X) df = (X .assign(probs=probs[:,-1], p_bin=lambda df_:pd.qcut(df_.probs, q=q, labels=[f'q{n}' for n in range(1,q+1)]) ) ) groups = df.groupby('p_bin') vals = X.loc[:,col].unique() if len(vals) > val_limit: vals = np.linspace(min(vals), max(vals), num=val_limit) res = [] for name,g in groups: for val in vals: this_X = g.loc[:,X.columns].assign(**{col:val}) q_prob = clf.predict_proba(this_X)[:,-1] res.append(this_X.assign(prob=q_prob, group=name)) results = pd.concat(res, axis='index') if ax is None: fig, ax = plt.subplots(figsize=(8,4)) if add_hist: back_ax = ax.twinx() back_ax.hist(X[col], density=True, alpha=.2) for name, g in results.groupby('group'): g.groupby(col).prob.mean().plot(ax=ax, label=name, color=color, alpha=alpha) if legend: ax.legend() if title: ax.set_title(title) return results Let’s plot the 10 quantiles for the education feature. Ideally, these lines do not cross each other. If they do cross, you might want to debug the model to make sure that it isn’t overfitting on sparse data. fig, ax = plt.subplots(figsize=(8,4)) quantile_ice(xgb_def, X_train, 'education', q=10, legend=False, add_hist=True, ax=ax, title='ICE plot for Age') 175 20. Better Models with ICE, Partial Dependence, Monotonic Constraints, and Calibration Figure 20.4: ICE plot for quantiles. 20.2 ICE Plots with SHAP There are multiple tools to create ICE plots. If you know the magic parameters, you can get the SHAP library to create an ICE plot. It will also create the histogram. This function is poorly documented, but I’ve decoded it for you. The model parameter needs to be a function that, given rows of data, will return probabilities. You can specify the rows to draw ice lines using the’ data’ parameter. You need to ensure that the npoints parameter is the number of unique values for a column. import shap fig, ax = plt.subplots(figsize=(8,4)) shap.plots.partial_dependence_plot(ind='education', model=lambda rows: xgb_def.predict_proba(rows)[:,-1], data=X_train.iloc[0:1000], ice=True, npoints=(X_train.education.nunique()), pd_linewidth=0, show=False, ax=ax) ax.set_title('ICE plot (from SHAP)') 20.3 Partial Dependence Plots If you set the quantile to 1 in the code above, you create a Partial Dependence Plot. It is the average of the ICE plots. Partial Dependence Plots (PDPs) are a popular visualization technique used in machine learning to understand the relationship between input variables and the model’s predicted 176 20.3. Partial Dependence Plots Figure 20.5: An ICE plot created whith the SHAP library. output. PDPs illustrate the average behavior of the model for a particular input variable while holding all other variables constant. These plots allow us to identify non-linear relationships, interactions, and other important patterns in the data that are not immediately apparent from summary statistics or simple scatterplots. fig, axes = plt.subplots(ncols=2, figsize=(8,4)) PartialDependenceDisplay.from_estimator(xgb_def, X_train, features=['r', 'education'], kind='average', ax=axes) fig.tight_layout() A common suggestion is to plot the PDP plot on top of a centered ICE plot. This let’s us better understand the general impact of a feature. fig, axes = plt.subplots(ncols=2, figsize=(8,4)) PartialDependenceDisplay.from_estimator(xgb_def, X_train, features=['r', 'education'], centered=True, kind='both', ax=axes) fig.tight_layout() Let’s expore the years_exp and Q1_Male plots. fig, axes = plt.subplots(ncols=2, figsize=(8,4)) PartialDependenceDisplay.from_estimator(xgb_def, X_train, features=['years_exp', 'Q1_Male'], 177 20. Better Models with ICE, Partial Dependence, Monotonic Constraints, and Calibration Figure 20.6: PDP and ICE plots for R and education. fig.tight_layout() centered=True, kind='both', ax=axes) Figure 20.7: PDP and ICE plots for years_exp and Q1_Male. It looks like year_exp tends to be monotonically increasing. The Q1_Male plot is flat for the PDP, but there is some spread for the ICE values, indicating that there is probably an interaction with other columns. 178 20.4. PDP with SHAP 20.4 PDP with SHAP It might not surprise you that the SHAP library can make PDP plots as well. We will create PDP plots with the aptly named partial_dependence_plot function. It can even add a histogram to the plot. Let’s create one for the years_exp feature. import shap fig, ax = plt.subplots(figsize=(8,4)) col = 'years_exp' shap.plots.partial_dependence_plot(ind=col, model=lambda rows: xgb_def.predict_proba(rows)[:,-1], data=X_train.iloc[0:1000], ice=False, npoints=(X_train[col].nunique()), pd_linewidth=2, show=False, ax=ax) ax.set_title('PDP plot (from SHAP)') Figure 20.8: A PDP plot created with SHAP. If we specify ice=True, we can plot the PDP on top of the ICE plots. In this example, we also highlight the expected value for the years_exp column and the expected probability for the positive label. fig, ax = plt.subplots(figsize=(8,4)) col = 'years_exp' shap.plots.partial_dependence_plot(ind=col, 179 20. Better Models with ICE, Partial Dependence, Monotonic Constraints, and Calibration model=lambda rows: xgb_def.predict_proba(rows)[:,-1], data=X_train.iloc[0:1000], ice=True, npoints=(X_train[col].nunique()), model_expected_value=True, feature_expected_value=True, pd_linewidth=2, show=False, ax=ax) ax.set_title('PDP plot (from SHAP) with ICE Plots') Figure 20.9: PDP and ICE plots created with SHAP. 20.5 Monotonic Constraints PDP and ICE plots are useful visualization tools in machine learning that can help to identify monotonic constraints in a model. A monotonic constraint means that as the value of a specific feature changes, the predicted outcome should either always increase or decrease. This is often the case in situations with a correlation between a feature and the outcome. The standard disclaimer that correlation is not causation applies here! Examining the ranked correlation (Spearman) is one way to explore these relationships. fig, ax = plt.subplots(figsize=(8,4)) (X_test .assign(target=y_test) .corr(method='spearman') .iloc[:-1] 180 20.5. Monotonic Constraints ) .loc[:,'target'] .sort_values(key=np.abs) .plot.barh(title='Spearman Correlation with Target', ax=ax) Figure 20.10: Spearman correlation with target variable. You can use this to guide you with monotonic constraints. You can create a cutoff and explore variables above that cutoff. Assume a cutoff for an absolute value of 0.2. The education column is non-binary and might be a good candidate to explore. PDP and ICE plots can also help to identify monotonic constraints by showing the relationship between a feature and the predicted outcome while holding all other features constant. If the plot shows a clear increasing or decreasing trend for a specific feature, this can indicate a monotonic constraint. You can see the PDP for education in the last section. The line appears almost monotonic if we inspect the education PDP plot for education. It has a small bump at 19. This plot caused me to explore the data a little more. The tweak_kag function converts Professional Degree to a value of 19 (19 years of school). Most professional degrees take 3-5 years. Let’s use some Pandas to explore what happens in education and the mean value of other columns: >>> ... ... ... ... ... print(X_train .assign(target=y_train) .groupby('education') .mean() .loc[:, ['age', 'years_exp', 'target']] ) education age years_exp target 181 20. Better Models with ICE, Partial Dependence, Monotonic Constraints, and Calibration 12.0 13.0 16.0 18.0 19.0 20.0 30.428571 30.369565 25.720867 28.913628 27.642857 35.310638 2.857143 6.760870 2.849593 3.225528 4.166667 4.834043 0.714286 0.652174 0.605691 0.393474 0.571429 0.174468 It appears that the target value does jump at 19. Let’s see how many values there are for education 19: >>> X_train.education.value_counts() 18.0 1042 16.0 738 20.0 235 13.0 46 19.0 42 12.0 7 Name: education, dtype: int64 There aren’t very many values. We can dig in a little more and inspect the raw data (remember that 19 was derived from the Professional degree value) to see if anything stands out: >>> ... ... ... ... ... print(raw .query('Q3.isin(["United States of America", "China", "India"]) ' 'and Q6.isin(["Data Scientist", "Software Engineer"])') .query('Q4 == "Professional degree"') .pipe(lambda df_:pd.crosstab(index=df_.Q5, columns=df_.Q6)) ) Q6 Q5 A business discipline (accounting, economics, f... Computer science (software engineering, etc.) Engineering (non-computer focused) Humanities (history, literature, philosophy, etc.) I never declared a major Mathematics or statistics Other Physics or astronomy Data Scientist Q6 Q5 A business discipline (accounting, economics, f... Computer science (software engineering, etc.) Engineering (non-computer focused) Humanities (history, literature, philosophy, etc.) I never declared a major Mathematics or statistics Other Physics or astronomy Software Engineer 182 \ 0 12 6 2 0 2 2 2 1 19 10 0 1 1 1 1 20.5. Monotonic Constraints Deciding whether to add a monotonic constraint to XGBoost depends on the specific context and goals of the model. A monotonic constraint can be added to ensure the model adheres to a particular relationship between a feature and the outcome. For example, in a scenario where we expect a clear cause-and-effect relationship between a feature and the outcome, such as in financial risk modeling, we might want to add a monotonic constraint to ensure that the model predicts an increasing or decreasing trend for that feature. Adding a monotonic constraint can help to improve the accuracy and interpretability of the model, by providing a clear constraint on the relationship between the feature and the outcome. It can also help reduce overfitting by limiting the range of values the model can predict for a given feature. From looking at the PDP and ICE plots it appears there is not a ton of data to not enforce a monotonic constraint for education. There is a chance that the model is overfitting the education entries with the value of 19. To simplify the model, we will add a monotonic constraint. I will also add a constraint to the years_exp column. We will specify the monotone_constraints parameter mapping the column name to the sign of the slope of the PDP plot. Because years_exp increases, we will map it to 1. The education column is decreasing, so we map it to -1. xgb_const = xgb.XGBClassifier(random_state=42, monotone_constraints={'years_exp':1, 'education':-1}) xgb_const.fit(X_train, y_train) xgb_const.score(X_test, y_test) It looks like the constraints improved (and simplified) our model! I’m going to go one step further. Because the PDP line for Q1_Male was flat, I’m going to remove gender columns as well. small_cols = ['age', 'education', 'years_exp', 'compensation', 'python', 'r', 'sql', #'Q1_Male', 'Q1_Female', 'Q1_Prefer not to say', #'Q1_Prefer to self-describe', 'Q3_United States of America', 'Q3_India', 'Q3_China', 'major_cs', 'major_other', 'major_eng', 'major_stat'] xgb_const2 = xgb.XGBClassifier(random_state=42, monotone_constraints={'years_exp':1, 'education':-1}) xgb_const2.fit(X_train[small_cols], y_train) Let’s look at the score: >>> xgb_const2.score(X_test[small_cols], y_test) 0.7569060773480663 Slightly better! And simpler! It looks like these constraints improved our model (and also simplified it). Another way to evaluate this is to look at the feature importance values. Our default model focuses on the R column now. The constrained model will hopefully spread out the feature importance more. fig, ax = plt.subplots(figsize=(8,4)) (pd.Series(xgb_def.feature_importances_, index=X_train.columns) .sort_values() .plot.barh(ax=ax) ) 183 20. Better Models with ICE, Partial Dependence, Monotonic Constraints, and Calibration Figure 20.11: Unconstrained feature importance. Notice that the R column is driving our model right now. We would like to distribute the importance if possible. It appears that the feature importance values from the constrained model are slightly more evenly disbursed indicating that that model will pay less attention to the R column. Figure 20.12: Constrained feature importance. Notice that the R column has less impact and the other columns have more. fig, ax = plt.subplots(figsize=(8,4)) (pd.Series(xgb_const2.feature_importances_, index=small_cols) .sort_values() .plot.barh(ax=ax) ) 184 20.6. Calibrating a Model 20.6 Calibrating a Model This section will look at fine-tuning our model with calibration. Calibration refers to adjusting a model’s output to better align with the actual probabilities of the target variable. If we want to use the probabilities (from .predict_proba) and not just the target, we will want to calibrate our model. With XGBoost, the probability output often does not correspond to the actual probabilities of the target variable. XGBoost models tend to produce predicted probabilities biased towards the ends of the probability range, meaning they often overestimate or underestimate the actual probabilities. This will lead to poor performance on tasks that require accurate probability estimates, such as ranking, threshold selection, and decision-making. Calibrating an XGBoost model involves post-processing the predicted probabilities of the model using a calibration method, such as Platt scaling or isotonic regression. These methods map the model’s predicted probabilities to calibrated probabilities that better align with the actual probabilities of the target variable. Let’s calibrate the model using scikit-learn. Two calibration types are available, sigmoid and isotonic. Sigmoid calibration involves fitting a logistic regression model to the predicted probabilities of a binary classifier and transforming the probabilities using the logistic function. Isotonic calibration consists in fitting a non-parametric monotonic function to the predicted probabilities, ensuring that the function increases with increasing probabilities. Both techniques can improve the reliability of the probabilistic predictions of a model, but they differ in their flexibility and interpretability. Sigmoid calibration is a simple and computationally efficient method that can be easily implemented, but it assumes a parametric form for the calibration function and may not capture more complex calibration patterns. Isotonic calibration, on the other hand, is a more flexible and data-driven method that can capture more complex calibration patterns. Still, it may require more data and computational resources to implement. We will try both and compare the results. We need to provide the CalibratedClassifierCV with our existing model and the method parameter. I’m using cv=prefit because I already fit the model. Then we call the .fit method. from sklearn.calibration import CalibratedClassifierCV xgb_cal = CalibratedClassifierCV(xgb_def, method='sigmoid', cv='prefit') xgb_cal.fit(X_test, y_test) xgb_cal_iso = CalibratedClassifierCV(xgb_def, method='isotonic', cv='prefit') xgb_cal_iso.fit(X_test, y_test) 20.7 Calibration Curves We will visualize the results with a calibration curve. A calibration curve is a graphical representation of the relationship between predicted probabilities and the actual frequencies of events in a binary classification problem. The curve plots the predicted probabilities (x-axis) against the observed frequencies of the positive class (y-axis) for different probability thresholds. The calibration curve visually assesses how well a classification model’s predicted probabilities match the true probabilities of the events of interest. Ideally, a well-calibrated model should have predicted probabilities that match the actual probabilities of the events, meaning that the calibration curve should be close to the diagonal line. 185 20. Better Models with ICE, Partial Dependence, Monotonic Constraints, and Calibration If the calibration curve deviates from the diagonal line, it suggests that the model’s predicted probabilities are either overconfident or underconfident. An overconfident model predicts high probabilities for events that are unlikely to occur, while an underconfident model predicts low probabilities for events that are likely to occur. from sklearn.calibration import CalibrationDisplay from matplotlib.gridspec import GridSpec fig = plt.figure(figsize=(8,6)) gs = GridSpec(4, 3) axes = fig.add_subplot(gs[:2, :3]) display = CalibrationDisplay.from_estimator(xgb_def, X_test, y_test, n_bins=10, ax=axes) disp_cal = CalibrationDisplay.from_estimator(xgb_cal, X_test, y_test, n_bins=10,ax=axes, name='sigmoid') disp_cal_iso = CalibrationDisplay.from_estimator(xgb_cal_iso, X_test, y_test, n_bins=10, ax=axes, name='isotonic') row = 2 col = 0 ax = fig.add_subplot(gs[row, col]) ax.hist(display.y_prob, range=(0,1), bins=20) ax.set(title='Default', xlabel='Predicted Prob') ax2 = fig.add_subplot(gs[row, 1]) ax2.hist(disp_cal.y_prob, range=(0,1), bins=20) ax2.set(title='Sigmoid', xlabel='Predicted Prob') ax3 = fig.add_subplot(gs[row, 2]) ax3.hist(disp_cal_iso.y_prob, range=(0,1), bins=20) ax3.set(title='Isotonic', xlabel='Predicted Prob') fig.tight_layout() The calibration curve suggests that our default model does a respectable job. It is tracking the diagonal pretty well. But our calibrated models look like they track the diagonal better. The histograms show the distribution of the default model and the calibrated models. Let’s look at the score of our calibrated models. It looks like they perform slightly better (at least for accuracy). >>> xgb_cal.score(X_test, y_test) 0.7480662983425415 >>> xgb_cal_iso.score(X_test, y_test) 0.7491712707182321 >>> xgb_def.score(X_test, y_test) 0.7458563535911602 20.8 Summary This chapter explored ICE and PDP plots. Then we looked at monotonic constraints and model calibration. ICE and PDP plots are two visualization techniques used to interpret the output of XGBoost models. ICE plots show the predicted outcome of an individual instance as a function of a single feature. In contrast, PDP plots show the average predicted outcome across all instances as a function of a single feature. Monotonic constraints can be added 186 20.9. Exercises Figure 20.13: Calibration curves for the default and calibrated models. Models that track the diagonal tend to return probabilities that reflect the data. to XGBoost models to ensure that the model’s predictions increase or decrease monotonically with increasing feature values. This can simplify the model to prevent overfitting. Calibration techniques such as sigmoid or isotonic calibration can be used to improve the reliability of the model’s probabilistic predictions. 20.9 Exercises 1. How can ICE and PDP plots identify non-linear relationships between features and the predicted outcome in an XGBoost model? 2. How can ICE and PDP plots be used to compare the effects of different features on the predicted outcome in an XGBoost model? 3. Can ICE and PDP plots visualize interactions between features in an XGBoost model? If so, how? 4. What are monotonic constraints in XGBoost, and how do they impact the model’s predictions? 5. How can monotonic constraints be specified for individual features in an XGBoost model? 6. What are some potential benefits of adding monotonic constraints to an XGBoost model, and in what situations are they particularly useful? 7. How can the calibration curve be used to assess the calibration of an XGBoost model’s predicted probabilities? 8. How can calibration techniques help improve the reliability of an XGBoost model’s probabilistic predictions in practical applications? 187 Chapter 21 Serving Models with MLFlow In this chapter, we will explore the MLFlow library. MLflow is an open-source library for managing the end-to-end machine learning lifecycle. It provides tools for tracking experiments, packaging machine learning models, and deploying models to production. MLflow is particularly useful for XGBoost because it allows users to track and compare the performance of different XGBoost models and quickly deploy the best-performing model to production. One of the key features of MLflow is its experiment-tracking capability. This allows users to log various metrics, such as model accuracy or training time, and compare the results of different experiments. This can be useful for identifying the most effective hyperparameters for an XGBoost model and choosing the best-performing model. MLflow also includes a model packaging feature, which makes it easy to package and deploy XGBoost models. This makes it possible to deploy XGBoost models to production environments and integrate them with other machine learning systems. 21.1 Installation and Setup MLFlow is a third-party library. Make sure you install it. Let’s show how to use MLFlow with the model we have been developing. We will then serve the model using MLFlow. Here are the imports that we will need: %matplotlib inline from feature_engine import encoding, imputation from hyperopt import fmin, tpe, hp, STATUS_OK, Trials import matplotlib.pyplot as plt import mlflow import numpy as np import pandas as pd from sklearn import base, metrics, model_selection, \ pipeline, preprocessing from sklearn.metrics import accuracy_score, roc_auc_score import xgboost as xgb import urllib import zipfile 189 21. Serving Models with MLFlow Let’s load the raw data. This code will create a pipeline and prepare the training and testing data. import pandas as pd from sklearn import model_selection, preprocessing import xg_helpers as xhelp url = 'https://github.com/mattharrison/datasets/raw/master/data/'\ 'kaggle-survey-2018.zip' fname = 'kaggle-survey-2018.zip' member_name = 'multipleChoiceResponses.csv' raw = xhelp.extract_zip(url, fname, member_name) ## Create raw X and raw y kag_X, kag_y = xhelp.get_rawX_y(raw, 'Q6') ## Split data kag_X_train, kag_X_test, kag_y_train, kag_y_test = \ model_selection.train_test_split( kag_X, kag_y, test_size=.3, random_state=42, stratify=kag_y) ## Transform X with pipeline X_train = xhelp.kag_pl.fit_transform(kag_X_train) X_test = xhelp.kag_pl.transform(kag_X_test) ## Transform y with label encoder label_encoder = preprocessing.LabelEncoder() label_encoder.fit(kag_y_train) y_train = label_encoder.transform(kag_y_train) y_test = label_encoder.transform(kag_y_test) # Combined Data for cross validation/etc X = pd.concat([X_train, X_test], axis='index') y = pd.Series([*y_train, *y_test], index=X.index) This code uses hypopt to train the model. It has been extended to log various metrics in MLFlow while training. from hyperopt import fmin, tpe, hp, STATUS_OK, Trials import mlflow from sklearn import metrics import xgboost as xgb ex_id = mlflow.create_experiment(name='ex3', artifact_location='ex2path') mlflow.set_experiment(experiment_name='ex3') with mlflow.start_run(): params = {'random_state': 42} rounds = [{'max_depth': hp.quniform('max_depth', 1, 12, 1), # tree 'min_child_weight': hp.loguniform('min_child_weight', -2, 3)}, {'subsample': hp.uniform('subsample', 0.5, 1), # stochastic 'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 1)}, 190 21.1. Installation and Setup ] {'gamma': hp.loguniform('gamma', -10, 10)}, # regularization {'learning_rate': hp.loguniform('learning_rate', -7, 0)} # boosting for round in rounds: params = {**params, **round} trials = Trials() best = fmin(fn=lambda space: xhelp.hyperparameter_tuning( space, X_train, y_train, X_test, y_test), space=params, algo=tpe.suggest, max_evals=10, trials=trials, timeout=60*5 # 5 minutes ) params = {**params, **best} for param, val in params.items(): mlflow.log_param(param, val) params['max_depth'] = int(params['max_depth']) xg = xgb.XGBClassifier(eval_metric='logloss', early_stopping_rounds=50, **params) xg.fit(X_train, y_train, eval_set=[(X_train, y_train), (X_test, y_test) ] ) for metric in [metrics.accuracy_score, metrics.precision_score, metrics.recall_score, metrics.f1_score]: mlflow.log_metric(metric.__name__, metric(y_test, xg.predict(X_test))) model_info = mlflow.xgboost.log_model(xg, artifact_path='model') At the end of this code, you’ll notice that we store the results of calling log_model in model_info. The log_model function stores the xgboost model as an artifact. It returns an object that has metadata about the model that we created. If you inspect the ex_id variable and the .run_id attribute, they will point to a directory that stores information about the model. >>> ex_id '172212630951564101' >>> model_info.run_id '263b3e793f584251a4e4cd1a2d494110' The directory structure looks like this: mlruns/172212630951564101/263b3e793f584251a4e4cd1a2d494110 ├── artifacts ├── meta.yaml ├── metrics │ ├── accuracy_score │ ├── f1_score 191 21. Serving Models with MLFlow │ │ ├── │ │ │ │ │ │ │ └── ├── precision_score └── recall_score params ├── colsample_bytree ├── gamma ├── learning_rate ├── max_depth ├── min_child_weight ├── random_state └── subsample tags ├── mlflow.log-model.history ├── mlflow.runName ├── mlflow.source.name ├── mlflow.source.type └── mlflow.user We can launch a server to run the model if we know the artifact. 21.2 Inspecting Model Artifacts You can tell MLFlow to launch a service to inspect the artifacts of training the model. From the command line, use the following command: mlflow ui Then go to the URL localhost:5000 You should see a site that looks like this. Figure 21.1: MLFlow UI Homepage 192 21.3. Running A Model From Code It will create unique identifiers of the runs like colorful-moose-627. You can inspect the value of tags/mlflow.runName file to find the run name. MLFlow automatically names your models with unique ids. 21.3 Running A Model From Code When you click on a finished model, MLFlow will give you the code to make predictions. You load the model using the load_model function and pass in the run id. import mlflow logged_model = 'runs:/ecc05fedb5c942598741816a1c6d76e2/model' # Load model as a PyFuncModel. loaded_model = mlflow.pyfunc.load_model(logged_model) This code above is available from the website found from launching the command line service. If you click on the model folder icon, it will show how to make a prediction with Pandas (like above) or with Spark. Once you have loaded the model, you can make predictions. Let’s predict the first row of the test data: >>> loaded_model.predict(X_test.iloc[[0]]) array([1]) 21.4 Serving Predictions MLFlow also creates an endpoint that we can query for predictions. If you have the uuid, you can launch a service for it. By default, it will use pyenv to create a fresh environment to run the service. I’m passing --env-manager local to bypass that and use my local environment: mlflow models serve -m mlruns/172212630951564101/ \ 263b3e793f584251a4e4cd1a2d494110/artifacts/model \ -p 1234 --env-manager local The above command starts a service on port 1234. If you hit the URL and port in a web browser, it will throw an error because it is expecting an HTTP POST command. The next section will show you how to query the service. MLFlow also enables you to serve the same model from Azure, AWS Sagemaker, or Spark. Explore the documentation for MLFlow about the built-in deployment tools for more details. 21.5 Querying from the Command Line You can also query models from the command line with curl. On UNIX systems, you can use the curl command. It look like this (make sure to replace $URL and $JSON_DATA with the appropriate values): curl $URL -X POST -H "Content-Type:application/json" --data $JSON_DATA You will need to create JSON_DATA data with this format. I’m showing it on multiple lines, but you will put it on a single line.: 193 21. Serving Models with MLFlow Figure 21.2: Inspecting a run of an experiment. You can expand the Parameters and the Metrics on the left-hand side. In the lower right is the code to run a model. 194 21.5. Querying from the Command Line {'dataframe_split': {'columns': ['col1', 'col2'], 'data': [[22, 16.0], [25, 18.0]]} } If you don’t want to create the JSON manually, you can use Pandas to make it. If you have a Dataframe with the data, use the Pandas .to_json method and set orient='split'. >>> X_test.head(2).to_json(orient='split', index=False) '{"columns":["age","education","years_exp","compensation", "python","r","sql","Q1_Male","Q1_Female","Q1_Prefer not to say", "Q1_Prefer to self-describe","Q3_United States of America", "Q3_India","Q3_China","major_cs","major_other","major_eng", "major_stat"],"data":[[22,16.0,1.0,0,1,0,0,1,0,0,0,0,1,0,1,0, 0,0],[25,18.0,1.0,70000,1,1,0,1,0,0,0,1,0,0,0,1,0,0]]}' This is a JSON string, and we need a Python dictionary so that we can embed this in another dictionary. Consider this value to be DICT. We must place it in another dictionary with the key dataframe_split : {'dataframe_split: DICT}. We will use the json.loads function to create a dictionary from the string. (We can’t use the Python string because the quotes are incorrect for JSON.) Here is the JSON data we need to insert into the dictionary: >>> import json >>> json.loads(X_test.head(2).to_json(orient='split', index=False)) {'columns': ['age', 'education', 'years_exp', 'compensation', 'python', 'r', 'sql', 'Q1_Male', 'Q1_Female', 'Q1_Prefer not to say', 'Q1_Prefer to self-describe', 'Q3_United States of America', 'Q3_India', 'Q3_China', 'major_cs', 'major_other', 'major_eng', 'major_stat'], 'data': [[22, 16.0, 1.0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0], [25, 18.0, 1.0, 70000, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0]]} Then you need to nest this in another dictionary under the key 'dataframe_split'. >>> {'dataframe_split': json.loads(X_test.head(2).to_json(orient='split', ... index=False))} {'dataframe_split': {'columns': ['age', 'education', 195 21. Serving Models with MLFlow 'years_exp', 'compensation', 'python', 'r', 'sql', 'Q1_Male', 'Q1_Female', 'Q1_Prefer not to say', 'Q1_Prefer to self-describe', 'Q3_United States of America', 'Q3_India', 'Q3_China', 'major_cs', 'major_other', 'major_eng', 'major_stat'], 'data': [[22, 16.0, 1.0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0], [25, 18.0, 1.0, 70000, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0]]}} Finally, we need to convert the Python dictionary back to a valid JSON string. We will use the json.dumps function to do that: >>> import json >>> post_data = json.dumps({'dataframe_split': json.loads( ... X_test.head(2).to_json(orient='split', index=False))}) >>> post_data '{"dataframe_split": {"columns": ["age", "education", "years_exp", "compensation", "python", "r", "sql", "Q1_Male", "Q1_Female", "Q1_Prefer not to say", "Q1_Prefer to self-describe", "Q3_United States of America", "Q3_India", "Q3_China", "major_cs", "major_other", "major_eng", "major_stat"], "data": [[22, 16.0, 1.0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0], [25, 18.0, 1.0, 70000, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0]]}}' I’ll make a function that transforms a dataframe into this format to avoid repetition. def create_post_data(df): dictionary = json.loads(df .to_json(orient='split', index=False)) return json.dumps({'dataframe_split': dictionary}) Let’s try out the function: >>> post_data = create_post_data(X_test.head(2)) >>> print(post_data) {"dataframe_split": {"columns": ["age", "education", "years_exp", "compensation", "python", "r", "sql", "Q1_Male", "Q1_Female", "Q1_Prefer not to say", "Q1_Prefer to self-describe", "Q3_United States of America", "Q3_India", "Q3_China", "major_cs", "major_other", "major_eng", "major_stat"], "data": [[22, 16.0, 1.0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0], [25, 18.0, 1.0, 70000, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0]]}} 196 21.5. Querying from the Command Line In Jupyter, we can use Python variables in shell commands. To tell Jupyter to use the value of the variable, we need to stick a dollar sign ($) in front of the variable. However, when we throw the post_data variable into the curl command in Jupyter, it fails. !curl http://127.0.0.1:1234/invocations -X POST -H \ "Content-Type:application/json" --data $post_data It fails with this error: curl: (3) unmatched brace in URL position 1: {columns: ^ This is because we need to wrap the contents of the post_data variable with single quotes. I’m going to stick the quotes directly into the Python string. >>> quoted = f"'{post_data}'" >>> quoted '\'{"dataframe_split": {"columns": ["age", "education", "years_exp", "compensation", "python", "r", "sql", "Q1_Male", "Q1_Female", "Q1_Prefer not to say", "Q1_Prefer to self-describe", "Q3_United States of America", "Q3_India", "Q3_China", "major_cs", "major_other", "major_eng", "major_stat"], "data": [[22, 16.0, 1.0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0], [25, 18.0, 1.0, 70000, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0]]}}\'' Let’s stick the capability to “quote” the data into the create_post_data function. def create_post_data(df, quote=True): dictionary = {'dataframe_split': json.loads(df .to_json(orient='split', index=False))} if quote: return f"'{dictionary}'" else: return dictionary quoted = create_post_data(X_test.head(2)) Let’s try it now: !curl http://127.0.0.1:1234/invocations -x post -h \ "content-type:application/json" --data $quoted This returns the JSON: {"predictions": [1, 0]} Indicating that the first row is a software engineer and the second is a data scientist. 197 21. Serving Models with MLFlow 21.6 Querying with the Requests Library I’ll show you how to query the service from python using the requests library. This code uses the requests library to make a post request. It will predict the first two test rows. it uses the pandas .to_json method and requires setting orient='split'. it returns 1 (software engineer) for the first row and 0 (data scientist) for the second. In this case, because we are sending the json data as a dictionary and not a quoted string, we will pass in quote=false. >>> import requests as req >>> import json >>> r = req.post('http://127.0.0.1:1234/invocations', ... json=create_post_data(x_test.head(2), quote=False)) >>> print(r.text) {"predictions": [1, 0]} Again, the .text result indicates that the web service predicted software engineer for the first row and data scientist for the second. 21.7 Building with Docker Docker is a platform that enables developers to package, deploy, and run applications in containers. A container is a lightweight, stand-alone executable package that includes everything needed to run a piece of software, including the code, runtime, system tools, libraries, and settings. Docker provides consistent and reproducible development and deployment environments across different platforms, making it an efficient and reliable tool for deploying applications, including machine learning models. It also makes it simple to scale the deployment by running multiple container instances. If you want to build a Docker image of the model and the service, you can use this command. (You will need to point to the correct model directory and replace MODELNAME with the name you want to use for the Docker image.) mlflow models build-docker -m mlruns/172212630951564101/ \ 263b3e793f584251a4e4cd1a2d494110/artifacts/model \ -n MODELNAME Note that you make sure you install the extra packages when installing MLFlow to get Docker capabilities: pip install mlflow[extras] After you have built the Docker image, you can run it with: docker run --rm -p 5001:8080 MODELNAME This will run the application locally on port 5001. You can query it from the command line or from code, as we illustrated previously: 198 21.8. Conclusion curl http://127.0.0.1:5001/invocations -X POST -H \ "Content-Type:application/json" --data '{"dataframe_split": \ {"columns": ["age", "education", "years_exp", "compensation", \ "python", "r", "sql", "Q1_Male", "Q1_Female", \ "Q1_Prefer not to say", "Q1_Prefer to self-describe", \ "Q3_United States of America", "Q3_India", "Q3_China", \ "major_cs", "major_other", "major_eng", "major_stat"], \ "data": [[22, 16.0, 1.0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, \ 1, 0, 0, 0], [25, 18.0, 1.0, 70000, 1, 1, 0, 1, 0, 0, 0, 1,\ 0, 0, 0, 1, 0, 0]]}}' 21.8 Conclusion This chapter introduced the MLFlow library, an open-source library for managing the end-toend machine learning lifecycle. It provides tools for tracking experiments, packaging machine learning models, and deploying models to production. We showed how to use the MLFlow library with the XGBoost library, how to inspect the artifacts of training the model, how to run a model from code, how to make predictions with the loaded model and launch the service, and how to use Docker with MLFlow. 21.9 Exercises 1. What is the MLFlow library, and what are its main features? 2. How does MLFlow help track and compare the performance of different XGBoost models? 3. How do you use MLFlow to package and deploy XGBoost models? 4. How do you create a Docker image from MLFlow? 199 Chapter 22 Conclusion Well, folks. We are at the end. I hope you enjoyed this journey. Throughout this journey, we have explored the depths of this powerful algorithm and learned how to build highly accurate classification models for real-world applications. We have covered everything from data preparation and feature engineering to hyperparameter tuning and model evaluation. To use a cliche… But the journey doesn’t end here. It’s just the beginning. The knowledge you have gained from this book can be applied to a wide range of classification problems, from predicting customer churn in business to diagnosing medical conditions in healthcare. So, what are you waiting for? Please take what you have learned and put it into practice. Practice is the best way to make sure you are absorbing this content. And please let me know. I’d love to hear what you used this for. Please reach out if you need help with XGBoost and want consulting or training for your team. You can find my offerings at www.metasnake.com. If you are a teacher using this book as a textbook, don’t hesitate to contact me, I would love to present on XGBoost to your class. 22.1 One more thing As an independent self-published author, I know firsthand how challenging it can be to share this content. Without the backing of a major publishing house or marketing team, I rely on the support of my readers to help spread the word about my work. That’s why I’m urging you to take a few minutes out of your day to post a review and share my book with your friends and family. If you enjoyed this book, posting an honest review is the best way you can say thanks for the countless hours I’ve poured into my work. If you do post a review, please let me know. I’d love to link to it on the homepage of the book. Your support can help me reach a wider audience. So please, if you’ve enjoyed my book, consider leaving a review and sharing it with others. Your support means the world to me and will impact my future endeavors. 201 22.1. One more thing About the Author Matt Harrison is a renowned corporate trainer specializing in Python and Pandas. With over a decade of experience in the field, he has trained thousands of professionals and helped them develop their skills in data analysis, machine learning, and automation using Python. Matt authored several best-selling Python and data analysis books, including Illustrated Guide to Python 3, Effective Pandas, Machine Learning Pocket Reference, and more. He is a frequent speaker at industry conferences and events, where he shares his expertise and insights on the latest trends and techniques in data analysis and Python programming. Technical Editors Dr. Ronald Legere earned his Ph.D. in Physics from Yale University in 1998, with a focus on laser cooling and cold atom collisions. He then served as a Post Doc in Quantum Information Theory at Caltech and later joined MIT Lincoln Laboratory in 2000 as technical staff. There, he managed a team of analysts and support staff providing systems analysis and costbenefit analysis to various Federal programs. In 2013, Dr. Legere became CEO of Legere Pharmaceuticals, leveraging his technical and leadership experience to improve business systems and regulatory compliance. In 2021, after selling Legere, he founded Rational Pursuit LLC to provide scientific, data-driven insights to businesses and organizations. Edward Krueger is an expert in data science and scientific applications. As the proprietor of Peak Values Consulting, he specializes in developing and deploying simulation, optimization, and machine learning models for clients across various industries, from programmatic advertising to heavy industry. In addition to his consulting work, Krueger is also an Adjunct Assistant Professor in the Economics Department at The University of Texas at Austin. He teaches programming, data science, and machine learning to master’s students. Alex Rook is a data scientist and Artificial Engineer based in North Carolina. He has degrees in Cognitive Science and Philosophy from the University of Virginia. He likes designing robust systems that look and feel good, function perfectly, and answer questions of high importance to multiple stakeholders who often have competing - yet valid - interests in their organization. He has extensive experience in non-financial risk management and mitigation. 203 Index Index ., 172 $, 197 'error', 71 **, 47, 106 .add, 19, 144 .assign, 154 .background_gradient, 15 .barh, 137 .bar, 17 .base_values, 154 .best_params_, 47, 81 .classes_, 34 .coef_, 134 .corr, 15, 162 .count, 146 .cv_results_, 48 .estimators_, 52 .evals_result, 68 .feature_importances_, 135, 137 .fit_transform, 11, 31 .fit, 27, 31, 61 .from_estimator, 120, 176, 185 .get_params, 43, 52, 53 .groupby, 181 .inverse_transform, 34 .loc, 26 .mean, 48 .on, 144 .pipe, 146 .plot.bar, 17 .plot.scatter, 94 .plot_tree, 28, 40, 52, 53, 62 .plot, 144 .predict_proba, 63, 119, 124, 185 .predict, 64 .query, 10, 27, 182 .rename, 154 .run_id, 191 .scale, 21, 144 .scatter, 19, 94 .score, 32, 45, 61, 115 .set_sticky, 15 .set_visible, 146 .set_xlabel, 146 .set_xticks, 146 .set_yticks, 146 .spines, 146 .text, 146 .tight_layout, 176 .to_json, 198 .transform, 11, 34 .unstack, 145, 146 .value_counts, 182 .values, 152 .view, 135 .view (dtreeviz), 54 .violinplot, 96 BaseEstimator, 10 CalibratedClassifierCV, 185 CalibrationDisplay, 185 ConfusionMatrixDisplay, 116 DecisionTreeClassifier, 27 DummyClassifier, 32 GridSearchCV, 47, 80 GridSpec, 185 LabelEncoder, 34 LogisticRegression, 133 MeanMedianImputer, 10 OneHotEncoder, 10 PartialDependenceDisplay, 172 PartialDependencePlot, 176 Plot, 144 RandomForestClassifier, 52 RandomState, 23 RocCurveDisplay, 120 SeedSequence, 23 StandardScaler, 133 ThresholdXGBClassifier, 124 TreeExplainer, 152 Trial, 87 ValueError, 33 XGBClassifier, 61 205 Index XGBRFClassifier, 53 XGBlassifier subclass, 123 X, 9 accuracy_score, 115 add, 19 annotate, 125 annot, 162 ax.plot, 26 background_gradient, 15, 48 beeswarm, 165 calc_gini, 25 calibration, 185 choice, 88 classification_report, 120 cmap, 162 cm, 166 color, 166 colsample_bylevel, 54, 75 colsample_bynode, 54, 75 colsample_bytree, 54, 75 confusion_matrix, 115 corr, 162 create_experiment, 190 create_post_data, 197 cross_val_score, 48, 82 crosstab, 18, 182 cv_results_, 48 cv, 46, 48 depth_range_to_display, 54 dtreeviz, 30, 54 dummy, 32 early_stopping_rounds, 57, 67, 76, 85, 106 ensemble, 52 eval_metric, 57, 71, 76 eval_set, 67, 106, 132 export_graphviz, 138 exp, 154 extract_zip, 5 f1_score, 120 feature_engine, 10 fig.tight_layout, 176 fit_transform, 11, 31 fit, 31 fmin, 86, 105, 190 fmt, 162 force, 160 from_estimator, 172, 176, 185 gamma, 57, 76, 78 get_tpr_fpr, 124 groupby, 181 grow_policy, 54, 75 heatmap, 93 206 hp, 88 hyperopt, 105 initjs, 152 inspection, 172 interaction_constraints, 148 inv_log, 64 inverse_transform, 34 jitter, 94 json.loads, 195 lambda, 154 learning_curve, 109 learning_rate, 57, 76, 78, 79 linear_model, 133 load_model, 193 loads, 195 loc, 26 log_metric, 190 log_param, 190 loguniform, 89 max_delta_step, 57, 76 max_depth, 27, 40, 43, 54, 61, 75 max_evals, 105 max_features, 43 max_leaves, 43, 54, 75 min_child_weight, 54, 75 min_impurity_decrease, 43 min_samples_leaf, 43 min_samples_split, 43 min_weight_fraction_leaf, 43 mlflow ui, 192 model_selection, 11 model_selection (sklearn), 47 model, 135 monotone_constraints, 183 my_dot_export, 28, 54 n_estimators, 33, 52, 57, 61, 76, 132 np.exp, 154 num_trees, 62 objective, 57, 76 openpyxl, 141 orient='split', 195 param_range, 129 partial_dependence_plot, 176, 179 pd.crosstab, 18, 145, 182 pd.read_csv, 5 penalty=None, 133 plot.bar, 17 plot_3d_mesh, 100 plot_cumulative_gain, 125 plot_lift_curve, 126 plot_tree, 28, 31, 40, 52 plots.force, 160 Index plots.scatter, 160 plots.waterfall, 156 precision_recall_curve, 119 precision_score, 117 predict_proba, 185 preprocessing, 133 pyfunc, 193 pyll, 88 quantile_ice, 173 query, 10, 27, 182 quniform, 91 read_csv, 5 recall_score, 117 reg_alpha, 57, 76 reg_lambda, 57, 76 rounds, 105 run_id, 191 sample, 88 sampling_method, 54, 75 scale_pos_weight, 57, 76 scale, 21 scatter (SHAP), 160 score, 32, 45 scoring, 46, 129 set_experiment, 190 set_sticky, 15 shap (PDP), 179 so.Dots, 19, 144 so.Jitter, 144 so.Line, 19, 144 so.PolyFit, 144 spines, 146 stratify, 11 subsample, 54, 75 tight_layout, 176 to_json, 198 top_n, 10 tpe.suggest, 87 train_test_split, 11 transform, 11, 146 tree.plot_tree, 31, 52 tree_index, 62 tree_method, 54, 75 trial2df, 92 trial, 92 ui, 192 uniform, 89 urllib, 5 validation_curve, 46, 78, 129 value_counts, 182 verbose, 47, 81, 132 view, 135 vmax, 162 vmin, 162 x_jitter, 162 xg_helpers.py, 6 xg_helpers, 60 xgbfir, 141 y, 9 zipfile, 5 3D mesh, 106 3D plot, 100 accuracy, 32, 71, 115, 129 algorithm, 23 alpha, 162 annotate, 125 area under curve, 71, 121 arrow (on plot), 125 AUC, 71, 121 auc tuning, 130 Average Gain, 141 average precision, 119 Average wFScore, 141 axis spines, 146 bagging, 51 bar plot, 17, 137 base value, 156 baseline model, 32 bayesian optimization, 85 benefits of XGBoost, 59 binary feature, 165 black box, 60, 133, 151 boosting, 33 bootstrap aggregation, 51 calibration, 185 categorical, 88 categorical encoding, 31 categorical values, 59 centered ICE plots, 172 classification report, 120 clean up data, 31, 61 cleanup, 5 closure, 85 coefficient, 133 coefficients, 134 colormap, 166 column subsampling, 51 command line, 193 comparing models, 121 complicated, 40 Condorcet’s Jury Theorem, 51 207 Index confusion matrix, 115 constraints, 148, 180 constructor, 43 conventions (scikit-learn), 52 convert labels to numbers, 34 correlation, 15, 93, 143 correlation between trees, 51 correlations, 162 cover, 137 cross tabulation, 145 cross validation score, 46 cross-validation, 82 cumulative gains curve, 125 curl, 193, 197 data leakage, 121 data size, 109 Decision stump, 27 decision tree visualization, 30 decrease learning rate, 79 default value, 156 dependence plot, 162 dependence plot (SHAP), 160 diagnose overfit, 111 diagnose underfit, 111 dictionary, 195 dictionary unpacking, 106 discrete, 88 distributions, 87 docker, 198 docker run, 198 DOT, 138 dtreeviz, 62, 135 early stopping, 67 EDA, 15 edge color, 23 endpoint, 193 ensemble, 51 ensembling, 54 environment, 193 error, 59 estimators, 52 eta, 79 evaluate labelling errors, 115 evaluation set, 67 Excel, 141 exhaust, 47 Expected Gain, 141 experiment tracking, 189 explaining, 60 explanations, 151 208 exploitation, 94 exploration, 94 exploratory data analysis, 15 export xgboost plot, 62 F1 score, 120 facet plot, 21 false positive, 25, 115, 117 feature engineering, 15 feature importance, 135 feature interactions, 141 fill in missing values, 31 first-class functions, 85 fit line, 19 fixing overfitting, 40 fixing underfit models, 39 font, 146 fraction (confusion matrix), 117 FScore, 141 gain, 125, 137, 141 gamma, 78, 100 Gini coefficient, 23, 25 global interpretation, 151 golfing, 59 good enough, 121 gradient descent, 33 graphviz, 138 grid search, 47, 80 grouping, 145 growing trees, 39 heahtmap, 162 heatmap, 93, 143 histogram, 23 histogram (ICE), 173 holdout, 11 holdout set, 11, 32 horizontal alignment, 146 Hyperopt, 85 hyperopt (with mlflow), 190 hyperparameter tuning, 87 hyperparameters, 39 hyperparameters (decision tree), 43 hyperparameters (random forest), 54 hyperparameters (XGBoost), 75 ICE, 171 ICE (shap), 176 imbalanced data, 11, 115 impute, 10 indicator column, 17 Index Individual conditional expectation, 171 integer distribution, 91 interaction, 141 interactions, 141 interactions (dependence plot), 162 interactive plot, 101 interactive visuals, 100 interpret, 133 interpretation, 60 inverse logit, 35, 64 isotonic calibration, 185 jitter, 19, 94, 162 JSON, 193 Jupyter, 197 juror, 51 k-fold validation, 82 Kaggle, 5 label encoder, 34 leaf values, 35 leaked data, 121 learning rate, 78 lift curve, 126 limit adjustment, 21 limit feature interactions, 148 line plot (Matplotlib), 26 load data, 5 local interpretation, 151 log loss, 71 log odds, 35, 154 log-axis, 100 log-uniform, 89 logistic regression, 133 logit, 35 loss, 100 macro average, 120 magnitude, 133 Matplotlib colormap, 166 max convention, 52 maximize marketing response rates, 125 memorizing, 40 metric, 46 min convention, 52 missing values, 59 mlflow, 189 mlflow serve, 193 monotonic constraints, 180 negative accuracy, 85 nested list, 148 no interactions (shap), 166 non-linear relationships, 135 non-linear response, 169 non-monotonic interactions, 165 normalize confusion matrix, 117 number of k-fold cross validations, 46 number of trees, 33 NumPy, 23 optimal gain, 125 optimal model, 126 optimize auc, 130 ordered samples, 125 over-fit model diagnosis, 109 overfit, 40 overfit diagnosis, 111 overfit learning curve, 111 Partial dependence plot, 176 PDP, 176 PDP (SHAP), 179 percentage (confusion matrix), 117 philosopher, 51 pipeline, 6, 9, 10, 31, 61 plot inverse logit, 36 plot polynomial line, 19 plot xgboost tree, 62 plotly, 100, 106 positive label, 34 precision, 117 precision-recall curve, 119 predict probability, 63 probabilities, 35 probability, 63, 124 probability of label, 119 prune, 40 pruning, 78 pyenv, 193 Python variables, 197 query model, 193 random forest, 51 random forest (XGBoost), 53 random forest hyperparameters, 54 random sample, 23 random samples, 23 rank, 141 ratio of gains, 126 recall, 117 receiver operating characteristic, 120 regularization, 54 209 Index requests, 198 residual, 59 results of evaluation, 68 reverse transformation, 34 ROC Curve, 120 roc-auc tuning, 130 roc_auc, 129 run name, 192 running a model, 193 running with docker, 198 sampling, 54 scale, 133 scatter plot, 19, 94, 144 Sci-kit learn pipeline, 6 scikit-learn conventions, 52 scikitplot, 125 Seaborn, 162 seaborn, 19, 93, 96, 144 sensitivity, 117 serve model, 193 service (mlflow), 192 setting y-axis limit, 109 shap, 151 SHAP (ICE), 176 shap waterfall, 156 shap without interactions, 166 sigmoid calibration, 185 sign of coefficient, 133 simple, 39 simpler models, 148 single prediction, 156 slope graph, 146 Spearman correlation, 15 specify feature interactions, 148 split, 11 split data, 61 split point, 23 standard deviation, 133 standardize, 133 step-wise tuning, 105, 130 sticky index, 15 string labels, 33 Stump, 27 stump, 166 subclass, 123 subsampling, 51 summary plot, 165 support, 120 survey data, 5 test set, 11 210 threshold, 120 threshold metrics, 124 tick location (Seaborn), 21 track artifacts, 192 trailing underscore convention, 52 trails, 87 training score, 46 training set, 11 transparency, 19 transparent model, 133 tree depth, 40 tree model, 23 trend, 19 true positive, 25, 115, 117 tuning function, 86 tuning hyperparameters, 78 tweak function, 6, 10 underfit, 39 underfit diagnosis, 111 undo transformation, 34 unstacked bar plot, 17 upack operatior, 47 uuid, 193 validation curve, 78 validation curve (random forest), 57 validation curves, 44 validation score, 46 validation set, 11, 32 variance, 40 verbose, 132 vertical alignment, 146 violin plot, 96 visualize decision tree, 31 waterfall plot, 156 weak model, 59 weight, 137 weighted average, 120 wFScore, 141 white box, 60, 133 x-axis log, 100 XGBoost hyperparameters, 75 XGBoost stump, 35 y-axis limit, 109 yellowbrick, 57, 120 Yellowbrick (validation curve), 46 yellowbrick learning curve, 109 ZIP file, 5