Why xgboost is so fast? Yafei Zhang kimmyzhang@tencent.com August 1, 2016 Yafei Zhang (tencent) Why xgboost is so fast? August 1, 2016 1 / 34 GBDT Recall Outline 1 GBDT Recall 2 Split Finding Algorithm 3 Column Blocks and Parallelization 4 Cache Aware Access 5 System Tricks 6 Summary 7 Reference Yafei Zhang (tencent) Why xgboost is so fast? August 1, 2016 2 / 34 GBDT Recall Goal A training set D = {(xi , yi )}N 1 . A loss function L. The model F . Objective Function L= N X |i=1 K X L(yi , F (xi ; w )) + {z } Training loss where F (x; w ) = K X Ω(fk ) |k=1 {z (1) } Regularization fk (x; wk ) (2) k=0 Goal: Learn F Greedily F ∗ = arg min L (3) F Yafei Zhang (tencent) Why xgboost is so fast? August 1, 2016 3 / 34 GBDT Recall Tree Splitting Algorithm At iteration k, fk is wanted L(yi , Fk (xi ; w )) = L(yi , Fk−1 (xi ; w ) + fk (xi )) ∂L(yi , Fk−1 (xi ; w )) ≈ L(yi , Fk−1 (xi ; w )) + fk (xi ) ∂Fk−1 | {z } + 1 2 (4) :=gi 2 ∂ L(yi , Fk−1 (xi ; w )) 2 fk (xi ) 2 ∂Fk−1 {z | :=hi (5) } 1 = L(yi , Fk−1 (xi ; w )) + gi fk (xi ) + hi fk2 (xi ) 2 where Fk = Yafei Zhang (tencent) (6) k X fj j=1 Why xgboost is so fast? (7) August 1, 2016 4 / 34 GBDT Recall Tree Splitting Algorithm Overall Loss w.r.t. fk N X 1 2 L(fk ) = gi fk (xi ) + hi fk (xi ) + Ω(fk ) 2 (8) i=1 Ω() for a decision tree f J γ λX 2 Ω(f ) = J + bj 2 2 (9) j=1 γ, λ: regularization coefficient. J: number of leaf nodes in f . Rj : regions of leaf nodes in f . bj : values of Rj . Yafei Zhang (tencent) Why xgboost is so fast? August 1, 2016 5 / 34 GBDT Recall Tree Splitting Algorithm With {Rj }J1 known, we are to optimize {bj }J1 . How to get {Rj }J1 will be described later. The optimal leaf values are given by minimizing L L(fk ) = L({bj }J1 , {Rj }J1 ) J X γ X 1 X 2 = g b + + λ h bj + J i j i 2 2 j=1 xi ∈Rj xi ∈Rj | {z } | {z } :=Gj :=Hj (10) To keep symbols uncluttered, J, bj and Rj are parameters of fk . Yafei Zhang (tencent) Why xgboost is so fast? August 1, 2016 6 / 34 GBDT Recall Tree Splitting Algorithm: Optimize bj The optimal leaf value of Rj Gj Hj + λ (11) J 2 1 X Gj γ =− + J 2 Hj + λ 2 (12) bj∗ = arg min L({bj }J1 , {Rj }J1 ) = − bj The minimal loss function L({bj∗ }J1 , {Rj }J1 ) j=1 Yafei Zhang (tencent) Why xgboost is so fast? August 1, 2016 7 / 34 GBDT Recall Tree Splitting Algorithm: Get Rj With {Rj }J1 and {bj∗ }J1 , now we consider splitting Ri to RL and RR . Define R and R0 : old and new splitting scheme Define b∗ and b∗ R : {Rj }J1 (13) R0 : {Rj }J1,j6=i ∪ {RL , RR } (14) b∗ : {bj∗ }J1 (15) b∗ : {bj∗ }J1,j6=i ∪ {bL∗ , bR∗ } (16) 0 0 Yafei Zhang (tencent) Why xgboost is so fast? August 1, 2016 8 / 34 GBDT Recall Tree Splitting Algorithm: Get Rj Gain is defined to be the decrease in L after splitting Ri , and use (12). Gain 0 Gain = 2L(b∗ , R) − 2L(b∗ , R0 ) GR2 GL2 (GL + GR )2 + − −γ = HL + λ HR + λ HL + H R + λ (17) (18) 3 Terms: Gain from splitting Gain from not splitting The complexity introduced by splitting Yafei Zhang (tencent) Why xgboost is so fast? August 1, 2016 9 / 34 GBDT Recall Tree Splitting Algorithm: Put all together From (18), we have the algorithm Calculate gi and hi for all instances. Carry out split finding algorithm to get some proposed splits. Iterate over all proposed splits. Calculate gain according to (18). Select the ”best” split, which results in the maximal gain. Stop splitting if the maximal gain is negative or some criterions are met. GBDT to tree splitting algorithm to split finding algorithm. Split finding algorithm will be described next section. Yafei Zhang (tencent) Why xgboost is so fast? August 1, 2016 10 / 34 Split Finding Algorithm Outline 1 GBDT Recall 2 Split Finding Algorithm 3 Column Blocks and Parallelization 4 Cache Aware Access 5 System Tricks 6 Summary 7 Reference Yafei Zhang (tencent) Why xgboost is so fast? August 1, 2016 11 / 34 Split Finding Algorithm Split Finding Algorithm Trivial for binary features Only one split. We can skip this section:) How to find proposed splits? Exact Greedy Algorithm Enumerate over all possible splits by brute force. Approximate Algorithm Propose percentiles on one feature. GK01(M. Greenwald & S. Khanna, 2001, Space-Efficient Online Computation of Quantile Summaries). GK01’s variations and extensions. In the approximate algorithm, can we reuse the proposed splits? Global: proposal will be carried out per tree. Local: proposal will be carried out per split. Yafei Zhang (tencent) Why xgboost is so fast? August 1, 2016 12 / 34 Split Finding Algorithm A Percentiles Demo from Internet Yafei Zhang (tencent) Why xgboost is so fast? August 1, 2016 13 / 34 Split Finding Algorithm Exact Greedy Algorithm Time complexity O(Nu ): Nu is the number of unique values of this feature. Yafei Zhang (tencent) Why xgboost is so fast? August 1, 2016 14 / 34 Split Finding Algorithm Approximate Algorithm Time complexity O(Np ): Np is the number of proposed splits of this feature. Yafei Zhang (tencent) Why xgboost is so fast? August 1, 2016 15 / 34 Split Finding Algorithm Approximate Algorithm: Demo Gain = max{Gain, Yafei Zhang (tencent) 2 2 G12 G23 G123 + − − γ, H1 + λ H23 + λ H123 + λ 2 2 G12 G32 G123 + − − γ} H12 + λ H3 + λ H123 + λ Why xgboost is so fast? August 1, 2016 (19) (20) 16 / 34 Split Finding Algorithm Split Finding Algorithm: Conclusions eps: bucket width, 1/eps: number of percentile buckets. Conclusions Approximate approach can be as well as the exact one. Yafei Zhang (tencent) Why xgboost is so fast? August 1, 2016 17 / 34 Column Blocks and Parallelization Outline 1 GBDT Recall 2 Split Finding Algorithm 3 Column Blocks and Parallelization 4 Cache Aware Access 5 System Tricks 6 Summary 7 Reference Yafei Zhang (tencent) Why xgboost is so fast? August 1, 2016 18 / 34 Column Blocks and Parallelization Column Blocks and Parallelization Yafei Zhang (tencent) Why xgboost is so fast? August 1, 2016 19 / 34 Column Blocks and Parallelization Column Blocks and Parallelization Feature values are sorted. A block contains one or more feature values. Instance indices are stored in blocks. Missing features are not stored. With column blocks, a parallel split finding algorithm is easy to design. Yafei Zhang (tencent) Why xgboost is so fast? August 1, 2016 20 / 34 Cache Aware Access Outline 1 GBDT Recall 2 Split Finding Algorithm 3 Column Blocks and Parallelization 4 Cache Aware Access 5 System Tricks 6 Summary 7 Reference Yafei Zhang (tencent) Why xgboost is so fast? August 1, 2016 21 / 34 Cache Aware Access Cache Aware Access Tricks A thread pre-fetches data from non-continuous memory into a continuous buffer. The main thread accumulates gradients statistics in the continuous buffer. Yafei Zhang (tencent) Why xgboost is so fast? August 1, 2016 22 / 34 Cache Aware Access Cache Aware Access: Conclusions Conclusions Cache aware algorithm is always no worse. Yafei Zhang (tencent) Why xgboost is so fast? August 1, 2016 23 / 34 System Tricks Outline 1 GBDT Recall 2 Split Finding Algorithm 3 Column Blocks and Parallelization 4 Cache Aware Access 5 System Tricks 6 Summary 7 Reference Yafei Zhang (tencent) Why xgboost is so fast? August 1, 2016 24 / 34 System Tricks System Tricks Block pre-fetching. Utilize multiple disks to parallelize disk operations. LZ4 compression(popular recent years for outstanding performance). Unrolling loops. OpenMP. Yafei Zhang (tencent) Why xgboost is so fast? August 1, 2016 25 / 34 System Tricks LZ4 Compression Compress blocks with LZ4. Yafei Zhang (tencent) Why xgboost is so fast? August 1, 2016 26 / 34 System Tricks Unrolling Loops: Program I # include < stdio .h > int main () { int sum = 0; int i ; for ( i = 0; i < 100000000; i ++) { sum += i ; } printf ( " % d \ n " , sum ); return 0; } Yafei Zhang (tencent) Why xgboost is so fast? August 1, 2016 27 / 34 System Tricks Unrolling Loops: Program II # include < stdio .h > int main () { int sum = 0; int i ; for ( i = 0; i < 100000000; i += 8) { sum += i ; sum += i + 1; sum += i + 2; sum += i + 3; ... sum += i + 7; } printf ( " % d \ n " , sum ); return 0; } Yafei Zhang (tencent) Why xgboost is so fast? August 1, 2016 28 / 34 System Tricks Unrolling Loops Conclusions Program II is 40%-50% faster. Performance can benefit from unrolling loops with light loop bodies. A simple solution CFLAGS=-funroll-loops CXXFLAGS=-funroll-loops Use flags above to compile Program I. Yafei Zhang (tencent) Why xgboost is so fast? August 1, 2016 29 / 34 System Tricks OpenMP A shared-memory parallel programming primitive. # pragma omp parallel for for ( int i = 0; i < 1024; i ++) { // do something in parallel } Not always faster, programs need to be tuned carefully. CFLAGS=-fopenmp CXXFLAGS=-fopenmp LDFLAGS=-fopenmp Yafei Zhang (tencent) Why xgboost is so fast? August 1, 2016 30 / 34 Summary Outline 1 GBDT Recall 2 Split Finding Algorithm 3 Column Blocks and Parallelization 4 Cache Aware Access 5 System Tricks 6 Summary 7 Reference Yafei Zhang (tencent) Why xgboost is so fast? August 1, 2016 31 / 34 Summary Summary GBDT recall tree splitting algorithm: fk , bj , Rj Split finding algorithm exact vs approximate global vs local Column blocks and parallelization one block = one or multiple features sorted feature value easy to parallelize Cache aware access pre-fetch alleviate non-continuous memory access System Tricks pre-fetch utilize multiple disks LZ4 compression Unrolling loops OpenMP Yafei Zhang (tencent) Why xgboost is so fast? August 1, 2016 32 / 34 Reference Outline 1 GBDT Recall 2 Split Finding Algorithm 3 Column Blocks and Parallelization 4 Cache Aware Access 5 System Tricks 6 Summary 7 Reference Yafei Zhang (tencent) Why xgboost is so fast? August 1, 2016 33 / 34 Reference Reference J. Friedman(1999). Greedy Function Approximation: A Gradient Boosting Machine. J. Friedman(1999). Stochastic Gradient Boosting. T. Hastie, R. Tibshirani, J. Friedman(2008). Chapter 10 of The Elements of Statistical Learning(2e). T. Chen, C. Guestrin(2016). XGBoost: A Scalable Tree Boosting System. Yafei Zhang (tencent) Why xgboost is so fast? August 1, 2016 34 / 34