# gbdt-xgboost ```Why xgboost is so fast?
Yafei Zhang
kimmyzhang@tencent.com
August 1, 2016
Yafei Zhang (tencent)
Why xgboost is so fast?
August 1, 2016
1 / 34
GBDT Recall
Outline
1
GBDT Recall
2
Split Finding Algorithm
3
Column Blocks and Parallelization
4
Cache Aware Access
5
System Tricks
6
Summary
7
Reference
Yafei Zhang (tencent)
Why xgboost is so fast?
August 1, 2016
2 / 34
GBDT Recall
Goal
A training set D = {(xi , yi )}N
1 . A loss function L. The model F .
Objective Function
L=
N
X
|i=1
K
X
L(yi , F (xi ; w )) +
{z
}
Training loss
where
F (x; w ) =
K
X
Ω(fk )
|k=1 {z
(1)
}
Regularization
fk (x; wk )
(2)
k=0
Goal: Learn F Greedily
F ∗ = arg min L
(3)
F
Yafei Zhang (tencent)
Why xgboost is so fast?
August 1, 2016
3 / 34
GBDT Recall
Tree Splitting Algorithm
At iteration k, fk is wanted
L(yi , Fk (xi ; w )) = L(yi , Fk−1 (xi ; w ) + fk (xi ))
∂L(yi , Fk−1 (xi ; w ))
≈ L(yi , Fk−1 (xi ; w )) +
fk (xi )
∂Fk−1
|
{z
}
+
1
2
(4)
:=gi
2
∂ L(yi , Fk−1 (xi ; w )) 2
fk (xi )
2
∂Fk−1
{z
|
:=hi
(5)
}
1
= L(yi , Fk−1 (xi ; w )) + gi fk (xi ) + hi fk2 (xi )
2
where
Fk =
Yafei Zhang (tencent)
(6)
k
X
fj
j=1
Why xgboost is so fast?
(7)
August 1, 2016
4 / 34
GBDT Recall
Tree Splitting Algorithm
Overall Loss w.r.t. fk
N X
1 2
L(fk ) =
gi fk (xi ) + hi fk (xi ) + Ω(fk )
2
(8)
i=1
Ω() for a decision tree f
J
γ
λX 2
Ω(f ) = J +
bj
2
2
(9)
j=1
γ, λ: regularization coefficient.
J: number of leaf nodes in f .
Rj : regions of leaf nodes in f .
bj : values of Rj .
Yafei Zhang (tencent)
Why xgboost is so fast?
August 1, 2016
5 / 34
GBDT Recall
Tree Splitting Algorithm
With {Rj }J1 known, we are to optimize {bj }J1 .
How to get {Rj }J1 will be described later.
The optimal leaf values are given by minimizing L

L(fk ) =
L({bj }J1 , {Rj }J1 )





 
J X
  γ
X
1

X
 2
=
g
b
+
+
λ
h


 bj  + J
i j
i

  2
2
j=1 xi ∈Rj
xi ∈Rj
 
| {z }
| {z }
:=Gj
:=Hj
(10)
To keep symbols uncluttered, J, bj and Rj are parameters of fk .
Yafei Zhang (tencent)
Why xgboost is so fast?
August 1, 2016
6 / 34
GBDT Recall
Tree Splitting Algorithm: Optimize bj
The optimal leaf value of Rj
Gj
Hj + λ
(11)
J
2
1 X Gj
γ
=−
+ J
2
Hj + λ 2
(12)
bj∗ = arg min L({bj }J1 , {Rj }J1 ) = −
bj
The minimal loss function
L({bj∗ }J1 , {Rj }J1 )
j=1
Yafei Zhang (tencent)
Why xgboost is so fast?
August 1, 2016
7 / 34
GBDT Recall
Tree Splitting Algorithm: Get Rj
With {Rj }J1 and {bj∗ }J1 , now we consider splitting Ri to RL and RR .
Define R and R0 : old and new splitting scheme
Define b∗ and b∗
R : {Rj }J1
(13)
R0 : {Rj }J1,j6=i ∪ {RL , RR }
(14)
b∗ : {bj∗ }J1
(15)
b∗ : {bj∗ }J1,j6=i ∪ {bL∗ , bR∗ }
(16)
0
0
Yafei Zhang (tencent)
Why xgboost is so fast?
August 1, 2016
8 / 34
GBDT Recall
Tree Splitting Algorithm: Get Rj
Gain is defined to be the decrease in L after splitting Ri , and use (12).
Gain
0
Gain = 2L(b∗ , R) − 2L(b∗ , R0 )
GR2
GL2
(GL + GR )2
+
−
−γ
=
HL + λ HR + λ HL + H R + λ
(17)
(18)
3 Terms:
Gain from splitting
Gain from not splitting
The complexity introduced by splitting
Yafei Zhang (tencent)
Why xgboost is so fast?
August 1, 2016
9 / 34
GBDT Recall
Tree Splitting Algorithm: Put all together
From (18), we have the algorithm
Calculate gi and hi for all instances.
Carry out split finding algorithm to get some proposed splits.
Iterate over all proposed splits.
Calculate gain according to (18).
Select the ”best” split, which results in the maximal gain.
Stop splitting if the maximal gain is negative or some criterions are
met.
GBDT to tree splitting algorithm to split finding algorithm.
Split finding algorithm will be described next section.
Yafei Zhang (tencent)
Why xgboost is so fast?
August 1, 2016
10 / 34
Split Finding Algorithm
Outline
1
GBDT Recall
2
Split Finding Algorithm
3
Column Blocks and Parallelization
4
Cache Aware Access
5
System Tricks
6
Summary
7
Reference
Yafei Zhang (tencent)
Why xgboost is so fast?
August 1, 2016
11 / 34
Split Finding Algorithm
Split Finding Algorithm
Trivial for binary features
Only one split. We can skip this section:)
How to find proposed splits?
Exact Greedy Algorithm
Enumerate over all possible splits by brute force.
Approximate Algorithm
Propose percentiles on one feature.
GK01(M. Greenwald &amp; S. Khanna, 2001, Space-Efficient Online
Computation of Quantile Summaries).
GK01’s variations and extensions.
In the approximate algorithm, can we reuse the proposed splits?
Global: proposal will be carried out per tree.
Local: proposal will be carried out per split.
Yafei Zhang (tencent)
Why xgboost is so fast?
August 1, 2016
12 / 34
Split Finding Algorithm
A Percentiles Demo
from Internet
Yafei Zhang (tencent)
Why xgboost is so fast?
August 1, 2016
13 / 34
Split Finding Algorithm
Exact Greedy Algorithm
Time complexity
O(Nu ): Nu is the number of unique values of this feature.
Yafei Zhang (tencent)
Why xgboost is so fast?
August 1, 2016
14 / 34
Split Finding Algorithm
Approximate Algorithm
Time complexity
O(Np ): Np is the number of proposed splits of this feature.
Yafei Zhang (tencent)
Why xgboost is so fast?
August 1, 2016
15 / 34
Split Finding Algorithm
Approximate Algorithm: Demo
Gain = max{Gain,
Yafei Zhang (tencent)
2
2
G12
G23
G123
+
−
− γ,
H1 + λ H23 + λ H123 + λ
2
2
G12
G32
G123
+
−
− γ}
H12 + λ H3 + λ H123 + λ
Why xgboost is so fast?
August 1, 2016
(19)
(20)
16 / 34
Split Finding Algorithm
Split Finding Algorithm: Conclusions
eps: bucket width, 1/eps: number of percentile buckets.
Conclusions
Approximate approach can be as well as the exact one.
Yafei Zhang (tencent)
Why xgboost is so fast?
August 1, 2016
17 / 34
Column Blocks and Parallelization
Outline
1
GBDT Recall
2
Split Finding Algorithm
3
Column Blocks and Parallelization
4
Cache Aware Access
5
System Tricks
6
Summary
7
Reference
Yafei Zhang (tencent)
Why xgboost is so fast?
August 1, 2016
18 / 34
Column Blocks and Parallelization
Column Blocks and Parallelization
Yafei Zhang (tencent)
Why xgboost is so fast?
August 1, 2016
19 / 34
Column Blocks and Parallelization
Column Blocks and Parallelization
Feature values are sorted.
A block contains one or more feature values.
Instance indices are stored in blocks.
Missing features are not stored.
With column blocks, a parallel split finding algorithm is easy to design.
Yafei Zhang (tencent)
Why xgboost is so fast?
August 1, 2016
20 / 34
Cache Aware Access
Outline
1
GBDT Recall
2
Split Finding Algorithm
3
Column Blocks and Parallelization
4
Cache Aware Access
5
System Tricks
6
Summary
7
Reference
Yafei Zhang (tencent)
Why xgboost is so fast?
August 1, 2016
21 / 34
Cache Aware Access
Cache Aware Access
Tricks
A thread pre-fetches data from non-continuous memory into a
continuous buffer.
buffer.
Yafei Zhang (tencent)
Why xgboost is so fast?
August 1, 2016
22 / 34
Cache Aware Access
Cache Aware Access: Conclusions
Conclusions
Cache aware algorithm is always no worse.
Yafei Zhang (tencent)
Why xgboost is so fast?
August 1, 2016
23 / 34
System Tricks
Outline
1
GBDT Recall
2
Split Finding Algorithm
3
Column Blocks and Parallelization
4
Cache Aware Access
5
System Tricks
6
Summary
7
Reference
Yafei Zhang (tencent)
Why xgboost is so fast?
August 1, 2016
24 / 34
System Tricks
System Tricks
Block pre-fetching.
Utilize multiple disks to parallelize disk operations.
LZ4 compression(popular recent years for outstanding performance).
Unrolling loops.
OpenMP.
Yafei Zhang (tencent)
Why xgboost is so fast?
August 1, 2016
25 / 34
System Tricks
LZ4 Compression
Compress blocks with LZ4.
Yafei Zhang (tencent)
Why xgboost is so fast?
August 1, 2016
26 / 34
System Tricks
Unrolling Loops: Program I
# include &lt; stdio .h &gt;
int main () {
int sum = 0;
int i ;
for ( i = 0; i &lt; 100000000; i ++) {
sum += i ;
}
printf ( &quot; % d \ n &quot; , sum );
return 0;
}
Yafei Zhang (tencent)
Why xgboost is so fast?
August 1, 2016
27 / 34
System Tricks
Unrolling Loops: Program II
# include &lt; stdio .h &gt;
int main () {
int sum = 0;
int i ;
for ( i = 0; i &lt; 100000000; i += 8) {
sum += i ;
sum += i + 1;
sum += i + 2;
sum += i + 3;
...
sum += i + 7;
}
printf ( &quot; % d \ n &quot; , sum );
return 0;
}
Yafei Zhang (tencent)
Why xgboost is so fast?
August 1, 2016
28 / 34
System Tricks
Unrolling Loops
Conclusions
Program II is 40%-50% faster.
Performance can benefit from unrolling loops with light loop bodies.
A simple solution
CFLAGS=-funroll-loops
CXXFLAGS=-funroll-loops
Use flags above to compile Program I.
Yafei Zhang (tencent)
Why xgboost is so fast?
August 1, 2016
29 / 34
System Tricks
OpenMP
A shared-memory parallel programming primitive.
# pragma omp parallel for
for ( int i = 0; i &lt; 1024; i ++) {
// do something in parallel
}
Not always faster, programs need to be tuned carefully.
CFLAGS=-fopenmp
CXXFLAGS=-fopenmp
LDFLAGS=-fopenmp
Yafei Zhang (tencent)
Why xgboost is so fast?
August 1, 2016
30 / 34
Summary
Outline
1
GBDT Recall
2
Split Finding Algorithm
3
Column Blocks and Parallelization
4
Cache Aware Access
5
System Tricks
6
Summary
7
Reference
Yafei Zhang (tencent)
Why xgboost is so fast?
August 1, 2016
31 / 34
Summary
Summary
GBDT recall
tree splitting algorithm: fk , bj , Rj
Split finding algorithm
exact vs approximate
global vs local
Column blocks and parallelization
one block = one or multiple features
sorted feature value
easy to parallelize
Cache aware access
pre-fetch
alleviate non-continuous memory access
System Tricks
pre-fetch
utilize multiple disks
LZ4 compression
Unrolling loops
OpenMP
Yafei Zhang (tencent)
Why xgboost is so fast?
August 1, 2016
32 / 34
Reference
Outline
1
GBDT Recall
2
Split Finding Algorithm
3
Column Blocks and Parallelization
4
Cache Aware Access
5
System Tricks
6
Summary
7
Reference
Yafei Zhang (tencent)
Why xgboost is so fast?
August 1, 2016
33 / 34
Reference
Reference
J. Friedman(1999). Greedy Function Approximation: A Gradient Boosting
Machine.