Entropy Estimation and Applications to Decision Trees

advertisement
Entropy Estimation and
Applications to Decision Trees
Estimation
1
Distribution over K=8 classes
0.8
0.6
Repeat 50,000 times:
1. Generate N samples
2. Estimate entropy from samples
0.4
0.2
0
1
2
3
4
5
6
7
8
H=1.289
Plugin H, N=10, 50000 replicates
Plugin H, N=100, 50000 replicates
8000
Plugin H, N=1000, 50000 replicates
2500
5000
4500
7000
2000
4000
6000
3500
5000
1500
3000
4000
2500
1000
3000
2000
1500
2000
500
1000
1000
0
500
0
0.2
0.4
0.6
0.8
1
1.2
N=10
1.4
1.6
1.8
2
0
0
0.2
0.4
0.6
0.8
1
1.2
N=100
1.4
1.6
1.8
2
0
0
0.2
0.4
0.6
0.8
1
1.2
1.4
N=50000
1.6
1.8
2
Estimation
Plugin H, N=100, 50000 replicates, with true entropy and 2 std dev estimates shown
2500
2000
1500
1000
500
0
0
0.2
0.4
0.6
0.8
1
1.2
Estimating the true entropy
Goals:
1. Consistency: large N guarantees correct result
2. Low variance: variation of estimates small
3. Low bias: expected estimate should be correct
1.4
1.6
1.8
2
Discrete Entropy Estimators
Experimental Results
• UCI classification
data sets
• Accuracy on test set
• Plugin vs. Grassberger
• Better trees
Source: [Nowozin, “Improved Information Gain
Estimates for Decision Tree Induction”, ICML 2012]
Differential Entropy Estimation
• In regression, differential entropy
– measures remaining uncertainty about y
– is a function of a distribution
𝐻 π‘ž =−
π‘ž 𝑦 π‘₯ log π‘ž(𝑦|π‘₯) d𝑦
𝑦
• Problem
– q is not from a parametric family
• Solution 1: project onto a parametric family
• Solution 2: non-parametric entropy estimation
Solution 1: parametric family
• Multivariate Normal distribution
– Estimate covariance matrix 𝐢 of all y vectors
– Plugin estimate of the entropy
𝐻 𝐢 =
𝑑 𝑑
1
+ log 2πœ‹ + log 𝐢
2 2
2
– Uniform minimum variance unbiased estimator (UMVUE)
𝐻 π‘Œ =
𝑑
1
log π‘’πœ‹ + log
2
2
𝑦 𝑦𝑇 −
𝑦∈π‘Œ
1
2
𝑑
πœ“
𝑗=1
𝑛+1−𝑗
2
[Ahmed, Gokhale, “Entropy expressions and their estimators for multivariate distributions”, IEEE Trans. Inf. Theory, 1989]
Solution 1: parametric family
Solution 1: parametric family
Solution 2: Non-parametric entropy estimation
• Minimal assumptions on distribution
• Nearest neighbour estimate
𝐻1𝑁𝑁
𝑑
=
𝑛
𝑛
log πœŒπ‘– + log 𝑛 − 1 + 𝛾 + log 𝑉𝑑
𝑖=1
– NN distance πœŒπ‘– = min𝑗∈ 1,β‹―,𝑛 \ 𝑖 𝑦𝑗 − 𝑦𝑖
– Euler-Mascheroni constant 𝛾 ≈ 0.5772
𝑑
– Volume of d-dim. hypersphere 𝑉𝑑 = πœ‹ 𝑑/2 /Γ 1 + 2
• Other estimators: KDE, spanning tree, k-NN, etc.
[Kozachenko, Leonenko, “Sample estimate of the entropy of a random vector”, Probl. Peredachi Inf., 1987]
[Beirlant, Dudewicz, GyΕ‘rfi, van der Meulen, “Nonparametric entropy estimation: An overview”, 2001]
[Wang, Kulkarni, Verdú, “Universal estimation of information measures for analog sources”, FnT Comm. Inf. Th., 2009]
Solution 2: Non-parametric estimation
Experimental Results
[Nowozin, “Improved Information Gain Estimates for Decision Tree Induction”, ICML 2012]
Streaming Decision Trees
Streaming Data
• “Infinite data” setting
• 10 possible splits and their scores
• When to stop and make a decision?
Plugin H, N=100, 50000 replicates, with true entropy and 2 std dev estimates shown
2500
2000
1500
1000
500
0
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Streaming Decision Trees
• Score splits on a subset of samples only
• Domingos/Hulten (Hoeffding Trees), 2000:
– Compute sample count n for given precision
– Streaming decision tree induction
– Incorrect confidence intervals, but work well in practice
• Jin/Agralwal, 2003:
– Tighter confidence interval, asymptotic derivation using delta method
• Loh/Nowozin, 2013:
– Racing algorithm (bad splits are removed early)
– Finite sample confidence intervals for entropy and gini
[Domingos, Hulten, “Mining High-Speed Data Streams”, KDD 2000]
[Jin, Agralwal, “Efficient Decision Tree Construction on Streaming Data”, KDD 2003]
[Loh, Nowozin, “Faster Hoeffding racing: Bernstein races via jackknife estimates”, ALT 2013]
Multivariate Delta Method
𝑔
𝛻𝑔 πœƒ
𝑔 πœƒ
πœƒ
Theorem. Let 𝑇𝑛 be a sequence of π‘˜-dimensional random vectors such that
β„’
𝑛 𝑇𝑛 − πœƒ
π’©π‘˜ 0, Σ πœƒ . Let 𝑔: β„π‘˜
gradient matrix 𝛻𝑔 πœƒ . Then
𝑛 𝑔 𝑇𝑛 − 𝑔 πœƒ
β„’
β„π‘š be once differentiable at πœƒ with
π’©π‘š 0, 𝛻𝑔 πœƒ 𝑇 Σ πœƒ 𝛻𝑔 πœƒ .
[DasGupta, “Asymptotic Theory of Statistics and Probability”, Springer, 2008]
Delta Method for the Information Gain
•
•
•
•
•
8 classes, 2 choices (left/right)
𝑃𝑆,𝑖 : probability of choice S, class I
𝑃𝐿 = 𝑖 𝑃𝐿,𝑖 , 𝑃𝑅 = 𝑖 𝑃𝑅,𝑖
𝑃𝑖 = 𝑃𝐿,𝑖 + 𝑃𝑅,𝑖
𝑖 𝑃𝑆,𝑖 = 1
𝑆∈ 𝐿,𝑅
0.4
0.3
0.2
0.1
0
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
0.4
0.3
0.2
0.1
0
Multivariate delta method: for 𝑛
∞ we have that
𝐼 𝑃 ~ 𝒩 𝐼 𝑃 , 𝑉 𝑃 /𝑛
𝑉 𝑃 =
𝑆∈ 𝐿,𝑅
𝑖 𝑃𝑆,𝑖
𝛼𝑆,𝑖 𝛼𝑆,𝑖 + 𝐼 𝑃
• 𝛼𝑆,𝑖 = log 𝑃𝑆,𝑖 + log 𝑃𝑆 − log 𝑃𝑖
• 𝐼 𝑃 = 𝐻 𝑃 + 𝐻 𝑃𝐿 , 𝑃𝑅 − 𝐻 𝑃𝑖 𝑖 , mutual information (infogain)
• Derivation lengthy but not difficult, slight generalization of Jin & Agralwal
[Small, “Expansions and Asymptotics for Statistics”, CRC, 2010]
[DasGupta, “Asymptotic Theory of Statistics and Probability”, Springer, 2008]
Delta Method Example
Plugin estimate and standard deviation, 10000 replicates
𝐼 𝑃 ~ 𝒩 𝐼 𝑃 , 𝑉 𝑃 /𝑛
As 𝑛 ∞, 𝑉 𝑃 is fixed
0.45
Infogain estimate
Infogain truth
Infogain estimate
0.4
0.35
0.3
0.25
0.2
0
50
100
150
200
250
300
350
400
450
500
Sample size
Asymptotic variance of the information gain
Empirical stddev
Delta method stddev
0.15
0.1
0.05
0
0
50
100
150
200
250
300
350
400
450
500
Conclusion on Entropy Estimation
•
•
•
•
Statistical problem
Large body of literature exists on entropy estimation
Better estimators yield better decision trees
Distribution of estimate relevant in the streaming setting
Download