Entropy Estimation and Applications to Decision Trees

Entropy Estimation and Applications to Decision Trees Estimation 1 Distribution over K=8 classes 0.8 0.6 Repeat 50,000 times: 1. Generate N samples 2. Estimate entropy from samples 0.4 0.2 0 1 2 3 4 5 6 7 8 H=1.289 Plugin H, N=10, 50000 replicates Plugin H, N=100, 50000 replicates 8000 Plugin H, N=1000, 50000 replicates 2500 5000 4500 7000 2000 4000 6000 3500 5000 1500 3000 4000 2500 1000 3000 2000 1500 2000 500 1000 1000 0 500 0 0.2 0.4 0.6 0.8 1 1.2 N=10 1.4 1.6 1.8 2 0 0 0.2 0.4 0.6 0.8 1 1.2 N=100 1.4 1.6 1.8 2 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 N=50000 1.6 1.8 2 Estimation Plugin H, N=100, 50000 replicates, with true entropy and 2 std dev estimates shown 2500 2000 1500 1000 500 0 0 0.2 0.4 0.6 0.8 1 1.2 Estimating the true entropy Goals: 1. Consistency: large N guarantees correct result 2. Low variance: variation of estimates small 3. Low bias: expected estimate should be correct 1.4 1.6 1.8 2 Discrete Entropy Estimators Experimental Results • UCI classification data sets • Accuracy on test set • Plugin vs. Grassberger • Better trees Source: [Nowozin, “Improved Information Gain Estimates for Decision Tree Induction”, ICML 2012] Differential Entropy Estimation • In regression, differential entropy – measures remaining uncertainty about y – is a function of a distribution 𝐻 𝑞 =− 𝑞 𝑦 𝑥 log 𝑞(𝑦|𝑥) d𝑦 𝑦 • Problem – q is not from a parametric family • Solution 1: project onto a parametric family • Solution 2: non-parametric entropy estimation Solution 1: parametric family • Multivariate Normal distribution – Estimate covariance matrix 𝐶 of all y vectors – Plugin estimate of the entropy 𝐻 𝐶 = 𝑑 𝑑 1 + log 2𝜋 + log 𝐶 2 2 2 – Uniform minimum variance unbiased estimator (UMVUE) 𝐻 𝑌 = 𝑑 1 log 𝑒𝜋 + log 2 2 𝑦 𝑦𝑇 − 𝑦∈𝑌 1 2 𝑑 𝜓 𝑗=1 𝑛+1−𝑗 2 [Ahmed, Gokhale, “Entropy expressions and their estimators for multivariate distributions”, IEEE Trans. Inf. Theory, 1989] Solution 1: parametric family Solution 1: parametric family Solution 2: Non-parametric entropy estimation • Minimal assumptions on distribution • Nearest neighbour estimate 𝐻1𝑁𝑁 𝑑 = 𝑛 𝑛 log 𝜌𝑖 + log 𝑛 − 1 + 𝛾 + log 𝑉𝑑 𝑖=1 – NN distance 𝜌𝑖 = min𝑗∈ 1,⋯,𝑛 \ 𝑖 𝑦𝑗 − 𝑦𝑖 – Euler-Mascheroni constant 𝛾 ≈ 0.5772 𝑑 – Volume of d-dim. hypersphere 𝑉𝑑 = 𝜋 𝑑/2 /Γ 1 + 2 • Other estimators: KDE, spanning tree, k-NN, etc. [Kozachenko, Leonenko, “Sample estimate of the entropy of a random vector”, Probl. Peredachi Inf., 1987] [Beirlant, Dudewicz, Győrfi, van der Meulen, “Nonparametric entropy estimation: An overview”, 2001] [Wang, Kulkarni, Verdú, “Universal estimation of information measures for analog sources”, FnT Comm. Inf. Th., 2009] Solution 2: Non-parametric estimation Experimental Results [Nowozin, “Improved Information Gain Estimates for Decision Tree Induction”, ICML 2012] Streaming Decision Trees Streaming Data • “Infinite data” setting • 10 possible splits and their scores • When to stop and make a decision? Plugin H, N=100, 50000 replicates, with true entropy and 2 std dev estimates shown 2500 2000 1500 1000 500 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Streaming Decision Trees • Score splits on a subset of samples only • Domingos/Hulten (Hoeffding Trees), 2000: – Compute sample count n for given precision – Streaming decision tree induction – Incorrect confidence intervals, but work well in practice • Jin/Agralwal, 2003: – Tighter confidence interval, asymptotic derivation using delta method • Loh/Nowozin, 2013: – Racing algorithm (bad splits are removed early) – Finite sample confidence intervals for entropy and gini [Domingos, Hulten, “Mining High-Speed Data Streams”, KDD 2000] [Jin, Agralwal, “Efficient Decision Tree Construction on Streaming Data”, KDD 2003] [Loh, Nowozin, “Faster Hoeffding racing: Bernstein races via jackknife estimates”, ALT 2013] Multivariate Delta Method 𝑔 𝛻𝑔 𝜃 𝑔 𝜃 𝜃 Theorem. Let 𝑇𝑛 be a sequence of 𝑘-dimensional random vectors such that ℒ 𝑛 𝑇𝑛 − 𝜃 𝒩𝑘 0, Σ 𝜃 . Let 𝑔: ℝ𝑘 gradient matrix 𝛻𝑔 𝜃 . Then 𝑛 𝑔 𝑇𝑛 − 𝑔 𝜃 ℒ ℝ𝑚 be once differentiable at 𝜃 with 𝒩𝑚 0, 𝛻𝑔 𝜃 𝑇 Σ 𝜃 𝛻𝑔 𝜃 . [DasGupta, “Asymptotic Theory of Statistics and Probability”, Springer, 2008] Delta Method for the Information Gain • • • • • 8 classes, 2 choices (left/right) 𝑃𝑆,𝑖 : probability of choice S, class I 𝑃𝐿 = 𝑖 𝑃𝐿,𝑖 , 𝑃𝑅 = 𝑖 𝑃𝑅,𝑖 𝑃𝑖 = 𝑃𝐿,𝑖 + 𝑃𝑅,𝑖 𝑖 𝑃𝑆,𝑖 = 1 𝑆∈ 𝐿,𝑅 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 0.4 0.3 0.2 0.1 0 Multivariate delta method: for 𝑛 ∞ we have that 𝐼 𝑃 ~ 𝒩 𝐼 𝑃 , 𝑉 𝑃 /𝑛 𝑉 𝑃 = 𝑆∈ 𝐿,𝑅 𝑖 𝑃𝑆,𝑖 𝛼𝑆,𝑖 𝛼𝑆,𝑖 + 𝐼 𝑃 • 𝛼𝑆,𝑖 = log 𝑃𝑆,𝑖 + log 𝑃𝑆 − log 𝑃𝑖 • 𝐼 𝑃 = 𝐻 𝑃 + 𝐻 𝑃𝐿 , 𝑃𝑅 − 𝐻 𝑃𝑖 𝑖 , mutual information (infogain) • Derivation lengthy but not difficult, slight generalization of Jin & Agralwal [Small, “Expansions and Asymptotics for Statistics”, CRC, 2010] [DasGupta, “Asymptotic Theory of Statistics and Probability”, Springer, 2008] Delta Method Example Plugin estimate and standard deviation, 10000 replicates 𝐼 𝑃 ~ 𝒩 𝐼 𝑃 , 𝑉 𝑃 /𝑛 As 𝑛 ∞, 𝑉 𝑃 is fixed 0.45 Infogain estimate Infogain truth Infogain estimate 0.4 0.35 0.3 0.25 0.2 0 50 100 150 200 250 300 350 400 450 500 Sample size Asymptotic variance of the information gain Empirical stddev Delta method stddev 0.15 0.1 0.05 0 0 50 100 150 200 250 300 350 400 450 500 Conclusion on Entropy Estimation • • • • Statistical problem Large body of literature exists on entropy estimation Better estimators yield better decision trees Distribution of estimate relevant in the streaming setting

Entropy Estimation and Applications to Decision Trees

Related documents

Products

Support

Entropy Estimation and Applications to Decision Trees

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib