Entropy Estimation and Applications to Decision Trees Estimation 1 Distribution over K=8 classes 0.8 0.6 Repeat 50,000 times: 1. Generate N samples 2. Estimate entropy from samples 0.4 0.2 0 1 2 3 4 5 6 7 8 H=1.289 Plugin H, N=10, 50000 replicates Plugin H, N=100, 50000 replicates 8000 Plugin H, N=1000, 50000 replicates 2500 5000 4500 7000 2000 4000 6000 3500 5000 1500 3000 4000 2500 1000 3000 2000 1500 2000 500 1000 1000 0 500 0 0.2 0.4 0.6 0.8 1 1.2 N=10 1.4 1.6 1.8 2 0 0 0.2 0.4 0.6 0.8 1 1.2 N=100 1.4 1.6 1.8 2 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 N=50000 1.6 1.8 2 Estimation Plugin H, N=100, 50000 replicates, with true entropy and 2 std dev estimates shown 2500 2000 1500 1000 500 0 0 0.2 0.4 0.6 0.8 1 1.2 Estimating the true entropy Goals: 1. Consistency: large N guarantees correct result 2. Low variance: variation of estimates small 3. Low bias: expected estimate should be correct 1.4 1.6 1.8 2 Discrete Entropy Estimators Experimental Results • UCI classification data sets • Accuracy on test set • Plugin vs. Grassberger • Better trees Source: [Nowozin, “Improved Information Gain Estimates for Decision Tree Induction”, ICML 2012] Differential Entropy Estimation • In regression, differential entropy – measures remaining uncertainty about y – is a function of a distribution π» π =− π π¦ π₯ log π(π¦|π₯) dπ¦ π¦ • Problem – q is not from a parametric family • Solution 1: project onto a parametric family • Solution 2: non-parametric entropy estimation Solution 1: parametric family • Multivariate Normal distribution – Estimate covariance matrix πΆ of all y vectors – Plugin estimate of the entropy π» πΆ = π π 1 + log 2π + log πΆ 2 2 2 – Uniform minimum variance unbiased estimator (UMVUE) π» π = π 1 log ππ + log 2 2 π¦ π¦π − π¦∈π 1 2 π π π=1 π+1−π 2 [Ahmed, Gokhale, “Entropy expressions and their estimators for multivariate distributions”, IEEE Trans. Inf. Theory, 1989] Solution 1: parametric family Solution 1: parametric family Solution 2: Non-parametric entropy estimation • Minimal assumptions on distribution • Nearest neighbour estimate π»1ππ π = π π log ππ + log π − 1 + πΎ + log ππ π=1 – NN distance ππ = minπ∈ 1,β―,π \ π π¦π − π¦π – Euler-Mascheroni constant πΎ ≈ 0.5772 π – Volume of d-dim. hypersphere ππ = π π/2 /Γ 1 + 2 • Other estimators: KDE, spanning tree, k-NN, etc. [Kozachenko, Leonenko, “Sample estimate of the entropy of a random vector”, Probl. Peredachi Inf., 1987] [Beirlant, Dudewicz, GyΕrfi, van der Meulen, “Nonparametric entropy estimation: An overview”, 2001] [Wang, Kulkarni, Verdú, “Universal estimation of information measures for analog sources”, FnT Comm. Inf. Th., 2009] Solution 2: Non-parametric estimation Experimental Results [Nowozin, “Improved Information Gain Estimates for Decision Tree Induction”, ICML 2012] Streaming Decision Trees Streaming Data • “Infinite data” setting • 10 possible splits and their scores • When to stop and make a decision? Plugin H, N=100, 50000 replicates, with true entropy and 2 std dev estimates shown 2500 2000 1500 1000 500 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Streaming Decision Trees • Score splits on a subset of samples only • Domingos/Hulten (Hoeffding Trees), 2000: – Compute sample count n for given precision – Streaming decision tree induction – Incorrect confidence intervals, but work well in practice • Jin/Agralwal, 2003: – Tighter confidence interval, asymptotic derivation using delta method • Loh/Nowozin, 2013: – Racing algorithm (bad splits are removed early) – Finite sample confidence intervals for entropy and gini [Domingos, Hulten, “Mining High-Speed Data Streams”, KDD 2000] [Jin, Agralwal, “Efficient Decision Tree Construction on Streaming Data”, KDD 2003] [Loh, Nowozin, “Faster Hoeffding racing: Bernstein races via jackknife estimates”, ALT 2013] Multivariate Delta Method π π»π π π π π Theorem. Let ππ be a sequence of π-dimensional random vectors such that β π ππ − π π©π 0, Σ π . Let π: βπ gradient matrix π»π π . Then π π ππ − π π β βπ be once differentiable at π with π©π 0, π»π π π Σ π π»π π . [DasGupta, “Asymptotic Theory of Statistics and Probability”, Springer, 2008] Delta Method for the Information Gain • • • • • 8 classes, 2 choices (left/right) ππ,π : probability of choice S, class I ππΏ = π ππΏ,π , ππ = π ππ ,π ππ = ππΏ,π + ππ ,π π ππ,π = 1 π∈ πΏ,π 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 0.4 0.3 0.2 0.1 0 Multivariate delta method: for π ∞ we have that πΌ π ~ π© πΌ π , π π /π π π = π∈ πΏ,π π ππ,π πΌπ,π πΌπ,π + πΌ π • πΌπ,π = log ππ,π + log ππ − log ππ • πΌ π = π» π + π» ππΏ , ππ − π» ππ π , mutual information (infogain) • Derivation lengthy but not difficult, slight generalization of Jin & Agralwal [Small, “Expansions and Asymptotics for Statistics”, CRC, 2010] [DasGupta, “Asymptotic Theory of Statistics and Probability”, Springer, 2008] Delta Method Example Plugin estimate and standard deviation, 10000 replicates πΌ π ~ π© πΌ π , π π /π As π ∞, π π is fixed 0.45 Infogain estimate Infogain truth Infogain estimate 0.4 0.35 0.3 0.25 0.2 0 50 100 150 200 250 300 350 400 450 500 Sample size Asymptotic variance of the information gain Empirical stddev Delta method stddev 0.15 0.1 0.05 0 0 50 100 150 200 250 300 350 400 450 500 Conclusion on Entropy Estimation • • • • Statistical problem Large body of literature exists on entropy estimation Better estimators yield better decision trees Distribution of estimate relevant in the streaming setting