Information Measure Estimation and Applications: Boosting the Effective Sample Size from n to n ln n Jiantao Jiao (Stanford EE) Joint work with: Kartik Venkat Stanford EE Yanjun Han Tsinghua EE Tsachy Weissman Stanford EE Information Theory, Learning, and Big Data Workshop March 17th, 2015 Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 1 / 35 Problem Setting Observe n samples from discrete distribution P with support size S we want to estimate H(P ) = Fα (P ) = S X i=1 S X −pi ln pi pαi , (Shannon’48) α>0 (Diversity, Simpson index, Rényi entropy, etc) i=1 Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 2 / 35 What was known? Start with entropy: H(P ) = Question PS i=1 −pi ln pi . Optimal estimator for H(P ) given n samples? Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 3 / 35 What was known? Start with entropy: H(P ) = Question PS i=1 −pi ln pi . Optimal estimator for H(P ) given n samples? No unbiased estimator, cannot calculate minimax estimator... (Lehmann and Casella’98) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 3 / 35 What was known? Start with entropy: H(P ) = Question PS i=1 −pi ln pi . Optimal estimator for H(P ) given n samples? No unbiased estimator, cannot calculate minimax estimator... (Lehmann and Casella’98) Classical asymptotics Optimal estimator for H(P ) when n → ∞? Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 3 / 35 Classical problem Empirical Entropy H(Pn ) = S X i=1 −p̂i ln p̂i where p̂i is the empirical frequency of symbol i H(Pn ) is the Maximum Likelihood Estimator (MLE) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 4 / 35 Classical problem Empirical Entropy H(Pn ) = S X i=1 −p̂i ln p̂i where p̂i is the empirical frequency of symbol i H(Pn ) is the Maximum Likelihood Estimator (MLE) Theorem The MLE H(Pn ) is asymptotically efficient. (Hájek–Le Cam theory) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 4 / 35 Is there something missing? Optimal estimator for finite samples unknown :( Question How about using MLE when n is finite? Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 5 / 35 Is there something missing? Optimal estimator for finite samples unknown :( Question How about using MLE when n is finite? We will show it is a very bad idea in general. Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 5 / 35 Our evaluation criterion: minimax decision theory Denote by MS distributions with support size S. Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 6 / 35 Our evaluation criterion: minimax decision theory Denote by MS distributions with support size S. ∆ Rnmax (F, F̂ ) = Jiantao Jiao (Stanford University) EP [(F (P ) − F̂ )2 ] Boosting the Effective Sample Size March 17th, 2015 6 / 35 Our evaluation criterion: minimax decision theory Denote by MS distributions with support size S. ∆ Rnmax (F, F̂ ) = sup EP [(F (P ) − F̂ )2 ] P ∈MS Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 6 / 35 Our evaluation criterion: minimax decision theory Denote by MS distributions with support size S. ∆ Rnmax (F, F̂ ) = sup EP [(F (P ) − F̂ )2 ] ∆ Minimax risk = P ∈MS sup EP [(F (P ) − F̂ ))2 ] P ∈MS Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 6 / 35 Our evaluation criterion: minimax decision theory Denote by MS distributions with support size S. ∆ Rnmax (F, F̂ ) = sup EP [(F (P ) − F̂ )2 ] ∆ P ∈MS Minimax risk = inf sup EP [(F (P ) − F̂ ))2 ] all F̂ P ∈MS Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 6 / 35 Our evaluation criterion: minimax decision theory Denote by MS distributions with support size S. ∆ Rnmax (F, F̂ ) = sup EP [(F (P ) − F̂ )2 ] ∆ P ∈MS Minimax risk = inf sup EP [(F (P ) − F̂ ))2 ] all F̂ P ∈MS Notation: an bn ⇔ 0 < c ≤ an & bn ⇔ an bn an bn ≤C<∞ ≥C>0 Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 6 / 35 Our evaluation criterion: minimax decision theory Denote by MS distributions with support size S. ∆ Rnmax (F, F̂ ) = sup EP [(F (P ) − F̂ )2 ] ∆ P ∈MS Minimax risk = inf sup EP [(F (P ) − F̂ ))2 ] all F̂ P ∈MS Notation: an bn ⇔ 0 < c ≤ an & bn ⇔ an bn an bn ≤C<∞ ≥C>0 L2 Risk = Bias2 + Variance 2 EP [(F (P ) − F̂ )2 ] = F (P ) − EP F̂ + Var(F̂ ) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 6 / 35 Shall we analyze MLE non-asymptotically? Theorem (J., Venkat, Han, Weissman’14) Rnmax (H, H(Pn )) S2 ln2 S + , n2 n } |{z} | {z Bias2 Jiantao Jiao (Stanford University) n&S Variance Boosting the Effective Sample Size March 17th, 2015 7 / 35 Shall we analyze MLE non-asymptotically? Theorem (J., Venkat, Han, Weissman’14) Rnmax (H, H(Pn )) S2 ln2 S + , n2 n } |{z} | {z Bias2 n&S Variance n S ⇔ Consistency Bias is dominating if n is not too large compared to S Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 7 / 35 Can we reduce the bias? Taylor series? (Taylor expansion of H(Pn ) around Pn = P ) Requires n S (Paninski’03) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 8 / 35 Can we reduce the bias? Taylor series? (Taylor expansion of H(Pn ) around Pn = P ) Requires n S (Paninski’03) Jackknife? Requires n S (Paninski’03) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 8 / 35 Can we reduce the bias? Taylor series? (Taylor expansion of H(Pn ) around Pn = P ) Requires n S (Paninski’03) Jackknife? Requires n S (Paninski’03) Bootstrap? Requires at least n S (Han, J., Weissman’15) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 8 / 35 Can we reduce the bias? Taylor series? (Taylor expansion of H(Pn ) around Pn = P ) Requires n S (Paninski’03) Jackknife? Requires n S (Paninski’03) Bootstrap? Requires at least n S (Han, J., Weissman’15) Bayes estimator under Dirichlet prior? Requires at least n S (Han, J., Weissman’15) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 8 / 35 Can we reduce the bias? Taylor series? (Taylor expansion of H(Pn ) around Pn = P ) Requires n S (Paninski’03) Jackknife? Requires n S (Paninski’03) Bootstrap? Requires at least n S (Han, J., Weissman’15) Bayes estimator under Dirichlet prior? Requires at least n S (Han, J., Weissman’15) Plug-in Dirichlet smoothed distribution? Requires at least n S (Han, J., Weissman’15) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 8 / 35 There are many more... Coverage adjusted estimator (Chao, Shen’03, Vu, Yu, Kass’07) BUB estimator (Paninski’03) Shrinkage estimator (Hausser, Strimmer’09) Grassberger estimator (Grassberger’08) NSB estimator (Nemenman, Shafee, Bialek’02) B-Splines estimator (Daub et al. 04) Wagner, Viswanath, Kulkarni’11 Ohannessian, Tan, Dahleh’ 11 ... Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 9 / 35 There are many more... Coverage adjusted estimator (Chao, Shen’03, Vu, Yu, Kass’07) BUB estimator (Paninski’03) Shrinkage estimator (Hausser, Strimmer’09) Grassberger estimator (Grassberger’08) NSB estimator (Nemenman, Shafee, Bialek’02) B-Splines estimator (Daub et al. 04) Wagner, Viswanath, Kulkarni’11 Ohannessian, Tan, Dahleh’ 11 ... Recent breakthrough: Valiant and Valiant’11: the exact phase transition of entropy estimation is n S/ ln S Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 9 / 35 There are many more... Coverage adjusted estimator (Chao, Shen’03, Vu, Yu, Kass’07) BUB estimator (Paninski’03) Shrinkage estimator (Hausser, Strimmer’09) Grassberger estimator (Grassberger’08) NSB estimator (Nemenman, Shafee, Bialek’02) B-Splines estimator (Daub et al. 04) Wagner, Viswanath, Kulkarni’11 Ohannessian, Tan, Dahleh’ 11 ... Recent breakthrough: Valiant and Valiant’11: the exact phase transition of entropy estimation is n S/ ln S Linear Programming based estimator not shown to achieve minimax rates (dependence on ) 1.03 Another one achieves it for n . Sln S . Not clear about other functionals Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 9 / 35 Systematic improvement Question Can we find a systematic methodology to improve MLE, and achieve the minimax rates for a wide class of functionals? George Pólya: “the more general problem may be easier to solve than the special problem”. Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 10 / 35 Start from first principle X ∼ B(n, p), if we use g(X) to estimate f (p): Bias(g(X)) , f (p) − Ep g(X) n X n j = f (p) − g(j) p (1 − p)n−j j j=0 Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 11 / 35 Start from first principle X ∼ B(n, p), if we use g(X) to estimate f (p): Bias(g(X)) , f (p) − Ep g(X) n X n j = f (p) − g(j) p (1 − p)n−j j j=0 Only polynomials of order ≤ n can be estimated without bias X(X − 1) . . . (X − r + 1) Ep = pr , 0 ≤ r ≤ n n(n − 1) . . . (n − r + 1) Bias corresponds to polynomial approximation error Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 11 / 35 Minimize the maximum bias What if we choose g(X) such that g = arg min sup |Bias(g(X))| g Jiantao Jiao (Stanford University) p∈[0,1] Boosting the Effective Sample Size March 17th, 2015 12 / 35 Minimize the maximum bias What if we choose g(X) such that g = arg min sup |Bias(g(X))| g p∈[0,1] Paninski’03 It does not work for entropy. (Variance is too large!) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 12 / 35 Minimize the maximum bias What if we choose g(X) such that g = arg min sup |Bias(g(X))| g p∈[0,1] Paninski’03 It does not work for entropy. (Variance is too large!) Best polynomial approximation: Pk (x) , arg min sup |f (x) − Pk (x)| Pk ∈Polyk x∈D Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 12 / 35 We only approximate when the problem is hard! f (p) “nonsmooth” 0 Jiantao Jiao (Stanford University) “smooth” ln n n Boosting the Effective Sample Size 1 p̂i March 17th, 2015 13 / 35 We only approximate when the problem is hard! f (p) unbiased estimate of best polynomial approximation of order ln n “nonsmooth” 0 Jiantao Jiao (Stanford University) “smooth” ln n n Boosting the Effective Sample Size 1 p̂i March 17th, 2015 13 / 35 We only approximate when the problem is hard! f (p) unbiased estimate of best polynomial approximation of order ln n f (p̂i ) − “nonsmooth” 0 Jiantao Jiao (Stanford University) f 00 (p̂i )p̂i (1−p̂i ) 2n “smooth” ln n n Boosting the Effective Sample Size 1 p̂i March 17th, 2015 13 / 35 Why ln n n threshold, ln n order? Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 14 / 35 Approximation theory in functional estimation Lepski, Nemirovski, Spokoiny’99, Cai and Low’11, Wu and Yang’14 (entropy estimation lower bound) Duality with shrinkage (J., Venkat, Han, Weissman’14): a new field to be explored! Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 15 / 35 A class of functionals Note Fα (P ) = PS α i=1 pi , α > 0. [J., Venkat, Han, Weissman’14] showed Minimax L2 rates H(P ) Fα (P ), 0 < α ≤ 1 2 Fα (P ), 1 < α < Fα (P ), α ≥ 23 3 2 S2 n2α 2−2α S2 +Sn n2α n−2(α−1) Fα (P ), 12 < α < 1 Jiantao Jiao (Stanford University) L2 rates of MLE 2 S2 + lnn S n2 n−1 Boosting the Effective Sample Size n−1 March 17th, 2015 16 / 35 A class of functionals Note Fα (P ) = PS α i=1 pi , α H(P ) Fα (P ), 0 < α ≤ 1 2 Fα (P ), 1 < α < Fα (P ), α ≥ 23 3 2 Fα (P ), 12 < α < 1 Jiantao Jiao (Stanford University) > 0. [J., Venkat, Han, Weissman’14] showed Minimax L2 rates 2 S2 + lnn S (n ln n)2 L2 rates of MLE 2 S2 + lnn S n2 S2 (n ln n)2α 2−2α S2 +Sn (n ln n)2α (n ln n)−2(α−1) S2 n2α 2−2α S2 +Sn n2α n−2(α−1) n−1 n−1 Boosting the Effective Sample Size March 17th, 2015 16 / 35 A class of functionals Note Fα (P ) = PS α i=1 pi , α H(P ) Fα (P ), 0 < α ≤ 1 2 Fα (P ), 1 < α < Fα (P ), α ≥ 23 3 2 Fα (P ), 12 < α < 1 > 0. [J., Venkat, Han, Weissman’14] showed Minimax L2 rates 2 S2 + lnn S (n ln n)2 L2 rates of MLE 2 S2 + lnn S n2 S2 (n ln n)2α 2−2α S2 +Sn (n ln n)2α (n ln n)−2(α−1) S2 n2α 2−2α S2 +Sn n2α n−2(α−1) n−1 n−1 Effective sample size enlargement Minimax rate-optimal with n samples ⇔ MLE with n ln n samples Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 16 / 35 Sample complexity perspectives H(P ) Fα (P ), 0 < α < 1 Fα (P ), α > 1 MLE Θ(S) Θ(S 1/α ) Θ(1) Optimal Θ(S/ ln S) Θ(S 1/α / ln S) Θ(1) Existing literature on Fα (P ) for 0 < α < 2: minimax rates unknown, sample complexity unknown, optimal schemes unknown.. Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 17 / 35 Unique features of our entropy estimator One realization of a general methodology Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 18 / 35 Unique features of our entropy estimator One realization of a general methodology Achieve the minimax rate (lower bound due to Wu and Yang’14) ln2 S S2 + (n ln n)2 n Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 18 / 35 Unique features of our entropy estimator One realization of a general methodology Achieve the minimax rate (lower bound due to Wu and Yang’14) ln2 S S2 + (n ln n)2 n Agnostic to the knowledge of S Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 18 / 35 Unique features of our entropy estimator Easily implementable: approximation order ln n, 500 order approximation using 0.46s on PC. (Chebfun) Empirical performance outperforms every available entropy estimator (Code to be released this week, available upon request) Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 19 / 35 Unique features of our entropy estimator Easily implementable: approximation order ln n, 500 order approximation using 0.46s on PC. (Chebfun) Empirical performance outperforms every available entropy estimator (Code to be released this week, available upon request) Some statisticians raised interesting questions: “We may not use this estimator unless you prove it is adaptive.” Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 19 / 35 Adaptive estimation Near-minimax estimator: 2 2 sup EP Ĥ Our − H(P ) inf sup EP Ĥ − H(P ) P ∈MS Jiantao Jiao (Stanford University) Ĥ P ∈MS Boosting the Effective Sample Size March 17th, 2015 20 / 35 Adaptive estimation Near-minimax estimator: 2 2 sup EP Ĥ Our − H(P ) inf sup EP Ĥ − H(P ) P ∈MS Ĥ P ∈MS What if we know a priori H(P ) ≤ H? We want sup P ∈MS (H) 2 EP Ĥ Our − H(P ) inf sup Ĥ P ∈MS (H) 2 EP Ĥ − H(P ) , where MS (H) = {P : H(P ) ≤ H, P ∈ MS }. Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 20 / 35 Adaptive estimation Near-minimax estimator: 2 2 sup EP Ĥ Our − H(P ) inf sup EP Ĥ − H(P ) P ∈MS Ĥ P ∈MS What if we know a priori H(P ) ≤ H? We want sup P ∈MS (H) 2 EP Ĥ Our − H(P ) inf sup Ĥ P ∈MS (H) 2 EP Ĥ − H(P ) , where MS (H) = {P : H(P ) ≤ H, P ∈ MS }. Can our estimator satisfy all these requirements without knowing S and H? Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 20 / 35 Our estimator is adaptive! Han, J., Weissman’15 showed Theorem inf sup Ĥ P ∈MS (H) 2 EP |Ĥ − H(P )| Jiantao Jiao (Stanford University) ( S2 (n ln n)2 H ln S ln H ln S n 2 S ln S nH ln n + Boosting the Effective Sample Size if S ln S ≤ enH ln n, otherwise. (1) March 17th, 2015 21 / 35 Our estimator is adaptive! Han, J., Weissman’15 showed Theorem inf sup Ĥ P ∈MS (H) 2 EP |Ĥ − H(P )| ( S2 (n ln n)2 H ln S ln H ln S n 2 S ln S nH ln n + if S ln S ≤ enH ln n, otherwise. (1) 1−/H For > lnHS , it requires n & S H (much smaller than S/ ln S for small H!) to achieve root MSE . Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 21 / 35 Our estimator is adaptive! Han, J., Weissman’15 showed Theorem inf sup Ĥ P ∈MS (H) 2 EP |Ĥ − H(P )| ( S2 (n ln n)2 H ln S ln H ln S n 2 S ln S nH ln n + if S ln S ≤ enH ln n, otherwise. (1) 1−/H For > lnHS , it requires n & S H (much smaller than S/ ln S for small H!) to achieve root MSE . MLE precisely requires n & Jiantao Jiao (Stanford University) ln S 1−/H . H S Boosting the Effective Sample Size March 17th, 2015 21 / 35 Our estimator is adaptive! Han, J., Weissman’15 showed Theorem inf sup Ĥ P ∈MS (H) 2 EP |Ĥ − H(P )| ( S2 (n ln n)2 H ln S ln H ln S n 2 S ln S nH ln n + if S ln S ≤ enH ln n, otherwise. (1) 1−/H For > lnHS , it requires n & S H (much smaller than S/ ln S for small H!) to achieve root MSE . MLE precisely requires n & ln S 1−/H . H S The n ⇒ n ln n phenomenon is still true! Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 21 / 35 Understanding the generality “Can you deal with cases where the non-differentiable regime is not just a point?” P We consider L1 (P, Q) = Si=1 |pi − qi |. Breakthrough by Valiant and Valiant’11: MLE requires n S samples, optimal is lnSS ! However, VV’11’s construction only proves to be optimal when S ln S . n . S. Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 22 / 35 Applying our general methodology The minimax L2 rate is S n ln n , while MLE is S n. The n ⇒ n ln n is still true! We show that VV’11 cannot achieve it when n & S. Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 23 / 35 Applying our general methodology The minimax L2 rate is S n ln n , while MLE is S n. The n ⇒ n ln n is still true! We show that VV’11 cannot achieve it when n & S. However, is that easy to do? The nonsmooth regime is a whole line! How to cover it? 2 Use a small square 0, lnnn ? Use a band? Use some other shape? Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 23 / 35 Bad news Best polynomial approximation in 2D is not unique. There exists no efficient algorithm to compute it. Some polynomial that achieves best approximation rate cannot be used in our methodology. All nonsmooth regime design methods mentioned before fail. Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 24 / 35 Bad news Best polynomial approximation in 2D is not unique. There exists no efficient algorithm to compute it. Some polynomial that achieves best approximation rate cannot be used in our methodology. All nonsmooth regime design methods mentioned before fail. Solution: hints from our paper “Minimax estimation of functionals of discrete distributions”, or wait for soon-be-finished paper “Minimax estimation of divergence functions”. Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 24 / 35 Significant difference in phase transitions One may think: “Who cares about a ln n improvement?” Take sequence n = 2S/ ln S, S equally (on log) sampled from 102 to 107 . For each n, S, sample 20 times from a uniform distribution. The scale of S/n S/n ∈ [2.2727, 8.0590] Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 25 / 35 Significant improvement! 5 JVHW estimator MLE 4.5 4 3.5 MSE 3 2.5 2 1.5 1 0.5 0 2 3 4 5 6 7 8 9 S/n Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 26 / 35 Mutual Information I(PXY ) Mutual information: I(PXY ) = H(PX ) + H(PY ) − H(PXY ) Theorem (J., Venkat, Han, Weissman’14) n ⇒ n ln n Jiantao Jiao (Stanford University) also holds. Boosting the Effective Sample Size March 17th, 2015 27 / 35 Learning Tree Graphical Models d-dimensional random vector with alphabet size S X = (X1 , X2 , . . . , Xd ) Suppose p.m.f. has tree structure PX = d Y i=1 Jiantao Jiao (Stanford University) PXi |Xπ(i) , Boosting the Effective Sample Size March 17th, 2015 28 / 35 Learning Tree Graphical Models d-dimensional random vector with alphabet size S X = (X1 , X2 , . . . , Xd ) Suppose p.m.f. has tree structure PX = d Y i=1 X4 X5 X3 X2 PXi |Xπ(i) , X6 IV. M AIN R ESULT: C AUSAL D EPENDEN A PPROXIMATIONS X1 In situations where there are multiple random Chow and Liu method can be used. However, i all possible arrangements of all the variables, processes and timings to find the best approxima native approach, which would maintain causality of second order distributions serves as an approximation of processes separate, is to find an approximation t the full joint. Jiantao Jiao (Stanford University) Boosting the Effective Sample Size probability by identifying March 17th,causal 2015 dependencies 28 / 35 n Fig. 1. Diagram of an approximating dependence tree structure. In this example, PbT ≈ PX6 (·)PX1 |X6 (·)PX3 |X6 (·)PX4 |X3 (·)PX2 |X3 (·)PX5 |X2 (·). Learning Tree Graphical Models d-dimensional random vector with alphabet size S X = (X1 , X2 , . . . , Xd ) Suppose p.m.f. has tree structure PX = d Y i=1 X4 X5 X3 X2 PXi |Xπ(i) , X6 IV. M AIN R ESULT: C AUSAL D EPENDEN A PPROXIMATIONS X1 In situations where there are multiple random Chow and Liu method can be used. However, i all possible arrangements of all the variables, Given n i.i.d. samples from PX , estimate underlying tree structure processes and timings to find the best approxima native approach, which would maintain causality of second order distributions serves as an approximation of processes separate, is to find an approximation t the full joint. Jiantao Jiao (Stanford University) Boosting the Effective Sample Size probability by identifying March 17th,causal 2015 dependencies 28 / 35 Fig. 1. Diagram of an approximating dependence tree structure. In this example, PbT ≈ PX6 (·)PX1 |X6 (·)PX3 |X6 (·)PX4 |X3 (·)PX2 |X3 (·)PX5 |X2 (·). n Chow–Liu algorithm (1968) Natural approach: Maximum Likelihood! Chow–Liu’68 solved it (2000 citations) - Maximum Weight Spanning Tree (MWST) - Empirical mutual information Theorem (Chow, Liu’68) EMLE = arg Jiantao Jiao (Stanford University) max EQ is a tree X I(P̂e ) e∈EQ Boosting the Effective Sample Size March 17th, 2015 29 / 35 Chow–Liu algorithm (1968) Natural approach: Maximum Likelihood! Chow–Liu’68 solved it (2000 citations) - Maximum Weight Spanning Tree (MWST) - Empirical mutual information Theorem (Chow, Liu’68) EMLE = arg max EQ is a tree X I(P̂e ) e∈EQ Our approach Replace empirical mutual information by better estimates! Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 29 / 35 Original CL vs. Modified CL We set d = 7, S = 300, a star graph, sweep n from 1k to 55k. 1 Original CL 0.9 Expected wrong−edges−ratio 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 Jiantao Jiao (Stanford University) 1 2 3 4 number of samples n Boosting the Effective Sample Size 5 6 4 x 10 March 17th, 2015 30 / 35 8 fold improvement! [and even more] We set d = 7, S = 300, a star graph, sweep n from 1k to 55k. 1 Original CL Modified CL 0.9 Expected wrong−edges−ratio 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 Jiantao Jiao (Stanford University) 1 2 3 4 number of samples n Boosting the Effective Sample Size 5 6 4 x 10 March 17th, 2015 31 / 35 What we did not talk about How to analyze the bias of MLE Why bootstrap and jackknife fail Extensions in multivariate settings (`p distance) Extensions in nonparametric estimation Extensions in non-i.i.d. models Applications in machine learning, medical imaging, computer vision, genomics, etc Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 32 / 35 Hindsight Unbiased estimation: approximation basis must satisfy unbiased equation (Kolmogorov’50) basis have good approximation properties additional variance incurred by approximation should not be large Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 33 / 35 Hindsight Unbiased estimation: approximation basis must satisfy unbiased equation (Kolmogorov’50) basis have good approximation properties additional variance incurred by approximation should not be large Theory of functional estimation ⇔ Analytical properties of functions The whole field of approximation theory, or even more Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 33 / 35 Related work O. Lepski, A. Nemirovski, and V. Spokoiny’99, “On estimation of the Lr norm of a regression function” T. Cai and M. Low’11, “Testing composite hypotheses, Hermite polynomials and optimal estimation of a nonsmooth functional”, Y. Wu and P. Yang’14, “Minimax rates of entropy estimation on large alphabets via best polynomial approximation”, J. Acharya, A. Orlitsky, A. T. Suresh, and H. Tyagi’14,“The complexity of estimating Renyi entropy” Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 34 / 35 Our work J. Jiao, K. Venkat, Y. Han, T. Weissman, “Minimax estimation of functionals of discrete distributions”, to appear in IEEE Transactions on Information Theory J. Jiao, K. Venkat, Y. Han, T. Weissman, “Maximum likelihood estimation of functionals of discrete distributrions”, available on arXiv J. Jiao, K. Venkat, Y. Han, T. Weissman, “Beyond maximum likelihood: from theory to practice”, available on arXiv J. Jiao, Y. Han, T. Weissman, “Minimax estimation of divergence functions”, in preparation. Y. Han, J. Jiao, T. Weissman, “Adaptive estimation of Shannon entropy”, available on arXiv Y. Han, J. Jiao, T. Weissman, “Does Dirichlet prior smoothing solve the Shannon entropy estimation problem?”, available on arXiv Y. Han, J. Jiao, T. Weissman, “Bias correction using Taylor series, Bootstrap, and Jackknife”, in preparation Y. Han, J. Jiao, T. Weissman, “How to bound of gap of Jensen’s inequality?”, in preparation Thank you! Jiantao Jiao (Stanford University) Boosting the Effective Sample Size March 17th, 2015 35 / 35