Boosting the Sample Size from n to n log n (slides

advertisement
Information Measure Estimation and Applications:
Boosting the Effective Sample Size from n to n ln n
Jiantao Jiao (Stanford EE)
Joint work with:
Kartik Venkat
Stanford EE
Yanjun Han
Tsinghua EE
Tsachy Weissman
Stanford EE
Information Theory, Learning, and Big Data Workshop
March 17th, 2015
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
1 / 35
Problem Setting
Observe n samples from discrete distribution P with support size S
we want to estimate
H(P ) =
Fα (P ) =
S
X
i=1
S
X
−pi ln pi
pαi ,
(Shannon’48)
α>0
(Diversity, Simpson index, Rényi entropy, etc)
i=1
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
2 / 35
What was known?
Start with entropy: H(P ) =
Question
PS
i=1 −pi ln pi .
Optimal estimator for H(P ) given n samples?
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
3 / 35
What was known?
Start with entropy: H(P ) =
Question
PS
i=1 −pi ln pi .
Optimal estimator for H(P ) given n samples?
No unbiased estimator, cannot calculate minimax estimator...
(Lehmann and Casella’98)
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
3 / 35
What was known?
Start with entropy: H(P ) =
Question
PS
i=1 −pi ln pi .
Optimal estimator for H(P ) given n samples?
No unbiased estimator, cannot calculate minimax estimator...
(Lehmann and Casella’98)
Classical asymptotics
Optimal estimator for H(P ) when n → ∞?
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
3 / 35
Classical problem
Empirical Entropy
H(Pn ) =
S
X
i=1
−p̂i ln p̂i
where p̂i is the empirical frequency of symbol i
H(Pn ) is the Maximum Likelihood Estimator (MLE)
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
4 / 35
Classical problem
Empirical Entropy
H(Pn ) =
S
X
i=1
−p̂i ln p̂i
where p̂i is the empirical frequency of symbol i
H(Pn ) is the Maximum Likelihood Estimator (MLE)
Theorem
The MLE H(Pn ) is asymptotically efficient. (Hájek–Le Cam theory)
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
4 / 35
Is there something missing?
Optimal estimator for finite samples unknown :(
Question
How about using MLE when n is finite?
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
5 / 35
Is there something missing?
Optimal estimator for finite samples unknown :(
Question
How about using MLE when n is finite?
We will show it is a very bad idea in general.
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
5 / 35
Our evaluation criterion: minimax decision theory
Denote by MS distributions with support size S.
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
6 / 35
Our evaluation criterion: minimax decision theory
Denote by MS distributions with support size S.
∆
Rnmax (F, F̂ ) =
Jiantao Jiao (Stanford University)
EP [(F (P ) − F̂ )2 ]
Boosting the Effective Sample Size
March 17th, 2015
6 / 35
Our evaluation criterion: minimax decision theory
Denote by MS distributions with support size S.
∆
Rnmax (F, F̂ ) = sup EP [(F (P ) − F̂ )2 ]
P ∈MS
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
6 / 35
Our evaluation criterion: minimax decision theory
Denote by MS distributions with support size S.
∆
Rnmax (F, F̂ ) = sup EP [(F (P ) − F̂ )2 ]
∆
Minimax risk =
P ∈MS
sup EP [(F (P ) − F̂ ))2 ]
P ∈MS
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
6 / 35
Our evaluation criterion: minimax decision theory
Denote by MS distributions with support size S.
∆
Rnmax (F, F̂ ) = sup EP [(F (P ) − F̂ )2 ]
∆
P ∈MS
Minimax risk = inf
sup EP [(F (P ) − F̂ ))2 ]
all F̂ P ∈MS
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
6 / 35
Our evaluation criterion: minimax decision theory
Denote by MS distributions with support size S.
∆
Rnmax (F, F̂ ) = sup EP [(F (P ) − F̂ )2 ]
∆
P ∈MS
Minimax risk = inf
sup EP [(F (P ) − F̂ ))2 ]
all F̂ P ∈MS
Notation:
an bn ⇔ 0 < c ≤
an & bn ⇔
an
bn
an
bn
≤C<∞
≥C>0
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
6 / 35
Our evaluation criterion: minimax decision theory
Denote by MS distributions with support size S.
∆
Rnmax (F, F̂ ) = sup EP [(F (P ) − F̂ )2 ]
∆
P ∈MS
Minimax risk = inf
sup EP [(F (P ) − F̂ ))2 ]
all F̂ P ∈MS
Notation:
an bn ⇔ 0 < c ≤
an & bn ⇔
an
bn
an
bn
≤C<∞
≥C>0
L2 Risk = Bias2 + Variance
2
EP [(F (P ) − F̂ )2 ] = F (P ) − EP F̂ + Var(F̂ )
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
6 / 35
Shall we analyze MLE non-asymptotically?
Theorem (J., Venkat, Han, Weissman’14)
Rnmax (H, H(Pn )) S2
ln2 S
+
,
n2
n }
|{z}
| {z
Bias2
Jiantao Jiao (Stanford University)
n&S
Variance
Boosting the Effective Sample Size
March 17th, 2015
7 / 35
Shall we analyze MLE non-asymptotically?
Theorem (J., Venkat, Han, Weissman’14)
Rnmax (H, H(Pn )) S2
ln2 S
+
,
n2
n }
|{z}
| {z
Bias2
n&S
Variance
n S ⇔ Consistency
Bias is dominating if n is not too large compared to S
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
7 / 35
Can we reduce the bias?
Taylor series? (Taylor expansion of H(Pn ) around Pn = P )
Requires n S (Paninski’03)
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
8 / 35
Can we reduce the bias?
Taylor series? (Taylor expansion of H(Pn ) around Pn = P )
Requires n S (Paninski’03)
Jackknife?
Requires n S (Paninski’03)
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
8 / 35
Can we reduce the bias?
Taylor series? (Taylor expansion of H(Pn ) around Pn = P )
Requires n S (Paninski’03)
Jackknife?
Requires n S (Paninski’03)
Bootstrap?
Requires at least n S (Han, J., Weissman’15)
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
8 / 35
Can we reduce the bias?
Taylor series? (Taylor expansion of H(Pn ) around Pn = P )
Requires n S (Paninski’03)
Jackknife?
Requires n S (Paninski’03)
Bootstrap?
Requires at least n S (Han, J., Weissman’15)
Bayes estimator under Dirichlet prior?
Requires at least n S (Han, J., Weissman’15)
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
8 / 35
Can we reduce the bias?
Taylor series? (Taylor expansion of H(Pn ) around Pn = P )
Requires n S (Paninski’03)
Jackknife?
Requires n S (Paninski’03)
Bootstrap?
Requires at least n S (Han, J., Weissman’15)
Bayes estimator under Dirichlet prior?
Requires at least n S (Han, J., Weissman’15)
Plug-in Dirichlet smoothed distribution?
Requires at least n S (Han, J., Weissman’15)
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
8 / 35
There are many more...
Coverage adjusted estimator (Chao, Shen’03, Vu, Yu, Kass’07)
BUB estimator (Paninski’03)
Shrinkage estimator (Hausser, Strimmer’09)
Grassberger estimator (Grassberger’08)
NSB estimator (Nemenman, Shafee, Bialek’02)
B-Splines estimator (Daub et al. 04)
Wagner, Viswanath, Kulkarni’11
Ohannessian, Tan, Dahleh’ 11
...
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
9 / 35
There are many more...
Coverage adjusted estimator (Chao, Shen’03, Vu, Yu, Kass’07)
BUB estimator (Paninski’03)
Shrinkage estimator (Hausser, Strimmer’09)
Grassberger estimator (Grassberger’08)
NSB estimator (Nemenman, Shafee, Bialek’02)
B-Splines estimator (Daub et al. 04)
Wagner, Viswanath, Kulkarni’11
Ohannessian, Tan, Dahleh’ 11
...
Recent breakthrough: Valiant and Valiant’11: the exact phase transition
of entropy estimation is n S/ ln S
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
9 / 35
There are many more...
Coverage adjusted estimator (Chao, Shen’03, Vu, Yu, Kass’07)
BUB estimator (Paninski’03)
Shrinkage estimator (Hausser, Strimmer’09)
Grassberger estimator (Grassberger’08)
NSB estimator (Nemenman, Shafee, Bialek’02)
B-Splines estimator (Daub et al. 04)
Wagner, Viswanath, Kulkarni’11
Ohannessian, Tan, Dahleh’ 11
...
Recent breakthrough: Valiant and Valiant’11: the exact phase transition
of entropy estimation is n S/ ln S
Linear Programming based estimator not shown to achieve minimax
rates (dependence on )
1.03
Another one achieves it for n . Sln S .
Not clear about other functionals
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
9 / 35
Systematic improvement
Question
Can we find a systematic methodology to improve MLE, and achieve the
minimax rates for a wide class of functionals?
George Pólya: “the more general problem may be easier to solve than the
special problem”.
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
10 / 35
Start from first principle
X ∼ B(n, p), if we use g(X) to estimate f (p):
Bias(g(X)) , f (p) − Ep g(X)
n
X
n j
= f (p) −
g(j)
p (1 − p)n−j
j
j=0
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
11 / 35
Start from first principle
X ∼ B(n, p), if we use g(X) to estimate f (p):
Bias(g(X)) , f (p) − Ep g(X)
n
X
n j
= f (p) −
g(j)
p (1 − p)n−j
j
j=0
Only polynomials of order ≤ n can be estimated without bias
X(X − 1) . . . (X − r + 1)
Ep
= pr , 0 ≤ r ≤ n
n(n − 1) . . . (n − r + 1)
Bias corresponds to polynomial approximation error
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
11 / 35
Minimize the maximum bias
What if we choose g(X) such that
g = arg min sup |Bias(g(X))|
g
Jiantao Jiao (Stanford University)
p∈[0,1]
Boosting the Effective Sample Size
March 17th, 2015
12 / 35
Minimize the maximum bias
What if we choose g(X) such that
g = arg min sup |Bias(g(X))|
g
p∈[0,1]
Paninski’03
It does not work for entropy. (Variance is too large!)
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
12 / 35
Minimize the maximum bias
What if we choose g(X) such that
g = arg min sup |Bias(g(X))|
g
p∈[0,1]
Paninski’03
It does not work for entropy. (Variance is too large!)
Best polynomial approximation:
Pk (x) , arg min
sup |f (x) − Pk (x)|
Pk ∈Polyk x∈D
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
12 / 35
We only approximate when the problem is hard!
f (p)
“nonsmooth”
0
Jiantao Jiao (Stanford University)
“smooth”
ln n
n
Boosting the Effective Sample Size
1
p̂i
March 17th, 2015
13 / 35
We only approximate when the problem is hard!
f (p)
unbiased estimate
of best polynomial
approximation of
order ln n
“nonsmooth”
0
Jiantao Jiao (Stanford University)
“smooth”
ln n
n
Boosting the Effective Sample Size
1
p̂i
March 17th, 2015
13 / 35
We only approximate when the problem is hard!
f (p)
unbiased estimate
of best polynomial
approximation of
order ln n
f (p̂i ) −
“nonsmooth”
0
Jiantao Jiao (Stanford University)
f 00 (p̂i )p̂i (1−p̂i )
2n
“smooth”
ln n
n
Boosting the Effective Sample Size
1
p̂i
March 17th, 2015
13 / 35
Why
ln n
n
threshold, ln n order?
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
14 / 35
Approximation theory in functional estimation
Lepski, Nemirovski, Spokoiny’99, Cai and Low’11, Wu and Yang’14
(entropy estimation lower bound)
Duality with shrinkage (J., Venkat, Han, Weissman’14): a new field
to be explored!
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
15 / 35
A class of functionals
Note Fα (P ) =
PS
α
i=1 pi , α
> 0. [J., Venkat, Han, Weissman’14] showed
Minimax L2 rates
H(P )
Fα (P ), 0 < α ≤
1
2
Fα (P ), 1 < α <
Fα (P ), α ≥ 23
3
2
S2
n2α
2−2α
S2
+Sn
n2α
n−2(α−1)
Fα (P ), 12 < α < 1
Jiantao Jiao (Stanford University)
L2 rates of MLE
2
S2
+ lnn S
n2
n−1
Boosting the Effective Sample Size
n−1
March 17th, 2015
16 / 35
A class of functionals
Note Fα (P ) =
PS
α
i=1 pi , α
H(P )
Fα (P ), 0 < α ≤
1
2
Fα (P ), 1 < α <
Fα (P ), α ≥ 23
3
2
Fα (P ), 12 < α < 1
Jiantao Jiao (Stanford University)
> 0. [J., Venkat, Han, Weissman’14] showed
Minimax L2 rates
2
S2
+ lnn S
(n ln n)2
L2 rates of MLE
2
S2
+ lnn S
n2
S2
(n ln n)2α
2−2α
S2
+Sn
(n ln n)2α
(n ln n)−2(α−1)
S2
n2α
2−2α
S2
+Sn
n2α
n−2(α−1)
n−1
n−1
Boosting the Effective Sample Size
March 17th, 2015
16 / 35
A class of functionals
Note Fα (P ) =
PS
α
i=1 pi , α
H(P )
Fα (P ), 0 < α ≤
1
2
Fα (P ), 1 < α <
Fα (P ), α ≥ 23
3
2
Fα (P ), 12 < α < 1
> 0. [J., Venkat, Han, Weissman’14] showed
Minimax L2 rates
2
S2
+ lnn S
(n ln n)2
L2 rates of MLE
2
S2
+ lnn S
n2
S2
(n ln n)2α
2−2α
S2
+Sn
(n ln n)2α
(n ln n)−2(α−1)
S2
n2α
2−2α
S2
+Sn
n2α
n−2(α−1)
n−1
n−1
Effective sample size enlargement
Minimax rate-optimal with n samples ⇔ MLE with n ln n samples
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
16 / 35
Sample complexity perspectives
H(P )
Fα (P ), 0 < α < 1
Fα (P ), α > 1
MLE
Θ(S)
Θ(S 1/α )
Θ(1)
Optimal
Θ(S/ ln S)
Θ(S 1/α / ln S)
Θ(1)
Existing literature on Fα (P ) for 0 < α < 2: minimax rates unknown,
sample complexity unknown, optimal schemes unknown..
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
17 / 35
Unique features of our entropy estimator
One realization of a general methodology
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
18 / 35
Unique features of our entropy estimator
One realization of a general methodology
Achieve the minimax rate (lower bound due to Wu and Yang’14)
ln2 S
S2
+
(n ln n)2
n
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
18 / 35
Unique features of our entropy estimator
One realization of a general methodology
Achieve the minimax rate (lower bound due to Wu and Yang’14)
ln2 S
S2
+
(n ln n)2
n
Agnostic to the knowledge of S
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
18 / 35
Unique features of our entropy estimator
Easily implementable: approximation order ln n, 500 order
approximation using 0.46s on PC. (Chebfun)
Empirical performance outperforms every available entropy estimator
(Code to be released this week, available upon request)
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
19 / 35
Unique features of our entropy estimator
Easily implementable: approximation order ln n, 500 order
approximation using 0.46s on PC. (Chebfun)
Empirical performance outperforms every available entropy estimator
(Code to be released this week, available upon request)
Some statisticians raised interesting questions: “We may not use this
estimator unless you prove it is adaptive.”
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
19 / 35
Adaptive estimation
Near-minimax estimator:
2
2
sup EP Ĥ Our − H(P ) inf sup EP Ĥ − H(P )
P ∈MS
Jiantao Jiao (Stanford University)
Ĥ P ∈MS
Boosting the Effective Sample Size
March 17th, 2015
20 / 35
Adaptive estimation
Near-minimax estimator:
2
2
sup EP Ĥ Our − H(P ) inf sup EP Ĥ − H(P )
P ∈MS
Ĥ P ∈MS
What if we know a priori H(P ) ≤ H? We want
sup
P ∈MS (H)
2
EP Ĥ Our − H(P ) inf
sup
Ĥ P ∈MS (H)
2
EP Ĥ − H(P ) ,
where MS (H) = {P : H(P ) ≤ H, P ∈ MS }.
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
20 / 35
Adaptive estimation
Near-minimax estimator:
2
2
sup EP Ĥ Our − H(P ) inf sup EP Ĥ − H(P )
P ∈MS
Ĥ P ∈MS
What if we know a priori H(P ) ≤ H? We want
sup
P ∈MS (H)
2
EP Ĥ Our − H(P ) inf
sup
Ĥ P ∈MS (H)
2
EP Ĥ − H(P ) ,
where MS (H) = {P : H(P ) ≤ H, P ∈ MS }.
Can our estimator satisfy all these requirements without knowing S and
H?
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
20 / 35
Our estimator is adaptive!
Han, J., Weissman’15 showed
Theorem
inf
sup
Ĥ P ∈MS (H)
2
EP |Ĥ − H(P )| Jiantao Jiao (Stanford University)
(
S2
(n ln n)2
H
ln S ln
H ln S
n
2
S ln S
nH ln n
+
Boosting the Effective Sample Size
if S ln S ≤ enH ln n,
otherwise.
(1)
March 17th, 2015
21 / 35
Our estimator is adaptive!
Han, J., Weissman’15 showed
Theorem
inf
sup
Ĥ P ∈MS (H)
2
EP |Ĥ − H(P )| (
S2
(n ln n)2
H
ln S ln
H ln S
n
2
S ln S
nH ln n
+
if S ln S ≤ enH ln n,
otherwise.
(1)
1−/H
For > lnHS , it requires n & S H (much smaller than S/ ln S for
small H!) to achieve root MSE .
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
21 / 35
Our estimator is adaptive!
Han, J., Weissman’15 showed
Theorem
inf
sup
Ĥ P ∈MS (H)
2
EP |Ĥ − H(P )| (
S2
(n ln n)2
H
ln S ln
H ln S
n
2
S ln S
nH ln n
+
if S ln S ≤ enH ln n,
otherwise.
(1)
1−/H
For > lnHS , it requires n & S H (much smaller than S/ ln S for
small H!) to achieve root MSE .
MLE precisely requires n &
Jiantao Jiao (Stanford University)
ln S 1−/H
.
H S
Boosting the Effective Sample Size
March 17th, 2015
21 / 35
Our estimator is adaptive!
Han, J., Weissman’15 showed
Theorem
inf
sup
Ĥ P ∈MS (H)
2
EP |Ĥ − H(P )| (
S2
(n ln n)2
H
ln S ln
H ln S
n
2
S ln S
nH ln n
+
if S ln S ≤ enH ln n,
otherwise.
(1)
1−/H
For > lnHS , it requires n & S H (much smaller than S/ ln S for
small H!) to achieve root MSE .
MLE precisely requires n &
ln S 1−/H
.
H S
The n ⇒ n ln n phenomenon is still true!
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
21 / 35
Understanding the generality
“Can you deal with cases where the non-differentiable regime is not just a
point?”
P
We consider L1 (P, Q) = Si=1 |pi − qi |.
Breakthrough by Valiant and Valiant’11: MLE requires n S
samples, optimal is lnSS !
However, VV’11’s construction only proves to be optimal when
S
ln S . n . S.
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
22 / 35
Applying our general methodology
The minimax L2 rate is
S
n ln n ,
while MLE is
S
n.
The n ⇒ n ln n is still true!
We show that VV’11 cannot achieve it when n & S.
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
23 / 35
Applying our general methodology
The minimax L2 rate is
S
n ln n ,
while MLE is
S
n.
The n ⇒ n ln n is still true!
We show that VV’11 cannot achieve it when n & S.
However, is that easy to do? The nonsmooth regime is a whole line! How
to cover it?
2
Use a small square 0, lnnn ?
Use a band?
Use some other shape?
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
23 / 35
Bad news
Best polynomial approximation in 2D is not unique.
There exists no efficient algorithm to compute it.
Some polynomial that achieves best approximation rate cannot be
used in our methodology.
All nonsmooth regime design methods mentioned before fail.
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
24 / 35
Bad news
Best polynomial approximation in 2D is not unique.
There exists no efficient algorithm to compute it.
Some polynomial that achieves best approximation rate cannot be
used in our methodology.
All nonsmooth regime design methods mentioned before fail.
Solution: hints from our paper “Minimax estimation of functionals of
discrete distributions”, or wait for soon-be-finished paper “Minimax
estimation of divergence functions”.
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
24 / 35
Significant difference in phase transitions
One may think: “Who cares about a ln n improvement?”
Take sequence n = 2S/ ln S, S equally (on log) sampled from 102 to 107 .
For each n, S, sample 20 times from a uniform distribution.
The scale of S/n
S/n ∈ [2.2727, 8.0590]
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
25 / 35
Significant improvement!
5
JVHW estimator
MLE
4.5
4
3.5
MSE
3
2.5
2
1.5
1
0.5
0
2
3
4
5
6
7
8
9
S/n
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
26 / 35
Mutual Information I(PXY )
Mutual information:
I(PXY ) = H(PX ) + H(PY ) − H(PXY )
Theorem (J., Venkat, Han, Weissman’14)
n ⇒ n ln n
Jiantao Jiao (Stanford University)
also holds.
Boosting the Effective Sample Size
March 17th, 2015
27 / 35
Learning Tree Graphical Models
d-dimensional random vector with alphabet size S
X = (X1 , X2 , . . . , Xd )
Suppose p.m.f. has tree structure
PX =
d
Y
i=1
Jiantao Jiao (Stanford University)
PXi |Xπ(i) ,
Boosting the Effective Sample Size
March 17th, 2015
28 / 35
Learning Tree Graphical Models
d-dimensional random vector with alphabet size S
X = (X1 , X2 , . . . , Xd )
Suppose p.m.f. has tree structure
PX =
d
Y
i=1
X4
X5
X3
X2
PXi |Xπ(i) ,
X6
IV. M AIN R ESULT: C AUSAL D EPENDEN
A PPROXIMATIONS
X1
In situations where there are multiple random
Chow and Liu method can be used. However, i
all possible arrangements of all the variables,
processes and timings to find the best approxima
native approach, which would maintain causality
of second order distributions serves as an approximation of
processes separate, is to find an approximation t
the full joint.
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size probability by identifying
March 17th,causal
2015 dependencies
28 / 35
n
Fig. 1. Diagram of an approximating dependence tree structure. In this example, PbT ≈ PX6 (·)PX1 |X6 (·)PX3 |X6 (·)PX4 |X3 (·)PX2 |X3 (·)PX5 |X2 (·).
Learning Tree Graphical Models
d-dimensional random vector with alphabet size S
X = (X1 , X2 , . . . , Xd )
Suppose p.m.f. has tree structure
PX =
d
Y
i=1
X4
X5
X3
X2
PXi |Xπ(i) ,
X6
IV. M AIN R ESULT: C AUSAL D EPENDEN
A PPROXIMATIONS
X1
In situations where there are multiple random
Chow and Liu method can be used. However, i
all possible
arrangements
of all the variables,
Given n i.i.d. samples from PX , estimate underlying
tree
structure
processes and timings to find the best approxima
native approach, which would maintain causality
of second order distributions serves as an approximation of
processes separate, is to find an approximation t
the full joint.
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size probability by identifying
March 17th,causal
2015 dependencies
28 / 35
Fig. 1. Diagram of an approximating dependence tree structure. In this example, PbT ≈ PX6 (·)PX1 |X6 (·)PX3 |X6 (·)PX4 |X3 (·)PX2 |X3 (·)PX5 |X2 (·).
n
Chow–Liu algorithm (1968)
Natural approach: Maximum Likelihood!
Chow–Liu’68 solved it (2000 citations)
- Maximum Weight Spanning Tree (MWST)
- Empirical mutual information
Theorem (Chow, Liu’68)
EMLE = arg
Jiantao Jiao (Stanford University)
max
EQ is a tree
X
I(P̂e )
e∈EQ
Boosting the Effective Sample Size
March 17th, 2015
29 / 35
Chow–Liu algorithm (1968)
Natural approach: Maximum Likelihood!
Chow–Liu’68 solved it (2000 citations)
- Maximum Weight Spanning Tree (MWST)
- Empirical mutual information
Theorem (Chow, Liu’68)
EMLE = arg
max
EQ is a tree
X
I(P̂e )
e∈EQ
Our approach
Replace empirical mutual information by better estimates!
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
29 / 35
Original CL vs. Modified CL
We set d = 7, S = 300, a star graph, sweep n from 1k to 55k.
1
Original CL
0.9
Expected wrong−edges−ratio
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
Jiantao Jiao (Stanford University)
1
2
3
4
number of samples n
Boosting the Effective Sample Size
5
6
4
x 10
March 17th, 2015
30 / 35
8 fold improvement! [and even more]
We set d = 7, S = 300, a star graph, sweep n from 1k to 55k.
1
Original CL
Modified CL
0.9
Expected wrong−edges−ratio
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
Jiantao Jiao (Stanford University)
1
2
3
4
number of samples n
Boosting the Effective Sample Size
5
6
4
x 10
March 17th, 2015
31 / 35
What we did not talk about
How to analyze the bias of MLE
Why bootstrap and jackknife fail
Extensions in multivariate settings (`p distance)
Extensions in nonparametric estimation
Extensions in non-i.i.d. models
Applications in machine learning, medical imaging, computer vision,
genomics, etc
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
32 / 35
Hindsight
Unbiased estimation:
approximation basis must satisfy unbiased equation (Kolmogorov’50)
basis have good approximation properties
additional variance incurred by approximation should not be large
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
33 / 35
Hindsight
Unbiased estimation:
approximation basis must satisfy unbiased equation (Kolmogorov’50)
basis have good approximation properties
additional variance incurred by approximation should not be large
Theory of functional estimation ⇔ Analytical properties of functions
The whole field of approximation theory, or even more
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
33 / 35
Related work
O. Lepski, A. Nemirovski, and V. Spokoiny’99, “On estimation of the
Lr norm of a regression function”
T. Cai and M. Low’11, “Testing composite hypotheses, Hermite
polynomials and optimal estimation of a nonsmooth functional”,
Y. Wu and P. Yang’14, “Minimax rates of entropy estimation on large
alphabets via best polynomial approximation”,
J. Acharya, A. Orlitsky, A. T. Suresh, and H. Tyagi’14,“The
complexity of estimating Renyi entropy”
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
34 / 35
Our work
J. Jiao, K. Venkat, Y. Han, T. Weissman, “Minimax estimation of functionals of
discrete distributions”, to appear in IEEE Transactions on Information Theory
J. Jiao, K. Venkat, Y. Han, T. Weissman, “Maximum likelihood estimation of
functionals of discrete distributrions”, available on arXiv
J. Jiao, K. Venkat, Y. Han, T. Weissman, “Beyond maximum likelihood: from
theory to practice”, available on arXiv
J. Jiao, Y. Han, T. Weissman, “Minimax estimation of divergence functions”, in
preparation.
Y. Han, J. Jiao, T. Weissman, “Adaptive estimation of Shannon entropy”,
available on arXiv
Y. Han, J. Jiao, T. Weissman, “Does Dirichlet prior smoothing solve the Shannon
entropy estimation problem?”, available on arXiv
Y. Han, J. Jiao, T. Weissman, “Bias correction using Taylor series, Bootstrap, and
Jackknife”, in preparation
Y. Han, J. Jiao, T. Weissman, “How to bound of gap of Jensen’s inequality?”, in
preparation
Thank you!
Jiantao Jiao (Stanford University)
Boosting the Effective Sample Size
March 17th, 2015
35 / 35
Download