A survey on mixing coefficients: computation and estimation. Vitaly Kuznetsov October 29, 2013

advertisement
A survey on mixing coefficients:
computation and estimation.
Vitaly Kuznetsov
Courant Institute of Mathematical Sciences,
New York University
October 29, 2013
1 / 24
Introduction
Binary classification
Receive a sample X1 , . . . , Xm with labels in {0, 1}.
Choose a hypothesis h that has a good expected
performance on unseen data.
X1 , . . . , Xm are typically assumed i.i.d.
2 / 24
Introduction (continued)
Much of the learning theory operates under the
assumption that data comes from an i.i.d. source.
In certain scenarios this assumption is not appropriate,
e.g. time series analysis.
To extend learning theory to this scenarios we need to
find a suitable relaxation of i.i.d. requirement.
One common approach found in literature is imposing
various “mixing conditions”.
Under these mixing conditions the strength of
dependence between random variables is measured
using “mixing coefficients”.
3 / 24
Outline
Mixing conditions and coefficients: definitions
and basic properties.
Computational aspects.
Estimating mixing coefficients.
Discussion.
4 / 24
How can we measure dependence between
random variables?
Common measures of dependence are so called
“mixing” coefficients.
Originally introduced to prove laws of large
numbers for sequences of dependent variables.
5 / 24
α mixing coefficient between two σ-algebras
Given a probability space (Ω, F, P) and two sub
σ-algebras σ1 and σ2, define α-mixing coefficient
α(σ1, σ2) = sup |P(A)P(B) − P(A ∩ B)|
A,B
where supremum is taken over all A ∈ σ1 and
B ∈ σ2 .
6 / 24
ϕ mixing coefficient
Define ϕ-mixing coefficient
ϕ(σ1|σ2) = sup |P(A) − P(A|B)|
A,B
where supremum is taken over all A ∈ σ1 and
B ∈ σ2 .
Note that ϕ coefficient is not symmetric.
7 / 24
β mixing coefficient
Define β-mixing coefficient between two σ-algebras σ1
and σ2 :
β(σ1 , σ2 ) = E sup |P(A) − P(A|σ2 )|
A
where supremum is taken over all A ∈ σ1 .
We can rewrite β-mixing coefficient as follows:
1
2
β(σ1 , σ2 ) = sup
I X
J
X
|P(Ai )P(Bj ) − P(Ai ∩ Bj )|
i=1 j=1
where supremum is taken over all finite partitions
{A1 , . . . , AI } and {B1 , . . . , BJ } of Ω such that Ai ∈ σ1
and Bj ∈ S2 .
8 / 24
Alternative definitions of β mixing coefficient
This leads to yet another characterization of β-mixing
coefficient:
β(σ1 , σ2 ) = kPσ1 ⊗ Pσ2 − Pσ1 ⊗σ2 k
where k · k denotes the total variation distance, i.e.
kP − Qk = supA |P(A) − Q(A)|.
Assuming distributions P and Q have densities f and
g respectively
Z
1
kP − Qk = 2 |f − g |
9 / 24
Relations between mixing coefficients
We have the following:
2α(σ1 , σ2 ) ≤ β(σ1 , σ2 ) ≤ ϕ(σ1 , σ2 )
The second inequality is immediate from the
definition.
Proof of the first inequality:
|P(A)P(B) − P(A ∩ B)|
+ |P(A)P(B c ) − P(A ∩ B c )|
+ |P(Ac )P(B) − P(Ac ∩ B)|
+ |P(Ac )P(B c ) − P(Ac ∩ B c )| ≤ 2β(σ1 , σ2 )
10 / 24
From two variables to stochastic processes (i)
Let {Xt }∞
t=−∞ be a doubly infinite sequence of
random variables.
Notation:
Xij = (Xi , Xi+1, . . . , Xj )
Pji is the joint probability distribution of Xij
σij is the σ-algebra generated by Xij
11 / 24
From two variables to stochastic processes (ii)
Define the following mixing coefficients
t
∞
α(a) = sup α(σ−∞
, σt+a
)
t
t
∞
β(a) = sup β(σ−∞
, σt+a
)
t
t
∞
ϕ(a) = sup ϕ(σ−∞
, σt+a
)
t
∞
We say that a sequence of random variables X−∞
is α,
β or ϕ mixing if the corresponding mixing coefficient
→ 0 as a → ∞.
These coefficients measure dependence between future
and the past separated by a time units.
12 / 24
Stationary stochastic processes
∞
A stochastic process X−∞
is (strictly) stationary for
any t ∈ Z and k, n ∈ N the distribution of Xtt+n is the
t+k+n
same as the distribution of Xt+k
.
For stationary processes mixing coefficients can be
simplified to
0
, σa∞ )
α(a) = α(σ−∞
0
β(a) = β(σ−∞
, σa∞ )
0
ϕ(a) = ϕ(σ−∞
, σa∞ )
13 / 24
Connections to machine learning
Theorem (M. Mohri, A. Rostamizadeh, 2009): Let
H = {X → Y} be a set of hypothesis and L be an M-bounded loss
function. Let S be a sample of size 2µa from a stationary β-mixing
process on X × Y, for any δ > 4(µ − 1)β(a) with probability at least
1 − δ 0 the following holds for all h ∈ H
s
m
log δ40
1 X
L(h(Xi ), Yi ) + R̂Sµ (L ◦ H) + 3M
E[L(h(X ), Y )] ≤
m i=1
2µ
where R̂Sµ denotes the empirical Rademacher complexity and
δ 0 = δ − 4(µ − 1)β(a).
Other results of the similar nature by R. Meir, M. Mohri and A.
Rostamizadeh, I. Steinwart et. al. to name a few.
14 / 24
Can we compute mixing coefficients?
Theorem (M. Ahsen, M. Vidyasagar, 2013):
Suppose X and Y are discrete random variables with
known joint and marginal probability distributions. Then
computing α-mixing coefficient is NP - hard. (equivalent
to “partition problem”).
Ahsen and Vidyasgar also give efficiently computable
upper and lower bounds.
15 / 24
Can we compute mixing coefficients? (continued)
Theorem (M. Ahsen, M. Vidyasagar, 2013):
Suppose X and Y are discrete random variables with
known joint distribution θij and marginal probability
distributions µi and νj . Then one has that
XX
1
β(σ(X ), σ(Y )) = 2
|γij |
X
max(γij , 0)
ϕ(σ(X ), σ(Y )) = max ν1j
j
i
where γij = θij − µi νj . Thus, β(σ(X ), σ(Y )) and
ϕ(σ(X ), σ(Y )) both are computable in polynomial time.
16 / 24
Estimation of mixing coefficients: naive approach (i)
Question: Given i.i.d. samples (X1 , Y1 ), . . . , (Xm , Ym ) from a joint
distribution of real-valued (X , Y ), can we estimate any of the mixing
coefficients?
Define the following estimators of the joint and marginal
distributions:
m
1 X
Φ̂(x) =
IX ≤x
m i=1 i
m
Φ̂(y ) =
1 X
IY ≤y
m i=1 i
m
1 X
Φ̂(x, y ) =
IX ≤x,Yi ≤y
m i=1 i
Let β̂ and ϕ̂ be estimators of β and γ based on empirical c.d.f.’s.
17 / 24
Estimation of mixing coefficients: naive approach (ii)
Theorem (M. Ahsen, M. Vidyasagar, 2013):
ϕ̂ ≥ β̂ =
m−1
→ 1 as m → ∞
m
Justification: Under empirical probability distributions
each sample has mass 1/m. Marginals are also uniform
and hence product distribution assigns mass of 1/m to
each point in the grid (xi , yj ). The conclusion now follows
from the above formula for discrete β.
18 / 24
Estimation of mixing coefficients: histograms (i)
A histogram estimator fˆ of a density f based on a sample
X1 , . . . , Xm is
fˆ(x) =
J
X
p̂j
IB (x)
mwj j
j=1
where
Bj ’s are bins partitioning the region with observations
m
X
p̂j =
IBj (Xi ) counts number of samples in bin Bj
i=1
wj is the width of the j-th bin
19 / 24
Estimation of mixing coefficients: histograms (ii)
Given m samples choose Jm intervals on R so that each
bin contains bm/Jm c or bm/Jm c + 1 samples from both X
and Y .
Theorem (M. Ahsen, M. Vidyasagar, 2013):
Suppose (X , Y ) ∼ θ, X ∼ µ and Y ∼ ν with θ being
absolutely continuous with respect to µ ⊗ ν. Then β̂
converges to β provided that Jm /m → 0. If in addition,
the density f ∈ L∞ then α̂ and ϕ̂ also converge to α and
ϕ respectively.
The measure-theoretic arguments used in the proof
establish consistency of the estimators but do not yield
error rates.
20 / 24
Estimation of mixing coefficients: stochastic processes (i)
Two step approximation
|β̂ d (a) − β(a)| ≤ |β̂ d (a) − β d (a)| + |β d (a) − β(a)|
t+a+d
t
where β d (a) = sup β(σt−d
, σt+a
) and β̂ d (a) is an
estimator based on
Z
d
1
β̂ (a) = 2 |fˆd ⊗ fˆd − fˆ2d |
with fˆd , fˆ2d being d and 2d dimensional histogram
estimators.
21 / 24
Estimation of mixing coefficients: stochastic processes (ii)
Theorem (D. McDonald, C. Shalizi, M. Shervish, 2011): Let
X1m be a sample from a stationary β-mixing process. For m = 2µm bm
and d ≤ µm we have that
−µm 22
−µm 21
d
d
+ 2 exp
P(|β̂ (a) − β (a)| ≥ ) ≤2 exp
2
2
+ 4(µm − 1)β(bm )
R
R
where 1 = /2 − E[ |fˆd − fd |] and 2 = − E[ |fˆ2d − f2d |].
Proof is based on blocking technique.
22 / 24
Estimation of mixing coefficients: stochastic processes (iii)
|β d (a) − β(a)| a measure-theoretic argument can be used to
show that this → 0 as d → ∞.
Under the assumption that densities fd and f2d are in the Sobolev
space H2 McDonald, Shalizi and Shervish argue that fˆ2d and fˆd
are consistent.
Choosing dm = O(exp(W (log n)), wm = O(m−km ) where
km =
W (log m) + 12 log m
log m( 12 exp(W (log n)) + 1)
and W is an inverse of w exp(w ), they show that estimator of β
based on histograms is consistent.
23 / 24
Estimation of mixing coefficients: discussion
Results do not provide convergence rate.
High-dimensional histogram estimation may not be
accurate.
Instead of estimating β directly intermediate step is
used to estimate densities.
Estimators based on kernels instead of histograms?
24 / 24
Download