A survey on mixing coefficients: computation and estimation. Vitaly Kuznetsov Courant Institute of Mathematical Sciences, New York University October 29, 2013 1 / 24 Introduction Binary classification Receive a sample X1 , . . . , Xm with labels in {0, 1}. Choose a hypothesis h that has a good expected performance on unseen data. X1 , . . . , Xm are typically assumed i.i.d. 2 / 24 Introduction (continued) Much of the learning theory operates under the assumption that data comes from an i.i.d. source. In certain scenarios this assumption is not appropriate, e.g. time series analysis. To extend learning theory to this scenarios we need to find a suitable relaxation of i.i.d. requirement. One common approach found in literature is imposing various “mixing conditions”. Under these mixing conditions the strength of dependence between random variables is measured using “mixing coefficients”. 3 / 24 Outline Mixing conditions and coefficients: definitions and basic properties. Computational aspects. Estimating mixing coefficients. Discussion. 4 / 24 How can we measure dependence between random variables? Common measures of dependence are so called “mixing” coefficients. Originally introduced to prove laws of large numbers for sequences of dependent variables. 5 / 24 α mixing coefficient between two σ-algebras Given a probability space (Ω, F, P) and two sub σ-algebras σ1 and σ2, define α-mixing coefficient α(σ1, σ2) = sup |P(A)P(B) − P(A ∩ B)| A,B where supremum is taken over all A ∈ σ1 and B ∈ σ2 . 6 / 24 ϕ mixing coefficient Define ϕ-mixing coefficient ϕ(σ1|σ2) = sup |P(A) − P(A|B)| A,B where supremum is taken over all A ∈ σ1 and B ∈ σ2 . Note that ϕ coefficient is not symmetric. 7 / 24 β mixing coefficient Define β-mixing coefficient between two σ-algebras σ1 and σ2 : β(σ1 , σ2 ) = E sup |P(A) − P(A|σ2 )| A where supremum is taken over all A ∈ σ1 . We can rewrite β-mixing coefficient as follows: 1 2 β(σ1 , σ2 ) = sup I X J X |P(Ai )P(Bj ) − P(Ai ∩ Bj )| i=1 j=1 where supremum is taken over all finite partitions {A1 , . . . , AI } and {B1 , . . . , BJ } of Ω such that Ai ∈ σ1 and Bj ∈ S2 . 8 / 24 Alternative definitions of β mixing coefficient This leads to yet another characterization of β-mixing coefficient: β(σ1 , σ2 ) = kPσ1 ⊗ Pσ2 − Pσ1 ⊗σ2 k where k · k denotes the total variation distance, i.e. kP − Qk = supA |P(A) − Q(A)|. Assuming distributions P and Q have densities f and g respectively Z 1 kP − Qk = 2 |f − g | 9 / 24 Relations between mixing coefficients We have the following: 2α(σ1 , σ2 ) ≤ β(σ1 , σ2 ) ≤ ϕ(σ1 , σ2 ) The second inequality is immediate from the definition. Proof of the first inequality: |P(A)P(B) − P(A ∩ B)| + |P(A)P(B c ) − P(A ∩ B c )| + |P(Ac )P(B) − P(Ac ∩ B)| + |P(Ac )P(B c ) − P(Ac ∩ B c )| ≤ 2β(σ1 , σ2 ) 10 / 24 From two variables to stochastic processes (i) Let {Xt }∞ t=−∞ be a doubly infinite sequence of random variables. Notation: Xij = (Xi , Xi+1, . . . , Xj ) Pji is the joint probability distribution of Xij σij is the σ-algebra generated by Xij 11 / 24 From two variables to stochastic processes (ii) Define the following mixing coefficients t ∞ α(a) = sup α(σ−∞ , σt+a ) t t ∞ β(a) = sup β(σ−∞ , σt+a ) t t ∞ ϕ(a) = sup ϕ(σ−∞ , σt+a ) t ∞ We say that a sequence of random variables X−∞ is α, β or ϕ mixing if the corresponding mixing coefficient → 0 as a → ∞. These coefficients measure dependence between future and the past separated by a time units. 12 / 24 Stationary stochastic processes ∞ A stochastic process X−∞ is (strictly) stationary for any t ∈ Z and k, n ∈ N the distribution of Xtt+n is the t+k+n same as the distribution of Xt+k . For stationary processes mixing coefficients can be simplified to 0 , σa∞ ) α(a) = α(σ−∞ 0 β(a) = β(σ−∞ , σa∞ ) 0 ϕ(a) = ϕ(σ−∞ , σa∞ ) 13 / 24 Connections to machine learning Theorem (M. Mohri, A. Rostamizadeh, 2009): Let H = {X → Y} be a set of hypothesis and L be an M-bounded loss function. Let S be a sample of size 2µa from a stationary β-mixing process on X × Y, for any δ > 4(µ − 1)β(a) with probability at least 1 − δ 0 the following holds for all h ∈ H s m log δ40 1 X L(h(Xi ), Yi ) + R̂Sµ (L ◦ H) + 3M E[L(h(X ), Y )] ≤ m i=1 2µ where R̂Sµ denotes the empirical Rademacher complexity and δ 0 = δ − 4(µ − 1)β(a). Other results of the similar nature by R. Meir, M. Mohri and A. Rostamizadeh, I. Steinwart et. al. to name a few. 14 / 24 Can we compute mixing coefficients? Theorem (M. Ahsen, M. Vidyasagar, 2013): Suppose X and Y are discrete random variables with known joint and marginal probability distributions. Then computing α-mixing coefficient is NP - hard. (equivalent to “partition problem”). Ahsen and Vidyasgar also give efficiently computable upper and lower bounds. 15 / 24 Can we compute mixing coefficients? (continued) Theorem (M. Ahsen, M. Vidyasagar, 2013): Suppose X and Y are discrete random variables with known joint distribution θij and marginal probability distributions µi and νj . Then one has that XX 1 β(σ(X ), σ(Y )) = 2 |γij | X max(γij , 0) ϕ(σ(X ), σ(Y )) = max ν1j j i where γij = θij − µi νj . Thus, β(σ(X ), σ(Y )) and ϕ(σ(X ), σ(Y )) both are computable in polynomial time. 16 / 24 Estimation of mixing coefficients: naive approach (i) Question: Given i.i.d. samples (X1 , Y1 ), . . . , (Xm , Ym ) from a joint distribution of real-valued (X , Y ), can we estimate any of the mixing coefficients? Define the following estimators of the joint and marginal distributions: m 1 X Φ̂(x) = IX ≤x m i=1 i m Φ̂(y ) = 1 X IY ≤y m i=1 i m 1 X Φ̂(x, y ) = IX ≤x,Yi ≤y m i=1 i Let β̂ and ϕ̂ be estimators of β and γ based on empirical c.d.f.’s. 17 / 24 Estimation of mixing coefficients: naive approach (ii) Theorem (M. Ahsen, M. Vidyasagar, 2013): ϕ̂ ≥ β̂ = m−1 → 1 as m → ∞ m Justification: Under empirical probability distributions each sample has mass 1/m. Marginals are also uniform and hence product distribution assigns mass of 1/m to each point in the grid (xi , yj ). The conclusion now follows from the above formula for discrete β. 18 / 24 Estimation of mixing coefficients: histograms (i) A histogram estimator fˆ of a density f based on a sample X1 , . . . , Xm is fˆ(x) = J X p̂j IB (x) mwj j j=1 where Bj ’s are bins partitioning the region with observations m X p̂j = IBj (Xi ) counts number of samples in bin Bj i=1 wj is the width of the j-th bin 19 / 24 Estimation of mixing coefficients: histograms (ii) Given m samples choose Jm intervals on R so that each bin contains bm/Jm c or bm/Jm c + 1 samples from both X and Y . Theorem (M. Ahsen, M. Vidyasagar, 2013): Suppose (X , Y ) ∼ θ, X ∼ µ and Y ∼ ν with θ being absolutely continuous with respect to µ ⊗ ν. Then β̂ converges to β provided that Jm /m → 0. If in addition, the density f ∈ L∞ then α̂ and ϕ̂ also converge to α and ϕ respectively. The measure-theoretic arguments used in the proof establish consistency of the estimators but do not yield error rates. 20 / 24 Estimation of mixing coefficients: stochastic processes (i) Two step approximation |β̂ d (a) − β(a)| ≤ |β̂ d (a) − β d (a)| + |β d (a) − β(a)| t+a+d t where β d (a) = sup β(σt−d , σt+a ) and β̂ d (a) is an estimator based on Z d 1 β̂ (a) = 2 |fˆd ⊗ fˆd − fˆ2d | with fˆd , fˆ2d being d and 2d dimensional histogram estimators. 21 / 24 Estimation of mixing coefficients: stochastic processes (ii) Theorem (D. McDonald, C. Shalizi, M. Shervish, 2011): Let X1m be a sample from a stationary β-mixing process. For m = 2µm bm and d ≤ µm we have that −µm 22 −µm 21 d d + 2 exp P(|β̂ (a) − β (a)| ≥ ) ≤2 exp 2 2 + 4(µm − 1)β(bm ) R R where 1 = /2 − E[ |fˆd − fd |] and 2 = − E[ |fˆ2d − f2d |]. Proof is based on blocking technique. 22 / 24 Estimation of mixing coefficients: stochastic processes (iii) |β d (a) − β(a)| a measure-theoretic argument can be used to show that this → 0 as d → ∞. Under the assumption that densities fd and f2d are in the Sobolev space H2 McDonald, Shalizi and Shervish argue that fˆ2d and fˆd are consistent. Choosing dm = O(exp(W (log n)), wm = O(m−km ) where km = W (log m) + 12 log m log m( 12 exp(W (log n)) + 1) and W is an inverse of w exp(w ), they show that estimator of β based on histograms is consistent. 23 / 24 Estimation of mixing coefficients: discussion Results do not provide convergence rate. High-dimensional histogram estimation may not be accurate. Instead of estimating β directly intermediate step is used to estimate densities. Estimators based on kernels instead of histograms? 24 / 24