Measuring and Testing Dependence by Correlation of Distances

advertisement
Distance Correlation
E-Statistics
Gábor J. Székely
Rényi Institute of the Hungarian Academy of Sciences
Columbia University, April 28-April 30, 2014
Topics
•
•
•
•
Lecture 1. Distance Correlation. From correlation (Galton/Pearson, 1895)
to distance correlation (Szekely, 2005). Important measures of
dependences and how to classify them via invariances. Distance correlation
t-test of independence. Open problems for big data.
Lecture 2. Energy statistics (E-statistics) and their applications. Testing
for symmetry, testing for normality, DISCO analysis, energy clustering, etc.
A simple inequality on energy statistics and a beautiful theorem of Fourier
transforms. What makes a statistic U (or V)?
Lecture 3. Brownian correlation. Correlation with respect to stochastic
processes. Distances and negative definite functions. Physics principles in
statistics (the uncertainty principle of statistics, symmetries/invariances,
equilibrium estimates). CLT for dependent variables via Brownian
correlation. What if the sample is not iid, what if the sample comes from a
stochastic process?
Colloquium talk. Partial distance correlation. Distance correlation and
dissimilarities via unbiased distance covariance estimates. What is wrong
with the Mantel test? Variable selection via pdCor. What is a good measure
of dependence? My Erlangen program in Statistics.
Lecture 1.
Distance Correlation
Dependence Measures and Tests for Independence
Kolmogorov: “Independence is the most important notion of probability theory”
•Correlation (Galton 1885-1888, Natural Inheritance, 1889, Pearson, 1895)
•Chi-square (Pearson, 1900)
•Spearman’s rank correlation (1904) Amer. J. Psychol. 15: 72–101.
•Fisher, R. (1922) and Fisher’s exact test
•Kendall’s tau (1938) A New Measure of Rank Correlation".Biometrika 30 (1–2):81–89.
•Maximal correlation (Hirschfeld) (Gebelein,1941) ( Lancaster, 1957), (Rényi,1959), (Sarmanov,1958), (Buja, 1990)
Dembo(2001)
•Hoeffding’s independence test (1948) Annals of Mathematical Statistics 19: 293–325, 1948.
•Blum-Kiefer-Wolfowitz (1961)
•Mantel test (1967)
•RKHS Baker (1973), Fukumizo, Gretton, Poczos, …
•RV coefficient (1976) Robert, P.; Escoufier, Y. A Unifying Tool for Linear Multivariate Statistical Methods: The RVCoefficient“ Applied Statistics 25 (3): 257–265.
•Also here is a question/answer from Stack Exchange which mentioned dcor and it was better than RV coefficient, apparently.
•http://math.stackexchange.com/questions/690972/distance-or-similarity-between-matrices-that-are-not-the-same-size
•Distance correlation (dCor) Szekely (2005), Szekely Bakirov and Rizzo (2007)
• nice free version apparently for Matlab/Octave. I should perhaps add a link to our energy page.
•http://mastrave.org/doc/mtv_m/dist_corr
•Brownian correlation Szekely and Rizzo (2009)
DCor generalizes and improves Correlation, RV, Mantel and Chi-square (denominator!)
MIC, 2010
Valhalla --- GÖTTERDÄMMERUNG
Kolmogorov: “Independence is the most important notion of probability theory”
What is Pearson’s correlation?
Sample: (Xk ,Yk ) k=1,2,…,n
Centered sample: Ak,=Xk-X. Bk=Yk-Y.
cov(x,y)=(1/n)ΣkAkBk
r:=cor(x,y) = cov(x,y)/[cov(x,x) cov(y,y)]1/2
Prehistory: (i) Gauss (1823) – normal surface with n correlated variables – for Gauss this was just one of the several parameters
(ii) Auguste Bravais(1846) referred to one of the parameters of the bivariate normal distribution as « une correlation” but like Gauss he did not recognize the
importance of correlation as a measure of dependence between variables. [Analyse mathématique sur les probabilités des erreurs de situation d'un point.
Mémoires présentés par divers savants à l'Académie royale des sciences de l'Institut de France, 9, 255-332.]
(iii)Francis Galton (1885-1888)
(iv)Karl Pearson (1895) product-moment r
LIII. On lines and planes of closest fit to systems of points in space
Philosophical Magazine Series 6, 1901 -- cited by 1700
Pearson had no unpublished thoughts
Why do we (NOT) like Pearson’s correlation?
What is the remedy?
Apples and Oranges
If we want to study the dependence between
oranges and apples then it is hard to add or
multiply them but it is always easy to do the
same with their distances.
ak,l:= |Xk – Xl|
bk,l:= |Yk – Yl|
for k,l=1,2,…,n
Ak,l:= ak,l–ak.–a. l + a. .
Bk,l:= bk,l–bk .–b. l + b. .
2
Cov(X,Y) :=dCov²(X,Y)
Distance
V²(X,Y):= (1/n2)Σ k lAk,l Bk,l
≥ 0 (!?!) see
Szekely, Rizzo, Bakirov(2007) Ann. Statist. 35/7
:=
Distance Covariance:
V²(X,Y):= (1/n2)Σ k lAk,l Bk,l
Distance standard deviation: V(X):=
Distance Correlation:
V(X,X), V(Y):=V(Y,Y)
dCor(X,Y)²:=R(X,Y)²:=
V(X,Y)²/ V(X)V(Y)
This should be introduced in our teaching at the
undergraduate level.
The population values are a.s.
limits of the empirical ones
as n→∞ .
Thm: dCov²=||fn(s,t)-fn(s)fn(t)||²
where ||.|| is the L2-norm with
the singular kernel
w(s,t):= c/(st)²
This kernel is unique if we have the following invariance: dCov²(a1+b1O1X, a2+b2O2Y)=b1b2dCov²(X,Y).
A beautriful theorem on
Fourier transforms
∫(1-cos
2
tx)/t dt=
c|x|
The Fourier transform of any power of |t|
is a constant times a power of |x|
Gel’fand, I. M. – Shilov, G. E. (1958, 1964), Generalized
Functions
Thm
V(X) =0 iff X is constant
V(a + bCX) = |b|V(X)
V(X+Y) <= V(X) + V(Y)
for independent rv’s with equality iff X or Y is constant
0 ≤ dCorr(X,Y) ≤1
dCorr(X,Y) =0 iff
X, Y are independent
dCorr (X,Y) = 1 iff
Y=a + bXC
akl:= |Xk – Xl|α
α
bkl:= |Yk – Yl|
Rα,
for 0< α <2
[R1 = R]
R 2(X,Y)= |Cor(X,Y)|=
|Pearson’s correlation|
E-statistics(energy statistics). R package version 1.1-0.
Thm
Under independence of X and Y
n dCov2n(X,Y) →Q= ∑ λkZ²k
otherwise the limit is ∞
Thus we have a universally consistent test of independence
What if (X,Y) is bivariate normal?
In this case
0.89 |corr|≤dCor ≤ |cor|
Unbiased Distance Correlation
Unbiased distance correlation
Unbiased estimator of dCor² (X, Y) is
dCor*n := < A*, B*>:= 1/[n(n-3)] (A*, B*)
This is an inner product in the linear space Hn of nxn
matrices generated by nxn distance matrices. The
population Hilbert space is denoted by H where the inner
product is (generated by) dCov*(X, Y).
The power of dCor test for independence is very good
especially for high dimensions p,q
Denote the unbiased version by dCov*n
The corresponding bias corrected distance
correlation is
R*n
This is the correlation for the 21st century.
Theorem. In high dimension if the CLT holds for the coordinates then
Tn:=[M-1] 1/2 R*n/[1-(R*n)2]1/2
where M=n(n-3)/2 is t-distributed with d.f. M-1.
Why?
R*n = Σij Uij Vij / [Σ Uij2 Σ Vij2 ] 1/2
with iid standard normal variables.
Put Zij = Uij / [Σ Uij2 ] 1/2 ; then Σ Zij2 = 1
Under the null, independence of Uij and Vij,
Zij does not depend on Vij
Given Z, by Cochran’s thm (the square of Σij
ZijVij has rank 1), Tn is t-distributed when Z
is given, thus even without this condition.
Under the alternative?
We need to show that if U, V are standard normal with
zero expected value and correlation ρ>0 then
P(UV > c) is a monotone increasing function of ρ.
For the proof notice that if X, Y are iid standard normal
a²+b² =1, 2ab = ρ then for
U:=aX+bY and V:= bX+aY
We have Var(U)=Var(V) = 1 and E(UV)=2ab = ρ.
Thus
UV=ab(X²+Y²)+(a²+b²)XY= ρ(X²+Y²)/2 + XY
Q.E.D.
(I do not need but I do not know what if the expectations are not zero.)
The number of operations is O(n²),
independently of the dimension
which can even be infinite
(X and Y can be in two different metric spaces –
Hilbert spaces)
The storage complexity can be reduced to O(n)
via recursive formula
Parallel processing for big n?
A characteristic measure of
dependence (population value)
2
dCov (X,Y)
=E|X-X’||Y-Y’|+
E|X-X’|E|Y-Y’| - 2E|X-X’||Y-Y’’|
dCov = cov of distances?
(X,Y) , (X’,Y’), (X”, Y”) are iid
dCov2(X,Y)=E[|X–X’||Y-Y’|] +E|X-X’|E|Y-Y’|
-E[|X–X’||Y-Y’’|] - E[|X–X’’||Y-Y’|] =
cov(|X–X’|, |Y–Y’|) – 2cov(|X-X’|, |Y-Y”|)
(i)Does cov(|X–X’|, |Y–Y’|)=0 imply X and Y are independent?
(ii)Does the independence of X and Y imply the independence of X, Y?
(i)q(x)=–c/2 for -1<x<0, ½ for 0<x<c, 0 otherwise, p(x,y):=1/4–q(x)q(y)
Max correlation?
sup f,g Cor(f(X), g(Y)) for all f,g Borel
functions with 0 < Var f(X), Var g(Y) < ∞.
Why should we (not) like max cor?
If max cor (X, Y) = 0 then X, Y are independent
For bivariate normal normal maxcor = |cor|
For partial sums if iid maxcor2(Sm,Sn)=m/n for m≤n
Sarmanov(1958) Dokl. Nauk. SSSR
What is wrong with maxcor?
What is the meaning of max cor = 1?
Trigonometric coins
Sn := sin U+ sin 2U + … + sin nU
tends to Cauchy (we did not divide by √n !!)
Open problem: What is the sup of dCor for
uncorrelated X and Y. Can it be > 0.85 ?
Lecture 2.
Energy Statistics (E-statistics)
Newton’s gravitational potential energy can
be generalized for statistical applications.
Statistical observations are heavenly bodies (in a metric space) governed by a statistical
potential energy which is zero iff an underlying statistical null hypothesis holds.
Potential energy statistics are symmetric functions of distances
between statistical observations in metric spaces.
EXAMPLE Testing Independence
Potential Energy Statistics
Potential energy statistics or energy
statistics or E-statistics in short are Ustatistics or V-statistics that are functions
of distances between sample elements.
The idea is to consider statistical
observations as heavenly bodies governed
by a statistical potential energy which is
zero iff an underlying statistical null
hypothesis is true.
Distances and Energy: the next level of
abstraction (Prelude)
In the beginning Man created integers. The accountants of Uruk in Mesopotamia, about five thousand years ago,
invented the first numerals – signs encoding the concept of oneness, twoness, threeness, etc. abstracted from any
particular entity. Before that for about another 5000 years jars of oil were counted with ovoid, measures of grain were
counted with cones, etc. , numbers were indicated with one-to-one correspondence. Numerals revolutionized our
civilization: they expressed abstract thoughts, after all, “two” does not exist in nature, only two fingers, two people, two
sheep, two apples, two oranges. After this abstraction we could not tell from the numerals what the objects were;
seeing the signs of 1,2,3,... we could not see or smell oranges, apples, etc. but we could do comparisons, we could do
“statistics”, “statistical inference”.
In this lecture instead of working with statistical observations, data taking integer or real values, or taking values in
Euclidean spaces, Hilbert spaces or in more general metric spaces we make inferences from their distances.
Distances and angles make wonders in science (see e.g. Thales 600 BC; G.J. Szekely: Thales and the Ten
Commandments). Here we will exploit this in statistics. Instead of working with numbers, vectors, functions, etc.
first we compute their distances and all our inferences will be based on these distances. This is the next level of
abstraction where not only we cannot tell the objects, we cannot even tell how big their numbers are, we cannot tell
what the data are, we can just tell how far they are from each other. At this level of abstraction we of course lose even
more information, we cannot sense lots of properties of data, e.g. if we add the same constant to all data then their
distances won’t change. No rigid motion in the space changes the distances. On the other hand we gain a lot: distances
are always easy to add, multiply, etc. even when it is not so natural to add or multiply vectors and more abstract
observations especially if they are not from the same space.
The next level of abstraction is energy statistic: invariance wrt ratios of distances: the angles are invariant. Distance
correlation is depends on angles.
Goodness-of-fit
Dual space
Application in statistics
Construct a U (or V) statistic with kernel
h(x,y)= E|x-X| + E|y-Y| - E|X-Y| - |x-y|
Vn =(1/n2)∑ h(Xi,Xk)
By the NULL: Eh(X,Y) =0 but h is also a rank 1 degenerate kernel because Eh(x,Y’) =0 a.s. under the null thus
Under the Null the limit distribution of nVn is
Q:=∑kλkZk2 where λk are eigenvalues of
Hilbert-Schmidt: ∫h(x,y)ψ(y)dF(y) = λψ(x)
and under the alternative (X and Y has different distributions)
nVn →∞ a.s.
What to do with Hilbert-Schmidt?
∫h(x,y)ψ(y)dF(y) = λψ(x), Q:=∑kλkZk2
(i)Approximate the eigenvalues: (1/n)∑i h(Xi, Xj)ψ=λψ
(ii) If Σiλi = 1 then P(Q ≥ c) ≤ P(Z2 ≥ c) if this
probability is at most 0.215 [conservative, consistent test)
(iii) t-test (see later)
Simple vs Good
Eα(X,Y):= 2E|X-Y|α -E|X-X’|α -E|Y-Y’|α ≥ 0
For 0 < α < 2
= 0 iff X and Y are identically distributed
For α=2 we have
E2(X,Y):= 2[E(X) – E(Y)]2
In case of “classical statistics” (α =2) life is simple but not always good
In case of “energy statistics” (0 < α < 2) life is not so simple but good.
Testing for (multivariate)
normality
Standardized sample: y1 y2,…, yn
E|Z – Z’|= ? For U(a,b): E|U-U’|=|b-a|/3
for exponential E|e-e’| = 1/λ
E|y – Z|= ? For U(a,b): E|x-U|= quadratic polynomial
(hint: if Z is a d-variate standard normal then |y-Z|2 has a noncentral
chi-square distribution with non-centrality parameter |y|2/2 and d.f. d+p
where p is a Poisson r.v. with mean |y|2/2, see Zacks (1981) p. 55)
In 1 dim: E|y – Z| = (2/π)1/2exp{-y2/2}+ x Φ(y) - y
√
For implementation see Energy package in R and Szekely and Rizzo (2004).
Why is energy a very good test
for normality?
1. It is affine invariant
2. Consistent against general alternatives
3. Powerful omnibus test
In the univariate case our energy test is “almost” the same
as the Anderson-Darling EDF test based on
∫(Fn(x) – F(x))2 dF(x)/ [F(x)(1-F(x)]
But here dF(x)/[F(x)(1-F(x)] is close to constant for standard normal F and thus
almost the same as “energy” thus our energy test is essentially a
multivariate extension of the powerful Anderson-Darling test.
Distance skewness
Advantages:
Skew(X):= E[X- E(X)/σ]³ = 0 does NOT
characterize symmetry but
distance skewness:
dSkew(X):= 1–E|X-X’|/E|X+X’|=0
iff X is centrally symmetric.
Sample: 1-Σ|Xi-Xk|/Σ|Xi+Xk|
DISCO: a nonparametric extension of
ANOVA
DISCO is a multi-sample test of equal distributions, a generalization of the hypothesis of
equal means which is ANOVA.
Put A=(X1, X2 , …, Xn), B=(Y1, Y2 , …, Ym), and d(A, B):=(1/nm)∑ |Xi – Yk|
Within-sample dispersion W:= ∑j (nj/2)∑ d(Aj ,Aj )
Put N =n1 +n2+…+nK and A:= {A1 , A2 , …, Ak}
Total dispersion T:=(N/2) d(A,A)
Thm. T = B + W
where B,the between sample dispersion, is the energy distance, i.e. the weighted sum of
E(Aj , Ak)= 2 d(Aj ,Ak)- d(Aj ,Aj)- d(Ak ,Ak)
The same thing with exponent α = 2 in d(A, B) is ANOVA
E-clustering
Hierarchical clustering: we merge clusters with minimum energy
distance:
E(CiUCj,Ck)=(ni+nk)/(ni+nj+nk)E(Ci,Ck)+(nj+nk)/(ni+nj+nk)E(Cj, Ck) nk/(ni+nj+nk)E(Ci, Cj)
In E-clustering not only the cluster centers matter but the cluster point
distributions. If the exponent in d is α=2 then we get Ward’s minimum
variance method, a geometrical method that separates and identifies
clusters by their centers. Thus Ward is not consistent but E-clustering
is consistent. The ability of E-clustering to separate and identify clusters
with equal or nearly equal centers has important practical applications.
For details see Szekely- Rizzo (2005) Hierarchical clustering via joint
between-within distances, Journal of Classification, 22(2), 151-183.
Kinetic Energy (E)
Under the Null the limit distribution of nVn is
Q:=∑kλkZk2 where λk are eigenvalues of
Hilbert-Schmidt: ∫h(x,y)ψ(y)dF(y) = λψ(x)
where h(x,y)= E|x-X| + E|y-Y| - E|X-Y| - |x-y|
Differentiate twice:
-ψ”/(2f) = Eψ with boundary conditions: ψ’(a)= ψ’(b)= 0
(the second derivative wrt x of (1/2)|x-y| is -δ(x-y) where δ is the Dirac delta)
Thus in 1Dimension E= 1/λ
Thus we transformed the potential energy (Hilbert-Schmidt) equation into a
kinetic energy (Schrödinger) equation.
Schrödinger equation(1926):
-ψ(x)”/(2m) + V(x)ψ(x) = (E + 1/E)ψ(x)
Energy conservation law?
My Erlangen Program in
Statistics
Klein, Felix 1872. "A comparative review of recent researches in geometry".
This is a classification of geometries via invariances (Euclidean, Similarity,
Affine, Projective,…) Klein was then at Erlangen.
Energy statistics are always rigid motion invariant, their ratios, e.g. dCor is also
invariant wrt scaling (angles remain invariant like in Thales’s geometry of
similarities)
Can we have more invariance? In the univariate case we have monotone
invariant rank statistics. But in the multivariate case if a statistic is 1-1
affine/projective invariant and continuous then it is constant. (projection is affine
but not 1-1, still because of continuity thr statistics are invariant to all
projections to (coordinate) lines thus they are constant).
Affine invariant energy statistics
They cannot be continuous but in case of testing
for normality affine invariance is natural (it is not
natural for testing independence because it
changes angles).
BUT dCor =0 is invariant with respect to all 1-1
Borel functions and max cor is also invariant wrt all
1-1 Borel but these are population values.
Maximal correlation is too invariant. Why? Max
correlation can easily be 1 for uncorrelated rv’s but
the max of dCor for uncorrelated variables is <
0.85.
Unsolved dCor problems
• Using subsampling construct confidence interval for
dCor^2. Why (not) bootstrap?
• Definition of Complexity of function f via dCor (X, f(X))
• Find sup dCor (X,Y) for uncorrelated X and Y.
Energy and U, V
Lecture 3.
Brownian Correlation / Covariance
Xid:= X – E(X) = id(X) – E(id(X)|id(.))
Cov2 (X, Y)= E(XidX’ idYid’Y’id’)
XW :=W(X) – E(W(X)|W(.))
Cov2W(X,Y):=E(XWX’WYW’Y’W’)
Remark:
Covid(X,Y) = |Cov(X,Y)|
Theorem: dCov (X,Y)= CovW (X,Y) (!!)
Szekely (2009) Ann. Appl. Statist 3/4 Discussion Paper
What if Brownian motion is replaced by another stochastic process? What
matters is the (positive definite) covariance function of the process.
Why Brownian?
We can replace BM by any two stochastic
processes U=U(t) and V=V(t)
Cov2U,V(X,Y):=E(XUX’UYVY’V)
But why is this generalization good, how to
compute, how to apply?
The covariance function of BM is
2min(s,t)= |s| + |t| -|s-t|.
Fractional BM
The simplest extension is
|s|α + |t|α -|s-t|α
and a zero mean Gaussian process with this cov is the fractional
BM defined for 0 < α < 2. This process was mentioned for the first
time in Kolmogorov(1940). α = 2H where H is the Hurst exponent.
Fractal dimension D= 2-H.
H describes the raggedness of the resultant motion,
with a higher value leading to a smoother motion. The value of H determines what kind of
process the fBm is:
•if H = 1/2 then the process is in fact a Brownian motion or Wiener process;
•if H > 1/2 then the increments of the process are positively correlated;
•if H < 1/2 then the increments of the process are negatively correlated.
The increment process, X(t) = BH(t+1) − BH(t), is known as fractional Gaussian noise.
Variogram
What properties of the (fractional) BM we need to make sure that the cov wrt certain stochastic processes is “energy” type i.e. it
depends on the distances of observations only?
In spatial statistics the variagram 2γ(s,t) of a random field Z(t) is 2γ(s,t):= Var(Z(s) –Z(t)).
Suppose E(Z(t))=0. For stationary processes
γ(s,t):= γ(s-t) and for stationary isotropic ones:
γ(s,t):= γ(|s-t|)
A function is a variogram of a zero expected value process/field iff it is conditionally
negative definite (see later). If the covariance function C(s,t) of a process exists then
2C(s,t) = 2E[(Z(s)Z(t)]= EZ(s)2 +EZ(t)2 +E[Z(t)-Z(s)]2 =
γ(s,s) + γ(t,t) – 2 γ(s-t).
For BM we had 2min(s,t)= |s| + |t| - 2|s-t|.
We also have the converse: γ(s-t)= C(s,s) + C(t,t) – 2C(s,t).
Cov2U,V(X,Y) is of “energy type” if the increments of U,V are stationary isotropic.
Special Gaussian processes
The negative log of the symmetric Laplace
ch.f is γ(t):=log(1 + |t|2) defines a LaplaceGaussian process with the corresponding
C(s,t) because this γ is conditionally
negative definite.
The negative log of the ch.f. of the difference
of two iid Poisson is γ(t):= cos t - 1. This
defines a Poisson-Gaussian process.
Correlation wrt stochastic
processes
When the covariance counts only we can assume
the processes are Gaussian.
Why do we need this generalization?
Conjecture. We need this generalization if the
observations (Xt Yt) are not iid but stationary
ergodic? Then consider cor wrt zero mean
(Gaussian) processes with stationary increments
having the same cov as (Xt Yt)?
A property of squared distances
What exactly are the properties of distances in Euclidean spaces (and
Hilbert spaces) that we need for statistical inferences?
We need the following properties of squared distances |x-y|^2 in Hilbert spaces.
Let H be a real Hilbert space, xi in H. Then we have that if ai in R and Σi ai =0
then
Σij ai aj |xi – xj|²= - 2| Σij ai xi |² ≤ 0
Thus if yi is in H i=1,2,…, n is another set of elements from H then
Σij ai aj |xi – yj|² = -2 Σij ai aj xi yj ≤ -| Σi ai (xi + yi)|² ≤ 0.
This is what we call the (conditional) negative definite property of |x-y|^2.
Negative definite functions
Let the data come from an arbitrary set S.
A function h(x,y): SxS → R is negative definite if
h(x,y) = h(y,x) (symmetric), h(x,x) = 0
and for all real numbers ai if Σi ai = 0 then
Σij ai aj h(xi, yj) ≤ 0. (*)
The function h is strongly negative definite if (*) is true and equality in
(*) holds iff all ai =0.
Theorem (I. J. Schoenberg (1938)) A metric space (S,d) embeds in a
Hilbert space iff h= d^2 is negative definite.
Further examples
h(x,y):= |x-y|α is negative definite if 0 < α ≤ 2, strictly negative definite if
0 < α < 2.
This is equivalent with the claim that exp{-|t|α} is a characteristic
function (of a symmetric stable distribution).
Classical statistics was built on α = 2.
This makes classical formulae simpler but because the “strictness”
does not hold here, the corresponding “quadratic theorems” apply to
“quadratic type distributions” only e.g. Gaussian distributions whose
densities are exp{ quadratic polynomial}
See also least squares
For α = 2 life is simple (~ multivariate normal) but not always good, for
0 < α < 2 life is not so simple but good (nonparametric).
My “energy” inferences are based on strictly negative definite kernels.
Why do we need negative
definite functions?
Let pi and qi be two probability distributions on the points xi , yi , resp.
Let X, Y be independent rv’s: P(X=xi ) = pi , P(Y = yi ) = qi . Then the
strong definiteness of h(x,y) implies that if ai = pi – qi then
Σij (pi– qi) (pj – qj) |xi – yj| ≤ 0
i.e. if E denotes the expectation of a random variables then the
potential energy of (X,Y)
E(X,Y):= E|X-Y|+E|X’-Y’| - E|X-X’| - E|Y-Y’| ≥ 0 (*)
where X’ and Y’ are iid copies of X and Y resp. Strong negative
definiteness implies that equality holds iff X and Y are identically
distributed. What it means is that the double centered expected
distance of X and Y, i.e. the potential energy of (X,Y), is always
nonnegative and equals zero iff X and Y are identically distributed.
High school example
n “red” cities xi are on two sides of a line L (river), k of them of the left
side, n-k on the right; similarly, m “green” cities yi are on the left side
of the same line, n-m are the the right. We connect two cities if they are
on different sides of the river. Red cities are connected with red, greens
with green, and mixed with blue.
Claim: 2#blue - #red - # green ≥ 0 and = 0 iff k=m
Hint: k(n-m)+m(n-k) – k(n-k)-m(n-m) = (k-m)² ≥ 0
Combine this with M. W. Crofton (1868) integral geometry formula on
random lines to get Energy.
Newton’s potential energy –
Statistical potential energy
Newton’s potential energy in our 3-dim space is proportional to the
reciprocal of the distance; if r:= |x-y| denotes the distance of points x,y,
then the potential energy is proportional to 1/r. The mathematical
significance of this function is that it is harmonic, i.e. 1/r is the
fundamental solution of the Laplace equation. In 1 dimension r itself is
harmonic.
For statistical applications what is relevant is that r^{α} is strictly
negative definite iff 0 < α < 2. Statistical potential energy is the double
centered version of
E|X- Y|α for 0 < α < 2
E(X,Y):= 2E|X-Y|α - E|X-X’|α - E|Y-Y’|α ≥ 0 for for 0 < α < 2.
SEP
Suppose for simplicity that the kernel of a V statistic has two arguments: h= h(x1, x2). This is the situation if we want to
check that X= given distribution. But what if the sample is SEP? Stationary = ? Ergodic = ?
Even in this case the SLLN holds and thus
(1/n2) Σi,j h(Xi, Xj) → Eh(X, X’) a.s.
thus we have strongly consistent estimators.
If h has rank 1 and the sample is iid then the limit distribution of 1/n \sum Σi,j h(Xi, Xj) is Q= Σk λk Z k2 where λk are
eigenvalues of the Hilbert-Schmidt operator ∫ h(x1, x2 ) Ψ(x2)dF(x2) = λΨ(x1).
We know that in general this is not true for SEP e.g. if h=x1x2 .
We can still compute the eigenvalues μk k=1,2,…,n of the random operator ( nxn random matrix)
(1/n) [h(Xi, Xj); i,j =1,2,…,n]
and we can consider the corresponding Gaussian quadratic form Q= Σk=1n μk Zk2.
Can the critical values for the corresponding null hypothesis be computed from Q if CLT holds e.g. if we have
martingale difference structure or mixing/weak dependence or dCor→ 0 (we need to approach Gauss distribution)
What to do with kernels like h(x1, x2, x3, x4), etc. and how to test independence from SEP?
Testing independence of
ergodic processes
If we have strongly stationary ergodic sequences then by
the SLLN for V-statistics we know that the empirical dCor
converges a.s. to the population dCor and this is constant
a.s. So we have a consistent estimator for the population
dCor. But how can we test if dCor=0 i.e. if the X process is
independent of the Y process?
Permutation tests won’t work. Limit theorems to Q depend
on the dependence structure so it is complicated. How
about the t-test? For this we need a kind of CLT.
What is the question?
Do we want to test if Xt is independent of Yt
or if the X sequence is independent of the Y
sequence?
Example. Let Xt t=1,2,… be iid and Yt = Xt+1
Then Xt is independent of Yt but the Y sequence is a shift of the X sequence so they
are not independent. We can now test if (Xt , Xt+1)
is independent of (Yt , Yt+1) , etc. using permutation test.
Null: The X st process is independent of the Y st process
Test if p-tuples of consecutive observations with random starting points
are independent e.g. with p= √n.
Proof of a conjecture of
Ibragimov-Linnik
How to avoid mixing condition in CLT?
Thm. Let Xn , n=0, +-1, +- 2, …be a strictly
stationary sequence, EXn=0 Sn=X1 +…+ Xn
(i) sn:= [Var(Sn)]1/2 = nf(n) where f(n) is a slowly
varying function,
(ii) CorW (S-m/sm , (Sr+m-Sm)/sm) → 0
as m,r→∞ and r/m → 0 and
(iii) (Sm/sm)2 is uniformly integrable.
Then the CLT holds.
(Bakirov, N. K and Szekely, G.J. Brownian covariance and central limit theorem for stationary sequences, Theory of
Probability and Its Applications, Vol. 55, No. 3, 371-394, 2011.)
ER
ER in our case has two meanings: Emergency Room and Energy in R,
i.e. Energy programs in the program package R.
Classical emergency toolkits of statisticians contain things like t-test, Ftest, ANOVA, tests of independence, Pearson’s correlation, etc. Most of
them are based on the assumption that the underlying distribution is
Gaussian. Our first aid toolkit is a collection of programs that are based
on the notion of energy and they do not assume that the underlying
distribution is Gaussian.
ANOVA is replaced by DISCO, Ward’s hierarchical clustering is
replaced by energy clustering, Pearson’s correlation by distance
correlation, etc. We can also test if a distribution is (multivariate)
Gaussian. We suggest statisticians use our Energy package in R, ER,
as a first aid for analyzing data in the Emergency Room. ER of course
cannot replace further scrutiny of specialists.
References
Székely, G.J. (1985-2005) Technical Reports on Energy (E-)statistics and on distance
correlation. Potential and Kinetic energy in Statistics.
Székely, G.J. and Rizzo, M. L. and Bakirov, N.K. (2007) Measuring and testing
independence by correlation of distances, Ann. Statistics 35/6, 2769-2794.
Székely, G. J. and Rizzo, M. L (2009) Brownian distance covariance, Discussion paper,
Ann. Applied Statistics. 3 /4 1236-1265.
Lyons, R. (2013) Distance covariance in metric spaces, Ann. Probability. 41/5, 32843305.
Szekely, G.J., Rizzo, M. L. (2013) Energy Statistics: A class of statistics based on
distances, JSPI, Invited paper.
Download