Postgraduate Lectures – MSSL/UCL A multivariate analysis primer Introduction

advertisement
Postgraduate Lectures – MSSL/UCL
A multivariate analysis primer
Ignacio Ferreras
i.ferreras@ucl.ac.uk
May 3, 2016
Introduction
Multivariate analysis deals with the study of datasets where several
observations are done for each system within a sample. We can assume
these observations depend on a number of parameters. For instance, in
astrophysics:
+ Several photometric/spectroscopic measurements of a sample of stars.
+ The analysis of the surface brightness distributions of galaxies.
+ Separation of components from a multi-frequency survey of the CMB.
In general, the goal is to:
• Find the relationship between the data and the parameters (regression).
• Arrange the data into a reduced set of classes (classification).
• Reduce the dimensionality, or determine the driving parameter(s).
• Reduce the noise of the observables based on the statistical properties.
I. Ferreras
A multivariate analysis primer
Page 1
An example
Consider a simple example
with three observables, the
mass, luminosity and size of
spheroidal stellar systems.
In
this 3D parameter space, the
observations populate a lowerdimensional space, revealing some
relationship.
In this case the
virial theorem (blue plane) gives
a good (although not complete)
explanation of the relation. The
question is: in general, can we
use the data alone to inform us of
possible physical correlations in a
complex set of observations?
(Tollerund et al. 2011, ApJ, 726, 108)
I. Ferreras
A multivariate analysis primer
Page 2
Probability Distribution
We can consider the observations and the sources as originating from a
probability density function (pdf), such that a given observation x has a
probability to be measured in the interval [x0 dx, x0 + dx]:
p(x0)dx,
where px is the probability density function, which can also be given by its
cumulative version:
Z x0
F (< x0) =
p(⇠)d⇠.
1
The expected value of an arbitrary function of the variable g(x) is:
hgi = E(g(x)) =
Z
+1
g(x)p(x)dx
1
and the uncertainty interval of x at the confidence level [c1, c2] (e.g.
c1 = 0.05, c2 = 0.95 for the 90% C.L.) is [x1, x2] = [F 1(c1), F 1(c2)].
I. Ferreras
A multivariate analysis primer
Page 3
Typical distributions
Binomial: The probability of an event succeeding (q) or failing (1
After n trials, the probability of k successes is:
✓ ◆
n k
pk (n; q) =
q (1
k
q)n
q).
k
Poisson: The limit of the binomial distribution when q ! 0 but nq ⌘
is finite. The probability of detecting k events (e.g. in a fixed interval of
time), is:
k
e
pk ( ) =
k!
Gaussian: The central limit theorem states that a sum of random variables,
with finite variance, will approach the Gaussian distribution, defined by two
parameters: mean (µ) and standard deviation ( ):
I. Ferreras
A multivariate analysis primer
⇣
1
p(x; µ, ) = p exp
2⇡
(x
2
µ)2 ⌘
2
Page 4
“chi-squared” ( 2): A sum of ⌫ centrally-distributed Gaussian variables
with unit variance.
⌫
x
x 2 1e 2
p⌫ (x) = ⌫ ⌫
22 (2)
This is used (and abused!) in model fitting, where one has a set of observed
quantities {xi} with uncertainty { i} and a model y(x; ⇡j ) that explains
those data with a set of parameters {⇡j }. The 2 is formed as follows:
2
(⇡j ) =
X  xi
i
y(xi; ⇡j )
2
i
and the likelihood related to the parameters is therefore:
L(⇡j |x) =
I. Ferreras
A multivariate analysis primer
Y
i
i
1
p
2⇡
e
[xi y(xi ;⇡j )]2
2 2
i
/e
1 2 (⇡ )
j
2
Page 5
Distribution Moments
The pdf can be coded into a set of numbers (statistics) that are commonly
used to describe the distribution. The nth order moment of a distribution
is defined as the expected value of the nth power of the variable:
n
Mn ⌘ E(x ) =
Z
+1
xnp(x)dx,
(1)
1
• n=0: Normalization, should always be M0 = 1
• n=1: Average. For a Gaussian M1 = µ
• n=2: Related to the “width” of the distribution, for a Gaussian M2 =
2
+ µ2
Higher order moments give more information about the shape of the pdf.
A Gaussian is uniquely defined by (µ, ), since M>2 = 0.
I. Ferreras
A multivariate analysis primer
Page 6
High order moments
It is practical to remove the mean (µ ⌘ M1) from the data to compute the
moments. These are now the central moments:
µn =
Z
+1
(x
µ)np(x)dx.
(2)
1
Two important high-order moments are used to explore non-gaussianity:
Skewness: (many similar definitions)
3
Third standardised moment:
Indicates deviation from
1 ⌘ µ3 /µ2 .
symmetry about the mean ( 1 = 0 for a Gaussian distribution).
Kurtosis: 2 ⌘ (µ4/µ42)
for a Gaussian).
I. Ferreras
A multivariate analysis primer
3 represents the “degree of peakiness” (
2
=0
Page 7
Skewness & Kurtosis
I. Ferreras
A multivariate analysis primer
Page 8
Extending to more variables
We will extend our measurements from one random variable, x, to a set of
n variables, to be written as a column vector1:
xT = (x1 x2 x3 · · · xn)
(3)
We can thus define a mean vector:
m = E(x),
(4)
+1 Z +1
(5)
the correlation matrix:
rij = E(xixj ) =
Z
1
xixj p(xi, xj )dxidxj ,
1
in vector notation:
R = E(xxT).
1
(6)
Hence the transpose symbol in the definition
I. Ferreras
A multivariate analysis primer
Page 9
The covariance
The correlation matrix is the extension of the second order moment to a set
of n random variables. Analogously, we can define the central second order
moment if we subtract the mean vector. This is the covariance matrix:
C ⌘ E[(x
m)(x
m)T ],
(7)
the covariances being each of the crossed-terms (o↵-diagonal) in the matrix:
cij = E[(xi
mi)(xj
mj )],
(8)
which trivially reduces to the individual variances for the diagonal terms of
the covariance matrix.
|cij |  i · j .
(9)
The equality holds when xi and xj are fully correlated.
I. Ferreras
A multivariate analysis primer
Page 10
Cross-covariance
These definitions can be extended when considering two di↵erent random
vectors x, y:
Cross-correlation: Rxy = E[xyT]
Cross-covariance: Cxy = E[(x
mx)(y
my ) T ]
Correlations and covariances measure the dependence between the random
variables using their second-order statistics.
Examples:
I. Autocorrelation function of galaxies to find clustering according to galaxy
types.
II. Cross-correlation of QSOs and galaxies to look for connections between
QSO activity and galaxy formation.
I. Ferreras
A multivariate analysis primer
Page 11
Multivariate Normal distribution
It is the equivalent to a 1-parameter (univariate) Gaussian distribution,
where the mean becomes a vector (µ), and the variance becomes a tensor
(covariance ⌃).
f (x) =
1
p
(2⇡)p/2 det⌃
exp
h
1
(x
2
T
µ) ⌃
1
(x
µ)
i
(10)
• Only first and second order statistics are needed
• Linear transformations are Gaussian
• Marginal and conditional densities are Gaussian
• The contours of fixed probabilities are n-dimensional hyperellipsoids
centered at mx.
I. Ferreras
A multivariate analysis primer
Page 12
Multivariate Gaussian Density
The covariance matrix is symmetric and
positive definite, which means we can find
a rotation (defined by an orthogonal matrix
E) such that:
T
Cx = EDE =
n
X
T
i ei ei
(11)
i=1
with D being a diagonal matrix whose
elements { i} are the variances of the
rotated components ei.
The figure shows the hyperellipsoid corresponding to the equation:
(x
I. Ferreras
A multivariate analysis primer
mx ) T C x
1
(x
mx) = constant
(12)
Page 13
Quoting uncertainties
For instance, the quoting of uncertainties
often reduces to giving some “1 ” level,
which – if the pdf of the measurement
is Gaussian – should tell us that the
observation ±1 error defines an interval
within which the true value should
be, with 68% probability, or within an
interval ±2 the true value is contained
with probability 95%, etc...
Sometimes the pdf is known in detail,
and one can quote the non-gaussian
confidence levels, even show a contour
map of the PDF for the parameters
considered.
The figure shows a typical example,
with the 68, 90 and 95% confidence
levels for the estimate of the age of the
stellar populations in a galaxy from its
spectroscopic data.
(Ferreras & Yi, 2004, MNRAS, 350, 1322)
I. Ferreras
A multivariate analysis primer
Page 14
Estimation Theory
Normally, we do not have access to the probability density function (pdf).
We define a set of “estimators” that allow us to determine the underlying
propery of the pdf.
If we take a set of N independent measurements, say the length of a rod,
we define a data vector:
xT = (x1 x2 · · · xN )
(13)
Typical estimators for the mean and variance for the p = 1 (univariate) case
are:
N
1 X
µ̂ =
xi
(14)
N i=1
1
2
ˆ =
I. Ferreras
A multivariate analysis primer
N
N
X
⇥
1 i=1
xi
µ̂
⇤2
(15)
Page 15
Likewise for the p-dimensional multivariate case.
N
1 X
µ̂ =
xi
N i=1
ˆ =
⌃
1
N
N
X
⇥
1 i=1
(16)
⇤⇥
µ̂ xi
xi
µ̂
⇤T
(17)
The standard notation is to define µ and ⌃ as the population mean and
ˆ as the sample mean and covariance. Note that
covariance, and µ̂ and ⌃
sample and population values may di↵er if the observed data set is biased.
Also beware that this result relies on the fact that the underlying distribution
is Gaussian.
You may also come across the scattering matrix, which is a scaled version
of the covariance:
˜ = (N
⌃
ˆ =
1)⌃
N
X
⇥
⇤⇥
µ̂ xi
xi
i=1
I. Ferreras
A multivariate analysis primer
µ̂
⇤T
(18)
Page 16
Multivariate likelihood
In multivariate model fitting, a common approach is to compare a set of N
observations (yi) to model predictions according to a parameter, or a set of
parameters (f (xi; ⇡j )) with the likelihood:
L(⇡j ) / e
2 (⇡ )
j
2
,
(19)
where 2 is the standard comparison between observations and model,
scaled by the uncertainties ( i):
2
(⇡j ) ⌘
N h
X
yi
i=1
f (xi; ⇡j ) i2
(20)
i
A good model fit to the data gives a minimum 2 of order N . If 2MIN
N
the model does not describe the data well (and any result from this likelihood
should be discarded)2, and 2MIN ⌧ N implies the uncertainties must have
been overestimated.
2
or the errors have been underestimated
I. Ferreras
A multivariate analysis primer
Page 17
If we use L(⇡j ) as a PDF, we can follow a Bayesian approach for the
derivation of parameters, and their uncertainties.
However, note that this assumption implies that all the N measurements,
{yi}, are uncorrelated, i.e. the “variance” attached to, say, the i-th
measurement is only
2
i
for i 6= j
= ⌃ii =) ⌃ij = 0,
where ⌃ is the N ⇥ N covariance matrix of the measurements.
correlated data sets, the definition of 2 is:
2
(⇡j ) ⌘ (y
f (x, ⇡))T ⌃
1
(y
f (x, ⇡)),
(21)
For
(22)
and in this case, the confidence levels from the use of L(⇡) as a PDF will
change with respect to the original case (eq. 20), according to the level of
correlation.
I. Ferreras
A multivariate analysis primer
Page 18
Linear Discriminant Analysis
This technique (pioneered by Fisher) rests on the definition of a function
that has to be maximised for an optimal classification of the data points
into classes. The simplest case corresponds to two classes ({c1, c2}) in a 2D
observable space (e.g. we have a set of galaxies for which we measure their
mass and size and we split them according to two morphological types).
The goal is to find a projection on a line such that both classes will be well
separated. We can easily extend this analysis to a higher dimensional space
by using vector notation.
I. Ferreras
A multivariate analysis primer
Page 19
This method needs a “training sample” which we already associate to either
of the two classes. A hierarchical method can be built up from this method.
We start with n1 data points for class c1 and n2 points for class c2.
For an arbitrary direction, given by a unit vector v, the projections of the
data points are given by vT xi. We can define two di↵erent means for each
class (j = {1, 2}):
nj
1 X
mj =
xi
nj i2c
j
nj
mvj =
1 X T
v x i = v T mj
nj i2c
j
The first one is a vector quantity, giving the average position of the j-th
class. The second one is a scalar quantity, representing the average of the
projections on to v. The linear discriminant that we need to maximise is:
J(v) =
(mv1 mv2 )2
s21(v) + s22(v)
where s2j (v) is the scatter measured within the projections in the j-th class,
I. Ferreras
A multivariate analysis primer
Page 20
i.e. it is related to the sample variance restricted to the data points in that
class:
nj
X
2
sj (v) =
(vT xi mvj )2
i2cj
Note that in this case, there is no factor 1/(nj
1) in the definition.
We can write the discriminant in vector notation:
J(v) =
˜ Bv
vT ⌃
˜Wv
vT ⌃
˜ B and ⌃
˜ W are the scatter matrices between classes, and within
where ⌃
classes, respectively:
˜ B = (m1
⌃
˜W = ⌃
˜1 + ⌃
˜2 =
⌃
n1
X
i2c1
I. Ferreras
A multivariate analysis primer
(xi
m2)(m1
m1)(xi
T
m2 ) T
m1 ) +
n2
X
(xi
m2)(xi
m2 ) T
i2c2
Page 21
I leave as an exercise the derivation of the final result: the direction v that
maximises the discriminant corresponds to:
˜ Bv = ⌃
˜Wv
⌃
˜ W has an inverse, we can
which is a generalized eigenvalue problem. If ⌃
convert this to an eigenvalue problem:
˜ 1⌃
˜
⌃
W Bv = v
˜ B v is always
Finally, since for any direction v, the transformed vector ⌃
collinear with (m1 m2), we can solve the eigenvalue equation:
˜ 1(m1
v=⌃
W
m2 )
I. Ferreras
A multivariate analysis primer
Page 22
Clustering analysis
One can use the observed data to classify the sources into di↵erent sets.
These classes are defined from the statistical properties within the whole set,
but they may reflect an underlying connection with the physical processes
of the systems under study. Clustering analyses can be classified as:
• Supervised/Unsupervised
• Hierarchical/Non-hierarchical
The concept of clustering relies on a definition of a distance in p-dimensional
parameter space. Note these parameters can be comparable (X,Y,Z
distances) or not at all (RA, Dec, redshift, luminosity).
I. Ferreras
A multivariate analysis primer
Page 23
Distance in parameter space
A generalisation of the Euclidean distance is the Minkowsi metric. The
distance between two p dimensional points:
D(xi, xj ) =
p
⇣X
k=1
xj,k |m
|xi,k
⌘1/m
(23)
The Euclidean case corresponds to m = 2. Other choices are m = 1
(Manhattan distance) or m ! 1 (Chebyshev distance).
When the di↵erent parameters have very di↵erent ranges, it is often – but
not always! – advisable to re-define them, scaling the parameters with
respect to their variance (and also o↵setting them to have zero mean):
zi,j =
(xi,j
q
x̂j )
(24)
ˆ jj
⌃
In addition, those parameters with a large range of variation should be
re-scaled by taking the logarithm.
I. Ferreras
A multivariate analysis primer
Page 24
A further approach taking into account the covariance of the data leads to
the Mahalanobis distance:
h
D(xi, xj ) = (xi
ˆ
xj )T ⌃
1
(xi
xj )
i1/2
(25)
When the dataset is decorrelated (diagonal covariance), this distance reduces
to the Euclidean case weighted by the variances.
I. Ferreras
A multivariate analysis primer
Page 25
Clustering
Once a distance is defined, the clustering proceeds by agglomerative
clustering in a hierarchical way, starting with one class per data point.
Two nearby data points are merged in one class following some specific
threshold (e.g. the closest pair in the whole data set), continuing the process
until one class engulfs all data points. The procedure can be visualized as
a tree or dendogram.
However, this process needs to define the distance between a cluster (C,
made up of points {p1, p2, · · · , pj }) and a new point (pk ).
Friends-of-friends (single linkage): dCk = min(d1k , d2k , · · · , djk )
Complete linkage: dCk = max(d1k , d2k , · · · , djk )
Pj
Average linkage: dCk = 1j i=1 dik
I. Ferreras
A multivariate analysis primer
Page 26
Clustering: k-means
A
standard
clustering
procedure starts with a
set of k locations in pdimensional parameter space,
representing the centroids
of k classes.
Data points
are assigned to one of these
classes, with the choice driven
by a minimization of the sum
of the squares of distances
among points within the same
class.
The number of classes k and
the seed locations are chosen
at startup.
I. Ferreras
A multivariate analysis primer
Page 27
The Information Bottleneck (IB)
There is a plethora of multivariate techniques aimed at blind source
separation. Among the many, the Information Bottleneck (Slonim et
al. 2000) is a good method to illustrate how to progressively build up
common classes. Its methodology derives from clustering techniques that
minimize a defined Euclidean distance in the n dimensional parameter space
spanned by the data vectors. So, if we have a set of s classes {ki}si=1 to
describe our data sample comprised of N n-dimensional vectors {xj }N
j=1 ,
we can describe the probability for a class k for a given data vector x by
the use of Bayes’ theorem:
h
1
p(k|x) / p(k)p(x|k) = p(k) p exp
2⇡
1
2
i
2
D
(k,
x))
,
2 E
(26)
where p(k) is the prior of the class, and DE is defined as the Euclidean
distance between the data vector and the class:
2
DE
(k, x)
=
n
X
⇥
ks
xs
s=1
I. Ferreras
A multivariate analysis primer
⇤2
(27)
Page 28
Defining Classes
The notation will be clearer if we consider a specific example. Let us assume
that we have a sample of galaxy spectra. Let us denote by G the set of all
galaxies in the sample, and by ⇤ the set of wavelengths observed in each
spectrum. The ensemble can be described by a joint probability p(g, )
denoting the probability of observing a photon with wavelength 2 ⇤ from
galaxy g 2 G (it is necessary to normalize all spectra to unity so that they
can be considered probability distributions).
We assume a uniform prior on the galaxies: p(g) = 1/N , where N is the
total number of galaxies.
The goal of the IB is to construct a set of classes C that preserves the
properties of the original sample, with a minimal number of classes and a
minimal loss of information. The spectral information of class c 2 C is
therefore:
X
p( |c) =
p( |g)p(g|c)
(28)
g
I. Ferreras
A multivariate analysis primer
Page 29
Mutual Information I
Information is often quantified in terms of the entropy of the class. For the
class of galaxies:
X
H(G) =
p(g) log(g)
(29)
g
If we include information about wavelengths, we can define a conditional
entropy of the galaxies from the spectra.
H(G|⇤) =
X
p( )
X
(30)
p(g| ) log p(g| )
g
The additional knowledge about the wavelength information can only result
in less uncertainty in the knowledge of G. We can define the mutual
information between classes G and ⇤ as:
I(G; ⇤) ⌘ H(G)
H(G|⇤) =
X
g,
p(g)p( |g) log
I. Ferreras
A multivariate analysis primer
p( |g)
p( )
(31)
Page 30
Mutual Information II
Mutual information between two random variables is therefore the amount
of uncertainty in one variable that is removed by the knowledge of the other
one. In our specific case, we can define the mutual information betweem
the set of galaxies G and the set of classes C as:
I(C; G) =
X
p(g)p(c|g) ln
c,g
p(c|g)
p(c)
(32)
The mutual information is symmetric, non-negative, and zero if and only if
both sets are independent.
“No manipulation of the data can increase the amount of mutual
information” (data processing inequality theorem). Hence, by grouping
galaxies into classes, one can only lose information about the data:
I(C; ⇤)  I(G; ⇤)
I. Ferreras
A multivariate analysis primer
(33)
Page 31
The Information Bottleneck
The goal of the IB is then to find a set of classes C that maximize the
spectral information I(C; ⇤) under a constraint on I(C; G). In essence, we
pass the spectral information I(G; ⇤) through the bottleneck of the classes,
which are forced to extract the relevant information between G and ⇤.
The optimal classification has to maximise the functional:
L[p(c|g)] = I(C; ⇤)
where
1
1
I(C; G)
(34)
is the Lagrange multiplier attached to the complexity constraint.
If ! 0 the classification is as non-informative as possible: one class for
all galaxies.
If ! 1 the compression of data into classes is maximised: one galaxy,
one class.
Varying the constraint allows us to probe the level of compactness of the
data into simpler classes.
I. Ferreras
A multivariate analysis primer
Page 32
The Information Bottleneck
The maximisation of the functional in equation 34 gives:
p(c|g) =
p(c)
exp(
Z(g, )
DKL)
(35)
where Z(g, ) is the partition function and DKL is the Kullback-Leibler
divergence, or cross-entropy between g and c, defined by:
DKL(g||c) =
X
p( |g) ln
p( |g)
p( |c)
(36)
analogous to the result using the Euclidean distance (equation 26).
In practice, the IB method follows a hierarchical approach, starting with
C ⌘ G, and merging two classes in each step, checking that the mutual
information I(C; ⇤) is maximally preserved. The iterative method stops
when a target minimum number of classes (or a mutual information
threshold) is reached.
I. Ferreras
A multivariate analysis primer
Page 33
The Information Bottleneck
An application of the IB to spectral data from the 2dF galaxy survey
(Slonim et al. 2001 MNRAS, 323, 720). Five components are just needed
to preserve most of the information (crosses in the left-hand panel). Notice
that information from real data (2dF) is harder to “compress” into classes
than mock samples from galaxy formation models.
I. Ferreras
A multivariate analysis primer
Page 34
The Information Bottleneck
(Ferreras 2012, IAUS, 284, 38)
I. Ferreras
A multivariate analysis primer
Page 35
Blind Source Separation
The goal is to separate a set of data into their underlying components.
Example I: Dinner Party Problem
We invite a number of guests to a dinner party. They have N independent
conversations. We put M microphones in the room, that record various
linear superpositions (depending on their location within the room) of
the conversations. Is it possible to disentangle the M recordings into N
conversations?
Example II: The formation history of a galaxy
The spectrum of a galaxy represents a superposition of its stellar populations.
They comprise all stars ever formed or incorporated in the galaxy (of course
excluding remnants). Is it possible to disentangle those populations into a
star formation history?
I. Ferreras
A multivariate analysis primer
Page 36
Blind Source Separation (cont’d)
Example III: Face Recognition
Algorithm to identify/classify faces by decomposing the information from
a large dataset into “sources” that can cleanly discriminate facial features
(no modelling).
Example IV: Response of the brain
In order to understand the processes inside the brain, NMR imaging is often
used on people that are subject to stimuli. The spatio-temporal output is
fed to some algorithm that separates the output into its key sources, so
that one can relate the input stimuli to the region of the brain that is being
activated.
Example V: Time series analysis
E.g. GRB light curves to be classified without any reference to a model,
simply decomposed into their simpler sources by the statistical properties of
a large sample of GRB data.
I. Ferreras
A multivariate analysis primer
Page 37
Signal Mixture (as a time series)
Let us denote by {xj (tk )} the sequence of observables (j = 1 · · · N ),
measured at a number of times (k = 1 · · · T ). The measurement process is
simply a MIXTURE of the original variables {yi(tk )} (i = 1 · · · N ) into the
observations:
xi(tk ) =
X
j
wij yj (tk ) ) x(t) = W · y(t)
⇣
+ noise
⌘
(37)
The matrix W 1 solves the problem. One can consider the statistical
properties of the observations in order to find out about the matrix.
For instance, one can consider choices of W that produce decorrelated
components (Principal Component Analysis) or statistically independent
components (Independent Component Analysis), or that reduce the mutual
information among classes (Information Bottleneck).
I. Ferreras
A multivariate analysis primer
Page 38
... a tough problem to solve
In a Blind Source Separation problem, we do not have any information about
the mixtures or about the underlying sources. The only data available is
a (hopefully large) set of observations that are known/hoped to originate
from a simple set of sources. We do not even know how many sources are
responsible for the data.
Often, a smaller number of sources can reliably reproduce the observations
(data compression).
Noise will be considered as an extra, additive component, i.e. by solving
the problem one can “denoise” the data.
I. Ferreras
A multivariate analysis primer
Page 39
Uncorrelatedness
Two random vectors x and y are uncorrelated if their cross-covariance
matrix is a zero matrix.
Cxy = 0 ) Rxy = mxmT
y.
(38)
One can consider also the case of uncorrelatedness within the components
of a random vector x:
Cx = D = diag(
2
x1
2
x2
···
2
xn ),
(39)
which is the essence of Principal Component Analysis (PCA).
In particular, random vectors having zero mean and unit covariance (up to
some constant variance 2) are said to be white.
mx = 0, Rx = Cx = I.
(40)
Exercise: Show that under an orthogonal transformation of an n-dimensional
vector: y = Tx, with T 2 SO(n), the transformed vector y remains white.
I. Ferreras
A multivariate analysis primer
Page 40
Statistical Independence
We can impose a stronger constraint on the data: two random variables x
and y are said to be statistically independent if and only if:
px,y (x, y) = px(x)py (y)
(41)
Which implies that for any function of these variables:
E[g(x)h(y)] = E[g(x)]E[h(y)]
(42)
If both x and y are Gaussian distributions, uncorrelatedness and statistical
independence are the same thing (remember a Gaussian distribution can be
fully described by the first and second order moments).
Uncorrelatedness:
moments.
equality of distributions up to the second order
Independence: equality of distributions for all orders, n = 1, · · · , 1.
I. Ferreras
A multivariate analysis primer
Page 41
Testing for correlation
A simple example that shows us how two variables can be correlated is the
following pdf – the 2D version of the previous definition of a multivariate
Gaussian (eq. 10):
P (x, y|
x,
y , ⇢) =
+
2⇡
x y
1
p
1
2⇢xy io
µy ) 2
(y
⇢2
⇥ exp
2
y
n
1
2(1
⇢2)
h (x
µx ) 2
2
x
+
(43)
x y
The correlation between x and y depends on the parameter ⇢, disappearing
as ⇢ ! 0. This is the correlation coefficient for two variables, defined as:
⇢=
cov[x, y]
,
(44)
x y
I. Ferreras
A multivariate analysis primer
Page 42
Testing for correlation (cont’d)
The figure shows the contours of the
bivariate Gaussian pdf for two choices of
⇢, a decorrelated case (blue) and a strongly
correlated one (red).
A typical estimator of correlation is given by the Pearson product-moment
correlation coefficient:
PN
r ⌘ qP
N
i=1 (xi
i=1 (xi
where h· · · i denotes the average.
I. Ferreras
A multivariate analysis primer
hxi)(yi hyi)
PN
hxi)2 i=1(yi hyi)2
(45)
Page 43
Testing for correlation (cont’d)
The contours of the previous figure drop from the maximum (at the origin)
by a factor 1/e at a distance x, given by:
xT C
1
(46)
x = 1,
where the covariance matrix is:
C=
2
x
x y⇢
2
y
x y⇢
!
(47)
We can use the standard estimators for the covariance term:
cov[x, y] =
x y⇢
=
1
N
1
h(x
x̄)(y
ȳ)i.
I. Ferreras
A multivariate analysis primer
(48)
Page 44
Beware of wrong parameter interpretation!
Anscombe’s quartet shows four sets of data with the same means, regression
coefficients and correlation/covariance.
I. Ferreras
A multivariate analysis primer
Page 45
Principal Component Analysis
Consider a sample of N objects with n parameters measured for each of
them. These data can be written as a set of N , n-dimensional vectors
{x(k)}N
The aim of PCA is to perform a linear transformation of
k=1 .
these vectors (a rotation in n dimensional space) such that one can define
an orthogonal set of n vectors (principal components, {ei}ni=1) that are
decorrelated, and can be used to describe the original set of N vectors.
Furthermore, each principal component will have an associated variance,
so that we can sort the principal components in decreasing order of the
individual variances.
Each of the N original vectors can be described by a set of n numbers
(“coordinates”) representing the projections on to each of the principal
components. This method also allows us to compress the data (lossy). We
can truncate this set of projections into the first m < n components, so
that most of the information (in the sense of variance) for each vector is
preserved.
I. Ferreras
A multivariate analysis primer
Page 46
PCA – Covariance
The easiest way to deal with PCA is to consider the covariance matrix,
which is a n ⇥ n real, symmetric matrix:
cij =
N
X
k=1
(k)
(xi
(k)
hxii)(xj
hxj i),
1  i, j  n
(49)
One can always diagonalize this matrix:
Cei =
i ei ,
(50)
with n eigenvalues { i} and n eigenvectors ei (the principal components),
and reorder them such that 1 > 2 > · · · > n. Since C is diagonal, the
eigenvectors are decorrelated (all the o↵-diagonal terms in their covariance
matrix are equal to zero).
I. Ferreras
A multivariate analysis primer
Page 47
PCA – Covariance
The projections of the original data vectors are often given as PCi=1,···n.
For the k-th input data vector we have the following expansion:
(k)
PCi
⌘x
(k)
· ei =
n
X
x(k)
s ei,s
(51)
s=1
The original vectors are therefore uniquely given by these n “coordinates”:
x
(k)
=
n
X
(k)
PCi ei
(52)
i=1
the truncation of this series leads to data compression.
I. Ferreras
A multivariate analysis primer
Page 48
Scree plot
The scree plot is a very useful figure that shows the variance of each
principal component as a function of rank. That allows us to determine
how much information is kept in each component and gives a quantitative
measurement of the information lost if the series is truncated.
This scree plot shows two main trends in the
decay of “information” with the increasing
rank of the principal components. Typically,
the trend after the 7–8th component is
characteristic of noise. Hence, by truncating
the series around those terms, one would
be capable of “de-noising” the data. The
inset shows the cumulative variance: with
8 components we retain about 90% of the
information in the original data set.
I. Ferreras
A multivariate analysis primer
Page 49
An example: PCA on galaxy spectra
(Rogers, IF et al. 2007)
This is an example of PCA applied to a set
of ⇠7,000 spectra from early-type galaxies
in the Sloan Digital Sky Survey.
After
de-reddening and de-redshifting the sample,
one can treat each SED as a data vector,
compute the covariance matrix, and find the
principal components.
The benefit of a BSS approach is that one
does not rely on models to extract information
from a data set. It is just the information
hidden in the data set – in the form of
variance – that results in the definition of the
principal components.
The drawback is that there is no “physics” in
the methodology. Even though, in this case,
we can see the Balmer series in components
2 and 5, we cannot interpret these spectra
as physical ones.
Indeed, the enforced
orthogonality inherent to PCA introduces
spurious non-physical spectral features.
I. Ferreras
A multivariate analysis primer
An example: PCA on galaxy spectra
Page 50
(Rogers, IF et al. 2007)
One can put the physics back
into the analysis by comparing
the projections of the principal
components on to the galaxies
(i.e. their coordinates) with physical
observables. Here we see a strong
trend of some of the components
with respect to colour or central
velocity dispersion.
I. Ferreras
A multivariate analysis primer
Page 51
An example: PCA on galaxy spectra
(Rogers, IF et al. 2007)
We can then project synthetic models of population synthesis – of known
age and metallicity – to quantify the way PCA has disentangled in part the
inherent degeneracies.
I. Ferreras
A multivariate analysis primer
Page 52
An example: PCA on galaxy spectra
(IF et al. 2010)
Once we identify the physical meaning of
the PCA-related projections, one can use
those as a way of describing the essential
information in the galaxy spectra.
This
figure shows how the PCA information (given
here by a combination of the projections
of the first two principal components) can
discriminate between the e↵ects of intrinsic
galaxy properties – such as central velocity
dispersion – and environment e↵ects –
described here by the mass of the host
halo.
I. Ferreras
A multivariate analysis primer
Page 53
The complexity of galaxies
If we consider a set of observables of galaxies like size, colour, luminosity,
etc, one finds a very “compressible” distribution. Here, a sample of HIdetected galaxies is analyzed with PCA, to show that one independent
parameter may be enough to explain their properties (Disney et al. 2008).
I. Ferreras
A multivariate analysis primer
Page 54
PCA: characterization of the PSF
PCA can be used to represent in a few
numbers the point spread function of a
camera. The figure illustrates the case for the
Advanced Camera for Surveys (HST/ACS,
Jee et al. 2007). The top panel shows an
observed PSF through the F814W passband
(a), and reconstructions using wavelets (b,
150 basis functions), shapelets (c, 78 fcns)
and PCA (d, 20 components, extracted from
800 stellar images). The plot compares these
profiles, showing the advantage of PCA,
which just uses the variance in the data set
as a way to determine the optimal basis
functions. The other methods rely on the
definition of the basis functions to optimally
match the PSF.
I. Ferreras
A multivariate analysis primer
Page 55
Face Recognition
Treating images as data vectors, we can look in the covariance matrix
of a set of pictures of faces to decompose the information into principal
components. We can then describe an arbitrary face by a number of
projections on to the most significant “eigenfaces”.
I. Ferreras
A multivariate analysis primer
Page 56
Other image recognition problems
Similarly, one can use PCA to determine the illumination or the orientation
of simple figures. This can help towards the general problem of computerbased visual recognition. It is also used in video surveillance work, separating
the interesting data from the background.
I. Ferreras
A multivariate analysis primer
Page 57
Drawbacks of PCA
• Linear
• Enforced orthogonality of principal components
• Non-physical sources
• Highly sensitive to outliers: Robust PCA requires a way of “clipping”
outliers from the original data set.
• “Attention deficit” Prone to catch consistent instrumental/data reduction
residuals.
I. Ferreras
A multivariate analysis primer
Page 58
PCA: removal of systematic signals
The last point in the list of drawbacks can actually be a strength of PCA
when applied to the filtering of residual e↵ects. In this case, Hewett & Wild
(2005) use PCA to remove small – but noticeable – night sky emission from
SDSS spectra.
I. Ferreras
A multivariate analysis primer
Page 59
Factor Analysis (FA)
An alternative methodology to solve the blind source separation is to assume
a set of m latent variables ({fi}), such that the p observed data ({yj },
p > m) correspond to linear superpositions of these variables plus noise
({✏j }):
y =µ+W ·f +✏
(53)
Here, µ is the mean of the data. In FA jargon, the p ⇥ m mixing matrix
(W) is called the loadings of the latent variables. There are a number
of assumptions about the data: the uncertainties have zero mean and are
uncorrelated; and there is no cross-covariance between the factors and the
uncertainties. Also cov(f ) = 1m⇥m
Note the di↵erence between PCA and FA:
• PCA gives the principal components as linear superpositions of the
original data. FA use latent variables.
• PCA aims at sorting the data with respect to the variance of the
observations. FA exploits the covariances among subsets.
I. Ferreras
A multivariate analysis primer
Page 60
After a few steps, we find that the covariance of the data, ⌃ = cov(y) ⌘
(y µ)(y µ)T , can be written:
⌃ = WW T +
where
,
is the (diagonal) covariance matrix of the uncertainties.
There are several ways to solve this:
1. Principal component method: (Note PC/FA 6= PCA) Here we neglect
the covariance of the uncertainty, and write:
⌃ = CDCT = (CD1/2)(CD1/2)T
where D is a diagonal matrix. We can take the square root as we are
dealing with a covariance (i.e. non-negative eigenvalues). Note CD1/2 is a
p ⇥ p matrix. The trick is now to select only a few of the top eigenvalues
(m < p), creating the eigenvector matrix (C1)p⇥m, and the eigenvalue
diagonal matrix (D1)m⇥m, such that:
1/2
W = (C1D1 )p⇥m
I. Ferreras
A multivariate analysis primer
Page 61
2. Principal factor method: The uncertainty matrix is included.
method is equivalent to PC/FA where the covariance is replaced by:
The
⌃)⌃
(remember the covariance of the uncertainties is a diagonal). A typical
assumption for the diagonal elements of this matrix is:
(⌃
)ii = (⌃)ii
1
(⌃
1
)ii
Similarly to the previous case, we diagonalise the matrix and restrict the
analysis to the highest m eigenvalues, obtaining:
⌃
1/2
= C 1 D1
=W
This method can be iterated, substituting the values of Wii back into the
diagonal elements of ⌃
.
I. Ferreras
A multivariate analysis primer
Page 62
Note that the decomposition into factors is not unique. A rotation, i.e.
a transformation via an orthogonal matrix (OOT = 1) produces the same
result. Therefore, the last, and important, step in FA is to rotate the mixing
matrix (W) until the loadings fall on fewer latent variables (rather than
being all spread out).
I. Ferreras
A multivariate analysis primer
Page 63
Independent Component Analysis (ICA)
ICA can be considered as an extension of PCA to arbitrary moments of
the probability distribution. With PCA, we simply decorrelate the data –
hence stopping at the covariance, i.e. the second order moment. With ICA
we require a separation of the data vectors into sources that are not only
decorrelated but statistically independent.
While PCA has a clean method to proceed: “diagonalise the covariance
matrix and project the data vectors on to the eigenvectors in decreasing
order of its eigenvalues”, ICA is not uniquely defined, and many techniques
have been defined to achieve the extraction of statistically independent
components. We will give a few conceptual ideas below. For more details
check out specific packages for the implementation of ICA (e.g. FastICA3).
3
http://scikit-learn.org/stable/modules/decomposition.html#ica
I. Ferreras
A multivariate analysis primer
Page 64
Non-Gaussianity
The central tenet of blind source separation is that the observed data vectors
are a mixture of the source signals plus some noise:
x = As + n
(54)
such that the sources s are statistically independent. But remember neither
the mixing matrix, nor the sources are known.
One way of proceeding makes use of the Central Limit Theorem:
If a set of signals s = (s1 s2 · · · sN ) are independent, with
2
means (µ1 µ2 · · · µN ) and variances ( 12 22 · · · N
), then the signal
PN
defined as x ⌘
i=1 si has a probability density function
that P
approaches (as N P
! 1) a Gaussian distribution with
2
mean
i µi and variance
i i
I. Ferreras
A multivariate analysis primer
Page 65
Non-Gaussianity: an example
Consider the speech signal on the left. It is a leptokurtic (or super-gaussian)
distribution – positive kurtosis. The middle panel shows a sawtooth signal,
clearly platykurtic (sub-gaussian, negative kurtosis). A mixture of both
(let’s just take the sum, rightmost panels) is a signal closer to a gaussian
From “Independent Component Analysis”, Stone.
I. Ferreras
A multivariate analysis primer
Page 66
Non-Gaussianity (Projection Pursuit)
This means that any mixture of independent (non-Gaussian) signals will
appear more Gaussian than the original ones. Hence, one can search for
possible decompositions of the original data vectors into those with the
highest non-gaussianities.
The down side is that ICA will only be capable of decomposing a set of
signals into a number of non-gaussian sources plus a single gaussian signal
which cannot be decomposed any further.
This example shows how to separate
the first two principal components
out of a PCA test into two more
independent sources, by maximizing
the non-gaussianity, measured here as
kurtosis (contour line) (Ferreras 2012,
IAUS, 284, 38).
I. Ferreras
A multivariate analysis primer
Page 67
A pictorial version of ICA
This is a very simple representation of ICA, where two independent signals
(left) are mixed into two observed datasets (middle). By whitening the
data (i.e. decorrelating and scaling such that cov(y) = 1), we see that
the final step is to “rotate” the axis so that each signal returns to a set of
independent components.
(from Hyvärinen et al. 2001)
I. Ferreras
A multivariate analysis primer
Page 68
Negentropy
Kurtosis is the simplest indicator of non-gaussianity, but it is strongly
a↵ected by outliers. Other, more robust, indicators are used in ICA, for
instance negentropy, which is the extra information (entropy) between the
observed dataset and the corresponding Gaussian one, that has the same
covariance.
J(y) ⌘ H(ygauss) H(y),
where H(y) = E(ln p(y)] is the entropy. The trick is to use some function
g(y) to avoid the dependence on outliers.
One of the methods that follow this approach is FastICA, consisting of a
fixed point (à la Newton-Raphson) method. An approximation is made to
describe negentropy. The first approach would involve high order moments:
J(y) ⇡
1
1
E(y 3) + [kurt(y)]2
12
48
However, this method is not robust against outliers.
non-polynomial expressions, finding’:
J(y) / [E{G(y)}
I. Ferreras
A multivariate analysis primer
One can go for
E{G(⌫)}]2,
Page 69
where the data (y) have zero mean and unit variance, and ⌫ is a random
variable from a Gaussian distribution, also with zero mean and unit variance.
Functions G(y) with a slower growth than y 3 will be less sentitive to outliers,
and typical cases are:
y2
G(y) = e 2
FastICA is a fixed point method (similar to the Newton-Raphson algorithm
to find the roots of a function) that maximises J(y) by an iterative
optimization of a projection vector (equivalent to transforming the mixing
matrix).
(from scikit-learn.org)
I. Ferreras
A multivariate analysis primer
Page 70
Infomax
Another way of extracting statistically independent sources is by the use of
the entropy (i.e. “the level of surprise”).
I A set of signals with a uniform joint pdf has maximum joint entropy
II A set of signals that have maximum joint entropy are mutually
independent
III Any invertible function of independent signals yields signals that are also
mutually independent.
The last point will be useful if we consider that for any pdf p(y), the
cumulative density function
g(Y ) ⌘
Z
Y
p(y)dy
(55)
1
has a maximum entropy pdf.
I. Ferreras
A multivariate analysis primer
Page 71
Infomax (cont’d)
An example of two source signals (s, leftmost panels) mixed (x = As), and
separated via infomax (y = Wx). The rightmost panels correspond to the
cumulative distribution (Y = g(y)) when optimized.
(from “Independent Component Analysis”, Stone)
I. Ferreras
A multivariate analysis primer
Page 72
Many more methods ...
This has been a brief introduction. There are many methods to extract
information from multivariate data, including the vast realm of machine
learning algorithms. Some interesting advanced topics are:
• Non-negative matrix factorization
• Support Vector Machines
• Artificial Neural Networkds
• Gaussian Processes
I. Ferreras
A multivariate analysis primer
Page 73
Further Reading
• Methods of multivariate analysis, Rencher & Christensen, 2012, Wiley
• Independent Component Analysis, Hyvärinen, Karhunen & Oja, 2001,
Wiley
• Independent Component Analysis: A tutorial introduction, Stone, 2004,
MIT Press
• Modern Statistical Methods for Astronomy, Feigelson & Babu, 2012,
Cambridge
• Practical statistics for astronomers, Wall & Jenkins, 2003, Cambridge
I. Ferreras
A multivariate analysis primer
Page 74
Download