Dimension reduction: modern aspects Nickolay T. Trendafilov Department of Mathematics and Statistics

advertisement
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Dimension reduction:
modern aspects
Nickolay T. Trendafilov
Department of Mathematics and Statistics
The Open University, UK
1 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Contents
1
Principal component analysis (PCA)
Basic facts and definitions
The interpretation and related difficulties
Example: the Pitprop data
Sparse PCA
Genesis and the way forward
Function-constraint sparse component
2
Exploratory factor analysis (EFA)
Intro/Motivation: comparison with PCA
The classical EFA (n > p)
Generalized EFA (GEFA) for n > p & p ≥ n
Numerical illustrations with GEFA
2 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Principal component analysis (PCA)
Basic facts and definitions
Origins and background
PCA was first defined in the form that is used nowadays
by Pearson (1901). He found the best-fitting line in the
least squares sense to the data points, which is known
today as the first principal component.
Hotelling (1933) showed that the loadings for the
components are the eigenvectors of the sample covariance
matrix.
Note, that PCA as a scientific tool appeared first in
Psychometrika. The quantitative measurement of the
human intelligence and personality in the 30’s of the last
century should have been of the same challenging
importance as the contemporary interest in gene and
tissue engineering, signal processing, or analyzing huge
climate, financial or internet data.
3 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Principal component analysis (PCA)
Basic facts and definitions
General
Main goal: Analyzing high-dimensional multivariate data
Main tools:
singular value decomposition (SVD) of (raw) data
matrix,
eigen-value decomposition (EVD) of correlation matrix
Main problems:
PCA might be too slow for very large data
results involve all input variables which complicates the
interpretation
4 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Principal component analysis (PCA)
Basic facts and definitions
Definition 1: PCA of data matrix
Let X be an n × p(n > p) data matrix containing
measurements of p random variables on n subjects
(individuals);
Let X be centered, i.e. X > 1n = 0p , and the columns
(variables) have unit length, i.e. diag(X > X ) = 1p .
Denote the SVD of X by X = FDA> , where
1
2
3
F is n × n orthogonal,
A is p × p orthogonal, and
D is n × p diagonal, which elements are called singular
values of X and are in decreasing order.
5 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Principal component analysis (PCA)
Basic facts and definitions
PCA dimension reduction: truncated SVD
For any r ( p), let Xr = Fr Dr A>
r be the truncated SVD
of X
Fr and Ar denote the first r columns of F and A
respectively (Fr> Fr = A>
r Ar = Ir ), and Dr is diagonal
matrix with the first r singular values of X :
1
2
3
Yr = Fr Dr contains the component scores, also known
as principal components (PCs),
Ar contains the component loadings, and
Dr2 contains the variances of the first r PCs.
It is well-known that Xr gives the best least squares (LS)
approximation of rank r to X .
6 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Principal component analysis (PCA)
Basic facts and definitions
...more on truncated SVD
A right multiplication of X = FDA> by Ar gives:
Ir
Dr
XAr = FD
=F
= Fr Dr = Yr ,
0(p−r )×r
0(p−r )×r
(1)
i.e. the principal components Yr are linear combination of
the original variables X with coefficients Ar ;
another right multiplication of (1) by A>
r gives
>
>
XAr A>
r = Yr Ar = Fr Dr Ar = Xr ,
i.e. the best LS approximation Xr to X of rank r is given
by the orthogonal projection of X onto the r -dimensional
subspace in Rp spanned by the columns of Ar .
7 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Principal component analysis (PCA)
Basic facts and definitions
PCA optimality
The variances of the first r PCs are given by
2
>
>
Yr> Yr = A>
r Xr Xr Ar = Dr Fr Fr Dr = Dr .
This also shows that the principal components are
uncorrelated, as Yr> Yr is diagonal.
Then, the total variance explained by the first r PCs is
>
given by traceDr2 = trace(A>
r Xr Xr Ar )
It is important to stress that the PCs are the only linear
combinations of the original variables which are uncorrelated
and the matrix Ar of their coefficients is orthonormal.
8 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Principal component analysis (PCA)
Basic facts and definitions
Definition 2: PCA of correlation matrix
Let R be an p × p sample correlation matrix
Denote the EVD of R by R = AD 2 A> , where
1
2
A is p × p orthogonal, and
D 2 is p × p diagonal, with elements in decreasing order
Let Rr = Ar Dr2 A>
r be the truncated EVD of R (r p)
Ar contains the first r eigen-vectors of R and is the
matrix of the component loadings of the first r PCs
Dr2 contains the first r eigen-values of R, which are the
variances of first r PCs
Rr gives the best LS approximation of rank r to R.
9 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Principal component analysis (PCA)
The interpretation and related difficulties
PCA interpretation: the classic case
The PCA is interpreted by considering the magnitudes of
the component loadings, which indicate how strongly
each of the original variables contribute to the PC.
Even if a reduced dimension of r PCs is considered for
further analysis, each PC is still a linear combination of
all original variables. This complicates the PCs
interpretation, especially when p is large.
Which loadings are easily interpretable?
10 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Principal component analysis (PCA)
The interpretation and related difficulties
The Thurstone’s simple structure concept...
(Thurstone, 1947, p. 335)
1 Each row of the factor matrix should have at least one
zero,
2 If there are r common factors each column of the factor
matrix should have at least r zeros,
3 For every pair of columns of the factor matrix there
should be several variables whose entries vanish in one
column but not in the other,
4 For every pair of columns of the factor matrix, a large
proportion of the variables should have vanishing entries
in both columns when there are four or more factors,
5 For every pair of columns of the factor matrix there
should be only a small number of variables with
non-vanishing entries in both columns.
11 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Principal component analysis (PCA)
The interpretation and related difficulties
PCA interpretation via simple structure rotation
The oldest approach to solve the PCA interpretation
problem is the simple structure rotation, which is based
on the following simple identity:
X = FDA> = FQQ −1 DA> ,
where Q can be any non-singular transformation matrix.
Two types of matrices Q are used in PCA and FA:
orthogonal and oblique.
B = ADQ −> is the matrix of rotated loadings by Q.
For orthogonal Q it reduces to B = ADQ. (As the
orthogonal matrix Q can be viewed as a rotation, these
methods are known as the rotation methods.)
12 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Principal component analysis (PCA)
The interpretation and related difficulties
Doing simple structure rotation in PCA (and FA)
Steps
1
2
3
Low-dimensional data approximation,...
Followed by rotation of the PC loadings
The rotation is found by optimizing certain criterion
which defines/formalizes the perception for simple
(interpretable) structure
Example: the VARIMAX criterion maximizes the variance
of the squared rotated loadings B, i.e. maxQ V, where
P
P
Pp
2 2
V = rj=1 Vj and Vj = pi=1 bij4 − p1
.
i=1 bij
Drawbacks of the rotated components:
still difficult to interpret loadings
correlated components, which also do not explain
decreasing amount of variance.
13 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Principal component analysis (PCA)
The interpretation and related difficulties
PCA interpretation via rotation methods
Traditionally, PCs are considered easily interpretable if
there are plenty of small component loadings indicating
the negligible importance of the corresponding variables.
Jollife, 2002, p.269
The most common way of doing this is to ignore (effectively
set to zero) coefficients whose absolute values fall below some
threshold.
Thus, implicitly, the PCs simplicity and interpretability are
associated with the sparseness of the component loadings.
14 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Principal component analysis (PCA)
Example: the Pitprop data
The Pitprop data consist of 14 variables which were measured
for each of 180 pitprops cut from Corsican pine timber. One
variable is compressive strength, and the other 13 variables are
physical measurements on the pitprops (Jeffers, 1967).
Table: Jeffers’s Pitprop data: Loadings of the first six PCs and
their interpretation by normalizing each column, and then, taking
loadings greater than .7 only (Jeffers, 1967)
Vars
topdiam
length
moist
testsg
ovensg
ringtop
ringbut
bowmax
bowdist
whorls
clear
knots
diaknot
1
.83
.83
.26
.36
.12
.58
.82
.60
.73
.78
-.02
-.24
-.23
Component loadings (AD)
2
3
4
5
.34
-.28
-.10
.08
.29
-.32
-.11
.11
.83
.19
.08
-.33
.70
.48
.06
-.34
-.26
.66
.05
-.17
-.02
.65
-.07
.30
-.29
.35
-.07
.21
-.29
-.33
.30
-.18
.03
-.28
.10
.10
-.38
-.16
-.22
-.15
.32
-.10
.85
.33
.53
.13
-.32
.57
.48
-.45
-.32
-.08
6
.11
.15
-.25
-.05
.56
.05
.00
-.05
.03
-.16
.16
-.15
.57
1
1.0
1.0
Jeffers’s interpretation
2
3
4
5
1.0
.84
.70
.99
.72
.88
.93
.73
1.0
.99
6
1.0
1.0
1.0
1.0
15 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Principal component analysis (PCA)
Example: the Pitprop data
Table: Jeffers’s Pitprop data: Rotated loadings by varimax and
their interpretation by normalizing each column, and then, taking
loadings greater than .59 only
Vars
topdiam
length
moist
testsg
ovensg
ringtop
ringbut
bowmax
bowdist
whorls
clear
knots
diaknot
1
.91
.94
.13
.13
-.14
.36
.62
.54
.77
.68
.03
-.06
.10
2
.26
.19
.96
.95
.03
.19
-.02
-.10
.03
-.10
.08
.14
.04
VARIMAX
3
-.01
-.00
-.14
.24
.90
.61
.47
-.10
-.03
.02
-.04
-.14
-.07
loadings
4
.03
.03
.08
.03
-.03
-.03
-.13
.11
.12
-.40
.97
.04
-.01
5
.01
.00
.08
.06
-.18
.28
-.01
-.56
-.16
-.35
.00
.87
.15
6
.08
.10
.04
-.03
-.03
-.49
-.55
-.23
-.12
-.34
-.00
.09
.93
Normalized loadings greater than .55
1
2
3
4
5
6
.97
1.0
1.0
.98
1.0
.68
.66
-.64
.82
.73
1.0
1.0
1.0
16 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Sparse PCA
Genesis and the way forward
Abandoning the rotation methods
Ignoring the small loadings is subjective and misleading,
especially for PCs from covariance matrix (Cadima &
Jollife, 1995).
Cadima & Jollife, 1995
One of the reasons for this is that it is not just loadings but
also the size (standard deviation) of each variable which
determines the importance of that variable in the linear
combination.
Therefore it may be desirable to put more emphasis on
simplicity than on variance maximization.
17 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Sparse PCA
Genesis and the way forward
Alternative to rotation: modify PCA to produce
explicitly simple principal components
The first method to directly construct sparse components
was proposed by Hausman (1982): it finds PC loadings
from a prescribed subset of values, say S = {−1, 0, 1}
Jolliffe & Uddin (2000) were the first to modify the
original PCs to additionally satisfy the Varimax criterion
(simplified component technique, SCoT) still explaining
successively decreasing portion of the variance:
max a> Ra + τ V(a) ,
a> a=1
where τ is a tuning parameter controlling the importance
of a> Ra and V(a).
18 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Sparse PCA
Genesis and the way forward
Genuine sparse component analysis
Jolliffe, Trendafilov & Uddin (2003) were the first to modify
the original PCs to additionally satisfy the LASSO constraint,
which drives many loadings to exact zeros (SCoTLASS)
max
a> Ra ,
kak2 = 1 and kak1 ≤ τ
a ⊥ {a1 , a2 , . . . , ai−1 }
where a1 , . . . , ai−1 are the already found vectors of sparse
component loadings. The requirement kak1 ≤ τ is known as
the LASSO constraint.
19 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Sparse PCA
Genesis and the way forward
SCoTLASS solution
1
SCoTLASS is first reformulated as:
max
kak2 = 1
a ⊥ {a1 , a2 , . . . , ai−1 }
2
Fµ (a) = a> Ra − µP(kak1 − τ ) ,
where P(x) is a penalty term, e.g. P(x) = max(x, 0).
Adopt: kak1 = a> sign(a) ≈ a> tanh(1000a) and solve:
(i) for orthonormal loadings and correlated components
dai
= (Ip − Ai A>
i ) ∇Fµ (ai ) , i = 1, 2, ..., r
dt
where Ai = [a1 , a2 , ..., ai ] and A>
i Ai = Ii , and
(ii) for oblique loadings and uncorrelated components:
ai ai> R
a1 a1> R
dai
= Ip − >
− ··· − >
∇Fµ (ai )
dt
a1 Ra1
ai Rai
20 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Sparse PCA
Genesis and the way forward
Efficient numerical procedures for sparse PCA
Zou, Hastie & Tibshirany (2006) transform the standard PCA
into a regression form to propose fast algorithm (SPCA) for
sparse PCA, applicable to large data:
min kX − Xab > k2F + λ1 kak22 + λ2 kak1 ,
a,b
subject to kbk2 = 1 ,
where b ∈ Rp is an auxiliary vector.
I During the last ten years, since the first paper on sparse
PCA appeared, enormous number of works on the topic has
been published.
21 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Sparse PCA
Genesis and the way forward
Major weakness of the existing methods is that:
they produce sparse loadings that are not completely
orthonormal, and
the corresponding sparse components are correlated.
Only SCoTLASS (and a method recently proposed) is
capable to produce either orthonormal loadings or
uncorrelated sparse components.
In the following, new definitions of sparse PCA will be
considered some of which can result simultaneously in nearly
orthonormal loadings and nearly uncorrelated sparse
components.
22 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Sparse PCA
Function-constrained sparse components
Types of problems seeking for sparse minimizers x
of f (x) through the `1 norm:
Wright, S. (2011). Gradient algorithms for regularized
optimization, SPARS11, Edinburgh, Scotland,
http://pages.cs.wisc.edu/˜swright
Weighted form: min f (x) + τ kxk1 , for some τ > 0;
`1 -constrained form (variable selection): min f (x) subject
to kxk1 ≤ τ ;
Function-constrained form: min kxk1 subject to f (x) ≤ f¯.
23 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Sparse PCA
Function-constrained sparse components
...and restated accordingly for sparse PCA:
For a given p × p correlation matrix R find vector of loadings
a, (kak2 = 1), by solving one of the following:
Weighted form: max a> Ra + τ kak1 , for some τ > 0.
`1 -constrained form (variable selection): max a> Ra
√
subject to kak1 ≤ τ, τ ∈ [1, p].
Function-constrained form: min kak1 subject to
a> Ra ≤ λ, where λ is eigenvalue of R.
24 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Sparse PCA
Function-constrained sparse components
Weakly correlated sparse components (WCPC)
The following version of sparse PCA:
min kAk1 + µkA> RA − D 2 kF ,
A> A=Ir
is seeking for sparse loadings A which additionally diagonalize
R, i.e. they are supposed to produce sparse components which
are as weakly correlated as possible. D 2 is diagonal matrix of
the original PCs’ variances.
25 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Sparse PCA
Function-constrained sparse components
Sparse components approximating the PCs
variances (VarFit)
The next version of sparse PCA is:
min kAk1 + µkdiag(A> RA) − D 2 kF ,
A> A=Ir
in which the variances of the sparse components should fit
better the initial variances D 2 , without paying attention to the
off-diagonal elements of R. As a result A> RA is expected to
be less similar to a diagonal matrix than in WCPC and the
resulting sparse components – more correlated.
26 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Sparse PCA
Function-constrained sparse components
Table: Function-constrained sparse loadings, Jeffers’s Pitprop data
Var
topdiam
length
moist
testsg
ovensg
ringtop
ringbut
bowmax
bowdist
whorls
clear
knots
diaknot
%Var
%Cvar
%Cvaradj
Comp
1
2
3
4
5
6
Sparse component loadings A, WCPC (45/78)
1
2
3
4
5
6
-.477
.136
-.537
.099
.001
.001
.331
-.087
-.071
-.001
.880
-.036
-.619
.749
-.128
-.292
-.247
-.109
.224
-.179
-.338
-.460
.001
-.389
.001
-1.00
.057
.025
.030
.971
.043
.700
.663
29.5
12.7
12.2
7.7
6.6
6.1
29.5
42.2
54.4
62.1
68.7
74.8
29.5
42.1
54.2
61.8
68.0
73.8
Correlations among sparse components
1.0
.09
-.07
-.01
.06
.07
.09
1.0
-.02
-.08
.06
-.03
-.07
-.02
1.0
-.09
.23
.17
-.01
-.08
-.09
1.0
.01
.07
.06
.06
.23
.01
1.0
.00
.07
-.03
.17
.07
.00
1.0
Sparse component loadings A, VarFit (63/78)
1
2
3
4
5
6
-.471
-.484
.707
.707
.044
.663
.743
-.329
.089
-.309
-.405
-.418
1.00
-1.00
1.00
27.8
14.8
13.1
7.7
7.7
7.7
27.8
42.6
55.7
63.4
71.1
78.8
27.8
41.0
53.7
61.1
68.0
74.3
Correlations among sparse components
1.0
-.18
-.27
-.21
.11
-.02
-.18
1.0
.19
-.20
.07
-.13
-.27
.19
1.0
.08
-.34
.08
-.21
-.20
.08
1.0
-.18
.03
.11
.07
-.34
-.18
1.0
-.01
-.02
-.13
.08
.03
-.01
1.0
27 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Exploratory factor analysis (EFA)
Intro/Motivation: comparison with PCA
Motivation:
Why PCA is more popular than EFA?
simple geometrical meaning
stable and fast computation: EVD or SVD
no distributional assumptions
What EFA has to offer?
complicated model (difficult to justify)
rather slow algorithms
restrictive distributional assumptions
28 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Exploratory factor analysis (EFA)
Intro/Motivation: comparison with PCA
Goal: make EFA as attractive as PCA
EFA as a specific data matrix decomposition
SVD-based algorithms
better fit than PCA
29 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Exploratory factor analysis (EFA)
Intro/Motivation: comparison with PCA
Harman’s five socio-economic variables data
The data contain:
n = 12 observations: census tracts - small areal
subdivisions of the city of Los Angeles, and
p = 5 variables: ‘total population’ = 1, ‘median school
years‘ = 2, ‘total employment’ = 3, ‘miscellaneous
professional services’ = 4 and ‘median house value’ = 5;
Harman, H. (1976) Modern Factor Analysis, 3rd. Edition,
Table 2.1, p. 14.
30 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Exploratory factor analysis (EFA)
Intro/Motivation: comparison with PCA
A motivational example: Harman’s data
Variable
POPULATION
SCHOOL
EMPLOYMENT
SERVICES
HOUSE
PCA
Λp
.58
.81
.77
-.54
.67
.73
.93
-.10
.79
-.56
EFA-LS
Λf ,l
.62
.78
.70
-.52
.70
.68
.88
-.14
.78
-.60
Ψ2l
.00
.23
.04
.20
.03
EFA-ML
Λf ,m
.02
1.00
.90
-.01
.14
.97
.80
.42
.96
-.00
Ψ2m
.01
.19
.04
.19
.07
rEFA-ML
Λf ,m Q
.59
.81
.76
-.53
.67
.72
.92
-.11
.81
-.56
It seems, therefore, that PCA and EFA do the same job for Harman’s data, but
is this really a correct conclusion?
The answer is no, because the EFA solution is not simply provided by the factor
loadings matrix Λf ,l or Λf ,m .
Let’s check how well the sample correlation matrix Z> Z is fitted by PCA and
>
2
>
EFA. The PCA fit is kZ> Z − Λp Λ>
p k = .24, while kZ Z − Λf ,l Λf ,l − Ψl k = .04
2
and kZ> Z − Λf ,m Λ>
f ,m − Ψm k = .06 are obtained for LS-EFA and ML-EFA
respectively. These results imply that, in fact, PCA and EFA are quite different.
31 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Exploratory factor analysis (EFA)
The classical EFA (n > p)
Definition:
z ∈ Rp×1 standardized manifest random variable. The EFA
model:
z = Λf + Ψu ,
f ∈ Rk×1 (k p) random common factors
Λ ∈ Rp×k is a matrix of fixed factor loadings
u ∈ Rp×1 random unique factors
Ψ ∈ Rp×p diagonal matrix of fixed coefficients called
uniquenesses
32 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Exploratory factor analysis (EFA)
The classical EFA (n > p)
Assumptions:
E(f) = 0k×1
E(u) = 0p×1
E(uu> ) = Ip
E(fu> ) = Ok×p
E(zz> ) = Σ (correlation matrix)
E(ff> ) = Φ (correlation matrix)
33 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Exploratory factor analysis (EFA)
The classical EFA (n > p)
Correlation structure:
The k-factor model and the assumptions imply:
Σ = ΛΦΛ> + Ψ2 .
If Φ = Ik – uncorrelated (orthogonal) common factors, and
the model reduces to:
Σ = ΛΛ> + Ψ2 .
Otherwise, (Φ 6= Ik ) – oblique common factors.
34 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Exploratory factor analysis (EFA)
The classical EFA (n > p)
Sample version (n > p):
Z be n × p standardized data matrix.
The k-factor EFA model and the assumptions imply that the
data Z are represented as follows:
Z ≈ FΛ> + UΨ ,
subject to
F> F = Ik , U> U = Ip , U> F = 0p×k
Ψ diagonal
35 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Exploratory factor analysis (EFA)
The classical EFA (n > p)
Factor extraction:
find Λ, Ψ such that Σ = ΛΛ> + Ψ2 fits Z> Z as close as
possible (i.e. Σ ≈ Z> Z) in some sense:
Maximum likelihood FA
min[log(det Σ) + trace(Σ−1 Z> Z)]
Λ,Ψ
Least squares FA
min ||Z> Z − Σ||2F
Λ,Ψ
36 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Exploratory factor analysis (EFA)
The classical EFA (n > p)
EFA as a specific matrix decomposition:
All parameters are considered fixed!
The k-factor EFA represents the data Z as:
min ||Z − FΛ> − UΨ||2F
Λ,F,U,Ψ
subject to
F> F = Ik , U> U = Ip , U> F = 0p×k
Ψ diagonal
37 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Exploratory factor analysis (EFA)
The classical EFA (n > p)
EFA in block-matrix notations:
>
||Z − FΛ −
UΨ||2F
>
Λ
=
Z − [F U] Ψ
2
,
F
where the EFA constraints imply:
>
> F F F> U
Ik
0k×p
F
[F U] =
=
0p×k
Ip
U> F U > U
U>
38 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Exploratory factor analysis (EFA)
The classical EFA (n > p)
EFALS: ALS algorithm for EFA
Let B = [F U] and A = [Λ Ψ], then the EFA problem
transforms to:
min ||Z − BA> ||2F
B> B=Ip+k
EFALS:
for fixed A, find B (orthonormal Procrustes problem)
for fixed B = [F U], update Λ ← Z> F and
Ψ ← diag(U> Z)
39 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Exploratory factor analysis (EFA)
Generalized EFA (GEFA) for n > p & p ≥ n
Problems with EFA when p ≥ n:
singular sample correlation matrix
F> F = Ik and U> F = 0p×k remain valid, but U> U 6= Ip
Then, the k-factor EFA data representation Z ≈ FΛ> + UΨ
implies:
Σ = ΛΛ> + ΨU> UΨ
40 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Exploratory factor analysis (EFA)
Generalized EFA (GEFA) for n > p & p ≥ n
New EFA constraints for p ≥ n:
F> F = Ik , U> F = 0p×k and Ψ diagonal
U> UΨ = Ψ (instead of U> U = Ip )
Implications: from Sylvester’s law of nullity
rank(U> )+rank(F)−n ≤ rank(U> F) = 0 ⇒ rank(U) ≤ n −k
Lemma:
If p > n, U> UΨ = Ψ implies that Ψ2 is positive semidefinite,
i.e. Ψ2 cannot be positive definite.
41 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Exploratory factor analysis (EFA)
Generalized EFA (GEFA) for n > p & p ≥ n
New EFA constraints for p ≥ n:
F> F = Ik , U> F = 0p×k and Ψ diagonal
U> UΨ = Ψ (instead of U> U = Ip )
Implications: from Sylvester’s law of nullity
rank(U> )+rank(F)−n ≤ rank(U> F) = 0 ⇒ rank(U) ≤ n −k
Lemma:
If p > n, U> UΨ = Ψ implies that Ψ2 is positive semidefinite,
i.e. Ψ2 cannot be positive definite.
Lemma:
If p > n, and r is the number of zero diagonal entries in Ψ,
then rank(Ψ) = p − r ≤ n − m.
42 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Exploratory factor analysis (EFA)
Generalized EFA (GEFA) for n > p & p ≥ n
Generalized EFA (GEFA) for p ≥ n:
The k-factor EFA represents the data Z as:
min ||Z − FΛ> − UΨ||2F
Λ,F,U,Ψ
subject to
F> F = Ik , U> F = 0p×k and Ψ diagonal
U> UΨ = Ψ
43 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Exploratory factor analysis (EFA)
Generalized EFA (GEFA) for n > p & p ≥ n
Another GEFA formulation:
Lemma:
If p > n and U> UΨ = Ψ, then FF> + UU> = In and
rank(F) = k imply F> F = Im and U> F = Op×m . The
converse implication is not generally true.
Then, the k-factor GEFA problem becomes:
min ||Z − FΛ> − UΨ||2F
Λ,F,U,Ψ
subject to
rank(F) = k , FF> + UU> = In
U> UΨ = Ψ (diagonal)
44 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Exploratory factor analysis (EFA)
Generalized EFA (GEFA) for n > p & p ≥ n
GEFA in block-matrix notations:
For B = [F U] and A = [Λ Ψ]
||Z − FΛ> − UΨ||2F = ||Z − BA> ||2F ,
where the GEFA constraints imply:
> F
>
= FF> + UU> = In
BB = [F U]
>
U
45 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Exploratory factor analysis (EFA)
Generalized EFA (GEFA) for n > p & p ≥ n
GEFALS: ALS algorithm for GEFA
With B = [F U] and A = [Λ Ψ], the GEFA problem is:
min ||Z − BA> ||2F
BB> =In
GEFALS:
for fixed A, find B (orthonormal Procrustes problem)
for fixed B = [F U], update Λ ← Z> F and
Ψ ← diag(U> Z)
46 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Exploratory factor analysis (EFA)
Generalized EFA (GEFA) for n > p & p ≥ n
Inside the GEFA objective function:
||Z − BA> ||2F =
>
>
>
||Z||2F + trace(B
| {zB} A A) − 2 trace(B ZA)
6=Ip+k
However, U> UΨ = Ψ implies:
trace(B> B A> A) = trace(Λ> Λ) + trace(Ψ2 )
i.e. does not depend on F and U (and B).
47 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Exploratory factor analysis (EFA)
Generalized EFA (GEFA) for n > p & p ≥ n
Convergence properties of GEFALS
1
2
3
4
5
6
7
Global minimizers of f (Λ) and f (Ψ) exist
There is an unique global minimizer of f (Ψ)
There is no unique global minimizer of f (Λ)
GEFALS is a globally convergent algorithm, i.e. it
converges from any starting value
However, there is no guarantee that the minimizer found
is the global minimum of the problem
GEFALS has rate of convergence as steepest descent
Conjugate gradient versions under construction (with Lars
Eldén)
48 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Exploratory factor analysis (EFA)
Generalized EFA (GEFA) for n > p & p ≥ n
Numerical performance of GEFALS
1
2
3
4
5
Try several runs with different starting values
The GEFALS convergence is linear (as in gradient
descent), i.e. the first few steps reduce f sharply, which is
then followed by a number of steps with little descent
The standard escape: developing methods, employing
second order derivatives
Their convergence is quadratic, i.e. they are faster than
gradient methods; however, they are locally convergent,
i.e. only a ”good” starting value ensures convergence
To imitate second order behavior: start with low
accuracy, and use the result as a starting value for a
second run with high accuracy.
49 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Exploratory factor analysis (EFA)
Generalized EFA (GEFA) for n > p & p ≥ n
Rotational freedom in EFA and GEFA:
For any orthogonal Q:
Z = FΛ> + UΨ = FQQ> Λ> + UΨ
Replace Λ by p × k lower triangular L to:
1
2
ease interpretation
avoid rotational freedom
Updating: replace Λ ← Z> F by L ← tril(Z> F).
All previous derivations remain valid!
50 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Exploratory factor analysis (EFA)
Numerical illustrations with GEFA
GEFA of Harman’s five socio-economic variables
data:
Var
Λ(≈ML)
Ψ2
L
Ψ2
1
.99 .05 .0150 1.00
0 .0173
2 -.01 .88 .2292 .03 .88 .2307
3
.97 .16 .0182 .98 .11 .0158
4
.40 .80 .2001 .44 .78 .2009
5 -.03 .98 .0318 .02 .98 .0292
fit
.075271
.075232
51 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Exploratory factor analysis (EFA)
Numerical illustrations with GEFA
Thurstone’s 26-variable box data:
L. L. Thurstone collected 20 boxes and measured their
three dimensions x (length), y (width) and z (height):
#
x
y
z
1
3
2
1
2
3
2
2
3
3
3
1
4
3
3
2
5
3
3
3
6
4
2
1
7
4
2
2
8
4
3
1
9
4
3
2
10
4
3
3
11
4
4
1
12
4
4
2
13
4
4
3
14
5
2
1
15
5
2
2
16
5
3
2
17
5
3
3
18
5
4
1
19
5
4
2
20
5
4
3
The variables are 26 functions of x, y and z, i.e.:
n = 20, p = 26 and k = 3.
Thurstone, L. L. (1947) Multiple-Factor Analysis,
University of Chicago Press: Chicago, IL, p. 141.
52 / 53
Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden
Exploratory factor analysis (EFA)
Numerical illustrations with GEFA
GEFA of Thurstone’s 26-variable box data:
Variable
x
y
z
xy
xz
yz
x2y
xy 2
x2z
xz 2
y 2z
yz 2
x/y
y /x
x/z
z/x
y /z
z/y
2x + 2y
2x + 2z
2y + 2z
p
2
2
px + y
2
2
px + z
y2 + z2
xyz
p
x2 + y2 + z2
fit
×
0
0
×
×
0
×
×
×
×
0
0
×
×
×
×
0
0
×
×
0
×
×
0
×
×
0
×
0
×
0
×
×
×
0
0
×
×
×
×
0
0
×
×
×
0
×
×
0
×
×
×
0
0
×
0
×
×
0
0
×
×
×
×
0
0
×
×
×
×
0
×
×
0
×
×
×
×
.98
.15
−.11
.59
.30
−.01
.75
.44
.52
.13
.05
−.05
.53
−.54
.48
−.54
.17
−.17
.71
.59
.03
.80
.80
.07
.27
.68
Λ
.12
.17
.98
−.09
.36
.92
.80
.01
.36
.88
.71
.70
.63
.06
.89
−.06
.32
.77
.39
.90
.82
.52
.58
.79
−.81
.16
.80
−.19
−.22
−.79
.26
.76
.29
−.90
−.27
.90
.70
.05
.32
.74
.85
.52
.59
.08
.26
.53
.93
.34
.67
.67
.64
.34
.591916
Ψ2
.0000
.0000
.0000
.0000
.0000
.0000
.0191
.0001
.0198
.0000
.0298
.0000
.0279
.0290
.0811
.0476
.0566
.0651
.0000
.0000
.0000
.0000
.0001
.0000
.0017
.0001
1.00
.25
.10
.68
.49
.20
.82
.52
.68
.33
.25
.16
.44
−.46
.31
−.36
.04
−.03
.79
.74
.23
.87
.91
.25
.47
.80
L
0
0
.97
0
.23
.96
.73
−.00
.20
.84
.59
.77
.54
−.00
.84
−.03
.15
.68
.24
.90
.73
.60
.45
.85
−.87
−.05
.87
.02
−.15
−.89
.20
.88
.40
−.87
−.38
.88
.61
.00
.15
.65
.76
.61
.49
−.01
.10
.39
.86
.44
.54
.68
.52
.28
.591920
Ψ2
.0000
.0000
.0000
.0000
.0000
.0000
.0191
.0000
.0198
.0000
.0298
.0000
.0279
.0290
.0811
.0476
.0566
.0651
.0000
.0000
.0000
.0000
.0001
.0000
.0017
.0001
53 / 53
Download