Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Dimension reduction: modern aspects Nickolay T. Trendafilov Department of Mathematics and Statistics The Open University, UK 1 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Contents 1 Principal component analysis (PCA) Basic facts and definitions The interpretation and related difficulties Example: the Pitprop data Sparse PCA Genesis and the way forward Function-constraint sparse component 2 Exploratory factor analysis (EFA) Intro/Motivation: comparison with PCA The classical EFA (n > p) Generalized EFA (GEFA) for n > p & p ≥ n Numerical illustrations with GEFA 2 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Principal component analysis (PCA) Basic facts and definitions Origins and background PCA was first defined in the form that is used nowadays by Pearson (1901). He found the best-fitting line in the least squares sense to the data points, which is known today as the first principal component. Hotelling (1933) showed that the loadings for the components are the eigenvectors of the sample covariance matrix. Note, that PCA as a scientific tool appeared first in Psychometrika. The quantitative measurement of the human intelligence and personality in the 30’s of the last century should have been of the same challenging importance as the contemporary interest in gene and tissue engineering, signal processing, or analyzing huge climate, financial or internet data. 3 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Principal component analysis (PCA) Basic facts and definitions General Main goal: Analyzing high-dimensional multivariate data Main tools: singular value decomposition (SVD) of (raw) data matrix, eigen-value decomposition (EVD) of correlation matrix Main problems: PCA might be too slow for very large data results involve all input variables which complicates the interpretation 4 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Principal component analysis (PCA) Basic facts and definitions Definition 1: PCA of data matrix Let X be an n × p(n > p) data matrix containing measurements of p random variables on n subjects (individuals); Let X be centered, i.e. X > 1n = 0p , and the columns (variables) have unit length, i.e. diag(X > X ) = 1p . Denote the SVD of X by X = FDA> , where 1 2 3 F is n × n orthogonal, A is p × p orthogonal, and D is n × p diagonal, which elements are called singular values of X and are in decreasing order. 5 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Principal component analysis (PCA) Basic facts and definitions PCA dimension reduction: truncated SVD For any r ( p), let Xr = Fr Dr A> r be the truncated SVD of X Fr and Ar denote the first r columns of F and A respectively (Fr> Fr = A> r Ar = Ir ), and Dr is diagonal matrix with the first r singular values of X : 1 2 3 Yr = Fr Dr contains the component scores, also known as principal components (PCs), Ar contains the component loadings, and Dr2 contains the variances of the first r PCs. It is well-known that Xr gives the best least squares (LS) approximation of rank r to X . 6 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Principal component analysis (PCA) Basic facts and definitions ...more on truncated SVD A right multiplication of X = FDA> by Ar gives: Ir Dr XAr = FD =F = Fr Dr = Yr , 0(p−r )×r 0(p−r )×r (1) i.e. the principal components Yr are linear combination of the original variables X with coefficients Ar ; another right multiplication of (1) by A> r gives > > XAr A> r = Yr Ar = Fr Dr Ar = Xr , i.e. the best LS approximation Xr to X of rank r is given by the orthogonal projection of X onto the r -dimensional subspace in Rp spanned by the columns of Ar . 7 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Principal component analysis (PCA) Basic facts and definitions PCA optimality The variances of the first r PCs are given by 2 > > Yr> Yr = A> r Xr Xr Ar = Dr Fr Fr Dr = Dr . This also shows that the principal components are uncorrelated, as Yr> Yr is diagonal. Then, the total variance explained by the first r PCs is > given by traceDr2 = trace(A> r Xr Xr Ar ) It is important to stress that the PCs are the only linear combinations of the original variables which are uncorrelated and the matrix Ar of their coefficients is orthonormal. 8 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Principal component analysis (PCA) Basic facts and definitions Definition 2: PCA of correlation matrix Let R be an p × p sample correlation matrix Denote the EVD of R by R = AD 2 A> , where 1 2 A is p × p orthogonal, and D 2 is p × p diagonal, with elements in decreasing order Let Rr = Ar Dr2 A> r be the truncated EVD of R (r p) Ar contains the first r eigen-vectors of R and is the matrix of the component loadings of the first r PCs Dr2 contains the first r eigen-values of R, which are the variances of first r PCs Rr gives the best LS approximation of rank r to R. 9 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Principal component analysis (PCA) The interpretation and related difficulties PCA interpretation: the classic case The PCA is interpreted by considering the magnitudes of the component loadings, which indicate how strongly each of the original variables contribute to the PC. Even if a reduced dimension of r PCs is considered for further analysis, each PC is still a linear combination of all original variables. This complicates the PCs interpretation, especially when p is large. Which loadings are easily interpretable? 10 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Principal component analysis (PCA) The interpretation and related difficulties The Thurstone’s simple structure concept... (Thurstone, 1947, p. 335) 1 Each row of the factor matrix should have at least one zero, 2 If there are r common factors each column of the factor matrix should have at least r zeros, 3 For every pair of columns of the factor matrix there should be several variables whose entries vanish in one column but not in the other, 4 For every pair of columns of the factor matrix, a large proportion of the variables should have vanishing entries in both columns when there are four or more factors, 5 For every pair of columns of the factor matrix there should be only a small number of variables with non-vanishing entries in both columns. 11 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Principal component analysis (PCA) The interpretation and related difficulties PCA interpretation via simple structure rotation The oldest approach to solve the PCA interpretation problem is the simple structure rotation, which is based on the following simple identity: X = FDA> = FQQ −1 DA> , where Q can be any non-singular transformation matrix. Two types of matrices Q are used in PCA and FA: orthogonal and oblique. B = ADQ −> is the matrix of rotated loadings by Q. For orthogonal Q it reduces to B = ADQ. (As the orthogonal matrix Q can be viewed as a rotation, these methods are known as the rotation methods.) 12 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Principal component analysis (PCA) The interpretation and related difficulties Doing simple structure rotation in PCA (and FA) Steps 1 2 3 Low-dimensional data approximation,... Followed by rotation of the PC loadings The rotation is found by optimizing certain criterion which defines/formalizes the perception for simple (interpretable) structure Example: the VARIMAX criterion maximizes the variance of the squared rotated loadings B, i.e. maxQ V, where P P Pp 2 2 V = rj=1 Vj and Vj = pi=1 bij4 − p1 . i=1 bij Drawbacks of the rotated components: still difficult to interpret loadings correlated components, which also do not explain decreasing amount of variance. 13 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Principal component analysis (PCA) The interpretation and related difficulties PCA interpretation via rotation methods Traditionally, PCs are considered easily interpretable if there are plenty of small component loadings indicating the negligible importance of the corresponding variables. Jollife, 2002, p.269 The most common way of doing this is to ignore (effectively set to zero) coefficients whose absolute values fall below some threshold. Thus, implicitly, the PCs simplicity and interpretability are associated with the sparseness of the component loadings. 14 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Principal component analysis (PCA) Example: the Pitprop data The Pitprop data consist of 14 variables which were measured for each of 180 pitprops cut from Corsican pine timber. One variable is compressive strength, and the other 13 variables are physical measurements on the pitprops (Jeffers, 1967). Table: Jeffers’s Pitprop data: Loadings of the first six PCs and their interpretation by normalizing each column, and then, taking loadings greater than .7 only (Jeffers, 1967) Vars topdiam length moist testsg ovensg ringtop ringbut bowmax bowdist whorls clear knots diaknot 1 .83 .83 .26 .36 .12 .58 .82 .60 .73 .78 -.02 -.24 -.23 Component loadings (AD) 2 3 4 5 .34 -.28 -.10 .08 .29 -.32 -.11 .11 .83 .19 .08 -.33 .70 .48 .06 -.34 -.26 .66 .05 -.17 -.02 .65 -.07 .30 -.29 .35 -.07 .21 -.29 -.33 .30 -.18 .03 -.28 .10 .10 -.38 -.16 -.22 -.15 .32 -.10 .85 .33 .53 .13 -.32 .57 .48 -.45 -.32 -.08 6 .11 .15 -.25 -.05 .56 .05 .00 -.05 .03 -.16 .16 -.15 .57 1 1.0 1.0 Jeffers’s interpretation 2 3 4 5 1.0 .84 .70 .99 .72 .88 .93 .73 1.0 .99 6 1.0 1.0 1.0 1.0 15 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Principal component analysis (PCA) Example: the Pitprop data Table: Jeffers’s Pitprop data: Rotated loadings by varimax and their interpretation by normalizing each column, and then, taking loadings greater than .59 only Vars topdiam length moist testsg ovensg ringtop ringbut bowmax bowdist whorls clear knots diaknot 1 .91 .94 .13 .13 -.14 .36 .62 .54 .77 .68 .03 -.06 .10 2 .26 .19 .96 .95 .03 .19 -.02 -.10 .03 -.10 .08 .14 .04 VARIMAX 3 -.01 -.00 -.14 .24 .90 .61 .47 -.10 -.03 .02 -.04 -.14 -.07 loadings 4 .03 .03 .08 .03 -.03 -.03 -.13 .11 .12 -.40 .97 .04 -.01 5 .01 .00 .08 .06 -.18 .28 -.01 -.56 -.16 -.35 .00 .87 .15 6 .08 .10 .04 -.03 -.03 -.49 -.55 -.23 -.12 -.34 -.00 .09 .93 Normalized loadings greater than .55 1 2 3 4 5 6 .97 1.0 1.0 .98 1.0 .68 .66 -.64 .82 .73 1.0 1.0 1.0 16 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Sparse PCA Genesis and the way forward Abandoning the rotation methods Ignoring the small loadings is subjective and misleading, especially for PCs from covariance matrix (Cadima & Jollife, 1995). Cadima & Jollife, 1995 One of the reasons for this is that it is not just loadings but also the size (standard deviation) of each variable which determines the importance of that variable in the linear combination. Therefore it may be desirable to put more emphasis on simplicity than on variance maximization. 17 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Sparse PCA Genesis and the way forward Alternative to rotation: modify PCA to produce explicitly simple principal components The first method to directly construct sparse components was proposed by Hausman (1982): it finds PC loadings from a prescribed subset of values, say S = {−1, 0, 1} Jolliffe & Uddin (2000) were the first to modify the original PCs to additionally satisfy the Varimax criterion (simplified component technique, SCoT) still explaining successively decreasing portion of the variance: max a> Ra + τ V(a) , a> a=1 where τ is a tuning parameter controlling the importance of a> Ra and V(a). 18 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Sparse PCA Genesis and the way forward Genuine sparse component analysis Jolliffe, Trendafilov & Uddin (2003) were the first to modify the original PCs to additionally satisfy the LASSO constraint, which drives many loadings to exact zeros (SCoTLASS) max a> Ra , kak2 = 1 and kak1 ≤ τ a ⊥ {a1 , a2 , . . . , ai−1 } where a1 , . . . , ai−1 are the already found vectors of sparse component loadings. The requirement kak1 ≤ τ is known as the LASSO constraint. 19 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Sparse PCA Genesis and the way forward SCoTLASS solution 1 SCoTLASS is first reformulated as: max kak2 = 1 a ⊥ {a1 , a2 , . . . , ai−1 } 2 Fµ (a) = a> Ra − µP(kak1 − τ ) , where P(x) is a penalty term, e.g. P(x) = max(x, 0). Adopt: kak1 = a> sign(a) ≈ a> tanh(1000a) and solve: (i) for orthonormal loadings and correlated components dai = (Ip − Ai A> i ) ∇Fµ (ai ) , i = 1, 2, ..., r dt where Ai = [a1 , a2 , ..., ai ] and A> i Ai = Ii , and (ii) for oblique loadings and uncorrelated components: ai ai> R a1 a1> R dai = Ip − > − ··· − > ∇Fµ (ai ) dt a1 Ra1 ai Rai 20 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Sparse PCA Genesis and the way forward Efficient numerical procedures for sparse PCA Zou, Hastie & Tibshirany (2006) transform the standard PCA into a regression form to propose fast algorithm (SPCA) for sparse PCA, applicable to large data: min kX − Xab > k2F + λ1 kak22 + λ2 kak1 , a,b subject to kbk2 = 1 , where b ∈ Rp is an auxiliary vector. I During the last ten years, since the first paper on sparse PCA appeared, enormous number of works on the topic has been published. 21 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Sparse PCA Genesis and the way forward Major weakness of the existing methods is that: they produce sparse loadings that are not completely orthonormal, and the corresponding sparse components are correlated. Only SCoTLASS (and a method recently proposed) is capable to produce either orthonormal loadings or uncorrelated sparse components. In the following, new definitions of sparse PCA will be considered some of which can result simultaneously in nearly orthonormal loadings and nearly uncorrelated sparse components. 22 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Sparse PCA Function-constrained sparse components Types of problems seeking for sparse minimizers x of f (x) through the `1 norm: Wright, S. (2011). Gradient algorithms for regularized optimization, SPARS11, Edinburgh, Scotland, http://pages.cs.wisc.edu/˜swright Weighted form: min f (x) + τ kxk1 , for some τ > 0; `1 -constrained form (variable selection): min f (x) subject to kxk1 ≤ τ ; Function-constrained form: min kxk1 subject to f (x) ≤ f¯. 23 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Sparse PCA Function-constrained sparse components ...and restated accordingly for sparse PCA: For a given p × p correlation matrix R find vector of loadings a, (kak2 = 1), by solving one of the following: Weighted form: max a> Ra + τ kak1 , for some τ > 0. `1 -constrained form (variable selection): max a> Ra √ subject to kak1 ≤ τ, τ ∈ [1, p]. Function-constrained form: min kak1 subject to a> Ra ≤ λ, where λ is eigenvalue of R. 24 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Sparse PCA Function-constrained sparse components Weakly correlated sparse components (WCPC) The following version of sparse PCA: min kAk1 + µkA> RA − D 2 kF , A> A=Ir is seeking for sparse loadings A which additionally diagonalize R, i.e. they are supposed to produce sparse components which are as weakly correlated as possible. D 2 is diagonal matrix of the original PCs’ variances. 25 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Sparse PCA Function-constrained sparse components Sparse components approximating the PCs variances (VarFit) The next version of sparse PCA is: min kAk1 + µkdiag(A> RA) − D 2 kF , A> A=Ir in which the variances of the sparse components should fit better the initial variances D 2 , without paying attention to the off-diagonal elements of R. As a result A> RA is expected to be less similar to a diagonal matrix than in WCPC and the resulting sparse components – more correlated. 26 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Sparse PCA Function-constrained sparse components Table: Function-constrained sparse loadings, Jeffers’s Pitprop data Var topdiam length moist testsg ovensg ringtop ringbut bowmax bowdist whorls clear knots diaknot %Var %Cvar %Cvaradj Comp 1 2 3 4 5 6 Sparse component loadings A, WCPC (45/78) 1 2 3 4 5 6 -.477 .136 -.537 .099 .001 .001 .331 -.087 -.071 -.001 .880 -.036 -.619 .749 -.128 -.292 -.247 -.109 .224 -.179 -.338 -.460 .001 -.389 .001 -1.00 .057 .025 .030 .971 .043 .700 .663 29.5 12.7 12.2 7.7 6.6 6.1 29.5 42.2 54.4 62.1 68.7 74.8 29.5 42.1 54.2 61.8 68.0 73.8 Correlations among sparse components 1.0 .09 -.07 -.01 .06 .07 .09 1.0 -.02 -.08 .06 -.03 -.07 -.02 1.0 -.09 .23 .17 -.01 -.08 -.09 1.0 .01 .07 .06 .06 .23 .01 1.0 .00 .07 -.03 .17 .07 .00 1.0 Sparse component loadings A, VarFit (63/78) 1 2 3 4 5 6 -.471 -.484 .707 .707 .044 .663 .743 -.329 .089 -.309 -.405 -.418 1.00 -1.00 1.00 27.8 14.8 13.1 7.7 7.7 7.7 27.8 42.6 55.7 63.4 71.1 78.8 27.8 41.0 53.7 61.1 68.0 74.3 Correlations among sparse components 1.0 -.18 -.27 -.21 .11 -.02 -.18 1.0 .19 -.20 .07 -.13 -.27 .19 1.0 .08 -.34 .08 -.21 -.20 .08 1.0 -.18 .03 .11 .07 -.34 -.18 1.0 -.01 -.02 -.13 .08 .03 -.01 1.0 27 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Exploratory factor analysis (EFA) Intro/Motivation: comparison with PCA Motivation: Why PCA is more popular than EFA? simple geometrical meaning stable and fast computation: EVD or SVD no distributional assumptions What EFA has to offer? complicated model (difficult to justify) rather slow algorithms restrictive distributional assumptions 28 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Exploratory factor analysis (EFA) Intro/Motivation: comparison with PCA Goal: make EFA as attractive as PCA EFA as a specific data matrix decomposition SVD-based algorithms better fit than PCA 29 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Exploratory factor analysis (EFA) Intro/Motivation: comparison with PCA Harman’s five socio-economic variables data The data contain: n = 12 observations: census tracts - small areal subdivisions of the city of Los Angeles, and p = 5 variables: ‘total population’ = 1, ‘median school years‘ = 2, ‘total employment’ = 3, ‘miscellaneous professional services’ = 4 and ‘median house value’ = 5; Harman, H. (1976) Modern Factor Analysis, 3rd. Edition, Table 2.1, p. 14. 30 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Exploratory factor analysis (EFA) Intro/Motivation: comparison with PCA A motivational example: Harman’s data Variable POPULATION SCHOOL EMPLOYMENT SERVICES HOUSE PCA Λp .58 .81 .77 -.54 .67 .73 .93 -.10 .79 -.56 EFA-LS Λf ,l .62 .78 .70 -.52 .70 .68 .88 -.14 .78 -.60 Ψ2l .00 .23 .04 .20 .03 EFA-ML Λf ,m .02 1.00 .90 -.01 .14 .97 .80 .42 .96 -.00 Ψ2m .01 .19 .04 .19 .07 rEFA-ML Λf ,m Q .59 .81 .76 -.53 .67 .72 .92 -.11 .81 -.56 It seems, therefore, that PCA and EFA do the same job for Harman’s data, but is this really a correct conclusion? The answer is no, because the EFA solution is not simply provided by the factor loadings matrix Λf ,l or Λf ,m . Let’s check how well the sample correlation matrix Z> Z is fitted by PCA and > 2 > EFA. The PCA fit is kZ> Z − Λp Λ> p k = .24, while kZ Z − Λf ,l Λf ,l − Ψl k = .04 2 and kZ> Z − Λf ,m Λ> f ,m − Ψm k = .06 are obtained for LS-EFA and ML-EFA respectively. These results imply that, in fact, PCA and EFA are quite different. 31 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Exploratory factor analysis (EFA) The classical EFA (n > p) Definition: z ∈ Rp×1 standardized manifest random variable. The EFA model: z = Λf + Ψu , f ∈ Rk×1 (k p) random common factors Λ ∈ Rp×k is a matrix of fixed factor loadings u ∈ Rp×1 random unique factors Ψ ∈ Rp×p diagonal matrix of fixed coefficients called uniquenesses 32 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Exploratory factor analysis (EFA) The classical EFA (n > p) Assumptions: E(f) = 0k×1 E(u) = 0p×1 E(uu> ) = Ip E(fu> ) = Ok×p E(zz> ) = Σ (correlation matrix) E(ff> ) = Φ (correlation matrix) 33 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Exploratory factor analysis (EFA) The classical EFA (n > p) Correlation structure: The k-factor model and the assumptions imply: Σ = ΛΦΛ> + Ψ2 . If Φ = Ik – uncorrelated (orthogonal) common factors, and the model reduces to: Σ = ΛΛ> + Ψ2 . Otherwise, (Φ 6= Ik ) – oblique common factors. 34 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Exploratory factor analysis (EFA) The classical EFA (n > p) Sample version (n > p): Z be n × p standardized data matrix. The k-factor EFA model and the assumptions imply that the data Z are represented as follows: Z ≈ FΛ> + UΨ , subject to F> F = Ik , U> U = Ip , U> F = 0p×k Ψ diagonal 35 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Exploratory factor analysis (EFA) The classical EFA (n > p) Factor extraction: find Λ, Ψ such that Σ = ΛΛ> + Ψ2 fits Z> Z as close as possible (i.e. Σ ≈ Z> Z) in some sense: Maximum likelihood FA min[log(det Σ) + trace(Σ−1 Z> Z)] Λ,Ψ Least squares FA min ||Z> Z − Σ||2F Λ,Ψ 36 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Exploratory factor analysis (EFA) The classical EFA (n > p) EFA as a specific matrix decomposition: All parameters are considered fixed! The k-factor EFA represents the data Z as: min ||Z − FΛ> − UΨ||2F Λ,F,U,Ψ subject to F> F = Ik , U> U = Ip , U> F = 0p×k Ψ diagonal 37 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Exploratory factor analysis (EFA) The classical EFA (n > p) EFA in block-matrix notations: > ||Z − FΛ − UΨ||2F > Λ = Z − [F U] Ψ 2 , F where the EFA constraints imply: > > F F F> U Ik 0k×p F [F U] = = 0p×k Ip U> F U > U U> 38 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Exploratory factor analysis (EFA) The classical EFA (n > p) EFALS: ALS algorithm for EFA Let B = [F U] and A = [Λ Ψ], then the EFA problem transforms to: min ||Z − BA> ||2F B> B=Ip+k EFALS: for fixed A, find B (orthonormal Procrustes problem) for fixed B = [F U], update Λ ← Z> F and Ψ ← diag(U> Z) 39 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Exploratory factor analysis (EFA) Generalized EFA (GEFA) for n > p & p ≥ n Problems with EFA when p ≥ n: singular sample correlation matrix F> F = Ik and U> F = 0p×k remain valid, but U> U 6= Ip Then, the k-factor EFA data representation Z ≈ FΛ> + UΨ implies: Σ = ΛΛ> + ΨU> UΨ 40 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Exploratory factor analysis (EFA) Generalized EFA (GEFA) for n > p & p ≥ n New EFA constraints for p ≥ n: F> F = Ik , U> F = 0p×k and Ψ diagonal U> UΨ = Ψ (instead of U> U = Ip ) Implications: from Sylvester’s law of nullity rank(U> )+rank(F)−n ≤ rank(U> F) = 0 ⇒ rank(U) ≤ n −k Lemma: If p > n, U> UΨ = Ψ implies that Ψ2 is positive semidefinite, i.e. Ψ2 cannot be positive definite. 41 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Exploratory factor analysis (EFA) Generalized EFA (GEFA) for n > p & p ≥ n New EFA constraints for p ≥ n: F> F = Ik , U> F = 0p×k and Ψ diagonal U> UΨ = Ψ (instead of U> U = Ip ) Implications: from Sylvester’s law of nullity rank(U> )+rank(F)−n ≤ rank(U> F) = 0 ⇒ rank(U) ≤ n −k Lemma: If p > n, U> UΨ = Ψ implies that Ψ2 is positive semidefinite, i.e. Ψ2 cannot be positive definite. Lemma: If p > n, and r is the number of zero diagonal entries in Ψ, then rank(Ψ) = p − r ≤ n − m. 42 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Exploratory factor analysis (EFA) Generalized EFA (GEFA) for n > p & p ≥ n Generalized EFA (GEFA) for p ≥ n: The k-factor EFA represents the data Z as: min ||Z − FΛ> − UΨ||2F Λ,F,U,Ψ subject to F> F = Ik , U> F = 0p×k and Ψ diagonal U> UΨ = Ψ 43 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Exploratory factor analysis (EFA) Generalized EFA (GEFA) for n > p & p ≥ n Another GEFA formulation: Lemma: If p > n and U> UΨ = Ψ, then FF> + UU> = In and rank(F) = k imply F> F = Im and U> F = Op×m . The converse implication is not generally true. Then, the k-factor GEFA problem becomes: min ||Z − FΛ> − UΨ||2F Λ,F,U,Ψ subject to rank(F) = k , FF> + UU> = In U> UΨ = Ψ (diagonal) 44 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Exploratory factor analysis (EFA) Generalized EFA (GEFA) for n > p & p ≥ n GEFA in block-matrix notations: For B = [F U] and A = [Λ Ψ] ||Z − FΛ> − UΨ||2F = ||Z − BA> ||2F , where the GEFA constraints imply: > F > = FF> + UU> = In BB = [F U] > U 45 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Exploratory factor analysis (EFA) Generalized EFA (GEFA) for n > p & p ≥ n GEFALS: ALS algorithm for GEFA With B = [F U] and A = [Λ Ψ], the GEFA problem is: min ||Z − BA> ||2F BB> =In GEFALS: for fixed A, find B (orthonormal Procrustes problem) for fixed B = [F U], update Λ ← Z> F and Ψ ← diag(U> Z) 46 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Exploratory factor analysis (EFA) Generalized EFA (GEFA) for n > p & p ≥ n Inside the GEFA objective function: ||Z − BA> ||2F = > > > ||Z||2F + trace(B | {zB} A A) − 2 trace(B ZA) 6=Ip+k However, U> UΨ = Ψ implies: trace(B> B A> A) = trace(Λ> Λ) + trace(Ψ2 ) i.e. does not depend on F and U (and B). 47 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Exploratory factor analysis (EFA) Generalized EFA (GEFA) for n > p & p ≥ n Convergence properties of GEFALS 1 2 3 4 5 6 7 Global minimizers of f (Λ) and f (Ψ) exist There is an unique global minimizer of f (Ψ) There is no unique global minimizer of f (Λ) GEFALS is a globally convergent algorithm, i.e. it converges from any starting value However, there is no guarantee that the minimizer found is the global minimum of the problem GEFALS has rate of convergence as steepest descent Conjugate gradient versions under construction (with Lars Eldén) 48 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Exploratory factor analysis (EFA) Generalized EFA (GEFA) for n > p & p ≥ n Numerical performance of GEFALS 1 2 3 4 5 Try several runs with different starting values The GEFALS convergence is linear (as in gradient descent), i.e. the first few steps reduce f sharply, which is then followed by a number of steps with little descent The standard escape: developing methods, employing second order derivatives Their convergence is quadratic, i.e. they are faster than gradient methods; however, they are locally convergent, i.e. only a ”good” starting value ensures convergence To imitate second order behavior: start with low accuracy, and use the result as a starting value for a second run with high accuracy. 49 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Exploratory factor analysis (EFA) Generalized EFA (GEFA) for n > p & p ≥ n Rotational freedom in EFA and GEFA: For any orthogonal Q: Z = FΛ> + UΨ = FQQ> Λ> + UΨ Replace Λ by p × k lower triangular L to: 1 2 ease interpretation avoid rotational freedom Updating: replace Λ ← Z> F by L ← tril(Z> F). All previous derivations remain valid! 50 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Exploratory factor analysis (EFA) Numerical illustrations with GEFA GEFA of Harman’s five socio-economic variables data: Var Λ(≈ML) Ψ2 L Ψ2 1 .99 .05 .0150 1.00 0 .0173 2 -.01 .88 .2292 .03 .88 .2307 3 .97 .16 .0182 .98 .11 .0158 4 .40 .80 .2001 .44 .78 .2009 5 -.03 .98 .0318 .02 .98 .0292 fit .075271 .075232 51 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Exploratory factor analysis (EFA) Numerical illustrations with GEFA Thurstone’s 26-variable box data: L. L. Thurstone collected 20 boxes and measured their three dimensions x (length), y (width) and z (height): # x y z 1 3 2 1 2 3 2 2 3 3 3 1 4 3 3 2 5 3 3 3 6 4 2 1 7 4 2 2 8 4 3 1 9 4 3 2 10 4 3 3 11 4 4 1 12 4 4 2 13 4 4 3 14 5 2 1 15 5 2 2 16 5 3 2 17 5 3 3 18 5 4 1 19 5 4 2 20 5 4 3 The variables are 26 functions of x, y and z, i.e.: n = 20, p = 26 and k = 3. Thurstone, L. L. (1947) Multiple-Factor Analysis, University of Chicago Press: Chicago, IL, p. 141. 52 / 53 Department of Mathematics, Linköping University, May 16, 2013, Linköping, Sweden Exploratory factor analysis (EFA) Numerical illustrations with GEFA GEFA of Thurstone’s 26-variable box data: Variable x y z xy xz yz x2y xy 2 x2z xz 2 y 2z yz 2 x/y y /x x/z z/x y /z z/y 2x + 2y 2x + 2z 2y + 2z p 2 2 px + y 2 2 px + z y2 + z2 xyz p x2 + y2 + z2 fit × 0 0 × × 0 × × × × 0 0 × × × × 0 0 × × 0 × × 0 × × 0 × 0 × 0 × × × 0 0 × × × × 0 0 × × × 0 × × 0 × × × 0 0 × 0 × × 0 0 × × × × 0 0 × × × × 0 × × 0 × × × × .98 .15 −.11 .59 .30 −.01 .75 .44 .52 .13 .05 −.05 .53 −.54 .48 −.54 .17 −.17 .71 .59 .03 .80 .80 .07 .27 .68 Λ .12 .17 .98 −.09 .36 .92 .80 .01 .36 .88 .71 .70 .63 .06 .89 −.06 .32 .77 .39 .90 .82 .52 .58 .79 −.81 .16 .80 −.19 −.22 −.79 .26 .76 .29 −.90 −.27 .90 .70 .05 .32 .74 .85 .52 .59 .08 .26 .53 .93 .34 .67 .67 .64 .34 .591916 Ψ2 .0000 .0000 .0000 .0000 .0000 .0000 .0191 .0001 .0198 .0000 .0298 .0000 .0279 .0290 .0811 .0476 .0566 .0651 .0000 .0000 .0000 .0000 .0001 .0000 .0017 .0001 1.00 .25 .10 .68 .49 .20 .82 .52 .68 .33 .25 .16 .44 −.46 .31 −.36 .04 −.03 .79 .74 .23 .87 .91 .25 .47 .80 L 0 0 .97 0 .23 .96 .73 −.00 .20 .84 .59 .77 .54 −.00 .84 −.03 .15 .68 .24 .90 .73 .60 .45 .85 −.87 −.05 .87 .02 −.15 −.89 .20 .88 .40 −.87 −.38 .88 .61 .00 .15 .65 .76 .61 .49 −.01 .10 .39 .86 .44 .54 .68 .52 .28 .591920 Ψ2 .0000 .0000 .0000 .0000 .0000 .0000 .0191 .0000 .0198 .0000 .0298 .0000 .0279 .0290 .0811 .0476 .0566 .0651 .0000 .0000 .0000 .0000 .0001 .0000 .0017 .0001 53 / 53