Factor Analysis

advertisement

Factor Analysis - 2

nd

TUTORIAL

Subject marks

File sub_marks.csv shows correlation coefficients between subject scores for a sample of 220 boys.

> sub_marks<-read.csv("sub_marks.csv",header=TRUE,sep=";")

> sub_marks

X Gaelic English History Arithmetic Algebra Geometry

1 Gaelic 1.00 0.44 0.41 0.29 0.33 0.25

2 English 0.44 1.00 0.35 0.35 0.32 0.33

3 History 0.41 0.35 1.00 0.16 0.19 0.18

4 Arithmetic 0.29 0.35 0.16 1.00 0.59 0.47

5 Algebra 0.33 0.32 0.19 0.59 1.00 0.46

6 Geometry 0.25 0.33 0.18 0.47 0.46 1.00

> row.names(sub_marks)<-sub_marks[,1]

> sub_marks<-sub_marks[,-1]

Each subject score is positively correlated with each of the scores in the other subjects, indicating that there is a general tendency for those who do well in one subject to do well in others. The highest correlations are between the three mathematical subjects and to a slightly lesser extent, between the three humanities subjects, suggesting that there is more in common within each of these two groups than between them.

In order to reduce the dimension of the problem and to explain the observed correlations through some related latent factors we fit a factor model using the principal factor method.

First of all we need to compute an initial estimate of the communalities by calculating the multiple correlation coefficient

R 2 i 0

of each variable with the remaining ones. We obtain it as a function of the diagonal elements of the inverse correlation matrix.

> R<-sub_marks

> solve(R)

Gaelic English History Arithmetic Algebra Geometry

Gaelic 1.43202290 -0.38819247 -0.39347215 -0.07755424 -0.21742892 -0.02260943

English -0.38819247 1.42013153 -0.25662358 -0.21571520 -0.05994464 -0.19644236

History -0.39347215 -0.25662358 1.25888064 0.04382975 -0.02990229 -0.05038962

Arithmetic -0.07755424 -0.21571520 0.04382975 1.71005669 -0.74956910 -0.37623964

Algebra -0.21742892 -0.05994464 -0.02990229 -0.74956910 1.69992943 -0.35014869

Geometry -0.02260943 -0.19644236 -0.05038962 -0.37623964 -0.35014869 1.41744949 and then estimate the communalities

> h2.zero<-1-1/(diag(solve(R)))

> h2.zero

> h2.zero<-round(h2.zero,2)

> h2.zero

Gaelic English History Arithmetic Algebra Geometry

0.30 0.30 0.21 0.42 0.41 0.29

>

1

Now we can compute the reduced correlation matrix by substituting the estimated communalities to the diagonal elements (the 1's) of the original correlation matrix.

> R.psi<-R

> i<-1

> for (i in 1:nrow(R.psi)){

+ R.psi[i,i]<-h2.zero[i]

+ }

> R.psi

Gaelic English History Arithmetic Algebra Geometry

Gaelic 0.30 0.44 0.41 0.29 0.33 0.25

English 0.44 0.30 0.35 0.35 0.32 0.33

History 0.41 0.35 0.21 0.16 0.19 0.18

Arithmetic 0.29 0.35 0.16 0.42 0.59 0.47

Algebra 0.33 0.32 0.19 0.59 0.41 0.46

Geometry 0.25 0.33 0.18 0.47 0.46 0.29

>

R.psi is still squared and symmetric, but it is not positive definite. Its decomposition through the spectral theorem shows that only two eigenvalues are positive

> eigen(R.psi)

$values

[1] 2.06689350 0.43185860 -0.07389683 -0.11987084 -0.17312229 -

0.20186214

$vectors

[,1] [,2] [,3] [,4] [,5] [,6]

[1,] -0.3894948 -0.4631310 -0.3323422 -0.08042293 0.49656626 0.5199098

[2,] -0.4068092 -0.3304872 0.4836944 -0.57629533 0.02029114 -0.3984924

[3,] -0.2898160 -0.5460418 -0.1179882 0.51251652 -0.56060428 -0.1642360

[4,] -0.4659483 0.4148356 -0.1362116 -0.32995348 -0.55776080 0.4150706

[5,] -0.4674485 0.3627504 -0.4779927 0.12767051 0.27396261 -0.5745187

[6,] -0.4039689 0.2728550 0.6282010 0.52304263 0.22930424 0.2038842 eigen.va<-eigen(R.psi)$values eigen.ve<-eigen(R.psi)$vectors

This means that two factors might be enough in order to explain the observed correlations.

The estimate of the factor loading matrix will then be obtained as , where has in columns the first two eigenvectors and

L

1

has on the diagonal the first two eigenvalues.

> l.diag

[,1] [,2]



[2,] 0.000000 0.4318586

> gamma<-eigen.ve[,1:2]

> gamma

[,1] [,2]

[1,] -0.3894948 -0.4631310

[2,] -0.4068092 -0.3304872

[3,] -0.2898160 -0.5460418

[4,] -0.4659483 0.4148356

[5,] -0.4674485 0.3627504

[6,] -0.4039689 0.2728550

2

We can now compute the estimated factor loadings:

> lambda<-gamma%*%sqrt(l.diag)

> round(lambda,2)

[,1] [,2]

[1,] -0.56 -0.30

[2,] -0.58 -0.22

[3,] -0.42 -0.36

[4,] -0.67 0.27

[5,] -0.67 0.24

[6,] -0.58 0.18

The first factor seems to measure overall ability in the six subjects, while the second contrasts humanities and mathematics subjects.

Communalities are, for each variable, the part of its variance that is explained by the common factors.

To estimate the communalities we need to sum the square of the factor loadings for each subject:

> lambda<-round(lambda,2)

> communality<-apply(lambda^2,1,sum)

> communality

[1] 0.4036 0.3848 0.3060 0.5218 0.5065 0.3688

Or, equivalently, communality<-diag(lambda%*%t(lambda))

These shows, for example, that the 40% of variance in Gaelic scores is explained by the two common factors. Of course, the larger the communality the better does the variable serve as an indicator of the associated factors.

To evaluate the goodness of fit of this model we can compute the residual correlation matrix ( ):

> R-lambda%*%t(lambda)

Gaelic English History Arithmetic Algebra Geometry

Gaelic 0.5964 0.0492 0.0668 -0.0042 0.0268 -0.0208

English 0.0492 0.6152 0.0272 0.0208 -0.0158 0.0332

History 0.0668 0.0272 0.6940 -0.0242 -0.0050 0.0012

Arithmetic -0.0042 0.0208 -0.0242 0.4782 0.0763 0.0328

Algebra 0.0268 -0.0158 -0.0050 0.0763 0.4935 0.0282

Geometry -0.0208 0.0332 0.0012 0.0328 0.0282 0.6312

Since the elements out of the diagonal are fairly small and close to zero we can conclude that the model fits adequately the data.

Athletics data

File AthleticsData.sav contains measurements over 9 different athletics disciplines on 1000 students:

3

1.

PINBALL

2.

BILLIARD

3.

GOLF

4.

1500 m

5.

2 Km row

6.

12 min RUN

7.

BENCH

8.

CURL

9.

MAX PUSHUP

The aim here is to reduce the dimension of the problem by measuring some latent factors that impact their performances.

The dataset has a SPSS format (extension .sav). To read the file we need to load an R-package that contains a function that allows this conversion.

> library(Hmisc)

> library(foreign)

> AthleticsData <- spss.get("AthleticsData.sav")

> x<-AthleticsData

>

> x[1:5,]

PINBALL BILLIARD GOLF X.1500M X.2KROW X.12MINTR BENCH

1 -1.1225055 0.009316132 -1.5267935 -0.9483176 -0.1647701 -0.05203922 1.3593056

2 0.3286001 -0.745125995 -0.8488870 0.6849068 0.1455623 0.13481553 -0.4906018

3 0.5442109 0.823572688 0.5519436 -0.6842024 -0.5152493 -0.24014598 -0.8188845

4 1.7282347 -0.142108710 0.9537609 0.9312700 -1.0275236 -0.89791136 -0.9732271

5 0.8650813 0.363277424 -0.5669886 0.9757308 1.2200180 0.24952087 1.2293106

CURL MAXPUSHU

1 -0.83766332 -0.04783271

2 0.22148232 0.38120977

3 -0.62012251 -1.13213981

4 -1.12976703 -0.33035037

5 -0.03462911 0.40703588

The R-function factanal performs maximum-likelihood factor analysis on a covariance (correlation) matrix or data matrix. It takes the following main arguments:

x: A formula or a numeric matrix or an object that can be

coerced to a numeric matrix.

factors: The number of factors to be fitted.

data: An optional data frame (or similar: see ‘model.frame’), used

only if ‘x’ is a formula. By default the variables are taken

from ‘environment(formula)’.

covmat: A covariance matrix, or a covariance list as returned by

‘cov.wt’. Of course, correlation matrices are covariance

matrices.

n.obs: The number of observations, used if ‘covmat’ is a covariance

matrix.

start: ‘NULL’ or a matrix of starting values, each column giving an

initial set of uniquenesses.

4

scores: Type of scores to produce, if any. The default is none,

‘"regression"’ gives Thompson's scores, ‘"Bartlett"’ given

Bartlett's weighted least-squares scores. Partial matching

allows these names to be abbreviated. rotation: character. ‘"none"’ or the name of a function to be used to

rotate the factors: it will be called with first argument the

loadings matrix, and should return a list with component

‘loadings’ giving the rotated loadings, or just the rotated

loadings.

To begin with, let’s analyze the AthleticsData with a 2 factor model.

> fit.2 <- factanal(x,factors=2,rotation="none")

> fit.2

Call: factanal(x = x, factors = 2, rotation = "none")

Uniquenesses:

PINBALL BILLIARD GOLF X.1500M X.2KROW X.12MINTR BENCH CURL

0.938 0.962 0.955 0.361 0.534 0.536 0.301 0.540

MAXPUSHU

0.560

Loadings:

Factor1 Factor2

PINBALL 0.249

BILLIARD 0.192

GOLF 0.206

X.1500M 0.793

X.2KROW 0.413 0.544

X.12MINTR 0.681

BENCH 0.813 -0.193

CURL 0.673

MAXPUSHU 0.545 0.379

Factor1 Factor2

SS loadings 1.734 1.579

Proportion Var 0.193 0.175

Cumulative Var 0.193 0.368

Test of the hypothesis that 2 factors are sufficient.

The chi square statistic is 652.4 on 19 degrees of freedom.

The p-value is 4.3e-126

Near the bottom of the output, we can see that the significance level of the χ2 fit statistic is very small. This indicates that the hypothesis that a 2 factor model fits the data is rejected. Since we are in a purely exploratory vein, let’s fit a 3 factor model.

> fit.3 <- factanal(x,factors=3,rotation="none")

> fit.3

Call: factanal(x = x, factors = 3, rotation = "none")

5

Uniquenesses:

PINBALL BILLIARD GOLF X.1500M X.2KROW X.12MINTR BENCH CURL

0.635 0.414 0.455 0.361 0.520 0.538 0.302 0.536

MAXPUSHU

0.540

Loadings:

Factor1 Factor2 Factor3

PINBALL 0.425 0.429

BILLIARD 0.443 0.624

GOLF 0.447 0.585

X.1500M 0.799

X.2KROW 0.408 0.496 -0.260

X.12MINTR 0.672

BENCH 0.729 -0.280 -0.297

CURL 0.605 -0.158 -0.270

MAXPUSHU 0.512 0.317 -0.312

Factor1 Factor2 Factor3

SS loadings 1.912 1.545 1.243

Proportion Var 0.212 0.172 0.138

Cumulative Var 0.212 0.384 0.522

Test of the hypothesis that 3 factors are sufficient.

The chi square statistic is 12.94 on 12 degrees of freedom.

The p-value is 0.373

These results are much more promising. Although the sample size is reasonably large, N = 1000, the significance level of .373 indicates that the hypothesis that a 3 factor model fits the data cannot be rejected. Changing from two factors to three has produced a huge improvement.

The output reports the uniquenesses i.e. the variances of the unique factors. As the algorithm fits the model using the correlation matrix, the communalities can be obtained as 1 minus the corresponding uniquenesses. For instance the communality for the variable pinball is 1-0.635=0.365.

The last table in the output reports the sum of squared loadings for each factor i.e.

0.425^2+0.443^2+…+0.512^2=1.912. It represents the part of the total variance that is explained by the first factor. If we divide it by the total variance (i.e. 9 in this case) we obtain the proportion of the total variance explained by the first factor. The first factor explains 21.2% of the total variance.

The unrotated factors do not have a clear interpretation. Some procedures have been developed to search automatically for a suitable rotation. For example, VARIMAX procedure attempts to find an orthogonal rotation that is close to simple structure by finding factors with few large loadings and as many near-zero loadings as possible. In order to improve the understanding of the problem let's try to rotate the axes with the VARIMAX procedure:

> fit.3 <- factanal(x,factors=3,rotation="varimax")

> fit.3

Call: factanal(x = x, factors = 3, rotation = "varimax")

Uniquenesses:

PINBALL BILLIARD GOLF X.1500M X.2KROW X.12MINTR BENCH CURL

0.635 0.414 0.455 0.361 0.520 0.538 0.302 0.536

MAXPUSHU

6

0.540

Loadings:

Factor1 Factor2 Factor3

PINBALL 0.131 0.590

BILLIARD 0.765

GOLF 0.735

X.1500M 0.779 -0.179

X.2KROW 0.585 0.372

X.12MINTR 0.678

BENCH -0.119 0.816 0.137

CURL 0.674

MAXPUSHU 0.433 0.522

Factor1 Factor2 Factor3

SS loadings 1.613 1.584 1.502

Proportion Var 0.179 0.176 0.167

Cumulative Var 0.179 0.355 0.522

Test of the hypothesis that 3 factors are sufficient.

The chi square statistic is 12.94 on 12 degrees of freedom.

The p-value is 0.373

As expected from the invariance of the factor model to orthogonal rotations the estimates of the communalities do not change after rotation.

We can "clean up" the factor pattern in several ways. One way is to hide small loadings, to reduce the visual clutter in the factor pattern. Another is to reduce the number of decimal places from 3 to 2. A third way is to sort the loadings to make the simple structure more obvious. The following command does all three:

> print(fit.3, digits = 2, cutoff = .2, sort = TRUE)

Call: factanal(x = x, factors = 3, rotation = "varimax")

Uniquenesses:

PINBALL BILLIARD GOLF X.1500M X.2KROW X.12MINTR BENCH CURL

0.64 0.41 0.46 0.36 0.52 0.54 0.30 0.54

MAXPUSHU

0.54

Loadings:

Factor1 Factor2 Factor3

X.1500M 0.78

X.2KROW 0.58 0.37

X.12MINTR 0.68

BENCH 0.82

CURL 0.67

MAXPUSHU 0.43 0.52

PINBALL 0.59

BILLIARD 0.76

GOLF 0.73

Factor1 Factor2 Factor3

SS loadings 1.61 1.58 1.50

Proportion Var 0.18 0.18 0.17

Cumulative Var 0.18 0.36 0.52

Test of the hypothesis that 3 factors are sufficient.

The chi square statistic is 12.94 on 12 degrees of freedom.

7

The p-value is 0.373

Now it is obvious that there are 3 factors. The traditional approach to naming factors is to:

Try do decide what construct is common to these variables

Examine the variables that load heavily on the factor

Name the factor after that construct

It seems that there are three factors. The first factor is something that is common to strong performance in a 1500 meter run, a 2000 meter row, and a 12 minute run. It looks like a good name for this factor is

"Endurance". The other two factors might be named "Strength", and "Hand-Eye Coordination".

Sometimes , we may want to calculate an individual's score on the latent variable(s). In factor analysis it is not straightforward, because the factors are random variables which have a probability distribution. There are various methods for obtaining predicted factor scores; the function factanal produces scores only if a data matrix is supplied and used. The first method is the regression method of Thomson, the second the weighted least squares method of Bartlett. scores_thomson<-factanal(x, factors = 3, scores = "regression")$scores scores_bartlett<-factanal(x, factors = 3, scores = "Bartlett")$scores

Example

File intel_test.txt shows correlations between scores of 75 children in 10 intelligence tests WPPSI:

X

1

: information

X

2

: vocabulary

X

3

: arithmetic

X

4

: similarities

X

5

: comprehension

X

6

: animal houses

X

7

: figures completion

X

8

: labyrinths

X

9

: geometric design

X

10

: block design

> cor.m<-as.matrix(read.table("c:\\temp\\intel_test.txt"))

>

> cor.m

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10

1 1.000 0.755 0.592 0.532 0.627 0.460 0.407 0.387 0.461 0.459

2 0.755 1.000 0.644 0.520 0.617 0.497 0.511 0.417 0.406 0.583

3 0.592 0.644 1.000 0.388 0.529 0.449 0.436 0.428 0.412 0.602

4 0.532 0.528 0.388 1.000 0.475 0.442 0.280 0.214 0.361 0.424

5 0.627 0.617 0.529 0.475 1.000 0.398 0.373 0.372 0.350 0.433

6 0.460 0.497 0.449 0.442 0.398 1.000 0.545 0.446 0.366 0.575

7 0.407 0.511 0.436 0.280 0.373 0.545 1.000 0.542 0.308 0.590

8 0.387 0.417 0.428 0.214 0.372 0.446 0.542 1.000 0.375 0.654

9 0.461 0.406 0.412 0.361 0.355 0.366 0.308 0.375 1.000 0.502

10 0.459 0.583 0.602 0.424 0.433 0.575 0.590 0.654 0.502 1.000

By looking at the correlation matrix one can see a strong correlation between the 10 tests: all the correlation values are positive and mostly varies between 0.4-0.6.

8

Factor analysis according to a maximum likelihood approach:

> res<-factanal(covmat=cor.m,factors=2,n.obs=75,rotation="none")

>

> res

Call: factanal(factors = 2, covmat = cor.m, n.obs = 75, rotation = "none")

Uniquenesses:

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10

0.215 0.249 0.452 0.622 0.482 0.553 0.534 0.481 0.679 0.177

Loadings:

Factor1 Factor2

[1,] 0.789 -0.403

[2,] 0.834 -0.234

[3,] 0.740

[4,] 0.587 -0.185

[5,] 0.676 -0.247

[6,] 0.654 0.140

[7,] 0.641 0.235

[8,] 0.630 0.351

[9,] 0.564

[10,] 0.807 0.414

Factor1 Factor2

SS loadings 4.872 0.685

Proportion Var 0.487 0.069

Cumulative Var 0.487 0.556

Test of the hypothesis that 2 factors are sufficient.

The chi square statistic is 16.51 on 26 degrees of freedom.

The p-value is 0.923

Record the percentage of variability in each variable that is explained by the model (communalities):

> round(apply(res$loadings^2,1,sum),3)

[1] 0.785 0.751 0.548 0.378 0.518 0.447 0.466 0.519 0.321 0.823

Rotate the factors with VARIMAX. Such a rotation works on the factor loadings increasing the differences between lower weights, letting them converge to zero, and the higher weights, letting them converge to one.

> res.rot<-factanal(covmat=cor.m,factors=2,n.obs=75,rotation="varimax")

>

> res.rot

Call: factanal(factors = 2, covmat = cor.m, n.obs = 75, rotation = "varimax")

Uniquenesses:

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10

0.215 0.249 0.452 0.622 0.482 0.553 0.534 0.481 0.679 0.177

9

Loadings:

Factor1 Factor2

[1,] 0.852 0.245

[2,] 0.769 0.399

[3,] 0.563 0.481

[4,] 0.555 0.266

[5,] 0.662 0.281

[6,] 0.382 0.549

[7,] 0.308 0.609

[8,] 0.220 0.686

[9,] 0.375 0.424

[10,] 0.307 0.854

Factor1 Factor2

SS loadings 2.904 2.653

Proportion Var 0.290 0.265

Cumulative Var 0.290 0.556

Test of the hypothesis that 2 factors are sufficient.

The chi square statistic is 16.51 on 26 degrees of freedom.

The p-value is 0.923

>

10

Download