106750892 Revised: 2/17/2016 Chapter 5. Principal Components Analysis. 5:1 Analysis of Principal Components 5:2 Introduction PCA is a method to study the structure of the data, with emphasis on determining the patterns of covariances among variables. Thus, PCA is the study of the structure of the variance-covariance matrix. In practical terms, PCA is a method to identify variable or sets of variables that are highly correlated with each other. The results can be used for multiple purposes: To construct a new set of variables that are linear combinations of the original variables and that contain exactly the same information as the original variables but that are orthogonal to each other. To identify patterns of multicollinearity in a data set and use the results to address the collinearity problem in multiple linear regression. To identify variables or factors, underlying the original variables, which are responsible for the variation in the data. To find out the effective number of dimensions over which the data set exhibits variation, with the purpose of reducing the number of dimensions of the problem. To create a few orthogonal variables that contain most of the information in the data and that simplify the identification of groupings in the observations. PCA is applied to a single group of variables; there is no distinction between explanatory and response variables. In multiple linear regression (MLR), PCA is applied only to the set of X variables, to study multicollinearity. 5:2.1 Example 1 Koenigs et al., (1982) used PCA to identify environmental gradients and to relate vegetation gradients to environmental variables. Forty-eight environmental variables, including soil chemical and physical characteristics, were measured in 40 different sites. In addition, over 17 vegetation variables, including % cover, height, and density, were measured in each site. PCA was applied both to the environmental and vegetational variables. The first 3 PC's of the vegetational variables explained 73.3% of all the variation in the data. These 3 dimensions are interpreted as the gradients naturally "perceived" by the plant community, and were well correlated with environmental variables related to soil moisture. 5:2.2 Example 2 Jeffers (J.N.R. Jeffers. 1967. Two case-studies on the application of Principal component analysis. Applied Statistics, 16:225-236) applied PCA to a sample of 40 winged aphids on which 19 different morphological characteristics had been measured. The characteristics measured included body length, body width, forewing length, leg length, length of various antennal segments, number of spiracles, etc. Winged aphids are difficult to identify, so the study used PCA to determine the number of distinct taxa present in the sample. Although PCA is not a formal procedure to define the clusters or groups of observations, it simplifies the data so they can be inspected graphically. The first two PC’s explained 85% of the total variance in the correlation matrix. When the 40 samples were plotted on the first two PC’s they formed 4 major groups distributed in an “interesting” S 106750892 1 106750892 Revised: 2/17/2016 shape. Although 19 traits were measured, the data only contained information equivalent to 2-3 independent traits. Figure 5-1. Use of principal components to facilitate the identification of taxons of aphids. 5:3 Model and concept PCA does not have any model to be tested, although it is assumed that the variables are linearly related. The analysis can be thought of as looking at the same set of data from a different perspective. The perspective is changed by moving the origin of the coordinate system to the centroid of the data and then rotating the axes. Given a set of p variables (X1, ..., Xp), PCA calculates a set of p linear combinations of the variables (PC1, ..., PCp) such that: The total variation in the new set of variables or principal components is the same as in the original variables. The first PC contains the most variance possible, e.g. as much variance as can be captured in a single axis. The second PC is orthogonal to the first one (their correlation is 0), and contains as much of the remaining variance as possible. The third PC is orthogonal to all previous PC's and also contains the most variance possible. Etc. This procedure is achieved by calculating a matrix of coefficients whose columns are called eigenvectors of the variance-covariance or of the correlation matrix of the data set. Some basic consequences of the procedure are that: All original variables are involved in the calculation of PC scores (i.e. the location of each observation in the new set of axis formed by the PC's). 106750892 2 106750892 Revised: 2/17/2016 The sum of variances of the PC's equals the sum of the variances of the original variables when PCA is based on the variance-covariance matrix, or the sum of the variances of the standardized variables when PCA is based on the correlation matrix. There are p eigenvalues (p=number of variables in the data), each one associated with one eigenvector and a PC. These eigenvalues are the variances of the data in each PC. Thus, the sum of eigenvalues based on the variance-covariance matrix is equal to the sum of variances of the original variables. PCA based on the correlation matrix is equivalent to using PCA based on the variancecovariance of the standardized variables. Because standardized variables have variance=1, the sum of eigenvalues is p, the number of variables. 5:4 Assumptions and potential problems 5:4.1 Normality For descriptive PCA no specific distribution is assumed. If the variables have a multivariate normal distribution the results of the analysis are enhanced and tend to be clearer. Normality can be assessed by using the Analyze Distribution platform in JMP, or the PROC UNIVARIATE. Transformations can be applied to approach normality as described in Figure 4.6 of Tabachnick and Fidell (1996). Multivariate normality can be assessed by looking at the pairwise scatterplots. If variables are normal and linearly related, the data will tend to exhibit multivariate normality. Strict testing of multivariate normality can be achieved by calculating the Jacknifed squared Mahalanobis distance for each observation and then testing the hypothesis that its distribution is a 2 distribution with as many degrees of freedom as variables considered in the PCA. This test is very sensitive, so it is recommended that a very low be used (e.g. 0.001 or 0.0005). Open the file spartina.jmp. In JMP, select the ANALYZE -> MULTIVARIATE platform and include all variables in the Response box; then, click OK. In the results of the multivariate platform, select Outlier Analysis. This will display the Mahalanobis distance for all points in the dataset. In the picture below, the observations were already sorted by increasing distance, so the plot looks like an ordered string of dots. 106750892 3 106750892 Revised: 2/17/2016 Click on the red triangle by Outlier Analysis and select Save Jackknife Distance. This creates a new column in the data table, labeled Jackknife Distance. Create a new column where you calculate the squared Jackknife Mahalanobis distance, which in the next picture is labeled Dsq. Note that the two labels, “Mahalanobis” and “Jackknife” both refer to the same statistical distance D. The difference is that the latter is calculated using the jackknife procedure whereby D for each observation is calculated while holding the observation out of the data to obtain the variance-covariance matrix. D2 is the variable that should be distributed like a 2 random variable with p=14 in this case. In order to determine how close D2 is to a 2, it is necessary to create a column of expected quantiles for the 2. For this the data must be sorted in increasing order of D2. The column with 2 quantile is created with a formula as indicated in the next figure. Conceptually, the 2 quantile is the value of a 2 with p degrees of freedom that is greater than x% of the values, where x is the proportion of rows (observations) with lower values of D2. 106750892 4 106750892 Revised: 2/17/2016 For the final step, simple regress D2 on the 2 quantile. The line should be straight, with slope 1 and a zero intercept. In the specific Spartina example, there is at least one outlier that throws off multivariate normality. 5:4.2 Linearity PCA assumes (i.e. only accounts for) linear relationships among variables. Lack of linearity works against the ability to “concentrate” variation in few PC’s. Linearity can be examined by looking at the pairwise scatterplots. If two variables are not linearly related, a transformation is applied. The variable to be transformed should be carefully selected such as not to disrupt the linearity with the rest of the variables. 5:4.3 Sample size A potential problem of PCA is that results are not reliable (e.g. are different from sample to sample) if sample sizes are small. The problem is not as grave as for Factor analysis, and it diminishes as the variables exhibit higher correlations. Although some textbooks recommend 300 cases or observations, this is probably more appropriate for social studies where many variables cannot be measured directly (e.g., intelligence). In agriculture and biology, scientists routinely use data sets with 30 or more observations and the results pass peer review. 5:4.4 Outliers Multivariate outliers can be a major problem in PCA, because just one or a few observations can completely distort the results. As indicated in the Data Screening topic, multivariate outliers can be identified by testing the jackknifed squared Mahalanobis distance. Transformations can help. If outliers remain after transformations, observations can be considered for deletion, but this has to be fully reported, and it has to be understood that elimination of observations just because they do not fit the rest of the data can have negative implications on the correspondence between sample and population. If sample size is very large, then deletion of a few outliers will not severely restrict the applicability of the results. 5:5 Geometry of PCA The principal components are obtaining by rotating a set of orthogonal (perpendicular and independent) axes, pivoting at the centroid of the data. First, the direction that maximizes the variance of the scores (or perpendicular projections of the data points on the moving axes) on the first axis is determined, and the axis (PC1) is then fixed in that position. The rotation continues with the constraint that PC1 is the axis of rotation until the second axis maximizes the variance of the scores, which is the position for PC2. The procedure continues until PC p-1 is set. Because of the orthogonality, setting PC p-1 also sets the last PC. 106750892 5 Revised: 2/17/2016 106750892 5:6 Procedure for analysis 5:6.1 JMP procedure. There are two ways to get Principal Components in JMP, through the Multivariate and the Graph -> Spinning Plot platforms. Both give all the numerical information. The Spinning Plot also displays a biplot, which is described below. When using the Spinning Plot make sure you select Principal Components and not “std Principal Components.” The latter give the standardized PC scores. 5:6.2 SAS code and output proc princomp data=spartina out=spartpc; var h2s sal eh7 ph acid p k ca mg na mn zn cu nh4; run; In this example, PCA is done on the correlation matrix, which is equivalent to saying that PC’s were calculated on the basis of the standardized variables. Using the correlation matrix is the default option for the PROC PRINCOMP, and is the most common choice. Alternatively, the analysis can use the covariance matrix, which just centers the data (i.e. all the principal components go through the centroid of the sample). The rationale for choosing correlation or covariance for PCA is discussed below. Box 1: Simple statistics Principal Component Analysis 45 Observations 14 Variables Mean StD H2S -601.7777778 30.6956385 Mean StD P 32.29688889 27.58669395 Mean StD Simple Statistics SAL EH7 30.26666667 -314.4000000 3.71972629 36.9559935 PH 4.602222222 1.246994366 ACID 3.861777778 2.506354913 K 797.6228889 297.6023371 MG 3075.109333 939.406676 NA 16596.71111 6882.42337 MN 38.10054667 24.48057096 CA 2365.318889 1718.327317 ZN 17.87524000 8.27980582 106750892 CU 3.988576667 1.036991704 NH4 87.45520000 47.27275022 6 Revised: 2/17/2016 106750892 Each eigenvalue represents the amount of variation from the original sample that is explained by the corresponding PC. In this example PCA was based on the correlations or standardized variables. Each standardized variable has a mean of zero and a variance of 1. Thus the sum of the variances of the original variables is equal to the number of variables, and the first PC accounts for 4.924/14 or 0.3517 of the total sample variance. The eigenvectors are vectors of coefficients that can be used to get the values of the projections of each observation on each new axis or PC. The logic behind this is just a change of axes: just as Box 2: Correlation Matrix. H2S SAL EH7 PH ACID P K CA MG NA MN ZN CU NH4 H2S 1.0000 0.0958 0.3997 0.2735 -.3738 -.1154 0.0690 0.0933 -.1078 -.0038 0.1415 -.2724 0.0127 -.4262 SAL 0.0958 1.0000 0.3093 -.0513 -.0125 -.1857 -.0206 0.0880 -.0100 0.1623 -.2536 -.4208 -.2660 -.1568 Correlation Matrix EH7 PH 0.3997 0.2735 0.3093 -.0513 1.0000 0.0940 0.0940 1.0000 -.1531 -.9464 -.3054 -.4014 0.4226 0.0192 -.0421 0.8780 0.2985 -.1761 0.3425 -.0377 -.1113 -.4751 -.2320 -.7222 0.0945 0.1814 -.2390 -.7460 ACID -.3738 -.0125 -.1531 -.9464 1.0000 0.3829 -.0702 -.7911 0.1305 -.0607 0.4204 0.7147 -.1432 0.8495 P -.1154 -.1857 -.3054 -.4014 0.3829 1.0000 -.2265 -.3067 -.0632 -.1632 0.4954 0.5574 -.0531 0.4897 K 0.0690 -.0206 0.4226 0.0192 -.0702 -.2265 1.0000 -.2652 0.8622 0.7921 -.3475 0.0736 0.6931 -.1176 H2S SAL EH7 PH ACID P K CA MG NA MN ZN CU NH4 CA 0.0933 0.0880 -.0421 0.8780 -.7911 -.3067 -.2652 1.0000 -.4184 -.2482 -.3090 -.6999 -.1122 -.5826 MG -.1078 -.0100 0.2985 -.1761 0.1305 -.0632 0.8622 -.4184 1.0000 0.8995 -.2194 0.3452 0.7121 0.1082 NA -.0038 0.1623 0.3425 -.0377 -.0607 -.1632 0.7921 -.2482 0.8995 1.0000 -.3101 0.1170 0.5601 -.1070 ZN -.2724 -.4208 -.2320 -.7222 0.7147 0.5574 0.0736 -.6999 0.3452 0.1170 0.6033 1.0000 0.2121 0.7207 CU 0.0127 -.2660 0.0945 0.1814 -.1432 -.0531 0.6931 -.1122 0.7121 0.5601 -.2335 0.2121 1.0000 0.0137 NH4 -.4262 -.1568 -.2390 -.7460 0.8495 0.4897 -.1176 -.5826 0.1082 -.1070 0.5270 0.7207 0.0137 1.0000 MN 0.1415 -.2536 -.1113 -.4751 0.4204 0.4954 -.3475 -.3090 -.2194 -.3101 1.0000 0.6033 -.2335 0.5270 the location of an observation can be expressed as a vector of p dimensions (in the example p=14) where the p dimensions are the measured variables, each observation can be expressed as a vector of p PC values or scores. The scores for all PC's for all observations can be saved into a SAS data set by using the OUT=filename option in the PROC PRINCOMP. This options creates a SAS file with all the information contained in the file specified by the DATA= option, plus the scores for all observations in all PC's. In JMP, from the Spinning Plot red triangle, select Save Principal Components. 106750892 7 Revised: 2/17/2016 106750892 To further the explanation of the eigenvectors, consider the first observation in the Spartina data set. The score for that observation on PC1 can be calculated by multiplying the standardized value for each variable by the corresponding element of the first column of the matrix of eigenvectors and adding all the terms, as shown in Table 1. Box 3. Eigenvalues PRIN1 PRIN2 PRIN3 PRIN4 PRIN5 PRIN6 PRIN7 PRIN8 PRIN9 PRIN10 PRIN11 PRIN12 PRIN13 PRIN14 Eigenvalues of the Correlation Matrix Eigenvalue Difference Proportion 4.92391 1.22868 0.351708 3.69523 2.08810 0.263945 1.60713 0.27222 0.114795 1.33490 0.64330 0.095350 0.69160 0.19103 0.049400 0.50057 0.11513 0.035755 0.38544 0.00467 0.027531 0.38077 0.21480 0.027198 0.16597 0.02298 0.011855 0.14299 0.05613 0.010214 0.08687 0.04158 0.006205 0.04529 0.01544 0.003235 0.02985 0.02036 0.002132 0.00949 . 0.000678 Cumulative 0.35171 0.61565 0.73045 0.82580 0.87520 0.91095 0.93848 0.96568 0.97754 0.98775 0.99395 0.99719 0.99932 1.00000 Box 4. Eigenvectors H2S SAL EH7 PH ACID P K CA MG NA MN ZN CU NH4 PRIN1 -.163637 -.107894 -.123813 -.408217 0.411680 0.273196 -.033446 -.358562 0.079033 -.017130 0.277082 0.404195 -.010788 0.398754 PRIN2 0.009086 0.017324 0.225247 -.027467 -.000362 -.111277 0.487887 -.180445 0.498653 0.470439 -.182164 0.088823 0.391707 -.025968 Eigenvectors PRIN3 PRIN4 0.231669 0.689722 0.605727 -.270389 0.458251 0.301313 -.282670 0.081726 0.204919 -.165831 -.160543 0.199965 -.022907 0.043000 -.206595 -.054385 -.049515 -.036561 0.050575 -.054358 0.019849 0.483078 -.176373 0.150047 -.376740 0.102023 -.010607 -.104087 PRIN5 0.014386 0.508742 -.166758 0.091618 -.162713 0.747115 -.061998 0.206152 0.103793 0.239519 0.038899 -.007768 0.063434 -.005857 PRIN6 -.419348 0.010076 0.596651 0.191256 -.024061 -.017903 -.016587 0.427579 0.034182 -.060440 0.299511 0.034351 0.077993 0.381686 PRIN7 0.300094 0.383770 -.296867 0.056897 0.117085 -.336928 -.067421 0.104949 -.044195 -.181661 0.124567 -.072907 0.562581 0.395252 H2S SAL EH7 PH ACID P K CA MG NA MN ZN CU NH4 PRIN8 -.073755 0.100873 -.312742 -.029538 -.152610 -.398662 -.115096 0.185889 0.170996 0.449939 0.531706 0.208525 -.277074 -.145025 PRIN9 0.168302 -.175066 -.226136 0.023918 0.095416 0.077828 0.559085 0.186412 -.011293 0.088170 0.086117 -.439455 -.376706 0.420100 PRIN10 0.295840 -.227621 0.083754 0.146959 0.101118 -.017685 -.555004 0.073763 0.111582 0.439200 -.361647 0.014406 -.129195 0.393717 PRIN12 -.015407 -.156210 0.055421 -.331152 0.455459 0.064822 -.030301 0.346574 -.397791 0.363391 0.077826 -.222750 0.305087 -.301510 PRIN13 0.006864 -.094878 -.033492 0.025938 0.351392 0.065467 -.249524 0.079545 0.690127 -.276211 0.172893 -.396331 -.000372 -.230796 PRIN14 -.079812 0.089376 -.023123 0.750134 0.477337 0.014741 0.072785 -.307040 -.192283 0.143663 0.140813 0.041311 -.043094 -.117317 PRIN11 0.222927 0.088425 -.023086 0.041662 0.344782 -.034542 0.217893 0.511310 0.118799 -.216233 -.269913 0.568635 -.192872 -.130247 106750892 8 Revised: 2/17/2016 106750892 Table 1. Example showing how to calculate the PC1 score for observation1. The values of the original variables are standardized because this PCA was performed on the correlation matrix. PC11 = -0.164*[-610.00-(-601.78)]/30.70 + + + + + + + + + + + + + = -0.108*[33.00-(30.27)]/3.72 -0.124*[-290.00-(-314.40)]/36.96 -0.408*[5.00-(4.60)]/1.25 0.412*[2.34-(3.86)]/2.51 0.273*[20.24-(32.30)]/27.59 -0.033*[1441.67-(797.62)]/297.60 -0.359*[2150.00-(2365.32)]/1718.33 0.079*[5169.05-(3075.11)]/939.41 -0.017*[35184.50-(16596.71)]/6882.42 0.277*[14.29-(38.10)]/24.48 0.404*[16.45-(17.88)]/8.28 -0.011*[5.02-(3.99)]/1.04 0.399*[59.52-(87.46)]/47.27 -1.097 In matrix notation the calculation of PC scores is straightforward; simply multiply the matrix of standardized values in the original axes, called Z, (standardized data matrix) and the matrix of eigenvectors V to obtain an nxp matrix of scores W: W=ZV The calculations in matrix notation are illustrated in the file HOW03.xls. 5:6.3 Loadings "Loadings" are the correlations between each one of the original variables and each one of the principal components. Therefore, there are as many loadings as coefficients in the matrix of eigenvectors. Loadings can be used to interpret the results of the PCA, because a high loading for a variable in a PC indicates that the PC has a strong common component or relationship with the variable. Loadings are interpreted by looking at the set of loadings for each PC and identifying groups that are large and negative, and groups that are large and positive. This is then used to interpret the PC as being an underlying factor that reflects increases in variables with positive loadings, and decreases in variables with negative loadings. Loadings can be calculated as a function of the eigenvectors and standard deviations of PC's and original variables, or they can be calculated directly by saving the PC scores and correlating them with the original variables. rij wij sj si where rij is the loading of variable i in PC j, w is the element from the eigenvector that goes with variable i in PC j, and s i and s j are the standard deviations of variable i and PC j. In the Spartina example, the loading for H2S in PC1 is -0.163637*sqrt(4.92391) = -0.3631. The loading for Mg in PC2 is 0.498653*sqrt(3.69523) = 0.9586. 5:6.4 Interpretation of results Interpretation of the results depends on the main goal for the PCA. We will consider two main types of goals: 106750892 9 106750892 Revised: 2/17/2016 1. Reduction of dimensionality of data set and/or identification of underlying factors. 2. Analysis of collinearity among X variables in a regression problem. 5:6.4.1 Identification of fewer underlying factors In the first case, the interpretation depends on whether the analysis identifies a clear subset of PC's that explain a large proportion of all of the variance. When this is the case, it is possible to try to interpret the most important PC's as underlying factors. In the Spartina example, the first two principal components represent two systems of variables that tend to vary together. PC1 is associated with pH, buffer acid, NH4, and Ca, and appears to represent a gradient of acidity-alkalinity. Similar interpretations can be reached for the next two principal components. The interpretation is simplified by using Gabriel's Biplots. A Gabriel's biplot has two components (hence the name biplot) plotted on the same set of axes represented by a pair of PC's (typically PC1 vs. PC2): a scatterplot of observations, and a plot of vectors that represent the loadings or correlations of each original variable with each PC in the biplot. The first element is obtained as by plotting PC1 vs. PC2 for each observation as dots on the graph. The second element is obtained by making a vector, for each variable, that goes from the origin (0,0) to the point represented by (rxpc1, rxpc2), where rxpc1, and rxpc2 are the loadings for the variable X with PC1 and PC2, respectively. Therefore, the Gabriel's biplot has as many points as observations, and as many vectors as variables in the data set. For the Spartina example, there are 45 points and 14 vectors (Figure 1). Groups of vectors that point in about the same (or directly opposite) directions indicate variables that tend to change together. Note that JMP draws the “rays” or vectors in the Spinning Graph by linking the origin to the PC scores or coordinates of a fictitious point that is at the average of all original variables except for the one it represents, for which it has a value equal to 3 standard deviations. This facilitates viewing the rays and scatter of points together, and it preserves the interpretability of the relative lengths of the vectors, but the vectors no longer have a total length of 1 (over all PC’s). I graphed a the loadings for the spartina example to illustrate this point in the HW03.xls file, with diamonds marking what would be the tips of the rays in a biplot. Why do the vectors have a length equal 1.0? Think of the vector for one of the variables, say pH, and imagine it poking through the 14-dimensional space formed by the principal axis or components. The length of the vector is the length of a hypotenuse of a right triangle. By applying Pythagorean theorem several times on can calculate the length of the vector, which is the sum of the squares of the 14 coordinates. Recall that each coordinate is the correlation between the corresponding PC and pH, because of the way the vector for pH was constructed. Therefore, the square of each coordinate is the R2 or proportion of the variance in pH explained by each PC. Since the PC’s are orthogonal to each other and they contain all of the variance in the sample, no portion of the variance of pH is explained by more than one component, and the sum of the variance explained by all components equals the total variance in pH. Therefore, the sum of the individual r2’s, which is the length of the vector, must equal 1. The points on the plot help to see how the observations may form different groups or vary along the "gradients" represented by the combination of both PC's. The vectors that have length close to 1 (say >.85) represent variables that have a strong association with the two components, i.e., the two PC's capture a great deal of the variation of the original variable. When the vector length is close to 1, the relationships of that vector and others that are also close to 1 will be accurately displayed in the plot. The direction of the vector shows the sign of the correlations (loadings). Moreover, the angle between any two vectors shows the degree of correlation between the variation of the two variables that is captured on the PC1-PC2 plane (in fact, r = cos [angle]). When vectors tend to form "bundles" they can be interpreted as systems of variables that describe the gradient. For example, Ca, pH, NH4, acid, and Zn form such a system. 106750892 10 Revised: 2/17/2016 106750892 Figure 1. Gabriel's biplot for the Spartina example. Numbers next to each point indicate the location of the sample. 5:6.4.2 Identification of collinearity When the PCA is performed on a set of X variables that are considered as explanatory variables for a multiple linear regression problem, the interpretation is different from above. The main goal in this case is to determine if there are variables in the set that tend to be almost perfect linear combinations of other variables in the set. These variable have to be identified and considered for deletion. Eigenvalue 4.924 3.695 1.607 1.335 0.692 0.501 0.385 0.381 0.166 0.143 0.087 0.045 0.030 0.010 Condition number 1.00 1.15 1.75 1.92 2.67 3.14 3.57 3.60 5.45 5.87 7.53 10.43 12.83 22.77 Identification of variables that may be causing a collinearity problem is achieved by calculating the Condition Number (CN) for each PC. Keep in mind that collinearity is a problem for MLR, not for PCA; we use PCA to work on the MLR problem. CNi 1 i Hence, the CN for PCi is the square root of the quotient between the largest eigenvalue and the eigenvalue for the PC under consideration. A value of CN of 30 or greater identifies a PC implicated in the collinearity. Those variables can be identified by requesting the COLLINOINT option in the PROC REG in SAS. 106750892 11 106750892 Revised: 2/17/2016 5:7 Some issues and Potential problems in PCA 5:7.1 Use of correlation or covariance matrix? The most typical choice is to use the correlation matrix to perform PCA, because this removes the impact of differences in the units used to express different variables. When the covariance matrix is used (by choosing Principal Components on Covariance in JMP or specifying the COV option in the PROC PRINCOMP statement in SAS), those variables that are expressed in units that yield values of large numerical magnitude will tend to dominate the first PC's. A change of units in a variable, say from g to kg will tend to reduce its contribution to the first PC's. In most cases this would be an undesirable artifact, because results would depend on the units used. A nice example for a situation where the use of the covariance matrix is recommended is given by Lattin et al., (2003) page 112: (From Lattin, J., J. Douglas Carroll, and Paul E. Green. 2003. Analyzing Multivariate Data. Thomson Brooks/Cole. 5:7.2 Interpretation depends on goal. In a sense, when PCA is performed to identify underlying factors and to reduce problem dimensionality, one hopes to finds a high degree of relation among some subgroups of variables. On the other hand, in MLR one hopes to find that all measured X variables will increase our ability to explain Y. In the case of Multiple Linear Regression (MLR), it is desirable to have many eigenvalues close to 1. 5:7.3 Interpretability. One of the main problems with PCA is that often times, the new axes identified are difficult to interpret, and they may all involve a combination with a significant "component" of each original variable. There is no formal procedure to interpret PC's or to deal with lack of interpretability. Interpretation can be easier if the problem allows rotation of the PC o transform them into more understandable “factors.” This type of analysis, closely related to PCA is called Factor Analysis, and it is widely used in the social sciences. 5:7.4 How many PC’s should be retained? In using PCA for reduction of dimensionality, one must decide how many components to keep for further analyses and explanation. In terms of visual presentation of results, it is very hard to convey results in more than 2 or 3 dimensions. 106750892 12 106750892 Revised: 2/17/2016 There are at least three options for deciding how many PC's to use, scree plots, retaining PC's whose eigenvalues exceed a critical value, and retaining sufficient PC's to account for a critical proportion of the original data. 5:7.4.1 Scree plot. The Scree plot is a graph of the eigenvalues in decreasing order. The y-axis has shows the eigenvalues and the x-axis shows their order. The graph is inspected visually to identify "elbows," and the location of these breaks in the line is used to select a given number of PC's. Figure 5-2. Use of Scree plot to decide how many PC to retain. This choice is subjective, and focuses on finding "breaks" in the continuity of the line. In this case, keeping 3 or 5 PC's would be acceptable choices. 5:7.4.2 Retain if >average. When PCA is based on the correlation matrix, the sum of eigenvalues equals p, the number of variables. Thus, the average value for the eigenvalues is 1.0. Any PC whose eigenvalue if greater than 1 explains more than the average amount of variation, and can be kept. 5:7.4.3 Retain as many necessary for 80%. Finally, a sufficient number of PC's can be retained to account for a desired proportion of the total variance in the original data. This proportion can be chosen subjectively. 106750892 13