Principal Components Analysis (PCA) Part I

CHAPTER 7 : PRINCIPLE COMPONENTS ANALYSIS (PCA) PART 1 Purpose: Width Width In this lab, you will learn how to conduct and interpret Principle Components Analysis (PCA). It is part of a group of techniques called Ordinations. PCA is a quantitative technique for Length describing populations with multiple types of Height measurements for each observation. PCA is a Length descriptive technique and, as such, has no hypotheses 30 associated with the procedure. It is a technique that allows you to reduce values from several variables into a single variable (data reduction). It will allow you to look at a point on a graph and be able to 20 describe that observation in terms of all of the measured variables. The analysis is useful for finding patterns in multiple continuous variable data. 10 The output (Component scores) from the analysis can be used in other statistical procedures. Background: 0 1 2 3 4 5 6 7 8 9 PCA is a descriptive technique for situations in Height which there is more than one measured continuous Figure 7 - 1: Height and width of variable. It generates equations called principle limpet shells components that describe the variation in the data. The first component describes the axis of the greatest variation; the next describes the next greatest variation in your data but is independent from the first component, etc. There are as many components as there are variables. 30 20 Width For example, let’s assume that you are measuring width and height of limpet shells, a conical shaped marine snail and that Figure 7-1 represents your measurements of shell width and height. Component 1 or Factor 1 Figure 7-2 illustrates both the first and second principle components. 10 The first principle component (also Component 2 referred to as a factor) would be the or Factor 2 axis or line that indicates the greatest 0 difference among the points (largest 1 2 3 4 5 6 7 8 9 variation). In this example, the Height smallest limpets both in height and width are found at one end of the first Figure 7 - 2: First and Second Principle Components or Factors principle component and the largest limpets in both height and width are found at the other end. The second principle component is at right angles to the first and describes the next greatest dimension. In this example, 7-1 limpets at one end of the second principle component have a large width and a small height while limpets at the other end have a small width and a large height. There are only two measurements so there are only two components. Component 2 In Figure 7-3, the data have been remapped (rotated) so that the first and second principle B components now form the x and y-axes. Notice A that low values (i.e. scores) for Component 1 indicate limpets that are narrow and short (small), whereas high scores for Component 1 Component 1 indicate wide and tall limpets (large). High scores for Component 2 indicate limpets that are C short and wide whereas low scores for Component 2 indicate tall and narrow limpets. By using values for both Component 1 and Figure 7 - 3: Remapped or rotated data Component 2, one can determine the shape of using Component 1 and Component 2 as that particular limpet relative to all of the others. coordinates. The limpet marked as A in Figure 3 has a low score for Component 1 indicating it is one of the smaller limpets and it has a higher score for Component 2 indicating that is a little flattened. The limpet indicated by B would be characterized as a medium sized limpet that is very flat. The limpet indicated by C would be characterized as one of the larger limpets that are very tall and narrow. Equations for Principle Components The Principle Components Analysis determines the position of the components and generates an equation for each component then converts the original data values to scores. The equation for a principle component in the example is:  YHeight -Y Height   Y -Y W idth  Component score  α*    β*  W idth  where  and  are coefficients s s   Height W idth    computed in the analysis. Let’s assume that we want to compute a Component 1 score for a limpet that was 5.9mm high and 10.6mm wide. The analysis will determine that: Table 7 - 1: Descriptive statistics and results of PCA on limpet data. 7-2 Height Width Measurements for the limpet 5.9 10.6 Mean ( Y ) for all limpets 5.29 11.21 Standard deviation (s) for all limpet 1.73 5.18 Coefficients for Component 1  = 0.589  = 0.589 Coefficients for Component 2 -0.963 0.963 The Component 1 score would be:  5.9-5.29  10.6-11.21 0.589*   0 . 589 *  0.138     1.73   5.18  A score of zero indicates that the value is average for the dataset. A negative score indicates that the value is lower than average for the data set and a positive score indicates a value that is higher than average for the dataset. The score is slightly positive so this limpet would be just a bit above average size. Each observation in the dataset will have a computed score for each Component. In our example, there were measurements for 30 limpet shells. Therefore, each of the thirty limpet shells will have both a Component 1 score and a Component 2 score. The scores can be plotted to provide a graph like that in Figure 3. Let’s compute a Component 2 score for the same limpet. The Component 2 score would be  5.9-5.29  10.6-11.21  0.963*   0 . 963 *  0.453     1.73   5.18  Recall that low values for this Component indicate a tall narrow limpet. Therefore, this limpet is higher and narrower than most. Output from a PCA Analysis There are four important parts to a PCA analysis, Percent Variance Explained, Component Loadings, Coefficients and Component scores. With Systat 10.0 the latter two are obtained only as options and are stored in files rather than displayed in the output. As we have already discussed these, we will concentrate on the two former. Percent Variance Explained tells you how much of the variance in your data is explained by each component and is useful for determining if a component is worth examining. Because the first component always explains the greatest variance, its value will always be greater than the others. As a rule of thumb, do not bother with components that explain less than 10% of your data. For this example, percent variance explained by Component 1 = 73.04%. Percent variance explained by Component 2 = 26.96%. Because each component explains more than 10% of the data, they are both worth examining. Component Loadings are the most important part of the output because they tell you how to interpret the Component Scores; they tell you the relative role of each variable in computing a Component score. A loading is a correlation between the original data and the Component scores. The absolute value of the loading tells you how important the variable is in computing the score. If a variable is highly related to the score (i.e. is very important in determining the value), the absolute value of the correlation will be high. As a rule of thumb, loadings with absolute values less than 0.300 are not considered in the interpretation. The sign of the coefficient tells you in which way the variables are related to the scores. If the sign is negative, it means that the higher the value of the variable, the lower the score. Table 7-2 illustrates loadings for the limpet shell data. 7-3 Table 7 - 2: PCA Component loadings from PCA analysis of limpet shell data. Component 1 Loadings Component 2 Loadings Height 0.856 - 0.517 Width 0.856 0.517 The absolute values of both height and width Component 1 loadings (0.856 and 0.856 respectively) are greater than 0.300 and the loadings are equal. Therefore they are both important and both contribute equally to the Component 1 score. For Component 1, the sign of the height loading is positive which means that, if a score is high, the value for height was high. The sign of the width loading is also positive which means that, if a score is high, the value for width was high. So large Component 1 scores indicate limpets that have both high values for height and width. For Component 2, the absolute values of both height and width Component 2 loadings (0.517 and 0.517) are greater than 0.300 and equal to each other which indicates they are both important and contribute equally to Component 2 scores. For Component 2, the sign of the height loading is NEGATIVE, which means that, if a score is high, the value for height was LOW. The sign of the width loading is positive which means that high scores indicate limpets with wide bases. So large Component 2 scores indicate limpets that have low values for height and high values for width, which means that the limpet is flattened. Low scores for Component 2 indicate that limpets have high values for height and low values for width, which means that the limpet is tall and narrow. Assumptions: There are two assumptions that need to met for a successful ordination: 1. Variables should be not be highly skewed. If the values are highly skewed, then the probability values based on the normal distribution are not accurate. Mathematical transforms are used to created new variables that are normally distributed from the old variables that were not. a. How do you check? Use a statistics program to compute skewness for each variable. If the absolute value of skewness is greater than 2.0, then you need to transform the variable. The transform normalizes the data. Try the appropriate transform and recomputed skewness. b. If the data are highly skewed, which transform do you use? 1 i. Skewed positive – Ln(Y) or Y or Y n (Use a larger n for very high skewness). If there are zero values, add 1.0 to the value before taking the log or root. ii. Skewed negative – 1/Y (if there are zero values add 1.0 to the value) or Yn (Use a larger n for very high skewness). iii. If all else fails, rank the data. If you rank the data, you will not have to recompute skewness because we know the distribution of ranks. 7-4 iv. NOTE: If data consist of percentages, they also need to be transformed using the angular transform. The angular transform is equal to: arcsin p where p is a proportion. If there are zero values, add .1 to all proportions. 2. The variables are independent of each other. If two variables have a high positive correlation with each other, they really measure only one aspect. If both are included in the PCA, the importance of that particular aspect is overemphasized. For example, weight and length of animals usually have a high positive correlation; small animals are short and light while big animals are long and heavy. Both variables measure size. If both are included in a PCA, size is overemphasized. If two variables have a large negative correlation with each other, the effects of both may be canceled out. Therefore it is important to use variables that are not highly correlated with each other. a. How do you find out if variables are correlated? Use a statistics program to compute Pearson Product Moment Correlations for all possible pairs of variables. A general rule of thumb to use is that, if the absolute value of the correlation coefficient (r) is greater than 0.700, you can consider the two variables to be correlated. b. If two variables are correlated, how do you determine which to use? c. Eliminate variables that are correlated with several other variables. If a variable is correlated with several others, it is usually because it measures a very broad parameter and will not yield as much useful information as the other variables. For example, let’s assume that you are interested in the distribution of plant species; you are recording elevation and measuring several habitat variables. You will find that many of the variables are correlated with elevation. If you keep elevation and eliminate the others, you will have less understanding about the mechanisms that limit the distribution of the plants. i. If the first rule doesn’t apply, choose the variable with the most biological significance to keep in the analysis. ii. Don’t keep variables with large numbers of zeros or variables for which there were problems with data collection (e.g. poor measurements, missing data etc.). iii. If all of the other rules don’t apply, choose the variable with the largest coefficient of dispersion (CD =Variance/Mean). Sample size Multivariate analyses require adequate number of observations. Tabachnick and Fidell (1994) recommend 5 observations per variable. So if you are measuring four variables, you would need a minimum of 4x5 or 20 observations. If you do not have a sufficient number of observations, you must eliminate variables from the analysis. 7-5 Exercise 1: Conduct the PCA analysis on the limpet shell measurements data. Data File: The data for this lab is in a file entitled, “Limpet PCA 01”, which is located on the BIO 156 folder on the Student Data Server. Please copy this file to YOUR diskette. In this problem, your goal is to find out if shell shape has an effect on limpet mortality from crab predation. You have measured three shell measurements, length, width and height of 30 limpets. You also know which of these limpets were successfully attacked and consumed by a crab and which were able to get away. You are to use Principle Components Analysis to quantify aspects of limpet shell shape based on three measurements: length, width and height. You are going to use scores generated in the PCA as measures of an aspect of shell shape and then you will see if there appear to be differences in shell shape between limpets that were eaten and those that were not. Check Sample Size and Assumptions 1. How many observations need to be collected at a minimum for these data (three variables: length, width and height)? There are 3 variables so there should be 3x5 or 15 observations (shells) at a minimum. 2. Are any of the variables skewed? Use a computer program (e.g. Systat™ see p 7- 13 for instructions) to compute skewness. No Figure 7 - 4: Output from Systat™ 10.0. Descriptive statistics - measure of skewness. 3. Will you need to transform any of the variables? No, because they are not percentages and are not skewed. Figure 7 - 5: Output from Systat™ 10.0. Pearson Product Moment Correlation coefficients. 5. If so, which variables will you discard? List your logic below. Yes, we should discard length. If we kept length, we would have to get rid of both width and height in our analysis, but if we got rid of length we could keep both height and width. Perform the PCA analysis 1. Plot the data to see the relationship of width to height (e.g Systat™ 10.0 see p 7-10 and 7-11 for instructions) (Figure 7-6). 2. Use a statistical software program to analyze the data (e.g. 7-6 30 20 WIDTH 4. Are any of the variables highly correlated (r≥0.700)? (see p 7 – 10 for Systat™ 10.0 instructions)? Yes, length was correlated with width (r=0.814) and length was also correlated with height (r=0.783) but width was not correlated with height (r=0.465). 10 0 1 2 3 4 5 6 HEIGHT 7 8 Figure 7 - 6: Plot of height versus width. 9 Systat™ see p 7-11 and 7-12 for instructions). 3. Determine which Components to investigate by examining the Percent of Total Variance Explained in the output (Figure 7-7) and note Components that explain 10% or more of the variance. In this case, both Components explain greater than 10% so we will use them both. 4. Next, find the Component Loadings (correlations of Components scores to Figure 7 - 7: Output from Systat™ 10.0. PCA – original variables) and use them to interpret Percentage of Total Variance Explained for limpet the Components. For interpretation, ignore data. any variables for which the absolute value of the loading is less than 0.300. In this case, for Component 1, the loading for Width=0.856. Because the absolute value is greater than 0.300, we will use Width in our interpretation of Component 1. Note that the loading for Width for Component 1 is Figure 7 - 8: Output from Systat™ 10.0. PCA – positive; that means that values of the Component 1 scores are positively correlated Principle Component Loadings for limpet data. with the values of Width. Low values of Width will tend to produce low Component 1 scores and high values of Width will tend to produce high Component 1 scores. The loading for Height for Component 1 (0.856) is also positive and its absolute value is greater than 0.300. Therefore low values for Height will tend to produce low Component 1 scores and high values of Height will tend to produce high Component 1 scores. Component 1 can then be interpreted as size with low scores indicating short narrow limpets and high scores indicating tall wide limpets. Loadings for Component 2 are Width=0.517 and Height= - 0.517. Note that Height is negatively correlated with the scores for Component 2. Therefore low values of Width and LARGE values of Height will produce low Component 2 scores. High values of Width and low values of Height will produce high Component 2 scores. So, Component 2 can be interpreted as pointedness with low scores indicating tall narrow limpets and high scores indicating short wide limpets. 5. Now find the Coefficients. These are what the computer used to compute the Component scores that you had saved to a dataset. (Figure 7-9). 6. Merge the original file and the Component scores file to create a file that contains all of the variables in the original dataset plus the scores (see instructions on p 7-12, 7-13). Figure 7 - 9: Output from Systat™ 10.0. PCA Coefficients for computing Component scores. 7. Open the dataset you created that contains the scores and then plot the scores using the Eaten$ variable as a symbol (see p 7-10 and 7-11 for instructions). How does this plot compare to the plot of height vs. width (Figure 7-7)? Notice, that except for the “Y”s and “N”s, the plot is a rotated version of Figure 7-7. 7-7 N Component 2 2 Height 9. Resolving apparent contradictions in the loadings. Notice that limpets in the upper right hand corner have high scores for both Components 1 and 2. Also notice that, for Component 1, high values for Height tend to produce high scores, but, for Component 2, low values for Height tend to produce high scores. How can this be? Remember that Component 1 explains more of the variation than Component 2 so it takes precedence; limpets in the upper right-hand corner of Figure 7-10 ARE tall, but the high Component 2 high scores indicate that, of the tall limpets, these are some of the shortest. Likewise, limpets in the lower righthand corner of Figure 7-10 would be the tallest of the tall limpets. Therefore Component 2 fine tunes the meaning for Component 1. 3 Width 8. Annotate the axes of the plot to make it easily interpretable. Either paste your graph into a Word document or insert it. Use the drawing tools to add arrows and descriptive axes labels based on the loadings. Your results should look like Figure 7-10. N NN N 1N N N 0 N N YN Y YYY YYYY Y Y YY Y Y -1 N Y Y Y -2 -3 -2 -1 0 Component 1 1 2 Width Height Figure 7 - 10: Plot of Components 1 and 2 from PCA of limpet shell measurements. "Y" indicates the limpet was eaten and "N" indicates the limpet was not eaten. 10. Interpret the experiment. Recall that, if a limpet is labeled with a “Y”, it was eaten by a crab. If a limpet is labeled with a “N”, it was not eaten by a crab. Notice that eaten or not-eaten limpets are not separated along the Component 1 axis. However, they are fairly separated along the Component 2 axis. What shape limpet was most likely to be eaten by a crab? Here, we notice that limpets that were eaten tend to have low Component 2 scores. Therefore the limpets with tall narrow shells appear to be more likely to be consumed by crabs. Why do you think the crabs prefer that shape? It turns out that the muscle and tendons attaching the limpet body to the shell attach at the point of the shell. If a crab pinches the top of the shell off, the shell becomes detached from the limpet body and the limpet is vulnerable. If a shell is tall and pointed, it is easier for the crab to pinch the top of the shell off. 7-8 Using Systat 10.0: Correlations (using the Limpet Shell Measurement data as an example). Computing the Pearson Product Moment Correlation coefficient for each possible pair of a set of variables. 1) From the STATISTICS pull-down menu, select CORRELATIONS and then SIMPLE. You will see the window shown in Figure 7-11. 2) To select a variable, click on it and then click on the ADD button. In this case select LENGTH, WIDTH and HEIGHT. All three should show up in the box labeled “Variable(s)”. 3) Click on OK. You will see the following output in the Analysis Window (Figure 7-12): Figure 7 - 11: Systat™ 10.0 Correlation window Figure 7 - 12: Systat™ 10.0 output of Pearson Product Moment Correlation matrix. Using Systat 10.0: Plotting your data 1) From the GRAPH pull-down menu, select SCATTERPLOT from PLOTS. 2) Click on HEIGHT and then click on ADD for the X variable. (See Figure 7-13). Figure 7 - 13: Selecting variables to plot in Systat™ 10.0 3) Click on Width and then click on ADD for the Y variable. 4) Click on the APPEARANCES button in the bottom right-hand corner of the window and then select COLOR AND FILL. 5) Click on SELECT 1ST COLOR. Then select BLACK for the 1st Color (see Figure 7-14). 6) Click on SELECT 1ST FILL. Then select solid as the fill pattern. Click on CONTINUE. 7) Click on the APPEARANCES button again and select SYMBOL AND LABEL. Figure 7 - 14: Selecting patterns and colors for plots in Systat™ 10.0 7-9 8) Specify size 2 for Symbol size. Note: if you have a character variable like Eaten$ which contains “Y”s and “N”s, you can select that variable as a symbol; the plot will then show “Y”s and “N”s for the points (Figure 7-15). 30 WIDTH 20 10 9) Click on CONTINUE. 0 1 10) Click on OK. Your picture will look like Figure 7-16. Figure 7 - 15: Window for 2 3 4 5 6 HEIGHT 7 8 Figure 7 - 16: Example of specifying a variable as a 11) Save your graph so that you can plotting height and width symbol. variables for limpet shell insert it into a drawing or word measurements. processing document to add. Double-click on the graph and then, from the FILE pull-down menu, select SAVE AS. Note: you can also select COPY GRAPH from the EDIT pull-down menu so that you can paste it into a document. Systat 10.0: Principal Components Analysis Now we are ready to compute the PCA. The PCA generates components that describe the variation in the data. We will first run the analysis and then examine each part to understand its meaning. 1. From the STATISTICS pull-down menu, select DATA REDUCTION and then FACTOR ANALYSIS. 2. Select variables to be used in the analysis. To select, click on variable and then click on the ADD button (see Figure 7-17). 3. Specify the number of Components. Click on NUMBER OF FACTORS and enter the number (no more than the number of variables; in this case 2). Figure 7 - 17: Systat™ 10.0. Window for specifying elements of a Principle Components Analysis (PCA). 4. Click on the SAVE button. 5. Create a data file that has all of the original data plus Component scores. Click on FACTOR SCORES and SAVE DATA WITH SCORES (see Figure 7-18). Then click on OK. 6. You will come back to the window shown in Figure 7-15. Click on OK. 7. Specify the file name (e.g. “Limpet Component Scores”) and save it to YOUR diskette. 8. The output from the PCA 7-10 9 The output will appear in the OUTPUT window. Put your name at the top and then either save your output as an RTF file and/or print it. Unfortunately, SYSTAT does not present the output in the best order. Important parts of the output a) Percent of Variance Explained. This tells you how much of the total variation in all of the variables can be explained by a component. The percent of total variance explained for Component 1 is listed under the column labeled “1”. For Component 2 it is listed under column 2 etc. b) Component Loadings. Loadings are simple correlations of the Component scores with the original variables. They are used to interpret the components. Component loadings for Factor 1 are listed under column “1” and Component loadings for Factor 2 are listed under column 2. c) Factor Score Coefficients. These are the coefficients use by the PCA to compute Component scores. d) Component Scores. The Component scores are NOT displayed in the SYSTAT results but are stored in the File that you created in step 7. Using Systat™ 10.0: Merging files. 1. From the DATA pull-down menu, select MERGE. You will see a window (Figures 7-18 and 7-19). In this example we will merge two files, “Limpet PCA 01.syd” and “Limpet PCA scores.syd” to create a new file with variables from each of the files. 2. Click on the top BROWSE button and find the “Limpet PCA 01.syd” file (Figure 7-18). Figure 7 - 18: Systat™ 10.0: Merge File window – upper half. 3. Click on the variables to keep from the first file and then click on the ADD button. In this case we want to keep all of the variables from the “Limpet PCA 01.syd” file (Figure 7-18). 4. Click on the bottom BROWSE button and find the “Limpet PCA scores.syd” file (Figure 7-19). 5. Click on the variables from the second file and then click on the ADD button. In this case we want to keep only FACTOR(1) and FACTOR(2) which contain the component scores for Components 1 and 2 respectively (Figure 7-19). Figure 7 - 19: Systat™ 10.0: Merge File window - lower half. 6. Make sure the “Save File” box is checked (Figure 7-19). 7. Then click on OK (Figure 7-18). You will then be prompted to give a name to the new file (e.g. “Limpet PCA all.syd” and to save it. Save it to your diskette. 7-11 Using Systat™ 10.0: Compute skewness 1. From the STATISTICS pull-down menu, select DESCRIPTIVE STATISTICS and then BASIC STATISTICS. 2. Double click on the variables for which you wish to compute skewness. 3. Make sure the SKEWNESS box is checked. 4. Click on OK. 7-12 Name ____________________________ Pts_________ On Your Own Problem: Antlion larve (Figure 20) build conical pits in which to trap ants (Figure 21). You are interested in determining what habitat characteristics might be associated with the presence or absence of antlion larvae. You have measured soil particle size, slope, density of grass and amount of canopy cover in 24 randomly selected locations in which antlions were not present and 16 randomly selected areas in which antlions were present. 1. How many observations need to be collected at a minimum for these data (three variables: length, width and height)? Figure 7 - 20: Antlion larvae (from en.wikipedia.org). 2. Are any of the variables skewed? 3. Will you need to transform any of the variables? 4. Are any of the variables highly correlated (r≥0.700)? 5. If so, which variables will you discard? List your logic below. Figure 7 - 21: Antlion pit (from en.wikipedia.org). 6. Perform the PCA analysis 7. Which components should you investigate? 7-13 8. What are the Component Loadings for this problem and how do you interpret them? 9. Merge the original file and the component scores file to create a file that contains all of the variables in the original dataset plus the scores 10. Plot Factor 1 versus Factor 2, Factor 1 versus Factor 3 and Factor 2 versus Factor 3. Annotate the axes of the plot to make it easily interpretable. Use ANTLION$ as the variable for the symbols. 11. Interpret the results. 7-14

Principal Components Analysis (PCA) Part I

Related documents

Products

Support

Principal Components Analysis (PCA) Part I

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib