Principal Components Analysis (PCA) Part I

advertisement
CHAPTER 7 :
PRINCIPLE COMPONENTS ANALYSIS (PCA) PART 1
Purpose:
Width
Width
In this lab, you will learn how to conduct and interpret Principle Components Analysis (PCA).
It is part of a group of techniques called Ordinations. PCA is a quantitative technique for
Length
describing populations with multiple types of
Height
measurements for each observation. PCA is a
Length
descriptive technique and, as such, has no hypotheses
30
associated with the procedure. It is a technique that
allows you to reduce values from several variables
into a single variable (data reduction). It will allow
you to look at a point on a graph and be able to
20
describe that observation in terms of all of the
measured variables. The analysis is useful for
finding patterns in multiple continuous variable data.
10
The output (Component scores) from the analysis can
be used in other statistical procedures.
Background:
0
1
2
3
4
5
6
7
8
9
PCA is a descriptive technique for situations in
Height
which there is more than one measured continuous
Figure 7 - 1: Height and width of
variable. It generates equations called principle
limpet shells
components that describe the variation in the data.
The first component describes the axis of the greatest variation; the next describes the next
greatest variation in your data but is independent from the first component, etc. There are as
many components as there are variables.
30
20
Width
For example, let’s assume that you
are measuring width and height of
limpet shells, a conical shaped
marine snail and that Figure 7-1
represents your measurements of
shell width and height.
Component 1
or Factor 1
Figure 7-2 illustrates both the first
and second principle components.
10
The first principle component (also
Component 2
referred to as a factor) would be the
or Factor 2
axis or line that indicates the greatest
0
difference among the points (largest
1 2 3 4 5 6 7 8 9
variation). In this example, the
Height
smallest limpets both in height and
width are found at one end of the first Figure 7 - 2: First and Second Principle Components or
Factors
principle component and the largest
limpets in both height and width are found at the other end. The second principle component
is at right angles to the first and describes the next greatest dimension. In this example,
7-1
limpets at one end of the second principle component have a large width and a small height
while limpets at the other end have a small width and a large height. There are only two
measurements so there are only two components.
Component 2
In Figure 7-3, the data have been remapped
(rotated) so that the first and second principle
B
components now form the x and y-axes. Notice
A
that low values (i.e. scores) for Component 1
indicate limpets that are narrow and short
(small), whereas high scores for Component 1
Component 1
indicate wide and tall limpets (large). High
scores for Component 2 indicate limpets that are
C
short and wide whereas low scores for
Component 2 indicate tall and narrow limpets.
By using values for both Component 1 and
Figure 7 - 3: Remapped or rotated data
Component 2, one can determine the shape of
using Component 1 and Component 2 as
that particular limpet relative to all of the others.
coordinates.
The limpet marked as A in Figure 3 has a low
score for Component 1 indicating it is one of the smaller limpets and it has a higher score for
Component 2 indicating that is a little flattened. The limpet indicated by B would be
characterized as a medium sized limpet that is very flat. The limpet indicated by C would be
characterized as one of the larger limpets that are very tall and narrow.
Equations for Principle Components
The Principle Components Analysis determines the position of the components and generates
an equation for each component then converts the original data values to scores. The equation
for a principle component in the example is:
 YHeight -Y Height 
 Y -Y W idth 
Component score  α* 
  β*  W idth
 where  and  are coefficients
s
s


Height
W idth



computed in the analysis.
Let’s assume that we want to compute a Component 1 score for a limpet that was 5.9mm high
and 10.6mm wide. The analysis will determine that:
Table 7 - 1: Descriptive statistics and results of PCA on limpet data.
7-2
Height
Width
Measurements for the limpet
5.9
10.6
Mean ( Y ) for all limpets
5.29
11.21
Standard deviation (s) for all limpet
1.73
5.18
Coefficients for Component 1
 = 0.589
 = 0.589
Coefficients for Component 2
-0.963
0.963
The Component 1 score would be:
 5.9-5.29 
10.6-11.21
0.589* 

0
.
589
*
 0.138



 1.73 
 5.18 
A score of zero indicates that the value is average for the dataset. A negative score indicates that
the value is lower than average for the data set and a positive score indicates a value that is
higher than average for the dataset. The score is slightly positive so this limpet would be just a bit
above average size.
Each observation in the dataset will have a computed score for each Component. In our
example, there were measurements for 30 limpet shells. Therefore, each of the thirty limpet
shells will have both a Component 1 score and a Component 2 score. The scores can be plotted
to provide a graph like that in Figure 3.
Let’s compute a Component 2 score for the same limpet.
The Component 2 score would be
 5.9-5.29 
10.6-11.21
 0.963* 

0
.
963
*
 0.453



 1.73 
 5.18 
Recall that low values for this Component indicate a tall narrow limpet. Therefore, this limpet is
higher and narrower than most.
Output from a PCA Analysis
There are four important parts to a PCA analysis, Percent Variance Explained, Component
Loadings, Coefficients and Component scores. With Systat 10.0 the latter two are obtained
only as options and are stored in files rather than displayed in the output. As we have already
discussed these, we will concentrate on the two former.
Percent Variance Explained tells you how much of the variance in your data is explained by each
component and is useful for determining if a component is worth examining. Because the first
component always explains the greatest variance, its value will always be greater than the others.
As a rule of thumb, do not bother with components that explain less than 10% of your data. For
this example, percent variance explained by Component 1 = 73.04%. Percent variance explained
by Component 2 = 26.96%. Because each component explains more than 10% of the data, they
are both worth examining.
Component Loadings are the most important part of the output because they tell you how to
interpret the Component Scores; they tell you the relative role of each variable in computing a
Component score. A loading is a correlation between the original data and the Component
scores. The absolute value of the loading tells you how important the variable is in computing
the score. If a variable is highly related to the score (i.e. is very important in determining the
value), the absolute value of the correlation will be high. As a rule of thumb, loadings with
absolute values less than 0.300 are not considered in the interpretation. The sign of the
coefficient tells you in which way the variables are related to the scores. If the sign is negative, it
means that the higher the value of the variable, the lower the score. Table 7-2 illustrates loadings
for the limpet shell data.
7-3
Table 7 - 2: PCA Component loadings from PCA analysis of limpet shell data.
Component 1
Loadings
Component 2
Loadings
Height
0.856
- 0.517
Width
0.856
0.517
The absolute values of both height and width Component 1 loadings (0.856 and 0.856
respectively) are greater than 0.300 and the loadings are equal. Therefore they are both important
and both contribute equally to the Component 1 score. For Component 1, the sign of the height
loading is positive which means that, if a score is high, the value for height was high. The sign
of the width loading is also positive which means that, if a score is high, the value for width was
high. So large Component 1 scores indicate limpets that have both high values for height and
width.
For Component 2, the absolute values of both height and width Component 2 loadings (0.517
and 0.517) are greater than 0.300 and equal to each other which indicates they are both important
and contribute equally to Component 2 scores. For Component 2, the sign of the height loading
is NEGATIVE, which means that, if a score is high, the value for height was LOW. The sign of
the width loading is positive which means that high scores indicate limpets with wide bases. So
large Component 2 scores indicate limpets that have low values for height and high values for
width, which means that the limpet is flattened. Low scores for Component 2 indicate that
limpets have high values for height and low values for width, which means that the limpet is tall
and narrow.
Assumptions:
There are two assumptions that need to met for a successful ordination:
1. Variables should be not be highly skewed. If the values are highly skewed, then the probability
values based on the normal distribution are not accurate. Mathematical transforms are used to
created new variables that are normally distributed from the old variables that were not.
a. How do you check? Use a statistics program to compute skewness for each variable. If the
absolute value of skewness is greater than 2.0, then you need to transform the variable. The
transform normalizes the data. Try the appropriate transform and recomputed skewness.
b. If the data are highly skewed, which transform do you use?
1
i. Skewed positive – Ln(Y) or Y or Y n (Use a larger n for very high skewness). If
there are zero values, add 1.0 to the value before taking the log or root.
ii. Skewed negative – 1/Y (if there are zero values add 1.0 to the value) or Yn (Use a
larger n for very high skewness).
iii. If all else fails, rank the data. If you rank the data, you will not have to recompute
skewness because we know the distribution of ranks.
7-4
iv. NOTE: If data consist of percentages, they also need to be transformed using the
angular transform. The angular transform is equal to: arcsin p where p is a
proportion. If there are zero values, add .1 to all proportions.
2. The variables are independent of each other. If two variables have a high positive correlation
with each other, they really measure only one aspect. If both are included in the PCA, the
importance of that particular aspect is overemphasized. For example, weight and length of
animals usually have a high positive correlation; small animals are short and light while big
animals are long and heavy. Both variables measure size. If both are included in a PCA, size is
overemphasized. If two variables have a large negative correlation with each other, the effects of
both may be canceled out. Therefore it is important to use variables that are not highly correlated
with each other.
a. How do you find out if variables are correlated? Use a statistics program to compute Pearson
Product Moment Correlations for all possible pairs of variables. A general rule of thumb to
use is that, if the absolute value of the correlation coefficient (r) is greater than 0.700, you can
consider the two variables to be correlated.
b. If two variables are correlated, how do you determine which to use?
c.
Eliminate variables that are correlated with several other variables.
If a variable is correlated with several others, it is usually because it measures a very broad
parameter and will not yield as much useful information as the other variables. For example,
let’s assume that you are interested in the distribution of plant species; you are recording
elevation and measuring several habitat variables. You will find that many of the variables
are correlated with elevation. If you keep elevation and eliminate the others, you will have
less understanding about the mechanisms that limit the distribution of the plants.
i. If the first rule doesn’t apply, choose the variable with the most biological
significance to keep in the analysis.
ii. Don’t keep variables with large numbers of zeros or variables for which there were
problems with data collection (e.g. poor measurements, missing data etc.).
iii. If all of the other rules don’t apply, choose the variable with the largest coefficient of
dispersion (CD =Variance/Mean).
Sample size
Multivariate analyses require adequate number of observations. Tabachnick and Fidell (1994)
recommend 5 observations per variable. So if you are measuring four variables, you would need
a minimum of 4x5 or 20 observations. If you do not have a sufficient number of observations,
you must eliminate variables from the analysis.
7-5
Exercise 1: Conduct the PCA analysis on the limpet shell measurements data.
Data File: The data for this lab is in a file entitled, “Limpet PCA 01”, which is located on the
BIO 156 folder on the Student Data Server. Please copy this file to YOUR diskette.
In this problem, your goal is to find out if shell shape has an effect on limpet mortality from crab
predation. You have measured three shell measurements, length, width and height of 30 limpets.
You also know which of these limpets were successfully attacked and consumed by a crab and
which were able to get away. You are to use Principle Components Analysis to quantify aspects
of limpet shell shape based on three measurements: length, width and height. You are going to
use scores generated in the PCA as measures of an aspect of shell shape and then you will see if
there appear to be differences in shell shape between limpets that were eaten and those that were
not.
Check Sample Size and Assumptions
1. How many observations need to be collected at a minimum for these data (three variables: length,
width and height)? There are 3 variables so there should be 3x5 or 15 observations (shells) at
a minimum.
2. Are any of the variables skewed? Use a computer program (e.g. Systat™ see p 7- 13 for
instructions) to compute skewness. No
Figure 7 - 4: Output from Systat™ 10.0. Descriptive
statistics - measure of skewness.
3. Will you need to transform any of the variables? No, because they are not percentages and are
not skewed.
Figure 7 - 5: Output from Systat™ 10.0.
Pearson Product Moment Correlation
coefficients.
5. If so, which variables will you discard? List
your logic below. Yes, we should discard
length. If we kept length, we would have to get rid of both
width and height in our analysis, but if we got rid of length
we could keep both height and width.
Perform the PCA analysis
1. Plot the data to see the relationship of width to height (e.g
Systat™ 10.0 see p 7-10 and 7-11 for instructions) (Figure
7-6).
2. Use a statistical software program to analyze the data (e.g.
7-6
30
20
WIDTH
4. Are any of the variables highly correlated
(r≥0.700)? (see p 7 – 10 for Systat™ 10.0
instructions)? Yes, length was correlated with
width (r=0.814) and length was also correlated
with height (r=0.783) but width was not
correlated with height (r=0.465).
10
0
1
2
3
4 5 6
HEIGHT
7
8
Figure 7 - 6: Plot of height versus
width.
9
Systat™ see p 7-11 and 7-12 for instructions).
3. Determine which Components to investigate by examining the Percent of Total Variance
Explained in the output (Figure 7-7) and note Components that explain 10% or more of the
variance. In this case, both Components
explain greater than 10% so we will use
them both.
4. Next, find the Component Loadings
(correlations of Components scores to
Figure 7 - 7: Output from Systat™ 10.0. PCA –
original variables) and use them to interpret
Percentage of Total Variance Explained for limpet
the Components. For interpretation, ignore
data.
any variables for which the absolute value
of the loading is less than 0.300. In this case,
for Component 1, the loading for
Width=0.856. Because the absolute value is
greater than 0.300, we will use Width in our
interpretation of Component 1. Note that the
loading for Width for Component 1 is
Figure 7 - 8: Output from Systat™ 10.0. PCA –
positive; that means that values of the
Component 1 scores are positively correlated Principle Component Loadings for limpet data.
with the values of Width. Low values of
Width will tend to produce low Component 1 scores and high values of Width will tend to
produce high Component 1 scores. The loading for Height for Component 1 (0.856) is also
positive and its absolute value is greater than 0.300. Therefore low values for Height will tend
to produce low Component 1 scores and high values of Height will tend to produce high
Component 1 scores. Component 1 can then be interpreted as size with low scores indicating
short narrow limpets and high scores indicating tall wide limpets.
Loadings for Component 2 are Width=0.517 and Height= - 0.517. Note that Height is
negatively correlated with the scores for Component 2. Therefore low values of Width and
LARGE values of Height will produce low Component 2 scores. High values of Width and low
values of Height will produce high Component 2 scores. So, Component 2 can be interpreted
as pointedness with low scores indicating tall narrow limpets and high scores indicating short
wide limpets.
5. Now find the Coefficients. These are what
the computer used to compute the
Component scores that you had saved to a
dataset. (Figure 7-9).
6. Merge the original file and the Component
scores file to create a file that contains all of
the variables in the original dataset plus the
scores (see instructions on p 7-12, 7-13).
Figure 7 - 9: Output from Systat™ 10.0. PCA
Coefficients for computing Component scores.
7. Open the dataset you created that contains the scores and then plot the scores using the Eaten$
variable as a symbol (see p 7-10 and 7-11 for instructions).
How does this plot compare to the plot of height vs. width (Figure 7-7)? Notice, that except
for the “Y”s and “N”s, the plot is a rotated version of Figure 7-7.
7-7
N
Component 2
2
Height
9. Resolving apparent contradictions in the
loadings. Notice that limpets in the upper right
hand corner have high scores for both
Components 1 and 2. Also notice that, for
Component 1, high values for Height tend to
produce high scores, but, for Component 2, low
values for Height tend to produce high scores.
How can this be? Remember that Component 1
explains more of the variation than Component
2 so it takes precedence; limpets in the upper
right-hand corner of Figure 7-10 ARE tall, but
the high Component 2 high scores indicate
that, of the tall limpets, these are some of the
shortest. Likewise, limpets in the lower righthand corner of Figure 7-10 would be the tallest
of the tall limpets. Therefore Component 2 fine
tunes the meaning for Component 1.
3
Width
8. Annotate the axes of the plot to make it easily
interpretable. Either paste your graph into a
Word document or insert it. Use the drawing
tools to add arrows and descriptive axes labels
based on the loadings. Your results should look
like Figure 7-10.
N
NN
N
1N
N N
0
N N YN Y
YYY YYYY Y Y
YY Y
Y
-1
N
Y Y
Y
-2
-3
-2
-1
0
Component 1
1
2
Width
Height
Figure 7 - 10: Plot of Components 1 and 2 from
PCA of limpet shell measurements. "Y"
indicates the limpet was eaten and "N"
indicates the limpet was not eaten.
10. Interpret the experiment. Recall that, if a limpet is labeled with a “Y”, it was eaten by a crab.
If a limpet is labeled with a “N”, it was not eaten by a crab. Notice that eaten or not-eaten
limpets are not separated along the Component 1 axis. However, they are fairly separated
along the Component 2 axis.
What shape limpet was most likely to be eaten by a crab?
Here, we notice that limpets that were eaten tend to have low Component 2 scores. Therefore
the limpets with tall narrow shells appear to be more likely to be consumed by crabs.
Why do you think the crabs prefer that shape?
It turns out that the muscle and tendons attaching the limpet body to the shell attach at the
point of the shell. If a crab pinches the top of the shell off, the shell becomes detached from
the limpet body and the limpet is vulnerable. If a shell is tall and pointed, it is easier for the
crab to pinch the top of the shell off.
7-8
Using Systat 10.0: Correlations (using the Limpet Shell Measurement data as an
example).
Computing the Pearson Product Moment Correlation coefficient for each possible pair of a set of
variables.
1) From the STATISTICS pull-down menu, select
CORRELATIONS and then SIMPLE. You will see the
window shown in Figure 7-11.
2) To select a variable, click on it and then click on the
ADD button. In this case select LENGTH, WIDTH and
HEIGHT. All three should show up in the box labeled
“Variable(s)”.
3) Click on OK. You will see the following output in the
Analysis Window (Figure 7-12):
Figure 7 - 11: Systat™ 10.0 Correlation
window
Figure 7 - 12: Systat™ 10.0 output of Pearson
Product Moment Correlation matrix.
Using Systat 10.0: Plotting your data
1) From the GRAPH pull-down menu, select
SCATTERPLOT from PLOTS.
2) Click on HEIGHT and then click on ADD for the X
variable. (See Figure 7-13).
Figure 7 - 13: Selecting variables to plot in
Systat™ 10.0
3) Click on Width and then click on ADD for the Y variable.
4) Click on the APPEARANCES button in the bottom right-hand
corner of the window and then select COLOR AND FILL.
5) Click on SELECT 1ST COLOR. Then select BLACK for the 1st
Color (see Figure 7-14).
6) Click on SELECT 1ST FILL. Then select solid as the fill pattern.
Click on CONTINUE.
7) Click on the APPEARANCES button again and select SYMBOL
AND LABEL.
Figure 7 - 14: Selecting
patterns and colors for
plots in Systat™ 10.0
7-9
8) Specify size 2 for Symbol size.
Note: if you have a character
variable like Eaten$ which
contains “Y”s and “N”s, you can
select that variable as a symbol;
the plot will then show “Y”s and
“N”s for the points (Figure 7-15).
30
WIDTH
20
10
9) Click on CONTINUE.
0
1
10) Click on OK. Your picture will
look like Figure 7-16.
Figure 7 - 15: Window for
2
3
4 5 6
HEIGHT
7
8
Figure 7 - 16: Example of
specifying a variable as a
11) Save your graph so that you can
plotting height and width
symbol.
variables for limpet shell
insert it into a drawing or word
measurements.
processing document to add.
Double-click on the graph and then, from the FILE pull-down
menu, select SAVE AS. Note: you can also select COPY GRAPH from the EDIT pull-down
menu so that you can paste it into a document.
Systat 10.0: Principal Components Analysis
Now we are ready to compute the PCA. The
PCA generates components that describe the
variation in the data. We will first run the
analysis and then examine each part to
understand its meaning.
1. From the STATISTICS pull-down menu, select
DATA REDUCTION and then FACTOR
ANALYSIS.
2. Select variables to be used in the analysis. To
select, click on variable and then click on the
ADD button (see Figure 7-17).
3. Specify the number of Components. Click on
NUMBER OF FACTORS and enter the number
(no more than the number of variables; in this
case 2).
Figure 7 - 17: Systat™ 10.0. Window for
specifying elements of a Principle Components
Analysis (PCA).
4. Click on the SAVE button.
5. Create a data file that has all of the original data plus Component scores. Click on FACTOR
SCORES and SAVE DATA WITH SCORES (see Figure 7-18). Then click on OK.
6. You will come back to the window shown in Figure 7-15. Click on OK.
7. Specify the file name (e.g. “Limpet Component Scores”) and save it to YOUR diskette.
8. The output from the PCA
7-10
9
The output will appear in the OUTPUT window. Put your name at the top and then either save
your output as an RTF file and/or print it. Unfortunately, SYSTAT does not present the output in
the best order.
Important parts of the output
a) Percent of Variance Explained. This tells you how much of the total variation in all of the
variables can be explained by a component. The percent of total variance explained for
Component 1 is listed under the column labeled “1”. For Component 2 it is listed under
column 2 etc.
b) Component Loadings. Loadings are simple correlations of the Component scores with the
original variables. They are used to interpret the components. Component loadings for
Factor 1 are listed under column “1” and Component loadings for Factor 2 are listed under
column 2.
c) Factor Score Coefficients. These are the coefficients use by the PCA to compute Component
scores.
d) Component Scores. The Component scores are NOT displayed in the SYSTAT results but
are stored in the File that you created in step 7.
Using Systat™ 10.0: Merging files.
1. From the DATA pull-down menu, select MERGE.
You will see a window (Figures 7-18 and 7-19). In this
example we will merge two files, “Limpet PCA
01.syd” and “Limpet PCA scores.syd” to create a new
file with variables from each of the files.
2. Click on the top BROWSE button and find the
“Limpet PCA 01.syd” file (Figure 7-18).
Figure 7 - 18: Systat™ 10.0: Merge File
window – upper half.
3. Click on the variables to keep from the first file and
then click on the ADD button. In this case we want to
keep all of the variables from the “Limpet PCA
01.syd” file (Figure 7-18).
4. Click on the bottom BROWSE button and find the
“Limpet PCA scores.syd” file (Figure 7-19).
5. Click on the variables from the second file and then
click on the ADD button. In this case we want to
keep only FACTOR(1) and FACTOR(2) which
contain the component scores for Components 1 and 2
respectively (Figure 7-19).
Figure 7 - 19: Systat™ 10.0: Merge File
window - lower half.
6. Make sure the “Save File” box is checked (Figure 7-19).
7. Then click on OK (Figure 7-18). You will then be prompted to give a name to the new file (e.g.
“Limpet PCA all.syd” and to save it. Save it to your diskette.
7-11
Using Systat™ 10.0: Compute skewness
1. From the STATISTICS pull-down menu, select DESCRIPTIVE STATISTICS and then
BASIC STATISTICS.
2. Double click on the variables for which you wish to compute skewness.
3. Make sure the SKEWNESS box is checked.
4. Click on OK.
7-12
Name ____________________________
Pts_________
On Your Own
Problem: Antlion larve (Figure 20) build conical pits
in which to trap ants (Figure 21). You are interested
in determining what habitat characteristics might be
associated with the presence or absence of antlion
larvae. You have measured soil particle size, slope,
density of grass and amount of canopy cover in 24
randomly selected locations in which antlions were
not present and 16 randomly selected areas in which
antlions were present.
1. How many observations need to be collected at a
minimum for these data (three variables: length,
width and height)?
Figure 7 - 20: Antlion larvae (from
en.wikipedia.org).
2. Are any of the variables skewed?
3. Will you need to transform any of the variables?
4. Are any of the variables highly correlated
(r≥0.700)?
5. If so, which variables will you discard? List your
logic below.
Figure 7 - 21: Antlion pit (from
en.wikipedia.org).
6. Perform the PCA analysis
7. Which components should you investigate?
7-13
8. What are the Component Loadings for this problem and how do you interpret them?
9. Merge the original file and the component scores file to create a file that contains all of the
variables in the original dataset plus the scores
10. Plot Factor 1 versus Factor 2, Factor 1 versus Factor 3 and Factor 2 versus Factor 3. Annotate
the axes of the plot to make it easily interpretable. Use ANTLION$ as the variable for the
symbols.
11. Interpret the results.
7-14
Download