Components and Cluster analysis

advertisement
Bio 286
Worksheet 7 – PCA and Cluster
page 1
Worksheet 7 – Principal Components Analysis and Cluster Analysis
1) PCA – use the file called “Ourworld_modified.syd”. This is a datafile consisting
of information collected on a country by country basis. The countries are objects
in a multivariate sense. Ultimately we want to determine if:
a. Military spending, Gross National Product, Birth rates, Death rates, and
population estimates vary between urban and rural countries. We are
really concerned that some of the dependent variables might be co-linear,
and therefore are worried about doing either 7 separate ANOVA’s or even
a multivariate evaluation (MANOVA). To determine if this is a
reasonable concern first run a PCA on the 7 continuous variables of
interest (POP_1983, POP_1986, POP_1990, BIRTH_82, DEATH_82,
GNP_82, MIL)
1. Go to Analyze, Multivariate Methods, Principal Components and
input the continuous (blue icon) predictor variables shown in the
Y window (DO NOT USE Urban_Metric).
i.
Make sure you use default method.
ii. Run the model
2. Now take a look at the following output
i.
Factor Loading plot (Right graph)
ii. Percent of variance explained (left graph). Recall that
eigenvalue>1 are informative . The scale of the left graph
is the variance explained by percent.
3. Interpret Factors 1 and 2 (with respect to original variables)
4. Ok you should see from the loading plot that that the original
variables don’t load exactly on the PC axes. Assuming that you
want to make the match better you can use a rotation. Click on
the red icon by PRINCIPAL COMPONENTS and then
FACTOR ANALYSIS and make sure the rotation method is
VARIMAX. Make sure the factoring method is PRINCIPAL
COMPONENTS and the prior community is also PRINCIPAL
COMPONENTS. VARIMAX produces better alignment of
original variables on the PC axes. Run the model. Does the new
loading plot align better?
5. Now save the PC scores - Click on the red icon by PRINCIPAL
COMPONENTS then on SAVE PRINCIPAL COMPONENTS
and save 2 components. These will be added to your date file.
6. Convince yourself that the composite variables (factor 1 and 2)
actually do a good job as proxies for the original variables (use
GRAPH BUILDER – use the linear fit icon)
a. Plot (scatterplot) POP_1983, POP_1986, POP_1990
vs PRIN1
b. Plot BIRTH_82, DEATH_82, GNP_82, MIL vs
PRIN2
c. Do the PC’s do a good job as proxies?
Bio 286
Worksheet 7 – PCA and Cluster
page 2
7. Now address the question posed above after accounting for colinearity: Does Military spending, Gross National Product, Birth
rates, Death rates, and population estimates vary between urban
and rural countries?
a. First plot PRIN2 vs PRIN1 using GRAPH
BUILDER. Use URBAN as a grouping variable.
Make sure to use the overlay function (upper right
corner)
b. If you want to create a confidence ellipse then click
on the icon that looks like an oval. Also if there are
lines present unclick them.
c. How might you test this question
Bio 286
Worksheet 7 – PCA and Cluster
page 3
2) PCA Regression – Here you are going to use PCA regression to test hypotheses
concerning the relationship between degree of urbanization (as a continuous
variable) and Military spending, Gross National Product, Birth rates, Death rates.
a. First run a multiple regression on the relationship between POP_1983,
POP_1986, POP_1990, BIRTH_82, DEATH_82, GNP_82, MIL and
Urban_Metric. The Urban_Metric score indicates the degree of
urbanization in a county: higher scores are more urbanized.
1. Use ANALYZE, FIT MODEL
i.
Put the predictor variables in the CONSTRUCT MODEL
EFFECTS window and Urban_Metric in the Y window
ii. PERSONALITY should be STANDARD LEAST
SQUARES
iii. Run the model
iv.
Look at the table PARAMETER ESTIMATES and
specifically at the VIF scores – any value >10 indicates
significant colinearity. Is there any evidence of
colinearity?
2. Given that there is evidence of colinearity one solution would be
to use PCA regression. Give the results of question 1, we know
that the original variables load up on two principal components.
We also know the factors (PC’s) are independent so colinearity
will not be an issue if we sue the factors as predictor variables.
i.
Use ANALYZE, FIT MODEL
ii. Put the Prin1 and Prin 2 inthe CONSTRUCT MODEL
EFFECTS window and Urban_Metric in the Y window
iii. PERSONALITY should be STANDARD LEAST
SQUARES
iv.
Run the model
v.
Look at the table PARAMETER ESTIMATES and
specifically at the VIF scores – any value >10 indicates
significant colinearity. Is there any evidence of
colinearity?
vi.
Ok now interpret the results in terms of the original
variables
a. Use GRAPH, GRAPH BUILDER
b. Put Prin 2 on the X axis and Urban_Metric on the Y
c. Click the upper icon showing the linear fit of a line
to data
d. Interpret this results in terms of the original
variables (that load on Prin 2) and also offer a
general interpretation of the relationship.
Remember the points are countries
Bio 286
Worksheet 7 – PCA and Cluster
page 4
3) Cluster analysis – this exercise is to show you how different distance measures
have an effect on the clustering pattern
a. Use ANALYZE, MULTIVARIATE METHODS, CLUSTER
1. PUT POP_1983, POP_1986, POP_1990, BIRTH_82,
DEATH_82, GNP_82, MIL in the Y window
2. Put URBAN (red icon) in the LABEL window.
3. Use OPTION – Hierarchical
4. Click on STANDARDIZE DATA (which transforms the variable
values to z scores)
5. Try the AVERAGE method first
6. Below the cluster diagram is a SCREE plot and a small diamond
Icon that you can drag to show the number of cluster for a
particular distance – try it out
7. If you want you can click on the CLUSTERING HISTORY icon
to see how the cases were entered as clustering progressed
8. Now repeat the process for Centroid, Ward, Single and
Complete. These methods are all described in the help file for
CLUSTER
9. Notice the differences in clustering as a function of distance
metric. There can be subtle or very large (especially for
SINGLE Linkage) differences
Download