Bio 286 Worksheet 7 – PCA and Cluster page 1 Worksheet 7 – Principal Components Analysis and Cluster Analysis 1) PCA – use the file called “Ourworld_modified.syd”. This is a datafile consisting of information collected on a country by country basis. The countries are objects in a multivariate sense. Ultimately we want to determine if: a. Military spending, Gross National Product, Birth rates, Death rates, and population estimates vary between urban and rural countries. We are really concerned that some of the dependent variables might be co-linear, and therefore are worried about doing either 7 separate ANOVA’s or even a multivariate evaluation (MANOVA). To determine if this is a reasonable concern first run a PCA on the 7 continuous variables of interest (POP_1983, POP_1986, POP_1990, BIRTH_82, DEATH_82, GNP_82, MIL) 1. Go to Analyze, Multivariate Methods, Principal Components and input the continuous (blue icon) predictor variables shown in the Y window (DO NOT USE Urban_Metric). i. Make sure you use default method. ii. Run the model 2. Now take a look at the following output i. Factor Loading plot (Right graph) ii. Percent of variance explained (left graph). Recall that eigenvalue>1 are informative . The scale of the left graph is the variance explained by percent. 3. Interpret Factors 1 and 2 (with respect to original variables) 4. Ok you should see from the loading plot that that the original variables don’t load exactly on the PC axes. Assuming that you want to make the match better you can use a rotation. Click on the red icon by PRINCIPAL COMPONENTS and then FACTOR ANALYSIS and make sure the rotation method is VARIMAX. Make sure the factoring method is PRINCIPAL COMPONENTS and the prior community is also PRINCIPAL COMPONENTS. VARIMAX produces better alignment of original variables on the PC axes. Run the model. Does the new loading plot align better? 5. Now save the PC scores - Click on the red icon by PRINCIPAL COMPONENTS then on SAVE PRINCIPAL COMPONENTS and save 2 components. These will be added to your date file. 6. Convince yourself that the composite variables (factor 1 and 2) actually do a good job as proxies for the original variables (use GRAPH BUILDER – use the linear fit icon) a. Plot (scatterplot) POP_1983, POP_1986, POP_1990 vs PRIN1 b. Plot BIRTH_82, DEATH_82, GNP_82, MIL vs PRIN2 c. Do the PC’s do a good job as proxies? Bio 286 Worksheet 7 – PCA and Cluster page 2 7. Now address the question posed above after accounting for colinearity: Does Military spending, Gross National Product, Birth rates, Death rates, and population estimates vary between urban and rural countries? a. First plot PRIN2 vs PRIN1 using GRAPH BUILDER. Use URBAN as a grouping variable. Make sure to use the overlay function (upper right corner) b. If you want to create a confidence ellipse then click on the icon that looks like an oval. Also if there are lines present unclick them. c. How might you test this question Bio 286 Worksheet 7 – PCA and Cluster page 3 2) PCA Regression – Here you are going to use PCA regression to test hypotheses concerning the relationship between degree of urbanization (as a continuous variable) and Military spending, Gross National Product, Birth rates, Death rates. a. First run a multiple regression on the relationship between POP_1983, POP_1986, POP_1990, BIRTH_82, DEATH_82, GNP_82, MIL and Urban_Metric. The Urban_Metric score indicates the degree of urbanization in a county: higher scores are more urbanized. 1. Use ANALYZE, FIT MODEL i. Put the predictor variables in the CONSTRUCT MODEL EFFECTS window and Urban_Metric in the Y window ii. PERSONALITY should be STANDARD LEAST SQUARES iii. Run the model iv. Look at the table PARAMETER ESTIMATES and specifically at the VIF scores – any value >10 indicates significant colinearity. Is there any evidence of colinearity? 2. Given that there is evidence of colinearity one solution would be to use PCA regression. Give the results of question 1, we know that the original variables load up on two principal components. We also know the factors (PC’s) are independent so colinearity will not be an issue if we sue the factors as predictor variables. i. Use ANALYZE, FIT MODEL ii. Put the Prin1 and Prin 2 inthe CONSTRUCT MODEL EFFECTS window and Urban_Metric in the Y window iii. PERSONALITY should be STANDARD LEAST SQUARES iv. Run the model v. Look at the table PARAMETER ESTIMATES and specifically at the VIF scores – any value >10 indicates significant colinearity. Is there any evidence of colinearity? vi. Ok now interpret the results in terms of the original variables a. Use GRAPH, GRAPH BUILDER b. Put Prin 2 on the X axis and Urban_Metric on the Y c. Click the upper icon showing the linear fit of a line to data d. Interpret this results in terms of the original variables (that load on Prin 2) and also offer a general interpretation of the relationship. Remember the points are countries Bio 286 Worksheet 7 – PCA and Cluster page 4 3) Cluster analysis – this exercise is to show you how different distance measures have an effect on the clustering pattern a. Use ANALYZE, MULTIVARIATE METHODS, CLUSTER 1. PUT POP_1983, POP_1986, POP_1990, BIRTH_82, DEATH_82, GNP_82, MIL in the Y window 2. Put URBAN (red icon) in the LABEL window. 3. Use OPTION – Hierarchical 4. Click on STANDARDIZE DATA (which transforms the variable values to z scores) 5. Try the AVERAGE method first 6. Below the cluster diagram is a SCREE plot and a small diamond Icon that you can drag to show the number of cluster for a particular distance – try it out 7. If you want you can click on the CLUSTERING HISTORY icon to see how the cases were entered as clustering progressed 8. Now repeat the process for Centroid, Ward, Single and Complete. These methods are all described in the help file for CLUSTER 9. Notice the differences in clustering as a function of distance metric. There can be subtle or very large (especially for SINGLE Linkage) differences