Statistics: Continuous Methods STAT452/652, Fall 2008 Computer Lab 4 Thursday, October 16, 2008 Ansari Business Building, 610 11:00AM-12:15PM CORRELATION with Instructor: Ilya Zaliapin STAT452/652, Fall 2008, Computer Lab 4 Page 1 Topic: Correlation coefficients Goals: • Learn how to compute the Pearson’s and Spearman’s correlation coefficients in Minitab • Learn how to find and interpret statistical significance of the correlation • Learn how to interpret results of the correlation analysis Assignments: 1) Download the data set correlation.MTW from the course web site to your Minitab session; it contains four samples: W, X, Y, and Z. 2) Decide which samples are from the Normal distribution (you can use the probability plot option with the Anderson-Darling test). 3) Perform a visual correlation analysis of the data, decide which pairs of variables (out of 6 possible pairs) have association, and discuss whether the association is monotone or linear. 4) Compute Pearson’s correlation for each pairs. 5) Compute significance (P-value) for the Pearson’s correlation; discuss which P-values can be used for analysis, and which should be dismissed (use the results of 2). 6) Compute Spearman’s correlation for each pair. 7) Compare the results of 3)-6), discuss, and make conclusions. Report: A printed report for this Lab is due on Thursday, October 23 in class. BW printouts are OK. Reports will not be accepted by mail. STAT452/652, Fall 2008, Computer Lab 4 Page 2 1. Introduction Applied research often focuses on relationships (associations) between different processes or phenomena. In the probabilistic context, one deals with relationships (associations, dependence) between random variables. In practice, establishing dependence (or independence), which is a property of the entire joint distribution of the random variables, is commonly replaced with a simpler task of establishing correlation, which only reflects some partial properties of the joint distribution. Coefficient of correlation is a scalar measure of association between paired observations (Xi,Yi), i =1,…, n. The understanding of association differs among different coefficients of correlation, which leads to different results of correlation analysis and different result interpretations. In class, we have discussed three correlation coefficients: Kendall’s τ, Pearson’s r, and Spearman’s r. The latter two will be considered in this Lab. 2. Pearson coefficient of correlation Definition. Consider a paired sample (Xi,Yi), i=1,…, n. The Pearson’s coefficient of correlation is defined as n r( X ,Y ) = ∑( X i =1 n ∑( X i =1 i i − X )(Yi − Y ) −X) 2 n ∑ (Y − Y ) i =1 2 i 2.1 Calculation Recall that the Pearson’s r measures linear relationships between observations. In Minitab, Pearson’s r can be computed using the menu Stat/Basic Statistics/Correlation STAT452/652, Fall 2008, Computer Lab 4 Page 3 The correlation sub-window allows choosing variables for analysis and providing other parameters. Notice that if you choose the Store matrix option, NOTHING will be displayed. Analysis results (values of r and the corresponding P-values) are shown in the Session window: 2.2 Significance The P-values computed by Minitab correspond to testing the hypothesis H0: “The population correlation is 0” versus Ha: “The population correlation is not 0”. The test statistics is U= r n−2 1 − r2 , STAT452/652, Fall 2008, Computer Lab 4 Page 4 which, for independent jointly Normal (X,Y) has the Student distribution with (n-2) degrees of freedom. 2.3 Example In this example, we perform Pearson correlation analysis for three samples, X, Y, and Z, each of length 100. First, we construct a scatterplot to get an idea of possible relationships among the observations (Fig. 1). Matrix Plot of X, Y, Z -2 0 2 2 0 X -2 2 0 Y -2 2 Z 0 -2 -2 0 2 -2 0 2 Figure 1: Scatterplot of X, Y, and Z The sactterplot suggests that the pair (X,Z) has a strong positive association (positive correlation), the pairs (X,Y) does not seem to have association, and the pair (Y,Z) might have a slight positive association. Next, we proceed with a formal correlation analysis. The results are summarized in Table 1, where the cells above diagonal display the values of the Pearson’s r, and the cells below the diagonal – the corresponding P-values. We see that our visual impression is confirmed by the formal analysis: the correlations between (X,Z) and (Y,Z) are significant (the corresponding P-values are less than 0.001), while the correlation between (X,Y) is not significant (the P-value is 0.2). STAT452/652, Fall 2008, Computer Lab 4 Page 5 Table 1: Results of the correlation analysis for observations X, Y, and Z (values above diagonal – Pearson’s r, values below diagonal – P-values) X Y Z X 1 -0.129 0.883 Y 0.2 1 0.350 Z <0.001 <0.001 1 We notice that despite the fact that both the correlations r(X,Z) and r(Y,Z) are significant, the first one reflects a strong association, and the second one reflects a weak association between random variables, which is emphasized by the absolute values of the correlation r. 3. Spearman coefficient of correlation Definition. Consider a paired sample (Xi,Yi), i=1,…, n. Let (RXi, RYi) be the corresponding ranks for the sample. The Spearman’s coefficient of correlation is defined as the Pearson’s correlation between the ranks: ρ ( X , Y ) = r( Ri X , Ri Y ) 2.1 Calculation Minitab has no built-in routine for Spearman’s correlation. Although, you can easily compute it by computing ranks and then applying the Perason’s correlation. The observation ranks are computed using the menu Data/Ranks: STAT452/652, Fall 2008, Computer Lab 4 Page 6 Notice that since you work with ranks, the P-values reported by the Pearson’s correlation will not be valid. STAT452/652, Fall 2008, Computer Lab 4 Page 7