Correlation & Regression – (Estimation & Significance) XBUS 6240 – Understanding Managerial Statistics – Fall 2012 M.D. Harper, Ph.D. *Correlation. Describes relationship between two variables. *Regression. Uses one variable to predict another variable. *Estimation. Correlation: Determine coefficient of correlation. Regression: Determine regression model. *Significance. Correlation: Make inference that correlation is significantly different from zero. Regression: Make inference that least-squares regression slope is significantly different from zero. (Summary) Given „n‟ pairs of data labeled „X‟ and „Y‟. Mean of X is X = X/n Mean of Y is Y = Y/n Sum of Squared Errors about the Mean (SS): SS of Variable Y: SSYY=( Y –Y )2=Y2–(Y)2/n SS of Variable X: SSXX=( X –X )2=X2–(X)2/n SS between variables: SSXY=( X –X )*( Y –Y ) =XY–(X)*(Y)/n Estimation. Coefficient of Correlation: R=SSXY/√(SSXX*SSYY) For Regression, let Y=Dependent Variable & X=Independent Variable. Regress Y on X. Least-squares Regression Model: Ŷ=b0+b1*X where b1 = SSXY/SSXX & b0 =Y – b1 *X Significance. Given Level of Significance = . Correlation. Let tt = √[ (n–2)*R2/(1–R2) ]. Then, using the Excel function, when 2*(1–T.DIST( tt , n–2 , 1 ) ) < , the correlation is significant. Least-Squares Regression. Let Ft = (n–2)*R2/(1–R2) = tt2 . Then, using the Excel function, when 1–F.DIST( Ft , 1 , n–2 , 1) < , the regression is significant. Page 1 of 12 (Review) Given paired data A survey was administered to 5 subjects containing the following items. [A questionnaire was given to 5 people containing the following questions.] 1. How many miles did you travel to get here today? _____ 2. How many minutes did it take to get here today? _____ 3. Please indicate your gender. (Circle one) Male Female 4. What was the temperature in Fahrenheit on your travel here today? _____ 5. Please indicate your classification. (Circle one) Fresh Soph Jr Sr Disagree Agree 6. It was very difficult for me to travel here today. 1 2 3 4 5 . The results: Subject Miles Minutes Gender Temp oF Class Difficult 1 1 4 M 68 Jr 2 2 3 6 F 58 Fresh 1 3 3 20 M 48 Soph 2 4 5 15 F 50 Jr 3 5 8 20 F 65 Soph 2 . Consider Miles and Minutes. Miles Minutes Subject X Y X*X Y*Y X*Y 1 1 4 1 16 4 2 3 6 9 36 18 3 3 20 9 400 60 4 5 15 25 225 75 5 8 20 64 400 160 SS=Sum of Squares SSXX = 108–20*20/5 = 28 Sum 20 65 Sum 108 1077 317 SSYY = 1077–65*65/5 = 232 SS 28 232 57 SSXY = 317–20*65/5 = 57 SS = Sum of Squares of Error = Sum of Squared Errors SSXX= ( X –X )2 = X2–(X)*(X)/n = 108–20*20/5 = 28 SSYY= ( Y –Y )2 = Y2–(Y)*(Y)/n = 1077–65*65/5 = 232 SSXY= ( X –X )*( Y –Y ) = X*Y–(X)*(Y)/n = 317–20*65/5 = 57 Summary n=5 SS=Sum of Squares 2 SSXX = 108–20*20/5 = 28 X=20 X =108 2 SSYY = 1077–65*65/5 = 232 Y=65 Y =1077 SSXY = 317–20*65/5 = 57 X*Y=317 Page 2 of 12 Correlation & Regression Concepts Y=Minutes Consider the relationship between the two variables. Miles Minutes Subject 20 X Y 15 1 1 4 10 2 3 6 5 3 3 20 0 4 5 15 0 2 4 6 X=Miles 5 8 20 n=5 X=20 Y=65 X2=108 Y2=1077 X*Y=317 8 10 SS=Sum of Squares SSXX = 108–20*20/5 = 28 SSYY = 1077–65*65/5 = 232 SSXY = 317–20*65/5 = 57 1. As „Miles‟ increases, „Minutes‟ increases. Thus, there is a positive relationship or positive association between „Miles‟ and „Minutes‟. If „Minutes‟ decreased as „Miles‟ increased, the relationship would be negative. . 2. To describe the relationship between the two variables, consider the coefficient of correlation, R = SSXY / √( SSXX*SSYY ) , Range of Sample Correlation is { –1 ≤ R ≤ +1 } R = 57 / √( 28*232) ≈ 0.7072 , Excel: =Correl( X-array , Y-array ) ≈ 0.7072 3. R=0 implies no linear correlation 4. R close to ±1 implies strong linear correlation 5. R close to 0 implies weak linear correlation 6. To represent the relationship between the two variables, consider a least-squares regression line with the equation in intercept-slope form, Y=b0+b1*X, where b0 and b1 are constants, b0=intercept & b1=slope For slope, b1=SSXY/SSXX, Excel function, “=slope( Y-array , X-array )” For intercept, b0=Y–b1*X , Excel function, “=intercept( Y-array , X-array )” b1=SSXY/SSXX = 57/28 = 2.035714286 ≈ 2.0357 b0= Y – b1 *X = (65/5) – (57/28)*(20/5) = 4.857142857 ≈ 4.857 Least-squares Regression Model: Ŷ=4.857+2.0357*X 7. Since X is used to estimate Y, Y is called the “Dependent Variable” and X is called the “Independent Variable”. 8. Since X is used to estimate Y, the terminology is “Regress Y on X” or “Regress the Dependent Variable on the Independent Variable”. 9. The independent variable is used to estimate the dependent variable. (See Excel Examples) Page 3 of 12 Correlation Concepts Y=Minutes Does there seem to be a relationship between the two variables? Miles Minutes Subject 20 X Y 15 1 1 4 10 2 3 6 5 3 3 20 0 0 2 4 6 8 4 5 15 X=Miles 5 8 20 n=5 X=20 Y=65 X =108 Y2=1077 X*Y=317 2 10 SS=Sum of Squares SSXX = 108–20*20/5 = 28 SSYY = 1077–65*65/5 = 232 SSXY = 317–20*65/5 = 57 1. As „Miles‟ increases, „Minutes‟ increases. Thus, there is a positive relationship or positive association between „Miles‟ and „Minutes‟. If „Minutes‟ decreased as „Miles‟ increased, the relationship would be negative. 2. To represent the relationship between the two variables, consider the coefficient of correlation, R = SSXY / √( SSXX*SSYY ) Range of Sample Correlation is { –1 ≤ R ≤ +1 } Using Excel: =Correl( X-array , Y-array ) R = 57 / √( 28*232) ≈ 0.7072 3. R=0 implies no linear correlation 4. R close to ±1 implies strong linear correlation 5. R close to 0 implies weak linear correlation . Note: Sample Covariance is defined as Cov(X,Y)=SSXY/(n–1) =57/(5–1) = 14.25 Sample Variances are defined as SX2=SSXX/(n–1) SY2=SSYY/(n–1) =28/(5–1) = 7 =232/(5–1) = 58 Sample Standard Deviations are defined as SX=√[SSXX/(n–1)] & SY=√[SSYY/(n–1)] =√(7) ≈2.64575 =√(58) ≈7.61577 Thus, Correlation is often defined as R=Cov(X,Y)/[ SX * SY] =14.25/√(7*58)≈0.7072 (See Excel Examples) Correlation: Relationship vs. Causation Page 4 of 12 Matrix Representations Consider three sets of data, A, B, & C, along with their univariate and bivariate measures. Univariate Measures: Index:1,2,3,4,5 Sum = X X2 SS = X2 – (X)2/n S2 A: 0,2,1,2,1 6 10 2.8 0.7 B: 4,1,2,0,2 9 25 8.8 2.2 C: 2,0,0,2,0 4 8 4.8 1.2 Bivariate Measures: Variance/ Correlation Sum(XY) SSxy Covariance r Sxy Index:1,2,3,4,5 A B C A B C A B C A B C A: 0,2,1,2,1 10 6 4 2.8 –4.8 –0.8 0.7 –1.2 –0.2 –0.967 –0.218 B: 4,1,2,0,2 25 8 8.8 0.8 2.2 0.2 0.123 C: 2,0,0,2,0 8 4.8 1.2 Correlation Matrix A B C A -0.967 -0.218 B 0.123 C Covariance Matrix A B C A -1.2 -0.2 B 0.2 C Variance Matrix A B C A 0.7 B 2.2 C 1.2 Variance/Covariance/Correlation Matrix A B C A 0.7 –0.967 –0.218 B –1.2 2.2 0.123 C –0.2 0.2 1.2 (See Excel Example) Page 5 of 12 Regression Concept Y=Minutes Can we estimate one variable using another variable? Miles Minutes Subject 20 X Y 15 1 1 4 10 2 3 6 5 3 3 20 0 4 5 15 0 2 4 6 X=Miles 5 8 20 n=5 X=20 Y=65 X2=108 Y2=1077 X*Y=317 8 10 SSXX = 108–20*20/5 = 28 SSYY = 1077–65*65/5 = 232 SSXY = 317–20*65/5 = 57 1. To represent the relationship between the two variables, consider an arbitrary regression line with the equation in intercept-slope form, Y=b0+b1*X, where b0 and b1 are constants, b0=intercept & b1=slope {Example: To illustrate the line through the data, visually select the line Y=5+2*X through the data where b0=5 and b1=2. } 2. The line Y=5+2*X can be used to estimate Y given a value of X, expressed as E[Y|X]. {For example: if X=3, then Y=11, expressed as E[Y|X=3]=5+2*3=11; and, if X=7, then Y=19, expressed as E[Y|X=7]=5+2*7=19.} 3. The line, Y=5+2*X, is called a regression model because it estimates “Y given X”. 4. Since X is used to estimate Y, Y is called the “Dependent Variable” and X is called the “Independent Variable”. 5. Since X is used to estimate Y, the terminology is “Regress Y on X” or “Regress the Dependent Variable on the Independent Variable”. 6. Statements about the relationship between variables. The independent variable is used to estimate the dependent variable. The independent variable contributes to the estimation of the dependent variable. The independent variable explains the variability of the dependent variable. The independent variable is used to model the dependent variable. 7. A common procedure yields the least-squares regression equation. Least-squares estimates b0 & b1 for the intercept and slope can be obtained using equations or functions in Excel. Y=b0+b1*X, where b0 and b1 are constants, b0=intercept & b1=slope For slope, b1=SSXY/SSXX, Excel function, “=slope( Y-array , X-array )” For intercept, b0=Y–b1*X , Excel function, “=intercept( Y-array , X-array )” b1=SSXY/SSXX = 57/28 = 2.035714286 ≈ 2.0357 b0= Y – b1 *X = (65/5) – (57/28)*(20/5) = 4.857142857 ≈ 4.857 (See Excel Examples) Page 6 of 12 Subject 1 Miles Minutes Estimate X Y Ŷ=5+2X 1 4 6 Error =(Ŷ–Y) 2 2 3 6 12 6 3 3 20 26 6 4 5 15 25 10 5 8 20 36 16 Y=Minutes Regression Definitions and Terminology 20 15 10 5 0 0 2 4 6 8 X=Miles 10 Consider the data and the regression line, Y=5+2*X. The regression line: Y=5+2*X Now, the mean of Y isY=Y/n=13 But the estimate for E[Y|X] for X=3 is: Model (Ŷ=11): Ŷ=b0+b1*X=5+2*3=11 Error, = (Ŷ – Y) between Model and Data Data for X=3 is (Y=6): Y=0+1*X+ Y=Minutes 20 15 10 5 0 0 2 4 6 8 X=Miles Model of Data: Y=0+1*X+ (Example: Data sets above.) Model of Fit: Ŷ=b0+b1*X (Example: Ŷ=5+2*X ) (Regression Model) Mean of Y: isY=Y/n (Example:Y= Y=0+1*X+ Ŷ=b0+b1*X 10 0 and 1 are Parameters. is an error term. b0 and b1 are Estimates of the Parameters. E[Y|X] is the Parameter, population mean of Y|X. Ŷ=b0+b1*X is the Estimate of the Parameter. Y is the Dependent Variable representing Data X is the Independent Variable representing Data 0 is the “Intercept” parameter 1 is the “Slope” parameter. is the error term Ŷ is the estimate of E[Y|X], the population mean of Y conditioned on X representing the value from a regression equation for a given value of X. b0 is the estimate of 0 (the intercept) b1 is the estimate of 1 (the slope) Y is the sample mean or the estimate of E[Y] not conditioned on X or the estimate of the mean of Y independent of X Regression Model of Data Regression Equation Regression Model of the Fit of the Data Page 7 of 12 (Optional) Least-Squares Regression Derivation The regression model that minimizes the sum of squared errors is called “Least-squares regression”. Model of Data: Y=0+1*X+ Solve for Error term: = [ Y – ( 0+1*X) ] Substitute Estimates: = [ Y – ( b0+b1*X) ] Square Error term: 2 = [ Y – ( b0+b1*X) ]2 Sum over all data: 2 = [ Y – ( b0+b1*X) ]2 Find b0 and b1 that minimizes the squared error. Min 2 = Min [ Y – ( b0+b1*X) ]2 b0,b1 b0,b1 Using calculus: Taking derivative, ∂(2)/∂(b0) = 2 [ Y – ( b0+b1*X) ] (–1) = 0 Summing through, Y – n*b0 – b1*X = 0 Collecting terms, b0 = Y/n – b1*X/n Alternative form, b0 = Y – b1 *X Taking derivative, ∂(2)/∂(b1) = 2 [ Y – ( b0+b1*X) ] (–X) = 0 Summing through, XY – b0*X – b1*X2 = 0 Substituting for b0, XY – ( Y/n – b1*X/n )*X – b1*X2 = 0 Collecting terms, b1 XY – (X)*(Y)/n ]/[ X2 – (X)2/n ] Alternative form, b1 = SSXY/SSXX The regression model that minimizes the sum of squared errors is called “Least-squares regression”. Model of Data: Y=0+1*X+ Least-squares Regression Model: Ŷ=b0+b1*X , for b1=SSXY/SSXX and b0=Y – b1 *X Example. Subject 1 2 3 4 5 Sum SS X=Miles 1 3 3 5 8 20 Y=Minutes 4 6 20 15 20 65 SS=Sum of Squares X*X 1 9 9 25 64 108 28 SSXX = 108–20*20/5 = 28 SSYY = 1077–65*65/5 = 232 Y*Y 16 36 400 225 400 1077 232 SSXY = 317–20*65/5 = 57 X*Y 4 18 60 75 160 317 57 Least-squares estimates: b1=SSXY/SSXX = 57/28 = 2.035714286 ≈ 2.0357 b0= Y – b1 *X = (65/5) – (57/28)*(20/5) = 4.857142857 ≈ 4.857 Least-squares Regression Model: Ŷ=4.857+2.0357*X Page 8 of 12 Sum of Squares Y=Minutes 20 Regression Line: Ŷ=b0+b1*X 15 Y:Y =Y/n Ŷ: Ŷ=b0+b1*X 10 Y: Y=0+1*X+ 5 0 0 2 4 6 8 X=Miles 10 Definitions. Model of Data: Y=0+1*X+ Least-squares Regression Model: Ŷ=b0+b1*X , for b1=SSXY/SSXX and b0=Y – b1 *X Sample mean of Data :Y = Y/n Total Sum of Squares, TSS = (Y –Y)2 = SSYY = 232 Sum of Squares of Model, SSM = (Ŷ –Y )2 = (SSXY)2/SSXX = (572/28) ≈ 116.04 (Sum of Squares of Regression) Sum of Squares of Error, SSE = (Y – Ŷ )2 = SSYY – (SSXY)2/SSXX ≈ 115.96 (Sum of Squares of Residuals) From Example: SSYY = 232; SSXX = 28; SSXY = 57 Identities. Total TSS (Y –Y)2 SSYY = Model = SSM = (Ŷ –Y )2 = [ (SSXY)2/SSXX ] + + + + Error SSE (Y – Ŷ )2 [ SSYY – (SSXY)2/SSXX ] 1. Total variation is explained by either the variation due to the model or variation due to random error. 2. As Ŷ approachesY, SSM approaches zero and SSE approaches TSS. 3. As Ŷ approachesY, the total variation is explained less by the model and more by random error. 4. As Ŷ approachesY, the regression slope approaches zero and the correlation coefficient approaches zero. Consider the estimate of the regression line, Ŷ=b0+b1*X = (Y – b1 *X ) + b1 * X , as b1 approaches zero, Ŷ approachesY. Page 9 of 12 R2 – Coefficient of Determination Total TSS (Y –Y)2 SSYY = Model = SSM = (Ŷ –Y )2 = [ (SSXY)2/SSXX ] + + + + Error SSE (Y – Ŷ )2 [ SSYY – (SSXY)2/SSXX ] Divide by TSS: TSS/TSS = SSM/TSS + SSE/TSS 2 1 = [ (SSXY) /(SSXX *SSYY ) ] + [ 1 – (SSXY)2/(SSXX *SSYY )] 4. Define: Coefficient of Determination, R2= SSM/TSS Thus: 1 = R2 + [ 1 – R2 ] Coefficient of Determination R2= SSM/TSS = (SSXY)2/(SSXX*SSYY) 0 ≤ R2 ≤ 1 1. R2 = SSM/TSS represents the percent of total variation explained by the model. 2. 1–R2 = SSE/TSS represents the percent of total variation explained by random error. The correlation coefficient, r = sqrt(R2) = SSXY/sqrt( SSXX*SSYY ), –1 ≤ r ≤ +1 Example. Subject 1 2 3 4 5 Sum SS X=Miles 1 3 3 5 8 20 Y=Minutes 4 6 20 15 20 65 SS=Sum of Squares X*X 1 9 9 25 64 108 28 SSXX = 108–20*20/5 = 28 SSYY = 1077–65*65/5 = 232 Y*Y 16 36 400 225 400 1077 232 SSXY = 317–20*65/5 = 57 X*Y 4 18 60 75 160 317 57 Coefficient of Determination, R2 = SSM/TSS = (SSXY)2/(SSXX*SSYY) = (572/(28*232) = 0.500153941 ≈ 0.5 Total Sum of Squares, TSS = (Y –Y)2 = SSYY = 232 Sum of Squares of Model, SSM = (Ŷ –Y )2 = (SSXY)2/SSXX = (572/28) ≈ 116.04 = TSS*R2 = [SSYY ]*[ (SSXY)2/(SSXX*SSYY) ] = (SSXY)2/SSXX = (572/28) ≈ 116.04 Sum of Squares of Error, SSE = (Y – Ŷ )2 = SSYY – (SSXY)2/SSXX ≈ 115.96 = TSS*(1–R2) = [SSYY]*[1 – (SSXY)2/(SSXX*SSYY) ] = SSYY – (SSXY)2/SSXX ≈ 115.96 Correlation coefficient, r = sqrt(R2) = =57/sqrt(28*232) = 0.707215625 ≈ 0.7 Page 10 of 12 ANOVA – Analysis of Variance – Regression Significance Consider the values: Subject 1 2 3 4 5 Sum X=Miles 1 3 3 5 8 20 Y=Minutes 4 6 20 15 20 65 ANOVA Table SOURCE MODEL (Regression) ERROR (Residuals) TOTAL (Corrected Total) DF SS MS 1 SSM MSM=SSM/1 n-2 SSE SSXY =57 SSXX =28 SSYY =232 Ft R2 MSM/MSE SSM/TSS MSE=SSE/(n-2) n-1 TSS Using Algebra to combine relationships: R2 = SSM/TSS ≈ 0.500153941 TSS = (Y –Y)2 = SSYY = 232 SSM = (Ŷ –Y )2 = TSS*R2 ≈ 116.04 SSE = (Y – Ŷ )2 = TSS*(1–R2) ≈ 115.96 r = sqrt (R2) ≈ 0.7072 ANOVA Table SOURCE DF SS MS Ft R2 MODEL 1 SSM=TSS*R2 MSM=SSM/1 MSM/MSE SSM/TSS 2 ERROR n-2 SSE=TSS*(1–R ) MSE=SSE/(n-2) TOTAL n-1 TSS=SSYY ANOVA Table SOURCE DF SS MS Ft R2 P-value MODEL 1 116.04 116.04 3.00 0.50 0.182 ERROR 3 115.96 38.65 TOTAL 4 232 Let = Level of Significance. Then regression is significant when P-value < . P-value is determined using Excel, =1 – F.DIST( Ft , 1, DFE=(n–2) , 1 ) For this example, P-value = 1 – F.DIST(3,1,3,1) ≈ 0.182 For =0.10, Regression is not significant since P-value>. Significant regression implies slope is not zero. Regression is not significant implies slope is zero. Page 11 of 12 Correlation Significance Consider three sets of data, A, B, & C, along with their univariate and bivariate measures. Univariate Measures: Index:1,2,3,4,5 Sum = X X2 SS = X2 – (X)2/n S2 A: 0,2,1,2,1 6 10 2.8 0.7 B: 4,1,2,0,2 9 25 8.8 2.2 C: 2,0,0,2,0 4 8 4.8 1.2 Bivariate Measures: Variance/ Correlation Sum(XY) SSxy Covariance r Sxy Index:1,2,3,4,5 A B C A B C A B C A B C A: 0,2,1,2,1 10 6 4 2.8 –4.8 –0.8 0.7 –1.2 –0.2 –0.967 –0.218 B: 4,1,2,0,2 25 8 8.8 0.8 2.2 0.2 0.123 C: 2,0,0,2,0 8 4.8 1.2 Given Level of Significance = . Let tt = √[ (n–2)*R2/(1–R2) ]. Then, determine P-value using Excel, P-value = 2*(1–T.DIST( tt , n–2 , 1 ) ) If P-value < /2 , correlation is significant (correlation is not zero). If P-value > /2 , correlation is not significant (correlation is essentially zero). Correlation Matrix A B C A -0.967* -0.218 B 0.123 C R2 A B C A 0.935* 0.048 B 0.015 C *significant at =0.10 Let =0.05 RAB is significant RAC is not significant RBC is not significant A A B C P-value B C 0.0072* 0.7244 0.8437 Let =0.10 RAB is significant RAC is not significant RBC is not significant Page 12 of 12