g-emba-stat-lec05-b-regression-fall-2012

advertisement
Correlation & Regression – (Estimation & Significance)
XBUS 6240 – Understanding Managerial Statistics – Fall 2012
M.D. Harper, Ph.D.
*Correlation. Describes relationship between two variables.
*Regression. Uses one variable to predict another variable.
*Estimation.
Correlation: Determine coefficient of correlation.
Regression: Determine regression model.
*Significance.
Correlation: Make inference that correlation is significantly different from zero.
Regression: Make inference that least-squares regression slope is significantly
different from zero.
(Summary)
Given „n‟ pairs of data labeled „X‟ and „Y‟.
Mean of X is X = X/n
Mean of Y is Y = Y/n
Sum of Squared Errors about the Mean (SS):
SS of Variable Y: SSYY=( Y –Y )2=Y2–(Y)2/n
SS of Variable X: SSXX=( X –X )2=X2–(X)2/n
SS between variables: SSXY=( X –X )*( Y –Y ) =XY–(X)*(Y)/n
Estimation.
Coefficient of Correlation: R=SSXY/√(SSXX*SSYY)
For Regression, let Y=Dependent Variable & X=Independent Variable.
Regress Y on X.
Least-squares Regression Model: Ŷ=b0+b1*X
where b1 = SSXY/SSXX & b0 =Y – b1 *X
Significance.
Given Level of Significance = .
Correlation. Let tt = √[ (n–2)*R2/(1–R2) ]. Then, using the Excel function,
when 2*(1–T.DIST( tt , n–2 , 1 ) ) <  , the correlation is significant.
Least-Squares Regression. Let Ft = (n–2)*R2/(1–R2) = tt2 . Then, using the Excel
function, when 1–F.DIST( Ft , 1 , n–2 , 1) <  , the regression is significant.
Page 1 of 12
(Review)
Given paired data
A survey was administered to 5 subjects containing the following items.
[A questionnaire was given to 5 people containing the following questions.]
1. How many miles did you travel to get here today? _____
2. How many minutes did it take to get here today? _____
3. Please indicate your gender. (Circle one) Male Female
4. What was the temperature in Fahrenheit on your travel here today? _____
5. Please indicate your classification. (Circle one) Fresh Soph Jr Sr
Disagree
Agree
6. It was very difficult for me to travel here today. 1
2
3
4
5
.
The results:
Subject
Miles Minutes Gender Temp oF Class Difficult
1
1
4
M
68
Jr
2
2
3
6
F
58
Fresh
1
3
3
20
M
48
Soph
2
4
5
15
F
50
Jr
3
5
8
20
F
65
Soph
2
.
Consider Miles and Minutes.
Miles Minutes
Subject
X
Y
X*X Y*Y X*Y
1
1
4
1
16
4
2
3
6
9
36
18
3
3
20
9
400
60
4
5
15
25
225
75
5
8
20
64
400 160
SS=Sum of Squares
SSXX = 108–20*20/5 = 28
Sum
20
65
Sum 108 1077 317
SSYY = 1077–65*65/5 = 232
SS
28
232
57
SSXY = 317–20*65/5 = 57
SS = Sum of Squares of Error = Sum of Squared Errors
SSXX= ( X –X )2 = X2–(X)*(X)/n = 108–20*20/5 = 28
SSYY= ( Y –Y )2 = Y2–(Y)*(Y)/n = 1077–65*65/5 = 232
SSXY= ( X –X )*( Y –Y ) = X*Y–(X)*(Y)/n = 317–20*65/5 = 57
Summary
n=5
SS=Sum of Squares
2
SSXX = 108–20*20/5 = 28
X=20
X =108
2
SSYY = 1077–65*65/5 = 232
Y=65
Y =1077
SSXY = 317–20*65/5 = 57
X*Y=317
Page 2 of 12
Correlation & Regression Concepts
Y=Minutes
Consider the relationship between the two variables.
Miles Minutes
Subject
20
X
Y
15
1
1
4
10
2
3
6
5
3
3
20
0
4
5
15
0
2
4
6
X=Miles
5
8
20
n=5
X=20
Y=65
X2=108
Y2=1077
X*Y=317
8
10
SS=Sum of Squares
SSXX = 108–20*20/5 = 28
SSYY = 1077–65*65/5 = 232
SSXY = 317–20*65/5 = 57
1. As „Miles‟ increases, „Minutes‟ increases. Thus, there is a positive relationship or
positive association between „Miles‟ and „Minutes‟. If „Minutes‟ decreased as
„Miles‟ increased, the relationship would be negative.
.
2. To describe the relationship between the two variables,
consider the coefficient of correlation,
R = SSXY / √( SSXX*SSYY ) , Range of Sample Correlation is { –1 ≤ R ≤ +1 }
R = 57 / √( 28*232) ≈ 0.7072 , Excel: =Correl( X-array , Y-array ) ≈ 0.7072
3. R=0 implies no linear correlation
4. R close to ±1 implies strong linear correlation
5. R close to 0 implies weak linear correlation
6. To represent the relationship between the two variables, consider a least-squares
regression line with the equation in intercept-slope form,
Y=b0+b1*X, where b0 and b1 are constants, b0=intercept & b1=slope
For slope, b1=SSXY/SSXX, Excel function, “=slope( Y-array , X-array )”
For intercept, b0=Y–b1*X , Excel function, “=intercept( Y-array , X-array )”
b1=SSXY/SSXX = 57/28 = 2.035714286 ≈ 2.0357
b0= Y – b1 *X = (65/5) – (57/28)*(20/5) = 4.857142857 ≈ 4.857
Least-squares Regression Model: Ŷ=4.857+2.0357*X
7. Since X is used to estimate Y, Y is called the “Dependent Variable” and X is called the
“Independent Variable”.
8. Since X is used to estimate Y, the terminology is “Regress Y on X” or “Regress the
Dependent Variable on the Independent Variable”.
9. The independent variable is used to estimate the dependent variable.
(See Excel Examples)
Page 3 of 12
Correlation Concepts
Y=Minutes
Does there seem to be a relationship between the two variables?
Miles Minutes
Subject
20
X
Y
15
1
1
4
10
2
3
6
5
3
3
20
0
0
2
4
6
8
4
5
15
X=Miles
5
8
20
n=5
X=20
Y=65
X =108
Y2=1077
X*Y=317
2
10
SS=Sum of Squares
SSXX = 108–20*20/5 = 28
SSYY = 1077–65*65/5 = 232
SSXY = 317–20*65/5 = 57
1. As „Miles‟ increases, „Minutes‟ increases. Thus, there is a positive relationship or
positive association between „Miles‟ and „Minutes‟. If „Minutes‟ decreased as
„Miles‟ increased, the relationship would be negative.
2. To represent the relationship between the two variables, consider the coefficient of
correlation,
R = SSXY / √( SSXX*SSYY )
Range of Sample Correlation is { –1 ≤ R ≤ +1 }
Using Excel: =Correl( X-array , Y-array )
R = 57 / √( 28*232) ≈ 0.7072
3. R=0 implies no linear correlation
4. R close to ±1 implies strong linear correlation
5. R close to 0 implies weak linear correlation
.
Note:
Sample Covariance is defined as
Cov(X,Y)=SSXY/(n–1)
=57/(5–1) = 14.25
Sample Variances are defined as
SX2=SSXX/(n–1)
SY2=SSYY/(n–1)
=28/(5–1) = 7
=232/(5–1) = 58
Sample Standard Deviations are
defined as
SX=√[SSXX/(n–1)] &
SY=√[SSYY/(n–1)]
=√(7) ≈2.64575
=√(58) ≈7.61577
Thus, Correlation is often defined as R=Cov(X,Y)/[ SX * SY] =14.25/√(7*58)≈0.7072
(See Excel Examples)
Correlation: Relationship vs. Causation
Page 4 of 12
Matrix Representations
Consider three sets of data, A, B, & C, along with their univariate and bivariate measures.
Univariate Measures:
Index:1,2,3,4,5 Sum = X X2 SS = X2 – (X)2/n S2
A: 0,2,1,2,1
6
10
2.8
0.7
B: 4,1,2,0,2
9
25
8.8
2.2
C: 2,0,0,2,0
4
8
4.8
1.2
Bivariate Measures:
Variance/
Correlation
Sum(XY)
SSxy
Covariance
r
Sxy
Index:1,2,3,4,5 A B C A
B
C
A
B
C A
B
C
A: 0,2,1,2,1 10 6 4 2.8 –4.8 –0.8 0.7 –1.2 –0.2
–0.967 –0.218
B: 4,1,2,0,2
25 8
8.8 0.8
2.2 0.2
0.123
C: 2,0,0,2,0
8
4.8
1.2
Correlation Matrix
A
B
C
A
-0.967 -0.218
B
0.123
C
Covariance Matrix
A
B
C
A
-1.2 -0.2
B
0.2
C
Variance Matrix
A B C
A 0.7
B
2.2
C
1.2
Variance/Covariance/Correlation Matrix
A
B
C
A
0.7
–0.967
–0.218
B
–1.2
2.2
0.123
C
–0.2
0.2
1.2
(See Excel Example)
Page 5 of 12
Regression Concept
Y=Minutes
Can we estimate one variable using another variable?
Miles Minutes
Subject
20
X
Y
15
1
1
4
10
2
3
6
5
3
3
20
0
4
5
15
0
2
4
6
X=Miles
5
8
20
n=5
X=20
Y=65
X2=108
Y2=1077
X*Y=317
8
10
SSXX = 108–20*20/5 = 28
SSYY = 1077–65*65/5 = 232
SSXY = 317–20*65/5 = 57
1. To represent the relationship between the two variables, consider an arbitrary
regression line with the equation in intercept-slope form,
Y=b0+b1*X, where b0 and b1 are constants, b0=intercept & b1=slope
{Example: To illustrate the line through the data, visually select
the line Y=5+2*X through the data where b0=5 and b1=2. }
2. The line Y=5+2*X can be used to estimate Y given a value of X, expressed as E[Y|X].
{For example: if X=3, then Y=11, expressed as E[Y|X=3]=5+2*3=11;
and, if X=7, then Y=19, expressed as E[Y|X=7]=5+2*7=19.}
3. The line, Y=5+2*X, is called a regression model because it estimates “Y given X”.
4. Since X is used to estimate Y, Y is called the “Dependent Variable” and X is called the
“Independent Variable”.
5. Since X is used to estimate Y, the terminology is “Regress Y on X” or “Regress the
Dependent Variable on the Independent Variable”.
6. Statements about the relationship between variables.
The independent variable is used to estimate the dependent variable.
The independent variable contributes to the estimation of the dependent variable.
The independent variable explains the variability of the dependent variable.
The independent variable is used to model the dependent variable.
7. A common procedure yields the least-squares regression equation.
Least-squares estimates b0 & b1 for the intercept and slope can be obtained using
equations or functions in Excel.
Y=b0+b1*X, where b0 and b1 are constants, b0=intercept & b1=slope
For slope, b1=SSXY/SSXX, Excel function, “=slope( Y-array , X-array )”
For intercept, b0=Y–b1*X , Excel function, “=intercept( Y-array , X-array )”
b1=SSXY/SSXX = 57/28 = 2.035714286 ≈ 2.0357
b0= Y – b1 *X = (65/5) – (57/28)*(20/5) = 4.857142857 ≈ 4.857
(See Excel Examples)
Page 6 of 12
Subject
1
Miles Minutes Estimate
X
Y
Ŷ=5+2X
1
4
6
Error
=(Ŷ–Y)
2
2
3
6
12
6
3
3
20
26
6
4
5
15
25
10
5
8
20
36
16
Y=Minutes
Regression Definitions and Terminology
20
15
10
5
0
0
2
4
6
8
X=Miles
10
Consider the data and the regression line, Y=5+2*X.
The regression line: Y=5+2*X
Now, the mean of Y isY=Y/n=13
But the estimate for E[Y|X] for X=3 is:
Model (Ŷ=11): Ŷ=b0+b1*X=5+2*3=11
Error,  = (Ŷ – Y) between Model and Data
Data for X=3 is (Y=6): Y=0+1*X+
Y=Minutes
20
15
10
5
0
0
2
4 6 8
X=Miles
Model of Data: Y=0+1*X+
(Example: Data sets above.)
Model of Fit: Ŷ=b0+b1*X
(Example: Ŷ=5+2*X )
(Regression Model)
Mean of Y: isY=Y/n
(Example:Y=
Y=0+1*X+
Ŷ=b0+b1*X
10
0 and 1 are Parameters.  is an error term.
b0 and b1 are Estimates of the Parameters.
E[Y|X] is the Parameter, population mean of Y|X.
Ŷ=b0+b1*X is the Estimate of the Parameter.
Y is the Dependent Variable representing Data
X is the Independent Variable representing Data
0 is the “Intercept” parameter
1 is the “Slope” parameter.
 is the error term
Ŷ is the estimate of E[Y|X], the population mean of Y
conditioned on X representing the value from a
regression equation for a given value of X.
b0 is the estimate of 0 (the intercept)
b1 is the estimate of 1 (the slope)
Y is the sample mean
or the estimate of E[Y] not conditioned on X
or the estimate of the mean of Y independent of X
Regression Model of Data
Regression Equation
Regression Model of the Fit of the Data
Page 7 of 12
(Optional)
Least-Squares Regression Derivation
The regression model that minimizes the sum of squared errors
is called “Least-squares regression”.
Model of Data: Y=0+1*X+
Solve for Error term:  = [ Y – ( 0+1*X) ]
Substitute Estimates:  = [ Y – ( b0+b1*X) ]
Square Error term: 2 = [ Y – ( b0+b1*X) ]2
Sum over all data: 2 = [ Y – ( b0+b1*X) ]2
Find b0 and b1 that minimizes the squared error.
Min 2 = Min  [ Y – ( b0+b1*X) ]2
b0,b1
b0,b1
Using calculus:
Taking derivative, ∂(2)/∂(b0) = 2  [ Y – ( b0+b1*X) ] (–1) = 0
Summing through, Y – n*b0 – b1*X = 0
Collecting terms, b0 = Y/n – b1*X/n
Alternative form, b0 = Y – b1 *X
Taking derivative, ∂(2)/∂(b1) = 2  [ Y – ( b0+b1*X) ] (–X) = 0
Summing through,  XY – b0*X – b1*X2 = 0
Substituting for b0,  XY – ( Y/n – b1*X/n )*X – b1*X2 = 0
Collecting terms, b1 XY – (X)*(Y)/n ]/[ X2 – (X)2/n ]
Alternative form, b1 = SSXY/SSXX
The regression model that minimizes the sum of squared errors is called “Least-squares
regression”.
Model of Data: Y=0+1*X+
Least-squares Regression Model: Ŷ=b0+b1*X , for b1=SSXY/SSXX and b0=Y – b1 *X
Example.
Subject
1 2
3
4
5 Sum SS
X=Miles
1 3
3
5
8
20
Y=Minutes 4 6 20 15 20
65
SS=Sum of Squares
X*X
1 9
9
25 64 108 28
SSXX = 108–20*20/5 = 28
SSYY = 1077–65*65/5 = 232
Y*Y
16 36 400 225 400 1077 232
SSXY = 317–20*65/5 = 57
X*Y
4 18 60 75 160 317 57
Least-squares estimates:
b1=SSXY/SSXX = 57/28 = 2.035714286 ≈ 2.0357
b0= Y – b1 *X = (65/5) – (57/28)*(20/5) = 4.857142857 ≈ 4.857
Least-squares Regression Model: Ŷ=4.857+2.0357*X
Page 8 of 12
Sum of Squares
Y=Minutes
20
Regression Line: Ŷ=b0+b1*X
15
Y:Y =Y/n
Ŷ: Ŷ=b0+b1*X
10
Y: Y=0+1*X+
5
0
0
2
4 6 8
X=Miles
10
Definitions.
Model of Data: Y=0+1*X+
Least-squares Regression Model: Ŷ=b0+b1*X , for b1=SSXY/SSXX and b0=Y – b1 *X
Sample mean of Data :Y = Y/n
Total Sum of Squares, TSS = (Y –Y)2 = SSYY = 232
Sum of Squares of Model, SSM = (Ŷ –Y )2 = (SSXY)2/SSXX = (572/28) ≈ 116.04
(Sum of Squares of Regression)
Sum of Squares of Error, SSE = (Y – Ŷ )2 = SSYY – (SSXY)2/SSXX ≈ 115.96
(Sum of Squares of Residuals)
From Example: SSYY = 232; SSXX = 28; SSXY = 57
Identities.
Total
TSS
(Y –Y)2
SSYY
=
Model
=
SSM
=
(Ŷ –Y )2
= [ (SSXY)2/SSXX ]
+
+
+
+
Error
SSE
(Y – Ŷ )2
[ SSYY – (SSXY)2/SSXX ]
1. Total variation is explained by either the variation due to the model or variation due to
random error.
2. As Ŷ approachesY, SSM approaches zero and SSE approaches TSS.
3. As Ŷ approachesY, the total variation is explained less by the model and more by
random error.
4. As Ŷ approachesY, the regression slope approaches zero and the correlation
coefficient approaches zero. Consider the estimate of the regression line,
Ŷ=b0+b1*X = (Y – b1 *X ) + b1 * X , as b1 approaches zero, Ŷ approachesY.
Page 9 of 12
R2 – Coefficient of Determination
Total
TSS
(Y –Y)2
SSYY
=
Model
=
SSM
=
(Ŷ –Y )2
= [ (SSXY)2/SSXX ]
+
+
+
+
Error
SSE
(Y – Ŷ )2
[ SSYY – (SSXY)2/SSXX ]
Divide by TSS:
TSS/TSS =
SSM/TSS
+
SSE/TSS
2
1
= [ (SSXY) /(SSXX *SSYY ) ] + [ 1 – (SSXY)2/(SSXX *SSYY )]
4. Define: Coefficient of Determination, R2= SSM/TSS
Thus:
1
=
R2
+
[ 1 – R2 ]
Coefficient of Determination
R2= SSM/TSS = (SSXY)2/(SSXX*SSYY)
0 ≤ R2 ≤ 1
1. R2 = SSM/TSS represents the percent of total variation explained by the model.
2. 1–R2 = SSE/TSS represents the percent of total variation explained by random error.
The correlation coefficient, r = sqrt(R2) = SSXY/sqrt( SSXX*SSYY ), –1 ≤ r ≤ +1
Example.
Subject
1 2
3
4
5 Sum SS
X=Miles
1 3
3
5
8
20
Y=Minutes 4 6 20 15 20
65
SS=Sum of Squares
X*X
1 9
9
25 64 108 28
SSXX = 108–20*20/5 = 28
SSYY = 1077–65*65/5 = 232
Y*Y
16 36 400 225 400 1077 232
SSXY = 317–20*65/5 = 57
X*Y
4 18 60 75 160 317 57
Coefficient of Determination, R2 = SSM/TSS
= (SSXY)2/(SSXX*SSYY) = (572/(28*232) = 0.500153941 ≈ 0.5
Total Sum of Squares, TSS = (Y –Y)2 = SSYY = 232
Sum of Squares of Model, SSM = (Ŷ –Y )2 = (SSXY)2/SSXX = (572/28) ≈ 116.04
= TSS*R2 = [SSYY ]*[ (SSXY)2/(SSXX*SSYY) ] = (SSXY)2/SSXX = (572/28) ≈ 116.04
Sum of Squares of Error, SSE = (Y – Ŷ )2 = SSYY – (SSXY)2/SSXX ≈ 115.96
= TSS*(1–R2) = [SSYY]*[1 – (SSXY)2/(SSXX*SSYY) ] = SSYY – (SSXY)2/SSXX ≈ 115.96
Correlation coefficient, r = sqrt(R2) =
=57/sqrt(28*232) = 0.707215625 ≈ 0.7
Page 10 of 12
ANOVA – Analysis of Variance – Regression Significance
Consider the values: Subject
1 2 3 4 5 Sum
X=Miles
1 3 3 5 8
20
Y=Minutes 4 6 20 15 20 65
ANOVA Table
SOURCE
MODEL
(Regression)
ERROR
(Residuals)
TOTAL
(Corrected Total)
DF SS
MS
1
SSM MSM=SSM/1
n-2 SSE
SSXY =57
SSXX =28
SSYY =232
Ft
R2
MSM/MSE SSM/TSS
MSE=SSE/(n-2)
n-1 TSS
Using Algebra to combine relationships:
R2 = SSM/TSS ≈ 0.500153941
TSS = (Y –Y)2 = SSYY = 232
SSM = (Ŷ –Y )2 = TSS*R2 ≈ 116.04
SSE = (Y – Ŷ )2 = TSS*(1–R2) ≈ 115.96
r = sqrt (R2) ≈ 0.7072
ANOVA Table
SOURCE DF SS
MS
Ft
R2
MODEL 1
SSM=TSS*R2
MSM=SSM/1
MSM/MSE SSM/TSS
2
ERROR
n-2 SSE=TSS*(1–R ) MSE=SSE/(n-2)
TOTAL
n-1 TSS=SSYY
ANOVA Table
SOURCE DF SS
MS
Ft
R2
P-value
MODEL 1
116.04 116.04 3.00 0.50 0.182
ERROR
3
115.96 38.65
TOTAL
4
232
Let  = Level of Significance. Then regression is significant when P-value < .
P-value is determined using Excel, =1 – F.DIST( Ft , 1, DFE=(n–2) , 1 )
For this example, P-value = 1 – F.DIST(3,1,3,1) ≈ 0.182
For =0.10, Regression is not significant since P-value>.
Significant regression implies slope is not zero.
Regression is not significant implies slope is zero.
Page 11 of 12
Correlation Significance
Consider three sets of data, A, B, & C, along with their univariate and bivariate measures.
Univariate Measures:
Index:1,2,3,4,5 Sum = X X2 SS = X2 – (X)2/n S2
A: 0,2,1,2,1
6
10
2.8
0.7
B: 4,1,2,0,2
9
25
8.8
2.2
C: 2,0,0,2,0
4
8
4.8
1.2
Bivariate Measures:
Variance/
Correlation
Sum(XY)
SSxy
Covariance
r
Sxy
Index:1,2,3,4,5 A B C A
B
C
A
B
C A
B
C
A: 0,2,1,2,1 10 6 4 2.8 –4.8 –0.8 0.7 –1.2 –0.2
–0.967 –0.218
B: 4,1,2,0,2
25 8
8.8 0.8
2.2 0.2
0.123
C: 2,0,0,2,0
8
4.8
1.2
Given Level of Significance = .
Let tt = √[ (n–2)*R2/(1–R2) ].
Then, determine P-value using Excel,
P-value = 2*(1–T.DIST( tt , n–2 , 1 ) )
If P-value < /2 , correlation is significant (correlation is not zero).
If P-value > /2 , correlation is not significant (correlation is essentially zero).
Correlation Matrix
A
B
C
A
-0.967* -0.218
B
0.123
C
R2
A
B
C
A
0.935* 0.048
B
0.015
C
*significant at =0.10
Let =0.05
RAB is significant
RAC is not significant
RBC is not significant
A
A
B
C
P-value
B
C
0.0072* 0.7244
0.8437
Let =0.10
RAB is significant
RAC is not significant
RBC is not significant
Page 12 of 12
Download