STATISTICS 302:504-505

advertisement
STAT 211
1
Handout 10 (Chapter 12): Simple Linear Regression and Correlation



Is there a relationship between two numeric random variables?
Can we describe this relationship with a model?
Can we use this model to predict the future values?
Example 1: Infestation of crops by insects has long been of great concern to farmers and
agricultural scientists. Certain article reports normally distributed data on x=age of cotton plant
(days) and y=% damaged squares. The following are the data.
X: 9 12
Y: 11 12
12
23
15
30
18
29
18
52
21
41
21
65
27
60
30
72
30
84
33
93
How do we demonstrate the relationship between age of the cotton plant and % damaged
squares?
Scatterplots: Recall that in the graph, it will not matter which you choose as X or Y.
Scatter plot of x versus y using MINITAB
100
90
80
70
Y
60
50
40
30
20
10
10
20
30
X
Correlations
A) PEARSON’S CORRELATION COEFFICIENT
-measures the strength & direction of the linear relationship between X and Y
-X and Y must be numerical variables
The formula looks like:
r
 X
i
 X Yi  Y 
(n  1) s x s y

 X  X Y  Y 
 X  X   Y  Y 
i
i
2
i
2
i
STAT 211
2
where sx and sy are the standard deviations for x and y. Notice we are looking at how far
each point deviates from the average X and Y value.
Properties of Pearson’s Correlation Coefficient
 r<0 implies a negative relationship, r>0 implies a positive relationship, r=0 no apparent
relationship.
 -1 r 1
 r=1 or –1 happens only when the points lie on an exact line
 r0.8 implies a strong relationship, 0.5<r<0.8 a moderate relationship, & r0.5
implies a weak one
Note: It is pretty hard to tell the difference in graphs with correlation under 0.5! However, the
larger the sample size, the easier it is to see the correlation!
CAUTION: Scatterplots with weak correlations are just scattered points. You could have a
curved shape (i.e.- a parabola) that fits the data perfectly, but it will have a very weak Pearson’s
correlation coefficient.


The correlation is the same no matter which variable is designated X or Y
The value of r does not depend upon the units of measurement. For example, the correlation
between grade and the hours you study for the class will be the same as the correlation
between the hours you study for the class and the grade.
Using the MINITAB program the following is computed for example 1.
Pearson correlation of X and Y = 0.949
P-Value = 0.000
Note the Pearson’s correlation between x and y is 0.949. The P-value=0.000 is for testing
H 0 :  XY  0 (the true correlation is zero. It means there is no linear relationship between x and
y) versus
H a :  XY  0 (the true correlation is not zero. It means there is a linear relationship between x
and y).
r n2
The formal test has the test statistics t 
=9.5186 with n-2=10 degrees of freedom. Since
1 r2
this is a two tailed test, P-value=2P(t>9.5186)
The correlation for x vs. x and y vs. y is 1. Why?
IMPORTANT !!!! Just because two variables are highly correlated does NOT mean that one
causes the other!!!!
B) SPEARMAN'S RANK CORRELATION (rs)
Recall, Pearson’s r is calculated with means and thus would be affected by outliers. Thus
Spearman’s method provides us with an alternative that is robust to outliers. Spearman’s can
STAT 211
3
identify both linear and nonlinear relationships. The actual observed data is not used – the
ranks are. That is, they replace the smallest X with a 1, the next with a 2, and so on. The
same is done for Y. For example:
for (12.3, 2.7) (10.4, 3.2) (13.2,3.0) you would use (2,1) (1,3) (3,2)
Then the Pearson formula is just applied to these ranks. Thus, rs is interpreted just like r: has
values between –1 and 1, with 1 indicating a strong relationship and 0 a weak.
Using the MINITAB program the following is computed for example 1.
Spearman’s correlation of x and y = 0.958
P-Value = 0.000
Purposes for fitting equations to data sets:
 To summarize and condense a data set in order to obtain predictive formulas
 To reject or confirm a proposed mathematical relation
 To assist in the search for a mathematical relation
 To perform a quantitative comparison of two or more data sets
Regression analysis is a general approach for obtaining a prediction function using a sample
data. We work with a dependent variable (Y, response or endogenous variable), independent
^
variable (X's, predictor or exogenous variable) and the predicted value for a given level of X, Y .
The method of least squares finds particular line where the aggregate deviation of the data points
above or below it is minimized.
Least Squares Regression Line-slope & intercept
Now if we have an idea that there is some sort of relationship, we are interested in some way to
summarize that relationship. We want to find a line that “best fits” the points – least squares
regression line - so that we can predict values for Y if we know X. The “best fit” line is the line
that in some sense is closest to all of the data points simultaneously.
The vertical distances from each point to the line is drawn. These are the residuals, denoted by
e. If the point is above the line, this distance is +, below it is -. If we added up all of the e’s, we
would get zero (we will check this in lecture) If we add up all of the squares of the residuals, we
get a measure of how far away from the data our line is. The “best line” will be the one with the
minimum sum of the squares – thus it is called the “Least Squares Regression Line”.
STAT 211
4
Simple Linear Regression: Y   0  1 x  e , where n observations included, the parameters
 0 and  1 are constants whose "true" values are unknown and must be estimated from the data.
 The variables Y and x are theoretically related to one another by the equation of the straight
line,
 The data set is typical of the behavior of the process under study.
 The Yi, i=1,2,….,n, are pairwise statistically independent of one another.
 The Yi are random variables possessing the same variance,  2 .
 For each data point, there are no outliers that have arisen under unusual, accidental, or
careless circumstances.
 The uncontrolled random error, e associated with the Y is normally independently
distributed with mean 0 and the constant variance,  2 .
Obtaining the best estimates for  0 (intercept) and  1 (slope):
^
 The estimate of  i (residual), ei  y i  y i  y i  (b0  b1 xi )
 Minimize
n
n
i 1
i 1
 ei2   ( yi  b0  b1 xi ) 2 (i.e., take a first derivative with respect to b0 and b1 )
 Resulting equations are called normal equations
_
_



x

x
y

y





i
i



i 1 
b1 
n
_ 2


 xi  x 


i 1 
n
n
y
i 1
n
i
 n  b0  b1   xi
n
i 1
n
n
 xi  yi  b0   xi  b1   xi2
i 1
i 1
then
i 1
_
n
_ _
 xi y i  n x y
i 1
n
x
i 1
2
i
nx
_
b0  y  b1  x
^
 The estimated linear regression equation, y  b0  b1  x
If you write y as the function of x (regress y on x), the following output can be obtained for
example 1.
The regression equation is
Y = - 19.7 + 3.2847 X
Predictor
Constant
X
S = 9.094
Coef
-19.670
3.2847
SE Coef
7.524
0.3440
R-Sq = 90.1%
T
-2.61
9.55
P
0.026
0.000
R-Sq(adj) = 89.1%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
1
10
11
SS
7541.7
827.0
8368.7
MS
7541.7
82.7
F
91.19
P
0.000
_2
STAT 211
5
The slope and intercept for your equation is found under the coefficient column. Which is the
slope and which is the intercept? What is the least squares regression line in this example?
Minitab gives you the following output for simple linear regression
Predictor (y)
Coef
SE Coef
Constant
b0
_


1
( x) 2


sb0  s 
_
n


 ( xi  x) 2 
tost (x)
b1
sb1  s /
Analysis of Variance
Source
DF
Regression
1
Residual Error
n-2
Total
n-1
y  19.67  3.2847 x
 (x
i 1
T
_
i
SS
SSR
SSE
SST
 x) 2
MS
MSR
MSE
P
b0
s b0
p-value
b1
s b1
p-value
F
MSR/MSE
P
p-value
has two interpretations:
(i) It is a point estimate of the mean value of y when x is given.
(ii) It is a point prediction of an individual y to be observed when x is given.
Estimate of average rbot when x =11 is  19.67  3.2847(11) =16.46
Predicted y when x=11 is  19.67  3.2847(11) =16.46
Estimate of average y when x =2 should not be computed using the equation
Predicted y when x=2 not be computed using the equation.
SSE = 827 is the unexplained variation: measure of y's variation that can be attributed to an
approximate linear relationship.
SST = 8368.7 explains the deviations of y from the sample mean of y
Coefficient of Determination, R2 : Measure what percent of Y's variation is explained by the X
variables via the regression model. It tells us the proportion of SST that is explained by the fitted
equation. Note that SSE is the proportion of SST that is not explained by the model.
SSR
SSE
R2 
 1
SST
SST
Only in simple linear regression, R 2  r 2 where r is the Pearson’s correlation coefficient.
n  1 SSE  (n  1)  R 2  1 



n  2 SST 
n2

Both will always be between 0 and 1 indicating:
(i) strong linear relationship between X and Y if it is close to 1 and
(ii) very weak relationship between X and Y if it is close to 0.
Adjusted R 2  1 
STAT 211
6
(iii) It is 0 (No linear relationship) when SSE= SST.
90.1% of Y's variation is explained by the X variables via the regression the example above.
Notice that R2 =r2 = (0.949)2 only when you have one X and one Y variable in your regression
model (Simple Linear Regression).
Eventhough, estimated variance and the coefficient of determination are given in your ANOVA
table, the following is the way how they are calculated:
^
 2  s 2  SSE /( n  2)  827 /10  82.7
: MSE
^
  s  82.7  9.094
: Root MSE
SSE
827
r2  1
 1
 0.901
SST
8368.7
: Coefficient of determination
Approximately 90.1% of the observed variation in % damaged squares (y) can be attributed to
the probabilistic linear relationship with age of cotton plant (x).
^
^
The following residuals ( e  y  y ) and predicted or fitted y ( y ) computed by Minitab. If you
compute them by hand, you will get slightly different number because of the rounding. My
suggestion is for you to use these numbers as it is given on the output unless otherwise is asked.
Obs
1
2
3
4
5
6
7
8
9
10
11
12
X
9.0
12.0
12.0
15.0
18.0
18.0
21.0
21.0
27.0
30.0
30.0
33.0
Y
11.00
12.00
23.00
30.00
29.00
52.00
41.00
65.00
60.00
72.00
84.00
93.00
Fit
9.89
19.75
19.75
29.60
39.45
39.45
49.31
49.31
69.02
78.87
78.87
88.73
SE Fit
4.75
3.93
3.93
3.24
2.76
2.76
2.63
2.63
3.45
4.19
4.19
5.04
Residual
1.11
-7.75
3.25
0.40
-10.45
12.55
-8.31
15.69
-9.02
-6.87
5.13
4.27
St Resid
0.14
-0.94
0.40
0.05
-1.21
1.45
-0.95
1.80
-1.07
-0.85
0.64
0.56
Notice that some of the residuals are positive and some others are negative. All add up to zero.
The intercept and the slope are calculated for this data. Since the slope is a numeric number other
than zero, we may think x and y are linearly related. Is it true in the population?
H 0 : 1  0 (no linear relation between x and y)
H a : 1  0 (x and y are linearly related)
Test statistics: t 
b1  0 3.2847  0

 9.55  t / 2;df  t 0.025;10 =2.228. H0 is rejected and data
sb1
0.344
supports the linear relationship between x and y .
STAT 211
7
Note that b1 is the estimated slope, s b1 is the standard error for b1 , 5% significance is used with
the error degrees of freedom. The formal test would be the following:
H 0 : 1  10
H a : 1  10
b1   10
where  10 is the value slope is compared with.
std b1 
Decision making can be done by using error degrees of freedom for any other t test we have
discussed before (either using the P-value or the rejection region method)
Test statistics: t 
The 100(1-)% confidence interval for  1 is b1  t / 2;df sb1
For our example, 95% confidence interval for  1 is 3.2847 2.228(0.344)=(2.518 , 4.051)
You can also test for H a : 1  10 or H a : 1  10 .
RESI1
10
0
-10
10
20
30
X
Example 2: Find the best predicted value for the number of viewers (in millions) given that the
salary (in million of dollars) of television star is $16 million. How does the predicted value
compare to actual number of viewers, which was 24 million?
The regression equation is
Viewers = 6.76 - 0.0111 Salary
STAT 211
8
Predictor
Constant
Salary
Coef
6.760
-0.01106
S = 3.266
SE Coef
1.459
0.03791
R-Sq = 1.4%
Analysis of Variance
Source
DF
Regression
1
Residual Error
6
Total
7
Obs
1
2
3
4
5
6
7
8
Salary
100
14
14
35
12
7
5
1
T
4.63
-0.29
P
0.004
0.780
R-Sq(adj) = 0.0%
SS
0.91
63.99
64.90
Viewers
7.00
4.40
5.90
1.60
10.40
9.60
8.90
4.20
MS
0.91
10.67
Fit
5.65
6.61
6.61
6.37
6.63
6.68
6.70
6.75
F
0.09
SE Fit
3.12
1.21
1.21
1.24
1.23
1.31
1.35
1.44
P
0.780
Residual
1.35
-2.21
-0.71
-4.77
3.77
2.92
2.20
-2.55
X denotes an observation whose X value gives it large influence.
11
10
9
Viewers
8
7
6
5
4
3
2
1
0
50
Salary
Linear Model:
Logarithmic Model:
Power Model:
Quadratic Model:
Exponential Model:
Logistic Model:
y=a+bx
y=a+bln(x)
y=axb
y=ax2+bx+c
y=abx
c
y=
1  ae bx
100
St Resid
1.40 X
-0.73
-0.23
-1.58
1.25
0.98
0.74
-0.87
Download