Simple Linear Regression

advertisement
Simple Linear Regression
Simple Linear Regression
• Our objective is to study the relationship
between two variables X and Y.
• One way to study the relationship between
two variables is by means of regression.
• Regression analysis is the process of
estimating a functional relationship between
X and Y. A regression equation is often used
to predict a value of Y for a given value of X.
• Another way to study relationship between
two variables is correlation. It involves
measuring the direction and the strength of
the linear relationship.
2
First-Order Linear Model =
Simple Linear Regression Model
Yi  b0  b1Xi  e i
where
y = dependent variable
x = independent variable
b0= y-intercept
b1= slope of the line
e = error variable
3
Simple Linear Model
Yi  b0  b1Xi  e i
This model is
– Simple: only one X
– Linear in the parameters: No parameter
appears as exponent or is multiplied or
divided by another parameter
– Linear in the predictor variable (X): X
appears only in the first power.
4
Examples
• Multiple Linear Regression:
Yi  b0  b1X1i  b2X2i  e i
• Polynomial Linear Regression:
Yi  b 0  b1X i  b 2 X 2i  e i
• Linear Regression:
log10(Yi )  b0  b1Xi  b2 exp(Xi )  e i
• Nonlinear Regression:
Yi  b0 /(1  b1 exp(b2Xi ))  e i
Linear or nonlinear in parameters
5
Deterministic Component of
Model
y-intercept
50
45
40
35
30
25
20bˆ0
15
10
5
0
0
y  bˆ0  bˆ1x
Rise
bˆ1(slope)=rise/run
Run
x
5
10
15
20
6
Mathematical vs Statistical Relation
x
50
45
40
35
30
25
20
15
10
5
0
^
y = - 5.3562 + 3.3988x
x
0
5
10
15
20
7
Error
• The scatterplot shows that the points are not
on a line, and so, in addition to the
relationship, we also describe the error:
yi  b 0  b1 xi  e i , i=1,2,...,n
• The Y’s are the response (or dependent)
variable. The x’s are the predictors or
independent variables, and the epsilon’s are the
errors. We assume that the errors are normal,
mutually independent, and have variance 2.
8
Least Squares:
 
n
Minimize
n
e i2 
i 1
( yi  b 0  bi xi )2
i 1
The minimizing y-intercept and slope are
given by bˆ0 and bˆ1 . We use the notation:
yˆi  bˆ0  bˆ1 xi
• The quantities Ri  yi  yˆi are called the
residuals. If we assume a normal error, they
should look normal.
Error: Yi-E(Yi) unknown; Residual:
Ri  yi  yˆi
estimated, i.e. known
9
Minimizing error
10
• The Simple Linear Regression Model
y  b 0  b1 x  e
• The Least Squares Regression Line where
yˆ  bˆ0  bˆ1 x
SS
xy
ˆ
ˆ  y  bˆ x
b1 
b
0
1
SS x
SS x 
SS xy 
(x  x )   x
2
2
i
i
x



 ( x  x )( y  y )  
i
i
2
i
n
x   y 


xy 
i
i i
i
n
11
What form does the error take?
• Each observation may be decomposed
into two parts:
y  yˆ  ( y  yˆ )
• The first part is used to determine the
fit, and the second to estimate the error.
• We estimate the standard deviation of
the error by:
2 

S
xy
2
ˆ
SSE   (Y  Y )  S yy  

 S xx 


12
Estimate of 2
• We estimate 2 by
SSE
s 
 MSE
n2
2
13
Example
• An educational economist wants to
establish the relationship between an
individual’s income and education. He takes
a random sample of 10 individuals and asks
for their income ( in $1000s) and education
( in years). The results are shown below.
Find the least squares regression line.
Education
Income
11 12 11 15 8
10 11 12 17 11
25 33 22 41 18 28 32 24 53 26
14
Dependent and Independent
Variables
• The dependent variable is the one that we
want to forecast or analyze.
• The independent variable is hypothesized
to affect the dependent variable.
• In this example, we wish to analyze
income and we choose the variable
individual’s education that most affects
income. Hence, y is income and x is
individual’s education
15
First Step:
 x  118
 x  1450
 y  302
 y  10072
 x y  3779
i
2
i
i
2
i
i
i
16
Sum of Squares:
x )( y )
(118)(302)

 x y 
 3779 
 215.4
(
SS xy
i
i i
SS x
i
n
10

 x 
n
Therefore,
2
i
(
xi ) 2
(118) 2
 1450 
 57.6
10
SS xy 215.4
ˆ
b1 

 3.74
SS x
57.6
302
118
ˆ
ˆ
b 0  y  b1 x 
 3.74
 13.93
10
10
17
The Least Squares Regression
Line
• The least squares regression line is
yˆ  13.93  3.74 x
• Interpretation of coefficients:
*The sample slope bˆ1  3.74 tells us that on
average for each additional year of education,
an individual’s income rises by $3.74 thousand.
• The y-intercept is bˆ0  13.93 . This value is the
expected (or average) income for an individual
who has 0 education level (which is meaningless
here)
18
Error Variable:
e is normally distributed.
E(e) = 0
The variance of e is 2.
The errors are independent of each other.
The estimator of 2 is
SSE
2
se 
 MSE
n2
where
•
•
•
•
•
2

S
xy
SSE   (Y  Yˆ )2  S yy  
 S xx


 and SS y 



yi2 
(

yi ) 2
n
19
Example (continue)
• For the previous example
SS y

y 
n
2
i
(
yi ) 2
(302) 2
 10072 
 951.6
10
Hence, SSE is
2 
 S xy
(215.4)2
SSE  S yy  
 146.09
  951.6 
 S xx 
57.6


SSE 146.09
se 

 18.26 and se  se2  4.27
n  2 10  2
2
20
2
Interpretation of se
• The value of “s” can be compared with the
mean value of y to provide a rough guide
as to whether s is small or large.
• Since y  30.2 and se=4.27, we would
conclude that “s” is relatively small,
indicating that the regression line fits the
data quite well.
21
Example
• Car dealers across North America use the
red book to determine a cars selling price
on the basis of important features. One of
these is the car’s current odometer reading.
• To examine this issue 100 three-year old
cars in mint condition were randomly
selected. Their selling price and odometer
reading were observed.
22
Portion of the data file
Odometer
37388
44758
45833
30862
…..
34212
33190
39196
36392
Price
5318
5061
5008
5795
…
5283
5259
5356
5133
23
Example (Minitab Output)
Regression Analysis
The regression equation is
Price = 6533 - 0.0312 Odometer
Predictor
Constant
Odometer
Coef
6533.38
-0.031158
S = 151.6
StDev
84.51
0.002309
R-Sq = 65.0%
T
P
77.31 0.000(SIGNIFICANT)
-13.49 0.000(SIGNIFICANT)
R-Sq(adj) = 64.7%
Analysis of Variance
Source
Regression
Error
Total
DF
1
98
99
SS
4183528
2251362
6434890
MS
4183528
22973
F
182.11
P
0.000
24
Example
• The least squares regression line is
yˆ  6533.38  0.031158 x
Price
6000
5500
5000
20000
30000
Odometer
40000
50000
25
Interpretation of the coefficients
• bˆ1  0.031158 means that for each additional
mile on the odometer, the price decreases by an
average of 3.1158 cents.
• bˆ0  6533.38 means that when x = 0 (new car),
the selling price is $6533.38 but x = 0 is not in
the range of x. So, we cannot interpret the value
of y when x=0 for this problem.
• R2=65.0% means that 65% of the variation of y
can be explained by x. The higher the value of
R2, the better the model fits the data.
26
R² and R² adjusted
• R² measures the degree of linear association
between X and Y.
• So, a high R² does not necessarily indicate that
the estimated regression line is a good fit.
• Also, an R² close to 0 does not necessarily
indicate that X and Y are unrelated (relation can
be nonlinear)
• As more and more X’s are added to the model,
R² always increases. R²adj accounts for the
number of parameters in the model.
27
Scatter Plot
Odometer .vs. Price Line Fit Plot
Price
6000
5500
5000
4500
19000
29000
39000
49000
Odometer
28
Testing the slope
• Are X and Y linearly related?
H 0 : b1  0
H A : b1  0
•Test Statistic:
ˆ
b1  b1
t
sbˆ
where
se
sbˆ 
SS x
1
1
29
Testing the slope (continue)
• The Rejection Region:
Reject H0 if t < -t/2,n-2 or t > t/2,n-2.
• If we are testing that high x values lead to high y
values, HA: b1>0. Then, the rejection region is
t > t,n-2.
• If we are testing that high x values lead to low y
values or low x values lead to high y values, HA: b1
<0. Then, the rejection region is t < - t,n-2.
30
Assessing the model
Example:
• Excel output
Intercept
Odometer
Coefficients
6533.4
-0.031
Standard Error t Stat
P-value
84.512322
77.307 1E-89
0.0023089
-13.49 4E-24
• Minitab output
Predictor
Constant
Odometer
Coef
6533.38
-0.031158
StDev
84.51
0.002309
T
77.31
-13.49
P
0.000
0.000
31
Coefficient of Determination
2
SS xy
SSE
R 
 1
SS x  SS y
SS y
2
For the data in odometer example, we obtain:
SSE
2,251,363
R  1
 1
SSy
6,434,890
2
 1  0.3499 0.6501
2
R adj
n  1 SSE
 1 (
)
n  p SS y
32
Using the Regression Equation
• Suppose we would like to predict the
selling price for a car with 40,000 miles on
the odometer
yˆ  6,533  0.0312 x
 6,533  0.0312(40,000)
 $5,285
33
Prediction and Confidence
Intervals
• Prediction Interval of y for x=xg: The
confidence interval for predicting the particular
value of y for a given x
1 ( xg  x )
yˆ  t / 2,n 2 se 1  
n
SS x
2
• Confidence Interval of E(y|x=xg): The
confidence interval for estimating the expected
value of y for a given x
yˆ  t / 2,n  2 se
1

n
( xg  x ) 2
SS x
34
Solving by Hand
(Prediction Interval)
• From previous calculations we have the
following estimates:
yˆ  5285, se  151.6, SS x  4309340160, x  36,009
• Thus a 95% prediction interval for
x=40,000 is:
1
(40, 000  36, 009) 2
5, 285  1.984(151.6) 1 

100
4,309,340,160
5, 285  303
•The prediction is that the selling price of the car
will fall between $4982 and $5588.
35
Solving by Hand
(Confidence Interval)
• A 95% confidence interval of
E(y| x=40,000) is:
1
(40,000  36,009) 2
5,285  1.984(151.6)

100
4,309,340,160
5,285  35
•The mean selling price of the car will fall between $5250
and $5320.
36
Prediction and Confidence
Intervals’ Graph
6300
Predicted
Prediction interval
5800
5300
Confidence interval
4800
20000
30000
40000
50000
Odometer
37
Notes
• No matter how strong is the statistical relation
between X and Y, no cause-and-effect pattern is
necessarily implied by the regression model. Ex:
Although a positive and significant relationship is
observed between vocabulary (X) and writing
speed (Y), this does not imply that an increase in
X causes an increase in Y. Other variables, such
as age, may affect both X and Y. Older children
have a larger vocabulary and faster writing
speed.
38
Regression Diagnostics
How to diagnose violations and how to deal with
observations that are unusually large or small?
• Residual Analysis:
Non-normality
Heteroscedasticity (non-constant variance)
Non-independence of the errors
Outlier
Influential observations
39
Standardized Residuals
• The standardized residuals are calculated
as
ri
Standardized residual 
se
where ri  yi  y
ˆi .
• The standard deviation of the i-th residual
is
sr  se
i
1 ( xi  x )
1  hi where hi  
n
SS x
2
40
Non-normality:
• The errors should be normally distributed. To
check the normality of errors, we use histogram
of the residuals or normal probability plot of
residuals or tests such as Shapiro-Wilk test.
• Dealing with non-normality:
– Transformation on Y
– Other types of regression (e.g., Poisson or
Logistic …)
– Nonparametric methods (e.g., nonparametric
regression(i.e. smoothing))
41
Heteroscedasticity:
2

• The error variance
should be constant.
e
When this requirement is violated, the condition
is called heteroscedasticity.
• To diagnose hetersocedastisticity or
homoscedasticity, one method is to plot the
residuals against the predicted value of y (or x).
If the points are distributed evenly around the
expected value of errors which is 0, this means
that the error variance is constant. Or, formal
tests such as: Breusch-Pagan test
42
Example
• A classic example of heteroscedasticity is that of
income versus expenditure on meals. As one's
income increases, the variability of food
consumption will increase. A poorer person will
spend a rather constant amount by always
eating less expensive food; a wealthier person
may occasionally buy inexpensive food and at
other times eat expensive meals. Those with
higher incomes display a greater variability of
food consumption.
43
Dealing with heteroscedasticity
• Transform Y
• Re-specify the Model (e.g., Missing
important X’s?)
• Use Weighted Least Squares instead of
n
e i2
Ordinary Least Squares min 
i 1 Var (e i )
44
Non-independence of error
variable:
• The values of error should be
independent. When the data are time
series, the errors often are correlated (i.e.,
autocorrelated or serially correlated). To
detect autocorrelation we plot the
residuals against the time periods. If there
is no pattern, this means that errors are
independent.
45
Outlier:
• An outlier is an observation that is unusually
small or large. Two possibilities which cause
outlier is
1. Error in recording the data. Detect the error
and correct it
The outlier point should not have been
included in the data (belongs to another
population)  Discard the point from the sample
2. The observation is unusually small or large
although it belong to the sample and there is no
recording error.  Do NOT remove it
46
Influential Observations
• One or more observations have a large
influence on the statistics.
Scatter Plot Without the Influential
Observation
Scatter Plot of One Influential Observation
150
60
50
100
y
40
y 30
50
20
10
0
0
10
20
30
x
40
50
0
0
10
20
30
40
50
x
47
Influential Observations
• Look into Cook’s Distance, DFFITS,
DFBETAS (Neter, J., Kutner, M.H., Nachtsheim, C.J., and
Wasserman, W., (1996) Applied Linear Statistical Models, 4th
edition, Irwin, pp. 378-384)
48
Multicollinearity
• A common issue in multiple regression is
multicollinearity. This exists when some or
all of the predictors in the model are highly
correlated. In such cases, the estimated
coefficient of any variable depends on
which other variables are in the model.
Also, standard errors of the coefficients
are very high…
49
Multicollinearity
• Look into correlation coefficient among X’s: If
Cor>0.8, suspect multicollinearity
• Look into Variance inflation factors (VIF):
VIF>10 is usually a sign of multicollinearity
• If there is multicollinearity:
– Use transformation on X’s, e.g. centering,
standardization. Ex: Cor(X,X²)=0.991; after
standardization Cor=0!
– Remove the X that causes multicollinearity
– Factor analysis
– Ridge regression
50
Exercise
• It is doubtful that any sports collects more
statistics than baseball. The fans are
always interested in determining which
factors lead to successful teams. The table
below lists the team batting average and
the team winning percentage for the 14
league teams at the end of a recent
season.
51
Team-B-A Winning%
0.254
0.414
0.269
0.519
0.255
0.500
0.262
0.537
0.254
0.352
0.247
0.519
0.264
0.506
0.271
0.512
0.280
0.586
0.256
0.438
0.248
0.519
0.255
0.512
0.270
0.525
0.257
0.562
y = winning % and x = team batting average
52
a) LS Regression Line


 y  7.001,  y  3.549
 x y  1.824562
xi  3.642,
2
i
i
i
i
x )( y )
(3.642)(7.001)

 x y 
 1.824562 
 0.0033
(
SS xy
xi2  0.949
SS x 
i
i i

x


x 
2
i
i
n
i
n
14
2
(3.642)2
 0.948622 
 0.00118
14
53
SS xy 0.003302
ˆ
b1 

 0.7941
SS x 0.001182
bˆ  y  bˆ x  0.5  (0.7941)0.26  0.2935
0
1
• The least squares regression line is
yˆ  0.2935  0.7941x
• The meaning bˆ1  0.7941 is for each
additional batting average of the team,
the winning percentage increases by an
average of 79.41%.
54
b) Standard Error of Estimate
SSE  S yy
2
2  
2 
 S xy

y
S


 i    xy 

    yi2 
 S xx  
 S xx 
n


 

 
7.0012
0.0033022
 (3.548785 
)
 0.03856
14
0.00182
SSE 0.03856
2
s



0.00321
and
s

s
So,
e
e  0.0567
n  2 14  2
2
e
• Since se=0.0567 is small, we would conclude
that “s” is relatively small, indicating that the
regression line fits the data quite well.
55
c) Do the data provide sufficient evidence
at the 5% significance level to conclude that
higher team batting average lead to higher
winning percentage?
H 0 : b1  0
H A : b1  0
bˆ1  b1
Test statistic: t 
 1.69
sbˆ
(p-value=.058)
1
Conclusion: Do not reject H0 at  = 0.05.
The higher team batting average does not
lead to higher winning percentage.
56
d)Coefficient of Determination
SS xy2
SSE
0.03856
R 
 1
 1
 0.1925
SS x  SS y
SS y
0.04778
2
The 19.25% of the variation in the winning percentage
can be explained by the batting average.
57
e) Predict with 90% confidence the winning
percentage of a team whose batting average
is 0.275.
yˆ  0.2935  0.7941(0.275)  0.5119
yˆ  t / 2,n2 se
2
1 ( xg  x )
1 

n
SS x
1 (0.275  0.2601) 2
0.5119  (1.782)(0.0567) 1 

14
0.001182
0.5119  0.1134
90% PI for y: (0.3985,0.6253)
•The prediction is that the winning percentage of the
team will fall between 39.85% and 62.53%.
58
Download