Chapter 11: Linear Regression

advertisement
Simple Linear Regression
•Linear regression model
•Prediction
•Limitation
•Correlation
1
Example: Computer Repair
A company markets and repairs small computers.
How fast (Time) an electronic component
(Computer Unit) can be repaired is very important
to the efficiency of the company. The Variables in
this example are:
Time and Units.
2
Humm…
How long will it take
me to repair this
unit?
Goal: to predict the length of repair
Time for a given number of computer
Units
3
Computer Repair Data
4
Units
Min’s
Units
Min’s
1
23
6
97
2
29
7
109
3
49
8
119
4
64
9
149
4
74
9
145
5
87
10
154
6
96
10
166
Graphical Summary of Two Quantitative
Variable
Scatterplot of response variable against explanatory variable




5
What is the overall (average) pattern?
What is the direction of the pattern?
How much do data points vary from the overall (average) pattern?
Any potential outliers?
Summary for Computer Repair Data
Scatterplot (Time vs Units)
Some Simple Conclusions




6
Time is Linearly related with
computer Units.
(The length of) Time is
Increasing as (the number of)
Units increases.
Data points are closed to the
line.
No potential outlier.
Numerical Summary of Two
Quantitative Variable
7

Regression Model

Correlation
Linear Regression Model


Y: the response variable
X: the explanatory variable
Y=b0+b1X+error
Y
} b1
1
} b0
8
X
Linear Regression Model

9
The regression line models the relationship
between X and Y on average.
Prediction

Yˆ: Predicted value of Y for a given X value

Regression equation:
Yˆ  bˆ0  bˆ1  X
Eg. How long will it take to repair 3 computer units?
Yˆ  4 . 16  15 . 51  X
10
The Limitation of the Regression
Equation

The regression equation cannot be used to predict
Y value for the X values which are (far) beyond the
range in which data are observed.
Eg. The predicted WT of a given HT:
Yˆ   205  5  X
Given HT of 40”, the regression equation will give
us WT of -205+5x40 = -5 pounds!!
11
The Unpredicted Part

12
The value Y  Yˆ is the part the regression
equation (model) cannot predict, and it is called
“residual.”
residual {
13
Correlation between X and Y

14
X and Y might be related to each other in
many ways: linear or curved.
y
2 .0
1 .6
1 .5
1 .4
1 .2
y
1 .8
2 .5
2 .0
2 .2
3 .0
Examples of Different Levels of Correlation
0.0
0.2
0.4
0.6
x
r=.98
Strong Linearity
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
x
r=.71
Median Linearity
15
y
2 .0
1 .0
2 .0
1 .5
2 .5
y
3 .0
2 .5
3 .5
4 .0
3 .0
Examples of Different Levels of Correlation
0.0
0.2
0.4
0.6
x
r=-.09
Nearly Uncorrelated
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
x
r=.00
Nearly Curved
16
(Pearson) Correlation Coefficient of X
and Y
• A measurement of the strength of the
“LINEAR” association between X and Y
• The correlation coefficient of X and Y is:
n
rxy 
 (y
i
 y )( x i  x )
i 1

s yy s xx
s xy
s yy s xx
17
Correlation Coefficient of X and Y



18
-1< r < 1
The magnitude of r measures the strength of
the linear association of X and Y
The sign of r indicate the direction of the
association: “-”  negative association
“+”  positive association
Correlation Coefficient


19
The value r is almost 0
 the best line to fit the data points is exactly
horizontal
 the value of X won’t change our prediction
on Y
The value r is almost 1
 A line fits the data points almost perfectly.
Goodness of Fit of SLR Model

For a data point: residuals

For the whole dataset: R^2
R^2 (=r^2) is the proportion o f variation in Y
explained by (the variation in) X
20
Table for Computing Mean, St. Deviation, and Corr. Coef.
i
1
2
…
n
Total
yi , yi  y , ( yi  y )
xi , xi  x , ( xi  x )
2
y1 , y1  y , ( y1  y )
2
x1 , x1  x , ( x1  x )
y2 , y2  y, ( y2  y )
2
x2 , x2  x , ( x2  x )
…

i 1
n
y i ,0 ,  ( y i  y )
i 1
y , 0 , S yy
2
( y 2  y )( x 2  x )
2
….
xn , xn  x , ( xn  x )
n
2
i
2
( y n  y )( x n  x )
n
n
 x ,0 ,  ( x
i 1
( y i  y )( x i  x )
( y 1  y )( x1  x )
2
….
yn , yn  y, ( yn  y )
n
2
i
 x)
i 1
x , 0 , S xx
2
 (y
i
 y )( x i  x )
i 1
S xy , rxy
21
Example: Computer Repair Time
n

y i  1361 , n  14 , y  1361 / 14  97 . 2143
i 1
n
s yy 
 (y
 y )  27768 . 35 ,
2
i
i 1
n
x
i
 84 , x  84 / 14  6 ,
i 1
n
s xx 
 (x
i
 x )  114 ,
i
 y )( x i  x )  1768 ,
2
i 1
n
s xy 
 (y
i 1
rxy  s xy /
22
s yy s xx  . 9937
Exercise
(1) Fill the following table, then compute the mean and st. deviation of Y and X
(2) Compute the corr. coef. of Y and X
(3) Draw a scatterplot
i
xi
xi  x
( xi  x )
2
yi  y
yi
( yi  y )
2
( y i  y )( x i  x )
1
-.3
-.3
.09
.1
-.9
.81
.27
2
-.2
-.2
.04
.4
-.6
.36
.12
3
-.1
.01
.7
4
.1
.01
1.2
.2
5
.2
.04
1.6
.6
6
.3
.3
.09
2.0
Total
0
*
.1
6.0
*
23
The Influence of Outliers
The slope becomes
bigger
(toward outliers)

13
Y3
11
9

7
5
4
6
8
10
X3
24
12
14
The r value
becomes smaller
(less linear)
The Influence of Outliers
S ca tte r plot of y v s x

The slope becomes
clear (toward
outliers)

The | r | value
becomes larger
(more linear:
0.1590.935)
5
4
y
3
2
1
0
0
2
4
6
x
25
8
10
Download