Linear Regression

advertisement
Linear Regression
Simple linear regression model determines the relationship
between one dependent variable (y) and one independent
variable (x).
A dependent variable is a random variable whose variation is
predicted by an independent variable.
The relationship is specified as:
y = b0 + b1x
Where:
• b0 and b1 are fixed numbers. Further, the value of b0
determines the point at which the straight line crosses the yaxis; the y-intercept.
Regression Analysis: Simple
Page: 2
• The value of b1 determines the slope of the line. Generally,
specified as the amount by which y changes for every unit
change in x.
If b1 < 0 the slope is negative
If b1 > 0 the slope is positive
If b1 = 0 the slope does not exist
-- the line is parallel to the x-axis.
STA/FIN Class Notes
Warning!! These notes contain direct references to copyrighted material
Do not quote, copy, or replicate without permission: mailto:nina@uri.edu
Class Notes to accompany: Introductory Statistics, 8th Ed, By Neil A. Weiss
Prepared by: Nina Kajiji, Ph.D.
Last Update: 1-DEC-10
Regression Analysis: Simple
Page: 3
Notation Used
Sxx = Σx2 - (Σx)2 / n
Sxy = Σxy - (ΣxΣy) / n
Syy = Σy2 - (Σy)2 / n
b1 = Sxy / Sxx
b0 = (1/n)(Σy - b1Σx)
OR Σ(x - x )2
OR Σ(x - x )(y - y )
OR Σ(y - y )2
The Least-Square Criterion
The straight line that best fits a set of data points is the one for
which the sum of squared errors is smallest.
Regression Line
The straight line that fits a set of data points the best according
to the least-square criterion is the regression line.
The Regression Equation
The equation of the regression line is stated as:
y = b0 + b1x
Estimation & Prediction
Estimation or prediction is simply using the regression equation
on the sample x-values to provide an estimate of the
corresponding y-value. That is,
ŷ = b0 + b1x
STA/FIN Class Notes
Warning!! These notes contain direct references to copyrighted material
Do not quote, copy, or replicate without permission: mailto:nina@uri.edu
Class Notes to accompany: Introductory Statistics, 8th Ed, By Neil A. Weiss
Prepared by: Nina Kajiji, Ph.D.
Last Update: 1-DEC-10
Regression Analysis: Simple
Page: 4
The Total sum of Squares
The total squared differences in y is called the total sum of
squares. The formula used is:
SST = Syy = Σ(y - y )2
Error Sum Of Squares
The total squared error is called the sum of squares error. The
formula used is:
SSE = Syy - (Sxy)2/Sxx = Σ(y - ŷ )2
Sum Of Squares Due To Regression
The total amount of variation explained by the regression line is
called the regression sum of squares. The formula used is:
SSR = (Sxy)2/Sxx = Σ( ŷ - y )2
The Regression Identity
SST =
SSR + SSE
STA/FIN Class Notes
Warning!! These notes contain direct references to copyrighted material
Do not quote, copy, or replicate without permission: mailto:nina@uri.edu
Class Notes to accompany: Introductory Statistics, 8th Ed, By Neil A. Weiss
Prepared by: Nina Kajiji, Ph.D.
Last Update: 1-DEC-10
Regression Analysis: Simple
Page: 5
The Coefficient Of Determination
The r-square or the coefficient of determination is the
percentage reduction obtained in the total squared error by using
the regression equation instead of the sample mean to predict the
observed y-values.
In other words, it is the amount of variation in the dependent
variable that is explained by the independent variable.
r2
=
(SSR / SST)
r2
=
1 - (SSE / SST)
r2
=
((Sxy)2/Sxx Syy)
OR
OR
STA/FIN Class Notes
Warning!! These notes contain direct references to copyrighted material
Do not quote, copy, or replicate without permission: mailto:nina@uri.edu
Class Notes to accompany: Introductory Statistics, 8th Ed, By Neil A. Weiss
Prepared by: Nina Kajiji, Ph.D.
Last Update: 1-DEC-10
Regression Analysis: Simple
Page: 6
Linear Correlation
• The square-root of r2 determines the linear correlation (r)
between x & y.
• r ranges from 0 to ±1.
• Positive values indicate a positive correlation; negative values
indicate a negative correlation.
• If x & y are correlated it does not mean they are related.
Other Terms
Extrapolation
Using the regression equation to make predictions for x-values
outside the range of x-values in the sample data.
Outliers & Influential Observations
Recall: An outlier is an observation that lies outside the overall
pattern of the data. In regression context an outlier is a data
point that lies far from the regression line.
An influential observation is a data point whose removal causes
the regression equation to change considerably.
Scatter Plots
Plot of x & y to visualize the pattern of the sample data. If the
plot shows a non-linear relationship between x (predictor
variables) and y (response variable) DO NOT use linear
regression methods.
STA/FIN Class Notes
Warning!! These notes contain direct references to copyrighted material
Do not quote, copy, or replicate without permission: mailto:nina@uri.edu
Class Notes to accompany: Introductory Statistics, 8th Ed, By Neil A. Weiss
Prepared by: Nina Kajiji, Ph.D.
Last Update: 1-DEC-10
Regression Analysis: Simple
Page: 7
Four Data Sets Having Same Value of Summary Statistics
(Source: Anscombe, 1973)
Data Set 1
x1
Mean
STD.
4
5
6
7
8
9
10
11
12
13
14
9.00
3.32
y1
4.26
5.68
7.24
4.82
6.95
8.81
8.04
8.33
10.84
7.58
9.96
7.50
2.03
Data Set 2
x2
4
5
6
7
8
9
10
11
12
13
14
9.00
3.32
Data Set 3
y2
3.1
4.74
6.13
7.26
8.14
8.77
9.14
9.26
9.13
8.74
8.1
7.50
2.03
x3
4
5
6
7
8
9
10
11
12
13
14
9.00
3.32
STA/FIN Class Notes
Warning!! These notes contain direct references to copyrighted material
Do not quote, copy, or replicate without permission: mailto:nina@uri.edu
Class Notes to accompany: Introductory Statistics, 8th Ed, By Neil A. Weiss
Data Set 4
y3
x4
5.39
5.73
6.04
6.42
6.77
7.11
7.46
7.81
8.15
12.74
8.84
7.50
2.03
8
8
8
8
8
8
8
8
8
8
19
9.00
3.32
y4
6.58
5.76
7.71
8.84
8.47
7.04
5.25
5.56
7.91
6.89
12.5
7.50
2.03
Prepared by: Nina Kajiji, Ph.D.
Last Update: 1-DEC-10
Regression Analysis: Simple
Page: 8
Data Set 1
Data Set 2
12
12
10
10
8
8
6
6
4
4
2
2
0
0
0
5
10
15
0
Data Set 3
5
10
15
Data Set 4
14
12
10
8
6
4
2
0
14
12
10
8
6
4
2
0
0
5
10
15
0
5
10
15
20
Inferential Methods
Assumptions For Regression Inferences
1. Population Regression Line (Assumption I): There is a
straight line, y = β0 + β1x such that for each x-value, the
mean of the corresponding population of y-values lies on that
straight line.
STA/FIN Class Notes
Warning!! These notes contain direct references to copyrighted material
Do not quote, copy, or replicate without permission: mailto:nina@uri.edu
Class Notes to accompany: Introductory Statistics, 8th Ed, By Neil A. Weiss
Prepared by: Nina Kajiji, Ph.D.
Last Update: 1-DEC-10
Regression Analysis: Simple
STA/FIN Class Notes
Warning!! These notes contain direct references to copyrighted material
Do not quote, copy, or replicate without permission: mailto:nina@uri.edu
Class Notes to accompany: Introductory Statistics, 8th Ed, By Neil A. Weiss
Page: 9
Prepared by: Nina Kajiji, Ph.D.
Last Update: 1-DEC-10
Regression Analysis: Simple
Page: 10
2. Equal Standard Deviations (Assumption II): The standard
deviation, σ, of the population of y-values corresponding to a
particular x-value is the same, regardless of the x-value.
3. Normality (Assumption III): For each x-value, the
corresponding population of y-values is normally distributed.
STA/FIN Class Notes
Warning!! These notes contain direct references to copyrighted material
Do not quote, copy, or replicate without permission: mailto:nina@uri.edu
Class Notes to accompany: Introductory Statistics, 8th Ed, By Neil A. Weiss
Prepared by: Nina Kajiji, Ph.D.
Last Update: 1-DEC-10
Regression Analysis: Simple
Page: 11
Standard Error Of The Estimate
The standard error of the estimate is defined by:
se
=
Sqrt(SSE / (n - 2))
It provides us with an estimate for the population standard
deviation. The se indicates how far the observed y-values are
from the predicted y-values, on average.
Residual Analysis For the Regression Model
If the assumptions for regression inferences are met, then the
following two conditions should hold.
1. A plot of the residuals against the x-values should fall
roughly in a horizontal band centered and symmetric about
the x-axis. If a pattern is indicated you probably need to use
some other analytical method that simple linear regression.
For example, the following graph shows that a quadratic
relationship exists in the residuals.
STA/FIN Class Notes
Warning!! These notes contain direct references to copyrighted material
Do not quote, copy, or replicate without permission: mailto:nina@uri.edu
Class Notes to accompany: Introductory Statistics, 8th Ed, By Neil A. Weiss
Prepared by: Nina Kajiji, Ph.D.
Last Update: 1-DEC-10
Regression Analysis: Simple
Page: 12
Linearity Test
Dependent Variable: Min
30
Residual
20
10
0
-10
-20
-30
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Units
2. A normal probability plot of the residuals should be roughly
linear. For example, the following graph shows that the
normality assumption is violated.
Normal Probability Plot
Dependent Variable: Min
Sorted Residual
30
20
10
0
-10
-20
-30
-30
-20
-10
0
10
20
30
Expected Residual
STA/FIN Class Notes
Warning!! These notes contain direct references to copyrighted material
Do not quote, copy, or replicate without permission: mailto:nina@uri.edu
Class Notes to accompany: Introductory Statistics, 8th Ed, By Neil A. Weiss
Prepared by: Nina Kajiji, Ph.D.
Last Update: 1-DEC-10
Regression Analysis: Simple
Page: 13
Hypothesis Tests For The Slope
Ho: β1 = 0
Ha: β1 ≠ 0
The test statistic has a t-distribution with df=(n - 2).
t
=
(b1 - β1) / (se / Sqrt(Sxx))
If the value of the test statistic falls in the rejection region, then
reject the null; otherwise do not reject the null.
Confidence Intervals for the Slope
The endpoints of the confidence interval for β1 are:
b1 ± tα/2 ( Se
S xx )
with df = n-2
Confidence Intervals for Means in Regression
yˆ p ± t α / 2 s e
1
+
n
(x
p
− Σx
S
n
)
2
with df = (n -2)
xx
Confidence Intervals for a Population y-value given an x-value
yˆ p ± t α / 2 s e 1 +
1
+
n
(x
p
− Σx
S
n
)
2
with df = (n -2)
xx
STA/FIN Class Notes
Warning!! These notes contain direct references to copyrighted material
Do not quote, copy, or replicate without permission: mailto:nina@uri.edu
Class Notes to accompany: Introductory Statistics, 8th Ed, By Neil A. Weiss
Prepared by: Nina Kajiji, Ph.D.
Last Update: 1-DEC-10
Regression Analysis: Simple
Page: 14
Aside: The F distribution
• Not symmetric
• Does not have zero at its center
Using the F-Table
• Need a significance level, numerator degrees of freedom (n1 - 1),
and denominator degrees of freedom (n2 - 1)
• To find the left tailed critical value we can use the reciprocal of the
right-tailed value with the numbers of degrees of freedom reversed.
• For example: if n1 = 10, n2 = 7, alpha=0.05 the right tail value is
5.5234 (from table). The left tail value is (1/4.317 = 0.2315). That
is, 4.317 is the critical value for 6,9 degrees of freedom.
STA/FIN Class Notes
Warning!! These notes contain direct references to copyrighted material
Do not quote, copy, or replicate without permission: mailto:nina@uri.edu
Class Notes to accompany: Introductory Statistics, 8th Ed, By Neil A. Weiss
Prepared by: Nina Kajiji, Ph.D.
Last Update: 1-DEC-10
Regression Analysis: Simple
Page: 15
Regression Example << Regrsam.xls >>
Age (x) Price (y)
xy
6
125 750
6
115 690
6
130 780
2
260 520
2
219 438
5
150 750
4
190 760
5
163 815
1
260 260
4
160 640
41
1772 6403
TOTALS
Sqr(x)
36
36
36
4
4
25
16
25
1
16
199
Syy
339680 - Sqr(1772) / 10 = 25681.60
Sxx
199 - Sqr(41) / 10 =
30.90
Sxy
6403 - (41)(1772) / 10 =
-862.20
B1 =
Sxy / Sxx =
-27.90
Bo =
(1/11 (1772 - b1(41))) =
265.09
SST =
Syy
= 25681.600
SSR =
(Sxy*Sxy) / Sxx
= 24057.891
SSE =
SST – SSR
= 1623.709
Se =
Sqrt(SSE / (n-2))
= 14.247
r-square = (1-(SSE / SST))*100
= 93.68%
r=
= 0.97
Sqrt(r-square)
STA/FIN Class Notes
Warning!! These notes contain direct references to copyrighted material
Do not quote, copy, or replicate without permission: mailto:nina@uri.edu
Class Notes to accompany: Introductory Statistics, 8th Ed, By Neil A. Weiss
Sqr(y)
15625
13225
16900
67600
47961
22500
36100
26569
67600
25600
339680
Prepared by: Nina Kajiji, Ph.D.
Last Update: 1-DEC-10
Regression Analysis: Simple
Page: 16
Hypothesis Testing
Ho: B1 = 0
Ha: B1 <> 0
t = b1 / (Se / Sqrt(Sxx)) =
-10.887
Critical t at 95% and df=8 = 2.306
Since calculated value is in the rejection
region we reject the null hypothesis.
There is enough evidence to conclude that the
age of corvettes is useful for predicting price
of the corvettes.
Confidence Intervals at 95%
-27.9 + 2.306 * (14.247 / Sqrt(30.90)) =
-27.9 - 2.306 * (14.247 / Sqrt(30.90)) =
CI = (-21.99, -33.81)
STA/FIN Class Notes
Warning!! These notes contain direct references to copyrighted material
Do not quote, copy, or replicate without permission: mailto:nina@uri.edu
Class Notes to accompany: Introductory Statistics, 8th Ed, By Neil A. Weiss
-21.99
-33.81
Prepared by: Nina Kajiji, Ph.D.
Last Update: 1-DEC-10
Download