Document

advertisement
Slides Prepared by
JOHN S. LOUCKS
St. Edward’s University
© 2003 South-Western/Thomson Learning™
Slide 1
Chapter 16
Regression Analysis: Model Building






General Linear Model
Determining When to Add or Delete Variables
Analysis of a Larger Problem
Variable-Selection Procedures
Residual Analysis
Multiple Regression Approach
to Analysis of Variance and
Experimental Design
© 2003 South-Western/Thomson Learning™
Slide 2
General Linear Model
Models in which the parameters (0, 1, . . . , p ) all
have exponents of one are called linear models.

First-Order Model with One Predictor Variable
y   0   1 x1  

Second-Order Model with One Predictor Variable
y   0   1 x 1   2 x 12  

Second-Order Model with Two Predictor Variables
with Interaction
y   0   1 x 1   2 x 2   3 x 12   4 x 22   5 x 1 x 2  
© 2003 South-Western/Thomson Learning™
Slide 3
General Linear Model
Often the problem of nonconstant variance can be
corrected by transforming the dependent variable to a
different scale.


Logarithmic Transformations
Most statistical packages provide the ability to apply
logarithmic transformations using either the base-10
(common log) or the base e = 2.71828... (natural log).
Reciprocal Transformation
Use 1/y as the dependent variable instead of y.
© 2003 South-Western/Thomson Learning™
Slide 4
General Linear Model
Models in which the parameters (0, 1, . . . , p ) have
exponents other than one are called nonlinear models.
In some cases we can perform a transformation of
variables that will enable us to use regression analysis
with the general linear model.

Exponential Model
The exponential model involves the regression
equation:
E( y )   0  x1
We can transform this nonlinear model to a linear
model by taking the logarithm of both sides.
© 2003 South-Western/Thomson Learning™
Slide 5
Variable Selection Procedures

Stepwise Regression
Forward Selection
Backward Elimination

Best-Subsets Regression


© 2003 South-Western/Thomson Learning™
Iterative; one independent
variable at a time is added or
deleted based on the F statistic
Different subsets of the independent variables are evaluated
Slide 6
Variable Selection Procedures

F Test
To test whether the addition of x2 to a model
involving x1 (or the deletion of x2 from a model
involving x1and x2) is statistically significant
(SSE(reduced)-SSE(full))/number of extra terms
F
MSE(full)
(SSE(x1 )-SSE(x1 ,x2 ))/1
F
(SSE(x1 , x2 ))/(n  p  1)
The p-value corresponding to the F statistic is the
criterion used to determine if a variable should be
added or deleted
© 2003 South-Western/Thomson Learning™
Slide 7
Stepwise Regression
Compute F stat. and
p-value for each indep.
variable in model
Any
p-value > alpha
to remove
?
No
Yes
Compute F stat. and
p-value for each indep.
variable not in model
Indep. variable with
smallest p-value is
entered into model
Indep. variable with
largest p-value is
removed from model
Yes
Any
p-value < alpha
to enter
?
No
Start
© 2003 South-Western/Thomson Learning™
Stop
Slide 8
Forward Selection



This procedure is similar to stepwise-regression, but
does not permit a variable to be deleted.
This forward-selection procedure starts with no
independent variables.
It adds variables one at a time as long as a significant
reduction in the error sum of squares (SSE) can be
achieved.
© 2003 South-Western/Thomson Learning™
Slide 9
Forward Selection
Start with no indep.
variables in model
Compute F stat. and
p-value for each indep.
variable not in model
Any
p-value < alpha
to enter
?
Yes
Indep. variable with
smallest p-value is
entered into model
No
Stop
© 2003 South-Western/Thomson Learning™
Slide 10
Backward Elimination



This procedure begins with a model that includes all
the independent variables the modeler wants
considered.
It then attempts to delete one variable at a time by
determining whether the least significant variable
currently in the model can be removed because its pvalue is less than the user-specified or default value.
Once a variable has been removed from the model it
cannot reenter at a subsequent step.
© 2003 South-Western/Thomson Learning™
Slide 11
Backward Elimination
Start with all indep.
variables in model
Compute F stat. and
p-value for each indep.
variable in model
Any
p-value > alpha
to remove
?
Yes
Indep. variable with
largest p-value is
removed from model
No
Stop
© 2003 South-Western/Thomson Learning™
Slide 12
Example: Clarksville Homes
Tony Zamora, a real estate investor, has just
moved to Clarksville and wants to learn about the
city’s residential real estate market. Tony has
randomly selected 25 house-for-sale listings from the
Sunday newspaper and collected the data listed on the
next three slides.
Develop, using the backward elimination
procedure, a multiple regression model to predict the
selling price of a house in Clarksville.
© 2003 South-Western/Thomson Learning™
Slide 13
Using Excel to Perform the
Backward Elimination Procedure

Worksheet (showing partial data)
A
1
2
3
4
5
6
7
8
9
Segment
of City
Northwest
South
Northeast
Northwest
West
South
West
West
B
C
D
E
F
Selling
House
Number Number Garage
Price
Size
of
of
Size
($000) (00 sq. ft.) Bedrms. Bathrms. (cars)
290
21
4
2
2
95
11
2
1
0
170
19
3
2
2
375
38
5
4
3
350
24
4
3
2
125
10
2
2
0
310
31
4
4
2
275
25
3
2
2
Note: Rows 10-26 are not shown.
© 2003 South-Western/Thomson Learning™
Slide 14
Using Excel to Perform the
Backward Elimination Procedure

Worksheet (showing partial data)
A
1
10
11
12
13
14
15
16
17
Segment
of City
Northwest
Northeast
Northwest
South
Northwest
West
South
South
B
C
D
E
F
Selling
House
Number Number Garage
Price
Size
of
of
Size
($000) (00 sq. ft.) Bedrms. Bathrms. (cars)
340
27
5
3
3
215
22
4
3
2
295
20
4
3
2
190
24
4
3
2
385
36
5
4
3
430
32
5
4
2
185
14
3
2
1
175
18
4
2
2
Note: Rows 2-9 are hidden and rows 18-26 not shown.
© 2003 South-Western/Thomson Learning™
Slide 15
Using Excel to Perform the
Backward Elimination Procedure

Worksheet (showing partial data)
A
1
18
19
20
21
22
23
24
25
26
Segment
of City
Northeast
Northwest
West
Northeast
West
Northwest
South
Northeast
West
B
C
D
E
F
Selling
House
Number Number Garage
Price
Size
of
of
Size
($000) (00 sq. ft.) Bedrms. Bathrms. (cars)
190
19
4
2
2
330
29
4
4
3
405
33
5
4
3
170
23
4
2
2
365
34
5
4
3
280
25
4
2
2
135
17
3
1
1
205
21
4
3
2
260
26
4
3
2
Note: Rows 2-17 are hidden.
© 2003 South-Western/Thomson Learning™
Slide 16
Using Excel to Perform the
Backward Elimination Procedure

Value Worksheet (partial)
A
27
28
29
30
31
32
33
34
35
36
B
C
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.898964443
R Square
0.80813707
Adjusted R Square
0.769764484
Standard Error
45.87155025
Observations
25
© 2003 South-Western/Thomson Learning™
Slide 17
Using Excel to Perform the
Backward Elimination Procedure

Value Worksheet (partial)
A
36
37
38
39
40
41
42
B
C
D
E
F
ANOVA
df
SS
MS
F
Significance F
Regression
4 177260
44315 21.06027
6.1385E-07
Residual
20 42083.98 2104.199
Total
24 219344
© 2003 South-Western/Thomson Learning™
Slide 18
Using Excel to Perform the
Backward Elimination Procedure

Value Worksheet (partial)
A
42
43
44
45
46
47
48
49
B
C
Coeffic. Std. Err.
Intercept
-59.416 54.6072
House Size 6.50587 3.24687
Bedrooms
29.1013 26.2148
Bathrooms 26.4004 18.8077
Cars
-10.803 27.329
© 2003 South-Western/Thomson Learning™
D
E
t Stat
-1.0881
2.0037
1.1101
1.4037
-0.3953
P-value
0.28951
0.05883
0.28012
0.17574
0.6968
Slide 19
Using Excel to Perform the
Backward Elimination Procedure



Cars (garage size) is the independent variable with
the highest p-value (.697) > .05
Cars is removed from the model
Multiple regression is performed again on the
remaining independent variables
© 2003 South-Western/Thomson Learning™
Slide 20
Using Excel to Perform the
Backward Elimination Procedure

Value Worksheet (partial)
A
27
28
29
30
31
32
33
34
35
36
B
C
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.898130279
R Square
0.806637998
Adjusted R Square
0.779014855
Standard Error
44.94059302
Observations
25
© 2003 South-Western/Thomson Learning™
Slide 21
Using Excel to Perform the
Backward Elimination Procedure

Value Worksheet (partial)
A
36
37
38
39
40
41
42
B
C
D
E
F
ANOVA
df
SS
MS
F
Significance F
Regression
4 177260
44315 21.06027
6.1385E-07
Residual
20 42083.98 2104.199
Total
24 219344
© 2003 South-Western/Thomson Learning™
Slide 22
Using Excel to Perform the
Backward Elimination Procedure

Value Worksheet (partial)
A
42
43
44
45
46
47
48
49
B
C
Coeffic. Std. Err.
Intercept
-47.342 44.3467
House Size 6.02021 2.94446
Bedrooms
23.0353 20.8229
Bathrooms 27.0286 18.3601
© 2003 South-Western/Thomson Learning™
D
E
t Stat
-1.0675
2.0446
1.1062
1.4721
P-value
0.29785
0.05363
0.28113
0.15581
Slide 23
Using Excel to Perform the
Backward Elimination Procedure



Bedrooms is the independent variable with the highest
p-value (.281) > .05
Bedrooms is removed from the model
Multiple regression is performed again on the
remaining independent variables
© 2003 South-Western/Thomson Learning™
Slide 24
Using Excel to Perform the
Backward Elimination Procedure

Value Worksheet (partial)
A
27
28
29
30
31
32
33
34
35
36
B
C
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.891835053
R Square
0.795369762
Adjusted R Square
0.776767013
Standard Error
45.1685807
Observations
25
© 2003 South-Western/Thomson Learning™
Slide 25
Using Excel to Perform the
Backward Elimination Procedure

Value Worksheet (partial)
A
36
37
38
39
40
41
42
B
C
D
E
F
ANOVA
df
SS
MS
Regression
2 174459.6 87229.79
Residual
22 44884.42 2040.201
Total
24 219344
© 2003 South-Western/Thomson Learning™
F
Significance F
42.7555 2.63432E-08
Slide 26
Using Excel to Perform the
Backward Elimination Procedure

Value Worksheet (partial)
A
B
C
D
E
42
43
Coeffic. Std. Err. t Stat P-value
44 Intercept
-12.349 31.2392 -0.3953 0.69642
45 House Size 7.94652 2.38644 3.3299 0.00304
46 Bathrooms 30.3444 18.2056 1.6668 0.10974
47
48
49
© 2003 South-Western/Thomson Learning™
Slide 27
Using Excel to Perform the
Backward Elimination Procedure



Bathrooms is the independent variable with the
highest p-value (.110) > .05
Bathrooms is removed from the model
Regression is performed again on the remaining
independent variable
© 2003 South-Western/Thomson Learning™
Slide 28
Using Excel to Perform the
Backward Elimination Procedure

Value Worksheet (partial)
A
27
28
29
30
31
32
33
34
35
36
B
C
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.877228487
R Square
0.769529819
Adjusted R Square
0.759509376
Standard Error
46.88202186
Observations
25
© 2003 South-Western/Thomson Learning™
Slide 29
Using Excel to Perform the
Backward Elimination Procedure

Value Worksheet (partial)
A
36
37
38
39
40
41
42
B
C
D
E
F
ANOVA
df
SS
MS
F
Significance F
Regression
1 168791.7 168791.7 76.79599 8.67454E-09
Residual
23 50552.25 2197.924
Total
24 219344
© 2003 South-Western/Thomson Learning™
Slide 30
Using Excel to Perform the
Backward Elimination Procedure

Value Worksheet (partial)
A
B
C
D
E
42
43
Coeffic. Std. Err. t Stat P-value
44 Intercept
-9.8669 32.3874 -0.3047 0.76337
45 House Size 11.3383 1.29384 8.7633 8.7E-09
46
47
48
49
© 2003 South-Western/Thomson Learning™
Slide 31
Using Excel to Perform the
Backward Elimination Procedure


House size is the only independent variable remaining
in the model
The estimated regression equation is:
yˆ  9.8669  11.3383(House Size)

The Adjusted R Square value is .760
© 2003 South-Western/Thomson Learning™
Slide 32
Variable-Selection Procedures

Best-Subsets Regression
• The three preceding procedures are one-variableat-a-time methods offering no guarantee that the
best model for a given number of variables will be
found.
• Some statistical software packages include bestsubsets regression that enables the user to find,
given a specified number of independent
variables, the best regression model.
• Typical output identifies the two best one-variable
estimated regression equations, the two best twovariable regression equations, and so on.
© 2003 South-Western/Thomson Learning™
Slide 33
Example: PGA Tour Data
The Professional Golfers Association keeps a variety
of statistics regarding performance measures. Data
include the average driving distance, percentage of
drives that land in the fairway, percentage of greens hit
in regulation, average number of putts, percentage of
sand saves, and average score.
The variable names and definitions are shown on the
next slide.
© 2003 South-Western/Thomson Learning™
Slide 34
Example: PGA Tour Data

Variable Names and Definitions
Drive: average length of a drive in yards
Fair: percentage of drives that land in the fairway
Green: percentage of greens hit in regulation (a par-3
green is “hit in regulation” if the player’s first
shot lands on the green)
Putt: average number of putts for greens that have
been hit in regulation
Sand: percentage of sand saves (landing in a sand
trap and still scoring par or better)
Score: average score for an 18-hole round
© 2003 South-Western/Thomson Learning™
Slide 35
Example: PGA Tour Data

Sample Data
Drive
277.6
259.6
269.1
267.0
267.3
255.6
272.9
265.4
Fair
.681
.691
.657
.689
.581
.778
.615
.718
Green
.667
.665
.649
.673
.637
.674
.667
.699
© 2003 South-Western/Thomson Learning™
Putt
1.768
1.810
1.747
1.763
1.781
1.791
1.780
1.790
Sand
.550
.536
.472
.672
.521
.455
.476
.551
Score
69.10
71.09
70.12
69.88
70.71
69.76
70.19
69.73
Slide 36
Example: PGA Tour Data

Sample Data (continued)
Drive
272.6
263.9
267.0
266.0
258.1
255.6
261.3
262.2
Fair
.660
.668
.686
.681
.695
.792
.740
.721
Green
.672
.669
.687
.670
.641
.672
.702
.662
© 2003 South-Western/Thomson Learning™
Putt
1.803
1.774
1.809
1.765
1.784
1.752
1.813
1.754
Sand
.431
.493
.492
.599
.500
.603
.529
.576
Score
69.97
70.33
70.32
70.09
70.46
69.49
69.88
70.27
Slide 37
Example: PGA Tour Data

Sample Data (continued)
Drive
260.5
271.3
263.3
276.6
252.1
263.0
263.0
253.5
266.2
Fair
.703
.671
.714
.634
.726
.687
.639
.732
.681
Green
.623
.666
.687
.643
.639
.675
.647
.693
.657
© 2003 South-Western/Thomson Learning™
Putt
1.782
1.783
1.796
1.776
1.788
1.786
1.760
1.797
1.812
Sand
.567
.492
.468
.541
.493
.486
.374
.518
.472
Score
70.72
70.30
69.91
70.69
70.59
70.20
70.81
70.26
70.96
Slide 38
Example: PGA Tour Data

Sample Correlation Coefficients
Drive
Fair
Green
Putt
Sand
Score
-.154
-.427
-.556
.258
-.278
Drive
Fair
Green
Putt
-.679
-.045
-.139
-.024
.421
.101
.265
.354
.083
-.296
© 2003 South-Western/Thomson Learning™
Slide 39
Example: PGA Tour Data

Best Subsets Regression of SCORE
Vars R-sq R-sq(a) C-p
1
30.9
27.9 26.9
1
18.2
14.6 35.7
2
54.7
50.5 12.4
2
54.6
50.5 12.5
3
60.7
55.1 10.2
3
59.1
53.3 11.4
4
72.2
66.8
4.2
4
60.9
53.1 12.1
5
72.6
65.4
6.0
© 2003 South-Western/Thomson Learning™
s
.39685
.43183
.32872
.32891
.31318
.31957
.26913
.32011
.27499
D F G
X
X
X X
X
X X
X X X
X X X
X X
X X X
P S
X
X
X
X X
X X
Slide 40
Example: PGA Tour Data
The regression equation
Score = 74.678 - .0398(Drive) - 6.686(Fair)
- 10.342(Green) + 9.858(Putt)
Predictor
Coef
Stdev
t-ratio
p
Constant
74.678
6.952
10.74
.000
Drive
-.0398
.01235
-3.22
.004
Fair
-6.686
1.939
-3.45
.003
Green
-10.342
3.561
-2.90
.009
Putt
9.858
3.180
3.10
.006
s = .2691
R-sq = 72.4%
R-sq(adj) = 66.8%
© 2003 South-Western/Thomson Learning™
Slide 41
Example: PGA Tour Data
Analysis of Variance
SOURCE
Regression
Error
Total
DF
4
20
24
© 2003 South-Western/Thomson Learning™
SS
3.79469
1.44865
5.24334
MS
.94867
.07243
F
13.10
P
.000
Slide 42
Residual Analysis: Autocorrelation

Durbin-Watson Test for Autocorrelation
• Statistic
n
2
 ( et  et  1 )
d  t 2
n
2
 et2
t 1
•
•
•
•
The statistic ranges in value from zero to four.
If successive values of the residuals are close
together (positive autocorrelation), the statistic
will be small.
If successive values are far apart (negative autocorrelation), the statistic will be large.
A value of two indicates no autocorrelation.
© 2003 South-Western/Thomson Learning™
Slide 43
End of Chapter 16
© 2003 South-Western/Thomson Learning™
Slide 44
Download