Nominal Variables Examples

advertisement
Chapter 20.3: Nominal Independent Variables
ANOVA and Multiple Regression
The two-sample t-test and ANOVA can be viewed as special cases of multiple regression
with nominal independent variables. Consider Example 15.1 in which the goal was to
compare three advertising strategies (Convenience, Quality and Price) for an apple juice
product. Suppose first that we just wanted to compare Convenience and Quality. We
could do a two-sample t-test:
Oneway Analysis of Sales By Strategy
800
Sales
700
600
500
400
Convenience
Quality
Strategy
t-Test
Estimate
Std Error
Lower 95%
Upper 95%
Difference
t-Test
DF
Prob > |t|
-75.45
30.01
-136.20
-14.70
-2.514
38
0.0163
Assuming equal variances
Analysis of Variance
Source
Strategy
Error
C. Total
DF
Sum of Squares
Mean Square
F Ratio
Prob > F
1
38
39
56927.03
342248.95
399175.97
56927.0
9006.6
6.3206
0.0163
Means for Oneway Anova
Level
Number
Mean
Convenience
20
577.550
Quality
20
653.000
Std Error uses a pooled estimate of error variance
Std Error
Lower 95%
Upper 95%
21.221
21.221
534.59
610.04
620.51
695.96
We could also consider a regression model. Let I1  0 if Convenience and I1  1 if
Quality. A regression model would look like this
Y   0  1 I 1 .
Bivariate Fit of Sales By I1
800
Sales
700
600
500
400
-0.2 0
.2
.4
.6
.8
1
1.2
I1
Linear Fit
Linear Fit
Sales = 577.55 + 75.45 I1
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.142611
0.120048
94.90285
615.275
40
Analysis of Variance
Source
Model
Error
C. Total
DF
Sum of Squares
Mean Square
F Ratio
1
38
39
56927.03
342248.95
399175.97
56927.0
9006.6
6.3206
Prob > F
0.0163
Parameter Estimates
Term
Intercept
I1
Estimate
Std Error
t Ratio
Prob>|t|
577.55
75.45
21.22092
30.01092
27.22
2.51
<.0001
0.0163
Notice that the coefficient on I1 equals the difference between the mean of quality (I1=1)
and the mean of convenience (I1=0). This makes sense because the coefficient on I1 is
the average change in sales for a one unit increase in I1. This just equals the difference
between the mean of quality and the mean of convenience. Also notice that the p-value
for testing whether I1=0 is the same as the p-value for the two-sided t-test.
What happens if we also want to incorporate the price strategy into the analysis, i.e., use a
one-way ANOVA with three levels for convenience, quality and price? The one-way
ANOVA analysis is the following:
Oneway Analysis of Sales By Strategy
800
Sales
700
600
500
400
Convenience Price
Each Pair
Student's t
0.05
Quality
Strategy
Oneway Anova
Summary of Fit
Rsquare
Adj Rsquare
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.101882
0.07037
94.31038
613.0667
60
Analysis of Variance
Source
Strategy
Error
C. Total
DF
Sum of Squares
Mean Square
F Ratio
Prob > F
2
57
59
57512.23
506983.50
564495.73
28756.1
8894.4
3.2330
0.0468
Means for Oneway Anova
Level
Number
Mean
Convenience
20
577.550
Price
20
608.650
Quality
20
653.000
Std Error uses a pooled estimate of error variance
Means Comparisons
Dif=Mean[i]Mean[j]
Quality
Price
Convenienc
e
Quality
Price
Convenience
0.000
-44.350
-75.450
44.350
0.000
-31.100
75.450
31.100
0.000
Std Error
Lower 95%
Upper 95%
21.088
21.088
21.088
535.32
566.42
610.77
619.78
650.88
695.23
Alpha=
0.05
Comparisons for each pair using Student's t
t
2.00247
Abs(Dif)LSD
Quality
Price
Convenienc
e
Quality
Price
Convenience
-59.721
-15.371
15.729
-15.371
-59.721
-28.621
15.729
-28.621
-59.721
Positive values show pairs of means that are significantly different.
There is a multiple regression equivalent to the ANOVA but it involves creating two
dummy variables: I1  1 if quality, I1  0 if convenience or price and I 2  1 if price,
I 2  0 if convenience or quality. The regression model looks like this:
Y   0  1 I 1   2 I 2  
In this case the intercept  0 represents the mean for convenience,  0  1 represents the
mean for quality and  0   2 represents the mean for price.  1 represents the difference
between the means for quality and convenience and  2 represents the difference between
the means for price and convenience.
Response Sales
Whole Model
Actual by Predicted Plot
Sales Actual
800
700
600
500
400
400 500 600 700 800
Sales Predicted P=0.0468
RSq=0.10 RMSE=94.31
Summary of Fit
RSquare
0.101882
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.07037
94.31038
613.0667
60
Analysis of Variance
Source
Model
Error
C. Total
DF
Sum of Squares
Mean Square
F Ratio
2
57
59
57512.23
506983.50
564495.73
28756.1
8894.4
3.2330
Prob > F
0.0468
Parameter Estimates
Term
Intercept
I1
I2
Estimate
Std Error
t Ratio
Prob>|t|
577.55
75.45
31.1
21.08844
29.82356
29.82356
27.39
2.53
1.04
<.0001
0.0142
0.3014
Effect Tests
Source
I1
I2
Nparm
DF
Sum of Squares
F Ratio
Prob > F
1
1
1
1
56927.025
9672.100
6.4003
1.0874
0.0142
0.3014
Sales Residual
Residual by Predicted Plot
200
100
0
-100
-200
400 500 600 700 800
Sales Predicted
Notice that the coefficients on I1 and I2 are the difference in sample means between
quality and convenience and between price and convenience respectively. Also notice
that the p-value for the F-test of H 0: 1   2  0 is identical to the p-value of the F-test
from one-way ANOVA of the hypothesis that the means of convenience, quality and
price are equal. This is the case because H 0: 1   2  0 is equivalent to the means of
convenience, quality and price being equal.
Combining Nominal Independent Variables and Continuous
Independent Variables
Multiple linear regression can accomodate both categorical independent variables and
continuous independent variables. For example (Keller and Warrack, section 20.3),
suppose you want to predict used-car prices from their odometer reading and their color
(white, silver and other). To represent the situation of three possible colors, we need two
indicator variables. In general to represent a nominal variable with m possible
categories, we must create m  1 indicator variables. Here, we create two indicator
variables.
I1  1 if the color is white
= 0 if the color is not white
I 2  1 if the color is silver
0 if the color is not silver
The category “Other colors” is defined by I1  0; I 2  0 .
Our regression model is
Y   0  1 X 1   2 I 1   3 I 2  
where X 1 equals odometer reading.
Response Price
Whole Model
Actual by Predicted Plot
Price Actual
16000
15500
15000
14500
14000
13500
13500 14500
15500
Price Predicted P<.0001
RSq=0.70 RMSE=284.54
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.69803
0.688594
284.5421
14822.82
100
Analysis of Variance
Source
Model
Error
C. Total
DF
Sum of Squares
Mean Square
F Ratio
3
96
99
17966997
7772564
25739561
5988999
80964
73.9709
Prob > F
<.0001
Parameter Estimates
Term
Intercept
Estimate
Std Error
t Ratio
Prob>|t|
16700.646
184.3331
90.60
<.0001
Term
Odometer
I-1
I-2
Estimate
Std Error
t Ratio
Prob>|t|
-0.05554
90.481959
295.47602
0.004737
68.16886
76.36998
-11.72
1.33
3.87
<.0001
0.1876
0.0002
Effect Tests
Source
Nparm
DF
Sum of Squares
F Ratio
Prob > F
1
1
1
1
1
1
11129137
142641
1211971
137.4575
1.7618
14.9692
<.0001
0.1876
0.0002
Odometer
I-1
I-2
Price Residual
Residual by Predicted Plot
800
600
400
200
0
-200
-400
-600
-800
13500
14500
15500
Price Predicted
The interpretation of the coefficient 90.48 on I1 is that a white car sells, on the average,
for $90.48 more than a car of the “Other Color” category with the same number of
odometer miles. The interpretation of the coefficient 295.48 on I2 is that a silver color
car sells, on the average for $295.48 more than a car of the “Other color” category with
the same number of odometer miles. The interpretation of the cofficient -.05554 on
odometer is that for two cars of the same color that differ in odometer by one, the car
with an additional mile on the odometer sells for 5.55 cents less on average.
For each color car, there is a regression line for how the average price depends on
odometer ( X 1 ). The regression lines for each color car are parallel. The indicator
variables can be interpreted as shifting the intercepts of the regression lines. See the
picture on page 708:
For white cars: Yˆ  (16700.65  90.48)  .056 X 1  16791.13  .056 X 1
For silver cars: Yˆ  (16700.65  295.48)  .056 X  16996.13  .056 X
1
1
For all other cars: Yˆ  16700.65  .056 X 1 .
Note: We will not consider it but it would be wise to investigate whether there was an
interaction between color and odometer. A more general multiple regression model
which would allow for different slopes and intercepts for each color of car would be
Y   0  1 X 1   2 I 1   3 I 2   4 X 1 * I 1   5 X 1 * I 2   .
Example: Problem 20.23
The president of a company that manufactures car seats has been concerned about the
number and cost of machine breakdowns. The problem is that the machines are old and
becoming quite unreliable. However, the cost of replacing them is quite high, and the
president is not certain that the cost can be made up in today’s slow economy.
In Exercise 18.85, a simple linear regression model was used to analyze the relationship
between welding machine breakdowns and the age of the machine. The analysis proved
to be so useful to company management that they decided to expand the model to include
other machines. Data were gathered for two other machines. These data as well as the
original data are stored in file Xr20-23 in the following way:
Column 1: Cost of repairs
Column 2: Age of machine
Column 3: Machine (1= welding machine; 2= lathe; 3=stamping machine)
(a) Develop a multiple regression model
(b) Interpret the coefficients
(c) Can we conclude that welding machines cost more to repair than stamping machines
if the machines are of the same age?
Solution:
(a) Y=Cost of repairs. We want to include as independent variables both the continuous
variable Age of Machine ( X 1 ) and the nominal variable Machine. To include the
nominal variable, we need to create two indicator variables (because the nominal variable
takes on three values). Let
I1  1 if the machine is a welding machine
= 0 if the machine is not a welding machine
I 2  1 if the machine is a lathe
0 if the machine is not a lathe
Our multiple regression model is
Y   0  1 X 1   2 I 1   3 I 2  
Response Repairs
Whole Model
Actual by Predicted Plot
Repairs Actual
500
400
300
200
100
100 200 300 400 500
Repairs Predicted P<.0001
RSq=0.59 RMSE=48.591
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.593778
0.572016
48.59141
340.6457
60
Analysis of Variance
Source
Model
Error
C. Total
DF
Sum of Squares
Mean Square
F Ratio
3
56
59
193271.36
132223.03
325494.39
64423.8
2361.1
27.2852
Prob > F
<.0001
Parameter Estimates
Term
Intercept
Age
I-1
I-2
Estimate
Std Error
t Ratio
Prob>|t|
119.25213
2.538233
-11.75534
-199.3737
35.00037
0.402311
19.70184
30.71301
3.41
6.31
-0.60
-6.49
0.0012
<.0001
0.5531
<.0001
Effect Tests
Source
Age
I-1
I-2
Nparm
DF
Sum of Squares
F Ratio
Prob > F
1
1
1
1
1
1
93984.714
840.574
99497.022
39.8050
0.3560
42.1397
<.0001
0.5531
<.0001
Residual by Predicted Plot
Repairs Residual
150
100
50
0
-50
-100
100 200 300 400 500
Repairs Predicted
The residual by predicted plot does not show any gross departures from a random scatter.
(b) For each additional month of age, repair costs increase on average by $2.54 for
machines of the same type; welding machines cost on average $11.76 less to repair than
stamping machines for the same age of machine; lathes cost on average $199.40 less to
repair than stamping machines for the same age of machine.
(c) To test whether welding machines cost more on average to repair than stamping
machines if the machines are of the same age, we want to test whether the coefficient on
I-1 is zero, i.e., we want to test H 0 :  2  0 . The p-value for the two-sided test is 0.5531.
Thus, there is not enough evidence to conclude that welding machines cost more on
average to repair than stamping machines if the machines are of the same age.
Practice Problems from Chapter 20: 20.6, 20.8, 20.20, 20.24
Download