Dummy Variables

advertisement

Modeling Qualitative Variables (Dummy Variables) with Regression

I. Modeling values as base and differences

On July 19 th , 2011 Dell computers offered a base Inspirion 600 for $299.99. A buyer is allowed to customize this computer. One of the choices was the type of Office 2010 where Windows 7 Home

Premium was included in the base price:

1.

If you wished the Microsoft Office and Student 2010, add $119 (Price becomes $418)

2.

If you wished the Microsoft Office Home and Business 2010, add $199 (Price becomes $498)

3.

If you wished the Microsoft Office Professional 2010, add $349

The prices of the computer under the different options become

1.

Base Price: $299.99

2.

Microsoft Office and Student 2010: $299.99 + $119 = $418.99

3.

Microsoft Office Home and Business 2010, $299.99 + $199 = $498.99

4.

Microsoft Office Professional 2010, $299.99 + $349 = $648.99

Since for a computer we must use 1 and 0 to represent Yes and No (dummy variables), we can model the choices above using the following equation:

Price = 299.99 + 119*(X

1

) + 199*(X

2

)+349*(X

3

)

Where X

1

= 1 if you choose option 1 (Office and Student), 0 otherwise

X

X

2

3

= 1 if you choose option 2 (Home and Business), 0 otherwise

= 1 if you choose option 3 (Professional), 0 otherwise

For example if you choose option 2, the price would be 299.99+119(0)+199*(1)+349*(0) = 299.99+199 =

498.99

How would you model the price of the following computers?

Computer 1, base computer with 1 year support $338.99

Computer 2, 90 day support, $299.99

Computer 3, 2 year support, $418.99

II. Using averages: If you are purchasing the same type of computer across many dealers, some prices will be higher and some will be lower, therefore we will average the prices. What are the prices of the following computers?

Average price = 300 + 50 (option 1) + 200 (option 2) – 140 (option 3)

If there had been 5 options, how many dummy variables would have been needed?

In general, we will model a qualitative variable with c levels using c-1 dummy variables

Population mean of Y =

0

+

1

X

1

+

2

X

2

+ … + +

 c-1

X c-1

Where X i

= 1 if level i and 0 otherwise, i = 1, 2, … c-1

In this case what would be the base average price?

What would

3

represent?

What would have been the population mean of level 2?

III. Errors due to sampling: If you are able to obtain only a sample of prices, the averages and the changes in average price due to changing options would be in error.

Example: You are measuring the tensile strength provided by 4 suppliers. Supplier 1 has been your supplier in the past and will be considered your base level. You create 3 dummy variables: X

1

= 1 if supplier 2, 0 otherwise; x

2

= 1 if supplier 3, 0 otherwise; and x

3

= 1 if supplier 4, 0 otherwise. After taking random samples of size 5 from each supplier, you find the following result:

ANOVA

Regression

Residual

Total df SS MS F

63.2855 21.09517 3.461629

97.504 6.094

160.7895

Significance

F

0.041366 3

16

19

Coefficients

Standard

Error t Stat P-value Lower 95%

Upper

95%

Intercept x1 x2 x3

19.52

4.74

3.32

1.64

1.103993 17.68128 6.34E-12

1.561282 3.035968 0.007866

1.561282 2.126458 0.049376

1.561282 1.050419 0.309133

Predicted tensile strength = 19.52 + 4.74 X

1

+ 3.32 X

2

+ 1.64 X

3 a. What is the sample average for supplier 2?

17.17964 21.86036

1.430231 8.049769

0.010231 6.629769

-1.66977 4.949769

The sample average tensile strength for supplier 4 would be 19.52 + 1.64 = 21.16

b. What happens to the average tensile strength in the sample when you go from your base supplier to supplier 3?

When you go from your base supplier to supplier 3, the sample average tensile strength improves by

3.32. c. Using part b to infer about the differences in population mean tensile strength, what would be the largest error you would expect?

The sample slope of 3.32 has a standard error of 1.561282. This is multiple regression with c-1=3 independent variables. The degrees of freedom of the t is n-k-1 = 20-3=16. The margin of error is then

T16*Sb2 = 2.1199 * 1.561282 = 3.309761712

Conclusion; With 95% confidence when you change from the original supplier to supplier 4, the average tensile strength will improve by 3.32 with a margin of error of 3.31. (Excel gives the range in the table above)

What would be the interpretation of the confidence interval for

1

? d. Can you conclude that going the average tensile strength of supplier 1 differs from supplier 4? This is a two-sided t-test, with b

3

= 1.64, S b3

= 1.561282 and n-k-1= 16. Try this between now and next class.

e. Can you conclude that the average tensile strength differs among the four suppliers? In this case the means would be equal if all the differences were zero. Multiple slopes are tested using an F test

H

0

1

=

2

=



3

=0 (equal means)

H

1

: at one



Is not zero (at least two means differ)

T.S. F = MSR/MSE = 21.09517/6.094 = 3.46

R.R. Reject Ho if F > F

(3,16)

= 3.24

Conclusion: We can say that the mean tensile strength differs for at least two suppliers.

Compare this with the F test of suppliers in One-way Analysis of Variance.

IV. If there are other variables in the model, I will have to add “holding all other variables constant” when interpreting a single coefficient. If you believe there are interactions between a qualitative variable and another independent variable, you would add product terms to the model.

Download