How to DO It - Kellogg School of Management

Regression Analysis
• The Motorpool Example
• Looking just at two-dimensional shadows, we
don’t see the true effects of the variables.
• We need a way to look at all the dimensions of
a relationship at the same time.
The Regression Model
Costs  α  β1  Mileage  β2  Age  β3  Make  ε
dependent
variable
explanatory variables
or
independent variables
coefficients
residual term
linear
mathematical structure
What we’re talking about is sometimes explicitly called “linear regression
analysis,” since it assumes that the underlying relationship is linear (i.e., a
straight line in two dimensions, a plane in three, and so on)!
Why Spend All This Time on such a
Limited Tool?
• Some interesting relationships are linear.
• All relationship are locally linear!
• Several of the most commonly encountered
nonlinear relationships in management can be
translated into linear relationships, studied
using regression analysis, and the results then
untranslated back to the original problem!
(This is part of what we’ll learn in Sessions 3
and 4.)
A Few Final Assumptions Concerning 
•
The validity of regression analysis depends on several assumptions concerning the
residual term.
– E[ε] = 0 . This is purely a cosmetic assumption. The estimate of α will include any on-average
residual effects which are different from zero.
– ε varies normally across the population. While a substantive assumption, this is typically true,
due to the Central Limit Theorem, since the residual term is the total of a myriad of other,
unidentified explanatory variables. If this assumption is not correct, all statements regarding
confidence intervals for individual predictions might be invalid.
•
The following additional assumptions will be discussed later in the course.
– StdDev[ε] does not vary with the values of the explanatory variables. (This is called the
homoskedasticity assumption.) Again, if this assumption is not correct, all statements
regarding confidence intervals for individual predictions might be invalid.
– ε is uncorrelated with the explanatory variables of the model. The regression analysis will
“attribute” as much of the variation in the dependent variable as it can to the explanatory
variables. If some unidentified factor covaries with one of the explanatory variables, the
estimate of that explanatory variable’s coefficient (i.e., the estimate of its effect in the
relationship) will suffer from “specification bias,” since the explanatory variable will have both
its own effect, and some of the effect of the unidentified variable, attributed to it. This is why,
when doing a regression for the purpose of estimating the effect of some explanatory variable
on the dependent variable, we try to work with the most “complete” model possible.
1. Predictions
• Given an individual, and some information
about that individual, predict what the
dependent variable will be.
– What annual maintenance and repair cost (Costs)
would you predict for a new (Age = 0) Honda
(Make = 1) driven 15,000 miles (Mileage = 15)?
• Regress the dependent variable onto the given
variables to get the “prediction equation”.
Then make the prediction.
Predictions
The prediction equation:
Costspred = 107.34 + 29.65 · Mileage + 73.96 · Age + 47.43 · Make ( + 0 )
1.1 A Prediction for an Individual
Costspred = 107.34 + 29.65 · 15 + 73.96 · 0 + 47.43 · 1 = $599.49
The margin of error in the prediction (at the 95%-confidence level) is
2.2010 · $55.75 = $122.70 ,
And so a 95%-confidence interval for the prediction is $599.49 ± $122.70 .
1.2 Prediction: The Estimated Mean
for a Subgroup of Similar Individuals
• Estimate the mean annual costs for new Hondas driven 15,000 miles.
– The estimate for the group is what we’d predict for any one member
of the group. The margin of error in the estimate is computed using
the standard error of the estimated mean.
$599.49 ± 2.2010 · $26.67
$599.49 ± $58.69
1.3 Sources of Error
Y
 α  β1 X 1    βk X k  ε
Ypred  a  b1X1    bk Xk  0
standard error of the
prediction
this

standard error of the
estimated mean
this2
standard error of the
regression, StdDev()

this2
The Standard Error of the Regression
• Using the prediction equation, we
predict for each sample
observation.
• The difference between the
prediction and the actual value of
the dependent variable (i.e., the
error) is an estimate of that
individual’s residual.
• StdDev() is estimated from these.
Indeed, the regression “process” simply
finds the coefficient estimates which
minimize the standard error of the
regression (or equivalently, which minimize
the sum of the squared residuals)!
A Brief Digression
• What annual maintenance and repair cost (Costs) would you
predict for a Honda (Make=1) driven 15,000 miles (Mileage =
15)?
– Regress Costs onto just Mileage and Make.
Costspred = $678.00 .
A Brief Digression
• The prediction made using the reduced model is precisely
what we would get if we predicted Age from Mileage and
Make, and then Costs from all 3!
Costspred = $678.00 .
A Brief Digression
• The prediction made using the reduced model is precisely
what we would get if we predicted Age from Mileage and
Make, and then Costs from all 3!
The reason we don’t take this latter
approach is that the standard error of
the prediction here is based on the
assumption that the age of the car is
precisely 1.061546 years, instead of
actually being unknown.
Still, it’s reassuring to see that the
numbers all fit together.
2. Estimating an Effect
• An additional thousand miles of driving in the
course of a year adds, on average, how much
to the year’s maintenance and repair costs?
– It is ESSENTIAL to note that the additional driving
changes neither the car’s Age, nor its Make. In
order to hold them constant while varying
Mileage, we need to work with a model including
ALL of the explanatory variables.
Estimating an Effect
The coefficient of Mileage in the most-complete model is our estimate of
the impact of a one-unit (1,000 mile) change in Mileage. That coefficient is
$29.65/ thousand miles. (That is, 29.65 units of the dependent variable per
unit of the explanatory variable.)
The predictions below examine the impact of an additional thousand miles
of driving for a two-year-old Ford and two two-year-old Hondas. Each
difference in predictions is $29.65 greater for the car driven an additional
1,000 miles.
Estimating an Effect
Each coefficient is an estimate
of the “true” coefficient, and is
subject to sampling error.
One standard-deviation’s-worth
of uncertainty in the estimate is
given by the standard error of
the coefficient.
For example, a 95%-confidence interval for the coefficient of Mileage in
the full model is
29.65 ± 2.2010 · 3.92
29.65 ± 8.62
3. The Explanatory Power of the Model
• Why do maintenance and repair costs vary
from car to car across the current fleet?
– A partial answer is, “Because Mileage, Age, and
Make vary from car to car across the fleet.”
• Indeed, variations in those three variables can
potentially explain 80.78% of the overall variability in
Costs across the fleet!
• This is the adjusted coefficient of determination for our
model.
The Explanatory Power of the Model
• Names can vary: The {adjusted, corrected, unbiased}
{coefficient of determination, r-squared} all refer to the
same thing.
– Without an adjective, the {coefficient of determination, rsquared} refers to a number slightly larger than the “correct”
number, and is a throwback to pre-computer days.
• When a new variable is added to a model, which actually
contributes nothing to the model (i.e., its true coefficient is
0), the adjusted coefficient of determination will, on
average, remain unchanged.
– Depending on chance, it might go up or down a bit.
– If negative, interpret it as 0%.
– The thing without the adjective will always go up. That’s
obviously not quite “right.”
The Explanatory Power of the Model
• Subtracting the adjusted coefficient of determination from 100%
yields the fraction of the population-wide variation in the
dependent variable which must be explained by terms still
lumped together in the residual.
– If your goal is to explain everything, you want the adjusted coefficient of
determination to be large.
– If your goal is to explain something, a very small value might be perfectly
acceptable.
4. The Relative Explanatory Importance of the
Explanatory Variables: The Beta-Weights
• What explains why maintenance
and repair costs vary from car to
car across the current fleet?
– (This is the same question as
before, but now we seek a more
detailed answer.)
– Compare the absolute values of
the beta-weights.
Variations in Mileage across the population are roughly twice as important as are
variations in Age (1.1531 vs. 0.5597), in helping to explain why Costs vary across the
population.
In turn, the fact that the cars vary in Age is more than twice as important as is the fact
that some are Fords, and others Hondas (0.5597 vs. 0.2193), in helping to explain why
Costs vary.
The Beta-Weights
• You can’t compare regression coefficients directly,
since they may carry different dimensions.
• The beta-weights are dimensionless, and combine how
much each explanatory variable varies, with how much
that variability leads to variability in the dependent
variable.
– Specifically, they are the product of each explanatory
variable’s standard deviation (how much it varies) and its
coefficient (how much its variation affects the dependent
variable), divided by the standard deviation of the
dependent variable (just to remove all dimensionality).
5. The Significance Levels of the
t-Ratios (the p-values)
• How strong is the evidence that Mileage does
play a role in the relationship involving all three
explanatory variables?
– “Strength of evidence” evokes memories of
hypothesis testing!
– If we wish to conclude that the evidence supports
the inclusion of Mileage in our model, we must
take the opposite as our null hypothesis:
• Mileage would not belong if it had no effect on Costs,
i.e., if its true coefficient were 0.
The Significance Levels of the
t-Ratios (the p-values)
•
Null hypothesis: “The true coefficient of
Mileage is 0.”
– Our estimate is 29.65.
– One standard-deviation’s-worth of
uncertainty in the estimate is 3.915.
– Our estimate is 7.5726 standard
deviations away from the hypothesized
true value.
– If the truth really were 0, we’d see
something this far away (or further) only
0.0011% of the time.
– The data is an overwhelmingly strong
contradiction to the null hypothesis, and
therefore …
The evidence is overwhelmingly strong in support of the statement that
the true coefficient of Mileage differs from 0, and Mileage does belong in
our model.
Does Make Belong in our Model?
• Null hypothesis: “The true
coefficient of Make is 0.”
– Our estimate is 47.43.
– One standard-deviation’s-worth of
uncertainty in the estimate is 28.98.
– Our estimate is 1.6366 standard
deviations away from the
hypothesized true value.
– If the truth really were 0, we’d see
something this far away (or further)
only 12.9983% of the time.
– The data is a bit of a contradiction to
the null hypothesis, and therefore …
There’s only a bit of evidence in support of the statement that the true
coefficient of Make differs from 0, and Mileage does belong in our model.
So, what should we do? Leave Make in, or take it out?
Does Make Belong in our Model?
• There’s only a bit of evidence in support of the statement that the true
coefficient of Make differs from 0, and Mileage does belong in our
model.
• So, what should we do? Leave Make in, or take it out?
• It depends: Remember, the belief decision must stand on three legs.
–
If the Fords and Hondas came from a joint production facility …
• I’d lean towards leaving it out
–
If the Fords came from Detroit, and the Hondas from Kyoto …
• I’d lean towards leaving it in
–
More data might clarify the situation …
• The standard error of the coefficient would drop.
– If the coefficient stayed around 43, the significance level would get closer to zero,
building stronger evidence for including the variable
– If the coefficient shrank towards 0, there would continue to be no real evidence
supporting Make’s inclusion, and even if it did belong, its estimated effect would be
small.
The Significance Levels of the t-Ratios
• Imagine that you have a model.
– You introduce a new variable into that model.
– The adjusted coefficient of determination increases.
• Does this mean that the new variable belongs in
your model?
– Not necessarily! Adding garbage to your model will
increase the adjusted coefficient of determination a
little bit around half of the time.
– The significance level (of the new variable) tells you if
the adjusted coefficient of determination went up by
enough to support keeping the new variable.
Summary
1.
Predictions
What annual maintenance and repair cost (Costs) would you predict for a new (Age = 0) Honda (Make = 1) driven 15,000 miles
(Mileage = 15)?
Regress dependent variable onto all known (for this individual) explanatory variables.
Look at (prediction) ± (~2)·(standard error of prediction).
Estimate the mean annual costs for new Hondas (note the plural!) driven 15,000 miles.
Regress dependent variable onto all known explanatory variables.
Look at (prediction) ± (~2)·(standard error of estimated mean).
2.
Estimating an Effect
An additional thousand miles of driving in the course of a year adds, on average, how much to the year’s maintenance and repair
costs?
Regress dependent variable onto all explanatory variables (use most complete model).
Look at (estimated coefficient) ) ± (~2)·(standard error of coefficient).
3.
The Explanatory Power of the Model
Why do maintenance and repair costs vary from car to car across the current fleet?
Look at the adjusted coefficient of determination to see how much of the variation in the dependent variable can be
jointly explained by variations in the included explanatory variables.
4.
The Relative Explanatory Importance of the Explanatory Variables
Variation in which explanatory variable is most important in explaining why maintenance and repair costs vary from car to car across
the current fleet?
Compare the absolute values of the beta-weights.
5.
The Significance Levels of the t-Ratios
How strong is the evidence that Mileage does play a role in the relationship involving all three explanatory variables?
The smaller the significance level, the stronger the evidence that this variable has a non-zero coefficient in this model.
Regression Analysis:
How to DO It
Example: The “car discount” dataset
Discounts on Car Purchases
• Of course, no one pays list price for a new car. Realizing
this, the owner of a new-car dealership has decided to
conduct a study, to attempt to understand better the
relationship between customer characteristics, and
customer success in negotiating a discount from his
salespeople.
• He collects data on a sample of 100 purchasers of mid-size
cars (he has already sold several thousand of these cars):
– Specifically, he notes the age, annual income, and sex (men
were represented by 0, and women by 1, in the coding of sex) of
each purchaser (obtained from credit records), together with
the discount from list price which the purchaser finally received.
Discounts on Car Purchases
• He collects data on a sample of 100 purchasers of mid-size
cars (he has already sold several thousand of these cars):
– He notes the age, annual income, and sex of each purchaser,
together with the discount from list price which the purchaser
finally received.
Discount
1003
1394
2542
1658
1374
1536
1402
692
947
1415
…
Age
28
41
21
47
29
43
54
35
41
19
…
Income
47658
32126
28374
29321
38016
25343
30310
45709
46242
27933
…
Sex
1
1
1
0
1
0
0
0
0
1
…
Discount ($) negotiated on the purchase
of a car: age of purchaser (years),
annual income ($), and sex (M/F = 0/1).
Discounts on Car Purchases
• He collects data on a sample of 100 purchasers of mid-size cars (he
has already sold several thousand of these cars):
– He notes the age, annual income, and sex of each purchaser, together
with the discount from list price which the purchaser finally received.
• Why mid-size cars only?
– To avoid needing to include model/price of car
• Other possible explanatory variables?
– About purchaser
• Negotiation training
• Preparatory research
• Significant other
– About salesperson
• Identity
• Biases
Look at the Univariate Statistics
• This will give you a sense of how each variable
varies individually
– Estimate of population mean (or proportion)
– Standard deviation and extremes
– 95%-confidence interval for population mean
(or proportion)
• Estimate ± (~2)·(standard error of the mean)
• Estimate ± “margin of error” (at 95%-confidence level)
Univariate statistics
mean
standard deviation
standard error of the mean
Discount
Age
Income
Sex
1268.24
37.1
35705.17
0.46
538.665375 9.91122209 10273.7291 0.50090827
53.8665375 0.99112221 1027.37291 0.05009083
minimum
median
maximum
range
130
1310.5
2542
2412
19
37
58
39
19119
34401.5
64648
45529
0
0
1
1
skewness
kurtosis
-0.018
-0.710
0.154
-0.633
0.452
-0.270
0.163
-2.014
number of observations
t-statistic for computing
95%-confidence intervals
100
1.9842
For example, $1,268.24 ± 1.9842·$53.87, or 46% ± 1.9842·5.01% .
The Full Regression
• The “most-complete” model provides …
– The best predictive model (pretty much)
– The most accurate estimate of the “pure effect” of
each explanatory variable on the dependent variable
• Specifically, the difference in the dependent variable
typically associated with one unit of difference in one
explanatory variable when the others are held constant.
Regression: Discount
coefficient
std error of coef
t-ratio
significance
beta-weight
constant
Age
Income
Sex
1971.72565 9.48991379 -0.035313 446.294355
146.147064 3.6320188 0.00366827 64.5567912
13.4914
2.6128
-9.6266
6.9132
0.0000%
1.0423%
0.0000%
0.0000%
0.1746
-0.6735
0.4150
standard error of regression
coefficient of determination
adjusted coef of determination
number of observations
residual degrees of freedom
t-statistic for computing
95%-confidence intervals
301.19175
69.68%
68.74%
100
96
1.9850
The Adjusted Coefficient of
Determination in the Full Model
• How much of the “story” (how much of the
overall variation in the dependent variable) is
potentially explained by the fact that the
explanatory variables themselves vary across
the population?
• r2 = 1 – Var() / Var(Y) (roughly) = 68.74%
– How can it be increased?
• By including new relevant variables
• Including a new “garbage” variable will leave it, on
average, unchanged
The Coefficients
• The coefficient of an explanatory variable in the
most-complete model …
– Is an estimate of the average difference in the
dependent variable for two distinct individuals who
differ (by one unit) only in that explanatory variable.
– Is an estimate of the average difference we’d expect to
see in a specific individual if one aspect alone were
slightly different (and all other aspects were the same.)
• coefficient ± (~2)·(standard error of coefficient)
Regression: Discount
coefficient
std error of coef
t-ratio
significance
beta-weight
constant
Age
Income
Sex
1971.72565 9.48991379 -0.035313 446.294355
146.147064 3.6320188 0.00366827 64.5567912
13.4914
2.6128
-9.6266
6.9132
0.0000%
1.0423%
0.0000%
0.0000%
0.1746
-0.6735
0.4150
standard error of regression
coefficient of determination
adjusted coef of determination
number of observations
residual degrees of freedom
t-statistic for computing
95%-confidence intervals
301.19175
69.68%
68.74%
100
96
1.9850
$9.49 ± 1.9850·$3.63 per year of Age (with same Income and Sex), or
-$0.0353 ± 1.9850·$0.0037 per dollar of Income (with same Age and Sex), or
$446.29 ± 1.9850·$64.56 more for a woman (1) than for a man (0) (with
same Age and Income)
Predictions, using most-recent regression
Predict
constant
Age
Income
Sex
coefficients
1971.7256
9.4899138
-0.035313
446.29435
predicted value of Discount
standard error of prediction
standard error of regression
standard error of estimated mean
confidence level
t-statistic
residual degr. freedom
Make single prediction
values for prediction
30
35000
1
31
35000
1
30
36000
1
30
35000
0
1466.762
305.5644
301.1917
51.50843
1476.252
305.2995
301.1917
49.91316
1431.449
305.8382
301.1917
53.10864
1020.468
305.2382
301.1917
49.53689
95.00%
1.9850
96
confidence limits
for prediction
lower
upper
860.2218 870.2374 824.3652 414.5748
2073.303 2082.267 2038.533 1626.361
confidence limits
for estimated mean
lower
upper
1364.519 1377.175 1326.029 922.1379
1569.006 1575.329 1536.869 1118.798
Tests involving Coefficients
• In the full model, how strongly does the evidence
support saying, “Sex≥$200”?
• H0: Sex≤$200, significance 0.01204% (overwhelmingly strong
evidence against H0, hence supporting original statement)
446.294
64.557
100
3
estimate/prediction of unknown quantity
measure of uncertainty
sample size
number of explanatory variables in regression, or
0 if dealing with a population mean
significance level of data with
respect to null hypothesis
Null hypothesis:
true value
≥
=
≤
200
100.00000%
0.02408%
0.01204%
(from t-distribution with 96 degrees of freedom)
From Session-1’s “Hypothesis_Testing_Tool.xls”
Tests involving Coefficients
• Other statements?
Statement
Significance level of Strength of
data (with respect to evidence
opposite statement) supporting
statement
Sex≥$200
0.01204%
overwhelming
Sex≥$300
1.28444%
very strong
Sex≥$350
6.95385%
somewhat strong
Sex≥$400
23.75235%
quite weak
From Session-1’s “Hypothesis_Testing_Tool.xls”
Predictions
• Based on ANY model, what would we predict
the dependent variable to be, if all we knew
about an individual were the given values for
the listed explanatory variables?
• Prediction ± (~2)·(standard error of the prediction)
• What would we expect to see, on average,
across a large pool of similar individuals?
• Prediction ± (~2)·(std. error of the estimated mean)
Prediction, using most-recent regression
coefficients
values for prediction
constant
Age
Income
Sex
1971.726 9.489914 -0.03531 446.2944
30
35000
1
predicted value of Discount
standard error of prediction
standard error of regression
standard error of estimated mean
confidence level
t-statistic
residual degr. freedom
Make multiple predictions
1466.762
305.5644
301.1917
51.50843
Predict
95.00%
1.9850
96
confidence limits
for prediction
lower
upper
860.2218
2073.303
confidence limits
for estimated mean
lower
upper
1364.519
1569.006
$1,466.76 ± 1.9850·$305.56, an individual prediction
for a 30-year-old woman earning $35,000/year
$1,466.76 ± 1.9850·$51.51, an estimate of the large-group mean
for 30-year-old women earning $35,000/year
Significance
• The significance level of the t-ratio (for each
variable separately)
• Sometimes called the “p-value” for that variable
– How strong is the evidence that, in a model already
containing all of the other explanatory variables, this
variable “belongs” (i.e., has a non-zero coefficient of
its own)?
– Equivalently, is this a variable whose value we’d like to
know when predicting for a specific individual?
• Close to zero = strong evidence it DOES belong
(our null hypothesis is that it doesn’t)
Regression: Discount
coefficient
std error of coef
t-ratio
significance
beta-weight
constant
Age
Income
Sex
1971.72565 9.48991379 -0.035313 446.294355
146.147064 3.6320188 0.00366827 64.5567912
13.4914
2.6128
-9.6266
6.9132
0.0000%
1.0423%
0.0000%
0.0000%
0.1746
-0.6735
0.4150
standard error of regression
coefficient of determination
adjusted coef of determination
number of observations
residual degrees of freedom
t-statistic for computing
95%-confidence intervals
301.19175
69.68%
68.74%
100
96
1.9850
Significance (continued)
• Null hypothesis: “In the current model, the
true coefficient of this variable is 0.”
– The coefficient of this variable is our estimate
– (coefficient) / (standard error of the coefficient)
tells us how many standard deviations away from
the hypothesized truth (0) the estimate is
• significance = Pr(we’d be this far away just by chance)
– Close to 0% = (recall coin-flipping story)
• highly contradictory to null hypothesis
• strongly supportive of alternative (it DOES belong)
Significance (continued)
• The significance level deals with the marginal
contribution of a variable to the current model.
• Adding an irrelevant explanatory variable to a
regression model will increase the adjusted
coefficient of determination about half the time.
The significance level tells us if the coefficient of
determination went up by enough to argue that
the new variable is relevant.
The Beta-Weights
• Why is Discount varying from one sale to the
next?
– What’s the relative explanatory “power” of
(variation in) each of the explanatory variables (in
explaining the currently-observed variability in the
dependent variable across the population)?
– The comparative magnitudes of the beta-weights
(for all of the explanatory variables together in the
model) answer this question.
Regression: Discount
coefficient
std error of coef
t-ratio
significance
beta-weight
constant
Age
Income
Sex
1971.72565 9.48991379 -0.035313 446.294355
146.147064 3.6320188 0.00366827 64.5567912
13.4914
2.6128
-9.6266
6.9132
0.0000%
1.0423%
0.0000%
0.0000%
0.1746
-0.6735
0.4150
standard error of regression
coefficient of determination
adjusted coef of determination
number of observations
residual degrees of freedom
t-statistic for computing
95%-confidence intervals
301.19175
69.68%
68.74%
100
96
1.9850
• Why does discount vary across the population?
• Primarily, because Income varies.
• Secondarily, because some purchasers are men and
others are women (i.e., Sex varies).
The Beta-Weights (continued)
• Each answers the question:
– If two individuals have the same values for all the
explanatory variables in the model except one, and for
this one their values differ by one standarddeviation’s-worth of variability (in this variable), then
their predicted values for the dependent variable
would differ by how many standard deviations (of
variability in the dependent variable)?
• “Typical” variation in each of the explanatory variables alone
can explain (relatively) how much of the observed variability
in the dependent variable?
We Can Explore Other Models
• We can drop variables
– Are older or younger purchasers currently getting
larger discounts?
• We can change the dependent variable
– Are the female purchasers, on average, older or
younger than the male purchasers?
– What’s the impact of aging on purchaser income?
Are the female purchasers, on average, older or younger than the male
purchasers?
Regression: Age
coefficient
std error of coef
t-ratio
significance
beta-weight
constant
Sex
38.9074074 -3.9291465
1.32861376 1.95893412
29.2842
-2.0058
0.0000%
4.7639%
-0.1986
standard error of regression
9.76327735
coefficient of determination
3.94%
adjusted coef of determination
2.96%
number of observations
residual degrees of freedom
t-statistic for computing
95%-confidence intervals
100
98
1.9845
Male purchasers are, on average, 38.91 years old.
Female purchasers are, on average, 3.93 years younger than the men.
If the “pure” effect of an additional year of age is to increase a purchaser’s
discount, then what explains the negative coefficient of Age below?
Regression: Discount
coefficient
std error of coef
t-ratio
significance
beta-weight
constant
Age
1817.16511 -14.795825
202.794627 5.28272248
8.9606
-2.8008
0.0000%
0.6142%
-0.2722
standard error of regression
520.957868
coefficient of determination
7.41%
adjusted coef of determination
6.47%
number of observations
residual degrees of freedom
t-statistic for computing
95%-confidence intervals
100
98
1.9845
• An older patron is likely to have a higher income (which typically is
associated with a smaller discount)
• An older patron is more likely to be male (which typically is associated
with a smaller discount)
A Reconciliation across Models
• On these next three slides, we’ll focus on the “older people have
higher incomes” effect:
• As a patron ages by a year (and his/her sex stays unchanged!),
his/her discount typically drops by $8.47.
Regression: Discount
coefficient
std error of coef
t-ratio
significance
beta-weight
constant
Age
Sex
1292.48764 -8.4694585 630.367989
178.496906 4.34612096 85.9945283
7.2410
-1.9487
7.3303
0.0000%
5.4216%
0.0000%
-0.1558
0.5862
As the patron ages by a year (and his/her sex stays unchanged!),
his/her income typically rises by $508.58.
Regression: Income
coefficient
std error of coef
t-ratio
significance
beta-weight
constant
19234.7835
3542.55077
5.4296
0.0000%
Age
Sex
508.57672 -5212.6301
86.25558 1706.69615
5.8962
-3.0542
0.0000%
0.2913%
0.4906
-0.2541
The combined age and income effects are precisely what we
originally estimated for an additional year of age, when income
was not held constant.
Regression: Discount
coefficient
std error of coef
t-ratio
significance
beta-weight
constant
Age
Income
Sex
1971.72565 9.48991379 -0.035313 446.294355
146.147064 3.6320188 0.00366827 64.5567912
13.4914
2.6128
-9.6266
6.9132
0.0000%
1.0423%
0.0000%
0.0000%
0.1746
-0.6735
0.4150
9.48991379 impact of Age
508.57672 additional Income
-17.959372 impact of additional Income
net consequence of
-8.4694585 aging a year and earning
more as a result
Conclusion
To the extent that Income covaries with Age, if Income is omitted
from our model, Age gets “blamed” for part of Income’s effect on
Discount.
Regression: Discount
coefficient
std error of coef
t-ratio
significance
beta-weight
constant
Age
Sex
1292.48764 -8.4694585 630.367989
178.496906 4.34612096 85.9945283
7.2410
-1.9487
7.3303
0.0000%
5.4216%
0.0000%
-0.1558
0.5862
This yields the most accurate possible predictions based on Age
and Sex alone, but grossly misestimates the pure effect of Age.
And that is why we try to use the “most-complete” model to
estimate the pure effect of any variable on the dependent variable
… and why our next session will focus on building the model itself.