slides

advertisement
Regression Analysis:
How to DO It
Example: The “car discount” dataset
The slides marked with this symbol will be skipped
during our first discussion of this dataset. After we
cover “hypothesis testing,” we’ll return to them.
Discounts on Car Purchases
• Of course, no one pays list price for a new car. Realizing
this, the owner of a new-car dealership has decided to
conduct a study, to attempt to understand better the
relationship between customer characteristics, and
customer success in negotiating a discount from his
salespeople.
• He collects data on a sample of 100 purchasers of mid-size
cars (he has already sold several thousand of these cars):
– Specifically, he notes the age, annual income, and sex (men
were represented by 0, and women by 1, in the coding of sex) of
each purchaser (obtained from credit records), together with
the discount from list price which the purchaser finally received.
Discounts on Car Purchases
• He collects data on a sample of 100 purchasers of mid-size
cars (he has already sold several thousand of these cars):
– He notes the age, annual income, and sex of each purchaser,
together with the discount from list price which the purchaser
finally received.
Discount
1003
1394
2542
1658
1374
1536
1402
692
947
1415
…
Age
28
41
21
47
29
43
54
35
41
19
…
Income
47658
32126
28374
29321
38016
25343
30310
45709
46242
27933
…
Sex
1
1
1
0
1
0
0
0
0
1
…
Discount ($) negotiated on the purchase
of a car: age of purchaser (years),
annual income ($), and sex (M/F = 0/1).
Discounts on Car Purchases
• He collects data on a sample of 100 purchasers of mid-size cars (he
has already sold several thousand of these cars):
– He notes the age, annual income, and sex of each purchaser, together
with the discount from list price which the purchaser finally received.
• Why mid-size cars only?
– To avoid needing to include model/price of car
• Other possible explanatory variables?
– About purchaser
• Negotiation training
• Preparatory research
• Significant other
– About salesperson
• Identity
• Biases
Look at the Univariate Statistics
• This will give you a sense of how each variable
varies individually
– Estimate of population mean (or proportion)
– Standard deviation and extremes
– 95%-confidence interval for population mean
(or proportion)
• Estimate ± (~2)·(standard error of the mean)
• Estimate ± “margin of error” (at 95%-confidence level)
Univariate statistics
mean
standard deviation
standard error of the mean
Discount
Age
Income
Sex
1268.24
37.1
35705.17
0.46
538.665375 9.91122209 10273.7291 0.50090827
53.8665375 0.99112221 1027.37291 0.05009083
minimum
median
maximum
range
130
1310.5
2542
2412
19
37
58
39
19119
34401.5
64648
45529
0
0
1
1
skewness
kurtosis
-0.018
-0.710
0.154
-0.633
0.452
-0.270
0.163
-2.014
number of observations
t-statistic for computing
95%-confidence intervals
100
1.9842
For example, $1,268.24 ± 1.9842·$53.87, or 46% ± 1.9842·5.01% .
The Full Regression
• The “most-complete” model provides …
– The best predictive model (pretty much)
– The most accurate estimate of the “pure effect” of
each explanatory variable on the dependent variable
• Specifically, the difference in the dependent variable
typically associated with one unit of difference in one
explanatory variable when the others are held constant.
Regression: Discount
coefficient
std error of coef
t-ratio
significance
beta-weight
constant
Age
Income
Sex
1971.72565 9.48991379 -0.035313 446.294355
146.147064 3.6320188 0.00366827 64.5567912
13.4914
2.6128
-9.6266
6.9132
0.0000%
1.0423%
0.0000%
0.0000%
0.1746
-0.6735
0.4150
standard error of regression
coefficient of determination
adjusted coef of determination
number of observations
residual degrees of freedom
t-statistic for computing
95%-confidence intervals
301.19175
69.68%
68.74%
100
96
1.9850
The Adjusted Coefficient of
Determination in the Full Model
• How much of the “story” (how much of the
overall variation in the dependent variable) is
potentially explained by the fact that the
explanatory variables themselves vary across
the population?
• r2 = 1 – Var() / Var(Y) (roughly) = 68.74%
– How can it be increased?
• By including new relevant variables
• Including a new “garbage” variable will leave it, on
average, unchanged
The Coefficients
• The coefficient of an explanatory variable in the
most-complete model …
– Is an estimate of the average difference in the
dependent variable for two distinct individuals who
differ (by one unit) only in that explanatory variable.
– Is an estimate of the average difference we’d expect to
see in a specific individual if one aspect alone were
slightly different (and all other aspects were the same.)
• coefficient ± (~2)·(standard error of coefficient)
Regression: Discount
coefficient
std error of coef
t-ratio
significance
beta-weight
constant
Age
Income
Sex
1971.72565 9.48991379 -0.035313 446.294355
146.147064 3.6320188 0.00366827 64.5567912
13.4914
2.6128
-9.6266
6.9132
0.0000%
1.0423%
0.0000%
0.0000%
0.1746
-0.6735
0.4150
standard error of regression
coefficient of determination
adjusted coef of determination
number of observations
residual degrees of freedom
t-statistic for computing
95%-confidence intervals
301.19175
69.68%
68.74%
100
96
1.9850
$9.49 ± 1.9850·$3.63 per year of Age (with same Income and Sex), or
-$0.0353 ± 1.9850·$0.0037 per dollar of Income (with same Age and Sex), or
$446.29 ± 1.9850·$64.56 more for a woman (1) than for a man (0) (with
same Age and Income)
Predictions, using most-recent regression
Predict
constant
Age
Income
Sex
coefficients
1971.7256
9.4899138
-0.035313
446.29435
predicted value of Discount
standard error of prediction
standard error of regression
standard error of estimated mean
confidence level
t-statistic
residual degr. freedom
Make single prediction
values for prediction
30
35000
1
31
35000
1
30
36000
1
30
35000
0
1466.762
305.5644
301.1917
51.50843
1476.252
305.2995
301.1917
49.91316
1431.449
305.8382
301.1917
53.10864
1020.468
305.2382
301.1917
49.53689
95.00%
1.9850
96
confidence limits
for prediction
lower
upper
860.2218 870.2374 824.3652 414.5748
2073.303 2082.267 2038.533 1626.361
confidence limits
for estimated mean
lower
upper
1364.519 1377.175 1326.029 922.1379
1569.006 1575.329 1536.869 1118.798
Tests involving Coefficients
• In the full model, how strongly does the evidence
support saying, “Sex≥$200”?
• H0: Sex≤$200, significance 0.01204% (overwhelmingly strong
evidence against H0, hence supporting original statement)
446.294
64.557
100
3
estimate/prediction of unknown quantity
measure of uncertainty
sample size
number of explanatory variables in regression, or
0 if dealing with a population mean
significance level of data with
respect to null hypothesis
Null hypothesis:
true value
≥
=
≤
200
100.00000%
0.02408%
0.01204%
(from t-distribution with 96 degrees of freedom)
From Session-2’s “Hypothesis_Testing_Tool.xls”
Tests involving Coefficients
• Other statements?
Statement
Significance level of Strength of
data (with respect to evidence
opposite statement) supporting
statement
Sex≥$200
0.01204%
overwhelming
Sex≥$300
1.28444%
very strong
Sex≥$350
6.95385%
somewhat strong
Sex≥$400
23.75235%
quite weak
From Session-1’s “Hypothesis_Testing_Tool.xls”
Predictions
• Based on ANY model, what would we predict
the dependent variable to be, if all we knew
about an individual were the given values for
the listed explanatory variables?
• Prediction ± (~2)·(standard error of the prediction)
• What would we expect to see, on average,
across a large pool of similar individuals?
• Prediction ± (~2)·(std. error of the estimated mean)
Prediction, using most-recent regression
coefficients
values for prediction
constant
Age
Income
Sex
1971.726 9.489914 -0.03531 446.2944
30
35000
1
predicted value of Discount
standard error of prediction
standard error of regression
standard error of estimated mean
confidence level
t-statistic
residual degr. freedom
Make multiple predictions
1466.762
305.5644
301.1917
51.50843
Predict
95.00%
1.9850
96
confidence limits
for prediction
lower
upper
860.2218
2073.303
confidence limits
for estimated mean
lower
upper
1364.519
1569.006
$1,466.76 ± 1.9850·$305.56, an individual prediction
for a 30-year-old woman earning $35,000/year
$1,466.76 ± 1.9850·$51.51, an estimate of the large-group mean
for 30-year-old women earning $35,000/year
Significance
• The significance level of the t-ratio (for each
variable separately)
• Sometimes called the “p-value” for that variable
– How strong is the evidence that, in a model already
containing all of the other explanatory variables, this
variable “belongs” (i.e., has a non-zero coefficient of
its own)?
– Equivalently, is this a variable whose value we’d like to
know when predicting for a specific individual?
• Close to zero = strong evidence it DOES belong
(our null hypothesis is that it doesn’t)
Regression: Discount
coefficient
std error of coef
t-ratio
significance
beta-weight
constant
Age
Income
Sex
1971.72565 9.48991379 -0.035313 446.294355
146.147064 3.6320188 0.00366827 64.5567912
13.4914
2.6128
-9.6266
6.9132
0.0000%
1.0423%
0.0000%
0.0000%
0.1746
-0.6735
0.4150
standard error of regression
coefficient of determination
adjusted coef of determination
number of observations
residual degrees of freedom
t-statistic for computing
95%-confidence intervals
301.19175
69.68%
68.74%
100
96
1.9850
Significance (continued)
• Null hypothesis: “In the current model, the
true coefficient of this variable is 0.”
– The coefficient of this variable is our estimate
– (coefficient) / (standard error of the coefficient)
tells us how many standard deviations away from
the hypothesized truth (0) the estimate is
• significance = Pr(we’d be this far away just by chance)
– Close to 0% = (recall coin-flipping story)
• highly contradictory to null hypothesis
• strongly supportive of alternative (it DOES belong)
Significance (continued)
• The significance level deals with the marginal
contribution of a variable to the current model.
• Adding an irrelevant explanatory variable to a
regression model will increase the adjusted
coefficient of determination about half the time.
The significance level tells us if the coefficient of
determination went up by enough to argue that
the new variable is relevant.
The Beta-Weights
• Why is Discount varying from one sale to the
next?
– What’s the relative explanatory “power” of
(variation in) each of the explanatory variables (in
explaining the currently-observed variability in the
dependent variable across the population)?
– The comparative magnitudes of the beta-weights
(for all of the explanatory variables together in the
model) answer this question.
Regression: Discount
coefficient
std error of coef
t-ratio
significance
beta-weight
constant
Age
Income
Sex
1971.72565 9.48991379 -0.035313 446.294355
146.147064 3.6320188 0.00366827 64.5567912
13.4914
2.6128
-9.6266
6.9132
0.0000%
1.0423%
0.0000%
0.0000%
0.1746
-0.6735
0.4150
standard error of regression
coefficient of determination
adjusted coef of determination
number of observations
residual degrees of freedom
t-statistic for computing
95%-confidence intervals
301.19175
69.68%
68.74%
100
96
1.9850
• Why does discount vary across the population?
• Primarily, because Income varies.
• Secondarily, because some purchasers are men and
others are women (i.e., Sex varies).
The Beta-Weights (continued)
• Each answers the question:
– If two individuals have the same values for all the
explanatory variables in the model except one, and for
this one their values differ by one standarddeviation’s-worth of variability (in this variable), then
their predicted values for the dependent variable
would differ by how many standard deviations (of
variability in the dependent variable)?
• “Typical” variation in each of the explanatory variables alone
can explain (relatively) how much of the observed variability
in the dependent variable?
We Can Explore Other Models
• We can drop variables
– Are older or younger purchasers currently getting
larger discounts?
• We can change the dependent variable
– Are the female purchasers, on average, older or
younger than the male purchasers?
– What’s the impact of aging on purchaser income?
Are the female purchasers, on average, older or younger than the male
purchasers?
Regression: Age
coefficient
std error of coef
t-ratio
significance
beta-weight
constant
Sex
38.9074074 -3.9291465
1.32861376 1.95893412
29.2842
-2.0058
0.0000%
4.7639%
-0.1986
standard error of regression
9.76327735
coefficient of determination
3.94%
adjusted coef of determination
2.96%
number of observations
residual degrees of freedom
t-statistic for computing
95%-confidence intervals
100
98
1.9845
Male purchasers are, on average, 38.91 years old.
Female purchasers are, on average, 3.93 years younger than the men.
If the “pure” effect of an additional year of age is to increase a purchaser’s
discount, then what explains the negative coefficient of Age below?
Regression: Discount
coefficient
std error of coef
t-ratio
significance
beta-weight
constant
Age
1817.16511 -14.795825
202.794627 5.28272248
8.9606
-2.8008
0.0000%
0.6142%
-0.2722
standard error of regression
520.957868
coefficient of determination
7.41%
adjusted coef of determination
6.47%
number of observations
residual degrees of freedom
t-statistic for computing
95%-confidence intervals
100
98
1.9845
• An older patron is likely to have a higher income (which typically is
associated with a smaller discount)
• An older patron is more likely to be male (which typically is associated
with a smaller discount)
A Reconciliation across Models
• On these next three slides, we’ll focus on the “older people have
higher incomes” effect:
• As a patron ages by a year (and his/her sex stays unchanged!),
his/her discount typically drops by $8.47.
Regression: Discount
coefficient
std error of coef
t-ratio
significance
beta-weight
constant
Age
Sex
1292.48764 -8.4694585 630.367989
178.496906 4.34612096 85.9945283
7.2410
-1.9487
7.3303
0.0000%
5.4216%
0.0000%
-0.1558
0.5862
As the patron ages by a year (and his/her sex stays unchanged!),
his/her income typically rises by $508.58.
Regression: Income
coefficient
std error of coef
t-ratio
significance
beta-weight
constant
19234.7835
3542.55077
5.4296
0.0000%
Age
Sex
508.57672 -5212.6301
86.25558 1706.69615
5.8962
-3.0542
0.0000%
0.2913%
0.4906
-0.2541
The combined age and income effects are precisely what we
originally estimated for an additional year of age, when income
was not held constant.
Regression: Discount
coefficient
std error of coef
t-ratio
significance
beta-weight
constant
Age
Income
Sex
1971.72565 9.48991379 -0.035313 446.294355
146.147064 3.6320188 0.00366827 64.5567912
13.4914
2.6128
-9.6266
6.9132
0.0000%
1.0423%
0.0000%
0.0000%
0.1746
-0.6735
0.4150
9.48991379 impact of Age
508.57672 additional Income
-17.959372 impact of additional Income
net consequence of
-8.4694585 aging a year and earning
more as a result
Conclusion
To the extent that Income covaries with Age, if Income is omitted
from our model, Age gets “blamed” for part of Income’s effect on
Discount.
Regression: Discount
coefficient
std error of coef
t-ratio
significance
beta-weight
constant
Age
Sex
1292.48764 -8.4694585 630.367989
178.496906 4.34612096 85.9945283
7.2410
-1.9487
7.3303
0.0000%
5.4216%
0.0000%
-0.1558
0.5862
This yields the most accurate possible predictions based on Age
and Sex alone, but grossly misestimates the pure effect of Age.
And that is why we try to use the “most-complete” model to
estimate the pure effect of any variable on the dependent variable
… and why our next session will focus on building the model itself.
Summary: Questions a
Regression Study can Answer
Make an Individual Prediction
Predict a variable (with an unknown value) for an individual, given
some specific information about that individual.
• Regress the variable-to-be-predicted (the dependent variable)
onto the known variables (the independent or explanatory
variables), and make a prediction.
• The margin of error in the prediction is (~2)∙(the standard error
of the prediction).
Example: “Predict the discount from list price that a 30-year-old
woman who buys an intermediate-sized vehicle from the dealership
would receive.”
$1668.77 ± 1.9847∙$420.06
Estimate a Group Mean
Estimate the mean value of a variable, across a (large) group of
individuals who share certain specific characteristics.
• Regress the first variable onto the others. Then make a
prediction of the variable for one of the individuals (which will
be used as the estimate of the mean across this group of similar
individuals).
• The margin of error in the estimated mean is (~2)∙(the standard
error of the estimated mean).
Example: “Estimate the mean discount received by 30-year-old
women (plural!) who buy intermediate-sized vehicles from the
dealership.”
$1668.77 ± 1.9847∙$65.60
Estimate a “Pure” Difference (1)
What is the mean difference in the value of the dependent variable
typically associated with a one-unit difference in another variable,
when everything else of relevance remains unchanged?
• Regress the dependent variable onto all of the other variables in
the study (the “most complete” model), and look at the
coefficient of the “other” variable.
• The margin of error in the estimated mean associated difference
is (~2)∙(the standard error of the coefficient).
Example: What is the average difference in negotiated discount
associated with an incremental year of age of the purchaser of an
intermediate-sized car from the dealership, when all other
characteristics of that purchaser remain unchanged?
$9.49 ± 1.9850∙$3.63
Estimate a “Pure” Effect (2)
Example: What is the average difference in negotiated discount
associated with an incremental year of age of the purchaser of an
intermediate-sized car from the dealership, when all other
characteristics of that purchaser remain unchanged?
Example: What is the average effect of an incremental year of age
on negotiated discount?
If you’re willing to assert that the linkage between age and
negotiated discount is causal (we’ll discuss “causality” in our next
class), then the “average pure difference” and “average pure effect”
questions can be viewed as the same.
Estimate a Confounded Difference (1)
What is the mean difference in the value of the dependent variable
typically associated with a one-unit difference in another variable,
when all remaining variables consequently may take different
values themselves?
• Regress the dependent variable onto just the one variable, and
look at the coefficient of the explanatory variable.
• The margin of error in the estimated mean difference is (~2)∙(the
standard error of the coefficient).
Example: As 30-year-old purchasers age by a year, estimate the
average change in their negotiated discounts.
$-14.80 ± 1.9845 ∙$5.28
Estimate a Confounded Difference (2)
Example (continued): As 30-year-old purchasers age by a year,
estimate the average change in their negotiated discounts.
The older purchasers would, on average receive smaller discounts.
This is because, as Age increases for purchasers, Income tends to
increase as well. The additional Age increases Discount, the
additional Income tends to decrease Discount, and the net effect
just happens to be a decrease.
Measure the Potential Explanatory
Power of a Model
How much of the variation in the dependent variable is potentially
explained by the fact that several explanatory variables vary from
one individual to the next?
• Regress the first variable (the dependent variable) onto the other
variables (the independent or explanatory variables), and look at
the adjusted coefficient of determination.
Example: “How much of the variation in negotiated Discounts on
intermediate-size cars can be potentially explained by the facts that
Age, Income, and Sex all vary from one purchaser to the next?”
68.74%
Rank the Explanatory Variables by
Relative Explanatory Importance
When all the variables are considered together, typical variation in
which would lead to the greatest expected variation in the
dependent variable.
• Regress the dependent variable-to-be-predicted (the dependent
variable) onto the explanatory variable.
• Compare the magnitudes (absolute values) of the beta-weights
of the explanatory variables.
Example: “Why does Discount vary from one purchaser to the
next?”
“Because Income varies (-0.6735). And secondarily, because Sex
varies (some are men, and others women) (0.4150).”
Evaluate a Variable’s Model Inclusion
(1)
Given a particular regression model, how strong is the (supporting)
evidence that a specific one of the explanatory variables has a true nonzero effect on the dependent variable (and therefore "belongs" in the
model)?
• To see if evidence supports a claim, we always take the opposite as the
null hypothesis: in this case, to say that a variable does not belong in
the model we say “H0: coefficient (of the explanatory variable) = 0.”
• The displayed significance level for that variable is with respect to the
“doesn’t belong” null hypothesis, so a large numeric significance level
indicates little or no evidence that the variable belongs in the model.
• However, a small significance level provides strong evidence against
the null hypothesis, and therefore strong evidence that the explanatory
variable plays a non-zero role in the relationship.
Evaluate a Variable’s Model Inclusion
(2)
Example:
Regression: Discount
coefficient
std error of coef
t-ratio
significance
beta-weight
constant
Age
Income
Sex
1971.72565 9.48991379 -0.035313 446.294355
146.147064 3.6320188 0.00366827 64.5567912
13.4914
2.6128
-9.6266
6.9132
0.0000%
1.0423%
0.0000%
0.0000%
0.1746
-0.6735
0.4150
We see here overwhelmingly-strong evidence that Income and Sex have non-zero effects and “belong”
in our model, and very strong evidence that Age belongs as well.
Download