SS10.1

advertisement
Defining the Goldilocks problem
Jane E. Miller, PhD
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition.
Overview
•
•
•
•
Defining the Goldilocks problem
Understanding why type of variable matters
Understanding why range of values matters
Outlining the steps to avert Goldilocks
problems
– Later podcasts fill in the details
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition.
What is “the Goldilocks problem”
in multivariate regression?
• As Goldilocks discovered, she and each of the
Three Bears preferred different sized chairs.
– One chair was too big,
– One chair was too small,
– One chair was just right!
• Likewise, different variables in a multivariate
regression often require different-sized
contrasts to illustrate the meaning of their
coefficients.
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition.
Review: Interpretation of
regression coefficients
• Ordinary least squares (OLS) coefficients (βs)
change in dependent variable (Y) for a 1-unit
increase in independent variable (Xi), with the
result in the units of the dependent variable.
• Logit coefficients estimate the effect of a 1unit increase in Xi on the log-odds of the
outcome under study.
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition.
Common pitfalls in interpreting
regression coefficients
• Assessing which independent variables are the
“most important” by directly comparing the
sizes of the estimated coefficients (βs).
• Direct comparison of βs implies that a 1-unit
increase in each independent variable is the
pertinent contrast for that variable.
– Problematic because many multivariate models
include different:
• Types of variables (levels of measurement).
• Ranges and scales of continuous variables.
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition.
Why does type of variable matter?
• Continuous independent variables
• Categorical independent variables
– Nominal
– Ordinal
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition.
Considerations for contrast size:
Continuous variables
• Different continuous variables have different
levels and ranges of values:
– Age in a sample of students might vary from 5 to
17 years
• A 12-unit range among values in the single to double
digits
– Their annual family incomes could vary from $0 to
$millions
• A million+ unit range with a median value likely in the
five digits
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition.
Problem: Directly comparing βs for
continuous variables with different scales
• Although a 1-year increase in age might be a
relevant contrast, a $1 increase in annual
family income in the US today would be trivial.
• Directly comparing the βs on age and income
implicitly assumes that a 1-unit increase fits
the scale of both variables.
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition.
Considerations for contrast size:
Categorical variables
• The numeric codes used as shorthand for categorical
variables have no mathematical meaning.
– E.g., dummy variable “boy” coded
1 = boy
0 = girl
– No such thing as a 1-unit increase in “genderness.”
• Such binary variables only span a 1-unit range, so
multiunit changes are not applicable.
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition.
Problem: Interpreting directionality of
nominal variable codes
• The values of nominal variables such as
gender or race have no natural order.
– Any rank ordering of categories of those variables
is arbitrary.
• An artifact of how the analyst chose to code
the categories.
– Could equally well code gender as 1 = male,
2 = female
– Thus the directionality implied by a 1-unit increase
is misleading.
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition.
Codes for ordinal variables
• Codes for categories of ordinal variables are
rankable and might appear to have numeric
meaning.
– E.g., categories for self-rated health might be
coded:
1: excellent
2: very good
3: good
4: fair
5: poor
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition.
Problem: Interpreting ordinal
values as if they were continuous
• Unlike integer values of a continuous variable
like age in years, the numeric distance between
categories of an ordinal variable cannot be
assumed to be uniform.
• E.g., respondents might perceive a bigger
difference between “good” and “fair” health
than between “very good” and “excellent”
health.
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition.
Problem: Interpreting ordinal
values as if they were continuous
• Unlike integer values of a continuous variable,
the numeric distance between categories of an
ordinal variable cannot be assumed to be
uniform, even when categories have numeric
units attached. E.g., income groups often
• Are of varying widths
– E.g., <$20K, $20K–39K, $40K–$79K, $80K–$160K
• Include an open-ended top category (e.g.,
>$160K).
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition.
Problem: Comparing βs on categorical
and continuous variables
• Given the different interpretations of βs on
continuous and categorical variables, if a model
includes both types of independent variables, cannot
compare their βs without considering the pertinent
size contrast for each variable.
– For mother’s age (a continuous variable), the contrast can
vary >1 unit (year) across cases.
– For gender, the contrast is one category versus the other,
and no more than a “1-unit” increase is possible.
– Even if βboy > βmother’s age (117.2 and 10.7, respectively), one
cannot conclude that gender is a “more important”
determinant of birth weight than mother’s age.
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition.
Why does range of values matter?
• When is a 1-unit change
– Too big?
– Too small?
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition.
When is a 1-unit increase too big?
• For independent variables whose values in
your data:
– fall mostly between 0 and 1,
– are clustered within a few units of one another,
– or are by definition restricted to between 0.0 and
1.0, e.g.,
• Variables measured in proportions
• Gini coefficients
• In such situations, apply a <1.0 unit contrast to
assess the effect of a change in Xi on Y.
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition.
Proportions versus percentages
• Researchers are often sloppy about variables
measured in proportions, instead labeling them as
percentages (or vice versa)
– The percentage equivalent of a proportion is by definition
100 times as large.
• Must convey the correct scale of the variable used in
the model so β can be interpreted correctly.
– For variables measured as a proportion, a 1-unit increase is
too large,
– For those measured in percentages, a 1-unit increase often
is too small.
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition.
When is a 1-unit increase too small?
• For independent variables with
– A high level or wide range of values
– Imprecise measurement of values
• E.g., for blood pressure, a 1 millimeter
mercury (mm Hg) difference is too small to be
– clinically meaningful
– observed with precision
• In such situations, apply a >1.0 unit contrast to
assess the effect of a change in Xi on Y.
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition.
Goldilocks issues for
the dependent variable: range of values
• Evaluate what a 1-unit increase means given
the range and scale of the dependent variable.
• β = 1.0 on a dummy variable is
– a trivially small effect in a model predicting birth
weight (which ranges from about 400 to 5,900
grams).
– A substantial effect in a model predicting grade
point average on the usual 4-point scale.
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition.
Goldilocks issues for the dependent
variable: model specification
• Ordinal dependent variables such as birth
weight categories (e.g., very low, low, normal,
and high birth weight) should not be modeled
using OLS models.
– OLS models imply that the numeric codes for
those categories are values of a continuous
dependent variable.
– Instead, use techniques such as ordered logit or
other methods for ordered categorical dependent
variables (Powers and Xie 2000).
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition.
Prose interpretation and
comparison of βs is critical
• If βs are only reported in a table or prose, you leave
it to readers to:
– Notice the different types and scales of the variables,
– Figure out pertinent-sized contrasts for each variable in
the model.
• Readers will then be more likely to make Goldilocks
errors when they assess the meaning of the βs on
different variables in your model.
• Your job as the author is to write about the results in
ways that avert Goldilocks errors of interpretation.
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition.
Steps for resolving Goldilocks problems
1. Getting acquainted with the units and distribution
of your independent and dependent variables.
2. Applying theoretical and empirical criteria to
choose a suitably-sized contrast for each
independent variable.
3. Using precise, complete labeling of units and
categories in prose, tables, and charts.
4. Interpreting the results in prose to clearly
communicate the substantive meaning of the βs
based on suitably-sized contrasts for each variable.
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition.
Summary
• A “one-size-fits-all” approach to interpreting
regression coefficients is often misleading
because variables
– Have different types (levels of measurement),
– Have different units of measurement,
– Have varying distributions of values,
– Occur in different real-world circumstances.
• These issues require careful thought about
how to present βs to convey the substantive
meaning of the β for each variable.
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition.
Suggested resources
• Miller, J. E. 2013. The Chicago Guide to Writing
about Multivariate Analysis, 2nd Edition.
– Chapter 10, on the Goldilocks problem
– Chapter 4, on types of variables, units, and
distribution
• Miller, J. E. and Y. V. Rodgers, 2008. “Economic
Importance and Statistical Significance:
Guidelines for Communicating Empirical
Research.” Feminist Economics 14 (2): 117–49.
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition.
Suggested online resources
• Podcasts on
– Interpreting multivariate regression coefficients
– Resolving the Goldilocks problem
• Measurement and variables
• Model specification
• Presenting results
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition.
Suggested practice exercises
• Study guide to The Chicago Guide to Writing
about Multivariate Analysis, 2nd Edition.
– Questions #1,2, and 7 in the problem set for
chapter 10.
– Suggested course extensions for chapter 10:
• “Reviewing” exercises #1 through 5.
• “Applying statistics and writing” question #1.
• “Revising” questions #1, 2, 3, 5, and 9.
– “Getting to know your variables” assignment
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition.
Contact information
Jane E. Miller, PhD
jmiller@ifh.rutgers.edu
Online materials available at
http://press.uchicago.edu/books/miller/multivariate/index.html
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition.
Download