Testing whether a multivariate specification can be simplified Jane E. Miller, PhD

advertisement
Testing whether a multivariate
specification can be simplified
Jane E. Miller, PhD
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Overview
• Initial model specifications with full set of
independent variables (IVs)
• How to test whether simpler specification fits the
data as well
• Approaches to simplifying model specification
– Creating a reference category that combines categories
– Collapsing other categories
• Presenting results of analyses to evaluate model
specification
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Estimated coefficients from an OLS model of
birth weight in grams
Intercept
Race/Hispanic origin
(Non-Hispanic white)
Mexican American
Non-Hispanic black
Mother’s education
< High school (<HS)
= High school (=HS)
(> High school; >HS)
Coefficient
3,317.8**
Standard error
25.1
–23.0
–172.6**
22.7
17.5
–55.5**
–53.9**
19.3
14.8
** denotes p < .01
Reference category in parenthesis
Testing whether a specification
could be simplified
• For a specification that involves several multicategory variables, might be able to simplify the
specification if some terms can be omitted without
worsening the overall fit of the model
• E.g., for a three-category variable, might it be
possible to
– Combine one of the modeled categories with the
reference category?
– Combine the two modeled categories with one another?
Example 1: Revising the reference
category for race
• The estimated coefficient for Mexican American is
not statistically significantly different from zero
– E.g., predicted birth weight is not statistically significantly
different for Mexican American than for non-Hispanic
white infants (the reference category)
• βMexicanAmerican = –23.0; standard error = 22.7
• Since birth weights for those two racial/ethnic
groups are so close, could combine them to create
the reference category
– Reference category now includes BOTH Non-Hispanic
white AND Mexican American infants
Test a revised race/ethnicity specification
• Replace Specification A
BW = f (NHB, MA, other independent variables)
• With Specification B
BW = f (NHB, same set of other independent variables)
– Reference category is now non-Hispanic white and
Mexican American infants
• Compare overall fit of specifications A and B
– Goodness-of-fit (GOF) statistics
– If fit of the model with simpler racial/ethnic specification is
not statistically significantly worse than that of the more
detailed specification, it is the parsimonious specification
Combining two modeled categories
with one another
• Testing differences of s from one model
• To formally test statistical significance of differences
between coefficients, e.g., H0: βj = βk, calculate the
test statistic:
– Divide the difference between the estimated coefficients
(j − i) by the standard error of the difference
– Compare the value of the test statistic against the critical
value with one degree of freedom
Standard error of the difference
• The standard error of the difference is calculated:
• √[var(j) + (2 × cov(j, k)) + var(k) ]
– var(j) and var(k) are the variances of j and k,
respectively
– cov(j, k) is the covariance between j and k
• The complete variance-covariance matrix for a
regression can be requested as part of the output
• The variance of each coefficient can be calculated
from its standard error (s.e.): var(j) = [s.e.(j)]2
Example 2: Testing whether β<HS = β=HS
• From the table, <HS = –55.5 and =HS = –53.9
• The difference between β<HS and β=HS is calculated
β<HS – β=HS = –55.5 – (–53.9) = 1.6
• For that model,
• var(<HS) = 370.9
• var(=HS) = 218.8
• cov(<HS, =HS) = 137.8
• Plugging those values into the formula for the standard error
of the difference yields
= √[370.9 + (2 × 137.8) + 218.8]
= 17.72
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Example 2, cont.: Test statistic for β<HS = β=HS
• To calculate the test statistic, divide the difference
between <HS and =HS by the standard error of the
difference:
(β<HS – β=HS)/s.e. (β<HS – β=HS)
= 1.6/17.7 = 0.09
• 0.09 < 1.96 (the critical value of 1.96 for a t-test with
∞ degrees of freedom)
• Thus we cannot reject the null hypothesis that β<HS =
β=HS
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
TEST statement
• Alternatively, request the test statistic for equality of
coefficients for pairs of coefficients as part of the
regression procedure
• E.g., to test whether predicted birth weight is
statistically significantly different for non-Hispanic
black than for Mexican American infants
– Specify “TEST ‘<HS’ = ‘=HS’ ” in your SAS syntax
– Output for H0: β<HS = β=HS reports an F-statistic of 0.01 with
a p-value of 0.93
• Conclusion: No statistically significant difference between the
estimated coefficients for <HS and =HS
Collapsing the education classification
• Because
 Cannot reject the null hypothesis that β<HS = β=HS based on
the estimates from the model
 Both β<HS and β=HS are statistically significantly different
from zero
 β<HS and β=HS are empirically very similar (–55.5 and –53.9,
respectively)
• Could simplify the specification by creating one
dummy to capture ≤HS
– Collapses the categories of <HS and =HS
Test a revised education specification
• Replace Specification A
BW = f (<HS, =HS, other independent variables)
• With Specification C
BW = f (≤HS, same set of other independent variables)
• Compare overall fit of specifications A and C
– GOF statistics
– If fit of the simpler specification is not statistically
significantly worse than that of the more detailed
education specification
– It would be the parsimonious specification
Caveat about combining categories
• Only combine categories for which it makes
substantive sense to do so
– E.g., < HS and > HS aren’t adjacent ordinal categories, so
you would NOT combine them with one another to
compare against = HS.
– However, for some research questions, you could combine
non-Hispanic blacks with Mexican-Americans because
both are considered racial/ethnic minority groups in the
US
Describing exploratory work
on model specification
• Always explain in your methods or results section
how you arrived at your final model specification
• Describe the criteria you used to decide which
independent variables to include in both initial and
final models
– Theoretical criteria about which variables and
classifications were used in initial specification
– Empirical criteria used to test simplifications to that
specification
– Theoretical criteria might override empirical criteria due to
the role of that variable in your specific research question
Example description of exploratory
work on model specification
• “Although birth weight for Mexican American infants
was not statistically significantly different from that
of non-Hispanic white infants, because race/ethnicity
is a variable of primary interest for our research
question, we retained it as a separate category in
sequence of models.”
– Theoretical criteria, used if race/ethnicity is of central
interest in the analysis
Alternative description of exploratory
work on model specification
• “Our initial model specification compared three
racial/ethnic categories (Mexican American, nonHispanic black, and non-Hispanic white). However,
birth weight for Mexican American infants was not
statistically significantly different from that of nonHispanic white infants, so those we combined those
two groups to create the revised reference category
for race/ethnicity.”
– Empirical criteria, to be used if race/ethnicity not a key IV
Summary
• Initial model specifications often include a full set of
independent variables (IVs) related to the substantive
research question. E.g.,
– Detailed classifications of 1+ categorical variables
– All pertinent main effects and interaction terms
• If some of those variables are not statistically significant
in the initial model, test simpler specifications to assess
whether there is a statistically significant loss of fit
• Decisions about which IVs to include in the final model
should be based on both theoretical and empirical
criteria related to your research question and data
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Suggested resources
• Miller, J. E. 2013. The Chicago Guide to Writing about
Multivariate Analysis, 2nd Edition. University of
Chicago Press, chapters 11 and 15.
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Suggested online resources
• Podcasts on
– Testing statistical significance of differences between
coefficients
– Comparing overall goodness of fit across models
– Creating variables and specifying models to test for
interactions
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Contact information
Jane E. Miller, PhD
jmiller@ifh.rutgers.edu
Online materials available at
http://press.uchicago.edu/books/miller/multivariate/index.html
The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.
Download