Testing whether a multivariate specification can be simplified Jane E. Miller, PhD The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Overview • Initial model specifications with full set of independent variables (IVs) • How to test whether simpler specification fits the data as well • Approaches to simplifying model specification – Creating a reference category that combines categories – Collapsing other categories • Presenting results of analyses to evaluate model specification The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Estimated coefficients from an OLS model of birth weight in grams Intercept Race/Hispanic origin (Non-Hispanic white) Mexican American Non-Hispanic black Mother’s education < High school (<HS) = High school (=HS) (> High school; >HS) Coefficient 3,317.8** Standard error 25.1 –23.0 –172.6** 22.7 17.5 –55.5** –53.9** 19.3 14.8 ** denotes p < .01 Reference category in parenthesis Testing whether a specification could be simplified • For a specification that involves several multicategory variables, might be able to simplify the specification if some terms can be omitted without worsening the overall fit of the model • E.g., for a three-category variable, might it be possible to – Combine one of the modeled categories with the reference category? – Combine the two modeled categories with one another? Example 1: Revising the reference category for race • The estimated coefficient for Mexican American is not statistically significantly different from zero – E.g., predicted birth weight is not statistically significantly different for Mexican American than for non-Hispanic white infants (the reference category) • βMexicanAmerican = –23.0; standard error = 22.7 • Since birth weights for those two racial/ethnic groups are so close, could combine them to create the reference category – Reference category now includes BOTH Non-Hispanic white AND Mexican American infants Test a revised race/ethnicity specification • Replace Specification A BW = f (NHB, MA, other independent variables) • With Specification B BW = f (NHB, same set of other independent variables) – Reference category is now non-Hispanic white and Mexican American infants • Compare overall fit of specifications A and B – Goodness-of-fit (GOF) statistics – If fit of the model with simpler racial/ethnic specification is not statistically significantly worse than that of the more detailed specification, it is the parsimonious specification Combining two modeled categories with one another • Testing differences of s from one model • To formally test statistical significance of differences between coefficients, e.g., H0: βj = βk, calculate the test statistic: – Divide the difference between the estimated coefficients (j − i) by the standard error of the difference – Compare the value of the test statistic against the critical value with one degree of freedom Standard error of the difference • The standard error of the difference is calculated: • √[var(j) + (2 × cov(j, k)) + var(k) ] – var(j) and var(k) are the variances of j and k, respectively – cov(j, k) is the covariance between j and k • The complete variance-covariance matrix for a regression can be requested as part of the output • The variance of each coefficient can be calculated from its standard error (s.e.): var(j) = [s.e.(j)]2 Example 2: Testing whether β<HS = β=HS • From the table, <HS = –55.5 and =HS = –53.9 • The difference between β<HS and β=HS is calculated β<HS – β=HS = –55.5 – (–53.9) = 1.6 • For that model, • var(<HS) = 370.9 • var(=HS) = 218.8 • cov(<HS, =HS) = 137.8 • Plugging those values into the formula for the standard error of the difference yields = √[370.9 + (2 × 137.8) + 218.8] = 17.72 The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Example 2, cont.: Test statistic for β<HS = β=HS • To calculate the test statistic, divide the difference between <HS and =HS by the standard error of the difference: (β<HS – β=HS)/s.e. (β<HS – β=HS) = 1.6/17.7 = 0.09 • 0.09 < 1.96 (the critical value of 1.96 for a t-test with ∞ degrees of freedom) • Thus we cannot reject the null hypothesis that β<HS = β=HS The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. TEST statement • Alternatively, request the test statistic for equality of coefficients for pairs of coefficients as part of the regression procedure • E.g., to test whether predicted birth weight is statistically significantly different for non-Hispanic black than for Mexican American infants – Specify “TEST ‘<HS’ = ‘=HS’ ” in your SAS syntax – Output for H0: β<HS = β=HS reports an F-statistic of 0.01 with a p-value of 0.93 • Conclusion: No statistically significant difference between the estimated coefficients for <HS and =HS Collapsing the education classification • Because Cannot reject the null hypothesis that β<HS = β=HS based on the estimates from the model Both β<HS and β=HS are statistically significantly different from zero β<HS and β=HS are empirically very similar (–55.5 and –53.9, respectively) • Could simplify the specification by creating one dummy to capture ≤HS – Collapses the categories of <HS and =HS Test a revised education specification • Replace Specification A BW = f (<HS, =HS, other independent variables) • With Specification C BW = f (≤HS, same set of other independent variables) • Compare overall fit of specifications A and C – GOF statistics – If fit of the simpler specification is not statistically significantly worse than that of the more detailed education specification – It would be the parsimonious specification Caveat about combining categories • Only combine categories for which it makes substantive sense to do so – E.g., < HS and > HS aren’t adjacent ordinal categories, so you would NOT combine them with one another to compare against = HS. – However, for some research questions, you could combine non-Hispanic blacks with Mexican-Americans because both are considered racial/ethnic minority groups in the US Describing exploratory work on model specification • Always explain in your methods or results section how you arrived at your final model specification • Describe the criteria you used to decide which independent variables to include in both initial and final models – Theoretical criteria about which variables and classifications were used in initial specification – Empirical criteria used to test simplifications to that specification – Theoretical criteria might override empirical criteria due to the role of that variable in your specific research question Example description of exploratory work on model specification • “Although birth weight for Mexican American infants was not statistically significantly different from that of non-Hispanic white infants, because race/ethnicity is a variable of primary interest for our research question, we retained it as a separate category in sequence of models.” – Theoretical criteria, used if race/ethnicity is of central interest in the analysis Alternative description of exploratory work on model specification • “Our initial model specification compared three racial/ethnic categories (Mexican American, nonHispanic black, and non-Hispanic white). However, birth weight for Mexican American infants was not statistically significantly different from that of nonHispanic white infants, so those we combined those two groups to create the revised reference category for race/ethnicity.” – Empirical criteria, to be used if race/ethnicity not a key IV Summary • Initial model specifications often include a full set of independent variables (IVs) related to the substantive research question. E.g., – Detailed classifications of 1+ categorical variables – All pertinent main effects and interaction terms • If some of those variables are not statistically significant in the initial model, test simpler specifications to assess whether there is a statistically significant loss of fit • Decisions about which IVs to include in the final model should be based on both theoretical and empirical criteria related to your research question and data The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Suggested resources • Miller, J. E. 2013. The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. University of Chicago Press, chapters 11 and 15. The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Suggested online resources • Podcasts on – Testing statistical significance of differences between coefficients – Comparing overall goodness of fit across models – Creating variables and specifying models to test for interactions The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. Contact information Jane E. Miller, PhD jmiller@ifh.rutgers.edu Online materials available at http://press.uchicago.edu/books/miller/multivariate/index.html The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition.