Defining the Goldilocks problem Jane E. Miller, PhD The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Overview • • • • Defining the Goldilocks problem Understanding why type of variable matters Understanding why range of values matters Outlining the steps to avert Goldilocks problems – Later podcasts fill in the details The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. What is “the Goldilocks problem” in multivariate regression? • As Goldilocks discovered, she and each of the Three Bears preferred different sized chairs. – One chair was too big, – One chair was too small, – One chair was just right! • Likewise, different variables in a multivariate regression often require different-sized contrasts to illustrate the meaning of their coefficients. The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Review: Interpretation of regression coefficients • Ordinary least squares (OLS) coefficients (βs) change in dependent variable (Y) for a 1-unit increase in independent variable (Xi), with the result in the units of the dependent variable. • Logit coefficients estimate the effect of a 1unit increase in Xi on the log-odds of the outcome under study. The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Common pitfalls in interpreting regression coefficients • Assessing which independent variables are the “most important” by directly comparing the sizes of the estimated coefficients (βs). • Direct comparison of βs implies that a 1-unit increase in each independent variable is the pertinent contrast for that variable. – Problematic because many multivariate models include different: • Types of variables (levels of measurement). • Ranges and scales of continuous variables. The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Why does type of variable matter? • Continuous independent variables • Categorical independent variables – Nominal – Ordinal The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Considerations for contrast size: Continuous variables • Different continuous variables have different levels and ranges of values: – Age in a sample of students might vary from 5 to 17 years • A 12-unit range among values in the single to double digits – Their annual family incomes could vary from $0 to $millions • A million+ unit range with a median value likely in the five digits The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Problem: Directly comparing βs for continuous variables with different scales • Although a 1-year increase in age might be a relevant contrast, a $1 increase in annual family income in the US today would be trivial. • Directly comparing the βs on age and income implicitly assumes that a 1-unit increase fits the scale of both variables. The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Considerations for contrast size: Categorical variables • The numeric codes used as shorthand for categorical variables have no mathematical meaning. – E.g., dummy variable “boy” coded 1 = boy 0 = girl – No such thing as a 1-unit increase in “genderness.” • Such binary variables only span a 1-unit range, so multiunit changes are not applicable. The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Problem: Interpreting directionality of nominal variable codes • The values of nominal variables such as gender or race have no natural order. – Any rank ordering of categories of those variables is arbitrary. • An artifact of how the analyst chose to code the categories. – Could equally well code gender as 1 = male, 2 = female – Thus the directionality implied by a 1-unit increase is misleading. The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Codes for ordinal variables • Codes for categories of ordinal variables are rankable and might appear to have numeric meaning. – E.g., categories for self-rated health might be coded: 1: excellent 2: very good 3: good 4: fair 5: poor The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Problem: Interpreting ordinal values as if they were continuous • Unlike integer values of a continuous variable like age in years, the numeric distance between categories of an ordinal variable cannot be assumed to be uniform. • E.g., respondents might perceive a bigger difference between “good” and “fair” health than between “very good” and “excellent” health. The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Problem: Interpreting ordinal values as if they were continuous • Unlike integer values of a continuous variable, the numeric distance between categories of an ordinal variable cannot be assumed to be uniform, even when categories have numeric units attached. E.g., income groups often • Are of varying widths – E.g., <$20K, $20K–39K, $40K–$79K, $80K–$160K • Include an open-ended top category (e.g., >$160K). The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Problem: Comparing βs on categorical and continuous variables • Given the different interpretations of βs on continuous and categorical variables, if a model includes both types of independent variables, cannot compare their βs without considering the pertinent size contrast for each variable. – For mother’s age (a continuous variable), the contrast can vary >1 unit (year) across cases. – For gender, the contrast is one category versus the other, and no more than a “1-unit” increase is possible. – Even if βboy > βmother’s age (117.2 and 10.7, respectively), one cannot conclude that gender is a “more important” determinant of birth weight than mother’s age. The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Why does range of values matter? • When is a 1-unit change – Too big? – Too small? The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. When is a 1-unit increase too big? • For independent variables whose values in your data: – fall mostly between 0 and 1, – are clustered within a few units of one another, – or are by definition restricted to between 0.0 and 1.0, e.g., • Variables measured in proportions • Gini coefficients • In such situations, apply a <1.0 unit contrast to assess the effect of a change in Xi on Y. The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Proportions versus percentages • Researchers are often sloppy about variables measured in proportions, instead labeling them as percentages (or vice versa) – The percentage equivalent of a proportion is by definition 100 times as large. • Must convey the correct scale of the variable used in the model so β can be interpreted correctly. – For variables measured as a proportion, a 1-unit increase is too large, – For those measured in percentages, a 1-unit increase often is too small. The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. When is a 1-unit increase too small? • For independent variables with – A high level or wide range of values – Imprecise measurement of values • E.g., for blood pressure, a 1 millimeter mercury (mm Hg) difference is too small to be – clinically meaningful – observed with precision • In such situations, apply a >1.0 unit contrast to assess the effect of a change in Xi on Y. The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Goldilocks issues for the dependent variable: range of values • Evaluate what a 1-unit increase means given the range and scale of the dependent variable. • β = 1.0 on a dummy variable is – a trivially small effect in a model predicting birth weight (which ranges from about 400 to 5,900 grams). – A substantial effect in a model predicting grade point average on the usual 4-point scale. The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Goldilocks issues for the dependent variable: model specification • Ordinal dependent variables such as birth weight categories (e.g., very low, low, normal, and high birth weight) should not be modeled using OLS models. – OLS models imply that the numeric codes for those categories are values of a continuous dependent variable. – Instead, use techniques such as ordered logit or other methods for ordered categorical dependent variables (Powers and Xie 2000). The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Prose interpretation and comparison of βs is critical • If βs are only reported in a table or prose, you leave it to readers to: – Notice the different types and scales of the variables, – Figure out pertinent-sized contrasts for each variable in the model. • Readers will then be more likely to make Goldilocks errors when they assess the meaning of the βs on different variables in your model. • Your job as the author is to write about the results in ways that avert Goldilocks errors of interpretation. The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Steps for resolving Goldilocks problems 1. Getting acquainted with the units and distribution of your independent and dependent variables. 2. Applying theoretical and empirical criteria to choose a suitably-sized contrast for each independent variable. 3. Using precise, complete labeling of units and categories in prose, tables, and charts. 4. Interpreting the results in prose to clearly communicate the substantive meaning of the βs based on suitably-sized contrasts for each variable. The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Summary • A “one-size-fits-all” approach to interpreting regression coefficients is often misleading because variables – Have different types (levels of measurement), – Have different units of measurement, – Have varying distributions of values, – Occur in different real-world circumstances. • These issues require careful thought about how to present βs to convey the substantive meaning of the β for each variable. The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Suggested resources • Miller, J. E. 2013. The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. – Chapter 10, on the Goldilocks problem – Chapter 4, on types of variables, units, and distribution • Miller, J. E. and Y. V. Rodgers, 2008. “Economic Importance and Statistical Significance: Guidelines for Communicating Empirical Research.” Feminist Economics 14 (2): 117–49. The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Suggested online resources • Podcasts on – Interpreting multivariate regression coefficients – Resolving the Goldilocks problem • Measurement and variables • Model specification • Presenting results The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Suggested practice exercises • Study guide to The Chicago Guide to Writing about Multivariate Analysis, 2nd Edition. – Questions #1,2, and 7 in the problem set for chapter 10. – Suggested course extensions for chapter 10: • “Reviewing” exercises #1 through 5. • “Applying statistics and writing” question #1. • “Revising” questions #1, 2, 3, 5, and 9. – “Getting to know your variables” assignment The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Contact information Jane E. Miller, PhD jmiller@ifh.rutgers.edu Online materials available at http://press.uchicago.edu/books/miller/multivariate/index.html The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition.