Estimating Interaction Effects Using Multiple Regression Herman Aguinis, Ph.D. Mehalchin Term Professor of Management The Business School University of Colorado at Denver www.cudenver.edu/~haguinis Overview • What is an Interaction Effect? • The “So What” Question: Importance of Interaction Effects for Theory and Practice • Estimating Interaction Effects Using Moderated Multiple Regression (MMR) • Problems with MMR • Aguinis, Beaty, Boik, & Pierce (2005, J. of Applied Psychology) • The “Now What” Question: Addressing problems with MMR • Some Conclusions What is an Interaction Effect? • The relationship between X and Y depends on Z (i.e., a moderator) X Y Z X Y Z • Other terms used: – Population control variable (Gaylord & Carroll, 1948); Subgrouping variable (Frederiksen & Melville, 1954); Predictability variable (Ghiselli, 1956); Referent variable (Toops, 1959); Modifier variable (Grooms & Endler, 1960);Homologizer variable (Johnson, 1966) Importance of Interaction Effects: Theory • • • • Going beyond main effects We typically say “it depends” More complex models “If we want to know how well we are doing in the biological, psychological, and social sciences, an index that will serve us well is how far we have advanced in our understanding of the moderator variables of our field” (Hall & Rosenthal, 1991, p. 447) Importance of Interaction Effects: Practice For example, personnel selection: • Test bias: The relationship between a test and a criterion depends on gender or ethnicity • “No bias exists if the regression equations relating the test and the criterion are indistinguishable for the groups in question” (Standards, 1999, p. 79) • In other words, the X-Y relationship differs depending on the value of Z (e.g., 1 = Female, 0 = Male) Illustration of Gender as a Moderator in Personnel Selection Women Job Performance Ŷwomen Common line Ŷcommon Ŷmen Men X Test Scores Importance of Interaction Effects: Practice • Management in General – Does an intervention work similarly well for, for example, Cantonese and American employees working in Hong Kong? (categorical moderator) • Example: Performance management system regarding teaching at university in Hong Kong. Would the same evaluation methods lead to employee (i.e., faculty) satisfaction depending on the national origin of faculty members? Estimating Interaction Effects • Moderated Multiple Regression (MMR) • Ŷ = a + b1 X + b2 Z + b3 X·Z, where Y = criterion (continuous variable) X = predictor (typically continuous) Z = moderator (continuous or categorical) X·Z = product term carrying information about the moderating effect (i.e., interaction between X and Z) Statistical Significance Test 2 1 R • Ŷ = a + b1 X + b2 Z ; • Ŷ = a + b1 X + b2 Z + b3 X·Z; ( R R ) / ( k2 k1 ) F 2 (1 R2 ) / ( N k2 1) 2 2 2 1 ; Ho : • Ho: β3 = 0 (using a t-statistic) R 2 2 ψ =ψ 1 2 Estimating Interaction Effects Using Moderated Multiple Regression (MMR) ^ Y a b1 X b2 Z b3 X Z • For example: – Personnel selection: Y = measure of performance, X = test score, Z = gender – Additional research areas: training, turnover, performance appraisal, return on investment, mentoring, self-efficacy, job satisfaction, organizational commitment, and career development, among others Interpreting Interactions (Z is continuous) • Ŷ = a + b1 X + b2 Z + b3 X·Z, • b3 = 2 means that a one-unit change in X (Z) increases the slope of Y on Z (Y on X) by 2 points Interpreting Interactions (Z is binary, dummy coded) • Ŷ = a + b1 X + b2 Z + b3 X·Z, • b3 = estimated difference between the slope of Y on X between the group coded as 1 and the group coded as 0. • b2 = estimated difference between X scores for a member in group coded as 1 and a member in group coded as 0 assuming the scores on Y are 0. • b1 = estimated X score for members of the group coded as 1 assuming the scores on Y are 0. • a = mean score on X for members of group coded as 0. Pervasive Use of MMR in the Organizational Sciences • Recent review: MMR was used in over 600 attempts to detect moderating effects of categorical variables in AMJ, JAP, and PP between 1977-1998 (Aguinis, Beaty, Boik, & Pierce, 2005, JAP) Selected Research on MMR • Aguinis (2004, Regression Analysis for Categorical Moderators, Guilford Press) • Aguinis, Beaty, Boik, and Pierce (2005, J. of Applied Psychology) • Aguinis, Boik, and Pierce (2001, Organizational Research Methods) • Aguinis, Petersen, and Pierce (1999, Organizational Research Methods) • Aguinis and Pierce (1998, Organizational Research Methods) • Aguinis and Pierce (1998, Ed. & Psychological Measurement) • Aguinis and Stone-Romero (1997, J. of Applied Psychology) • Aguinis, Bommer, and Pierce (1996, Ed. & Psychological Measurement) • Aguinis (1995, J. of Management) Methodology: Monte Carlo Simulations • Research question: Does MMR do a good job at estimating moderating effects? • Difficulty: We don’t know the population • Solution: Monte Carlo methodology – – – – – Create a population Generate random samples Perform MMR analyses on samples Compare population versus samples Assess % of hits and misses Problems with MMR 1. 2. We don’t find moderators If we find them, they are small Why should we care? Theory: Failure to find support for correct hypotheses (derailment of theory advancement process; model misspecification) Practice: Erroneous decision making (e.g., over and under prediction of performance, implementation of ineffective interventions) – – Ethical implications Legal implications Some Culprits for Erroneous Estimation of Moderating Effects • Small total sample size • Unequal sample size across moderatorbased groups • Range restriction (i.e., truncation) in predictor variable X • Scale coarseness • Violation of homogeneity of error variance assumption • Unreliability of measurement • Artificial dichotomization/polichotomization of continuous variables • Interactive effects Unequal Sample Size Across Moderator-based Subgroups • Applies to categorical moderators (e.g., gender, national origin) • In many research situations, n1 n2 • Two studies examined this issue (Aguinis & Stone-Romero, 1997; Stone, Alliger, and Aguinis, 1994) (see also Aguinis, 1995) • 2 n1n2 N' n1 n2 • Conclusion: n1 needs to be (.3 n2) or larger to detect medium moderating effects Truncation in Predictor X • Non-random sampling • Pervasive in field settings (systematic in personnel selection/test validation research, [X,Y] | X > x) • Aguinis and Stone-Romero (1997) (categorical moderator) McClelland and Judd, 1993 (continuous moderator) • Truncation has a dramatic impact on power – N = 300, medium moderating effect, power = .81 – Same conditions, truncation = .80, power = .51 • Conclusion: Even mild levels of truncation can have a substantial detrimental effect on power Violation of Homogeneity of Error Variance Assumption • Applies to categorical moderators • Error variance: Variance in Y that remains after predicting Y from X is equal across subgroups (e.g., women, men) 2 2 2 ( 1 ) e Y XY • (i ) (i ) (i ) • Distinct from homoscedasticity assumption Regression of Homoscedastic Data 10 8 Criterion (Y) 6 4 2 0 0 2 4 6 8 10 12 14 Predictor (X) Total Sample: Women & Men 16 18 10 10 8 8 6 6 Criterion (Y) Criterion (Y) Regression for Subgroups 4 2 0 0 2 4 6 8 Predictor (X) Women 10 12 14 16 18 4 2 0 0 2 4 6 8 Predictor (X) Men 10 12 14 16 18 Artificial polichotomization of continuous variables • Median split and other common methods for “simplifying the data” before conducting ANOVAs • Cohen (1983) showed this practice is inappropriate • In the context of MMR, some have used a median split procedure on continuous predictor Z and compared correlations across groups • MMR always performs better than comparing artificially-created subgroups (Stone-Romero & Anderson, 1994) • Conclusion: Do not polichotomize truly continuous predictors Interactions Among Artifacts • Concurrent manipulation of truncation, N, n1 and n2, and moderating effect magnitude (Aguinis & Stone-Romero, JAP, 1997) . • Results: Methodological artifacts have interactive effects on power. • Even if conditions conducive to high power are favorable regarding one factor (e.g., N), conditions unfavorable regarding other factors (e.g., truncation) will lead to low power. • Conclusion: Relying on a single strategy (e.g., increase N) to improve power will not be successful if other methodological and statistical artifacts Aguinis, Beaty, Boik, & Pierce (2005, JAP) • Q1: What is the size of observed moderating effects of categorical variables in published research? • Q2: What would the size of moderating effects of categorical variables be in published research under conditions of perfect reliability? • Q3: What is the a priori power of MMR to detect moderating effects of categorical variables in published research? • Q4: Do MMR tests reported in published research have sufficient statistical power to detect moderating effects conventionally defined as small, medium, and large? Method • Review of all articles published from 1969 to 1998 in Academy of Management Journal (AMJ), Journal of Applied Psychology (JAP), and Personnel Psychology (PP) • Criteria for study inclusion: – At least one MMR analysis – The MMR analysis included a continuous criterion Y, a continuous predictor X, and a categorical moderator Z Effect Size and Power Computation • Total of 636 MMR analyses • Moderator sample sizes for 507 (79.72%) • Moderator group sample sizes and predictorcriterion rs for 261 (41.04%) • Effect sizes and power computation based on 261 MMR analyses for which ns and rs were available. We used SD information when available, and assumed homogeneity or error variance when this information was not available Results (I) Frequency of MMR Use over Time: Number of MMR Ana lyse s 120 100 80 60 40 20 0 1977 1979 1981 1983 1985 1987 1989 1991 1993 Publication Year 1995 1997 Q1: Size of Observed Effects (I) 2 2 R R f 2 2 21 1 R2 • Effect size metric: • Median f 2 = .002, • Mean (SD) = .009 (.025) – 95% CI = .0089 to .0091 – 25th percentile = .0004 – 75th percentile = .0053 • Effect size values over time: r(261) = .15, p < .05 Q1: Size of Observed Effects (II) Mean (SD) Median AMJ (k = 6) .040 (.047) .025 JAP (k = 236) .007 (.024) .002 PP (k = 19) .017 (.025) .006 • F(2, 258) = 4.97, p = .008, η2 = .04 • Tukey HSD tests: AMJ > JAP and PP > JAP Q1: Size of Observed Effects (III) Mean (SD) Median Gender (k = 63) .005 (.011) .002 Ethnicity (k = 45) .002 (.002) .001 Other (k = 153) .013 (.031) .002 • F(2, 258) = 8.71, p < .001, η2 = .06 • Tukey HSD tests: Other > Ethnicity Q1: Size of Observed Effects (IV) Mean (SD); Median Personnel Selection (k = 20) Other (k = 241) .010 (.023); .001 .009 (.025) .002 • t(259) = -.226, p = ns Mean (SD); Median Work Attitudes (k = 96) Other (k = 165) .005 (.015); .002 .011 (.029) .002 • t(259) = -0.95, p = ns Q2: Construct-level Effects (I) • Median f 2 = .003 – Increase of .001 over median observed effect size • Mean (SD) = .017 – Increase of .008 over mean observed effect size Statistical Power Q3: Statistical Power (I) 1.00 0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 Effect Size Q3: Statistical Power (II) Statistical Power 1.00 0.80 0.60 0.40 0.20 0.00 0.000 0.003 0.005 0.008 0.010 0.013 0.015 0.018 0.020 0.023 Effect Size Q4: Power to Detect Small, Medium, and Large Effects • Small f 2 (.02); mean power = .84; 72% of tests would have a power of .80 or higher • Medium f 2 (.15); mean power = .98 • Large f 2 (.35); mean power = 1.0 Some Conclusions • We expected effect size to be small, but not so small (i.e., median of .002) • Computation of construct-level effect sizes did not improve things by much (i.e., median of .003) • More encouraging results: – None of the 95% CIs around the mean effect size for the various comparisons included zero – Effect sizes have increased over time – Given the observed sample sizes, mean power is sufficient to detect effects ≥ .02 – 72% of studies had sufficient power to detect an effect ≥ .02 Some Implications • Are theories in dozens of research domains incorrect in hypothesizing moderators? • Are hundreds of researchers in dozens of disparate domains wrong and population moderating effects so small? • Could be, but….. more likely, methodological artifacts decrease the observed effect sizes substantially vis-à-vis their population counterparts • More attention needs to be paid to design and analysis issues that decrease observed effect sizes • Conventional definitions of effect size (f 2) for moderators should probably be revised The “Now What” Question • Before data are collected – – – – Larger sample size * More reliable measures * Avoid truncated samples * Use non-coarse scales (e.g., program by Aguinis, Bommer, & Pierce, 1996, Ed. & Psych. Measurement) – Equalize sample size across moderator-based subgroups – Use computer programs in the public domain to estimate sample size needed for desired power level – Gather information on research design trade-offs * Easier said that done! Tools to Improve Moderating Effect Estimation (Aguinis, 2004) • Scale coarseness – Aguinis, Bommer, and Pierce (1996, Educational & Psychological Measurement) • Homogeneity of error variance – Aguinis, Petersen, and Pierce (1999, Organizational Research Methods) • Power estimation and research design trade-offs – Aguinis, Pierce, and Stone-Romero (1994, Educational & Psychological Measurement) – Aguinis and Pierce (1998, Educational & Psychological Measurement) – Aguinis, Boik, and Pierce (2001, Organizational Research Methods) Assessment of Assumption Compliance • DeShon and Alexander’s (1996) 1.5 rule of thumb • Bartlett’s homogeneity test: M= • • • • ( ivi ) log e ( ivi si2 / ivi ) ivi log e si2 1 1 ( i1 / vi 1 / ivi ) 3( k 1) k = number of sub-groups nk = number of observations in each sub-group s2 = sub-group variance on the criterion v = degrees of freedom from which s2 is based Homogeneity is not Met... Now What? • Use alternatives to MMR – Alexander and colleagues' normalized-t approximation: (c 3 3c) (4c 7 33c 5 240c 3 855c) ; where zi c 2 4 b (10b 8bc 1000b a v i .5; b 48a 2 ; c a ln(1 t 2 i / v i ) ; and vi nk 2 – OR James's second-order approximation: 1 2 1 ( k 3) 2 3 T 1 3 4 2 4 2 16 2 c 2 2 8R 23 10 R 22 4 R 21 6R 12 8R 12 R 11 4 R 11 2 2 ( 2 R 4 R 2 R 2 R 4 R R 2 R ) ( 1 ) 23 22 21 12 12 11 11 2 1 R 4R R 2R R 4R 2 4R R R 2 12 12 11 12 10 11 11 10 10 4 3 2 1 4 2 R 23 3R 22 3R 21 R 20 5 6 2 4 2 2 316 R 12 4 R 23 6 R 22 4 R 21 R 20 h ( ) c 1 (3 4 2 )T 35 8 15 6 9 4 5 2 2 1 2 16 2 R 22 4 R 21 R 20 2 R 12 R 10 4 R 11R 10 R 10 9 3 5 8 6 4 2 1 R R 2 27 3 22 11 8 6 4 2 4 1 R R R 45 9 7 3 12 11 8 6 4 2 4 23 Program ALTMMR • Calculates – Error variance ratio (highest if more than 2 subgroups) – Bartlett’s M – James’s J – Alexander’s A • Uses sample descriptive data – nk , sx , sy , rxy – User sets p = .05 or .01 (for all but James’s statistic) Program ALTMMR Described in detail in Aguinis (2004) Available at www.cudenver.edu/~haguinis/ (click on MMR icon on left side of page) Executable on-line or locally Power Estimation • Program POWER – Aguinis, Pierce, and Stone-Romero (1994, Ed. & Psych. Measurement) • Program MMRPWR – Aguinis and Pierce (1998, Ed. & Psych. Measurement) • Program MMRPOWER – Aguinis, Boik, and Pierce (2001, Organizational Research Methods) Program MMRPOWER • Problems/Challenges regarding POWER and MMRPWR programs: – Based on extrapolation from simulations: Range of values is limited – Absence of factors known to affect power of MMR (e.g., unreliability) • Theoretical approximation to power: k k 1 k 1 1 2 Power Pr Fk 1, N 2 k 1 j x y H j j G j 0 , j 1 j 1 N 2k Program MMRPOWER Described in detail in Aguinis (2004) Available at www.cudenver.edu/~haguinis/ (click on MMR icon on left side of page) Executable on-line or locally Some Conclusions • Observed moderating effects are very small • MMR is a low power test for detecting effect sizes as typically observed • Researchers are not aware of problems with MMR • Implications for theory and practice • User-friendly programs are available and allow researchers to improve moderating effect estimation • Using these tools will allow researchers to make more informed decisions regarding the operation of moderating effects