The Power of Replication: How (Not) to Interpret Empirical Findings Michael Price Georgia State University and NBER The Basic Problem: Interpretation • What parameters are measured by the study? • Are the parameters that are measured applicable in other environments? • How likely are the parameters that are measured to reflect the “truth”? The Basic Problem: Interpretation • What is the maintained theory and how does interpretation depend on the maintained theory? – Revealed altruism and the difference between acts of omission versus acts of commission – The importance of endowment on reference points and “framing” effects • The ability of individuals (research partners) to sort and the availability of substitutes – Allowing individuals to avoid the ask in charity and subsequent patterns of giving – The effect of social comparisons on energy use in dorms/apartments versus single family homes in a small town The Basic Problem: Interpretation • The basic mechanics of scientific discovery….. – The more independent researchers that are working on a problem, the less likely that the initial finding is “true” – The extent of research “bias” and the sensitivity of a finding to the maintained model decreases the likelihood a finding is “true” • The power of replication…. – The more times any given study is replicated, the more likely that the findings are “true” The Importance of the Maintained Model: “Framing” Effects • A number of studies report that seemingly innocuous changes to a game lead to dramatic differences in outcomes – Payoffs to the recipient in a dictator game depend on whether the choice is framed as giving to or taking from the recipient – Differences in final allocations in payoff equivalent common pool resource and public goods games • Are such differences “anomalies”? – Answer to the question depends on the maintained model…. “Framing” Effects – Standard Model • Any model with utility defined over final payoffs does not distinguish between acts of omission versus commission – Not sharing with a recipient in the dictator game is an act of omission – Taking from a recipient in the dictator game is an of commission • Consider individual that is asked how to split $10 with another party – Giving $X to the recipient is the same as taking $10 – X from the recipient – Would thus expect final allocations to be independent of endowment “Framing” Effect – Moral Costs • Suppose that individuals feel guilty when chosen actions are deemed “selfish” – Such feelings would motivate giving in dictator game – Concept shares similarity with social pressures in DellaVigna et al. (2012) • U.S. law makes distinction between acts of omission and acts of commission when assigning liability • Assume that feelings of guilt are stronger for acts of commission – Assignment of property rights and associated action space will impact split – Final payoff to dictator will be lower when asked to take from recipient “Framing” the Results • Suppose that one observes small but statistically significant differences in dictator payoff under Give and Take frames • Interpretation of the data depend upon the maintained model – If believe “true” model is that of moral costs, differences are predicted by theory and reflect that the games are different – If believe “true” model is defined over final payoffs only, differences are at odds with theory and reflects “framing” “Framing” the Results • Example reflects how researcher “bias” can influence what is viewed as the “true” state of the world • One should thus ask how likely the maintained model is likely to be valid • Design replication studies that take on defining characteristics of model – Inequality aversion predicts that indifference curves are backward above the 45 degree line – Efficiency preferences suggest player 1 would strictly prefer bundle with payoffs (9, 7, 6) to one with payoffs (11, 7, 4) Sorting and Non-Compliance • Experiments may differ in ability and/or costs for subjects to sort – Provide dictators option to forego potential profits to avoid being asked to share versus forcing them to share – Warning potential donors that a solicitor will be coming to their door during a given time period versus showing up unannounced • Sorting fundamentally alters what parameters (motives) are reflected in subsequent actions – Donations to an unexpected solicitor reflects social pressures and altruism – With sorting, the importance of social pressures is lower Sorting and Non-Compliance • Randomize subjects to into different remuneration schemes – conditional bonus, loss framed bonuses, piece rate – Typical experiment will focus on contemporaneous effects of the various compensation schemes – But the choice of compensation scheme may impact who remains with the company over the long-run • Long-run impacts will depend on what types of workers elect to remain with the company – Potential differences in the relative superiority of contract types in short and long-run… – Suppose that attrition is correlated with treatment – e.g., low productivity workers are less likely to remain if paid via piece rate Sorting and Non-Compliance • Nature of scientific discovery is that research tends to focus on contemporaneous effects first • Number of examples highlighting benefits of replication studies that examine treatment effects over longer horizon – Appearance of solicitor versus use of charitable raffle – Providing potential donors unconditional versus conditional gifts Sorting and Non-Compliance • A fundamental challenge in designing/interpreting experiments is issue of compliance (exposure to treatment) – Parents that are offered incentive to attend a parent academy but elect not to – Households that are sent but do not open/read letter that includes a normative appeal to conserve energy • In such instances what experiment captures is an intent to treat effect – randomization is an imperfect instrument Sorting and Non-Compliance • Recall that estimated treatment effect under IV is given: π½πΌπ πΈ π π = 1 − πΈ[π|π = 0] = πΈ π π = 1 − πΈ[π|π = 0] • If one cannot observe or model compliance, what is estimated is π½ = ππΈ π π = 1, π = 1 − (1 − π)πΈ π π = 1, π = 0 − πΈ[π|π = 0] The Availability of Substitutes • Growing body of work that explores the impact of social comparisons on residential energy use • Opower reports average reductions in consumption in range of 2-3% – However, treatment leads to increased consumption in some utilities and up to 4-5% reductions in others • Studies that explore the effects within dorms or apartments report effects in the 15-20% range The Availability of Substitutes • Intuitively the impact of such programs will depend on ability of individual to substitute away from in-home energy use • Those living in dorms or large apartment complexes have more options to substitute away from in-home use – Watch TV or study in common rooms of dorm – Wash/dry clothes in common laundry room rather than apartment • Cities with more amenities – movie theaters, public libraries, coffee houses, etc. – provide more substitution possibilities The Availability of Substitutes • Data used to analyze the impacts of such programs rarely includes controls for substitutes • Extent to which availability of substitutes predicts variation in estimated treatment effects is unanswered question – Facilitate better predictions for those wishing to implement such policies – Facilitate deeper understanding of channels through which messages impact behavior A Related Concern….Partner Selection • Implementation of field experiments requires consent of a willing partner – Charity that is willing to test effectiveness of a given fund-raising technique – Utility that is willing to explore the effectiveness of price changes or targeted messages during periods of peak demand – School district that is willing to explore the effectiveness of teacher incentives/curriculum change • What is those willing to implement experiments are fundamentally different than others? A Related Concern…Partner Selection • Utilities that are willing to explore role of targeted messages as means to manage peak demand – More likely to face capacity/transmission constraints during peak periods (unobserved differences in consumers/market structure) – More likely to have implemented other strategies to manage demand during peak periods (unobserved differences in margins that can adjust) – More likely to believe that consumers will respond in desired way to treatment (unobserved differences in consumers/market structure) • Extent to which such selection would impact estimated treatment effects is unknown…but can be understood through replication Types of Replication • Various levels of replication – Re-analyze existing data to check robustness of results – Implementing experiment using similar protocol but different subject pool – Employ new research design to test the interpretation/validity of prior findings • When to implement and benefits of any given strategy depend on underlying cause of concern Types of Replication – Re-Analysis • Want to re-analyze existing data when you believe that results are sensitive to modeling choices – Functional form assumptions or choice of controls – Rules for selecting relevant sample • More common with naturally occurring data where identification relies upon choice of instrument • However, there is scope for re-analysis of experimental data – Power of underlying statistical tests – Assumptions of linear treatment effect or specification of underlying model of interest – Potential imbalance across observables that effect outcomes Types of Replication – Rerun Original Design • Maniadis et al. (2014) provide model that highlights conditions that influence the likelihood that stated research finding is “true” – Prior belief on the existence/magnitude of a particular association – Number of independent research teams working on a problem – Extent to which interpretation of finding is influenced by maintained model – potential for researcher “bias” – Number of replication studies that report similar findings • Framework highlights conditions under which one may want to re-run the original experiment using new subject pool Types of Replication – Re-run Original Design • The likelihood of a false positive is greater the lower the prior one places on the existence/magnitude of a reported effect – Concern is not with choice of design per se but likelihood that findings reflect “luck” or draw from a small sample – Concern exacerbated by tendency for journals to publish “unexpected” results • The likelihood of a false positive for an initial finding is greater the more independent research teams are exploring a question Types of Replication – New Study Design • The likelihood of a false positive is greater the more likely it is that the researcher is “biased” – Design protocol in way that “forces” result – Interpret data in a way that is colored by maintained model – Results depend on ability of subjects to sort/availability of substitutes • When underlying concern is research “bias” want to explore new study designs – Introduce sorting in the dictator game – Examine choices in regions where models have distinct predictions – Examine choice across domains with more/less substitutes and control for such Take Away Thoughts…. • Number of factors that influence what any given experiment measures and how to interpret the results • Nature of scientific discovery suggests the power of replication – Tendency for journals to publish “novel” or “unexpected” findings – Sensitivity of results to maintained model and how that influences the design – Heterogeneity in treatment effects and influence of partner selection and characteristics of environment on such Take Away Thoughts… • Various levels of replication that address different concerns – Re-analyze existing data – Re-run original design with new subject pool – Design new set of experiments to explore robustness of a result • Intuitive criteria that allow researcher/practitioner to determine which results should be replicated and what approach to take • Replication need not be a dirty word or something we shy away from….embrace it and do not be afraid to question prior findings