Assessing the Correspondence between Experimental Results

Assessing the Correspondence between Experimental Results Obtained in the Lab and Field: A Review of Recent Social Science Research Alexander Coppock and Donald P. Green∗ Columbia University July 25, 2013 PRELIMINARY DRAFT NOT FOR CITATION WITHOUT PERMISSION Abstract A small but growing social science literature examines the correspondence between experimental results obtained in lab and field settings. We review this literature and reanalyze a set of recent experiments carried out in parallel in both lab and field. Using a standardized format that calls attention to both the point estimates and the statistical uncertainty surrounding them, we analyze the overall pattern of lab-field correspondence. Lab-field correspondence is found to be quite strong (Spearman’s ρ = 0.73). Finally, we discuss some of the limitations of the current manner in which lab-field comparisons are constructed and suggest directions for future research. 1 Introduction Lab experiments and field experiments offer complementary approaches to the study of cause- and-effect. Both methods attempt to isolate the causal influence of one or more interventions by ∗ Please direct comments to ac3242@columbia.edu (Alexander Coppock) or dpg2110@columbia.edu (Donald P. Green). We are grateful to Oliver Armantier, Peter Aronow, Lindsay Dolan, Uri Gneezy, John List, Jennifer Jerit, and Andrej Tusicisny for helpful comments, Glenn Harrison and Nicholas Valentino for graciously providing replication data, and Johannes Abeler and Ido Erev for providing supplementary information. 1 eliminating the systematic intrusion of confounding factors. Experiments carried out in lab or field settings typically allocate subjects randomly to treatment and control groups, ensuring that those assigned to each group have the same expected potential outcomes. Apparent differences in outcomes between treatment and control groups therefore reflect either the effect of the treatment or random sampling variability. The interpretation of experimental results, however, is a point of contention between lab and field researchers. Although the line between lab and field is sometimes blurry (Gerber and Green 2012), lab and field studies typically differ in terms of who the subjects are, the context in which the subjects receive the treatments, and the manner in which outcomes are measured. At the risk of caricaturing lab and field studies, one might say that the canonical lab study involves undergraduate subjects, an artificial environment in which subjects know that they are being studied, and outcomes that are measured shortly after the experimental intervention. The canonical field study assesses the effects of a real-world intervention on those who would ordinarily encounter it, does not alert subjects to the fact that they are being studied, and measures outcomes sometime after the intervention occurs. For concreteness, consider the contrast between lab and field experiments on vote choice. Großer and Schram (2006) study voter turnout in elections by providing undergraduate subjects with a schedule of monetary payoffs that varied depending according to both the electoral outcome and each subject’s private voting costs; the experimental intervention was whether some subjects’ turnout decisions are observed and by whom. Approximately 90 seconds after the intervention, the researchers measured two outcomes: whether subjects voted and, if so, for which candidate. By contrast, Gerber (2004) presents a field experiment that assesses the effects of information on voter turnout and candidate choice. Randomly selected registered voters were sent mail from an actual candidate during state legislative elections; several days later, outcomes were assessed by post-election interviews with subjects in the treatment and control groups. The subjects who received direct mail were unaware that they were part of a research study, and respondents to the survey were asked about their candidate preference and turnout before any mention was made of the mailings. 2 These two studies illustrate some of the ways in which field and lab studies may differ. The lab study relies on a convenience sample of undergraduate subjects, whereas the field study draws its subjects from the voter rolls. The lab study presents subjects with an abstract election campaign in which the only information available to subjects is controlled by the experimenter; the field study takes place in the context of an actual election, which means that the intervention must compete with subjects’ background knowledge and competing information. Outcomes in the two studies are measured at different points in time, with the lab study gauging responses immediately after presentation of the stimulus and the field study assessing effects days later. Finally, the lab study is obtrusive in the sense that subjects are aware of the fact that researchers are studying their voting behavior; the field study’s post-election interview notifies subjects that research is being conducted, but the connection to the experimental stimulus remains opaque. Each of these design features may have important implications for how the results are interpreted. Results may be distorted if subjects know they are being watched, especially if they perceive a connection between treatments and outcomes. Failure to control or measure how subjects receive the treatment in field settings leads to uncertainty about the meaning of an apparent treatment effect. Outcomes measured immediately after the administration of a treatment may be a poor guide to long-term outcomes. Attrition of subjects from follow-up measurement may introduce bias. Even if these methodological concerns were inconsequential, there remains the substantive concern that a lab study addresses a different type of causal effect than a corresponding field study, which assesses the effects of information in a context of competing information and demands for voter attention. When researchers argue about the relative merits of lab and field experimentation (Camerer 2011, Levitt and List 2007a, Falk and Heckmann 2009, Gneezy and List 2006), these issues - obtrusiveness, treatment fidelity, outcome measurement, and context-dependent treatment effects - tend to occupy center stage. One way to address this controversy is to turn the sensitivity of results into an empirical question. Recent years have seen the emergence of a literature that assesses whether results obtained in the lab are echoed in the field and vice versa. Although no one to our knowledge has attempted a comprehensive assessment of the literature, two of the most prominent articles to address this 3 question have come to different conclusions about lab-field correspondence. Levitt and List (2007a) expresses skepticism about the potential for laboratory findings to generalize to the field, citing the ways in which a typical laboratory experiment changes the decision environment, whereas Camerer (2011) stresses evidence showing agreement across the two domains. Levitt and List describe five features of typical laboratory experimentation that are likely to change behavior relative to the field: moral and ethical considerations, scrutiny, decision context, selection, and stakes. For each of these features, Levitt and List review lab-field comparisons, noting instances where lab results were sensitive to the contextual factors noted above. They emphasize the theoretical differences between lab and field, and why those differences should matter when extrapolating from one context to the other. Levitt and List conclude, “Theory is the tool that permits us to take results from one environment to predict in another, and generalizability of laboratory evidence should be no exception.” (p. 170) Writing in response to Levitt and List, Camerer (2011) discusses six close comparisons of lab and field studies conducted in parallel settings.1 The criteria by which Camerer judges correspondence is study-specific: One pair arrived at the same effect sign, another recovered similar coefficients, another displayed modest “prosociality” correlations across contexts. In the remainder of the paper, Camerer reviews lab-field correspondence in which the study designs are imperfectly matched or in which the subject populations are quite different, and finds general agreement. Camerer (p. 35) concludes, “There is no replicated evidence that experimental economics lab data fail to generalize to central empirical features of field data.” In addition to disputing the current state of lab-field comparability, the two papers disagree as to whether lab experiments ought to generalize in the first place.2 Levitt and List (p. 170) suggest that “perhaps the most fundamental question in experimental economics is whether findings from the lab are likely to provide reliable inferences outside of the laboratory.” Camerer (p. 5) replies: [T]he phrase “promise of generalizability” in my title describes two kinds of promise. The first “promise” is whether every lab experiment actually promises to generalize 1 Two of these studies are included in our set of twelve, but the remaining four failed to meet our inclusion criteria. Although the attempt to settle the question of lab-field correspondence empirically is relatively new in the social sciences, the debate over the purpose of lab experimentation and its “external validity” is an old one in other fields such as social psychology (Sears 1986, Mook 1983). 2 4 to a particular field setting. The answer is “No”.. . . The second kind of “promise” is whether generalizability is likely to be “promising” (i.e., typically accurate) for those lab experiments that are specifically designed to have features much like those in closely parallel settings. The answer is “Yes.” We set out to study whether lab-field correspondence is strong under what might reasonably be described as “closely parallel” conditions. The question is an important one, as the stakes are high. All else being equal, if correspondence is strong, the marginal research dollar might be better spent in the lab than in the field. Field experimentation can be expensive, logistically challenging, and ethically encumbered. If laboratory experiments can consistently predict field treatment effects, then the lab offers clear advantages, in particular, the relative ease of conducting experiments and systematically extending them through replication. Further, theoretical control over incentives and behaviors might allow for the identification of some effects that are beyond the reach of field experiments (the impact of changes in national election voting rules, for example). The aim of this paper is to review the literature on lab-field correspondence. We begin by describing the criteria used to identify close lab-field comparisons. Next, we provide a brief narrative summary of each of the studies and compare lab and field results using a standardized format that calls attention to both the point estimates and the statistical uncertainty surrounding them. Although this collection of studies is too small to test specific theories about lab-field correspondence, we can assess the overall level of lab-field agreement in the extant literature. After summarizing the overall pattern of statistical results, which reveals a high degree of correspondence (Spearman’s ρ = 0.73), we discuss some of the limitations of the current manner in which lab-field comparisons are constructed and suggest directions for future research. 2 Method In order to construct a systematic review of lab-field correspondence, we sought to 1) identify a set of lab-field comparisons, 2) make explicit what we mean by correspondence, and 3) standardize 5 analytic procedures to facilitate cross-study comparison. Study Selection.We gathered a comprehensive set of recent studies in the social sciences that mention or reference the comparison of results across lab and field. We expanded on the many comparisons noted by Camerer, Levitt, and List by following chains of citations and by conducting searches on the terms “lab experiment,” “field experiment,” “lab-field correspondence,” and “generalize from lab to field.” This initial search netted approximately 80 journal articles and unpublished manuscripts, which are listed in the appendix. The investigations are from many domains: experimental economics, sociology, and political science. We whittled this sample of 80 articles down to twelve in three steps: I: Explicit Pairing of Studies We kept only those studies whose authors set out to make an explicit lab-field comparison. Either the authors conducted parallel lab and field experiments themselves or they named a specific field (lab) experiment their lab (field) study was addressing. We excluded several field experiments that sought to test in the field some well-established laboratory result loss aversion, for example - but did not name a particular comparison study.3 In a small number of cases, we excluded studies reporting lab and field experiments that were not, in our judgment, sufficiently parallel (such as King and Ahmad 2010). II: Definitions of Lab and Field As noted above, the distinction between the lab and field experimentation is not always clear. Lab and field experiments may differ along a number of dimensions, including treatments, subjects, contexts, and outcome measures. The “explicit pair” filter eliminated almost all borderline cases, which obviated the need for strict definitions of lab and field. We nevertheless excluded “lab-in-the-field” studies in which subjects played a laboratory game outside a university context (such as Benz and Meier 2008). III: Estimation of a Treatment Effect 3 We could have chosen a representative lab study against which to assess lab-field correspondence, but the correspondence might be strong or weak depending on the comparison chosen. The main advantage of this approach is that it limits our own discretion. 6 The set of lab and field studies was further restricted to randomized experiments that estimated treatment effects. In other words, admissible experiments had to assess the effect of a randomly assigned manipulation. Measurement studies, by contrast, estimate the level of an outcome variable rather than a change in that variable. An example of pure measurement from the laboratory is a dictator game in which an average level of pro-sociality is estimated. In principle, a comparison of laboratory measurements and field measurements is possible and potentially informative (Benz and Meier 2008). However, similar baseline measurements would be no guarantee of similar treatment response across lab and field. The investigation of treatment effects generally requires a fully randomized design, but we relaxed the definition of randomization to include pseudo-randomizations such as assignment according to the day on which subjects showed up at the lab.4 Data Collection and Reanalysis. The data used in the analyses were gathered from publicly-available datasets, correspondence with the authors, or from tables and charts in the original papers. In some cases, we corrected minor mistakes in the original analyses. Further, we standardized the presentation format so that we could assess correspondence within particular studies, and across the entire set. For each of the studies, we collected mean outcomes (without covariate adjustment) in each treatment group, standard errors, and group sizes. From these data, we calculated treatment effects and 95% confidence intervals. Most of these studies have multiple treatment groups (arms) and measure a single outcome, but one (Jerit et al. 2011) has only two treatment groups and measures a large number of outcomes. Assessing Correspondence. One issue that arises in existing debates about lab-field correspondence is the question of how to assess correspondence statistically. Intuition suggests that one should simply compare treatment effects. However, for many comparisons, outcomes are not measured using the same scale. For example, in one of the studies discussed below, the lab study is measured by number of computer mazes solved, and the field study in playground footrace times. There is no universally accepted technique for determining the appropriate theoretical maze-to-footspeed 4 Several studies would have been excluded because either they did not use a fully randomized design or they did not report their randomization procedures. See Green and Tusicisny (2013) for a critique of this practice. There is the further issue of accounting for clustered random assignment; we use the authors’ reported standard errors but recognize that these estimates probably understate the true sampling variability. 7 conversion ratio. It might be argued that treatment effects could be put in percentage terms: a 10% increase in mazes completed compared with a 10% increase in velocity. This too may have problems if small percentage changes are substantively quite large in some domains but not others. Another technique applied by some researchers to assess correspondence is to compare the sign and statistical significance of effects: if the treatment effects are positive and significant in both lab and field, the results are said to indicate strong correspondence. This approach has two weaknesses. First, it is not clear that lab and field studies would be thought to be in strong agreement if they both recovered insignificant effects. Second, this approach may conflate the magnitude of the effect size with the power of the study. For example, even if the estimated effects were the same, a large field experiment might generate a significant p-value, whereas a small lab experiment may not. In order to sidestep the issue of incomparable scaling and sample-size-dependent conclusions, we assess correspondence using rank-order correlations. We collect mean outcomes in each experimental group in both the lab and field studies. Spearman’s ρ assesses the degree to which the ordering of means in the lab corresponds to the ordering in the field. This approach has a number of advantages. First, it is robust to the scaling problem described above. Second, it keeps the question of effect sign separate from the question of statistical uncertainty. Third, it accommodates what some researchers describe as “general” or “qualitative” (Kessler and Vesterlund, forthcoming) correspondence, insofar as larger effects in the lab are associated with larger effects in the field. One weakness of our approach is that correlations tend to be exaggerated in absolute value when N is small (Student 1908). For this reason, our overall conclusions are based on the full set of lab-field comparisons, which we standardize for purposes of meta-analysis. In Figure 1 below, laboratory results are plotted on the x-axis and field results are plotted on the y-axis. Perfect correspondence, by our metric, would be represented by means falling along any strictly increasing line, where the error bars (both horizontal and vertical) do not overlap. The numeric data used to generate these charts may be found in the Appendix Table. 8 Figure 1: 12 Lab-Field Comparisons. Each panel presents unstandardized means by experimental group, except for the last panel, which presents estimated treatment effects. 400 300 200 100 0 80 60 40 20 0 Field Kilos Picked Per Group Field Average Seller Revenue, Field 15 10 5 0 0 0.0 0.4 0.6 Lab Participation ● 0.8 ● 60 1 3 Lab ● ● 4 1/6th Pay ● 80 100 SIS AIS 120 Average Seller Revenue, Lab ● ● Harrison and List (2008) 2 Individual Pay Even Split 5 140 Gneezy, Haruvy, and Yafe (2004) 0.2 Control Competition Erev, Bornstein, and Galili (1993) compared with Bornstein, Erev, and Rosen (1990) 6 1.0 350 300 250 200 150 100 50 0 5 4 3 2 1 0 1.0 0.8 0.6 0.0 0 0 30 2 3 4 ● 40 $65 Quality supplied in the Lab $20 ● List (2006) Effort, Lab 20 Small Control Large ● 0.2 0.4 ● High Bribe 0.8 Proportion taking the Bribe, Lab 0.6 ● Control Monitoring High Wage ● ● Armentier and Boly (forthcoming) 1 10 ● ● Gneezy and Rustichini (2000) 1.0 5 50 0.2 0.4 % lead for Bush, Lab 0.0 Effort, Field Quality supplied in the Field Proportion taking the Bribe, Field 0.4 0.2 0.0 −0.2 −0.4 200 150 100 50 0 25 20 15 10 Actual Donations in the Field Field Consumption 5 0 0 0 ● 0.0 ● 0.2 White Neutral ● % lead for Bush, Lab −0.2 ● 0.4 10 15 20 25 5 ● 15 20 Label Cash Lab Consumption 10 ● Abeler and Marklein (2013) Hypothetical Donataions in the Lab 5 ● Mismatch Match ● Shang, Reed, and Croson (2006) −0.4 Control Black + White 25 30 Valentino, Traugott, and Hutchings (2002) 0.3 0.2 0.1 0.0 −0.1 −0.3 4 3 2 Field Average Round 2 − Round 1 Speed Field Average Contribution Treatment Effects in Field 1 0 0.2 0.1 0.0 −0.1 −0.2 9 10 ● 15 ● 2 ● 6 ● ● ● Lab Average Contribution 4 Low Control Matching Challenge 20 8 10 High Control Rondeau and List (2008): Average Contributions Lab Average Mazes Completed 5 ● ● −0.2 ● ● ● ● ● ● ● 0.0 0.1 0.2 Pearson Correlation = 0.21 Spearman Correlation = 0.26 ● ● ● ●● ● ● Treatment Effects in Lab −0.1 ● ● ● Jerit, Barabas, and Clifford (forthcoming) 0 0 Competition (F) No Competition (M) No Competition (F) Competition (M) Gneezy and Rustichini (2004) compared with Gneezy et al. (2003) 3 Review of Twelve Lab-Field Comparisons The purpose of the following section is to characterize each lab-field comparison, which we present in chronological order. Special attention is devoted to describing the subject pools, treatments, contexts and outcome measures, as these components shed light on whether lab-field correspondence is expected or surprising. In an effort to standardize comparisons across studies, we summarize four key aspects of each study: 1) purpose, 2) treatments, 3) subjects and context, and 4) outcome measures and results. Erev, Bornstein, and Galili (1993) compared with Bornstein, Erev, and Rosen (1990) Study purpose. This pair of studies examines the effect of competition in two domains: a laboratory “give some” game and orange-picking by high-school students. In both contexts, subjects had an incentive to free-ride – the competition treatment was expected to mitigate this problem. Subjects and Context. Male undergraduate lab subjects were assigned to teams of three members. Each subject could choose to take IS 3 or give IS 3 each to the other group members. There were three players in each group; if they all take, each receives a payoff of IS 3. If they all give, each receives a payoff of IS 6. The field subjects were male high school students recruited for three hours’ work in an orange grove. The subjects were arranged into teams of four, and payoffs were determined by the quantity of oranges picked. Treatments. In the lab, the subjects in the Control treatment played the “give some” game as described above. In the Competition treatment, pairs of groups competed to have the largest proportion of givers. Winning this competition yielded an additional IS 9 for each group member. In the field, subjects in the Team condition (analogous to the lab Control condition) received 1/4th of the total group payoff (calculated according to a IS/kg ratio). Those in the Competition condition received these same payoffs, except that the groups of four were split into dyads. Whichever dyad picked more oranges received a bonus. Outcome measures and results. Outcomes in the lab were measured by participation – whether subjects gave or not. The proportion of givers in the Control condition was lower than in the 10 Competition condition. Outcomes in the field are measured in kilograms picked by each group of four. In the field as well, production was higher in the competition. Gneezy and Rustichini (2000) Study purpose. This study sought to identify the effect of small incentives on intrinsic motivation. The hypothesis is that small incentives may actually decrease effort relative to no (external) incentives. Subjects and Context. Undergraduate subjects in the lab experiment answered IQ test questions. The field experiment was carried out with high school students during their annual door-to-door fundraising service day. Treatments. In the lab, the treatment conditions5 were Control (0 NIS), Small (0.1 NIS), and Large (1 NIS) per correct answer.6 In the field, some students were offered incentives for performance: the treatments were Control (no incentive), Small (1% of donations raised), and Large (10% of donations raised). Outcome measures and results. Outcomes were measured based on the number of IQ questions in the lab and donations raised in the field. In the lab, Small < Control < Large. In the field, Small < Large < Control. Valentino, Traugott, and Hutchings (2002) Study purpose. This study explored the effect of political advertising, with and without racial cues, on support for George W. Bush during the summer preceding the 2000 Presidential election. Subjects and Context. Subjects in the lab study were recruited from Ann Arbor area business and office buildings. Subjects were randomly assigned to one of four treatment groups, viewed the experimental stimuli on a computer, and filled out a survey on the same computer. Subjects in the field study were selected using an area probability sample from the Detroit metropolitan area. If respondents were home and opened the door (55.3% response rate), interviewers played 5 These are our naming conventions, for clarity. A fourth treatment, 3 NIS per correct answer, was included in the lab experiment, but we do not directly compare it to the field experiment, which did not have a corresponding proportional increase in payment. 6 11 the experimental stimuli on a laptop, and the relevant political questions were also answered by the subject on the laptop. Treatments. The two studies used identical experimental stimuli. In the Control condition, subjects were shown product advertisements. In the Neutral condition, subjects were shown a Bush political advertisement that did not include any racial cues. In the White condition, the same advertisement audio was played, this time with white actors in the video. In the Black + White condition, images of black actors receiving government services were included under the audio “Democrats want to spend your tax dollars on wasteful government programs.” Outcome measures and results. The outcome measure in both lab and field was candidate preference, where Gore = -1, No Preference = 0, Bush = 1. The mean of this variable in each group represents the percentage point lead for Bush. In the lab, Control < Neutral < White < Black & White. In the field, Control < Black & White < Neutral < White. Gneezy and Rustichini (2004) compared with Gneezy et al. (2003) Study purpose. This pair of papers explores the heterogeneous effects of competition on performance by gender. The hypothesis is that competition will affect males’ performance but not females’. Subjects and Context. The lab study tests the effect of competition on the ability to solve computer mazes quickly among undergraduates. The field study tests the effect of competition on gym class race times among 9- and 10-year-olds. Treatments. Subjects were brought into the lab in groups of six – three men and three women. In the No Competition treatment, the payoffs each subject received depended only on his or her own performance in solving the computer mazes. In the Competition treatment, only the subject solving the most mazes would receive any payoff at all. Subjects in the field ran two races: a baseline race in which they ran alone. The second race they ran would either be alone (the No Competition condition) or with a classmate (the Competition condition). Outcome measures and results. In the lab, outcomes were measured based on the number of mazes completed. In the field, outcomes were measured by the difference in speed between the first 12 and second races run.7 The treatment effect among males is positive in both lab and field. Among females, the treatment effects have different signs in the lab and the field, but this is most likely due to sampling variability rather than any real differences between the lab and field contexts.8 Gneezy, Haruvy, and Yafe (2004) Study purpose. This study sought to explore the “diner’s dilemma” – the differences in consumption behavior when individuals in a group paid the full cost of their own consumption versus a fraction of the group’s total consumption. Subjects and Context. Subjects in both experiments were recruited on and around a university campus. The field experiment took place in a restaurant where groups of six subjects (who did not know one another) were invited to eat lunch together. This design is not unobtrusive – subjects knew they were participating in an experiment, and further, lunch with strangers is relatively uncommon. In the lab experiment, groups of six subjects were asked to choose the number of units to produce from a production table with costs and revenues detailed for each production level. Treatments. Three treatments were directly comparable across the lab and field experiments: Individual Pay, in which individuals paid their private costs; Even Split, in which individuals paid the average of all six group members’ costs; and 1/6th Pay, in which individuals paid for 1/6th of their private consumption. This last treatment was included because it is theoretically equivalent to the Even Split treatment from the standpoint of the marginal costs incurred by the subjects. Outcome measures and results. Outcomes were measured by the cost of individuals’ meals in the field experiment and in production units in the lab. The rankings of the three treatments proved to be identical across lab and field: Individual Pay < Even Split < 1/6th Pay. Because the difference between Even Split and Individual Pay was only found to be significant in the field, the authors described these studies as not corresponding well to one another. 7 In the original paper, these differences were negative when race times improved. We reversed the sign to maintain consistency with our other graphs. 8 In the meta analysis to follow, our rank-order correlation metric will take this uncertainty into account: a pair of insignificant results will not unduly understate (or overstate) the potential for lab-field correspondence. 13 List (2006) Study purpose. Inspired by previous laboratory experiments, the authors conducted a field experimental test of “gift exchange”, in which subjects reciprocate generosity even when there is no clear benefit of doing so. In these experiments, a sum of money was offered to dealers for sports cards that would be rated at a particular level of quality if they were graded. Because the quality of the cards is only partially observable to buyers, sellers can offer cards of lower-quality without the buyers being aware of it. Gift exchange, then, would be observed if dealers supply higher quality when offered larger sums. Subjects and Context. The lab and field experiments both took place at a sports card show. List reports the results of a number of experiments, but the pair of experiments that are the most directly comparable is Lab-Market and the Floor (Cards). In Lab-Market experiment, an experimental market was constructed in a room adjacent to the show. Buyers bought 1990 Leaf Frank Thomas baseball cards from sellers. The sellers were recruited from the dealers at the show, and the buyers were recruited from the public visiting the show. In the field experiment, confederates, recruited from the public visiting the show, approached dealers to buy a Thomas card. The principal differences between the lab and the field experiments are that 1) the dealers in the lab were those who agreed to participate in an experiment, and 2) the lab interactions took place outside of the normal sports card show context. Treatments. The treatments for lab and field experiments were identical. Confederates approached dealers in the $20 condition and offered $20 for a Thomas card that would receive a PSA 9 if it were graded. Dealers in the $65 condition were offered $65 for a Thomas card that would receive a PSA 10 if graded. Note that both the amount offered and the requested quality differ across treatment conditions. This feature of the experimental design complicates the substantive interpretation of the results but does not matter for the assessment of lab-field correspondence. Outcome measures and results. Outcomes were measured by the actual quality supplied by the dealers – after the experiment, all cards were later submitted to experts for grading. In both lab and field, dealers in the $65 condition supplied higher quality than dealers in the $20 condition. The treatment effect is on the order of a single PSA grade level in both lab and field. 14 Shang, J., Reed, A., & Croson, R. (2008). Study purpose. This study examined the effects of gender-specific appeals to donate to a public radio station on contributions. A standard technique in public radio fundraising is to inform new potential donors of the past donation behavior of others. To the extent that this mechanism increases donations, it may operate through an “identity congruency” channel – new donors may feel more generous when previous donors “like them” were more generous. Shang, Reed, and Croson report the results of a field experiment and four lab experiments exploring this channel by manipulating the gender congruency of the information. We report the results here of a comparison between the field experiment and what we consider to be the most parallel lab experiment (experiment 3A), though the substantive results of the comparison do not change according to which lab experiment is chosen. Subjects, Context, and Treatments The field experiment was carried out during an on-air fundraising drive for a public radio station. 76 listeners who called in to the station intending to donate were immediately randomized into one of two treatments, the gender match condition or the gender mismatch condition. In the match condition, male (female) callers were told “We had another member; he (she) contributed $240.” In the mismatch condition, the pronouns were switched. The lab experiment took place in a university context with 184 undergraduate subjects, who were asked to imagine that they were planning to call a public radio station to donate $25. The subjects are then told to imagine that “[d]uring your conversation with a volunteer on the phone, you were told they had just spoken with another donor, Mary (Tom), and that she (he) had contributed $70 this year.” Subjects were asked to write down how much they would hypothetically donate to the radio station. It should be noted that the main purpose of lab experiment was to explore how the treatment response differed for those subjects with high or low identity esteem and for those who focus primarily on themselves or others. These characteristics are unmeasured in the field, so we cannot compare lab-field correspondence according to these subject characteristics. Outcome measures and results. The field outcomes were measured in actual donations and the lab outcomes in hypothetical donations. In the field, subjects donated more in the match condition 15 than in the mismatch condition, but this pattern was reversed in the lab.9 Rondeau and List (2008) Study purpose. This study examines the effect of various solicitation strategies on donation behavior. Subjects and Context. The field experiment was conducted in collaboration with the Sierra Club. Donations to a K-12 environmental education program were solicited from 3,000 Sierra Club members. In all treatment conditions, the solicitation included a money-back guarantee if a threshold level of donations was not met. The laboratory experiment was conducted at the University of Victoria with subjects recruited from introductory economics classes. All subjects were given $12, which they could donate to a public good. If a threshold amount donated to the public good was reached, each individual would receive a private benefit, and if not, all donations would be refunded. Treatments. There were four directly comparable treatments in lab and field: Low Control, High Control, Matching, and Challenge. In the field (lab) the threshold level was $2,500 ($22.50) in Low Control, and $5,000 ($45) in the other conditions. The only difference between the Low Control and the High Control condition was the level of the threshold. In the challenge condition (CH), subjects were informed that $2,500 ($22.50) had already been contributed. In the matching condition (MA), subjects were informed that every dollar they contributed would be matched 1:1, up to $2,500 ($22.50). Outcome measures and results. The outcome of interest in both studies is the dollar amount contributed by each subject. In the original analysis, the authors reported average field donations only among the set of subjects who made a positive contribution. This technique is prone to bias because making a positive contribution may be affected by treatment. We report the unconditional results, which change the relative rankings of the treatments across lab and field. In the lab, Matching < Challenge < Low Control < High Control. In the field, Low Control < Matching < 9 The authors did not respond to our requests for the original data, so these estimates are a based on a reconstruction of the laboratory data from reported statistics. We aggregated the interacted results presented in Figure 3 on page 358. 16 High Control < Challenge. Harrison, Glenn W., and John A. List (2008). Study purpose. The winner’s curse refers to the conjecture that those who win auctions may pay more than their private valuations for the auctioned goods. A key prediction of common value auction theory is that the revenue earned by the seller depends on the information structure of the auction market. Sellers will earn less under a Symmetric Information Structure (SIS) – all bidders have a private (noisy) signal of the good’s value – and more under an Asymmetric Information Structure (AIS) – insiders know the good’s value with certainty, and outsiders receive noisy private signals. The original study considered quite a few other predictions of auction theory, but the SIS versus AIS treatments are the ones that are directly comparable across lab and field. Subjects, Context, and Treatments The lab and field experiments were carried out at a sportscard trading show in Tuscon, AZ. Lab subjects were recruited from among the dealers and non-dealers at the show. In the lab, the main treatments (SIS and AIS) were assigned using session randomization. In the lab, auctions were held with either 4 or 7 participants at a time. In the SIS condition, all participants received a private noisy signal about the good’s value (centered at the common value of $94.33), and all participants knew that the other participants received different private signals. In the AIS condition, one participant (the “insider”) was told the actual value of the good, and all participants knew that one participant among them was the insider. Because groups of subjects were treated together, the unit of observation is the individual auction, not the individual subject. Field subjects were sportscard show attendees, both dealers and non-dealers, who approached the experimenters’ table and asked to purchase a particular pack of sportscards. The subjects were then informed that there would be an auction for the pack at a specified time in the future. Unopened pack auctions are not an uncommon event at sportscard trading shows. The field auctions consisted of 4 subjects at a time. In the SIS condition, dealers were matched with dealers, and non-dealers were matched with non-dealers. In the AIS condition, 3 non-dealers were matched with 1 dealer. In both treatments, the subjects were informed of the composition of their auction group. Again, the unit of observation is the auction. 17 Outcome measures and results. Outcomes are measured by the revenues generated by the auctions in each of the two conditions. In both the lab and the field, the AIS condition generated more revenue than did the SIS condition, but this difference was only significantly different from zero in the field. Armantier and Boly (forthcoming) Study purpose. These experiments measure the sensitivity of bribe-taking behavior to three experimental manipulations. The design of the field and lab experiments was quite similar: in both, subjects were asked to grade 20 exams, the 11th of which contained a bribe and a request not to grade the exam harshly. Correctly grading this 11th exam meant giving it failing grade. Treatments. There were four treatment groups: Control, High Wage, High Bribe, and Monitoring. Subjects in the control condition received a wage and were offered a bribe equal to 1/5 of that wage. The High Wage treatment is the same as Control except the wage is 40% higher. The High Bribe treatment is the same as Control except the bribe is twice as large. The Monitoring treatment is the same as Control except the graders are informed that 5 of the 20 exams they graded would be checked for accuracy and that there would be a financial penalty for any unmarked mistakes. Subjects and Context. The field experiment was carried out in Burkina Faso, where part-time work grading exams is common. As such, the subjects did not suspect that they were participating in an experiment. The lab experiment was carried out in a lab in Montreal. Unlike the field subjects, the lab subjects knew they were participating in an experiment, and they were explicitly told by the experimenters that the 11th exam included a bribe, which they could accept or reject. Outcome measures and results. Outcomes were measured by the proportion of graders taking the bribe. In the field, bribe-taking showed the following pattern: High Wage < Monitoring < Control < High Bribe. In the lab, the pattern was High Wage < Monitoring = High Bribe < Control. Substantively, the main point of agreement between lab and field is that Monitoring does little to change behavior. The lab and field disagree, however, on the effects of High Wage and High Bribe: The field saw an increase in corruption from higher bribes but the lab did not, and the lab saw a decrease in corruption from the High Wage treatment but the field did not. 18 Abeler and Marklein (2013) Study purpose. This study assessed the effects of restricted versus unrestricted vouchers on consumption. A restricted voucher allows a subject to spend X Euros on only one type of good, whereas an unrestricted voucher allows a subject to spend X Euros on any type of good. The prediction is that among subjects who would consume more than X Euros of the restricted good, consumption will be unaffected by which voucher subjects receive. Subjects and Context. The field experiment was carried out in a wine restaurant in southern Germany. Diners received vouchers that could be used towards their meals. The lab subject pool consisted of undergraduates majoring in fields other than economics. They were asked to choose a bundle of fictional goods, given prices and a set of available resources. The optimal bundle (the bundle that maximized revenue minus costs) included more units of both housing and clothing than were covered by the grants. Treatments. In the field experiment, subjects in the Cash condition received an 8 Euro voucher they could apply to their entire bill. Subjects in the Label condition received vouchers that could be used on beverages only.10 Lab subjects in the Cash condition received an unrestricted voucher they could apply to either housing or clothing. Subjects in the Label condition received a voucher that could only be spent on one of the goods, but not both. Outcome measures and results. The authors report many outcome measures for the field experiment, but the total amount spent on beverages is the measure that is most closely comparable across experiments. In the lab experiment, outcomes were measured in units consumed of the labeled good. The optimal consumption was 13 units of the labeled good. Both experiments recover a positive treatment effect. Subjects spend more on the labeled good in both experimental contexts. Jerit, Barabas, and Clifford (forthcoming) Study purpose. This study investigated the effects of newspaper information on political knowledge and opinions. 10 This study employed a pseudo-randomization: all of the beverage vouchers were given out over the course of a week, and the unrestricted vouchers were given out over the next week. The nominal standard errors reported here do not reflect the extra uncertainty associated with this procedure. 19 Subjects and Context. The authors go to great lengths to create comparable lab and field subject pools. They selected a random sample of 12,000 registered voters who were not subscribers to the Tallahassee Democrat newspaper and randomly assigned 6,000 to their field experiment and 6,000 to their lab experiment. 1135 of those assigned to the field experiment returned questionnaires. 417 of those assigned to the lab experiment accepted their invitations to participate. Treatments. Subjects in the treatment condition of both studies were exposed to Tallahassee Democrat stories from two Sunday editions in the spring of 2011. Field subjects in the treatment condition were sent two free Sunday papers. Lab subjects in the treatment condition were shown 4 stories taken from those two papers and given a survey directly after reading them. Outcome measures and results. Outcomes in both studies were measured with an identical survey that included 17 questions concerning political knowledge, preferences, and beliefs. We present treatment effects rather than group means, as there are only two groups but 17 outcome variables. Because the lab and field studies’ outcomes are in the same scale, we present both the Pearson correlation coefficient (0.21) and the rank-order (Spearman) correlation coefficient (0.26) as measures of correspondence. However, because the treatments had no discernible effect on a large number of the outcome measures, this low correlation is in part due to sampling variability – lab-field differences are all the more difficult to discern when treatment effects are weak. Despite the randomized design, non-compliance limits the comparability of lab and field subjects. The subjects who arrived at the lab represent a population of laboratory compliers, and the treatment effects recovered in the lab are the complier average causal effects (CACEs) for that population. The CACEs among that population may or may not be similar to the average treatment effects for the entire population. A related problem is the attrition in the field experimental groups: the authors implicitly assume that conditional on responding to the questionnaire, potential outcomes are in expectation equal across treatment groups. The authors note that response rates were not significantly different between treatment and control in the field. Nevertheless, any conclusions will necessarily be restricted to the set of individuals who respond to the questionnaire. These complications underline the difficulty of extrapolating from the lab to the field even when the subjects pools are drawn from the same underlying population. It may be still more difficult 20 when the populations are different, and when non-random selections of those populations sort into the lab and field samples, as is demonstrated by the randomization check presented in the appendix of the original article. 4 Results & Discussion We present the results of these twelve close comparisons in Figure 1. With the exception of the Jerit et al. study, these graphs present treatment group means and 95% confidence intervals. Lab results are presented on the horizontal axis, and field results are presented on the vertical axis. For the Jerit et al. study, we present treatment effects, not group means. Taken together, these twelve studies demonstrate a reasonably strong correlation between group means in the lab and group means in the field. In order to provide some measure of this correlation, we combined the twelve studies to form a single dataset.11 In order to facilitate cross-study comparison, we recorded each study’s estimates of the average treatment effects in lab and field (rather than each study’s treatment and control group averages). We standardized these treatment effects using the Cohen’s d procedure as shown in Equations 1 and 2. This process allows us to compare effect sizes in standard units. Standardized Average Treatment Effect = r Standardized ATE Standard Error = 11 µT reat − µControl σControl 2 σT reat NT reat + 2 σControl NControl σControl (1) (2) When constructing an overall assessment of lab-field correspondence, we included only a single treatment effect pair from the Jerit et al. study. Which pair we included caused changes in the overall correlation ranging from .699 to .762. Figure 2 is generated using the treatment effect pair that presents the “median case” for lab-field correspondence. We believe that including all 17 pairs gives the Jerit et al. study too much weight, but doing so reduces the rank-order correlation to .56. 21 2 A E2 F 1 E1 K J2 D1 C3 C1 C2 0 Field Standardized Treatment Effects 3 Standardized Treatment Effects across all Lab−Field Pairs H1 H2 J1 H3 J3 L D2 B2 G B1 Rank−Order Correlation: 0.73 −1 I −1 0 1 2 3 Lab Standardized Treatment Effects Figure 2: Each point represents a standardized lab-field treatment effect pair. A = Erev, Bornstein, and Galili (1993) and Bornstein, Erev, and Rosen (1990), B = Gneezy and Rustichini (2000), C = Valentino, Traugott and Hutchings (2002), D = Gneezy and Rustichini (2004) and Gneezy et al. (2003), E = Gneezy, Haruvy, and Yafe (2004), F = List (2006), G = Shang, Reed, and Croson (2008), H = Rondeau and List (2008), I = Harrison and List (2008), J = Armantier and Boly (forthcoming), K = Abeler and Marklein (2013), L= Jerit, Barabas, and Clifford (forthcoming). Figure 2 shows our results. Treatment effects in the lab and treatment effects in the field show a upward-sloping relationship, with a rank order correlation of 0.73. To the extent that lab and field treatment effects are measured with error, the observed relationship may understate the underlying correlation between lab and field treatment effects. In order to be sure that the correlation was not driven by any single pair or a particular set of pairs of treatment effects, we carried out the following robustness check. First, we calculated the set of rank-order correlations that would occur if we dropped any single pair of treatment effects. 22 Next, we calculated the set of correlations that would occur if we dropped any two pairs, and so on, out to any ten pairs.12 The results of this procedure can be seen in Figure 3, moving leftward from the center of the graph. As we move rightward from the center from the graph, we take our 21 observed pairs and first duplicate any single pair, then any two pairs, and so on out to any ten pairs. With some abuse of notation, we label the x-axis as ranging from speaking, beyond 21 21 , 21 X we are appending 21 11 to 21 31 – strictly combinations to our 21 observed pairs. On the y-axis of Figure 3, we plot the 5-number summary of the set of rank-order correlations. Unsurprisingly, the median correlation stays constant across all permutations. As we remove or duplicate more observations, the maximum and minimum correlations spread out – even crossing zero and reaching 1 in the case of dropping any ten pairs. A striking feature of this graph, however, is that the interquartile range stays within the narrow band of about .6 to .8. 1.0 Rank Order Correlation of Standardized Treatment Effects in Lab and Field ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 Pairs are Removed Pairs are Duplicated −0.5 Rank Order Correlation Coefficient 0.5 ● −1.0 ● Maximum 75th Percentile Median 25th Percentile Minimum 212121212121212121212121212121212121212121                       111213141516171819202122232425262728293031 Combinations of Lab−Field Pairs used in Calculation Figure 3: We calculate the rank order correlations of the 21 treatment effect pairs included in the meta-analysis, under all possible combinations duplicating or removing any 1, 2, 3... or 10 of them. 12 Above dropping any 6 pairs, we randomly sampled 50,000 of the possible combinations. 23 In summary, our review of parallel lab and field experiments indicates a strong overall pattern of lab-field correspondence. Although many of the studies we reviewed lacked sufficient power to discern the level of correspondence with precision, the pattern of lab-field correspondence becomes clear when all of the studies are pooled. As Figure 2 indicates, stronger effects in the lab are associated with stronger effects in the field. The apparent correlation between lab and field results would presumably be even stronger if each of the studies had estimated lab and field effects with greater precision. The overall pattern of agreement is surprising given the many ways in which lab and field studies differ. In the collection of twelve studies we examined, seven of the lab studies involved undergraduate subjects. In six of the lab studies, subjects encountered an abstract rendering of a real-life situation, such as the exchange of tokens meant to simulate the division of a restaurant bill. In all twelve of the lab studies and nine of the field studies, outcomes were measured immediately after subjects encountered the treatment. Our collection of studies is too sparse to permit a more fine-grained analysis of how changes in lab-field similarity in terms of subjects, treatments, contexts, and outcomes affect correspondence. Nevertheless, correspondence is remarkably high given that our collection of lab and field studies often diverge on more than one of these dimensions. That said, we recognize that caution is warranted when drawing inferences about lab-field correspondence based on the current state of the literature. As noted earlier, studies of labfield correspondence have emerged in an ad hoc fashion, without any attempt to systematically investigate variation in subjects, treatments, contexts, and outcomes. The lack of systematic procedures raises two concerns. One is the so-called “file-drawer problem” (Rosenthal 1979). If authors or journal editors have a preference for noteworthy findings – either demonstrations of very high or very low correspondence – the distribution of correspondence reported in academic work might be unrepresentative of the broader set of studies that were conducted. We might see an exaggerated level of lab-field correspondence because high correspondence findings are especially likely to find their way into published articles, conference papers, or manuscripts posted online. As this literature matures, it will be interesting to see whether future research findings diverge from the strong correspondence apparent in currently available work. The threat of publication bias also 24 underscores the need for preregistration of experiments and institutional arrangements facilitating the reporting of results even in the absence of publication. A second concern has to do with the way in which lab-field comparisons are chosen. In most of the studies considered here, a field experiment was conducted to confirm, validate, or challenge a laboratory result. If the lab is to function as a cost-effective substitute for field research, it makes sense to take the opposite approach: start with a field experiment, and look for parallel tests in the lab. Similarly, if the aim is to calibrate lab designs so that their results agree with field research findings (Camerer 2012, p. 47), field experimentation is a natural starting point. How might one go about looking for field research that lends itself to telling lab-field comparisons? One approach is to start with well-developed field literatures that examine the effects of treatments that can be delivered in lab or field contexts. For example, field experiments on the accountability of public officials have assessed how voters respond to revelations of politicians’ behavior (Chong et al. 2010, Humphreys and Weinstein 2010, Banerjee et al. 2010). Field experiments on tax compliance (Fellner, Sausgruber, and Traxler 2013, Slemrod et al. 2001) have distinguished between the effectiveness of threats versus appeals to morals. In tests of the psychology of sunk cost, field experimental results have shown that charging for health products does not increase use relative to free provision (Cohen and Dupas 2010, Ashraf, Berry, and Shapiro 2010). The challenge in each of these instances is to devise lab experiments that address the same causal or policy question, which in some cases involves outcomes that are expressed long after the administration of a treatment. The debate over the relative merits of laboratory and field experiments is often framed in terms of the questions that each is better equipped to address: a schematic version of this debate is that field experiments can answer “real-world” questions and lab experiments can isolate causal mechanisms from “real-world” noise. We have attempted to advance this debate by offering some evidence that when studies in the lab and field attempt to answer similar questions, they arrive at similar answers. The task going forward is to investigate the question of lab-field correspondence in a more systematic fashion, designing research specifically to assess the conditions under which correspondence is maximized. 25 References Anderson, Craig A., and Brad J. Bushman. 1997. “External Validity of ”Trivial” Experiments: The Case of Laboratory Aggression.” Review of General Psychology 1 (1): 19–41. Anderson, Craig A., James J. Lindsay, and Brad J. Bushman. 1999. “Research in the Psychological Laboratory: Truth or Triviality?” Current Directions in Psychological Science 8 (1): 3–9. Ashraf, Nava, James Berry, and Jesse M. Shapiro. 2010. “Can Higher Prices Stimulate Product Use? Evidence from a Field Experiment in Zambia.” American Economic Review 100 (December): 2383–2413. Banerjee, A, Selvan Kumar, Rohini Pande, and Felix Su. 2010. “Do Informed Voters Make Better Choices? Experimental Evidence from Urban India.” Unpublished Manuscript . Camerer, Colin F. 2011. “The Promise and Success of Lab-Field Generalizability in Experimental Economics: A Critical Reply to Levitt and List.” Unpublished Manuscript . Chong, Alberto, Ana De La O, Dean S. Karlan, and Leonard Wantchekron. 2010. “Information Dissemination and Local Governments Electoral Returns, Evidence from a Field Experiment in Mexico.”. Cohen, Jessica, and Pascaline Dupas. 2010. “Free Distribution or Cost-sharing? Evidence from a Randomized Malaria Prevention Experiment.” Quarterly Journal of Economics 125 (1): 1–45. Croson, Rachel, and Uri Gneezy. 2009. “Gender Differences in Preferences.” Journal of Economic Literature 47 (2): 448–474. Falk, Armin, and James J Heckman. 2009. “Lab Experiments are a Major Source of Knowledge in the Social Sciences.” Science 326 (5952): 535–8. Fellner, Gerlinde, Rupert Sausgruber, and Christian Traxler. 2013. “Testing Enforcement Strategies in the Field: Threat, Moral Appeal and Social Information.” Journal of the European Economic Association 11 (3). Gerber, Alan S. 2004. “Does Campaign Spending Work?: Field Experiments Provide Evidence and Suggest New Theory.” American Behavioral Scientist 47 (5): 541–574. Gerber, Alan S., and Donald P. Green. 2012. Field Experiments: Design, Analysis, and Interpretation. New York: W.W. Norton. Harrison, Glenn W., and John A. List. 2004. “Field Experiments.” Journal of Economic Literature 42 (4): 1009–1055. Henry, P. J. 2008. “College Sophomores in the Laboratory Redux: Influences of a Narrow Data Base on Social Psychology’s View of the Nature of Prejudice.” Psychological Inquiry 19 (2): 49–71. Herberich, David H., Steven D. Levitt, and John A. List. 2009. “Can Field Experiments Return Agricultural Economics to the Glory Days?” American Journal of Agricultural Economics 91 (5): 1259–1265. 26 Humphreys, Macartan, and Jeremy Weinstein. 2010. “Policing Politicians: Citizen Empowerment and Political Accountability in Uganda.” Unpublished Manuscript . Kessler, Judd, and Lise Vesterlund. N.d. “The External Validity of Laboratory Experiments: Qualitative rather than Quantitative Effects.” Unpublished Manuscript. Forthcoming. Levitt, Steven D., and John A. List. 2007a. “Viewpoint: On the Generalizability of Lab Behaviour to the Field.” Canadian Journal of Economics 40 (2): 347–370. Levitt, Steven D., and John A. List. 2007b. “What Do Laboratory Experiments Measuring Social Preferences Reveal About the Real World?” Journal of Economic Perspectives 21 (2): 153–174. Levitt, Steven D., and John A. List. 2008. “Homo Economicus Evolves.” Science 319 (5865): 909–10. Levitt, Steven D., and John A. List. 2009. “Field Experiments in Economics: The Past, the Present, and the Future.” European Economic Review 53 (1): 1–18. Mecklenburg, Sheri H., Patricia J. Bailey, and Mark R. Larson. 2008. “The Illinois Field Study: A Significant Contribution to Understanding Real World Eyewitness Identification Issues.” Law and Human Behavior 32 (1): 22–7. Mook, Douglas G. 1983. “In Defense of External Invalidity.” American Psychologist (April): 379– 387. Sears, David O. 1986. “College Sophomores in the Laboratory: Influences of a Narrow Data Base on Social Psychology’s View of Human Nature.” Journal of Personality and Social Psychology 51 (3): 515–530. Slemrod, Joel, Marsha Blumenthal, and Charles Christian. 2001. “Taxpayer Response to an Increased Probability of Audit: Evidence from a Controlled Experiment in Minnesota.” Journal of Public Economics 79 (3): 455–483. Student. 1908. “Probable Error of a Correlation Coefficient.” Biometrika 6 (2): 302–310. 27 Appendix: Articles and Unpublished Manuscripts Considered for Inclusion Abeler, Johannes, and Felix Marklein. 2013. “Fungibility, Labels, and Consumption.”. Alevy, Jonathan E., Michael S. Haigh, and John A. List. 2007. “Information Cascades: Evidence from a Field Experiment with Financial Market Professionals.” The Journal of Finance 62 (1): 151–180. Ansolabehere, Stephen, Shanto Iyengar, Adam Simon, and Nicholas A. Valentino. 1994. “Does Attack Advertising Demobilize the Electorate?” American Political Science Review 88 (4): 829– 838. Armantier, Olivier, and Amadou Boly. 2013. “Comparing Corruption in the Laboratory and in the Field in Burkina Faso and in Canada.” The Economic Journal . Arnot, Chris, Peter C. Boxall, and Sean B. Cash. 2006. “Do Ethical Consumers Care About Price ? A Revealed Preference Analysis of Fair Trade Coffee Purchases.” Canadian Journal of Agricultural Economics 2003: 555–565. Ashraf, Nava, Dean S. Karlan, and Wesley Yin. 2006. “Tying Odysseus to the Mast: Evidence From a Commitment Savings Product in the Philippines.” The Quarterly Journal of Economics 121 (2): 635–672. Bateson, Melissa, Daniel Nettle, and Gilbert Roberts. 2006. “Cues of Being Watched Enhance Cooperation in a Real-World Setting.” Biology Letters 2 (3): 412–4. Bellemare, Charles, and Bruce Shearer. 2007. “Gift Exchange within a Firm: Evidence from a Field Experiment.” SSRN Electronic Journal . Benz, Matthias, and Stephan Meier. 2008. “Do People Behave in Experiments as in the Field?Evidence from Donations.” Experimental Economics 11 (3): 268–281. Bertrand, Marianne, Dean S. Karlan, Sendhil Mullainathan, Eldar Shafir, and Jonathan Zinman. 2005. “Whats Psychology Worth? A Field Experiment in the Consumer Credit Market.”. Bornstein, Gary, Ido Erev, and Orna Rosen. 1990. “Intergroup Competition as a Structural Solution to Social Dilemmas.” Social Behaviour 5 (4): 247–260. Brooks, Charles M., Patrick J. Kaufmann, and Donald R. Lichtenstein. 2008. “Trip Chaining Behavior in Multi-Destination Shopping Trips: A Field Experiment and Laboratory Replication.” Journal of Retailing 84 (1): 29–38. Burke, Raymond R., A. Harlam, Bari, Barbara E Kahn, and Leonard M. Lodish. 1992. “Comparing Dynamic Consumer Choice in Real and Computer-simulated evironments.” Journal of Consumer Research 19 (1): 71–82. Camerer, Colin F. 1998. “Can Asset Markets Be Manipulated? A Field Experiment with Racetrack Betting.” Journal of Political Economy 106 (3): 457. 28 Carpenter, Jeffrey, and Erika Seki. 2011. “Do Social Preferences Increase Productivity? Field Experimental Evidence from Fishermen in Toyama Bay.” Economic Inquiry 49 (2): 612–630. Chang, Jae Bong, Jayson L. Lusk, and F. Bailey Norwood. 2009. “How Closely Do Hypothetical Surveys and Laboratory Experiments Predict Field Behavior?” American Journal of Agricultural Economics 91 (2): 518–534. Charness, Gary, and Marie Claire Villeval. 2009. “Cooperation and Competition in Intergenerational Experiments in the Field and the Laboratory.” American Economic Review 99 (3): 956–978. Charness, Gary, and Uri Gneezy. 2009. “Incentives to Exercise.” Econometrica 77 (3): 909–931. Clinton, Joshua D., and John S. Lapinski. 2004. “Targeted Advertising and Voter Turnout: An Experimental Study of the 2000 Presidential Election.” Journal of Politics 66 (1): 69–96. Cohn, Alain, Ernst Fehr, and Lorenz Goette. 2013. “Fair Wages and Effort Provision: Combining Evidence from the Lab and the Field.” SSRN Electronic Journal (January). Correll, Shelly J., Stephen Benard, and In Paik. 2007. “Getting a Job: Is There a Motherhood Penalty?” American Journal of Sociology 112 (5): 1297–1338. Cummings, Ronald G., and Laura O. Taylor. 1999. “Unbiased Value Estimates for Environmental Goods: A Cheap Talk Design for the Contingent Valuation Method.” American Economic Review 89 (3): 649–665. Engelbrecht-Wiggans, Richard, John A. List, and David H. Reiley. 2006. “Demand Reduction in Multi-Unit Auctions with Varying Numbers of Bidders: Theory and Evidence from a Field Experiment.” International Economic Review 47 (1): 203–231. Erev, Ido, Gary Bornstein, and Rachely Galili. 1993. “Constructive Intergroup Competition as a Solution to the Free Rider Problem: A Field Experiment.” Journal of Experimental Social Psychology 29: 463–478. Falk, Armin. 2007. “Gift Exchange in the Field.” Econometrica 75 (5): 1501–1511. Fehr, Ernst, and Lorenz Goette. 2007. “Do Workers Work More if Wages Are High? Evidence from a Randomized Field Experiment.” American Economic Review 97 (1): 298–317. Frank, M G, and T Gilovich. 1988. “The Dark Side of Self- and Social Perception: Black Uniforms and Aggression in Professional Sports.” Journal of Personality and Social Psychology 54 (1): 74–85. Gneezy, Ayelet, Alex Imas, Amber Brown, Leif D. Nelson, and Michael I. Norton. 2012. “Paying to Be Nice: Consistency and Costly Prosocial Behavior.” Management Science 58 (1): 179–187. Gneezy, Uri, and Aldo Rustichini. 2000. “Pay Enough or Don’t Pay at All.” The Quarterly Journal of Economics 115 (3): 791–810. Gneezy, Uri, and Aldo Rustichini. 2004. “Gender and Competition at a Young Age.” American Economic Review 94 (2): 377–381. 29 Gneezy, Uri, Ernan Haruvy, and Hadas Yafe. 2004. “The Inefficiency of Splitting the Bill.” The Economic Journal 114: 265–280. Gneezy, Uri, and John A. List. 2006. “Putting Behavioral Economics to Work: Testing for Gift Exchange in Labor Markets Using Field Experiments.” Econometrica 74 (5): 1365–1384. Gneezy, Uri, John A. List, and George Wu. 2006. “The Uncertainty Effect: When a Risky Prospect is Valued Less than its Worst Possible Outcome.” The Quarterly Journal of Economics 121 (4): 1283–1309. Gneezy, Uri, Muriel Niederle, and Aldo Rustichini. 2003. “Performance in Competitive Environments: Gender Differences.” The Quarterly Journal of Economics 118 (3): 1049–1074. Grosser, Jens, and Arthur Schram. 2010. “Public Opinion Polls, Voter Turnout, and Welfare: An Experimental Study.” American Journal of Political Science 54 (3): 700–717. Haan, Marco, and Peter Kooreman. 2002. “Free Riding and the Provision of Candy Bars.” Journal of Public Economics 83: 277–291. Harrison, Glenn W., and John A. List. 2008. “Naturally Occurring Markets and Exogenous Laboratory Experiments: A Case Study of the Winners Curse.” The Economic Journal 118 (2004): 822–843. Harrison, Glenn W., John A. List, and Charles Towe. 2007. “Naturally Occurring Preferences and Exogenous Laboratory Experiments: A Case Study of Risk Aversion.” Econometrica 75 (2): 433–458. Harrison, Glenn W., Morten T. Lau, and Melonie B. Williams. 2002. “Estimating Individual Discount Rates in Denmark: A Field Experiment.” American Economic Review 92 (5): 1606– 1617. Hennig-Schmidt, H, B Rockenbach, and A Sadrieh. 2010. “In Search of Workers Real Effort Reciprocitya Field and a Laboratory Experiment.” Journal of the European Economic Association 8 (4): 817–837. Hossain, Tanjim, and John Morgan. 2005. “A Test of the Revenue Equivalence Theorem using Field Experiments on eBay.” SSRN Electronic Journal . Jerit, Jennifer, Jason Barabas, and Scott Clifford. N.d. “Comparing Contemporaneous Laboratory and Field Experiments on Media Effects.” Public Opinion Quarterly. Forthcoming. Karlan, Dean S. 2005. “Using Experimental Economics to Measure Social Capital and Predict Financial Decisions.” American Economic Review 95 (5): 1688–1699. Karlan, Dean S., and John A. List. 2007. “Does Price Matter in Charitable Giving? Evidence from a Large-Scale Natural Field Experiment.” American Economic Review 97 (5): 1774–1793. King, Eden B., and Afra S. Ahmad. 2010. “An Experimental Field Study Of Interpersonal Discrimination Toward Muslim Job Applicants.” Personnel Psychology 63 (4): 881–906. 30 Kube, Sebastian, Michel André Maréchal, and Clemens Puppe. 2006. “Putting Reciprocity to Work - Positive Versus Negative Responses in the Field.” University of St. Gallen Economics Discussion Paper 2006-27. Landry, Craig, Andreas Lange, John A. List, and Michael K. Price. 2006. “Toward an Understanding of the Economics of Charity: Evidence from a Field Experiment.” The Quarterly Journal of Economics 121 (2): 747–782. Laury, Susan K., and Laura O. Taylor. 2008. “Altruism Spillovers: Are Behaviors in Context-Free Experiments Predictive of Altruism Toward a Naturally Occurring Public Good?” Journal of Economic Behavior & Organization 65 (1): 9–29. Leonard, Kenneth L., and Melkiory C. Masatu. 2008. “Moving from the Lab to the Field: Exploring Scrutiny and Duration Effects in Lab Experiments.” Economics Letters 100 (2): 284–287. Levitt, Steven D., John A. List, and David H. Reiley. 2010. “What Happens in the Field Stays in the Field: Exploring Whether Professionals Play Minimax in Laboratory Experiments.” Econometrica 78 (4): 1413–1434. Lindsay, R. C. L., and Gary L. Wells. 1985. “Improving Eyewitness Identifications from Lineups: Simultaneous versus Sequential Lineup Presentation.” Journal of Applied Psychology 70 (3): 556–564. List, John A. 2001. “Do Explicit Warnings Eliminate the Hypothetical Bias in Elicitation Procedures? Evidence from Field Auctions for Sportscards.” American Economic Review 91 (5): 1498–1507. List, John A. 2003. “Does Market Experience Eliminate Market Anomalies?” The Quarterly Journal of Economics 118 (1): 41–71. List, John A. 2004a. “The Nature and Extent of Discrimination in the Marketplace: Evidence from the Field.” The Quarterly Journal of Economics 119 (1): 49–89. List, John A. 2004b. “Young, Selfish and Male: Field Evidence of Social Preferences.” The Economic Journal 114 (492): 121–149. List, John A. 2006a. “Friend or Foe? A Natural Experiment of the Prisoner’s Dilemma.” The Review of Economics and Statistics 88 (3): 463–471. List, John A. 2006b. “The Behavioralist Meets the Market: Measuring Social Preferences and Reputation Effects in Actual Transactions.” Journal of Political Economy 114 (1): 1–37. List, John A. 2006c. “Using Hicksian Surplus Measures to Examine Consistency of Individual Preferences: Evidence from a Field Experiment.” Scandinavian Journal of Economics 108 (1): 115–134. List, John A., and David Lucking-Reiley. 2000. “Demand Reduction in Multiunit Auctions: Evidence from a Sportscard Field Experiment.” American Economic Review 90 (4): 961–972. 31 List, John A., and David Lucking-Reiley. 2002. “The Effects of Seed Money and Refunds on Charitable Giving: Experimental Evidence from a University Capital Campaign.” Journal of Political Economy 110 (1): 215–233. List, John A., and Michael K. Price. 2005. “Conspiracies and Secret Price Discounts in the Marketplace: Evidence from Field Experiments.” Rand Journal of Economics 36 (3): 700–717. Lucking-Reiley, David. 1999. “Using Field Experiments to Test Equivalence Between Auction Formats: Magic on the Internet.” American Economic Review 89 (5): 1063–1080. Martin, Richard, and John Randal. 2005. “Voluntary Contributions to a Public Good: A Natural Field Experiment.” Unpublished manuscript, Victoria University . Matsen, Egil, and Bjarne Strom. 2006. “Joker: Choice in a Simple Game with Large Stakes.” Norwegian University of Science and Technology Working Paper Series 2006:19. Meier, Stephan. 2007. “Do Subsidies Increase Charitable Giving in the Long Run? Matching Donations in a Field Experiment.” Journal of the European Economic Association 5 (6): 1203– 1222. Mellström, Carl, and Magnus Johannesson. 2008. “Crowding Out in Blood Donation: Was Titmuss Right?” Journal of the European Economic Association 6 (4): 845–863. Nevin, John R. 1974. “Laboratory Experiments for Estimating Consumer Demand: A Validation Study.” Journal of Marketing Research 11 (3): 261–268. Normann, Hans-Theo, Till Requate, and Israel Waichman. 2012. “Do Short-Term Laboratory Experiments Provide Valid Descriptions of Long-Term Economic Interactions?” SSRN Electronic Journal . Ostling, Robert, Joseph Wang, Eileen Chou, and Colin F. Camerer. 2011. “Testing game theory in the field: Swedish LUPI lottery games.” American Economic Journal: Microeconomics 3 (August): 1–33. Palacios-Huerta, Ignacio, and Oscar Volij. 2008. “Experientia Docet: Professionals Play Minimax in Laboratory Experiments.” Econometrica 76 (1): 71–115. Reiley, David H. 2006. “Field Experiments On The Effects Of Reserve Prices In Auctions: More Magic On The Internet.” The RAND Journal of Economics 37 (1): 195–211. Rondeau, Daniel, and John A. List. 2008. “Matching and Challenge Gifts to Charity: Evidence from Laboratory and Natural Field Experiments.” Experimental Economics 11 (3): 253–267. Shang, Jen, Americus Reed, and Rachel Croson. 2008. “Identity Congruency Effects on Donations.” Journal of Marketing Research 45: 351–361. Stoop, Jan. 2012. “From the Lab to the Field: Envelopes, Dictators and Manners.” University Library of Munich, Germany MPRA Paper 37048. Stoop, Jan, Charles Noussair, and Daan van Soest. 2010. “From the Lab to the Field: Cooperation among Fishermen.” University Library of Munich, Germany MPRA Paper (28924). 32 Stutzer, Alois, Lorenz Goette, and Michael Zehnder. 2011. “Active Decisions and Prosocial Behaviour: a Field Experiment on Blood Donation.” The Economic Journal 121 (556): F476–F493. Thaler, Richard H., and Shlomo Benartzi. 2004. “Save More Tomorrow: Using Behavioral Economics to Increase Employee Saving.” Journal of Political Economy 112 (1). Tonin, Mirco, and Michael Vlassopoulos. 2009. “Disentangling the Sources of Pro-social Behavior in the Workplace: A Field Experiment.”. Valentino, Nicholas A., Michael W. Traugott, and Vincent L. Hutchings. 2002. “Group Cues and Ideological Constraint: A Replication of Political Advertising Effects Studies in the Lab and in the Field.” Political Communication (19): 29–48. 33 Appendix Table: Results for Single Outcome Studies Field Means Field SE Field N Lab Means Lab SE Lab N Erev, Bornstein, and Galili (1993) compared with Bornstein, Erev, and Rosen (1990) Control 280.00 11.22 12.00 0.47 0.09 30.00 Competition 380.00 11.22 12.00 0.78 0.05 60.00 Lab summary statistics were generously provided by the original authors. The field estimates are reported on page 472. We estimated standard errors on the basis of the reported F-statistic and an assumption of equal variances. Gneezy and Rustichini (2000) Control 238.60 30.27 30.00 28.40 2.20 40.00 Small 153.60 26.14 30.00 23.07 2.33 40.00 Large 219.30 28.86 30.00 34.70 1.40 40.00 Descriptive statistics were reported in Tables 1 and 4 on pages 797-800. Valentino, Traugott, and Hutchings (2002) Control -0.25 0.09 59.00 -0.26 0.10 77.00 Neutral -0.15 0.09 65.00 0.07 0.10 75.00 White -0.13 0.11 53.00 0.19 0.09 84.00 Black + White 0.00 0.13 38.00 -0.04 0.10 78.00 The data were generously provided by the original authors. Gneezy and Rustichini (2004) compared with Gneezy et al. (2003) No Competition (M) 0.06 0.07 12.00 11.23 0.75 30.00 No Competition (F) 0.02 0.06 12.00 9.73 0.64 30.00 Competition (M) 0.16 0.04 63.00 15.03 0.75 30.00 Competition (F) -0.01 0.04 53.00 10.87 0.87 30.00 Field descriptive statistics were reported in Tables 1 and 2 on pages 379-380 of the field study. Complete results were recovered from Figures 1 and 2 on page 1055 of the lab study. Gneezy, Haruvy, and Yafe (2004) Individual Pay 37.29 2.56 24.00 3.17 0.17 12.00 Even Split 50.92 2.93 24.00 3.42 0.29 12.00 1/6th Pay 57.42 6.28 12.00 4.67 0.19 12.00 Complete results were presented in Tables 1 and 7 on pages 270-275. List (2006) $20 2.10 0.13 50.00 3.10 0.16 30.00 $65 3.20 0.14 50.00 4.10 0.11 30.00 These estimates were reported in Table 3 on page 17. Group sizes were reported in Table 1 on page 7. Shang, Reed, and Croson (2006) Match 141.88 14.09 35.00 24.65 0.90 89.00 Mismatch 105.70 8.84 41.00 27.58 0.82 95.00 The lab point estimates were reconstructed from Figure 3 on page 358 and the standard errors were estimated, with additional assumptions, from the t-statistics reported on page 357. The field estimates were reported on page 353. Rondeau and List (2008) High Control 1.84 0.41 748.00 7.53 0.39 48.00 Challenge 2.16 0.48 749.00 6.84 0.49 48.00 Matching 1.65 0.31 750.00 5.76 0.46 48.00 Low Control 1.27 0.24 747.00 7.33 0.48 48.00 Dataset accessed from Daniel Rondeau’s personal website. Harrison and List (2008) AIS 9.43 2.43 15.00 95.42 14.72 42.00 SIS 6.94 1.73 16.00 93.65 15.19 38.00 Data were generously provided by the original authors. Armentier and Boly (2008) Control 0.49 0.08 37.00 0.67 0.09 30.00 High Wage 0.38 0.08 40.00 0.48 0.09 31.00 High Bribe 0.69 0.07 45.00 0.66 0.08 32.00 Monitoring 0.41 0.07 44.00 0.66 0.08 32.00 All data come from Table 1 on page 26. Abeler and Marklein (2010) Cash 15.03 1.34 34.00 14.42 0.54 45.00 Label 18.94 1.44 37.00 16.70 0.60 46.00 Summary statistics for the field experiment were generously provided by the original authors. Complete lab data were recovered from Figure 2 on page 21. 34

Assessing the Correspondence between Experimental Results

Related documents

Products

Support

Assessing the Correspondence between Experimental Results

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib