Doing Synthesis and Meta-Analysis in Applied Linguistics Lourdes Ortega University of Hawai‘i at Mānoa National Tsing Hua University Taiwan, June 8, 2011 Please cite as: Ortega, L. (2011). Doing synthesis and meta-analysis in applied linguistics. Invited workshop at Tsing Hua University, Taipei, June 8, 2011. Copyright © Lourdes Ortega, 2011 Research synthesis (including meta-analysis) 1. 2. 3. 4. 5. 6. What is it? Why do it? How do we do it? An example… Challenges? Value? What is research synthesis? The reviewing continuum Secondary Research Narrative ..............................................................Systematic LIT REVIEW ……………..SYNTHESIS…………… META-ANALYSIS So, what is meta-analysis, specifically? …one specific kind of research synthesis… Secondary analysis of quantitative analyses Each primary study is a data point Goal: what are the main ‘effects’ or ‘relationships’ found across many studies? Strictly speaking, only quantitative studies apply Why do it? Traditional literature reviews… …have lead to unending debates: What does the evidence “say”? According to whom? How do we know who is right? e.g.: Critical Period Hypothesis (Hyltenstam et al. vs. Birdsong) e.g.: error correction (Ferris vs. Truscott) Typical strategies of traditional reviews? Tables summarizing many studies e.g. from Krashen et al. (1979): Vote-counting technique e.g.: Error correction in L2 writing Limitations: Idiosyncratic methodology No specific set of methods, up to mysterious expertise Evidentiary warrants difficult to judge Experts are always vested, therefore vulnerable to charge of bias Statistical significance has serious pitfalls Over-reliance on statistical significance (but magnitude, not just generalizability, is of interest to social scientists!) What does the evidence “say”? According to whom? How do we know who is right? SOLUTION in the late 1970s Methods for reviewing, from “art” into “science”: Systematic, not arbitrary More than the sum of the parts Replicable Secondary, yes... but empirically accountable, & discovering new truths in old data How do we do it? Norris & Ortega (2006a, 2006b) Norris, J. M., & Ortega, L. (2007). The future of research synthesis in applied linguistics: Beyond art or science. TESOL Quarterly, 41, 805-815. Norris, J. M., & Ortega, L. (2010). Timeline: Research synthesis. Language Teaching, 43, 461-479. Ortega, L. (2010). Research synthesis. In B. Paltridge & A. Phakiti (Eds.), Companion to research methods in applied linguistics (pp. 111126). London: Continuum. Norris, J. M. (2012). Meta-analysis. In C. Chapelle (Ed.), Encyclopedia of applied linguistics. Malden, MA: Wiley. What are the definitional features of all syntheses (including meta-analyses)? 1. Principled selection of primary studies 2. Systematic coding of each study for main variables 3. Direct use of the evidence reported (not the authors’ interpretations) across studies 1. Principled selection of studies Sampling is central to empirical researchwhat population are we trying to understand? Random [experimental] Purposive [qualitative] Sampling is central to synthesis, as well Complete [secondary research should be based on the full universe of studies that have investigated the same thing] Search & Retrieval of Literature The literature search is a key step in systematic synthesis (some direction: In'nami & Koizumi, 2010)identify all studies that are relevant Exhaustive [electronic, hand, footnote chasing invisible college] Replicable [fully explained in report] 1st electronic searches 2nd other techniques: Manual searches of journals Footnote chasing Forward searches with Web of Science Website searches of key contributing scholars Polite email requests to authors & experts Inclusion & Exclusion criteria All potentially relevant studies must then be examined to decide: Include or Exclude (“apples or oranges?”) Exclusion criteria Inclusion criteria [explain each reason for exclusion and give examples] [all criteria satisfied] Full rationale: [tables, appendices, philosophy of inclusivity or selectivity] What are the definitional features of all syntheses (including meta-analyses)? 1. Principled selection of studies Literature search + Study eligibility criteria, Inclusion/exclusion 2. Systematic coding of each study Eliciting evidence with consistency, just as when surveying, interviewing, or testing participants Asking research questions of the literature: What variables are important? How (and how well) have they been investigated? What are the findings across studies? Coding book to identify study features that answer questions Publication features Methodological features Year Sample size Author Design Published or Fugitive? Reliability •Journal •Book •Dissertation •Presentation Stats used Etc. Multiple coders Substantive features e.g., How was “explicit” instruction defined? e.g., How was “learning” measured? e.g., Means, sd, etc? What are the definitional features of all syntheses (including all meta-analyses)? 1. Principled selection of studies 2. Systematic coding of each study for main variables Coding book, Standardization, Intercoder reliability 3. Trust the evidence, not the authors Record carefully what authors report and how they report it,… But ultimately, analyze what the evidence they present tells us, not what they say it means… Seeking an objective view across studies of the accumulated state of knowledge… When aggregating and averaging findings is the goal, as in metaanalysis… How do we compare, combine, and interpret findings across numerous quantitative studies of the same thing? effect sizes & confidence intervals Effect size: What is it? An estimate of the magnitude or strength of a quantitative finding: …how much difference? …how much improvement? …how closely related? Effect sizes: absolute scales scale Study 1 Study 2 1. percent Experimental group = 30% better than control Motivation & achievement, r = .36 Experimental group = 20% better than control Motivation & achievement, r = .78 Pre-post TOEFL score: 450 575 Pre-post TOEFL score: 450 495 2. correlation 3. known measure Q: What happens when studies to not report findings on comparable scales? Effect sizes: standardized d is also simple to calculate and to interpret, and it incorporates variability differences between groups Effect size d = The average of the experimental group minus the average of the control group divided by the pooled standard deviation of both groups. Effect sizes: standardized Difference between experimental and control groups in standard deviation units (Cohen’s d) No sizeable effect (d=0.10) exper. contr. difference Very large effect (d=3.00) exper. contr. difference Effect sizes for meta-analysis Study 1 effect size 1 Study 2 effect size 2 Study 3 effect size 3 Study 4 effect size 4 Study 5 effect size 5 Study … … Study … = average effect size … d > .30 Interpreting effect sizes: What does d d > .80 d < .80 d < .30 really tell us? "The terms 'small,' 'medium,' and 'large' are relative, not only to each other, but to the area of behavioral science or even more particularly to the specific content and research method being employed in any given investigation..." (Cohen, 1988, p. 25) The average is not enoughConfidence Intervals “The margin of error in an observation” The stroll from the hotel to the University is, on average, 10 minutes, plus or minus 3 minutes: Lower bound= 7 minutes Average= 10 minutes 95% certainty Upper bound= 13 minutes Confidence Intervals in Meta-analysis CIs tell us about the certainty with which we can interpret an average effect size. Effect Sizes and Confidence Intervals in Meta-analysis N Avg. effect of instructional treatment 49 K Mean d 98 .96 SD d .87 95% 95% CI CI lower upper .78 1.14 We can be 95% certain that the actual effect of instruction lies between .78 and 1.14 Why does it help to focus on effect sizes? There is a statistically significant difference in mortality rates between smokers and non-smokers. Smoking up to half a pack a day (or less than 10 cigarettes) a day increases the chance of mortality by 40% when compared to Smoking two packs or more a non-smokers day increases the risk of death by three times to 120% when compared to e.g., effects of Smoking research non-smokers in the 1960s U.S. Department of Health, Education, and Welfare Report, 1967 And what about small effects— can they be important too? r = .034 a truly ‘tiny’ effect! d = .30 a small magnitude effect! Regular aspirin consumption and decrease in heart attacks = 3.4% decrease = at least 3 out of 100 who would not have a heart attack if they regularly took aspirin. Effects of reading tutorials for underachieving students, the same for untrained peer tutoring and for highly trained teachers engaging in longer hours of tutoring. Both are important! Interpreting effect sizes: complex, contextualized, not absolute What are the definitional features of all syntheses (including all meta-analyses)? 1. Principled selection of studies 2. Systematic coding of each study for main variables 3. Direct use of the evidence reported (not the authors’ Effect sizes, interpretations) Confidence Intervals, Other kinds of new data based on old How do we do it? An example of Synthesis+meta-analysis In applied linguistics, the first fullblown synthesis and meta-analysis: Norris, J. M., & Ortega, L. (2000). Effectiveness of L2 instruction: A research synthesis and quantitative meta-analysis. Language Learning, 50, 417-528. Step 1: Problem Specification inductive Traditional grammar dictogloss Recasts Effects of Garden path Input flood instruction Input enhancement Input processing Consciousnessraising Task-based interaction Focus of Norris & Ortega RQ 3: Effect of outcome measures? RQ 1&2 Instruction Overall? By type? L2 instruction RQ 4: Instructional intensity? RQ 5: Durability of effects? RQ 6: Quality of research practices? L2 learning Step 2: Literature search 1st electronic searches 2nd other techniques: Manual searches of 14 journals Footnote chasing of 25 reviews Footnote chasing of each study included Step 3: Study eligibility criteria Potentially relevant 250 >> >> relevant for synthesis 77 >> >> adequate for meta-analysis 49 Step 4: Coding of study features Type of instruction: FonF, FonFS, explicit, implicit Type of outcome measure: metalinguistic, selected, constrained, free Intensity of instruction: Brief (less than 1 hr), short (between 1 and 2 hrs), medium (between 3 and 6 hrs), long (more than 7 hrs) Durability of effects: effect sizes on delayed tests Steps 5 & 6: Analyze, display, interpret Findings RQ 1 & 2 (effectiveness): Findings RQ 3 (type of measure) Findings RQ 4 (intensity): Findings RQ 5 (durability): RQ 1-5 (meta-analysis part): How effective is L2 instruction? Clearly more effective than no instruction or only meaningful exposure to L2 d = 0.96 based on 49 studies Explicit instruction is superior in the short term to implicit instruction d = 1.13 versus d = 0.54, based on 69 and 29 contrasts, respectively But focus on form and on formS are equally effective d = 1.00 form versus 0.93 formS, based on 43 and 55 contrasts, respectively Effects are durable delayed post-tests from 22 studies: d = 1.02 RQ6 (synthesis part): Research practices Too many variables in a single design need to simplify designs, increase N No pre-test (18%), no true control group (83%) need to always include both Poor reporting standards (52% no sd, 84% no instrument reliability, 57% no set alpha) editors need to demand better reporting Misuse of statistical inference (no assumptions checked or met, parametric stats on small samples, no consideration of magnitude) the field needs better training in statistics if they insist on using such methods Since then…accumulation of meta-analyses In 2000, when Norris & Ortega was published, there were only 2 other published systematic syntheses in applied linguistics. As of 2010, Norris & Ortega identified 23 in their Timeline, most published since 2006. Motivation: Masgoret & Gardner (2003) Interaction: Keck et al. (2006), Mackey & Goo (2007) Oral feedback: Russell & Spada (2006), Lyster & Saito (2010), Li (2010) Use of glosses in CALL: Taylor (2006 & 2009), Abraham (2008) Some challenges for research synthesis in L2 research… Publication bias: “file drawer problem” Well known phenomenon, present in all the social sciences (Rosenthal, 1979; Rothstein et al., 2005) Little understood in applied linguistics • Include fugitive literature • Check for publication bias Quality: “garbage in, garbage out” The quality of a synthesis can only be as good as the quality of the primary studies that are synthesized in it... But how do we judge quality? Publication type? Methodology ratings? Exclusions? Ethics Anticipate consequences of synthesis Would it prematurely close the area for research? Would it taken as a personal attack on researchers/labs? What is the potential for findings to be (mis)appropriated by audiences (policy makers, teachers, …)? High-tech statistication, cookie-cutter approach “... conceptual vacuum when technical meta-analytic expertise is not coupled with deep knowledge of the theoretical and conceptual issues at stake in the research domain under review…” (Norris & Ortega, 2006b, p. 37) Meta-analysis only, no interest in quantitative synthesis of other kinds/scope Thomas (1994), (2006) Ortega (2003) ????? New-generation meta-analyses bypass synthesis: Li (2010) Lyster & Saito (2010) Plonsky (2011) Spada & Tomita (2010) Qualitative synthesis? No interest either in exploring qualitative synthesis… Only Téllez & Waxman (2006) in applied linguistics Yet, much contemporary research in applied linguistics is qualitative and increasingly more is mixed-methods… both worth synthesizing! And there are options to draw from in education, health sciences, and other fields! Meta-ethnography (Noblit & Hare, 1988; see Téllez & Waxman, 2006) Qualitative Comparative Analysis (Ragin, 1999) Critical Interpretive Synthesis (Dixon-Woods et al., 2006) Value? There is huge value in systematic synthesis (including meta-analysis): Secondary research, yes... but: • Empirically accountable • Conceptually illuminating: discovering new truths in old data Sustained progress… • Much improvement in certain reporting practices (LL, MLJ in particular) • Larger N in primary studies = more trustworthy analyses • Use of increasingly sophisticated techniques in metaanalyses… study quality criteria, weighting (by N, reliability, variance), fixed/random effects models, sensitivity analysis, fill & trim estimations, publication bias, etc. • Use of meta-analytic software, e.g.: http://www.meta-analysis.com But only if applied linguists cultivate“the will to synthesis” “we envision synthetic methodologies as advancing our ability to produce new knowledge by carefully building upon, expanding, and transforming what has been accumulated over time ... However, ... all knowledge is bound by context and purpose...” (Norris & Ortega, 2006b, p. 37) Thank You lortega@hawaii.edu References Abraham, L. B. (2008). Computer-mediated glosses in second language reading comprehension and vocabulary learning: A meta-analysis. Computer Assisted Language Learning , 21, 199-226. Dixon-Woods, M., Bonas, S., Booth, A., Jones, D. R., Miller, T., Sutton, A. J., et al. (2006). How can systematic reviews incorporate qualitative research? A critical perspective. Qualitative Research, 6, 27-44. Keck, C. M., Iberri-Shea, G., Tracy-Ventura, N., & Wa-Mbaleka, S. (2006). Investigating the empirical link between task-based interaction and acquisition: A meta-analysis. In J. M. Norris & L. Ortega (Eds.), Synthesizing research on language learning and teaching (pp. 91-131). Amsterdam: John Benjamins. Krashen, S., Long, M. H., & Scarcella, R. (1979). Accounting for child-adult differences in second language rate and attainment. TESOL Quarterly, 13, 573582. Li, S. (2010). The effectiveness of corrective feedback in SLA: A meta-analysis. Language Learning, 60, 309-365. Lyster, R., & Saito, K. (2010). Oral feedback in classroom SLA: A meta- analysis. Studies in Second Language Acquisition, 32(2). Mackey, A., & Goo, J. M. (2007). Interaction research in SLA: A meta-analysis and research synthesis. In A. Mackey (Ed.), Conversational interaction in second language acquisition: A collection of empirical studies (pp. 407-452). New York: Oxford University Press. Masgoret, A.-M., & Gardner, R. C. (2003). Attitudes, motivation, and second language learning: A meta-analysis of studies conducted by Gardner and associates. Language Learning, 53, 123-163. Noblit, G. W., & Hare, R. D. (1988). Meta-ethnography : Synthesizing qualitative studies. Newbury Park, CA: Sage. Norris, J. M. (2012). Meta-analysis. In C. Chapelle (Ed.), Encyclopedia of applied linguistics. Malden, MA: Wiley. Norris, J. M., & Ortega, L. (2000). Effectiveness of L2 instruction: A research synthesis and quantitative meta-analysis. Language Learning, 50, 417-528. Norris, J. M., & Ortega, L. (Eds.). (2006a). Synthesizing research on language learning and teaching. Amsterdam: John Benjamins. Norris, J. M., & Ortega, L. (2006b). The value and practice of research synthesis for language learning and teaching. In J. M. Norris & L. Ortega (Eds.), Synthesizing research on language learning and teaching (pp. 3-50). Amsterdam: John Benjamins. Norris, J. M., & Ortega, L. (2007). The future of research synthesis in applied linguistics: Beyond art or science. TESOL Quarterly, 41, 805-815. Norris, J. M., & Ortega, L. (2010). Research timeline: Research synthesis. Language Teaching, 43, 461-479. Ortega, L. (2003). Syntactic complexity measures and their relationship to L2 proficiency: A research synthesis of college-level L2 writing. Applied Linguistics, 24, 492-518. Ortega, L. (2010). Research synthesis. In B. Paltridge & A. Phakiti (Eds.), Companion to research methods in applied linguistics (pp. 111-126). London: Continuum. Plonsky, L. (2011). The effectiveness of second language strategy instruction: A meta-analysis. Language Learning, 61(4). Ragin, C. C. (1999). Using Qualitative Comparative Analysis to study causal complexity. Health Services Research, 34 (5 -Part 2), 1225-1239. Russell, J., & Spada, N. (2006). The effectiveness of corrective feedback for the acquisition of L2 grammar: A meta-analysis of the research. In J. M. Norris & L. Ortega (Eds.), Synthesizing research on language learning and teaching (pp. 133-164). Amsterdam: John Benjamins. Spada, N., & Tomita, Y. (2010). Interactions between type of instruction and type of language feature: A meta-analysis. Language Learning, 60, 263-308. Taylor, A. M. (2006). The effects of CALL versus traditional L1 glosses on L2 reading comprehension. CALICO Journal , 23, 309-318. Taylor, A. M. (2009). CALL-based versus paper-based glosses: Is there a difference in reading comprehension? CALICO Journal , 27, 147-160. Téllez, K., & Waxman, H. C. (2006). A meta-synthesis of qualitative research on effective teaching practices for English Language Learners. In J. M. Norris & L. Ortega (Eds.), Synthesizing research on language learning and teaching (pp. 245-277). Amsterdam: John Benjamins. Thomas, M. (1994). Assessment of L2 proficiency in second language acquisition research. Language Learning, 44, 307-336. Thomas, M. (2006). Research synthesis and historiography: The case of assessment of second language proficiency. In J. M. Norris & L. Ortega (Eds.), Synthesizing research on language learning and teaching (pp. 279-298). Amsterdam: John Benjamins.