«A chi-square test showed that...» – or did it really? Bård Uri Jensen http://privat.hihm.no/buj/ bard.jensen@hihm.no Allowing [statistical software] to do our thinking is a sure recipe for disaster. (Good & Hardin, 2012, p. xi) - or did it really? «Simple» statistical tests • chi-square (X 2) test • t-test - or did it really? Statistical hypothesis testing 1. Formulate a hypothesis E.g. In Norwegian L2, Vietnamese have more TENSE errors than Somali. 2. Formulate a null-hypothesis Vietnamese and Somalis have the same rate of TENSE errors. 3. «Disprove» the null-hypothesis = demonstrate its unlikelihood E.g. less than 5% chance for the null-hypothesis to be true = «Significance» • We choose α according to what we consider an acceptable risk of false conclusions Often 5% in linguistic research - or did it really? Conditions of use • Independent observations chi-square test t-test • Parametric assumptions t-test • The dangers of repeated testing any test - or did it really? A simple example from ornithology - or did it really? A simple example from ornithology - or did it really? A simple example from ornithology - or did it really? A simple example from ornithology - or did it really? A simple example from corpus linguistics - or did it really? A simple example from corpus linguistics • The observations should be independent. • An important condition of use for chi-squared test t-test The observations should be of different individuals. «Chi-square is a much-abused test in second language research studies, and often one of its assumptions (that of independence of data) is violated as a matter of course.» Larson-Hall (2010, p.206) - or did it really? Example 1: Chi-squared test, non-independent observations • Blom & Paradis 2013 Journal of Speech, Language, and Hearing Research On past tense production in L2 children with language impairment • 48 children with English as L2 • Overregularization of past tense Hypothesis: Less common in verb stems ending in /d/ or /t/ overregularization zero marking d# or t# 16 69 others 42 98 • X 2 (1) = 3.45, p (one-sided) = 0.032 • Problem: n = 85 + 140, N = 48 • Observations are not independent, so the result is invalid. - or did it really? Example 1: Chi-squared test, non-independent observations • Solution A: Pick just one observation from each author/speaker • “To exclude the author as one more relevant factor, the database was cleaned so that there is only one example for each verb from any single author.” Sokolova 2012, p. 94 - or did it really? Example 1: Chi-squared test, non-independent observations • Solution A: Pick just one observation from each author/speaker Sokolova 2012 • Solution B: • Calculate average values for each informant Use the average values as independent observations Test significance with an appropriate test, e.g. t-test or U-test Gujord 2013 Both these solutions might require a larger corpus! • «Solution» C: Alter the research question Danckaert 2011 - or did it really? Example 1: Chi-squared test, non-independent observations • Solution B: - or did it really? Example 2: T-test, non-independent observations • Klavan 2012 PhD thesis from Tartu University Investigation of adposition ‘peal’ and adessive case • 450 observations of each, from 2 corpora • • • • t = 8.02, p < 0.001 Conclusion: adessive phrases are longer than ‘peal’-phrases Problem: Observations are not independent. The conclusion is invalid. - or did it really? - or did it really? Example 3: T-test, non-normal populations • Hunter (2011, s. 48) PhD thesis from Birmingham University On grammaticality judgements by L2 students • Conclusion: • the accuracy (max. = 1) for the teacher group (M = .98, SD = .14) was significantly higher than the student group (M = .64, SD = .49), t(1) = 4.9, p < .001. • Problem: Mean = 0.98, Maximum value = 1 Standard deviation= 0.14 • The distribution cannot possibly be normal. • The result is invalid. - or did it really? 2,5 2,0 1,5 1,0 0,5 0,0 0,0 - or did it really? 0,5 1,0 1,5 Example 4 Repeated testing • Leedham 2011 PhD thesis, The Open University Features in the writing of Chinese students in UK universities • Conclusion: • There are differences in frequencies of certain phrases between 3rd year students and younger students • Problem: • Repeated testing without adjusting the probability values • Some of the results are not valid. - or did it really? CV - or did it really? CV Moral There are no simple tests. 1. You should understand the conditions of the test. 2. You should take the conditions into account. 3. You should document properly how you perform the test, what numbers you put into it, how the conditions are met. «A chi-square test showed that the difference is significant.» - or did it really? Is it really that important? • «[C]ompared to other social sciences (e.g., psychology, communication, sociology, anthropology, …) or branches of linguistics (e.g., psycholinguistics, phonetics, sociolinguistics…), most of corpus linguistics has paradoxically only begun to develop this methodological awareness.» Gries (forthcoming, p.1) - or did it really? Is it really that important? • «It has become increasingly apparent over a period of several years that psychologists, taken in the aggregate, employ the chi-square test incorrectly.» Lewis and Burke (1949) - or did it really? Whose responsibility is it? - or did it really? «Corpus linguistics needs to ‘catch up’ [...]» Gries (forthcoming, p.1) - or did it really? References (http://privat.hihm.no/buj) Boneau, A. C. (1960). The effects of violations of assumptions underlying the t test. Psychological Bulletin, 57(1), 49-64. Good, P.I. & Hardin, J.W. (2012). Common errors in statistics (and how to avoid them). Hoboken: John Wiley. Gries, S (forthcoming). Quantitative designs and statistical techniques. http://www.linguistics.ucsb.edu/faculty/stgries/research/InProgr_STG_QuantDesAndMethCorpLing_CUPHb.pdf Larson-Hall, J. (2010). A Guide to Doing Statistics in Second Language Research Using SPSS. New York: Routledge. Lewis, D., & Burke, C. J. (1949). The use and misuse of the chi-square test. Psychological Bulletin, 46(6), 433- 489. Blom & Paradis (2013). Past Tense Production by English Second Language Learners With and Without Language Impairment. In Journal of Speech, Language, and Hearing Research. 56, 281-294. Danckaert, L. (2011). On the left periphery of Latin embedded clauses. Ph.D. thesis. University of Gent. Gujord, A.H. (2013). Grammatical encoding of past time in L2 Norwegian : The roles of L1 influence and verb semantics. Ph.D. thesis. University of Bergen. Hunter, J.D. (2011). A multi-method investigation of the effectiveness and utility of delayed corrective feedback in second-language oral production. Ph.D. thesis. University of Birmingham. Klavan, j. (2012). Evidence in linguistics : corpus-linguistic and experimental methods for studying grammatical synonymy. Ph.D. thesis. University of Tartu. Leedham, M. (2011). A corpus-driven study of features of Chinese students’ undergraduate writing in UK universities. Ph.D. thesis. The Open University. Sokolova, S. (2012). Asymmetries in Linguistic Construal : Russian Prefixes and the Locative Alternation. Ph.D. thesis. University of Tromsø. - or did it really?