z-squared: the origin and use of χ² - or what I wish I had been told about statistics (but had to work out for myself) Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk Outline • What is the point of statistics? – Linguistic alternation experiments – How inferential statistics works • Introducing z tests – Two types (single-sample and two-sample) – How these tests are related to χ² • Comparing experiments and ‘effect size’ – Swing and ‘skew’ • Low frequency events and small samples What is the point of statistics? • Analyse data you already have – corpus linguistics • Design new experiments – collect new data, add annotation – experimental linguistics in the lab • Try new methods – pose the right question • We are going to focus on z and χ² tests What is the point of statistics? • Analyse data you already have – corpus linguistics • Design new experiments – collect new data, add annotation – experimental linguistics in the lab • Try new methods – pose the right question • We are going to focus on z and χ² tests } observational science } } } experimental science philosophy of science a little maths What is ‘inferential statistics’? • Suppose we carry out an experiment – We toss a coin 10 times and get 5 heads – How confident are we in the results? • Suppose we repeat the experiment • Will we get the same result again? • Inferential statistics is a method of inferring the behaviour of future ‘ghost’ experiments from one experiment – Infer from the sample to the population • Let us consider one type of experiment – Linguistic alternation experiments Alternation experiments • Imagine a speaker forming a sentence as a series of decisions/choices. They can – add: choose to extend a phrase or clause, or stop – select: choose between constructions • Choices will be constrained – grammatically – semantically Alternation experiments • Imagine a speaker forming a sentence as a series of decisions/choices. They can – add: choose to extend a phrase or clause, or stop – select: choose between constructions • Choices will be constrained – grammatically – semantically • Research question: – within these constraints, what factors influence the particular choice? Alternation experiments • Laboratory experiment (cued) – pose the choice to subjects – observe the one they make – manipulate different potential influences • Observational experiment (uncued) – observe the choices speakers make when they make them (e.g. in a corpus) – extract data for different potential influences • sociolinguistic: subdivide data by genre, etc • lexical/grammatical: subdivide data by elements in surrounding context Statistical assumptions A random sample taken from the population – Not always easy to achieve • multiple cases from the same text and speakers, etc • may be limited historical data available – Be careful with data concentrated in a few texts The sample is tiny compared to the population – This is easy to satisfy in linguistics! Repeated sampling tends to form a Binomial distribution – This requires slightly more explanation... The Binomial distribution • Repeated sampling tends to form a Binomial distribution – We toss a coin 10 times, and get 5 heads: F N=1 x 1 3 5 7 9 The Binomial distribution • Repeated sampling tends to form a Binomial distribution F N=4 x 1 3 5 7 9 The Binomial distribution • Repeated sampling tends to form a Binomial distribution F N=8 x 1 3 5 7 9 The Binomial distribution • Repeated sampling tends to form a Binomial distribution F N = 12 x 1 3 5 7 9 The Binomial distribution • Repeated sampling tends to form a Binomial distribution F N = 16 x 1 3 5 7 9 The Binomial distribution • Repeated sampling tends to form a Binomial distribution F N = 20 x 1 3 5 7 9 The Binomial distribution • Repeated sampling tends to form a Binomial distribution F N = 24 x 1 3 5 7 9 Binomial Normal • The Binomial (discrete) distribution tends to match the Normal (continuous) distribution F x 1 3 5 7 9 The central limit theorem • Any Normal distribution can be defined by only two variables and the Normal function z F population mean x = P standard deviation s = P(1 – P) / n z.s – With more data in the experiment, s will be smaller z.s – Divide by 10 for probability scale 0.1 0.3 0.5 0.7 p The central limit theorem • Any Normal distribution can be defined by only two variables and the Normal function z F population mean x = P standard deviation s = P(1 – P) / n z.s z.s – 95% of the curve is within ~2 standard deviations of the mean 2.5% (the correct figure is 1.95996!) 2.5% 95% 0.1 0.3 0.5 0.7 p The single-sample z test... • Is an observation > z standard deviations from the expected population mean? – If yes, the result is significant F observation p z.s 0.25% 0.1 z.s 0.25% P 0.3 0.5 0.7 p ...gives us a “confidence interval” • P ± z . s is the confidence interval for P – Enough for a test F z.s 0.25% 0.1 z.s 0.25% P 0.3 0.5 0.7 p ...gives us a “confidence interval” • P ± z . s is the confidence interval for P – But we need the interval about p observation p F w– w+ P 0.25% 0.1 0.3 0.5 0.25% 0.7 p ...gives us a “confidence interval” • The interval about p is called the Wilson score interval • This interval is asymmetric • It reflects the Normal interval about P: observation p F w– w+ P 0.25% 0.25% • If P is at the upper limit of p, p is at the lower limit of P (Wilson, 1927) 0.1 0.3 0.5 0.7 p ...gives us a “confidence interval” • The interval about p is called the Wilson score interval • To calculate w– and w+ we use this formula: observation p F z2 p z 2n w– w+ P 0.25% p (1 p ) z 2 2 n 4n z2 1 n 0.25% (Wilson, 1927) 0.1 0.3 0.5 0.7 p Plotting confidence intervals • E.g. Plot the probability of adding successive attributive adjectives to a NP in ICE-GB – You can easily see that the first two falls are significant, but the last is not 0.25 p 0.20 0.15 0.10 0.05 0.00 0 1 2 3 4 A simple experiment • Consider two binary variables, A and B – Each one is subdivided: • A = {a, ¬a} e.g. NP has AJP? {yes, no} • B = {b, ¬b} e.g. Speaker gender {male, female} – Does B ‘affect’ A? • We perform an experiment (or sample a corpus) – We find 45 cases (NPs) classified by A and B (left) – This is a ‘contingency table’ A simple experiment • Consider two binary variables, A and B – Each one is subdivided: • A = {a, ¬a} e.g. NP has AJP? {yes, no} • B = {b, ¬b} e.g. Speaker gender {male, female} • We perform an experiment (or sample a corpus) – We find 45 cases (NPs) classified by A and B (left) – This is a ‘contingency table’ A = dependent variable a ¬a S b 20 5 25 ¬b 10 10 20 S 30 15 45 • Q1. Does B cause a to differ from A? – Does speaker gender affect decision to include an AJP? B = independent variable – Does B ‘affect’ A? Does B cause a to differ from A? • Compare column 1 (a) and column 3 (A) – Probability of picking b at random (gender = male) • p(b) = 25/45 = 5/9 = 0.556 a ¬a S b 20 5 25 ¬b 10 10 20 S 30 15 45 Does B cause a to differ from A? • Compare column 1 (a) and column 3 (A) – Probability of picking b at random (gender = male) • p(b) = 25/45 = 5/9 = 0.556 • Next, examine a (has AJP) – New probability of picking b • p(b | a) = 20/30 = 2/3 = 0.667 – Confidence interval for p(b | a) • population standard deviation s = p(b)(1–p(b))/n = (5/9 4/9) / 30 • p z.s = (0.489, 0.845) a ¬a S b 20 5 25 ¬b 10 10 20 S 30 15 45 Does B cause a to differ from A? • Compare column 1 (a) and column 3 (A) – Probability of picking b at random (gender = male) • p(b) = 25/45 = 5/9 = 0.556 • Next, examine a (has AJP) – New probability of picking b • p(b | a) = 20/30 = 2/3 = 0.667 – Confidence interval for p(b) • population standard deviation s = p(b)(1–p(b))/n = (5/9 4/9) / 30 • p z.s = (0.378, 0.733) a ¬a S b 20 5 25 ¬b 10 10 20 S 30 15 45 • Not significant: p(b | a) is inside c.i. for p(b) Visualising this test • Confidence interval for p(b) – P = expected value E = expected distribution F p p(b) 0.667 p(b | a) z.s z.s P E p(b) 0.556 p 0.378 0.733 A a The single-sample z test • Compares an observation with a given value – – – – We used it to compare p(b | a) with p(b) This is a “goodness of fit” test p Identical to a standard 21 χ² test No need to test p(¬b | a) with p(¬b) E • Note that p(b) is given – All of the variation is assumed to be in the estimation of p(b | a) A a – Could also compare p(b | ¬a) (no AJP) with p(b) • Q2. Does B cause a to differ from ¬a? – Does speaker gender affect presence / absence of AJP? z test for 2 independent proportions • Method: combine observed values – take the difference (subtract) |p1 – p2| – calculate an ‘averaged’ confidence interval F p p2 = p(b | ¬a) O1 p1 O2 p2 p1 = p(b | a) p ¬a O1 O2 a z test for 2 independent proportions • New confidence interval D = |O1 – O2| ^ ^ – standard deviation s' = p(1 – p) (1/n1 +1/n2) – p^ = p(b) = 25/45 = 5/9 difference in p – compare x z.s' with D x = |p1 – p2| z.s' x = |p1 – p2| a ¬a S b 20 5 25 ¬b 10 10 20 S 30 15 45 n1 n2 D mean x = 0 p 0 Does B cause a to differ from ¬a? • Compare column 1 (a) and column 2 (¬a) – Probabilities (speaker gender = male) • p(b | a) = 20/30 = 2/3 = 0.667 • p(b | ¬a) = 5/15 = 1/3 = 0.333 – Confidence interval • pooled probability estimate p^ = p(b) = 5/9 = 0.556 • standard deviation ^ ^ (1/n + 1/n ) s' = p(1 – p) 1 2 5 4 1 1 = ( /9 /9) ( /30 + /15) • z.s' = 0.308 a ¬a S b 20 5 25 ¬b 10 10 20 S 30 15 45 Does B cause a to differ from ¬a? • Compare column 1 (a) and column 2 (¬a) – Probabilities (speaker gender = male) • p(b | a) = 20/30 = 2/3 = 0.667 • p(b | ¬a) = 5/15 = 1/3 = 0.333 – Confidence interval • pooled probability estimate p^ = p(b) = 5/9 = 0.556 • standard deviation ^ ^ (1/n + 1/n ) s' = p(1 – p) 1 2 5 4 1 1 = ( /9 /9) ( /30 + /15) • z.s' = 0.308 a ¬a S b 20 5 25 ¬b 10 10 20 S 30 15 45 • Significant: |p(b | a) – p(b | ¬a)| > z.s' z test for 2 independent proportions • Identical to a standard 22 χ² test – So you can use the usual method! z test for 2 independent proportions • Identical to a standard 22 χ² test – So you can use the usual method! • BUT: these tests have different purposes – 21 goodness of fit compares single value a with superset A A • assumes only a varies – 22 test compares two values a, ¬a within a set A • both values may vary ¬a a 2 2 c2 z test for 2 independent proportions • Identical to a standard 22 χ² test – So you can use the usual method! • BUT: these tests have different purposes – 21 goodness of fit compares single value a with superset A A • assumes only a varies – 22 test compares two values a, ¬a within a set A • both values may vary • Q: Do we need χ²? ¬a a 2 2 c2 Larger χ² tests • χ² is popular because it can be applied to contingency tables with many values • r 1 goodness of fit χ² tests (r 2) • r c χ² tests for homogeneity (r,c 2) • z tests have 1 degree of freedom • strength: significance is due to only one source • strength: easy to plot values and confidence intervals • weakness: multiple values may be unavoidable • With larger χ² tests, evaluate and simplify: • Examine χ² contributions for each row or column • Focus on alternation - try to test for a speaker choice How big is the effect? • These tests do not measure the strength of the interaction between two variables – They test whether the strength of an interaction is greater than would be expected by chance • With lots of data, a tiny change would be significant How big is the effect? • These tests do not measure the strength of the interaction between two variables – They test whether the strength of an interaction is greater than would be expected by chance • With lots of data, a tiny change would be significant – Don’t use χ², p or z values to compare two different experiments • A result significant at p<0.01 is not ‘better’ than one significant at p<0.05 How big is the effect? • These tests do not measure the strength of the interaction between two variables – They test whether the strength of an interaction is greater than would be expected by chance • With lots of data, a tiny change would be significant – Don’t use χ², p or z values to compare two different experiments • A result significant at p<0.01 is not ‘better’ than one significant at p<0.05 • There are a number of ways of measuring ‘association strength’ or ‘effect size’ Percentage swing • Compare probabilities of a DV value (a, AJP) across a change in the IV (gender): – swing d = p(a | ¬b) – p(a | b) = 10/20 – 20/25 = -0.3 a ¬a S b 20 5 25 ¬b 10 10 20 S 30 15 45 Percentage swing • Compare probabilities of a DV value (a, AJP) across a change in the IV (gender): – swing d = p(a | ¬b) – p(a | b) = 10/20 – 20/25 = -0.3 • As a proportion of the initial value a ¬a S b 20 5 25 – % swing d % = d/p(a | b) = -0.3/0.8 ¬b 10 10 20 S 30 15 45 Percentage swing • Compare probabilities of a DV value (a, AJP) across a change in the IV (gender): – swing d = p(a | ¬b) – p(a | b) = 10/20 – 20/25 = -0.3 • As a proportion of the initial value a ¬a S b 20 5 25 – % swing d % = d/p(a | b) = -37.5% ¬b 10 10 20 S 30 15 45 • We can even calculate confidence intervals on d or d % – Use z test for two independent proportions (we are comparing differences in p values) Cramér’s φ • Can be used on any χ² table – Mathematically well defined – Probabilistic (c.f. swing d [-1, +1], d % = ?) • = 0 no relationship between A and B • = 1 B strictly determines A • straight line between these two extremes =0 a ¬a b 0.5 0.5 ¬b 0.5 0.5 S 1 1 =1 S 1 1 2 b ¬b S a ¬a 1 0 0 1 1 1 S 1 1 2 Cramér’s φ • Can be used on any χ² table – Mathematically well defined – Probabilistic (c.f. swing d [-1, +1], d % = ?) • = 0 no relationship between A and B • = 1 B strictly determines A • straight line between these two extremes =0 a ¬a b 0.5 0.5 ¬b 0.5 0.5 S 1 1 } ‘averaged’ swing =1 S 1 1 2 b ¬b S a ¬a 1 0 0 1 1 1 S 1 1 2 Cramér’s φ • Can be used on any χ² table – Mathematically well defined – Probabilistic (c.f. swing d [-1, +1], d % = ?) • = 0 no relationship between A and B • = 1 B strictly determines A • straight line between these two extremes – Based on χ² • = χ²/N (22) N = grand total • c = χ²/(k – 1)N (r c ) k = min(r, c) Cramér’s φ • Can be used on any χ² table – Mathematically well defined – Probabilistic (c.f. swing d [-1, +1], d % = ?) • = 0 no relationship between A and B • = 1 B strictly determines A • straight line between these two extremes – Based on χ² • = χ²/N (22) N = grand total • c = χ²/(k – 1)N (r c ) k = min(r, c) • Can be used for r 1 goodness of fit tests – Recalibrate using methods in Wallis (2012) – Better indicator than percentage swing Significantly better? • Suppose we have two similar experiments – How do we test if one result is significantly stronger than another? Significantly better? • Suppose we have two similar experiments – How do we test if one result is significantly stronger than another? • Test swings • Use z test for two samples from different populations • Use s' = s12 + s22 • Test |d1(a) – d2(a)| > z.s' 0 -0.1 -0.2 -0.3 -0.4 -0.5 -0.6 -0.7 d1(a) d2(a) a ¬a S b 20 5 25 ¬b 10 10 20 S 30 15 45 a ¬a S b 50 5 55 ¬b 10 10 20 S 30 15 75 Modern improvements on z and χ² • ‘Continuity correction’ for small n – Yates’ χ2 test Modern improvements on z and χ² • ‘Continuity correction’ for small n – Yates’ χ2 test • Wilson’s score interval – The correct formula for intervals on p w– w+ p 0 p Modern improvements on z and χ² • ‘Continuity correction’ for small n – Yates’ χ2 test – can be used elsewhere • Wilson’s score interval – The correct formula for intervals on p • Newcombe (1998) improves on 22 χ² test – Uses the Wilson interval – Better than χ² and log-likelihood (etc.) for low-frequency events w– w+ p 0 p Conclusions • The basic idea of all of these tests is – Predict future results if experiment were repeated • ‘Significant’ = effect > 0 (e.g. 19 times out of 20) • Based on the Binomial distribution – Approximated by Normal distribution – many uses • Plotting confidence intervals • Use goodness of fit or single-sample z tests to compare a sample, a, with a point it is dependent on, A • Use 22 tests or two independent sample z tests to compare two observed samples (a, ¬a) • When using larger r c tests, simplify as far as possible to identify the source of variation! Conclusions • Two methods for measuring the ‘size’ of an experimental effect – Simple idea, easy to report • absolute or percentage swing – More reliable, but possibly less intuitive • Cramér’s φ – You can compare two experiments • Is absolute swing significantly greater? • Use a type of z test! • A similar approach is possible with φ • Take care with small samples / low frequencies – Use Wilson and Newcombe’s methods instead! References • Newcombe, R.G. 1998. Interval estimation for the difference between independent proportions: comparison of eleven methods. Statistics in Medicine 17: 873-890 • Wallis, S.A. 2009. Binomial distributions, probability and Wilson’s confidence interval. London: Survey of English Usage • Wallis, S.A. 2010. z-squared: The origin and use of χ². London: Survey of English Usage • Wallis, S.A. 2012. Goodness of fit measures for discrete categorical data. London: Survey of English Usage • Wilson, E.B. 1927. Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association 22: 209-212 • Assorted statistical tests: – www.ucl.ac.uk/english-usage/staff/sean/resources/2x2chisq.xls