z -squared: the origin and use of χ² - or -

advertisement
z-squared: the origin and use of χ²
- or what I wish I had been told about statistics
(but had to work out for myself)
Sean Wallis
Survey of English Usage
University College London
s.wallis@ucl.ac.uk
Outline
• What is the point of statistics?
– Linguistic alternation experiments
– How inferential statistics works
• Introducing z tests
– Two types (single-sample and two-sample)
– How these tests are related to χ²
• Comparing experiments and ‘effect size’
– Swing and ‘skew’
• Low frequency events and small samples
What is the point of statistics?
• Analyse data you already have
– corpus linguistics
• Design new experiments
– collect new data, add annotation
– experimental linguistics in the lab
• Try new methods
– pose the right question
• We are going to focus on
z and χ² tests
What is the point of statistics?
• Analyse data you already have
– corpus linguistics
• Design new experiments
– collect new data, add annotation
– experimental linguistics in the lab
• Try new methods
– pose the right question
• We are going to focus on
z and χ² tests
}
observational
science
}
}
}
experimental
science
philosophy of
science
a little maths
What is ‘inferential statistics’?
• Suppose we carry out an experiment
– We toss a coin 10 times and get 5 heads
– How confident are we in the results?
• Suppose we repeat the experiment
• Will we get the same result again?
• Inferential statistics is a method of inferring
the behaviour of future ‘ghost’ experiments
from one experiment
– Infer from the sample to the population
• Let us consider one type of experiment
– Linguistic alternation experiments
Alternation experiments
• Imagine a speaker forming a sentence as a
series of decisions/choices. They can
– add: choose to extend a phrase or clause, or stop
– select: choose between constructions
• Choices will be constrained
– grammatically
– semantically
Alternation experiments
• Imagine a speaker forming a sentence as a
series of decisions/choices. They can
– add: choose to extend a phrase or clause, or stop
– select: choose between constructions
• Choices will be constrained
– grammatically
– semantically
• Research question:
– within these constraints,
what factors influence the particular choice?
Alternation experiments
• Laboratory experiment (cued)
– pose the choice to subjects
– observe the one they make
– manipulate different potential influences
• Observational experiment (uncued)
– observe the choices speakers make when they
make them (e.g. in a corpus)
– extract data for different potential influences
• sociolinguistic: subdivide data by genre, etc
• lexical/grammatical: subdivide data by elements in
surrounding context
Statistical assumptions
A random sample taken from the population
– Not always easy to achieve
• multiple cases from the same text and speakers, etc
• may be limited historical data available
– Be careful with data concentrated in a few texts
The sample is tiny compared to the population
– This is easy to satisfy in linguistics!
Repeated sampling tends to form a Binomial
distribution
– This requires slightly more explanation...
The Binomial distribution
• Repeated sampling tends to form a Binomial
distribution
– We toss a coin 10 times, and get 5 heads:
F
N=1
x
1
3
5
7
9
The Binomial distribution
• Repeated sampling tends to form a Binomial
distribution
F
N=4
x
1
3
5
7
9
The Binomial distribution
• Repeated sampling tends to form a Binomial
distribution
F
N=8
x
1
3
5
7
9
The Binomial distribution
• Repeated sampling tends to form a Binomial
distribution
F
N = 12
x
1
3
5
7
9
The Binomial distribution
• Repeated sampling tends to form a Binomial
distribution
F
N = 16
x
1
3
5
7
9
The Binomial distribution
• Repeated sampling tends to form a Binomial
distribution
F
N = 20
x
1
3
5
7
9
The Binomial distribution
• Repeated sampling tends to form a Binomial
distribution
F
N = 24
x
1
3
5
7
9
Binomial  Normal
• The Binomial (discrete) distribution tends to
match the Normal (continuous) distribution
F
x
1
3
5
7
9
The central limit theorem
• Any Normal distribution can be defined by
only two variables and the Normal function z
F
 population
mean x = P
 standard deviation
s =  P(1 – P) / n
z.s
– With more
data in the
experiment, s
will be smaller
z.s
– Divide by 10 for
probability scale
0.1
0.3
0.5
0.7
p
The central limit theorem
• Any Normal distribution can be defined by
only two variables and the Normal function z
F
 population
mean x = P
 standard deviation
s =  P(1 – P) / n
z.s
z.s
– 95% of the curve is within ~2
standard deviations of the mean
2.5%
(the correct figure
is 1.95996!)
2.5%
95%
0.1
0.3
0.5
0.7
p
The single-sample z test...
• Is an observation > z standard deviations
from the expected population mean?
– If yes, the result is significant
F
observation p
z.s
0.25%
0.1
z.s
0.25%
P
0.3
0.5
0.7
p
...gives us a “confidence interval”
• P ± z . s is the confidence interval for P
– Enough for a test
F
z.s
0.25%
0.1
z.s
0.25%
P
0.3
0.5
0.7
p
...gives us a “confidence interval”
• P ± z . s is the confidence interval for P
– But we need the interval about p
observation p
F
w– w+
P
0.25%
0.1
0.3
0.5
0.25%
0.7
p
...gives us a “confidence interval”
• The interval about p is called the Wilson
score interval
• This interval is
asymmetric
• It reflects the
Normal interval
about P:
observation p
F
w– w+
P
0.25%
0.25%
• If P is at the upper
limit of p,
p is at the lower
limit of P
(Wilson, 1927)
0.1
0.3
0.5
0.7
p
...gives us a “confidence interval”
• The interval about p is called the Wilson
score interval
• To calculate w–
and w+ we use
this formula:
observation p
F
z2
p
z
2n
w– w+
P
0.25%
p (1  p ) z 2
 2
n
4n
z2
1
n
0.25%
(Wilson, 1927)
0.1
0.3
0.5
0.7
p
Plotting confidence intervals
• E.g. Plot the probability of adding successive
attributive adjectives to a NP in ICE-GB
– You can easily see that the first two falls are
significant, but the last is not
0.25
p
0.20
0.15
0.10
0.05
0.00
0
1
2
3
4
A simple experiment
• Consider two binary variables, A and B
– Each one is subdivided:
• A = {a, ¬a} e.g. NP has AJP? {yes, no}
• B = {b, ¬b} e.g. Speaker gender {male, female}
– Does B ‘affect’ A?
• We perform an experiment
(or sample a corpus)
– We find 45 cases (NPs)
classified by A and B (left)
– This is a ‘contingency table’
A simple experiment
• Consider two binary variables, A and B
– Each one is subdivided:
• A = {a, ¬a} e.g. NP has AJP? {yes, no}
• B = {b, ¬b} e.g. Speaker gender {male, female}
• We perform an experiment
(or sample a corpus)
– We find 45 cases (NPs)
classified by A and B (left)
– This is a ‘contingency table’
A = dependent variable
a ¬a S
b 20 5 25
¬b 10 10 20
S 30 15 45
• Q1. Does B cause a to differ from A?
– Does speaker gender affect decision to include an AJP?
B = independent variable
– Does B ‘affect’ A?
Does B cause a to differ from A?
• Compare column 1 (a) and column 3 (A)
– Probability of picking b at random (gender = male)
• p(b) = 25/45 = 5/9 = 0.556
a ¬a S
b 20 5 25
¬b 10 10 20
S 30 15 45
Does B cause a to differ from A?
• Compare column 1 (a) and column 3 (A)
– Probability of picking b at random (gender = male)
• p(b) = 25/45 = 5/9 = 0.556
• Next, examine a (has AJP)
– New probability of picking b
• p(b | a) = 20/30 = 2/3 = 0.667
– Confidence interval for p(b | a)
• population standard deviation
s = p(b)(1–p(b))/n =  (5/9 4/9) / 30
• p  z.s = (0.489, 0.845)
a ¬a S
b 20 5 25
¬b 10 10 20
S 30 15 45
Does B cause a to differ from A?
• Compare column 1 (a) and column 3 (A)
– Probability of picking b at random (gender = male)
• p(b) = 25/45 = 5/9 = 0.556
• Next, examine a (has AJP)
– New probability of picking b
• p(b | a) = 20/30 = 2/3 = 0.667
– Confidence interval for p(b)
• population standard deviation
s = p(b)(1–p(b))/n =  (5/9 4/9) / 30
• p  z.s = (0.378, 0.733)
a ¬a S
b 20 5 25
¬b 10 10 20
S 30 15 45
• Not significant: p(b | a) is inside c.i. for p(b)
Visualising this test
• Confidence interval for p(b)
– P = expected value E = expected distribution
F
p
p(b) 0.667
p(b | a)
z.s
z.s
P
E
p(b)
0.556
p
0.378
0.733
A
a
The single-sample z test
• Compares an observation with a given value
–
–
–
–
We used it to compare p(b | a) with p(b)
This is a “goodness of fit” test
p
Identical to a standard 21 χ² test
No need to test p(¬b | a) with p(¬b)
E
• Note that p(b) is given
– All of the variation is assumed to be
in the estimation of p(b | a)
A
a
– Could also compare p(b | ¬a) (no AJP) with p(b)
• Q2. Does B cause a to differ from ¬a?
– Does speaker gender affect presence / absence of AJP?
z test for 2 independent proportions
• Method: combine observed values
– take the difference (subtract) |p1 – p2|
– calculate an ‘averaged’ confidence interval
F
p
p2 = p(b | ¬a)
O1
p1
O2
p2
p1 = p(b | a)
p
¬a
O1
O2
a
z test for 2 independent proportions
• New confidence interval D = |O1 – O2|
^
^
– standard deviation s' = p(1
– p)
(1/n1 +1/n2)
– p^ = p(b) = 25/45 = 5/9
difference in p
– compare
x
z.s' with
D
x = |p1 – p2|
z.s'
x = |p1 – p2|
a ¬a S
b 20 5 25
¬b 10 10 20
S 30 15 45
n1
n2
D
mean x = 0
p
0
Does B cause a to differ from ¬a?
• Compare column 1 (a) and column 2 (¬a)
– Probabilities (speaker gender = male)
• p(b | a) = 20/30 = 2/3 = 0.667
• p(b | ¬a) = 5/15 = 1/3 = 0.333
– Confidence interval
• pooled probability estimate
p^ = p(b) = 5/9 = 0.556
• standard deviation
^
^ (1/n + 1/n )
s' =  p(1
– p)
1
2
5
4
1
1
=  ( /9 /9) ( /30 + /15)
• z.s' = 0.308
a ¬a S
b 20 5 25
¬b 10 10 20
S 30 15 45
Does B cause a to differ from ¬a?
• Compare column 1 (a) and column 2 (¬a)
– Probabilities (speaker gender = male)
• p(b | a) = 20/30 = 2/3 = 0.667
• p(b | ¬a) = 5/15 = 1/3 = 0.333
– Confidence interval
• pooled probability estimate
p^ = p(b) = 5/9 = 0.556
• standard deviation
^
^ (1/n + 1/n )
s' =  p(1
– p)
1
2
5
4
1
1
=  ( /9 /9) ( /30 + /15)
• z.s' = 0.308
a ¬a S
b 20 5 25
¬b 10 10 20
S 30 15 45
• Significant: |p(b | a) – p(b | ¬a)| > z.s'
z test for 2 independent proportions
• Identical to a standard 22 χ² test
– So you can use the usual method!
z test for 2 independent proportions
• Identical to a standard 22 χ² test
– So you can use the usual method!
• BUT: these tests have different purposes
– 21 goodness of fit compares
single value a with superset A
A
• assumes only a varies
– 22 test compares two values
a, ¬a within a set A
• both values may vary
¬a
a
2  2 c2
z test for 2 independent proportions
• Identical to a standard 22 χ² test
– So you can use the usual method!
• BUT: these tests have different purposes
– 21 goodness of fit compares
single value a with superset A
A
• assumes only a varies
– 22 test compares two values
a, ¬a within a set A
• both values may vary
• Q: Do we need χ²?
¬a
a
2  2 c2
Larger χ² tests
• χ² is popular because it can be applied to
contingency tables with many values
• r  1 goodness of fit χ² tests (r  2)
• r  c χ² tests for homogeneity (r,c  2)
• z tests have 1 degree of freedom
• strength: significance is due to only one source
• strength: easy to plot values and confidence intervals
• weakness: multiple values may be unavoidable
• With larger χ² tests, evaluate and simplify:
• Examine χ² contributions for each row or column
• Focus on alternation - try to test for a speaker choice
How big is the effect?
• These tests do not measure the strength of
the interaction between two variables
– They test whether the strength of an interaction is
greater than would be expected by chance
• With lots of data, a tiny change would be significant
How big is the effect?
• These tests do not measure the strength of
the interaction between two variables
– They test whether the strength of an interaction is
greater than would be expected by chance
• With lots of data, a tiny change would be significant
– Don’t use χ², p or z values to compare two
different experiments
• A result significant at p<0.01 is not ‘better’ than one
significant at p<0.05
How big is the effect?
• These tests do not measure the strength of
the interaction between two variables
– They test whether the strength of an interaction is
greater than would be expected by chance
• With lots of data, a tiny change would be significant
– Don’t use χ², p or z values to compare two
different experiments
• A result significant at p<0.01 is not ‘better’ than one
significant at p<0.05
• There are a number of ways of measuring
‘association strength’ or ‘effect size’
Percentage swing
• Compare probabilities of a DV value (a, AJP)
across a change in the IV (gender):
– swing d = p(a | ¬b) – p(a | b) = 10/20 – 20/25 = -0.3
a ¬a S
b 20 5 25
¬b 10 10 20
S 30 15 45
Percentage swing
• Compare probabilities of a DV value (a, AJP)
across a change in the IV (gender):
– swing d = p(a | ¬b) – p(a | b) = 10/20 – 20/25 = -0.3
• As a proportion of the
initial value
a ¬a S
b 20 5 25
– % swing d % = d/p(a | b) = -0.3/0.8 ¬b 10 10 20
S 30 15 45
Percentage swing
• Compare probabilities of a DV value (a, AJP)
across a change in the IV (gender):
– swing d = p(a | ¬b) – p(a | b) = 10/20 – 20/25 = -0.3
• As a proportion of the
initial value
a ¬a S
b 20 5 25
– % swing d % = d/p(a | b) = -37.5% ¬b 10 10 20
S 30 15 45
• We can even calculate
confidence intervals on d or d %
– Use z test for two independent proportions
(we are comparing differences in p values)
Cramér’s φ
• Can be used on any χ² table
– Mathematically well defined
– Probabilistic (c.f. swing d  [-1, +1], d % = ?)
•  = 0  no relationship between A and B
•  = 1  B strictly determines A
• straight line between these two extremes
=0
a ¬a
b 0.5 0.5
¬b 0.5 0.5
S
1 1
=1
S
1
1
2

b
¬b
S
a ¬a
1 0
0 1
1 1
S
1
1
2
Cramér’s φ
• Can be used on any χ² table
– Mathematically well defined
– Probabilistic (c.f. swing d  [-1, +1], d % = ?)
•  = 0  no relationship between A and B
•  = 1  B strictly determines A
• straight line between these two extremes
=0
a ¬a
b 0.5 0.5
¬b 0.5 0.5
S
1 1
}
‘averaged’
swing
=1
S
1
1
2

b
¬b
S
a ¬a
1 0
0 1
1 1
S
1
1
2
Cramér’s φ
• Can be used on any χ² table
– Mathematically well defined
– Probabilistic (c.f. swing d  [-1, +1], d % = ?)
•  = 0  no relationship between A and B
•  = 1  B strictly determines A
• straight line between these two extremes
– Based on χ²
•  =  χ²/N
(22) N = grand total
• c =  χ²/(k – 1)N (r c ) k = min(r, c)
Cramér’s φ
• Can be used on any χ² table
– Mathematically well defined
– Probabilistic (c.f. swing d  [-1, +1], d % = ?)
•  = 0  no relationship between A and B
•  = 1  B strictly determines A
• straight line between these two extremes
– Based on χ²
•  =  χ²/N
(22) N = grand total
• c =  χ²/(k – 1)N (r c ) k = min(r, c)
• Can be used for r 1 goodness of fit tests
– Recalibrate using methods in Wallis (2012)
– Better indicator than percentage swing
Significantly better?
• Suppose we have two similar experiments
– How do we test if one result is significantly
stronger than another?
Significantly better?
• Suppose we have two similar experiments
– How do we test if one result is significantly
stronger than another?
• Test swings
• Use z test for two samples from
different populations
• Use s' =  s12 + s22
• Test |d1(a) – d2(a)| > z.s'
0
-0.1
-0.2
-0.3
-0.4
-0.5
-0.6
-0.7
d1(a)
d2(a)
a ¬a S
b 20 5 25
¬b 10 10 20
S 30 15 45
a ¬a S
b 50 5 55
¬b 10 10 20
S 30 15 75
Modern improvements on z and χ²
• ‘Continuity correction’ for small n
– Yates’ χ2 test
Modern improvements on z and χ²
• ‘Continuity correction’ for small n
– Yates’ χ2 test
• Wilson’s score interval
– The correct formula for intervals on p
w–
w+
p
0
p
Modern improvements on z and χ²
• ‘Continuity correction’ for small n
– Yates’ χ2 test – can be used elsewhere
• Wilson’s score interval
– The correct formula for intervals on p
• Newcombe (1998)
improves on 22 χ² test
– Uses the Wilson interval
– Better than χ² and
log-likelihood (etc.)
for low-frequency events
w–
w+
p
0
p
Conclusions
• The basic idea of all of these tests is
– Predict future results if experiment were repeated
• ‘Significant’ = effect > 0 (e.g. 19 times out of 20)
• Based on the Binomial distribution
– Approximated by Normal distribution – many uses
• Plotting confidence intervals
• Use goodness of fit or single-sample z tests to
compare a sample, a, with a point it is dependent on, A
• Use 22 tests or two independent sample z tests to
compare two observed samples (a, ¬a)
• When using larger r c tests, simplify as far as
possible to identify the source of variation!
Conclusions
• Two methods for measuring the ‘size’ of an
experimental effect
– Simple idea, easy to report
• absolute or percentage swing
– More reliable, but possibly less intuitive
• Cramér’s φ
– You can compare two experiments
• Is absolute swing significantly greater?
• Use a type of z test!
• A similar approach is possible with φ
• Take care with small samples / low frequencies
– Use Wilson and Newcombe’s methods instead!
References
• Newcombe, R.G. 1998. Interval estimation for the difference
between independent proportions: comparison of eleven
methods. Statistics in Medicine 17: 873-890
• Wallis, S.A. 2009. Binomial distributions, probability and Wilson’s
confidence interval. London: Survey of English Usage
• Wallis, S.A. 2010. z-squared: The origin and use of χ². London:
Survey of English Usage
• Wallis, S.A. 2012. Goodness of fit measures for discrete
categorical data. London: Survey of English Usage
• Wilson, E.B. 1927. Probable inference, the law of succession,
and statistical inference. Journal of the American Statistical
Association 22: 209-212
• Assorted statistical tests:
– www.ucl.ac.uk/english-usage/staff/sean/resources/2x2chisq.xls
Download