Discrete Markov Chain Monte Carlo

advertisement
Discrete Markov Chain Monte Carlo
Chapter 2 RANDOMIZATION TESTS
Chapter 2: Randomization tests
There is a name for the method used by Connor and Simberloff to compare the observed
number of checkerboards (10 for the finch data) with what you could expect to get if
species had distributed themselves purely at random. The method belongs to the general
area of statistics called hypothesis testing, and more specifically, the method is an
instance of a randomization test. (In ecology, randomization tests are often named for
the models they are based on, called null models.) Until the 1980s, randomization tests
tended to be limited to comparatively simple kinds of data sets, because there was no
general and well-understood method for generating random data sets in more complicated
situations. Computers and related theory have been changing that during the last two
decades, so that randomization tests have grown in importance, and are now used much
more often than in the past. Much of the mathematics in this book is tied to the theory of
how to create random data sets in order to carry out randomization tests. The following
example, which comes from a court case, illustrates the use of randomization tests in one
of the simpler situations where it is easy to see how to generate random data sets.
2.1 Martin vs. Westvaco: An Introduction to Randomization Tests
Robert Martin turned 55 during 1991. Earlier in that same year the Westvaco
Corporation, which makes paper products, decided to downsize. They laid off several
members of their engineering department, where Bob Martin worked, and he was one of
those who lost their jobs. Later that year, he hired a lawyer to sue Westvaco, claiming he
had been laid off because of his age. A major piece of Martin's case was based on a
statistical analysis of the ages of the employees at Westvaco.
At the time the layoffs began, Bob Martin was one of 50 people working at various jobs
in the engineering department of Westvaco's envelope division. Some were paid by the
hour; others, like Martin, who had more education and greater responsibility, were
salaried. Over the course of the spring, Westvaco's management went through five
rounds of planning for a reduction in force. In Round 1, they decided to eliminate 11
positions. In Round 2 they added 9 more to the list. By the time the layoffs ended, after
all 5 rounds, only 22 of the 50 workers had kept their jobs, and the average age in the
department had fallen from 48 to 46.
Display 2.1 shows the data provided by Westvaco to Martin's lawyers.1 Each row
corresponds to one worker, and each column corresponds to some feature: job title,
whether hourly or salaried, the date of hire and age as of the first of January, 1991
(shortly before the layoffs). The last column tells how the worker fared in the
downsizing: a 1 means chosen for layoff in Round 1 of planning for the reduction in
force, a 2 means Round 2, and similarly for 3, 4 or 5; however, 0 means "not chosen for
layoff."
1
The statistical analysis in the lawsuit (CA No. 92-03121-F) used all 50 employees in the
Engineering Department of the Envelope Division, with separate analyses for exempt
(salaried) and non-exempt (hourly) workers.
02/17/16 George W. Cobb, Mount Holyoke College, NSF#0089004
page 2.1
Discrete Markov Chain Monte Carlo
Row
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
Job title
Engineering Clerk
Engineering Tech II
Engineering Tech II
Secretary to Engin Manag
Engineering Tech II
Engineering Tech II
Engineering Tech II
Parts Crib Attendant
Engineering Tech II
Engineering Tech II
Technical Secretary
Engineering Tech II
Engineering Tech II
Engineering Tech II
Customer Serv Engineer
Customer Serv Engr Assoc
Design Engineer
Design Engineer
Design Engineer
Design Engineer
Engineering Assistant
Engineering Associate
Engineering Manager
Machine Designer
Packaging Engineer
Prod Spec - Printing
Proj Eng-Elec
Project Engineer
Project Engineer
Project Engineer
Supv Engineering Serv
Supv Machine Shop
Chemist
Design Engineer
Engineering Associate
Machine Designer
Machine Parts Cont-Supv
Prod Specialist
Project Engineer
Chemist
Design Engineer
Electrical Engineer
Machine Designer
Machine Parts Cont Coor
VH Prod Specialist
Printing Coordinator
Prod Dev Engineer
Prod Specialist
VH Prod Specialist
Engineering Associate
Chapter 2 RANDOMIZATION TESTS
Pay
H
H
H
H
H
H
H
H
H
H
H
H
H
H
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
Birth
mo
yr
9
66
4
53
10
35
2
43
8
38
8
36
1
32
11
69
5
36
8
27
5
36
2
36
9
58
7
56
4
30
2
62
12
43
3
37
3
36
1
31
6
60
2
57
2
32
9
59
3
38
12
44
9
43
7
49
8
43
6
34
4
54
11
37
8
22
9
38
2
61
2
39
10
28
9
27
7
25
12
30
4
60
11
49
3
35
9
37
5
35
2
41
6
59
7
32
3
42
8
68
Hire
mo
yr
7
89
8
78
7
65
9
66
9
74
3
60
2
63
10
89
4
77
12
51
11
73
4
62
11
76
5
77
9
66
5
88
9
67
6
74
2
78
3
67
7
86
4
85
11
63
3
90
11
83
11
74
4
71
9
73
4
64
8
81
6
72
3
64
4
54
12
87
9
85
4
85
8
53
10
43
9
59
10
52
5
89
3
86
12
68
10
67
9
55
1
62
11
85
1
55
4
62
5
89
RIF
0
0
0
0
1
1
1
1
2
2
2
3
4
4
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
2
2
2
2
2
2
3
3
4
4
5
AGE
1/1/91
25
38
56
48
53
55
59
22
55
64
55
55
33
35
61
29
48
54
55
60
31
34
59
32
53
47
48
42
48
57
37
54
69
53
30
52
63
64
66
61
31
42
56
54
56
50
32
59
49
23
Display 2.1 The data in Martin versus Westvaco.
02/17/16 George W. Cobb, Mount Holyoke College, NSF#0089004
page 2.2
Discrete Markov Chain Monte Carlo
Chapter 2 RANDOMIZATION TESTS
On balance, the patterns in the Martin data show that the percentage of people laid off
was higher for older workers than for younger ones. One of the main arguments in the
case was about what those patterns mean: are the patterns "real," or could they be due
just to natural variation? There's no way to repeat Westvaco's actual decision process,
which means there's no way to measure the variability in that process. In fact, it's hard to
say precisely what "natural variability" really means. It is possible, however, first to
define a simple, artificial, age-neutral decision process (a null model), then to repeat that
process, and use the results to ask whether that process is variable enough to give results
as extreme as Westvaco's.
A comprehensive analysis to answer that question would be quite involved. For now,
though, you can get a pretty good idea of how the analysis goes by working with just a
subset of the data. Here are ages and Row IDs of the ten hourly workers involved in the
second of the five rounds of layoffs, arranged from youngest to oldest. The ages of the
three that were laid off are underlined:
Age
25
33
35
38
48
55
55
55
56
64
Row ID
1
13
14
2
4
12
9
11
3
10
What to make of the data requires balancing two points of view. On one hand, the
pattern in the data is pretty striking. Of the five people under age 50, all kept their jobs.
Of the five who were 55 or older, only two kept their jobs. On the other hand, the
numbers of people involved are pretty small: just three out of ten. Should you take
seriously a pattern involving so few people? The two viewpoints correspond to two sides
of an argument that was at the center of the statistical part of the Martin case. Here's a
simplified version.2
Martin: Look at the pattern in the data: All three of the workers laid off were much
older than average. That's evidence of age bias.
Westvaco: Not so fast! You're only looking at ten people total, and only three jobs were
eliminated. Just one small change and the picture would be entirely different. For
example, suppose it had been the 25-year-old instead of the 64-year-old who was
laid off. Switch the 25 and the 64, and you get a totally different set of averages:
Actual data:
25 33 35 38 48 55 55 55 56 64
Altered data: 25 33 35 38 48 55 55 55 56 64
Average ages:
Laid off
Kept
Actual data
58.0
41.4
Altered data
45.0
47.0
2
I owe the idea of a dialog to Statistics (1978) by David Freedman, Robert Pisani, and
Roger Purves, now in a third edition (1998).
02/17/16 George W. Cobb, Mount Holyoke College, NSF#0089004
page 2.3
Discrete Markov Chain Monte Carlo
Chapter 2 RANDOMIZATION TESTS
See! Just one small change and the average age of the three who were laid off is
actually lower than the average age of the others.
Martin: Not so fast, yourself! Of all the possible changes, you picked the one that is
most favorable to your side. If you'd switched one of the 55-year-olds who got
fired with the 55-year-old who kept his job, the averages wouldn't change at all.
Why not compare what actually happened with all the possibilities that
might have happened? Start with the ten workers, and pick three at random. Do
this over and over, to see what typically happens, and compare the actual data
with these results.
Westvaco: But you'd be ignoring relevant information, things like worker qualifications,
and which positions were easiest to do without.
Martin: I agree. But you're changing the subject. Remember our question: "Is the
sample large enough to support a conclusion?" That's a pretty narrow question. It
doesn't say anything about why the workers were chosen. At this point, we're just
asking "If you treat all ten workers alike, and pick three at random without regard
to age, how likely is it that their average age will be 58 or more?"
You can use simulation to estimate the probability p that if you draw three workers at
random, just by chance you will get an average age of 58 years or more.
Randomization tests by simulation: generate, compare, estimate
Generate a large number (NReps) of random data sets. Here, each “data
set” is a random subset of 3 worker’s ages chosen from the 10.
Compare each random data set with the actual data. Is the average age for
the random data set greater than or equal to 58? (Yes/No)
Estimate the probability p by the observed proportion of Yes answers:
p̂ = (# Yes)/(# Repetitions) .
If p̂ is tiny, you know that an average age of 58 is too extreme to occur just by chance.
Some other explanation is needed.
For most applications, it will be necessary to carry out the three steps on a computer, but
this deliberately simplified example is one you can do by drawing marbles out of a
bucket or the equivalent.
Activity 2.1 (Physical simulation): Did Westvaco Discriminate?
Step 1. Generate random data sets
Write each of the ten ages on identical squares cut from 3x5 cards, and put them
in a box: 28, 33, 35, 38, 48, 55, 55, 55, 56, 64.
Mix the squares thoroughly and draw out three at random without replacement.
Step 2. Compare each random data set with the actual data (55, 55, 64)
Compute the average age for the sample. Is the value  58? (Record Yes or no.)
02/17/16 George W. Cobb, Mount Holyoke College, NSF#0089004
page 2.4
Discrete Markov Chain Monte Carlo
Chapter 2 RANDOMIZATION TESTS
Step 3. Estimate the value of p using the observed proportion.
Repeat Steps 1 and 2 ten times. Combine your results with those from the rest of
the class before you compute the proportion p̂ = (# Yes) / (# Repetitions).
Your chance model in the physical simulation is completely age neutral: All sets of three
workers have exactly the same chance of being selected for layoff, regardless of age. The
simulation tells you what sort of results are reasonable to expect from that sort of ageblind process. Here are the first four of 1000 repetitions from such a model:
Simulation (underlined = laid off)
Average age
25 33 35 38 48 55 55 55 56 64
42.67
25 33 35 38 48 55 55 55 56 64
48.00
25 33 35 38 48 55 55 55 56 64
42.67
25 33 35 38 48 55 55 55 56 64
37.00
30
20
0
10
Number of times
40
50
Display 2.2 is a plot that shows the distribution of average ages for 1000 repetitions of
the sampling process.
30
35
40
45
50
55
60
Average age of those chosen
Display 2.2 Results of 1000 repetitions
The distribution of average age of those chosen for layoff by the chance model
02/17/16 George W. Cobb, Mount Holyoke College, NSF#0089004
page 2.5
Discrete Markov Chain Monte Carlo
Chapter 2 RANDOMIZATION TESTS
Out of 1000 repetitions, only 49, or about 5% gave an average age of 58 or older. So it is
not at all likely that just by chance you'd pick workers as old as the three Westvaco
picked. Did the company discriminate? There's no way to tell just from the numbers
alone. However, if your simulations had told you that an average of 58 or older is easy to
get by chance alone, then the data would provide no evidence of discrimination. If, on
the other hand, it turns out to be very unlikely to get a value this big just by chance,
statistical logic says to conclude that the pattern is "real," that is, more than just
coincidence. It is then up to the company to explain why their decision-making process
led to such a large average age for those laid off.
The logic of the last paragraph may take some time to get used to, but it can help to recast
the logic in the form of a real argument between two people. Here's an imaginary version
of such an argument.
Martin: Look at the pattern in the data: All three of the workers laid off were much
older than average.
Westvaco: So what? I claim you could get a result like that just by chance. If chance
alone can account for the pattern, there's no reason to look for any other explanation.
Martin: OK, let's test your claim. If it's easy to get an average as big as 58 by drawing
at random, I'll agree that we can't rule out chance as one possible explanation. But if an
average that big is really hard to get from random draws, we agree that chance alone can't
account for the pattern. Right?
Westvaco: Right.
Martin: Here are the results of my simulations. If you look at the three hourly workers
laid off in round two, the probability is only 5% that you could get an average age of 58
or more. And if you do the same computations for the entire engineering department, the
probability is a lot less, about 0.01, or one out of 100. What do you say to that?
Westvaco: Well ... I'll agree that it's really hard to get patterns that extreme just by
chance, but that by itself still doesn't prove discrimination.3
In principle we can apply the same three steps to the finch data, using the number of
checkerboards in place of average age for comparing data sets in Step 2. Our estimate in
Step 3 would then give an answer to the question Connor and Simberloff asked: “ If you
generate data sets purely at random, so that each data sets has the same chance as each of
the others, how likely are you to get 10 or more checkerboards?” Although in principle
this three-step approach will answer the question, in practice it is hard to carry out Step 1,
because there is no quick and simple way to generate random data sets. In a sense, much
of the rest of this entire book deals with the mathematics of solving this problem, along
with related questions that have yet to be answered.
3
In the actual case, an analysis based on all 50 employees in the department gave a p-value much less than
.05. Martin and Westvaco reached a settlement out of court before the case went to trial.
02/17/16 George W. Cobb, Mount Holyoke College, NSF#0089004
page 2.6
Discrete Markov Chain Monte Carlo
Chapter 2 RANDOMIZATION TESTS
2.2 An informal introduction to S-Plus
A useful reference:
http://lib.stat.cmu.edu/S/cheatsheet
Opening a new script file in S-Plus:
Click on the S-Plus icon
Click OK to use existing data
Start a new script file (File > New > Script file)
Warm-up
There is a standard statistical vocabulary to describe choosing a random subset from some larger set: The
larger set that you choose from is called a population; the random subset that you choose is called a
sample. In the Martin example, the population is the set of ten ages {25, 33, 35, 38, 48, 55, 55, 55, 56,
64}. The set of three chosen (e.g., {55, 55, 64}) is the sample.
Several sets of lines of S-Plus code are shown below. For each set of lines, first make a guess about what
the code will do. Then type the code into the top part of the split window of the script file. This is where
you can enter and edit code. (Commands and keystrokes for editing are pretty much the same as in
Microsoft Word.) Finally, click on the “run” button, the solid triangle in the left margin of the second
toolbar, in the column below File. This will execute your code in the bottom half of the window, and let
you check whether your guess was correct.
Populations as vectors
1a
Pop <- c(0,0,0,0,0,1,1,1,1,1)
Pop
1b
zeros <- rep(0,5)
zeros
1c
ones <- rep(1,5)
Pop <- c(zeros, ones)
Pop
1d
Pop <- rep(c(0,1),5)
Pop
1e
sort(Pop)
Pop
02/17/16 George W. Cobb, Mount Holyoke College, NSF#0089004
page 2.7
Discrete Markov Chain Monte Carlo
Chapter 2 RANDOMIZATION TESTS
Populations and samples
2a
sample(Pop,3,replace=F)
2b
sum(sample(Pop,3,replace=F))
2c
Pop2 <- c(25,33,35,38,48,55,55,55,56,64)
sample(Pop2,3,replace=F)
2d
fired <- sample(Pop2,3,replace=F)
fired
mean(fired)
mean(fired) >= 58
2e
mean(sample(Pop2,3,replace=F)) >= 58
Using a programming loop to create many samples4
Read through the following S-Plus code to see what it does. Note that a # separates
comments from the executable code.
3
# Draw random samples of size 3, without replacement, from a given population,
# and determine whether the average is > = 58.
# Repeat this process NRep times, and find the proportion of samples that
# have an average age of 58 or more.
#
NRep <- 1000
# NRep is the number of repetitions (= number of samples)
#
NYes <- 0
# NYes will keep track of how many samples have a mean
#
of 58 or more.
#
for (i in 1:NRep)
# This is the S-Plus language for a loop. The commands
#
enclosed between { and } will be executed NRep times,
#
once for each value of i
#
{
# Begin the body of the loop
#
#
NYes <- NYes + (mean(sample(Pop2,3,replace=F)) >= 58)
4
In S-plus, programming loops slow execution and should generally be avoided. I use them here because
they make it easier to follow the logic. Later on, you’ll learn ways to avoid using loops.
02/17/16 George W. Cobb, Mount Holyoke College, NSF#0089004
page 2.8
Discrete Markov Chain Monte Carlo
}
pHat <- NYes/NRep
pHat
Chapter 2 RANDOMIZATION TESTS
#
# End the loop
#
# Compute the observed proportion
# Print the value of pHat
Exercises: the Martin case
Here are the ages of the hourly workers at the time of each of the first four rounds of
layoffs. Those chosen in the given round are underlined; those already chosen in a
previous round are crossed out:
Round 1:
Round 2:
Round 3:
Round 4:
22
22
22
22
25
25
25
25
33
33
33
33
35
35
35
35
38
38
38
38
48
48
48
48
53
53
53
53
55
55
55
55
55
55
55
55
55
55
55
55
55
55
55
55
56
56
56
56
59
59
59
59
64
64
64
64
4. Guess the p-value for each of Rounds 1, 3, and 4. (Note that for Round 3, you don’t
need to guess: you can use logic to figure out the p-value.
5. Use S-plus to estimate the p-values for Rounds 1, 3, and 4.
Preliminary Investigation: How many replications do you need?
6. Go back to the data for Round 2 of the reduction in force, and use the S-plus code to
get values of p for each of the following values of NRep:
1, 5, 25, 100, 500, 2500, 10000
(The last one may take 30 seconds or so.) Then put your values of p versus NRep in a
table, along with the values from the others in the class:
NRep = # samples
1
2
3
5
etc.
Values of p-hat
Based on all the data, what is your best estimate for the value of p? Make a rough plot by
hand of p versus NRep. Describe, as quantitatively as you can, the pattern that relates
the variability in the values of the estimates to the number of repetitions. Roughly how
many repetitions are needed to be confident that any given estimate will be within .01 of
the true value? (This is your first look at a question that you will study more
systematically in Chapter 3.)
02/17/16 George W. Cobb, Mount Holyoke College, NSF#0089004
page 2.9
Discrete Markov Chain Monte Carlo
Chapter 2 RANDOMIZATION TESTS
2.3 Randomizations tests, I: The two-sample permutation test
The Martin example is typical of a large class of situations. Here is another instance:
Example 1. Calcium and blood pressure.
To test whether taking calcium supplements can reduce blood pressure, investigators used
a chance device to divide 21 male subjects into two groups. One group of 10 men, the
treatment group, were given calcium supplements and told to take them every day for 12
weeks. The other 11 men, the control group, were given pills that looked the same as the
supplements (a placebo), and given the same instructions: take one every day. Neither
the subjects themselves nor the people giving out the pills and taking blood pressure
readings knew which pills contained the calcium. (The experiment was double blind.)
Subjects had their blood pressure read at the beginning of the study and again at the end.
The numbers below tell the reduction in systolic blood pressure (when the heart is
contracted), in millimeters of mercury. (Positive values are good; negative values mean
that the blood pressure went up.)
Calcium: 7, -4, 18, 17, -3, -5, 1, 10, 11, -2
Placebo: -1, 12, -1, -3, 3, -5, 5, 2, -11, -1, -3
Here are the same numbers, arranged in order, with the values in the treatment group
underlined:
-18 -17 -12 -11 -10 -7 -5 -3 -2 -1 1 1 1 2 3 3 3 4 5 5 11
-11 -5 -5 -4 -3 -3 -3 -2 -1 -1 -1 1 2 3 5 7 10 11 12 17 18
Notice that for this example, as for Martin, there are two groups to compare, in this
instance those assigned to the treatment group, and those assigned to the placebo group.5
Here also, as in the Martin example, the information we have available for comparing the
two groups is quantitative, and we can judge the results using the average reduction in
blood pressure for the calcium group, which was 5 millimeters of mercury.
5
There is an extremely important difference that makes this example different from Martin, however. For
the calcium study, the two groups in fact were chosen purely at random. For the Martin example, there
wasn’t any actual randomization; instead, random selection was the null model being tested. For a
randomized controlled experiment like the calcium study, the randomization was a deliberate part of the
experimental design. Although the source of the randomization makes no difference in how you calculate
the p-value, it makes a tremendous difference in what it tells you. For Martin, a tiny p-value tells you to
reject the null model of random selection. For the calcium study, that’s not a logical option, because the
randomization actually occurred. That randomization guarantees that there are only two possible
explanations for the observed difference between the treatment and control groups: chance, or the
treatment.
02/17/16 George W. Cobb, Mount Holyoke College, NSF#0089004
page 2.10
Discrete Markov Chain Monte Carlo
Chapter 2 RANDOMIZATION TESTS
Was the calcium supplement effective in lowering blood pressure? Here’s how the logic
goes: The only differences between the two groups were (1) the calcium, and (2)
differences created by the random assignment. Assume for the moment that the calcium
had no effect. Then the observed reduction of 5 mm Hg in the calcium group was due
purely to chance, that is, to the random assignment. To see whether chance is a
believable way to account for the average of 5, we ask, “If you take the 21 blood pressure
values, and choose 10 of them at random, how likely is it that you’ll get an average of 5
or more?” If this probability, the p-value, is tiny, we conclude that chance is not a
believable explanation; it must be due to the calcium treatment.
Exercise:
7. (a) I used 10,000 repetitions to estimate this probability, and got 0.0813. What do you
conclude? (b) Use the S-plus code from before to compute the p-value, with NReps
=1000. How far is your estimate from mine? Which value is more reliable?
Example 2. Hospital carpets.
In a hospital, noise can be an irritation that interferes with a patient’s recovery. Putting
down carpeting in the rooms would cut down on noise, but the carpeting might tend to
harbor bacteria. To study this possibility, doctors at a Montana hospital conducted an
experiment to see whether rooms with carpeting had higher levels of airborne bacteria
than rooms with bare floors. They began with 16 rooms and randomly chose eight to
have carpeting installed. The other eight were left bare. At the end of their test period,
they pumped air from each room over a culture medium (agar in a petri dish), allowed
enough time for the bacterial colonies to grow, and recorded the number of colonies per
cubic foot of air. Here are the results:
Carpeted floors
Room #
Colonies/cu.ft.
212
11.8
216
8.2
220
7.1
223
13.0
225
10.8
226
10.1
227
14.6
228
14.0
Average
11.2
Bare floors
Room #
Colonies/cu.ft.
210
12.1
214
8.3
215
3.8
217
7.2
221
12.0
222
11.2
224
10.1
229
13.7
Average
9.8
Display 2.3 Levels of airborne bacteria for 16 hospital rooms
Exercise:
8. Estimate the p-value for testing the hypothesis that carpeting had no effect on the
levels of airborne bacteria. (Find the chance that if you choose 8 values at random from
the 16 bacteria levels, you’ll get an average of 11.2 or more. Use 10,000 repetitions.)
02/17/16 George W. Cobb, Mount Holyoke College, NSF#0089004
page 2.11
Discrete Markov Chain Monte Carlo
Chapter 2 RANDOMIZATION TESTS
The three examples, Martin, calcium, and carpets, all have the same abstract structure:
Summary: Two-sample permutation tests
Data: Two groups (samples) of numerical values, n1 in Group 1 and n2 in
Group 2.
Test statistic: Average (mean) of the values in Group 1.
Observed value: Group 1 average for the actual data
Null model: All possible ways to choose n1 values (a random sample)
from the combined set of n1 + n2 values (the population) are equally
likely.
p-value: The chance that the average for a random sample is at least as
large as the observed value.
Example 3. Speed limits and traffic deaths.
The year 1996 offered an unusual opportunity to scientists who study traffic safety. Until
that year, states had to keep highway speeds at 55 mile per hour or below in order to
receive federal money. Then, toward the end of 1995, a new federal law took effect, one
that allowed states to raise their speed limits. Thirty two states did just that, either at the
beginning of 1996, or at some point during the year. The other 18 states, and the District
of Columbia kept the 55 mph limit. Conventional wisdom had it that increasing the
speed limit would lead to more highway deaths. The change in the law gave scientists a
chance to test this hypothesis. The numbers in Display 2.46 show the percentage change
in numbers of highway traffic deaths between 1995 and 1996, for all 50 states and DC.
States that kept 55 mph
AK -29.0
NH -20.0
CT
-4.4
NJ 44.1
DC -80.0
NY -9.7
HI -25.0
OR -16.4
IN -13.2
SC 32.1
KY
3.4
VA
-9.1
LA
-5.4
VT -41.2
ME -14.3
WI 41.4
MN 10.8
WV 23.2
ND -50.0
States that raised the speed limit
AL 24.5
KS -13.3
OH
1.6
AR 41.3
MA 33.3
OK 34.1
AZ
0.0
MD -1.8
PA
-7.0
CA
4.4
MI
-7.9
RI
18.2
CO -19.1
MO 50.7
SD 22.2
DE 30.0
MS 17.6
TN
4.0
FL
8.2
MT 17.9
TX 14.8
GA 32.1
NC
5.4
UT
9.4
IA
41.4
NE 62.5
WA 34.3
ID -17.9
NM
3.4
WY -31.5
IL
9.4
NV 17.9
Display 2.4 Percentage change in traffic deaths
Here is how the features of this example correspond to the elements of the abstract
summary. The two groups are the states that increased the 55 mph speed limit (Group 1)
and those that kept it (Group 2). The test statistic is the average percent change in
highway deaths for Group 1. Its observed value is the actual average for those states,
which works out to 13.2. The null model is that all way to choose 32 numbers from the
6
Data from Ramsey, Fred L. and Daniel W. Shafer (2002). The Statistical Sleuth, 2nd ed. Pacific Grove,
CA: Duxbury. Original source: “Report to Congress: The effect of increased speed limits in the postNMSL era,” National Highway Traffic Safety Administration, February, 1998.
02/17/16 George W. Cobb, Mount Holyoke College, NSF#0089004
page 2.12
Discrete Markov Chain Monte Carlo
Chapter 2 RANDOMIZATION TESTS
set of 51 listed in the table are equally likely. The p-value is the chance of getting a
group average of 13.2 or more; this works out to about 0.005.
Discussion question
9. What would change, and what would be the same, if you defined Group 1 to be the
states that didn’t raise their speed limits?
Example 4. O-rings.
The explosion of the Challenger space shuttle has received a lot of attention from
statisticians because the disaster and loss of the astronauts’ lives could have been
prevented by fairly simple data analysis. The explosion was caused by failure of O-ring
seals that allowed rocket fuel to leak and explode, and an investigation concluded that the
o-ring failures were themselves caused by the low temperature at the time of the launch.
The summary below7 shows the relationship between air temperature at launch time and
the number of “O-ring incidents” per launch for 24 launches.
Launch temperature
Below 65
1 1 1 3
Above 65
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 2
Exercises:
10. Identify the components (samples, null model, test statistic, observed value, p-value)
for the Martin and calcium examples.
11. Guess: Will the p-value for Example 4 turn out to be closest to .5, .1, .05, .01, or
.001? After you guess, use S-plus to estimate the p-value.
12. The faulty analysis before the launch ignored all the 0s in the data and looked only at
the temperature on the days for which the launches had problems with the O-rings.
Ignoring 0s gives the following summary:
Launch temperature
Below 65
1 1 1 3
Above 65
1 1 2
Guess: Will the p-value turn out to be closest to .5, .1, .05, .01, or .001? Then use S-plus
to estimate the p-value.
7
Data from Ramsey, Fred L. and Daniel W. Shafer (2002). The Statistical Sleuth, 2nd ed. Pacific Grove,
CA: Duxbury, p. 86. Original source: Feynman, Richard P. (1988). What Do Dou Care What Other
People Think? New York: W. W. Norton.
02/17/16 George W. Cobb, Mount Holyoke College, NSF#0089004
page 2.13
Discrete Markov Chain Monte Carlo
Chapter 2 RANDOMIZATION TESTS
2.3 Randomization tests, II: Fisher’s exact test
So far, the data values in the examples have been quantitative: ages, reduction in blood
pressure, levels of airborne bacteria. What if the data values are categorical? One of the
simplest randomization tests is Fisher’s exact test, which is used to test hypotheses about
data that can be summarized in a 2x2 table of counts.
Example 5. The Salem witchcraft hysteria
The year 1692 saw nineteen convicted witches hanged in Salem Village (now Danvers)
Massachusetts. Almost three centuries later, historians examining documents related to
the trials discovered a striking pattern relating trial testimony and geography. Those who
testified against the accused witches tended to live in the western part of Salem Village;
those who testified in defense of the accused tended to live in the eastern part, which was
wealthier, more commercial, more cosmopolitan, and closer to the town of Salem, at the
time the second busiest port in the colonies.8 A total of 61 residents testified in the trials,
of whom 35 lived in the western part of the village, and 26 in the eastern part. Of the 35
“westerners,” 30 were “accusers” and only 5 were “defenders;” of the 26 easterners, only
2 were accusers; the remaining 24 were defenders:
Testimony
Geography
Accuser
West
30
East
Total
Defender
Total
% accuser
5
35
85.7%
2
24
26
7.7%
32
29
61
Display 2.5 Geography and testimony in the Salem witch trials of 1692
Is it possible to get a pattern as extreme as this just by chance? If the relationship
between geography and testimony were purely random, how likely would a pattern this
extreme be? Represent the residents who testified by poker chips, 35 marked “West” and
26 marked “East.” Put all 61 chips in a bag, mix thoroughly, and draw out 32 at random.
Call these Accusers; count how many of the accuser chips say West, how many say East,
and record the results in a table.9 I just did this, and got:
Testimony
Geography
Accuser
West
19
East
Total
Defender
Total
% accuser
16
35
54.3%
13
13
26
50.0%
32
29
61
Display 2.6 Results for a sample of accusers drawn at random
8
Historians Boyer and Nissenbaum cite this pattern as part of the evidence in support of an economically
based interpretation of the witchcraft hysteria. See Boyer, Paul and Stephen Nissenbaum (1974). Salem
Possessed: The Social Origins of Witchcraft, Cambridge: Harvard University Press
9
The chips left in the bag are the defenders. You could count them to fill in the rest of the table, or you
could get the missing values by subtraction, since they are determined by what you draw out.
02/17/16 George W. Cobb, Mount Holyoke College, NSF#0089004
page 2.14
Discrete Markov Chain Monte Carlo
Chapter 2 RANDOMIZATION TESTS
My random data set is not nearly as extreme as the actual one.
I repeated the whole process -- random draws, count, compare – 10,000 times, using the
S-Plus code in Display 2.7, and not once did I get a table as extreme as the actual data.
Conclusion: If you draw at random, it is all but impossible to get a data table like the
observed one. In other words, “It’s just a chance relationship” is not a believable
explanation for the data.
For the S-plus simulation, I used 0s and 1s to represent East and West. There were 26
people from the East who testified, and 35 from the West, so my population has 26 0s
and 35 1s.
pop <- c(rep(0,26),rep(1,35))
phat <- 0
NRep <- 10000
for (i in 1:NRep){
phat <- phat + (sum(sample(pop,32,replace=F))>=30)/NRep
}
phat
Display 2.7 S-Plus code for drawing random samples of accusers and estimating p
Here’s an abstract version of the same analysis:
Step 1: Generate random data sets
Population: 61 individuals, 35 of them 1s (West) and 26 of them 0s (East).
Sample: A subset of 32.
Null model: The sample is chosen completely randomly; all subsets of 32
are equally likely.
Step 2. Compare random data sets with the actual data.
Test statistic: Number of 1s in the sample (= number of “West” chips
among the randomly chosen “accusers.”)
Actual data value: There were, in fact, 30 residents of the western part of
Salem Village among the accusers.
Compare: Record a Yes if there are 30 or more 1s in the sample.
Step 3. Estimate.
Out of 10,000 data sets, none had as many as 30 1s.
Because the p-value is so very tiny, we reject the null model. It is not a believable
explanation for the actual data.
02/17/16 George W. Cobb, Mount Holyoke College, NSF#0089004
page 2.15
Discrete Markov Chain Monte Carlo
Chapter 2 RANDOMIZATION TESTS
Drill Exercises:
13. A small version of the witch data.
Assume that only 10 people had testified, of whom 4 lived in the west, 6 in the east.
Assume also that 3 were accusers, and that all 3 came from the west. Set up the
population, null model, test statistic and observed value. Then find the p-value and state
your conclusion.
14. There is more than one way to define the population and null model for S-plus. Two
easy variations are (a) to reverse the labels for 1s and 0s, so that 1 represents East and 0
represents West, and/or (b) to reverse the labels for in the sample and not in the sample,
so that the sample represents Defenders, and those not in the sample are the Accusers.
For each of these variations, tell what the test statistic would be, and how to compute the
p-value. Verify that the p-value is the same for all these variations.
15. A more substantive variation reverses the roles of population and sample: Let the 1s
and 0s in the population tell whether an individual was an Accuser or Defender. Let “in
the sample” correspond to “West,” and “not in the sample” to East. Define a test statistic,
and tell how to modify the S-Plus code in Display 3s.3 to compute the p-value.
(Optional: Run your modified S-Plus code, and verify the (non-obvious) fact that (apart
from random variation) it is equal to the p-value from the original code.
Example 6. US v. Gilbert
For several years in the 1990s, Kristen Gilbert worked as a nurse in the intensive care
unit (ICU) of the Veteran’s Administration hospital in Northampton, Massachusetts.
Over the course of her time there, other nurses came to suspect that she was killing
patients by injecting them with the heart stimulant epinephrine.10 Part of the evidence
against Gilbert was a statistical analysis of more than one thousand 8-hour shifts during
the time Gilbert worked in the ICU. Was there an association between Gilbert’s presence
on the ICU and whether or not someone died on the shift?11 Here are the data:
K.G. present
on shift?
Yes
No
Total
Death on shift?
Yes
No
40
217
34
1350
74
1567
Total
257
1384
1641
% Yes
15.6%
2.5%
Display 2.8 Data on possible association between nurse Gilbert’s presence in the ICU of
the Northampton VA hospital and deaths on a shift.
10
A synthetic form of adrenaline.
11
As of this writing, the actual data are not yet public. They figured prominently in the grand jury
testimony that led to Gilbert’s indictment, but at the subsequent trial, the judge ruled that they could not be
shown to the jury.
02/17/16 George W. Cobb, Mount Holyoke College, NSF#0089004
page 2.16
Discrete Markov Chain Monte Carlo
Chapter 2 RANDOMIZATION TESTS
Drill Exercises:
16. Define the population of 0s and 1s: What does a 0 represent? a 1? How many 0s,
and how many 1s, are in the population. (Note: There is more than one right way to do
this.)
17. Define the null model: If you think of drawing a random sample from the
population,, what does “drawn out” (that is, in the sample) represent? What does “not
drawn out” represent?
18. Define the test statistic: What does the number of 1s in the sample represent? What
is the observed value of the test statistic?
19. p-value. Modify the S-Plus code in Display 3s.3 so that it would compute the pvalue. (Optional: Compute the p-value using your code.)
Example 7. Anthrax.
After Senator Tom Daschle received a letter containing anthrax, the Hart Senate Office
Building was fumigated in an attempt to kill the spores. After the first fumigation, public
health officials conducted a multi-part test to see whether the building was safe to work
in. In the first phase of the test, 17 strips capable of detecting live anthrax spores were
placed throughout the test area, and later were checked for anthrax. Five of the 17 were
positive. In the second phase of the test, another 17 strips were placed in the same
locations, but this time suitably protected technicians walked around on the carpet,
moving the room air in the process, to simulate normal office traffic. This time 16 of 17
strips were positive. The results of these tests led to a second, more vigorous, and
successful fumigation.
Exercises
20. Summarize the test results in a 2x2 table.
21. Define a suitable population, null model, and test statistic for Fisher’s exact test.
22. Use S-plus to compute an appropriate p-value.
Discussion question:
23. Just as each strip in the test for anthrax can show a false positive (indicate anthrax
when none is present) or a false negative (indicating no anthrax when it is in fact
present), statistical tests can also show false positives and false negatives. For the study
of calcium supplements, what would a false positive be? A false negative?
Summary. Fisher’s exact test is appropriate when (1) you want to compare two
randomly chosen groups of individuals, and (2) the feature of the individuals that you use
to make the comparison is dichotomous – reducible to yes/no. Think of randomly
02/17/16 George W. Cobb, Mount Holyoke College, NSF#0089004
page 2.17
Discrete Markov Chain Monte Carlo
Chapter 2 RANDOMIZATION TESTS
drawing marbles from a bucket. This gives two randomly chosen groups: those drawn
out, and those left in. The marbles are of two colors; color is the feature used to compare
the two groups. For the Salem witch data (Example 5), we asked, “What if the
‘accusers’ and ‘defenders’ had been chosen at random?” In that example, the actual
accusers and defenders were not chosen at random, but we wanted to test whether the
observed data was consistent with random selection. So under our null model, the
‘accusers’ and ‘defenders’ were the randomly chosen groups. The feature used for the
comparison was geography, east or west. For the Gilbert data (Example 6) we asked,
“What if the deaths had occurred on randomly chosen shifts?” Here, also, the actual
shifts were not chosen that way, but we wanted to compare the actual data with what we
would be likely to get if the shifts had been chosen randomly. Thus shifts with and
without deaths were the randomly chosen groups. The feature used for comparing groups
was whether or not Gilbert was present on the shift.
2.4 Randomization tests, III: variations.
Variation 1: Dichotomizing a numerical variable.
Although Fisher’s exact test is designed for dichotomous populations – those with just
two kinds of individuals – it is possible to use the test when the feature you use to
compare groups is quantitative. To turn a quantitative variable into a dichotomous one,
pick a threshold value of the variable, and replace its actual value with a Yes or No
answer to the question “Is the value of the variable greater than or equal to the
threshold?” Once you’ve replaced the numbers with Yes/No answers, you can carry out
Fisher’s exact test.
Example 8. Martin vs. WestVaCo
Here, once again, are the ages of the ten hourly worker involved in the second round of
layoffs at WestVaCo, with the ages of those laid off indicated by underlining.
25 33 35 38 48 55 55 55 56 64
a. One way to choose a threshold is to go by the law. According to federal
employment law, the “protected class” (of workers who cannot be fired
because of their age) begins at age 40. If we use 40 as the threshold, then the
set of ten ages becomes
N N N N Y Y Y Y Y Y
We can summarize the new, dichotomized version of what actually happened
in a 2x2 table:
40 or older?
No
Yes
Total
No
4
3
7
Yes
0
3
3
02/17/16 George W. Cobb, Mount Holyoke College, NSF#0089004
Total
4
6
10
page 2.18
Discrete Markov Chain Monte Carlo
Chapter 2 RANDOMIZATION TESTS
If we use as our test statistic the number of workers aged 40 or more among
those laid off, we get a p-value of about .17:
p = P(3 or more Y in a random sample of 3 chosen from 4 N, 6 Y)  .17.
b. Notice that using Fisher’s test when you have a quantitative variable ignores
relevant information.12 here, for example, all three workers laid of were very
much older than the threshold age of 40, but the test in (a) didn’t use that
information. In general, it is better not to use Fisher’s test in a situation like
this; the permutation test would ordinarily be preferred. However, you can
sometimes get a better version of Fisher’s test by changing the threshold. For
example, you could choose as your threshold the median (half-way point) of
the set of observed ages. The resulting test is called the median test. For the
Martin data we have an even number of values, so there are two middle values
48 and 55, and the median is the number half-way between them, 51.5. Using
51.5 as a threshold, and replacing ages by Yes/No answers to “Is the age 51.5
or older?” gives a population of 5 Ns, 5 Ys.
Drill exercise: (24) Summarize the data in a 2x2 table.
For this threshold, the p-value is
p = P(3 or more Y in a random sample of 3 chosen from 5 N, 5 Y)  .06.
c. Drill exercise. (25) Repeat the test using 55 as the threshold. Summarize the
data in a 2x2 table, and explain why the p-value should be less than the two
previous p-values. (It’s actual value is 1/30  .03.)
Example 9. Calcium and blood pressure.
If we want to apply Fisher’s test to the data in Example 1, we have to reduce the
quantitative data to dichotomous data by choosing a threshold value and asking, “Is the
change in blood pressure greater than or equal to the threshold?”
Drill exercises.
26. One natural choice for the threshold is 0. Values 0 or greater indicate that the blood
pressure did not go down. Carry out Fisher’s test using 0 as your threshold.
27. Carry out the median test.
28. Compare p-values for the two tests. Why do you think the p-values differ in the way
that they do?
12
Ignoring relevant information can reduce the power of a statistical test by making it less likely that the
test will reject null models that should be rejected.
02/17/16 George W. Cobb, Mount Holyoke College, NSF#0089004
page 2.19
Discrete Markov Chain Monte Carlo
Chapter 2 RANDOMIZATION TESTS
Drill exercises
29. Tell how to conduct a median test:
Null model.
a. What is the population?
b. What constitutes a random sample?
c. What are the objects that are equally likely according to the null model?
Test statistic
d. Tell what test statistic to use.
Variation 2: Transforming to ranks
Back before the days of cheap computers, it was often not practical to estimate p-values
by simulation. Statisticians who wanted to do randomization tests found a clever way
around the problem. Their solution was based on the fact that if your population consists
of consecutive integers, like {1, 2, 3, …, n} there is a theoretical analysis that gives a
workable approximation to p-values.13 Of course most populations don’t consist of
consecutive integers, but you can force them to if you replace the actual data values with
their ranks: order the values from smallest to largest, assign rank 1 to the smallest, rank 2
to the next smallest, etc. Once you’ve assigned ranks, you can do a two-sample
permutation test on the ranks. The resulting test is called the Wilcoxon rank sum test.14
Here’s how the ranking works for the Martin data.
Example 10: A rank test for the Martin data
Age
Rank
25
1
33
2
35
3
38
4
48
5
55
7
55
7
55
7
56
9
64
10
Notice how the ranking handles ties: the three 55s have ranks 6, 7, and 8, so we assign
each of the 55s the average of those ranks, (6+7+8)/3 = 7.
Exercises
30. Carry out the Wilcoxon test on the Martin data, but first, guess whether the p-value
will be larger or smaller than for the permutation test using the actual age. (It will not be
the same.) What do you think is the reason for the difference in p-values?
31. Carry out the Wilcoxon test on the calcium data of Example 1. As in (30), before
you run the simulation, guess the p-value.
13
Find the mean and variance of {1, 2, …, n}. Then use the fact that for large enough n1 and n2, the
distribution of the average of the values in a random sample of size n1 will be approximately normal.
14
The p-value you get from the Wilcoxon test will always be equal to the p-value you would get using a
different approach, called the Mann-Whitney U-test.
02/17/16 George W. Cobb, Mount Holyoke College, NSF#0089004
page 2.20
Discrete Markov Chain Monte Carlo
Chapter 2 RANDOMIZATION TESTS
Variation 3: Paired data
Until now, all the data sets have had the same structure: two groups of values. To
generate random data sets with the same structure, you combined the two group into a
single population, then randomly chose exactly enough values for Group 1, leaving the
rest for Group 2. This structure is just one of a great many that are possible. Often data
come in the form of pairs:
Example 11. Beestings.
When a bee stings and leaves his stinger behind in his victim, does he also leave with it
some odor that tells other bees, “Drill here!”? To answer this question, J.B. Free
designed a randomized experiment.15 First, he took a square board and from it suspended
16 cotton balls on threads in a 4x4 arrangement. Half the cotton balls had been
previously stung, the other half were brand new, and the positions of the two kinds was
randomized. Apparatus completed, Free went to a beehive, opened the top and jerked the
array of cotton balls up and down, inviting stings. Later, he counted the numbers of new
stingers his provocation had garnered. He repeated all this eight more times, with the
results shown in Display 2.9.
Occasion
Stung
Fresh
I
27
33
II
9
9
III
33
21
IV
33
15
V
4
6
VI
22
16
VII VIII
21 33
19 15
IX
70
10
Ave
28.0
16.0
Display 2.9 Numbers of new stinger left by bees
in previously stung and fresh cotton balls
On average, there were 28 stingers left in the cotton balls that had been previously stung,
only 16 in those that had not. Given the grand total of 396 new stingers, is the average of
28 for Stung too big to be due just to the random assignment? Suppose for the moment
that the presence of stingers in the cotton balls had no effect. Then within each pair, it
would be just a matter of chance as to which number got assigned to Stung, and which to
Fresh. We can create random data sets by regarding each occasion as a tiny population of
just two values, and randomly choosing the value that gets assigned to Stung, leaving the
other for fresh. Equivalently, we can toss a coin for each occasion, choosing the first
value of the coin lands heads, the second if tails. Display 2.10 shows an instance of this:
Occasion
Stung
Fresh
I
27
33
II
9
9
III
33
21
IV
33
15
V
4
6
VI VII VIII
22 21 33
16 19 15
IX
70
10
Ave
28.0
16.0
Coin toss
"Stung"
"Fresh"
1
27
33
1
9
9
0
21
33
0
15
33
0
6
4
1
22
16
1
70
10
Ave
22.7
21.3
0
19
21
0
15
33
Display 2.10 Generating a random data set for the bee sting data
If the coin toss lands head (1) the first value in a pair is assigned to “Stung”;
if tails (0), the second value is assigned to “Stung.”
15
Free, J. B. (1961). “The stinging response of honeybees,” Animal Behavior 9, pp. 193-196.
02/17/16 George W. Cobb, Mount Holyoke College, NSF#0089004
page 2.21
Discrete Markov Chain Monte Carlo
Chapter 2 RANDOMIZATION TESTS
Once we have a way to generate random data sets, we’re in business. We can carry out
the randomization test by the usual 3-step algorithm: generate, compare, estimate.
Display 2.11 shows S-plus code for doing this.16 The p-value turns out to be about .04.
Apparently, bees are more likely to sting where others have stung before.
Stung <- c(27,9,33,33,4,22,21,33,70)
Fresh <- c(33,9,21,15,6,16,19,15,10)
StungMean <- mean(Stung)
nPairs <- length(Stung)
NRep <- 1000
NYes <- 0
for (i in 1:NRep){
tosses <- sample(c(0,1),nPairs,replace=T)
ave <- sum(tosses*Stung + (1-tosses)*Fresh)/nPairs
NYes <- NYes + (ave >= StungMean)
}
pHat <- NYes/NRep
pHat
Display 2.11. S-plus code for a permutation test for paired data
Exercise:
32. Explain how the line of code
ave <- sum(tosses*Stung + (1-tosses)*Fresh)/nPairs
works to give the average of the nine values randomly assigned to “Stung”.
Example 12. Radioactive twins.
Would you agree to inhale an aerosol of radioactive Teflon particles? Seven pairs of
identical twins once did. They were part of a study of the effect of environment on the
health of lungs.17 One twin in each pair had been living in a rural environment, the other
in an urban environment. The numbers in Display 2.12 tell the percent of radioactivity
remaining one hour after inhaling the aerosol. Lower values are better: they indicate that
a larger percentage of the particles had been cleared from the lungs.
Twin Pair
I
II
III
IV
V VI VII
Rural 10.1 51.8 33.5 32.8 69.0 38.8 54.6
Urban 28.1 36.2 40.7 38.8 71.0 47.0 57.0
Ave
41.5
45.5
Display 2.12 Percentage of radioactivity remaining in the lungs,
for seven pairs of twins living in two environments
16
Remember that if you use loops in your code, S-plus will punish you by running slowly. Chapter 3 will
show you an alternative to loops.
17
Camner, Per and Klas Phillipson (1973). “Urban factor and tracheobronchial clearance,” Archives of
Environmental Health 27, pp. 82.
02/17/16 George W. Cobb, Mount Holyoke College, NSF#0089004
page 2.22
Discrete Markov Chain Monte Carlo
Chapter 2 RANDOMIZATION TESTS
Exercises:
33. Modify the S-plus code in Display 2.11 to carry out a permutation test. What do you
conclude about the effect of environment on health?
34. For the bee data, the justification for a permutation test comes from the fact that
conditions (stung or fresh) were randomly assigned. The twin study, however, is an
observational study, with no randomization of the conditions (rural or urban) possible.
What is the justification for using a permutation test?
Additional exercises involving variations on the permutation test
35. Dichotomize the O-ring data of Example 4, replacing the number of incidents with
Yes (1 or more incidents) or No (0 incident), and summarize the results in a 2x2 table.
Then carry out Fisher’s exact test. How does your p-value here compare with the one
based directly on the numbers of incidents?
36. Dichotomize the data on traffic deaths (Example 3) using the sign of the change, i.e.,
whether the deaths went up or down. Summarize the results in a 2x2 table, and carry out
Fisher’s exact test.
37. Use the bee sting data of Example 11. Order all 18 data values and assign ranks.
Then carry out a permutation test on the pairs of ranks, using suitably modified code from
Display 2.11. This test is called the signed rank test.
38. Use the bee sting data one more time. This time assign ranks separately for each
pair: assign a 1 to the larger value and a 0 to the smaller value. If the two values are
equal, simply omit that pair from the analysis. Carry out a permutation test on the pairs
of ranks. This test is called the sign test.
39. Compare the p-values from the three tests using the bee sting data: the permutation
test using the numbers of stings, the signed rank test, and the sign test. The first test uses
the actual data, the second gives up some of that information by converting to ranks, and
the third test give up still more information by looking only at which value in a pair was
the larger one. Based on the bee sting data, how does giving up information appear to
affect p-values?
02/17/16 George W. Cobb, Mount Holyoke College, NSF#0089004
page 2.23
Discrete Markov Chain Monte Carlo
Chapter 2 RANDOMIZATION TESTS
2.5 Randomization tests, IV: Chi-square tests.
Introduction. Fisher’s exact test applies to data sets you can summarize in a 2x2 table.
Such data sets have the same structure as the results of drawing a sample from a bucket
containing red and blue marbles: there are two groups (drawn out, left in) and two kinds
of individuals (the two colors). What if there are more than two kinds of individuals, or
more than two samples?
Example 13. Victoria’s descendants
Some people claim there is an association between a person’s birthday and the day of the
year on which they die. According to the theory, people who are dying tend to “hang on”
until their birthday. Display 2.13 shows counts for 82 descendants of Queen Victoria,
classified by month of birth (row) and month of death (column). Those who died in the
same month as they were born in appear on the main diagonal. Those who died in a
month just before or just after their birth month appear just below or just above the main
diagonal.
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Total
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Total
1
0
0
0
1
2
0
0
1
0
1
0
6
1
0
0
1
0
0
0
0
0
1
0
2
5
1
0
0
0
2
1
0
0
0
0
0
1
5
3
0
2
0
0
0
1
0
1
3
1
1
12
2
1
1
1
1
1
1
1
1
1
1
0
12
2
0
0
0
1
0
0
0
0
0
0
0
3
2
0
2
1
0
0
0
0
1
1
1
2
10
0
0
0
3
0
0
1
0
0
1
0
2
7
0
0
0
1
1
0
0
0
0
0
1
0
3
1
1
0
2
0
0
1
0
0
1
1
0
7
0
1
1
1
2
0
0
2
0
1
1
0
9
0
1
1
0
0
0
1
0
0
0
0
0
3
13
4
7
10
8
4
5
3
4
9
7
8
82
Display 2.13. Month of birth (row) and month of death (column)
for 82 descendants of Queen Victoria
If the claim of association is true, we would expect to find a tendency for counts to be
higher on or near the main diagonal, lower near the southwest and northeast corners. If,
on the other hand, there is no association, we would expect the count in a cell to be equal
to the product of the cell’s column total times its row fraction. For example, consider the
upper left cell, which corresponds to those born in a January who also died in a January.
The row and column totals tell us that 6 of 82 descendants, or 7.32%, were born in a
January; 13 died in a January. If there is no association between month of birth and
month of death, we would expect the fraction of January births to be the same in each
column. In particular, we would expect 7.32% of the 13 January deaths, or 13(0.0732) =
0.95, to be January births. The goal of this section is to apply the same kind of thinking
to all the cells of the table, and somehow combine the results to create/define a test
statistic that can serve as the basis of a randomization test.
02/17/16 George W. Cobb, Mount Holyoke College, NSF#0089004
page 2.24
Discrete Markov Chain Monte Carlo
Chapter 2 RANDOMIZATION TESTS
The chi-square test statistic. One of the most common methods in all of statistics is the
chi-square test. This test is so flexible and broadly applicable that almost any data set
based on sorting and counting can be studied using it.18 The chi-square test has two parts,
which are often presented as a single package, without noting or distinguishing between
one part and the other.19 This is unfortunate, because one part – a measure of distance
that serves as a test statistic -- is much more useful than the other – a shortcut method for
approximating p-values. In what follows, I’ll describe and illustrate the more useful part,
which gives a general-purpose method for carrying out Step 2 (the comparison step) of
the p-value algorithm.20
For a concrete illustration, consider the summary table for the Martin example. For 2x2
tables, the chi-square distance isn’t something you really need, because its value is
completely determined by the entry in the upper left cell of the table.21 However, you do
need chi-square (or some alternative) for tables larger than 2x2, and the 2x2 case is a
simple starting point for learning.
Laid off?
Yes
No
55 or
older?
Yes
No
Total
3
0
3
2
5
7
Total
5
5
10
Display 2.14. Summary table for the Martin example
The null model corresponds to choosing a random subset of size 3 – those laid off -- from
{28, 3, 35, 38, 48, 55, 55, 55, 56, 64} and counting the number of people who are 55 or
older. Since 5 of the 10, or 50% of those in our population are 55 or older, we would
expect that, on the average over the long run, 50% of those in a randomly chosen subset
would be 55 or older. For subsets of size 3, this long run average – the expected value –
would be 50% of 3, or 1.5. Because the table entries have to add to give the same row
and column totals as for the actual data, we can fill in the entire table:
18
If you think about the last statement, it should sound too good to be true, and it is. Nevertheless, along
with regression and analysis of variance, chi-square methods constitute one of the three principal groups of
classical statistical methods.
19
If you’ve seen the chi-square test before, you yourself may well have been given the package tour.
20
The less useful part of the chi-square test is an approximation based on mathematical theory, one that you
can sometimes use as a substitute for Steps 1 and 3 of the p-value algorithm. Back in 1901 when this
approximation was invented (by Karl Pearson), before the days of computers, it was a major breakthrough,
and statisticians had no choice but to use it. Without the approximation, there would have been no way to
compute p-values. However, the approximation only works well for a restricted class of data sets, whereas
the randomization algorithm is not restricted in this way, and can be used to estimate p-values to any
desired accuracy.
21
This means that the p-value will be the same regardless of whether you use the chi-square distance or the
cell entry in the comparison step of the algorithm.
02/17/16 George W. Cobb, Mount Holyoke College, NSF#0089004
page 2.25
Discrete Markov Chain Monte Carlo
Chapter 2 RANDOMIZATION TESTS
Laid off?
Yes
No
55 or
older?
1.5
1.5
3
Yes
No
Total
Total
3.5
3.5
7
5
5
10
Display 2.15. Expected values for the Martin example
We can now use these expected values to compare tables. In effect, we invent a way to
measure the “distance” between two tables, and compare tables according to how far they
are from the table of expected values.
Step 2a
Step 2b
Observed
3
2
0
5
-
(Obs - Exp)^2
2.25 2.25
2.25 2.25
Step 2c
Expected
1.5
3.5
1.5
3.5
Obs - Exp
=
Expected
/
1.5
3.5
1.5
3.5
1.5
-1.5
-1.5
1.5
(O-E)^2/E
=
1.5 0.64
1.5 0.64
Chi-square = sum of (O-E)^2/E = 4.29
Display 2.16. The chi-square distance between observed and expected counts
By looking closely at the way the chi-square value is defined, you can convince yourself
that it does behave like a distance.

Observed close to expected  chi-square near zero. Consider first a table whose
observed counts are exactly equal to the expected counts. All of the entries in the
table of differences will be zeros, and the chi-square distance will be zero as well.
In other words, the chi-square distance from a table to itself is zero, just as it
should be.

Observed far from expected  chi-square large. Now consider a table of
observed counts that are far from their expected values. At least some of the
differences (Obs – Exp) in Step 2a will be far from 0.22 When these differences
are squared, in Step 2b, the resulting values will be large, and that will make the
chi-square value large.
22
Notice that because the observed and expected counts have the same marginal totals, the marginal totals
for the table of differences will all be 0.
02/17/16 George W. Cobb, Mount Holyoke College, NSF#0089004
page 2.26
Discrete Markov Chain Monte Carlo
Chapter 2 RANDOMIZATION TESTS

Why divide (O-E)2 by E? Dividing by E is a technical adjustment, designed to
give all the cells an equal chance to contribute to the chi-square total. To see how
this works, consider two extreme cases. First, suppose that the expected value is
1. If 1 is the expected value, we might get observed values of 2, or 5, or even 10,
but 10 would be a very major departure from expectation. On the low side, the
observed count can never be less than 0, so –1 is the lowest possible value for O –
E. Next, suppose that instead of 1, the expected value is 101. An observed count
of 102, or 105, or even 110 is, in percentage terms, still quite close to the expected
value, even though the differences (O-E) are the same as in the first case. On the
low side, (O-E) can easily go far below –1.
Now compare the two cases. In the first, a value of (O-E) = 4 indicates major
departure from expectation: 5 instead of 1. In the second case, a value of (O-E) = 4
indicates a departure of less than 4% from the expectation. Dividing (O-E)2 by E puts
these departures in perspective. In the first case, (O-E)2/E = 16; in the second case,
(O-E)2/E = 0.16.
Once you have defined the chi-square distance, you can use it to compare data sets. A
random data set is more extreme than the actual Martin data, for example, if and only if
its chi-square distance from the table of expected values is greater than or equal to 4.29.
You calculate the p-value in the usual way, as the fraction of random data sets that are at
least as extreme as the actual data.
Testing for association in two-way tables of counts
For tables larger than 2x2, the arithmetic is messier, but the logic is the same as in the
Martin example. Here’s a version of the randomization algorithm suitable for larger
tables.
Step 0. Expected value = (row fraction)(column total).
Observed chi-square: Follow Steps 2a-2c below for the actual data.
Step 1. Generate random data sets with the same row and column totals as the
actual data.
Step 2 Compare values of the chi-square statistic. For each random data set,
compute
a. Observed – Expected
b. (Obs-Exp)2/Exp
c. Chi-square = sum of (O-E)2/E
d. Compare: is chi-square for the random data set at least as big as for the
actual data?
Step 3. Estimate: p  # Yes / # Datasets
02/17/16 George W. Cobb, Mount Holyoke College, NSF#0089004
page 2.27
Discrete Markov Chain Monte Carlo
Chapter 2 RANDOMIZATION TESTS
For the data in Display 2.13, the value of chi-square is 115.6. To see how this value
compares with the values we’d get from random data sets, we need a method for
generating 12x12 tables of counts with the same margins as in Display 2.13.
Generating random tables of counts with given margins.
You can generate data sets by physical simulation, much as in the Martin example.
Here’s how it works for Victoria’s descendants.
Step 1. Label by rows. Put 82 chips in a bucket, with labels determined by the
row (birth month) totals. Thus 6 of the chips say “Jan”, 5 say “Feb”, 5 “Mar”, 12
“Apr”, and so on.23
Step 2. Draw out by columns. Mix the chips thoroughly. Then draw them out in
stages determined by the column totals: The first 13 chips you draw out
correspond to deaths in January; the next 4 correspond to deaths in February, …,
the last 8 correspond to deaths in December.
In S-plus, you can carry out the same two steps:
Step 1. Label by rows. Just as rep(1, 3) creates a vector (1, 1, 1), rep(1:3,
c(4,3,1)) creates a vector (1, 1, 1, 1, 2, 2, 2, 3). For the Victoria data, we want
rep(1:12, c(6, 5, 5, 12, 12, 3, 10, 7, 3, 7, 9, 3)). Rather than type in all the row
totals, we use the command rowSums to tell S-plus to compute the totals for us.
Pop <- rep(1:12, rowSums(ActualData))
More generally, the number of rows won’t be necessarily be 12, but it will be
given by the first element of the vector that gives the dimension of the data,
dim(ActualData)[1]. Thus we create the bucket of labeled chips with the
command
Pop <- rep(1:dim(ActualData)[1], rowSums(ActualData))
Step 2. Draw out by columns. In the same way, we create a vector ColGroups of
column labels. Following the column totals, this vector will have thirteen 1s for
January, four 2s for February, etc. For our particular example, we could use
rep(1:12, c(13, 4, 7, 10, 8, 4, 5, 3, 4, 9, 7, 8)). The S-plus code uses the more
general
ColGroups <- rep(1:dim(ActualData)[2], colSums(ActualData))
To create a random data set, first permute the row labels using
23
Notice that the births tend to come between April and August. Our null model, which fixes the row
totals, copies the actual distribution of the birth months.
02/17/16 George W. Cobb, Mount Holyoke College, NSF#0089004
page 2.28
Discrete Markov Chain Monte Carlo
Chapter 2 RANDOMIZATION TESTS
Permutation <- Sample(Pop, length(Pop)),
Then line up the vector of permuted row labels next to the vector of column
labels, and count. Here’s how it works for the Martin example:
Row labels:
1
1
1
1
1
2
2
2
2
2
Permuted row labels: 1
Column labels:
1
2
1
2
1
1
2
2
2
1
2
1
2
1
2
2
2
2
2
The last two rows give 10 vertical pairs (Permuted row label, column label).
Sorting and counting these gives a 2x2 summary table:
Row label
Column label
(2 = laid off, 1 = retained)
1
2
Total
4
3
5
2 (55 or older)
1
2
Total
3
7
10
1 (under 55)
5
Display 2.17. Summary table from sorting and counting
S-plus does the counting for us, in response to the command “table”:
RandomTable <- table(Permutation, ColGroups)
Display 2.18 shows the S-plus code for (1) computing expected values, (2) computing
chi-square distance, and (3) estimating the p-value by generating random data sets and
finding the fraction whose chi-square value is at least as large as for the actual data. Out
of 10,000 random data sets, xxx gave a chi-square value of 115.6 or more. Conclusion:
It isn’t at all unusual to get data as extreme as the actual data. Queen Victoria’s
descendants provide no evidence of the claim that birth month and death month are
associated.
02/17/16 George W. Cobb, Mount Holyoke College, NSF#0089004
page 2.29
Discrete Markov Chain Monte Carlo
Chapter 2 RANDOMIZATION TESTS
##############################################################
#
#
Chi-Square Tests by randomization
#
##############################################################
#
#
##############
#
# Expected. This function takes a matrix of non-negative entries
# and returns a matrix of expected values computed assuming there
# is no association between rows and columns: the (i,j) element of
# the matrix equals the total for row i times the proportion for
# column j.
#
##############
#
expected <- function(A){
RowTotals <- matrix(rowSums(A),dim(A)[1],1)
GrandTotal <- sum(A)
ColFractions <- matrix(colSums(A),1,dim(A)[2]) / GrandTotal
Expected <- RowTotals %*% ColFractions
return(Expected)
}
#
##############
#
# ChiSqDistance. This function takes matrices of observed and
# expected counts and returns the usual chi-square statistic,
# with value equal to the sum of
#
(observed - expected)^2 / expected.
#
##############
#
ChiSqDistance <- function(Observed,Expected){
ChiSq <- sum((Observed-Expected)^2/Expected)
return(ChiSq)
}
ChiSqDistance(A,expected(A))
#
##############
#
#
ChiSqSim: Carries out a randomization chi-square test for independence
#
by creating random data sets, computing the chi-square distance from
#
expected values computed assuming no association, and estimates the
#
p-value using the fraction of random data sets whose chi-square values
#
are at least as large as the value for the actual data.
#
Input: a matrix of counts and the number of repetitions.
#
##############
#
ChiSqSim <- function(ActualData, NReps){
Exp <- expected(ActualData)
# Expected values
ActChiSq <- ChiSqDistance(ActualData,Exp) # Observed value of the
#
chi-square distance
NYes <- 0
# NYes counts # of random
#
data sets with
#
bigger chi-square
# Pop = the contents of the
#
bucket:
Pop <- rep(1:dim(ActualData)[1],rowSums(ActualData))
PopSize <- length(Pop)
# Number of chips in the bucket
# ColGroups tells which draws go
02/17/16 George W. Cobb, Mount Holyoke College, NSF#0089004
page 2.30
Discrete Markov Chain Monte Carlo
Chapter 2 RANDOMIZATION TESTS
#
with which columns
ColGroups <- rep(1:dim(ActualData)[2],colSums(ActualData))
#
for (i in 1:NReps){
# Repeat this loop once for
#
each random data set
Permutation <- sample(Pop,PopSize)
# Create a random permuation
#
of the chips in the bucket
RandomData <- table(Permutation,ColGroups) # Summarize the results
#
in a r x c table of counts
# Add 1 to NYes if the random
#
data set has a chi-square
#
values at least as large
#
as for the actual data.
#
NYes <- NYes + (ChiSqDistance(RandomData,Exp) >= ActChiSq)
}
p.hat <- NYes / NReps
return(p.hat)
}
#
# The Martin data
#
Martin <- matrix(c(3,2,0,5),2,2,byrow=T)
Martin
expected(Martin)
ChiSqDistance(Martin,expected(Martin))
ChiSqSim(Martin,1000)
#
# Victoria’s Descendants
#
Victoria
expected(Victoria)
ChiSqDistance(Victoria,expected(Victoria))
ChiSqSim(Victoria,1000)
Display 2.18 S-plus code for randomization chi-square test for two-way tables
02/17/16 George W. Cobb, Mount Holyoke College, NSF#0089004
page 2.31
Discrete Markov Chain Monte Carlo
Chapter 2 RANDOMIZATION TESTS
Appendix 2.1 Randomization tests: A summary
GIVEN:
1. A null model (or model and null hypothesis):
a. A set, called the population. (Subsets of the population are samples; the number
of elements in the sample is the sample size, n.)
b. A finite (though often very large) collection of equally likely samples.
2. A test statistic (or metric):
a. A function or rule that assigns a real number to each sample. (The function itself
is the test statistic.)
b. A way to tell which of two values of the test statistic is more extreme. (Often,
larger values are more extreme.)
3. Observed data
A particular sample (and so, automatically, a particular value of the test statistic.)
COMPUTE:
1. The p-value (more formally, the observed significance level):
The p-value is the probability, computed using the null model, of getting a value of
the test statistic at least as extreme as the observed value.
According to the null model, all the samples are equally likely, so the p-value is just
the fraction of samples that have values of the test statistic at least as extreme as the
observed value. There are three general approaches to computing p-values, brute
force, mathematical theory, and simulation.
a. Brute force: List all the samples, compute values of the test statistic for each
sample, and count.
b. Mathematical theory: Find a shortcut by applying mathematical ideas (e.g., the
theory of permutations and combinations) to the structure of the set of samples in
the null model.
c. Simulation: Use physical apparatus or a computer to generate a large number of
random samples, and estimate the p-value using the fraction of samples that give a
value of the test statistic at least as extreme as the observed value.
INTERPRETATION:
The p-value measures how surprising the observed data would be if the null model were
true. A moderate sized p-value means that the observed value of the test statistic is pretty
much what you would expect to get for data generated by the null model. A tiny p-value
raises doubts about the null model: If the model is correct (or approximately correct),
then it would be very unusual to get such an extreme value of the test statistic for data
generated by the model. In other words, either the null model is wrong, or else a very
unlikely outcome has occurred.
02/17/16 George W. Cobb, Mount Holyoke College, NSF#0089004
page 2.32
Discrete Markov Chain Monte Carlo
Chapter 2 RANDOMIZATION TESTS
More on the logic of inference
Contrast the time sequences for what actually happens when you generate statistical data,
with the time sequence for how you reason about it after the fact. Here's what actually
happens when you produce data:
1. The setting: There is a particular given chance mechanism for producing the data,
such as tossing a coin 100 times.
2. Before producing data, you can use probability calculations to determine which groups
of outcomes are likely, which are unlikely. (In probability, which is a branch of
mathematics, the model is known, and you use it to deduce the chances for various events
that have not yet happened.)
2. The chance mechanism produces the data, e.g., 91 heads in 100 tosses of a fair coin.
Now consider the analysis using the logic of inference. What was once assumed fixed
(the chance mechanism) is now regarded as unknown. In the coin example, you don't
know the value p for the probability of heads. To make matters worse, what was initially
unknown and variable – the number of heads you would get if you were to toss 100 times
-- is now fixed: you got 91 heads in 100 tosses. It may appear to make no sense to
compute the probability of something that has already happened. However, according to
the logic of classical inference, probabilities only apply to the data. There is no way to
answer the question we really want answered: "How likely is it that p=.5?" No wonder
people find this hard!
Here's the time sequence for the steps in the inference.
1. The setting: You have fixed, known data, and you want to test whether a particular
model is reasonably consistent with the data.
2. You spell out the (tentative) model you want to test.
3. Now, in your imagination, you "go back in time," ignoring for the moment the fact
that you already have the data, and ask " Which outcomes are likely, and which are
unlikely?" In particular, you ask, "How likely is it to get the data we actually got?" In
formal inference, this probability is the p-value.
4. The p-value is used not for prediction, the way probabilities are ordinarily used, but as
a measure of surprise: "If I believe the model, how surprised should I be to get data like
what I actually got?" If the p-value is small enough, the model is rejected. This is a
typical of the way probability is used in statistics: after the fact. (Though statistics is a
mathematical science, it is not a branch of mathematics they way probability is.)
02/17/16 George W. Cobb, Mount Holyoke College, NSF#0089004
page 2.33
Discrete Markov Chain Monte Carlo
Chapter 2 RANDOMIZATION TESTS
Drill and practice with the abstract structure: Null models, test statistics, p-values
40. Samples of size 1. Find p-values for the following situations.
a. Null model: All elements of S1 = {1, 2, …, 20} are equally likely.
Test statistic: t(x) = x, for x  S1; larger values are more extreme.
Observed data: x0 = 3.
b. Null model: All elements of S2 = {1, 2, …, N} are equally likely.
Test statistic: t(x) = x, for x  S2; larger values are more extreme.
Observed data: x0 = 3.
c. Same as (b), but x0 = N-3.
d. Null model: All 26 letters of the English alphabet are equally likely.
Test statistic: t=1 if the letter drawn is a vowel, 0 otherwise.
Observed data: x0 = e.
41. Samples of size 2 or more. Find p-values for the following situations.
a.
Null model: All subsets of size two drawn from {1, 2, …, 6} are equally likely.
Test statistic: t({x,y}) = x+y; larger values are more extreme.
Observed data: {x0, y0}= {3,6}.
b. Null model: All pairs (i,j) with i<j chosen from {1, 2, …, 10} are equally likely.
Test statistic: t1(i,j) = j – i ; larger values are more extreme.
Observed data: (1,3).
c. Same as (b), but : t2(i,j) = (j – i)2.
d. Same as (b), but : t3(i,j) = least common multiple of i and j.
e. Null model: All samples of size three chosen with replacement from {1, 2, …, 6} are
equally likely. (Note that each sample is like a roll of three fair dice.)
Test statistic: Sum of the sample values; larger values are more extreme.
Observed data: (5,5,4).
42. More complicated structures. Find p-values for the following situations.
a.
Null model: All 3 x 3 matrices of 0s and 1s with row totals 2, 1, 2 and column totals,
2, 1, 2 are equally likely.
Test statistic: Number of checkerboard units (see bottom of page 3).
Observed data: The table on page 1.
02/17/16 George W. Cobb, Mount Holyoke College, NSF#0089004
page 2.34
Discrete Markov Chain Monte Carlo
Chapter 2 RANDOMIZATION TESTS
b. Null model: All 2x2 matrices of 0s, 1s and 2s are equally likely.
Test statistic: Sum of squares of the elements of the matrix.
Observed data: Sum of squares = 10
S-Plus exercises: very basic drill
43 – 45. Write S-Plus code to do the following; then run your code as a check.
43. Sample. Take a simple random sample of size 3 from {1, 2, …, 20}
44. Test statistic. Take a simple random sample of size 3 from {1, 2, …, 20} and find
the number of elements in the sample with values of 18 or more.
45. p .Take NRep random samples of size 3 from the same population, and find p-hat,
the proportion of samples with two or more elements with values 18 or more.
Martin v. Westvaco
46. The table below classifies salaried workers using two Yes/No questions: Under 40?
and Laid off? (In employment law, 40 is a special age, because only those 40 or older
belong to what is called the "protected class," the group covered by the law against age
discrimination.)
Under 40?
Yes
No
Total
Laid off?
Yes
No
Total
% Yes
4
5
9
44.4%
14
13
27
51.9%
18
18
36
50.0%
Display 2.19 Martin data for salaried workers
a. Set up a null model: Tell the population (how many of what kinds of items); tell the
sample size, and describe the set of equally likely samples.
b. What is the test statistic?
c. What is its observed value?
d. Use S-Plus to find the p-value.
47. 50 or older. The average age of the salaried workforce in Westvaco’s engineering
department was older than in many companies: ¾ of the 36 employees were over 40.
Here are data like those in (7), except that ages are divided at 50 instead of 40; repeat
parts (a) – (d):
02/17/16 George W. Cobb, Mount Holyoke College, NSF#0089004
page 2.35
Discrete Markov Chain Monte Carlo
Under 50?
Yes
No
Total
Chapter 2 RANDOMIZATION TESTS
Laid off?
Yes
No
Total
% Yes
5
10
15
33.3%
13
8
21
61.9%
18
18
36
50.0%
Verbal/interpretive practice
48. Martin, continued. Does the evidence in the second table (8) provide stronger or
weaker support for Martin's case? Explain. How do you account for the different
messages from the two tables? Both provide evidence; how do you judge the evidence
from the two tables taken together?
49. P-values. Write a short paragraph explaining the logic of p-values and significance
testing in your own words.
50. Number of samples. Write a short paragraph summarizing your current
understanding of the relationship between the number of repetitions in a simulation
and the stability and reliability of the estimated p-value.
51. A trustworthy friend? A friend wants to bet with you on the outcome of a coin toss.
The coin looks fair, but you decide to do a little checking. You flip the coin: it lands
Heads. You flip again: also heads. A third flip: heads. Flip: heads. Flip: heads. You
continue to flip, and the coin lands Heads nineteen times in twenty tosses. Don't try any
calculations, but explain why the evidence -- 19 heads in 20 tosses -- makes it hard to
believe the coin is fair.
52. The logic in (1) relies on the fact that a certain probability is small. Describe in
words what this probability is, and tell how you could use simulation to estimate it.
53. Snow in July? A friendly tornado puts you and your dog Toto down in Kansas, and a
booming voice from behind a screen tells you that the date is July 4 (hypothesis).
However, you see snow in the air (data), and make an inference that it is not really July 4.
Describe in words what probability your inference is based on.
54. Which test statistic? The Physical Simulation asked you to use the average age to
summarize the set of three ages of the workers chosen for layoff. How different would
your conclusions have been if you had chosen some other summary? (There is a welldeveloped mathematical theory for deciding which summaries work well, but that theory
would take us off on a tangent. All the same, you can still think about the issues even
without the theory.) Some other possible summaries are listed at the end of this question.
Are any of them equivalent to the average age? Which summary do you like best, and
why?
Sum of the ages of the three who were laid off
Average age difference
(= average age of those laid off - average age of those retained)
Number of employees 55 or older who were laid off
02/17/16 George W. Cobb, Mount Holyoke College, NSF#0089004
page 2.36
Discrete Markov Chain Monte Carlo
Chapter 2 RANDOMIZATION TESTS
Age of the youngest worker who was laid off
Age of the oldest worker who was laid off
Middle of the ages of the three who were laid off
55. How unlikely is "too unlikely"? The probability you estimated in the Physical
Simulation is in fact exactly equal to 0.05. What if it had been 0.01 instead? Or 0.10?
How would that have changed you conclusions? (In a typical court case, a probability of
0.025 or less is required to serve as evidence of discrimination. Some scientific
publications use a cut-off value of 0.05, or sometimes 0.01.)
56. At the end of Round 3, there were only six hourly workers left. Their ages were 25,
33, 34, 38, 48, and 56. The 33 and 34 year olds were chosen for layoff. Think about how
you would repeat the Physical Simulation using the data four Round 4.
a. What is the population? (Give a list.)
b. How big is the sample?
c. Define, in words, the probability you would estimate if you were to do the simulation.
d. Write out the rule for estimating the probability, using the format from Step A1 as a
guide.
e. Give your best estimate for the probability by choosing from
1%, 5%, 20%, 50%, 80%, 95%, and 99%.
f. Is the actual outcome easy to get just by chance, or hard?
g. Does this one part of the data (Round 4, hourly) provide evidence in Martin's favor?
57. After the first three, the next hourly worker laid off by Westvaco was the other 55
year-old. What's wrong with the following argument?
Lawyer for Westvaco: "I grant you that if you choose three workers at random, the
probability of getting an average age of 58 or older is only .05. But if you extend the
analysis to include the fourth person laid off, the average age is lower, only 57.25 (=
[55+55+55+64] / 4). An average of 57.25 or older is more likely than an average of 58 or
older. So in fact, if you look at all four who were laid off instead of just the first three,
the evidence of age bias is weaker than you claim."
58. Use the data from Rounds 2 and 3 combined. Tell how to simulate the chance of
getting an average age of 57.25 or more using the methods of the Physical Simulation:
What is the population? the sample size? Tell how to estimate the probability, following
the Physical Simulation as a guide. Give your best estimate of the probability. Then tell
how to use this probability in judging the evidence from Rounds 2 and 3 combined.
59. Sketch a dot graph like Display 1.4 to illustrate what you think simulations would
look like for the following scenario:
Three workers were laid from a set of ten whose ages were the same as in the Martin
case. The ages of those laid off were 48, 55, and 55. If you choose three workers at
random, the probability of getting an average of 52.66 or older is .166.
02/17/16 George W. Cobb, Mount Holyoke College, NSF#0089004
page 2.37
Discrete Markov Chain Monte Carlo
Chapter 2 RANDOMIZATION TESTS
60. For some situations, it is possible to find probabilities by counting equally likely
outcomes instead of by simulating. Suppose only two workers had been laid off, with an
average age of 59.5 years. It is straightforward, though tedious, to list all possible pairs
of workers who might have been chosen.. Here's the beginning of a systematic listing.
The first nine outcomes all include the 25-year-old and one other. The next eight
outcomes all include the 33-year old and one other, but not the 25-year-old, since the pair
(25,33) was already counted.
Count
Pair chosen (underlined = laid off)
Average age
1
25 33 35 38 48 55 55 55 56 64
29.0
2
25 33 35 38 48 55 55 55 56 64
30.0
3
25 33 35 38 48 55 55 55 56 64
31.5
9
25 33 35 38 48 55 55 55 56 64
44.5
10
25 33 35 38 48 55 55 55 56 64
24.0
11
25 33 35 38 48 55 55 55 56 64
35.5
etc.
How many possible pairs are there? (Don't list them all!) How many give an average
age of 59.5 years or older? (Do list them.) If the pair is chosen completely at random,
then all possibilities are equally likely, and the probability of getting an average age of
59.5 or older equals the number of possibilities with an average of 59.5 or more divided
by the total number of possibilities. What is the probability for this situation? Is the
evidence of age bias stronger or weaker than in the example?
61. It is possible to use the same approach of listing and counting possibilities to find the
probability of getting an average of 58 or more when drawing three at random. It turns
out there are 120 possibilities. List the ones that give an average of 58 or more, and
compute the probability. How does this number compare with the results of the class
simulation in Physical Simulation? Why do the two probabilities differ (if they do)?
62. How would your reasoning and conclusions change if the five oldest workers among
the entire group of ten were all age 55 (so that the ages of the ten were 25, 33, 35, 38, 48,
55, 55, 55, 55, 55), and the three chosen for layoff were all 55? Is the evidence of age
bias stronger or weaker than in the actual case?
63. The law on age discrimination applies only to people 40 or older. Suppose that
instead of looking at actual ages, you look only at whether each worker is less than 40, or
40 or older. Tell what summary statistic you would use, and tell how you would set up
the model for simulating an age-neutral process for choosing three workers to be laid off.
Conclude by discussing whether you think it is better to use actual ages, or just the
information about whether a person is 40 or older.
02/17/16 George W. Cobb, Mount Holyoke College, NSF#0089004
page 2.38
Discrete Markov Chain Monte Carlo
Chapter 2 RANDOMIZATION TESTS
Applied problems: creating your own null models and test statistics
64. More Martin.
a. Use the 2x2 summary tables for salaried workers (in 7 and 8) to create a 3x2
summary table with three age groups: under 40, 40 to 49, and 50 or older:
b. Describe a null model that corresponds to the hypothesis of no discrimination.
c. Invent/define a test statistic that will be larger if older workers are more likely to be
chosen for layoff.
d. Compute the observed value of your test statistic.
e. Find the p-value for your combination of null model, test statistic, and observed
value.
65. Horse racing
The data set below shows the starting position of winning horses in 144 races.24 All races
took place in the US, and each race had eight horses. Position 1, nearest the inside rail, is
hypothesized to be advantageous.
Starting position
Number of wins
1
29
2
19
3
18
4
25
5
17
6
10
7
15
8
11
a. Describe a null model that corresponds to the hypothesis that starting position has no
effect.
b. Invent/describe a test statistic that will have larger values if lower numbered starting
positions are more advantageous.
c. For the data given here, tell whether the p-value will be < 0.01,  0.01 but  0.1, or >
0.1.
70. Spatial data.
One way to record the spatial distribution of a plant species is to subdivide a larger area
into a grid of small squares (quadrats), and record whether the plant (Carex arenaria in
Display 2.20) is present (1) or absent (0), in each square.25
a. Suppose you want to test the hypothesis that the plants distribute themselves
randomly, without regard to how close they are to others of the same species. Thus
the chance of finding a plant in any one quadrat doesn’t depend on whether or not
there are plants in neighboring quadrats. Consider two possible null models, one that
keeps only the total number of ones fixed, and a second that keep the row and column
sums fixed. Which model is more appropriate, and why?
b. Consider two alternatives to the null model: (i) The presence of the plant in any one
quadrat makes its presence in neighboring quadrats more likely. (ii) The presence of
24
New York Post, August 30, 1955, p. 42. Reprinted in Siegel, S. and Castellan, N.J.
(1988). Non-parametric Statistics for the Behavioral Sciences, 2nd ed., New York:
McGraw-Hill, p. 47.
25
Strauss, D. (1992). The many faces of logistic regression. American Statistician 46,
321-327.
02/17/16 George W. Cobb, Mount Holyoke College, NSF#0089004
page 2.39
Discrete Markov Chain Monte Carlo
Chapter 2 RANDOMIZATION TESTS
the plant in a quadrat makes its presence in neighboring quadrats less likely. Devise
and define two test statistics, one for each alternative
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
0
1
0
0
2
1
1
1
0
1
1
1
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
3
1
0
1
1
1
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
4
1
0
1
1
0
0
1
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
5
0
1
0
1
1
0
1
0
0
0
0
1
0
0
1
0
0
0
0
0
0
0
0
1
6
1
1
1
1
1
1
0
0
1
0
1
0
0
0
1
0
0
0
0
0
1
1
0
1
7
1
1
1
1
1
0
1
1
0
1
0
0
0
1
0
0
0
1
1
0
0
1
0
0
8
1
0
0
0
1
1
0
1
1
0
0
0
0
1
0
0
1
0
0
0
0
0
0
1
9
1
0
0
1
1
0
1
0
1
1
0
0
0
0
0
0
0
0
0
0
0
1
0
0
10
1
1
0
0
0
1
0
1
0
0
1
0
0
1
0
0
0
0
0
0
0
1
0
0
11
0
0
0
1
0
1
0
0
0
0
1
1
0
0
0
0
1
0
0
0
0
0
0
0
12
0
0
0
0
0
1
0
0
0
0
0
1
0
1
0
1
0
0
0
0
0
0
1
0
13
0
0
0
0
0
1
1
0
0
0
0
0
1
1
0
0
1
0
0
1
1
1
1
0
14
1
0
0
0
1
1
0
0
0
0
0
0
0
1
1
0
0
1
1
0
0
0
0
0
15
0
0
1
0
1
0
0
0
0
0
0
0
0
1
1
1
1
1
1
0
0
0
1
0
16
0
0
1
0
1
0
1
0
0
0
1
0
1
1
0
1
0
1
0
0
0
0
0
0
17
1
0
1
1
0
0
1
1
1
1
1
1
0
1
0
0
0
0
0
0
0
0
0
0
18
0
1
1
1
1
0
0
0
0
0
0
0
0
1
1
0
0
0
1
0
0
0
0
0
19
0
1
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
0
1
1
20
1
1
0
0
0
1
1
0
0
0
0
0
0
0
1
1
0
1
1
0
0
0
0
0
21
1
0
0
0
0
0
0
0
1
1
0
0
0
0
0
1
0
0
1
0
0
0
0
0
22
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
1
1
0
0
0
0
23
1
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
24
0
1
0
0
0
0
0
0
0
0
0
0
0
1
1
1
0
1
1
1
1
0
1
0
Display 2.20 Presence (1) or absence (0) of the plant Carex arenaria
in the squares of a 24 x 24 array.
71. Capture-mark-recapture.
An entymologist wants to estimate the size of an animal population. Suppose, for
example you want to know the number of giant cockroaches that feed in the vicinity of a
particular stand of trees in a Costa Rican rainforest. The entymologist captures and
marks 25 roaches, then releases them. After waiting long enough for the marked roaches
to mix thoroughly with the rest of the roach population, he captures a sample of 50
cockroaches, and finds that 14 of them are marked.
a.
Let N be the unknown number of cockroaches in the population. Tell how to use a
null model to test the hypothesis that N is greater than or equal to 100.
b. Tell how to use your null model to find the set of possible population sizes N that are
not rejected by a test of the sort in (1).
c. What assumptions about the cockroaches are necessary in order to make the random
sampling model appropriate?
For the cockroach data we assume that the 50 insects caught in the sample are a
random subset of the population. (This situation differs from the two previous
examples. Here our goal is not to test whether the data are consistent with choosing
02/17/16 George W. Cobb, Mount Holyoke College, NSF#0089004
page 2.40
Discrete Markov Chain Monte Carlo
Chapter 2 RANDOMIZATION TESTS
at random. Rather, we accept random selection as a reasonable assumption,26 in order
to use it to make deductions about the size of the population.) For this example, the
two randomly chosen groups are the 50 roaches that got caught, and the N-50 that
didn’t get caught. The feature used for comparing groups is whether or not a roach
was marked.
26
Biologists have broken down the assumption of random selection into a list of conditions about the
population. In order for the assumption to be reasonable, there must be no movement of individuals in or
out of the population during the time of the study, and the individuals must move around enough that the
marked individuals get well mixed into the population before the sample is caught.
02/17/16 George W. Cobb, Mount Holyoke College, NSF#0089004
page 2.41
Discrete Markov Chain Monte Carlo
Chapter 2 RANDOMIZATION TESTS
Short Investigations
72. What is the effect of ties on the permutation test? For example, compare drawing
three at random from {1, 2, …, N}and drawing three at random from {1, 1, 2, …, N-1}.
73. What is the effect of the “cut point” (threshold) when you turn a quantitative variable
like age into a categorical variable like “Under 40” or “40 or older”?
74. What is the effect of the choice of test statistic on the performance of the test?
75. If you use simulations to estimate a p-value, different simulations will give different
results. Large simulations (many samples; large value of NRep) are more stable and
reliable than small ones. Invent a way to tell from a set of simulation results whether or
not you’ve done enough to stop.
Related research questions
76. APPLIED QUESTIONS:
No matter what area of application you are interested in, there are two closely related
research questions to be answered:
a. Null model: What choice or choices will provide useful evidence for evaluating a
given scientific hypothesis?
b. Test statistic: Same question.
The questions and answers will depend very much on the applied context. In the
field of community ecology, questions of this sort have been a source of vigorous
debate, the subject of voluminous published research, and, on several occasions, the
excuse for indecorous name-calling, for more than 20 years.
77. MATHEMATICAL QUESTIONS:
If you estimate the p-value using simulation, there are two generic research questions
you can ask about the estimation process:
a. Convergence: Does the observed proportion converge to the correct p-value as
you increase the number of samples? If not, can you alter the simulation to get
convergence to the right value?
b. Rate: How fast is the convergence? In other words, how many samples to you
need to get an estimate that is “close enough” to the true p-value?
02/17/16 George W. Cobb, Mount Holyoke College, NSF#0089004
page 2.42
Download