MODERN SCIENCE AND THE BAYESIAN-FREQUENTIST CONTROVERSY Bradley Efron

advertisement
MODERN SCIENCE AND
THE BAYESIAN-FREQUENTIST CONTROVERSY
Bradley Efron
Abstract
The 250-year debate between Bayesians and frequentists is unusual among philosophical arguments in actually having important practical consequences. Whenever
noisy data is a major concern, scientists depend on statistical inference to pursue nature’s mysteries. 19th Century science was broadly Bayesian in its statistical methodology, while frequentism dominated 20th Century scientific practice. This brings up a
pointed question: which philosophy will predominate in the 21st Century? One thing
is already clear – statistical inference will pay an increased role in scientific progress as
scientists attack bigger, messier problems in biology, medicine, neuroscience, the environment, and other fields that have resisted traditional deterministic analyses. This
talk argues that a combination of frequentist and Bayesian thinking will be needed to
deal with the massive data sets scientists are now bringing us. Three examples are
given to suggest how such combinations might look in practice. A large portion of the
talk is based on my presidential address to the American Statistical Association and a
related column in Amstat News.
Autumn in Stanford brings with it two notable natural phenomena: the days get short
and it starts to rain a lot. These are both “scientific facts” but they involve quite different
kinds of science. Shorter days exemplify hard-edged science, precise and so predictable
that you can sell an almanac saying exactly how short each day will be, down to the nearest
second. The almanacs try to predict rainfall too but they’re not nearly so successful. Rainfall
is a famously random phenomenon, as centuries of unhappy farmers can testify. (My father’s
almanac went further, predicting good or bad fishing weather for each day, indicated by a
full fish icon, an empty fish, or a half fish for borderline days.)
Hard-edged science still dominates public perceptions, but the attention of modern scientists has swung heavily toward rainfall-like subjects, the kind where random behavior plays
a major role. A cartoon history of western thought might recognize three eras: an unpredictable pre-scientific world ruled by willful gods and magic; the precise clockwork universe
of Newton and Laplace; and the modern scientific perspective of an understandable world,
but one where predictability is tempered by a heavy dose of randomness. Deterministic
Newtonian science is majestic, and the basis of modern science too, but a few hundred years
of it pretty much exhausted nature’s storehouse of precisely predictable events. Subjects like
biology, medicine, and economics require a more flexible scientific world view, the kind we
statisticians are trained to understand.
These thoughts were very much in my mind at “Phystat2003”, a conference of particle
physicists and statisticians held at the Stanford Linear Accelerator Center las year. It’s at
least slightly amazing to me that the physicists, who were the convening force, were eager
to confer with us. One can’t imagine “Phystat1903”, back when the physics world disdained
statistics. “If your experiment needs statistics you ought to have done a better experiment”
in Lord Rutherford’s infamous words.
Rutherford lived in a rich man’s world of scientific experimentation, where nature generously provided boatloads of data, enough for the law of large numbers to squelch any
noise. Nature has gotten more tight-fisted with modern physicists. They are asking harder
questions, ones where the data is thin on the ground, and where efficient inference becomes
a necessity. In short, they have started playing in our ball park.
The question of greatest interest at Phystat2003 concerned the mass of the neutrino,
a famously elusive particle that is much lighter than an electron, and may weigh almost
nothing at all. Heroic experiments, involving house-sized vats of cleaning fluid in abandoned
mine shafts, yielded only a few dozen or a few hundred neutrinos. This left lots of room for
experimental noise, and in fact the best unbiased estimate of neutrino mass turned out to be
1
negative. Mass itself can’t be negative of course. Given a negative estimate, the physicists
wished to establish a statistical upper bound for the true mass, the smaller the better from
the point of view of further experimentation. As a result the particle physics literature now
contains a healthy debate on Bayesian versus frequentist ways of setting the bound. The
current favorite is the “Feldman-Cousins” method, developed by two prominent physicists,
a likelihood-ratio based system of one-sided confidence intervals.
The physicists I talked with were really bothered by our 250 year old Bayesian-frequentist
argument. Basically there’s only one way of doing physics but there seems to be at least
two ways to do statistics, and they don’t always give the same answers. This says something
about the special nature of our field. Most scientists study some aspect of nature, rocks,
stars, particles; we study scientists, or at least scientific data. Statistics is an information
science, the first and most fully developed information science. Maybe it’s not surprising
then that there is more than one way to think about an abstract subject like “information”.
The Bayesian-Frequentist debate reflects two different attitudes to the process of doing
science, both quite legitimate. Bayesian statistics is well-suited to individual researchers,
or a research group, trying to use all the information at its disposal to make the quickest
possible progress. In pursuing progress, Bayesians tend to be aggressive and optimistic
with their modeling assumptions. Frequentist statisticians are more cautious and defensive.
One definition says that a frequentist is a Bayesian trying to do well, or at least not too
badly, against any possible prior distribution. The frequentist aims for universally acceptable
conclusions, ones that will stand up to adversarial scrutiny. The FDA for example doesn’t
care about Pfizer’s prior opinion of how well it’s new drug will work, it wants objective proof.
Pfizer, on the other hand may care very much about its own opinions in planning future drug
development.
Bayesians excel at combining information from different sources, “coherence” being the
technical word for correct combination. On the other hand, a common frequentist tactic is
to pull problems apart, focusing for the sake of objectivity on a subset of the data that can
be analyzed optimally. I’ll show examples of both tactics soon.
Broadly speaking, Bayesian statistics dominated 19th Century statistical practice while
the 20th Century was more frequentist. What’s going to happen in the 21st Century? One
thing that’s already happening is that scientists are bringing statisticians much bigger data
sets to analyze, with millions of data points and thousands of parameters to consider all
at once. Microarrays, a favorite technology of modern bioscience, are the poster boy for
scientific giganticism.
2
Classical statistics was fashioned for small problems, a few hundred data points at most,
a few parameters. Some new thinking is definitely called for on our part. I strongly suspect
that statistics is in for a burst of new theory and methodology, and that this burst will
feature a combination of Bayesian and frequentist reasoning. I’m going to argue that in
some ways huge data sets are actually easier to handle for both schools of thought.
Here’s a real-life example I used to illustrate Bayesian virtues to the physicists. A
physicist friend of mine and her husband found out, thanks to the miracle of sonograms,
that they were going to have twin boys. One day at breakfast in the student union she
suddenly asked me what was the probability that the twins would be identical rather than
fraternal. This seemed like a tough question, especially at breakfast. Stalling for time, I
asked if the doctor had given her any more information. “Yes”, she said, “he told me that
the proportion of identical twins was one third”. This is the population proportion of course,
and my friend wanted to know the probability that her twins would be identical.
Bayes would have lived in vain if I didn’t answer my friend using Bayes’ rule. According
to the doctor the prior odds ratio of identical to nonidentical is one-third to two-thirds, or
one half. Because identical twins are always the same sex but fraternal twins are random, the
likelihood ratio for seeing “both boys” in the sonogram is a factor of two in favor of Identical.
Bayes’ rule says to multiply the prior odds by the likelihood ratio to get the current odds: in
this case 1/2 times 2 equals 1; in other words, equal odds on identical or nonidentical given
the sonogram results. So I told my friend that her odds were 50-50 (wishing the answer had
come out something else, like 63-37, to make me seem more clever.) Incidentally, the twins
are a couple of years old now, and “couldn’t be more nonidentical” according to their mom.
Now Bayes rule is a very attractive way of reasoning, and fun to use, but using Bayes rule
doesn’t make one a Bayesian. Always using Bayes rule does, and that’s where the practical
difficulties begin: the kind of expert opinion that gave us the prior odds one-third to twothirds usually doesn’t exist, or may be controversial or even wrong. The likelihood ratio can
cause troubles too. Typically the numerator is easy enough, being the probability of seeing
the data at hand given our theory of interest; but the denominator refers to probabilities
under Other theories, which may not be clearly defined in our minds. This is why Bayesians
have to be such aggressive math modelers. Frequentism took center stage in the 20th Century
in order to avoid all this model specification.
Figure 1 concerns a more typical scientific inference problem, of the sort that is almost
always handled frequentistically these days. It involves a breast cancer study that attracted
national attention when it appeared in the New England Journal of Medicine in 2001. Dr.
3
Hedenfalk and his associates were studying two genetic mutations that each lead to increased
breast cancer risk, called “BRCA1” and “BRCA2” by geneticists. These are different mutations on different chromesomes. Hedenfalk et al wondered if the tumors resulting from the
two different mutations were themselves genetically different.
• BRCA1 (7 Tumors)
-1.29 -1.41 -0.55 -1.04 1.28 -0.27 -0.57
• BRACA2 (8 Tumors)
-0.70 1.33 1.14 4.67 0.21 0.65 1.02 0.16
Figure 1:
Expression data for the first of 3226 genes, microarray study of breast cancer;
Hedenfalk et al. (2001).
To answer the question they took tumor material from 15 breast cancer patients, seven
from women with the BRCA1 mutation and eight with BRCA2. A separate microarray was
developed for each of the 15 tumors, each microarray having the same 3226 genes on it. Here
we only see the data for the first gene: 7 genetic activity numbers for the BRCA1 cases and
8 activity numbers for the BRCA2’s. These numbers don’t have much meaning individually,
even for microbiologists, but they can be compared with each other statistically. The question
of interest is “are the expression levels different for BRCA1 compared to BRCA2?” It looks
like this gene might be more active in the BRCA2 tumors, since those 8 numbers are mostly
positive , while 6 of the 7 BRAC1’s are negative.
A standard frequentist answer to this question uses Wilcoxon’s nonparametric twosample test. (which amounts to the usual t-test except with ranks replacing the original
numbers.) We order the 15 expression values from smallest to largest and compute “W ”,
the sum of ranks for the BRCA2 values. The biggest W could be is 92, if all 8 BRCA2
numbers were larger than all 7 BRCA1’s; at the opposite end of the scale, if the 8 BRCA2’s
were all smaller than the 7 BRCA1’s, we’d get W = 36. For the gene 1 data we actually get
W = 83, which looks pretty big. It is big, by the usual frequentist criterion. Its two-sided
p-value, the probability of getting a W at least this extreme, is only .024 under the null
hypothesis that there is no real expression difference. We’d usually put a star next to .024
to indicate significance, according to Fisher’s famous .05 cutoff point. Notice that this analysis requires very little from the statistician, no prior probabilities or likelihoods, only the
specification of a null hypothesis. It’s no wonder that hypothesis testing is wildly popular
with scientists, and has been for a hundred years.
4
The .05 significance cutoff has been used literally millions of times since Fisher proposed
it in the early Nineteen Hundreds. It has become a standard of objective comparison in
all areas of science. I don’t think that .05 could stand up to such intense use if it wasn’t
producing basically correct scientific inferences most of the time. But .05 was intended to
apply to a single comparison, not 3226 comparisons at once.
I computed W for each of the 3226 genes in the BRCA microarray data. The histogram
in Figure 2 shows the results, which range from 8 genes with the smallest possible W, W = 36,
to 7 with W = 92, the largest possible, and with all intermediate values represented many
times over. There’s more about the analysis of this data set in Efron (2004).
Figure 2:
Wilcoxon statistics for 3226 genes from a breast cancer study
It looks like something is definitely going on here. The histogram is much wider than
the theoretical Wilcoxon null density (the smooth curve) that would apply if none of the
genes behaved differently for BRCA1 vs BRCA2. 580 of the genes, 18% of them, achieve
significance according to the usual one-at-a-time .05 criterion. That’s a lot more than the
null hypothesis 5%, but now it isn’t so clear how to assess significance for any one gene given
so many candidates. Does the W = 83 we saw for gene 1 really indicate significance?
I was saying earlier that huge data sets are in some ways easier to analyze than the small
one’s we’ve been used to. Here is a big-data-set kind of answer to assessing the significance
of the observed Wilcoxon value W = 83 : 36 of the 3226 genes, gene 1 and 35 others,
have W = 83; under the null hypothesis that there’s no real difference between BRCA1
5
and BRCA2 expressions, we would expect to see only 9 genes with W = 83. Therefore the
expected false discovery rate is 9 out of 36, or 25%. If Hedenfalk decides to investigate gene
1 further he has a 25% chance of wasting his time. Investigator time is usually precious,
so he might prefer to focus attention on those genes with more extreme W values, having
smaller false discovery rates. For example there are 13 genes with W = 90, and these have
a false discovery rate of only 8%.
The “9 out of 36” calculation looks definitely frequentist. In fact the original false
discovery rate theory developed by Benjamini and Hochberg in 1995 was phrased entirely
in frequentist terms, not very much different philosophically than Fisher’s .05 cutoff or
Neyman-Pearson testing. Their work is a good example of the kind of “new theory” that I
hope statisticians will be developing in response to the challenge of massive data sets.
It turns out that the false discovery rate calculations also have a very nice Bayesian
rationale. We assume that a priori a proportion p0 of the genes are null, and that these
genes have W’s following the null Wilcoxon density f0 (w). In a usual microarray experiment
we’d expect most of the genes to be null, with p0 no smaller than say 90%. The remainder of
the genes are non-null, and follow some other density, let’s call it f1 (w), for their Wilcoxon
scores. These are the “interesting genes”, the ones we want to identify and report back to
the investigators. If we know p0 , f0 , and f1 , then Bayes rule tells us right away what the
probability is of a gene being null or non-null, given it’s Wilcoxon score W.
The catch is that to actually carry out Bayes rule we need to know the prior quantities p0 , f0 (w), and f1 (w). This looks pretty hopeless without an alarming amount of prior
modeling and guesswork. But an interesting thing happens with a large data set like this
one: we can use the data to estimate the prior quantities, and then use these estimates to
approximate Bayes rule. When we do so the answer turns out much the same as before, for
example null probability 9 out of 36 given W = 83.
This is properly called an “empirical Bayes” approach. Empirical Bayes estimates combine the the two statistical philosophies: the prior quantities are estimated frequentistically
in order to carry out Bayesian calculations. Empirical Bayes analysis goes back to Robbins
and Stein in the 1950’s, but they were way ahead of their time. The kind of massively
parallel data sets that really benefit from empirical Bayes analysis seem to be much more a
21st Century phenomenon.
The BRCA data set is big by classical standards, but it is big in an interesting way:
it repeats the same “small” data structure again and again, so we are presented with 3226
similar two-sample comparisons. This kind of parallel structure gives the statistician a terrific
6
advantage, it’s just what we need to bring empirical Bayes methods to bear. Statisticians
are not passive observers of the scientific scene. The fact that we can successfully analyze
ANOVA problems leads scientists to plan their experiments in ANOVA style. In the same
way we can influence the design of big data sets by demonstrating impressively successful
analyses of parallel structures.
We have a natural advantage here: it’s a lot easier to manufacture high-throughput
devices if they have a parallel design. The familiar medical breakthrough story on TV,
showing what looks like a hundred eyedroppers squirting at once, illustrates parallel design
in action. Microarrays, flow cytometry, proteomics, time-of-flight spectroscopy, all refer to
machines of this sort that are going to provide us with huge data sets nicely suited for
empirical Bayes methods.
Figure 3 shows another example. It concerns an experiment comparing 7 normal children
with 7 dyslexic kids. A diffusion tensor imaging scan (related to fMRI scanning) was done
for each child, providing measurements of activity at 16000 locations in the brain. At each
of these locations a two-sample t-test was performed comparing the normal and dyslexic
kids. The figure shows the signs of the t-statistics for 580 of the positions on one horizontal
slice of the brain scan. (There are 40 other slices with similar pictures.) Squares indicate
positive t-statistics, x’s negative, with filled in squares indicating values exceeding two; these
are positions that would be considered significantly different between the two groups by the
standard .05 one-at-a-time criterion.
We can use the same false discovery rate empirical Bayes analysis here, with one important difference: the geometry of the brain scan lets us see the large amount of spatial
correlation. Better results are obtained by averaging the data over small contiguous blocks
of brain position, better in the sense of giving more cases with small false discovery rates.
The best way of doing so is one of those interesting questions raised by the new technology.
There’s one last thing to say about my false discovery rate calculations for the BRCA
data: they may not be right! At first glance the “9 out of 36 equals 25% false discoveries”
argument looks too simple to be wrong. The “9” in the numerator, which comes from
Wilcoxon’s null hypothesis distribution, is the only place where any theory is involved. But
that’s where potential trouble lies. If we only had one gene’s data, say for gene 1 as before,
we would have to use the Wilcoxon null, but with thousands of genes to consider at once,
most of which are probably null, we can empirically estimate the null distribution itself.
Doing so gives far fewer significant genes in this case, as you can read about in Efron (2004).
Estimating the null hypothesis itself from the data sounds a little crazy, but that’s what I
7
Figure 3:
580 t-statistics from a brain imaging study of dyslexia; solid squares t ≥ 2;
empty squares t ≥ 0; x’s t < 0. From data in Schwartzman, Dougherty, and Taylor (2004).
meant about huge data sets presenting new opportunities as well as difficulties.
How could Wilcoxon’s null hypothesis possibly go astray? There is a simple answer in
this case: There was dependence across microarrays. The empirical Bayes analysis doesn’t
need independence within a microarray, but the validity of Wilcoxon’s null requires independence across any one gene’s 15 measurements. For some reason this wasn’t time for the
BRCA experiment, particularly the BRCA2 data. Table 1 shows that the BRCA2 measurements were substantially correlated in groups of four.
I have to apologize for going on so long about empirical Bayes, which has always been
one of my favorite topics, and now at last seems to be going from ugly duckling to swan
in the world of statistical applications. Here is another example of Bayesian-frequentist
convergence, equally dear to my heart.
Figure 4 tells the unhappy story of how people’s kidneys get worse as they grow older.
The 157 dots represent 157 healthy volunteers, with the horizontal axis their age and the
vertical axis a measure of total kidney function. I’ve used the “lowess” curve-fitter, a complicated sort of robust moving average, to summarize the decline of kidney function with
age. The fitted curve goes steadily downward except for a plateau in the 20’s.
How accurate is the lowess fit? This is one of those questions that has gone from hopeless
8
1
2
3
4
5
6
7
8
1
1.00
0.02
0.02
0.23
-0.36
-0.35
-0.39
-0.34
2
0.02
1.00
0.10
-0.08
-0.30
-0.30
-0.23
-0.33
3
0.02
0.10
1.00
-0.17
-0.21
-0.26
-0.31
-0.27
4
0.23
-0.08
-0.17
1.00
-0.30
-0.23
-0.27
-0.32
5
-0.36
-0.30
-0.21
-0.30
1.00
-0.02
0.11
0.22
6
-0.35
-0.30
-0.26
-0.23
-0.02
1.00
0.15
0.13
7
-0.39
-0.23
-0.31
-0.27
0.11
0.15
1.00
0.07
8
-0.34
-0.33
-0.27
-0.32
0.22
0.13
0.07
1.00
Table 1: Correlations of the 8 BRCA2 microarray measurements (after first subtracting each
gene’s mean value).
to easy with the advent of high-speed computation. A simple bootstrap analysis gives the
answer in, literally, seconds. We resample the 157 points, that is, take a random sample of
157 points with replacement from the original 157 (so some of the original points appear
once, twice, three times or more, and others don’t appear at all in the resample.) Then the
lowess curve-fitter is applied to the resampled data set, giving a bootstrap version of the
original curve.
In Figure 5 I’ve repeated the whole process 100 times, yielding 100 bootstrap lowess
curves. Their spread gives a quick and dependable picture of the statistical variability in
the original curve. For instance we can see that the variability is much greater near the high
end of the age scale, at the far right, than it is in the plateau.
The bootstrap was originally developed as a purely frequentist device. Nevertheless
the bootstrap picture has a Bayesian interpretation: if we could put an “uninformative”
prior on the collection of possible age-kidney curves, that is, a prior that reflects a lack of
specific opinions, then the resulting Bayes analysis would tend to agree with the bootstrap
distribution. The bootstrap-objective Bayes relationship is pursued in Efron and Tibshirani
(1998).
This brings up an important trend in Bayesian statistics. Objectivity is one of the principal reasons that frequentism dominated 20th century applications; a frequentist method
like Wilcoxon’s test, which is completely devoid of prior opinion, has a clear claim to being
objective – a crucial fact when scientists communicate with their skeptical colleagues. Uninformative priors, the kind that also have a claim to objectivity, are the Bayesian response.
9
Figure 4:
Kidney function versus age for 157 normal volunteers, and lowess fit
Bayesian statistics has seen a strong movement away from subjectivity and towards objective
uninformative priors in the past 20 years.
Technical improvements, the computer implementation of Markov Chain Monte Carlo
methods, have facilitated this trend, but the main reason I believe is a desire to compete
with frequentism in the domain of real-world applications. Whatever the reason, the effect
has been to bring Bayesian and frequentist practice closer together.
In practice it isn’t easy to specify an uninformative prior, especially in messy-looking
problems like choosing a possibly jagged regression curve. What looks uninformative enough
often turns out to subtly force answers in one direction or another. The bootstrap connection
is intriguing because it suggest a simple way of carrying out a genuinely objective Bayesian
analysis, but this is only a suggestion so far.
Perhaps I’ve let my enthusiasm for empirical Bayes and the bootstrap run away with the
main point I started out to make. The bottom line is that we have entered an era of massive
scientific data collection, with a demand for answers to large-scale inference problems that
lie beyond the scope of classical statistics. In the struggle to find these answers the statistics
profession needs to use both frequenstist and Bayesian ideas, and new combinations of the
two. Moreover I think this is already beginning to happen... which was the real point of the
examples.
10
Figure 5:
100 Bootstrap replication of lowess fit
It took enormously long for the statistical point of view to develop. Two thousand years
separate Aristotelian logic from Bayes theorem, its natural probabilistic extension. Another
150 years went past before Fisher, Neyman, and other frequentists developed a statistical
theory satisfactory for general scientific use in situations where Bayes theorem is difficult to
apply. The truth is that statistical reasoning does not come naturally to the human brain.
We are cause and effect thinkers, ideal perhaps for avoiding the jaws of the sabre-toothed
tiger, but less effective in dealing with partial correlation or regression to the mean.
Once it caught on though, statistical reasoning proved to be a scientific success story.
Starting from just about zero in 1900, statistics spread from one field to another, becoming
the dominant mode of quantitative thinking in literally dozens of fields, from agriculture,
education, and psychology to medicine, biology, and economics, with the hard sciences knocking on our door now. A graph of statistics during the 20th Century shows a steadily rising
curve of activity and influence. Statisticians, a naturally modest bunch, tend to think of
their field as a small one, but it is a discipline with a long arm, reaching into almost every
area of science and social science these days.
Our curve has taken a bend upwards in the 21st Century. A new generation of scientific
devices, typified by microarrays, produce data on a gargantuan scale – with millions of data
points and thousands of parameters to consider at the same time. These experiments are
“deeply statistical”. Common sense, and even good scientific intuition, won’t do the job
11
by themselves. Careful statistical reasoning is the only way to see through the haze of
randomness to the structure underneath. Massive data collection, in astronomy, psychology,
biology, medicine, and commerce, is a fact of 21st Century science, and a good reason to
buy statistics futures if they are ever offered on the NASDAQ.
I find the microarray story particularly encouraging for statistics. The first fact is that
the biologists did come to us for answers to the inference problems raised by their avalanche
of microarray data. This is our payoff for being helpful colleagues in the past, doing all
those ANOVAs, t-tests, and randomized clinical trials that have become a standard part of
biomedical research. And indeed we seem to be helping again, providing a solid set of new
analytic tools for microarray experiments. The benefit goes both ways. Microarrays are
helping out inside our field too, raising difficult new problems in large-scale simultaneous
inference, stimulating a new burst of methodology and theory, and refocusing our attention
on underdeveloped areas like empirical Bayes.
Ken Alder’s 2002 book, “The Measure of All Things”, brilliantly relates the story of
the meter, one ten-millionth the distance from the equator to the pole, and how its length
was determined in post-revolutionary France. Most of the book concerns the difficulties of
the “savants” in carrying out their arduous astronomical-geographical measurements. One
savant, Pierre Mechain, couldn’t quite reconcile his readings, and wound up fudging the
answers, driving himself to near-madness and death.
Near the conclusion of “Measure” Alder suddenly springs his main point, forgiving
Mechain as laboring under an obsolete, overly precise, notion of scientific reality:
“Approach the world instead through the veil of uncertainty and science would never be
the same. And nor would savants. During the course of the next century science learned to
manage uncertainty. The field of statistics that would one day emerge from the insights to
Legendre, Laplace, and Gauss would transform the physical sciences, inspire the biological
sciences, and give birth to the social sciences. In the process ‘savants’ became scientists.”
Right on, Ken! Alder’s new world of science has been a long time emerging but there is
no doubt that 21st Century scientists are committed to the statistical point of view. This
puts the pressure on us, the statisticians, to fulfill our end of the bargain. We have been
up to the task in the past and I suspect we will succeed again, though it may take a couple
more Fishers, Neymans, and Walds to do the trick.
12
References
Adler, K. (2002). “The measure of all things: the seven-year odyssey that transformed the
world”, Little-Brown, New York.
Benjamini, Y. and Hochberg, Y. (1995). “Controlling the False Discovery Rate: A Practical
and Powerful Approach to Multiple Testing”. Jour. Royal Stat. Soc. B 57, 289-300.
Efron, B. (2005). “Bayesians, Frequentists, and Scientists”. To appear Journ. American
Statistical Association 100.
Efron, B. (2004). “Large-scale Simultaneous Hypothesis Testing: The Choice of a Null
Hypothesis”, JASA 99, 96-104.
Efron, B. (2004). “Life in a random universe”, Amstat News 330, 2-3.
Efron, B. (2003). “Robbins, Empirical Bayes, and Microarrays”, Annals Stat. 24, 366-378.
Hedenfalk, I., Duggen, D., Chen, Y. et al. (2001). “Gene Expression Profiles in Hereditary
Breast Cancer”, New England Journal of Medicine 344, 539-48.
Schwartzman, A., Dougherty, R., Taylor, J. (2005), “Cross-Subject Comparison of Principal
Diffusion Direction Maps”. To appear Magnetic Resonance in Medicine, armins@stanford.edu.
13
Download