Significance results aren't the end of an experimenter's work

advertisement
Significance
Sometimes we observe a surprising event
or sequence of events. E.g. you flip a coin
ten times and get ten heads. You ask
yourself:
“Is the result significant?”
I.e. does this result indicate that the coin is
biased or did you just observe an unlikely
(but not impossible) run?
Since the chance of getting 10 consecutive
heads is 1 in 210, it seems fair to conclude:
“Either the coin is unfair or the coin is fair
and I’ve just seen an unusual run of heads.”
1
Let H = the coin is fair.
We call this a “statistical hypothesis” about
the coin.
Given the result of 10 consecutive heads,
could we conclude:
The probability of H is very low, say 5%?
No. From the frequency perspective, this
does not make sense! Either the coin is
fair (symmetrical structure) or it isn’t. There
is no probability about it.
We could conclude:
Either the coin is unfair or it is fair and an
unlikely event occurred.
Alternatively:
If H is true, then an unlikely event occurred.
2
Recall the example from last class: we
found that the probability of getting between
1079 & 1089 Amazing cars that don’t
require a tune-up was 99%.
Let’s say the government testers found that
only 1050 cars went 100,000 km without
requiring a tune-up. What does this mean?
If the hypothesis that 99% of Amazing cars
are reliable is true, then an event has
occurred that has a probability <1%.
We have found a significant result. We
say that:
The government data is significant at the
1% level.
3
We use frequency-type probabilities to
come up with significance tests, i.e. to tell
us when observed results are meaningful.
How do we do this? Design a test:
Example: does this new drug, X, help
alleviate allergy symptoms?
1. Select two groups: one gets X, the
other doesn’t.
2. H = X has no effect (“null hypothesis”).
3. Measure the results of the two groups
(this calls for some decision).
4. Divide the results into Big (i.e. X helps
a lot) and Little (no real difference).
5. Choose these categories so that there
is a low probability (according to your
model) of getting Big (often 1% is
used).
6. Therefore, if the results fall into Big,
they are significant (at the 1% level).
4
Let’s worry about #5 above for a moment.
There is a range of possible results your
experiment might yield.
We divide the possible data into two groups:
S (significant) and N (not significant) such
that:
1. Pr(S/H) < Pr(N/H). This is clearly
important for otherwise the result
wouldn’t be useful in moving us
away from the null hypothesis.
2. Pr(S/H) = p (usually 1% or 5%).
What does all this mean? If you’re a
medical experimenter, you need to make
some choices on how to design your test for
X.
5
The experimenter records the data for the
two groups given X. Assume she decides
to measure how often the subjects in each
group sneeze per day.
She designs her test so that:
1. The probability if getting a certain
number of sneezes is 1% if the Null
Hypothesis is true.
How do experimenters do this?
Logicians aren’t in the business of test
design. However, we can analyze the
reasoning behind reported results.
6
Let’s say that our researcher finds that all
subjects who got X sneezed n times per
day. She tells us that the significance
level (or “p-value”) of this result is 1%.
What does this mean?
NOT: “There is a 1% chance that the Null
Hypothesis is true”.
Either X has an effect or it doesn’t. Rather:
If H is true, then using a certain statistic
(sneeze frequency) that summarizes data
from an experiment like this, the probability
of obtaining the data we obtained is 1%.
I.e.:
Either H is true and something with a
probability <1% happened, or H is false.
7
A simple example:
H = this coin is fair.
Experiment: I toss it seven times.
Result: Seven heads.
 There are 27 = 128 possible outcomes
(if H is true).
 Only one of these outcomes = seven
consecutive heads.
 So, if H is true, I got a result that had
<1% (0.78125%) probability of
happening.
Accordingly, I report:
“Either H is false or H is true and something
unusual (Pr < 1%) happened by chance.”
I say that the “p-value” or “level of
significance” is 0.01 (really, 0.0078125).
8
Significance results aren’t the end of an
experimenter’s work. They leave a great
deal to be done:
 Causation vs. correlation
Just because two things happen
together, it does not follow that one
caused the other.
 Which causal processes?
We can know a result is significant
without knowing how it is.
 Common cause?
We need to find out if one thing caused
the other (X and lower sneezing) or if
the two are the joint results of something
else.
The significance results do give us a
common measure for different experiments,
but that is only the beginning of the story.
9
Power
Significance results don’t necessarily
change your beliefs. For example, if
nothing significant is observed you may still
reject the null hypothesis and keep looking.
The Neyman-Pearson Framework is a
method that leads to a definite decision, one
or the other of:
 Reject a hypothesis H if the outcome
falls in R
 Accept H if the outcome falls in A.
They emphasize two kinds of error:
 Reject H when it is true (type I).
 Accept H when it is false (type II).
N & P say we should minimize both types of
error.
10
To do this you must design a test that:
1. Keeps Pr(R/H) low while
2. Maximizing Pr (R/~H).
Number two is called the power of the test.
N & P want to maximize power.
A significance test can help with 1.
The power is harder to define because it is
not always clear what it means for a
statistical hypothesis to be false: there are
very many ways this might be the case,
many of which could elude us until science
progresses. We just focus on the logical
point:
N-P tests try to maximize the probability
of rejecting a false hypothesis while
minimizing the probability of rejecting a
true hypothesis.
11
Confidence & Inductive Behaviour
We often want to draw conclusions about a
whole population based on a sample.
What we want to know is how reliable
conclusions from samples to wholes are.
Interval estimates:
1. The number of males in this room is
between 20 and 50.
2. The proportion of female students is
between 0.45 and 0.55.
3. Etc.
How reliable are estimates like this?
The Bayesian has a quick answer: Sample
the population and use Bayes’ rule to
calculate the belief-type probability that the
number of males in this room is between 20
and 50.
12
But on the frequency perspective, either the
number of males is between 20 and 50 or it
isn’t. There is no frequency (and, hence, no
probability) about it.
What can the frequency-type theorist say
about the reliability of interval estimates?
Opinion Polls
1. We think of polls as Bernoulli trials:
sampling with replacement with
constant probability p.
2. Divides the population into the Gs and
the ~Gs (this proportion gives p).
When p is known, Bernoulli’s Theorem tells
us that:
For any small margin of error, there is a
high probability that a large sample from
the population will have a proportion of
Gs close to p.
13
We can use Normal Fact III (from chapter
17 of the book) as follows:
 The probability that the proportion of
Gs in our sample is within /n of p is
0.68.
 The probability that the proportion of
Gs in our sample is within 2/n of p is
0.95.
 The probability that the proportion of
Gs in our sample is within 3/n of p is
0.99.
But we want to go from our sample to p.
How do we do this?
14
Recall that  = [p(1-p)n]
In chapter 17 (exercises) we see that the
margin of error is largest when p = ½.
In that case,  = 1/(2n). We conclude that:
 The probability is at least (worst case)
0.68 that the proportion, s, of Gs in the
sample is within 1/(2n) of p.
 The probability is at least 0.95 that the
proportion, s, of Gs in the sample is
within 1/n of p.
 The probability is at least 0.99 that the
proportion, s, of Gs in the sample is
within 3/(2n) of p.
So, if n = 10,000, then there is a 95%
probability that the proportion s of Gs in the
sample is within 1/10,000 = 1% of p (no
matter what p is).
15
But this allows us to go from samples to
wholes as follows:
Let s be the proportion of Gs in your sample
of 10,000. You know that:
There is at least a 95% probability that s
is within 1% of p.
But if that is the case, then:
There is at least a 95% probability that p
is within 1% of s.
Remember:
 p is fixed. This is just a fact about the
population under study.
 What can vary is s, the proportion of
Gs found in a sample.
16
So you sample 10,000 people and find that
25% of them are Gs. Do you conclude:
There is a 95% probability that the
percentage of Gs lies between 0.24 &
0.26?
No. From the frequency perspective this is
nonsense: either the proportion of Gs is
between 0.24 and 0.26 or it isn’t. What
should we say?
1. The interval 0.24-0.26 is called a
confidence interval. It is an interval
estimate of the proportion of Gs.
2. The 95% figure is our confidence in
the interval estimate. That is, we
made the estimate by a method that
gives the correct interval 95% of the
time.
17
A confidence statement has the form: On
the basis of our data, we estimate that an
unknown quantity lies in an interval. This
estimate is made according to a procedure
that is right with probability at least 0.95.
Example (p. 233):
1026 residents were polled and 38%
favoured a complete ban on smoking in
restaurants. The result is considered
accurate to within three percentage
points, 19 times out of 20.
What this means:
1. The proportion of residents who favour
a ban lies in the interval 0.35-0.41.
2. The method used to get this figure is
right 95% of the time.
18
Why choose plus or minus 3%? Recall that
the 95% confidence interval is:
1/n = 1/1026 = 0.031  3%.
The pollsters chose their numbers so that
the confidence interval would be 95% with a
3% margin of error.
We can calculate the 99% confidence
interval. The margin or error is:
The 68% confidence interval is:
19
Behaviour or inference?
Neyman denies that we ever make inductive
inferences such as:
“The probability of x is y”.
All we can do, he says, is make frequencytype claims about our methods (confidence
intervals):
“If I use this procedure repeatedly, I will
be right 68/95/99% of the time.”
Or:
“If I reject H, I am using a method that
will only lead me astray 1% of the time.”
BUT: Why isn’t this an inference? It seems
that you infer that H is false.
It is good to get clear on how the inferences
work, but we needn’t be dogmatic.
20
THE PROPENSITY INTERPRETATION
We have spent some time examining the
frequency theory of probability.
It is an important perspective on probability
for it allows us to view probability
objectively, as a measure of occurrences in
the world instead of personal assessment.
Recall Von Mises:
The essentially new idea which appeared about
1919…was to consider the theory of probability as a
science of the same order as geometry or theoretical
mechanics…just as the subject matter of geometry is
the study of space phenomena, so probability theory
deals with mass phenomena and repetitive events.
This perspective is useful in dealing with
any area where we need to draw
conclusions from statistics:
 Medicine, demographics, genetic
studies, etc.
21
Single cases
Consider an event, E. E might be a flipping
of a coin or the decay of a radioactive atom.
What is the objective probability of E?
Frequency theorists typically deny that this
question has an answer. We can talk about
the probability of heads in a sequence of
tosses, or the probability of decay in a group
of atoms, but that is all.
Here is Von Mises again:
We can say nothing about he probability of death of
an individual even if we know his condition of life and
health in detail. The phrase ‘probability of death’,
when it refers to a single person has no meaning at all
for us. This is one of the most important
consequences of our definition of probability.
It has no meaning, of course, because he
defines probability as a relative frequency of
events.
22
Dissatisfaction
To many philosophers this seems
unsatisfactory.
For instance, Popper thought Quantum
mechanics requires an objective notion of
probability that also applies to single cases.
 E.g., the probability that this atom will
decay at this time is p.
Suppose some new substance is
discovered, but there is only a small
quantity of it in the universe, say one gram.
 Suppose a person, S, ingests all of the
substance.
 It seems to makes sense to say ‘there is
a probability it will kill him’
23
A possible response
Equate the probability of E with the
frequency of E in the collective/sequence.
Popper: We toss two dice, A and B.
 A is ‘loaded’: it is weighted so that 5
comes up ¼ of the time
 The material in B is uniformly distributed
We toss the two dice as follows: A is thrown
many times, B only occasionally. Therefore:
 The frequency of 5’s will be about ¼.
Now we are about to toss B—single event.
 What is the probability of 5?
 If we equate it with the frequency in the
collective, the answer is ¼.
 But B is not loaded, so the probability
should be 1/6.
24
Dispositions
Because of cases such as these,
philosophers introduced the propensity
interpretation of probability.
On this view, the probability of E occurring
is a measure of the propensity of a given
situation to give rise to E.
A propensity is a disposition or tendency for
something to occur (as we have seen).
Some points to note:
 Something can have the disposition p to
X even if it never in fact X’s, or X’s only
once.
 So, if objective probabilities are
dispositions, we can meaningfully
assign probabilities to single cases.
25
Properties of situations
Suppose we toss a fair coin. Should the
propensity theorist say the coin has a
propensity of 0.5 to come up heads?
No.
 Suppose we toss it on a surface that
has a number of slots. Now there is a
real probability of the coin landing on its
edge.
 In this situation, the probability of heads
is less than 0.5, even though the coin is
fair.
Propensities are properties of situations or
experimental set-ups.
 A coin might have a propensity in
situation S to come up heads 50% of the
time.
 Propensities are relations between
objects and set-ups.
26
Propensities and frequencies
Of course, there is a relation between
propensities and frequencies.
If a situation, S, has a propensity, p, to
produce event E, then if we repeatedly bring
about S then the frequency of E’s should be
close to p.
This is one way in which we can test a
propensity claim—by repeating the
situation.
Question: if we can’t repeat the situation,
how could we test a propensity claim?
27
The nature of dispositions
There is some philosophical controversy
over what exactly a disposition is.
Some think we can think of them as true
counterfactual conditionals:
 If F were the case, then G would be the
case.
 E.g.: Johnny is disposed to get angry
when punched = If Johnny were to be
punched, then he would become angry.
Others think of them as a unique property
that exists in an object or situation.
This is something to think about.
28
Assigning propensities
Suppose S is a 40 year old Canadian. We
know that z% of 40 year old Canadians die
before reaching 41.
 So, we say the propensity that S dies
within a year is z.
But S is also a smoker. We know that x% of
Canadian smoker’s die within the year.
 Should the propensity that S die within a
year be z or x?
Response: use the narrowest reference
class available.
Since 40 year old Canadian smokers is a
subset of 40 year old Canadians, the
answer is x.
29
Problem
A slightly different example:
 As before, x% of 40 year old Cdn.
smokers die within the next year
 But y% of 40 year old Cdn. recreational
swimmers die within the next year.
Now, suppose S both smokes and swims.
Which reference class do we use?
 Each is equally narrow
 There seems to be no way to uniquely
assign a propensity.
30
Objectively based vs. objective
Recall the frequency principle from a few
weeks back:
 It tells you to equate the single case
probability of an event to the frequency
of that event in the long run.
From a frequency dogmatist’s point of view,
the single case probability has no meaning
but we might use this as a useful guideline.
Some people have suggested that
propensity assignments are like this.
 They deny that there is any objective
probability to single events.
 I.e., all such assignments are subjective.
 But you can base those assignments on
frequency-type evidence.
Hence, single case assignments can be
objectively based but are not objective.
31
Reference classes again
Hence, on this view it is no surprise that we
come up against the reference class
problem.
 Where there is insufficient frequencytype evidence, there is no objective
basis for a single case probability
assignment.
But perhaps there is a way around this.
Recall S, who is a member of two equally
narrow reference classes with probabilities x
and y.
But suppose we have further information
about S in particular, for example:
 S is from a long line of smokers none of
whom have died of cancer before the
age of 80.
We can use this information…
32
Non-statistical information
This gives us reason to favour y over x as
the probability of S dying.
 Since S in particular seems unlikely to
die from smoking this year, we seem
justified in choosing the rec. swimmer’s
class in assigning our probability.
While this procedure will be subjective and
imprecise, it is reasonable and can help us
assign propensities to single cases.
33
Summing up
Still, in simple set-ups it seems reasonable
to suppose that there is an objective
probability of an event occurring in a single
case, e.g.:
 Tossing a die, spinning a roulette wheel,
etc.
It seems, however, difficult to accept
objective propensities in more complex
cases.
Still, we might want to assign probabilities to
such cases. Clearly there is more work to
be done…
34
Alternate theories
I. State of the universe (Miller): the state of
the entire universe at T has a propensity to
bring about a single event E.
 Objection: such a state isn’t repeatable
so how can we test for such properties?
II. Relevant Conditions (Fetzer): the set of
conditions causally related to E has a
propensity to bring about single event E.
 Objection: not easy to determine all
causally relevant conditions.
III. Long run propensity (Gillies):
propensities are properties of repeatable
conditions that determine the frequencies of
E in the long run.
 Objection: no assignment for single
cases.
35
Homework
 Do the exercises at the end of chapters
18 & 19.
 Read the sections on inference vs.
behaviour again. This will come in
handy later.
36
Download