Bernoulli`s Theorem

advertisement
PROBABILITY AS FREQUENCY
We have now taken a detailed look at both
belief-type theories of probability.
Belief-type
Frequency-type
Logical
Personal
Limiting
probability probability frequency


Propensity
theory
It is time to consider the frequency-type
approaches.
This week: limiting frequency.
Next week: limiting frequency (cont’d);
propensity theory.
1
Relative frequency – Von Mises
Von Mises:
 “just as the subject matter of geometry
is the study of space phenomena, so
probability theory deals with mass
phenomena and repetitive events”.
 Probability deals with “problems in
which either the same event repeats
itself again and again, or a great number
of uniform elements are involved at the
same time”.
Possible topics:
 Repeated tosses of a coin
 Canadians born in 1980 who are male
 The collection of molecules in a gas
Ruled out:
 Probability that S dies this year
 Chances of Leafs winning the 2008
Stanley Cup
2
Relative frequency – the basic idea
Let  represent a collective of some
repetitive events or mass phenomena, e.g.
tosses of a coin.
Let k represent some attribute of that
collective, e.g. ‘heads’.
If in the first n members of , k occurs m(k)
times, then the relative frequency of k is
m(k)/n.
3
Limiting frequency defined
Axiom of convergence:
If k is an arbitrary attribute of a
collective, C, then limn m(k)/n exists.
From this, we define the probability of k in C
as follows:
Pr(k/C) = limn m(k)/n
This is the Limiting Frequency theory of
probability.
In other words, we assume that as n
increases, the ratio, m(k)/n, approaches a
fixed value.
4
The “long run”
If you toss a coin once, the relative
frequency of heads will be either 1 or 0.
Suppose the first toss is heads, the second
tails, the third tails again, the fourth heads.
 The relative frequency of heads
therefore goes from 1 to 0.5 to 0.3 to
0.5.
 We can see that the relative frequency
changes quite drastically.
The idea behind the axiom of convergence
(and, therefore, the limiting frequency notion
of probability) is that the relative frequency
of heads will stabilize over the long run.
 The more we toss the coin, the less the
ratio will fluctuate.
 Graphically:
5
Picturing the situation
1.0
Ratio of
heads
0.5
0
Number of tosses
Stability is a core idea of the frequency
theory—we shall come back to it in a
moment.
First, notice that probability theory is taken
to apply to collectives where the outcomes
are random. Let’s see why.
6
Randomness
Suppose a coin tossing machine is set up
so that the results are: 1H-2T, 1H-2T, 1H2T, 1H-2T, etc.
 Then, limn m(k)/n = 1/3
But this does not define a probability—it is,
after all, mechanically determined.
So von Mises specifies:
 Admissible collectives cannot be subject
to a gambling system, i.e. a rule that
would allow you to win money in the
long run.
 In the case of coin tossing, you can’t win
at better than 50%.
Okay, back to stability…
7
Stability
What is the relationship between stability,
relative frequency and probability?
Suppose you and five friends each toss a
fair coin 10 times and record the number of
heads. The results are as follows:
Person
You
Friend 1
Friend 2
Friend 3
Friend 4
Friend 5
# of Heads (k)
7
5
3
8
4
2
k/n (rel. freq.)
0.7
0.5
0.3
0.8
0.4
0.2
The (sample) average or mean number of
heads is:
(7 + 5 + 3 + 8 + 4 + 2)/6 = 4.83.
We expect the average number of heads to
be near 5 even though only one person got
exactly 5 heads.
8
Next consider the experiment discussed in
the book in which 250 students toss a fair
coin 20 times each, and then record the
number of heads they observe.
There is a wide variety of results but as we
would expect, they mostly cluster around 10
heads in 20 tosses:
35
30
25
20
# heads
15
10
5
0
0
2
4
6
8 10 12 14 16 18 20
Is there a way to measure how close the
results are to what we expect? Yes!
9
Sample Standard Deviation
Consider N trials, each observed result, Xi
and the mean, X. The sample standard
deviation is:
SD = [1/N x (Xi – X)2 ]
So, for example, the standard deviation in
the first experiment is:
SD = {1/6 x [(7-4.83)2 + (5-4.83)2 + (3-4.83)2 + (84.83)2 + (4-4.83)2 + (2-4.83)2]}
= {1/6 x [2.172 + 0.172 + -1.832 + 3.172 + -.832 + 2.832]}
= {1/6 x [4.7089 + 0.0289 + 3.3489 + 10.0489 +
0.6889 + 8.0089]}
= {1/6 x 39.8334} = {6.6389}  2.58
Most of the results lie between 4.83  2.58
The SD for the 2nd experiment is 2.9: most
of the results are within 9.79  2.9.
10
The smaller the standard deviation, the
more closely clustered are the results
around the mean.
The relative frequency of heads in the first
experiment is
kmean/n = 4.83/10 = 0.483 (close to ½).
In the second:
kmean/n = 9.76/20 = 0.49 (closer to ½).
 In the long run, we expect the relative
frequency with which an event, E, occurs
to approach a value, p, which we equate
with the probability of E (on the limiting
frequency view of probability).
 We also expect the SD to shrink.
11
Repeated, independent trials of some event,
E, with constant probability, p, are called
Bernoulli trials.
Imagine you have an ideal urn with 100
balls: 30 are green, 70 red.
You draw with replacement 10 times.
Since the probability of drawing green = 0.3,
you would expect to get p x n green balls =
0.3 x 10 = 3.
The most probable number of green balls,
k0, is roughly pn.
12
Theorem: for a large number of trials, the
most probable relative frequency, k0/n, is
essentially just p.
More accurately:
[p – (1-p)/n]  k0/n  (p + p/n)
(You can see that as n approaches infinity,
the relative frequency approaches p).
We can use the above formula to calculate
the most probable number of greens:
[pn – (1-p)]  k0  (pn + p)
Note: there can be more than one most
probable number!
13
What if instead of n draws, we repeatedly
draw n balls?
If we average out the number of green balls
after many trials of n draws, we would
expect to get pn green balls on average:
Theorem: the expected relative frequency
is p.
This is similar to the frequency principle.
14
Of course, the larger the sample, the more
likely the relative frequency will be close to
p. How large? The larger the better. We
ask:
For some small margin of error, , what is
the probability that the relative frequency of
greens in n trials will be within  of p?
This is called the accuracy probability:
Theorem: as the number of trials increases
the accuracy probability approaches 1.
Relative frequencies tend to converge on
probabilities.
This theorem tells us that:
Pr([p – ]  k/n  [p + ])  1 (as n )
Or:
Pr([pn – n]  k  [pn + n])  1 (as n )
15
Bernoulli’s Theorem
1. The difference between p and the
relative frequency can be made as
small as you want if you increase the
number of trials sufficiently
2. The accuracy probability can be made
as close as you want to 1, provided
you perform enough trials.
This is the idea behind Bernoulli’s Theorem:
For any arbitrarily small error, , and any
arbitrarily small difference, x, there is a
number of trials, N, such that for any n > N:
Pr[(p-)  k/n  (p+)] > (1-x)
16
Example: You flip a coin seven times in a
row. What is the expected number of
heads?
But we know that you cannot get 3.5 heads.
So, we think that the most probable number
will be integer or integers closest to pn:
This is confirmed by calculating the most
probable number of heads:
Note: there is no guarantee that the
expected number will equal the most
probable number.
17
Example: The probability that a given car
ride leads to an accident in 1/5000. In
Toronto, there are 250,000 car rides per
day. What is the most probable number of
accidents today?
18
Normal Approximations
If you drew a line joining the tops of each
bar in the graph on page two, it would look
something like a “bell”:
35
30
25
20
# heads
15
10
5
0
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20
19
Bell curves are often called Normal curves
because they occur very often.
Two key properties of normal curves:
1. The theoretical mean. This is the
value of the curve at its peak ().
2. The theoretical standard deviation.
This is the width of the curve ().

70
60
50
40
human ht
30
20
10
0
0
8

20
Example: Human height (for a given
gender) comes in indefinitely many values,
but they tend to cluster in a bell shape
around a mean.
Recall that the theoretical mean is the most
numerous outcome, hence:  = pn
What is the probability that a given height
measurement is close to that mean?
Normal fact I:
The probability that E is within  of pn is
about 0.68.
The probability that E is within 2 of pn is
about 0.95.
The probability that E is within 3 of pn is
about 0.99.
We won’t prove these. We will simply use
these results.
21
b(k,n,p) = the probability of getting k
occurrences of event E in n trials
where Pr(E) is p.
Abraham De Moivre showed that Normal
distributions approximate Bernoulli trials:
Normal fact II:
A binomial distribution b(k,n,p) is
approximated by a Normal distribution in
which
=pn and
= [(1-p)pn]
This is a binomial distribution: i.e. it
concerns Bernoulli trials in which there are
two outcomes (heads/tails, green/red, etc.).
22
How long will it take the relative frequency
of some event to converge on the
probability of that event?
Consider a Bernoulli trial in which the
events have probability p. You perform n
trials and want to know how likely it is that
your result, k, will be close to pn:
Normal Fact III:
The probability that k is within  of pn is
about 0.68.
The probability that k is within 2 of pn is
about 0.95.
The probability that k is within 3 of pn is
about 0.99.
23
Example: Amazing Cars Inc. claims that
99% of its cars can drive 100,000 km before
requiring a tune up. The government tests
1100 cars by driving them for 100,000 km.
If Amazing is telling the truth, what is the
probability that the government will find that
between 1079 and 1099 of Amazing’s cars
don’t require a tune up?
pn = the mean =
SD =
So, 3SD =
24
Example: Let’s do question #6 from chapter
17:
25
Homework
 Questions from chapters 16 & 17.
26
Download