Lecture05

advertisement
Math 507, Lecture 5, Fall 2003, The Product Rule, Bayes’ Rule,
and Extended Independence (Sec. 2.12–2.14)
1) The Product Rule
a) Notation: In advanced probability courses it is common to denote the intersection
of events by concatenating them rather than writing an intersection symbol
between them. That is, if A and B are events we write AB instead of A  B . Our
book does not use this notation, but I will use it in this lecture to simplify what I
have to type (i.e., writing the letters adjacent does not make me use the equation
editor!).
b) Example: Suppose a bag contains four green beads and seven red ones. If I pull
two out (without replacement), what is the probability that both are red?
i)
Intuitively my probability of getting red on the first draw is 7/11, my
probability of then getting another red on the second draw is 6/10=3/5, and
my total probability of getting both red is (7/11)*(3/5)=21/55, which is about
0.38.
ii)
Formally we can solve the problem by counting. There are C(11,2)=55 2subsets of the 11 beads, and C(7,2)=21 of them contain 2 red beads. Assuming
all outcomes are equally likely, the probability of 2 reds is 21/55. We see that
our intuitive procedure yields the correct answer, but can we justify it
formally?
c) Theorem (The Product Rule for probabilities): Suppose A and B are events in a
sample space S with probability measure P. We already know that
P(B|A)=P(AB)/P(A). Clearing denominators we see P(AB)=P(A)P(B|A). In other
words, the probability that A and B both happen equals the probability that A
happens times the probability that B happens if A does.
d) Example Revisited: Again suppose we have four green and seven red beads in a
bag and we choose two of them without replacement. Let F be the event that the
first bead chosen is red and S be the event that the second bead chosen is red.
Then the even that both beads are red is FS. We can find its probability from the
product rule as follows: P(FS)=P(F)P(S|F)=(7/11)*(6/10)=21/55. Our intuition is
now justified.
e) The product rule generalizes to more events. For instance, suppose A, B, C, and D
are events in a sample space S with probability measure P. Then
P(ABCD)=P(A)P(B|A)P(C|AB)P(D|ABC). This same pattern works with any
number of events.
f) Example Re-revisited: Now suppose we have four green and seven red beads and
we want the probability that four beads chosen (without replacement) from the
bag are all red. Let A, B, C, and D, be the events that the first, second, third, and
fourth beads are red, respectively. Then the probability that all four beads are red
is P(ABCD)=(7/11)*(6/10)*(5/9)*(4/8)=(7/11)*(3/5)*(5/9)*(1/2)=7/66, which is
about 0.11. Note that, for instance, the fourth factor 4/8=1/2 is the probability that
the fourth bead is red after we have already removed three red beads.
2) Independence of Several Events
a) Independence of three events
i)
If A, B, and C are events in some sample space S with a probability
measure P, then we say the events are independent if each pair is independent
(i.e., P(AB)=P(A)P(B), P(AC)=P(A)P(C), and P(BC)=P(B)P(C)) and in
addition P(ABC)=P(A)P(B)P(C).
ii)
Intuitively this is what is needed to guarantee that all the events and their
complements are proportionally represented in each other in all relevant ways.
It is possible to satisfy some of these equations while failing to satisfy others.
iii)
Contrary Example: Roll a red die and a clear die. Let A be the event “the
red die is 1,” B be the event “the clear die is 3,” and C be the event “both dice
have the same number.” The P(A)=P(B)=P(C)=6/36=1/6. Clearly
P(AB)=1/36=P(A)P(B), P(AC)=1/36=P(A)P(C), and P(BC)=1/36=P(B)P(C).
The event ABC, however, is empty (since you cannot have the red die 1, the
clear die 3, and both dice the same), so P(ABC)=0. But
P(A)P(B)P(C)=(1/6)(1/6)(1/6)=1/216. Thus A, B, and C are not independent.
iv)
Example: Roll red, green, and clear dice. Let A be the event “the red dice
is 1,” B be the events “the green die is 1,” and C be the event “the clear die is
1.” It is easy to test that events A, B, and C are independent under the uniform
model. In particular P(ABC)=1/216=(1/6)(1/6)(1/6)=P(A)P(B)P(C).
b) Independence of more events. A collection of events is independent if the
probability of every subset of them equals the product of the probabilities of the
events in the subsets. For example, events A, B, C, and D are independent if every
pair is independent, every triple is independent, and
P(ABCD)=P(A)P(B)P(C)P(D). (Consider rolling four dice and getting all 1’s,
flipping four coins and getting HTTH).
3) Bayes’ Rule
a) Bayes’ Rule is a simple formula relating the values of P(A|B) and P(B|A). It has
several forms and interesting consequences.
b) Theorem 2.16 (Bayes’ Rule)
i)
Given events H and E in a sample space S with probability measure P it
holds that P(H|E)=P(H)P(E|H)/P(E).
ii)
Proof: By definition of conditional probability, P(H|E)=P(HE)/P(E). By
the product rule P(HE)=P(H)P(E|H). Therefore P(H|E)=P(H)P(E|H)/P(E).
iii)
Here we use H and E to stand for Hypothesis and Evidence. We
sometimes conceive of Bayes’ Rule as telling us how to revise the probability
of a hypothesis based on the observation of some particular piece of evidence.
iv)
Example: You are living in a dorm. One night the fire alarm goes off. How
likely is it that there is a fire? Here H is the event “there is a fire” and E is the
event “the fire alarm goes off.” You want to know P(H|E). You estimate that
all things being equal a fire is unlikely on a given night, setting P(H)=0.001
(roughly one fire in three years). You know that in a typical semester of about
100 days there are about 3 fire alarms (typically false alarms), so you estimate
P(E)=0.03. Finally you guess that it is nearly certain someone would set off
the alarm if there really were a fire, so you estimate P(E|H)=0.98. By Bayes’
Rule, P(H|E)=P(H)P(E|H)/P(E)=(0.001)(0.98)/(0.03)=0.033.
v)
Notes to the example: From one point of view the alarm is almost
meaningless. There is only 3.3% chance of a fire. Why is it so low? Your
probabilities say that in 100 days you should expect 30 alarms but only one
fire. Thus your chance of having fire with the alarm is 1/30. From another
point of view the alarm carries a lot of weight: The alarm raises the likelihood
of a fire thirtyfold, from 0.1% to 3.3% (that is 1/1000 to 1/30). This is how the
evidence (alarm) causes you to revise your estimate of the hypothesis (fire). In
any case the difference between P(H|E) and P(E|H) is large: 0.033 to 0.98, a
clear example of how these quantities need not be equal.
c) Theorem 2.17 (Bayes’ Rule, extended form)
i)
Under the same circumstances as before
P( H ) P( E | H )
P( H | E ) 
.
P( H ) P( E | H )  P( H ) P( E | H )
ii)
Proof: This is the same equation as in the simpler statement except that in
that case the denominator was simply P(E). It is easy to see that the sets EH
and E H partition E. The Venn Diagram
S
makes it clear: The event E is
partitioned into the yellow section EH
H
E
and the orange section E H . Since the
sets are disjoint and have union E, we
have P( E )  P( EH )  P( E H ) . By the
product rule the righthand side becomes
P( H ) P( E | H )  P( H ) P( E | H ) . So we
have the same equation as before, but with a fancy expansion of P(E) in the
denominator.
iii)
Example (medical testing)
(1) A drug company has designed a test for a disease. Through extensive
testing, the company reports that the test produces only 1% false positive
results (i.e., a healthy person tests positive) and only 2% false negative
results (i.e., a person with the disease tests negative). Let P be the event
“someone tests positive,” N be the event “someone tests negative,” H be
the event “someone is healthy,” and D be the event “someone has the
disease.” Then the company is reporting P(P|H)=0.01 (or equivalently
P(N|H)=0.99) and P(N|D)=0.02 (or equivalently P(P|D)=0.98).
(2) Suppose you test positive for the disease. How likely is it that you in fact
have the disease? It is tempting but incorrect to say 98% since
P(P|D)=0.98. But you want to know P(D|P), which may be quite different.
It turns out you do not have enough information yet. Oddly enough you
must also know P(D), the prevalence of the disease in your population.
Why? As in the case of the fires and fire alarms, if the disease is rare, then
false positives will dominate true ones. If the disease is common, true
positives will dominate false ones.
(3) Suppose the disease is rare, occurring in only 0.05% of the population.
Then applying the second form of Bayes’ we get
P( D) P( P | D)
P( D | P) 
P( D) P( P | D)  P( H ) P( P | H )
.
0.0005 * 0.98

 0.047  4.7%
0.0005 * 0.98  0.9995 * 0.01
(4) Thus with a positive test your chance of having the disease is still just
below 5%. Why? Roughly speaking among 2000 randomly chosen people
you expect to have 20 positive tests but only 1 person with the disease.
Thus about 95% of your positives are false. Still this is a dramatic increase
in the probability of having the disease, from 0.05% to 4.7%, almost a
hundredfold increase.
(5) On the other hand, suppose the disease is common. Suppose 10% of
people in your “population” have the disease. Then
P( D) P( P | D)
P( D | P) 
P( D) P( P | D)  P( H ) P( P | H )
.
0.1 * 0.98

 0.916  91.6%
0.1 * 0.98  0.9 * 0.01
(6) Now the positive test gives you over a 90% chance of having the disease.
How can this be? Now among 1000 people, 100 will have the disease and
98 of them will test positive. Similarly 900 will not have the disease and 9
of them will test positive. Now fewer than 10% of your positives are false.
d) Revision of Probabilities by Bayes
i)
Is this all nonsense? Do not the test results speak for themselves? Do they
not give the same information regardless of who takes them and how many
people have the disease? No! The subtle but crucial point is that Bayes’ Rule
lets us revise a probability based on new evidence. Revision implies the
existence of a prior probability to be revised. The new probability depends not
only on the evidence but also on our prior estimate of the probability. If we
already know an event is likely, then evidence in its favor may make it nearly
certain. The same evidence will be less compelling, however, if we know the
event to be inherently unlikely. The less likely the event, the stronger the
evidence must be in order to make it probable.
ii)
For example, suppose your neighbor goes to play in the U.S. Open chess
tournament (a big tournament drawing hundreds or thousands of players).
Later you overhear on the radio that someone from your town won the
tournament. Do you excitedly call up your neighbor to find out if he won? It
depends on your original estimate of his likelihood of winning. If he is a weak
player with no realistic chance of winning, then you probably call him up to
find out who did win. If he is one of the strongest players in the county, one
with serious chances of winning the tournament, you call up excitedly to see if
he is in fact the winner.
iii)
More simply, a positive test for AIDS is more worrisome for a
promiscuous drug addict than a chaste person who avoid drugs. The same
evidence is more compelling when revising a probability you know to be
likely.
e) Theorem 2.19 (Bayes’ Rule for multiple hypotheses)
i)
Suppose you have n events (hypotheses) H1 ,, H n that partition the
sample space S and you have an event (evidence) E in S. Then for i between 1
P( H i ) P( E | H i )
and n, P( H i | E ) 
.
P( H 1 ) P( E | H 1 )    P( H n ) P( E | H n )
ii)
Proof: This is identical to the proof of the extended form of Bayes’ Rule
except that you partition E into n blocks by dividing it among the H’s. Then
you use this partition to expand the denominator P(E) in the simple form of
Bayes’ Rule.
iii)
Example: At a college, 40% of the students are freshmen, 25%
sophomores, 20% juniors, and 15% seniors (These are your H’s, partitioning
the whole population of students). Among students on the honor roll (your
evidence E), 5% are freshmen (the percentage of E among freshmen), 10% are
sophomores, 18% are juniors, and 22% are seniors. What percentage of honor
roll students are sophomores? By Bayes’ Rule the probability is
0.25 * 0.1
 0.219  21.9% .
0.4 * 0.05  0.25 * 0.1  0.2 * 0.18  0.15 * 0.22
f) Note: Theorem 2.18 and remark 2 on p. 50 are interesting but not essential to our
work. In particular you may want to look at remark 3 before Theorem 2.18 if you
have ever wondered what “odds” are and how they relate to probability.
4) Statistical Interlude
a) Recall that one of the crucial jobs of statistics is to help people summarize data to
look for the real information and patterns to be found in it. A couple lectures ago
we talked about using a single number to summarize a collection of numbers. In
particular we looked at various numbers that might do this job (mean, median,
mode), and we looked at data resistant to summary (e.g., phone books). The key is
to make sure the summary is clear and accurate, just as you would in writing a
summary of a book.
b) God, in His wisdom, has designed the human eye to see certain relationships
easily. Statisticians take advantage of this by drawing pictures of data that the eye
can easily scan.
c) One of the most powerful such tools is the histogram.
i)
To make a histogram one must have a collection of “interval” or “ratio”
data. Typically this means that the data represent counts or measurements.
ii)
One takes the range of the data and partitions it into a convenient number
of subranges, usually of equal width. Then one counts the number of data
values falling into each of these ranges (frequency classes or bins). Next one
draws a horizontal axis containing the range of data values and a vertical axis
labeled frequency with values going as high as that of the most frequent
(modal) class. Finally one draws over every subrange of values a bar whose
height (or area) represents the frequency with which data falls into that
subrange. Usually the bars touch without overlap to indicate that they partition
the data values (i.e., nothing falls in the cracks).
iii)
The histogram quickly yields much information about the data, indicating
what values are typical, what values are uncommon, and whether any values
are extremely different from the main body of values. This information is
obvious to the eye which easily compares heights and areas of rectangles.
iv)
To convert a frequency histogram into a relative frequency histogram,
simply divide every frequency by the total number of data values. Now the
heights of the bars represent the proportion of data falling into each subrange.
This is useful in making histograms from populations of different sizes
visually comparable.
v)
Example: The website
http://www.amstat.org/publications/jse/jse_data_archive.html has a file
poverty.dat (described in poverty.txt) with information from 1990 on birth,
death, and infant mortality rates, and life expectancies of men and women in
97 countries around the world. Let us begin by looking at the data on
women’s life expectancies.
(1) One challenge in constructing a histogram is to decide how many bars to
have or, equivalently, to decide how wide to make the subranges. Too few
bars makes the histogram coarse and blocky, obscuring patterns and
crucial details. Too many bars produce a snaggletoothed histogram with
lots of little gaps and bars of height one. The ideal normally lies
somewhere in between. Here are attempts that divide women’s life
expectancy into 1-year, 10-year, and 2.5 year subranges. The last seems to
work best.
(a) One year:
Life Expectancy of Women in 97 Countries
8
7
5
4
3
2
1
Age
Female Life Expectancy
80
82
76
78
72
74
68
70
64
66
60
62
56
58
52
54
48
50
44
46
0
40
42
Frequency
6
(b) Ten year:
Life Expectancy of Women in 97 Countries
45
40
35
Frequency
30
25
20
15
10
5
0
40
50
60
70
Age
Female Life Expectancy
80
90
(c) 2.5 year:
Life Expectancy of Women in 97 Countries
14
12
Frequency
10
8
6
4
2
80
82
.5
77
.5
75
70
72
.5
65
67
.5
60
62
.5
55
57
.5
50
52
.5
47
.5
45
40
42
.5
37
.5
0
Age
Female Life Expectancy
(2) All three histograms are valid and somewhat useful, but the last seems to
give us the clearest sense of the “shape” of the data. What do we see? All
the values fall between about 42 and 83. Values in the upper 70’s and low
80’s are most common. Higher values drop off very quickly. Perhaps this
suggests a large number of nations share about equally in the benefits of
modern medicine and public health policy. On the other hand the
frequencies decrease rather uniformly from 75 down to 42, with perhaps a
small gap in the 60’s. It would be interesting to look for anything the
nations with life expectancies below 60 have in common.
vi)
It is also interesting to compare the life expectancy data for men and
women. This involves trying to show two histograms at once in such a way as
to make their data visually comparable. One of the most straightforward
approaches is simply to make separate histograms for men and women using
the same horizontal and vertical scales and then display the two histograms,
one above the other. Here are some other approaches:
(1) Double bar graph:
Life Expectancy in 97 Countries by Sex
14
12
Frequency
10
8
6
4
2
0
37.5 40 42.5 45 47.5 50 52.5 55 57.5 60 62.5 65 67.5 70 72.5 75 77.5 80 82.5
Age
Male Life Expectancy
Female Life Expectancy
(2) Stacked histogram:
Life Expectancy in 97 Countries by Sex
25
20
Frequency
15
10
5
0
37.5 40 42.5 45 47.5 50 52.5 55 57.5 60 62.5 65 67.5 70 72.5 75 77.5 80 82.5
Age
Male Life Expectancy
Female Life Expectancy
(3) Three-dimensional bar graphs side-by-side:
Life Expectancy in 97 Countries by Sex
14
10
8
Female Life Expectancy
6
4
Male Life Expectancy
2
0
37.5 42.5 47.5 52.5 57.5 62.5 67.5 72.5 77.5 82.5
Age
Male Life Expectancy
Female Life Expectancy
Frequency
12
(4) A different approach using line graphs
Life Expectancy in 97 Countries by Sex
14
12
Frequency
10
8
6
4
2
0
37.5 40 42.5 45 47.5 50 52.5 55 57.5 60 62.5 65 67.5 70 72.5 75 77.5 80 82.5
Age
Male Life Expectancy
Female Life Expectancy
(5) The point of all these displays of the same data is to show that one can
take many creative approaches to studying the same data and
communicating what is there.
vii)
One interesting apparent pattern is a tendency of women to live longer
than men. How can we investigate this further?
(1) A natural approach is to construct a histogram of the difference between
women’s and men’s life expectancies in each country. That produces the
following histogram.
Female Minus Male Life Expectancy in 97 Countries
30
25
Frequency
20
15
10
5
0
-3
-2
-1
0
1
2
3
4
5
6
7
8
9
Years
(2) From this histogram we see one nation standing out in having men live
about two years longer than women. It would be interesting to find out
what the story is there. Otherwise women live longer than men in almost
every nation, with the bulk of the differences being between three and
seven years. This has some interesting implications in terms of how
societies will deal with large numbers of widows. It also suggests (not
proves) that death from childbirth is far less common worldwide than it
once was.
(3) A different approach to this matter calls for a scatterplot, a graph in which
the average life expectancies for men and women in each country are
treated as an ordered pair and graphed. This is a powerful tool for spotting
relationships between sets of data, as the following graph shows.
10
Male vs. Female Life Expectancy in 97 Countries, 1990
90
80
Female Life Expectancy
70
60
50
40
30
30
40
50
60
70
80
Male Life Expectancy
(4) Here we see a strong, positive "linear relationship”. The data roughly fit
along a line (not the line shown). Generally if men live longer in a
country, then so do women. The line is not a regression line. It is simply
the line y=x, showing those countries in which men and women have
equal life expectancies. Dots above the line indicate nations where women
live longer than men. Only a very few dots lie below the line, confirming
the pattern we saw in the previous graph.
90
Download