Lecture05

Math 507, Lecture 5, Fall 2003, The Product Rule, Bayes’ Rule, and Extended Independence (Sec. 2.12–2.14) 1) The Product Rule a) Notation: In advanced probability courses it is common to denote the intersection of events by concatenating them rather than writing an intersection symbol between them. That is, if A and B are events we write AB instead of A  B . Our book does not use this notation, but I will use it in this lecture to simplify what I have to type (i.e., writing the letters adjacent does not make me use the equation editor!). b) Example: Suppose a bag contains four green beads and seven red ones. If I pull two out (without replacement), what is the probability that both are red? i) Intuitively my probability of getting red on the first draw is 7/11, my probability of then getting another red on the second draw is 6/10=3/5, and my total probability of getting both red is (7/11)*(3/5)=21/55, which is about 0.38. ii) Formally we can solve the problem by counting. There are C(11,2)=55 2subsets of the 11 beads, and C(7,2)=21 of them contain 2 red beads. Assuming all outcomes are equally likely, the probability of 2 reds is 21/55. We see that our intuitive procedure yields the correct answer, but can we justify it formally? c) Theorem (The Product Rule for probabilities): Suppose A and B are events in a sample space S with probability measure P. We already know that P(B|A)=P(AB)/P(A). Clearing denominators we see P(AB)=P(A)P(B|A). In other words, the probability that A and B both happen equals the probability that A happens times the probability that B happens if A does. d) Example Revisited: Again suppose we have four green and seven red beads in a bag and we choose two of them without replacement. Let F be the event that the first bead chosen is red and S be the event that the second bead chosen is red. Then the even that both beads are red is FS. We can find its probability from the product rule as follows: P(FS)=P(F)P(S|F)=(7/11)*(6/10)=21/55. Our intuition is now justified. e) The product rule generalizes to more events. For instance, suppose A, B, C, and D are events in a sample space S with probability measure P. Then P(ABCD)=P(A)P(B|A)P(C|AB)P(D|ABC). This same pattern works with any number of events. f) Example Re-revisited: Now suppose we have four green and seven red beads and we want the probability that four beads chosen (without replacement) from the bag are all red. Let A, B, C, and D, be the events that the first, second, third, and fourth beads are red, respectively. Then the probability that all four beads are red is P(ABCD)=(7/11)*(6/10)*(5/9)*(4/8)=(7/11)*(3/5)*(5/9)*(1/2)=7/66, which is about 0.11. Note that, for instance, the fourth factor 4/8=1/2 is the probability that the fourth bead is red after we have already removed three red beads. 2) Independence of Several Events a) Independence of three events i) If A, B, and C are events in some sample space S with a probability measure P, then we say the events are independent if each pair is independent (i.e., P(AB)=P(A)P(B), P(AC)=P(A)P(C), and P(BC)=P(B)P(C)) and in addition P(ABC)=P(A)P(B)P(C). ii) Intuitively this is what is needed to guarantee that all the events and their complements are proportionally represented in each other in all relevant ways. It is possible to satisfy some of these equations while failing to satisfy others. iii) Contrary Example: Roll a red die and a clear die. Let A be the event “the red die is 1,” B be the event “the clear die is 3,” and C be the event “both dice have the same number.” The P(A)=P(B)=P(C)=6/36=1/6. Clearly P(AB)=1/36=P(A)P(B), P(AC)=1/36=P(A)P(C), and P(BC)=1/36=P(B)P(C). The event ABC, however, is empty (since you cannot have the red die 1, the clear die 3, and both dice the same), so P(ABC)=0. But P(A)P(B)P(C)=(1/6)(1/6)(1/6)=1/216. Thus A, B, and C are not independent. iv) Example: Roll red, green, and clear dice. Let A be the event “the red dice is 1,” B be the events “the green die is 1,” and C be the event “the clear die is 1.” It is easy to test that events A, B, and C are independent under the uniform model. In particular P(ABC)=1/216=(1/6)(1/6)(1/6)=P(A)P(B)P(C). b) Independence of more events. A collection of events is independent if the probability of every subset of them equals the product of the probabilities of the events in the subsets. For example, events A, B, C, and D are independent if every pair is independent, every triple is independent, and P(ABCD)=P(A)P(B)P(C)P(D). (Consider rolling four dice and getting all 1’s, flipping four coins and getting HTTH). 3) Bayes’ Rule a) Bayes’ Rule is a simple formula relating the values of P(A|B) and P(B|A). It has several forms and interesting consequences. b) Theorem 2.16 (Bayes’ Rule) i) Given events H and E in a sample space S with probability measure P it holds that P(H|E)=P(H)P(E|H)/P(E). ii) Proof: By definition of conditional probability, P(H|E)=P(HE)/P(E). By the product rule P(HE)=P(H)P(E|H). Therefore P(H|E)=P(H)P(E|H)/P(E). iii) Here we use H and E to stand for Hypothesis and Evidence. We sometimes conceive of Bayes’ Rule as telling us how to revise the probability of a hypothesis based on the observation of some particular piece of evidence. iv) Example: You are living in a dorm. One night the fire alarm goes off. How likely is it that there is a fire? Here H is the event “there is a fire” and E is the event “the fire alarm goes off.” You want to know P(H|E). You estimate that all things being equal a fire is unlikely on a given night, setting P(H)=0.001 (roughly one fire in three years). You know that in a typical semester of about 100 days there are about 3 fire alarms (typically false alarms), so you estimate P(E)=0.03. Finally you guess that it is nearly certain someone would set off the alarm if there really were a fire, so you estimate P(E|H)=0.98. By Bayes’ Rule, P(H|E)=P(H)P(E|H)/P(E)=(0.001)(0.98)/(0.03)=0.033. v) Notes to the example: From one point of view the alarm is almost meaningless. There is only 3.3% chance of a fire. Why is it so low? Your probabilities say that in 100 days you should expect 30 alarms but only one fire. Thus your chance of having fire with the alarm is 1/30. From another point of view the alarm carries a lot of weight: The alarm raises the likelihood of a fire thirtyfold, from 0.1% to 3.3% (that is 1/1000 to 1/30). This is how the evidence (alarm) causes you to revise your estimate of the hypothesis (fire). In any case the difference between P(H|E) and P(E|H) is large: 0.033 to 0.98, a clear example of how these quantities need not be equal. c) Theorem 2.17 (Bayes’ Rule, extended form) i) Under the same circumstances as before P( H ) P( E | H ) P( H | E )  . P( H ) P( E | H )  P( H ) P( E | H ) ii) Proof: This is the same equation as in the simpler statement except that in that case the denominator was simply P(E). It is easy to see that the sets EH and E H partition E. The Venn Diagram S makes it clear: The event E is partitioned into the yellow section EH H E and the orange section E H . Since the sets are disjoint and have union E, we have P( E )  P( EH )  P( E H ) . By the product rule the righthand side becomes P( H ) P( E | H )  P( H ) P( E | H ) . So we have the same equation as before, but with a fancy expansion of P(E) in the denominator. iii) Example (medical testing) (1) A drug company has designed a test for a disease. Through extensive testing, the company reports that the test produces only 1% false positive results (i.e., a healthy person tests positive) and only 2% false negative results (i.e., a person with the disease tests negative). Let P be the event “someone tests positive,” N be the event “someone tests negative,” H be the event “someone is healthy,” and D be the event “someone has the disease.” Then the company is reporting P(P|H)=0.01 (or equivalently P(N|H)=0.99) and P(N|D)=0.02 (or equivalently P(P|D)=0.98). (2) Suppose you test positive for the disease. How likely is it that you in fact have the disease? It is tempting but incorrect to say 98% since P(P|D)=0.98. But you want to know P(D|P), which may be quite different. It turns out you do not have enough information yet. Oddly enough you must also know P(D), the prevalence of the disease in your population. Why? As in the case of the fires and fire alarms, if the disease is rare, then false positives will dominate true ones. If the disease is common, true positives will dominate false ones. (3) Suppose the disease is rare, occurring in only 0.05% of the population. Then applying the second form of Bayes’ we get P( D) P( P | D) P( D | P)  P( D) P( P | D)  P( H ) P( P | H ) . 0.0005 * 0.98   0.047  4.7% 0.0005 * 0.98  0.9995 * 0.01 (4) Thus with a positive test your chance of having the disease is still just below 5%. Why? Roughly speaking among 2000 randomly chosen people you expect to have 20 positive tests but only 1 person with the disease. Thus about 95% of your positives are false. Still this is a dramatic increase in the probability of having the disease, from 0.05% to 4.7%, almost a hundredfold increase. (5) On the other hand, suppose the disease is common. Suppose 10% of people in your “population” have the disease. Then P( D) P( P | D) P( D | P)  P( D) P( P | D)  P( H ) P( P | H ) . 0.1 * 0.98   0.916  91.6% 0.1 * 0.98  0.9 * 0.01 (6) Now the positive test gives you over a 90% chance of having the disease. How can this be? Now among 1000 people, 100 will have the disease and 98 of them will test positive. Similarly 900 will not have the disease and 9 of them will test positive. Now fewer than 10% of your positives are false. d) Revision of Probabilities by Bayes i) Is this all nonsense? Do not the test results speak for themselves? Do they not give the same information regardless of who takes them and how many people have the disease? No! The subtle but crucial point is that Bayes’ Rule lets us revise a probability based on new evidence. Revision implies the existence of a prior probability to be revised. The new probability depends not only on the evidence but also on our prior estimate of the probability. If we already know an event is likely, then evidence in its favor may make it nearly certain. The same evidence will be less compelling, however, if we know the event to be inherently unlikely. The less likely the event, the stronger the evidence must be in order to make it probable. ii) For example, suppose your neighbor goes to play in the U.S. Open chess tournament (a big tournament drawing hundreds or thousands of players). Later you overhear on the radio that someone from your town won the tournament. Do you excitedly call up your neighbor to find out if he won? It depends on your original estimate of his likelihood of winning. If he is a weak player with no realistic chance of winning, then you probably call him up to find out who did win. If he is one of the strongest players in the county, one with serious chances of winning the tournament, you call up excitedly to see if he is in fact the winner. iii) More simply, a positive test for AIDS is more worrisome for a promiscuous drug addict than a chaste person who avoid drugs. The same evidence is more compelling when revising a probability you know to be likely. e) Theorem 2.19 (Bayes’ Rule for multiple hypotheses) i) Suppose you have n events (hypotheses) H1 ,, H n that partition the sample space S and you have an event (evidence) E in S. Then for i between 1 P( H i ) P( E | H i ) and n, P( H i | E )  . P( H 1 ) P( E | H 1 )    P( H n ) P( E | H n ) ii) Proof: This is identical to the proof of the extended form of Bayes’ Rule except that you partition E into n blocks by dividing it among the H’s. Then you use this partition to expand the denominator P(E) in the simple form of Bayes’ Rule. iii) Example: At a college, 40% of the students are freshmen, 25% sophomores, 20% juniors, and 15% seniors (These are your H’s, partitioning the whole population of students). Among students on the honor roll (your evidence E), 5% are freshmen (the percentage of E among freshmen), 10% are sophomores, 18% are juniors, and 22% are seniors. What percentage of honor roll students are sophomores? By Bayes’ Rule the probability is 0.25 * 0.1  0.219  21.9% . 0.4 * 0.05  0.25 * 0.1  0.2 * 0.18  0.15 * 0.22 f) Note: Theorem 2.18 and remark 2 on p. 50 are interesting but not essential to our work. In particular you may want to look at remark 3 before Theorem 2.18 if you have ever wondered what “odds” are and how they relate to probability. 4) Statistical Interlude a) Recall that one of the crucial jobs of statistics is to help people summarize data to look for the real information and patterns to be found in it. A couple lectures ago we talked about using a single number to summarize a collection of numbers. In particular we looked at various numbers that might do this job (mean, median, mode), and we looked at data resistant to summary (e.g., phone books). The key is to make sure the summary is clear and accurate, just as you would in writing a summary of a book. b) God, in His wisdom, has designed the human eye to see certain relationships easily. Statisticians take advantage of this by drawing pictures of data that the eye can easily scan. c) One of the most powerful such tools is the histogram. i) To make a histogram one must have a collection of “interval” or “ratio” data. Typically this means that the data represent counts or measurements. ii) One takes the range of the data and partitions it into a convenient number of subranges, usually of equal width. Then one counts the number of data values falling into each of these ranges (frequency classes or bins). Next one draws a horizontal axis containing the range of data values and a vertical axis labeled frequency with values going as high as that of the most frequent (modal) class. Finally one draws over every subrange of values a bar whose height (or area) represents the frequency with which data falls into that subrange. Usually the bars touch without overlap to indicate that they partition the data values (i.e., nothing falls in the cracks). iii) The histogram quickly yields much information about the data, indicating what values are typical, what values are uncommon, and whether any values are extremely different from the main body of values. This information is obvious to the eye which easily compares heights and areas of rectangles. iv) To convert a frequency histogram into a relative frequency histogram, simply divide every frequency by the total number of data values. Now the heights of the bars represent the proportion of data falling into each subrange. This is useful in making histograms from populations of different sizes visually comparable. v) Example: The website http://www.amstat.org/publications/jse/jse_data_archive.html has a file poverty.dat (described in poverty.txt) with information from 1990 on birth, death, and infant mortality rates, and life expectancies of men and women in 97 countries around the world. Let us begin by looking at the data on women’s life expectancies. (1) One challenge in constructing a histogram is to decide how many bars to have or, equivalently, to decide how wide to make the subranges. Too few bars makes the histogram coarse and blocky, obscuring patterns and crucial details. Too many bars produce a snaggletoothed histogram with lots of little gaps and bars of height one. The ideal normally lies somewhere in between. Here are attempts that divide women’s life expectancy into 1-year, 10-year, and 2.5 year subranges. The last seems to work best. (a) One year: Life Expectancy of Women in 97 Countries 8 7 5 4 3 2 1 Age Female Life Expectancy 80 82 76 78 72 74 68 70 64 66 60 62 56 58 52 54 48 50 44 46 0 40 42 Frequency 6 (b) Ten year: Life Expectancy of Women in 97 Countries 45 40 35 Frequency 30 25 20 15 10 5 0 40 50 60 70 Age Female Life Expectancy 80 90 (c) 2.5 year: Life Expectancy of Women in 97 Countries 14 12 Frequency 10 8 6 4 2 80 82 .5 77 .5 75 70 72 .5 65 67 .5 60 62 .5 55 57 .5 50 52 .5 47 .5 45 40 42 .5 37 .5 0 Age Female Life Expectancy (2) All three histograms are valid and somewhat useful, but the last seems to give us the clearest sense of the “shape” of the data. What do we see? All the values fall between about 42 and 83. Values in the upper 70’s and low 80’s are most common. Higher values drop off very quickly. Perhaps this suggests a large number of nations share about equally in the benefits of modern medicine and public health policy. On the other hand the frequencies decrease rather uniformly from 75 down to 42, with perhaps a small gap in the 60’s. It would be interesting to look for anything the nations with life expectancies below 60 have in common. vi) It is also interesting to compare the life expectancy data for men and women. This involves trying to show two histograms at once in such a way as to make their data visually comparable. One of the most straightforward approaches is simply to make separate histograms for men and women using the same horizontal and vertical scales and then display the two histograms, one above the other. Here are some other approaches: (1) Double bar graph: Life Expectancy in 97 Countries by Sex 14 12 Frequency 10 8 6 4 2 0 37.5 40 42.5 45 47.5 50 52.5 55 57.5 60 62.5 65 67.5 70 72.5 75 77.5 80 82.5 Age Male Life Expectancy Female Life Expectancy (2) Stacked histogram: Life Expectancy in 97 Countries by Sex 25 20 Frequency 15 10 5 0 37.5 40 42.5 45 47.5 50 52.5 55 57.5 60 62.5 65 67.5 70 72.5 75 77.5 80 82.5 Age Male Life Expectancy Female Life Expectancy (3) Three-dimensional bar graphs side-by-side: Life Expectancy in 97 Countries by Sex 14 10 8 Female Life Expectancy 6 4 Male Life Expectancy 2 0 37.5 42.5 47.5 52.5 57.5 62.5 67.5 72.5 77.5 82.5 Age Male Life Expectancy Female Life Expectancy Frequency 12 (4) A different approach using line graphs Life Expectancy in 97 Countries by Sex 14 12 Frequency 10 8 6 4 2 0 37.5 40 42.5 45 47.5 50 52.5 55 57.5 60 62.5 65 67.5 70 72.5 75 77.5 80 82.5 Age Male Life Expectancy Female Life Expectancy (5) The point of all these displays of the same data is to show that one can take many creative approaches to studying the same data and communicating what is there. vii) One interesting apparent pattern is a tendency of women to live longer than men. How can we investigate this further? (1) A natural approach is to construct a histogram of the difference between women’s and men’s life expectancies in each country. That produces the following histogram. Female Minus Male Life Expectancy in 97 Countries 30 25 Frequency 20 15 10 5 0 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 Years (2) From this histogram we see one nation standing out in having men live about two years longer than women. It would be interesting to find out what the story is there. Otherwise women live longer than men in almost every nation, with the bulk of the differences being between three and seven years. This has some interesting implications in terms of how societies will deal with large numbers of widows. It also suggests (not proves) that death from childbirth is far less common worldwide than it once was. (3) A different approach to this matter calls for a scatterplot, a graph in which the average life expectancies for men and women in each country are treated as an ordered pair and graphed. This is a powerful tool for spotting relationships between sets of data, as the following graph shows. 10 Male vs. Female Life Expectancy in 97 Countries, 1990 90 80 Female Life Expectancy 70 60 50 40 30 30 40 50 60 70 80 Male Life Expectancy (4) Here we see a strong, positive "linear relationship”. The data roughly fit along a line (not the line shown). Generally if men live longer in a country, then so do women. The line is not a regression line. It is simply the line y=x, showing those countries in which men and women have equal life expectancies. Dots above the line indicate nations where women live longer than men. Only a very few dots lie below the line, confirming the pattern we saw in the previous graph. 90

Lecture05

Related documents

Products

Support

Lecture05

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib