PROBABILITY AS FREQUENCY We have now taken a detailed look at both belief-type theories of probability. Belief-type Frequency-type Logical Personal Limiting probability probability frequency Propensity theory It is time to consider the frequency-type approaches. This week: limiting frequency. Next week: limiting frequency (cont’d); propensity theory. 1 Relative frequency – Von Mises Von Mises: “just as the subject matter of geometry is the study of space phenomena, so probability theory deals with mass phenomena and repetitive events”. Probability deals with “problems in which either the same event repeats itself again and again, or a great number of uniform elements are involved at the same time”. Possible topics: Repeated tosses of a coin Canadians born in 1980 who are male The collection of molecules in a gas Ruled out: Probability that S dies this year Chances of Leafs winning the 2008 Stanley Cup 2 Relative frequency – the basic idea Let represent a collective of some repetitive events or mass phenomena, e.g. tosses of a coin. Let k represent some attribute of that collective, e.g. ‘heads’. If in the first n members of , k occurs m(k) times, then the relative frequency of k is m(k)/n. 3 Limiting frequency defined Axiom of convergence: If k is an arbitrary attribute of a collective, C, then limn m(k)/n exists. From this, we define the probability of k in C as follows: Pr(k/C) = limn m(k)/n This is the Limiting Frequency theory of probability. In other words, we assume that as n increases, the ratio, m(k)/n, approaches a fixed value. 4 The “long run” If you toss a coin once, the relative frequency of heads will be either 1 or 0. Suppose the first toss is heads, the second tails, the third tails again, the fourth heads. The relative frequency of heads therefore goes from 1 to 0.5 to 0.3 to 0.5. We can see that the relative frequency changes quite drastically. The idea behind the axiom of convergence (and, therefore, the limiting frequency notion of probability) is that the relative frequency of heads will stabilize over the long run. The more we toss the coin, the less the ratio will fluctuate. Graphically: 5 Picturing the situation 1.0 Ratio of heads 0.5 0 Number of tosses Stability is a core idea of the frequency theory—we shall come back to it in a moment. First, notice that probability theory is taken to apply to collectives where the outcomes are random. Let’s see why. 6 Randomness Suppose a coin tossing machine is set up so that the results are: 1H-2T, 1H-2T, 1H2T, 1H-2T, etc. Then, limn m(k)/n = 1/3 But this does not define a probability—it is, after all, mechanically determined. So von Mises specifies: Admissible collectives cannot be subject to a gambling system, i.e. a rule that would allow you to win money in the long run. In the case of coin tossing, you can’t win at better than 50%. Okay, back to stability… 7 Stability What is the relationship between stability, relative frequency and probability? Suppose you and five friends each toss a fair coin 10 times and record the number of heads. The results are as follows: Person You Friend 1 Friend 2 Friend 3 Friend 4 Friend 5 # of Heads (k) 7 5 3 8 4 2 k/n (rel. freq.) 0.7 0.5 0.3 0.8 0.4 0.2 The (sample) average or mean number of heads is: (7 + 5 + 3 + 8 + 4 + 2)/6 = 4.83. We expect the average number of heads to be near 5 even though only one person got exactly 5 heads. 8 Next consider the experiment discussed in the book in which 250 students toss a fair coin 20 times each, and then record the number of heads they observe. There is a wide variety of results but as we would expect, they mostly cluster around 10 heads in 20 tosses: 35 30 25 20 # heads 15 10 5 0 0 2 4 6 8 10 12 14 16 18 20 Is there a way to measure how close the results are to what we expect? Yes! 9 Sample Standard Deviation Consider N trials, each observed result, Xi and the mean, X. The sample standard deviation is: SD = [1/N x (Xi – X)2 ] So, for example, the standard deviation in the first experiment is: SD = {1/6 x [(7-4.83)2 + (5-4.83)2 + (3-4.83)2 + (84.83)2 + (4-4.83)2 + (2-4.83)2]} = {1/6 x [2.172 + 0.172 + -1.832 + 3.172 + -.832 + 2.832]} = {1/6 x [4.7089 + 0.0289 + 3.3489 + 10.0489 + 0.6889 + 8.0089]} = {1/6 x 39.8334} = {6.6389} 2.58 Most of the results lie between 4.83 2.58 The SD for the 2nd experiment is 2.9: most of the results are within 9.79 2.9. 10 The smaller the standard deviation, the more closely clustered are the results around the mean. The relative frequency of heads in the first experiment is kmean/n = 4.83/10 = 0.483 (close to ½). In the second: kmean/n = 9.76/20 = 0.49 (closer to ½). In the long run, we expect the relative frequency with which an event, E, occurs to approach a value, p, which we equate with the probability of E (on the limiting frequency view of probability). We also expect the SD to shrink. 11 Repeated, independent trials of some event, E, with constant probability, p, are called Bernoulli trials. Imagine you have an ideal urn with 100 balls: 30 are green, 70 red. You draw with replacement 10 times. Since the probability of drawing green = 0.3, you would expect to get p x n green balls = 0.3 x 10 = 3. The most probable number of green balls, k0, is roughly pn. 12 Theorem: for a large number of trials, the most probable relative frequency, k0/n, is essentially just p. More accurately: [p – (1-p)/n] k0/n (p + p/n) (You can see that as n approaches infinity, the relative frequency approaches p). We can use the above formula to calculate the most probable number of greens: [pn – (1-p)] k0 (pn + p) Note: there can be more than one most probable number! 13 What if instead of n draws, we repeatedly draw n balls? If we average out the number of green balls after many trials of n draws, we would expect to get pn green balls on average: Theorem: the expected relative frequency is p. This is similar to the frequency principle. 14 Of course, the larger the sample, the more likely the relative frequency will be close to p. How large? The larger the better. We ask: For some small margin of error, , what is the probability that the relative frequency of greens in n trials will be within of p? This is called the accuracy probability: Theorem: as the number of trials increases the accuracy probability approaches 1. Relative frequencies tend to converge on probabilities. This theorem tells us that: Pr([p – ] k/n [p + ]) 1 (as n ) Or: Pr([pn – n] k [pn + n]) 1 (as n ) 15 Bernoulli’s Theorem 1. The difference between p and the relative frequency can be made as small as you want if you increase the number of trials sufficiently 2. The accuracy probability can be made as close as you want to 1, provided you perform enough trials. This is the idea behind Bernoulli’s Theorem: For any arbitrarily small error, , and any arbitrarily small difference, x, there is a number of trials, N, such that for any n > N: Pr[(p-) k/n (p+)] > (1-x) 16 Example: You flip a coin seven times in a row. What is the expected number of heads? But we know that you cannot get 3.5 heads. So, we think that the most probable number will be integer or integers closest to pn: This is confirmed by calculating the most probable number of heads: Note: there is no guarantee that the expected number will equal the most probable number. 17 Example: The probability that a given car ride leads to an accident in 1/5000. In Toronto, there are 250,000 car rides per day. What is the most probable number of accidents today? 18 Normal Approximations If you drew a line joining the tops of each bar in the graph on page two, it would look something like a “bell”: 35 30 25 20 # heads 15 10 5 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 19 Bell curves are often called Normal curves because they occur very often. Two key properties of normal curves: 1. The theoretical mean. This is the value of the curve at its peak (). 2. The theoretical standard deviation. This is the width of the curve (). 70 60 50 40 human ht 30 20 10 0 0 8 20 Example: Human height (for a given gender) comes in indefinitely many values, but they tend to cluster in a bell shape around a mean. Recall that the theoretical mean is the most numerous outcome, hence: = pn What is the probability that a given height measurement is close to that mean? Normal fact I: The probability that E is within of pn is about 0.68. The probability that E is within 2 of pn is about 0.95. The probability that E is within 3 of pn is about 0.99. We won’t prove these. We will simply use these results. 21 b(k,n,p) = the probability of getting k occurrences of event E in n trials where Pr(E) is p. Abraham De Moivre showed that Normal distributions approximate Bernoulli trials: Normal fact II: A binomial distribution b(k,n,p) is approximated by a Normal distribution in which =pn and = [(1-p)pn] This is a binomial distribution: i.e. it concerns Bernoulli trials in which there are two outcomes (heads/tails, green/red, etc.). 22 How long will it take the relative frequency of some event to converge on the probability of that event? Consider a Bernoulli trial in which the events have probability p. You perform n trials and want to know how likely it is that your result, k, will be close to pn: Normal Fact III: The probability that k is within of pn is about 0.68. The probability that k is within 2 of pn is about 0.95. The probability that k is within 3 of pn is about 0.99. 23 Example: Amazing Cars Inc. claims that 99% of its cars can drive 100,000 km before requiring a tune up. The government tests 1100 cars by driving them for 100,000 km. If Amazing is telling the truth, what is the probability that the government will find that between 1079 and 1099 of Amazing’s cars don’t require a tune up? pn = the mean = SD = So, 3SD = 24 Example: Let’s do question #6 from chapter 17: 25 Homework Questions from chapters 16 & 17. 26