Chapter 4 Probability: Studying Randomness Randomness and Probability • Random: Process where the outcome in a particular trial is not known in advance, although a distribution of outcomes may be known for a long series of repetitions • Probability: The proportion of time a particular outcome will occur in a long series of repetitions of a random process • Independence: When the outcome of one trial does not effect probailities of outcomes of subsequent trials Probability Models • Probability Model: – Listing of possible outcomes – Probability corresponding to each outcome • Sample Space (S): Set of all possible outcomes of a random process • Event: Outcome or set of outcomes of a random process (subset of S) • Venn Diagram: Graphic description of a sample space and events Rules of Probability • The probability of an event A, denoted P(A) must lie between 0 and 1 (0 P(A) 1) • For the sample space S, P(S)=1 • Disjoint events have no common outcomes. For 2 disjoint events A and B, P(A or B) = P(A) + P(B) • The complement of an event A is the event that A does not occur, denoted Ac. P(A)+P(Ac) = 1 • The probability of any event A is the sum of the probabilities of the individual outcomes that make up the event when the sample space is finite Assigning Probabilities to Events • Assign probabilities to each individual outcome and add up probabilities of all outcomes comprising the event • When each outcome is equally likely, count the number of outcomes corresponding to the event and divide by the total number of outcomes • Multiplication Rule: A and B are independent events if knowledge that one occurred does not effect the probability the other has occurred. If A and B are independent, then P(A and B) = P(A)P(B) • Multiplication rule extends to any finite number of events Example - Casualties at Gettysburg • Results from Battle of Gettysburg Counts Killed Wounded Captured/Missing Safe Survival Total North 3155 14525 5365 72324 95369 South 2592 12709 12227 49972 77500 Proportions North 0.0331 0.1523 0.0563 0.7584 1.0000 South 0.0334 0.1640 0.1578 0.6448 1.0000 Killed, Wounded, Captured/Missing are considered casualties, what is the probability a randomly selected Northern soldier was a casualty? A Southern soldier? Obtain the distribution across armies Random Variables • Random Variable (RV): Variable that takes on the value of a numeric outcome of a random process • Discrete RV: Can take on a finite (or countably infinite) set of possible outcomes • Probability Distribution: List of values a random variable can take on and their corresponding probabilities – Individual probabilities must lie between 0 and 1 – Probabilities sum to 1 • Notation: – Random variable: X – Values X can take on: x1, x2, …, xk – Probabilities: P(X=x1) = p1 … P(X=xk) = pk Example: Wars Begun by Year (1482-1939) Distribution of Numbers of wars started by year X = # of wars stared in randomly selected year Levels: x1=0, x2=1, x3=2, x4=3, x5=4 Probability Distribution: Histogram #Wars 0 1 2 3 4 Probability 0.5284 0.3231 0.1070 0.0328 0.0087 Yearr • • • • 300 200 100 0 0 1 2 3 Wars 4 More Masters Golf Tournament 1st Round Scores Histogram Score 90 87 84 81 78 75 72 69 66 600 500 400 300 200 100 0 63 Frequency Score Frequency Probability 63 1 0.000288 64 2 0.000576 65 6 0.001728 66 16 0.004608 67 46 0.013249 68 67 0.019297 69 151 0.043491 70 238 0.068548 71 337 0.097062 72 428 0.123272 73 467 0.134505 74 498 0.143433 75 397 0.114343 76 293 0.084389 77 203 0.058468 78 125 0.036002 79 78 0.022465 80 50 0.014401 81 28 0.008065 82 17 0.004896 83 7 0.002016 84 7 0.002016 85 4 0.001152 86 3 0.000864 87 1 0.000288 88 2 0.000576 Continuous Random Variables • Variable can take on any value along a continuous range of numbers (interval) • Probability distribution is described by a smooth density curve • Probabilities of ranges of values for X correspond to areas under the density curve – Curve must lie on or above the horizontal axis – Total area under the curve is 1 • Special case: Normal distributions Means and Variances of Random Variables • Mean: Long-run average a random variable will take on (also the balance point of the probability distribution) • Expected Value is another term, however we really do not expect that a realization of X will necessarily be close to its mean. Notation: E(X) • Mean of a discrete random variable: E( X ) X x1 p1 x2 p2 xk pk xi pi Examples - Wars & Masters Golf #Wars 0 1 2 3 4 Sum Probability 0.5284 0.3231 0.1070 0.0328 0.0087 1.0000 x*p 0.0000 0.3231 0.2140 0.0983 0.0349 0.6703 =0.67 Score 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 Sum prob 0.000288 0.000576 0.001728 0.004608 0.013249 0.019297 0.043491 0.068548 0.097062 0.123272 0.134505 0.143433 0.114343 0.084389 0.058468 0.036002 0.022465 0.014401 0.008065 0.004896 0.002016 0.002016 0.001152 0.000864 0.000288 0.000576 1 x*p 0.0181 0.0369 0.1123 0.3041 0.8877 1.3122 3.0009 4.7984 6.8914 8.8756 9.8188 10.6141 8.5757 6.4136 4.5020 2.8082 1.7748 1.1521 0.6532 0.4015 0.1673 0.1694 0.0979 0.0743 0.0251 0.0507 73.54 =73.54 Statistical Estimation/Law of Large Numbers • In practice we won’t know but will want to estimate it • We can select a sample of individuals and observe the sample mean: x • By selecting a large enough sample size we can be very confident that our sample mean will be arbitrarily close to the true parameter value • Margin of error measures the upper bound (with a high level of confidence) in our sampling error. It decreases as the sample size increases Rules for Means • Linear Transformations: a + bX (where a and b are constants): E(a+bX) = a+bX = a + bX • Sums of random variables: X + Y (where X and Y are random variables): E(X+Y) = X+Y = X + Y • Linear Functions of Random Variables: E(a1X1++anXn) = a11+…+ann where E(Xi)=i Example: Masters Golf Tournament • Mean by Round (Note ordering): 1=73.54 2=73.07 3=73.76 4=73.91 Mean Score per hole (18) for round 1: E((1/18)X1) = (1/18)1 = (1/18)73.54 = 4.09 Mean Score versus par (72) for round 1: E(X1-72) = X1-72 = 73.54-72= +1.54 (1.54 over par) Mean Difference (Round 1 - Round 4): E(X1-X4) = 1 - 4 = 73.54 - 73.91 = -0.37 Mean Total Score: E(X1+X2+X3+X4) = 1+ 2+ 3+ 4 = = 73.54+73.07+73.76+73.91 = 294.28 (6.28 over par) Variance of a Random Variable • Variance: Measure of the spread of the probability distribution. Average squared deviation from the mean • Standard Deviation: (Positive) Square Root of Variance V ( X ) X2 ( x1 X ) 2 p1 ( xk X ) 2 pk ( xi X ) 2 pi xi2 pi X2 E ( X 2 )-μ X2 (useful when X takes on integer va lues) Rules for Variances (X, Y RVs a, b constants) V (a bX ) a2bX b 2 X2 V (aX bY ) 2 aX bY a b 2ab X Y 2 2 X 2 2 Y where is the correlatio n between X and Y Variance of a Random Variable V (a bX ) a2bX b 2 X2 2 2 2 2 2 V (aX bY ) aX a b Y 2ab X Y bY X where is the correlatio n between X and Y Special Cases: • X and Y are independent (outcome of one does not alter the distribution of the other): = 0, last term drops out • a=b=1 and = 0 V(X+Y) = X2 + Y2 • a=1 b= -1 and = 0 • a=b=1 and 0 V(X-Y) = X2 + Y2 V(X+Y) = X2 + Y2 + 2XY • a=1 b= -1 and 0 V(X-Y) = X2 + Y2 -2XY Wars & Masters (Round 1) Golf Scores Wars (x) 0 1 2 3 4 Sum Prob 0.5284 0.3231 0.1070 0.0328 0.0087 1.0000 (x- ) -0.6703 0.3297 1.3297 2.3297 3.3297 2=.7362 = .8580 (x- )^2 0.4493 0.1087 1.7681 5.4275 11.0869 ((x- )^2)*p 0.2374 0.0351 0.1892 0.1780 0.0965 0.7362 Score 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 Sum prob (x-)^2 0.000288 111.0916 0.000576 91.0116 0.001728 72.9316 0.004608 56.8516 0.013249 42.7716 0.019297 30.6916 0.043491 20.6116 0.068548 12.5316 0.097062 6.4516 0.123272 2.3716 0.134505 0.2916 0.143433 0.2116 0.114343 2.1316 0.084389 6.0516 0.058468 11.9716 0.036002 19.8916 0.022465 29.8116 0.014401 41.7316 0.008065 55.6516 0.004896 71.5716 0.002016 89.4916 0.002016 109.4116 0.001152 131.3316 0.000864 155.2516 0.000288 181.1716 0.000576 209.0916 1 ((x-)^2)p 0.031996 0.052426 0.126034 0.261989 0.566674 0.592263 0.896415 0.859021 0.626207 0.292352 0.039222 0.03035 0.243734 0.510691 0.699952 0.716143 0.669731 0.600974 0.448803 0.350437 0.180427 0.220588 0.151304 0.134146 0.052181 0.120444 9.474503 2 =9.47 3.08 Masters Scores (Rounds 1 & 4) 1 = 73.54 4 = 73.91 12=9.48 42=11.95 =0.24 • Variance of Round 1 scores vs Par: V(X1-72)=12=9.48 • Variance of Sum and Difference of Round 1 and Round 4 Scores: Sum ( X 1 X 4 ) : V ( X 1 X 4 ) 12 42 2 1 4 9.48 11.95 2(0.24) (9.48)(11.95) 9.48 11.95 5.11 26.54 Difference ( X 1 X 4 ) : V ( X 1 X 4 ) 12 42 2 1 4 9.48 11.95 2(0.24) (9.48)(11.95) 9.48 11.95 5.11 16.32 X X 26.54 5.15 1 4 X X 16.32 4.04 1 4 General Rules of Probability • Union of set of events: Event that any (at least one) of the events occur • Disjoint events: Events that share no common sample points. If A, B, and C are pairwise disjoint, the probability of their union is: P(A)+P(B)+P(C) • Intersection of two (or more) events: The event that both (all) events occur. • Addition Rule: P(A or B) = P(A)+P(B)-P(A and B) • Conditional Probability: The probability B occurs given A has occurred: P(B|A) • Multiplication Rule (generalized to conditional prob): P(A and B)=P(A)P(B|A)=P(B)P(A|B) Conditional Probability • Generally interested in case that one event precedes another temporally (but not necessary) • When P(A) > 0 (otherwise is trivial): P( A and B) P( B | A) P( A) P( A and B) P( A | B) P( B) • Contingency Table: Table that cross-classifies individuals or probabilities across 2 or more event classifications • Tree Diagram: Graphical description of cross-classification of 2 or more events John Snow London Cholera Death Study • 2 Water Companies (Let D be the event of death): – Southwark&Vauxhall (S): 264913 customers, 3702 deaths – Lambeth (L): 171363 customers, 407 deaths – Overall: 436276 customers, 4109 deaths 4109 .0094 (94 per 10000 people) 436276 3702 P( D | S ) .0140 (140 per 10000 people) 264913 407 P ( D | L) .0024 (24 per 10000 people) 171363 P( D) Note that probability of death is almost 6 times higher for S&V customers than Lambeth customers (was important in showing how cholera spread) John Snow London Cholera Death Study Water Company S&V Lambeth Total Cholera Death Yes No Total 3702 (.0085) 407 (.0009) 4109 (.0094) 261211 (.5987) 170956 (.3919) 432167 (.9906) 264913 (.6072) 171363 (.3928) 436276 (1.0000) ( Contingency Table with joint probabilities (in body of table) and marginal probabilities (on edge of table) John Snow London Cholera Death Study Company Death .0140 D (.0085) S&V .6072 .9860 DC (.5987) WaterUser .0024 .3928 L .9976 D (.0009) DC (.3919) Tree Diagram obtaining joint probabilities by multiplication rule Example: Florida lotto • You select 6 distinct digits from 1 to 53 (no replacement) • State randomly draws 6 digits from 1 to 53 • Probability you match all 6 digits: – First state draw: P(match 1st) = 6/53 – Given you match 1st, you have 5 left and state has 52 left: P(match 2nd given matched 1st) = 5/52 – Process continues: P(match 3rd given 1&2) = 4/51 – P(match 4th given 1&2&3) = 3/50 – P(match 5th given 1&2&3&4) = 2/49 – P(match 6th given 1&2&3&4) = 1/48 1 6 5 4 3 2 1 Multiplica tion rule : P(match all) 53 52 51 50 49 48 22,957,480 Bayes’s Rule - Updating Probabilities • Let A1,…,Ak be a set of events that partition a sample space such that (mutually exclusive and exhaustive): – each set has known P(Ai) > 0 (each event can occur) – for any 2 sets Ai and Aj, P(Ai and Aj) = 0 (events are disjoint) – P(A1) + … + P(Ak) = 1 (each outcome belongs to one of events) • If C is an event such that – 0 < P(C) < 1 (C can occur, but will not necessarily occur) – We know the probability will occur given each event Ai: P(C|Ai) • Then we can compute probability of Ai given C occurred: P(C | Ai ) P( Ai ) P( Ai and C ) P( Ai | C ) P(C | A1 ) P( A1 ) P(C | Ak ) P( Ak ) P(C ) Northern Army at Gettysburg Regiment I Corps II Corps III Corps V Corps VI Corps XI Corps XII Corps Cav Corps Arty Reserve Sum Label A1 A2 A3 A4 A5 A6 A7 A8 A9 Initial # 10022 12884 11924 12509 15555 9839 8589 11501 2546 95369 Casualties 6059 4369 4211 2187 242 3801 1082 852 242 23045 P(Ai) 0.1051 0.1351 0.1250 0.1312 0.1631 0.1032 0.0901 0.1206 0.0267 1 P(C|Ai) 0.6046 0.3391 0.3532 0.1748 0.0156 0.3863 0.1260 0.0741 0.0951 P(C|Ai)*P(Ai) 0.0635 0.0458 0.0442 0.0229 0.0025 0.0399 0.0113 0.0089 0.0025 0.2416 P(C) P(Ai|C) 0.2630 0.1896 0.1828 0.0949 0.0105 0.1650 0.0470 0.0370 0.0105 1.0002 • Regiments: partition of soldiers (A1,…,A9). Casualty: event C • P(Ai) = (size of regiment) / (total soldiers) = (Column 3)/95369 • P(C|Ai) = (# casualties) / (regiment size) = (Col 4)/(Col 3) • P(C|Ai) P(Ai) = P(Ai and C) = (Col 5)*(Col 6) •P(C)=sum(Col 7) • P(Ai|C) = P(Ai and C) / P(C) = (Col 7)/.2416 Independent Events • Two events A and B are independent if P(B|A)=P(B) and P(A|B)=P(A) , otherwise they are dependent or not independent. • Cholera Example: P(D) = .0094 P(D|S) = .0140 P(D|L) =.0024 Not independent (which firm would you prefer)? • Union Army Example: P(C) = .2416 P(C|A1)=.6046 P(C|A5)=.0156 Not independent: Almost 40 times higher risk for A1