Probabilistic Models Example #1 A production lot of 10,000 parts is tested for defects. It is expected that a defective part occurs once in every 1,000 parts. A sample of 500 is tested, with 2 defective parts found. Should you conclude that this is a ”bad” lot? Is the number of defective parts within the tolerance? To understand and analyze this test, we need to have the right model for the events. We need to identify an ”event” and its probability. Basic probability properties: P (event) = 0 event cannot occur 1 event must occur 0 < p(E) < 1 otherwise. If A is set of all possible events =⇒ P (A) = 1 If distinct events are disjoint =⇒ P (A1 or A2 or A3 ) = P (A1 ) + P (A2 ) + P (A3 ) j= X P (Aj ) = 1 all possible A1 A2 A3 A If all events are equally likely =⇒ P (A j ) = 1/n for n events. For example we can consider: A1 = the first part defective A2 = the second part defective 62 A3 = only the third part defective Then A1 and A3 are disjoint =⇒ P (A1 or A3 ) = P (A1 ) + P (A3 ) Note: If A1 and A2 not disjoint =⇒ P (A1 or A2 ) 6= P (A1 ) + P (A2 ) A1 A2 A3 A In this case we have P (A1 or A2 ) = P (A1 ) + P (A2 ) − P (A1 and A2 ) In our example: The event space is all possible outcomes of the test - large and complicated ! So we would like to break it down to smaller tests: testing each part individually. Defective/ Not Defective is similar to a coin flip: Fair Coin Biased Coin P (H) = 1/2 P (T ) = 1/2 P (T ) = p P (H) = 1 − p = q It is useful to designate H, T (Success or Failure) with 1 and 0. Each test is then a Bernoulli trial with parameter p: P (1)p. Now we can calculate the probability of events involving more than one test of a part, by noting that the tests of the parts are assumed to be independent. Conditional Probability P (A|B) = probability that event A occurs given B P (A, B) where P(A,B) = probability that both occur, and P(B) = probP (A|B) = P (B) ability only B occurs. 63 Simple example showing dependence: Flip a coin three times. Find the P(observe 2 heads | second is a head). If we view all possible outcome ( for a fair coin ) we have: HHH HHT THH THT HTH HTT TTH TTT. The first four events are those with H as second outcome. Then P(observe 2 heads | second is a head) = P(2 out of 3 | second is head) = 2/4 = 1/2. Note: P(2 out of 3) is unconditional probability , i.e the probability of 2 successes in 3 Bernoulli trials. ! 1 2 1 1 3 3 P(2 out of 3) = = 2 2 2 8 We can also calculate P(observe 2 heads | second is a head) using P (2 out of 3 and second is H) 2/8 1 = = P (second is H) 1/2 2 Simple example showing independence: Successive outcomes of the coin flip (Bernoulli trial) are independent. So P (X1 = H|X2 = T ) = P (X1 = H) Then, it must be that P (X1 = H|X2 = T ) = P (X1 = H, X2 = T ) P (X2 = T ) =⇒ P (X1 = H, X2 = T ) = P (X1 = H) · P (X2 = T ) In general the joint probability density for two independent random variables X 1 , X2 is written as P (X1 , X2 ) = f (x1 , x2 ) = f (x1 )f (x2 ) that is the density for the joint distribution is just the product of the individual densities. The statement of a problem can make a big difference: Example Given a room of n people, what is the probability that at least 2 have the same birthday? So we have to find P (X ≥ 2) where X is the number of people with same birthday. Stated in this way it is difficult because we have to consider all possible combinations. 64 But we can also state as 1 − P (X = 1). So P (X ≥ 2) = 1 − P (X = 1) where P (1) = P (the 2nd person is different from the 1st ) · P (the 3rd person is different from the 2nd and 1st ) · P (the nth person is different from all the rest) 1 2 n−1 = 1− 1− ··· 1 − 365 365 365 P (X ≥ 2) 0.315 0.5 0.75 0.98 n 17 23 32 56 Recall our problem. Define Xj = outcome of j th test. P (Xj = 1) = p where p is the probability of defective part. Here ”success” is the event of finding a defective part. Then if the individual tests are independent: P (X1 = 1 and X2 = 0) = P (X1 = 1) · P (X2 = 0) Again, the treatment of ”and” and ”or” in probability of events can be viewed from the set point of view. P (X1 = 1 or X2 = 0) = P (X1 = 1) + P (X2 = 0) − P (X1 = 1 and X2 = 0) Let’s write the event space for the first two tests: P (X2 = 0) −→ 00 10 01 11 |{z} P (X1 = 1) P (X1 = 1) = p P (X2 = 0) = 1 − p P (X1 = 1 and X2 = 0) = p(1 − p) using independence P (X1 = 1 or X2 = 0) = p + (1 − p) − p(1 − p) = 1 − p(1 − p) Note: we also have to be careful about specifying order: P (X1 = 1 and X2 = 0) 6= P(1 success and 1 failure) 65 P(1 success and 1 failure) = P (X1 = 1, X2 = 0) + P (X1 = 0, X2 = 1) = 2p(1 − p) So if we ask what is the probability of a defective part in first n trials, we need to take into account the number of ways that could occur: 1 0 0 0··· 0 0 1 0 0··· 0 0 0 1 0··· 0 and so on Similarly, if we ask 1 1 0 0··· 1 0 1 0··· 0 0 1 1··· for P(two defective parts in n trials): 0 0 0 and so on For any one of these sequences of n trials with j defective (successes) we have p j (1 − p)n−j for the probability for that particular sequence. Since the order does not affect this probability, and the trials are identical, we look at the number of combinations/permtations we can have in these sequences: ! n! n choose j from n = = j j!(n − j)! So the probability for j defective in n trials is: P (j out of n) = n j ! pj (1 − p)n−j This is called the binomial distribution B(n, p) with parameters n and p, where n is the number of Bernoulli trials and p is the success for any 1 trail. Note: to use the binomial distribution we must have a sequence of tests which are: 1. Identical Bernoulli trials (only two possible outcomes) 2. Independent So if Y is the number of defectives found in a batch of size n, ! P n P (Y = j) = P(all possible sequences with j successes) = pj (1 − p)n−j . j Back to our example: We have a test of 500 parts with 2 defectives, and we have to determine if that is enough to decide that we have a bad lot of parts. Then , not only do we have to calculate the probability of this event, but also give a tolerance for the number of defectives observed. Tolerance is topically defined in terms of 90%, or 95%, or 99% chance of observing some event, under the assumption 66 that the lot is good. First we calculate the probability of the obsserved event P(Y = 2) for Y ∼ B(500, 0.001). So P (Y = 2) = 500 2 ! (0.001)2 (0.999)498 ≈ 0.0758 How do we compare this to the tolerance? If the lot is good, then we would expect that the probability of observing k or more defectives is ! n X n P (Y ≥ k) = pj (1 − p)n−j j j=k Graphically, our probability mass function for the discrete binomial variable Y, P(Y = j) looks like: P(Y = j) 0.6 0.3 j So P (Y ≥ k) is adding up the mass for all values j ≥ k Here we see: P (Y ≥ 1) = 1 − P (Y = 0) = 1 − 0.6064 = 0.3936 P (Y ≥ 2) = 1 − [P (Y = 0) + P (Y = 1)] = 1 − (0.6064 + 0.3035) = 0.0901 P (Y ≥ 3) = 1−[P (Y = 0)+P (Y = 1)+P (Y = 2)] = 1−(0.6064+0.3035+0.0758) = 0.0143 So if we define a 90 % ”tolerance” - usually called a confidence interval - then this defines a ”rare” event as an event that occurs with 10% probability. Similarly an event which occurs with 5% probability, defined as ”rare”, corresponds to defining a 95% confidence interval. Thus for 90% confidence interval, we would conclude that our observation of 2 defectives is a ””rare” event for a good batch. 67 For a 95% confidence interval, we would conclude that observing 2 defectives is not ”rare”.. Thus the choice of the confidence interval influences how many false positives or false negatives we obtain in our testing. The choice depends on our tolerance of either one. So far our calculation of probability/confidence interval is based on discrete probability mass functions. We can generalize to other approximations by noting that we are calculating P(Y) where Y is a sum of identically distributed, independent random variables =⇒ Xj = 0 or 1 with probability p. P Y = Xj = the number of successes, where the outcome for each trial is independent. There are many results for sums of random variables, the most famous being the Central Limit Theorem. This let us approximate the distribution of the sum with a norma distribution: X Y ∼ N (µ, σ 2 ) where µ = E hX Y i and σ 2 = V ar hX Y i Let’s first review Expected Value. Expected value is by definition the weighted average, where the weight is given by the probability mass function: P (X = x i ) = f (xi ) for discrete values Xi of the random variables X. E[X] = X xi · f (xi ) all i X E[g(X)] = g(xi ) · f (xi ) for g a function all i For X a Bernoulli random variable, f(1) = p, f(0)= 1-p, so E[X] = 1·p+0·(1−p) = p For a binomial random variable P (Y = j) = E[Y ] = n X j=0 j n j ! pj (1 − p)n−j n j ! pj (1 − p)n−j This can be calculated by rearranging terms in the sum. We also note that since P Y = nk=1 Xk i.e. Y is the sum of the outcomes of n Bernoulli trials, then E[Y ] = Pn E [ k=1 Xk ] 68 Since Xk are identical, and we can commute the sums, E " n X # Xk = k=1 n X E[Xk ] = nE[Xk ] = np k=1 Similarly we can use this idea to calculate variance V ar[X] = E[(X − E(X))2 ] = E[X 2 − 2XE[X] + (E[X])2 ] = E[X 2 ] − (E[X])2 (since E[X] is just a number not a random variable). For the Bernoulli trials: E[X] = p V ar[X] = E[X 2 ] − p2 E[X 2 ] = 12 p + 02 (1 − p) = p V ar[X] = p − p2 = p(1 − p) P P So E[Y ] = E[Xk ] since Y = Xk Now we can calculate: V ar[Y ] = V ar " # n X k=1 Xk = E " # Recall that E n X n X k=1 Xk − E " n X Xk k=1 #!2 Xk = nE[Xk ] = np k=1 So we can write the variation as: V ar[Y ] = E =E = " n X k=1 n X k=1 n X k=1 !2 n XX X 2 (Xk − p)(Xj − p) (Xk − p) = E (Xk − p) + 2 j k<j k=1 # (Xk − p)2 + 2E E[(Xk − p)2 ] + 2 XX j k<j XX j k<j (Xk − p)(Xj − p) E[(Xk − p)(Xj − p)] Note that E[(Xk − p)(Xj − p)] = = X (xk − p)(xj − p)f (xj , xk ) X (xk − p)(xj − p)f (xj )f (xk ) all xk , xj all xk , xj 69 since xj , xk are independent (Bernoulli trials). So, E[(Xk − p)(Xj − p)] = E[Xk − p]E[Xj − p] = 0 since E[Xk ] = p So, V ar[Y ] = Pn k=1 E[(Xk − p)2 ] = Pn k=1 V ar[Xk ] = np(1 − p) Thus the sum of independent random variables X i has expected value nE[Xi ] and variance nV ar[Xi ]. In addition the Central Limit Theorem gives an approximation of the density of sum of i.i.d. random variables. lim = n→∞ Pn Xi − nE[Xi ] ∼ N (0, 1) √ p n V ar[Xi ] i=1 This says that the sum of n i.i.d. random variables X i approaches a normal distribution. A normal distribution with mean µ and variance σ 2 has probability density function of the form (y − µ)2 2σ 2 e p(y) ≈ √ 2πσ 2 − -∞ < y < +∞ Note: this is a continuous random variable. You can verify that : Z +∞ p(y)dy = 1, −∞ E[Y ] = Z +∞ yp(y)dy = µ, −∞ V ar[Y ] = Z +∞ (y−µ)2 p(y)dy = σ 2 −∞ By the definition of E[Y], E[Y − µ] = 0, E[cY ] = cE[Y ] = cµ V ar[cY ] = E[(cY − cµ)2 ] = c2 V ar[Y ] So in the limit above, we subtract off the mean and divide by the standard deviation, leaving us with a random variable with mean 0 , and variance 1. The proof that the density tends towards a normal distribution is not covered here, but the implications are significant: The Central Limit Theorem (CLT) says that we can take any random variable which has bounded mean and variance (the random variables can be discrete or continuous) and if we take a sum of these random variables, as n → ∞ the density of the sum will be normal. 70 Then we can approximate the probability of observations for the sums by using the normal distribution. For example, in our previous example, we considered the probability that P ( k) for some k. The CLT says we can consider instead: P (Z) = Pn i=1 Xi > P Xi − nE[Xi ] √ √ n V arXi P(z) 0 Then P P Xi − nE[Xi ] k − nE[Xi ] >√ √ √ √ n V arXi n V arXi =⇒ P z k − nE[Xi ] Z>√ √ n V arXi ! ! = P (Z > z) P Notice that in this case the comparison is between ni=1 Xi a sum of discrete random variables which take only positive values, and Z, a continuous random variable with range over all reals. So we would expect that this approximation may not be valid for all values of P Y = ni=1 Xi for finite n. Then P (Z > z) we can identify using p(z) Typically we call the range of likely variation a confidence interval, which is then defined in terms of values of Z. The confidence interval could be one-sided or two-sided, depending on the application. We can compare our previous results to approximation with the normal distribution: P (Y ≤ 1) = (binomial) P (Y ≤ 1) = (normal) FIGURE 71 P(z) P(Z > z*) z z* 0 Example #2 Suppose a bus can arrive at any time between 11:00 am and 11:15 am, with equal probability. If you arrive at the bus stop at 11:00 am, with what probability will you wait for 10 minutes or more for the bus? With what probability will you wait a total of 300 minutes over the whole month (assuming you arrive at 11:00 am every day)? For which value of total minutes would you question the validity of the bus schedule? Let X = the waiting time of the bus in one day Then P (X > 10) = 1 − P (X ≤ 10) = 1 − Z 10 f (x)dx 0 Where f(x) is a uniform probability density given by f (x) = 1 15 0 Let Y = total amount of time waiting = P (Y > 300) = P ( 30 X 0 ≤ x ≤ 15 otherwise. P30 Xi > 300) = P i=1 i=1 Xi . Then P Xi − nE[Xi ] 300 − nE[Xi ] > √ √ √ √ n V arXi n V arXi Where Z 15 15 x dx = and 15 2 0 Z 15 1 x3 15x2 152 x (x − 15/2)2 dx = [ − + ] = 18.75 V ar[Xi ] = 15 15 3 2 4 0 E[Xi ] = 72 ! Then 15 30 300 − 30 · X 2 P (Y > 300) = P ( Xi > 300) = P Z > √30 · 18.75 = P (Z > 3.1623) < 0.01 i=1 So if we could choose a 90% or 95% confidence level, that would be P (Z > 1.28) ≈ 0.90 P (Z > 1.6) ≈ 0.95 Note that 3.1623 is much larger that these levels. Then we could conclude that we have observed rare event, given the assumption about the bus arrivals, so we could question this assumption. Example #3 Test a drug on 100 patients, with probability p of benefit from the drug. How many patients would we expect to observe receiving benefits from the drug in order to accept the assumed effectiveness of it? In this case we have to choose the confidence level and determine the number of patients that satisfy that confidence level. If Y is the number of patients receiving a benefit, this says we want to find y such that P (Y ≥ y) = 0.95 Then we need to identify E[Y] and Var[Y], and in particular we would like to identify P Y as a sum of random variables. Here Y = Xi whereXi is the Bernoulli trial for each individual patient. So E[Xi ] = p =⇒ E[Y ] = 100p and V ar[Y ] = 100p(1 − p). =⇒ Y − 90 ≥ −1.65 = 0.95 P √ 100 · 0.09 =⇒ Y ≥ −1.65 · 3 + 90 ≈ 85.05 So, in order to accept the drugs effectiveness of .9, we would want to see more than 85 patients receiving benefits from the drug. ”Least Squares” - Linear Regression Data:- view as a random variable or as a function + random variable at each data point. 73 Let Yi = the data points Yi = f (t) + εi where εi = error (usual assumption εi ∼ N(0,σ)) What is σ. It depends on f(t). The data points fit a linear function f (t) = a + bt Y i t In general, we would like to minimize ”errors” ε i . In fact, we will minimize variance about a + bt (”like the mean”). F (a, b) = n X i=1 (yi − (a + bti )2 ) n X ∂F = 2(yi − (a + bti )(−1)) = 0 ∂a i=1 n X ∂F = 2(yi − (a + bti )(−ti )) = 0 ∂b i=1 =⇒ n X i=1 n X yi = na + b n X ti i=1 ti yi = a i=1 n X i=1 ti − b n X t2i i=1 Solve for a and b: Pn P − b ni=1 ti a= n P P P n ni=1 yi ti − ni=1 yi ni=1 ti P P b= n( ni=1 t2i ) − ( ni=1 ti )2 i=1 yi 74 Note ŷi = a + bti is the estimate of yi Then Pn variation of estimate (ŷi − ȳ)2 R2 = = Pi=1 n 2 variation of actual data i=1 (yi − ȳ) We can apply the Least square fit to the following data: n X n X xi = 217.3 i=1 n X x2i = 3671 i=1 n X yi = 1049.2 i=1 xi yi = 17657 i=1 Then a = 29.4751, b = 3.065, and R 2 = 0.7178 Example(From Larsen&Marx) Crickets make their chirping sound by sliding one wing cover very rapidly back and forth over the other. Biologists have long been aware that there is a linear relationship between temperature and frequency with which a cricket chirps, although the slope and y-intercept of the relationship varies from species to species. Listed below are 15 frequency-temperature observations recorded for the striped ground cricket, Nemobius fasciatus fasciatus. Plot these data and find the equation of the least-square line, y= a+bx. Suppose a cricket of this species is observed to chirp 18 times per second. What would be the estimated temperature? Observation number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Chirps per second (x i ) 20.0 16 19.8 18.4 17.1 15.5 14.7 17.1 15.4 16.2 15.0 17.2 16.0 17.0 14.4 75 Temperature,yi (0 F ) 88.6 71.6 93.3 84.3 80.6 75.2 69.7 82.0 69.4 83.3 79.6 82.6 80.6 83.5 76.3 Note that X and Y do not have to be related linearly in order to use linear regression. Example If y ≈ Aebx , then we can take logarithms of both sides to obtain linear problem. =⇒ ln y = ln(Aebx ) = ln A + Bx. So, W = ln A + Bx Then we can apply the linear regression formulae to W, and x ln a = P Wi − b n P xi b= P P P Wi xi − yi xi P P n( x2i ) − ( xi )2 Example(From Larsen&Marx) Mistletoe is a plant that grows parasitically in the upper branches of large trees. Left unchecked, it can seriously stunt a tree’s growth. Recently an experiment was done to test a theory that older trees are less susceptible to mistletoe growth than younger trees. A number of shoots were cut from 3-,4-,9-,15-, and 40-yearold Ponderosa pines. These were then side-grafted to 3-year-old nursery stock and planted in a preserve. Each tree was ”inoculated” with mistletoe seeds. Five years later, a count was made of the number of mistletoe plants in each stand of trees. (A stand consisted of approximately ten trees; there were three stands of each of the four youngest age groups and two stands of the oldest.)The results are shown below: Number of Mistletoe Plants,y 3 28 33 22 Age of Trees x (years) 9 15 22 10 4 10 36 24 15 6 14 9 40 1 1 So if we try to approximate mistletoe data using y = a + bx we get a = 25.9375, b = -0.7117, and R2 = 0.6508. For y = axb , we use z = ln y , v = ln x, α = ln a =⇒ z = ln a + b ln x = ln a + bv = α + bv. Then we get α = 4.9, b=-1.1572, and r 2 =0.826. Note: Here r 2 was calculated using the linear formulation. Pn (ẑi − z̄)2 r = Pni=1 2 2 i=1 (zi where ẑi = α + bvi and ŷi = eα xbi For y = aebx , w = lny =⇒ w = ln a + bx 76 − z̄) Then we get ln a = 3.56, b=-0.0893 Pn (ŵi − w̄)2 r = Pi=1 n 2 2 i=1 (wi − w̄) Again r 2 was calculated using the linear formulation where w i = ln a + bxi , and yi = e(ln a+bxi ) 77