Bio/statistics Handout 9: Common probability functions There are various standard probability functions that serve as standard models to compare with when trying to decide when a given phenomena is ‘unexpected’ or not. a) What does ‘random’ really mean? Consider a bacterial cell that moves one cell length, either to the right or left, in each unit of time. If we start some large number of these cells at x = 0 and wait t units of time, we can determine a function, pt(x), which is the fraction of bacteria that end up at position x {0, ±1, ±2, …} at time t. We can now ask: What do we expect the function x pt(x) to look like? Suppose that we think that the bacteria is moving ‘randomly’. Two questions then arise: How do we translate our intuitive notion of the English term ‘random’ into a prediction for pt(x)? Granted we have a prediction, for each t and x, then how far must pt(x) be from its predicted value before we must accept the fact that the bacteria is not moving randomly? (9.1) These questions go straight to the heart of what is called the ‘scientific method’. We made a hypothesis: The bacteria moves left or right at random’. We want to first generate some testable predictions of the hypothesis (the first point in (9.1)), and then compare these predictions with experiment. The second point in (9.1) asks for criteria to use to evaluate whether the experiment confirms or rejects our hypothesis. The first question in (9.1) is the provenance of ‘probability theory’ and the second the provenance of ‘statistics’. This handout addresses the first question in (9.1), while aspects of the second are addressed in subsequent handouts. b) A mathematical translation of the term ‘random’ To say that an element is chosen from a given set at ‘random’ is traditionally given the following mathematical definition: Probabilities are defined using the probability function that assigns all elements the same probability. Thus, if the set has some N elements, then the probability of any given element appearing is N1 . (9.2) The probability function that assigns this constant value to all elements is called the uniform probability function. Here is an archetypal example: A coin has probability 12 for landing heads up and so probability 12 for landing tails up. The coin is flipped T times. Our sample space is the set S that consists of the N = 2T possible sequences (±1, ±1, …, ±1), where +1 is in the k’th spot when the k’th flip landed heads up, while -1 sits in this slot when the k’th flip landed tails up. Note that after setting T = t, this same sample space describes all of the possiblities for the moves of our bacteria from Section a), above. You might think that the uniform probability distribution is frightfully dull—after all, how much can you say about a constant? c) Some standard counting solutions The uniform probability distribution becomes interesting when you consider the probabilities for certain subsets, or the probabilities for the values of certain random variables. To motivate our interest in the uniform probability distribution, consider first its appearance in the case of bacteria. Here, our model of random behavior takes S as just described. Now define f so that f(1,···,N) = 1+2+···+N and ask for the probabilities of the possible values of f. With regards to bacteria, f tells us the position of the bacteria after N steps in the model where the bacteria moves right or left with equal probability at each step. The probabilities for the possible values of f provide the theoretical predictions for the measured function pt(x). In general, if S has N elements and a subset K S has k elements, then the uniform probability distribution gives probability Nk to the set K. Even so, it may be some task to count the elements in any given set. Moreover, the degree of difficulty may depend on the manner in which the set is described. There are, however, some standard counting formula available to facilitate things. For example, let S denote the sample space with 2N elements as described in Section a) above. For n {0, 1, . . . , N}, let Kn denote the subset of elements in S with n occurrences of +1, thus with N-n occurrences of –1. Note that Kn has an alternate description, this as the set of elements on which f(1, . . ., N) = 1+2+···+N has value 2n-N. In any event, here is a basic fact: The set Kn has N! n!(N n)! members. (9.3) In this regard, remember that k! is defined for any positive integer k as k(k-1)(k-2)···1. Also, 0! is defined to by equal to 1. For those who don’t like to take facts without proof, I explain in the last section below how to derive (9.3) and also the formulae that follow N! By the way, n!(N n)! arises often enough in counting problems to warrant its own symbol, this N . n (9.4) Here is another standard counting formula: Let b ≥ 1 be given, and let S denote the set of bN elements of the form (1, . . . , N) where each k is now {1, . . ., b}. If N > b, then there are no elements in S with no two k and k´ are alike. If N ≤ b, then such elements can be present. Fix b and let Eb denote the subset of those N-tuples (1, …, N) where k ≠ k´ when k ≠ k´. The set Eb has b! (b N)! members (9.5) The case n = N in (9.5) provides the following: There are N! ways to order a set of N distinct elements. (9.6) Here, a set of elements is ‘ordered’ simply by listing them one after the other. For example, the set that consists of 1 apple and 1 orange has two orderings, (apple, orange) and (orange, apple). d) Some standard probability functions The counting formulae just presented can be used to derive some probability functions that you will almost surely see again and again in your scientific career. The Equal Probability Binomial: Let S denote the sample space with the 2N elements of the form (1, . . ., N) where each k can be 1 or –1. For any given integer n {0, 1, . . . , n}, let Kn denote the event that there are precisely n occurrences of +1 in the N-tuple (1, . . . , N). Then the uniform probability function assigns Kn the probability P(n) = 2 N! n!(N n)! . (9.7) The fact that Kn has probability P(n) follows from (9.3). The assignment of P(n) to an integer n defines a probability function on the N+1 element set {0, 1, …, N}, this a probability function that is decidedly not uniform. We will investigate some of its properties below. Equation (9.7) can be used to answer the first question in (9.1) with regards to our random bacteria in Section a). To see how this comes about, let S denotes the sample space with the 2N elements of the form (±1,±1, …, ±1). Let f again denote the random variable that assigns 1+···+N to any given (1, . . . , N). Then, the event f = 2n-N is exactly our set Kn. This understood, then P(f = x) = 0 unless both N and x are even, or both are odd, in which case P(f = x) = 2N N! Nx ( Nx 2 )!( 2 )! . (9.8) If we believe that the bacteria chooses left and right with equal probability, then we should be comparing our experimentally deterimined pt(x) with the N = t version of (9.8). The Binomial Probability Function: The probability function P in (9.7) is an example of what is called the binomial probability function on the set {0, . . . , N}. The ‘generic’ version of the binomial probability distribution requires the choice of a number, q [0, 1]. With q chosen, the probability q-version assigns the following probability to an integer n: Pq(n) = N! n!(N n)! qn (1-q)N-n . (9.9) The latter can also be seen as stemming ultimately from a uniform distribution. In particular, when q is a rational number, say ab where a {0, … , b}, this comes about in the following manner: Take the bN element sample space of N-tuples {1, . . ., N} where each k can be any integer from the set {1, . . . b}. Give this sample space the uniform distribution. Thus, each of its elements has probability bN. Now, define a function, ga, on this sample space that assigns to any given such N-tuple the number of its entries k that obey k ≤ a. For example, g1 counts the number of entries that equal 1, while g2 counts the number that are either 1 or 2. As is explained next, Pa/b(n) is the probability as decreed by the uniform probability function that the random variable ga takes value n. To see how the q = ab version of (9.9) comes about, use Sb to denote the just described sample space with bN elements. Let S denote our old sample space with the 2N n-tuples of the form (±1, …, ±1). Now, define a map, Fa, from Sb to S as follows: Take Fa(1, . . ., N) = (1, . . ., N) where each k is set equal to +1 when k ≤ a. Otherwise, k = -1. This understood, the random variable ga has value n on any given element if and only Fa maps the element to the set Kn. Now, there are precisely an (b-a)N-n points in Sb that map to any given element in Kn. This is because there are a possibilities for each k where the corresponding k is +1, and (b-a) possiblities for each k where the corresponding k is –1. Since there are an(b-a)N-n elements in Sb that map to any given N! element in Kn and n!(N n)! elements in KN, so the set of points in Sb where g = n must have probability N! n N-n bN n!(N . n)! a (b-a) b N b n b(N-n) (9.10) and then To obtain the q = version of (9.9), take the factor and write it as n n n (N-n) N-n N-n write b a as q while writing b (b-a) as (1-q) . Even in the case that q is not rational, the probability function in (9.9) arises from a related problem on a sample space with 2N elements. Take the sample space to be our a b friend S whose elements are the N-tuples with elements (±1, …, ±1). However, now take not the uniform probability, but the probability function whereby +1 occurs in any given entry with probability q and –1 occurs with probability (1-q). Define the random variable, g, to be assign to any given element the number of appearances of +1. Then (9.8) gives the probability that g = n. To see why, note that the event that g = n is just our set Kn. With this new probability, each element in Kn has probability qn (1-q)N-n. As (9.3) gives the number of elements in Kn, its probability is therefore given by (9.9). The probability function in (9.9) is relevant to our bacterial walking scenario when we make the hypothesis that the bacteria moves to the right at any given step with probability q, thus to the left with probability 1-q. I’ll elaborate on this in a subsequent handout. The Poisson probability function: This is a probability function on the sample space, = {0, 1, . . . }, the non-negative integers. As you can see, this sample space has an infinite number of elements. Even so, I trust that you find it reasonable that we define a probability function on to be a function, f, with f(n) (0, 1) for each n, and such that ∑n=0,1,… f(n) = f(1) + f(2) + ··· (9.11) is a convergent series with limit equal to 1. As with the binomial probability function, there is a whole family of Poisson functions, one for each choice of a positive real number. Let > 0 denote the given choice. The version of the Poisson function assigns to any given non-negative integer the probability P(n) = 1 n! n e . (9.12) You can see that ∑n=0,1,… P(n) = 1 if you know about power series expansions of the exponential function. In particular, the function e has the power series expansion e = 1 + + 1 2 2 + 1 6 3 + · · · + 1 n! n + · · · . (9.13) Granted (9.13), then ∑ = (∑ ) = = 1. The Poisson probability enters when trying to decide if an observed pattern is or is not random. For example, suppose that on average, some number, , of newborns in the United States carry exhibit a certain birth defect. Suppose that some number, n, such births are observed in 2004. Does this constitute an unexpected clustering that should be investigated? If the defects are unrelated and if the causative agent is similar in all cases over the years, then the probability of n occurences in a given year should be very close to the value of the = version of the Poisson function P(n). 1 n=0,1,… n ! n e 1 n=0,1,… n ! n e e e The Poisson function is an N ∞ limit of the binomial function. To be more precise, the version of the Poisson probability P(n) is the N ∞ limit of the versions of (9.9) with q set equal to 1-e/N. This is to say that 1 n! n e = limN∞ N! n!(N n)! (1-e/N)n e(/N)(N-n) . (9.14) The proof that (9.14) holds takes us in directions that we don’t have time for here. Let me just say that it uses the approximation to the factorial known as Stirling’s formula: ln(k!) k ln(k) – k (9.15) e k . with the error of size I give some further examples of how the Poisson function is used in a separate handout. c) Means and standard deviations Let me remind you that if P is a probability function on some subset S of the set of integers, then its mean is ∑nS n P(n) (9.16) and the square of its standard deviation, , is 2 = ∑nS (n-)2 P(n) . (9.17) In this regard, keep in mind that when S is an infinite number of elements, then and are defined only when the corresponding sums on the right sides of (9.16) and (9.17) are those of convergent series. The mean and standard deviation characterize any given probability function to some extent. More to the point, both the mean and standard deviation are often used in applications of probability and statistics. The mean and standard deviation for the binomial probability on {0, 1, . . . , N} are: = Nq and 2 = N q (1-q) (9.18) For the Poisson probability function on {0, 1, … }, they are = and 2 = . (9.19) I describe a slick method for computing the relevant sums in the next section. To get more of a feel for the binomial probability function, note first that the mean for the q = 12 version is N2 . This conforms to the expectation that half of the entries in the ‘average’ N-tuple (1, . . . , N) should be +1 and half should be –1. Meanwhile in the general version, the assertion that the mean is Nq suggests that the fraction q of the entries in the ‘average’ N-tuple should be +1 and the fraction (1-q) should be –1. To get a sense for the standard deviation, one can ask for the value of n that makes Pq(n) largest. To see where this is, note that Pq(n+1)/Pq(n) = N n n1 q 1 q . (9.20) This ratio is less than 1 if and only if n < Nq – (1-q) . (9.21) Since (1-q) < 1, this then means that the ratio in (9.20) peaks at a value of n that is within ±1 of the mean. As I remarked in Handout 3, the standard deviation indicates the extent to which the probabilities concentrate about the mean. To see this, consider the following basic fact: Theorem: Suppose that P is a probability function on a subset of {…, -1, 0, 1, , …} with a well defined mean and standard deviation. For any R ≥ 1, the probability assigned to the set where |n| > R is less than R2. For example, this says that the probability of being 2 away from the mean is less than 1 1 4 , and the probability of being 3 away is less than 9 . This theorem justifies the focus in the literature on the mean and standard deviation, since knowing these two numbers gives you rigorous bounds for probabilities. The probability bound stated in the theorem is known as the Chebychev inequality. Here is the proof: Let S denote the sample space here, and let E S denote the set where |n| > R. The probability of E is then ∑nE P(n). However, since |n-| > R for n E, one has 1≤ |n | 2 R2 2 (9.22) on E. Thus, ∑nE P(n) ≤ ∑nE |n | 2 R2 2 P(n) . (9.23) To finish the story, note that the right side of (9.23) is even larger when we allow the sum to include all points in S instead of restricting only to points in E. Thus, we learn that ∑nE P(n) ≤ ∑n |n | 2 R2 2 P(n) (9.24) The definition of from (9.17) can now be invoked from (9.17) to identify the sum on the right hand side of (9.24) with R2. 2 d) Characteristic polynomials The slick computation of the mean and standard deviation that I mentioned involves the introduction of the notion of the characteristic polynomial. The latter is an often useful way to encode any given probability function on a subset of {…, -1, 0, 1, …}. In the case when the subset is {0, 1, 2, … , N} and the probability function is the binomial function from (9.9), the characteristic polynomial is the function of x given by (x) = Pq(0) + x Pq(1) + x2 Pq(2) + · · · + xN Pq(N) . (9.25) Thus, is a degree N polynomial in the variable x. As it turns out, the polynomial in (9.25) can be factored completely: (x) = (q x + (1-q))N (9.26) Indeed, to see why (9.26) is true, consider multiplying out an N-fold product of the form: (a1 x + b1)(a2 x + b2) ··· (aN x + bN) (9.27) A given term in the resulting sum can be labled as (1, . .. , N) where k = 1 if the k’th factor in (9.27) contributed akx, while k = -1 if the k’th factor contributes bk. The power N! of x for such a term is equal to the number of k that are +1. This is the n!(N n)! , the N! number of elements in the set Kn that appears in (9.3). Thus, n!(N n)! terms contribute to the coefficient of xn in (9.27). In the case of (9.26), all versions of ak are equal to q, and all versions of bk are equal to (1-q), so each term that contributes to xn is qn(1-q)N-nxn. As N! n there are n!(N n)! of them, the coefficient of x in (9.25) is Pq(n) as claimed. Now, in general the characteristic polynomial for a probability function, P, on a subset of {0, 1, …} has the form (x) = P(0) + P(1) x + P(2) x2 + ··· = ∑n P(n) xn . (9.28) Here are two of the salient features of this polynomial: First, the values of and its derivative and second derivative at x = 1 are: 1 = (1) = ( dxd ) x-1 2 2 = ( dxd 2 ) |x=1 - ( - 1) . (9.29) Second, the values of the value of , its derivative and its higher order derivatives at x = 0 determine P since 1 n! ( dxd n ) x=0 = P(n) . n (9.30) To explain (9.29), note that (1) = P(0) + P(1) + ··· = 1. Meanwhile, the derivative of at x = 1 is 1·P(1) + 2·P(2) + ···, and this is the mean . With the help of (9.17), a very similar argument establishes the third point in (9.29). In the case of the binomial distribution, d dx = Nq (qx + (1-q))N-1 (9.31) Set x = 1 here to find the mean equal to Nq as claimed. Meanwhile d2 dx 2 (x) = N(N-1)q2 (qx + (1-q))N-2 . (9.32) Set x = 1 here finds the right hand side of (9.29) equal to N(N-1)q2 – N2q2 + Nq = N q(1-q) (9.33) which is the asserted value for . For the Poisson probability function, the characteristic polynomial is the infinite power series 2 xnn + · · · ) e . (9.34) x As can be seen by replacing in (9.13) with x, the sum on the right here is e . Thus, P(0) + x P(1) + x2 P(2) + ··· = (1 + x + 1 2 x22 + 1 6 x33 + · · · + 1 n! (x) = e(x-1) . (9.35) In particular, the first and second derivatives of this function at x = 1 are both equal to . With (9.29), this last fact serves to justify the claim that the both the mean and standard deviation for the Poisson probability are equal to . The characteristic polynomial for a probability function is often used to simplify seemingly hard computations in the manner just illustrated. e) Loose ends about counting elements in various sets My purpose in this last section is to explain where the formula in (9.3) and (9.5) come from. To start, consider (9.5). There are n choices for 1. With 1 chosen, there are n-1 choices for 2, one less than that for 1 since we are not allowed to have these two equal. Given choices for 1 and 2, there are n-2 choices for 3. Continuing in this vein finds n-k choices available for k+1 if (1, . . ., k) have been chosen. Thus, the total number of choices is n·(n-1)····(n-N+1), and this is the claim in (9.5). To see how (9.3) arises, let me introduce the following notation: Let mn(N) denote the number of elements in the (1, . . . , N) version of Kn. If we are counting elements in this set, then we can divide this version of Kn into two subsets, one where 1 = 1 and the other where 1 = -1. The number of elements in the first is mn-1(N-1) since the (N-1)-tuple (2, . . . , N) must have n-1 occurrences of +1. The number in the second is mn(N-1) since in this case, the (N-1)-tuple (2, . . . , N) must have all of the n occurrences of +1. Thus, we see that mn(N) = mn-1(N-1) + mn(N-1) . (9.36) This formula looks much like a matrix equation. Indeed, fix some integer T ≥ 1 and make a T-component vector, m (N), whose coefficients are the values of mn for the cases that 1 ≤ n ≤ T. This equation asserts that m (N) = A m (N-1) where A is the matrix with Ak,k and Ak,k-1 both equal to 1 and all other entries equal to zero. Iterating this equation then finds m (N) = AN-1 m (1) , (9.37) where m (1) is the vector with top component 1 and all others equal to zero. Now, we don’t have the machinery to realistically compute AN-1, so instead, lets just verify that the expression in (9.3) gives the solution to (9.36). In this regard, note that m (N) is uniquely determined by (9.37) for each N > 1 from m (1) and so if we believe that we have a set { m (1), m (2), …} of solutions, then we need only plug in our candidate and see if (9.37) holds. This is to say that in order to verify that (9.3) is the correct, one need only check that the formula in (9.34) holds. This amounts to verifying that N! n!(N n)! = (N1)! (n 1)!(N n)! + (N 1)! n!(N n1)! . (9.38) I leave this to you as an exercise. Exercises: Let A denote the 44 version of the matrix in (9.37). Thus, 1 1 A = 0 0 0 1 1 0 0 0 0 0 . 1 0 1 1 a) Present the steps of the reduced row echelon reduction of A to verify that it is invertible. b) Find A1 using Fact 2.3.5 of LA&A 2. Let denote a fixed number in (0, 1). Now define a probability function, P, on the set {0, 1, 2, …} by setting P(n) = (1-) n. a) Verify that P(0) + P(1) + ··· = 1, and thus verify that P is a probability function. b) Sum the series P(0) + x P(1) + x2 P(2) + ··· to verify that the characteristic (1 ) function is the (x) = (1 x ) . c) Use the formula in (9.29) to compute the mean and standard deviation of P. d) In the case = 12 , the mean is 1 and the standard deviation is √2. As 6 ≥ + 3, the Theorem in Section c), asserts that the probability for the set {6, 7, …} should be less than 19 . Verify this prediction by summing P(6) + P(7) + ···. e) In the case = 23 , the mean is 2 and = √6. Verify the prediction of the Theorem in Section c) that {7, 8, …} has probability less than 14 by summing P(7) + ···. This exercise fills in some of the details in the verification of (9.3). a) Multiply both sides of (9.38) by (n-1)!(N-n-1)! and divide both sides of the result by (N-1)! Give the resulting equation. b) Use this last result to verify (9.38).