1 Chapter 2 Bernoulli Random Variables in n Dimensions 1. Introduction This chapter is dedicated to my STAT 305C students at Iowa State University in the Fall 2006 semester. It is due, in no small part, to their thoughtful questions throughout the course, but especially in relation to histogram uncertainty, that has convinced me to address the issues in this chapter in a rigorous way, and in a format that I believe is accessible to those who have a general interest in randomness. There are many phenomena that involve only two possible recordable or measurable outcomes. Decisions ranging from the yes/no type to the success/failure type abound in every day life. Will I get to work on time today, or won’t I? Will I pass my exam, or won’t I? Will the candidate get elected, or not? Will my friend succeed in her business, or won’t she? Will my house withstand an earth quake of 6+ magnitude, or won’t it? Will I meet an interesting woman at the club tonight, or won’t I? Will my sister’s cancer go into remission, or won’t it. And the list of examples could go on for volumes. They all entail an element of uncertainty; else why would one ask the question. With enough knowledge, this uncertainty can be captured by an assigned probability for one of the outcomes. It doesn’t matter which outcome is assigned the said probability, since the other outcome will hence have a probability that is one minus the assigned probability. The act of asking any of the above questions, and then recording the outcome is the essence of what is in the realm of probability and statistics termed a Bernoulli random variable, as now defined. Definition 1.1 Let X denote a random variable (i.e. an action, operation, observation, etc.) the result of which is a recorded zero or one. Let the probability that the recorded outcome is one be specified as p. Then X is said to be a Bernoulli(p) random variable. This definition specifically avoided the use of any real mathematical notation, in order to allow the reader to not be unduly distracted from the conceptual meaning of a Ber(p) random variable. While this works for a single random variable, when we address larger collections of them, then it is extremely helpful to have a more compact notation. For this reason, we now give a more mathematical version of the above definition. Definition 1.2 Let X be a random variable whose sample space is S X {0,1} , and let p denote the probability of the set {1}. In compact notation, this is often written as Pr[ X 1] p . Then X is said to be a Bernoulli(p), or, simply, a Ber(p) random variable. Since this author feels that many people grasp concepts better with visuals, the probability structure of a Ber(p) random variable is shown in Figure 1. At one level, Figure 1 is very simple. The values that X can take on are included in the horizontal axis, and the probabilities associated with them are included on the vertical axis. However, conceptually, the implications of Figure 1 are deep. 2 1 0.9 0.8 0.7 Pr[X=x] 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 This axis includes the sample space ofr X 1.6 1.8 Figure 1. The probability structure for a Ber(p=0.7) random variable. X is a 1-dimensional (1-D) random variable, since the values that it can take on are its sample space S X {0,1} , which includes simply numbers, or scalars. So, these numbers can be identified as a subset of the real line, which in Figure 1 is the horizontal axis. Since probabilities are also just numbers, they require only one axis, which in Figure 1 is the vertical line. But what if X were a 2-D random variable; that is, its sample space was a collection of ordered pairs? As we will see presently, then we would need to use a plane (i.e. an area associated with, say, a horizontal line and a vertical line). In that case, the probabilities would have to be associated with a third line (e.g. a line coming out of the page). To summarize this concept, the probability description for any random variable requires that one first identify its sample space. In the case of Figure 1, that entailed drawing a line, and then marking the values zero and one on that line. Second, one then associates probability information associated with the sample space. In the case of Figure 1, that entailed drawing a line perpendicular to the first line, and including numerical probabilities associated with zero and one. Another conceptually deep element of Figure 1 is an element that Figure 1 (as does almost any probability figure in any text book in the area) fails to highlight. It is the fact that, in Figure 1, the probability 0.7 is not, I repeat, NOT the probability associated with the number 1. Rather, it is the probability associated with the set {1}. While many might argue that this distinction is overly pedantic, I can assure you that ignoring this distinction is, in my opinion, one of the most significant sources of confusion for students taking a first course in probability and statistics (and even for some students in graduate level courses I have taught). Ignoring this distinction in the 1-D case shown in Figure 1 might well cause no problems. But ignoring it for higher dimensional case can result in big problems. So, let’s get it straight here and now. Definition 1.3 The probability entity Pr(•) is a measure of the size of a set. 3 In view of this definition, Pr(1) makes no sense, since 1 is a number, not a set. However, Pr({1}) makes perfect sense, since {1} is a set (as defined using { }, and this set contains only the number 1 in it. Since Pr(A) measures the “size” of a set A, we can immediately apply natural reasoning to arrive at what some books term “axioms of probability”. These include the following: Axiom 1. Pr( S X ) 1 . Axiom 2. Pr( ) 0 , where { } ; that is, is the empty set. Axiom 3. Let A and B be two subset of S X . Pr( A B) Pr( A) Pr( B) Pr( A B) . The first axiom simply says that when one performs the action and records a resulting number, the probability that the number is in S X must equal one. When you think about it, by definition, it cannot be a number that is not in S X . The second axiom simply states that the probability that you get no number when you perform the action and record a number must be zero. To appreciate the reasonableness of the third axiom, we will use the visual aid of the Venn diagram shown in Figure 2. Figure 2. The yellow rectangle corresponds to the entire sample space, S X . The “size” (i.e. probability) of this set equals one. The blue and red circles are clearly subsets of S X . The probability of A is the area in blue. The probability of B is the area in red. The black area where A and B intersect is equal to Pr( A B) . Since Pr(•) is a measure of size, it can be visualized as area, as is done in Figure 2. Imagining the sample space, S X , to be the interior of the rectangle, it follows that the area shown in yellow must be assigned a value of one. The circle in red has an area 4 whose size is Pr(A), and the circle in blue has a size that is Pr(B). These two circles have a common area, as shown in black, and that area has a size that is Pr( A B) . Finally, it should be mentioned that the union of two sets is, itself, a set. And that set includes all the elements that are in either set. If there are elements that are common to both of those sets, it is a mistake to misinterpret that to mean that those elements are repeated twice (once in each set). They are not repeated. They are simply common to both sets. Clearly, if sets A and B have no common elements, then A B . Hence, from Axiom 2, the rightmost term on Axiom 3 is zero. In relation to Figure 2 above, that would mean that the blue and red circles did not intersect. Hence, the area associated with their union would simply be the sum of their areas. We will encounter this situation often in this chapter. For this reason, we now formally state this as a special case of Axiom 3. Axiom 3’- A Special Case: Let A and B be two subsets of S X . If A B , then Pr( A B) Pr( A) Pr( B) . We are now in a position to apply address the above axioms and underlying concepts in relation to the Ber(p) random variable, X, whose sample space is S X {0,1} . To this end, let’s begin by identifying all the possible subsets of S X . Since S X has only two elements in it, there are four possible subsets of this set. These include {0}, {1}, S X and . The first two sets here are clearly subsets of S X . However, the set S X is also, formally speaking, a subset of itself. However, since this subset is, in fact, the set itself, it is sometimes called an improper subset. Nonetheless, it is a subset of S X . The last subset of S X , namely the empty set, , is simply, by definition, a subset of any set. Even so, it has a real significance, as we will presently describe. And so, the collection of all the possible subsets of S X {0,1} is the following set: X {{0},{1}, S X , } . It is crucially important to understand that X is, itself a set. And the elements of this set are, themselves sets. Why is this of such conceptual importance? It is because Pr(•) is a measure of the “size” of a set. Hence, Pr(•) measures the size of the elements of X . It does not measure the size of the elements of S X {0,1} , since the elements of this set are numbers, and not sets. In relation to Figure 2, we have the following results: (i) Pr({0}) 0.3 ; (ii) Pr({1}) 0.7 ; (iii) Since {0} {1} , we have Pr({0} {1}) Pr({0}) Pr({1}) 0.3 0.7 1.0 (iv) Since {0} {1} {0,1} S X , we could also arrive at the rightmost value, 1.0, in (iii) via Axiom 2; namely, Pr({0,1}) Pr( S X ) 1.0 . 5 The practical beauty of the set X is that any question one could fathom in relation to X can be identified as one of the elements of X . Here are some examples: Question 1: What is the probability that you either fail ( {0} ) or you succeed ( {1} ) in understanding this material? Well, since “or” represents a union set operation, the “event” that you either fail or succeed is simply S X , which is an element of X . Question 2: What is the probability that you fail? Since here, “failure” has been identified with the number, 0, the “event” that you fail is a set that includes only the number 0; that is, {0}. And, of course, this set is in X . Question 3: What is the probability that you only partially succeed in understanding this material? Well, our chosen sample space does not recognize partial success. It has only two elements in it: 0 = failure, and 1 = success. And so, while this is a valid question for one to ask, the element in X that corresponds to this event of partial success is the empty set, . So, the probability that you partially succeed in this setting is zero. 2. Two-Dimensional Bernoulli Random Variables. It might seem to some (especially those who have some background in probability and statistics) that the developments in the last section were belabored and overly pedantic or complicated. If that is the case, wonderful! Those individuals should then have no trouble in following this and subsequent sections. If, on the other hand, some troubles are encountered, then it is suggested that these individuals return to the last section and review it. For, all of the basic concepts covered there are simply repeated in this and future sections; albeit simply in two dimensions. However, in fairness, it should be mentioned that the richness of this topic is most readily exposed in the context of not one, but two random variables. It is far more common to encounter situations where the relationship between two variables is of primary interest; as opposed to the nature of a single variable. In this respect, this section is distinct from the last. It requires that the reader take a different perspective on the material. Definition 2.1. Let X1 ~ Ber ( p1 ) and X 2 ~ Ber ( p2 ) be Bernoulli random variables. Then the 2-dimensional (2-D) random variable X ( X1, X 2 ) is said to be a 2-D Bernoulli random variable. The first item to address in relation to any random variable is its sample space. The possible values that the 2-D variable X ( X1, X 2 ) can take on are not numbers, but, rather ordered pairs of numbers. Hence, the sample space for X is S X {(0,0) , (1,0) , (0,1) , (1,1) } . (2.1) 6 Key things to note here include the fact that since X is 2-D, its sample space is contained in the plane, and not the line. Hence, to visualize its probability description will require three dimensions. Also, since now, S X has 4 elements (as opposed to 2 elements for the 1-D case), its probability description will require the specification of 3 probabilities (not only one, as in the 1-D case). Define the following probabilities: pi j Pr( { (i, j ) } ) ; i, j {0,1} (2.2) Even though (2.1) defines four probabilities ( p0 0 , p1 0 , p0 1 , p11 ), in view of Axiom 2 above, only three of these four quantities need be specified, since the fourth must be one minus the sum of the other three. p0 1 p0 0 p1 0 x2 p11 0 0 1 x1 Figure 3. Visual description of the probability structure of a 2-D Bernoulli random variable. Having defined the sample space for X, and having a general idea of what its probability description is, the next natural step is to identify all the possible subsets of (2.1). Why? Because remember, any question one can fathom to ask in relation to X corresponds to one of these subset. And so, having all possible subset of X in hand can give confidence in answering any question that one might pose. It can also illuminate questions that one might not otherwise contemplate asking. Since this set contains 4 elements, the total number of subsets of this set will be 24 = 16. Let’s carefully develop this collection, since it will include a procedure that can be used for higher dimensional variables, as well. A procedure for determining the collection, X of all the subsets of (2.1): 7 (i) All sets containing only a single element: {(0,0}, {(1,0)}, {(0,1)}, {(1,1)} (ii) All sets containing two elements: -pair (0,0) with each of 3 elements to its right elements: {00, 10}, (00, 01}, {00, 11} -pair (1,0) with each of the two elements to its right: {10, 01}, {10, 11} -pair (0,1) with the one remaining element to its right: {10 , 11} [Notation: for simplicity we use 10 to mean the element (1,0), etc.] (iii) All sets containing 3 elements: -pair {00 10} with the first element to the right: {00 10 01} -pair {00 10} with the second element to the right: {00 10 11} -pair {00 01} with the element to the right of 01: {00 01 11} -pair {10 01} with the element to the right: {10 01 11} (iv) S X and If you count the total number of set in (i) – (iv) you will find there are 16. Specifically, X { {00}, {10}, {01}, {11}, {00,10}, {00,01},{00,11},{10,01},{10,11}, {01,11}, {00,10,01} , {00,10,11}, {00,01,11} , {10,01,11} , S X , } (2.3) It is important to note that the four singleton sets {(0,0)}, {(1,0)}, {(0,1)} and {(1,1)} have no elements in common with one another. Since they are each a 1-element set, to say that two of them have an element in common would be to say that they each have one and the same element. While the ordered pairs (0,0) and (0,1) do, indeed, have the same first coordinate, their second coordinates are different. As shown in Figure 3, they are two distinctly separate points in the plane. Thus, the intersection of the sets {(0,0)} and {(0,1)} is the empty set. A second point to note is that any element (i.e. set) in the collection (2.3) can be expressed as a union of two or more of these disjoint singleton sets. For example, {(0,0), (1,1) } = {(0,0)} {(1,1)}. Hence, from Axiom 3’ above, 8 Pr( {( 0,0), (1,1) } ) Pr( {( 0,0} {(1,1)} } Pr( {( 0,0)} ) Pr( {(1,1)} ) p0 0 p11 . It follows that if we know the probabilities of the singleton sets, then we can compute the probability of any set in X . We now state this in a formal way. Fact: The probability structure of a 2-D Bernoulli random variable is completely specified when 3 of the 4 probabilities { p0 0 , p1 0 , p0 1 , p11} are specified. In view of this fact, and the above Definition 2.1, it should be apparent that Definition 2.1 is incomplete, in the sense that it does not define a unique 2-D Bernoulli random variable. This is because in that definition only two parameters were specified; namely, p1 and p2 . Even so, the given definition is a natural extension of the definition of a 1-D Bernoulli random variable. We now offer an alternative to Definition 2.1 that does completely and unambiguously define a 2-D Bernoulli random variable. Definition 2.1’ The random variable X ( X1, X 2 ) is said to be a completely defined 2-D Bernoulli random variable if its sample space is S X {( 0,0), (1,0), (0,1), (1,1)} and if any three of the four singleton set probabilities { p0 0 , p1 0 , p0 1 , p11} are specified. This alternative definition eliminates the lack of the complete specification of the 2-D Bernoulli random variable, but at the expense of not seeming to be a natural extension of the 1-D random variable. Now, let’s address the question of how the specification of { p0 0 , p1 0 , p0 1 , p11} leads to the specification of p1 and p2 . To this end, it is of crucial conceptual importance to understand what is meant when one refers to “the event that X1 equals one”, within the 2-D framework. Remember: ANY question one can ask, in relation to X ( X1, X 2 ) can be identified as one unique set in the collection of sets given by (2.3). This includes questions such as: what is the probability that X1 equals one? In the 2-D sample space for X, this event is: “The event that X1 equals one” (often written as [ X1 =1] ) is the set {(1,0), (1,1)}. This set includes all elements whose first coordinate is 1, but whose second coordinate can be anything. Why? Because there was no mention of X 2 ; only X1 . If you are having difficulty with this, then consider when you were first learning about x, y and graphing in high school math. If there is no y, then you would identify the relation x=1 as just the point 1.0 on the x-axis. However, in the x-y plane, the relation x=1 is a vertical line that intersects the x-axis at the location 1.0. You are allowing y to be anything, because no information about y was given. 9 And so, we have the following relation between p1 Pr[ X 1] and { p0 0 , p1 0 , p0 1 , p11} : 1 p1 Pr[ X1 1] Pr{(1,0), (1,1)} Pr({(1,0)}) Pr({(1,1)}) p1 0 p11 p1 j (2.4a) j 0 Similarly, 1 p2 Pr[ X 2 1] Pr{( 0,1), (1,1)} Pr({( 0,1)}) Pr({(1,1)}) p0 1 p11 pi 1 (2.4b) i 0 From (2.4) we observe more of the missing details when one specifies only p1 and p2 in relation to a 2-D Bernoulli random variable. If these parameters are specified, then one still needs to specify one of the four parameters { p0 0 , p1 0 , p0 1 , p11} for a complete, unambiguous description of the probability structure of X ( X1, X 2 ) . There is one common situation where specification of only p1 and p2 is sufficient to completely specify the probability structure of X ( X1, X 2 ) . It is the situation where X1 ~ Ber ( p1 ) and X 2 ~ Ber ( p2 ) are statistically independent. In more simple terms, this situation is one wherein knowledge of the value X1 has no influence on the probability that X 2 equals any specified value. For example, if you toss a coin and you read “heads” (map “heads” to 1), then that result in no way changes the probability that you will get a “heads” on a second flip, does it? If we agree that it does not, then the 2-coin flip is an example of a 2-D Bernoulli random variable, where the two components of X ( X1, X 2 ) are statistically independent. As another example, consider parts inspection at the delivery dock of a company. If a randomly selected part passes inspection, it is natural to assume that the probability that the next selected part passes is not influenced by knowledge that the first part passed. While this is a natural assumption in parts inspection protocols, it may not necessarily be true. If the parts were manufactured in the presence of an unknown systematic manufacturing malfunction, then the fact that the first part passed may well influence the probability that the second part passes; for example, if there is only one good part, then the probability that the second part passes, given the first part passed will be zero. We will presently address the mathematical details of what is required for X1 and X 2 to be assumed to be statistically independent. However, in order to expedite that investigation, it is appropriate to first address yet another source of major confusion to novices in this area. Unions and Intersections, And’s and Or’s, and Commas and CommasIt should be apparent by now that probability is intimately related to sets. As noted above, it is, in fact, a well-defined measure of the “size” of a set. Yet, as was noted above, the vast majority of text books dealing with probability and statistics use a notation that, at 10 the very least, de-emphasizes the notion of sets. For example, in the case of the 2-D Bernoulli random variable X ( X1, X 2 ) , most books will use notation such as Pr[ X1 1] p1 (2.5a) If one realizes that Pr(A) measures the “size” (i.e. probability) of the set A, then it must be that, depending on how you read (2.5a), either X1 1 is a set, or [ X1 1] is a set. In either case, it is very likely that a student who has had some prior exposure to sets has never seen either one of the above expressions for a set. To this point, we have been using a more common notation for a set; namely the {•} notation. Let’s rewrite (2.5a) in this more common notation. p1 Pr( { (1,0) , (1,1) } ) (2.5b) There is nothing ambiguous or vague about (2.5b). The set in question has two elements in it; namely the element (1,0) and the element (1,1). In particular, (1,0) is an element, and not a set. Whereas, { (1,0) } is a set, and that set contains the element (1,0). One might argue that (2.5a) is clear, and that (2.5b) is involves too many unnecessary symbols that can cause confusion. However, let’s consider the following probability: Pr[ X1 1 X 2 1] (2.5c) This expression includes a set operation symbol, namely the intersection symbol . This suggests that X1 1 and X 2 1 are sets. Moreover, (2.5c) suggests that these two sets may have elements in common. But what exactly is the set X1 1 ? Well, if we ignore X 2 , then we have only a 1-D Bernoulli random variable, whose sample space is {0,1}. In that case, the expression X1 1 means the set{1}. However, if we include X 2 in our investigation, then the expression X1 1 means { (1,0) , (1,1) }. These are two distinctly different sets, and yet each set is expressed as X1 1 . A seasoned student of probability might argue that one must keep in mind the entire setting when interpreting the meaning of X1 1 . However, for a student who has no prior background in the field, it is often not so easy to keep the entire setting in mind. Before we can reconcile this ambiguity, we need to first address the set notion of a union ( ). The union of two sets is a set whose elements include all of the elements of each set; but where elements in common in these two sets are not counted twice. For example { (1,0) , (1,1) } { (0,1) , (1,1) } { (1,0) , (0,1) , (1,1) } (2.5d) Notice that the common element (1,1) is not counted twice in this union of the sets. Unfortunately, the same type of notation used in (2.5c) for intersections is commonly also used for unions. Specifically, we have the expression 11 Pr[ X1 1 X 2 1] (2.5e) Since (2.5c) and (2.5e) each involve a set operation, the above discussion related to the ambiguity and vagueness of expressions such as X1 1 applies to (2.5e). Now let’s reconcile these two types of expressions. In doing so, we will discover that there are commas and there are commas. To this end, we will express (2.5c) in the unambiguous notation associated with a 2-D Bernoulli random variable. Pr[ X 1 1 X 2 1] Pr({ (1,0), (1,1) } {( 0,1) , (1,1) } ) Pr( {(1,1) } ) p11 (2.5f) The leftmost expression in (2.5f) is ambiguous when not accompanied by a note that we are dealing only with a 2-D random variable. (What if we actually had a 3-D random variable?) The middle expression is unambiguous. Furthermore, any student with even a cursory exposure to sets would be able to identify the single element, (1,1), that is common to both sets. The equality of the leftmost and rightmost expressions reveals that in this 2-D framework, we can refer to the element (1,1) as “the element whose first component is one and whose second component is one. Hence the comma that separates these two components of the element (1,1) may be read as an and comma. Similarly, rewriting (2.5e) and referring to (2.5c) gives Pr[ X1 1 X 2 1] Pr({(1,0)} {(0,1)} {(1,1)}) Pr({(1,0), (0,1), (1,1)}) (2.5g) Hence, the commas that separate the elements (1,0), (0,1) and (1,1) may be read as or commas. After a bit of reflection, the reader may find all of this to be obvious. In that event, this brief digression will have served its purpose. In that case, let’s proceed to the following examples to further assess the reader’s grasp of this topic. Example 2.1 Let X ( X1, X 2 ) ~Ber( { p0 0 , p1 0 , p0 1 , p11} ). Notice that there is no assumption of independence here. (a) Clearly state the sets corresponding to the following events: (i) [ X1 X 2 ] : Answer: {( x1, x2 ) S X | x1 x2} { (0,1) } . (ii): [ X1 X 2 ] : Answer: {( x1, x2 ) S X | x1 x2} { (0,0), (0,1) } . (iii) [ X1 X 2 ] : Answer: {( x1, x2 ) S X | x1 x2} { (0,0), (1,1) } . (iv) [| X1 X 2 | 1] : Answer: {( x1, x2 ) S X | | x1 x2 | 1} { (0,1), (1,0) } . (b) Compute the probabilities of the events in (a), in terms of { p0 0 , p1 0 , p0 1 , p11} . 12 (i) Pr[ X1 X 2 ] : Answer: Pr( { (0,1) } ) p0 1 . (ii): Pr[ X1 X 2 ] : Answer: Pr( { (0,0), (0,1) } ) p0 0 p0 1 . (iii) Pr[ X1 X 2 ] : Answer: Pr( { (0,0), (1,1) } ) p0 0 p11 . (iv) Pr[| X1 X 2 | 1] : Answer: Pr( { (0,1), (1,0) } ) p0 1 p1 0 . Hopefully, the reader felt that the answers in the above example were self-evident, once the sets in question were clearly described as such. The next example is similar to the last one. However, it extends the conceptual understanding of this topic to arrive at a very important and useful quantity; namely the cumulative distribution function (CDF). Example 2.2 Again, let X ( X1, X 2 ) ~Ber( { p0 0 , p1 0 , p0 1 , p11} ). Now, let ( x1, x2 ) be any pair of real numbers (i.e. any point in the plane). Notice here that ( x1, x2 ) is not constrained to be an element of S X . (a) Develop an expression for Pr[ X1 x1 ] FX 1 ( x1 ) as a function of x1 while ignoring X2 . Solution: If we want to, we can approach this problem in exactly the same manner as in the above example; namely, by clearly describing the set corresponding to the expressed event [ X1 x1 ] ; namely, [ X1 x1 ] {(u, v) S X | u x1} . However, since our interest here is only in the random variable X1 , whose sample space is extremely simple (i.e. {0,1} ) we will choose this approach. The p-value for this random variable in terms of the 2-D probabilities is given above in (2.4a). The probability description for X1 ~ Ber ( p1 0.7) was shown in Figure 1 above. But that figure only utilized x-values in the range [0,2]. The expression we are to develop here should consider any value of x1 . The following expression is hopefully clear from Figure 1: 0 for x1 0 FX 1 ( x1 ) 1 p1 for 0 x1 1 1 for x1 1 This expression is plotted below for the value p1 0.7 (2.6) 13 1.5 F(x1) 1 0.5 0 -2 -1.5 -1 -0.5 0 x1 0.5 1 1.5 2 Figure 4. Graph of Pr[ X1 x1 ] FX 1 ( x1 ) given by (2.6). So, how exactly did Figure 4 arise from Figure 1? Well, Figure 1 shows where the “lumps” of probability are, and also gives the values of these “lumps”. For example, the “lump” at location x1 0 is the probability Pr[ X1 0] Pr({( 0,0), (0,1)}) 0.3 . If the reader is confused by the fact that Figure 1 is for the Ber(p) random variable, X, while Figure 2 is for the Ber( p1 ) random variable, X1 , it should be remembered that when we discussed Figure 1 there was only one random variable. However, now there are two. And so, now, we need to have some way to distinguish one from the other. Nonetheless, both are Bernoulli random variables. And so, both will have the general probability structure illustrated in Figure 1; albeit with possibly differing p-values. So, again- how did Figure 4 arise from Figure 1? The key to answering this question is to observe that Pr[ X1 x1 ] FX 1 ( x1 ) is the totality of the probability associated with the interval (, x1 ] . Hence, as long as x1 0 , the value of FX 1 ( x1 ) will be zero, since the first “lump” of probability is at x1 0 . So, at this location, FX 1 ( x1 ) will experience an increase in probability, in the amount Pr[ X1 0] 1 p1 . This increase, or jump in FX 1 ( x1 ) is shown in Figure 4. As we allow x1 to continue its travel to the right of zero, since there are no lumps of probability in the interval (0,1) the value of FX 1 ( x1 ) will remain at the value 1 p1 throughout this region. When x1 1 , the value of FX 1 ( x1 ) will increase by an amount p1 , since that is the value of the “lump” of probability at this location: Pr[ X1 1] p1 . Hence, when x1 1 we have FX 1 (1) 1.0 . In words, the probability that the random variable X1 equals any number less than or equal to one is 1.0. It follows that there are no more “lumps” of probability to be “accumulated” as x1 continues it travel to the right beyond the number 1. This is the reason that 14 FX 1 ( x1 ) remains flat to the right of x1 1 in Figure 4. It is the “accumulating” feature of FX 1 ( x1 ) as x1 “travels” from left to right that is responsible for the following definition. Definition 2.3 Let X be any 1-D random variable. Then Pr[ X x] FX ( x) is called the cumulative distribution function (CDF) for X. Example 2.2 continued: (b) Develop an expression for Pr[ X1 x1 ] FX 1 ( x1 ) as a function of x1 while not ignoring X2. Solution: Again, as in (a), we write [ X1 x1 ] {(u, v) S X | u x1} {(u,0), (u,1) | u x1}. As we compute the probability of this set, let’s actually identify the actual set that corresponds to a given value for u: (i) for x1 0, [ X1 x1 ] Pr[ X1 x1 ] 0 , 1 (ii) for 0 x1 1, [ X1 x1 ] {( 0,0), (0,1)} Pr[ X1 x1 ] p0 j j 0 1 1 j 0 j 0 (iii) for x1 1, [ X1 x1 ] {( 0,0), (0,1), (1,0), (1,1)} Pr[ X1 x1 ] p0 j p1 j . Notice that the rightmost expressions in (ii) and (iii) above are summations. It is fair to argue that the summation notation is unduly heavy, in the sense that (ii), for example, could have been written more simply as p0 0 p01 . Not only is this a fair argument, it points to yet another example where the biggest stumbling block to a novice might be the notation, and not the concept. However, in this particular situation the summation notation was chosen (at the risk of frustrating some novices) in order to highlight a concept that is central in dealing with two (or more) random variables. We now state this concept for the more general case of two random variables, say, X and Y, whose joint probability structure is specified by a collection of joint probabilities, say, { pi j where i S X and j SY } . Fact 2.1 Consider a 2-D random variable, (X,Y) having a discrete 2-D sample space S( X ,Y ) {x, y) | x S X and y SY } S X SY , and corresponding joint probabilities { px y where x S X and y SY } . Then Pr[ X x] p yS Y xy . In many books on the subject, Fact 2.1 is stated as a theorem, and often it is accompanied by a proof. However, we do not believe that this fact is worthy of the theorem label. It is an immediate consequence of the realization that the set [ X x] {( x, y ) S X Y } 15 A reader who has had a course in integral calculus might recognize that integration is synonymous with accumulation. The above Fact 2.1 says, in words: To obtain Pr[ X x] integrate the joint probabilities over the values of y. For the benefit of such readers, consider the following example. Example 2.3 Consider a random variable, say, U, whose sample space is SU [0,1] (i.e. the closed interval with left endpoint 0, and with right endpoint 1). Furthermore, assume that U has a uniform probability distribution on this interval. Call this distribution fU (u) . The meaning of the term uniform here, is that the probability of any sub-interval of SU [0,1] depends only on the width of that interval, and not on its location. For a subinterval of width 0.1 (be it the interval (0,0.1) or (0.2,0.3), or [0.8,0.9]) the probability that U falls in the interval is 0.1. This distribution is shown in Figure 5 below. It follows that the probability the U falls in the interval [0,u] is equal to u. Another way of expressing this is Pr[U u ] u . But this is exactly the definition of the CDF for U. And so Pr[U u ] FU (u ) u . This CDF is also shown in Figure 5 below. Notice that this CDF is linear in u, and has a slope equal to 1.0. The derivative of this CDF is, therefore, just its slope, which is exactly fU (u) . Hence, here, we can conclude that fU (u) is the derivative of FU (u ) ; or, equivalently, FU (u ) is the integral of fU (u) . 1.5 1 0.5 0 -0.5 0 0.5 u 1 1.5 Figure 5 Graphs of the CDF, FU (u ) (thick line), and its derivative fU (u) (thin line). The above example is a demonstration of the following general definition that holds for any random variable. 16 Definition 2.4 Let W be any random variable, and let FW ( w) Pr[W w] be its cumulative distribution function (CDF). Then the (possibly generalized) derivative of FW (w) is fW (w) , which is called the (possibly generalized) probability density function (PDF) for W. In Example 3.2 above, indeed, the derivative of the CDF FU (u ) u is exactly the PDF fU (u ) 1 . However, in the case of X ~ Ber ( p) with a CDF having the general structure illustrated in Figure 4 above, we see that the CDF has a slope equal to zero, except at the jump locations. And at these locations the slope is infinite (or, if you like, undefined). What is the derivative of such a function? Well, properly speaking, the derivative does not exist at the jump locations. Hence, properly speaking, X ~ Ber ( p) does not have a PDF. However, “generally speaking” (i.e. in the generalized sense) we can say that its derivative has the form illustrated in Figure 1 above. Specifically, the PDF is identically zero, except at the jump locations where it contains “lumps” of probability. [For those readers who are familiar with Dirac-δ functions, these lumps are, in fact, weighted δ-functions, whose weights are the probability values]. The key points here are two: Key Point #1: Every random variable has a well-defined CDF, and Key Point #2: If the CDF is not differentiable, then, properly speaking, the PDF does not exist. Nonetheless, if we allow generalized derivatives, then it does exist everywhere, except at a discrete number of locations. In the next chapter we will discuss the relation between the CDF and PFD of a wide variety of random variables. However, for the time being, let’s return to Bernoulli random variables. In particular, there are two topics that still need to be addressed before we move on to n-D Bernoulli random variables. One is the topic of statistical independence,and the second is the topic of conditional probability. As we shall see shortly, these two topics are strongly connected. Definition 2.5 Let (X,Y) be a 2-D random variable with sample space S ( X ,Y ) . Let A be a subset of this space that relates only to X, and let B be a subset that relates only to Y. Then the subsets (i.e. events) A and B are said to be (statistically) independent events if Pr( A B) Pr( A) Pr( B) . If all events relating to X are independent of all events relating to Y, then the random variables X and Y are said to be (statistically) independent. Before we investigate just exactly how the notion of statistical independence relates to a 2-D Bernoulli random variable, let’s demonstrate its practical implications in an example. Example 2.4 Consider the act of tossing a fair coin twice. Let X k correspond to the action that is the kth toss, and let a “heads” correspond to one, and a “tails” correspond to a zero. 17 Then, X ( X1, X 2 ) is a 2-D Bernoulli random variable. Since the coin is assumed to be a fair coin, we have p1 Pr[ X 1 1] Pr{(1,0), (1,1)} 0.5 and p2 Pr[ X 2 1] Pr{( 0,1), (1,1)} 0.5 . But because the coin is fair, each of the four possible outcomes, {(0,0)}, {(1,0)}, {(0,1)}, {(1,1)} should have the same probability. Hence, Pr{(1,1)} 0.25 . Rewriting this probability in the usual notation gives Pr{(1,1)} 0.25 P[ X1 1 X 2 1] 0.5 0.5 Pr[ X1 1] Pr[ X 2 1] . So, we see that the events [ X1 1] and [ X 2 1] are statistically independent. In exactly the same manner, one can show that all of the events related to X1 (i.e. [ X1 0] and [ X1 1] ) are independent of all the events related to X 2 (i.e. [ X 2 0] and [ X 2 1] ). We can conclude that the assumption of a fair coin, and in particular, that the above four outcomes have equal probability, is equivalent to the assumption that X1 and X 2 are statistically independent. Now, let’s look more closely at a 2-D Bernoulli random variable X ( X1, X 2 ) with specified probabilities { p0 0 , p1 0 , p0 1 , p11} . Without loss of generality, let’s assume the first three probabilities have been specified. Then p11 1 ( p0 0 p1 0 p0 1 ) . We now address the question: UNDER WHAT CONDITIONS ARE THE EVENTS [ X1 1] AND [ X 2 1] INDEPENDENT ? ANSWER: Let’s first express the condition for independence in terms of the usual notation. Then we will translate the condition in terms of sets. These events are independent if: Pr[ X1 1 X 2 1] Pr[ X1 1] Pr[ X 2 1] . (2.7a) In terms of sets, (2.7a) becomes Pr{(1,1)} Pr{(1,0), (1,1)} Pr{( 0,1), (1,1)} . (2.7b) In terms of the specified probabilities, (2.7b) becomes p11 ( p1 0 p11 ) ( p0 1 p11 ) (2.7c) Even though (2.7c) is the condition on the specified probabilities for these events to be independent, we can arrive at a more simple expression by using the fact that p11 1 ( p0 0 p1 0 p0 1 ) . First, let’s rewrite (2.7c) as 18 p11 p1 0 p0 1 p11 ( p1 0 p0 1 p11 ) p1 0 p0 1 p11 (1 p0 0 ) . (2.7d) Subtracting p1 1 from each side of (2.7d), and rearranging terms, gives p0 0 p11 p1 0 p0 1 . (2.7e) Equation (2.7e) is the condition needed to assume that the events [ X1 1] and [ X 2 1] are independent. Using exactly the same procedure, one can show that the condition (2.7e) is the condition. We state this formally in the following fact. Fact 2.2 The components of the 2-D Bernoulli random variable X ( X1, X 2 ) with specified probabilities { p0 0 , p1 0 , p0 1 , p11} are statistically independent if and only if the condition p0 0 p11 p1 0 p0 1 holds. Example 2.4 above is a special case of this fact. Since we assumed p0 0 p1 0 p0 1 p11 0.25 , clearly, the above condition holds. This equality of the elemental probabilities is a sufficient condition for independence, but it is not necessary. Consider the following example. Example 2.5. Suppose the person has very good control over the number of rotations the coin makes while in the air. In particular, suppose the following numerical probabilities: p1 0 p0 1 0.2 . Now, we need to find the numerical value of p0 0 (if there is one) such that the relation p0 0 p11 p1 0 p0 1 holds. To this end, express this condition as: p0 0 p11 p0 0 [ 1 ( p0 0 p1 0 p0 1 ) ] p1 0 p0 1 (2.8a) This equation can be rewritten as a quadratic equation in the unknown p0 0 . 2 p00 [ 1 ( p10 p01 ) ] p00 p10 p01 0 . (2.8b) Applying the quadratic formula to (2.8b) gives p00 1 [ 1 ( p10 p01 ) 1 ( p10 p01 )]2 2 4 p01 p01 ] . Inserting the above numerical information into (2.8c) gives (2.8c) 19 1 1 [ 0.6 0.36 0.16 ] [ 0.6 0.4472] 0.5236 or 0.0764 (2.8d) 2 2 Notice that for the chosen values p1 0 p0 1 0.2 there are two possible choices for p00 . p00 Furthermore, they add up to 0.6. Hence, if we choose the first for p00 , then the second is exactly p11 . It should also be noted that (2.8c) indicates that for certain choices of p10 and p01 there will be no value of p00 that makes the components of X independent. Specifically, if both p10 and p01 are large enough so that the term inside the square root is negative, then there is no real-valued solution for p00 . We now address the concept of conditional probability in relation to a 2-D Bernoulli random variable. First, however, we give the following definition of conditional probability in the general setting. Definition 2.6. Let A and B be two subsets of a sample space, S X , and suppose that Pr( B) 0 . Then, the probability of A given B, written as Pr( A | B) is defined as Pr( A | B) Pr( A B) . Pr( B) (2.9) To understand (2.9) we refer to the Venn diagram in Figure 2. What “given B” means is that our sample space is now restricted to the set B. Stated another way, nothing outside of the set B exists. So, in Figure 2, only the red circle exists now. Equation (2.9) is the probability of that portion of the set A that is in the set B. The Pr( A B) , which is the black area in Figure 2, is the “size” of the intersection relative to the entire sample space. Since our new sample space is the smaller one, B, the probability of this intersection relative to B, demands that we “scale it” by dividing that probability by the probability of B, as is done in (2.9). Now that we have Definition 2.6, we can make an alternative definition of statistical independence defined per Definition 2.5. Specifically, Definition 2.5’ Events A and B (where it is assumed that Pr( B) 0 ) are said to be statistically independent events if Pr( A | B) Pr( A) . Remark In relation to Figure 2, this means that if A is contained entirely in B, then restricting our sample space to B does not alter the probability of A. In other words, if under the condition B, the probability of A is not changed, then A and B are statistically independent. However, while that condition that A is entirely contained in B is a sufficient condition for independence, it is not necessary. Again, referring to Figure 2, all that is necessary is that the overlap of A and B be just enough so that the black intersection area equals the product of the blue and read areas. 20 We now proceed to relate the concept of conditional probability to a 2-D Bernoulli random variable. Because the sample space for this random variable is so simple, it offers a clear picture of both the meaning and value of conditional probability. Example 2.6 Again, let X ( X1, X 2 ) ~Ber( { p0 0 , p1 0 , p0 1 , p11} ). Develop the expression for Pr[ X 2 j | X1 i ] . Solution: Pr[ X 2 j | X 1 i ] Pr[ X 2 j X 1 i ] Pr[ X 1 i ] pi j pi 0 pi 1 . (2.10a) In particular, Pr[ X 2 1 | X1 0 ] p0 0 p01 p01 p10 p11 p11 p2 | X1 0 (2.10b) p2 | X 1 1 . (2.10c) and Pr[ X 2 1 | X 1 1 ] The probabilities (2.10b) and (2.10c) are the p-values for X 2 , conditioned on the events [ X1 0] and [ X1 1] , respectively. As simple as it was to obtain (2.10), it can be an extremely valuable tool. Specifically, if one has reliable numerical values for { p0 0 , p1 0 , p0 1 , p11} , then (2.10) is a prediction model, in the sense that, if we have obtained a numerical value for X1 , it allows us to predict the probability that X 2 will equal zero or one. Remember, if X1 and X 2 are independent, then the numerical information associated with X1 is irrelevant, in the sense that it does not alter the probability that X 2 will equal zero or one. But there are many situations where these random variables are not independent. 3. n-Dimensional Bernoulli Random Variables Definition 2.7 Let X [ X 1 ,, X n ] where each X k ~ Ber ( pk ) . Then X is said to be an n-D Bernoulli random variable. The p-values { pk }nk 1 in the above definition are not generally sufficient to describe X unambiguously. The reason lies in the fact that the sample space for X includes 2n distinct elements. Hence, to completely describe the probability structure of X requires the 21 specification of 2n 1 probabilities. Specifically, we need to specify all but one of { Pr{( x1,, xn )} px1 x2 xn ; xk {0,1} } . There is a situation wherein the n p-values { pk }nk 1 are sufficient to completely describe X; namely when the n random variables comprising X are mutually independent. However, in doing so, we will demonstrate the value of the uniform random variable considered in Example 3 above. Using a Random Number Generator to Simulate n-D iid Bernoulli Random Variables In this section we address the problem of simulating data associated with a Bernoulli random variable. This simulation will utilize a uniform random number generator. And so, first, we will formally define what we mean by a uniform random number generator. Definition 2.8 A uniform random number generator is a program that, when called, produces a “random” number that lies in the interval [0,1]. In fact, the above definition is not very formal. But it describes in simple terms the gist of a uniform random number generator. The following definition is formal, and allows the generation of n numbers at a time. Definition 2.8’Define the n-D random variable U [U1 ,U 2 ,,U n ] where each U k is a random variable that has a uniform distribution on the interval [0,1], and where these n random variables are mutually independent. The two assumptions that these variables each have the same distribution and that they are mutually independent is typically phrased as the assumption that they are independent and identically distributed (iid). Then U is an n-D uniform random number generator. The following example uses the uniform random number generator in Matlab to demonstrate this definition. Example 2.7 Here, we give examples of an n-D uniform random variable, U, using the Matlab command “rand”, for n=1,2 and 25: (i) U = rand(1,1) is a 1-D uniform random variable. Each time this command is executed, the result is a “randomly selected” number in the interval [0,1]. For example: >> rand(1,1) ans = 0.9501 22 (ii) U=rand(1,2) is a 2-D uniform random variable. For example, >> rand(1,2) ans = 0.2311 0.6068 (iii) U=rand(5,5) is a 25-D uniform random variable. For example, >> rand(5,5) ans = 0.3340 0.5298 0.4329 0.6405 0.2259 0.2091 0.5798 0.3798 0.7604 0.7833 0.6808 0.4611 0.5678 0.7942 0.0592 0.6029 0.0503 0.4154 0.3050 0.8744 0.0150 0.7680 0.9708 0.9901 0.7889 It is important to note that the command rand(m,n) is the m n -D random variable. The numbers are a result of the command. They are not random variables. They are numbers. A random variable is an action, algorithm, or operation that when conducted yields numbers. We now proceed to show how the uniform random number generator can be used to simulate measurements of a Bernoulli random variable. Let’s begin with a 1-D random variable. Again, we will use Matlab commands to this end. Using U to arrive at X ~ Ber ( p) : For U ~ Uniform[0,1] , define the random variable, X, in the following way: Map the interval [U 1 p] to the event [X=0], and map the event [ 1 p U 1] to the event [X=1]. Recall from Example 3 above that Pr[U 1 p] 1 p . Hence, it follows that Pr[ X 0] 1 p . Therefore, since X can take on only the value zero or one, we have Pr[ X 1] p ; that it, X is a Ber(p) random variable. Here is a Matlab code that corresponds to X ~ Ber ( p) : p=0.7; u=rand(1,1); if u <=1-p x=0 else x=1 end For example: >> p=0.7; u=rand(1,1); 23 if u <=1-p x=0 else x=1 end x= 1 Now, suppose that we want to simulate multiple measurements associated with this random variable X ~ Ber ( p) associated with the above code. Well, we could simple embed the code in a DO loop, and repeat the above operation the desired number of times. Well, it turns out that Matlab is a programming language that is not well-suited to DO loops. If the loop count is small, it works fine. But if you wanted to simulate, say one million values associated with X, then it would take a long time. In fact, Matlab was designed in a way that makes it a very fast code for batch or vector operations. With this in mind, the code below is offered. It includes no IF/ELSE commands, and it requires no DO loop for multiple measurements. We will give the code for the case of one million values associated with X. p=0.7; m=1000000; u=rand(1,m); u=u-(1-p); x=ceil(u); The command u=rand(1,m) results in a 1x1000000 vector of numbers between 0 and 1. The command u=u-(1-p), shifts every number to the left by an amount 1-p. Thus, since here p=0.7, every number that was in the interval [0,0.3] has been shifted to a number in the interval[-0.3,0]. In particular, not only is every such number now a negative number, but the closest integer to the right of it is zero. The ceil command rounds numbers to the next higher integer. The command ceil is short for ceiling, or “round up to the nearest integer”. Similarly, numbers originally in the interval (0.3,1] are moved to the interval (0,0.7]. Since they are still positive, the next highest integer associated with them is one. Here is an example of running the above code. Rather than showing x, which contains one million zeros/ones, we included a command that adds these numbers. This sum is the number of ones, since zeros contribute nothing to a sum. >> p=0.7; m=1000000; u=rand(1,m); u=u-(1-p); x=ceil(u); >> sum(x) ans = 24 700202 Notice that the relative frequency of ones is 700202/1000000, which is pretty close to the 0.7 p-value for X. In fact, if we were to pretend that these numbers were collected from an experiment, then we would estimate the p-value for X by this relative frequency value. The value of running a simulation is that you know the truth. The truth in the simulation is that the p-value is 0.7. And so, the simulation using 1000000 measurements appears to give a pretty accurate estimate of the true p-value. We will next pursue more carefully what this example has just demonstrated. Using U [U1 ,,U n ] to Simulate X [ X 1 ,, X n ] Independent and Identically Distributed (iid) Ber(p) Random Variables, and then, from these, investigating the 1 n probability structure of the random variable pˆ X k X . n k 1 In this subsection we are interested in using simulations to gain some idea of how many subjects, n, would be required to obtain a good estimate of the p-value of a typical subject. The experiment is based on the question: What is the probability that a typical American believes that we should withdraw from Iraq. We will identify the set {1} with an affirmative, and the set {0} with opposition. We will ask this question to n independent subject and record their responses. Let X k ~ Ber ( p) be the response of the kth subject. Notice that we are assuming that each subject has the same probability, p, of believing that we should withdraw. Thus, X [ X 1 ,, X n ] is an n-D random variable whose components are iid Ber(p) random variables. After we conduct the survey, our next action will be to estimate p using the estimator 1 n X (2.7) k X. n k 1 Notice that (2.7) is a random variable that is a composite action that includes first recording the responses of n subjects, and then taking an average of these responses. pˆ But suppose we were considering conducting an experiment where only 100 measurements were being considered. Well, if we run the above code for this value of m, we get a sum equal to 74. Running it a second time gives a sum equal to 67. And if we run the code 500 times, we could plot a histogram of the sum data, to get a better idea of the amount of uncertainty of the p-value estimator for m=100. Here is the Matlab code that allows us to conduct this investigation of how good an estimate of the true p-value we can expect: p=0.7; n=500; m=100; u=rand(m,n); u=u-(1-p); 25 x=ceil(u); phat=0.01*sum(x); hist(phat) 120 100 80 60 40 20 0 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 Figure 6. Histogram of the p-value estimator (2.7) associated with m=100 subjects, using 500 simulations. Notice that the histogram is reasonably well centered about the true p-value of 0.7. Based on the 500 estimates of (2.7) for n=100, the sample mean and standard deviation of the estimator (2.7) are 0.7001 and 0.0442, respectively. Were we to use a 2-σ reporting error for our estimate of (2.7) for m=100, it would be ~±0.09 (or 9%). To get an idea of how the reporting error may be influenced by the number of chosen subjects, n, used in (2.7), we embedded the above code in an m-DO LOOP, for values of m=100, 1000, and 10,000. For each value of m we computed the sample standard deviation. The code and results are given below. >> %PROGRAM NAME: phatstd m=[100 1000 10000]; phatstdvec=[]; p=0.7; n=500; for i=1:3 u=rand(m(i),n); u=u-(1-p); x=ceil(u); phat=(1/m(i))*sum(x); phatstd=std(phat); phatstdvec=[phatstdvec phatstd]; end phatstdvec 26 phatstdvec = 0.0469 0.0140 0.0046 Closer examination of these 3 numbers associated with the chosen 3 values of m, would reveal that the standard deviation of (2.7) appears to be inversely proportional to m . In the next example, we demonstrate the power of knowledge of the conditional PDF of a 2-D Bernoulli random variable, in relation to a process that is a time-indexed collection of random variables. In general, such a process is known as a random process: Definition 2.9 A time-indexed collection of random variables, { X t }t is known as a random process. If the joint PDF of any subset { X k }tkmt does not depend on t, then the process is said to be a stationary random process. The universe is rife with time-dependent variables that take on only one of two possible values. Consider just a few such processes from a wide range of settings: Whether a person is breathing normally or not. Whether a drop in air temperature causes a chemical phase change or not. Whether farmers will get more that 2 inches of rain in July or not. Whether your cell phone receives a correct bit of information or not. Whether a cooling pump performs as designed or not. Whether you get married in any given year or not. Whether a black hole exists in a sector of the galaxy or not. All of these examples are time-dependent. In the following example we address what might be termed a Bernoulli or a binary random process. Example 2.8 Samples of a mixing process are taken once every hour. If the chemical composition is not within required limits, a value of one is entered into the data log. Otherwise, a value of zero is entered. The Figure 7 below shows two randomly selected 200-hout segments of the data log for a process that is deemed to be pretty much under control. From these data, we see that, for the most part, the process is in control. However, when it goes out of control, there seems to be a tendency to remain out of control for more than one hour. (a) Under federal regulations, the mixture associated with an out-of-control period must be discarded. Management would like to have a computer model for simulating this control data log. It should be a random model that captures key information, such as the mean and standard deviation of a simulated data log. Pursue the design of such a model. 27 Well, having had Professor Sherman’s STAT 305C course, you immediately recall the notion of a Bernoulli random variable. And so, your first thought is to define the events [X=0] and [X=1] to correspond to “in” and “out of” control, respectively. To estimate the p-value for X, you add up all the ‘ones’ in the lower segment given in Figure 7, and divide this number by 200. This yields the p-value, p=12/200=0.06. You then proceed to simulate a data log segment by using the following Matlab commands: >> u = rand(1,200); >> u= u – 0.94; >> y = ceil(u); >> stem(y) The stem plot is shown in Figure 8 below. Time Series for Process Control State 1 0.9 0.8 0.7 y(t) 0.6 0.5 0.4 0.3 0.2 0.1 0 0 20 40 60 80 100 Time 120 140 160 180 200 160 180 200 Time Series for Process Control State 1 0.9 0.8 0.7 y(t) 0.6 0.5 0.4 0.3 0.2 0.1 0 0 20 40 60 80 100 Time 120 140 Figure 7. Two 200-hour segments of the control data log for a mixing process. Even though Figure 8 has general similarities to Figure 7, it lacks the “grouping” tendency of the ‘ones’. Hence, management feels the model is inadequate. 28 Simulation of Process Data Log Using a Ber(0.06) random variable 1 0.9 0.8 Control State 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 20 40 60 80 100 Time 120 140 160 180 200 Figure 8. Simulation of a data log 200-hour segment using a Ber(0.6) random variable. (b) Use the concept of a 2-D Bernoulli random variable whose components are not assumed to be statistically independent, as the basis for your model. Specifically, X1 is the process control state at any time, t, and X2 is the state at time t+1. To this end, you need configure the data to correspond to X ( X1, X 2 ) . You do this in the following way: For simplicity, consider the following measurements associated with a 10-hour segment: [0 0 0 1 0 0 0 0 10]. This array represents 10 measurements of X1. Now, for each measurement of X1 then measurement that follows it is the corresponding measurement of X 2 . Since you have no measurement following the 10th measurement of X1 , this means that you have only 9 measurements of X ( X1, X 2 ) ; namely 000100001 001000010 Of these 9 ordered pairs, you have 5 (0,0) elements. And so, your estimate for p00 would be 5/9. Using this procedure on the second data set in Figure 7, you arrive at the following numerical estimates: p10 p01 0.01 and p11 0.05 . It follows that p00 0.93 . Since p1 Pr[ X1 1] p10 p11 0.06 , your first measurement of your 200-hour simulation is that of a Ber(0.06) random variable. You simulate a numerical value for this variable in exactly the way you did in part (a). If the number is 0, your p-value for simulating the second number is obtained using (2.7b), and if your first number was a 1, then you use a p-value given by (2.7b). Specifically, Pr[ X 2 1 | X1 0 ] p0 0 p01 p01 p2 | [ X1 0] 0.01 0.0106 0.93 0.01 29 Pr[ X 2 1 | X 1 1 ] p10 p11 p11 p2 | [ X 1 1] 0.05 0.8333 . 0.01 0.05 The Matlab code for this simulation is shown below. %PROGRAM NAME: berprocess.m % This program generates a realization % of a Ber(p00, p10, p01,p11) correlated process npts = 200; y=zeros(npts+1,1); % Stationary Joint Probabilities between Y(k) and Y(k-1) p01=0.05; p10=p01; p11=0.05 p00 = 1 - (p11 + p10 + p01); pvec = [p00 p10 p01 p11] %Marginal p for any Y(k) p=p11 + p10 % ------------------------x = rand(npts+1,1); y(1)= ceil(x(1)- (1-p)); % Initial condition for k = 2:npts+1 if y(k-1)== 0 pk = p10/(p00 + p10); y(k)=ceil(x(k) - (1-pk)); else pk = p11/(p11 + p10); y(k)=ceil(x(k) - (1-pk)); end end stem(y(1:200)) xlabel('Time') ylabel('y(t)') title('Time Series for Process Control State') Running this code twice, gives the simulation segments in Figure 9 below. 30 Time Series for Process Control State 1 0.9 0.8 0.7 y(t) 0.6 0.5 0.4 0.3 0.2 0.1 0 0 50 100 150 200 250 200 250 Time Time Series for Process Control State 1 0.9 0.8 0.7 y(t) 0.6 0.5 0.4 0.3 0.2 0.1 0 0 50 100 150 Time Figure 9. Two 200-hour data log simulations using a 2-D Bernoulli random variable with probabilities p10 p01 0.01 , p11 0.05 , and hence, p00 0.93 . Management feels that this model captures the grouping tendency of the ones, and so your model is approved. Congratulations!!! Before we leave this example, let’s think about the reasonableness of the ‘ones’ grouping tendency. What this says is that when the process does go out of control, it has a tendency to remain out of control for more than one hour. In fact, the above conditional probability p11 0.05 Pr[ X 2 1 | X 1 1 ] p2 | [ X 1 1] 0.8333 p10 p11 0.01 0.05 States that if it is out of control during one hour, then there is an 83% chance that it will remain out of control the next hour. This can point to a number of possible sources responsible for the process going out of control. Specifically, if the time constant associated with either transient nonhomogeneities in the chemicals, or with a partially blocked mixing valve is on the order of an hour, then one might have reason to 31 investigate these sources. If either of these sources has a time constant on the order of hours, then the above model can be used for early detection of the source. Specifically, we can use a sliding window to collect overlapping data segments, and estimate the probabilities associated with X ( X1, X 2 ) . If a blockage in the mixing valve takes hours to dissolve, then one might expect the above probability value 0.8333 to increase. We can use this logic to construct a hypothesis test for determining whether we think the valve is blocked or not. We will discuss hypothesis testing presently. Perhaps this commentary will help to motivate the reader to look forward to that topic. 4. Functions of n-D Bernoulli Random Variables Having a solid grasp of the sample space and probability description of an n-D Bernoulli random variable is crucial in order to appreciate the simplicity of the material in this section. If the reader finds this material difficult, then it is suggested that the previous sections be reviewed. Again, the concepts are (i) a random variable, which is an action that, if repeated could lead to different results, (ii) the sample space associated with the variable, which is the set of all measurable values that the variable could take on, and (iii) the probabilities associated with subsets of the sample space. The reader should place the primary focus on the nature of the action and on the set of measurable values that could result. In a sense, the computation of probabilities is “after the fact”; that is, once the events of concern have been clearly identified as subsets of the sample space, the probability of those events is almost trivial to compute. If the reader can accept and appreciate this view, then this section will be simple. Furthermore, as we arrive at some of the more classic random variables in most textbooks, the reader will not only understand their origins better, but will be able to readily relax the assumptions upon which they are based, if need be. 4.1 Functions of a 1-D Bernoulli Random Variable A function is an operation, an algorithm, or an action. Due to the extremely simple nature of X ~ Ber ( p) , there are not many operations that one can perform on X. One is the following: Example 2.9 For X ~ Ber ( p) , perform the following operation on X: Y aX b. Since X is a random variable, and Y is a function of X, it follows that Y is also a random variable. In this case, it is the operation of ‘multiplying X by the constant a, and then adding the constant b to it.’ The first step in understanding Y is to identify its sample space. To this end, perform the above operation on each element in S X {0,1} , and the reader should arrive at SY {b, a b} . It should be apparent that the following sets are equivalent: {0} {b} and {1} {a b} . Since they are equivalent, they must have the same probabilities: Pr[Y a b] Pr{a b} Pr[ X 1] p and Pr[Y b] 1 p . 32 Since a and b are any constants the readers chooses, it follows that any random variable Y that can take on only one of two possible numerical values is ,basically, just a veiled Bernoulli random variable. Example 2.10 For X ~ Ber ( p) , perform the following operation on X: Y a X2 c X b. Even though this operation is more complicated than that of the last example, the reader should not feel intimidated. Simply, proceed to identify the sample space for Y, in exactly the same manner as was done in the last example: {0} {b} and {1} {a c b} . Hence, SY {b, a c b} , and so, again, we see that Y is simply a veiled Bernoulli random variable. 4.2 Functions of a 2-D Bernoulli Random Variable Let X ~ Ber ( p00 , p10 , p01, p11) . Since X is completely and unambiguously defined by its ample space, S X {( 0,0), (1,0), (0,1), (1,1)} and the associated probabilities, { p00 , p10 , p01, p11} , the reader should feel confident that he/she can easily accommodate any function of X. Consider the following examples. Example 2.11 Perform the following operation on X: Y a X1 bX 2 . We then have the following equivalent sets, or events, and theire associated probabilities: {( 0,0)} {0} Pr[Y 0] p00 {(1,0)} {a} Pr[Y a] p10 {( 0,1)} {b} Pr[Y b] p10 {(1,1)} {a b} Pr[Y a b] p11 Hence, the sample space for Y is SY {0, a, b, a b} , and the elemental subsets of this set have the above probabilities. Armed with this complete description of Y, the reader should feel competent and unafraid to answer any question one might pose in relation to Y. For chosen numerical values, a b 1, we have SY {0,1,2} . Notice that we do not have SY {0,1,1,2} . Why? Because SY is the set of possible values that Y can take on. And so, it makes no sense to include the number 1 twice. In this case, the subset {1} of SY is equivalent to the subset {(1,0),(0,1)} of S X . With this awareness of the equivalence of sets, it is almost trivial to compute the probability Pr[Y 1] Pr{(1,0), (0,1)} Pr({(1,0} {( 0,1)}) Pr{(1,0)} Pr{( 0,1)} p10 p01 33 If the reader feels that the above equation is unduly belabored, good. Then the material is becoming so conceptually clear and simple that we are succeeding in conveyance of the same. If the reader is confused or unsure as to the reasons for the equalities in the above equation, then the reader should return to the previous sections and fill in any gaps in understanding. Before we leave this example, consider the application where X [ X1 X 2 ] corresponds to the measurement of significant (1), versus insignificant (0) rainfall on two consecutive days. Suppose that on any given day, the probability of significant rainfall is p. Then, Pr[ X1 1] Pr[ X 2 ] 1 p . If we assume that X1 and X 2 are independent, then we arrive at the following probabilities for Y: Pr[Y 0] Pr{( 0,0)} p00 (1 p )(1 p) Pr[Y 1] Pr{(1,0), (0,1)} p10 p01 2 p (1 p ) Pr[Y 2] Pr{(1,1)} p11 p 2 Notice that the rightmost quantities in these three equations are a consequence of the assumption of independence; whereas the middle quantities entail no such assumption. This leads to the question: Is it reasonable to assume that if a region experiences a significant amount of rainfall on any given day, then it might be more likely to experiences a significant amount on the next day? If the region is prone to experiencing longer weather fronts, where storms linger for more than on day, then the answer to this question would be yes. In that case, X1 and X 2 are not independent. Hence, the rightmost expressions in the above equation are wrong; whereas the middle expressions are still correct. The caveat here is that one must have reliable numerical estimates of these probabilities. If one only has information about any given day, and not about two consecutive days, then one might resort to assuming the variables are independent. This is not necessarily a wrong assumption. But it is one that should be clearly noted in presenting probability information to farmers. Example 2.12 Perform the following operation on X: Y | X1 X 2 | . This operation is not as fabricated as it might seem. Consider sending a text message to your friend. Most communications networks convert text messages into a sequence of zeros and ones. Each 0/1 is called an information bit. Now, let the event that you send a 1 correspond to [ X1 1] , and let the event that your friend correctly receives it be [ X 2 1] . Then, here, the event [Y 1] corresponds to a bit error in the transmission. We then have the following equivalent sets, or events, and their associated probabilities: 34 {( 0,0), (1,1)} {0} Pr[Y 0] p00 p11 {(1,0), (0,1)} {1} Pr[Y 1] p10 P01 . Hence, Y ~ Ber ( p p10 p01 ) . Even though we have the joint probability information in the parameters { p00 , p10 , p01, p11} , it is more useful to compute the conditional probability information. After all, what you are really concerned with is the event that your friend correctly receives a 0/1, given that you sent a 0/1. As in Example 2.8, these conditional probabilities are given by Pr[ X 2 1 | X 1 0 ] Pr[ X 2 0 | X 1 1 ] p0 0 p01 p01 p10 p10 p11 Usually, it is presumed that each bit you send to your friend is as likely to be a zero as it is a one; that is, X1 ~ Ber ( p 0.5) . In this case, the above error probabilities become Pr[ X 2 1 | X 1 0 ] Pr[ X 2 0 | X 1 1 ] p01 0 .5 p10 0 .5 If it is further assumed that p10 p01 , then we arrive at the usual expression for the bit transmission error: ebit Pr[ X 2 1 | X 1 0 ] Pr[ X 2 0 | X 1 1 ] p01 . 4.2 Functions of an n-D Bernoulli Random Variable Recall that a complete and unambiguous description of an n-D Bernoulli random variable requires specification of the probabilities associated with the 2 n elemental sets in the sample space for X. This sample space can be expressed as: S X {( x1 , x2 ,, xn ) | every xk {0,1} } . Denote the probability associated with the elemental set {( x1 , x2 ,, xn ) } as p x1 x 2 x n . If one has access to m numerical measurements {( x1( j ) , x2( j ) ,, xn( j ) }mj1 of X, then one can 35 estimate p x1 x 2 x n by the relative number of occurrences of the element ( x1 , x2 ,, xn ) in relation to the number of measurements, m. In this section we will restrict our attention to the more common setting, wherein the components of X are mutually independent. It follows that n px1 x 2 x n Pr[ X 1 x1 X 2 x2 X n xn ] Pr[ X k xk ] . (2.8a) k 1 Now, since X k ~ Ber ( pk ) , we have 1 pk Pr[ X k xk ] pk for xk 0 for xk 1 (2.8b) We are now in a position to consider some classic functions of X [ X 1 ,, X n ] . Example 2.13 Define the random variable Y, which is the smallest index, k, such that xk 1 . For example, suppose n=5. Then, in relation to the elements (0,1,0,0,1) and (0,0,1,1,0), the associated values for Y are 2, and 3, respectively. More generally, the sample space for Y is SY {1,2,, n} . Before we compute the probabilities associated with the elemental subsets of this sample space, let’s find the equivalent events in S X . Specifically, [Y k ] [ X1 0 X 2 0 X k 1 0 X k 1] . Hence, in view of the assumption that the components of X [ X 1 ,, X n ] are mutually independent, we have k 1 Pr[Y k ] Pr[ X 1 0 X 2 0 X k 1 0 X k 1] (1 p j ) pk . (2.9a) j 1 If we assume that, not only are the components of X mutually independent, but that they all have exactly the same p-value, then we obtain the following well known geometric probability model for Y: Pr[Y k ] k 1 (1 p ) p j k p(1 p) k 1 ; 1 k n (2.9b) j 1 Of course, the expression (2.9b) is simpler than (2.9a). It should be, since all the p-values are assumed to be the same. However, (2.9a) is often a more realistic situation than (2.9b). All too often, the assumption of equal p-values is born of convenience or of lack of understanding. We will see this same assumption in the next example. 36 Example 2.14 In Example 2.7 we used Matlab to investigate the probability description for the p-value estimator associated with the assumedly independent and identically distributed (iid) Ber(p) components of X [ X 1 ,, X n ] , given by (2.7), and repeated here: 1 n pˆ X k X . n k 1 In this example, we will obtain the actual probability model for p̂ . Furthermore, we will obtain it in the more general (and often more realistic) case, where the components are independent, but they do not have one and the same p-value. To this end, we first give the sample space for the following random variable: Y n X k 1 k The sample space for Y is SY {0, 1, 2, , n 1, n} . Notice that this sample space has n+1 elements in it. With this, we are in a position to identify the subset of S X that corresponds to the elemental subset {k} of SY . We begin with the two simplest subsets: [Y 0] { (0,0,, 0) } and [Y n] { (1,1,, 1) } . The key point here is that the only way that Y can take on the value zero (n) is if every component of X takes on the value zero (one). In the more general setting wherein the components are assumed to be mutually independent, but with not necessarily the same p-value, the probabilities of these two events are simply n n k 1 k 1 Pr[Y 0] Pr{ (0,0,, 0) } Pr[ X 1 0 X n 0] Pr[ X k 0] (1 pk ) and n n k 1 k 1 Pr[Y n] Pr{ (1,1,, 1) } Pr[ X 1 1 X n 1] Pr[ X k 1] pk . Next, we consider the slightly more challenging events [Y 1] and [Y n 1] . In particular, the only way that Y can take on the value one is if one and only one of the components of X equals one. Similarly, the only way that Y can take on the value n 1 is if one and only one of the components of X equals zero. Hence, we have the following equivalent events: [Y 1] ”only one component of X equals 1” {(1,0,.), (0,1,0,0),, (0,,0,1} and 37 [Y n 1] ”only one component of X equals 0” {( 0,1,.,1), (1,0,1,1), , (1,,1,0} Notice that the elements making up these sets are distinct. For example, the element (1,0,,0) is a point in n-D space that is distinctly different than the point (0,1,0,,0) . Sure, many of the components of these two elements are the same. But the elements are distinctly separate points in n-D space. Their intersection is, therefore, the empty set. Hence, the probability of the event [Y 1] is simply the sum of the elemental probabilities associated with the set {(1,0,.), (0,1,0,0),, (0,,0,1} ; that is: Pr[Y 1] Pr{(1,0,,0), (0,1,0,0), , (0,,0,1} Pr{(1,0,,0)} Pr{( 0,0,1)} n n1 j 2 j 1 p1 (1 p j ) (1 p j ) pn n k 1 n [ pk (1 p j ) ] j 1 ( j k ) Similarly, Pr[Y n 1] Pr{( 0,1,,1), (1,0,1,1),, (1,,1,0} Pr{( 0,1,,1)} Pr{(1,1,0)} n 1 n j 2 n n [(1 p ) p (1 p1 ) p j p j (1 pn ) k k 1 j 1 j ] j 1 ( jk) While to many readers, these expressions may seem formidable, if not down right ugly, such readers should carefully assess whether their queasiness is due to the unfamiliar notation, or to a lack of conceptual understanding. The first portions of each of these equations were “long hand” expressions, wherein each probability is noted individually. The second portions of these equations use summation (∑) and product (∏) notation, in order to make the expressions more compact. The reader who is unfamiliar with this type of notation should not misconstrue discomfort over that, with lack of conceptual understanding. Now, let’s consider the most general event [Y k ] . This event is the event that k of the n components of X take on the value one, while the others take on the value zero, right? Well, one way corresponds to the event {(11, 12 ,,1k , 0k 1,0n )} . Notice that we have included subscripts on the 0’s and 1’s simply to make it clear as to their positions in the ordering of the n components of this element. It should be clear that the probability of this event is simply k Pr{(11 , 12 ,,1k , 0k 1 ,0n )} p j j 1 n (1 p ) . j (2.10) j k 1 The question now is: How many ways can one position k ones and (n-k) zeros in n slots? Each way will have a corresponding probability; just as the way of positioning all the ones first, (followed by all the zeros), resulted in the above probability. The answer to this question begins by answering a similar question: 38 How many ways can one order n distinctly different objects in n slots? Well, in the first slot we can place any one of the n distinct objects. Once we have chosen one of them, then in the second slot we have only n 1 distinct objects to choose from. Having chosen one of them, we are left to choose from n 2 distinct objects to place in the third slot, and so on. Think of this as a tree, where each branch has branches, and each of those branches has branches, and so on. The question is to figure out how many total branches there are. Let’s identify the first slot with the biggest diameter branches. Then we have n of these, corresponding to the n distinct objects. Now, each one of these main branches has n 1 branches of slightly smaller diameter. And each one of those branches has n 2 slightly smaller branches, and so on. So, the total number of branches is n (n 1) (n 1) 2 1 n! (read as “n factorial”) Now, if each of the k ones and the n k zeros were distinctly different (say, different colors, or different aromas!), then there would be n! different possible ways to order them. However, all the ones look alike, and all the zeros look alike. And so there will be fewer than n! ways to order them. How many fewer? Well, how many ways can one order the k ones, assuming they are distinct? The answer is k! ways. And how many ways can one order n k zeros, assuming they are distinct? The answer is ( n k )! ways. But since the ones and zeros are not distinct, the number of ways we can position them into the n slots is: n n! (read as “n choose k” ) k! (n k )! k (2.10b) Now, each way we position the k ones in the n slots has a corresponding probability, as demonstrated above. And so, there can be a lot of probabilities to compute, even for a modest value of n. For example, if n=10, then for k=5 we have to compute 10! 10 9 8 7 6 10 9 8 6 7 5!5! 5 4 3 2 1 5 3 4 2 1 2 3 2 3 7 36 7 252 different probabilities, and then add them up to get Pr[Y 5] . And so, even if we were to now proceed to develop a general expression for this probability, it would be really ugly! Furthermore, given specified p-values { p j }nj 1 corresponding to the n Ber(pj) components of X, we would still need a calculator, if not a computer, to compute this probability for the events [Y=k] for all of the n+1 possible values of k: k=0, 1, …, n. For these reasons, we will focus our attention on an algorithm for computing the elemental probabilities associated first with S X . Then, having numerical values for these probabilities, we will develop an algorithm that uses them to compute the probability of each [Y=k] event. We do this in the Appendix to this chapter, since the development is mainly one of writing a 39 program. We will use Matlab as our programming language. However, those familiar with other languages (e.g. C++) may prefer to write their own code. Recall, that the only assumption, thus far, has been that the n components of X are independent. We will now make the further assumption that they all have one and the same p-value. In this case (2.10) becomes k Pr{(11 , 12 ,,1k , 0k 1 ,0n )} p j j 1 n (1 p ) j p k (1 p) n k (2.12) j k 1 However, since the p-values are identical, then so long as an element of S X has k 1’s (and, consequently, n-k 0’s), that element has the probability (2.12). And the number of distinct elements of this type is given by (2.11). Hence, under the assumption that the components of X are iid , we have the following probability description for their sum, Y: n Pr[Y k ] p k (1 p) n k k k 0,1,, n (2.13) The probability model (2.13) is known as the Binomial probability model. Notice that is entails two parameters: n, the number of Bernoulli random variables being added, and p, the p-value associated with each and every Bernoulli variable. Having (2.13), it is trivial to arrive at the probability model for our p-value estimator pˆ 1 n Xk X n k 1 1 Y. n The sample space for p̂ is obtained by dividing each element in SY by n. Hence, we have the following equivalent elemental events: [ pˆ k / n ] [Y k ] . It follows that the probability description for p̂ is n Pr[ pˆ k / n ] Pr[Y k ] p k (1 p) n k . k (2.14) The following is a Matlab code to compute this theoretical probability model for p̂ n=100; p = 0.7; xvec = 0; pvec = (1-p)^n; for k = 1 : n; 40 xvec = [xvec , k/n]; pval = ( factorial(n)/(factorial(k)*factorial(n-k)) ) * p^k * (1-p)^(n-k); pvec = [pvec , pval ]; end stem(xvec, pvec) This code gives the following figure. 0.09 0.08 0.07 Pr(phat = x) 0.06 0.05 0.04 0.03 0.02 0.01 0 0 0.1 0.2 0.3 0.4 0.5 x 0.6 0.7 0.8 0.9 1 Figure 10. Graph of the Binomial distribution (2.13) for n=100 and p=0.7. A comparison of Figure 7 and Figure 10 shows that Figure 7 has a shape reasonably similar to the theoretical model in Figure 10. 41 APPENDIX To begin, we should figure out just how many elemental events are associated with S X . Well, in the first slot there can be a zero or a one. Then for each of these “branches”, the second slot can be a zero or a one, and so on. It turns out that there are a total of 2 n ”branches” ; that is, there are 2 n elements in S X . For example, if n=10, then S X has 210 1024 elements in it. Now let’s compute the following numbers: n(0) ”the number of elements that contain no 1’s” = 1 n n(1) ”the number of elements that contain one 1’s” = n 1 n n! n(2) ”the number of elements that contain one 1’s” = 2 (n 2)!2! In general, n n! . n(k ) ”the number of elements that contain one 1’s” = k (n k )!k!