1.15Review of Set Theorytheo.1.1 @cref@cref[theo][1][1]1.1[1][5][]5 1.38Operations on setstheo.1.3 @cref@cref[theo][3][1]1.3[1][8][]8 1.48Operations on setstheo.1.4 @cref@cref[theo][4][1]1.4[1][8][]8 Probability Density Functionstheo.4.1 @cref@cref[theo][1][4]4.1[1][67][]67 4.275Expected Values and Moments Involving Pairs of Random Variablestheo.4.2 @cref@cref[theo][2][4]4.2[1][75][]75 4.376Expected Values and Moments Involving Pairs of Random Variablestheo.4.3 @cref@cref[theo][3][4]4.3[1][ 4.786Expectations Involving Multiple Random Variablestheo.4.7 @cref@cref[theo][7][4]4.7[1][86][]86 4.887Expectations Involving Multiple Random Variablestheo.4.8 @cref@cref[theo][8][4]4.8[1][87][]87 5.194Laws of Large Numberstheo.5.1 @cref@cref[theo][1][5]5.1[1][93][]94 5.294Laws of Large Numberstheo.5.2 @cref@cref[theo][2][5]5.2[1][94][]94 PROBABILITY MODELS IN ENGINEERING COURSE NOTES ECE2191 Dr Faezeh Marzbanrad Department of Electrical and Computer Systems Engineering Monash University Lecturers: Dr Faezeh Marzbanrad (Clayton) Dr Wynita Griggs (Clayton) Dr Mohamed Hisham (Malaysia) 2020 Contents Contents 1 Preliminary Concepts 1.1 Probability Models in Engineering 1.2 Review of Set Theory . . . . . . . 1.3 Operations on sets . . . . . . . . . 1.4 Other Notations . . . . . . . . . . 1.5 Random Experiments . . . . . . . 1.5.1 Tree Diagrams . . . . . . 1.5.2 Coordinate System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4 4 7 10 10 12 13 2 Probability Theory 2.1 Definition of Probability . . . . . . . . . . . . . . . . . . . . . 2.1.1 Relative Frequency Definition . . . . . . . . . . . . . . 2.1.2 Axiomatic Definition . . . . . . . . . . . . . . . . . . . 2.2 Joint Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Conditional Probabilities . . . . . . . . . . . . . . . . . . . . . 2.3.1 Bayes’s Theorem . . . . . . . . . . . . . . . . . . . . . 2.4 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Basic Combinatorics . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Sequence of Experiments . . . . . . . . . . . . . . . . 2.5.2 Sampling with Replacement and with Ordering . . . . 2.5.3 Sampling without Replacement and with Ordering . . 2.5.4 Sampling without Replacement and without Ordering 2.5.5 Sampling with Replacement and without Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 14 14 14 16 17 18 21 22 22 23 24 25 27 3 Random Variables 3.1 The Notion of a Random Variable . . . . . . . . . . . . . . . . . 3.2 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . 3.2.1 Probability Mass Function . . . . . . . . . . . . . . . . . 3.2.2 The Cumulative Distribution Function . . . . . . . . . . 3.2.3 Expected Value and Moments . . . . . . . . . . . . . . . 3.2.4 Conditional Probability Mass Function and Expectation 3.2.5 Common Discrete Random Variables . . . . . . . . . . . 3.3 Continuous Random Variables . . . . . . . . . . . . . . . . . . . 3.3.1 The Probability Density Function . . . . . . . . . . . . . 3.3.2 Conditional CDF and PDF . . . . . . . . . . . . . . . . . 3.3.3 The Expected Value and Moments . . . . . . . . . . . . 3.3.4 Important Continuous Random Variables . . . . . . . . 3.4 The Markov and Chebyshev Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 29 31 31 32 34 37 40 47 48 51 52 55 61 4 Two or More Random Variables 4.1 Pairs of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Joint Cumulative Distribution Function . . . . . . . . . . . . . . . . . . 64 64 65 . . . . . . . . . . . . . . . . . . . . . 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contents 4.2 4.1.2 Joint Probability Density Functions . . . . . . . . . . . . . . . . . . . 4.1.3 Joint Probability Mass Functions . . . . . . . . . . . . . . . . . . . . 4.1.4 Conditional Probabilities and densities . . . . . . . . . . . . . . . . . 4.1.5 Expected Values and Moments Involving Pairs of Random Variables 4.1.6 Independence of Random Variables . . . . . . . . . . . . . . . . . . . 4.1.7 Pairs of Jointly Gaussian Random Variables . . . . . . . . . . . . . . Multiple Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Vector Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Joint and Conditional PMFs, CDFs and PDFs . . . . . . . . . . . . . . 4.2.3 Expectations Involving Multiple Random Variables . . . . . . . . . . 4.2.4 Multi-Dimensional Gaussian Random Variables . . . . . . . . . . . . 5 Random Sums and Sequences 5.1 Independent and Identically Distributed Random Variables 5.2 Mean and Variance of Sums of Random Variables . . . . . 5.3 The Sample Mean . . . . . . . . . . . . . . . . . . . . . . . 5.4 Laws of Large Numbers . . . . . . . . . . . . . . . . . . . . 5.5 The Central Limit Theorem . . . . . . . . . . . . . . . . . . 5.6 Convergence of Sequences of Random Variables . . . . . . 5.6.1 Sure Convergence . . . . . . . . . . . . . . . . . . 5.6.2 Almost-Sure Convergence . . . . . . . . . . . . . . 5.6.3 Convergence in Probability . . . . . . . . . . . . . 5.6.4 Convergence in the Mean Square Sense . . . . . . 5.6.5 Convergence in Distribution . . . . . . . . . . . . . 5.7 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 70 71 74 78 81 84 84 85 86 88 . . . . . . . . . . . . 90 90 91 92 93 95 98 100 101 102 103 103 104 1 Preliminary Concepts 1 Preliminary Concepts 1.1 Probability Models in Engineering In many real world situations, the outcome is uncertain. Many systems involve phenomena with unpredictable variation and randomness. We often deal with random experiments in which the outcome varies unpredictably when the experiment is repeated under the same conditions. In those cases, deterministic models are not appropriate, since they predict the same outcome for each repetition of an experiment. Probability models are intended for such random experiments. In engineering problems in particular, the occurrence of many events is either uncertain or the outcome cannot be specified by a precise value or formula. The exact value of the power line voltage during high activity in the summer is an example which cannot be described in any deterministic way. In communications, the events can frequently be reduced to a series of binary digits, while the sequence of these digits is uncertain and that is how it carries the information. Therefore in engineering applications, probability models play a fundamental role. 1.2 Review of Set Theory In random experiments we are interested in the occurrence of events that are represented by sets. Before proceeding with further discussion of events and random experiments, we present some essential concepts from set theory. As we will see, the definitions and concepts presented here will clarify and unify the mathematical foundations of probability theory. Definition 1.1. Set: A set is an unordered collection of objects. We typically use a capital letter to denote a set, listing the objects within braces or by graphing. The notation 𝐴 = {𝑥 : 𝑥 > 0, 𝑥 ≤ 2} is read as “the set 𝐴 contains all 𝑥 such that 𝑥 is greater than zero and less than or equal to two.” The notation 𝜁 ∈ 𝐴 is read as “the object zeta is in the set A.” Two sets are equal if they have exactly the same objects in them; i.e., 𝐴 = 𝐵 if 𝐴 contains exactly the same elements that are contained in 𝐵. Definition 1.2. Null set: denoted ∅, is the empty set and contains no objects. Definition 1.3. Universal set: denoted 𝑆, is the set of all objects in the universe. The universe can be anything we define it to be. For example, we sometimes consider 𝑆 = 𝑅, the set of all real numbers. Definition 1.4. Subset: If every object in set 𝐴 is also an object in set 𝐵, then 𝐴 is a subset of 𝐵, denoted by 𝐴 ⊂ 𝐵. The expression 𝐵 ⊃ 𝐴 read as “𝐴 contains 𝐵” is equivalent to 𝐴 ⊂ 𝐵. Definition 1.5. Union: The union of sets 𝐴 and 𝐵, denoted 𝐴 ∪ 𝐵, is the set of objects that belong to 𝐴 or 𝐵 or both, i.e., 𝐴 ∪ 𝐵 = {𝜁 : 𝜁 ∈ 𝐴 or 𝜁 ∈ 𝐵}. 4 1 Preliminary Concepts Definition 1.6. Intersection: The intersection of sets 𝐴 and 𝐵, denoted 𝐴 ∩ 𝐵, is the set of objects common to both 𝐴 and 𝐵; i.e ., 𝐴 ∩ 𝐵 = {𝜁 : 𝜁 ∈ 𝐴 and 𝜁 ∈ 𝐵} Note that if 𝐴 ⊂ 𝐵, then 𝐴 ∩ 𝐵 = 𝐴. In particular, we always have 𝐴 ∩ 𝑆 = 𝐴. Definition 1.7. Complement: The complement of a set 𝐴, denoted 𝐴𝑐 , is the collection of all objects in 𝑆 not included in 𝐴; i.e ., 𝐴𝑐 = {𝜁 ∈ 𝑆 : 𝜁 ∉ 𝐴} . Definition 1.8. Difference: The relative complement or difference of sets 𝐴 and 𝐵 is the set of elements in 𝐴 that are not in 𝐵: 𝐴 − 𝐵 = {𝜁 : 𝜁 ∈ 𝐴 and 𝜁 ∉ 𝐵} Note that 𝐴 − 𝐵 = 𝐴 ∩ 𝐵𝑐 . These definitions and relationships among sets are illustrated in Figure 1.1. These diagrams are called Venn diagrams, which represent sets by simple plane areas within the universal set, pictured as a rectangle. Venn diagrams are important visual aids to understand relationships among sets. A B s A B s s (b) Set A. (a) Universal Set S. B A (c) Set B. A B s (d) Set Ac. B A B (g) A ⊂ B. A B s s s (f) Set A ∩ B. (e) Set A ∪ B. A A B s s A B (h) disjoint sets A and B. (b) Set A-B. Figure 1.1: Venn diagrams representing sets Theorem 1.1 Let 𝐴 ⊂ 𝐵 and 𝐵 ⊂ 𝐴. Then 𝐴 = 𝐵. Proof. Since the empty set is a subset of any set, if 𝐴 = ∅ then 𝐵 ⊂ 𝐴 implies that 𝐵 = ∅. Similarly, if 𝐵 = ∅ then 𝐴 ⊂ 𝐵 implies that 𝐴 = ∅. The theorem is obviously true if 𝐴 and 𝐵 are both empty. If 𝐴 and 𝐵 are nonempty, since 𝐴 ⊂ 𝐵, if 𝜁 ∈ 𝐴 then 𝜁 ∈ 𝐵. Since 𝐵 ⊂ 𝐴, if 𝜁 ∈ 𝐵 then 𝜁 ∈ 𝐴. We therefore conclude that 𝐴 = 𝐵. The converse of the above theorem is also true: If 𝐴 = 𝐵 then 𝐴 ⊂ 𝐵 and 𝐵 ⊂ 𝐴. 5 1 Preliminary Concepts Example 1.1 Let 𝐴 = {(𝑥, 𝑦) : 𝑦 ≤ 𝑥 }, 𝐵 = {(𝑥, 𝑦) : 𝑥 ≤ 𝑦 + 1}, 𝐶 = {(𝑥, 𝑦) : 𝑦 < 1}, and 𝐷 = {(𝑥, 𝑦) : 0 ≤ 𝑦}. Find and sketch 𝐸 = 𝐴 ∩ 𝐵, 𝐹 = 𝐶 ∩ 𝐷, 𝐺 = 𝐸 ∩ 𝐹 , and 𝐻 = {(𝑥, 𝑦) : (−𝑥, 𝑦 + 1) ∈ 𝐺 }. Solution. We first sketch the boundaries of the given sets 𝐴, 𝐵, 𝐶, and 𝐷. Note that if the boundary of the region is included in the set, it is indicated with a solid line, and if not, it is indicated with a dotted line. We have 𝐸 = 𝐴 ∩ 𝐵 = {(𝑥, 𝑦) : 𝑥 − 1 ≤ 𝑦 ≤ 𝑥 } and 𝐹 = 𝐶 ∩ 𝐷 = {(𝑥, 𝑦) : 0 ≤ 𝑦 < 1}. The set 𝐺 is the set of all ordered pairs (𝑥, 𝑦) satisfying both 𝑥 − 1 ≤ 𝑦 ≤ 𝑥 and 0 ≤ 𝑦 < 1. Using 1− to denote a value just less than 1, the second inequality may be expressed as 0 ≤ 𝑦 ≤ 1− . We may then express the set 𝐺 as 𝐺 = {(𝑥, 𝑦) : 𝑚𝑎𝑥 {0, 𝑥 − 1} ≤ 𝑦 ≤ 𝑚𝑖𝑛{𝑥, 1− }}, The set 𝐻 is obtained from 𝐺 by folding about the y-axis and translating down one unit. This can be seen from the definitions of G and H by noting that (𝑥, 𝑦) ∈ 𝐻 if (−𝑥, 𝑦 + 1) ∈ 𝐺; hence, we replace 𝑥 with −𝑥 and 𝑦 with 𝑦 + 1 in the above result for 𝐺 to obtain 𝐻 = {(𝑥, 𝑦) : 𝑚𝑎𝑥 {0, −𝑥 − 1} ≤ 𝑦 + 1 ≤ 𝑚𝑖𝑛{−𝑥, 1− }}, or 𝐻 = {(𝑥, 𝑦) : 𝑚𝑎𝑥 {−1, −𝑥 − 2} ≤ 𝑦 ≤ 𝑚𝑖𝑛{−1 − 𝑥, 0− }}. The sets are illustrated in Figure 1.2. Figure 1.2: 6 1 Preliminary Concepts 1.3 Operations on sets Throughout probability theory it is often required to establish relationships between sets. The set operations ∪ and ∩ operate on sets in much the same way the operations + and × operate on real numbers. Similarly, the special sets ∅ and 𝑆 correspond to the additive identity 0 and the multiplicative identity 1, respectively. This correspondence between operations on sets and operations on real numbers is made explicit by the theorem below, which can be proved by applying the definitions of the basic set operations stated above. Theorem 1.2: Properties of Set Operations Commutative Properties: 𝐴∪𝐵 = 𝐵 ∪𝐴 (1.1) 𝐴∩𝐵 = 𝐵 ∩𝐴 (1.2) 𝐴 ∪ (𝐵 ∪ 𝐶) = (𝐴 ∪ 𝐵) ∪ 𝐶 (1.3) 𝐴 ∩ (𝐵 ∩ 𝐶) = (𝐴 ∩ 𝐵) ∩ 𝐶 (1.4) 𝐴 ∩ (𝐵 ∪ 𝐶) = (𝐴 ∩ 𝐵) ∪ (𝐴 ∩ 𝐶) (1.5) 𝐴 ∪ (𝐵 ∩ 𝐶) = (𝐴 ∪ 𝐵) ∩ (𝐴 ∪ 𝐶) (1.6) (𝐴 ∪ 𝐵)𝑐 = 𝐴𝑐 ∩ 𝐵𝑐 (1.7) Associative Properties: Distributive Properties: De Morgan’s Laws: 𝑐 𝑐 (𝐴 ∩ 𝐵) = 𝐴 ∪ 𝐵 𝑐 (1.8) Identities involving ∅ and 𝑆: 𝐴∪∅=𝐴 (1.9) 𝐴∩𝑆 =𝐴 (1.10) 𝐴∩∅=∅ (1.11) 𝐴∪𝑆 =𝑆 (1.12) 𝐴 ∩ 𝐴𝑐 = ∅ (1.13) Identities involving complements: 𝑐 𝐴∪𝐴 =𝑆 (1.14) 𝑐 𝑐 (1.15) (𝐴 ) = 𝐴 Example 1.2 Prove De Morgan’s rules. 7 1 Preliminary Concepts Solution. First suppose that 𝜁 ∈ (𝐴 ∪ 𝐵)𝑐 , then 𝜁 ∉ (𝐴 ∪ 𝐵). In particular, we have 𝜁 ∉ 𝐴 which implies 𝜁 ∈ 𝐴𝑐 . Similarly, we have 𝜁 ∉ 𝐵 which implies 𝜁 ∈ 𝐵𝑐 . Hence 𝜁 is in both 𝐴𝑐 and 𝐵𝑐 that is, 𝜁 ∈ 𝐴𝑐 ∩ 𝐵𝑐 . We have shown that (𝐴 ∪ 𝐵)𝑐 ⊂ 𝐴𝑐 ∩ 𝐵𝑐 . To prove inclusion in the other direction, suppose that 𝜁 ∈ 𝐴𝑐 ∩ 𝐵𝑐 . This implies that 𝜁 ∈ 𝐴𝑐 so 𝜁 ∉ 𝐴. Similarly, 𝜁 ∈ 𝐵𝑐 so 𝜁 ∉ 𝐵. Therefore, 𝜁 ∉ (𝐴 ∪ 𝐵) and so 𝜁 ∈ (𝐴 ∪ 𝐵)𝑐 . We have shown that 𝐴𝑐 ∩ 𝐵𝑐 ⊂ (𝐴 ∪ 𝐵)𝑐 . This proves that (𝐴 ∪ 𝐵)𝑐 = 𝐴𝑐 ∩ 𝐵𝑐 . To prove the second De Morgan rule, apply the first De Morgan rule to 𝐴𝑐 and to 𝐵𝑐 to obtain: (𝐴𝑐 ∪ 𝐵𝑐 )𝑐 = (𝐴𝑐 )𝑐 ∩ (𝐵𝑐 )𝑐 = 𝐴 ∩ 𝐵, where we used the identity (𝐴𝑐 )𝑐 = 𝐴. Now take complements of both sides: 𝐴𝑐 ∪ 𝐵𝑐 = (𝐴 ∩ 𝐵)𝑐 . [Exercise−] Use a Venn diagram to demonstrate De Morgan’s rule. Additional insight to operations on sets is provided by the correspondence between the algebra of set inclusion and Boolean algebra. An element either belongs to a set or it does not. Thus, interpreting sets as Boolean (logical) variables having values of 0 or 1, the ∪ operation as the logical "OR", the ∩ as the logical "AND" operation, and the 𝑐 as the logical complement "NOT", any expression involving set operations can be treated as a Boolean expression. Theorem 1.3 Negative Absorption Theorem: 𝐴 ∪ (𝐴𝑐 ∩ 𝐵) = 𝐴 ∪ 𝐵. (1.16) Proof. Using the distributive property, 𝐴 ∪ (𝐴𝑐 ∩ 𝐵) = (𝐴 ∪ 𝐴𝑐 ) ∩ (𝐴 ∪ 𝐵) = 𝑆 ∩ (𝐴 ∪ 𝐵) = 𝐴 ∪ 𝐵. Theorem 1.4 Principle of Duality: Any set identity remains true if the symbols ∪,∩, S, and ∅, are replaced with the symbols ∩,∪,∅, and S, respectively. Proof. The proof follows by applying De Morgan’s Laws and renaming sets 𝐴𝑐 , 𝐵𝑐 , etc. as 𝐴, 𝐵, etc. Properties of set operations are easily extended to deal with any finite number of sets. To do this, we need notation for the union and intersection of a collection of sets. Definition 1.9. Union: We define the union of a collection of sets (or “set of sets”) {𝐴𝑖 : 𝑖 ∈ 𝐼 } (1.17) by: Ø 𝐴𝑖 = {𝜁 ∈ 𝑆 : 𝜁 ∈ 𝐴𝑖 for some 𝑖 ∈ 𝐼 } 𝑖 ∈𝐼 8 (1.18) 1 Preliminary Concepts Definition 1.10. Intersection: We define the intersection of a collection of sets {𝐴𝑖 : 𝑖 ∈ 𝐼 } (1.19) by: Ù 𝐴𝑖 = {𝜁 ∈ 𝑆 : 𝜁 ∈ 𝐴𝑖 for every 𝑖 ∈ 𝐼 } (1.20) 𝑖 ∈𝐼 Theorem 1.5: Properties of Set Operations (extended) Commutative and Associative Properties: 𝑛 Ø 𝑖=1 𝑛 Ù 𝐴𝑖 = 𝐴1 ∪ 𝐴2 ∪ ... ∪ 𝐴𝑛 = 𝐴𝑖 1 ∪ 𝐴𝑖 2 ∪ ... ∪ 𝐴𝑖𝑛 , (1.21) 𝐴𝑖 = 𝐴1 ∩ 𝐴2 ∩ ... ∩ 𝐴𝑛 = 𝐴𝑖 1 ∩ 𝐴𝑖 2 ∩ ... ∩ 𝐴𝑖𝑛 , (1.22) 𝑖=1 where 𝑖 1 ∈ {1, 2, ..., 𝑛} = 𝐼 1, 𝑖 2 ∈ 𝐼 2 = 𝐼 1 ∩ {𝑖 1 }𝑐 , and 𝑖𝑙 ∈ 𝐼𝑙 = 𝐼𝑙−1 ∩ {𝑖𝑙−1 }𝑐 , 𝑙 = 2, 3, ..., 𝑛. In other words, the union (or intersection) of 𝑛 sets is independent of the order in which the unions (or intersections) are taken. Distributive Properties: 𝐵∩ 𝐵∪ 𝑛 Ø 𝑖=1 𝑛 Ù 𝐴𝑖 = 𝐴𝑖 = 𝑖=1 𝑛 Ø 𝑖=1 𝑛 Ù (𝐵 ∩ 𝐴𝑖 ) (1.23) (𝐵 ∪ 𝐴𝑖 ) (1.24) 𝑖=1 De Morgan’s Laws: ( ( 𝑛 Ù 𝑖=1 𝑛 Ø 𝐴𝑖 )𝑐 = 𝐴𝑖 )𝑐 = 𝑖=1 𝑛 Ø 𝑖=1 𝑛 Ù 𝐴𝑐𝑖 (1.25) 𝐴𝑐𝑖 (1.26) 𝑖=1 Throughout much of probability, it is useful to decompose a set into a union of simpler, nonoverlapping sets. This is an application of the “divide and conquer” approach to problem solving. Necessary terminology is established in the following definitions. Definition 1.11. Mutually Exclusive: The sets 𝐴1, 𝐴2, ..., 𝐴𝑛 are mutually exclusive (or disjoint) if 𝐴𝑖 ∩ 𝐴 𝑗 = ∅ for all 𝑖 and 𝑗 with 𝑖 ≠ 𝑗 . Definition 1.12. Partition: The sets 𝐴1, 𝐴2, ..., 𝐴𝑛 form a partition of the set 𝐵 if they are mutually Ð exclusive and 𝐵 = 𝐴1 ∪ 𝐴2 ∪ ... ∪ 𝐴𝑛 = 𝑛𝑖=1 𝐴𝑖 Definition 1.13. Collectively Exhaustive: The sets 𝐴1, 𝐴2, ..., 𝐴𝑛 are collectively exhaustive if Ð 𝑆 = 𝐴1 ∪ 𝐴2 ∪ ... ∪ 𝐴𝑛 = 𝑛𝑖=1 𝐴𝑖 9 1 Preliminary Concepts Example 1.3 Let 𝑆 = {(𝑥, 𝑦) : 𝑥 ≥ 0, 𝑦 ≥ 0}, 𝐴 = {(𝑥, 𝑦) : 𝑥 + 𝑦 < 1}, 𝐵 = {(𝑥, 𝑦) : 𝑥 < 𝑦}, and 𝐶 = {(𝑥, 𝑦) : 𝑥𝑦 > 1/4}. Are the sets 𝐴, 𝐵, and 𝐶 mutually exclusive, collectively exhaustive, and/or a partition of 𝑆? Solution. Since 𝐴 ∩ 𝐶 = ∅, the sets 𝐴 and 𝐶 are mutually exclusive; however, 𝐴 ∩ 𝐵 ≠ ∅ and 𝐵 ∩ 𝐶 ≠ ∅, so 𝐴 and 𝐵, and 𝐵 and 𝐶 are not mutually exclusive. Since 𝐴 ∪ 𝐵 ∪ 𝐶 ≠ 𝑆, the events are not collectively exhaustive. The events 𝐴, 𝐵, and 𝐶 are not a partition of S since they are not mutually exclusive and collectively exhaustive. Definition 1.14. Cartesian Product: The Cartesian product of sets 𝐴 and 𝐵 is a set of ordered pairs of elements of 𝐴 and 𝐵: 𝐴 × 𝐵 = {𝜁 = (𝜁 1, 𝜁 2 ) : 𝜁 1 ∈ 𝐴, 𝜁 2 ∈ 𝐵}. (1.27) The Cartesian product of sets 𝐴1, 𝐴2, ..., 𝐴𝑛 is a set of n-tuples (an ordered list of 𝑛 elements) of elements of 𝐴1, 𝐴2, ..., 𝐴𝑛 : 𝐴1 × 𝐴2 × ... × 𝐴𝑛 = {𝜁 = (𝜁 1, 𝜁 2, ...𝜁𝑛 ) : 𝜁 1 ∈ 𝐴1, 𝜁 2 ∈ 𝐴2, ..., 𝜁𝑛 ∈ 𝐴𝑛 }. (1.28) An important example of a Cartesian product is the usual n-dimensional real Euclidean space: 𝑅𝑛 = 𝑅 × 𝑅 × ... × 𝑅 . | {z } (1.29) 𝑛 terms 1.4 Other Notations Some special sets of real numbers will often be encountered: (𝑎, 𝑏) = 𝑥 : 𝑎 < 𝑥 < 𝑏, (𝑎, 𝑏] = 𝑥 : 𝑎 < 𝑥 ≤ 𝑏, [𝑎, 𝑏) = 𝑥 : 𝑎 ≤ 𝑥 < 𝑏, [𝑎, 𝑏] = 𝑥 : 𝑎 ≤ 𝑥 ≤ 𝑏. Note that if 𝑎 > 𝑏, then (𝑎, 𝑏) = (𝑎, 𝑏] = [𝑎, 𝑏) = [𝑎, 𝑏] = ∅. If 𝑎 = 𝑏, then (𝑎, 𝑏) = (𝑎, 𝑏] = [𝑎, 𝑏) = ∅ and [𝑎, 𝑏] = 𝑎. The notation (𝑎, 𝑏) is also used to denote an ordered pair—we depend on the context to determine whether (𝑎, 𝑏) represents an open interval of real numbers or an ordered pair. 1.5 Random Experiments To further clarify the basics of random experiments, we begin with a few simple definitions. Definition 1.15. Experiment: An experiment is a procedure we perform (quite often hypothetical) that produces some result. Often the letter 𝐸 is used to designate an experiment. For example, the experiment 𝐸 5 might consist of tossing a coin five times. 10 1 Preliminary Concepts Definition 1.16. Outcome: An outcome is a possible result of an experiment. The letter 𝜁 is often used to represent outcomes. For example, the outcome 𝜁 1 of experiment 𝐸 5 might represent the sequence of tosses heads-heads-tails-heads-tails; or concisely, HHTHT. Definition 1.17. Event: An event is a certain set of outcomes of an experiment. For example, the event 𝐶 associated with experiment 𝐸 5 might be 𝐶 = {all outcomes consisting of an even number of heads} Definition 1.18. Sample space: The sample space is the collection or set of “all possible” distinct (collectively exhaustive and mutually exclusive) outcomes of an experiment. The letter 𝑆 is used to designate the sample space, which is the universal set of outcomes of an experiment. Note that in the coin tossing experiment, the coin may land on edge. But experience has shown us that such a result is highly unlikely to occur. Therefore, our sample space for such experiments typically excludes such unlikely outcomes, and only includes all possible outcomes. For now, we assume all outcomes to be distinct. Consequently, we are considering only the set of simple outcomes that are collectively exhaustive and mutually exclusive. A sample space is called discrete if it is a finite or a countably infinite set. It is called continuous or a continuum otherwise. The set of all real numbers between 0 and 1 is an example of an uncountable sample space. For now, we only deal with discrete sample spaces. Example 1.4 Consider the experiment of flipping a fair coin once, where fair means that the coin is not biased in weight to a particular side. There are two possible outcomes: head (𝜁 1 = 𝐻 ) or a tail (𝜁 2 = 𝑇 ). Thus, the sample space 𝑆, consists of two outcomes, 𝜁 1 = 𝐻 and 𝜁 2 = 𝑇 . Example 1.5 Now consider flipping the coin until a tails occurs, when the experiment is terminated. The sample space consists of a collection of sequences of coin tosses. The outcomes are 𝜁𝑛 , 𝑛 = 1, 2, 3, .... The final toss in any particular sequence is a tail and terminates the sequence. The preceding tosses prior to the occurrence of the tail must be heads. The possible outcomes that may occur are: 𝜁 1 = (𝑇 ), 𝜁 2 = (𝐻,𝑇 ), 𝜁 3 = (𝐻, 𝐻,𝑇 ), ... Note that in this case, n can extend to infinity. This is a combined sample space resulting from conducting independent but identical experiments. In this example, the sample space is countably infinite. Example 1.6 A cubical die with numbered faces is rolled and the result observed. The sample space consists of six possible outcomes, 𝜁 1 = 1, 𝜁 2 = 2, ..., 𝜁 6 = 6, indicating the possible observed faces of the cubical die. Example 1.7 Now consider the experiment of rolling two dice and observing the results. The sample space 11 1 Preliminary Concepts consists of 36 outcomes: 𝜁 1 = (1, 1), 𝜁 2 = (1, 2), ..., 𝜁 6 = (1, 6), 𝜁 7 = (2, 1), 𝜁 8 = (2, 2), ..., 𝜁 3 6 = (6, 6) the first component in the ordered pair indicates the result of the toss of the first die, and the second component indicates the result of the toss of the second die. Alternatively we can consider this experiment as two distinct experiments, each consisting of rolling a single die. The sample spaces (𝑆 1 and 𝑆 2 ) for each of the two experiments are identical, namely, the same as Example 1.6. We may now consider the sample space of the original experiment 𝑆, to be the combination of the sample spaces 𝑆 1 and 𝑆 2 , which consists of all possible combinations of the elements of both 𝑆 1 and 𝑆 2 . This is another example of a combined sample space. Several interesting events can be also defined from this experiment, such as: 𝐴 = {the sum of the outcomes of the two rolls = 4}, 𝐵 = {the outcomes of the two rolls are identical}, 𝐶 = {the first roll was bigger than the second}. The choice of a particular sample space depends upon the questions that are to be answered concerning the experiment. Suppose that in Example 1.7, we were asked to record after each roll the sum of the numbers shown on the two faces. Then, the sample space could be represented by eleven outcomes, 𝜁 1 = 2, 𝜁 2 = 3, ..., 𝜁 11 = 12. However, the original sample space was in some way more fundamental. Because the sum of the die faces can be determined from the numbers on the die faces, but the sum is not sufficient to specify the sequence of numbers that occurred. 1.5.1 Tree Diagrams Many experiments consist of a sequence of simpler “sub-experiments” as, for example, the sequential tossing of a coin or the sequential die rolling. A tree diagram is a useful graphical representation of a sequence of experiments, particularly when each sub-experiment has a small number of possible outcomes. Example 1.8 The coin in Example 1.4 is tossed twice. Illustrate the sample space with a tree diagram. Let 𝐻𝑖 and 𝑇𝑖 denote the outcome of a head or a tale on the the 𝑖 𝑡ℎ toss, respectively. The sample space is: 𝑆 = {𝐻 1𝐻 2, 𝐻 1𝑇2,𝑇1𝐻 2,𝑇1𝑇2 } The tree diagram illustrating the sample space for this sequence of two coin tosses is shown in Figure 1.3. Figure 1.3: Tree diagram for Example 1.8 12 1 Preliminary Concepts Each node represents an outcome of one coin toss and the branches of the tree connect the nodes. The number of branches to the right of each node corresponds to the number of outcomes for the next coin toss (or experiment). A sequence of samples connected by branches in a left to right path from the origin to a terminal node represents a sample point for the combined experiment. There is a one-to-one correspondence between the paths in the tree diagram and the sample points in the sample space for the combined experiment. 1.5.2 Coordinate System Coordinate system representation is another way to illustrate the sample space, especially useful for a combination of two experiment with numerical outcomes. With this method, each axis lists the outcomes for each sub-experiment. In Example 1.7, if a die is tossed twice, the coordinate system can represent the sample space as shown in Figure 1.4. Figure 1.4: Coordinate system representation for Example 1.7 Note that there are 36 sample points in the experiment. Additionally, we distinguish between sample points with regard to order; e.g., (1,2) is different from (2,1). Further Reading 1. John D. Enderle, David C. Farden, Daniel J. Krause, Basic Probability Theory for Biomedical Engineers, Morgan & Claypool, 2006: sections 1.1 and 1.2 2. Scott L. Miller, Donald Childers, Probability and random processes: with applications to signal processing and communications, 2nd ed., Elsevier 2012: section 2.1 3. Alberto Leon-Garcia, Probability, statistics, and random processes for electrical engineering, 3rd ed. Pearson, 2007: sections 1.3 and 2.1 4. Charles W. Therrien, Probability for electrical and computer engineers, CRC Press, 2004: chapter 1 13 2 Probability Theory 2 Probability Theory 2.1 Definition of Probability Now that the concepts of experiments, outcomes, and events have been introduced, the next step is to assign probabilities to various outcomes and events. This requires a careful definition of probability. It should be clear from our everyday usage of the word probability that it is a measure of the likelihood of various events. In general terms, probability is a function of an event that produces a numerical quantity that measures the likelihood of that event. More specifically, probability is a real number between 0 and 1, with probability = 0 meaning that the event is extremely unlikely to occur and probability = 1 meaning that the event is almost certain to occur. Several approaches to probability theory have been taken. Two definitions are discussed in this section. 2.1.1 Relative Frequency Definition The relative frequency definition of probability is based on observation or experimental evidence and not on prior knowledge. If an experiment is repeated 𝑁 times and a certain event 𝐴 occurs in 𝑁𝐴 out of 𝑁 trials, then the probability of 𝐴 is defined to be: 𝑁𝐴 𝑁 →+∞ 𝑁 𝑃 (𝐴) = lim (2.1) For example, if a six-sided die is rolled a large number of times and the numbers on the face of the die come up in approximately equal proportions, then we could say that the probability of each number on the upturned face of the die is 1/6. The difficulty with this definition is determining when 𝑁 is sufficiently large and indeed if the limit actually exists. We will certainly use this definition in practice, relating deduced probabilities to the physical world, but we will not develop probability theory from it. 2.1.2 Axiomatic Definition For now, we consider the event space (denoted by 𝐹 ) to be simply the space containing all events to which we wish to assign a probability. We start with three axioms that any method for assigning probabilities must satisfy: 1. For any event 𝐴 ∈ 𝐹 , 𝑃 (𝐴) ≥ 0 (a negative probability does not make sense). 2. If 𝑆 is the sample space for a given experiment, 𝑃 (𝑆) = 1 (probabilities are normalized so that the maximum value is unity). 3. If 𝐴 ∩ 𝐵 = ∅, then 𝑃 (𝐴 ∪ 𝐵) = 𝑃 (𝐴) + 𝑃 (𝐵). In general if 𝐴1, 𝐴2, ... are mutually exclusive events in 𝐹 , i.e. 𝐴𝑖 ∩ 𝐴 𝑗 = for all 𝑖 ≠ 𝑗, then: 𝑃( ∞ Ø 𝐴𝑖 ) = 𝑖=1 ∞ Õ 𝑖=1 14 𝑃 (𝐴𝑖 ) 2 Probability Theory The following theorem is a direct consequence of the axioms of probability, which is useful for solving probability problems. Theorem 2.1 Assuming that all events indicated are in the event space 𝐹 , we have: (i) 𝑃 (𝐴𝑐 ) = 1 − 𝑃 (𝐴), (ii) 𝑃 (∅) = 0, (iii) 0 ≤ 𝑃 (𝐴) ≤ 1, (iv) 𝑃 (𝐴 ∪ 𝐵) = 𝑃 (𝐴) + 𝑃 (𝐵) − 𝑃 (𝐴 ∩ 𝐵) (v) 𝑃 (𝐵) ≤ 𝑃 (𝐴) if 𝐵 ⊂ 𝐴. Proof. (i) Since 𝑆 = 𝐴 ∪ 𝐴𝑐 and 𝐴 ∩ 𝐴𝑐 = ∅, we apply the second and third axioms of probability to obtain 𝑃 (𝑆) = 1 = 𝑃 (𝐴) + 𝑃 (𝐴𝑐 ), from which (i) follows. (ii) Applying (i) with 𝐴 = 𝑆 we have 𝐴𝑐 = ∅ so that 𝑃 (∅) = 1 − 𝑃 (𝑆) = 0. (iii) From (i) we have 𝑃 (𝐴) = 1 − 𝑃 (𝐴𝑐 ), from the first axiom we have 𝑃 (𝐴) ≥ 0 and 𝑃 (𝐴𝑐 ) ≥ 0; consequently, 0 ≤ 𝑃 (𝐴) ≤ 1. (iv) Let 𝐶 = 𝐵 ∩ 𝐴𝑐 . Then 𝐴 ∪ 𝐶 = 𝐴 ∪ (𝐵 ∩ 𝐴𝑐 ) = (𝐴 ∪ 𝐵) ∩ (𝐴 ∪ 𝐴𝑐 ) = 𝐴 ∪ 𝐵, and 𝐴 ∩ 𝐶 = 𝐴 ∩ 𝐵 ∩ 𝐴𝑐 = ∅, so that 𝑃 (𝐴 ∪ 𝐵) = 𝑃 (𝐴 ∪ 𝐶) = 𝑃 (𝐴) + 𝑃 (𝐶). Now we find 𝑃 (𝐶). Since 𝐵 = 𝐵 ∩ 𝑆 = 𝐵 ∩ (𝐴 ∪ 𝐴𝑐 ) = (𝐵 ∩ 𝐴) ∪ (𝐵 ∩ 𝐴𝑐 ) and (𝐵 ∩ 𝐴) ∩ (𝐵 ∩ 𝐴𝑐 ) = ∅, 𝑃 (𝐵) = 𝑃 (𝐵 ∩ 𝐴𝑐 ) + 𝑃 (𝐴 ∩ 𝐵) = 𝑃 (𝐶) + 𝑃 (𝐴 ∩ 𝐵), so 𝑃 (𝐶) = 𝑃 (𝐵) − 𝑃 (𝐴 ∩ 𝐵). (v) We have 𝐴 = 𝐴 ∩ (𝐵 ∪ 𝐵𝑐 ) = (𝐴 ∩ 𝐵) ∪ (𝐴 ∩ 𝐵𝑐 ), and if 𝐵 ⊂ 𝐴, then 𝐴 = 𝐵 ∪ (𝐴 ∩ 𝐵𝑐 ). Since 𝐵 ∩ (𝐴 ∩ 𝐵𝑐 ) = ∅, consequently, 𝑃 (𝐴) = 𝑃 (𝐵) + 𝑃 (𝐴 ∩ 𝐵𝑐 ) ≥ 𝑃 (𝐵). [Exercise−] Visualize this theorem by drawing a Venn diagram. Example 2.1 Given 𝑃 (𝐴) = 0.4, 𝑃 (𝐴 ∩ 𝐵𝑐 ) = 0.2, and 𝑃 (𝐴 ∪ 𝐵) = 0.6, find 𝑃 (𝐴 ∩ 𝐵) and 𝑃 (𝐵). Solution. We have 𝑃 (𝐴) = 𝑃 (𝐴 ∩ 𝐵) + 𝑃 (𝐴 ∩ 𝐵𝑐 ) so that 𝑃 (𝐴 ∩ 𝐵) = 0.4 − 0.2 = 0.2. Similarly, 𝑃 (𝐵𝑐 ) = 𝑃 (𝐵𝑐 ∩ 𝐴) + 𝑃 (𝐵𝑐 ∩ 𝐴𝑐 ) = 0.2 + 1 − 𝑃 (𝐴 ∪ 𝐵) = 0.6. Hence, 𝑃 (𝐵) = 1 − 𝑃 (𝐵𝑐 ) = 0.4. Note that since probabilities are non-negative (theorem 2.1 (iii)), the theorem 2.1 (iv) implies that the probability of the union of two events is no greater than the sum of the individual event probabilities: 𝑃 (𝐴 ∪ 𝐵) ≤ 𝑃 (𝐴) + 𝑃 (𝐵) (2.2) This can be extended to Boole’s Inequality, described as follows. 15 2 Probability Theory Theorem 2.2 Boole’s inequality: Let 𝐴1, 𝐴2, ... all belong to 𝐹 . Then 𝑃( ∞ Ø 𝐴𝑖 ) = 𝑖=1 ∞ Õ (𝑃 (𝐴𝑘 ) − 𝑃 (𝐴𝑘 ∩ 𝐵𝑘 )) ≤ 𝑘=1 ∞ Õ 𝑃 (𝐴𝑘 ) 𝑘=1 where 𝐵𝑘 = 𝑘−1 Ø 𝐴𝑖 𝑖=1 Proof. Note that 𝐵 1 = ∅, 𝐵 2 = 𝐴1, 𝐵 3 = 𝐴1 ∪ 𝐴2, ..., 𝐵𝑘 = 𝐴1 ∪ 𝐴2 ∪ ... ∪ 𝐴𝑘−1 ; as 𝑘 increases, the size of 𝐵𝑘 is non-decreasing. Let 𝐶𝑘 = 𝐴𝑘 ∩ 𝐵𝑘𝑐 ; thus, 𝐶𝑘 = 𝐴𝑘 ∩ (𝐴𝑐1 ∩ 𝐴𝑐2 ∩ ... ∩ 𝐴𝑐𝑘−1 ) consists of all elements in 𝐴𝑘 and not in any 𝐴𝑖 , 𝑖 = 1, 2, ..., 𝑘 − 1. Then 𝐵𝑘+1 = 𝑘 Ø 𝐴𝑖 = 𝐵𝑘 ∪ (𝐴𝑘 ∩ 𝐵𝑘𝑐 ) . | {z } 𝑖=1 𝐶𝑘 and 𝑃 (𝐵𝑘 + 1) = 𝑃 (𝐵𝑘 ) + 𝑃 (𝐶𝑘 ). We have 𝑃 (𝐵 2 ) = 𝑃 (𝐶 1 ), 𝑃 (𝐵 3 ) = 𝑃 (𝐶 1 ) + 𝑃 (𝐶 2 ), and 𝑃 (𝐵𝑘+1 ) = 𝑃 ( 𝑘 Ø 𝐴𝑖 ) = 𝑖=1 𝑘 Õ 𝑃 (𝐶𝑖 ) 𝑖=1 The desired result follows by noting that 𝑃 (𝐶𝑖 ) = 𝑃 (𝐴𝑖 ) − 𝑃 (𝐴𝑖 ∩ 𝐵𝑖 ). Example 2.2 Let 𝑆 = [0, 1] (the set of real numbers 𝑥 : 0 ≤ 𝑥 ≤ 1). Let 𝐴1 = [0, 0.5], 𝐴2 = (0.45, 0.7), 𝐴3 = [0.6, 0.8), and assume 𝑃 (𝜁 ∈ 𝐼 ) = length of the interval 𝐼 ∩ 𝑆, so that 𝑃 (𝐴1 ) = 0.5, 𝑃 (𝐴2 ) = 0.25, and 𝑃 (𝐴3 ) = 0.2. Find 𝑃 (𝐴1 ∪ 𝐴2 ∪ 𝐴3 ). Solution. Let 𝐶 1 = 𝐴1, 𝐶 2 = 𝐴2 ∩ 𝐴𝑐1 = (0.5, 0.7), and 𝐶 3 = 𝐴3 ∩ 𝐴𝑐1 ∩ 𝐴𝑐2 = [0.7, 0.8). Then 𝐶 1, 𝐶 2, and 𝐶 3 are mutually exclusive and 𝐴1 ∪𝐴2 ∪𝐴3 = 𝐶 1 ∪𝐶 2 ∪𝐶 3 ; hence 𝑃 (𝐴1 ∪𝐴2 ∪𝐴3 ) = 𝑃 (𝐶 1 ∪ 𝐶 2 ∪ 𝐶 3 ) = 0.5 + 0.2 + 0.1 = 0.8. Note that for this example, Boole’s inequality yields 𝑃 (𝐴1 ∪ 𝐴2 ∪ 𝐴3 ) ≤ 0.5 + 0.25 + 0.2 = 0.95. 2.2 Joint Probabilities Suppose that we have two sets, 𝐴 and 𝐵. We saw a few results in the previous section that dealt with how to calculate the probability of the union of two sets, 𝐴 ∪ 𝐵. At least as frequently, we are interested in calculating the probability of the intersection of two sets, 𝐴 ∩ 𝐵. Definition 2.1. Joint probability: The probability of the intersection of two sets, 𝐴 ∩ 𝐵 is referred to as the joint probability of the sets 𝐴 and 𝐵, 𝑃 (𝐴 ∩ 𝐵), usually denoted by 𝑃 (𝐴, 𝐵). Extending to an arbitrary number of sets, the joint probability of the sets 𝐴1, 𝐴2, ..., 𝐴𝑀 , denoted 𝑃 (𝐴1, 𝐴2, ..., 𝐴𝑀 ), is 𝑃 (𝐴1 ∩ 𝐴2 ∩ ... ∩ 𝐴𝑀 ). 16 2 Probability Theory From the relative frequency definition, in practice we may let 𝑛𝐴,𝐵 be the number of times that 𝐴 and 𝐵 simultaneously occur in 𝑛 trials. Then, 𝑛𝐴,𝐵 𝑃 (𝐴, 𝐵) = lim (2.3) 𝑛→∞ 𝑛 Example 2.3 A standard deck of playing cards has 52 cards that can be divided in several manners. There are four suits (spades, hearts,diamonds, and clubs), each of which has 13 cards (ace, 2, 3, 4, ... , 10, jack, queen, king). There are two red suits (hearts and diamonds) and two black suits (spades and clubs). Also, the jacks, queens, and kings are referred to as face cards, while the others are number cards. Suppose the cards are sufficiently shuffled (randomized) and one card is drawn from the deck. The experiment has 52 outcomes corresponding to the 52 individual cards that could have been selected. Hence, each outcome has a probability of 1/52. Define the events: A = {red card selected}, B = {number card selected}, C = {heart selected}. Since the event A consists of 26 outcomes (there are 26 red cards), then 𝑃 (𝐴) = 26/52 = 1/2. Likewise, 𝑃 (𝐵) = 40/52 = 10/13 and 𝑃 (𝐶) = 13/52 = 1/4. Events A and B have 20 outcomes in common, hence 𝑃 (𝐴, 𝐵) = 20/52 = 5/13. Likewise, 𝑃 (𝐵, 𝐶) = 10/52 = 5/26 and 𝑃 (𝐴, 𝐶) = 13/52 = 1/4. It is interesting to note that in this example, 𝑃 (𝐴, 𝐶) = 𝑃 (𝐶), because 𝐶 ⊂ 𝐴 and as a result 𝐴 ∩ 𝐶 = 𝐶. 2.3 Conditional Probabilities Often the occurrence of one event may be dependent upon the occurrence of another. In Example 2.3, the event A = {a red card is selected} had a probability of 𝑃 (𝐴) = 1/2. If it is known that event C = {a heart is selected} has occurred, then the event A is now certain (probability equal to 1), since all cards in the heart suit are red. Likewise, if it is known that the event C did not occur, then there are 39 cards remaining, 13 of which are red (all the diamonds). Hence, the probability of event A in that case becomes 1/3. Clearly, the probability of event A depends on the occurrence of event C. We say that the probability of A is conditional on C, or the probability of A conditioned on knowing that C has occurred. Definition 2.2. Conditional probability: the probability of A given knowledge that the event B has occurred is referred to as the conditional probability of A given B, denoted by 𝑃 (𝐴|𝐵), i.e.: 𝑃 (𝐴|𝐵) = 𝑃 (𝐴, 𝐵) 𝑃 (𝐵) (2.4) provided that 𝑃 (𝐵) is nonzero. The conditional probability measure is a legitimate probability measure that satisfies each of the axioms of probability. Note carefully that 𝑃 (𝐵|𝐴) ≠ 𝑃 (𝐴|𝐵). If we interpret probability as relative frequency, then 𝑃 (𝐴|𝐵) should be the relative frequency of the event 𝐴 ∩ 𝐵 in experiments where 𝐵 occurred. Suppose that the experiment is performed 𝑛 times, and suppose that event 𝐵 occurs 𝑛𝐵 times, and that event 𝐴 ∩ 𝐵 occurs 𝑛𝐴,𝐵 times. The relative frequency of interest is then: 𝑛𝐴,𝐵 /𝑛 𝑛𝐴,𝐵 𝑃 (𝐴, 𝐵) = lim = lim 𝑛→∞ 𝑛→∞ 𝑃 (𝐵) 𝑛𝐵 /𝑛 𝑛𝐵 17 (2.5) 2 Probability Theory provided that 𝑃 (𝐵) is nonzero. We may find in some cases that conditional probabilities are easier to compute than the corresponding joint probabilities, and hence this formula offers a convenient way to compute joint probabilities: 𝑃 (𝐴, 𝐵) = 𝑃 (𝐵|𝐴)𝑃 (𝐴) = 𝑃 (𝐴|𝐵)𝑃 (𝐵) (2.6) This idea can be extended to more than two events. Consider finding the joint probability of three events, 𝐴, 𝐵, and 𝐶: 𝑃 (𝐴, 𝐵, 𝐶) = 𝑃 (𝐶 |𝐴, 𝐵)𝑃 (𝐴, 𝐵) = 𝑃 (𝐶 |𝐴, 𝐵)𝑃 (𝐵|𝐴)𝑃 (𝐴) (2.7) In general, for 𝑀 events, 𝐴1, 𝐴2, ..., 𝐴𝑀 , 𝑃 (𝐴1, 𝐴2, ..., 𝐴𝑀 ) = 𝑃 (𝐴𝑀 |𝐴1, 𝐴2, ..., 𝐴𝑀−1 )𝑃 (𝐴𝑀−1 |𝐴1, 𝐴2, ..., 𝐴𝑀−2 )... × 𝑃 (𝐴2 |𝐴1 )𝑃 (𝐴1 ) (2.8) Example 2.4 Return to the experiment of drawing cards from a deck as described in Example 2.3. Suppose now that we select two cards at random from the deck. When we select the second card, we do not return the first card to the deck. In this case, we say that we are selecting cards without replacement. As a result, the probabilities associated with selecting the second card are slightly different if we have knowledge of which card was drawn on the first selection. To illustrate this, let: A = {first card was a spade} and B = {second card was a spade}. The probability of the event A can be calculated as in the previous example to be 𝑃 (𝐴) = 13/52 = 1/4. Likewise, if we have no knowledge of what was drawn on the first selection, the probability of the event B is the same, 𝑃 (𝐵) = 1/4. To calculate the joint probability of A and B, we have to do some counting. To begin, when we select the first card there are 52 possible outcomes. Since this card is not returned to the deck, there are only 51 possible outcomes for the second card. Hence, this experiment of selecting two cards from the deck has 52 ∗ 51 possible outcomes each of which is equally likely. Similarly, there are 13 ∗ 12 outcomes that belong to the joint event 𝐴 ∩ 𝐵. Therefore, the joint probability for A and B is 𝑃 (𝐴, 𝐵) = (13 ∗ 12)/(52 ∗ 51) = 1/17. The conditional probability of the second card being a spade given that the first card is a spade is then 𝑃 (𝐵|𝐴) = 𝑃 (𝐴, 𝐵)/𝑃 (𝐴) = (1/17)/(1/4) = 4/17. However, calculating this conditional probability directly is probably easier than calculating the joint probability. Given that we know the first card selected was a spade, there are now 51 cards left in the deck, 12 of which are spades, thus 𝑃 (𝐵|𝐴) = 12/51 = 4/17. 2.3.1 Bayes’s Theorem The concept of conditional probability leads us to the following theorem. Theorem 2.3 For any events 𝐴 and 𝐵 such that 𝑃 (𝐵) ≠ 0, 𝑃 (𝐴|𝐵) = 𝑃 (𝐵|𝐴)𝑃 (𝐴) 𝑃 (𝐵) 18 (2.9) 2 Probability Theory Proof. From definition 2.2, 𝑃 (𝐴, 𝐵) = 𝑃 (𝐴|𝐵)𝑃 (𝐵) = 𝑃 (𝐵|𝐴)𝑃 (𝐴). (2.10) It follows directly by dividing the preceding equations by 𝑃 (𝐵). Theorem 2.3 is useful for calculating certain conditional probabilities since, in many problems, it may be quite difficult to compute 𝑃 (𝐴|𝐵) directly, whereas calculating 𝑃 (𝐵|𝐴) may be straightforward. Theorem 2.4: Theorem of Total Probability Let 𝐵 1, 𝐵 2, ..., 𝐵𝑛 be a set of mutually exclusive and collectively exhaustive events. That is, 𝐵𝑖 ∩ 𝐵 𝑗 = for all 𝑖 ≠ 𝑗 and 𝑛 𝑛 Ø Õ 𝐵𝑖 = 𝑆 ⇒ 𝑃 (𝐵𝑖 ) = 1 (2.11) 𝑖=1 𝑖=1 then 𝑃 (𝐴) = 𝑛 Õ 𝑃 (𝐴|𝐵𝑖 )𝑃 (𝐵𝑖 ) (2.12) 𝑖=1 Proof. From the Venn diagram in Figure 2.1, it can be seen that the event 𝐴 can be written as: 𝐴 = (𝐴∩𝐵 1 ) ∪ (𝐴∩𝐵 2 ) ∪...∪ (𝐴∩𝐵𝑛 ) ⇒ 𝑃 (𝐴) = 𝑃 ({𝐴∩𝐵 1 }∪{𝐴∩𝐵 2 }∪...∪{𝐴∩𝐵𝑛 }) (2.13) Also, since the 𝐵𝑖 are all mutually exclusive, then the {𝐴 ∩ 𝐵𝑖 } are also mutually exclusive, so that 𝑛 𝑛 Õ Õ 𝑃 (𝐴) = 𝑃 (𝐴, 𝐵𝑖 ) = 𝑃 (𝐴|𝐵𝑖 )𝑃 (𝐵𝑖 ) (by Theorem 2.3). (2.14) 𝑖=1 𝑖=1 Figure 2.1: Venn diagram used to help prove the theorem of total probability By combining the results of Theorems 2.3 and 2.4, we get what has come to be known as Bayes’s theorem. Theorem 2.5: Bayes’s Theorem 19 2 Probability Theory Let 𝐵 1, 𝐵 2, ..., 𝐵𝑛 be a set of mutually exclusive and collectively exhaustive events. Then: 𝑃 (𝐴|𝐵𝑖 )𝑃 (𝐵𝑖 ) 𝑃 (𝐵𝑖 |𝐴) = Í𝑛 𝑖=1 𝑃 (𝐴|𝐵𝑖 )𝑃 (𝐵𝑖 ) (2.15) 𝑃 (𝐵𝑖 ) is often referred to as the a priori probability of event 𝐵𝑖 , while 𝑃 (𝐵𝑖 |𝐴) is known as the a posteriori probability of event 𝐵𝑖 given 𝐴. Example 2.5 A certain auditorium has 30 rows of seats. Row 1 has 11 seats, while Row 2 has 12 seats, Row 3 has 13 seats, and so on to the back of the auditorium where Row 30 has 40 seats. A door prize is to be given away by randomly selecting a row (with equal probability of selecting any of the 30 rows) and then randomly selecting a seat within that row (with each seat in the row equally likely to be selected). Find the probability that Seat 15 was selected given that Row 20 was selected and also find the probability that Row 20 was selected given that Seat 15 was selected. Solution. The first task is straightforward. Given that Row 20 was selected, there are 30 possible seats in Row 20 that are equally likely to be selected. Hence, 𝑃 (𝑆𝑒𝑎𝑡15|𝑅𝑜𝑤20) = 1/30. Without the help of Bayes’s theorem, finding the probability that Row 20 was selected given that we know Seat 15 was selected would seem to be a formidable problem. Using Bayes’s theorem, 𝑃 (𝑅𝑜𝑤20|𝑆𝑒𝑎𝑡15) = 𝑃 (𝑆𝑒𝑎𝑡15|𝑅𝑜𝑤20)𝑃 (𝑅𝑜𝑤20)/𝑃 (𝑆𝑒𝑎𝑡15). The two terms in the numerator on the right-hand side are both equal to 1/30. The term in the denominator is calculated using the help of the theorem of total probability. 𝑃 (𝑆𝑒𝑎𝑡15) = 30 Õ 1 1 = 0.0342 𝑘 + 10 30 𝑘=5 With this calculation completed, the a posteriori probability of Row 20 being selected given seat 15 was selected is given by: 𝑃 (𝑅𝑜𝑤20|𝑆𝑒𝑎𝑡15) = 1/30 ∗ 1/30 = 0.0325 0.0342 Note that the a priori probability that Row 20 was selected is 1/30 = 0.0333. Therefore, the additional information that Seat 15 was selected makes the event that Row 20 was selected slightly less likely. In some sense, this may be counterintuitive, since we know that if Seat 15 was selected, there are certain rows that could not have been selected (i.e., Rows 1–4 have fewer than 15 seats) and, therefore, we might expect Row 20 to have a slightly higher probability of being selected compared to when we have no information about which seat was selected. To see why the probability actually goes down, try computing the probability that Row 5 was selected given that Seat 15 was selected. The event that Seat 15 was selected makes some rows much more probable, while it makes others less probable and a few rows now impossible. 20 2 Probability Theory 2.4 Independence In Example 2.5, it was seen that observing one event can change the probability of the occurrence of another event. In that particular case, the fact that it was known that Seat 15 was selected, lowered the probability that Row 20 was selected. We say that the event 𝐴 = {Row 20 was selected} is statistically dependent on the event 𝐵 = {Seat 15 was selected}. If the description of the auditorium were changed so that each row had an equal number of seats (e.g., say all 30 rows had 20 seats each), then observing the event B = Seat 15 was selected would not give us any new information about the likelihood of the event 𝐴 = {Row 20 was selected}. In that case, we say that the events 𝐴 and 𝐵 are statistically independent. Mathematically, two events 𝐴 and 𝐵 are independent if 𝑃 (𝐴|𝐵) = 𝑃 (𝐴). That is, the a priori probability of event 𝐴 is identical to the a posteriori probability of 𝐴 given 𝐵. Note that if 𝑃 (𝐴|𝐵) = 𝑃 (𝐴), then the following conditions also hold: 𝑃 (𝐵|𝐴) = 𝑃 (𝐵) and 𝑃 (𝐴, 𝐵) = 𝑃 (𝐴)𝑃 (𝐵). Furthermore, if 𝑃 (𝐴|𝐵) ≠ 𝑃 (𝐴), then the other two conditions also do not hold. We can thereby conclude that any of these three conditions can be used as a test for independence and the other two forms must follow. We use the last form as a definition of independence since it is symmetric relative to the events A and B. Definition 2.3. Independence: Two events are statistically independent if and only if: 𝑃 (𝐴, 𝐵) = 𝑃 (𝐴)𝑃 (𝐵) (2.16) Example 2.6 Consider the experiment of tossing two numbered dice and observing the numbers that appear on the two upper faces. For convenience, let the dice be distinguished by color, with the first die tossed being red and the second being white. Let: A = {number on the red die is less than or equal to 2}, B = {number on the white die is greater than or equal to 4}, C = {the sum of the numbers on the two dice is 3}. As mentioned in the preceding text, there are several ways to establish independence (or lack thereof) of a pair of events. One possible way is to compare 𝑃 (𝐴, 𝐵) with 𝑃 (𝐴)𝑃 (𝐵). Note that for the events defined here, 𝑃 (𝐴) = 1/3, 𝑃 (𝐵) = 1/2, 𝑃 (𝐶) = 1/18. Also, of the 36 possible outcomes of the experiment, six belong to the event 𝐴 ∩ 𝐵 and hence 𝑃 (𝐴, 𝐵) = 1/6. Since 𝑃 (𝐴)𝑃 (𝐵) = 1/6 as well, we conclude that the events 𝐴 and 𝐵 are independent. This agrees with intuition since we would not expect the outcome of the roll of one die to affect the outcome of the other. What about the events 𝐴 and 𝐶? Of the 36 possible outcomes of the experiment, two belong to the event 𝐴∩𝐶 and hence 𝑃 (𝐴, 𝐶) = 1/18. Since 𝑃 (𝐴)𝑃 (𝐶) = 1/54, the events 𝐴 and 𝐶 are not independent. Again, this is intuitive since whenever the event 𝐶 occurs, the event 𝐴 must also occur and so the two must be dependent. Finally, we look at the pair of events 𝐵 and 𝐶. Clearly, 𝐵 and 𝐶 are mutually exclusive. If the white die shows a number greater than or equal to 4, there is no way the sum can be 3. Hence, 𝑃 (𝐵, 𝐶) = 0 and since 𝑃 (𝐵)𝑃 (𝐶) = 1/36, these two events are also dependent. Note that mutually exclusive events are not the same as independent events. For two events 𝐴 and 𝐵 for which 𝑃 (𝐴) ≠ 0 and 𝑃 (𝐵) ≠ 0, 𝐴 and 𝐵 can never be both independent and mutually exclusive. Thus, mutually exclusive events are necessarily statistically dependent. Generalizing the definition of independence to three events, 𝐴, 𝐵, and 𝐶 are mutually independent if each pair of events is independent; 𝑃 (𝐴, 𝐵) = 𝑃 (𝐴)𝑃 (𝐵) 21 (2.17) 2 Probability Theory 𝑃 (𝐴, 𝐶) = 𝑃 (𝐴)𝑃 (𝐶) (2.18) 𝑃 (𝐵, 𝐶) = 𝑃 (𝐵)𝑃 (𝐶) (2.19) 𝑃 (𝐴, 𝐵, 𝐶) = 𝑃 (𝐴)𝑃 (𝐵)𝑃 (𝐶) (2.20) and in addition, Definition 2.4. The events 𝐴1, 𝐴2, ..., 𝐴𝑛 are independent if any subset of 𝑘 < 𝑛 of these events are independent, and in addition 𝑃 (𝐴1, 𝐴2, ..., 𝐴𝑛 ) = 𝑃 (𝐴1 )𝑃 (𝐴2 )...𝑃 (𝐴𝑛 ) (2.21) There are basically two ways in which we can use the idea of independence. We can compute joint or conditional probabilities and apply one of the definitions as a test for independence. Alternatively, we can assume independence and use the definitions to compute joint or conditional probabilities that otherwise may be difficult to find. The latter approach is used extensively in engineering applications. For example, certain types of noise signals can be modeled in this way. Suppose we have some time waveform 𝑋 (𝑡) which represents a noisy signal that we wish to sample at various points in time, 𝑡 1, 𝑡 2, ..., 𝑡𝑛 . Perhaps we are interested in the probabilities that these samples might exceed some threshold, so we define the events 𝐴𝑖 = 𝑃 (𝑋 (𝑡𝑖 ) > 𝑇 ), 𝑖 = 1, 2, ..., 𝑛. In some cases, we can assume that the value of the noise at one point in time does not affect the value of the noise at another point in time. Hence, we assume that these events are independent and therefore 𝑃 (𝐴1, 𝐴2, ..., 𝐴𝑛 ) = 𝑃 (𝐴1 )𝑃 (𝐴2 )...𝑃 (𝐴𝑛 ). 2.5 Basic Combinatorics In many situations, the probability of each possible outcome of an experiment is taken to be equally likely. The card drawing and dice rolling examples can fall into this category, where finding the probability of a certain event 𝐴 can be obtained by counting. 𝑃 (𝐴) = Number of outcomes in A Number of outcomes in entire sample space (2.22) Sometimes, when the scope of the experiment is fairly small, it is straightforward to count the number of outcomes. On the other hand, for problems where the experiment is fairly complicated, the number of outcomes involved can quickly become astronomical, and the corresponding exercise in counting can be quite daunting. In this section, we present some fairly simple tools that are helpful for counting the number of outcomes in a variety of commonly encountered situations. 2.5.1 Sequence of Experiments Suppose a combined experiment (𝐸 = 𝐸 1 ×𝐸 2 ×𝐸 3 ×...×𝐸𝑘 ) is performed where the first experiment 𝐸 1 has 𝑛 1 possible outcomes, followed by a second experiment 𝐸 2 which has 𝑛 2 possible outcomes and so on. A sequence of 𝑘 such experiments thus has 𝑛 = 𝑛 1𝑛 2 ...𝑛𝑘 = 𝑘 Ö 𝑛𝑖 (2.23) 𝑖=1 possible outcomes. This result allows us to quickly calculate the number of sample points in a sequence of experiments. 22 2 Probability Theory Example 2.7 How many odd two digit numbers can be formed from the digits 2, 7, 8, and 9, if each digit can be used only once? Solution. As the first experiment, there are two ways of selecting a number for the unit’s place (either 7 or 9). In each case of the first experiment, there are three ways of selecting a number for the ten’s place in the second experiment, excluding the digit used for the unit’s place. The number of outcomes in the combined experiment is therefore 2 × 3 = 6. Example 2.8 An analog-to-digital converter outputs an 8-bit word to represent an input analog voltage in the range −5 to +5 V. Determine the total number of words possible and the maximum sampling (quantization) error. Solution. Since each bit (or binary digit) in a computer word is either a one or a zero, and there are 8 bits, then the total number of computer words is 𝑛 = 28 = 256. To determine the maximum sampling error, first compute the range of voltage assigned to each computer word which equals 10 V/256 words = 0.0390625 V/word and then divide by two (i.e. round off to the nearest level), which yields a maximum error of 0.0195312 V/word. 2.5.2 Sampling with Replacement and with Ordering Suppose we choose 𝑘 objects in an order from a set 𝐴 with 𝑛 distinct objects, in a way that after selecting each object and noting its identity in an ordered list, we place it back in the set before the next choice is made. Therefore the same choice can be repeated. We will refer to the set 𝐴 as the “population.” The experiment produces an ordered 𝑘−tuple (𝑥 1, 𝑥 2, ..., 𝑥𝑘 ) where 𝑥𝑖 ∈ 𝐴 and 𝑖 = 1, 2, ..., 𝑘. Equation 2.23 with 𝑛 1 = 𝑛 2 = ... = 𝑛𝑘 = 𝑛 implies that number of distinct ordered 𝑘−tuples = 𝑛𝑘 . Example 2.9 How many k-digit binary numbers are there? Solution. There are 2𝑘 different binary numbers. Note that the digits are "ordered", and repeated 0 and 1 digits are possible. Example 2.10 An urn contains five balls numbered 1 to 5. Suppose we select two balls from the urn with replacement. How many distinct ordered pairs are possible? What is the probability that the two draws yield the same number? 23 2 Probability Theory Solution. The number of ordered pairs is 52 = 25. Figure 2.2 shows the 25 possible pairs. Five of the 25 outcomes have the two draws with the same number; if we suppose that all pairs are equiprobable, then the probability that the two draws yield the same number is 5/25 = 0.2. Figure 2.2: Possible outcomes in sampling with replacement and with ordering of two balls from an urn containing five distinct balls 2.5.3 Sampling without Replacement and with Ordering Suppose we choose 𝑘 objects in an order from a population 𝐴 of 𝑛 distinct objects, without replacement. Clearly 𝑘 ≤ 𝑛. The number of possible outcomes in the first draw is 𝑛 1 = 𝑛; the number of possible outcomes in the second draw is 𝑛 2 = 𝑛 − 1, namely all 𝑛 objects except the one selected in the first draw; and so on, up to 𝑛𝑘 = 𝑛 − (𝑘 − 1) in the final draw. The number of distinct ordered 𝑘−tuples is: 𝑃𝑘𝑛 = 𝑛(𝑛 − 1)...(𝑛 − 𝑘 + 1) = 𝑛! (𝑛 − 𝑘)! (2.24) The quantity 𝑃𝑘𝑛 is also called the number of permutations of 𝑛 things taken 𝑘 at a time, or 𝑘−permutations. Example 2.11 An urn contains five balls numbered 1 to 5. Suppose we select two balls in succession without replacement. How many distinct ordered pairs are possible? What is the probability that the first ball has a number larger than that of the second ball? Solution. Equation 2.24 states that the number of ordered pairs is 5 × 4 = 20, as shown in figure 2.3. Ten ordered pairs (in the dashed triangle) have the first number larger than the second number ; thus the probability of this event is 10/20 = 0.5. 24 2 Probability Theory Figure 2.3: Possible outcomes in sampling without replacement and with ordering. Example 2.12 An urn contains five balls numbered 1 to 5. Suppose we draw three balls with replacement. What is the probability that all three balls are different Solution. From Equation 2.23 there are 53 = 125 possible outcomes, which we will suppose are equiprobable. The number of these outcomes for which the three draws are different is given by Equation 2.24, 5 × 4 × 3 = 60, Thus the probability that all three balls are different is 60/125 = 0.48. In many problems of interest, we seek to find the number of different ways that we can rearrange or order several items. The number of permutations can easily be determined from equation 2.24 and is given as follows. Consider drawing 𝑛 objects from an urn containing 𝑛 distinct objects until the urn is empty, i.e. sampling without replacement with 𝑘 = 𝑛. Thus, the number of possible orderings, i.e. permutations of 𝑛 distinct objects is: number of permutations of 𝑛 objects = 𝑛(𝑛 − 1)...(2) (1) = 𝑛! (2.25) 2.5.4 Sampling without Replacement and without Ordering Suppose we pick 𝑘 objects from a set of 𝑛 distinct objects without replacement and that we record the result without regard to order. (You can imagine that you have no record of the order in which the selection was done.) We call the resulting subset of 𝑘 selected objects a combination of size 𝑘. The number of different combinations of size 𝑘 from a set of size 𝑛 (𝑘 ≤ 𝑛) is: 𝑛(𝑛 − 1)...(𝑛 − 𝑘 + 1) 𝑛! 𝑛 𝑛 𝐶𝑘 = = , (2.26) 𝑘 𝑘! (𝑛 − 𝑘)!𝑘! 𝑛 The expression is also called a binomial coefficient and is read “n choose k.” Note that choosing 𝑘 𝑘 objects out of a set of 𝑛 is equivalent to choosing the objects that are to be left out, since 𝑛 𝐶𝑘𝑛 = 𝐶𝑛−𝑘 25 (2.27) 2 Probability Theory Note that from Equation 2.25, there are 𝑘! possible orders in which the 𝑘 selected objects could have been selected. Thus in the case of 𝑘−permutations 𝑃𝑘𝑛 , the total number of distinct ordered samples of 𝑘 objects is: 𝑃𝑘𝑛 = 𝐶𝑘𝑛 𝑘! (2.28) Example 2.13 Find the number of ways of selecting two balls from five balls numbered 1 to 5, without replacement and without regard to order. Solution. From Equation 2.26: 5! 5 = = 10 2 2!3! Figure 2.4 shows the 10 pairs. Figure 2.4: Possible outcomes in sampling without replacement and without ordering. Example 2.14 Find the number of distinct permutations of 2 white balls and 3 black balls. Solution. This problem is equivalent to the sampling problem: Assume 5 possible positions for the balls, then pick a combination of 2 positions out of 5 and arrange the 2 white balls accordingly. Each combination leads to a distinct arrangement (permutation) of 2 white balls and 3 black balls. Thus the number of distinct permutations of 2 white balls and 3 black balls is: 𝐶 25 . The 10 distinct permutations with 2 whites (zeros) and 3 blacks (ones) are: 00111 01011 01101 01110 10011 10101 10110 11001 11010 11100. Note that the position of whites (zeros) can be represented by the pair of numbers on the two selected balls in figure 2.4. Example 2.14 shows that sampling without replacement and without ordering is equivalent to partitioning the set of 𝑛 distinct objects into two sets: 𝐵, containing the 𝑘 items that are picked from the urn, and 𝐵𝑐 containing the 𝑛 − 𝑘 left behind. Suppose we partition a set of 𝑛 distinct 26 2 Probability Theory objects into 𝐽 subsets 𝐵 1, 𝐵 2, ..., 𝐵 𝐽 where 𝐵 𝐽 is assigned 𝑘 𝐽 elements and 𝑘 1 + 𝑘 2 + ... + 𝑘 𝐽 = 𝑛 It is shown that the number of distinct partitions is: 𝑛! 𝑘 1 !𝑘 2 !...𝑘 𝐽 ! (2.29) which is called the multinomial coefficient. The binomial coefficient is a special case of the multinomial coefficient where 𝐽 = 2. 2.5.5 Sampling with Replacement and without Ordering Suppose we pick 𝑘 objects from a set of 𝑛 distinct objects with replacement and we record the result without regard to order. This can be done by filling out a form which has 𝑛 columns, one for each distinct object. Each time an object is selected, an “×” is placed in the corresponding column. For example, if we are picking 5 objects from 4 distinct objects, one possible form would look like this: Object 1 ×× Object 2 Object 3 × Object 4 ×× Note that this form can be summarized by the sequence ×× | | × | ×× where the "|" s indicate the lines between columns, and where nothing appears between consecutive |s if the corresponding object was not selected. Each different arrangement of 5 ×s and 3 |s leads to a distinct form. If we identify ×s with “white balls” and |s with “black balls,” then this problem becomes similar to 8 the example 2.14, and the number of different arrangements is given by . In the general case 3 the form will involve 𝑘 ×s and (𝑛 − 1) |s. Thus the number of different ways of picking 𝑘 objects from a set of 𝑛 distinct objects with replacement and without ordering is given by: 𝑛 −1+𝑘 𝑛 −1+𝑘 = (2.30) 𝑘 𝑛−1 Example 2.15 Find the number of ways of selecting two balls from five balls numbered 1 to 5, with replacement but without regard to order. Solution. From Equation 2.30: 6! 5−1+2 = = 15 2 2!4! Figure 2.5 shows the 15 pairs. Note that because of the replacement after each selection, the same ball can be selected twice for each pair. 27 2 Probability Theory Figure 2.5: Possible outcomes in sampling with replacement and without ordering. Further Reading 1. John D. Enderle, David C. Farden, Daniel J. Krause, Basic Probability Theory for Biomedical Engineers, Morgan & Claypool, 2006: sections 1.2.3 to 1.9 2. Scott L. Miller, Donald Childers, Probability and random processes: with applications to signal processing and communications, 2nd ed., Elsevier 2012: section 2.2 to 2.7 3. Alberto Leon-Garcia, Probability, statistics, and random processes for electrical engineering, 3rd ed. Pearson, 2007: sections 2.2 to 2.6 28 3 Random Variables 3 Random Variables In most random experiments, we are interested in a numerical attribute of the outcome of the experiment. A random variable is defined as a function that assigns a numerical value to the outcome of the experiment. 3.1 The Notion of a Random Variable The outcome of a random experiment need not be a number. However, we are usually interested not in the outcome itself, but rather in some measurement or numerical attribute of the outcome. For example, in 𝑛 tosses of a coin, we may be interested in the total number of heads and not in the specific order in which heads and tails occur. In a randomly selected Web document, we may be interested only in the length of the document. In each of these examples, a measurement assigns a numerical value to the outcome of the random experiment. Since the outcomes are random, the results of the measurements will also be random. Hence it makes sense to talk about the probabilities of the resulting numerical values. Definition 3.1. Random variable: A random variable is a real valued function of the elements of a sample space, 𝑆. A random variable 𝑋 is a function that assigns a real number, 𝑋 (𝜁 ), to each outcome 𝜁 in the sample space, 𝑆, of a random experiment, 𝐸. If the mapping 𝑋 (𝜁 ) is such that the random variable 𝑋 takes on a finite or countably infinite number of values, then we refer to 𝑋 as a discrete random variable; whereas, if the range of 𝑋 (𝜁 ) is an uncountably infinite number of points, we refer to 𝑋 as a continuous random variable. Figure 3.1 illustrates how a random variable assigns a number to an outcome in the sample space. The sample space 𝑆 is the domain of the random variable, and the set 𝑆𝑥 of all values taken on by 𝑋 is the range of the random variable. Thus 𝑆𝑥 is a subset of the set of all real numbers. We will use the capital letters (𝑋 , 𝑌 , etc.) to denote random variables, and lower case letters (𝑥, 𝑦, etc.) to denote possible values of the random variables. Figure 3.1: A random variable assigns a number 𝑋 (𝜁 ) to each outcome 𝜁 in the sample space 𝑆 of a random experiment. Since 𝑋 (𝜁 ) is a random variable whose numerical value depends on the outcome of an experiment, we cannot describe the random variable by stating its value; rather, we describe the probabilities 29 3 Random Variables that the variable takes on a specific value or values (e.g. 𝑃 (𝑋 = 3) or 𝑃 (𝑋 > 8)). Example 3.1 A coin is tossed three times and the sequence of heads and tails is noted. The sample space for this experiment is 𝑆 ={ HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}. (a) Let 𝑋 be the number of heads in the three tosses. Find the random variable 𝑋 (𝜁 ) for each outcome 𝜁 . (b) Now find the probability of the event {𝑋 = 2}. Solution. (a) 𝑋 assigns each outcome 𝜁 in 𝑆 a number from the set 𝑆𝑥 = {0, 1, 2, 3}. The table below lists the eight outcomes of 𝑆 and the corresponding values of 𝑋 . 𝜁 : 𝑋 (𝜁 ) : HHH 3 HHT 2 HTH 2 THH 2 HTT 1 THT 1 TTH 1 TTT 0 (b) Note that 𝑋 (𝜁 ) = 2 if and only if 𝜁 is in { HHT, HTH, THH}, therefore: 𝑃 (𝑋 = 2) = 𝑃 ({ HHT, HTH, THH}) = 𝑃 ({HHT}) + 𝑃 ({HTH}) + 𝑃 ({THH}) = 3/8 Example 3.1 shows a general technique for finding the probabilities of events involving the random variable 𝑋 . Let the underlying random experiment have sample space 𝑆. To find the probability of a subset 𝐵 of 𝑅, e.g., 𝐵 = {𝑥𝑘 }, we need to find the outcomes in 𝑆 that are mapped to 𝐵, i.e.: 𝐴 = {𝜁 : 𝑋 (𝜁 ) ∈ 𝐵} (3.1) As shown in figure 3.2. If event 𝐴 occurs then 𝑋 (𝜁 ) ∈ 𝐵, so event 𝐵 occurs. Conversely, if event 𝐵 occurs, then the value 𝑋 (𝜁 ) implies that 𝜁 is in 𝐴, so event 𝐴 occurs. Thus the probability that 𝑋 is in 𝐵 is given by: 𝑃 (𝑋 ∈ 𝐵) = 𝑃 (𝐴) = 𝑃 ({𝜁 : 𝑋 (𝜁 ) ∈ 𝐵}) (3.2) We refer to 𝐴 and 𝐵 as equivalent events. In some random experiments the outcome 𝜁 is already the numerical value we are interested in. In such cases we simply let 𝑋 (𝜁 ) = 𝜁 that is, the identity function, to obtain a random variable. Figure 3.2: An illustration of 𝑃 (𝑋 ∈ 𝐵) = 𝑃 (𝜁 ∈ 𝐴). 30 3 Random Variables 3.2 Discrete Random Variables Definition 3.2. Discrete random variable: A random variable 𝑋 that assumes values from a countable set, that is, 𝑆𝑥 = {𝑥 1, 𝑥 2, 𝑥 3, ...}. A discrete random variable is said to be finite if its range is finite, that is, 𝑆𝑥 = {𝑥 1, 𝑥 2, 𝑥 3, ..., 𝑥𝑛 }. We are interested in finding the probabilities of events involving a discrete random variable 𝑋 . Since the sample space is discrete, we only need to obtain the probabilities for the events 𝐴𝑘 = {𝜁 : 𝑋 (𝜁 ) = 𝑥𝑘 } in the underlying random experiment. The probabilities of all events involving 𝑋 can be found from the probabilities of the 𝐴𝑘 s. 3.2.1 Probability Mass Function Definition 3.3. Probability mass function: The probability mass function (PMF), 𝑃𝑋 (𝑥), of a random variable, 𝑋 , is a function that assigns a probability to each possible value of the random variable. The probability that the random variable 𝑋 takes on the specific value 𝑥 is the value of the probability mass function for 𝑥. That is, 𝑃𝑋 (𝑥) = 𝑃 (𝑋 = 𝑥) = 𝑃 ({𝜁 : 𝑋 (𝜁 ) = 𝑥𝑘 }) for 𝑥 a real number (3.3) Note that we use the convention that upper case variables represent random variables while lower case variables represent fixed values that the random variable can assume. The PMF satisfies the following properties that provide all the information required to calculate probabilities for events involving the discrete random variable 𝑋 : (i) 𝑃𝑋 (𝑥) ≥ 0 for all 𝑥 Í Í Í (ii) 𝑥 ∈𝑆𝑥 𝑃𝑋 (𝑥) = 𝑘 𝑃𝑋 (𝑥) = 𝑘 𝑃 (𝐴𝑘 ) = 1 Í (iii) 𝑃 (𝑋 ∈ 𝐵) = 𝑥 ∈𝐵 𝑃𝑋 (𝑥) where 𝐵 ⊂ 𝑆𝑋 Example 3.2 Let 𝑋 be the number of heads in three independent tosses of a fair coin. Find the PMF of 𝑋 . Solution. As seen in Example 3.1: 𝑃𝑋 (0) = 𝑃 (𝑋 = 0) = 𝑃 ({TTT}) = 1/8 𝑃𝑋 (1) = 𝑃 (𝑋 = 1) = 𝑃 ({HTT}) + 𝑃 ({THT}) + 𝑃 ({TTH}) = 3/8 𝑃𝑋 (2) = 𝑃 (𝑋 = 2) = 𝑃 ({HHT}) + 𝑃 ({HTH}) + 𝑃 ({THH}) = 3/8 𝑃𝑋 (3) = 𝑃 (𝑋 = 3) = 𝑃 ({HHH}) = 1/8 Note that: 𝑃𝑋 (0) + 𝑃𝑋 (1) + 𝑃𝑋 (2) + 𝑃𝑋 (3) = 1 31 3 Random Variables Figure 3.2 shows the graph of 𝑃𝑋 (𝑥) versus 𝑥 for the random variables in this example. Generally the graph of the PMF of a discrete random variable has vertical arrows of height 𝑃𝑋 (𝑥𝑘 ) at the values 𝑥𝑘 in 𝑆𝑥 . The relative values of PMF at different points give an indication of the relative likelihoods of occurrence. Finally, let’s consider the relationship between relative frequencies and the PMF. Suppose we perform 𝑛 independent repetitions to obtain 𝑛 observations of the discrete random variable 𝑋 . Let 𝑁𝑘 (𝑛) be the number of times the event 𝑋 = 𝑥𝑘 occurs and let 𝑓𝑘 (𝑛) = 𝑁𝑘 (𝑛)/𝑛 be the corresponding relative frequency. As 𝑛 becomes large we expect that 𝑓𝑘 (𝑛) → 𝑃𝑋 (𝑥𝑘 ). Therefore the graph of relative frequencies should approach the graph of the PMF. For the experiment in Example 3.2, 1000 repetitions of an experiment of tossing a coin may generate a graph of relative frequencies shown in Figure 3.3. Figure 3.3: Relative frequencies and corresponding PMF for the experiment in Example 3.2 3.2.2 The Cumulative Distribution Function The PMF of a discrete random variable was defined in terms of events of the form {𝑋 = 𝑏}. The cumulative distribution function is an alternative approach which uses events of the form {𝑋 ≤ 𝑏}. The cumulative distribution function has the advantage that it is not limited to discrete random variables and applies to all types of random variables. Definition 3.4. Cumulative distribution function: The cumulative distribution function (CDF) of a random variable 𝑋 is defined as the probability of the event {𝑋 ≤ 𝑥 }: 𝐹𝑋 (𝑥) = 𝑃 (𝑋 ≤ 𝑥) for −∞ < 𝑥 < +∞ (3.4) In other words, the CDF is the probability that the random variable 𝑋 takes on a value in the set (−∞, 𝑥]. In terms of the underlying sample space, the CDF is the probability of the event 32 3 Random Variables {𝜁 : 𝑋 (𝜁 ) ≤ 𝑥 }. The event {𝑋 ≤ 𝑥 } and its probability vary as 𝑥 is varied; since 𝐹𝑋 (𝑥) is a function of the variable 𝑥. From the definition of CDF, the following property can be derived: 𝑃 (𝑋 > 𝑥) = 1 − 𝐹𝑋 (𝑥) (3.5) The CDF has the following interpretation in terms of relative frequency. Suppose that the experiment that yields the outcome 𝜁 and hence 𝑋 (𝜁 ) is performed a large number of times. 𝐹𝑋 (𝑏) is then the long-term proportion of times in which 𝑋 (𝜁 ) ≤ 𝑏. Like the PMF, the CDF summarizes the probabilistic properties of a random variable. Knowledge of either of them allows the other function to be calculated. For example, suppose that the PMF is known. The CDF can then be calculated from the expression: Õ Õ 𝐹𝑋 (𝑥) = 𝑃 (𝑋 = 𝑦) = 𝑃𝑋 (𝑦) (3.6) 𝑦 ≤𝑥 𝑦 ≤𝑥 In other words, the value of 𝐹𝑋 (𝑥) is constructed by simply adding together the probabilities 𝑃𝑋 (𝑥) for values 𝑦 that are no larger than 𝑥. Note that: 𝑃 (𝑎 < 𝑋 ≤ 𝑏) = 𝐹𝑋 (𝑏) − 𝐹𝑋 (𝑎) (3.7) The CDF is an increasing step function with steps at the values taken by the random variable. The heights of the steps are the probabilities of taking these values. Mathematically, the PMF can be obtained from the CDF through the relationship: 𝑃𝑋 (𝑥) = 𝐹𝑋 (𝑥) − 𝐹𝑋 (𝑥 − ) (3.8) where 𝐹𝑋 (𝑥 − ) is the limiting value from below of the cumulative distribution function. If there is no step in the cumulative distribution function at a point 𝑥, then 𝐹𝑋 (𝑥) = 𝐹𝑋 (𝑥 − ) and 𝑃𝑋 (𝑥) = 0. If there is a step at a point 𝑥, then 𝐹𝑋 (𝑥) is the value of the CDF at the top of the step, and 𝐹𝑋 (𝑥 − ) is the value of the CDF at the bottom of the step, so that 𝑃𝑋 (𝑥) is the height of the step. These relationships are illustrated in the following example. Example 3.3 Similar to Example 3.2, let 𝑋 be the number of heads in three tosses of a fair coin. Find the CDF of X. Solution. From Example 3.2, we know that 𝑋 takes on only the values 0, 1, 2, and 3 with probabilities 1/8, 3/8, 3/8, and 1/8, respectively, so 𝐹𝑋 (𝑥) is simply the sum of the probabilities of the outcomes from {0, 1, 2, 3} that are less than or equal to 𝑥. The resulting CDF is a non-decreasing staircase function that grows from 0 to 1. It has jumps at the points 0, 1, 2, 3 of magnitudes 1/8, 3/8, 3/8, and 1/8, respectively. 33 3 Random Variables Let us take a closer look at one of these discontinuities, say, in the vicinity of 𝑥 = 1. For a small positive number 𝛿, we have: 𝐹𝑋 (1− ) = 𝐹𝑋 (1 − 𝛿) = 𝑃 (𝑋 ≤ 1 − 𝛿) = 𝑃 ({0 heads}) = 1/8 so the limit of the CDF as 𝑥 approaches 1 from the left is 1/8. However, 𝐹𝑋 (1) = 𝑃 (𝑋 ≤ 1) = 𝑃 ({0 or 1 heads}) = 1/8 + 3/8 = 1/2 Thus the CDF is continuous from the right and equal to 1/2 at the point 𝑥 = 1. Indeed, we note the magnitude of the step at the point 𝑥 = 1 is 𝑃 (𝑋 = 1) = 1/2 − 1/8 = 3/8. The CDF can be written compactly in terms of the unit step function: 1 3 3 1 𝐹𝑋 (𝑥) = 𝑢 (𝑥) + 𝑢 (𝑥 − 1) + 𝑢 (𝑥 − 2) + 𝑢 (𝑥 − 3) 8 8 8 8 3.2.3 Expected Value and Moments Expected Value In some situations we are interested in a few parameters that summarize the information provided by the PMF. For example, Figure 3.4 shows the results of many repetitions of an experiment that produces two random variables. It can be observed that the random variable 𝑌 varies about the value 0, whereas the random variable 𝑋 varies around the value 5. It is also clear that 𝑋 is more spread out than 𝑌 . We may just need some parameters that quantify these properties. Figure 3.4: The graphs show 150 repetitions of the experiments yielding 𝑋 and 𝑌 . It is clear that 𝑋 is centered about the value 5 while 𝑌 is centered about 0. It is also clear that 𝑋 is more spread out than 𝑌 (Taken from Alberto Leon-Garcia, Probability, statistics, and random processes for electrical engineering,3rd ed. Pearson, 2007). Definition 3.5. Expected value: The expected value or expectation or mean of a discrete random variable 𝑋 , with a probability mass function 𝑃𝑋 (𝑥) is defined by: Õ 𝑚𝑋 = 𝐸 [𝑋 ] = 𝑥𝑘 𝑃𝑋 (𝑥𝑘 ) (3.9) 𝑘 34 3 Random Variables 𝐸 [𝑋 ] provides a summary measure of the average value taken by the random variable and is also known as the mean of the random variable. The expected value 𝐸 [𝑋 ] is defined if the above sum converges absolutely, that is: Õ 𝐸 [|𝑋 |] = |𝑥𝑘 |𝑃𝑋 (𝑥𝑘 ) < ∞ (3.10) 𝑘 otherwise the expected value does not exist. Random variables with unbounded expected value are not uncommon and appear in models where outcomes that have extremely large values are not that rare. Examples include the sizes of files in Web transfers, frequencies of words in large bodies of text, and various financial and economic problems. If we view 𝑃𝑋 (𝑥) as the distribution of mass on the points 𝑥 1, 𝑥 2, ... on the real line, then 𝐸 [𝑋 ] represents the center of mass of this distribution. Example 3.4 Revisiting Example 3.1, let 𝑋 be the number of heads in three tosses of a fair coin. Find 𝐸 [𝑋 ]. Solution. From Example 3.2 and the pmf of 𝑋 : 𝐸 [𝑋 ] = 3 Õ 𝑘𝑃𝑋 (𝑘) = 0(1/8) + 1(3/8) + 2(3/8) + 3(1/8) = 1.5 𝑘=0 The use of the term “expected value” does not mean that we expect to observe 𝐸 [𝑋 ] when we perform the experiment that generates 𝑋 . For example, the expected value of the number of heads in Example 3.4 is 1.5, but its outcomes can only be 0, 1, 2 or 3. 𝐸 [𝑋 ] can be explained as an average of 𝑋 in a large number of observations of 𝑋 . Suppose we perform 𝑛 independent repetitions of the experiment that generates 𝑋 , and we record the observed values as 𝑥 (1), 𝑥 (2), ..., 𝑥 (𝑛), where 𝑥 ( 𝑗) is the observation in the 𝑗 𝑡 ℎ experiment. Let 𝑁𝑘 (𝑛) be the number of times 𝑥𝑘 is observed (𝑘 = 1, 2, ..., 𝐾), and let 𝑓𝑘 (𝑛) = 𝑁𝑘 (𝑛)/𝑛 be the corresponding relative frequency. The arithmetic average, or sample mean of the observations, is: 𝑥 (1) + 𝑥 (2) + ... + 𝑥 (𝑛) 𝑥 1 𝑁 1 (𝑛) + 𝑥 2 𝑁 2 (𝑛) + ... + 𝑥 𝐾 𝑁𝐾 (𝑛) = 𝑛 𝑛 = 𝑥 1 𝑓1 (𝑛) + 𝑥 2 𝑓2 (𝑛) + ... + 𝑥 𝐾 𝑓𝐾 (𝑛) Õ = 𝑥𝑘 𝑓𝑘 (𝑛) h𝑋 i𝑛 = (3.11) (3.12) (3.13) 𝑘 The first numerator adds the observations in the order in which they occur, and the second numerator counts how many times each 𝑥𝑘 occurs and then computes the total. As 𝑛 becomes large, we expect relative frequencies to approach the probabilities 𝑃𝑋 (𝑥𝑘 ): lim 𝑓𝑘 (𝑛) = 𝑃𝑋 (𝑥𝑘 ) 𝑛→∞ for all 𝑘 (3.14) Equation 3.13 then implies that: h𝑋 i𝑛 = Õ 𝑥𝑘 𝑓𝑘 (𝑛) → Õ 𝑘 𝑘 35 𝑥𝑘 𝑃𝑋 (𝑥𝑘 ) = 𝐸 [𝑋 ] (3.15) 3 Random Variables Thus we expect the sample mean to converge to 𝐸 [𝑋 ] as 𝑛 becomes large. We can also easily find the expected value of functions of a random variable. Let 𝑋 be a discrete random variable, and let 𝑍 = 𝑔(𝑋 ) Since 𝑋 is discrete, 𝑍 = 𝑔(𝑋 ) will assume a countable set of values of the form 𝑔(𝑥𝑘 ) where 𝑥𝑘 ∈ 𝑆𝑋 . One way to find the expectation of 𝑍 is to use Equation 3.9, which requires that we first find the PMF of 𝑍 . Another way is to use the following: Õ 𝐸 [𝑍 ] = 𝐸 [𝑔(𝑋 )] = 𝑔(𝑥𝑘 )𝑃𝑋 (𝑥𝑘 ) (3.16) 𝑘 Let 𝑍 be the function: 𝑍 = 𝑎𝑔(𝑋 ) + 𝑏ℎ(𝑋 ) + 𝑐 where 𝑎, 𝑏, and 𝑐 are real numbers, then: 𝐸 [𝑍 ] = 𝑎𝐸 [𝑔(𝑋 )] + 𝑏𝐸 [ℎ(𝑋 )] + 𝑐 (3.17) 𝐸 [𝑔(𝑋 ) + ℎ(𝑋 )] = 𝐸 [𝑔(𝑋 )] + 𝐸 [ℎ(𝑋 )] (3.18) 𝐸 [𝑎𝑋 ] = 𝑎𝐸 [𝑋 ] (3.19) 𝐸 [𝑋 + 𝑐] = 𝐸 [𝑋 ] + 𝑐 (3.20) 𝐸 [𝑐] = 𝑐 (3.21) It further implies that: Variance of a Random Variable We usually need more information about 𝑋 , than what expected value 𝐸 [𝑋 ] provides. For example, if we know that 𝐸 [𝑋 ] = 0 then it could be that 𝑋 is zero all the time, or it takes on extremely large positive and negative values. We are therefore interested not only in the mean of a random variable, but also in the extent of the random variable’s variation about its mean. Let the deviation of the random variable 𝑋 about its mean be 𝑋 − 𝐸 [𝑋 ] which can take on positive and negative values. Since we are interested in the magnitude of the variations only, it is convenient to work with the square of the deviation, which is always positive, (𝑋 − 𝐸 [𝑋 ]) 2 . Definition 3.6. Variance: The variance of the random variable 𝑋 is defined as: Õ 𝜎𝑋2 = 𝑉 𝐴𝑅 [𝑋 ] = 𝐸 [(𝑋 − 𝑚𝑋 ) 2 ] = (𝑥 − 𝑚𝑋 ) 2 𝑃𝑋 (𝑥) (3.22) 𝑋 ∈𝑆𝑋 The variance is a positive quantity that measures the spread of the distribution of the random variable about its mean value. Larger values of the variance indicate that the distribution is more spread out. For example in Figure 3.4, 𝑋 has a larger variance than 𝑌 . Definition 3.7. Standard deviation: The standard deviation of the random variable 𝑋 is defined by: 𝜎𝑋 = 𝑆𝑇 𝐷 (𝑋 ) = 𝑉 𝐴𝑅 [𝑋 ] 1/2 (3.23) By taking the square root of the variance, we obtain a quantity with the same units as 𝑋 . An alternative expression for the variance can be obtained as follows: 𝑉 𝐴𝑅 [𝑋 ] = 𝐸 [(𝑋 − 𝑚𝑋 ) 2 ] = 𝐸 [𝑋 2 − 2𝑚𝑋 𝑋 + 𝑚𝑋2 ] 2 = 𝐸 [𝑋 ] − 2𝑚𝑋 𝐸 [𝑋 ] 2 = 𝐸 [𝑋 ] − 𝑚𝑋2 + 𝑚𝑋2 (3.24) (3.25) (3.26) 𝐸 [𝑋 2 ] is called the second moment of 𝑋 . 36 3 Random Variables Definition 3.8. Moment: The 𝑛𝑡ℎ moment of 𝑋 is defined as: 𝐸 [𝑋 𝑛 ]. Example 3.5 Revisiting Example 3.1, let 𝑋 be the number of heads in three tosses of a fair coin. Find 𝑉 𝐴𝑅 [𝑋 ]. Solution. 2 𝐸 [𝑋 ] = 3 Õ 𝑘 2 𝑃𝑋 (𝑘) = 0(1/8) + 12 (3/8) + 22 (3/8) + 32 (1/8) = 3 𝑘=0 𝑉 𝐴𝑅 [𝑋 ] = 𝐸 [𝑋 2 ] − (𝐸 [𝑋 ]) 2 = 3 − (1.5) 2 = 0.75 Let 𝑌 = 𝑋 + 𝑐, then: 𝑉 𝐴𝑅 [𝑋 + 𝑐] = 𝐸 [(𝑋 + 𝑐 − (𝐸 [𝑋 ] + 𝑐)) 2 ] 2 = 𝐸 [(𝑋 − 𝐸 [𝑋 ]) ] = 𝑉 𝐴𝑅 [𝑋 ] (3.27) (3.28) Adding a constant to a random variable does not affect the variance. Let 𝑍 = 𝑐𝑋 then: 𝑉 𝐴𝑅 [𝑐𝑋 ] = 𝐸 [(𝑐𝑋 − 𝑐 (𝐸 [𝑋 ])) 2 ] (3.29) 2 (3.30) 2 = 𝐸 [𝑐 (𝑋 − 𝐸 [𝑋 ]) ] 2 = 𝑐 (𝑉 𝐴𝑅 [𝑋 ]) (3.31) Scaling a random variable by 𝑐 scales the variance by 𝑐 2 and the standard deviation by |𝑐 |. Note that a random variable that is equal to a constant 𝑋 = 𝑐 with probability 1, has zero variance: 𝑉 𝐴𝑅 [𝑋 ] = 𝐸 [(𝑋 − 𝑐) 2 ] = 𝐸 [0] = 0 Finally, Variance is a special case of central moments, for 𝑛 = 2, where we define 𝑛𝑡ℎ central moment as follows. Definition 3.9. Central Moments: The 𝑛𝑡ℎ central moment of a random variable is defined as: 𝐸 [(𝑋 − 𝑚𝑋 )𝑛 ]. 3.2.4 Conditional Probability Mass Function and Expectation In many situations we have partial information about a random variable 𝑋 or about the outcome of its underlying random experiment. We are interested in how this information changes the probability of events involving the random variable. Definition 3.10. Conditional Probability Mass Function: Let 𝑋 be a discrete random variable with PMF 𝑃𝑋 (𝑥) and let 𝐶 be an event that has nonzero probability, 𝑃 (𝐶) > 0. The conditional probability mass function of 𝑋 is defined by the conditional probability: 𝑃𝑋 𝐼𝐶 (𝑥) = 𝑃 (𝑋 = 𝑥 |𝐶) for 𝑥 a real number (3.32) Applying the definition of conditional probability we have: 𝑃𝑋 𝐼𝐶 (𝑥) = 𝑃 ({𝑋 = 𝑥 } ∩ 𝐶) 𝑃 (𝐶) 37 (3.33) 3 Random Variables As illustrated in Figure 3.5, the above expression has a nice intuitive interpretation: The conditional probability of the event {𝑋 = 𝑥𝑘 } is given by the probabilities of outcomes 𝜁 for which both 𝑋 (𝜁 ) = 𝑥𝑘 and 𝜁 are in 𝐶, normalized by 𝑃 (𝐶). Figure 3.5: Conditional PMF of 𝑋 given event 𝐶. The conditional PMF has the same properties as PMF. If 𝑆 is partitioned by 𝐴𝑘 = {𝑋 = 𝑥𝑘 }, then: Ø 𝐶= (𝐴𝑘 ∩ 𝐶) and 𝑘 Õ 𝑥𝑘 ∈𝑆𝑋 𝑃𝑋 |𝐶 (𝑥𝑘 ) = Õ 𝑃𝑋 |𝐶 (𝑥𝑘 ) = 𝑘 Õ 𝑃 ({𝑋 = 𝑥 } ∩ 𝐶) 𝑘 𝑃 (𝐶) 𝑘 1 Õ 𝑃 (𝐶) = 𝑃 (𝐴𝑘 ∩ 𝐶) = =1 𝑃 (𝐶) 𝑃 (𝐶) 𝑘 Most of the time the event 𝐶 is defined in terms of 𝑋 , for example 𝐶 = {𝑎 ≤ 𝑋 ≤ 𝑏}. For 𝑥𝑘 ∈ 𝑆𝑋 , we have the following result: ( 𝑃 (𝑥 ) 𝑋 𝑘 if 𝑥𝑘 ∈ 𝐶 𝑃𝑋 |𝐶 (𝑥𝑘 ) = 𝑃 (𝐶) (3.34) 0 if 𝑥𝑘 ∉ 𝐶 Example 3.6 Let 𝑋 be the number of heads in three tosses of a fair coin. Find the conditional PMF of 𝑋 given that we know the observed number was less than 2. Solution. We condition on the event 𝐶 = {𝑋 < 2}. From Example 3.2: 𝑃 (𝐶) = 𝑃𝑋 (0) + 𝑃𝑋 (1) = 1/8 + 3/8 = 1/2. Therefore: 𝑃𝑋 |𝐶 (0) = 𝑃𝑋 (0) 1/8 = = 1/4. 𝑃 (𝐶) 1/2 𝑃𝑋 |𝐶 (1) = 𝑃𝑋 (1) 3/8 = = 3/4. 𝑃 (𝐶) 1/2 and 𝑃𝑋 |𝐶 (𝑥𝑘 ) is zero otherwise. Note that 𝑃𝑋 |𝐶 (0) + 𝑃𝑋 |𝐶 (1) = 1. Many random experiments have natural ways of partitioning the sample space 𝑆 into the union of disjoint events 𝐵 1, 𝐵 2, ..., 𝐵𝑛 . Let 𝑃𝑋 |𝐵𝑖 (𝑥) be the conditional PMF of 𝑋 given event 𝐵𝑖 . The 38 3 Random Variables theorem on total probability allows us to find the PMF of 𝑋 in terms of the conditional PMFs: 𝑛 Õ 𝑃𝑋 (𝑥) = 𝑃𝑋 |𝐵𝑖 (𝑥)𝑃 (𝐵𝑖 ) (3.35) 𝑖=1 Definition 3.11. Conditional Expected Value: Let 𝑋 be a discrete random variable, and suppose that we know that event 𝐵 has occurred. The conditional expected value of 𝑋 given 𝐵 is defined as: Õ Õ 𝑚𝑋 |𝐵 = 𝐸 [𝑋 |𝐵] = 𝑥𝑃𝑋 |𝐵 (𝑥) = 𝑥𝑘 𝑃𝑋 |𝐵 (𝑥𝑘 ) (3.36) 𝑥 ∈𝑆𝑥 𝑘 where we apply the absolute convergence requirement on the summation. Definition 3.12. Conditional Variance: Let 𝑋 be a discrete random variable, and suppose that we know that event 𝐵 has occurred. The conditional variance of 𝑋 given 𝐵 is defined as: Õ 𝜎𝑋2 |𝐵 = 𝑉 𝐴𝑅 [𝑋 |𝐵] = 𝐸 [(𝑋 − 𝑚𝑋 |𝐵 ) 2 |𝐵] = (𝑥𝑘 − 𝑚𝑋 |𝐵 ) 2 𝑃𝑋 |𝐵 (𝑥𝑘 ) (3.37) 𝑘 2 − 𝑚𝑋2 |𝐵 = 𝐸 [𝑋 |𝐵] (3.38) Note that the variation is measured with respect to 𝑚𝑋 |𝐵 not 𝑚𝑋 . Let 𝐵 1, 𝐵 2, ..., 𝐵𝑛 be the partition of 𝑆, and let 𝑃𝑋 |𝐵𝑖 (𝑥) be the conditional PMF of 𝑋 given event 𝐵𝑖 . 𝐸 [𝑋 ] can be calculated from the conditional expectation 𝐸 [𝑋 |𝐵𝑖 ]: 𝐸 [𝑋 ] = 𝑛 Õ 𝐸 [𝑋 |𝐵𝑖 ]𝑃 (𝐵𝑖 ) (3.39) 𝑖=1 By the theorem on total probability we have: 𝐸 [𝑋 ] = Õ 𝑥𝑘 𝑃𝑋 (𝑥𝑘 ) = 𝑘 Õ 𝑥𝑘 { 𝑛 Õ 𝑃𝑋 |𝐵𝑖 (𝑥𝑘 )𝑃 (𝐵𝑖 )} 𝑘 𝑛 Õ 𝑛 Õ Õ = { 𝑥𝑘 𝑃𝑋 |𝐵𝑖 (𝑥𝑘 )}𝑃 (𝐵𝑖 ) = 𝐸 [𝑋 |𝐵𝑖 ]𝑃 (𝐵𝑖 ) 𝑖=1 (3.40) 𝑖=1 (3.41) 𝑖=1 𝑘 where we first express 𝑃𝑋 (𝑥𝑘 ) in terms of the conditional PMFs, and we then change the order of summation. Using the same approach we can also show: 𝐸 [𝑔(𝑋 )] = 𝑛 Õ 𝐸 [𝑔(𝑋 )|𝐵𝑖 ]𝑃 (𝐵𝑖 ) (3.42) 𝑖=1 Example 3.7 Let 𝑋 be the number of heads in three tosses of a fair coin. Find the expected value and variance of 𝑋 ,if we know that at least one head was observed. Solution. We are given 𝐶 = {𝑋 > 0}, so for 𝑥𝑘 = 1, 2, 3: 𝑃 (𝐶) = 1 − 𝑃𝑋 (0) = 7/8 39 3 Random Variables 𝐸 [𝑋 |𝐶] = Õ 𝑥𝑘 𝑃𝑋 |𝐶 (𝑥𝑘 ) = 1( 𝑃𝑋 (2) 𝑃𝑋 (3) 𝑃𝑋 (1) ) + 2( ) + 3( ) 𝑃 (𝐶) 𝑃 (𝐶) 𝑃 (𝐶) 𝑘 3/8 3/8 1/8 ) + 2( ) + 3( ) 7/8 7/8 7/8 = 12/7 ≈ 1.7 = 1( which is larger than 𝐸 [𝑋 ] = 1.5 found in Example 3.4 𝐸 [𝑋 2 |𝐶] = Õ 𝑥𝑘2 𝑃𝑋 |𝐶 (𝑥𝑘 ) = 1( 𝑘 3/8 1/8 3/8 ) + 4( ) + 9( ) = 24/7 7/8 7/8 7/8 𝑉 𝐴𝑅 [𝑋 |𝐶] = 𝐸 [𝑋 2 |𝐶] − (𝐸 [𝑋 |𝐶]) 2 ≈ 0.49 3.2.5 Common Discrete Random Variables In this section we present the most important of the discrete random variables and their basic properties and applications. Bernoulli Random Variable Definition 3.13. Bernoulli trial: A Bernoulli trial involves performing an experiment once and noting whether a particular event 𝐴 occurs. The outcome of the Bernoulli trial is said to be a “success” if 𝐴 occurs and a “failure” otherwise. We can view the outcome of a single Bernoulli trial as the outcome of a toss of a coin for which the probability of heads (success) is 𝑝 = 𝑃 (𝐴). The probability of 𝑘 successes in 𝑛 Bernoulli trials is then equal to the probability of 𝑘 heads in 𝑛 tosses of the coin. Definition 3.14. Bernoulli random variable: Let 𝐴 be an event related to the outcomes of some random experiment. The Bernoulli random variable 𝐼𝐴 equals one if the event 𝐴 occurs, and zero otherwise, and is given by the indicator function for 𝐴: ( 1 if 𝜁 ∈ 𝐴 𝐼𝐴 (𝜁 ) = (3.43) 0 if 𝜁 ∉ 𝐴 𝐼𝐴 is a discrete random variable with range = {0, 1}. • The PMF of 𝐼𝐴 is: 𝑃𝐼 (1) = 𝑝 and 𝑃𝐼 (0) = 1 − 𝑝 = 𝑞 (3.44) where 𝑃 (𝐴) = 𝑝. • The mean of 𝐼𝐴 is 𝐸 [𝐼𝐴 ] = 1 × 𝑃𝐼 (1) + 0 × 𝑃𝐼 (0) = 𝑝 The sample mean in 𝑛 independent Bernoulli trials is simply the relative frequency of successes and converges to 𝑝 as 𝑛 increases. 0𝑁 0 (𝑛) + 1𝑁 1 (𝑛) →𝑝 (3.45) h𝐼𝐴 i𝑛 = 𝑛 • The variance of 𝐼𝐴 can be found as follows: 𝐸 [𝐼𝐴2 ] = 1 × 𝑃𝐼 (1) + 0 × 𝑃𝐼 (0) = 𝑝 𝜎𝐼2 = 𝑉 𝐴𝑅 [𝐼𝐴 ] = 𝑝 − 𝑝 2 = 𝑝 (1 − 𝑝) = 𝑝𝑞 40 (3.46) 3 Random Variables The variance is quadratic in 𝑝, with value zero at 𝑝 = 0 and 𝑝 = 1 and maximum at 𝑝 = 1/2. This agrees with intuition since values of 𝑝 close to 0 or to 1 imply a preponderance of successes or failures and hence less variability in the observed values. The maximum variability occurs when which corresponds to the case that is most difficult to predict. Every Bernoulli trial, regardless of the event 𝐴, is equivalent to the tossing of a biased coin with probability of heads 𝑝. Binomial Random Variable Consider 𝑛 independent Bernoulli trials, with outcomes of 𝑘 successes (e.g. Example3.1). Outcomes of the repeated trials are represented as 𝑛 element vectors whose elements are taken from 𝑆 = {0, 1}, therefore the repeated experiment has a sample space of 𝑆𝑛 = {0, 1}𝑛 , which is referred to as a Cartesian space. For example consider the following outcome: 𝜁𝑘 = (1, 1, ..., 1, 0, 0, ..., 0 ) | {z } | {z } 𝑘 times 𝑛 − 𝑘 times The probability of this outcome occurring is: (3.47) 𝑃 (𝜁𝑘 ) = 𝑝 𝑘 (1 − 𝑝)𝑛−𝑘 In fact, the order of the 1s and 0s in the sequence is irrelevant. Any outcome with exactly 𝑘 1s and 𝑛 − 𝑘 0s would have the same probability. The number of outcomes in the event of exactly 𝑘 successes, is just the number of combinations of 𝑛 trials taken 𝑘 successes at a time. Theorem 3.1: Binomial probability law Let 𝑘 be the number of successes in 𝑛 independent Bernoulli trials, then the probabilities of 𝑘 are given by the binomial probability law: 𝑛 𝑘 𝑃𝑛 (𝑘) = 𝑝 (1 − 𝑝)𝑛−𝑘 for 𝑘 = 0, ..., 𝑛 (3.48) 𝑘 𝑛 is the binomial coefficient (see equation 2.26). 𝑘 Now let the random variable 𝑋 represent the number of successes occurred in the sequence of 𝑛 trials. Definition 3.15. Binomial random variable: let 𝑋 be the number of times a certain event 𝐴 occurs in 𝑛 independent Bernoulli trials. 𝑋 is called the Binomial random variable. For example, 𝑋 could be the number of heads in 𝑛 tosses of a coin (as seen in Examples 3.2 to 3.5, where 𝑛 = 3 and 𝑝 = 1/2). • The PMF of the binomial random variable 𝑋 is: 𝑛 𝑘 𝑃 (𝑋 = 𝑘) = 𝑃𝑋 (𝑘) = 𝑝 (1 − 𝑝)𝑛−𝑘 𝑘 41 for 𝑘 = 0, ..., 𝑛 (3.49) 3 Random Variables • The expected value of 𝑋 is: 𝐸 [𝑋 ] = 𝑛 Õ 𝑛 Õ 𝑛 𝑘 𝑘𝑃𝑋 (𝑘) = 𝑘 𝑝 (1 − 𝑝)𝑛−𝑘 𝑘 𝑘=0 Since the summation is zero for 𝑘 = 0, 𝑛 𝑛 Õ Õ 𝑛! (𝑛 − 1)! 𝑘 𝑛−𝑘 𝐸 [𝑋 ] = 𝑘 𝑝 (1 − 𝑝) = 𝑛𝑝 𝑝 𝑘−1 (1 − 𝑝)𝑛−𝑘 𝑘!(𝑛 − 𝑘)! (𝑘 − 1)!(𝑛 − 𝑘)! 𝑘=1 𝑛−1 Õ = 𝑛𝑝 𝑗=0 (3.50) 𝑘=0 (3.51) 𝑘=1 (𝑛 − 1)! 𝑝 𝑗 (1 − 𝑝)𝑛−1−𝑗 ( 𝑗)!(𝑛 − 1 − 𝑗)! (3.52) (3.53) Í (𝑛−1)! 𝑗 𝑛−1−𝑗 equal to one, since it adds all the terms Note that the summation 𝑛−1 𝑗=0 ( 𝑗)!(𝑛−1−𝑗)! 𝑝 (1 − 𝑝) in a binomial PMF with parameters 𝑛 − 1 and 𝑝, so: 𝐸 [𝑋 ] = 𝑛𝑝 × 1 = 𝑛𝑝 (3.54) It agrees with our intuition since we expect a fraction 𝑝 of the outcomes to result in success. to find the variance of 𝑋 , 𝑛 𝑛 Õ Õ 𝑛! 𝑛! 𝐸 [𝑋 2 ] = 𝑘2 𝑝 𝑘 (1 − 𝑝)𝑛−𝑘 = 𝑘 𝑝 𝑘 (1 − 𝑝)𝑛−𝑘 𝑘!(𝑛 − 𝑘)! (𝑘 − 1)!(𝑛 − 𝑘)! 𝑘=0 𝑘=1 𝑛−1 Õ 𝑛−1 𝑗 𝑝 (1 − 𝑝)𝑛−1−𝑗 = 𝑛𝑝 ( 𝑗 + 1) 𝑗 𝑗=0 𝑛−1 𝑛−1 Õ Õ 𝑛−1 𝑗 𝑛−1 𝑗 𝑛−1−𝑗 𝑝 (1 − 𝑝)𝑛−1−𝑗 ) 𝑝 (1 − 𝑝) + = 𝑛𝑝 ( 𝑗 𝑗 𝑗 (3.55) (3.56) (3.57) 𝑗=0 𝑗=0 In the third line, the first sum is the mean of a binomial random variable with parameters 𝑛 − 1 and 𝑝, and hence equal to (𝑛 − 1)𝑝. The second sum is the sum of the binomial probabilities and hence equal to 1. Therefore, 𝐸 [𝑋 2 ] = 𝑛𝑝 (𝑛𝑝 + 1 − 𝑝) (3.58) 𝑉 𝐴𝑅 [𝑋 ] = 𝐸 [𝑋 2 ] − 𝐸 [𝑋 ] 2 = 𝑛𝑝 (𝑛𝑝 + 1 − 𝑝) − (𝑛𝑝) 2 = 𝑛𝑝 (1 − 𝑝) = 𝑛𝑝𝑞 (3.59) We see that the variance of the binomial is 𝑛 times the variance of a Bernoulli random variable. We observe that values of p close to 0 or to 1 imply smaller variance, and that the maximum variability is when 𝑝 = 1/2. The binomial random variable arises in applications where there are two types of objects (i.e., heads/tails, correct/erroneous bits, good/defective items, active/silent speakers), and we are interested in the number of type 1 objects in a randomly selected batch of size 𝑛, where the type of each object is independent of the types of the other objects in the batch. Example 3.8 A binary communications channel introduces a bit error in a transmission with probability 𝑝. Let 𝑋 be the number of errors in 𝑛 independent transmissions. Find the probability of one or fewer errors. 42 3 Random Variables Solution. 𝑋 is a binomial random variable, and the probability of 𝑘 errors in 𝑛 bit transmissions is given by the PMF in Equation 3.60: 𝑛 0 𝑛 1 𝑃 (𝑋 ≤ 1) = 𝑝 (1 − 𝑝)𝑛 + 𝑝 (1 − 𝑝)𝑛−1 = (1 − 𝑝)𝑛 + 𝑛𝑝 (1 − 𝑝)𝑛−1 0 1 Geometric Random Variable Definition 3.16. Geometric random variable: The geometric random variable is defined as the number 𝑋 of independent Bernoulli trials until the first occurrence of a success. Note that the event 𝑋 = 𝑘 occurs if the underlying experiment finds 𝑘 − 1 consecutive failures, followed by one success. If the probability of “success” in each Bernoulli trial is 𝑝, then: • Therefore the PMF is: 𝑃𝑋 (𝑘) = 𝑃 (00...01) = (1 − 𝑝)𝑘−1𝑝 = 𝑞𝑘−1𝑝 for 𝑘 = 1, 2, ... (3.60) Note that the PMF decays geometrically with 𝑘, and the ratio 1 − 𝑝 = 𝑞. As 𝑝 increases, the PMF decays more rapidly. • The probability that 𝑋 ≤ 𝑘 can be written in closed form: 𝑘 Õ 𝑃 (𝑋 ≤ 𝑘) = 𝑞 𝑗−1𝑝 = 𝑝 𝑘−1 Õ 𝑗=1 𝑞𝑗 = 𝑝 𝑗=0 1 − 𝑞𝑘 = 1 − 𝑞𝑘 1−𝑞 (3.61) 𝑘𝑞𝑘−1 (3.62) • The expectation of 𝑋 is: 𝐸 [𝑋 ] = ∞ Õ 𝑘𝑝𝑞𝑘−1 = 𝑝 ∞ Õ 𝑘=1 𝑘=1 This expression can be evaluated by differentiating the series: ∞ Õ 1 = 𝑥𝑘 1−𝑥 (3.63) 𝑘=0 to obtain: ∞ Õ 1 = 𝑘𝑥 𝑘−1 (1 − 𝑥) 2 (3.64) 𝑘=0 Letting 𝑥 = 𝑞: 𝐸 [𝑋 ] = 𝑝 1 = 1/𝑝 (1 − 𝑞) 2 (3.65) which is finite as long as 𝑝 > 0. • If the Equation 3.64 is further differentiated, we obtain: ∞ Õ 2 = 𝑘 (𝑘 − 1)𝑥 𝑘−2 (1 − 𝑥) 3 𝑘=0 43 (3.66) 3 Random Variables Let 𝑥 = 𝑞 and multiply both sides by 𝑝𝑞 to obtain: ∞ Õ 2𝑝𝑞 = 𝑝𝑞 𝑘 (𝑘 − 1)𝑞𝑘−2 (1 − 𝑞) 3 𝑘=0 = ∞ Õ (𝑘 2 − 𝑘)𝑝𝑞𝑘−1 = 𝐸 [𝑋 2 ] − 𝐸 [𝑋 ] 𝑘=0 So the second moment is: 𝐸 [𝑋 2 ] = 1+𝑞 2𝑝𝑞 + 𝐸 [𝑋 ] = 2𝑞/𝑝 2 + 1/𝑝 = 2 3 (1 − 𝑞) 𝑝 𝑉 𝐴𝑅 [𝑋 ] = 𝐸 [𝑋 2 ] − 𝐸 [𝑋 ] 2 = 1+𝑞 − 1/𝑝 2 = 𝑞/𝑝 2 𝑝2 (3.67) (3.68) We see that the mean and variance increase as 𝑝, the success probability, decreases. Sometimes we are interested in 𝑀 the number of failures before a success occurs, also referred to as a geometric random variable. Its PMF is: 𝑃 (𝑀 = 𝑘) = (1 − 𝑝)𝑘 𝑝 (3.69) The geometric random variable is the only discrete random variable that satisfies the memoryless property: 𝑃 (𝑋 ≥ 𝑘 + 𝑗 |𝑋 > 𝑗) = 𝑃 (𝑋 ≥ 𝑘) (3.70) The above expression states that if a success has not occurred in the first 𝑗 trials, then the probability of having to perform at least 𝑘 more trials is the same as the probability of initially having to perform at least 𝑘 trials. Thus, each time a failure occurs, the system “forgets” and begins anew as if it were performing the first trial. The geometric random variable arises in applications where one is interested in the time (i.e., number of trials) that elapses between the occurrence of events in a sequence of independent experiments. Examples where the modified geometric random variable arises are: number of customers awaiting service in a queuing system; number of white dots between successive black dots in a scan of a black-and-white document. Example 3.9 A production line yields two types of devices. Type 1 devices occur with probability 𝛼 and work for a relatively short time that is geometrically distributed with parameter 𝑟 . Type 2 devices work much longer, occur with probability 1 − 𝛼 and have a lifetime that is geometrically distributed with parameter 𝑠. Let 𝑋 be the lifetime of an arbitrary device. Find the PMF, mean and variance of 𝑋 . Solution. The random experiment that generates 𝑋 involves selecting a device type and then observing its lifetime. We can partition the sets of outcomes in this experiment into event 𝐵 1 consisting of those outcomes in which the device is type 1, and 𝐵 2 consisting of those outcomes in which the device is type 2. From the theorem of total probability: 𝑃𝑋 (𝑘) = 𝑃𝑋 |𝐵1 (𝑘)𝑃 (𝐵 1 ) + 𝑃𝑋 |𝐵2 (𝑘)𝑃 (𝐵 2 ) = (1 − 𝑟 )𝑘−1𝑟 (𝛼) + (1 − 𝑠)𝑘−1𝑠 (1 − 𝛼) for 𝑘 = 1, 2, ... The conditional mean and second moment of each device type is that of a geometric random 44 3 Random Variables variable with the corresponding parameter: 𝐸 [𝑋 |𝐵 1 ] = 1/𝑟 𝐸 [𝑋 |𝐵 2 ] = 1/𝑠 𝐸 [𝑋 2 |𝐵 1 ] = (1 + 1 − 𝑟 )/𝑟 2 𝐸 [𝑋 2 |𝐵 2 ] = (1 + 1 − 𝑠)/𝑠 2 The mean and the second moment of 𝑋 are then: 𝐸 [𝑋 ] = (𝐸 [𝑋 |𝐵 1 ]) (𝛼) + (𝐸 [𝑋 |𝐵 2 ]) (1 − 𝛼) = 𝛼/𝑟 + (1 − 𝛼)/𝑠 𝐸 [𝑋 2 ] = 𝐸 [𝑋 2 |𝐵 1 ] (𝛼) + 𝐸 [𝑋 2 |𝐵 2 ] (1 − 𝛼) = 𝛼 (2 − 𝑟 )/𝑟 2 + (1 − 𝛼) (2 − 𝑠)/𝑠 2 𝑉 𝐴𝑅 [𝑋 ] = 𝐸 [𝑋 2 ] − 𝐸 [𝑋 ] 2 = 𝛼 (2 − 𝑟 )/𝑟 2 + (1 − 𝛼) (2 − 𝑠)/𝑠 2 − (𝛼/𝑟 + (1 − 𝛼)/𝑠) 2 Note that we do not use the conditional variances to find 𝑉 𝐴𝑅 [𝑋 ], since the Equation 3.42 does not similarly apply to the conditional variances. Poisson Random Variable In many applications, we are interested in counting the number of occurrences of an event in a certain time period or in a certain region in space. The Poisson random variable arises in situations where the events occur “completely at random” in time or space. For example, the Poisson random variable arises in counts of emissions from radioactive substances, in counts of demands for telephone connections, and in counts of defects in a semiconductor chip, in queuing theory and in communication networks. The number of customers arriving at a cashier in a store during some time interval may be well modeled as a Poisson random variable as may the number of data packets arriving at a node in a computer network. • The PMF of the Poisson random variable is given by: 𝑃𝑋 (𝑘) = 𝛼 𝑘 −𝛼 𝑒 , 𝑘 = 0, 1, 2, ... 𝑘! (3.71) where 𝛼 is the average number of event occurrences in a specified time interval or region in space. The PMF sums to one, as required, since: ∞ Õ 𝛼𝑘 𝑒 −𝛼 = 𝑒 −𝛼 ∞ Õ 𝛼𝑘 𝑘! 𝑘! 𝑘=0 = 𝑒 −𝛼 𝑒 𝛼 = 1 𝑘=0 where we used the fact that the second summation is the infinite series expansion for 𝑒 𝛼 . • The mean can be found as follows: ∞ ∞ ∞ Õ Õ Õ 𝛼 𝑘 −𝛼 𝛼𝑘 𝛼 (𝑘−1) −𝛼 −𝛼 𝐸 [𝑋 ] = 𝑘 𝑒 =𝑒 =𝑒 𝛼 = 𝑒 −𝛼 𝛼𝑒 𝛼 = 𝛼 𝑘! (𝑘 − 1)! (𝑘 − 1)! 𝑘=0 𝑘=1 𝑘=1 45 (3.72) 3 Random Variables • [Exercise−] It can be shown that the variance is: 𝑉 𝐴𝑅 [𝑋 ] = 𝛼 (3.73) One of the applications of the Poisson probabilities is to approximate the binomial probabilities when the number of repeated trials, 𝑛 , is very large and the probability of success in each individual trial,𝑝 , is very small. Then the binomial random variable can be well approximated by a Poisson random variable. That is, the Poisson random variable is a limiting case of the binomial random variable. Let 𝑛 approach infinity and 𝑝 approach 0 in such a way that lim𝑛→∞ 𝑛𝑝 = 𝛼, then the binomial PMF converges to the PMF of Poisson random variable: 𝛼𝑘 𝑛 𝑘 (3.74) 𝑝 (1 − 𝑝)𝑛−𝑘 → 𝑒 −𝛼 , for 𝑘 = 0, 1, 2, ... 𝑘 𝑘! The Poisson random variable appears in numerous physical situations because many models are very large in scale and involve very rare events. For example, the Poisson PMF gives an accurate prediction for the relative frequencies of the number of particles emitted by a radioactive mass during a fixed time period. The Poisson random variable also comes up in situations where we can imagine a sequence of Bernoulli trials taking place in time or space. Suppose we count the number of event occurrences in a T-second interval. Divide the time interval into a very large number, 𝑛, of sub-intervals. A pulse in a sub-interval indicates the occurrence of an event. Each sub-interval can be viewed as one in a sequence of independent Bernoulli trials if the following conditions hold: (1) At most one event can occur in a sub-interval, that is, the probability of more than one event occurrence is negligible; (2) the outcomes in different sub-intervals are independent; and (3) the probability of an event occurrence in a sub-interval is 𝑝 = 𝛼/𝑛 where 𝛼 is the average number of events observed in a 1-second interval. The number 𝑁 of events in 1 second is a binomial random variable with parameters 𝑛 and 𝑝 = 𝛼/𝑛. Thus as 𝑛 → ∞ 𝑁 becomes a Poisson random variable with parameter 𝛼. Example 3.10 An optical communication system transmits information at a rate of 109 bits/second. The probability of a bit error in the optical communication system is 10−9 . Find the probability of five or more errors in 1 second. Solution. Each bit transmission corresponds to a Bernoulli trial with a “success” corresponding to a bit error in transmission. The probability of 𝑘 errors in 𝑛 = 109 transmissions (1 second) is then given by the binomial probability with 𝑛 = 109 and 𝑝 = 10−9 . The Poisson approximation uses 𝛼 = 𝑛𝑝 = 109 (10−9 ) = 1. Thus: 𝑃 (𝑋 ≥ 5) = 1 − 𝑃 (𝑋 < 5) = 1 − 4 Õ 𝛼𝑘 𝑒 −𝛼 𝑘! 𝑘=0 = 1 − 𝑒 (1 + 1/1! + 1/2! + 1/3! + 1/4!) = 0.00366 −1 Uniform Random Variable Definition 3.17. Uniform random variable: The discrete uniform random variable 𝑋 takes on values in a set of consecutive integers 𝑆𝑋 = { 𝑗 + 1, ..., 𝑗 + 𝐿} with equal probability. 46 3 Random Variables • The PMF of the uniform random variable is: 𝑃𝑋 (𝑘) = 1/𝐿 for 𝑘 ∈ { 𝑗 + 1, ..., 𝑗 + 𝐿} • [Exercise−] It can be shown that the mean is: 𝐸 [𝑋 ] = 𝑗 + 𝐿+1 2 (3.75) • [Exercise−] It is easy to show that the variance is: 𝑉 𝐴𝑅 [𝑋 ] = 𝐿2 − 1 12 (3.76) This random variable occurs whenever outcomes are equally likely, e.g., toss of a fair coin or a fair die, spinning of an arrow in a wheel divided into equal segments, selection of numbers from an urn. Example 3.11 Let 𝑋 be the time required to transmit a message, where 𝑋 is a uniform random variable with 𝑆𝑋 = {1, ..., 𝐿}. Suppose that a message has already been transmitting for 𝑚 time units, find the probability that the remaining transmission time is 𝑗 time units and the expected value of the remaining transmission time. Solution. We are given the condition 𝐶 = {𝑋 > 𝑚}, so for 𝑚 + 1 ≤ 𝑚 + 𝑗 ≤ 𝐿: 𝑃𝑋 |𝐶 (𝑚 + 𝑗) = 𝐸 [𝑋 |𝐶] = 𝑃 (𝑋 = 𝑚 + 𝑗) 1/𝐿 1 = = , for 𝑚 + 1 ≤ 𝑚 + 𝑗 ≤ 𝐿 𝑃 (𝑋 > 𝑚) (𝐿 − 𝑚)/𝐿 𝐿 − 𝑚 𝐿 Õ 𝑗=𝑚+1 𝑗 (1/𝐿 − 𝑚) = 𝐿 +𝑚 + 1 2 The expectation can also be directly calculated from Equation 3.75, replacing the parameters 𝐿 and 𝑗 by 𝐿 − 𝑚 and 𝑚, respectively. 3.3 Continuous Random Variables Consider a discrete uniform random variable, 𝑋 , that takes on values from the set {0, 1/𝑁 , 2/𝑁 , ..., (𝑁 − 1)/𝑁 }, with PMF of 1/𝑁 . If 𝑁 is a large number so that it appears that the random number can be anything in the continuous range [0, 1), i.e. 𝑁 → ∞, then the PMF approaches zero! That is, each point has zero probability of occurring, or in other words, every possible outcome has probability zero. Yet, something has to occur! Since a continuous random variable typically has a zero probability of taking on a specific value, the pmf cannot be used to characterize the probabilities of 𝑋 . Therefore we define it by its CDF property. 47 3 Random Variables Definition 3.18. Continuous random variable: A random variable whose CDF 𝐹𝑋 (𝑥) is continuous everywhere, and which, in addition, is sufficiently smooth that it can be written as an integral of some non-negative function 𝑓 (𝑥): ∫ 𝑥 𝐹𝑋 (𝑥) = 𝑓 (𝑡)𝑑𝑡 (3.77) −∞ For continuous random variables, we calculate probabilities as integrals of “probability densities” over intervals of the real line. A random variable can also be of mixed type, that is a random variable with a CDF that has jumps on a countable set of points 𝑥 0, 𝑥 1, 𝑥 2, ... but that also increases continuously over at least one interval of values of 𝑥. The CDF for these random variables has the form: 𝐹𝑋 (𝑥) = 𝑝𝐹 1 (𝑥) + (1 − 𝑝)𝐹 2 (𝑥) where 0 < 𝑝 < 1 and 𝐹 1 (𝑥) is the CDF of a discrete random variable and 𝐹 2 (𝑥) is the CDF of a continuous random variable. Random variables of mixed type can be viewed as being produced by a two-step process: A coin is tossed; if the outcome of the toss is heads, a discrete random variable is generated according to 𝐹 1 (𝑥) otherwise, a continuous random variable is generated according to 𝐹 2 (𝑥). 3.3.1 The Probability Density Function While the CDF represents a mathematical tool to statistically describe a random variable, it is often quite cumbersome to work with CDFs or to infer various properties of a random variable from its CDF. To help circumvent these problems, an alternative and often more convenient description known as the probability density function is often used. Definition 3.19. Probability density function: The probability density function of 𝑋 (PDF), if it exists, is defined as the derivative of 𝐹𝑋 (𝑥): 𝑓𝑋 (𝑥) = 𝑑𝐹𝑋 (𝑥) 𝑑𝑥 (3.78) The PDF represents the “density” of probability at the point 𝑥 in the following sense: The probability that 𝑋 is in a small interval in the vicinity of 𝑥, i.e. 𝑥 < 𝑋 ≤ 𝑥 + ℎ, is: 𝑃 (𝑥 < 𝑋 ≤ 𝑥 + ℎ) = 𝐹𝑋 (𝑥 + ℎ) − 𝐹𝑋 (𝑥) = 𝐹𝑋 (𝑥 + ℎ) − 𝐹𝑋 (𝑥) ℎ ℎ (3.79) If the CDF has a derivative at 𝑥, then as ℎ becomes very small, 𝑃 (𝑥 < 𝑋 ≤ 𝑥 + ℎ) ≈ 𝑓𝑋 (𝑥)ℎ (3.80) Thus represents the “density” of probability at the point 𝑥 in the sense that the probability that 𝑋 is in a small interval in the vicinity of 𝑥 is approximately 𝑓𝑋 (𝑥)ℎ. The derivative of the CDF, when it exists, is positive since the CDF is a non-decreasing function of 𝑥, thus: 𝑓𝑋 (𝑥) ≥ 0 (3.81) Note that the PDF specifies the probabilities of events of the form “𝑋 falls in a small interval of width 𝑑𝑥 about the point 𝑥”. Therefore probabilities of events involving 𝑋 in a certain range can 48 3 Random Variables be expressed in terms of the PDF by adding the probabilities of intervals of width 𝑑𝑥. As the widths of the intervals approach zero, we obtain an integral in terms of the PDF: ∫ 𝑏 𝑃 (𝑎 ≤ 𝑋 ≤ 𝑏) = 𝑓𝑋 (𝑥)𝑑𝑥 (3.82) 𝑎 The probability of an interval is therefore the area under 𝑓𝑋 (𝑥) in that interval. Figure 3.6: (a) The probability density function specifies the probability of intervals of infinitesimal width. (b) The probability of an interval [𝑎, 𝑏] is the area under the PDF in that interval. (Taken from Alberto Leon-Garcia, Probability, statistics, and random processes for electrical engineering,3rd ed. Pearson, 2007) The probability of any event that consists of the union of disjoint intervals can thus be found by adding the integrals of the PDF over each of the intervals. The CDF of 𝑋 can be obtained by integrating the PDF: ∫ 𝑥 𝐹𝑋 (𝑥) = 𝑓𝑋 (𝑡)𝑑𝑡 (3.83) −∞ Since the probabilities of all events involving 𝑋 can be written in terms of the CDF, it then follows that these probabilities can be written in terms of the PDF. Thus the PDF completely specifies the behavior of continuous random variables. By letting 𝑥 tend to infinity in Equation 3.83, we obtain: ∫ ∞ 1= 𝑓𝑋 (𝑡)𝑑𝑡 (3.84) −∞ A valid PDF can be formed by normalising any non-negative, piecewise continuous function 𝑔(𝑥) that has a finite integral over all real values of 𝑥. Example 3.12 The PDF for the random variable 𝑋 is: ( 𝑓𝑋 (𝑥) = 𝛽𝑥 2 0 −1 < 𝑥 < 2 otherwise Find 𝛽 so that 𝑓𝑋 (𝑥) is a PDF, and find the CDF 𝐹𝑋 (𝑥). 49 3 Random Variables Solution. We require: 1= ∫ ∞ ∫ 2 𝑓𝑋 (𝑡)𝑑𝑡 = 𝛽 −∞ 𝑥 2𝑑𝑥 = (𝛽/3) (8 + 1) = 3𝛽 −1 So, 𝛽 = 1/3, which is positive, as required. To find the CDF: ∫ 𝑥 ∫ 𝑥 𝐹𝑋 (𝑥) = 𝑓𝑋 (𝑡)𝑑𝑡 = (1/3)𝑡 2𝑑𝑡 = (1/9) (𝑥 3 + 1) −∞ −1 Finally, since 𝑓𝑋 (𝑥) = 0 for 𝑥 > 2, 𝐹𝑋 (𝑥) = 1 for𝑥 ≥ 2. PDF of Discrete Random Variables The derivative of the CDF does not exist at points where the CDF is not continuous. As seen in section 3.2.2, CDF of discrete random variables has discontinuities, where the notion of PDF cannot be applied. We can generalize the definition of the PDF by noting the relation between the unit step function 𝑢 (𝑥) and Dirac delta function 𝛿 (𝑥): ( 1 𝑥≥0 𝑢 (𝑥) = (3.85) 0 𝑥<0 ∫ 𝑥 𝑢 (𝑥) = 𝛿 (𝑡)𝑑𝑡 (3.86) −∞ Recall that the delta function 𝛿 (𝑥) is zero everywhere except at 𝑥 = 0, where it is unbounded. To maintain the right continuity of the step function at 0, we use the convention: ∫ 0 𝑢 (0) = 1 = 𝛿 (𝑡)𝑑𝑡 (3.87) −∞ The PDF for a discrete random variable can be defined by: 𝑓𝑋 (𝑥) = Õ 𝑑 𝑃𝑋 (𝑥𝑘 )𝛿 (𝑥 − 𝑥𝑘 ) 𝐹𝑋 (𝑥) = 𝑑𝑥 (3.88) 𝑘 Thus the generalized definition of PDF places a delta function of weight 𝑃 (𝑋 = 𝑥𝑘 ) at the points 𝑥𝑘 where the CDF is discontinuous. 50 3 Random Variables Example 3.13 Find the PDF of 𝑋 in Example 3.3. Solution. We found that the CDF of 𝑋 is: 1 3 3 1 𝐹𝑋 (𝑥) = 𝑢 (𝑥) + 𝑢 (𝑥 − 1) + 𝑢 (𝑥 − 2) + 𝑢 (𝑥 − 3) 8 8 8 8 Therefore the PDF of 𝑋 is given by: 1 3 3 1 𝑓𝑋 (𝑥) = 𝛿 (𝑥) + 𝛿 (𝑥 − 1) + 𝛿 (𝑥 − 2) + 𝛿 (𝑥 − 3) 8 8 8 8 3.3.2 Conditional CDF and PDF Definition 3.20. Conditional cumulative distribution function: Suppose that event 𝐶 is given and that 𝑃 (𝐶) > 0. The conditional CDF of 𝑋 given 𝐶 is defined by: 𝐹𝑋 |𝐶 (𝑥) = 𝑃 ({𝑋 ≤ 𝑥 } ∩ 𝐶) 𝑃 (𝐶) (3.89) and satisfies all the properties of a CDF. The conditional PDF of 𝑋 given 𝐶 is then defined by: 𝑓𝑋 |𝐶 (𝑥) = 𝑑 𝐹𝑋 |𝐶 (𝑥) 𝑑𝑥 (3.90) Example 3.14 The lifetime 𝑋 of a machine has a continuous CDF 𝐹𝑋 (𝑥). Find the conditional CDF and PDF given the event 𝐶 = {𝑋 > 𝑡 } (i.e., “machine is still working at time 𝑡”). Solution. The conditional CDF is: 𝐹𝑋 |𝐶 (𝑥) = 𝑃 (𝑋 ≤ 𝑥 |𝑋 > 𝑡) = 𝑃 ({𝑋 ≤ 𝑥 } ∩ {𝑋 > 𝑡 }) 𝑃 (𝑋 > 𝑡) The intersection of the two events in the numerator is equal to the empty set when 𝑥 < 𝑡 and to {𝑡 < 𝑋 ≤ 𝑥 } when 𝑥 ≥ 𝑡. Then: ( 𝐹 (𝑥)−𝐹 (𝑡 ) 𝑋 𝑋 𝑥 >𝑡 1−𝐹𝑋 (𝑡 ) 𝐹𝑋 |𝐶 (𝑥) = 0 𝑥 ≤𝑡 The conditional pdf is found by differentiating with respect to 𝑥: 𝑓𝑋 |𝐶 (𝑥) = 𝑓𝑋 (𝑥) 1 − 𝐹𝑋 (𝑡) 51 3 Random Variables Now suppose that we have a partition of the sample space 𝑆 into the union of disjoint events 𝐵 1, 𝐵 2, ..., 𝐵𝑛 . Let 𝐹𝑋 |𝐵𝑖 (𝑥) be the conditional CDF of 𝑋 given event 𝐵𝑖 . The theorem on total probability allows us to find the CDF of 𝑋 in terms of the conditional CDFs: 𝐹𝑋 (𝑥) = 𝑃 (𝑋 ≤ 𝑥) = 𝑛 Õ 𝑃 (𝑋 ≤ 𝑥 |𝐵𝑖 )𝑃 (𝐵𝑖 ) = 𝑖=1 𝑛 Õ 𝐹𝑋 |𝐵𝑖 (𝑥)𝑃 (𝐵𝑖 ) (3.91) 𝑖=1 The PDF is obtained by differentiation: 𝑛 𝑓𝑋 (𝑥) = Õ 𝑑 𝐹𝑋 (𝑥) = 𝑓𝑋 |𝐵𝑖 (𝑥)𝑃 (𝐵𝑖 ) 𝑑𝑥 𝑖=1 (3.92) 3.3.3 The Expected Value and Moments Expected Value We discussed the expectation for discrete random variables in Section 3.2.3, and found that the sample mean of independent observations of a random variable approaches 𝐸 [𝑋 ]. Suppose we perform a series of such experiments for continuous random variables. Since for continuous random variables we have 𝑃 (𝑋 = 𝑥) = 0 for any specific value of 𝑥, we divide the real line into small intervals and count the number of times the observations fall in the interval 𝑥𝑘 < 𝑋 < 𝑥𝑘 + Δ. As 𝑛 becomes large, then the relative frequency 𝑓𝑘 (𝑛) = 𝑁𝑘 (𝑛)/𝑛 will approach 𝑓𝑋 (𝑥𝑘 )Δ, the probability of the interval. We calculate the sample mean in terms of the relative frequencies and let 𝑛 → ∞: Õ Õ h𝑋 i𝑛 = 𝑥𝑘 𝑓𝑘 (𝑛) → 𝑥𝑘 𝑓𝑋 (𝑥𝑘 )Δ 𝑘 𝑘 The expression on the right-hand side approaches an integral as we decrease Δ. Thus, the expected value or mean of a continuous random variable 𝑋 is defined by: ∫ +∞ 𝐸 [𝑋 ] = 𝑡 𝑓𝑋 (𝑡)𝑑𝑡 (3.93) −∞ The expected value 𝐸 [𝑋 ] is defined if the above integral converges absolutely, that is, ∫ +∞ 𝐸 [|𝑋 |] = |𝑡 |𝑓𝑋 (𝑡)𝑑𝑡 < ∞ −∞ We already discussed 𝐸 [𝑋 ] for discrete random variables in detail, but the definition in Equation 3.93 is applicable if we express the PDF of a discrete random variable using delta (𝛿) functions: ∫ +∞ Õ 𝐸 [𝑋 ] = 𝑡 𝑃𝑋 (𝑥𝑘 )𝛿 (𝑡 − 𝑥𝑘 )𝑑𝑡 −∞ = = Õ 𝑘 Õ 𝑘 ∫ +∞ 𝑃𝑋 (𝑥𝑘 ) 𝑡𝛿 (𝑡 − 𝑥𝑘 )𝑑𝑡 −∞ 𝑃𝑋 (𝑥𝑘 )𝑥𝑘 𝑘 Example 3.15 The PDF of the uniform random variable is a constant value over a certain range and zero 52 3 Random Variables outside that range: ( 𝑓𝑋 (𝑥) = 1 𝑏−𝑎 𝑎 ≤𝑥 ≤𝑏 𝑥 < 𝑎 𝑎𝑛𝑑 𝑥 > 𝑏 0 Find the expectation 𝐸 [𝑋 ]. Solution. ∫ 𝑏 𝐸 [𝑋 ] = 𝑡 𝑎 𝑎 +𝑏 1 𝑑𝑡 = 𝑏 −𝑎 2 which is the midpoint of the interval [𝑎, 𝑏]. The result in Example 3.15 could have been found immediately by noting that 𝐸 [𝑋 ] = 𝑚 when the PDF is symmetric about a point 𝑚, i.e. 𝑓𝑋 (𝑚 − 𝑥) = 𝑓𝑋 (𝑚 + 𝑥) for all 𝑥, then assuming that the mean exists, ∫ ∫ +∞ 0= +∞ (𝑚 − 𝑡) 𝑓𝑋 (𝑡)𝑑𝑡 = 𝑚 − 𝑡 𝑓𝑋 (𝑡)𝑑𝑡 −∞ −∞ The first equality above follows from the symmetry of 𝑓𝑋 (𝑡) about 𝑡 = 𝑚 and the odd symmetry of (𝑚 − 𝑡) about the same point. We then have that 𝐸 [𝑋 ] = 𝑚. The following expressions are useful when 𝑋 is a nonnegative random variable: ∫ ∞ 𝐸 [𝑋 ] = (1 − 𝐹𝑋 (𝑡))𝑑𝑡 if 𝑋 continuous and nonnegative (3.94) 0 𝐸 [𝑋 ] = ∞ Õ 𝑃 (𝑋 > 𝑘) if 𝑋 nonnegative, integer-valued (3.95) 𝑘=0 Functions of a Random Variable The concept of expectation can be applied to the functions of random variables as well. This will allow us to define many other parameters that describe various aspects of a continuous random variable. Definition 3.21. Given a continuous random variable 𝑋 with PDF 𝑓𝑋 (𝑥), the expected value of a function, 𝑔(𝑋 ), of that random variable is given by: ∫ +∞ 𝐸 [𝑔(𝑋 )] = 𝑔(𝑥) 𝑓𝑋 (𝑥)𝑑𝑥 (3.96) −∞ Example 3.16 If 𝑌 = 𝑎𝑋 + 𝑏 where 𝑋 is a continuous random variable with expected value of 𝐸 [𝑋 ] and 𝑎 and 𝑏 are constant values, find 𝐸 [𝑌 ]. Solution. ∫ +∞ 𝐸 [𝑌 ] = 𝐸 [𝑎𝑋 + 𝑏] = ∫ +∞ (𝑎𝑥 + 𝑏) 𝑓𝑋 (𝑥)𝑑𝑥 = 𝑎 −∞ 𝑥 𝑓𝑋 (𝑥)𝑑𝑥 + 𝑏 = 𝑎𝐸 [𝑋 ] + 𝑏 −∞ 53 3 Random Variables In general, expectation is a linear operation and expectation operator can be exchanged (in order) with any other linear operation. For any linear combination of functions: ∫ ∞ Õ Õ Õ ∫ ∞ Õ 𝐸 [ 𝑎𝑘 𝑔𝑘 (𝑋 )] = ( 𝑎𝑘 𝑔𝑘 (𝑥))𝑓𝑋 (𝑥)𝑑𝑥 = 𝑎𝑘 𝑔𝑘 (𝑥) 𝑓𝑋 (𝑥)𝑑𝑥 = 𝑎𝑘 𝐸 [𝑔𝑘 (𝑋 )] −∞ 𝑘 𝑘 𝑘 −∞ 𝑘 (3.97) Moments Definition 3.22. Moment: The 𝑛𝑡ℎ moment of a continuous random variable 𝑋 is defined as: ∫ +∞ 𝑥 𝑛 𝑓𝑋 (𝑥)𝑑𝑥 (3.98) 𝐸 [𝑋 𝑛 ] = −∞ The zeroth moment is simply the area under the PDF and must be one for any random variable. The most commonly used moments are the first and second moments. The first moment is the expected value. For some random variables, the second moment might be a more meaningful characterization than the first. For example, suppose 𝑋 is a sample of a noise waveform. We might expect that the distribution of the noise is symmetric about zero and hence the first moment will be zero. It only shows that the noise does not have a bias. However, the second moment of the random noise is in some sense a measure of the strength of the noise, which can give us some useful physical insight into the power of the noise. Under certain conditions, a PDF is completely specified if the expected values of all the moments of 𝑋 are known. Variance Similar to the definition of variance for discrete random variables, for continuous random variables 𝑋 , the variance is defined as: 𝑉 𝐴𝑅 [𝑋 ] = 𝐸 [(𝑋 − 𝐸 [𝑋 ]) 2 ] = 𝐸 [𝑋 2 ] − 𝐸 [𝑋 ] 2 (3.99) and the standard deviation is defined by: 𝑆𝑇 𝐷 [𝑋 ] = 𝜎𝑋 = 𝑉 𝐴𝑅 [𝑋 ] 1/2 . Example 3.17 Find the variance of the continuous uniform random variable in Example 3.15. Solution. ∫ 𝑏 (𝑥 − 𝑉 𝐴𝑅 [𝑋 ] = 𝑎 Let 𝑦 = 𝑥 − 𝑎+𝑏 2 , 1 𝑉 𝐴𝑅 [𝑋 ] = 𝑏 −𝑎 ∫ 𝑎 +𝑏 2 1 ) 𝑑𝑥 2 𝑏 −𝑎 (𝑏−𝑎)/2 −(𝑏−𝑎)/2 𝑦 2𝑑𝑦 = (𝑏 − 𝑎) 2 12 The properties derived in section 3.2.3 can be similarly derived for the variance of continuous random variables: 𝑉 𝐴𝑅 [𝑐] = 0 (3.100) 54 3 Random Variables 𝑉 𝐴𝑅 [𝑋 + 𝑐] = 𝑉 𝐴𝑅 [𝑋 ] (3.101) 𝑉 𝐴𝑅 [𝑐𝑋 ] = 𝑐 2𝑉 𝐴𝑅 [𝑋 ] (3.102) where 𝑐 is a constant. The mean and variance are the two most important parameters used in summarizing the PDF of a random variable. Other parameters and moments are occasionally used. For example, the skewness defined by 𝐸 [(𝑋 − 𝐸 [𝑋 ]) 3 ]/𝑆𝑇 𝐷 [𝑋 ] 3 measures the degree of asymmetry about the mean. It is easy to show that if a PDF is symmetric about its mean, then its skewness is zero. The point to note with these parameters of the PDF is that each involves the expected value of a higher power of 𝑋 . 3.3.4 Important Continuous Random Variables The Uniform Random Variable The uniform random variable arises in situations where all values in an interval of the real line are equally likely to occur. • As introduced in Example 3.15, the uniform random variable 𝑈 in the interval [𝑎, 𝑏] has PDF: ( 1 𝑎 ≤𝑥 ≤𝑏 𝑓𝑈 (𝑥) = 𝑏−𝑎 (3.103) 0 𝑥 < 𝑎 𝑎𝑛𝑑 𝑥 > 𝑏 • and CDF: 𝐹𝑈 (𝑥) = 0 𝑥−𝑎 𝑏−𝑎 1 𝑥 <𝑎 𝑎 ≤𝑥 ≤𝑏 𝑥 >𝑏 (3.104) • As found in Examples 3.15 and 3.17 𝐸 [𝑈 ] = 𝑉 𝐴𝑅 [𝑈 ] = 𝑎 +𝑏 2 (3.105) (𝑏 − 𝑎) 2 12 (3.106) The uniform random variable appears in many situations that involve equally likely continuous random variables. Obviously 𝑈 can only be defined over intervals that are finite in length. The Exponential Random Variable The exponential random variable arises in the modeling of the time between occurrence of events (e.g., the time between customer demands for call connections), and in the modeling of the lifetime of devices and systems. • The exponential random variable 𝑋 with parameter 𝜆 has PDF: ( 𝜆𝑒 −𝜆𝑥 𝑥 ≥ 0 𝑓𝑋 (𝑥) = 0 𝑥<0 55 (3.107) 3 Random Variables • and CDF: ( 𝐹𝑋 (𝑥) = 1 − 𝑒 −𝜆𝑥 0 𝑥≥0 𝑥<0 (3.108) The parameter 𝜆 is the rate at which events occur, so 𝐹𝑋 (𝑥), the probability of an event occurring by time 𝑥, increases at the rate 𝜆 increases. • The expectation is given by: ∫ ∞ 𝐸 [𝑋 ] = 𝑡𝜆𝑒 −𝜆𝑡 𝑑𝑡 0 ∫ ∫ using integration by parts ( 𝑢𝑑𝑣 = 𝑢𝑣 − 𝑣𝑑𝑢), with 𝑢 = 𝑡 and 𝑑𝑣 = 𝜆𝑒 −𝜆𝑡 𝑑𝑡: ∞ ∫ ∞ −𝜆𝑡 𝐸 [𝑋 ] = −𝑡𝑒 + 𝑒 −𝜆𝑡 𝑑𝑡 0 (3.109) 0 = lim 𝑡𝑒 −𝜆𝑡 − 0 + ( 𝑡 →∞ −𝑒 −𝜆𝑡 ∞ ) 𝜆 0 −𝑒 −𝜆𝑡 1 1 + = 𝑡 →∞ 𝜆 𝜆 𝜆 = lim (3.110) where we have used the fact that 𝑒 −𝜆𝑡 and 𝑡𝑒 −𝜆𝑡 go to zero as 𝑡 approaches infinity. • [Exercise−] It can be shown that the variance is: 𝑉 𝐴𝑅 [𝑋 ] = 1 𝜆2 (3.111) In event inter-arrival situations, 𝜆 is in units of events/second and 1/𝜆 is in units of seconds per event inter-arrival. The exponential random variable satisfies the memoryless property: 𝑃 (𝑋 > 𝑡 + ℎ|𝑋 > 𝑡) = 𝑃 (𝑋 > ℎ) (3.112) The expression on the left side is the probability of having to wait at least ℎ additional seconds given that one has already been waiting 𝑡 seconds.The expression on the right side is the probability of waiting at least ℎ seconds when one first begins to wait. Thus the probability of waiting at least an additional ℎ seconds is the same regardless of how long one has already been waiting! This property can be proved as follows: 𝑃 (𝑋 > 𝑡 + ℎ ∩ 𝑋 > 𝑡) for ℎ > 0 𝑃 (𝑋 > 𝑡) 𝑃 (𝑋 > 𝑡 + ℎ) 𝑒 −𝜆 (𝑡 +ℎ) = = 𝑃 (𝑋 > 𝑡) 𝑒 −𝜆𝑡 = 𝑒 −𝜆ℎ = 𝑃 (𝑋 > ℎ) 𝑃 (𝑋 > 𝑡 + ℎ|𝑋 > 𝑡) = The memoryless property of the exponential random variable makes it the cornerstone for the theory of Markov chains, which is used extensively in evaluating the performance of computer systems and communications networks. It can be shown that the exponential random variable is the only continuous random variable that satisfies the memoryless property. 56 3 Random Variables The Gaussian (Normal) Random Variable There are many real-world situations where one deals with a random variable 𝑋 that consists of the sum of a large number of “small” random variables. The exact description of the PDF of 𝑋 in terms of the component random variables can become quite complex and unwieldy. However, under very general conditions, as the number of components becomes large, the CDF of 𝑋 approaches that of the Gaussian random variable. This random variable appears so often in problems involving randomness that it is known as the “normal” random variable. Figure 3.7: Probability density function of Gaussian random variable. • The PDF for the Gaussian random variable 𝑋 is given by: 1 −(𝑥−𝑚) 2 /2𝜎 2 𝑒 𝑓𝑋 (𝑥) = √ 2𝜋𝜎 −∞ <𝑥 < ∞ (3.113) where 𝑚 and 𝜎 > 0 are real numbers, denoting the mean and standard deviation of 𝑋 . As shown in Figure 3.7, the Gaussian PDF is a “bell-shaped” curve centered and symmetric about 𝑚 and whose “width” increases with 𝜎. In general, the Gaussian PDF is centered about the point 𝑥 = 𝑚 and has a width that is proportional to 𝜎. The special case when 𝑚 = 0 and 𝜎 = 1, is called “standard normal” random variable. Because Gaussian random variables are so commonly used in such a wide variety of applications, it is standard practice to introduce a shorthand notation to describe a Gaussian random variable, 𝑋 ∼ 𝑁 (𝑚, 𝜎 2 ). • The CDF of the Gaussian random variable is given by: ∫ 𝑥 0 2 2 1 𝑃 (𝑋 ≤ 𝑥) = √ 𝑒 −(𝑥 −𝑚) /2𝜎 𝑑𝑥 0 2𝜋𝜎 −∞ The change of variable 𝑡 = (𝑥 0 − 𝑚)/𝜎 results in: ∫ (𝑥−𝑚)/𝜎 2 1 𝑥 −𝑚 𝐹𝑋 (𝑥) = √ 𝑒 −𝑡 /2𝑑𝑡 = Φ( ) 𝜎 2𝜋 −∞ where Φ(𝑥) is the CDF of a Gaussian random variable with 𝑚 = 0 and 𝜎 = 1: ∫ 𝑥 2 1 Φ(𝑥) = √ 𝑒 −𝑡 /2𝑑𝑡 2𝜋 −∞ 57 (3.114) (3.115) 3 Random Variables Therefore any probability involving an arbitrary Gaussian random variable can be expressed in terms of Φ(𝑥). • Note that the PDF of a Gaussian random variable is symmetric about the point 𝑚. Therefore the mean is 𝐸 [𝑋 ] = 𝑚 (as also defined above). • Since 𝜎 is the standard deviation, the variance is 𝑉 𝐴𝑅 [𝑋 ] = 𝜎 2 . In electrical engineering it is customary to work with the Q-function, which is defined by: ∫ ∞ 2 1 𝑄 (𝑥) = 1 − Φ(𝑥) = √ 𝑒 −𝑡 /2𝑑𝑡 (3.116) 2𝜋 𝑥 𝑄 (𝑥) is simply the probability of the “tail” of the PDF. The symmetry of the PDF implies that: 𝑄 (0) = 1/2 and 𝑄 (−𝑥) = 1 − 𝑄 (𝑥) (3.117) From Equation 3.114 which corresponds to 𝑃 (𝑋 ≤ 𝑥), the following can be derived: 𝑃 (𝑋 > 𝑥) = 𝑄 ( 𝑥 −𝑚 ) 𝜎 (3.118) Figure 3.8: Standardized integrals related to the Gaussian CDF and the Φ and 𝑄 functions. Figure 3.8 shows the standardized integrals related to the Gaussian CDF and the Φ and 𝑄 functions. It can be shown that it is impossible to express the CDF integral in closed form. However, as with other important integrals that cannot be expressed in closed form (e.g., Bessel functions), one can always look up values of the required CDF in looking up tables, or use numerical approximations of the desired integral to any desired accuracy. The following expression has been found to give good accuracy for 𝑄 (𝑥) over the entire range 0 < 𝑥 < ∞: 𝑄 (𝑥) ≈ ( 1 2 1 ) √ 𝑒 −𝑥 /2 √ (1 − 𝑎)𝑥 + 𝑎 𝑥 2 + 𝑏 2𝜋 (3.119) where 𝑎 = 1/𝜋 and 𝑏 = 2𝜋. In some problems, we are interested in finding the value of 𝑥 for which 𝑄 (𝑥) = 10−𝑘 . Table 3.1 gives these values for 𝑘 = 1, ..., 10. 58 3 Random Variables Table 3.1: Look-up table for 𝑄 (𝑥) = 10−𝑘 . 𝑘 𝑥 so that 𝑄 (𝑥) = 10−𝑘 1 1.2815 2 2.3263 3 3.0902 4 3.7190 5 4.2649 6 4.7535 7 5.1993 8 5.6120 9 5.9978 6.3613 10 The Gaussian random variable plays a very important role in communication systems, where transmission signals are corrupted by noise voltages resulting from the thermal motion of electrons. It can be shown from physical principles that these voltages will have a Gaussian PDF. Example 3.18 A communication system accepts a positive voltage 𝑉 as input and outputs a voltage 𝑌 = 𝛼𝑉 + 𝑁 , where 𝛼 = 10−2 and 𝑁 is a Gaussian random variable with parameters 𝑚 = 0 and 𝜎 = 2. Find the value of 𝑉 that gives 𝑃 (𝑌 < 0) = 10−6 . Solution. The probability 𝑃 (𝑌 < 0) is written in terms of 𝑁 as follows: 𝑃 (𝑌 < 0) = 𝑃 (𝛼𝑉 + 𝑁 < 0) = 𝑃 (𝑁 < −𝛼𝑉 ) = Φ( 𝛼𝑉 −𝛼𝑉 ) = 𝑄( ) = 10−6 . 𝜎 𝜎 From Table 3.1 we see that the argument of the Q-function should be 𝑉 = 950.6 𝛼𝑉 𝜎 = 4.753. Thus The Gamma Random Variable The Gamma random variable is a versatile random variable that appears in many applications. For example, it is used to model the time required to service customers in queueing systems, the lifetime of devices and systems in reliability studies, and the defect clustering behavior in VLSI chips. • The PDF of the gamma random variable has two parameters, 𝛼 > 0 and 𝜆 > 0 and is given by: 𝜆(𝜆𝑥) 𝛼−1𝑒 −𝜆𝑥 𝑓𝑋 (𝑥) = 0 < 𝑥 < ∞, (3.120) Γ(𝛼) where Γ is the gamma function, which is defined by: ∫ ∞ Γ(𝛼) = 𝑥 𝛼−1𝑒 −𝑥 𝑑𝑥 0 59 𝛼>0 (3.121) 3 Random Variables The gamma function has the following properties: √ Γ(1/2) = 𝜋, Γ(𝛼 + 1) = 𝛼 Γ(𝛼) Γ(𝑚 + 1) = 𝑚! for 𝛼 > 0 for 𝑚 a non negative integer • The CDF of Gamma random variable is given by: 𝐹𝑋 (𝑥) = 𝛾 (𝛼, 𝜆𝑥) Γ(𝛼) 0<𝑥 <∞ (3.122) where the incomplete gamma function 𝛾 is given by: ∫ 𝛾 (𝛼, 𝛽) = 𝛽 𝑥 𝛼−1𝑒 −𝑥 𝑑𝑥 0 (3.123) • Mean of the Gamma random variable is: 𝐸 [𝑋 ] = 𝛼/𝜆 (3.124) • Variance of the Gamma random variable is: 𝑉 𝐴𝑅 [𝑋 ] = 𝛼/𝜆 2 (3.125) The versatility of the gamma random variable is due to the richness of the gamma function Γ(𝛼). Figure 3.9: Probability density function of gamma random variable. The PDF of the gamma random variable can assume a variety of shapes as shown in Figure 3.9. By varying the parameters 𝜆 and 𝛼 it is possible to fit the gamma PDF to many types of experimental data. The exponential random variable is obtained by letting 𝛼 = 1. By letting 𝜆 = 1/2 and 𝛼 = 𝑘/2, 60 3 Random Variables where 𝑘 is a positive integer, we obtain the Chi-square random variable, which appears in certain statistical problems and wireless communications applications. The m-Erlang random variable is obtained when 𝛼 = 𝑚 a positive integer. The m-Erlang random variable is used in the system reliability models and in queueing systems models and plays a fundamental role in the study of wireline telecommunication networks. In general, the CDF of the gamma random variable does not have a closed-form expression. However, the special case of the m-Erlang random variable does have a closed-form expression. 3.4 The Markov and Chebyshev Inequalities In general, the mean and variance of a random variable do not provide enough information to determine the CDF/PDF. However, the mean and variance of a random variable 𝑋 do allow us to obtain bounds for probabilities of the form 𝑃 (|𝑋 | ≥ 𝑡). Definition 3.23. Markov inequality: Suppose first that 𝑋 is a nonnegative random variable with mean 𝐸 [𝑋 ].The Markov inequality then states that: 𝑃 (𝑋 ≥ 𝑎) ≤ 𝐸 [𝑋 ] 𝑎 for 𝑋 non negative Markov inequality can be obtained as follows: ∫ 𝑎 ∫ ∞ ∫ 𝐸 [𝑋 ] = 𝑡 𝑓𝑋 (𝑡)𝑑𝑡 + 𝑡 𝑓𝑋 (𝑡)𝑑𝑡 ≥ 0 𝑎 ∞ ∫ 𝑡 𝑓𝑋 (𝑡)𝑑𝑡 ≥ 𝑎 (3.126) ∞ 𝑎𝑓𝑋 (𝑡)𝑑𝑡 ≥ 𝑎𝑃 (𝑋 ≥ 𝑎) 𝑎 Example 3.19 The mean height of children in a kindergarten class is 70 cm. Find the bound on the probability that a kid in the class is taller than 140 cm. Solution. The Markov inequality gives 𝑃 (𝐻 ≥ 140) ≤ 70/140 = 0.5 The bound in the above example appears to be ridiculous. However, a bound, by its very nature, must take the worst case into consideration. One can easily construct a random variable for which the bound given by the Markov inequality is exact. The reason we know that the bound in the above example is ridiculous is that we have knowledge about the variability of the children’s height about their mean. Definition 3.24. Chebyshev inequality: Suppose that the mean 𝐸 [𝑋 ] = 𝑚 and the variance 𝑉 𝐴𝑅 [𝑋 ] = 𝜎 2 of a random variable are known, and that we are interested in bounding 𝑃 (|𝑋 −𝑚| ≥ 𝑎). The Chebyshev inequality states that: 𝑃 (|𝑋 − 𝑚| ≥ 𝑎) ≤ 𝜎2 𝑎2 (3.127) The Chebyshev inequality is a consequence of the Markov inequality. Let 𝐷 2 = (𝑋 − 𝑚) 2 be the squared deviation from the mean. Then the Markov inequality applied to 𝐷 2 gives: 𝑃 (𝐷 2 ≥ 𝑎 2 ) ≤ 𝐸 [(𝑋 − 𝑚) 2 ] 𝜎 2 = 2 𝑎2 𝑎 61 3 Random Variables and note that {𝐷 2 ≥ 𝑎 2 } and {|𝑋 −𝑚| ≥ 𝑎} are equivalent events. Suppose that a random variable 𝑋 has zero variance; then the Chebyshev inequality implies that 𝑃 (𝑋 = 𝑚) = 1, i.e. random variable is equal to its mean with probability one, hence constant in almost all experiments. Example 3.20 If 𝑋 is a Gaussian random variable with mean 𝑚 and variance 𝜎 2 , Find the upper bound for 𝑃 (|𝑋 − 𝑚| ≥ 𝑘𝜎) according to the Chebyshev inequality. Solution. The Chebyshev inequality for 𝑎 = 𝑘𝜎 gives: 𝑃 (|𝑋 − 𝑚| ≥ 𝑘𝜎) ≤ 1 𝑘2 if 𝑘 = 2 the Chebyshev inequality gives the upper bound 0.25. Also we know that for Gaussian random variables: 𝑃 (|𝑋 − 𝑚| ≥ 2𝜎) = 2𝑄 (2) ≈ 0.0456. We see that for certain random variables, the Chebyshev inequality can give rather loose bounds. Nevertheless, the inequality is useful in situations in which we have no knowledge about the distribution of a given random variable other than its mean and variance. We will later use the Chebyshev inequality to prove that the arithmetic average of independent measurements of the same random variable is highly likely to be close to the expected value of the random variable when the number of measurements is large. If more information is available than just the mean and variance, then it is possible to obtain bounds that are tighter than the Markov and Chebyshev inequalities. Consider the Markov inequality again. The region of interest is 𝐴 = {𝑡 ≥ 𝑎}, so let 𝐼𝐴 (𝑡) be the indicator function, i.e. 𝐼𝐴 (𝑡) = 1 if 𝑡 ∈ 𝐴 and 𝐼𝐴 (𝑡) = 0 otherwise. The key step in the derivation is to note that 𝑡/𝑎 ≥ 1 in the region of interest. In effect we bounded 𝐼𝐴 (𝑡) by 𝑡/𝑎 and then have: ∫ ∞ ∫ ∞ 𝑃 (𝑋 ≥ 𝑎) = 𝐼𝐴 (𝑡) 𝑓𝑋 (𝑡)𝑑𝑡 ≤ (𝑡/𝑎) 𝑓𝑋 (𝑡)𝑑𝑡 = 𝐸 [𝑥]/𝑎 0 0 By changing the upper bound on 𝐼𝐴 (𝑡), we can obtain different bounds on 𝑃 (𝑋 ≥ 𝑎). Consider the bound 𝐼𝐴 (𝑡) ≤ 𝑒 𝑠 (𝑡 −𝑎) , also shown in Figure 3.10, where 𝑠 > 0 then the following bound can be obtained. Definition 3.25. Chernoff bound: Suppose 𝑋 is a random variable, then: ∫ ∞ 𝑃 (𝑋 ≥ 𝑎) ≤ (𝑒 𝑠 (𝑡 −𝑎) ) 𝑓𝑋 (𝑡)𝑑𝑡 = 𝑒 −𝑠𝑎 𝐸 [𝑒 𝑠𝑋 ] 0 (3.128) This bound is called the Chernoff bound, which can be seen to depend on the expected value of an exponential function of 𝑋 . This function is called the moment generating function. 62 3 Random Variables Figure 3.10: Bounds on indicator function for 𝐴 = {𝑡 ≥ 𝑎}. Further Reading 1. Alberto Leon-Garcia, Probability, statistics, and random processes for electrical engineering, 3rd ed. Pearson, 2007: chapters 3 and 4 2. Scott L. Miller, Donald Childers, Probability and random processes: with applications to signal processing and communications, 2nd ed., Elsevier 2012: section 2.8 and 2.9, and chapters 3 and 4. 3. Anthony Hayter, Probability and Statistics for Engineers and Scientists, 4th ed., Brooks/Cole, Cengage Learning 2012: chapter 2 to 5. 63 4 Two or More Random Variables 4 Two or More Random Variables Many random experiments involve several random variables. In some experiments a number of different quantities are measured. For example, the voltage signals at several points in a circuit at some specific time may be of interest. Other experiments involve the repeated measurement of a certain quantity such as the repeated measurement (“sampling”) of the amplitude of an audio or video signal that varies with time. In this chapter, we extend the random variable concepts already introduced to two or more random variables. In a sense we have already covered all the fundamental concepts of probability and random variables, and we are “simply” elaborating on the case of two or more random variables. Nevertheless, there are significant analytical techniques that need to be learned. 4.1 Pairs of Random Variables Some experiments involve two random variables, e.g. the study of a system with a random input. Due to the randomness of the input, the output will naturally be random as well. Quite often it is necessary to characterize the relationship between the input and the output. A pair of random variables can be used to characterize this relationship: one for the input and another for the output. Another class of examples involving random variables are those involving spatial coordinates in two dimensions. A pair of random variables can be used to probabilistically describe the position of an object which is subject to various random forces. There are endless examples of situations where we are interested in two random quantities that may or may not be related to one another, for example, the height and weight of a student, or the temperature and relative humidity at a certain place and time. Consider an experiment 𝐸 whose outcomes lie in a sample space, 𝑆. A two dimensional random variable is a mapping of the points in the sample space to ordered pairs {𝑥, 𝑦}. Usually, when dealing with a pair of random variables, the sample space naturally partitions itself so that it can be viewed as a combination of two simpler sample spaces. For example, suppose the experiment was to observe the height and weight of a typical student. The range of student heights could fall within some set which we call sample space 𝑆 1 , while the range of student weights could fall within the space 𝑆 2 . The overall sample space of the experiment could then be viewed as 𝑆 1 × 𝑆 2 . For any outcome 𝑠 ∈ 𝑆 of this experiment, the pair of random variables (𝑋, 𝑌 ) is merely a mapping of the outcome 𝑠 to a pair of numerical values 𝑥 (𝑠), 𝑦 (𝑥). In the case of our height/weight experiment, it would be natural to choose 𝑥 (𝑠) to be the height of the student, while 𝑦 (𝑠) is the weight of the student. While the density functions 𝑓𝑋 (𝑥) and 𝑓𝑌 (𝑦) do partially characterize the experiment, they do not completely describe the situation. It would be natural to expect that the height and weight are somehow related to each other. While it may not be very rare to have a student 180 cm tall nor unusual to have a student who weighs 55 kg, it is probably rare indeed to have a student who is both 180 cm tall and weighs 55 kg. Therefore, to characterize the relationship between a pair of random variables, it is necessary to look at the joint probabilities of events relating to both random variables. 64 4 Two or More Random Variables 4.1.1 Joint Cumulative Distribution Function We start with the notion of a joint CDF. Definition 4.1. Joint Cumulative Distribution Function: The joint CDF of a pair of random variables, {𝑋, 𝑌 }, is 𝐹𝑋 ,𝑌 (𝑥, 𝑦) = 𝑃 (𝑋 ≤ 𝑥, 𝑌 ≤ 𝑦). That is, the joint CDF is the joint probability of the two events {𝑋 ≤ 𝑥 } and {𝑌 ≤ 𝑦}. As with the CDF of a single random variable, not any function can be a joint CDF. The joint CDF of a pair of random variables will satisfy properties similar to those satisfied by the CDFs of single random variables. • Since the joint CDF is a probability, it must take on a value between 0 and 1, i.e. 0 ≤ 𝐹𝑋 ,𝑌 (𝑥, 𝑦) ≤ 1 • 𝐹𝑋 ,𝑌 (𝑥, 𝑦) evaluated at either 𝑥 = −∞ or 𝑦 = −∞ (or both) must be zero and 𝐹𝑋 ,𝑌 (∞, ∞) must be one, i.e. 𝐹𝑋 ,𝑌 (−∞, −∞) = 0 𝐹𝑋 ,𝑌 (−∞, 𝑦) = 0 𝐹𝑋 ,𝑌 (𝑥, −∞) = 0 𝐹𝑋 ,𝑌 (∞, ∞) = 1 • For 𝑥 1 ≤ 𝑥 2 and 𝑦1 ≤ 𝑦2 , {𝑋 ≤ 𝑥 1 } ∩ {𝑌 ≤ 𝑦1 } is a subset of {𝑋 ≤ 𝑥 2 } ∩ {𝑌 ≤ 𝑦2 } so that 𝐹𝑋 ,𝑌 (𝑥 1, 𝑦1 ) ≤ 𝐹𝑋 ,𝑌 (𝑥 2, 𝑦2 ). That is, the CDF is a monotonic, non-decreasing function of both 𝑥 and 𝑦. • Since the event {𝑋 ≤ ∞} must happen, then {𝑋 ≤ ∞} ∩ {𝑌 ≤ 𝑦} = {𝑌 ≤ 𝑦}, so that 𝐹𝑋 ,𝑌 (∞, 𝑦) = 𝐹𝑌 (𝑦). Likewise, 𝐹𝑋 ,𝑌 (𝑥, ∞) = 𝐹𝑋 (𝑥). In the context of joint CDFs, 𝐹𝑋 (𝑥) and 𝐹𝑌 (𝑦) are referred to as the marginal CDFs of 𝑋 and 𝑌 , respectively. • Consider using a joint CDF to evaluate the probability that the pair of random variables (𝑋, 𝑌 ) falls into a rectangular region bounded by the points (𝑥 1, 𝑦1 ), (𝑥 2, 𝑦1 ), (𝑥 1, 𝑦2 ) and (𝑥 2, 𝑦2 ) (white rectangle is figure 4.1). Evaluating 𝐹𝑋 ,𝑌 (𝑥 2, 𝑦2 ) gives the probability that the random variable falls anywhere below or to the left of the point (𝑥 2, 𝑦2 ); this includes all of the area in the desired rectangle, plus everything below and to the left of the desired rectangle. The probability of the random variable falling to the left of the rectangle can be subtracted off using 𝐹𝑋 ,𝑌 (𝑥 1, 𝑦2 ). Similarly, the region below the rectangle can be subtracted off using 𝐹𝑋 ,𝑌 (𝑥 2, 𝑦1 ) (two shaded regions). In subtracting off these two quantities, we have subtracted twice the probability of the pair falling both below and to the left of the desired rectangle (dark-shaded region). Hence we must add back this probability using 𝐹𝑋 ,𝑌 (𝑥 1, 𝑦1 ). That is: 𝑃 (𝑥 1 < 𝑋 ≤ 𝑥 2, 𝑦1 < 𝑌 ≤ 𝑦2 ) = 𝐹𝑋 ,𝑌 (𝑥 2, 𝑦2 ) − 𝐹𝑋 ,𝑌 (𝑥 1, 𝑦2 ) − 𝐹𝑋 ,𝑌 (𝑥 2, 𝑦1 ) + 𝐹𝑋 ,𝑌 (𝑥 1, 𝑦1 ) ≥ 0. (4.1) 65 4 Two or More Random Variables Figure 4.1: Illustrating the evaluation of the probability of a pair of random variables falling in a rectangular region. Equation 4.1 tells us how to calculate the probability of the pair of random variables falling in a rectangular region. Often, we are interested in also calculating the probability of the pair of random variables falling in non rectangular (e.g., a circle or triangle) region. This can be done by forming the required region using many infinitesimal rectangles and then repeatedly applying Equation 4.1. Example 4.1 Consider a pair of random variables which are uniformly distributed over the unit square (i.e., 0 < 𝑥 < 1, 0 < 𝑦 < 1). Find the joint CDF. Solution. The CDF is: 0, 𝑥 < 0 or 𝑦 < 0 𝑥, 0 ≤ 𝑥 ≤ 1, 𝑦 > 1 𝐹𝑋 ,𝑌 (𝑥, 𝑦) = 𝑦, 𝑥 > 1, 0 ≤ 𝑦 ≤ 1 𝑥𝑦, 0 ≤ 𝑥 ≤ 1, 0 ≤ 𝑦 ≤ 1 1, 𝑥 > 1, 𝑦 > 1 Even this very simple example leads to a rather cumbersome function. Nevertheless, it is straightforward to verify that this function does indeed satisfy all the properties of a joint CDF. From this joint CDF, the marginal CDF of 𝑋 can be found to be: 0, 𝑥 < 0 𝐹𝑋 (𝑥) = 𝐹𝑋 ,𝑌 (𝑥, ∞) = 𝑥, 0 ≤ 𝑥 ≤ 1 1, 𝑥 > 1 Hence, the marginal CDF of 𝑋 is a uniform distribution. The same statement holds for 𝑌 as well. 66 4 Two or More Random Variables 4.1.2 Joint Probability Density Functions As seen in Example 4.1, even the simplest joint random variables can lead to CDFs which are quite unwieldy. As a result, working with joint CDFs can be difficult. In order to avoid extensive use of joint CDFs, attention is now turned to the two dimensional equivalent of the PDF. Definition 4.2. Joint Probability Density Functions: The joint probability density function of a pair of random variables (𝑋, 𝑌 ) evaluated at the point (𝑥, 𝑦) is: 𝑓𝑋 ,𝑌 (𝑥, 𝑦) = 𝑙𝑖𝑚𝜀𝑥 →0,𝜀 𝑦 →0 𝑃 (𝑥 ≤ 𝑋 < 𝑥 + 𝜀𝑥 , 𝑦 ≤ 𝑌 < 𝑦 + 𝜀 𝑦 ) 𝜀𝑥 𝜀 𝑦 (4.2) Similar to the one-dimensional case, the joint PDF is the probability that the pair of random variables (𝑋, 𝑌 ) lies in an infinitesimal region defined by the point (𝑥, 𝑦) normalised by the area of the region. For a single random variable, the PDF was the derivative of the CDF. By applying Equation 4.1 to the definition of the joint PDF, a similar relationship is obtained. Theorem 4.1 The joint PDF 𝑓𝑋 ,𝑌 (𝑥, 𝑦) can be obtained from the joint CDF 𝐹𝑋 ,𝑌 (𝑥, 𝑦) by taking a partial derivative with respect to each variable. That is, 𝑓𝑋 ,𝑌 (𝑥, 𝑦) = 𝜕2 𝐹𝑋 ,𝑌 (𝑥, 𝑦) 𝜕𝑥 𝜕𝑦 (4.3) Proof. Using Equation 4.1 𝑃 (𝑥 ≤ 𝑋 < 𝑥 + 𝜀𝑥 , 𝑦 ≤ 𝑌 < 𝑦 + 𝜀 𝑦 ) = 𝐹𝑋 ,𝑌 (𝑥 + 𝜀𝑥 , 𝑦 + 𝜀 𝑦 ) − 𝐹𝑋 ,𝑌 (𝑥, 𝑦 + 𝜀 𝑦 ) − 𝐹𝑋 ,𝑌 (𝑥 + 𝜀𝑥 , 𝑦) + 𝐹𝑋 ,𝑌 (𝑥, 𝑦) = [𝐹𝑋 ,𝑌 (𝑥 + 𝜀𝑥 , 𝑦 + 𝜀 𝑦 ) − 𝐹𝑋 ,𝑌 (𝑥, 𝑦 + 𝜀 𝑦 )] − [𝐹𝑋 ,𝑌 (𝑥 + 𝜀𝑥 , 𝑦) − 𝐹𝑋 ,𝑌 (𝑥, 𝑦)] Dividing by 𝜀𝑥 and taking the limit as 𝜀𝑥 → 0 results in 𝑃 (𝑥 ≤ 𝑋 < 𝑥 + 𝜀𝑥 , 𝑦 ≤ 𝑌 < 𝑦 + 𝜀 𝑦 ) 𝜀𝑥 →0 𝜀𝑥 [𝐹𝑋 ,𝑌 (𝑥 + 𝜀𝑥 , 𝑦 + 𝜀 𝑦 ) − 𝐹𝑋 ,𝑌 (𝑥, 𝑦 + 𝜀 𝑦 )] [𝐹𝑋 ,𝑌 (𝑥 + 𝜀𝑥 , 𝑦) − 𝐹𝑋 ,𝑌 (𝑥, 𝑦)] = lim − lim 𝜀𝑥 →0 𝜀𝑥 →0 𝜀𝑥 𝜀𝑥 𝜕 𝜕 = 𝐹𝑋 ,𝑌 (𝑥, 𝑦 + 𝜀 𝑦 ) − 𝐹𝑋 ,𝑌 (𝑥, 𝑦) 𝜕𝑥 𝜕𝑥 lim Dividing by 𝜀 𝑦 and taking the limit as 𝜀 𝑦 → 0 gives the desired result: 𝑓𝑋 ,𝑌 (𝑥, 𝑦) = lim 𝜀𝑥 →0,𝜀 𝑦 →0 = lim 𝜀 𝑦 →0 𝑃 (𝑥 ≤ 𝑋 < 𝑥 + 𝜀𝑥 , 𝑦 ≤ 𝑌 < 𝑦 + 𝜀 𝑦 ) 𝜀𝑥 , 𝜀 𝑦 𝜕 𝜕𝑥 𝐹𝑋 ,𝑌 (𝑥, 𝑦 + 𝜀𝑦 ) − 𝜀𝑦 𝜕 𝜕𝑥 𝐹𝑋 ,𝑌 (𝑥, 𝑦) = 𝜕2 𝐹𝑋 ,𝑌 (𝑥, 𝑦) 𝜕𝑥 𝜕𝑦 This theorem shows that we can obtain a joint PDF from a joint CDF by differentiating with respect to each variable. The converse of this statement would be that we could obtain a joint 67 4 Two or More Random Variables CDF from a joint PDF by integrating with respect to each variable. Specifically: ∫ 𝑦∫ 𝑥 𝐹𝑋 ,𝑌 (𝑥, 𝑦) = 𝑓𝑋 ,𝑌 (𝑢, 𝑣)𝑑𝑢𝑑𝑣 −∞ (4.4) −∞ Example 4.2 Consider the pair of random variables with uniform distribution in Example 4.1. Find the joint PDF. Solution. By differentiating the joint CDF with respect to both 𝑥 and 𝑦, the joint PDF is ( 1, 0 < 𝑥 < 1 and 0 < 𝑦 < 1 𝑓𝑋 ,𝑌 (𝑥, 𝑦) = 0, otherwise which is much simpler than the joint CDF. From the definition of the joint PDF and its relationship with the joint CDF, several properties of joint PDFs can be inferred: (i) 𝑓𝑋 ,𝑌 (𝑥, 𝑦) ≥ 0 (ii) ∫∞∫∞ 𝑓 (𝑥, 𝑦)𝑑𝑥𝑑𝑦 −∞ −∞ 𝑋 ,𝑌 (iii) 𝑓𝑋 (𝑥) = =1 ∫∞ 𝑓 (𝑥, 𝑦)𝑑𝑦 and 𝑓𝑌 (𝑦) = −∞ 𝑋 ,𝑌 (iv) 𝑃 (𝑥 1 < 𝑋 ≤ 𝑥 2, 𝑦1 < 𝑌 ≤ 𝑦2 ) = ∫∞ 𝑓 (𝑥, 𝑦)𝑑𝑥 −∞ 𝑋 ,𝑌 𝑦2 ∫ 𝑥 2 𝑓 (𝑥, 𝑦)𝑑𝑥𝑑𝑦 𝑦1 𝑥 1 𝑋 ,𝑌 ∫ Property (i) follows directly from the definition of the joint PDF since both the numerator and denominator there are nonnegative. Property (ii) results from the relationship in Equation 4.4 together with the fact that 𝐹𝑋 ,𝑌 (∞, ∞) = 1. This is the normalization integral for joint PDFs. These first two properties form a set of sufficient conditions for a function of two variables to be a valid joint PDF. Property (iii) is obtained by noting ∫ ∞ that ∫ 𝑥 the marginal CDF of 𝑋 is 𝐹𝑋 (𝑥) = 𝐹𝑋 ,𝑌 (𝑥, ∞). Using Equation 4.4 then results in 𝐹𝑋 (𝑥) = −∞ −∞ 𝑓𝑋 ,𝑌 (𝑢, 𝑦)𝑑𝑢𝑑𝑦. Differentiating this expression with respect to 𝑥 produces the expression in property (iii) for the marginal PDF of 𝑥. A similar derivation produces the marginal PDF of 𝑦. Hence, the marginal PDFs are obtained by integrating out the unwanted variable in the joint PDF. The last property is obtained by combining Equations 4.1 and 4.4. Property (iv) of joint PDFs specifies how to compute the probability that a pair of random variables takes on a value in a rectangular region. Often, we are interested in computing the probability that the pair of random variables falls in a region which is not rectangularly shaped. In general, suppose we wish to compute 𝑃 ((𝑋, 𝑌 ) ∈ 𝐴), where 𝐴 is the region illustrated in Figure 4.2. This general region can be approximated as a union of many nonoverlapping rectangular regions as shown in the figure. In fact, as we make the rectangles ever smaller, the approximation improves to the point where the representation becomes exact in the limit as the rectangles get infinitely small. That is, any region can be represented as an infinite number of infinitesimal rectangular 68 4 Two or More Random Variables Ð regions so that 𝐴 = 𝑅𝑖 , where 𝑅𝑖 represents the ith rectangular region. The probability that the random pair falls in 𝐴 is then computed as: Õ Õ∬ 𝑃 ((𝑋, 𝑌 ) ∈ 𝐴) = 𝑃 ((𝑋, 𝑌 ) ∈ 𝑅𝑖 ) = 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦 (4.5) 𝑖 𝑖 𝑅𝑖 The sum of the integrals over the rectangular regions can be replaced by an integral over the original region 𝐴: ∬ 𝑃 ((𝑋, 𝑌 ) ∈ 𝐴) = 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦 (4.6) 𝐴 This important result shows that the probability of a pair of random variables falling in some two-dimensional region 𝐴 is found by integrating the joint PDF of the two random variables over the region 𝐴. Figure 4.2: Approximation of an arbitrary region by a series of infinitesimal rectangles. Example 4.3 Suppose that a pair of random variables has the joint PDF given by: 𝑓𝑋 ,𝑌 (𝑥, 𝑦) = 𝑐𝑒 −𝑥 𝑒 −𝑦/2𝑢 (𝑥)𝑢 (𝑦) Find (a) the constant value 𝑐 and (b) the probability of the event {𝑋 > 𝑌 }. Solution. (a) The constant 𝑐 is found using the normalization integral: ∫ ∞∫ ∞ ∫ ∞∫ ∞ 𝑐𝑒 −𝑥 𝑒 −𝑦/2𝑑𝑥𝑑𝑦 = 1 ⇒ 𝑐 = 1/2 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦 = −∞ −∞ 0 0 (b) This probability can be viewed as the probability of the pair (𝑋, 𝑌 ) falling in the region 𝐴 that is now defined as 𝐴 = {(𝑥, 𝑦) : 𝑥 > 𝑦}. This probability is calculated as: ∬ ∫ ∞∫ ∞ ∫ ∞ 1 −𝑥 −𝑦/2 1 −3𝑦/2 𝑃 (𝑋 > 𝑌 ) = 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦 = 𝑒 𝑒 𝑑𝑥𝑑𝑦 = 𝑒 𝑑𝑦 = 1/3 2 2 𝑥 >𝑦 0 𝑦 0 69 4 Two or More Random Variables 4.1.3 Joint Probability Mass Functions When the random variables are discrete rather than continuous, it is often more convenient to work with probability mass functions (PMFs) rather than PDFs or CDFs. It is straightforward to extend the concept of the PMF to a pair of random variables. Definition 4.3. Joint Probability Mass Function: The joint PMF for a pair of discrete random variables 𝑋 and 𝑌 is given by: 𝑃𝑋 ,𝑌 (𝑥, 𝑦) = 𝑃 ({𝑋 = 𝑥 } ∩ {𝑌 = 𝑦}) In particular, suppose the random variable 𝑋 takes on values from the set {𝑥 1, 𝑥 2, ..., 𝑥 𝑀 } and the random variable 𝑌 takes on values from the set {𝑦1, 𝑦2, ..., 𝑦𝑁 }. Here, either 𝑀 or 𝑁 could be potentially infinite, or both could be finite. Several properties of the joint PMF analogous to those developed for joint PDFs should be apparent. (i) 0 ≤ 𝑃𝑋 ,𝑌 (𝑥, 𝑦) ≤ 1 (4.7) (ii) 𝑀 Õ 𝑁 Õ 𝑃𝑋 ,𝑌 (𝑥𝑚 , 𝑦𝑛 ) = 1 (4.8) 𝑚=1 𝑛=1 (iii) 𝑁 Õ 𝑃𝑋 ,𝑌 (𝑥𝑚 , 𝑦𝑛 ) = 𝑃𝑋 (𝑥𝑚 ), 𝑛=1 𝑀 Õ 𝑃𝑋 ,𝑌 (𝑥𝑚 , 𝑦𝑛 ) = 𝑃𝑌 (𝑦𝑛 ) (4.9) 𝑚=1 (iv) 𝑃 ((𝑋, 𝑌 ) ∈ 𝐴) = ÕÕ 𝑃𝑋 ,𝑌 (𝑥, 𝑦) (4.10) (𝑥,𝑦) ∈𝐴 Furthermore, the joint PDF or the joint CDF of a pair of discrete random variables can be related to the joint PMF through the use of delta functions or step functions by: 𝑓𝑋 ,𝑌 (𝑥, 𝑦) = 𝑀 Õ 𝑁 Õ 𝑃𝑋 ,𝑌 (𝑥𝑚 , 𝑦𝑛 )𝛿 (𝑥 − 𝑥𝑚 )𝛿 (𝑦 − 𝑦𝑛 ) (4.11) 𝑃𝑋 ,𝑌 (𝑥𝑚 , 𝑦𝑛 )𝑢 (𝑥 − 𝑥𝑚 )𝑢 (𝑦 − 𝑦𝑛 ) (4.12) 𝑚=1 𝑛=1 𝐹𝑋 ,𝑌 (𝑥, 𝑦) = 𝑀 Õ 𝑁 Õ 𝑚=1 𝑛=1 Usually, it is most convenient to work with PMFs when the random variables are discrete. However, if the random variables are mixed (i.e., one is discrete and one is continuous), then it becomes necessary to work with PDFs or CDFs since the PMF will not be meaningful for the continuous random variable. Example 4.4 Two discrete random variables 𝑁 and 𝑀 have a joint PMF given by: 𝑃 𝑁 ,𝑀 (𝑛, 𝑚) = 𝑎𝑛 𝑏 𝑚 (𝑛 + 𝑚)! , 𝑚 = 0, 1, 2, 3, ..., 𝑛 = 0, 1, 2, 3, ... 𝑛!𝑚! (𝑎 + 𝑏 + 1)𝑛+𝑚+1 Find the marginal PMFs 𝑃 𝑁 (𝑛) and 𝑃𝑀 (𝑚). 70 4 Two or More Random Variables Solution. The marginal PMF of 𝑁 can be found by summing over 𝑚 in the joint PMF: 𝑃 𝑁 (𝑛) = ∞ Õ 𝑃 𝑁 ,𝑀 (𝑛, 𝑚) = 𝑚=0 ∞ Õ (𝑛 + 𝑚)! 𝑎𝑛 𝑏 𝑚 𝑛!𝑚! (𝑎 + 𝑏 + 1)𝑛+𝑚+1 𝑚=0 To evaluate this series, the following identity is used: ∞ Õ (𝑛 + 𝑚)! 𝑚 1 𝑛+1 𝑥 =( ) 𝑛!𝑚! 1−𝑥 𝑚=0 The marginal PMF then reduces to 𝑃 𝑁 (𝑛) = = ∞ Õ 𝑏𝑚 (𝑛 + 𝑚)! 𝑎𝑛 (𝑎 + 𝑏 + 1)𝑛+1 𝑚=0 𝑛!𝑚! (𝑎 + 𝑏 + 1)𝑚 𝑎𝑛 1 ( 𝑛+1 (𝑎 + 𝑏 + 1) 1− 𝑏 𝑎+𝑏+1 )𝑛+1 = 𝑎𝑛 (1 + 𝑎)𝑛+1 Likewise, by symmetry, the marginal PMF of 𝑀 is 𝑃𝑀 (𝑚) = 𝑏𝑚 (1 + 𝑏)𝑚+1 Hence, the random variables 𝑀 and 𝑁 both follow a geometric distribution 4.1.4 Conditional Probabilities and densities The notion of conditional distribution functions and conditional density functions can be extended to the case where the conditioning event is related to another random variable. For example, we might want to know the distribution of a random variable representing the score a student achieves on a test given the value of another random variable representing the number of hours the student studied for the test. Or, perhaps we want to know the probability density function of the outside temperature, given that the humidity is known to be below 50%. To start with, consider a pair of discrete random variables 𝑋 and 𝑌 with a PMF, 𝑃𝑋 ,𝑌 (𝑥, 𝑦). Suppose we would like to know the PMF of the random variable X given that the value of 𝑌 has been observed. Then, according to the definition of conditional probability: 𝑃 (𝑋 = 𝑥 |𝑌 = 𝑦) = 𝑃 (𝑋 = 𝑥, 𝑌 = 𝑦) 𝑃𝑋 ,𝑌 (𝑥, 𝑦) = 𝑃 (𝑌 = 𝑦) 𝑃𝑌 (𝑦) (4.13) We refer to this as the conditional PMF of 𝑋 given 𝑌 . By way of notation we write: 𝑃𝑋 |𝑌 (𝑥 |𝑦) = 𝑃𝑋 ,𝑌 (𝑥, 𝑦) 𝑃𝑌 (𝑦) (4.14) Example 4.5 Using the joint PMF given in Example 4.4 along with the marginal PMF found in that example, find the conditional PMF: 𝑃 𝑁 |𝑀 (𝑛|𝑚) 71 4 Two or More Random Variables Solution. 𝑃𝑀,𝑁 (𝑚, 𝑛) 𝑃𝑀 (𝑚) (𝑛 + 𝑚)! 𝑎𝑛 𝑏 𝑚 (1 + 𝑏)𝑚+1 = 𝑛!𝑚! (𝑎 + 𝑏 + 1)𝑛+𝑚+1 𝑏𝑚 𝑛 𝑚+1 (𝑛 + 𝑚)! 𝑎 (1 + 𝑏) = 𝑛!𝑚! (𝑎 + 𝑏 + 1)𝑛+𝑚+1 𝑃 𝑁 |𝑀 (𝑛|𝑚) = Note that the conditional PMF of 𝑁 given 𝑀 is quite different than the marginal PMF of 𝑁 . That is, knowing 𝑀 changes the distribution of 𝑁 . The simple result developed in Equation 4.13 can be extended to the case of continuous random variables and PDFs. Definition 4.4. Conditional probability density function: The conditional PDF of a random variable 𝑋 given that 𝑌 = 𝑦 is: 𝑓𝑋 ,𝑌 (𝑥, 𝑦) 𝑓𝑋 |𝑌 (𝑥 |𝑦) = (4.15) 𝑓𝑌 (𝑦) Integrating both sides of this equation with respect to x produces the conditional CDFs: Definition 4.5. Conditional cumulative distribution function: The conditional CDF of a random variable 𝑋 given that 𝑌 = 𝑦 is: ∫𝑥 𝑓𝑋 ,𝑌 (𝑥 0, 𝑦)𝑑𝑥 0 𝐹𝑋 |𝑌 (𝑥 |𝑦) = −∞ (4.16) 𝑓𝑌 (𝑦) Usually, the conditional PDF is much easier to work with, so the conditional CDF will not be discussed further. Example 4.6 A certain pair of random variables has a joint PDF given by: 𝑓𝑋 ,𝑌 (𝑥, 𝑦) = 2𝑎𝑏𝑐 𝑢 (𝑥)𝑢 (𝑦) (𝑎𝑥 + 𝑏𝑦 + 𝑐) 3 for some positive constants 𝑎, 𝑏, and 𝑐. Find the conditional PDF of 𝑋 given 𝑌 and 𝑌 given 𝑋. Solution. The marginal PDFs are easily found to be: ∫ ∞ 𝑓𝑋 (𝑥) = 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑦 = 0 ∫ 𝑓𝑌 (𝑦) = ∞ 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥 = 0 𝑎𝑐 𝑢 (𝑥) (𝑎𝑥 + 𝑐) 2 𝑏𝑐 𝑢 (𝑦) (𝑏𝑦 + 𝑐) 2 The conditional PDF of 𝑋 given 𝑌 then works out to be: 𝑓𝑋 |𝑌 (𝑥 |𝑦) = 𝑓𝑋 ,𝑌 (𝑥, 𝑦) 2𝑎(𝑏𝑦 + 𝑐) 2 = 𝑢 (𝑥) 𝑓𝑌 (𝑦) (𝑎𝑥 + 𝑏𝑦 + 𝑐) 3 72 4 Two or More Random Variables The conditional PDF of Y given X could also be determined in a similar way: 𝑓𝑌 |𝑋 (𝑦|𝑥) = 𝑓𝑋 ,𝑌 (𝑥, 𝑦) 2𝑏 (𝑎𝑥 + 𝑐) 2 = 𝑢 (𝑦) 𝑓𝑋 (𝑥) (𝑎𝑥 + 𝑏𝑦 + 𝑐) 3 Example 4.7 𝑋 and 𝑌 are two Gaussian random variables with a joint PDF: 1 2 𝑓𝑋 ,𝑌 (𝑥, 𝑦) = √ 𝑒𝑥𝑝 (− (𝑥 2 − 𝑥𝑦 + 𝑦 2 )) 3 𝜋 3 Find the marginal PDFs and the conditional PDF of 𝑋 given 𝑌 . Solution. The marginal PDF is found as follows: ∫ ∞ ∫ ∞ 1 2 2 𝑓𝑋 (𝑥) = 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑦 = √ 𝑒𝑥𝑝 (− 𝑥 2 ) 𝑒𝑥𝑝 (− (𝑦 2 − 𝑥𝑦))𝑑𝑦 3 3 𝜋 3 −∞ −∞ ∫ ∞ 2 2 𝑥 1 𝑥 2 = √ 𝑒𝑥𝑝 (− ) 𝑒𝑥𝑝 (− (𝑦 2 − 𝑥𝑦 + ))𝑑𝑦 2 −∞ 3 4 𝜋 3 ∫ ∞ 2 1 𝑥 2 = √ 𝑒𝑥𝑝 (− ) 𝑒𝑥𝑝 (− (𝑦 − 𝑥/2) 2 )𝑑𝑦 2 −∞ 3 𝜋 3 Now the integrand is a Gaussian-looking function. If the appropriate constant is added to the integrand, the integrand p will be a valid PDF and hence must integrate out to one. In this case, the constant is 2/(3𝜋). Therefore, the integral as just written must evaluate to p (3𝜋)/2. So: 𝑥2 1 𝑓𝑋 (𝑥) = √ 𝑒𝑥𝑝 (− ) 2 2𝜋 and we see that 𝑋 is a zero-mean, unit-variance, Gaussian (i.e., standard normal) random variable. By symmetry, the marginal PDF of 𝑌 must also be of the same form. The conditional PDF of 𝑋 given 𝑌 is 𝑓𝑋 ,𝑌 (𝑥, 𝑦) 𝑓𝑋 |𝑌 (𝑥 |𝑦) = = 𝑓𝑌 (𝑦) 1 √ 𝑒𝑥𝑝 (− 32 (𝑥 2 𝜋 3 − 𝑥𝑦 + 𝑦 2 )) 𝑦2 √1 𝑒𝑥𝑝 (− ) 2 2𝜋 r = 𝑦 2 2 𝑒𝑥𝑝 (− (𝑥 − ) 2 ) 3𝜋 3 2 So, the conditional PDF of 𝑋 given 𝑌 is also Gaussian. But, given that it is known that 𝑌 = 𝑦, the mean of 𝑋 is now 𝑦/2 (instead of zero), and the variance of 𝑋 is 3/4 (instead of one). In this example, knowledge of 𝑌 has shifted the mean and reduced the variance of 𝑋 . In addition to conditioning on a random variable taking on a point value such as 𝑌 = 𝑦, the conditioning can also occur on an interval of the form 𝑦1 ≤ 𝑌 ≤ 𝑦2 . To simplify notation, let the conditioning event 𝐴 be 𝐴 = {𝑦1 ≤ 𝑌 ≤ 𝑦2 }. The relevant conditional PMF, PDF, and CDF are then given, respectively, by: Í𝑦2 𝑦=𝑦1 𝑃𝑋 ,𝑌 (𝑥, 𝑦) 𝑃𝑋 |𝐴 (𝑥) = Í𝑦2 (4.17) 𝑦=𝑦1 𝑃𝑌 (𝑦) 73 4 Two or More Random Variables 𝑦2 𝑓 (𝑥, 𝑦)𝑑𝑦 𝑦1 𝑋 ,𝑌 ∫ 𝑦2 𝑓 (𝑦)𝑑𝑦 𝑦1 𝑌 (4.18) 𝐹𝑋 ,𝑌 (𝑥, 𝑦2 ) − 𝐹𝑋 ,𝑌 (𝑥, 𝑦1 ) 𝐹𝑌 (𝑦2 ) − 𝐹𝑌 (𝑦1 ) (4.19) ∫ 𝑓𝑋 |𝐴 (𝑥) = 𝐹𝑋 |𝐴 (𝑥) = Example 4.8 Using the joint PDF of Example 4.7, determine the conditional PDF of 𝑋 given that 𝑌 > 𝑦0 . Solution. ∫ ∞ 1 2 √ 𝑒𝑥𝑝 (− (𝑥 2 − 𝑥𝑦 + 𝑦 2 ))𝑑𝑦 3 𝑦0 𝜋 3 r ∫ ∞ 𝑥2 2 2 𝑥 1 𝑒𝑥𝑝 (− (𝑦 − ) 2 )𝑑𝑦 = √ 𝑒𝑥𝑝 (− ) 2 3𝜋 3 2 2𝜋 𝑦0 ∫ ∞ 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑦 = 𝑦0 2𝑦0 − 𝑥 1 𝑥2 = √ 𝑒𝑥𝑝 (− )𝑄 ( √ ) 2 2𝜋 3 Since the marginal PDF of 𝑌 is a zero-mean, unit-variance Gaussian PDF, ∫ ∞ ∫ ∞ 𝑦2 1 𝑓𝑌 (𝑦)𝑑𝑦 = √ 𝑒𝑥𝑝 (− )𝑑𝑦 = 𝑄 (𝑦0 ) 2 2𝜋 𝑦0 𝑦0 Therefore, the PDF of 𝑋 given 𝑌 > 𝑦0 is: 2𝑦 −𝑥 0 𝑥 2 𝑄 ( √3 ) 1 𝑓𝑋 |𝑌 >𝑦0 (𝑥) = √ 𝑒𝑥𝑝 (− ) 2 𝑄 (𝑦0 ) 2𝜋 Note that when the conditioning event was a point condition on 𝑌 , the conditional PDF of 𝑋 was Gaussian; yet, when the conditioning event is an interval condition on 𝑌 , the resulting conditional PDF of 𝑋 is not Gaussian at all. 4.1.5 Expected Values and Moments Involving Pairs of Random Variables We are often interested in how two variables 𝑋 and 𝑌 vary together. In particular, we are interested in whether the variation of 𝑋 and 𝑌 are correlated. For example, if 𝑋 increases does 𝑌 tend to increase or to decrease? The joint moments of 𝑋 and 𝑌 provide this information. Definition 4.6. Let 𝑔(𝑥, 𝑦) be an arbitrary two-dimensional function. The expected value of 𝑔(𝑋, 𝑌 ), where 𝑋 and 𝑌 are random variables, is ∬ 𝐸 [𝑔(𝑋, 𝑌 )] = 𝑔(𝑥, 𝑦) 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦 (4.20) For discrete random variables, the equivalent expression in terms of the joint PMF is: ÕÕ 𝐸 [𝑔(𝑋, 𝑌 )] = 𝑔(𝑥𝑚 , 𝑦𝑛 )𝑃𝑋 ,𝑌 (𝑥𝑚 , 𝑦𝑛 ) 𝑚 𝑛 74 (4.21) 4 Two or More Random Variables If the function 𝑔(𝑥, 𝑦) is actually a function of only a single variable, say 𝑥, then this definition reduces to the definition of expected values for functions of a single random variable: ∫ ∞∫ ∞ ∫ ∞ ∫ ∞ ∫ ∞ 𝐸 [𝑔(𝑋 )] = 𝑔(𝑥) 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦 = 𝑔(𝑥) ( 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑦)𝑑𝑥 = 𝑔(𝑥)𝑓𝑋 (𝑥)𝑑𝑥 −∞ −∞ −∞ −∞ −∞ (4.22) To start with, consider an arbitrary linear function of the two variables 𝑔(𝑥, 𝑦) = 𝑎𝑥 + 𝑏𝑦, where 𝑎 and 𝑏 are constants. Then: ∫ ∞∫ ∞ 𝐸 [𝑎𝑋 + 𝑏𝑌 ] = (𝑎𝑥 + 𝑏𝑦)𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦 −∞ −∞ ∫ ∞∫ ∞ ∫ ∞∫ ∞ =𝑎 𝑥 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦 + 𝑏 𝑦 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦 −∞ −∞ −∞ −∞ = 𝑎𝐸 [𝑋 ] + 𝑏𝐸 [𝑌 ] This result merely states that expectation is a linear operation. Definition 4.7. Correlation The correlation between two random variables is defined as: ∬ 𝑅𝑋 ,𝑌 = 𝐸 [𝑋𝑌 ] = 𝑥𝑦 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦 (4.23) Furthermore, two random variables which have a correlation of zero are said to be orthogonal. One instance in which the correlation appears is in calculating the second moment of a sum of two random variables. That is, consider finding the expected value of 𝑔(𝑋, 𝑌 ) = (𝑋 + 𝑌 ) 2 . 𝐸 [(𝑋 + 𝑌 ) 2 ] = 𝐸 [𝑋 2 + 2𝑋𝑌 + 𝑌 2 ] = 𝐸 [𝑋 2 ] + 𝐸 [𝑌 2 ] + 2𝐸 [𝑋𝑌 ] (4.24) Hence the second moment of the sum is the sum of the second moments plus twice the correlation. Definition 4.8. Covariance The covariance between two random variables is: ∬ 𝐶𝑂𝑉 (𝑋, 𝑌 ) = 𝐸 [(𝑋 − 𝐸 [𝑋 ]) (𝑌 − 𝐸 [𝑌 ])] = (𝑥 − 𝐸 [𝑋 ]) (𝑦 − 𝐸 [𝑌 ]) 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦 (4.25) If two random variables have a covariance of zero, they are said to be uncorrelated. Theorem 4.2 The correlation and covariance are strongly related to one another as follows: 𝐶𝑂𝑉 (𝑋, 𝑌 ) = 𝑅𝑋 ,𝑌 − 𝐸 [𝑋 ]𝐸 [𝑌 ] (4.26) Proof. 𝐶𝑂𝑉 (𝑋, 𝑌 ) = 𝐸 [(𝑋 − 𝐸 [𝑋 ]) (𝑌 − 𝐸 [𝑌 ])] = 𝐸 [𝑋𝑌 − 𝐸 [𝑋 ]𝑌 − 𝐸 [𝑌 ]𝑋 + 𝐸 [𝑋 ]𝐸 [𝑌 ]] = 𝐸 [𝑋𝑌 ] − 𝐸 [𝑋 ]𝐸 [𝑌 ] − 𝐸 [𝑌 ]𝐸 [𝑋 ] + 𝐸 [𝑋 ]𝐸 [𝑌 ] = 𝐸 [𝑋𝑌 ] − 𝐸 [𝑋 ]𝐸 [𝑌 ] As a result, if either 𝑋 or 𝑌 (or both) has a mean of zero, correlation and covariance are equivalent. The covariance function occurs when calculating the variance of a sum of two random variables: 𝑉 𝐴𝑅 [𝑋 + 𝑌 ] = 𝑉 𝐴𝑅 [𝑋 ] + 𝑉 𝐴𝑅 [𝑌 ] + 2𝐶𝑂𝑉 (𝑋, 𝑌 ) 75 (4.27) 4 Two or More Random Variables This result can be obtained from Equation 4.24 by replacing 𝑋 with 𝑋 − 𝐸 [𝑋 ] and 𝑌 with 𝑌 − 𝐸 [𝑌 ]. Another statistical parameter related to a pair of random variables is the correlation coefficient, which is nothing more than a normalized version of the covariance. Definition 4.9. Correlation coefficient The correlation coefficient of two random variables 𝑋 and 𝑌 , 𝜌𝑋𝑌 , is defined as 𝐸 [(𝑋 − 𝐸 [𝑋 ]) (𝑌 − 𝐸 [𝑌 ])] 𝐶𝑂𝑉 (𝑋, 𝑌 ) = 𝜌𝑋𝑌 = p 𝜎𝑋 𝜎𝑌 𝑉 𝐴𝑅(𝑋 )𝑉 𝐴𝑅(𝑌 ) (4.28) The next theorem quantifies the nature of the normalization. Theorem 4.3 Correlation coefficient is less than 1 in absolute value. Proof. Consider taking the second moment of 𝑋 + 𝑎𝑌 , where 𝑎 is a real constant: 𝐸 [(𝑋 + 𝑎𝑌 ) 2 ] = 𝐸 [𝑋 2 ] + 2𝑎𝐸 [𝑋𝑌 ] + 𝑎 2 𝐸 [𝑌 2 ] ≥ 0 Since this is true for any 𝑎, we can tighten the bound by choosing the value of 𝑎 that minimizes the left-hand side. This value of 𝑎 turns out to be 𝑎= −𝐸 [𝑋𝑌 ] 𝐸 [𝑌 2 ] Plugging in this value gives 𝐸 [𝑋 2 ] + 𝐸 [𝑋𝑌 ] 2 𝐸 [𝑋𝑌 ] 2 − 2 ≥ 0 ⇒ 𝐸 [𝑋𝑌 ] 2 ≤ 𝐸 [𝑋 2 ]𝐸 [𝑌 2 ] 𝐸 [𝑌 2 ] 𝐸 [𝑌 2 ] If we replace 𝑋 with 𝑋 − 𝐸 [𝑋 ] and 𝑌 with 𝑌 − 𝐸 [𝑌 ], the result is (𝐶𝑂𝑉 (𝑋, 𝑌 )) 2 ≤ 𝑉 𝐴𝑅 [𝑋 ]𝑉 𝐴𝑅 [𝑌 ] Rearranging terms then gives the desired result: 𝐶𝑂𝑉 (𝑋, 𝑌 ) |𝜌𝑋𝑌 | = | p |≤1 𝑉 𝐴𝑅 [𝑋 ]𝑉 𝐴𝑅 [𝑌 ] Note that we can also infer from the proof that equality holds if 𝑌 is a constant times 𝑋 . That is, a correlation coefficient of 1 (or −1) implies that 𝑋 and 𝑌 are completely correlated (knowing 𝑌 determines 𝑋 ). Furthermore, uncorrelated random variables will have a correlation coefficient of zero. Therefore, as its name implies, the correlation coefficient is a quantitative measure of the correlation between two random variables. It should be emphasized at this point that zero correlation is not to be confused with independence. These two concepts are not the same. Example 4.9 Consider once again the joint PDF of Example 4.7. Find 𝑅𝑋 ,𝑌 , 𝐶𝑂𝑉 (𝑋, 𝑌 ) and 𝜌𝑋 ,𝑌 . 76 4 Two or More Random Variables Solution. The correlation for these random variables is: ∫ ∞∫ ∞ 𝑥𝑦 2 𝐸 [𝑋𝑌 ] = √ 𝑒𝑥𝑝 (− (𝑥 2 + 𝑥𝑦 + 𝑦 2 ))𝑑𝑦𝑑𝑥 3 −∞ −∞ 𝜋 3 In order to evaluate this integral, the joint PDF is rewritten 𝑓𝑋 ,𝑌 (𝑥, 𝑦) = 𝑓𝑌 |𝑋 (𝑦|𝑥)𝑓𝑋 (𝑥) and then those terms involving only 𝑥 are pulled outside the inner integral over 𝑦. ∫ ∞ r ∫ ∞ 2 𝑥 𝑥2 𝑥 2 𝑦 𝑒𝑥𝑝 (− (𝑦 − ) 2 )𝑑𝑦)𝑑𝑥 𝐸 [𝑋𝑌 ] = √ 𝑒𝑥𝑝 (− ) ( 2 3𝜋 3 2 2𝜋 −∞ −∞ The inner integral (in square brackets) is the expected value of a Gaussian random variable with a mean of 𝑥/2 and variance of 3/4 which thus evaluates to 𝑥/2. Hence, ∫ 1 ∞ 𝑥2 𝑥2 𝐸 [𝑋𝑌 ] = √ 𝑒𝑥𝑝 (− )𝑑𝑥 2 −∞ 2𝜋 2 The remaining integral is the second moment of a Gaussian random variable with zero mean and unit variance which integrates to 1. The correlation of these two random variables is therefore 𝐸 [𝑋𝑌 ] = 1/2. Since both 𝑋 and 𝑌 have zero means, 𝐶𝑂𝑉 (𝑋, 𝑌 ) is also equal to 1/2. Finally, the correlation coefficient is also 𝜌𝑋𝑌 = 1/2 due to the fact that both 𝑋 and 𝑌 have unit variance. The concepts of correlation and covariance can be generalized to higher-order moments as given in the following definition. Definition 4.10. Joint moment: The (𝑚, 𝑛)𝑡ℎ joint moment of two random variables 𝑋 and 𝑌 is: ∬ 𝑚 𝑛 𝐸 [𝑋 𝑌 ] = 𝑥 𝑚𝑦𝑛 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦 (4.29) Definition 4.11. Joint central moment: The (𝑚, 𝑛)𝑡ℎ joint central moment of two random variables 𝑋 and 𝑌 is: ∬ 𝑚 𝑛 𝐸 [(𝑋 − 𝐸 [𝑋 ]) (𝑌 − 𝐸 [𝑌 ]) ] = (𝑥 − 𝐸 [𝑋 ])𝑚 (𝑦 − 𝐸 [𝑌 ])𝑛 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦 (4.30) These higher-order joint moments are not frequently used. As with single random variables, a conditional expected value can also be defined for which the expectation is carried out with respect to the appropriate conditional density function. Definition 4.12. The conditional expected value of a function 𝑔(𝑋 ) of a random variable 𝑋 given that 𝑌 = 𝑦 is: ∫ ∞ 𝐸 [𝑔(𝑋 )|𝑌 ] = 𝑔(𝑥) 𝑓𝑋 |𝑌 (𝑥 |𝑦)𝑑𝑥 (4.31) −∞ Conditional expected values can be particularly useful in calculating expected values of functions of two random variables that can be factored into the product of two one-dimensional functions. 77 4 Two or More Random Variables That is, consider a function of the form 𝑔(𝑥, 𝑦) = 𝑔1 (𝑥)𝑔2 (𝑦). Then: ∫ ∞∫ ∞ 𝐸 [𝑔1 (𝑋 )𝑔2 (𝑌 )] = 𝑔1 (𝑥)𝑔2 (𝑦)𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦 −∞ −∞ ∫ ∞∫ ∞ = 𝑔1 (𝑥)𝑔2 (𝑦)𝑓𝑋 (𝑥) 𝑓𝑌 |𝑋 (𝑦|𝑥)𝑑𝑥𝑑𝑦 −∞ −∞ ∫ ∞ ∫ ∞ = 𝑔1 (𝑥) 𝑓𝑋 (𝑥) ( 𝑔2 (𝑦) 𝑓𝑌 |𝑋 (𝑦|𝑥)𝑑𝑦)𝑑𝑥 −∞ −∞ ∫ ∞ = 𝑔1 (𝑥) 𝑓𝑋 (𝑥)𝐸𝑌 [𝑔2 (𝑌 )|𝑋 ]𝑑𝑥 −∞ = 𝐸𝑋 [𝑔1 (𝑋 )𝐸𝑌 [𝑔2 (𝑌 )|𝑋 ]] Here, the subscripts on the expectation operator have been included for clarity to emphasize that the outer expectation is with respect to the random variable X, while the inner expectation is with respect to the random variable Y (conditioned on X). This result allows us to break a two-dimensional expectation into two one-dimensional expectations. This technique was used in Example 4.9, where the correlation between two variables was essentially written as: 𝑅𝑋 ,𝑌 = 𝐸𝑋 [𝑋 𝐸𝑌 [𝑌 |𝑋 ]] (4.32) In that example, the conditional PDF of Y given X was Gaussian, thus finding the conditional mean was accomplished by inspection. The outer expectation then required finding the second moment of a Gaussian random variable, which is also straightforward. 4.1.6 Independence of Random Variables The concept of independent events was introduced in section 2.4. In this section, we extend this concept to the realm of random variables. To make that extension, consider the events 𝐴 = {𝑋 ≤ 𝑥 } and 𝐵 = {𝑌 ≤ 𝑦} related to the random variables 𝑋 and 𝑌 . The two events 𝐴 and 𝐵 are statistically independent if 𝑃 (𝐴, 𝐵) = 𝑃 (𝐴)𝑃 (𝐵). Restated in terms of the random variables, this condition becomes 𝑃 (𝑋 ≤ 𝑥, 𝑌 ≤ 𝑦) = 𝑃 (𝑋 ≤ 𝑥)𝑃 (𝑌 ≤ 𝑦) ⇒ 𝐹𝑋 ,𝑌 (𝑥, 𝑦) = 𝐹𝑋 (𝑥)𝐹𝑌 (𝑦) (4.33) Hence, two random variables are statistically independent if their joint CDF factors into a product of the marginal CDFs. Differentiating both sides of this equation with respect to both 𝑥 and 𝑦 reveals that the same statement applies to the PDF as well. That is, for statistically independent random variables, the joint PDF factors into a product of the marginal PDFs: 𝑓𝑋 ,𝑌 (𝑥, 𝑦) = 𝑓𝑋 (𝑥) 𝑓𝑌 (𝑦) (4.34) It is not difficult to show that the same statement applies to PMFs as well. The preceding condition can also be restated in terms of conditional PDFs. Dividing both sides of Equation 4.34 by 𝑓𝑋 (𝑥) results in 𝑓𝑌 |𝑋 (𝑦|𝑥) = 𝑓𝑌 (𝑦) (4.35) A similar result involving the conditional PDF of X given Y could have been obtained by dividing both sides by the PDF of Y. In other words, if X and Y are independent, knowing the value of the random variable X should not change the distribution of Y and vice versa. 78 4 Two or More Random Variables Example 4.10 Revisiting Example 4.7 once again, verify the Independence of 𝑋 and 𝑌 . Solution. Since the marginal PDF of X is 1 𝑥2 𝑓𝑋 (𝑥) = √ 𝑒𝑥𝑝 (− ) 2 2𝜋 and the conditional PDF of X given Y is: r 𝑓𝑋 |𝑌 (𝑥 |𝑦) = 𝑦 2 2 𝑒𝑥𝑝 (− (𝑥 − ) 2 ) 3𝜋 3 2 which are not equal, these two random variables are not independent. Example 4.11 Suppose the random variables X and Y are uniformly distributed on the square defined by 0 ≤ 𝑥, 𝑦 ≤ 1. Are these two random variables independent? Solution. The joint PDF of X and Y is: ( 𝑓𝑋 ,𝑌 (𝑥, 𝑦) = 1, 0, 0 ≤ 𝑥, 𝑦 ≤ 1 otherwise and marginal PDFs of X and Y is: ( 𝑓𝑋 (𝑥) = ( 𝑓𝑌 (𝑦) = 1, 0, 0≤𝑥 ≤1 otherwise 1, 0, 0≤𝑦 ≤1 otherwise These random variables are statistically independent since 𝑓𝑋 ,𝑌 (𝑥, 𝑦) = 𝑓𝑋 (𝑥) 𝑓𝑌 (𝑦). Theorem 4.4 Let 𝑋 and 𝑌 be two independent random variables and consider forming two new random variables 𝑈 = 𝑔1 (𝑋 ) and 𝑉 = 𝑔2 (𝑌 ). These new random variables 𝑈 and 𝑉 are also independent Another important result deals with the correlation, covariance, and correlation coefficients of independent random variables. 79 4 Two or More Random Variables Theorem 4.5 If 𝑋 and 𝑌 are independent random variables, then 𝑅𝑋 ,𝑌 = 𝐸 [𝑋 ]𝐸 [𝑌 ], 𝐶𝑜𝑣 (𝑋, 𝑌 ) = 0, and 𝜌𝑋 ,𝑌 = 0. Proof. ∬ 𝐸 [𝑋𝑌 ] = ∫ 𝑥𝑦 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥𝑑𝑦 = ∫ 𝑥 𝑓𝑋 (𝑥)𝑑𝑥 𝑦 𝑓𝑌 (𝑦)𝑑𝑦 = 𝐸 [𝑋 ]𝐸 [𝑌 ] The conditions involving covariance and correlation coefficient follow directly from this result. Therefore, independent random variables are necessarily uncorrelated, but the converse is not always true. Uncorrelated random variables do not have to be independent as demonstrated by the next example. Example 4.12: Uncorrelated but Dependent Random Variables Consider a pair of random variables 𝑋 and 𝑌 that are uniformly distributed over the unit circle so that: ( 1/𝜋, 𝑥 2 + 𝑦 2 ≤ 1 𝑓𝑋 ,𝑌 (𝑥, 𝑦) = 0, otherwise The marginal PDF of 𝑋 can be found as follows: √ ∫ ∞ 𝑓𝑋 (𝑥) = ∫ 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑦 = −∞ 1−𝑥 2 √ − 1−𝑥 2 1 2√ 𝑑𝑦 = 1 − 𝑥 2, 𝜋 𝜋 −1 ≤𝑥 ≤ 1 By symmetry, the marginal PDF of 𝑌 must take on the same functional form. Hence, the product of the marginal PDFs is 𝑓𝑋 (𝑥) 𝑓𝑌 (𝑦) = 4p (1 − 𝑥 2 ) (1 − 𝑦 2 ), 𝜋2 − 1 ≤ 𝑥, 𝑦 ≤ 1 Clearly, this is not equal to the joint PDF, and therefore, the two random variables are dependent. This conclusion could have been determined in a simpler manner. Note that if we are told that 𝑋 = 1, then necessarily 𝑌 = 0, whereas if we know that 𝑋 = 0, then 𝑌 can range anywhere from -1 to 1. Therefore, conditioning on different values of 𝑋 leads to different distributions for 𝑌 . Next, the correlation between 𝑋 and 𝑌 is calculated. 𝑥𝑦 1 𝑑𝑥𝑑𝑦 = 𝜋 𝑥 2 +𝑦 2 ≤1 𝜋 ∬ 𝑅𝑋 ,𝑌 = 𝐸 [𝑋𝑌 ] = ∫ √ 1 ∫ 𝑥( −1 1−𝑥 2 √ − 1−𝑥 2 𝑦𝑑𝑦)𝑑𝑥 Since the inner integrand is an odd function (of 𝑦) and the limits of integration are symmetric about zero, the integral is zero. Hence, 𝑅𝑋 ,𝑌 = 0. Note from the marginal PDFs just found that both 𝑋 and 𝑌 are zero-mean. So, it is seen for this example that while the two random variables are uncorrelated, they are not independent. 80 4 Two or More Random Variables 4.1.7 Pairs of Jointly Gaussian Random Variables As with single random variables, the most common and important example of a two-dimensional probability distribution is that of a joint Gaussian distribution. The jointly Gaussian random variables appear in numerous applications in electrical engineering.They are frequently used to model signals in signal processing applications, and they are the most important model used in communication systems that involve dealing with signals in the presence of noise. They also play a central role in many statistical methods. Definition 4.13. Jointly Gaussian random variables: A pair of random variables 𝑋 and 𝑌 is said to be jointly Gaussian if their joint PDF is of the general form: 𝑓𝑋 ,𝑌 (𝑥, 𝑦) = ( 1 𝑒𝑥𝑝 (− q 2 2𝜋𝜎𝑋 𝜎𝑌 1 − 𝜌𝑋𝑌 𝑥−𝑚𝑋 2 𝜎𝑋 ) 𝑋 − 2𝜌𝑋𝑌 ( 𝑥−𝑚 𝜎𝑋 ) ( 𝑦−𝑚𝑌 𝜎𝑌 )+( 2 ) 2(1 − 𝜌𝑋𝑌 𝑦−𝑚𝑌 2 𝜎𝑌 ) ) (4.36) where 𝑚𝑋 and 𝑚𝑌 are the means of 𝑋 and 𝑌 , respectively; 𝜎𝑋 and 𝜎𝑌 are the standard deviations of 𝑋 and 𝑌 , respectively; and 𝜌𝑋𝑌 is the correlation coefficient of 𝑋 and 𝑌 . It can be shown that this joint PDF results in Gaussian marginal PDFs: ∫ ∞ 1 (𝑥 − 𝑚𝑋 ) 2 𝑓𝑋 (𝑥) = 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑦 = √ ) 𝑒𝑥𝑝 (− 2𝜎𝑋2 2𝜋𝜎𝑋 −∞ ∫ ∞ (𝑦 − 𝑚𝑌 ) 2 1 𝑓𝑌 (𝑦) = 𝑒𝑥𝑝 (− ) 𝑓𝑋 ,𝑌 (𝑥, 𝑦)𝑑𝑥 = √ 2𝜎𝑌2 2𝜋𝜎𝑌 −∞ (4.37) (4.38) Furthermore, if 𝑋 and 𝑌 are jointly Gaussian, then the conditional PDF of 𝑋 given 𝑌 = 𝑦 is also 2 ). Gaussian, with a mean of 𝑚𝑋 + 𝜌𝑋𝑌 (𝜎𝑋 /𝜎𝑌 ) (𝑦 − 𝑚𝑌 ) and a variance of 𝜎𝑋2 (1 − 𝜌𝑋𝑌 Figure 4.3 shows the joint Gaussian PDF for three different values of the correlation coefficient. In Figure 4.3(a), the correlation coefficient is 𝜌𝑋𝑌 = 0 and thus the two random variables are uncorrelated. Figure 4.3(b) shows the joint PDF when the correlation coefficient is large and positive, 𝜌𝑋𝑌 = 0.9. Note how the surface has become taller and thinner and largely lies above the line 𝑦 = 𝑥. In Figure 4.3(c), the correlation is now large and negative, 𝜌𝑋𝑌 = −0.9. Note that this is the same picture as in Figure 4.3(b), except that it has been rotated by 90𝑜 . Now the surface lies largely above the line 𝑦 = −𝑥. In all three figures, the means of both 𝑋 and 𝑌 are zero and the variances of both 𝑋 and 𝑌 are 1. Changing the means would simply translate the surface but would not change the shape. Changing the variances would expand or contract the surface along either the 𝑋 − or 𝑌 −axis depending on which variance was changed. 81 4 Two or More Random Variables Figure 4.3: The joint Gaussian PDF: (a) 𝑚𝑋 = 𝑚𝑌 = 0, 𝜎𝑋 = 𝜎𝑌 = 1, 𝜌𝑋𝑌 = 0; (b) 𝑚𝑋 = 𝑚𝑌 = 0, 𝜎𝑋 = 𝜎𝑌 = 1, 𝜌𝑋𝑌 = 0.9; (c) 𝑚𝑋 = 𝑚𝑌 = 0, 𝜎𝑋 = 𝜎𝑌 = 1, 𝜌𝑋𝑌 = −0.9 Example 4.13 82 4 Two or More Random Variables The joint Gaussian PDF is given by the Equation 4.36. Suppose have the following equation: ( 𝑦 − 𝑚𝑌 2 𝑥 − 𝑚𝑋 2 𝑥 − 𝑚 𝑋 𝑦 − 𝑚𝑌 ) − 2𝜌𝑋𝑌 ( )( )+( ) = 𝑐2 𝜎𝑋 𝜎𝑋 𝜎𝑌 𝜎𝑌 This is the equation for an ellipse. Plotting these ellipses for different values of 𝑐 results in what is known as a contour plot. Figure 4.4 shows such plots for the two-dimensional joint Gaussian PDF. Figure 4.4: Contour plots for joint Gaussian random variables Theorem 4.6 Uncorrelated Gaussian random variables are independent. Proof. Uncorrelated Gaussian random variables have a correlation coefficient of zero. Plugging 𝜌𝑋𝑌 = 0 into the general joint Gaussian PDF results in ( 1 𝑓𝑋 ,𝑌 (𝑥, 𝑦) = 𝑒𝑥𝑝 (− 2𝜋𝜎𝑋 𝜎𝑌 𝑥−𝑚𝑋 2 𝜎𝑋 ) +( 𝑦−𝑚𝑌 2 𝜎𝑌 ) 2 ) This clearly factors into the product of the marginal Gaussian PDFs. (𝑦 − 𝑚𝑌 ) 2 1 (𝑥 − 𝑚𝑋 ) 2 1 𝑓𝑋 ,𝑌 (𝑥, 𝑦) = √ 𝑒𝑥𝑝 (− ) 𝑒𝑥𝑝 (− ) = 𝑓𝑋 (𝑥) 𝑓𝑌 (𝑦) √ 2𝜎𝑋2 2𝜎𝑌2 2𝜋𝜎𝑋 2𝜋𝜎𝑌 While Example 4.12 demonstrated that this property does not hold for all random variables, however it is true for Gaussian random variables. This allows us to give a stronger interpretation to the correlation coefficient when dealing with Gaussian random variables. Previously, it was stated that the correlation coefficient is a quantitative measure of the amount of correlation between two variables. While this is true, it is a rather vague statement. We see that in the case of Gaussian random variables, we can make the connection between correlation and statistical dependence. Hence, for jointly Gaussian random variables, the correlation coefficient can indeed be viewed as a quantitative measure of statistical dependence. 83 4 Two or More Random Variables 4.2 Multiple Random Variables In many applications, it is necessary to deal with a large numbers of random variables. Often, the number of variables can be arbitrary. Therefore we extend the concepts developed previously for single random variables and pairs of random variables to allow for an arbitrary number of random variables. A common example is multidimensional Gaussian random variables, while most non-Gaussian random variables are difficult to deal with in many dimensions. One of the main goals here is to develop a vector/matrix notation which will allow us to represent potentially large sequences of random variables with a compact notation. 4.2.1 Vector Random Variables The notion of a random variable is easily generalized to the case where several quantities are of interest. Definition 4.14. Vector random variables: A vector random variable X is a function that assigns a vector of real numbers to each outcome 𝜁 in 𝑆, the sample space of the random experiment. We use uppercase boldface notation for vector random variables. By convention X is a column vector (𝑛 rows by 1 column), so the vector random variable with components𝑋 1, 𝑋 2, ..., 𝑋𝑛 corresponds to 𝑋 1 𝑋 2 X = .. = [𝑋 1, 𝑋 2, ..., 𝑋𝑛 ]𝑇 (4.39) . 𝑋𝑛 where T denotes the transpose of a matrix or vector. Possible values of the vector random variable are denoted by x= [𝑥 1, 𝑥 2, ..., 𝑥𝑛 ]𝑇 where 𝑥𝑖 corresponds to the value of 𝑋𝑖 . Example 4.14: Samples of an Audio Signal Let the outcome of a random experiment be an audio signal 𝑋 (𝑡). Let the random variable 𝑋𝑘 = 𝑋 (𝑘𝑇 ) be the sample of the signal taken at time 𝑘𝑇 . An MP3 codec processes the audio in blocks of 𝑛 samples X= [𝑋 1, 𝑋 2, ..., 𝑋𝑛 ]𝑇 . X is a vector random variable. Each event 𝐴 involving X= [𝑋 1, 𝑋 2, ..., 𝑋𝑛 ]𝑇 has a corresponding region in an n-dimensional real space 𝑅𝑛 . As before, we use “rectangular” product-form sets in 𝑅𝑛 as building blocks. For the n-dimensional random variable X= [𝑋 1, 𝑋 2, ..., 𝑋𝑛 ]𝑇 we are interested in events that have the product form: 𝐴 = {𝑋 1 in 𝐴1 } ∩ {𝑋 2 in 𝐴2 } ∩ ... ∩ {𝑋𝑛 in 𝐴𝑛 } (4.40) where each 𝐴𝑘 is a one-dimensional event (i.e., subset of the real line) that involves 𝑋𝑘 only. The event 𝐴 occurs when all of the events {𝑋𝑘 in 𝐴𝑘 } occur jointly. We are interested in obtaining the probabilities of these product-form events: 𝑃 (𝐴) = 𝑃 (X ∈ A) = 𝑃 ({𝑋 1 in 𝐴1 } ∩ {𝑋 2 in 𝐴2 } ∩ ... ∩ {𝑋𝑛 in 𝐴𝑛 }) , 𝑃 (𝑋 1 in 𝐴1, 𝑋 2 in 𝐴2, ..., 𝑋𝑛 in 𝐴𝑛 ) (4.41) (4.42) In principle, this probability is obtained by finding the probability of the equivalent event in the underlying sample space, that is, 𝑃 (𝐴) = 𝑃 ({𝜁 in 𝑆 : X(𝜁 ) in A}) = 𝑃 ({𝜁 in 𝑆 : 𝑋 1 (𝜁 ) ∈ 𝐴1, 𝑋 2 (𝜁 ) ∈ 𝐴2, ..., 𝑋𝑛 (𝜁 ) ∈ 𝐴𝑛 }) 84 (4.43) (4.44) 4 Two or More Random Variables 4.2.2 Joint and Conditional PMFs, CDFs and PDFs The concepts of PMF, CDF, PDF are easily extended to an arbitrary number of random variables. Definition 4.15. For a vector of 𝑁 random variables X= [𝑋 1, 𝑋 2, ..., 𝑋 𝑁 ]𝑇 , with possible values x= [𝑥 1, 𝑥 2, ..., 𝑥 𝑁 ]𝑇 , the joint PMF, CDF, and PDF are given, respectively, by: 𝑃 X (x) = 𝑃𝑋1,𝑋2,...,𝑋𝑁 (𝑥 1, 𝑥 2, ..., 𝑥 𝑁 ) = 𝑃 (𝑋 1 = 𝑥 1, 𝑋 2 = 𝑥 2, ..., 𝑋 𝑁 = 𝑥 𝑁 ) (4.45) 𝐹 X (x) = 𝐹𝑋1,𝑋2,...,𝑋𝑁 (𝑥 1, 𝑥 2, ..., 𝑥 𝑁 ) = 𝑃 (𝑋 1 ≤ 𝑥 1, 𝑋 2 ≤ 𝑥 2, ..., 𝑋 𝑁 ≤ 𝑥 𝑁 ) (4.46) 𝜕𝑁 𝐹𝑋 ,𝑋 ,...,𝑋 (𝑥 1, 𝑥 2, ..., 𝑥 𝑁 ) 𝜕𝑥 1 𝜕𝑥 2 ...𝜕𝑥 𝑁 1 2 𝑛 𝑓X (x) = 𝑓𝑋1,𝑋2,...,𝑋𝑁 (𝑥 1, 𝑥 2, ..., 𝑥 𝑁 ) = (4.47) Marginal CDFs can be found for a subset of the variables by evaluating the joint CDF at infinity for the unwanted variables. For example, if we are only interested in a subset {𝑋 1, 𝑋 2, ..., 𝑋𝑀 } of X= [𝑋 1, 𝑋 2, ..., 𝑋 𝑁 ]𝑇 , where 𝑁 ≥ 𝑀: 𝐹𝑋1,𝑋2,...,𝑋𝑀 (𝑥 1, 𝑥 2, ..., 𝑥 𝑀 ) = 𝐹𝑋1,𝑋2,...,𝑋𝑁 (𝑥 1, 𝑥 2, ..., 𝑥 𝑀 , ..., ∞, ∞, ..., ∞) (4.48) Marginal PDFs are found from the joint PDF by integrating out the unwanted variables. Similarly, marginal PMFs are obtained from the joint PMF by summing out the unwanted variables. ∫ ∞∫ ∞ ∫ ∞ 𝑓𝑋1,𝑋2,...,𝑋𝑀 (𝑥 1, 𝑥 2, ..., 𝑥 𝑀 ) = ... 𝑓𝑋1,𝑋2,...,𝑋𝑁 (𝑥 1, 𝑥 2, ..., 𝑥 𝑁 )𝑑𝑥 𝑀+1𝑑𝑥 𝑀+2 ...𝑑𝑥 𝑁 (4.49) −∞ −∞ 𝑃𝑋1,𝑋2,...,𝑋𝑀 (𝑥 1, 𝑥 2, ..., 𝑥 𝑀 ) = −∞ Õ Õ ... 𝑥 𝑀+1 𝑥 𝑀+2 Õ 𝑃𝑋1,𝑋2,...,𝑋𝑁 (𝑥 1, 𝑥 2, ..., 𝑥 𝑁 ) (4.50) 𝑥𝑁 Similar to that done for pairs of random variables, we can also establish conditional PMFs and PDFs. Definition 4.16. For a set of 𝑁 random variables 𝑋 1, 𝑋 2, ..., 𝑋 𝑁 , the conditional PMF and PDF of 𝑋 1, 𝑋 2, ..., 𝑋𝑀 conditioned on 𝑋𝑀+1, 𝑋𝑀+2, ..., 𝑋 𝑁 are given by 𝑃𝑋1,𝑋2,...,𝑋𝑀 |𝑋𝑀+1,...,𝑋𝑁 (𝑥 1, 𝑥 2, ..., 𝑥 𝑀 |𝑥 𝑀+1, ..., 𝑥 𝑁 ) = 𝑃 (𝑋 1 = 𝑥 1, 𝑋 2 = 𝑥 2, ..., 𝑋 𝑁 = 𝑥 𝑁 ) 𝑃 (𝑋𝑀+1 = 𝑥 𝑀+1, ..., 𝑋 𝑁 = 𝑥 𝑁 ) 𝑓𝑋1,𝑋2,...,𝑋𝑀 |𝑋𝑀+1,...,𝑋𝑁 (𝑥 1, 𝑥 2, ..., 𝑥 𝑀 |𝑥 𝑀+1, ..., 𝑥 𝑁 ) = 𝑓𝑋1,𝑋2,...,𝑋𝑁 (𝑥 1, 𝑥 2, ..., 𝑥 𝑁 ) 𝑓𝑋𝑀+1,...,𝑋𝑁 (𝑥 𝑀+1, ..., 𝑥 𝑁 ) (4.51) (4.52) Using conditional PDFs, many interesting factorization results can be established for joint PDFs involving multiple random variables. For example, consider four random variables, 𝑋 1, 𝑋 2, 𝑋 3, 𝑋 4 . 𝑓𝑋1,𝑋2,𝑋3,𝑋4 (𝑥 1, 𝑥 2, 𝑥 3, 𝑥 4 ) = 𝑓𝑋1 |𝑋2,𝑋3,𝑋4 (𝑥 1 |𝑥 2, 𝑥 3, 𝑥 4 ) 𝑓𝑋2,𝑋3,𝑋4 (𝑥 2, 𝑥 3, 𝑥 4 ) = 𝑓𝑋1 |𝑋2,𝑋3,𝑋4 (𝑥 1 |𝑥 2, 𝑥 3, 𝑥 4 ) 𝑓𝑋2 |𝑋3,𝑋4 (𝑥 2 |𝑥 3, 𝑥 4 ) 𝑓𝑋3,𝑋4 (𝑥 3, 𝑥 4 ) = 𝑓𝑋1 |𝑋2,𝑋3,𝑋4 (𝑥 1 |𝑥 2, 𝑥 3, 𝑥 4 ) 𝑓𝑋2 |𝑋3,𝑋4 (𝑥 2 |𝑥 3, 𝑥 4 ) 𝑓𝑋3 |𝑋4 (𝑥 3 |𝑥 4 )𝑓𝑋4 (𝑥 4 ) Almost endless other possibilities exist as well. Definition 4.17. A set of 𝑁 random variables are statistically independent if any subset of the random variables are independent of any other disjoint subset. In particular, any joint PDF of 𝑀 ≤ 𝑁 variables should factor into a product of the corresponding marginal PDFs. 85 4 Two or More Random Variables As an example, consider three random variables, 𝑋 , 𝑌 , 𝑍 . For these three random variables to be independent, we must have each pair independent. This implies that: 𝑓𝑋 ,𝑌 (𝑥, 𝑦) = 𝑓𝑋 (𝑥) 𝑓𝑌 (𝑦) 𝑓𝑋 ,𝑍 (𝑥, 𝑧) = 𝑓𝑋 (𝑥)𝑓𝑍 (𝑧) (4.53) 𝑓𝑌 ,𝑍 (𝑦, 𝑧) = 𝑓𝑌 (𝑦)𝑓𝑍 (𝑧) In addition, the joint PDF of all three must also factor into a product of the marginals, 𝑓𝑋 ,𝑌 ,𝑍 (𝑥, 𝑦, 𝑧) = 𝑓𝑋 (𝑥)𝑓𝑌 (𝑦) 𝑓𝑍 (𝑧) (4.54) Note that all three conditions in Equation 4.53 follow directly from the single condition in Equation 4.54. Hence, Equation 4.54 is a necessary and sufficient condition for three variables to be statistically independent. Naturally, this result can be extended to any number of variables. That is, the elements of a random vector X= [𝑋 1, 𝑋 2, ..., 𝑋 𝑁 ]𝑇 are independent if 𝑓X (x) = 𝑁 Ö 𝑓𝑋𝑛 (𝑥𝑛 ) (4.55) 𝑛=1 4.2.3 Expectations Involving Multiple Random Variables For a vector of random variables X= [𝑋 1, 𝑋 2, ..., 𝑋 𝑁 ]𝑇 , we can construct a corresponding mean vector that is a column vector of the same dimension and whose components are the means of the elements of X. Mathematically, we say 𝑚 = 𝐸 [X] = [𝐸 [𝑋 1 ], 𝐸 [𝑋 2 ], ..., 𝐸 [𝑋 𝑁 ]]𝑇 . Two other important quantities associated with the random vector are the correlation and covariance matrices. Definition 4.18. For a random vector X= [𝑋 1, 𝑋 2, ..., 𝑋 𝑁 ]𝑇 , the correlation matrix is defined as RXX = 𝐸 [XX𝑇 ]. That is, the (𝑖, 𝑗)𝑡ℎ element of the 𝑁 × 𝑁 matrix RXX is 𝐸 [𝑋𝑖 𝑋 𝑗 ]. Similarly, the covariance matrix is defined as CXX = 𝐸 [(X − 𝑚) (X − 𝑚)𝑇 ] so that the (𝑖, 𝑗)𝑡ℎ element of CXX is COV(𝑋𝑖 , 𝑋 𝑗 ). Theorem 4.7 Correlation matrices and covariance matrices are symmetric and positive definite. Proof. Recall that a square matrix, RXX , is symmetric if RXX = R𝑇XX . Equivalently, the (𝑖, 𝑗)𝑡ℎ element must be the same as the (𝑖, 𝑗)𝑡ℎ element. This is clearly the case here since 𝐸 [𝑋𝑖 𝑋 𝑗 ] = 𝐸 [𝑋 𝑗 𝑋𝑖 ]. Recall that the matrix is positive definite if z𝑇 RXX z > 0 for any vector z such that ||𝑧|| > 0. z𝑇 RXX z = z𝑇 𝐸 [XX𝑇 ]z = 𝐸 [z𝑇 XX𝑇 z] = 𝐸 [(z𝑇 X) 2 ] (4.56) Note that z𝑇 X is a scalar random variable (a linear combination of the components of X). Since the second moment of any random variable is positive (except for the pathological case of a random variable which is identically equal to zero), then the correlation matrix is positive definite. As an aside, this also implies that the eigenvalues of the correlation matrix are all positive. Identical steps can be followed to prove the same properties hold for the covariance matrix. 86 4 Two or More Random Variables Next, consider a linear transformation of a vector random variable. That is, create a new set of 𝑀 random variables, Y = [𝑌1, 𝑌2, ..., 𝑌𝑀 ]𝑇 , according to: 𝑌1 =𝑎 1,1𝑋 1 + 𝑎 1,2𝑋 2 + ... + 𝑎 1,𝑁 𝑋 𝑁 + 𝑏 1 𝑌2 =𝑎 2,1𝑋 1 + 𝑎 2,2𝑋 2 + ... + 𝑎 2,𝑁 𝑋 𝑁 + 𝑏 2 .. . (4.57) 𝑌𝑀 =𝑎𝑀,1𝑋 1 + 𝑎𝑀,2𝑋 2 + ... + 𝑎𝑀,𝑁 𝑋 𝑁 + 𝑏 𝑀 The number of new variables, M, does not have to be the same as the number of original variables, N. To write this type of linear transformation in a compact fashion, define a matrix A whose (𝑖, 𝑗)𝑡ℎ element is the coefficient 𝑎𝑖,𝑗 and a column vector, b= [𝑏 1, 𝑏 2, ..., 𝑏 𝑀 ]𝑇 . Then the linear transformation of Equation 4.57 is written in vector/matrix form as Y = AX + b. The next theorem describes the relationship between the means of X and Y and the correlation matrices of X and Y. Theorem 4.8 For a linear transformation of vector random variables of the form Y = AX + b, the means of X and Y are related by. mY = AmX + b (4.58) Also, the correlation matrices of X and Y are related by: RYY = ARXX A𝑇 + AmX b𝑇 + bm𝑇X A𝑇 + bb𝑇 (4.59) and the covariance matrices of X and Y are related by: CYY = ACXX A𝑇 (4.60) mY = 𝐸 [Y] = 𝐸 [AX + b] = A𝐸 [X] + b = AmX + b (4.61) Proof. For the mean vector, Similarly, for the correlation matrix RYY = 𝐸 [YY𝑇 ] = 𝐸 [(AX + b) (AX + b)𝑇 ] = 𝐸 [AXX𝑇 A𝑇 ] + 𝐸 [bX𝑇 A𝑇 ] + 𝐸 [AXb𝑇 ] + 𝐸 [bb𝑇 ] = ARXX A𝑇 + AmX b𝑇 + bm𝑇X A𝑇 + bb𝑇 (4.62) To prove the result for the covariance matrix, write Y−mY as Y − mY = (AX + b) − (AmX + b) = A(X − mX ) (4.63) Then, CYY = 𝐸 [(Y − mY ) (Y − mY )𝑇 ] = 𝐸 [(A(X − mX )) (A(X − mX ))𝑇 ] = 𝐸 [A(X − mX ) (X − mX )𝑇 A𝑇 ] = A𝐸 [(X − mX ) (X − mX )𝑇 ]A𝑇 = ACXX A𝑇 (4.64) 87 4 Two or More Random Variables 4.2.4 Multi-Dimensional Gaussian Random Variables Recall from the study of two-dimensional random variables in the previous chapter that the functional form of the joint Gaussian PDF was fairly complicated. It would seem that the prospects of forming a joint Gaussian PDF for an arbitrary number of dimensions are grim. However, the vector/matrix notation developed in the previous sections make this task manageable and, in fact, the resulting joint Gaussian PDF is quite simple. Definition 4.19. The joint Gaussian PDF for a vector of 𝑁 random variables, 𝑋 , with mean vector, m𝑋 , and covariance matrix, CXX , is given by 𝑓X (x) = p 1 1 𝑒𝑥𝑝 (− (X − mX )𝑇 C−1 XX (X − mX )) 2 (2𝜋) 𝑁 𝑑𝑒𝑡 (CXX ) (4.65) Example 4.15 To demonstrate the use of this matrix notation, suppose X is a two-element vector and the mean vector and covariance matrix are given by their general forms: 𝑚 mX = 1 𝑚2 and 𝜎12 𝜌𝜎1𝜎2 = 𝜌𝜎1𝜎2 𝜎22 CXX The determinant of the covariance matrix is 𝑑𝑒𝑡 (CXX ) = 𝜎12𝜎22 − (𝜌𝜎1𝜎2 ) 2 = 𝜎12𝜎22 (1 − 𝜌 2 ) while the inverse is C−1 XX = 𝜎22 −𝜌𝜎1𝜎2 −𝜌𝜎1𝜎2 𝜎12 𝜎12𝜎22 (1 − 𝜌 2 ) = 𝜎1−2 −𝜌𝜎1−1𝜎2−1 −1 −1 −𝜌𝜎1 𝜎2 𝜎2−2 1 − 𝜌2 The quadratic form in the exponent then works out to be 𝜎1−2 −𝜌𝜎1−1𝜎2−1 𝜎2−2 −𝜌𝜎1−1𝜎2−1 𝑥1 − 𝑚1 𝑇 −1 (X − mX ) CXX (X − mX ) = 𝑥 1 − 𝑚 1 𝑥 2 − 𝑚 2 𝑥2 − 𝑚2 1 − 𝜌2 𝑥 1 −𝑚 1 2 𝑥 1 −𝑚 1 𝑥 2 −𝑚 2 𝑥 2 −𝑚 2 2 ( 𝜎1 ) − 2𝜌 ( 𝜎1 )( 𝜎2 ) + ( 𝜎2 ) = 1 − 𝜌2 Plugging all these results into the general form for the joint Gaussian PDF gives 𝑓𝑋1,𝑋2 (𝑥 1, 𝑥 2 ) = q 1 𝑒𝑥𝑝 (− 1 2 1 2 2 2 ( 𝑥 1𝜎−𝑚 ) − 2𝜌 ( 𝑥 1𝜎−𝑚 )( 𝑥 2𝜎−𝑚 ) + ( 𝑥 2𝜎−𝑚 ) 1 1 2 2 (2𝜋) 2𝜎12𝜎22 (1 − 𝜌 2 ) 2(1 − 𝜌 2 ) ) (4.66) This is exactly the form of the two-dimensional joint Gaussian PDF defined in Equation 4.36. 88 4 Two or More Random Variables Example 4.16 Suppose 𝑋 1, 𝑋 2, ..., 𝑋𝑛 are jointly Gaussian random variables with 𝐶𝑂𝑉 (𝑋𝑖 , 𝑋 𝑗 ) = 0 for 𝑖 ≠ 𝑗. Show that 𝑋 1, 𝑋 2, ..., 𝑋𝑛 are independent random variables. Solution. Since for all 𝐶𝑂𝑉 (𝑋𝑖 , 𝑋 𝑗 ) = 0 for all 𝑖 ≠ 𝑗, all of the off-diagonal elements of the covariance matrix of X are zero. In other words, CXX is a diagonal matrix of the general form: 𝜎12 0 ... 0 0 𝜎 2 ... 0 2 CXX = ... ... ... 0 0 ... 𝜎 2 𝑁 The determinant of a diagonal matrix is the product of the diagonal entries so that in this case 𝑑𝑒𝑡 (CXX ) = 𝜎12𝜎22 ...𝜎𝑁2 . The inverse is also trivial to compute and takes on the form C−1 XX 𝜎1−2 0 ... 0 0 𝜎 −2 ... 0 2 = ... ... ... 0 0 ... 𝜎𝑁−2 The quadratic form that appears in the exponent of the Gaussian PDF becomes, (X − mX )𝑇 C−1 XX (X − mX ) = 𝑥1 − 𝑚1 𝑥2 − 𝑚2 𝜎1−2 0 ... 0 0 𝜎2−2 ... 0 ... 𝑥 𝑁 − 𝑚 𝑁 ... ... ... −2 0 0 ... 𝜎𝑁 𝑥1 − 𝑚1 𝑁 𝑥2 − 𝑚2 Õ 𝑥 𝑛 − 𝑚𝑛 2 = ( ) ... 𝜎𝑛 𝑛=1 𝑥 𝑁 − 𝑚 𝑁 The joint Gaussian PDF for a vector of uncorrelated random variables is then 𝑓X (x) = q 1 (2𝜋) 𝑁 𝜎12𝜎22 ...𝜎𝑁2 Ö 1 (𝑥𝑛 − 𝑚𝑛 ) 2 1 Õ 𝑥 𝑛 − 𝑚𝑛 2 𝑒𝑥𝑝 (− ( ) )= 𝑒𝑥𝑝 (− ) p 2 2 𝑛=1 𝜎𝑛 2𝜎𝑛2 𝑛=1 2𝜋𝜎𝑛 𝑁 𝑁 This shows that for any number of uncorrelated Gaussian random variables, the joint PDF factors into the product of marginal PDFs and hence uncorrelated Gaussian random variables are independent. This is a generalization of the same result for two Gaussian random variables. Further Reading 1. Scott L. Miller, Donald Childers, Probability and random processes: with applications to signal processing and communications, Elsevier 2012: sections 5.1 to 5.7 and 6.1 to 6.3 2. Alberto Leon-Garcia, Probability, statistics, and random processes for electrical engineering, 3rd ed. Pearson, 2007: sections 5.1 to 5.9 and 6.1 to 6.4 3. Charles W. Therrien, Probability for electrical and computer engineers, CRC Press, 2004: chapter 5 89 5 Random Sums and Sequences 5 Random Sums and Sequences Many problems involve the counting of the number of occurrences of events, the measurement of cumulative effects, or the computation of arithmetic averages in a series of measurements. Usually these problems can be reduced to the problem of finding, exactly or approximately, the distribution of a random variable that consists of the sum of 𝑛 independent, identically distributed random variables. In this chapter, we investigate sums of random variables and their properties as 𝑛 becomes large. 5.1 Independent and Identically Distributed Random Variables In many applications, we are able to observe an experiment repeatedly. Each new observation can occur with an independent realization of whatever random phenomena control the experiment. This sort of situation gives rise to independent and identically distributed (IID or i.i.d.) random variables. Definition 5.1. Independent and Identically Distributed: A sequence of random variables 𝑋 1, 𝑋 2, ..., 𝑋𝑛 is IID if 𝐹𝑋 𝑖 (𝑥) = 𝐹𝑋 (𝑥) ∀𝑖 = 1, 2, ..., 𝑛 (5.1) and 𝐹𝑋1,𝑋2,...,𝑋𝑛 (𝑥 1, 𝑥 2, ..., 𝑥𝑛 ) = 𝑛 Ö 𝐹𝑋 𝑖 (𝑥𝑖 ) (5.2) 𝑖=1 For continuous random variables, the CDFs can be replaced with PDFs in Equations 5.1 and 5.2, while for discrete random variables, the CDFs can be replaced by PMFs. Suppose, for example, we wish to measure the voltage produced by a certain sensor. The sensor might be measuring the relative humidity outside. Our sensor converts the humidity to a voltage level which we can then easily measure. However, as with any measuring equipment, the voltage we measure is random due to noise generated in the sensor as well as in the measuring equipment. Suppose the voltage we measure is represented by a random variable 𝑋 given by 𝑋 = 𝑣 (ℎ) + 𝑁 , where 𝑣 (ℎ) is the true voltage that should be presented by the sensor when the humidity is ℎ, and 𝑁 is the noise in the measurement. Assuming that the noise is zero-mean, then 𝐸 [𝑋 ] = 𝑣 (ℎ). That is, on the average, the measurement will be equal to the true voltage 𝑣 (ℎ). Furthermore, if the variance of the noise is sufficiently small, then the measurement will tend to be close to the true value we are trying to measure. But what if the variance is not small? Then the noise will tend to distort our measurement making our system unreliable. In such a case, we might be able to improve our measurement system by taking several measurements. This will allow us to “average out” the effects of the noise. Suppose we have the ability to make several measurements and observe a sequence of measurements 𝑋 1, 𝑋 2, ..., 𝑋𝑛 . It might be reasonable to expect that the noise that corrupts a given measurement has the same distribution each time (and hence the 𝑋𝑖 are identically distributed) and is independent of the noise in any other measurement (so that the 𝑋𝑖 are independent). Then the 𝑛 measurements form a sequence of IID random variables. A fundamental question is then: 90 5 Random Sums and Sequences How do we process an IID sequence to extract the desired information from it? In the preceding case, the parameter of interest, 𝑣 (ℎ), happens to be the mean of the distribution of the 𝑋𝑖 . This turns out to be a fairly common problem and so we address that in the following sections. 5.2 Mean and Variance of Sums of Random Variables Let 𝑋 1, 𝑋 2, ..., 𝑋𝑛 be a sequence of random variables, and let 𝑆𝑛 be their sum: 𝑆𝑛 = 𝑋 1 + 𝑋 2 + ... + 𝑋𝑛 It was shown in section 3.2.3 that regardless of statistical dependence of 𝑋𝑖 s, the expected value of a sum of 𝑛 random variables is equal to the sum of the expected values: 𝐸 [𝑋 1 + 𝑋 2 + ... + 𝑋𝑛 ] = 𝐸 [𝑋 1 ] + 𝐸 [𝑋 2 ] + ... + 𝐸 [𝑋𝑛 ] Thus knowledge of the means of the 𝑋𝑖 s suffices to find the mean of 𝑆𝑛 . The following example shows that in order to compute the variance of a sum of random variables, we need to know the variances and covariances of the 𝑋𝑖 s. Example 5.1 Find the variance of 𝑍 = 𝑋 + 𝑌 . Solution. The variance of 𝑍 is: 𝑉 𝐴𝑅 [𝑍 ] = 𝐸 [(𝑍 − 𝐸 [𝑍 ]) 2 ] = 𝐸 [(𝑋 + 𝑌 − 𝐸 [𝑋 ] − 𝐸 [𝑌 ]) 2 ] = 𝐸 [((𝑋 − 𝐸 [𝑋 ]) + (𝑌 − 𝐸 [𝑌 ])) 2 ] = 𝐸 [(𝑋 − 𝐸 [𝑋 ]) 2 + (𝑌 − 𝐸 [𝑌 ]) 2 + (𝑋 − 𝐸 [𝑋 ])(𝑌 − 𝐸 [𝑌 ]) + (𝑌 − 𝐸 [𝑌 ])(𝑋 − 𝐸 [𝑋 ])] = 𝑉 𝐴𝑅 [𝑋 ] + 𝑉 𝐴𝑅 [𝑌 ] + 𝐶𝑂𝑉 (𝑋, 𝑌 ) + 𝐶𝑂𝑉 (𝑌 , 𝑋 ) = 𝑉 𝐴𝑅 [𝑋 ] + 𝑉 𝐴𝑅 [𝑌 ] + 2𝐶𝑂𝑉 (𝑋, 𝑌 ) In general, the covariance 𝐶𝑂𝑉 (𝑋, 𝑌 ) is not equal to zero, so the variance of a sum is not necessarily equal to the sum of the individual variances. The result in Example 5.1 can be generalized to the case of 𝑛 random variables: 𝑉 𝐴𝑅 [𝑋 1 + 𝑋 2 + ... + 𝑋𝑛 ] = 𝐸 [ = = 𝑛 Õ (𝑋 𝑗 − 𝐸 [𝑋 𝑗 ]) 𝑗=1 𝑛 𝑛 ÕÕ 𝑛 Õ (𝑋𝑘 − 𝐸 [𝑋𝑘 ])] 𝑘=1 𝐸 [(𝑋 𝑗 − 𝐸 [𝑋 𝑗 ]) (𝑋𝑘 − 𝐸 [𝑋𝑘 ])] 𝑗=1 𝑘=1 𝑛 Õ 𝑛 Õ 𝑛 Õ 𝐶𝑂𝑉 (𝑋 𝑗 , 𝑋𝑘 ) 𝑉 𝐴𝑅 [𝑋𝑘 ] + (5.3) 𝑗=1 𝑘=1 𝑗≠𝑘 𝑘=1 Thus in general, the variance of a sum of random variables is not equal to the sum of the individual variances. 91 5 Random Sums and Sequences An important special case is when the 𝑋 𝑗 s are independent random variables. If 𝑋 1, 𝑋 2, ..., 𝑋𝑛 are independent random variables, then 𝐶𝑂𝑉 (𝑋 𝑗 , 𝑋𝑘 ) = 0 for 𝑗 ≠ 𝑘 and: 𝑉 𝐴𝑅 [𝑋 1 + 𝑋 2 + ... + 𝑋𝑛 ] = 𝑛 Õ 𝑉 𝐴𝑅 [𝑋𝑘 ] (5.4) 𝑘=1 Now suppose 𝑋 1, 𝑋 2, ..., 𝑋𝑛 are 𝑛 IID random variables, each with mean 𝑚 and variance 𝜎 2 , then the sum of 𝑋𝑖 s, 𝑆𝑛 , has the following mean: 𝐸 [𝑆𝑛 ] = 𝐸 [𝑋 1 ] + 𝐸 [𝑋 2 ] + ... + 𝐸 [𝑋𝑛 ] = 𝑛𝑚 (5.5) The covariance of pairs of independent random variables is zero, so: 𝑉 𝐴𝑅 [𝑆𝑛 ] = 𝑛 Õ 𝑉 𝐴𝑅 [𝑋𝑘 ] = 𝑛𝑉 𝐴𝑅 [𝑋𝑖 ] = 𝑛𝜎 2 (5.6) 𝑘=1 5.3 The Sample Mean Definition 5.2. Sample Mean: Let 𝑋 be a random variable for which the mean, 𝐸 [𝑋 ] = 𝑚 is unknown. Let 𝑋 1, 𝑋 2, ..., 𝑋𝑛 denote 𝑛 independent, repeated measurements of 𝑋 ; i.e. 𝑋𝑖 s are IID random variables with the same PDF as 𝑋 . The sample mean of the sequence is used to estimate 𝐸 [𝑋 ]: 𝑛 1Õ 𝑀𝑛 = 𝑋𝑗 (5.7) 𝑛 𝑗=1 The sample variance is then defined as: 𝜎𝑛2 𝑛 1Õ = (𝑋 𝑗 − 𝑀𝑛 ) 2 𝑛 𝑗=1 (5.8) The sample mean is itself a random variable, so it will exhibit random variation. Our aim is to verify if 𝑀𝑛 can be a good estimator of 𝐸 [𝑋 ] = 𝑚. A good estimator is expected to have the following two properties: 1. On the average, it should give the correct expected value (with no bias): 𝐸 [𝑀𝑛 ] = 𝑚 2. It should not vary too much about the correct value of this parameter, that is, 𝐸 [(𝑀𝑛 − 𝑚) 2 ] (variance) is small. The expected value of the sample mean is given by: 𝐸 [𝑀𝑛 ] = 𝐸 [ 𝑛 𝑛 1Õ 1Õ 𝑋𝑗] = 𝐸 [𝑋 𝑗 ] = 𝑚 𝑛 𝑗=1 𝑛 𝑗=1 (5.9) since 𝐸 [𝑋 𝑗 ] = 𝐸 [𝑋 ] = 𝑚 for all 𝑗. Thus the sample mean is equal to 𝐸 [𝑋 ] = 𝑚 on the average. For this reason, we say that the sample mean is an unbiased estimator for 𝑚. The mean square error of the sample mean about 𝑚 is equal to the variance of 𝑀𝑛 that is, 𝐸 [(𝑀𝑛 − 𝑚) 2 ] = 𝐸 [(𝑀𝑛 − 𝐸 [𝑀𝑛 ]) 2 ] 92 (5.10) 5 Random Sums and Sequences Note that 𝑀𝑛 = 𝑆𝑛 /𝑛 where 𝑆𝑛 = 𝑋 1 + 𝑋 2 + ... + 𝑋𝑛 . From Equation 5.6, 𝑉 𝐴𝑅 [𝑆𝑛 ] = 𝑛𝜎 2 , since the 𝑋 𝑗 s are IID random variables. Thus 𝑉 𝐴𝑅 [𝑀𝑛 ] = 𝑛𝜎 2 𝜎 2 1 𝑉 𝐴𝑅 [𝑆 ] = = 𝑛 𝑛2 𝑛2 𝑛 (5.11) Therefore the variance of the sample mean approaches zero as the number of samples is increased. This implies that the probability that the sample mean is close to the true mean approaches one as 𝑛 becomes very large. We can formalize this statement by using the Chebyshev inequality from Equation 3.127: 𝑉 𝐴𝑅 [𝑀𝑛 ] 𝑃 (|𝑀𝑛 − 𝐸 [𝑀𝑛 ] | ≥ 𝜀) ≤ (5.12) 𝜀2 Substituting for 𝐸 [𝑀𝑛 ] and 𝑉 𝐴𝑅 [𝑀𝑛 ], we obtain 𝑃 (|𝑀𝑛 − 𝑚| ≥ 𝜀) ≤ 𝜎2 𝑛𝜀 2 (5.13) If we consider the complement, we obtain 𝑃 (|𝑀𝑛 − 𝑚| < 𝜀) ≥ 1 − 𝜎2 𝑛𝜀 2 (5.14) Thus for any choice of error 𝜀 and probability 1 − 𝛿, we can select the number of samples 𝑛 so that 𝑀𝑛 is within 𝜀 of the true mean with probability 1 − 𝛿 or greater. The following example illustrates this. Example 5.2 A voltage of constant, but unknown, value is to be measured. Each measurement 𝑋 𝑗 is actually the sum of the desired voltage 𝑣 and a noise voltage 𝑁 𝑗 of zero mean and standard deviation of 1 microvolt (𝜇𝑉 ): 𝑋𝑗 = 𝑣 + 𝑁𝑗 Assume that the noise voltages are independent random variables. How many measurements are required so that the probability that 𝑀𝑛 is within 𝜀 = 1𝜇𝑉 of the true mean is at least 0.99? Solution. Each measurement 𝑋 𝑗 has mean 𝑣 and variance 1, so from Equation 5.14 we require that 𝑛 satisfy: 𝜎2 1 1 − 2 = 1 − = 0.99 𝑛𝜀 𝑛 This implies that 𝑛 = 100. Thus if we were to repeat the measurement 100 times and compute the sample mean, on the average, at least 99 times out of 100, the resulting sample mean will be within 1𝜇𝑉 of the true mean. 5.4 Laws of Large Numbers Note that if we let 𝑛 approach infinity in Equation 5.14 we obtain lim 𝑃 (|𝑀𝑛 − 𝑚| < 𝜀) = 1 𝑛→∞ (5.15) Equation 5.14 requires that the 𝑋 𝑗 s have finite variance. It can be shown that this limit holds even if the variance of the 𝑋 𝑗 s does not exist. 93 5 Random Sums and Sequences Theorem 5.1: Weak Law of Large Numbers Let 𝑋 1, 𝑋 2, ... be a sequence of IID random variables with finite mean 𝐸 [𝑋 ] = 𝑚, then for 𝜀 > 0, lim 𝑃 (|𝑀𝑛 − 𝑚| < 𝜀) = 1 (5.16) 𝑛→∞ The weak law of large numbers states that for a large enough fixed value of 𝑛, the sample mean using 𝑛 samples will be close to the true mean with high probability. The weak law of large numbers does not address the question about what happens to the sample mean as a function of 𝑛 as we make additional measurements. This question is taken up by the strong law of large numbers. Suppose we make a series of independent measurements of the same random variable. Let 𝑋 1, 𝑋 2, ... be the resulting sequence of IID random variables with mean 𝑚. Now consider the sequence of sample means that results from the above measurements: 𝑀1, 𝑀2, ... where 𝑀 𝑗 is the sample mean computed using 𝑋 1 through 𝑋 𝑗 . We expect that with high probability, each particular sequence of sample means approaches 𝑚 and stays there: 𝑃 ( lim 𝑀𝑛 = 𝑚) = 1 𝑛→∞ (5.17) that is, with virtual certainty, every sequence of sample mean calculations converges to the true mean of the quantity (The proof of this result is beyond the level of this unit). Theorem 5.2: Strong Law of Large Numbers Let 𝑋 1, 𝑋 2, ... be a sequence of IID random variables with finite mean 𝐸 [𝑋 ] = 𝑚, and finite variance, then, 𝑃 ( lim 𝑀𝑛 = 𝑚) = 1 (5.18) 𝑛→∞ Equation 5.18 appears similar to Equation 5.16, but in fact it makes a dramatically different statement. It states that with probability 1, every sequence of sample mean calculations will eventually approach and stay close to 𝐸 [𝑋 ] = 𝑚. This is the type of convergence we expect in physical situations where statistical regularity holds. Although under certain conditions, the theory predicts the convergence of sample means to expected values, there are still gaps between the mathematical theory and the real world (i.e., we can never actually carry out an infinite number of measurements and compute an infinite number of sample means). Nevertheless, the strong law of large numbers demonstrates the remarkable consistency between the theory and the observed physical behavior. Note that relative frequencies discussed in previous chapters, are special cases of sample averages. If we apply the weak law of large numbers to the relative frequency of an event 𝐴, 𝑓𝐴 (𝑛), in a sequence of independent repetitions of a random experiment, we obtain lim 𝑃 (|𝑓𝐴 (𝑛) − 𝑃 (𝐴)| < 𝜀) = 1 𝑛→∞ (5.19) If we apply the strong law of large numbers, we obtain: 𝑃 ( lim 𝑓𝐴 (𝑛) = 𝑃 (𝐴)) = 1 𝑛→∞ 94 (5.20) 5 Random Sums and Sequences Example 5.3 In order to estimate the probability of an event 𝐴, a sequence of Bernoulli trials is carried out and the relative frequency of 𝐴 is observed. How large should 𝑛 be in order to have a 0.95 probability that the relative frequency is within 0.01 of 𝑝 = 𝑃 (𝐴)? Solution. Let 𝑋 = 𝐼𝐴 be the indicator function of 𝐴. From Equations 3.45 and 3.46 we have that the mean of is 𝑚 = 𝑝 and the variance is 𝜎 2 = 𝑝 (1 − 𝑝). Since 𝑝 is unknown, 𝜎 2 is also unknown. However, it is easy to show that 𝑝 (1 − 𝑝) is at most 1/4 for 0 ≤ 𝑝 ≤ 1 Therefore, by Equation 5.13, 𝜎2 1 𝑃 (|𝑓𝐴 (𝑛) − 𝑝 | ≥ 𝜀) ≤ 2 ≤ 𝑛𝜀 4𝑛𝜀 2 The desired accuracy is 𝜀 = 0.01 and the desired probability is: 1 − 0.95 = 1 4𝑛𝜀 2 We then solve this for 𝑛 and obtain 𝑛 = 50, 000. It has already been pointed out that the Chebyshev inequality gives very loose bounds, so we expect that this value for 𝑛 is probably overly conservative. In the next section, we present a better estimate for the required value of 𝑛. 5.5 The Central Limit Theorem Probably the most important result dealing with sums of random variables is the central limit theorem which states that under some mild conditions, these sums converge to a Gaussian random variable in distribution. This result provides the basis for many theoretical models of random phenomena. The central limit theorem explains why the Gaussian random variable appears in so many diverse applications. In nature, many macroscopic phenomena result from the addition of numerous independent, microscopic processes; this gives rise to the Gaussian random variable. In many man-made problems, we are interested in averages that often consist of the sum of independent random variables. This again gives rise to the Gaussian random variable. Let 𝑋 1, 𝑋 2, ... be a sequence of IID random variables with finite mean 𝐸 [𝑋 ] = 𝑚, and finite variance 𝜎 2 , and let 𝑆𝑛 be the sum of the first 𝑛 random variables in the sequence. We present the central limit theorem, which states that, as 𝑛 becomes large, the CDF of a properly normalized 𝑆𝑛 approaches that of a Gaussian random variable. This enables us to approximate the CDF of 𝑆𝑛 with that of a Gaussian random variable. We know from equations 5.5 and 5.6 that if the 𝑋 𝑗 s are IID, then 𝑆𝑛 has mean 𝑛𝑚 and variance 𝑛𝜎 2 . The central limit theorem states that the CDF of a suitably normalized version of 𝑆𝑛 approaches that of a Gaussian random variable. Central Limit Theorem: Let 𝑋 𝑗 be a sequence of IID random variables with mean 𝑚 and variance 𝜎 2 . Define a new random variable, 𝑍 , as a (shifted and scaled) sum of the 𝑋 𝑗 s: 𝑛 1 Õ 𝑋𝑗 − 𝑚 𝑍=√ 𝑛 𝑗=1 𝜎 (5.21) Note that 𝑍 has been constructed such that 𝐸 [𝑍 ] = 0 and 𝑉 𝐴𝑅 [𝑍 ] = 1. In the limit as 𝑛 approaches 95 5 Random Sums and Sequences infinity, the random variable 𝑍 converges in distribution to a standard Gaussian random variable. Several remarks about this theorem are in order at this point. First, no restrictions were put on the distribution of the 𝑋 𝑗 s, since it applies to any infinite sum of IID random variables, regardless of the distribution. From a practical standpoint, the central limit theorem implies that for the sum of a sufficiently large (but finite) number of random variables, the sum is approximately Gaussian distributed. Of course, the goodness of this approximation depends on how many terms are in the sum and also the distribution of the individual terms in the sum. Figures 5.1 to 5.3 compare the exact CDF and the Gaussian approximation for the sums of Bernoulli, uniform, and exponential random variables, respectively. In all three cases, it can be seen that the approximation improves as the number of terms in the sum increases. Figure 5.1: (a) The CDF of the sum of five independent Bernoulli random variables with 𝑝 = 1/2 and the CDF of a Gaussian random variable of the same mean and variance. (b) The CDF of the sum of 25 independent Bernoulli random variables with 𝑝 = 1/2 and the CDF of a Gaussian random variable of the same mean and variance. Figure 5.2: The CDF of the sum of five independent discrete, uniform random variables from the set {0, 1, 2, ..., 9} and the CDF of a Gaussian random variable of the same mean and variance. The central limit theorem guarantees that the sum converges in "distribution" to Gaussian, but this does not necessarily imply convergence in "density". As a counter example, suppose that the 96 5 Random Sums and Sequences Figure 5.3: (a) The CDF of the sum of five independent exponential random variables of mean 1 and the CDF of a Gaussian random variable of the same mean and variance. (b) The CDF of the sum of 50 independent exponential random variables of mean 1 and the CDF of a Gaussian random variable of the same mean and variance. 𝑋 𝑗 s are discrete random variables, then the sum must also be a discrete random variable. Strictly speaking, the density of 𝑍 would then not exist, and it would not be meaningful to say that the density of 𝑍 is Gaussian. From a practical standpoint, the probability density of 𝑍 would be a series of impulses. While the envelope of these impulses would have a Gaussian shape to it, the density is clearly not Gaussian. If the 𝑋 𝑗 s are continuous random variables, the convergence in density generally occurs as well. The IID assumption is not needed in many cases. The central limit theorem also applies to independent random variables that are not necessarily identically distributed. Loosely speaking, all that is required is that no term (or small number of terms) dominates the sum, and the resulting infinite sum of independent random variables will approach a Gaussian distribution in the limit as the number of terms in the sum goes to infinity. The central limit theorem also applies to some cases of dependent random variables, but we will not consider such cases here. Example 5.4 The time between events in a certain random experiment is IID exponential random variables with mean 𝑚 seconds. Find the probability that the 1000th event occurs in the time interval (1000 ± 50)𝑚. Solution. Let 𝑋𝑖 be the time between events and let 𝑆𝑛 be the time of the 𝑛th event, then 𝑆𝑛 = 𝑋 1 + 𝑋 2 + ... + 𝑋𝑛 . We know that the mean and variance of the exponential random variable 𝑋 𝑗 is given by 𝐸 [𝑋 𝑗 ] = 𝑚 and 𝑉 𝐴𝑅 [𝑋 𝑗 ] = 𝑚 2 . The mean and variance of 𝑆𝑛 are then 𝐸 [𝑆𝑛 ] = 𝑛𝐸 [𝑋 𝑗 ] = 𝑛𝑚 and 𝑉 𝐴𝑅 [𝑆𝑛 ] = 𝑛𝑉 𝐴𝑅 [𝑋 𝑗 ] = 𝑛𝑚 2 . The central limit theorem then gives 950𝑚 − 1000𝑚 1050𝑚 − 1000𝑚 ≤ 𝑍𝑛 ≤ ) √ √ 𝑚 1000 𝑚 1000 ' 𝑄 (1.58) − 𝑄 (−1.58) = 1 − 2𝑄 (1.58) = 0.8866 𝑃 (950𝑚 ≤ 𝑆 1000 ≤ 1050𝑚) = 𝑃 ( 97 5 Random Sums and Sequences Thus as 𝑛 becomes large, 𝑆𝑛 is very likely to be close to its mean 𝑛𝑚. We can therefore conjecture that the long-term average rate at which events occur is 𝑛 1 𝑛 events = = 𝑒𝑣𝑒𝑛𝑡𝑠/𝑠𝑒𝑐𝑜𝑛𝑑 𝑆𝑛 seconds 𝑛𝑚 𝑚 5.6 Convergence of Sequences of Random Variables We discussed the convergence of the sequence of arithmetic averages 𝑀𝑛 of IID random variables to the expected value 𝑚: 𝑀𝑛 → 𝑚 as 𝑛 → ∞ (5.22) The weak law and strong law of large numbers describe two ways in which for the sequence of random variables 𝑀𝑛 converges to the constant value given by 𝑚. In this section we consider the more general situation where a sequence of random variables (usually not IID) 𝑋 1, 𝑋 2, ... converges to some random variable 𝑋 : 𝑋𝑛 → 𝑋 as 𝑛 → ∞ (5.23) We will describe several ways in which this convergence can take place. Note that Equation 5.22 is a special case of Equation 5.23 where the limiting random variable 𝑋 is given by the constant 𝑚. To understand the meaning of Equation 5.23, we first need to revisit the definition of a vector random variable X= (𝑋 1, 𝑋 2, ..., 𝑋𝑛 ). X was defined as a function that assigns a vector of real values to each outcome 𝜁 from some sample space 𝑆: 𝑋 (𝜁 ) = (𝑋 1 (𝜁 ), 𝑋 2 (𝜁 ), ..., 𝑋𝑛 (𝜁 )) (5.24) The randomness in the vector random variable was induced by the randomness in the underlying probability law governing the selection of 𝜁 . We obtain a sequence of random variables by letting 𝑛 increase without bound, that is, a sequence of random variables 𝑋 is a function that assigns a countably infinite number of real values to each outcome 𝜁 from some sample space 𝑆: 𝑋 (𝜁 ) = (𝑋 1 (𝜁 ), 𝑋 2 (𝜁 ), ..., 𝑋𝑛 (𝜁 ), ...) (5.25) From now on, we will use the notation {𝑋𝑛 (𝜁 )} or {𝑋𝑛 } instead of X(𝜁 ) to denote the sequence of random variables. A sequence of random variables can be viewed as a sequence of functions of 𝜁 . On the other hand, it is more natural to instead imagine that each point in 𝑆, say 𝜁 , produces a particular sequence of real numbers, 𝑥 1, 𝑥 2, 𝑥 3, ... where 𝑥 1 = 𝑋 1 (𝜁 ), 𝑥 2 = 𝑋 2 (𝜁 ) and so on. This sequence is called the sample sequence for the point 𝜁 . Example 5.5 Let 𝜁 be selected at random from the interval 𝑆 = [0, 1] where we assume that the probability that 𝜁 is in a sub-interval of 𝑆 is equal to the length of the sub-interval. For 𝑛 = 1, 2, ... we define the sequence of random variables: 1 𝑉𝑛 (𝜁 ) = 𝜁 (1 − ) 𝑛 The two ways of looking at sequences of random variables is evident here. First, we can view 𝑉𝑛 (𝜁 ) as a sequence of functions of 𝜁 as shown in Figure 5.4(a). Alternatively, we can 98 5 Random Sums and Sequences imagine that we first perform the random experiment that yields 𝜁 and that we then observe the corresponding sequence of real numbers 𝑉𝑛 (𝜁 ) as shown in Figure 5.4(b). Figure 5.4: Two ways of looking at sequences of random variables: (a) Sequence of random variables as a sequence of functions of 𝜁 , (b) Sequence of random variables as a sequence of real numbers determined by 𝜁 The standard methods from calculus can be used to determine the convergence of the sample sequence for each point 𝜁 . Intuitively, we say that the sequence of real numbers 𝑥𝑛 converges to the real number 𝑥 if the difference |𝑥𝑛 − 𝑥 | approaches zero as 𝑛 approaches infinity. More formally, we say that: The sequence 𝑥𝑛 converges to 𝑥 if, given any 𝜀 > 0, we can specify an integer 𝑁 such that for all values of 𝑛 beyond 𝑁 we can guarantee that |𝑥𝑛 − 𝑥 | < 𝜀 Thus if a sequence converges, then for any 𝜀 we can find an 𝑁 so that the sequence remains inside a 2𝜀 corridor about 𝑥, as shown in Figure 5.5. Figure 5.5: Convergence of a sequence of numbers If we make 𝜀 smaller, 𝑁 becomes larger. Hence we arrive at our intuitive view that 𝑥𝑛 becomes closer and closer to 𝑥. If the limiting value 𝑥 is not known, we can still determine whether a 99 5 Random Sums and Sequences sequence converges by applying the Cauchy criterion: The sequence 𝑥𝑛 converges if and only if, given 𝜀 > 0 we can specify integer 𝑁 0 such that for 𝑚 and 𝑛 greater than 𝑁 0, |𝑥𝑛 − 𝑥𝑚 | < 𝜀 The Cauchy criterion states that the maximum variation in the sequence for points beyond 𝑁 0 is less than 𝜀. Example 5.6 Let 𝑉𝑛 (𝜁 ) be the sequence of random variables from Example 5.5. Does the sequence of real numbers corresponding to a fixed 𝜁 converge? Solution. From Figure 5.4(a), we expect that for a fixed value 𝜁 , 𝑉𝑛 (𝜁 ) will converge to the limit 𝜁 . Therefore, we consider the difference between the 𝑛th number in the sequence and the limit: 𝜁 1 1 |𝑉𝑛 (𝜁 ) − 𝜁 | = |𝜁 (1 − ) − 𝜁 | = | | < 𝑛 𝑛 𝑛 where the last inequality follows from the fact that 𝜁 is always less than one. In order to keep the above difference less than 𝜀 we choose 𝑛 so that |𝑉𝑛 (𝜁 ) − 𝜁 | < that is, we select 𝑛 > 𝑁 = 1 𝜀 1 <𝜀 𝑛 Thus the sequence of real numbers 𝑉𝑛 (𝜁 ) converges to 𝜁 . When we talk about the convergence of sequences of random variables, we are concerned with questions such as: Do all (or almost all) sample sequences converge, and if so, do they all converge to the same values or to different values? The first two definitions of convergence address these questions. 5.6.1 Sure Convergence Definition 5.3. Sure Convergence: The sequence of random variables {𝑋𝑛 (𝜁 )} converges surely to the random variable 𝑋 (𝜁 ) if the sequence of functions 𝑋𝑛 (𝜁 ) converges to the function 𝑋 (𝜁 ) as 𝑛 → ∞ for all 𝜁 in 𝑆: 𝑋𝑛 (𝜁 ) → 𝑋 (𝜁 ) as 𝑛 → ∞ for all 𝜁 ∈ 𝑆 Example 5.7 Let 𝑋 be a random variable uniformly distributed over [0, 1). Then define the random sequence 𝑋 𝑋𝑛 = , 𝑛 = 1, 2, 3, ... 1 + 𝑛2 In this case, for any realization 𝑋 = 𝑥, a sequence is produced of the form: 𝑥𝑛 = 𝑥 1 + 𝑛2 100 5 Random Sums and Sequences which converges to lim𝑛→∞ 𝑥𝑛 = 0. We say that the sequence converges surely to lim𝑛→∞ 𝑋𝑛 = 0. Sure convergence requires that the sample sequence corresponding to every 𝜁 converges. Note that it does not require that all the sample sequences converge to the same values; that is, the sample sequences for different points 𝜁 and 𝜁 0 can converge to different values. Example 5.8 Let 𝑋 be a random variable uniformly distributed over [0, 1). Then define the random sequence 𝑛𝑋 , 𝑛 = 1, 2, 3, ... 𝑋𝑛 = 1 + 𝑛2 In this case, for any realization 𝑋 = 𝑥, a sequence is produced of the form: 𝑥𝑛 = 𝑛𝑥 1 + 𝑛2 which converges to lim𝑛→∞ 𝑥𝑛 = 𝑥. We say that the sequence converges surely to a random variable lim𝑛→∞ 𝑋𝑛 = 𝑋 . In this case, the value that the sequence converges to depends on the particular realization of the random variable 𝑋 . 5.6.2 Almost-Sure Convergence Definition 5.4. Almost-Sure Convergence: The sequence of random variables {𝑋𝑛 (𝜁 )} converges almost surely to the random variable 𝑋 (𝜁 ) if the sequence of functions 𝑋𝑛 (𝜁 ) converges to the function 𝑋 (𝜁 ) as 𝑛 → ∞ for all 𝜁 in 𝑆, except possibly on a set of probability zero; that is: 𝑃 (𝜁 : 𝑋𝑛 (𝜁 ) → 𝑋 (𝜁 ) as 𝑛 → ∞) = 1 In Figure 5.6 we illustrate almost-sure convergence for the case where sample sequences converge to the same value 𝑥; we see that almost all sequences must eventually enter and remain inside a 2𝜀 corridor. In almost-sure convergence some of the sample sequences may not converge, but these must all belong to 𝜁 s that are in a set that has probability zero. Figure 5.6: Almost-sure convergence for sample sequences The strong law of large numbers is an example of almost-sure convergence. Note that sure convergence implies almost-sure convergence. 101 5 Random Sums and Sequences Example 5.9 As an example of a sequence that converges almost surely, consider the random sequence 𝑋𝑛 = 𝑠𝑖𝑛(𝑛𝜋𝑋 ) 𝑛𝜋𝑋 where 𝑋𝑛 is a random variable uniformly distributed over [0,1). For almost every realization 𝑋 = 𝑥, the sequence: 𝑠𝑖𝑛(𝑛𝜋𝑥) 𝑥𝑛 = 𝑛𝜋𝑥 converges to lim𝑛→∞ 𝑥𝑛 = 0. The one exception is the realization 𝑋 = 0 in which case the sequence becomes 𝑥𝑛 = 1 which converges, but not to the same value. Therefore, we say that the sequence 𝑋𝑛 converges almost surely to lim𝑛→∞ 𝑋𝑛 = 0 since the one exception to this convergence occurs with zero probability; that is, 𝑃 (𝑋 = 0) = 0 5.6.3 Convergence in Probability Definition 5.5. Convergence in Probability: The sequence of random variables {𝑋𝑛 (𝜁 )} converges in probability to the random variable 𝑋 (𝜁 ) if for any 𝜀 > 0: 𝑃 (|𝑋𝑛 (𝜁 ) − 𝑋 (𝜁 )| > 𝜀) → 0 as 𝑛 → ∞ In Figure 5.7 we illustrate convergence in probability for the case where the limiting random variable is a constant 𝑥; we see that at the specified time 𝑛 0 most sample sequences must be within 𝜀 of 𝑥. However, the sequences are not required to remain inside a 2𝜀 corridor. The weak law of large numbers is an example of convergence in probability.Thus we see that the fundamental difference between almost-sure convergence and convergence in probability is the same as that between the strong law and the weak law of large numbers. Figure 5.7: Convergence in probability for sample sequences Example 5.10 Let 𝑋𝑘 , 𝑘 = 1, 2, 3, ... be a sequence of IID Gaussian random variables with mean 𝑚 and Í variance 𝜎 2 . Suppose we form the sequence of sample means 𝑀𝑛 = 𝑛1 𝑛𝑘=1 𝑋𝑘 , 𝑛 = 1, 2, 3, .... Since the 𝑀𝑛 are linear combinations of Gaussian random variables, then they are also Gaussian with 𝐸 [𝑀𝑛 ] = 𝑚 and 𝑉 𝐴𝑅 [𝑀𝑛 ] = 𝜎 2 /𝑛. Therefore, the probability that the sample 102 5 Random Sums and Sequences mean is removed from the true mean by more than 𝜀 is r 𝑛𝜀 𝑃 (|𝑀𝑛 − 𝑚| > 𝜀) = 2𝑄 ( ) 𝜎2 As 𝑛 → ∞, this quantity clearly approaches zero, so that this sequence of sample means converges in probability to the true mean. 5.6.4 Convergence in the Mean Square Sense Definition 5.6. Convergence in the Mean Square Sense: The sequence of random variables {𝑋𝑛 (𝜁 )} converges in the Mean Square (MS) sense to the random variable 𝑋 (𝜁 ) if: 𝐸 [(𝑋𝑛 (𝜁 ) − 𝑋 (𝜁 )) 2 ] → 0 as 𝑛 → ∞ Mean square convergence is of great practical interest in electrical engineering applications because of its analytical simplicity and because of the interpretation of 𝐸 [(𝑋𝑛 − 𝑋 ) 2 ] as the “power” in an error signal. Example 5.11 Consider the sequence of sample means of IID Gaussian random variables described in Example 5.10. This sequence also converges in the MS sense since: 𝐸 [(𝑀𝑛 − 𝑚) 2 ] = 𝑉 𝐴𝑅 [𝑀𝑛 ] = 𝜎2 𝑛 This sequence of sample variances converges to 0 as 𝑛 → ∞, thus producing convergence of the random sequence in the MS sense. 5.6.5 Convergence in Distribution Definition 5.7. Convergence in Distribution: The sequence of random variables {𝑋𝑛 } with cumulative distribution functions {𝐹𝑛 (𝑥)} converges in distribution to the random variable 𝑋 with cumulative distribution 𝐹 (𝑥) if: 𝐹𝑛 (𝑥) → 𝐹 (𝑥) as 𝑛 → ∞ for all 𝑥 at which 𝐹 (𝑥) is continuous. The central limit theorem is an example of convergence in distribution. Example 5.12 Consider once again the sequence of sample means of IID Gaussian random variables described in Example 5.10. Since 𝑀𝑛 is Gaussian with mean 𝑚 and variance 𝜎 2 /𝑛, its CDF takes the form 𝑥 −𝑚 𝐹𝑀𝑛 (𝑥) = 1 − 𝑄 ( √ ) 𝜎/ 𝑛 For any 𝑥 > 𝑚, lim𝑛→∞ 𝐹𝑀𝑛 (𝑥) = 1, while for any 𝑥 < 𝑚, lim𝑛→∞ 𝐹𝑀𝑛 (𝑥) = 0. Thus, the 103 5 Random Sums and Sequences limiting form of the CDF is: lim 𝐹𝑀𝑛 (𝑥) = 𝑢 (𝑥 − 𝑚) 𝑛→∞ where 𝑢 (𝑥) is the unit step function. Note that the point 𝑥 = 𝑚 is not a point of continuity of 𝐹𝑀 (𝑥). It should be noted, as was seen in the previous sequence of examples, that some random sequences converge in many of the different senses. In fact, one form of convergence may necessarily imply convergence in several other forms. Table 5.1 illustrates these relationships. For example, convergence in distribution is the weakest form of convergence and does not necessarily imply any of the other forms of convergence. Conversely, if a sequence converges in any of the other modes presented, it will also converge in distribution. Table 5.1: Relationships between convergence modes, showing whether the convergence mode in each row implies the convergence mode in each column This ↓ implies this → Sure Almost Sure Probability Mean Square Distribution Sure X Yes Yes No Yes Almost Sure No X Yes No Yes Probability No No X No Yes Mean Square No No Yes X Yes Distribution No No No No X 5.7 Confidence Intervals Consider once again the problem of estimating the mean of a distribution from 𝑛 IID random variables. When the sample mean 𝑀𝑛 is formed, it could be said that (hopefully) the true mean is “close” to the sample mean. While this is a vague statement, with the help of the central limit theorem, we can make the statement mathematically precise. If a sufficient number of samples are taken, the sample mean can be well approximated by a Gaussian random variable with a mean of 𝐸 [𝑀𝑛 ] = 𝑚 (Equation 5.9) and variance of 𝑉 𝐴𝑅 [𝑀𝑛 ] = 𝜎 2 /𝑛 (Equation 5.11). Using the Gaussian distribution, the probability of the sample mean being within some amount 𝜀 of the true mean can be easily calculated, √ (5.26) 𝑃 (|𝑀𝑛 − 𝑚| < 𝜀) = 𝑃 (𝑚 − 𝜀 < 𝑀𝑛 < 𝑚 + 𝜀) = 1 − 2𝑄 (𝜀 𝑛/𝜎) Stated another way, let 𝜀𝑎 be the value of 𝜀 such that the right hand side of the above equation is 1 − 𝑎; that is, 𝜎 𝜀𝑎 = √ 𝑄 −1 (𝑎/2) (5.27) 𝑛 where 𝑄 −1 is the inverse of the Q-function. Then, given 𝑛 samples which lead to a sample mean 𝑀𝑛 , the true mean will fall in the interval (𝑀𝑛 − 𝜀𝑎 , 𝑀𝑛 + 𝜀𝑎 ) with probability 1 − 𝑎. The interval (𝑀𝑛 −𝜀𝑎 , 𝑀𝑛 +𝜀𝑎 ) is referred to as the confidence interval while the probability is the confidence level or, alternatively, is the level of significance. The confidence level and level of significance are usually expressed as percentages. The corresponding values of the quantity 𝑐 𝑎 = 𝑄 −1 (𝑎/2) are provided in Table 5.2 for several typical values of 𝑎. 104 5 Random Sums and Sequences Table 5.2: Reference values to calculate confidence intervals Percentage of Percentage of Confidence Level Level of Significance (1 − 𝑎) ∗ 100% 𝑎 ∗ 100% 𝑐 𝑎 = 𝑄 −1 (𝑎/2) 90 10 1.64 5 1.96 95 99 1 2.58 99.9 0.1 3.29 99.99 0.01 3.89 Example 5.13 Suppose the IID random variables each have a variance of 𝜎 2 = 4. A sample of 𝑛 = 100 values is taken and the sample mean is found to be 𝑀𝑛 = 10.2. (a) Determine the 95% confidence interval for the true mean 𝑚. (b) Suppose we want to be 99% confident that the true mean falls within a factor of ±0.5 of the sample mean. How many samples need to be taken in forming the sample mean? √ Solution. (a) In this case 𝜎/ 𝑛 = 0.2, and the appropriate value of 𝑐 𝑎 is 𝑐 0.05 = 1.96 from Table 5.2. The 95% confidence interval is then: 𝜎 𝜎 (𝑀𝑛 − √ 𝑐 0.05, 𝑀𝑛 + √ 𝑐 0.05 ) = (9.808, 10.592) 𝑛 𝑛 (b) To ensure this level of confidence, it is required that 𝜎 √ 𝑐 0.01 = 0.5 𝑛 and therefore 2.58 ∗ 2 2 𝑐 0.01𝜎 2 ) =( ) = 106.5 0.5 0.5 Since 𝑛 must be an integer, it is concluded that at least 107 samples must be taken. 𝑛=( In summary, to achieve a level of significance specified by 𝑎, we note that by virtue of the central limit theorem, the sum 𝑀𝑛 − 𝑚 𝑍ˆ𝑛 = (5.28) √ 𝜎/ 𝑛 approximately follows a standard normal distribution. We can then easily specify a symmetric interval about zero in which a standard Gaussian random variable will fall with probability 1 − 𝑎. As long as 𝑛 is sufficiently large, the original distribution of the IID random variables does not matter. Note that in order to form the confidence interval as specified, the standard deviation of the 𝑋 𝑗 must be known. While in some cases, this may be a reasonable assumption, in many applications, the standard deviation is also unknown. The most obvious thing to do in that case would be to replace the true standard deviation with the sample standard deviation. 105 5 Random Sums and Sequences Further Reading 1. Scott L. Miller, Donald Childers, Probability and random processes: with applications to signal processing and communications, Elsevier 2012: chapter 7. 2. Alberto Leon-Garcia, Probability, statistics, and random processes for electrical engineering, 3rd ed. Pearson, 2007: chapter 7. 106