Introduction to PROBABILITY THEORY with CONTEMPORARY APPLICATIONS Lester L. Helms Dover Books on Mathematics Handbook of Mathematical Functions: with Formulas, Graphs, and Mathematical Tables, Edited by Milton Abramowitz and Irene A. Stegun. (0-486-61272-4) Abstract and Concrete Categories: The Joy of Cats, Jiri Adamek, Horst Herrlich, George E. Strecker. (0-486-46934-4) Nonstandard Methods in Stochastic Analysis and Mathematical Physics, Sergio Albeverio, Jens Erik Fenstad, Raphael Hoegh-Krohn and Tom Lindstrom. (0-486-46899-2) Mathematics: Its Content, Methods and Meaning, A. D. Aleksandrov, A. N. Kolmogorov, and M. A. Lavrent’ev. (0-486-40916-3) College Geometry: An Introduction to the Modern Geometry of the Triangle and the Circle, Nathan Altshiller-Court. (0-486-45805-9) The Works of Archimedes, Archimedes. Translated by Sir Thomas Heath. (0-486-42084-1) Real Variables with Basic Metric Space Topology, Robert B. Ash. (0-486-47220-5) Introduction to Differentiable Manifolds, Louis Auslander and Robert E. MacKenzie. (0-486-47172-1) Problem Solving Through Recreational Mathematics, Bonnie Averbach and Orin Chein. (0-486-40917-1) Theory of Linear Operations, Stefan Banach. Translated by F. Jellett. (0-486-46983-2) Vector Calculus, Peter Baxandall and Hans Liebeck. (0-486-46620-5) Introduction to Vectors and Tensors: Second Edition-Two Volumes Bound as One, Ray M. Bo­ wen and C.-C. Wang. (0-486-46914-X) Advanced Trigonometry, C. V. Durell and A. Robson. (0-486-43229-7) Fourier Analysis in Several Complex Variables, Leon Ehrenpreis. (0-486-44975-0) The Thirteen Books of the Elements, Vol. 1, Euclid. Edited by Thomas L. Heath. (0-486-60088-2) The Thirteen Books of the Elements, Vol. 2, Euclid. (0-486-60089-0) The Thirteen Books of the Elements, Vol. 3, Euclid. Edited by Thomas L. Heath. (0-486-60090-4) An Introduction to Differential Equations and Their Applications, Stanley J. Farlow. (0-486-44595-X) Partial Differential Equations for Scientists and Engineers, Stanley J. Farlow. (0-486-67620-X) Stochastic Differential Equations and Applications, Avner Friedman. (0-486-45359-6) Advanced Calculus, Avner Friedman. (0-486-45795-8) Point Set Topology, Steven A. Gaal. (0-486-47222-1) Discovering Mathematics: The Art of Investigation, A. Gardiner. (0-486-45299-9) Lattice Theory: First Concepts and Distributive Lattices, George Gratzer. (0-486-47173-X) Ordinary Differential Equations, Jack K. Hale. (0-486-47211-6) Methods of Applied Mathematics, Francis B. Hildebrand. (0-486-67002-3) Basic Algebra 1: Second Edition, Nathan Jacobson. (0-486-47189-6) Basic Algebra II: Second Edition, Nathan Jacobson. (0-486-47187-X) Numerical Solution of Partial Differential Equations by the Finite Element Method, Claes John­ son. (0-486-46900-X) Advanced Euclidean Geometry, Roger A. Johnson. (0-486-46237-4) Geometry and Convexity: A Study in Mathematical Methods, Paul J. Kelly and Max L Weiss (0-486-46980-8) (continued on back flap) INTRODUCTION TO PROBABILITY THEORY With Contemporary Applications Lester L. Helms University of Illinois at Urbana-Champaign DOVER PUBLICATIONS, INC. Mineola, New York Copyright Copyright © 1997, 2010 by Lester L. Helms All rights reserved. Bibliographical Note This Dover edition, first published in 2010, in an unabridged republication of the work originally published in 1997 by W. H. Freeman and Company, New York. The author has provided a new errata list for this edition. Library of Congress Cataloging-in-Publication Data Helms, L. L. (Lester La Verne), 1927— Introduction to probability theory : with contemporary applications / Lester L. Helms. Dover ed. p. cm. Originally published: New York : W. H. Freeman, 1997. With new errata list. Includes bibliographical references and index. ISBN-13: 978-0-486-47418-2 ISBN-10: 0-486-47418-6 1. Probabilities. I. Title. QA273.H52 2010 519.2—dc22 2009034243 Manufactured in the United States by Courier Corporation 47418601 www.doverpublications.com In memory of David Michael Helms 1955-1990 CONTENTS 1 2 3 vii Preface ix Errata xi Classical Probability 1 1.1 1.2 1.3 1.4 1.5 1 2 5 12 17 Beginnings................................................................................... Basic Rules................................................................................... Counting...................................................................................... Equally Likely Case.................................................................... Other Models............................................................................. Axioms of Probability 25 2.1 2.2 2.3 2.4 2.5 2.6 2.7 25 26 31 35 41 45 52 Introduction................................................................................ Set Theory.................................................................................... Countable Sets............................................................................. Axioms......................................................................................... Properties of Probability Functions............................................ Conditional Probability andIndependence................................ Some Applications....................................................................... Random Variables 60 3.1 3.2 3.3 3.4 3.5 3.6 60 60 72 81 92 96 Introduction................................................................................ Random Variables....................................................................... Independent RandomVariables................................................... Generating Functions................................................................. Gambler’s Ruin Problem . .. /................................................ Appendix ................................................................................... viii CONTENTS 4 5 6 Expectation 99 4.1 Introduction....................................................................... 4.2 Expected Value.............................................................................. 4.3 Properties of Expectation............................................................ 4.4 Covariance and Correlation <...................................................... 4.5 Conditional Expectation ............................................................ 4.6 Entropy........................................................................................ 99 99 107 117 125 133 Stochastic Processes 144 5.1 5.2 5.3 5.4 5.5 144 145 159 167 172 Introduction................................................................................. Markov Chains ........................................................................... Random Walks.............................................................................. Branching Processes..................................................................... Prediction Theory........................................................................ Continuous Random Variables 6.1 6.2 6.3 6.4 6.5 6.6 181 Introduction................................................................................. 181 Random Variables........................................................................ 182 Distribution Functions............................................................... 190 Joint Distribution Functions..................................................... 199 Computations with Densities..........................................................209 Multivariate and Conditional Densities.......................................... 216 7 Expectation Revisited 226 7.1 Introduction................................................................................. 226 7.2 Riemann-Stieltjes Integral............................................................ 226 7.3 Expectation and Conditional Expectation.................................... 234 7.4 Normal Density........................................................................... 243 7.5 Covariance and Covariance Functions.......................................... 255 8 Continuous Parameter Markov Processes 8.1 8.2 8.3 8.4 8.5 8.6 267 Introduction................................................................................. 267 Poisson Process............................................................................... 267 Birth and Death Processes................................................................ 273 Markov Chains ............................................................................... 278 Matrix Calculus........................................................................... 285 Stationary Distributions............................................................... 293 Solutions to Exercises 300 Standard Normal Distribution Function 346 Symbols 347 Index 348 PREFACE In addition to exposing a student to diverse applications of probability theory through numerous examples, a probability textbook should convince the student that there is a coherent set of rules for dealing with probabilities and that there are powerful methodologies for solving probability problems. Aside from routine differentiation and integration methods, as far as possible I have based this book on the following three topics from the calculus: (I) the principle of mathematical induction, (2) the existence of limits of monotone sequences, and (3) power series. With three or four exceptions, complete proofs of theorems are included for the benefit of the highly motivated student. The transition from calculus to probability theory is not easy for the typical student. New concepts, which are not amenable to “plug and chug” methods, are introduced at each turn of the page. At the risk of verbosity, I have endeavored to err on the side of readability to make this transition easier. Even so, the student will need to have pen and scratch pad ready for writing out some of the details. Again for the benefit of the student, solutions to all the exercises are included at the end of the text. Students need immediate reassurance that they have worked a problem correctly so that they can get on with the learning process and should not be made to wait until the next class meeting. Some of the exercises are tagged with a caution symbol in the form of a hand; these should not be attempted without Mathematica or Maple V software. Some of these exercises stipulate that answers should be calculated to n decimal places to ensure that mathematical software is actually used rather than a hand calculator and a crude approximation. At most, two or three classroom periods should be allotted to Chapter 1. A standard one-semester course might consist of Chapters 1-4, one section from ix I PREFACE Chapter 5, and Chapters 6-7. On the other hand, an instructor who believes in the inevitability of a digitized science might offer a one-semester course based only on Chapters 1-5. There is more than enough material for a two-semester course. The manuscript was classroom tested during the fall semester of 1994 and the spring semester of 1996. Many examples and exercises have been added to the original manuscript at the suggestion of the students. This book was written for students, and I would welcome any suggestions from them on how it might be improved via e-mail at l-helms@math.uiuc.edu. I would like to express my appreciation to W. H. Freeman reviewers S. James Taylor of the University of Virginia and Cathleen M. Zucco of LeMoyne College for their many suggestions on how to improve the book. I also thank Mary Louise Byrd, Project Editor at W. H. Freeman, for maintaining a reasonable production schedule. I especially thank Holly Hodder, Senior Editor at W. H. Freeman, for her interest in publishing this book. July 1996 ERRATA 1. In line 15 of page 44, change 3/16 to 5/32. 2. In line 12 of page 48, interchange .116 and .5. 3. In line 16 of page 48, change sentence beginning “Since ...” to That someone must be able to pass on a B allele and therefore must be of genotype OB, BB, or AB; in the first and third cases there is a 50-50 chance that the B allele will be passed to the child. The computation of P(E\FC) is similar to that of P(.E|.F) except for an additional term P(E A Fc\Fab')P(Fab') in the numerator. Thus, p = (.5)(.116) + .007 + (.5)(:038) = .877 4. Line 11“ of page 48 should read P(E\F") _-528 P(E\FC) .096 5. In line 2 of page 49, change .89 to .85. 6. The following sentence should be added to line 2 on page 106: Before trying to verify (4.2) below, the reader should do Exercise 4.2.8 first. 7. The first sentence of line 14 on page 106 should read: A similar argument applies to the fourth starting pattern, but care must be taken with the second and third starting patterns. 8. Equation (4.2) on page 106 should be changed to read: £r(n) = gT(n - 1) - ^9r(n - 3) + j^r(n - 4), xi n > 4. xii 9. The display equation following Equation (4.3) on page 106 should read: pT(t)-l-t-t2-t3 = t3 t4 - 1 - t - t2) - ~(H0 - 1) + ZTPt(0o 10 10. The last display equation following Equation (4.3) on page 106 should read: , 9t{ ' . 16 + 2t3 16 - 16t + 2t3 -t4' 11. The 30 on lines 3 and 4 of page 107 should be changed to 18. 12. Exercise 4.2.8 on page 107 should be replaced by Consider a sequence of Bernoulli trials with probability of success p = 1/2, define a waiting time T by putting T = n if the word 101 appears for the first time at the end of the nth trial, and let pr(n) = P(T > n). Show that 1 /I 1 \ 1 9r(n) = -gT(n - 1) + -gT(n - 1) - -gT(n - 2) + -gT(n - 3) z yz 4 j o for n > 3. Using the fact that pr(0) = pr(l) = Pt(2) = 1, find the generating function g? and E[T]. 13. The following sentences should be added to Definition 5.2 on page 154: (3) The state j is periodic of period d(j) if d(j) is the greatest com­ mon divisor of the set {n € N : pff > 0} and is called aperiodic if d(j) = 1; (4) the chain {zn}£L0 is aperiodic if each state is aperiodic. 14. In line 6“ on page 154, insert “an aperiodic” before the word irreducible. 15. In line 8“ of page 168, add right parenthesis. 16. In line 6 of page 172, replace “with” by “by”. 17. In line 9“ of page 179, replace “and” by “<”. 18. In line 5“ of page 179, replace “and” by “<”. 19. The D in line 3 on page 238 (after Figure 7.1) should have an exponent 2 as in D2. 20. Solution 2.5.3 on page 308 should be 17/24. 21. Solution 4.2.8 on page 319 should read: xiii Let T = n if the word 101 occurs on the nth trial for the first time. Then 8 + 242 9tW - 8 _ st + 2t2 _ t3 and E[T] = £T(1) = 10. 22. Solution 4.2.9 on page 319 should read: 435 P(T < 11) = 1 - P(T > 11) = 1 - gT(ll) =77^7 = -4248. INTRODUCTION TO PROBABILITY THEORY With Contemporary Applications CLASSICAL PROBABILITY BEGINNINGS As far back as 3500 B.C., devices were used in conjunction with board games to inject an element of uncertainty into the game. A heel bone or knucklebone of a hooved animal was commonly used. Dice made from clay were in existence even before the Greek and Roman empires. Just how the outcomes of these devices were measured or weighted, if at all, is unknown. It may be that the outcomes were ascribed to fate, the gods, or whatever, with no attempt being made to associate numbers with outcomes. At the end of the fifteenth century and beginning of the sixteenth century, numbers began to be associated with the outcomes of gaming devices, and by that time empirical odds had been established for some devices by inveterate gamblers. In the first half of the sixteenth century, the Italian physician and mathematician Girolamo Cardano (1501-1576) made the abstraction from empiricism to theoretical concept in his book Liber de Ludo Aiea, “The Book of Games of Chance,” which was published posthumously in 1663. An English translation of this book by Sydney Gould can be found in Cardano, The Gambling Scholar by Oystein Ore (see the Supplemental Reading List at the end of this chapter). Among other things, Cardano calculated the odds of getting various scores with two dice and with three dice. During the period 1550-1650, several mathematicians were involved in calculating the chances of winning at gambling. Sometime between 1613 and 1623, Galileo (1564-1642) wrote a paper on dice without alluding to any prior work, as though the calculation of probabilities had become commonplace by then. Some historians mark 1654 as the birth of the theory of probability. It was 1 2 1 CLASSICAL PROBABILITY in this year that a gambler, the Chevalier de Mere, proposed several problems to Blaise Pascal (1623-1662), who in turn communicated the problems to Pierre de Fermat (1601-1665). Thus began a correspondence between Pascal and Fermat about probabilities that some authors claim to be the beginning of the theory of probability. On the heels of this correspondence, in 1667, another seminal work appeared: De Ratiociniis in Ludo Aiea by Christianus Huygens (1629-1695), in which the concept of expectation was introduced for the first time. In the following sections, a theory of probability will be developed (as opposed to the theory of probability, since there are several approaches to probability) as expeditiously as possible using contemporary notation and terminology. BASIC RULES Game playing and gambling have been common forms of recreation among all classes of people for hundreds of years. In fact, the desire to win at gambling was a primary driving force in the development of probability theory during the sixteenth century. As a result, much of the early work dealt with dice and with answering questions about perceived discrepancies in empirical odds. Some of the gamblers were quite astute at recognizing discrepancies on the order of 1/100. The point is that the gamblers of the sixteenth century were aware of some kind of empirical law according to which there was predictability about the frequency of occurrence of a specified outcome of a game, even though there was no way of predicting the outcome of a particular play of the game. Consider an experiment or game in which the outcome is uncertain and consider some attribute of an outcome. Let A be the collection of outcomes having that attribute. Suppose the experiment or game is repeated N times and N(A) is the number of repetitions for which the outcome is in A. The ratio N (A)/N is called the relative frequency of A. The fact that the ratio N (A)/N seems to stabilize near some real number p when N is large, written N(A) N ~ * P’ is an empirical law. This law can no more be proved than Newton’s law of cooling can be proved. Of course, the number p depends upon A, and it is customary to denote it by P(A), so that the empirical law is usually written The number P(A) is called the probability of A. This tendency of relative frequencies to stabilize near a real number is illustrated in Figure 1.1. The 1 .2 3 BASIC RULES graph depicts the relative frequency of getting a head in flipping a coin N times for N up to 500, calculated at multiples of 5 and rounded off. The number 500 was chosen in advance of any coin flips. Just what was Cardano’s contribution? It was the observation that for most simple games of chance, the probability of a particular outcome is simply the reciprocal of the total number of outcomes for the game, an observation that seemed to agree with empirical odds established by the gamblers. For example, if the game consists of rolling a fair die (i.e., a nearly perfect cubical die), then the outcomes 1,2,3,4,5,6 represent the number of pips on the top surface of the die after coming to rest, and.so each outcome has an associated probability of 1/6. Cardano also considered the roll of two dice; for purposes of argument, a red die and a white die. There are six outcomes for the red die. Each of the outcomes of the red die can be paired with one of six outcomes for the white die, and so there are a total of 36 possible outcomes for the roll of the two dice. Thus, Cardano assigned each outcome a probability of 1/36. Cardano’s assignment of probabilities is universally accepted for most simple games of chance: rolling a die, rolling two dice, . . ., rolling n dice; flipping a coin, flipping a coin two times in succession, . .., flipping a coin n times in succession; flipping n coins simultaneously; and dealing a hand of n cards from a well-shuffled deck of playing cards. Consider two collections of outcomes A and B having no outcomes in common; i.e., A and B are mutually exclusive. If A U B denotes the collection of outcomes in A or in B and N (A U B) denotes the number of times the outcome is in A U B in N repetitions of the game, then N(A U B) = N(A) + N(B). Since N(AUB) _ N(A) N(B) N ” N + N ’ it follows from the empirical law that P(AUB) P(A) + P(B) (1.1) 4 1 CLASSICAL PROBABILITY whenever A and B are mutually exclusive. Note also that 0 and it follows from the empirical law that 0 < P(A) < 1. N(A)/N 1, (1-2) In particular, if O is the collection of all outcomes of the game, then N(fl) = N and P(O) = 1. (1.3) The properties of probabilities expressed by Equations 1.1,1.2, and 1.3 embody the basic rules for more general probability models. Returning to Cardano’s assignment of probabilities, if A consists of the outcomes &>i, a>2> ■■■>&>& and N(a>j) is the number of times the outcome is a>i in N repetitions of a game, then N(A) = N(ci>i) + • • • + N(&>jt) and P(A) = P(o»i) + • • • + P(tojt) by the empirical law. Letting |A| denote the number of outcomes in A, P(A) = (1-4) We now have the basic rules for calculating probabilities associated with simple games of chance. Such calculations are reduced to counting outcomes. The reader should develop a systematic procedure for identifying and labeling outcomes, as in the following example. EXAMPLE 1.1 Consider an experiment in which a coin is flipped three times in succession. The outcomes can be labeled using three-letter words made up from an alphabet of H and T (or 1 and 0). The label TTH stands for an outcome for which the first two flips resulted in tails and the third in heads. All possible outcomes can be listed: HHH, THH, HTH, HHT, HTT, THT, TTH, TTT. Consider the attribute “the number of heads in the outcome is 2.” If A is the collection of outcomes having this attribute, then |A[ = 3, and soP(A) = 3/8. ■ EXAMPLE 1.2 Consider an experiment in which a bowl contains five chips numbered 1,2, 3,4,5. The chips are thoroughly mixed and one of them is selected blindly, the remaining chips are thoroughly mixed again, and then one of the remaining chips is selected blindly. An outcome of this experiment can be labeled using a two-letter word made up from an alphabet consisting of the digits 1,2,3,4,5, with the proviso that no digit can be repeated. All possible outcomes can be listed: 12,13,14,15, 21,23,24,25,31, 32, 34,35,41,42,43, 45,51,52,53,54. Consider the attribute “the first digit is less than the second.” If A is the collection of outcomes with this attribute, then |A| = 10, and so P(A) = 10/20 = 1/2. ■ 1.3 EXERCISES 1.2 5 COUNTING The last problem requires the principle of mathematical induction, which states: If P(n) is a statement concerning the positive integer n that satisfies (i) P(l) is true and (ii) P(n + 1) is true whenever P(n) is true, then P(n) is true for all integers m 1. 1. Consider an experiment in which a coin is flipped four times in succession. If A is the collection of outcomes having two heads, determine P(A). 2. If a coin is flipped n times in succession, what is the relationship between |O| and n? 3. Consider four distinguishable coins (e.g., a penny, a nickel, a dime, and a quarter). If the four coins are tossed simultaneously and A consists of all outcomes having two heads, determine P(A). 4. If four coins of like kind are tossed simultaneously and A consists of all outcomes having three heads, determine P(A). 5. A simultaneous toss of two indistinguishable dice results in a con­ figuration (if the positions of the two dice are interchanged, a new configuration is not obtained). What is the total number of configura­ tions? 6. Use the principle of mathematical induction to prove that and ]2 . , ,2 , ... , m2 _ n(n + l)(2n + l) 1 + z +3 + ••• + « — ----------6 for every integer n MWb 1. COUNTING In the previous section, outcomes were given labels that were words made up using a specified alphabet. This procedure is just a special case of more general schemes. If ab .. .,a„ are n distinct objects, A will denote the collection consisting of these objects, written A = {ab ..., an}- If B = {&i»..., bra} is a second collection, we can form a new collection, denoted by A X B, which consists of all ordered pairs (a,, bj) with i = 1,..., n and j = Since we can form a rectangular array with n rows and m columns in which the element of the ith row and jth column is (a,-, bj), the total number of ordered pairs in the array is n X m. Therefore, |A X B|’ = n X m. (1-5) 6 1 CLASSICAL PROBABILITY More generally, let Ab...,Ar be r collections having nb...,nr members, respectively. We can then form the collection Ai X • • • X Ar of ordered r-tuples («i, ..., «r) where each a, belongs to Ai, i = 1,..., r. In this case, |Ai X • • • X A>| = «| X m X • • • X nr. (1.6) The proof of this result requires a mathematical induction argument. The essential step in the argument is as follows if we agree that the ordered rtuple (ti|,..., ar) is the same as the ordered pair ((ab..., ar-1)> ar)- By the induction argument, the number of ordered (r — l)-tuples (ab ..., ar-i) is m X • • • X nr-1, and it follows from Equation 1.5 that the number of ordered r-tuples is (m X ••• X nr-1) X nr. In the particular case that all of the Ab .. ,,Ar are the same collection A and |A| = n, then the number of ordered r-tuples (ab ..., ar) where each a,belongs to A is given by [3 X ••• X Aj = nr. (1.7) r times Ordered r-tuples of the type just described have another name in probability theory. The collection A is called a population and the ordered r-tuple («i,...,«r) is called an ordered sample of size r with replacement from a population A of size n. Such an ordered sample can be thought of as being formed by successively selecting r elements from A with each element being returned to A before the next element is chosen. Theorem 1.3.1 The number of ordered samples of size r with replacement from a population of size n is nr. EXAMPLE 1.3 Suppose a die is rolled three times in succession. The outcome can be regarded as an ordered sample of size 3 with replacement from the population A = {1, 2, 3,4,5,6}. In this case n = 6 and r = 3, and so the total number of outcomes is 63 = 216. ■ EXAMPLE 1.4 Two dice are thrown 24 times in succession. The outcome can be described as an ordered sample of size 24 from a population of size 36 with replacement. The total number of outcomes is 3624. ■ If in forming an ordered sample from a population A we choose not to return an element to A, then we obtain an ordered sample of size r without replacement from a population A of size n. Theorem 1.3.2 The number of ordered samples of size r without replacement from a population of size n is n(n — 1) X • • • x (n — r + 1). 1.3 COUNTING 7 Again a mathematical induction argument is needed to prove this result. Simply put, there are n choices for the first member of the sample, n - 1 choices for the second, and upon making the rth choice there are only n — (r — 1) = n — r + 1 choices for the rth member of the sample, and so the total number of choices is n(n — 1) X • • • x (n — r + 1). Because the latter product arises quite frequently, it is convenient to introduce a symbol for it; namely, (n)r = n(n - 1) X ••• X (n - r +1). (1.8) Note that (n)„ = n(n — 1) x • • • x 2 x 1 = n\ EXAMPLE 1.5 The game of solitaire is played with a deck of 52 cards. The game commences with 28 cards placed on a table in a prescribed order as drawn from the deck and constitutes an ordered sample of size 28 without replacement from a population of size 52. The number of such samples is (52)28 = 52 X 51 X • •• X 25. ■ In some cases, order is irrelevant. For example, it is not necessary to hold the cards of a poker hand in the order in which they are dealt from a deck. An unordered sample of size r from a population A of size n is just a subpopulation of A having r members. C(n, r) will denote the number of such unordered samples. Such a sample is also called a combination of n things taken r ata time. Theorem 1.3.3 The number of unordered samples of size r from a population of size n is . (n)r n(n — 1) X • • • X (n — r + 1) C(n,r) = — = ------------------- :------------ ----- • r! r! A convincing argument can be made as follows. The number of ordered samples of size r without replacement from the population of size n is (n)r. Each such ordered sample can be obtained by first selecting an unordered sample of size r from the population, which can be done in C(n, r) ways, and then taking an ordered sample of size r without replacement from the subpopulation of size r. Since the latter can be done in r! ways, (n)r = C(n, r) X r! Suppose a poker hand of 5 cards is dealt from a wellshuffled deck of 52 cards. Since a poker hand can be regarded as unordered, the total number of poker hands is C(52,5) = 2,598,960. ■ EXAMPLE 1.6 Note that 8 1 CLASSICAL PROBABILITY n(n - 1) X • • • X (h - r+1) = r! r!(n — r)! nl since nl = (n(n - 1) X • • • x (n - r + 1)) X (n - r)l. Asa matter of nota­ tional convenience, 0! is equal to 1 by definition and C (n, 0) = 1 if calculated formally using the last displayed equation. Another commonly used notation for C(w, r) is ( ” ). The two will be used interchangeably. It is implicit in the above definition of C(n, r) that 0 r < n. Again as a matter of notational convenience, we will put C(n, r) = (”) = 0 if r < 0 or if r > n. The ( ” ) are called binomial coefficients because of their association with the binomial theorem: n / \ (a + b)n =y\{tl)akbn-k. (1.9) k=0 This theorem can be used to derive useful relationships connecting the coeffi­ cients. For example, putting a = b = 1, 2"“±(Bt) = (”0)4") + - 4J). (MO) k =0 EXAMPLE 1.7 If n is a positive integer with n S: 2, then where the coefficient of ( ” ) is + or - as n is even or odd, respectively. This can be seen as follows. Taking b = 1 in Equation 1.9, (0+1)” it = 0 Taking a = — 1, ” z ”\) 0 = (-! + !)■ =£(-!/( 1 =0 COUNTING 9 EXAMPLE 1.8 If tn and n are positive integers and t is any real number, 1.3 then (l+t)m+n = (1 + t)m(l + t)". Applying the binomial theorem three times, m+n . . ! m , n , . s('nr)‘‘=fe(“)‘i e(")" \i=0 Jt=O /\j=0 \ / = ±±(? )(")<■’• i=0;=0 J Collecting terms with a factor of tk, m+n , m+n / i .\ fc = 0 t = 0\i=0 / s('"r),‘“SE(7)(/i)r Equating corresponding coefficients of t*. it follows that (”’l+’’)-±(7)(il,.)." i=0 (i.io Returning to Equation 1.8, note that the right side of the equation makes sense if n is any real number. For any real number x and any positive integer r, we define (x)r = x(x — 1) X • • • X (x — r + 1) and , (x\ (x)r x(x - 1) X • • • X (x - r + 1) C(x,r) = ( J = — = ----------------- - ----------------- . We also define (x)0 = landC(x,0) = ( ) * = 1; for any negative integer r, we define ( * ) = 0. EXAMPLE 1.9 If r is a nonnegative integer, then ( ? ) = ( —l)r since IO 1 CLASSICAL PROBABILITY (-a = (-D(-2)x---xei — r + i) r! r! (- n ' r ' = r! „ (_,r. ■ There are many more equations relating to binomial coefficients. The reader interested in pursuing the subject further should consult the books by Feller and Tucker (see the Supplementary Reading List at the end of the chapter). It was tacitly assumed in the preceding discussions that the elements of a sample are distinguishable. But there are probability models in physics in which some elementary particles behave as though they are indistinguishable. A general scheme for dealing with such particles can be described as follows. Consider r indistinguishable balls and n distinguishable boxes numbered 1, 2,...»n. If the balls are distributed among the boxes in some way, the result is called a configuration. If a ball in Box 1 is interchanged with a ball in Box 2, the configuration does not change. The total number of configurations can be calculated using the following device. The label * | * | * * * | ** | * signifies that there are a total of five boxes with one ball in Box 1, one ball in Box 2, three balls in Box 3, two balls in Box 4, and one ball in Box 5, which is to the right of the last vertical bar. In general, there are (n — 1) + r = n + r — 1 symbols in the label because the number of vertical bars is one less than the number n of boxes. A label is completely specified if r of the n + r — 1 positions are selected to be filled by asterisks; i.e., by selecting a subpopulation of size r. The total number of ways of selecting such subpopulations is C(n + r — 1, r). Theorem 1.3.4 The total number of ways of distributing r indistinguishable balls into n boxes is ("+rr"’). EXAMPLE 1.10 Suppose two dice, indistinguishable to the naked eye, are tossed. To determine the total number of possible configurations, consider boxes numbered 1,2,..., 6 and consider the dice as balls that are placed into the boxes. In this case, n = 6 and r = 2, so that the total number of c .. (6 +2 - 1 \ _ configurations is ) = 21. ■ Dice do not behave as though they are indistinguishable; in fact, two dice behave as though there are 36 outcomes with each having the same probability. EXERCISES 1.3 The reader should review Maclaurin and Taylor series expansions before starting on these problems. 1.3 1. 11 COUNTING If tn and n are positive integers with 0 < n tn, show that ( m *2 ) by (a) expressing both sides in terms of factorials and (b) interpreting each side as the number of ways of selecting a subpopulation. 2. If n and r are positive integers with 1 < r < n, show that C(n, r) = C(n — 1, r) + C(n — 1, r — 1). (This equation validates the triangular array 1 1 1 1 1 2 13 3 4 6 1 1 4 1 commonly called “Pascal’s Triangle” in Western cultures.) 3. If n is a positive integer, show that the Maclaurin series expansion of the function/(t) = (1 + t)" is r«) = ±(”)r‘. ' fc = 0 A How does this result relate to the binomial theorem? 4. If a is any real number, show that the Maclaurin series expansion of the function/(r) = (l+r)“is (!+<)“ =Z(p»‘ fc = 0 (which is valid for |r| < 1). 5. If n is a positive integer, use the binomial expansion of (1 + t)" to show that „2»-=t‘("j = (")+2("2 )—"(:)■ k=l 6. If n is a positive integer with n 2, show that n(n- 1)2”-2 = 2 • 1( ” ) + 3 • 2( ” ) + • • • + n(n - 1)( ” )• 12 1 CLASSICAL PROBABILITY 7. A die is tossed n times in succession. What is the probability that a 1 will not appear? 8. A coin is flipped 2n times in succession. What is the probability that the number of heads and tails will be equal? 9. A rectangular box in 3-space is'subdivided into 2m congruent rectan­ gular boxes numbered 1, 2,..., 2n. (a) If n indistinguishable particles are distributed in the 2n boxes, what is the total number of configura­ tions? (b) If all configurations have the same probability of occurrence, what is the probability that boxes numbered 1,2,..., n will be empty? 10. If x > 0 and k is a nonnegative integer, show that The remaining problems are too tedious to do manually. software such as Mathematica or Maple V is appropriate. Mathematical 11. A coin is flipped 20 times in succession. Find the probability, accurate to six decimal places, that the number of heads and tails will be equal. 12. Suppose 20 dice, indistinguishable to the naked eye, are tossed. What is the total number of possible configurations? EQUALLY LIKELY CASE This section will address only probability models of the type described in the previous section. For such models, calculating probabilities consists of two steps: counting the total number of outcomes and counting the number of outcomes in a given collection. In doing so it is important either to make a complete list of all outcomes or to give a precise mathematical description of all such outcomes. Consider an experiment in which two dice, one red and one white, are tossed simultaneously. The outcome of the experiment is a complicated picture that can be recorded only partially by a camera. It is not necessary to go that far, however, since we are interested only in the number of pips showing on the two dice, and that information can be summarized by creating a name or label for it; e.g., the ordered pair (i,j), 1 i, j 6 can be used as a label for the outcome in which the red die shows i and the white die shows j. Each such outcome is an ordered sample of size 2 with replacement from a population of size 6. The total number of such outcomes is 62 = 36. The collection of all 36 labels is shown in Figure 1.2. In tossing two dice, we are not usually interested in the number of pips on each die but rather in the sum of the two numbers; i.e., the score. For example, consider the score of 4. We can identify the outcomes with a score of 4 as those in the third diagonal from the upper left corner in Figure 1.2; 1 .4 13 EQUALLY LIKELY CASE (1,1) (2,1) (3,1) (4,1) (5,1) (6,1) (1,2) (2,2) (3,2) (4,2) (5,2) (6,2) (1,3) (2,3) (3,3) (4,3) (5,3) (6,3) (1,4) (2,4) (3,4) (4,4) (5,4) (6,4) (1,5) (2,5) (3,5) (4,5) (5,5) (6,5) (1,6) (2,6) (3,6) (4,6) (5,6) (6,6) FIGURE 1.2 Outcomes for two dice. namely, (1,3), (2,2), and (3,1). If A is the collection of these outcomes, then P(A) = 3/36 = 1/12. In general, ifx is one of the scores 2, 3,..., 12 andp(x) denotes the probability of the collection of outcomes having the score x, then p(x) can be calculated in the same way. The results are shown in Figure 1.3: FIGURE 1.3 Scores for two dice. Coin flipping is an experiment that most people have performed. Consider an experiment in which a coin is flipped n times in succession. An outcome of this experiment can be labeled by an n -letter word using an alphabet made up of T and H (or 0 and 1). For example, TTHTHH is the label for an outcome of an experiment of flipping a coin six times in succession with tails occurring on the first, second, and fourth flips and heads appearing on the remaining flips. If n is large it is impractical to make a list of all outcomes, but we can count the total number of outcomes because an outcome is an ordered sample of size n with replacement from a population {T,H} of size 2. Thus, the total number of outcomes is 2” by Theorem 1.3.1. EXAMPLE 1.11 Suppose a coin is flipped 10 times in succession. The total number of outcomes is 210 = 1024. What are the chances that there will be three heads in the outcome? To answer this question, let A be the collection of outcomes having three heads. Since each outcome has the same probability assigned to it, we need only count the number of outcomes having three heads. A label for an outcome consists of 10 letter positions that are filled by H’s or T’s. There are C(10,3) ways of selecting three positions to be filled with H’s and the remaining seven positions with T’s. Thus, (10) ” V 3 7 15 _ 210 ” 128’ " Notice in the wording of the question posed in this example that “three 14 1 CLASSICAL PROBABILITY heads” is used rather than the “exactly three heads” that is commonly used in elementary algebra books. Three means exactly three; the prefix “exactly” is redundant. In both the two-dice and the coin-flipping experiments, the outcomes were regarded as ordered samples with replacement. The following example illustrates counting ordered samples without replacement. EXAMPLE 1.12 (The Birthday Problem) Consider a class of 30 students. Each student has a birthday that can be any one of the days numbered 1,2,..., 365. Assume that the 30 birthdays of the students constitute an ordered sample of size 30 with replacement from a population of size 365 and that all outcomes have the same chance of occurring. What are the chances that no two of them will have the same birthday? Let A be the collection of outcomes for which there are no repetitions of birthdays; i.e., A is an ordered sample of size 30 without replacement from a population of size 365. Thus, |A| = (365)30 and P(A) = (365)3o 36530 which is equal to .29 rounded to two decimal places. Thus, it is unlikely that no two will have the same birthday. ■ A poker hand can be considered an ordered sample without replacement or an unordered sample as far as calculating probabilities is concerned, provided there is total adherence to whichever of the two is adopted. It is customary to consider poker hands as unordered samples. In counting outcomes, it is important to not introduce order. Suppose a poker hand of 5 cards is dealt from a wellshuffled deck of 52 playing cards. What is the probability of getting a royal flush; i.e., 10,J,Q,K,A of the same suit? Regarding a poker hand as an unordered sample of size 5 from a population of size 52, the total number of outcomes is ( ). Let A be the collection of outcomes that are royal flushes. We can form EXAMPLE 1.13 a royal flush in the following way. We first select a suit from among the four suits, which can be done in four ways; having selected the suit, the royal flush is then completely determined. Thus, P(A) = 4/(^) = .0000015. ■ A common mistake is to introduce order into the following example where there should be none. EXAMPLE 1.14 Consider a poker hand as described in the previous example. What is the probability of getting two pairs; i.e., a hand of the type 1 .4 15 EQUALLY LIKELY CASE {*, x, y,y,z} where x, y, and z are distinct face values? There are 13 face values. We first choose a subpopulation of size 2, which can be done in ( ) ways, to specify the face value for each of the pairs. We then choose the face value for the singleton card from among the remaining 11 face values, which can be done in 11 ways. All face values have now been selected. Since there are four cards having the face value of the singleton, there are four choices for the singleton. We now go to the lower of the face values of the two selected for the pairs. Since there are four cards having that face value, we select a sub­ population of size 2, which can be done in ( ^ ) ways. Having done this we now select a subpopulation of size 2 from the four cards having the other face value for a pair, which also can be done in ( ^ ) ways. If A is the collection of outcomes that have two pairs, then X 11 X 4 X P(A) = which is approximately 1/20. ■ It might appear that order was introduced into this calculation when we chose to look first at the pair with the lower face value, but the order was already there once the two face values were chosen. In choosing the face values for the two pairs, it would have been incorrect to say that this could be done in 13X12 ways, because this would regard “a pair of jacks and a pair of kings” as different from “a pair of kings and a pair of jacks”. An unordered sample of size r from a population of size n is called a random sample if each sample has the same probability l/( ” ) of occurring. A poker hand is a random sample of size 5 from a population of size 52. We will conclude this section on counting by looking at a commonly used sampling model. Consider a population consisting of Type 1 individuals and «2 Type 2 individuals. The population size is then n = + «2- Suppose a random sample of size r is selected from the population. Since the population contains individuals of both types, we can ask for the probability that the random sample will contain k Type 1 individuals where 0 k r. Of course, k cannot exceed the number of Type 1 individuals, so we must also have k niji.e., 0 k min{r, nJ. Let A be the collection of samples having 16 1 CLASSICAL PROBABILITY k individuals of Type 1. Then P(A) = forO k min{r, nJ. EXAMPLE 1.15 Ona given day, a machine produces 100 items. Assuming that 10 of the items are defective, what is the probability that a random sample of size 5 from the output will contain 3 defective items? Let A be the collection of samples having 3 defective items. Then More generally, suppose a population of size n contains individuals of Type 1, n2 individuals of Type 2, . . . , nk individuals of Type k. If a random sample of size r is taken from the population, what is the probability that the random sample will contain r( Type 1 individuals, r2 Type 2 individuals,. . . , rjt Type k individuals? Let A be the collection of such samples. Then T1 ' ■■■ X ' ' Tk P(A) = where r = rj + r2 + • • • + . EXAMPLE 1.16 If a bridge hand of 13 cards is dealt from a well-shuffled deck of 52 playing cards, what is the probability that the hand will contain three hearts, five diamonds, two spades, and three clubs? Since the hand is a random sample of size 13 from a population of size 52, the total number of outcomes is ( ). Let A be the collection of samples as described. Then P(A) = 1 .5 EXERCISES 1.4 OTHER MODELS 17 In sampling problems, the student should first decide whether the sample is unordered or ordered and, in the latter case, whether with replacement or without replacement. 1. Instead of the usual dice, consider two (regular) tetrahedral dice with faces bearing 1,2,3,4 pips. If the two tetrahedral dice are rolled simulta­ neously, find the probability p(x) that the total score will be x where x can be one of the integers 2,..., 8. 2. If three tetrahedral dice are rolled simultaneously, find the probability p(x) that the score will be one of the integers 3,..., 12. 3. If three cubical dice are rolled simultaneously, find the probability p(x) that the total score will be x where x can be one of the integers 3,4,...,18. 4. If you purchase a single ticket for a lottery in which a random sample of size 6 is selected from the population {1, 2,..., 54}, what is the probability that you hold the winning ticket? 5. In some state lotteries, a winning ticket must have six numbers between 1 and 48 listed in the same order as the numbers were successively drawn at random without replacement. What is the probability that the purchaser of a single ticket will hold the winning ticket? 6. If 1000 raffle tickets are sold, of which 50 are winning tickets and you purchase 10 tickets, what is the probability that you will have 2 winning tickets? 7. In a group of four people, what is the probability that no two will have the same birth month? 8. If a poker hand of 5 cards is dealt from a well-shuffled deck of 52 playing cards, what is the probability of getting a full house (i.e., 3 cards with the same face value and 2 cards with the same face value)? 9. If a poker hand of 5 cards is dealt from a well-shuffled deck of 52 playing cards, what is the probability of getting a straight flush (i.e., 5 cards in sequence in the same suit with the ace counting as a 1 or as the highest card)? 10. In a fish-tagging survey, 100 bass are netted, tagged, and released. After waiting long enough for the tagged fish to disperse, a second sample of 100 bass is taken, of which 5 are observed to be tagged. If the number of bass in the lake is n, what is the probability that a random sample of size 100 will contain 5 tagged fish? If you were asked to estimate the number of bass in the lake, what would you estimate? OTHER MODELS In the early stages of probability theory, a controversy arose between M. de Roberval and Blaise Pascal over the assignment of equal probabilities to 18 1 CLASSICAL PROBABILITY outcomes. The basic issue of the controversy can be described as follows. Suppose a coin is flipped until a head appears with a maximum of two flips. It was argued by M. de Roberval that the outcomes H,TH,TT are equally likely and each should be assigned probability 1/3; Pascal, however, reasoned that they should be assigned the probabilities 1/2,1/4, and 1/4, respectively, on the grounds that the coin could be flipped twice and the result of the second flip simply ignored after getting a head on the first flip; thus, the two outcomes HH and HT would have probabilities adding to 1/2. Whether or not outcomes should be assigned equal probabilities in the case of simple games of chance depends on what one calls an outcome. Consider, for example, rolling two dice. If we declare the score obtained an outcome, then the possible outcomes are 2,3,..., 12, and we have previously seen that these outcomes should not be assigned equal probabilities but rather those given in Figure 1.3. This suggests that we should have available a more general model. Consider an experiment with a finite number of outcomes &>i, co 2, ■. ■ > <on and let O = {coi, a>2, ■■■, <u„}. For each i = 1, 2,..., n, let p(o>,) be a weight associated with a,, satisfying (i) 0 < p(w() < 1. (”) X"= !?(&>/) = 1- The weight p(<Uj) will be called the probability of a collection of outcomes, we define . If A = {&>,,,..., <uIt} is k pw = J>(co,-); ;=i i.e., P(A) is the sum of the weights of the outcomes in A. It can be seen that Equation 1.1 is satisfied as follows. If A = {cu(l, ..., uj and B = {cu;i,..., cup} have no outcomes in common, then A U B consists of the outcomes cup,..., cuit, cup,..., cup and P(A U B) is the sum of the weights associated with the latter outcomes. Thus, P(A U B) = p(cup) + • • • + p(cup) + p(coj,) + • • • + p(cup) = + ''' + p(wp)] + [p(wj|) + • • • + p(cup)] = P(A) + P(B). Similarly, Equations 1.2 and 1.3 are satisfied. EXAMPLE 1.17 Consider an experiment for which the outcome can be described by a four-letter word using the alphabet 0,1. If co is such an outcome, a weight p(cu) can be associated with co by forming a product in which each 1 .5 19 OTHER MODELS 1 in a) is replaced by 1/3 and each 0 by 2/3. For example, if a> = 1110, then p(<o) = 1/3 • 1/3 • 1/3 • 2/3 = (1 /3)3(2/3)1. Note that the exponent of 1/3 is just the sum of the digits in <o and the exponent of 2/3 is 4 minus the sum of the digits in w. There are 16 outcomes and 16 associated weights. It is tedious to do so, but the 16 outcomes and weights can be listed and the sum of the weights shown to be 1. ■ EXAMPLE 1.18 (n Bernoulli Trials) Fix 0 < p < 1. Let q = 1 — p and let n be any positive integer. Let O be the collection of all words of length n using the alphabet {0,1}. We can think of 0 and 1 as an encoding of failure and success or tail and head, respectively, in n repetitions of a basic experiment in which the probability of success is p and the probability of failure is q. If (i) = {*;}"=! is an element of O, we associate with the weight p(w) = p^>~' xiq(n~^j-i x>\ Clearly, p(w) 0 for each at 6 O. We need only verify that the sum of all the weights is 1. Each outcome u> = {xj}"=1 such that = k has associated weight pkqn~k. The number of such outcomes is equal to the number of ways of selecting k of the n letter positions to be filled with 1’s, which is ( ” ). Thus, the sum of the weights of outcomes with X."= ixj = k is ( ? > )pkqn~k. If we now add the sums of these weights for k = 0, K we obtain by the binomial theorem Xfppv* fc=0 = (?+?)" = i- This model goes by the name n Bernoulli trials. We have just seen that if we let Ak be the collection of outcomes having k successes, then P(AJ = (nk)pkqn~k. ■ A distributor plans to use an optical character recogni­ tion scanner to transfer the contents of a catalog of parts to a computer. Each part has a 12-digit part number and a 13-digit stock number. The probability that the scanner will misread a digit depends upon the digit being scanned; e.g., it is more likely that an 8 will be misread as a 3 than as a 1. Assuming that the maximum probability that a digit will be misread is .01, what is the probability that the part number and stock number will be recorded without error? We can view this experiment as a succession of 25 trials in which success is interpreted EXAMPLE 1.19 20 1 CLASSICAL PROBABILITY to mean that a digit is read correctly and failure is interpreted to mean that a digit is misread. Assuming that probabilities are assigned in accordance with the Bernoulli model with p = .99, let A be the collection of outcomes having no misreads. A consists of a single outcome with probability P(A) *= (.99)25 = .78. ■ Consider an experiment in which a coin is flipped until a head appears for the first time with a maximum of five flips. The outcomes can be labeled H, TH, TTH, TTTH, TTTTH, TTTTT with weights P(H) = 1/2,P(TH) = 1/4, P(TTH) = 1/8, P(TTTH) = 1/16,P(TTTTH) = 1/32, and P(TTTTT) = 1/32. ■ EXAMPLE 1.20 The weights attached to the outcomes in this example were constructed using Pascal’s line of reasoning. The previous example suggests a coin-flipping experiment in which a coin is flipped until head appears for the first time, at which time the experiment terminates. The outcomes of this experiment can be described as an infinite sequence of labels H, TH, TTH, TTTH,.... Does this list describe all out­ comes? What about the possibility that a head never appears? We can include or not include an unending label TTT. . . for this possibility as we choose. We will not include such a label on the grounds that in any instance in which this experiment has been performed, the experiment terminates in a finite number of steps. By analogy with the previous example, we can assign weights as follows: p(H) = 1/2,p(TH) = 1/4,p(TTH) = 1/8,.... Note that even if we included the outcome with label TTT. . . , there would be no weight left for it because p(h)+p(th) + -- - = jrl n=l Z and the sum of this geometric series is 1. This model suggests an even more general model. Consider an experiment for which there is an infinite sequence of outcomes (d2, . . . . Let O = {o>b a>2,...}, and for each i 1 let pfw,) be a weight associated with each satisfying («) 0 < p(a)i) < 1. (»«) Z“=iP("i) = 1- If A = {wj(,..., <w,m} is any finite subcollection of outcomes, we define m pw = H=1 1 .5 OTHER MODELS 21 if A = {(Dj', a>i2,...} is an infinite sequence of outcomes, we define 00 p(A) = fc=l Rather than making the distinction between finite sums and infinite sums as in the last two equations, we usually Just write P(A) = X p(wit )> the range of k being clear from the description of A. Again, Equations 1.1, 1.2, and 1.3 are satisfied. EXAMPLE 1.21 A pair of dice are rolled until a score of 6 appears for the first time, at which time the experiment is terminated. A typical outcome can be labeled by the word ** * ••• 6 where * represents a score other than 6; if there are n asterisks preceding the 6 with n > 0, the weight or probability associated with the outcome is .n ( 5 f31 V . * * . . . * 6) = I — I — n( 136/ \36 '--v---- ' n times ' Note that the weights are nonnegative, and since the weights constitute the terms of a convergent geometric series, ' f3! Y = ! o\36/ \36/ The last model we will describe involves the concept of conditional prob­ ability. Consider two collections of outcomes A and B associated with an experiment. Before performing the experiment, we have some notion of what P(A) should be. Instead of performing the experiment and observing the outcome, an impartial observer views the outcome and relates only partial information to us; namely, that the outcome is in B. Quite often in a situation like this, we would adjust our estimate of the chance that the outcome is in A. For example, suppose the experiment consists of selecting a person at random from a given population consisting of men and women in equal numbers. Before performing the experiment, the probability that the selected person is a man is 1/2. But if the experiment is performed and an impartial observer tells us only that the selected person is color-blind, then we would adjust our estimate of the probability that the person is a man to be much higher, because color blindness is much more prevalent in men than in women. To see how probabilities should be changed in the light of partial informa­ tion, we go back to the empirical law. Suppose the experiment in question is repeated N times. Since the impartial observer conveys information to us when the outcome is in B, we can ignore all repetitions for which the outcome 22 1 CLASSICAL PROBABILITY is not in B. Let A Cl B denote the collection of outcomes that are in both A and B. The number of outcomes for which the outcome is in B is N(B), and among these N (A Cl B) are also in A. The relative frequency of occurrence of outcomes in A among those in B is N(A Cl B)/N(B). This ratio should stabilize near the new probability when N is large. Thus, N(AHB) _ N(AQB)/N N(B) ~ N(B)/N P(ADB) P(B) Of course, P(B) must be positive for the quotient to be defined. This new probability is called the conditional probability of A given B and is denoted by P(A | B). We therefore define I P(ADB) P(A|B, = -Ttsr- (1.12) P(ADB) = P(A|B)P(B). (1-13) Note that EXAMPLE 1.22 Two dice are rolled and we are informed that the score is 6. What is the probability that there is a 3 on each die? Let A be the collection of outcomes for which there is a 3 on each die and let B be the collection of outcomes for which the score is 6. Then P(A | B) = P(A Cl B)/P(B) = (l/36)/(5/36) = 1/5. ■ Note that the conditional probability can be viewed in the following way. As soon as we are told that the score is 6, we are dealing with the population {(5,1), (4, 2),..., (1, 5)}. Since there are only five outcomes in this new population, the probability of the outcome (3, 3) is 1/5 . There are probability models for which the probability mechanism is not specified by giving the probability of each outcome but rather by a mixture of such probabilities and conditional probabilities. EXAMPLE 1.23 Suppose a bowl contains 10 red chips and 5 white chips. An experiment consists of selecting a chip at random from the bowl. If the drawn chip is red, it and 5 other red chips are returned to the bowl; if the chip is white, it is discarded. A second chip is then selected at random from the bowl. What is the probability that both chips will be red? This model is not described in such a way that the probability of each outcome is known; it is described in terms of probabilities of some outcomes and conditional probabilities. Let B be the collection of outcomes for which the first chip selected is red and let A be the collection for which the second chip is red. Then P(A | B) = 3/4 and P(B) = 2/3 so that 1 .5 23 OTHER MODELS P(B flA) = P(A | B)P(B) = | by Equation 1.13. ■ EXERCISES 1.5 The reader should review infinite series, sums of infinite series, and infinite geometric series before doing the following exercises. 1. Determine the sum of the series ^”=0(l/2)3". 2. Determine the sum of the series 2L”=4(1/4)2”. 3. Suppose a pair of dice are rolled until a score of 7 appears for the first time, whereupon the experiment ends. An outcome with n scores different from 7 followed by a 7 is assigned probability (5/6)” (1/6), n 0. What is the probability that the experiment will terminate on an odd number of rolls of the dice? 4. If a pair of dice are rolled, what is the probability that the score will be greater than or equal to 8? 5. Suppose a coin is flipped 10 times in succession. For i = 1,2,..., 10 let A, be the collection of outcomes for which there is a head on the ith flip. Calculate P(Ai), P(Aa)> P(Ai Cl A2), and P(A2 | Ai). How are the first three probabilities related? How are the second and fourth related? How do these numbers change if the 10 is replaced by 20? 6. In the notation of Problem 5, calculate P (Ay | A,) fori i <j n. 7. Bowl 1 contains 10 red chips and 5 white chips. Bowl 2 contains 10 red chips and 10 white chips. A chip is selected at random from Bowl 1, transferred to Bowl 2, and then a second chip is selected at random from Bowl 2. What is the probability that both chips will be red? 8. A pair of dice, one red and one white, are rolled. Let A be the collection of outcomes for which the number of pips on the red die is less than or equal to 2 and let B be the collection of outcomes for which the number of pips on the white die is greater than or equal to 4. Calculate P(A ( B). What does this say about the partial information “the number of pips on the white die is greater than or equal to 4”? 9. A man has n keys of which one will open his lock and the others will not. If he tries the keys randomly one at a time, what is the probability that the lock will be opened on the rths try where 1 r S n? 10. Consider an experiment in which the outcomes are the positive integers 1,2,.... For each k 1, let P[) = 1 - 1 - 1 k(k + V) k k + 1' Can the p(k) serve as weights for a probability model? 24 1 CLASSIC AL PROBABILITY SUPPLEMENTAL READING LIST 1. 2. 3. 4. 5. E N. David (1962). Games, Gods, and Gambling. New York: Hafner Publishing Co. W. Feller (1957). An Introduction to Probability Theory and Its Applications, 2nd ed. New York: Wiley. Oystein Ore (1953). Cardano, The Gambling Scholar. Princeton, N.J.: Princeton University Press. M. A. Todhunter (1965). A History of the Mathematical Theory of Probability. New York: Chelsea. A. Tucker (1984). Applied Combinatorics. New York: Wiley. AXIOMS OF PROBABILITY INTRODUCTION Rules for calculating probabilities.associated with simple games of chance were developed in the works of P. R. de Montmort (1678-1719) and A. de Moivre (1667-1754). These rules also began to be applied in mortality tables and life insurance calculations as early as the late seventeenth century. Most of the effort during this period was concentrated on specific problems dealing with combinations. But eventually problems required more than combinatorial methods, and powerful tools had to be developed for their solutions. Terms such as “gain” and “duration of play” were commonly used during this period and evolved into an abstract concept known as a “chance variable,” much like “momentum” in mechanics. In any particular application, a chance variable was defined in some natural way, not as a mathematical entity but rather by its properties. The publication of Foundations of the Theory of Probability by A. N. Kolmogorov in 1933 marked the beginning of a rapid development of prob­ ability theory and its application to diverse fields, particularly during and immediately after World War II. The reader interested in alternatives to the axiomatic probability model discussed in this chapter should read the book by Hamming listed in the Supplemental Reading List. The content of this chapter is rather abstract. A real appreciation of probability theory cannot be gained without some firsthand experience with a random device. Experiment with flipping a coin many times—you will be surprised by some of the facets of randomness. 25 26 2 AXIOMS OF PROBABILITY SET THEORY A typical exercise in probability theory involving two dice will start out “Let A be the event ‘the score is 11For our purposes, this layman s description of A is an abbreviated form of “Let Abe the collection of outcomes a> with score 11 and consequently A is a subcollection of all possible outcomes. Later, we will define an event to be a subcollection of the collection of all possible outcomes. We saw in Chapter 1 that probability theory pertains to collections of outcomes. Such collections are called sets. One starting point for developing mathematics is the set N of natural numbers 1,2,..., which is denoted by N = {1, 2, 3,...} and eventually leads to the set R of real numbers. Algebraic and order properties of real numbers will be taken for granted. A primitive notion of set theory is that of membership. We write x 6 X if x is a member or element of the set X. If x is not a member of X, we write x £ X. If X and Y are two sets, we write X C Y if x G X implies x G Y and say that X is contained in or is a subset of Y. We say that two sets X and Y are equal, written X — Y, if X C Y and Y C X. It sometimes happens in manipulating sets that we end up with something that has no members. As a matter of notational convenience, we use 0 to signify a set that has no elements and call 0 the empty set. We need a procedure for specifying sets. To obtain one, let p(x) be a sentence containing a variable x. Then {x : p(x)} will denote the set of objects x for which p(x) is true. For example, consider the sentence “x = lorx = 2 or x = 3.” Then {x : p(x)} consists of the natural numbers 1, 2, and 3. This set is usually written {1, 2,3} for brevity. EXAMPLE 2.1 Let p(x) be the sentence “x G N and x2 {x : p(x)} = {1, 2, 3,4,5,6}. ■ 40.” Then EXAMPLE 2.2 Ifa, b G R witha b, then we have the usual definitions of closed, open, and semiclosed intervals: [a, b] = {x : x G R and a < x (a,b) = {x : x G R and a < x < b} [a, b) = {x : x G R and a < x < b} (a, b] = {x : x G R and a < x b} b}. Infinite intervals are defined similarly; e.g., [a, +°°) = {x : x G R andx a}. ■ For the remainder of this section, we will assume that we are dealing with a universe U. All objects under consideration will be members of U, and all sets will be subcollections of U. 2.2 SET THEORY 27 Given X C U, the complement of X (relative to 17), denoted by Xc, is defined by Xc = {x : x (£ X}. The set specified on the right should contain “x G 17” as part of its description, but this part is customarily omitted when it is understood that we are dealing with a fixed universe. It is easy to see that 0C = U Uc = 0. If X and Y are two subsets of 17, the union of X and K, denoted by X U Y, is defined by /UK = {x : x GXorx 6 Y}; the intersection of X and Y, denoted by X Cl Y, is defined by X A Y = {x : x GXandx G Y}. If X A Y = 0, we say thatX and Y are mutually exclusive or disjoint. These concepts can be illustrated as follows. For 17 take the points inside a rectangle in a plane, and for a subset A of 17 take the points within and on a simple closed curve (e.g., a circle). If this is done for subsets X, Y, Z,... of C7, the resulting picture is called a Venn diagram. The operations on sets defined above can be depicted as in Figure 2.1. Venn diagrams can be helpful for understanding set operations. Shaded region: X u Y Shaded region: Xc FIGURE 2.1 Venn diagrams illustrating set operations. 28 2 AXIOMS OF PROBABILITY If Xb X2,..., Xn is a finite sequence of sets, their union is denoted and defined by n (J Xi = {x : x G Xj for some i = !,...,«} * 1 =1 ' and their intersection by n Q Xi = {x : x G Xi for all i = i=i Similarly, if Xi,X2,... is an infinite sequence of sets, then the union and intersection of the sets are defined by X Xi = {x : x G Xi for some i S 1} i=1 and P| Xi = {x : x G X; for all i S: 1}, i=i respectively. As was the case with sums, rather than making the distinction between finite unions (intersections) and infinite unions (intersections), we usually just write UX,- (ITX,) if the range on i is easily ascertained. The following example requires the use of the Archimedian property of the real numbers, which states that if r is a real number, then there is a positive integer n such that n > r. For each n l,letA„ = [0,1/n). Since 0 G An for all n S: 1, 0 G DA„. Clearly, ClA„ cannot contain any negative numbers. But what about positive numbers? Assume x > 0. By the Archimedian property, there is a positive integer m such that m > 1/x. It follows that x > Um so that x € Am, and thus x £ ClAn. It follows that C1A„ = {0}. ■ EXAMPLE 2.3 The union, intersection, and complement operations on sets are subject to algebraic laws that in some cases, but not all, are the same as the algebraic laws for real numbers. Corresponding to the addition and multiplication of real numbers we have commutative laws: xur = rux x n y = y nx. A proof of the commutative law for union requires that two things be proved; namely, that X U Y C V UX and Y UX C X U Y. Consider the first relation. 2.2 SET THEORY 29 FIGURE 2.2 X n(YUZ) = (X C1Y)U(X n Z). Suppose x G X U Y. Then x 6 X or x 6 T; but this statement is the same as x G Y orx G X, and sox G Y UX. Thus, * G X U Y implies * G TUX. At a crucial point in this argument, there is a claim that the statement “x G X or x G y” is equivalent to the statement “x G Y orx G X.” To justify this claim, we could move on to formal “truth tables,” but we will not. The equivalence of the two statements is taken for granted as something from logic. If X, Y, and Z are three sets, there are associative laws: XU(TUZ) = (XUT)UZ xn(ynz) = (xny)nz. The associative laws permit us to omit the parentheses altogether since they can be reinserted in any manner; e.g., A U B U C U D = ((A UB) U C) UD = A U (B U(CUD)). There are also distributive laws: x n (y u z) = (X n y) u (X n z) x u (y n z) = (x u y) n (x u z). A convincing, but not rigorous, argument that the first distributive law is true can be made by examining a Venn diagram for X IT (y U Z) as in Figure 2.2. The two top shaded regions represent X IT Y and X IT Z. Their union is the lower shaded region X IT (T U Z). 30 2 AXIOMS OF PROBABILITY The effect of complementation on unions and intersections is the subject of de Morgans laws: (X U Y)c = xc n yc (x fi y)j = xc u yc. There are also more general distributive laws: x n uy„ = u(x n y„) x u ny„ = n(x u y„) and more general de Morgan’s laws: (ux„)c = nx' (nx„)c = ux‘. The following special relations hold for all X C U: X no = 0 XUU = U X UXC = U X U0 = X xnu =x x n xc = 0. Venn diagrams must be recognized for what they are—doodles. Equations relating sets cannot be proved using Venn diagrams. Such proofs require repeated applications of the laws defined above. Venn diagrams can be used legitimately to prove negative results, however. EXAMPLE 2.4 Consider the equation X fl y fl Z = X n y n (y U Z). Is this equation true for all subsets X, Y,Z of a given U? The answer is no if we can construct a U and X, Y, Z for which the equation is not true. By constructing a Venn diagram for three sets, labeling the parts 1,2,..., 8 as in Figure 2.3, and defining U = {1,2,..., 8}, X = {1,4,5,7}, Y = {2, 5,6, 7}, andZ = {3,4,6,7}, we obtain X fl Y fl Z = {7} # {5, 7} = XfiynfTUZ). We thus have a specific example of X, Y, and Z for which the above equation is not true, and therefore the equation is not always true. ■ FIGURE 2.3 Counterexample. 2.3 31 COUNTABLE SETS Care must be taken when going beyond the relations listed above. For example, if X U 7 = X U Z, there may be a temptation to conclude that Y = Z because the analogous result is true in arithmetic. But the conclusion would not be valid. For example, let U = {1,2,3, 4}, X = {1,2,4}, Y = {2,3}, andZ = {2,3,4}. Then? #Z,butX U Y = {1,2,3,4} = X U Z. EXERCISES 2.2 1. Which of the following statements are correct? (a) 2 E {1,2,3}, (b) 2 C {1,2,3}, (c) {2} E {1,2,3}, (d) {2} C {1, 2, 3}. 2. Consider a universe U consisting of ordered pairs (i,j)> 1 i>j 6 where i represents the number of pips on a red die and j the number on a white die. Express the lay statement “the number of pips on the red die is greater than the number of pips on the white die” as a proposition concerning elements of U, and identify the set A specified by the proposition. 3. IfA„ = [0,21/n), n >: 1, determine ClAn. 4. IfA„ = {(x,y) : x E R,y E K,0 == y == x”,0 == x < 1}, n > 1, determine ClAn. 5. If A„ = {(x,y) : x G R,y E R, 0 == y determine C1A„. xn, 0 s x < 1}, n > 1, 6. If A is any subset of the universe U, show that (Ac)c = A. 7. If A and B are any two sets, show that A C B if and only if Bc C Ac. 8. Is it true that X D (7 UZ) = (X Cl 7) UZ for all subsets X, Y, and Z of the universe U? If not, give an example to show that the equation is not true in general. 9. Prove that (X U 7) Cl (X Cl 7)c = (X Cl 7C) U (Xc Cl 7) for all subsets X, 7 of the universe U. 10. IfA„ = [0, | sin(n-n72)| ], n > 1, determine U~=1 (Cl *A ) and n:=1(u^„Aj. COUNTABLE SETS Let A and B be two nonempty sets and consider A X B, the collection of all ordered pairs (x,y) with x E A,y G B. A function or mapping f from A to B is a subset/ C A X B with the property that (x,y) G/ and (x,z) G / implies y = z. The domain of / is the set {x : x G A and (x,y) G / for some/ G B}. (2.1) 32 2 AXIOMS OF PROBABILITY We will assume that A is chosen so that A is the domain off. The range off is the set {y : y 6 B and (x, y) G f for some x G A}. If (x,y) G /, then y is written/(x1) in the usual calculus terminology so that f consists of all pairs (x,f(x)} as x ranges over A. All of the above is condensed into a single symbol /: A EXAMPLE 2.5 B. Let A = B = R and consider f = {(x,y):x G R,y G R,-l < x < l,y > 0,x2+y2 = 1}. Then/is a semicircle in the xy-plane. In the usual notation, f(x) = J\ — x2 with domain{x :x G fl, -1 x 1} and range{y : y G fl, 0 y < 1}, ■ EXAMPLE 2.6 Let A = B = R and consider g = {(x,y) : x G R,y G R, — 1 x l,x2 +y2 — 1}. In this case, for each x with — 1 < x < 1, there are two values g(x) = i -s/1 — x2 such that (x,g(x)) G g. Therefore, g is not a function or mapping because 2.1 is not satisfied. ■ Finite and infinite sequences are specific examples of mappings. When we speak of a finite sequence of real numbers we are dealing with a collection of ordered pairs (k, ak) where k E {1,2,..., n} and at G R. If we let a be the collection of such pairs, then a : {1,2,..., n} —> R. Similarly, an infinite sequence of real numbers is a mapping a : N —> R; if we put ak = a(k), the usual notation for a is then a = {ajt}“=1. Care must be taken to distinguish between the terms of an infinite sequence and the range of the sequence. For example, if a = {(-then-1,+1,-1,... are the terms of the sequence, but the range is the set {— 1, +1}. We can use these concepts to make precise the commonly used term “finite.” A set X is finite if for some n G N and some set B it is the range of a mapping a : {1,2,..., n} B. A set is infinite if not finite. The set X is countable if it is the range of an infinite sequence; i.e., the range of a mapping a : N —* B for some B containing X. We can always replace B by X. By definition, the empty set 0 is countable. Finite sets are countable because if X is the range of the finite sequence {o>}”=1, then it is the range of the infinite sequence {ajt}^=1 where ak = an for all k > n. The set of natural numbers N = {1,2,...} is countable because N is the range of the map I : N —> N where /(n) = n, n 1. The set of even positive integers 2.3 33 COUNTABLE SETS FIGURE 2.4 Countable union. {2,4,6,...} is countable because it is the range of the map a : N —> N with a(n) = 2m, m > 1. The set of negative integers {..., —2, —1} is countable because it is the range of the map a : N —> R with a(n) = — n, m & 1. A countable set can be finite or infinite. If we wish to exclude the finite case, we say that X is countably infinite if X is infinite and countable. The union of two countable sets is again countable; in fact, the union of finitely many countable sets is again countable. The proof of the following theorem is more palatable if it is looked upon as a programing problem. An algorithm is given for each n s 1 for calculating the fcth term of a sequence p and we would like to define a single algorithm for listing all of thex„,j. Theorem 2. 3.1 The union of a countable collection of countable sets is countable. PROOF: We will assume that the collection is countably infinite. In this case, the collection is the range of a sequence of countable sets. We can assume that each of the Xj is the range of an infinite sequence by repeating one of its elements infinitely many times if this is not the case. Letting X = U”= j Xj, we must show that there is a mapping a : N —> X having X as its range. For; 2: 1, let X, = {«;&}”= r The terms of Xj appear in the jth row of the array shown in Figure 2.4. An informal argument can be made for arranging the elements of this array as a sequence by following the path indicated in Figure 2.4. A map a : N —> X can be constructed using the same idea but following a diagonal from lower left to upper right, dropping down to the next diagonal, following the next diagonal from lower left to upper right, and so forth. We will illustrate the construction by using the identity to identify the element in the array corresponding to a(100). Note that 1 + 2 + --- + 13 = 13-14 — = 91, 34 2 AXIOMS OF PROBABILITY which is the total number of elements in the array located in the first 13 diagonals. Starting at a14,i, if we move 9 positions along the 14th diagonal, we arrive at the element in the array designated by a(100); it is easy to calculate that a( 100) = a^g. ■ EXAMPLE 2.7 The set of integers Z ={..., — 2, —1,0,1, 2,.. discount­ able. This is true since Z is the union of the countable sets {..., — 2, — 1}, {0}, and{1,2,...}. ■ If q GN, let Z, = {..., ~2/q, -1/q, 0/q, 1/q,...}. Then Zq is countable since Zq is the union of three sets {..., — 2/q, — 1/q}, {0/q}, and {1/q, 2/q,...} each of which is easily seen to be countable. ■ EXAMPLE 2.8 EXAMPLE 2.9 The set Q of rational numbers p/q, where q G N and p G Z, is countable. This follows from the fact that Q = U j Zq and that each Zq is countable and from Theorem 2.3.1. ■ Theorem 2. 3.2 IfY is countable and X C Y, thenX is countable. PROOF: We can assume that X # 0, because otherwise X is countable by definition. Since Y is countable, there is a mapping a : N —> Y with Y the range of the map. Let Xo be a fixed element of X. Define fl : N —> X by putting > ifa(n)G Y D Xc x0 for n s 1. The range of (3 is then X and consequently X is countable. ■ The set R of real numbers is not countable. In view of the previous theorem, it suffices to show that [0,1) is not countable. This is done by using a method known as Cantor’s diagonalization procedure. Each x in [0,1) has a decimal representation x = .d\d2 • • • where d, G {0,1, 2, • • •, 9}, i > 1. But the representation is not unique. For example, 1/2 = .500 • • • = .499 • • •. We will achieve uniqueness when this happens by using the representation that has all zeros beyond some point. Assume that [0,1) is countable. Then [0,1) = {x],x2,...}. Suppose x,- has the unique decimal representation Xj = .dndi2,..., i 1, and consider the array .du d|2 du ■d2i .d3i d22 di2 d2i da 2.4 35 AXIOMS Consider the diagonal starting at dn. For each j > 1, choose ej different from djj, 0, and 9. Then y = ’ represents a real number in [0,1) that is different from each x,. But we assumed that the decimal representation of every real number in [0,1) appears in the above array, and we have a contradiction. Our assumption that [0,1) is countable leads to a contradiction. EXERCISES 2.3 The last problem requires the use of the well-ordering property of the natural numbers N, which states that if A C N and A # 0, then A has a least element. 1. Iff = {(x,y) : x E R,y E R,y = x2, —1 domain and range. x < 1}, determine its 2. If in the customary notation of the calculus/(x) = JI — x4, describe f as a subset of R X R and determine its domain and range. 3. If in the customary notation of the calculus/(x) = 1/ — x2, describe / as a subset of R X R and determine its domain and range. 4. If q E N and X = {p/q : p G N}, show that X is countable. 5. Let Xi, Xz,..., Xm be a finite sequence of countably infinite sets. Show that X = Xj U • • • U Xm is countable. 6. Show that the set X of all infinite sequences of 0’s and 1 ’s is uncountable. 7. Show that N X N = {(m,n) : m G N,n E N} is countable by con­ sidering the collection of finite sets Ak = {(m, n) : m + n = k}. 8. Let A and B be countable sets. Show that A X B is countable. 9. Which of the following sets are countable? (a) The set of circles in the plane having centers with rational coordi­ nates and rational radii. (b) The set of all polynomials P(x) = a„xn + • • • + tyx + ao having integer coefficients. (c) The set of all intervals (a, b) C R having rational endpoints. 10. Let Xi, Xz> ■ ■ • be an infinite sequence of countably infinite sets. Show thatX = (j”=1X„ is countable. AXIOMS If A and B are disjoint collections of outcomes, we have seen that P(AUB) = P(A) + P(B). More generally, if Ab ..., A„ are disjoint, it follows from the empirical law that n \ n ;=1 /. j=l ( (2.4) 36 2 AXIOMS OF PROBABILITY If the total number of outcomes is finite, there is no more to be said in regard to Equation 2.4. But what if {Ay} is an infinite sequence of disjoint collections? An experiment with an infinite number of outcomes was discussed in Chapter 1; namely, flipping a coin until a head appears for the first time. In this case, it is possible to have an infinite sequence,{Ay} of disjoint collections of outcomes, and so it makes sense to ask if X \ X U Ay =XP(Ay)y=i / y=i ( (2-5) This is a moot question for all the examples with finitely many outcomes and has an affirmative answer for the single model just described. Since Equation 2.5 is compatible with every example we have considered, can we assume that Equation 2.5, in addition to Equations 1.1, 1.2, and 1.3, is valid in a general model for probability theory? We can, but we cannot have everything we would like to have. It turns out that we cannot assume that Equation 2.5 is valid for all sequences {Ay} of disjoint collections of outcomes and at the same time assume that P(A) is meaningful for all possible A. We must give up one of the two assumptions. We will give up the latter, and so P(A) may not be meaningful for some A. Let O denote the collection of all outcomes for a given experiment. The following definitions are needed to limit the A for which P(A) will be defined. Definition 2.1 A collection si of subsets of fl is an algebra if 1. A,B€=si implies A U B G si. 2. ASsl impliesAc G si. 3. ft Esi. ■ That is, si is an algebra of subsets of O if it is closed under the operations of union and complementation and O G si. A mathematical induction argument can be used to show that an algebra si is closed under finite unions; i.e., if A|, A2,..., An G si, then Uy1-। Ay G si. The important thing to remember about algebras is that by starting with a finite number of elements of the algebra and performing a finite number of union, intersection, and complementation operations on them, the result is still in the algebra. Let si bean algebra of subsets of O and let A i,A2,A3,A4 be elements of the algebra with some or all having nonempty intersections. If we need to restructure the union U^=1Ai into a union of disjoint sets in .91, we could let B, = AbB2 = A2 A A‘,B3 = A3 A (Ai U A2)f, and B4 = A4 A (A 1 U A2 U A3)c. Then By C Ay, 1 < j < 4, the By are disjoint, and UAy = UBy. ■ EXAMPLE 2.10 2.4 37 AXIOMS We need to postulate more if we want to deal with infinite sequences. Definition 2.2 A collection 1. 2. of subsets of 17 is a cr-algebra if is an algebra. If{Aj] is an infinite sequence in then (JAy E S'. ■ The important thing to remember about a-algebras is that by starting with a sequence of elements of the cr-algebra and performing countably many union, intersection, and complementation operations on them, the result is still in the cr-algebra. EXAMPLE 2.11 Let 17 be a finite set of outcomes and let si be the collection of all subsets of 17. Then si is an algebra. ■ Let 17 be any set and let S' be the collection of all subsets of 17. Then S' is a cr-algebra. Clearly, 17 E S' since 17 C 17. If {A„} is a finite or infinite sequence in S', then UA„ is a subset of 17 and therefore is in S'. If A E S', then Ac C 17 and Ac E S'. ■ EXAMPLE 2.12 If is any collection of subsets of a set 17, then there is a “smallest cr-algebra,” denoted by cr(®), that contains SZk In discussing probability, we began with objects (o that were used to form collections A that have now been used to form cr-algebras S'. This process has taken us through three hierarchical levels of set theory, and to prove the result just stated would require going to a fourth hierarchical level. This fourth level is left to more advanced texts. For the time being, we have all the concepts needed to describe a general probability model. 17 will be a fixed collection of outcomes. Definition 2.3 A probability space is a triple (£l,'3',P') where & is a nonempty cr-algebra of subsets of 17 and P is a mapping from S' to R satisfying 1. P(ty = 1. 2. 0 < P(A) < 1 for all A E S^. 3. If {Ay} is a finite or infinite disjoint sequence in then P(UA;) = 2ZP(A;). ■ All the simple games of chance described in Chapter 1 for which 17 is finite, S' is the collection of all subsets of 17, and P is defined as described there result in probability spaces (17, S', P). Definition 2.4 If (fl, P) is a probability space, elements of'3' are called events. ■ 38 2 AXIOMS OF PROBABILITY We return to a model discussed in Section 1.5. EXAMPLE 2.13 Let O = {w b a>2, ■ ■ ■} be countably infinite, S' the aalgebra of all subsets of O, and ) a weight function as defined in Section 1.5. Define P(A),A E as in Section 1.5. Then (0,9%P) is a probability space. To show this, we need only verify Item 3 of Definition 2.3. Let {Ay} be a sequence of disjoint events in S'and let A = UAy. Suppose A = {&>,,, a>,2,...}. The fact that the series ) is a convergent series with sum P(A) means that the terms of the series can be rearranged without affecting the sum of the series. This fact about absolutely convergent series is proved or at least discussed in most calculus books. We rearrange the terms of the series so that the termsp(w;.) with E Ai come first, then the termsp(o>(v) with 0$ E A2 second, and so on, to obtain P(A) = J + Z = +■■■ = P(Ai) + P(A2) + • • •. Therefore, P(UAj) = 51p(A)« and (O, S’, P) is a probability space. ■ We have previously encountered the following situation. Suppose a coin is flipped until a head appears for the first time with a maximum of n flips. Suppose n is a large positive integer. Let A be the event “the experiment terminates on the fifth flip.” We can think of this experiment continuing through all n flips of the coin and simply ignoring what happens after the fifth flip. If a) E A, then the first four letters of a> are T, the fifth is H, and there are two choices for each of the remaining n — 5 letters. Thus, |a| = 2”-5, and it appears that P(A) does not depend upon n at all! This computation is based on Pascal’s reasoning in which we think of the coin as continuing to be flipped beyond the fifth flip and simply ignoring everything beyond the fifth flip. Why bother to mention the number n at all? If we eliminate mentioning n at all, then we are confronted with a conceptual experiment in which a coin is continually flipped. We can, in fact, construct a probability space (0,9’, P) with O consisting of outcomes w that are words of infinite length 2.4 AXIOMS 39 using the alphabet T, H, and probabilities for events such as A described above are calculated by fixing some large n. We will consider a more general model. In the following example, O will denote the set of all infinite sequences {xi}/°=1 where each x, is a 1 or a 0. We can think of 1 and 0 as an encoding of H and T or S and F, respectively, where S stands for success and F stands for failure. This is uncountable; i.e., not countable. (See Exercise 2.3.6.) Therefore, none of the models we have discussed pertain to O. The model depends upon a parameter p, called the probability of success, with 0 < p < 1. The number q = 1 — p is called the probability offailure. Whenever p and q appear, these conditions onp and q will be taken for granted without comment. EXAMPLE 2.14 (Infinite Sequence of BernouDi Trials) Fix 0 < p < 1 and let q = 1 — p. Let O be the set described above and let S'o be the collection of subsets A of fl of the form A = {co : co = = 8i,...,xin = 8„}, (2.6) where n is any positive integer, 1 < ix < i2 < • • • < in, and each 8; is a 0 or a 1. We think of the x, as the results of successive trials. For S' we take (/(S'o), the smallest cr-algebra containing S'o- As an illustration of how probabilities are to be computed, consider the event “1 on the second trial, 0 on the fourth trial, and 1 on the eighth trial”; i.e., the event A - {co : co = {xi}°°=1,x2 = l,x4 = 0,x8 = !}• Then P(A) =p2q =p2(l-pf Note that p(A) = px2+x*+» qi-(x2+x4+xa) For an event A of the type described in Equation 2.6, its probability is defined to be (2.7) P(A) = Note that X”= i fy is the number of l’s in the trials numbered ib i2,..., in and n — X”= i is the number of 0’s in the same trials. It cannot be done here, but it is possible to extend the definition of P so that P(A) is defined for all A E S'. Any set of outcomes that can be expressed in terms of events placing restrictions on only a finite number of trials will also be an event. Consider the event A described by “a 1 eventually appears in the outcome co” ; i.e., 00 A = {to : co = 1}. ;=i 40 2 AXIOMS OF PROBABILITY If we let Aj be the event “1 appears for the first time on the jth trial,” then A = |J Aj G & for the reasons just cited, and the Aj are disjoint. Since P(Ay) = q}~xp and the latter is the general term of a geometric series, 'p = P^q’ 1 = ^(A) = j=| = 1- ■ ;=1 *1 There is no reason to limit the number of results of each trial to just the 0 and 1 of the preceding example. We can allow the possibility that each trial results in one of k possibilities r(, ., r* with associated weights pi,p2> • • • >pk> where 0 p,; 1, i = 1,..., k. Suppose n 1,1 < q < i2 < • • • < i„, and 3i,..., 8„ G {rb ..., fy}. For the event A = {to : to = {xy} * =1,x;, = 8„ = 8„}, we can define P(A) = p?' Xp^X--’X p”*, where ntj is the number of trials resulting in r,, 1 i k. This model is applicable to an unending sequence of throws of a die where the result of each throw is one of the integers 1,2,3,4,5,6 with weight 1/6 associated with each. EXERCISES 2.4 1. Using mathematical induction, write out a formal proof that an algebra is closed under finite unions; i.e., for every n >: 1, n Ai,..., A„ G sA implies [J Ay G s£. 2. If is an algebra of subsets of O, show that intersections. is closed under finite 3. Let S' be a cr-algebra of subsets of O. Show that S' is closed under countable intersections; i.e., if {Ay} is a finite or infinite sequence in S', then Pl Ay G S'. 4. Let O be an uncountable set and let S' be the collection of subsets A of O such that either A is countable or Ac is countable. Show that S' is a cr-algebra. 5. Consider an infinite sequence of Bernoulli trials with probability of success p. What is the probability that a success (or 1) will occur for the first time on an even-numbered trial? 2.5 PROPERTIES OF PROBABILITY FUNCTIONS 41 6. An experiment consists of tossing a pair of dice until a score of 8 is observed for the first time, whereupon the experiment is terminated. What is the probability that it will terminate on an odd number of tosses of the dice? 7. A bowl contains w white chips, r red chips, and b black chips. Chips are successively selected at random from the bowl with replacement. What is the probability that a white chip will appear before a black chip? 8. If S' is a cr-algebra of subsets of O and {A;- }”=, is an increasing sequence in S' (i.e., An C A„+J for all n > 1), show that there is a disjoint sequence {By }”= j in S' such that U JL, Aj = U °°= l Bj. 9. If S' is a cr-algebra of subsets of O, {Aj }”= l is a decreasing sequence in S' (i.e.,A„+i C A„ for all m S 1), and A = Cl JL, Ay, show that there is a decreasing sequence {Bj}j°=, in S'such that An = A(JB„, A IT B„ =0 for all n si and d”= j Bj = 0. PROPERTIES OF PROBABILITY FUNCTIONS Throughout this section, (O, S', P) will be a fixed probability space as described in Definition 2.3. We will now deduce several properties of the probability function P from the axioms listed in Definition 2.3. Consider two events A, B E&. Since fl = A U Ac, intersecting both sides of this equation with B we obtain B = B no = B n (A U Af) = (B n A) U (B n Ac); i.e., we can decompose B into two parts according to whether or not an outcome in B is in A or not in A. Since B Cl A and B Cl Ac are disjoint, by Item 3 of Definition 2.3, P(B) = P(B A A) + P(B IT Ac) for all A, B G S?. (2.8) If we put B = O and use Item 1 of Definition 2.3, then 1 = P(O) = P(O A A) + P(O A Ac), so that P(Af) = 1 - P(A) for all A G S?. (2.9) In particular, P(0) = 1 — P(O) = 0. Consider n flips of a coin and let A be the event “the outcome a> has one or more heads.” Calculating P(A) directly is complicated, but calculating P(AC) is easily done because Ac consists of just one outcome having a label of n T’s. Since each outcome has probability 1/2”, P(A) = 1 - P(AC) = 1 - (1/2"). ■ EXAMPLE 2.15 I 42 2 AXIOMS OF PROBABILITY Suppose now that A, B G 9% A C B. Then A Cl B = A, and so Equation 2.8 becomes P(B) = P(A) + P(B A Ac). Since P(B fl Ac) > 0 by Item 2 of Definition 2.3, P(A) :£ P(B) whenever A, B E S', A C B. (2.10) If A and B are any two events, then A UB = (AABC)U(AAB)U(AC AB); i.e., A U B can be split into three parts: (1) those outcomes in A but not in B, (2) those outcomes in both A and B, and (3) those outcomes in B but not in A. Thus, P(AUB) = P(AflBc) + P(AnB) + P(Ac AB). Applying Equation 2.8 to the first and third terms on the right and simplifying, we obtain P(AUB) = P(A) + P(B)-P(AAB). (2.11) EXAMPLE 2.16 A card is selected at random from a deck of 52 cards. What is the probability that the card selected will be a king or a spade? Let A be the event “the outcome w is a king” and let B be the event “the outcome w is a spade.” The required probability is P(A U B) = P(A} + P(B) — P(A A B) = 1/13 + 1/4 - 1/52 = 4/13. ■ If A, B, and C are any three events, then P(A U B U C) = P(A) + P(B) + P(C) - P(A A B) - P(A A C) - P(B A C) + P(A A B A C). More generally, if A(,..., A^ are any events, then N P(A! U---UAN) = ^P(Af) i=l + 2 P(Ai,AAj |£i|<i2£N 2Z P(A>, AA,-2 AAb) +P(A| AA2 A-- - AAn) N = 22(-l)r-1 r=l 2Z P(Ai, A-.-AAJ. (2.12) SN This result goes by the name inclusion/exclusion principle and can be proved using mathematical induction. Returning to Equation 2.11, P(AUB) < P(A)+ P(B) for all A, B G 9? 2.5 43 PROPERTIES OF PROBABILITY FUNCTIONS since P (A Cl B) 2 0 by Item 2 of Definition 2.3. This inequality is a special case of a more general inequality whose proof will require the following lemma. Lemma 2.5.1 If {Aj}°°=1 is a sequence of events, then there is a disjoint sequence {By}y°=1 of events such that Bj QAjforallj 1, U;"=1Ay = \j"=l Bj for all n 2 1, and UBj = UAy. PROOF: Let Bi = Aiand Bj = Ay Cl (UAi)c for; 2 2. Clearly, By C Ay i < y — 1, for ally 2 1. For 1 //-i Bi n By C Ai n Bj C \ U A; In By = 0. \i=1 / Thus, B; Ci By = 0 whenever 1 £ i < j — l,j 2 1. This means that the By are disjoint. Clearly, U"=1 By C U"=1Ay. Supposes 6 U"=1Ay. Then there is a smallest integer k £ n such that co G Ak. Thus, co G Ak Cl (U ^A,- )c = Bk C U nk = xBk, and it follows that U" =! Ay C U i By and therefore that the two are equal. The proof of the last assertion is essentially the same. ■ Theorem 2.5.2 (Boole’s Inequality) //{Ay} is any sequence of events, then P(UAy) £ X P(Ay). PROOF: By Lemma 2.5.1, there is a disjoint sequence of events {By} such that By CAj,j 2 l.andUBy = UAy. By Inequality 2.10, P(By) £ P(Ay),y 2 1. Since the By are disjoint, P(UAy) = P(UBy) = 2LP(By) < Theorem 2.5.3 P(Ay). ■ Let {Ay}}°= j be a sequence of events. (i) (ii) If Ai C Az C • • • is an increasing sequence and A = (J™=lAj, then P(A} = lim„_»«> P(A„). If Ai 3 Az 3 ••• is a decreasing sequence and A = (\™=lAj, then P(A) = lim„_»ooP(A„). PROOF: (i) Let {Ay}°°=1 be an increasing sequence of events and let A = U ”= j Ay G S'. Note that U;”= j Ay = A„. By Lemma 2.5.1, there is a disjoint sequence of events {By}”= j such that By C Ay,j 2 1,U;"=1 Aj = Uy = 1 Bj, and UAy = UBy. By Item 3 of Definition 2.3, 00 \ ( / 00 \ 00 |jAy = P By = j=i / V=i / j=l / n n = lim y P(By) = lim P n—,oo-f—' ;=1 n-+oo * \ \ I ]Bj = lim P(A„). V=1 I / n-»<» 44 2 AXIOMS OF PROBABILITY (») Let {A;}“= j be a decreasing sequence of events and let A = A; G S'. Then {AJ} * =, is an increasing sequence of events, and Af = (np=1A;)c = j Aj. By the first part of the proof, U 1 -P(A) = P(AA) =.Jim P(A„) = lim (1 - P(A„)) = 1 - lim P(A„), n -* w n -* x and so P (A) = limn_xP(An). ■ EXERCISES 2.5 1. In manufacturing brass cylindrical sleeves, 5 percent are defective because the outer diameter is too small and 3 percent are defective because the inner diameter is too large. What is the best you can say about the probability that a sleeve selected at random from a lot will be defective? 2. If A, B, and C are any three events, show that P(A U B U C) = p(A)+P(B)+P(G)-P(AnB)-P(Anc)-P(Bnc)+P(AnBnc). 3. Consider three events A, B, C for which P(A) = 1/3, P(B) = 1/4, P(C) = 1/2, P(ADB) = 1/8, P(AflC) = 1/8, P(B Cl C) = 3/16, and P(A ABAC) = 1/32. CalculateP(A U B U C). 4. An integer is chosen at random between 0000 and 9999. (a) Use the inclusion/exclusion principle to calculate the probability that at least one 1 will appear in the number, (b) Calculate the same probability assuming that the experiment is that of four Bernoulli trials. 5. Show that the probability that one and only one of the events A and B will occur is P(A) + P(B)-2P(ACIB). 6. The mid-seventeenth century gambler Chavalier de Mere thought that the probability of getting at least one ace with the throw of four dice is equal to the probability of getting at least one double ace in 24 throws of two dice. Was de Mere correct? 7. Consider an infinite sequence of Bernoulli trials with probability of success p. If o»o is any outcome, show that P({o)q}) = 0. (Note: There is no significance to the fact that each outcome has probability 0 where­ as the aggregate of all outcomes has probability 1! After all, points in the interval [0,1] have zero length, but the aggregate [0,1] has lengthl.) 8. If P(A) = .8 and P(B) = .75, show that P(A AB) > .55. More generally, show that if A and B are any two events, then min(P(A), P(B)) > P(A A B) > P(A) + P(B) - 1. 2.6 45 CONDITIONAL PROBABILITY AND INDEPENDENCE 9. If Ai,..., An are any events, show that P(A] n ••• riA„) > PCAJ + ’-’ + PCAJ - (n - 1). CONDITIONAL PROBABILITY AND INDEPENDENCE Conditional probabilities are defined for general probability spaces as in Equation 1.12. Definition 2.5 For B G S' with P(B) > 0 and A G S', define P(A|B) = P(A n B) P(B) Since P(A|B) associates with each A G S' a real number, it is a function from S' to R, which we denote by P(-|B) and call a conditional probability function. An immediate consequence of the definition is the equation P(A n B) = P(A|B)P(B), A,B G &,P(B) > 0, (2.13) which is sometimes called the law of compound probabilities. If P(B) = 0, we usually define P(A|B) = 0, which is consistent with Equation 2.13 since P(AClB) = 0 whenever P(B) = 0. It was pointed out at the end of Section 1.5 that some probability models are described not by specifying the probability of each outcome but rather by a combination of outcome probabilities and conditional probabilities as in the following example. A bowl contains 10 red balls and 10 white balls. An EXAMPLE 2.17 experiment consists of selecting a ball at random from the bowl, replacing it by a ball of the other color, putting the replacement into the bowl, and then selecting a second ball at random from the bowl. There are four outcomes of the experiment: (R,R), (R,W), (W,R), and (W,W). Probabilities of these four outcomes are not given explicitly, but the model is described so that they can be determined. To do this, let R । denote the event “first ball selected is red” and let R2 denote the event “second ball selected is red.” The following numbers are the given data: 1 P(Ri) = 5 1 R(R?) = j 11 = 55 11 PtfzlRS) = 55 *.) PCRzl 9 = 55 9 P(R5lRD = 55 46 2 AXIOMS OF PROBABILITY As an illustration of these computations, consider PfRalPi)- Given that the outcome is in R i, at the time of the second selection there are 9 red and 11 white balls in the bowl, and so the probability that the second ball will be red is 9/20. Probabilities of individual outcomes can be calculated using Equation 2.13; e.g., * % 9 1 9 P((R, R)) = P(R1 IT R2) = P(R2|R j )P(R,) = - • - = ■ All the theorems proved for probability functions in the previous section are true for conditional probability functions P(-|B) with a fixed B G S', P(B) > 0. To see this, define P(A) = P(A|B) for A G S\ SinceP(O) = P(OAB)/P(B) = 1, Item 1 of Definition 2.3 is satisfied. Since for A G S', 0 s P(A) = P(A|B) = P(AAB)/P(B) 1, Item 2 is satisfied. Let {Ay} be a finite or infinite disjoint sequence in S'. Then = P(UA,|B) = Since the events Ay A B are disjoint, =ZPWj). and Item 3 of Definition 2.3 is satisfied. Since the theorems proved in the previous section were consequences of Items 1,2, and 3 in Definition 2.3, these same theorems are true for conditional probability functions P(-|B) for fixed B G S', P(B) > 0. For example, if {Ay}“_ 1 is an increasing sequence of events with A = U ”=1 Ay, it is not necessary to give a proof that lim P(A„|B) = P(A|B). One of the most useful applications of conditional probabilities is known as Bayes’ rule. Let Ab A2,..., An be a finite disjoint collection of events that exhausts O; i.e., O = U"=1Ay. We think of the Ab ..., A„ as a stratification of O. If B is any other event with P(B) > 0,1 < i < n, then P(A,|B) = 1 P(B) 2.6 CONDITIONAL PROBABILITY AND INDEPENDENCE 47 Since B = (J ”=1(B A Af) with the latter events disjoint, n j=i and so P(A,|B) = P(B|A,)P(A,) i = X-^P^A^AjY (2.14) Note that all the probabilities P (A iP (A„), P (B | A iP (B | A„) must be given data to apply Bayes’rule and that the Ai,..., A„ are disjoint and exhaust fl. It was tacitly assumed in this discussion that P(A,) > 0,1 i £ n. This is always true of at least one A;, and the last equation is true assuming only that P(B) > 0. EXAMPLE 2.18 A bowl contains three red balls and one white ball. A ball is selected at random from the bowl, replaced by a ball of the other color, and returned to the bowl. A second ball is then selected at random from the bowl. Given that the second ball is red, what is the probability that the first ball was red? Let R, be the event “ith ball is red,” i = 1, 2. Then Ry and Ri are disjoint and exhaust fl. We are given the data P(Pi) = 3/4, P(J?f) = 1/4,P(/?2|Pi) = l/2,P(P2|Pf) = 1. Thus, P<k IR.) = P(R2|RQP(R,) ' 11 21 p(r2|Ri)P(Ri) + p(r;|r;)P(Rs) = 3 s’ The next application of Bayes’ rule has to do with the settlement of paternity cases in a court of law and necessitates a crude review of genetics related to blood types. In conceiving a child, each parent contributes one of the alleles O, A, or B to form one of the pairs 00, AO, AA, BO, BB, AB, called genotypes. Both A and B are dominant over O; neither A nor B is dominant over the other. The observed blood types, called phenotypes, of the child can be O, A, B, or AB. Figure 2.5 gives combinations of genotypes and phenotypes as well as the proportion of each combination in the general population. Genotype Phenotype Proportion 00 0 .479 OA A .310 AA A .050 OB B .116 BB B .007 AB AB .038 FIGURE 2.5 Frequencies of genotypes and phenotypes. 48 2 AXIOMS OF PROBABILITY EXAMPLE 2.19 (Paternity Index) Jane of blood type A claims in court that Dick of blood type B is the father of her child of blood type B. The following calculations are made to support her claim. Consider an experiment in which a person is selected at random from the population of adult males. Let E be the event “The child of the person selected is of blood type B” and let F be the event “The person selected is Dick.” The genotype of the child is either OB or BB. Since the mother has blood type A, her genotype can only be OA, and she passed on the O allele to her child. The genotype of Dick is unknown, but it must be either OB or BB. Let Fob and Fbb be the events “Dick’s genotype is OB” and “Dick’s genotype is BB,” respectively. Then n/r,lr,x P(E nF|F0B)P(F0B) + P(E nF|FBB)P(FBB) P(EClF) P(C|F) = "wT “------------------------------------------ W------------------------------------------- ■ From Figure 2.5, P(E D F|FqB) = .116,P(FqB) = .5,P(E IT F|Fbb) = I.P(Fbb) = -007, and P(F) = .123. Therefore, 1 = (.n6)(.5H.Q07 = 528 .123 We now calculate P(E|Ff), the probability that the child is of blood type B given that someone other than Dick is the father. Since that someone must have blood type B and therefore genotype OB or BB, and in the first case there is a 50-50 chance that the B allele will be passed to the child, P(E|FC) = (.116)(.5) + .007 = .065. The quantity P(E|F) = ___ P(E|FC) .123 is called the paternity index and is interpreted to mean that a person of blood type B is eight times more likely to be the father than some other person. The paternity index is just as applicable to one man of blood type B as it is to any other man of the same blood type. This is a useful index, but it does not give us the probability that Dick is the father of the child with blood type B. By Bayes’ rule, P(F|E) = P(E|F)P(F) P(E|F)P(F) + P(E|Fc)P(Ff)’ We can use the calculations above to obtain the two conditional probabilities on the right, but to complete the computation we need to know P(F). Jane claims that this number should be 1 and Dick claims that it should be 0. In 2,6 CONDITIONAL PROBABILITY AND INDEPENDENCE 49 this situation it is customary to compromise by using the figure P(F) = .5, in which case P(F|B) = .89. ■ The reader might question the applicability of Bayes’ rule in paternity cases but not Bayes’ rule itself. The basic premise in this example is that Jane chose an adult male at random from the population of adult males and the chosen person fathered her child. As in Chapter 1, we will interpret P(A|B) as the probability of A given the partial information that the outcome is in B. It sometimes happens that the partial information is irrelevant as far as the event A is concerned; i.e., P(A|B) = P(A) or, using Equation 2.13, P(A Cl B) = P(A)P(B). In this case, the events A and B are said to be independent. We will reformulate the definition so that it is not required that P(B) > 0. Definition 2.6 The events A,BE9 are independent events if P(AQB) = P(A)P(B). ■ The definition is now symmetric in A and B. Consider a roll of two dice, one red and one white. Let Ri, i = 1,..., 6 be the event “i pips on the red die” and let Wj,j = 1,..., 6 be the event “j pips on the white die.” Any pair R, and Wj are independent events since EXAMPLE 2.20 p(Ri n Wj) = P«i,j)) = 1 = 1-1 = P(Ri)P(Wj). ■ 30 O O Generally speaking, any event specified solely by conditions on a red die will be independent of any event specified solely by conditions on a white die. Let A be the event “even number of pips on the red die” and let B be the event “odd number of pips on the white die.” By examining Figure 1.2, P(A Cl B) = 1/4 = 1/2 • 1/2 = P(A)P(B). Theorem 2. 6.1 IfA and B are independent events, then each of the pairs A and Bc, Ac and B, Ac andBc are independent. PROOF: Consider the pair A and Bc. By Equation 2.8, P(AABC) = P(A) — P(AClB) = P(A) - P(A)P(B) = P(A)(1 — P(B)) = P(A)P(BC), and so A and Bc are independent. Similarly for the other two pairs. ■ If A, B, and C are any three events, independence of the three could be taken to mean that the three pairs A and B, A and C, and B and C are independent 50 2 AXIOMS OF PROBABILITY pairs. This type of independence is called pairwise independence. In some models there is a stronger built-in independence. Definition 2.7 The events Ab ..., An are mutually independent if P(Ai, n A,-) = P{A,)P(A<2), h # i2,1 ib i2 n P(Ait DA^QA^ = P(Ai,)P(Ai2)P(Ai}) for 11,1'2,1'3 distinct,! i]> i2> 1'3 — n P(A\ aa2a---aa„) = p(a1)P(a2)x---xp(a„). ■ The total number of conditions imposed in this definition is easily calculated using Equation 1.10 and is 2" — n — 1. It is possible for events to be pairwise independent but not mutually independent. Suppose a pair of dice are rolled, one red and one white. Let A be the event “odd number of pips on the red die,” B the event “odd number of pips on the white die,” and C the event “the score is odd.” Checking Figures 1.2 and 1.3, it is easy to see that A, B, and C are pairwise independent, butP(AABAC) = 0 # (1/2)3 = P(A)P(B)P(C)sinceAABAC = 0. ■ EXAMPLE 2.21 Caveat: Independent and mutually exclusive are not the same. Theorem 2. 6.2 If A\, A2,..., A„ are mutually independent events, then Bi, B2,..., B„ are also mutually independent where each Bj is Aj orAf In rolling a pair of dice, the numbers of pips on each die constitute independent events. Coin flipping also has built-in independence. EXAMPLE 2.22 Consider an infinite sequence of Bernoulli trials with probability of success p. Suppose Sb S2,..., 8„ 6 {0,1} are given and 1 < 1'1 < i2 < • • • < in. For j = 1,2,..., n, let Ait = {w : co = {x,} * =1,x;. = 8;}; i.e., if co G Ait, then the result of the 1) trial is 8, and nothing else is known about the results of the other trials. The events A,,,..., A,„ are then mutually independent events. According to Equation 2.7, P(Ajp = ps>qx~s>, and since Af| A • • • A AIn — {<u : co — {xj }f- _ ।, x;, = 31,..., Xja = S„ }, P(A,- A ■ • • A A,„) = = fWi1-8') 1 =1 = P(A,-) X • • • X P(Ain). 2.6 CONDITIONAL PROBABILITY AND INDEPENDENCE 51 Since this is true for any set of integers 1 < ii < i2 < • • • < i„,the2" - n - 1 conditions for mutual independence are fulfilled. ■ Theorem 2.6.3 Let Ai, A2, • • • > A„ be mutually independent events and let I = {ib ..., i^}, J = {ji>.. .,jn-k} be nonempty disjoint subsets o/{l, 2,..., n }. Then any event con­ structed from the Ait>..A/t is independent of any event constructed from the Ajt,..., Ajn_k. This theorem is proved in more advanced texts. For now we must be satisfied with verifying it in specific cases in the exercises. EXERCISES 2.6 Bayes’ rule is needed for some of the following problems. 1. Let A, B, and C be mutually independent events. Show that A, Bc, and C are mutually independent. 2. Let Ai, A2,..., An be mutually independent events. ShowthatBb B2,..., Bn are mutually independent events where B,- is A, or A-, i = 1, 2,..., n. (a) If A, B, and C are any three events, show that P(A Cl B Cl C) = P(A|B A C)P(B|C)P(C) provided the conditional probabilities are defined, (b) State a generalization for events Ab A2,..., A„. A bowl contains 10 red balls and 10 white balls. A ball is selected at random from the bowl, replaced by a ball of the other color, and returned to the bowl. This procedure is repeated two more times. An outcome is defined to be an ordered triple (i,j, k) where: is the number of red balls in the bowl after the first return, j is the number after the second return, and k is the number after the third return. Determine the probability of each outcome. A coin is flipped twice in succession. Let A be the event “head on the first flip,” B the event “head on the second flip,” and C the event “the two flips match.” (a) Are the events A, B, and C pairwise independent? (b) Are they mutually independent? In both cases, justify your answer. Binary digits 0 and 1 are transmitted over a communications channel. If a 1 is sent, it will be received as a 1 with probability .95 and as a 0 with probability .05; if a 0 is sent, it will be received as a 0 with probability .99 and as a 1 with probability .01. If the probabilities that a 0 or 1 is sent are equal, what is (a) the probability that a 1 was sent given that a 1 was received and (b) the probability that a 0 was sent given that a 0 was received? If, in the previous problem, three successive binary digits are transmitted with independence between digits, what is the probability that 111 was sent given that 101 was received? There are three chests with two drawers each, and each drawer con­ tains a gold coin or a silver coin. Chest 1 contains two gold coins, Chest 2 3. 4. 5. 6. 7. 8. 52 2 AXIOMS OF PROBABILITY contains a gold coin and a silver coin, and Chest 3 contains two silver coins. (Gold coins? A very old problem.) A chest is selected at random, and then one of its two drawers is selected at random and opened. If a gold coin is observed, what is the probability that the other drawer contains a gold coin? 9. Let A, B, C, and D be mutually independent events. Show that A U B and C O D are independent events. 10. Consider an event A and an infinite sequence of disjoint events {A;}”=, such that A and Aj are independent for each j > 1. Show that A and UA;- are independent events. 11. A mechanical system consists of components A, Blt B2, C, and D as indicated in the diagram. The system will function if there is a path from a to (3 along which all components are functioning, (a) If in a specified period of time, A, C, and D will each malfunction with probability .05 while Bj and B2 will each malfunction with probability .2, all independently of one another, what is the probability that the system will function during the period? (b) If Bj is added to the system in parallel with B i and B2 and with the same probability of malfunction, what is the probability that the system will function? The following problem does not require mathematical software such as Mathematica or Maple V, but using a hand calculator is a bit tedious. 12. Suppose in the previous exercise that the components Bi and B2 are replaced by Bj,..., Bm connected in parallel and C is replaced by Ci,..., C„ connected in parallel. Assume that each B, will malfunction with probability .4 and each Cj will malfunction with probability .6. If the Bi cost $100 each and the Cj cost $80 each, how many of the Bj and Cj components are required to ensure that the total system will function with probability at least .88 and will minimize the cost? SOME APPLICATIONS The first application will deal with such questions as “How secure is your re­ motely operated garage door opener? Your computer password? Your answer- 2.7 53 SOME APPLICATIONS ing machine access number? Your telephone calling card number?” Such applications require an extension of the definition of mutually independent events to countable collections. Fix the probability space (O, S', P). Definition 2.8 The events Ai,A2,... are mutually independent if every finite subcollection consists of mutually independent events. ■ EXAMPLE 2.23 Consider an infinite sequence of Bernoulli trials with probability of success p. For each; 2: 1, let 8j E {0,1} and define j s 1. Aj = {to : to = {x,}7=1,x;- = 3;}, If{A(1, ... ,Ain} is any subcollection of the A,, we saw in the previous section that the A,,,..., Ajn are mutually independent events. Thus, the events Ab A2,... are mutually independent. ■ Consider any sequence of events {A;-}J°_ j and an outcome to. What is to be meant by the statement that to belongs to infinitely many of the Aj? It should mean that no matter how far out you go in the sequence, the to should belong to an Aj out beyond that point; i.e., for every k 2 1, to belongs to some Aj with j 2 k or, in the language of set theory, to E \Jj>tAj. Since this is true for every k 2 1, to E UjafcAj. Definition 2.9 (f{A;}”=! is any sequence of events, we define {A„ i .o.} = Ofcai Ujajt Aj. ■ Note that {A„ i.o.}, read “A„ infinitely often,” is an event, because S' is closed under countable unions and intersections. What is the complement of {A„ i.o.}? If to is in the complement, then it is not true that to E Aj for infinitely many); i.e., to E Aj for at most finitely many A,. Formally, by de Morgan’s laws, {A„ i.o.}c = Ufc>i Aj. This brings us to a famous theorem that for some reason is called a lemma. Lemma 2.7.1 (Borel-Cantelli) Let {A,}”_! be an infinite sequence of events. converges, thenP({An i.o.}) = 0. (i) If (ii) If the Aj are mutually independent events and ^>=1P(Aj) diverges, then P({An i.o.}) = 1. 54 2 AXIOMS OF PROBABILITY PROOF: (i) Assume that the series Z“= i P(Aj) converges. Since the sequence {U; afcA;}”=1 is a decreasing sequence and {A„ i.o.} = IT”=1 A;, P({A„ i.o.}) = lim P(Uj>kAj) *«> kby Theorem 2.5.3. By Theorem 2.5.2, 00 P(U;afcA;) i=k Thus, 00 o P({A„ i.0.}) < J>(A;) j=k for all k > 1. Since the series X”=i P(A;) converges, the sum on the right has the limit zero as k -> ». Since P({A„ i.o.}) does not depend upon k, P({A„ i.o.}) = 0. (ii) It is easily checked using calculus that the graph of the equation y = 1 — x lies below the graph of the equation/ = e~x forx >: 0; i.e., 1— x e~x for allx 2: 0. Therefore, P(AJ) = l-P(Aj) < e~P(A>\ j > 1. (2.15) Consider {A„ i.o.}c = U * >i IT; ait AJ. Since {lTJ=ilAj}"=jfe is a decreasing sequence and ChaM; C n; = fcA-, for all r k, it follows that P(njsfcA-) limP(lTj=fcAj). Since At,..., Ar are mutually independent events, Ack,.. ,,Acr are mutually independent events and r P(Clj^) < Km np<AP r~* i=k lim r->oc A f = lime-^”* ^^^ 2.7 SOME APPLICATIONS 55 by Inequality 2.15. Since the series P(Aj) diverges to +<» for each k 2- 1, the limit on the right is zero, and so P(Cl, afcAj) = 0 for all A > 1. Therefore, 00 0 < P({A„ i.o.}c) = P(Ufca] ^P(n;afcAp = 0 fc=l and P({A„ i.o.}c) = 0. Thus, P({A„ i.o.}) = 1 , as was to be proved. ■ Lemma 2.7.2 //{A,,, A;2,.. .}isany subcollection of the collection {Ai,A2,...}, then{Ai„ i.o.} C {A„ i.o.}. PROOF: If co is in infinitely many of the A,-., then it is in infinitely many of theA; . ■ Consider an infinite sequence of Bernoulli trials with probability of success p. Consider the four-letter word 1001. You may substitute the binary representation of your social security number (which may require up to 30 binary digits 0 and 1), computer password, answering machine access number, or telephone calling card number for this number. What is the probability that the word 1001 will appear infinitely often in an outcome co = {x;}”=1? For each; 2 1, let EXAMPLE 2.24 (Password Problem) Bj = {co : co = {xi}”=1,x; = 1,Xj+i = 0,Xj+2 = 0,xj+3 = 1}. If co 6 Bi, then co looks like 1001.... If co G {B„ i.o.}, then 1001 appears in co infinitely often. Although the events Bb B2> ■ ■ ■ are not mutually independent, the events Bi,Bs, Bg,... are mutually independent because they are based on nonoverlapping sets of four trials. Since each B; has probability p2q2 and Z7=oP(Bi+4j) = X^=iP2q2 diverges, P({B]+4„ i.o.}) = 1 by (ii) of the Borel-Cantelli lemma. Since Bb Bs, B9,... is a subcollection of the collection Bi,B2, ... and{Bi+4„ i.o.} C {B„ i.o.} by the preceding lemma, P({B„ i.o.}) = 1. ■ How safe is your remotely operated garage door opener, computer password, answering machine access number, or telephone calling card number? It de­ pends upon how long it would take a random generator of l’s and 0’s to hit the electronic combination. The question should not be “Can it be violated?” but rather “How long will it take?” But that is another mathematical problem to which we will return in Chapter 4. To illustrate the inclusion/exdusion principle, consider a deck of cards that are numbered 1, 2,..., N. Suppose the deck is thoroughly shuffled and the cards are dealt one by one onto positions numbered I, 2,..., N. A match occurs at position j if the card numbered j is at that position. If all N! arrangements of the deck are equally likely, what is the probability that there 56 2 AXIOMS OF PROBABILITY will be at least one match? Let Aj be the event “there is a match at the jth position.” The answer to the question lies in calculating P(A\ U • • • U A^). This probability can be calculated using the inclusion/exclusion principle given by Equation 2.12: P^U — UAn) =2>(a;)- JI l£i|<i2£N i=l p(a;, nA,2nAi,) y + 1 £ii<i2<i3 SN - •••+P(A1nA2n---nAN) N = 21(-ir1 r=l y l£i|<-"<i, p(A,,n---nAj. SN Consider a typical term P(A,-, Cl • • • Cl A;,) where 1 < ij < i2 < • • • ir N. For an outcome to be in A,, D • • • D A,r there must be matches at the ilt.. .,ir positions. The number of outcomes with such matches is (N — r)>. Thus, and lsil<i2<-<irsN since the sum on the left has ( 1 • ) terms corresponding to the number of ways of choosing a subset {ij,..., ir} from {1,..., N}. Therefore, N P(A!U---UAN) =y(-iy( * r=i _ 2- (-ir1 ,i r=l r=l r! The last sum is a partial sum for the Maclaurin series expansion e~x = * with = 1 except for a missing r = 0 term. If N is large, the sum j( — 1 )r/r! can be approximated by e-1 — 1. Thus, for large N, the probability of at least one match is P(Aj U ••• UAN) ~ 1 - e 2.7 57 SOME APPLICATIONS and the probability of no match is approximately Me. Actually, the approxi­ mation of the probability of no match by Me is quite good even for N as small as 6, the error in this case being on the order of .0002. The following example appears in many guises. EXAMPLE 2.25 (Coupon Collector Problem) Any one of N different coupons (e.g., baseball cards) is included in a commercial product. Assume independence between purchases and that the coupons are equally likely to appear in a product. If a collector purchases the product n times, n S: N, what is the probability that a complete set of coupons will be collected? Suppose the coupons are labeled 1,2,..., N. Let Aj be the event that coupon j does not appear among the n purchases, j = 1,... ,N. The probability that a complete set is not collected in n purchases is then P(Ai U • • • U An). By the inclusion/exclusion principle, N PfAjU — UAjv) = 22(-l)r-1 r=l P(Ai, A--- nA;,) y, N y = yP(A;)i=l P(A;AAj) + ••• + (-l)N-1P(A1n ••■An). Note that the last term P(Ai A • • • A A^) = 0 since it is impossible for no coupon to appear. Consider A;. The probability that coupon i will not appear with a particular purchase is 1 — (1/N), and the probability that it will not appear in n purchases is (1 — (1/N))". Thus, i -1 \ ' f ' Consider A; and Aj,i j. The probability that coupons i and; will not appear with a particular purchase is (1 — (2/N)), and the probability that they will not appear in n purchases is (1 - (2/N))". Since there are ( ^ ) choices of i and; with 1 i <j y. N, WOA;) = • Similarly, S ___ , / \/ r \" PMi.n-.-nA,,)’ (rX1-N) * * *• S 58 2 AXIOMS OF PROBABILITY Therefore, r \)” ’ P(Ar U • • • U AN) = X(-1> r_,/NV r A1 " n r=1 For example, if N = 6 and n = 25, then P(Ai U • • • U Ag) = .062 and the probability of collecting a complete set of 6 coupons with 25 purchases is .938. ■ EXERCISES 2.7 1. A deck of cards numbered 1,2,..., 10 is shuffled and the cards are dealt one by one onto positions 1,2,..., 10. Calculate the exact probability of at least one match. 2. Consider an infinite number of Bernoulli trials with probability of success p # 1/2,0 < p < 1. An equalization occurs as of some trial if there is an equal number of heads and tails. Equalization can occur only on an even number of trials, (a) If A2n is the event “Equalization occurs on the 2n trial,” show that /W) = (2„")p”(l-p)". (b) What is the probability that an infinite number of equalizations will occur? The next two problems relate to an infinite sequence of Bernoulli trials with probability of success 1/2. A run of length r beginning on the nth trial occurs if there are l’s on the n through (n + r — 1) trials followed immediately by a 0. For integers n, r 2: 1, let A„,r be the event consisting of those outcomes for which there is a run of length greater than or equal to r beginning on the nth trial. 3. If r is a fixed positive integer, determine P(A„,r i.o.). 4. If for n 2 1 and 8 > 0, r„ = (1 + 8) log2 n, determine P(An,r„i .o.). The following problems pertain to the game of craps, which is played with two dice according to the following rules: • You win on the first roll if you roll a score of 7 or 11. • You lose on the first roll if you roll a score of 2, 3, or 12. • If you do not roll a 2, 3, 7, 11, or 12 on the first roll, the score becomes your “point” for subsequent rolls. • You win on subsequent rolls if you roll your point without having rolled a 7 and lose if you roll a 7 without having rolled your point. 5. Assuming independence between trials, describe appropriate O, S', and P. 2.7 SOME APPLICATIONS 59 6. What is the probability that you will win with a point of 8? 7. What is the probability that you will win at craps? 8. What is the probability that the game will terminate? The following problems require mathematical software, such as Mathematica or Maple V, or much patience. 9. A deck of 52 cards numbered 1,2,..., 52 is shuffled and the cards are dealt one by one onto positions 1,2,..., 52. Calculate the probability of at least one match without using the approximation 1 — Me. 10. A commercial product includes a coupon that can be either a worthless coupon or one of eight collectible coupons. If 30 percent of the coupons are worthless and the collectible coupons occur in equal proportions, how many products must be purchased to be 95 percent confident of obtaining a complete set of collectible coupons, assuming that the coupons are inserted randomly into the product? SUPPLEMENTAL READING LIST R. W. Hamming (1991). The Art of Probability for Scientists and Engineers. Redwood City, Calif.: Addison-Wesley. RANDOM VARIABLES INTRODUCTION The score obtained upon rolling two dice and the number of heads in n flips of a coin are examples of random variables. It is possible to forgo the apparatus of the first two chapters and deal directly with a primitive concept of random variables by specifying certain probability statements about the random variables. Eventually, however, the study of algebraic and limiting properties of random variables would lead to the considerations of the first two chapters. One of the problems we will study in this chapter is the gambler’s ruin problem, which apparently appeared in print for the first time in a paper by Huygens around the beginning of the eighteenth century. This problem was solved by James Bernoulli in a paper published posthumously in 1713. A more modern method, the method of difference equations, will be used to solve the ruin problem. Another important methodology for solving probability problems involves generating functions, which were introduced by de Moivre around 1740 and treated exhaustively by Laplace at the end of the eighteenth century. The reader wanting to learn more applications of generating functions, or of probability theory in general, would be well advised to read the book by Feller listed at the end of the chapter. RANDOM VARIABLES Unless otherwise specified, (O, S', P) will be a fixed probability space. At this juncture, we are not going to give the most general definition of a random variable but will keep things as simple as possible. 60 3.2 Definition 3.1 61 RANDOM VARIABLES A map X : fl —> R is a random variable if the range of X is a countable set {xi,x2,...}, finite or infinite, and {co : X(co) = x7} G S' for all j S 1. ■ A random variable as just defined is customarily called a “discrete random variable,” but the prefix “discrete” will be dropped because no other type of random variable will be considered until much later. The definition will be extended to allow the possibility thatX can take on the value +<». In all the probability models considered so far, except for an infinite sequence of Bernoulli trials, S' consists of all subsets of O. When this is the case, {to : X(co) = Xj} is just another subset of fl and therefore is in S'. In most cases, showing that X is a random variable simply amounts to verifying that the range of X is countable. The notation for the event {co : X(co) = Xj} will be compressed to (X = Xj) by suppressing the co. The same is true for other events; e.g., the event{to : a <X(co) b} will be compressed to (a <X b). Definition 3.2 Let X be a random variable with range {xi,x2,...}. The density function fx is the real-valuedfunction on the range ofX defined by fx(xj) = P(X =Xj), j = 1,2,... ■ The range of X can be finite or infinite. The density function fx will be denoted simply by/ if the meaning is clear from the context. It is important to keep in mind that the domains of fx and P are not the same. P is a function on S' with values in R, whereas fx is a function on the range {xj, x2,...} of the random variable X with values in P. EXAMPLE 3. 1 Let X be the score upon rolling two dice. The function p defined on {2, 3,..., 12} as in Figure 1.3 is the density function of X; e.g., /x(7) = P(7) = 6/36. ■ Consider n Bernoulli trials with prob­ ability of success p. Let X be the number of successes in n trials; i.e., if to = *{;} ”= i withx; G {0,1}, thenX(to) = * r X"=i Since the range of X is {0,1,...»n}, X is a random variable. If X"= j xj = k, then there are k successes in co and n - k failures, and so co has probabilitypkqn~k. But there EXAMPLE 3. 2 (Binomial Density) *) &( = fc-o.i..... " is the density function. This density function is called the binomial density function with parameters n andp and-is denoted by b(-,n,p). ■ 62 3 RANDOM VARIABLES Between a source S and a collector C there is an absorption medium as indicated below. EXAMPLE 3. 3 S —► 'Absorber The probability that a given particle emitted from S is not absorbed is p, and the probability that it is absorbed is 1 — p, 0 < p < 1. Assume that the particles are absorbed independently of each other. If n particles are emitted, the probability that exactly k particles will reach C is given by the binomial density fc(k;n,p) = (? )p (l * -p)n~k> k = 0,...,n. ■ Consider an infinite se­ quence of Bernoulli trials with probability of success p as described in Example 2.14. If X is defined as the first trial at which success occurs, then we have a small problem in that X(o>o) is not defined for the outcome w0 = (0,0,...) consisting of all 0’s. It was shown in Example 2.14 that a success will eventually occur with probability 1. We can therefore define X(&>o) however we choose, the result having no bearing on the computation of probabilities. We choose to define X(o>o) = +°°- The range of X is then N U {+<»}, which is countable. For k E N, EXAMPLE 3.4 (Geometric Density Function) (X = k) = {a> : a> = {x;}* =1,Xi = 0, • • • ,Xk-i = 0,x> = 1} E S', and since (X = +<») = (U * = 1(X = k))f E S', X is a random variable. We saw in Section 2.4 that /x(k) = P(X = k) = pqk~l, k = 1,2,... This density function is called a geometric density function with parameters p and q = 1 — p, 0 < p < 1. ■ Note that the definition of a random variable has been extended in this example because Definition 3.1 requires that the values of X be real numbers and R does not contain +<». If the value +<» is allowed, the random variable is called an extended real-valued random variable. The only situation in which a random variable will be allowed to take on the value +<» is that in which X measures some waiting time as in the previous example. In either case, the criterion is still the same, because if (X = Xy) E S' for all real values Xy of 3.2 63 RANDOM VARIABLES X, then (X = +») = (Uj(X = Xj))c G 3% since S' is a a-algebra. In most instances, the original definition is applicable. The geometric density is applicable to physical systems for which aging is not a factor. For example, the waiting time, in discrete units, for a radioactive atom to decay has a geometric density in which the parameter p can be determined from the half-life of the atom. EXAMPLE 3.5 (Negative Binomial Density) Consider an infinite se­ quence of Bernoulli trials with probability of success p. Fix a positive integer r and let Tr be the trial at which the rth success occurs for the first time. If for the outcome w the rth success never occurs, then we put Tr(o>) = +<». The range of Tr is {r, r + 1,.. .,+<»}. If x s r is a positive integer, it is easy to see that (Tr = x) is a condition on just finitely many trials, and so Tr is an extended real-valued random variable. Each outcome in the event (Tr = x) has probability prqx~r- Since r — 1 of the trials preceding the xth trial must be successes, the number of outcomes in this event is ( X J ). Thus, P(Jr = x) = ( _ * J )prqx T, x = r, r + 1,... Changing the scale by replacing x by x + r, P(Tr-r = x) = (X+rT _~^prqx = (X + ^-1W> x = 0,1,2,... P(Tr-r = x)= {~^pr{~q)x, x = 0,l,2,... By Exercise 1.3.10, It follows that Tr - r has the density function fM = x = 0,1,2,... which is called the negative binomial density with parameters r and p. The name arises from the fact that 64 3 RANDOM VARIABLES by the generalized binomial theorem (see Exercise 1.3.4). Accordingly, y. ( ' =/(I - ir = i. x=0 and it follows that P(Tr < +<») = P(Tr 6 {r, r + 1,...}) = 1. This means that the rth success will eventually occur with probability 1. ■ Caveat: In the discussions that follow, assume that all random variables are real-valued unless explicitly stated otherwise. The next density, the Poisson density, can be obtained as a limiting case of the binomial density as follows. Consider a sequence of experiments described by a binomial density for which the probability of success depends upon n; that is, consider a sequence of binomial densities b(-, n,pn), n S: 1. Assume that as n increases, p„ varies in such a way that npn —> A > 0 for some fixed A. Fix an integer k 0. Then lim fc(fc;n,p„) = lim ( ” )p^(l - p„)"-k. For large n,p„ = A/n, 1 -p„ = 1 - (A/n), and fc! \n / \ nJ _ A^n(n - 1) X • • • X (n - fc + 1) kl nk Since k is fixed, there are a fixed number of factors in the last product, and the limit of the product is the product of the limits. Since lim,,.^! - (j/n)) = 1 for; = l,...,k — 1, lirrin-^xU — (A/n))” = e-A from the calculus, and lim„_»«,(l - (Mn)}~k = 1, lim ( nk )p (l * -p„)n~k = . I n —K Therefore, lim b(k’,n,pn} = A^e-A , k > 0. 3.2 65 RANDOM VARIABLES Can the function/(k) = (Xke A)/k>, k >: 0, serve as the density of a random variable X? According to Definition 3.1, the domain of the density function fx is the range {xi,x2,...} of X. Whenever it is convenient to do so, we will define fx(x) — 0 for x g {xi,X2>...}. With this convention in mind, a density function has the following properties: 0</(x)<l forallxGP. (3.1) There is a countable set {xi, x2,...} such that 'Z.jftXj) = 1 and/(x) = 0 whenever x S {xI,x2,...}. (3.2) Conversely, given such a function, we can construct a probability space (fl, S', P) by taking fl = {xb x2, ■■■},& the collection of all subsets of fl, and defining P using the weight function f(Xj) as in Section 1.5. The random variable X defined on fl byX(xy) = Xj then has f as its density function. To show that this construction can be applied to the function/Xk) of the previous paragraph, we need the Maclaurin series expansion (3.3) EXAMPLE 3.6 (Poisson Density Function) Fix a positive number A and let (Xke~ yk\ * 0 if k = 0,1,... otherwise. Clearly,/(x) S 0 for all x 6 R. Since {0,1,...} is a countable set,/(x) = 0 for all x £ {0,1,...}, and = 1 by Equation 3.3, the function f satisfies 3.1 and 3.2. Thus, there is a probability space (fl, 9% P) and a random variable X having f as its density function. This density function is called a Poisson density function with parameter A and is usually denoted by p( •; A). ■ The Poisson density is usually applied in situations in which there are a large number n of trials with a small probability p that an event will occur in each trial and with A = np moderate in magnitude. EXAMPLE 3.7 An electronic system has a periodic operating cycle of 0.01 second. In each of the cycles, an event can occur with probability .001. What is the probability of observing fewer than 15 events in a 100-second time interval? During 100 seconds, 10,000 cycles will be observed. Letting A = 10,000(.001) = 10, the probability that k events will be observed is given 66 3 RANDOM VARIABLES by the Poisson density p(k-, 10). The required probability is 14 14 10ke'l° Xp»-, 10) = X « .9165. k=0 k=0 k\ EXAMPLE 3.8 (Uniform Density Function) For fixed n 6 N, let O = {1,2,...,«} and let/(^) = 1/n for k = 1,2, ...,n. Then/satisfies 3.1 and 3.2, and there is a random variable X having / as its density function. ■ If A is any set of real numbers and X is a random variable, then (X GA) = {a: X(w) £A} = Ux.6A{w : X(cu) = Xy} belongs to S' since each set {w : X(w) = xy} 6 S'. Since the events in the union are disjoint, P(X E A) = y P(X = Xj) = (3.4) x, GA Xj GA This equation allows us to compute probabilities related to the random vari­ able X. EXAMPLE 3.9 Suppose X has a geometric density with parameters p and q = 1 — p. Then 00 P(X s 10) = P(X E [10, oo)) = y p</-1 = q9. ■ / = 10 Let X be any random variable and let <p be a real-valued function on R. Given any a> E ft, it makes sense to form the composite function g>(X(to)). This composite map is denoted by <p(X). The range of ^(X) is the count­ able set {^>(xi), g>(xz),..Let y be in the range of <p(X). To show that (^>(X) = y) E S', let Xj,,Xj2,... be those values of X for which <p(xi.) = y. Then (y>(X) = y) = U; (X = x,;) E S', since X is a random variable. This shows that ^(X) is a random variable. EXAMPLE 3.10 IfX is a random variable, thenX2 is the random variable defined for each w by X2(w) = (X(co))2, sinX is the random variable defined for each u> by (sinX)(w) = sin(X(w)), |X| is the random variable defined for each <a by |X|(w) = |X(w)|, and so forth. ■ 3.2 67 RANDOM VARIABLES Given a random variable X and a real-valued function <p on R, how do we determine the density function fz of the random variable Z = <p(X)? There is no algorithm for generating the density function fz. EXAMPLE 3.11 Let X be a random variable having the geometric density fxto = pqx~\ x=l,2,... andletT = min (X, 5). The range of Y is the set {1,2, 3,4,5}. IfX(w) is 1,2,3, or4,theny(w) = X(o>) and/y(x) = fx(x),x = 1,2, 3,4. IfX(w) >. 5,then y(w) = min(X(o>), 5) = 5 so that (y = 5) = (X 2 5) = U”=5(X = x). Therefore, = <* 4- /r(5) = x=5 Therefore, f fy(y) = < <74 0 if/ = 1,2, 3,4 if/ = 5 otherwise. ■ Consider now two random variables X and Y on the same probability space with ranges {xi, X2,...} and {yi, yi,...}, respectively, and let </, be a real-valued function of two variables. Then for each a> E fl,</r(X(w), y(w)) defines a new map from ft to R that is denoted by ip(X, y). The range of </r(X, T) is the set {ip(xi,yj) : i l,j S 1}, which is a subset of the set : j — 1}> which is countable by Theorem 2.3.1 and is therefore countable. Let z be any value of <A(X, T) and let (Xi,,y71), (xj2, y72),... be those ordered pairs for which ^xik,yh) =z-Then (</r(X,y) = z) = |J{co :X(a>) = x4,y(«>) = yjt} k = U«x =xjn(y = %))e? k since X and y are random variables. Thus, Z = i/*(X, T) is a random variable. EXAMPLE 3.12 If X and T are random variables, then X + Y, X - Y, XY, X2 + y2, max(X, y), min(X, y), sinXy, and so forth, are all random variables. ■ More generally, if Xb ... ,X„ are n random variables and ip is a real-valued function of n variables, then i/ *(Xi, ..., X„) defined by ^(XI,...,X„)(w) = iKXda>),...Ma>y) 68 3 RANDOM VARIABLES for a> E fl is a random variable. Finding the density function of Z = can be difficult, depending upon ip. We will show how this can be done when n = 2 in some cases using the joint density of two random variables. Definition 3.3 Let X and Y be random variables with ranges {xi.xz,...} and respectively. The joint density function fx,y of X and Y is defined on the set of ordered pairs {(Xi, yj) : i l,j 2: 1} by fx.Y&i.yj) = P(X = xitY = yj), i,j > 1. ■ Equation 3.4 can be extended to two or more random variables. LetX and Y be two random variables with ranges {xb Xi,...} and {yb y2,...}, respectively, and let A be any subset of R X R. Since ((X, T) E A) = U )eA(X = Xi>Y = yf), P((X,Y)EA) = P(X = Xi,Y = Yj) = fx,y(xhyj). (3.5) In calculating probabilities pertaining to the pair X, Y, there is some latitude in the choice of A. The set A can usually be defined by replacing X and Y by typical values x, and y}respectively. EXAMPLE 3.13 Suppose two dice are rolled, one red and one white. Let X be the number of pips on the red die, let Y be the number on the white die, and let Z be the maximum of the two numbers; i.e., Z = max(X, Y). The joint density of X and Y is fx,y(x,y) = x.y = 1,2,..., 6. Suppose we want to find the joint density of X and Z. Both X and Z have range {1,2,..., 6}. Let x and z be typical values of X and Z, respectively. Since Z 2 X,fxtz(x,z) = P(X = x,Z = z) = 0 if z < x. If z = x, then we must have Y < x. Hence, (X = x,Z = x) = (X = x, Y < x). To put the event (X = x, Y x) into the form ((X, T) E A), formally replace X by i and Y by j in the first event to define A = {(i,j) : i = x,j < x}. Thus, by Equation 3.5, fx.z(x.z) = P(X = x.Z = x) = P((X, T) E A) = 1 = ± 36 36 1=X,1<J<X V 3.2 69 RANDOM VARIABLES whenever z = x. If z > x, the event (X = x,Z = z) can occur only if Y = z; i.e., (X — x,Z — z) = (X = x, Y = z) whenever z > x, and thus >z) * /x,z( — P(X = x,Y = z) = 1/36 whenever z > x. In summary, fx,z(x,z) = < Definition 3.4 0 x/36 1/36 if z < x if z = x if z > x. ■ (3.6) The joint density function fx.... x„ °f the random variables Xb..., Xn is the real-valued function fx... • ■ • >xn) ~ P(Xi = Xi,. . .,Xn = x„). ■ Of course, if any x, is not in the range of X,-, then the probability on the right is zero. If it is clear from context, the joint density will be denoted simply by f(xi,..., x„), keeping in mind that the order of the variables in the argument off corresponds to the order of the random variables. If the joint density fxt,...,x„ °f Xu..., X„ is known, then the joint density of any subcollection of the X/s can be determined. For example, suppose we want to determine the joint density of Xb... >Xn-\. Let {x„bx„2,...} be the range of Xn. Then n = U(x" = Xnk}k Intersecting both sides of this equation by (Xi =Xi,...,X„-i = x„-i), (Xi = Xi,...,Xn~i = Xn-i) = ^J(Xj = Xi,...,X„-i = X„-i,X„ = Xnif)k Since the events on the right are disjoint, fx... ,xn.1(^b--->^-i) = ^>Jx.... x„(x1,...,xn-i,xnkf k Alternatively, fx.... X„-|(X1’ • • = ^./x.... X„(xl>-• -’Xn-l’Xn) (3.7) x„ since fx.... x„( *i» ■ ••>xn-i>xn') = 0 whenever x„ is not one of the values of X„. This procedure can be repeated as often as necessary to obtain the joint density of any subcollection. 70 3 RANDOM VARIABLES EXAMPLE 3.14 A point with integer coordinates (X, V) is chosen at random from the triangle with vertices at (1,1), (n, 1), and (n, n) where n is a fixed positive integer. Since the total number of points with integer coordinates (x, y) in the triangle is 1 + 2 +••• + n = (n(n + l))/2> the density function of the pair (X, V) is 2/(n(n + 1)) 0 if 1 < y < x,x = 1,2,.. .,n otherwise. For x = 1,2,..., n, the density fx (x) is given by f M = V 2 = X X n(n + 1) n{n + 1)' Thus, , , . x> 2x/(n (n + 1)) o if x = 1,2,..., n otherwise. Fory = 1,2,...»n, the density/y(y) is given by , , , _ <5 2 _ 2(n — y+1) /y(7) ^n(n + l) n(n + l) ‘ A 7 Thus, (2(n — y + l))/(n(n + 1)) EXERCISES 3.2 1. if y = 1,2,..., n otherwise. ■ A bus tour operator uses a bus with a capacity of 45 passengers but sells 50 tickets. If one person out of 12 is a no-show, what is the probability that everyone who shows up for the tour will be accommodated? 2. What is the maximum number of tickets the bus tour operator should sell to be able to accommodate all that show up with probability at least equal to .90? 3. An electronic system has an operating cycle of 0.01 second. During successive time intervals of length 5 X 10-6, an event may occur with probability p = .0005. What is the approximate probability that fewer than 8 events will occur during 10 cycles? 4. The random variables X and Y have the joint density function 2/(n(n + l)) 0 ifl <y< -x + n + 1,1 otherwise where x and y are positive integers. Find/x( ) * and/y(y). x n 3.2 RANDOM VARIABLES 71 5. Suppose a pair of dice are rolled and Z is the larger of the number of pips on each. What is the density of Z? 6. Denote the general term of the binomial density with parameters n and pby b(k;n,p) = (?)pkqn~k> K k = 0, Find a recursion formula for calculating b(k-, n,p) from b(k — 1; «,p), and put the ratio of the two in the form ! + •••. For what value or values of k is b(k; n,p) a maximum? 7. The two random variables X and Y have the joint density fx,y(x,y) = ------ —------, x,y = 0,1,2,... A.y. where a, fl >0. What are the densities of X and V? 8. Suppose the random variables X and V have ranges {xi,x2,...} and {yi, y2,...}, respectively, and their joint density has the form fx,y(Xi>yj) = f(xi)g(yj)> i,j = 1,2,... Express the densities of X and Y in terms off andg. 9. The random variables X and Y have the joint density function tabulated below. Find the densities of X and Y and calculate P(Y X). X = 1 x = 2 x = 3 x = 4 y = 1 .03 .01 .07 .01 y = 2 .07 .01 .06 .13 y = 3 .04 .08 .06 .02 y = 4 .06 .09 .03 .07 y = 5 .05 .06 .03 .02 10. The cards of a deck are numbered 1,2,..., 50. The deck is thoroughly shuffled and then two cards are dealt. Let X and V be the numbers on the first and second cards, respectively. Use Equation 3.5 to calculate P(|X - y| > 2). The following problems require mathematical software such as Mathematica or Maple V. 11. A jumbo jet with a capacity of 365 passengers is oversold by 10 tickets. If one person out of 25 is a no-show, what is the probability that all of those who show up will board the jet? 72 3 RANDOM VARIABLES 12. What is the maximum number of tickets the airline should sell to be able to accommodate all who show up with probability .99? INDEPENDENT RANDOM VARIABLES The discussion of the empirical law in Chapter 1 does not include the phrase “under identical conditions,” which is usually a part of the discussion. The phrase means that an experiment should be repeated in such a way that the outcome of a repetition should not be influenced by the outcome of previous repetitions; i.e., it should be independent of previous outcomes of the experiment. If XbX2, • • • denote the outcomes of successive repetitions, then the events (Xj = xj, (X? = *2), • • • should be independent events. The formal definition will be given in terms of joint densities. Definition 3.5 The random variables Xj>..., X„ are independent if fx.... * i»---» x„( » ) = /x.( i) * x/x2(x2) x ••• x/Xn(x„) for all Xi, ...,xnER. ■ That is, Xj,.. . ,X„ are independent if their joint density is the product of their individual densities. We frequently use the fact that if Xb... ,Xn are independent random variables and {ij,..., it} is a subset of {1,2,..., n}, then Xip ..., Xjk are independent random variables. Consider, for example, X|,...,X„-i. By Equation 3.7, fx......Xo-,(Xl,...,X„-i) = ^fx.... X„(X1>. X„ = 5Lfx. (Xi) X • • • X fX"., (x„_1 )fXn (x„) = fx,M x • • • = /x,(xi)X •• • x/xn_.(x„-i). Thus, Xi,... ,Xn-i are independent random variables. EXAMPLE 3.15 Consider the roll of two dice, one red and one white. Let X be the number of pips on the red die and Y the number on the white die. Since fxM = P(X = x) = 1/6 for x = l,...,6,/y(y) = P(Y = y) = 1/6 for y = 1, ...,6, andfx,y( >y) * = 1/36 for 1 < x,y < 6, 3.3 INDEPENDENT RANDOM VARIABLES fx,y(x,y) = fxWfY(y) 73 for all x,y G R. Thus, X and Y are independent random variables. ■ Independence of random variables X and Y implies much more. Suppose A and B are any sets of real numbers and the ranges of X and Y are {xi, x2>...} and {yi, y2,...}, respectively. Then P(X G A,Y EB) = P(X G A)P(Y GB). This follows from independence, since P(XGA,YGB) = X = 2Z fx(Xi)fy(yj) XiEA.yjGB JI fx(Xi)¥ 22 A (/;)) = P(X GA)P(YGB). For example, P(X > x,Y > y) = P(X > x)P(Y > y). More generally, if Xi,X2>.. .,Xn are independent random variables and Ai, A2>..., An are any sets of real numbers, then P(Xi GAi,...,X„ GA„) = P(Xj GAj) X • • • X P(X„ G A„). Consider n Bernoulli trials with probability of success p. For j = 1,.. .,n, let X; = 1 if there is a success on the j th trial and let Xj = 0 if there is a failure on the jth trial. Note that the set (Xj = Xj) involves a condition imposed solely on the;th trial. It follows that the events (Xj = ..., (X„ = x„) are mutually independent, and therefore EXAMPLE 3.16 P(Xj = Xi,...,X„ = x„) = P(Xj = Xj) X ••• XP(X„ = x„); that is, fx (*i> •. •, x„) = fXl (xi) x • • • X fXn (x„), and therefore Xi,..., X„ are independent random variables. ■ 74 3 RANDOM VARIABLES Example 3.16 is a special case of a more general model. Instead of allow­ ing each trial to have just two outcomes 0 and 1, we could allow r outcomes on each trial. For example, in eight repeated tosses of a die, the outcome of each toss is one of the integers 1, 2,.... 6. A typical outcome might look like co = (3,5,2, 1, 2, 3, 3,4). As in the Bernoulli case, we can define random variables Xb X2,..., Xs to specify outcomes of individual tosses; i.e., for this outcome, X|(w) = 3, X2(w) = 5, ...,X8(o>) = 4. In the Bernoulli case, we also counted the number of successes; in tossing a die, we could count the number of times Y, the outcome i appears among the eight tosses of the die. For the above outcome, Ki(w) = l.Y^co) = 2,Yi(co) = 3,y4(w) = i,y5(w) = i,y6(a>) = o. EXAMPLE 3.17 (Multinomial Density Function) Consider a basic ex­ periment with r outcomes that we choose to label as 1, 2, . . . , r having probabilities plt...,pr, respectively. Consider the compound experiment of n independent repetitions of this basic experiment. An outcome co of the compound experiment is an ordered n-tuple co = (ib ..., in) where each ij 6 {1,2,..., r}. We associate with each co = (ib ..., i„) the weight p(w) = pi, X • • • X pin. Letting O be the collection of all such outcomes, O is finite and we can take for S' the collection of all subsets of O. For j = 1,..., n, we can define a random variable Xj by putting Xj(w) = ij whenever co = (ib ..., in). Probabilities have been defined so that i ।,..., X„ = in) = pit X • • • X pin P (X i fori) G {1, ...,r},j = 1,..., n. Since ^J = 1P; = 1> r P(Xl=il) = = 21 P( i * h-Jn = 1 r y. = h>X2 = i2,...,Xn = i„) pi, x • • • x piit '2...... 'n = 1 = z ••• y. (s>, x-xJ '2=1 '..-1 = 1 \'n=l / r r / r . = £•••Spi x"-xp>.., '2 = 1 '„-i = l T T = yy- z p-.x-'-xp^, '2 = 1 ',,-1 = 1 \ 5>. \i„ = 1 / 3.3 75 INDEPENDENT RANDOM VARIABLES Similarly, P(Xj = if) = p^j = 1,..., n. Thus, = ix,...,X„ = f„) = P(X\ = ij) x • • • x P(X„ = i„), and Xx,...,Xn are independent random variables. For 1 k r, let Kjt(co) be the number of k’s in the outcome co = (h,..., i„). Note that y1+...+yr — Let Hi,..., nr be nonnegative integers with n = «i+- • -+nr. Any outcome co for which the number of l’s is nx, the number of 2’s is n2, and so on, has probability p"' X • • • X pf. Since there are many outcomes fitting this description, P(YX = ni,...,Yr = nr) = C(n; nh ..., nr)p"' X • •• X p"r where C(n; nx,..., nf) is the number of such outcomes. We can calculate this constant as follows. The number of ways of selecting nx positions out of the n positions to be filled with l’s is ( ” ); having done this, the number of ways of selecting n2 positions out of the remaining n — nx to be filled with 2’s is (n ni ), and so forth. Thus, ' «2 z P(YX = m,...,yr = nf) = (" xn1/x n2 ' ---------- V' X • • ■ X p * nr /ri rr Expressing the binomial coefficients in terms of factorials and simplifying, This joint density of Yb ..., Yr is called the multinomial density. ■ Suppose a die is tossed 12 times in succession. The probability that there will be two l’s, one 2, four 3’s, one 4, three 5’s, and one 6 is EXAMPLE 3.18 p(y, = 2, r, = i, r, = 4, 12! 1 = i, r, = 3, r6 = i) = 2il|4lli?11!g7 - .00076. ■ Definition 3.6 The random variables of the sequence XbX2,... are independent if for every n 1, Xi, X2,..., X„ are independent random variables. ■ 76 3 Definition 3.7 RANDOM VARIABLES The sequence of random variables Xi,X2, • • • « called an infinite sequence of Bernoulli random variables with probability of success p if they are independent, P(Xj = 1) = p,andP(Xj = 0) = 1 - p for all j > 1. ■ EXAMPLE 3.19 Consider an infinite sequence of Bernoulli trials with probability of success p as described in Example 2.14. For/ l,letXj = 1 if there is a success on the/th trial and let Xj = 0 if there is a failure on the /th trial. It was shown in Example 2.22 that the events (Xi = xj,..., (X„ = x„) are mutually independent for every n S: 1. Thus, the random variables of the sequence Xi,X2,... are independent. It was shown in Example 2.22 that P(Xx = x1,...,X„ = x„) = The probability P(%2 = * i, ..., X„+i = x„) is also equal to the product on the right side of this equation. More generally, we have the following property of the joint density of n consecutive X/s: fx.... = fxk...................................... . (3-8) for all n, k S: 1; i.e., the joint density of X , X * *-n> • • ■ > X;t+„ is independent of k. In particular, the probability of getting r successes in the first n trials is the same as the probability of getting r successes in any n trials. ■ Theorem 3. 3.1 Let X and Y be independent random variables with ranges {xj,X2,...} and {yi>/2> • • •}> respectively, and letZ = X + Y. Then fz(z) =^fx(Xi)fr(z-x,) =^fx(z~yj)fy(yj)• ' J (3.9) PROOF: For fixed z E R, fz(z) = P(Z = z) = P(X + Y = z). stratify the event (X + Y = z) according to the values of X; i.e., write (X + Y = z) = |J(X + Y = z,X = Xi). i Since (X + Y - z,X = x;) = (Y = z - xitX = x,), (X + Y = z) = |J(X = xh Y = z - Xi), i Since the events on the right are disjoint, P(X + Y = z) = £p(X = Xi,Y = z -Xi). Now 3.3 77 INDEPENDENT RANDOM VARIABLES By independence of X and Y, fz(z) = ^fx.Y(Xi,Z - Xj) = 2Z/x(Xi)/y(z -X,). i i A similar argument applies to the second assertion. ■ If the random variables of Theorem 3.3.1 are of a particular type, the formula takes on a simpler form. Definition 3.8 The random variable X is nonnegative integer-valued iffxM = 0 whenever x £ {0,1,2,...}. ■ Theorem 3. 3.2 If X and Y are independent nonnegative integer-valued random variables and Z = X + Y, then /zW = Xx=ofxMfy(z -x) 0 if z = 0,1,2,... otherwise. PROOF: Note that Z is also nonnegative integer-valued. By Theorem 3.3.1, for z = 0,1, 2,..., /z(z) = ^fx&lffc ~ x\ x =0 Note that/y (z — x) = 0 whenever x > z and the infinite limit can be replaced byz. ■ Let X and Y be independent random variables having Poisson densities with parameters Ai and A2, respectively. Let Z = X + Y. Consider any z 6 {0,1,2,...}. By Theorem 3.3.2 and the binomial theorem, EXAMPLE 3.20 z fzW = S Afe A' A| xe A* x =0 It follows that Z has a Poisson density with parameter A! + A2. ■ EXAMPLE 3.21 Let X and Y be independent random variables having binomial densities b(-;m,p) and b(-;n,p), respectively. The range of Z = 78 3 RANDOM VARIABLES X + Y is then {0,1,..., m + n}. Suppose z is in the range of Z. By Theorem 3.3.2 and Equation 1.11, z fz(z} = y'rb(x-,jn,p.)b(z - x;n,p) x=0 x=0 x =0 = (m+n > )pzq{m+n}~t. Thus, Z has the binomial density /?(•; m + n, p). ■ The last two examples have extensions to sums of finitely many random variables. The following lemma will be needed. Recall that a density function fx (x) is zero when x is not in the range of X. Lemma 3.3.3 IfXi, X2,.. .,Xn are independent random variables, then Xi + X2 + • • • + X„-i and Xn are independent random variables. PROOF: (n = 4 case) Let a and /3 be values of Xi + X2 + X3 and X4, respectively. Then fxt+x2+x},xSa> P) = P(X1+X2+X3 = a,X4 = /?) = 2Zp(X1+X2 + X3 = a,Xi = x3,X4 = /?) X3 = 22P(Xi+X2 = a-Xi,Xi = x3,X4 = /3) = = a - Xi - x2,X2 = x2,Xi = Xi,XA = 0). X3 x2 Since Xj, X2, X3, and X4 are independent random variables, fxi+x2+x},xSa’ P) = a-Xi- x2)P(X2 = x2)P(X3 = x3)P(X4 = )3). = X) X2 Using the fact that Xj, X2, and X3 are independent random variables, 3.3 INDEPENDENT RANDOM VARIABLES 79 fxi+Xi+X3,xSa> = ££p(Xi = a - Xi - x2,X2 = x2,X3 = x3)P(X4 = P) x} x2 = S£Wi +X2+X3 = a,X2 = x2,X3 = x3))P(X4 = /?)• V Xi x2 / Applying Equation 3.7 two times in succession, fx^x2^Xi,xSa’P") = fxt+x1+x}(a)fx4(P')- ■ This result is a special case of a more general result. Consider a collection Xi,... ,X„,X„+1,.. .,Xn+m of independent random variables. Let 0 and if/ be real-valued functions of n and m variables, respectively. Then </>(Xi,...,X„) and i/»(X„+i,.. .,X„+m) are independent random variables. Any potential for gaining insight into probability theory by proving this result would be overwhelmed by the cumbersome notation required at this stage. Theorem 3.3.4 LetXi, X2,.. ,,Xk be independent random variables. (i) If each Xi has a binomial density with parameters n, and p, then Xj + • • • + Xk has a binomial density with parameters ni + • • • + nk and p. (ii) If each Xi has a Poisson density with parameter Xi, then Xi + • • • + Xk has a Poisson density with parameter Ai + • • • + Xk. (Hi) If each X, has a negative binomial density with parameters r; and p, then Xj + • • • + Xk has a negative binomial density with parameters H + • • • + rk and p. Assertions (i) and (ii) were proved in the n =2 case in the last two examples; the general cases use these results and a mathematical induction argument. Assertion (hi) is left as an exercise in the n =2 case, the general case again being an easy application of mathematical induction. The problem of finding the density function of Z = </>(Xi,.. ..XJ can be difficult. Sometimes independence makes it possible, as in the following example. EXAMPLE 3.22 Let X and Y be independent random variables each of which has a uniform density on {1,2,..., n} and let Z = max(X, T). The range of Z is then {1, 2, ...,n}. Suppose z is in the range. Rather than calculating/z(z) = P(Z = z), we calculate P(Z z) for reasons that will become apparent. P(Z < z) = P(max(X,T) < z) = P(X z,Y < z) 3 80 RANDOM VARIABLES z z = y) = x = 1 y- 1 z = z = x)p(y = x = 1 y= 1 _ z2 n2 By Equation 2.8, for 2 z n, P(Z = z) = P((Z < z) Cl (Z < z - l)c) = P(z < z) - P((z < z) n (z < z - i)) = P(Z < z) -P(Z < z - 1) _ 2z - 1 It is easy to see that this result holds for z = 1 also. Thus, /z(z) = EXERCISES 3.3 (2z — l)/n2 0 ifz = 1,2, ...,m otherwise. ■ (3.10) Problem 1 requires the following fact from the calculus. If 27=o and 2”=0 bnxn are power series, the product of their sums can be written zL„=ocnx", where c„ = ^.k=oakbn-k for n S: 1, on the common interval of convergence. 1. If a and b are any real numbers and z is a nonnegative integer, show that x=0 2. Let X and Y be independent random variables having negative binomial densities with parameters r and p and s and p, respectively. Derive the density of Z = X + Y. 3. Let X and Y be independent random variables each having a uniform densityon{l,2,...,«}. CalculateP(X > T)andP(X = T). 4. Let N be a random variable having a Poisson density with parameter A > 0. Given that N = n, n Bernoulli trials are performed, the number X of successes is counted, and the number Y of failures is counted. Show that X and Y are independent random variables. 3.4 81 generating functions 5. Let X and Y be independent random variables having geometric densities with the same parameter p. Calculate P(X 2: y) and P(X = K). 6. LetX and Y be as in Problem 5. Find the density of Z = X + Y. 7. LetX and Y be independent random variables having uniform densities on{l, 2,..n} and let Z = X + Y. Find the density of Z. 8. Let X and Y be independent random variables and let 0 and be two real-valued functions on R. Show that </>(X) and ^(Y) are independent random variables. Solving the following problem without the benefit of Mathematica or Maple V software would be tedious. 9. Let X and Y be independent random variables with X having a binomial density &(■; 10,1/2) and Y having a uniform density on {1,2,3}. Find the density of Z = X + Y accurate to three decimal places. GENERATING FUNCTIONS In some instances, the problem of finding the density function of a sum of two random variables can be transformed into a purely algebraic problem using generating functions. Definition 3.9 Let {«/}”=o fl sequence of real numbers. If the power series S”=o aj^ has (—10, to) as its interval of convergence for some to > 0, then the function A(t) = ajP is called its generating function. ■ If aj = 1 for all j > 0, then A(t) = X“‘=ot-' has ( —1,1) as its interval of convergence and A(t) = 1/(1 —t). If aj = 1/jlforall j > 0, thenA(t) = X”=o t’/j\has (—•»,•») as its interval of convergence and EXAMPLE 3.23 A(t) = e*. If «o — fli — 0 and ay = 1 for all; 2: 2, thenA(t) = has (-1,1) as its interval of convergence and A(t) = t2/(l - t). ■ Returning to the notation of Definition 3.9, if there is an M G R such that |a| < M for all; > 0, then the series Z7=ofl;tJ converges absolutely at least for —1 < t < 1, since the general term of the series zl”=o \aj^\ Is dominated by the general term of the series X”=o M|tp, which is known to converge for |f | < 1, and thus the interval of convergence of 5Z”=o aj contains the interval (-!,!). An important result about generating functions is the fact that if a function can be represented as the sum of a power series on an open interval containing 0, then that representation is unique; i.e., if 82 3 RANDOM VARIABLES /(r) = ^ajtj = ^bjt} j=0 j=0 on an open interval containing 0, then EXAMPLE 3.24 = bj for all j 0. Suppose we have found that a generating function is given by AM = y-TTP lfl < L What is the sequence {fljlJLo? We can interpret 1/(1 — t2) as the sum of the — ^”=0 t2;. Thus, geometric series = JV'’ A(r) = j=0 j=o and corresponding coefficients of t; are equal. Noting that coefficients of even powers of t on the right are equal to 1 and coefficients of odd powers of t on the right are 0, cij = (1/2)(1 — ( —l)J+1),j =0,1,2,.... ■ Another important property of power series is the following. If {a; }“=o and {fy}“=0 are sequences of real numbers and Cj = aobj + • • • + ajh^.j 0, then the power series Xj=ocj^ converges absolutely on the common interval of convergence of the series XJ=o aj^ and zL *=o and ZE cj= (S aif}Y V=o j=0 /\; = 0 bi A / It is important to remember that the method of generating functions applies only to nonnegative integer-valued random variables. Definition 3.10 IfX is a nonnegative integer-valued random variable with density fx, its gener­ ating function is the function fx on [ — 1,1] defined by X 7x(t) = X/x(x)tx, -1 < t < 1. ■ x=0 Note that the generating function of X is the same as the generating function of the sequence {/x(x)} * =0. Since [fxj < 1 for all x > 0,/x(t) is certainly 3.4 83 GENERATING FUNCTIONS defined on (—1,1). But since X”= 0 fxW = 1, the power series converges absolutely when |t| = I. Thus.^x is defined on [—1,1]. EXAMPLE 3.25 Let X have a geometric density with parameter p, 0 < p < 1. Then x=l x —1 EXAMPLE 3.26 y=0 1 Let X have a binomial density with parameters n and p. Then 00 A«) = X aw'” x =0 n x=0 n . . x =0 X = 22 (”)pv f * = (pt + q)n. ■ EXAMPLE 3.27 Let X have a Poisson density with parameter A > 0. Then ~ A e" * _A^ (At) * A x A(0 = X-TT-' = e x=0 ' x =0 EXAMPLE 3.28 _A Af StT = e e = e A(r_n _ • " Let X have a negative binomial density with parameters r and p. Then A» = E (7 )/(-«)’<' = p'S (7 Fo x x=o x By the generalized binomial theorem, The following theorem is one justification for introducing generating functions. 84 3 Theorem 3.4.1 RANDOM VARIABLES If X and Y are independent nonnegative integer-valued random variables and Z = X+ Y>thenfz(t) = fx(t)fY(t)forallt G [-1,1]. PROOF: First note that /x(Ofy(O = \x=o J\y=o / =±^ z=0 where ct = ^x=<sfxWfy{z - x). By Theorem 3.3.2, z=0 z=0 Thus,/z(t) =/x(t)/y(t). ■ Let X be a random variable taking on the values 1,2, 3 with probabilities .02, .53, .45, respectively, and let Y be a random variable taking on the values 1,2, 3, 4 with equal probabilities. What is the density of Z = X + T, assuming that X and Y are independent? The generating function of X is fx(t) = -02t + .53t2 + .45t3, the generating function of Y is FY(t) = .25t + .25t2 + .25t3 + ,25t4, and the generating function of Z is EXAMPLE 3.29 fZ(t) =fx(t)fy(t) = (.02t + ,53t2 + .45t3)(.25t + .25t2 + ,25t3 + .25t4) = ,005t2 + .1375t3 + ,25t4 + .25t5 + .245t6 + .1125t7. Therefore, fz (2) = .005,£(3) = .1375,£(4) = /z(5) = .25, fz (6) = .245, and/z(7) = .1125. ■ Corollary 3.4.2 IfXi,..., Xn are independent nonnegative integer-valued random variables and Z =Xi + ---+X,l,thenfz(t') = PROOF: The statement is trivially true when n = 1. Assume it is true for n — 1. Since X| + • • • + X„_| andX„ are independent by Lemma 3.3.3, = Jxl+-+xn_l(t) ’fx„(t) by Theorem 3.4.1. By the induction hypothesis,/x,+—+x„_I (t) = Therefore,fz(t) = fl^i/x/O. ■ * (^) 3.4 85 GENERATING FUNCTIONS This corollary provides an alternative proof of the three assertions in Theorem 3.3.4. EXAMPLE 3.30 Let Xi,..., Xk be independent random variables and let Z = X\ + • • • +Xk. If each X, has a negative binomial density with parameters r, and p, then (1 “ “ (1 - where r = Ti + • • • + rk. But by the generalized binomial theorem, p'd-««)■' = S( 7 )/(-«<)'• z=0 Therefore, z=0 and it follows that /z(z) = (~r)pr(-qY, * = 0,1,2,...; i.e., Z has a negative binomial density with parameters r = n + • • • + r* and p. ■ Having tediously calculated the probabilities p(x) of getting a score of x upon rolling three dice in Exercise 1.4.3, the reader will appreciate the ease with which these probabilities can be calculated using generating functions. Let Xi, X2, and X3 denote the number of pips on each of the dice and let X = Xi +X2 +X3. The generating function of eachX; is Since Xi, X2, and X3 are independent, by Corollary 3.4.2, A(,)“ 6 + T + ’"+6 J’ 86 3 RANDOM VARIABLES Expanding the expression on the right side using mathematical software, 1 c 1 4 + —t + 36 * 72 A«) = ZAO 5 7 7 8 + —t7 + —r 72 72 108 25 ,2 7 u 5 14 +---- + — t13 + —t14 216 72 72 25 9 ° * + -?1 + ---- 1 + -t 8 216 8 1 5 i' + — t17 + — ?8. + ---36 72 216 108 ) * Since/x( is just the coefficient of tx, the probabilities can be read off; e.g., P(X = 13) = 7/72. Generating functions are particularly useful for solving difference equations, as in the following example. EXAMPLE 3.31 Consider an infinite sequence of Bernoulli trials with probability of success p. For each n 2: 1, let pn be the probability of the event En “an even number of successes in the first n trials.” We will use the fact that the probability of an even number of successes in trials !,...,« — 1 is the same as the probability of an even number of successes in trials 2,..., n. If the first trial results in a failure, in order for an outcome to be in En there must be an even number of successes in trials 2,..., n, and if the first trial results in success, there must be an odd number of successes in trials 2,..., n. Thus, Pn = qpn-l +p(l - pn-l). Decomposing E„ in this way makes sense only for n 2: 2. In one trial there is only one way to get an even number of successes, namely none at all, by that trial resulting in failure. Thus, pi = q. If the equation above is to hold when n = 1, we must have q = pi = qpo + p(l - p0), and so p0 must be taken equal to 1. Therefore, the pn satisfy the difference equation Pn = qpn-i +p(l-p»-i), n>l (3.11) and the initial condition po = 1. To solve the equation subject to the initial condition, let P(t) be the generating function of the sequence {p„}”=0; i.e., P(0 = !L = * opntn- Multiplying both sides of Equation 3.11 by t" and summing n = 1,2,..., ^Pnt” = <Jt^Pn-itn~l + pt^tn~l - pt^pn-itn~l n=! n=1 m=1 x 00 M=1 » = qt^Pni” +pt^tn -pt^pntn. n=0 n=0 n=0 SinceEZ=iP„f" = P(r)“Po = P(J) ~ 1 andEZ=o'" = 1/(1 “ 0. P(t) - 1 = qtP(t) + - ptP(t). 3.4 GENERATING FUNCTIONS 87 Solving for P(t), P(t) = .—!— +---------- ££________ . 1—qt+pt (1 — t)(l — qt + pt) Applying the method of partial fraction expansions to the second term on the right, Pit) = _!— + _£----- !_ 1 — qt+pt 1—q+pl — t P_________ 1 1 — q + p 1—qt+pt' Since 1 — q +p = 2p, Pit) = 1 1 21-t 1 1 2 1 - qt + pt and 1 1 2Pit) = ------ .j.--------------- _ 1 - t 1 - qt +pt Regarding the two terms on the right as sums of geometric series, ^2pntn =Xtn n=0 n=0 =^ + i<J-p)")tnn=0 n=0 Equating coefficients of tn, we obtain pn = ^(1 + iq - p)")> n == 1. This solution for pn is much more enlightening than the solution pn = biO;n,p) + b(2’,n,p) + bi4;n,p) + - - - + bi2m;n,p) where m is the greatest integer such that 2m < n. ■ It is often necessary to interchange the order of summation of two infinite series. The essential facts will be presented; proofs are contained in the ap­ pendix at the end of the chapter. A map a : N X N —> R is called a double sequence, and its value at (z, j) is denoted by rt/j. We also write a = {a,j} = = The double sequence {a; j} converges and has limit L if for every € > 0 there are integers M, N 2: 1 such that — L] < € for all m 2 M, n > N. In this case we write lim, j_»«> a, j = L. 88 3 RANDOM VARIABLES Given a double sequence {a, j }, the formal expression 00 S a{-> can be formed and is called a double series. For each m 2 1, n 2 1, the following partial sum can be formed: Srn,n y, 1 s i £ tn ISjSn EXAMPLE 3.32 Consider the double sequence {«/j} defined by aI>;- = ( — l)'+j(l/j'). Since is a sum of finitely many terms, the terms can be added in any order. Fixing; and summing over i, On the other hand, fixing i and summing over j, Definition 3.11 The double series'fff^^ a,:j is said to converge and have sum S i/lim^-^ooSij = S; i.e., iffor each e > 0 there are integers M,N 2 1 such that |S;j — S| < e for all i 2 M,j 2 N. ■ If the at,j 0 for all i,j 2 1, we say that = diverges to +00 if for every L e R there are integers M, N 2 1 such that Sitj 2 L for all i > M, j 2 N. Given the double series 21”^ = i fli,j we can form the iterated sums 3.4 GENERATING FUNCTIONS 89 Proofs often depend upon showing that the latter two iterated series are equal; i.e., the order of summation can be interchanged. Theorem 3.4.3 =i 1/ is a double series with aitj > 0 for all i,j > 1, then even if any sum is +«. The double series =i Theorem 3.4.4 = converges absolutely if the double series converges. If the double series = i «i,j converges absolutely, then The next application of generating functions has to do with the sum of a random number of random variables. Consider any infinite sequence of random variables {X;}”=1 and let N be any random variable taking on values in {1,2,...}. An outcome a> determines an infinite sequence X^w), X2(o>),... as well as a positive integer and we can form the sum of the first N(w) terms of the infinite sequence, which is denoted by 5n(<o) = Xi (o>) + X2(w) + • • • + X„M (w )• Sn is called the sum of a rando m number ofrandom variables. If N is a constant n, then S„ is the sum of a fixed number of random variables. That is a random variable follows from the fact that 00 CO (SN = s) = |J(Sn = S,N = n) = U((N = n) n (S„ = s)) n=1 n=1 and the fact that each S„ is a random variable. Theorem 3.4.5 If {Xj}”=1 is an infinite sequence of independent nonnegative integer-valued random variables all having the same density function f, N is a positive integer­ valued random variable, andN,XltX2>.. • are independent, then fsN(t) ^fM(fx,(t))- 90 3 RANDOM VARIABLES PROOF: By stratifying the event (S^ = s) according to the values of N and using the fact that S/v = S„ on the event (N = n), X A,(») = s=0 X = ]?P(SN = $)ts s=0 X X = s’N = = j=0 n = 1 x x = S.N = n)ts. = j=0 n = l Since N, Xb..., Xn are independent random variables, Xi + • • • + Xn and N are independent random variables by Lemma 3.3.3. By Theorem 3.4.4, X X = *) p(N = fsN(t) = s=On=l X / X \ = Z|2P(S" =$)rsP(N = n) n = l\$=O / x - Ekw»(”) n=0 X = S(A.(O)Vn(«) n =0 where the terms corresponding to n = 0 in the last two expressions can be included since fN(0) = 0. Since/N(t) = ^^=ltnfN(n), fs^t) =fN(fXl(t». ■ EXAMPLE 3.33 Suppose the wind carries N seeds onto a given plot of land where N has a Poisson density with parameter A > 0 and each seed has probability p of germinating, independently of the number of seeds and independently of the other seeds. Let XbX2,... be an infinite sequence of Bernoulli random variables with P(Xj = 1) = p. The number of germinating seeds is then Sn =Xi + ---+Xn. Since/x,(d = pt + q and fa (t) = fsN(t) = fN (pt + q) 3.4 GENERATING FUNCTIONS It follows that EXERCISES 3.4 91 has a Poisson density with parameter Ap. ■ 1. Let X be a random variable having a uniform density on {1,2,, n}. Find the generating function of X. 2. The sequence of real numbers {a/}”=i has the generating function A(t) = 1 — (1 — t2)1/2. Find a formula for the ay. 3. If the random variable X has each of the following generating functions, what is the corresponding density function? (a) fx(t) = (b) fx(t) = (c) fX(t) = t/(8 - 7t)5 4. A die is rolled to determine how many times a coin is to be flipped and then the coin is flipped that many times. Let X be the number of heads so obtained. Find the generating function of X. 5. Consider the generating function Describe a compound experiment and an associated random variable having this generating function. 6. If the random variable X has the generating function ?x(t) = e2(,2-1)> what is the density function of X? 7. Consider an infinite sequence of Bernoulli random variables Xi, X2, • • • with probability of success p. Let En be the event “there are an even number of successes in the first n trials.” Express E„ in terms of the random variables Xb X2,... and use the theorems of probability theory to derive Equation 3.11 by stratifying En according to the values of Xf, i.e.,p„ = P(E„) = P(E„ n(Xj = 0)) + P(E„ n(X! = 1)), and so forth. The following problems require software such as Mathematica or Maple V. 8. Consider an infinite sequence of Bernoulli trials with probability of success p = 1/2. For n 1, let qn be the probability that the pattern 11 will not appear in the first n trials (i.e., the probability that there will not be two consecutive l’s). Derive a difference equation for the qn, specify initial conditions, and find a formula for the qn. 92 3 RANDOM VARIABLES 9. If 10 dice are tossed simultaneously, what is the probability of getting a score of 42? 10. If X has the binomial density £>(-; 5, .5), Y has a uniform density on {1,2,..., 6}, and X, Y are independent random variables, find the density of Z = X + F. . 11. For j = 1,..., 10, the random variable Xj takes on the values 1 and 0 with probabilities pj and 1 — pj, respectively, where pj = (.95)V2. If X = Xi + ■ • ■ + Xio, find the density of X assuming that ..., Xio are independent. GAMBLER'S RUIN PROBLEM Suppose a gambler and an opponent have a combined capital of a units and the gambler has x units of capital where 1 x a — 1. The gambler wagers one unit on successive plays of a game in which the probability that he will win one unit is p and that he will lose one unit is q = 1 — p, where 0 < p < 1. The gambler is ruined if his capital ever reaches zero units; his opponent is ruined if the gambler’s capital ever reaches a units. What is the probability that the gambler will be ruined eventually? Since it is conceivable that the wagers could go on forever and neither be ruined, it is of interest to also find the probability that the opponent will be ruined. A more immediate question concerns a probability model for which these questions make sense. Let {Xj} be an infinite sequence of Bernoulli trials with probability of success p so that the X/s are independent, P(X; = 1) = p,and P(X; =0) = q. If for each j 1 we let Yj = 2Xy — 1, then the Yj's are independent random variables with P(Fj = 1) = p and P(F; = —1) = q. For; 1, Yj represents the gambler’s gain on the;th play of the game. His capital as of the; th play is then S; = x + Y i + • • • Yj, j S 1. The gambler is ruined if $! = 0 or 0 < Si < a,..., 0 < S; = i < a,Sj =0 for some; > 2. The probability of eventual ruin qx, which depends upon x, is given by qx — P($i = 0or0<$| < a,.... 0 < Sj-i < a, Sj = 0 for some; > 2). Since the indicated events are mutually exclusive, — P($i = 0) + P(0 < S| < a,..., 0 < Sj-i < a, Sj = 0 for some; Suppose 1 < x < a — 1. Then ruin cannot occur on the first wager, and qx = P(0 < S| < a, ...,0 < Sj-i < a,Sj = 0 for some; > 2). We now show that qx — pqx+i + qqx-i> 1 < x < a — 1. 2). 3.5 93 GAMBLER'S RUIN PROBLEM A “probabilistic argument” can be made as follows. The first wager can result in winning one unit, with probability p, whereupon the gambler’s capital becomes x + 1 and the probability of subsequent ruin is qx+i; since an event determined solely by the first wager and an event determined by subsequent wagers are independent, the probability of winning the first wager and then being ruined is pqx+\. Similarly, the probability of losing the first wager and then being ruined is qqx-\. Since these two possibilities are mutually exclusive, qx = pqx+i + qqx-i> for 1 < x < a - 1. The same argument applies when x = 1, with the exception that the probability of losing the first wager and then being ruined is q • 1, since ruin has already occurred on the first wager. Thus, qi = pq2 + q, and if the equation above is to hold when x = 1, we must have qo = 1. Similarly, qa-i = qqa-2> and we must have qa = 0. The qx must then satisfy the difference equation qx = pqx+i +qqx-i> 1 < x < a - 1 (3.12) subject to the boundary conditions qo = i>qa = o. (3.13) One way of solving such a problem is to try known functions successively until we come across a solution; e.g., qx = A, qx = Bx,qx = Cx2,...,qx = DXX, and so forth, where A, B, C, D,... are constants. It is easy to check that if A is any constant, then qx = A satisfies the difference equation but does not satisfy the boundary conditions. Trying qx ~ BXX results in a quadratic equation in A that has two roots A = 1 and A = q/p. At this point, we must consider two cases according to whether p q or p = q = 1/2, since there is only one solution in the latter case. Suppose first that p # q so that there are two distinct roots of the quadratic equation. In this case there are two solutions qx = A and qx = B(q/p)*, but neither satisfies both boundary conditions. Noting that the difference equation has the property that if q^ and q^1 are two solutions, then q™ + q® is also a solution, we might try \x ~1 • p) ( In this case, A and B can be chosen so that both boundary conditions are satisfied and satisfy the equations A+B = 1 (\« - | = 0. P 94 3 RANDOM VARIABLES Solving for A and B, W = (,/p)- X, (q/p)a - 1 (3.14) provided p # q. Suppose now that p'= q = 1/2. Again qx = A is a solution of the difference equation qx = ~<?x+i + 1 < x < fl - 1 but does not satisfy both boundary conditions. This time qx = Bx satisfies the difference equation but not the boundary conditions. The function qx = A + Bx will satisfy all conditions and leads to the solution qx = 1 — —, — 1 (3.15) provided p = q = 1/2. Equations 3.14 and 3.15 provide the answer to the first of the two questions originally raised about the probability of eventual ruin. What about the second question pertaining to the probability px that the gambler will wipe out his adversary? It is not necessary to repeat the arguments given above, since we can interpret px as the probability of ruin for the adversary, in which case x is replaced by a — x and p by q in the equations above. In the p # q case, and in thep = q = 1/2 case, px = -, a 1 < x < fl - 1. (3.17) Returning to the probability of ruin, we have found a solution to the problem, depending upon whether p q or p = q = 1/2. How do we know that the qx is the real solution to our problem? Perhaps there is some other solution qx that satisfies the difference equation and the boundary conditions. It is a question of the uniqueness of the solution. Suppose q^ and q<2) are two solutions. Then ux = qxl> — qx2> will satisfy the equation »x = pMx+i +qux-i, 1 < x < a - 1 (3.18) and the boundary conditions Mo — 0> Mfl = 0. (3.19) 3.5 95 GAMBLER'S RUIN PROBLEM Assume that ux & 0. By replacing ux by — ux, if necessary, we can assume that uy > 0 for some y 6 {0,1,, a}. Consider the finite set of numbers {mo, »i, , ua}. There is some m for which um is the largest of the numbers in this set; i.e., um 2: ux for x = 1,2,..., a - 1 and um > 0. If there is more than one such m, we can assume by the well-ordering principle that m is the smallest integer with this property. Then um-i < um. But Um = pUm+1 + qum-i < pum+i + qum pum+qum = um, a contradiction. The assumption that ux & 0 leads to a contradiction and therefore ux = 0; i.e., qW = qW for x = 1, 2,..., a — 1. Thus, the qx given by Equation 3.14 or 3.15 is the only solution of the difference equation satisfying the boundary conditions. What happens if the gambler decides to wager one-half unit each time instead of one unit? Will this improve his chances of avoiding eventual ruin? The effect of this change is to double the number of units. In the p = q = 1/2 case, the probability of eventual ruin is 1-^ = 1--, 2a a (3.20) and there is no change in the probability of eventual ruin. Suppose p this case the probability of eventual ruin is (q/p?a ~ (q/p) * (q/p)2a - 1 = qx . (q/p)a + (q/p)x (q/p)a + 1 q. In _ In the usual situation in which the game is unfair to the gambler q > p, the second factor on the right is greater than 1 so that wagering half a unit instead of a whole unit actually increases the probability of eventual ruin. EXAMPLE 3.34 Suppose the gambler has an initial capital of $100 and he decides in advance to continue placing wagers of $10 on a game with p = .45 until he has increased his capital by $10 or has been ruined. He then has 10 units to wager. By Equation 3.14, the probability of eventual ruin is gio = -204. Thus, there is a probability of .796 of achieving the goal of increasing his capital by $10. Of course, if upon winning the $10 the gambler gets greedy and continues to play against an adversary who for all practical purposes is infinitely rich, then it is simply a question of how long it will take for ruin to occur. But that is another mathematical problem. ■ EXERCISES 3.5 1. What is the probability that the wagering will eventually terminate? 2. If q > p, what is the gambler’s probability of eventual ruin against an infinitely rich adversary? 3. If {ii,.... im} and {.. .,jn} are disjoint sets of positive integers, it is known that events of the type (Y-,-, = 8i,...,Yim = 8m) and (Y;i = €l,...,Yjn = €„) are independent. Show that the events (y-j = i) and (y2 + • • • + Yj = y for some j 2) are independent. 96 3 RANDOM VARIABLES 4. Modify the gambler’s ruin problem by allowing the possibility of a tie on each play of the game so that there are positive numbers a, ft, y, with a + /3 + y = 1 such that P(Yj = 1) = a, P(Yj = 0) = /3,and P(Yj = -1) = y, and let qx be the probability of eventual ruin for the gambler. Derive a difference equation for the qx and appropriate boundary conditions. Solve for the qx and draw conclusions. APPENDIX We need only deal with the first equation because the second can be obtained by interchanging the role of i and j. Suppose first that 2”y = i ai,j converges and has sum S. Clearly, S for all i,j 1. Since limfj_»oo Sl>;- = S, given e > 0 there are integers M, N 2: 1 such that Proof of Theorem 3.4.3 S -e< X ai.i ~ s for all m > M,n 2 N. ISjSm 1 < i<n Since m S I 1 Si Sn n a'-j i=lj=l is valid for finite sums, S — e < 27=1 2"=i <%i.j Sforallm 2:M,n 2 N. Since 2"=i ai.j is an increasing sequence for each i, with m fixed we can take the limit as n —> oo to obtain S — e < 27= i 2”=i a‘-i ~ Since the middle expression increases with m and is bounded above by S, the series 27= 1 (27=! ai.j) converges and Since e is arbitrary, Suppose now that 27=i(27=i converges to S in R. Given e > 0, there is anM > 1 suchthatS-e < 27=1(27=] «,,;)< S+eforallw > M,from which it follows that for each i = 1,...»m, the series 27= i ai,j converges. Thus, S — e < lim„ 27= i 2"= i «i,; < S + e, and there is an N >: 1 such 3.6 APPENDIX 97 aij — Sm,n < $ + e for all m > M,n s N. This that S — e < X7= ] shows that 2E7; = j converges and has sum S = X7=1(XJ>= i «/,;)• Assume that the double series X7y = i ai,j diverges to +<®. Given any L G R, there are integers M, N > 1 such that m n 22 22 ai.j = I =1 ;=1 Thus, 227 *= 22 aij > for all m M, n N. ISjSm 1 < i == n i(^7= ia'^ > f°r m — M. Thus, the sequence diverges to +<», and so Finally, suppose that the series 27=i^7=i a’^ diverges to +<». To deal with this case, note that lim„_»ooS„in exists as a real number or lim„_»ooS„,„ = +°° since {S„,n}7=i is an increasing sequence of real numbers. In the latter case, given L G R there is an M S 1 such that S„,„ > L for all n M, and therefore Sm,„ > L for all m > M,n £ M; i.e., limm,n_»«> Sm,„ = X,7; = iai.j = +°°- On the other hand, if lim„_»«>S„,„ = S G R, then it is easy to see that hmw,„_»ooSw,„ = S, and by the first part of the proof ](2^7=1 a'^ converges to S G R, a contradiction. Therefore, ll^Ozn.n—*°° ^m,n = ] ^i,j "P00. ■ The following functions will be needed for the next proof. For x G R, let x* = max(x, 0) x~ = max(—x, 0). Then it is easy to see, by considering the two cases x that x = x+ — x |x| = x+ + x~ 0 < x+ s |x| 0 < x" < |x|. Oandx 0 separately, 98 3 RANDOM VARIABLES Proof of Theorem 3.4.4: Since 0 the double series i a;,j converges. By Theorem 3.4.3, Taking the difference between the + and — versions results in the conclusions of the theorem. ■ SUPPLEMENTAL READING LIST W. Feller (1957). An Introduction to Probability Theory and Its Applications, 2nd ed. New York: Wiley. EXPECTATION INTRODUCTION The concept of expectation was first formalized in print by Huygens in the middle of the seventeenth century, and it has played an essential role in probability theory ever since. The expected value of a random variable is a number that summarizes information about a random variable. From the time of its inception until the middle of the twentieth century, the concept of expected value developed along two paths: the discrete and continuous cases. Although it is possible to treat both paths simultaneously, we will stay with the discrete for the time being. Among other things, we will determine the expected duration of play in the gambler’s ruin problem, discuss prediction and filtering theory, and look briefly at some applications to communication theory. EXPECTED VALUE Unless specified otherwise in examples, (0,9% P) will be a fixed probability space. The idea behind expected value is very simple. If a gambler wagers on a game in which he can win one unit with probability p = 3/4 and lose two units with probability q = 1/4 and he plays 100 games, then according to the empirical law for relative frequencies, the gambler would expect to win about 75 games and lose about 25 games. Thus, he would expect to win about 75 • 1 units and lose about 25 • 2 units with a net gain of 75 • 1 - 25 • 2 units. Putting this on a per-game basis, he would expect a net gain per game of 99 1OO 4 EXPECTATION S'-g-W-’g where the coefficients 3/4 and 1/4 are the probabilities of winning 1 and —2 units, respectively. Definition 4.1 Let X be a random variable with range {xi,x2,...}, finite or infinite. The ex­ pected value ofX, denoted by E [X], is defined as the real number E[X] = ^XifxM i provided that the series converges absolutely and X is said to have finite expecta­ tion; ifP(X > 0) = 1 and the series diverges, E[X] is defined as +°°. ■ If the range of X is finite, the series on the right is a finite sum and there is no question of convergence, absolute or otherwise. The question of absolute convergence is appropriate only when the series is infinite. Recall that if a series converges but not absolutely, then it is conditionally convergent. In the case of a conditionally convergent series, a rearrangement of the terms of the series can alter the sum of the series. If the sum in the series above is conditionally convergent, then one person listing the values of X in one order might arrive at a different sum than would some other person listing the values in another order. Under absolute convergence, the order in which the values of X are listed does not matter. EXAMPLE 4. 1 Let X have a uniform density on {0,so that fxW = 1/(m + 1),x = 0,1,..., n. By Exercise 1.2.6, B[X] = Vi— = —Ti = - — -',(”4'1) = - ■ pi " + i n + lp; » +> 2 EXAMPLE 4. 2 Let X have a binomial density with parameters n and p. Then , n . E|X] -2}x(")pV'‘ rt = * V (x - l)!(n -x)r X=1 n—1 = np^^b(x; n — l,p) x=0 = np, X — g(n-l)-(x-l) 4.2 EXPECTED VALUE 1O1 the last equation holding because Z"=o b(x-, n - l,p) is the sum of all the probabilities making up a binomial density. ■ Let X have a geometric density with parameter p so that fxW = pqx~l,x = 1,2,.... Regard the series zl”=0 Q * as a power series in q with ( — 1,1) as its interval of convergence. Since EXAMPLE 4.3 within the interval of convergence (1-q)2 dq\~q f^dq^ dq^ with the latter series also converging absolutely in the interval ( — 1,1). Return­ ing to the geometric density, E[X] = J^xp^’1 = P^x^-1 = —■ x=l EXAMPLE 4.4 x=0 ' r Let X have a Poisson density with parameter A > 0 so that mi x=0 provided that the series converges absolutely. Since the terms of the series are nonnegative, absolute convergence and convergence are the same thing and we need only verify the latter. Since and the latter series is the Madaurin series expansion of eA, which is known to converge absolutely on (—°°, +00), E[X] is defined and E[X] = Ae-AeA = A. ■ If X is a random variable and 0 is a real-valued function on R, then Z = (f>(X) is also a random variable. According to the definition of expected value, to calculate E [Z ] we must first determine the density fz of the random variable Z. This need not be done according to the following theorem. 102 Theorem 4.2.1 4 EXPECTATION Let X be a random variable with range {xb Xi,...} and let <j> be a real-valued function on R. Then E[</>(X)] is defined and sm] j provided the series converges absolutely. In applying this result, the sum on the right is formed by replacing X in <p(X) by a typical value Xj, multiplying by the probability that X takes on that value, and then summing over j. PROOF: Assume that the series converges absolutely. Any rearrangement of the series will not affect the convergence or sum of the series. Let {zi, Z2» • • ■} be the range of Z = </>(X). By rearranging the terms of the series, xwwy. w=x iW)=zi) * j ' i = = = E[Z]. i show that the series The same steps applied with replaced by S, ztfz fa) converges absolutely. ■ EXAMPLE 4.5 &(•; n,p) and let Let X be a random variable with binomial density = x2, x E R. Then W(X)] n =£ fc = O z n . . , - »(I )pV“‘+( J )A”‘. it =O We have seen that the second sum on the right is equal to np. Since = n(n - l)p222 ( )pk-2(fn k=2 n —2 . _ _ . = n(n - l)p222 ( ” k )pkq{n~2}~k k=0 n—2 = n(n - l)p2^ k =0 = n(n - l)p2, - 2,p) 4.2 EXPECTED VALUE 103 E[X2] = n(n — l)p2 + np = n2p2 — np2 + np. ■ It is not hard to construct examples of random variables X for which E[X] is not defined as a real number. EXAMPLE 4.6 Let X be a random variable with density/x(x) = x = 1,2,... (see Exercise 1.5.10). Since the series is the divergent harmonic series except for a missing first term, the series 21”=1x/x(x) does not converge, and therefore E[X] is not defined as a real number. ■ In some instances, as in the previous example, when the terms of the series Xy Xjfx(Xi') are nonnegative but the series does not converge, we say that the series diverges to+<» and we write E[X] = +oo. Theorem 4.2.2 IfX has finite expectation and c is any real number, then (i) lfP(X > 0) = l,thenE[X] > 0. (it) IfP(X = c) = l,thenE[X] = c. (iii) E[cX] = cE[X]. PROOF: IfP(X > 0) = l,then/x( ) * = Owheneverx <0,andsoE[X] = XjXjfx^Xj) = Xx.s0Xj/x( j) * Ifp(x = c) = Lthenfx(c) = 1 and fxM = 0 whenever x # c, sothatE[X] = ^jXjfx(Xj) = cfx(c) = c. By Theorem 4.3.1, E[cX] = Xjfx(xf) = cE[X]. ■ If the density function of a nonnegative integer-valued random variable X is not known but its generating function is, the expected value of X can be calculated indirectly using the generating function. The following notation is useful for this purpose. Let / be a real-valued function defined on an interval having fl as its right endpoint. Then/(fl —) is defined to be limx_>(J-/(x), even if infinite. Theorem 4.2.3 (Abel) Let {flj}J°=0 be a sequence with aj > 0 and generating function A(t) on (-1,1), andletA(T) = Xy°=oflr Then A(l-) = lim A(t) = A(l), t -St­ even if the series diverges to +oo. 104 4 EXPECTATION PROOF: Suppose first that ^=oaj diverges to +<». Given any M, there is an N > 1 such that X"= oai > M for all n s N. Since limf_»i-~ j — 1 J >M for all n S N, > M. Since M is arbitrary, A(l —) = +<». Suppose now that < +°°- Let L = ^?=oaj. Given any e > 0, there is an N 1 such that ^L"=oaj > L — e for all m > N. Thus, = L = X.j = oaj- ■ Since e is arbitrary, A(1 —) = lim^!-2 * =0 The point of Abel’s theorem is that if we formally put t = 1 in the equation A(t) = X7=ofl/f;, then A(l—) = A(l), even if infinite. In working with power series on ( — 1,1) having nonnegative coefficients, we will put A(l) = , even if infinite. Let X be a nonnegative integer-valued random variable with density func­ tion fx and generating function /x- ln this case, it is possible for E[X] = 2”„ox/x( ) * = +°°. We will need a tail probability function defined by gxW = P(X > x), x = 0,1,... and its generating function X gx(0 = ^gxWtx, -Kt < 1. x =0 Note that gx is not a density function. The functions fx and gx are related by the following equation: “I < t < 1. gx(0 = To see this, write 00 (1 - Ogx(r) = (1 - t'l'y' gx(x)tX x=0 X cc = ^LgxWtX - ^Tgx(x)tX+l x=0 x=0 (4.1) 4.2 EXPECTED VALUE 105 00 co ~^gx(x - l)tx = gx(O) x=l x=l 00 = 1 ~/x(0) - 2j#x( * - 1) ~gx(x))tx. x=l Noting that gx(x - 1) - gx(x) = P(X > x - 1) - P(X > x) = P(X = x) = fx(x} forx > 1, (l-t)gx(O = l-7x(f). This establishes Equation 4.1. Since the interval of convergence of the power series defining }x contains (-1,1), 7x(0 = = i^xfxW1, x=0 flrx=o and therefore E[X] = 2”=ox/x( ) * ~ Z”=o* ( “ l)/x(x) = /x^l), even if infinite. Theorem 4.2.4 — 1 < t < 1, Similarly, E[X(X — 1)] = IfX is a nonnegative integer-valued random variable, then B[X] =7x(D = MD whetherfinite or infinite. PROOF: By Equation 4.1 and the mean value theorem for derivatives, fa(f) = 1 - = £(f) where t < £ < 1. By Abel’s theorem, fed) = fed-) = Jun. = Zfd-) -?xd) = si *)- ■ EXAMPLE 4.7 Let X have a negative binomial density with parameters r andp so that7x(*) = pr(l “ ^)"r-Then7x(0 = rprq(l - qt)-r-1, andso E[X] =7^(1) = r(q/p). ■ It was shown in Section 2.7 that a remotely operated garage door opener is anything but secure. Let {Xj}J°= j be an.infinite sequence of Bernoulli random 106 4 EXPECTATION variables with probability of success p = 1/2. We have seen that the word 1001 will occur infinitely often in an outcome with probability 1. EXAMPLE 4.8 (Password Problem) Consider the sequence just described and define a random waiting tjme T by putting T = n if the word 1001 appears for the first time at the end of the nth trial so that T > 4. Let gr(n) = P(T > n); i.e., gr(n) is the probability that the word 1001 doesnot occur in the first n trials. Note that gr(0) = gr(l) = gr(2) = gr(3) = 1. The only way for the word 1001 not to appear in the first n trials for n S: 4 is for an outcome to begin with one of the following starting patterns: 0 • • •, 11 •••,101•••,1000 • ■ •, and the word 1001 does not subsequently appear. The event consisting of those outcomes with the starting pattern 0 • • • and the word 1001 does not sub­ sequently appear in the remaining n — 1 trials has probability (l/2)gr(n — 1). A similar argument applies to the other three starting patterns. Therefore, the gr(n) must satisfy the difference equation gr(n) = “ 1) + (4.2) ~ 2) + Jst(h “ 3) + -^grtn - 4), o lo n > 4 subject to the initial conditions gr(0) = gr(l) = ^t(2) = gr(3) = 1. (4.3) Multiplying both sides of Equation 4.2 by tn and summing over n > 4, gr(t) - 1 - t - t2 - r3 t t2 = AgAt} - 1 - t ~ t2) + ~(gr(t} - 1 - f) Z> t4 + -z(gr(t) ~ 1) + T^grtt')O AO Solving forgr, 16 + 8t + 4t2 + 2r3 8t f “ 16 - 8t - 4r2 - 2r3 - r4' Since fT(t) = 1 — (1 — t)gr(f), which is a rational function of t, we could in principle determine the density fT by applying the method of partial fraction 4.3 PROPERTIES OF EXPECTATION 107 expansions to frW; this requires, however, finding the roots of the polynomial in the denominator of gj-. If all we are interested in is E [ T ], then these problems can be avoided by using Theorem 4.2.4 to obtain E[T] = gr(l) = 30. On the average, it will take about 30 trials for the word 1001 to appear in an outcome. ■ Let X have a geometric density with parameter p. Find E [X2]. EXERCISES 4.2 2. If X has a Poisson density function p(-; A), calculate E[X2]. 3. A random sample of size 3 is drawn from a bowl containing 10 white and 5 red balls. If X is the number of white balls in the sample, find B[X]. 4. Let X be a random variable having a Poisson density with parameter A > 0. Calculates[1/(1 + X)]. 5. Let X be a random variable having a negative binomial density with parameters r S: 2andp. Calculate E[1/(X + 1)]. 6. If X is a nonnegative integer-valued random variable, show that E [X ] = x:=1p(x ^x). 7. Let {Xj}“=1 be a sequence of independent nonnegative integer-valued random variables all having the same density function for which E[Xj] = E[Xi] is defined as a real number, and let N be a positive integer-valued random variable such that E[N] is defined as a real number. Assume that N, Xi,X2,... are independent. If = Xi +X2 + • • • +Xn, show that E[Sn] = E[N]E[Xi]. 8. A remotely operated garage door opener has an electronic combination lock of 10 binary digits. If a random device transmits a signal that has the probability properties of an infinite sequence of Bernoulli trials with probability of success p = 1/2, what is the expected number of digits required to activate the opener? The following problem requires mathematical software such as Mathematica or Maple V. 9. Consider the random variable T of Example 4.8. Calculate P(T £11). PROPERTIES OF EXPECTATION In Chapter 3, we defined functions of random variables such as <p(X, T) and i/r (Xi,..., X„). In the case of a single random variableX, it was shown that the expected value of Z = </>(X) could be calculated without going through the intermediate step of determining the density function of Z. A similar result applies to a function of several random Variables. 108 Theorem 4.3.1 4 expectation IfXi,... ,Xn are random variables and ip is a real-valued function of n variables, then E[^Xl,...,X„')] = 2 ' .... xB(Xb...,x„) (4.4) »t provided the multiple series on the right converges absolutely. Operationally, the sum on the right is obtained by replacing X\,.. .,Xn by typical values xn..., xn, respectively, multiplying by the probability that the random variables will take on those values, and then summing in any order over all possible values of the random variables. The proof of Theorem 4.3.1 amounts to a justification of the rearrangement of the terms of the series. The reader is referred to Theorem 12-42 in the book by Apostol listed at the end of the chapter. Theorem 4.3.2 If X and Y are random variables with finite expectations and P(X 2: T) = 1, thenE[X] > E[y). PROOF: Since/x.y(xi,y;) = 0 whenever x, < by Theorem 4.3.1, E[X] = y'xifx.Y(xi,yj') Xi.fi = 22 xifx.y(xi,yj') Xi * fi - X yjfx.y{Xi,yj) Xi * fi = '^yjfx.Y(xi,yf) Xi.fi = E[y]. ■ Theorem 4.3.3 If Xi,...,Xn are any random variables with finite expectations and ci,...,cn are any real constants, then 22"=1 CjXj has finite expectation and n n = 2>W E J=1 ;=i PROOF: By Theorem 4.2.2, we can assume that the Cj = 1, j — 1,..., n. Taking i/ *(xi, ... ,xn) = Xi + • • • + x„ in Theorem 4.3.1, we must first show that the series therein is absolutely convergent. Since ]xj + • • • + x„ | < 4.3 PROPERTIES OF EXPECTATION 109 ... Xn(xb...,X„) 22 IX1 + ‘ - 22 (tai + • • • + ta|)/x.... X„(Xb • ■ ■ >X„) X| ,...fxn = Stal/x...... (Xb • • • > X„ ) + • ' * + 22talA,... X„(X1,...,X„). X|>...>X^ X|f...fXfj By Theorem 3.4.3, a suitable order of iterated summation can be chosen so that 22 tal/x....x„(xb...,x„) = 21 tai Xj 22 fx,... x„(xb...,x„) \X|>...tXj—|>Xj + |t...tXn • = 21taWxA and so 22 |xi + • • • +Xn|/xi,...,X„(xb...,X„) 22tal/x,(Xl) Xi X|.-...X„ + ■ ■ ■ + 22 ta \fx„ (Xn )• x„ Since each Xj has finite expectation, each term on the right is finite and the multiple series converges absolutely. Therefore, E[Xi + • • • + X„] is defined, and E[Xi +•• • +X„] = 22 (*i + ’ ’ ’ +x„)/x... jcn(xb...,x„) = 22 x^fx... x(xb--.,x„) X|,...jXn + •••+ 22 xnfxt... x„(xb...,x„) X!,...pcn = E[X1] + ---+E[X„]. ■ EXAMPLE 4.9 Consider an infinite sequence of Bernoulli random vari­ ables {Xj }J°= । with probability of success p and let S„ = Xi + • • • + X„. Then no 4 EXPECTATION £[Xj] = 1 • p + 0 • q = p,j = By Theorem 4.3.3, £[S„] = np, a result obtained previously using the fact that Sn has the binomial density b(-;n,p). ■ The introduction of auxiliary random variables as in the next example can simplify the computation of expected value. EXAMPLE 4.10 Suppose a population of n objects consists of nx objects of Type 1, »2 objects of Type 2,... , ns objects of Type s, where n = ni + n2 + • • • + ns. A random sample of size r n is taken without replacement from the population. Let Xi be the number of Type 1 objects, X2 the number of Type 2 objects,..., Xs the number of Type $ objects in the sample. To cal­ culate £ [Xy ], we define auxiliary random variables Ijj,..., Ij>r as follows. Let I^k = 1 or 0 according to whether the fcth object in the sample is of Type j or not. The value of Ij,k is determined by looking at the fcth object chosen from the population and totally disregarding the other choices. This amounts to selecting just one object. Thus, P(Ij,k = 1) = nj/n,P(Ijtk = 0) = (n — nj)/n, and therefore £[/;,&] = 0 • ((n — «y)/n) + 1 • (ny/n) = nj/n. Since X} = I+ • • • + IJir, EIX;] = XEl'iAl - T~~ k=l by Theorem 4.3.3. ■ Theorem 4.3.4 IfX and Y are independent random variables with finite expectations, then X Y has finite expectation and £[XT] = £[X]£[r]. PROOF: Let x-t and yy be arbitrary elements of the range of X and Y, respec­ tively. We must first show that < +00- 'J By Theorem 3.4.3, 2Z \xiyj\fx.Y(Xi,yj) = 22 l* >lhl/x(xi)/y(y ;) = S > 1 DE hl A (yj) \j / (*-• )• 4.3 PROPERTIES OF EXPECTATION 111 Since the sum within the parentheses is a constant, it can be taken outside the summation over i to obtain ^\xiyj\fx,Y(xi>yj) = (22 )< +00/Xi X j i.j / Therefore,XY has finite expectation and i \; / = E(y]X^/x(x,) i = E[T]E[X]. ■ It is important to remember that Theorem 4.3.4 applies only to independent random variables. Suppose two dice, one red and one white, are rolled. Let X and Y be the number of pips on the red and white die, respectively. By Exercise 1.2.6, EXAMPLE 4.11 6 • 6 B[X) = E|Y) - y >-l y = X j = T. >-l6 2 Since X and Y are independent random variables, E[XV] = E[X]E [T] = 49/4. ■ Definition 4.2 The random variable X has a finite second moment if E [X2] is finite. ■ We will need the following fact: if x is any real number, then |x| x2 + 1. To see this, note that if |x| 1, then |x| < x2 + 1 whereas if |x| s 1, then Jx| |x|2 x2 + 1. Consider a random variable X with finite second moment and range {xi,x2, •••}• Since Z, h|/x(x,) == Z;(x2 + l)/x( i) * E[X2] + 1,X has finite expectation. In this case, we can define a parameter fix by Mx = E[X] which is called the mean or expected value of X. 112 4 EXPECTATION Consider the random variable (X—/xx)2 = -K2-2/xx-^+Mx- SinceX2and X have finite expectation, (X - Mx )2 has finite expectation by Theorems 4.2.2 and 4.3.3, and we can define a second parameter <r2 x = £[(X - Mx)2]. called the variance of X. The variance of X is also denoted by var X. ax = VvarX is called the standard deviation of X. If the random variable X is clear from the context, the subscript X on fix and crx will be omitted. Since (X - Atx)2 = X2 - 2/xx-X + Mx andE[/xx-X] = Mx^M = (E[X])2, by Theorem 4.2.2 varX = o-2 = E[(X - Mx)2] = W2] " (B [X])2. It is easily checked that var (aX) = a2 varX and var (X + c) = varX. EXAMPLE 4.12 Let X be a random variable having a uniform density on {0,1,..., n}. It was shown in the previous section that E [X] = n/2. By Exercise 1.2.6, 2 1 1 r 2 ^2 2i n(2n + 1) > .X2----- - = ----- -[I2 + 22 + • • • + m2] = ----- -----n +1 n +1 6 x=0 V-1 _ m(2m + 1) 6 (n V _ n(n + 2) \2/ 12 Let X be a random variable having a binomial density b(-;n,p). It was shown in the previous section that E[X] = np and that E[X2] = n2p2 — np2 + np, so that EXAMPLE 4.13 varX = E[X2] - (E[X])2 = np(l - p). ■ Let Xb ..., Xn be independent random variables all hav­ ing the same density function and finite second moments. Let p, = E [X; ] and cr2 = varXp 1 j S n, and let S„ = X] + • • • + X„. By Theorem 4.3.3, E [S„ ] = n fi. The variance of S„ can be calculated using the equation EXAMPLE 4.14 n ( ;=| n = 2JX; - p)2 + 2jx; - fj^Xj - fl), ;=i > j 4.3 113 PROPERTIES OF EXPECTATION provided the terms on the right have finite expectations. The terms of the first sum have finite expectations because the Xj have finite second moments. By independence and Theorem 4.3.4, the terms of the second sum have finite expectations, and E[(X; - /x)(X;- - /x)J = E[X; -/x]E[X;- -/x] =0. Thus, n varS„ = E[(S„ - n/x,)2] = ^E[(X, - /x)2] = no2. ■ ;=i Let X be a nonnegative integer-valued random variable with finite second moment. We have seen that E[X] can be calculated from the generating function/x by the equation E[X] = varX can also be calculated from the generating function. In fact, varX =/;(!)+/x(l)-[/x(l)]2To see this, recall that/x(t) = zl”=0/x*,()£ (4.5) so that • 00 fx(t) = 2Z ( * x =0 “ Vfx(x)tx~2 on the interval (-1,1). By Abel’s theorem, 00 7xd) = = b[x(x ~1)1 = _ x=0 andsoE[X2] = + £(1) andvarX =/x'(l)+/x(l) - [/x(l)]2. EXAMPLE 4.15 LetX be a random variable having a Poisson density with parameter A > 0. Then/^(t) = eA(f-1),/x(t) = AeA(f-1),/x(t) = A2eA(f-1). Thus, varX = ^(1) +/x(l) “ [/x(D]2 = A2 + A — A2 = A. ■ The mean and variance of a random variable X are just two parameters that summarize some of the information in its density function. Even though in most cases they do not determine the density, they can provide information about probabilities. Lemma 4.3.5 (Markov’s Inequality) If X is any random variable with finite expectation and t > 0, then । । P(|x| > t) < E[M] 114 4 EXPECTATION PROOF: Let {xbx2,...} be the range of X. Since the series defining is absolutely convergent, E[|X|] < +<». By Theorem 4.2.1, B[|X|] = j S t - X fx(Xj) |xj2f Ixjat = tP(|X| > t). ■ The next inequality is an easy consequence of Markov’s inequality. Theorem 4.3.6 (Chebyshev’s Inequality) LetX be a random variable with mean pc and finite variance cr2. Then P(|X - । O’2 > 8) < — for all 8 > 0. PROOF: By Markov’s inequality, p(|x-> 3) = P((x- jt)2 > a2) < £[(X~M)2' = ■ Consider an infinite sequence of Bernoulli trials {X;}J°=1 with probability of success p and let Sn = Xi + • • • + Xn be the number of successes in n trials. We know that Sn has a b(--, n,p) density, that E[S„] = np, and that varS„ = np(l — p). By Chebyshev’s inequality, y n a a) = P(|s„ - ,.p| a „a; S - f(1 f). J n232 n8l By maximizing the function g(p) = p(l-p),0 that the maximum value of g is 1/4. Thus, P S„ 4n82' p 1, it is easily seen (4.6) Since Sn represents the number of successes in n trials, Sn/n represents the relative frequency of successes in n trials. Taking the limit as n —> oo, lim P Sn (4.7) for all 8 > 0;i.e., given a prescribed error 8 > 0, the probability that the relative frequency Sn/n will differ from p by more than 8 goes to zero as n -> <». This 4.3 PROPERTIES OF EXPECTATION 115 sounds suspiciously like the empirical law for relative frequencies, but it is, in fact, a mathematical theorem. Inequality 4.6 can be used to determine how many repetitions of an experiment are required to pin down the probability of success p when it is unknown. EXAMPLE 4.16 Consider an infinite sequence of Bernoulli trials with probability of success p. How many repetitions are required to be 97 percent confident that the relative frequency of success Sn/n will be within .05 of p? That is, how do we choose n so that P Sn > .051 < .03? By Inequality 4.6, if we choose n so that ----------- < .03, 4n(.O5)2 ’ then the above condition will be satisfied. Therefore, n > 4(.05)2(.03) and so n can be taken to be 3334. ■ The number n = 3334 in this example is rather large, but it must be remembered that nothing has been assumed about p. Any preliminary information about p can reduce the number n by several factors; e.g., if it is known that p pertains to an event that is relatively uncommon, say p 1/10, then p(l — p) 9/100 and n can be reduced to 1200. The fact that lim„_>«>P(|(S„/n) - p| >: 3) = 0 for all 8 > 0 was first proved by Jacob Bernoulli around 1713. It is a special case of a slightly more general result. Theorem 4.3.7 (Weak Law of Large Numbers) Let {X; }°°=1 be a sequence of independent random variables all having the same density function and finite second moments. If Sn = Xi + - ■ - + X„ and pt = 1, then lim P n—>oo for all 8 > 0. s„ n 116 4 EXPECTATION PROOF: Let a2 = varX;,j > 1. By Theorem 4.3.3 and Example 4.14, E[Sn] = n /x andvar Sn = ncr2. Thus, for each 8 > 0, pf — -fi 2: 8 \ n = P (|S„ - n/x| £ n8) varSn a2 as n —> <». ■ There is a strong law of large numbers that reflects the empirical law more precisely than the weak law. The strong law is beyond the scope of this book. EXERCISES 4.3 1. Suppose two dice are rolled, one red and one white. Let X be the number of pips on the red die, let Y be the number of pips on the white die, and let Z be the larger of the two numbers of pips. The joint density offx,z is given by Equation 3.6. Calculate E[XZ]. 2. Let X and Y be as in Problem 1 and let U = min (X, T). Calculate E[U]. 3. Let X be a random variable having a geometric density with parameter p. Use the generating function/x(d to find varX. 4. Let X be a random variable having generating function/(t) = Calculate varX. 5. Let X be a random variable having a negative binomial density with parameters r andp. Calculate varX. 6. A manufacturer produces items of which 3 percent are defective. The manufacturer contracts to sell 10,000 items to a buyer with the stipulation that if the number of defective items exceeds d units, then the buyer can claim a full refund. How should d be chosen so that the manufacturer does not have to give a refund to more than 5 percent of the buyers? 7. Consider an infinite sequence of Bernoulli trials with probability of success p for which it is known that p 1/4. How many trials are required to be 90 percent confident that the relative frequency of successes Sn/n will be within .05 of p? The next three problems pertain to a population of n objects of which of Type 1, n2 are of Type 2,. . ., ns are of Type s. 8. are If Type 1 objects have value Vb Type 2 have value V2,. . . , Type s have value V,, and V is the value of a random sample of size r < n without replacement from the population, derive a formula for E [ V]. 4.4 COVARIANCE AND CORRELATION 117 9. A commercial fisherman is allowed to net 50 game fish each month from a lake in which 30 percent of the fish are largemouth bass, 10 percent are smallmouth bass, 20 percent are white bass, and 40 percent are walleyes. If the largemouth bass average 2.5 pounds, the smallmouth bass 1.8 pounds, the white bass 1.2 pounds, and the walleye 2.4 pounds, what is the expected weight of his catch? 10. For j = 1,..., s, let Xj be the number of Type j objects in a random sample of size r n without replacement from the population. Use the auxiliary random variables of Example 4.10 to calculate varXj. 11. Let X be a random variable having a finite second moment. If /x = E[X] and a2 = varX = 0, show thatX = /z with probability 1. 12. If X is a nonnegative integer-valued random variable, then E[X] = /x(l) = £x(l)- Assuming that E[X2] is finite, express varX in terms ofgx13. Calculate the standard deviation err of the waiting time T of Exam­ ple 4.8. COVARIANCE AND CORRELATION It was shown in the previous section that the expected value of a sum of random variables is the sum of the expected values. Is this true of variances? Generally speaking, it is not true. A simple inequality will be needed for the next result. If a, b are any real numbers, then (a + b)2 < 2(a2 + b2). This follows from the fact that a2 — 2ab + b2 = (a — b)2 2: 0, so that 2ab a2 + b2 and (a + b)2 = a2 + 2ab + b2 2(a2 + b2). Lemma 4.4.1 IfXi}... , Xn are random variables with finite second moments and Ci,...,c„ are any real numbers, then 22"=j CjXj has a finite second moment. PROOF: If the random variable X with range {xb xz,...} has finite second mo­ ment and c G R, then cX has finite second moment since ^.j(cxj)2fx(Xj) = c2mfix2fx(Xj') < +00- Thus, each CjXj has finite second moment and it can be assumed that q = 1,1 < ; < n. We prove the result for the n = 2 case first. Since (Xi + X2)2 2(Xf + Xj), by Theorem 4.3.2 E[(Xi+X2)2] 2 (E [X^]+E [Xf ])<+<», and Xi+X2 has finite second mo­ ment. The general case follows from a mathematical induction argument. ■ Consider two random variables X and Y with finite second moments. Since Mx+r = = Mx 118 4 EXPECTATION var(X + T) = E[((X + K) - (mx + Mr))2] = E[((X-Mx) + (ir-Mr))2] = E[(X - Mx)2] + B[(K - Mr)2] + 2E[(X - - Mr)] = varX + varY + 2E[(X - px)(Y ~ Mr)]The last term will be given a name of its own. But we must first establish that it is finite. Theorem 4.4.2 (Schwarz’s Inequality) IfX and Y have finite second moments, then (E[XT])2 < E[X2]E[Y2]; (4.8) equality holds if and only if P(X = 0) = 1 or P(Y = aX) = 1 for some con­ stanta. PROOF: Either P(X = 0) = 1 or P(X = 0) < 1. In the first case, equality holds in Equation 4.8 because both sides are zero. We can therefore assume that P(X = 0) < 1, which means that X takes on some value xq # 0 with positive probability, so that E[X2] = . xffx (* /) > 0. Define a quadratic function by the equation g(A) = E[(K - AX)2] = E[Y2] - 2XE[XY] + A2E[X2]. This function has a minimum value at Ao Thus, 0 E[(y-A0X)2] Ao by E[XV]/E[X2], _ E[xy] E[X2] ' E[(T — AX)2] for all real numbers A. Replacing E[(T - A0X)2] = E[Y2] - 2A0E[Xy] + AqE[X2] _ 2 (E[xy])2 (E[xy])2 - m 1" 2~eW + WT _ . , - £|r 1 (EIXK))2 “eW and so 0 < E[(y - AoX)2] = E[y2] - ^X!)2 On the one hand, this implies that (E[XT])2 < E[X2]E[y2]; £[(r - AX)2]. 4.4 COVARIANCE AND CORRELATION 119 on the other hand, if there is equality then E[(T — AqX)2] = 0. If Y — AqX takes on some nonzero value with positive probability, we would have E [(K — AqX)2] > 0, a contradiction. Thus, P(Y — AqX = 0) = 1. ■ If X and Y have finite second moments, then we know that both E [(X — fix)2] and E[(y “ Py)2] are finite. Applying Inequality 4.8 to the random variablesX — fix and Y — fiy, (E[(X - flX)(Y - fiy)])2 < E[(X - flx)2]E [(y - fly)2] < +00, and therefore E[(X — fix)(Y ~ Mr)] is defined. Definition 4.3 IfX and Y have finite second moments, the covariance of X and Y, denoted by cov (X, K), is defined by COV(X,y) = E[(X - flX)(Y - fly)]. Alternatively, cov(x,y) = E[xy]-E[x]E[y]. ■ Note that cov (X, c) = E [(X - fix)(c ~ c)] = 0 whenever c is a constant, that cov (X,X) = E[X2] - (E[X])2 = varX, and also that cov (X, y) = 0 whenever X and Y are independent, by Theorem 4.3.4. We now return to the variance of a sum. Theorem 4.4.3 //X;,..., X„ have finite second moments, then n / n \ varj^X; j = ^varXj+2 cov(X,-,Xy). =1 1 < i <j S n V=i / PROOF: Since E[£"= j Xj] = Z- = lfMX,> " varMTX; U=1 = El ^X, - S [v=l / ;=1 \2' ) / \2 / n (£(X ; “ MX, J V=1 n = t 2 (X, “ MX, )(^; “ MX;) i.) = 1 120 4 EXPECTATION n n = '^E[(Xj-tix.)2] + j=l n -MX;)] E[(Xi = ___ cov(X;,Xj). ■ varX, + 2 = _ isicjsn ;=1 Corollary 4.4.4 2Z IfXi, ...,Xn are independent random variables having finite second moments, then ( n \ n !LXj]=lL™Xr ;=1 / j=l PROOF: Fori # j,Xi and X, are independent and cov (X,-,X;) = 0. ■ There is a more general version of Theorem 4.4.3. Theorem 4.4.5 Let Xi,... ,Xm, Tb ..., Y„ have finite second moments and let alt...,am, b\,...,bn be arbitrary real numbers. Then f m n \ m n covi^atXi,^bjYj 1= y' ^ajbj cov(X,-» Yj). V=i ;=i i = ij = i / PROOF: By Theorem 4.3.3, m m = ^a,E[Xi] E : =1 i=l and n ;=i Since m n = E xxa-bJx-xJ .>=!; = ! m n = ^^a.bjE{x,Yj], >=!;=! 4.4 121 COVARIANCE AND CORRELATION m (21 n ai^’’ \/ n /m \ 21 Yj I = E I 2 i^i II 21 a m (i=l m \ bj Yj j \ \/ n /\j2> = l £iy>i / n i=l;=l (m \ / n \ >=i /\j=i / m n = 'X^ibj(E[XiYj] - E[Xj]E[YJ) i=!;=! m n = 22fl>bj cov(Xi,Yj). ■ >=!;=! EXAMPLE 4.17 Consider an experiment in which balls numbered 1, 2,..., n are distributed at random in n boxes so that the total number of outcomes is n!. Let S„ be the number of matches; i.e., the number of balls in boxes having the same number. The range of S„ is {0,1,..., n}. Suppose we want to calculate E[S„] and varSrt. For; = 1,..., n.letX) = 1 or 0 according to whether the jth ball is in the jth box or not. Then S„ = Xi + • • • + Xn. Since P(Xj = !) = (« — !)!/»! = l/n,E[Xj] — 1/n. Thus, E[S„J = 1. Since X? = Xp varXy = E[X/] - (E[X;])2 = E[XJ - (EIXJ)2 = = We now calculate E[XjXt] for j # k. Now XjXk is 1 or 0 according to whether the jth and fcth balls are in the corresponding boxes or not. Thus, P(XjXk = 1) = (n -2)!/n! = l/(n(n - 1)), so that for; # k, cov(XjX t) = E[X,X *] *] -E[X;]E[X y = n^(n ■—— 1)-. By Theorem 4.4.3, n varS„ =21var^;+2 2. cov(X;,Xj). j=l 4 lSi<;£n 122 4 EXPECTATION Since all the terms in the second sum are equal to l/(n2(n — 1)) and the number of terms is the number of ways of selecting two distinct integers i and j from {1,..., n} without regard to order, varS„ = n • n - 1 /n \ 1 . . 1“ 1. nl 2 ' n2(n — 1) Therefore, E [S„ ] = landvarS„ = 1. ■ There are good reasons for replacing the random variable X by a centered and normalized random variable (X — p,x)/crx. Definition 4.4 IfX and Y are two random variables having finite second moments, the correla­ tion between X and Y, denoted by p(X, Y), is defined by cov(X,Y) fYV. .\ aX J\ °~Y /J Wy The following result is of interest in its own right and also tells us something about p(X, Y). If X and Y are independent random variables with finite second moments, then p(X, Y) = 0 since cov (X, Y) = 0 in this case. The converse is not true in general. It is possible for p(X, Y) = 0 without X and Y being independent. Also, replacing X and Y in p(X, Y) by certain linear functions of X and Y, respectively, does not change the correlation; i.e., p(aX + b,cY + d) = p(X, Y) whenever a > 0, c > 0. This follows from the fact that var (aX+ b) = E[((aX + b) — (ap,x + b))2] = E[a2(X - p-x)2] = a2 varX, var (cY + d) = c2varY,and cov(aX + b,cY + d) = E[((aX + b) - (ap,x +b))((cY + d) - (cp.Y + d))] = E[ac(X - px)(i" “ Mr)] = ac cov(X, Y), so that p(aX + b,cY + d) = flCcg.v(X>y) = p(X,Y) aaxcaY (4.9) whenever a > 0, c > 0. Theorem 4.4.6 LetX and Y be random variables with finite second moments, crx > 0, cry > 0. Then |p(X, Y)| 1 with equality if and only if there are constants a and b such thatP(Y = aX + b) = 1. 4.4 123 COVARIANCE AND CORRELATION PROOF: LetX * = (X — /xx)/crx» Y * = (Y — iMy^/cry. By Equation 4.9, ,y * )p(X = p(X, y). Since E[X * 2] = £[((X - px)/<rx)2] = l/a2 xE[(XMx)2] = 1 and likewise E[y * 2] = 1, by Inequality 4.8, p(X, Y)2 = * ,y )p(X 2 s E[X * 2]E(y * 2] = 1. Thus, |p(X, y)| s 1 with equality if and only if P(Y * = aX ) * = 1 for some a £ R, in which case there are constants a and b such that P(Y = aX + b) = 1. ■ EXAMPLE 4.18 Let Xi, X2, and X3 be independent random variables with cr2 Xi = 2, cr^ = 4, and a2X} = 3, respectively, and consider the problem of calculating the correlation between the random variables 2Xi — 3X2 + 5X3 andXi + 2X2 — 4X3. By independence, cov (X;, Xj) = 0 whenever i j. By Theorem 4.4.5, cov(2X! - 3X2 + 5X3, X! + 2X2 - 4X3) = (2)(1)cov(X1,X1) + (2)(2)cov(X1,X2) + (2)(-4)cov(X1,X3) + (-3)(1)cov(X2,X1) + (~3)(2) cov(X2,X2) + ( —3)(—4) cov(X2,X3) + (5)(1)cov(X3,X1) + (5)(2) cov(X3,X2) + (5)(-4)cov(X3,X3) = 2 cov(XbXi) - 6 cov(X2,X2) - 20 cov(X3,X3) = 2^X] “ = -80 — m°X3 Since the random variables 2Xb — 3X2, and 5X3 are independent, var (2Xj - 3X2 + 5X3) = var (2XJ + var (-3X2) + var (5X3) = 4<r^ + 9aX1 + 25a2 Xf = 119 Similarly, var(X1 + 2X2 -4X3) = 66. Therefore, p(2X] - 3X2 +X3, Xi + 2X2- 4X3) = -80 7119 766 -.903. The correlation between X and y measures the linear dependence between X and y. In the case p(X, Y) = 1 1, there is a linear functional relationship 124 4 EXPECTATION between X and Y. It is possible for two random variables U and V to be related functionally in the same way as two random variables X and Y with wide disparities between the correlations p(U, V) and p(X, Y). Let X be a random variable that takes on values -1,0,1 with probabilities 1/4, 1/2, 1/4, respectively, and let Y = X2. It is easy to calculate that p(X, Y) = 0. If we let U = X + 1 and V = U2, then U and V have the same functional relationship as X and Y. It is also easy to calculate that p(U, V) = 2 >/2/3 ~ .94 if use is made of the fact that X3 = X and X4 = X2; for example, EXAMPLE 4.19 varV = E[V2] - (B[V])2 = E[(X + I)4] — (E[(X + I)2))2 = E[X4] + 4E[X3] + 6E[X2] + 4E[X] + 1 - (E[X2] + 2E[X) + I)2 = E[X2]+4E[X]+6E[X2] + 4E[X] + 1 - (E[X2] + 2E[X] + I)2 Since E[X] = 0 andE[X2] = 1/2, var V = 9/4. Eventhough U and V are functionally related in the same way as X and Y, one pair has correlation zero and the other pair has correlation close to 1. ■ EXERCISES 4.4 1. The joint density function /x,y (x, y) of the random variables X and Y is tabulated below. Calculate p(X, Y). y = 1 y = 2 y = 3 y = 4 y = 5 X = -1 i 20 2 20 i 20 0 0 X = 0 0 3 20 2 20 i 20 0 X = 1 i 20 2 20 3 20 0 0 X = 2 0 1 20 1 20 i 20 i 20 2. Let X and Y be random variables with p(X,Y) = 3/4, var X = 2, and varY = 1. Calculate var (X + 2 Y). 3. A bowl contains r red balls and b black balls. An unordered random sample of size 2 is selected from the bowl. Let X be the number of red balls and Y the number of black balls in the sample. Calculate p(X, Y) without using the joint density of X and Y. 4.5 125 CONDITIONAL EXPECTATION 4. Let Xi, X2, and X3 be independent random variables with * a = 4, 0^ = 3, andcr^ = 1. Calculatep(Xi + 2X2 - X3,3Xi - X2 + X3). 5. A bowl contains three balls numbered 1,2,3. Two balls are successively selected at random from the bowl without replacement. If X is the number on the first ball and Y the number on the second ball, calculate p(x,y). 6. 7. Suppose n distinguishable balls are randomly distributed into r boxes. If Sr is the number of empty boxes, Sr = Xj + • • • + Xr where X, is 1 or 0 according to whether box i is empty or not, 1 i < r. (a) Calculate E[X,]. (b) Calculate E [X, X; ], i * j. (c) Calculate E[Sr]. (d) Calculate var Sr. Consider a basic experiment with r outcomes 1, 2,..., r having prob­ abilities pi, p2,..., pr> respectively, and consider n independent repeti­ tions of this basic experiment. For i = 1,2,..., r, let Yj be the number of trials resulting in the outcome i. Writing Y, = 7,,i +1^2 + • • • + Iiin where 7,j = 1 or 0 according to whether the jth trial results in i or not, (a) Calculate £[7,^7^] for i ^j,k = €> \ s k, € < n. (b) Calculate E[7j,*7;/] for i (c) Calculate E [ Y,- ] and E [ Y,- Yj ], i # j. (d) Calculate var Yj. (e) Calculate p(Y,, Yj), i j,k j. 8. Let X and Y be two random variables that take on only two values each. If cov (X, Y) = 0, show that X and Y are independent. 9. Let X and Y be random variables with finite second moments. The linear function aX + b of X is called the best mean square linear predictor ofYif E[(Y - aX - fe)2] =£ E[(Y - cX - d)2] for all real numbers c, d E.R. Calculate a and b. CONDITIONAL EXPECTATION We have seen in some instances that conditional probabilities can be used to simplify computations and, in fact, some probability models are defined in terms of conditional probabilities. We will look at this concept in the context of random variables. 126 4 EXPECTATION LetX and Y be two random variables with ranges {x), x2,...} and {yb y2, • ■ •}» respectively. If P(X = Xj) > 0, then P(Y = yJX = Xj) is defined, and we will let/y|x (yk |x;-) denote this conditional probability. Thus, JX(Xj) When P(X = x;) = fx(xj) = 0, the above quotient is undefined, and we define/y|x(/it|x7 ) = 0 whenever fx(xj) = 0. The function/y|x(yt|x;) of the two variables Xj,yk is called the conditional density of Y given X = Xj. The Xj variable is usually thought of as a parameter. It follows from the definition that fx,Y(Xj>yk) = fylx(yk\xj)fx(Xj). EXAMPLE 4.20 (4.10) A bowl contains chips numbered from 1 to 10. A chip is selected at random from the bowl. If the chip selected is numbered x, 1 < x 10, then a second chip is selected at random from the chips numbered 1,2,... ,x. This is an experiment for which the probability model is defined in terms of conditional densities. Let X be the number on the first chip and Y the number on the second chip. Then forx = 1,2,..., 10 otherwise. 1/10 0 The remainder of the description of the experiment specifies /y|x(y|x). For x = 1,2,..., 10, ) * A|x(/I 1/x 0 = fory = 1,2,... ,x otherwise. Thus, fx.y(x,y) = 1/1 Ox 0 for 1 < y < x, x = 1,2,..., 10 otherwise. ■ Conditional probabilities can also be defined for collections of random variables. In what follows, Xj will denote a typical value of Xj and y a typical value of Y. Definition 4.5 If Y,Xi,X2>. .., X,n are random variables, the conditional density of Y given X\, X2,..., Xm is the function fy\x.... xm(y| i>* *„>) ••> fxt,...,xm (^1> • • • >xm) whenever the denominator is differentfrom zero and is equal to zero otherwise. ■ 4.5 127 CONDITIONAL EXPECTATION The conditional density satisfies the following equation: fx.... xm,Y(xi,...,xm,y') = fy[x.... xm (y\xlt Theorem 4.5.1 (4.11) )fx....... (xi, ...,xm). If the random variable Y is independent of the collection of random variables {Xi,...,Xm} (i.e., fx.... xm,y(xi>.. .,xm,y) = /xl,...,xm (xb.. .,xm)/y(y)), then fy\x... ,xn(/l b * whenever fx Xm (* b «) * •.•» = /?(/) • • • >xm) > 0. PROOF: The result follows directly from the definition of the conditional density. ■ EXAMPLE 4.21 Consider an infinite sequence of Bernoulli random vari­ ables {Xj}f=! with probability of success p. Fixing n > 1, fx„\X 1*, • • • , Xn -1) = fx„ (xn ) whenever fx xn-,( *i> •• > 0. This follows from the fact that the Xj,... ,X„ are independent random variables, so that fxl....,xn(xi,...,x„') = fx,(xi) X • • • x/Xn(x„) = fx Xn-](xb • • • >Xn — i)fxn(Xn)> and therefore X„ is independent of the collection {Xb ..., X„-j}. ■ Let {yi>y2>- ••} be the range of the random variable Y. For any values xi,...,x„ of Xb ..., X„, respectively, such that fx„...,x„ (* b • • •»x„) > 0, the conditional density fy[X.... x„(yl *i> • • • >x„) is a density function as a function of y since fy\x.... x„(>%lxb • • ->xn) = P(Y = yt[Xi = xb ...,X„ = x„), the conditional probabilities on the right are nonnegative, and the union of the disjoint events (T = y* ) is all of O, so that '^'1fY\Xi,...,x„(yk\xi>. ..,xn) — P(Y = yJXi = Xi,.. .,X„ = x„) yt yt = p[|J(y = n)l i * =x1,...,x„ =X„) \n / = P(ft|X! = xb...,X„ = x„) = 1 . 128 Definition 4.6 4 EXPECTATION Let {Y, Xb ... ,Xn} be a collection of random variables with E[ Y] finite. The conditional expectation of Y given Xi = xb ...,X„ = x„ is defined by E[y|X! = xi,...,X„ = = 2^y/y|x„...,xn(y;|xi,...,x„) whenever fx... ,x„ (*i> • • ■ > xn) > 0 and is defined arbitrarily when fx.... x„( i. * *».) •••» = 0. ■ The definition of E [Y] required that the defining series be absolutely convergent, but no mention is made of absolute convergence of the series above defining E[Y|Xi = Xi,...,X„ = x„]. The absolute convergence is inherent in the requirement that E[Y] be finite; i.e., that E[|Y|] < +<». By Theorem 3.4.3, +00 > 22 hlM/p Xi = 22W 52 fx.... xn,y(^i.--..^,y;) yj xi,...,xn = EE Wx.... xn.Y(xl,...,xn,yj) Yj X^.^Xn = 22 22 Ww.... Xn** b--* (yl n)fx I,...,Xn(* b--->Xn) y, xlt...,x„ = 22^1/m.... X„(/I b* • -,X„) j/x.... ,X„(X1,...,X„). \ y, / Thus, for any term with fx.... x„( b * • • ->xn) > 0, the series within the par­ entheses converges, and so the series defining E[Y|Xi = xi,...,Xn = x„] converges absolutely. EXAMPLE 4.22 Consider the random variables X and Y of Example 4.20. Suppose x E {1, 2,..., 10}. By Exercise 1.2.6, £(y|X = x] = SWO'lx) = ±y'y=l y=i X = 41 = x a z for x = 1,2,..., 10. ■ We will now consider operational properties of the conditional expectation. Theorem 4.5.2 If Y, Xi,..., X„ are any random variables with E [ Y] finite, then E[Z) - 2 E(r|X, = ............ xn = x,}fx.... x.(x,......... x,). 4.5 129 CONDITIONAL EXPECTATION PROOF: By definition of the conditional expectation, 52 £[T|Xi =X\,...,X„ = x„]fx.... x„( *i» = 51 • • ->xn) 22^|x.... xn(yith,...,x„)/x..... *1... x„ yt = 5. ^.ykfx.... x„,y(xi, .. .,xn,yk) x..... >X„ Yk = 22 22 yrfx.... x„.Y(x1,...,x„,yk') yt X|,...x„ = = Ein n The interchange of order of summation is justifiable by absolute convergence and Theorem 3.4.4. ■ EXAMPLE 4.23 Consider the random variables X and Y of Example 4.22. The expected value of Y is given by E[Y] = S]°=i B[y|X = xlfx(x) = * Xl°=i(( +l)/2)(l/10) = l/20Ei°=1(* + 1) = 3.25. ■ Theorem 4.5.3 Let Y he a random variable with finite expectation that is independent of Xi,...,X„.Then ElTlX! = x1,...,X„ =x„] = E[T] whenever fx.... .. (xb ..., xn) > 0. PROOF: Suppose fx... . i, (* ... , xn) >0. By Theorem 4.5.1, E[y|X! =xi,...,x„ = x„] = 5.n/yix.... * i>---. x„(nl n ) n = = E^- ■ yt We mention in passing that if {Vi,..., Ym} and {Xi,...,X„} are two collections of random variables and i/ *(Yi, ..., Ym) has finite expectation, then E[iA(Yi,...,ym)!Xl = xu...,X„ = x„] = 22 •/'(/!’• .... Xmix..... .. ••> *")• yi.-./m The proof of this result again involves rearranging the terms of an infinite series. Using this result, properties of conditional expected value analogous to 130 4 EXPECTATION those of expected value can be proved; e.g., the conditional expectation of a sum of random variables is equal to the sum of the conditional expectations. In the remainder of this section, we will deal with the expected duration of play for the gambler’s ruin problem. Consider an infinite sequence of Bernoulli random variables {Xj} * =1 with probability of success p and the associated gambler’s ruin problem. We will use the notation of Section 3.5 where we were able to calculate the probabilities of eventual ruin given in Equations 3.14 and 3.15. We will now consider how long the play will last. If the gambler’s initial capital is x, let Tx = n if play terminates on the nth play (i.e., either the gambler or his adversary is ruined on the nth play) and let Dx = E[TX] be the expected duration of play. Suppose 1 < x < a — 1. If the gambler wins one unit on the first play (with probability p), then his capital becomes x + 1, and the subsequent expected duration of play is Dx+l; if he loses one unit on the first play (with probability q), then his capital becomes x — 1, and the subsequent expected duration of play is Dx-i. Since one play has already taken place and these two possibilities are mutually exclusive, Dx = p(Dx+i + 1) + q(Dx-i + 1) if 1 < x < a - 1. If x = 1 and the gambler loses on the first play, then his subsequent expected duration is zero; since one play has already taken place, the second term in this equation becomes just q. Similarly, the first term becomes just p if x = a — 1 and he wins on the first play. This means that this equation holds for x = 1 and x = a — 1 provided we put Dq = Da = 0. Therefore, the expected duration Dx satisfies the difference equation Dx = pDx+i +qDx-{ + 1 ifl<x<a-l. (4.12) subject to the boundary conditions Do = 0, Da = 0. (4.13) It should be emphasized that the derivation of the difference equation and boundary conditions is heuristic and not mathematical. Were it not for the constant term in Equation 4.12, we could solve this problem as in Section 3.5. The procedure for solving this problem is as follows. Note that if ux satisfies the equation Mx = pWx+1 + qux-i for 1 £ x s a - 1 (4.14) and vxpj satisfies the equation vxp} = pvxP\ + qv{xp_\ + 1 for 1 < x < a - 1, (4.15) 4.5 131 CONDITIONAL EXPECTATION then ux + satisfies Equation 4.12. Since Equation 4.14 is the same as Equation 3.12 in Section 3.5, we can use the results of Section 3.5 to solve Equation 4.14 depending upon whether p tA q or p = q. In the p q case, ux — A + B 4 J> where A and B are arbitrary constants, and in the p = q case ux = A + Bx. Since there are two arbitrary constants A and B in these solutions, it suffices to find some vxp\ called a particular solution, of Equation 4.15. In the p q case, we can take V(P) = _L_ (i~P X and in the p = q case, = —x2. We can therefore find a solution to Equation 4.12 in the p q case of the form \x ( -I , 1 < x < a - 1 PJ and in thep = q case of the form Dx = —x2 + A + Bx, 1 < x < a — 1. Choosing A and B to satisfy the boundary conditions 4.13, in the p # q case, and in the p — q case, Dx = x(a — x), 1 < x a — 1. (4.17) Against an infinitely rich adversary in the unfair case q > p,lima_mDx = x/(q — p); in the fair case q = p, limfl _♦«> Dx = +<». EXAMPLE 4.24 Suppose a gambler and his adversary each have $100 and $1 is wagered each time in a fair game. The expected duration is then D100 = 100(200 - 100) = 10,000. If one-half dollar is wagered each time, 132 4 expectation then this has the effect of doubling the units, and the expected duration is then D2oo = 200(400 — 200) = 40,000. We saw in Section 3.5 that doubling the number of units has no effect on the probability of eventual ruin in the fair case; doubling the units by wagering one-half unit on each play only prolongs the agony. ■ • „ The heuristic argument used to derive the difference equation 4.12 and the boundary conditions 4.13 should not be confused with a mathematical derivation. Although a proper mathematical argument can be made, the details are too tedious at this stage. EXERCISES 4.5 1. An experiment consists of selecting an integer X at random from {1,2,..., 100} and then selecting an integer Y at random from {1,2,..., X}. Calculate E [y] and var Y. 2. An experiment consists of selecting an integer X at random from {0,1,..., 100} and then selecting an integer Y at random from {0,1,.. .,X}. Use the results of Example 4.1 and Example 4.12 to identify E[K|X = x] and E(y2|X = x] for x = 0,1,..., 100 and then calculate E [ Y ] and var Y. 3. Let Xi and X2 be independent random variables with Poisson densities p(-; Ai) andp(-; A2), respectively. If n is a positive integer, show that the conditional density of Xi given that X] +X2 = n is a binomial density with parameters n andp = Aj/(Ai + A2). 4. If X and Y are independent random variables with binomial densities b(-; m,p) and b(-; n,p), respectively, calculate E[X|X + Y = z],z = 0,1,..., m + n. 5. A number P is selected from the set {1/10,2/10, • • •, 9/10} according to a uniform density. Given that P = j/10, a number X is selected from {1,2,..., 100} according to a binomial density with parameters n = 100 andp = j/10. Calculate E[X]. 6. Let {Xj} be a sequence of random variables having finite expectations and let N be a nonnegative integer-valued random variable that is independent of each Xj. Show that E[Xn|N = n] = E[X„] whenever/n(m) > 0. 7. Let {X;-} be a sequence of independent random variables having the same mean p and finite variance cr2, let So = 0, and let Sn = Xj + • • • + X„, n 1. Also let N be a nonnegative integer-valued ran­ dom variable having finite mean and finite variance that is independent of the Xj. Use the result of the previous problem to show that E[Sn] = pE[N] and var$N = <r2E[N] + p2 varN. 4.6 ENTROPY 133 8. If c is a constant and X is any random variable, show that E[c|X = x] = c whenever fx(x) > 0. 9. Let X and Y be random variables with Y having a finite second mo­ ment, and let </>(X) be a real-valued function of X having finite second moment. Show that E[0(X)T|X = x] = 0(x)E[T|X = x] whenever /x(x) > 0. 10. Calculate the expected duration of play Dx for the modified gambler’s ruin problem described in Exercise 3.5.4. ENTROPY In 1948, in a fundamental paper on the transmission of information (see the Supplemental Reading List at the end of the chapter), C. E. Shannon proposed a measure to quantify the uncertainty of an event. The basic idea of his measure is that frequently occurring events convey less information than infrequently occurring events. For example, the frequently occurring letter E in an English message conveys less information than the infrequently occurring Q, X, or Z. The two words uncertainty and information are used repeatedly in what follows, and it is necessary to have some understanding of the relationship between the two. For example, consider a random variable X that takes on the values 1,..., 6 with equal probabilities. Initially, there is uncertainty about the value of X. But if X is observed and we are told that the value of X is 3 or 4, then there is a decrease in uncertainty and an increase in information. Consider an event A with probability p. A measure of the uncertainty of A should be some nonnegative monotone decreasing function /(p) of p so that Z(p) is large when p is small. Moreover, if Ai and Az are independent events with probabilities pi and p2, respectively, then the uncertainty of Ai Cl Az is /(pipa). If it becomes known that Az has occurred, the uncertainty I(pipz) should be decreased by I (pz}, and we should be left with the uncertainty I (pi); i.e.,/(pip2) - I(p2) = I(pi) or Kprpz) = f(pi)+/(p2)- (4.18) This additive property of uncertainty for independent events is an assumption on our part. While we are at it, we might as well assume that /(p) is a continu­ ous function of p. The property expressed by Equation 4.18 is reminiscent of the log function, and it should not come as a surprise that I(p) can be shown to be the log function except possibly for a multiplicative factor. The function I(p) = ~logp = logp 0<p<l satisfies all of the above requirements. But Z(p) has not been completely determined, because there is more than one choice for the base of the log 134 4 EXPECTATION function. In communication theory, the base 2 is used because an on-off relay records one unit of information, called a bit. In this section, it will be understood that the log function is to the base 2. If an event A has probability 1/2, then/(1/2) = - log 1/2 = 1 bit. Having defined the uncertainty,of a single event, we now define the uncertainty associated with a random variable as the average uncertainty of the events (X = x). Definition 4.7 LetX be a random variable with range {xi, X2>...}. The entropy or uncertainty ofX is the quantity H(X) = -^fxWlogfxM = wherefxW log fxM = 0 wheneverfxM = 0. ■ It must be emphasized that H (X) is determined by the values of the density function and only indirectly by the random variable X. EXAMPLE 4.25 Let X be a random variable taking on the values —1,0, and 1 with probabilities 1/4, 1/2, and 1/4, respectively, and let Y be a random variable taking on the values 0, e, and it with probabilities 1/2, 1/4, and 1/4, respectively. Then 1 1 1 1 , 1 1,1 3 H(X) = — - log - — - log - — - log - = 4 &4 2 &2 4 °4 2 and 1 *i 1 h 1 3 - H(y) = --log-- jlog J - jlog- = -. ■ It is apparent from this example that the entropy of a random variable is totally unrelated to the meaning of the random variable and is determined solely by the values of its density function. This situation would be better portrayed if the notation H^f), where f is a density function, were used instead of H(X). H(X) is the expected value of a function of X in a rather complicated way; namely, H(X) = E[-log/x(X)] = - Sj^x,) log/x(x,-). Consider a typical term of the form h(p) = -p log p in the definition of H(X). The function h is continuous on (0,1]. If we define h(p) to be zero when p = 0, then h is also continuous at 0 since limp_0+(—p log p) = 0 by I’Hopital’s rule. The graph of h(p) is depicted in Figure 4.1. Since the terms in the series defining H (X) are nonnegative, H (X) is defined even if the series diverges to +<». It is possible for H(X) to be infinite. 4.6 135 ENTROPY EXAMPLE 4.26 Consider a random variable X having density function c /x(«) = ——— > n log n n=2,3,... where c is chosen so that =2/x(«) = 1. The integral test for infinite series can be used to show that the series =2 l/(« log n) diverges to +» and that the series =21/(« log2 n) converges. The entropy of X is then hot = -x-A- log (-A-) n log h \n log n ) = V ——— (-logc + logn +21oglogn) 7^1 n log n = c y' ( logc + 1 + 2 log log n\ ~2 \ n log2 n n log n n log2 n / Since the sum of the first terms converges, the sum of the second terms diverges to +<», and the sum of the third terms is nonnegative, the series defining H (X) diverges to +oo. Thus, H (X) = +oo. EXAMPLE 4.27 Let X be a random variable having a uniform density on {1,2,..., n}. Then H(X) = - y-log- = log n bits. ■ t—' n n The following lemma will be needed to establish an important property of H(X). P FIGURE 4.1 Graph of -p log p. 136 4 Lemma 4.6.1 EXPECTATION Inx x — 1 for allx > 0 with equality holding if and only ifx = 1. This result is proved by showing that the line y ~ x — 1 is tangent to the curve y = Inx when x = 1 and that the graph of the latter lies below the tangent line since y — Inx is concave.downward. The assumption that the random variables of the following theorem have the same range is not essential because their ranges can be replaced by the union of their ranges insofar as densities are concerned. Theorem 4.6.2 (Gibbs’ Inequality) Let X and Y be discrete random variables having the same range such that fx(z) = 0 if and only iffy(z) = 0. Then (4.19) 2&(^)log^ s 0 with equality holding if and only iffx(zf) = fy (zj) for all j- PROOF: We need only consider those zj for whichfx(zj) > 0. By Lemma 4.6.1, j jx(Zj) \jx(Zj) j } = XW } } = 1-1 = 0 (4.20) Since log a = (In a)/(In 2), multiplying both sides of this inequality by 1/ In 2 we obtain Inequality 4.19. If/x(z>) — fy(zf) for all j, then the left side of Inequality 4.19 is zero and there is equality therein. Assume now that there is equality in Inequality 4.19. Multiplying both sides by In 2, ■ • Thus, the left member of Inequality 4.20 is zero, and therefore j -1) \\JX(Zj) ) lnf775V° fx(Zj)J (4>21) Since the terms of the sum are nonnegative, they must all be zero; i.e., inZkli2 = foW fx(Zj) fx(Zj) and fx(Zj) = fy(zj) by Lemma 4.6.1. ■ 4.6 ENTROPY 137 EXAMPLE 4.28 Consider all random variables X that take on exactly n values X\,... ,xn with positive probabilities. Let Y be a random variable such that P(Y = Xj) = 1/n.i = l,...,n. ThenH(X) < H(K) = logn for all such X; i.e., the entropy is a maximum when the density is uniform. This follows from Gibbs’ inequality, since <. n so that H(X) = y>./x(x,)log—— < -;>>( ) * : — i Jx (X:) . . n log - = logn = H(Y) by Example 4.27. ■ If a random variable X takes on values including —1 and 1, thenX2 has the effect of lumping — 1 and 1 together. Such lumping reduces entropy. Upon observing X2, there is a loss of information gained as compared to observ­ ing X. Theorem 4.6.3 IfX is a discrete random variable and <f> is any real-valued function on the range ofX, then H(<p(X}) < H(X). PROOF: Suppose first that {xi,,x,j,...} is a subset of the range of X, finite or infinite. For each j 1, let pj = /x(x;.). Consider first the finite case fc,,..., x,t}. Since pi < T.kj = lpj, k \ y'.pi) ( ;=i i = i, / Adding corresponding members of these k inequalities, i -1 \i = 1 / \i = 1 If the sequence {x;.} is infinite, since plogp is continuous on [0,1] we can let k oo in this inequality to obtain ^P.logP. Ep> log E^' 4 138 EXPECTATION in the finite or infinite case. Therefore, - \ i Kg E> )- \ i / / i lo& Now let {zi, z2,...} be the range of Z = <p(X). Then H(Z) = H(<£(X)) = -51/z(Z;)log/z(Z;). } Consider a fixed z} and let {x;j, Xji2,...} be the set of values of X such that Zj = <b(xj,k). Since the probabilities of the Xj,\, Xji2,... are lumped together to produce fz(zj), by the above result -/zCzjOlog/zCzj) == -^fx(Xj,k) log fX(Xj,k). k Summing over j, H(Z) < X, ^.{-fx (Xj,k) log fx (Xj,k )}• j k Since the terms in the iterated sums on the right are nonnegative, Theorem 3.4.3 can be applied to obtain H(Z) < -^fxW\o&fxW = H(Xf ■ Since the definition of uncertainty applies to any discrete density, it makes sense to discuss the joint uncertainty of two random variables. Definition 4.8 IfX and Y are discrete random variables, the joint uncertainty or joint entropy is defined by H(X,Y) = -^fx.y^.y^fx.Y^.yj). ■ 'J Theorem 4.6.4 IfX and Y are discrete random variables, then H(X,Y) equality holding if and only ifX and Y are independent. H(X) + H(Y) with PROOF: Since H(X) = ~^fxM log fxM = -^^fx.YtXi.yjyiogfxtXi) ' ' j 4.6 ENTROPY 139 and H(K) = -2Z/y(y;)log/y(y;) = - y y * fx,Y(x,-,yf) log fY(), i } ’ by Theorem 3.4.3, H(X) + H(Y) = - y y./x,Y(xi»Yi)(log fx(xi) + log fy(yiY) = “ y 2Z/x. y (Xi > Yj) Jog fx (Xi }fY (yj). > ; By Gibbs’ inequality, i m fx(Xi)fY(yj) > . > ./x,y(^»/>)log -7—.------ vij fx.y{xi,yj} 0, from which it follows that H{X,Y} = -yy/x.Yte^yOlog/x.Y^.y;) ' j - - y y fx.y (Xi, yj) log fx (Xi }fY (yj) ' j = H(X)+H(Y). There is equality in this application of Gibbs’ inequality if and only if fx,y(x;,yj) = fx(xi)fY(yj) for all i and j; i.e., if and only if X and Y are independent. Since the concept of uncertainty applies to any discrete density function, we can define conditional uncertainty. Definition 4.9 1. Let X and Y be discrete random variables. The conditional uncertainty or conditional entropy ofY given thatX = x is defined by H(T|X = x) = -y/y|x(y;^)log/y|x(yjl^)i 2. The conditional uncertainty or conditional entropy ofY given X is defined by H(Y\X) = y/x(x;)H(y|X = x,); i.e., H(Y |X) is the weighted average of the H (y|X = x,). ■ 140 4 EXPECTATION Note that H(y|X) = -^/x(x,)2^/y|x(y;k)log/y|x(z,k) >■ } ) log -frix (yj I* )• = - 22 fa Theorem 4.6.5 IfX and Y are discrete random variables, then H(X, Y) = H(X) + H(y|X) = H(y) + H(X|y). PROOF: H(X,y) = -22/x,y(^»/>)log/x,y(^»/>) '•} = - 22-^lx <Yj !*'■ )/x (Xi ) log fy\x ( Yj l*i )/x (Xj ) 'J = - ^fy\x (Yj I* )/x (Xi) log fY\x (Yj I* ) ’•} “ 22/y|x (Yj I* )/x (Xi) log fx (x,) ij = -^JxMY\X = Xj) - 52fx(Xi)logfxte) i i = H(y|X) + H(X). ■ We will now examine a procedure for selecting a density function called the maximum entropy principle. Consider a random variable X that takes on values 1,2,..., 6 with unknown probabilities. What is known is that E [X] = 9/2 rather than the 7/2 it would be if X took on the six values with equal probabilities. Fori = 1,..., 6, let pi = P(X = i). Can we choose pi,... ,p6 so that E [X] = Jpi = 9/2, and how do we choose them? We might see if we can choose the pi,... ,p6 to maximize the entropy 6 H(X) = -5>!og Pi> i=1 subject to the conditions that 6 22P- = 1 i=l (4.22) 4.6 ENTROPY 141 and A • 9 2_, fp' = ?• i=l (4.23) 2 Recall from the calculus that this is a maximization problem subject to two constraints, which can be dealt with by the method of Lagrange multipliers. Let /6 Upi,...,p6) = H(X) - AI \i = i \ / 6 Q - 1 I- pA^Tipi - / \i = i 2 Setting (d/dpj)L = 0, i = 1, (? 1 — (p, log —) - A -/xi = 0, dpi pi i = 1, ...,6 — 1 — log pi — A — pi =0, i = 1,... 6. or Thus, log pi = — (1 + A + pi) or p,-.= e-(i+A+M«) Let x = e-Mandy = e^1+A). Then p, = x'/y and Equations 4.22 and 4.23 become :=i and • i 9 tx = -y. It follows from the first of these two equations that x cannot be zero. Therefore, Dividing byx, 4 142 EXPECTATION Writing out the terms of this equation and clearing of fractions, 3x5+x4~x3-3x2-5x-7 = 0. This equation has only one positive root x ~ 1.449254. y ~ 26.663653, and using the equation p, = x'/y, It follows that p, = .05435 p2 = .07877 p3 = .11416 p4 =.16544 p5 = .23977 p6 = .34749. EXERCISES 4.6 1. Calculate the entropy of a random variable X having density/(l) = 1/2, f(2) = 1/4,/(3) = 1/8,/(4) = 1/16,/(5) = 1/16. 2. Let X be a random variable having a geometric density /x(x) = , (1/2) * x = 1,2, ... . Calculate the entropy H(X). How much infor­ mation is gained upon observing that X = 3? 3. Consider a random variable X that has a uniform density on {1,2,..., 2m}. If successive pairs of integers are lumped together (i.e., 1 and 2, 3 and 4, etc., are lumped together), by how much is the uncertainty decreased? 4. Let X be selected at random from the set of integers {1,2,..., n }. Given that X = x, Y is then selected at random from the set of integers {1, 2,..., x}. Calculate H(X, Y) without using the joint density of X and y. 5. If X is the score on tossing two dice, calculate H(X). 6. If a card is drawn at random from a deck of 52 cards and a king of diamonds is observed, how much information has been gained? 7. If a card is drawn at random from a deck of 52 cards and you are told that a king has been observed, how much information has been gained? 8. If X and Y are random variables, use Lemma 4.6.1 to show that H(X|K) < H(X) by calculating H(X|K) - H(X). 9. Consider a random variable X that takes on the values 1, 2,..., 6. Given that E [X] = 4, determine the density of X that maximizes H(X). 4.6 143 ENTROPY SUPPLEMENTAL READING LIST 1. 2. T. M. Apostol (1957). Mathematical Analysis. Reading, Mass.: Addison-Wesley. C. E. Shannon (1948). A Mathematical Theory of Communication. Monograph B-1598. Bell System Technical Journal. p STOCHASTIC PROCESSES INTRODUCTION The topics discussed in this chapter have been selected not only to illustrate the concepts introduced in the previous chapters but also to expose the reader to the breadth and depth of applications of probability theory. In this elemen­ tary treatment, we will only scratch the surface of these topics. Topics within this chapter are independent, and subsequent chapters are independent of the topics of this chapter. We first take up a model for randomly evolving processes having the property that probability statements about future developments given the past history depend only upon the immediate past and not the remote past. The section on random walks was chosen primarily because the topic involves more applications of generating functions and difference equations. After random walks comes a section on branching processes, which were developed as a model for survival of family names and nuclear chain reactions. Because some of the great successes of probability theory have to do with prediction theory and communication theory in general, the chapter concludes with an application to prediction theory. Each of these topics could be expanded to book length and has been. Having learned some of the techniques for dealing with such topics, the reader can pursue them in greater depth in the book by Karlin and Taylor listed in the Supplemental Readings at the end of the chapter. More substantial applications to engineering can be found in the book by Helstrom. The section on prediction theory just barely scratches the surface of this subject. An excellent additional source is the book by Kendall and Ord. 144 5.2 145 MARKOV CHAINS MARKOV CHAINS Several of the examples discussed in the previous chapters share a common structure that will be elaborated upon in this section. Consider a countable set S = {sj, si,...}, finite or infinite, called a state space and consisting of objects called states. Since we can encode the states by giving Sj the label j, we can assume that S = {1,2,..., N} for some N or S = {1,2,...}. Definition 5.1 A sequence of random variables {Xn}™=ois called a Markov chain iffor all n andj0,ji, ...,j„ E. S, P(Xn Jnl-^0 j0> ■ ■ ■ ,X„- i 1 jn — i) = P(X„ = j„ |X„ —i = jn — 1). ■ (5.1) The significance of a Markov chain lies in the fact that if (X„ = J„) is a future event, then the conditional probability of this event given the past history (Xo = jo,.. .,X„-i = j„-i) depends only upon the immediate past (X„-i = j„-i) and not upon the remote past (Xo = jo,... ,X„-2 = j„-2). Let {X„}„ =0 be a Markov chain. If Xn = j, vte say that the chain is in the state) at time n. The probabilities p”;1’" = p(xn = j|x„_1 = i), nsij.jes are called one-step transition probabilities and depend upon the time that a transition from i to j takes place. If P(X„ = j |X„_] = i) is defined and is independent of n, the probabilities pid = P(Xn = j\X„-y = i), nsl,i,j6S are called stationary transition probabilities’, if the conditional probability is not defined, we put pij = 0. The numbers pi,j can be displayed in matrix form: pl.l Pl,2 P2,l p2,2 • pi,I Pi,2 If |S | = N, this is an N X N matrix; if S is infinite, there are an infinite number of rows and columns. The matrix is customarily symbolized by P = [p,,; ]. The f th row of P is the conditional density of X„ given that X„ -i = i. Clearly, 146 5 STOCHASTIC PROCESSES each pi,j 2: 0. Since the union of the disjoint events (X„ = j), j — 1,2,.. is O, = ’) = WIX" = ’) = L i i • A matrix P = [ p, j ] with the last two properties is called a stochastic matrix. The density of Xo is denoted by tt0—i.e., 7r0(j') = /x0(j),j £ S—and is called the initial density. Suppose n 2: 1 andjo.ji,.. .,jn £ S. If the Markov chain {X„}“_0 has stationary transition probabilities, then P (Xo = >...,X„ = j„) = P(Xn = jn | Xo = jo, . . ■ , X„-i = Jn-l)P(Xo = Jo> • • • > %n -1 = Jn-1) ~ P(Xn = jn | Xn - j = jn- ] )P(Xq = Jo> • • • >X„-1 = Pjn-i>jnP(X>i-l jn — l} =Jn-l|Xo = Jo» • • • > Xn-2 = jn-i) XP(X0 = Jo, ...,X„_2 = Jn-2) Pjn-l.jnPin-l.jn-l X ' ' ' X Pj0.jtP (Xq = 7ro(jo)pjo,ji X ‘ ‘ ‘ — Jo) Pj„(5.2) It should be noted that if at some stage the given event has zero probability, then the final result is still true since both sides are then zero. The last equation provides the means for constructing Markov chains on a state space S = {1,2,...}. Given a stochastic matrix P = [p,(J and a density function 7r0 on S, a probability space (O, S', 9“) and random variables {X„}"=0 can be constructed so that the probabilities P(Xo = jo,. ■ ■ >Xn = jn) are defined by Equation 5.2. Equation 5.2 can be used to reformulate the definition of a Markov chain. If 1 < m < n andjo.ji,...,j„ £ S, then P (Xm +1 = Jni + 1> • • • > X„ = jn | Xq = Jo, • • • > Xm — jm) P(Xm + l jm + l> • • • >Xn = jn I Xm = jm)> (5.3) i.e., the probability of any future event given the past depends only upon the immediate past (X,„ = j,„) and not upon the remote past (Xo = jo,... ,X,„-| = jm-i). To see this, first consider the left side of the equation. By Equation 5.2, P (-^Gn + l ~ jm + l> ■ • • > = jn |Xq = Jo, ... , Xm = jm ) = ^MPm, X • • • X pjm)Jmti X »• • x Koljolpio.}, X • • • X P;m_1Jm = Pjm.jmtl X • • • X pj„_,Jn. MARKOV CHAINS 5.2 147 Since P (,Xm — jm,..., X„ — j„) zEl 7ro(jo)pji,j2 X • • • X — ( X ••• X and P(^m jm) - 'rro(jo)pjt,jz X • • • Xp7m_(jm, it follows that P(Xm jm> - >Xn jn) ~ P<Xm — jrn)pjm,jm^ X • • • X pjn_tljn. Dividing by P(Xm = jm), P(Xm + i = jm+l> • • • >Xn = jn |Xn = jm) = Pjm,jm+i X ' ’ ' P]„-i,j„- Thus, both sides of Equation 5.3 are equal to the product pjm,jmtl This establishes Equation 5.3. EXAMPLE 5.1 (Binary Information Source) A Markov information source is a sequential mechanism for which the chance that a certain symbol will be produced may depend upon the preceding symbol. Suppose the possible symbols are 0 and 1. If at some stage a 0 is produced, then at the next stage a 1 will be produced with probability p and a 0 will be produced with probability 1 — p; if a 1 is produced, at the next stage a 0 will be produced with probability q and a 1 will be produced with probability 1 — q. Since the p and q do not depend upon the number of times a symbol has been produced, this experiment can be described by the stationary transition matrix In 1907, P. and T. Ehrenfest described a conceptual experiment for the movement of N molecules between two containers A and B. The state of the system at any given time is the number of molecules in A so that S = {0,1,..., N}. At any given time, a molecule is chosen at random from among the N and moved to the other container. This chance mechanism is repeated indefinitely. Since the mechanism does not depend upon how many changes have occurred, the process has stationary EXAMPLE 5.2 (Ehrenfest Diffusion Model) 148 5 STOCHASTIC PROCESSES transition probabilities given by pi,i-i = i/N, pi,i+i — 1 — (»/N) and pi.] = 0 otherwise. In this case, 0 1 - (l/N) 0 0 ... ... ; 1 0 • 0 0 • 0 0 0 0 0 0 ... ... 0 1 1/N 0 0 1/N p = EXAMPLE 5.3 (Random Walk on the Integers) Consider the space S = {..., —2, — 1, 0,1, 2,...}. Let {Yj}JL0 be a sequence of independent random variables such that Vo has a specified density 7r0 and P(K; = 1) = p, P(Yj = — 1) = q where) £ l,p, q>0,p + q = 1. For n £ l,letX„ = Xy=0 Yj. Then the sequence {X„}„=0 is a Markov chain with stationary transition probabilities if j = i + 1 if) = i—l otherwise. P Pi,j = P(%n = j I^Gj-1 = )* = \ q o To show that {X„} * =0 is a Markov chain, consider P(X„ “ j |Xq — jo> • • • > Xn -1 — jn—l). NotethatXo = jo.Xi = ji,. • .,X„-i = jn-i if and only if To = jo>Yl — ji “jo, = in-I -jn-2- By independence, P (Xn — J 1-^0 — jo>Xl — jl, . . . ,Xn-i — jn-i) n Ko — jo, ^1 — jl ~ jo> • • • > Yn -1 — jn—l ~ jn-2 ( i =0 jo,...,Yn — \ jn — l jn—2>Yn j P(Y0 jo>--->Yn-i — jn-i jn-2) = P(Y„ = j~j„-l). P(Y0 jn—l} Also by independence and Lemma 3.3.3, n n-I \ 1=0 1=0 / ( = P^/=0y- =jn-l>Y„ =j-j„-l) P&KlYi =jn-l) = P(Yn =j~j„-l). 5.2 149 MARKOV CHAINS Therefore, = j„_j) = P(X„ = j\X„-} P(Xn = ;|X0 = and the sequence {X„}”=0 is a Markov chain. Since pi.j = P(Xn = j |X„_) = i) = P(Yn = j-i)=j q if; = i + 1 if; = i - 1 otherwise, the chain has stationary transition probabilities. The chain {X„}"=0 is interpreted as follows. A particle starts off at an initial position jo in accord­ ance with the initial density ttq. The particle will then jump to jo + 1 with probability p or jump to jo ~ 1 with probability q; in general, if after n jumps it is at i, it will then jump to i + 1 with probability p or to i — 1 with probability q, independently of n. ■ Let {X„ }”_ 0 be a Markov chain with stationary transition probabilities. The conditional probabilities pi,;(n) ~ P(Xm+n — j |Xm — i) are independent of m (see Exercise 5.2.11) and are called n-step transition probabilities. More generally, if m, n > 1 and ji,. ,.,jn G S, the conditional probabilities P(Xm + l — jl> • • • > Xm+n jn\Xm are independent of m. This property of a Markov chain with stationary transition probabilities is called the stationarity property. The property simply means that an integer m >: 1 can be subtracted from all the indices appearing in a conditional probability; e.g., P (X4 — ;4,Xs — y’s.Xft — ;s|X3 — 73) = P(Xj = j4,X2 = J5,X3 = j6|X0 = J3). Letting P(n) = [pi,;(«)], P(n) is called the n-step transition matrix. We define PijW jq if i = j if i # j. If n = 1, then p,j(l) = P(Xm+l = j |Xm = i) = pij, and therefore P(l) = P. 150 5 Theorem 5.2.1 (ChapmanKolmogorov STOCHASTIC PROCESSES Forallm,n a: landi.j G 5, p/,;(™ +") = ^pi,k(m)pk,j(n). (5.4) Equation) PROOF: We can assume that P(X0 = i) > 0, because otherwise p,-.j(m+n) = P(Xm+n = j\X0 = i) = 0,piik(m) = P(Xm = k|X0 = i) = 0, and both sides are zero. Suppose 1 m < n, jo,.. .,jn G S, and P(Xo = jo) > 0. Then it follows from Equation 5.3 that P (Xq = jo, . . •, Xm = jm> ■ • • > Xn jn ) — P(Xm + i = jm + 1,...,Xn = jn |Xo = jo> • • • >Xm = jm) X P(Xq = jo, • • . , Xm jm ) = P(Xm + i = jm + i, . . . ,Xn = jn |Xm — jm) X P(X] =ji,...,Xm = jm |Xq = jo)P(Xo = jo)- Summing over....... jm-i>jm+i........ jn-i, P (Xo = jo> Xm jm > Xn jn ) = P(X„ = jn |Xm = j„,)P(Xm = jrn |X0 = j0)P(X0 = Jo). (5.5) Replacing n by m + n, jo by i, jm by k, andj„ byj in Equation 5.5, piij(m + n) = P(Xm+n = j |X0 = i) = ^P(Xm+n = j.Xm = k\X0 = i) k _ 'C-' P(.Xm+n = j>Xm = k, Xp ~ i) = ^P(Xm+n = jfXm = k)P(Xm = k\X0 = i) k = ^Pk.j{n)pi,k{m). ■ k Equation 5.4 can be interpreted in terms of matrix multiplication. Noting that the first factors pi.i(m),pi^m),... constitute the ith row of P(m) and the second factors pi,j(«).p2,; («)> • ■ • constitute the jth column of P(n), the sum on the right of Equation 5.4 is the product of the elements of the ith row of P(m) and the corresponding elements of the jth column of P(n); but this is just the definition of the element in the ith row and jth column of the product of the two matrices P(m) and P(n). In terms of matrix multiplication, the Chapman-Kolmogorov equation simply says that P(m + n) = P(m)P(n). Since P(l) = P, P(n + 1) = P(n)Pforalln s 1; iterating this result, we see 5.2 MARKOV CHAINS 151 n transitions —> m transitions —> FIGURE 5.1 Stopping and restarting a chain. that P(n) = Pn and that P(n) is simply the nth power of the transition ma­ trix P. Note also that P(n) is a stochastic matrix for each n S: 1. Since P(l) = P, the statement is true for n = 1. Assuming the statement is true for n — 1, it follows from Equation 5.4 and Theorem 3.4.3 that = 5151pu(« - l)pitj(l) } ■ j k = X.P».^» “ k y'.Pk.iW j = ^Pi.k(n - 1) = 1 k and that P(n) has nonnegative entries. It follows from the principle of mathematical induction that P(n) is a stochastic matrix for every n > 1. Figure 5.1 is a graphical illustration of the Chapman-Kolmogorov equation. If at some time the Markov chain is in the state i, the probability of going from i to j in m + n steps can be obtained by stopping the chain after m steps in state k, restarting the chain with initial state k, ending up in state; after n additional steps, and summing over k. In the case of the Ehrenfest diffusion model, Example 5.2, it is reasonable to ask how the molecules will be distributed between the two containers after much time has lapsed; i.e., what happens top;,; (n), the probability that starting from state i the chain will be in state j after n transitions, as n —> oo. In general, determining the limiting behavior of the p,-,;(n) can be difficult. To keep things as simple as possible in this chapter, we will limit the remaining discussion to Markov chains having a finite state space S = {1,2,..., r} and r X r transition matrix P = [pi,;]. We assume that r > 2 to avoid the trivial case of a chain with just one state. 152 Theorem 5.2.2 5 STOCHASTIC PROCESSES If there is an integer N such thatpi,j(N) > Ofor 1 lim pij(n) = Ttj, i,j r,then j = l,...,r exists and is independent of i. PROOF: We first prove the result assuming that N = 1; i.e., pij > 0, 1 i,j < r. If r = 2 and 1/2 1/2 P = 1/2 1/2 then Pn = P for all n si and the assertion is true with 771 = 772 = 1/2. We can therefore assume that S = , J™. min 1! Pi.j < y Consider a fixed; and define m„ = min pi.j(n). l<i£r J M„ = max p,-,;(n) l<i<r 1 If we can show that the sequence as depicted below, } increases, the sequence {M„ } decreases and lim„_»«>(M„ — m„) = 0, then there would be a number ttj such that lim M„ = lim nin = tt,, rt-»oo n—>00 J and since m„ < p,j(n) < M„, it would follow that lim p;,;(n) = 77;, j = 1,.. .,r. Since m„ is the minimum of a finite collection of numbers, it must be one of them, say = Pknj(n). 5.2 MARKOV CHAINS 153 By Equation 5.4, M„+i = max pi,;(n + l) = max Vpi,kpk,iW ISiSr ' k=l = max \pi.knpkl,,jW+'^pi>kpk,j(n)\ max Ipi.^tn, + M„ ^pi,k I 1S'Sr\ k*„ J = max (pi,k„ ntn+Mn(l- pitkn)) = max (M„ - (M„ - m„)pi,kn) 1 £i sr = M„ - (M„ - r^) min pi>kn l^iSr — Mn ~ (Mn - n^S. Therefore, Mn+) < M„ - (M„ - mJ 3 < M„, and the sequence {Af„} is decreasing. A similar argument shows that the sequence {mJ is increasing, and > nin + (M„ - mJ 3 > rn,. By combining the last two inequalities, M„+1 - m„+1 (M„ - ntn) - 23(M„ - mJ = (1 - 23)(M„ - nJ. Since M] — mj < 1, < (1 - 28)"-1, 0 < M„ - n 1, and therefore M„ -» 0 as n -» <». Thus, lim„ _»«,pitj (n) = ttj exists. Since lim„ _*«>/>, >;(n) = ]imn_eoM„ and the latter does not depend on i, irj is independent of i. Suppose now that N is a positive integer for which pi,j(N) > 0 for 1 i,j r, let P = [pitj] = PN_ = [pi,;(N)], and let P(m) = Pn = PnN = [p;,;(nN)]. Since P(l) = P is a stochastic matrix withp(l) > 0,1 i,j r, the first part of the proof implies that lim pi,;(n) = lim pit;(nN) = tt.- M-+OO J n^co 154 5 STOCHASTIC PROCESSES exists for 1 < i,j r, independently of i. This means that given £ > 0, for each pair i,j there is an N,-,y 1 such that |pij(»N) - 77;| < e whenever n s Letting Nc = maxi£iij £r Nij, |pi,j(nN) - 7Tjl< e simultaneously for all i,j with 1 i,j r whenever n 2: Ne. Any positive integer n can be written n = k(n)N + €(n) where limn_xk(n) = +<» and 0 £ €(n) < r. By Equation 5.4, Pi,j(n) = 2Zp>.k(€(n))(pk,;(fc(n)N). k Since pi,/€(«)) = 1, p,-. k (€( n ))(pk,, (^ (n )N) - tt,}. pij(n) - iTj = k Since lim„_»x k(n) = +«>, there is an M 2 1 such that k(n) 2 Ne whenever n 2 M. Thus, for n 2 M,k(n) 2 Ne and |pij(n) - 7T,| < 2Lp,-,fc(€(n))|pfc,j(^(«)N) - ttJ k < e^p,. J€(n)) = e. k Therefore, lim pi./n) = TTj, n —*x I <j r, independently of i. ■ Definition 5.2 fl J The state j can be reached from the state i if there is a positive integer n such that> 0; f2j the transition matrix P or chain {X„} * =0 is irreducible if each state can be reached from every other state. ■ Consider a Markov chain with irreducible transition matrix P = (p/j). There is then a positive integer N such that pij(N) > 0 for all i,j G S, and Vj = lim„_»xpij(n) is defined for all i,j G S, independently of i, according to Theorem 5.2.2. Taking n = 1 and letting m -> oo in the Chapman-Kolmogorov equation, r vj=^vkpk.j, k=i j = l,...,r. (5.6) 5.2 155 MARKOV CHAINS The Vj are clearly nonnegative. Since XJ=i pi.jW = 1 and r V .—, ;=1 r r = V lint p,j(n) = lim Vp,.;-(n) = 1, .;=1 —,n*a> n—»a>.— ' j=l {p;}J = 1 is a probability density called the asymptotic distribution or limiting distribution. It is left as an exercise to show that the density {p;}J=1 satisfy­ ing Equation 5.6 is unique. According to Equation 5.6, the determination of the Vj amounts to solving a system of linear equations. Consider a Markov chain with state space S = {1,2, 3} and stationary transition matrix EXAMPLE 5.4 1/2 0 1/2 1/2 1/2 0 0 In this case, Equation 5.6 becomes Pl = v2 ~ P3 = 1 v2 + - Vy 1 1 Vi + - V3 1 -Vi. These equations are not linearly independent, and one of them must be discarded and replaced by the equation Vi + P2 + Vi = 1. If this is done and the resulting equations are solved, we find 4 1 "■ = 9- "2 “ 5’ 2 _ = 5- ■ EXAMPLE 5.5 Consider the Ehrenfest diffusion model with S = {0, 1, ....Nj.pi.i-! = i/N,piii+i = 1 - (i/N), and p,j = 0 otherwise. In this case, Equation 5.6 reads Pj = 2Z Pfcpiij, fc=0 j=0,...,N. 156 5 STOCHASTIC PROCESSES The equation corresponding to j = 0 is V0= N’ and the equation corresponding to j For 1 N is j < N — 1, Disregarding theN in the denominators on the right side, the equations suggest that Vj has a form that allows they + 1 and N — j +1 coefficients to be cancelled. This suggests that the Vj have the form ( ; but since the sum of all the Vj is equal to 1 and Sp=o ( = 2N, the Vj must have the form It is easily verified that the Vj given by this equation do in fact satisfy the above equations and represent the asymptotic distribution. The interpretation is that whatever the initial number of molecules in container A, ultimately each of the N molecules is assigned to one of the two containers with equal probabilities. ■ EXERCISES 5.2 1. A gambler and his adversary have a combined capital of N units. In successive wagers, the gambler can win one unit with probability p or lose one unit with probability q = 1 - p. Describe the transition matrix. 2. Consider a Markov chain with state space S = {1,2,3,4} and transition matrix P = 1/2 0 0 0 0 1/3 0 1/4 0 0 1 3/4 1/2 2/3 0 Find the smallest integer n such that pi,j (n) > 0, i, j = 1,2,3,4 and determine the asymptotic distribution of the chain. 5.2 157 MARKOV CHAINS 3. Determine the asymptotic distribution of the binary information source of Example 5.1. 4. Consider a Markov chain with state space S = {1,2,3} and transition matrix p = 0 1/2 1/2 1/2 0 1/2 1/2 1/2 0 Find the smallest integer n for which p,,;(n) > 0 for i,j = 1,2,3 and find the asymptotic distribution of the chain. 5. Let P = [pij] be an N X N transition matrix and suppose that fij = l^kpk,j>} — 1> •• • >N. Show that for each n S: 1, N P-j = ^Pkpk.jtn), fc=i 6. An N X N transition matrix P = [ p/.y ] is doubly stochastic if SfLi pi,; = 1 for; = 1,... ,N. Assuming that the limits exist, show that r , ' 1 Vj = hm pi;j(n) = J n —>oo 7 N 7. j = 1,...,N. j = 1,...,N. Consider a Markov chain with state space S = {1,2,3,4,5} and transition matrix ‘ 1/3 1/6 1/6 P = 1/6 . 1/6 1/6 1/3 1/6 1/6 1/6 1/6 1/6 1/3 1/6 1/6 1/6 1/6 1/6 1/3 1/6 1/6 ‘ 1/6 1/6 1/6 1/3 . Find the asymptotic distribution of the chain. 8. Let P = [ pij] be an N X N transition matrix for which there is an n > 1 such that pi,;(n) > 0 for all i,j = 1,.. .,N. Let {/Xj}f=i be a probability density that satisfies the equation = ^Pkpk.j, k=l Show that {p-j};N= i is unique. j = 1,...,N. 158 5 STOCHASTIC PROCESSES 9. Consider a Markov chain with state space S = {1,2, 3} and transition matrix ■ 0 p. = X1 1 1/2 0 0 1/2 0 0 Calculate P(2n) and P(2n — 1) for all n 1 and draw conclusions about the asymptotic distribution of the chain. 10. N red balls and N white balls are placed into two containers A and B so that each contains N balls. The number of red balls in A is the state of a system. At each step of a continuing process, a ball is selected at random from each container and transferred to the other container. Determine the transition matrix P and find the asymptotic distribution. 11. Let {X„ }“=0 be a Markov chain with stationary transition probabilities. lfm,n S: l,i, j G S, show that P(Xm+n = j jXm = i) is independent of m. The next two problems require mathematical software such as Mathematica or Maple V. 12. Consider a Markov chain with state space S = {1,2,3,4,5} and transition matrix ’ .1 .2 P = .2 .3 .3 .2 .2 .2 .3 .3 .1 .3 .2 .2 .3 .5 .1 .2 .2 0 .1 .2 .2 0 .1 Find the asymptotic distribution of the chain. 13. Consider a Markov chain with state space S — {1,2, 3,4,5,6} and transition matrix P = 0 .11 .10 .40 .05 0 .12 0 .10 0 .05 0 .38 .29 0 0 .30 0 0 .20 .15 0 .40 .30 .40 0 .25 .30 0 .30 Find the asymptotic distribution of the chain. .10 ' .40 .40 .30 .20 .40 5.3 159 RANDOM WALKS RANDOM WALKS The probability model for the gambler’s ruin problem consists of an infinite sequence of independent random variables {y?}”= j with P(y; = 1) = p and P(Yj = —1) = q = 1—p where 0 < p < 1. Fixing integers x and a with 1 < x a — 1, Sn =x + Yi + -- - + Ynis the gambler’s capital as of the nth play of a game, with x representing the gambler’s initial capital. The S„ can also serve as model for a particle taking a random walk on the integer points of the line. We interpret x as the initial position of the particle, and starting at x the particle will jump one unit to the right with probability p and one unit to the left with probability q; i.e., its position after the first jump will be Si = x + y j. Starting from this new position, the particle will jump one unit to the right with probabilityp and one unit to the left with probability q, so that its position after the second jump will be $2 = x + Y^ + y2. After the nth jump, its position will be S„ = x + Y i + • • • + Y„. Let qx be the probability that the particle will reach 0 before reaching a. Since only the interpretation of the S„ has changed and not the probability model, the qx are given by Equations 3.14 and 3.15. In the language of random walks, 0 and a are boundary points for the interval of integers {0,1,.... a}, and qx is the probability of absorption at 0. Let {Tj} be an infinite sequence of independent random variables as described above and let S„ = Y! + • • • + Yn. Then Sn can be interpreted as the position of a particle after n moves with the particle starting at 0 and successively jumping one unit to the right with probability p and one unit to the left with probability q = 1 - p. After the first jump, the particle is no longer at 0 and may or may not eventually return to 0. We will endeavor to calculate the probability that the particle will eventually return to 0. The notation introduced above will be used throughout this section. Recall that Z is the set of integers {..., —2, — 1,0,1,2...}. Definition 5.3 If p = q, the sequence {S„}”=1 is called a symmetric random walk on Z; if p q> the sequence {S„ }"=! is called a random walk on Z with drift to the left ifq>p and to the right ifp>q. ■ We will introduce two sequences of numbers {uj} and {/,} by defining uj = P(Sj = 0), fj = P(S! # O.-.-.S;-! # 0,Sj = 0), j * 1 j > 1. Since the numbers u0 and fo have no meaning, we are free to define them however we choose and put »o = l>/o = 0- Since 1 and j/j| 1 for all; S: 0, the generating functions 00 U(s) = X. uis’’ j=o 160 5 STOCHASTIC PROCESSES FW = i=0 converge absolutely in the interval ( — 1,1). The probability that the particle will eventually return to the origin is'P(Sn = 0 for some n 2: 1). Since this event can be stratified according to the first time the particle reaches the origin, 00 P(S„ = 0 for some n > 1) = P(Si # 0,.. 0, Sj = 0) ;=i 00 = Sj5 * i. ;=i Letting/ = i /> 1 ~ / is the probability that the particle will never return to the origin. Tne {uj} and {/} sequences are related by the equation Uj = f0Uj +ftUj-2 + • • • +fjUo, (5.8) j>l. Note that the first term on the right is zero since/0 = 0. This equation follows from the fact that j (Sj = °) = U(S1 = 0>Sj = 0) fc = l and the following argument. The event (Si # 0, ...,Sjt-i # 0,Sfc = 0) depends only upon X|,...,Xfc, and since Sjt = Xi + • • • + Xfc = 0, the condition Sy = 0 is the same as the condition X * +i + • • • + Xy = 0, which depends only upon ... ,Xy. By independence and the fact that the joint density of Xfc+1,... ,Xy is the same as the joint density of Xb .. .,Xj-k, Uj = P(Sj = 0) j #0,...,Sfc_1 #0,Sfc = 0)P(Xfc+1 + • • •+X; = 0) = k=1 } = ^P(Si Sfc-! #0,Sfc = 0)P(X1 + ---+XJ_fc = 0) k=1 j j ” ^^,fkuj~k = ^'.fkUj-kk=l k=0 This establishes Equation 5.8. Multiplying both sides of Equation 5.8 by s/ 5.3 161 RANDOM WALKS 1, and using the fact that/0 = 0, summing over; 00 00 22 uisj = 2?(/o»; + ’ ’ ’ +fj“oW- ;=o j=i Since uq = 1, the left side of this equation is U(s) — 1, and according to the discussion preceding Definition 3.10, the right side is equal to 00 00 i=o j =0 = U(s)F(s). Thus, U(s) — 1 = F(s)U(s) and the generating functions are related as follows: -V ? (J(s) = t f 1 - F(f) (5.9) Since the Uj and/- are nonnegative, by Abel’s theorem (Theorem 4.2.3) both limits lim,-.!- U(s) = Z^=0Uj and lim^i-F(s) = = / - 1 exist, even if the first is infinite. Theorem 5.3.1 f < 1 ifandonlyifZ^=QUj < +<». PROOF: Note that 0^/<lor/=l. When f < 1, 00 I I 2?“/ = lim U(s) = ---- P--------- =—7- = ----- - < »; j-i1 - hm,_>1_F(s) 1-f when/ = l,lims_i-F(s) = 1 andlims_»i-U(s) = 2^°=ow; = +°°- Int^e latter case, X”=o uj < +°° implies that f < 1. ■ This theorem gives us a workable criterion for deciding if the particle will eventually return to the origin with probability 1. Before applying the criterion, it is necessary to take up approximations to factorials. A sequence {aj}J°= 1 is said to be asymptotically equivalent to the sequence {fy}”=1, written aj ~ bj, if lim;_»«,a/fy = 1. It is easy to see that if aj ~ bj and Cj — dj, then aj/q — bj/dj. The following relationship is known as Stirling's formula: nl ~ J2^nn+(1/2)e~n. (5.10) An elementary proof of this result can be found in the book by R. Ash listed at the end of this chapter. 162 5 STOCHASTIC PROCESSES Returning to the series £ * =o Uj.} note that tij = 0 whenever; is odd because a return to the origin can occur only in an even number of jumps (i.e., the number of jumps to the right must be equal to the number of jumps to the left). Consider m2„ for n > 1. Since the number of jumps to the left and to the right must be equal, -2. = (2„")pV. "SI. By Stirling’s formula, Since hm ------------ == = 1, n~* x (4pq)n/ yjn tt there is an N S: 1 such that _____ for all n > N, (4pq)n/ Jn 7T and since Jmr s 1 for all positive integers n, u2n < ./7TM < 2(4pq)" for all n > N. We now take up the p # q cases and p = q cases separately. Suppose first that p # q. In this case, 4pq = 4p(l — p) < 1 since the maximum value 1/4 of p(l - p) is attained only when p = q = 1/2. By the comparison test for positive series, the series =0 u2n converges since the geometric series ZZ=o(4P?)" converges. Thus, p q implies that X7=ow; converges. By Theorem 5.3.1, f < 1. This means that in the p # q case, there is a positive probability that the particle will never return to the origin. Consider now the p = q = 1/2 case. Then 4pq = 1 and u2n ~ 1/ Jrrn. That is, U2n lim n-»x 1/ Jmr and there is an N S: 1 such that J1/ Jn tt 2 U2n for all n > N 5.3 163 RANDOM WALKS or u2n > for all n £ N. y= 2 y/Trn By comparison with the divergent p-series X”= i l/« 1/2> the series 2”=0 u2n diverges. Thus, p = q = 1/2 implies that X”=o uj diverges. By Theorem 5.3.1, f = 1 and the particle will return to the origin with probability 1. In summary, we have the following theorem. Theorem 5. 3.2 If p # q, there is a positive probability that the random walk {S„}” = j will never return to the origin; ifp = q, the random walk {S„}”_ j will return to the origin with probability 1. Can the probability of eventually returning to the origin be determined in the p # q case? With a little more work we can answer this question since the Uj are known. In fact, ;=0 J Using the easily verified fact that (2„") = (-4r(-'/2), j=0 j =0 } J = (1 - 4pqs2)-1/2. By Equation 5.9, F(s) = 1 - (1 - 4pqs2)1'2. Since F(l) = 1 ~ (1 ~ 4pq)1/2 and also F(l) = = /,/ = 1 ~ (1 ~ 4pq)1/2. Noting that 1 — 4pq = 1 — 4p(l — p) = (1 — 2p)2 = (q — p)2,/ = 1 “ “Pl- Theorem 5. 3.3 The random walk {S„}”=1 will return to the origin with probability f = 1 - |q - p|. 164 5 STOCHASTIC PROCESSES We can also use the previous result to determine the expected number of jumps to return to the origin. Define a waiting time random variable T by putting T = n on (Si 0, ...,S„-j # 0, Sn = 0) for n S 1 and T = +<» otherwise. Then P(T = n) = P(Si # 0, ...,Sn-i 0,S„ = 0) = fn. Consider the p^ q case. Since f = ^.™=ofn < 1, P(T = +oo) = i - p(p < oo) = i - f > o and E[T] = +°°. Now consider the p = q = 1/2 case. This time P(T < +°°) = f = 1. Note that fr(s) = ^=0P(T = n)s” = Z * =of„sn = F($). Therefore, B[T] - £(,) - F’(l) = ,lun_ = +»• In summary, we have the following theorem. Theorem 5. 3.4 The expected number ofjumpsfor return to the origin is +°°for the one-dimensional random walk {S„ . In the symmetric case, the random walk will return to the origin with probability 1, but the expected time for doing so is infinite. A two-dimensional random walk on the points in the plane with integer coordinates can be described as follows. If at a given time a particle is at a point (x, y) with integer coordinates, then it will jump to one of the four neighboring points (x + l,y), (x - l,y), (x,y + 1), (x,y — 1) with specified probabilities independently of what has taken place previously. A threedimensional random walk on the points in 3-space with integer coordinates can be described similarly, except that jumps to six neighboring points will be allowed. To simplify the discussion, we will consider only symmetric two- and three-dimensional random walks. For each j > 1, let (Xj, Yj) be an ordered pair of random variables with joint density function fx^Xj.yj) = 1/4 0 if (Xj,yj) = ( + 1,0) or (0, ± 1) otherwise. We will assume that a probability space (0, S', P) can be constructed so that the pairs (Xi, Ti), (%2» ^2). are independent; i.e., for every n S: 1, /x„y,... x„,yn(x1,y1,...,xn,yn) = (^i,y») x ’' ’ x/x„.yn(x„,y„). Note that for each j > 1, the random variables Xj and Yj are not independent. For each n > 1, let S„ = JG + • • • + X„ and Tn = Yj + • • • + Y„. Then the sequence of pairs {(S„, T„ )} * = j describes a two-dimensional random walk starting at the origin on the points in the plane with integer coordinates. A particle taking such a random walk can be at the origin as of the nth jump if and only if both Sn = 0 and Tn =0. As before, we can define un = P(S„ = 0, T„ = 0), n > 1 5.3 RANDOM WALKS 165 with u0 = 1 and also define fn = P(|Si| + |Tj| #0,...,|S„_1| + |T„_1|#0,S„ = 0,T„ = 0), n > 1 with /o = 0. The probabilityfn is the probability that the particle will return to the origin for the first time on the nth jump, and/ = X”= i fj is the probability that the particle will eventually return to the origin. The generating functions U(s) and F($) are related as in Equation 5.9, so that again/ < 1 if and only if 2”=o ui < +00- As in the one-dimensional random walk, a return to the origin can occur only in an even number of jumps. For S2n = 0and?2n = 0, the number k of jumps to the right must be equal to the number k of jumps to the left, and the number n — k of jumps up must be equal to the number n — k of jumps down where k = 0,..., n. By the multinomial density and Equation 1.11, ■'T- (2n)!_______ (1)12” k'.kl(n — k)l(n -fc)!k4' By Stirling’s formula, »2n ~ mtt’ Thus, there is a positive integer N 2: 1 such that 1 1 2 mr »2n - --------for all n N. Since the series X7= i diverges, the series 2^=o u« diverges and therefore / = 1. Thus, a symmetric random walk in the plane will return to the origin with probability 1. The situation changes, however, in higher dimensions. In the threedimensional case, the random walk starting at the origin takes place on the points in 3-space with integer coordinates, and the particle will jump to any one of its nearest neighbors with probability 1/6. As in the previous cases, a return to the origin can occur only in an even number of jumps. For this to happen, the number of jumps in the positive x-direction must be equal to the 166 5 STOCHASTIC PROCESSES number of jumps in the negative x-direction, and the same for the /-direction and z-direction. In this case, A _____________ (2n)l (2n)l_____________ A V" £^ojWk -;)!( * ~ j)!(n " k)l(n - k)l \6j 1 / 2n \ /_______ n n!!________ \ 22" ' n ' ^0^\3njl(k - j)i.(n - k)\/ ' Since the quantity in the parentheses is the general term of a multinomial density, U2n 1 (2n\ ( n!_______ \ 22" ' n ' ^[y^k -jMn - k)l J n k / । \ 1 f 2n \ ( n\ \ 22n V n ' j,k \3njl(k -jy.(n - k)l)' The indicated maximum will be achieved when the three factorials in the denominator are equal and, since their sum is n, when each is equal to (n/3)l, assuming that n/3 is an integer. Putting aside such technical details, U2n < 2« ) 22" ' n 73" ((n/3)!)3’ Applying Stirling’s formula, Equation 5.10, to the factorials: < -L(2n } n'________ L 1 W2" “ 22n ' n '3"((n/3)l)3 27r'V7rn3/2’ By the comparison test, the series ^^=oun can be compared with the convergent p-series X“=1 l/n3/2 with p = 3/2 > 1, and therefore the series X”=o converges. In this case,/ < 1 and there is a positive probability that a return to the origin will never occur. The technical details glossed over previously can be taken care of by using the fact that for 0 j k n, j!(k-j)!(n -k)i > / /fi \\3 + , where T is the gamma function (see Section 6.5), and making use of known estimates of the gamma function for large values of the argument. 5.4 EXERCISES 5.3 167 BRANCHING PROCESSES The following terminology will be used in connection with a particle taking a random walk on the integers {0,1,..., a}, a >2. The boundary point 0 is an elastic barrier for the walk if there is a number 8 with 0 < 8 < 1 such that the particle upon reaching 1 will Jump to 2 with probability p or remain at 1 with probability 8q or jump to 0 with probability (1 - 8)q. 1. Verify that (2") . (-4)"(-1/2 x n ' x n for every positive integer n. 2. Consider a random walk on Z that Jumps two units to the right with probability p and one unit to the left with probability q, 0 < p < 1, p + <? = 1. If a particle starts at 0, for what values of p is return to 0 certain? 3. Let qx be the probability that a random walk on {0,1,, a} with elastic barrier at 0 as described above will hit 0 before hitting a. Find a difference equation for the qx, find boundary conditions, and deter­ mine qx. 4. Let Txt 1 x a — 1, be the waiting time for the random walk of the previous problem to hit either 0 or a. Calculate Dx = E[Tx], 1 < x < a — 1, in the p ¥= q case. BRANCHING PROCESSES If a neutron collides with the nucleus of an atom, the nucleus may split and give rise to new neutrons, which in turn may collide with other nuclei and give rise to more neutrons, and so forth. This is an example of a branching process. Another commonly cited example involves the survival of family names, assumed to be passed on to male offspring. Starting with one individual, k offspring may be produced with probability pk, k = 0,1, ... . The number of offspring is a random variable Xj that describes the size of the first generation. Each of the Xi offspring can then produce k offspring with probability pk, k = 0,1,2,..., independently of Xi and independently of the number of offspring of individuals of the same generation. The total number of the offspring of the Xi individuals is then a random variable X2 that describes the size of the second generation, and so forth. Continuing in this way, there is a sequence of random variables Xo,Xi,... where Xo is the size of the initial generation and Xj describes the size of the jth generation. A careful construction of a branching process in terms of random variables requires an infinite sequence of independent nonnegative integer-valued ran­ 168 5 STOCHASTIC PROCESSES dom variables all having the same density p(k) = pk,k = 0,1, . . . . We will assume that there is such a sequence of random variables. We commence with the density function p just described and assume throughout that Xo = 1-LetX( be a random variable having density function p and let y^, Y^,... be a sequenceof independent random variables that all have the same density p and that are also independent of Xi. We then let x2 * = y*I 0 + y42° + • • • + A | i.e., X2 is the sum of a random number of random variables. Letting p denote the generating function of the density p, by Theorem 3.4.5, /x,(0 = 7x,(p(0). Since fXl = p, fx2(‘) = P(?(f)- Now let y j2', y^,... be a sequence of independent random variables all having the same density and independent of all previously mentioned random variables, and let x3 = y(!2> + y(22) + • • • + yg. Again by Theorem 3.4.5, fx}M = fx2(p<J» =?(£(?(')))■ Continuing in this manner, a sequenceX|,X2... is obtained whose generating functions satisfy fxiJj) =fxj(p(t) forallj > 1. (5.11) We will now show using mathematical induction that 7xbl(t) = p(/x,-(O) for all) > 1. (5.12) Since/x2 = /x,(p(t)) and/X| = p,fx,(t) = p(Jx,W) an^ Equation 5.12 is true for) = 1. Suppose Equation 5.12 is true for j - 1. By Equation 5.11, 7x/H(f) = /x,•(?(')) = p(7x,_,(p(f)) = p(/x,(0), and the assertion is true for j. It follows from the principle of mathematical induction that Equation 5.12 is true for all j > 1. 5.4 BRANCHING PROCESSES 169 The parameters/x = EpG] = p'(l)and<r2 = varJG = p"(l)+p'(l) (pr(l))2> assumed to be finite, are useful in describing qualitative properties of the branching process. SinceE[X;+,] = fa (1), by Equation 5.12, 7^,(0 = M and = p'Cfx^fx^ = P'Vfx^) = ^E[Xj]. Iterating this result, E[Xj+i] = /xj+1. Therefore, EpQ] = /x; for allj 2 1, and the expected size of thejth generation increases or decreases geometrically according to whether /x > 1 or /x < 1. Consider now the probability that the branching process will eventually terminate; i.e., P(Xn = 0 for some n 2 1). We will want to exclude from consideration some special cases. Suppose first that po = 0; i.e., the probability that an individual will have zero offspring is zero; in this case, extinction will never occur and we can henceforth assume that po > 0. Suppose now that po = 1. Then extinction will occur with the first generation, and the po = 1 case will be excluded. Henceforth, we will assume that 0 < po < 1. Consider the probability qj that the size of thejth generation will be zero; i.e., qj = P{Xj = 0) = /x.(0). By Equation 5.12, <?j+1 = /X/+I(0) = p(/x>(0)) = p(^). Thus, the qfs are related by the equation = p(<?j) forallj 2 1. (5.13) Since p is supposedly given as part of the data describing the branching process and qi = P(Xi = 0) = po> in principle Equation 5.13 can be used to determine the sequence {<?;}”=! by iteration. In general, however, p(s) will have nonlinear terms s\ j 2 2, which makes it difficult to find a formula for the qj. As an alternative, we might examine the long-range behavior of the qj by considering lim; _,« qj, if it exists. Assuming that the limit exists and using the fact that p(s) is continuous on [0,1], q = lim qj+i = limp(q;) = p(<j); j —>00 j —>00 i.e., q is a solution of the equation s = p(s'). Note that s = 1 solves this equation since p(l) = 1, but there may be other solutions as well. 170 5 STOCHASTIC PROCESSES FIGURE 5.2 Graphs of t = s and t = p(s),po+p\ = 1- To determine other roots of the equation s = p(s), we will examine the function p(-) in greater detail. Since 0 < po < l,pj > 0 for some) 2: l,and sincep'(s) = T“=ijpj$’~l>p'(s') > 0 on (0,1) andp($) is strictly increasing on [0,1]. Since qi = po > 0, q2 = p(qi) > p(0) = p0 = qi- Assume that qj > qj_t. Then q7+i = p(qj') > ptqj-]) = qj. By mathematical induction, qj < qj+] for allj 2 1. Thus, {<?;}“=! is a monotone increasing sequence that is bounded above by 1, and therefore q = lim?_»x qj exists with 0 < q < 1. An alternative approach to solutions of the equation s = p(s) is to look at points of intersection of the graphs of the equations t = $ and t = p($), 0 < $ ■< 1, since the s-coordinate of a point of intersection is a solution of the equation s = p(s). It will be necessary to consider two cases. Suppose first that po +pi — 1- Thus, p(s) = po + p\$ with 0 < pi < 1, and the two graphs are as depicted in Figure 5.2. In this case, it is clear that there is only one solution to the equation s = p(s); namely, s = 1, so that q = 1. This means that for large), the probability that extinction will occur with the jth generation is very close to 1. Note thatp. = p'(l) = pi < linthepo+pi = 1 case. Suppose now that po + pi < 1 so that p; > 0 for some j > 2. In this case, p"(s) = 2^x=2j(j — l)p7s^-2 > 0 on (0,1) and the function p($) is convex and strictly increasing as depicted in Figure 5.3. In this case, it is clear that there are at most two solutions of the equation s = p(s). We will now show that q is the smallest solution of this equation. Let r > 0 be any solution. Thenqi = po = p(0) < p(r) = r. Assume that cjj — i < r. Then qj = p(qj-i) < p(r) = r. Thus, the sequence {<?,}“= । is bounded above by r, and therefore q = limj^xq; r; i.e., q is the smallest solution of the equations = p(s). EXAMPLE 5.6 According to a statistical study by A. J. Lotka, the number of male offspring of an American male is given by the modified geometric 5.4 BRANCHING PROCESSES 171 density po — .4823 and pk = (.2126)(.5893)fe \ k S: 1. The generating function p is then p(t) = .4823 + .2126t 1 —.5893t' Using software such as Mathematica or Maple V to approximate the solution of the equation .4823 + .2126t = t, 1 - .5893t the probability of extinction is .8183. ■ We will now relate the probability of ultimate extinction to the expected number of offspring of a single individual in the po + pi <1 case. If q < 1, then there is a point so in (q, 1) such thatp'(s0) = 1 by the mean value theorem; since p'(s) is strictly increasing on [0,1] and continuous from the left at 1 by Abel’s theorem (Theorem 4.2.3), p, = p'(l) > 1. Thus, if p. = p'(l) 1, thenq = 1. Suppose now that q = 1 so that the graph of t = p(s) intersects the graph of t = $ in only one point. It follows that p = p'(l) 1. Thus, q = 1 if and only if p, 1. In the po + pi = 1 case, q = 1 and p 1. We thus have the following theorem. Theorem 5.4.1 q = lim; _a>q;- = 1 if and only if p 1. Suppose p0 = 1/8, pi = 1/4, and p2 = 5/8. Then p($) = 1/8 + (1/4)$ + (5/8)s2 and the equation $ = p(s) has the two solutions 1/5,1. Therefore, the probability of ultimate extinction is 1/5 with p = 3/2. ■ EXAMPLE 5.7 FIGURE 5.3 Graphs of t = s and t = p(s)>po+pi < I- 172 5 EXERCISES 5.4 STOCHASTIC PROCESSES 1. What is the probability of ultimate extinction q for a branching process with 3,3 * 1 1, P(5) = RS +85 +85+8? o o o o 2. Consider the branching process with po — 1/8, pi = 3/8, p2 = 3/8, pi = 1/8, and p„ = 0 for all n 2: 4. Calculate the probability q3 that extinction will occur with the third generation. 3. Find a formula for the probability of ultimate extinction q for a branching process with p(s) = a + fis2 where 0 < a < 1 and a + (3 = 1. 4. If /x = E [Xi ] and <r2 = varXi, show that varX;+i = /x2 varXj +/x; <r2 for; 2: 1. 5. Show that varXj = a2(/x2;-2+/x2;-3 + • • • + p/-1) for; 2 1. The following problems require mathematical software such as Mathematica or Maple V. 6. Consider the branching process with po = 1/4, pi = 1/2, pj = 1/8, pi = 3/32, p4 = 1/32, and pn = 0 for all n 2 5. Approximate the probability of ultimate extinction q. 7. Consider a branching process for which the number of offspring of an individual has a Poisson density with parameter A = 2. Approximate the probability of ultimate extinction q. 8. Consider a branching process for which the number of offspring of an individual has a binomial density with parameters n = 5andp = .25. Approximate the probability of ultimate extinction q. 9. Consider the branching process with p0 = 1/8, pi = 3/8, p2 = 3/8, pi = 1/8, and p„ = 0 for all n 2 4. Calculate qi through <?i0. PREDICTION THEORY Consider a sequence {X,} of random variables with finite second moments where j is allowed to range from — <» to +<». The sequence may correspond to a random process that has been going for some time. Suppose the index n corresponds to the present and the random variables ... ,X„-2,X„-i corre­ spond to observations in the past. How can the past observations.. .,X„-2,X„-i be used to predict X„? That is, is there some function i//(... ,Xn-2,X„-i) of 5.5 PREDICTION THEORY 173 the past that predicts Xn ? Because prediction entails some probability of error, there must be some criterion for choosing a predictor. There also must be some internal coherence in the sequence {Xy}. For example, if Xn is independent of the past, then the past is of no use for predicting Xn. Throughout this section, {Xy} will denote a two-sided sequence of random variables {Xy having finite second moments. The construction of such sequences is similar to the construction of infinite sequences of Bernoulli random variables. Definition 5.4 The sequence {Xy} is a stationary sequence iffor each finite sequence of integers ji < J2 < • • • < jk and integers n, • • • > xjf) ~ fXj....,Xjk (xji> ■ • • > xjk )• ■ We have seen that a (one-sided) sequence of Bernoulli random variables has this property for positive integers n. Stationarity is stronger than what is required for this section. Definition 5.5 The sequence of random variables {Xy} is weakly stationary if E[Xy] = p. independently ofj and the covariance E[(Xj — p.)(Xk — /x)] depends only upon \j — k\,-<x> < j,k <+<x>. ■ Since E[(Xy —/x)(Xj — /x)] is independent of j, cr2 = var Xo = E[(Xq — /x)2] = E[(X;- — /x)2] = var Xy, — co <j < +oo We will assume throughout that cr > 0. If {Xy} is a weakly stationary process, the function 7?(n) = E[(Xy-/x)(Xy+„ -/x)], -oo < M <+oo is independent of j and is called the covariance function of the sequence. Note that R (0) = cr2 and that R(-n) = E[(Xj ~ p.)(X}-n - /x)] = E[(Xy-„ - /x)(Xy - /x)] = R(n), andsofl(n) = R(~n) = R(|n|). The function p(n) = —r-> crz —oo < n < +oo is called the correlation function of the sequence {Xy}. Note that p(0) = 1. Let {Xy} be a two-sided sequence of independent ran­ dom variables having the same density function and let cr2 be the common variance. Then cov(Xj,Xk) = 0 whenever j # k and cov(Xy,Xy) = var Xj = cr2. Thus, EXAMPLE 5.8 174 5 STOCHASTIC PROCESSES ifn = 0 ifn # 0. RW = The sequence {Xj} is both stationary and weakly stationary. ■ * \ The sequence of the previous example can be used to construct other weakly stationary sequences. EXAMPLE 5.9 (Moving Average Process) Let {T;} be a two-sided se­ quence of independent random variables with finite second moments having the same density function, and let = E[Vo], cr2 = varTo- Now let a0,..., am -! be a finite sequence of real numbers and define X;- = aoYj + aiYj-i + • • • + am-iYj-m+i, —00 < j < +°°. Each Xj has finite second moments by Lemma 4.4.1, and Yj-k E[X;-] = E k =0 This shows that E [X;- ] is independent ofj. We will now show that cov (X,, XI+„) is independent of i. Define dj = 0 for j £ {0, 1,..., m — 1} and assume for the time being that n > 0. Since the Yj are independent random variables andE[(y,_y — /x)(V — ^c)] = Oexceptwhenj = k — n, cov (Xj, Xi+„) = E [X,- - A* 22 ai P' A i = 0 /\ =E - 22 ajakE[(Yi-j j,k = O (2E2 fl^Yi+n-k = 22flfc-"flfc vary-+„-fc fc = 0 . &k—n &k Var Yi+n — fc. k=n /i)] 5.5 PREDICTION THEORY 175 Therefore, cov(X;,X,+„) = if n < m - 1 if n > tn - 1. (T2(rtortn + • • • + 0 Clearly, cov (X,, X,+n) is independent of i. This is also true if n is replaced by — n, because then cov (X;, X, _„) = cov (X, , X,) which is independent of i. In the particular case that a* = 1/ Jm, k = 0,...,m - 1, cr2(l - (|n|/m)) if\n\ < tn - 1 0 if\n\ > tn. If we want to predict X„ using past observations ... , X„-2,X„-i, there are many ways to choosea predictor X„ = </,(• • •, X>-2> and we must formulate some criterion for deciding which is the best. One possible criterion for choosing a best predictor X„ is to choose X„ so that £[(X„ - X„)2] is a minimum, where E [(X„ — X „ )2] is a measure of the distance between X„ and X„, called the mean square error. Consider a finite collection of random variables Y, Yb..., Yp having finite second moments and zero means and let be the collection of all linear combinations of the 7b ..., Yp', i.e., ai Yi; ay ..., ap ER An element of will be denoted by Y and called a linear predictor of 7. It is easy to see that if Yi and Yi are in £ and a, b are any two real numbers, then aYi + bYi is in A proof of the following theorem would take us too far astray from probability theory. Proofs can be found in books on measure theory or Hilbert space theory. Theorem 5.5.1 There is a Y * E such that £[(y - y ) 2] < £[(y - y)2] * foraiiye^. (5.14) The y * of this theorem is called a minimum mean square linear predictor of 7. The quantity £[(7 — 7 ) 2] is called the minimum mean square error. It is * possible to prove this result using calculus by writing £[(7 - 7)2] = £[72] - 2'XaiE[YYi] + i=1 WjElYiYj] i,j = i and minimizing the expression on the right as a quadratic function of the variables ab ..., ap. 176 Theorem 5.5.2 5 STOCHASTIC PROCESSES ApredictorY has minimum mean square error if and only if E{(Y — Y * *)Y] = 0 for every linear predictor Y in S£. Moreover, if Y * and Y2 are any two linear predictors with minimum mean square error, then Y * = Yj probability 1. PROOF: Suppose first that E[(Y - Y *)Y] = 0 for all Y G Then E[(Y - Y)2] = E[(Y — Y * +Y * — Y)2] = E[(Y - Y *) 2] + 2E[(Y - *Y )(Y Since Y * - YG E[(Y - *Y )(Y - Y)] + E[(Y * - Y)2]. - Y)] = 0 by hypothesis. Therefore, E[(Y-Y)2] > E[(Y — Y *) 2] for all Y G and Y * has minimum mean square error. Now let Y * be the predictor of Theorem 5.5.1 with minimum mean square error. Note that if Y G and E[Y2] = varY = 0, then Y = 0 with probability 1, and therefore E[(Y — Y *)Y] = 0. We can therefore assume that E[Y2] 0. Suppose that E[(Y - Y *)Y] = A #0 for some Y G <£. Consider which is in since Y and Y * are in !£. Writing Z - Y * = -4— Y, E(Y2] E[(Y - Z)2] = E[((Y - Y *) + (Y * - Z))2] = E[(Y - Y *) 2] + 2E[(Y - *Y )(Y - Z)] + E[(Y * — Z)2] )(-A-Y) * = E[(Y- Y *) 2] ~2E (Y- y E[Y2] + E ——Y2 Le[y2]2 J A = E[(Y - Y *) 2] - 2-4-E[(Y - Y *)Y] E[Y2] 2 + -4A2 = E[(Y-Y ) * 2]-—2A 4E[Y2] E[Y2] A2 + -4—E[Y2] E[Y2]2 5.5 PREDICTION THEORY 177 = E[(y - r*) 2] - —~ W2] ) * <£[(y-y 2]. But this contradicts the fact that Y * minimizes the mean square error. The assumption that E[(y - y )y] * = A # 0 for some T E leads to a contradiction, and therefore E(y - y )y] * = 0 for all Y E X. Finally, suppose that Y* and are both minimum mean square linear predictors of y. Then o = E[(y- yrjy] = E[yy] -E[y y] * fori = i,2,y e£. Therefore, E[(Y * — y2)y] = 0 for all Y E !£. Since the first factor is inX AAA A A _ we can replace Y by Y * — YJ to obtain E[(Y * — Y^)2] = 0, and therefore * = Y^ with probability 1 (see Exercise 4.3.11). ■ y We now return to the two-sided weakly stationary sequence {Xy} and the problem of predicting X„ using the past .. .,X„-2,Xn-i- Computationally, we cannot expect to use all of the past ■. .,X„_2,X„-i to predict X„ and must decide upon how many observations in the immediate past we will use. Suppose it has been decided to use just p observations Xn-p,..., Xn-i. The most general predictor, not necessarily linear, will then have the form X„ = i/r(X„-p,..., X„-i). It is true, but cannot be proved here, that if we put $(Xn—pt • ■ ■ > Xn — 1) — E [X„ | Xn—p Xn —p> . . ., Xn-i Xn — 1 ], then X„ = ik(Xn-p, ■. .,X„-i) is the minimum mean square predictor of Xn. This is not the same as the minimum mean square linear predictor of X„. In practice, the computation of E[X„ |X„-P = xn~p,... ,Xn-i = * n-i] requires complete knowledge of the joint density fxn.l,,...,xn-l,x„> even if the joint density were known, the calculation of the conditional density might be intractable. The prediction problem is easier to handle if we limit ourselves to linear prediction. To apply the above theorems, we must assume that the random variables Xj have been centered; i.e., that E [Xy] = 0, —<» < j + °°. Let p and n be fixed positive integers and let be the collection of all linear combinations of X„-p,... ,X„-i. A typical element of will be denoted by X„. Let X * = Xy =, flyX„-j be a minimum mean square linear predictor of X„. Then = 0 for all X „ G SL E[(X„ — X * )X„-y] =.0 forj = l,...,p „)X„] * E[(X„-X These equations hold if and only if 178 5 STOCHASTIC PROCESSES or — • • • — apX„-p)Xn-j] — 0 for/ — 1,... ,p = aiE[X,i-iX„-j] + • • • + apE[Xn-pX„-;] for; = l,...,p. E[(XM — or E [X„X„-j] Thus, the above condition is equivalent to R(J) = aii?(l - j) + --• + apR(p - j) i.e., the ..., for j = 1,.. . ,p; (5.15) must satisfy the linear equations P (1) = aiR(0) + • • • + apR(p P (2) = ciiR( 1) 1) + * * * + apR(p 2) R(p) = a\R(l - p) + • • • + apR(0). Since there is at least one minimum mean square linear predictor X * , there is at least one solution ai,..., ap of this system of equations. We can also calculate the minimum mean square error crj, using X * as follows: <r2p = E[(X„ -x;,)2] = E[(X„ -X „)(X„ * „)] * -X = E[(X„ — X )X„] * — E[(X„ — * )X X The second term on the right is zero since X * ]. 6 ZE. Thus, p = E[X„X„] -E ;=i EXAMPLE 5.10 Let {Xy} be a two-sided weakly stationary process with covariance function R( } = 1 “ (|n|/3) 0 forn=0,+ 1,+2 otherwise. 5.5 PREDICTION THEORY 179 Then i?(0) = !,/?(!) = 2/3, and R(2) = 1/3. Suppose we take p = 2 so that the minimum mean square linear predictor X * = aiXn-1 + a2Xn-2 will be used to predict Xn. The coefficients alt a2 must then satisfy the equations 2 2 3 =fll + 3fl2 1 2 3 "3 1 * + *2’ Solving for ai and a2, X’„ = -X„_J - -X„-2, and the minimum mean square error is er * = 8/15. ■ EXERCISES 5.5 1. Let { Tj} be a sequence of independent random variables having the same density function with E[Ty] = OandvarYy = 1. For each j 2: 1, let Xj = l/4Kj + l/2Tj-i + l/4Yj-2. Find the covariance function for the {Xj} sequence, a minimum mean square linear predictor *X of X„ based on the last three observations X„-i,X„-2,X„-3, and the minimum mean square error <t%. 2. Would there be any improvement in the minimum mean square error in Problem 1 if the last four observations X„ - b X„ _2, X„ -3, X„ -4 were used to predict X„ ? Verify your answer by determining *X and *cr . 3. Consider a stationary sequence {Xn*} “ X„ = aX„-i + e„, that satisfies the equation -oo < n < +oo, where {en is a stationary process wither * = vare„ >0 for which E[e„Xm] = 0 for all integers m andn. Show that |a| < 1. 4. Consider a stationary process {X„}+” that satisfies the equation X„+1 — fliX„ + a2X„-i + e„+b where is a stationary process with E[e„] = 0, er * = vare„ > 0 and E[e„Xm] = 0 for all integers m and n. If p is the correlation function of the X„ process, show that p(l) — ai + a2p(l) p(2) = flip(2)+ a2 and determine ai and a2 in terms of p(l) and p(2). 180 5 STOCHASTIC PROCESSES Solving the following problem without the benefit of mathematical software such as Mathematica or Maple V would be extremely tedious. 5. As a result of a statistical study of a stationary process, the values J? (0), i?(2), i?(3), and i?(4) of the covariance function R(n) have been estimated to be 2,-1.68,-1.46, 1.22, and 1.08, respectively. If the last four observed values of the process are, in the order observed, — 2.25, —1.25, .25, and 3.75, what is the minimum mean square linear predictor of the next value? SUPPLEMENTAL READING LIST R. B. Ash (1970). Basic Probability Theory. New York: Wiley. C. W. Helstrom (1991). Probability and Stochastic Processes for Engineers, 2nd ed. New York: Macmillan. S. Karlin and H. M. Taylor (1975). A First Course in Stochastic Processes, 2nd ed. New York: Academic Press. M. Kendall and J. K. Ord (1990). Time Series, 3rd ed. New York: Oxford University Press. CONTINUOUS RANDOM VARIABLES INTRODUCTION At one time, a chance variable or random variable X was an undefined entity with an associated function F(x), called the distribution function of X, that specified the probability that X x. Probability theory at that time dealt with properties of distribution functions. The concept of probability space did not enter into the picture. Rapidly expanding applications of probability theory eventually necessitated a renewed look at the foundations. Most of this chapter discusses random variables as they were dealt with before the development of the probability space model. Familiarity with the evaluation of double integrals by means of iterated integrals will be taken for granted. There will be situations in which it is necessary to interchange the order of integration of iterated integrals. The following statement justifies this procedure. Let/ : R2 —> R be a nonnegative real-valued function that is Riemann integrable on each finite rectangle, and let (a, b), (c,d) be two intervals of real numbers, finite or infinite. Then rb /rd I I a \Jc \ rd /rb f(x,y)dyjdx = / I Jc \la \ f(x,y)dx \dy. / Proofs of this result can be found in most calculus books. 181 182 6 CONTINUOUS RANDOM VARIABLES RANDOM VARIABLES The random variables considered in the previous chapters are customarily called discrete random variables, meaning that their ranges are countable sets. But because there are meaningful numerical attributes of outcomes of experiments not having this property,'we must look at the concept of random variables anew. Let (0,S7, P) be a probability space. In previous chapters, a mapping X : O —> R was called a random variable if its range is a countable set {xi, x2>...} and (X = x) E S' for all x in the range of X. Since (X < x) = UXi <x(X = X,) E S', events of the type (X x),x E R, also belong to S'. We will take the latter property as the definition of a random variable. Definition 6.1 A mapping X : Q. —> R is called a random variable if (X X(o>) x} E S' for all x E R. ■ x) = {w : In some instances, we allow X to take on the value +°°, which is not in R, particularly for waiting times; in this case, X is called an extended real-valued random variable. The criterion is exactly the same; i.e., (X s x) E S' for all x E R. The fact that (X x) E S' for x E R means that P(X x) is defined for all x E R and defines a function on R. Definition 6.2 IfX is a random variable, the function Fx ■ R Fx(x) = P(X < x), R defined by x E R, is called the distribution function of the random variable X. ■ Consider a,b E R with a b. Since P(a < X < b) = P((X i>) Cl (X a)c) = P(X < fe) — P(X a), probabilities of the type P(a < X b) can be calculated using Fx by the equation P(a <X < b) = Fx(b) -Fx(a). (6.1) To illustrate these concepts, we need to enlarge our collection of probability spaces. In many experimental situations, the outcome of the experiment is a real number, and it is natural to take O = R. It should be permissible to speak of the outcome being in some interval of real numbers. This means that S' should at least include all intervals of the form (a, b), [a, b), (a, b], [a, b], (a, +<»), [a, +<»), and so forth. It is a fact, but cannot be proved here, that there is a smallest (T-algebra of subsets of R that contains all intervals of the type just described. S% however, does not contain all subsets of R. The reader may take comfort in the fact that any subset of R encountered at this level will be in S'. As usual, subsets of R in S' are called events. Now that we have settled on fl and S', what do we do for a probability function P? 6.2 183 RANDOM VARIABLES Consider a conceptual experiment in which a number is selected at random from the interval (0,1). What does this mean? It should mean that the probability that the number selected will be in the interval (1/8,1/4) should be the same as the probability that it will be in the interval (7/8,1) and also that it is twice as likely to be in the interval (1/8,3/8). This suggests that P should be determined by the length of the interval, provided the interval is a subinterval of (0,1); i.e., P((a, bf) = b — a whenever 0 a b 1. Since no probability should be assigned to points outside (0,1), if (a, b) is any interval, P((a,b)) should be equal to P((a,b) Cl (0,1)); e.g., P(( —3,1/2)) = P((0,1/2)) — 1/2. It can be shown that there is a probability function P defined on S' with these properties. ■ EXAMPLE 6.1 Example 6.1 can be modified by replacing the interval (0,1) by an interval (a, b} with —oo < a < b < +oo, as indicated below: (///■////////) ) c d b and defining (6.2) whenever a < c < d < b, P(( —°°, a)) = 0, and P((b> +<»)) = 0. Pis then called a uniform probability measure on (a, b). For this example and Example 6.1, it should be noted that P({x}) = 0 for all x E R. For example, if x E (a, b), then {x} C (x — (1/n), x + (l/n)) C (a, b) for large n, and so 0 S P({x}) S P| \x - —, x + — ) | = ----------- > 0 as n —> oo. \\ n nJ) b-a Since P({x}) does not depend upon n, P({x}) = 0. Similar arguments can be used to show the same when x a orx b. Since single points are assigned zero probability, P((c,d]) = P([c,d)) = P([c,d]) = P((c,d)). Another way to modify Example 6.1, in addition to replacing (0,1) by (a, b), is to take O = (a, b) and define P only for intervals (c,d) C (a, b) by Equation 6.2. Every reader is familiar with experiments for which the above model is appropriate. The familiar pointer mounted on a circular disk is an example of an experiment in which a number between 0 and 2ir is selected at random, although the interpretation of the outcome usually involves digitizing the outcome by assigning digits to equal sectors of the disk. With such examples in mind, we can now exhibit nondiscrete random variables. 184 6 CONTINUOUS RANDOM VARIABLES EXAMPLE 6.2 Let (O, S', P) be a probability measure space where O = R and P is the uniform probability measure on (a, b), —00 < a < b < +•». For each a> G O, let X(w) = co. If x G R, then (X x) = {« : x} = {o> : co x} = (—oc,x] £ 9? and X is a random variable. X is not discrete since it can take on every value in R. The distribution function Fx of X can be calculated as follows: (i) Ifx < a,thenFx(x) = P(X < x) = P((-oo,x]) = P((—<»,x] A (a, b)) = P(0) = 0 since (—<»,x] Cl (a,b) = 0. (ii) If a < x < b, then Fx(x) = P(X < x) = P((-<»,x] Cl (a, b)) = P((ti,x]) = P((a,x)) = (x — a)/(b — a). (Hi) If x > b, then Fx(x) = P(X < x) = P((-oo,x] A (a, b)) = P((a,b)) = 1. Thus, FxW = if x a if a < x < b if x >: b. ■ 0 (x-a)/(b~a) 1 The random variable of this example is called a continuous random variable. The choice of “continuous” as a modifier is a traditional but poor one in that continuous in this context is analogous to a continuous distribution of mass as opposed to a discrete distribution and is in no way related to the concept of continuity of a function as studied in the calculus. Other examples of continuous random variables can be constructed as follows. Let/ : R —> R be a real-valued nonnegative function that is Riemann integrable on every subinterval of R, and the improper integral [ ^f(t)dt * is defined and equal to 1. Let O = R, let S' be the smallest cr-algebra containing all intervals of real numbers, and define P(A) for A G S' by putting rb P((a,b)) = f(t)dt for any interval (a,b). The integral on the right is also equal to P((a,b]), P([a, b)), and P([a, b]). Consider, for example, P((a, b]). Since // 1 \\ P((a,b)) < P((a,b]) == P a,b + - = \\ " // ffr+(l/n) f(t)dt and the Riemann integral fxf(t)d t is a continuous function of its upper limit, rb+(l/n) P((a,b]) < lim n~ * xJa rb f(t)dt = Ja f(t)dt = P((a,b)). 6.2 185 random variables Therefore, P(fa,b]) = P((a,by). In calculating probabilities P(I) for an interval I, we can remove or adjoin endpoints to I without affecting the probabilities. Now define X : ft -> R by putting X(to) = to for all to E R. Then (X x) = {to : X(to) x) = {to : to x} = (—°°,x] and Fx(x) = P(X < x) = P((-oo,x]) = P((—°°,x)) = r /(t)dt. J —00 Caveat: Generally speaking, endpoints can be removed or adjoined in this way only when probabilities are computed by integrating a Riemann integrable function. EXAMPLE 6.3 Consider the function fM = o e~x if x < 0 if x > 0. Since f is nonnegative and 1 = f“e~'dt = linib^>.i.a>f()be~tdt = limb_»+<»[—e~b + 1] = 1, there is a random variable X with distribution function = { i _°e-« if x < 0 if x 0. The graphs of f and Fx are shown in Figure 6.1. The functions f and Fx are called the exponential density function and the distribution function, respectively. ■ As in the discrete case, it is necessary to perform various algebraic operations on random variables and deal with functions of random variables. Let(17, S', P) be a probability space. Given random variables X and K, we can define X + Y by putting (X + T)( to) = X(to) +T(to) for to E O and define X Y by putting (X Y)(to) = X(to)y(to) for to E O. More generally, if 0 is a function of n real variables xb..., x„ and Xi,..., X„ are random variables, we can define 0(Xb ...,X„)(to) = 0(Xi(to),.. .,X„(to)); the above sum and product operations are special cases by taking 0(x,y) = x + y and </>(x,y) = xy, respectively. Lemma 6.2.1 If X is a random variable, then (X < x), (X 2: x), (X > x) G S' for all x E R. If a,b E R with a < b, then (a < X b) E S'. PROOF: Let Q = {n, r2,...} be the countable collection of rational numbers. Suppose x E R and X(to) < x. Then there is an rj E Q such that X(to) < rj < x. Conversely, if r;- < x andX(to) ry, thenX(to) < x. Thus, (X < x) = Ur.6Q,r.<x(X < rf) G & 186 6 CONTINUOUS RANDOM VARIABLES /(x) = e~x, x > 0 Fx(x) = 1 " e~x<x - 0 FIGURE 6.1 Exponential density and distribution functions. by definition of a random variable. Since (X S x) G S', (X > x) = (X < x)c G S?. By the first part of the proof, (X < x) G S', and so (X < x)c = (X > x) G S?. If a < b, then (a < X < b) = (X < b) Cl (X > a) G S5. ■ It is easily seen that (a < X < b), (a < X < b), and (a < X :£ b) also belong to S'. Theorem 6.2.2 IfX, Y are random variables and a,b G R, then aX + bYt XYt and |X| are all random variables. PROOF: We first show that aX is a random variable. If a =0, then aX = 0 and (aX S x) = 0 if x < 0 and (oX x) = Q if x SO and aX is a random variable. If a > 0, then (aX < x) = (X < x/a) G S' for all x G R, and if a <0, then (aX < x) = (X S x/a) G & for all x G R. Thus, aX is a random variable. We now show that X + Y is a random variable. Consider (X + Y > z), z G R. If w is in this set, then X(w) > z — Y{oY) and 6.2 187 RANDOM VARIABLES there is a rational number rj such that X(&>) > rj > z — y(&>). The converse is also true. Therefore, (X + Y > z) = U [(X > rj) n (y > z - rj)) e 9?, and therefore (X + Y z) G S'. To show that XT is a random variable, we first show that X2 is a random variable. If x < 0, then (X2 x) = 0 6 9'; if x 2 0, then (X2 < x) = ( — Jx < X < Jx) 6 S', as was to be proved. Since XY = (1/4)((X + T)2 — (X — T)2), XY is a random variable by the previous steps. Since (|Xj s x) = 0 for x < 0 and (|X| x) = (—x < X x)forx s 0, |X| is a random variable. ■ It follows from this theorem that if n is a positive integer, X is a random variable, and ao,.... an are constants, then p(X) = «oX" +aiX"-1 + • • • + a„ is a random variable; i.e„ a polynomial function of a random variable is again a random variable. This result can be extended to continuous functions. That is, if </> : R —> R is a continuous function and X is a random variable, then </>(X) is a random variable. Likewise, if <f> is a continuous function of n variables Xi,.. .,xn and Xi,... ,X„ are random variables, then </>(Xi,... ,Xn) is a random variable. There is, however, trouble lurking beyond this point. In the case of a discrete random variable X, <£(X) is a random variable for any function </> : R —> R. This fact need not be true for nondiscrete random variables. But since we will have no need to go beyond continuous functions of random variables, we will leave this matter where it belongs; namely, in a graduate course in real analysis. One of the central problems we will take up has to do with finding the distribution function of 7 = </>(X) knowing the distribution function of X. EXAMPLE 6.4 Let X be a random variable having the distribution function FXW = [ f(P)dt J —co where 1 0 ifO <x < 1 otherwise. Then FXM = ifx < 0 ifO < x < 1 if x 2= 1. 188 6 CONTINUOUS RANDOM VARIABLES FIGURE 6.2 Distribution function of Y - X2. If Y = X2, what is the distribution function of Y? If y < 0, then Fy(y) = P(Y < y) = P(X2 < y) = P(0) = 0. If 0 < y < 1, then Fy(y) = P(Y == y) = P(X2 < y) = P(-< X < = = Jo ~ Jy- If y then Fy(y) = P(Y < y) = P(X2 < y) = j^f(t)dt = 1. Therefore, Fy(y) = < ify < 0 if 0 < y < 1 ify > 1. The graph of Fy is shown in Figure 6.2. ■ Consider a conceptual experiment in which a point is chosen at random from a region S in the plane; e.g„ a region encompassed by a simple closed curve. "At random” should mean that the probabilities that the chosen point will be in congruent subregions of S should be the same and that the probability that the chosen point will be in disjoint subregions should be the sum of the probabilities of being in each. These criteria suggest that probabilities should be determined by areas; e.g., if A C S is the shaded region depicted in Figure 6.3, then the probability that the chosen point will be in A is given by FIGURE 6.3 Geometric probabilities. 6.2 RANDOM VARIABLES 189 pM) ( } = — |s| where |A| denotes the area of A. More generally, if S is a region in the ndimensional space Rn and A is a subregion, then P(A) is defined by the same equation, with |A| representing the n -dimensional volume of A. Probabilities defined in this way are called geometric probabilities. EXAMPLE 6.5 Suppose a point is chosen at random from a region S in the plane consisting of points (x,y) with x2 + y2 1; i.e., a point in a disk of radius 1 and having center at (0,0). Let A be the set of points in a disk of radius 1/2 and having the same center. In this case, |A| = rr(l/2)2 = tt/4, |S| = it, and |S| EXERCISES 6.2 1. 4' ■ An experiment consists of choosing a point at random from a disk D in the plane with center at (0,0) of radius 1. If X is the distance of the point from the origin, find the distribution function of X. 2. An experiment consists of selecting a point at random from a ball in 3-space with center at the origin and radius 1. If X is the distance from the origin, find the distribution function of X. 3. By choosing a point X at random from the interval [0,1], the line segment [0,1] is broken into two line segments [0,X] and [X, 1]. What is the probability that the length of the shorter segment will be less than or equal to one-fourth of the length of the longer segment? 4. A point X is chosen at random from the interval [0,1]. What is the probability that the roots of the equation 4y2 + 7Xy + 1 = 0 will be real? 5. If 0 x+1 ~ S ! -x 0 r, . calculate F(x) = ifx < -1 if — 1 < x < 0 if0 < x < 1 ifx > 1, dt for each real number x. 6. Consider the function g(x) = x G R. Find a constant c such that = 1 where/(t) = cg(t). If X is a random variable having distribution function Fx(x) = ^-«,fWdt, calculate P(-l < X < 1). 7. Let ' f 1/2 0 * if - 1 < t < 1 otherwise, 190 6 CONTINUOUS RANDOM VARIABLES F(x) - f* mf(t)dt>x e and X be a random variable having dis­ tribution function F. Calculate F(x) for each x G R. If Y = X12, what is the distribution function of Y? 8. Let X be a random variable having the distribution function Fx (x) = J xf(t)dt where * f(t) = ifO < x < 1 otherwise. 1 0 If Y = X3, what is the distribution function of Y? 9. If 0 x2 F(x) = < x - (1/4) 1 ifx < 0 ifO < x < 1/2 if 1/2 < x < 5/4 ifx > 5/4, find a function /(x),x G R, such that F(x) = l* xf(t)dt for all x G R. 10. If X : ft —> R, show that the following statements are equivalent. x) G S'for allx G R. (a) (X (b) (X < x) G S'for allx G R. (c) (X 2: x) G S' for allx G R. (d) (X > x) G S'for allx G R. DISTRIBUTION FUNCTIONS Let (ft, S', P) be a probability space and let X be a random variable. The distribution function Fx ofX was defined in the previous section and is given by Fx(x) = P(X x), x G R. If the random variable X is known from context, the subscript X will be suppressed. Before looking at properties of distribution functions, we review some definitions from the calculus. Consider a function f : R —> R and let a G R. 1. If there is a number L with the property that for each e > 0 there is a 8 > 0 such that |/(x) — L| < e whenever a < x < a + 8, we write limx_a+/(x) = L. L is usually denoted by f(a +). 6.3 DISTRIBUTION FUNCTIONS 2. 191 If there is a number I with the property that for each e > 0 there is a 8 > 0 such that |/(x) — Z| < e whenever a — 8 < x < a, we write limx_fl-/(x) = I. I is usually denoted by/(a—). 3. If there is a number L with the property that for each e > 0 there is an MGR such that j/(x) — L| < e whenever x > M, we write limx_»+oo/(x) = L. L is usually denoted by/(+<»). 4. If there is a number I with the property that for each e > 0 there is an m G R such that [/(x) — /| < e whenever x < m, we write limx_-a>/(x) = I. I is usually denoted by/(—<»). Theorem 6.3.1 IfF is a distribution function, then (i) 0 (ii) F(x) F(x) 1 for all x G R. F(y) wheneverx — y. (iii) F(—°°) = OandF(+°o) = 1. (iv) F(x+) = lim y ^X+F(y) exists for each x E.RandF(x') = F(x+) F(x-) = lim^^x- F(y) exists for each x G R. (v) F is right-continuous at each x G R;i.e., F(x) = F(x+) for allx G R. In addition, F(x—) = P(X < x). PROOF: (i) Since F(x) is a probability, (i) is trivially true. (ii) If x s y, then (X < x) G (X y) and F(x) = P(X < x) < P(X < y) = F(y), so that (ii) is true. (iii) We will prove only the second part of (iii), the proof of the first part being similar. We first prove that lim„_>ooF(n) = 1. Note that {(X < «)}“=i is an increasing sequence of events. For any co G (l,X(co) is real number and there is an n such thatX(w) s i.e., ft C U“=1(X < n). Since the opposite relation is always true, the events (X n) increase to ft. Therefore, lim„_ooF(n) = lim„_»ooP(X < n) = P(O) = 1 by Theorem 2.5.3. Thus, for each e > 0 there is a positive integer N such that ]F(N) — 1| < €, which 192 6 CONTINUOUS RANDOM VARIABLES implies that 1 — e < F(N) 1. Ifx 2 N,then 1 — e < F(N) F(x) 1 < 1 + e, and so |F(x) — 1| < e whenever x 2 N. This proves that F(+<») = limx_»+xF(x) = 1. (iv) Fixx G R. Since the sequence of events {(X x + (1/m ))}„_, isadecreasing sequence and (X £ x) = A„ = 1(X x + (l/n)),F(x) = lim„_a>F(x + (1/n)) by Theorem 2.5.3. Thus, for each € > 0 there is a positive integer N such that |F(x) ~ F(x + (1/N))| < e, which implies that F(x + (1/N)) < F(x) + e. Supposex < y < x + (1/N). Then, F(x) — e < F(x) F(y) F^x + < F(x) + e; i.e., |F(y) — F(x)| < e whenever |y — x| < 1/N. This shows that F(x) = F(x+) = limy_»x+ F(y). In the second part of (iv), we can show only that the left limit limy_»x- F(y) exists; it need not be equal to F(x). To show that the left limit at x exists, note that the sequence of events {(X x — (l/n))}„ =, is an increasing sequence with U”=1(X x — (1/n)) = (X < x). By Theorem 2.5.3, lim F|x — — ) = lim P[X < x — — ) = P(X < x). I nj \ nj *OC n- Thus, given e > 0, there is a positive integer N such that P(X < x) - e < F|x - 1) < P(X < x). \ ™/ Let 8 = (1/N). Suppose x — 8 = x — (1/N) < y < x. Let M be a positive integer such thaty < x — (1/M) < x. Then P(X < x) - e < Fix - 1 ) < F(y) < Fix ) < P(X < x); \ Nj \ M) i.e., |F(y) — P(X < x)| < e. We have shown that for each e > 0 there is a 8 > 0 such that |F(y) - P(X < x)| < e whenever x — 8 < y < x; i.e., we have shown not only that the left limit at x exists but also that F(x-) = limF(y) = P(X < x). (v) Corollary 6.3.2 Statement (v) is just a restatement of the first part of (iv). ■ For each x G P, P(X = x) = F(x+) - F(x~) = F(x) - F(x-). PROOF: P(X = x) = P((X x) = F(x) — F(x—). ■ x) A (X < x)c) = P(X x) - P(X < 6.3 193 DISTRIBUTION FUNCTIONS Since P(X < x) s P(X x), we always have F(x-) < F(x) = F(x+)> but there may not be equality on the left. Note that F is continuous at x if and only if F(x~) = F(x) and that F can have jump discontinuities only as depicted in Figure 6.4, the magnitude of the jump at x being F(x) — F(x~). How large is the set of points of discontinuity of F? Theorem 6.3.3 The set ofpoints of discontinuity of a distribution function F is at most countable. PROOF: For each n >: l,letD„ = {x : F(x) — F(x—) S 1/m}. EachD„ is empty or finite, because otherwise the sum of the jumps of F at points in D„ would exceed 1, which cannot happen since the total increase in F is 1. Since D = is the set of points at which F has a jump discontinuity, D is at most countable, by Theorem 2.3.1. ■ EXAMPLE 6. 6 Consider a random variable X with distribution function F(x) = 0 (l/4)(x - 1) 1/2 (l/2)(x - 3) . 1 ifx < 1 ifl == x < 2 if 2 < x < 4 if 4 < x < 5 ifx S: 5. Clearly, F(2—) = 1/4 and F(2) = F(2+) = 1/2. F is not continuous at 2, andP(X = 2) = 1/4. ■ Theorem 6.3.4 Given a function F : R —> R with properties (i)-(v) in Theorem 6.3.1, there is a probability space and a random variable X having F as its distribution function. Sketch of Proof: Take Q, = R and take S' to be the smallest <r-algebra con­ taining all intervals of real numbers. For each co Gft,letX(co) = co. Since (X < a) = {co : X(co) £ fl} = {co : co S «} = (~°°, a J G S', X is a ran­ FIGURE 6.4 Jump discontinuity. 194 6 CONTINUOUS RANDOM VARIABLES domvariable. Foraninterval(a,bJ,defineP((a,bj) = F(b)—F(a). Thefunction P can then be extended to all events in S'. Thus, Fx(x) = P(X x) = P((-°°,xJ) = lim„_»_xP((n,x]) = lim„_»_x(F(x) - F(n)) = F(x). ■ If f ; R —> R is nonnegative and Riemann integrable on every interval of real numbers with f2Z/(Od t = I, then Theorem 6.3.4 applies to the function fX Fix') = J —X f(t)dt. Thus, there is a probability space (ft, S', P) and a random variable X having F as its distribution function. The function / is called a density function for F and for X. Is the converse true? That is, given a random variable X with distribution function F, is there a nonnegative Riemann integrable function f:R R such that the above equation holds? If there were such a function /, F would certainly have to be a continuous function, since the indefinite integral of / is a continuous function of its upper limit. In Example 6.6, the distribution function F is not continuous and consequently does not have a density function. A positive answer to the question just posed requires that F be at least continuous. But that is not enough to ensure that F has a density function. There is a criterion for determining if there is such a function. Definition 6.3 The function F : R —> R is absolutely continuous iffor each e > 0 there is a 8 > 0 such that n ^\F(Pi)-F(a^< e i=l whenever n 2: 1 and (ot\, Pf),... ,(an, X"=1|)3,-a;|<8. ■ are nonoverlapping intervals with If F is absolutely continuous, then F is continuous, as can be seen by taking n = 1 in this definition. The following theorem settles the question asked above. Theorem 6.3.5 Let F be a distribution function on R. Then F is the indefinite integral of a function f if and only ifF is absolutely continuous. The function / of this theorem can be identified, at most points of R, as the derivative F’(x) of F. We will not elaborate on what is meant by “most points” at this stage. Such matters, as well as the proof of this theorem, are best left to advanced analysis courses. It should also be noted that the function / is not unique. If g : R —> R agrees with / except at a finite number of points (or even at countably many points), then F(x) = f-a>g(t)dt also. In practice, the following fact can be used to establish that the distribution function F has 6.3 195 DISTRIBUTION FUNCTIONS a density function. If f(x) = F'(x) wherever the derivative is defined and ~ 1, then/is a density function for F. Consider a situation in which it is known that the random variable X has a density function / and we want to find a density function, if there is one, for the random variable Y = </>(X) where cf>: R —> R is continuous. The procedure for doing this is best illustrated by means of examples. Suppose the random variable X has the density EXAMPLE 6. 7 e~x 0 fW = ifx 2: 0 ifx < 0. If Y = X2, what is the density of T? We first calculate the distribution function ofX. Forx < 0, Fx(x) = = 0;forx S: 0, Fx(x) = ^e~‘dt = 1 — e~x. Thus, FxM = ifx < 0 if x S 0. 0 1 - e~x Let G be the distribution function of Y. Fory < 0, G(y) = P(Y y) = 0. Suppose y 0. Then G(y) = P(Y < y) = P(X2 < y) = P{~ Jy X - 7?) = Jy < X - 7?) = px( 7?) “ Fx<~ Jy)’ Therefore, G(y)" . Fx(7y)~Fx(-7y) ify < 0 ify > 0. Since — Jy < 0 fory > 0, Fx( — Jy} = 0 and G(y) = 0 1 —e ■Jy ify < 0 ify >: 0. The density g( y) = G'(y) is therefore given by g(y) = EXAMPLE 6. 8 0 (1/2 75^-^ ify < 0 ify > 0. ■ Suppose the random variable X has the density /(x) 0 ifO < x < 1 otherwise 0 = < x 1 ‘ ifx < 0 if 0 x < 1 ifx 1. and let Y = 7x. Then Fx( ) * 196 6 CONTINUOUS RANDOM VARIABLES We can assume that X takes on values in [0,1]. The same will be true of Y. Therefore, if G is the distribution function of Y, then G (y) = 0 if y < 0 and G(y) = 1 if y > 1. Suppose 0 y 1. Then G(y) = P(Y < y) = P(0 < jx <y) = P(0 < X </) = Fx(y2) — Fx(0) = Fx(y2) = y2. Hence, G(y) = 0 y2 i ify < 0 if 0 < y < 1 ify 1. The density g(y) = G'(y') is therefore given by g(y) = ifO < y < 1 otherwise. ■ 2y 0 In both of these examples, the distribution function of Y = 4>(X) is obtained by converting P(Y y) into a probability statement about X using the properties of the function <f>. EXAMPLE 6. 9 (Cauchy Density) A source of light is mounted on one of two parallel walls, which are a unit distance apart as depicted in Figure 6.5. An angle 0 is chosen at random from the interval ( — tt/2, tt/2), measured from the perpendicular to the wall at the source, and a light beam is cast in that direction. The density function of 0 is then feW = I/77 0 if — tt/2 < 0 < tt/2 otherwise, and the distribution function F©(0) is given by 0 F©(0) = < (1/tt)(0 + (tt/2)) I if 6 < — tt/2 if - tt/2 S 0 < tt/2 if 0 > 77/2. 6.3 197 DISTRIBUTION FUNCTIONS FIGURE 6.6 X = tan©. Let X be the directed distance from the nearest point on the opposite wall to the point where the light beam hits the opposite wall. Then X = tan 0. Since 0 can take on values between —tt/2 and tt/2, X can take on values between —<» and +oo. For — oo < x < +oo, FXW = x) = P(tan© x). To determine P(tan 0 x), we must convert the statement tan 0 ^x into a statement about©. Since arctan x is an increasing function on (—oo,+oo),0 = arctan (tan©) arctanx whenever tan© x and conversely. Thus, P(tan0 x) = P(0 arctan x), but we should keep in mind that 0 takes on values between — tt/2 and tt/2, so that Fx(x) = P(tan0 x) = P^—y < 0 arctan x 77’\ = Fq (arctan x) — F© I — J 1 ( 7r\ 77 \ 2/ = — arctan x + — . Therefore, /x(x) = F'X(X) = -oo < X < +00. (6.3) Equation 6.3 can be obtained by looking at the graph of the tangent function in Figure 6.6; for tan 0 s x, 0 must be in the interval from —tt/2 to arctan x, 198 6 CONTINUOUS RANDOM VARIABLES and according to the definition of “random,” _ FxM = P(tan0 / 7T _ \ arctan x — ( —tt/2) x) = P [-— < 0 := arctan x = -----------------------\ 2 ' 7T so that A(X) = Pi(x) = 1^. — 00 < X < +00. This function is known as the Cauchy density. ■ EXERCISES 6.3 1. Calculate F'(x) for the distribution function of Example 6.6 and verify that F is not the indefinite integral of F'. 2. Let X be a random variable with distribution function 0 x2/4 F(x) = < 1/2 (l/2)(x - 1) 1 ifx < 0 ifO < x < 1 if 1 < x < 2 if2 < x < 3 ifx 3. Calculate (a) P(0 < X < 1). (b) P(0 < X < 1). (c) P(X = 1). (d) P(l/2 < X < 5/2). 3. Let X be a random variable having density function fW = 1/tt 0 if — tt/2 < x < tt/2 otherwise and let K = sinX. Find a density function for y. 4. Let X be a random variable with distribution function F(x) = 0 1 - ifx < 0 ifx s 0. If M >0, let y = min(X,M). Determine the distribution function of 7. Does y have a density function? 5. Let X be a random variable with distribution function F(x) = 0 1 ~e~x ifx < 0 ifx 2: 0 and let Y — \[x. Since P(X S: 0) = 1, y/x is defined. What is the density function of y? 6.4 6. 199 JOINT DISTRIBUTION FUNCTIONS Let X be a random variable having a density function fxM = e x 0 ifx >: 0 ifx < 0 and let Y = logX. What is the density of T? 7. Consider a searchlight that is mounted on a wall. An angle 0 is chosen at random between — tt/2 and tt/2, as measured from a perpendicular to the wall, and the light beam falls on an object 100 units away. If X denotes the distance of the object from the wall, what are the distribution and density functions of X? JOINT DISTRIBUTION FUNCTIONS It is not unusual in experimental situations to consider two or more numerical attributes of an outcome simultaneously. For the time being, we will consider only two attributes. Let X, Y be two random variables on the probability space (ft, 9% P). Then P(X x, Y < y) is meaningful and defines a function P * >y of two real variables. Definition 6.4 Ifx,y G R, the function of two real variables Fx,y(x,y) = P(X <x,T <y) is called the joint distribution function of the pair (X, Y). ■ Suppose a point is chosen at random from the unit square U in the plane with opposite vertices at (0,0) and (1,1). Clearly, we should take ft = U and “random” should mean that the probabilities that the outcome will be in two congruent regions in ft will be the same. This suggests that the probability that the outcome will be in a region A within ft should be equal to the area of the region divided by the area of the unit square, which is 1. Thus, P(A) = Area(A). If co = (x,y) G ft , let X(co) = x and let T(co) = y. Then X and Y should be random variables with EXAMPLE 6.10 Fx.y(x>y) = < y xy I if x < 0 or y < 0 ifO < x < l,y > 1 ifx > 1,0 < y < 1 if0<x<l,0<y<l if x > l,y > 1. As an example of the computation of Px.r (*>/)> suppose 0 < x £ 1,0 £ y < 1. Then Fx,y(x>y) is the area of the shaded rectangle with opposite vertices at (0,0) and (x,y) as depicted in Figure 6.7. Since the area is xy,Fx,Y(x,y) = xy. ■ 200 6 CONTINUOUS RANDOM VARIABLES (l.D Uy) (0,0) FIGURE 6.7 Fx.r(xj),0 <x^l,0<y^l. Joint distribution functions have properties similar to those listed in Theo­ rem 6.3.1 for a distribution function of a single random variable. Theorem 6.4.1 IfF is a joint distribution function, then F(x,y) 1 for all (x,y) ER2. (i) 0 (ii) If (X|,yi) and (X2,72) are any tw0 points in R2 with x{ x2 and 71 72, thenF(x2,y2) - F(x{,y2} - F(x2,yi) + F(xbyi) 2:0. (Hi) limx_+x,?_+x F(x,y) = 1; limx__x>y_-x F(x,y) = 0; for each y E R, limx_-x F(x, 7) = 0; and for each x E R, limr_-x F(x,y) = 0. (iv) Foreach (a,b) f= R2,hmx*_>a+,yb+F(x,y') = F(a,b), limx_»a+F(x, b) = F(a,b), and\imy_b+F(a,y) ~ F(a,b). (v) For each (a, b) E R2, limx_.a- F(x, b) and lim^-^- F(a,y) exist. The inequality of (ii) is easily reconstructed using Figure 6.8 by associating with each of the vertices a + or — sign, starting with a + at the upper right vertex and alternating signs, and then applying the signs to the value of F at the corresponding point. Except for (ii), proofs of these statements are similar to the proofs of the statements in Theorem 6.3.1. Proof of (ii) Let (xbyi) and (x2,y2) be two arbitrary points in the plane with Xi x2 and 7! y2. Then F(x2>72) “ F(xi,y2) - F(x2,7i) + F(xbyi) >: 0. This inequality follows from the fact that 0 < P(Xj <X < x2,7! < T 72) = P((X < x2) n (7! < Y < 72) n (X < x,)c) (6.4) 6.4 201 JOINT DISTRIBUTION FUNCTIONS = P(X < x2,71 < Y < y2) - P(X < xt,yt < K < y2) = P((X <x2)n(y <y2)n(y <71)c) - P((x < xoncr <y2) A(y <yi)c) = P(X < x2, Y y2) - P(X <x2,Y < yi) - P(Xj < xb Y < y2) + P(X < xb Y < y,) = -Px.yfeyz) ~ -Fx.yte.yi) “ Px.y^b/2) + Fx,y( *i»/i) since (X < x2) A (T < y2) A (Y < yj = (X < x2) A (K < yj, and so forth. ■ In the case of a single random variable X, Theorem 6.3.5 gives a necessary and sufficient condition that Fx have a density function; namely, that Fx be absolutely continuous. This concept can be extended to joint distribution functions, but it is best left to more advanced courses. Definition 6.5 A nonnegative Riemann integrable function f : R2 —> R is a density function for the joint distribution function F if F(x,y) = where AXiy = {(«, v) : u to area. ■ f(u,v)da x,v forallx,yE.R y} and da denotes integration with respect In practice, the preceding double integral over AXiy is calculated using iterated integrals, since 'f JJ AX,y f(u,v)da ■= fx icy \ cy irx \ I /(M.vjdvjdw = I /(M.vjdwjdv. J —00 \J —00 / J—00\J—00 j We will assume in the remainder of this section that all joint distribution functions have density functions. FIGURE 6.8 Alternating signs. 202 6 CONTINUOUS RANDOM VARIABLES Most distribution functions come about by starting with a nonnegative Riemann integrable function/ : R2 -> R with total integral 1 and constructing a probability space (ft, S', P) and a pair of random variables (X, Y) such that Fx,y has density function / by imitating the construction of the previous section as follows. Let ft = R2, let S' be the smallest cr-algebra containing all rectangles in R2, and for co = (x,y) GftletX(w) = x, K(w) = y. ThenX and Y are random variables. For any rectangle I C R2, define PUl = j] f(u>v)da, I noting that including or excluding the edges of I has no effect on the value of the double integral. The probability function P can then be extended to all events in S'. Since (X x, Y y) — {a> : X(w) x, K(w) y) = {(«, v) : u < x,v y}> Px.Y(x,y) = P(X < x.Y < y) = Jj f(u,v)da where Ax,y = {(«, v) : u x,v y}. It follows that the pair (X, Y) has/ as density function. It can be shown that for A G S', P((X, K)GA) Without getting involved deeply in integration theory, for computational purposes we must limit the class of regions A for which this probability can be calculated. For example, if a < b, </>2 are continuous functions on [a, b] with </i(x) <}>2(x'),x G [a, fe], and A — {(«, v) : a < m < b, S v </>2(m)}, then P((X, K) G A) = ff JJ /(«, v)da = fb I \ /(«, v)dvjdM. Ja V<Ai(w) / A EXAMPLE 6.11 Let X and Y be random variables with joint density function fx.Y^x.y) = ifO < x < 1,0 < y < 1 otherwise. 6.4 203 JOINT DISTRIBUTION FUNCTIONS Suppose we want to calculate P(X2 + y2 < 1). We must first define a region A C R2 such that (X2 + y2 < 1) = ((X, Y) G A). This is done by formally replacing X and Y by typical values u and v, respectively. In this case, we let A = {(«, v) : u2 + v2 < 1}, as shown in Figure 6.9. It is easy to see that (X(w), y(w)) G A if and only if X2(w) + y2(w) 1. Therefore, P(X2 + Y2 < 1) = P((X, T) G A) = fx,y(n>v)da. Recalling that the integrand vanishes outside the unit square U and is equal to 1 inside, the integral is equal to Area(A Cl U) = tt/4. Thus, P(X2 + Y2 1) = tt/4. ■ FIGURE 6.9 P(X2 + V2 < 1). Knowing the joint distribution function or density function of the random variables X and Y, the corresponding individual distribution or density func­ tions can be obtained. Theorem 6.4.2 IfX andY have the joint distribution function Fx,y> then (i) Fx(x) = limy_+ooFx,y(x,y). (ii) Fy(y) = limx_+ooFx,y(x,y). In addition, ifX and Y have a joint density fx,y> then (Hi) fxM = IZfx.ytx^dv, x G R,and (iv) fdy} = ^fx.y^yWu, y&R, are densities for X and Y, respectively. 204 6 CONTINUOUS RANDOM VARIABLES PROOF: The first step in proving (i) is to show that Fx(x) = lim FXtY(x,n). n Since the sequence of events {(X as n —> +00 for each x G R, n)} increases to the event (X S x) x, Y lim FXiY(x,n) = lim P(X n->+x * x, Y n) = P(X < x) = Fx(x). The rest of the proof of (i) is the same as the corresponding part of the proof of (in) in Theorem 6.3.1. To prove (iii), let ■+x g(x) = fX,y(x,v)dv. —X Since Fx(x) = P(X < x, -oo < Y < +oo) fXtY(u, v}dv \du it follows that g is a density for X. ■ When fx and fY are obtained in this way, they are called marginal density functions. Suppose a point is chosen at random from a disk D with center (0,0) and radius 1. Let X be the x-coordinate of the chosen point and Y they-coordinate. The joint density fXtY is then EXAMPLE 6.12 A,y(w>v) = if u2 + v2 otherwise. 1 The density of X is then given by/x(x) = [ ^f XlY(x, v)dv, —oo < x < +<». * To evaluate the integral, it is necessary to consider the cases x < -1,-1 < x < l.andx > 1 separately. Whenx < -1, the function fx,Y(x, v) vanishes on the vertical line through x on the w-axis, and so /x(x) = 0. The same is true when x > 1. When -1 < x < l,/x.y(x, •) is equal to 1/rr on the line segment joining (x, - Vl - x2) to (x, >/l - x2) as depicted in Figure 6.10 and equal to 0 at other points of the vertical line through x, so that 6.4 205 JOINT DISTRIBUTION FUNCTIONS v FIGURE 6.10 Marginal density. Therefore, fxM = if."1<x^1 otherwise. ■ 0 This example illustrates a technique for finding the density function of a random variable X. The consideration of a second random variable Y can result in the determination of the joint density fx,y> from which fx can be obtained as above. Suppose now that we are given the joint density function of two random variables X and Y and we would like to find the density fz of the sum Z — X + Y, assuming there is such a density. Theorem 6.4.3 = X + Y. LetX and Y be random variables having a joint density fx,y an^ The density ofZ exists and is given by /z(z) = r +oc fx.y{u,z - u)du J —00 ‘+00 ~ fx,y(z ~ v>v)dv, (6.5) z G R. J —00 PROOF: Consider first the distribution function Fz. Since Fz(z) = P(Z z) = P(X + Y < z), r +°° / r z“w r FZW = J /x.y(M,v)da = \ v)dv \du. J—00 \J —00 / {u+1/Sz} With u fixed, let w == v + u in the inner integral to obtain \ f+eo/fz J —00 \J —00 rz Sf *™ = J —00 w fx,y(u>w —00 - u)dujdw. 206 6 CONTINUOUS RANDOM VARIABLES If we let r+x g(z) = fx.Y^x.z - x)dx fx,Y^>z ~ u)du> thenFz(z) = [* xg(w)dw. Therefore, g is a density for the random variable Z. ■ The formula for the density of Z in Theorem 6.4.3 takes on a more usable form in the case of independent random variables. Definition 6.6 T/je random variables X and Y are independent if P(X < x,Y < y) = P(X < x)P(T == y) for all x,y e R. ■ This definition is clearly equivalent to the requirement that Fx,Y<x>y) = Fx(x')FY(y') for all x,y G R. If X and Y are independent random variables and <f> and <// are continuous real-valued functions on R, then </>(X) and ip(Y) are also independent random variables. As in the discrete case, the proof of this fact will be omitted. If X and Y have a joint density fx,y, then we know that X has a density/x and Y has a density fy. Theorem 6.4.4 Let X and Y be random variables having a joint density. Then X and Y are independent if and only iffx (x)fY (y) is a density for X and Y. PROOF: Suppose fx (x)fy( y) is a joint density for X and Y. Then •x /fy P(X < x, Y < y) = \ fxWfY{v)dv\du — X. \J—x j p fp \ £(“) /y(v)dv dw —X 00 J 6.4 JOINT DISTRIBUTION FUNCTIONS 207 fxMP(Y y)du fx = P(Y < y) fx(u)du — CO = P(Y < y)P(X < x), and therefore X and Y are independent random variables. On the other hand, if X and Y are independent, then Px.y{x,y) = P(X < x,Y < y) = P(X < x)P(Y < y) f(x \/fy /x(M)dw fy(v)dv •y fxWfyMda, so that/x(x)/y(y) is a joint density forX and Y. ■ Theorem 6.4.5 If the random variables X and Y are independent with densities fx and fy, respectively, and Z = X + Y, then Z has the density r+a> r+<» fxWfy(z ~x)dx fzW = —CO —co /x(z ~y)/r(y)dy, z e r. PROOF: Theorem 6.4.3 ■ The above equations take on simpler forms if, in addition, X and Y are nonnegative random variables. In this case, Io fxWfy(z — x)dx if z — 0 if z < 0, (6.6) with a similar result holding for integration with respect toy instead of x. Since Z is nonnegative,/z(z) = 0 for z < 0. For z > 0, the integral over (-<», +<») can be replaced by the integral over (0, +<») since fx {x.} = 0 for x < 0; since fY(z — x) = 0 when x > z, the integral over (0, +<») can be replaced by the integral over (0,z). Caveat: In real life, independence of random variables is the exception rather than the rule. 208 6 CONTINUOUS RANDOM VARIABLES Let X and Y have the joint density EXAMPLE 6.13 e~x~y = *>y) Zx,y( ifx S: 0,y > 0 otherwise. Since the joint density vanishes outside the first quadrant, the pair (X, V) is in the first quadrant with probability 1 and outside with probability 0. Thus, /x(x) = Oifx < Oand/y(y) = Oify < 0. Suppose * > 0. Then r +x fxM = J—X. r +oo /x.y(x,v)dv = e~x~vdv = e Jo Thus, Similarly, Since/X,y(x,y) = fx (* )/y(/)> the random variables X andY are independent. Now let Z = X + Y. Again jz(z) = 0 if z < 0. Forz 0, - +CO /z(z) = )/y(z * /x( e~xfY(z — x)dx. ~x)dx . o Since fY(z — x) = 0 when z — x < 0, rz e~xe~^z~x^dx = ze~z. /z(z) = Jo Therefore, /2(z) = ze z 0 if Z S: 0 if z < 0. 6.5 EXERCISES 6.4 1. COMPUTATIONS WITH DENSITIES 209 LetX and Y be independent random variables having density functions if 0 < x < 1 otherwise fxW = if 0 < y < 1 otherwise. /r(y) = Calculate P{Y < X). 2. Let X and Y be independent random variables having density functions 2e 2* 0 fx(x) = ifx > 0 if x < 0 fy(y) = 3e~3y 0 if y 0 ify < 0. Calculate P{Y < X). 3. Let fx,y(x,y) = c(l -x2 -y2) 0 ifO < x2+y2 < l,x > 0,y > 0 otherwise be a density function. Find the value of c and calculate P(X S: 1/2). 4. Let fx.yfx.y) = xe 0 if 2: 0,y 2: 0 * otherwise. Find/X(x) and/y(y). 5. Let X and Y be independent random variables with densities =• { J ifo < X < 1 otherwise /r(y) = ify > 0 ify < 0 What is the density of Z = X + Y? 6. Let X and Y be the random variables of Problem 2 and let Z = min (X, K). Find the density of Z. 7. Let X and Y be the random variables of Problem 2 and let Z = X + Y. Find the density of Z. 8. Let X and Y be the random variables of Problem 1 and let Z = X + Y. Find the density of Z. (Hint: Calculate the distribution function by considering the intersection of the region {(«, v) : v z ~ u} with the unit square U in the u v-plane.) COMPUTATIONS WITH DENSITIES We begin this section by cataloging several common density functions, special cases of which were seen in the previous section. 210 6 CONTINUOUS RANDOM VARIABLES EXAMPLE 6.14 A density commonly used for many board games that employ a spinner is the uniform density on [a, b] defined by l/(b — a) 0 if a < x £ b otherwise. ■ EXAMPLE 6.15 The continuous version of the discrete geometric density is the exponential density with parameter A > 0 defined by fW = Xe~Ax 0 ifx > 0 if x > 0. ■ Waiting times connected with continuously varying random processes sometimes have an exponential density. One of the best known density functions is the normal density (or Gauss density or Laplace density). Consider the function e~x2/2 for x S 0 and let c = f0+x e~x /2dx. The constant c can be determined indirectly by calculating e~^^/2da where A = {(x, y) : x S: 0,y > 0} is the first quadrant in the plane. Transforming to polar coordinates, c • ?r/2 /r +x \ I e~r2/2rdrjd 0 = . o \Jo / Thus, c = J0+oc e x2/2dx = .Jiri'!, and so EXAMPLE 6.16 If we define *) </>( = —— e"x2/2, ■Jitt xER, (6.7) then </> can serve as the density function of a random variable. The density <p is called the standard normal density. The graph of the standard normal density is depicted in Figure 6.11. ■ 6.5 211 COMPUTATIONS WITH DENSITIES 1 -- I -3-2-10 2 3 X FIGURE 6.11 Standard normal density. The corresponding distribution function defined by 4>(x) = J </>(t)dt, x G R, is called the standard normal distribution function. Values of $(%) have been calculated using Maple V software and are given in the Standard Normal Distribution Function table (see p. 346). $(%) can be determined for negative values of x from this table by using the fact that </> is a symmetric function, so thatd>(x) = 1 — d>(— x). For example, <!>(—.75) = 1 — 4>(.75) = .2266. If X is any random variable with density fx and Y = aX + b,a,b E R a 0, thenfy can be expressed in terms offx as follows. Suppose first that a > 0. SinceFy(y) = P(Y < y) = P(X < (y ~ b)/a) = Fx((y~b)/a), fy(y) = (l/«)fX((y ~ b)/a). If a < 0, the Fy(y) = P(X > (y ~ b)/a) = P(X>(y~b)/a) = 1-Fx((y-b)/a) andfY(y) = (~l/a)fx((y~b)/a). Since —a = |a | when a < 0, the two cases can be combined by writing 1 (y — b \ fr(y) ~ j~tfx i~----- I«1 \ « / (6.8) EXAMPLE 6.1 7 Consider a random variable X having a standard normal density *</>( ) and Y = aX + /z where /z, cr G R, a > 0. Then fy(y) = (l/<r)</>((y — /z)/tr),y G R. Therefore, /r(y) = y/2ircr A random variable Y having this density is said to have a normal density n(/z, <r2) with parameters /z and a, called the mean and standard deviation, respectively. The latter terms will be justified later. For the time being, /z and a are just parameters. ■ If X and Y are independent random variables having normal densities, what can we say about the density of the sum? Theorem 6.5.1 LetX and Y be independent random variables having normal densities n (p.x, ax) and n(p.y, <??), respectively Then Z - X + Y has a normal density n(p,x + /zr, a2 x + a2). 212 6 CONTINUOUS RANDOM VARIABLES Sketch of Proof: By Theorem 6.4.3, /z(z) = 1 g - U - Ax )2/2<rJ __L—e"(z"x"/i'')2/2<r‘' dx. ■JlTTCry The next step is to combine the'two exponents and then complete the square on x. The result is that a factor of 1/ along with an exponential function of a quadratic in z can be taken outside the integral, leaving the total integral of a normal density, which is 1. Although the algebra is tedious, readers should carry out these steps at least once in their lifetime. ■ If the random variable Y has a normal density with parameters /z and <r, then probabilities of the type P(a < Y b) can be expressed in terms of <!> as follows. Since Y = aX + /z, where X has a standard normal density, P(a < Y b) = P(a < crX + /z b) EXAMPLE 6.1 8 Suppose the random variable X has a normal density with parameters /z = 100 and a = 10. According to the Standard Normal Distribution Function table (see p. 346) and the fact that <I>(x) = 1 — <>(—x), x E R, P(75 <X < 125) . 125 — 100 = —io—} 75 — 100. x —io—} = <>(2'5) “ = #(2.5) - (1 - #(2.5)) = 2#(2.5) - 1 = .9876. ■ If X is any random variable with density fx and Y = X2, then the density of Y can be obtained as follows. We first express Fy in terms of Fx. Since Y is nonnegative, Fy(y) = 0 if y < 0. Suppose y > 0. Then Fy(y) = P(Y < y) = P(X2 < y) = P(< X < ^) = jy < X — jy) = Fx( jy) — Fx(- y/y}- Since the derivative of the latter expression is (1/(2 y/yW^ jy} + F'x(-^/y)), 0 ify < 0 if/ao. (6-9) EXAMPLE 6.1 9 Let X have a standard normal density and let Y = X2. Applying the formula above, 1Vr! “[ (l/^)^ ify < 0 ify > 0. ■ (6.10) 6.5 213 COMPUTATIONS WITH DENSITIES The last density is a special case of a family of density functions having the form x“-1e-Aj; for positive x except for a multiplicative constant where a and A are positive parameters. To determine the multiplicative constant, we must evaluate the integral r+oo xa~le~^dx. Jo Letting y = Ax, this integral becomes A-“ ya~le~ydy. Putting aside the evaluation of the latter integral for the time being, for each a > 0 let r +« ya~le~ydy. T(a) = Jo Then The reciprocal of this constant is therefore the required multiplicative constant. EXAMPLE 6.2 0 A random variable having the density T(a, A) defined by ,fx<0 0 r(a,A)(x)-| (6.11) is called a gamma density with parameters a and A. ■ Returning to Ha) as a function of a, the recurrence relation T(a + 1) = aT(a), a > 0, (6.12) can be obtained by applying integration by parts to the integral r+oo X(a+l)-le~xdx Jo Since T(l) = e~ydy = 1, it is easy to show by an induction argument that for every positive integer n, f(n + !) = «!. Since the density given in Equation 6.10 is a T(l/2,1/2) density, it follows that the multiplicative constants in Equations 6.10 and 6.11 must be equal in this case. Therefore, 214 6 CONTINUOUS RANDOM VARIABLES From this result, T(a) can be calculated using Equation 6.12 for any a > 0 that is an odd multiple of 1/2 . For example, = - r/5\ \2 / 2 \2 / 4 2 2\2 / 4 EXAMPLE 6.2 1 Let X have a normal density n(0, <r2) and let Y = X2 Then/y(y) = 0 ify < 0. Suppose / > 0. Then Jy Since Jtt = F(l/2), /r(y) = ((l/2<r2)1/2/r(l/2))y(1/2)-1e">'/2<r2 0 ify > 0 ify 0. It follows that Y has a F(l/2, l/2<r2) density. ■ The exponential density is a special case of the gamma density with a = 1 and A = 1. According to Example 6.13, if X and Y are independent and have exponential densities with the same parameter A = 1, then Z = X + Y has a T(2,1) density. This suggests the following theorem. Theorem 6.5.2 LetX and Y be independent random variables having gamma densities F(ai, A) andr(a2> A), respectively. IfZ = X + Y, then Z has a F(ct1 + ct2) A) density. PROOF: By Equation 6.6 for z > 0, /z(z) = fz A"1 Jo 1 (al) “l-lg-^.-^2 (2 r(«2)( ^ai+a2 r(a!)F(a2)e e-A(z'x)dx rz xa'~l(z -x}ai~xdx. 0 Making the substitution y = x/z in the integral, xa'~l(z ~x)ai~'dx = 0 y^-’d-y^-^y. o The latter integral is a constant that will be denoted by B(ab a2). Therefore, /z(z) = Aa'+«7(F(a1)r(a2))B(a1,a2)2«>+^-1e-^ 0 if z > 0 if z < 0. 6.5 215 COMPUTATIONS WITH DENSITIES FIGURE 6.12 The bell curve. Disregarding the fractional constant in the description of/z(z), this function has the form of a gamma density and therefore must be a T(ai + a2> A) density. This concludes the proof. ■ The fact that the above function is a gamma density permits us to determine the constant B(ab az)- Since the function is a gamma density F(a! + a2, A), we must have BCai.az) _ 1 r(ai)r(a2) r(ai+a2)’ and therefore Dz x f1 #l-in w-U rfaJHaz) 1 (“1 + a2) JO The reader should observe the remarkable fact that we have evaluated a family of integrals using probability methods. EXAMPLE 6.2 2 = Jo . . 1W Returning to the latter part of the above proof, consider the function 3(aba2)(x) = (Ha, + a2)/(r(a1)r(a2))x«>-1(l - x) * ’"1 0 ifO<x < 1 otherwise. Since this function is nonnegative and the multiplicative constant has been chosen so that its total integral is 1, it is a density function called the beta density with parameters ai and a2. 216 6 CONTINUOUS RANDOM VARIABLES A final remark about the graph of the standard normal density shown in Figure 6.11. The standard normal density is shown there using the same units on the axes. The usual practice is to use a much larger unit on the y-axis, which produces the distorted view of the standard normal density seen in Figure 6.12. Figure 6.11 shows that the normal density is spread out more uniformly than shown on the distorted graph. The distorted graph is commonly called the “bell curve” and is the favorite of book cover designers. EXERCISES 6.5 1. If the random variable X has a normal density n (ft, a2) and Y = a X + b, what is the form and parameters of the density of T? 2. A random variable X has a normal density n (/j., <r2) and it is known that P(X < 100) = .9938 and P(X < 60) = .9332. Find /z and a. 3. Let X and Y be independent random variables with standard normal densities. Assuming that X2 and Y2 are independent, find the density ofZ = X2 + T2. 4. Let X and Y be independent random variables having standard normal densities. Calculate the probability that the pair (X, T) will be between the lines through the origin making angles tt/6 and tt/4 radians with the x-axis and also in the first quadrant. 5. A point in the plane is chosen in such a way that its x-coordinate X is n(0, <r2), its y-coordinate Y is n(0, cr2), and the two are independent. Find the density of the distance Z = Jx2 + T2 of the point from the origin. The density of Z is called the Rayleigh density. 6. If the random variable X has a uniform density on the interval [a,b],a < b, find a function </> such that Y = </>(X) has a uniform density on [0,1]. 7. Let X have a standard normal density h(0,1). Find a function </> such that Y = </>(X) has a uniform density on (0,1). (Hint: is strictly increasing and has an inverse function <I>-1.) MULTIVARIATE AND CONDITIONAL DENSITIES Given discrete random variables X and Y, the conditional density of Y given X = x was defined in Section 4.5 to be the function A|x(y|x) = P(y =y|x =X). This is not possible for continuous random variables, because if x is a point of continuity of F , * then P(X = x) = 0 and the above conditional prob­ ability is not defined. In particular, if Fx is continuous at every point, then the conditional probability is not defined for any x. Even if P(X = x) > 0, P(T = y |X = x) would still be equal to zero for most y since P(K = y) = 0 6.6 217 MULTIVARIATE AND CONDITIONAL DENSITIES whenever Fy is continuous at y. In the case of continuous random variables, it is necessary to avoid events of the type (Y = y) or (X = x). To keep the discussion at the introductory level, we will assume that X and Y are random variables with a joint density function. Consider the conditional probability P(Y y\x S X x + Ax). Assuming that the given event has positive probability, P(T < y|x < X < x + Ax) = P(Y y,x < X x + Ax) P(x < X x + Ax) f-'JxX+^fx,Y(u>v)dudv Lx+AxfxWdu We might try defining the conditional distribution of Y given X = x by FY[x(y I* = x) = lim P(Y Ax—>0 Ax-kJ_o° y |x s X < x + Ax) ^fx(u}du \* Assuming that the limit can be taken under the first integral sign, Vv FrlxtylX =x) = fc+AxfxWdu / This suggests that the conditional density of Y given that X = x should be defined as , , । , .. (1/Ax)fX+A*fx.Y(u,yMu If the functions/x and/x,y are continuous atx and (x, y), respectively, then the denominator and numerator will have limits fx (x) and/x.r (*> /)»respectively, since they represent the average values of these functions near the points of continuity. In this case, we would have fnx(y Definition 6.7 I*) fx,r(x,y) fxM provided/x(x) > 0- If X and Y are random variables having a joint density, the conditional den­ sity of Y given X is defined for x,y E R by )>0 * if/x( if/x(x) = 0. ■ 218 6 CONTINUOUS RANDOM VARIABLES Note that the definition does not require any of the assumptions made in the above heuristic argument. Note also that the equation fx,y(x,y) = fy\x(y I x)fx(x), x,y G R, holds at all points x for which fx (x) > 0. As in the discrete case, experiments are sometimes defined in terms of densities and conditional densities. EXAMPLE 6.23 A point X is chosen at random from the interval [0, 1]. Given that X = x, a point Y is then chosen at random from the interval [0, x]. Suppose we are required to find the density of Y. The density of X is clearly uniform on [0,1]; i.e., ifo < x < 1 otherwise. A(x) = The information concerning Y is given in conditional form; namely, if 0 y otherwise. 1/x 0 x That is, given that X — x, Y is uniform on [0,x]. The joint density fx.Y^x.y) is given by if 0 < y < x, 0 < x < 1 otherwise. fx,r(x,y) = fy|x(y | x)fx(x) = Since Y takes on values in [0,1], fy(y) = 0 if y < 0 or y > 1. Suppose 0 < y < 1. Then /y(y) = f +x fx.ytx.yjdx = f1 1 - dx = -Iny. Thus, if 0 == y < 1 otherwise. ■ EXAMPLE 6.24 To assess the reliability of a component of a system, fatigue or wear and tear must be taken into consideration. Suppose it is known that if a component has survived up to time t, then it will fail in a small time interval (t, t +At) with probability approximately proportional to the length At of the interval where the constant of proportionality depends upon t; i.e., the 6.6 MULTIVARIATE AND CONDITIONAL DENSITIES 219 conditional probability is approximately equal to j3 (t) At for some nonnegative function j3(t) on (0, <»). Let T denote the time at which the component will fail. We will use an intuitive argument to determine the distribution function of T from the given data /3(t). According to the above assumption, P(t < T t + At | T > t) « j3(t)At. On the other hand, if we let F r denote the distribution function of T, then P(t < T < t +At | T > t) = P(t < T < t + At) P(T > t) Fr(t + At) ~ F'r(t') I-FtW Assuming that Ft has a continuous density function fa, by the mean value theorem of the integral calculus, P(t < T < t + At | T > t) - /r(pA^. 1 — rrttj Thus, P(t < T < t + At | T > t) „ J3(t) = hm ----------------------------------Ar-»0 fr(t} i-FT(ty Therefore, J3(t) = -^-(ln(l-PT(t))). at Integrating from 0 to t and using the fact that Fr(0) = 0, ft J3(s)ds, ln(l-Fr(t)) = ~ Jo so that Fr(t) = 1 - e~^p{s}ds and fT(t) = 220 6 CONTINUOUS RANDOM VARIABLES Since any real-life component will eventually fail, we should require that limf_+x F-r(t) — I and therefore that -X j3(s) ds = +«>. ■ ■ o' EXAMPLE 6.25 Consider a system made up of two components that are connected in series with associated failure rates i(t) and Let T be the time of failure of the system and let Tb T2 be the times of failure of the two components. If the two components fail independently of each other, what is the failure rate fl(t) of the system? Since the components are connected in series, T = minfT^ T2). By the assumed independence, P(T>0 = P(min(T1,T2)>0 = P(T, >t,T2>t) = P(JX > t)P(T2 >t) = (l-Pr.fOXl-Pr^)). Thus, 1 _ PT(t) = e~lo Pi^)dse-fotp2sds _ e-[o'(^i(s)+02(s))as and F?(t) = 1 — e“lo^ds)+02(j))<is, and therefore j3(t) = + fi2(t'). ■ We need not limit ourselves to just two random variables. If Xb ..., Xn are n random variables, we can define the joint distribution function F XlXn by the equation Fx,... xB(xh...,x„) = P(Xx < x„). The distribution function Fx1„..,xn and random variables Xb ..., X„ are said to have the Riemann integrable function fx...... : Rn -> R ns joint density function if P(«t <Xt < bx....... an <Xn < bn) (b" \ \ /x,... X„(xb .. .,x„)dx„ • • • dxb / / 6.6 221 MULTIVARIATE AND CONDITIONAL DENSITIES More generally, if A C Rn and A is in the smallest <r-algebra of subsets of Rn containing all n-dimensional rectangles in P", then p((xb...,x„)eA) = j- - -J/x... X„(xl,...iX„)dVn A where dVn denotes integration with respect to n-dimensional volume. To calculate a probability of the type P(g(Xb a) where g is some function of n-variables, the probability is put into the form P(g(Xb...,X„) < «) = P((Xit...,Xn)(=A) • r = ■■■ fx.... Xrl(*l> -->Xn)dV„ J A whereA = {(xb...,x„) : g(xb .. .,x„) < a}. EXAMPLE 6.26 Consider the unit cube U C Rn defined by U = {(xb...,x„) :0Sxi < <x„ < 1}. An experiment consists of choosing a point at random from U. The latter statement means that the probability that the outcome will be in a region A, if A is in the <r-algebra described above, is taken to be the volume of A A U, Vbl(A A 17), divided by the volume of U, which is 1. If Xb ..., X„ denote the first, second,..., Hth-coordinates, respectively, of the chosen point, then P((Xb...,X„) GA) = Vol(A A U). X„), we put In particular, if we want to evaluate P(Xi s X2 A = {(xb...,xM): Xi < x2 x„}, so that P(Xj < X2 < ••• < X„) = p-j ldV„. AA U The multiple integral can be calculated using iterated integrals by fixing xb...,x„-! and integrating with respect to xn between x„-! and 1, then integrating with respect tox„-i between x„-2 and 1, and so forth. Thus, f1 P(Xj < X2 < • • • s Xn /f1 \ \ dxn jdxn-i... Idx! • Xn-2 / / •1 \ (1 - xn-i)dx„-x • • • Idx, • Xn — 2 / I 222 6 CONTINUOUS RANDOM VARIABLES Jo («-D! «!’ Independence of continuous random variables is defined as before; i.e., the random variables Xj, ... ,Xn are independent if P(X1 < xu...,X„ < x„) = P(Xi < Xi)X ••• XP(X„ < x„) for all .. xn ER or, equivalently, Fx,... xAxi,...,x„) = Fx,(xl) X •• • X FXn(xn) forallx1}.. .,xn ER. Assuming that Xb ... ,X„ have a joint density function fxy...,x„> then each X, has a density function fXi>i = and the random variables are independent if and only if fx,....X„ (Xb • • • > Xn ) = fx, <X1) X • • • x fXn (Xn ) for all Xy ..., xn ER. More precisely, if and only if fXl (xi) X • • • X fXn (x„) is a joint density for Xi,... ,X„. As in the discrete case, if Xi,... , X„ are independent random variables and </>!,...,</>„ are real-valued continuous functions on R, then the random variables (Xi),..., <t>n (XM) are indepen­ dent random variables, as are the two random variables Xi + • • • + Xn-i and Xn. LetX, Y, and Z be independent random variables with each having a uniform density on [0, 1 ]. The joint density is the product of the three densities and is given by EXAMPLE 6.27 fx,Y,z(x, y,z) = if 0 < x, y, z otherwise. 1 Consider P(Z S XY). This probability can be calculated by defining A = {(x,y, z) : z S xy}so that The region of integration consists of all points (x, y, z) ER3 that lie above the surface z = xy. Since the integrand vanishes outside the unit cube, we can integrate over the region consisting of points (x,y, z) that are above the surface 6.6 223 MULTIVARIATE AND CONDITIONAL DENSITIES z = xy, below the surfacez = 1, and above the unit square in the xy-plane. By fixing x and y with 0 x, y 1, we can integrate with respect to z from xy to 1, so that P(Z > XT) = [ [ (j ldz]dxdy / JoJo\lxy = [ [(1 ~xy)dxdy Jo Jo = 43 ■ The proofs of the following theorems will be reserved as exercises. Theorem 6. 6.1 Let Xb... ,Xn be independent random variables. If Xi has a gamma density r{ai,X),i = l,...,n,thenXi + ---+XnhasthegammadensityI\al + -'- + an, A). Theorem 6. 6.2 Let X\,...,Xn be independent random variables. If X , has a normal density n(p.i,al),thenXi + - • -+Xn has an (p, a2) density where p, = Pi + '' • + pn and a2 = a2 + • • • + a2. LetXb .. .,Xn be random variables having a joint density function fxt... x„If 1 m < n, define r , x _ Al.....X„(X1> I JXm«... X„|X..... X„(Xm + l,...,Xn |Xb...,Xm) - - JX... — provided the denominator is positive. As before, fx.... xB(xb...,x„) = fx . ...... X n|X|,...,X m (x m+b ..., x „ | xb ..., x m )/X1.... x Jxi,...,xm), provided the second factor on the right is positive. EXAMPLE 6.28 A point X is chosen at random from the interval [0, Ij. Given that X = x, a point Y is chosen at random from the interval [0,x]. Finally, given that Y = y> a point Z is chosen at random from the interval [0, y]. What is the density of Z? The given information specifies the following densities and conditional densities: fxW = 1 0 • if 0 < x < 1 otherwise. 6 224 CONTINUOUS RANDOM VARIABLES fr|x(y I* ) if o < y < X < 1 otherwise. = if 0 < z < y < x < 1 otherwise. /z|x,r(z \x,y) = /Z|y(z |y).= Since fx,Y,z(x,y.z) = /z|x,y(z | x,y)fx.Y(x,y) = fz\Y(z | y)/y|X(y I x)fx(x) provided 05z<y£x<l, ifO^z<y^x<l otherwise. fx,Y.z(x,y,z) = ForO < z 1, fx.Y,z(x,y,z)dx ]dy fzW = p/r1 i \ I —dx ]dy Z Vy xy ) = ~(lnz)2, and fz(z) = 0 otherwise. ■ EXERCISES 6.6 1. Let Xi,...,X„ be independent random variables each having an exponential density with parameter X > 0. Find the density of Z = X, + ••• +X„. 2. Let Xi,..., Xn be independent random variables each having a standard normal density. Find the density of Z = JX} +X^+ • • • + X2. 3. Write out complete proofs of Theorems 6.6.1 and 6.6.2. 4. Consider a disk D with center at (0,0) and radius 1. A point with x~ coordinate X is chosen at random from the line segment joining ( — 1,0) to (1,0), and then a point is chosen at random from the line segment joining (x, - J1 -~x2) to (x, - x2). Let Y be the /-coordinate of the latter point. What is the density of V? 5. Let X1.X2.X3 be independent random variables each having an ex­ ponential density with the same parameter A = 1. Calculate the probability P(X\ 2X2 3X3). 6.6 MULTIVARIATE AND CONDITIONAL DENSITIES 225 6. Random variables Xi,X2,. ..,Xn are defined as follows. A point Xi is selected at random from the interval [0,1]; given that Xi = a point X2 is selected at random from the interval [0, xj; and at the last step, given that Xn-i = xn-i, a point Xn is selected at random from the interval [0, x„-i]. What is the density of X„? 7. Let Xi,.. .,Xn be independent random variables each having a uni­ form density on [0,1]. Let U = min (Xi,...,Xn) and V = max(Xi,.. .,Xn). Find the joint distribution function of U and V and verify that is the joint density of U and V. 8. Let X, Y, and Z be random variables with joint density Z2e~z(l+x+y) fx,Y,z(x>y>z) ~ 0 if x > 0, y >: 0, z S 0 otherwise. 'Pindfx,fY>fz>fx,Y> and/x.rlz9. A system consists of two components operating in parallel with as­ sociated failure times Ti and T2 and failure rates /?i(t) and j32(t), respectively. Let T be the time of failure of the system. Assuming that the two components fail independently of each other, what is the density of T? I p EXPECTATION REVISITED I INTRODUCTION Although this chapter discusses most of the concepts introduced in Chapter 4, the introduction of the Riemann-Stieltjes integral makes it possible to formu­ late a definition of expected value that combines the discrete and continuous cases. An additional condition, however, must be imposed to have an effective means of calculating expected values using the calculus. One of the hallmarks of probability theory is a classic theorem known as the DeMoivre-Laplace limit theorem, which states that a sum of binomial probabilities can be approximated by the integral of a function known as the “bell-shaped” normal density. Although the proof of this result is tedious, a complete proof is given with enough detail that a postcalculus student can follow the arguments. Reading the proof at this stage is not essential for learning probability. The chapter concludes with applications to certain types of sequences of random variables called stationary processes, which occur in filtering theory and prediction theory. Processes of this type have their origin in the works of G. U. Yule and E. Slutsky during the period 1920-1940. RIEMANN-STIELTJES INTEGRAL If X is a discrete random variable with range {xb X2>...} and density function fx, the expected value of X was defined to be E[X] = Xj xjfx(xj\ provided the series converges absolutely. This definition cannot be used for continuous random variables for several reasons. What is needed is a formulation of expected value that applies to continuous random variables and that reduces to 226 7.2 227 RIEMANN-STIELTJES INTEGRAL the definition above in the case of discrete random variables. Fortunately, there is a type of integral that can do both, called the Riemann-Stieltjes integral. Let us quickly review the Riemann integral. Let [a, b] C R, a < b, be a finite interval and let </► : [a, b] —> R. A collection of points {x0) xb with a = xq x2 — • • • — x„ = b is called a partition of [a, b] and is denoted by ir. The norm of the partition rr is denoted by |tt| and defined by |tt| = max{%! - x0, x2 — xb ... ,xn - Xn-J. For i = l,..., n, let £,- be any point of the interval [xf-_i, jcf-] and let Ax,- = xt — Xj-\. The Riemann integral of </> over [a, b] is then defined as n rb 4>{t)dt = lim ^ </>(£,-) Ax,- provided the limit on the right exists. There is nothing sacred about using the weight Ax,- for the subinterval [x,-- b x,- ]. We could just as well use some other weighting scheme. Let F be any distribution function on R. Using the same notation as above, we could replace Ax,- by AF,- = F(x,) — F(x,-i) and define an integral of </> over [a, b] with respect to F by n rb <A(t)dF(t) = lim V </►(£,-)AF,L M-ofzr provided the limit on the right exists. To be more precise about the existence of the limit, if and tt2 are any two partitions of [a, b], we say that tt2 is finer than TTi if tti C tt2. Definition 7.1 The function <p : [a, b] —> R is Riemann-Stieltjes integrable over [a,b] with respect to F if there is a real number L such thatfor every e > 0 there is a partition tte of [a, b] for which n <e i=l for all partitions rr finer than irE. The number L is denoted by fb rb <£(t)dF(t) | a <j>dF. ■ or J& The Riemann-Stieltjes integral shares many of the properties of the Riemann integral: 228 7 EXPECTATION REVISITED 1. If </>!, 02 are Riemann-Stieltjes integrable over [a, b] with respect to F and Ci,c2 G R, then C| 0i + c202 is Riemann-Stieltjes integrable over [a, b] with respect to F and ■b rb rb (A01 + c202)dF <= cj 0]dF + c2 02dF. Ja Ja Ja 2. Let a < c < b and 0 : [a,b] —> R. If two of the following three integrals exist, then so does the third and 3. If a < b and 0 is Riemann-Stieltjes integrable over [a, b] with respect to F,then ra ~b Ja 0dF = — (f>dF. Jb 4. If 0 is continuous on [a, b], then 0 is Riemann-Stieltjes integrable over [a, b] with respect to F. There is also an integration by parts formula that will be omitted because it will not be needed here. Proofs of the above statements can be found in the book by Apostol listed at the end of this chapter. EXAMPLE 7.1 Let tfj be continuous on [a, b] and let F(x) = 0 1 if x < c ifx > c where a < c < b. Then 0(f) dF(r) = 0(c). This can be seen as follows. Since 0 is continuous at c, given any £ > 0 there is a 8 > 0 such that |0(x) - 0(c)| < e whenever |x - c| < 8,x G [a, b]. Fix a partition tte of [fl, b] with 17re| < 8. Consider any partition it = {x0, xi,...,xn} finer than tte. Then c G [x,;0_j, x,-0] for some i0 = 1,2, Since F(x,) - F(xj-i) = 0 for i i0 and F(x,0) - F(x,0-1) = I, n i =i 7.2 229 RIEMANN-STIELTJES INTEGRAL whereG and therefore Since |£0-c| < |tt| < |7re| < 8, |0(&)-0(c)| < e, 0(£) AF; - 0(c) = |0(£,o) - 0(c)] < e for any partition tt finer than tte. Thus, L = 0(c) satisfies the above definition and//0(t)dF(t) = ■ Note that continuity of 0 was required only at the point c in this example. The same type of argument can be used to show that if 0 : [a, b] —> R is continuous and there are a finite number of points Ci, C2,..., cn with a < ci < C2 <•••< cn b such that F only increases by jumps at Ci, C2,.cn, then 0(t)dF(t) = ^0(c,)(F(cf) -F(c,-)). In particular, if X is a discrete random variable with range {cb ..., c„} and F = Fx,thenP(X = c, ) = Fx(c<) - Fx(cf-) =/x(c,)and n . rb E[X] =2L0(ci)/x(c,) = Improper integrals of the type the usual way as the limits (b lim d>dF —+o°Ja 0(t)dF(t). <f>dF and J and lim fidF can be defined in <!>dF, respectively, provided the limits exist in R. If both limits do exist, then J is defined by the equation (pdF = Definition 7.2 <l>dF + (j>dF, fid? a G R. The improper integral ^™<t>dF is absolutely convergent if the improper inte­ gral f 2Z l0l dF is defined. ■ We can now formulate the definition of expected value of any random variable. 230 Definition 7.3 7 EXPECTATION REVISITED The expected value E [X ] of the random variable X is defined by ■+=c xdFx(x) B[X] = J “00 provided the improper integral is absolutely convergent; i.e., f2”|* HRxCx) + 00. ■ < If X is a discrete random variable, this definition of B[X] agrees with the definition given in Section 4.2. We have seen that if the distribution function Fx is absolutely continuous, then there is a density function/x such that FXM = [ /x(T)dt, J —X x e R, and rfe Fx(fe)“fx(«) = fxtfdt. Ja Suppose in addition, that fx is continuous on the interval [a,b]. Then by the mean value theorem of the integral calculus, there is a number c with a < c < b such that rb PxW-FM= \ fx(f)dt =fx(cXb-a). Ja Suppose now that the function </>: [a, b] —> R is continuous on [a, b] and let it = {xq, ... ,x„} be any partition of [a, b]. Then ” :=l where •b n fXi " fx^dt = = i=l i=l G [x, - b x, ], i = 1,Then n n <l>(t')dF(t) = lim Y<A(^>)AFi = lim V </>(£,)/x(£;) Ax,-. Jo l’rl-‘Oi=i If the & and were the same in the second sum, the latter limit would exist since </> is Riemann-Stieltjes integrable with respect to Fx and would be equal t0 L <AWx(^) dt; the same result is valid even if and are not the same, 7.2 RIEMANN-STIELTJES INTEGRAL 231 by a result known as Duhamel’s principle. Therefore, assuming that Fx has a density fx that is continuous on [a, b] and </> is also continuous on [a, b], rb rb (fMdFxW = Ja Theorem 7.2.1 <f(x)fx(x}dx. Ja Let X be a random variable with E [X] defined and having a continuous density. Then £[X] = J —00 xfx(x)dx. |x|dFx(x)forallb > O,fo+" |x|/x(x) dx f0+°°|x|dFx(x) < +oo. Similarly, |x|/x(x) dx = |x| dFx(x) < +oo. Since the integrals [°Mfx M dx and l0°°xfx(x) dx are absolutely convergent, PROOF: Since|x|/x(x) dx = r +00 E[X] = r +00 f0 xdFxW = J —oo J— oo r +00 rO = xdFxW xdFx(x) + Jo J —00 x/x(x)dx + xfxMdx Jo -+X = xfx(x)dx. ■ If Z = </>(X) and we want to calculate E[Z], we must first find the density function of Z, assuming there is one. The following theorem allows us to bypass this step by using fx rather than// as a weight function. The proof will be omitted. Proofs require approximating the random variable X by a discrete random variable and applying Theorem 4.2.1. Theorem 7.2.2 Let X be a random variable having a continuous density and let <f> : R —> R be a continuous function. If the integral j d>(x)fx(x) ^x converges absolutely, then E [</>(X)] is defined and r+<X> W(X)] = EXAMPLE 7.2 J —00 chMfxMdx. Let X have a uniform density on [a, b], a < b. Then r +°o E[X] = J — oo i a +b X------- dx = 2 Ja & a rb xfxMdx = 232 7 EXPECTATION REVISITED Note that £[X] is the midpoint of the interval [«,&]. If we take </>( *) — x2, then £[X2] = . —x x2fx(xy)dx = x27—T~dx = r(b2 + ab + a2). ■ Ja . & A To calculate £[</>(X)], the integral is set up by replacing X in the argument of </> by a typical value x, multiplying by the density of X, and integrating with respect to x. EXAMPLE 7.3 Let X be a random variable having a F(l/2, 1/2) density. Then x2T |- |(x) dx \i 2 r B[X2] y._-. JO x2— x(,/2) le x/2dx yj27T ■J2tt Jo Note that the integrand looks like a F(5/2,1/2) density. It would be if it were multiplied by the constant (l/2)5/2 _ 1 H5/2) " 3y/^' Thus, —j=x(5/2) le x/2 dx 3 v2tt Since the F(5/2, l/2)(x) density is zero for x < 0, the last integral is the total integral of T(5/2,1/2) over (-oc>+oo) and has value 1. Therefore, £[X2] = 3. ■ There are random variables for which £[X] is not defined according to the criterion in Definition 7.3. EXAMPLE 7.4 Let X be a random variable having the Cauchy density fxW = 1 1 7T 1 + X2' 11, TT ] +X * 2 f77 Jq xER. Since ---------tdx — — x ------ 7 dx — +oo 1 +x2 7.2 RIEMANN-STIELTJES INTEGRAL 233 the integral J J X-------- 7 dx -00 7T 1 + X1 +0° is not absolutely convergent, and therefore E[X] is not defined according to Definition 7.3. ■ If the random variable X takes on only nonnegative values with probability 1, then fxM = 0 for x < 0 and the integral defining E[X] has the form l^xfxWdx. Absolute convergence and convergence are the same in this case. If the integral is not convergent, then as in calculus we write E [X ] = {0+ xfx(x)dx = +°°. In addition, if T is a waiting time that takes on the value +<» with positive probability, we then put E [T] = +<». EXERCISES 7.2 1. Let X be a random variable having a uniform density on [0, rr]. Calcu­ late E [sin X]. 2. Let X be a random variable having the standard normal density </>(x) = (1/ y/2Tr)e~x2/2, —oo < x < °°. Calculate E[|X|]. 3. Let X be a random variable having a uniform density on [0,1]. Calculate E [min (X, 1/2)] and E [max (X, 1/2)]. 4. Let Xj,...,X„ be independent random variables each having a uni­ form density on [0,1], let U = min(Xi,.. .,Xn), and let V = max (Xb ..., X„). Calculate E [17], E [172], E [V], and E [V2]. 5. Let X be a random variable having an exponential density with param­ eter A > 0. Calculate E [X ] and E[X2], if defined. 6. Let X be a random variable having a gamma density T(a, A) where a, A > 0. If defined, calculate E[Xr] where r is a positive integer. 7. Let X be a random variable having a beta density j3(ai,a2) where a\, a2 > 0. If defined, calculate E[Xr] where r is a positive integer. 8. Let F be a distribution function that increases only by jumps at the points Ci < C2 < • • • < cm and let </> : R —> R be continuous at Prove that r +°° ni ^(t)dF(t) = ^<MciXF(ci)-F(ci-y>. '~x 9. i=i Let X be a random variable having a density function fx for which fxW = 0 if x < 0. Assuming that E[X] is defined, show that r 4-oo E[X] = Jo (See Exercise 4.2.6.) r 4-co P(X>x)dx=\ Jo (l-Fx(x))dx. 234 7 EXPECTATION REVISITED 10. If X and Y are random variables having densities with Y S: X S 0 with probability 1, show that £[T] S: £[X] S: 0. EXPECTATION AND« CONDITIONAL EXPECTATION The reason for defining the expected value of a random variable in terms of a Riemann-Stieltjes integral in the previous section was to convince the reader that there is a way of treating the discrete and continuous cases simultaneously and also of treating random variables that are neither discrete nor continuous. Operationally, the discrete case uses summation and the continuous case uses integration. Let Xi,.. .,Xn be n random variables, let </> : Rn —> R be a real-valued continuous function of n variables, and let Z = </>(Xi,.. .,X„). Then the expected value £[Z] = zf** dFz(z) is defined, provided the integral is absolutely convergent. The definition, however, requires the determination of Fz knowing the probability characteristics of Xj,..., Xn. The expected value of Z can be calculated using the following theorem. Theorem 7.3.1 LetXi,..., Xn ben random variables having a joint density fx x„ and let <f> be a real-valued continuous function of n variables. lfZ = <f>(Xi,...,X„), then (f>(xi,...,xny)fXl... xn(xi,...,x„)dV„, E[Z] = | R" where dV„ denotes integration with respect to volume, provided the integral converges absolutely. In a slightly more advanced course dealing with probability measures, this theorem is relatively easy to prove. EXAMPLE 7.5 Let (X, T) be the coordinates of point chosen at random from the unit square U = {(x,y) : 0 < x 1,0 < y < 1} in the plane. Then 1 0 if0<x<l,0<y<l otherwise and £[xr] ([ i J R2 xyfx.ytx.yjdA = fVf1 Jo Vo \ i / 4 xydy ]dx — -. 7.3 EXPECTATION AND CONDITIONAL EXPECTATION 235 In this example, note that the integral for calculating £[XT] is obtained by replacing X and Y by x and y, respectively, to form the integrand xy, multiplying by the joint density, and then integrating with respect to area. Granted Theorem 7.3.1, we can establish properties of the expected value. Theorem 7.3.2 Let X and Y be random variables having a joint density with E [X] and E [T] defined and leta.b ER. (i) IfP(X 2 0) = 1, thenE[X] > 0. (it) The expected value of aX + bY is defined, and £[aX + bY] = a£[X] + b£[K]. (Hi) IfP(X > K) = 1, then E[X] > £[/]. (iv) |£[X]| <£[|X|], PROOF: (i) If P(X 2 0) = 1, then/xW = 0 for * r +oo r +oo xfx(x)dx B[X] = — 00 (it) < 0 and xfx(x)dx>0. o Putting aside the question of whether or not £ [aX + bY ] is defined, (ax + by)fx,y(x,y) dA R2 r +°° / r +°° \ a I xfx,Y(x,y)dy]dx f +oo / f 4.00 I + \ yfx,y(x,y) dx jdy C +oo a /■ +oo xfx(x) dx + b I = «£[X] + b£[Y]. The steps involved in showing that £ [aX + bT] is defined are similar using the inequality \ax + by\ — |a ||x| + |b||y|. (tit) (iv) Since P(X > D = 1,P(X - Y > 0) = 1 and£[X] -£[K] = £[X-y] 2 Oby (ii) and (i). Since —|X| X < |X|,~£[|X|] £[X] £[|X|] by (iii), and therefore |£[X]| jE[|X|]. ■ 236 7 expectation revisited EXAMPLE 7.6 Let (X, Y) be the coordinates of a point chosen at random from the part of a disk D with center (0,0) and radius 1 in the first quadrant. The joint density is ft x _ fx.y(x,y) = ifx2+y2 £ l,x — 0,y S: 0 otherwise, C4/77) Q and the marginal densities are (4/tt) Jl — x2 ifO < x < 1 otherwise ifO < y < 1 otherwise. SincefxWfy(y) for all in a small disk centered at (0,0), X and Y are not independent random variables. Since 4 !-------- 7 j 4 y~ V1 ~y2dy = —, 77 jTT r E[X] = E[K] = q E[X + r] = (4/3)77 + (4/3)77 = (8/3)77. ■ Theorem 7.3.3 Let X and Y be independent random variables for which E[X] and E[Y]are defined. Then E [X Y] is defined and E[xr] = E[x]E[y]. PROOF: Since \xy\fxMfY(y)dA dA )/r(y) * Mly|/x( R2 r +x ~ / f +°° \ ly\fy(y)dyjdx MAW I J— X \J-X ) \/f+0° = V |y|/r(y)^y |x|/x(x) dx / \J —co < 4-oo f E[XK] is defined. The same calculation with |xy| replaced by xy shows that E[xr] = E[x]E[r]. ■ 7.3 237 EXPECTATION AND CONDITIONAL EXPECTATION EXAMPLE 7.7 Let X and Y be independent random variables having exponential densities with parameters Aj and A2, respectively. By Exercise 7.2.5, E[X] = l/AiandE[y] = I/A2. By Theorem 7.3.3, E [XT] = 1/(AiA2). ■ The preceding theorem can be extended to functions of X and Y as follows. Theorem 7.3.4 Let X and Y be independent random variables and let and ip be continuous real-valued functions with E[</>(X)] andE[il/(Yf] defined. Then E[0(X)i/<y)] is defined, and E^X^Y)] = E[d>(X)]E[iKY)]. All of the theorems and corollaries of Section 4.4 hold for arbitrary random variables. The proof of Schwarz’s inequality (Equation 4.8) is precisely the same as in Section 4.4 as soon as it is established that E[X2] = 0 implies that P(X = 0) = 1. Consider the case that X has a density function fx. For every positive integer n, 0 = £[X2] = f+°° x2fx(x)dx > f 1 / 1 \ X2fX(x)dx > —rP[|X| > - J —00 \ Thus, P(|X| S: 1/n) = 0 for every positive integer n, and since the events (|X| S: 1/n) increase to the event (|X| > 0) as n —> <», P(X = 0) = P(|X| = 0) = 1. This is all that is needed to replicate the proof of Schwarz’s inequality. It is not essential that X have a density function. The following example can also be interpreted as a random walk in the plane if the bonds are thought of as instantaneous displacements of a randomly moving particle. For a more comprehensive treatment of chain molecules, see the book by P. J. Flory listed at the end of the chapter. EXAMPLE 7.8 (Chain Molecules) Consider a chain molecule formed in the following way. Let €be a fixed positive number representing the distance between two molecules or the length of the bond between the two. Starting with an initial molecule at the origin of thexy-plane, a bond is formed between it and molecule #1 of length € making an angle 0i with the positive x-axis, where 0! is chosen at random from the interval [0, 2tt]; starting from the position of molecule #1, a bond is formed between it and molecule #2 of length € making an angle 02 with the positive x-axis, where 02 is chosen at random from the interval [0, 2tt], independently of 01, and so forth, as shown in Figure 7.1. If there are n bonds in the chain with an initial molecule at the origin, what is the expected value of the square of the distance of the nth molecule from the origin? For i = 1,..., n, let (X,, Y,) be the change in the coordinates in going from the (i — l)st molecule to the ith molecule. Then X,- = £cos0i Yi = €sin0;. 238 7 EXPECTATION REVISITED FIGURE 7.1 Chain molecule. The position of molecule #n is then If D denotes the square of the distance of molecule #n from the initial molecule, then f/ „fi v=l /[n„ \2 \ +EKi / V=1 / \2\ fl fl i=l i=l i Note that £[X?] = £[€2cos20;] €2 f2n = — 2tt Jo = €2 COS2 0; d f2” 1 + COS 20; d 2tt Jq 2 Oi = f 2 and that £[X,X; ] = €2£ [cos 0; COS 0j ] = €2£[cos0;]£[cos0;] 2tt / 1 f2,r = €2 — COS 0; d 0; I--cos 0; d 6j \2tt Jo J\2tt Jo = 0 7.3 EXPECTATION AND CONDITIONAL EXPECTATION 239 by the independence of 0, and 0;- for i j and Theorem 7.3.4. Similarly, E[K?] = €2/2 and E[r, r; ] = 0. Therefore, P2 P2 E[D2] = n— + n — = nP2. 2 2 It should be emphasized that this is not the square of the expected value but the expected value of the square. By Schwarz’s inequality, (E[D])2 < E[D2] = nP2, from which it follows that E[D] Pjn. ■ In considering a single random variable X, most of the information concerning X is embodied in its density function. For some purposes, we would like to summarize that information in a few parameters. Consider a random variable X that has a finite second moment and density function/x; i.e., E[X2] = f^x2fx(x) dx < +<». In Section 4.3, we saw that |x| x2 + 1 for all x G R, so that r +°° J—00 r +°° r +a> MxW dx < J—00 (x2 + l)/x(x) dx = J—00 x2/x(x) dx+1 <+°o, and it follows that E[X] is defined. Thus, Mx = B[X] is finite and is called the mean of X. Since (x — Mx)2 — x2 — 2fixx + Mx> -+00 (x - p-x^fxWdx -00 P +00 r +oo r 4-00 = x2fx(x')dx - 2/zx J—00 J—00 = E[X2] -jll2 x< +oo. /j.2 xfx(x)dx )dx+l * x/x( J —00 Thus, E [(X — Mx)2] is finite, and we define a second parameter a2 x = E[(X - /zx)2] = W2] “ = E[X2] - (E[X])2, called the variance of X. The variance of X is also denoted by var X. It is eas­ ily checked that var(cX) = c2varX var (X + c) = var X EXAMPLE 7. 9 Let X have a uniform density on [a, fe], a < b. It was shown in the previous section that E[X] = (a + b)/2 and E[X2] = (l/3)(b2+ab+a2). Thus./zx = («+b)/2and = var X = (b-a)2/12. ■ 7 240 EXPECTATION REVISITED EXAMPLE 7.1 0 Let X have a T(a, A) density. By Exercise 7.2.6, for each positive integer r, , r, (a + r - l)(a + r - 2) X • • • X a B[X J = ------------ - ----- —------- ----------- • A In particular, £[X] = a/Aand£[X2] = ((a + l)a)/A2, so that fix = a^ andvarX = a/A2. ■ EXAMPLE 7.1 1 Let X have an exponential density with parameter A > 0. Since this density is the same as the T( 1, A) density, fix — 1/A and var X = 1/A2 by Example 7.10. ■ EXAMPLE 7.1 2 Let X be a random variable having a standard normal density. Since f J-x |x|—^=e~x2/2 dx = 2 [ J2it Jo x—j=e~x2/2 dx < +00, j2ir £[X] is defined and £[X] = P x—~=e~xl/2 dx = 0, J-x y/2lT since the integrand is an odd function. To find var X = £ [X2] — (£ [X])2 = £[X2J, we need only determine £[X2 J. Letting Y = X2, by Example 6.19 Y has a T( 1/2,1/2) density. By Example 7.10, £[T] = 1. Thus, fix = £ [X ] = 0 and ax = var X = 1. ■ Now let Y be a random variable Y having an n(fi, a2) density. By definition, Y = aX + fi where X has a standard normal density. It follows that fiY = £[T] = <r£[X] + fi = fi and a2Y = var Y = var (aX + fi) = a2 var X = a2. From Example 6.17, /r(X) = y/2Tra and fiy and ay can be readily identified by examining the parameters fi and a in the function fiy. Although the random variable X is not required to have a density in the next lemma, the proof will assume a density. Lemma 7.3.5 (Markov’s Inequality) IfX is any random variable for which E [X ] is defined and t > 0, then P(]X| > t) < 7.3 EXPECTATION AND CONDITIONAL EXPECTATION 241 PROOF: E[|X|] = | "|x|/x(x)dx 2 (|x|a, |x|/x(x)dx a :P(|X| a r). ■ As in the discrete case, Chebyshev’s inequality is an easy consequence of Markov’s inequality. Theorem 7.3.6 (Chebyshev’s Inequality) Let X be a random variable with mean /z and finite variance a2. Then (T2 P(|X-/z| > 8) < — for all 8 > 0. Suppose now that the random variables Y, Xb ..., X„ have a joint density. We can then consider the conditional density of Y given Xi,..., Xn and define the conditional expectation of Y given Xb ..., Xn by the equation r d-oo W |Xj = xb...,x„ = x„] = yfy\xu. „x„(y\xi>--->xn)dy, J —00 provided the integral is absolutely convergent. Theorem 7.3.7 If Y, Xb... ,X„ have a joint density and E [7] is defined, then £[T | Xi = xb..., X„ = xn ] is defined, and £[/] = | "’J jE[r|Xi = Xi,...,xn = xn]fXl... x„(xi, ■ ■ ■ ,xn) dxi ...dx„. Rn Sketch of Proof: Since £[ 7] is defined, r+oc +00 > = J —00 r +oo J —00 \y\fy(y)dy r r 1/1 J ••• J /r,x.... xn(y,xb...,x„)dxi ...dxn dy Rn f = J f /f+°° \ .... x,(/lxi’•••’* '«) ‘O' J v “0° Rn X fx.... xn(xb...,x„)dxi ...dxn. ■ / Thus, the integral within parentheses is finite for “most” points (xb ... ,xn) in Rn. The same calculation with |y| replaced byy will establish the final result. 242 7 EXPECTATION REVISITED EXAMPLE 7.13 Let (X, K) be the coordinatesofa point chosen at random from a triangle with vertices at (0,0), (1,0), and (1,1). Intuitively, given that X =■ x with 0 < x 1, the point (x., Y) is then chosen at random from the line segment joining (x, 0) to (x,x); i.e., given that X =■ x, Y has a uniform density on (0,x), and therefore E[y|X = x] = x/2 for 0 < x 1. This intuitive argument can be justified by the following formal calculations. The joint density of X and Y is if0 < y < x,0 < x otherwise. fx.y(x,y) = Clearly,/x(x) = Oforx < Oandx > 1. For 0 < x 1 1, ‘X fxM = fx,Y(x,y)dy .o 2dy = 2x. Therefore, f 2x [ 0 ifO < x < 1 otherwise and ) * /r|x(y| = 1/x 0 if 0 y < x, O^x otherwise. 1 ForO < x < 1, E[r|x = x] = yfy\x(y\x)dy = rx i , ix2 x „ = TT = 2 Care must be taken in making heuristic arguments of the tyoe appearing in this example. Such arguments can allow one to discover facts, but they should be formally verified as above before staking one’s reputation on the result. The last theorem can be stated in a more general context. Suppose Ti> ■ • ■ > Ym, Xj,..., X„ are random variables having a joint density and i/j is a continuous real-valued function of nt variables. If E[i/j(Y1,.. ., Tm)J is defined, then E[IA(r1,...,y„,)|x1 = xb...,x„ = x„] = | '"I </'(/!>•• Rn' ■>ym)fYl.... rJX!..... X„ (/b • • ■ »/ni|xi,. .., x„) dyx... dym. 7.4 NORMAL DENSITY 243 Starting with this result, it is possible to develop such concepts as condi­ tional variance of Y given Xi = xb...,X„ = xn denoted by var(Y^ = xb..., X„ = xn), and so forth. EXERCISES 7.3 1. Let X and Y be independent random variables, both having an expo­ nential density with parameter A = 1. If Z = max(X, Y), calculate E [Z] without using the density of Z. 2. A point (X, Y) is selected at random from the unit square {(x,y) : 0 x 1,0 y 1}. Given that X = x and Y = y, a point (U, V) is selected at random from the rectangle {(«, v) : 0 u < x, 0 v y}. Calculate £[D|X = x, Y = y]. 3. Let A be a random variable having an exponential density with pa­ rameter 1. Given that A = A, the random variables Xb...,Xn are independent random variables each having an exponential density with parameter A. Find£[A|X! = xb...,X„ = x„], 4. Let X and Y be random variables having the joint density 8xy if 0 :£ y < x < 1 0 otherwise. Find£[Y|X = x] and E[X |Y = y]. 5. Let A be a random variable having a gamma density Ha, /?) and given that A = A, X has an exponential density with parameter A. Find the density of X and the conditional density of A given X = x. 6. Let U and V be the random variables of Exercises 6.6.7 and 7.2.4 and let/? = V — U be the range of the Xb.. .,Xn. Calculate £[/?] and var R. 7. Consider the random variables U and V of the previous problem. Determine/u|v(M|v) and calculate £[U|V = v]. Consider the random variables X, Y, and Z having the joint density of Exercise 6.6.8. Determine _/z|x.y(z|x,y) and calculate £[Z|X = x,Y = y]. 9. Let X and Y be the coordinates of a point chosen at random from the triangle in the plane with vertices at (—1,0), (0,1), and (1,0). Determine £[Y|X = x] without calculations. 8. 10. A number X is chosen at random from the interval [0,1]. Given that X = x, a number Y is chosen at random from the interval [0,x]. Calculate £[X | Y = y]. NORMAL DENSITY Consider a sequence {Xy}"=1 of n Bernoulli trials with probability of success p = 1/2 and let $„ = We know that has the binomial density function 244 7 EXPECTATION REVISITED tU;"4)= (")2'"’ i ’"■ 2 with E[S„] = n/2 and a$„ = Jn/2. If we were to examine a bar graph of this density, it would be centered about the point x = n/2 on the x-axis, and the variance n/4 would be large for large n, indicating that the density is spread out far from the mean. As n becomes large, the individual probabilities would also become small. To eliminate the spreading effect, we consider the normalized sum Sw ~ (n/2) which is centered about E[S ] * = 0 and has variance var S* = var((S„ — (n/2))/( Jn/2)) = (4/n)var Sn = 1. The bar graph of Sj6 is depicted in Figure 7.2. The area of the rectangle centered above 0 represents the probability * i6 = 0) = P(S„ = 18) = ( ig )2'36 - .132. P(S If we compare Figure 7.2 with the graph of the standard normal density depicted in Figure 6.11, it might appear that one of the two could be used to approximate the other. At one time it was impractical to calculate? (a < Sn < /?) because of the large number of arithmetic operations required even when n is moderately large, and so the normal density was used to approximate With the advent of fast computers, such calculations can now be done in fractions of a second for moderate values of n. Even though approximation of FIGURE 7.2 P(Sj6 = i). 7.4 245 NORMAL DENSITY binomial probabilities is not as important as it once was, there are valid reasons for looking into normal approximations to the binomial density. The following theorem was first proved by DeMoivre (1667-1754). The proof of the theorem involves only elementary facts from the calculus. Reading the proof is not essential for learning probability at this stage. Theorem 7.4.1 (Central Limit Theorem) Let {X j }?= ! be a sequence of n Bernoulli trials with probability ofsuccess p = 1/2 and letSn = = Then lim Pi\S„ ~ —| <x----- j = ■ f e t2/2dt. The probability on the left is the same as P (|S * | < x). The proof of the theorem requires the following approximations. Lemma 7.4.2 There are functions y and 8 such that (i) In (1 + x) = x(l + y(x)) if ]x| < 1. (ii) |y(x)| == 8(|xj) if|x| < 1. (Hi) 8 is nondecreasing on [0,1) and limJC_>o+ 8(x) = 0. PROOF: The function In (1 + x) has the Madaurin series expansion x2 x3 x4 ln(l + x) ~x~ — + — — — + ••■ 2 3 4 = x(l + y(x)) for |x| < 1 where y(x) = — (x/2) + (x2/3) — (x3/4) + • • •• Let Since the series defining 8 converges absolutely in (-1,+1) and the sum of a power series is a continuous function on its interval of convergence, limx_»o+ 8(x) = 8(0) = 0. It is clear that 8 is increasing on [0,1). ■ We will also need the following discrete version of the mean value theorem of the integral calculus. 246 7 Lemma 7.4.3 EXPECTATION REVISITED If a\,... ,an are nonnegative real numbers and b},...,bn are any real numbers, then there is an m with \m | maxj < j | such that n n ^ajbj = m^aj. j=i ' j=i PROOF: We can assume that some aj > 0 because otherwise we can take m = 0. Since n Z^ajbj ■ n ~ ZE MM n IM)ZEai’ ;=i “ < ;=i we can take m = ^a;b; ■ We will also need to approximate the exponential function. It follows from the Maclaurin series expansion of ey that ey = 1 + 0(y) where 0(y) = y + y2/2!+y3/3!+••• satisfies (1) limy-,0 0(y) = 0, (2)|0(y)| 0(|y|),and (3) 0(y) is nondecreasing on [0, +<»). Proof of Theorem 7.4.1 We prove the theorem for an even number of trials first. The number x will be fixed throughout the proof. Consider only those n for which x Jn/1 < n. Let jPn(x) = P(|$2n ~~n\<x Jnji'). Since Sj„ has the binomial density &(•; 2n, 1/2), P„(x) = Substituting j for k — n, PnM Note that ( n + j ) — ( n _ j )> so that the term in the sum corresponding to j is equal to the term corresponding to -j. Consider the terms of the sum 7.4 247 NORMAL DENSITY for which j >: 0. Since (2n)l n+j' (n+j)!(n-j)I (n +;)! = (n + j) X • • • X (n + l)n! ( 2n \ = n(n - 1) X • • • X (n -j + 1)’ we can write 2n \2~2n = (2n')'2~2n n!n! + 1) (n +j)(n + j — 1) X • • • X (n + 1)' »(» ~ !) x • • • x (» By the above remark, this equation holds for; < 0. Therefore, y (2n)!2-2„ n(n ~ 1) X • • ♦ X (n - j + 1) -y__ nln! (n +j)(n + j — 1) X • • • X (n + 1)' Letting = (2»)!2-2n and applying Stirling’s formula, Equation 5.10, to the factorials, P„ ~ 1/ Jim. This means that P„/(l/ ypirn ) —> 1 as n —> <»; i.e., wherelim„_,oo8„ = 0. Thus, we can write (l + 8„). Therefore, PnM = 1 z, . c x ,---- ( 1 + O n ) n(n - 1) X • • • X (n -j + 1) .• \ / • t ,f\ (n +j)(n +j - 1) X •• • X {n + 1)' For |j | < x Jn/l, let _ n{n — 1) X • • • X (n — j + 1) Dn,i ~ {n +j)(n + j - 1) X ••• X (n + 1) 1 (1 + (j/n))(l + (j/(n - 1))) X • • • X (1 + (j/(n ~ j + 1)))' 248 7 EXPECTATION REVISITED Taking natural logarithms and approximating the In function using the y function of Lemma 7.4.2, Writing j/(n — i) = (j/n)(l + (i/(n - i))), Applying Lemma 7.4.3 to the second sum on the right, .V 2 = -L-y .L In Dn,j = -L-y n 7n',2^n n 7n'J n i =o where < max J-l Recalling the 8 function of Lemma 7.4.2, Noting that and letting A„iX denote the quantity on the right, lim„ \n — i A„>x = 0 and 3(A„,X). Also note that |i/(n - i)| < |j|/(n - |j|) < A„,x. Thus, |y„jl < 8(A„,X) + A„>X + A„>X8(A„,X). 249 NORMAL DENSITY 7.4 Letting A„jX denote the quantity on the right, 71, X where lim„_oo A„>x = 0. Thus, Dnij = e^2/"e~lVn^2/n) Using the approximation = 1 + 0(y) discussed after Lemma 7.4.3, / / Dn,j = e-j1,n 1 + 0 \ \ i2 n Thus, Equation 7.1 can be written 1 / / j M7T \ \ M —e-/2/"(l + 8„) 1 + 0 1 mr i2\ / j2\\ + 8ne\-yn,j- • n 1 Jn 1 / ( 7=e~C/n 8„ + 0 M7T \ \ We will now show that the second sum on the right has the limit zero as n —> <». Applying Lemma 7.4.3 to the second sum, it can be written (7.2) where |A„J < max Since|0(y)| / 8n + 0 \ / j2 \ M / +3n0 \ j2 — « ;|j| 0(|y|) and 0(y) is nondecreasing on [O,00), e(-y jyn.jn •2 \ I'M7J - ( / 2\ and /_ x2\ ~ x2 \ |A„,X| < a„+ 0[A„,xy )+6„0lA„,xy 1^ Oasn -> co, 250 7 EXPECTATION REVISITED and therefore limM-i.x A„,x = 0. Note also that and therefore the quantity in 7.2 has the limit zero as n —> oo. It follows that lim P„(x) = lim n-tx 3^ - 1—-.e ^/n n -»X Jn 7Jv |;|<x Jn/2 provided the limit on the right exists. If we let Xj = j j2/n, then the points {xj : |j[ < x Jn/2} constitute a partition of the interval [—x,x] into subintervals [xj - b Xj ] of length Ax;- = j2/n, and Since the sum on the right is just a Riemann sum defining the integral f *x(l/,/27r)e-f2/2dt, lim P„(x) n— This completes the proof for even n. To take care of odd n it is necessary to do an epsilon argument. We will use </>(x) in place of (1/ J2^^)e~x^,^ for the rest of the proof. Since J d t is a continuous function of x, given e > 0 there is an h >0 such that ■x+h rx <J>(t}dt < -x-h <f>(t)dt + £. J—x Since lim P([S2,i - n\< (x - h) Jn/2) n —>a> 4>(t)dt and rx+h lim P(|S2„ - m| < (x + h) Jn/2) = 4>(t)dt, J-x-h there is a positive integer N such that for all n > N 7.4 251 NORMAL DENSITY (i) (1/2) - (h/2)j2n < 0. («) P(|S2n - n| < (x - A) 7^/2) > f_\ <f>(t) dt - e. (in) P(|S2m - n| < (x + h) y/n/i) < $* x<f>(t) dt + e. Since X:(co) = 0 or 1, |X,(co) — (1/2)| = 1/2 for all i, and therefore . 2n + 1 •$2n+i(w) “ •—~— = 52n(w)+X2n+i(co) - n —— |S2„(w) - n| + X2n+i(w) - = |52m(w) “«l+ J- Suppose |S2„ (co) — n \ < (x — h) Jn/2. By (i), for m > N, „ , x 2m + 1 52n+i(w) — —-— „ , x 52m+i (a>) „ , x 52m+i(co) 2n + 1 -— 2n + 1 -— : |S2(„+i)(to) - (n + 1)| < (x + A) It follows from these relations and (ii) and (iii) above that for n <f>(t)dt - e < P |S2„ - n| < (x - /j) N, 252 7 EXPECTATION REVISITED 2n + 1 2 2n + 1 4 n +1 2 </>(t) dt + e. This shows that J. 2n + 1. 2n + 1 \ fx lt . , _ hm P |S2„+i “ ~~—\<XJ—7— = <f>(t)dt. ■ I 2 V 4 I j-v The original central limit theorem has a tendency to underestimate the binomial probabilities, as can be seen from Table 7.1 in the n = 36, p = 1/2 case. The central limit theorem was improved upon by Laplace (1749-1827). Let {Xj}"=1 be a sequence of n Bernoulli random variables with probability of success p and let Sn = The following result gives a better approximation of binomial probabilities. The proof belongs in a more advanced text. P(a < Sn < + (7.3) where h = 1/ ^/npq andxf = (t — np)h. Suppose n — 36 and p = 1/2. According to Table 7.1, P(13 < Sn < 23) = .9347. If we use Equation 7.3 to approximate this probability, then h = 1/3, x13 = —5/3, x23 = 5/3, and EXAMPLE 7.14 (5 1\ (5 1\ P(13 < S„ ^23)«4> - + - -4> ----)«= .9332, \3 6 / \ 3 6/ Number of Successes Probability Normal Approximation 17 < S„ 19 16 == S„ < 20 15 < S„ < 21 14 < S„ < 22 13 < S„ 23 12 < S„ < 24 11 < S„ 25 10 < S„ < 26 9 < S„ 27 .3833 .5950 .7570 .8675 .9347 .9711 .9887 .9960 .9988 .2611 .4950 .6827 .8176 .9044 .9545 .9804 .9923 .9973 TABLE 7.1 Normal Approximation of Binomial Probabilities 7.4 253 NORMAL DENSITY a much better approximation than that given in Table 7.1. ■ The following theorem is a weakened version of the last approximation but has the advantage of being easier to apply in some situations. Theorem 7.4.4 (DeMoivreLaplace Limit Theorem) Let {X;}"_ j be a sequence of n Bernoulli random variables with probability of success p and let Sn = = Then for fixed a < b, P(a < S* EXAMPLE 7.15 < &) « #(&) - <!>(«). (7.4) A survey is undertaken to determine how many voters in a population of eligible voters favor candidate A. Assume that the unknown proportion of voters who favor A is p and that voters act independently of one another. Suppose we want to determine how many should be polled so that the observed proportion of favorable voters is within .05 of p with probability at least .95. We can look upon the polling as a sequence of Bernoulli trials {Xj}"=1 with unknown probability of success p. The observed proportion of favorable voters will then be Sn/n, and we want to choose n so that P Sn T~P < .05 > .95. We could proceed as in Section 4.3 by using Inequality 4.6 to require that P S„ T~P > .05 I < 1 < .05; 4n(.O5)2 i.e., that n > 2000. Since Inequality 4.6 assumes virtually nothing about the density of S„/n, a better result might be obtained by invoking Theorem 7.4.4. Note that P --P n < .05^ = p( Sn “ np = p(|s*i (.05)Vn/pg < cos) y^). By the Standard Normal Distribution Function table (see page 346), the approximate solution of the equation ^(x) — <I>(— x) = 2<I>(x) — 1 = .95 is x = 1.96. Since P(|S | * x) == 4>(x) - 4>(-x) = .95, we should choose n so that (.05) yjn/pq > 1.96, in which case we would have P(|S | * that n > (.05) Jn/pq) /L96\2 .95. This requires 254 7 EXPECTATION REVISITED Since pq = p(l — p) s 1/4, if we choose n so that /1.96\2 1 /1.96\2 n — ----- - 2: ----- pq, \.O5 ) 4 \ .05 ) ™ we would then have P(|(S„/m) — p| < .05) S: .95. We therefore take n to be the smallest integer for which i.e.,wetaken = 385. By polling 385 eligible voters and using S„/n toestimate the unknown p, we know that our estimate will be within .05 of p 19 times out of 20. ■ The central limit theorem is valid in much more general situations than those dealt with here. For example, if {Xj} is a sequence of independent random variables having the same distribution function with p. = E[Xi], <r2 = var X] < +<», and Sn = X"= i 7C;, then lim Pla l „ _oo EXERCISES 7.4 fj. — &|l =<>(&)-<>(«). (7.5) Equations 7.4 and 7.5 were used to obtain answers to the following problems. 1. Approximate )2-64. 2. Approximate XJI30 ( )(l/4); (3/4)128-7. 3. A jumbo jet with a seating capacity of 360 passengers is allowed a maximum weight of 59,000 pounds for passengers. If the average weight of a passenger is 160 pounds with a standard deviation of cr = 48 pounds, what is the approximate probability that the weight limit will be exceeded, assuming the 360 passengers that board are a random sample from the population? 4. A national polling agency would like to determine the percentage of eligible voters who favor their client within 3 percentage points with 90 percent confidence. How many eligible voters should be polled? 5. Consider a particle taking a random walk on the integers starting at 0 with p = 1/2. What is the approximate probability that the particle will be within 30 units of 0 after 1000 steps? 6. Consider a particle taking a random walk on the integers starting at 0 with p = .45. What is the approximate probability that the particle will be to the right of -50 after 1000 steps? 7.5 COVARIANCE AND COVARIANCE FUNCTIONS 255 7. Then real numbers alt... ,an are rounded off to the nearest integers a\ + X\,...,an +X„, respectively, where the round-off errors Xb.. .,X„ are assumed to be independent and have a uniform density on [ —1/2,1/2]. Use the central limit theorem to find a number A > 0, depending upon n, such that P (| = i Xj | < A) = . 99. (See the final paragraph of this section.) 8. A programmer decides to carry tn significant figures to the right of the decimal point and round off the result of any addition, multiplication, or division operation to that many figures. Assume that 106 elementary operations are performed, that successive round-off errors are inde­ pendent and have a uniform density on [~ (l/2)10~m, (1/2) 10~m ], and that the final error is the sum of all the round-off errors. Find an upper bound, which does not depend upon m, for the probability that the final error will be less than 5 X 10-,”+2 in absolute value. COVARIANCE AND COVARIANCE FUNCTIONS The covariance cov (X, K) between two random variables with finite second moments was defined in Section 4.4. All the definitions, lemmas, and theorems of that section are valid for any random variables—discrete, continuous, or a mixture of the two. We can and will use the properties of cov (X, K) described in Section 4.4. Assuming that the random variables X, Y have finite second moments with <Tx > 0 and ay > 0, the correlation between X and Y is defined just as in the discrete case by the equation fV v. _ cov(X,r) _ £[(X-/zx)(y-/xy)] p(A> I ) ! — / o-xo-y JvarXjvarK As pointed out above, Inequality 4.8, Schwarz’s inequality, holds for any random variables with finite second moment. Theorem 7.5.1 (Schwarz’s Inequality) I/X and Y are any random variables with finite second moments, then (£[XF])2 £[X2]£[r2] (7.6) with equality holding if and only if P(X =0) = lorP(y = aX) - 1 for some constanta. As in the discrete case, |p(X, K)| 1 with equality if and only if there are constants a, b G R such that PfY = dX + b) = 1. 256 7 EXPECTATION REVISITED If x = ...»xn) is a point in Rn, the length of the vector x is the quantity (X"=i x?)I/2. By analogy, if {xb... ,x„} is the range of a random variable X, we could define the length of X, written ||X||, by the equation / n . , \1/2 m= t_______ = ^x2iV=1 / Since the quantity on the right makes sense for any random variable with finite second moments, we can extend this notion as follows. Definition 7.4 If X is any random variable with finite second moment, define || X || = Je [X2]; || X || is called the norm ofX. ■ What is to be made of the equation ||X|| = >/jE [X2] = 0? Putting Y = 1 in Inequality 7.6, (£[X])2 < £[X2] and £[X] = 0. According to the discussion following Theorem 7.3.4, var X = 0, and therefore P(X = 0) = 1. Note that this does not mean that X(to) = 0 for every a> G ft. If we have two random variables X and Y with ||X — r|| = >/£[(X — K)2] = 0, we can conclude only that P(X = Y) = 1. Two random variables X and Y with this property are said to be equal in the probability sense and will be regarded as the same in this section. As in vector calculus, once we have a concept of length, we can go on to distances. Definition 7.5 IfX and Y are random variables with finite second moments, the mean square distance between X and Y is the quantity |]x-y|| = 7b[(x - y)2]. ■ The following inequality is the analog of the geometrical fact that the length of a side of a triangle is less than or equal to the sum of the lengths of the other two sides. Lemma 7.5.2 Triangle Inequality) IfX and y are any random variables with finite second moments, then U + I'll SW + 114 (7.7) PROOF: Since ||X + F||2 = E[(X + Y)2J = £[A'2| + 2£(XY) + E[Y!] and 7.5 COVARIANCE AND COVARIANCE FUNCTIONS E [XY] 257 ^/£[X2] VE[K2J by Schwarz’s inequality, ||x + r||2 < e[x2] + 2 Vb[x2] Ve[y2] + e[r2] = pEpr2) + Jem)2 = (W+IMA and therefore ||X + Y|| < ||X|| + ||K||. ■ Replacing X by X — Z and Y by Z — Y in 7.7, we obtain the inequality ||x-r|| == l|x - z|| + ||z - r||, (7.8) which is also referred to as the triangle inequality. Another inequality can be obtained from Inequality 7.7 by replacing X by X — Y to obtain ||X|| ||X — y|| + ||y|| or ||X — y|| S: ||X|| — ||y||. Interchanging X and Y in the latter inequality and using the fact that ||X — Y|| = ||y — X||, ||X — Y|| S: ||y|| — [[X[|. We thus obtain another version of the triangle inequality: llx - r|| |||x||-||r|||. (7.9) With the above definitions in mind, we can now discuss convergence of sequences of random variables. Definition 7.6 Let X,Xi,X2,... be random variables having finite second moments. The sequence {X„ }"= j converges in mean square to X, written ms-lim„ Xn = X, if lim ||X„ -X|| = 0 n -*a> or, equivalently, lim E[(X„ -X)2] = 0. ■ n -*oo We can use Inequality 7.8 to show that the mean square limit of a sequence {Xn } is unique in the probability sense if it exists. Suppose lim ||X„ — X|| = 0 and lim ||Xn — X'|| = 0. n—n—♦<» Since 0 < ||x -x'll == ||x -X„|| + ||x„ -X'll -> 0 as n -> «>, || X - X'|| = 0, and therefore X = X' with probability 1. 258 7 EXPECTATION REVISITED Now that we have a means for taking limits, we can deal with infinite series. If {%/}”= j is an infinite sequence of random variables having finite second moments, we can form the infinite series 5?j = lXj. Letting Sn = = i Xj, n > 1, if there is a random variable S with finite second moment such that S = ms-lim„_x S„, we write S = ms-lim„_»xXy = i Xj = msXy°=iXj and say that the series ^mlXj converges in the mean square sense. Convergence in the mean also implies convergence of means. Lemma 7.5.3 Let X,Xi,X2, ...and Y, YUY2... be random variables with finite second moments. //ms-limn_»xX„ = X andms-lim„_»x Yn = Y, then (i) limM_»»£[XM] = £[X]. (ii) \imn^E[XnY] = E[XY]. (Hi) lim„_>w£[X2] = £[X2]. (iv) lim„_«,£[X„y„] = £[Xy]. PROOF: Replacing X and Y in Inequality 7.6 by 1 and |X„ — X|, respectively, by Theorem 7.3.2: |£[X„] — £[X]|2 < £[|X„ - X|]2 == £[(X„ -X)2] 0 as n —> oo. Thus, lim„ £ [X„ ] = £ [X ] and (i) is proved. By Schwarz’s inequality, 0 < |£[X„y] - £[XT]|2 = |£[(X„ - X)T]|2 < £[(X„ X)2]£[T2] -> 0 as n -> oo, and lim„_x£[X„ y] = £[XV], so that (ii) is true. Part (iii) is the same as the statement that limn_x||X„||2 = ]]X]]2. If we can show that lim„_»x ||X„|| = ||X||, then (iii) would be proved by continuity of the function/(x) = x2. By Inequality 7.9, | ||X„||-||X|| | < ||X„ -X|| 0 as n —> oo and (iii) is true. To prove (iv), by Schwarz’s inequality: |£[x„y„] -£[xy]| = |£[(x„-x)y„] + £[x(y„ - y)]| £[|(x„ -x)y„]]+£[]x(y„ - y)|] Vb[(X„ -x)2]V£[y2]+ V£[x2]y£[(y„-y)2]. Since limn_»x£[(X„ — X)2] = 0, lim„_»x £[(yn — y)2] = 0 by hypothesis, and lim„ _ x £ [ y2 ] = £[y2] by (iii), lim„_x£[Xn Yn ] = £[Xy]. ■ In the remainder of this section, we will consider a family of random variables {Xf : t G T} having finite second moments where T is the set of all real numbers R or the set of integers Z, assuming that such a family exists. Definition 7.7 The family or process {Xt : t E.T} is weakly stationary if 1. £[X$] = £[Xf] for alls, t £ T. 7.5 259 COVARIANCE AND COVARIANCE FUNCTIONS 2. cov(Xs,Xt) = cov(X$+Jt,Xt+h)forallh,s,t G T. ■ The function r(h) = cov(Xf,Xf+fc) = cov(X0,Xfc), h G T, is called the covariance function of the process. To avoid trivialities, we assume that r(0) = cov(Xq,Xq) = var Xq — <t2 > 0. Note that for h > 0, r(—h) = cov(Xt,Xt-h) = cov(Xt-h,Xt) = r(h), and therefore r(h) = r(-h) = r(|h|), h G T. The normalized covariance function is called the correlation function of the {Xf : t G T} process. EXAMPLE 7.16 Let {Xj} !..^ * be a sequence of independent random variables having the same distribution function, finite second moments, and a2 = var Xq. Then for any v G Z, if v = 0 if v ^0. ■ r(u) = E[(Xj - £[Xj)(Xj+„ - E[Xi+P])] = EXAMPLE 7.17 Let U and V be uncorrelated random variables (i.e., p(U, V) = 0) with zero means and unit variances. For A G R, let X t = U cos At + V sin At, t G R. Since £[X(] = £[(7] cos At + £[V] sin At = 0, the covariance function is given by r(h) = cov(X t,Xt+h) = E[XtXt+h] = £ [(17 cos At + V sin Ar) (CJ cos A(t + h) + V sin A(t + h))] = cos At cosA(t + h)E [U2] + cos At sinA(t + h)E [U V] + sin At cosA(t + h)E [CJ V] + sin At sin A(t + h)£[ V2]. Since £[172] = 1,£[V2] = l,and£[(JV] = 0, r(h) = cos At cos A(t + h) + sin At sin A(t + h) = cos Ah. 260 7 EXPECTATION REVISITED It follows that the process {Xf : t G T} is weakly stationary. Since we can write Ci cosx + c2 sinx = Jc2 + c2 cos (x + 0) where 0 = arctan (ci/c2),Xf can-be written Xf = Ju2 + V2 cos (At+ 0) where 0 = arctan (177 V), a random variable. Thus, Xf is a periodic function with random amplitude Ju2 + V2, random phase shift 0, and fixed frequency A/2rr. ■ We now consider an example that can serve as a model for the sound produced by n different tuning forks that are struck at random times. EXAMPLE 7.18 Let 17O,..., CJ„, Vo,... V„ be uncorrelated random vari­ ables with zero means. Assume that the U, and V, have common variances a2,i = 0,..., n, and let cr2 = (Tq + • • • + cr2. Also let Ao ..., A„ be distinct real numbers. Set n Xf = 2^(17; cos Ay t + Vj sin Ay t), t G R. j=o Note that E[17; [7y] = E[V,Vy] = E[CJjV;] = 0 whenever i E[IT,2] = E[V?] = cr?. Clearly, n B[Xf] = ^T(E[l7y]cosAyt+E[Vy]sinAyt) = 0 7=0 and B[XfXf+/l] ITy COS Ay t + Vy sin Ay t) 7=0 n 22(l7fc cos Afc(t + h) + Vk sin A *(t fc = 0 E[IT2] COS AytCOSAy(t + h) + E[V2]sinAytsinAy(t+ h) n cr-cosXjh. = j=o + /»)) j and 7.5 COVARIANCE AND COVARIANCE FUNCTIONS 261 The process {Xf : t E F} is therefore weakly stationary with covariance function n r(/i) = ^7 cos W- j~o Since r(0) = = a2, the correlation function is given by " <r2 pw= 21^2 cos Xjh, h E R. (7.10) As in Example 7.17, Xt can be thought of as a mixture of n +1 sound waves with random amplitudes Juj + V2, random phase shifts 0j = arctan (Uj/Vj), and fixed frequencies Aj/2-Tr, j = 0,..., n. ■ Equation 7.10 suggests that the correlation function p(A) of a weakly stationary process {Xf : t G T} has a representation r00 p(h) = J —co cosA/idF(A), h E R, (7.11) where F is a distribution function on R. Without going into details, such a function exists and is called the spectral distribution function. In Example 7.18, the function F increases only at the points A, by jumps of <t2/<t2, j = 0,..., n. For a weakly stationary process of the type the correlation function p(v) is defined only for v E Z, and in this case the integrand cosAp in Equation 7.11 is a periodic function of A of period 2rr since cos (A + 2 n 7T) v = cos A v. In this case, Equation 7.11 can be written r(2j+l)7r 00 cosApdF(A). p(p) = y = — oo * (2/ ~ 1)tt It can be shown, but will not be done here, that the latter equation can be written pW = cos ArdF(A), v E Z, J(-7T,7r] where P is a distribution function with F( —rr) = 0 and F(rr) = 1. In the particular case that Ip(j')! < °°> the spectral distribution function P has a density function/, called the spectral density function, so that pM = f J —IT /(A)cosApdA; 262 7 EXPECTATION REVISITED moreover, 1 1 f(A) =— + — 3" * p(p)cos Ap, 2tt tt v= 1 * — tt < A < tt. (7.12) % It is sometimes useful to smooth data by using a moving average, as in the next example. Consider a sequence {X; }"=_«> of uncorrelated random variables having the same means and variances a2. Fix m S: 1 and for each integer n,let EXAMPLE 7.19 m Yn ^l-Kn + ^2-^n — I + ' ' ' + m+1 \ QjXn—j + 1 where ab..., am are constants. Put cij = 0 if j € {1,2,..., m}. The sequence {^j}”=-oo is called a moving average process. Note that = Ai(«i + " • + «m) var Yn = a2(a2 + • • • + a^). and To show that the Y„ process is weakly stationary, fix p s 0 and consider m m /z)(Xn + i,_y+i ~ P-)]- [(Xn -i-t-1 i=lj=l Since the terms of this sum are nonzero only when the subscripts of the X/s are equal, the sum is equal to m a2^aj-vaj. j=i Since this quantity does not depend upon n, the {T;}”= _re sequence is weakly stationary with covariance function m r(p) = a2y aj-vaj. j=i 7.5 263 COVARIANCE AND COVARIANCE FUNCTIONS Since j — v 2: 1 is required for the first factor to be nonzero, we must have v <j ~ — 1, and therefore m r(p) = a2 JT aj-»aj j = v+\ Therefore, ( } _ 1 ’ <r2(am-pam+am-i-pam-l+'■ ■+aiaI/+l') 0 In particular, if a;- = 1/ ^ftn,j — 1,tn, then for v <t2(1 - (p/m)) if v < tn - 1 ifp>m. 0, ifO < v < tn - 1 if v 2 tn. Since r(~v) = r(p) = r(|i'|), cr2(l - (|p|/m)) 0 if M < tn - 1 ifjpj 2 m. ■ Let {Xj }y°= -a, be a weakly stationary process. Suppose there is a real number A with |A| < 1 such that Xn=AXn-l+Nn, nf=Z, where the Ny’s, representing a “noise” component, are uncorrelated with zero means and variance <r2. By iteration, Xn = A(AX„-2 + N„_|)+N„ = X2Xn-2 + XNn-,+Nn j-i = XXn-j+^XNn-i. i=o Thus, = £[(A>X„_y)2] = A2>E[X2_y]. (7.13) 264 7 EXPECTATION REVISITED Since £[X2_J = var X„-j + (£[X„-J)2 and the {X;}JL_x process is weakly stationary, the quantities on the right are independent of n — j, and so for some constant c S: 0. Since |A| < 1, lim; _»x A2-* = 0, and so ;-l X„ = ms- lim A'Nn-j = msi =0 i=0 « X Nn The significance of this equation lies in the fact that the {Xj} process has a “representation” in terms of a sequence of random variables that are much easier to analyze. This is apparent in the following calculation of the covariance function of the {X;}x_ _x process. By Lemma 7.5.3, since £[N2_,] = a2 and £[N„] = 0 when i j. To calculate cov(X„,X„+jt), note that cov(X„,X„+jt) = £[X„X„+t], since £[X„] = 0 for 7.5 265 COVARIANCE AND COVARIANCE FUNCTIONS all n E Z. By Equation 7.13, for k > 1, Jt-i Xn+k ~ kkXn + XNn+k-i> :=0 so that B[X„X„+J l = A *E[X k-\ 2] + lim I —>00 •• 'J f j=01=0 X+i E[Nn-jNn+k-i]' Since n+k ~ i S: n + 1 fori = 0,...,k — 1 and n —j all of the terms in the double sum are zero. Thus, n for; = 0,.. B[X„X„+d = A * — The covariance function of the {X_m process is r(Jt) = —A1*1, AEZ, and the correlation function is given by p(A) = AW, EXERCISES 7.5 1. JtEZ. Let ai,...,am,bi,...,bm be positive constants, let Zj,...,Zm be independent random variables having a uniform density on [0, 2tt], and let m aj cos {nbj + Zj). Xn = j-i Show that the sequence {X;-}"= its correlation function. 2. is weakly stationary and determine Let {Xjbe a weakly stationary process, let a i,.. .,am beconstants, and let m J=1 266 7 EXPECTATION REVISITED Show that the process {Yj}^_x is weakly stationary and determine its correlation function in terms of the correlation function of the X j process. 3. Consider a sequence of random variables * {Xj} =_x defined as the moving average Xj = Nj + aNj-i, —<» < j <0°, where {Nj} *= is a weakly stationary sequence of uncorrelated random variables having unit variances. Find the spectral density function of theXy process. 4. Consider a sequence of random variables {Xj}^=_x defined as the moving average Xj = Nj + aNj-i + ftNj-2, -oo <j < oo, where {Njis a weakly stationary sequence having unit variances. Find the spectral density of the Xj process. 5. Let A and 0 be independent random variables where A takes on the values Ab ..., km with probabilities pb ... ,pm and 0 has a uniform density on [0, 2tt], and let Xt = cos (At + 0), t G R. Show that the process {X b t G R} is weakly stationary and determine its correlation function. SUPPLEMENTAL READING LIST 1. 2. T. M. Apostol (1957). Mathematical Analysis. Reading, Mass.: Addison-Wesley. P. J. Flory (1969). Statistical Mechanics of Chain Molecules. New York: Wiley/ Interscience. CONTINUOUS PARAMETER MARKOV PROCESSES INTRODUCTION In this chapter, we will pursue a path that is less dependent upon the structure of the probability space and more dependent upon macro properties of non­ random functions. In the next section, starting with a few heuristic principles governing the probabilistic behavior of an evolving system in small time intervals, a system of differential equations is derived and solved, resulting in time dependent probability functions that describe a process known as the Poisson process. This process plays an important role in waiting time models. The Poisson process is a special case of a more general class of processes that can be described by time dependent probability functions, called continuous parameter Markov chains. Starting with a set of equations that the probability functions must satisfy, it is shown that the functions satisfy a system of differential equations. Using matrix calculus, a method is developed for solv­ ing such systems. POISSON PROCESS Consider an experimental situation in which events occur at random times. For example, calls to a mainframe computer may arrive at random times 0 < ti s t2 < • • •. If a counter is initiated at time 0, it will increase to 1 at time ti, increase to 2 at time t2, and so forth. The outcome in this case is 267 268 8 CONTINUOUS PARAMETER MARKOV PROCESSES FIGURE 8.1 Counter outcome. a function w(t) on [0, <») with graph as depicted in Figure 8.1. A probability space for this type of experimental situation would consist of all such a>. The construction of such a probability space is better left to more advanced texts. A different approach, which avoids such constructions, will be followed here. This approach entails the derivation of some equations based on heuristic arguments. In the following discussion, o(h) (read “little o of h”) will be a generic symbol for a function of h that satisfies the condition We will assume the following properties of the counter process just described. Independently of the number of occurrences of events in the interval (0, t), for small h > 0: (i) The probability that an event will occur in the interval (r, t + h) is Ah + o(h), (ii) The probability that no event will occur in the interval (t, t + h) is 1 — Ah + o(h), and (iii) The probability of two or more events occurring in the interval (t, t + h) is o(h), where A is a positive constant. These might be reasonable assumptions for the situation described above for periods when saturation is unlikely. Assuming there is an appropriate probability space for which these assump­ tions are valid, let P„ (t) be the probability that n events will occur in the time interval (0, t). Consider adjacent time intervals (0, t) and (t, t + h). If n > 1 events occur in the interval (0, t + h), then one of three things must be true: (1) n of the events occur in (0, t) and none occur in (t, t + h), (2) n — 1 events occur in (0, t) and one occurs in (t, t + h), or (3) two or more of the n events occur in (t, t + h). Since these are mutually exclusive possibilities and there is 8.2 POISSON PROCESS 269 independence between events occurring in (0, t) and (t, t + A), P„(t + h) = P„(t)(l - Xh + o(h)) + Pn-x(t)(Xh + o(h)) + o(h) or P„(t + h) = P„(t)(l - Ah) + Pn-!(t)Ah+o(h) where the o(h) in the first equation has been replaced by an 0(h) of the form Pn(t)o(h) + P„-i(t)o(h). Therefore, Letting h —> 0+, P'n(t) = -AP„(t) + AP„-i(t), n > l,t>0, (8.1) assuming, of course, that the Pn (t) are differentiable. It is necessary to consider the n = 0 case separately since only (1) holds in this case. Thus, P0(t + h) = P0(f)(l -Ah + o(h)), so that Po(t + h)-Po(0 _ ,p lf}oW -xp0(t) + — Letting h —> 0+, we obtain the differential equation P'0(t) = -AP0(t), t > 0. (8.2) In deriving Equations 8.1 and 8.2, we let h —> 0+ so that the above derivatives are really right derivatives. By considering the intervals (0, t — h) and (t — h, t), these equations can be seen to hold for left derivatives also and therefore for unrestricted derivatives. The Pn (t), n > 0, must satisfy the initial conditions Po(O) = 1 (8.3) and PM(0) = 0, n > 1. (8.4) We will now undertake to solve these differential equations subject to the stated initial conditions. Consider first the differential equation P'0(t) = — AP0(t) subject to the initial condition P0(0) = 1. It is easily seen that the solution is 270 8 CONTINUOUS PARAMETER MARKOV PROCESSES Po(t) = e~Xt, t > 0. Putting n = 1 in Equation 8.1 and substituting e-Af for Po(t), we find that Pi(t) satisfies the equation Pj(t) = — APi(t) + Ae-Af and the initial condition Pi (0) = 0. To solve the equation, write it as p;(t) + AP!(t) = Ae'Af. Multiplying both sides by the factor eAf, the equation can be written — at = A. Integrating, Pi(t) = Ate-Af + ce~Xt. The integration constant c must be zero to satisfy the initial condition Pi (0) = 0. Thus, P,(t) = Ate-Af. Repeating the same steps for the n = 2 case, we find By mathematical induction, AM tn P„(t) = -^-e"Af, n - 0,t > 0. (8.5) The heuristic description has resulted in a collection of specific functions. If there is any validity to this procedure, we should be able to start with the end product—namely, the P„(t) functions—and construct a probability model from which the differential equations above can be deduced. To construct such a process, we would take ft to consist of all real-valued nondecreasing step functions co on [0, <») with co(0) = 0 that increase only by unit jumps. The probability function P would be defined as follows. For each t > 0, let Xf(co) = co(t). If 0 h < t2 <••• < and < m2 — ''' — «!■> let P(Xfl = «!,...,Xft = nk) = P(Xt,-X0 = ni,Xf2~ Xf, = m2— «],... ,Xtt — Xft_, = nk — nk-1) = - tl) X • • • X Pn.-n^ttk - tk-kY, 8.2 POISSON PROCESS 271 i.e., probabilities are assigned so that the increments Xt, ~ Xq> Xf2 — Xtl,..., Xtk — Xft_1 are independent random variables with PV^-X,,., = n) =pn(tj - tj-o = ■ = b nl k We can now state the formal definition of a Poisson process on a given probability space (ft, S', P). Definition 8.1 The family of random variables {Xt : t 2 0} is a Poisson process with rate A > 0 if 1. P(X0 = 0) = 1, 2. Xtl — Xt},..., Xtlt — Xtlt_t are independent random variables whenever 0 ti t^, and 3. P(Xt-Xs = n) = (A"(t-$)"/«!) e~A(f"j),n E N, whenever 0 < s t. ■ Let {Xt : t 2 0}be a Poisson process with parameter A > 0. Taking s = 0 in (3), Xt has a Poisson density with parameter At, so that E[Xf] = At and varXf = At (see Example 4.15). We can consider the time at which the first event occurs by letting Wi be the first time t that Xt = 1. Assuming that Wi is a random variable, the density of Wi can be obtained as follows. If t 2 O.thenW! t if and only if Xf 2 1, so that Fwt (t) = P(W\ < t) = P(Xf 2 1) = P(Xf — Xo 2 1), and therefore ” in fn Fw,(t) = X = e'AVf - 1) = 1 - e"Af. «! It follows that the density of W! is given by Ae"Af 0 if t 2 0 if t < 0; i.e., Wi has an exponential density with parameter A, and it follows from Exercise 7.2.5 that E[ Wi] = 1/A. The parameter A is called the rate. The greater the rate of occurrence of events, the smaller the waiting time for an event to occur. EXERCISES 8.2 1. Let {Xf : t 2 0} be a Poisson process with rate A>0. If 0 < s < t andO < k < n, calculate P(XS■ = k | Xf = n). 272 B CONTINUOUS PARAMETER MARKOV PROCESSES 2. Let {Xf(,) : t > 0}"=1 be independent Poisson processes with the same rate A > 0. Find the density of the waiting time for all n of the processes to have at least one event occur. 3. Let {Xt : t 2: 0} be a Poisson process with rate A > 0. By writing Xf = Sfcli(Xt ~ * X —।) + (Xt X[f])> where [t] denotes the largest integer less than or equal to t, show that P(limf Xf = +00) = 1 by considering the events At ~ (Xk — Xk-i = 1). 4. Suppose the rate A = A(t) is a nonnegative function on [0, +°°) that is Riemann integrable on each finite interval and define P„ (t) as before. Then P„(t) satisfies Equations 8.1, 8.3, and 8.4. Let X G(t,s) = ^Pn(t)sn, n=0 -1 < s < l,f > 0, be the generating function of the sequence {P„ (t)}„ =0. (a) Verify that G(t, s) satisfies the equation = —A(t)(l ~s)G, -l<s<l,t>0, and the initial condition -1 < s < 1. G(0,s) = 1, (b) Verify that G(t, s) = e (1 (c) Verify that Pn(t) = 1 n! « *(«)< satisfies these conditions. rt /o' A(M)d« / o and that E[XfJ = Jo A(M)dM. 5. Use the results of the previous problem to determine P„(t), n > 0 and E[Xf]forA(t) = 1/(1 + t). 6. Use the result of Problem 5 to approximate Pi0(100) = P(Xiqq = 10) when A(t) = 1/(1 + t). 8.3 BIRTH AND DEATH PROCESSES 273 BIRTH AND DEATH PROCESSES The Poisson process is an example of a birth process in which the population size only increases. Realistic models for population growth must not only incorporate deaths but also allow the possibility that birth and death rates depend upon population size. Assuming that there is a probability space (ft, S', P) and a process {Xt : t 3: 0} reflecting population growth, we will assume that for h > 0, 1. P(Xt+h-Xt = 1 |Xf = n) = P„h + o(h), 2. P(Xt+h -Xt = -1 |Xf = n) = 8nh + o(h), 3. P(Xt+h ~Xt = 0|Xf = n) = 1 - (/3„ + 8n)h + o(h), and 4. P(|Xf+fc -Xf| > 2) = o(h), where & 0, 8q = 0, and 8„ S 0 for all n S: 1. We will also assume that Xq = n0 where «q — 1 is the initial population size, so that Xt represents the population size at time t. Note that there is no mention of independence between population size in (0, t) and changes in (t, t + h); in fact, independence will be lacking because birth and death rates in (t, t + h) can depend upon population size in (0, t). Letting P„(t) = P(Xf = n), a system of differential equations for the P„(t) can be derived as follows. Note first of all that P„0(0) = P(X0 = n0) = 1 and that P„(0) = P(X0 = m) = 0 for all n «o- Consider first Po(t). Forh > 0, Po(t + h) = P(Xf+A = 0) = P(Xt+h = 0,Xf = 0)+P(Xf+/1 = 0,Xf = 1) + P(Xt+h = 0,Xf > 2) = P(Xt+h = 0 |Xf = 0)P(Xf = 0) + P(Xf+A = 0 |Xf = l)P(Xf = 1) + P(Xt+h = 0,Xf > 2) = (1 - poh + o(h))P0(t) + (8ih + o(h))P,(t) + o(h). Therefore, h h Letting h —> 0+, we obtain P'0(t) = -j30P0(t) + 8iPi(f), t>0. The derivative in this equation is a right derivative, but by replacing t by t - h, the same equation holds for the left derivative and therefore for the derivative. Consider now P„(t) for n S 1. Proceeding as before, 274 8 CONTINUOUS PARAMETER MARKOV PROCESSES P„(t + h) = XP(Xt+h = n,Xt = k) <t=0 = ^P(Xt+h — Xt = n — k,Xt = k) Jc=O = P(Xt+h-Xt = -l,Xf = n + 1) + P(Xt+h -Xt = 0,Xt = «) + P(Xt+h ~Xt = l,Xf = n - 1) + P(Xt+h ~ Xt — n ~ k,Xt — k). |k-n|£2 Since X|fc_„|£2P(Xf+/, -X, = n - k,Xt = k) < P(|Xf+h - Xf| > 2) = o(h), Pn(t + h) = (8„+1h + o(h))P„+i(f) + (1 - (j3„ + 8„)h + o(h))P„(t) + (0„-}h + o(h))Pn-l(t) + o(h). Therefore, Pn(T + h) ~ P„(t) _ . _ , n ■ o \n . a n (tX . , On + \P n + \\t) \Pn + Oh'PhCu + Pn — lP n — 1(U + n < n Letting h —> 0+ (and also letting h —> 0+ after replacing t by t — h), P'n(t) = 8n+iPn+i(t) — + 8„)P„(t) + j3„-iP„-i(t). The functions Po(t), Pi(f),... therefore satisfy the system of differential equations ( P’0(t) = -j30P0(f) + 8iPi(t), f>0 [p;(t) = 8n+lPn+dt}-(P„+8n)Pn^ + pn-}Pn-dt} , 1 . 1 subject to the initial conditions We will now specialize by considering a pure birth process for which 8„ = 0,n >: 0. In this case, the system of differential equations becomes f P’0W = -^oPo(^) ( P;(t) = -PnP^+Pn-lPn-M, n > 1. Suppose n0 > 0. The general solution of the first equation has the form Po(^) = Coe~^ot; to satisfy the initial condition in Equation 8.7, we must have 8.3 BIRTH AND DEATH PROCESSES 275 Cq — 0, in which case Po(t) = 0 and the second equation in 8.8 for n = 1 reduces to If n0 > 1, again Pi(t) = 0 and the second equation in 8.8 reduces to Pfr) = -P2P2W. Continuing in this way, we arrive at the fact that P„(t) = 0 for all 0 and that n < n0 P„0(t) = e~p"of Consider the second equation in 8.8 for n > n0; after multiplying both sides by e^nt, it can be written at Integrating from 0 to t and using the initial condition P„ (0) = 0 for n we obtain the recurrence relation Pn(t) = pn-ie-P"1 f e^pn-x{s)ds. no, (8.9) Jo Since P„0(t) is known, this equation can be used to generate the P„(t) suc­ cessively. Note that P„(t) S: 0 for all n > 1, t > 0. It is conceivable that in certain populations the birth rates might be so great that the population “explodes” or becomes infinite in a finite time interval, and it is of interest to consider the probability 00 00 P(Xf = +00) = l-^P(Xt =n) = l-^TPM n=0 n=0 Thus, P(Xf = +<») > 0 if and only if X”=o Pn (t) < 1- An explosion will not occur if P(Xt = +<») = 0 for all t > 0; i.e., if X”=o P" ~ 1 f°r f > 0A criterion for the latter is given by the following theorem. Theorem 8.3.1 £"=0P,i(t) = Ifordlt — 0 if and only if the series XZ=o PROOF: Letting Sk(t) = Xj=0P«W,S^t) = diverges. By Equa­ tion 8.8, k Sl(t) = -poPoity+^-pnPn^ + pn-.Pn-^t)) = ~pkPk(t). n=l 276 8 CONTINUOUS PARAMETER MARKOV PROCESSES Integrating from 0 to t and using the initial conditions in 8.7, 1 - Sjt(r) =j3jf P^sjds. (8.10) Jo * .% Since the terms defining the sums $k(t) are nonnegative, for each t the left side decreases monotonically as k increases, and so the right side decreases in the same way. Let /z(t) = lim Pjt(s) ds S: 0. JQ Thus, •t Pk(s)ds 2= ^2. Pk .o Using the fact that Sk(t) = 'Z’^Pnlt) = S * =0P(Xf = n) = P(0 S Xf < k) < 1 and summing k = 0,..., n, rt "1 S„(s) ds > t > Jo If the series k=oPk Wfin diverges, then ju,(t) must be zero for all t, lim(l - Sjt(t)) = lim £k tPk(s) ds = /x(t) = 0, k—><x> k-*<x> o and therefore XZ=o^«(f) = lim -^ * hand, by Equation 8.10, ■' st(S) ds=± r pM Sk(t) - 1 for all t >: 0. On the other =± n=Qd0 n=Q Pn n=QPn Since Sk(s) increases with k, the integral on the left increases with k and Sk(5)ds lim k—»oc 0 If the limit can be taken past the integral sign (which is permissible in this case), ri qo ( lim St(s)) ds < X « 8.3 277 BIRTH AND DEATH PROCESSES = linijt-»oo Sk(t) = 1 for all t 2 0, we would have If for all t > 0. Since t can be arbitrarily large, X7=o Vfi" = +°°- Thus, ^,^=0Pk(t) = Iforallt S: 0 implies that the series l/|3n diverges. ■ Generally speaking, the differential equations in 8.6 for a birth and death process are difficult to solve because they must be solved simultaneously, as opposed to the pure birth case where they can be solved sequentially starting with the first equation in 8.8. There are methods of obtaining qualitative information about population growth even if the equations in 8.6 cannot be solved. Consider a birth and death process with = j3n and 8n = 8n, n 2: 0, where j3, 8 >0. We will assume that «q = 1> so that Xo = 1. The initial conditions are then EXAMPLE 8.1 Pi(0) =•! • Pn (0) =0, n/1. ConsiderM(t) = E[XJ = ^^=onP(Xt = n) = ^.„=onPn(t). Note that M(0) = Z^onPnW = 1 and that M'(t) = Sn = inP«(r)> formally at least. Multiplying both sides of the second equation in 8.6 by n and summing over n = 1, 2,..., co CO (« + l)Pn+i (')“(/? +3) 2>2PM(r) M'(r) = 8 n=1 n =1 co n(n - l)P„-i(t) + j3 n =1 co = (J3 ~8.^nPn(t) n =1 = (j3 - 8)M(t). The average population size M (t) therefore satisfies the differential equation M'(t) = (/? - 8)M(t) subject to the initial condition M(0) = 1. Thus, e(P~S)t M(r) = 1 ifj3^8 ifP = 5- 278 8 CONTINUOUS PARAMETER MARKOV PROCESSES This function also gives qualitative information about the long-term behavior of population size, since lim M(t) = < EXERCISES 8.3 1. 4-00 0 1' Consider a pure birth process with if j3 > 8 ifj3 < 8 if j3 = 8. ■ = ft n,8n = 0, n S: 0, and Xq = 1. Calculate Pt(t), Pi(t), and ?3(t). 2. Consider the pure birth process of the previous problem. Use mathe­ matical induction to prove that Pn (t) = n S: 2. 3. Explain what happens to the pure birth equations in 8.8 when /3> 0, i = 1,..., n — 1, =0. 4. Calculate E[Xf] for a birth and death process with j3„ = a+pn,8„ = 8m, n S: 0, andXo = L 5. Consider the pure birth process for which 8„ = 0, = |3m,m S: 0. Then the P„ (t) satisfy Equation 8.8 and the initial conditions in 8.7. Let . X G(t,s) = X Pn(t)s% -l<$<l,tS0 M =M0 be the generating function of the sequence {P„ (t)}“=0. (a) Verify that G(t, s) satisfies the equation ^- + ^(1-5)^- = 0 (b) and the initial condition G(0,s) = s"°. Verify that / \n° G(t,s) = sn<> ------- --------- — \1 -s(l - e-0f)/ satisfies the conditions of the previous problem. (c) Determine E [Xf ]. (d) Determine P„(t),M S: 0, intheM0 = lease. MARKOV CHAINS Let S = {sb s2,...} be a finite or countably infinite collection of objects called states. We will now describe a model for moving among the states in a random way. 8.4 Definition 8.2 MARKOV CHAINS 279 A collection of random variables {Xt : t 0} is called a Markov chain if P^tn = sin |Xf, = s,Xfn_, = s,n.,) = P(Xf„ = sin |Xfn_, = 5,-_1) whenever Q t\ t2 — tn and s;,,..., Sin GS. ■ This defining property is called the Markov property. If we regard t„-i as the present time, this property requires that probability of a future event given the past does not depend upon the remote past or, put another way, the process does not have a memory. Definition 8.3 The transition function of the Markov chain {Xf : t S: 0} is defined by pi.j($,t) = P(Xt = Sj | Xs = s^, 0 < s < t,sitsj £ S. The chain has stationary transition functions ifthepitj(s, t) depend only upon t — s. ■ We will consider only Markov chains {Xf : t function Pi,i(t) = P(Xt+s = sJX, = Si), 0} with stationary transition i,j > l.s.t > 0, (8.11) which satisfies the additional continuity condition that flim+p:,;(t) = 8^ (8.12) where j = 1 if i = j and 8I>;- = 0 if i j. For fixed t >: 0, the numbers pi,j(t) can be displayed in the following matrix form: Pl,l(t) P2,l(t) pl,2(t) p2,2(t) Pi,l(t) pil2(t) P(t) = which is called the stationary transition matrix. If S is finite, this matrix is 1$ I X |s|; if S is countably infinite, it has an infinite number of rows and columns. EXAMPLE 8.2 let 0 ti < Let {Xf : t 5: 0} be a Poisson process with rate A > 0, ’ — h an<^ 0 — »i < m2 < • • • nk, and let 280 8 CONTINUOUS PARAMETER MARKOV PROCESSES P„(t) = P(Xt = n) as in Section 8.2. Since the process has independent increments, P (Xtk = nk | Xf| — nl,...,Xtk.l ~ »k-i) _ P(Xtl - Xo = n!,Xt2 ~Xi, = n? ~ «b- • •>Xtt ~ Xtk_t - nk ~ nt-i) P(Xf, ~ Xo — n\,..., Xft_j — Xft_2 — nk-i ~ njr-z) Xtk_, = P(Xfk = Pnic — nk-i^tk tlk nk — l) tk — 1)- Calculating P(Xtk = nk | Xtk_t =■ njt-i) in the same way, it too is equal to PMfc—Mfc_, (*> “ tk-1)- According to Equation 8.5, P (Xft = nk |Xf, = nl,...,Xtk_, = njt-i) ~ P(Xtk = nk | Xft_! = nk-}) tk — }) Pnk—nk-i(tk = (A(tt (nk~nk-i)l so that {Xf : t S: 0} is a Markov chain with stationary transition function . . __ ?m,n (\n~mtn~m/(n — m)l)e~Xt 0 ifn>m,tS0 otherwise. The stationary transition functions pij inherit an important property from the Markov property. The equation in (iii) of the following theorem is known as the Chapman-Kolmogorov equation. Theorem 8.4.1 The transition functions pjj have the following properties: (i) Pi.jW = &i,j ifi.j (H) I- T.jpi,j(t) = 1 for all t 0. (iii) pitj(s + t) = T.kPi.k(s)pk,j(t) ifs, t > 0. PROOF: We first prove (iii). By the Markov property, pi, j +1) = P(Xt+s = Sj\XQ = s^ P(Xt4-{ = Sj,Xs = Sk,X0 = Sj) _ k P(X0 = sf) 8.4 281 MARKOV CHAINS = ^P(Xt+J = Sj |X0 = S'>Xs = sk)P(Xs = $ft|X0 = $,) k = ^P(Xt+s = Sj I Xj = sk)P(Xs = sd Xo = $,) k = ^PiMpk.j(t). k Since P(X0 = s;- | Xo = $,) = P(X0 = s;,X0 = s;)/P(X0 = $,) is 0 or 1 according to whether i j or i — j, respectively, pI>; (0) = &i.j and (i) holds. Finally, pi.jft = Xj P(Xt = Sj | Xo = sf) = P(Xf G S | Xo = s;) = 1 and (ii) holds. ■ The three properties listed in this theorem in conjunction with the conti­ nuity assumption 8.12 imply a good deal more about the transition functions pi j. Additional properties will be reviewed briefly for the purpose of clarifying applications. Proofs of these facts require a little more mathematical back­ ground than is presupposed here. Complete details are available in Chapter 14 of the book by Karlin and Taylor listed at the end of the chapter. The first facts concern differentiability of the transition functions p/j. Theorem 8.4.2 (i) For every i, " (U.i = exists but may be — °°. (ii) For alii,j with i j, 4i.j = Pi.jW = lim exists and is finite. Since Xj pi.j(t) = 1> it might be tempting to take the derivative term by term to show that i This is certainly a valid step if the sum Xj pi.j (*) is a finite sum, as it is in many applications, but in general the most that can be proved is that for each i, —4i.i — 282 B CONTINUOUS PARAMETER MARKOV PROCESSES We will assume in the remainder of this section that (8.13) This requirement means that all of the entries in the following matrix, called the q-matrix, are finite and that each row sum is zero: ?1.1 ?2.1 ?1,2 KA ?>.2 <12,2 In matrix notation, Q1 = 0 where 1 is a column vector all of whose entries are 1. Not only are the pi,j(t) differentiable at 0, butp/j (t) is defined for all t > 0. Consider the equations Pi,j(t + $) ~ pi,j (t) = 2 \pi,k(s)pk,j(t) pi,j (t) k = ^,PiMpk,j(t) + (pi,,(s) ~ l)p,j(t). i * k Proceeding formally by dividing by s and letting s —> 0+, we obtain the following equations, which are known as the Kolmogorov backward equations: t > 0. p'i.j(t) = 'ZjH.kPk.jW + qi.ipi.jt0. (8.14) i * k On the other hand, we can write Pi,j(t+s) - p^s) = 2piik(s)pk.j(t) - ^pi.k&pk.jW k k = ^PiMpk.j(t) -^pi,k(s)8k,j k k = ^Pi.k^pk.^t) - Sfcj). k Operating formally again by dividing by t and letting t -> 0+, we obtain the following equations, which are known as the Kolmogorov forward equations: P'i.j^ =^Pi.k^k.j+qj,jpi,j(t), j * k t > 0. (8.15) 8.4 283 MARKOV CHAINS If the state space S is infinite, then both sets of forward and backward equations represent an infinite system of differential equations that must be solved simultaneously. Both sets of equations take on deceptively simple forms if expressed in matrix notation. Letting P'(t) = [p,' ;-(t)] be the matrix with entries p^^t), the backward equations take on the form P'(0 = QP(t), t > 0, (8.16) t > 0, (8.17) and the forward equations take on the form P'(t) = P(f)Q, with both equations subject to the initial continuity condition 8.12. EXAMPLE 8.3 Consider a Poisson process {Xt : t In this case, the transition functions p,,j are given by Pi.jW = ----------- e -Ar , 0} with rate A > 0. j > i > 0, and it is easily seen that qij = Pt,iW = ~Xqt,i+i = Pi|l+i(0) = A, and qij = 0 for; {i, i + 1} and all i S: 0. The q-matrix is given by Q = -A 0 0 A -A 0 0 A -A ... ... ... Application of these results lies in the choice of the q-matrix. From the definition of the g,-j, for all i, pi.i(fi) = 1 + qt,ih + o(h), and fori pi,j(h} ~ ty-jh + °W- These equations frequently suggest how the qitj should be chosen. It is then a matter of solving the Kolmogorov backward or forward equations for the pit). Consider a system consisting of a single unit that has a failure rate of /z and a repair rate of A; i.e., if the unit is in operation at time t, the probability that it will fail in the interval (t, t + h) is ph + o(h); if not in operation at time t, the probability that it will be repaired and put back into EXAMPLE 8.4 284 8 CONTINUOUS PARAMETER MARKOV PROCESSES operation in the interval (t, t + h) is Ah + o(h). This system can be in one of the two states 0,1 with the q-matrix Q = -A A e.g.» the probability of going from state 1 to state 0 in a time interval of length h is approximately q^oh = fih. In this case, there are four transition functions to be determined: po.o>po,i>pi,o>pi,i- The Kolmogorov forward equations are PiM = , PiM = i=0,l pi,o(t}X-i=0,l. PiMfi - Api.o(t), the equations can be Since p,j(t) = 1 — p,-.o(t) and pi,o(t) = 1 — written = /z p,'A(t) + (A + ju-Jpuft) = A. p'iM + (X +fi)pii0(t) Multiplying both sides of the first equation by e^A+/i)f, it can be written ^(e^'pM) = fie^. Integrating and then multiplying both sides by e (A+M)f, PM = T^- + cI>oe-(A+/x)f A + fi where the integration constants must be chosen to satisfy the continuity con­ dition 8.12. After so choosing the constants and applying the same procedure to the pi,! (t), we obtain EXERCISES 8.4 1. Po,o(f) — /z + Ae~(A+/x)f A + /z pi.o(0 = /z - /ze (A+/x)f A + fi Po,\(t) — A - Ae~(A+/x)t A + fi A + zie-(A+/i)f PM Consider the q-matrix Q = -A 0 0 A -A 0 0 A -A ... ... ... =A +^-44 ---fi 8.5 285 MATRIX CALCULUS and the corresponding Kolmogorov forward equations for P(t) = (a) Use the forward equations to show that pIj0(t) s 0 for all i > 1. (b) For each i S: 1, use an induction argument to show that p/j (t) = 0 for; < i - 1. (c) Calculate p; j (t) for; Si. 2. A system consists of m components, some of which are in operation and some of which are not at any given time t. If there are k components in operation at time t, the probability that one of them will fail in the interval (t, t + h) is p.kh + 0(h), the probability that one will be put back into operation in the interval is Ah + o (h), and the probability that two or more changes will occur is o(h). If the state of the system is the number of components in operation, what is the q-matrix for the system? 3. A system consists of two components that are connected in parallel with one online and the other on standby. The one in operation at time t will fail in the interval (t, t + h) with probability Ah + o(h). A component cannot fail while on standby. A component in failed condition at time t will be repaired in the interval (t, t + h) with probability p,h + o(h). The probability of two or more changes taking place in an interval of length h is o(h). If the state of the system is the number of failed components, what is the q-matrix for this system? 4. Suppose the system of the previous problem is modified so that a component on standby at time t will fail in the interval (t, t + h) with probability Ash + o(h). If the number of failed components is the state, what is the q-matrix for the system? MATRIX CALCULUS In the previous sections, we have seen that the differential equation p'(t) = ap(t) has the solution p(t) = ceat. We have seen also that the transition matrix P(t) = [pi,j(t).| satisfies the matrix equation P'(t) = QP(t), where Q, the q-matrix, is a constant matrix. We could try to solve the equation P'(t) = QP(t) by means of the function P(t) = etQ if only we knew what e‘Q represents. Since e a2t2 = 1 + at + —— + —- + 2! 3! we might try replacing a by Q to obtain e tQ t2Q2 Q 3 t* = ,+,Q+-ir+— 286 8 CONTINUOUS PARAMETER MARKOV PROCESSES where it is necessary to replace the number 1 by the matrix I = j]. The individual terms involving powers of Q on the right side make sense. Except for the fact that the right side is a sum of an infinite series all of whose terms are matrices, this equation can be taken as the definition of etQ. Since sums of infinite series are defined in term? of limits, we must digress to discuss sequences of matrices and limits of such sequences. In keeping within the bounds of introductory material, we will limit ourselves to matrices with a finite number of rows and columns. We will deal with r X s matrices A = [«,-j] with r rows and s columns. Either {A(n)} or A(2),... will denote an infinite sequence of such matrices, all of size r X s. The dependence upon n is specified by putting the n in parentheses, since the notation A" stands for the nth power of A when A is a square matrix. Thus, A(n) = [a/^]. If lim„_»x exists and is equal to aIi;- for each i and;, we say that the sequence {A(n)} converges to A — [atj ] and write lim A(n) = A. n—>x Alternatively, lim [«/"•’] = [lim a/"’] = *»nJ n -*x J EXAMPLE 8.5 lim n —*oo 2l/n (1 + (!/«))" 1 = (n2 + l)/(n3 + n2 + 1) 1 1 e 0 Let {A(n)} be a convergent sequence of r X $ matrices and let {B(n)} be a convergent sequence of s X t matrices with lim„_>0oA(") = A and lim„_>S0B(") = B, respectively. Then the sequence {A(")B(")} converges and limn_»oo= AB. This follows from the fact that the entry in the ith row and jth column of A(n)B(n) is XUi ahkb<k,j with limit Zl = i “i.kbk.j, which is the corresponding entry in A B. Given a sequence of matrices {A(n)}, all of the same size, we can form the expression 2Z“=1A(n), since matrices of the same size can be added. The definition of the sum of the infinite series X”=1A(n) mimics the calculus definition. For n >: 1, let n = ^(I)+^(2) + • • •+^(n) s(n) = K?] = k=l 8.5 287 MATRIX CALCULUS where s/y = ^=1a^ • If the sequence {S(n)} has a limit S = [s;>;-], we say that the series converges and has sum 5. Note that i “S s‘-i = In what follows, an infinite series will begin with a zero term as in =0 A(">. If the nth term is An for some matrix A, AQ by convention will be the identity matrix/ = We will also adopt the notation A" = Definition 8.4 IfA is an r X r matrix, we defineeA = =0 A"/n! provided each of the series converges absolutely, i,j = 1,..., r. ■ Lemma 8.5.1 Let A = [«,j ] be an r X r matrix and let An = If there is a constant M such that < M" for 1 i,j r and n > 1, then eA is defined. Moreover, if a^j 0 for 1 i,j r, then a-^ >: 0 for 1 i,j r,n > 1, and£™=oafj/n! > 0. PROOF: Since the series ^,^=0Mn/n! converges, the series converges absolutely by the comparison test, and eA is defined. Since = = i ai,k< *k,j — — 0 for 1 i,j :£ r. It is easily seen that a/"' 2: 0 fori S i,j S r,n >: 1, using mathematical induction. ■ If a and bare real numbers, then a b ~ ba and eaeb = ea+b; but if A and B are r X r matrices, it need not be true that A B = B A nor that eAe B = eA+B when the latter are defined. For example, if A = 1 0 1 1 and 0 1 1 1 then Granted even that AB = BA, we are confronted immediately with the binomial theorem in dealing with eA+B. Lemma 8.5.2 Let A and B be r X r matrices such that AB = B A. Sl'-o ( "t >B"-‘ Then (A + B)" = 8 288 CONTINUOUS PARAMETER MARKOV PROCESSES PROOF: Since Z{„0AkB,-k = B +A = A + B, the assertion is true for n = 1. Assume it is true for n — 1. Then (A + B)" = (A + B)(A + B)"-1 = (A + B) V ( ” , 1 )AkBn~'~k. k=0 Since BAkB"-1-k = BAAk~lBn~l~k = ABAk'lBn',''< = AkBBn~l~k = AkBn~k, )AkBn~k (A + B)" = X ( " ~ 1 ')AMBn~x~k k=0 j=l = k=0 J j=0 7 Since the sum of the binomial coefficients is ( ? ) (see Exercise 1.3.2), A + B =A"+22(M )a7B""7 +B" = X ( n ^AB”-’ ;=1 J j =0 } and the assertion is true for n. By the principle of mathematical induction, the assertion is true for all n S 1. ■ For the proof of the following lemma, the entry in the ith row and jth column of the matrix A will be denoted by (A),j. Lemma 8.5.3 Let A and B be r X r matrices for which eA and e B are defined and AB = BA. Then eA+B is defined and eA+B = eAeB. PROOF: Since eB = 8.5 289 MATRIX CA1CULUS Since the indicated series are absolutely convergent, the product of their sums can be written X”=o * where n M i.j.k _(€) ,(«-€) Zai,k _ & («—€)!’ By Lemma 8.5.2, °k,j Jt = 1 n = o €=0 0. (n- £)\ ai,k n = 0 €= 0 V = 1 Dk,j & (n — €)! aW bl{n;~e} U; LU ^on! Theorem 8.5.4 Let Q = [<7i,y] be an r X r matrix for which q,^ S: 0 if i j and = 0,i = l,...,r. ThenP(t) = [pi,j(f)] = ef Q is defined, and the pi,j(t) have the following properties: (i) limf_o+pi,j(t) = Si.}. (H) Pi.j(t) > 0,t>0. (Hi) = l»f > 0. = * =5)pJt,j(t)(S iPi. (iv) Pi.j(s + 0 (v) P'i.jW = TJ^X^i.kPk.j^f PROOF: We first show that etQ is defined by showing that there is a constant M such that < M", 1 < i,j < r,n > 1. To see this, let M = maxlsi£r(ZJ=i Since |^j| M, the assertion is true for n = 1. Assume it is true for n - 1; i.e„ \q-"j 1}| =s M"-1 for 1 < i,j < r. Since = XUi kffl s StoAT”! s k=l s M”’ *=1 290 8 CONTINUOUS PARAMETER MARKOV PROCESSES and the assertion is true for n whenever it is true for n — 1. It is therefore true for all n 1 by mathematical induction. Since tnMn n! /n \ converges, the series and the series ■x. X" 1 ‘"•i converges absolutely for 1 i, j r. Thus, etQ is defined for all t > 0. In fact, this argument shows that this power series in t has (— °°, °°) as its interval of convergence. Since Pi.jW &i,j + 22 nl f n=1 and the sum of a power series is continuous on its interval of convergence, limf_>o+pij(f) ~ 3i,y>l r. This proves (i). Consider the matrix Q + al. The off-diagonal elements of Q + al are the same as those of Q. The diagonal elements of Q are negative, but an a > 0 can be chosen so that Q + al has nonnegative elements. Writing efQ = e~atIet^+al\ note that the entries of the matrix et(Q+al) are all nonnegative by Lemma 8.5.1. Since it is easily checked that = e~atl, the same is true for the entries of e~atI. Thus, the entries of P(t) = [pi,;(f)l — e‘Q are nonnegative. This proves (ii). Since x f.n r - &i,j + tcb\j + 22 ^7 22 ^i'k n=2 Jt = 1 +Z^22&'1)22^ = Zu = i j=l j=l n=2'1' k = l j=l and (Hi) is proved. Since esQ and e,Q are defined, so is e(j+f)Q, and e(s+l^Q = esQetQ; i.e., Pi,j(s + t) = (e(j+,)Q)i.y = 22(ejQ)i.it(e,Q)M = ^pi.k(s)pkij(t) k=\ 8.5 291 MATRIX CALCULUS and (iv) is satisfied. It remains only to show that P’(f) = [p,j(t)J = QP(t)Since a power series can be differentiated term by term within its interval of convergence, n! —-------- ?.(M) n=l (n - 1)!^',; t"-1 :n-l) k=l and (v) is true. ■ Theorem 8.5.4 provides the answer to whether or not a q-matrix determines a Markov chain. Direct application of the theorem is impractical except possibly for the simplest cases, as will be seen in the following example, and useful information about the resulting Markov chain must be obtained indirectly, as in the next section. EXAMPLE 8.6 Q4 = Consider the q-matrix -3 6 -3 -3 -3 6 Q6 = -3 -3 ,Q5 = 9 -9 0 -18 9 9 9 -18 9 9 9 -18 6 0 9 -9 -9 0 9 292 8 CONTINUOUS PARAMETER MARKOV PROCESSES Upon calculating Q7, it is seen that Q7 = (—27)Q, and from this it follows that 1 < k < 6, n > 0. Qk+6n = (—27)”Qk, Thus, (-27)ntfc+6n (k + 6n)! k=ln=0 « 6 27)" t6n X 11 = 1 n=0 - Qk(k + 6n)! In particular, “ Pb3(r) ~ S / r t3 t2 V2 + 6”)! t4 3(3 + 6m)! +6(4 + 6m)1 1■ - 9---- ------ +9----- ------(5 + 6n)l (6 + 6n)l / EXERCISES 8.5 1. Show that eatI = eatI. 2. If A, /z > 0 and <2 = f “A A M -p- L solve the matrix equation P'(t) = QP{t),t >: 0, using the methods of this section. 3. Solve the matrix equation P'(t) = QP(t), t > 0, using the methods of this section, where -110 0 1-10 0 0 0-1 1 0 0 1-1 8.6 293 STATIONARY DISTRIBUTIONS STATIONARY DISTRIBUTIONS LetQ = [q,.;] bean r X r q-matrixand let P(t) = [pI>7 (t)] be the associated matrix of transition functions. One of the problems related to the P(t) is the long-range behavior of the pi,j(f) as t —> <». In general, the long-range behavior can be complicated, and we will deal only with the simplest case. Letting Q" = by definition of matrix multiplication, k=i Writing Q3 = QQ2, = 22 « = H ‘H.klk,mi­ k = i e= i le =i More generally, Qij = 22 22''' 22 (8.18) i„-i=l = 1:2 = 1 Definition 8.5 ‘ The q-matrix Q = [qJ>7- ] is irreducible if for every pair i,j G {1,..., r} with i j either > 0 or there is a finite sequence ilt ...,im such that X • • • X qjmj 0. ■ • We can assume that the i, ilf..., im,j in this definition are distinct, because if io — io for somep < q, then the factor typjp+l 1»:'<: can be deleted since it is preceded by and followed by q^i^- The resulting product will remain nonzero. Since the qk, t — 0 whenever i we can assume that X ’ ’ ‘ X qin)j > 0 in addition to the i, i\,..., im» j being distinct. EXAMPLE 8.7 The q-matrix Q = -1 0 1 1 -1 0 0 1 -1 i 294 8 CONTINUOUS PARAMETER MARKOV PROCESSES is irreducible since <jli2 = 1 > 0, ~ 1 > 0, ?2.3?3,i = 1 > 0, <72,3 = 1 > 0, <73,1 = 1 > 0, and <73,! <71,2 = 1 > 6. ■ There is an easy way to determine if an r X r q-matrix is irreducible by drawing a diagram as in Figure 8.2. The numbers in the circles represent the states 1,2,..., 6. The arrow connecting 2 to 4 signifies that <72,4 > 0. If it is possible to find a path connecting all the states by following the arrows, then Q is irreducible. In this case, the matrix Q is irreducible since there is a path connecting all the states. Theorem 8.6.1 If the q-matrix Q = [<7;,; J is irreducible, then pi.j(t') > 0 for all t > 0,1 i,j r. PROOF: We first show that pi, 1 (t) > 0 for t > 0. Since limf_o+pi,; (t) = 1, there is a 5 > 0 such that p;,; (f) > 0 for all 0 < fr 8. Consider any t > 0 and choose k such that k8 t < (k + l)8. Since P(t) = P(k8)P(t — k8} = P(8)P(8) X • • • X P(8)P(t — kS), pi'j(f) is a sum of terms as in Equation 8.18, with the qk, ( replaced by pjt, f, and pi.iW) pi.i(8>)pi,i(8) X • • • X p;,i(8)pi.;(t - kSf, since t — k8 < 8, all the factors on the rigjit are positive and therefore p;,; (t) > 0 for all t > 0. Suppose now that i j. Let m be the smallest integer for which there is a finite sequence ih..., in, such that x ‘ ‘ ‘ X qimtj > 0. Then q-^ = 0 for all n < m. Consider r r I) 1 im ~ 1 8.6 STATIONARY DISTRIBUTIONS 295 Each term on the right must be nonnegative, because if some term were strictly negative then some factor q^ would be strictly negative, and this can only happen if ij = ij+i, contradicting the minimality of m. Because all terms are nonnegative and at least one is positive, > 0. But qW = 0 for all n < m implies that n=m Since Pi.iW tm 00, m! jn-m 'y' —0 as t —> 0+, = has the same sign as qff for all small t, and therefore> 0 for all small t. Choosing 8 > 0 such thatpi,j(t) > 0 for 0 < t < 8,pij(t + s) S pi,j(t')pj,j(s') > 0 for 0 < t < 8 and all s > 0. Therefore, pi.j(t') > 0 for all t > 0. ■ To study the long-range behavior of the pI>;(t), we will recall the discrete parameter version first. Let P = [p,j] be an r X r stochastic matrix; i.e., pi j > 0 for 1 i,j r and = 1,1 i r. Moreover, let P" = [pfj ] be the matrix of n-step transition probabilities. According to Theorem 5.2.2, if there is a positive integer N such that pW > 0 for 1 < i,j r, then limn_oop-”-) exists and is independent of i. It should be noted that the conclusion of the following theorem does not require that the q-matrix be irreducible, but the limit may depend upon i. Theorem 8.6.2 Let Q = be an irreducible r X r q-matrix and let P(t) = [p/j(t)] be the associated matrix of Markov transition functions. Then there is a probability density it = {tti, ..., ttJ such that lim pi, j(t) = TTj, j = 1, ...,r, independently ofi. PROOF: Fixj G {1,..., r}and let e > 0. Since lim^_o+pi,j(^) = 8i,j> 1 — i,j r, r 296 8 CONTINUOUS PARAMETER MARKOV PROCESSES and there is an ho > 0 such that y, i pi,} (f») - sij\ < % i=l whenever 0 < h < ho. Fixing h with 0 < h < ho, we will now show that TTj = lim„ _>x pi, j(nh) exists independently of i. To see this, note that P(h} = [p,j(h)J is a stochastic matrix and that pIfJ(h) > 0 by Theorem 8.6.1, since Q is irreducible. Moreover, [pi.j(Mh)] = P(nh) = [P(h)]M = (p-j^h)] and TTj = lim„_xpi,j(nh) exists and is independent of i, since this is true of the Pi”/(h) by Theorem 5.2.2. Any t > 0 can be written t = n(t)h + r(t) where n(t) is a nonnegative integer and 0 r(t) < h. From the inequalities |pij(t) “ |pi.y(t) — Pi.j(«(t)^)l + lp/.j(«(t)^) “ ’fjl and r r Ipij(f) ~Pf.;(«(£)k)| = ^pi.k(nWh')pk,j(r(t)) -y'_pi,dn{t')h)pk,i(O>) k=l k=l n=i lp».j(t) - ^j\ g 2 Since n(t) —> <» as t —> the second term on the right can be made less than e/2 for large t; i.e., limf_>xpI,j(t) = Try. Clearly, tt; >0,1 r. Since ^2 -= 1 p,j(t) = 1 and there are only a finite number of terms in the sum, = ■ Except for relatively simple q-matrices, using limits to calculate it is difficult. Fortunately, there is a simple algebraic method for finding tt, assuming that the q-matrix is irreducible. Definition 8.6 The probability density ir on {I,..., r} is a stationary density for the Markov transition functions pi, j(t) if r = y-kPk.iW, j = l,...,r. ■ t=l Theorem 8.6.3 Let Q = [g,j] be an r X r irreducible q-matrix and let P(t) = [p:,j(t)] be the associated matrix of transition functions. Then P(t) has a unique stationary 8.6 297 stationary distributions density it that satisfies the equation r j = 1, ...,r. ^TTiqi.j = 0, (8.19) :=1 PROOF: Let 77; = Since r pi.fis + T) = k=l letting s —> 00 we obtain tt; = = i iTkpk,j(t')> and tt;- is a stationary density for P(t). Let a be a second stationary density. Since r <rj = X>'.<rkPk,i(t')> t > 0, k=l letting t —> oo, r r = ITj^tTk = TTj, <Tj = j = 1, ...,r, k=l k=l and a = tt. Thus it is unique. Since r r » k=l k=l n=0 .n " = 52 ^pk.j(t) = 22 irk 22 -q{ky= 22 n=0 / r \ V=1 / (52 I the sum of the power series in t on the right is a constant, and therefore all the coefficients of t, t2,... must be zero; in particular r 22 k=l r = 52 ^kj = o. ■ k=l Equation 8.19 alone does not determine the stationary density tt uniquely. The equation X/=i 7rj = 1 must be used in conjunction with Equation 8.19. EXAMPLE 8.8 Consider the q-matrix Q = -4121 1-421 2 1-63 0 0 1-1 B 298 CONTINUOUS PARAMETER MARKOV PROCESSES It is easily seen that Q is irreducible. In this case, Equation 8.19 becomes —4tti + 7T2 + 2773 = 0 TTj — 4tT2 + =0 7T3 2tti + 2tt2 — 6tt3 + tt4 = 0 TTi + TT2 + 3tt3 — tt4 = 0. Applying the usual row and column operations, these equations are easily seen to be linearly dependent, and the equation TTi + 7T2 + 7T3 + 7T4 — 1 must be used to find the stationary density 7T. In this case, 7T! = 1/10, tt2 = 1/15, rr3 = 1/6, tt4 = 2/3. If the pi,j(t) are the transition functions corre­ sponding to the above q-matrix, then limf_,x pi, i(t) = 1/10, limf_>xp,,2(0 = 1/15,limf_,»pIj3(t) = 1/6, limf_xpi,4(t) = 2/3 for i = 1,2,3,4. ■ EXERCISES 8.6 1. Which of the following q-matrices are irreducible? ■ -1 10 0 0 0-1011 0 1-10 0 Qi = 00 1-1 0 1 0 0 0 -1 q2 = -1 1 0 0 0 ‘ 1-1000 10-210 • 000-11 0 0 10-1 2. Consider the system of Exercise 8.4.3. If the failure rate is A = .01 and the repair rate is /z = 2, what is the limiting distribution of the system? 3. Determine the limiting distribution of the transition functions corresponding to the following q-matrix: Q = -3 1 1 1 0 0 -3 1 0 0 1 0 -4 1 0 1 1 1 -2 1 1 1 1 0 -1 8.6 STATIONARY DISTRIBUTIONS 299 The following problems require mathematical software such as Mathematica or Maple V. ( 4. Consider the Markov chain P(t) determined by the q-matrix I Q = -1 1/3 1/4 1/2 -1 3/4 1/2 2/3 -1 Approxim<M|ith three-place accuracy the Markov transition function P(t) ~ {pijv)}whent = 2, and determine tt, = lim t_>oo pi,j (t), 1 — 5. Approximate with three-place accuracy the limiting distribution of the transition function P(t) = [p, j(t)] corresponding to the q-matrix » -1 .05 .10 .25 .50 .10 -1 .10 .25 .10 .25 .15 -1 .25 .20 .45 .65 .50 -1 .20 .20 .15 .30 .25 -1 SUPPLEMENTAL READING LIST S. Karlin and H. M. Taylor (1981). A Second Course in Stochastic Processes. New York- Academic Press. SOLUTIONS TO EXERCISES CHAPTER 1 Exercises 1.2 1. P(A) = 3/8. 2. |O|=2". 3. P(A) = 3/8. 4. P(A) = 1/4. 5. 21 configurations. 6. Let Pi(«) be the statement that 1 + 2 + • • • + n = n(n + 1 )/2. Since 1 = 1(1 + l)/2, Pi(l) is true. Assume P](n) is true. Then , „ . n(n + l) (n + l)(n+2) 1 + 2 + • • • + n + (n + 1) = ----------- + (n + 1) = ------------------- , 2 2 which is just the statement Pj(m + 1). Therefore, Pj(m + 1) is true whenever Pj (m ) is true, and Pi (m) is true for all m >: 1 by the principle of mathematical induction. Now let P2(m) be the statement I2 + 22 + • • • + m2 = n(n + 1)(2m + l)/6. Since I2 = 1(1 + 1)(2 + l)/6, P2(l) is true. Assume P2(m) is true. Then ,2 ^2 2 / , ,2 m(m + 1)(2m + 1) , -> I2 + 22 + • • • + n2 + (n + I)2 = —--------- -------- - + (« + I)2 6 . /2n2 + n \ = (n + 1) ---- ----- + n + 1 300 SOLUTIONS TO EXERCISES 301 . /2n2 + n + 6n + 6\ = (« + 1) --------- ----------y o j __ (n + l)(n + 2)(2n +3) 6 which is just the statement P2(n + 1). Therefore, P2(n + 1) is true whenever P2(n) is true, and P2(n) is true for all n > 1 by the principle of mathematical induction. Exercises 1.3 1. (a) is obvious, (b) Since the number of ways of selecting n individuals out of m for inclusion in a sample is the same as the number of ways of selecting tn — n to not be included, the two binomial coefficients are equal. 2. Express C(n — 1, r) and C(m — 1, r — 1) in terms of factorials, take the sum, and simplify the resulting equation. 3. (a)/(0) = 1 and fork n,/(fc)(0) = n(n — 1) X • • • X (n — k + 1) = («)jt,/(<r)(0)/k! = (n)k/kl = ( ” ); for k > n,/(fc)(0) = 0, which is K also equal to ( ^ ) since k > n. Therefore,456789 4. Thekthderivative off(t) at t = Ois a(a — 1) X • • • X (a — k + l) = (a)j.. 5. Let a = 1 and b = t in Equation 1.9, differentiate with respect to t, and set t = 1. Put a = 1 and b = t in Equation 1.9, differentiate twice with respect to t, and put t = 1. 6. 7. Let A be the collection of outcomes not having a 1. Then A is an ordered sample of size n with replacement from the population {2,3,4,5,6} of size 5. The number of such samples is 5". Thus, P(A) = 5"/6". 8. Let A be the collection of outcomes for which the number of heads and tails are equal. The total number of outcomes is 22n. To count the number of outcomes in A, select n out of the 2m positions in a label to be filled with H’s, which can be done in ( j ways, and fill the remaining positions with T’s. Thus, P(A) = (^M)/22n. 9. (a) ( 1 ). (b) Let A be the collection of outcomes for which boxes 1,2,..., m are empty. The n particles are then distributed among the 302 SOLUTIONS TO EXERCISES remaining boxes numbered n + 1,..., 2n in ( -x \ ' k ' (~x)k _ (~x)(~x - 1) X - ■ X (~x - k+ 1) k'. kl __ 1 = ( 1 11. .176197. 12. 53,130. ) ways. Thus, J * (% + fc ~ 1) X ••• X (x) kl nUx + fc"1)* ' kl Exercises 1.4 1. p(2) = 1/16, p(3) = 1/8, p(4) = 3/16, p(5) = 1/4, p(6) = 3/16, p(7) = 1/8, p(8) = 1/16. 2. p(3) = 1/64, p(4) = 3/64, p(5) = 6/64, p(6) = 10/64, p(7) = 12/64, p(8) = 12/64, p(9) = 10/64, p(10) = 6/64, p(ll) = 3/64, p(12) = 1/64. 3. p(3) = 1/216, p(4) = 3/216, p(5) = 6/216, p(6) = 10/216, p(7) = 15/216, p(8) = 21/216, p(9) = 25/216, p( 10) = 27/216, p(ll) = 27/216, p(12) = 25/216, p(13) = 21/216, p(14) = 15/216, p(15) = 10/216, p(I6) = 6/216, p(I7) = 3/216, p(I8) = 1/216. 4. l/(564) = 3.8719 X 10"8; or one chance in 25,827,165. 5. l/(48)6 « .11318 X IO"9. 6- (5°)(958°) / 1000 A 10 ' or, what is the same, / 10 \ / 990 \ 2 ' 48 ' / 1000\ 507 7. (12)4/124 = .573, accurate to three decimal places. SOLUTIONS TO EXERCISES 303 8- 13X12X( * )Q) ' (?) 9. 10 * /(?). ■ 10. f 100Wn-100X 5 7^ 95 7 v 100 7 One would guess 2000. Exercises 1.5 1. 8/7. 2. 1/(15- 163). 3. 6/11. 4. 5/12. 5. An outcome is an ordered sample of size 10 from a population of size 2 with replacement, (a)P(A!) = P(A2) = 29/210 = 1/2, P (A i A A2) = 28/210 = 1/4,P(A2 | Ai) = 1/2. (b)P(A! A A2) = P(A,)P(A2). (c) P(A2 | Ai) = P(A2). (d) If 10 is replaced by 20, there is no change in any of these probabilities. 6. 1/2. 7. 22/63. 8. From Figure 1.2, P(A A B) = 1/6, P(B) = 1/2, so that P(A | B) = 1/3. Since P(A) = 1/3, the partial information does not change the probability of A. What does random mean in this exercise? If all n keys are placed in a row and tried one at a time, then we are dealing with an ordered sample of size n without replacement from a population of size n, and there are n! outcomes. The number of outcomes with the good key in the rth place is (n — 1)!. The required probability is (m — l)!/n! = 1/n. 9. 10. (a) 0 p(k) =£ 1. (b) Since = lim 11-------—) = 1, n -»oo I n +1/ the p(k) can be used as weights in a probability model. 304 SOLUTIONS TO EXERCISES CHAPTER 2 Exercises 2.2 1. (a) Correct, (b) Incorrect since 2 is not a set. (c) Incorrect since {1,2,3} has no sets as members. (d)‘Correct. 2. The proposition is G O and i > j.” A = : (i,j) G ft and i > j} = {(2,1), (3,1), (4,1), (5,1), (6,1), (3,2), (4,2), (5,2), (6,2), (4,3), (5,3), (6,3), (5,4), (6,4), (6,5)}. 3. Since 21/n > 1, [0,1] C A„ for all n > 1. Thus, [0,1] C AA„. Since lim„_x21/" = 1, there is nox > 1 in AA„. Thus, AA„ = [0,1]. 4. Examine the graph of the equation y = xn. AA„ = {(x,y) : 0 l,y = 0}. 5. AA„ = {(x,y) : 0 6. If co G A, then co is not in Af; i.e., to G (Ac)c. If to G (Ac)c, then to is not in Ac; i.e., co GA. Thus, (Ac)c C A and A = (Ac)c. 7. Assume A C B. If co G Bc, then co cannot be in A since if it were it would be in B also, which is impossible. Thus, co G A and Bc C Ac. Assume now that Bc C Ac. If co G A, then co £ Bc since if it were then it would be in Ac, which is impossible; thus, co G (Bc)c = B, and so A C B. 8. Not true in general. Let U = {1,2,3,4}, X = {1,2,3}, Y = {3,4}, Z = {2,3,4}. Then Y U Z = {2,3,4},X A (Y U Z) = {2,3}, whereas XAT = {3}, (XAK)UZ = {2,3,4}. Clearly,XA(KUZ) (XAK)UZ. 9. By de Morgan’s laws, the distributive laws, and the facts that X A Xc = 0, Y A Yc = 0, (X U K) A (X A K)c = (X U Y) A (Xf U Tc) = ((x u r) a xc) u ((x u y) a yc) = (x a xc) u (y a xc) u (X a yc) u (y a yc) = (y a xc) u (x a yf). x < l,y = 0} U {(x,y) : x = 1,0 x < y S 1}. 10. If n is odd, A„ = [0,1]; if n is even, An = [0,0] = {0}. For each « 1. (\a* „A = {0}>U:=1(nfca„Afc) = u:=1{o} = {O}. Also, u^„Afc = [0,i],n:=1(ufca„Afc) = [0,1], Exercises 2.3 1. The domain is [ — 1,1], The range is [0,1]. 2- f = {(x,y) : x G R,y G R,y — >/l — x4, — 1 < x < 1}. The domain is [ — 1,1], The range is [0,1]. 3- f ~ {(x,y) : x G R,y G R,y = 1/ y/1 — x2, — 1 < x < 1}. The domain is (— 1,1), and the range is [ 1, °°). 4. Define a : N -> X by putting a(p) = p/q. Then X is the range of a, and X is countable. 5. Fori = 1,2,..., m, let X, = {xj^x/z,...}. Since each X, is countably infinite, there is a mapping a, : N -> Xj having X,- as its range. Consider SOLUTIONS TO EXERCISES 305 the array *11 *21 *31 *12 *22 *32 *13 *23 *33 *ml *m2 *m3 • • • ••• The elements of this array can be arranged in a sequence {«(»)};., by going down the first column, then down the second column, and so forth. To identify a(n) with some a,- (m), let p be the number of columns to the left of the nth element a(n). The n — pm element of the (p + 1 )st column is then the nth element a(n). More precisely, each n G N has a unique representation n = pm + q where p G N U {0} and q G {1,2,..., m}. Letting X = UX,, define a : N —> X as follows. If n = pm + q as above, put «(«) = < *q(p + I). Then X is the range of a. 6. Assume X is countable; i.e., X is the range of an infinite sequence *{ n}"=1 where each xn is an infinite sequence of 0’s and l’s. Let xn = *{n,L:} “=1. Consider the infinite sequence y = *{y J “=| where yk = 1 — Xk,k. Since yn Xn.n for each n S l,y xn for all n S: 1. This is a contradiction since y G X but is not in the range of {x„}"=|. Therefore, X is not countable. 7. N X N = U“_2Ak. Since each Ak is countable, N X N is countable by Theorem 2.3.1. 8. We can assume that A and B are ranges of infinite sequences {ani}“ = j and {b„}"=1, respectively. Then A XB = U"=2{(am, bn) : m + n = k}. 9. All three are countable. 10. Suppose Xi = *i2> {ii» • • •}>1 — L Fix n G N and consider the set A = {m : m G N, ((m + l)(m + 2))/2 S: n}. A C N. Since (n + l)(n + 2) 3n — — M, 2------ 2 n G A and therefore A / 0. It follows from the well-ordering property that A has a least element m (m ) for which (mj(m) + 1)(mj(m) + 2) 2 306 SOLUTIONS TO EXERCISES Since m(n) is the smallest integer with this property, tn(n) - 1 does not have this property, and so + 1) “ 2 " and tn(n') is the largest element of N with this property. Now define /(n) so that m(n)(m(n) + 1) , n = ---------------------+ I («) /(«) ~ m(n) + 1. Define a : N —> X = U”=1X„ by where 1 putting «(«) = Xm(„)+2-/(n ),/(«)• Exercises 2.4 1. Consider the statement ft P(n) : A\,... ,An E si implies ^jAyGsi. ;=i When n = 1, the statement simply says that A\ E si implies Ai E si, which is trivially true. Assume that P(h) is true and consider P(n + 1). Let Ai,.. .,An, An+i E si. By the induction hypothesis that P(n) is true, Ai U • • • U An E si. Since si is an algebra, A! U • • • An U Am + i = (A( U •■■An) U An+i E si. Therefore, P(n + 1) is true. It follows from the principle of mathematical induction that P(m) is true for all positive integers n. 2. Suppose Ai,...,A„ E si. Since si is closed under complementation, Ap ...,A„ E si. Since si is closed under finite unions by Problem 1, UJ=1 AJ G si. Since si is closed under complementation, (U ”=1 AJ)C E si. By de Morgan’s laws, 3. Let {Ay} be a finite or infinite sequence in the cr-algebra S'. If finite, the intersection is in S' by Problem 2 since S' is an algebra. Assume {Ay} is an infinite sequence. Since S' is closed under complementation, the sequence {AJ} is in S' and UAJ is in S'. Thus, AAy = (UAJ )c G S'. SOLUTIONS TO EXERCISES 307 4. ft G S' since Oc = 0 is countable. Suppose A G S'. If A is countable, then the complement of Ac is countable and Ac G if A is not countable, then Ac is countable and Ac G S'. In either case, Ac G S'. Let {A;} be a sequence in S'. If every Aj is countable, then U)Aj is countable and belongs to S'; if some Aja is not countable, then Aj0 is countable; (UAy)c = (AAJ) is countable since it is a subset of the countable Aj0, and so UAy G S'. In either case, UAy G S'. 5. For each n > 1, let A2n be the event “success occurs for the first time on trial numbered 2m.” Let A = U„ = 1A2n. Since the A2n are disjoint events, P(A) = S:=1P(A2„). ButP(A2„) = q2"-1pand P(A) = ^q2n~lp = £ n=l = *1 n = l = P? j^2)"’1 = 7TG3 n=l 1 ” q 1 + q' 6. (5/36) + (31/36)2 (5/36) + (31/36)4 (5/36) + (31/36)6(5/36) + • • • = .5373. 7. Let Wn,Rn, and B„ be the events that a white chip, a red chip, and a black chip are chosen on the nth drawing, respectively. The event A of interest can be decomposed into disjoint events according to the trial at which a white chip appears for the first time while being preceded by red chips; i.e., A = U (Pi A W2) U (Pi A P2 A W3) U • • • and P(A) = P(Wi)+P(Pi AW2) + P(Pi AP2AW3) + --- = w/(w + b). 8. 9. Define the sequence {By }J°=1 byBi = Ai and By = Aj AA{_1 for; S: 2. The By are disjoint. Since each By C Aj,j 2: 1, UBy C UAy. Suppose a> G UAy. Then there is a largest k >: 1 such that w G Ak and to G UyZ} Aj, so that w G Ak A = Bk C UBy. Thus, UAy C UBy and the two are equal. Fix n S: 1. Since A = Pl°°=lAj C A„,An = AU (An AAC). Let Bn = An A Ac. Since A„+i C An, B„+i = An+i A Ac G An A Ac = Bn and{By}y°=1 is a decreasing sequence. Nown“=iBy = n“=1(Ay AAC). It is easy to check that the last set is equal to Ac A fl , Aj = Ac DA = 0. Exercises 2.5 1. Let A be the event “outer diameter of the sleeve co is too large” and let B be the event “inner diameter of the sleeve to is too large.” Then P(A) = .05 and P(B) = .03. Since we know nothing about P(A A B), P(A UB) < P(A) + P(B) = .08. Since .05 = P(A) < P(A U B) and .03 = P(B) < P(AUB), .05 < P(AUB) < P(A) + P(B) = .08. 2. Let A, B, and C be three arbitrary events. Then, by Equation 2.11, P(A U B U C) = P((A UB) U C) = P(A U B) + P(C) - P((A U B) A C). 308 SOLUTIONS TO EXERCISES But P(A U B) = P(A) + P(B) - P(A A B) and p((a ub) no = p((a n C) u (B a C)) = P(A nC) + P(B AC) - P((Aa C) a (B n C)) = p(a n c) + p(b n C) - P(a n b n c). Thus, P(A U B U C) = P(A) + P(B) - P(A A B) + P(C) - p(anq- p(b dc>) + p(adb n c). 3. 4. 65/96. (a) P = 4 • (1000/104) - 6 • (100/104) + 4 • (10/104) - (1/104) = .3439 (b) P = 1 - (9/10)4 = .3439. 5. The required probability is P((A A Bc) U (B A Ac)) = P(A A Bc) + P(B n ac) - P((a n bc) n (B n ac)) = P(A) - P(a n B) + P(B) P(A AB) = P(A) + P(B) - 2P(A AB). 6. No. The probability of getting at least one ace with the throw of four dice is ~ .51775, whereas the probability of getting a double ace in 24 throws of a pair of dice is 1- .49140. \36/ 7. Suppose c«>o = 1 where each 8; is a 1 or a 0. For each n S: 1, let An = {to : a) = {xjJJLpX! = 8|,...,x„ = 8„}. Then {co0} = nJ3-; Aj C An for all n S: 1. Let a = max(p, q) < 1. Since 0 < P(A„) = = an and lim„_,oc an = 0, lim„_ooP(A„) = 0, and so P(coq) = 0. 309 SOLUTIONS TO EXERCISES 8. SinceAAB CAandAAB CB,P(AAB) == P(A) andP(AAB) < P(B), sothatP(AAB) min(P(A),P(B)). On the other hand, P(A A B) = P(A) + P(B) - P(A UB) > P(A) + P(B) - 1 since P(A U B) < 1. 9. The inequality is trivially true when n — 1. Assume the inequality is true for n. By Problem 8, P(Aj A —AA„ AAn+i) > P(Aj A---A„)+P(A„+1)“1 > (P(A!) + ---+P(A„)-(n - 1)) + P(A„-i) — 1 = P(A1) + ---+P(A„+J)-n, and the inequality is true for n + 1 whenever it is true for n. By the principle of mathematical induction, it is true for all n S: 1. Exercises 2.6 1. By mutual independence, the pairs A and B, A and C, B and C are inde­ pendent. By Theorem 2,6.1, the pairs A and Bc, Bc and C are independent. Thus, the pairs A and Bc, A and C, Bc and C are independent. We need onlyshowthatP(AABcAC) = P(A)P(BC>)P(C>). ByEquation2.8,P(AA BCAC) = P((AAC)ABC) = P(AAC)-P(AAC AB) = P(A)P(C)P(A)P(B)P(C) = P(A)P(C)(1 - P(B)) = P(A')P(BC')P(C'). Thus, A, Bf, and C are mutually independent. 2. Let Ai,.. .,A„ be mutually independent. This means that if 1 • • • 4 < n, then h < k P(A,-, A —AA,J = j=i If we can show that this equation remains valid whenever some A^ is replaced by its complement, then this procedure can be repeated as often as necessary to arrive at the B>.. Just by interchanging the positions of the Ai;, we can assume that we want to replace A;, by its complement. By Equation 2.8, P(AJ, A A,2 A • • • A A,J = P(A,-2 A • • • A Ait) - P(A;, A • • • A AJ = n/w-np(A>) j=2 k ( j=l \/ \ npuo) i -p(a.) J- = 2 /\ / = P(ApP(A;2) X ••• XP(A,t). 310 SOLUTIONS TO EXERCISES 3. (a) P(A A B A C) = P(A A C)B A C)) = P(A |B A C)P(C) = P(A\B A C)P(B|C)P(C). (b)P(A, A • •• AA„) = P(A„ |Aj A • • • A A„-i)P(A„-1 |A] A • • • A A„-2) X • • • X P(A2 |Ai)P(Ai), provided the conditional probabilities are defined. 4. The only outcomes with positive probabilities are (11,12,13), (11,12,11), (11,10,11), (11,10,9), (9,10,11), (9,10, 9), (9,8,9), (9, 8,7). Let Ri be the event “the ith ball selected is red,” i = 1,2, 3. Then P(ll, 12,13) = P(Rci ClRc2ClRcJ = P(P'|PJ AP')P(^|PJ)P(PJ) 1 _ £ ~ 20 20 2 = _L “ 100' Similarly, P(ll, 12, 11) = 27/200, P( 11,10,11) = 11/80, P( 11,10,9) = 11/80, P(9,10,11) = 11/80, P(9,10,9) = 11/80, P(9, 8,9) = 27/200, P(9, 8, 7) = 9/100. 5. 6. 7. We are given P(A) = P(B) = P(C) = 1/2, P(A A B) = P(A A C) = P(B AC) = P(A AB AC) = 1/4. (a) Since P(A A B) = P(A)P(B), P(A AC) = P(A)P(C), and P(B AC) = P(B)P(C), A, B, and C are pairwise independent, (b) Since P(A AB A C) = 1/4 ¥= 1/8 = P(A)P(B)P(C), the three events are not mutually independent. Let A be the event “1 is sent” and let B be the event “1 is received.” By Bayes’ rule, (a)P(AjB) = .9896. (b)P(Ac|Bc) = .9519. For i = 1, 2,3, let A, be the event “the ith digit sent is 1” and let B, be the event “the ith digit received is 1.” By independence and the previous problem, P(At AA2 AA3|Bi AB$AB3) = P((Ai A Bi) A (A2 A B2c) A (A3 A B3)) P(Bi A B2 A B3) _ P(Ay A B,)P(A2 A B2c)P(A3 A B3) P(B,)P(B2)P(B3) = P(A1[B1)P(A2|B$)P(A3[B3) = (,9896)2(.0481) = .0471. 8. Let Ct be the event “Chest i is selected” and let G be the event “Gold coin is observed.” The given facts are P(CJ = P(C2) - P(C3) = 1/3, P(G | Ci) = 1,P(G[C2) = 1/2,P(G|C3) = 0. Given that the observed coin is gold, the only way that the other coin can be gold is for 31 1 SOLUTIONS TO EXERCISES the outcome to be in Chest 1, which means that we are required to find P(C| | G). Apply Bayes’ rule. P(Ci | G) = 2/3. 9. P((AUB)A(CAD)) = P((AACAD)U(B ACAD)) = P(AACAD) + P(BACAD)-P(AAB ACAD) = P(A)P(C)P(D)+P(B)P(C)P(D)P(A)P(B')P(C)P(D) = P(C)P(D){P(A)+P(B)-P(A)P(B)} = P(CA D){P(A) + P(B) -P(A AB)} = P(C n D)P(A U B). 10. Since the events A A Aj,j S: 1, are disjoint, P (A A (UA; )) = P(U(A A A;)) = EP(A AA;) = SP(A)P(A;) = P(A)P(UAy). 11. (a) .9007. (b) .9021. Not much is gained by adding B3. 12. Three of the B, and two of the Cj at a cost of $460. Exercises 2.7 1. 28,319/44,800 « .6321. 2. (a) For equalization to occur on the 2 nth trial, there must be n heads and n tails, (b) Using the fact thatp(l — p) < 1/4 and the limit ratio test, the infinite series =s(2")p”<i-p)” n=1 n=1 converges, and so P(A2ni.o.) = 0. 3. Let Bn,r be the event consisting of those outcomes for which there is a run of length r beginning on the nth trial. Since P(B„,r) = l/2r+1, the events Br+2,r > B2r+4,r.B»(r+2),r.... are independent events, and 00 CO 1 ^P(Bn[r+2),r) = 22 tTh = +°°’ n=l n=l P(BM(r+2),r j.o.) = 1. Since (B„(r+2),r j.o.) C (B„,r».o.) C (A„,r i.o.), P(A„,ri.O.) = 1. 4. Since CO CO 1 co J ^PCAn.rJ = 22 2(l+8)log2n “ 2E2 n l+S < °°’ n=1 5. n=1 n=1 P(A„,r„ i.O.) = 0. For ft, take all sequences w = {x;}"_, where each x, G {2,3,..., 12}. For n > 1,1 < i| < »2 < • • • < > and 5b ..., G {2,3,..., 12}, let A,-... in(8i,...,8„) = {w: w = {x,}“=1,x,- = 8i,...,xin = 8„} 312 SOLUTIONS TO EXERCISES and let S' be the smallest <r-algebra containing all such sets. Define P(Ait... i„(8b..., 8„)) = p(A;i(81)n---nA;„(S„)) n ;=i n = np<5<) J=1 where p is the weight function given in Figure 1.3. 6. E”=i (5/36) (25/36)""1 (5/36) = 25/(11 • 36). 7. 6/36 +2/36+ 2{ 1/36+ 2/45 +25/(11)(36)} « .4929. 8. The probability of losing is 1/36 + 2/36 + 1/36 + 2{1/18 + 1/15 + 5/66} .507. There is probability 1 that the game will terminate, since the sum of the probabilities of winning and losing is 1. 9. .63210558. 10. Let T be the number of purchases required to obtain a complete set of collectibles. Then P(T > m) =2(“1)r"1(,)(-3 + -7(1“ r\\" 8// ’ r=1 Since P(T > 55) > .05, P(T > 56) < .05, 56 purchases are required. CHAPTER 3 Exercises 3.2 1. .4050. 2. 47. 3. .2202. 4. fxW = fr(y) = 2(-x+n + l) h(h + 1) 0 2(—y+n-t-1) n(n + l) 0 ifx = 1, 2,... n otherwise. if y = 1,2,... n otherwise. 313 SOLUTIONS TO EXERCISES 5. Let z G {1,2,..., 6}. By Equation 3.6, 6. The recursion formula is kq The ratio is WiMO b(k — 1; n,p) = j+ k = li2....... kq Let tn = [(« + 1 )p], the greatest integer less than or equal to (n + l)p. If (n + l)p is not an integer, the b(k;n,p) are strictly increasing for k tn and then are strictly decreasing for k tn, so that the maximum value is b(m-t n,p). If (n + l)p is an integer, then both b(nr, n,p) andb(tn — 1; n,p) are maximum values. 7. 8. 9. fx is a Poisson density with parameter a, and fy is a Poisson density with parameter /?. fxM = cf(xi) where c = Sj°=ig(y;)» and/y(y; ) = dgty^ where d = fxM = -25 for x = 1,2,3,4. /y(l) = .12,/y(2) = .27,/y(3) = .20,fY(4) = .25,/y(5) = .16. P(Y > X) = .70 10. Think of O as points (i, j) in the plane with integer coordinates 1 i,j < 50, except that points on the diagonal i = j are not included. Points on the two diagonals adjacent to the diagonal are not included in the event. a2) = 2^= '9S11. .9341. 12. 372. Exercises 3.3 1. For all real t, (1 + t)a+b = (I + t)a(1 + t)b. By the generalized binomial theorem, z=0 x=0 y=0 7 314 SOLUTIONS TO EXERCISES X = 2>tz z =0 where * = £(“)( b )• VX ' VZ —X ' x=0 If two power series agree on an open interval about 0, then their coefficients must be equal. Thus, x=0 2. Forz = 0,1,2,..., 2 fzw = X (7 V<>'C7 * X )p,(-«)!” x =0 =r'(-,)-±(7)(27x) x=0 Thus, Z has a negative binomial density with parameters r + $ and p. 3. 4. p(x > y) = ((m + i)/2«),p(x = y) = i/«. fx.Y(x,y} = P(X = x,Y = y) = P(X = x,Y = y,N = x + y) = P(X = x,Y = y\N = x + y'jPtN = x + y) = (Aprc-AP(A?r Xl Since X Aa yi N, X fx(x) = P(X = x) = ^P(X = x\N = n)P(N = n) fl =±(:w-^ n=x Xl '*• ^(n~x)i SOLUTIONS TO EXERCISES 315 x!I c c Similarly, P(T = y) = and X and Y are independent random Thus, /x,y(x,y) = /x (* )/?(/)» variables. 5. P(X > T) = 1/(2 —p), P(X = K) = p/(2 - p). 6. For z = 2,3,..., z z-1 /z(z) = ^fxMfy(y') = ^pqx~lpqz~x~l = (z ~ Vp2qz~2. x=0 x=l z-1 if 2 7. /z(z) = 4 8. z < n -z+2n + l ---- rp----- if n < z 0 otherwise. 2n Suppose the ranges of X and Y are {xb x2,...} and {yb y2, • • ■}, respectively. Suppose <pi and ipj are in the ranges of <A(X) and ip(y), respectively. Let {x/p...,xla} be the set of values of X such that </>(x,m) = </>,- and let {y;i,... ,yjp} be the set of values of Y such that ip(y7n) = <Ay- Then (<A(X) = = ipj) = Um,n(X = xim,Y = yjn). Since the latter sets are disjoint, P(0(X) = <Ai><A(n = <Aj) = = X-Jp<r mtn = Ep<x “x<-> Ep<y = ».) \m /\n = P(Um(X = x,J)P(U„(y = yjn)) = P(<A(X) = <Ai)P(<A(T) = <A;)9. /z(D = /z(i3) = .ooo,;z(2) = ;Z(12) = .004,;z(3) = ;Z(11) = .018,/z(4) = /z(10) = .057,/z(5) = /z(9) = .122,/z(6) = /z(8) = . 189, fz(7) = .219. 316 SOLUTIONS TO EXERCISES Exercises 3.4 1. fx(t') = t(l ~ tn)/n(l — t). 2. SinceA(t) = 1 —(1 —12)1/2 = I- * =o X ( ) ( —= 0> a2j+i = ) for all j > 1. 0 for all j > 0,and«2j = ( —1)>+1( 3. (a) Geometric density with p = 1/3. (b) Poisson density with A = 1/4. (c) Negative binomial density with r = 5 and p = 1/8. 4. X =Xj+---+XW whereX[,X2, ••• is an infinite sequence of independent random variables with/xj (0 = (1/2) + (l/2)f, N has generating function fx(t} = (t/6)((l — t6)/(l — t))» and the random variables N,Xi,X2, ... are independent. Thus, . > ,> , x .5 + .5t 1 - (.5 + .5t)6 fx(t} - fN(fxSt) “ 6 x 1_(5+5r)- 5. /(t) is the generating function of a random number of random variables Sn =Xi + -,,+Xx where the X/s are Bernoulli random variables with p = 1/3 and N has a uniform density on {1,2,..., 6}. This could arise from tossing a die to determine how many times a basic S or F trial with p = 1/3 should be performed. 6. /x(2x) = (2xe~2)/xl and/x(2x + 1) = 0 for x = 0,1,... . 7. Letting tn = [n/2],m is the largest integer such that 2m n. Write En = (2"=IXj G {0,2....,2m}). Stratifying En using the values of Xi, E„ = (Xi = O.^X, G {0,2,...,2m}) J =i n U (Xi = l,2Xj e {0,2,...,2m}) j=i n = (Xi = O.^Xj G {0,2,...,2m}) J =2 { n Xi = 1, e {0,2,..., 2m} \j = 2 < and so n pn = P(£„) = P(Xt = O.^Xj G {0,2,...,2m}) J=2 / / n ¥\ + P IXi = 1,1 Xj G {0,2,..., 2m} I j. \ v=2 // SOLUTIONS TO EXERCISES 317 Since Xi and ^"=2Xj are independent by Lemma 3.3.3, \ n ( ( e {0,2, ...,2m} i=2 / / n- \\ e {0,2, ...,2m} jj. 1 “P // V=2 Since PtS^Xj e {0,2,...,2m}) = P(Z"^Xj G {0,2,...,2m}) = Pn — l> Pn = qpn-\ +P(1 -p„-l). 8. The difference equation is q„ — (l/2)q„-i + (l/4)q„-2, n S: 2, subject to the initial conditions qo — qi ~ 1- The solution is 9. .03262. 10. fz(l) = 1/192, fz(2) = 1/32,fz(3) = 1/12, fz(4) - 13/96,/z(5) = 31/192), ;z(6) = l/6,fz(7) = 31/192, ;z(8) = 13/96,/z(9) = 1/12, ;z(10) = 1/32,/z(ll) = 1/192. 11. /x(0) = .00790, £(1) = -04972,/x(2) = .13998, £ (3) = ..23204, /x(4) = .25083, £ (5) = .18474, fx(6) = .09390, fx(7) = .03252, £(8) = .00735,£(9) = .00098,£(10) = .00006. Exercises 3.5 1. Add the right sides of Equations 3.14 and 3.16 to verify that px + qx = 1 in thep q case and the right sides of Equations 3.15 and 3.17 to verify that px + qx = 1 in the p = q = 1/2 case. 2. lima_>oo qx = 1 is the probability of eventual ruin against an infinitely rich adversary in the unfair situation q > p. 3. If y2 + •' • + Yj = y f°r some; S: 2, then there is a smallest integer for which this is true; i.e., (y2 + ... + r. =y) = 00 (Y2 =y)U U(y2^y’---’y2 + - ■ ■ + Y2+k k=0 y> y2 + • • • + Yi+k — y)- 318 SOLUTIONS TO EXERCISES Since the events on the right side are disjoint and each is independent of (K, = 1), the events (Yj = 1) and (Y2 + '' • Yj = y for some) > 2) are independent (see Exercises 2.6.9). 4. The difference equation for the probability of ruin qx is * * qx = aqx+i + fiqx + yqx-i> 1 < x < a - 1, subject to the boundary conditions qo = i,qa = o. The equation is the same as the difference equation = Y^qx+l + T~^pqx~1' The solution to this equation is obtained by replacing p by a/(l — j3) and q by y/( 1 — /?) in Equations 3.14 and 3.15 to obtain (y/a)fl - (y/a)x qx = ---------:----------- > (y/a)fl - 1 in the a 1 < x a — 1, p case and q x = 1 — —, 1 < x a — 1, in the a = j3 case. CHAPTER 4 Exercises 4.2 1. E[X2] = (I+q)/p2. 2. E[X2]=A2 + A. 3. E[X] = 2. 4. E[1/(X + 1)] = (1 — e-A)/A. 5. E[1/(X + 1)] = p/(q(r — 1)). 6. B[X] = gx(l) = E:=ogxW = *Z x = 0P(X > x) = ZX x=QP(X > x+1) = S7=1P(Xx). 7. The generating function of SN is given by fsN(t) = fN(fxt^ and £[$n] = /sN(l). Since f'N(t) = fN(fXl^f'Xl(t),E[N] = fN(l), SOLUTIONS TO EXERCISES 319 E[Xd =/Xi(l), and /X1(l) = 1, =/;(A1(l))4(l) ^(D/x/D 8. Let T = n if the 10-digit combination occurs on the nth trial for the first time. Then and E[T] = gr(l) = 2046. 9. P(T < 11) = 1 ~P(T > 11) = 1 -gr(ll) = 279/1024. Exercises 4.3 1. E[XZ] = 154/9. 2. E[17] = 91/36. 3. varX = q/p2. 4. varX = 32. 5. varX = r(q/p2). 6. d = 377. 7. n = 750. 8. E[V] = Vir(ni/n) + • • • + Vsr(ns/n). 9. 106.5 pounds. 10. varX; = r(n;/n)((n — r)(n — H;)/n(n — 1)). 11. Let {xi,X2,.. .}be the range ofX. By hypothesis, * ) X/ — /i)2/x(xj) = For any j with Xj — /z ¥= O.fx(xj') = 0. Since 2;/x(xj) ~ 1, there is some Xj in the range of X with fx (x,) > 0 that cannot be different from fi. Therefore, x, = >0and/x(/z) = 1; i.e., P(X = /z) = 1. 12. varX = 2^'(l)+g(l)-[g(l)P. 13. aT = 27.09. Exercises 4.4 1. 2. fix = .5,0-2 = 1.05,/zy = 2.6, a2 = .94, E[XY] = 1.7,p(X,Y) « .40. var(X+2T) = 6 + 3^. 3. Note that X + Y = 2. Since cov (X, 2) = 0andcov(X,X) = a2 x, cov(X, T) = cov(X, 2 —X) = cov(X,2) - cov(X,X) = -a^. Since a2-x = (Tx’PtX.Y) = -(r2x/aX(T2-x = “1- 320 SOLUTIONS TO EXERCISES 4. p(Xi + 2X2 ~ X3,3X1 - X2 + X3) = 5/(2 7170). 5. p(X,r) = -1/2. 6. (a) £[X,] = (r - l)n/rn = (1 - (l/r))". (b) £[XiXj = (r - 2T/rn = (1 “ (2/r))M. (c) (d) £[Sr] = r((r - DVr") = r(l - (1/r))". Since X,2 = Xf.varXi = (1 - (1/r))" - (1 - (1/r))2". Fori cov(X;Xj) = (1 - (2/r))M - (1 - (1/r))2". Thus, r var Sr = __ varX, + 2 i=l 7. (a) £[li.klj.k] = Ofori (b) (c) £ [Z,-,fcZj,d = PiPj for £[y,-] = npi and£[V,yj] = n(n — l)p,p7 fori (d) varT,- = Mp, (l - p, ). cov (Yi,Yj) = -npipj andp(Yit Yj) = - yp,pj/(l — p, )(1 -py) for i j. (e) 8. cov(X,-,X;) l£:<J£r j. Suppose the range of X is {a, b} and the range of y is {c, d}. If £ is any event, let Ie be the indicator function of £; i.e., /f(co) = 1 or 0 according to whether co is in £ or not. Let A = {co : X(co) = a}, C = {co : K(co) = c}. Then X = aIA + bIA',Y = clc + dIC',XY = acIAr\c + adIAr\ce + bcIA<r\c + bdl^QQe. Thus, £[Xy] = (a - b)(c - d)P(A nC) + d(a - b)P(A) + b(c ~d)P(C) + bd and £[X]£[y] = (« - b)(c - d)P(A)P(C) + d(a - b)P(A) + b(c - d)P(C) + bd. Since cov(X,y) = £[Xy] — £[X]£[y] = 0,P(AAC) = P(A)P(C). Therefore, the pair {A, C} are independent events, as are the pairs {Ac, C}, {A, Cc}, and {Ac, Cc}. Thus, X and Y are independent. SOLUTIONS TO EXERCISES 321 9. Minimize the function of two variables g(c, d~) = E[(Y — cX — d)2]. Then cov(X,y) 2 ax b = E[y] - c-ov(*>y)E[x]. 4 fl Exercises 4.5 1. E[y] = 25.75, varT = 490.1875. 2. E[K] = 25, varK = 500. 3. Since Xi + X2 has a Poisson density/>(•; A! + A2) by Theorem 3.3.4, if n is any positive integer, DfY iv j V _ \ ^(-^1 = x,Xi+X2 = n) P(Xi = x Xi+X2 = n) = -----_ ----PyXi +X2 = n) _ P(X, = x)P(X2 = n - x) P(X, +X2 = n) _ P(XMp(n ~~X?A2) p(n;A! + A2) ■ ~(n')l Ai Tf A2 x \M + ^2 j \M + a2 y 4. By Theorem 3.3.4, X + Y has a binomial density b(-,m + n,p). z = 0,..., m + n and x = 0,..., z, r , . , P(X =x,X + r =z) A|Xrf(x|2)-------- P(X + y==z) _ p(x = x)P(y = z - x) P(x + y = z) _ b(x; m,p')b{z — x; n,p) b(z; m + «,p) (:)(2:x) (mz+n) By Equation 1.11, E[X|X + y = z] For 322 SOLUTIONS TO EXERCISES tn {m +n — 1 (T7) 2-1 tn = -------- z. » tn + n 5. 6. E[X] = 50. Since N andXM are independent, fxN |^(x | n) = fx„W and £[Xn|N = n] = 2Lx/Xn|N(x|m) = 2Lx/Xn(x) = E[X„]. x X 7. By Problem 6, X » E[$n] = 2>[$n IN = n]fN(n) = 2>[S„]/n(h) n=0 n=0 x = /zE[N], = n=0 Also, X X E[$&] = 22^1^ = n]fN{n} = n=0 n=0 x x = ^JvarSn + (E[S„])2)/n(«) = ^.(na2 + h2/z2)/n(») n=0 n =0 = <r2E[N] +/z2E[N2]. Therefore, var$x = <t2E[N] + /z2E[N2] — /z2(E[N])2 = <r2£[N] + /z2 var N. 8. Let Z = c. Then f /z|x( lIx)> - P<ZD,=v 2>X x = whenever/x(* ) _ n 1 ifz = c if z c > 0. Therefore, E[Z|X =x] =2>/z|x(z|x) = c z wheneverfxM > 0. SOLUTIONS TO EXERCISES 9. 323 By Inequality 4.8, <£(X)y has finite expectation and F[^>(X)y ]X = x] is defined whenever/x(x) > 0. Suppose/x(x) > 0. Letting Z = note that JS[Z|X = x] = 2Lz/Z|x(z|x) = y' zfz\x(z\x'). z z#0 For z / 0, /z|x(z|x) = P(</>(X)y = z.X = x)/P(X = x). Suppose </>(x) 0. Then P(Y = z/<j>(x),X = x) P(X = x) fzlx(z\x) and Blz'x=x] = ^zfr,x[^x) z Z 2#0 4>M I z TFT \0W = <^(x)E[y|X =x]. If </>(x) = 0, then/Z|x(z |x) = P(0 = z,X ~ x)/P{X = x) = 0, and F[Z|X =x] = ^zfz\x(z\x) = 0 = <£(x)E[y|X = x] z#0 wheneverfx(x) > 0. 10. Dx will satisfy the difference equation Dx = aDx+\ + pDx + yDx-i + 1 1 < x a - 1, and the boundary conditions Do = 0,Da = 0. We can assume that j3 < 1, since otherwise neither gambler or adversary ever wins or loses. Thus, Dx satisfies 324 SOLUTIONS TO EXERCISES subject to the boundary conditions Do — 0, Da — 0. Letting Dx = (1 - j3)Dx, Dx satisfies the equation Dx = pDx+i + qbx-t + 1 and the boundary conditions Do — 0, Da — 0 wherep = a/(l ~/3),q = y/(l — /?). Thus, Dx satisfies Equations 4.12 and 4.13 with p and q replaced by p and q, respectively. Dx is therefore given by Equations 4.16 and 4.17, so that in the a y case, r, D 1 ( x al — (y/a)x\ ~ ------- 1 — ' I x 1 — ft \y — a y — al — (y/a)a J 1 < x ~ ci — 1, and in the a = y case, Exercises 4.6 1. H(X) = 15/8 bits. 2. (a)H(X) = - E:=i(1/2/ log(l/2r = Zx = 1x(l/2)X = (1/2)^2“=1x(1/2)x-1 = 2 (see Example 4.3). (b) 2 bits. 3. 1 bit. 4. Use the equation H (X, K) = H(y|X) + H(X). SinceH(X) = logn and H(K|X = /) = log/,i = l,...,n,H(y|X) = S- = 1(l/«) logi = (1/«)S"=1 log/ = (log(n!))/«,H(X, y) = (log(fi!))/M +logn. 5. H(X) « 3.2744. 6. 5.7 bits. 7. 2 bits. 8. By Lemma 4.6.1, H(X|y) -H(X) = “ IY(Xi Iyj ) losfx IY(*:• I Yj ) + ^fx.Y(Xi, ) log/X (x<) SOLUTIONS TO EXERCISES _ 1 ~ E77 X ■ fx.Y ln2 t 325 w \ fx(Xi) > Vi)ln 7—T । , i|yj) * vx|r( /xfc) _ . ,/x|r(x; |/j) 9. pi ~ .10307, p2 « .12273, p3 « .14615, p4 ~ .17403, p5 « .20724, p6 « .24678. CHAPTER 2. n = 5 and Vj = 4/59, v2 = 8/59, v3 = 24/59, v4 = 23/59. 3. Vi = q/(p + q), v2 = p/(p + q). 4. 5. n = 2 and Vi = v2 = v3 = 1/3. The result is true for n = 1 by hypothesis. Assume that the result is true for n. Then N NN y'.P'kpk.jtn + 1) = k= 1 Jt = 1 6=1 N N €=1 N = k=1 = 6=1 and the result is true for n +1. By the principle of mathematical induction, the result is true for all n S: 1. 6. First show that each P(«) = [p/j(n)] is doubly stochastic as follows. The result is true for n = 1 by hypothesis. Assume that P(n) is doubly stochastic. Then 326 SOLUTIONS TO EXERCISES N NN 22?:,; (n + 1) = y'.y'Jpi,kpk,j(n') :=1 i=1k=1 N N = ^Pk.jM^pi,k i=l k=l " N = ^LPkjW = b k=l and the result is true for n +1. By the principle of mathematical induction, P(n) is doubly stochastic for every n S: 1. Since 2^ = 1 ?:,;(«) = 1, n > 1, NN N 1 = lim V ?:,;(«) = V lhn P’-A”) = 2". Vj = Nvj> i = l"“" n~ * Ki = l >=1 and therefore Vj = l/N,j = 1,..., N. 7. 8. 9. Since P is doubly stochastic, Vj = 1/5, j = 1,.... 5, by the previous exercise. By Problem 5, /zy = X^=i Pkpk.j(n) for all n 1. Letting Vj = lim„_oo pi,y(n),j = N Pj = ^PkVj = Vj, j = 1,...,N. k=l Thus, {/zy}^=1 is the asymptotic distribution. ' 0 P(2n - 1) = P2n~l = [pi,j(2n - 1)] = 1 1 P(2n) = p2" = [p,-y(2n)] = 1 0 0 1/2 0 0 0 1/2 1/2 1/2 ' 0 0 0 1/2 1/2 n > 1. n > 1. The asymptotic distribution is not defined since limn_,oop;,;(n) does not exist. ' 0 i ?7T 0 D __ u 1 .2.^-1) . 0 ... N7 0 ^-i) N2 2 2^N-2) N2 0 0U (N—2)2 N2 0 ... U0 . . . Q u ••• ... 0 0 ' U0 0U 0 0 u u 1 0 SOLUTIONS TO EXERCISES 11. 327 P(Xm+n = j\Xm Ji> • • • > Xm+n — i jn—i,Xm+n . ........ m+1 jl>--->Xm+n Jl.—.jn-1 1 P(Xm = i) By Equation 5.2, P(X„+„ = j |Xm = i) 77’^o)p'o.'i X * * ■ X pi„-\,ipi,j\ X ' " X pj„-uj ^2 1 x ------------ — P(Xm = i) = 22 Pi.Ji X ■ ’ • X Pi„-t,j> and the latter does not depend upon nt. 12. Pi = .2126372201, v2 = .2339009422, p3 = .2144524159, Vi = .2157489844, p5 = .1232604374. 13. Pi = .1098019693, p2 = .0365008905, p3 = .1201612072, Vi = .2061555430, v5 = .2261706688, v6 = .3012097212. Exercises 5.3 / 2n \ ' n ' (2n)(2n - l)(2n - 3) • • • 3 > 2 • 1 n!nl 2n(2n - 1)(2m -3) •••3- 1 m! ( —1)”223 ”( —1/2)(—3/2) - • • ((1/2) - m) n! (-4)n(-l/2)(-3/2) • • • (-(1/2) -nil) n! 2. p = 1/3. 3. qx satisfies the difference equation qx = pqx+t + qqx-i> 2 < x < a - 1, 328 SOLUTIONS TO EXERCISES subject to the boundary conditions qa = O,qo~ &Q1 — 1 — 8. Ifp # q, Ifp = q, 1-5 = -r.------ TV”“ *)• a (1 — 5) + 3 4. Dx satisfies the difference equation Dx = pDx+i + qDx-\ +1, 1 < x < a - 1, subject to the boundary conditions Da = 0, 3D] = Dq. In the p # q case, if we put D = (1-5) then Dx = B[TX] = A + B(q/p)x, 0 = 1 D(«-P)l, \P/ a, where x \ Sq P Exercises 5.4 1. q = -2+ Js « .236. 2. qy = .204325. 3. q = (1- |a - £ |)/2/3. 4. Start with equationfx>tl(i) = to obtain (p(s)). differentiate twice, and set s = 1 Usethefactsthatp'(l) = = ^,^"(1) = varX!(1) = W = obtain = varX;-f;.(l) + (/;.(l))2 = varX;-^+^to /x'.J1) = P-1 (<r2 ~ p. + /z2) + /z2(varX;- - p) + /z2>). SOLUTIONS TO EXERCISES 329 Then + (/xj+l(l))2 — /z2 varX;-+ <r2/z;. varXj+1 5. For; S: 1, let P(j) be the proposition varXj = <r2(/z2;-2 + /z2;-3 + • • • + p?-1). Since varXi = a2 and P(l) is the proposition varXi ~ <r2(/z°) ~ a2, P(I) is true. Assume P(j — 1) is true. By Problem 3, varX; = /z2 varXj-i+/J-1<r2 = /z2<r2(/z2^-4 + /z2j"3 45 + • • • + /z'"2) + p)~x<y2 = <r2(/z2>-2 + /z2>-3 + • • • + pj + /z'-1), and it follows that P(j) is true. Therefore, P( j) is true for all j S: 1 by the principle of mathematical induction. 6. q = .706420. 7. q ~ .203188. 8. q = .552719. 9. qi = .125, q2 = .177979, q3 = .204325, q< = .218344, q5 = .226058, q6 = .230379, q7 = .232824, qs = .234214, q9 = .235007, = .235461. Exercises 5.5 1. P(0) = 3/8,P(+l) = l/4,P( + 2) = 1/16,P(n) = 0 otherwise. X * (6/5)X„-i - (9/10)X„-2 + (2/5)X„_3 and aj = 21/160 « .131. 2. The independence of Xn = (l/4)y„+(l/2)y„-i+(l/4)y„_2andXn-4 = (l/4)y„-4 + (l/2)y„-5 + (l/4)y„-6 would seem to indicate that nothing would be gained by including X„-4 in X . * There is, however, a reduction in the mean square error by including X„-4. cr\ « .117, and Xn Xn — 1 5 _Xn—2+ _Xn-3 5 □ = _Xn-4. 3 3. Letting = varX„, a2 x = a2<r^ +<r2 or <r^(I — a2) = a2 > 0. Thus, a2 < 1, and so | a | <1. 4. ai = (pi(l - p2))/(l ~ P?)»«2 = (P2-p?)/(l “Pi)- 5. Predicted value = 2.946. 330 SOLUTIONS TO EXERCISES .CHAPTER 6. Exercises 6.2 F(x) = ifx < 0 if 0 < x < 1 ifx > 1. FxW = ifx < 0 if 0 < x < 1 ifx >: 1. 1. 2. 3. 2/5. 4. 3/7. 5. F(x) = 6. 7. 0 (l/2)(x + I)2 1 - (1/2)(1 - x)2 1 c = l/2andP(-l < X < 1) = 1 - Me. 0 ifx < —1 F(x) = { (l/2)(x + 1) if — 1 < x < 1 1 ifx S: 1. Fy(y) = if y < 0 ifO <y < 1 ify S: 1. Fr(y) = if y > 0 if 0 < y < 1 if y > 1. 8. 9. 10. ifx —1 if - 1 < x < 0 ifO < x < 1 ifx =2 1. 0 2x 1 0 ifx < 0 ifO < x < 1/2 if 1/2 < x < 5/4 ifx >: 5/4. (a) (b) For any x 6 R,(X < x) = Ur>GQ,rj<x(X r,) 6 9? since each (X rj) G S' by (a). (b) (c) For any x £ R, (X >: x) = (X < x)c £ 9? since (X < x) G 9? by (b). (c) (d) Foranyx G R, (X > x) = U^q.^JX > r,) 6 since each (X S: x) 6 S' by (c). (d) (a). For any x 6 R, (X < x) = (X > x)c 6 S5 since (X > x) 6 S' by (d). 331 SOLUTIONS TO EXERCISES Exercises 6.3 1. F'M = 1/4 1/2 if 1 < x < 2 if 4 < x < 5 otherwise. 0 (l/4)(x - 1) 1/4 (1/4) + (l/2)(x - 4) 3/4 F'(f)dt = ifx < 1 if 1 < x < 2 if 2 < x < 4 if 4 < x < 5 if x > 5 Since F(5) = 1 5^3/4 = j2ooF'(t) dt, F' is not a density for F. 2. 3. (a) P(0 < X < 1) = 1/4. (b) P(0 < X < 1) = 1/2. (c) P(X = 1) = 1/4. (d) P(l/2 < X < 5/2) = 11/16. Let G(y) = P(T y) = P(sinX < y). If y < -l,G(y) = 0 and if y > 1, G(y) = 1. Suppose — 1 < y < 1. Then G(y) = P(sinX y) = P( —1 sinX < y) = P(arcsin( —1) < X arcsiny) = P(arcsiny) — F%( —tt/2). Since 0 FXM = < (l/7r)(x + (77/2)) ifx < — tt/2 if — tt/2 x < tt/2 ifx S: tt/2, 0 G(y) = < (1/tt) (arcsiny + (tt/2)) 1 and g(y) = 4. l/(-n- Vl -y2) 0 if - 1 <y < 1 otherwise. Since Y takes on values in [0,+°°] with probability 1, G(y) = P(Y s y) = 0 whenever y < 0 and G(y) = P(T ~ y) = 1 whenever y S: M. Suppose 0 <y < M. Then G(y) = P(min(X, M) < y) = 1 - P(min(X,M) >y) = 1 - P(X > y,M > y) = 1 - P(X > y) = P(X < y) = 1 - e~y. Thus, 0 1 -e~> G(y) = < 1 ify < 0 if 0 < y < M ify S M. 332 SOLUTIONS TO EXERCISES 5. Since the graph of G has a jump at M, G is not continuous and does not have a density function. ify < 0 0 2ye if y == 0. 6- /y(y) = eZ7. f 0 Fx(x) = * 1 — (2/tt) arccos(x/100) 1 ifx < 0 ifO < x < 100 if x — 100. (2/tt)(1/ 71002 -x2) 0 fxM = ifO < x < 100 otherwise. Exercises 6.4 1. P(Y < X) = 1/2. 2. P(Y < X) = 3/5. 3. c = 8/tt, P(X > 1/2) = (2/3) - (3^3/477) 4. r , x _ 5. [ e~x 0 ifx S: 0 ifx < 0 = if z < 0 if 0 z < 1 ifz >: 1 0 /z(z) = S 1 ~ e~z [ e~z(e — 1) if z > 0 if z < 0. 6. 7. 1/(1+y)2 0 , , . 6(e 22 — e 3z) 0 fz(z) = 0 8. F2(2) = < z2/2 -1 + 2z - (z2/2) 1 Z /z(2) = 2—z 0 if z > 0 if z < 0. if z < 0 if 0 z < 1 if 1 z <2 if z > 2. if 0 < z < 1 if 1 < z < 2 otherwise. ify > 0 ify < 0. 333 SOLUTIONS TO EXERCISES Exercises 6.5 1. SinceX = <rZ + /z where Z has a standard normal density, Y = aX + b = a <tZ + ap, + b, and Y has a m (a/z + b, a2 a2) density. 2. Since P(X < 100) = #((100 - /z)/a) = .9938 and P(X < 60) = #((60 — /z)/<r) = .9332, using the table of the normal distribution function, 100 — /z -------- -- = 2.5 60 — fj. Solving for /z and a, /z = 0, <r = 40. 3. According to Example 6.19, X2 and Y2 both have F(l/2, 1/2) densities. By Theorem 6.5.2, Z = X2 + Y2 has a F(l, 1/2) density. 4. 1/24. Integrate by transforming to polar coordinates. 5. By Equation 6.9, X2 and Y2 both have F(l/2, l/2<r2) densities. By Theorem 6.5.2, W = X2 + Y2 has a F(l, l/2<r2) density. For z < 0, Fz(z) = 0. Forz > 0,Fz(z) = P(Z < z) = P{jw < z) = P(W z2) = Fw(z2). Thus,/Z(z) = 2z/w(z2), and /z(z) = 6. 0 (z/<r2)e-^2/2<r^ The function </>(s) = (x — a)/(b — a) maps the interval (a, b) onto the interval (0,1). Let Y = </>(X). Since X takes on values between a and b, Y takes on values between 0 and 1. Thus, /y(y) = 0 if y < 0 or y > 1. IfO <y < l,thenFy(y) = P(T < y) = P((X-«)/(&-«) < y) = P(X y(b — a) + a) = Fx(y(b ~ a) + a), and therefore /r(y) = * (y(b F - a) + a)(b - a) = 1. Thus, /r(y) = 7. if z < 0 ifz > 0. 1 0 ifO <y < 1 otherwise. Let Y = #(X). Since # takes on values between 0 and 1, the same is true of Y, and so /y(y) = 0 if y < 0 or y > 1. For 0 < y < l,Fy(y) = P(#(X) < y) = P(X < #'1(y)) = #(#-I(y)) = y. Thus./y (y) = 1 for 0 < y < 1 and Y has a uniform density on (0,1). Exercises 6.6 1. 2. Since the exponential density is the same as the F(l, A) density, Z = Xi + • •• + X„ hasaF(n.A) density by Theorem 6.6.1. Since each X, has a standard normal density, each X2 has a F(l/2,1/2) density. By Theorem 6.6.1, W = X2 + • • • + X2 has a F(n/2,1/2) density. 334 SOLUTIONS TO EXERCISES Since Z — Therefore, = 0 if z < 0. For z > 0,/z(z) = 2z/w(z2)- (2/2"/2)r(n/2)z”-Ie-z2/2 3. if z > 0 if z < 0. (a) Proof of Theorem 6.6.1. Consider the proposition P(m) :%!+••• +X„ has a F(«i + • • • + a„, A) density. P(l) is trivially true sinceXi has a F(ai, A) density by hypothesis. Assume that P(n — 1) is true. Then + • • • + X„-! has a T(ai + ••• + «„-!, A) density. By Theorem 6.5.2, (Xi + - • -+Xn-|)+Xn hasaF(ai + • •• + «„, A) density. Therefore, P (n) is true whenever P (n — 1) is true. By the principle of mathematical induction, P(n) is true for all h S 1. (b) Proof of Theorem 6.6.2. Consider the proposition P(n) :Xi + •• • +Xn hasa n(/zi + • • • + /z„, <r2 + • • • + <r2) density. P(l) is trivially true since Xi has a n(/zi,<7|) density by hypothesis. Assume P(n - 1) is true. Then X! + • • • + XM-! has a m(/z1 + • • • + P-n-i, <r2 + • • • +a2_1) density. By Theorem 6.5.1, (Xj + • • • +X„-i)+X„ has a n(/zi + • • • + /z„, a] + • • • + <r2) density. Therefore, P(h) is true whenever P(n — 1) is true. By the principle of mathematical induction, P(m) is true for all m S: 1. fy(y) = (l/2)arcsin ^/l — y2 0 if — 1 < y < 1 otherwise. c / r/ r '^00 J0 \Jx1/2 \J(2/3)x2 \ \ dx3]dx2]dXl / / P(Xj < 2X2 < 3X3) = 55' 6. 7. Use a mathematical induction argument to show that Fu.v(m,v) = (!/«!)(-logx)" if 0 < x < 1 otherwise. vn — (v — u)n 0 if 0 £ H < V S 1 otherwise. 335 SOLUTIONS TO EXERCISES For 0 s m < v < 1, ru n(n — l)(y — x)" 2dydx JoJx , o n(v — x)n~l dx ~ vn — (y ~ u}n. AW = { 1/(10+x)2 8. ifx > 0 ifx < 0. AW = j l/(’0+^ ify 0 if y < 0. if Z S: 0 if z < 0. /z(z) = ifx 0,y 5: 0 otherwise. /x.y(x,y) = ^(x+y) fx.Ylz(x,ylz) = 9. For t 0 if x,y,z S: 0 otherwise. 0, FtW = P(T t) = P(max(TbT2) < t) = P(Ti < t,T2 < t) = PfTi < t)P(T2 < t) fTW = Fr^fT.W+fT^PT^ = (1 - + j8i(t)e"^'^,(j)rfj(l - e~lo'fi2(s)ds)- CHAPTER 7. Exercises 7.2 1. £[sinX] = 2/tt. 2. £[|X|]=2/^. 3. E[min(X,l/2)] = 3/8, £ [max (X, 1/2)] = 5/8. 4. £[U] = l/(n + 1), £[U2] = 2/(n+ l)(n+2), £[V] = n/(n + 1), £[ V2] = n/(n +2). £[X] = 1/A, £[X2] = 2/A2. 5. 336 SOLUTIONS TO EXERCISES 6. E[Xr] = ((a + r - l)(a + r - 2) X • • • X a)/Ar 7. E[Xr] = Jo i(ai)r(a2) ~x)^dx r(ai + a2) r(ai + r)r(a2) P(ai)r(a2) Haj + a2 + r) _______ (ai + r ~~ l)(< *i + r — 2) X • • • X _______ (oti + a2 + r ~ l)(c *i + a2 + r — 2) X • • • X (a1 + a2)’ since the second integral is equal to 1. 8. Choose a < b so that a < c, < b,i = x < a and F(x) = 1 for x > b and Then F(x) = 0 for </>(t) dF(t) — jab dF(t). Since </> is continuous at each c,-» given e > 0, there is a 8; >0 such that | </>(%) — </>(g) | < e whenever |x — c, | <8,. Let 8 = min{8i,..8m,c2 - q,...,cm - cm-i}. Then | </>(%) — <A(g) | < e whenever |x — c< | < 8, i — 1,... ,m. Let 7re be any partition of [a, 8] such that | 7re | < 8. Let rr = {x<),..., x„ } be any partition of [a, b] finer that tte. Then each c, belongs to one and onlyone interval (x;;_!, x;-], i = 1,..m. Note that F(xjt)-F(xjt- 1) = 0 if k is not one of the j/s. Let & be any point of (x * —i, JCjt], k = Then n m 2L<A(^)(F(xjt) - Fte-O) - y </>(c,)(F(c,) - F(c,—)) k=l i=l m m = 2>(&)(F(^)-FtXj^Y) -F(Ci-)) i=1 i=1 m m = X <A(^)(F(G) - F(C,- -)) - y <A(G )(E(G ) - F(C,- -)) i =1 i~1 m y, I ) | (F(Ci) - F(Ci -)) :=1 tn < e^(F(a) -F(Ci-)) = e. «=! This shows that </>(t)dF(t) = f/ </>(t) dF{tj = lim^|_>oS2 = i ^(^)AFjt = Xi"=i ci(p(c<) “ F(cf-)). 337 SOLUTIONS TO EXERCISES 9. £[X] = f0*“ xfx(x)dx = fo+°°(loxfx(x)dy)dx. Interchanging the order of integration, •+00 fx(x)dx jdy = £[X] = Jo P(X>y')dy Jo r +<» = 10. Jo (i - Fx(y))dy- Since Y X, (X > x) C (7 > y) and 1 — FxCx) 1 — Fy(x). By Problem 5, £[X] = f0+"(l - Fx(x))dx < - Fr(x))dx = £[/]. Exercises 7.3 1. £[max(X,/)] = 3/2. 2. £[U |X = x,Y = y] = x/2. 3. £[A | Xi = xi,...,Xn = x„] = (n + 1)/(1 + xi + • • • + x„). E[Y |X = x] = £[X|y =y] = 5. (2/3)x 0 ifO<x<l otherwise. ifO <y < 1 otherwise. (2/3)(y2 + y + l)/(y + 1) 0 The density of X is (a^3“)/(x +/3)a+1 ifx > 0 ifx == 0. The conditional density of A given that X = x is a F(a + l,x + j3) density. 6. 7. £[RJ = (n- l)/(n + l),varR = 2(n - l)/((n + l)2(n + 2)). fu\v(u I v) = (n — l)(v — m)" 2 vn-l £[U| V = v] = { V/Qn ifO < v < 1 otherwise. fz\x.Y(z\x,y') = |(l+x+y)3z2e~z(1+x+/) £[Z|X = x,Y = y] = ifx,y,z 3 1 +x +y" 0. 338 SOLUTIONS TO EXERCISES (1 +x)/2 E[K |X = x] = <! (1 -x)/2 £[X I V = y] = (/ - l)/(lny) 0 if - 1 < X < 0 if 0 < x < 1 otherwise. ifO <y < 1 otherwise. Exercises 7.4 1. .7745. 2. .5819. 3. P(S360 4. n = 757. 5. P( |S10oo I 6. 7. P(Siooo -50) = .056. A = 2.575 Jn/12. 8. Let m = 106. Since/z = E[Xy] = 0 and a2 = 10“m/12» P( |S„ | < 5 • 10“m+2) < P( | *S | < ^3) « 2<I>( 73) - 1 « .9168. 59000) = .0622. 30) = .6574. Exercises 7.5 The identities 1 + cos 2a 2 cos a = 1. E[X„] — Xy"=1 djE[cos(nbj + Zj)] = E[X2] = E I ^(1/2^) j02n cos(nbj + z) dz a, cos(nb, = 22«;2E[cos2(nb,- +Zj)] + ^/atajE[cos(nbi + Zj)]E[cos(nbj + Zj)] 339 SOLUTIONS TO EXERCISES m c2tt cos2 (nb i + z}dz )o frv 1 2,7 '27r 1 +COS(2(m&, + z)) , •----------- ------------ dz 2 o m -y«i2^ For v S: 1, r(r) = £[X„X„,J /m \/ m = E I a, cos(m bj + Z,) j a, cos((n + v)bj + Zj) m = ^^a2E[cos{nb, +Z,)cos((m + p)b,- +Z,)] i=l + ^^ajajE[cos(nbi +Z,)]E[cos((m + v)bj +Z;-)] i^j m a) = — E[cos((2m + v)bj + 2Z,) + cos(pb,)] i=i 2 _^L 7,2 _fL 7,2 1 f27T cos((2m + v)bi + 2z) dz o =i 7,2 m = 2 cos vbi. Since r(p) = r(~v) = r( | v|), 7 , C0S vbi £7=1 pM = 2. Let/z = E[X0]andr(p) = cov(X0,X„). ThenE[K„] = MZf=i«jand m m \2 \ E m \/ w fli(Xn-i + i ~ /z) Il =1 m " fly(X„-y+i ~ ft) /\J = 1 m ^(Xn—j + l M)] | 340 SOLUTIONS TO EXERCISES m m - j), = :=lj=l which is independent of n. For v 2: 1, m m = ^y^aiajE[(Xn-i+i - /J-KKn+v-j+i ~ /z)] :=1j=l m m - -j + v), i=x 3 =x which is independent of n. Since r(p) = r( —v) = r(|v|), 2X i X)"= i fli “MM) Xr=1Xr=I«,«;r(x--» ’ 1 + a1 2 + 2 cos ak 2tt(1 + a2) 3. 4. ' 5. v G Z. — 77 < A < 77. _ 1 + a2 + j32 + 2(a + aj3)cosA + 2j3 cos2A ~ 277(1 + a2 + j32) — 77 < A < TT. E [XJ = Ej”=! Pj(1/277) f02,r cos (kj t + 0) d e = 0, and m । /• 2tt BIX}] = Xft— cos2(A; t + 0) d0 ; = 1 277 Jo = V> I [2'1±^W±2«^ y=l 277 Jo 2 For h > 0, cov(XnXf+^) = £[cos(At + 0) cos(A(t + h) + 0)] = ^B[cos(A(2t + /i) + 20) + cosA/i] 1 J” i f2,r (cos(Aj(2t + h) + 20) + cosAi/j)d0 = = 1 m cosXjh- SOLUTIONS TO EXERCISES 341 Since r(h} = r(-h') = r(|h|),r(h) = (l/2)Sj”=iPj cosAyli, and therefore p(fi) = S”=iPj cos A; h. CHAPTER Exercises 8.2 1. P(XS = k |Xf = «) = ( n * d )(s/t) - (s/t))"~k. 2. For i = 1,...,«, let be the first time that = 1 and let W = max(W},...,W(1")). Then P(W < t) = (1 - e~At)n and fw(t) = «Ae-Af(l — e-Af)"-1 for t >: 0 and = 0 for t 0. 3. Note that P(Xf-X[f) < 0) = 0. Thus,P(Xf > Xi'L j *(X — Xjt-i)) = 1, and it suffices to show that P(limf_+ooX[f] = +°°) = 1. Since the events Ab Az,... are independent events and * P(A ) = Ae-A, X”=1 P(Ak) = °°. By the Borel-Cantelli lemma, P(A„i.o.) = 1 and P(^“=l(Xk —Xjt-i) = +00) = 1. Since limf_«>X[f] = ^=l(Xk - Xjt-i),P(limf*_ooX( f] = oo) = 1. 5- 11 P„(t) = _(ln(l + t))"— nl 1 +1 £[Xf] = ln(l + t). Pio(lOO) « .012. 6. Exercises 8.3 1. P,(t) = e"^,P2(t) = e~pt - e_2^,P3(t) = e~^ - 2e"^f + e'3^. 2. By Problem 1, Pi(t) = e~^f and the assertion is true for n = 1. Assume that the assertion is true for m — l;i.e.,P„-i(t) = e~^r(l — By Equation 8.9, P„(t) = P(« “ Jo -e'^)"“2ds = (n - l)e“^"f f (ePs - Y)n~2fiePs ds Jo = - I)"-1 = e'^d Thus, the assertion is true for n whenever it is true for n - 1, and it follows from the principle of mathematical induction that the assertion is true for all n G N. 342 SOLUTIONS TO EXERCISES 3. By Equation 8.9, e^sPn(s}ds = 0 P'n+l(P) = Pne-p^ ' »% Jo since /3n = 0. Thus, P„+1(t) = 0 and so P„+i(t) = c. Using the initial condition P„+1(0) = 0, c = 0 and P(Xt = n + 1) = PM+i(t) = Oforall t, as would be expected without any calculations. 4. As in Example 8.1, oc M'(t) = + n=l If there were no deaths in the population, by Theorem 8.3.1 we would have P(Xf < +00) = X =iP„-i(^) * = 1 since the series X“=i l/(a + /3n) diverges. A fortiori, P(Xt < +00) = Pn-i(t) ~ 1 in the presence of deaths. Thus, M’(t) = (/? - 8)M(t) + a, and so M(t) = 5. (c) E[Xf] = noeK (d) P0(t) - 0,P„(t) = e~^(1 no = 1 case. in the Exercises 8.4 1. (a)Fixi 1. The forward equation for p,i0(t) is Pi,o(f) = '^'.Pi.k(lk,0 + ^o,opi,o(t). Since q^.o = 0 if fc 2: 1, the equation reduces to Pi,o^) = ~^p:,o(t)> which has the general solution p,-(0(t) = c,e-Af where c, is a constant. Since the continuity condition requires that limf_»0+pii0(t) = 0, c, = 0 and so p,-,0 = 0. (b) Fix i > 1. From part (a), p,)0(t) = 0. Suppose pj.jt (t) = 0 for all k < j — i ~ 1. The forward equation forp,j(t) is = ^Mk,j +qj,jpi,jW). k^i 343 SOLUTIONS TO EXERCISES Since tjj+i.j = 0, ty-i.j = A, and — 0 for all other cases for which p'ij(t) = Ap,j-i(t) - Apj./t). But = 0 by the induction hypothesis. Thus, p,'j(t) = -Api,;(t) and pij(t') = 0 as in (a), and thereforepi,j(t) = 0 whenever; < i. (c) Suppose; >: i. Then the forward equation for pI>; (t) is Ap;J-l(t) kpij(t) so that Integrating from 0 to t, eXspi,j-i(s)ds eXtpi.j(t) - 8ij o and pi.jW = 8i.je At + ke Af o eXspi,j-i(s)ds. Since pi(,-i(t) = 0, = e Af Thus, eAse~As ds = kte~At pf,f+i(t) = Ac Af o and P:.:+2(f) e^kse As ds Jo A2'2 -Ar ----- e , 2 Ae Af and so forth. By induction, A^'t>-' _Af Pi,jW = —---- rrre 344 SOLUTIONS TO EXERCISES 0 A ~(A + /z) 2/z 0 A — (A + 2/z) o 0 A ... ... ... 0 0 0 0 ... -A 2. Q = r -A fi Q = - 0 3. A -(ji + k) /z — (A + Aj) fi 0 4. Q = 0 0 0 tn fi — nt fi 0 A ~fi (A + A$) -(k + fi) fi 0 A ~fi Exercises 8.5 1. Since!" = I for all n s 1, 2. Since Q2 = -(A + /z)Q,Q3 = (A +/z)2Q,..., Q" = (-1)" *(A + fi}n~lQ,n > 1, P(r) = I + IfP(r) = - (A+/x)r — —Q—Q- then , x /z + ke~{K+^ pi,i(t) = r--------- pi,2(t) = A - ke~^+^‘ “ P-e (A+At)' A + /ie P2,1(O----- 7—----- ?2,2(t) — 3. Writing Q as a block matrix, Q = A O O D Q" where A = D = A" O O Dn -1 1 n > 1. 1 -1 SOLUTIONS TO EXERCISES 345 By Problem 2 for n > 1,A" = (-l)"-12"-1AandD" = (-1)"-12"-1D. Thus, ( —1)"-12"-1D O and ((-l/2)X:=1(-2f)"/nl)A O O ((-l/2)X:=1(-2t)"/n!)D I+ Identifying entries in P(t), 1+ Pl,l(f) = P3,3(t) = ------~------ 1 - e~2t p2,l(f) — p4,3(f)----------- - ------ Pl,2(f) — p3,4(f) — P2,2(P) = p4,4(t) “ For all other pairs (i,j),= 0. Exercises 8.6 1. Qi is irreducible; Q2 is not irreducible. 2. tt, 3. limf_+0opi(i(t) = 1/6, lim^o, pi,2(f) = 1/24, limt_oopi,3(f) = 1/8, limt_Oop:,4(f) ~ 1/3, lim^oopi,s(f) = 1/3. F .285 .365 .350 ‘ P(2) = {pij(2)} = .214 .419 .367 . .205 .391 .404 4. = .995000, tt2 = .004975, rr3 = .000025. tti = 12/53, tt2 = 21/53, tt3 = 20/53. 5. TTi = .195, tt2 = .132, tt3 = .182, = .302, tt5 = .189. STANDARD NORMAL DISTRIBUTION FUNCTION *(x) = -4= [’ e"'lndt, x > 0. JlTT J~x Forx < 0, use the relation ^(x) = 1 — <I>(— x). 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 346 .00 .5000 .5398 .5793 .6179 .6554 .6915 .7257 .7580 .7881 .8159 .8413 .8643 .8849 .9032 .9192 .9332 .9452 .9554 .9641 .9713 .9772 .9821 .9861 .9893 .9918 .9938 .9953 .9965 .9974 .9981 .9987 .01 .5040 .5438 .5832 .6217 .6591 .6950 .7291 .7611 .7910 .8186 .8438 .8665 .8869 .9049 .9207 .9345 .9463 .9564 .9649 .9719 .9778 .9826 .9864 .9896 .9920 .9940 .9955 .9966 .9975 .9982 .9987 .02 .5080 .5478 .5871 .6255 .6628 .6985 .7324 .7642 .7939 .8212 .8461 .8686 .8888 .9066 .9222 .9357 .9474 .9573 .9656 .9726 .9783 .9830 .9868 .9898 .9922 .9941 .9956 .9967 .9976 .9982 .9987 .03 .5120 .5517 .5910 .6293 .6664 .7019 .7357 .7673 .7967 .8238 .8485 .8708 .8907 .9082 .9236 .9370 .9484 .9582 .9664 .9732 .9788 .9834 .9871 .9901 .9925 .9943 .9957 .9968 .9977 .9983 .9988 .04 .5160 .5557 .5948 .6331 .6700 .7054 .7389 .7704 .7995 .8264 .8508 .8729 .8925 .9099 .9251 .9382 .9495 .9591 .9671 .9738 .9793 .9838 .9875 .9904 .9927 .9945 .9959 .9969 .9977 .9984 .9988 .05 .5199 .5596 .5987 .6368 .6736 .7088 .7422 .7734 .8023 .8289 .8531 .8749 .8944 .9115 .9265 .9394 .9505 .9599 .9678 .9744 .9798 .9842 .9878 .9906 .9929 .9946 .9960 .9970 .9978 .9984 .9989 .06 .5239 .5636 .6026 .6406 .6772 .7123 .7454 .7764 .8051 .8315 .8554 .8770 .8962 .9131 .9279 .9406 .9515 .9608 .9686 .9750 .9803 .9846 .9881 .9909 .9931 .9948 .9961 .9971 .9979 .9985 .9989 .07 .5279 .5675 .6064 .6443 .6808 .7157 .7486 .7794 .8078 .8340 .8577 .8790 .8980 .9147 .9292 .9418 .9525 .9616 .9693 .9756 .9808 .9850 .9884 .9911 .9932 .9949 .9962 .9972 .9979 .9985 .9989 .08 .5319 .5714 .6103 .6480 .6844 .7190 .7517 .7823 .8106 .8365 .8599 .8810 .8997 .9162 .9306 .9429 .9535 .9625 .9699 .9761 .9812 .9854 .9887 .9913 .9934 .9951 .9963 .9973 .9980 .9986 .9990 .09 .5359 .5753 .6141 .6517 .6879 .7224 .7549 .7852 .8133 .8389 .8621 .8830 .9015 .9177 .9319 .9441 .9545 .9633 .9706 .9767 .9817 .9857 .9890 .9916 .9936 .9952 .9964 .9974 .9981 .9986 .9990 SYMBOLS e, 26 |A|, 4, 188 {An : i.o.}, 53 b(k : n,p), 61 C(n,r), 7 C(x, r), 9 cov (X, 7), 119, 225 Dx, 130 £[•], 100, 230 E[K |Xi = xb...,X„ = x„J, 128,241 /x. 61,194 fx y, 68, 201 A’„..X.69,22O /x.82 /r(x(y | x), 126,127 /r|X|,...,xn(y 1 * i> . • .,xn), 126 Fx,182 Fx>y, 201 9s, 37 gx .104 gx, 104 N, 26 N(A) ,4 (n)r,7 n(/z, <r2), 211 o(h), 268 p(k : A), 65 p, 39 px.94 347 P, 2,37 P(A | B), 22,45 q,39 <fr,92 Q, 34 R, 34 R(n), 173 varX, 112 (X = x), 61 Z, 34 C, 26 U, 27 D,27 X U Y, 27 X D Y, 27 Ac, 27 j3(ai, a2), 215 T(a), 213 T(a, A), 213 0,18, 36 (O, 9?, P), 37 Px. Hl p(n), 259 p(X, K), 255 </>(x), 210 4>(x), 211 <rx, 112, 239 ~, 161 INDEX cr-algebra, 37 smallest, 37 Abel’s theorem, 103 absolutely continuous, 194 absolutely convergent, 229 absorption, 159 algebra, 36 Archimedian property, 28 associative laws, 29 asymptotic distribution, 155 asymptotically equivalent, 161 Bayes’ rule, 46 Bernoulli random variables, 76,127,130 Bernoulli trials, 19,39, 55, 73, 86,92, 105, 109,114, 115,243 beta density, 215 binary information source, 147 binomial density, 61, 77, 79 binomial theorem, 8 birth and death processes, 273 birthday problem, 14 bit, 134 Boole’s inequality, 43 Borel-Cantelli lemma, 53 branching process, 167 extinction, 169 Cantor’s diagonalization procedure, 34 Cardano, 1, 3 Cauchy density, 196, 232 central limit theorem, 245 chain molecule, 273 Chapman-Kolmogorov equation, 150, 280 348 Chebyshev’s inequality, 114, 241 commutative laws, 28 complement, 27 composite function, 66 compound probabilities, 45 conditional density, 126, 217 conditional entropy, 139 conditional expectation, 128,241 conditional probability, 21,22,45 conditional probability function, 45 conditional uncertainty, 139 configuration, 5,10 continuous random variable, 184 converges in mean square, 257 correlation, 122, 255 correlation function, 173, 259 countable, 32 countably infinite, 33 coupon collector problem, 57 covariance, 119 covariance function, 173, 259 de Morgan’s laws, 30 DeMoivre, 245 DeMoivre-Laplace limit theorem, 253 density function, 61,194 beta, 215 binomial, 61 Cauchy, 232 conditional, 217 exponential, 210 gamma, 213, 223 Gauss, 210 geometric, 62 joint, 68, 69, 220 INDEX Laplace, 210 marginal, 204 multinomial, 74 negative binomial, 63 normal, 210,223 Poisson, 64,65 Rayleigh, 216 spectral, 261 standard normal density, 210 uniform, 66,210 difference equation, 93 discrete random variable, 61 disjoint, 27 distribution function, 182 joint, 220 spectral, 261 standard normal, 211 distributive laws, 29,30 domain, 31 double sequence, 87 double series, 88 doubly stochastic, 157 Ehrenfest diffusion model, 147,155 elastic barrier, 167 empirical law, 2 empty set, 26 entropy, 134 conditional, 139 joint, 138 maximum principle, 140 equalization, 58 events, 37,182 expected value, 99,100,111,230 binomial density, 100 geometric density, 101 negative binomial density, 105 Poisson density, 101 uniform density, 100 exponential density, 210 mean, 240 variance, 240 extended real-valued random variable, 62,182 Fermat, 2 finite, 32 finite second moment, 111 finite sequence, 32 functional 349 gambler’s ruin problem, 92, 130,159 gamma density, 213 mean, 240 variance, 240 gamma density function, 223 garage door opener, 52, 55, 105, 107 Gauss density, 210 generating function, 81,82 binomial density, 83 geometric density, 83 negative binomial density, 83 Poisson density, 83 random variable, 82 sequence, 81 sum of random variables, 84 geometric density function, 62 geometric probabilities, 188 Huygens, 2 inclusion/exdusion principle, 56,57 independence, continuous random variables, 222 independent, 49 independent random variables, 72, 75, 206 inequality Chebyshev’s, 114 Markov’s, 113 Schwarz’s, 118, 255 triangle, 256 infinite sequence, 32 infinitely often, 53 information, 133 initial density, 146 intersection, 27 irreducible, 154,293 joint conditional density, 217 joint density function, 68, 69,220 joint distribution function, 198, 220 joint entropy, 138 joint uncertainty, 138 Kolmogorov backward equations, 282 Kolmogorov forward equations, 282 Laplace density, 210 limiting distribution, 155 350 INDEX linear dependence, 123 linear predictor, 175 mapping, 31 marginal density function, 204 Markov chain, 145, 279 Markov property, 279 Markov’s inequality, 113, 240 match, 55 maximum entropy principle, 140 mean, 111,211,239 binomial density, 100 geometric density, 101 negative binomial density, 105 Poisson density, 101 uniform density, 100 mean square convergence, 257 mean square distance, 256 mean square error, 175 moment, finite second, 111 moving average process, 174, 262 multinomial density, 75 multinomial density function, 74 mutually exclusive, 27 mutually independent, 50, 53 n-step transition matrix, 149 n-step transition probabilities, 149 negative binomial density, 63, 79 nonnegative integer-valued random variable, 77 norm, 256 normal density, 210, 211 mean, 240 variance, 240 normal density function, 223 one-step transition probabilities, 145 ordered pairs, 5 ordered r-tuple, 6 ordered sample with replacement, 6 without replacement, 6 pairwise independence, 50 Pascal, 2 password problem, 55,106 paternity index, 47, 48 Poisson density, 64, 77,79, 91 Poisson density function, 65 Poisson process, 271 poker hand,14 population, 6 probability conditional, 21 of eventual ruin, 92,94 of failure, 39 space, 37 of success, 39 pure birth process, 274 q-matrix, 282 random sample, 15 random variable, 61, 182 continuous, 184 discrete, 61 entropy, 134 extended real-valued, 62, 182 nonnegative integer-valued, 77 uncertainty, 134 independent, 72, 75, 110,127,206, 222 random walk, 148 drift, 159 symmetric, 159 three-dimensional, 164 two-dimensional, 164 range, 32 Rayleigh density, 216 relative frequency, 2 reliability, 218 run, 58 sample random, 15 unordered, 7 Schwarz’s inequality, 118,255 set theory, 26 associative laws, 29 commutative laws, 28 de Morgan’s laws, 30 disjoint, 27 distributive laws, 29, 30 empty set, 26 equal, 26 intersection, 27 membership, 26 INDEX mutually exclusive, 27 subset, 26 union, 27 sets, 26 spectral density function, 261 spectral distribution function, 261 standard deviation, 112, 211 standard normal density, 210 standard normal distribution function, 211 state space, 145 states, 145,278 stationarity property, 149 stationary density, 296 stationary sequence, 173 stationary transition function, 279 stationary transition matrix, 279 stationary transition probabilities, 145 Stirling’s formula, 161 stochastic matrix, 146 stratification, 46 subset, 26 sum of random number of random variables, 89 symmetric random walk, 159 tail probability function, 104 transition function, 279 transition matrix, n -step, 149 transition probabilities 351 n-step, 149 stationary, 145 triangle inequality, 256 uncertainty, 133,134 conditional, 139 joint, 138 uniform density, 79, 210, 231 variance, 239 mean, 239 uniform density function, 66 uniform probability measure, 183 union, 27 unit square, 199 universe, 26 unordered sample, 7 varX, 239 variance, 112,239 binomial density, 112 Poisson density, 113 of a sum, 119 uniform density, 112 Venn diagram, 27 weak law of large numbers, 115 weakly stationary, 173,258 well-ordering property, 35 (continued from front flap) Trigonometry Refresher, A. Albert Klaf. (0-486-44227-6) Calculus: An Intuitive and Physical Approach (Second Edition), Morris Kline. (0-486-40453-6) The Philosophy of Mathematics: An Introductory Essay, Stephan Korner. (0-486-47185-3) Companion to Concrete Mathematics: Mathematical Techniques and Various Applications, Z. A. Melzak. (0-486-45781-8) Number Systems and the Foundations of Analysis, Elliott Mendelson. (0-486-45792-3) Experimental Statistics, Mary Gibbons Natrella. (0-486-43937-2) An Introduction to Identification, J. P. Norton. (0-486-46935-2) Beyond Geometry: Classic Papers from Riemann to Einstein, Edited with an Introduction and Notes by Peter Pesic. (0-486-45350-2) The Stanford Mathematics Problem Book: With Hints and Solutions, G. Polya and J. Kilpatrick. (0-486-46924-7) Splines and Variational Methods, P. M. Prenter. (0-486-46902-6) Probability Theory, A. Renyi. (0-486-45867-9) Logic for Mathematicians, J. Barkley Rosser. (0-486-46898-4) Partial Differential Equations: Sources and Solutions, Arthur David Snider. (0-486-45340-5) Introduction to Biostatistics: Second Edition, Robert R. Sokal and F. James Rohlf. (0-486-46961-1) Mathematical Programming, Steven Vajda. (0-486-47213^2) The Logic of Chance, John Venn. (0-486-45055-4) The Concept of a Riemann Surface, Hermann WeyL (0-486-47004-0) Introduction to Projective Geometry, C. R. Wylie, Jr. (0-486-46895-X) Foundations of Geometry, C. R. Wylie, Jr. (0-486-47214-0) See every Dover book in print at www.doverpublications.com MATHEMATICS Introduction to PROBABILITY THEORY with CONTEMPORARY APPLICATIONS Lester L. Helms his introduction to probability theory transforms a highly abstract subject into a series of coherent concepts. Its extensive discussions and clear examples, written in plain language, expose students to the rules and methods of prob­ ability. Suitable for an introductory probability course, this volume requires abstract and conceptual thinking skills and a background in calculus. T Topics include classical probability, set theory, axioms, prob­ ability functions, random and independent random variables, expected values, and covariance and correlations. Additional subjects include stochastic processes, continuous random variables, expectation and conditional expectation, and con­ tinuous parameter Markov processes. Numerous exercises foster the development of problem-solving skills, and all problems feature step-by-step solutions. 9780486474182 12/04/2019 14.06-3 wun-W: 0-486-47418-6 SEE EVERY DOVER BOOK IN PRINT AT WWW.DOVERPUBLICATIONS.COM 23