An introduction to probability Shiu-Tang Li Index Page 1 Basic probability theory 2 1.1 Definitions and examples 2 1.2 Conditional probability 5 1.3 Random variable 6 1.4 (Cumulative) Distribution functions 7 1.5 Discrete random variables and continuous random variables 10 1.6 CDF/PDF/PMF of functions of random variables 12 1.7 Independence 14 1.8 Expected value and variance of random variables 16 1.9 Examples of discrete random variables and continuous random variables 20 1.10 Joint distributions 24 1.11 Moment generating functions 35 1.12 LLN and CLT 37 1 An introduction to probability Shiu-Tang Li 1 Basic probability theory 1.1 Definitions and examples Definition 1.1.1 A space (Ω, F, P ), where Ω is the sample space, F ⊂ 2Ω is the collection of sets that could be assigned probability (F is also called the collection of events), and P is the probability measure, is called a probability space if it satisfies (1) For any A ∈ F , 0 ≤ P (A) ≤ 1. (2) P (Ω) = 1. S P∞ (3) If A1 , A2 , · · · , An , · · · ∈ F , and Ai ∩Aj = ∅ for all i 6= j, then P ( ∞ j=1 Aj ) = j=1 P (Aj ). Remark 1.1.2 We also require F ⊂ 2Ω to be a σ-algebra, which means F satisfies the following properties: (1) ∅, Ω ∈ F . (2) If A ∈ F , then Ac ∈ F . S (3) If A1 , A2 , · · · , An , · · · ∈ F , then ∞ j=1 Aj ∈ F . By DeMorgan’s law, when we have both (1) and (2), then (3) is equivalent to (4) below: T (4) If A1 , A2 , · · · , An , · · · ∈ F , then ∞ j=1 Aj ∈ F . When Ω is a countable set, that is, Ω = {a1 , · · · , aN } or Ω = {a1 , · · · , an , · · · }, and every one point set is in F (that is, for all j we have {aj } ∈ F ), then F = 2Ω in these cases. Remark 1.1.3 Definition 1.1.1 implies some easy properties: (a) If A ∈ F , then P (Ac ) = 1 − P (A). (b) If A1 , A2 , · · · , AN ∈ F , and Ai ∩ Aj = ∅ for all i 6= j, then PN S P( N j=1 P (Aj ). (c) If A, B ∈ F , A ⊂ B, then P (A) ≤ P (B). j=1 Aj ) = Example 1.1.4 Let Ω = {1, 2, 3, 4, 5, 6}, the set of all possible outcomes when rolling a die. We usually take F = 2Ω , and if it is a fair die we let P ({j}) = 1/6 for 1 ≤ j ≤ 6. Example 1.1.5 Let Ω = {(1, 1), (1, 2), · · · , (1, 6), (2, 1), · · · , (6, 6)}, the set of all possible outcomes when rolling two dice. Note that #Ω = 36. Here we take F = 2Ω , and if it is a fair die we let P ({(i, j)}) = 1/36 for 1 ≤ i, j ≤ 6. Exercise 1.1.6 Let A ∈ F be the event that the first rolling is even and the second one is not 5. Calculate #A and P (A). Example 1.1.7 Let Ω = {(♠10, ♣A, ♣7, ♦4, ♥J), (♦3, ♠2, ♥2, ♦2, ♦4), · · · , (♠3, ♥4, ♥2, ♠7, ♣A)} be the set of all possible outcomes when we draw 5 cards from a poker deck, where the order is considered. It’s not hard to see #Ω = 52 × 51 × 50 × 49 × 48. Now we let F = 2Ω , and the probability for each particular hand is 1/#Ω. 2 An introduction to probability Shiu-Tang Li Exercise 1.1.8 Let A be the event that your hand is a full house. Calculate #A and P (A) using the probability space given in Example 1.1.7. Example 1.1.9 We now consider a similar probability space to the one given in Example 1.1.7. Let Ω = {{♠10, ♣A, ♣7, ♦4, ♥J}, {♦3, ♠2, ♥2, ♦2, ♦4}, · · · , {♠3, ♥4, ♥2, ♠7, ♣A}} be the set of all possible outcomes when we draw 5 cards from a poker deck, where the order 52×51×50×49×48 is NOT considered. We discover that #Ω = 52 = . We let F = 2Ω , and the 5 5! probability for each particular hand is 1/#Ω. Exercise 1.1.10 Let A be the event that your hand is a full house. Calculate #A and P (A) using the probability space given in Example 1.1.9. Example 1.1.11 Tom is waiting for a bus. He knows that the bus would come every 15 minutes, and there is no fixed bus schedule. Let Ω = [0, 15], which is the collection of all “times” he could wait. Intuitively, for each exact moment t ∈ [0, 15], P ({t}) = 0, and we BELIEVE that for any A1 = [a, b], A2 = (a, b], A3 = [a, b), A4 = (a, b) ⊂ [0, 15], P (Aj ) = b−a . 15 To define a ‘good’ set of events F we need knowledge from measure theory. Warning: in this case F is not 2Ω . Example 1.1.12 (Coin tossing) We continue flipping a unfair coin with probability p to have a head and q to have a tail, and of course p + q = 1. The sample space Ω is defined to be the set of all possible sequences {(a1 , a2 , · · · ) : aj ∈ {T, H} ∀j ∈ N}. Besides, we introduce an notation: (T, H, ·, ·, T, · · · ) := {(a1 , a2 , · · · ) : a1 = T, a2 = H, a5 = T, aj ∈ {T, H} f or j 6= 1, 2, 5}. Now we define an increasing sequence of σ-algebras {Fi } on Ω as follows. Let F1 = {∅, Ω, (T, · · · ), (H, · · · )}, F2 := {∅, Ω, (T, · · · ), (H, · · · ), (·, T, · · · ), (·, H, · · · ), (H, T, · · · ), (T, H, · · · ), (T, T, · · · ), (H, H, · · · ), (T, T, · · · ) ∪ (H, H, · · · ), (T, H, · · · ) ∪ (H, T, · · · ), (T, · · · )∪(·, T, · · · ), (T, · · · )∪(·, H, · · · ), (H, · · · )∪(·, T, · · · ), (H, · · · )∪(·, H, · · · )}, and so on. Fj means all the events that can be assigned to some probability when we toss a coin j times. For example, we may take F2 as the σ-algebra, and we can define probability on events in F2 , like P ({(T, H, · · · )}) = pq and P ({(H, H, · · · )}) = p2 . Actually, the σ-algebra that we use on sequences is algebra that includes every Fj . We’ll see this later. W∞ j=1 Fj , which is the smallest σ- Example 1.1.13 (Standard Birthday Problem) [Saeed Ghahramani] What is the probability that at least two students of a class of size n have the same birthday? Compute the numerical values of such probabilities for n = 23, 30, 50, and 60. Assume that the birth 3 An introduction to probability Shiu-Tang Li rates are constant throughout the year and that each year has 365 days. Sol. We may let the sample space Ω := {(k1 , · · · , kn ) : kj ∈ {1, · · · , 365}}, F = 2Ω . And for each (k1 , · · · , kn ) ∈ Ω, P ((k1 , · · · , kn )) = 3651 n . The probability that no two students . When n = 23, 30, 50, and 60, the have the same birthday is P (n) = 365×364×···×(365−(n−1)) 365n corresponding values 1 − P (n) are 0.507, 0.706, 0.970, and 0.995 respectively. Exercise 1.1.14 Let Ω = N, the collection of all events F = {A : A is finite or Ac is 1 finite}, and P ({n}) = 2n+1 for every n ∈ N. For any finite set An = {a1 , · · · , an } ∈ F , where P ai ∈ N ∀1 ≤ i ≤ n, define P (An ) = ni=1 P ({ai }). For any set B ∈ F such that B c is finite, define P (B) = 1 − P (B c ). 1. Show that for any A, B ∈ F , Ac , A ∩ B ∈ F . 2. Show that this model (Ω, F, P ) satisfies (1) and (2) of Definition 1.1.1, and P (A∪B) = P (A) + P (B) for A, B ∈ F , A ∩ B = ∅, but it fails to possess (3) of Definition 1.1.1, namely the countable additivity property. Sol of the last assertion. If we have the countable additivity property, then 1 = P (Ω) = P S 1 P ( n∈N {n}) = ∞ n=1 P ({n}) = 2 , which is absurd. 4 An introduction to probability Shiu-Tang Li 1.2 Conditional probability Definition 1.2.1 Let (Ω, F, P ) be the probability space, and A, B ∈ F be two events such that P (B) > 0. Then the conditional probability of A conditioned on B, denoted by P (A|B), is defined to be P (A ∩ B)/P (B). If P (B) = 0, then we define P (A|B) := 0. Theorem 1.2.2 (Law of total probability) Let {Bn }n be a sequence of mutually disS P∞ joint events s.t. ∞ n=1 Bn = Ω. Then we have P (A) = n=1 P (A ∩ Bn ). Remark 1.2.3 If we let B1 = B, B2 = B c , Bn = ∅ for n ≥ 3, then we have P (A) = P (A ∩ B) + P (A ∩ B c ). Theorem 1.2.4 (Baye’s theorem) Let {Bn }n be a sequence of mutually disjoint events S P (A∩Bn ) n )P (Bn ) P∞ s.t. ∞ = P∞P (A|B . n=1 Bn = Ω. Then P (Bn |A) = P (A∩Bj ) P (A|Bj )P (Bj ) j=1 j=1 The following example reveals how the information given beforehand can “change” the probability of a certain event. Example 1.2.5 Draw two cards from a poker deck of 52 cards, without replacement. Determine the conditional probability that both cards are aces, given that (a) one of the cards is the ace of spades. (b) the second card is an ace. (c) at least one of the cards is an ace. Sol. (a)1/17. (b)1/17. (c)1/33. Example 1.2.6 (An application to Baye’s theorem) In a village, 10% of the population has some disease. A test is administered that if a person is sick, the test will be positive 95% of the time and if the person is not sick, then the test would still has a 20% chance to be positive. If the test result of John is positive, what’s the probability that he is infected? Sol. If John does not take the test, the probability that he is infected is 10%, since we do not have any information about him. Once we’re given some information, we could make a more precise judgement. Let A be the event that John is infected, and B be the event that his test result is posi(B|A)P (A) tive. We’re asked to calculate P (A|B). By Baye’s theorem, it equals P (B|A)PP(A)+P (B|Ac )P (Ac ) 95%·10% = 95%·10%+20%·90% ≈ 34.5%. Exercise 1.2.7 [Grinstead & Snell] Prove that if P (A|C) ≥ P (B|C) and P (A|C c ) ≥ P (B|C c ), then P (A) ≥ P (B). 5 An introduction to probability Shiu-Tang Li 1.3 Random variable Definition 1.3.1 Let (Ω, F, P ) be the probability space. A random variable, abbreviated as r.v., is a mapping X : Ω → R so that {X ∈ O} := {ω ∈ Ω : X(ω) ∈ O} ∈ F for any open set O ⊂ R. Remark 1.3.2 Random variables help us analyze many properties of the original abstract space (Ω, F, P ). The condition {X ∈ O} ∈ F is the “measurability condition” - the behavior of a random variable cannot be too wild. Remark 1.3.3 By the definition of a r.v. X, {X ∈ D} ∈ F for every closed set D of R. Example 1.3.4 Roll a die twice. Let Ω = {(1, 1), · · · , (1, 6), (2, 1), · · · , (6, 6)}, F = 2Ω , P ({(i, j)}) = 1/36 for all i, j. Let X be the total number of 1’s: It means X((1, 2)) = 1, X((2, 6)) = 0, X((1, 1)) = 2, and so forth. It’s also easy to see {(i, j) : X(i, j) ∈ O} ∈ F for all open set O in R. 6 An introduction to probability Shiu-Tang Li 1.4 (Cumulative) Distribution functions Let (Ω, F, P ) be the probability space, and X be some random variable on this space. Since {X ≤ x} ∈ F for all x ∈ R, FX (x) = P (X ≤ x) is a well-defined function on R. Definition 1.4.1 The function FX (x) defined above is called the (cumulative) distribution function of X. It is abbreviated as CDF. Before we proceed to the properties of CDFs, we introduce the idea of convergence of sets and derive some more properties of probability measures. Definition 1.4.2 A sequence of sets An is said to converge to a set A, denoted by An → A, or A = limn→∞ An , if the following holds: (1) For any x ∈ A, there exists N = N (x) ∈ N so that x ∈ An for all n ≥ N . (2) For any x ∈ / A, there exists M = M (x) ∈ N so that x ∈ / An for all n ≥ M . By the way, it’s not hard to see the limit limn→∞ An is unique if it exists. (Which means the above definition is well-defined.) Theorem 1.4.3 Let An be an increasing sequence of sets. Then limn→∞ An = T Similarly, if An is a decreasing sequence of events, then limn→∞ An = ∞ n=1 An . S∞ n=1 An . S Definition 1.4.4 Let An be a sequence of sets. We know that { ∞ n=N An }N is a deT∞ creasing sequence of events and { n=N An }N is an increasing sequence of events. By the previous theorem, both the limits of these two sequences exist, and we define lim supn An := T∞ S limN →∞ ∞ n=N An . n=N An and lim inf n An := limN →∞ Theorem 1.4.5 Let An be a sequence of sets. limn→∞ An exists if and only if lim supn An = lim inf n An . If any one of these equivalent conditions holds, we have furthermore lim supn An = lim inf n An = limn→∞ An . Proof. (⇒) Assume that x ∈ lim supn An and x ∈ / lim inf n An . Since x ∈ lim supn An , x ∈ An T for infinitely many n’s. Since x ∈ / lim inf n An , x ∈ / ∞ n=N An for any N ∈ N, which means S∞ c c x ∈ n=N An for any N ∈ N. Therefore, x ∈ An for infinitely many n’s. This proves An 9 A. (⇐) Assume that lim supn An = lim inf n An . We claim that A = limn An exists and T A = lim supn An = lim inf n An . If x ∈ A, then x ∈ lim inf n An , which implies x ∈ ∞ n=N An for some N ∈ N. So (1) of Definition 1.4.2 is satisfied. If x ∈ / A, then x ∈ / lim supn An , which S∞ T∞ implies x ∈ / n=N An for all N large, which is equivalent to x ∈ n=N Acn for all N large, and this is exactly (2) of Definition 1.4.2. 7 An introduction to probability Shiu-Tang Li Theorem 1.4.6 (Continuity in probability) Let An ∈ F for all n ∈ N , and An → A. Then P (An ) → P (A). S Proof. We first prove the case An ↑ A. By Theorem 1.4.3, A = ∞ n=1 An . We define B1 = A1 , and Bn = An \ An−1 for n ≥ 2. We find that {Bn } is a sequence of pairwise disjoint events S S S S∞ so that nj=1 Aj = nj=1 Bj . Let n → ∞, we have further ∞ A = j j=1 j=1 Bj . By countable additivity of P , we have lim P (Aj ) = lim P ( n→∞ n→∞ = lim n→∞ = P( n [ Aj ) = lim P ( n→∞ j=1 n X P (Bj ) = ∞ X j=1 ∞ [ j=1 n [ Bj ) j=1 P (Bj ) j=1 Bj ) = P ( ∞ [ Aj ) = P (A). j=1 Next we prove the case An ↓ A. Since Acn ↑ Ac , we have P (Acn ) = 1 − P (An ) → P (Ac ) = 1 − P (A), and therefore P (An ) → P (A). T A ) → P (lim inf n→∞ An ) = P (A) as N → ∞ (We’ve For the general case, since P ( ∞ S∞ n=N n used Theorem 1.4.5 here), P ( n=N An ) → P (lim supn→∞ An ) = P (A) as N → ∞, and S∞ T P( ∞ n=N An ) ≤ P (AN ) ≤ P ( n=N An ), by the squeezing theorem we have P (AN ) → P (A) as N → ∞. Now we are able to study some properties of FX . Theorem 1.4.7 Let FX (x) be the CDF of the random variable X. We have (1) FX (x) is increasing. (2) FX (x) is right-continuous for every x ∈ R. (3) FX (x−), the left limit of FX (x), exists for every x ∈ R. (4) FX (+∞) := limx→∞ FX (x) = 1. (5) FX (−∞) := limx→−∞ FX (x) = 0. (6) P ({X = x}) = FX (x) − FX (x−) for every x ∈ R. Proof. (1) If x > y, then {X ≤ x} ⊃ {X ≤ y}, and FX (x) = P (X ≤ x) ≥ P (X ≤ y) = FX (y). (2) Let xn ↓ x, since {X ≤ xn } → {X ≤ x} (why?), by Theorem 1.4.6 we have FX (xn ) → FX (x). 8 An introduction to probability Shiu-Tang Li (3) First we note that {X < x} ∈ F . For any xn ↑ x, {X ≤ xn } → {X < x}, so FX (xn ) = P (X ≤ xn ) → P (X < x). FX (x−) is therefore given by P (X < x). (4) If xn ↑ +∞, then {X ≤ xn } → {X ∈ R} = Ω, and by Theorem 1.4.6 we have FX (xn ) → P (Ω) = 1. (5) Exercise. (6) Since P ({X = x}) = P (X ≤ x) − P (X < x), the result follows from (3). Now an interesting question arises: if now we are given a right continuous, increasing function F : R → R s.t. F (+∞) = 1 and F (−∞) = 0, can we always find some probability space (Ω, F, P ) and an r.v. X so that FX (x) ≡ F (x)? Luckily, the answer is affirmative. The interested readers may check [Varadhan]. 9 An introduction to probability Shiu-Tang Li 1.5 Discrete random variables and continuous random variables 1.5.1 Definitions Definition 1.5.1 A r.v. X which takes its values on a finite set or a countably infinite set is said to be a discrete random variable. That is, there exists x1 , x2 , · · · ∈ R s.t. P∞ i=1 P (X = xi ) = 1. (By the definition of r.v.s, {X = xi } ∈ F for all i ∈ N. (why?)) Definition 1.5.2 Let X be a discrete r.v. Its (probability) mass function pX (x), or PMF, is defined as pX (x) := P (X = x). Remark 1.5.3 Therefore, when X is discrete, we have FX (x) = sum has only countably many nonzero terms. P y≤x pX (y), where the Definition 1.5.4 A r.v. X is said to be a continuous random variable with (probRx ability) density function (PDF) fX (x) if FX (x) = P (X ≤ x) = −∞ fX (y) dy for all x ∈ R, where fX (x) ≥ 0 for all x ∈ R and fX is Riemann integrable on R. Remark 1.5.5 The PDF fX (x) is not unique. For example, we may change the values fX (x) for finitely many x’s without changing the value of FX (x) for all x ∈ R. Actually, the PDFs of X form an equivalence class, with equivalence relation defined by f ∼ g if Rx Rx f (y) dy = g(y) dy for all x ∈ R. Therefore, when we’re talking about the PDF −∞ −∞ fX (x) of X, we’re referring to an arbitrary candidate in the equivalence class. R∞ Remarks 1.5.6 (1) −∞ fX (y) dy = 1. (Verify it.) (2) When X is continuous, FX (x) = Rx f (y) dy is a continuous function, and hence P (X = x) = FX (x) − FX (x−) = 0 for −∞ X every x ∈ R. (3) In a more general setting (requiring the knowledge of measure theory), the Riemann integrability of fX may be loosen to the Lebesgue integrability. Theorem 1.5.7 Let X be a continuous r.v. If FX0 (x0 ) exists for some x0 ∈ R and fX is continuous at x0 , then FX0 (x0 ) = fX (x0 ). Proof. Use the fundamental theorem of calculus. The following theorem provides us with a way to “verify” if a r.v. X is a continuous r.v. by simply looking at its CDF. It also helps us to find a PDF candidate of X. 10 An introduction to probability Shiu-Tang Li Theorem 1.5.8 Let X be r.v. with CDF FX . Assume that FX is continuous on R and FX0 exists for all but finitely many values of R. We have X is a continuous r.v., with its PDF defined by fX (x) = FX0 (x) for which FX0 (x) exists and fX (x) = 0 for which FX0 (x) does not exist. Proof. Assume that FX0 (x) does not exist on {x1 , · · · , xn }, with x1 < · · · < xn . For y 0 < y ≤ Ry x1 , by the fundamental theorem of calculus we have FX (y) − FX (y 0 ) = y0 FX0 (z) dz, letting Ry y 0 → −∞ we have FX (y) = −∞ FX0 (z) dz. For x1 < y ≤ x2 , FX (y) = (FX (y) − FX (x1 )) + Ry R x1 0 FX (z) dz, again by the fundamental theorem of calculus, where FX (x1 ) = x1 FX0 (z) dz + −∞ the continuity of FX plays a role. The rest of the proof is left to the readers. We have to “warn” the reader that not every r.v. X is either continuous or discrete. The reader is invited to think upon the following two examples. Example 1.5.9 FX (x) = 0 for all x ≤ 0, FX (x) = x/2 for 0 < x < 1, and FX (x) = 1 for all x ≥ 1. Even if FX (x) is continuous for every x ∈ R, it may happen that X is not a continuous r.v., as can be seen in the following example. Example 1.5.10 Let FX be the Cantor-Lebesgue function. It can be proved that no fX Rx exists so that FX (x) = −∞ fX (y) dy for every x ∈ R. Again, the proof requires measure theory. In 1.9 we’ll see many examples of discrete and continuous random variables, and discover their properties. 11 An introduction to probability Shiu-Tang Li 1.6 CDF/PDF/PMF of functions of random variables Let X be a r.v. defined on (Ω, F, P ), and g : R → R be a continuous function. It’s not hard to see that g(X) is a r.v. by definition, for continuous functions pull open sets back to open sets. In this section we’ll see through some examples about how we compute the CDF/PDF/PMF of Y = g(X). Theorem 1.6.1 Let X be a discrete r.v. s.t. some function. Then g(X) is a discrete r.v. P∞ i=1 P (X = xi ) = 1, and g : R → R be S Proof. For each z ∈ R s.t. z = g(xi ) for some i ∈ N, we have {g(X) = z} ⊃ xi ∈{y:g(y)=z} {X = P P S xi }. Summing up all these z’s gives us 1 = P (Ω) ≥ z P (g(X) = z) ≥ z P ( xi ∈{y:g(y)=z} {X = P P P P xi }) = z xi ∈{y:g(y)=z} P (X = xi ) = ∞ i=1 P (X = xi ) = 1, so z P (g(X) = z) = 1. We’ve used the fact that we may sum up an absolutely convergent series in any order. Remark 1.6.2 When X is continuous, It is possible that g(X) is continuous, discrete, or neither of them. We’ll find it in the following examples. Exercise 1.6.3 Let P (X = 1) = P (X = −1) = 1/6, P (X = 2) = P (X = −2) = 1/4, P (X = 0) = 1/6. Find the PMF of Y = X 2 . Example 1.6.4 Let X be a continuous r.v. with density fX (x) = CDF of Y = X 2 . 2 √1 e−x /2 . 2π Find the √ √ Sol. First, P (Y ≤ x) = 0 for x < 0. For x ≥ 0, P (Y ≤ x) = P (− x ≤ X ≤ x) = R √x −y2 /2 R −√x −y2 /2 1 √1 √ e dy − 2π −∞ e dy. 2π −∞ We find that FY (x) = P (Y ≤ x) is differentiable for all x 6= 0. When x < 0, FY0 (x) = 0. √1 · √1 e−x/2 = √1 · √1 e−x/2 . We may When x > 0, FY0 (x) = 21 · √1x · √12π e−x/2 − −1 · 2 x x 2π 2π therefore define the PDF fY (x) of Y by fY (0) := 0 and fY (x) = FY0 (x) for x 6= 0. Example 1.6.5 Let X be a continuous r.v. with density fX (x) = CDF of Y = g(X), where x if x > 0 g(x) = 0 otherwise . 2 √1 e−x /2 . 2π Find the Sol. First, FY (x) = P (Y ≤ x) = 0 for any x < 0. We also observe that P (Y ≤ 0) = R0 2 P (Y = 0) = P (X ≤ 0) = −∞ √12π e−y /2 dy = 1/2. For x > 0, we have FY (x) = P (Y ≤ x) = 12 An introduction to probability P (Y ≤ 0) + P (0 < Y ≤ x) = P (Y ≤ 0) + P (0 < X ≤ x) = 1/2 + Shiu-Tang Li Rx 0 2 √1 e−y /2 2π dy. We find that FY (x) is differentiable everywhere except for x = 0, and FY (x) has a jump of size 1/2 at x = 0. Since P (Y = y) > 0 only when y = 0, and P (Y = 0) = 1/2 < 1, Y cannot be a discrete r.v. Besides, if Y is continuous, then P (Y = y) = 0 for y ∈ R, so Y cannot be a continuous r.v. 1 Example 1.6.6 Let X be a continuous r.v. with density fX (x) = b−a · 1(a,b) (x). Find X the CDF of Y = e . (Rmk. 1A (x) := 1 if x ∈ A, and 1A (x) := 0 if x ∈ / A. It is called the indicator function of set A.) Sol. For x ≤ 0, FY (x) = P (eX ≤ x) = 0. For x > 0, FY (x) = P (eX ≤ x) = P (X ≤ R ln(x) 1 ln(x)) = −∞ b−a 1(a,b) (y) dy. 1 1 For x > 0, x 6= ea , eb , we have FY0 (x) = b−a · x1 · 1(a,b) (ln(x)) = b−a · x1 · 1(ea ,eb ) (x) (To apply chain rule to (f ◦ g)(x) at x0 , f must be differentiable at g(x0 )). For x < 0, FY0 (x) = 0. A “version” of the PDF of Y is given by fY (x) = FY0 (x) for x 6= 0, ea , eb , and fY (x) = 0 otherwise. 13 An introduction to probability Shiu-Tang Li 1.7 Independence 1.7.1 Independence of events Definition 1.7.1 Let our probability space be (Ω, F, P ). A collection of events {Eα }α∈I ⊂ F is said to be independent if for any finite subcollection Eαn1 , · · · , Eαnk , αni ∈ I, we have T P ( 1≤i≤k Eαni ) = Π1≤i≤k P (Eαni ). Remark 1.7.2 If we take the index set I = {1, 2}, the above definition reads E1 and E2 are independent if P (E1 )P (E2 ) = P (E1 ∩ E2 ); if I = {1, 2, 3}, then the definition reads E1 , E2 , and E3 are independent if P (E1 )P (E2 )P (E3 ) = P (E1 ∩ E2 ∩ E3 ), P (E1 )P (E2 ) = P (E1 ∩ E2 ), P (E1 )P (E3 ) = P (E1 ∩ E3 ), and P (E2 )P (E3 ) = P (E2 ∩ E3 ). Remark 1.7.3 Assume that P (E2 ) > 0, then E1 and E2 are independent if and only if P (E1 ) = P (E1 |E2 ). Definition 1.7.4 Let our probability space be (Ω, F, P ). A collection of events {Eα }α∈I ⊂ F is said to be pairwise independent if for any β, γ ∈ I, we have P (Eβ ∩ Eγ ) = P (Eβ ) · P (Eγ ). Remark 1.7.5 Obviously, independence implies pairwise independence. Example 1.7.6 (pairwise independence does not imply independence) Let Ω = {ω1 , ω2 , ω3 , ω4 }, F = 2Ω , P ({ωi }) = 1/4 for all i. We find that E1 = {ω1 , ω2 }, E2 = {ω2 , ω3 }, and E3 = {ω1 , ω3 } are pairwise independent, but not independent. Example 1.7.7 Let us look at another example. Let Ω = {ω1 , ω2 , ω3 , ω4 , ω5 , ω6 , ω7 , ω8 }, F = 2Ω , P ({ωi }) = 1/8 for all i, and E1 = {ω1 , ω2 , ω3 , ω4 }, E2 = {ω2 , ω3 , ω4 , ω5 }, and E3 = {ω3 , ω4 , ω5 , ω6 }. We have P (E1 )P (E2 )P (E3 ) = P (E1 ∩ E2 ∩ E3 ) = 1/8, but they’re not independent. 1.7.2 Independence of r.v.s Definition 1.7.8 Let the probability space be (Ω, F, P ). A collection of random variables {Xα }α∈I is said to be independent if for any finite subcollection Xαn1 , · · · , Xαnk , αni ∈ I, T we have P ( 1≤i≤k Xα−1 (Oi )) = Π1≤i≤k P (Xα−1 (Oi )) for arbitrary open set Oi ⊂ R. ni ni Definition 1.7.9 Let the probability space be (Ω, F, P ). A collection of random variables {Xα }α∈I is said to be pairwise independent if for any β, γ ∈ I, we have P (Xβ−1 (O1 ) ∩ Xγ−1 (O2 )) = P (Xβ−1 (O1 )) · P (Xγ−1 (O2 )) for any open sets O1 , O2 ⊂ R. 14 An introduction to probability Shiu-Tang Li Exercise 1.7.10 Show that we may replace all the open sets in Definition 1.7.8 and Definition 1.7.9 with closed sets. Exercise 1.7.11 Let X1 , · · · , Xn be independent r.v.s and f1 , · · · , fn : R → R be continuous functions. Show that f1 (X1 ), · · · , fn (Xn ) are independent. Definition 1.7.12 A collection of random variables {Xα }α∈I is said to be independent and identically distributed, abbreviated as i.i.d, if they are independent and they have the same CDF. Exercise 1.7.13 Flip a fair dice twice. Let X1 be the number of the first roll and X2 be the number in the second roll. If X1 and X2 are independent, show that P ({X1 = i}∩{X2 = j}) = 1/36 for all 1 ≤ i, j ≤ 6. Exercise 1.7.14 Show that A and B are independent if one of them is either ∅ or Ω. Exercise 1.7.15 Let X(ω) ≡ c, c ∈ R. By the previous exercise, show that X and Y are independent for arbitrary r.v. Y which is defined on the same probability space as is X. Exercise 1.7.16 Let Ω = {ω1 , · · · , ωk }, F = 2Ω , and P ({ωj }) > 0 for all 1 ≤ j ≤ k. Show that if X is independent of itself, then X(ω) ≡ c for some c ∈ R. Sol. Assume X is not a constant r.v. We take ωi 6= ωj , ωi , ωj ∈ Ω such that X(ω1 ) = a, X(ω2 ) = b and a 6= b. Since X is independent to itself, we have 0 = P (∅) = P (X −1 (a) ∩ X −1 (b)) = P (X −1 (a)) · P (X −1 (b)) ≥ P (ωj )P (ωk ) > 0, a contradiction. 15 An introduction to probability Shiu-Tang Li 1.8 Expected value and variance of random variables Now that we’ve introduced two types of r.v.s - continuous and discrete, we may define the expected value and variance for these two types of r.v.s. 1.8.1 Discrete case P∞ Definition 1.8.1 Let X be a discrete r.v. s.t. P (X = xi ) = 1. The expected P∞ i=1 value of X, denoted by E[X], is defined to be i=1 xi × P (X = xi ), provided that P∞ i=1 |xi | × P (X = xi ) < ∞. Expected value, as its name suggests, indicates the “average” amount of all possible outcomes, weighted by probabilities. Example 1.8.2 Let (Ω, F, P ) be given as in Example 1.1.4, and let X be the number rolled. Compute E[X]. Example 1.8.3 (Double bet strategy) John flips a fair coin, and he could bet any amount of money in a single flip: if he flips a tail, he wins the amount of money he bets, or otherwise he loses the bet. He adopts the following strategy: he begins by betting 1 dollar, and if he loses, he bets 2 dollars, and if he loses again, he bets 4 dollars, and so on. Assume that he plays the game for at most n times. We construct the probability space as Ω = {T, HT, HHT, · · · , |H ·{z · · H} T, H · · H}}, | ·{z n-1 times n times Ω F = 2 , P (T ) = 1/2, P (HT ) = 1/4, and so on. Let X be the total amount of money John wins. To compute E[X], by definition, E[X] = P (T ) · 1 + P (T H) · 1 + P (H · · · HT ) · 1 + n n−1 n n−1 P (H · · · H) · (−1 − 2 − · · · − 2n ) = 2 +2 2n +···+1 − 2 +2 2n +···+1 = 0. If we assume that John could play infinitely many times and he is financially supported to do so, then we may construct the probability space as Ω = {T, HT, HHT, · · · , H · · · HT, · · · }. In this case, he would win 1 dollar whatever, and thus E[X] = 1, in this “impractical” model. P Definition 1.8.4 Let X be a discrete r.v. s.t. ∞ i=1 P (X = xi ) = 1. The variance of X, P∞ denoted by V ar(X), is defined to be i=1 (xi − E[X])2 × P (X = xi ), provided that E[X] exists. V ar(X) could be +∞. Variance shows the tendency of a r.v. concentrating to its mean (expected value). Example 1.8.5 Let Ω = {ω1 , ω2 , ω3 , ω4 , ω5 }, F = 2Ω , P (wj ) = 1/5 for 1 ≤ j ≤ 5. Let X, Y be two r.v.s such that X(ωj ) = 4 + j and Y (ωj ) = 10 + 3j 2 . Which one is larger? 16 An introduction to probability Shiu-Tang Li V ar(X) or V ar(Y )? 1.8.2 Continuous case Definition 1.8.6 Let X be a continuous r.v. with density fX . The expected value of R∞ R∞ X, denoted by E[X], is defined to be −∞ x·fX (x) dx, provided that −∞ |x|·fX (x) dx < ∞. Definition 1.8.7 Let X be a continuous r.v. with density fX . The variance of X, R∞ denoted by V ar(X), is defined to be −∞ (x − E[X])2 · fX (x) dx, provided that E[X] exists. V ar(X) could be +∞. 1.8.3 Properties of expected values P∞ Theorem 1.8.8 Let X be a discrete r.v. s.t. = x ) = 1, and g : R → R be i=1 P (X P∞ P∞ i some function. Then E[g(X)] = i=1 g(xi ) · P (X = xi ), if i=1 |g(xi )| · P (X = xi ) < ∞. P Proof. First we note by Theorem 1.6.1 that g(X) is discrete. It also says z∈A P (g(X) = z) = 1, where A := {y : y = g(xi ) for some i ∈ N}. So by definition we have E[g(X)] = P we take a closer look at the proof of Theorem 1.6.1, we’ll z∈A z · P (g(X) = z). If P find that P (g(X) = z) = = z) = xi ∈{y:g(y)=z} P (X = xi ). This means z · P (g(X) P∞ P i=1 g(xi ) · xi ∈{y:g(y)=z} g(xi )P (X = xi ). Summing over all such z’s gives us E[g(X)] = P (X = xi ). From 1.6 we know that for some continuous function g : R → R, and some continuous r.v. X, g(X) could be neither a continuous r.v. nor a discrete r.v. This means E[g(X)] cannot be defined in such case. To remedy this we need to impose some restrictions on g, as is seen in the following theorem. Definition 1.8.9 g : R → R is called a piecewise strictly monotone function if there exists a1 , · · · , ak ∈ R so that g is strictly monotone on (−∞, a1 ], [ak , ∞), and [ai , ai+1 ] for all 1 ≤ i ≤ k − 1. Theorem 1.8.10 Let X be a continuous r.v. with density fX (x), and g : R → R be a continuously differentiable and piecewise strictly monotone function. Then g(X) is a R∞ R∞ continuous r.v., and E[g(X)] = −∞ g(x) · fX (x) dx, if −∞ |g(x)| · fX (x) dx < ∞ (To make sure the integral is well-defined). Proof. For simplicity we consider the case that g is strictly increasing on (−∞, a] and [b, ∞) and strictly decreasing on [a, b]. Let g(a) = c > d = g(b). Besides, there exists a0 > a, b0 < b s.t. g(a0 ) = g(a) and g(b0 ) = g(b). d 1 P (g(X) ≤ x) = g0 (g−1 · For x < d, P (g(X) ≤ x) = P (X ≤ g −1 (x)). Therefore, dx (x)) −1 fX (g (x)). For d < x < c, there exists a1 < a2 < a3 s.t. g(a1 ) = g(a2 ) = g(a3 ) = x. 17 An introduction to probability Shiu-Tang Li We have P (g(X) ≤ x) = P (a2 ≤ X ≤ a3 ) + P (X ≤ a1 ). Taking derivatives we have d d P (g(X) ≤ x) = g0 (a1 3 ) · fX (a3 ) − g0 (a1 2 ) · fX (a2 ) + g0 (a1 1 ) · fX (a1 ). For x > c, dx P (g(X) ≤ dx 1 x) = g0 (g−1 · fX (g −1 (x)). (x)) Rd Rc x −1 It follows that E[g(X)] = −∞ g0 (g−1 ·f (g (x)) dx+ X (x)) d R∞ x x −1 · f (a ) dx + · f (g (x)) dx. X 1 X 0 0 −1 g (a1 ) c g (g (x)) x x ·fX (a3 )− g0 (a ·fX (a2 )+ g 0 (a3 ) 2) First, g is monotone on [b0 , ∞), so the change of variable formula is applicable and we Rd R b0 g(y) R b0 x −1 0 have −∞ g0 (g−1 · f (g (x)) dx = · f (y)g (y) dy = g(y) · fX (y) dy. Similarly, X X 0 (x)) −∞ R ∞ −∞ g (y) R∞ x −1 · fX (g (x)) dx = a0 g(y) · fX (y) dy. We now observe that when x goes from d c g 0 (g −1 (x)) to c, a3 goes from b to a0 , a2 goes from b to a, and a1 goes from b0 to a. By the change of Rc x R a0 Rc variable formula again, we have d g0 (a ·fX (a3 ) dx = b g(y)·fX (y) dy, d g0−x ·fX (a2 ) dx = (a2 ) 3) Rc x Rb Ra Ra − b g(y) · fX (y) dy = a g(y) · fX (y) dy, and d g0 (a1 ) · fX (a1 ) dx = b0 g(y) · fX (y) dy. Adding up everything, we have E[g(X)] = R∞ −∞ g(y) · fX (y) dy. The reason that Theorem 1.8.10 is powerful is because it provides us a quick way to compute E[g(X)], without finding the PDF for g(X). We invite the reader to compute E[g(X)] in Example 1.6.5 and Example 1.6.6, using and without using Theorem 1.8.10, respectively. We remind the reader that in a more general setting, E[X] is defined even X is neither continuous nor discrete. We can get rid of a lot of restrictions on g in Theorem 1.8.10. in the more general framework. Corollary 1.8.11 Let X be a continuous r.v. with density fX (x), then E[p(X)] = R∞ p(x) · f (x) dx, where p(x) is a polynomial with degree ≥ 1, provided that |p(x)| · X −∞ −∞ fX (x) dx < ∞. R∞ Corollary 1.8.12 Let X be a continuous r.v. with density fX (x), then E[aX + b] = R∞ (ax + b) · f (x) dx, where a, b ∈ R, a = 6 0, provided that |x| · fX (x) dx < ∞. X −∞ −∞ R∞ Corollary 1.8.13 Let X be a continuous r.v. with density fX (x), then E[g(X)] = R∞ x g(X)·f (x) dx, where g(x) = e or g(x) = |x| , provided that |g(x)|·fX (x) dx < ∞. X −∞ −∞ R∞ Corollary 1.8.14 Let X be a continuous r.v. with density fX (x), and V ar(X) < ∞. Then we have V ar(X) = E[(X − E[X])2 ] = E[X 2 ] − (E[X])2 . Proof. V ar(X) = E[(X − E[X])2 ] by Corollary 1.8.11. To see V ar(X) = E[X 2 ] − (E[X])2 , R∞ note that it is true when E[X] = 0. When E[X] 6= 0, V ar(X) = −∞ (x−E[X])2 ·fX (x) dx = R∞ 2 R∞ R 2 ∞ x · f (x) dx − 2E[X] x · f (x) dx + (E[X]) f (x) dx = E[X 2 ] − (E[X])2 . X X −∞ −∞ −∞ X 18 An introduction to probability Shiu-Tang Li R∞ R∞ We’ve used the fact that ∞ > −∞ 2(x − E[X])2 + 2(E[X])2 · fX (x) dx ≥ −∞ x2 · R∞ fX (x) dx, so that −∞ x2 · fX (x) dx = E[X 2 ] by Corollary 1.8.11. Exercise 1.8.15 Try to state parallel arguments of Corollary 1.8.11 - 1.8.14, when X is a discrete r.v. 19 An introduction to probability Shiu-Tang Li 1.9 Examples of discrete random variables and continuous random variables 1.9.1 Discrete r.v.s Example 1.9.1 (Bernoulli r.v.) P (X = 1) = p, P (X = 0) = 1 − p, 0 < p < 1. It means to flip a coin with probability p to get a head. X is the number of heads. We say X satisfies the Bernoulli distribution, denoted by X ∼ Bernoulli(p). E[X] = p · 1 + (1 − p) · 0 = p. V ar(X) = E[X 2 ] − (E[X])2 = (p · 12 + (1 − p) · 02 ) − p2 = p − p2 . Example 1.9.2 (Binomial r.v.) P (X = k) = n k k p (1 − p)n−k , 0 < p < 1, 0 ≤ k ≤ n. It means to flip n independent coins, each with probability p to get a head. X is the total number of heads. We say X satisfies the Binomial(n, p) distribution, denoted by X ∼ Binomial(n, p). P E[X] = 0≤k≤n k · (1 − p))n−1 = np. n! pk (1 k!(n−k)! − p)n−k = np P (n−1)! k−1 (1 1≤k≤n (k−1)!(n−k)! p − p)n−k = np(p + n! pk (1 − p)n−k − (np)2 k!(n − k)! 0≤k≤n X X n! n! k· k(k − 1) · pk (1 − p)n−k + pk (1 − p)n−k − (np)2 = k!(n − k)! k!(n − k)! 0≤k≤n 0≤k≤n V ar(X) = E[X 2 ] − (E[X])2 = = n(n − 1)p2 X k2 · (n − 2)! pk−2 (1 − p)n−k + np − (np)2 (k − 2)!(n − k)! 2≤k≤n X = n(n − 1)p2 (p + (1 − p))n−2 + np − (np)2 = np − np2 . Example 1.9.3 (Geometric r.v.) P (X = k) = p(1 − p)k−1 , 0 < p < 1, k ∈ N. It means to flip a coin with probability p to get a head, and X is the first time to get a head. We write X ∼ Geo(p). P P P k−1 E[X] = ∞ kp(1 − p)k−1 = ∞ p(1 − p)k−1 + (1 − p) · ∞ + (1 − p)2 · k=1 k=1 k=1 p(1 − p) P P∞ 1 k−1 k−1 + · · · = 1−(1−p) · ∞ = p1 . k=1 p(1 − p) k=1 p(1 − p) 20 An introduction to probability V ar(X) = = = = ∞ X k=1 ∞ X 2 k−1 k p(1 − p) Shiu-Tang Li ∞ ∞ X 1 2 X 2 1 k−1 −( ) = (k + k)p(1 − p) − kp(1 − p)k−1 − ( )2 p p k=1 k=1 (k 2 + k)p(1 − p)k−1 − k=1 ∞ X 1 1 − ( )2 p p 1 1 p(1 − p)k−1 2 + 4(1 − p) + 6(1 − p)2 + · · · − − ( )2 p p k=1 1 1 − . 2 p p Theorem 1.9.4 (Memoryless property of geometric r.v.s) Let X ∼ Geo(p). Then for any n, m ∈ N, P (X > m + n|X > m) = P (X > n). Conversely, if for any n, m ∈ N, P P (X > m + n|X > m) = P (X > n), and n∈N P (X = n) = 1, then X ∼ Geo(p) for some 0 < p < 1, or P (X = 1) = 1. Proof. The proof of the first assertion is omitted. For the second assertion, we let p = P (X = 1). By the assumption we have P (X > 2) = P (X > 1)2 = (1 − p)2 , which shows P (X = 2) = 1 − p − (1 − p)2 = p(1 − p). By induction, P (X = k) = P (X > k − 1) − P (X > k) = P (X > k − 1) − P (X > k − 1)P (X > 1) = p · (1 − p − p(1 − p) − · · · − p(1 − p)k−2 ) = p(1 − p)k−1 . If 0 < p < 1, then X ∼ Geo(p). If p = 1, then P (X = 1) = 1. It cannot happen that p = 0 from the above deductions, for this would make P (X = n) = 0 for all n ∈ N. Example 1.9.5 (Poisson r.v.) P (X = k) = X ∼ P oisson(λ). E[X] = P∞ V ar(X) = k=0 k· P∞ k=0 e−λ λk k! k2 · =λ e−λ λk k! P∞ k=1 e−λ λk−1 (k−1)! − λ2 = P∞ k=0 e−λ λk , k! λ > 0, k ∈ N. We write = λ. k(k − 1) · e−λ λk k! + λ − λ2 = λ. Theorem 1.9.6 below shows the Poisson r.v. is some kind of “limit” of binomial r.v.s. Theorem 1.9.6 Let X ∼ P oisson(λ), and Xn ∼ Binomial(n, nλ ) for all n ∈ N. Then for any k ∈ N ∪ {0}, P (X = k) = limn→∞ P (Xn = k). Proof. n! λk λ n−k (1 − ) k!(n − k)! nk n λ −k λk e−λ λk e−λ n(n − 1) · · · (n − k + 1) (1 − nλ )n = · · · (1 − ) → . k! nk e−λ n k! P (Xn = k) = 21 An introduction to probability Shiu-Tang Li 1.9.2 Continuous r.v.s Example 1.9.7 (Uniform r.v.) fX (x) = E[X] = 1 b−a V ar(X) = Rb a 1 b−a x dx = Rb a b2 −a2 2 · 1 b−a = x2 dx − ( a+b )2 = 2 1 1 (x), b−a [a,b] b > a. Write X ∼ U nif (a, b). a+b . 2 a2 +ab+b2 3 − a2 +2ab+b2 4 = (b−a)2 . 12 Example 1.9.8 (Exponential r.v.) fX (x) = λe−λx 1[0,∞) (x), λ > 0. Write X ∼ exp(λ). E[X] = R∞ 0 V ar(X) = ∞ R ∞ xλe−λx dx = −xe−λx 0 + 0 e−λx dx = λ1 . R∞ 0 x2 λe−λx dx − 1 λ2 ∞ R ∞ = −x2 e−λx 0 + 0 2xe−λx dx − 1 λ2 = 2 λ2 − 1 λ2 = 1 . λ2 Theorem 1.9.9 (Memoryless property of exponential r.v.s) Let X ∼ exp(λ). Then for any s, t ≥ 0, P (X > s + t|X > s) = P (X > t). Conversely, if for any s, t ≥ 0, P (X > s + t|X > s) = P (X > t), and P (X > 0) = 1, then X ∼ exp(λ) for some λ > 0. [Norris] Proof. (⇒) For s, t ≥ 0, P (X > s + t|X > s) = P (X>s+t) P (X>s) = e−λ(s+t) e−λs = e−λt = P (X > t). (⇐) Since P (X > 0) = 1, and P (X > n1 ) → P (X > 0) (Theorem 1.4.6), we may find some n ∈ N s.t. P (X > n1 ) > 0. Also, by the assumption we have P (X > 1) = P (X > 1 n ) > 0, and thus we may find some λ > 0 so that P (X > 1) = e−λ . n For any q ∈ Q+ , we may write q = m/n, m, n ∈ N. We find that P (X > m/n)n = P (X > m) = P (X > 1)m = e−λm , which implies P (X > q) = P (X > m/n) = e−λ·m/n = e−λq . Now for any t > 0, we may find pn ↑ t and qn ↓ t so that pn , qn ∈ Q+ for all n ∈ N. We have e−λpn = P (X > pn ) ≥ P (X > t) ≥ P (X > qn ) = e−λqn . Let n → ∞ we have P (X > t) = e−λt = 1 − FX (t) for t > 0. Therefore, fX (t) = FX0 (t) = λe−λt for t > 0. (x−µ)2 Example 1.9.10 (Normal r.v.) fX (x) = σ√12π e− 2σ2 , σ > 0, µ ∈ R. X ∼ N (µ, σ 2 ). X is called a standard normal r.v. if X ∼ N (0, 1). Exercise 1.9.11 Using the fact R∞ 2 e−x dx = −∞ √ π to prove R∞ √1 e− −∞ σ 2π (x−µ)2 2σ 2 Write dx = 1. Z ∞ (x−µ)2 y2 1 1 − 2 2σ E[X] = x· √ e dx = (y + µ) · √ e− 2σ2 dy σ 2π σ 2π −∞ Z−∞ Z ∞ ∞ 2 2 y y 1 1 √ e− 2σ2 dy = µ. = y · √ e− 2σ2 dy + µ σ 2π −∞ −∞ σ 2π Z ∞ 22 An introduction to probability Remarks. (1) 2 − y2 2σ y·e R∞ 0 y· Shiu-Tang Li y2 √1 e− 2σ2 σ 2π dy < ∞ (use limit comparison test of integrals). (2) is an odd function. ∞ Z ∞ (x−µ)2 y2 1 1 − 2 2σ V ar(X) = (x − µ) · √ e y 2 · √ e− 2σ2 dy dx = σ 2π σ 2π −∞ Z ∞ −∞ 2 2 y y y 1 ∞ √ e− 2σ2 dy = σ 2 . = √ · (−σ 2 )e− 2σ2 −∞ + σ 2 σ 2π −∞ σ 2π Z 2 Exercise 1.9.12 Let X ∼ N (0, 1). Using integration by parts formula to show E[X n ] = 0 if n = 1, 3, 5, · · · and E[X n ] = 1 · 3 · · · (2k − 1) if n = 2k, k ∈ N. Example 1.9.13 (Cauchy r.v.) fX (x) = Since R∞ −∞ 1 π · 1 . 1+x2 1 |x|· π1 · 1+x 2 = ∞ (prove it), E[X] does not exist. Hence V ar(X) does not exist. Example 1.9.14 (Gamma r.v.) fX (x) = R ∞ x−1 −t t e dt. Write X ∼ Gamma(α, β). 0 β α α−1 −βx x e 1(0,∞) , Γ(α) α, β > 0, Γ(x) := Remarks 1.9.15 (1) Γ(n) = (n − 1)! for all n ∈ N. E[X] = α/β, V ar(X) = α/β 2 , calculations omitted. (2) When α = 1, β = λ, Gamma(α, β) ∼ exp(λ). 23 An introduction to probability Shiu-Tang Li 1.10 Joint distributions 1.10.1 Definitions Definition 1.10.1 Let X1 , · · · , Xn be r.v.s on (Ω, F, P ). The joint distribution function FX1 ,··· ,Xn (x1 , · · · , xn ) is defined to be P (X1 ≤ x1 , · · · , Xn ≤ xn ). Definition 1.10.2 Let X1 , · · · , Xn be discrete r.v.s on (Ω, F, P ). The joint probability mass function, or joint P M F , is defined to be pX1 ,··· ,Xn (x1 , · · · , xn ) = P (X1 = x1 , · · · , Xn = xn ). Definition 1.10.3 Let X1 , · · · , Xn be continuous r.v.s on (Ω, F, P ). The joint probability density function, or joint P DF , is defined to be a Riemann integrable function R xn R x1 fX1 ,··· ,Xn : Rn → R s.t. the iterated integral −∞ · · · −∞ fX1 ,··· ,Xn (y1 , · · · , yn ) dy1 · · · dyn is well-defined and equals P (X1 ≤ x1 , · · · , Xn ≤ xn ), and fX1 ,··· ,Xn ≥ 0. We recall a theorem from advanced calculus, which provides us with a different way to look at the iterated integral: Theorem 1.10.4 [Folland] Let R = [a, b] × [c, d], and let f be an Riemann integrable function on R. Suppose that, for each y ∈ [c, d], the function fy defined by fy = f (x, y) Rb is integrable on [a, b], and the function g(x) = a f (x, y) dx is integrable on [c, d]. Then RR Rd Rb f dA = c [ a f (x, y) dx] dy. R Remark 1.10.5 As what we’ve mentioned in Remark 1.5.5, the joint PDF fX1 ,··· ,Xn is not unique - they are a collection of “candidate functions” in some equivalence class. 1.10.2 Basic properties Theorem 1.10.6 P (y1 < X1 ≤ x1 , y2 < X2 ≤ x2 ) = FX1 ,X2 (x1 , x2 ) − FX1 ,X2 (x1 , y2 ) − FX1 ,X2 (y1 , x2 ) + FX1 ,X2 (y1 , y2 ). Proof. Exercise. What about the case of n r.v.s? Theorem 1.10.7 P (X1 ≤ x1 , X2 = x2 ) = FX1 ,X2 (x1 , x2 ) − limz↑x2 FX1 ,X2 (x1 , z). Proof. Exercise. How to write P (y1 < X1 ≤ x1 , X2 = x2 , X3 > x3 ) in terms of FX1 ,X2 ,X3 ? Theorem 1.10.8 Assume that the joint PDF of X1 and X2 exists. Then P (y1 < X1 ≤ Rx Rx x1 , y2 < X2 ≤ x2 ) = y22 y11 fX1 ,X2 (z1 , z2 ) dz1 dz2 . Proof. The proof follows from Theorem 1.10.6. How do you extend it to multi-dimensional case? Is it OK if we replace < with ≤ in the identity? 24 An introduction to probability Shiu-Tang Li Theorem 1.10.9 Assume that the joint PDF of X1 , X2 , · · · , Xn exists. Then the PDF R∞ R∞ of Xk is given by g(yk ) = −∞ · · · −∞ fX1 ,··· ,Xn (y1 , · · · , yn ) dy1 · · · db yk · · · dyn , 1 ≤ k ≤ n. Ry Proof. Show that −∞ g(yk ) dyk = P (Xk ≤ y) for all y ∈ R. Theorem 1.10.10 Assume that the joint PDF of X1 , X2 , · · · , Xn exists. Then FX1 ,··· ,Xn (x1 , · · · , xn ) is a continuous function on Rn . Proof. By Theorem 1.4.6, P (x − δ < Xk ≤ x + δ) → P (Xk = x) = 0 as δ ↓ 0. It T follows that FX1 ,··· ,Xn (x1 + δ, · · · , xn + δ) − FX1 ,··· ,Xn (x1 − δ, · · · , xn − δ) = P ( 1≤k≤n {Xk ≤ T P xk +δ})−P ( 1≤k≤n {Xk ≤ xk −δ}) ≤ 1≤k≤n P (xk −δ < Xk ≤ xk +δ) → 0 as δ ↓ 0. We may select δ enough so that FX1 ,··· ,Xn (x1 + δ, · · · , xn + δ) − FX1 ,··· ,Xn (x1 − δ, · · · , xn − δ) < for any given > 0. Therefore, for any k(z1 , · · · , zn ) − (x1 , · · · , xn )k < δ, FX1 ,··· ,Xn (x1 , · · · , xn ) − ≤ FX1 ,··· ,Xn (z1 , · · · , zn ) ≤ FX1 ,··· ,Xn (x1 , · · · , xn ) + . This completes the proof. Remark 1.10.11 When X1 , · · · , Xn are discrete r.v.s of the same probability space (Ω, F, P ), The joint PMF always exists (exercise). But it is not the case for continuous r.v.s, as shown in the following example. Example 1.10.12 Let X ∼ N (0, 1) and Y (ω) = X(ω) for every ω ∈ Ω. Show that the joint PDF of X and Y does not exist. Sol. Assume the joint PDF f (x, y) of X, Y exists. Let An,j,k := { 2jn < X ≤ j+1 , k < 2n 2n P Y ≤ k+1 }, j, k ∈ Z, n ∈ N. For each n ∈ N, we have j,k ( 21n )2 · sup{f (x, y) : (x, y) ∈ 2n R∞ R∞ ] × ( 2kn , k+1 ]} ↓ −∞ −∞ f (x, y) dx dy = 1 as n → ∞, since f (x, y) is Riemann inte( 2jn , j+1 2n 2n grable on R2 . Therefore, we may select N ∈ N s.t. k k+1 ( 2N , 2N ]} ≤ 32 . P 1 2 j,k ( 2N ) · sup{f (x, y) : (x, y) ∈ ( 2jN , j+1 ]× 2N P Besides, 1 = P (Ω) = P (∪j,k AN +1,j,k ) = P (∪j AN +1,j,j ) = j P (AN +1,j,j ) P R (j+1)/2N +1 R (j+1)/2N +1 P j j+1 1 2 f (x, y) dx dy ≤ = j j/2N +1 N +1 j ( 2N +1 ) · sup{f (x, y) : (x, y) ∈ ( 2N +1 , 2N +1 ] × j/2 P 1 2 j j+1 j j+1 1 3 ( 2Nj+1 , 2j+1 N +1 ]} ≤ 4 ·2· j ( 2N ) ·sup{f (x, y) : (x, y) ∈ ( 2N , 2N ]×( 2N , 2N ]} ≤ 4 , a contradiction. 1.10.3 Properties about expected values The following theorem generalizes Theorem 1.8.8, and the proof is omitted here. P P∞ Theorem 1.10.13 Let X1 , · · · , Xn be discrete r.v.s s.t. ∞ P (X1 = xj1 , · · · , Xn = jn =1 j1 =1 P∞ P n xjn ) = 1, and g : R → R be some function. Then E[g(X1 , · · · , Xn )] = jn =1 ∞ j1 =1 g(xj1 , · · · , xjn ) P∞ P∞ ·P (X1 = xj1 , · · · , Xn = xjn ), if jn =1 j1 =1 |g(xj1 , · · · , xjn )|P (X1 = xj1 , · · · , Xn = xjn ) < 25 An introduction to probability Shiu-Tang Li ∞. The following theorem generalizes Theorem 1.8.10. It is extremely complicated, or almost impossible, to prove it with advanced calculus. When we’ve learned measure theory, it then becomes tractable. The proof is again omitted here. Theorem 1.10.14 Let X1 , · · · , Xn be continuous r.v.s with joint density fX1 ,··· ,Xn (x1 , · · · , xn ), and g : Rn → R be a continuous function. Then g(X1 , · · · , Xn ) is a r.v., and if g(X1 , · · · , Xn ) is either continuous or discrete, then Z ∞ Z ∞ g(x1 , · · · , xn ) · fX1 ,··· ,Xn (x1 , · · · , xn ) dx1 · · · dxn , ··· E[g(X1 , · · · , Xn )] = −∞ provided that R∞ −∞ ··· R∞ −∞ −∞ |g(x1 , · · · , xn )| · fX1 ,··· ,Xn (x1 , · · · , xn ) dx1 · · · dxn < ∞. Example 1.10.15 Let X, Y be continuous r.v.s with joint PDF fX,Y (x, y). Then by R∞ R∞ Theorem 1.10.14, for all g : R → R continuous, E[g(X)] = −∞ −∞ g(x) · fX,Y (x, y) dx dy = R∞ R∞ R∞ g(x) · ( −∞ fX,Y (x, y) dy) dx = −∞ g(x) · fX (x) dx. This shows Theorem 1.10.14 is com−∞ patible with what we’ve learned before (see Theorem 1.8.10). Theorem 1.10.16 Let X1 , · · · , Xn be continuous r.v.s with joint density fX1 ,··· ,Xn (x1 , · · · , xn ). Then we have E[a1 X1 + · · · + an Xn ] = a1 E[X1 ] + · · · + an E[Xn ], where ak ∈ R for 1 ≤ k ≤ n, provided that E[|Xk |] < ∞ for all 1 ≤ k ≤ n. Proof. By Theorem 1.10.14, Z ∞ Z ∞ E[a1 X1 + · · · + an Xn ] = ··· (a1 x1 + · · · + an xn ) · fX1 ,··· ,Xn (x1 , · · · , xn ) dx1 · · · dxn −∞ −∞ Z ∞ Z ∞ X = ak ··· xk · fX1 ,··· ,Xn (x1 , · · · , xn ) dx1 · · · dxn 1≤k≤n −∞ −∞ = a1 E[X1 ] + · · · + an E[Xn ]. Remark 1.10.17 The above theorem also holds for X1 , · · · , Xn discrete (prove it). Actually, with a more general definition of expected values of r.v.s, Theorem 1.10.16 holds for arbitrary r.v.s that live on the same probability space. Example 1.10.18 (Coupon collector’s problem) Suppose there n different basketball player cards, and you want to collect all of them. The rule is as follows: each time you cannot buy the player card you want, but instead buy a random one, since the cards are all sealed. 26 An introduction to probability Shiu-Tang Li What is the expected number of cards that you have to buy to get a full collection of n cards? Sol. Let Xk be the number of cards you have to buy to raise the number of kinds of player cards from k − 1 to k. Obviously X1 = 1. It’s not hard to see X2 ∼ Geo( n−1 ), and n n−2 X3 ∼ Geo( n ), and so on. Therefore, the expected number of cards we have to buy is n n E[X1 + · · · + Xn ] = E[X1 ] + · · · + E[Xn ] (by the previous remark) = 1 + n−1 + n−2 + · · · + n1 , which equals n ln(n) + γ · n + 12 + o(1) as n becomes large. Definition 1.10.19 Let X, Y be either both continuous r.v.s with joint density fX,Y (x, y) or both discrete r.v.s. We define the covariance of X and Y to be E[(X−E[X])·(Y −E[Y ])], written by Cov(X, Y ), provided that E[|X|] < ∞, E[|Y |] < ∞, and E[|X − E[X]| · |Y − E[Y ]|] < ∞. Remark 1.10.20 In general, we may define Cov(X, Y ) for any r.v.s X and Y . We’ll talk about this in the measure-theoretic probability course. Theorem 1.10.21 Cov(X, Y ) = E[XY ] − E[X] · E[Y ]. Proof. We present the proof when X, Y are continuous with joint density f . By Theorem 1.10.14, Cov(X, Y ) = E[(X − E[X]) · (Y − E[Y ])] Z ∞Z ∞ = (x − E[X])(y − E[Y ])f (x, y) dx dy −∞ −∞ Z ∞Z ∞ (xy − yE[X] − xE[Y ] + E[X]E[Y ])f (x, y) dx dy = −∞ −∞ = E[XY ] − E[X] · E[Y ]. 1.10.4 Independence and joint distributions Theorem 1.10.22 Let X1 , · · · , Xn be continuous r.v.s with joint density fX1 ,··· ,Xn (x1 , · · · , xn ), and each Xk has PDF fXk (xk ) for 1 ≤ k ≤ n. If fX1 ,··· ,Xn (x1 , · · · , xn ) = Π1≤k≤n fXk (xk ), then X1 , · · · , Xn are independent; conversely, if X1 , · · · , Xn are independent continuous r.v.s, then one “version” of their joint PDF is given by Π1≤k≤n fXk (xk ). 27 An introduction to probability Shiu-Tang Li Proof. (⇒) We first notice that (By a slight revision of Theorem 1.10.8) Z bn Z b1 fX1 ,··· ,Xn (x1 , · · · , xn ) dx1 · · · dxn ··· P (a1 < X1 < b1 , · · · , an < Xn < bn ) = a1 Z b1 an Z bn fX1 (x1 ) · · · fXn (xn ) dx1 · · · dxn ··· = a1 an =Π1≤k≤n P (ak < Xk < bk ), which proves the case P (X1 ∈ O1 , · · · , Xn ∈ On ) = Π1≤k≤n P (Xk ∈ Ok ) for Ok = (ak , bk ). We then consider O1 to be a general open set in R, and Ok = (ak , bk ) for k ≥ 2. Since S we may write O1 as a countable disjoint union of open sets, say O1 = ∞ j=1 (a1j , b1j ) (see [Munkres]; the union may include infinite open intervals), thus P (X1 ∈ O1 , · · · , Xn ∈ On ) = ∞ X P (X1 ∈ (a1j , b1j ), · · · , Xn ∈ On ) j=1 = ∞ X P (X1 ∈ (a1j , b1j )) × · · · × P (Xn ∈ On ) j=1 =P (X2 ∈ O2 ) · · · P (Xn ∈ On ) × ∞ X P (X1 ∈ (a1j , b1j )) j=1 =P (X1 ∈ O1 ) · · · P (Xn ∈ On ). Next we consider O1 , O2 to be general open sets in R, and Ok = (ak , bk ) for k ≥ 3, and then O1 , O2 , O3 to be general open sets in R, and so on, following the same deductions. (⇐) Since P (X1 ≤ a1 , · · · , Xn ≤ an ) =P (X1 < a1 , · · · , Xn < an ) (why?) =P (X1 < a1 ) · · · P (Xn < an ) Z a1 Z an fXn (xn ) dxn = fX1 (x1 ) dx1 × · · · × −∞ −∞ Z an Z a1 = ··· fX1 (x1 ) · · · fXn (xn ) dx1 · · · dxn , −∞ −∞ it follows that Π1≤k≤n fXk (xk ) is a version of their joint PDF. Corollary 1.10.23 If X1 , · · · , Xn are independent continuous r.v.s, then their joint PDF exists. 28 An introduction to probability Shiu-Tang Li Theorem 1.10.24 (Independence implies zero correlation) Let X, Y be independent r.v.s. If they are both continuous or both discrete, then Cov(X, Y ) = 0. Proof. (Continuous case) By theorem 1.10.22, the joint PDF of X,Y is given by fX (x)fY (y). By Theorem 1.10.14 and Theorem 1.10.21, Cov(X) =E[XY ] − E[X]E[Y ] Z ∞Z ∞ Z = xyfX (x)fY (y) dx dy − −∞ −∞ ∞ Z ∞ xfX (x) dx · −∞ yfY (y) dy −∞ =0. (Discrete case) Let X, Y be discrete r.v.s s.t. have P∞ P∞ i=1 j=1 P (X = xi , Y = yj ) = 1. We Cov(X) =E[XY ] − E[X]E[Y ] ∞ X ∞ ∞ ∞ X X X = xi yj P (X = xi , Y = yj ) − xi P (X = xi ) · yj P (Y = yi ) = i=1 j=1 ∞ X ∞ X i=1 j=1 i=1 xi yj P (X = xi )P (Y = yj ) − ∞ X i=1 j=1 xi P (X = xi ) · ∞ X yj P (Y = yi ) (by independence) j=1 =0. Remark 1.10.25 The above theorem can be generalized to arbitrary independent r.v.s X, Y . Example 1.10.26 (Zero correlation does not imply independence) Let Ω = {ω1 , ω2 }, P (ω1 ) = 2/3, P (ω2 ) = 1/3, X(ω1 ) = 1, X(ω2 ) = −2, Y (ω1 ) = 2, Y (ω2 ) = 1. It’s not hard to see Cov(X, Y ) = E[XY ] − E[X]E[Y ] = 0 − 0 · 5/3 = 0, but P (X = 1, Y = 2) = P (ω2 ) 6= P (ω2 )2 = P (X = 1) · P (Y = 2). Theorem 1.10.27 Let X1 , · · · , Xn be independent continuous r.v.s or independent disP crete r.v.s. Then V ar(X1 + · · · + Xn ) = ni=1 V ar(Xi ). 29 An introduction to probability Shiu-Tang Li Proof. We’ll prove the continuous case only. By Theorem 1.10.14, V ar(X1 + · · · + Xn ) = E[(X1 + · · · + Xn − E[X1 + · · · + Xn ])2 ] Z ∞ Z ∞ (x1 + · · · + xn − E[X1 + · · · + Xn ])2 fX1 ,··· ,Xn (x1 , · · · , xn ) dx1 · · · dxn ··· = −∞ ∞ −∞ ∞ Z Z ··· = −∞ = n X X n X 2 (xj − E[Xj ]) + 2 (xi − E[Xi ])(xj − E[Xj ]) fX1 (x1 ) · · · fXn (xn ) dx1 · · · dxn −∞ j=1 1≤i<j≤n V ar(Xi ). i=1 1.10.5 Density transformation formula Let X1 , · · · , Xn be continuous r.v.s with joint density fX1 ,··· ,Xn (x1 , · · · , xn ). We’re interested in computing the joint PDF of g1 (X1 , · · · , Xn ), · · · , gn (X1 , · · · , Xn ) given that g1 , · · · , gn are “nice” enough. For example, if X, Y are continuous r.v.s with joint density f , how do we compute the joint PDF of U = X + Y and V = XY ? We would learn the technique in this section. Let us first recall the inverse mapping theorem (or inverse function theorem) from [Folland], and the change-of-variable formula from [Folland] (with some revisions). Theorem 1.10.28 (Inverse function theorem) Let U and V be open sets in Rn , a ∈ U , and b = f (a). Suppose that f : U → V is a mapping of class C 1 and the Frechet ∂f |x=a is nonzero). Then there exist neighderivative f 0 (a) is invertible (that is, the Jacobian ∂x borhoods M ⊂ U and N ⊂ V of a and b, respectively, so that f is a one-to-one map from M to N , and the inverse map f −1 from N to M is also of class C 1 . Moreover, if y = f (x) ∈ N , −1 then (f −1 )0 (y) = f 0 (x) . Theorem 1.10.29 (Change-of-variable formula) Given open sets U and V in Rn , let G : U → V be a one-to-one transformation of class C 1 whose derivative G 0 (u) is invertible for all u ∈ U . If f is integrable on G (U ), then f ◦ G is integrable on U , and Z Z Z Z ∂G n n ··· f (x) d x = · · · f (G (u)) · | |d u. ∂u G (U ) U And now it’s time for the main course of this section. Theorem 1.10.30 (Density transformation formula) Let X1 , · · · , Xn be continuous r.v.s with joint density fX1 ,··· ,Xn (x1 , · · · , xn ), and let g : Rn → Rn be a C 1 -mapping 30 An introduction to probability Shiu-Tang Li s.t. g 0 (x) is invertible for all x ∈ Rn , and g is 1-1 with inverse h. Then the joint PDF of (U1 , · · · , Un ) := g (X1 , · · · , Xn ) exists, and it is given by fU1 ,··· ,Un (u1 , · · · , un ) = ∂h fX1 ,··· ,Xn (h(u1 , · · · , un ))1g (Rn ) (u1 , · · · , un ) · | ∂u |. Proof. Fix an arbitrary open set O in Rn . By Theorem 1.10.28, g (Rn ) is an open set (why?), and h is of class C 1 on g (Rn ) so that h 0 (u) is invertible for all u ∈ g (Rn ). Besides, g −1 (O) = h (O ∩ g (Rn )) is an open set, since g is a continuous function. We have P ((U1 , · · · , Un ) ∈ O) =P (g (X1 , · · · , Xn ) ∈ O) =P ((X1 , · · · , Xn ) ∈ g −1 (O)) Z Z = ··· fX1 ,··· ,Xn (x1 , · · · , xn ) dx1 · · · dxn (why?) g −1 (O) Z Z = ··· fX1 ,··· ,Xn (x1 , · · · , xn ) dx1 · · · dxn h (O∩g (Rn )) Z Z ∂h fX1 ,··· ,Xn (h(u1 , · · · , un )) · | | du1 · · · dun (Theorem 1.10.29) = ··· ∂u O∩g (Rn ) Z Z ∂h = · · · fX1 ,··· ,Xn (h(u1 , · · · , un ))1g (Rn ) (u1 , · · · , un ) · | | du1 · · · dun . ∂u O If we take O = {(u1 , · · · , un ) : ui < ai }, then it follows that fX1 ,··· ,Xn (h(u1 , · · · , un ))1g (Rn ) (u1 , · · · , un )· ∂h | ∂u | is a version of the joint PDF of (U1 , · · · , Un ). Theorem 1.10.31 (Density transformation formula, a slight generalization) Let X1 , · · · , Xn be continuous r.v.s with joint density fX1 ,··· ,Xn (x1 , · · · , xn ), and let g : E → Rn be a C 1 -mapping s.t. g 0 (x) is invertible for all x ∈ E, where E is an open set in Rn , and g is 1-1 with inverse h. Besides, we assume P ((X1 , · · · , Xn ) ∈ / E) = 0. Then the joint PDF of (U1 , · · · , Un ) := g (X1 , · · · , Xn ) exists, and it is given by fU1 ,··· ,Un (u1 , · · · , un ) = ∂h fX1 ,··· ,Xn (h(u1 , · · · , un ))1g (E) (u1 , · · · , un ) · | ∂u |. Proof. Fix an arbitrary open set O in Rn . By Theorem 1.10.28, g (E) is an open set, and h is of class C 1 on g (E) so that h 0 (u) is invertible for all u ∈ g (E). Besides, g −1 (O) ∩ E = h (O ∩ g (Rn )) ∩ h (g (E)) = h (O ∩ g (E)) is an open set, since g is a continuous function. 31 An introduction to probability Shiu-Tang Li We have P ((U1 , · · · , Un ) ∈ O) =P (g (X1 , · · · , Xn ) ∈ O) =P ((X1 , · · · , Xn ) ∈ g −1 (O)) Z Z = ··· fX1 ,··· ,Xn (x1 , · · · , xn ) dx1 · · · dxn g −1 (O) Z Z = ··· fX1 ,··· ,Xn (x1 , · · · , xn ) dx1 · · · dxn g −1 (O)∩E Z Z = ··· fX1 ,··· ,Xn (x1 , · · · , xn ) dx1 · · · dxn h (O∩g (E)) Z Z ∂h = ··· fX1 ,··· ,Xn (h(u1 , · · · , un )) · | | du1 · · · dun (Theorem 1.10.29) ∂u O∩g (E) Z Z ∂h = · · · fX1 ,··· ,Xn (h(u1 , · · · , un ))1g (E) (u1 , · · · , un ) · | | du1 · · · dun . ∂u O If we take O = {(u1 , · · · , un ) : ui < ai }, then it follows that fX1 ,··· ,Xn (h(u1 , · · · , un ))1g (E) (u1 , · · · , un )· | is a version of the joint PDF of (U1 , · · · , Un ). | ∂h ∂u Example 1.10.32 Let X and Y be two independent standard normal random variables. What is the joint density of U = X + Y and V = X − Y ? Sol. Note that g : (x, y) 7→ (x + y, x − y) is a C 1 mapping: R2 → R2 , and g 0 (x, y) = 1 1 is invertible for all (x, y) ∈ R. To compute inverse h of g, we let u = x + y, v = 1 −1 and y = u−v , and thus we have h : (u, v) 7→ ( u+v , u−v ), where x − y, so that x = u+v 2 2 2 2 1/2 1/2 2 h 0 (u, v) = . Also we notice that fX,Y (x, y) = fX (x)fY (y) = √12π e−x /2 · 1/2 −1/2 2 /2 1 −y √ e , and g (R2 ) = R2 . By Theorem 1.10.30, 2π R∞ ∂h fU,V (u, v) =fX,Y (h(u, v))1g (R2 ) (u, v) · | | ∂u u+v u − v 1/2 1/2 )fY ( ) · det =fX ( 1/2 −1/2 2 2 1 1 1 −u2 /4−v2 /4 2 2 = · e−(u+v) /8 · e−(u−v) /8 · = e . 2π 2 4π R ∞ 1 −u2 /4−v2 /4 Here is an interesting fact. By Theorem 1.10.9, fU (u) = −∞ 4π e dv = 2 1 √ e−v /4 −∞ 2 π 2 2 1 √ e−u /4 · 2 π 2 dv = 2√1 π e−u /4 , and similarly fV (v) = 2√1 π e−v /4 . It follows that fU,V (u, v) = fU (u)fV (v), which shows U = X + Y and V = X − Y are independent by Theorem 1.10.22. 32 An introduction to probability Shiu-Tang Li Example 1.10.33 Let X and Y be continuous independent random variables with joint PDF fX,Y . Compute the PDF of X + Y . Sol. Construct a C 1 -mapping g : R2 → R2 , (x, y) 7→ (x + y, y), and note that g (R2 ) = R2 1 1 0 and g (x, y) = is invertible for all (x, y) ∈ R. To compute inverse h of g, we let 0 1 u = x + y, v = y, so that x = u − v and y = v, and thus we have h : (u, v) 7→ (u − v, v), 1 −1 0 where h (u, v) = . By Theorem 1.10.30, 0 1 ∂h fU,V (u, v) =fX,Y (h(u, v))1g (R2 ) (u, v) · | | ∂u 1 −1 =fX,Y (u − v, v) · det 0 1 =fX,Y (u − v, v), so the PDF is given by fU (u) = R∞ −∞ fU,V (u, v) dv = R∞ −∞ fX,Y (u − v, v) dv. Remark 1.10.34 (Convolution Theorem) In the previous example, if X,Y are inR∞ dependent, then the PDF of X +Y is given by fU (u) = −∞ fX (u−v)·fY (v) dv = (fX ∗fY )(u). Example 1.10.35 Let X and Y be independent, standard normal random variables. Find the distribution of X/Y . Sol. Construct a C 1 -mapping g : E → R2 , (x, y) 7→ (x/y, y), where E := {(x, y) ∈ R2 : y 6= 0}, and P ((X, Y ) ∈ / E) = P (Y 6= 0) = 0. Besides, g (E) = E and g 0 (x, y) = 1/y −x/y 2 is invertible for all (x, y) ∈ E. To compute inverse h of g, we let u = 0 1 x/y, v = y, so that x = uv and y = v, and thus we have h : (u, v) 7→ (uv, v), where v u 0 h (u, v) = , which is invertible for all (u, v) ∈ g (E) = E. By Theorem 1.10.31, 0 1 fU,V (u, v) =fX,Y (h(u, v))1g (E) (u, v) · | ∂h | ∂u ∂h =fX,Y (h(u, v))| | (why?) ∂u v u =fX,Y (uv, v) · det 0 1 =fX (uv)fY (v)|v| 1 2 2 2 = e−u v /2 · e−v /2 |v|. 2π 33 An introduction to probability As a result, fU (u) = R f (u, v) dv = 2 R U,V Shiu-Tang Li R∞ 0 1 −u2 v 2 /2 −v 2 /2 v e e 2π 1 = π1 · 1+u 2 (X/Y is Cauchy!), and FX/Y (x) = P (X/Y ≤ x) = 1 (tan−1 (x) π + π2 ). Rx v=∞ v=0 x 1 −1 du = π tan (u) = 2 )v 2 /2 −1 −(1+u dv = π1 · 1+u 2e 1 · 1 −∞ π 1+u2 −∞ 34 An introduction to probability Shiu-Tang Li 1.11 Moment generating functions Definition 1.11.1 Let X be a discrete r.v. or a continuous r.v. The moment generating function (abbreviated as MGF) of X, is defined to be MX (t) := E[etX ] for t ∈ R. P Remarks 1.11.2 (1) MX (0) = 1. (2) Let t 6= 0. When X is discrete, and ∞ i=1 P (X = P∞ t·xi xi ) = 1, then MX (t) = i=1 e P (X = xi ) by Theorem 1.8.8. When X is continuous with R∞ PDF fX (x), then MX (t) = −∞ etx fX (x) dx by Theorem 1.8.10. We’ll soon see the power of MGF in the next few theorems. Theorem 1.11.3 (MGF determines CDF) If X and Y are both continuous r.v.s or both discrete r.v.s with MGF MX and MY respectively, and there exists an open interval I containing 0 so that MX (t) = MY (t) for all t ∈ I, then the CDF of X and Y are the same, namely FX (x) ≡ FY (x). Proof. The proof is omitted. The interested reader could see [Varadhan] for a proof, which requires the knowledge of characteristic functions. Theorem 1.11.4 Let X1 , · · · , Xn be independent continuous r.v.s or independent discrete r.v.s. If we let Y = X1 + · · · + Xn , then MY (t) = Πni=1 MXi (t) for all t ∈ R. Proof. We prove the continuous case only. By Theorem 1.10.14 and Theorem 1.10.21, if t is chosen s.t. E[etX1 ] < ∞, then Z ∞ Z ∞ tY MY (t) =E[e ] = ··· et(x1 +···+xn ) · fX1 ,··· ,Xn (x1 , · · · , xn ) dx1 · · · dxn −∞ Z ∞ Z −∞ ∞ = ··· et(x1 +···+xn ) · fX1 (x1 ) · · · fXn (xn ) dx1 · · · dxn −∞ Z ∞ −∞ =Πni=1 etxi fXi (xi ) dxi = Πni=1 MXi (t). −∞ Otherwise, if t is chosen s.t. E[etX1 ] = ∞, then MY (t) = Πni=1 MXi (t) = ∞. Table 1.11.5 Here we list the MGF of some common distributions. The readers are invited to compute any of them if interested. 35 An introduction to probability Distribution Bernoulli(p) Binomial(n, p) Geo(p) P oisson(λ) U nif (a, b) exp(λ) N (µ, σ 2 ) Cauchy Gamma(α, β) Shiu-Tang Li MGF: M (t) 1 − p + pet (1 − p + pet )n t pe , for t < − ln(1 − p), +∞ otherwise 1−(1−p)et t eλ(e −1) etb −eta t(b−a) 1 , 1−t/λ for t < λ, +∞ otherwise 1 2 2 etµ+ 2 σ t does not exist (+∞ for all t 6= 0) (1 − t/β)−α , for t < β Example 1.11.6 (Important!) Let X1 , · · · , Xn be independent r.v.s, and Y = X1 + · · · + Xn . Based on Theorem 1.11.3, Theorem 1.11.4, and Table 1.11.5, we have the following observations. (1) If Xi ∼ Bernoulli(p) for all 1 ≤ i ≤ n, then Y ∼ Binomial(n, p). (2) If Xi ∼ Bernoulli(Ni , p) for all 1 ≤ i ≤ n, then Y ∼ Binomial(N1 + · · · + Nn , p). (3) If Xi ∼ P oisson(λi ) for all 1 ≤ i ≤ n, then Y ∼ P oisson(λ1 + · · · + λn ). (4) If Xi ∼ exp(λ) for all 1 ≤ i ≤ n, then Y ∼ Gamma(n, λ). (5) If Xi ∼ N (µi , σi2 ) for all 1 ≤ i ≤ n, then N (µ1 + · · · + µn , σ12 + · · · + σn2 ). (6) If Xi ∼ Gamma(αi , β) for all 1 ≤ i ≤ n, then Y ∼ Gamma(α1 + · · · + αn , β). Next we’ll see how the “moment generating” function MX (t) generates moments. Theorem 1.11.7 Let X be either a continuous or a discrete r.v., so that there exists n an open interval I containing 0 on which MX (t) < ∞. Then the n-th moment E[X ] of X n exists and we have dtd n M (t) = E[X n ]. t=0 Proof. (In class). Some details are omitted - we need the dominated convergence theorem from measure theory. 36 An introduction to probability Shiu-Tang Li 1.12 LLN and CLT In this section we would state without proof two very important theorems, the law of large numbers and the central limit theorem. Theorem 1.12.1 (Strong law of large numbers, SLLN) Let X1 , · · · , Xn , · · · be a n (ω) sequence i.i.d r.v.s so that E[|X1 |] < ∞. Then P ({ω : limn→∞ X1 (ω)+···+X = E[X1 ]}) = 1. n Theorem 1.12.2 (Weak law of large numbers, WLLN) Let X1 , · · · , Xn , · · · be a sen (ω) quence i.i.d r.v.s so that E[|X1 |] < ∞. Then for any > 0, limn→∞ P ({ω : | X1 (ω)+···+X − n E[X1 ]| > }) = 0. Remark 1.12.3 As the name suggests, SLLN implies WLLN. But we would not give a proof here. Theorem 1.12.4 (Central limit theorem, CLT) Let X1 , · · · , Xn , · · · be a sequence √ n −nE[X1 ] ≤ i.i.d r.v.s so that E[X12 ] < ∞. Then for any x ∈ R, we have limn→∞ P ( X1 +···+X n·V ar(X1 ) R x 1 −y2 /2 √ x) = −∞ 2π e dy. 37