Conditional distributions (discrete case) The basic idea behind conditional distributions is simple: Suppose (X � Y ) is a jointly-distributed random vector with a discrete joint distribution. Then we can think of P{X = � | Y = �}, for every fixed �, as a distribution of probabilities indexed by the variable �. Example 1. Recall the following joint distribution from your text: possible value for X 1 2 3 possible 3 1/6 1/6 0 values 2 1/6 0 1/6 for Y 1 0 1/6 1/6 Now suppose we know that Y = 3. Then, the conditional distribution of X, given that we know Y = 3, is given by the following: P{X = 1 � Y = 3} 1/6 1 = = � P{Y = 3} 1/3 2 1 P{X = 2 | Y = 3} = � 2 P{X = 1 | Y = 3} = similarly, and � P{X = � | Y = 3} = 0 for all other values of �. Note that, in this example, � P{X = � | Y = 3} = 1. Therefore, the “conditional distribution” of X given that Y = 3 is indeed a total collection of probabilities. As such, it has an expectation, variance, etc., as well. For instance, � � � � � 1 1 3 E(X | Y = 3) = P{X = � | Y = 3} = 1 × + 2× = � 2 2 2 � That is, if Y = 3, then our best predictor of X is 3/2. Whereas, EX = 2, which means that our best predictor of X, in light of no additional 83 84 17 information, is 2. You should check that also 5 E(X | Y = 1) = � E(X | Y = 2) = 2� 2 Similarly, � � � � 1 1 2 2 2 E(X | Y = 3) = 1 × + 2 × = 3� 2 2 whence � �2 3 3 Var(X | Y = 3) = 3 − = � 2 4 You should compare this with the unconditional variance VarX = 8/3 (check the details!). Fix a number �. In general, the conditional distribution of X given that Y = � is given by the table [function of �]: P{X = � � Y = �} P{X = � � Y = �} P{X = � | Y = �} = =� � P{Y = �} � P{X = � � Y = �} � It follows easily from this that: (i) P{X = � | Y = �} ≥ 0; and (ii) � P{X = � | Y = �} = 1. This shows that we really are studying a [here conditional] probability distribution. As such, � � � � E �(X) � Y = � = �(�)P{X = � | Y = �}� � as long as the sum converges absolutely [or when �(�) ≥ 0 for all �]. Example 2. Choose and fix 0 < � < 1, and two integers �� � ≥ 1. Let X and Y be independent random variables; we suppose that X has a binomial distribution with parameters � and �; and Y has a binomial distribution with parameters � and �. Because X + Y can be thought of as the total number of successes in � + � tosses of independent �-coins, it follows that X + Y has a binomial distribution with parameters � + � and �. Our present goal is to find the conditional distribution of X, given that X+Y = �, for a fixed integer 0 ≤ � ≤ � + �. P{X = � � X + Y = �} P{X = � | X + Y = �} = P{X + Y = �} P{X = �} · P{Y = � − �} = � P{X + Y = �} The numerator is zero unless � = 0 � � � � � �. But if 0 ≤ � ≤ �, then ��� � �−� � � � �−� �−�+� � � · � � ��+�� �−� P{X = � | X + Y = �} = � � �+�−� � � ��� � � �� · � � = ���+��−� � Conditional distributions (discrete case) 85 That is, given that X + Y = �, then the conditional distribution of X is a hypergeometric distribution! We can now read off the mean and the variance from facts we know about hypergeometrics. Namely, according to Example 1 on page 61 of these notes, E(X | X + Y = �) = and Example 2 on page 61 tells us that Var(X | X + Y = �) = � � × �� �+� � � �+�−� � �+��+� �−1 (Check the details! Some care is needed; on page 61, the notation was slightly different than it is here: The variable B there is now �; N there is now � + �, etc.) Example 3. Suppose X1 and X2 are independent, distributed respectively as Poisson(λ1 ) and Poisson(λ2 ), where λ1 � λ2 > 0 are fixed constants. What is the distribution of X1 , given that X1 +X2 = � for a positive [fixed] integer �? We will need the distribution of X1 + X2 . Therefore, let us begin with that: The possible values of X1 +X2 are 0� 1� � � � ; therefore the “convolution” formula for discrete distributions implies that for all � ≥ 0, P{X1 + X2 = �} = = = = � � ∞ � �=0 � � P{X1 = �} · P{X2 = � − �} λ1� �−λ1 · P{X2 = � − �} �! � � � λ1� �−λ1 λ2�−� �−λ2 �−(λ1 +λ2 ) � � � �−� · = · λ λ �! (� − �)! �! � 1 2 �=0 �−(λ1 +λ2 ) �! · (λ1 + λ2 )� �=0 [binomial theorem]; and P{X1 + X2 = �} = 0 for other values of �. In other words, X1 + X2 is distributed as Poisson(λ1 + λ2 ).1 1Aside: This and induction together prove that if X � � � � � X are independent and all distributed 1 � respectively as Poisson(λ1 ), . . . , Poisson(λ� ), then X1 + · · · + X� is distributed according to a Poisson distribution with parameter λ1 + · · · + λ� . 86 Now, where 17 P{X1 = � | X1 + X2 = �} = P{X1 = �} · P{X2 = � − �} P{X1 + X2 = �} λ1� �−λ1 λ2�−� �−λ2 · �! (� − �)! = (λ1 + λ2 )� �−(λ1 +λ2 ) �! � � � = �� ��−� � � λ1 λ2 � � := 1 − � = � λ1 + λ2 λ1 + λ2 That is, the conditional distribution of X1 , given that X1 + X2 = �, is Binomial(� � �). � := Continuous conditional distributions Once we understand conditional distributions in the discrete setting, we could predict how the theory should work in the continuous setting [although it is quite difficult to justify that what is about to be discussed is legitimate]. Suppose (X � Y ) is jointly distributed with joint density �(� � �). Then we define the conditional density of X given Y = � [assuming that �Y (�) �= 0] as �(� � �) �X|Y (�| �) := for all −∞ < � < ∞� �Y (�) This leads to conditional expectations: � ∞ � � E �(X) | Y = � = �(�)�X|Y (�| �) ��� −∞ etc. As such we have now E(X | Y = �), Var(X | Y = �), etc. in the usual way using [now conditional] densities. Example 4 (Uniform on a triangle, p. 414 of your text). Suppose (X � Y ) is chosen uniformly at random from the triangle {(� � �) : � ≥ 0� � ≥ 0� � + � ≤ 2}. Find the conditional density of Y , given that X = � [for a fixed 0 ≤ � ≤ 2]. We know that �(� � �) = 12 if (� � �) is in the triangle and �(� � �) = 0 otherwise. Therefore, for all fixed 0 ≤ � ≤ 2, � 2−� 2−� �X (�) = �(� � �) �� = � 2 0 Continuous conditional distributions 87 and �X (�) = 0 for other values of �. This yields the following: For all 0 ≤ � ≤ 2 − �, �(� � �) 1/2 1 �Y |X (� | �) = = = � �X (�) (2 − �)/2 2−� and �Y |X (� | �) = 0 for all other values of �. In other words, given that X = �, the conditional distribution of Y is Uniform(0 � 2 − �). Thus, for instance, P{Y > 1 | X = �} = 0 if 2 − � < 1 [i.e., � > 1] and P{Y > 1 | X = �} = (1 − �)/(2 − �) for 0 ≤ � ≤ 1. Alternatively, we can work things out the longer way by hand: � ∞ P{Y > 1 | X = �} = �Y |X (� | �) ��� 1 Now the integrand is zero if � > 2 − �. Therefore, unless 0 ≤ � ≤ 1, the preceding probability is zero. When 0 ≤ � ≤ 1, then we have � 2−� 1 1−� P{Y > 1 | X = �} = �� = � 2−� 2−� 1 As another example within this one, let us compute E(Y | X = �): � ∞ E(Y | X = �) = ��Y |X (� | �) �� −∞ 2−� � � 2−� �� = � 2−� 2 0 [Alternatively, we can read this off from facts that we know about uniforms; for instance, we should be able to tell—without computation—that Var(Y | X = �) = (2 − �)2 /12. Check this by direct computation also!] =