Chapter 8 Conditional Expectation

advertisement
1
CHAPTER 8
Conditional Expectation & Prediction
8.1 Introduction
We have already discussed the importance of being able to predict the probability of the conditional event
[Y  y | X  x] . To this end, we addressed both linear and nonlinear prediction models for the predictor random variable


Y  g ( X ) . In the context of the linear predictor Y  mX  b , we showed that the predictor that achieves the minimum
mean-squared error (mmse) must satisfy the ‘othogonality’ condition:

Cov(Y  Y , X )  0 .
(8.1)
For certain types of nonlinear models (e.g. polynomials), we showed how (1.1) relates to the solution for the model
parameters. In this chapter we return to this prediction problem, but in more generality, and with more relevance to the
situation wherein we have discrete random variables. In particular, we will address the predictor that is, by far, the most
popular predictor; namely the conditional expected value of Y given X.
8.2 The Conditional Expected Value Predictor
Given a 2-D random variable (X,Y), recall that the conditional pdf of Y given X=x is:

fY | X  x ( y)  f ( x, y) /  f ( x, y)dy  f ( x, y) / f X ( x) .

(2.1)
The corresponding conditional expected value of Y given X=x is defined as:


EY (Y | X  x) 


y f Y | X  x ( y ) dy .
(2.2a)
Notice that (2a) is a non-random function of x. For notational convenience, denote (2a) as

EY (Y | X  x)   ( x) .
(2.2b)
We are now in a position to define

EY (Y | X )   ( X ) .
(2.3)
Notice that (2.3) is a function of the random variable X; not the value x. Hence, (2.3) is a random variable. In fact, it is a
very important random variable, in the following sense.
Theorem 2.1 Let h(X) be a prediction model for Y, and define the mean squared prediction error:

mseh

E{[Y  h( X )]2 } .
(2.4)
Then (4) is minimized for h( X )   ( X ) .
Proof:
mseh  E{[Y  h( X )]2 }  E{[(Y   ( X ) )  ( ( X )  h( X ) )]2 } . By linearity of E(*), this is:
mseh  mse  2E[(Y   ( X ) )( ( X )  h( X ) )]  E{[ ( X )  h( X ) )]2 }.
We will now show that the middle term in (5) equals zero. To this end, write
(2.5)
2
E[(Y   ( X ) )( ( X )  ( X ) )]  EX { EY [(Y   ( X ) )( ( X )  h( X ) ) | X  x]} . The only term in the conditional
expected value EY [(Y   ( X ) )( ( X )  h( X ) ) | X  x] that depends on Y is Y, itself. This is important, because in the
definition (2a) the integral is over the sample space for Y. Hence, anything that does not depend on Y can be brought
outside of the integral, which then equals one in such cases. Therefore, we have
(2.6)
EY [(Y   ( X ) )( ( X )  h( X ) )]| X  x }  [ EY (Y | X )   ( X ) ][ ( X )  h( X ) ) | X  x]  0 .
The last equality in (2.6) follows from the definition (2.2b). In view of (2.6), (2.5) becomes
mseh  mse  E{[ ( X )  h( X ) )]2 } . Clearly, the predictor h(X) that minimizes this is h( X ) )]   ( X ) . □
Example 2.1 Suppose that (X,Y) has a joint normal pdf. From Theorem 6.9 (p.221), the conditional expected value of Y
given X=x is
E (Y | X  x)  Y  ( 
Y
)( x   X ) .
X
(2.7)
Now, recall that we showed in class, for any (X,Y) the best unbiased linear prediction model is:


E (Y | X )  Y  aX  b where a   Y and b  Y  a X .
X
(2.8)
With minimal manipulation, it should be clear that (2.7) and (2.8) are identical. [See Example 3.6 for a more detailed
development.] Hence, we have the following remarkable result:
Result 2.1 For (X,Y) jointly normal, the best linear prediction model for Y is the best among all prediction models. □
8.3 The Conditional Expected Value in the Case of Discrete Random Variables
There is a subtle distinction between (2.7) and (2.8); namely, that whereas (2.7) is a number, (2.8) is a random variable. In
the jargon of estimation theory, (2.7) is an estimate of Y, whereas (2.8) is an estimator of Y. We began in chapters 1 and 2
with addressing the distinction between a number, x, and the action, X, that it resulted from. Throughout the semester we
continued to emphasize the importance of this distinction. In this chapter, this distinction is of paramount importance in
order to understand the quantities E (Y | X  x) and E (Y | X ) such as those illustrated in (2.7) and (2.8). In this section
we will use Bernoulli random variables, both to contribute to a basic understanding of this distinction and to demonstrate
the (in)appropriateness of using E (Y | X ) in the case of discrete random variables.
Example 3.1 Suppose that (X,Y) has a joint Bernoulli pdf. Then
 ( x)  EY (Y | X  x) 
1
y
y 0
Pr( X  x, Y  y)
Pr( X  x)

Pr( X  x, Y  1)
. ; x  {0,1}
Pr( X  x)
(3.1a)
In particular,
 (0)  EY (Y | X  0) 
Pr( X  0, Y  1)
Pr( X  0)

p01
 p1|0 ,
1  pX
(3.1b)
3
and
 (1)  EY (Y | X  1) 
Pr( X  1, Y  1)
Pr( X  1)
 p1|1 .
(3.1c)
Now, if we simply replace x with X in (3.1), we obtain
 ( X )  EY (Y | X ) 
Pr( X , Y  1)
 p1| X .
Pr( X )
This expression makes no sense. On the other hand, write:
 ( X )  p1| X

 I 0 ( X ) p1| 0  I1 ( X ) p1|1 .
(3.2)
Here, [ I k ( X )  1]  [ X  k ] is the indicator function of the random variable X (i.e. it is, indeed, a function of X, and

not x): it equals one when X=k, and equals zero otherwise. Let’s take a moment to study the random variable W   ( X )
given by (3.2) in more detail. The sample space for W is SW  { p1| 0 , p1|1} . Since this sample space includes only two
numbers, W is in some sense related to a Bernoulli random variable. Let’s compute the pdf fW (w) having
SW  { p1| 0 , p1|1} :
fW ( p1|0 )  Pr[W  p1|0 ]  Pr[ X  0]  1  p X and
fW ( p1|1 )  Pr[W  p1|1 ]  Pr[ X  1]  p X
Then the expected value of W is:
W  E[ ( X )]  p1| 0 (1  p X )  p1|1 p X  pY
(3.3)
From (3.3) we see that W   ( X ) is an unbiased predictor of Y. This should be no surprise since it is, after all, the
conditional mean of Y. Let’s now proceed to compute the variance of  ( X ) . To this end, we first compute
.
E(W 2 )  E[ ( X ) 2 ]  p12|0 (1  pX )  p12|1 pX .
(3.4)
Hence,  W2  Var[ ( X ) 2 ]  p12|0 (1  pX )  p12|1 pX  [ p1|0 (1  pX )  p1|1 pX ]2 , or
 W2  Var[ ( X ) 2 ]  ( p12|0  p12|1 )(1  pX ) pX  2 [ p1|0 p1|1 pX (1  pX ) , or
 W2  Var[ ( X )2 ]  ( p1|1  p1|0 )2 )(1  pX ) pX .
Before we go any further with this case, the following very important comment is in order.
(3.5)
4
Important Comment 3.1: Recall that both X and Y are Bernoulli random variables. Recall also, that the problem at hand


is to predict a value for Y, given the event [ X  x] . If we use as our predictor Y  W  EY (Y | X )   ( X ) then in the


case of the event [ X  0] we will have y  EY (Y | X  0)   (0)  p1|0 , and in the case of the event [ X  1] we will


have y  EY (Y | X  1)   (1)  p1|1 . Notice that in either case our predictions are not elements of the sample space for Y,
which is SY  {0,1} . Imagine you are asked “Given a person’s right eye is blue (=1 versus brown=0), what is your
prediction as to the color of the left eye?” If you are using the ‘best’ predictor (in terms of minimum mean squared error),
then your answer will not be 0, nor will it be 1. It will be a number between 0 and 1. For example, suppose you computed
it that number to be, say, 0.98. What kind of response do you think that you would get? What does this number mean?
Does it mean that the left eye will be 98% blue and 2% brown? If you gave that explanation, then a best case scenario
would be a room full of laughter. A worst case response would not be nearly so pleasant.
And so, in the case of predicting one Bernoulli random variable, given the value of another Bernoulli random variable, the


mmse predictor Y  W  EY (Y | X )   ( X ) is a ridiculous predictor that makes no sense! In fact, in any situation where
the sample spaces for X and Y are both discrete, this predictor is equally ridiculous. We will end this important comment
with the following


THOUGHT QUESTION: Given that Y  W  EY (Y | X )   ( X ) is an inappropriate predictor in the case of discrete
random variables, then how can it be used to arrive at an appropriate predictor?
ANSWER: The answer to this question is simple! Rather than announcing the predictor value, say,


y  EY (Y | X  1)   (1)  p1|1 , we will, instead, announce the event [Y  1] with attendant probability p1|1 . This


predictor is not the mmse predictor y  EY (Y | X  1)   (1)  p1|1 !
We now consider two extreme cases in relation to this 2-D Bernoulli random variable (X,Y).
Case 1: ρ=0. In this case p1|1  p1| 0  pY . Hence, the variance (3.5) is zero! While this may seem strange, for this case
we have  ( X )  p1| 0  pY . Clearly, the variance of a constant is zero. Even though the mmse predictor is
 ( x  0)   ( x  1)  p1| 0  pY this number is neither 0 nor 1. In relation to the above answer, the appropriate predictor

is “ Y  1 with probability pY, regardless of what x is.” □
Case 2: ρ=1. In this case p1|1  1 and p1|0  0 Hence, the variance (3.5) is  X2 . Since in this case we have Y  X (with

probability 1), this is also  Y2 . More fundamentally, the appropriate predictor is Y  X (in probability). □
5
From Theorem 1 above we know that (3.2) is the best estimator of Y in relation to mean squared error. Hence, we are led
to ask what this error is.

mse
 E{[Y   ( X )]2 }  E (Y 2 )  2E[Y ( X )]  E[ ( X ) 2 ] .
(3.6)
The first term on the right side of the equality in (3.6) is E (Y 2 )   Y2  pY2 . The last term in (3.6) is (3.4). The joint
expectation in the middle term is
1
1
E[Y ( X )]   y  ( x) px y   (0) p01   (1) p11  p1|0 p01  p1|1 p11 .
y 0 x 0
Hence, (3.6) becomes
mse   Y2  pY2  p12|0 (1  p X )  p12|1 p X  2( p1|0 p01  p1|1 p11 ) .
(3.7)
Since p01  p1|0 (1  p X ) and p11  p1|1 p X , (3.7) becomes
mse   Y2  pY2  p12|0 (1  pX )  p12|1 pX .
(3.8)
Continuing the above two cases in relation to (3.8), we have
Case 1: ρ=0. In this case p1|1  p1| 0  pY . Hence, mse   Y2 .
Case 2: ρ=1. In this case p1|1  1 and p1|0  0 . Hence, mse   Y2  pY2  p X . Again, since in this case we have Y  X ,
then p X  pY , and so mse   Y2  pY2  pY  0 . Of course the prediction error will be zero, since by knowing X we
know Y. □
Notice that in the last example the ‘worst case’ was mse   Y2 , and the ‘best case’ was mse  0 . These bounds on
mse hold in general. If X is independent of Y, then using it will not help. One might as well simply predict Y to be zero,
in which the error is Y, and the mse   Y2 .
And so, in what real sense is  ( X ) the best predictor? Well, if one were presented with a large collection of 2-D
Bernoulli random variables, then on the average, giving the ridiculous answer  (x ) would result in the smallest meansquared prediction error. [At which point, I, for one, would say: So what! It’s a ridiculous predictor.]
We now present two practical examples where the predictor described in the above ANSWER makes perfect sense.
Example 3.2 Suppose that to determine whether a person has cancer, two tests are performed. The result of each test can
be either negative (0) or positive (1). Suppose that, based on a large data base, the 2-D Bernoulli random variable (X,Y)
has the following joint probabilities: p00  0.9 , p10  0.02 , p01  0.02 , p11  0.06 . After the first test has been
6
performed, the patient often would like feedback regarding the probabilities associated with the next test, yet to be
performed. In this setting, the sample space for the optimal predictor is:
S ( X )  {
p01
p
.02
.06
, 11 }  {
,
}  { 2 / 92 , 6 / 8} .
1  pX pX
1  (.02  .06) .02  .06
Hence, if the first test is negative, then the physician should tell the patient that the second test has a probability of 2/92 of
being positive. If the first test is positive, then the patient should be told that the probability that the second test will be
positive is ¾. However, sometimes it may be in the patient’s best interest to not give such quantitative feedback. In this
situation, if the first test is negative, the patient could simply be told not to worry (yet). Whereas, if the first test is
positive, then the patient could be told that, while it is more likely the second test will be positive than negative, there is
still a significant chance that it will be negative- and so, again- don’t worry too much (yet). □
Example 3.3 Consider a survey where each respondent is asked two questions. The set of possible responses for each
question is {1, 2, 3, 4, 5 }. Let the 2-D random variable (X, Y) correspond to recording the responses of the two questions.
It is desired to determine how well the answer to the second question can be predicted from the answer to the first.

One of the investigators proposes using a linear model: Y  aX  b . The method of least squares results in the model


Y  1.2 X  0.5 . Since S X  {1,2,3,4,5}, the sample space for Y is therefore SY  {1.7, 2.9, 4.1, 5.3, 6.5} . Since no
element in this sample space is in SY  {1,2,3,4,5} , the investigator decides to “round-down”. However, that poses a
problem; namely that 3 is not in this modified space. He then decides to “round up”. But that entails the same problem;
namely that 1 is not in the space. Finally, he decides to simply map the 5 numbers in SY directly to the 5 numbers in SY .
While this resolves the above type of problem, it begs the question: How well-justified is the model, after all?
The approach of this investigator is an all too common one. And the ad hoc manner in which issues such as those above
are resolved can lead others to question the value of the model, and can also lead to poor performance. A more rigorous
approach is the optimal approach; namely to use the conditional expected value of Y given X. However, there is a price;
and it is that one needs to have enough data for reliable estimates of the joint probabilities. Denote these joint probabilities

as Pr[ X  x, Y  y]  pxy . Hence, S ( X ,Y )  {( x, y ) | x, y  {1,  ,5} } and the corresponding elemental probabilities are
{ p xy | x, y  {1,  ,5} } . From these it is easy to compute the marginal probabilities { px | x  {1, 5} } and
{ p y | y  {1,5} } . Having these, it is then as easy to compute the conditional probabilities:

Pr[Y  y | X  x]  Pr[ X  x, Y  y] / Pr[ X  x]  pxy / px  p y|x . From these, we formulate the conditional expected
value:
5
 ( x)  E[Y | X  x]   y p y|x .
(3.9a)
y 1
Finally, the best predictor is:
5
 ( X )  E[Y | X ]   y p y| X .
y 1
(3.9b)
7
As was the case in Example 3.2, we need to define what we mean by the quantity p y| X . In the last example, since the sum
was only over the set {0,1}, the sum equaled simply p1| X . We need to interpret (3.9) is the same manner. Extrapolating on
(9c) we obtain
p y| X

I 0 ( X ) p y |0  I1 ( X ) p y |1    I 5 ( X ) p y |5 .
(3.9c)
Hence, the meaning of (3.9b) is
5
5
y 1
x 1
 ( X )  E[Y | X ]   y  I x ( X ) p y|x .
(3.9d)
Suppose, for example, that we want to predict Y based on the event [X=3]. Then (17d) becomes
5
5
5
y 1
x 1
y 1
 ( X  3)  E[Y | X  3]   y  I x ( X  3) p y|x   y p y|3 .
(3.10)
A FUN QUESTION: What is the sample space for (3.9d)?

To finish this example, let’s consider a predictor, Y whose sample space is the same as that of Y; namely, {1,2,3,4,5}.
From (3.10) we clearly have
5
 ( X  x)  E[Y | X  x]   y p y| x .
(3.11)
y 1

Hence, in the case where we have the event [ X  x] we could use Y  y* where p y*|x  max{ p y|x } . As mentioned
ySY
above, one should take caution in using this estimator due to the fact that there may be more than one (essential)
maximum.
It should be clear that the mmse predictor (3.10) will almost never include prediction values consistent with SY. A more
appropriate predictor would be the one that uses (3.11) to arrive at a predictor that is consistent with SY. Even then,
expanding on that predictor, one could choose to be more specific; e.g. to say: “ You will have [Y=1] with probability
p1|3 , you will have [Y=2] with probability p 2|3 ,etc.”
Having said that, then one might say “Why don’t you just tell me the value of Y that has the highest probability given
X=3?” My answer would be in the form of a question: “Wouldn’t you want to know if two different values of Y have very
similar probabilities, given X=3?” □
Before we proceed to address E (Y | X ) in the context of continuous random variable such as that of Example 1.1, it is
appropriate to pause here, and respond to the following concern one might well have:
“So… OK. You’re telling me that for discrete random variables the mmse predictor E (Y | X ) is not a good idea, right?
But then, how do I use the (x,y) data that I will get to arrive at the predictor that you are arguing is better?”
ANSWER: We would use the (x,y) data to obtain estimates of all of the conditional probabilities.
Comment 3.2: The above answer highlights the fact that we must estimate all of the conditional probabilities. The
number of conditional probabilities is the same as the number of joint probabilities. Hence, in this example we would need
8
to estimate a total of 25 joint probabilities. To obtain reliable estimates of so many unknown parameters would require a
suitably large amount of data; as oppose to estimating only a couple of parameters, such as the mean and variance. As
with the point estimation of any unknown parameter, it would behoove one to consider the associated estimator in relation
to things like confidence interval estimation and hypothesis testing. In relation to hypothesis testing, an obvious test might
be one that considers whether two unknown conditional probabilities are in fact equal, versus one being larger than the
other. □
Example 3.4 Consider an experiment that involves repeated rolling of a fair die. Let X=The act of recording the first roll
number such that we roll a ‘6’, and let Y=The act of recording the number of rolls associated with a ‘1’ prior to the first

roll that we have a ‘6’. The problem is to obtain the form of the best unbiased predictor: Y  E (Y | X ) .

Solution: We will begin by obtaining the expression for Y  E (Y | X  x) . In words, we now have the situation where
we first roll a ‘6’ on the xth roll. Define the x-D random variable W  (W1 ,W2 ,,Wx ) where [Wk  1] is the event that
we roll a ‘1’ on the kth roll, and where [Wk  0] is the event that we do not roll a ‘1’ on the kth roll. Assuming the rolls
are mutually independent, then in the x  1 rolls prior to rolling the first ‘6’, Y | X  x

x 1
W
k 1
k
is a binomial(n,p)
random variable with parameters n  x  1 and p  pW  1 / 5 . In particular
E (Y | X  x) 
And so, we obtain
 x1 
E  Wk  
 k 1 
x 1
 E (W )
k 1
k
 ( x  1) pW

x 1
.
5

X 1
Y  E (Y | X ) 
.
5
(3.12)
Now, we could stop here. But let’s instead explore some of the properties of this best predictor, (3.12). In this way, we can
bring a variety of major concepts to bear on a better understanding of it.
(a) What is the sample space for X=The act of recording the first roll number such that we roll a ‘6’?
Answer: S X  {0, 1, … , ∞}
(b) What is the pdf for X?
Answer: f X ( x)  Pr[ X  x]  (5 / 6) x1 (1 / 6)
(c) What are the mean and variance of X? [Hint: Does the pdf have a name?]
Answer: X ~ nbino(k=1 , p=1/6). And so, from
http://en.wikipedia.org/wiki/Negative_binomial_distribution
or
9
http://mathworld.wolfram.com/NegativeBinomialDistribution.html
X 
1 p
5/ 6

5
p
1 5/ 6
 X2 
1 p
5/6

 30 .
2
(p
(1 / 6) 2
Now it is easy  to obtain a complete description of the structure of (3.12):

(d) What is the sample space for Y  E (Y | X ) 
X 1
?
5
Answer: SY  { y | y  ( x  1) / 5 with x  {1,2, } }  {0 , 1 / 5 , 2 / 5, }

(e) What is the pdf for Y ?
 X 1

Answer: f Y ( y )  Pr[Y  y ]  Pr 
 5

 y   PrX  5 y  1  f X (5 y  1)


(f) What are the mean and variance of Y ?
 X 1 X 1 5 1

 0.8 .

5
5
 5 
Answer: Y  E 
 X  1  1 2 30
 Y2  Var
 1.2
  X 
25
 5  25
(g) Use Matlab to obtain a plot of fY ( y ) :
The pdf of Yhat
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
0
0.5
1
1.5
2
y=0.2*(x-1)
2.5
3
3.5
4
10

(h) Even though Y  E (Y | X ) 
X 1
is the best unbiased predictor of Y in the sense that it achieves the mmse, why is
5
this an ‘undesirable’ predictor?
Answer: It is undesirable because it predicts values that are not in SY .
(i) How might you arrive at a more ‘desirable’ predictor (i.e. one whose sample space is consistent with that of Y)?
[Hint: What is f Y | X  x ( y )  Pr[Y  y | X  x]  p y|x ?]
Answer: Use the relative values of f Y | X  x ( y )  Pr[Y  y | X  x]  p y|x to choose the y-value have the largest probability
for a given value x. [Note: If you haven’t figured it out, this estimator is the maximum likelihood estimator! ]
To further investigate the answer to (i) consider the following Matlab code that yields a sequence of plots of the pdf of
Y|X x

x 1
W
k 1
k
~ binomial(x-1,p=0.2):
%PROGRAM NAME: ch8ex3_4.m
% This code uses the conditional pdf of Y|X=x
% ~ Bino(x-1, 0.2) to obtain an alternative estimator
% of Y=The # of 1's before X=xth roll of a die for a 6.
xmax = 50;
x = 1:xmax;
p = 0.2;
pygx = zeros(1,xmax);
pygx(1)=1; % Pr[Y=0 | X=1]
for x = 2:xmax
xm1 = x-1;
xm1vec = 0:xm1;
fxm1vec = binopdf(xm1vec,xm1,p);
stem(xm1vec,fxm1vec)
grid
xm1
pause
end
We will now run this code. After each plot, the code is paused so that we can investigate the probability structure of
Y | X  x . In particular, identify the highest probability value of this random variable. Perhaps the plots will lead you to

construct an alternative estimator Y | X to the mmse estimator E (Y | X ) . 
□
The last example addressed the tossing of a die. It is a classic type of example that is found in many textbooks on
probability and statistics. It is also the type of example that can be viewed as quite uninteresting to many students. It is,
however, readily couched in relation to many more interesting problems. Consider the following example.
Example 3.5 Suppose that a certain vehicle design includes 6 types of potential malfunctions, with severity ratings of 1
(very minor) to 6 (very major). In this context then, the last example is concerned with predicting the number of very
minor malfunctions that could precede the occurrence of the first very major malfunction. Now let’s add an element of
reality. Suppose that the probability of a malfunction is inversely proportional to its severity level. At this point we will
11
assume that malfunctions that occur over time are mutually independent (not all that realistic, in general- but it’s easier to
deal with at this point). What we now have are independent, but non-identical Bernoulli random variables. Hence, the pdf
of Y | X  x is no longer binomial ( x  1, p ) . Even so, at this point in the course the student should not be intimidated by
the lack of a ‘convenient’ pdf. Clearly, the process of arriving at the predictor (3.11) will be more involved. But the payoff
might well be worth the effort. For example, suppose that the very minor malfunction as probability 0.95, while the very
major malfunction as probability 0.01. For the moment let’s ignore the other levels of malfunctions. Furthermore, let’s
assume that the cost associated with a very minor malfunction is, say, $100, while that associated with a major
malfunction is, say, $10,000. Then of interest in this problem setting is to predict the cost of vehicle upkeep prior to a first
major cost. This author feels that many companies would be very interested in being able to predict such costs. □
We now proceed to elaborate upon Example 3.2.
Example 3.6 Let ( X , Y ) ~ N ( X , Y ,  X ,  Y ,  ) . Then the joint pdf is given by (see p.220):
f ( x, y)  (2 X  Y
 x  
1
X

1   )  exp{
2
2(1   )   X

2 1
2
  y  Y
  
  Y
2

 x  X
  2  

 X
 y  Y

  Y

 } . (3.13)

Integrating (3.13) over the interval ( ,  ) for y (resp. x ) gives (see p.220-221):
f X ( x) 
1
 X 2
 exp[ 
(x   X )2
]
2 X2
;
fY ( y) 
1
 Y 2
 exp[ 
( y  Y ) 2
].
2 Y2
The conditional pdf of Y given X=x for continuous random variables X and Y is f Y | X  x ( y ) 
(3.14)
f ( x, y )
.
f X ( x)
Hence, from (3.13) and (3.14) we obtain (see Theorem 6.9 on p. 221):



Y | X  x ~ N  Y | X  x  Y   Y ( x   X ) ,  Y2| X  x   Y2 (1   2 )  .
X


From (3.15) we have
E (Y | X  x)  Y | X  x  Y  
Y
(x  X ) ,
X
and hence:
 ( X )  E (Y | X )  Y | X  x  Y  

  
   
Y
( X   X )    Y  X   Y    Y   X  .
X
 X 
 X  

(3.15)
12
Recalling that  
  
 XY


 m and Y    Y  X  Y  m X  b . These are exactly the model
, then  Y  XY
2
X X
 XY
 X 
parameters we obtained for the best linear unbiased predictor of X given X, regardless of the underlying pdf’s. Hence, we
can conclude that in the case of jointly normal random variables, the best linear predictor of Y using X is the best among
all unbiased predictors. [Ah!! If only everything in life was normal ! ]
Notice also that, given knowledge that X=x, the random variable Y | X  x has variance that is reduced by a factor of
1   2 relative to that of Y alone.
Example 3.7 Let { X k }k  be a ‘time’-indexed collection of random variables (i.e. a random process), and suppose that
this collection adheres to the following conditions:

(C1): For any index, k, E ( X k )   k   X (i.e. all have the same mean);

(C2): For any index, k, Var ( X k )   k2   X2 (i.e. all have the same variance);

(C3): for any indices, j,k, Cov( X j , X j k )   j , j k  c(k ) (i.e. the covariance depends only on how far apart the indices
are).
A random process that adheres to the above conditions is said to be a wide sense stationary (wss) process.
Now, suppose that the process has  X  0 and adheres to the following recursion:
X k  X k 1  U k
(3.16)
where the random variables {U k }k   are iid ~ N (0,  U2 ) . [Notice that the random process {U k }k   also satisfies the
above conditions, and is therefore also a wss process. However, because Cov(U j , U j k )  0 for k  0 , it is given a
special name. It is called a white noise process.]
In view of (3.16), the pdf of the conditional expected value of Xk+1 given X k  xk is:
X k 1 | X k  xk ~ N (  xk ,  Y2| X  x   X2 (1   2 ) ) .
(3.17)
However, multiplying both sides of (3.16) by Xk-1 and taking expected values of both sides gives
E ( X k X k 1 )  E ( X k21 )  E (U k X k 1 )  Cov( X k , X k 1 )    X2
   Cov( X k , X k 1 ) /  X2   .
13
[Note that since {U k }k   is a white noise process, the random variable Uk is independent of Xk-1 since the latter depends
only on {U j }kj 1 .] . Hence,
X k 1 | X k  xk ~ N ( xk ,  Y2| X x   X2 (1   2 ) ) .
(3.18)
Specifically, from (3.17) and (3.18) we see that the random variable X k 1 | X k  xk is simply
X k 1 | X k  xk

 xk  U k 1 .
(3.19)
Since  xk is non-random, it follows immediately that the variance of X k 1 | X k  xk is exactly the variance of Uk+1.
Again, since {U k }k   is wss it follows from (C2) that Var (U k 1 )   U2 ; that is:
 U2   X2 (1   2 ) .

(3.20)

In words then, we see that the predictor of Xk+1 which is X k 1  X k 1 | X k   xk has variance that is exactly the variance

of the prediction ‘error’: X k 1  X k 1  U k 1 . Furthermore, no predictor can do better than this because. Why? Well, the


predictor is X k 1  X k 1 | X k   X k  E ( X k 1 | X k ) is, after all, the best predictor. Viewed another way, because of the
fact that {U k }k   is a white noise process, it is totally random, or equivalently, totally unpredictable. □
Download