Solution

advertisement
MATHEMATICAL STATISTICS
LECTURE NOTES
Prof. Dr. Onur KÖKSOY
1
MULTIVARIATE DISTRIBUTIONS
ο‚· Joint and Marginal Distributions
In this section we shall be concerned first with the bivariate case, that is, with
situations where we are interested at the same time, in a pair of random variables defined over
a joint sample space. Later, we shall extend this discussion to the multivariate case, covering
any finite number of random variables.
If 𝑋 and π‘Œ are discrete random variables, we write the probability that 𝑋 will take on
the value π‘₯ and π‘Œ will take on the value 𝑦 as 𝑃(𝑋 = π‘₯, π‘Œ = 𝑦). Thus, 𝑃(𝑋 = π‘₯, π‘Œ = 𝑦) is the
probability of the intersection of the events 𝑋 = π‘₯ and π‘Œ = 𝑦. We can now, in the bivariate
case, display the probabilities associated with all pairs of values 𝑋 and π‘Œ by means of a table.
Theorem: A bivariate function, 𝑝(π‘₯, 𝑦) = 𝑃(𝑋 = π‘₯, π‘Œ = 𝑦) can serve as the joint probability
distribution of a pair of discrete random variables 𝑋 and π‘Œ if and only if its values, 𝑝(π‘₯, 𝑦),
satisfy the conditions;
1. 𝑝(π‘₯, 𝑦) ≥ 0 for each pair of values (π‘₯, 𝑦) within its domain;
2. ∑π‘₯ ∑𝑦 𝑝(π‘₯, 𝑦) = 1, where the double summation extends over all possible pairs (π‘₯, 𝑦)
within its domain.
Definition: If 𝑋 and π‘Œ are discrete random variables, the function given by,
𝐹(π‘₯, 𝑦) = 𝑃(𝑋 ≤ π‘₯; π‘Œ ≤ 𝑦) = ∑ ∑ 𝑝(𝑠, 𝑑) π‘“π‘œπ‘Ÿ − ∞ < π‘₯ < ∞ ; −∞ < 𝑦 < ∞
𝑠≤π‘₯ 𝑑≤𝑦
where 𝑝(𝑠, 𝑑) is the value of the joint probability distribution of 𝑋 and π‘Œ at (𝑠, 𝑑), is called the
joint distribution function, or the joint cumulative distribution, of 𝑋 and π‘Œ.
2
Example: Determine the value of π‘˜ for which the function given by,
𝑝(π‘₯, 𝑦) = π‘˜π‘₯𝑦 , π‘“π‘œπ‘Ÿ π‘₯ = 1,2,3 π‘Žπ‘›π‘‘ 𝑦 = 1,2,3
can serve as a joint probability distribution?
Solution: Substituting the various values of π‘₯ and 𝑦, we get,
𝑝(1,1) = π‘˜, 𝑝(1,2) = 2π‘˜, 𝑝(1,3) = 3π‘˜, 𝑝(2,1) = 2π‘˜, 𝑝(2,2) = 4π‘˜, 𝑝(2,3) = 6π‘˜,
𝑝(3,1) = 3π‘˜, 𝑝(3,2) = 6π‘˜ and 𝑝(3,3) = 9π‘˜.
To satisfy the first condition of the Theorem, the constant π‘˜ must be nonnegative, and to
satisfy the second condition,
π‘˜ + 2π‘˜ + 3π‘˜ + 2π‘˜ + 4π‘˜ + 6π‘˜ + 3π‘˜ + 6π‘˜ + 9π‘˜ = 1
1
So that 36π‘˜ = 1 and π‘˜ = 36.
Example: If the values of the joint distribution of 𝑋 and π‘Œ are as shown in the table,
π‘₯
𝑦
0
1
2
0
1/6
1/3
1/12
1
2/9
1/6
2
1/36
Find 𝐹(1,1).
Solution:
𝐹(1,1) = 𝑃(𝑋 ≤ 1, π‘Œ ≤ 1) = 𝑝(0,0) + 𝑝(0,1) + 𝑝(1,0) + 𝑝(1,1) =
Some Remarks: We also get,
𝐹(−2,1) = 𝑃(𝑋 ≤ −2, π‘Œ ≤ 1) = 0
𝐹(3.7,4.5) = 𝑃(𝑋 ≤ 3.7, π‘Œ ≤ 4.5) = 1
3
1 2 1 1 8
+ + + =
6 9 3 6 9
Definition: A bivariate function with values 𝑓(π‘₯, 𝑦), defined over the π‘₯𝑦-plane, is called a
joint probability density function of the continuous random variables 𝑋 and π‘Œ if and only if,
𝑃((𝑋, π‘Œ) ∈ 𝐴) = ∬ 𝑓(π‘₯, 𝑦)𝑑π‘₯𝑑𝑦
𝐴
for any region A in the π‘₯𝑦-plane.
Theorem: A bivariate function can serve as a joint probability density function of continuous
random variables 𝑋 and π‘Œ if its values,𝑓(π‘₯, 𝑦), satisfy the conditions,
1. 𝑓(π‘₯, 𝑦) ≥ 0 π‘“π‘œπ‘Ÿ − ∞ < π‘₯ < ∞ , −∞ < 𝑦 < ∞ .
∞
∞
2. ∫−∞ ∫−∞ 𝑓(π‘₯, 𝑦)𝑑π‘₯𝑑𝑦 = 1.
Definition: If 𝑋 π‘Žπ‘›π‘‘ π‘Œ are continuous random variables, the function given by,
𝑦
π‘₯
𝐹(π‘₯, 𝑦) = 𝑃(𝑋 ≤ π‘₯, π‘Œ ≤ 𝑦) = ∫ ∫ 𝑓(𝑠, 𝑑) 𝑑𝑠𝑑𝑑 , π‘“π‘œπ‘Ÿ − ∞ < π‘₯ < ∞ ; −∞ < 𝑦 < ∞
−∞ −∞
where 𝑓(𝑠, 𝑑) is the value of the joint probability distribution of 𝑋 and π‘Œ at (𝑠, 𝑑), is called the
joint distribution function, of 𝑋 and π‘Œ.
Note: 𝑓(π‘₯, 𝑦) =
πœ•2 𝐹(π‘₯,𝑦)
πœ•π‘₯πœ•π‘¦
All definitions of this section can be generalized to the multivariate case where there
are 𝑛 random variables. Let the values of the joint probability distribution of 𝑛 discrete
random variables 𝑋1 , 𝑋2 , … , 𝑋𝑛 are given by,
𝑝(π‘₯1 , π‘₯2 , … , π‘₯𝑛 ) = 𝑃(𝑋1 = π‘₯1 , 𝑋2 = π‘₯2 , … , 𝑋𝑛 = π‘₯𝑛 )
For each 𝑛 −tuple(π‘₯1 , π‘₯2 , … , π‘₯𝑛 ) within the range of the random variables and so of their joint
distribution are given by
𝐹(π‘₯1 , π‘₯2 , … , π‘₯𝑛 ) = 𝑃(𝑋1 ≤ π‘₯1 , 𝑋2 ≤ π‘₯2 , … , 𝑋𝑛 ≤ π‘₯𝑛 ) ,
π‘“π‘œπ‘Ÿ − ∞ < π‘₯1 < ∞ ; −∞ < π‘₯2 < ∞ ; … ; −∞ < π‘₯𝑛 < ∞
4
In the continuous case, probabilities are again obtained by integrating the joint
probability density, and the joint distribution function is given by,
π‘₯𝑛
𝐹(π‘₯1 , π‘₯2 , … , π‘₯𝑛 ) = ∫
π‘₯2
π‘₯1
… ∫ ∫ 𝑓 (𝑑1 , 𝑑2 , … , 𝑑𝑛 )𝑑𝑑1 𝑑𝑑2 … 𝑑𝑑𝑛 ,
−∞
−∞ −∞
π‘“π‘œπ‘Ÿ − ∞ < π‘₯1 < ∞ ; −∞ < π‘₯2 < ∞ ; … ; −∞ < π‘₯𝑛 < ∞
Also practical differentiation yields,
𝑓(π‘₯1 , π‘₯2 , … , π‘₯𝑛 ) =
πœ• 𝑛 𝐹(π‘₯1 , π‘₯2 , … , π‘₯𝑛 )
πœ•π‘₯1 πœ•π‘₯2 … πœ•π‘₯𝑛
wherever these partial derivatives exist.
Example: If the joint probability density of 𝑋 and π‘Œ is given by,
𝑓(π‘₯, 𝑦) = {
π‘₯ + 𝑦,
0 ,
π‘“π‘œπ‘Ÿ 0 < π‘₯ < 1, 0 < 𝑦 < 1
π‘’π‘™π‘ π‘’π‘€β„Žπ‘’π‘Ÿπ‘’
Find the joint distribution function of these two random variables.
Solution:
𝑦 π‘₯
1
𝐹(π‘₯, 𝑦) = ∫ ∫ (𝑠 + 𝑑)𝑑𝑠𝑑𝑑 = π‘₯𝑦(π‘₯ + 𝑦) ;
2
0 0
0< π‘₯ < 1, 0< 𝑦 < 1
Example: Find the joint probability density function of the two random variables 𝑋 and π‘Œ
whose joint distribution function is given by,
(1 − 𝑒 −π‘₯ )(1 − 𝑒 −𝑦 ) ,
𝐹(π‘₯, 𝑦) = {
0
,
π‘“π‘œπ‘Ÿ 0 < π‘₯, 0 < 𝑦
π‘’π‘™π‘ π‘’π‘€β„Žπ‘’π‘Ÿπ‘’
Also use the joint probability density to determine 𝑃(1 < 𝑋 < 3, 1 < π‘Œ < 2).
Solution: Since partial differentiation yields,
πœ• 2 𝐹(π‘₯, 𝑦)
𝑓(π‘₯, 𝑦) =
= 𝑒 −(π‘₯+𝑦)
πœ•π‘₯πœ•π‘¦
5
for π‘₯ > 0 π‘Žπ‘›π‘‘ 𝑦 > 0 and 0 elsewhere, we find that the joint probability density 𝑋 and π‘Œ is
given by,
𝑒 −(π‘₯+𝑦) ,
𝑓(π‘₯, 𝑦) = {
0,
π‘“π‘œπ‘Ÿ 0 < π‘₯,
0<𝑦
π‘’π‘™π‘ π‘’π‘€β„Žπ‘’π‘Ÿπ‘’
Thus, integration yields,
2
3
𝑃(1 < 𝑋 < 3, 1 < π‘Œ < 2) = ∫ ∫ 𝑒 −(π‘₯+𝑦) 𝑑π‘₯𝑑𝑦 = (𝑒 −1 − 𝑒 −3 )(𝑒 −1 − 𝑒 −2 ) = 0.074
1
1
Example: If the joint probability distribution of three discrete random variables X, Y and Z is
given by,
𝑝(π‘₯, 𝑦, 𝑧) =
(π‘₯ + 𝑦)𝑧
π‘“π‘œπ‘Ÿ π‘₯ = 1,2 ; 𝑦 = 1,2,3 ; 𝑧 = 1,2
63
3
6
4
13
Find 𝑃(𝑋 = 2, π‘Œ + 𝑍 ≤ 3) = 𝑝(2,1,1) + 𝑝(2,1,2) + 𝑝(2,2,1) = 63 + 63 + 63 = 63.
Example: If the trivariate probability density of 𝑋1 , 𝑋2 π‘Žπ‘›π‘‘ 𝑋3 is given by,
(π‘₯ + π‘₯2 )𝑒 −π‘₯3 ,
𝑓(π‘₯1 , π‘₯2 , π‘₯3 ) = { 1
0
,
π‘“π‘œπ‘Ÿ 0 < π‘₯1 < 1,
π‘’π‘™π‘ π‘’π‘€β„Žπ‘’π‘Ÿπ‘’
0 < π‘₯2 < 1,
Find 𝑃((π‘₯1 , π‘₯2 , π‘₯3 ) ∈ 𝐴 ) where A is the region,
1
{(π‘₯1 , π‘₯2 , π‘₯3 )|0 < 𝑋1 < 2 ,
1
2
< 𝑋2 < 1 , 𝑋3 < 1}.
Solution:
1 1
𝑃((π‘₯1 , π‘₯2 , π‘₯3 ) ∈ 𝐴) = 𝑃 (0 < 𝑋1 < ,
< 𝑋2 < 1, 𝑋3 < 1)
2 2
1
1
1
2
1 1
1 π‘₯2
= ∫ ∫ ∫ (π‘₯1 + π‘₯2 )𝑒 −π‘₯3 𝑑π‘₯1 𝑑π‘₯2 𝑑π‘₯3 = ∫ ∫ ( + ) 𝑒 −π‘₯3 𝑑π‘₯2 𝑑π‘₯3
1
1 8
2
0
0
0
2
1
=∫
0
2
1 −π‘₯
1
𝑒 3 𝑑π‘₯3 = (1 − 𝑒 −1 ) = 0.158
4
4
6
π‘₯3 > 0
ο‚· Marginal Distributions
Definition: If 𝑋 and π‘Œ are discrete random variables and 𝑝(π‘₯, 𝑦) is the value of their joint
probability distribution at(π‘₯, 𝑦):
𝑔(π‘₯) = ∑𝑦 𝑝(π‘₯, 𝑦) for each π‘₯ within the range of 𝑋 : The marginal distribution of 𝑋.
𝑔(𝑦) = ∑π‘₯ 𝑝(π‘₯, 𝑦) for each 𝑦 within the range of π‘Œ : The marginal distribution of π‘Œ.
Definition: If 𝑋 and π‘Œ are continuous random variables and 𝑓(π‘₯, 𝑦) is the value of their joint
probability density at(π‘₯, 𝑦):
∞
𝑔(π‘₯) = ∫−∞ 𝑓(π‘₯, 𝑦)𝑑𝑦 for each −∞ < π‘₯ < ∞ : The marginal density of 𝑋.
∞
𝑔(𝑦) = ∫−∞ 𝑓(π‘₯, 𝑦)𝑑π‘₯ for each −∞ < 𝑦 < ∞ : The marginal density of π‘Œ.
ο‚·
Joint Marginal Distributions:
Definition: If the joint probability distribution of the discrete random variables 𝑋1 , 𝑋2 , … , 𝑋𝑛
has the values 𝑝(π‘₯1 , π‘₯2 , … , π‘₯𝑛 ), the marginal distribution of 𝑋1 alone is given by,
𝑔(π‘₯1 ) = ∑ ∑ … ∑ 𝑝(π‘₯1 , π‘₯2 , … , π‘₯𝑛 )
π‘₯2
π‘₯3
π‘₯𝑛
for all values within the range of 𝑋1 , the joint marginal distribution of 𝑋1 , 𝑋2 , 𝑋3 is given by,
𝑔(π‘₯1 , π‘₯2 , π‘₯3 ) = ∑ ∑ … ∑ 𝑝(π‘₯1 , π‘₯2 , … , π‘₯𝑛 )
π‘₯4
π‘₯5
π‘₯𝑛
for all values within the range of 𝑋1 , 𝑋2 π‘Žπ‘›π‘‘ 𝑋3 , and other marginal distributions defined in
the same manner.
For the continuous case, let the continuous random variables be 𝑋1 , 𝑋2 , … , 𝑋𝑛 and joint
density be 𝑓(π‘₯1 , π‘₯2 , … , π‘₯𝑛 ). Then the marginal density of 𝑋2 alone is given by,
∞
∞
∞
β„Ž(π‘₯2 ) = ∫ … ∫ ∫ 𝑓 (π‘₯1 , π‘₯2 , … , π‘₯𝑛 )𝑑π‘₯1 𝑑π‘₯3 … 𝑑π‘₯𝑛 , π‘“π‘œπ‘Ÿ − ∞ < π‘₯2 < ∞ .
−∞
−∞ −∞
7
The joint marginal density of 𝑋1 and 𝑋𝑛 is given by,
∞
∞
∞
β„Ž(π‘₯1 , π‘₯𝑛 ) = ∫ … ∫ ∫ 𝑓 (π‘₯1 , π‘₯2 , … , π‘₯𝑛 )𝑑π‘₯2 𝑑π‘₯3 … 𝑑π‘₯𝑛−1 ,
−∞
−∞ −∞
π‘“π‘œπ‘Ÿ − ∞ < π‘₯1 < ∞ , −∞ < π‘₯𝑛 < ∞ , and so forth.
Example:
π‘₯
𝑦
𝑝(π‘₯, 𝑦) 0
1
2
β„Ž(𝑦)
0
1/6
1/3
1/12
7/12
1
2/9
1/6
2
1/36
𝑔(π‘₯)
7/18
1/36
5/12
1/2
1/12
The column totals are probabilities that 𝑋 will take on the values 0, 1 and 2. In other words,
they are the values
2
𝑔(π‘₯) = ∑ 𝑝(π‘₯, 𝑦) π‘“π‘œπ‘Ÿ π‘₯ = 0,1,2
𝑦=0
of the probability distribution of 𝑋. By the same taken, the row totals are the values,
2
β„Ž(𝑦) = ∑ 𝑝(π‘₯, 𝑦) π‘“π‘œπ‘Ÿ 𝑦 = 0,1,2
π‘₯=0
of the probability distribution of π‘Œ.
Example: Given the joint probability density,
2
(π‘₯
𝑓(π‘₯, 𝑦) = {3 + 2𝑦) ,
0
,
π‘“π‘œπ‘Ÿ 0 < π‘₯ < 1 ,
π‘’π‘™π‘ π‘’π‘€β„Žπ‘’π‘Ÿπ‘’
Find the marginal densities of 𝑋 and π‘Œ.
8
0<𝑦<1
Solution: Performing the necessary integrations, we get
∞
1
𝑔(π‘₯) = ∫ 𝑓(π‘₯, 𝑦)𝑑𝑦 = ∫
−∞
0
∞
1
2
2
(π‘₯ + 2𝑦)𝑑𝑦 = (π‘₯ + 1),
3
3
0<π‘₯<1
2
1
(π‘₯ + 2𝑦)𝑑π‘₯ = (1 + 4𝑦),
3
3
0<𝑦<1
Likewise,
β„Ž(𝑦) = ∫ 𝑓(π‘₯, 𝑦)𝑑π‘₯ = ∫
−∞
0
Example: Given the joint trivariate probability density,
𝑓(π‘₯1 , π‘₯2 , π‘₯3 ) = {
(π‘₯1 + π‘₯2 )𝑒 −π‘₯3
0
, π‘“π‘œπ‘Ÿ 0 < π‘₯1 < 1, 0 < π‘₯2 < 1 , π‘₯3 > 0
,
π‘’π‘™π‘ π‘’π‘€β„Žπ‘’π‘Ÿπ‘’
Find the joint marginal density of 𝑋1 π‘Žπ‘›π‘‘ 𝑋3 and the marginal density of 𝑋1 alone.
Solution: Performing the necessary integration, we find that the joint marginal density of
𝑋1 π‘Žπ‘›π‘‘ 𝑋3 is given by,
1
1
π‘š(π‘₯1 , π‘₯3 ) = ∫ (π‘₯1 + π‘₯2 )𝑒 −π‘₯3 𝑑π‘₯2 = (π‘₯1 + ) 𝑒 −π‘₯3 ,
2
0
π‘“π‘œπ‘Ÿ 0 < π‘₯1 < 1 π‘Žπ‘›π‘‘ π‘₯3 > 0 π‘Žπ‘›π‘‘ π‘š(π‘₯1 , π‘₯3 ) = 0 π‘’π‘™π‘ π‘’π‘€β„Žπ‘’π‘Ÿπ‘’.
Using this result, we find that the marginal density of 𝑋1 alone is given by,
∞ 1
∞
∞
1
1
𝑔(π‘₯1 ) = ∫ ∫ 𝑓(π‘₯1 , π‘₯2 , π‘₯3 )𝑑π‘₯2 𝑑π‘₯3 = ∫ π‘š(π‘₯1 , π‘₯3 ) 𝑑π‘₯3 = ∫ (π‘₯1 + ) 𝑒 −π‘₯3 𝑑π‘₯3 = π‘₯1 +
2
2
0 0
0
0
π‘“π‘œπ‘Ÿ 0 < π‘₯1 < 1 π‘Žπ‘›π‘‘ 𝑔(π‘₯1 ) = 0 π‘’π‘™π‘ π‘’π‘€β„Žπ‘’π‘Ÿπ‘’.
9
ο‚· Covariance and Correlation:
Given any two random variables (𝑋1 , 𝑋2 ), the covariance of 𝑋1 and 𝑋2 is defined as,
πΆπ‘œπ‘£(𝑋1 , 𝑋2 ) = 𝐸((𝑋1 − πœ‡1 )(𝑋2 − πœ‡2 )) = 𝐸(𝑋1 𝑋2 ) − πœ‡1 πœ‡2
∞
∞
∫ ∫ (π‘₯1 − πœ‡1 )(π‘₯2 − πœ‡2 )𝑓(π‘₯1 , π‘₯2 )𝑑π‘₯1 𝑑π‘₯2 , 𝑖𝑓 (𝑋1 , 𝑋2 ) π‘π‘œπ‘›π‘‘π‘–π‘›π‘’π‘œπ‘’π‘  π‘Ÿ. 𝑣.
πΆπ‘œπ‘£(𝑋1 , 𝑋2 ) =
−∞ −∞
∑ ∑(π‘₯1 − πœ‡1 )(π‘₯2 − πœ‡2 )𝑝(π‘₯1 , π‘₯2 )
{
, 𝑖𝑓 (𝑋1 , 𝑋2 ) π‘‘π‘–π‘ π‘π‘Ÿπ‘’π‘‘π‘’ π‘Ÿ. 𝑣
π‘₯1 π‘₯ 2
π»π‘’π‘Ÿπ‘’, πœ‡π‘– is the expected value of 𝑋𝑖 .
In the continuous case,
∞
πœ‡π‘– = ∫ π‘₯𝑖 𝑓𝑖 (π‘₯)𝑑π‘₯𝑖
−∞
In the discrete case,
πœ‡π‘– = ∑ π‘₯𝑖 𝑝𝑖 (π‘₯𝑖 )
π‘₯𝑖
The correlation between 𝑋1 and 𝑋2 is defined as,
𝜌12 =
πΆπ‘œπ‘£(𝑋1 , 𝑋2 )
𝜎1 𝜎2
where, πœŽπ‘– (𝑖 = 1,2) is the standard deviation of 𝑋𝑖 .
The correlation between 𝑋1 and 𝑋2 is known as a measure of association and −1 ≤ 𝜌12 ≤ 1.
10
Example:
π‘₯
𝑦
𝑝(π‘₯, 𝑦) 0
1
2
0
1/6
1/3
1/12
1
2/9
1/6
2
1/36
7/12
7/18
1/36
5/12
1/2
1/12
Find the covariance of 𝑋 and π‘Œ.
Solution:
1
2
1
1
1
1
1
𝐸(π‘‹π‘Œ) = (0 ∗ 0 ∗ ) + (0 ∗ 1 ∗ ) + (0 ∗ 2 ∗ ) + (1 ∗ 0 ∗ ) + (1 ∗ 1 ∗ ) + (2 ∗ 0 ∗ ) =
6
9
36
3
6
12
6
πœ‡π‘‹ = 𝐸(𝑋) = (0 ∗
5
1
1
2
) + (1 ∗ ) + (2 ∗ ) =
12
2
12
3
πœ‡π‘Œ = 𝐸(π‘Œ) = (0 ∗
7
7
1
4
) + (1 ∗ ) + (2 ∗ ) =
12
18
36
9
πΆπ‘œπ‘£(𝑋, π‘Œ) = 𝐸(π‘‹π‘Œ) − πœ‡π‘‹ πœ‡π‘Œ =
1
2 4
7
−( ∗ )=−
6
3 9
54
Example: Given,
2,
𝑓(π‘₯, 𝑦) = {
0,
π‘“π‘œπ‘Ÿ π‘₯ > 0, 𝑦 > 0 , π‘₯ + 𝑦 < 1
π‘’π‘™π‘ π‘’π‘€β„Žπ‘’π‘Ÿπ‘’
Find the covariance of the random variables.
1
1−π‘₯
Solution: πœ‡π‘‹ = ∫0 ∫0
1
1−π‘₯
𝐸(π‘‹π‘Œ) = ∫ ∫
0
2π‘₯𝑦𝑑𝑦𝑑π‘₯ =
0
πΆπ‘œπ‘£(𝑋, π‘Œ) = πœŽπ‘‹π‘Œ =
1
1
1−π‘₯
2π‘₯𝑑𝑦𝑑π‘₯ = 3 and πœ‡π‘Œ = ∫0 ∫0
1
12
1
1 1
1
−( ∗ )=−
12
3 3
36
11
1
2𝑦𝑑𝑦𝑑π‘₯ = 3
οƒΌ The following are some properties of covariance:
1. |πΆπ‘œπ‘£(𝑋, π‘Œ)| ≤ πœŽπ‘‹ πœŽπ‘Œ where πœŽπ‘‹ and πœŽπ‘Œ are the standard deviations of 𝑋 and π‘Œ, respectively.
2. If 𝑐 is any constant, then πΆπ‘œπ‘£(𝑋, 𝑐) = 0.
3. For any constant π‘Ž and 𝑏, πΆπ‘œπ‘£(π‘Žπ‘‹, π‘π‘Œ) = π‘Žπ‘πΆπ‘œπ‘£(𝑋, π‘Œ).
4. For any constants a, 𝑏, 𝑐, π‘Žπ‘›π‘‘ 𝑑,
πΆπ‘œπ‘£(π‘Žπ‘‹1 + 𝑏𝑋2 , 𝑐𝑋3 + 𝑑𝑋4 ) = π‘Žπ‘πΆπ‘œπ‘£(𝑋1 , 𝑋3 ) + π‘Žπ‘‘πΆπ‘œπ‘£(𝑋1 , 𝑋4 ) + π‘π‘πΆπ‘œπ‘£(𝑋2 , 𝑋3 ) + π‘π‘‘πΆπ‘œπ‘£(𝑋2 , 𝑋4 )
Property 4 can be generalized to,
π‘š
𝑛
π‘š
𝑏
πΆπ‘œπ‘£ (∑ π‘Žπ‘– 𝑋𝑖 , ∑ 𝑏𝑗 π‘Œπ‘— ) = ∑ ∑ π‘Žπ‘– 𝑏𝑗 πΆπ‘œπ‘£(𝑋𝑖 , π‘Œπ‘— )
𝑖=1
𝑗=1
𝑖=1 𝑗=1
From Property 1, we deduce that −1 ≤ 𝜌12 ≤ 1. The correlation obtains the values ±1 only if
the two variables are linearly dependent.
Definition: Random variable 𝑋1 , … π‘‹π‘˜ are said to be mutually independent if, for every
(π‘₯1 , … , π‘₯π‘˜ ),
π‘˜
𝑓(π‘₯1 , … , π‘₯π‘˜ ) = ∏ 𝑓𝑖 (π‘₯𝑖 )
𝑖=1
where 𝑓𝑖 (π‘₯𝑖 ) is the marginal p.d.f of 𝑋𝑖 .
Remark : If two random variables are independent, their correlation (or covariance) is zero.
The converse is generally not true. Zero correlation does not imply independence.
Example: If the joint probability distribution of 𝑋 and π‘Œ is given by,
π‘₯
𝑦
𝑝(π‘₯, 𝑦)
-1
0
1
-1
1/6
1/3
1/6
2/3
0
0
0
0
0
1
1/6
0
1/6
1/3
1/3
1/3
1/3
12
Show that their covariance is zero even though the two random variables are not independent.
Solution:
1
1
1
πœ‡π‘‹ = 𝐸(𝑋) = ((−1) ∗ ) + (0 ∗ ) + (1 ∗ ) = 0
3
3
3
2
1
1
πœ‡π‘Œ = 𝐸(π‘Œ) = ((−1) ∗ ) + (0 ∗ 0) + (1 ∗ ) = −
3
3
3
1
1
1
1
1
𝐸(π‘‹π‘Œ) = ((−1) ∗ (−1) ∗ ) + (0 ∗ (−1) ∗ ) + (1 ∗ (−1) ∗ ) + ((−1) ∗ 1 ∗ ) (1 ∗ 1 ∗ ) = 0
6
3
6
6
6
1
Thus,πœŽπ‘‹π‘Œ = πΆπ‘œπ‘£(𝑋, π‘Œ) = 0 − (0 ∗ (− 3)) = 0, the covariance is zero, but the two random
variables are not independent. For instance, (π‘₯, 𝑦) ≠ 𝑝(π‘₯) ∗ 𝑝(𝑦) π‘“π‘œπ‘Ÿ π‘₯ = −1 and 𝑦 = −1.
The following result is very important for independent random variables.
οƒΌ If 𝑋1 , … π‘‹π‘˜ are mutually independent, then, for any integrable functions,
𝑔1 (𝑋1 ), … , π‘”π‘˜ (π‘‹π‘˜ ),
π‘˜
π‘˜
𝐸 (∏ 𝑔𝑖 (𝑋𝑖 )) = ∏ 𝐸(𝑔𝑖 (𝑋𝑖 ))
𝑖=1
𝑖=1
Indeed,
π‘˜
𝐸 (∏ 𝑔𝑖 (𝑋𝑖 )) = ∫ … ∫ 𝑔1 (π‘₯1 ) … π‘”π‘˜ (π‘₯π‘˜ )𝑓(π‘₯1 , … , π‘₯π‘˜ )𝑑π‘₯1 … 𝑑π‘₯π‘˜
𝑖=1
= ∫ … ∫ 𝑔1 (π‘₯1 ) … π‘”π‘˜ (π‘₯π‘˜ )𝑓1 (π‘₯1 ) … π‘“π‘˜ (π‘₯π‘˜ )𝑑π‘₯1 … 𝑑π‘₯π‘˜
= ∫ 𝑔1 (π‘₯1 )𝑓1 (π‘₯1 )𝑑π‘₯1 … ∫ π‘”π‘˜ (π‘₯π‘˜ ) π‘“π‘˜ (π‘₯π‘˜ )𝑑π‘₯π‘˜
π‘˜
= ∏ 𝐸(𝑔𝑖 (𝑋𝑖 ))
𝑖=1
13
ο‚· Conditional Distributions:
Continuous Case: (𝑋1 , 𝑋2 ) continuous r.v.’s with a joint p.d.f 𝑓(π‘₯1 , π‘₯2 ) and also marginal
p.d.f ’ s are 𝑓1 (. ) and 𝑓2 (. ), respectively. Then, the conditional p.d.f of 𝑋2 , given (𝑋1 = π‘₯1 )
where 𝑓1 (π‘₯1 ) > 0, is defined to be,
𝑓2.1 (π‘₯2 |π‘₯1 ) =
𝑓(π‘₯1 , π‘₯2 )
𝑓1 (π‘₯1 )
Remark : 𝑓2.1 (π‘₯2 |π‘₯1 ) is a p.d.f.
Proof:
∞
∞
∫ 𝑓(π‘₯1 , π‘₯2 )𝑑π‘₯2 𝑓1 (π‘₯1 )
∫ 𝑓2.1 (π‘₯2 |π‘₯1 )𝑑π‘₯2 = −∞
=
=1
𝑓1 (π‘₯1 )
𝑓1 (π‘₯1 )
−∞
The conditional expectation of 𝑋2, given (𝑋1 = π‘₯1 ) such that 𝑓1 (π‘₯1 ) > 0, is the expected
value of 𝑋2 with respect to the conditional p.d.f 𝑓2.1 (π‘₯2 |π‘₯1 ), that is,
∞
𝐸(𝑋2|𝑋1 = π‘₯1 ) = ∫ π‘₯2 𝑓2.1 (π‘₯2 |π‘₯1 )𝑑π‘₯2 = πœ‡π‘‹2 |π‘₯1
−∞
The conditional variance of 𝑋2, given (𝑋1 = π‘₯1 ) such that 𝑓1 (π‘₯1 ) > 0 is
2
π‘‰π‘Žπ‘Ÿ(𝑋2 |𝑋1 = π‘₯1 ) = 𝐸 ((𝑋2 −πœ‡π‘‹2|𝑋1 ) |𝑋1 = π‘₯1 ) = 𝐸(𝑋2 2 |𝑋1 = π‘₯1 ) − (𝐸(𝑋2|𝑋1 = π‘₯1 ))
Remark: 𝑋1 π‘Žπ‘›π‘‘ 𝑋2 independent if and only if,
𝑓2.1 (π‘₯2 |π‘₯1 ) = 𝑓2 (π‘₯2 ) π‘“π‘œπ‘Ÿ π‘Žπ‘™π‘™ π‘₯2 and
14
𝑓1.2 (π‘₯1 |π‘₯2 ) = 𝑓1 (π‘₯1 ) π‘“π‘œπ‘Ÿ π‘Žπ‘™π‘™ π‘₯1
2
Remark: 𝐸𝑋 (𝐸(π‘Œ|𝑋)) = 𝐸(π‘Œ) ∢ π‘‡β„Žπ‘’ π‘™π‘Žπ‘€ π‘œπ‘“ π‘–π‘‘π‘’π‘Ÿπ‘Žπ‘‘π‘’π‘‘ 𝑒π‘₯π‘π‘’π‘π‘‘π‘Žπ‘‘π‘–π‘œπ‘›
Proof:
𝐸𝑋 (𝐸(π‘Œ|𝑋)) = ∫ 𝐸(π‘Œ|𝑋 = π‘₯) 𝑓𝑋 (π‘₯)𝑑π‘₯
= ∫ {∫ π‘¦π‘“π‘Œ.𝑋 (𝑦|π‘₯)𝑑𝑦} 𝑓𝑋 (π‘₯)𝑑π‘₯
= ∫∫𝑦
𝑓(π‘₯, 𝑦)
𝑓 (π‘₯)𝑑𝑦𝑑π‘₯
𝑓𝑋 (π‘₯) 𝑋
= ∫ 𝑦 {∫ 𝑓(π‘₯, 𝑦)𝑑π‘₯} 𝑑𝑦
= ∫ π‘¦π‘“π‘Œ (𝑦)𝑑𝑦
= 𝐸(π‘Œ)
Discrete Case:
𝐸(𝑒(𝑋)|π‘Œ = 𝑦) = ∑ 𝑒(π‘₯)𝑝( π‘₯|𝑦)
𝑋
P(𝑋|π‘Œ = 𝑦) =
𝑃(𝑋 = π‘₯, π‘Œ = 𝑦)
𝑃(π‘Œ = 𝑦)
provided that 𝑃(π‘Œ = 𝑦) ≠ 0
Example: If the joint density of 𝑋 and π‘Œ is given by,
2
(π‘₯
𝑓(π‘₯, 𝑦) = {3 + 2𝑦) , π‘“π‘œπ‘Ÿ 0 < π‘₯ < 1,
0 , π‘’π‘™π‘ π‘’π‘€β„Žπ‘’π‘Ÿπ‘’
0<𝑦<1
1
Find the conditional mean and the conditional variance 𝑋 given π‘Œ = 2.
15
Solution: The marginal densities,
1
2
2
𝑔(π‘₯) = ∫ (π‘₯ + 2𝑦)𝑑𝑦 = (π‘₯ + 1)
3
3
π‘“π‘œπ‘Ÿ 0 < π‘₯ < 1
0
1
2
1
β„Ž(𝑦) = ∫ (π‘₯ + 2𝑦)𝑑π‘₯ = (4𝑦 + 1)
3
3
π‘“π‘œπ‘Ÿ 0 < 𝑦 < 1
0
2
𝑓(π‘₯, 𝑦) 3 (π‘₯ + 2𝑦) 2π‘₯ + 4𝑦
𝑓(π‘₯|𝑦) =
=
=
1
𝑓(𝑦)
1 + 4𝑦
(1
+
4𝑦)
3
π‘“π‘œπ‘Ÿ 0 < π‘₯ < 1
So that,
2
1
(π‘₯
𝑓 (π‘₯| 2) = { 3 + 1) , π‘“π‘œπ‘Ÿ 0 < π‘₯ < 1
0 ,
π‘’π‘™π‘ π‘’π‘€β„Žπ‘’π‘Ÿπ‘’
Thus,
1
2
5
1
𝐸 (𝑋| 2) = πœ‡π‘‹|1 = ∫ π‘₯(π‘₯ + 1)𝑑π‘₯ =
3
9
2
0
1
2
7
1
𝐸 (𝑋 2 | 2) = πœ‡π‘‹|1 = ∫ π‘₯ 2 (π‘₯ + 1)𝑑π‘₯ =
3
18
2
0
7
5 2
13
1
π‘‰π‘Žπ‘Ÿ (𝑋| 2) = − ( ) =
8
9
162
16
ο‚· Linear Combinations of Random Variables:
Let 𝑋1 , … 𝑋𝑛 be random variables having a joint distribution, with joint p.d.f 𝑓(π‘₯1 , … , π‘₯𝑛 ).
Let 𝛼1 , … 𝛼𝑛 be given constants. Then,
𝑛
π‘Š = ∑ 𝛼𝑖 𝑋𝑖
𝑖=1
is the linear combination of X’ s. The expected value and the variance of a linear combination
are as follows:
𝑛
𝐸(π‘Š) = ∑ 𝛼𝑖 𝐸(𝑋𝑖 )
𝑖=1
and
𝑛
π‘‰π‘Žπ‘Ÿ(π‘Š) = ∑ 𝛼𝑖 2 π‘‰π‘Žπ‘Ÿ(𝑋𝑖 ) + ∑ ∑ 𝛼𝑖 𝛼𝑗 πΆπ‘œπ‘£(𝑋𝑖 , 𝑋𝑗 )
⏟ 𝑖≠𝑗
𝑖=1
=2 ∑𝑖<𝑗 ∑ 𝛼𝑖 𝛼𝑗 πΆπ‘œπ‘£(𝑋𝑖 ,𝑋𝑗 )
Example: Let 𝑋1 , … 𝑋𝑛 be i.i.d random variables, with common expectations πœ‡ and common
1
finite variances 𝜎 2 . The sample mean 𝑋̅𝑛 = 𝑛 ∑𝑛𝑖=1 π‘₯𝑖 is a particular linear combination, with
1
𝛼1 = 𝛼2 = β‹― = 𝛼𝑛 = 𝑛 . Hence,
𝑛
𝐸(𝑋̅𝑛 ) =
1
∑ 𝐸(𝑋𝑖 ) = πœ‡
𝑛
𝑖=1
(We have shown that in a random sample of 𝑛 i.i.d random variables, the sample mean has the
same expectation as that of the individual variables.) and, because 𝑋1 , … 𝑋𝑛 are mutually
independent, πΆπ‘œπ‘£(𝑋𝑖 , 𝑋𝑗 ) = 0, π‘Žπ‘™π‘™ 𝑖 ≠ 𝑗. Hence,
𝑛
π‘‰π‘Žπ‘Ÿ(𝑋̅𝑛 ) =
1
𝜎2
)
∑
π‘‰π‘Žπ‘Ÿ(𝑋
=
𝑖
𝑛2
𝑛
𝑖=1
1
(the sample variance is reduced by a factor of 𝑛.)
17
Moreover, from Chebyshev’s inequality, for any πœ€ > 0,
π‘ƒπ‘Ÿ(|𝑋̅𝑛 − πœ‡| > πœ€) <
Therefore, because lim
𝜎2
𝑛→∞ π‘›πœ€ 2
𝜎2
π‘›πœ€ 2
= 0,
lim π‘ƒπ‘Ÿ(|𝑋̅𝑛 − πœ‡| > πœ€) = 0
𝑛→∞
This property is called the CONVERGENCE IN PROBABILITY of 𝑋̅𝑛 π‘‘π‘œ πœ‡.
Example: If the random variable 𝑋, π‘Œ and 𝑍 have the means πœ‡π‘‹ = 2 , πœ‡π‘Œ = −3 , πœ‡π‘ = 4,
and the variances πœŽπ‘‹ 2 = 1 , πœŽπ‘Œ 2 = 5 , πœŽπ‘ 2 = 2 and the covariances πΆπ‘œπ‘£(𝑋, π‘Œ) = −2,
πΆπ‘œπ‘£(𝑋, 𝑍) = −1, πΆπ‘œπ‘£(π‘Œ, 𝑍) = 1. Find the mean and variance of π‘Š = 3𝑋 − π‘Œ + 2𝑍.
Solution:
𝐸(π‘Š) = 𝐸(3𝑋 − π‘Œ + 2𝑍) = 3𝐸(𝑋) − 𝐸(π‘Œ) + 2𝐸(𝑍) = (3 ∗ 2) − (−3) + (2 ∗ 4) = 17
π‘‰π‘Žπ‘Ÿ(π‘Š) = 9π‘‰π‘Žπ‘Ÿ(𝑋) + π‘‰π‘Žπ‘Ÿ(π‘Œ) + 4π‘‰π‘Žπ‘Ÿ(𝑍) − 6πΆπ‘œπ‘£(𝑋, π‘Œ) + 12πΆπ‘œπ‘£(𝑋; 𝑍) − 4πΆπ‘œπ‘£(π‘Œ, 𝑍)
= (9 ∗ 1) + 5 + (2 ∗ 4) − (6 ∗ (−2)) + (12 ∗ (−1)) − (4 ∗ 1) = 18
Theorem: If 𝑋1 , … 𝑋𝑛 are r.v. and π‘Œ1 = ∑𝑛𝑖=1 𝛼𝑖 𝑋𝑖 and π‘Œ2 = ∑𝑛𝑖=1 𝑏𝑖 𝑋𝑖 , where 𝛼1 , … 𝛼𝑛 and
𝑏1 , … 𝑏𝑛 are constant, then,
𝑛
πΆπ‘œπ‘£(π‘Œ1 , π‘Œ2 ) = ∑ 𝛼𝑖 𝑏𝑖 π‘‰π‘Žπ‘Ÿ(𝑋𝑖 ) + ∑
𝑖=1
𝑖<𝑗
∑(𝛼𝑖 𝑏𝑗 + 𝛼𝑗 𝑏𝑖 )πΆπ‘œπ‘£(𝑋𝑖 , 𝑋𝑗 )
Corollary: If the r.v.’s 𝑋1 , … 𝑋𝑛 are independent, π‘Œ1 = ∑𝑛𝑖=1 𝛼𝑖 𝑋𝑖 and π‘Œ2 = ∑𝑛𝑖=1 𝑏𝑖 𝑋𝑖 , then,
𝑛
πΆπ‘œπ‘£(π‘Œ1 , π‘Œ2 ) = ∑ 𝛼𝑖 𝑏𝑖 π‘‰π‘Žπ‘Ÿ(𝑋𝑖 )
𝑖=1
18
Example: If the random variables 𝑋, π‘Œ and 𝑍 have the means πœ‡π‘‹ = 3 , πœ‡π‘Œ = 5 , πœ‡π‘ = 2,
and the variances πœŽπ‘‹ 2 = 8 , πœŽπ‘Œ 2 = 12 , πœŽπ‘ 2 = 18 and the covariances πΆπ‘œπ‘£(𝑋, π‘Œ) = 1,
πΆπ‘œπ‘£(𝑋, 𝑍) = −3, πΆπ‘œπ‘£(π‘Œ, 𝑍) = 2.
Find
the
covariance
of
π‘Š = 𝑋 + 4π‘Œ + 2𝑍 and
𝑉 = 3𝑋 − π‘Œ − 𝑍.
Solution:
πΆπ‘œπ‘£(π‘Š, 𝑉) = πΆπ‘œπ‘£(𝑋 + 4π‘Œ + 2𝑍, 3𝑋 − π‘Œ − 𝑍)
= 3π‘‰π‘Žπ‘Ÿ(𝑋) − 4π‘‰π‘Žπ‘Ÿ(π‘Œ) − 2π‘‰π‘Žπ‘Ÿ(𝑍) + 11πΆπ‘œπ‘£(𝑋, π‘Œ) + 5πΆπ‘œπ‘£(𝑋, 𝑍) − 6πΆπ‘œπ‘£(π‘Œ, 𝑍)
= (3 ∗ 8) − (4 ∗ 12) − (2 ∗ 18) + (11 ∗ 1) + (5 ∗ (−3)) − (6 ∗ 2) = −76
οƒΌ The following is a useful result:
If 𝑋1 , … 𝑋𝑛 are mutually independent, then the m.g.f of 𝑇𝑛 = ∑𝑛𝑖=1 𝑋𝑖 is,
𝑛
𝑀𝑇𝑛 = ∏ 𝑀𝑋𝑖 (𝑑)
𝑖=1
where 𝑀𝑋𝑖 (𝑑) = 𝐸(𝑒 𝑑𝑋𝑖 ) = ∑π‘₯𝑖 𝑒 𝑑π‘₯𝑖 𝑓(π‘₯𝑖 ).
Proof:
𝑛
𝑀𝑇𝑛 = 𝐸(𝑒
𝑑 ∑𝑛
𝑖=1 𝑋𝑖
) = 𝐸 (∏ 𝑒
𝑛
𝑑𝑋𝑖
𝑖=1
) = ∏ 𝐸(𝑒
𝑖=1
𝑛
𝑑𝑋𝑖
) = ∏ 𝑀𝑋𝑖 (𝑑)
𝑖=1
Example: If 𝑋1 , … 𝑋𝑛 be independent random variables having Poisson distribution with
parameters πœ†π‘– , 𝑖 = 1,2, … , 𝑛, then show that their sum 𝑇𝑛 = ∑𝑛𝑖=1 𝑋𝑖 has the Poisson
distribution with parameter πœ‡π‘› = ∑𝑛𝑖=1 πœ†π‘– .
Solution:
∞
𝑀𝑋𝑖 (𝑑) = ∑ 𝑒
π‘₯𝑖 =0
∞
𝑑π‘₯𝑖
𝑒 −πœ†π‘– πœ†π‘– π‘₯𝑖
(𝑒 𝑑 πœ†π‘– )π‘₯𝑖
𝑑
= 𝑒 −πœ†π‘– ∑
= 𝑒 −πœ†π‘– (1−𝑒 ) = exp(−πœ†π‘– (1 − 𝑒𝑑 ))
π‘₯𝑖 !
π‘₯𝑖 !
π‘₯𝑖 =0
Then,
19
𝑛
𝑛
𝑛
𝑀𝑇𝑛 (𝑑) = ∏ 𝑀𝑋𝑖 (𝑑) = ∏ exp(−πœ†π‘– (1 − 𝑒
𝑖=1
𝑑 ))
= exp (− ∑ πœ†π‘– (1 − 𝑒 𝑑 )) = exp(−πœ‡π‘› (1 − 𝑒 𝑑 ))
𝑖=1
𝑖=1
Example: If 𝑋1 , … π‘‹π‘˜ be independent random variables each having binomial distribution like
𝐡(𝑛𝑖 , 𝑝), 𝑖 = 1,2, … , π‘˜, then their sum π‘‡π‘˜ has the binomial distribution. To show this,
π‘˜
π‘˜
π‘€π‘‡π‘˜ = ∏ 𝑀𝑋𝑖 (𝑑) = (𝑝𝑒 𝑑 + 1 − 𝑝)∑𝑖=1 𝑛𝑖
𝑖=1
Solution:
𝑛𝑖
𝑀𝑋𝑖 (𝑑) = ∑ 𝑒 𝑑π‘₯𝑖
π‘₯𝑖 =0
𝑛𝑖 !
𝑝 π‘₯𝑖 (1 − 𝑝)𝑛𝑖 −π‘₯𝑖
π‘₯𝑖 ! (𝑛𝑖 − π‘₯𝑖 )!
𝑛𝑖
= ∑ (𝑝𝑒 𝑑 )π‘₯𝑖
π‘₯𝑖 =0
𝑛𝑖 !
(1 − 𝑝)𝑛𝑖 −π‘₯𝑖 = (𝑝𝑒 𝑑 + 1 − 𝑝)𝑛𝑖
π‘₯𝑖 ! (𝑛𝑖 − π‘₯𝑖 )!
Then,
π‘˜
π‘˜
π‘€π‘‡π‘˜ = ∏ 𝑀𝑋𝑖 (𝑑) = (𝑝𝑒 𝑑 + 1 − 𝑝)∑𝑖=1 𝑛𝑖
𝑖=1
That is, π‘‡π‘˜ is distributed like 𝐡(∑π‘˜π‘–=1 𝑛𝑖 , 𝑝).
Example: Suppose 𝑋1 , … 𝑋𝑛 are independent random variables and the distribution of 𝑋𝑖 is
normal 𝑁(πœ‡π‘– , πœŽπ‘– 2 ), then the distribution of π‘Š = ∑𝑛𝑖=1 𝛼𝑖 𝑋𝑖 is normal, like that of
𝑛
𝑛
𝑁 (∑ 𝛼𝑖 πœ‡π‘– , ∑ 𝛼𝑖 2 πœŽπ‘– 2 )
𝑖=1
𝑖=1
Example: Suppose 𝑋1 , … 𝑋𝑛 are independent random variables and the distribution of 𝑋𝑖 is
gamma like 𝐺(𝑣𝑖 , 𝛽), then the distribution of 𝑇𝑛 = ∑𝑛𝑖=1 𝑋𝑖 is gamma, like that of
𝐺(∑𝑛𝑖=1 𝑣𝑖 , 𝛽).
π‘€π‘‡π‘˜ = ∏π‘˜π‘–=1 𝑀𝑋𝑖 (𝑑) = ∏π‘˜π‘–=1(1 − 𝛽𝑑)
20
−𝑣𝑖
𝑛
= (1 − 𝛽𝑑)− ∑𝑖=1 𝑣𝑖
EXERCISES
1. Let X and Y have the joint probability function
𝑝(π‘₯, 𝑦) =
π‘₯+𝑦
, π‘₯ = 1,2,3 , 𝑦 = 1,2
21
a. Find the conditional probability function of 𝑋, given that π‘Œ = 𝑦.
b. 𝑃(𝑋 = 2|π‘Œ = 2)=?
c. Find the conditional probability function of π‘Œ, given that 𝑋 = π‘₯.
d. Find the conditional mean πœ‡π‘Œ|π‘₯ and the conditional variance 𝜎 2 π‘Œ|π‘₯ when π‘₯ = 3.
e. Graph the joint, marginal and conditional probablity functions.
Solution:
2
𝑝1 (π‘₯) = ∑
𝑦=1
π‘₯ + 𝑦 2π‘₯ + 3
=
,
21
21
3
𝑝2 (𝑦) = ∑
π‘₯=1
a. 𝑔(π‘₯|𝑦) =
𝑝(π‘₯,𝑦)
𝑝2 (𝑦)
=
π‘₯+𝑦
21
3𝑦+6
21
π‘₯ = 1,2,3
π‘₯ + 𝑦 3𝑦 + 6
=
,
21
21
𝑦 = 1,2
π‘₯+𝑦
= 3𝑦+6 , π‘₯ = 1,2,3, π‘€β„Žπ‘’π‘› 𝑦 = 1 π‘œπ‘Ÿ 2.
4
1
b. 𝑃(𝑋 = 2|π‘Œ = 2) = 𝑔(2|2) = 12 = 3
c. β„Ž(𝑦|π‘₯) =
𝑝(π‘₯,𝑦)
𝑝1 (π‘₯)
π‘₯+𝑦
= 2π‘₯+3 , 𝑦 = 1,2, π‘€β„Žπ‘’π‘› π‘₯ = 1,2 π‘œπ‘Ÿ 3.
d. πœ‡π‘Œ|π‘₯ = 𝐸(π‘Œ|𝑋 = 3) = ∑2𝑦=1 𝑦 β„Ž(𝑦|3) = ∑2𝑦=1 𝑦
𝜎 2 π‘Œ|3 = 𝐸 ((π‘Œ −
3+𝑦
9
4
14 2
14 2 3+𝑦
9
9
) |𝑋 = 3) = ∑2𝑦=1 (𝑦 −
21
) (
5
= (1 ∗ 9) + (2 ∗ 9) =
9
25
)=(
81
4
16
9
81
∗ )+(
14
9
5
20
9
81
∗ )=
conditional probablity function of X, given y.
conditional probablity function of Y, given x.
22
2. Let X and Y have the joint probability density function
𝑓(π‘₯, 𝑦) = 2 ,
0≤ π‘₯≤𝑦≤1
a. Find the marginal density functions.
b. β„Ž(𝑦|π‘₯)=?
c. Find the conditional mean πœ‡π‘Œ|π‘₯ and the conditional variance 𝜎 2 π‘Œ|π‘₯
3
7
1
d. Calculate 𝑃 (4 ≤ π‘Œ ≤ 8 |𝑋 = 4).
Solution:
1
a. 𝑓1 (π‘₯) = ∫π‘₯ 2𝑑𝑦 = 2(1 − π‘₯) , 0 ≤ π‘₯ ≤ 1
𝑦
𝑓2 (𝑦) = ∫0 2𝑑π‘₯ = 2𝑦
b. β„Ž(𝑦|π‘₯) =
𝑓(π‘₯,𝑦)
𝑓1 (π‘₯)
, 0≤𝑦≤1
2
1
= 2(1−π‘₯) = 1−π‘₯ , π‘₯ ≤ 𝑦 ≤ 1 , 0 ≤ π‘₯ ≤ 1.
1
1
𝑦2
1
c. πœ‡π‘Œ|π‘₯ = 𝐸(π‘Œ|π‘₯) = ∫π‘₯ 𝑦 (1−π‘₯) 𝑑𝑦 = 2(1−π‘₯)| =
1+π‘₯
2
π‘₯
𝜎
2
2
π‘Œ|π‘₯
3
= 𝐸 ((π‘Œ − πœ‡π‘Œ|π‘₯ ) |π‘₯) =
7
1
7
1
∫π‘₯ (𝑦
−
, 0≤π‘₯≤1
1+π‘₯ 2 1
2
)
7
1
1
𝑑𝑦 = 3(1−π‘₯) (𝑦 −
1−π‘₯
1
1
d. 𝑃 (4 ≤ π‘Œ ≤ 8 |𝑋 = 4) = ∫38 β„Ž (𝑦| 4) 𝑑𝑦 = ∫38 3/4 𝑑𝑦 = 6.
4
4
23
1
1+π‘₯ 3
2
) | =
π‘₯
(1−π‘₯)2
12
3. Let X and Y have the joint probability function
𝑝(π‘₯, 𝑦) =
π‘₯ + 2𝑦
,
18
π‘₯ = 1,2 ,
𝑦 = 1,2
a. Show that X and Y are dependent.
b. Find πΆπ‘œπ‘£(𝑋, π‘Œ).
c. Find the correlation coefficient 𝜌.
Solution:
a. The marginal probability functions are, respectively,
𝑝(π‘₯) = ∑2𝑦=1
π‘₯+2𝑦
18
=
2π‘₯+6
18
𝑝(𝑦) = ∑2π‘₯=1
, π‘₯ = 1,2
π‘₯+2𝑦
18
=
3+4𝑦
18
, 𝑦 = 1,2
Since (π‘₯, 𝑦) ≠ 𝑝(π‘₯)𝑝(𝑦) , X and Y are dependent.
b. πœ‡π‘‹ = 𝐸(𝑋) = ∑2π‘₯=1 π‘₯
2π‘₯+6
πœ‡π‘Œ = 𝐸(π‘Œ) = ∑2𝑦=1 𝑦
3+4𝑦
𝐸(π‘‹π‘Œ) = ∑2𝑦=1 ∑2π‘₯=1 π‘₯𝑦
π‘₯+2𝑦
18
18
18
8
10
= (1 ∗ 18) + (2 ∗ 18) =
7
11
14
9
29
= (1 ∗ 18) + (2 ∗ 18) = 18
3
4
5
6
45
= (1 ∗ 1 ∗ 18) + (2 ∗ 1 ∗ 18) + (1 ∗ 2 ∗ 18) + (2 ∗ 2 ∗ 18) = 18
πΆπ‘œπ‘£(𝑋, π‘Œ) = 𝐸(π‘‹π‘Œ) − πœ‡π‘‹ πœ‡π‘Œ =
c. πœŽπ‘‹2 = π‘‰π‘Žπ‘Ÿ(𝑋) = ∑2π‘₯=1 π‘₯ 2
2π‘₯+6
πœŽπ‘Œ2 = π‘‰π‘Žπ‘Ÿ(π‘Œ) = ∑2𝑦=1 𝑦 2
18
20
− ( 9 ) = 81
3+4𝑦
18
14 2
45
14 29
1
−( ∗ )=−
18
9 18
162
29 2
77
18
324
−( ) =
1
162
𝜌=
=−
= −0.025
√π‘‰π‘Žπ‘Ÿ(𝑋)π‘‰π‘Žπ‘Ÿ(π‘Œ)
20
77
√ ∗
81 324
πΆπ‘œπ‘£(𝑋, π‘Œ)
24
4. “Independence implies zero correlation, but zero correlation does not necessarily imply
independence.”
𝑋 π‘Žπ‘›π‘‘ π‘Œ π‘Žπ‘Ÿπ‘’ 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑑 → πΆπ‘œπ‘£(𝑋, π‘Œ) = 0
(𝐡𝑒𝑑 π‘‘β„Žπ‘’ π‘π‘œπ‘›π‘£π‘’π‘Ÿπ‘ π‘’ π‘œπ‘“ π‘‘β„Žπ‘–π‘  π‘›π‘œπ‘‘ π‘›π‘’π‘π‘’π‘ π‘ π‘Žπ‘Ÿπ‘–π‘™π‘¦ π‘‘π‘Ÿπ‘’π‘’)
To show this, let X and Y have the joint probability function,
𝑝(π‘₯, 𝑦) =
1
, (π‘₯, 𝑦) = (0,1), (1,0), (2,1)
3
Show that πΆπ‘œπ‘£(𝑋, π‘Œ) = 0, but X and Y are dependent.
Solution:
2
πœ‡π‘‹ = 𝐸(𝑋) = 1 and πœ‡π‘Œ = 𝐸(π‘Œ) = 3.
1
1
1
2
πΆπ‘œπ‘£(𝑋, π‘Œ) = 𝐸(π‘‹π‘Œ) − πœ‡π‘‹ πœ‡π‘Œ = (0 ∗ 1 ∗ ) + (1 ∗ 0 ∗ ) + (2 ∗ 1 ∗ ) − (1 ∗ ) = 0
3
3
3
3
That is 𝜌=0 but X and Y are dependent.
(Show that 𝑝(π‘₯, 𝑦) ≠ 𝑝(π‘₯)𝑝(𝑦))
25
5. . Let X and Y have the joint probability function
𝑝(π‘₯, 𝑦) =
π‘₯+𝑦
, π‘₯ = 1,2,3 , 𝑦 = 1,2
21
Show that X and Y are dependent.
Solution:
2
𝑝(π‘₯) = ∑
𝑦=1
π‘₯ + 𝑦 2π‘₯ + 3
=
,
21
21
3
𝑝(𝑦) = ∑
π‘₯=1
π‘₯ = 1,2,3
π‘₯ + 𝑦 6 + 3𝑦
=
,
21
21
𝑦 = 1,2
Since 𝑝(π‘₯)𝑝(𝑦) ≠ 𝑝(π‘₯, 𝑦), X and Y are dependent.
6. Let the joint probability function of X and Y are,
𝑝(π‘₯, 𝑦) =
π‘₯𝑦 2
,
30
π‘₯ = 1,2,3 ,
𝑦 = 1,2
Show that X and Y are dependent.
Solution:
2
𝑝(π‘₯) = ∑
𝑦=1
3
𝑝(𝑦) = ∑
π‘₯=1
π‘₯𝑦 2 π‘₯
= ,
30
6
π‘₯𝑦 2 𝑦 2
=
,
30
5
π‘₯ = 1,2,3
𝑦 = 1,2
Since 𝑝(π‘₯)𝑝(𝑦) = 𝑝(π‘₯, 𝑦) π‘“π‘œπ‘Ÿ π‘₯ = 1,2,3 π‘Žπ‘›π‘‘ 𝑦 = 1,2 , X and Y are independent.
26
7. There are eight similar chips in a bowl: three marked (0,0), two marked (1,0), two marked
(0,1) and one marked (1,1). A player selects a chip at random and is given the sum of the two
coordinates in dollars. If X and Y represent those two coordinates respectively, their joint
probability function is
𝑝(π‘₯, 𝑦) =
3−π‘₯−𝑦
,
8
π‘₯ = 0,1 ,
𝑦 = 0,1
Find the expected pay off.
Solution:
1
1
𝐸(𝑋 + π‘Œ) = ∑ ∑(π‘₯ + 𝑦)
𝑦=0 π‘₯=0
3−π‘₯−𝑦
3
2
2
1
3
= (0 ∗ ) + (1 ∗ ) + (1 ∗ ) + (2 ∗ ) =
8
8
8
8
8
4
That is, the expected payoff is 75 cents.
8. Let X and Y have the joint probability density function,
3
𝑓(π‘₯, 𝑦) = π‘₯ 2 (1 − |𝑦|) ,
2
−1 < π‘₯ < 1 , −1 < 𝑦 < 1
Let 𝐴 = {(π‘₯, 𝑦): 0 < π‘₯ < 1, 0 < 𝑦 < π‘₯}. Find the probability that (X,Y) falls into A.
Solution:
1
1
π‘₯
π‘₯
1
3 2
3 2
𝑦2
3 3 π‘₯4
(1
𝑃((𝑋, π‘Œ) ∈ 𝐴) = ∫ ∫ π‘₯
− 𝑦) 𝑑𝑦𝑑π‘₯ = ∫ π‘₯ (𝑦 − )| 𝑑π‘₯ = ∫ (π‘₯ − ) 𝑑π‘₯
2
2 0
2
2
0 0 2
0
1
3 π‘₯4 π‘₯5
9
= ( − )| = .
2 4 10 0 40
27
0
9. Let X and Y have the joint probability density function,
𝑓(π‘₯, 𝑦) = 2 ,
1
0≤π‘₯≤𝑦≤1
1
a. Find 𝑃 (0 ≤ 𝑋 ≤ 2 , 0 ≤ π‘Œ ≤ 2).
b. Find the marginal densities 𝑓(π‘₯) and 𝑓(𝑦).
c. Find the 𝐸(𝑋) and 𝐸(π‘Œ).
d. Show that X and Y are dependent.
Solution:
1
1
1
𝑦
1
a. 𝑃 (0 ≤ 𝑋 ≤ 2 , 0 ≤ π‘Œ ≤ 2) = ∫02 ∫0 2𝑑π‘₯𝑑𝑦 = 4 .
1
b. 𝑓(π‘₯) = ∫π‘₯ 2𝑑𝑦 = 2(1 − π‘₯) , 0 ≤ π‘₯ ≤ 1
𝑦
𝑓(𝑦) = ∫0 2𝑑π‘₯ = 2𝑦
1
, 0≤𝑦≤1
1
1
1
c. 𝐸(𝑋) = ∫0 ∫π‘₯ 2π‘₯𝑑𝑦𝑑π‘₯ = ∫0 2π‘₯(1 − π‘₯)𝑑π‘₯ = 3 .
1
𝑦
1
2
𝐸(π‘Œ) = ∫0 ∫0 2𝑦𝑑π‘₯𝑑𝑦 = ∫0 2𝑦 2 𝑑𝑦 = 3 .
d. 𝑓(π‘₯)𝑓(𝑦) ≠ 𝑓(π‘₯, 𝑦) ⇒ X and Y are dependent because the support S is not a product
space, since it is bounded by the diagonal line y=x.
28
ORDER STATISTICS
29
Remark:
30
Example:
Example:
31
Theorem:
32
Example:
33
FUNCTIONS OF RANDOM VARIABLES
ο‚·
One-to-one Transformation
Frequently in statistics, one encounters the need to derive the probability distribution
of a function of one or more random variables. For example, suppose that X is a discrete
random variable with probability distribution𝑝(π‘₯), and suppose further that π‘Œ = 𝑒(𝑋)
defines a one-to-one transformation between the values of 𝑋 and π‘Œ. We wish to find the
probability distribution of π‘Œ. It is important to note that the one-to-one transformation implies
that each value π‘₯ is related to one, and only one, value 𝑦 = 𝑒(π‘₯) and that each value 𝑦 is
related to one, and only one, value π‘₯ = 𝑀(𝑦), where 𝑀(𝑦) is obtained by solving 𝑦 = 𝑒(π‘₯)
for π‘₯ in terms of 𝑦. The random variable π‘Œ assumes the value 𝑦 when X assumes the value
w(y). Consequently, the probability distribution of Y is given by
𝑔(𝑦) = 𝑃(π‘Œ = 𝑦) = 𝑃[𝑋 = 𝑀(𝑦)] = 𝑝[𝑀(𝑦)].
Theorem: Suppose that X is a discrete random variable with probability distribution 𝑝(π‘₯).
Let π‘Œ = 𝑒(𝑋) define a one-to-one transformation between the values of X and Y so that the
equation π‘Œ = 𝑒(π‘₯) can be uniquely solved for x in terms of y, say π‘₯ = 𝑀(𝑦). Then the
probability distribution of Y is
𝑔(𝑦) = 𝑝[𝑀(𝑦)].
Example
Solution:
34
Example:
Find the probability distribution of the random variable π‘Œ = 2𝑋 − 1.
Solution:
Theorem:
35
Example:
Solution:
36
Example
Let 𝑋1 and 𝑋2 be discrete random variables with
Solution:
1
𝑃(π‘Œ = 1) = 𝑔(1,1) = 18
2
2
2
𝑃(π‘Œ = 2) = 𝑔(2,1) + 𝑔(2,2) = 18 + 18 = 9
3
1
4
2
6
1
𝑃(π‘Œ = 3) = 𝑔(3,3) = 18 = 6
𝑃(π‘Œ = 4) = 𝑔(4,2) = 18 = 9
𝑃(π‘Œ = 6) = 𝑔(6,3) = 18 = 3
Theorem :
37
Example: The hospital period, in days, for patients following treatment for a certain type of
kidney disorder is a random variable π‘Œ = 𝑋 + 4, where X has the density function
a. Find the probability density function of the random variable Y .
b. Using the density function of Y , find the probability that the hospital period for a patient
following this treatment will exceed 8 days.
Solution:
38
Theorem:
39
Example:
The random variables X and Y , representing the weights of creams and coffees, respectively,
in 1- kilogram boxes of chocolates containing a mixture of creams, toffees, and cordials, have
the joint density function
a. Find the probability density function of the random variable 𝑍 = 𝑋 + π‘Œ .
b. Using the density function of Z, find the probability that, in a given box, the sum of the
weights of creams and toffees accounts for at least 1/2 but less than 3/4 of the total weight.
40
Example:
Solution:
41
ο‚·
Not One-to-one Transformation
42
Theorem:
43
Example
Solution:
44
Example
Let X have the probability distribution
Find the probability distribution of the random variable Y = X2.
Solution:
45
ο‚· The CDF Technique
46
47
48
Example 4. Suppose that 𝑋~π‘ˆ[−1,1]. Find the distribution π‘Œ = 𝑒π‘₯𝑝(𝑋).
Solution:
Example 5. Suppose that 𝑋~π‘ˆ[−1,1]. Find the distribution π‘Œ = 𝑋 2
Solution:
Example 6. Suppose that 𝑋~π‘ˆ[−1,1]. Find the distribution π‘Œ = 1 − 𝑋 2
Solution:
49
ESTIMATION
ο‚· General Remarks :
Suppose we have a population and a random variable X which results from some
measurement in the population.
e.g. : population = all people, X = IQ ,
population = all trees, X = lifespan
We are interested in knowing something about the distribution of X in the population.
If we draw an observation, say X, then X has this distribution. If we draw n observations, X1,
X2, …, Xn then each Xi’s have this distribution. Hence Xi’s are identically distributed.
Furthermore, if the observations are random sample from the population, then the Xi’s will be
independent. Hence X1, X2, …, Xn are i.i.d random variables. (identically and independent
distributed random variables.)
Inference : Infer something about the distribution of X in the population, based on a random
sample.
Two types of inference:
a. Estimation of a population parameters .
There is two types of estimation:
1. Point estimation: from the sample produce a “best guess” as to the value
of a parameter. Since usually we work with continuous distributions,
𝑃(𝑒π‘₯π‘Žπ‘π‘‘π‘™π‘¦ π‘π‘œπ‘Ÿπ‘Ÿπ‘’π‘π‘‘) = 0.
2. Interval estimation: produce some interval, i.e. region, which we believe
includes the true parameter value. Since this gives us an area, we can find
the probability that the interval includes the parameter.
b. Hypothesis testing: checking some preconceived notion about a population parameter.
50
Point Estimation
ο‚·
Introduction:
Suppose we draw a random sample X1, X2, …, Xn from a population, X1, X2, …, Xn
are iid. Let πœƒ represent some parameter of the pdf 𝑓(π‘₯). We wish to estimate πœƒ based on the
sample X1, X2, …, Xn. Usually we form some function of the sample:
πœƒΜ‚ (X1, X2, …, Xn) as an estimator of πœƒ.
Now, πœƒΜ‚ is a function of xi’s and is called the estimator function or simply estimator. Once we
draw a random sample, we compute a numeric value from this function, which we call the
estimate of πœƒ. For example, 𝑋̅ is an estimator of πœ‡π‘‹ : Once we draw a random sample and
compute the sample average we get a numeric value which is the estimate of πœ‡π‘‹ .
Definition: An estimator is a rule, often expressed as a formula, that tells how to calculate the
value of an estimate based on the measurements contained in a sample. For example, the
sample mean,
𝑛
1
π‘ŒΜ… = ∑ π‘Œπ‘–
𝑛
𝑖=1
is one possible point estimator of the population mean πœ‡. Clearly, the expression for π‘ŒΜ… is both
a rule and a formula. It tells us to sum the sample observations and divide by the sample size
n.
51
1. Techniques for Finding Point Estimators
1.1 Method of Moments
52
Example:
Given that the sample random variables 𝑋1 , 𝑋2 , … , 𝑋𝑛 are i.i.d taken from a
Bernoulli distribution with parameter p.
a. Find the moment estimator for p.
b. Tossing a coin 10 times and equating heads to value 1 and tails to value 0, we
obtained the following values:
0-1-1-0-1-0-1-1-1-0
Obtain a moment estimate for p, the probability of success (head).
53
Example:
Given that the sample random variables 𝑋1 , 𝑋2 , … , 𝑋𝑛 are i.i.d taken from a
Gamma distribution with parameters 𝛼 π‘Žπ‘›π‘‘ 𝛽. Use the method of moments to estimate 𝛼 and
𝛽.
54
Example:
Given that the sample random variables 𝑋1 , 𝑋2 , … , 𝑋𝑛 are i.i.d taken from
𝑁(πœ‡, 𝜎 2 ). Use the method of moments to estimate πœ‡ and 𝜎 2 .
55
Example: Given that the sample random variables 𝑋1 , 𝑋2 , … , 𝑋𝑛 are i.i.d taken from π‘ˆ(π‘Ž, 𝑏).
Use the method of moments to estimate π‘Ž and 𝑏.
πœ‡Μ‚ 1 =
(π‘Ž + 𝑏)
⇒ 2πœ‡Μ‚ 1 = π‘Ž + 𝑏 ⇒ π‘ŽΜ‚ = 2πœ‡Μ‚ 1 − 𝑏̂
2
πœ‡Μ‚ 2 =
π‘Ž2 + π‘Žπ‘ + 𝑏 2
⇒ 3πœ‡Μ‚ 2 = π‘ŽΜ‚2 + π‘ŽΜ‚π‘Μ‚ + 𝑏̂ 2 ⇒ 3πœ‡Μ‚ 2 = (2πœ‡Μ‚ 1 − 𝑏̂)2 + (2πœ‡Μ‚ 1 − 𝑏̂)𝑏̂ + 𝑏̂ 2
3
3πœ‡Μ‚ 2 = 4πœ‡Μ‚ 1 2 − 4πœ‡Μ‚ 1 𝑏̂ + 𝑏̂ 2 + 2πœ‡Μ‚ 1 𝑏̂ − 𝑏̂ 2 + 𝑏̂ 2
3πœ‡Μ‚ 2 = 3πœ‡Μ‚ 1 2 + (πœ‡Μ‚ 1 − 𝑏̂)
√ 3πœ‡Μ‚ 2 − 3πœ‡Μ‚ 1 2 = πœ‡Μ‚ 1 − 𝑏̂
𝑏̂ = πœ‡Μ‚ 1 + √ 3πœ‡Μ‚ 2 − 3πœ‡Μ‚ 1 2
56
2
Example: Given that the sample random variables 𝑋1 , 𝑋2 , … , 𝑋𝑛 are i.i.d taken from Poisson
1
1
1
2
distribution with parameter πœ† > 0. Show that both 𝑛 ∑𝑛𝑖=1 𝑋𝑖 and 𝑛 ∑𝑛𝑖=1 𝑋𝑖2 − (𝑛 ∑𝑛𝑖=1 𝑋𝑖 ) are
moment estimators of πœ†.
57
1.2 Method of Maximum Likelihood
It is highly desirable to have a method that is generally applicable to the construction
of statistical estimators that have “good” properties. In this section we present an important
method for finding estimators of parameters proposed by geneticist/statistician Sir Ronald A.
Fisher around 1922 called the method of maximum likelihood. Even though the method of
moments is intuitive and easy to apply, it usually does not yield “good” estimators. The
method of maximum likelihood is intuitively appealing, because we attempt to find the values
of the true parameters that would have most likely produced the data that we in fact observed.
For most cases of practical interest, the performance of maximum likelihood estimators is
optimal for large enough data. This is one of the most versatile methods for fitting parametric
statistical models to data.
Definition: Let 𝑓(π‘₯1 , π‘₯2 , … , π‘₯𝑛 ; πœƒ), be the joint probability(or density) function of n random
variables 𝑋1 , 𝑋2 , … , 𝑋𝑛 with sample values π‘₯1 , π‘₯2 , … , π‘₯𝑛 . The likelihood function of the
sample is given by,
𝐿(πœƒ; π‘₯1 , π‘₯2 , … , π‘₯𝑛 ) = 𝑓(π‘₯1 , π‘₯2 , … , π‘₯𝑛 ; πœƒ), (= 𝐿(πœƒ)𝑖𝑛 π‘Ž π‘π‘Ÿπ‘–π‘’π‘“π‘’π‘Ÿ π‘›π‘œπ‘‘π‘Žπ‘‘π‘–π‘œπ‘›)
We emphasize that L is function of πœƒ for fixed sample values.
58
Definition: The maximum likelihood estimators (MLEs) are those values of the parameters
that maximize the likelihood function with respect to the parameter πœƒ. That is,
𝑛
𝐿(πœƒΜ‚; π‘₯1 , π‘₯2 , … , π‘₯𝑛 ) = π‘šπ‘Žπ‘₯ 𝐿(πœƒ; π‘₯1 , π‘₯2 , … , π‘₯𝑛 ) = π‘šπ‘Žπ‘₯ ∏ 𝑓(π‘₯𝑖 ; πœƒ)
𝑖=1
For computational convenience 𝐿 will be transformed to,
𝑛
𝑙𝑛 𝐿(πœƒ; π‘₯1 , π‘₯2 , … , π‘₯𝑛 ) = ∑ 𝑙𝑛𝑓(π‘₯𝑖 ; πœƒ)
𝑖=1
To find πœƒΜ‚, we need only maximize 𝑙𝑛 𝐿 with respect to πœƒ. In this regard, if 𝐿 is a twicedifferentiable function of πœƒ, then a necessary or first-order condition for 𝑙𝑛 𝐿 to attain a
maximum at πœƒ = πœƒΜ‚ is,
𝑑𝑙𝑛 𝐿
|
=0
π‘‘πœƒ πœƒ=πœƒΜ‚
Hence all we need to do is set
𝑑𝑙𝑛 𝐿
π‘‘πœƒ
= 0 and solve for the value of πœƒ, πœƒΜ‚, which makes this
derivative vanish. If πœƒΜ‚ = 𝑔(π‘₯1 , π‘₯2 , … , π‘₯𝑛 ) is the value of πœƒ that maximizes 𝑙𝑛 𝐿, then πœƒΜ‚ will be
termed the maximum likelihood estimate of πœƒ; it is realization of the maximum likelihood
estimator πœƒΜ‚ = 𝑔(𝑋1 , 𝑋2 , … , 𝑋𝑛 ) and represents the parameter value most likely to have
generated the sample realizations π‘₯𝑖 , 𝑖 = 1, … , 𝑛.
59
Example: Suppose we have a biased coin for which one side is four times as likely to turn up
on any given flip as the other. After 𝑛 = 3 tosses we must determine if the coin is biased in
favor of heads (H) or in favor of tails (T). If we define a success as getting heads on any flip
of the coin, then the probability that heads occurs will be denoted as 𝑝 (and thus the
probability of tails on any single flip is 1 − 𝑝). For this problem πœƒ = 𝑝 so that the Bernoulli
probability function can be written as,
𝑃(𝑋 = π‘₯; πœƒ) = 𝑝(π‘₯; πœƒ) = 𝑝 π‘₯ (1 − 𝑝)1−π‘₯
So if heads occurs, π‘₯ = 1 and 𝑝(1; πœƒ) = 𝑝; and if tails occurs, π‘₯ = 0 and 𝑝(0; πœƒ) = 1 − 𝑝.
Since one side of the coin is four times as likely to occur as the other, the possible values of
1
4
πœƒ = 𝑝 are 5 π‘œπ‘Ÿ 5 .
π‘Ž
(In general, if the odds in favor of some event A occurring are π‘Ž to 𝑏, then 𝑃(𝐴) = π‘Ž+𝑏; and
𝑏
the odds against A must be 𝑃(𝐴′) = π‘Ž+𝑏. For our problem, since the odds of one side are4 to 1
1
1
relative to the other, we have 𝑃(𝐴) = 1+4. Hence 5 is the probability of one side occurring and
4
5
is the probability of the other side occurring.)
For each of these ′𝑠 , the associated Bernoulli probability distribution are provided by Table.
πœƒ=
X
0
1
1
5
πœƒ=
1
1
𝑃 (𝑋 = π‘₯; ) = 𝑝 (π‘₯; )
5
5
1
4
𝑝 (0; ) =
5
5
1
1
𝑝 (1; ) =
5
5
4
5
4
4
𝑃 (𝑋 = π‘₯; ) = 𝑝 (π‘₯; )
5
5
4
1
𝑝 (0; ) =
5
5
4
4
𝑝 (1; ) =
5
5
Suppose we toss the combination (H, T, H) so that π‘₯1 = 1, π‘₯2 = 0 π‘Žπ‘›π‘‘ π‘₯3 = 1. Let us express
the likelihood function for 𝑛 = 3 as,
3
3
3
3
𝐿(πœƒ; π‘₯1 , π‘₯2 , π‘₯3 ) = ∏ 𝑝(π‘₯𝑖 ; πœƒ) = ∏ 𝑝 π‘₯𝑖 (1 − 𝑝)1−π‘₯𝑖 = 𝑝∑𝑖=1 π‘₯𝑖 (1 − 𝑝)3−∑𝑖=1 π‘₯𝑖
𝑖=1
𝑖=1
Then,
60
𝐿(πœƒ; 1,0,1) = 𝑝2 (1 − 𝑝)
Clearly, the probability of the observed sample is a function of πœƒ = 𝑝.
1
1
12
1
4
4
4
42
4
16
For πœƒ = 5 , 𝐿 (5 ; 1,0,1) = 5 (1 − 5) = 125
For πœƒ = 5 , 𝐿 (5 ; 1,0,1) = 5 (1 − 5) = 125
4
So, if the coin is biased toward heads, πœƒ = 𝑝 = 5 and thus the probability of the event
16
1
(H,T,H) is 125 ; and if it is biased toward tails, πœƒ = 𝑝 = 5 and thus the probability of the event
(H,T,H) is
4
125
. Since the maximum likelihood function is
estimate of 𝑝 is 𝑝̂ =
4
5
16
125
, the maximum likelihood
, and this estimate yields the largest a priori probability of the given
event (H,T,H); that is, it is the value of 𝑝 that renders the observed sample combination
(H,T,H) most likely.
61
Example: Suppose that in n=5 drawing (with replacement) from a vessel containing a large
number of red and black balls we obtain two red and three black balls. What is the best
estimate of the proportion of red balls in the vessels?
Solution:
Let 𝑝 (respectively, 1 − 𝑝) denote the probability of getting a red (respectively, black) ball
from the vessel on any draw. Clearly the desired proportion of red balls must coincide with 𝑝.
Under sampling with replacement, the various drawings yield a set of independent events and
thus the probability of obtaining the given sequence of events is the product of the
probabilities of the individual drawings. With two red and three black balls, the probability of
the given sequence of events is 𝑝2 (1 − 𝑝)3 . But the observed sequence of outcomes is only
one way of getting two red and three black balls. The total number of ways of getting two red
5
and three black is ( ) = 10.
2
Hence the implied binomial probability is 𝐡(2; 5, 𝑝) = 10𝑝2 (1 − 𝑝)3 .Since the red balls is
fixed, this expression is a function of 𝑝 -the likelihood function of the sample. Thus the
likelihood function for the observed number of red balls is,
𝐿(𝑝; 2,5) = 10𝑝2 (1 − 𝑝)3 , 0 ≤ 𝑝 ≤ 1
Hence the maximum likelihood method has us choose the value for 𝑝, which makes the
observed outcome of two red and three black balls the most probable outcome. To make our
choice of 𝑝, let us perform the following experiment: we specify a whole range of possible
𝑝’s and select the one that maximizes 𝐿 . That is, we will chose the 𝑝, 𝑝̂ , that maximizes the
probability of getting the actual sample outcome. Hence 𝑝̂ best explains the realized sample.
All this is carried out in Table.
𝑝
0.0
0.1
0.2
0.3
Μ‚ = 𝟎. πŸ’)
0.4 (𝒑
0.5
0.6
0.7
0.8
0.9
1
𝐿(𝑝; 2,5)
0.0000
0.0729
0.2048
0.3087
0.3456 (max. value )
0.3125
0.2304
0.1323
0.0552
0.0081
0.0000
62
Example: Given that the sample random variables 𝑋1 , 𝑋2 , … , 𝑋𝑛 are i.i.d taken from πΊπ‘’π‘œ(𝑝),
0 ≤ 𝑝 ≤ 1. Find MLE of 𝑝.
Solution:
63
Example: Given that the sample random variables 𝑋1 , 𝑋2 , … , 𝑋𝑛 are i.i.d taken from π‘ƒπ‘œπ‘–π‘ π‘ π‘œπ‘›
distribution with parameter πœ†, 0 ≤ 𝑝 ≤ 1. Find MLE of πœ†.
Solution:
64
Example: Let 𝑋1 , 𝑋2 , … , 𝑋𝑛 are i.i.d 𝑁(πœ‡, 𝜎 2 ) random variables.
a. If πœ‡ is unknown and 𝜎 2 is unknown, find MLE for πœ‡.
b. If πœ‡ is known and 𝜎 2 is unknown, find MLE for 𝜎 2 .
c. If πœ‡ and 𝜎 2 are both unknown, find MLE for both 𝜎 2 π‘Žπ‘›π‘‘ πœ‡ .
Solution:
a)
b)
65
c)
66
Example: Given that the sample random variables 𝑋1 , 𝑋2 , … , 𝑋𝑛 are i.i.d taken from 𝐺(𝛼, 𝛽).
Find the MLEs for unknown parameters 𝛼 π‘Žπ‘›π‘‘ 𝛽.
Solution:
67
1.3 Method of Least Squares
The last technique for obtaining a good point estimator of a parameter πœƒ is the method
of least squares. Least squares estimators are determined in a fashion such that desirable
properties of an estimator are essentially built into them by virtue of the process by which
they are constructed. In this regard, least squares estimators are best linear unbiased
estimators. Best means that out of the class of all unbiased linear estimators of πœƒ, the least
squares estimators have minimum variance, and thus minimum mean squared error.
Additionally, least squared estimators have the advantage that knowledge of the form the
population probability density function is not required.
Example: Let us determine a least squares estimators for the population mean πœ‡. Given that
the sample random variables π‘Œ1 , π‘Œ2 , … , π‘Œπ‘› are i.i.d, it follows that 𝐸(π‘Œπ‘– ) = πœ‡ and 𝑉(π‘Œπ‘– ) = 𝜎 2
for all 𝑖 = 1,2, … , 𝑛. Now, let π‘Œπ‘– = πœ‡ + πœ€π‘– , 𝑖 = 1,2, … , 𝑛, where πœ€π‘– is an observational random
variable with 𝐸(πœ€π‘– ) = 0 π‘Žπ‘›π‘‘ 𝑉(πœ€π‘– ) = 𝜎 2 . Under random sampling, πœ€π‘– accounts for the
difference between π‘Œπ‘– and its mean πœ‡.
Then principle of least squares directs us to choose πœ‡ (the π‘Œπ‘– ′𝑠 are fixed, 𝑖 = 1,2, … , 𝑛) so as
to minimize the sum of the squared deviations between the observed π‘Œπ‘– values and their mean
πœ‡. That is we should choose πœ‡ so as to minimize,
𝑛
∑ πœ€π‘–2
𝑖=1
𝑛
= ∑( π‘Œπ‘– − πœ‡)2
𝑖=1
To this end we set,
𝑛
𝑑 ∑𝑛𝑖=1 πœ€π‘–2
= −2 ∑( π‘Œπ‘– − πœ‡) = 0
π‘‘πœ‡
𝑖=1
So as to obtain πœ‡Μ‚ = π‘ŒΜ…. (Note that
2
𝑑2 ∑𝑛
𝑖=1 πœ€π‘–
π‘‘πœ‡ 2
= 2𝑛 > 0 as required for a minimum.)
Hence the least squares estimator of the population mean πœ‡ is the sample mean π‘ŒΜ….
68
2. Properties of Point Estimators
Three different methods of finding estimators for population parameters have been
introduced in the preceding section. We have seen that it is possible to have several estimators
for the same parameter. For a practitioner of statistics, an important question is going to be
which of many available sample statistics, such as mean, median, smallest observation, or
largest observation, should be chosen to represent the entire sample? Should we use the
method of moments estimator, the maximum likelihood estimator, or an estimator obtained
through some other method of least squares? Now we introduce some common ways to
distinguish between them by looking at some desirable properties of these estimators.
2.1 Unbiasedness:
It is desirable to have the property that the expected value of an estimator of a
parameter is equal to the true value of the parameter. Such estimators are called unbiased
estimators.
Definition: Let πœƒΜ‚ be a point estimator for a parameter πœƒ. Then πœƒΜ‚ is an unbiased estimator if
𝐸(πœƒΜ‚ ) = πœƒ
If 𝐸(πœƒΜ‚ ) ≠ πœƒ, then πœƒΜ‚ is said to be biased estimator. The bias of a point estimator πœƒΜ‚ is given
by
𝐡(πœƒΜ‚ ) = 𝐸(πœƒΜ‚ ) − πœƒ
** Note that bias is a constant.
Example:
Solution:
69
Example: Let 𝑋1 , 𝑋2 , … , 𝑋𝑛 depicts a set of i.i.d sample random variables taken from a
Bernoulli population with parameter 𝑝. Show that the method of moment estimator is also an
unbiased estimator.
Solution:
Example: Let 𝑋1 , 𝑋2 , … , 𝑋𝑛 depicts a set of i.i.d sample random variables taken from a
1
2
population with finite mean πœ‡. Show that the sample mean 𝑋̅ and 3 𝑋̅ + 3 𝑋1 are both unbiased
estimators of πœ‡.
Solution:
𝑛
𝑛
𝑛
𝑖=1
𝑖=1
𝑖=1
1
1
1
1
𝐸(𝑋̅) = 𝐸 ( ∑ 𝑋𝑖 ) = ∑ 𝐸(𝑋𝑖 ) = ∑ πœ‡ = π‘›πœ‡ = πœ‡
𝑛
𝑛
𝑛
𝑛
𝑛
𝑛
𝑖=1
𝑖=1
1
2
1 1
2
1
2
1
2
𝐸 ( 𝑋̅ + 𝑋1 ) = 𝐸 ( ( ∑ 𝑋𝑖 ) + 𝑋1 ) =
∑ 𝐸(𝑋𝑖 ) + 𝐸(𝑋1 ) =
π‘›πœ‡ + πœ‡ = πœ‡
3
3
3 𝑛
3
3𝑛
3
3𝑛
3
1
2
Hence, sample mean 𝑋̅ and 3 𝑋̅ + 3 𝑋1 are both unbiased estimators of πœ‡.
70
Example:
Solution:
71
οƒΌ For instance, if the i.i.d sample random variables, 𝑋𝑖 , 𝑖 = 1,2, … , 𝑛, each have the
same mean πœ‡ and 𝜎 2 as the population random variable X, then:
ο‚·
If πœƒΜ‚ = 𝑋̅, then 𝐸(𝑋̅) = πœ‡ for any population distribution. Hence, the sample mean is
an unbiased estimator of the population mean.
ο‚·
If a statistic πœƒΜ‚ = ∑𝑛𝑖=1 π‘Žπ‘– 𝑋𝑖 is a linear combination of the sample random variables 𝑋𝑖 ,
𝑖 = 1,2, … , 𝑛, then, 𝐸(πœƒΜ‚) = 𝐸(∑𝑛𝑖=1 π‘Žπ‘– 𝑋𝑖 ) = πœ‡ ∑𝑛𝑖=1 π‘Žπ‘– . If ∑𝑛𝑖=1 π‘Žπ‘– = 1, then πœƒΜ‚ is an
unbiased estimator of πœ‡.
ο‚·
𝑋
For 𝑋 a binomial random variable, if πœƒΜ‚ = 𝑝̂ = 𝑛 , then 𝐸(𝑝̂ ) = 𝑝. Hence the sample
proportion serves as unbiased estimator of population proportion 𝑝.
ο‚·
If πœƒΜ‚ is an unbiased estimator of a parameter πœƒ, it does not follow that every function of
πœƒΜ‚ , β„Ž(πœƒΜ‚), is an unbiased estimator of β„Ž(πœƒ).
οƒΌ As we have seen, there can be many unbiased estimators of a parameter θ. Which one
of these estimators can we choose?
ο‚·
If we have to choose an unbiased estimator, it will be desirable to choose the
one with the least variance. Figure 1 shows sampling distributions for unbiased
point estimators for θ. We would prefer that our estimator have the type of
distribution indicated in Figure 1(b) because the smaller variance guarantees
that in repeated sampling a higher fraction of values of πœƒΜ‚2 will close to θ.
ο‚·
If an estimator is biased, then we should prefer the one with low bias as well as
low variance. Generally, it is better to have an estimator that has low bias as
well as low variance. Rather than using the bias and variance of a point
72
2
estimator to characterize its goodness, we might employ 𝐸 ((πœƒΜ‚ − πœƒ) ), the
average of the square of the distance between the estimator and its target
parameter.
This leads us to the following definition,
Definiton: The mean square error of a point estimator πœƒΜ‚ is,
2
𝑀𝑆𝐸( πœƒΜ‚) = 𝐸 ((πœƒΜ‚ − πœƒ) )
The mean square error of an estimator πœƒΜ‚, 𝑀𝑆𝐸(πœƒΜ‚), is a function of both its variance and its
bias. If 𝐡(πœƒΜ‚) denotes the bias of the estimator πœƒΜ‚, it can be shown that,
2
𝑀𝑆𝐸( πœƒΜ‚) = π‘‰π‘Žπ‘Ÿ(πœƒΜ‚) + (𝐡(πœƒΜ‚))
Definiton: The unbiased estimator πœƒΜ‚ that minimizes the mean square error is called the
minimum variance unbiased estimator (MVUE) of θ.
73
Consider two unbiased estimator, πœƒΜ‚1 and πœƒΜ‚2 of population parameter θ. Then,
𝑀𝑆𝐸( πœƒΜ‚1 ) = π‘‰π‘Žπ‘Ÿ(πœƒΜ‚1 ) π‘Žπ‘›π‘‘ 𝑀𝑆𝐸( πœƒΜ‚2 ) = π‘‰π‘Žπ‘Ÿ(πœƒΜ‚2 ). If π‘‰π‘Žπ‘Ÿ(πœƒΜ‚1 ) < π‘‰π‘Žπ‘Ÿ(πœƒΜ‚2 ), then we can state
that πœƒΜ‚1 is a minimum variance unbiased estimator of θ .
Figure 1(a): 𝐸(𝑇) = πœƒ ∢ 𝑇 is an unbiased estimator of πœƒ, its sampling distribution is centered
exactly on πœƒ.
Figure 1(b): 𝐸(𝑇) ≠ πœƒ ∢ 𝑇 is a biased estimator of πœƒ, its sampling distribution is not centered
on πœƒ.
Figure 1(c): 𝐸(𝑇) = 𝐸(𝑇′) = πœƒ , π‘‰π‘Žπ‘Ÿ(𝑇) < π‘‰π‘Žπ‘Ÿ(𝑇′) ∢ 𝑇 π‘Žπ‘›π‘‘ 𝑇′ are unbiased estimators
of πœƒ, its sampling distribution is centered on πœƒ. It seems preferable to choose the one with the
smallest variance since using 𝑇 yields a higher probability of obtaining an estimate that is
closer to πœƒ.
Figure 1(d): 𝐸(𝑇) ≠ πœƒ, 𝐸(𝑇′) = πœƒ , π‘‰π‘Žπ‘Ÿ(𝑇) < π‘‰π‘Žπ‘Ÿ(𝑇′) ∢ 𝑇 is a biased estimator of πœƒ and
the bias is small, then 𝑇 may be preferable to an unbiased estimator 𝑇’ whose variance is
large.
74
Example:
75
Example : Let π‘Œ1 , π‘Œ2 , … , π‘Œπ‘› depicts a set of i.i.d sample random variables taken from a
population with 𝐸(π‘Œπ‘– ) = πœ‡ and 𝑉(π‘Œπ‘– ) = 𝜎 2 . Show that,
1
𝑆′2 = 𝑛 ∑𝑛𝑖=1(π‘Œπ‘– − π‘ŒΜ…)2 is a biased estimator for 𝜎 2 and that,
1
𝑆 2 = 𝑛−1 ∑𝑛𝑖=1(π‘Œπ‘– − π‘ŒΜ…)2 is a unbiased estimator for 𝜎 2 .
Solution :
π΅π‘–π‘Žπ‘ (𝑆
′2 )
𝑛−1 2
𝜎2
2
=
𝜎 −𝜎 =−
𝑛
𝑛
76
2.2. Efficiency:
We have seen that there can be more than one unbiased estimator for a parameter θ.
We have also mentioned that the one with the least variance is desirable. Here, we introduce
the concept of efficiency. If there are two unbiased estimators, it is desirable to have the one
with a smaller variance. Specifically, πœƒΜ‚ is termed an efficient (alternatively minimum variance
unbiased or best unbiased) estimator of θ if πœƒΜ‚ is unbiased and 𝑉(πœƒΜ‚) ≤ 𝑉(πœƒΜ‚’) for all possible
values of θ, where πœƒΜ‚’ is any other unbiased estimator of θ.
Hence πœƒΜ‚ is more efficient than πœƒΜ‚’ since the sampling distribution of πœƒΜ‚ is more closely
concentrated about θ than the sampling distribution of πœƒΜ‚’.
How do we determine whether or not a given estimator is efficient?
Definition:
If πœƒΜ‚1 and πœƒΜ‚2 are two unbiased estimators of population parameter θ, the
efficiency of πœƒΜ‚1 relative to πœƒΜ‚2 is the ratio,
𝑒(πœƒΜ‚1 , πœƒΜ‚2 ) =
π‘‰π‘Žπ‘Ÿ(πœƒΜ‚2 )
π‘‰π‘Žπ‘Ÿ(πœƒΜ‚1 )
If πœƒΜ‚1 and πœƒΜ‚2 are two biased estimators of population parameter θ, the efficiency of πœƒΜ‚1 relative
to πœƒΜ‚2 is the ratio,
𝑒(πœƒΜ‚1 , πœƒΜ‚2 ) =
𝑀𝑆𝐸(πœƒΜ‚2 )
𝑀𝑆𝐸(πœƒΜ‚1 )
If (πœƒΜ‚2 ) > π‘‰π‘Žπ‘Ÿ(πœƒΜ‚1 ) (𝑀𝑆𝐸(πœƒΜ‚2 ) > 𝑀𝑆𝐸(πœƒΜ‚1 )) , or equivalently, 𝑒(πœƒΜ‚1 , πœƒΜ‚2 ) > 1, then πœƒΜ‚1 is
relatively more efficient than πœƒΜ‚2 .
77
Example:
Solution:
78
Example:
Solution:
79
80
Example :
Hint:
Solution :
81
Example:
1
1
𝑆′2 = 𝑛 ∑𝑛𝑖=1(π‘Œπ‘– − π‘ŒΜ…)2 and 𝑆 2 = 𝑛−1 ∑𝑛𝑖=1(π‘Œπ‘– − π‘ŒΜ…)2
Solution:
(𝑛−1)𝑆 2
𝜎2
~πœ’ 2 (𝑛−1) . So,
(𝑛−1)𝑆 2
𝐸(
𝜎2
π‘‰π‘Žπ‘Ÿ (
) = (𝑛 − 1) ⇒
(𝑛−1)𝑆 2
𝜎2
(𝑛−1)𝐸(𝑆 2 )
𝜎2
) = 2(𝑛 − 1) ⇒
= (𝑛 − 1) ⇒ 𝐸(𝑆 2 ) = 𝜎 2
(𝑛−1)2 π‘‰π‘Žπ‘Ÿ(𝑆 2 )
𝜎4
2𝜎4
= 2(𝑛 − 1) ⇒ π‘‰π‘Žπ‘Ÿ(𝑆 2 ) = (𝑛−1)
2𝜎4
Because 𝐸(𝑆 2 ) = 𝜎 2 , 𝑆 2 is an unbiased estimator of 𝜎 2 . Then, 𝑀𝑆𝐸(𝑆 2 ) = (𝑛−1)
𝑛𝑆′2 = 𝑆 2 (𝑛 − 1) ⇒ 𝑆′2 =
𝐸(𝑆′2 ) =
𝐸(𝑆 2 )(𝑛−1)
𝑛
=
𝑆 2 (𝑛−1)
𝑛
(𝑛−1)𝜎2
= 𝜎2 −
𝑛
𝜎2
𝑛
Because 𝐸(𝑆′2 ) ≠ 𝜎 2 , 𝑆′2 is a biased estimator of 𝜎 2 and π΅π‘–π‘Žπ‘  𝑖𝑠 −
π‘‰π‘Žπ‘Ÿ(𝑆′2 ) =
𝑀𝑆𝐸(𝑆′2 ) =
=
(2𝑛−1)𝜎4
𝑛2
2𝜎4
=
π‘‰π‘Žπ‘Ÿ(𝑆 2 )(𝑛−1)2
𝑛2
(𝑛−1)2𝜎4
𝑛2
2𝜎4 (𝑛−1)2
= (𝑛−1)
+ (−
𝜎2
2
) =
𝑛
𝑛2
=
(𝑛−1)2𝜎4
𝑛2
(2𝑛−1)𝜎4
𝑛2
(2𝑛−1)(𝑛−1)
2𝑛2
(𝑛−1)
82
𝜎2
𝑛
.
We now turn to a more comprehensive approach to finding an efficient or minimum
variance unbiased estimator.
Definition: An unbiased estimator πœƒΜ‚0 , is said to be a uniformly minimum variance
unbiased estimator (UMVUE) for the parameter πœƒ if, for any other unbiased estimator πœƒΜ‚,
π‘‰π‘Žπ‘Ÿ(πœƒΜ‚0 ) ≤ π‘‰π‘Žπ‘Ÿ(πœƒΜ‚)
for all possible values of πœƒ. It is not always easy to find an UMVUE for a parameter.
Our search for any such estimator is facilitated by the notion of finding a lower limit
CR(θ, n) (hereafter called the Cramér-Rao (1945, 1946)) on the variance of any unbiased
estimator of a parameter θ. If πœƒΜ‚ is any unbiased estimator of θ, then V(πœƒΜ‚) ≥ CR(θ, n). This
lower bound enables us to determine if a given unbiased estimator has the (theoretically)
smallest possible variance in the sense that, if V(πœƒΜ‚) = CR(θ, n), then πœƒΜ‚ represents the most
efficient estimator of θ. In this regard, if we can find an unbiased estimator πœƒΜ‚ for which V(πœƒΜ‚) =
CR(θ, n), then the most efficient estimator is actually a minimum variance bound estimator.
Theorem: Cramér-Rao Inequality :
Let 𝑋1 , 𝑋2 , … , 𝑋𝑛 ve iid random sample variables from a population with 𝑝𝑑𝑓 (π‘œπ‘Ÿ 𝑝𝑓 ) π‘“πœƒ (π‘₯)
that depends on a parameter πœƒ. If πœƒΜ‚ is an unbiased estimator of πœƒ, and if
π‘‰π‘Žπ‘Ÿ(πœƒΜ‚) ≥
1
= 𝐢𝑅(πœƒ, 𝑛)
𝑑 2 ln 𝐿
−𝐸 (
)
π‘‘πœƒ 2
Then, πœƒΜ‚ is a uniformly minimum variance unbiased estimator (UMVUE) of πœƒ. Hence
the variance of πœƒΜ‚ is never smaller than CR(θ, n), which is constant for a fixed n. If πœƒΜ‚ is an
unbiased estimator of θ and strict equality holds in Theorem, then πœƒΜ‚ is the most efficient or
minimum variance bound estimator of θ. Hence the realizations of the most efficient estimator
are those that are most concentrated about θ and thus have the highest probability of being
close to θ.
83
Example: Let 𝑋1 , 𝑋2 , … , 𝑋𝑛 be a set of independent and identically distributed sample random
variables taken from a normally distributed population. Assuming that σ2 is known, what is
the Cramér-Rao lower bound on the variance of any unbiased estimator of μ?
Solution:
2
𝜎
Since 𝑋̅ is an unbiased estimator of μ and π‘‰π‘Žπ‘Ÿ(𝑋̅) = 𝑛 , it follows that πœƒΜ‚ = πœ‡Μ‚ = 𝑋̅ is a
minimum variance bound estimator of μ.
84
Example: Let 𝑋1 , 𝑋2 , … , 𝑋𝑛 be a set of independent and identically distributed sample random
variables taken from a normally distributed population. Assuming that μ is known, what is the
Cramér-Rao lower bound on the variance of any unbiased estimator of σ2?
Solution:
Since 𝑀𝐿𝐸 π‘œπ‘“ 𝜎 2 , 𝑆 ′2 = ∑𝑛𝑖=1
π‘‰π‘Žπ‘Ÿ(𝑆 ′2 ) =
2𝜎4
𝑛
(𝑋𝑖 −πœ‡)2
𝑛
is an unbiased estimator of 𝜎 2 and
, it follows that πœƒΜ‚ = πœŽΜ‚ 2 = 𝑆 ′2 is a minimum variance bound estimator of 𝜎 2 .
85
Example: Let 𝑋1 , 𝑋2 , … , 𝑋𝑛 be a set of independent and identically distributed sample random
variables taken from a Poisson distributed population with unknown parameter πœ†. What is the
Cramér-Rao lower bound on the variance of any unbiased estimator of πœ†?
Solution:
Since the sample mean 𝑋̅ is an unbiased estimator of πœ† and the variance of the sample
mean is
𝜎2
𝑛
2
𝜎
, it follows that π‘‰π‘Žπ‘Ÿ(πœƒΜ‚) = π‘‰π‘Žπ‘Ÿ( 𝑋̅) = , and thus πœƒΜ‚ = 𝑋̅ is a minimum variance
𝑛
bound estimator of πœ†.
86
2.3 Sufficiency:
In the statistical inference problems on a parameter, one of the major questions is: Can
a specific statistic replace the entire data without losing pertinent information?
Suppose 𝑋1 , 𝑋2 , … , 𝑋𝑛 is a iid random sample variables from a probability distribution
with unknown parameter θ. In general, statisticians look for ways of reducing a set of data so
that these data can be more easily understood without losing the meaning associated with the
entire collection of observations. Intuitively, a statistic 𝑇 is a sufficient statistic for a
parameter θ if π‘ˆ contains all the information available in the data about the value of θ.
For example, the sample mean may contain all the relevant information about the
parameter πœ‡, and in that case 𝑇 = 𝑋̅ is called a sufficient statistic for πœ‡. An estimator that is a
function of a sufficient statistic can be deemed to be a “good” estimator, because it depends
on fewer data values. When we have a sufficient statistic 𝑇 for θ, we need to concentrate only
on 𝑇 because it exhausts all the information that the sample has about θ. That is, knowledge of
the actual 𝑛 observations does not contribute anything more to the inference about θ.
Sufficient statistics often can be used to develop estimators that are efficient (have
minimum variance among all unbiased estimators). In fact, as we shall now see, if a minimum
variance bound or most efficient estimator exists, it will be found to be a sufficient statistic.
Moreover, if an efficient estimator of θ exists, it is expressible as a function of a sufficient
statistic.
Definiton:
Let 𝑋1 , 𝑋2 , … , 𝑋𝑛 be a random sample from a probability distribution with unknown
parameter θ. Then, the statistic 𝑇 = 𝑔(𝑋1 , 𝑋2 , … , 𝑋𝑛 ) is said to be sufficient for θ if the
conditional pdf or pf of 𝑋1 , 𝑋2 , … , 𝑋𝑛 given 𝑇 = 𝑑 does not depend on θ for any value of 𝑑.
An estimator of θ, that is a function of a sufficient statistic for πœƒ is said to be a sufficient
estimator of θ.
87
Example: Let 𝑋1 , 𝑋2 , … , 𝑋𝑛 be iid random variables with parameter p. Show that 𝑇 = ∑𝑛𝑖=1 𝑋𝑖
is sufficient for p.
Solution:
𝑇 = ∑𝑛𝑖=1 𝑋𝑖 be the total number of successes in the 𝑛 independent trials. If the value of 𝑇 is
known, can we gain any additional information about p by examining the value of any
alternative estimator that also depends on 𝑋1 , 𝑋2 , … , 𝑋𝑛 ? According to our definition of
sufficiency, the conditional distribution of 𝑋1 , 𝑋2 , … , 𝑋𝑛 given 𝑇 = 𝑑 must be independent of
𝑝. Is it?
The joint probability function,
𝑛
𝑛
𝑓(𝑋1 , 𝑋2 , … , 𝑋𝑛 ; 𝑝) = 𝑝∑𝑖=1 𝑋𝑖 (1 − 𝑝)𝑛−∑𝑖=1 𝑋𝑖 ,
0≤πœƒ≤1
Because, 𝑇 = ∑𝑛𝑖=1 𝑋𝑖 we have,
𝑓(𝑋1 , 𝑋2 , … , 𝑋𝑛 ; 𝑝) = 𝑝𝑇 (1 − 𝑝)𝑛−𝑇 ,
0≤πœƒ≤1
Also, because, 𝑇~𝐡(𝑛, 𝑝), we have,
𝑛
𝑓(𝑑 ; 𝑝) = ( ) 𝑝𝑑 (1 − 𝑝)𝑛−𝑑
𝑑
Also,
𝑓(𝑋1 , 𝑋2 , … , 𝑋𝑛 | 𝑇 = 𝑑) =
𝑓(𝑋1 , 𝑋2 , … , 𝑋𝑛 ; 𝑝)
𝑝𝑑 (1 − 𝑝)𝑛−𝑑
1
= 𝑛
= 𝑛
𝑓(𝑑 ; 𝑝)
( ) 𝑝𝑑 (1 − 𝑝)𝑛−𝑑 ( )
𝑑
𝑑
which is independent of p. Therefore, 𝑇 = ∑𝑛
𝑖=1 𝑋𝑖 is sufficient estimator for p.
88
The definition of a sufficient statistic can tell us how to check to see if a particular
estimator is sufficient for a parameter θ, but it does not tell us how to actually go about
finding a sufficient statistic (if one exists). To address the issue of operationally determining a
sufficient statistic, we turn to following Theorem, the Fisher-Neyman Factorization Theorem
(1922, 1924, 1935):
Theorem: Fisher-Neyman Factorization Theorem:
Let 𝑋1 , 𝑋2 , … , 𝑋𝑛 be a random sample taken from a population with probability density
function f (x; θ). The estimator 𝑇 = 𝑔(𝑋1 , 𝑋2 , … , 𝑋𝑛 , 𝑛) is a sufficient statistic for the
parameter θ if and only if the likelihood function of the sample factors as the product of two
nonnegative functions h(t; θ, n) and j(π‘₯1 , π‘₯2 , … , π‘₯𝑛 , n) or
for every realization t = g(π‘₯1 , π‘₯2 , … , π‘₯𝑛 , n) of T and all admissible values of θ. Although the
function j is independent of θ (it may possibly be a constant), the function h depends on the
sample realizations via the estimator T and thus this estimator constitutes a sufficient statistic
for θ.
Example:
Let the sample random variables 𝑋1 , 𝑋2 , … , 𝑋𝑛 be drawn from a Poisson
population with parameter πœ† > 0. Show that the minimum variance bound estimator 𝑋̅ is
sufficient for πœ†.
Solution:
Note that h depends upon the 𝑋𝑖 , i = 1, . . . , n, only through the function or realization 𝑑 = π‘₯Μ… .
So, 𝑋̅ is a sufficient statistic for πœ†.
89
Example:
Example:
90
Definition: Minimal Sufficient Statistics
It may be the case that there is more than one sufficient statistic associated with the
parameter θ of some population probability density (or mass) function. Hence it is only
natural to ask whether one sufficient statistic is better than any other such statistic in the sense
that it achieves the highest possible degree of data reduction without loss of information about
θ. A sufficient statistic that satisfies this requirement will be termed a minimal sufficient
statistic. More formally, a sufficient statistic T for a parameter θ is termed a minimal sufficient
statistic if, for any other sufficient statistic 𝑇’ for θ, T is a function of 𝑇’. A general procedure
for finding a minimal sufficient statistic is provided by the following Theorem, the Lehmann–
Scheffé Theorem (1950, 1955, 1956):
Theorem: Lehmann–Scheffé Theorem:
Let 𝐿(πœƒ; π‘₯1 , π‘₯2 , … , π‘₯𝑛 , 𝑛) denote the likelihood function of a random sample taken from the
population probability density function 𝑓 (π‘₯; πœƒ). Suppose there exists a function 𝑇 =
𝑔(𝑋1 , 𝑋2 , … , 𝑋𝑛 , 𝑛) such that, for the two sets of sample realizations {π‘₯1 , π‘₯2 , … , π‘₯𝑛 } and
{𝑦1 , 𝑦2 , … , 𝑦𝑛 }, the likelihood ratio
𝐿(πœƒ; π‘₯1 , π‘₯2 , … , π‘₯𝑛 , 𝑛)
𝐿(πœƒ; 𝑦1 , 𝑦2 , … , 𝑦𝑛 , 𝑛)
is independent of πœƒ if and only if 𝑔(π‘₯1 , π‘₯2 , … , π‘₯𝑛 , 𝑛) = 𝑔(𝑦1 , 𝑦2 , … , 𝑦𝑛 , 𝑛). Then, T is a
minimal sufficient statistics for πœƒ
Example:
91
Let us now consider the importance or usefulness of a sufficient statistic (provided, of
course, that one exists). As will be seen next, sufficient statistics often can be used to
construct estimators that have minimum variance among all unbiased estimators or are
efficient. In fact, sufficiency is a necessary condition for efficiency in that an estimator cannot
be efficient unless it utilizes all the sample information.
We note first that if an unbiased estimator T of a parameter θ is a function of a
sufficient statistic S, then T has a variance that is smaller than that of any other unbiased
estimator of θ that is not dependent on S. Second, if T is a minimum variance bound or most
efficient estimator of θ, then T is also a sufficient statistic (although the converse does not
necessarily hold).
This first point may be legitimized by the Rao-Blackwell Theorem (presented next),
which establishes a connection between unbiased estimators and sufficient statistics. In fact,
this theorem informs us that we may improve upon an unbiased estimator by conditioning it
on a sufficient statistic. That is, an unbiased estimator and a sufficient statistic for a parameter
θ may be combined to yield a single estimator that is both unbiased and sufficient for θ and
has a variance no larger (and usually smaller) than that of the original unbiased estimator.
To see this let T be any unbiased estimator of a parameter θ and let the statistic S be
sufficient for θ. Then the Rao-Blackwell Theorem indicates that another estimator 𝑇’ can be
derived from S (i.e., expressed as a function of S) such that 𝑇’ is unbiased and sufficient for θ
with the variance of 𝑇’ uniformly less than or equal to the variance of T.
Theorem: Rao-Blackwell Theorem :
Let 𝑋1 , 𝑋2 , … , 𝑋𝑛 be a random sample taken from the population probability density function
𝑓 (π‘₯; πœƒ) and let 𝑇 = 𝑔(𝑋1 , 𝑋2 , … , 𝑋𝑛 , 𝑛) be any unbiased estimator of the parameter θ. For
𝑆 = 𝑒(𝑋1 , 𝑋2 , … , 𝑋𝑛 , 𝑛) a sufficient statistic for θ, define 𝑇’ = 𝐸(𝑇/𝑆), where T is not a
function of S alone. Then 𝐸(𝑇’) = πœƒ and 𝑉(𝑇’) ≤ 𝑉(𝑇) for all θ (with 𝑉(𝑇’) < 𝑉(𝑇) for
some θ unless 𝑇 = 𝑇’ with probability 1).
What this theorem tells us is that if we condition any unbiased estimator T on a sufficient
statistic S to obtain a new unbiased estimator 𝑇’, then 𝑇’ is also sufficient for θ and is a
uniformly better unbiased estimator of θ. Moreover, the resulting efficient estimator is unique.
92
It was mentioned earlier that minimum variance bound estimators and sufficient
statistics are related. In fact, as we shall now see, under certain conditions they are one and
the same. To see this let us return to the Factorization Theorem and rewrite the factorization
criterion or sufficiency condition,
then the sufficient statistic T is also a minimum variance bound estimator of the parameter θ.
Example:
93
Definition: Jointly sufficient statistics
Specifically, for 𝑋1 , 𝑋2 , … , 𝑋𝑛 a set of sample random variables taken from the population
probability density function 𝑓 (π‘₯; πœƒ), the statistics π‘†π‘˜ = π‘’π‘˜ (𝑋1 , 𝑋2 , … , 𝑋𝑛 , 𝑛), π‘˜ = 1, . . . , π‘Ÿ,
are said to be jointly sufficient if and only if the joint probability density function of
𝑋1 , 𝑋2 , … , 𝑋𝑛 given 𝑆1 , 𝑆2 , … , 𝑆𝑛 is independent of θ for any set of realizations π‘ π‘˜ =
π‘’π‘˜ (π‘₯1 , π‘₯2 , … , π‘₯𝑛 , 𝑛),
π‘˜ = 1, . . . , π‘Ÿ; that is, the conditional distribution of 𝑋1 , 𝑋2 , … , 𝑋𝑛
given 𝑆1 , 𝑆2 , … , 𝑆𝑛 or f (𝑋1 , 𝑋2 , … , 𝑋𝑛 |𝑆1 , 𝑆2 , … , 𝑆𝑛 ) does not depend on θ.
We next examine a set of statistics for their joint sufficiency by considering the
following Theorem, the Generalized Fisher-Neyman Factorization Theorem:
Theorem : Generalized Fisher-Neyman Factorization Theorem
94
Example:
95
We noted earlier that a minimal sufficient statistic is one that achieves the highest
possible degree of data reduction without losing any information about a parameter θ.
Moreover, a general procedure for finding a minimal sufficient statistic was provided by the
Lehman-Scheffé Theorem. Let us now extend this theorem to the determination of a set of
jointly minimal sufficient statistics. We thus have the generalized Lehman-Scheffé Theorem:
Theorem: Generalized Lehman-Scheffé Theorem:
Example:
96
2.4 Consistency
Definition:
e.g :
Suppose that a coin, which has probability p of resulting in heads, is tossed n times. If
the tosses are independent, then Y, the number of heads among the n tosses, has a binomial
distribution. If the true value of p is unknown, the sample proportion Y/n is an estimator of p.
What happens to this sample proportion as the number of tosses n increases?
Our intuition leads us to believe that as n gets larger, Y/n should get closer to the true value of
p. That is, as the amount of information in the sample increases, our estimator should get
closer to the quantity being estimated.
97
Theorem: If πœƒΜ‚ is an unbiased estimator of πœƒ, is a consistent estimator for πœƒ if,
lim π‘‰π‘Žπ‘Ÿ(πœƒΜ‚) = 0
𝑛→∞
And if πœƒΜ‚ is a biased estimator of πœƒ, is a consistent estimator for πœƒ if,
lim π‘‰π‘Žπ‘Ÿ(πœƒΜ‚) = 0 π‘Žπ‘›π‘‘ lim π΅π‘–π‘Žπ‘ (πœƒΜ‚) = 0 ; π‘œπ‘Ÿ lim 𝑀𝑆𝐸(πœƒΜ‚) = 0
𝑛→∞
𝑛→∞
𝑛→∞
Example:
Solution:
98
Example:
Solution:
99
100
Download