Theory of Statistics

advertisement
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
BTRY 4090 / STSCI 4090, Spring 2010
Theory of Statistics
Instructor: Ping Li
Department of Statistical Science
Cornell University
Instructor: Ping Li
1
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
General Information
• Lectures: Tue, Thu 10:10-11:25 am, Stimson Hall G01
• Section: Mon 2:55 - 4:10 pm, Warren Hall 131
• Instructor: Ping Li, pingli@cornell.edu,
Office Hours: Tue, Thu 11:25 am -12 pm, 1192, Comstock Hall
• TA: Xiao Luo, lx42@cornell.edu.
Office hours: TBD
(1) Mon, 4:10 - 5:10pm Warren Hall 131;
(2) Wed, 2:30 - 3:30pm, Comstock Hall 1181.
• Prerequisites: BTRY 4080 or equivalent
• Textbook: Rice, Mathematical Statistics and Data Analysis, 3rd edition
2
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
• Exams:
– Prelim 1: In Class, Feb. 25, 2010
– Prelim 2: In Class, April 8, 2010
– Final Exam: Warren Hall 145, 2pm - 4:30pm, May 13, 2010
– Policy: Close book, close notes
• Programming: Some programming assignments. You can either use Matlab
or R. For practice, please download the Matlab examples in 4080 lecture
notes.
3
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
• Homework: Weekly
– Please turn in your homework either in class or to BSCB front desk
(Comstock Hall, 1198).
– No late homework will be accepted.
– Before computing your overall homework grade, the assignment with the
lowest grade (if ≥
grade (if ≥
25%) will be dropped, the one with the second lowest
50%) will also be dropped.
– It is the students’ responsibility to keep copies of the submitted homework.
4
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
• Grading: Two formulas
Instructor: Ping Li
1. Homework: 30% + Two Prelims:
35% + Final: 35%
2. Homework: 30% + Two Prelims:
25% + Final: 45%
Your grade is whichever higher.
• Course Letter Grade Assignment
A ≈ 90% (in the absolute scale)
C ≈ 60% (in the absolute scale)
In borderline cases, participation in section and class interactions will be used as
a determining factor.
5
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Syllabus
Topic
Textbook
Random number generation
Probability, Random Variables, Joint Distributions, Expected Values
Chapters 1-4
Limit Theorems
Chapter 5
Distributions Derived from the Normal Distribution
Chapter 6
Estimation of Parameters and Fitting of Probability Distributions
Chapter 8
Testing Hypothesis and Assessing Goodness of Fit
Chapter 9
Comparing Two Samples
Chapter 11
The Analysis of Categorical Data
Chapter 13
Linear Least Squares
Chapter 14
6
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Chapters 1 to 4: Mostly Reviews
• Random number generation
• The method of random projections:
A real example
of using probability to solve computationally intensive (or infeasible) problems.
• Capture/Recpature method:
An example of discrete probability and
the introduction to parameter estimation using maximum likelihood.
• Conditional expectations, bivariate normal, and random projections
• Moment generating function and random projections
7
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Nonuniform Sampling by Inversion
The goal: Sample X from a distribution F (x).
The inversion transform sampling:
• Sample U ∼ Uniform(0, 1).
• Output X = F −1 (U )
Proof:
Pr (X ≤ x) = Pr F
−1
(U ) ≤ x = Pr (U ≤ F (x)) = F (x)
Limitation: Need a closed-form F −1 , but many common distributions (eg,
normal) do not have closed-form F −1 .
8
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Examples of Inversion Transform Sampling
• X ∼ Exponential(λ), i.e., F (x) = 1 − e−λx , x ≥ 0.
log(1−U )
Let U ∼ Uniform(0, 1), then
∼ Exponential(λ)
−λ
• X ∼ Pareto(α), i.e., F (x) = 1 − x1α , x ≥ 1.
Let U ∼ Uniform(0, 1), then (1−U1 )1/α ∼ Pareto(α).
A small trick:
If U
∼ Uniform(0, 1), then 1 − U ∼ Uniform(0, 1).
Thus, we can replace 1 − U by U .
Instructor: Ping Li
9
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
The Box-Muller Transform
U1 and U2 are i.i.d. samples from Uniform(0, 1). Then
p
N1 = −2 log U1 cos(2πU2 )
p
N2 = −2 log U1 sin(2πU2 )
are two i.i.d samples from the standard normal N (0, 1).
Q: How to generate non-standard normals?
Instructor: Ping Li
10
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
An Introduction to Random Projections
Many applications require a data matrix:
A
∈ Rn×D
For example, the term-by-document matrix may contain n
(web pages) and D
model), or D
= 1010 documents
= 106 single words, or D = 1012 double words (bi-gram
= 1018 triple words (tri-gram model).
Many matrix operations boil down to computing how close (how far) two rows
(columns) of the matrix are. For example, linear least square (AT A)
Challenges: The matrix may be too large to store,
or computing AT A is too expensive.
−1
AT y .
11
Cornell University, BTRY 4090 / STSCI 4090
Random Projections:
Spring 2010
Replace A by B
A
Instructor: Ping Li
=A×R
R = B
R ∈ RD×k : a random matrix, with i.i.d. entries sampled from N (0, 1).
B ∈ Rn×k : projected matrix, also random.
k is very small (eg k = 50 ∼ 100), but n and D are very large.
B approximately preserves the Euclidean distance and dot products between any
two rows of A.
In particular, E (BBT ) = AAT .
12
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Consider first two rows in A:
Instructor: Ping Li
u1 , u2 ∈ RD .
u1 = {u1,1 , u1,2 , u1,3 , ..., u1,i , ..., u1,D }
u2 = {u2,1 , u2,2 , u2,3 , ..., u2,i , ..., u2,D }
and first two rows in B:
v1 , v2 ∈ Rk .
v1 = {v1,1 , v1,2 , v1,3 , ..., v1,j , ..., v1,k }
v2 = {v2,1 , v2,2 , v2,3 , ..., v2,j , ..., v2,k }
v1 = RT u1 ,
v2 = RT u2 .
R = {rij }, i = 1 to D and j = 1 to k .
rij ∼ N (0, 1).
13
Cornell University, BTRY 4090 / STSCI 4090
v1 = RT u1 ,
Spring 2010
v2 = RT u2 .
v1,j =
D
X
Instructor: Ping Li
R = {rij }, i = 1 to D and j = 1 to k .
rij u1,i ,
v2,j =
i=1
v1,j − v2,j =
The Euclidean norm of u1 :
The Euclidean norm of v1 :
D
X
rij u2,i ,
i=1
D
X
i=1
rij [u1,i − u2,i ]
PD
i=1
Pk
j=1
|u1,i |2 .
|v1,j |2 .
The Euclidean distance between u1 and u2 :
The Euclidean distance between v1 and v2 :
PD
i=1
Pk
|u1,i − u2,i |2 .
2
|v
−
v
|
.
1,j
2,j
j=1
14
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
What are we hoping for?
•
Pk
•
Pk
2
j=1 |v1,j | ≈
j=1
PD
2
|u
|
, as close as possible.
1,i
i=1
2
|v1,j − v2,j | ≈
PD
i=1
|u1,i − u2,i |2 , as close as possible.
• k should be as small as possible, for a specified level of accuracy.
15
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Unbiased Estimator of d and m1 , m2
We need a good estimator, unbiased and has small variance.
Note that the estimation problem is essentially the same for d and for m1 (m2 ).
Thus, we can focus on estimating m1 .
By random projections, we have k i.i..d. samples (why?)
vj =
D
X
rij u1,i , j = 1, 2, ...k
i=1
Because rij
∼ N (0, 1), we can develop estimators and analyze the properties
using normal and χ2 distributions. But we can also solve the problem without
using normals.
16
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Unbiased Estimator of m1
v1,j =
D
X
rij u1,i , j = 1, 2, ...k,
(rij ∼ N (0, 1))
i=1
To get started, let’s first look the moments
E(v1,j ) = E
D
X
i=1
rij u1,i
!
=
D
X
i=1
E(rij )u1,i = 0
17
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
"
2
E(v1,j
) =E 
D
X
Instructor: Ping Li
rij u1,i
i=1
#2 



D
X
X
2
2
=E 
rij u1,i +
rij u1,i ri′ j u1,i′ 
i=1
i6=i′


D
X
X
2
2
=
E(rij )u1,i +
E(rij ri′ j )u1,i u1,i′ 
i=1
=
D
X
i=1
Great!
i6=i′
!
u21,i + 0
= m1
m1 is exactly what we are after.
Since we have k , i.i.d. samples vj , we can simply average them to estimate m1 .
18
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
An unbiased estimator of the Euclidean norm
Instructor: Ping Li
m1 =
k
1X
|v1,j |2 ,
m̂1 =
k j=1
PD
2
|u
|
1,i
i=1
k
k
X
1X
1
E (m̂1 ) =
E |v1,j |2 =
m1 = m1
k j=1
k j=1
We need to analyze its variance to assess its accuracy.
Recall, our goal is to use k (number of projections) as small as possible.
19
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
k
1
1 X
2
2
V ar |v1,j | = V ar |v1,j |
V ar (m̂1 ) = 2
k j=1
k
1
4
2
2
= E |v1,j | − E (|v1,j | )
k

!
4
D
X
1
=
E
rij u1,i
− m21 
k
i=1
We can compute E
P
D
i=1 rij u1,i
2
4
take advantage of the χ distribution.
directly, but it would be much easier if we
20
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
χ2 Distribution
∼ N (0, 1), then Y = X 2 is a Chi-Square distribution with one degree of
freedom, denoted by χ21 .
If X
Pk
= 1 to k are i.i.d. normal Xi ∼ N (0, 1). Then Y = j=1 Xj2 follows
a Chi-square distribution with k degrees of freedom, denoted by χ2k .
If Xj , j
If Y
∼ χ2k , then
E(Y ) = k, V ar(Y ) = 2k
21
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Recall, after random projections,
v1,j =
D
X
rij u1,i , j = 1, 2, ...k,
rij ∼ N (0, 1)
i=1
Therefore, vj also has a normal distribution:
v1,j ∼ N
v1,j
Equivalently √m
1
0,
D
X
i=1
|ui,i |2
!
= N (0, m1 )
∼ N (0, 1).
Therefore,
v1,j
√
m1
2
=
2
v1,j
m1
∼
χ21 ,
V ar
2
v1,j
m1
!
= 2,
V ar
2
v1,j
Now we can figure out the variance formula for random projections.
= 2m21
22
Cornell University, BTRY 4090 / STSCI 4090
Implication
Spring 2010
2
1
2m
1
V ar (m̂1 ) = V ar |v1,j |2 =
k
k
V ar(m̂1 )
2
=
,
2
m1
k
q
Instructor: Ping Li
independent of m1
V ar(m̂1 )
is known as the coefficient of variation.
m21
——————-
We have solved the variance using χ21 .
We can actually figure out the distribution of m̂1 using χ2k .
23
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
k
1X
|v1,j |2 ,
m̂1 =
k j=1
Instructor: Ping Li
v1,j ∼ N (0, m1 )
Because v1,j ’s are i.i.d, we know
2
k X
k m̂1
v1,j
=
∼ χ2k
√
m1
m1
j=1
(why?)
This will be useful for analyzing the error bound using probability inequalities.
We can also write down the moments of m̂1 directly using χ2k
24
Cornell University, BTRY 4090 / STSCI 4090
Recall, if Y
Spring 2010
Instructor: Ping Li
∼ χ2k , then E(Y ) = k , and V ar(Y ) = 2k
=⇒
E
k m̂1
m1
= k,
V ar
k m̂1
m1
=⇒
m21
2m21
V ar(m̂1 ) = 2k 2 =
k
k
= 2k,
25
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
An unbiased estimator of the Euclidean distance
k
X
1
dˆ =
|v1,j − v2,j |2 ,
k j=1
k dˆ
∼ χ2k ,
d
d=
PD
i=1
|u1,i − u2,i |2
2
2d
ˆ =
.
V ar(d)
k
They can be derived exactly the way as we analyze the estimator of m1 .
Note that the coefficient of variation for dˆ
ˆ
2
V ar(d)
= ,
d2
k
independent of d
meaning that the errors are pre-determined by k , a huge advantage.
26
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
More probability problems
• What is the error probability P |dˆ − d| ≥ ǫd ?
• How large k should be?
• What about the inner (dot) product a =
PD
i=1
u1,i u2,i ?
27
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
An unbiased estimator of the inner product
a=
k
1X
â =
v1,j v2,j ,
k j=1
PD
i=1
u1,i u2,i
E(â) = a
m1 m2 + a2
V ar(â) =
k
Proof:
v1,j v2,j =
"
D
X
i=1
u1,i rij
#" D
X
i=1
u2,i rij
#
28
Cornell University, BTRY 4090 / STSCI 4090
v1,j v2,j =
Spring 2010
"D
X
u1,i rij
i=1
=
D
X
Instructor: Ping Li
#"
2
u1,i u2,i rij
D
X
i=1
+
i=1
=⇒
E(v1,j v2,j ) =
D
X
u1,i u2,i E
i=1
=
D
X
=
D
X
2
rij
X
i6=i′
u1,i u2,i = a
i=1
This proves the unbiasedness.
X
u1,i u2,i′ rij ri′ j
i6=i′
u1,i u2,i 1 +
i=1
u2,i rij
#
+
X
u1,i u2,i′ E [rij ri′ j ]
i6=i′
u1,i u2,i′ 0
29
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
We first derive the variance of â using a complicated brute force method, then we
show a much simpler method using conditional expectation.

2
[v1,j v2,j ] = 
=
=
"
D
X
2
u1,i u2,i rij
+
i=1
D
X
i6=i′
2
u1,i u2,i rij
i=1
D
X
#2
4
[u1,i u2,i ]2 rij
i=1
+
X
X

+
+2
2
u1,i u2,i′ rij ri′ j 
X
i6=i′
X
2
u1,i u2,i′ rij ri′ j  + ...
u1,i u2,i u1,i′ u2,i′ [rij ri′ j ]2
i6=i′
[u1,i u2,i′ ]2 [rij ri′ j ]2 + ...
i6=i′
Why can we ignore the rest of the terms (after taking expectations)?
30
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Why can we ignore the rest of the terms (after taking expectations)?
Recall rij
∼ N (0, 1) i.i.d.
E(rij ) = 0,
2
E(rij
) = 1,
E(rij ri′ j ) = E(rij )E(ri′ j ) = 0
3
E(rij
) = 0,
4
E(rij
) = 3,
2
2
E(rij
ri′ j ) = E(rij
)E(ri′ j ) = 0
31
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Therefore,
2
E [v1,j v2,j ] =
D
X
2
3 [u1,i u2,i ] + 2
i=1
X
u1,i u2,i u1,i′ u2,i′ +
i6=i′
X
[u1,i u2,i′ ]2
i6=i′
But
2
a =
"D
X
u1,i u2,i
i=1
m1 m2 =
"
D
X
i=1
#2
|u1,i |
2
=
D
X
2
[u1,i u2,i ] +
i=1
#" D
X
i=1
X
u1,i u2,i u1,i′ u2,i′
i6=i′
|u2,i |
2
#
=
D
X
i=1
2
[u1,i u2,i ] +
X
[u1,i u2,i′ ]2
i6=i′
Therefore,
E [v1,j v2,j ]2 = m1 m2 + 2a2 ,
V ar [v1,j v2,j ] = m1 m2 + a2
32
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
An unbiased estimator of the inner product
k
1X
v1,j v2,j ,
â =
k j=1
a=
PD
i=1
u1,i u2,i
E(â) = a
m1 m2 + a2
V ar(â) =
k
The coefficient of variation
q
V ar(â)
a2
=
q
m1 m2 +a2 1
a2
k,
When two vectors u1 and u2 are almost orthogonal, a
=⇒ coefficient of variation ≈ ∞.
not independent of a.
≈ 0,
=⇒ random projections may not be good for estimating inner products.
33
Cornell University, BTRY 4090 / STSCI 4090
The joint distribution of
E(v1,j ) = 0,
Spring 2010
v1,j =
Instructor: Ping Li
PD
i=1
u1,i rij and v2,j =
V ar(v1,j ) =
D
X
i=1
E(v2,j ) = 0,
V ar(v2,j ) =
D
X
i=1
PD
i=1
u2,i rij .
|u1,i |2 = m1
|u2,i |2 = m2
Cov(v1,i , v2,j ) = E(v1,j v2,j ) − E(v1,j )E(v2,j ) = a
v1,j and v2,j are jointly normal (bivariate normal).






v
0
m
 1,j  ∼ N µ = 
, Σ =  1
v2,j
0
a
a
m2


(What if we know m1 and m2 exactly? For example, by one scan of data matrix.)
34
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Summary of Random Projections
Random Projections:
Replace A by B
A
=A×R
R = B
• An elegant method, interesting probability exercise.
• Suitable for approximating Euclidean distances in massive, dense, and
heavy-tailed (some entries are excessively large) data matrices.
• It does not take advantage of data sparsity.
• We will come back to study its error probability bounds (and other things).
35
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Capture/Recapture Method: Section 1.4, Example I
The method may be used to estimate the size of a wildlife population. Suppose
that t animals are captured, tagged, and released. On a later occasion, m
animals are captured, and it is found that r of them are tagged.
Assume the total population is N .
Q 1:
What is the probability mass function P {N
= n}?
Q 2:
How large is the population N , estimated from m, r , and t ?
36
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Solution:
P {N = n} =
To estimate N , we can choose the N
maximized.
Ln =
∝
t
r
n−t
m−r
n
m
= n such that Ln = P {N = n} is
(n−t)!
t!
r!(t−r)! (m−r)!(n−t−m+r)!
n!
m!(n−m)!
(n−t)!
(n−t−m+r)!
n!
(n−m)!
(n − t)!(n − m)!
=
(n − t − m + r)!n!
37
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
The method of maximum likelihood
Instructor: Ping Li
To find the n such that Ln is
maximized
(n − t)!(n − m)!
Ln =
(n − t − m + r)!n!
If Ln has a global maximum, then it is equivalent to finding the n such that
Ln
(n − t)(n − m)
=1=
Ln−1
n(n − t − m + r)
mt
=⇒n =
r
gn =
Indeed, if n
<
mt
r , then
(n − t)(n − m) − n(n − t − m + r) = mt − nr < 0
Thus, if n
<
mt
r , then gn is increasing; if
n>
mt
r , then gn is decreasing.
38
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
How to plot Ln ?
Ln =
(n − t)!(n − m)!
(n − m)(n − m − 1)...(n − m − t + r + 1)
=
(n − t − m + r)!n!
n(n − 1)...(n − t + 1)
log Ln =
t−r
X
j=1
log (n − m − j + 1) −
t
X
i=1
log(n − i + 1)
39
Cornell University, BTRY 4090 / STSCI 4090
x 10
Likelihood ratio (gn): t = 10 m = 20 r = 4
1.5
1
1.4
0.8
1.3
Likelihood Ratio
Likelihood
Instructor: Ping Li
Likelihood (Ln): t = 10 m = 20 r = 4
−8
1.2
Spring 2010
0.6
1.2
0.4
1.1
0.2
1
0
20
40
60
80
100
n
120
140
160
0.9
20
40
60
80
100
n
120
140
160
40
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Matlab code
function cap_recap(t, m, r);
n0 = max(t+m-r, m)+5;
j=1:(t-r); i = 1:t;
for n = n0:5*n0
L(n-n0+1) = exp( sum(log(n-m+1-j)) - sum(log(n+1-i)));
g(n-n0+1)= (n-t)*(n-m)./n./(n-t-m+r);
end
figure;
plot(n0:5*n0,L,’r’,’linewidth’,2);grid on;
xlabel(’n’); ylabel(’Likelihood’);
title([’Likelihood (L_n): t = ’ num2str(t) ’ m = ’ num2str(m) ’ r = ’ num2str(r)]);
figure;
plot(n0:5*n0,g, ’r’,’linewidth’,2);grid on;
xlabel(’n’); ylabel(’Likelihood Ratio’);
title([’Likelihood ratio (g_n): t = ’ num2str(t) ’ m = ’ num2str(m) ’ r = ’ num2str(r)]);
41
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
The Bivariate Normal Distribution
The random variables X and Y have a bivariate normal distribution if, for
constants, ux , uy , σx
> 0, σy > 0, −1 < ρ < 1, their joint density function is
given, for all −∞ < x, y < ∞, by
1
− 2(1−ρ
1
2)
p
f (x, y) =
e
2πσx σy 1 − ρ2
If X and Y are independent, then ρ
(x−µx )2
2
σx
(y−µy )2
+
2
σy
(x−µx )(y−µy )
−2ρ
σx σy
= 0, and
1
f (x, y) =
e
2πσx σy
− 12
(x−µx )2
2
σx
(y−µy )2
+
2
σy
42
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Denote that X and Y are jointly normal:


X
Y



 ∼ N µ = 
µx
µy


, Σ = 
σx2
ρσx σy
ρσx σy
σy2
X and Y are marginally normal:
X ∼ N (µx , σx2 ),
Y ∼ N (µy , σy2 )
X and Y are also conditionally normal:
σx
X|Y ∼ N µx + ρ(y − µy ) , (1 − ρ2 )σx2
σy
Y |X ∼ N
σy
µy + ρ(x − µx ) , (1 − ρ2 )σy2
σx


43
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Bivariate Normal and Random Projections
A
R = B
v1 and v2 , the first two rows in B, have k entries:
PD
PD
v1,j = i=1 u1,i rij and v2,j = i=1 u2,i rij .
v1,j and v2,j are bivariate normal:






v
0
m
 1,j  ∼ N µ = 
, Σ =  1
v2,j
0
a
m1 =
PD
i=1
2
|u1,i | , m2 =
PD
i=1
2
|u2,i | , a =
PD
i=1
a
m2


u1,i u2,i
44
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Simplify calculations using conditional normality
v1,j |v2,j ∼ N
E (v1,j v2,j )
2
a
m1 m2 − a
v2,j ,
m2
m2
2
2
=E E v1,j
v2,j
|v2,j
=E
2
v2,j
2
=
2
E v2,j E
m1 m2 − a2
+
m2
2
v1,j
|v2,j
a
v2,j
m2
2 !!
2
m1 m2 − a2
2 a
2
+ 3m2 2 = m1 m2 + 2a .
=m2
m2
m2
The unbiased estimator â
=
1
k
PD
i=1 v1,j v2,j has variance
1
2
Var (â) =
m1 m2 + a
k
45
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Moment Generating Function (MGF)
Definition:
For a random variable X , its moment generating function (MGF),
is defined as
tX
MX (t) = E e
=
 P
tx

p(x)e

x



 R∞
−∞
etx f (x)dx
if X is discrete
if X is continuous
MGF MX (t) uniquely determines the distribution of X .
46
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
MGF of Normal
2
Suppose X
∼ N (0, 1), i.e., fX (x) =
∞
−x
√1 e− 2 .
2π
1 − x2
MX (t) =
e √ e 2 dx
2π
−∞
Z ∞
1 − x2 +tx
√ e 2
=
dx
2π
−∞
Z ∞
1 − x2 −2tx+t2 −t2
2
√ e
dx
=
2π
−∞
Z ∞
t2
1 − (x−t)2
2
√ e
=e 2
dx
2π
−∞
Z
=e
t2
2
tx
47
Cornell University, BTRY 4090 / STSCI 4090
Suppose Y
Write Y
Spring 2010
Instructor: Ping Li
∼ N (µ, σ 2 ).
= σX + µ, where X ∼ N (0, 1).
tY
MY (t) = E e
We can view σt as another t′ .
µt
µt+σtX σtX µt
=E e
=e E e
µt
MY (t) = e MX (σt) = e
×e
σ 2 t2
2
2
µt+ σ2 t2
=e
48
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
MGF of Chi-Square
If Xj , j
= 1 to k , are i.i.d. N (0, 1), then
Pk
Y = j=1 Xj2 ∼ χ2k , a Chi-squared distribution with k degrees of freedom.
What is the density function?
Well, since the MGF uniquely determines the
distribution, we can analyze MGF first.
By the independence of Xj ,
k
h Pk
i Y
h 2 i h 2 ik
Y t
2
MY (t) = E e
= E et j=1 Xj =
E etXj = E etXj
j=1
49
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
h 2i Z ∞
2
tXj
tx2 1
− x2
E e
dx
=
e √ e
2π
−∞
Z ∞
1 − x2 +tx2
√ e 2
=
dx
2π
−∞
Z ∞
1 − x2 (1−2t)
2
√ e
dx
=
2π
−∞
Z ∞
2
x
1
1
√ e− 2σ2 dx,
=
σ2 =
1 − 2t
2π
−∞
Z ∞
x2
1
− 2σ
2
√
e
dx = σ
=σ
2πσ
−∞
1
=
(1 − 2t)1/2
h 2 ik
MY (t) = E etXj
=
1
,
k/2
(1 − 2t)
(t < 1/2)
50
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
MGF for Random Projections
In random projections, the unbiased estimator dˆ =
k
1
k
Pk
2
|v
−
v
|
1,j
2,j
j=1
k dˆ X |v1,j − v2,j |2
=
∼ χ2k
d
d
j=1
Q: What is the MGF of dˆ.
Solution:
ˆ
h
Mdˆ(t) = E edt = E e
where 2dt/k
< 1, i.e., t < k/(2d).
kd̂
d
i
[ ]
dt
k
=
1−
2dt
k
−k/2
51
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Moments and MGF
tX
MX (t) = E e
tX ′
=⇒MX (t) = E Xe
n tX (n)
=⇒MX (t) = E X e
Setting t
= 0,
(n)
E [X n ] = MX (0)
52
Cornell University, BTRY 4090 / STSCI 4090
Example:
Spring 2010
X ∼ χ2k . MX (t) =
Instructor: Ping Li
1
.
(1−2t)k/2
−k
−k/2
−k/2−1
(1 − 2t)
(−2) = k (1 − 2t)
2
−k
′′
− 1 (1 − 2t)−k/2−2 (−2)
M (t) =k
2
M ′ (t) =
=k(k + 2) (1 − 2t)−k/2−2
Therefore,
E(X) = M ′ (0) = k,
E(X 2 ) = M ′′ (0) = k 2 + 2k
V ar(X) = (k 2 + 2k) − k 2 = 2k.
53
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Example: MGF and Moments of â in Random Projections
The unbiased estimator of inner product:
â =
Using conditional expectation:
Pk
i=1 v1,j v2,j .
2
a
m1 m2 − a
v2,j ,
m2
m2
∼ N (0, m2 )
v1,j |v2,j ∼ N
v2,j
1
k
For simplicity, let
x = v1,j , y = v2,j , µ =
m1 m2 − a2
σ =
m2
2
a
a
v2,j =
y,
m2
m2
54
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
E (exp(v1,j v2,j t))
Using the MGF of x|y
Instructor: Ping Li
= E (exp(xyt)) = E (E (exp(xyt)) |y)
∼ N (µ, σ 2 )
2
E (exp(xyt)|y)
µyt+ σ2 (yt)2
=e
2
µyt+ σ2 (yt)2
E (E (exp(xyt)|y)) = E e
2
µyt +
Since y
∼
σ
(yt)2 = y 2
2
y2
N (0, m2 ), we known m2
∼ χ21 .
2
a
σ 2
t+
t
m2
2
55
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Using MGF of χ21 , we obtain
2
µyt+ σ2
E e
(yt)
2
=E e
y2
m2
m2
a
m2
2
t+ σ2 t2
−1/2
a
σ2 2
= 1 − 2m2
t+
t
m2
2
1
2 2 −2
= 1 − 2at − m1 m2 − a t
.
By independence,
2
t
2 t
Mâ (t) = 1 − 2a − m1 m2 − a
k
k2
Now, we can use this MGF to calculate moments of â.
− k2
.
56
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
2
t
2 t
Mâ (t) = 1 − 2a − m1 m2 − a
k
k2
− k2
,
#
k
−
−1
2
2
t
2 t
(1)
Mâ (t) =(−k/2) 1 − 2a − m1 m2 − a
k
k2
2 2t
× −2a/k − m1 m2 − a
k2
"
The term in [...] will not matter after letting t
= 0.
Therefore,
E(â)
= M GFâ
(1)
(0) = (−k/2)(−2a/k) = a
57
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Following a similar procedure, we can obtain
m1 m2 + a2
Var (â) =
k
2a
2
E (â − a) = 2 3m1 m2 + a
k
3
The centered third moment measures the skewness of the distribution and can be
quite useful, for example, testing the normality.
58
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Tail Probabilities
The tail probability P
(X > t) is extremely important.
For example, in random projections,
P |dˆ − d| ≥ ǫd
tells what is the probability that the difference (error) between the estimated
Euclidian distance dˆ and the true distance d exceeds an ǫ fraction of the true
distance d.
Q:
Is it just the cumulative probability function (CDF)?
59
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Tail Probability Inequalities (Bounds)
P (X > t) ≤ ???
Reasons to study tail probability bounds:
• Even if the distribution of X is known, evaluating P (X > t) often requires
numerical methods.
• Often the exact distribution of X is unknown. Instead, we may know the
moments (mean, variance, MGF, etc).
• Theoretical reasons. For example, studying how fast the error decreases.
60
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Several Tail Probability Inequalities (Bounds)
• Markov’s Inequality.
Only use the first moment. Most basic.
• Chebyshev’s Inequality.
Only use the second moment.
• Chernoff’s Inequality.
Use the MGF. Most accurate and popular among theorists.
61
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Markov’s Inequality: Theorem A in Section 4.1
If X is a random variable with P (X
≥ 0) = 1, and for which E(X) exists, then
E(X)
P (X ≥ t) ≤
t
Proof:
E(X) =
Assume X is continuous with probability density f (x).
Z
∞
0
xf (x)dx ≥
Z
∞
t
xf (x)dx ≥
Z
∞
t
tf (x)dx = tP (X ≥ t)
See the textbook for the proof by assuming X is discrete.
Many extremely useful bounds can be obtained from Markov’s inequality.
62
Cornell University, BTRY 4090 / STSCI 4090
Markov’s inequality
Spring 2010
P (X ≥ t) ≤
Instructor: Ping Li
E(X)
t .
If t
= kE(X), then
P (X ≥ t) = P (X ≥ kE(X)) ≤
1
k
The error decreases at the rate of k1 , which is too slow.
The original Markov’s inequality only utilizes the first moment (hence its
inaccuracy).
63
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Chebyshev’s Inequality: Theorem C in Section 4.1
Let X be a random variable with mean µ and variance σ 2 . Then for any t
σ2
P (|X − µ| ≥ t) ≤ 2
t
Proof:
Let Y
= (X − µ)2 = |X − µ|2 , w = t2 , then
2
E(Y )
E (X − µ)
σ2
P (Y ≥ w) ≤
=
=
w
w
w
Note that |X
− µ|2 ≥ t2 ⇐⇒ |x − µ| ≥ t. Therefore,
2
2
P (|X − µ| ≥ t) = P |X − µ| ≥ t
σ2
≤ 2
t
> 0,
64
Cornell University, BTRY 4090 / STSCI 4090
Chebyshev’s inequality
Spring 2010
P (|X − µ| ≥ t) ≤
P (|X − µ| ≥ kσ) ≤
Instructor: Ping Li
σ2
t2 .
1
k2
The error decreases at the rate of k12 , which is faster than k1 .
If t
= kσ , then
65
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Chernoff Inequality
Ross, Proposition 8.5.2:
then for any ǫ
Application:
If X is a random variable with finite MGF MX (t),
>0
P {X ≥ ǫ} ≤ e−tǫ MX (t),
for all t
>0
P {X ≤ ǫ} ≤ e−tǫ MX (t),
for all t
<0
One can choose the t to minimize the upper bounds. This
usually leads to accurate probability bounds, which decrease exponentially fast.
66
Cornell University, BTRY 4090 / STSCI 4090
Proof:
For t
Spring 2010
Instructor: Ping Li
Use Markov’s Inequality.
> 0, because X > ǫ ⇐⇒ etX > etǫ (monotone transformation)
tX
tǫ
P (X > ǫ) =P e ≥ e
tX E e
≤
etǫ
=e−tǫ MX (t)
67
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Tail Bounds of Normal Random Variables
X ∼ N (µ, σ 2 ). Assume µ > 0. Need to know P (|X − µ| ≥ ǫµ) ≤ ??
Chebyshev’s inequality:
2
P (|X − µ| ≥ ǫµ) ≤
2
σ
1 σ
=
ǫ2 µ 2
ǫ2 µ 2
The bound is not good enough, only decreasing at the rate of ǫ12 .
68
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Tail Bounds of Normal Using Chernoff’s Inequality
Right tail bound
For any t
P (X − µ ≥ ǫµ)
> 0,
P (X − µ ≥ ǫµ)
=P (X ≥ (1 + ǫ)µ)
≤e−t(1+ǫ)µ MX (t)
−t(1+ǫ)µ µt+σ 2 t2 /2
=e
e
−t(1+ǫ)µ+µt+σ 2 t2 /2
=e
=e−tǫµ+σ
2 2
t /2
What’s next? Since the inequality holds for any t
minimize the upper bound.
> 0, we can choose the t to
69
Cornell University, BTRY 4090 / STSCI 4090
Right tail bound
Choose the t
Spring 2010
Instructor: Ping Li
P (X − µ ≥ ǫµ)
= t∗ to minimize g(t) = −tǫµ + σ 2 t2 /2.
2 2
µǫ
µ
ǫ
′
2
∗
∗
g (t) = −ǫµ + σ t = 0 =⇒ t = 2 =⇒ g(t ) = −
σ
2 σ2
Therefore,
2 µ2
σ2
− ǫ2
P (X − µ ≥ ǫµ) ≤ e
−ǫ2
decreasing at the rate of e
.
70
Cornell University, BTRY 4090 / STSCI 4090
Left tail bound
For any t
Spring 2010
Instructor: Ping Li
P (X − µ ≤ −ǫµ)
< 0,
P (X − µ ≤ −ǫµ) =P (X ≤ (1 − ǫ)µ)
≤e−t(1−ǫ)µ MX (t)
=e−t(1−ǫ)µ eµt+σ
2 2
t /2
tǫµ+σ 2 t2 /2
=e
Choose the t
= t∗ = − σµǫ2 to minimize tǫµ + σ 2 t2 /2. Therefore,
2 µ2
σ2
− ǫ2
P (X − µ ≤ −ǫµ) ≤ e
71
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Combine left and right tail bounds
Instructor: Ping Li
P (|X − µ| ≥ ǫµ)
P (|X − µ| ≥ ǫµ)
=P (X − µ ≥ ǫµ) + P (X − µ ≤ −ǫµ)
2 µ2
σ2
− ǫ2
≤2e
72
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Sample Size Selection Using Tail Bounds
2
Xi ∼ N µ, σ , i.i.d. i = 1 to k .
An unbiased estimator of µ is µ̂
k
1X
Xi ,
µ̂ =
k i=1
µ̂ ∼ N
Choose k such that
P (|µ̂ − µ| ≥ ǫµ) ≤ δ
———–
2
We already know P (|µ̂ − µ|
− ǫ2
≥ ǫµ) ≤ 2e
µ2
σ 2 /k
.
σ2
µ,
k
73
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
It suffices to select the k such that
2 kµ2
σ2
− ǫ2
2e
2 kµ2
σ2
− ǫ2
=⇒e
2
≤δ
≤
2
δ
2
ǫ kµ
δ
=⇒ −
≤
log
2 σ2
2
2
2
ǫ kµ
δ
=⇒
≥
−
log
2 σ2
2
2 σ2
δ
=⇒k ≥ − log
2
ǫ2 µ 2
74
Cornell University, BTRY 4090 / STSCI 4090
Suppose Xi
Spring 2010
2
∼ N (µ, σ ), i = 1 to k , i.i.d. Then µ̂ =
unbiased estimator of µ. If the sample size k satisfies
Instructor: Ping Li
1
n
Pn
i=1
Xi is an
2
2 σ2
,
k ≥ log
δ
ǫ2 µ 2
then with probability at least 1 − δ , the estimated µ is within a 1 ± ǫ factor of the
true µ, i.e., |µ̂ − µ|
≤ ǫµ.
75
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
What affects sample size k ?
2
2 σ2
k ≥ log
δ
ǫ2 µ 2
• δ : level of significance. Lower δ → more significant → larger k .
•
σ2
µ2 : noise/signal ratio.
σ2
Higher µ2
→ larger k .
• ǫ: accuracy. Lower ǫ → more accurate → larger k .
• The evaluation criterion. For example, |µ̂ − µ| ≤ ǫµ, or |µ̂ − µ| ≤ ǫ?
76
Cornell University, BTRY 4090 / STSCI 4090
Exercise:
Spring 2010
Instructor: Ping Li
In random projections, dˆ is the unbiased estimator of the Euclidian
distance d.
• Prove the exponential tail bound:
P |dˆ − d| ≥ ǫd ≤ e???
• Determine the sample size such that
P |dˆ − d| ≥ ǫd ≤ δ
77
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Section 4.6: Approximate Methods
Suppose we know
What about
2
E(X) = µX , V ar(X) = σX
. Suppose Y = g(X).
E(Y ), V ar(Y ) ?
In many cases, analytical solutions are not available (or too complicated).
How about Y
= aX ? Easy!
2
We know E(Y ) = aE(X) = aµX , V ar(Y ) = a2 σX
.
78
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
The Delta Method
General idea: Linear expansion of Y
= g(X) about X = µX .
1
Y = g(X) = g(µX ) + (X − µX )g ′ (µX ) + (X − µX )2 g ′′ (µX ) + ...
2
Taking expectations on both sides:
1
E(Y ) = g(µX ) + E(X − µX )g (µX ) + E(X − µX )2 g ′′ (µX ) + ...
2
2
σX
=⇒E(Y ) ≈ g(µX ) +
g ′′ (µX )
2
′
What about the variance?
79
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Use the linear approximation only:
Y = g(X) = g(µX ) + (X − µX )g ′ (µX ) + ...
2
2
V ar(Y ) ≈ [g ′ (µX )] σX
How good are these approximates?
about µX .
Depends on the nonlinearity of g(X)
80
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Example B in Section 4.6
X ∼ U (0, 1), Y =
√
X . Compute E(Y ) and V ar(Y ).
Exact Method
E(Y ) =
Z
0
E(Y 2 ) =
Z
1
xdx =
0
1
,
2
1
√
xdx =
1
1
2
1/2+1 x
=
.
1/2 + 1
3
0
2
1
2
1
V ar(Y ) = −
=
= 0.0556
2
3
18
81
Cornell University, BTRY 4090 / STSCI 4090
Delta Method:
Spring 2010
Instructor: Ping Li
X ∼ U (0, 1), E(X) = 21 , V ar(X) =
√
Y = g(X) = X . g ′ (X) = 21 X −1/2 ,
g ′′ (X) = − 12 21 X −1/2−1 = − 41 X −3/2 .
1
12 .
p
V ar(X)
1 −3/2
E(Y ) ≈ E(X) +
− E
(X)
2
4
p
1/12
1
= 1/2 +
− (1/2)−3/2 = 0.6776
2
4
2
1 −1/2
(X)
V ar(Y ) ≈V ar(X) E
2
2
1 1
=
(1/2)−1/2 = 0.0417
12 2
82
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Delta Method for Sign Random Projections
The projected data v1,j and v2,j are bivariate normal.


v1,j
v2,j



 ∼ N µ = 
One can use â
=
1
k
Pk
0
0


, Σ = 
m1
a
a
m2
j=1 v1,j v2,j to estimate
first estimate the angle θ
= cos−1
√ a
m1 m2

 , j = 1, 2, ..., k
a without bias. One can also
using
Pr (sign(v1,j ) = sign(v2,j ) = 1 −
θ
π
√
then estimate a using cos θ̂
m1 m2 . Delta method can help the analysis.
(Why sign random projections?)
83
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
The Delta Method for Two Variables
2
Z = g(X, Y ). E(X) = µX , E(Y ) = µY , V ar(X) = σX
,
V ar(Y ) = σY2 , Cov(X, Y ) = σXY .
Taylor expansion of Z
= g(X, Y ), about (X = µX , Y = µY ):
2
∂g(µX , µY ) 1
∂g
(µX , µY )
2
Z =(X − µX )
+ (X − µX )
∂X
2
∂X 2
2
∂g(µX , µY ) 1
2 ∂g (µY , µY )
+(Y − µY )
+ (Y − µY )
∂Y
2
∂Y 2
∂g 2 (µY , µY )
+g(µX , µY ) + (X − µX )(Y − µY )
+ ...
∂X∂Y
84
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Taking expectations of both sides of the expansion:
2
σX
∂g 2 (µX , µY )
E(Z) ≈g(µX , µY ) +
2
∂X 2
σY2 ∂g 2 (µX , µY )
∂g 2 (µY , µY )
+
+ σXY
2
∂Y 2
∂X∂Y
Only using linear expansion yields
2
2
∂g(µ
,
µ
)
∂g(µ
,
µ
)
X
Y
X
Y
2
V ar(Z) ≈σX
+ σY2
∂X
∂Y
∂g(µX , µY )
∂g(µX , µY )
+ 2σXY
∂X
∂Y
85
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Chapter 5: Limit Theorems
X1 , X2 , ..., Xn are i.i.d. samples. What Happen if n → ∞?
• The Law of Large Numbers
• The Central Limit Theorem
• The Normal Approximation
Instructor: Ping Li
86
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
The Law of Large Numbers
Theorem 5.2.A:
Let X1 , X2 , ..., be a sequence of independent random
variables with E(Xi )
= µ and V ar(Xi ) = σ 2 . Then, for any ǫ > 0, as
n → ∞,
P
n
!
1 X
X i − µ > ǫ → 0
n
i=1
The sequence {Xn } is said to Converge in probability to µ.
87
Cornell University, BTRY 4090 / STSCI 4090
Proof:
Spring 2010
Using Chebyshev’s Inequality.
Because Xi ’s are i.i.d.,
n
1X
X̄ =
Xi .
n i=1
n
1X
1
E(Xi ) = nµ = µ
E(X̄) =
n i=1
n
n
1 X
1
σ2
2
V ar(X̄) = 2
V ar(Xi ) = 2 nσ =
n i=1
n
n
Thus, by Chebyshev’s Inequality,
V ar(X̄)
σ2
P (|X̄ − µ| ≥ ǫ) ≤
= 2 →0
2
ǫ
nǫ
Instructor: Ping Li
88
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Normal Distribution
10.8
10.6
Sample Mean
10.4
10.2
10
9.8
9.6
9.4
9.2
0
10
1
10
2
10
3
10
n
4
10
5
10
6
10
89
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Gamma Distribution
11
10.8
Sample Mean
10.6
10.4
10.2
10
9.8
0
10
1
10
2
10
3
10
n
4
10
5
10
6
10
90
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Uniform Distribution
16
14
Sample Mean
12
10
8
6
4
0
10
1
10
2
10
3
10
n
4
10
5
10
6
10
91
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Matlab Code
function TestLawLargeNumbers(MEAN)
N = 10ˆ6;
figure;
c = [’r’,’k’,’b’];
for repeat = 1:3
X = normrnd(MEAN, 1, 1, N);
% var = 1
semilogx(cumsum(X)./(1:N),c(repeat),’linewidth’,2);
grid on; hold on;
end;
xlabel(’n’); ylabel(’Sample Mean’);
title(’Normal Distribution’);
figure;
for repeat = 1:3
X = gamrnd(MEAN.ˆ2, 1./MEAN, 1, N); % var = 1
semilogx(cumsum(X)./(1:N),c(repeat),’linewidth’,2);
92
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
grid on; hold on;
end;
xlabel(’n’); ylabel(’Sample Mean’);
title(’Gamma Distribution’);
figure;
for repeat = 1:3
X = rand(1, N)*MEAN*2;
semilogx(cumsum(X)./(1:N),c(repeat),’linewidth’,2);
grid on; hold on;
end;
xlabel(’n’); ylabel(’Sample Mean’);
title(’Uniform Distribution’);
93
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Monte Carlo Integration
To calculate
I(f ) =
Z
1
f (x)dx,
for example
−x2 /2
f (x) = e
0
Numerical integration can be difficult, especially in high-dimensions.
94
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Monte Carlo integration:
Generate n i.i.d. samples Xi
as n
→ ∞.
∼ U (0, 1). Then by LLN
1X
f (Xi ) → E(f (Xi )) =
n
Z
1
f (x)1dx
0
Advantages
• Very flexible. The interval does not have to be [0,1]. The function f (x) can
be complicated. The function can be decomposed in various ways, e.g.,
f (x) = g(x) ∗ h(x), and one can sample from other distributions.
• Straightforward in high-dimensions, double integrals, triple integrals, etc.
95
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Major disadvantage of Monte Carlo integration
LLN converges at the rate of √1n , from the Central Limit theorem.
1
Numerical integrations converges at the rate of n
.
However, in high-dimensions, the difference becomes smaller.
Also, there are more advanced Monte Carlo techniques to achieve better rates.
96
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Examples for Monte Carlo Numerical Integration
R1
cos xdx as an expectation:
Z 1
Z 1
cos xdx =
1 × cos xdx = E(cos(x)),
Treat 0
0
0
x ∼ Uniform U (0, 1)
Monte Carlo integration procedure:
• Generate N i.i.d. samples xi ∼ Uniform U (0, 1), i = 1 to N .
PN
1
• Use empirical expectation N i=1 cos(xi ) to approximate E(cos(x)).
97
Cornell University, BTRY 4090 / STSCI 4090
R1
True value: 0
Spring 2010
Instructor: Ping Li
cos xdx = sin(1) = 0.8415
0.88
0.86
0.84
0.82
0.8
1
10
2
10
3
10
4
10
N
5
10
6
10
7
10
98
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Z
0
1
Instructor: Ping Li
log2 (x + 0.1) −x0.15
p
e
dx
sin(x + 0.1)
2
1.5
1
0.5 1
10
2
10
3
10
4
10
N
5
10
6
10
7
10
99
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Section 5.3: Central Limit Theorem and Normal Approximation
Central Limit Theorem
Let X1 , X2 , ..., be a sequence of independent and
identically distributed random variables, each having finite mean E(Xi )
variance σ 2 . Then as n
P
= µ and
→∞
X1 + X2 + ... + Xn − nµ
√
≤y
nσ
→
Z
y
−∞
1 − t2
√ e 2 dt
2π
100
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Normal Approximation
X̄ − µ
X1 + X2 + ... + Xn − nµ
√
p
=
nσ
σ 2 /n
Non-rigorously, we may say X̄ is approximately N
But we know
E(X̄) = µ, V ar(X̄) =
σ2
n .
is approximately
2
µ, σn .
N (0, 1)
101
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Normal Distribution Approximates Binomial
Suppose X
∼ Binomial(n, p). For fixed p, as n → ∞
Binomial(n, p) ≈ N (µ, σ 2 ),
µ = np, σ 2 = np(1 − p).
Instructor: Ping Li
102
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
n = 10 p = 0.2
0.35
Density (mass) function
0.3
0.25
0.2
0.15
0.1
0.05
0
0
1
2
3
4
x
5
6
7
8
9
10
103
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
n = 20 p = 0.2
0.25
Density (mass) function
0.2
0.15
0.1
0.05
0
−10
−5
0
5
10
x
15
20
25
104
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
n = 50 p = 0.2
0.16
0.14
Density (mass) function
0.12
0.1
0.08
0.06
0.04
0.02
0
−10
−5
0
5
10
x
15
20
25
30
105
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
n = 100 p = 0.2
0.1
0.09
Density (mass) function
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
−10
0
10
20
x
30
40
50
106
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
n = 1000 p = 0.2
0.035
Density (mass) function
0.03
0.025
0.02
0.015
0.01
0.005
0
100
150
200
x
250
300
107
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Matlab code
function NormalApprBinomial(n,p);
mu = n*p; sigma2 = n*p*(1-p);
figure;
bar((0:n),binopdf(0:n,n,p),’g’);hold on; grid on;
x = mu - 3*sigma2:0.001:mu+3*sigma2;
plot(x,normpdf(x,mu,sqrt(sigma2)),’r-’,’linewidth’,2);
xlabel(’x’);ylabel(’Density (mass) function’);
title([’n = ’ num2str(n) ’
p = ’ num2str(p)]);
108
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Convergence in Distribution
Definition:
Let X1 , X2 , ..., be a sequence of random variables with
cumulative distributions F1 , F2 , ..., and let X be a random variable with
distribution function F . We say that Xn converges in distribution to X if
lim Fn (x) = F (x)
n→∞
at every point x at which F is continuous.
109
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Theorem 5.3A: Continuity Theorem
Let Fn be a sequence of cumulative distribution functions with the corresponding
MGF Mn . Let F be a cumulative distribution function with MGF M .
If Mn (t)
→ M (t) for all t in an open interval containing zero,
then Fn (x)
→ F (x) at all continuity points of F .
110
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Approximate Poisson by Normal
If X
∼ P oi(λ) is approximately normal when λ is large.
Recall P oi(λ) approximates Bin(n, p) with λ
≈ np, and large n.
——————————-
Let Xn
Let
∼ P oi(λn ). Let λ1 , λ2 , ... be an increasing sequence with λn → ∞.
Zn =
Let Z
X√
n −λn
, with CDF
λn
∼ N (0, 1), with CDF F .
To show
Fn .
2
Fn → F , suffices to show MZn (t) → MZ (t) = et
/2
.
111
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Proof:
If Y
t
∼ P oi(λ), then MY (t) = eλ(e
−1)
. Then, for Zn
√
− √λn t λn et/ λn −1
λ
MZn (t) =e
=
X√
n −λn
,
λn
e
√
h p
i
= exp −t λn + λn et/ λn −1
n
= exp[g(t, n)]
112
Cornell University, BTRY 4090 / STSCI 4090
t
Recall e
=1+t+
Spring 2010
t2
2
t2
6
Instructor: Ping Li
+ + ...
√
p
g(t, n) = − t λn + λn et/ λn − 1
2
3
p
t
1 t
1 t
√
+
= − t λn + λn
+
+ ...
2 λn
6 λ3/2
λn
n
1 t3
t2
t2
+ ...→
= +
1/2
2
6 λn
2
Therefore, MZn (t)
2
→ et
/2
= MZ (t)
113
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
The Proof of Central Limit Theorem
Theorem 5.3.B:
Let X1 , X2 , ..., be a sequence of independent random
variables having mean µ and variance σ 2 and the common probability distribution
function F and MGF M defined in the neighborhood of zero. Then
lim P
n→∞
Proof:
Pn
Z x
1 −z 2 /2
i=1 Xi − nµ
√
√ e
≤x =
dz,
σ n
2π
−∞
Let Sn
=
Pn
i=1 Xi and Zn =
t2 /2
MZn (t) → e
,
−∞ < x < ∞
Sn √
−nµ
. It suffices to show
σ n
as
n → ∞.
114
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
= M n (t). Hence
√
nµ
nµ
t
t
− √ t
√
√
MZn (t) =e σ n MSn
= e− σ t M n
σ n
σ n
Note that MSn (t)
Taylor expand M (t) about zero
t2 ′′
M (t) =1 + tM (0) + M (0) + ...
2
t2 2
2
=1 + tµ +
σ + µ + ...
2
′
115
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Therefore,
MZn (t) =e−
√
nµ
σ t
Mn
t
√
σ n
2
nµ
µt
t
=e− σ t 1 + √ + 2
σ n 2σ n
√
nµ
= exp −
t + n log 1 +
σ
√
2
σ +µ
2
+ ...
2
n
µt
t
2
2
√ +
σ + µ + ...
σ n 2σ 2 n
116
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
By Taylor expansion, log(1 + x)
=x−
Instructor: Ping Li
x2
2
+ .... Therefore,
σ 2 + µ2
µt
t2
n log 1 + √ + 2
σ n 2σ n
"
#
2
2
1
µt
µt
t
2
2
√
√
=n
+ 2 σ +µ −
+ ...
2 σ n
σ n 2σ n
2
µt
t
+ ...
=n √ +
σ n 2n
Hence
√
2
nµ
µt
t
t + n log 1 + √ + 2 σ 2 + µ2 + ...
MZn (t) = exp −
σ
σ n 2σ n
t2 /2
→e
The textbook assumed µ
= 0 to start with, which simplified the algebra.
117
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Chapter 6: Distributions Derived From Normal
• χ2 distribution
If X1 , X2 , ..., Xn are i.i.d. N (0, 1). Then
Pn
2
2
2
i=1 Xi ∼ χn , the χ distribution with n degrees of freedom.
• t distribution
independent, then
If U
√Z
U/n
∼ χ2n , Z ∼ N (0, 1), and Z and U are
∼ tn , the t distribution with n degrees of freedom.
• F distribution
If U ∼ χ2m , V ∼ χ2n , and U and V are independent,
U/m
then V /n ∼ Fm,n , the F distribution with m and n degrees of freedom.
118
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
χ2 Distribution
If X1 , X2 , ..., Xn are i.i.d.
N (0, 1). Then
with n degrees of freedom.
Pn
2
2
2
X
∼
χ
,
the
χ
distribution
n
i=1 i
• Z ∼ χ2n , then MGF MZ (t) = (1 − 2t)−n/2 .
• Z ∼ χ2n , then E(Z) = n,
V ar(Z) = 2n.
• Z1 ∼ χ2n , Z2 ∼ χ2m , Z1 and Z2 are independent. Then
Z = Z1 + Z2 ∼ χ2n+m .
•
χ2n
= Gamma α =
n
2,
λ=
1
2 .
119
Cornell University, BTRY 4090 / STSCI 4090
If X
If Z
Spring 2010
Instructor: Ping Li
∼ Gamma(α, λ), then MX (t) =
∼ χ2n , then MZ (t) = (1 − 2t)
Therefore, Z
∼
χ2n
= Gamma
Therefore, the density function of Z
−n/2
n 1
2, 2
λ
λ−t
=
α
=
1
1−2t
1
1−t/λ
α
n/2
∼ χ2n
1
z n/2−1 e−z/2 ,
fZ (z) = n/2
2 Γ(n/2)
z≥0
.
120
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
t Distribution
If U
∼ χ2n , Z ∼ N (0, 1), and Z and U are independent, then √Z
U/n
the t distribution with n degrees of freedom.
Theorem 6.2.A:
The density function of the Z
Γ[(n + 1)/2]
fZ (z) = √
nπΓ(n/2)
∼ tn is
z2
1+
n
−(n+1)/2
∼ tn ,
121
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
0.4
1 degree
10 degrees
normal
0.35
0.3
density
0.25
0.2
0.15
0.1
0.05
0
−5
0
x
5
122
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Matlab Code
function plot_tdensity
figure;
x=-5:0.01:5;
plot(x,tpdf(x,1),’g-’,’linewidth’,2);hold on; grid on;
plot(x,tpdf(x,10),’k-’,’linewidth’,2);hold on; grid on;
plot(x,normpdf(x),’r’,’linewidth’,2);
for n = 2:9;
plot(x,tpdf(x,n));hold on; grid on;
end;
xlabel(’x’); ylabel(’density’);
legend(’1 degree’,’10 degrees’,’normal’);
123
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Things to know about tn distributions:
• It is widely used in statistical testing, the t-test.
• It is practically indistinguishable from normal, when n ≥ 45.
• It is a heavy-tailed distribution, only has < nth moments.
• It is the Cauchy distribution when n = 1.
Instructor: Ping Li
124
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
The F Distribution
∼ χ2m , V ∼ χ2n , and U and V are independent, then
Z = U/m
V /n ∼ Fm,n , the F distribution with m and n degrees of freedom.
If U
Proposition 6.2.B:
If Z
∼ Fm,n , then the density
Γ[(m + n)/2] m m/2 m/2−1 m −(m+n)/2
fZ (z) =
z
1+ z
Γ(m/2)Γ(n/2) n
n
The F distribution is also widely used in statistical testing, the
F -test.
125
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
The Cauchy Distribution
If X
Z=
∼ N (0, 1) and Y ∼ N (0, 1), and X and Y are independent. Then
X
Y has the standard Cauchy distribution, with density
1
fZ (z) =
,
π(z 2 + 1)
−∞ < z < ∞
Cauchy distribution does not have a finite mean, E(Z)
It is also the t-distribution with 1 degree of freedom.
= ∞.
126
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Proof:
X
≤z
FZ (z) = P (Z ≤ z) = P
Y
=P (X ≤ Y z, Y > 0) + P (X ≥ Y z, Y < 0)
=2P (x ≤ yz, Y > 0)
Z ∞ Z yz
=2
fX,Y (x, y)dxdy
0
0
Z ∞ Z yz
1 − x2 1 − y2
√ e 2 √ e 2 dxdy
=2
2π
2π
0
0
Z ∞
Z
yz
2
y2
1
− 2
− x2
e
e
dxdy
=
π 0
0
Now what?
It actually appears easier to work the PDF fZ (z).
127
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Use the fact
∂
R g(x)
c
h(y)dy
= h(g(x))g ′ (x),
∂x
∞
for any constant
2
2
1
− 2
− y 2z
fZ (z) =
e
ye
dy
π 0
2
Z ∞
2
2
1
y
− y (z2 +1)
e
=
d
π 0
2
1 1
= 2
.
πz +1
Z
y2
What’s the problem when working directly with the CDF?
c.
128
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
yz
2
1
− x2
− 2
FZ (z) =
e
dxdy
e
π 0
Z ∞ Z yz 0 2 2
1
− x +y
2
=
e
dxdy
π 0
0
Z Z
2
1 ∞ π/2
− r2
=
e
rdθdr
π 0
tan−1 (1/z)
Z ∞
hπ
i
r2
1
=
e− 2 r
− tan−1 (1/z) dr
π 0
2
2
Z ∞
−1
2
r
π/2 − tan (1/z)
− r2
=
e
d
π
2
0
Z
∞
y2
Z
π/2 − tan−1 (1/z)
=
π
Therefore,
1 1
fZ (z) =
.
π z2 + 1
129
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Section 6.3: Sample Mean and Sample Variance
Let X1 , X2 , ..., Xn be independent samples from N (µ, σ 2 ).
n
The sample mean
1X
Xi
X̄ =
n i=1
n
The sample variance
2
1 X
2
S =
Xi − X̄
n − 1 i=1
130
Cornell University, BTRY 4090 / STSCI 4090
Theorem 6.3.A:
Spring 2010
Instructor: Ping Li
The random variable X̄ and the vector
(X1 − X̄, X2 − X̄, ..., Xn − X̄) are independent.
Proof:
Read the book for a more rigorous proof.
Let’s only prove that X̄ and Xi
− X̄ are uncorrelated (homework problem).
131
Cornell University, BTRY 4090 / STSCI 4090
Corollary 6.3.A:
Proof:
Spring 2010
X̄ and S 2 are independently distributed.
It follows immediately because S 2 is a function of
(X1 − X̄, X2 − X̄, ..., Xn − X̄).
Instructor: Ping Li
132
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Joint Distribution of the Sample Mean and Sample Variance
(n − 1)S 2 /σ 2 ∼ χ2n−1 .
Theorem 6.3.B:
Proof:
X1 , X2 , ..., Xn , are independent normal variables, Xi ∼ N (µ, σ 2 ).
Intuitively, S
2
=
1
n−1
Pn
Chi-squared distribution.
i=1
Xi − X̄
2
should be closely related to a
133
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
2
(n − 1)S =
=
=
n
X
i=1
n
X
i=1
n
X
i=1
Instructor: Ping Li
Xi − X̄
2
Xi − µ + µ − X̄
2
2
(Xi − µ) − n µ − X̄
2
134
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
2
n X
Xi − µ
i=1
σ
2
Y =
Instructor: Ping Li
∼ χ2n
2
n X
Xi − µ
(n − 1)S
=
2
σ
i=1
Y +
µ − X̄
√
σ/ n
µ − X̄
√
σ/ n
2
σ
=
The MGFs in both sides should be equal.
Also, note that Y and X̄ are independent.
−
2
µ − X̄
√
σ/ n
2
n X
Xi − µ
i=1
σ
∼ χ21
2
135
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
2
(n − 1)S
,
Y =
σ2
Y +
Instructor: Ping Li
µ − X̄
√
σ/ n
2
=
2
n X
Xi − µ
i=1
Equating the MGFs of both sides (also using independence).
tY
(1 − 2t)−1/2 = (1 − 2t)−n/2
tY
= (1 − 2t)−(n−1)/2
E e
=⇒E e
Therefore,
(n − 1)S 2
2
∼
χ
n−1
σ2
σ
136
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Corollary 6.3.B:
X̄ − µ
√ ∼ tn−1 .
S/ n
Proof:
X̄−µ
√
σ/ n
X̄ − µ
U
√ = p
=√
S/ n
V
(n − 1)S 2 /σ 2 /(n − 1)
U ∼ N (0, 1). V ∼ χ2n−1 . Therefore,
√U
V
∼ tn−1 by definition.
137
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Chapter 8: Parameter Estimation
One of the most important chapters for 4090!
Assume n i.i.d. observations Xi , i
= 1 to n. Xi ’s has density function with k
parameters θ1 , θ2 , ... ,θk , written as fX (x; θ1 , θ2 , ..., θk ).
The task is to estimate θ1 , θ2 , ..., θk , from n samples X1 , X2 , ..., Xn .
———————————-
Where did the density function fX come from in the first place?
This is often a chicken-egg problem, but it is not a major concern for this class.
138
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Two Basic Estimation Methods
Suppose X1 , X2 , ..., Xn are i.i.d. samples with density fX (x; θ1 , θ2 ).
• The method of moments
Pn
1
Force n
i=1 Xi = E(X) and
1
n
Two equations, two unknowns (θ1 , θ2 ).
Pn
i=1
Xi2 = E(X 2 )
• The method of maximum likelihood
Find the θ1 and θ2 that maximizes the joint probability (likelihood)
Qn
i=1 fX (xi ; θ1 , θ2 ).
An optimization problem, maybe convex.
139
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
The Method of Moments
Assume n i.i.d. observations Xi , i
= 1 to n. Xi ’s has density function with k
parameters θ1 , θ2 , ... ,θk , written as fX (x; θ1 , θ2 , ..., θk ).
Define the mth theoretical moment of X
µm = E(X m )
Define the mth empirical moment of X
n
µ̂m
Solve a system of k equations:
What could be the difficulties?
1X m
=
Xi
n i=1
µm = µ̂m , m = 1 to k .
140
Cornell University, BTRY 4090 / STSCI 4090
Example 8.4.A:
Because E(Xi )
Spring 2010
Instructor: Ping Li
Xi ∼ P oisson(λ), i.i.d. i = 1 to n.
= λ, the moment estimator would be
n
1X
λ̂ =
Xi = X̄
n i=1
—————
Properties of λ̂
n
1X
E(λ̂) =
E(Xi ) = λ
n i=1
V ar(λ̂) =
1
λ
V ar(Xi ) =
n
n
141
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Xi ∼ P oisson(λ), i.i.d. i = 1 to n.
Because V ar(Xi )
= λ, we can also estimate λ by
1
λ̂2 =
n
n
X
i=1
Xi2
−
1
n
n
X
Xi
i=1
!2
This estimator λ̂2 is no longer unbiased, because
λ
λ
2
2
E(λ̂2 ) = λ + λ −
+λ =λ−
n
n
Moment estimators are in general biased.
Q: How to modify λ̂2 to obtain an unbiased estimator?
142
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Xi ∼ N (µ, σ 2 ), i.i.d. i = 1 to n.
Example 8.4.B:
Solve for µ and σ 2 from the equations
1
µ=
n
n
X
1
σ =
n
2
Xi ,
i=1
n
X
Xi2
i=1
−
1
n
n
X
i=1
The moment estimators are
n
1X
2
σ̂ =
(Xi − X̄)2
n i=1
µ̂ = X̄,
We have known that µ̂ and σ̄ 2 are independent, and
µ̂ ∼ N
2
σ
µ,
n
,
nσ̂ 2
2
∼
χ
n−1
σ2
Xi
!2
143
Cornell University, BTRY 4090 / STSCI 4090
Example 8.4.C:
Spring 2010
Instructor: Ping Li
Xi ∼ Gamma(α, λ), i.i.d. i = 1 to n.
The first two moments are
α
µ1 = ,
λ
α(α + 1)
µ2 =
λ2
Equivalently
µ21
α=
,
µ2 − µ21
µ1
λ=
µ2 − µ21
The moment estimators are
µ̂21
X̄ 2
α̂ =
= 2,
2
µ̂2 − µ̂1
σ̂
λ̂ =
µ̂1
X̄
=
µ̂2 − µ̂21
σ̂ 2
144
Cornell University, BTRY 4090 / STSCI 4090
Example 8.4.D:
Spring 2010
Instructor: Ping Li
Assume that the random variable X has density
1 + αx
,
fX (x) =
2
|x| ≤ 1, |α| ≤ 1
Then α can be estimated from the first moment
1
1 + αx
α
µ1 =
x
dx = .
2
3
−1
Z
Therefore, the moment estimator would be
α̂ = 3X̄.
145
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Consistency of Moment Estimators
Definition:
Let θ̂n be an estimator of a parameter θ based on a sample of
size n. Then θ̂n is consistent in probability, if for any ǫ
P |θ̂n − θ| ≥ ǫ → 0,
as
> 0,
n→∞
Moment estimators are consistent if the conditions for Weak Law of Large
Numbers are satisfied.
146
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
A Simulation Study for Estimating Gamma Parameters
Consider a gamma distribution Gamma(α, λ) with α
Generate n, for n
= 4 and λ = 0.5.
= 5 to n = 105 , samples from Gamma(α = 4, λ = 0.5).
Estimate α and λ by moment estimators for every n.
Repeat the experiment 4 times.
147
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Gamma: Moment estimate of α = 4
6
5.5
5
4.5
4
3.5
3
2.5
2
1.5
1
0
10
1
10
2
10
3
10
4
10
5
10
148
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Gamma: Moment estimate of λ = 0.5
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
10
1
10
2
10
3
10
4
10
5
10
149
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Matlab Code
function est_gamma
n = 10ˆ5; al =4; lam = 0.5; c = [’b’,’k’,’r’,’g’];
for t = 1:4;
X = gamrnd(al,1/lam,n,1);
mu1 = cumsum(X)./(1:n)’;
mu2 = cumsum(X.ˆ2)./(1:n)’;
est_al = mu1.ˆ2./(mu2-mu1.ˆ2);
est_lam = mu1./(mu2-mu1.ˆ2);
st =5;
figure(1);
semilogx((st:n)’,est_al(st:end),c(t), ’linewidth’,2); hold on; grid on;
title([’Gamma: Moment estimate of \alpha = ’ num2str(al)]);
figure(2);
semilogx((st:n)’,est_lam(st:end),c(t),’linewidth’,2); hold on; grid on;
title([’Gamma: Moment estimate of \lambda = ’ num2str(lam)]);
end;
150
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
The Method of Maximum Likelihood
Suppose that random variables X1 , X2 , ..., Xn have a joint density
f (x1 , x2 , ..., xn |θ). Given observed values Xi = xi , where i = 1 to n, the
likelihood of θ as a function of (x1 , x2 , .., xn ) is defined as
lik(θ) = f (x1 , x2 , ..., xn |θ)
The method of maximum likelihood seeks the θ that maximizes lik(θ).
151
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
The Log Likelihood in the I.I.D. Case
If Xi ’s are i.i.d., then
lik(θ) =
n
Y
i=1
f (Xi |θ)
It is often more convenient to work with its logarithm, called the Log Likelihood
l(θ) =
n
X
i=1
log f (Xi |θ)
152
Cornell University, BTRY 4090 / STSCI 4090
Example 8.5.A:
Spring 2010
Instructor: Ping Li
Suppose X1 , X2 , ..., Xn , are i.i.d. samples of
P oisson(λ). Then the likelihood of λ is
lik(λ) =
n
Y
λXi e−λ
i=1
Xi !
The log likelihood is
l(λ) =
n
X
i=1
[Xi log λ − λ − log Xi !]
= log λ
n
X
i=1
Xi − nλ +
The part in [...] is useless for finding the MLE.
"
−
n
X
i=1
log Xi !
#
153
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
The log likelihood is
l(λ) = log λ
n
X
i=1
The MLE is the solution to l′ (λ)
Xi − nλ −
n
X
log Xi !
i=1
= 0, where
n
1X
′
l (λ) =
Xi − n
λ i=1
Therefore, the MLE is
λ̂ = X̄ , same as the moment estimator.
′′
− λ12
Pn
Xi ≤ 0, meaning that l(λ) is a
concave function and the solution to l′ (λ) = 0 is indeed the maximum.
For verification, check l
(λ) =
i=1
154
Cornell University, BTRY 4090 / STSCI 4090
Example 8.5.B:
Spring 2010
Given n i.i.d. samples, Xi
log likelihood is
l µ, σ
2
=
n
X
Instructor: Ping Li
∼ N (µ, σ 2 ), i = 1 to n. The
log fX (Xi ; µ, σ 2 )
i=1
n
1 X
1
2
=− 2
(Xi − µ) − n log(2πσ 2 )
2σ i=1
2
n
n
∂l
1 X
1X
=
2
(Xi − µ) = 0 =⇒ µ̂ =
Xi
∂µ
2σ 2 i=1
n i=1
n
n
X
∂l
1 X
n
1
2
2
2
=
(X
−
µ)
−
=
0
=⇒
σ̂
=
(X
−
µ̂)
.
i
i
2
4
2
∂σ
2σ i=1
2σ
n i=1
155
Cornell University, BTRY 4090 / STSCI 4090
Example 8.5.C:
Spring 2010
Instructor: Ping Li
Xi ∼ Gamma(α, λ), i.i.d. i = 1 to n.
The likelihood function is
lik(α, λ) =
n
Y
1 α α−1 −λXi
λ Xi e
Γ(α)
i=1
The log likelihood function is
l(α, λ) =
n
X
i=1
− log Γ(α) + α log λ + (α − 1) log Xi − λXi
Taking derivatives
n
X
∂l(α, λ)
Γ′ (α)
= −n
+ n log λ +
log Xi
∂α
Γ(α)
i=1
n
α X
∂l(α, λ)
=n −
Xi
∂λ
λ i=1
156
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
The MLE solutions are
α̂
λ̂ =
X̄
n
Γ′ (α̂)
1X
+ log α̂ − log X̄ +
log Xi = 0
−
Γ(α̂)
n i=1
Need an iterative scheme to solve for α̂ and λ̂. This is actually a difficult
numerical problems because naive method will not converge, or possibly because
Γ′ (α̂)
the Matlab implementation of the “psi” function Γ(α̂) is not that accurate.
As the last resort, one can always do exhaustive search or binary search.
Our simulations can show MLE is indeed better than moment estimator.
157
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Gamma: Moment estimate of α = 4
8
Moment
MLE
7
6
5
4
3
2
10
20
30
40
50
60
70
80
90
100
158
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Gamma: Moment estimate of λ = 0.5
1
Moment
MLE
0.9
0.8
0.7
0.6
0.5
0.4
10
20
30
40
50
60
70
80
90
100
159
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Matlab Code
function est_gamma_mle
close all; clear all;
n = 10ˆ2; al =4; lam = 0.5; c = [’b’,’k’,’r’,’g’];
for t = 1:3;
X = gamrnd(al,1/lam,n,1);
% Find the moment estimators as starting points.
mu1 = cumsum(X)./(1:n)’;
mu2 = cumsum(X.ˆ2)./(1:n)’;
est_al = mu1.ˆ2./(mu2-mu1.ˆ2);
est_lam = mu1./(mu2-mu1.ˆ2);
% Exhaustive search in the neighbor of the moment estimator.
mu_log = cumsum(log(X))./(1:n)’;
m = 400;
for i = 1:m;
al_m(:,i) = est_al-2+0.01*(i-1);
ind_neg = find(al_m(:,i)<0);
al_m(ind_neg,i) = eps;
lam_m(:,i)= al_m(:,i)./mu1;
end;
L = log(lam_m).*al_m + (al_m-1).*(mu_log*ones(1,m)) - lam_m.*(mu1*ones(1,m)) - log(gamma(al_m));
[dummy, ind] = max(L,[],2);
for i = 1:n
est_al_mle(i) = al_m(i,ind(i));
est_lam_mle(i) = lam_m(i,ind(i));
end;
st =10;
figure(1);
plot((st:n)’,est_al(st:end),[c(t) ’--’], ’linewidth’,2); hold on; grid on;
160
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
plot((st:n)’,est_al_mle(st:end),c(t), ’linewidth’,2); hold on; grid on;
title([’Gamma: Moment estimate of \alpha = ’ num2str(al)]);
legend(’Moment’,’MLE’);
figure(2);
plot((st:n)’,est_lam(st:end),[c(t) ’--’],’linewidth’,2); hold on; grid on;
plot((st:n)’,est_lam_mle(st:end),c(t),’linewidth’,2); hold on; grid on;
title([’Gamma: Moment estimate of \lambda = ’ num2str(lam)]);
legend(’Moment’,’MLE’);
end;
Instructor: Ping Li
161
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Newton’s Method
To find the maximum or minimum of function f (x) is equivalent to find the x∗
such that f ′ (x∗ )
= 0.
Suppose x is close to x∗ . By Taylor expansions
f ′ (x∗ ) = f ′ (x) + (x∗ − x)f ′′ (x) + ... = 0
we obtain
′
f
(x)
∗
x ≈ x − ′′
f (x)
This gives an iterative formula.
In multi-dimensions, need invert a Hessian matrix (not just a reciprocal of f ′′ (x)).
162
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
MLE Using Newtons’ Method for Estimating Gamma Parameters
Xi ∼ Gamma(α, λ), i.i.d. i = 1 to n.
The log likelihood function
l(α, λ) =
n
X
i=1
− log Γ(α) + α log λ + (α − 1) log Xi − λXi
First derivatives
n
X
Γ′ (α)
∂l(α, λ)
= −n
+ n log λ +
log Xi
∂α
Γ(α)
i=1
n
∂l(α, λ)
α X
=n −
Xi
∂λ
λ i=1
163
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Second derivatives
∂ 2 l(α, λ)
′
=
−nψ
(α),
∂α2
Γ′ (α)
ψ(α) =
Γ(α)
∂ 2 l(α, λ)
α
= −n 2
∂λ2
λ
∂ 2 l(α, λ)
1
=n
∂λα
λ
We can use Newton’s method (two dimensions), starting with moment estimators.
The problem is actually more complicated because we have a constrained
α ≥ 0 and λ ≥ 0 may not be satisfied
during iterations, especially when sample size n is not large.
optimization problem. The constraints:
One the other hand, One-Step Newton’s method usually works well, starting with
an (already pretty good) estimator. Often more iterations do not help much.
164
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Gamma: One−step MLE of α = 4
4
Moment
One−step MLE
3.5
3
MSE
2.5
2
1.5
1
0.5
0
20
40
60
80
100
120
Sample size
140
160
180
200
165
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Gamma: One−step MLE of λ = 0.5
0.07
Moment
One−step MLE
0.06
0.05
MSE
0.04
0.03
0.02
0.01
0
20
40
60
80
100
120
Sample size
140
160
180
200
166
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Matlab Code for MLE Using One-Step Newton Updates
function est_gamma_mle_onestep
al =4; lam = 0.5;
N=[20:10:50, 80, 100, 150 200]; T = 10ˆ4;
X = gamrnd(al,1/lam,T,max(N));
for i = 1:length(N)
n = N(i);
mu1 = sum(X(:,1:n),2)./n;
mu2 = sum(X(:,1:n).ˆ2,2)./n;
est_al0 = mu1.ˆ2./(mu2-mu1.ˆ2);
est_lam0 = mu1./(mu2-mu1.ˆ2);
est_al0_mu(i) = mean(est_al0);
est_al0_var(i) = var(est_al0);
est_lam0_mu(i) = mean(est_lam0);
est_lam0_var(i) = var(est_lam0);
est_al_mle_s1 = est_al0;
est_lam_mle_s1= est_lam0;
d1_al = log(est_lam_mle_s1)+mean(log(X(:,1:n)),2) - psi(est_al_mle_s1);
d1_lam =est_al_mle_s1./est_lam_mle_s1 - mean(X(:,1:n),2);
d2_al = - psi(1,est_al_mle_s1);
d12 = 1./est_lam_mle_s1;
d2_lam = -est_al_mle_s1./est_lam_mle_s1.ˆ2;
for j = 1:T;
update(j,:) = (inv([d2_al(j) d12(j); d12(j) d2_lam(j)])*[d1_al(j);d1_lam(j)])’;
end;
est_al_mle_s1 = est_al_mle_s1 - update(:,1);
est_lam_mle_s1 = est_lam_mle_s1 - update(:,2);
est_lam_mle_s1 = est_al_mle_s1./mean(X(:,1:n),2);
167
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
est_al_mle_s1_mu(i) = mean(est_al_mle_s1);
est_al_mle_s1_var(i) = var(est_al_mle_s1);
est_lam_mle_s1_mu(i) = mean(est_lam_mle_s1);
est_lam_mle_s1_var(i) = var(est_lam_mle_s1);
end;
figure;
plot(N, (est_al0_mu-al).ˆ2+est_al0_var,’k--’,’linewidth’,2); hold on; grid on;
plot(N, (est_al_mle_s1_mu-al).ˆ2+est_al_mle_s1_var,’r-’,’linewidth’,2);
xlabel(’Sample size’);ylabel(’MSE’);
title([’Gamma: One-step MLE of \alpha = ’ num2str(al)]);
legend(’Moment’,’One-step MLE’);
figure;
plot(N, (est_lam0_mu-lam).ˆ2+est_lam0_var,’k--’,’linewidth’,2); hold on; grid on;
plot(N, (est_lam_mle_s1_mu-lam).ˆ2+est_lam_mle_s1_var,’r-’,’linewidth’,2);
title([’Gamma: One-step MLE of \lambda = ’ num2str(lam)]);
xlabel(’Sample size’);ylabel(’MSE’);
legend(’Moment’,’One-step MLE’);
Instructor: Ping Li
168
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
MLE of Multinomial Probabilities
Suppose X1 , X2 , ..., Xm , which are the counts of cells 1, 2, ..., m, follow a
multinomial distribution with total count of n and cell probabilities p1 , p2 , ..., pm .
To estimate p1 , p2 , ..., pm from the observations X1
= x1 , X2 = x2 , ...,
Xm = xm , write down the joint likelihood
f (x1 , x2 , ..., xm | p1 , p2 , ..., pm ) ∝
m
Y
pxi i
i=1
and the log likelihood
L(p1 , p2 , ..., pm ) =
m
X
i=1
A constrained optimization problem.
xi log pi ,
m
X
i=1
pi = 1
169
Cornell University, BTRY 4090 / STSCI 4090
Solution 1:
Spring 2010
Reduce to m − 1 variables.
L(p2 , ..., pm ) = x1 log(1 − p2 − p3 − .... − pm ) +
where
Instructor: Ping Li
m
X
i=2
m
X
xi log pi ,
i=2
pi ≤ 1, pi ≥ 0, pi ≤ 1
We do not have to worry about the inequality constraints unless they are violated.
170
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
∂L
−x1
xi
=
+
= 0,
∂pi
1 − p2 − p3 − ... − pm
pi
i = 2, 3, ..., m
x1
xi
=
p1
pi
x1
x2
x3
xm
=⇒
=
=
= ... =
=λ
p1
p2
p3
pm
=⇒
Therefore,
p1 =
x1
,
λ
=⇒ 1 =
p2 =
m
X
pi =
x2
,
λ
Pm
i=1
..., pm =
xi
=
n
λ
xm
,
λ
λ
xi
=⇒ λ = n =⇒ pi = , i = 1, 2, ..., m
n
i=1
171
Cornell University, BTRY 4090 / STSCI 4090
Solution 2:
Spring 2010
Instructor: Ping Li
Lagrange multiplier (essentially the same as solution 1)
Convert the original problem into an “unconstrained” problem
L(p1 , p2 , ..., pm ) =
m
X
i=1
xi log pi − λ
m
X
i=1
!
pi − 1
172
Cornell University, BTRY 4090 / STSCI 4090
Example A:
Spring 2010
Instructor: Ping Li
Hardy-Weinberg Equilibrium
If gene frequencies are in equilibrium, the genotypes AA, Aa, and aa occur in a
population with frequencies
(1 − θ)2 ,
2θ(1 − θ),
θ2 ,
respectively. Suppose we observe sample counts x1 , x2 , and x3 , with total =
Q: Estimate θ using MLE.
n.
173
Cornell University, BTRY 4090 / STSCI 4090
Solution:
Spring 2010
Instructor: Ping Li
The log likelihood can be written as
l(θ) =
3
X
xi log pi
i=1
=x1 log(1 − θ)2 + x2 log 2θ(1 − θ) + x3 log θ 2
∝2x1 log(1 − θ) + x2 log θ + x2 log(1 − θ) + 2x3 log θ
=(2x1 + x2 ) log(1 − θ) + (x2 + 2x3 ) log θ
Taking the first derivative
∂l(θ)
2x1 + x2
x2 + 2x3
=−
+
=0
∂θ
1−θ
θ
=⇒ θ̂ =
What is V ar(θ̂)?
x2 + 2x3
2n
174
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
1
(V ar(x2 ) + 4V ar(x3 ) + 4Cov(x2 , x3 ))
2
4n
1
= 2 (np2 (1 − p2 ) + 4np3 (1 − p3 )−4np2 p3 )
4n
1
2
p2 + 4p3 − (p2 + 2p3 )
=
4n
θ(1 − θ)
=
2n
V ar(θ̂) =
1
We will soon show the variance of MLE is asymptotically I(θ)
, where
I(θ) = −E
is the Fisher Information.
2
∂ l(θ)
∂θ 2
175
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
∂ 2 l(θ)
2x1 + x2
x2 + 2x3
=−
−
∂θ 2
(1 − θ)2
θ2
I(θ) = − E
2
∂ l(θ)
∂θ 2
2(1 − θ)2 + 2θ(1 − θ)
2θ(1 − θ) + 2θ 2
=n
+n
2
(1 − θ)
θ2
2n
2n
2n
=
+
=
1−θ
θ
θ(1 − θ)
Therefore, the “asymptotic variance” is V ar(θ̂)
which in this case is the exact variance.
=
θ(1−θ)
2n ,
176
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Review Properties of Multinomial Distribution
Suppose X1 , X2 , ..., Xm , which are the counts of cells 1, 2, ..., m, follow a
multinomial distribution with total count of n and cell probabilities p1 , p2 , ..., pm .
Marginal and conditional distributions
Xj ∼ Binomial(n, pj )
Xj |Xi ∼ Binomial n − Xi ,
pj
1 − pi
,
i 6= j
177
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Moments
E(Xj ) = npj ,
V ar(Xj ) = npj (1 − pj )
pj
E(Xj |Xi ) = (n − Xi )
1 − pi
2
E(Xi Xj ) = E(Xi E(Xj |Xi )) = E nXi − Xi
pj
2
2 2
n pi − npi (1 − pi ) − n pi
=
1 − pi
= npi pj (n − 1)
Cov(Xi , Xj ) = E(Xi Xj ) − E(Xi )E(Xj )
= npi pj (n − 1) − n2 pi pj = −npi pj
pj
1 − pi
178
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Large Sample Theory for MLE
Assume i.i.d. samples of size n, Xi , i
= 1 to n, with density f (x|θ).
The MLE of θ , denoted by θ̂ is given by
θ̂ = argmax
θ
Large sample theory says, as n
θ̂ ∼ N θ,
n
X
i=1
log f (xi |θ)
→ ∞, θ̂ is asymptotically unbiased and normal.
1
nI(θ)
,
approximately
I(θ) is the Fisher Information of θ : I(θ) = −E
h
∂2
∂θ 2
i
log f (X|θ) .
179
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Fisher Information
∂
I(θ) =E
log f (X|θ)
∂θ
2
2
∂
=−E
log f (X|θ)
∂θ 2
How to prove the equivalence of two definitions?
180
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Proof:
∂
E
log f (X|θ)
∂θ
2
=
Z ∂f
∂θ
∂2
log f (X|θ) = −
−E
∂θ 2
=−
2
Z f
Z
1
f dx =
f2
∂2f
∂θ 2
−
h
f2
2
∂ f
dx +
∂θ 2
Z ∂f
∂θ
i2
Z Therefore, it suffices to show (in fact assume):
Z
2
2
∂ f
∂
dx = 2
∂θ 2
∂θ
Z
f dx = 0
∂f
∂θ
2
1
dx
f
f dx
∂f
∂θ
2
1
dx
f
181
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Example: Normal Distribution
Given n i.i.d. samples, xi
∼ N (µ, σ 2 ), i = 1 to n.
log fX (x; µ, σ 2 ) = −
1
1
2
2
(x
−
µ)
−
log(2πσ
)
2
2σ
2
∂ 2 log fX (x; µ, σ 2 )
1
1
= − 2 =⇒ I(µ) = 2
∂µ2
σ
σ
∂ 2 log fX (x; µ, σ 2 )
(x − µ)2
1
=
−
+
∂(σ 2 )2
σ6
2σ 4
σ2
1
1
=⇒ I(σ ) = 6 − 4 = 4
σ
2σ
2σ
2
“Asymptotic” variances of MLE are in fact exact in this case.
182
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Example: Binomial Distribution
x ∼ Binomial(p, n): Pr (x = k) =
Log likelihood and Fisher Information:
n
k
pk (1 − p)n−k
l(p) = k log p + (n − k) log(1 − p)
k n−k
l (p) = −
=⇒ MLE p̂ =??
p
1−p
k
n−k
′′
l (p) = − 2 −
p
(1 − p)2
np
n − np
n
′′
=
I(p) = −E (l (p)) = 2 +
p
(1 − p)2
p(1 − p)
′
“Asymptotic” variance of MLE is also exact in this case.
183
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Intuition About the Asymptotic Distributions & Variances of MLE
The MLE θ̂ is the solution to the MLE equation l′ (θ̂)
= 0.
The Taylor expansion around the true θ
l′ (θ̂) ≈ l′ (θ) + (θ̂ − θ)l′′ (θ)
Let l′ (θ̂)
= 0 (because θ̂ is the MLE solution)
l′ (θ)
(θ̂ − θ) ≈ − ′′
l (θ)
What is the mean of l′ (θ)? What is the mean of l′′ (θ)?
184
Cornell University, BTRY 4090 / STSCI 4090
l′ (θ) =
Spring 2010
n
X
∂ log f (xi )
i=1
′
E (l (θ)) =
n
X
Instructor: Ping Li
E
i=1
∂θ
∂ log f (xi )
∂θ
=
n
X
i=1
∂f (xi )
∂θ
f (xi )
= nE
∂f (x)
∂θ
f (x)
!
=0
because
E
∂f (x)
∂θ
f (x)
!
=
Z
∂f (x)
∂θ
f (x)
f (x)dx =
Z
∂f (x)
∂
dx =
∂θ
∂θ
Z
f (x)dx = 0
185
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
E(l′ (θ)) = 0, and we know −E(l′′ (θ)) = nI(θ), the Fisher Information. Thus
l′ (θ)
l′ (θ)
≈
(θ̂ − θ) ≈ − ′′
l (θ)
nI(θ)
and
E(l′ (θ))
E(θ̂ − θ) ≈
=0
nI(θ)
Then, the variance
E(l′ (θ))2
nI(θ)
1
V ar(θ̂) ≈ 2 2
= 2
=
n I (θ)
n I(θ)
nI(θ)
186
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Sec. 8.7: Efficiency and Cramér-Rao Lower Bound
Definition:
Given two unbiased estimates, θ̂1 and θ̂2 , the efficiency of θ̂1
relative to θ̂2 is
ef f θ̂1 , θ̂2 =
V ar(θ̂2 )
V ar(θ̂1 )
For example, if the variance of θ̂2 is 0.8 times the variance of θ̂1 . Then θ̂1 is 80%
efficient relative to θ̂2 .
Asymptotic relative efficiency
Given two asymptotically unbiased
estimates, θ̂1 and θ̂2 , the asymptotic relative efficiency of θ̂1 relative to θ̂2 is
computed using their asymptotic variances (as sample size goes to infinity).
187
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Example 8.7.A:
Assume that the random variable X has density
1 + αx
,
fX (x) =
2
Method of Moments:
|x| ≤ 1, |α| ≤ 1
α can be estimated from the first moment
Z 1
1 + αx
α
µ1 =
x
dx = .
2
3
−1
Therefore, the moment estimator would be
α̂m = 3X̄.
whose variance
3 − α2
9
9
2
2
V ar(α̂m ) = V ar(X) =
E(X ) − E (X) =
n
n
n
188
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Maximum Likelihood Estimate:
Instructor: Ping Li
The first two derivatives
∂
x
log fX (x; α) =
∂α
1 + αx
∂2
−x2
log fX (x; α) =
2
∂α
(1 + αx)2
Therefore, the MLE is the solution to
n
X
i=1
Xi
= 0.
1 + α̂mle Xi
Can not compute the exact variance. We resort to approximate (asymptotic)
variance
V ar (α̂mle ) ≈
1
nI(α)
189
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Use the second derivatives to compute I(α)
I(α) = − E
=
=
=
When α
= 0, I(α) =
R1
Z
1
−1
Z 1
∂
log fX (x|α)
2
∂α
x2
1 + αx
dx
2
(1 + αx)
2
x2
dx
2(1 + αx)
−1
1+α
log 1−α
x2
dx
−1 2
2
− 2α
2α3
,
= 13 ,
which can also be obtained by taking limit of I(α).
α 6= 0
190
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
The asymptotic relative efficiency of α̂m to α̂mle is
V ar(α̂mle )
=
V ar(α̂m )
log
2α3
3−α2
1+α
1−α −
2α
1
0.9
0.8
Efficiency
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
−1
−0.5
Why the efficiency is no larger than 1?
0
α
0.5
1
191
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Cramér-Rao Inequality
Theorem 8.7.A:
Let X1 , X2 , ..., Xn be i.i.d. with density function f (x; θ).
Let T be an unbiased estimate of θ . Then under smoothness assumption on
f (x; θ),
1
V ar(T ) ≥
nI(θ)
Thus, under reasonable assumptions, MLE is optimal or (asymptotically) optimal.
192
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Sec. 8.8: Sufficiency
Definition:
Let X1 , X2 , ..., Xn be i.i.d. samples with density f (x; θ). A
statistic T
= T (X1 , X2 , ..., Xn ) is said to be sufficient for θ if the conditional
distribution of X1 , X2 , ..., Xn , given T = t, does not depend on θ for any t.
In other words, given T , we can gain no more knowledge about θ .
193
Cornell University, BTRY 4090 / STSCI 4090
Example 8.8.A:
Spring 2010
Instructor: Ping Li
Let X1 , X2 , ..., Xn be a sequence of independent
Bernoulli random variables with P (Xi
= 1) = θ . Let T =
P (X1 = x1 , ..., Xn = xn |T = t) =
Pn
i=1
Xi .
P (X1 = x1 , ..., Xn = xn , T = t)
P (T = t)
θ t (1 − θ)n−t
= n t
n−t
t θ (1 − θ)
1
= n ,
t
which is independent of θ . Therefore, T
=
Pn
i=1
Xi is a sufficient statistic.
194
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Theorem 8.8.1.A: Factorization Theorem
A necessary and sufficient condition for T (X1 , ..., Xn ) to be sufficient for a
parameter θ is that the joint probability density (mass) function factors
f (x1 , x2 , .., xn ; θ) = g [T (x1 , x2 , ..., xn ), θ] h(x1 , x2 , ..., xn )
195
Cornell University, BTRY 4090 / STSCI 4090
Example 8.8.1.A:
Spring 2010
Instructor: Ping Li
X1 , X2 , ..., Xn are i.i.d. Bernoulli random variables with
success probability θ .
f (x1 , x2 , ..., xn ; θ) =
θ
Y
θ xi (1 − θ)1−xi
i=1
Pn
i=1 xi
Pn
i=1 xi
=θ
(1 − θ)
P ni=1 xi
θ
=
(1 − θ)n
1−θ
=g (T, θ) × h
n−
h(x1 , x2 , ..., xn ) = 1.
Pn
T (x1 , x2 , ..., xn ) = i=1 xi is the sufficient statistic.
T
θ
g(T, θ) = 1−θ
(1 − θ)n
196
Cornell University, BTRY 4090 / STSCI 4090
Example 8.8.1.B:
Spring 2010
Instructor: Ping Li
X1 , X2 , ..., Xn are i.i.d. normal N (µ, σ 2 ), both µ and
σ 2 are unknown.
n
Y
(x −µ)2
1
− i2σ2
√
f (x1 , x2 , ..., xn ; µ, σ ) =
e
2πσ
i=1
2
=
1
−
(xi −µ)2
i=1
2σ 2
Pn
e
(2π)n/2 σ n
Pn
−1 P n
2
1
x
−2µ
xi +nµ2 ]
[
i
2
i=1
i=1
2σ
e
=
(2π)n/2 σ n
Therefore,
Pn
2
i=1 xi and
Equivalently, we say T
Pn
i=1
xi are sufficient statistics.
= (X̄, S 2 ) is the sufficient statistic for normal with
unknown mean and variance.
197
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Proof of the Factorization Theorem (Discrete Case)
Theorem:
A necessary and sufficient condition for T (X1 , ..., Xn ) to be
sufficient for a parameter θ is that the joint probability mass function factors
P (X1 = x1 , .., Xn = xn ; θ) = g [T (x1 , ..., xn ), θ] h(x1 , ..., xn )
Proof of sufficient condition:
Assume
P (X1 = x1 , .., Xn = xn ; θ) = g [T (x1 , ..., xn ), θ] h(x1 , ..., xn )
Then the conditional distribution
P (X1 = x1 , ..., Xn = xn |T = t) =
P (X1 = x1 , ..., Xn = xn , T = t)
P (T = t)
198
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
But we assume P (X1 , .., Xn ) factors, i.e.,
P (T = t) =
X
P (X1 = x1 , ..., Xn = xn )
T (x1 ,...,xn )=t
=g(t, θ)
X
h(x1 , ..., xn )
T (x1 ,...,xn )=t
Note that t is constant. Thus, the conditional distribution
P (X1 = x1 , ..., Xn = xn , T = t)
P (X1 = x1 , ..., Xn = xn |T = t) =
P (T = t)
g(t, θ)h(x1 , ..., xn )
=P
T (x1 ,...,xn )=t g(t, θ)h(x1 , ..., xn )
h(x1 , ..., xn )
,
T (x1 ,...,xn )=t h(x1 , ..., xn )
which does not depend on θ .
=P
Therefore, T (X1 , ..., Xn ) is a sufficient statistic.
199
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Proof of necessary condition:
Instructor: Ping Li
Assume T (X1 , ..., Xn ) is sufficient. That
is, the conditional distribution (X1 , ..., Xn )|T does not depend on θ . Then
P (X1 = x1 , ..., Xn = xn ) =P (X1 = x1 , ..., Xn = xn |T = t)P (T = t)
=P (T = t)P (X1 = x1 , ..., Xn = xn |T = t)
=g(t, θ)h(x1 , ..., xn ),
where
h(x1 , ..., xn ) = P (X1 = x1 , ..., Xn = xn |T = t)
g(t, θ) = P (T = t)
Therefore, the probability mass function factors.
200
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Exponential Family
Definition:
Members of one parameter (θ ) exponential family have density
function (or frequency functions) of the form
f (x; θ) =



 exp [c(θ)T (x) + d(θ) + S(x)] if x ∈ A



0
otherwise
Where the set A does not depend on θ .
Many common distributions: normal, binomial, Poisson, gamma, are members of
this family.
201
Cornell University, BTRY 4090 / STSCI 4090
Example 8.8.C:
Spring 2010
Instructor: Ping Li
The frequency function of the Bernoulli distribution is
P (X = x) =θ x (1 − θ)1−x ,
x ∈ {0, 1}
θ
= exp x log
+ log(1 − θ)
1−θ
Therefore, this is a member of the exponential family, with
θ
c(θ) = log
1−θ
T (x) = x
d(θ) = log 1 − θ
S(x) = 0.
—————-
f (x; θ) = exp [c(θ)T (x) + d(θ) + S(x)].
202
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Sufficient statistics of exponential family
Suppose that X1 , X2 , ..., Xn is an i.i.d. sample from a member of the
exponential family. Then the joint probability is
n
Y
i=1
f (xi |θ) =
n
Y
exp [c(θ)T (xi ) + d(θ) + S(xi )]
i=1
"
= exp c(θ)
n
X
T (xi ) + nd(θ) exp
i=1
By the factorization theorem, we know
In the Bernoulli example,
#
Pn
i=1
Pn
i=1
T (xi ) =
"
n
X
i=1
S(xi )
#
T (xi ) is a sufficient statistic.
Pn
i=1
xi .
203
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
The MLE of exponential family
If T (x) is a sufficient statistic for θ , then the MLE is a function of T .
Recall: if X
∼ N (µ, σ 2 ), then the MLE
n
1X
µ̂ =
xi
n i=1
n
1X
2
σ̂ =
(xi − µ̂)2
n i=1
We know that (
Pn
i=1
xi ,
Pn
i=1
x2i ) is sufficient statistic.
Note that normal is a member of the two-parameter exponential family.
204
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
k -parameter Exponential Family
Definition:
Members of k -parameter (θ ) exponential family have density
function (or frequency functions) of the form

f (x; θ) = exp 
k
X
j=1

cj (θ)Tj (x) + d(θ) + S(x)
205
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Normal Distribution and Exponential Family
Suppose X
∼ N (µ, σ 2 ). Then
2
1
1
1 2
µ
µ
2
2
√
f (x; µ, σ ) =
exp − log σ − 2 x + 2 x − 2
2
2σ
σ
2σ
2π
Does it really belong to a (2-dim) exponential family?
Well, suppose σ 2 is known, then it is clear that it does belong to a one-dim
exponential family.
f (x; θ) = exp [c(θ)T (x) + d(θ) + S(x)]
θ = µ,
T (x) = x,
µ2
d(θ) = − 2 ,
2σ
c(θ) =
µ
σ2
x2
1
1
2
S(x) = − 2 − log σ − log 2π
2σ
2
2
206
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
When σ 2 is unknown, then we need to re-parameterize the distribution by letting
µ
2
,
σ
= (θ1 , θ2 )
θ=
σ2
Then it belongs to a 2-dim exponential family
f (x; θ) = exp [c1 (θ)T1 (x) + c2 (θ)T2 (x) + d(θ) + S(x)]
µ
T1 (x) = x
c1 (θ) = 2 = θ1 ,
σ
1
1
c2 (θ) = − 2 = −
,
T2 (x) = x2
2σ
2θ2
1
µ2
1
θ12
2
d(θ) = − log σ − 2 = − log θ2 − θ2
2
2σ
2
2
1
S(x) = − log 2π
2
207
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Another Nice Property of Exponential Family
Suppose

f (x; θ) = exp 
k
X
j=1

cj (θ)Tj (x) + d(θ) + S(x)
Then
∂d(θ)
E (Ti (X)) = −
∂ci (θ)
Exercise: What about variances and covariances?
208
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Proof: Take derivatives on both sides of
R
Instructor: Ping Li
R
f dx =
R
∂ f dx
1, i.e., ∂ci (θ)
= 0.
∂ f dx
∂f
=
dx
∂ci (θ)
∂ci (θ)


Z
k
X
∂
exp 
cj (θ)Tj (x) + d(θ) + S(x) dx
=
∂ci (θ)
j=1


Z
k
X
∂d(θ)


dx
= exp
cj (θ)Tj (x) + d(θ) + S(x) Ti (x) +
∂c
(θ)
i
j=1
Z ∂d(θ)
= f Ti (x) +
dx
∂ci (θ)
Z
Therefore
E (Ti (X)) =
Z
f Ti (x)dx = −
Z
f
∂d(θ)
∂d(θ)
dx = −
∂ci (θ)
∂ci (θ)
209
Cornell University, BTRY 4090 / STSCI 4090
For example, X
Spring 2010
Instructor: Ping Li
∼ N (µ, σ 2 ) belongs to a 2-dim exponential family
µ
2
,
σ
θ = (θ1 , θ2 ) =
σ2
T1 (x) = x, T2 (x) = x2
Apply the previous result,
µ 2
∂d(θ)
E(T1 (x)) = E(x) = −
= − (−θ1 θ2 ) = 2 σ = µ
c1 (θ)
σ
as expected.
210
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Sec. 8.6: The Bayesian Approach to Parameter Estimation
The θ is the parameter to be estimated
The prior distribution
fΘ (θ).
The joint distribution
fX,Θ (x, θ) = fX|Θ (x|θ)fΘ (θ)
The marginal distribution
fX (x) =
Z
fX,Θ (x, θ)dθ =
Z
fX|Θ (x|θ)fΘ (θ)dθ
The posterior distribution
fX|Θ (x|θ)fΘ (θ)
fX,Θ (x, θ)
R
fΘ|X (θ|x) =
=
fX (x)
fX|Θ (x|u)fΘ (u)du
211
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Three main issues in Bayesian estimation
• Specify a prior (without looking at the data first).
• Calculate the posterior distribution, maybe computationally intensive.
• Choose appropriate estimators from the posterior distribution:
mean, median, mode, ...
212
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
The Add-one Smoothing
Consider n + m trials having a common probability of success. Suppose,
however, that this success probability is not fixed in advance but is chosen from
U (0, 1).
Q: What is the conditional distribution of this success probability given that the
n + m trials result in n successes?
Solution:
Let X = trial success probability.
X ∼ U (0, 1).
Let N = total number of successes.
N |X = x ∼ Binomial(n + m, x).
213
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Solution:
Let X = trial success probability.
X ∼ U (0, 1).
Let N = total number of successes.
N |X = x ∼ Binomial(n + m, x).
P {N = n|X = x}fX (x)
P {N = n}
n+m n
m
x
(1
−
x)
= n
P {N = n}
∝ xn (1 − x)m
fX|N (x|n) =
Therefore, X|N
Here X
∼ Beta(n + 1, m + 1).
∼ U (0, 1) is the prior distribution.
214
Cornell University, BTRY 4090 / STSCI 4090
If X|N
Spring 2010
Instructor: Ping Li
∼ Beta(n + 1, m + 1), then
n+1
E(X|N ) =
(n + 1) + (m + 1)
Suppose we do not have a prior knowledge of the success probability X .
We observe n successes out of n + m trials.
The most intuitive estimate (in fact MLE) of X should be
X̂ =
n
n+m
Assuming a uniform prior on X leads to the add-one smoothing.
215
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Posterior distribution, assuming p ~ U(0,1)
10
m = 8, n = 0
m = 8, n = 2
m = 80, n = 20
9
8
7
PMF
6
5
4
3
2
1
0
0
0.2
0.4
0.6
p
Posterior distribution X|N
Posterior mean:
E(X) =
∼ Beta(n + 1, m + 1).
n+1
m+1 ,
n
Posterior mode (peak of density): m
.
0.8
1
216
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Estimating Binomial Parameter Under Beta Prior
X ∼ Bin(n, p).
p ∼ Beta(a, b).
Joint probability
n x
Γ(a + b) a−1
n−x
fX,P (x, p) =
p (1 − p)
p
(1 − p)b−1
x
Γ(a)Γ(b)
Γ(a + b) n x+a−1
n−x+b−1
p
(1 − p)
=
Γ(a)Γ(b) x
which is also a beta distribution
Beta(x + a, n − x + b).
Marginal distribution
fX (x) =
Z
1
fX,P (x, p)dp = g(n, x),
0
(very nice, why?)
217
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Therefor, the posterior distribution is also Beta, with parameters
(x + a, n − x + b). This is extremely convenient.
Moment estimator (using posterior mean)
x+a
x+a
=
p̂ = E(p|x) =
(x + a) + (n − x + b)
n+a+b
x
n
a
a+b
=
+
na+b+n a+ba+b+n
x
n:
the usual estimate without considering priors.
a
a+b :
the estimate when there are no data.
The add-one smoothing is a special case with a
What about the bias-variance trade-off??
= b = 1.
218
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
The Bias-Variance Trade-off
Bayesian estimator (using posterior mean)
p̂ =
x+a
n+a+b
MLE
p̂M LE =
x
n
Assume p is fixed (conditional on p). Study the MSE ratio
MSE ratio
=
MSE(p̂)
MSE(p̂M LE )
We hope MSE ratio ≤ 1, especially when sample size n is reasonable.
219
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Asymptotic MSE ratio: when n is not too small
A
Asymptotic MSE ratio = 1 +
+O
n
We hope A
≤0
Exercise: Find A, which is a function of p, a, b.
1
n2
.
220
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
p = 0.5 , a = 1 , b = 1
1
0.9
0.8
MSE ratios
0.7
0.6
0.5
0.4
0.3
0.2
Exact MSE ratios
Asymptotic MSE ratios
0.1
0
0
20
40
60
n
80
100
221
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
p = 0.9 , a = 1 , b = 1
2
Exact MSE ratios
Asymptotic MSE ratios
MSE ratios
1.5
1
0.5
0
20
40
60
n
80
100
222
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Conjugate Priors
The prior distribution
fΘ (θ), belongs to family G.
The conditional distribution
fX|Θ (x|θ), belongs to family H .
The posterior distribution
fΘ|X (θ|x) =
fX|Θ (x|θ)fΘ (θ)
fX,Θ (x, θ)
= R
fX (x)
fX|Θ (x|u)fΘ (u)du
If the posterior distribution also belongs to G, then G is conjugate to H .
Conjugate priors were introduced mainly for the computational convenience.
223
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Examples of Conjugate priors:
Beta is conjugate to Binomial.
Gamma is conjugate to Poisson.
Dirichlet is conjugate to multinomial.
Gamma is conjugate to exponential.
Normal is conjugate to normal (with known variance).
Instructor: Ping Li
224
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Chapter 9: Testing Hypothesis
Suppose you have a coin which is possibly biased. You want to test whether the
coin is indeed biased (i.e., p
Suppose you observe k
6= 0.5), by tossing the coin n = 10 times.
= 8 heads (out of n = 10 tosses). It is reasonable to
guess that this coin is indeed biased. But how to make a precise statement?
Are n
= 10 tosses enough? How about n = 100? n = 1000? What is the
principled approach?
225
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Terminologies
Null hypothesis
H0 : p = 0.5
Alternative hypothesis
HA : p 6= 0.5
Type I error
Rejecting H0 when it is true
Significance level
P (Type I error) = P (Reject H0 |H0 ) = α
Type II error
Accepting H0 when it is false
P (Type II error) = P (Accept H0 |HA ) = β
Power
Goal:
1−β
Low α and high 1 − β .
226
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Let X1 , X2 , ..., Xn be an i.i.d. sample from a normal with
Example 9.2.A
known variance σ 2 and unknown mean µ. Consider two simple hypotheses:
H0 : µ = µ0
HA : µ = µ1
(µ1 > µ0 )
Under H0 , the null distribution likelihood is
f0 ∝
n
Y
i=1
exp −
"
1
1
2
(X
−
µ
)
=
exp
−
i
0
2σ 2
2σ 2
n
X
i=1
Under HA , the likelihood is
"
1
f1 ∝ exp − 2
2σ
Which hypothesis is more likely?
n
X
i=1
(Xi − µ1 )2
#
(Xi − µ0 )2
#
227
Cornell University, BTRY 4090 / STSCI 4090
Likelihood Ratio:
Because µ0
Spring 2010
Instructor: Ping Li
Small ratios =⇒ rejection. Sounds reasonable, but why?
1 Pn
2
f0 exp − 2σ2 i=1 (Xi − µ0 )
1 Pn
=
2
f1 exp − 2σ2 i=1 (Xi − µ1 )
"
#
n
1 X
2
2
= exp − 2
(Xi − µ0 ) − (Xi − µ1 )
2σ i=1
i
h n 2
2
2
X̄(µ
−
µ
)
+
µ
−
µ
= exp
0
1
1
0
2σ 2
− µ1 < 0 (by assumption), the likelihood is small if X̄ is large.
Suppose the significance level α
= 0.05. With how large X̄ can we reject H0 ?
Neyman-Pearson Lemma provides the answers.
228
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Neyman-Pearson Lemma
Suppose that H0 and HA are simple hypotheses and that the test that rejects
H0 whenever the likelihood ratio is less than c has significance level α. Then any
other test for which the significance level is ≤ α has power less than or equal to
that of the likelihood ratio test.
In other words, among all possible tests achieving significance level ≤
based on likelihood ratio maximizes the power.
α, the test
229
Cornell University, BTRY 4090 / STSCI 4090
Proof:
Let
Spring 2010
H0 : f (x) = f0 (x),
Instructor: Ping Li
HA : f (x) = fA (x)
Denote two tests

 0, if H is accepted
0
d(x) =
 1, if H0 is rejected

 0, if H is accepted
0
d∗ (x) =
 1, if H0 is rejected
The test d(x), based on the likelihood ratio, has a significance level α, i.e.,
d(x) = 1, whenever f0 (x) < cfA (x),
(c > 0)
Z
α = P (d(x) = 1| H0 ) = E(d(x)|H0 ) = d(x)f0 (x)dx
Assume the test d∗ (x) has smaller significance level, i.e.,
P (d∗ (x) = 1|H0 ) ≤ P (d(x) = 1|H0 ) = α
Z
=⇒ [d(x) − d∗ (x)] f0 (x)dx ≥ 0
230
Cornell University, BTRY 4090 / STSCI 4090
To show:
Spring 2010
Instructor: Ping Li
P (d∗ (x) = 1|HA ) ≤P (d(x) = 1|HA )
Equivalently, we need to show
Z
[d(x) − d∗ (x)] fA (x)dx≥0
We make use of a key inequality
d∗ (x) [cfA (x) − f0 (x)] ≤ d(x) [cfA (x) − f0 (x)]
d(x) = 1 whenever cfA (x) − f0 (x) > 0, and
d(x), d∗ (x) only take values in {0, 1}.
which is true because
More specifically, let M (x)
If M (x)
= cfA (x) − f0 (x).
> 0, then the right-hand-side of the inequality becomes M (x), but the
left-hand-side becomes M (x) (if d∗ (x) = 1) or 0 (if d∗ (x) = 0). Thus the
inequality holds, because M (x) > 0.
231
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
If M (x)
< 0, then the right-hand-side of the inequality becomes 0, but the
left-hand-side becomes M (x) (if d∗ (x) = 1) or 0 (if d∗ (x) = 0). Thus the
inequality also holds, because M (x) < 0.
Integrating both sides of the inequality yields
Z
Z
d∗ (x) [cfA (x) − f0 (x)] dx ≤ d(x) [cfA (x) − f0 (x)] dx
Z
Z
=⇒c [d(x) − d∗ (x)] fA dx ≥ [d(x) − d∗ (x)] f0 dx ≥ 0
232
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
f0
f1
Continue Example 9.2.A:
Instructor: Ping Li
≤ c =⇒ Reject H0 .
h n i
f0
2
2
= exp
2
X̄(µ
−
µ
)
+
µ
−
µ
≤c
0
1
1
0
f1
2σ 2
α = P (reject H0 |H0 ) = P (f0 ≤ cf1 |H0 )
Equivalently,
Reject H0 if
X̄ ≥ x0 ,
P (X̄ ≥ x0 |H0 ) = α.
2
Under H0 : X̄ ∼ N µ0 , σ /n
and
233
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
α =P (X̄ ≥ x0 |H0 )
X̄ − µ0
x −µ
√ > 0 √ 0
=P
σ/ n
σ/ n
x0 − µ0
√
=1 − Φ
σ/ n
σ
=⇒ x0 = µ0 + zα √
n
zα is the upper α point of the standard normal:
P (Z ≥ zα ) = α, where Z ∼ N (0, 1).
z0.05 = 1.645, z0.025 = 1.960
Therefore, the test rejects H0 if
X̄ ≥ µ0 + zα √σn .
—————
Q: What is β ? What is the power? Can we reduce both α and β when n is fixed?
234
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Uniformly Most Powerful Test
Neyman-Pearson Lemma requires that both hypotheses be simple. However,
most real-situations require composite hypothesis.
If the alternative H1 is composite, a test that is most powerful for every simple
alternative in H1 is uniformly most powerful (UMP).
235
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Continuing Example 9.2.A:
Instructor: Ping Li
Consider testing
H0 : µ = µ0
H1 : µ > µ0
For every µ1
> µ0 , the likelihood ratio test rejects H0 if X̄ ≥ x0 , where
σ
√
x0 = µ0 + zα
n
does not depend on µ1 .
Therefore, this test is most powerful for every µ1
> µ0 and hence it is UMP.
236
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Similarly, the test is UMP for testing (one-sided alternative)
H0 : µ < µ0
H1 : µ > µ0
However, the test is not UMP for testing (two-sided alternative)
H0 : µ = µ0
H1 : µ 6= µ0
Unfortunately, in typical composite situations, there is no UMP test.
Instructor: Ping Li
237
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
P-Value
Definition:
The p-value is the smallest significance level at which the null
hypothesis would be rejected.
The smaller the p-value, the stronger the evidence against the null hypothesis.
In a sense, calculating the p-value is more sensible than specifying (often
arbitrarily) the level of significance α.
238
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Confidence Intervals
Xn be an i.i.d. sample from a normal distribution
having unknown mean µ and known variance σ 2 . Consider testing
Example 9.3 A:
Let X1 , ...
H0 : µ = µ0
HA : µ 6= µ0
Consider a test that rejects H0 : for |X̄
− µ0 | ≥ x0 such that
P (|X̄ − µ0 | > x0 |H0 ) = α
Solve for x0 :
x0 =
√σ zα/2 .
n
239
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
The test accepts H0 if
σ
σ
√
√
X̄ −
zα/2 ≤ µ0 ≤ X̄ +
zα/2
n
n
We say a 100(1 − α)% confidence interval for µ0 is
µ0 ∈
Duality:
σ
σ
X̄ − √ zα/2 , X̄ + √ zα/2
n
n
µ0 lies in the confidence interval for µ if and only if the hypothesis
test accepts. This result holds more generally.
240
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Duality of Confidence Intervals and Hypothesis Tests
Let θ be a parameter of a family of probability distributions.
random variables constituting the data by X.
Theorem 9.3.A:
Suppose that for every value θ0
θ ∈ Θ. Denote the
∈ Θ there is a test at level
α of the hypothesis: H0 : θ = θ0 . Denote that acceptance region of the test by
A(θ0 ). Then the set
C(X) = {θ : X ∈ A(θ)}
is a 100(1 − α)% confidence region for θ .
241
Cornell University, BTRY 4090 / STSCI 4090
Proof:
Spring 2010
Need to show
P [θ0 ∈ C(X)|θ = θ0 ] = 1 − α
By the definition of C(X), we know
P [θ0 ∈ C(X)|θ = θ0 ] = P [X ∈ A(θ0 )|θ = θ0 ]
By the definition of level of significance, we know
P [X ∈ A(θ0 )|θ = θ0 ] = 1 − α.
This completes the proof.
Instructor: Ping Li
242
Cornell University, BTRY 4090 / STSCI 4090
Theorem 9.3.B:
Spring 2010
Instructor: Ping Li
Suppose that C(X) is 100(1 − α)% confidence region for
θ ; that is, for every θ0 ,
P [θ0 ∈ C(X)|θ = θ0 ] = 1 − α
Then an acceptance region for a test at level α of H0
: θ = θ0 is
A(θ0 ) = {X|θ0 ∈ C(X)}
Proof:
P [X ∈ A(θ0 )|θ = θ0 ] = P [θ0 ∈ C(X)|θ = θ0 ] = 1 − α
243
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Generalized Likelihood Ratio Test
Likelihood ratio test:
A simple hypothesis versus a simple hypothesis. Optimal. Very limited use.
Generalized likelihood ratio test:
Composite hypotheses. Sub-optimal and widely-used.
Play the same role as MLE in parameter estimation.
244
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Assume a sample X1 , ... ,Xn from a distribution with unknown parameter θ .
H0 : θ ∈ ω0
HA : θ ∈ ω1
Let Ω
= ω0 ∪ ω1 . The test statistic
max
Λ=
θ∈ω0
lik(θ)
max lik(θ)
θ∈Ω
Reject H0 if Λ
≤ λ0 , such that
P (Λ ≤ λ0 |H0 ) = α
245
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Let X1 , ..., Xn be i.i.d. and
Example 9.4.A: Testing a Normal Mean
normally distributed with mean µ and known variance σ 2 . Test
H0 : µ = µ0
HA : µ 6= µ0
In other words, ω0
= {µ0 }, Ω = {−∞ < µ < ∞}.
max
θ∈ω0
− 2σ12
lik(µ) = √
n e
2πσ
max lik(µ)
θ∈Ω
1
1
− 2σ12
= √
n e
2πσ
Pn
i=1 (Xi −µ0 )
Pn
i=1 (Xi −X̄)
2
2
246
Cornell University, BTRY 4090 / STSCI 4090
max
lik(θ)
Spring 2010
Instructor: Ping Li
" n
#)
n
X
1 X
θ∈ω0
2
(Xi − µ0 ) −
(Xi − X̄)2
Λ=
= exp − 2
max lik(θ)
2σ i=1
i=1
θ∈Ω
(
" n
#)
1 X
= exp − 2
(X̄ − µ0 )(2Xi − µ0 − X̄)
2σ i=1
1
= exp − 2 n(X̄ − µ0 )2
2σ
(
(X̄ − µ0 )2
−2 log Λ =
σ 2 /n
Because under H0 , ∼
N (µ0 , σ 2 /n), we know, under H0 ,
−2 log Λ|H0 ∼ χ21
247
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
The test rejects H0
(X̄ − µ0 )2
2
>
χ
1,α
σ 2 /n
χ21,0.05 = 3.841.
Equivalently, the test rejects H0 if
X̄ − µ0 ≥ zα/2 √σ
n
—————–
In this case, we know the sample null distribution exactly. When the sample
distribution is unknown (or not in a convenient form), we resort to the
approximation by central limit theorem.
248
Cornell University, BTRY 4090 / STSCI 4090
Theorem 9.4.A:
Spring 2010
Instructor: Ping Li
Under some smoothness conditions on the probability
density of mass functions, the null distribution of −2 log Λ tends to a chi-square
distribution with degrees of freedom equal to dimΩ− dimω0 , as the sample size
tends to infinity.
dimΩ = number of free parameters under Ω
dimω0 = number of free parameters under ω0 .
In Example 9.4.A, the null hypothesis specifies µ and σ 2 and hence there are no
free parameters under H0 , i.e., dimω0
= 0.
Under Ω, σ 2 is known (fixed) but µ is free, so dimΩ
= 1.
249
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Generalized Likelihood Ratio Tests for Multinomial Distribution
Goodness of fit:
Assume the multinomial probabilities pi are specified by
H0 : p = p(θ),
θ ∈ ω0
where θ is a (vector of) parameter(s) to be estimated.
We need to know whether the model p(θ) is good or not, according to the
observed data (cell counts).
We also need an alternative hypothesis. A common choice of Ω would be
Ω = {pi , i = 1, 2, ..., m|pi ≥ 0,
m
X
i=1
pi = 1}
250
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
max
Λ=
p∈ω0
Instructor: Ping Li
lik(p)
max lik(p)
p∈Ω
n
x1
xm
p
(
θ̂)
...p
(
θ̂)
1
m
x ,x ,...,xm
x1 xm
= 1 2
n
x1 ,x2 ,...,xm p̂1 ...p̂m
!xi
m
Y pi (θ̂)
=
i=1
θ̂ : the MLE under ω0
Λ=
m
Y
i=1
pi (θ̂)
p̂i
!np̂i
p̂i
p̂i =
,
xi
n :
the MLE under Ω.
−2 log Λ = −2n
m
X
i=1
p̂i log
pi (θ̂)
p̂i
!
251
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
−2 log Λ = − 2n
=2
m
X
m
X
p̂i log
i=1
np̂i log
i=1
m
X
!
pi (θ̂)
p̂i
!
np̂i
npi (θ̂)
Oi
=2
Oi log
Ei
i=1
Oi = np̂i = xi : the observed counts,
Ei = npi (θ̂) : the expected counts
−2 log Λ is asymptotically χ2s .
The degrees of freedom
s = dimΩ − dimω0 = (m − 1) − k .
k = length of the vector θ = number of parameters in the model.
252
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
G2 Test Versus X 2 Test
Generalized likelihood ratio test
G2 = −2 log Λ =2
m
X
np̂i
np̂i log
npi (θ̂)
i=1
!
=2
m
X
Oi log
i=1
Pearson’s Chi-square test
X2 =
h
m
xi − npi (θ̂)
X
i=1
npi (θ̂)
i2
G2 and X 2 are asymptotically equivalent.
=
m
2
X
[Oi − Ei ]
i=1
Ei
Oi
Ei
253
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
By Taylor expansions, about x
x log
Instructor: Ping Li
≈ x0 ,
x
x − x0 + x0
x − x0
=x log
= x log 1 +
x0
x0
x0
2
x − x0
(x − x0 )
=x
−
+ ...
2
x0
2x0
2
x − x0
(x − x0 )
= (x − x0 + x0 )
−
+ ...
2
x0
2x0
(x − x0 )2
=(x − x0 ) +
+ ...
2x0
254
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Under H0 , we expect np̂i
= xi ≈ npi (θ̂). Thus
!
m
X
np̂i
2
G =2
np̂i log
npi (θ̂)
i=1
"
#
m
2
X
(np̂i − npi (θ̂))
(np̂i − npi (θ̂)) +
=2
+ ...
2npi (θ̂)
i=1
≈
m
X
(np̂i − npi (θ̂))2
i=1
npi (θ̂)
= X2
It appears G2 test should be “more accurate,” but X 2 is actually more frequently
used.
255
Cornell University, BTRY 4090 / STSCI 4090
Example 9.5.A:
Spring 2010
Instructor: Ping Li
The Hardy-Weinberg equilibrium model assumes the cell
probabilities are
(1 − θ)2 ,
2θ(1 − θ),
θ2
The observed counts are 342, 500, and 187, respectively (total n
Using MLE, we estimate θ̂
=
2x3 +x2
2n
= 1029).
= 0.4246842.
The expected (estimated) counts are 340.6, 502.8, and 185.6, respectively.
G2 = 0.032499, X 2 = 0.0325041 (slightly different numbers in the Book)
Both G2 and X 2 are asymptotically χ2s where
s = (m − 1) − k = (3 − 1) − 1 = 1.
256
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
G2 = 0.032499, X 2 = 0.0325041, both asymptotically χ21 .
p-values
For G2 , p-value = 0.85694.
For X 2 , p-value = 0.85682
Very large p-values indicate that we should not reject H0 .
In other words, the model is very good.
Suppose we do want to reject H0 , we must use a significance level α
≥ 0.86.
257
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
The Poisson Dispersion Test
Assume X
∼ P oi(λ), then E(X) = λ and V ar(X) = λ.
However, for many real data, the variance may considerably exceed the mean.
Over-dispersion is often caused by subject heterogeneity, which may require a
more flexible model to explain the data
Given counts x1 , ..., xn , consider
ω0 : xi ∼ P oi(λ), i = 1, 2, ..., n
Ω : xi ∼ P oi(λi ), i = 1, 2, ..., n
258
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Given counts x1 , ..., xn , consider
ω0 : xi ∼ P oi(λ), i = 1, 2, ..., n
Ω : xi ∼ P oi(λi ), i = 1, 2, ..., n
Under ω0 , the MLE is λ
Λ=
= x̄.
max
lik(λ)
max
lik(λi )
λ∈ω0
λi ∈Ω
Qn
xi −λ̂
/xi !
i=1 λ̂ e
= Qn
xi −λ̂i
λ̂
/xi !
i=1 i e
Qn
xi
n xi −x̄
Y
x̄ e /xi !
x̄
= Qni=1 xi −x
=
exi −x̄
i /x !
xi
i
i=1 xi e
i=1
n
X
xi
−2 log Λ = 2
xi log
∼ χ2n−1
x̄
i=1
(asymptotically)
259
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Tests for Normality
If X
∼ N (µ, σ 2 ), then
• The density function is symmetric about µ, with coefficient of skewness
b1 = 0, where
E(X − µ)3
b1 =
σ3
• The coefficient of kurtosis b2 = 3, where
E(X − µ)4
b2 =
σ4
These provide two simple tests for normality (among many tests).
260
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Two simple tests for normality
• Reject if the empirical coefficient of skewness |b̂1 | is large, where
Pn
1
3
(X
−
X̄)
i
i=1
b̂1 = nP
3/2
n
1
2
i=1 (Xi − X̄)
n
• Reject if the empirical coefficient of kurtosis |b̂2 − 3| is large, where
Pn
1
4
(X
−
X̄)
i
b̂2 = n Pni=1
2
1
2
i=1 (Xi − X̄)
n
Difficulty: The distributions of b̂1 and b̂2 have no closed-forms and one must
resort to a numerical procedure.
261
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Chapter 11: Comparing Two Samples
• Comparing two independent samples
For example, a sample X1 , ... , Xn is drawn from N (µX , σ 2 ); and an
independent sample Y1 , ..., Ym is drawn from N (µY , σ 2 ).
H0 : µX = µY
HA : µY 6= µY
• Comparing paired samples
For example, we observe pairs (Xi , Yi ), i
= 1 to n. We would like to test
the difference X and Y .
Pairing causes samples to be dependent, i.e., Cov(Xi , Yi )
= σXY .
262
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Section 11.2: Comparing Two Independent Samples
Example: In a medical study, a sample of subjects may be assigned to a
particular treatment, and another independent sample may be assigned to a
control treatment.
• Methods based on the normal distribution
• The analysis of power
263
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Methods Based on the Normal Distribution
A sample X1 , ... , Xn is drawn from N (µX , σ 2 );
An independent sample Y1 , ..., Ym is drawn from N (µY , σ 2 ).
The goal is to study the difference µX
− µY from the observations.
By the independence assumption,
X̄ − Ȳ ∼ N µX − µY , σ 2
Two scenarios:
• σ 2 is known.
• σ 2 is unknown.
1
1
+
n m
.
264
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Two Independent Normal Samples with Known Variance
X̄ − Ȳ ∼ N µX − µY , σ 2
1
1
+
n m
Assume σ 2 is known. Then
(X̄ − Ȳ ) − (µX − µY )
q
Z=
∼ N (0, 1)
1
σ n1 + m
The 100(1 − α)% confidence interval of is
(X̄ − Ȳ ) ± zα/2 σ
r
1
1
+
n m
However, σ 2 in general must be estimated from the data.
.
265
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Two Independent Normal Samples with Unknown Variance
The pooled sample variance
s2p
(n − 1)s2X + (m − 1)s2Y
=
m+n−2
is an estimate of the common variance σ 2 , where
n
s2X
1 X
=
(Xi − X̄)2
n − 1 i=1
m
1 X
2
sY =
(Yi − Ȳ )2
m − 1 i=1
are the sample variances of the X ’s and Y ’s.
s2p is the weighted average of s2X and s2Y .
266
Cornell University, BTRY 4090 / STSCI 4090
Theorem 11.2.A:
Spring 2010
Instructor: Ping Li
The test statistic
t=
(X̄ − Ȳ ) − (µX − µY )
q
∼ tm+n−2
1
sp n1 + m
a t distribution with m + n − 2 degrees of freedom.
Proof:
Recall in Chapter 6, if V
independent, then
√U
V /n
∼ tn .
∼ χ2n , U ∼ N (0, 1), and U and V are
267
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
s2p (m + n − 2)
(n − 1)s2X + (m − 1)s2Y
2
=
∼
χ
m+n−2
σ2
σ2
Let
Then
(X̄ − Ȳ ) − (µX − µY )
q
U=
∼ N (0, 1)
1
σ n1 + m
U
q
∼ tm+n−2
s2p /σ 2
That is,
U
=
2
2
sp /σ
(X̄−Ȳ )−(µX −µY )
σ
√1
1
n+m
sp /σ
(X̄ − Ȳ ) − (µX − µY )
q
=
∼ tm+n−2
1
sp n1 + m
268
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Three Types of Hypothesis Testing
The null hypothesis:
H0 : µX = µY
Three common alternative hypotheses
H1 : µX 6= µY
H2 : µX > µY
H3 : µX < µY
H1 is a two-sided alternative
H2 and H3 are one-sided alternatives
Instructor: Ping Li
269
Cornell University, BTRY 4090 / STSCI 4090
Using the test statistic t
=
Spring 2010
sp
X̄−Ȳ
√
1
1
n+m
Instructor: Ping Li
, the rejection regions are
For
H1 : |t| > tn+m−2,α/2
For
H2 : t > tn+m−2,α
For
H3 : t < −tn+m−2,α
Pay attention to the p-value calculation for H1 .
270
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
The Equivalence between t-test and Likelihood Ratio Test
H0 : µX = µY ,
H1 : µX 6= µY .
Three parameters:
2
θ = µX , µY , σ .
max
Λ=
θ∈ω0
lik(µX , µY , σ 2 )
max lik(µX , µY , σ 2 )
θ∈Ω
We can show rejecting small Λ (i.e., rejecting large −2 log Λ) is equivalent to
rejecting large |t|
=
|X̄−Ȳ |
sp
√1
1
n+m
.
271
Cornell University, BTRY 4090 / STSCI 4090
Three parameters:
Spring 2010
Instructor: Ping Li
2
θ = µX , µY , σ .
ω0 = {µX = µY = µ0 , 0 < σ = σ0 < ∞}
Ω = {−∞ < µX , µY < ∞, 0 < σ < ∞}
lik(µX , µY , σ 2 )
m
n
2 Y
2
Y
1
(Xi − µX )
1
(Yi − µY )
√
√
=
exp −
exp
−
2
2
2σ
2σ
2πσ
2πσ
i=1
i=1
m+n
m+n
log 2π −
log σ 2
2"
2
#
n
m
X
1 X
2
− 2
(Xi − µX ) +
(Yi − µY )2
2σ i=1
i=1
l(µX , µY , σ 2 ) = −
272
Cornell University, BTRY 4090 / STSCI 4090
Under ω0
Spring 2010
Instructor: Ping Li
= {µX = µY = µ0 , 0 < σ = σ0 < ∞}
m+n
m+n
log 2π −
log σ02
2"
2
#
n
m
X
1 X
2
− 2
(Xi − µ0 ) +
(Yi − µ0 )2
2σ0 i=1
i=1
l(µ0 , σ02 ) = −
∼ N (µ0 , σ02 ) and Yi ∼ N (µ0 , σ02 ), Xi and Yi are
independent, we have m + n samples in N (µ0 , σ02 ).
In fact, since both Xi
Therefore, the MLEs are
1
µ̂0 =
m+n
σ̂02
1
=
m+n
"
n
X
m
X
#
n
m
Xi +
Yi =
X̄ +
Ȳ
m
+
n
m
+
n
i=1
i=1
" n
#
m
X
X
2
(Xi − µ̂0 ) +
(Yi − µ̂0 )2
i=1
i=1
273
Cornell University, BTRY 4090 / STSCI 4090
Thus, under the null ω0
Instructor: Ping Li
= {µX = µY = µ0 , 0 < σ = σ0 < ∞}
l(µ̂0 , σ̂02 ) = −
Under Ω
Spring 2010
m+n
m+n
m+n
log 2π −
log σ̂02 −
2
2
2
= {−∞ < µX , µY < ∞, 0 < σ < ∞}. We can show
µ̂X = X̄,
σ̂ 2 =
1
m+n
µ̂Y = Ȳ
" n
#
m
X
X
2
(Xi − µ̂X ) +
(Yi − µ̂Y )2
i=1
i=1
m+n
m+n
m+n
2
l(µ̂X , µ̂Y , σ̂ ) = −
log 2π −
log σ̂ −
2
2
2
2
274
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
The negative log likelihood ratio is
2
m
+
n
σ̂
log 02
− l(µ̂0 , σ̂02 ) − l(µ̂X , µ̂Y , σ̂ 2 ) =
2
σ̂
σ̂02
Therefore, the test rejects for large values of σ̂2 .
Pn
Pn
2
(X
−
µ̂
)
+
(Yi − µ̂0 )2
σ̂02
i
0
i=1
i=1
Pn
= Pn
2
2
σ̂ 2
i=1 (Xi − X̄) +
i=1 (Yi − Ȳ )
mn
(X̄ − Ȳ )2
Pn
Pn
=1 +
2
m + n i=1 (Xi − X̄) + i=1 (Yi − Ȳ )2
Equivalently, the test rejects for large values of
|X̄ − Ȳ |
qP
Pn
n
2
2
i=1 (Xi − X̄) +
i=1 (Yi − Ȳ )
which is the t statistic.
275
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Power Analysis of Two-Sample t Test
Recall power = 1 - Type II error = P (reject H0 |HA ).
To compute the power, we must specify a simple alternative hypothesis. We
consider
H0 : µX − µY = 0
H1 : µX − µY = ∆.
For simplicity, we assume σ 2 is known and n
The t test rejects if
|X̄ − Ȳ | > zα/2 σ
q
= m.
2
n.
276
Cornell University, BTRY 4090 / STSCI 4090
power =P
=P
Spring 2010
Instructor: Ping Li
!
r
2
|X̄ − Ȳ | > zα/2 σ
|H1
n
!
r
2
X̄ − Ȳ > zα/2 σ
|H1 + P
n
X̄ − Ȳ < −zα/2 σ
2σ 2
Note that X̄ − Ȳ |H1 ∼ N µX − µY = ∆, n . Therefore,
r
!
2
|H1
n
q


2
zα/2 σ n − ∆
X̄
−
Ȳ
−
∆
p
p
=P 
>
|H1 
σ 2/n
σ 2/n
p
∆
=1 − Φ zα/2 −
n/2
σ
P
X̄ − Ȳ > zα/2 σ
277
r
2
|H1
n
!
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Therefore, the power can be computed from
∆p
∆p
power =1 − Φ zα/2 −
n/2 + Φ −zα/2 −
n/2
σ
σ
′
′
=1 − Φ zα/2 − ∆ + Φ −zα/2 − ∆
p
∆
′
where ∆ = σ n/2.
Three parameters, α, ∆, and n, affect the power.
• Larger α =⇒ smaller zα/2 =⇒ larger power.
• Larger |∆′ | =⇒ larger power.
• Larger |∆| =⇒ larger power.
• Larger n =⇒ larger power.
• Smaller σ =⇒ larger power.
What is the relation between α and power if ∆
= 0?
278
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Section 11.3: Comparing Paired Samples
In many cases, the samples are paired (and dependent),
for example, measurements before and after medical treatments.
Consider
(Xi , Yi ), i = 1, 2, ...n
(Xi , Yi ) is independent of (Xj , Yj ), if i 6= j
E(Xi ) = µX ,
2
V ar(Xi ) = σX
,
E(Yi ) = µY
V ar(Yi ) = σY2
279
Cornell University, BTRY 4090 / STSCI 4090
Let Di
Spring 2010
= Xi − Yi , and D̄ =
1
n
Instructor: Ping Li
Pn
i=1
Di . Then,
E(D̄) = µX − µY ,
1 2
2
V ar(D̄) =
σ + σY − 2ρσX σY
n X
Therefore, D̄ is still an unbiased estimator of µX
− µY , but it has smaller
variance if there exists positive correlation (ρ > 0).
280
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Paired Test Based on the Normal Distribution
This methods assume that Di
= Xi − Yi is i.i.d. normal with
E(Di ) = µD ,
2
V ar(Di ) = σD
2
In general, σD
needs to be estimated from the data.
Consider a two-sided test
H0 : µD = 0,
A t-test rejects for large values of |t|, where t
The rejection region is D̄
> tn−1,α/2 sD̄ .
HA : µD 6= 0
=
D̄−µD
sD̄ .
281
Cornell University, BTRY 4090 / STSCI 4090
Example 11.3.1.A:
Spring 2010
Instructor: Ping Li
Effect of cigarette smoking on platelet aggregation.
Before (X )
After (Y )
Difference (D )
25
27
2
25
29
4
27
37
10
44
56
12
30
46
16
67
82
15
53
57
4
53
80
27
52
61
9
60
59
-1
28
43
15
282
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
D̄ = 10.272
r
63.6182
sD̄ =
= 2.405
11
ρ = 0.8938
D̄
10.272
H0 : t =
=
= 4.271.
sD̄
2.405
Suppose α
= 0.01.
tα/2,n−1 = t0.005,10 = 3.169 < t.
Therefore, the test rejects H0 at significance level α
Alternatively, we say the p-value is smaller than 0.01.
= 0.01.
283
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
A Heuristic Explanation on GLRT
Why, under H0 , the test statistic
max
Λ=
θ∈ω0
lik(θ)
max lik(θ)
θ∈Ω
satisfies
−2 log Λ → χ2s ,
as n
→ ∞??
The heuristic argument
• Only considers s = 1.
• Utilizes Taylor expansion.
• Uses the fact that the MLE is asymptotically normal.
284
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Since s
= 1, we consider H0 : θ = θ0 .
Let l(θ)
= log lik(θ) and θ̂ be the MLE of θ ∈ Ω.
h
i
−2 log Λ = −2 l(θ0 ) − l(θ̂)
Applying Taylor expansion
(θ0 − θ̂)2 ′′
l(θ0 ) = l(θ̂) + (θ0 − θ̂)l (θ̂) +
l (θ̂) + ...
2
′
Because θ̂ is the MLE, we know l′ (θ̂)
= 0. Therefore,
h
i
−2 log Λ = −2 l(θ0 ) − l(θ̂) = −l′′ (θ̂)(θ0 − θ̂)2 + ...
285
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
The MLE is asymptotically normal, i.e., as n
θ̂ − θ0
q
Because nI(θ)
1
nI(θ)
Instructor: Ping Li
→ ∞,
p
= θ̂ − θ0
nI(θ) → N (0, 1)
= −E(l′′ (θ)), we can (heuristically) write, as n → ∞,
−2 log Λ = − l′′ (θ̂)(θ0 − θ̂)2
p
i2
h
≈ θ̂ − θ0
nI(θ)
→χ21
286
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Chapter 14: Linear Lease Squares
Materials:
• The basic procedure: Observe (xi , yi ). Assume y = β0 + β1 x.
Estimate β0 , β1 by minimizing
X
(yi − β0 − β1 xi )2
• Statistical analysis of linear square estimates
Assume y = β0 + β1 x + e, e ∼ N (0, σ 2 ), and x is constant.
What are the statistical properties of β0 and β1 , which are estimated by the
least square procedure?
• Matrix approach to multiple least squares
• Conditional expectation and best linear estimator
for better understanding of the basic procedure.
If X and Y are jointly normal, then linear regression is the best under MSE.
287
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Linear Lease Squares: The Basic Procedure
The basic procedure is to fit a straight line to a plot of points (xi , yi ),
y = β0 + β1 x
by minimizing
L(β0 , β1 ) =
n
X
i=1
(yi − β0 − β1 xi )2 ,
i.e., solving for β0 and β1 from
∂L(β0 , β1 )
=0
∂β0
∂L(β0 , β1 )
=0
∂β1
288
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Taking the first derivatives,
n
∂L(β0 , β1 ) X
=
2(yi − β0 − β1 xi )(−1)
∂β0
i=1
n
∂L(β0 , β1 ) X
=
2(yi − β0 − β1 xi )(−xi )
∂β1
i=1
Setting them to zero =⇒
β̂0 =ȳ − x̄βˆ1
Pn
Pn
i=1 xi yi − ȳP i=1 xi
P
β̂1 =
n
n
2 − x̄
x
i=1 i
i=1 xi
Instructor: Ping Li
289
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Statistical Properties of β̂0 and β̂1
Model:
yi = β0 + β1 xi + ei ,
i = 1, 2, ..., n
ei ∼ N (0, σ 2 ), i.i.d.
xi ’s are constants. The randomness of yi ’s is due to ei .
The coefficients β0 and β1 are estimated by least squares.
Q:
Under this model, what are E(β̂0 ), V ar(β̂0 ), E(β̂1 ), V ar(β̂1 ), etc.?
290
Cornell University, BTRY 4090 / STSCI 4090
According to the model:
Spring 2010
yi = β0 + β1 xi + ei , ei ∼ N (0, σ 2 ),
E(yi ) = β0 + β1 xi
E(ȳ) = β0 + β1 x̄
V ar(yi ) = σ 2
Cov(yi , yj ) = 0, if i 6= j
Therefore,
E(β̂0 ) = E(ȳ − x̄β̂1 ) = β0 + β1 x̄ − x̄E(β̂1 )
i.e., E(β̂0 )
Instructor: Ping Li
= β0 iff E(β̂1 ) = β1
291
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Pn
Pn
xi yi − ȳ i=1 xi
Pn
E(β̂1 ) =E Pi=1
n
2
i=1 xi
i=1 xi − x̄
Pn
Pn
xi E(yi ) − E(ȳ) i=1 xi
i=1
P
Pn
=
n
2
i=1 xi − x̄
i=1 xi
Pn
Pn
xi
0 + β1 xi ) − (β0 + β1 x̄)
i=1 xi (βP
i=1
Pn
=
n
2 − x̄
x
i=1 xi
i=1 i
Pn
Pn
2
β1
x − β1 x̄ i=1 xi
Pn
= Pni=1 2i
i=1 xi − x̄
i=1 xi
=β1
Theorem 14.2.A:
Unbiasedness
E(β̂0 ) = β0 ,
E(β̂1 ) = β1
292
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Another way to express β̂1 :
Pn
Pn
x
y
−
ȳ
xi
i i
i=1
i=1
Pn
β̂1 = Pn
2
i=1 xi − x̄
i=1 xi
Pn
(x − x̄)(yi − ȳ)
Pn i
= i=1
2
i=1 (xi − x̄)
Pn
(xi − x̄)yi
= Pi=1
n
2
i=1 (xi − x̄)
Note that
n
X
i=1
(xi − x̄) = 0,
n
X
i=1
(yi − ȳ) = 0.
293
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Theorem 14.2.B:
V ar(β̂1 ) =
Pn
2
(x
−
x̄)
V ar(yi )
i
i=1
Pn
2
[ i=1 (xi − x̄)2 ]
σ2
= Pn
2
i=1 (xi − x̄)
Exercises
V ar(β̂0 ) =
Pn
σ2
2
x
i
i=1
Pnn
,
2
(x
−
x̄)
i=1 i
−σ 2 x̄
Cov(β̂0 , β̂1 ) = Pn
.
2
(x
−
x̄)
i=1 i
294
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Residual Sum of Squares (RSS)
Definition
RSS =
n
X
i=1
(yi − β̂0 − β̂1 xi )2
We can show
E(RSS) = (n − 2)σ 2
In other words,
RSS
s =
n−2
2
is an unbiased estimator of σ 2 .
295
Cornell University, BTRY 4090 / STSCI 4090
E(RSS) = E
Spring 2010
" n
X
i=1
=E
" n
X
i=1
=E
" n
X
i=1
Instructor: Ping Li
(yi − β̂0 − β̂1 xi )2
#
(β0 + β1 xi + ei − β̂0 − β̂1 xi )2
(β0 − β̂0 + (β1 − β̂1 )xi + ei )2
=nV ar(β̂0 ) + V ar(β̂1 )
n
X
#
#
x2i + nσ 2 + 2Cov(β̂0 , β̂1 )
i=1
+ 2E
" n
X
i=1
ei β0 − β̂0 + (β1 − β̂1 )xi
=(n + 2)σ 2 + 2E
"
n
X
i=1
n
X
i=1
#
ei β0 − β̂0 + (β1 − β̂1 )xi
#
xi
296
Cornell University, BTRY 4090 / STSCI 4090
E
"
=E
"
=E
"
=E
"
"
n
X
i=1
n
X
i=1
n
X
i=1
n
X
i=1
=E β̂1
Spring 2010
ei β0 − β̂0 + (β1 − β̂1 )xi
Instructor: Ping Li
#
ei β0 − ȳ + x̄β̂1 + (β1 − β̂1 )xi
#
ei β0 − β0 − x̄β1 − ē + x̄β̂1 + (β1 − β̂1 )xi
ei −x̄β1 + x̄β̂1 + (β1 − β̂1 )xi
n
X
i=1
#
ei (x̄ − xi ) − σ 2
#
− σ2
#
297
Cornell University, BTRY 4090 / STSCI 4090
"
E β̂1
n
X
i=1
Spring 2010
ei (x̄ − xi )
Instructor: Ping Li
#
n
n
X
(x
−
x̄)y
i
i
Pi=1
ei
n
2
i=1 (xi − x̄) i=1
=E
"P
=E
"P
=E
Pn
= − σ2
(x̄ − xi )
n
i − x̄)(β0 + β1 xi
i=1 (xP
n
2
i=1 (xi − x̄)
2
(x
−
x̄)(x̄
−
x
)e
i
i
i
i=1
Pn
2
i=1 (xi − x̄)
+ ei )
#
n
X
i=1
#
ei (x̄ − xi )
Therefore,
E(RSS) = (n + 2)σ 2 + 2(−2σ 2 ) = (n − 2)σ 2
298
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
The Distributions β̂0 and β̂1
Model:
yi = β0 + β1 xi + ei , ei ∼ N (0, σ 2 ),
yi ∼ N (β0 + β1 xi , σ 2 )
β̂1 =
n
X
i=1
ci yi ∼ N (β1 , V ar(β̂1 ))
β̂0 = ȳ − x̄β̂1 ∼ N (β0 , V ar(β̂0 ))
s2
RSS
2
(n
−
2)
=
∼
χ
n−2
σ2
σ2
β̂0 − β0
β̂1 − β1
∼ tn−2 ,
∼ tn−2
sβ̂0
sβ̂1
What if ei is not normal? Central limit theorem and normal approximation.
299
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Hypothesis Testing
Once we know the distributions
β̂0 − β0
∼ tn−2 ,
sβ̂0
β̂1 − β1
∼ tn−2
sβ̂1
we can conduct hypothesis test, for example,
H0 : β1 = 0
HA : β1 6= 0
300
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Multiple Least Squares
Model:
yi = β0 + β1 xi,1 + ... + βp−1 xi,p−1 + ei ,
Observations:
ei ∼ N (0, σ 2 ) i.i.d.
(xi , yi ), i = 1 to n.
Multiple least squares: Estimate βj by minimizing the MSE
L(βj , j = 0, 1, ..., p − 1) =
n
X
i=1
(yi − β0 −
p−1
X
j=1
xi,j βj )2
301
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Matrix Approach to Linear Least Square





X=




1
x1,1
x1,2
...
x1,p−1
1
x2,1
x2,2
...
x2,p−1
1
x3,1
x3,2
...
x3,p−1
..
.
..
.
..
.
..
.
..
.
xn,2
...
xn,p−1
1 xn,1
L(β) =
n
X
i=1
(yi − β0 −
p−1
X
j=1





,









β=




β0
β1
β2
..
.
βp−1
2
xi,j βj )2 = kY − Xβk










302
Cornell University, BTRY 4090 / STSCI 4090
L(β) =
n
X
i=1
Spring 2010
(yi − β0 −
Instructor: Ping Li
p−1
X
j=1
2
xi,j βj )2 = kY − Xβk
Matrix/vector derivative
∂L(β)
=2(−XT ) (Y − Xβ)
∂β
T
T
= − 2 X Y − X Xβ = 0
=⇒
XT Xβ = XT Y
T −1 T
=⇒β̂ = X X
XY
303
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Statistical Properties of β
Model:
ei ∼ N (0, σ)2 i.i.d.
Y = Xβ + e,
Unbiasedness (Theorem 14.4.2.A):
E β̂ =E
T
XX
T
−1
−1
T
X Y
T
X [Xβ + e]
−1
−1
=E XT X
XT X β + E XT X
XT e
=E
=β
XX
304
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Covariance matrix of β̂ (Theorem 14.4.2.B)
−1
−1
V ar β̂ =V ar XT X
XT X β + XT X
XT e
−1
=V ar XT X
XT e
h
−1 T iT
T −1 T
T
= XX
X V ar(e) X X
X
T −1 T
T −1
= XX
X V ar(e) X X X
T −1 T T −1
2
=σ X X
X X X X
T −1
2
=σ X X
Note that V ar(e) is a diagonal matrix = σ 2 In×n
305
Cornell University, BTRY 4090 / STSCI 4090
Theorem 14.4.3.A:
Spring 2010
Instructor: Ping Li
An unbiased estimator of σ 2 is s2 , where
kY − Ŷk2
s =
n−p
2
Proof:
Lemma 14.4.3.A:
h i
−1
Ŷ = Xβ̂ = X XT X
XT Y = PY
P2 = P = PT
(I − P)2 = I − P = (I − P)T
Proof of Lemma 14.4.3.A
T −1 T
T −1 T
T −1 T
P =X X X
X X X X
X =X X X
X =P
2
306
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Therefore,
kY − Ŷk2 = k(I − P)Yk2 = Y T (I − P)T (I − P)Y = Y T (I − P)Y
and
T
E Y (I − P)Y =E
because
T
T
β X +e
T
(I − P) (Xβ + e)
T
T
T
=β X (I − P)Xβ + E e (I − P)e
T
=E e (I − P)e
T 2
=nσ − E e Pe
h i
−1
XT (I − P)X = XT X − XT X XT X
XT X = 0
307
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li

E eT (P)e =E 

=E 
=σ 2
p
X
j=1
p
X
j=1
p
X
"
n
X
i=1
#
ei Pij ej 

Pjj e2j 
Pjj = pσ 2
j=1
where we skip the proof of the very last step.
Combining the results, we obtain

E kY − Ŷk2 = (n − p)σ 2
308
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Properties of Residuals
Residuals:
ê = Y − Ŷ = (I − P)Y .
Covariance matrix of residuals:
V ar(ê) =(I − P)Var(Y)(I − P)T
=(I − P)σ 2 I(I − P)
=σ 2 (I − P)
=⇒ Residuals are correlated
Instructor: Ping Li
309
Cornell University, BTRY 4090 / STSCI 4090
Theorem 14.4.A:
Spring 2010
Instructor: Ping Li
The residuals are uncorrelated with the fitted values.
Proof:
T
T
E(ê Ŷ) =E Y (I − P)PY
T
2
=E Y (P − P )Y
T
=E Y (P − P)Y
=0
310
Cornell University, BTRY 4090 / STSCI 4090
Spring 2010
Instructor: Ping Li
Inference about β
T −1
2
V ar β̂ = σ X X
= σ 2 C.
Using s2 to estimate σ 2 , we obtain the distribution
β̂j − βj
∼ tn−p ,
sβ̂j
where sβ̂
j
√
= s cii
which allows us to conduct hypothesis test on the significance of the fit.
311
Download