Document

advertisement
Brief Review
Probability and Statistics
Probability distributions
Continuous distributions
Defn (density function)
Let x denote a continuous random variable then f(x) is
called the density function of x
1)
f(x) ≥ 0

2)


b
3)
f ( x)dx  1
 f ( x)dx  P  a  x  b
a
The Normal distribution
(mean m, standard deviation s)

1
f  x 
e
2s
 x  m 2
2s 2
The Exponential distribution
e x
f  x  
 0
x0
x0
0.2
0.1
0
-2
0
2
4
6
8
10
The Gamma distribution
An important family of distibutions
The Gamma Function, G(x)
An important function in mathematics
The Gamma function is defined for x ≥ 0 by

G  x    u e du
x 1  u
0
u x 1e  u
G  x
u
The Gamma distribution
Let the continuous random variable X have
density function:
    1   x
x e

f  x    G  

0

x0
x0
Then X is said to have a Gamma distribution
with parameters  and .
Graph: The gamma distribution
0.4
( = 2,  = 0.9)
0.3
( = 2,  = 0.6)
( = 3,  = 0.6)
0.2
0.1
0
0
2
4
6
8
10
Comments
1. The set of gamma distributions is a family of
distributions (parameterized by  and ).
2. Contained within this family are other distributions
a. The Exponential distribution – in this case  = 1, the
gamma distribution becomes the exponential distribution
with parameter . The exponential distribution arises if
we are measuring the lifetime, X, of an object that does
not age. It is also used a distribution for waiting times
between events occurring uniformly in time.
b. The Chi-square distribution – in the case  = n/2 and
 = ½, the gamma distribution becomes the chi- square
(c2) distribution with n degrees of freedom. Later we
will see that sum of squares of independent standard
normal variates have a chi-square distribution, degrees
of freedom = the number of independent terms in the
sum of squares.
The Exponential distribution
e x
f  x  
 0
x0
x0
The Chi-square (c2) distribution with n d.f.
  1  2 n 1  1 x
 2 x2 e 2
f  x    G  n2 

0

n
n 2  x
 1
2
2
x
e
n
 2 n
 2 G 2 

0

x0
x0
x0
x0
Graph: The c2 distribution
(n = 4)
0.2
(n = 5)
(n = 6)
0.1
0
0
4
8
12
16
Defn (Joint density function)
Let x = (x1 ,x2 ,x3 , ... , xn) denote a vector of continuous
random variables then
f(x) = f(x1 ,x2 ,x3 , ... , xn)
is called the joint density function of x = (x1 ,x2 ,x3 , ... , xn)
if
1)
f(x) ≥ 0
2)
 f ( x ) dx  1
3)
 f (x)dx  Px  R 
R
Note:
 

 f (x)dx      f x , x , x dx dx  dx
1
  
2
n
1
2

 f (x)dx      f x , x , x dx dx  dx
1
R
R
n
2
n
1
2
n
Defn (Marginal density function)
The marginal density of x1 = (x1 ,x2 ,x3 , ... , xp) (p < n)
is defined by:
f1(x1) =  f ( x )dx 2 =  f ( x1 , x 2 )dx 2
where x2 = (xp+1 ,xp+2 ,xp+3 , ... , xn)
The marginal density of x2 = (xp+1 ,xp+2 ,xp+3 , ... , xn) is
defined by:
f2(x2) = f ( x) dx1=  f (x1 , x 2 )dx1

where x1 = (x1 ,x2 ,x3 , ... , xp)
Defn (Conditional density function)
The conditional density of x1 given x2 (defined in previous
slide) (p < n) is defined by:
f ( x)
f (x1 , x 2 )
f1|2(x1 |x2) =

f 2 x 2 
f 2 x 2 
conditional density of x2 given x1 is defined by:
f (x) f (x1 , x 2 )
f2|1(x2 |x1) =

f1 x1 
f1 x1 
Marginal densities describe how the subvector
xi behaves ignoring xj
Conditional densities describe how the
subvector xi behaves when the subvector xj is
held fixed
Defn (Independence)
The two sub-vectors (x1 and x2) are called independent
if:
f(x) = f(x1, x2) = f1(x1)f2(x2)
= product of marginals
or
the conditional density of xi given xj :
fi|j(xi |xj) = fi(xi) = marginal density of xi
Example (p-variate Normal)
The random vector x (p × 1) is said to have the
p-variate Normal distribution with
mean vector m (p × 1) and
covariance matrix S (p × p)
(written x ~ Np(m,S)) if:
 1

1
f x  
exp  (x  μ)' S (x  μ)
1/ 2
p/2
 2

2  S
1
Example (bivariate Normal)
 x1 
The random vector x    is said to have the bivariate
 x2 
 m1 
Normal distribution with mean vector μ   
m2 

and
covariance matrix
2
s
s

s1
 11
12 
S


s 12 s 22   s1s 2
s1s 2 
2 
s2 
 1

1
f x  
exp  (x  μ)' S (x  μ)
1/ 2
p/2
 2

2  S
1
 1

1
f x1 , x2  
exp  (x  μ)' S (x  μ)
1/ 2
 2

2  S
1

1
2 s 11s 22  s
2 1/ 2
12
exp  Q x1 , x2 
1
s 11 s 12 
Q x1 , x2   (x  μ)' 
(x  μ)

s 12 s 22 
s 22 ( x1  m1 ) 2  2s 12 ( x1  m1 )( x2  m 2 )  s 11 ( x2  m 2 ) 2

s 11s 22  s 122
f x1 , x2  
1
2s1s 1 1   2
exp  Qx1 , x2 
Qx1 , x2 
2
 x1  m1 
 x1  m1  x2  m 2   x2  m 2 

  2  

  

s1 
s 1  s 2   s 2 



1  2
2
Theorem (Transformations)
Let x = (x1 ,x2 ,x3 , ... , xn) denote a vector of
continuous random variables with joint density
function f(x1 ,x2 ,x3 , ... , xn) = f(x). Let
y1 =f1(x1 ,x2 ,x3 , ... , xn)
y2 =f2(x1 ,x2 ,x3 , ... , xn)
...
yn =fn(x1 ,x2 ,x3 , ... , xn)
define a 1-1 transformation of x into y.
Then the joint density of y is g(y) given by:
g(y) = f(x)|J| where
 ( x )  ( x1 , x 2 , x3 ,..., x n )
J 

 ( y )  ( y1 , y 2 , y 3 ,..., y n )
x n 
 x1 x 2
...
 y


y

y
1
1
 1

 x1 x 2 ... x n 
 det  y 2 y 2
y 2  = the Jacobian of the
...
 transformation
 x


x

x
n
2
 1

...
y n 
 y n y n
Corollary (Linear Transformations)
Let x = (x1 ,x2 ,x3 , ... , xn) denote a vector of
continuous random variables with joint density
function f(x1 ,x2 ,x3 , ... , xn) = f(x). Let
y1 = a11x1 + a12x2 + a13x3 , ... + a1nxn
y2 = a21x1 + a22x2 + a23x3 , ... + a2nxn
...
yn = an1x1 + an2x2 + an3x3 , ... + annxn
define a 1-1 transformation of x into y.
Then the joint density of y is g(y) given by:
1
1
1
g ( y )  f ( x)
 f ( A y)
det( A)
det( A)
a11
a
21

where A 


an1
a12
a22

an 2
a1n 
a2 n 
  

... ann 
...
...
Corollary (Linear Transformations for Normal
Random variables)
Let x = (x1 ,x2 ,x3 , ... , xn) denote a vector of continuous
random variables having an n-variate Normal
distribution with mean vector m and covariance matrix
S.
i.e.
x ~ Nn(m, S)
Let
y1 = a11x1 + a12x2 + a13x3 , ... + a1nxn
y2 = a21x1 + a22x2 + a23x3 , ... + a2nxn
...
yn = an1x1 + an2x2 + an3x3 , ... + annxn
define a 1-1 transformation of x into y.
Then y = (y1 ,y2 ,y3 , ... , yn) ~ Nn(Am,ASA')
Defn (Expectation)
Let x = (x1 ,x2 ,x3 , ... , xn) denote a vector of
continuous random variables with joint density
function
f(x) = f(x1 ,x2 ,x3 , ... , xn).
Let U = h(x) = h(x1 ,x2 ,x3 , ... , xn)
Then
E U   E h(x)   h(x) f (x)dx
Defn (Conditional Expectation)
Let x = (x1 ,x2 ,x3 , ... , xn) = (x1 , x2 ) denote a
vector of continuous random variables with joint
density function
f(x) = f(x1 ,x2 ,x3 , ... , xn) = f(x1 , x2 ).
Let U = h(x1) = h(x1 ,x2 ,x3 , ... , xp)
Then the conditional expectation of U given x2
EU x2   Eh(x1 ) x2    h(x1 ) f1|2 (x1 x2 )dx
1
Defn (Variance)
Let x = (x1 ,x2 ,x3 , ... , xn) denote a vector of
continuous random variables with joint density
function
f(x) = f(x1 ,x2 ,x3 , ... , xn).
Let U = h(x) = h(x1 ,x2 ,x3 , ... , xn)
Then

 
s  VarU   E U  EU   E h(x)  Eh(x)
2
U
2
2

Defn (Conditional Variance)
Let x = (x1 ,x2 ,x3 , ... , xn) = (x1 , x2 ) denote a
vector of continuous random variables with joint
density function
f(x) = f(x1 ,x2 ,x3 , ... , xn) = f(x1 , x2 ).
Let U = h(x1) = h(x1 ,x2 ,x3 , ... , xp)
Then the conditional variance of U given x2

VarU x 2   E h(x1 )  Eh(x1 ) x 2
2

Defn (Covariance, Correlation)
Let x = (x1 ,x2 ,x3 , ... , xn) denote a vector of
continuous random variables with joint density
function
f(x) = f(x1 ,x2 ,x3 , ... , xn).
Let U = h(x) = h(x1 ,x2 ,x3 , ... , xn) and
V = g(x) =g(x1 ,x2 ,x3 , ... , xn)
Then the covariance of U and V.
CovU ,V   EU  EU V  EV 
 Eh(x)  Eh(x)g (x)  Eg (x)
and UV
CovU ,V 

 correlatio n
Var (U ) Var (V )
Properties
•
•
•
•
Expectation
Variance
Covariance
Correlation
1. E[a1x1 + a2x2 + a3x3 + ... + anxn]
= a1E[x1] + a2E[x2] + a3E[x3] + ... + anE[xn]
or E[a'x] = a'E[x]
2. E[UV] = E[h(x1)g(x2)]
= E[U]E[V] = E[h(x1)]E[g(x2)]
if x1 and x2 are independent
3. Var[a1x1 + a2x2 + a3x3 + ... + anxn]
n
n
i 1
i j
  ai2Var[ xi ]  2 ai a j Cov[ xi , x j ]
or Var[a'x] = a′S a
Cov( x1 , x2 ) ... Cov( x1 , xn ) 
Var ( x1 )
Cov( x , x ) Var ( x )

...
Cov
(
x
,
x
)
2
1
2
2
n

where S  
...



Cov( xn , x1 ) Cov( xn , x2 ) ... Var ( xn ) 
4. Cov[a1x1 + a2x2 + ... + anxn ,
b1x1 + b2x2 + ... + bnxn]
n
n
i 1
i j
  ai b jVar[ xi ]   ai b j Cov[ xi , x j ]
or Cov[a'x, b'x] = a′S b
5.
EU   Ex2 EU x2 
6.
VarU   Ex2 VarU x2  Varx2 EU x2 
Statistical Inference
Making decisions from data
There are two main areas of Statistical Inference
• Estimation – deciding on the value of a
parameter
– Point estimation
– Confidence Interval, Confidence region Estimation
• Hypothesis testing
– Deciding if a statement (hypotheisis) about a
parameter is True or False
The general statistical model
Most data fits this situation
Defn (The Classical Statistical Model)
The data vector
x = (x1 ,x2 ,x3 , ... , xn)
The model
Let f(x| q) = f(x1 ,x2 , ... , xn | q1 , q2 ,... , qp)
denote the joint density of the data vector x =
(x1 ,x2 ,x3 , ... , xn) of observations where the
unknown parameter vector q  W (a subset of
p-dimensional space).
An Example
The data vector
x = (x1 ,x2 ,x3 , ... , xn) a sample from the normal
distribution with mean m and variance s2
The model
Then f(x| m , s2) = f(x1 ,x2 , ... , xn | m , s2), the joint
density of x = (x1 ,x2 ,x3 , ... , xn) takes on the form:

f x ms
 
n
2
i 1
1
e
2 s

xi  m 2

2s



1
2 
n/2
s
n
e
n

i 1
 xi  m 2
2s 
where the unknown parameter vector q  (m , s2)  W
={(x,y)|-∞ < x < ∞ , 0 ≤ y < ∞}.
Defn (Sufficient Statistics)
Let x have joint density f(x| q) where the unknown
parameter vector q  W.
Then S = (S1(x) ,S2(x) ,S3(x) , ... , Sk(x)) is called a set of
sufficient statistics for the parameter vector q if the
conditional distribution of x given S = (S1(x) ,S2(x)
,S3(x) , ... , Sk(x)) is not functionally dependent on the
parameter vector q.
A set of sufficient statistics contains all of the
information concerning the unknown parameter vector
A Simple Example illustrating Sufficiency
Suppose that we observe a Success-Failure experiment
n = 3 times. Let q denote the probability of Success.
Suppose that the data that is collected is x1, x2, x3 where
xi takes on the value 1 is the ith trial is a Success and 0 if
the ith trial is a Failure.
The following table gives possible values of (x1, x2, x3).
(x1, x2, x3)
(0, 0, 0)
(1, 0, 0)
(0, 1, 0)
(0, 0, 1)
(1, 1, 0)
(1, 0, 1)
(0, 1, 1)
(1, 1, 1)
f(x1, x2, x3|q)
(1 - q)3
(1 - q)2q
(1 - q)2q
(1 - q)2q
(1 - q)q2
(1 - q)q2
(1 - q)q2
q3
S =Sxi
0
1
1
1
2
2
2
3
g(S |q)
(1 - q)3
3(1 - q)2q
3(1 - q)q2
q3
f(x1, x2, x3| S)
1
1/3
1/3
1/3
1/3
1/3
1/3
1
The data can be generated in two equivalent ways:
1. Generating (x1, x2, x3) directly from f (x1, x2, x3|q) or
2. Generating S from g(S|q) then generating (x1, x2, x3) from f (x1,
x2, x3|S). Since the second step does involve q no additional
information will be obtained by knowing (x1, x2, x3) once S is
determined
The Sufficiency Principle
Any decision regarding the parameter q should
be based on a set of Sufficient statistics S1(x),
S2(x), ...,Sk(x) and not otherwise on the value of
x.
A useful approach in developing a statistical
procedure
1. Find sufficient statistics
2. Develop estimators , tests of hypotheses etc.
using only these statistics
Defn (Minimal Sufficient Statistics)
Let x have joint density f(x| q) where the
unknown parameter vector q  W.
Then S = (S1(x) ,S2(x) ,S3(x) , ... , Sk(x)) is a set
of Minimal Sufficient statistics for the
parameter vector q if S = (S1(x) ,S2(x) ,S3(x) , ...
, Sk(x)) is a set of Sufficient statistics and can be
calculated from any other set of Sufficient
statistics.
Theorem (The Factorization Criterion)
Let x have joint density f(x| q) where the unknown
parameter vector q  W.
Then S = (S1(x) ,S2(x) ,S3(x) , ... , Sk(x)) is a set of
Sufficient statistics for the parameter vector q if
f(x| q) = h(x)g(S, q)
= h(x)g(S1(x) ,S2(x) ,S3(x) , ... , Sk(x), q).
This is useful for finding Sufficient statistics
i.e. If you can factor out q-dependence with a set of
statistics then these statistics are a set of Sufficient
statistics
Defn (Completeness)
Let x have joint density f(x| q) where the unknown
parameter vector q  W.
Then S = (S1(x) ,S2(x) ,S3(x) , ... , Sk(x)) is a set of
Complete Sufficient statistics for the parameter vector
q if S = (S1(x) ,S2(x) ,S3(x) , ... , Sk(x)) is a set of
Sufficient statistics and whenever
E[f(S1(x) ,S2(x) ,S3(x) , ... , Sk(x)) ] = 0
then
P[f(S1(x) ,S2(x) ,S3(x) , ... , Sk(x)) = 0] = 1
Defn (The Exponential Family)
Let x have joint density f(x| q)| where the
unknown parameter vector q  W. Then f(x| q)
is said to be a member of the exponential family
of distributions if:

 k

h(x) g (θ) exp  S i (x) pi (θ) ai  xi  bi
f x θ   
,
 i 1


0
Otherwise
q W,where
1) - ∞ < ai < bi < ∞ are not dependent on q.
2) W contains a nondegenerate k-dimensional
rectangle.
3) g(q), ai ,bi and pi(q) are not dependent on x.
4) h(x), ai ,bi and Si(x) are not dependent on q.
If in addition.
5) The Si(x) are functionally independent for i = 1, 2,..., k.
6) [Si(x)]/ xj exists and is continuous for all i = 1, 2,..., k j = 1,
2,..., n.
7) pi(q) is a continuous function of q for all i = 1, 2,..., k.
8) R = {[p1(q),p2(q), ...,pK(q)] | q  W,} contains nondegenerate
k-dimensional rectangle.
Then
the set of statistics S1(x), S2(x), ...,Sk(x) form a Minimal
Complete set of Sufficient statistics.
Examples
Suppose we repeat a success-failure experiment
independently n times. Suppose that q is the
probability of success. Note: 0 ≤ q ≤ 1
Let
1 i th repitition is Success
xi  
th
0
i
repitition is Failure

q
f xi   
1 - q
xi  1
1 xi
xi
 q 1 - q 
xi  0
The joint density of x1 , x2 , x3 ,, xn
is
f x1 , x2 , x3 ,, xn q   f x1  f x2  f x3  f xn 
 q x1 1 - q 
q
 xi
i
1 x1
q x 1 - q 1 x q x 1 - q 1 x q x 1 - q 1 x
1- q 
 xi  q S 1 - q nS  1 - q n q 1 - q 1 S
i
n

 1 - q  e
2
2
3

f x q   h(x) g (q ) expS (x) p(q )
i
n

n ln q 1-q 1 S
where S x    xi
3

n
Defn (The Likelihood function)
Let x have joint density f(x|q) where the unkown
parameter vector q W. Then for a
given value of the observation vector x ,the
Likelihood function, Lx(q), is defined by:
Lx(q) = f(x|q) with q W
The log Likelihood function lx(q) is defined by:
lx(q) =lnLx(q) = lnf(x|q) with q W
The Likelihood Principle
Any decision regarding the parameter q should
be based on the likelihood function Lx(q) and not
otherwise on the value of x.
If two data sets result in the same likelihood
function the decision regarding q should be the
same.
Some statisticians find it useful to plot the
likelihood function Lx(q) given the value of x.
It summarizes the information contained in x
regarding the parameter vector q.
An Example
The data vector
x = (x1 ,x2 ,x3 , ... , xn) a sample from the normal
distribution with mean m and variance s2
The joint distribution of x
Then f(x| m , s2) = f(x1 ,x2 , ... , xn | m , s2), the joint
density of x = (x1 ,x2 ,x3 , ... , xn) takes on the form:

f x ms
 
n
2
i 1
1
e
2 s

xi  m 2

2s



1
2 
n/2
s
n
e
n

i 1
 xi  m 2
2s 
where the unknown parameter vector q  (m , s2)  W
={(x,y)|-∞ < x < ∞ , 0 ≤ y < ∞}.
The Likelihood function
Assume data vector is known
x = (x1 ,x2 ,x3 , ... , xn)
The Likelihood function
Then L(m , s)= f(x| m , s) = f(x1 ,x2 , ... , xn | m , s2),
n

i 1


1
e
2s

 xi  m 

1
 2 
n/2
sn
1
 2  s n
n/2
2s


n
 2  s
 xi  m 
2s 
1

n/2
2
2s
 xi2 2 m xi  m 2 
n
1

i 1

1
i 1
e
e
2
n
e
n

i 1
 xi  m 2
2s 
or
L  m,s  

1
 2 


n/2
1
 2  s
n/2
i 1
e
sn
n
2
2
x

2
m
x

m


i
i

2s 
e
n
1 
 
xi2  2 m
2s  i1


1
 2  s
n/2
n
n
1
e
1
2s 
 n 1 s
2

xi  nm 2 

i 1

n

 nx 2  2 m nx  n m 2
n
since s 2 
2
2
x

nx
i
i 1
n
or
n 1
2
2
2
x

n

1
s

nx
i  
i 1
n
and since x 
x
i 1
n
i
n
then
x
i 1
i
 nx

hence
L  m ,s  


1
 2  s
n/2
n
e
2s 

1
 2  s
n/2
n
 n 1 s
1
e
1
2
 nx 2  2 m nx  nm 2
 n 1s  n x  m  
2
2
2s 

Now consider the following data: (n = 10)
57.1
72.3
75.0
57.8
50.3
mean
s
L  m ,s  
48.0
1
 6.2832  s
53.1
58.5
53.7
57.54
9.2185

5
49.6
10
e
1
2s 

9 9.2185 10 57.54 m 
2
2

Likelihood n = 10
3E-16
2.5E-16
2E-16
1.5E-16
1E-16
70
5E-17
m
0
1
0
s

S1
20
50
Contour Map of Likelihood n = 100
70
m

50
S1
1
0
s

20
Now consider the following data: (n = 100)
57.1
72.3
75.0
57.8
50.3
48.0
49.6
53.1
58.5
53.7
77.8
43.0
69.8
65.1
71.1
44.4
64.4
52.9
56.4
43.9
49.0
37.6
65.5
50.4
40.7
66.9
51.5
55.8
49.1
59.5
64.5
67.6
79.9
48.0
68.1
68.0
65.8
61.3
75.0
78.0
61.8
69.0
56.2
77.2
57.5
84.0
45.5
64.4
58.7
77.5
81.9
77.1
58.7
71.2
58.1
50.3
53.2
47.6
53.3
76.4
69.8
57.8
65.9
63.0
43.5
70.7
85.2
57.2
78.9
72.9
78.6
53.9
61.9
75.2
62.2
53.2
73.0
38.9
75.4
69.7
68.8
77.0
51.2
65.6
44.7
40.4
72.1
68.1
82.2
64.7
83.1
71.9
65.4
45.0
51.6
48.3
58.5
65.3
65.9
59.6
mean
s
62.02
11.8571
L  m ,s  

1
 6.2832  s
50
100
e
1
2s 
9911.8571 100 62.02 m  
2
2
Likelihood n = 100
1.6E-169
1.4E-169
1.2E-169
1E-169
8E-170
6E-170
4E-170
70
2E-170
m
0
1
0
s

S1
20
50
Contour Map of Likelihood n = 100
70
m

50
S1
1
0
s

20
The Sufficiency Principle
Any decision regarding the parameter q should
be based on a set of Sufficient statistics S1(x),
S2(x), ...,Sk(x) and not otherwise on the value of
x.
If two data sets result in the same values for the
set of Sufficient statistics the decision regarding
q should be the same.
Theorem (Birnbaum - Equivalency of the
Likelihood Principle and Sufficiency Principle)
Lx (q)  Lx (q)
1
2
if and only if
S1(x1) = S1(x2),..., and Sk(x1) = Sk(x2)
The following table gives possible values of (x1, x2, x3).
f(x1, x2, x3|q)
(1 - q)3
(1 - q)2q
(1 - q)2q
(1 - q)2q
(1 - q)q2
(1 - q)q2
(1 - q)q2
(x1, x2, x3)
(0, 0, 0)
(1, 0, 0)
(0, 1, 0)
(0, 0, 1)
(1, 1, 0)
(1, 0, 1)
(0, 1, 1)
(1, 1, 1)
S =Sxi
0
1
1
1
2
2
2
3
q3
g(S |q)
(1 - q)3
f(x1, x2, x3| S)
1
1/3
1/3
1/3
1/3
1/3
1/3
1
3(1 - q)2q
3(1 - q)q2
q3
The Likelihood function
S =0
1.2
0.08
0.08
0
0
0
0.2
0.4
0.6
0.8
1
0.2
0.02
0.02
0
0.4
0.04
0.04
0.2
0.6
0.06
0.06
0.4
0.8
0.1
0.1
0.6
1
0.12
0.12
0.8
S =3
1.2
0.14
0.14
1
S =2
0.16
S =1
0.16
0
0.2
0.4
0.6
0.8
1
0
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Estimation Theory
Point Estimation
Defn (Estimator)
Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of
observations having joint density f(x|q) where
the unknown parameter vector q W.
Then an estimator of the parameter f(q) = f(q1
,q2 , ... , qk) is any function T(x)=T(x1 ,x2 ,x3 , ... ,
xn) of the observation vector.
Defn (Mean Square Error)
Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of
observations having joint density f(x|q) where the
unknown parameter vector q  W.
Let T(x) be an estimator of the parameter
f(q). Then the Mean Square Error of T(x) is
defined to be:
M .S.E.T x  θ  E(T (x)  f (θ)) 2 
  (T (x)  f (θ)) 2 f (x | θ)dx
Defn (Uniformly Better)
Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of
observations having joint density f(x|q) where
the unknown parameter vector q  W.
Let T(x) and T*(x) be estimators of the
parameter f(q). Then T(x) is said to be
uniformly better than T*(x) if:
M .S .E.T x  θ   M .S .E.T *x  θ 
whenever θ  W
Defn (Unbiased )
Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of
observations having joint density f(x|q) where
the unknown parameter vector q  W.
Let T(x) be an estimator of the parameter f(q).
Then T(x) is said to be an unbiased estimator of
the parameter f(q) if:
E T x    T (x) f (x | θ)dx  f θ 
Theorem (Cramer Rao Lower bound)
Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of
observations having joint density f(x|q) where
the unknown parameter vector q  W. Suppose
that:
i) f ( x | θ) exists for all x and for all θ  W .
θ

f ( x | θ)
ii)
f ( x | θ)dx  
dx

θ
θ

f ( x | θ)
iii)
t x  f ( x | θ)dx   t x 
dx

θ
θ
2

 f (x | θ)  

iv) 0  E  q     for all θ  W
i

 


Let M denote the p x p matrix with ijth element.
  2 ln f (x | θ) 
mij   E 
 i, j  1,2,  , p
 q i q j 
Then V = M-1 is the lower bound for the covariance
matrix of unbiased estimators of q.
That is, var(c' θ̂ ) = c'var( θ̂)c ≥ c'M-1c = c'Vc where θ̂
is a vector of unbiased estimators of q.
Defn (Uniformly Minimum Variance
Unbiased Estimator)
Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of
observations having joint density f(x|q) where
the unknown parameter vector q  W. Then
T*(x) is said to be the UMVU (Uniformly
minimum variance unbiased) estimator of f(q)
if:
1) E[T*(x)] = f(q) for all q  W.
2) Var[T*(x)] ≤ Var[T(x)] for all q  W
whenever E[T(x)] = f(q).
Theorem (Rao-Blackwell)
Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of
observations having joint density f(x|q) where
the unknown parameter vector q  W.
Let S1(x), S2(x), ...,SK(x) denote a set of sufficient
statistics.
Let T(x) be any unbiased estimator of f(q).
Then T*[S1(x), S2(x), ...,Sk (x)] = E[T(x)|S1(x),
S2(x), ...,Sk (x)] is an unbiased estimator of f(q)
such that:
Var[T*(S1(x), S2(x), ...,Sk(x))] ≤ Var[T(x)]
for all q  W.
Theorem (Lehmann-Scheffe')
Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of
observations having joint density f(x|q) where
the unknown parameter vector q  W.
Let S1(x), S2(x), ...,SK(x) denote a set of
complete sufficient statistics.
Let T*[S1(x), S2(x), ...,Sk (x)] be an unbiased
estimator of f(q). Then:
T*(S1(x), S2(x), ...,Sk(x)) )] is the UMVU
estimator of f(q).
Defn (Consistency)
Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of
observations having joint density f(x|q) where
the unknown parameter vector q  W. Let Tn(x)
be an estimator of f(q). Then Tn(x) is called a
consistent estimator of f(q) if for any e > 0:
lim PTn x   f θ  e   0 for all θ  W
n
Defn (M. S. E. Consistency)
Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of
observations having joint density f(x|q) where the
unknown parameter vector q  W. Let Tn(x) be an
estimator of f(q). Then Tn(x) is called a M. S. E.
consistent estimator of f(q) if for any e > 0:


lim M .S .E.Tn θ  lim E Tn x  f θ  0
n 
for all θ  W
n 
2
Methods for Finding Estimators
1. The Method of Moments
2. Maximum Likelihood Estimation
Methods for finding estimators
1. Method of Moments
2. Maximum Likelihood Estimation
Method of Moments
Let x1, … , xn denote a sample from the density
function
f(x; q1, … , qp) = f(x; q)
The kth moment of the distribution being
sampled is defined to be:
mk q1 ,
,q p   E  x k  



x k f  x;q1 ,
,q p  dx
The kth sample moment is defined to be:
1 n k
mk   xi
n i 1
To find the method of moments estimator of
q1, … , qp we set up the equations:
m1 q1 , ,q p   m1
m2 q1 , ,q p   m2
m p q1 ,
,q p   m p
We then solve the equations
m1 q1 , ,q p   m1
m2 q1 , ,q p   m2
m p q1 ,
,q p   m p
for q1, … , qp.
The solutions
q1 , ,q p
are called the method of moments estimators
The Method of Maximum Likelihood
Suppose that the data x1, … , xn has joint density
function
f(x1, … , xn ; q1, … , qp)
where q  (q1, … , qp) are unknown parameters
assumed to lie in W (a subset of p-dimensional
space).
We want to estimate the parametersq1, … , qp
Definition: Maximum Likelihood Estimation
Suppose that the data x1, … , xn has joint density
function
f(x1, … , xn ; q1, … , qp)
Then the Likelihood function is defined to be
L(q) = L(q1, … , qp)
= f(x1, … , xn ; q1, … , qp)
the Maximum Likelihood estimators of the parameters
q1, … , qp are the values that maximize
L(q) = L(q1, … , qp)
the Maximum Likelihood estimators of the parameters
q1, … , qp are the values
qˆ1 , ,qˆp
Such that

L qˆ1 ,
Note:

, qˆp  max L q1 ,
q1 , ,q p
maximizing L q1 ,
is equivalent to maximizing
l q1 ,
, q p   ln L q1 ,
the log-likelihood function
,q p 
,q p 
,q p 
Application
The General Linear Model
Consider the random variable Y with
1. E[Y] = g(U1 ,U2 , ... , Uk)
= b1f1(U1 ,U2 , ... , Uk) + b2f2(U1 ,U2 , ... , Uk) +
... + bpfp(U1 ,U2 , ... , Uk)
p
=  b if i U ,U 2 ,...,U k 
i 1
and
2. var(Y) = s2
• where b1, b2 , ... ,bp are unknown parameters
• and f1 ,f2 , ... , fp are known functions of the
nonrandom variables U1 ,U2 , ... , Uk.
• Assume further that Y is normally distributed.
Thus the density of Y is:
f(Y|b1, b2 , ... ,bp, s2) = f(Y| b, s2)
 1
2

exp  2 Y  g (U1 ,U 2 ,...,U k ) 
 2s
 2
2s 2
p


 
1
1



exp  2 Y   b ifi U1U 2 ,...,U k  
2s 
i 1
2s 2

 


1
 1
2

exp 
Y  b1 X 1  b 2 X 2  ...  b p X p 
2
2
 2s

2s
1

where X i  fi U1U 2 ,...,U k 

i = 1,2, … , p
Now suppose that n independent observations of Y,
(y1, y2, ..., yn) are made
corresponding to n sets of values of (U1 ,U2 , ... , Uk) ,u12 , ... , u1k),
(u21 ,u22 , ... , u2k),
...
(un1 ,un2 , ... , unk).
Let xij = fj(ui1 ,ui2 , ... , uik) j =1, 2, ..., p; i =1, 2, ..., n.
Then the joint density of y = (y1, y2, ... yn) is:
(u11
f(y1, y2, ..., yn|b1, b2 , ... ,bp, s2) = f(y|b, s2)

1
2s 
2 n/2
 1
exp 
2
 2s

yi  g (u1i , u 2i ,..., u ki ) 

i 1

n
2
2
p

n 
 
1
 1


exp 
y   b j f j (u1i , u 2i ,..., u ki ) 
2  i
2 n/2
2s i 1 
j 1
2s

 


2
p

n 
 
1
 1


exp 
y   b j xij  
2  i
2 n/2
2s i 1 
j 1
2s

 







1
2s 
2 n/2
 1






exp 
y

Xβ
y

Xβ

2
 2s

 1

yy  2yXβ  βXXβ

exp 
2
2 n/2
 2s

2s
1
1
 1












exp

β
X
Xβ
exp

y
y

2
y
Xβ




2
2
2 n/2
 2s

 2s

2s 

1

 1





 hy g β, s exp 
y
y

2
y
Xβ

2
 2s

2
Thus f(y|b,s2) is a member of the
exponential family of distributions
and S = (y'y, X'y) is a Minimal Complete
set of Sufficient Statistics.
Hypothesis Testing
Defn (Test of size )
Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of
observations having joint density f(x| q) where
the unknown parameter vector q  W.
Let w be any subset of W.
Consider testing the the Null Hypothesis
H0: q  w
against the alternative hypothesis
H1: q  w.
Let A denote the acceptance region for the test.
(all values x = (x1 ,x2 ,x3 , ... , xn) of such that the
decision to accept H0 is made.)
and let C denote the critical region for the test
(all values x = (x1 ,x2 ,x3 , ... , xn) of such that the
decision to reject H0 is made.).
Then the test is said to be of size  if
Px  C    f (x | θ)dx   for all θ  w and
C
Px  C    f (x | θ)dx   for at least one θ0  w
C
Defn (Power)
Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of
observations having joint density f(x| q) where the
unknown parameter vector q  W.
Consider testing the the Null Hypothesis
H0: q  w
against the alternative hypothesis
H1: q w.
where w is any subset of W. Then the Power of the test for
q w is defined to be:
 C θ   Px  C    f ( x | θ)dx
C
Defn (Uniformly Most Powerful (UMP) test of
size )
Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of
observations having joint density f(x| q) where the
unknown parameter vector q  W.
Consider testing the the Null Hypothesis
H0: q  w
against the alternative hypothesis
H1: q  w.
where w is any subset of W.
Let C denote the critical region for the test . Then
the test is called the UMP test of size  if:
Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of
observations having joint density f(x| q) where the
unknown parameter vector q  W.
Consider testing the the Null Hypothesis
H0: q  w
against the alternative hypothesis
H1: q  w.
where w is any subset of W.
Let C denote the critical region for the test . Then
the test is called the UMP test of size  if:
Px  C    f (x | θ)dx   for all θ  w and
C
Px  C    f (x | θ)dx   for at least one θ0  w
C
and for any other critical region C* such that:
Px  C * 
Px  C * 
 f (x | θ)dx  
for all θ  w and
C*
 f (x | θ)dx  
for at least one θ0  w
C*
then
 f (x | θ)dx   f (x | θ)dx for all θ w .
C
C*
Theorem (Neymann-Pearson Lemma)
Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of observations having
joint density f(x| q) where the unknown parameter vector q  W =
(q0, q1).
Consider testing the the Null Hypothesis
H0: q = q0
against the alternative hypothesis
H1: q = q1.
Then the UMP test of size  has critical region:
 f (x | θ 0 )

C  x
 K
 f (x | θ1 )

where K is chosen so that
 f (x | θ
C
0
)dx  
Defn (Likelihood Ratio Test of size )
Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of observations having
joint density f(x| q) where the unknown parameter vector q  W.
Consider testing the the Null Hypothesis
H0: q  w
against the alternative hypothesis
H1: q  w.
where w is any subset of W
Then the Likelihood Ratio (LR) test of size a has critical region:
 max f (x | θ)



C  x θw
 K
f ( x | θ)
 max

θW
where K is chosen so that
Px  C    f (x | θ)dx   for all θ  w and
Px  C    f (x | θ)dx   for at least one θ0  w
C
C
Theorem (Asymptotic distribution of
Likelihood ratio test criterion)
Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of observations having
joint density f(x| q) where the unknown parameter vector q  W.
Consider testing the the Null Hypothesis
H0: q  w
against the alternative hypothesis
H1: q  w.
max f (x | θ)
Let  x   θw
where w is any subset of W
max f (x | θ)
θW
Then under proper regularity conditions on U = -2ln(x)
possesses an asymptotic Chi-square distribution with degrees of
freedom equal to the difference between the number of
independent parameters in W and w.
Download