Lecture 1 - Department of Computer and Information Science

advertisement
732A36 Theory of
Statistics
Course within the Master’s program in
Statistics and Data mining
Fall semester 2011
April 13, 2015
1
Course details





Course web: www.ida.liu.se/~732A36
Course responsible, tutor and examiner: Anders
Nordgaard
Course period: Nov 2011-Jan 2012
Examination: Written exam in January 2012, Compulsory
assignments
Course literature: “Garthwaite PH, Jolliffe IT and Jones B
(2002). Statistical Inference. 2nd ed. Oxford University
Press, Oxford. ISBN 0-19-857226-3”
Department of Computer and Information Science (IDA)
Linköpings universitet, Sweden
April 13, 2015
2
Course contents







Statistical inference in general
Point estimation (unbiasedness, consistency, efficiency,
sufficiency, completeness)
Information and likelihood concepts
Maximum-likelihood and Method-of-moment estimation
Classical hypothesis testing (Power functions, the
Neyman-Pearson lemma , Maximum Likelihood Ratio
Tests, Wald’s test)
Confidence intervals
…
Department of Computer and Information Science (IDA)
Linköpings universitet, Sweden
April 13, 2015
3
Course contents, cont.




Statistical decision theory (Loss functions, Risk concepts,
Prior distributions, Sequential tests)
Bayesian inference (Estimation, Hypothesis testing,
Credible intervals, Predictive distributions)
Non-parametric inference
Computer intensive methods for estimation
Department of Computer and Information Science (IDA)
Linköpings universitet, Sweden
April 13, 2015
4
Details about teaching and examination





Teaching is (as usual) sparse: A mixture between
lectures and problem seminars
Lectures: Overview and some details of each chapter
covered. No full-cover of the contents!
Problem seminars: Discussions about solutions to
recommended exercises. Students should be prepared to
provide solutions on the board!
Towards the end of the course a couple of larger
compulsory assignments (that need solutions to be
worked out with the help of a computer) will be
distributed.
The course is finished by a written exam
Department of Computer and Information Science (IDA)
Linköpings universitet, Sweden
April 13, 2015
5
Prerequisities




Good understanding of calculus an algebra
Good understanding of the concepts of expectations
(including variance calculations)
Familiarity with families of probability distributions
(Normal, Exponential, Binomial, Poisson, Gamma (Chisquare), Beta, …)
Skills in computer programming (e.g. with R , SAS,
Matlab,)
Department of Computer and Information Science (IDA)
Linköpings universitet, Sweden
April 13, 2015
6
Statistical inference in general
Population
Model
Sample
Conclusions about the population is drawn from the sample with
assistance from a specified model
Department of Computer and Information Science (IDA)
Linköpings universitet, Sweden
April 13, 2015
7
The two paradigms: Neyman-Pearson
(frequentistic) and Bayesian
Population
Model
Sample
Neyman-Pearson:
• Model specifies the probability distribution for data obtained
in a sample including a number of unknown population
parameters
• Bayesian:
•Model specifies the probability distribution for data obtained
in a sample and a probability distribution (prior) for each of
the unknown population parameters of that distribution
•
Department of Computer and Information Science (IDA)
Linköpings universitet, Sweden
April 13, 2015
8
How is inference made?


Point estimation: Find the “best” approximations of an
unknown population parameter
Interval estimation: Find a range of values that with high
certainty covers the unknown population parameter


Can be extended to regions if the parameter is multidimensional
Hypothesis testing: Give statements about the population
(values of parameters, probability distributions, issues of
independence,…) along with a quantitative measure of
“certainty”
Department of Computer and Information Science (IDA)
Linköpings universitet, Sweden
April 13, 2015
9
Tools for making inference






Criteria for a point estimate to be “good”
“Algorithmic” methods to find point estimates (Maximum
Likelihood, Least Squares, Method-of-Moments)
Classical methods of constructing hypothesis test (NeymanPearson lemma, Maximum Likelihood Ratio Test,…)
Classical methods to construct confidence intervals (regions)
Decision theory (make use of loss and risk functions, utility
and cost) to find point estimates and hypothesis tests
Using prior distributions to construct tests , credible intervals
and predictive distributions (Bayesian inference)
Department of Computer and Information Science (IDA)
Linköpings universitet, Sweden
April 13, 2015
10
Tools for making inference…



Using theory of randomization to form non-parametric tests
(tests not depending on any probability distribution behind
data)
Computer intensive methods (bootstrap and cross-validation
techniques)
Advanced models from data that make use of auxiliary
information (explanatory variables): Generalized linear models,
Generalized additive models, Spatio-temporal models, …
Department of Computer and Information Science (IDA)
Linköpings universitet, Sweden
April 13, 2015
11
The univariate population-sample model



The population to be investigated is such that the values that
comes out in a sample x1, x2 , …are governed by a probability
distribution
The probability distribution is represented by a probability
density (or mass) function f(x )
Alternatively, the sample values can be seen as the outcomes of
independent random variables X1, X2, … all with probability
density (or mass) function f(x )
Department of Computer and Information Science (IDA)
Linköpings universitet, Sweden
April 13, 2015
12
Point estimation (frequentistic paradigm)





We have a sample x = (x1 , … , xn ) from a population
The population contains an unknown parameter 
The functional forms of the distributional functions may be
known or unknown, but they depend on the unknown  .
Denote generally by f(x ;  ) the probability density or mass
function of the distribution
A point estimate of  is a function of the sample values
ˆ  ˆx1,, xn   ˆ x 
such that its values should be close to the unknown .
Department of Computer and Information Science (IDA)
Linköpings universitet, Sweden
April 13, 2015
13
“Standard” point estimates

The sample mean
mean 
x
is a point estimate of the population
1 n
x   xi  ˆ x1 ,, xn 
n i 1

The sample variance s2 is a point estimate of the population
variance  2
1 n
2
2
ˆ


x1 ,, xn 
s 
x

x



i
n  1 i 1
2

The sample proportion p of a specific event (a specific value or
range of values) is a point estimate of the corresponding
population proportion 
p
#
xi : event is
satisfied
n
Department of Computer and Information Science (IDA)
Linköpings universitet, Sweden
April 13, 2015
14
 ˆ x1 ,, xn 
Assessing a point estimate

A point estimate has a sampling distribution

Replace the sample observations x1 , … , xn with their
corresponding random variables X1 , … , Xn in the functional
expression:
ˆ  ˆ X1,, X n 



 The point estimate is a random variable that is observed in the
sample (point estimator)
As a random variable the point estimator must have a probability
distribution than can be deduced from f (x ;  )
The point estimator /estimate is assessed by investigating the
its sampling distribution, in particular the mean and the
variance.
Department of Computer and Information Science (IDA)
Linköpings universitet, Sweden
April 13, 2015
15

Unbiasedness

A point estimator is unbiased for  if the mean of its
sampling distribution is equal to 
 

E ˆ  E ˆ X1,, X n   

The bias of a point estimate for  is
 
bias ˆ  E ˆ  

Thus, a point estimate with bias = 0 is unbiased, otherwise
it is biased
Department of Computer and Information Science (IDA)
Linköpings universitet, Sweden
April 13, 2015
16
Examples (within the univariate population-sample
model)

The sample mean is always unbiased for estimating the
population mean
1 n
 1 n
E X   E   X i    E  X i   
n 1
 n 1


Is the sample mean an unbiased estimate of the population
median?
Why do we divide by n–1 in the sample variance (and not by n )?
n
 n
2
E  X i  X     E X i2  nE X 2  
 1
 1
 
Department of Computer and Information Science (IDA)
Linköpings universitet, Sweden
April 13, 2015
17
 
Consistency

A point estimator is (weakly) consistent if


Pr ˆ      0 as n   for any   0


Thus, the point estimator should converge in probability to 
Theorem: A point estimator is consistent if


biasˆ  0 and Var ˆ  0 as n  

Proof: Use Chebyshev’s inequality in terms of



    
2

ˆ
ˆ
MSE   E       Var ˆ  bias ˆ


Department of Computer and Information Science (IDA)
Linköpings universitet, Sweden
April 13, 2015
18
2
Examples


The sample mean is a consistent estimator of the population
mean. What probability law can be applied?
What do we require for the sample variance to be a consistent
estimator of the population variance?
2
 n 2
 1 
2
Var s  
 Var  X i  nX  
 n 1 
 1

 
2
 1 


 n 1 
2

 n 2 2
 n 2 2 
2
 Var  X i   n Var X  2nCov  X i , X    ...
 1

 1


 
Department of Computer and Information Science (IDA)
Linköpings universitet, Sweden
April 13, 2015
19
Efficiency

Assume we have two unbiased estimators of  , i.e.
   
ˆ1 ,ˆ2 : E ˆ1  E ˆ2  
 
 
If Var ˆ 1  Var ˆ 2  with strict inequality
for at least one value of 
then ˆ 1 is said to be more efficient than ˆ 2 

The efficiency of an unbiased estimator is defined as
 
ˆ j 
eff 
  
 
mini Var ˆi 

1
 j
ˆ
Var 
Department of Computer and Information Science (IDA)
Linköpings universitet, Sweden
April 13, 2015
20
Example

X  Xn
1 n
Let ˆ  X   X i ; n  2 and ˆ 2   1
n 1
2
EX1   EX 2    
E ˆ 1    ; E ˆ 2   


2
2
 Both estimatorsare unbiased
1
1
2
1 n
2
Var ˆ      Var  X i   2  n   
n
n
n 1
1
2 2  2  2
2 
Var ˆ    Var  X 1   Var  X 2  


4
4
2
n
since n  2
1
2
 ˆ 1 is more efficient than ˆ 2 
Department of Computer and Information Science (IDA)
Linköpings universitet, Sweden
April 13, 2015
21
Likelihood function

For a sample x

the likelihood function for is defined as
n
L ; x    f xi ; 
i 1

the log-likelihood function is
n
l  ; x   lnL ; x    ln f xi ; 
i 1
measure how likely (or expected) the sample is
Department of Computer and Information Science (IDA)
Linköpings universitet, Sweden
April 13, 2015
22
Fisher information

The (Fisher) Information about  contained in a sample x is
defined as
2
2
 

 






I    E  l  ; X   E  l  ; X 1 ,, X n  
  
  
 
 



Theorem: Under some regularity conditions (interchangeability
of integration and differentiation)
 2

I     E  l  ; X 
 

In particular the range of X cannot depend on  (such as in
a population where X  U(0, ) )
Department of Computer and Information Science (IDA)
Linköpings universitet, Sweden
April 13, 2015
23
Why is it measure of information for 
L ; x  and l  ; x  is related to the probability of the obtained sample.
How do this probability change with  ?
T his is measured by
l

l
is close to 0  T he probability is not so affected by slightly

changes of 
If
 T hesample do not contain much information about 
l
If
is largely positive or largely negative  T he probability

changes a lot if  is slightly changed.
2
 l 
   measures the amount of information about  in the particular sample
  
  l  2 
E     measures generally the amount of information about 
    


in a sample from the current distribution.
Department of Computer and Information Science (IDA)
Linköpings universitet, Sweden
April 13, 2015
24
Example

X  Exp( )
n
n
L ; x    f  xi ;    1  e
1

1
l  ; x   lnL ; x   n ln  
l
n 1
  2

 
 xi 
1

1

n

e
1

n
 xi
1
n
 xi
1
n
x ; X
i
fulfills the regularity conditions
1
 n

 2l
n
2 n
2 n
 2  3  xi  I     2  3  E  X i  
2

  1
 1



n
2

2
n

2
 n 
3
Department of Computer and Information Science (IDA)
Linköpings universitet, Sweden
April 13, 2015
25
Cramér-Rao inequality

Under the same regularity conditions as for the previous
theorem the following holds for any unbiased estimator

1
Var ˆ 
I  

The lower bound is attained if and only if

l
 I    ˆ  

Department of Computer and Information Science (IDA)
Linköpings universitet, Sweden
April 13, 2015
26

Proof:



E ˆ  E ˆ X 1 ,  , X n      ˆ y1 ,  , yn   f  y1 ;     f  yn ; dy1  dyn 

y1
 f  y1 ,, y n ; 
yn
    ˆ y1 ,  , yn   L ; y1 ,  , yn dy1  dyn   ˆ y   L ; y d y
y1
yn
y

But as ˆ is unbiased E ˆ  

E ˆ




 ˆ y   L ; y d y 
y

L ; y d y
  ˆ y  
condit ions


y
Regularit y
d ln g  x  g '  x 

dx
g x 

L ; y 
l  ; y  

L ; y   l  ; y   L ; y 



L ; y 


E ˆ
l  ; y 
l  ; X  


  ˆ y  
 L ; y d y  E ˆ X  








y
Now as

Department of Computer and Information Science (IDA)
Linköpings universitet, Sweden
April 13, 2015
27

E ˆ

   1
On t he ot her hand:



l  ; X  

 E ˆ X  
 1



T he t heoretcal
i correlat ion bet ween two variables U and V
CovU , V 
 U , V  
sat isfies  U , V   1
Var U   Var V 
 CovU , V  
Var U   Var V   CovU , V   Var U   Var V 
2
CovU , V   E U  V   E U   E V 
l  ; X 
Let U  ˆ and V 

l  ; y 
 l  ; X  
E ˆ   ; E 
 f  y1 ;     f  yn ; dy1  dyn 




 y


l  ; y 
y   L ; y dy 



 L ; y dy 
y
Regularit y





L

;
y
d
y


y 
condit ions

1  0

Department of Computer and Information Science (IDA)
Linköpings universitet, Sweden
April 13, 2015
28
l  ; X  
 l  ; X  

 l  ; X  
 Covˆ,
  E ˆ 
  E ˆ  E 










 1  0  1

 l  ; X  
 1  1  Var ˆ  Var
  Var ˆ




2
   l  ; X   2    l  ; X    2 

  E 

E
 




 


   
 
 


   l  ; X   2 

  l  ; X   2 
2 
ˆ


 Var ˆ   E  

0

Var


E
 
 

 






 



 


   l  ; X   2  

ˆ
 Var    E  
 

 


 
 



1
 I1
Department of Computer and Information Science (IDA)
Linköpings universitet, Sweden
April 13, 2015
29
Example

X  Exp( )

 
Var  X   E X 2  E  X    x 2  1  e  x  dx   2 
2
0

  x2  e

   2 xe
x  
0
x 

dx   2  0  2   x  1  e  x  dx   2 
0
0
2     2   2
 Var X  
2
n

1
I  
 X as an est imat or of  at t ains t he Cramér- Rao lower bound
Department of Computer and Information Science (IDA)
Linköpings universitet, Sweden
April 13, 2015
30
Sufficiency

A function T of the sample values of a sample x, i.e.
T = T(x)=T(x1 , … , xn ) is a statistic that is sufficient for the
parameter  if the conditional distribution of the sample
random variables does not depend on , i.e.
f X1 ,, X n T ( X1 ,, X n )t  y1 ,, yn t  cannotbe writtenas a functionof 

What does it mean in practice?


If T is sufficient for  then no more information about  than what
is contained in T can be obtained from the sample.
It is enough to work with T when deriving point estimates of 
Department of Computer and Information Science (IDA)
Linköpings universitet, Sweden
April 13, 2015
31
Example
Assume x   x1 , x2  is a sample from Exp 
Let T  x1 , x2   x1  x2 and assume T  t is observed
 x2  t  x1.
f X 1 , X 2 ,T  y1 , y2 , t   f X 1 , X 2  y1 , t  y1  
 1e 
1
y1
  1e  
1
t  y1 
  2e  
1
t
fT t ?
Derive by different at
i ing FT t   P rT  t 
P rT  t   P r X 1  X 2  t   P r X 2  t  X 1  

 
1  1 y1
e
1   1 y 2
 e
dy1dy2  
y 2 t  y1
Department of Computer and Information Science (IDA)
Linköpings universitet, Sweden
April 13, 2015
32
t  y1
t

 
1   y1
y1  0 y 2  0
t


1
1   1 y1
e
e

 e
1   y 2
 e
  1 y 2

 
t

dy2 dy1 
1
1  y1
 e
y1  0

t  y1
y2 0
y1  0
t
1
t

dy1 
1   1 y1
e
e
1   1t
 e
dy   e
  1 y1
1

 1 e
1
 y1 e
1
 e   t  t 1e   t  1  0  1  e   t  t 1e  

fT t    1e 
1
t
 
1   1t
e
 f X 1 , X 2 T  y1 , y2 T  t  
 2  1t
 t e
 2  1t
 e
t  2 e 
1
t

33
dy2 dy1 
dy 
1

t
y1  0

1
t
 2  1t
e
1
not depending on 
t
Department of Computer and Information Science (IDA)
Linköpings universitet, Sweden
April 13, 2015
  t
e
  1 t  y1 
1   1t
y1  0
1

1   1 y 2
y2 0
y1  0
1   1 y1
t  y1

The factorization theorem:
T is sufficient for  if and only if the likelihood function can be
written
L ; x   K1 T  x ;  K 2  x 
i.e. can be factorized using two non-negative functions such
that the first depends on x only through the statistics T and
also on  and the second does not depend on 
Department of Computer and Information Science (IDA)
Linköpings universitet, Sweden
April 13, 2015
34
Example, cont

X  Exp( )
n
Let T  x    xi
1
n
n
L ; x    f xi ;    1  e
1

1

1
1
 xi 

1

n

e
1

n
 xi
n
e 1  1   xi is sufficient for 

1
 K 2  x 

n

K1 


n

1

xi ;  


Department of Computer and Information Science (IDA)
Linköpings universitet, Sweden
April 13, 2015
35
n
 xi
1

Department of Computer and Information Science (IDA)
Linköpings universitet, Sweden
April 13, 2015
36
Download