Lecture 6: Statistics: Learning models from data Carlos Fernandez-Granda

advertisement
Lecture 6: Statistics: Learning models from data
DS GA 1002 Statistical and Mathematical Models
http://www.cims.nyu.edu/~cfgranda/pages/DSGA1002_fall15
Carlos Fernandez-Granda
10/19/2015
Learning a model
Of interest in itself, characterization of the data
Useful to perform inference
(


parametric, fit using a frequentist
Models may be
Bayesian


nonparametric
framework
Parametric models
Assumption: Data sampled from known distribution with a
small number of unknown parameters
Justification: Theoretical (Central Limit Theorem), empirical . . .
Parametric models: Frequentist framework
Parametric models: Bayesian framework
Nonparametric models
Likelihood
Aim: Find parameters that fit the data best
Criterion: What value of the parameters makes the data more likely
For a fixed vector of iid data x the likelihood is defined as
Lx (θ) :=
n
Y
pXi (xi , θ)
i=1
if the distribution of the data is discrete and
Lx (θ) :=
n
Y
i=1
if the distribution is continuous
fXi (xi , θ)
Maximum-likelihood estimator
The likelihood is a function of the parameters θ
For fixed θ the likelihood is the pmf/pdf evaluated at the data
It quantifies how likely the data are according to the model
Maximum-likelihood (ML) estimator :
θ̂ ML (x) := arg max Lx (θ)
θ
= arg max log Lx (θ) .
θ
Maximizing the log-likelihood is equivalent, and often more convenient
Examples
Parameter of a Bernouilli
p̂ML =
n1
,
n0 + n1
Parameters of a Gaussian
n
µ̂ML =
1X
,
n
i=1
n
2
σ̂ML
1X
=
(xi − µ̂ML )
n
i=1
Example: Fitting a Gaussian
0.25
n = 20
n = 1000
0.20
0.15
0.10
0.05
60
62
64
66
68
70
Height (inches)
72
74
76
Parametric models: Frequentist framework
Parametric models: Bayesian framework
Nonparametric models
Bayesian framework
Frequentist statistics: parameters are deterministic
Bayesian statistics: parameters are random with known distribution
Greater modeling flexibility, but stronger assumptions
Bayesian framework
Two modeling choices:
1. Prior distribution of the parameters
2. Conditional distribution of the data given the parameters
This is the likelihood!
Aim: Determining the posterior distribution of the parameters
given the data
Posterior distribution
By Bayes’ Rule, if the data are discrete
pΘ (θ) pX|Θ (x|θ)
pΘ|X (θ|x) = P
u pΘ (u) pX|Θ (x|u)
for discrete parameters and
fΘ (θ) pX|Θ (x|θ)
u fΘ (u) pX|Θ (x|u) du
fΘ|X (θ|x) = R
for continuous parameters
Posterior distribution
By Bayes’ Rule, if the data are continuous
pΘ (θ) fX|Θ (x|θ)
pΘ|X (θ|x) = P
u pΘ (u) fX|Θ (x|u)
for discrete parameters and
fΘ (θ) fX|Θ (x|θ)
u fΘ (u) fX|Θ (x|u) du
fΘ|X (θ|x) = R
for continuous parameters
Credible intervals
Once we know the posterior, we can compute 1 − α credible intervals
Now we can talk about the probability of the parameter belonging
to a fixed interval
Unlike frequentist confidence intervals, confidence intervals depend
on the choice of prior
Conditional mean
We often need a point estimate of the parameter
The mean of the posterior distribution is optimal if our criterion is
mean squared error
2 E Θ|X = arg min E
θ̂ (X) − Θ
.
θ̂(X)
Example
Parameter of a Bernouilli
Prior distribution:
2.0
1.5
1.0
0.5
0.0
0.0
0.2
0.4
0.6
0.8
1.0
n0 = 91, n1 = 9
Posterior mean (uniform prior)
Posterior mean (skewed prior)
ML estimator
14
12
10
8
6
4
2
0
0.0
0.2
0.4
0.6
0.8
1.0
n0 = 1, n1 = 3
2.5
2.0
1.5
1.0
0.5
0.0
0.0
0.2
0.4
0.6
0.8
1.0
n0 = 3, n1 = 1
2.0
1.5
1.0
0.5
0.0
0.0
0.2
0.4
0.6
0.8
1.0
Bayesian interpretation of ML estimator
The ML estimator is the mode of the posterior if the prior is uniform
Uniform priors are only possible if the parameter space is bounded
Parametric models: Frequentist framework
Parametric models: Bayesian framework
Nonparametric models
Nonparametric methods
Parametric model may not be available or not fit the data well
Alternative: estimate the distribution directly from the data
Very challenging: many (infinite!) different distributions could have
generated the data
Empirical cdf
Let X1 , X2 , . . . be an iid sequence with cdf FX
The empirical cdf at x ∈ R is
n
1X
Fbn (x) :=
1Xi ≤x
n
i=1
Fbn is an unbiased and consistent estimator of FX
Fbn → FX in mean square as n → ∞
Example: Heights, n = 10
1.0
0.8
True cdf
Empirical cdf
0.6
0.4
0.2
60
62
64
66
68
70
Height (inches)
72
74
76
Example: Heights, n = 100
1.0
0.8
True cdf
Empirical cdf
0.6
0.4
0.2
60
62
64
66
68
70
Height (inches)
72
74
76
Example: Heights, n = 100
1.0
0.8
True cdf
Empirical cdf
0.6
0.4
0.2
60
62
64
66
68
70
Height (inches)
72
74
76
Estimating the pdf
Let X1 , X2 , . . . be an iid sequence with pdf fX
The kernel density estimator of fX at x ∈ R is
n
1 X
b
fh,n (x) :=
k
nh
i=1
x − Xi
h
,
k is a kernel with maximum at 0 which decreases away 0 and satisfies
k (x) ≥ 0 for all x ∈ R
Z
k (x) dx = 1
R
Example: Abalone weights
KDE bandwidth: 0.05
KDE bandwidth: 0.25
KDE bandwidth: 0.5
True pdf
1.0
0.8
0.6
0.4
0.2
0.0
1
0
1
2
Weight (grams)
3
4
Example: Abalone weights
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0 1
0
1
2
Weight (grams)
3
4
Download