Introduction to Parametric Density Estimation

advertisement
CHAPTER 4:
Parametric Methods
Parametric Estimation





Given X = { xt }t
goal: infer probability distribution p(x)
Parametric estimation:
Assume a form for p (x | θ) and estimate θ, its
sufficient statistics, using X
e.g., N ( μ, σ2) where θ = { μ, σ2}
Problem: How can we obtain θ from X?
Assumption: X contains samples of a onedimensional random variable
Later multivariate estimation: X contains multiple
and not only a single measurement.
Example; Gaussian Distribution
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
2
http://en.wikipedia.org/wiki/Normal_distribution
Maximum Likelihood Estimation


Density function p with parameters θ is given and
xt~p (X |θ)
Likelihood of θ given the sample X
l (θ|X) = p (X |θ) = ∏t p (xt|θ)
We look θ for that “maximizes the likelihood of the sample”!

Log likelihood
L(θ|X) = log l (θ|X) = ∑t log p (xt|θ)

Maximum likelihood estimator (MLE)
θ* = argmaxθ L(θ|X)
Homework: Sample: 0, 3, 3, 4, 5 and x~N(,)? Use MLE to find(,)!
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
3
Bayes’ Estimator

Treat θ as a random var with prior p (θ)
Bayes’ rule: p (θ|X) = p(X|θ) * p(θ) / p(X)
Maximum a Posteriori (MAP): θMAP = argmaxθ p(θ|X)
Maximum Likelihood (ML): θML = argmaxθ p(X|θ)

Bayes’ Estimator: θBayes’ = E[θ|X] =



∫ θ p(θ|X) dθ
Comments:
 ML just takes the maximum value of the density function
 Compared with ML, MAP additionally considers priors
 Bayes’ estimator averages over all possible values of θ which
are weighted by their likelihood to occur (which is measured
by a probability distribution p(θ)).
For MAP see: http://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation
For comparison see: http://metaoptimize.com/qa/questions/7885/what-is-the-relationship-between-mle-map-em-point-estimation
4
Parametric Classification
gi x   p x | Ci P Ci 
kind of p(Ci|x)
or equivalent ly
gi x   log p x | Ci   log P Ci 
2


1
x  i  
p x | Ci  
exp

2
2i
2i 


1
x  i 
gi x    log 2  log i 
 log P Ci 
2
2
2i
2
5
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Data
ML/MAP
P(x|Ci)
Parametric Classification
gi x   p x | Ci P Ci 
kind of p(Ci|x)
or equivalent ly
gi x   log p x | Ci   log P Ci 
Using Bayes Theorem
P(C1|x)=P(C1)xP(x|C1)/P(x)
P(C2|x)=P(C2)xP(x|C2)/P(x)
As P(x) is the same in both formulas, we can drop it!
6
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Given the sample X  {x t ,r t }tN1
t

1
if
x
 Ci

t
ri  
t
0 if x  C j , j  i
x 

ML estimates are
P̂ Ci  
ri
t
N
t
 x ri
t
mi 
t
t
r
i
t
si2 
 x
t
 mi rit
t
r
i
t


2
t
t
Discriminant becomes

1
x  mi 
gi x    log 2  log si 
 log P̂ Ci 
2
2
2si
2
7
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Equal variances
Single boundary at
halfway between
means
8
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Variances are different
Two boundaries
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Homework!
9
Model Selection
Remark: will be discussed in more depth later: Topic 11




Cross-validation: Measure generalization accuracy
by testing on data unused during training
Regularization: Penalize complex models
E’=error on data + λ model complexity
Akaike’s information criterion (AIC), Bayesian
information criterion (BIC)
Minimum description length (MDL): Kolmogorov
complexity, shortest description of data
Structural risk minimization (SRM)
10
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
CHAPTER 5:
Multivariate Methods
Normal Distribution: http://en.wikipedia.org/wiki/Normal_distribution
Z-score: see http://en.wikipedia.org/wiki/Standard_score
Multivariate Data



Multiple measurements (sensors)
d inputs/features/attributes: d-variate
N instances/observations/examples
 X 11
 2
X1

X
 
 N
 X 1
X
X
1
2
2
2
X 2N
X 

X 

N
 X d 


1
d
2
d
12
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Example: 16 0 0
0 16 -3
0 -3 1
Multivariate Parameters
Mean : E x   μ  1 ,..., d 
T
Covariance : ij  CovX i , X j 
Correlatio n : Corr X i , X j   ij 

  CovX   E X  μ X  μ 
T

ij
i  j
  12  12   1d 


2
 21  2   2d 


 


2 
 d 1  d 2   d 
Correlation: http://en.wikipedia.org/wiki/Correlation
13
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Parameter Estimation
Sample mean m : mi


N
t
x
i
t 1
N
, i  1,..., d
x


N
Covariance matrix S : sij
x

S
N
t 1
t

 m xt  m
N
t 1

t
i

 mi x tj  m j
N

or
T
with m  (m1 ,..., md )
Correlation matrix R : rij 
sij
si s j
http://en.wikipedia.org/wiki/Multivariate_normal_distribution
http://webscripts.softpedia.com/script/Scientific-Engineering-Ruby/Statistics-and-Probability/Multivariate-Gaussian-Distribution-35454.html
14
Multivariate Normal Distribution
Mahalanobis distance
between x and 
x ~ N d μ, Σ 
1
 1

T
1




p x  
exp

x

μ
Σ
x

μ
1/ 2
 2

d/2


2 Σ
(5.9)
15
http://en.wikipedia.org/wiki/Mahalanobis_distance
Mahalanobis Distance
The Mahalanobis distance is based on
correlations between variables by which
different patterns can be identified and
analyzed. It differs from Euclidean distance
in that it takes into account the correlations
of the data set and is scale-invariant.
x ~ N d μ, Σ 
Mahalanobis distance
between x and 
1
 1

T
1




p x  
exp

x

μ
Σ
x

μ
1/ 2
 2

d/2


2 Σ
16
http://www.analyzemath.com/Calculators/inverse_matrix_3by3.html
Multivariate Normal Distribution


Mahalanobis distance: (x – μ)T ∑–1 (x – μ)
measures the distance from x to μ in terms of ∑
(normalizes for difference in variances and
correlations)
Bivariate: d = 2
 2
  
 1
12
1 2
2
2



Remark:  is the correlation between the two variables
p x1 , x2  
1
21 2

1
2
2 
exp 
z1  2 z1 z 2  z 2 
2
2
1 
 2 1 




zi   xi  i  /  i
Z-score: see http://en.wikipedia.org/wiki/Standard_score
Called z-score zi for xi
17
Bivariate Normal
18
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
19
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Model Selection
Assumption
Covariance matrix
No of parameters
Shared, Hyperspheric
Si=S=s2I
1
Shared, Axis-aligned
Si=S, with sij=0
d
Shared, Hyperellipsoidal
Si=S
Different, Hyperellipsoidal Si


d(d+1)/2
K d(d+1)/2
As we increase complexity (less restricted S), bias
decreases and variance increases
Assume simple models (allow some bias) to control
variance (regularization)
20
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Download