Topic5

advertisement
CHAPTER 4:
Parametric Methods
Parametric Estimation





Given X = { xt }t
goal: infer probability distribution p(x)
Parametric estimation:
Assume a form for p (x | θ) and estimate θ, its
sufficient statistics, using X
e.g., N ( μ, σ2) where θ = { μ, σ2}
Problem: How can we obtain θ from X?
Assumption: X contains samples of a onedimensional random variable
Later multivariate estimation: X contains multiple
and not only a single measurement.
Example; Gaussian Distribution
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
2
http://en.wikipedia.org/wiki/Normal_distribution
Maximum Likelihood Estimation


Density function p with parameters θ is given and
xt~p (X |θ)
Likelihood of θ given the sample X
l (θ|X) = p (X |θ) = ∏t p (xt|θ)
We look θ for that “maximizes the likelihood of the sample”!

Log likelihood
L(θ|X) = log l (θ|X) = ∑t log p (xt|θ)

Maximum likelihood estimator (MLE)
θ* = argmaxθ L(θ|X)
Homework: Sample: 0, 3, 3, 4, 5 and x~N(,)? Use MLE to find(,)!
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
3
Examples: Bernoulli/Multinomial

Bernoulli: Two states, failure/success, x in {0,1}
P (x) = pox (1 – po ) (1 – x)
L (po|X) = log ∏t poxt (1 – po ) (1 – xt)
MLE: po = ∑t xt / N

Multinomial: K>2 states, xi in {0,1}
P (x1,x2,...,xK) = ∏i pixi
L(p1,p2,...,pK|X) = log ∏t ∏i pixit
MLE: pi = ∑t xit / N
4
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Gaussian (Normal) Distribution

p x  
 x   2 
1
exp
2
2
 2 
 x   2 
1
p x  
exp

2
2 
2


μ
σ
http://en.wikipedia.org/wiki/Probability_density_function
p(x) = N ( μ, σ2)
MLE for μ and σ2:
m
s2 
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
t
x

t
N
 x
t
m

2
t
N
5
Bias and Variance
Unknown parameter θ
Estimator di = d (Xi) on sample Xi
Bias: bθ(d) = E [d] – θ
Variance: E [(d–E [d])2]
Mean square error of the estimator d:
r (d,θ) = E [(d–θ)2]
= (E [d] – θ)2 + E [(d–E [d])2]
= Bias2 + Variance
Error in the Model itself
Variation/randomness of the model
6
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Bayes’ Estimator

Treat θ as a random var with prior p (θ)
Bayes’ rule: p (θ|X) = p(X|θ) * p(θ) / p(X)
Maximum a Posteriori (MAP): θMAP = argmaxθ p(θ|X)
Maximum Likelihood (ML): θML = argmaxθ p(X|θ)

Bayes’ Estimator: θBayes’ = E[θ|X] =



∫ θ p(θ|X) dθ
Comments:
 ML just takes the maximum value of the density function
 Compared with ML, MAP additionally considers priors
 Bayes’ estimator averages over all possible values of θ which
are weighted by their likelihood to occur (which is measured
by a probability distribution p(θ)).
For MAP see: http://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation
For comparison see: http://metaoptimize.com/qa/questions/7885/what-is-the-relationship-between-mle-map-em-point-estimation
7
Skip today
Bayes’ Estimator: Example

xt ~ N (θ, σo2) and θ ~ N ( μ, σ2)

θML = m
θMAP = θBayes’ =

2
1/ 
N/
E  | X  
m

2
2
2
2
N / 0  1 / 
N / 0  1 / 
2
0
σ: converges to m
8
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
!
Parametric Classification
gi x   p x | Ci P Ci 
or equivalent ly
kind of p(Ci|x)
gi x   log p x | Ci   log P Ci 
2


1
x  i  
p x | Ci  
exp

2
2i
2i 


1
x  i 
gi x    log 2  log i 
 log P Ci 
2
2
2i
2
9
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Data
ML/MAP
P(x|Ci)
Parametric Classification
gi x   p x | Ci P Ci 
or equivalent ly
kind of p(Ci|x)
gi x   log p x | Ci   log P Ci 
Using Bayes Theorem
P(C1|x)=P(C1)xP(x|C1)/P(x)
P(C2|x)=P(C2)xP(x|C2)/P(x)
As P(x) is the same in both formulas, we can drop it!
10
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Given the sample X  {x t ,r t }tN1
t

1
if
x
 Ci

t
ri  
t
0 if x  C j , j  i
x 

ML estimates are
P̂ Ci  
ri
t
N
t
 x ri
t
mi 
t
t
r
i
t
si2 
 x
t
 mi rit
t
r
i
t


2
t
t
Discriminant becomes

1
x  mi 
gi x    log 2  log si 
 log P̂ Ci 
2
2
2si
2
11
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Equal variances
Single boundary at
halfway between
means
12
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Variances are different
Two boundaries
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Homework!
13
Regression
r  f x   
estimator for r : g  x |  
 ~ N 0,  2 

p r | x  ~ N g  x |  ,  2
N





L  | X   log  p x t ,r t
Maximizing the probability
of the sample again!
t 1
N
N
 
 log  p r t | x t  log  p x t
t 1
t 1
14
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Skip to 20!
Regression: From LogL to Error
N
L  | X   log 
t 1


 rt  g xt | 
1
exp
2
2

2



 
2

1
 N log 2  2  r t  g x t | 
2 t 1
N


1
E  | X    r t  g x t | 
2 t 1
N
2

2

15
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Linear Regression
g x | w , w   w x  w
t
t
1
0
1
0
t
t
r

Nw

w
x

0
1
t
t
r x
t
t
t
 N

A
t
x

t
 
 w 0  x  w1  x
t
t
t 2
t
t
t

x


r
t 

w 0 
 t

w

y

w 
t t
t 2

r
x

 1
t x 
 t

w  A 1 y
 
Relationship to what we discussed in Topic2??
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
16
Polynomial Regression
Here we get k+1 equations with k+1 unknowns!


t
 
g x | w k , , w 2 , w 1 , w 0  w k x
1 x 1

2
1
x
D


N
1
x

t k
 
   w2 x
x 
x 


 
 
x 

 
1 2
2 2
N 2

T
w D D

1
t 2
 w1x t  w 0
1



r
x

 2
2 k
r 
x 

r 

  

 N
N 2
r 
x

1 k
DT r
17
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Other Error Measures

Square Error:


1
E  | X    r t  g x t | 
2 t 1
N
 r
N

Relative Square Error:
E  | X  
t
2

2

 g xt | 

t 1
 r
N
2
t
r

t 1

Absolute Error: E (θ|X) =

ε-sensitive Error:
E (θ|X) =
∑t |rt – g(xt|θ)|
∑ t 1(|rt – g(xt|θ)|>ε) (|rt – g(xt|θ)| – ε)
18
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Bias and Variance

 

E r  g x  | x  E r  E r | x  | x  E r | x   g x 
2
2
noise


2
squared error

E X E r | x   gx  | x  E r | x   E X gx   E X gx   E X gx 
2
2
bias
2
variance
To be revisited next week!
19
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Estimating Bias and Variance

M samples Xi={xti , rti}, i=1,...,M
are used to fit gi (x), i =1,...,M
1
Bias g  
N
2
 g x   f x 
t
t
2
t
1
Variance g  
NM
1
g x  
gi x 

M t
 g x   g x 
t
t
2
i
t
i
20
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Bias/Variance Dilemma

Example: gi(x)=2 has no variance and high bias
gi(x)=


∑t rti/N has lower bias with variance
As we increase complexity,
bias decreases (a better fit to data) and
variance increases (fit varies more with data)
Bias/Variance dilemma: (Geman et al., 1992)
21
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
f
f
bias
gi
g
variance
22
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Polynomial Regression
Best fit “min error”
23
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Model Selection
Remark: will be discussed in more depth later: Topic 11




Cross-validation: Measure generalization accuracy
by testing on data unused during training
Regularization: Penalize complex models
E’=error on data + λ model complexity
Akaike’s information criterion (AIC), Bayesian
information criterion (BIC)
Minimum description length (MDL): Kolmogorov
complexity, shortest description of data
Structural risk minimization (SRM)
24
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Bayesian Model Selection

Prior on models, p(model)
p data | model  p model 
p model | data 
p data



Regularization, when prior favors simpler models
Bayes, MAP of the posterior, p(model|data)
Average over a number of models with high
posterior (voting, ensembles: Chapter 15)
25
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
CHAPTER 5:
Multivariate Methods
Normal Distribution: http://en.wikipedia.org/wiki/Normal_distribution
Z-score: see http://en.wikipedia.org/wiki/Standard_score
Multivariate Data



Multiple measurements (sensors)
d inputs/features/attributes: d-variate
N instances/observations/examples
 X 11
 2
X1

X
 
 N
 X 1
X 21
X 22
X
N
2
X d1 
2 
 Xd 

N
 X d 

27
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Example: 16 0 0
0 16 -3
0 -3 1
Multivariate Parameters
Mean : E x   μ  1 ,..., d 
T
Covariance : ij  CovX i , X j 
Correlatio n : Corr X i , X j   ij 

  Cov X   E X  μ X  μ 
T

ij
i  j
  12  12   1d 


2
 21  2   2d 


 


2 
 d 1  d 2   d 
Correlation: http://en.wikipedia.org/wiki/Correlation
28
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Parameter Estimation
Sample mean m : mi


N
t
x
i
t 1
N
, i  1,..., d
x


N
Covariance matrix S : sij
x

S
N
t 1
t

 m xt  m
N
t 1

t
i

 mi x tj  m j
N

or
T
with m  (m1 ,..., md )
Correlation matrix R : rij 
sij
si s j
http://en.wikipedia.org/wiki/Multivariate_normal_distribution
http://webscripts.softpedia.com/script/Scientific-Engineering-Ruby/Statistics-and-Probability/Multivariate-Gaussian-Distribution-35454.html
29
Multivariate Normal Distribution
Mahalanobis distance
between x and 
x ~ N d μ, Σ 
1
 1

T
1
p x  
exp x  μ Σ x  μ
1/ 2
d/2
 2

2 Σ
(5.9)
30
http://en.wikipedia.org/wiki/Mahalanobis_distance
Mahalanobis Distance
The Mahalanobis distance is based on
correlations between variables by which
different patterns can be identified and
analyzed. It differs from Euclidean distance
in that it takes into account the correlations
of the data set and is scale-invariant.
x ~ N d μ, Σ 
Mahalanobis distance
between x and 
1
 1

T
1
p x  
exp x  μ Σ x  μ
1/ 2
d/2
 2

2 Σ
31
http://www.analyzemath.com/Calculators/inverse_matrix_3by3.html
Multivariate Normal Distribution


Mahalanobis distance: (x – μ)T ∑–1 (x – μ)
measures the distance from x to μ in terms of ∑
(normalizes for difference in variances and
correlations)
Bivariate: d = 2
 2
  
 1
12
1 2
2
2



Remark:  is the correlation between the two variables
p x1 , x2  
1
21 2

1
2
2 
exp 
z1  2 z1 z 2  z 2 
2
2
1 
 2 1 




zi   xi  i  /  i
Z-score: see http://en.wikipedia.org/wiki/Standard_score
Called z-score zi for xi
32
Bivariate Normal
33
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
34
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Independent Inputs: Naive Bayes

If xi are independent, offdiagonals of ∑ are 0,
Mahalanobis distance reduces to weighted (by 1/σi)
Euclidean distance:
 1 d x 
1
i
i


p x    pi x i  
exp


d
 
2
d
/
2

i 1 
i 1
i

2  i
d



2



i 1

If variances are also equal, reduces to Euclidean
distance
35
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Parametric Classification

If p (x | Ci ) ~ N ( μi , ∑i )
1
 1

T
1
p x | Ci  
exp x  μi  Σi x  μi 
1/ 2
d/2
 2

2 Σi

Discriminant functions are
gi x   log p x | Ci   log P Ci 
d
1
1
T
1
  log2  log Σi  x  μi  Σi x  μi   log P Ci 
2
2
2
36
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Estimation of Parameters
r

P̂ C  
t
t i
i
mi 
N
t t
r
t i x
t
r
t i
r x


t
Si
t i
t

t
 m i x  mi
t
r
t i

T
1
1
T
1
gi x    log Si  x  mi  Si x  mi   log P̂ Ci 
2
2
37
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
skip
Different Si

Quadratic discriminant


1
1 T 1
1
T
1
gi x    log Si  x Si x  2x T Si mi  mi Si mi  log P̂ Ci 
2
2
T
 x T Wi x  w i x  w i 0
where
1 1
Wi   Si
2
1
w i  Si mi
1 T 1
1
w i 0   mi Si mi  log Si  log P̂ Ci 
2
2
38
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
likelihoods
discriminant:
P (C1|x ) = 0.5
posterior for C1
39
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Initially skip!
Common Covariance Matrix S

Shared common sample covariance S
S   P̂ C i Si
i

Discriminant reduces to
1
T
gi x    x  mi  S 1 x  mi   log P̂ Ci 
2
which is a linear discriminant
gi x   w i x  w i 0
T
where
1
w i  S mi w i 0
1 T 1
  mi S mi  log P̂ Ci 
2
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
40
Initially skip!
Common Covariance Matrix S
41
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Likely covered in April!
Diagonal S

When xj j = 1,..d, are independent, ∑ is diagonal
p (x|Ci) = ∏j p (xj |Ci)
(Naive Bayes’ assumption)
1  x  mij
gi x    
2 j 1  s j
d
t
j
2

  log P̂ Ci 


Classify based on weighted Euclidean distance (in sj
units) to the nearest mean
42
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Diagonal S
variances may be
different
43
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Diagonal S, equal variances

Nearest mean classifier: Classify based on
Euclidean distance to the nearest mean
gi x   
x  mi
2s
2

2
 log P̂ Ci 
1
  2  x tj  mij
2s j 1
d

2

 log P̂ Ci 
Each mean can be considered a prototype or
template and this is template matching
44
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Diagonal S, equal variances
*?
45
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Model Selection
Assumption
Covariance matrix
No of parameters
Shared, Hyperspheric
Si=S=s2I
1
Shared, Axis-aligned
Si=S, with sij=0
d
Shared, Hyperellipsoidal
Si=S
Different, Hyperellipsoidal Si


d(d+1)/2
K d(d+1)/2
As we increase complexity (less restricted S), bias
decreases and variance increases
Assume simple models (allow some bias) to control
variance (regularization)
46
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
skip!
Discrete Features

Binary features: pij  p x j  1 | Ci 
if xj are independent (Naive Bayes’)
d
xj
1x j 
p x | Ci   pij 1  pij 
j 1
the discriminant is linear
gi x   log p x | Ci   log P Ci 
  x j log pij  1  x j  log 1  pij   log P Ci 
j
Estimated parameters
p̂ij 
t t
x
t jri
t
r
t i
47
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
skip!
Discrete Features

Multinomial (1-of-nj) features: xj  {v1, v2,..., vnj}
pijk  p z jk  1 | Ci   p x j  v k | Ci 
if xj are independent
d
nj
p x | Ci    pijkjk
z
j 1 k 1
gi x    j k z jk log pijk  log P Ci 
p̂ijk 
t
t
z
r
t jk i
t
r
t i
48
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
skip!
Multivariate Regression
r  g x | w , w ,..., w   
t
t
0

1
d
Multivariate linear model
w 0  w 1x 1t  w 2x 2t    w d x dt


1
E w 0 , w 1 ,..., w d | X   t r t  w 0  w 1x 1t    w d x dt
2
Multivariate polynomial model:

2
Define new higher-order variables
z1=x1, z2=x2, z3=x12, z4=x22, z5=x1x2
and use the linear model in this new z space
(basis functions, kernel trick, SVM: Chapter 10)
49
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Download