Parametric and Multivariate Methods Christopher Simpkins Chris Simpkins (Georgia Tech)

advertisement
Parametric and Multivariate Methods
Christopher Simpkins
chris.simpkins@gatech.edu
Chris Simpkins (Georgia Tech)
CS 4641 Machine Learning
1 / 20
Parametric and Multivariate Methods
Maximum Likelihood Estimation
Bias and Variance
Parametric Classification
Model Selection
Multivariate Data
Chris Simpkins (Georgia Tech)
CS 4641 Machine Learning
2 / 20
Parametric Estimation
Given X = {x t }t where x t ∼ p(x)
Assume a distribution for p(x|θ) and estimate the sufficient
statistics, θ, using X (the data)
Example: if X ∼ N(µ, σ 2 ), we estimate µ and σ 2 , the sufficient
statistics of the Gaussian distribution, using X
Chris Simpkins (Georgia Tech)
CS 4641 Machine Learning
3 / 20
Maximum Likelihood Estimation
Say we have X = {x t }t where x t ∼ p(x|θ). Maximum likelihood
estimation is finding the parameter(s) θ that make x most likely
Because the x t are i.i.d., the likelihood of θ given a sample X is
the product of the individual x t likelihoods:
l(θ|X ) ≡ p(X |θ) =
N
Y
p(x t |θ)
t=1
In MLE, we find the parameters θ that make X most likely to be
drawn: θ∗ = argmaxθ L(θ|X )
For mathematical convenience and computational expedience we
use the log likelihood, which turns the product into a sum:
L(θ|X ) ≡ log l(θ|X ) =
N
X
log p(x t |θ)
t=1
Chris Simpkins (Georgia Tech)
CS 4641 Machine Learning
4 / 20
Example: Bernoulli Density
Given the Bernoulli distribution, P(x) = px (1 − p)1−x , x ∈ {0, 1}
The expectation and variance are:
P
E[X ] = P
x xp(x) = 1 · p + 0 · (1 − p) = p
Var (X ) = x (x − E[X ])2 p(x) = p(1 − p)
The log likelihood of a given Bernoulli sample X = {x t }t is
N
Y
L(p|X ) = log
t
p(x ) (1 − p)(1−x
t)
t=1
=
X
=
X
x t log p +
t
X
(1 − x t ) log(1 − p)
t
t
x log p + (N −
t
Chris Simpkins (Georgia Tech)
X
x t ) log(1 − p)
t
CS 4641 Machine Learning
5 / 20
Example: Bernoulli Density (continued)
If we set the derivative
p:
dL
dp
= 0 and solve for p we get the MLE for
xt
N
Note that p̂ is itself a random variable with p̂i for given Xi .
P
p̂ =
t
As N increases, variance decreases and p̂i s become more similar
Chris Simpkins (Georgia Tech)
CS 4641 Machine Learning
6 / 20
Example: Multinomial Density
Multinomial distribution is a generalization of Bernoulli
Given {xt }N
t=1 where
(
1
t
xi =
0
if experiment t chose state i
otherwise
We can follow the same procedure as for the Bernoulli and get:
P t
x
p̂i = t i
N
Chris Simpkins (Georgia Tech)
CS 4641 Machine Learning
7 / 20
Example: Gaussian Density
Gaussian (normal) density is denoted N(µ, σ 2 ) where E[X ] ≡ µ
and Var (X ) ≡ σ 2
Gaussian density function is:
h (x − µ2 ) i
1
p(x) = √
exp −
2σ 2
2πσ
This leads to the likelihood function
P t
(x − µ)2
N
L(µ, σ|X ) = − log(2π) − N log σ − t
2
2σ 2
and setting the derivative of the log likelihood function equal to 0
we get the MLE estimators for µ (m) and σ 2 (s2 ):
P t
x
m= t
PN t
(x − m)2
s2 = t
N
Chris Simpkins (Georgia Tech)
CS 4641 Machine Learning
8 / 20
Evaluating an Estimator: Bias
Let d(X ) be an estimator for θ. The bias of d is:
bθ (d) = E[d(X )] − θ
If bθ (d) = 0 for all θ, then d is an unbiased estimator
For example, m is an unbiased estimator of µ:
P t
x
E[m] − µ = E[ t ] − µ
N
1X
E[x t ] − µ
=
N
t
Nµ
=
−µ
N
=µ−µ=0
This means that, though m may be different from µ for a particular
sample, as we take more samples the average of their sample
means will approach the population mean
Chris Simpkins (Georgia Tech)
CS 4641 Machine Learning
9 / 20
Evaluating an Estimator: Variance
The variance of an estimator tells us how much it varies from
sample to sample
m is a consistent estimator because Var (m) → 0 as N → ∞.
s2 is biased - details in the book
The mean square error of an esitmator d (after doing a bunch of
tedious agebra - look in the book if you’re interested) is:
r (d, θ) = Var (d) + (bθ (d))2
Chris Simpkins (Georgia Tech)
CS 4641 Machine Learning
10 / 20
The Bayes’ Estimator
The estimators we’ve seen so far are frequentist - they’re based
on sample data
We can also take a Bayesian approach, which takes advantage of
expert knowledge and helps deal with small data sets
We use Bayes’ Rule to combine a prior density p(θ) with evidence
(the sample data) to get a posterior estimate of θ:
p(θ|X ) =
Chris Simpkins (Georgia Tech)
p(X |θ)p(θ)
p(X )
CS 4641 Machine Learning
11 / 20
Parametric Classification
Remember Bayesian decision theory: P(Ci |x) =
the discriminant
p(x|Ci )P(Ci )
,
p(x)
and
gi (x) = p(x|Ci )P(Ci ) or gi (x) = log p(x|Ci ) + log P(Ci )
If we assume p(x|Ci ) ∼ N(µi , σi2 ), then
gi (x) = − 12 log 2π − log σi −
(x−µi )2
2σi2
+ log P(Ci )
Plugging in the estimators for mi , si2 , and P̂(Ci ) and doing some
agebra, we get
gi (x) = −(x − mi )2
and choose Ci if |x − mi | = min|x − mk |
k
In other words, we assign a query point to the class whose mean
is nearest
Chris Simpkins (Georgia Tech)
CS 4641 Machine Learning
12 / 20
Tuning Model Complexity: Bias/Variance Dilemma
Bias/variance dilemma: as model complexity increases, small
changes in data set cause big changes in learned hypotheses
(variance), but complex models fit the data better
Bias means the model class does not contain the true hypothesis,
which we call underfitting
Variance means the model class is too general (complex) and
learns the noise, which we call overiftting
So how do we find the right model complexity?
Chris Simpkins (Georgia Tech)
CS 4641 Machine Learning
13 / 20
Model Selection
Regularization: augment the error function with a parameter, λ,
that penalizes model complexity
E 0 = error on data + λ · model complexity
Akaike or Bayesian Information Criterion (AIC, BIC) - estimate
discrepancy between test error and training error
Structural risk minimization: choose the simplest model that gives
us good emprical error
Mininmum description length (MDL): information-theoretic
Kolmogorov complexity: shortest description of the data
Bayesian model selection: p(model|data) =
p(data|model)p(model)
p(data)
The way we really do it: cross-validation
As complexity increases validation error decraeases to a point,
then increases. This “elbow” is the optimal model complexity
Chris Simpkins (Georgia Tech)
CS 4641 Machine Learning
14 / 20
Example: Model Section
(a) Data and fitted polynomials
5
0
ï5
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
(b) Error vs. polynomial order
3
Training
Validation
2.5
2
1.5
1
0.5
1
2
Chris Simpkins (Georgia Tech)
3
4
5
CS 4641 Machine Learning
6
7
8
15 / 20
Multivariate Data
A data set can be viewed as a matrix:
 1
X1 X21 · · ·
 X2 X2 · · ·
2
 1
X= .
..
..
 ..
.
.
N
N
X1 X2 · · ·
Xd1
Xd2
..
.
XdN





And we can calculate various statistics from the data matrix, like the
mean vector, E[x] = µ = [µ1 , ..., µd ]T , or attribute covariances ...
Chris Simpkins (Georgia Tech)
CS 4641 Machine Learning
16 / 20
Covariance
The variance of a single variable x is given by:
Pn
(xi − X̄ )2
2
σ = i=1
n
The variance of two variables, x and y, is given by:
Pn
(xi − X̄ )(yi − Ȳ )
cov (X , Y ) = i=1
n
Covariance tells you how two variable vary together:
If the covariance between two variables is positive, then as one
variable increases the other will increase.
If the covariance between two variables is negative, then as one
variable increases the other will decrease.
If the covariance between two variables is zero, then the two
variables are completely independent of each other.
Chris Simpkins (Georgia Tech)
CS 4641 Machine Learning
17 / 20
Covariance Matrix
For a vector of variables < X1 , ..., Xn >, such as the features of a data
set, we can construct a matrix which represents the covariance
between each pair of variables Xi and Xj where i and j are indexes of
the feature vector.


var (X1 )
cov (X1 , X2 ) · · · cov (X1 , Xn )
 cov (X2 , X1 )
var (X2 )
· · · cov (X2 , Xn ) 


cov (X ) = 

..
..
..
.
.


.
.
.
.
cov (Xn , X1 ) cov (Xn , X2 ) · · ·
var (Xn )
Notice that:
along the diagonal we have simply the variance of an individual
variable, and
the matrix is symmetric, that is, cov (Xi , Xj ) = cov (Xj , Xi ).
Chris Simpkins (Georgia Tech)
CS 4641 Machine Learning
18 / 20
Using a Covariance Matrix
Consider the following data set:1
age
workclass
fnlwgt
education
education-num
marital-status
occupation
relationship
race
sex
capital-gain
capital-loss
hours-per-week
native-country
Are any of these attributes related?
1
http://archive.ics.uci.edu/ml/datasets/Adult
Chris Simpkins (Georgia Tech)
CS 4641 Machine Learning
19 / 20
A Look Ahead
Clustering - finding groups within unlabeled data
Feature selection - selecting (or deriving) features to improve
learning
Dimensionality reduction - dealing with high dimensional data by
transforming the data into a lower dimensional space
Chris Simpkins (Georgia Tech)
CS 4641 Machine Learning
20 / 20
Download