Generative Models for Discrete Data Steps for Learning a Generative Model References

advertisement
Generative Models for Discrete Data
Steps for Learning a Generative Model
Learning Gaussian Models
References
Introduction to Machine Learning
CSE474/574: Generative Models
Varun Chandola <chandola@buffalo.edu>
Varun Chandola
Introduction to Machine Learning
Generative Models for Discrete Data
Steps for Learning a Generative Model
Learning Gaussian Models
References
Outline
1
2
3
Generative Models for Discrete Data
Bayesian Concept Learning
Likelihood
Adding a Prior
Posterior
Posterior Predictive Distribution
Steps for Learning a Generative Model
Incorporating Prior
Beta Distribution
Conjugate Priors
Estimating Posterior
Using Predictive Distribution
Need for Prior
Need for Bayesian Averaging
Learning Gaussian Models
Estimating Parameters
Estimating Posterior Varun Chandola
Introduction to Machine Learning
Generative Models for Discrete Data
Steps for Learning a Generative Model
Learning Gaussian Models
References
Bayesian Concept Learning
Likelihood
Adding a Prior
Posterior
Posterior Predictive Distribution
Outline
1
Generative Models for Discrete Data
Bayesian Concept Learning
Likelihood
Adding a Prior
Posterior
Posterior Predictive Distribution
2
Steps for Learning a Generative Model
Incorporating Prior
Beta Distribution
Conjugate Priors
Estimating Posterior
Using Predictive Distribution
Need for Prior
Need for Bayesian Averaging
3
Learning Gaussian Models
Estimating Parameters
Estimating Posterior
Varun Chandola
Introduction to Machine Learning
Generative Models for Discrete Data
Steps for Learning a Generative Model
Learning Gaussian Models
References
Bayesian Concept Learning
Likelihood
Adding a Prior
Posterior
Posterior Predictive Distribution
Generative Models
Let us go back to our tumor example
X represents the data with multiple discrete attributes
Is X a discrete or continuous random variable?
Y represent the class (benign or malignant)
Most probable class
P(Y = c|X, θ) ∝ P(X = x|Y = c, θ)P(Y = c, θ)
P(X = x|Y = c, θ) = p(x|y = c, θ)
p(x|y = c, θ) - class conditional density
How is the data distributed for each class?
Varun Chandola
Introduction to Machine Learning
Generative Models for Discrete Data
Steps for Learning a Generative Model
Learning Gaussian Models
References
Bayesian Concept Learning
Likelihood
Adding a Prior
Posterior
Posterior Predictive Distribution
Bayesian Concept Learning
Concept assigns binary labels to examples
We want to find out: P(x|x ∈ c)
Varun Chandola
Introduction to Machine Learning
Generative Models for Discrete Data
Steps for Learning a Generative Model
Learning Gaussian Models
References
Bayesian Concept Learning
Likelihood
Adding a Prior
Posterior
Posterior Predictive Distribution
Concept Learning in Number Line
I give you a set of numbers (training
set D) belonging to a concept
Choose the most likely hypothesis
(concept)
Assume that numbers are between 1
and 100
Hypothesis Space (H):
All powers of 2
All powers of 4
All even numbers
All prime numbers
Numbers close to a fixed number
(say 12)
..
.
Varun Chandola
Socrative Game
Goto:
http://m.socrative.
com/student/
Enter class ID ubmachinelearning2016
Introduction to Machine Learning
Generative Models for Discrete Data
Steps for Learning a Generative Model
Learning Gaussian Models
References
Bayesian Concept Learning
Likelihood
Adding a Prior
Posterior
Posterior Predictive Distribution
Ready?
Hypothesis Space (H)
1
Even numbers
2
Odd numbers
3
Squares
4
Powers of 2
5
Powers of 4
6
Powers of 16
7
Multiples of 5
8
Multiples of 10
9
Numbers within 20 ± 5
10
All numbers between 1
and 100
D = {}
D = {16}
D = {60}
D=
{16, 19, 15, 20, 18}
D = {16, 4, 64, 32}
Varun Chandola
Introduction to Machine Learning
Generative Models for Discrete Data
Steps for Learning a Generative Model
Learning Gaussian Models
References
Bayesian Concept Learning
Likelihood
Adding a Prior
Posterior
Posterior Predictive Distribution
Ready?
Hypothesis Space (H)
1
Even numbers
2
Odd numbers
3
Squares
4
Powers of 2
5
Powers of 4
6
Powers of 16
7
Multiples of 5
8
Multiples of 10
9
Numbers within 20 ± 5
10
All numbers between 1
and 100
D = {}
D = {16}
D = {60}
D=
{16, 19, 15, 20, 18}
D = {16, 4, 64, 32}
Varun Chandola
Introduction to Machine Learning
Generative Models for Discrete Data
Steps for Learning a Generative Model
Learning Gaussian Models
References
Bayesian Concept Learning
Likelihood
Adding a Prior
Posterior
Posterior Predictive Distribution
Ready?
Hypothesis Space (H)
1
Even numbers
2
Odd numbers
3
Squares
4
Powers of 2
5
Powers of 4
6
Powers of 16
7
Multiples of 5
8
Multiples of 10
9
Numbers within 20 ± 5
10
All numbers between 1
and 100
D = {}
D = {16}
D = {60}
D=
{16, 19, 15, 20, 18}
D = {16, 4, 64, 32}
Varun Chandola
Introduction to Machine Learning
Generative Models for Discrete Data
Steps for Learning a Generative Model
Learning Gaussian Models
References
Bayesian Concept Learning
Likelihood
Adding a Prior
Posterior
Posterior Predictive Distribution
Ready?
Hypothesis Space (H)
1
Even numbers
2
Odd numbers
3
Squares
4
Powers of 2
5
Powers of 4
6
Powers of 16
7
Multiples of 5
8
Multiples of 10
9
Numbers within 20 ± 5
10
All numbers between 1
and 100
D = {}
D = {16}
D = {60}
D=
{16, 19, 15, 20, 18}
D = {16, 4, 64, 32}
Varun Chandola
Introduction to Machine Learning
Generative Models for Discrete Data
Steps for Learning a Generative Model
Learning Gaussian Models
References
Bayesian Concept Learning
Likelihood
Adding a Prior
Posterior
Posterior Predictive Distribution
Ready?
Hypothesis Space (H)
1
Even numbers
2
Odd numbers
3
Squares
4
Powers of 2
5
Powers of 4
6
Powers of 16
7
Multiples of 5
8
Multiples of 10
9
Numbers within 20 ± 5
10
All numbers between 1
and 100
D = {}
D = {16}
D = {60}
D=
{16, 19, 15, 20, 18}
D = {16, 4, 64, 32}
Varun Chandola
Introduction to Machine Learning
Generative Models for Discrete Data
Steps for Learning a Generative Model
Learning Gaussian Models
References
Bayesian Concept Learning
Likelihood
Adding a Prior
Posterior
Posterior Predictive Distribution
Computing Likelihood
Why choose powers of 2 concept over even numbers concept for
D = {16, 4, 64, 32}?
Avoid suspicious coincidences
Choose concept with higher likelihood
What is the likelihood of above D to be generated using the powers
of 2 concept?
Likelihood for even numbers concept?
Varun Chandola
Introduction to Machine Learning
Generative Models for Discrete Data
Steps for Learning a Generative Model
Learning Gaussian Models
References
Bayesian Concept Learning
Likelihood
Adding a Prior
Posterior
Posterior Predictive Distribution
Likelihood
Why choose one hypothesis over other?
Avoid suspicious coincidences
Choose concept with higher likelihood
Y
p(D|h) =
p(x|h)
x∈D
Log Likelihood
log p(D|h) =
X
log p(x|h)
x∈D
Varun Chandola
Introduction to Machine Learning
Generative Models for Discrete Data
Steps for Learning a Generative Model
Learning Gaussian Models
References
Bayesian Concept Learning
Likelihood
Adding a Prior
Posterior
Posterior Predictive Distribution
Bayesian Concept Learning
1
Even numbers
2
Odd numbers
3
Squares
4
Powers of 2
5
Powers of 4
6
Powers of 16
7
Multiples of 5
8
Multiples of 10
Numbers within 20 ± 5
9
10
D = {16, 4, 64, 32}
All numbers between 1
and 100
Varun Chandola
Introduction to Machine Learning
Generative Models for Discrete Data
Steps for Learning a Generative Model
Learning Gaussian Models
References
Bayesian Concept Learning
Likelihood
Adding a Prior
Posterior
Posterior Predictive Distribution
Adding a Prior
Inside information about the hypotheses
Some hypotheses are more likely apriori
May not be the right hypothesis (prior can be wrong)
Varun Chandola
Introduction to Machine Learning
Generative Models for Discrete Data
Steps for Learning a Generative Model
Learning Gaussian Models
References
Bayesian Concept Learning
Likelihood
Adding a Prior
Posterior
Posterior Predictive Distribution
Posterior
Revised estimates for h after observing evidence (D) and the prior
Posterior ∝ Likelihood × Prior
p(D|h)p(h)
p(h|D) = P
0
0
h0 ∈H p(D|h )p(h )
1
2
3
4
5
6
7
8
9
10
h
Even
Odd
Squares
Powers of 2
Powers of 4
Powers of 16
Multiples of 5
Multiples of 10
Numbers within 20 ± 5
All Numbers
Prior
0.3
0.075
0.075
0.1
0.075
0.075
0.075
0.075
0.075
0.075
Varun Chandola
Likelihood
0.16 × 10−6
0
0
0.77 × 10−3
0
0
0
0
0
0.01 × 10−6
Introduction to Machine Learning
Posterior
0.621 × 10−3
0
0
0.997
0
0
0
0
0
0.009 × 10−3
Bayesian Concept Learning
Likelihood
Adding a Prior
Posterior
Posterior Predictive Distribution
Generative Models for Discrete Data
Steps for Learning a Generative Model
Learning Gaussian Models
References
Finding the Best Hypothesis
Maximum A Priori Estimate
ĥprior = arg max p(h)
h
Maximum Likelihood Estimate (MLE)
ĥMLE
=
arg max p(D|h) = arg max log p(D|h)
h
h
X
= arg max
log p(x|H)
h
x∈D
Maximum a Posteriori (MAP) Estimate
ĥMAP = arg max p(D|h)p(h) = arg max(log p(D|h) + log p(h))
h
h
Varun Chandola
Introduction to Machine Learning
Bayesian Concept Learning
Likelihood
Adding a Prior
Posterior
Posterior Predictive Distribution
Generative Models for Discrete Data
Steps for Learning a Generative Model
Learning Gaussian Models
References
MAP and MLE
ĥprior - Most likely hypothesis based on prior
ĥMLE - Most likely hypothesis based on evidence
ĥMAP - Most likely hypothesis based on posterior
ĥprior = arg max log p(h)
h
ĥMLE = arg max log p(D|h)
h
ĥMAP = arg max(log p(D|h) + log p(h))
h
Varun Chandola
Introduction to Machine Learning
Generative Models for Discrete Data
Steps for Learning a Generative Model
Learning Gaussian Models
References
Bayesian Concept Learning
Likelihood
Adding a Prior
Posterior
Posterior Predictive Distribution
Interesting Properties
As data increases, MAP estimate converges towards MLE
Why?
MAP/MLE are consistent estimators
If concept is in H, MAP/ML estimates will converge
If c ∈
/ H, MAP/ML estimates converge to h which is closest possible
to the truth
Varun Chandola
Introduction to Machine Learning
Bayesian Concept Learning
Likelihood
Adding a Prior
Posterior
Posterior Predictive Distribution
Generative Models for Discrete Data
Steps for Learning a Generative Model
Learning Gaussian Models
References
Posterior Predictive Distribution
New input, x ∗
What is the probability that x ∗ is also generated by the same
concept as D?
p(x ∗ ∈ c|D)?
Option 0: Treat hprior as the true concept
p(x ∗ ∈ c|D) = p(x ∗ ∈ hprior )
Option 1: Treat hMLE as the true concept
p(x ∗ ∈ c|D) = p(x ∗ ∈ hMLE )
Option 2: Treat hMAP as the true concept
p(x ∗ ∈ c|D) = p(x ∗ ∈ hMAP )
Option 3: Bayesian Averaging
p(x ∗ ∈ c|D) =
X
p(x ∗ ∈ h)p(h|D)
h
Varun Chandola
Introduction to Machine Learning
Generative Models for Discrete Data
Steps for Learning a Generative Model
Learning Gaussian Models
References
Incorporating Prior
Beta Distribution
Conjugate Priors
Estimating Posterior
Using Predictive Distribution
Need for Prior
Need for Bayesian Averaging
Outline
1
Generative Models for Discrete Data
Bayesian Concept Learning
Likelihood
Adding a Prior
Posterior
Posterior Predictive Distribution
2
Steps for Learning a Generative Model
Incorporating Prior
Beta Distribution
Conjugate Priors
Estimating Posterior
Using Predictive Distribution
Need for Prior
Need for Bayesian Averaging
3
Learning Gaussian Models
Estimating Parameters
Estimating Posterior
Varun Chandola
Introduction to Machine Learning
Generative Models for Discrete Data
Steps for Learning a Generative Model
Learning Gaussian Models
References
Incorporating Prior
Beta Distribution
Conjugate Priors
Estimating Posterior
Using Predictive Distribution
Need for Prior
Need for Bayesian Averaging
Steps for Learning a Generative Model
Example: D is a sequence of N binary values (0s and 1s) (coin
tosses)
What is the best distribution that could describe D?
What is the probability of observing a head in future?
Step 1: Choose the form of the model
Hypothesis Space - All possible distributions
Too complicated!!
Revised hypothesis space - All Bernoulli distributions
(X ∼ Ber (θ), 0 ≤ θ ≤ 1)
θ is the hypothesis
Still infinite (θ can take infinite possible values)
Varun Chandola
Introduction to Machine Learning
Generative Models for Discrete Data
Steps for Learning a Generative Model
Learning Gaussian Models
References
Incorporating Prior
Beta Distribution
Conjugate Priors
Estimating Posterior
Using Predictive Distribution
Need for Prior
Need for Bayesian Averaging
Compute Likelihood
Likelihood of D
p(D|θ) = θN1 (1 − θ)N0
Maximum Likelihood Estimate
θ̂MLE
=
arg max p(D|θ) = arg max θN1 (1 − θ)N0
=
N1
N0 + N1
θ
θ
We can stop here (MLE approach)
Probability of getting a head next:
p(x ∗ = 1|D) = θ̂MLE
Varun Chandola
Introduction to Machine Learning
Generative Models for Discrete Data
Steps for Learning a Generative Model
Learning Gaussian Models
References
Incorporating Prior
Beta Distribution
Conjugate Priors
Estimating Posterior
Using Predictive Distribution
Need for Prior
Need for Bayesian Averaging
Compute Likelihood
Likelihood of D
p(D|θ) = θN1 (1 − θ)N0
Maximum Likelihood Estimate
θ̂MLE
=
arg max p(D|θ) = arg max θN1 (1 − θ)N0
=
N1
N0 + N1
θ
θ
We can stop here (MLE approach)
Probability of getting a head next:
p(x ∗ = 1|D) = θ̂MLE
Varun Chandola
Introduction to Machine Learning
Generative Models for Discrete Data
Steps for Learning a Generative Model
Learning Gaussian Models
References
Incorporating Prior
Beta Distribution
Conjugate Priors
Estimating Posterior
Using Predictive Distribution
Need for Prior
Need for Bayesian Averaging
Incorporating Prior
1.2
1.1
Prior encodes our prior belief on
θ
How to set a Bayesian prior?
1
2
1
0.9
A point estimate: θprior = 0.5
A probability distribution
over θ (a random variable)
Which one?
For a bernoulli
distribution 0 ≤ θ ≤ 1
Beta Distribution
0.2
0.4
0.6
0.8
1
p(θ)
3
2
1
0.2
Varun Chandola
0.4
0.6
0.8
Introduction to Machine Learning
1
Incorporating Prior
Beta Distribution
Conjugate Priors
Estimating Posterior
Using Predictive Distribution
Need for Prior
Need for Bayesian Averaging
Generative Models for Discrete Data
Steps for Learning a Generative Model
Learning Gaussian Models
References
Beta Distribution as Prior
Continuous random variables defined between 0 and 1
1
Beta(θ|a, b) , p(θ|a, b) =
θa−1 (1 − θ)b−1
B(a, b)
a and b are the (hyper-)parameters for the distribution
B(a, b) is the beta function
B(a, b) =
Z
Γ(x) =
Γ(a)Γ(b)
Γ(a + b)
∞
u x−1 e −u du
0
If x is integer
Γ(x) = (x − 1)!
“Control” the shape of the pdf
We can stop here as well (prior approach)
p(x ∗ = 1)
= θprior
Introduction to Machine Learning
Varun Chandola
Generative Models for Discrete Data
Steps for Learning a Generative Model
Learning Gaussian Models
References
Incorporating Prior
Beta Distribution
Conjugate Priors
Estimating Posterior
Using Predictive Distribution
Need for Prior
Need for Bayesian Averaging
Conjugate Priors
Another reason to choose Beta distribution
p(D|θ) = θN1 (1 − θ)N0
p(θ) ∝ θa−1 (1 − θ)b−1
Posterior ∝ Likelihood × Prior
p(θ|D) ∝ θN1 (1 − θ)N0 θa−1 (1 − θ)b−1
∝ θN1 +a−1 (1 − θ)N0 +b−1
Posterior has same form as the prior
Beta distribution is a conjugate prior for Bernoulli/Binomial
distribution
Varun Chandola
Introduction to Machine Learning
Generative Models for Discrete Data
Steps for Learning a Generative Model
Learning Gaussian Models
References
Incorporating Prior
Beta Distribution
Conjugate Priors
Estimating Posterior
Using Predictive Distribution
Need for Prior
Need for Bayesian Averaging
Estimating Posterior
Posterior
p(θ|D) ∝ θN1 +a−1 (1 − θ)N0 +b−1
= Beta(θ|N1 + a, N0 + b)
We start with a belief that
E[θ] =
a
a+b
After observing N trials in which we observe N1 heads and N0 trails,
we update our belief as:
E[θ|D] =
Varun Chandola
a + N1
a+b+N
Introduction to Machine Learning
Generative Models for Discrete Data
Steps for Learning a Generative Model
Learning Gaussian Models
References
Incorporating Prior
Beta Distribution
Conjugate Priors
Estimating Posterior
Using Predictive Distribution
Need for Prior
Need for Bayesian Averaging
Using Posterior
We know that posterior over θ is a beta distribution
MAP estimate
θ̂MAP
=
arg max p(θ|a + N1 , b + N0 )
=
a + N1 − 1
a+b+N −2
θ
What happens if a = b = 1?
We can stop here as well (MAP approach)
Probability of getting a head next:
p(x ∗ = 1|D) = θ̂MAP
Varun Chandola
Introduction to Machine Learning
Incorporating Prior
Beta Distribution
Conjugate Priors
Estimating Posterior
Using Predictive Distribution
Need for Prior
Need for Bayesian Averaging
Generative Models for Discrete Data
Steps for Learning a Generative Model
Learning Gaussian Models
References
True Bayesian Approach
All values of θ are possible
Prediction on an unknown input (x ∗ ) is given by Bayesian Averaging
p(x ∗ = 1|D)
Z
1
p(x = 1|θ)p(θ|D)dθ
=
0
Z
=
1
θBeta(θ|a + N1 , b + N0 )
0
=
=
E[θ|D]
a + N1
a+b+N
This is same as using E[θ|D] as a point estimate for θ
Varun Chandola
Introduction to Machine Learning
Generative Models for Discrete Data
Steps for Learning a Generative Model
Learning Gaussian Models
References
Incorporating Prior
Beta Distribution
Conjugate Priors
Estimating Posterior
Using Predictive Distribution
Need for Prior
Need for Bayesian Averaging
The Black Swan Paradox
Why use a prior?
Consider D = tails, tails, tails
N1 = 0, N = 3
θ̂MLE = 0
p(x ∗ = 1|D) = 0!!
Never observe a heads
The black swan paradox
How does the Bayesian approach help?
p(x ∗ = 1|D) =
a
a+b+3
Varun Chandola
Introduction to Machine Learning
Generative Models for Discrete Data
Steps for Learning a Generative Model
Learning Gaussian Models
References
Incorporating Prior
Beta Distribution
Conjugate Priors
Estimating Posterior
Using Predictive Distribution
Need for Prior
Need for Bayesian Averaging
Why is MAP Estimate Insufficient?
MAP is only one part of the posterior
θ at which the posterior probability is maximum
But is that enough?
What about the posterior variance of θ?
var [θ|D] =
(a + N1 )(b + N0 )
(a + b + N)2 (a + b + N + 1)
If variance is high then θMAP is not trustworthy
Bayesian averaging helps in this case
Varun Chandola
Introduction to Machine Learning
Generative Models for Discrete Data
Steps for Learning a Generative Model
Learning Gaussian Models
References
Estimating Parameters
Estimating Posterior
Outline
1
Generative Models for Discrete Data
Bayesian Concept Learning
Likelihood
Adding a Prior
Posterior
Posterior Predictive Distribution
2
Steps for Learning a Generative Model
Incorporating Prior
Beta Distribution
Conjugate Priors
Estimating Posterior
Using Predictive Distribution
Need for Prior
Need for Bayesian Averaging
3
Learning Gaussian Models
Estimating Parameters
Estimating Posterior
Varun Chandola
Introduction to Machine Learning
Generative Models for Discrete Data
Steps for Learning a Generative Model
Learning Gaussian Models
References
Estimating Parameters
Estimating Posterior
Multivariate Gaussian
pdf for MVN with d dimensions:
N (x|µ, Σ) ,
1
> −1
exp
−
(x
−
µ)
Σ
(x
−
µ)
2
(2π)d/2 |Σ|1/2
1
Varun Chandola
Introduction to Machine Learning
Generative Models for Discrete Data
Steps for Learning a Generative Model
Learning Gaussian Models
References
Estimating Parameters
Estimating Posterior
Estimating Parameters of MVN
Problem Statement
Given a set of N independent and identically distributed (iid)
samples, D, learn the parameters (µ, Σ) of a Gaussian distribution that
generated D.
MLE approach - maximize log-likelihood
Result
b MLE =
µ
N
1 X
xi , x̄
N
i=1
N
X
b MLE = 1
Σ
(xi − x̄)(xi − x̄)>
N
i=1
Varun Chandola
Introduction to Machine Learning
Generative Models for Discrete Data
Steps for Learning a Generative Model
Learning Gaussian Models
References
Estimating Parameters
Estimating Posterior
Estimating Posterior
We need posterior for both µ and Σ
p(µ)
p(Σ)
What distribution do we need to sample µ?
A Gaussian distribution!
p(µ) = N (µ|m0 , V0 )
What distribution do we need to sample Σ?
An Inverse-Wishart distribution.
p(Σ)
=
=
IW (Σ|S, ν)
1
1
|Σ|−(ν+D+1)/2 exp − tr (S−1 Σ−1 )
ZIW
2
where,
ZIW = |S|−ν/2 2νD/2 ΓD (ν/2)
Varun Chandola
Introduction to Machine Learning
Generative Models for Discrete Data
Steps for Learning a Generative Model
Learning Gaussian Models
References
Estimating Parameters
Estimating Posterior
Calculating Posterior
Posterior for µ - Also a MVN
p(µ|D, Σ)
−1
VN
mN
= N (mN , VN )
= V0−1 + NΣ−1
= VN (Σ−1 (Nx̄) + V0−1 m0 )
Posterior for Σ - Also an Inverse Wishart
p(Σ|D, µ)
=
IW (SN , νN )
=
S0 + Sµ
νN = ν0 + N
S−1
N
Varun Chandola
Introduction to Machine Learning
Generative Models for Discrete Data
Steps for Learning a Generative Model
Learning Gaussian Models
References
References
Varun Chandola
Introduction to Machine Learning
Download