Bayesian Learning,
Cont’d
Administrivia
•
Various homework bugs:
•
Due: Oct 12 (Tues) not 9 (Sat)
•
Problem 3 should read:
•
(duh)
•
(some) info on naive Bayes in Sec. 4.3 of text
Administrivia
•
Another bug in last time’s lecture:
•
Multivariate Gaussian should look like:
5 minutes of math...
•
Joint probabilities
•
Given d different random vars,
•
The “joint” probability of them taking on the simultaneous values
• given by
•
Or, for shorthand,
•
Closely related to the “joint PDF”
5 minutes of math...
•
Independence:
•
Two random variables are statistically independent iff:
•
Or, equivalently (usually for discrete RVs):
•
For multivariate RVs:
Exercise
•
Suppose you’re given the PDF:
•
Where z is a normalizing constant.
•
What must z be to make this a legitimate
PDF?
•
Are and independent? Why or why not?
•
What about the PDF:
Parameterizing PDFs
•
Given training data, [ X , Y ] , w/ discrete labels Y
•
Break data out into sets , etc.
•
Want to come up with models,
, etc.
,
•
Suppose the individual f() s are Gaussian, need the params μ and σ
•
How do you get the params?
•
Now, what if the f()s are something really funky you’ve never seen before in your life, with parameters
Maximum likelihood
•
Principle of maximum likelihood:
•
Pick the parameters that make the data as probable (or, in general “likely”) as possible
•
Regard the probability function as a func of two variables: data and parameters:
•
Function
L is the “likelihood function”
•
Want to pick the that maximizes
L
Example
•
Consider the exponential PDF:
•
Can think of this as either a function of x or τ
Exponential as fn of x
Exponential as a fn of τ
Max likelihood params
•
So, for a fixed set of data, X , want the parameter that maximizes
L
•
Hold X constant, optimize
•
How?
•
More important: f () is usually a function of a single data point (possibly vector), but
L is a func. of a set of data
•
How do you extend f () to set of data?
IID Samples
•
In supervised learning, we usually assume that data points are sampled independently and from the same distribution
•
IID assumption: data are independent and identically distributed
IID Samples
•
In supervised learning, we usually assume that data points are sampled independently and from the same distribution
•
IID assumption: data are independent and identically distributed
•
⇒ joint PDF can be written as product of individual (marginal) PDFs:
The max likelihood recipe
Start with IID data
•
Assume model for individual data point, f ( X ; Θ )
•
Construct joint likelihood function (PDF):
•
Find the params Θ that maximize
L
•
(If you’re lucky): Differentiate L w.r.t. Θ , set
=0 and solve
•
Repeat for each class
Exercise
•
Find the maximum likelihood estimator of μ for the univariate Gaussian:
•
Find the maximum likelihood estimator of β for the degenerate gamma distribution:
•
Hint: consider the log of the likelihood fns in both cases
Putting the parts together
[ X , Y ]
5 minutes of math...
•
Marginal probabilities
•
If you have a joint PDF:
•
... and want to know about the probability of just one RV (regardless of what happens to the others)
•
Marginal PDF of or :
5 minutes of math...
•
Conditional probabilities
•
Suppose you have a joint PDF, f ( H , W )
•
Now you get to see one of the values, e.g.,
H=“ 183cm ”
•
What’s your probability estimate of A , given this new knowledge?
5 minutes of math...
•
Conditional probabilities
•
Suppose you have a joint PDF, f ( H , W )
•
Now you get to see one of the values, e.g.,
H=“ 183cm ”
•
What’s your probability estimate of A , given this new knowledge?
Everything’s random...
•
Basic Bayesian viewpoint:
•
Treat (almost) everything as a random variable
•
Data/independent var: X vector
•
Class/dependent var: Y
•
Parameters : Θ
•
E.g., mean, variance, correlations, multinomial params, etc.
•
Use Bayes’ Rule to assess probabilities of classes