ppt - CUNY

advertisement
Lecture 5: Graphical Models
Machine Learning
CUNY Graduate Center
Today
• Logistic Regression
– Maximum Entropy Formulation
• Decision Trees Redux
– Now using Information Theory
• Graphical Models
– Representing conditional dependence
graphically
1
Logistic Regression
Optimization
• Take the gradient in terms of w
2
Optimization
• We know the gradient of the error function,
but how do we find the maximum value?
• Setting to zero is nontrivial
• Numerical approximation
3
Entropy
• Measure of uncertainty, or Measure of
“Information”
• High uncertainty equals high entropy.
• Rare events are more “informative” than
common events.
4
Examples of Entropy
• Uniform distributions have higher
distributions.
5
Maximum Entropy
• Logistic Regression is also known as
Maximum Entropy.
• Entropy is convex.
– Convergence Expectation.
• Constrain this optimization to enforce good
classification.
• Increase maximum likelihood of the data
while making the distribution of weights most
even.
– Include as many useful features as possible.
6
Maximum Entropy with
Constraints
•
From Klein and Manning Tutorial
7
Optimization formulation
• If we let the weights represent likelihoods
of value for each feature.
For each feature i
8
Solving MaxEnt formulation
• Convex optimization with a concave
objective function and linear constraints.
• Lagrange Multipliers
Dual representation of the
maximum likelihood estimation of
Logistic Regression
For each feature i
9
Decision Trees
• Nested ‘if’-statements for classification
• Each Decision Tree Node contains a
feature and a split point.
• Challenges:
– Determine which feature and split point to use
– Determine which branches are worth
including at all (Pruning)
10
Decision Trees
color
blue
h
w
<66
m
w
m
<150
m
h
f
h
<66
<170
f
w
<140
w
<145
green
brown
f
m
<64
f
m
f
11
Ranking Branches
• Last time, we used classification accuracy
to measure value of a branch.
6M / 6F
height
<68
1M / 5F
5M / 1F
50% Accuracy before Branch
83.3% Accuracy after Branch
33.3% Accuracy Improvement
12
Ranking Branches
• Measure Decrease in Entropy of the class
distribution following the split
6M / 6F
height
<68
1M / 5F
5M / 1F
H(x) = 2 before Branch
83.3% Accuracy after Branch
33.3% Accuracy Improvement
13
InfoGain Criterion
• Calculate the decrease in Entropy across a
split point.
• This represents the amount of information
contained in the split.
• This is relatively indifferent to the position on
the decision tree.
– More applicable to N-way classification.
– Accuracy represents the mode of the distribution
– Entropy can be reduced while leaving the mode
unaffected.
14
Graphical Models and
Conditional Independence
• More generally about probabilities, but
used in classification and clustering.
• Both Linear Regression and Logistic
Regression use probabilistic models.
• Graphical Models allow us to structure,
and visualize probabilistic models, and the
relationships between variables.
15
(Joint) Probability Tables
• Represent multinomial joint probabilities
between K variables as K-dimensional
tables
• Assuming D binary variables, how big is
this table?
• What is we had multinomials with M
entries?
16
Probability Models
• What if the variables are independent?
• If x and y are independent:
• The original distribution can be factored
• How big is this table, if each variable is
binary?
17
Conditional Independence
• Independence assumptions are
convenient (Naïve Bayes), but rarely true.
• More often some groups of variables are
dependent, but others are independent.
• Still others are conditionally
independent.
18
Conditional Independence
• If two variables are conditionally
independent.
• E.g. y = flu?, x = achiness?, z =
headache?
19
Factorization if a joint
• Assume
• How do you factorize:
20
Factorization if a joint
• What if there is no conditional
independence?
• How do you factorize:
21
Structure of Graphical Models
• Graphical models allow us to represent
dependence relationships between variables
visually
– Graphical models are directed acyclic graphs
(DAG).
– Nodes: random variables
– Edges: Dependence relationship
– No Edge: Independent variables
– Direction of the edge: indicates a parent-child
relationship
– Parent: Source – Trigger
– Child: Destination – Response
22
Example Graphical Models
x
y
x
y
• Parents of a node i are denoted πi
• Factorization of the joint in a graphical
model:
23
Basic Graphical Models
• Independent Variables
x
y
z
y
z
• Observations
x
• When we observe a variable, (fix its value from data)
we color the node grey.
• Observing a variable allows us to condition on it. E.g.
p(x,z|y)
• Given an observation we can generate pdfs for the
other variables.
24
Example Graphical Models
•
•
•
•
X = cloudy?
Y = raining?
Z = wet ground?
Markov Chain
x
y
z
25
Example Graphical Models
• Markov Chain
x
y
z
• Are x and z conditionally independent
given y?
26
Example Graphical Models
• Markov Chain
x
y
z
27
One Trigger Two Responses
• X = achiness?
• Y = flu?
• Z = fever?
y
x
z
28
Example Graphical Models
y
x
z
• Are x and z conditionally independent
given y?
29
Example Graphical Models
y
x
z
30
Two Triggers One Response
• X = rain?
• Y = wet sidewalk?
• Z = spilled coffee?
x
z
y
31
Example Graphical Models
x
z
y
• Are x and z conditionally independent
given y?
32
Example Graphical Models
x
z
y
33
Factorization
x1
x3
x5
x0
x2
x4
34
Factorization
x1
x3
x5
x0
x2
x4
35
How Large are the probability
tables?
36
Model Parameters as Nodes
• Treating model parameters as a random
variable, we can include these in a
graphical model
• Multivariate Bernouli
µ0
µ1
µ2
x0
x1
x2
37
Model Parameters as Nodes
• Treating model parameters as a random
variable, we can include these in a
graphical model
• Multinomial
µ
x0
x1
x2
38
Naïve Bayes Classification
y
x0
x1
x2
• Observed variables xi are independent given the class
variable y
• The distribution can be optimized using maximum likelihood
on each variable separately.
• Can easily combine various types of distributions
39
Graphical Models
• Graphical representation of dependency
relationships
• Directed Acyclic Graphs
• Nodes as random variables
• Edges define dependency relations
• What can we do with Graphical Models
– Learn parameters – to fit data
– Understand independence relationships between
variables
– Perform inference (marginals and conditionals)
– Compute likelihoods for classification.
40
Plate Notation
y
y
x0
x1
…
xn
n
xi
• To indicate a repeated variable, draw a
plate around it.
41
Completely observed Graphical
Completely Observed graphical models
Model
Completely Observed graphical
models
• Observations for every node
Suppose we have observations for every node.
Suppose we have observations for every node.
Flu Fever Sinus Ache Swell Head
Y
L
Y
Y
Y
N
Flu Fever Sinus Ache Swell Head
N
M
N
N
N
N
Y
L
Y
Y
Y
N
Y
H
N
N
Y
Y
N
M
N
N
N
N
Y
M
Y
N
N
Y
Y
H
N
N
Y
Y
Y
M
Y
N
N
Y
In the simplest – least general – graph, assume each independent. Train 6
separate models.
In the simplest – least general – graph, assume each independent. Train 6
separate models.
Fl
Fe
Si
Ac
Sw
He
• Simplest (least general) graph, assume
each independent
Fl
Fe
Si
Ac
Sw
He
2nd simplest graph – most general – assume no independence. Build a
6-dimensional table. (Divide by total count.)
2nd simplest graph – most general – assume no independence. Build a
6-dimensional table. (Divide Fl
by totalFecount.)Si
Ac
Sw
He
42
Suppose we have observations for every node.
Completely
observed Graphical
Completely Observed
graphical models
Flu Fever Sinus Ache Swell Head
Y
L Model
Y
Y
Y
N
N
M
N
N
N
N
Suppose we have observations for every node.
Y
H
N
N
Y
Y
Y
M
Y
N
N
Y
Flu Fever Sinus Ache Swell Head
Y
L
Y
Y
Y
N
In the simplest – least general – graph, assume each independent. Train 6
N
M
N
N
N
N
separate models.
Y
H
N
N
Y
Y
Y
M
Fl
FeY
Si N Ac N Sw Y He
• Observations for every node
• Second simplest graph, assume complete
dependence
In the simplest – least general – graph, assume each independent. Train 6
separate
models. graph – most general – assume no independence. Build a
2nd simplest
6-dimensional table.
Fl
Fe
Si
Ac
Sw
He
Fl
Fe
Si
Ac
Sw
He
2nd simplest graph – most general – assume no independence. Build a
6-dimensional table. (Divide by total count.)
43
Fl
Fe
Si
Ac
Sw
He
Maximum Likelihood Conditional Probability Tables
Maximum Likelihood
Consider this Graphical Model
x1
x0
x2
x3
x5
x4
Each node has a conditional probability table θ .
• Each node
has a conditional probability
Given the table, we have a pdf
table, θ
p(x|θ) =
p(x |π , θ )
• Given the tables, we can construct the pdf.
i
M− 1
i
i
i
i= 0
We have m variables in x, and N data points, X.
Maximum (log) Likelihood
N− 1
M− 1
• Use Maximum
Likelihood
to =find
the best
argmax
ln
p(x
θ
= argmax
ln p(X|θ)
settings of θ
∗
θ
θ
argmax
n= 0
i= 0
N− 1 M − 1
N− 1
=
in |θi )
ln p(X n |θ)
=
argmax
44
ln p(xin |θi )
Maximum likelihood
45
Maximum Likelihood CPTs
Count functions
• Count the
number of times something
First, Kronecker’s delta function.
appears in the data
1 if x = x
n
δ(xn , xm ) =
0
m
otherwise
Counts: the number of times something appears in the data
N− 1
m(xi )
δ(xi , xin )
=
n= 0
N− 1
m(X)
δ(X, X n )
=
n= 0
N=
δ(x1, x2 )
m(x1 ) =
x1
x1
x2
δ(x1 , x2 , x
=
x1
x2
x3
46
Maximum Likelihood
m(X)
= =
m(X)
ln ln p(xp(x
|π ,|
Maximum likelihood CPTs
M− 1
M− 1
N− 1N− 1
l (θ)l (θ)= =
i
ln p(X
ln p(X
n |θ)n |θ)
xn
M− 1
n= 0 n= 0
N− 1
l (θ)
=
ln p(X n |θ)
= =
n= 0
n= 0
N− 1
p(X|θ)
ln ln p(X|θ)
X
xn
=
X
= =
M− 1
m(X)
ln p(X|θ)
m(X)
ln p(X|θ)
xn
xn
=
n
xin= 0 i = 0
i = 0 xi , π i
m(X)
m(X)
ln p
,πix X\
,πixiX\
i = 0 i x=i 0
\ πxi i \ πi
m(X) ln
p(xi |π
i , θi ) i
M − 1M − 1
i = 0 xi , π i X\ xi \ π i
m(X) ln p(X|θ)
x
m(X)
ln p(x
m(X)
ln p(x
i |πi i,|
= =
i= 0
X)p(X|θ)
ln p(X|θ)
n , ln
δ(xδ(x
n , X)
=
0 X
δ(xnn=
, X)0 n=
lnXp(X|θ)
n= 0
xn
M− 1
= =
=
p(xi |πi , θi )
=
=
i= 0
m(X) ln p(xi |π
M i−, θ
1Mi )− 1
p(X|θ)
N− 1N− 1
X
i= 0 i= 0
M− 1
M− 1
M− 1
=
δ(xn , X)
ln
m(X) ln
xn ,X)
n
δ(xn δ(x
,X)
n= 0 n= 0 X
N− 1
=
=
N− 1N− 1
xn
ln pi |
m(xm(x
lni )p(x
i , πii ,) π
= =
m(xi , πi ) ln p(xi |πii=, θ
i)
,πixi ,πi
i = 0 xi 0
• Define a function: θ(x , π ), π= ) p(x
= p(x
|π ,|π
θ ), θ )
Constraint:
Constraint:
Constraint:
• Constraint:
θ(x , π ) = 1
Define aDefine
function:
Define
a function:
a function:
θ(xi , πi ) = p(xi |πi , θ(x
θi )
ii
i
i
i
i
i
ii
ii
i
θ(xθ(x
=i ) 1= 1
i , πii ), π
xi
xi
ii
xi
23 / 36
47
Maximum Likelihood
• Use Lagrange Multipliers
48
Maximum A Posteriori Training
• Bayesians would never do that, the thetas
need a prior.
49
Conditional Dependence Test
• Can check conditional independence in a graphical model
– “Is achiness (x3) independent of the flue (x0) given fever(x1)?”
– “Is achiness (x3) independent of sinus infections(x2) given
fever(x1)?”
50
ation and Bayes Ball
D-Separation and Bayes Ball
✛
✘
x1
x0
✚
x2
x3
x5
✙
x4
• Intuition: nodes are separated or blocked
on: nodes are separat ed, or blocked by sets of nodes
by sets of nodes.
xample:
nodes
x1 and
x2, x2,
“ block”
x0 to
– E.g.
nodes
x1 and
“block”the
thepath
path from
from x0
So
hen x0 ⊥⊥toxx5.
, x3x0 is cond. ind.from x5 given x1 and
5 |x2
x2
51
Bayes Ball Algorithm
• Shade nodes xc
• Place a “ball” at each node in xa
• Bounce balls around the graph according
to rules
• If no balls reach xb, then cond. ind.
52
Ten rules of Bayes Ball Theorem
53
Bayes Ball Example
x0 ⊥⊥ x4 |x2?
x3
x1
x0
x5
x2
x4
54
Bayes Ball Example
x0 ⊥⊥ x5 |x1, x2 ?
x3
x1
x0
x5
x2
x4
55
Undirected Graphs
• What if we allow undirected graphs?
• What do they correspond to?
• Not Cause/Effect, or Trigger/Response,
but general dependence
• Example: Image pixels, each pixel is a
bernouli
– P(x11,…, x1M,…, xM1,…, xMM)
– Bright pixels have bright neighbors
• No parents, just probabilities.
• Grid models are called Markov
Random Fields
56
Undirected Graphs
A
D
C
B
• Undirected separability is easy.
• To check conditional independence of A
and B given C, check the Graph
reachability of A and B without going
through nodes in C
57
Next Time
• More fun with Graphical Models
• Read Chapter 8.1, 8.2
58
Download