Introduction Bayesian Networks Representation

advertisement
Introduction to Probabilistic
Graphical Models
Eran Segal
Weizmann Institute
Logistics

Staff:

Instructor


Teaching Assistants


Ohad Manor (ohad.manor@weizmann.ac.il, room 125)
Course information:


Eran Segal (eran.segal@weizmann.ac.il, room 149)
WWW: http://www.weizmann.ac.il/math/pgm
Course book:

“Bayesian Networks and Beyond”,
Daphne Koller (Stanford) & Nir Friedman (Hebrew U.)
Course structure

One weekly meeting


Homework assignments



Sun: 9am-11am
2 weeks to complete each
40% of final grade
Final exam


3 hour class exam, date will be announced
60% of final grade
Probabilistic Graphical Models


Tool for representing complex systems and
performing sophisticated reasoning tasks
Fundamental notion: Modularity


Complex systems are built by combining simpler parts
Why have a model?




Compact and modular representation of complex systems
Ability to execute complex reasoning patterns
Make predictions
Generalize from particular problem
Probabilistic Graphical Models


Increasingly important in Machine Learning
Many classical probabilistic problems in statistics,
information theory, pattern recognition, and statistical
mechanics are special cases of the formalism


Graphical models provides a common framework
Advantage: specialized techniques developed in one field
can be transferred between research communities
Representation: Graphs



Intuitive data structure for modeling highly-interacting
sets of variables
Explicit model for modularity
Data structure that allows for design of efficient
general-purpose algorithms
Reasoning: Probability Theory

Well understood framework for modeling uncertainty




Partial knowledge of the state of the world
Noisy observations
Phenomenon not covered by our model
Inherent stochasticity

Clear semantics

Can be learned from data
Probabilistic Reasoning

In this course we will learn:

Semantics of probabilistic graphical models (PGMs)





Bayesian networks
Markov networks
Answering queries in a PGMs (“inference”)
Learning PGMs from data (“learning”)
Modeling temporal processes with PGMs

Hidden Markov Models (HMMs) as a special case
Course Outline
Week
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Topic
Introduction, Bayesian network representation
Bayesian network representation cont.
Local probability models
Undirected graphical models
Exact inference
Exact inference cont.
Approximate inference
Approximate inference cont.
Learning: Parameters
Learning: Parameters cont.
Learning: Structure
Partially observed data
Learning undirected graphical models
Template models
Dynamic Bayesian networks
Reading
1-3
1-3
5
4
9,10
9,10
12
12
16,17
16,17
18
19
20
6
15
A Simple Example


We want to model whether our neighbor will inform
us of the alarm being set off
The alarm can set off if



There is a burglary
There is an earthquake
Whether our neighbor calls depends on whether the
alarm is set off
A Simple Example

Variables

Earthquake (E), Burglary (B), Alarm (A), NeighborCalls (N)
E
B
A
N
Prob.
F
F
F
F
0.01
F
F
F
T
0.04
F
F
T
F
0.05
F
F
T
T
0.01
F
T
F
F
0.02
F
T
F
T
0.07
F
T
T
F
0.2
F
T
T
T
0.1
T
F
F
F
0.01
T
F
F
T
0.07
T
F
T
F
0.13
T
F
T
T
0.04
T
T
F
F
0.06
T
T
F
T
0.05
T
T
T
F
0.1
T
T
T
T
0.05
24-1 independent
parameters
A Simple Example
B
E
F
T
F
T
0.9
0.1
0.7
0.3
Earthquake
Burglary
A
Alarm
E
B
F
T
F
F
0.99
0.01
F
T
0.1
0.9
T
F
0.3
0.7
T
T
0.01
0.99
NeighborCalls
N
A
F
T
F
0.9
0.1
T
0.2
0.8
7 independent
parameters
Example Bayesian Network

The “Alarm” network for monitoring intensive care
patients


509 parameters (full joint 237)
37 variables
PULMEMBOLUS
PAP
MINVOLSET
KINKEDTUBE
INTUBATION
SHUNT
VENTMACH
VENTLUNG
VENITUBE
PRESS
MINOVL
ANAPHYLAXIS
SAO2
TPR
HYPOVOLEMIA
LVEDVOLUME
CVP
PCWP
LVFAILURE
STROEVOLUME
FIO2
VENTALV
PVSAT
ARTCO2
EXPCO2
INSUFFANESTH
CATECHOL
HISTORY
ERRBLOWOUTPUT
CO
BP
HR
HREKG
HRBP
DISCONNECT
ERRCAUTER
HRSAT
Application: Clustering Users


Input: TV shows that each user watches
Output: TV show “clusters”

Assumption: shows watched by same users are similar
Class 1
• Power rangers
• Animaniacs
• X-men
• Tazmania
• Spider man
Class 2
• Young and restless
• Bold and the beautiful
• As the world turns
• Price is right
• CBS eve news
Class 4
• 60 minutes
• NBC nightly news
• CBS eve news
• Murder she wrote
• Matlock
Class 3
• Tonight show
• Conan O’Brien
• NBC nightly news
• Later with Kinnear
• Seinfeld
Class 5
• Seinfeld
• Friends
• Mad about you
• ER
• Frasier
App.: Recommendation Systems




Given user preferences, suggest recommendations
Example: Amazon.com
Input: movie preferences of many users
Solution: model correlations between movie features




Users that like comedy, often like drama
Users that like action, often do not like cartoons
Users that like Robert Deniro films often like Al Pacino films
Given user preferences, can predict probability that new
movies match preferences
Diagnostic Systems


Diagnostic indexing for home health site at microsoft
Enter symptoms  recommend multimedia content
Online TroubleShooters
App.: Finding Regulatory Networks
P(Level | Module, Regulators)
Module
HAP4 
1
Expression level of
CMK1 
Regulator
1 in experiment
0 does
What module
gene “g” belong to?
0
Experiment
Regulator1
0
BMH1 
Regulator2
GIC2 
2
Module
Regulator3
Gene
0
0
Expression level in each
module is a function of
expression of regulators
0
Level
Expression
App.: Finding Regulatory Networks
Experimentally tested regulator
11
8
GATA
Enriched cis-Regulatory Motif
Gat1
Regulator (transcription factor)
10
N11
Regulation supported in literature
Hap4
Regulator (Signaling molecule)
9
GCN4
Inferred regulation
1
CBF1_B
Energy and
cAMP signaling
Msn4
Xbp1
25
HAP234
3
Sip2
2
STRE
48 Module (number)
Kin82
Tpk1
N41
N14
41 33
N13
MIG1
REPCAR
CAT8
ADR1
N26
MCM1
18 13 17 15 14
DNA and RNA
processing
nuclear
Yer184c
Cmk1
Ppt1
Lsg1
4
N18
Pph3
26
Gis1
Tpk2
Gac1
30 42
N30
GCR1
HSF
XBP1
HAC1
5 16
Yap6
Ypl230w
ABF_C
N36
47 39
Ime4
Bmh1
Gcn20
Not3
31 36
Amino acid
metabolism
Prerequisites

Probability theory







Conditional probabilities
Joint distribution
Random variables
Information theory
Function optimization
Graph theory
Computational complexity
Probability Theory

Probability distribution P over (, S) is a mapping from events
in S such that:



P() 0 for all S
P() = 1
If ,S and =, then P()=P()+P()

P
(



)
(

|
)
Conditional Probability: P
P
(

)

(

)

P
(
|)
P
(
)
Chain Rule: P




P
(

|
)
P
(

)
P
(

|
)

P
(

)

Bayes Rule:

(
|

)

P
(
|
)

(

|
)
Conditional Independence: P






Random Variables & Notation










Random variable: Function from to a value
Categorical / Ordinal / Continuous
Val(X) – set of possible values of RV X
Upper case letters denote RVs (e.g., X, Y, Z)
Upper case bold letters denote set of RVs (e.g., X, Y)
Lower case letters denote RV values (e.g., x, y, z)
Lower case bold letters denote RV set values (e.g., x)
Values for categorical RVs with |Val(X)|=k: x1,x2,…,xk
Marginal distribution over X: P(X)
Conditional independence: X is independent of Y
given Z if: 
x

X
,
y

Y
,
z

Z
:
(
X

x

Y

y
|
Z

z
)
in
P
Expectation

[
X
]

xP
(
x
)
Discrete RVs: E

P
x

X

Continuous RVs:
E
[
X
]
xp
(
x
)
P
x

X

Linearity of expectation:
(xy)P
(x,y)


xP
(x,y)
yP
(x,y)

x
P
(x,y)
y
P
(x,y)

xP
(x)
yP
(y)
E
Y]
P[X
x,y
x,y
x
x,y
y
x
y
x
y
E
[X]E
[Y]

Expectation of products:
(when X Y in P)
EP[XY
]
(x, y)
xyP

xyP
(x)P(y)
x,y
x,y





xP
(x)
yP
(
y
)



x
 y

E[X]E[Y]
Independence
assumption
Variance

Variance of RV:



X

Var
[
X
]
E

E
[
X
]
P
P
P
2
2

E
X

2
XE
[
X
]

E
X
]
P
P
P[
2
2

E
[
X
]

E
[
2
XE
[
X
]]

E
[
E
X
]]
P
P
P
P P[
2
2

E
[
X
]

2
E
[
X
]
E
[
X
]

E
X
]
P
P
P
P[
2

2

E
[
X
]
E
X
]
P
P[
2

If X and Y are independent: Var[X+Y]=Var[X]+Var[Y]

Var[aX+b]=a2Var[X]
Information Theory

(
X
)


P
(
x
)
log
P
(
x
)
Entropy: H

P
x

X




We use log base 2 to interpret entropy as bits of information
Entropy of X is a lower bound on avg. # of bits to encode values of X
0  Hp(X)  log|Val(X)| for any distribution P(X)
Conditional entropy: H
(
X
|
Y
)

H
(
X
,
Y
)

H
(
Y
)
P
P
P


P
(
x
,
y
)
log
P
(
x
|
y
)

x

X
,
y

Y


Information only helps: H
(
X
|
Y
)

H
(
X
)
P
P
Mutual information: I
(
X
;
Y
)
H
(
X
)

H
(
X
|
Y
)
P
P
P
P
(
x
|y
)

P
(
x
,
y
)
log

x

X
,
y

Y
P
(
x
)




0  Ip(X;Y)  Hp(X)
Symmetry: Ip(X;Y)= Ip(Y;X)
Ip(X;Y)=0 iff X and Y are independent
(
X
,...,
X
)

H
(
X
)

...

H
(
X
|
X
,.
X
)
Chain rule of entropies: H
P
1
n
P
1
P
n
1
n

1
Distances Between Distributions


D(P॥Q)0
D(P॥Q)=0 iff P=Q

Not a distance metric (no symmetry and triangle inequality)




P
(
x
)
Q
(
x
)
Relative Entropy: D
(
P
(
X
)
Q
(
X
)
)

P
(
x
)
log

x

X
L1 distance:
L2 distance:
L distance:
P

Q

|
P
(
x
)

Q
(
x
)
|

1
x

X


1
2
2
P

Q

(
P
(
x
)

Q
(
x
))

2
x

X
P

Q

max
|
P
(
x
)

Q
(
x
)
|

x

X
Optimization Theory





Find values 1,…n such that f
(
,...,
)

ma
f
(
'
,.
'
)
1
n
'
,...,
'
1
n1n
Optimization strategies


Solve gradient analytically and verify local maximum
Gradient search: guess initial values, and improve iteratively




Gradient ascent
Line search
Conjugate gradient
Lagrange multipliers










:

j

1
,...
n
,
c
0
Solve maximization problem with constraints C
j 
n


  


 

Maximize L
(, )

f
()

c

j
j
j

1
Graph Theory








Undirected graph
Directed graph
Complete graph (every two nodes connected)
Acyclic graph
Partially directed acyclic graph (PDAG)
Induced graph
Sub-graph
Graph algorithms

Shortest path from node X1 to all other nodes (BFS)
Representing Joint Distributions


Random variables: X1,…,Xn
P is a joint distribution over X1,…,Xn
If X1,..,Xn binary, need 2n parameters to describe P
Can we represent P more compactly?
 Key: Exploit independence properties
Independent Random Variables

Two variables X and Y are independent if



If X and Y are independent then:


P(X=x|Y=y) = P(X=x) for all values x,y
Equivalently, knowing Y does not change predictions of X
P(X, Y) = P(X|Y)P(Y) = P(X)P(Y)
If X1,…,Xn are independent then:




P(X1,…,Xn) = P(X1)…P(Xn)
O(n) parameters
All 2n probabilities are implicitly defined
Cannot represent many types of distributions
Conditional Independence

X and Y are conditionally independent given Z if



P(X=x|Y=y, Z=z) = P(X=x|Z=z) for all values x, y, z
Equivalently, if we know Z, then knowing Y does not change
predictions of X
Notation: Ind(X;Y | Z) or (X  Y | Z)
Conditional Parameterization


S = Score on test, Val(S) = {s0,s1}
I = Intelligence, Val(I) = {i0,i1}
P(I,S)
I
S
P(I,S)
i0
s0
0.665
i0
s1
0.035
i1
s0
0.06
i1
s1
0.24
P(I)
P(S|I)
I
S
i0
i1
I
s0
s1
0.7
0.3
i0
0.95
0.05
i1
0.2
0.8
Joint parameterization
Conditional parameterization
3 parameters
3 parameters
Alternative parameterization: P(S) and P(I|S)
Conditional Parameterization




S = Score on test, Val(S) = {s0,s1}
I = Intelligence, Val(I) = {i0,i1}
G = Grade, Val(G) = {g0,g1,g2}
Assume that G and S are independent given I

Joint parameterization


223=12-1=11 independent parameters
Conditional parameterization has





P(I,S,G) = P(I)P(S|I)P(G|I,S) = P(I)P(S|I)P(G|I)
P(I) – 1 independent parameter
P(S|I) – 21 independent parameters
P(G|I) - 22 independent parameters
7 independent parameters
Naïve Bayes Model



Class variable C, Val(C) = {c1,…,ck}
Evidence variables X1,…,Xn
Naïve Bayes assumption: evidence variables are conditionally
independent given C
n
P
(
C
,
X
,...,
X
)

P
(
C
)
P
(
X
|
C
)

1
n
i
i

1


Applications in medical diagnosis, text classification
Used as a classifier:
n
P
(
C

c
|
x
,...,
x
)
(
x
|
C

c
)
P
(
C

c
)
1
1n
i
1
1 P
 
P
(
C

c
|
x
,...,
x
)
P
(
C

c
)
P
(
x
|
C

c
)
i

1
2
1
n
2
i
2

Problem: Double counting correlated evidence
Bayesian Network (Informal)

Directed acyclic graph G



Nodes represent random variables
Edges represent direct influences between random variables
Local probability models
I
S
Example 1
I
S
C
G
Example 2
X1 X2
…
Xn
Naïve Bayes
Bayesian Network (Informal)

Represent a joint distribution



Specifies the probability for P(X=x)
Specifies the probability for P(X=x|E=e)
Allows for reasoning patterns



Prediction (e.g., intelligent  high scores)
Explanation (e.g., low score  not intelligent)
Explaining away (different causes for same effect interact)
I
S
G
Example 2
Bayesian Network Structure

Directed acyclic graph G


Nodes X1,…,Xn represent random variables
G encodes local Markov assumptions


Xi is independent of its non-descendants given its parents
Formally: (Xi  NonDesc(Xi) | Pa(Xi))
A
E  {A,C,D,F} | B
D
B
C
E
F
G
Independency Mappings (I-Maps)



Let P be a distribution over X
Let I(P) be the independencies (X  Y | Z) in P
A Bayesian network structure is an I-map
(independency mapping) of P if I(G)I(P)
I
S
I(G)={IS}
I
S
P(I,S)
I
S
P(I,S)
i0
s0
0.25
i0
s0
0.4
i0
s1
0.25
i0
s1
0.3
i1
s0
0.25
i1
s0
0.2
i1
s1
0.25
i1
s1
0.1
I(P)={IS}
I(P)=
I
S
I(G)=
Factorization Theorem
n

(
X
,...,
X
)

P
(
X
|
Pa
(
X
)
If G is an I-Map of P, then P

1
n
i
i
i

1
Proof:
 wlog. X1,…,Xn is an ordering consistent with G
n
P
(
X
,...,
X
)

P
(
X
|
X
,...,
X
)
 By chain rule:

1
n
i
1
i

1
i

1
 From assumption: Pa
(
X
)

{
X

,
X
}
i
1
,
i

1
{
X

,
X
}

Pa
(
X
)

Non
(
X
)
1
,
i

1
i
i

Since G is an I-Map  (Xi; NonDesc(Xi)| Pa(Xi))I(P)
P
(
X
|
X
,...,
X
)

P
(
X
|
Pa
(
X
))
i
1
i

1
i
i
Factorization Implies I-Map
n

P
(
X
,...,
X
)

P
(
X
|
Pa
(
X
))
 G is an I-Map of P

1
n
i
i
i

1
Proof:
 Need to show (Xi; NonDesc(Xi)| Pa(Xi))I(P) or that
P(Xi | NonDesc(Xi)) = P(Xi | Pa(Xi))


wlog. X1,…,Xn is an ordering consistent with G
P
(X
(X
i,NonDesc
i))
P
(X
(X
i | NonDesc
i)) 
P
(NonDesc
(X
i))
i
P
(X |Pa
(X))

k
k

1
ki
1
P
(X
(X

k |Pa
k))
k
1
P
(X
(X
i |Pa
i))
Bayesian Network Definition

A Bayesian network is a pair (G,P)



P factorizes over G
P is specified as set of CPDs associated with G’s nodes
Parameters


Joint distribution: 2n
Bayesian network (bounded in-degree k): n2k
Bayesian Network Design

Variable considerations




Structure considerations



Clarity test: can an omniscient being determine its value?
Hidden variables?
Irrelevant variables
Causal order of variables
Which independencies (approximately) hold?
Probability considerations



Zero probabilities
Orders of magnitude
Relative values
Download