Lecture 4: Graphical models

advertisement
Lecture 4:
Graphical models
Bayesian networks
Some general graphical model concepts:
nodes (vertices)
links (edges)
A graph can be
disconnected:
or connected: ; undirected:
or directed:
the edges are
one-directed
arrows
cyclic:
or acyclic:
possible to start in
one node and
“come back”
Examples:
I2A
Transport routes:
F
I1
S
I2B
Acyclic, but not completely directed
Junction trees:
A
B
D
F
G
C
A
E
D
H
F
B
C
E
G
H
BCE
ABD
BDG
DFG
BEG
EGH
From 8 nodes to 6 nodes (Source: Wikipedia)
Bayesian (belief) networks
A Bayesian network is a connected directed acyclic graph (DAG) in
which
• the nodes represent random variables
• the links represent direct relevance relationships among variables
Examples:
X
Y
This small network has two nodes representing the random variable X and Y.
The directed link gives a relevance relationship between the two variables that
means Pr (Y = y | X = x, I ) ≠ Pr (Y = y | I )
X
Y
Z
This network has three nodes representing the random
variables X, Y and Z.
The directed links give relevance relationships that means
Pr ( Y = y | X = x, I ) ≠ Pr ( Y = y | I )
Pr ( Z = z | X = x, I ) ≠ Pr ( Z = z | I )
but also (as will be seen below)
Pr ( Z = z | Y = y, X = x, I ) = Pr ( Z = z | X = x, I )
Structures in a Bayesian network
There are two classifications for nodes: parent nodes and child nodes
parent node
child node
parent nodes
Thus, a node can be solely a
parent node, solely a child
node or both!
child nodes
Probability “tables”
Each node represents a random variable.
This random variable has either assigned probabilities (nominal
scale or discrete) or an assigned probability density function
(continuous scale) for its states.
For a node that is solely a parent node:
The assigned probabilities or density function are conditional on
background information only (may be expressed as unconditional)
For a node that is a child node (solely or joint parent/child):
The assigned probabilities or density function are conditional on the
states of its parent nodes (and on background information).
Example:
Probability tables
X
Y
X has the
states x1 and x2
X
Probabilities
x1
Pr (X = x1 | I )
x2
Pr (X = x2 | I )
Y has the states
y1 and y2
Probabilities
X:
Y:
x1
x2
y1
Pr (Y = y1 | X = x1, I )
Pr (Y = y1 | X = x2, I )
y2
Pr (Y = y2 | X = x1, I )
Pr (Y = y2 | X = x2, I )
Example Dyes on banknotes (from previous lectures)
Two states:
A?
A?
Probabilities
A : " Dye is present"
A
0.001
A : " Dye is absent"
A
0.999
Probabilities
Two states:
B?
B : " Result is positive"
A?:
B?:
A
A
B
0.99
0.02
B
0.01
0.98
B : " Result is negative"
More about the structure…
Ancestors and descendants:
A node X is an ancestor of a node Y and Y is in turn a descendant of
X if there is a unidirectional path from X to Y
A
D
B
C
E
F
G
H
I
Ancestor
Descendants
A
D, E, G, I
B
E, F, G, H, I
C
F, H, I
E
G, I
F
H, I
G
I
H
I
Different connections:
A
diverging connection
B
A
C
B
A
C
serial connection
B
converging connection
C
Conditional independence and d-separation
A
1) Diverging connection
B
C
There is a path between B and C even if it not unidirectional
B may be relevant for C (and vice versa)
However, if the state of A is known this relevance is lost.: The path
is blocked
B and C are conditionally independent given A
Example:
Assume the old rat Willie was caught in a trap.
We have also found a sac with wheat grains with a small
hole where grains have leaked out, and we suspect that
Willie made this hole.
Examining the sac and Willie we find
• traces of wheat grain in the jaw of Willie
• traces of saliva at the damage on the sac that matches the
DNA of Willie
The whole scenario can be described with three random variables :
A with the two states:
A1: “Willie made the whole in the sac”
A2: “Willie has not been near the whole in the sac”
B with the two states:
B1: “Traces of wheat grain found in Willie’s jaw”
B2: “No traces of wheat grain found in Willie’s jaw”
C with the two states:
C1: “Match between saliva DNA and Willie’s DNA”
C2: “No match in DNA between saliva and Willie”
Note that the states of B and C are actually given, but the description gives
a complete model.
A1: “Willie made the whole in the sac
A2: “Willie has not been near the whole in the sac”
B1: “Traces of wheat grain found in Willie’s jaw”
B2: “No traces of wheat grain found in Willie’s jaw”
C1: “Match between saliva DNA and Willie’s DNA”
C2: “No match in DNA between saliva and Willie”
First we assume none of the states are given:
Is B relevant for C ?
Yes, because if B1 is true, i.e. we have found wheat grains in Willie’s
jaw, the conditional probability of obtaining a match in DNA would
be different from the corresponding conditional probability if B2 was
true.
A1: “Willie made the whole in the sac
A2: “Willie has not been near the whole in the sac”
B1: “Traces of wheat grain found in Willie’s jaw”
B2: “No traces of wheat grain found in Willie’s jaw”
C1: “Match between saliva DNA and Willie’s DNA”
C2: “No match in DNA between saliva and Willie”
Now assume for A that the state A2 is given, i.e. Willie was never near
the sac.
Under this condition B can no longer be relevant for C since whether
we find a match in DNA between the saliva trace and Willie or not,
can have nothing to do with the grains we have found in Willie’s jaw.
Now assume for A that the state A1 is given, i.e. Willie made the hole.
Under this condition it is tempting to think that B is relevant for C, but
the relevance is actually lost. Whether we find a match in DNA or not
cannot have any impact on whether we find grains in the jaw or not
once we have stated that Willie made the hole.
A1: “Willie made the whole in the sac
A2: “Willie has not been near the whole in the sac”
B1: “Traces of wheat grain found in Willie’s jaw”
B2: “No traces of wheat grain found in Willie’s jaw”
C1: “Match between saliva DNA and Willie’s DNA”
C2: “No match in DNA between saliva and Willie”
The scenario can be
described with the
Bayesian network
A
B
C
i.e. a diverging connection
When a state of a node is assumed to be given we say the node is
instantiated
In the example, once A is instantiated the relevance relationship
between B and C is lost.
B and C are thus conditionally independent given a state of A.
2) Serial connection
A
B
C
There is a path between A and C (unidirectional from A to C).
A may be relevant for C (and vice versa).
If the state of B is known this relevance is lost.: The path is
blocked.
A and C are conditionally independent given (a state of) B.
Example The Willie case with another description
Let the random variables be
A with the two states
A1: “Willie made the hole in the sac”
A2: “Willie did not make the hole” ,
B with the two states
B1: “Willie left saliva on the damage”
B2: “Willie left no saliva” ,
C with the two states
C1: “There is a match in DNA”
C2: “There is no match”
Assuming no state is given, there is a relevance relationship
between A and C:
Pr (C1 A1 , I ) ≠ Pr (C1 A2 , I )
Now, assume state B1 of B is given,
i.e. we assume there was a contact
between Willie’s jaw and the damage.
A1: “Willie made the hole in the sac”
A2: “Willie did not make the hole”
B1: “Willie left saliva on the damage”
B2: “Willie left no saliva”
C1: “There is a match in DNA”
C2: “There is no match”
A can no longer be relevant for C as once we have stated that Willie
left saliva it does not matter for C whether he made the hole or not.
The scenario can be described with the Bayesian network
A
B
C
Once B is instantiated the relevant relationship between A and C is
lost.
A and C are conditionally independent given a state of B.
3) Converging connection
A
B
C
There is a path between A and B (not unidirectional).
A may be relevant for B (and vice versa).
If the state of C is (completely) unknown this relevance does not
exist.
If the state of C is known (exactly or by a modification of the state
probabilities) the path is opened.
A and C are conditionally dependent given information about the
states of C, otherwise they are (conditionally) independent.
Example Paternity testing: child, mother and the true father
Let
A be a random variable representing the mother’s genotype in a specific locus.
B be a random variable representing the true father’s genotype in the same locus.
C be a random variable representing the child’s genotype in that locus.
A:
A1
A2
B:
B1
B2
C:
C1
C2
A:
B:
C:
A1
A2
B1
B2
C1
C2
A: Mother’s genotype
B: True father’s genotype
C: Child’s genotype
If we know nothing about C (C1 and C2 are both unknown) , then
•
information about A cannot have any impact on B and vice versa
If we on the other hand know the genotype of the child (C1 and C2 are both known
or one of them is) then
•
knowledge of the genotype of the mother has impact on the probabilities of the
different genotypes that can be possessed by the true father since the child must
have inherited half of the genotype from the mother and the other half from the
father
Bayesian network:
A
B
C
d-separation (directional separation)
In a directed acyclic graph (DAG) the concept of d-separation is
defined as:
Let SX, SY and SX be three disjoint subsets of variables included in the
DAG.
The sets SX and SY are d-separated given SZ if every path between a
variable X in SX and a variable Y in SY is
either
• a serial connection through a variable Z in SZ or a divergent
connection diverging from a variable Z in SZ
or
• a converging connection converging to a variable W not in SZ (or
SX or SY) and of which no descendants belong to SZ (or SX or SY)
X1
X2
Z2
Y1
Z1
Y3
Y2
Z3
W1
W2
• No direct link from red area
to blue area or vice versa
• No convergence from blue
area and red area to green area
The Markov property - formal definition of a Bayesian network
Consider a variable X in a DAG.
Let PA(X ) be the set of all parents to X and DE(X ) be the set of all descendants
to X.
Let SY be a set of variables that does not include any variables in DE(X ), i.e. are
not descendants of X.
Then, the DAG is a Bayesian network if and only if
Pr ( X PA ( X ), SY , I ) = Pr ( X PA ( X ), I )
i.e. X is conditionally independent of SY given PA(X ).
This is also known as the Markov property.
Note , by Pr(X | … ) we mean the probability of X having a particular state.
Pr (X PA ( X ), SY , I ) =
Example
= Pr (X {Y3 ,W1 }, {Y1 , Y2 , Y3 , Y4 }, I ) =
Y1
Y2
= Pr (X Y3 , W1 , Y1 , Y2 , Y4 , I ) =
Y4
Serial links Y1 → Y3 → X ,
Y3
W1
W2
= Y2 → Y3 → X ,
=
Y4 → (W2 →)W1 → X
= Pr (X Y3 , W1 , I ) = P(X PA ( X ), I )
X
Pr ( X PA ( X ), SW , I ) =
D1
D1
= Pr ( X {Y3 , W1 }, {W1 , W2 }, I ) =
= Pr ( X Y3 , W1 , W2 , I ) =
= Serial link W2 → W1 → X =
D3
= Pr ( X Y3 , W1 , I ) = P ( X PA ( X ), I )
Software
GeNIe (Graphical network Interface)
• Software free-of-charge
• Powerful for building complex network and running with moderately
large probability tables
• Download from https://dslpitt.org/genie
HUGIN
• Commercial software
• Probably today’s most powerful software for Bayesian networks
• A demo version (less powerful than GeNIe) can be downloaded from
www.hugin.com
Others: Netica (commercial) https://www.norsys.com/download.html
Example
A burglary was done in a shop. On the shop floor the police have secured a
shoeprint. In the home of a suspect a shoe is found with a sole pattern that matches
that of the shoeprint.
In a compiled database of shoeprints it is found that the particular pattern is
prevalent in 3 out of 657 prints.
Hypotheses (usually called propositions in forensic literature):
Hp : “The shoeprint was made by the found shoe”
Hd : “The shoeprint was made by some other shoe”
“p” in Hp stands for “Prosecutor” (incriminating proposition)
“d” in Hd stands for “Defence” (alternative to the incriminating)
Evidence:
E : “There is a match in pattern between shoeprint and the found shoe”
Default settings
Table automatically
set from table of H
Setting the probability table for node E
If proposition Hp (Shoeprint was made by found shoe) is true:
(
) (
)
⇒ Pr (No match H , I ) = Pr (E H , I ) = 0
Pr Match H p , I = Pr E H p , I ≈ 1
p
p
If proposition Hd (Shoeprint was made by another shoe) is true:
Pr (Match H d , I ) = Pr (E H d , I ) = γ
where γ is the proportion shoes in the (relevant) population of shoes having the
observed pattern
(
)
⇒ Pr (No match H d , I ) = Pr E H d , I = 1 − γ
The proportion γ is unknown, but an estimate from the database can be used
γˆ =
3
≈ 0.0046
657
Run the network
Instantiate the match
Double-click on bar or
on state label
B = LR =
Pr (H p E , I ) Pr (H d E , I )
Pr (H p I ) Pr (H d I )
≈
0.995421 0.004579
≈
= 0.995421 0.004579 ≈ 217
0.5 0.5
On the other we could directly have computed
1
LR ≈
= 219
3 657
which is more accurate
Alternative network
Propositions (as before):
Hp : “The shoeprint was made by the found shoe”
Hd : “The shoeprint was made by some other shoe”
Evidence:
X : Sole pattern of the found shoe
States: q (the observed pattern)
non-q
Y : Pattern of the shoe print
States: q
non-q
LR =
(
Pr X = q, Y = q H p , I
)=
Pr (X = q, Y = q H d , I )
(
) (
Pr Y = q X = q, H p , I ⋅ Pr X = q H p , I
=
Pr (Y = q X = q, H d , I )⋅ Pr ( X = q H d , I )
(
)
(
Pr Y = q X = q, H p , I
Pr X = q H p , I =
=
(
= Pr X = q H d , I
)
=
(
Pr (Y = q H d , I )
1
≈
3 657
)=
Pr (Y = q X = q, H d , I )
Pr Y = q X = q, H p , I
=
)=
)≈
1
≈
Pr (Y = q H d , I )
In a network…
Probability table for X
H
X
X
Probability
q
δ
non-q
1–δ
Y
Probability table for Y
H
Hp
X
Y
Hd
q
non-q
q
non-q
q
1
0
3/657
3/657
non-q
0
1
654/657
654/657
Note!
We need to give a probability table for X to make the software work.
However, we do not know δ but it does not matter what value we set here.
Probability table for X
Instantiate nodes X and Y both to q
X
Probability
q
δ
non-q
1–δ
Example: (more complex) In the head of the experienced examiner
Assume there is a question whether an individual has a specific
disease A or another disease B.
What is observed is
The individual has an increased level of substance S.
The individual has recurrent fever attacks.
The individual has light recurrent pain in the stomach.
The experience of the examining physician says
1. If disease A is present it is quite common to have an increased level of
substance S
2. If disease B is present it is less common to have an increased level of
substance S
3. If disease A is present it is not generally common to have recurrent fever
attacks, but if there is also an increased level of substance S such events are
very common
4. Recurrent fever attacks are quite common when disease B is present
regardless of the level of substance S
5. Recurrent pain in the stomach are generally more common when disease B
is present than when disease A is present, and regardless of the level of
substance S and whether fever attacks are present or not
6. If a patient has disease A, increased levels of substance S and recurrent fever
attacks he/she would almost certainly have recurrent pain in the stomach.
Otherwise, if disease A is present recurrent pain in the stomach is equally
common
Can we put this up in a network?
Let the “disease node” be H, with states A and B
Let the “evidence” nodes be
X with states
x1 : “The individual has an increased level of substance S”
x2 : “The individual has a normal level of substance S”
Y with states
y1 : “The individual has recurrent fever attacks”
y2 : “The individual has no fever attacks”
Z with states
z1 : “The individual has light recurrent pain in the stomach”
z2 : “The individual has no pain in the stomach”
Probability table for X
H
H:
X
X
A
B
x1
δ
γ
x2
1–δ
1–γ
Probability table for Y
Y
H:
A
X:
Y
Z
B
x1
x2
x1
x2
y1
ω
ψ
λ
λ
y2
1–ω
1–ψ
1–λ
1–λ
Probability table for Z
H:
A
X:
x1
Y:
Z
B
x2
x1
x2
y1
y2
y1
y2
y1
y2
y1
y2
z1
1
β
β
β
α
α
α
α
z1
0
1–β
1–β
1–β
1–α 1–α 1–α 1–α
The probabilities set out in the tables take into account some of the experience listed
(e.g. that some probabilities are equal)
x1 : “The individual has an increased level
of substance S”
x2 : “The individual has a normal level of
substance S”
Probability table for X
H:
X
A
B
x1
δ
γ
x2
1–δ
1–γ
Estimating numbers for δ and γ
1. If disease A is present it is quite common to have an increased level of
substance S
2. If disease B is present it is less common to have an increased level of
substance S
Experience 1 & 2 δ >> γ
Assume δ ≈ 0.8 and γ ≈ 0.2
y1 : “The individual has recurrent
fever attacks”
y2 : “The individual has no fever
attacks”
x1 : “The individual has an increased level
of substance S”
x2 : “The individual has a normal level of
substance S”
Probability table for Y
H:
A
X:
Y
B
x1
x2
x1
x2
y1
ω
ψ
λ
λ
y2
1–ω
1–ψ
1–λ
1–λ
Estimating λ, ω and ψ
3. If disease A is present it is not generally common to have recurrent fever
attacks, but if there is also an increased level of substance S such events are
very common
4. Recurrent fever attacks are quite common when disease B is present
regardless of the level of substance S
Experience 3 ω high and ψ < 0.5 Assume ω ≈ 0.9, ψ ≈ 0.3
Experience 4 Assume λ ≈ 0.8
Probability table for Z
H:
A
X:
x1
Y:
Z
B
x2
x1
x2
y1
y2
y1
y2
y1
y2
y1
y2
z1
1
β
β
β
α
α
α
α
z1
0
1–β
1–β
1–β
1–α
1–α
1–α
1–α
z1 : “The individual has light recurrent pain in the stomach”
z2 : “The individual has no pain in the stomach”
y1 : “The individual has recurrent fever attacks”
y2 : “The individual has no fever attacks”
Estimating α and β
x1 : “The individual has an increased level of substance S”
x2 : “The individual has a normal level of substance S”
5. Recurrent pain in the stomach are generally more common when disease B is
present than when disease A is present, and regardless of the level of substance S and
whether fever attacks are present or not
6. If a patient has disease A, increased levels of substance S and recurrent fever
attacks he/she would almost certainly have recurrent pain in the stomach. Otherwise,
if disease A is present recurrent pain in the stomach is equally common
Experience 5 & 6 α > β Assume α ≈ 0.6 and β ≈ 0.4
Run network
Instantiate the nodes X, Y and Z
The likelihood ratio of the evidence becomes
88.325
LR ≈
≈ 7.5
11.764
Thus the three observations combined are 7.5 times more probable if
disease A is present than if disease B is present.
Download