Data Intensive Research: The Complexity Dimension in Data Analysis Chris Williams

advertisement
Data Intensive Research: The Complexity
Dimension in Data Analysis
Chris Williams
School of Informatics, University of Edinburgh
March 2010
Outline
I
Probabilistic graphical models
I
Reconstructing gene regulatory networks with Bayesian
networks using multiple sources of prior knowledge (Werhli
and Husmeier, 2007)
I
Premature baby monitoring (Quinn, Williams and
McIntosh, IEEE PAMI 2008)
Probabilistic graphical models
I
Probabilistic graphical models are a tool for modelling
complex networks of relationships with are
non-deterministic
I
In a directed graphical model G (aka Bayesian network)
the joint probability is defined as
Y
p(X1 , . . . , Xm |G) =
p(Xi |parentsi , G)
i
I
It is the missing edges that describe conditional
independences
I
Bayesian networks represent conditional (in)dependence
relations: not necessarily causal interactions
Example: Does my car start?
P(f=empty) = 0.05
P(b=bad) = 0.02
Battery
Fuel
Gauge
Turn Over
P(g=empty|b=good, f=not empty) = 0.04
P(g=empty| b=good, f=empty) = 0.97
P(g=empty| b=bad, f=not empty) = 0.10
P(g=empty|b=bad, f=empty) = 0.99
P(t=no|b=good) = 0.03
P(t=no|b=bad) = 0.98
Heckerman (1995)
Start
P(s=no|t=yes, f=not empty) = 0.01
P(s=no|t=yes, f=empty) = 0.92
P(s=no| t = no, f=not empty) = 1.0
P(s=no| t = no, f = empty) = 1.0
Reconstructing gene regulatory networks
I
Reconstructing Gene Regulatory Networks with Bayesian
Networks by Combining Expression Data with Multiple
Sources of Prior Knowledge.
Adriano V. Werhli, Dirk Husmeier, Statistical Applications in
Genetics and Molecular Biology 6(1), 2007.
I
Goal: learn about the underlying graphical structure based
on data and (uncertain) prior knowledge
p(G|D, B) ∝ p(D|G)p(G|B)
I
I
I
I
I
G is the graph structure
D is the data
B is the background knowledge
In this case p(D|G) can be computed exactly (integrating
out parameters analytically)
Sample p(G|D, B) using MCMC methods
Data
I
Raf is a critical signalling protein involved in regulating
cellular proliferation in human immune system cells
I
Deregulation of the Raf pathway can lead to
carcinogenesis
I
Data from flow cytometry experiments (Sachs et al,
Science 2005)
Currently accepted Raf signalling network, taken from Sachs et al.
(2005).
Prior knowledge
I
Prior knowledge from KEGG pathways database
I
Gij = 1 means that the edge i → j is present; Gij = 0
means that this edge is absent
I
I
Let Bij ∈ [0, 1] reflect the strength of belief that Gij = 1. Set
Bij = 0.5 if there is no information
P
E(G) = i,j |Bij − Gij |
I
Overall:
p(G|B, β) =
1
exp(−βE(G))
Z (β)
Results
I
I
Evaluation by (left) area under ROC curve, (right) number
of predicted true positive edges for a fixed number (5)
spurious edges
Error bars from sampling 5 data sets of size 100
Premature Baby Monitoring
Quinn, Williams and McIntosh (2008)
I
I
I
Why model this data?
Artifact corruption, leading to false alarms
Our aim is to determine the baby’s state of health despite
these problems
Overview
I
Factors
I
Factorial switching Kalman filter
I
Inference
I
Parameter estimation
I
Results
I
Modelling novel regimes
Probes
1. ECG, 2. arterial line, 3. pulse oximeter 4. core temperature,
5. peripheral temperature, 6. transcutaneous probe.
Factors affecting measurements
I
60
Sys. BP (mmHg)
I
The physiological observations are affected by different
factors.
Factors can be artifactual or physiological.
An arterial blood sample (artifact):
50
40
30
60
Dia. BP (mmHg)
I
40
20
0
0
200
400
600
Time (s)
800
1000
Common factor examples
Transcutaneous probe recalibration (artifact)
30
TcPO2 (kPa)
TcPO2 (kPa)
30
20
10
5
0
0
20
10
0
15
TcPCO2 (kPa)
0
10
TcPCO2 (kPa)
I
1000
2000 3000
Time (s)
4000
5000
10
5
0
0
2000
4000
Time (s)
6000
Common factor examples
Bradycardia (physiological)
180
180
160
160
140
140
HR (bpm)
HR (bpm)
I
120
100
120
100
80
80
60
60
40
0
20
40
60
80
Time (s)
100
40
0
50
100
Time (s)
150
Factorial Switching Kalman Filter
Artifactual factors
Physiological factors
Physiological state
Artifactual state
Observations
FSKF notation
I
st is the switch variable, which indexes factor settings, e.g.
‘blood sample occurring and first stage of TCP
recalibration’.
I
xt is the hidden continuous state at time t. This contains
information on the true physiology of the baby, and on the
levels of artifactual processes.
I
y1:t are the observations.
Factor interactions
Factor interactions
Factor interactions
Factor interactions
Factor interactions
Partial Ordering of Factors Overwriting Channels
Sys BP
Dia BP
TcPO2
TcPCO2
SpO2
Core temp.
Periph. temp.
Dropouts
Blood sample
Temp. disconnection
Incubator open
TCP recalibration
Bradycardia
Abnormal (other)
X/Normal
Heart rate
Factors higher up the list overwrite lower factors
Inference
I
I
For this application, we are interested in filtering, inferring
p(st , xt |y1:t ).
Exact inference is intractable, we use two approximations
I
I
I
Gaussian Sum (Alspach and Sorenson, 1972), analytical
approximation
Rao-Blackwellised particle filtering.
Inferring artifactual factors carries out data cleaning as part
of the inferential process
Parameter estimation
I
We need to estimate a dynamical model for each
continuous state variable for each setting of the factors
I
We use AR/ARMA/ARIMA modelling, e.g. an AR(p)
process
p
X
xi (t) =
αij xi (t − j) + t
j=1
I
Fortunately, annotated training data is available
180
160
140
Heart rate120
100
80
60
Bradycardia
0
20
40
60
80
100
120
I
The hidden continuous state in this application is
interpretable, and domain knowledge can be used to help
parameterize the dynamical models for each factor.
Parameter estimation example
37
Core temp.
Incubator
air temp.
36
35
34
33
0
100
200
300
400
500
I
For example, we know that the falling temperature
measurements caused by a probe disconnection will follow
an exponential decay
I
Therefore we can model these dynamics as an AR(1)
process, and set parameters by solving the Yule-Walker
equations.
Inference results
Inference of bradycardia and incubator open factors. Note
that heart rate variation while incubator is open is
attributed to handling of the baby (BR factor suppressed)
HR (bpm)
200
150
100
50
85
Humidity (%)
I
80
75
70
65
BR
IO
0
500
1000
Time (s)
1500
Inference results
55
38
50
37.5
45
Core temp. (°C)
Sys. BP (mmHg)
Can examine variance variance of estimates of true
physiology x̂t , e.g. for blood sample (left) and temperature
probe disconnection (right):
40
35
Dia. BP (mmHg)
I
37
36.5
50
40
30
36
35.5
20
35
TD
BS
0
50
100
150 200
Time (s)
250
0
500
Time (s)
1000
Quantitative Evaluation
I
3-fold cross validation on 360 hours of monitoring data
from 15 babies.
I
FHMM has the same factor structure as the FSKF, with no
hidden continuous state.
Inference type
GS
AUC
RBPF
AUC
FHMM
AUC
EER
EER
EER
Incu. open
0.87
0.17
0.77
0.23
0.78
0.25
Core temp.
0.77
0.34
0.74
0.32
0.74
0.32
Blood sample
0.96
0.14
0.86
0.15
0.82
0.20
Bradycardia
0.88
0.25
0.77
0.28
0.66
0.37
(b) Core temp. disconnect
1
0.8
0.8
sensitivity
sensitivity
(a) Incubator open
1
0.6
0.4
GS
RBPF
FHMM
0.2
0
0
0.5
1 − specificity
0.6
0.4
0.2
0
1
0
1
0.8
0.8
0.6
0.4
0.2
0
1
(d) Bradycardia
1
sensitivity
sensitivity
(c) Blood sample
0.5
1 − specificity
0.6
0.4
0.2
0
0.5
1 − specificity
1
0
0
0.5
1 − specificity
1
Novel dynamics
I
There are many other factors influencing the data: drugs,
sepsis, neurological problems...
200
?
150
Heart rate
100
50
70
60
Dia. BP
50
40
100
SpO2
50
0
0
200
400
600
800
1000
1200
Known Unknowns
Add a factor to represent abnormal dynamics (the X-factor)
I
In practice we build this as a “broadened” version of
normal dynamics
p(y|s)
I
y
X-factor inference results
Classification of periods of clinically significant
cardiovascular disturbance:
55
Sys BP (mmHg)
60
Sys. BP (mmHg)
I
50
40
45
40
35
100
SpO2 (%)
SpO2 (%)
30
100
50
90
80
70
95
90
85
X
X
True X
True X
0
500
1000
1500
Time (s)
2000
2500
0
2000
4000
Time (s)
6000
Baby monitoring: Conclusions
I
FSKF successfully applied to complex physiological
monitoring data
I
Interpretable structure
I
Knowledge engineering used to parameterize dynamic
models
I
Allows monitoring of known and novel dynamics
(supervised and unsupervised learning)
For more information see
http://homepages.inf.ed.ac.uk/ckiw/neonatal/
Overall conclusions
I
Structured probabilistic graphical models are one method
for incorporating domain knowledge into the analysis, while
maintaining the advantages of handling noise/uncertainty
I
For circuit models, the focus may be on determining
network structure(s) compatible with the data and
background knowledge. Such circuit models should (with
some caveats) be useful wrt answering causal/intervention
questions
I
For time series problems, models like the FSKF can be
applied to many different condition monitoring problems
Download