A Formal Model for Dataflows, Runs of

advertisement
A Formal Model for Dataflows, Runs of
Dataflows, and Provenance within Runs
Natalia Kwasnikowska
and
Jan Van den Bussche
Theoretical Computer Science Group
Hasselt University, Belgium
May 21, 2008
Scientific Data Analysis
complex
How to
analyze
all this?
experimental
data
☹
Scientific Data Analysis
external resources
integration layer
design
☺
dataflow
design
dataflow
execution
annotations
storage
Where Do We Fit in the Picture
external resources
integration layer
design
☺
dataflow
design
dataflow
execution
annotations
storage
Where Do We Fit in the Picture
external resources
integration layer
design
☺
dataflow
design
dataflow
execution
annotations
dataflow
repository
Why a Dataflow Repository
●
Effective management of
–
experimental data
–
dataflows and dataflow runs
●
Data management of different runs
●
Verification of dataflow results
●
●
Tracking the provenance of values occurring in
a dataflow result
Querying of all stored information
A Formal Model for Dataflows, Runs of
Dataflows, and Provenance within Runs
external resources
integration layer
design
☺
dataflow
design
dataflow
execution
annotations
dataflow
repository
Our Contribution
●
●
Quest for formal requirements
Formal conceptual model of a dataflow
repository
–
focus on what entities should be stored
–
precise formalisation of those entities
–
a run of a dataflow
● binding of building blocks of a dataflow to
services
formal integrity constraints on the repository
–
formalisation of provenance of dataflow results
●
Example
( Reference-Image, {Anatomy-1, Anatomy-2, Anatomy-3, Anatomy-4} )
↓
align_warp
↓
{Warp-Params1, Warp-Params2, Warp-Params3, Warp-Params4}
↓
reslice
↓
{Resliced-Image1, Resliced-Image2, Resliced-Image3, Resliced-Image4}
↓
softmean
↓
Atlas-Image
↓
slicer
↓
{Atlas-X-Slice, Atlas-Y-Slice, Atlas-Z-Slice}
↓
convert
↓
{Atlas-X-Graphic, Atlas-Y-Graphic, Atlas-Z-Graphic}
Dataflows – Nested Relational Calculus
dataflow brainAtlas
(r: Image, p: {Image}, a: APs, m: MPs, s: {SPs}) : {SlicedImage}
is
slice(aggregate(align(r, p, a), m), s)
dataflow alignImages (r: Image, p: {Image}, a: APs) : {Image} is
for i in p return
performWarp(analyzeWarp(r, i, p))
dataflow sliceImage (i: Image, s: {SPs}) : {SlicedImage} is
for parameter in s return
⟨ slice: parameter.slice,
image : changeFormat(cutSlice(i, parameter)) ⟩
Base and Complex Types
dataflow brainAtlas
(r: Image, p: {Image}, a: APs, m: MPs, s: {SPs}) : {SlicedImage}
is
⟨image : Image3D, header : HDR⟩
slice(aggregate(align(r, p, a), m), s)
dataflow alignImages (r: Image, p: {Image}, a: APs) : {Image} is
{ ⟨slice : Slice, image : Image2D⟩ }
for i in p return
performWarp(analyzeWarp(r, i, p))
dataflow sliceImage (i: Image, s: {SPs}) : {SlicedImage} is
for parameter in s return
⟨ slice: parameter.slice,
image : changeFormat(cutSlice(i, parameter)) ⟩
Service Names
dataflow brainAtlas
(r: Image, p: {Image}, a: APs, m: MPs, s: {SPs}) : {SlicedImage}
is
slice(aggregate(align(r, p, a), m), s)
dataflow alignImages (r: Image, p: {Image}, a: APs) : {Image} is
for i in p return
performWarp(analyzeWarp(r, i, p))
dataflow sliceImage (i: Image, s: {SPs}) : {SlicedImage} is
for parameter in s return
⟨ slice: parameter.slice,
image : changeFormat(cutSlice(i, parameter)) ⟩
Service Names - Signatures
dataflow brainAtlas
(r: Image, p: {Image}, a: APs, m: MPs, s: {SPs}) : {SlicedImage}
is
slice(aggregate(align(r, p, a), m), s)
Image ×alignImages
{SPs}  {SlicedImage}
dataflow
(r: Image, p: {Image}, a: APs) : {Image} is
for i in p return
performWarp(analyzeWarp(r, i, p))
dataflow sliceImage (i: Image, s: {SPs}) : {SlicedImage} is
Image × SPs  PGM
for parameter in s return
⟨ slice: parameter.slice,
image : changeFormat(cutSlice(i, parameter)) ⟩
Subdataflows
dataflow brainAtlas
(r: Image, p: {Image}, a: APs, m: MPs, s: {SPs}) : {SlicedImage}
is
slice(aggregate(align(r, p, a), m), s)
Image ×alignImages
{SPs}  {SlicedImage}
dataflow
(r: Image, p: {Image}, a: APs) : {Image} is
for i in p return
performWarp(analyzeWarp(r, i, p))
dataflow sliceImage (i: Image, s: {SPs}) : {SlicedImage} is
for parameter in s return
⟨ slice: parameter.slice,
image : changeFormat(cutSlice(i, parameter)) ⟩
Binding Tree – External Services
brainAtlas
align ➸ alignImage
analyzeWarp ➸ align_warp
performWarp ➸ reslice
slice ➸ sliceImage
cutSlice ➸ slicer
changeFormat ➸ convert
aggregate ➸ softmean
Result of brainAtlas
slice
image
X
altas-x.gif
Y
altas-y.gif
Z
altas-z.gif
●
Complex object
●
Is it enough for
–
verification ?
–
reproducibility ?
Result of brainAtlas
slice
image
X
altas-x.gif
Y
altas-y.gif
Z
altas-z.gif
●
Complex object
●
Is it enough for
–
verification ?
yes
–
reproducibility ?
more or less
Verification and Reproducibility
●
●
Input data
–
in house, archived
–
public repositories
External services
–
access to used programs
–
changes in implementation
improved algorithms
● bug fixes
●
Verification and Reproducibility
●
●
Input data
–
in house, archived (hopefully)
–
public repositories
External services
–
access to used programs
–
changes in implementation
improved algorithms
● bug fixes
●
Another Example
Given two organisms A and B, extract
all messenger RNA sequences from
GenBank belonging to A.
Then for each found sequence, search
for similar sequences belonging to B.
☺
Dataflows – Nested Relational Calculus
dataflow findSimilar
(A : Organism, B : Organism) : MatchedSeqs is
∪ for s in entrez(A, genbank) return
if s.moltype = mRNA
then { ⟨ a : s, b : filter(blast(s, 1e-4), 300, B) ⟩ }
else ∅
dataflow filterBlastRep
(rep : BlastRep, min : Int, org : Organism) : Seqs is
∪ for a in accDB(rep, min) return
let seq ≔ getSeq(a.accessionnr, a.database) in
if seq.organism = org
then { seq }
else ∅
Binding Tree – External Services
findSimilar
entrez ➸ NCBI-entrez
filter ➸ filterBlastRep
accDB ➸ extractPairs
getSeq ➸ getSequence
blast ➸ NCBI-blast
Result of findSimilar(cat, mouse)
a
b
cat-1
mouse-1-1
⋮
mouse-1-m1
cat-2
mouse-2-1
⋮
mouse-2-m2
⋮
cat-n
⋮
mouse-n-1
⋮
mouse-n-mn
●
Complex object !
Result of findSimilar(cat, mouse)
a
b
●
Complex object !
mouse-1-1
⋮
cat-1
⟨organism : cat,
moltype : mRNA, content: NM_001079655⟩
mouse-1-m
1
cat-2
⋮
cat-n
mouse-2-1
⋮
mouse-2-m2
⋮
mouse-n-1
⋮
mouse-n-mn
Result of findSimilar(cat, mouse)
a
b
cat-1
mouse-1-1
⋮
mouse-1-m1
cat-2
mouse-2-1
⋮
mouse-2-m2
⋮
cat-n
⋮
mouse-n-1
⋮
mouse-n-mn
●
Complex object !
●
Is it enough for
–
verification ?
–
reproducibility ?
Result of findSimilar(cat, mouse)
a
b
cat-1
mouse-1-1
⋮
mouse-1-m1
cat-2
mouse-2-1
⋮
mouse-2-m2
⋮
cat-n
⋮
mouse-n-1
⋮
mouse-n-mn
●
Complex object !
●
Is it enough for
–
verification ?
no
–
reproducibility ?
no
Result of findSimilar(cat, mouse)
a
b
cat-1
mouse-1-1
⋮
mouse-1-m1
cat-2
mouse-2-1
⋮
mouse-2-m2
⋮
cat-n
⋮
mouse-n-1
⋮
mouse-n-mn
●
Complex object !
●
Is it enough for
–
verification ?
no
–
reproducibility ?
no
External services !!!
NCBI database
Run of a Dataflow
●
entire run includes all intermediate results
●
run formalized as set of triples
–
triple = (subexpression, input, output)
–
subexpressions induce tree-structure
–
complex due to for-loops
●
only service-call results need to be stored
●
entire run formally defined using inference rules
Service-call Triples of findSimilar(cat, mouse)
dataflow findSimilar
(A : Organism, B : Organism) : MatchedSeqs is
∪ for s in entrez(A, genbank) return
if s.moltype = mRNA
then { ⟨ a : s, b : filter(blast(s, 1e-4), 300, B) ⟩ }
else ∅
Service-call Triples of findSimilar(cat, mouse)
entrez
A  cat
B  mouse
org  cat
db  genbank
cat-1
cat-2
⋮
cat-n
blast
A  cat
B  mouse
s  cat-2
seq  cat-2
evalue  1e-4
blastrep-2
filter
A  cat
B  mouse
s  cat-2
rep  rep-2
score  300
org  mouse
mouse-2-1
⋮
mouse-2-m2
Run Triples of findSimilar(cat, mouse)
dataflow findSimilar
(A : Organism, B : Organism) : MatchedSeqs is
∪ for s in entrez(A, genbank) return
if s.moltype = mRNA
then { ⟨ a : s, b : filter(blast(s, 1e-4), 300, B) ⟩ }
else ∅
Run Inference Rules
Run Triples of findSimilar(cat, mouse)
if
if
A  cat
B  mouse
s  cat-2
A  cat
B  mouse
s  cat-9
a
b
cat-2
mouse-2-1
⋮
mouse-2-m2
a
b
cat-9
mouse-9-1
⋮
mouse-9-m9
Provenance of Subvalues
a
b
cat-1
mouse-1-1
⋮
mouse-1-m1
cat-2
⋮
cat-n
mouse-2-1
⋮
mouse-2-m2
⋮
mouse-n-1
⋮
mouse-n-mn
●
●
●
●
●
Complex object !
Provenance of a
subvalue of result ?
Again set of triples
Only what contributes
to subvalue
Inference rules
Provenance of Subvalues
a
b
cat-1
mouse-1-1
⋮
mouse-1-m1
cat-2
⋮
cat-n
mouse-2-1
⋮
mouse-2-m2
⋮
mouse-n-1
⋮
mouse-n-mn
●
●
●
●
●
Complex object !
Provenance of a
subvalue of result ?
Again set of triples
Only what contributes
to subvalue
Inference rules
Provenance of mouse-2-1
a
b
cat-1
mouse-1-1
⋮
mouse-1-m1
cat-2
mouse-2-1
⋮
mouse-2-m2
⋮
cat-n
⋮
mouse-n-1
⋮
mouse-n-mn
Value assignment
A = cat
B = mouse
dataflow findSimilar
(A : Organism, B : Organism) : MatchedSeqs is
∪ for s in entrez(A, genbank) return
if s.moltype = mRNA
then { ⟨ a : s, b : filter(blast(s, 1e-4), 300, B) ⟩ }
else ∅
Provenance of mouse-2-1
a
b
cat-1
mouse-1-1
⋮
mouse-1-m1
cat-2
mouse-2-1
⋮
mouse-2-m2
⋮
cat-n
⋮
mouse-n-1
⋮
mouse-n-mn
Value assignment
A = cat
B = mouse
dataflow findSimilar
(A : Organism, B : Organism) : MatchedSeqs is
∪ for s in entrez(A, genbank) return
if s.moltype = mRNA
then { ⟨ a : s, b : filter(blast(s, 1e-4), 300, B) ⟩ }
else ∅
Provenance of mouse-2-1
a
b
cat-1
mouse-1-1
⋮
mouse-1-m1
cat-2
mouse-2-1
⋮
mouse-2-m2
⋮
cat-n
⋮
mouse-n-1
⋮
mouse-n-mn
Value assignment
A = cat
B = mouse
dataflow findSimilar
(A : Organism, B : Organism) : MatchedSeqs is
∪ for s in entrez(A, genbank) return
if s.moltype = mRNA
then { ⟨ a : s, b : filter(blast(s, 1e-4), 300, B) ⟩ }
else ∅
Provenance of mouse-2-1
Value assignment
A = cat
B = mouse
s = cat-2
a
b
cat-2
mouse-2-1
⋮
mouse-2-m2
dataflow findSimilar
(A : Organism, B : Organism) : MatchedSeqs is
∪ for s in entrez(A, genbank) return
if s.moltype = mRNA
then { ⟨ a : s, b : filter(blast(s, 1e-4), 300, B) ⟩ }
else ∅
Provenance of mouse-2-1
Value assignment
A = cat
B = mouse
s = cat-2
a
b
cat-2
mouse-2-1
⋮
mouse-2-m2
dataflow findSimilar
(A : Organism, B : Organism) : MatchedSeqs is
∪ for s in entrez(A, genbank) return
if s.moltype = mRNA
then { ⟨ a : s, b : filter(blast(s, 1e-4), 300, B) ⟩ }
else ∅
Provenance of mouse-2-1
Value assignment
A = cat
B = mouse
s = cat-2
a
b
cat-2
mouse-2-1
⋮
mouse-2-m2
dataflow findSimilar
(A : Organism, B : Organism) : MatchedSeqs is
∪ for s in entrez(A, genbank) return
if s.moltype = mRNA
then { ⟨ a : s, b : filter(blast(s, 1e-4), 300, B) ⟩ }
else ∅
Provenance of mouse-2-1
Value assignment
mouse-2-1
⋮
mouse-2-m2
A = cat
B = mouse
s = cat-2
rep = blastrep-2
score = 300
org = mouse
dataflow findSimilar
(A : Organism, B : Organism) : MatchedSeqs is
∪ for s in entrez(A, genbank) return
if s.moltype = mRNA
then { ⟨ a : s, b : filter(blast(s, 1e-4), 300, B) ⟩ }
else ∅
What Have We Got so Far?
●
A formal model for dataflows
●
Formalisation of a run of a dataflow
●
Formalisation of provenance within runs
●
Is it enough for
–
verification ?
–
reproducibility ?
What Have We Got so Far?
●
A formal model for dataflows
●
Formalisation of a run of a dataflow
●
Formalisation of provenance within runs
●
Is it enough for
–
verification ?
not quite
–
reproducibility ?
not quite
Binding Tree – External Services
findSimilar
entrez ➸ NCBI-entrez
filter ➸ filterBlastRep
accDB ➸ extractPairs
getSeq ➸ getSequence
blast ➸ NCBI-blast BLAST++
Dataflow Repository:
Formal Conceptual Model
dataflow design
dataflow execution
D: dataflow identifiers
R: run identifiers
expr: D  NRCexpr
dataflow: R  D
inputtypes: D  Types
run: R  Runs
servicesigs: D Sigs
inputvals: R  Inputs
binding: R  BindingTree
internalcall: R × Triples  R
Dataflow Repository:
Formal Conceptual Model
dataflow design
dataflow execution
D: dataflow identifiers
R: run identifiers
expr: D  NRCexpr
dataflow: R  D
service-call triple
subdataflow run
inputtypes: D  Types
run: R  Runs
servicesigs: D Sigs
inputvals: R  Inputs
binding: R  BindingTree
internalcall: R × Triples  R
Closure
A  cat
B  mouse
s  cat-2
rep  blastreport-2
score  300
org  mouse
filter
mouse-2-1
⋮
mouse-2-m2
findSimilar
entrez ➸ NCBI-entrez
filter ➸ filterBlastRep
accDB ➸ extractPairs
getSeq ➸ getSequence
blast ➸ NCBI-blast
Conclusions
●
Formal requirements for a dataflow repository
–
●
focus on what entities should be stored
We have formally defined:
–
all entities and all constraints on the repository
–
provenance tracking of dataflow results
●
NRC is the right “glue” for the building blocks
●
Binding of the building blocks before running
–
binding tree: building blocks become external
services or subdataflows
–
integrity constraint on the repository: closure
Current and Future Work
●
●
Design of a dataflow repository system
–
focus on how entities should be stored
–
prototype system in development
Querying of a dataflow repository
–
a formal repository model is a prerequisite
–
SQL and XQuery
–
clustering of dataflows and runs
Download