A Formal Model for Dataflows, Runs of Dataflows, and Provenance within Runs Natalia Kwasnikowska and Jan Van den Bussche Theoretical Computer Science Group Hasselt University, Belgium May 21, 2008 Scientific Data Analysis complex How to analyze all this? experimental data ☹ Scientific Data Analysis external resources integration layer design ☺ dataflow design dataflow execution annotations storage Where Do We Fit in the Picture external resources integration layer design ☺ dataflow design dataflow execution annotations storage Where Do We Fit in the Picture external resources integration layer design ☺ dataflow design dataflow execution annotations dataflow repository Why a Dataflow Repository ● Effective management of – experimental data – dataflows and dataflow runs ● Data management of different runs ● Verification of dataflow results ● ● Tracking the provenance of values occurring in a dataflow result Querying of all stored information A Formal Model for Dataflows, Runs of Dataflows, and Provenance within Runs external resources integration layer design ☺ dataflow design dataflow execution annotations dataflow repository Our Contribution ● ● Quest for formal requirements Formal conceptual model of a dataflow repository – focus on what entities should be stored – precise formalisation of those entities – a run of a dataflow ● binding of building blocks of a dataflow to services formal integrity constraints on the repository – formalisation of provenance of dataflow results ● Example ( Reference-Image, {Anatomy-1, Anatomy-2, Anatomy-3, Anatomy-4} ) ↓ align_warp ↓ {Warp-Params1, Warp-Params2, Warp-Params3, Warp-Params4} ↓ reslice ↓ {Resliced-Image1, Resliced-Image2, Resliced-Image3, Resliced-Image4} ↓ softmean ↓ Atlas-Image ↓ slicer ↓ {Atlas-X-Slice, Atlas-Y-Slice, Atlas-Z-Slice} ↓ convert ↓ {Atlas-X-Graphic, Atlas-Y-Graphic, Atlas-Z-Graphic} Dataflows – Nested Relational Calculus dataflow brainAtlas (r: Image, p: {Image}, a: APs, m: MPs, s: {SPs}) : {SlicedImage} is slice(aggregate(align(r, p, a), m), s) dataflow alignImages (r: Image, p: {Image}, a: APs) : {Image} is for i in p return performWarp(analyzeWarp(r, i, p)) dataflow sliceImage (i: Image, s: {SPs}) : {SlicedImage} is for parameter in s return 〈 slice: parameter.slice, image : changeFormat(cutSlice(i, parameter)) 〉 Base and Complex Types dataflow brainAtlas (r: Image, p: {Image}, a: APs, m: MPs, s: {SPs}) : {SlicedImage} is 〈image : Image3D, header : HDR〉 slice(aggregate(align(r, p, a), m), s) dataflow alignImages (r: Image, p: {Image}, a: APs) : {Image} is { 〈slice : Slice, image : Image2D〉 } for i in p return performWarp(analyzeWarp(r, i, p)) dataflow sliceImage (i: Image, s: {SPs}) : {SlicedImage} is for parameter in s return 〈 slice: parameter.slice, image : changeFormat(cutSlice(i, parameter)) 〉 Service Names dataflow brainAtlas (r: Image, p: {Image}, a: APs, m: MPs, s: {SPs}) : {SlicedImage} is slice(aggregate(align(r, p, a), m), s) dataflow alignImages (r: Image, p: {Image}, a: APs) : {Image} is for i in p return performWarp(analyzeWarp(r, i, p)) dataflow sliceImage (i: Image, s: {SPs}) : {SlicedImage} is for parameter in s return 〈 slice: parameter.slice, image : changeFormat(cutSlice(i, parameter)) 〉 Service Names - Signatures dataflow brainAtlas (r: Image, p: {Image}, a: APs, m: MPs, s: {SPs}) : {SlicedImage} is slice(aggregate(align(r, p, a), m), s) Image ×alignImages {SPs} {SlicedImage} dataflow (r: Image, p: {Image}, a: APs) : {Image} is for i in p return performWarp(analyzeWarp(r, i, p)) dataflow sliceImage (i: Image, s: {SPs}) : {SlicedImage} is Image × SPs PGM for parameter in s return 〈 slice: parameter.slice, image : changeFormat(cutSlice(i, parameter)) 〉 Subdataflows dataflow brainAtlas (r: Image, p: {Image}, a: APs, m: MPs, s: {SPs}) : {SlicedImage} is slice(aggregate(align(r, p, a), m), s) Image ×alignImages {SPs} {SlicedImage} dataflow (r: Image, p: {Image}, a: APs) : {Image} is for i in p return performWarp(analyzeWarp(r, i, p)) dataflow sliceImage (i: Image, s: {SPs}) : {SlicedImage} is for parameter in s return 〈 slice: parameter.slice, image : changeFormat(cutSlice(i, parameter)) 〉 Binding Tree – External Services brainAtlas align ➸ alignImage analyzeWarp ➸ align_warp performWarp ➸ reslice slice ➸ sliceImage cutSlice ➸ slicer changeFormat ➸ convert aggregate ➸ softmean Result of brainAtlas slice image X altas-x.gif Y altas-y.gif Z altas-z.gif ● Complex object ● Is it enough for – verification ? – reproducibility ? Result of brainAtlas slice image X altas-x.gif Y altas-y.gif Z altas-z.gif ● Complex object ● Is it enough for – verification ? yes – reproducibility ? more or less Verification and Reproducibility ● ● Input data – in house, archived – public repositories External services – access to used programs – changes in implementation improved algorithms ● bug fixes ● Verification and Reproducibility ● ● Input data – in house, archived (hopefully) – public repositories External services – access to used programs – changes in implementation improved algorithms ● bug fixes ● Another Example Given two organisms A and B, extract all messenger RNA sequences from GenBank belonging to A. Then for each found sequence, search for similar sequences belonging to B. ☺ Dataflows – Nested Relational Calculus dataflow findSimilar (A : Organism, B : Organism) : MatchedSeqs is ∪ for s in entrez(A, genbank) return if s.moltype = mRNA then { 〈 a : s, b : filter(blast(s, 1e-4), 300, B) 〉 } else ∅ dataflow filterBlastRep (rep : BlastRep, min : Int, org : Organism) : Seqs is ∪ for a in accDB(rep, min) return let seq ≔ getSeq(a.accessionnr, a.database) in if seq.organism = org then { seq } else ∅ Binding Tree – External Services findSimilar entrez ➸ NCBI-entrez filter ➸ filterBlastRep accDB ➸ extractPairs getSeq ➸ getSequence blast ➸ NCBI-blast Result of findSimilar(cat, mouse) a b cat-1 mouse-1-1 ⋮ mouse-1-m1 cat-2 mouse-2-1 ⋮ mouse-2-m2 ⋮ cat-n ⋮ mouse-n-1 ⋮ mouse-n-mn ● Complex object ! Result of findSimilar(cat, mouse) a b ● Complex object ! mouse-1-1 ⋮ cat-1 〈organism : cat, moltype : mRNA, content: NM_001079655〉 mouse-1-m 1 cat-2 ⋮ cat-n mouse-2-1 ⋮ mouse-2-m2 ⋮ mouse-n-1 ⋮ mouse-n-mn Result of findSimilar(cat, mouse) a b cat-1 mouse-1-1 ⋮ mouse-1-m1 cat-2 mouse-2-1 ⋮ mouse-2-m2 ⋮ cat-n ⋮ mouse-n-1 ⋮ mouse-n-mn ● Complex object ! ● Is it enough for – verification ? – reproducibility ? Result of findSimilar(cat, mouse) a b cat-1 mouse-1-1 ⋮ mouse-1-m1 cat-2 mouse-2-1 ⋮ mouse-2-m2 ⋮ cat-n ⋮ mouse-n-1 ⋮ mouse-n-mn ● Complex object ! ● Is it enough for – verification ? no – reproducibility ? no Result of findSimilar(cat, mouse) a b cat-1 mouse-1-1 ⋮ mouse-1-m1 cat-2 mouse-2-1 ⋮ mouse-2-m2 ⋮ cat-n ⋮ mouse-n-1 ⋮ mouse-n-mn ● Complex object ! ● Is it enough for – verification ? no – reproducibility ? no External services !!! NCBI database Run of a Dataflow ● entire run includes all intermediate results ● run formalized as set of triples – triple = (subexpression, input, output) – subexpressions induce tree-structure – complex due to for-loops ● only service-call results need to be stored ● entire run formally defined using inference rules Service-call Triples of findSimilar(cat, mouse) dataflow findSimilar (A : Organism, B : Organism) : MatchedSeqs is ∪ for s in entrez(A, genbank) return if s.moltype = mRNA then { 〈 a : s, b : filter(blast(s, 1e-4), 300, B) 〉 } else ∅ Service-call Triples of findSimilar(cat, mouse) entrez A cat B mouse org cat db genbank cat-1 cat-2 ⋮ cat-n blast A cat B mouse s cat-2 seq cat-2 evalue 1e-4 blastrep-2 filter A cat B mouse s cat-2 rep rep-2 score 300 org mouse mouse-2-1 ⋮ mouse-2-m2 Run Triples of findSimilar(cat, mouse) dataflow findSimilar (A : Organism, B : Organism) : MatchedSeqs is ∪ for s in entrez(A, genbank) return if s.moltype = mRNA then { 〈 a : s, b : filter(blast(s, 1e-4), 300, B) 〉 } else ∅ Run Inference Rules Run Triples of findSimilar(cat, mouse) if if A cat B mouse s cat-2 A cat B mouse s cat-9 a b cat-2 mouse-2-1 ⋮ mouse-2-m2 a b cat-9 mouse-9-1 ⋮ mouse-9-m9 Provenance of Subvalues a b cat-1 mouse-1-1 ⋮ mouse-1-m1 cat-2 ⋮ cat-n mouse-2-1 ⋮ mouse-2-m2 ⋮ mouse-n-1 ⋮ mouse-n-mn ● ● ● ● ● Complex object ! Provenance of a subvalue of result ? Again set of triples Only what contributes to subvalue Inference rules Provenance of Subvalues a b cat-1 mouse-1-1 ⋮ mouse-1-m1 cat-2 ⋮ cat-n mouse-2-1 ⋮ mouse-2-m2 ⋮ mouse-n-1 ⋮ mouse-n-mn ● ● ● ● ● Complex object ! Provenance of a subvalue of result ? Again set of triples Only what contributes to subvalue Inference rules Provenance of mouse-2-1 a b cat-1 mouse-1-1 ⋮ mouse-1-m1 cat-2 mouse-2-1 ⋮ mouse-2-m2 ⋮ cat-n ⋮ mouse-n-1 ⋮ mouse-n-mn Value assignment A = cat B = mouse dataflow findSimilar (A : Organism, B : Organism) : MatchedSeqs is ∪ for s in entrez(A, genbank) return if s.moltype = mRNA then { 〈 a : s, b : filter(blast(s, 1e-4), 300, B) 〉 } else ∅ Provenance of mouse-2-1 a b cat-1 mouse-1-1 ⋮ mouse-1-m1 cat-2 mouse-2-1 ⋮ mouse-2-m2 ⋮ cat-n ⋮ mouse-n-1 ⋮ mouse-n-mn Value assignment A = cat B = mouse dataflow findSimilar (A : Organism, B : Organism) : MatchedSeqs is ∪ for s in entrez(A, genbank) return if s.moltype = mRNA then { 〈 a : s, b : filter(blast(s, 1e-4), 300, B) 〉 } else ∅ Provenance of mouse-2-1 a b cat-1 mouse-1-1 ⋮ mouse-1-m1 cat-2 mouse-2-1 ⋮ mouse-2-m2 ⋮ cat-n ⋮ mouse-n-1 ⋮ mouse-n-mn Value assignment A = cat B = mouse dataflow findSimilar (A : Organism, B : Organism) : MatchedSeqs is ∪ for s in entrez(A, genbank) return if s.moltype = mRNA then { 〈 a : s, b : filter(blast(s, 1e-4), 300, B) 〉 } else ∅ Provenance of mouse-2-1 Value assignment A = cat B = mouse s = cat-2 a b cat-2 mouse-2-1 ⋮ mouse-2-m2 dataflow findSimilar (A : Organism, B : Organism) : MatchedSeqs is ∪ for s in entrez(A, genbank) return if s.moltype = mRNA then { 〈 a : s, b : filter(blast(s, 1e-4), 300, B) 〉 } else ∅ Provenance of mouse-2-1 Value assignment A = cat B = mouse s = cat-2 a b cat-2 mouse-2-1 ⋮ mouse-2-m2 dataflow findSimilar (A : Organism, B : Organism) : MatchedSeqs is ∪ for s in entrez(A, genbank) return if s.moltype = mRNA then { 〈 a : s, b : filter(blast(s, 1e-4), 300, B) 〉 } else ∅ Provenance of mouse-2-1 Value assignment A = cat B = mouse s = cat-2 a b cat-2 mouse-2-1 ⋮ mouse-2-m2 dataflow findSimilar (A : Organism, B : Organism) : MatchedSeqs is ∪ for s in entrez(A, genbank) return if s.moltype = mRNA then { 〈 a : s, b : filter(blast(s, 1e-4), 300, B) 〉 } else ∅ Provenance of mouse-2-1 Value assignment mouse-2-1 ⋮ mouse-2-m2 A = cat B = mouse s = cat-2 rep = blastrep-2 score = 300 org = mouse dataflow findSimilar (A : Organism, B : Organism) : MatchedSeqs is ∪ for s in entrez(A, genbank) return if s.moltype = mRNA then { 〈 a : s, b : filter(blast(s, 1e-4), 300, B) 〉 } else ∅ What Have We Got so Far? ● A formal model for dataflows ● Formalisation of a run of a dataflow ● Formalisation of provenance within runs ● Is it enough for – verification ? – reproducibility ? What Have We Got so Far? ● A formal model for dataflows ● Formalisation of a run of a dataflow ● Formalisation of provenance within runs ● Is it enough for – verification ? not quite – reproducibility ? not quite Binding Tree – External Services findSimilar entrez ➸ NCBI-entrez filter ➸ filterBlastRep accDB ➸ extractPairs getSeq ➸ getSequence blast ➸ NCBI-blast BLAST++ Dataflow Repository: Formal Conceptual Model dataflow design dataflow execution D: dataflow identifiers R: run identifiers expr: D NRCexpr dataflow: R D inputtypes: D Types run: R Runs servicesigs: D Sigs inputvals: R Inputs binding: R BindingTree internalcall: R × Triples R Dataflow Repository: Formal Conceptual Model dataflow design dataflow execution D: dataflow identifiers R: run identifiers expr: D NRCexpr dataflow: R D service-call triple subdataflow run inputtypes: D Types run: R Runs servicesigs: D Sigs inputvals: R Inputs binding: R BindingTree internalcall: R × Triples R Closure A cat B mouse s cat-2 rep blastreport-2 score 300 org mouse filter mouse-2-1 ⋮ mouse-2-m2 findSimilar entrez ➸ NCBI-entrez filter ➸ filterBlastRep accDB ➸ extractPairs getSeq ➸ getSequence blast ➸ NCBI-blast Conclusions ● Formal requirements for a dataflow repository – ● focus on what entities should be stored We have formally defined: – all entities and all constraints on the repository – provenance tracking of dataflow results ● NRC is the right “glue” for the building blocks ● Binding of the building blocks before running – binding tree: building blocks become external services or subdataflows – integrity constraint on the repository: closure Current and Future Work ● ● Design of a dataflow repository system – focus on how entities should be stored – prototype system in development Querying of a dataflow repository – a formal repository model is a prerequisite – SQL and XQuery – clustering of dataflows and runs