CIDR11_Davidson

advertisement
Enabling Privacy in ProvenanceAware Workflow Systems
Susan B. Davidson1
Joint work with
Sanjeev Khanna, Sudeepa Roy,
Julia Stoyaonovich, Val Tannen 1
Yi Chen 2
Tova Milo3
1University
of Pennsylvania
2Arizona
State University
3Tel
Aviv University
Science has been revolutionized
• High-throughput technologies
generate floods of data, which must
be analyzed to create knowledge.
• Scientists are turning to scientific
workflow systems to help manage the
analysis.
2
TGCCGTG
CTGTGC
TGGCTAA
CTAAATG
ATG…
… GGCTAAA
TCTGTGC
TGCCGTG
…TGTCTG
ATCCGTG
TGGCGTC
… TGGCTA..
Scientific Workflow

Split Entries
Align Sequences
Format-1
Functional Data Curate Annotations
Format-2
Format-3


Graphical representation of a
sequence of actions to perform a task
(e.g., a biological experiment)
Vertex ≡ Module (program)
Edge ≡ Dataflow
Construct Trees

Run: An execution of the workflow
 Actual data appears on the edges
.
3
Need for provenance
s
Split Entries
Align Sequences
Format
Functional Data
Curate Annotations
Format
Format
Construct Trees
Need for privacy
TGCCGTGT
CCCTTTCCG
GGCTAAAT
TGCCGTGT
TGTGGCTA
GTCTGTGC
TGCCGTGT
GGCTAAAT
AATGTCTG
TGCCGTGT
GGCTAAAT
GTCTGTGC
… TGC
ATGGCCGT
GGCTAAAT
GTCTGTGC
GTGGTCTG
GTCTGTGC
… GTCTGTGC
GTCTGTGC
… … TGCCTAAC
… TAACTAA…
t
How has this tree
been generated?
…
?
Which sequences
have been used to
produce this tree?
Biologist’s workspace
4
Outline
• The need: Workflow provenance
repositories
• The challenge of privacy
• Privacy-aware search and query
5
Workflow Provenance Repositories
6
Current situation: Workflow repositories
• To enable sharing and reuse, repositories of
workflow specifications are being created
– e.g. myExperiment.org
– Keyword search is used to find specifications
of interest
• Several workflow systems are storing
provenance information
– Module executions, input parameters,
input/output data
– “Input-only”
7
The Vision
• “Workflow Provenance” repositories will store
specifications as well as executions (i.e.
provenance information)
– Searchable
– Queryable
– Privacy protecting
• Searching/querying these repositories can be
used to
–
–
–
–
Find/reuse workflows
Understand meaning of a workflow
Correct/debug erroneous specifications
Find “bad” data and what its downstream effect
might be
8
“You are better off designing in security and
privacy… from the start, rather than trying to
add them later.”
(Shapiro, CACM 53:6, June 2010)
9
Privacy Concerns in Scientific Workflows
Data, module, structural
10
Example: Module Privacy
Patient record: Gender, smoking habits,
Familial environment,
blood pressure, blood test report, …
P: (X1, X2, X3, X4, X5)
Split entries
(X1, X2, X3)
(X1, X2, X4, X5)
Check for
Cancer
Check for
Infectious disease
P has cancer?
P has an
disease?
Create Report
report
11
infectious
If X1 > 60
OR
(X2 < 800
AND
X5=1) AND
….
Module
functionality
should be kept
secret
(From patient’s
standpoint): output
should not be guessed
given input data values
(From module owner’s
standpoint): no one
should be able to
simulate the module
and use it elsewhere.
Example: Structural Privacy
Protein
M1
M1 compares the
entire protein against
already annotated
genomes
M2
Protein
+
Functional
annotation
M2 compares
domains of proteins
(more precise but
more time consuming)
Relationships between certain data/module pairs should
be kept secret
12
Privacy concerns at a glance
smoking habits, blood pressure,
blood test report, ……

P: (X1, X2, X3, X4)
Data Privacy
 Data items are
private
Split entries
(X1, X2, X3)
Check for cancer
(X1, X2, X4)
DB

Check for infectious disease
P has cancer?
Create Report
report
P has infectious

disease?
Module Privacy
 Module functionality
is private (x, f(x))
Structural Privacy
 Execution paths
between certain data
is private
13
Module Privacy (a hint)
• A module f = a function
• For every input x to f, f(x) value should not be revealed
– Enough equivalent possible f(x) values w.r.t. visible
information
• There
is a knife,to
a fork and a spoon
According
in this
figureprivacy
required
guarantee
arxiv.org/abs/1005.5543
14
Search and Query
15
Workflow model, revisited

Vertex ≡ Module
W2
W4
 Atomic or composite (subworkflow)
Workflow hierarchy
W1

Edge ≡ Dataflow
lifestyle, family
history, physical
symptoms
SNPs, ethnicity
I
Determine
Genetic
Susceptibility
M3

Evaluate
Disorder Risk
prognosis
O
M5
Expand SNP
Set
Consult
External Databases


….
query
Query
OMIM
M6
disorders
M8
W3
Generate
Database Queries
query
SNPs
M4
disorders
M2
W4
W2
W1
M1
W3
Query
PubMed
disorders
Combine
Disorder Sets
M7
Access Views
W2
W4
W1
W2
W1
W2
W4
W1
W3
• Prefixes of workflow hierarchy can be
used to define the finest level of
granularity the user can “see”
W1:
I
M1
M2
O
17
Search


Keyword search: “Database, disorder risk”
Result should be concise and informative
W1
lifestyle, family
history, physical
symptoms
SNPs, ethnicity
I
Determine
Genetic
Susceptibility
disorders
M2
W4
W4
W2
W1
M1
W2
Evaluate
Disorder Risk
prognosis
O
M3

M5
Expand SNP
Set
query
Consult
External Databases

query
Query
OMIM
M6
SNPs
M4
Generate
Database Queries
disorders
M8
Query
PubMed
disorders
Combine
Disorder Sets
“WISE: a Workflow Information Search Engine”,
(ICDE 2009)
M7
Result of “database, disorder
risk”
lifestyle, family
history, physical
symptoms
SNPs, ethnicity
M3
Expand SNP
Set
I
M5
SNPs
Generate
Database Queries
query
Query
OMIM
M6
disorders
M2
Evaluate
Disorder Risk
prognosis
O
M8
disorders
W1
query
Query
PubMed
disorders
Combine
Disorder Sets
M7
W2
W4
Result of “database, disorder
risk”
W1
W2
W2
W1
lifestyle, family
history, physical
symptoms
SNPs, ethnicity
I
M1
lifestyle, family
history, physical
symptoms
Determine
Genetic
Susceptibility
M3

M3
Expand SNP
Set
I
Expand SNP
Set
SNPs, ethnicity
SNPs
SNPs
M4
M4
Consult
External Databases
Consult
External Databases
disorders
M2
Evaluate
Disorder Risk
prognosis
O
disorders
M2
Evaluate
Disorder Risk
prognosis
O
Queries
• “Workflows in which Expand SNP is
performed before Query PubMed.”
• “Data items that were used to produce
d19.”
• …
• Shares with search the need to match
certain terms and give a view of the
result and preserve privacy.
21
Research Challenges
• More ways of enabling scientists to construct
workflows
• Formalizing privacy notions
• Provenance query language?
• Efficiently identifying data that users can
access
– users may have different privileges, yielding many
different access views.
• …
22
Acknowledgments
• Members of the PennDB group
– Susan Davidson
– Sanjeev Khanna
– Sudeepa Roy
– Julia Stoyanovich
– Val Tannen
• Friend of PennDB and Tel Aviv
faculty
– Tova Milo
• PennDB alum and ASU faculty
– Yi Chen
• Bioinformatics collaborators
– Sarah Cohen-Boulakia
This work was supported in part by NSF IIS-0803524, IIS-0629846,
IIS-0915438, CCF-0635084, and IIS-0904314;
NSF CAREER award IIS-0845647; and CRA 0937060
24
Download