Enabling Privacy in ProvenanceAware Workflow Systems Susan B. Davidson1 Joint work with Sanjeev Khanna, Sudeepa Roy, Julia Stoyaonovich, Val Tannen 1 Yi Chen 2 Tova Milo3 1University of Pennsylvania 2Arizona State University 3Tel Aviv University Science has been revolutionized • High-throughput technologies generate floods of data, which must be analyzed to create knowledge. • Scientists are turning to scientific workflow systems to help manage the analysis. 2 TGCCGTG CTGTGC TGGCTAA CTAAATG ATG… … GGCTAAA TCTGTGC TGCCGTG …TGTCTG ATCCGTG TGGCGTC … TGGCTA.. Scientific Workflow Split Entries Align Sequences Format-1 Functional Data Curate Annotations Format-2 Format-3 Graphical representation of a sequence of actions to perform a task (e.g., a biological experiment) Vertex ≡ Module (program) Edge ≡ Dataflow Construct Trees Run: An execution of the workflow Actual data appears on the edges . 3 Need for provenance s Split Entries Align Sequences Format Functional Data Curate Annotations Format Format Construct Trees Need for privacy TGCCGTGT CCCTTTCCG GGCTAAAT TGCCGTGT TGTGGCTA GTCTGTGC TGCCGTGT GGCTAAAT AATGTCTG TGCCGTGT GGCTAAAT GTCTGTGC … TGC ATGGCCGT GGCTAAAT GTCTGTGC GTGGTCTG GTCTGTGC … GTCTGTGC GTCTGTGC … … TGCCTAAC … TAACTAA… t How has this tree been generated? … ? Which sequences have been used to produce this tree? Biologist’s workspace 4 Outline • The need: Workflow provenance repositories • The challenge of privacy • Privacy-aware search and query 5 Workflow Provenance Repositories 6 Current situation: Workflow repositories • To enable sharing and reuse, repositories of workflow specifications are being created – e.g. myExperiment.org – Keyword search is used to find specifications of interest • Several workflow systems are storing provenance information – Module executions, input parameters, input/output data – “Input-only” 7 The Vision • “Workflow Provenance” repositories will store specifications as well as executions (i.e. provenance information) – Searchable – Queryable – Privacy protecting • Searching/querying these repositories can be used to – – – – Find/reuse workflows Understand meaning of a workflow Correct/debug erroneous specifications Find “bad” data and what its downstream effect might be 8 “You are better off designing in security and privacy… from the start, rather than trying to add them later.” (Shapiro, CACM 53:6, June 2010) 9 Privacy Concerns in Scientific Workflows Data, module, structural 10 Example: Module Privacy Patient record: Gender, smoking habits, Familial environment, blood pressure, blood test report, … P: (X1, X2, X3, X4, X5) Split entries (X1, X2, X3) (X1, X2, X4, X5) Check for Cancer Check for Infectious disease P has cancer? P has an disease? Create Report report 11 infectious If X1 > 60 OR (X2 < 800 AND X5=1) AND …. Module functionality should be kept secret (From patient’s standpoint): output should not be guessed given input data values (From module owner’s standpoint): no one should be able to simulate the module and use it elsewhere. Example: Structural Privacy Protein M1 M1 compares the entire protein against already annotated genomes M2 Protein + Functional annotation M2 compares domains of proteins (more precise but more time consuming) Relationships between certain data/module pairs should be kept secret 12 Privacy concerns at a glance smoking habits, blood pressure, blood test report, …… P: (X1, X2, X3, X4) Data Privacy Data items are private Split entries (X1, X2, X3) Check for cancer (X1, X2, X4) DB Check for infectious disease P has cancer? Create Report report P has infectious disease? Module Privacy Module functionality is private (x, f(x)) Structural Privacy Execution paths between certain data is private 13 Module Privacy (a hint) • A module f = a function • For every input x to f, f(x) value should not be revealed – Enough equivalent possible f(x) values w.r.t. visible information • There is a knife,to a fork and a spoon According in this figureprivacy required guarantee arxiv.org/abs/1005.5543 14 Search and Query 15 Workflow model, revisited Vertex ≡ Module W2 W4 Atomic or composite (subworkflow) Workflow hierarchy W1 Edge ≡ Dataflow lifestyle, family history, physical symptoms SNPs, ethnicity I Determine Genetic Susceptibility M3 Evaluate Disorder Risk prognosis O M5 Expand SNP Set Consult External Databases …. query Query OMIM M6 disorders M8 W3 Generate Database Queries query SNPs M4 disorders M2 W4 W2 W1 M1 W3 Query PubMed disorders Combine Disorder Sets M7 Access Views W2 W4 W1 W2 W1 W2 W4 W1 W3 • Prefixes of workflow hierarchy can be used to define the finest level of granularity the user can “see” W1: I M1 M2 O 17 Search Keyword search: “Database, disorder risk” Result should be concise and informative W1 lifestyle, family history, physical symptoms SNPs, ethnicity I Determine Genetic Susceptibility disorders M2 W4 W4 W2 W1 M1 W2 Evaluate Disorder Risk prognosis O M3 M5 Expand SNP Set query Consult External Databases query Query OMIM M6 SNPs M4 Generate Database Queries disorders M8 Query PubMed disorders Combine Disorder Sets “WISE: a Workflow Information Search Engine”, (ICDE 2009) M7 Result of “database, disorder risk” lifestyle, family history, physical symptoms SNPs, ethnicity M3 Expand SNP Set I M5 SNPs Generate Database Queries query Query OMIM M6 disorders M2 Evaluate Disorder Risk prognosis O M8 disorders W1 query Query PubMed disorders Combine Disorder Sets M7 W2 W4 Result of “database, disorder risk” W1 W2 W2 W1 lifestyle, family history, physical symptoms SNPs, ethnicity I M1 lifestyle, family history, physical symptoms Determine Genetic Susceptibility M3 M3 Expand SNP Set I Expand SNP Set SNPs, ethnicity SNPs SNPs M4 M4 Consult External Databases Consult External Databases disorders M2 Evaluate Disorder Risk prognosis O disorders M2 Evaluate Disorder Risk prognosis O Queries • “Workflows in which Expand SNP is performed before Query PubMed.” • “Data items that were used to produce d19.” • … • Shares with search the need to match certain terms and give a view of the result and preserve privacy. 21 Research Challenges • More ways of enabling scientists to construct workflows • Formalizing privacy notions • Provenance query language? • Efficiently identifying data that users can access – users may have different privileges, yielding many different access views. • … 22 Acknowledgments • Members of the PennDB group – Susan Davidson – Sanjeev Khanna – Sudeepa Roy – Julia Stoyanovich – Val Tannen • Friend of PennDB and Tel Aviv faculty – Tova Milo • PennDB alum and ASU faculty – Yi Chen • Bioinformatics collaborators – Sarah Cohen-Boulakia This work was supported in part by NSF IIS-0803524, IIS-0629846, IIS-0915438, CCF-0635084, and IIS-0904314; NSF CAREER award IIS-0845647; and CRA 0937060 24