Provenance and Scientific Workflows: Challenges and Opportunities 1. Susan B. Davidson Juliana Freire University of Pennsylvania 3330 Walnut Street Philadelphia, PA 19104-6389 University of Utah 50 S. Central Campus Dr, rm 3190 Salt Lake City, UT 84112 susan@cis.upenn.edu juliana@cs.utah.edu REFERENCES [1] W. Aalst and K. Hee. Workflow Management: Models, Methods, and Systems. MIT Press, 2002. [2] A. Ailamaki, Y. E. Ioannidis, and M. Livny. Scientific workflow management by database management. In Proceedings of SSDBM, pages 190–199, 1998. [3] I. Altintas, O. Barney, and E. Jaeger-Frank. Provenance collection support in the kepler scientific workflow system. In Proceedings of the International Provenance and Annotation Workshop (IPAW), pages 118–132, 2006. [4] E. Andersen, S. P. Callahan, D. A. Koop, E. Santos, C. E. Scheidegger, H. T. Vo, J. Freire, and C. T. Silva. Vistrails: Using provenance to streamline data exploration. In Poster Proceedings of the International Workshop on Data Integration in the Life Sciences (DILS), page 8, 2007. [5] Apple’s Mac OS X Automator. http://www.apple.com/downloads/macosx/automator. [6] R. Barga and L. Digiampietri. Automatic generation of workflow provenance. In Proceedings of the International Provenance and Annotation Workshop (IPAW), pages 1–9, 2006. Invited paper. [7] R. S. Barga and L. A. Digiampietri. Automatic capture and efficient storage of escience experiment provenance. Concurrency and Computation: Practice and Experience, 20(5):419–429, 2008. [8] L. Bavoil, S. Callahan, P. Crossno, J. Freire, C. Scheidegger, C. Silva, and H. Vo. Vistrails: Enabling interactive multiple-view visualizations. In Proceedings of IEEE Visualization, pages 135–142, 2005. [9] C. Beeri, A. Eyal, S. Kamenkovich, and T. Milo. Querying business processes. In VLDB, pages 343–354, 2006. [10] C. Beeri, A. Eyal, T. Milo, and A. Pilberg. Monitoring business processes with queries. In Proceedings of VLDB, pages 603–614, 2007. [11] O. Biton, S. C. Boulakia, and S. B. Davidson. Zoom*userviews: Querying relevant provenance in workflow systems. In Proceedings of VLDB, pages 1366–1369, 2007. Copyright is held by the author/owner(s). ACM X-XXXXX-XX-X/XX/XX. [12] O. Biton, S. Cohen-Boulakia, S. Davidson, and C. Hara. Querying and managing provenance through user views in scientific workflows. In Proceedings of ICDE, 2008. To appear. [13] R. Bose, I. Foster, and L. Moreau. Report on the International Provenance and Annotation Workshop. SIGMOD Rec., 35(3):51–53, 2006. [14] R. Bose and J. Frew. Lineage retrieval for scientific data processing: a survey. ACM Computing Surveys, 37(1):1–28, 2005. [15] S. Bowers, T. McPhillips, and B. Ludaescher. A provenance model for collection-oriented scientific workflows. Concurrency and Computation: Practice and Experience, 20(5):519–529, 2008. [16] S. Bowers, T. McPhillips, B. Ludascher, S. Cohen, and S. B. Davidson. A Model for User-Oriented Data Provenance in Pipelined Scientific Workflows . In Proceedings of the International Provenance and Annotation Workshop (IPAW), 2006. [17] Business Process Execution Language for Web Services. http://www.ibm.com/developerworks/library/specification/wsbpel/. [18] U. Braun, S. Garfinkel, D. A. Holland, K.-K. Muniswamy-Reddy, and M. I. Seltzer. Issues in Automatic Provenance Collection. In Proceedings of the International Provenance and Annotation Workshop (IPAW), 2006. [19] P. Buneman and W.Tan. Provenance in databases. In Proceedings of ACM SIGMOD, pages 1171–1173, 2007. [20] S. Callahan, J. Freire, E. Santos, C. Scheidegger, C. Silva, and H. Vo. Managing the evolution of dataflows with vistrails (Extended Abstract). In IEEE Workshop on Workflow and Data Flow for Scientific Applications (SciFlow), 2006. [21] S. Callahan, J. Freire, E. Santos, C. Scheidegger, C. Silva, and H. Vo. Using provenance to streamline data exploration through visualization. Technical Report UUSCI-2006-016, SCI Institute–Univ. of Utah, 2006. [22] S. Callahan, J. Freire, E. Santos, C. Scheidegger, C. Silva, and H. Vo. VisTrails: Visualization meets Data Management. In Proceedings of ACM SIGMOD, pages 745–747, 2006. Demo description. [23] S. P. Callahan, J. Freire, C. E. Scheidegger, C. Silva, and H. T. Vo. Towards provenance-enabling paraview. In Proceedings of the International Provenance and Annotation Workshop (IPAW), 2008. To appear. [24] A. Chapman and H. V. Jagadish. Issues in building practical provenance systems. IEEE Data Eng. Bull., 30(4):38–43, 2007. [25] A. P. Chapman, H. V. Jagadish, and P. Ramanan. Efficient provenance storage. In Proceedings of ACM SIGMOD, pages 993–1006, 2008. [26] B. Clifford, I. Foster, M. Hategan, T. Stef-Praun, M. Wilde, and Y. Zhao. Tracking provenance in a virtual data grid. Concurrency and Computation: Practice and Experience, 20(5):565–575, 2008. [27] S. Cohen, S. C. Boulakia, and S. B. Davidson. Towards a model of provenance and user views in scientific workflows. In DILS, pages 264–279, 2006. [28] S. Cohen-Boulakia, O. Biton, S. Cohen, and S. Davidson. Addressing the provenance challenge using zoom. Concurrency and Computation: Practice and Experience, 20(5):497–506, 2008. [29] S. Cohen-Boulakia, S. B. Davidson, C. Froidevaux, Z. Lacroix, and M.-E. Vidal. Path-based systems to guide life scientists in the maze of biological data sources. Journal of Bioinformatics and Computational Biology, 4(5), 2006. [30] S. B. Davidson, S. C. Boulakia, A. Eyal, B. Ludäscher, T. M. McPhillips, S. Bowers, M. K. Anand, and J. Freire. Provenance in scientific workflow systems. IEEE Data Eng. Bull., 30(4):44–50, 2007. [31] S. B. Davidson and J. Freire. Provenance and scientific workflows: challenges and opportunities. In Proceedings of ACM SIGMOD, pages 1345–1350, 2008. [32] E. Deelman, S. Callaghan, E. Field, H. Francoeur, R. Graves, N. Gupta, V. Gupta, T. H. Jordan, C. Kesselman, P. Maechling, J. Mehringer, G. Mehta, D. Okaya, K. Vahi, and L. Zhao. Managing large-scale workflow execution from resource provisioning to provenance tracking: The cybershake example. In Proceedings of e-Science, 2006. [33] E. Deelman and Y. Gil. NSF Workshop on Challenges of Scientific Workflows. Technical report, NSF, 2006. http://vtcpc.isi.edu/wiki/index.php/Main_Page. [34] E. Deelman, G. Singh, M.-H. Su, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, K. Vahi, G. B. Berriman, J. Good, A. Laity, J. C. Jacob, and D. S. Katz. Pegasus: a Framework for Mapping Complex Scientific Workflows onto Distributed Systems. Scientific Programming Journal, 13(3):219–237, 2005. [35] E. Eide, L. Stoller, T. Stack, J. Freire, and J. Lepreau. Integrated scientific workflow management for the Emulab network testbed. In USENIX, pages 363–368, 2006. [36] T. Ellkvist, D. Koop, E. W. Anderson, J. Freire, and C. Silva. Using provenance to support real-time collaborative design of workflows. In Proceedings of the International Provenance and Annotation Workshop (IPAW), 2008. To appear. [37] I. Foster, J. Voeckler, M. Wilde, and Y. Zhao. Chimera: A virtual data system for representing, querying and automating data derivation. In Proceedings of SSDBM, pages 37–46, 2002. [38] J. N. Foster, T. J. Green, and V. Tannen. Annotated xml: queries and provenance. In Proceedings of PODS, pages 271–280, 2008. [39] J. Freire, D. Koop, E. Santos, and C. T. Silva. Provenance for computational tasks: A survey. Computing in Science and Engineering, 10(3):11–21, 2008. [40] J. Freire and C. Silva. Towards enabling social analysis of [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] scientific data. In CHI Social Data Analysis Workshop, 2008. To appear. J. Freire, C. T. Silva, S. P. Callahan, E. Santos, C. E. Scheidegger, and H. T. Vo. Managing rapidly-evolving scientific workflows. In International Provenance and Annotation Workshop (IPAW), LNCS 4145, pages 10–18, 2006. Invited paper. J. Frew and R. Bose. Earth system science workbench: A data management infrastructure for earth science products. In Proceedings of SSDBM, pages 180–189, 2001. J. Frew, D. Metzger, and P. Slaughter. Automatic capture and reconstruction of computational provenance. Concurrency and Computation: Practice and Experience, 20(5):485–496, 2008. J. Futrelle and J. Myers. Tracking provenance semantics in heterogeneous execution systems. Concurrency and Computation: Practice and Experience, 20(5):555–564, 2008. D. Gannon et al. A Workshop on Scientific and Scholarly Workflow Cyberinfrastructure: Improving Interoperability, Sustainability and Platform Convergence in Scientific And Scholarly Workflow. Technical report, NSF and Mellon Foundation, 2007. https://spaces.internet2.edu/display/SciSchWorkflow. Y. Gil, V. Ratnakar, and E. Deelman. Metadata Catalogs with Semantic Representations. In Proceedings of the International Provenance and Annotation Workshop (IPAW), 2006. J. Golbeck and J. Hendler. A semantic web approach to tracking provenance in scientific workflows. Concurrency and Computation: Practice and Experience, 20(5):431–439, 2008. T. J. Green, G. Karvounarakis, and V. Tannen. Provenance semirings. In Proceedings of PODS, pages 31–40, 2007. M. Greenwood, C. Goble, R. Stevens, J. Zhao, M. Addis, D. Marvin, L. Moreau, and T. Oinn. Provenance of e-Science Experiments - experience from Bioinformatics. In Proceedings of The UK OST e-Science second All Hands Meeting (AHM), pages 223–226, 2003. P. Groth, S. Jiang, S. Miles, S. Munroe, V. Tan, S. Tsasakou, and L. Moreau. An architecture for provenance systems. Technical report, ECS, University of Southampton, 2006. T. Heinis and G. Alonso. Efficient lineage tracking for scientific workflows. In Proceedings of ACM SIGMOD, pages 1007–1018, 2008. J. Hidders, N. Kwasnikowska, J. Sroka, J. Tyszkiewicz, and J. V. den Bussche. A formal model of dataflow repositories. In DILS, pages 105–121, 2007. T. Jankun-Kelly and K. Ma. Visualization exploration and encapsulation via a spreadsheet-like interface. IEEE Transactions on Visualization and Computer Graphics, 7(3):275–287, 2001. T. Jankun-Kelly, K. Ma, and M. Gertz. A model for the visualization exploration process. In Proceedings of IEEE Visualization, 2002. The Kepler Project. http://kepler-project.org. J. Kim, E. Deelman, Y. Gil, G. Mehta, and V. Ratnakar. Provenance trails in the wings/pegasus system. Concurrency and Computation: Practice and Experience, 20(5):587–597, 2008. Kitware. Paraview. http://www.paraview.org. [58] D. Koop, C. Scheidegger, S. Callahan, J. Freire, and C. Silva. Viscomplete: Data-driven suggestions for visualization systems. IEEE Transactions on Visualization and Computer Graphics, 2008. Accepted with minor revisions. Papers from the IEEE Visualization Conference 2008. [59] A. Krenek, J. Sitera, L. Matyska, F. Dvorak, M. Mulac, M. Ruda, and Z. Salvet. glite job provenance – a job-centric view. Concurrency and Computation: Practice and Experience, 20(5):453–462, 2008. [60] M. Kreuseler, T. Nocke, and H. Schumann. A history mechanism for visual data mining. In Proceedings of IEEE Information Visualization Symposium, pages 49–56, 2004. [61] E. A. Lee and T. M. Parks. Dataflow process networks. In Proceedings of the IEEE, pages 773–801, 1995. [62] L. Lins, D. Koop, E. Anderson, S. P. Callahan, E. Santos, C. E. Scheidegger, J. Freire, and C. Silva. Examining statistics of workflow evolution provenance: A first study. In Proceedings of SSDBM, 2008. To appear. [63] D. T. Liu and M. J. Franklin. The Design of GridDB: A Data-Centric Overlay for the Scientific Grid. In Proceedings of VLDB, pages 600–611, 2004. [64] B. Ludäscher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger-Frank, M. Jones, E. Lee, J. Tao, and Y. Zhao. Scientific Workflow Management and the Kepler System. Concurrency and Computation: Practice & Experience, 2005. [65] B. Ludäscher and C. Goble. Guest editors’ introduction to the special section on scientific workflows. SIGMOD Rec., 34(3):3–4, 2005. [66] P. Maechling, H. Chalupsky, M. Dougherty, E. Deelman, Y. Gil, S. Gullapalli, V. Gupta, C. Kesselman, J. Kim, G. Mehta, B. Mendenhall, T. Russ, G. Singh, M. Spraragen, G. Staples, and K. Vahi. Simplifying construction of complex workflows for non- expert users of the southern california earthquake center community modeling environment. SIGMOD Rec., 34(3):24–30, 2005. [67] Microsoft Workflow Foundation. http://msdn2.microsoft.com/en-us/netframework/ aa663322.aspx. [68] S. Miles, P. Groth, M. Branco, and L. Moreau. The requirements of using provenance in e-science experiments. Technical report, ECS, University of Southampton, 2006. [69] S. Miles, P. Groth, S. Munroe, S. Jiang, T. Assandri, and L. Moreau. Extracting Causal Graphs from an Open Provenance Data Model. Concurrency and Computation: Practice and Experience, 20(5):577–586, 2008. [70] S. Miles, S. C. Wong, W. Feng, P. Groth, K. P. Zauner, and L. Moreau. Provenance-based validation of e-science experiments. Journal of Web Semantics, 5(1):28–38, 2007. [71] L. Moreau, editor. Concurrency and Computation: Practice and Experience– Special Issue on the First Provenance Challenge, 2008. [72] L. Moreau and I. Foster, editors. Provenance and Annotation of Data - International Provenance and Annotation Workshop, volume 4145. Springer-Verlag, 2006. [73] L. Moreau, J. Freire, J. Futrelle, R. McGrath, J. Myers, and P. Paulson. The open provenance model, December 2007. http://eprints.ecs.soton.ac.uk/14979. [74] L. Moreau, . others Bertram Ludäscher, I. Altintas, R. S. Barga, S. Bowers, S. Callahan, G. Chin Jr., B. Clifford, [75] [76] [77] [78] [79] [80] [81] [82] S. Cohen, S. Cohen-Boulakia, S. Davidson, E. Deelman, L. Digiampietri, I. Foster, J. Freire, J. Frew, J. Futrelle, T. Gibson, Y. Gil, C. Goble, J. Golbeck, P. Groth, D. A. Holland, S. Jiang, J. Kim, D. Koop, A. Krenek, T. McPhillips, G. Mehta, S. Miles, D. Metzger, S. Munroe, J. Myers, B. Plale, N. Podhorszki, V. Ratnakar, E. Santos, C. Scheidegger, K. Schuchardt, M. Seltzer, Y. L. Simmhan, C. Silva, P. Slaughter, E. Stephan, R. Stevens, D. Turi, H. Vo, M. Wilde, J. Zhao, and Y. Zhao. The First Provenance Challenge. Concurrency and Computation: Practice and Experience, 20(5):409–418, 2008. K.-K. Muniswamy-Reddy, D. A. Holland, and U. B. M. I. Seltzer. Provenance-aware storage systems. In USENIX, pages 43–56, 2006. S. Munroe, P. Groth, S. Jiang, S. Miles, V. Tan, , and L. Moreau. Data model for process documentation. Technical report, University of Southampton, 2006. http://eprints.ecs.soton.ac.uk/13047. http://www.myexperiment.org/workflows. T. Oinn, M. Greenwood, M. Addis, M. N. Alpdemir, J. Ferris, K. Glover, C. Goble, A. Goderis, D. Hull, D. Marvin, P. Li, P. Lord, M. R. Pocock, M. Senger, R. Stevens, A. Wipat, and C. Wroe. Taverna: lessons in creating a workflow environment for the life sciences: Research articles. Concurrency and Computation: Practice & Experience, 18(10):1067–1100, 2006. U. PLUS: Synthesizing Privacy, Lineage and Security. Barbara blaustein, len seligman, michael morse, m. david allen, arnon rosenthal. In IIMAS, 2008. N. Podhorszki, B. Ludaescher, I. Altintas, S. Bowers, and T. McPhillips. Recording data provenance for kepler scientific workflows. Concurrency and Computation: Practice and Experience, 20(5):507–518, 2008. The EU Provenance Project. http://twiki.gridprovenance.org/bin/view/ Provenance. First provenance challenge. http://twiki.ipaw.info/bin/view/Challenge/ FirstProvenanceChallenge, 2006. S. Miles, and L. Moreau (organizers). [83] Second provenance challenge. http://twiki.ipaw.info/bin/view/Challenge/ SecondProvenanceChallenge, 2007. J. Freire, S. Miles, and L. Moreau (organizers). [84] E. Santos, L. Lins, J. P. Ahrens, J. Freire, and C. Silva. A first study on clustering collections of workflow graphs. In Proceedings of the International Provenance and Annotation Workshop (IPAW), 2008. To appear. [85] C. Scheidegger, D. Koop, E. Santos, H. Vo, S. Callahan, J. Freire, and C. Silva. Tackling the provenance challenge one layer at a time. Concurrency and Computation: Practice and Experience, 20(5):473–483, 2008. [86] C. Scheidegger, D. Koop, H. Vo, J. Freire, and C. Silva. Querying and creating visualizations by analogy. IEEE Transactions on Visualization and Computer Graphics, 13(6):1560–1567, 2007. Papers from the IEEE Visualization Conference 2007. [87] J. Schopf, I. Coleman, R. Procter, and A. Voss. Report of the user requirements and web based access for eresearch workshop. Technical Report UKeS-2006-07, UK National e-Science Centre, November 2006. [88] K. Schuchardt, T. Gibson, E. Stephan, and G. Chin, Jr. Applying content management to automated provenance [89] [90] [91] [92] [93] [94] [95] [96] [97] [98] [99] [100] [101] [102] [103] [104] [105] [106] [107] capture. Concurrency and Computation: Practice and Experience, 20(5):541–554, 2008. IEEE Workshop on Workflow and Data Flow for Scientific Applications (SciFlow 2006). http://www.cc.gatech.edu/ cooperb/sciflow06. M. Seltzer, D. A. Holland, U. Braun, and K.-K. Muniswamy-Reddy. Pass-ing the provenance challenge. Concurrency and Computation: Practice and Experience, 20(5):531–540, 2008. SGI Scientific Workflow Solution. http://www.sgi.com/industries/sciences. D. Shasha, J. T. L. Wang, and R. Giugno. Algorithmics and applications of tree and graph searching. In Proceedings of PODS, pages 39–52, 2002. B. Shneiderman. Acm’s computing professionals face new challenges. Commun. ACM, 45(2):31–34, 2002. C. Silva, J. Freire, and S. P. Callahan. Provenance for visualizations: Reproducibility and beyond. Computing in Science & Engineering, 9(5):82–89, 2007. Y. L. Simmhan, B. Plale, and D. Gannon. A survey of data provenance in e-science. SIGMOD Record, 34(3):31–36, 2005. Y. L. Simmhan, B. Plale, and D. Gannon. A framework for collecting provenance in data-centric scientific workflows. In IEEE International Conference on Web Services (ICWS), Chicago, IL, 2006. Y. L. Simmhan, B. Plale, and D. Gannon. Karma2: Provenance management for data driven workflows. International Journal of Web Services Research, Idea Group Publishing, 5:1, 2008. To Appear. Y. L. Simmhan, B. Plale, and D. Gannon. Query capabilities of the karma provenance framework. Concurrency and Computation: Practice and Experience, Wiley InterScience, 20(5):441–439, 2008. Y. L. Simmhan, B. Plale, and D. Gannon. Querying capabilities of the karma provenance framework. Concurrency and Computation: Practice and Experience, 20(5):441–451, 2008. Y. L. Simmhan, B. Plale, D. Gannon, and S. Marru. Performance evaluation of the karma provenance framework for scientific workflows. In L. Moreau and I. T. Foster, editors, International Provenance and Annotation Workshop (IPAW), Chicago, IL, volume 4145 of Lecture Notes in Computer Science, pages 222–236. Springer, 2006. E. Stolte, C. von Praun, G. Alonso, and T. R. Gross. Scientific data repositories: Designing for a moving target. In Proceedings of ACM SIGMOD, pages 349–360, 2003. The Swift System. http://www.ci.uchicago.edu/swift. M. Szomszor and L. Moreau. Recording and reasoning over data provenance in web and grid services. In International Conference on Ontologies, Databases and Applications of SEmantics (ODBASE’03), volume 2888 of Lecture Notes in Computer Science, pages 603–620, Catania, Sicily, Italy, Nov. 2003. P. P. Talukdar, M. Jacob, M. S. Mehmood, K. Crammer, Z. G. Ives, F. Pereira, , and S. Guha. Learning to create data-integrating queries. In VLDB, 2008. To appear. W. C. Tan. Provenance in databases: Past, current, and future. IEEE Data Eng. Bull., 30(4):3–12, 2007. The Taverna Project. http://taverna.sourceforge.net. The Triana Project. http://www.trianacode.org. [108] W. W. van der Aalst and L. T. Maruster. Workflow mining: discovering process models from event logs. IEEE TKDE, 16(9):1128– 1142, 2004. [109] J. van Wijk. The value of visualization. In Proceedings of IEEE Visualization, 2005. [110] VDS - The GriPhyN Virtual Data System. http://www.ci.uchicago.edu/wiki/bin/view/VDS/ VDSWeb/WebMain. [111] F. B. Viegas, M. Wattenberg, F. van Ham, J. Kriss, and M. McKeon. Manyeyes: a site for visualization at internet scale. IEEE Transactions on Visualization and Computer Graphics, 13(6):1121–1128, 2007. [112] The VisTrails Project. http://www.vistrails.org. [113] VisTrails Packages. http://www.vistrails.org/index.php/UsersGuideVisTrailsPackages. [114] Yahoo! Pipes. http://pipes.yahoo.com. [115] X. Yan, P. S. Yu, and J. Han. Graph indexing: a frequent structure-based approach. In Proceedings of ACM SIGMOD, pages 335–346, 2004. [116] J. Zhao, C. Goble, R. Stevens, and D. Turi. Mining taverna’s semantic web of provenance. Concurrency and Computation: Practice and Experience, 20(5):463–472, 2008. [117] Y. Zhao, M. Wilde, and I. Foster. A virtual data provenance model by yong zhao, michael wilde, and ian foster. In Proceedings of the International Provenance and Annotation Workshop (IPAW), pages 148–161, 2006.