Provenance and Scientific Workflows: Challenges and

advertisement
Provenance and Scientific Workflows:
Challenges and Opportunities
1.
Susan B. Davidson
Juliana Freire
University of Pennsylvania
3330 Walnut Street
Philadelphia, PA 19104-6389
University of Utah
50 S. Central Campus Dr, rm 3190
Salt Lake City, UT 84112
susan@cis.upenn.edu
juliana@cs.utah.edu
REFERENCES
[1] W. Aalst and K. Hee. Workflow Management: Models,
Methods, and Systems. MIT Press, 2002.
[2] A. Ailamaki, Y. E. Ioannidis, and M. Livny. Scientific
workflow management by database management. In
Proceedings of SSDBM, pages 190–199, 1998.
[3] I. Altintas, O. Barney, and E. Jaeger-Frank. Provenance
collection support in the kepler scientific workflow system.
In Proceedings of the International Provenance and
Annotation Workshop (IPAW), pages 118–132, 2006.
[4] E. Andersen, S. P. Callahan, D. A. Koop, E. Santos, C. E.
Scheidegger, H. T. Vo, J. Freire, and C. T. Silva. Vistrails:
Using provenance to streamline data exploration. In Poster
Proceedings of the International Workshop on Data
Integration in the Life Sciences (DILS), page 8, 2007.
[5] Apple’s Mac OS X Automator.
http://www.apple.com/downloads/macosx/automator.
[6] R. Barga and L. Digiampietri. Automatic generation of
workflow provenance. In Proceedings of the International
Provenance and Annotation Workshop (IPAW), pages 1–9,
2006. Invited paper.
[7] R. S. Barga and L. A. Digiampietri. Automatic capture and
efficient storage of escience experiment provenance.
Concurrency and Computation: Practice and Experience,
20(5):419–429, 2008.
[8] L. Bavoil, S. Callahan, P. Crossno, J. Freire,
C. Scheidegger, C. Silva, and H. Vo. Vistrails: Enabling
interactive multiple-view visualizations. In Proceedings of
IEEE Visualization, pages 135–142, 2005.
[9] C. Beeri, A. Eyal, S. Kamenkovich, and T. Milo. Querying
business processes. In VLDB, pages 343–354, 2006.
[10] C. Beeri, A. Eyal, T. Milo, and A. Pilberg. Monitoring
business processes with queries. In Proceedings of VLDB,
pages 603–614, 2007.
[11] O. Biton, S. C. Boulakia, and S. B. Davidson.
Zoom*userviews: Querying relevant provenance in
workflow systems. In Proceedings of VLDB, pages
1366–1369, 2007.
Copyright is held by the author/owner(s).
ACM X-XXXXX-XX-X/XX/XX.
[12] O. Biton, S. Cohen-Boulakia, S. Davidson, and C. Hara.
Querying and managing provenance through user views in
scientific workflows. In Proceedings of ICDE, 2008. To
appear.
[13] R. Bose, I. Foster, and L. Moreau. Report on the
International Provenance and Annotation Workshop.
SIGMOD Rec., 35(3):51–53, 2006.
[14] R. Bose and J. Frew. Lineage retrieval for scientific data
processing: a survey. ACM Computing Surveys, 37(1):1–28,
2005.
[15] S. Bowers, T. McPhillips, and B. Ludaescher. A provenance
model for collection-oriented scientific workflows.
Concurrency and Computation: Practice and Experience,
20(5):519–529, 2008.
[16] S. Bowers, T. McPhillips, B. Ludascher, S. Cohen, and
S. B. Davidson. A Model for User-Oriented Data
Provenance in Pipelined Scientific Workflows . In
Proceedings of the International Provenance and
Annotation Workshop (IPAW), 2006.
[17] Business Process Execution Language for Web Services.
http://www.ibm.com/developerworks/library/specification/wsbpel/.
[18] U. Braun, S. Garfinkel, D. A. Holland, K.-K.
Muniswamy-Reddy, and M. I. Seltzer. Issues in Automatic
Provenance Collection. In Proceedings of the International
Provenance and Annotation Workshop (IPAW), 2006.
[19] P. Buneman and W.Tan. Provenance in databases. In
Proceedings of ACM SIGMOD, pages 1171–1173, 2007.
[20] S. Callahan, J. Freire, E. Santos, C. Scheidegger, C. Silva,
and H. Vo. Managing the evolution of dataflows with
vistrails (Extended Abstract). In IEEE Workshop on
Workflow and Data Flow for Scientific Applications
(SciFlow), 2006.
[21] S. Callahan, J. Freire, E. Santos, C. Scheidegger, C. Silva,
and H. Vo. Using provenance to streamline data exploration
through visualization. Technical Report UUSCI-2006-016,
SCI Institute–Univ. of Utah, 2006.
[22] S. Callahan, J. Freire, E. Santos, C. Scheidegger, C. Silva,
and H. Vo. VisTrails: Visualization meets Data
Management. In Proceedings of ACM SIGMOD, pages
745–747, 2006. Demo description.
[23] S. P. Callahan, J. Freire, C. E. Scheidegger, C. Silva, and
H. T. Vo. Towards provenance-enabling paraview. In
Proceedings of the International Provenance and
Annotation Workshop (IPAW), 2008. To appear.
[24] A. Chapman and H. V. Jagadish. Issues in building practical
provenance systems. IEEE Data Eng. Bull., 30(4):38–43,
2007.
[25] A. P. Chapman, H. V. Jagadish, and P. Ramanan. Efficient
provenance storage. In Proceedings of ACM SIGMOD,
pages 993–1006, 2008.
[26] B. Clifford, I. Foster, M. Hategan, T. Stef-Praun, M. Wilde,
and Y. Zhao. Tracking provenance in a virtual data grid.
Concurrency and Computation: Practice and Experience,
20(5):565–575, 2008.
[27] S. Cohen, S. C. Boulakia, and S. B. Davidson. Towards a
model of provenance and user views in scientific
workflows. In DILS, pages 264–279, 2006.
[28] S. Cohen-Boulakia, O. Biton, S. Cohen, and S. Davidson.
Addressing the provenance challenge using zoom.
Concurrency and Computation: Practice and Experience,
20(5):497–506, 2008.
[29] S. Cohen-Boulakia, S. B. Davidson, C. Froidevaux,
Z. Lacroix, and M.-E. Vidal. Path-based systems to guide
life scientists in the maze of biological data sources.
Journal of Bioinformatics and Computational Biology,
4(5), 2006.
[30] S. B. Davidson, S. C. Boulakia, A. Eyal, B. Ludäscher,
T. M. McPhillips, S. Bowers, M. K. Anand, and J. Freire.
Provenance in scientific workflow systems. IEEE Data Eng.
Bull., 30(4):44–50, 2007.
[31] S. B. Davidson and J. Freire. Provenance and scientific
workflows: challenges and opportunities. In Proceedings of
ACM SIGMOD, pages 1345–1350, 2008.
[32] E. Deelman, S. Callaghan, E. Field, H. Francoeur,
R. Graves, N. Gupta, V. Gupta, T. H. Jordan, C. Kesselman,
P. Maechling, J. Mehringer, G. Mehta, D. Okaya, K. Vahi,
and L. Zhao. Managing large-scale workflow execution
from resource provisioning to provenance tracking: The
cybershake example. In Proceedings of e-Science, 2006.
[33] E. Deelman and Y. Gil. NSF Workshop on Challenges of
Scientific Workflows. Technical report, NSF, 2006.
http://vtcpc.isi.edu/wiki/index.php/Main_Page.
[34] E. Deelman, G. Singh, M.-H. Su, J. Blythe, Y. Gil,
C. Kesselman, G. Mehta, K. Vahi, G. B. Berriman, J. Good,
A. Laity, J. C. Jacob, and D. S. Katz. Pegasus: a Framework
for Mapping Complex Scientific Workflows onto
Distributed Systems. Scientific Programming Journal,
13(3):219–237, 2005.
[35] E. Eide, L. Stoller, T. Stack, J. Freire, and J. Lepreau.
Integrated scientific workflow management for the Emulab
network testbed. In USENIX, pages 363–368, 2006.
[36] T. Ellkvist, D. Koop, E. W. Anderson, J. Freire, and
C. Silva. Using provenance to support real-time
collaborative design of workflows. In Proceedings of the
International Provenance and Annotation Workshop
(IPAW), 2008. To appear.
[37] I. Foster, J. Voeckler, M. Wilde, and Y. Zhao. Chimera: A
virtual data system for representing, querying and
automating data derivation. In Proceedings of SSDBM,
pages 37–46, 2002.
[38] J. N. Foster, T. J. Green, and V. Tannen. Annotated xml:
queries and provenance. In Proceedings of PODS, pages
271–280, 2008.
[39] J. Freire, D. Koop, E. Santos, and C. T. Silva. Provenance
for computational tasks: A survey. Computing in Science
and Engineering, 10(3):11–21, 2008.
[40] J. Freire and C. Silva. Towards enabling social analysis of
[41]
[42]
[43]
[44]
[45]
[46]
[47]
[48]
[49]
[50]
[51]
[52]
[53]
[54]
[55]
[56]
[57]
scientific data. In CHI Social Data Analysis Workshop,
2008. To appear.
J. Freire, C. T. Silva, S. P. Callahan, E. Santos, C. E.
Scheidegger, and H. T. Vo. Managing rapidly-evolving
scientific workflows. In International Provenance and
Annotation Workshop (IPAW), LNCS 4145, pages 10–18,
2006. Invited paper.
J. Frew and R. Bose. Earth system science workbench: A
data management infrastructure for earth science products.
In Proceedings of SSDBM, pages 180–189, 2001.
J. Frew, D. Metzger, and P. Slaughter. Automatic capture
and reconstruction of computational provenance.
Concurrency and Computation: Practice and Experience,
20(5):485–496, 2008.
J. Futrelle and J. Myers. Tracking provenance semantics in
heterogeneous execution systems. Concurrency and
Computation: Practice and Experience, 20(5):555–564,
2008.
D. Gannon et al. A Workshop on Scientific and Scholarly
Workflow Cyberinfrastructure: Improving Interoperability,
Sustainability and Platform Convergence in Scientific And
Scholarly Workflow. Technical report, NSF and Mellon
Foundation, 2007.
https://spaces.internet2.edu/display/SciSchWorkflow.
Y. Gil, V. Ratnakar, and E. Deelman. Metadata Catalogs
with Semantic Representations. In Proceedings of the
International Provenance and Annotation Workshop
(IPAW), 2006.
J. Golbeck and J. Hendler. A semantic web approach to
tracking provenance in scientific workflows. Concurrency
and Computation: Practice and Experience,
20(5):431–439, 2008.
T. J. Green, G. Karvounarakis, and V. Tannen. Provenance
semirings. In Proceedings of PODS, pages 31–40, 2007.
M. Greenwood, C. Goble, R. Stevens, J. Zhao, M. Addis,
D. Marvin, L. Moreau, and T. Oinn. Provenance of
e-Science Experiments - experience from Bioinformatics.
In Proceedings of The UK OST e-Science second All Hands
Meeting (AHM), pages 223–226, 2003.
P. Groth, S. Jiang, S. Miles, S. Munroe, V. Tan, S. Tsasakou,
and L. Moreau. An architecture for provenance systems.
Technical report, ECS, University of Southampton, 2006.
T. Heinis and G. Alonso. Efficient lineage tracking for
scientific workflows. In Proceedings of ACM SIGMOD,
pages 1007–1018, 2008.
J. Hidders, N. Kwasnikowska, J. Sroka, J. Tyszkiewicz, and
J. V. den Bussche. A formal model of dataflow repositories.
In DILS, pages 105–121, 2007.
T. Jankun-Kelly and K. Ma. Visualization exploration and
encapsulation via a spreadsheet-like interface. IEEE
Transactions on Visualization and Computer Graphics,
7(3):275–287, 2001.
T. Jankun-Kelly, K. Ma, and M. Gertz. A model for the
visualization exploration process. In Proceedings of IEEE
Visualization, 2002.
The Kepler Project. http://kepler-project.org.
J. Kim, E. Deelman, Y. Gil, G. Mehta, and V. Ratnakar.
Provenance trails in the wings/pegasus system.
Concurrency and Computation: Practice and Experience,
20(5):587–597, 2008.
Kitware. Paraview. http://www.paraview.org.
[58] D. Koop, C. Scheidegger, S. Callahan, J. Freire, and
C. Silva. Viscomplete: Data-driven suggestions for
visualization systems. IEEE Transactions on Visualization
and Computer Graphics, 2008. Accepted with minor
revisions. Papers from the IEEE Visualization Conference
2008.
[59] A. Krenek, J. Sitera, L. Matyska, F. Dvorak, M. Mulac,
M. Ruda, and Z. Salvet. glite job provenance – a job-centric
view. Concurrency and Computation: Practice and
Experience, 20(5):453–462, 2008.
[60] M. Kreuseler, T. Nocke, and H. Schumann. A history
mechanism for visual data mining. In Proceedings of IEEE
Information Visualization Symposium, pages 49–56, 2004.
[61] E. A. Lee and T. M. Parks. Dataflow process networks. In
Proceedings of the IEEE, pages 773–801, 1995.
[62] L. Lins, D. Koop, E. Anderson, S. P. Callahan, E. Santos,
C. E. Scheidegger, J. Freire, and C. Silva. Examining
statistics of workflow evolution provenance: A first study.
In Proceedings of SSDBM, 2008. To appear.
[63] D. T. Liu and M. J. Franklin. The Design of GridDB: A
Data-Centric Overlay for the Scientific Grid. In
Proceedings of VLDB, pages 600–611, 2004.
[64] B. Ludäscher, I. Altintas, C. Berkley, D. Higgins,
E. Jaeger-Frank, M. Jones, E. Lee, J. Tao, and Y. Zhao.
Scientific Workflow Management and the Kepler System.
Concurrency and Computation: Practice & Experience,
2005.
[65] B. Ludäscher and C. Goble. Guest editors’ introduction to
the special section on scientific workflows. SIGMOD Rec.,
34(3):3–4, 2005.
[66] P. Maechling, H. Chalupsky, M. Dougherty, E. Deelman,
Y. Gil, S. Gullapalli, V. Gupta, C. Kesselman, J. Kim,
G. Mehta, B. Mendenhall, T. Russ, G. Singh, M. Spraragen,
G. Staples, and K. Vahi. Simplifying construction of
complex workflows for non- expert users of the southern
california earthquake center community modeling
environment. SIGMOD Rec., 34(3):24–30, 2005.
[67] Microsoft Workflow Foundation.
http://msdn2.microsoft.com/en-us/netframework/
aa663322.aspx.
[68] S. Miles, P. Groth, M. Branco, and L. Moreau. The
requirements of using provenance in e-science experiments.
Technical report, ECS, University of Southampton, 2006.
[69] S. Miles, P. Groth, S. Munroe, S. Jiang, T. Assandri, and
L. Moreau. Extracting Causal Graphs from an Open
Provenance Data Model. Concurrency and Computation:
Practice and Experience, 20(5):577–586, 2008.
[70] S. Miles, S. C. Wong, W. Feng, P. Groth, K. P. Zauner, and
L. Moreau. Provenance-based validation of e-science
experiments. Journal of Web Semantics, 5(1):28–38, 2007.
[71] L. Moreau, editor. Concurrency and Computation: Practice
and Experience– Special Issue on the First Provenance
Challenge, 2008.
[72] L. Moreau and I. Foster, editors. Provenance and
Annotation of Data - International Provenance and
Annotation Workshop, volume 4145. Springer-Verlag, 2006.
[73] L. Moreau, J. Freire, J. Futrelle, R. McGrath, J. Myers, and
P. Paulson. The open provenance model, December 2007.
http://eprints.ecs.soton.ac.uk/14979.
[74] L. Moreau, . others Bertram Ludäscher, I. Altintas, R. S.
Barga, S. Bowers, S. Callahan, G. Chin Jr., B. Clifford,
[75]
[76]
[77]
[78]
[79]
[80]
[81]
[82]
S. Cohen, S. Cohen-Boulakia, S. Davidson, E. Deelman,
L. Digiampietri, I. Foster, J. Freire, J. Frew, J. Futrelle,
T. Gibson, Y. Gil, C. Goble, J. Golbeck, P. Groth, D. A.
Holland, S. Jiang, J. Kim, D. Koop, A. Krenek,
T. McPhillips, G. Mehta, S. Miles, D. Metzger, S. Munroe,
J. Myers, B. Plale, N. Podhorszki, V. Ratnakar, E. Santos,
C. Scheidegger, K. Schuchardt, M. Seltzer, Y. L. Simmhan,
C. Silva, P. Slaughter, E. Stephan, R. Stevens, D. Turi,
H. Vo, M. Wilde, J. Zhao, and Y. Zhao. The First
Provenance Challenge. Concurrency and Computation:
Practice and Experience, 20(5):409–418, 2008.
K.-K. Muniswamy-Reddy, D. A. Holland, and U. B. M. I.
Seltzer. Provenance-aware storage systems. In USENIX,
pages 43–56, 2006.
S. Munroe, P. Groth, S. Jiang, S. Miles, V. Tan, , and
L. Moreau. Data model for process documentation.
Technical report, University of Southampton, 2006.
http://eprints.ecs.soton.ac.uk/13047.
http://www.myexperiment.org/workflows.
T. Oinn, M. Greenwood, M. Addis, M. N. Alpdemir,
J. Ferris, K. Glover, C. Goble, A. Goderis, D. Hull,
D. Marvin, P. Li, P. Lord, M. R. Pocock, M. Senger,
R. Stevens, A. Wipat, and C. Wroe. Taverna: lessons in
creating a workflow environment for the life sciences:
Research articles. Concurrency and Computation: Practice
& Experience, 18(10):1067–1100, 2006.
U. PLUS: Synthesizing Privacy, Lineage and Security.
Barbara blaustein, len seligman, michael morse, m. david
allen, arnon rosenthal. In IIMAS, 2008.
N. Podhorszki, B. Ludaescher, I. Altintas, S. Bowers, and
T. McPhillips. Recording data provenance for kepler
scientific workflows. Concurrency and Computation:
Practice and Experience, 20(5):507–518, 2008.
The EU Provenance Project.
http://twiki.gridprovenance.org/bin/view/ Provenance.
First provenance challenge.
http://twiki.ipaw.info/bin/view/Challenge/
FirstProvenanceChallenge, 2006. S. Miles, and L. Moreau
(organizers).
[83] Second provenance challenge.
http://twiki.ipaw.info/bin/view/Challenge/
SecondProvenanceChallenge, 2007. J. Freire, S. Miles, and
L. Moreau (organizers).
[84] E. Santos, L. Lins, J. P. Ahrens, J. Freire, and C. Silva. A
first study on clustering collections of workflow graphs. In
Proceedings of the International Provenance and
Annotation Workshop (IPAW), 2008. To appear.
[85] C. Scheidegger, D. Koop, E. Santos, H. Vo, S. Callahan,
J. Freire, and C. Silva. Tackling the provenance challenge
one layer at a time. Concurrency and Computation:
Practice and Experience, 20(5):473–483, 2008.
[86] C. Scheidegger, D. Koop, H. Vo, J. Freire, and C. Silva.
Querying and creating visualizations by analogy. IEEE
Transactions on Visualization and Computer Graphics,
13(6):1560–1567, 2007. Papers from the IEEE
Visualization Conference 2007.
[87] J. Schopf, I. Coleman, R. Procter, and A. Voss. Report of
the user requirements and web based access for eresearch
workshop. Technical Report UKeS-2006-07, UK National
e-Science Centre, November 2006.
[88] K. Schuchardt, T. Gibson, E. Stephan, and G. Chin, Jr.
Applying content management to automated provenance
[89]
[90]
[91]
[92]
[93]
[94]
[95]
[96]
[97]
[98]
[99]
[100]
[101]
[102]
[103]
[104]
[105]
[106]
[107]
capture. Concurrency and Computation: Practice and
Experience, 20(5):541–554, 2008.
IEEE Workshop on Workflow and Data Flow for Scientific
Applications (SciFlow 2006).
http://www.cc.gatech.edu/ cooperb/sciflow06.
M. Seltzer, D. A. Holland, U. Braun, and K.-K.
Muniswamy-Reddy. Pass-ing the provenance challenge.
Concurrency and Computation: Practice and Experience,
20(5):531–540, 2008.
SGI Scientific Workflow Solution.
http://www.sgi.com/industries/sciences.
D. Shasha, J. T. L. Wang, and R. Giugno. Algorithmics and
applications of tree and graph searching. In Proceedings of
PODS, pages 39–52, 2002.
B. Shneiderman. Acm’s computing professionals face new
challenges. Commun. ACM, 45(2):31–34, 2002.
C. Silva, J. Freire, and S. P. Callahan. Provenance for
visualizations: Reproducibility and beyond. Computing in
Science & Engineering, 9(5):82–89, 2007.
Y. L. Simmhan, B. Plale, and D. Gannon. A survey of data
provenance in e-science. SIGMOD Record, 34(3):31–36,
2005.
Y. L. Simmhan, B. Plale, and D. Gannon. A framework for
collecting provenance in data-centric scientific workflows.
In IEEE International Conference on Web Services (ICWS),
Chicago, IL, 2006.
Y. L. Simmhan, B. Plale, and D. Gannon. Karma2:
Provenance management for data driven workflows.
International Journal of Web Services Research, Idea
Group Publishing, 5:1, 2008. To Appear.
Y. L. Simmhan, B. Plale, and D. Gannon. Query
capabilities of the karma provenance framework.
Concurrency and Computation: Practice and Experience,
Wiley InterScience, 20(5):441–439, 2008.
Y. L. Simmhan, B. Plale, and D. Gannon. Querying
capabilities of the karma provenance framework.
Concurrency and Computation: Practice and Experience,
20(5):441–451, 2008.
Y. L. Simmhan, B. Plale, D. Gannon, and S. Marru.
Performance evaluation of the karma provenance
framework for scientific workflows. In L. Moreau and I. T.
Foster, editors, International Provenance and Annotation
Workshop (IPAW), Chicago, IL, volume 4145 of Lecture
Notes in Computer Science, pages 222–236. Springer, 2006.
E. Stolte, C. von Praun, G. Alonso, and T. R. Gross.
Scientific data repositories: Designing for a moving target.
In Proceedings of ACM SIGMOD, pages 349–360, 2003.
The Swift System. http://www.ci.uchicago.edu/swift.
M. Szomszor and L. Moreau. Recording and reasoning over
data provenance in web and grid services. In International
Conference on Ontologies, Databases and Applications of
SEmantics (ODBASE’03), volume 2888 of Lecture Notes in
Computer Science, pages 603–620, Catania, Sicily, Italy,
Nov. 2003.
P. P. Talukdar, M. Jacob, M. S. Mehmood, K. Crammer,
Z. G. Ives, F. Pereira, , and S. Guha. Learning to create
data-integrating queries. In VLDB, 2008. To appear.
W. C. Tan. Provenance in databases: Past, current, and
future. IEEE Data Eng. Bull., 30(4):3–12, 2007.
The Taverna Project. http://taverna.sourceforge.net.
The Triana Project. http://www.trianacode.org.
[108] W. W. van der Aalst and L. T. Maruster. Workflow mining:
discovering process models from event logs. IEEE TKDE,
16(9):1128– 1142, 2004.
[109] J. van Wijk. The value of visualization. In Proceedings of
IEEE Visualization, 2005.
[110] VDS - The GriPhyN Virtual Data System.
http://www.ci.uchicago.edu/wiki/bin/view/VDS/
VDSWeb/WebMain.
[111] F. B. Viegas, M. Wattenberg, F. van Ham, J. Kriss, and
M. McKeon. Manyeyes: a site for visualization at internet
scale. IEEE Transactions on Visualization and Computer
Graphics, 13(6):1121–1128, 2007.
[112] The VisTrails Project. http://www.vistrails.org.
[113] VisTrails Packages.
http://www.vistrails.org/index.php/UsersGuideVisTrailsPackages.
[114] Yahoo! Pipes. http://pipes.yahoo.com.
[115] X. Yan, P. S. Yu, and J. Han. Graph indexing: a frequent
structure-based approach. In Proceedings of ACM
SIGMOD, pages 335–346, 2004.
[116] J. Zhao, C. Goble, R. Stevens, and D. Turi. Mining
taverna’s semantic web of provenance. Concurrency and
Computation: Practice and Experience, 20(5):463–472,
2008.
[117] Y. Zhao, M. Wilde, and I. Foster. A virtual data provenance
model by yong zhao, michael wilde, and ian foster. In
Proceedings of the International Provenance and
Annotation Workshop (IPAW), pages 148–161, 2006.
Download