Accelerating Time to Experiment: Sharing Methods

CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2000; 00:1–6 Towards Open Science: The myExperiment approach D. De Roure*, Carole Goble†, Sergejs Aleksejevs†, Sean Bechhofer†, Jiten Bhagat†, Don Cruickshank*, Duncan Hull†, Yuwei Lin†, Danius Michaelides*, David Newman*, Rob Procter† * School of Electronics and Computer Science, University of Southampton,Southampton SO17 1BJ, UK. † School of Computer Science, The University of Manchester, Manchester M13 9PL, UK SUMMARY (placeholder from original paper) myExperiment has set out to provide the social software and services to support the scientific process, focusing on the ‘time to experiment’ phase of the scholarly knowledge cycle rather than the data deluge from new experimental techniques. In the context of open science this phase is also experiencing a deluge of scientific objects – not just data but protocols, methods and the new artefacts of digital science such as workflows, provenance records and ontologies. myExperiment has demonstrated the role of a social web site in addressing this challenge. KEY WORDS: Scientific Workflow, Virtual Research Environment, web 2.0, Open Repositories, Data Curation 1. INTRODUCTION While the computer science community has focused on accelerating processing, if we are to accelerate time to discovery then our research effort needs to look at ‘accelerating’ the human part of the discovery cycle. Scientific advance relies on a social process in which scientists share hypothesis, insights and results, and the data and methods that support these. Traditionally scholarly discourse and dissemination has focused around the publication of peer reviewed articles, mediated by the scholarly publishing process and established structures such as conferences. However, the scientific process (figure 1 -cameron’s picture) has many other opportunities for supporting acceleration.  The systematic publication of primary and secondary data sets, along with standard metadata sufficient to support their interpretation, is now becoming an accepted, or at least expected, practice, although the tying together published results with the data that upon which they are based is still poorly supported [ref].  Algorithms – published through libraries and services; nanoHUB, R libraries etc  Data and algorithms commonly accessible using the Web as the dissemination platform through web services or web servers [ref]. Scientists need to share (and find) not just the digital materials of scientific research but also the digital methods and processes: the protocols, plans, and standard operating procedures of bench science and the scripts, workflows and provenance records of e-Science. Methods are scientific commodities in their own right, with associated intellectual property, metadata, and life cycles; and as with data and articles, subject to their own forms of authorship, credit and reuse criteria. By pooling and sharing methods we have the potential to accelerate science by pooling and sharing know-how and best practice, avoiding reinvention and hence reducing time-to-experiment [ref] amongst decoupled communities of workflow scientists who are not organized into predetermined Virtual Organisations. Scientific workflows, for example, are often complex and challenging to build, and can require specialist expertise that is hard-won and may be outside the skill-set of those who need [ref CCPE reuse paper]. The workflows in [trps] took a bioinformatics expert six months and over 40 versions to develop; however, once developed they were immediately reusable by other, perhaps less experienced, e-Scientists accelerating their research [ref]. By combining methods with results we can accelerate discovery by enabling transparent, comparable and reproducible science [ref]. By packaging and aggregating methods with data, results tutorials, simulations, logs, tags, experts, members, groups - any form of digital research object - sharing these across applications as publication units we can work towards an open e-Laboratory that is outside any specific application. Research objects are coupled with a research objective or designed to support a research process are intended to capture a scientific investigation or research question and are expected to be reusable, repeatable, and replayable. A data and method deluge demands new techniques, especially in the context of open science. The new instrument that we bring to bear on this challenge is provided by society itself – it is the scale of community participation and the network effects that this brings. This instrument offers new ways of tackling difficult challenges; for example, the ‘decay’ over time of research objects can be addressed by community curation. Accelerating time to experiment requires social infrastructure for open science. There is great potential in providing social tools to support the scientific process and the sourcing, sharing and continued curation of scientific research objects [wikinomics article]. This is possible because increasingly the various research objects are born or available digitally. Scientists are just beginning to use blogs, wikis and social networks to facilitate more rapid and immediate sharing of research, a phenomenon sometimes characterised as Science 2.0 [ref, ref]. The Open Science movement [ref], though currently niche, vocally advocates the large scale, open distributed collaboration is enabled by making data, methods and results freely available on the Web.  By adopting social content sharing tools for research object repositories we can harness a social infrastructure that enables social networking around scientific objects and provides community support for social tagging, comments, ratings and recommendations and social network analysis and reuse mining (what is used with what, for what and by whom), and remixing of new research objects from previously deposited ones. We can take advantage of popular and familiar user interfaces of social content sharing sites such as Flickr, YouTube and Slideshare [refs]. myExperiment supports the social network of users and content, assisting researchers in discovering resources as well as in deposit, publication and curation. The “social metadata”, including tags and ratings, is shown in Figure 1. While it has parallels with websites such as FaceBook, myExperiment also provides a privacy, sharing, licensing, credit and attribution model which reflects and respects the needs of researchers and offers incentives for self-archiving and social curation. We address the challenge of creating the appropriate technical environment that understands and supports the social scientific discovery and publishing lifecycle, that protects the assets of a scientist yet also creates the incentives to reward self-less sharing to scientific strangers for potentially unexpected use. By adopting an open, extensible and participative development environment for research object repositories functionality can become readily available for reuse by others and draw on other services as much as possible. We should not oblige the scientist to come to a repository, but rather make it as easy as possible to bring the content to the scientist’s own environment. This is essential for adoption [ref], which in turn is essential to build a community and catalyse community network effects. Open science is thus the process of opening up content (sharing research objects in controlled and appropriate ways) and opening up applications (sharing research objects and the functionality of their repositories with applications). myExperiment [ref] is a socially-sourced content repository that supports the sharing and curating of research objects used by scientists, specifically focused on scientific workflows and experiment plans. For researchers it provides a social infrastructure that encourages sharing and a platform for conducting research, through familiar user interfaces. For developers it provides an open, extensible and participative environment. The production team is made up of software developers, scientific informaticians from a range of disciplines and social scientists. Launched in November 2007 and built using Web 2.0-style social networking approaches according to Web 2.0 design principles [ref] myExperiment: Facilitates the management and sharing of research workflows. The public repository (myexperiment.org) has established a significant collection of scientific workflows, spanning multiple disciplines (biology, chemistry, social science, music, astronomy) and multiple workflow systems, which has been accessed by over 16,000 users worldwide. At the time of writing 1 the public site has over 600 different workflows (200+ 1 25th March 2009 versions), drawn from XX workflow management systems including Taverna, Kepler, Triana, Trident, etc etc). There are 1600 registered users. How many plans? Include the notion of Objects here, and Packs here as the start of objects. In section 2 we introduce myExperiment, briefly present our development methodology, highlight the characteristics of workflows and other forms of methods such as plans and standard operating protocols that influence us and compare our work with other method repositories. Supports a social model for content curation tailored to the scientist and community. Producers of research objects should have incentives to make them available and consumers need to be able to discover and reuse them; all will benefit from self- and community-curation. myExperiment is a fruitful environment in which to study its use by social scientists which has revealed shortcomings and issues to address and steered development. In section 3 we describe the social model that myExperiment implements and discuss it in practice as identified by a user study that has shadowed and steered the development of the repository. In particular we show that the content is roughly split into a market and a toolbox; and that sharing is desirable and possible but anonymous reuse is difficult. We compare our social content approach to other content services in science and outside science. Supports open science by exposing its content and functionality into the users’ tools and applications and absorbing other interfaces. myExperiment provides an open, extensible environment to permit ease of integration with other software, tools and services, and benefit from participative contribution of software. In contrast to social web sites like Facebook and mySpace, developers can download, reuse and repurpose myExperiment itself, and the codebase is evolving as it is used across multiple projects. In section 4 we show how, by exposing the myExperiment functionality, new interfaces have been built and existing interfaces have incorporated myExperiment functionality, including plug-ins to the Taverna workbench, Facebook applications, an iGoogle gadget-based research dashboard and a Silverlight interface, and Chemistry Electronic Lab Notebooks and ‘blogging the lab’ [ref]. In section 5 we discuss how myExperiment is a first step towards a general notion of Research Object, which captures aggregations of objects and also encompasses the other forms of data in myExperiment, and is the part of a greater vision of interoperable e-Laboratories. We describe how myExperiment makes Research Objects accessible and actionable beyond the core repository. Our Research Objects are represented in RDF and we have developed a myExperiment ontology which uses Dublin Core metadata for research objects and FOAF for the social network information. To interwork with other repositories we adopted the Object Reuse and Exchange representation from the Open Archives initiative, which is based on named RDF graphs. We envisage that the scholarly publishing process will evolve to support this more general notion of scientific research object, which will facilitate reusable and reproducible research. What is the appropriate related work? myExperiment is an important component in the revolution in creating, sharing and publishing scientific results, and has already established itself as a valuable and unique repository with a growing international presence. It demonstrates the success, and exposes the challenges, of blending modern social curation methods with the demands of researchers sharing hard-won intellectual assets and research works within a scholarly communication lifecycle 2. MYEXPERIMENT – A WORKFLOW REPOSITORY Duncan – can you provide material on other repositories like pipeline pilot please? introduce myExperiment, briefly present our development methodology, highlight the characteristics of workflows and other forms of methods such as plans and standard operating protocols that influence us and compare our work with other method repositories. Researchers in every discipline now have access to a vast array of digital resources, ranging from libraries of articles and data collections to analytical tools and visualisation applications, many publicly available. These digital resources are combined and their data aggregated and analysed in the day-to-day conduct of research. With these new techniques come many new kinds of digital assets which have yet to become adopted within the scholarly knowledge lifecycle, yet they are an essential part of it. One of the most important is the scientific workflow. A scientific workflow is the description of a process that specifies the co-ordinated execution of multiple tasks so that, for example, data analysis and simulations can be repeated and accurately reported. Alongside experiment plans, Standard Operating Procedures and laboratory protocols, automated workflows are one of the most recent forms of scientific digital methods, and one that has gained popularity and adoption in a short time [1]. They represent the methods component of modern research and are valuable and important scholarly assets in their own right. 2.1 The myexperiment repository The myExperiment repository was motivated by observing a clear need to share workflows, motivated by a frustration with existing systems which missed the social dimension, merely making things available rather than encouraging and controlling sharing and presented complex user interfaces out of line with the popular web sites that people are using on an everyday basis, thereby demanding further skill. The motivation and rationale for the myExperiment project is discussed in detail in [4-5] and the design principles in [6]. Crucially we focused on the user experience, providing an attractive and immediately understandable web interface that uses the metaphors and behaviours of popular tools used in everyday life. It is immediately familiar to a new generation of students and researchers. A myExperiment support the deposition and curation of research objects in different domains. The current research objects are:  Workflows – compound objects which contain services, the workflow graph, workflow-specific metadata and additional information dependent on the workflow system. Workflows are versioned, and each workflow has usage statistics such as the number of viewings and the number of downloads.  Packs – collections of research objects to form aggregate entities. In addition to objects on the current server, packs can also contain links to objects on other servers.  Files – binary objects that are uploaded to myExperiment and are opaque to the system.  Groups – collections of registered users. The person who creates a group controls its membership through invitations and requests. The data model supports relationships between groups. 2.2 Architecture The architecture of one instance of myExperiment is shown in Figure 1. For ease of use, all the interfaces to myExperiment functionality are accessed via the HTTP protocol. For end users we provide an HTML based web interface. External applications can also access the other interfaces, in particular the managed RESTful API (see next section). In line with our open environment capability, the database server, search server and external workflow enactors are all separate systems to which the main application connects. The interfaces are accessed via a web server that handles load balancing over a cluster of mongrel application servers. Ultimately scalability will also be achieved by federating multiple instances of myExperiment. myExperiment is built in the Ruby on Rails web application framework and follows the Model View Controller abstractions set out in Rails. In particular, the models follow the active record pattern as provided by the ActiveRecord library. By keeping with the architectural design of Rails we were able to leverage many of its capabilities to build features for users rapidly. Various mechanisms for authentication are provided based on the interfaces used. For end users, authentication can be via external OpenID services or the internal username/password mechanism. In response to 24x7 demand, the myExperiment.org servers are hosted in a commercial collocation company with service availability that exceeds university targets. The service is hosted on two servers: a web frontend and a database backend. The frontend consists of the Apache webserver and a cluster of Ruby on Rails processes, running on separate ports using the Mongrel Cluster software. Static content such as CSS stylesheets, Javascript files and images are served directly by Apache, whereas for dynamic content (HTML and XML), Apache makes connections to the Ruby on Rails processes using the load balancing and proxy Apache modules. The database, which is a major component of the Ruby on Rails system, is hosted on the second server in the form of MySQL. This second server also runs the Solr search server, which is a Java implementation of the Lucene search library running as a Java servlet in Tomcat. To ensure service reliability, CPU load, memory and disk usage is monitored using the Nagios monitoring tools, which also check for correct and timely response of the entire service by making web requests as if it were an external user. The agile ‘perpetual beta’ development process [8] requires frequent updates to be rolled out to the main myExperiment.org service. This is aided by maintaining a separate server for final testing of code, which allows preview and test of new features and checking for performance regressions with automated tools. A test server containing a recent snapshot of the public data from the live site is also provided to developers writing applications that make use of the myExperiment API. myExperiment, which is released under the BSD licence, is built in the Ruby on Rails web application framework and follows the Model View Controller abstractions set out in Rails. By keeping with the architectural design of Rails we were able to leverage many of its capabilities to build features for users rapidly. The database server, search server and external workflow enactors are all separate systems with simple interfaces to which the main application connects. Various mechanisms for authentication are provided based on the interfaces used; for example, for end users, authentication can be via external OpenID services. 2.3 Related Work Workflow management systems already make workflows available for sharing, through repository stores for workflows developed as part of projects or communities. Unlike myExperiment, they are tied to a particular type of workflow and do not offer programmatic access to the workflows. For example, the Kepler Hydrant (www.hpc.jcu.edu.au/hydrant) is a site (under development) for sharing Kepler workflows. It supports workflow execution and allows users to assign permissions to other users. Inforsense’s commercial Customer Hub (www.chub.inforsense.com) has the ambition of enabling Inforsense workflow users to share best practices and leverage community knowledge. However, it does not rely on the social model. Galaxy (galaxy.psu.edu) provides a public site where biologists can run analyses and for developers it provides an open-source framework for tool and data integration. It does not provide social infrastructure to support sharing of workflows. Social networking sites such as Facebook (www.facebook.com) do not support research objects, and the handling of attribution and licensing may not be adequate for scientists. Facebook supports the development of plug-in applications. Similarly, science-specific social networks, like Epernicus (www.epernicus.com) only support the social part of the VRE function. The research objects of OpenWetWare (openwetware.org) are protocols used in biology labs and, through use of a wiki, OpenWetWare supports the social model and open environment. However, it does not itself intend to be a platform for conducting computationally-intensive research. Finally, Nanohub (www.nanohub.org) is a good example of a VRE that takes a portal approach. It focuses on the nanotechnology domain and provides web-based resources for research, education and collaboration. It also provides simulation tools that can be accessed from the browser. In terms of social infrastructure it provides workspaces, online meetings and user groups. In contrast, myExperiment deliberately set out to build a Web 2.0 site which would be familiar to users, choosing a Web application framework (Ruby on Rails) rather than a portal framework. It offers a rich API and remote execution. myExperiment is designed to provide services to a portal and also to be used as a Web 2.0 ‘skin’ over existing portal services. 3. SUPPORTING SOCIAL CONTENT AND COLLABORATIVE CURATION Dan’s stats are in this section Rob and Yuwei’s words are in this section In section 3 we describe the social model that myExperiment implements and discuss it in practice as identified by a user study that has shadowed and steered the development of the repository. In particular we show that the content is roughly split into a market and a toolbox; and that sharing is desirable and possible but anonymous reuse is difficult. We compare our social content approach to other content services in science and outside science. 3.1 The social dimension At the outset there was considerable scepticism as to whether scientists would be prepared to use a social web site and doubt that critical mass could be achieved in order to enjoy the network effects of Web 2.0. We now have an evidence base of several months of usage data which provides evidence that, for our communities at least, scientists do share and that their social networks develop [3]. Our analysis also shows that an enthusiastic core is willing to share quality workflows but expects credit for doing so, acting as provider to the wider community [4]. We are working with our social science colleagues to analyse the sharing behaviours in different communities. myExperiment is distinctive because it majors on the social dimension, and it can itself be seen as an experiment to explore whether scientific communities share sufficiently in order to benefit from the network effects of a social web site. To support producers of research objects in contributing to myExperiment we provide members of the site with support for credit and attribution, and fine control over the visibility and sharing of research objects. Early user feedback revealed this to be the most critical factor in making a social web site acceptable for use by scientists. Other members of the site ‘consuming’ the research objects can view, download, tag, review and ‘favourite’ them, which aids their discovery and enhances reputation of the producers. Additionally, content exposed publicly is discoverable through search engines. Unless they are maintained, workflows and other research objects can cease to be reusable over time –they effectively ‘decay’, though in fact it is their context that is changing. For example, a recent change in gene identifiers by one service provider led to a myExperiment announcement for users of the affected workflows. Useful workflows will be curated by the community that uses them, and the original authors are also encouraged to curate because they are getting credit for use of their work. Workflow decay is a difficult problem and myExperiment provides a new approach through community curation. The social model is supported through a number of entities. Foremost is the member – these are users who have registered with myExperiment and can find or contribute research objects, create tags, comments and reviews etc. They can also form friendships with other members. A great many users of myExperiment are not members – they are simply people browsing the site for publicly available content or following links from search engines such as Google. The ‘social metadata’ entities are: 1. Attributions – So that members can show what a research object is based on. 2. Creditations – So that members can give credit to others. By default, the uploader is given credit. 3. Favourites – To support reputation and provide incentive, members can identify their favourite research objects. The list of favourites is visible to other members. 4. Ratings – A simple 5 star rating system to assist with recommendation. 5. Reviews – Explanations to augment ratings. 6. Citations – Publication information associated with research objects. 7. Comments – To enable members to comment on research objects. 8. Tags – To annotate research objects for ease of discovery. The owner of the “tagging” (the association of the tag to the object) is recorded. 9. Policies – see below for Ownership, Sharing and Permissions. Broadly the user interface reflects the same set of entities and is designed to make it as easy as possible for consumers to find research objects (by search or navigation) and for producers to contribute. Workflows, users, groups and packs have their own pages which become the root for pivoting and browsing. Mechanisms for tagging and commenting etc are consistent across these pages. Only content that is authorised to be shown is visible to the current user. As a result, many parts of the user interface are very dynamic, with different content, features and actions shown/enabled based on the current user context. Sometimes the user interface does not directly reflect the underlying data model. In particular, the underlying sharing and permissions model is highly object oriented, with User, Group, Contribution, Policy and Permission objects all working together. However, users are presented with ‘canned’ options that allow quick selection of the most appropriate sharing option, for example: ‘anyone can view and download’; ‘anyone can view, but only my Friends are allowed to download’. 3.2 Analysis of Usage Analysis of myExperiment.org usage statistics over the period January-July 2008 demonstrates: (i) a rapidly growing community, (ii) extensive use of contributed research objects and (iii) the development of social groups. (i) Community size. At the time of writing, myExperiment.org has 1051 activated accounts. There has been a steady growth in the user base during 2008, with about 10-20 new users registering a week. Spikes in registrations are due to Taverna workshops that use myExperiment to host their tutorial materials and conferences. 38% of the registered users are regular visitors. In a seven month period the site received approximately 60000 page views in 13500 visits by 8581 unique visitors. The figures are collected using Google Analytics and do not include accesses made via the API. It is interesting to note that the number of unique visitors is much larger than the number of registered users This suggests that the publicly visible content on the site is of value to a wider audience. (ii) Use of research objects. myExperiment.org hosts three types of research objects: workflows, files and packs. There are XX workflows and a further XXXworkflows that are revised versions. Workflows were downloaded a total of 50934 times, with three workflows commanding over a 1000 downloads each. Figure 4 shows a general overview of workflow popularity based on downloads. Over time we might expect a larger number of workflows appearing with a smaller number of downloads. The present figure is explained by the strong differences found in documentation of workflows – the less documentation, the fewer downloads. In terms of permissions, 280 (85%) of the workflows are publicly visible whereas 252 (76%) are publically downloadable. 40% of the workflows with restricted access are entirely private to the user and for the remaining the user has elected to share with individual users and groups. 36 workflows (over 10%) have been shared with the owner granting edit permissions to specific users and groups. In addition there are 53 instances where users have noted that a workflow is based on another workflow on the site. This indicates that the site is supporting collaboration amongst its users and that they are willing to contribute derived works. We have also investigated how users discover workflows using the site, finding that an enthusiastic core is willing to share quality workflows but expects credit for doing so, acting as provider to the wider community [10]. Plain files are used much less with only 109 uploads. However, 70% were added after the introduction of packs where users are making use of packs to associate documentation, example inputs and outputs and other files with workflows. Analysing the use of packs would be premature given their recent introduction on the site. To date some 20 packs have been created. (iii) Social group development. The Groups mechanism has been used to form teams of workflow builders, to help organise events, to collaborate between projects and to locate peers with similar interests. myExperiment currently counts over 100 groups, ranging from two to 20 members. 3.3 Research Methods Rob’s words start here It is important to investigate how researchers actually make use of Web 2.0 tools, to explore participatory patterns and cultures, how these shape the ways in which scientific knowledge is crafted and communicated and to use the insights gained to help Web 2.0 tools and services adapt to their users’ needs. As a contribution, we are undertaking a series of in depth studies of researchers’ use of Web 2.0 tools. Here, we present evidence from one of these studies, an ongoing investigation of myExperiment users’ workflow sharing and re-use practices, their motivations, concerns and potential barriers. We have been conducting a series of interviews with myExperiment users. The study has been designed not only to gather data about expectations and motivations as researchers join myExperiment but also to provide a longitudinal perspective over a period of 24 months which enables us to track myExperiment users’ attitudes and behaviour as they become more experienced. Interviewees are selected on the basis of myExperiment activity profiles, including workflows uploaded/downloaded; number of friends; group membership; group moderation and discipline. In total, we expect to conduct between 40-50 interviews with 25-30 respondents in 4-5 waves. In each wave, existing respondents have been re-interviewed and new respondents recruited. To date, we have conducted 34 interviews with 27 users through phone, instant messaging, via email and face-to-face. One user ahs been interviewed three times; five users interviewed twice, and the rest have been interviewed once. A summary is given in Table 1. Interviewees are recruited either via myExperiment or by snowball sampling (i.e., users suggested by interviewees). 3.4 Findings: Motivations for Being a myExperiment User All interviewees report using myExperiment for publishing and disseminating workflows: “The ability to be able to publish the workflows […] It’s much more convenient and more organised than just publishing the XML file on my research group’s site […] People who have the same interests or use the same components […] they’ll more likely to find it on myExperiment. So, basically, for things like dissemination it’s fantastic, makes it easy to point people out which I think it’s useful.” In contrast, some interviewees express a more collaborative ethos: “When my solutions work, it would be great to share with others.” “[…] I have the feeling that there’s a better opportunity for sharing scholarly work in that way too.” Other interviewees are aware of the importance of ‘network effects’: “[…] we started to realise very quickly that it’s going to be the next kind of big area for different kinds of research […] many people say to us ‘hey, we’re using myExperiment. Can we have access to various different pieces?’” 3.5 Findings: Barriers and Enablers to Sharing Workflows Research domains have emerged as an important barrier to reuse of workflows. The responses of our sample of users suggest that workflows do not easily migrate across domains, reflecting, perhaps, distinctive ‘patterns’ to research processes in different domains. Many interviewees also commented that their research is relatively advanced or is too specialised for many workflows to be directly helpful to them. This may reflect that the myExperiment community is still evolving and, as yet, is populated by early adopters, such that effects normally attributable to social networks have yet to make themselves felt. Good documentation is accepted as a key requisite for facilitating sharing and re-use. From our interviewees’ comments, they are still learning what constitutes good documentation for workflow discovery and sharing: “It’s a compromise how much time a contributor would like to invest to make sharing workflow feasible. If I have to be honest, my own workflows are not documented properly either. My workflow is used by a small group of people. Once in a while when I present my work, that’s how they know how my workflow works.” “[…] it depends on how well people label various components. We’ve seen quite a few models with no descriptions of what input and output are, or what kind of format and what assumptions they made.” “The functionality is there, but […] people are being lazy or people don’t understand different ways they can put information in to make that available to other users […] people make too many assumptions over what anyone else knows about the system. It happens everywhere.” “[…] I found browsing for things to be one of the functionality areas I think need help because I find it difficult to find something […] some features that make it easier to find workflows would be useful.” “[…] you can start asking questions to collaborators […] there’s flexibility there but it’s something that can be improved […] it’s users rather than myExperiment’s fault.” Tagging is one example of how myExperiment users are grappling with establishing good documentation practice: “One of the issues we found is everyone kinds of puts things down as a text mining tag which is not very specific in order to get the different bits and pieces that we need. So users need to put in or people who actually put things there need to start considering how to label things in order to have them being found.” “I always think of tags as supplements for navigation because you can’t really rely on people conforming to tagging standard ways that other people doing it. So if it’s a low frequency tag you are not necessarily going to get everything even related to it.” “If they can make it easier to discover workflows, like better using the metadata, promoting users to use metadata, like a paper repository and then being able to search other things, that would be very helpful to me.” 3.6 Findings: Diverse Requirements for Sharing We can discern from our data two distinct myExperiment communities when it comes to workflow re-use. The community which we will refer to as workflow ‘consumers’ prefer larger workflows ready to be down loaded and enacted; the community of workflow ‘builders’ prefer smaller, modularised workflows which can be assembled and customised: “I upload the complete thing and I also upload its parts. The idea I have is that workflow creators will more look for components […]. As the end user you probably look for the larger workflows.” “A larger workflow might be too specific for my needs, it might be more worthwhile to look for the smaller parts, to adjust them to my needs. Things in bioinformatics are often so specific, it can be difficult to find the right thing, smaller things are easier to evaluate.” We might conclude that workflow consumers see myExperiment as a workflow ‘supermarket’; workflow builders see it as a ‘toolbox’. 4. OPEN SCIENTIFIC PLATFORMS David Newman’s words are in this section Jits’ Silverlight words are in this section In section 4 we show how, by exposing the myExperiment functionality, new interfaces have been built and existing interfaces have incorporated myExperiment functionality, including plug-ins to the Taverna workbench, Facebook applications, an iGoogle gadget-based research dashboard and a Silverlight interface, and Chemistry Electronic Lab Notebooks and ‘blogging the lab’ [ref]. 4.1 API myExperiment always prefers reuse to reinvention and can easily access other services. It is designed to be part of the scholarly knowledge cycle and is compatible with Open Archives Initiative protocols. While it provides a workflow repository function, much of the associated information – such as data and publications – may be held in other repositories, so myExperiment makes it easy to refer to external content. In addition to the API, the myExperiment codebase is open source and can be used by anyone to set up their own myExperiment instance. As well as bringing this capability to the user through the myExperiment interface, the API is designed so that developers are easily able to build ‘functionality mashups’ over myExperiment for rapid prototyping of tools to support researchers. These may be prescriptive interfaces for specific tasks, such as running preconfigured workflows. To support the open and extensible environment we provide data access using basic REST principles, and in line with the community we are increasingly adopting Atom as a means of delivering content and synchronising with peer services. These interfaces have wide adoption in the developer community. Though Ruby on Rails provides a mechanism for automatically providing REST access, we decided to manage the API separately so that we could respond to the requirements of API users, while also being independent of codebase evolution. Hence the REST API is driven by an XML specification that can be loaded and edited within Microsoft Excel. This allows us to create an independent API specification with the added benefit that it is in one place instead of spread across many model files. It also assists in generating documentation and tests. Elements of the myExperiment data model have been revealed via the REST API on a case by case basis. Currently, the exposed entities include: workflows, files, users, groups, tags, messages, citations, reviews, comments, ratings and packs. Given that control of visibility is crucial to myExperiment, we need a means of authenticated API access. This is achieved by using the OAuth protocol, whose purpose is not just to authenticate that a user has given a service consumer access to a service provider; it is a specific key that may have certain privileges assigned to it. With OAuth, a user can create several keys which could be used with one service, and each of those keys may have a different set of privileges. A developer community is growing up around the API, with projects developing Google gadgets, Facebook Apps, a plugin for the Taverna workflow system, mashups over myExperiment services and a Silverlight interface. It is also being used to incorporate myExperiment functionality in systems such as Wikis. The developer community uses the myExperiment developer wiki to collaborate, following our own principle of supporting the social model. Jits’s Silverlight words... Users do not come equal. As such, a major factor in the development of an open RESTful API was to facilitate the development of new and novel interfaces and applications. Our close work with users on improving the usability of the myExperiment website has informed many complex requirements, one being the need for richer discovery tools. We explored the use of a Rich Internet Application (RIA) technology – Microsoft Silverlight – in developing a simple search mashup tool. Silverlight, in a similar vein to Adobe Flash, is an extension to the browser in which rich content and functionality can be provided to users. This goes beyond existing web standards with the disadvantage that users need to download additional software. Our search mashup presents a clean interface that allows a user to focus on discovery without being distracted by the other features of myExperiment. We have used the keyword search and tag cloud functionality (via the API) to allow discovery of all public content from the myExperiment.org repository. These include workflows, packs, files, groups and users. Users of the mashup can also preview each individual search result (either inline or in a popup). If interested they can then click through, which opens up the item in myExperiment (in a new browser window). Because it runs within a browser, the mashup can be hosted in one place and then be accessed just like any other web application. Our discussions with users suggest that rich mashups provide easy entry points to repositories like myExperiment and, by taking a task-oriented approach, allows users to focus on the task at hand. More complex and personalised workspace oriented applications could also provide useful interfaces in which scientists can carry out their work. For ease of use, all the interfaces to myExperiment functionality are accessed via the HTTP protocol. For end users we provide the HTML based Web interface, while external applications can also access the other interfaces, in particular the managed RESTful API. The HTML interface makes minimal assumptions about the capabilities of the browser, optionally using JavaScript and AJAX to improve the interactive experience, while the API enables the construction of Rich Web Applications and mashups. Developing new Interfaces. We have two exercises in building entirely new user interfaces to myExperiment’s functionality. Firstly we are using Silverlight to build a rich similarity search and socially-driven workspace mashup that uses the myExperiment API together with other common data sources like Google Search, Google Scholar, CiteULike, Connotea, PubMed and so on. Figure 1 shows how a user can initiate a search from the “Top Tags” cloud built from live myExperiment data. Secondly we have built Google Gadgets for myExperiment. Two of these are shown in Figure 2: the Tag Cloud gadget, and the Workflows Feed Reader gadget which shows the latest myExperiment workflows. Bringing myExperiment to existing interfaces. We have integrated with the Taverna workflow workbench by building a Taverna plugin for myExperiment, so that Taverna users can access the myExperiment capabilities from within the Taverna environment (figure 3). We are currently integrating with Microsoft’s Trident Scientific Workflow Workbench [6], and for this we have developed preliminary support in myExperiment for sharing Windows Workflow Foundation (WWF) workflows. Finally we are working in conjunction with our open science colleagues in chemistry to bring myExperiment together with work on Electronic Lab Notebooks and ‘blogging the lab’ [7]. Scholarly publishing. myExperiment can refer to external repository content and we are developing mechanisms to import metadata directly from the repositories, focusing first on EPrints integration. myExperiment supports a general notion of research object, which captures aggregations of objects and also encompasses the other forms of data in myExperiment – for example members, groups, tags and the social network. We call these objects EMOs (Encapsulated myExperiment Objects). EMOs are represented in RDF and we have developed a myExperiment ontology which uses Dublin Core metadata for research objects and FOAF for the social network information. To interwork with repositories we have adopted the Object Reuse and Exchange representation from the Open Archives initiative, which is based on named RDF graphs. We envisage that the scholarly publishing process will evolve to support this more general notion of scientific research object, which will facilitate reusable and reproducible research. 4.2 RDF myExperiment's ability to share information is one of its key advantages when it comes to closing the experimental lifecycle loop. However it is important to consider the mechanisms for how this information is shared. The myExperiment RESTful API has already demonstrated how a machine-oriented sharing mechanism can allow the development of new interfaces in the form of "mashups" and "gadgets". The RESTful API although extensible is quite rigid and requires any linking-up to be performed client-side, RDF provides a framework so this can be executed server-side. myExperiment publishes all its public data as RDF at http://rdf.myexperiment.org/. RDF has a very simple subject-predicate-object (triple) structure that facilitates linking-up but these relationships can be formalized using a meta-structure provided by a schema or ontology. By formalizing, additional information can be inferred rather than having to define it explicitly. myExperiment uses a modularized ontology set to provide its formalization (http://rdf.myexperiment.org/ontologies/). The myExperiment ontology is modularized to promote reuse with each module designed for a specific sub-domain, e.g. types of annotation/contribution, attribution and creditation, packs, experiments, etc. As well as promoting reuse, the myExperiment ontology reuses parts of other ontologies/schemas:  FOAF and SIOC for representing the social network  Creative Commons for contribution licenses.  Dublin Core for common metadata properties  OAI-ORE for representing packs 4.3 Ontology Modules Architecture Through reuse it is possible to make some sense of myExperiment data outside its domain, allowing data from different sources to be collated. By making the myExperiment ontology reusable its saves reinvention and allows similar projects to map their data in the same way. Significant effort is being given to representing experiments and the data they produce in such a way that their insights can be shared across multiple fields. The Scientific Discourse subgroup of the W3C's Health Care and Life Sciences (http://esw.w3.org/topic/HCLSIG/SWANSIOC) has been considering how to reconcile a number of ontologies that treat experiments as first class objects. A SPARQL endpoint is a way of providing this server-side linking-up. By collating all myExperiment's RDF data in a single data structure, known as a triplestore, it is possible to provide a very flexible querying interface using the Semantic Web querying language SPARQL. myExperiment's SPARQL endpoint (http://rdf.myexperiment.org/sparql) allows the execution of simple queries that could be performed by a RESTful API call with the appropriate parameters, such as returning all the workflows uploaded by a specific user. However, if you wanted to further restrict that to those you had also commented on, the RESTful API would not be able to do this without additional functionality being added. SPARQL queries are essentially trying to map networks where one or more of the nodes or links are unknown. myExperiment's RDF provides a listing of components (sources, sinks processors and links) for Taverna workflows; SPARQL provides a facitlity for searching for workflows where these components link up in a specific user-defined way. This makes it possible to find a workflow that is much more tailored to a searcher's requirements, which becomes ever more important as the number of workflows grow. Returning SPARQL results is an appropriate format if this data is to be used within a Semantic Web application, however this is often not the case. Results may need to be ported in a more generic way or understood by a real person. The SPARQL endpoint allows results to be exported as comma-separated values (CSV) or visualized as an HTML table. Other more specific use cases have also arisen, in particular a capability to represent SPARQL queries that return mappings (e.g. between users who are friends) as a matrix that can be exported as CSV. It is important to understand that although RDF and SPARQL are Semantic Web technologies this should not prevent them from working seamlessly with applications that are not. 5. TOWARDS RESEARCH OBJECTS AND E-LABORATORIES Sean’s words are in this section, also Jits’ words on biocatalogue In section 5 we discuss how myExperiment is a first step towards a general notion of Research Object, which captures aggregations of objects and also encompasses the other forms of data in myExperiment, and is the part of a greater vision of interoperable e-Laboratories. Our Research Objects are represented in RDF and we have developed a myExperiment ontology which uses Dublin Core metadata for research objects and FOAF for the social network information. To interwork with other repositories we adopted the Object Reuse and Exchange representation from the Open Archives initiative, which is based on named RDF graphs. We envisage that the scholarly publishing process will evolve to support this more general notion of scientific research object, which will facilitate reusable and reproducible research. What is the appropriate related work? 5.1 The e-Laboratory The e-laboratory or e-lab, is a set of integrated components that, used together, form a distributed and collaborative space for e-Science, facilitating the planning and execution of in silico experiments. An e-lab brings together people, materials and methods in order to support scientific investigation. Central to the idea of an e-lab is the Research Object (RO). Research Objects are conceptual objects that bring together resources used in scientific investigation. An RO is an aggregation of resources (data sets, analysis methods, workflows, results, people) that tells a particular story about an investigation, experiment or process. The Research Object captures key information about the lifecycle of the investigation (for example provenance information about analyses), facilitating reuse of results and repeatability of experiments. ROs are the work objects that are built, transformed and published in the course of scientific experiments. ROs form a common currency for e-lab infrastructure, with a particular e-lab being built from components and services that consume and produce ROs. ROs play a role both in driving the components and capabilities within an e-lab. For example, the flavour (see below) of an RO can be used within a workbench to determine appropriate visualisation methods for the contents of the RO. ROs are not simple internal to a particular e-lab platform, and they will also play a role in sharing/communicating not just between services and components within an e-lab, but also with other e-labs (or e-labs services). The layered approach (see below) is key to achieving this, with services able to remain agnostic in the face of information they do not understand. 5.2 Repeat, Reuse, Replay An RO tells a story about an investigation. It provides not just an aggregation of resources, but additional information about the purpose, reason or rationale for the aggregation. ROs thus capture the investigation process. As an example, consider researchers working with survey data (Obesity e-Lab). Analyses on a complex survey may be encapsulated in statistical scripts, but those scripts are not generally associated with the data extraction process. As a result, the steps required to, cut down and extract information from a data set before running the analysis can be forgotten. An RO that captures the entire process including hypothesis; data extraction; workflows used to analyse and transform data; and ultimately publication of results, supports the notion of /repeatability/, providing sufficient information for a third party to validate the investigation that has been performed. Another key characteristic of ROs is /reusability/. The myExperiment experience shows the benefits of publishing and sharing scientific workflows, where scientists can reuse and repurpose existing workflows to reduce the time to experiment (EXAMPLE). The use of ROs extends this further, bundling not just workflow but also data sets, intermediate results and hypotheses. Finally, ROs provide a facility for /replay/, allowing examination of the process undergone. Although related to repeatability, replay meets a different need. For example, replay of an RO can offer a coarser grained level of detail, showing the major steps undertaken during an experiment. (EXAMPLE of Shared Genomic's use of Coverflow as an example of "replay" functionality driven by RO content). In [Barend Mons], the problem of "knowledge burying" is highlighted, where knowledge about investigations or experiments is published in paper form, and text mining techniques are required to extract this knowledge, leading to inefficient transfer of information (See Figure burying). A view of "Research Object as publication", packaging and associating data, results and methods as part of the publication process helps to overcome some of these issues by ensuring that information and knowledge is not lost during that publication process. 5.3 myExperiment Packs Nascent Research Objects are already present in myExperiment in the form of /packs/ (or Encapsulated myExperiment Objects (EMOs) as referred to in earlier work). Packs allow the aggregation of heterogeneous content, and have been used to provide bundles of related workflows (along with additional documentation) for training purposes. Although packs support the collection of workflows with, for example, input data and service log invocations, the detailed relationships between the items, which forms an essential core of the Research Object notion is not explicitly represented. 5.4 A Layered Architecture The Open Archives Initiative's Object Reuse and Exchange (OAI-ORE) provides a standard for the description and exchange of aggregations of Web resources. Research Objects provide an extension of ORE, with RO specific schemas or ontologies providing vocabulary to be used within instantiations of ORE. An upper level RO Model describes the common characteristics across ROs (FOR EXAMPLE), while specific RO Domain Schemas describe extensions to that vocabulary which are relevant to particular domains (FOR EXAMPLE). This layered approach (See Figure RO-layers) allows components to offer general RO functionality while remaining agnostic as to the domain specific content. Use of emerging standards such as ORE also facilitates the use of third party components for management of objects (IF SUCH THINGS ACTUALLY EXIST). 5.5 Biocatalogue Bioinformaticians utilise a vast array of resources and tools through Web services. These provide well defined interfaces that can be accessed programmatically, regardless of platform or programming language, over the internet or a local network. In an effort to automate tasks, these services are often chained together in complex, networked flows that can be stored and reused. Web Services are seen to be “critical to the effective linking of biological data and computing resources to enable complex querying, computing, and analysis across disparate biological data and data sources” (Tools and Resources Strategy Panel feedback, 14th March 2007). Although the services available are well defined from a protocol and structural point of view, they are often poorly described in their usage and with very little documentation available on the internet. Much of the knowledge and information on the use of services is captured within the minds of the bioinformaticians and in the scientific workflows that utilise the services. There is a severe need to capture this knowledge and allow professional and community curation on it. BioCatalogue provides an actively curated and centralised registry of Web services. Scientists can use and rely on BioCatalogue as a starting point in discovering what services they can utilise in their work. BioCatalogue aims to support users in the whole lifecycle from service discovery to service usage. Whilst myExperiment is a repository of scientific research objects BioCatalogue is a registry, where rich and accurate descriptions; loose and semantic tagging; and in depth examples add the most value. Monitoring and recording service availability is another key factor in assessing the usability of web services and adds significant value to the discovery process. Social curation is a major challenge in an environment where many different types of stakeholders exist - from service providers, service developers, end users and professional curators. The design of the technical functionality requires careful understanding of the different tensions between and priorities of these stakeholders. We ensure that the BioCatalogue does not endorse any negative community feedback but that we facilitate quality and expert curation from users and professional curators. This requires not only highly usable technical features and interfaces but policies need to be in place to allow transparency in how we handle situations of libel and such. Much like myExperiment, BioCatalogue is also a Web service itself, allowing other applications and clients (such as the Taverna workflow workbench) to bring the functionality to their users. By integrating BioCatalogue with myExperiment and other tools that scientists use, we bring BioCatalogue intrinsically into the user's work patterns and daily tasks, at the point of usage. This allows BioCatalogue to really add value to the user. The development of the BioCatalogue service follows many of the same principles and methodologies of the myExperiment project and there is significant overlap between the two projects - from code sharing, collaborative design and regular joint meetings. The release cycle (driven by a perpetual beta process) and development environment match with myExperiment and we take the same user-driven design and development approach, with a dedicated "friends" community and a community liasion officer who manages this. 6. DISCUSSION, FUTURE WORK AND CONCLUSION The results of our study myExperiment users reflect a community in formation, whose members’ attitudes, intentions and expectations are diverse and evolving. As such, significant changes in attitudes, intentions and expectations are likely to occur as the activities of community members filter through the networks of relationships, reinforcing successful innovations and defining good practice. As yet, we have not found evidence of a major shift in research cultures and practices such that, for example, researchers are turning to unprompted and spontaneous collaboration through the sharing and re-use of artefacts. It is likely that, as we note above, it is simply too early for such behaviours to emerge. Rather, at present, a major motivation for researchers using myExperiment is to build ‘social capital’, among their peers. This, of course, does not mean that sharing and re-use will not emerge as significant features of myExperiment users’ behaviour. Indeed, there is some evidence of workflow re-use, and rather more evidence of a desire to re-use. A key inhibiting factor seems to be a lack of adequate documentation and, crucially, lack of understanding as to what constitutes good practice. Our findings suggest that researchers will take time to adapt and possibly to redefine current practices for producing and documenting scientific knowledge such that these new practices are compatible with the sharing and re-use of new kinds of knowledge artefacts such as workflows. They also suggest that the myExperiment community is prepared to grapple with these challenges. Of course, such changes are likely to become feasible only when individual and collaborative interests align. In the meantime, what is important is that myExperiment is successful in bootstrapping network effects and, in this respect, the signs are promising. ACKNOWLEDGEMENTS Please add people to this alphabetical list The design of myExperiment and Research Objects has been a collaborative exercise involving a large group of people including Mark Borkum, Les Carr, Simon Coles, Phil Couch, Catherine De Roure, Tom Everleigh, Paul Fisher, Jeremy Frey, Antoon Goderis, Matt Lee, Cameron Neylon, Savas Parastatidis, Meik Poschen, Marco Roos, Robert Stevens, Shoaib Sufi, Franck Tanoh, David Withers, Katy Wolstencroft. myExperiment is funded by the JISC Virtual Research Environments programme, Microsoft Technical Computing Initiative and EPSRC. REFERENCES Will be managed in endnote – just drop text inline or here for now so I know what they are! [Barend Mons] B. Mons. Which gene did you mean? BMC Bioinformatics 6(), p.142 2005 DOI: 10.1186/1471-2105-6-142 [1] Waldrop, M. Mitchell, “Science 2.0: Great New Tool, or Great Risk?”, Scientific American, Published online January 9, 2008 on http://www.sciam.com/article.cfm?id=science-2-point-0-great-new-tool-or-great-risk [2] Borda, Ann, et al. Report of the Working Group on Virtual Research Communities for the OST e-Infrastructure Steering Group. London, UK, Office of Science and Technology, 46pp. 2006. [3] Gil, Y., Deelman, E., Ellisman, M. et al. “Examining the Challenges of Scientific Workflows”. IEEE Computer 40(12): 24-32. 2007. [4] De Roure, D., Goble, C. and Stevens, R., “Designing the myExperiment Virtual Research Environment for the Social Sharing of Workflows,” IEEE International Conference on e-Science and Grid Computing, pp.603-610, 10-13 Dec. 2007 [5] De Roure, D., Goble, C. and Stevens, R.. “The Design and Realisation of the myExperiment Virtual Research Environment for Social Sharing of Workflows”, Future Generation Computer Systems, published online July 2008. [6] De Roure, D. and Goble, C. Six Principles of Software Design to Empower Scientists. IEEE Software. In Press, 2009. [7] Oinn, T., Greenwood, M., Addis, M.et al. “Taverna: lessons in creating a workflow environment for the life sciences,” Concurrency and Computation: Practice and Experience 18, 10 Aug. 2006, 1067-1100. [8] Lin, Y., Poschen, M., Procter, R. et al. “Agile Management: Strategies for Developing a Social Networking Site for Scientists,” in 4th International Conference on e-Social Science, 18-20 June 2008, Manchester, UK. [9] Neylon, C. Openwetware blog. See http://blog.openwetware.org/scienceintheopen/ [10] Goderis, A., De Roure, D., Goble, C., Bhagat, J., Cruickshank, D., Fisher, P., Michaelides, D. and Tanoh, F. “Discovering Scientific Workflows: The myExperiment Benchmarks,” IEEE Transactions on Automation Science and Engineering . (Submitted 2008) [11] O’Reilly, T. What is Web 2.0? http://www.oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/what-is-web-20.html

Accelerating Time to Experiment: Sharing Methods

Related documents

Products

Support

Accelerating Time to Experiment: Sharing Methods

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib