A portal interface to Grid workflow technology

advertisement
A portal interface to my Grid workflow technology
Stefan Rennick Egglestone a , M.Nedim Alpdemir b , Chris Greenhalgh a , Arijit
Mukherjee c and Ian Roberts d
a.
School of Computer Science and IT, University of Nottingham
b.
School of Computer Science, University of Manchester
c.
School of Computing Science, University of Newcastle upon Tyne
d.
Department of Computer Science, University of Sheffield
Abstract
Workflow technology previously developed by the my Grid project has been used to automate a number of complex bioinformatics analyses, allowing much larger volumes of
data to be produced than would be possible if these analyses were to be performed manually. my Grid workflows can be constructed using the Taverna workbench, and the analysis
specified by such a workflow can be performed throught the use of a workflow enactment
engine.
This paper describes the design and implementation of a simple user-interface which aims
to support the use of workflow technology by providing web-based access to workflow and
results management facilities. This interface has been constructed using the Gridsphere
portal framework, and makes use of storage facilities provided by the my Grid Information
Repository.
1
Introduction
A rich selection of computational resources are
available to scientists working with biological data.
It is common for these scientists to wish to perform composite analyses involving multiple resources, and the my Grid project has developed
workflow-based middleware which allows the performance of such analyses to be automated. Development of this middleware has focussed on
providing support for professional bioinformaticians, who are likely to be expert computer users,
and who may wish to automate complex analyses
involving the use of large numbers of resources.
However, workflow techniques may also be useful for experimental biologists, who are often less
expert in computing, and who may wish to automate simpler analyses.
Members of this second group of users are unlikely to wish to construct their own workflows,
but may wish to use workflows which have been
constructed by other, more expert users. To support this scenario, the my Grid project have developed the my Grid Portal Interface (MPI), which
is a simplified, web-based interface to my Grid
workflow enactment and storage technology. The
MPI provides functionality which allows groups
of users to share and enact a number of work-
flows, and to browse data which has been produced by previous workflow enactments.
This paper describes the design and implementation of the MPI, and is structured as follows: Section 2 describes the use of workflow technology
to automate biological analyses, and motivates
the development of the MPI. Section 3 describes
general details of the design and implementation
of the MPI, and section 4 describes a number of
interesting features of this design in more detail.
Section 5 then discusses improvements that could
be made in the design of the MPI, and in particular those that might be made to support the requirements of large groups of users who wish to
publish many workflows.
2
2.1
Motivation for the development of
the MPI
A categorization of users of bioinformatics resources
A wide variety of resources are available which
support the storage, retrieval and processing of
biological data. These resources are often publiclyand freely-available, and have become essential
tools for scientists who work with this type of
data.
Previous work [19] has identified a group of users
of these resources who can be given the label of
professional bioinformatician. Members of this
group of users rarely gather new data experimentally. Instead, they mine storage resources for
data which has been submitted by other biologists, from which they generate new knowledge
by the application of complex combinations of
data processing operations. The my Grid project
has focussed on providing support for members
of this group of users, by providing workflowbased middleware that allows these types of complex analyses to be automated.
However, dialog with the user community has revealed another group of users of distributed resources. Members of this group are primarily biologists who gather new data by the application
of experimental techniques. They are likely to
submit this experimental data for archiving in a
storage resource, and are also likely to perform
simple analyses of this data using a small number
of distributed processing facilities. Members of
this group of users are in general not expert in the
use of computational techniques, and are unlikely
to construct workflows to automate their analyses. They may, however, be willing to use workflows constructed by other, more expert users.
The rest of this section introduces existing my Grid
workflow technology, explains why it is less likely
to be used by members of this second group, and
motivates the provision of a simplified interface
to this technology which may be useful to members of this group.
2.2
my
flows is a difficult process, which requires extensive expertise in computing and bioinformatics.
As such, members of the second group of users
introduced in subsection 2.1 above, who are not
expert users of computer technology, are unlikely
to wish to construct workflows to automate their
analyses. However, since many of the analyses
that these users perform are simple and standardized, it may be that the enactment of a standard
set of workflows which have been constructed by
other, more expert users may provide a useful automation mechanism for them.
Since members of this group of users are less expert in computing, it may be unreasonable to expect them to download, install and learn the interface provided by the Taverna workflow workbench. Although this interface is well-designed,
it does aim to support both workflow construction and enactment, and is hence more complex
than an interface which solely provided workflow
enactment facilities could be. As an alternative
to Taverna, the my Grid project have developed a
simplified, portal-based interface to my Grid workflow enactment technology called the my Grid Portal Interface (MPI), the design of which has been
targeted at this group of users. This interface is
used through a standard web-browser, access to
which is ubiquitous amongst the user community, and provides facilities for the storage and
retrieval of both workflows and enactment data.
The rest of this paper describes and evaluates the
implementation of the MPI.
3
Basic design details
Grid workflow technology
Existing my Grid technology allows a complex analyses involving multiple distributed resources to be
expressed as a workflow. A workflow which has
been constructed to represent an analysis consists
of a description of the inputs required by the analysis, the resources used by the analysis, and the
results that performing the analysis would generate. Such an analysis can be performed by the use
of a workflow enactment engine. To support the
use of workflow technology, the my Grid project
have developed the Taverna workflow workbench
[18, 9], which is a graphical user interface that
provides functionality to construct and enact workflows, and to store both workflows and data produced by enactments to the local file system. my Grid
workflow technology, including Taverna, has been
successfully used in a number of research projects,
including investigations into Williams-Beuren Syndrome [17] and Grave’s Disease [15].
Although workflow technology has been proven
to be useful to these projects, it has been found
that the construction and maintenance of work-
3.1
Access control and data sharing
Discussion with potential users of the MPI have
revealed that many work in small groups, with
a relatively fixed membership and which are localized to one organization. These groups often
consist of a number of scientists with similar research interests. It is common for users in these
groups to perform similar analyses on similar sets
of input data, and to wish to share any results
which are produced by the performance of these
analyses. They may, however wish to restrict access to this data to members of the group. Currently, they normally use local or networked file
systems for data storage, and share data by the
use of shared network areas or email.
The design of the initial prototype of the MPI has
focussed on support for small, static groups such
as these, rather than for larger, cross-organizational
and more dynamic groupings. This has had the
advantage of simplifying security and storage requirements, meaning that rapid development of
the initial prototype has been possible. It has
been assumed that one MPI installation will be
made per group of users, and that all members of
this group will be allocated an account in this installation, which will be protected by a username
and password. The design of the MPI supports a
course level of sharing of data, in which all members of a group share access to all workflows and
enactment data which have been published in a
particular installation.
3.2
of such an form is shown in figure 2 below, which
has been generated for a workflow which has two
inputs labelled in0 and in1.
Selected details of the interface design
The MPI provides user account administration,
workflow management, workflow enactment and
results management functionality. This paper, due
to considerations of space, does not attempt to
describe all of these in detail, but instead introduces a small number of features of the MPI user
interface design.
3.2.1
Figure 2: An input form for a workflow with two
textual inputs
Once a user has started an enactment, they can
monitor its progress. After the enactment has
completed, an enactment summary page is generated. An example of such a summary page is
shown in figure 3 below.
Workflow management functionality
Users can upload workflows into an MPI installation by providing details of a file on their local file system into which the workflow has been
saved. Once a workflow has been uploaded, it
can be grouped with other workflows into a collection. Figure 1 below shows how the list of
available collections is presented to an MPI user.
Figure 3: A enactment summary page for a workflow enactment
Figure 1: A list of workflow collections
Hyperlinks labelled view can be used to view a
list of the workflows in a given collection. Workflows can be added to this list, deleted from this
list, and the details of workflows in the list can be
viewed.
Hyperlinks labelled delete can be used to delete a
given collection, thereby deleting all workflows
which have been added to that collection. The
button labelled add new can be used to add a new,
named collection, and the buttons labelled load
new and reload all are used to synchronize this
page if other MPI users have modified the list of
collections.
3.2.2
Workflow enactment functionality
Users can choose to enact a workflow in a collection. A user who chooses to enact a workflow is
presented with an form which gathers input parameters necessary for this enactment. An example
This summary page contains hyperlinks which
allow a user to browse both the input parameters
which were provided to start an enactment, and
the output parameters which were produced by
this enactment. These input parameters can consist of items of text, HTML or images. Figure 4
below shows an example of an image which has
been produced by an enactment, and which is being rendered by the MPI.
Figure 4: An image produced by enacting a
workflow
4
4.1
Selected implementation issues
Technology choices for implementation
Necessary features of portal interfaces, such as
login systems, are difficult to implement securely
and reliably. However, a number of portal interfaces exist which provide such generic features.
Figure 5: Basic architecture of the MPI
These portal interfaces are intended to be frameworks into which custom applications can be deployed.
Gridsphere [20, 2] is an example of an existing
portal framework, which has been used during
development of the MPI. Gridsphere has been used
in a number of eScience projects, including GeneGrid [16] and HPC-Europa [3], and seems to be a
lightweight and reliable choice. The MPI inherits its login system and administration tools from
Gridsphere, and takes advantage of layout features that it provides. The MPI application itself
has been developed to conform to the JSR-168
portlet standard [5] which attempts to define a
standard mechanism for specifying portal-based
web-applications. This standard is supported by
a number of other portal frameworks, including
uPortal [10] and Jetspeed-2 [4], meaning that it
should be possible to deploy the MPI into these
portal frameworks with minor modifications to a
number of configuration files.
The MPI takes advantage of storage facilities provided by the myGrid Information Repository (MIR),
which has previously been developed as part of
the myGrid project. This is a web-service [12]
which provides specific support for the storage
and retrieval of workflows and enactment data,
and which uses a relational database as a backend. The MIR web-service interface is defined
using the Web Service Description Language (WSDL)
[11] and is invoked using the Simple Object Access Protocol (SOAP) [7]. The MIR does not
currently provide any form of role-based access
control to data, but can be protected using HTTP
basic authentication [6] which requires the provision of a valid username and password with every
service invocation.
4.2
Basic architecture
Figure 5 above shows the basic architecture supporting the MPI.
A user accessing the MPI through their web-browser
must first login through Gridsphere, providing a
username and password. These login details are
authenticated against the Gridsphere security database. Once a user has logged in, any requests that
they make for MPI web content are forwarded
by Gridsphere to the MPI web-application, which
generates relevant web-pages. This may require
the retrieval of data from the MIR, in which case
information retrieval facilities provided by the MIR
web-service interface are used. Alternatively, generated web-pages may define a form which will
be used to gather information to be added to the
MIR.
If an MIR protected by basic HTTP authentication is being used, the MPI installation must be
statically configured with a single username and
password to be used during MIR access.
4.3
Caching of data from the MIR
Sets of results generated by enacting workflows
can be large. For example, the enactment of one
particular workflow used as a test case for the
MPI regularly generates 40mb of data. If MPI
and MIR installation are hosted on different machines, the cost of transferring this data on demand over a network can be excessive.
To attempt to reduce the demand on network resources, the MPI constructs a write-through cache
per logged-in portal user. This cache is constructed
when a user logs in to the portal, and is destroyed
either when they log out or when their portal session times out. All communication with the MIR
is made through this cache. An item of data is
downloaded into the cache from the MIR when
first requested, from where future requests for
this entity may be served directly. Any entities
added to the MIR by the MPI are added through
the cache, and are then available without ever
having to be downloaded.
It could be dangerous, however, to cache all items
of data indefinitely. This might cause the memory requirements of the cache to expand excessively until a memory overrun was caused. Instead, a cache discard policy must be supplied to
selectively drop entities from the cache to restrict
its memory usage. Designing this discard policy to reflect common MIR data access patterns
is important to maintain a useful level of cache
performance.
Currently, a simple discard policy has been implemented. This is based on the observation that
the only items of data which can consume a significant amount of memory are those which relate to the enactment of a workflow. The amount
of memory consumed by other items is relatively
small. As such, the discard policy specifies a
maximum number of sets of enactment data that
can be cached. If a request is made for a set of
enactment data which is not in the cache, and if
the maximum number of sets of data are already
cached, then the set of data which was least recently accessed is dropped from the cache to free
up space, into which the newly requested set of
data can be downloaded.
This discard policy is simple, and may not be
good enough in some situations. However, it has
so far proven to be effective in a typical server
environment. The simple cache policy is aided
by the use of a per-user cache that is destroyed
when a user logs out, thereby freeing up space in
memory.
4.4
Rendering of enactment data
Workflows that can be enacted in the MPI are
limited to those that accept plain text values as
inputs, and which produce plain text, images or
HTML as results. Users can view inputs and results of a particular enactment by using the hyperlinks on a summary page for an enactment, as
shown in figure 3 above, to navigate to to a page
capable of rendering the chosen item of data.
Different types of data are rendered using different HTML elements. Plain text is rendered inside
an HTML
<TEXTAREA> element. Images are rendered
using an HTML <IMG> element, whose src attribute points to a servlet capable of rendering the
image.
HTML produced by workflows can be slightly
more difficult to render. One feature of the current myGrid system is that the enactment of a
workflow can generate HTML containing hyperlinks to other items of data produced during the
same workflow enactment. Figure 6 below shows
an example of such a hyperlink.
<A HREF="urn:lsid:lsdocument:
A3434355">link text</A>
Figure 6: Example of a hyperlink to an item of
enactment data
The string urn:lsid:lsdocument:
A3434355 is an internal identifier which has been
assigned to an item of data produced during the
same workflow enactment. The MPI renders such
HTML by replacing this identifier with the URL
of an MPI web-page capable of rendering this
item of data. This mechanism is useful, because
it means that workflows can be modified to produce not just items of data but also HTML visualizations which refer to these items of data. As
an example, the image shown in figure 5 above
is involved in an HTML visualization. As part of
the same enactment, HTML was produced con-
taining a <MAP> element which defines which
regions of this image should be clickable. In this
case, the clickable regions are the small vertical bars which represent interesting features of a
DNA sequence. Users can click on one of these
vertical bars to reveal further textual information
about this feature.
5
Discussion
The MPI which has been described in this paper
is the first working prototype of a portal which allows the publishing and enactment of pre-written
workflows. Implementation has been simplified
by focusing the design on support for users who
work in small groups, and who wish to publish
only small numbers of workflows. It is hoped
that even with these constraints on design, the
MPI will still be useful to much of the potential
user community. However, it may be the case
that some users do work in larger, more dynamic
groupings, and this section attempts to describe
how this design might be improved to support
these types of user.
5.1
Improving mechanism for locating workflows
As the number of users of the MPI increases, it
is likely that the number of workflows that have
been uploaded into it will also increase. Of course,
even a small group of users could upload a substantial number of workflows. Presenting available workflows in one large list can make it difficult for a user to locate a particular workflow
that they are looking for. The MPI attempts to
provides a limited hierarchical structure to make
the management of workflows easier by allowing
named workflows to be grouped into named collections. However, for larger numbers of workflows this one-level hierarchy may be too rigid.
One simple solution to this problem would be to
add extra levels of nesting to the hierarchy, or
to allow an arbitrary number of hierarchical levels, much like directories in a file system. This
approach is already supported by the MIR, so
would just involve changes to the interface design of the MPI. However, making these changes
would necessarily increase the complexity of the
interface.
An alternative approach would be to allow the
visibility of workflows to be limited so that users
only see workflows that they are interested in.
This might be achieved by simply marking a workflow as only being visible to the user that uploaded it. Sharing of workflows could still be
supported by allowing the status of a workflow to
be changed so that it becomes globally visible. A
more advanced design might provide finer-grained
role-based access control (RBAC), which might
allow users to limit the visibility of workflows to
members of dynamic subsets of users of a particular MPI installation.
Implementing these features in an interface would
require a backend storage device that supports
RBAC. The current MIR implementation does not
include any such support, but future projects intend to add this. An alternative might be to use
the Storage Resource Broker (SRB) [14, 8], a remote file store which provides the organization
of users into dynamic groupings, and which has
provides a number of flexible access control methods.
The provision of some form of RBAC would also
be dependent upon the existence of a security architecture that supports the authentication of users.
Ideally, this security architecture would allow a
user to authenticate themselves through the Gridsphere login system. They might then be granted
a temporary token which would allow them to authenticate using a security mechanism provided
by the chosen storage solution. A candidate technology for this security architecture is the Grid
Security Infrastructure (GSI) [13]. Gridsphere
provides support for GSI through the GridPortlets
application [1] and SRB allows users to authenticate themselves via GSI. Exactly how to use
these facilities to construct a security architecture
is an issue for further research.
An alternative mechanism which might allow workflows to be located more easily might be the provision of features which allowed users to construct simple queries for workflows. Users might
be allowed to search for workflows, for example,
which they had uploaded in the last week, which
inputs or outputs with a given name, or which
used a particular service. Again, the provision of
a querying mechanism would be dependent upon
support for querying in the backend data store.
The current MIR uses a database for storage, and
queries over this database could be constructed in
SQL. If SRB were to be used as a replacement,
then this also provides facilities for the construction of queries over metadata attached to individual files and directories.
5.2
Caching
The present portal prototype constructs a cache
per user, which has a simple discard policy and
which is destroyed when the user logs out.
As more users are added to the portal interface,
this approach becomes less efficient, as it becomes
more likely that multiple users will cache the same
item of data. This means that with a large num-
ber of users, even a well-designed per-user discard policy might not be sufficient to avoid memory overruns. Destroying a cache that has been
created for a user when this user logs out is also
inefficient, as regularly accessed data will have
to be quickly reloaded into a new cache when the
user logs in again.
It seems that a better caching strategy would maintain one cache for all portal users, and for this
cache to persist even when users logs out. Design of a discard policy for this cache might be
difficult, and should aim to avoid penalizing individual users of the portal interface by dropping
their data too regularly whilst ensuring that excessive memory usage is avoided.
6
Conclusion
This paper has described the design and implementation of an initial prototype of a simple, portalbased interface to my Grid workflow enactment
and storage technology. This interface has been
designed to allow small groups of users to share a
set of workflows, which they can enact, producing data which is also shared. It is hoped that this
interface will be a useful addition to the existing
selection of my Grid software.
At the time of writing, this interface has not yet
been trialled by a significant number of members
of the user community. It is hoped that such trialling will take place, and that this will lead to
improvements in the interface design.
7
Acknowledgements
The authors would like to acknowledge the assistance of the whole myGrid consortium. This
work is supported by the UK e-Science programme
EPSRC grant GR/R67743.
References
[1] GridPortlets application home-page (Verified 28/06/2005).
http://www.opengridportals.
org/space/Toolkits/
GridPortlets.
[2] Gridsphere portal home-page (Verified
28/06/2005).
http://www.gridsphere.org/
gridsphere/gridsphere.
[3] HPC-Europa
home-page
(Verified
28/06/2005).
http://www.hpc-europa.org/.
[4] Jetspeed-2
28/06/2005).
home-page
(Verified
http://portals.apache.org/
jetspeed-2/.
[5] JSR-168 portlet interface specification
(Verified 28/06/2005).
http://www.jcp.org/en/jsr/
detail?id=168.
[6] RFC 2617 - HTTP Authentication (Verified
28/06/2005).
http://www.faqs.org/rfcs/
rfc2617.html.
[7] Simple Object Access Protocol specification (Verified 28/06/2005).
http://www.w3.org/TR/soap/.
[8] Storage Resource Broker home-page.
http://www.sdsc.edu/srb/.
[9] Taverna workflow workbench home-page
(Verified 28/06/2005).
http://taverna.sourceforge.
net/.
[10] uPortal home-page (Verified 28/06/2005).
http://www.uportal.org/.
[11] Web Services Description Language specification (Verified 28/06/2005).
http://www.w3.org/TR/wsdl.
[12] Web-Services
specification
(Verified
28/06/2005).
http://www.w3.org/2002/ws/.
[13] W. Allcock, A. Chervenak, I. Foster,
L. Pearlman, V. Welch, and M. Wilde.
Globus toolkit support for distributed dataintensive science. Computing in High Energy and Nuclear Physics, 2001.
[14] Arcot Rajasekar et al. Storage Resource
Broker - managing distributed data in a
grid. Computer Society of India Journal,
Special Issue on SAN, 33(4):42–54, 2003.
[15] Matthew Addis et al. Experiences with escience workflow specification and enactment in bioinformatics. pages p.459–467.
UK e-Science All Hands, 2003. ISBN - 1904425-11-9.
[16] P. Donachy et al. Genegrid: Grid based
virtual bioinformatics laboratory. UK eScience All Hands, 2003.
[17] R. Stevens et al.
Exploring Williams
Beuren Syndrome using my grid. pages
i303–i310. 12th International Conference
on Intelligent Systems in Molecular Biology, 2004. published Bioinformatics Vol.
20 Suppl. 1.
[18] Tom Oinn et al.
Taverna: A tool
for the composition and enactment of
bioinformatics workflows. Bioinformatics,
20(17):3045–3054, 2004.
[19] Carole Goble, Stephen Pettifer, Robert
Stevens, and Chris Greenhalgh. Knowledge
integration: In Silico experiments in bioinformatics. The Grid: Blueprint for a New
Computing Infrastructure, pages 121–134,
2003.
[20] Jason Novotny, Michael Russell, and Oliver
Wehrens. Gridsphere: An advanced portal
framework. EUROMICRO Special Session
on Advances in Web Computing, 2004.
Download