[4] : Automatically Composing Data Workflows with Relational

CBR based Workflow Composition Assistant
Eran Chinthaka, Jaliya Ekanayake
Computer Science Department, Indiana University
Bloomington, Indiana.
Workflows are becoming an integral part of businesses. But
composing a workflow might take considerable amount of
time, even when the author is aware of what he needs to
achieve from a workflow. Especially when there are multiple services to choose from and user has specific requirement for inputs and outputs, workflow composition will be
even harder. When there are workflows composed already,
either by the same user or someone else, which are same or
similar to current requirements, there are no ways to re-use
them in most of the workflow environments. Users might
also be interested in adding explanations or semantics describing the functionality of workflows and the services
within them. But most of the specifications addressing to
add semantics to workflows seem bit harder to be understood by normal users. In our system, we propose a CBR
(Case-Based Reasoning) approach for users to exploit existing services and workflows and present an easier and extensible way for expressing his requirements for the system.
1 Introduction
Users compose workflows to achieve a certain task, aggregating existing services. Most of the time a novice user
will have a set of inputs and an expected output in his mind
and he is trying to go from those inputs to the expected
output using the services available for him. Even though
this seems trivial to carry out this composition might be
difficult due to various reasons.
There can be lots of services available for him to
choose from. He might need to search all or most
of them.
It requires some level of knowledge to understand
the semantics of each and every service for him to
get the best workflow.
Services might have been written using various
service description standards, the user is not familiar with. For example, WSDL (Web services
Description Language) which is being used to define a Web service might not be familiar to the
He might have trouble converting what he really
expects in to the workflow language required to
be used by the workflow composer.
There are couples of ways the user can tackle these problems. One of the good options to tackle this task is to first
look at existing workflows and to see whether he can get
some components out of it. These existing components will
either help the user to accomplish the whole task or might
help to achieve sub-parts of it.
This is quite analogous to what the programmers do when
they are developing software. When they are given a problem, they might first try to find existing software or code
segments to carry out the same task or they will use some
existing components as sub-components within their programming exercise.
Applying the same concept in our workflow domain, these
workflow authors might have workflows already composed, which can be used as sub-components of the new
workflow. Or there can be workflows written by other developers and can be hosted in public places like myexperiment.org.
Another common problem with workflow composition is
that workflow authors might also encounter difficulties expressing their requirements to the system. Even though
there are various semantic languages [1][2][3] proposed to
be used in these situations, some users might not be conversant with those languages. The best option for these users might be to let them express their requirements using
simple keywords. But if the user is advanced enough, then
the system should also support expressing the requirements
using a given semantic language.
We propose a Case-based reasoning approach to assist users during their workflow composition. Our system will
help users to find existing workflows to fit his requirements or to find components which he might be able to fit
in to his bigger workflow. Our system will also enable users to express their requirements using keywords based approach or the system can be easily extended to be used
with a known semantic description language.
In this paper, we will talk about related work in this area in
the next section. Section 3 explains the proposed solution
to overcome the problems outlined above. In section 4 we
will talk about the key features of our system which makes
our system. Finally section 5 will outline the future work
that can be carried out to improve our proposed system.
2 Related Work
There were various efforts carried out previously to help
users to compose workflows. Ambite and Kapoor [4] have
proposed a system to automatically compose workflows
from given set of requirements. Patrick et al [5] have also
talked about a constrained object model for workflow
composition, based upon a metamodel for workflows and
ontologies for processes and data flows. All these approaches are looking to use various searching strategies to
come up with an automatic workflow to the user. There are
couples of problems associated with these approaches.
First, it might take a considerable amount of time to come
up with a solution, as this will be a brute-force search with
some heuristics. Also, the amount of time taken can be a
function of the size of the search space. If there are more
services to be searched then this might not perform and
scale better. There should be better ways of handling automatic composition when the search space is large. And
these systems are not flexible enough to handle different
ways of expressing user’s inputs.
Kim et al [6], talk about an approach to help users during
workflow composition by analyzing the composed workflows. The CAT system notifies users on various errors and
suggests the users to overcome them. They have used
planning techniques for the better construction workflows.
There are also efforts [7] on helping users compose workflows by introducing domain specific ontologies. These
approaches are suggesting services after each component
so that users can wire their existing services in to them.
But these systems can suggest at most one components at a
time and not generic enough to be used across different
List of
List of
a complete workflow. [single output requirement]
List of
Figure 1. Inputs to the system are a set of keywords lists, specifying the inputs and outputs of the expected workflow.
To elaborate on the above issue consider the workflow
shown in Figure 2. Each box shown by the dashed lines
could be a possible match for the user’s inputs and outputs.
3 Proposed Solution
We developed a solution to the above problem by using a
CBR approach. Our case base contains existing workflow
descriptions with each input and output to the services inside the workflow is annotated with keywords. The user
specifies the inputs and outputs of the expected workflow
or the part of the workflow that he or she wants to find using keywords. The system retrieves the possible matches
by considering keyword information and the structural information of the workflows stored in the case base. Once
the user specifies the search keywords, as shown in Figure
1, the system can match it to
a service,
a part of a workflow or
X- User inputs and outputs are matched to
an entire workflow
Y- User inputs and outputs are matched to
a single service
Z- User inputs and outputs are matched to
apart of the workflow
Figure 2. Matching user’s search criteria to parts of workflows.
As mentioned earlier, the system takes the structural information of the workflow when calculating the similarity
for the user’s search criteria. Therefore, we consider only
the workflows, which are acyclic and the search is based
on the following lemma.
“All acyclic workflows can be modeled as a graph”
This enables us to reduce the structural search into a graph
search and solve the problem at hand.
Check if paths exist
from input nodes to output nodes
Input keywords List
Output keywords
Keyword similarity check (Lesk
Check the selfcontainedness of the
selected nodes in the
Try to adapt using existing workflows and
Figure 3. System operations and the decision points
output the service is expecting as well as the key features
Depending on the user’s requests and the cases that we
of that input or the output. For example, consider the service D in Figure 2. Assume the service D’s task is to idenhave stored in our case base, it is not always possible to
tify a particular set of proteins for a set of given species
find a direct match to the user’s input and output requirefrom a set of relevant documents and a description or a
ments from our system. Therefore, we have incorporated a
model of the species. A possible set of keywords for the
simple case adaptation step, which enhance results of the
two inputs and the output for the Service D will look like
search. If we cannot find the required set from a single
workflow in our system, we try to merge the closest workflow with the services in our knowledgebase to get the per<input> proteins, description, document type </input>
<input>genomics model, species, type of the input</input>
fect match. For example, user might want to get the results
<output>proteins, entities, discovered, output type</ output>
as an HTML page, but the workflow we have might output
as an XML file. In this case we can match our workflow in
to a XML to HTML converter services and fulfill what usThe user specifies the number of inputs and the outputs of
er expects.The overall operations performed by our system
the expected portion of the workflow that the user needs to
are shown in Figure 3.
find along with a set of keywords for each input and output. The user specified inputs are matched against the inputs of all the services in a given workflow. A match can
Keyword based Similarity Check
be made using one or more services as the input services if
We use keyword similarity to match user’s inputs and outall the inputs of the selected services match the input deput descriptions to the cases, stored in our knowledge base.
scriptions given by the user. Figure 4 elaborates this
Before a workflow is stored in the case base, the user is rescheme of matching user’s inputs to the inputs of the serquired to annotate the inputs and outputs of each service in
the workflow with relevant keywords. Currently we do not
place any restrictions on the selection of the keywords or
the number of keywords used to annotate each input. However, we believe that a domain specific ontology will increase the effectiveness the search.
The keywords used to annotate the inputs and outputs
should be able to describe the “type” of the input or the
The services in a workflow can produce multiple outputs.
However, for a search, the user should specify only one
output using keywords. This simplification is only an implementation optimization and if the user expects multiple
outputs then multiple searches can be used to find the expected result. The user specified output is also matched
against all the outputs of the services in a given workflow.
The inputs and outputs can matched to different services
inside the workflow and hence the system should be able to
determine if the selected services are connected. We also
check the self-containdness of the selected services for
which we use the structural information.
to first
to first
one of the easiest natural language processing techniques
to find similarity.
Structural Similarity Check
Once a set of input services and an output service is
matched for the user’s input and output descriptions for a
given workflow, the next task that the system performs is
to check the structure of the graph.
Service 1
Check if there exists at least one path from all the
input services to the desired output service.
Check whether all the services encountered within
these paths form a self contained set of nodes in a
workflow (graph)
Service 1
to third
to third
to second
to second
Service 2
Service 1 and Service 2 are
possible candidates for the
inputs specified by the user
Service 2
Service 2 has an additional
input and hence these services cannot be considered
as a possible match
Figure 4. Matching user’s input descriptions to the inputs
of services.
Lesk Algorithm
We use Lesk algorithm[12] based measure to check the
similarity of keywords to the keywords inside our services.
We first read all the keywords found within all the services
in our system, to find out the frequency of each keyword.
When we are give a set of keywords to be matched, we
first find the overlapping of our input keywords to the inputs of each and every service. We then weight the words
in the overlapping set by their inverse-document frequency.
For example, if an overlapping word is found within 20
services, out of the 50 services we have within workflows,
we assign log 50/20, as the weight of this word. This
weighting will provide the significance and uniqueness of
the given word for the whole system.
We find the weighting of each and every word in the overlapping set to come up with a total score. This Lesk algorith measure is then compared against a threshold value to
decide the significance the similarity of this match.
Even though we have used Lesk algorithm as a measure for
keyword similarity, our system is flexible enough to integrate any simple or complex similarity measure to decide
the similarity of cases. We used Lesk algorithm as it was
The first check ensures that the selected inputs and outputs
are actually a part of a graph and not a disconnected set of
nodes. The second check ensures that the collected set of
nodes, can be used without any additional inputs to achieve
the user’s needs. In our system, we store all the workflows
as graphs and hence performing the above two checks
simply reduces to a graph search problems.
Case Adaptation
In our system, the keyword based similarity check allows
user to specify a certain threshold for the keyword matching. This allows the user to relax or tighten the search criteria and hence control the recall. However, sometimes finding exact matches for the user requests only from the
stored cases might not be possible. As a solution to this, we
incorporated a case adaptation step that is based on the following assumptions.
1. Case base stores descriptions of services, which takes
one input and one output.
2. The adaptation considers only these services for constructing extensions to the existing workflows or part of
the workflows.
The case adaptation mechanism uses back tracking to adapt
a workflow to the user’s requests and has the following
For each workflow, identify input services by
running the keyword similarity check using the
input descriptions specified by the user.
Identify services (from the services stored in the
case base) which have the output keywords similar to the output description specified by the user.
For each such service (say service S), find a service in the selected workflow that has the output
description similar to the input description of the
service S.
If there is a match, then execute the direct search
algorithm to see if the identified input services has
a connecting paths to the service S and those services in the paths form a self contained set.
If such a path is found, add the Service S to the
end of the identified workflow (or part of the
workflow) and present the result to the user.
The above algorithm is not computationally efficient as it
requires searching all the stored cases against the selected
workflows. Therefore we only perform little iteration of
the above for adaptation.
The final output of a workflow is highly dependent on the
specific problem the user trying to solve. For example, a
workflow created by a user to retrieve protein structures
from an online database may use a service to convert the
selected results into an html table for the final presentation.
Another user, may need to find the exact workflow, but
does not need the output in html format rather in some other format such as an xml file. In such situations, although
the case base contains a near ideal match, without adaptation the user will not be able to reuse the workflow composed by the previous user. With our system, if the system
can find a service which performs the html to xml conversion, it will be able to adapt the stored workflow to the user
requirement and hence suggest it to the user.
4 Key Features
There are couples of novel features introduced in our system.
Our system will be much faster than either a manual or an
automatic composition assistant. Manual techniques alone
will require lots of time figuring out what services to use at
each step. Having ontology might improve the total time
and effort, but will not be better than our knowledge-based
approach. Automatic workflow composition tools will use
extensive search over the set of services. Even if those systems have a pre-indexed system, adding and removing new
services will provide challenges to the system.
Our system will not rely on any particular workflow description language [8] [9] [10] [11]. We only consider acyclic workflows in our approach and this makes us to reduce
any workflow description language in to a graph. Our main
algorithms work on graph levels and this makes our system
independent of any workflow description language.
Our system enables users to provide their input and output
descriptions using keywords. For this we used a basic similarity measure based on Lesk Algorithm [12]. We used
Lesk algorithm in our system as it was the easiest to use
among the techniques for keyword matching and disambiguation. Similarity metrics in our system are decoupled
from the main system making it easier to plug-in a different context matching algorithm.
If the semantics are expressed using a know standard,
again plug-in those similarity components adhering to
those standards will be easier.
The cases retrieved using our system might not be exact
matches to user requirements. But those cases will be intelligent suggestions for the users to make his task easier. User can set a threshold in the system, to filter out the better
results. This enables the users to control the quality and the
number of results our system provides to him.
5 Future Works
Our current system works on graphs, going along the assumption of reducing every acyclic workflow in to a graph.
We want to verify this assumption and to natively support
few workflow description languages.
Current system uses a keyword based similarity model for
expressing and matching semantics of services and workflows. We would like to support few semantic languages
within the system. The best approach will be to use an existing implementation of those semantic languages and integrating that implementation with our similarity detection
and requirement specification components.
There are existing workflow composition frameworks
available today within various domains. XBaya[13] is one
of such tools developed within Extreme Lab, Indiana University. Integration of our system into XBaya or any other
system will help leverage functionality and the usage of
those composition tools.
We would like to do a performance evaluation of our system with various knowledge-heavy systems and also with
manual systems. Comparing our system performance in
terms of memory, speed and response time compared to
automatic composition frameworks might provide better
insights in to the usability of our system. It is worth to note
that our system will be an improvement to both manual and
automatic composition systems and will not completely replace them.
