CBR based Workflow Composition Assistant Eran Chinthaka, Jaliya Ekanayake Computer Science Department, Indiana University Bloomington, Indiana. Abstract Workflows are becoming an integral part of businesses. But composing a workflow might take considerable amount of time, even when the author is aware of what he needs to achieve from a workflow. Especially when there are multiple services to choose from and user has specific requirement for inputs and outputs, workflow composition will be even harder. When there are workflows composed already, either by the same user or someone else, which are same or similar to current requirements, there are no ways to re-use them in most of the workflow environments. Users might also be interested in adding explanations or semantics describing the functionality of workflows and the services within them. But most of the specifications addressing to add semantics to workflows seem bit harder to be understood by normal users. In our system, we propose a CBR (Case-Based Reasoning) approach for users to exploit existing services and workflows and present an easier and extensible way for expressing his requirements for the system. 1 Introduction Users compose workflows to achieve a certain task, aggregating existing services. Most of the time a novice user will have a set of inputs and an expected output in his mind and he is trying to go from those inputs to the expected output using the services available for him. Even though this seems trivial to carry out this composition might be difficult due to various reasons. 1. 2. 3. 4. There can be lots of services available for him to choose from. He might need to search all or most of them. It requires some level of knowledge to understand the semantics of each and every service for him to get the best workflow. Services might have been written using various service description standards, the user is not familiar with. For example, WSDL (Web services Description Language) which is being used to define a Web service might not be familiar to the user. He might have trouble converting what he really expects in to the workflow language required to be used by the workflow composer. There are couples of ways the user can tackle these problems. One of the good options to tackle this task is to first look at existing workflows and to see whether he can get some components out of it. These existing components will either help the user to accomplish the whole task or might help to achieve sub-parts of it. This is quite analogous to what the programmers do when they are developing software. When they are given a problem, they might first try to find existing software or code segments to carry out the same task or they will use some existing components as sub-components within their programming exercise. Applying the same concept in our workflow domain, these workflow authors might have workflows already composed, which can be used as sub-components of the new workflow. Or there can be workflows written by other developers and can be hosted in public places like myexperiment.org. Another common problem with workflow composition is that workflow authors might also encounter difficulties expressing their requirements to the system. Even though there are various semantic languages [1][2][3] proposed to be used in these situations, some users might not be conversant with those languages. The best option for these users might be to let them express their requirements using simple keywords. But if the user is advanced enough, then the system should also support expressing the requirements using a given semantic language. We propose a Case-based reasoning approach to assist users during their workflow composition. Our system will help users to find existing workflows to fit his requirements or to find components which he might be able to fit in to his bigger workflow. Our system will also enable users to express their requirements using keywords based approach or the system can be easily extended to be used with a known semantic description language. In this paper, we will talk about related work in this area in the next section. Section 3 explains the proposed solution to overcome the problems outlined above. In section 4 we will talk about the key features of our system which makes our system. Finally section 5 will outline the future work that can be carried out to improve our proposed system. 2 Related Work There were various efforts carried out previously to help users to compose workflows. Ambite and Kapoor [4] have proposed a system to automatically compose workflows from given set of requirements. Patrick et al [5] have also talked about a constrained object model for workflow composition, based upon a metamodel for workflows and ontologies for processes and data flows. All these approaches are looking to use various searching strategies to come up with an automatic workflow to the user. There are couples of problems associated with these approaches. First, it might take a considerable amount of time to come up with a solution, as this will be a brute-force search with some heuristics. Also, the amount of time taken can be a function of the size of the search space. If there are more services to be searched then this might not perform and scale better. There should be better ways of handling automatic composition when the search space is large. And these systems are not flexible enough to handle different ways of expressing user’s inputs. Kim et al [6], talk about an approach to help users during workflow composition by analyzing the composed workflows. The CAT system notifies users on various errors and suggests the users to overcome them. They have used planning techniques for the better construction workflows. There are also efforts [7] on helping users compose workflows by introducing domain specific ontologies. These approaches are suggesting services after each component so that users can wire their existing services in to them. But these systems can suggest at most one components at a time and not generic enough to be used across different domains. (iii) List of keywords List of keywords a complete workflow. [single output requirement] Expected Workflow List of keywords Figure 1. Inputs to the system are a set of keywords lists, specifying the inputs and outputs of the expected workflow. To elaborate on the above issue consider the workflow shown in Figure 2. Each box shown by the dashed lines could be a possible match for the user’s inputs and outputs. Y X Z 3 Proposed Solution We developed a solution to the above problem by using a CBR approach. Our case base contains existing workflow descriptions with each input and output to the services inside the workflow is annotated with keywords. The user specifies the inputs and outputs of the expected workflow or the part of the workflow that he or she wants to find using keywords. The system retrieves the possible matches by considering keyword information and the structural information of the workflows stored in the case base. Once the user specifies the search keywords, as shown in Figure 1, the system can match it to (i) (ii) a service, a part of a workflow or Copyright © 2008, Indiana University, Bloomington, Indiana. All rights reserved. X- User inputs and outputs are matched to an entire workflow Y- User inputs and outputs are matched to a single service Z- User inputs and outputs are matched to apart of the workflow Figure 2. Matching user’s search criteria to parts of workflows. As mentioned earlier, the system takes the structural information of the workflow when calculating the similarity for the user’s search criteria. Therefore, we consider only the workflows, which are acyclic and the search is based on the following lemma. “All acyclic workflows can be modeled as a graph” This enables us to reduce the structural search into a graph search and solve the problem at hand. Check if paths exist from input nodes to output nodes Input keywords List Output keywords Ou Ou Keyword similarity check (Lesk Algorithm) Check the selfcontainedness of the selected nodes in the workflow If Found YE S YE S If Successful Try to adapt using existing workflows and services NO Figure 3. System operations and the decision points output the service is expecting as well as the key features Depending on the user’s requests and the cases that we of that input or the output. For example, consider the service D in Figure 2. Assume the service D’s task is to idenhave stored in our case base, it is not always possible to tify a particular set of proteins for a set of given species find a direct match to the user’s input and output requirefrom a set of relevant documents and a description or a ments from our system. Therefore, we have incorporated a model of the species. A possible set of keywords for the simple case adaptation step, which enhance results of the two inputs and the output for the Service D will look like search. If we cannot find the required set from a single following. workflow in our system, we try to merge the closest workflow with the services in our knowledgebase to get the per<input> proteins, description, document type </input> <input>genomics model, species, type of the input</input> fect match. For example, user might want to get the results <output>proteins, entities, discovered, output type</ output> as an HTML page, but the workflow we have might output as an XML file. In this case we can match our workflow in to a XML to HTML converter services and fulfill what usThe user specifies the number of inputs and the outputs of er expects.The overall operations performed by our system the expected portion of the workflow that the user needs to are shown in Figure 3. find along with a set of keywords for each input and output. The user specified inputs are matched against the inputs of all the services in a given workflow. A match can Keyword based Similarity Check be made using one or more services as the input services if We use keyword similarity to match user’s inputs and outall the inputs of the selected services match the input deput descriptions to the cases, stored in our knowledge base. scriptions given by the user. Figure 4 elaborates this Before a workflow is stored in the case base, the user is rescheme of matching user’s inputs to the inputs of the serquired to annotate the inputs and outputs of each service in vices. the workflow with relevant keywords. Currently we do not place any restrictions on the selection of the keywords or the number of keywords used to annotate each input. However, we believe that a domain specific ontology will increase the effectiveness the search. The keywords used to annotate the inputs and outputs should be able to describe the “type” of the input or the The services in a workflow can produce multiple outputs. However, for a search, the user should specify only one output using keywords. This simplification is only an implementation optimization and if the user expects multiple outputs then multiple searches can be used to find the expected result. The user specified output is also matched against all the outputs of the services in a given workflow. The inputs and outputs can matched to different services inside the workflow and hence the system should be able to determine if the selected services are connected. We also check the self-containdness of the selected services for which we use the structural information. Matched to first input Matched to first input one of the easiest natural language processing techniques to find similarity. Structural Similarity Check Once a set of input services and an output service is matched for the user’s input and output descriptions for a given workflow, the next task that the system performs is to check the structure of the graph. 1. Service 1 Check if there exists at least one path from all the input services to the desired output service. Check whether all the services encountered within these paths form a self contained set of nodes in a workflow (graph) Service 1 2. Matched to third input Matched to third input Matched to second input Matched to second input Service 2 Service 1 and Service 2 are possible candidates for the inputs specified by the user Service 2 Service 2 has an additional input and hence these services cannot be considered as a possible match Figure 4. Matching user’s input descriptions to the inputs of services. Lesk Algorithm We use Lesk algorithm[12] based measure to check the similarity of keywords to the keywords inside our services. We first read all the keywords found within all the services in our system, to find out the frequency of each keyword. When we are give a set of keywords to be matched, we first find the overlapping of our input keywords to the inputs of each and every service. We then weight the words in the overlapping set by their inverse-document frequency. For example, if an overlapping word is found within 20 services, out of the 50 services we have within workflows, we assign log 50/20, as the weight of this word. This weighting will provide the significance and uniqueness of the given word for the whole system. We find the weighting of each and every word in the overlapping set to come up with a total score. This Lesk algorith measure is then compared against a threshold value to decide the significance the similarity of this match. Even though we have used Lesk algorithm as a measure for keyword similarity, our system is flexible enough to integrate any simple or complex similarity measure to decide the similarity of cases. We used Lesk algorithm as it was The first check ensures that the selected inputs and outputs are actually a part of a graph and not a disconnected set of nodes. The second check ensures that the collected set of nodes, can be used without any additional inputs to achieve the user’s needs. In our system, we store all the workflows as graphs and hence performing the above two checks simply reduces to a graph search problems. Case Adaptation In our system, the keyword based similarity check allows user to specify a certain threshold for the keyword matching. This allows the user to relax or tighten the search criteria and hence control the recall. However, sometimes finding exact matches for the user requests only from the stored cases might not be possible. As a solution to this, we incorporated a case adaptation step that is based on the following assumptions. 1. Case base stores descriptions of services, which takes one input and one output. 2. The adaptation considers only these services for constructing extensions to the existing workflows or part of the workflows. The case adaptation mechanism uses back tracking to adapt a workflow to the user’s requests and has the following steps. 1. 2. 3. For each workflow, identify input services by running the keyword similarity check using the input descriptions specified by the user. Identify services (from the services stored in the case base) which have the output keywords similar to the output description specified by the user. For each such service (say service S), find a service in the selected workflow that has the output 4. 5. description similar to the input description of the service S. If there is a match, then execute the direct search algorithm to see if the identified input services has a connecting paths to the service S and those services in the paths form a self contained set. If such a path is found, add the Service S to the end of the identified workflow (or part of the workflow) and present the result to the user. The above algorithm is not computationally efficient as it requires searching all the stored cases against the selected workflows. Therefore we only perform little iteration of the above for adaptation. The final output of a workflow is highly dependent on the specific problem the user trying to solve. For example, a workflow created by a user to retrieve protein structures from an online database may use a service to convert the selected results into an html table for the final presentation. Another user, may need to find the exact workflow, but does not need the output in html format rather in some other format such as an xml file. In such situations, although the case base contains a near ideal match, without adaptation the user will not be able to reuse the workflow composed by the previous user. With our system, if the system can find a service which performs the html to xml conversion, it will be able to adapt the stored workflow to the user requirement and hence suggest it to the user. 4 Key Features There are couples of novel features introduced in our system. Our system will be much faster than either a manual or an automatic composition assistant. Manual techniques alone will require lots of time figuring out what services to use at each step. Having ontology might improve the total time and effort, but will not be better than our knowledge-based approach. Automatic workflow composition tools will use extensive search over the set of services. Even if those systems have a pre-indexed system, adding and removing new services will provide challenges to the system. Our system will not rely on any particular workflow description language [8] [9] [10] [11]. We only consider acyclic workflows in our approach and this makes us to reduce any workflow description language in to a graph. Our main algorithms work on graph levels and this makes our system independent of any workflow description language. Our system enables users to provide their input and output descriptions using keywords. For this we used a basic similarity measure based on Lesk Algorithm [12]. We used Lesk algorithm in our system as it was the easiest to use among the techniques for keyword matching and disambiguation. Similarity metrics in our system are decoupled from the main system making it easier to plug-in a different context matching algorithm. If the semantics are expressed using a know standard, again plug-in those similarity components adhering to those standards will be easier. The cases retrieved using our system might not be exact matches to user requirements. But those cases will be intelligent suggestions for the users to make his task easier. User can set a threshold in the system, to filter out the better results. This enables the users to control the quality and the number of results our system provides to him. 5 Future Works Our current system works on graphs, going along the assumption of reducing every acyclic workflow in to a graph. We want to verify this assumption and to natively support few workflow description languages. Current system uses a keyword based similarity model for expressing and matching semantics of services and workflows. We would like to support few semantic languages within the system. The best approach will be to use an existing implementation of those semantic languages and integrating that implementation with our similarity detection and requirement specification components. There are existing workflow composition frameworks available today within various domains. XBaya[13] is one of such tools developed within Extreme Lab, Indiana University. Integration of our system into XBaya or any other system will help leverage functionality and the usage of those composition tools. We would like to do a performance evaluation of our system with various knowledge-heavy systems and also with manual systems. Comparing our system performance in terms of memory, speed and response time compared to automatic composition frameworks might provide better insights in to the usability of our system. It is worth to note that our system will be an improvement to both manual and automatic composition systems and will not completely replace them. References [1] : Putting Semantics in Grid Workflow Management: the OWL-WS approach Stefano Beco, Barbara Cantalupo, Nikolaos Matskanis, Mike Surridge [2] : OWL-S: Semantic Markup for Web Services. David Martin et al. http://www.w3.org/Submission/OWL-S/ [3] : Resource Description Framework http://www.w3.org/TR/rdf-syntax-grammar/ (RDF). [4] : Automatically Composing Data Workflows with Relational Descriptions and Shim Services, J.-L. Ambite and D. Kapoor, Proceedings of ISWC 2007 [5] Patrick Albert, Laurent Henocque, Mathias Kleiner, "Configuration-Based Workflow Composition," icws, pp. 285-292, IEEE International Conference on Web Services (ICWS'05), 2005 [6] : A Knowledge-Based Approach to Interactive Workflow Composition. Jihie Kim, Yolanda Gil, Marc Spraragen [7] : Using Ontologies for Data Mining Workflow Composition - Rahul Ramachandran [8] : Business Process Execution Language for Web Services ftp://www6.software.ibm.com/software/developer/library/ ws-bpel11.pdf [9] : The Business Process Modeling Language (BPML) http://www.bpmi.org/bpml-spec.htm [10] : Delivering Web service coordination capability to users. Tom Oinn et al. [11] : Web Services Flow http://dps.uibk.ac.at/uploads/100/WSFL.pdf Language. [12] Word sense disambiguation using WordNet and the Lesk algorithm Jonas EKEDAHL and Koraljka GOLUB [13] Xbaya: A graphical workflow composer for the web services architecture S Shirasuna, D Gannon - Technical Report 004, LEAD, 2006