WebSlogger: A Web Services Search Engine www.wbslogger.com CS8803: Advanced Internet Application Development Spring 2008 Class Project Team Members: Roland Krystian Alberciak Piotr Kozikowski Sudnya Padalikar Tushar Sugandhi krystian@gatech.edu piotr.kozikowski@gatech.edu sudnya.padalikar@gatech.edu tusharsugandhi@gatech.edu 1. Abstract: Our objective for the project “WebSlogger” is to improve accessibility and usability of web services. We worked on improving the searching, clustering and classification and ranking techniques of web services available over the Internet. We investigated the shortcomings of current solutions in locating and ranking web services. In our work, we have suggested some steps to improve the currently existing solutions to facilitate search, classification and ranking of web services over the Internet. 2. Introduction: Web services are a very popular way for businesses to communicate with each other, as well as with clients. Web services are meant to endorse the principles of modularity and code re-usability. Quite a few web services can even do tasks like look up product details, currency conversion etc. The standard method to define web services is using the Web Service Definition Language (WSDL). Here the web service is defined in terms of its logical port types through with it showcases the functionality, web methods and its data types - inputs, and outputs. The web service may refer other XSD schemas or other web services for the data types it is using. These varied web services are consolidated via a platform-independent web service registry system called Universal Description Discovery and Integration (UDDI) which is a local index of web services. Some online directories like Woogle, XMethods, ProgrammableWeb etc. categorize and provide a search facility for web services. But the number of web services registered on these sites is very limited and does not reflect the number of actual web services available on the Internet. In this project we built an automated system that discovered web services and clustered and categorized them according to our clustering algorithms. The main components of our project are discovery of WSDL files on the Internet by using a search engine API viz. Yahoo search API, automating the clustering and categorization of the web services thus discovered, and ranking those web services to improve quality of search results. We have also used various innovative clustering and ranking algorithms which are different from conventional ranking algorithms used for web page ranking. The remaining of the paper is organized as follows. The following section describes the previous work done in this field. Then we discuss the design of our website www.wbslogger.com . In design, we mainly talk about the way we crawled the Web to fetch web services. Then we talk about clustering algorithm we used for categorizing them. Then we discuss our proposed ranking techniques specially designed for web services ranking. Then we present the experimental results and conclusive remarks. 3. Previous Work: We familiarized ourselves with studies that were done on this topic. Out of those, two projects [5, 6] were of particular interest to us since their work was very much in line with what we are trying to achieve. Currently there is a search engine called Woogle [5] that has provision for finding similar web service operations. Woogle claims to have an algorithm that covers over 1500 web service operations. Although semantics don't play a significant role in web search they are essential in checking similarity of operations. The crux of Woogle is that they have an algorithm that clusters the parameter names into semantically meaningful concepts. Woogle then finds similar web services. Yellowpages[6] talks more about web service discovery and categorization. They used the existing Google APIs for WSDL discovery (although this did limit the search to 1000 requests per query and limited number of queries per day). Web service classification was done by manually labeling a portion of the web services, and testing the classification algorithms on this labeled data set with considerable accuracy. But there is scope of increasing the accuracy in classification by adding more labels. An interesting finding of this study was that many web services do not define WSDLs which means that there are a lot of ad-hoc APIs available on the Internet whereby authors have not codified them in WSDL format. The lack of web services in the WSDL format makes it hard to discover and probe those web services automatically and much more accurately. 4. WebSlogger System Design: Figure 1: System Architecture Figure 1 shows the design overview of our system. The main components of this system are – Yahoo Search API based web services retrieval engine Web Services Storage Database (MySQL based) Web Services Clustering Engine o Glossary Metadata used for automated clustering Web Services Ranking Engine o User Rating System o Automated Referential Ranking System Web Interface / GUI 4.1 Web Services Retrieval Engine: During our background research we figured out that the "Yellow Pages project"[6] had used Google's SOAP API and retrieved results from 3 main domains viz. .com, .net, .org. There is an existing, publicly available web crawl known as the Alexa Web Search Platform. It is a service that makes the whole Alexa “web crawl data set” available for a reasonable cost. We estimate that it should be possible to retrieve a list of all WSDLs from this service for a nominal cost (probably $50 to $200). We also studied related work [5], one of which was that built Woogle. This study proposed a set of similarity search primitives for web service operations, and described algorithms for effectively implementing these searches. The work done by Woogle was mainly based on exploiting the structure of the web services and employing a novel clustering mechanism to group parameter names into meaningful concepts. But this project had not implemented automatic web-service invocation. We had several options for retrieving Web Services from the Web. We looked at the different search APIs from several search engines. We mainly considered the search APIs made available by Google as well as Yahoo. After doing some background research we figured out that we could not use Google's SOAP API since it is deprecated since December 2006. Google has now released its Ajax based API. The main problem in using this API was that it returned restricted results. This would restrict the results we could retrieve per day to around 1000. As against this, the Yahoo search API returned 100 results per query and also allowed us to retrieve 5000 queries per day. This effectively removed all the constraints on faster retrieval of web services over WWW. In addition to this, the Yahoo API has a provision for multilingual support. This enabled us to search in domains of 38 different languages. So we decided to use Yahoo's search API and ran the search for .wsdl and .asmx files. We managed to successfully download 16000 Web Services from the Internet. We used to retrieve files with extension “.wsdl” and “.asmx?WSDL”. We understand that these queries do not completely cover the web services domain. A lot more web services can be found for the query “inurl:wsdl”. We did not retrieve those web services because of lack of time and resources to process those web services by clustering and ranking engine. We also realized that quite a few of the web services retrieved were repeated and so worked on separating unique files from the excessively repetitive data we had downloaded. The following table shows the number of web services retrieved against each language: Language Latvian Greek Catalan Norwegian Croatian Portuguese Dutch Finnish Spanish Persian Turkish Czech Islandic Romanian English Japanese # Web Services 60 100 20 430 120 910 460 300 602 90 460 470 120 90 1987 1450 Language Lithuanian Indonesian Korean Arabic Swedish French Russian Slovenian Estonian German Thai Hungarian Italian Polish Danish # Web Services 70 10 110 50 400 1620 1100 190 190 1201 80 310 1280 280 840 Table 1: Web services categorized by Languages 4.2 Clustering and Classification Engine: While analyzing the best approach for classification and categorization, we found that the structure of a web service description file (WSDL file) provides a lot of details about how the meaning and usefulness of the wsdl file. We used hierarchical clustering to exploit the structure of the web service files. We came up with glossaries of 27 categories with 2800 keywords, modeled after expertise by one of our project mates who have experience in the arena of clustering algorithms and ontologies. The distribution of keywords amongst the categories is as shown in table 3. Glossary of words for clustering: We came up with a glossary of terms popular in each category that we would classify web services into. Web service components used for classification are listed in order of how they appear in the wsdl file. The priority of a section for the classification algorithm is as given in table 2. Section in Web Service Description File Priority Documentation 2 Message -> Name 5 Message -> Part -> Name 7 PortType -> Name 3 PortType -> Operation -> Name 4 PortType -> Operation -> Input -> Message 6 PortType -> Operation -> Output -> Message 6 Service -> Name 1 Table 2: Web Service Section Priority for Clustering Algorithm Cluster Name Entertainment Travel Legal and Financial Health and Medicine Airlines Retail Social Services Recreation and sport Transport and Relocation Finance Builders Automotive Insurance Education and Instruction Cruises Food Hospital Computers and Internet Library Automotive Credit and Debt Services Real Estate Hotels Legal Property Manager # Keywords 130 15 49 380 84 55 121 804 50 107 61 106 91 223 217 140 70 17 64 148 149 74 72 12 Table 3: Distribution of glossary terms in clusters Our rationale for this clustering strategy is: Web Service Partitioning by Importance. Some sections in web service file are more important than other. For example, the Service Name / Operation Name is more important than message type name since it is higher on a semantic level and provides more hints about the overall wsdl file. We use a data structure called Affinity Vector which is a vector of size equal to the number of clusters in the system. We know that a web service may belong to one or many clusters. This is determined by co-relating the words in various sections of a web service (with different priority) to the words in the glossary of a particular cluster. Thus the affinity vector of a web service indicates which cluster the web service is closely related to. Currently we partition the web service to a cluster which has the highest weight in its affinity vector. But we intend to relax this constraint in future. We also demonstrate language support in our clustering, as the WSDL files are extracted from the web and cataloged by the language domain they come from. Since programming languages are in English (usually), we find many programmers talk to each other in their documentation by writing WSDL documentation in English, thereby supporting us to understand and interpret a foreign WSDL file by exploiting this simple observation. This observation can be confirmed by looking at the results of Table 1. Our improved clustering and classification was not a dictionary attack, and therefore we observed it had less false positives and false negatives than the original work by Danesh Irani et al [6] on the Yellow Pages project. 4.3 Ranking Engine: The fundamental aim of our project is to provide up-to-date and relevant ranking of Web Services. Our ranking algorithm plans are 2 fold, and represent the process in which we expected the project to grow. Since we have restricted our way to rank wsdl files on popularity due to our crawler methodology, we realized that our ranking scheme would have to be somewhat based on the user's who use and contribute opinions about web services they locate via our service. Ignoring user contribution would lead to some sort of automatic classification ranking scheme which we unfortunately do not have the resources to provide without more statistics and data. However, after user classification occurs over some time via our interface, if we were to then introduce our automatic ranking algorithms we could then have a way to compare the effectiveness of (a) pure collaborative ranking vs. (b) non-user input, automatic ranking algorithm. This is subjective and difficult to test, but we can manually select relevant web services that should be in the top results for certain queries and test the ranking systems against it. This form of collaborative ranking we have implemented accomplishes the following key objectives: Users can leave comments - Thus permitting us to build a library of (hopefully constructive and usage detailed) praises and complaints corresponding to a web service. Likert scale ranking - This effectively allows users to use their judgment and enumerate a rating for each web service that holistically takes into account all of their experiences with that and other web services they have used. Uptime/Downtime - Can you rely on the web service to be there when you need it? We want to provide programmers this information so they can evaluate expectations of service. Figure 2: Screenshot of User Rating System We have also implemented some 'quality of format' rankings, which attempt to discriminate incomplete or erroneous web service files. As the file is parsed for inclusion into our search results, we analyze it for errors. Once errors are detected, the file is rejected and is not included into our search results. We do this to meet a base level of quality in expectations of our results for our users: that our wsdl files are viable candidates for use, and that we knowingly do not provide faulty or defective results. Ideally we can expand this approach to answer more profound questions like does good format, and thoroughness correlate to a good WSDL file? Heuristics on model web service files could locate for us answers to these questions. A deeper question that has come up with standards in the web community is adherence to standards, and whether browsers should correct bad HTML code, etc. We believe with WSDL files, the debate on "Do You Care if Your WSDL is W3C Compliant?" should yield that standards matter. WSDL files are written by developers for developers, so format does matter and cannot be simply 'auto corrected'. Thus, remain the following ranking avenues that we wish to explore: Generate referral chain from WSDL: As explained earlier in this paper, web services can (and in many cases do) use web methods / data types from other web services. Thus, they refer to other web services for their operation. This scenario is analogous to a web page having links to other web pages. Therefore similar algorithms can be applied to the web services which are involved in referential chains. That is, if a web service publishes certain data types / web methods which are used by many web services or popular web services, then the rating of the web service under consideration should be higher. Since we do not have access to crawler data, we wish to emulate a citation network in order to determine valuable web services. A 'psuedo' pagerank algorithm approach could be to analyze the dependencies of web service files.Web services often use methods / objects from other web services. We should use this linking to rank web services. Other approaches for ranking are as follows: o o Rank good users / bad users in the community: Experts User Level: Usage statistic ranking: How long a user views a wsdl Does a user go back to look at it again [since a wsdl is like an API] Inquire user about what wsdl files they used to achieve a goal Another key idea is aging. We must presume that web services go 'stale', as new web services which do the same functionality as previous ones, may do it with more state of the art languages. Introducing an aging component to rankings so that 'stale' results get excluded would be quite valuable. 4.4 Graphical User Interface: An early goal of our project was to develop an application that people would choose use not only because it fulfilled their objectives and delivered high in the threshold of content quality, but also because the interface was easy and enjoyable to use. We took cues from current, successful web search engines in order to model an interface that supports many different kinds of interactions, particularly "recognition" and "recall" user behaviors. We felt early on that empowering a user to search or browse was a necessary feature of our system, and thus our primary goals of the development of categories and indexing of search results lent quite well to support accomplishing our GUI interaction aims. Simplicity was introduced into our GUI by reducing the clutter of content displayed on screen via employing white space, contrasting colors for semantic barriers and CSS style sheets in order to support pleasant experiences. Figure 3: WebSlogger GUI Overview In developing our GUI, we surveyed some options for selecting how to accomplish the latter task by reviewing programming languages which are well established for web development, and by reviewing what options came with them for rapid and extensible development. We wanted to make a choice that could support future development should this project be continued in the future, while also providing an opportunity to learn new skills. Ultimately, we leaned towards languages which elicit the power of MVC frameworks in development, since we feel there should be a level of maturity in our work. We considered between PHP and Ruby On Rails since we had / or were compelled to have some experience with these languages, and due to their popularity in web application development. On the PHP front, we've had some experience working with the CakePHP framework which provides MVC support and rapid application development. Ruby on Rails reviews suggested that it affords a faster development process since it 'standardizes' some development practices. Though as a community, users of Ruby on Rails have reported some issues where RoR sites under high web site load 'collapse' (http://it.slashdot.org/article.pl?sid=07/09/09/215230&from=rss) we find that the promise of learning something new, with promises of new perspectives in viewing, abstracting and solving problems, and eventually reducing development time cost via lifetime of code in maintenance cycle to be the deciding factor in selecting this language. 5. Experimental Results: Most of the experimental result we want to showcase in this section are mentioned and referred throughout this document. We would like to reiterate some of the key findings of our experiments Search engines like Google, Yahoo suggest that the number of WSDL files available on the Web is huge. o Google -> filetype:asmx returns 32,000 results o Google -> filetype:wsdl returns 14,800 results o Yahoo -> originalurlextension:wsdl returns 75,400 results o Yahoo -> originalurlextension:asmx returns 1,68,000 results But when we used Yahoo Search APIs the results that we got were – o Yahoo -> originalurlextension:wsdl returns 8,000 results o Yahoo -> originalurlextension:asmx returns 7,000 results Thus, either Yahoo does not provide all the search results for its API users or there is discrepancy between results provided by Search APIs and from the Search Web Page. Yellow Pages project [6] conclusions suggest that this behavior is true for Google as well. This was quite astonishing behavior. Further, even though we retrieved 16000 web services from the web, only 3500 out of them were unique. And this behavior is also observed with Google as pointed out by Yellow Pages project [6]. We were quite surprised by the degree of duplication over the web. As part of our experimental research, we conducted a feature analysis of different web services search / catalogue websites. It was quite amazing that even though Yahoo / Google reports that the number of unique web services over the web of the magnitude of 5000, there is hardly any website which has all the web services registered on it. The various feature comparison table for WebSlogger with other websites is as shown in Figure 4. Further, it was shocking to know that the organizations like Microsoft, IBM, SAP etc. have decided to stop supporting the most popular web service registration and discovery system – UDDI. Figure 4: WebSlogger: Feature Comparison 5.1 Results of Clustering Engine We used following data as an input to our clustering engine. o 16000 Web Services in 38 different languages o 2800 keywords as glossary terms o 27 Hierarchical Clusters The distribution of web services over the clusters is as shown in below. Automotive(0) -Automotive Insurance(0) Community(0) -Library(0) -Social Services(0) Computers and Internet(0) Education and Instruction(0) Entertainment(220) Food(0) -Restaurants (0) Health and Medicine(12) -Health Insurance(0) -Hospital(0) Legal and Financial(884) -Credit and Debt Services(11) -Legal(0) -Finance(0) Others(9906) Real Estate(63) -Builders (0) -Property Manager(0) Recreation and sport(455) Retail(0) Transport and Relocation(25) -Airlines (2) -Cruises (21) -Hotels(2) -Travel(0) As we have mentioned earlier, our clustering algorithm is immature and our glossaries are not perfect, we have a large number of web services clustered in “Others” cluster. 6. Future Work: In our work we met our standard objectives and established the proof-of-concept web service search engine called, "WebSlogger". However, should we have more resources and time; we would like to mature the following concepts in our deployment: 6.1 Crawl the WWW vs. extraction mining In our original objectives, we sought to reliably obtain about 1000 unique results. Though we surpassed our original objectives, we resorted to pseudo-crawling by extraction queries sent to a search engine. Ideally, should the resource and cooperation be available, we would like to do authentic crawling of the web for WSDL, or in the worst case scenario do extraction from multiple different search engines. This would accomplish the objective of asserting 'popularity' or 'credibility' of wsdl sources, and the hubs which act in identifying them. We have no way, via extraction, to know how many pages link to a particular wsdl file unless the wsdl file itself reports these chains via including it in the namespace section of the wsdl file. This would allow us to identify what other a person on the web think is important, particularly when it comes to web services. 6.2 Improving Clustering As mentioned before in the clustering section, we plan to introduce an automatic feedback categorization scheme. At this moment, the scheme we have proposed required some training but in the future it should be capable of automatic classification (well, at least mostly automatic, we do expect some tweaking to be required). Further, the glossaries that we use for classification are immature and incomplete. There is a lot of scope in improving the quality of those glossaries. Further, the cluster labels are hard coded. An automated algorithm can be implemented which can add new clusters, remove unwanted clusters, along with false positive and false negative classification detection based on user feedback. 6. 3 Improving Ranking Recall in the ranking section we discussed our implementation of collaborative ranking metrics, and we speculated that future rankings would attempt to converge to the presumed Page-Rank metric of hubs and sinks. 6.4 Location Based Clustering The original spirit of the Yellow Pages project that we sought to progress was to provide a mechanism whereby web services online can help generate business for local businesses. Web searches often return results from the entire internet, so providing a means to identify local talent is a new commercial venture that many municipalities intend to provide in order to promote a sustainable business and collaboration atmosphere. 6.5 Executable Web Services [http://xml.nig.ac.jp/wsdl/index.jsp] To help programmers and wsdl writers develop and share solid code, we wish to provide an avenue to visualize web service parameters. We'd like to provide the ability to automatically parse a web service and indicate inputs and outputs to non-technical users, as well as commentary about the web service generated from automatic method invocation. Archiving input, output pairs for web service file methods permits us to allow users to look over sample uses of a wsdl file and verify usage by example, rather than by documentation. We find many web services are sparsely documented, but yet contain valuable methods that could help any programmer. If we could sample these methods, we could generate documentation about a wsdl file. Better yet, if the users could test web services via an interface on our website, their 'testing' of the wsdl file via our interface could be saved and archived for future users' benefit. 7. Conclusions: The most interesting contributions of WebSlogger to the global community are the ability to let programmers locate and use web services, provide an easy means to do this via a web based application that can be accessed from any OS with minimal time or effort cost to the user to use, and permit users to collaboratively review the content we provide. The learning curve involved to implement this project was quite long. We had to learn various technologies to implement this website like – Ruby On Rails Registering and using a domain web host on the web Web Services Description Language Specification More experience on the process of picking up from a previous project and working with the previous team to extend and improve the idea Programming Languages like Python, Java, C#, The various concepts and techniques that we learned in class can be considered for extending WebSlogger to a next level: Better support for Location aware queries. Better scaling our system architecture Provide tracking on web services we locate to locate changes, and introduce those changes into our database incrementally Overall, after investigating how current Web Services registration, search and ranking systems are performing, we feel that these systems are at very preliminary state and lots of work needs to be done in this domain. As Web 2.0 gains more popularity with advancement in the domains like - media search (video / audio), content based advertising, web on mobile devices etc; and the importance of web services can not be denied which will play a major role in this advancement. A well designed and sophisticated Web Services Search Engine will only help in achieving the desired progress in those fields mentioned above. We sincerely believe that WebSlogger can turn out an important stepping stone in this field. But it needs a lot of hard work to improve it to a level that we envision. 8. References: [1] Web Services Description Language http://www.w3.org/TR/wsdl [2] UDDI http://en.wikipedia.org/wiki/Universal_Description_Discovery_and_Integration [3] WSIL http://www.ibm.com/developerworks/library/specification/ws-wsilspec/ [4] Alexa Web Search Platform on Amazon http://www.amazon.com/gp/browse.html?node=269962011 http://developer.amazonwebservices.com/connect/entry.jspa?externalID=801&categoryI D=120 [5] Similarity Search for Web Services - Xin Dong, Alon Halevy, Jayant Madhavan, Ema Nemes, Jun Zhang [6] Yellowpages for Web Services - Andrew Cantino & Danesh Irani [7] Programmable Web http://www.programmableweb.com/ [8] Alexa Search http://www.alexa.com/ [9]xMethods http://www.xmethods.net/ve2/index.po