A platform for improving Web Services accessibility Roland Krystian Alberciak, Piotr Kozikowski, Sudnya Padalikar, Tushar Sugandhi {krystian, piotr.kozikowski, sudnya.padalikar, tusharsugandhi}@gatech.edu Project Proposal Abstract We concern ourselves with the problem of improving web service accessibility. We wish to deal and find better solutions to the problems of search, classification and accessibility of web services over the Internet. We propose to extend the existing and novel work done in this domain. We also plan to investigate the shortcomings of current solutions in locating, ranking and packaging web services to their users. In our work, we propose to provide the next steps in improving upon those issues. Motivation and Objectives In our work, we propose to produce a web service directory which will enable us to enlist substantially more web services (in substantially better format) than what is possible in the current-day WS directories. In their work Danesh Irani, et al. found current day directories that account for only a small fraction of listing web services. In comparing WS directory listing services, they realized that any existing directory was able to account for at most ~1000 services. When the authors did their own web searching for search string patterns representative of WS's, they however revealed upto 36,000 listings for websites which were possible WS candidates. This suggests that the current techniques which enlists and maintain directories of WS's are possibly inadequate at identifying the full market share of these services. We therefore seek commercial prospect in our work: realizing techniques which can give better exposure to owners of web services, and developing a revenue model for this knowledge discovery. We therefore outline these objectives for our work: Objectives: [1] Design an architecture and develop a system which can crawl and index web services over the Internet. [2] Be agile enough for adjustment and identification of these services despite the different standards which they are defined by. [3] Provide clustering and ranking for better assessment of web services and make them accessible to more people. [4] Make a usable web interface through which people can query for web services, and use the results more conveniently for development and personal use. Related work We familiarized ourselves with studies that were done on this topic. Out of those, two projects [5, 6] were of particular interest to us since their work was very much in line with what we are trying to achieve. Currently there is a search engine called Woogle [5] that has provision for finding similar web service operations. Woogle claims to have an algorithm that covers over 1500 web service operations. Although semantics don't play a significant role in web search they are essential in checking similarity of operations. The crux of Woogle is that they have an algorithm that clusters the parameter names into semantically meaningful concepts. Woogle then finds similar web services. Yellowpages[6] talks more about web service discovery and categorization. They used the existing Google APIs for WSDL discovery (although this did limit the search to 1000 requests per query and limited number of queries per day). Web service classification was done by manually labeling a portion of the WSs, and testing the classification algorithms on this labeled data set with considerable accuracy. But there is scope of increasing the accuracy in classification by adding more labels. An interesting finding of this study was that many web services do not define WSDLs which means that there are a lot of ad-hoc APIs available on the Internet whereby authors have not codified them in WSDL format. The lack of WSs in the WSDL format makes it hard to discover and probe those web services automatically and much more accurately. Proposed work We plan to modularize this project into following tasks for better tracking and manageability Tasks: 1. Crawler improvements We have many options to choose from while selecting which crawler we shall use to crawl the Internet for web Services. The first option is to write our own crawler while using the number of open source web crawlers available in the market. The second option is to use a search engine's API (ex: Google APIs, where Google does the crawling). However, Google does not support their SOAP APIs in the public domain anymore. Additionally, they have restricted the use of this API by allowing limited number of queries per day and only 1000 results per query. Another option is to buy crawling services from web service providers like Amazon[4], Alexa[8], xMethods[9], programmableweb [7] etc. In our work, we will need to evaluate these options and choose the best one considering our constraints: resources, availability of funds, and time/expectations. 2. Clustering/Classification Algorithms improvements Danesh Irani et al [6] have done a good amount of work in extracting the web services on the Internet using the Google Search APIs. However, their scope was limited to Web service search and classification. We propose extensions to their work by enhancing the clustering / classification algorithms they have developed. Here, our focus is on improving semantics of classification, so that semantically related web services are linked together. We will discuss the database level indexing and caching for efficient storage and retrieval in the following sections. 3. Caching and Indexing We also propose indexing and caching the web services that are returned by the crawler. We plan to crawl the Internet periodically and cache and index the resultant of web services so that we can cluster and classify them efficiently. This will also enable the end users to retrieve the search results efficiently. 4. Ranking We also propose the novel technique for ranking of Web Services based on different parameters like - Usage: This parameter exploits the fact that if the web service is used by large number of web applications, it is more valuable, and thus ranks higher in our system. - Reference: Just like Google page rank system for web pages, if the web service has large indegree into it from other web pages (not web services), then the rank of that web service is higher. - Embedded: The no. of web pages that have this web service embedded into them also play an important role in evaluating the importance of a particular web service. - User rating: The actual users which have used / visited this web service should be able to rate that web service. We can make use of these ratings for calculating the rank of a web service. We will consider these parameters while designing a ranking algorithm for our system. 5. Status indication of Availability of Web Service This is another innovative feature that we propose to add in our work. We want to provide information about availability i.e. whether the web service is up and fully functional / working partially / or not working at all. Users will get this information when they search for the web service. As per our knowledge, no other web service search engine provides this kind of information. 6. Web Interface and Accessibility to our data We propose to build a GUI that offers the following functionality: - A functionality to search for web services based on keywords / classification - A functionality to describe the status of the web service {Available/ partially running / down} - User Review System: The users should be able to post their reviews / comments for a web service. They should also be able to provide the rating for a particular web service which we can make use of while ranking the web services. 7. Choosing Architecture We will evaluate both peer to peer and centralized server architectures for their suitability to implement the system we propose to build currently. A centralized system is less complex and will be easier to build, test and deploy. Since our crawler will not run very frequently, it doesn't need to be extremely fast and so we expect a single server should suffice to run the crawler and host the GUI component. This solution however will not be appropriate if the number of users of our system grows significantly. A peerto-peer system on the other hand, would take more effort to build and test but offers much better scalability as the computing capacity would grow proportionally to the user base size, without having to add any dedicated resources. We will explore both approaches but the most feasible option seems to be the centralized system approach. If scalability becomes an issue, we will leave that as an opportunity for future development of this project. 8. Background Research Last but not the least, we plan to do extensive background study before implementing the tasks described above. In order to enhance our understanding of Web Service indexing options available online and to prioritize the tasks mentioned above, we intend to study the search engine companies which do WS indexing already. We want to understand the shortcomings of their approaches and avoid those glitches and thus be more effective with our work. We may even contact the companies and seek more information about their work. We would like to know whether they still are working on the web service search engines, and to identify what new features they are intending on adding. For instance, some resources like wsgoogle were academic publications and may not have been worked on after the conference paper was published. We also wish to contact key people in the field of Web Services search in order to understand the problem domain a little better. Our team discussed the possibility of contacting w3c and inquiring them about web services, particularly: why so few websites are identified as web services. We have already explored UDDI, and WSIL standards and would like to explore these concepts more thoroughly. Plan of action Schedule The Following is a time-line of the project's main activities and milestones (in bold): February 15: Project proposal submission February 16 - 28: Establish primary and secondary list of features, specific technologies to be used, and research related work like woogle, wsoogle, Alexa etc. February 29: Final project definition March 1 - 15: o Analyze code from Danesh Irani et al [6] and integrate it with our crawler o Design GUI component o Design indexing and ranking system March 16 - 30: Refine clustering and classification mechanisms Build GUI prototype Implement indexing and ranking system April 1 - 15: Integrate all parts and test April 15: Final Prototype Implementation Deadline April 16 - 27: Continue testing April 15 - 24: Project presentation April 21 - 15: Project demo April 27: Final project submission Evaluation and Testing Method The project will consist of several components: crawling, indexing, clustering and classification, ranking and GUI. Each of them will be thoroughly tested as an independent module with the following acceptance criteria: Crawling: Reliably obtain at least 1000 unique web services results by improving the crawler developed by Danesh Irani et al [6]. The numerical figure is pretty conservative. It will be updated after background study. Indexing and search: The system should efficiently and correctly retrieve the web services as a result of predefined test queries. The index should contain all the items found by the crawler. We plan to design the test queries to cover maximum possible scenarios if not all. Ranking: This is subjective and difficult to test, but we can manually select relevant web services that should be in the top results for certain queries and test the ranking system against it. We also plan to include the numeric values of rank in the result set for testing purpose. Clustering and classification: Our improved clustering and classification must have at least 50% less false positives and false negatives than the original work by Danesh Irani et al [6] GUI: This component should be up and running on a server and pass a series of tests that will ensure its interface with the other components works as designed. Once all modules pass the acceptance tests the system as a whole will be completely integrated and tested. We are expect to have a working prototype without any major bugs before the project presentation to the class and use the remaining time before the final submission to correct minor bugs, if any. References: [1] Web Services Description Language http://www.w3.org/TR/wsdl [2] UDDI http://en.wikipedia.org/wiki/Universal_Description_Discovery_and_Integration [3] WSIL http://www.ibm.com/developerworks/library/specification/ws-wsilspec/ [4] Alexa Web Search Platform on Amazon http://www.amazon.com/gp/browse.html?node=269962011 http://developer.amazonwebservices.com/connect/entry.jspa?externalID=801&categoryI D=120 [5] Similarity Search for Web Services - Xin Dong, Alon Halevy, Jayant Madhavan, Ema Nemes, Jun Zhang [6] Yellowpages for Web Services - Andrew Cantino & Danesh Irani [7] Programmable Web http://www.programmableweb.com/ [8] Alexa Search http://www.alexa.com/ [9]xMethods http://www.xmethods.net/ve2/index.po