A Web Services Search Engine CS 8803 [AIA] - Spring 2008 Roland Krystian Alberciak Piotr Kozikowski Sudnya Padalikar Tushar Sugandhi Outline • Project Overview • Searching Web-services o o Tools / APIs How to figure out what information to show • Results :Working prototype o Locate, classify, rank, and present web-services • System Integration o Diversity! Languages (no joke): Python, Ruby on Rails, PHP, C#, Java, Perl. Databases: MySQL, MSSQL Project Overview Step 1 - There are web-services available on the web Step 2 - (Challanges) Obstacles to find WS vs. web pages because: Effort to Register Directories disconnected No Clustering available No Ranking available Step 3 - Profit Should be Beneficial for Web Developers Should be Beneficial for us What is out there? • Swoogle -“10,000 ontologies” (they are more concerned with “semantic web” and “metadata”, and not so much on web services) • Programmableweb -726 (only APIs) • "Yellow pages" - 5000 web-services • XMethods - 500 web-services • UDDI - Discontinued but was useful to many web services to advertise themselves. Survey of the Market- We found solutions for Step 2! Step 1. Have web-services available on the web Step 2. (Solutions) Crawler, database, web application and a bunch of clustering algorithms and lots of "glue" Step 3. Our proposed solution - Web Slogger! - for us: content based advertising - for users: easy way to search for web-services System Architecture Crawling Yahoo! Why not Google? Restricted extraction: Could not extract many results What about Alexa? Couldn't afford it! :-) What did we crawl for? .wsdl and .asmx files How is Webslogger different from the Yellow Pages project (last year's class project)? • Multiple Language support Categorization and Clustering Glossaries • Hierarchical Categirization (27 Categories) • List of keywords for each category (2800 keywords) Web Service Partitioning By Importance Some sections in web service are more important than othe r e.g. Service Name / Operation Name is more important than message type name. Affinity Vector • Weight assigned to each term in Webservice based on its mapping with Glossary • Determines which web service belongs to which category Ranking Insight Fundamental Difference: Web page ranking is based on inlinks and outlinks. Web service ranking should be based on objects and web methods. Recall: Our results are extracts from search engines. Therefore: • We don't know how many pages link to a particular wsdl file. • Search engine algorithms [ie. PageRank] have this data and can assert 'popularity', 'credibility' of hubs which locate sources. Resolution: We must find alternate ways to rank content Ranking Options 1. Community Level: Collaborative Ranking: • users can leave comments, • Likert scale ranking • rank good users / bad users in the community: experts 2. User Level: Usage statistic ranking: • how long you view a wsdl • do you go back to look at it again [since it is like an API...] • inquire about what wsdl files they used to achieve a goal Ranking Options ..contd 3. Use Page Ranking provided by Google / Yahoo 4. File Level: Quality of file: • "Do You Care if Your WSDL is W3C Compliant?" o Good format, thoroughness. Heuristics on model files. 5. Generate referral chain from WSDL o Understand citation network in order to determine valuable web services o Web services often use methods / objects from other web services. Use this linking to rank web services. <?xml version="1.0"?> <definitions name="StockQuote" targetNamespace="http://example.com/stockquote.wsdl" xmlns:tns="http://example.com/stockquote.wsdl" xmlns:xsd1="http://example.com/stockquote.xsd" xmlns:soap="http://schemas.xmlsoap.org/wsdl/soap/" xmlns="http://schemas.xmlsoap.org/wsdl/"> <message name="SubscribeToQuotes"> ... element="xsd1:SubscriptionHeader"/> </message> <portType name="StockQuotePortType"> <operation name="SubscribeToQuotes"> ... </operation> </portType> www.wbslogger.com Future work • Develop our own crawler • Further improve clustering (there is always room for that!) • Figure out an innovative (&& effective) way for ranking • Location based clustering Questions ?