WebSlogger: A Web Services Search Engine www.wbslogger.com CS8803: Advanced Internet Application Development

advertisement
WebSlogger: A Web Services Search Engine
www.wbslogger.com
CS8803: Advanced Internet Application Development
Spring 2008
Class Project
Team Members:
Roland Krystian Alberciak
Piotr Kozikowski
Sudnya Padalikar
Tushar Sugandhi
krystian@gatech.edu
piotr.kozikowski@gatech.edu
sudnya.padalikar@gatech.edu
tusharsugandhi@gatech.edu
1. Abstract:
Our objective for the project “WebSlogger” is to improve accessibility and
usability of web services. We worked on improving the searching, clustering and
classification and ranking techniques of web services available over the Internet. We
investigated the shortcomings of current solutions in locating and ranking web services.
In our work, we have suggested some steps to improve the currently existing solutions to
facilitate search, classification and ranking of web services over the Internet.
2. Introduction:
Web services are a very popular way for businesses to communicate with each
other, as well as with clients. Web services are meant to endorse the principles of
modularity and code re-usability. Quite a few web services can even do tasks like look up
product details, currency conversion etc. The standard method to define web services is
using the Web Service Definition Language (WSDL). Here the web service is defined in
terms of its logical port types through with it showcases the functionality, web methods
and its data types - inputs, and outputs. The web service may refer other XSD schemas or
other web services for the data types it is using. These varied web services are
consolidated via a platform-independent web service registry system called Universal
Description Discovery and Integration (UDDI) which is a local index of web services.
Some online directories like Woogle, XMethods, ProgrammableWeb etc. categorize and
provide a search facility for web services. But the number of web services registered on
these sites is very limited and does not reflect the number of actual web services available
on the Internet. In this project we built an automated system that discovered web services
and clustered and categorized them according to our clustering algorithms. The main
components of our project are discovery of WSDL files on the Internet by using a search
engine API viz. Yahoo search API, automating the clustering and categorization of the
web services thus discovered, and ranking those web services to improve quality of
search results. We have also used various innovative clustering and ranking algorithms
which are different from conventional ranking algorithms used for web page ranking.
The remaining of the paper is organized as follows. The following section
describes the previous work done in this field. Then we discuss the design of our website
www.wbslogger.com . In design, we mainly talk about the way we crawled the Web to
fetch web services. Then we talk about clustering algorithm we used for categorizing
them. Then we discuss our proposed ranking techniques specially designed for web
services ranking. Then we present the experimental results and conclusive remarks.
3. Previous Work:
We familiarized ourselves with studies that were done on this topic. Out of those,
two projects [5, 6] were of particular interest to us since their work was very much in line
with what we are trying to achieve. Currently there is a search engine called Woogle [5]
that has provision for finding similar web service operations. Woogle claims to have an
algorithm that covers over 1500 web service operations. Although semantics don't play a
significant role in web search they are essential in checking similarity of operations. The
crux of Woogle is that they have an algorithm that clusters the parameter names into
semantically meaningful concepts. Woogle then finds similar web services.
Yellowpages[6] talks more about web service discovery and categorization. They used
the existing Google APIs for WSDL discovery (although this did limit the search to 1000
requests per query and limited number of queries per day). Web service classification was
done by manually labeling a portion of the web services, and testing the classification
algorithms on this labeled data set with considerable accuracy. But there is scope of
increasing the accuracy in classification by adding more labels. An interesting finding of
this study was that many web services do not define WSDLs which means that there are a
lot of ad-hoc APIs available on the Internet whereby authors have not codified them in
WSDL format. The lack of web services in the WSDL format makes it hard to discover
and probe those web services automatically and much more accurately.
4. WebSlogger System Design:
Figure 1: System Architecture
Figure 1 shows the design overview of our system. The main components of this system
are –
 Yahoo Search API based web services retrieval engine
 Web Services Storage Database (MySQL based)
 Web Services Clustering Engine
o Glossary Metadata used for automated clustering
 Web Services Ranking Engine
o User Rating System
o Automated Referential Ranking System
 Web Interface / GUI
4.1 Web Services Retrieval Engine:
During our background research we figured out that the "Yellow Pages project"[6]
had used Google's SOAP API and retrieved results from 3 main domains
viz. .com, .net, .org. There is an existing, publicly available web crawl known as the
Alexa Web Search Platform. It is a service that makes the whole Alexa “web crawl data
set” available for a reasonable cost. We estimate that it should be possible to retrieve a
list of all WSDLs from this service for a nominal cost (probably $50 to $200).
We also studied related work [5], one of which was that built Woogle. This study
proposed a set of similarity search primitives for web service operations, and described
algorithms for effectively implementing these searches. The work done by Woogle was
mainly based on exploiting the structure of the web services and employing a novel
clustering mechanism to group parameter names into meaningful concepts. But this
project
had
not
implemented
automatic
web-service
invocation.
We had several options for retrieving Web Services from the Web. We looked at
the different search APIs from several search engines. We mainly considered the search
APIs made available by Google as well as Yahoo. After doing some background research
we figured out that we could not use Google's SOAP API since it is deprecated since
December 2006. Google has now released its Ajax based API. The main problem in using
this API was that it returned restricted results. This would restrict the results we could
retrieve per day to around 1000. As against this, the Yahoo search API returned 100
results per query and also allowed us to retrieve 5000 queries per day. This effectively
removed all the constraints on faster retrieval of web services over WWW. In addition to
this, the Yahoo API has a provision for multilingual support. This enabled us to search in
domains of 38 different languages. So we decided to use Yahoo's search API and ran the
search for .wsdl and .asmx files.
We managed to successfully download 16000 Web Services from the Internet.
We used to retrieve files with extension “.wsdl” and “.asmx?WSDL”. We understand
that these queries do not completely cover the web services domain. A lot more web
services can be found for the query “inurl:wsdl”. We did not retrieve those web services
because of lack of time and resources to process those web services by clustering and
ranking engine. We also realized that quite a few of the web services retrieved were
repeated and so worked on separating unique files from the excessively repetitive data we
had downloaded.
The following table shows the number of web services retrieved against each
language:
Language
Latvian
Greek
Catalan
Norwegian
Croatian
Portuguese
Dutch
Finnish
Spanish
Persian
Turkish
Czech
Islandic
Romanian
English
Japanese
# Web
Services
60
100
20
430
120
910
460
300
602
90
460
470
120
90
1987
1450
Language
Lithuanian
Indonesian
Korean
Arabic
Swedish
French
Russian
Slovenian
Estonian
German
Thai
Hungarian
Italian
Polish
Danish
# Web
Services
70
10
110
50
400
1620
1100
190
190
1201
80
310
1280
280
840
Table 1: Web services categorized by Languages
4.2 Clustering and Classification Engine:
While analyzing the best approach for classification and categorization, we found
that the structure of a web service description file (WSDL file) provides a lot of details
about how the meaning and usefulness of the wsdl file. We used hierarchical clustering to
exploit the structure of the web service files. We came up with glossaries of 27 categories
with 2800 keywords, modeled after expertise by one of our project mates who have
experience in the arena of clustering algorithms and ontologies.
The distribution of
keywords amongst the categories is as shown in table 3.
Glossary of words for clustering: We came up with a glossary of terms popular in
each category that we would classify web services into. Web service components used for
classification are listed in order of how they appear in the wsdl file. The priority of a
section for the classification algorithm is as given in table 2.
Section in Web Service Description File
Priority
Documentation
2
Message -> Name
5
Message -> Part -> Name
7
PortType -> Name
3
PortType -> Operation -> Name
4
PortType -> Operation -> Input -> Message
6
PortType -> Operation -> Output -> Message
6
Service -> Name
1
Table 2: Web Service Section Priority for Clustering Algorithm
Cluster Name
Entertainment
Travel
Legal and Financial
Health and Medicine
Airlines
Retail
Social Services
Recreation and sport
Transport and
Relocation
Finance
Builders
Automotive Insurance
Education and
Instruction
Cruises
Food
Hospital
Computers and Internet
Library
Automotive
Credit and Debt
Services
Real Estate
Hotels
Legal
Property Manager
# Keywords
130
15
49
380
84
55
121
804
50
107
61
106
91
223
217
140
70
17
64
148
149
74
72
12
Table 3: Distribution of glossary terms in clusters
Our rationale for this clustering strategy is: Web Service Partitioning by
Importance. Some sections in web service file are more important than other. For
example, the Service Name / Operation Name is more important than message type name
since it is higher on a semantic level and provides more hints about the overall wsdl file.
We use a data structure called Affinity Vector which is a vector of size equal to
the number of clusters in the system. We know that a web service may belong to one or
many clusters. This is determined by co-relating the words in various sections of a web
service (with different priority) to the words in the glossary of a particular cluster. Thus
the affinity vector of a web service indicates which cluster the web service is closely
related to. Currently we partition the web service to a cluster which has the highest
weight in its affinity vector. But we intend to relax this constraint in future.
We also demonstrate language support in our clustering, as the WSDL files are
extracted from the web and cataloged by the language domain they come from. Since
programming languages are in English (usually), we find many programmers talk to each
other in their documentation by writing WSDL documentation in English, thereby
supporting us to understand and interpret a foreign WSDL file by exploiting this simple
observation. This observation can be confirmed by looking at the results of Table 1.
Our improved clustering and classification was not a dictionary attack, and
therefore we observed it had less false positives and false negatives than the original
work by Danesh Irani et al [6] on the Yellow Pages project.
4.3 Ranking Engine:
The fundamental aim of our project is to provide up-to-date and relevant ranking of
Web Services. Our ranking algorithm plans are 2 fold, and represent the process in which
we expected the project to grow.
Since we have restricted our way to rank wsdl files on popularity due to our
crawler methodology, we realized that our ranking scheme would have to be somewhat
based on the user's who use and contribute opinions about web services they locate via
our service. Ignoring user contribution would lead to some sort of automatic
classification ranking scheme which we unfortunately do not have the resources to
provide without more statistics and data. However, after user classification occurs over
some time via our interface, if we were to then introduce our automatic ranking
algorithms we could then have a way to compare the effectiveness of (a) pure
collaborative ranking vs. (b) non-user input, automatic ranking algorithm. This is
subjective and difficult to test, but we can manually select relevant web services that
should be in the top results for certain queries and test the ranking systems against it.
This form of collaborative ranking we have implemented accomplishes the following key
objectives:


Users can leave comments - Thus permitting us to build a library of (hopefully
constructive and usage detailed) praises and complaints corresponding to a web
service.
Likert scale ranking - This effectively allows users to use their judgment and
enumerate a rating for each web service that holistically takes into account all of
their experiences with that and other web services they have used.

Uptime/Downtime - Can you rely on the web service to be there when you need it?
We want to provide programmers this information so they can evaluate
expectations of service.
Figure 2: Screenshot of User Rating System
We have also implemented some 'quality of format' rankings, which attempt to
discriminate incomplete or erroneous web service files. As the file is parsed for inclusion
into our search results, we analyze it for errors. Once errors are detected, the file is
rejected and is not included into our search results. We do this to meet a base level of
quality in expectations of our results for our users: that our wsdl files are viable
candidates for use, and that we knowingly do not provide faulty or defective results.
Ideally we can expand this approach to answer more profound questions like does good
format, and thoroughness correlate to a good WSDL file? Heuristics on model web
service files could locate for us answers to these questions. A deeper question that has
come up with standards in the web community is adherence to standards, and whether
browsers should correct bad HTML code, etc. We believe with WSDL files, the debate
on "Do You Care if Your WSDL is W3C Compliant?" should yield that standards matter.
WSDL files are written by developers for developers, so format does matter and cannot
be simply 'auto corrected'.
Thus, remain the following ranking avenues that we wish to explore:



Generate referral chain from WSDL: As explained earlier in this paper, web
services can (and in many cases do) use web methods / data types from other web
services. Thus, they refer to other web services for their operation. This scenario
is analogous to a web page having links to other web pages. Therefore similar
algorithms can be applied to the web services which are involved in referential
chains. That is, if a web service publishes certain data types / web methods which
are used by many web services or popular web services, then the rating of the web
service under consideration should be higher.
Since we do not have access to crawler data, we wish to emulate a citation
network in order to determine valuable web services. A 'psuedo' pagerank
algorithm approach could be to analyze the dependencies of web service
files.Web services often use methods / objects from other web services. We
should use this linking to rank web services.
Other approaches for ranking are as follows:
o
o
Rank good users / bad users in the community: Experts
User Level: Usage statistic ranking:




How long a user views a wsdl
Does a user go back to look at it again [since a wsdl is like an API]
Inquire user about what wsdl files they used to achieve a goal
Another key idea is aging. We must presume that web services go 'stale', as new
web services which do the same functionality as previous ones, may do it with
more state of the art languages. Introducing an aging component to rankings so
that 'stale' results get excluded would be quite valuable.
4.4 Graphical User Interface:
An early goal of our project was to develop an application that people would
choose use not only because it fulfilled their objectives and delivered high in the
threshold of content quality, but also because the interface was easy and enjoyable to
use. We took cues from current, successful web search engines in order to model an
interface that supports many different kinds of interactions, particularly "recognition" and
"recall" user behaviors. We felt early on that empowering a user to search or browse was
a necessary feature of our system, and thus our primary goals of the development of
categories and indexing of search results lent quite well to support accomplishing our
GUI interaction aims. Simplicity was introduced into our GUI by reducing the clutter of
content displayed on screen via employing white space, contrasting colors for semantic
barriers and CSS style sheets in order to support pleasant experiences.
Figure 3: WebSlogger GUI Overview
In developing our GUI, we surveyed some options for selecting how to
accomplish the latter task by reviewing programming languages which are well
established for web development, and by reviewing what options came with them for
rapid and extensible development. We wanted to make a choice that could support future
development should this project be continued in the future, while also providing an
opportunity to learn new skills. Ultimately, we leaned towards languages which elicit the
power of MVC frameworks in development, since we feel there should be a level of
maturity in our work.
We considered between PHP and Ruby On Rails since we had / or were
compelled to have some experience with these languages, and due to their popularity in
web application development. On the PHP front, we've had some experience working
with the CakePHP framework which provides MVC support and rapid application
development. Ruby on Rails reviews suggested that it affords a faster development
process since it 'standardizes' some development practices. Though as a community, users
of Ruby on Rails have reported some issues where RoR sites under high web site
load 'collapse' (http://it.slashdot.org/article.pl?sid=07/09/09/215230&from=rss) we find
that the promise of learning something new, with promises of new perspectives in
viewing, abstracting and solving problems, and eventually reducing development time
cost via lifetime of code in maintenance cycle to be the deciding factor in selecting this
language.
5. Experimental Results:
Most of the experimental result we want to showcase in this section are
mentioned and referred throughout this document. We would like to reiterate some of the
key findings of our experiments
 Search engines like Google, Yahoo suggest that the number of WSDL files
available on the Web is huge.
o Google -> filetype:asmx returns 32,000 results
o Google -> filetype:wsdl returns 14,800 results
o Yahoo -> originalurlextension:wsdl returns 75,400 results
o Yahoo -> originalurlextension:asmx returns 1,68,000 results
But when we used Yahoo Search APIs the results that we got were –
o Yahoo -> originalurlextension:wsdl returns 8,000 results
o Yahoo -> originalurlextension:asmx returns 7,000 results
Thus, either Yahoo does not provide all the search results for its API users or
there is discrepancy between results provided by Search APIs and from the Search
Web Page. Yellow Pages project [6] conclusions suggest that this behavior is true
for Google as well. This was quite astonishing behavior.



Further, even though we retrieved 16000 web services from the web, only 3500
out of them were unique. And this behavior is also observed with Google as
pointed out by Yellow Pages project [6]. We were quite surprised by the degree of
duplication over the web.
As part of our experimental research, we conducted a feature analysis of different
web services search / catalogue websites. It was quite amazing that even though
Yahoo / Google reports that the number of unique web services over the web of
the magnitude of 5000, there is hardly any website which has all the web services
registered on it. The various feature comparison table for WebSlogger with other
websites is as shown in Figure 4.
Further, it was shocking to know that the organizations like Microsoft, IBM, SAP
etc. have decided to stop supporting the most popular web service registration and
discovery system – UDDI.
Figure 4: WebSlogger: Feature Comparison
5.1 Results of Clustering Engine
We used following data as an input to our clustering engine.
o 16000 Web Services in 38 different languages
o 2800 keywords as glossary terms
o 27 Hierarchical Clusters
The distribution of web services over the clusters is as shown in below.
 Automotive(0)
-Automotive Insurance(0)
 Community(0)
-Library(0)
-Social Services(0)
 Computers and Internet(0)
 Education and Instruction(0)
 Entertainment(220)
 Food(0)
-Restaurants (0)
 Health and Medicine(12)
-Health Insurance(0)
-Hospital(0)
 Legal and Financial(884)
-Credit and Debt Services(11)
-Legal(0)
-Finance(0)
 Others(9906)
 Real Estate(63)
-Builders (0)
-Property Manager(0)
 Recreation and sport(455)
 Retail(0)
 Transport and Relocation(25)
-Airlines (2)
-Cruises (21)
-Hotels(2)
-Travel(0)
As we have mentioned earlier, our clustering algorithm is immature and our
glossaries are not perfect, we have a large number of web services clustered in “Others”
cluster.
6. Future Work:
In our work we met our standard objectives and established the proof-of-concept
web service search engine called, "WebSlogger". However, should we have more
resources and time; we would like to mature the following concepts in our deployment:
6.1 Crawl the WWW vs. extraction mining
In our original objectives, we sought to reliably obtain about 1000 unique results.
Though we surpassed our original objectives, we resorted to pseudo-crawling by
extraction queries sent to a search engine. Ideally, should the resource and cooperation be
available, we would like to do authentic crawling of the web for WSDL, or in the worst
case scenario do extraction from multiple different search engines.
This would accomplish the objective of asserting 'popularity' or 'credibility' of
wsdl sources, and the hubs which act in identifying them. We have no way, via extraction,
to know how many pages link to a particular wsdl file unless the wsdl file itself reports
these chains via including it in the namespace section of the wsdl file. This would allow
us to identify what other a person on the web think is important, particularly when it
comes to web services.
6.2 Improving Clustering
As mentioned before in the clustering section, we plan to introduce an automatic
feedback categorization scheme. At this moment, the scheme we have proposed required
some training but in the future it should be capable of automatic classification (well, at
least mostly automatic, we do expect some tweaking to be required). Further, the
glossaries that we use for classification are immature and incomplete. There is a lot of
scope in improving the quality of those glossaries. Further, the cluster labels are hard
coded. An automated algorithm can be implemented which can add new clusters, remove
unwanted clusters, along with false positive and false negative classification detection
based on user feedback.
6. 3 Improving Ranking
Recall in the ranking section we discussed our implementation of collaborative
ranking metrics, and we speculated that future rankings would attempt to converge to the
presumed Page-Rank metric of hubs and sinks.
6.4 Location Based Clustering
The original spirit of the Yellow Pages project that we sought to progress was to
provide a mechanism whereby web services online can help generate business for local
businesses. Web searches often return results from the entire internet, so providing a
means to identify local talent is a new commercial venture that many municipalities
intend to provide in order to promote a sustainable business and collaboration atmosphere.
6.5 Executable Web Services [http://xml.nig.ac.jp/wsdl/index.jsp]
To help programmers and wsdl writers develop and share solid code, we wish to
provide an avenue to visualize web service parameters. We'd like to provide the ability to
automatically parse a web service and indicate inputs and outputs to non-technical users,
as well as commentary about the web service generated from automatic method
invocation. Archiving input, output pairs for web service file methods permits us to allow
users to look over sample uses of a wsdl file and verify usage by example, rather than by
documentation.
We find many web services are sparsely documented, but yet contain valuable
methods that could help any programmer. If we could sample these methods, we could
generate documentation about a wsdl file. Better yet, if the users could test web services
via an interface on our website, their 'testing' of the wsdl file via our interface could be
saved and archived for future users' benefit.
7. Conclusions:
The most interesting contributions of WebSlogger to the global community are
the ability to let programmers locate and use web services, provide an easy means to do
this via a web based application that can be accessed from any OS with minimal time or
effort cost to the user to use, and permit users to collaboratively review the content we
provide.
The learning curve involved to implement this project was quite long. We had to learn
various technologies to implement this website like –
 Ruby On Rails
 Registering and using a domain web host on the web
 Web Services Description Language Specification
 More experience on the process of picking up from a previous project and
working with the previous team to extend and improve the idea
 Programming Languages like Python, Java, C#,
The various concepts and techniques that we learned in class can be considered for
extending WebSlogger to a next level:
 Better support for Location aware queries.
 Better scaling our system architecture
 Provide tracking on web services we locate to locate changes, and introduce those
changes into our database incrementally
Overall, after investigating how current Web Services registration, search and
ranking systems are performing, we feel that these systems are at very preliminary state
and lots of work needs to be done in this domain. As Web 2.0 gains more popularity with
advancement in the domains like - media search (video / audio), content based
advertising, web on mobile devices etc; and the importance of web services can not be
denied which will play a major role in this advancement. A well designed and
sophisticated Web Services Search Engine will only help in achieving the desired
progress in those fields mentioned above. We sincerely believe that WebSlogger can turn
out an important stepping stone in this field. But it needs a lot of hard work to improve it
to a level that we envision.
8. References:
[1] Web Services Description Language
http://www.w3.org/TR/wsdl
[2] UDDI
http://en.wikipedia.org/wiki/Universal_Description_Discovery_and_Integration
[3] WSIL
http://www.ibm.com/developerworks/library/specification/ws-wsilspec/
[4] Alexa Web Search Platform on Amazon
http://www.amazon.com/gp/browse.html?node=269962011
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=801&categoryI
D=120
[5] Similarity Search for Web Services - Xin Dong, Alon Halevy, Jayant Madhavan, Ema
Nemes, Jun Zhang
[6] Yellowpages for Web Services - Andrew Cantino & Danesh Irani
[7] Programmable Web
http://www.programmableweb.com/
[8] Alexa Search
http://www.alexa.com/
[9]xMethods
http://www.xmethods.net/ve2/index.po
Download