A platform for improving Web Services accessibility {krystian, piotr.kozikowski, sudnya.padalikar,

advertisement
A platform for improving Web Services accessibility
Roland Krystian Alberciak, Piotr Kozikowski, Sudnya Padalikar, Tushar Sugandhi
{krystian, piotr.kozikowski, sudnya.padalikar, tusharsugandhi}@gatech.edu
Project Proposal Abstract
We concern ourselves with the problem of improving web service accessibility.
We wish to deal and find better solutions to the problems of search, classification and
accessibility of web services over the Internet. We propose to extend the existing and
novel work done in this domain. We also plan to investigate the shortcomings of current
solutions in locating, ranking and packaging web services to their users. In our work, we
propose to provide the next steps in improving upon those issues.
Motivation and Objectives
In our work, we propose to produce a web service directory which will enable us
to enlist substantially more web services (in substantially better format) than what is
possible in the current-day WS directories. In their work Danesh Irani, et al. found
current day directories that account for only a small fraction of listing web services. In
comparing WS directory listing services, they realized that any existing directory was
able to account for at most ~1000 services. When the authors did their own web
searching for search string patterns representative of WS's, they however revealed upto
36,000 listings for websites which were possible WS candidates.
This suggests that the current techniques which enlists and maintain directories of
WS's are possibly inadequate at identifying the full market share of these services. We
therefore seek commercial prospect in our work: realizing techniques which can give
better exposure to owners of web services, and developing a revenue model for this
knowledge discovery.
We therefore outline these objectives for our work:
Objectives:
[1] Design an architecture and develop a system which can crawl and index web services
over the Internet.
[2] Be agile enough for adjustment and identification of these services despite the
different standards which they are defined by.
[3] Provide clustering and ranking for better assessment of web services and make them
accessible to more people.
[4] Make a usable web interface through which people can query for web services, and
use the results more conveniently for development and personal use.
Related work
We familiarized ourselves with studies that were done on this topic. Out of those, two
projects [5, 6] were of particular interest to us since their work was very much in line
with what we are trying to achieve. Currently there is a search engine called Woogle [5]
that has provision for finding similar web service operations. Woogle claims to have an
algorithm that covers over 1500 web service operations. Although semantics don't play a
significant role in web search they are essential in checking similarity of operations. The
crux of Woogle is that they have an algorithm that clusters the parameter names into
semantically meaningful concepts. Woogle then finds similar web services.
Yellowpages[6] talks more about web service discovery and categorization. They used
the existing Google APIs for WSDL discovery (although this did limit the search to 1000
requests per query and limited number of queries per day). Web service classification was
done by manually labeling a portion of the WSs, and testing the classification algorithms
on this labeled data set with considerable accuracy. But there is scope of increasing the
accuracy in classification by adding more labels. An interesting finding of this study was
that many web services do not define WSDLs which means that there are a lot of ad-hoc
APIs available on the Internet whereby authors have not codified them in WSDL format.
The lack of WSs in the WSDL format makes it hard to discover and probe those web
services automatically and much more accurately.
Proposed work
We plan to modularize this project into following tasks for better tracking and
manageability Tasks:
1. Crawler improvements
We have many options to choose from while selecting which crawler we shall use
to crawl the Internet for web Services.
The first option is to write our own crawler while using the number of open
source web crawlers available in the market. The second option is to use a search engine's
API (ex: Google APIs, where Google does the crawling). However, Google does not
support their SOAP APIs in the public domain anymore. Additionally, they have
restricted the use of this API by allowing limited number of queries per day and only
1000 results per query. Another option is to buy crawling services from web service
providers like Amazon[4], Alexa[8], xMethods[9], programmableweb [7] etc.
In our work, we will need to evaluate these options and choose the best one
considering our constraints: resources, availability of funds, and time/expectations.
2. Clustering/Classification Algorithms improvements
Danesh Irani et al [6] have done a good amount of work in extracting the web
services on the Internet using the Google Search APIs. However, their scope was limited
to Web service search and classification. We propose extensions to their work by
enhancing the clustering / classification algorithms they have developed. Here, our focus
is on improving semantics of classification, so that semantically related web services are
linked together. We will discuss the database level indexing and caching for efficient
storage and retrieval in the following sections.
3. Caching and Indexing
We also propose indexing and caching the web services that are returned by the
crawler. We plan to crawl the Internet periodically and cache and index the resultant of
web services so that we can cluster and classify them efficiently. This will also enable the
end users to retrieve the search results efficiently.
4. Ranking
We also propose the novel technique for ranking of Web Services based on different
parameters like - Usage: This parameter exploits the fact that if the web service is used by large
number of web applications, it is more valuable, and thus ranks higher in our system.
- Reference: Just like Google page rank system for web pages, if the web service
has large indegree into it from other web pages (not web services), then the rank of that
web service is higher.
- Embedded: The no. of web pages that have this web service embedded into them
also play an important role in evaluating the importance of a particular web service.
- User rating: The actual users which have used / visited this web service should
be able to rate that web service. We can make use of these ratings for calculating the rank
of a web service.
We will consider these parameters while designing a ranking algorithm for our
system.
5. Status indication of Availability of Web Service
This is another innovative feature that we propose to add in our work. We want to
provide information about availability i.e. whether the web service is up and fully
functional / working partially / or not working at all. Users will get this information when
they search for the web service. As per our knowledge, no other web service search
engine provides this kind of information.
6. Web Interface and Accessibility to our data
We propose to build a GUI that offers the following functionality:
- A functionality to search for web services based on keywords / classification
- A functionality to describe the status of the web service {Available/ partially running /
down}
- User Review System: The users should be able to post their reviews / comments for a
web service. They should also be able to provide the rating for a particular web service
which we can make use of while ranking the web services.
7. Choosing Architecture
We will evaluate both peer to peer and centralized server architectures for their
suitability to implement the system we propose to build currently. A centralized system is
less complex and will be easier to build, test and deploy. Since our crawler will not run
very frequently, it doesn't need to be extremely fast and so we expect a single server
should suffice to run the crawler and host the GUI component. This solution however
will not be appropriate if the number of users of our system grows significantly. A peerto-peer system on the other hand, would take more effort to build and test but
offers much better scalability as the computing capacity would grow proportionally to the
user base size, without having to add any dedicated resources. We will explore both
approaches but the most feasible option seems to be the centralized system approach. If
scalability becomes an issue, we will leave that as an opportunity for future development
of this project.
8. Background Research
Last but not the least, we plan to do extensive background study before
implementing the tasks described above. In order to enhance our understanding of Web
Service indexing options available online and to prioritize the tasks mentioned above, we
intend to study the search engine companies which do WS indexing already. We want to
understand the shortcomings of their approaches and avoid those glitches and thus be
more effective with our work. We may even contact the companies and seek more
information about their work. We would like to know whether they still are working on
the web service search engines, and to identify what new features they are intending on
adding. For instance, some resources like wsgoogle were academic publications and may
not have been worked on after the conference paper was published.
We also wish to contact key people in the field of Web Services search in order to
understand the problem domain a little better. Our team discussed the possibility of
contacting w3c and inquiring them about web services, particularly: why so few websites
are identified as web services. We have already explored UDDI, and WSIL standards and
would like to explore these concepts more thoroughly.
Plan of action
Schedule
The Following is a time-line of the project's main activities and milestones (in bold):













February 15: Project proposal submission
February 16 - 28: Establish primary and secondary list of features, specific
technologies to be used, and research related work like woogle, wsoogle, Alexa
etc.
February 29: Final project definition
March 1 - 15:
o Analyze code from Danesh Irani et al [6] and integrate it with our
crawler
o Design GUI component
o Design indexing and ranking system
March 16 - 30: Refine clustering and classification mechanisms
Build GUI prototype
Implement indexing and ranking system
April 1 - 15: Integrate all parts and test
April 15: Final Prototype Implementation Deadline
April 16 - 27: Continue testing
April 15 - 24: Project presentation
April 21 - 15: Project demo
April 27: Final project submission
Evaluation and Testing Method
The project will consist of several components: crawling, indexing, clustering and
classification, ranking and GUI. Each of them will be thoroughly tested as an
independent module with the following acceptance criteria:




Crawling: Reliably obtain at least 1000 unique web services results by improving
the crawler developed by Danesh Irani et al [6]. The numerical figure is pretty
conservative. It will be updated after background study.
Indexing and search: The system should efficiently and correctly retrieve the web
services as a result of predefined test queries. The index should contain all the
items found by the crawler. We plan to design the test queries to cover maximum
possible scenarios if not all.
Ranking: This is subjective and difficult to test, but we can manually select
relevant web services that should be in the top results for certain queries and test
the ranking system against it. We also plan to include the numeric values of rank
in the result set for testing purpose.
Clustering and classification: Our improved clustering and classification must
have at least 50% less false positives and false negatives than the original work by
Danesh Irani et al [6]

GUI: This component should be up and running on a server and pass a series of
tests that will ensure its interface with the other components works as designed.
Once all modules pass the acceptance tests the system as a whole will be completely
integrated and tested. We are expect to have a working prototype without any major bugs
before the project presentation to the class and use the remaining time before the final
submission to correct minor bugs, if any.
References:
[1] Web Services Description Language
http://www.w3.org/TR/wsdl
[2] UDDI
http://en.wikipedia.org/wiki/Universal_Description_Discovery_and_Integration
[3] WSIL
http://www.ibm.com/developerworks/library/specification/ws-wsilspec/
[4] Alexa Web Search Platform on Amazon
http://www.amazon.com/gp/browse.html?node=269962011
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=801&categoryI
D=120
[5] Similarity Search for Web Services - Xin Dong, Alon Halevy, Jayant Madhavan, Ema
Nemes, Jun Zhang
[6] Yellowpages for Web Services - Andrew Cantino & Danesh Irani
[7] Programmable Web
http://www.programmableweb.com/
[8] Alexa Search
http://www.alexa.com/
[9]xMethods
http://www.xmethods.net/ve2/index.po
Download