There have been a number of protocols devised and implemented

advertisement
Open Search
David Wolber, Pooja Garg
University of San Francisco
Abstract
Open Search is an architecture for facilitating grassroots development of both digital
libraries and metasearch clients. Based on the Open Search Protocol (OSP) and Registry
(OSR), the architecture allows the creators of digital libraries to make their data instantly
available to any OSP-conforming metasearch client. Conversely, the architecture allows
metasearch clients to expose a dynamic list of digital libraries to their users.
Introduction
The web is huge. Search engines that index the entire web do not always provide results
that are relevant to the user. Domain-specific digital libraries can provide better searches
by reducing the size of the information space. Examples include the International Movie
Database for movies, the ACM digital library for computer science, and blogging
libraries such as Technorati. The results provided by such libraries can be more
meaningful and more timely, as their relatively small size means that the data can be
updated in minutes as opposed to the weeks required for crawling the entire web.
Such domain-specific digital libraries are being introduced everyday. Technology has
placed the creation of libraries within the grasp of the ordinary computer user. Personal
crawling software allows even an end-user to initiate a crawl by providing a list of seed
pages and topic keywords. Ordinary computers are powerful enough to perform such a
crawl efficiently and periodically, and big enough to store the resulting digital library.
Just as HTML facilitated the rapid creation of web pages, factors today are leading to the
explosion of searchable subsets of the web.
Information seekers can try to manually keep up with newly published digital libraries,
but finding the pertinent sources on a particular topic is becoming more and more
difficult. A key goal of metasearch applications is to help users with this process. They
filter, unify, personalize and rank the results from various sources for human
consumption. They also help users discover digital libraries either explicitly or by
automatically choosing libraries appropriate for a particular query.
Unfortunately, today’s metasearch applications are based on fixed lists of information
sources. For instance, A9 provides access to a set of sources including Google Images,
Amazon’s Search Within a Book, and the International Movie Database. Metasearch.com
provides access to Google, Yahoo, Kanoodle, and others. Such applications are
implemented with custom scrapers or web service consumers for each individual
information source.
Now consider the process by which the global state of metasearch evolves. A new digital
library comes on-line with a web page interface and perhaps web service API access. It
gains some popularity and is discovered by some metasearch applications. If an API is
provided, the metasearch developer writes a consumer for it. If no API is provided, but
the “robots.txt” permissions allow access, a scraper-consumer can be developed. In any
case, days if not weeks are needed to discover a new library and to write code which
extends the metasearch client.
This process is also slow from the perspective of the digital library creator, especially one
who wishes to openly provide access to their data in a timely manner. Such creators
include non-commercial entities such as researchers creating topic-specific libraries, as
well as business entities who want their products disseminated as quickly and widely as
possible. The best these entities can do today is provide a web service API and then ask
particular metasearch clients to add access to it in their next release. There is no means
possible to quickly and easily facilitate dissemination of the library.
There is a need for an open architecture that allows the immediate dissemination of
digital libraries and the dynamic discovery of digital libraries by metasearch clients. Such
an architecture needs two key elements: a common search API and a registry for sources
to identify themselves. These elements render the development of digital libraries and
metasearch clients independent of each other. By conforming to the API and registering,
digital libraries are instantly available to all metasearch clients. Client software
dynamically accesses the registry to build a list of the currently available sources, and
invokes searches on any source in the list using the operations of the API. In this way, the
global state of metasearch can grow in a grass-roots manner.
In this paper we introduce Open Search, an architecture that addresses this need. It
consists of a search API called the Open Search Protocol (OSP) and a UDDI-based
registry called the Open Source Registry (OSR).
To bootstrap the system, we have developed a number of OSP-conforming information
access services. These include wrappers for the existing APIs of Google, Amazon,
Technorati, Feedster, and the Internet Archive. We have also developed a desktop
application, PublishMe, that allows ordinary users to publish parts of their desktops as
OSP-conforming services. With PublishMe, the system becomes not only an architecture
for enhancing the metasearch information space but also one for peer-to-peer knowledge
sharing.
Besides the services, we have also implemented three metasearch clients based on the
architecture. The clients allow users to send queries to traditional search engines as well
as personal search engines created by PublishMe. For example, a user interested in
metasearch might select Google and “David Wolber” as the search sources for their
queries.
System Architecture
GoogleAPI
Google
OSPService
OSP Registry
GoogleWrapper
AmazonWrapper
TechnoratiWrapper
PW: Wolber
PW: Brooks
AmazonAPI
Amazon
OSPService
sources
Wolber
OSPService
Brooks
OSPService
OS Client
Source list
Figure 1 illustrates the Open Search architecture. An organization itself can publish a web
service conforming to OSP that runs on their servers, or a third-party can publish a web
service that wraps the organization’s custom API calls within OSP. We used the latter
method to develop the first OSP services such as the ones from Google and Amazon
shown in Figure 1. With this scheme, an Open Search metasearch client sends calls to the
OSP wrapper service. The wrapper service translates the call to the custom form and
sends it to the server where the data resides. Upon receiving results, the wrapper
translates them into OSP result form and sends them to client. Though not ideal in that
twice the number of network calls are necessary, the scheme does provide a way for OSP
clients to access any digital library with a public API.
Individuals publish their desktops through downloading and executing the PublishMe
software. No wrapper is necessary—Publish me deploys a server and an OSP,conforming web service directly on the individual’s PC and metasearch clients
communicate directly with that server.
Need to deal with registering above and in picture
Client and web service polymorphism
OSP
Distributed computing and remote procedure call mechanisms have been around a long
time -- DCOM, Corba, and RMI are three of the most common. Recently, standards have
emerged based on HTTP and XML: WSDL for publishing the interfaces to remote
procedures, SOAP to actually make the remote calls, and UDDI for registering services.
One benefit of this emergence is that most development environments now provide
support so that programmers can code objects and functions in their preferred language,
with the environment handling the plumbing, i.e., the generation of a WSDL specification
file and the conversion of function calls to distributed SOAP calls.
WSDL and SOAP give businesses the mechanisms necessary to agree on and implement
protocols within a domain. Given an agreed upon WSDL file, businesses can develop
services on any platform and using any development language and environment. Client
applications can then use UDDI registries to find particular services within the domain,
and access the services using the standard defined in the agreed upon WSDL file. When
new business services are implemented and registered, the clients can access them
immediately and without the client program being modified. This open process is a key to
the proliferation of B2B applications and in general automating much of the
communication processes of the world.
Most web service standards have come from particular business domains. For instance,
Microsoft has published a WSDL interface to which securities information providers can
conform.
OSP, on the other hand, provides a cross-domain protocol, and in particular a protocol for
search-related services.
Domain-specific schemas
cross-domain schema
WSDL SOAP
Specialized
Registries
UDDI
search-related not in the restrictive definition pertaining to keyword search, but a more
general definition including various associative operations.
Notes on UNIVERSAL SEARCH API
Overview of the key methods that it provides.
compare with START SDARTS
Details on each… here are some notes…
sources that send documents over (e.g., personal sources)
some type of inheritance????
images
comparison to firefox, in which one must submit to administrator and tell it something
about how the results are
Keyword search
in parameters
keywords -- either as a single string or as a list of words/phrases.
restrictions –date, etc. things found on an advanced search window. Maybe a sublibrary…eg for google, News or Groups… the alternative is that sublibraries would be
implemented as separate services…I think the current api has something called
“category”.
count
out parameter
total number of results
results – would be nice if it had a standardized text matching ranking as well as a
popularity measurement…maybe even lower level such as number of hits, number of
fancy hits, etc., then a client could do what they wanted…maybe some way for the client
to specify how to rank.
GetCitations (inward links)
in parameter
metadata – metadata about the thing you want to get inward links to. Metadata
object has fields such as title, url, maybe some source specific id…the source then deals
with it however it can… The alternative to such a scheme is to make the client query the
registry to see what a source does provide…e.g., does it provide a getCitations(url)
Note with restful the client could send tagged parameters, e.g. url=xxx or title=yyy
out parameter
total number of results
results – here results are ranked only on popularity
Get Outward Links
This one is a bit confusing as for somethings the client can compute outward links itself,
i.e. if the client wants the outlinks of a url, he can just parse it.
However, outward links might also be links other than hrefs. For instance, a law
document will contain references to cases, e.g., Wolber vs. US. A law service parse such
stuff and send links to the referred to cases.
REGISTRY
UDDI has emerged as an XML standard for web service registries. Because UDDI is a
protocol that allows all types of services to register, we developed a layer on top of UDDI
that provides specific support to OSP metasearch clients. In particular, the WebTop
registry provides metadata about each source, including vocabulary information as was
done with the SDARTS initiative, and it compiles data used to measure a source’s
reputation.
The key interface to the registry is the getSources method. It returns a list of all registered
sources including the following data:
endpoint url
which of the api methods it provides
reputation measure
PUBLISHME
The USP and registry provide programmers with the ability to create search areas that are
immediately accessible to WebTop clients. We also provide an application, PublishMe,
that allows ordinary computer users to create and publish parts of their desktops as search
areas.
PublishMe is similar to Google Desktop in that it builds and continually updates a search
area from a user’s desktop (documents, email, etc.). Google has been careful, due to
people’s privacy concerns, to implement and characterize the desktop search area as one
which is accessed only by the user herself. PublishMe, on the other hand, provides
facilities so that a user can publish her desktop, or parts of it, as a USP-conforming web
service running directly on the user’s personal computer. PublishMe registers this search
area and service with the OpenSearch registry so that the users desktop is immediately
available to all OpenSearch clients.
The motivation behind PublishMe is that many of us create knowledge every day, but …
Experts.
PublishMe consists of a dialog for specifying the parts of the desktop that are “open”, a
file system crawler that builds the search area, a OSP-conforming web service, and a tiny
Cassinni server that, when deployed, responds to OSP queries from the outside world.
Currently, access specification is rudimentary: the user can specify folders from their file
system which serve as top-level roots of the search area. Given that privacy is an
incredibly important issue, we plan to add sophistication to the access specification
including the ability to specify individual and group accesss. See X for a discussion of
privacy.
The file system crawler begins at the top-level roots and builds two data structures: an
inverse index for keyword search, and a link base describing the relationships between
documents (including documents on the web pointed to from local documents, e.g.,
bookmarks). Note that the crawler considers a directory as a list of links so that directorycontains-file is treated as an outward link just like a hyperlink found within a file.
Bookmarks are considered as well—in fact the bookmarks directory is by default selected
as a top-level root. The linkbase is bi-directional so that the outside world can query a
desktop to see if it has documents that link to a particular url (inward link).
The crawler runs as a background process that is invoked periodically to keep the inverse
index and linkbase consistent with the file system. We are also experimenting with
handling file system events to help with this process.
The Cassini server deploys a single web service that conforms to OSP and uses the data
compiled from the file system crawl to respond to queries. Upon user login, the server is
deployed and an on-line message sent to the WebTop registry. When the user logs off,
the registry is also notified.
Metasearch Clients
We have developed three client applications based on the Open Search architecture.
These clients serve as proof-of-concept for the architecture, but are also interesting in
their own right.
The first client, shown in Figure X, provides a file manager like tree-view enhanced with
search capabilities. The user can browse beginning with root folders, or perform a
keyword search. Results from the local file system and those retrieved from external
sources are displayed together within a tree-view. Whereas, in a traditional file manager,
the user can only expand folders, in this client the user can expand both folders and
documents. Expansion of any node results in information queries being sent to selected
information sources, and the results being displayed at the next level in the tree view. The
user specifies which queries are invoked on node expansion by selecting the active
sources and active associations. Associations include out-links, in-links, and similarcontent links.
The snapshot above shows the WebTop web client. Four “preferred” sources are shown,
including one, David Wolber, that is a personal search area. The user can access the
entire list of WebTop registered source by clicking on the “More” button.
The user has selected Google and Feedster as the active information sources, and
performed a traditional search with the keyword “metasearch”. The system has responded
by listing three results from both Google and Feedster.
Next, the user clicks on the + next to the third result, expanding “Mamma Metasearch”:
Because the associations “Keyword” and “Inward” are selected, the system sent both a
keyword search query and an inward link query to the active sources. For “Keyword”
expansions, the system performs TFIDF on the document to come up with a set of
characterizing words. In this case qtypeselected, arial, and qtypes were extracted from the
Mamma metasearch page and sent to both of the search engines. Neither search engine
returned results for that combination of words. Note that the system lists the
automatically identified query words in the right-top corner instead of hiding this
automation from the user.
The “Inward” link query did provide the three results from Google, each of which
displays an inward arrow to the left. Each of these documents, e.g., the one titled “PUC
Library” contains a link to Mamma Metasearch (or at least Google thinks it does). As no
results were returned from Feedster, none of the documents in its database point to
Mamma Metasearch.
Note that the “Outward” association is not selected as an association type. If it were, the
expansion would have also displayed hyperlinks found within Mamma Metasearch.
Outlinks, in the current system, do not result in queries to information sources, but are
handled by the client parsing the document itself.
We call this search-enhanced tree an IOC Tree, as the user can expand along inward,
outward, and content-related associations. The IOC Tree gives a user a birds-eye view of
a topic. For instance, if inward and outward associations are selected, the view can show
the history of a research area from the original seminal paper to its current derivations.
The user can interact with this view at her leisure, expanding nodes to view its
derivations or predecessors.
Content associations, on the other hand, allow for the discovery of similar works that are
not explicitly linked. In the context of browsing ones own files, one can discover files
that are similar to the ones in the current directory but for whatever reason have been
filed in a different bin.
Whereas the first client integrates distributed information queries within a file manager
tree view, our second client integrates them within a browser. With this second client,
queries are invoked each time a new page is loaded into the central frame. The results
from these queries are displayed in sidebars that provide context for the current page:
Picture
As with the first client, the user can specify the associations and search areas for the
queries—in this case, they are specified for the left, right, and bottom context panels. The
default configuration places the inward links of the current page in the left context panel,
the outward links in the right panel, and the content links in the bottom panel, with
Google as the source for inward and content, and the outward links computed from the
client parsing the page. The user can easily change the selected source and association
type for each panel.
When the user clicks on a link within one of the panels or the open page or enters a new
url, a new page is loaded in the center frame, and the context panels update based on their
specification.
Automatically invoking information queries within applications has been called both justin-time information access as well as zero-input information access. Another interesting
application, besides the file management and browser example here, is to include
information queries during the creative process, i.e., while the user works with a word
processor. Watson[], Margin Notes[] and Powerscout are early examples of such systems.
What zero-input access provides is impromptu information discovery. Certainly, a user
could open a search engine separately from the other desktop application being used, and
when the need arises explicitly invoke information queries. Automated queries, with
results displayed on the periphery, allow users to discover contextual information even
when they are not explicitly looking for it.
The flip-side, of course, is that displaying context can be an annoyance to the user.
Discuss code within client…getSources…list of sources, modifying the url to call a
particular source…web service polymorphism…
CUTS
[definition of metasearch from sdarts paper:
a metasearcher performs three main tasks. After receiving a query, it determines the best
database to evaluate the query (database selection), it translates the query in a suitable
form for each database (query translation), and finally, it retrieves and merges the
results from the different sources (results merging) and returns them to the user using a
uniform interface.
Download