Open Search David Wolber, Pooja Garg University of San Francisco Abstract Open Search is an architecture for facilitating grassroots development of both digital libraries and metasearch clients. Based on the Open Search Protocol (OSP) and Registry (OSR), the architecture allows the creators of digital libraries to make their data instantly available to any OSP-conforming metasearch client. Conversely, the architecture allows metasearch clients to expose a dynamic list of digital libraries to their users. Introduction The web is huge. Search engines that index the entire web do not always provide results that are relevant to the user. Domain-specific digital libraries can provide better searches by reducing the size of the information space. Examples include the International Movie Database for movies, the ACM digital library for computer science, and blogging libraries such as Technorati. The results provided by such libraries can be more meaningful and more timely, as their relatively small size means that the data can be updated in minutes as opposed to the weeks required for crawling the entire web. Such domain-specific digital libraries are being introduced everyday. Technology has placed the creation of libraries within the grasp of the ordinary computer user. Personal crawling software allows even an end-user to initiate a crawl by providing a list of seed pages and topic keywords. Ordinary computers are powerful enough to perform such a crawl efficiently and periodically, and big enough to store the resulting digital library. Just as HTML facilitated the rapid creation of web pages, factors today are leading to the explosion of searchable subsets of the web. Information seekers can try to manually keep up with newly published digital libraries, but finding the pertinent sources on a particular topic is becoming more and more difficult. A key goal of metasearch applications is to help users with this process. They filter, unify, personalize and rank the results from various sources for human consumption. They also help users discover digital libraries either explicitly or by automatically choosing libraries appropriate for a particular query. Unfortunately, today’s metasearch applications are based on fixed lists of information sources. For instance, A9 provides access to a set of sources including Google Images, Amazon’s Search Within a Book, and the International Movie Database. Metasearch.com provides access to Google, Yahoo, Kanoodle, and others. Such applications are implemented with custom scrapers or web service consumers for each individual information source. Now consider the process by which the global state of metasearch evolves. A new digital library comes on-line with a web page interface and perhaps web service API access. It gains some popularity and is discovered by some metasearch applications. If an API is provided, the metasearch developer writes a consumer for it. If no API is provided, but the “robots.txt” permissions allow access, a scraper-consumer can be developed. In any case, days if not weeks are needed to discover a new library and to write code which extends the metasearch client. This process is also slow from the perspective of the digital library creator, especially one who wishes to openly provide access to their data in a timely manner. Such creators include non-commercial entities such as researchers creating topic-specific libraries, as well as business entities who want their products disseminated as quickly and widely as possible. The best these entities can do today is provide a web service API and then ask particular metasearch clients to add access to it in their next release. There is no means possible to quickly and easily facilitate dissemination of the library. There is a need for an open architecture that allows the immediate dissemination of digital libraries and the dynamic discovery of digital libraries by metasearch clients. Such an architecture needs two key elements: a common search API and a registry for sources to identify themselves. These elements render the development of digital libraries and metasearch clients independent of each other. By conforming to the API and registering, digital libraries are instantly available to all metasearch clients. Client software dynamically accesses the registry to build a list of the currently available sources, and invokes searches on any source in the list using the operations of the API. In this way, the global state of metasearch can grow in a grass-roots manner. In this paper we introduce Open Search, an architecture that addresses this need. It consists of a search API called the Open Search Protocol (OSP) and a UDDI-based registry called the Open Source Registry (OSR). To bootstrap the system, we have developed a number of OSP-conforming information access services. These include wrappers for the existing APIs of Google, Amazon, Technorati, Feedster, and the Internet Archive. We have also developed a desktop application, PublishMe, that allows ordinary users to publish parts of their desktops as OSP-conforming services. With PublishMe, the system becomes not only an architecture for enhancing the metasearch information space but also one for peer-to-peer knowledge sharing. Besides the services, we have also implemented three metasearch clients based on the architecture. The clients allow users to send queries to traditional search engines as well as personal search engines created by PublishMe. For example, a user interested in metasearch might select Google and “David Wolber” as the search sources for their queries. System Architecture GoogleAPI Google OSPService OSP Registry GoogleWrapper AmazonWrapper TechnoratiWrapper PW: Wolber PW: Brooks AmazonAPI Amazon OSPService sources Wolber OSPService Brooks OSPService OS Client Source list Figure 1 illustrates the Open Search architecture. An organization itself can publish a web service conforming to OSP that runs on their servers, or a third-party can publish a web service that wraps the organization’s custom API calls within OSP. We used the latter method to develop the first OSP services such as the ones from Google and Amazon shown in Figure 1. With this scheme, an Open Search metasearch client sends calls to the OSP wrapper service. The wrapper service translates the call to the custom form and sends it to the server where the data resides. Upon receiving results, the wrapper translates them into OSP result form and sends them to client. Though not ideal in that twice the number of network calls are necessary, the scheme does provide a way for OSP clients to access any digital library with a public API. Individuals publish their desktops through downloading and executing the PublishMe software. No wrapper is necessary—Publish me deploys a server and an OSP,conforming web service directly on the individual’s PC and metasearch clients communicate directly with that server. Need to deal with registering above and in picture Client and web service polymorphism OSP Distributed computing and remote procedure call mechanisms have been around a long time -- DCOM, Corba, and RMI are three of the most common. Recently, standards have emerged based on HTTP and XML: WSDL for publishing the interfaces to remote procedures, SOAP to actually make the remote calls, and UDDI for registering services. One benefit of this emergence is that most development environments now provide support so that programmers can code objects and functions in their preferred language, with the environment handling the plumbing, i.e., the generation of a WSDL specification file and the conversion of function calls to distributed SOAP calls. WSDL and SOAP give businesses the mechanisms necessary to agree on and implement protocols within a domain. Given an agreed upon WSDL file, businesses can develop services on any platform and using any development language and environment. Client applications can then use UDDI registries to find particular services within the domain, and access the services using the standard defined in the agreed upon WSDL file. When new business services are implemented and registered, the clients can access them immediately and without the client program being modified. This open process is a key to the proliferation of B2B applications and in general automating much of the communication processes of the world. Most web service standards have come from particular business domains. For instance, Microsoft has published a WSDL interface to which securities information providers can conform. OSP, on the other hand, provides a cross-domain protocol, and in particular a protocol for search-related services. Domain-specific schemas cross-domain schema WSDL SOAP Specialized Registries UDDI search-related not in the restrictive definition pertaining to keyword search, but a more general definition including various associative operations. Notes on UNIVERSAL SEARCH API Overview of the key methods that it provides. compare with START SDARTS Details on each… here are some notes… sources that send documents over (e.g., personal sources) some type of inheritance???? images comparison to firefox, in which one must submit to administrator and tell it something about how the results are Keyword search in parameters keywords -- either as a single string or as a list of words/phrases. restrictions –date, etc. things found on an advanced search window. Maybe a sublibrary…eg for google, News or Groups… the alternative is that sublibraries would be implemented as separate services…I think the current api has something called “category”. count out parameter total number of results results – would be nice if it had a standardized text matching ranking as well as a popularity measurement…maybe even lower level such as number of hits, number of fancy hits, etc., then a client could do what they wanted…maybe some way for the client to specify how to rank. GetCitations (inward links) in parameter metadata – metadata about the thing you want to get inward links to. Metadata object has fields such as title, url, maybe some source specific id…the source then deals with it however it can… The alternative to such a scheme is to make the client query the registry to see what a source does provide…e.g., does it provide a getCitations(url) Note with restful the client could send tagged parameters, e.g. url=xxx or title=yyy out parameter total number of results results – here results are ranked only on popularity Get Outward Links This one is a bit confusing as for somethings the client can compute outward links itself, i.e. if the client wants the outlinks of a url, he can just parse it. However, outward links might also be links other than hrefs. For instance, a law document will contain references to cases, e.g., Wolber vs. US. A law service parse such stuff and send links to the referred to cases. REGISTRY UDDI has emerged as an XML standard for web service registries. Because UDDI is a protocol that allows all types of services to register, we developed a layer on top of UDDI that provides specific support to OSP metasearch clients. In particular, the WebTop registry provides metadata about each source, including vocabulary information as was done with the SDARTS initiative, and it compiles data used to measure a source’s reputation. The key interface to the registry is the getSources method. It returns a list of all registered sources including the following data: endpoint url which of the api methods it provides reputation measure PUBLISHME The USP and registry provide programmers with the ability to create search areas that are immediately accessible to WebTop clients. We also provide an application, PublishMe, that allows ordinary computer users to create and publish parts of their desktops as search areas. PublishMe is similar to Google Desktop in that it builds and continually updates a search area from a user’s desktop (documents, email, etc.). Google has been careful, due to people’s privacy concerns, to implement and characterize the desktop search area as one which is accessed only by the user herself. PublishMe, on the other hand, provides facilities so that a user can publish her desktop, or parts of it, as a USP-conforming web service running directly on the user’s personal computer. PublishMe registers this search area and service with the OpenSearch registry so that the users desktop is immediately available to all OpenSearch clients. The motivation behind PublishMe is that many of us create knowledge every day, but … Experts. PublishMe consists of a dialog for specifying the parts of the desktop that are “open”, a file system crawler that builds the search area, a OSP-conforming web service, and a tiny Cassinni server that, when deployed, responds to OSP queries from the outside world. Currently, access specification is rudimentary: the user can specify folders from their file system which serve as top-level roots of the search area. Given that privacy is an incredibly important issue, we plan to add sophistication to the access specification including the ability to specify individual and group accesss. See X for a discussion of privacy. The file system crawler begins at the top-level roots and builds two data structures: an inverse index for keyword search, and a link base describing the relationships between documents (including documents on the web pointed to from local documents, e.g., bookmarks). Note that the crawler considers a directory as a list of links so that directorycontains-file is treated as an outward link just like a hyperlink found within a file. Bookmarks are considered as well—in fact the bookmarks directory is by default selected as a top-level root. The linkbase is bi-directional so that the outside world can query a desktop to see if it has documents that link to a particular url (inward link). The crawler runs as a background process that is invoked periodically to keep the inverse index and linkbase consistent with the file system. We are also experimenting with handling file system events to help with this process. The Cassini server deploys a single web service that conforms to OSP and uses the data compiled from the file system crawl to respond to queries. Upon user login, the server is deployed and an on-line message sent to the WebTop registry. When the user logs off, the registry is also notified. Metasearch Clients We have developed three client applications based on the Open Search architecture. These clients serve as proof-of-concept for the architecture, but are also interesting in their own right. The first client, shown in Figure X, provides a file manager like tree-view enhanced with search capabilities. The user can browse beginning with root folders, or perform a keyword search. Results from the local file system and those retrieved from external sources are displayed together within a tree-view. Whereas, in a traditional file manager, the user can only expand folders, in this client the user can expand both folders and documents. Expansion of any node results in information queries being sent to selected information sources, and the results being displayed at the next level in the tree view. The user specifies which queries are invoked on node expansion by selecting the active sources and active associations. Associations include out-links, in-links, and similarcontent links. The snapshot above shows the WebTop web client. Four “preferred” sources are shown, including one, David Wolber, that is a personal search area. The user can access the entire list of WebTop registered source by clicking on the “More” button. The user has selected Google and Feedster as the active information sources, and performed a traditional search with the keyword “metasearch”. The system has responded by listing three results from both Google and Feedster. Next, the user clicks on the + next to the third result, expanding “Mamma Metasearch”: Because the associations “Keyword” and “Inward” are selected, the system sent both a keyword search query and an inward link query to the active sources. For “Keyword” expansions, the system performs TFIDF on the document to come up with a set of characterizing words. In this case qtypeselected, arial, and qtypes were extracted from the Mamma metasearch page and sent to both of the search engines. Neither search engine returned results for that combination of words. Note that the system lists the automatically identified query words in the right-top corner instead of hiding this automation from the user. The “Inward” link query did provide the three results from Google, each of which displays an inward arrow to the left. Each of these documents, e.g., the one titled “PUC Library” contains a link to Mamma Metasearch (or at least Google thinks it does). As no results were returned from Feedster, none of the documents in its database point to Mamma Metasearch. Note that the “Outward” association is not selected as an association type. If it were, the expansion would have also displayed hyperlinks found within Mamma Metasearch. Outlinks, in the current system, do not result in queries to information sources, but are handled by the client parsing the document itself. We call this search-enhanced tree an IOC Tree, as the user can expand along inward, outward, and content-related associations. The IOC Tree gives a user a birds-eye view of a topic. For instance, if inward and outward associations are selected, the view can show the history of a research area from the original seminal paper to its current derivations. The user can interact with this view at her leisure, expanding nodes to view its derivations or predecessors. Content associations, on the other hand, allow for the discovery of similar works that are not explicitly linked. In the context of browsing ones own files, one can discover files that are similar to the ones in the current directory but for whatever reason have been filed in a different bin. Whereas the first client integrates distributed information queries within a file manager tree view, our second client integrates them within a browser. With this second client, queries are invoked each time a new page is loaded into the central frame. The results from these queries are displayed in sidebars that provide context for the current page: Picture As with the first client, the user can specify the associations and search areas for the queries—in this case, they are specified for the left, right, and bottom context panels. The default configuration places the inward links of the current page in the left context panel, the outward links in the right panel, and the content links in the bottom panel, with Google as the source for inward and content, and the outward links computed from the client parsing the page. The user can easily change the selected source and association type for each panel. When the user clicks on a link within one of the panels or the open page or enters a new url, a new page is loaded in the center frame, and the context panels update based on their specification. Automatically invoking information queries within applications has been called both justin-time information access as well as zero-input information access. Another interesting application, besides the file management and browser example here, is to include information queries during the creative process, i.e., while the user works with a word processor. Watson[], Margin Notes[] and Powerscout are early examples of such systems. What zero-input access provides is impromptu information discovery. Certainly, a user could open a search engine separately from the other desktop application being used, and when the need arises explicitly invoke information queries. Automated queries, with results displayed on the periphery, allow users to discover contextual information even when they are not explicitly looking for it. The flip-side, of course, is that displaying context can be an annoyance to the user. Discuss code within client…getSources…list of sources, modifying the url to call a particular source…web service polymorphism… CUTS [definition of metasearch from sdarts paper: a metasearcher performs three main tasks. After receiving a query, it determines the best database to evaluate the query (database selection), it translates the query in a suitable form for each database (query translation), and finally, it retrieves and merges the results from the different sources (results merging) and returns them to the user using a uniform interface.