Analysis Environments for Community Repositories:

advertisement
Technology Trends for Information Retrieval in the Net
Bruce R. Schatz
Information Infrastructure evolves, as better technology becomes available to support basic
needs. For technology to be mature enough to be incorporated into standard infrastructure, it
must be sufficiently generic. That is, the technology must be robust and readily adaptable to
many different applications and purposes.
For Information Infrastructure to support Semantic Retrieval in a fundamental way, several
new technologies must be incorporated into the standard support to facilitate Concept
Navigation. In particular, the rise of four technologies is critical: document protocols for
information retrieval, extraction parsers for noun phrases, statistical indexers for context
computations, communications protocols for peer-to-peer computations. Together, these generic
technologies support semantic indexing of community repositories.
A document can be stored in a standard representation. Concepts can be extracted from a
document with some level of semantics. These concepts can be utilized to transform a document
collection into a searchable repository, by indexing the documents with some level of semantics.
Finally, the resultant indexing can be utilized to semantically federate the knowledge of a
community, by concept navigation across distributed repositories that comprise relevant sources.
The Rise of Web Document Protocols has made it possible to store documents in a standard
representation. Prior to the worldwide adoption of a single format to represent documents,
collections were limited to those that could be administered by a single central organization.
Prime examples were Dialog, for bibliographic databases consisting of journal abstracts, and
Lexis/Nexis, for full-text databases consisting of magazine articles.
The widespread adoption of WWW (World-Wide Web) Protocols enabled global information
retrieval, which in turn increased the volume to the point that semantic indexing has become
necessary to enable effective retrieval. In particular, the current situation was caused by the
universal distribution of servers that store documents in HTML (HyperText Markup Language)
and retrieve documents using HTTP (HyperText Transmission Protocol).
Many more
organizations could now maintain their own collections, since the information retrieval
technology was now standard enough to enable information providers to directly store their own
collections, rather than transferring them to a central repository archive.
Standard protocols implied that a single program could retrieve documents from multiple
sources. Thus the WWW protocols enabled the implementation of Web browsers. In particular,
NCSA Mosaic proved to be the right combination of streamlined standards and flexible
interfaces to attract millions of users to information retrieval for the first time [1].
As the number of documents increased, identifying initial documents to hypertext browse
from became a major problem. Then, web searchers began to dominate web browsers as the
primary interface to global information space. These searches across so many documents with
such variance showed the weakness of syntactic search, such as the word matching used within
Dialog, and increased the demand for semantic indexing embedded within the infrastructure.
The response of the WWW designers has been to extend the markup languages from
formating to typing. HTML has tags (markups) to indicate how to display phrases, such as
centering or boldfacing. HTML has now evolved into XML (eXtensible Markup Language),
which enables syntactic specification of many types of units, including custom applications.
XML is the web version of markup languages used in the publishing industry, such as SGML
(Standard Generalized Markup Language), which are used to tag the structure of documents,
such as sections and figures. SGML has also been used for many years for scholarly documents
to tag the types of the phrases, e.g. recording the semantics of names in the humanities literature
as “this is a person, place, book, painting”.
The major activities towards the Semantic Web are developing infrastructure to type phrases
within documents [2], for use in search engines and other software. Languages are being
defined to enable authors to provide metadata describing document semantics within the text.
There are a number of such languages under development [3], supporting definition of
ontologies, which are formal definitions of the important concepts in a given domain. Use of
such languages will be by authors of documents, and thus subject to the limitations of author
reliability and accuracy. These limitations have proven significant with earlier languages, such
as SGML, thus encouraging the development of automatic tagging techniques whenever possible
to augment any manual tagging.
Document standards eliminate the need for format converters for each collection. Extracting
words becomes universally possible with a syntactic parser. But, extracting concepts requires a
semantic parser, which extracts the appropriate units from documents of any subject domain.
Many years of research into information retrieval have shown that the most discriminating units
for retrieval in text documents are multi-word noun phrases.
Thus, the best concepts in
document collections are noun phrases.
The Rise of Generic Parsing has made it possible to automatically extract concepts from
arbitrary documents. The key to context-based semantic indexing is identifying the “right size”
unit to extract from the objects in the collections. These units represent the “concepts” in the
collection. The document collection is then processed statistically to compute the co-occurrence
frequency of the units within each document.
Over the years, the feasible technology for concept extraction has become increasingly more
precise. Initially, there were heuristic rules that used stop words and verb phrases to
approximate noun phrase extraction. Then, there were simple noun phrase grammars for
particular subject domains. Finally, the statistical parsing technology became good enough, so
that extraction was computable without explicit grammars [4].
These statistical parsers can
extract noun phrases quite accurately for general texts, after being trained on sample collections.
This technology trend approximates meaning by statistical versions of context. This trend in
information retrieval has been a global trend in recent years for pattern recognition in many
areas. Computers have now become powerful enough that rules can be practically replaced by
statistics in many cases. Global statistics on local context has replaced deterministic parsing.
For example, in computational linguistics, the best noun phrase extractors no longer have an
underlying definite grammar, but instead rely on neural nets trained on typical cases. The initial
phases of the DARPA TIPSTER program, a $100M effort to extract facts from newspaper
articles for intelligence purposes, were based upon grammars, but the final phases were based
upon statistical parsers. Once the neural nets are trained on a range of collections, they can
parse arbitrary texts with high accuracy. It is even possible to determine automatically the type
of the noun phrases, such as person or place, with high precision [5].
Once the units, such as noun phrases, are extracted, they can be used to approximate meaning.
This is done by computing the frequency with which the units occur within each document
2
across the collection. In the same sense that the noun phrases represent concepts, the contextual
frequencies represent meanings.
These frequencies for each phrase form a space for the collection, where each concept is
related to each other concept by co-occurrence in context. The concept space is used to generate
related concepts for a given concept, which can be used to retrieve documents containing the
related concepts.
The space consists of the interrelationships between the concepts in the
collection.
Concept navigation is enabled by a concept space computed from a document collection. The
technology operates generically, independent of subject domain. The goal is enable users to
navigate spaces of concepts, instead of documents of words. Interactive navigation of the
concept space is useful for locating related terms relevant to a particular search strategy.
The Rise of Statistical Indexing has made it possible to compute relationships between
concepts within a collection. Algorithms for computing statistical co-occurrence have been
studied within information retrieval since the 1960s [6]. But it is only in the last few years that
the statistics involved for effective retrieval have been computationally feasible for real
collections. These concept space computations combine artificial intelligence for the concept
extraction, via noun phrase parsing, with information retrieval for the concept relationship, via
statistical co-occurrence.
The technology curves of computer power are making statistical indexing feasible. The
coming period is the decade that scalable semantics will become a practical reality. For the 40year period from the dawn of modern information retrieval in 1960 to the present worldwide
Internet search of 2000, statistical indexing has been an academic curiosity. Techniques such as
co-occurrence frequency were well-known, but computable only on collections of a few hundred
documents. The practical information retrieval on real-world collections of millions of
documents relied instead on exact match of text phrases, such as embodied in full-text search.
The speed of machines is changing all this rapidly. The next 10 years, 2000-2010, will see
the fall of indexing barriers for all real-world collections. For many years, the largest computer
could not semantically index the smallest collection. After the coming decade, even the smallest
computer will be able to semantically index the largest collection. Hero experiments in the late
1990s performed semantic indexing on the complete literature of scientific disciplines [7].
Experiments of this scale will be routinely carried out by ordinary people on their watches
(palmtop computers) less than 10 years later, in the late 2000s.
The TREC (Text REtrieval Conference) competition [8] is organized by the National Institute
for Standards and Technology (NIST). It grew out of the DARPA TIPSTER evaluation
program, starting in 1992, and is now a public indexing competition entered annually by
international teams. Each team generates semantic indexes for gigabyte document collections
using their statistical software.
Currently, semantic indexing can be computed by the appropriate community machine, but in
batch mode. For example, a concept space for 1K documents is appropriate for a laboratory of
10 people and takes an hour to compute on a small laboratory server. Similarly, a community
space of 10K documents for 100 people takes an hour on a large departmental server. Each
community repository can be processed on the appropriate-scale server for that community. As
the speed of machines increases, the time of indexing will decrease from batch to interactive, and
semantic indexing will become feasible on dynamically specified collections.
3
When the technology for semantic indexing becomes routinely available, it will be possible to
incorporate this indexing directly into the infrastructure. At present, the WWW protocols make
it easy to develop a collection to access as a set of documents. Typically, the collection is
available for browsing but not for searching, except by being incorporated into web portals,
which gather documents via crawlers for central indexing. Software is not commonly available
for groups to maintain and index their own collection for web-wide search.
The Rise of Peer-Peer Protocols is making it possible to support distributed repositories for
small communities. This trend is following the same pattern in the Internet in the 2000s as did
email in the ARPAnet in the 1960s, where person-person communications became the dominant
service in infrastructure designed for station-station computations. Today, there are many
personal web sites, even though traffic is dominated by central archives, such as home shopping
and scientific databases, which drive the market.
The Net at present is fundamentally a client-server model, with few large servers and many
small clients. The clients are typically user workstations, which prepare queries to be processed
at archival servers. As the number of servers increases and the size of collections decreases, the
infrastructure will evolve into a peer-peer model, where user machines exchange data directly.
In this model, each machine is both a client and a server at different times.
There are already significant peer-peer where simple protocols enable users to directly share
their datasets. These are driven by the desires of specialized communities to directly share with
each other, without the intervention of central authorities. The most famous example is Napster
for music sharing, where files on a personal machine in a specified format can be made
accessible to other peer machines, via a local program that supports the sharing protocol. Such
file sharing services have become so popular, that the technology is breaking down due to lack of
searching capability that can filter out copyrighted songs.
There are many examples in more scientific situations of successful peer-peer protocols [9].
Typically, these programs implement a simple service on an individual user’s machine, which
performs some small computation on small data that can be combined across many machines
into a large computation on large data.
For example, the SETI@home software is running as a screensaver on several million
machines across the world, each computing the results of a radio telescope survey from a
different sky region. Computed results are sent to a central repository for a database Searching
for ExtraTerrestrial Life (SETI) across the entire universe. Similar net-wide distributed
computation, with volunteer downloads of software onto personal machines, has computed large
primes and broken encryption schemes.
For-profit corporations have used peer-to-peer computing for public-service medical
computations [10]. Nearly a million PCs have already been volunteered for the United Devices
Cancer Research Program, using a commercial version of the SETI@home software.
Generalized software to handle documents or databases exists at a primitive level for peerpeer protocols. The local PCs are currently used only as processors for computations for a small
segment of the database from the central site. True data-centered peer-peer is not available,
where each PC computes on its own locally administered database and the central site only
merges the database computation from each local site. Data-centered peer-peer is necessary for
global search of local repositories, and will become feasible as technology matures.
The Net will evolve from the Internet to the Interspace, as information infrastructure supports
semantic indexing for community repositories. The Interspace supports effective navigation
4
across information spaces, just as the Internet supports reliable transmission across data
networks. Internet infrastructure, such as the Open Directory project [11], enables distributed
subject curators to index web sites within assigned categories, with the entries being entire
collections. In contrast, Interspace infrastructure, such as automatic subject indexing [12], will
enable distributed community curators to index the documents themselves within the collections.
Increasing scale of community databases will force evolution of peer-peer protocols.
Semantic indexing will mature and become infrastructure at whatever level technology will
support generically. Community repositories will be automatically indexed, then aggregated to
provide global indexes. Concept navigation will be a standard function of global infrastructure
in 2010, much as document browsing has become in 2000. Then the Internet will have evolved
into the Interspace.
References
B. Schatz and J. Hardin, “NCSA Mosaic and the World-Wide Web: Global Hypermedia Protocols for the
Internet”, Science, Vol. 265, 12 Aug. 1994, pp. 895-901.
2. T. Berners-Lee, et. al. , “The Semantic Web”, Scientific American, Vol. 284, May 2001, pp. 35-43.
3. D. Fensel, et. al., “The semantic Web and its languages”, IEEE Intelligent Systems, November/December
2000, pp. 67-73.
4. T. Strzalkowski, “Natural Language Information Retrieval”, Information Processing & Management, Vol.
31, 1996, pp. 397-417.
5. D. Bikel, et. al., “NYMBLE: A High-Performance Learning Name Finder”, Proc. 5th Conf. Applied
Natural Language Processing, Mar. 1998, pp. 194-201.
6. P. Kantor, “Information Retrieval Techniques”, Annual Review Information Science & Technology, Vol.
29, 1994, pp. 53-90.
7. B. Schatz, “Information Retrieval in Digital Libraries: Bringing Search to the Net”, Science, Vol. 275, 17
Jan. 1997, pp. 327-334.
8. D. Harman (ed), Text REtrieval Conferences (TREC), National Institute Standards & Technology (NIST),
http://trec.nist.gov
9. B. Hayes, “Collective Wisdom”, American Scientist, Vol. 86, Mar-Apr 1998, pp. 118-122.
10. Intel Philanthropic Peer-to-Peer Program, www.intel.com/cure
11. Open Directory Project, www.dmoz.org
12. Y. Chung, et. al., “Automatic Subject Indexing Using an Associative Neural Network”, Proc. 3rd Int’l ACM
Conference Digital Libraries, Jun. 1998, Pittsburgh, pp. 59-68.
1.
5
Download