Search Engines for Digital Libraries

advertisement

SEARCH ENGINES for DIGITAL LIBRARIES

Jaime Carbonell jgc@cs.cmu.edu

Language Technologies Institute

Carnegie Mellon University

Pittsburgh PA, USA

November 18, 2006

OUTLINE OF PRESENTATION

Search Engine Primer

– Architecture, function and brief history

– Components: indexing, matching, …

Recent R&D Results for Search Engines

– Clustering + summarization

– Beyond relevance (popularity, novelty, …)

– The “invisible web” and distributed IR

Search Engines for eLibraries

– Coping with OCR imperfections

– Knowledge maps (text, metadata, and more)

18 November 2006 2 Alexandria Universal Library Conference

Search Engines in a Nutshell

User

The Web Spider

Library, etc.

18 November 2006

Inverted

Index

3 Alexandria Universal Library Conference

Search

Engine

Search Engine Evolution

• …in the 1980s (pre-web)

– Single collection with < 10 6 documents (archive)

• …mid 90s-mid 00’s (web):

– Single collection with > 10 9 documents (web)

• …beyond 2006:

– Multiple collections > 10 12 docs (invisible web)

– “ Find what I mean ” queries & profiles with clustering, summarization and personalization.

– Beyond monolingual text: OCR, audio, video, crosslingual search with translation, …

18 November 2006 4 Alexandria Universal Library Conference

INVERTED INDEXING:

Multiple Access Methods

Task: Enable accurate and efficient document retrieval

Solution: Database of inverted lists, different access methods

Hash Table Access B-Tree-style Access zebra

: :

: : apple apple

. . . .

. . . .

. . . .

zebra

• Exact match

• O(1) access

. . . .

. . . .

. . . .

• Exact match

• Range match

• O(log (n)) access

18 November 2006

Database of Inverted Lists

5 Alexandria Universal Library Conference

QUERY-DOCMUMENT SIMILARITY

(Simplified)

Traditional “Cosine Similarity”

Sim (

 q , d

)

 q

 q

 d d

 where: d

 i

1 ,...

n d i

2

Each element in the query and document vectors are word weights

Rare words count more, e.g.: d i

= log

2

(D all

/D freq

(word i

))

Getting the top-k documents (or web pages) is done by:

Retrieve(

 q , k )

Arg max d

D

[ k , Sim ( d

,

 q )]

18 November 2006 6 Alexandria Universal Library Conference

REFINEMENTS TO IMPROVE

SEARCH ENGINES

Well-known methods

– Stop-word removal (e.g., “it”, “the”, “in”, …)

– Phrasing (e.g., “heart attack”, “to be or not to be”)

– Morphology (e.g., “countries” => “country”)

More recent methods

– Query expansion (e.g., “cheap” => “inexpensive”,

“discount”, “economic”, “affordable”…)

– Pure relevance => popularity + relevance

» Google’s page-rank by in-link density

» Collaborative filtering (e.g. Amazon)

18 November 2006 7 Alexandria Universal Library Conference

Coping with OCR Errors in

Documents

Problem: Search engines require text (not images)

– OCR: images  text is imperfect

– Errors are language, font, and OCR-engine dependent

Solution: Multifaceted

– Character n-gram-level confusion matrix (e.g.: ‘rn’  ‘m’)

– Word-level confusion matrix

– Augmented noisy channel model w *

 arg max[ P ( w | v , f , l )]

 arg max[ P ( v | w , f , l ) P ( w )]

– P(v|w,f,l) is a function of OCR edit distance, actual frequencies in labeled data, etc.

– Automated Thesaurus (for redundancy)

18 November 2006 8 Alexandria Universal Library Conference

BEYOND SEARCHING

Automated Summarization

– Multi-document summaries

– User-controllable (length, type, etc.)

Document Clustering

– Group search results by content similarity

– Then, summarize and label each cluster

Personal Profiling

– User models (of interests, level of knowledge)

– Task models (progression of types of info needed)

Information Push (beyond automated clipping)

18 November 2006 9 Alexandria Universal Library Conference

NEXT-GENERATION

SEARCH ENGINES

Search Criteria Beyond Query-Relevance

Popularity of web-page (link density, clicks, …)

– Information novelty (content differential, recency)

Trustworthiness of source

– Appropriateness to user (difficulty level, …)

• “Find What I Mean” Principle

– Search on semantically related terms

– Induce user profile from past history, etc.

– Disambiguate terms (e.g. “Jordan”, or “club”)

– From generic search to helpful E-Librarians

18 November 2006 10 Alexandria Universal Library Conference

Clustering Search vs Standard Search

(e.g. clusty.com)

documents query

IR

Cluster summaries

18 November 2006 11 Alexandria Universal Library Conference

NEXT-GENERATION SEARCH:

Maximal Marginal Relevance Principle

In general, we want to retrieve the k maximally-useful docs,

Where utility = F(relevance, novelty, popularity, clarity, …)

So far, we can do relevance & popularity. Novelty is next, by defining “marginal relevance” to be “relevant + new”:

MMR(

 q , D , k )

Arg max[ d

 i

D k ,

Sim ( d

 i

,

 q )

( 1

 

) max d i

 d

 j

Sim ( d

 i

, d

 j

)]

MMR is used for ranking search results, or for selecting optimal passages in summary generation.

18 November 2006 12 Alexandria Universal Library Conference

MMR

MMR Ranking vs Standard IR

(Future CONDOR Release)

documents query

IR

λ controls spiral curl

18 November 2006 13 Alexandria Universal Library Conference

NEXT-GENERATION SEARCH:

Seeking the “Invisible” Web

• Invisible Web = DB’s Accessible via Web Pages

– Dynamically-generated web-pages from DB’s

– Information (dynamic pages) served via Java apps

– 10 to 100 times larger than static HTML web

– Growing faster than static “visible” web

Need Distributed-IR Model to Access (Callan)

– Either unify content or model each DB

– User’s query => appropriate DB(s) => secondary search => unify results

18 November 2006 14 Alexandria Universal Library Conference

Make it a single database problem

18 November 2006

The Web-Search Model

. . .

. . .

U.S. Sales (New York)

European Sales (Zurich)

R & D (San Jose)

15

Administration

(Pittsburgh)

:

:

:

:

Competitor

(Dallas)

Alexandria Universal Library Conference

Automatic Resource Selection:

Federated (Distributed) Search

......

Library of

Congress

......

NY Times

......

West

......

Microsoft

......

Google

......

Best

DBs

?

?

Search Results

Dynamic ranking

Automatic database selection

Find out what each database (or library) contains

Decide where to search for this query (one or multiple sites)

Search one or more databases

• Merge results returned by different searches (reliability, relevance, …)

18 November 2006 16 Alexandria Universal Library Conference

KNOWLEDGE MAPS:

First Steps Towards Useful eLibrarians

Query: “Tom Sawyer”

RESULTS:

Tom Sawyer home page

The Adventures of Tom Sawyer

Tom Sawyer software (graph search)

Disneyland – Tom Sawyer Island

DERIVATIVE & SECONDARY WORKS:

CliffsNotes: The Adventures of Tom…

Tom Sawyer & Huck Finn comicbook

“Tom Sawyer” filmed in 1980

A literary analysis of Tom Sawyer

18 November 2006

WHERE TO GET IT:

Universal Library: free online text & images

Bibliomania – free online literature

Amazon.com: The Adventures of Tom…

RELATED INFORMATION :

Mark Twain: life and works

Wikipedia: “Tom Sawyer”

Literature chat room: Tom Sawyer

On merchandising Huck Finn and Tom

Sawyer

17 Alexandria Universal Library Conference

CONCLUDING REMARKS

Search Engine Technology is Evolving Rapidly

– Relevance => relevance + novelty + reliability

– Federated (distributed) search

E-library search

 web search

– OCR issues for e-libraries

– Metadata search

– In-depth organizaiton

Presentation and Organization is Key

– Clusters, MMR, …

– Variable-depth summaries, knowledge maps, …

18 November 2006 18 Alexandria Universal Library Conference

Language Technologies

SLOGAN TECHNOLGY (e.g.)

• “…right information”

• “…right people”

• “…right time”

• “…right medium”

• “…right language”

• “…right level of detail”

IR (search engines)

Routing, personalization

Anticipatory analysis

Info extraction, speech

Machine translation

Summarization ,

expansion

18 November 2006 19 Alexandria Universal Library Conference

Download