Search Engines for Digital Libraries

SEARCH ENGINES for DIGITAL LIBRARIES

Jaime Carbonell jgc@cs.cmu.edu

Language Technologies Institute

Carnegie Mellon University

Pittsburgh PA, USA

November 18, 2006

OUTLINE OF PRESENTATION

•

Search Engine Primer

– Architecture, function and brief history

– Components: indexing, matching, …

•

Recent R&D Results for Search Engines

– Clustering + summarization

– Beyond relevance (popularity, novelty, …)

– The “invisible web” and distributed IR

•

Search Engines for eLibraries

– Coping with OCR imperfections

– Knowledge maps (text, metadata, and more)

18 November 2006 2 Alexandria Universal Library Conference

Search Engines in a Nutshell

User

The Web Spider

Library, etc.

18 November 2006

Inverted

Index

3 Alexandria Universal Library Conference

Search

Engine

Search Engine Evolution

• …in the 1980s (pre-web)

– Single collection with < 10 6 documents (archive)

• …mid 90s-mid 00’s (web):

– Single collection with > 10 9 documents (web)

• …beyond 2006:

– Multiple collections > 10 12 docs (invisible web)

– “ Find what I mean ” queries & profiles with clustering, summarization and personalization.

– Beyond monolingual text: OCR, audio, video, crosslingual search with translation, …


INVERTED INDEXING:

Multiple Access Methods

Task: Enable accurate and efficient document retrieval

Solution: Database of inverted lists, different access methods

Hash Table Access B-Tree-style Access zebra

: :

: : apple apple

. . . .

. . . .

. . . .

zebra

• Exact match

• O(1) access

. . . .

. . . .

. . . .

• Exact match

• Range match

• O(log (n)) access

18 November 2006

Database of Inverted Lists


QUERY-DOCMUMENT SIMILARITY

(Simplified)

Traditional “Cosine Similarity”



Sim (

 q , d



)



 q

 q



 d d

 where: d



 i





1 ,...

n d i

2

Each element in the query and document vectors are word weights

Rare words count more, e.g.: d i

= log

2

(D all

/D freq

(word i

))

Getting the top-k documents (or web pages) is done by:

Retrieve(

 q , k )



Arg max d





D

[ k , Sim ( d



,

 q )]


REFINEMENTS TO IMPROVE

SEARCH ENGINES

•

Well-known methods

– Stop-word removal (e.g., “it”, “the”, “in”, …)

– Phrasing (e.g., “heart attack”, “to be or not to be”)

– Morphology (e.g., “countries” => “country”)

•

More recent methods

– Query expansion (e.g., “cheap” => “inexpensive”,

“discount”, “economic”, “affordable”…)

– Pure relevance => popularity + relevance

» Google’s page-rank by in-link density

» Collaborative filtering (e.g. Amazon)


Coping with OCR Errors in

Documents

•

Problem: Search engines require text (not images)

– OCR: images  text is imperfect

– Errors are language, font, and OCR-engine dependent

•

Solution: Multifaceted

– Character n-gram-level confusion matrix (e.g.: ‘rn’  ‘m’)

– Word-level confusion matrix

– Augmented noisy channel model w *

 arg max[ P ( w | v , f , l )]

 arg max[ P ( v | w , f , l ) P ( w )]

– P(v|w,f,l) is a function of OCR edit distance, actual frequencies in labeled data, etc.

– Automated Thesaurus (for redundancy)


BEYOND SEARCHING

•

Automated Summarization

– Multi-document summaries

– User-controllable (length, type, etc.)

•

Document Clustering

– Group search results by content similarity

– Then, summarize and label each cluster

•

Personal Profiling

– User models (of interests, level of knowledge)

– Task models (progression of types of info needed)

•

Information Push (beyond automated clipping)


NEXT-GENERATION

SEARCH ENGINES

•

Search Criteria Beyond Query-Relevance

–

Popularity of web-page (link density, clicks, …)

– Information novelty (content differential, recency)

–

Trustworthiness of source

– Appropriateness to user (difficulty level, …)

• “Find What I Mean” Principle

– Search on semantically related terms

– Induce user profile from past history, etc.

– Disambiguate terms (e.g. “Jordan”, or “club”)

– From generic search to helpful E-Librarians


Clustering Search vs Standard Search

(e.g. clusty.com)

documents query

IR

Cluster summaries


NEXT-GENERATION SEARCH:

Maximal Marginal Relevance Principle

In general, we want to retrieve the k maximally-useful docs,

Where utility = F(relevance, novelty, popularity, clarity, …)

So far, we can do relevance & popularity. Novelty is next, by defining “marginal relevance” to be “relevant + new”:

MMR(

 q , D , k )



Arg max[ d

 i



D k ,



Sim ( d

 i

,

 q )



( 1

 

) max d i

 d

 j

Sim ( d

 i

, d

 j

)]

MMR is used for ranking search results, or for selecting optimal passages in summary generation.


MMR

MMR Ranking vs Standard IR

(Future CONDOR Release)

documents query

IR

λ controls spiral curl


NEXT-GENERATION SEARCH:

Seeking the “Invisible” Web

• Invisible Web = DB’s Accessible via Web Pages

– Dynamically-generated web-pages from DB’s

– Information (dynamic pages) served via Java apps

– 10 to 100 times larger than static HTML web

– Growing faster than static “visible” web

•

Need Distributed-IR Model to Access (Callan)

– Either unify content or model each DB

– User’s query => appropriate DB(s) => secondary search => unify results


Make it a single database problem

18 November 2006

The Web-Search Model

. . .

. . .

…

U.S. Sales (New York)

…

European Sales (Zurich)

…

R & D (San Jose)

15

…

Administration

(Pittsburgh)

:

:

:

:

…

Competitor

(Dallas)

Alexandria Universal Library Conference

Automatic Resource Selection:

Federated (Distributed) Search

......

Library of

Congress

......

NY Times

......

West

......

Microsoft

......

Google

......

Best

DBs

?

?

Search Results

Dynamic ranking

Automatic database selection

•

Find out what each database (or library) contains

•

Decide where to search for this query (one or multiple sites)

•

Search one or more databases

• Merge results returned by different searches (reliability, relevance, …)


KNOWLEDGE MAPS:

First Steps Towards Useful eLibrarians

Query: “Tom Sawyer”

RESULTS:

Tom Sawyer home page

The Adventures of Tom Sawyer

Tom Sawyer software (graph search)

Disneyland – Tom Sawyer Island

DERIVATIVE & SECONDARY WORKS:

CliffsNotes: The Adventures of Tom…

Tom Sawyer & Huck Finn comicbook

“Tom Sawyer” filmed in 1980

A literary analysis of Tom Sawyer

18 November 2006

WHERE TO GET IT:

Universal Library: free online text & images

Bibliomania – free online literature

Amazon.com: The Adventures of Tom…

RELATED INFORMATION :

Mark Twain: life and works

Wikipedia: “Tom Sawyer”

Literature chat room: Tom Sawyer

On merchandising Huck Finn and Tom

Sawyer


CONCLUDING REMARKS

•

Search Engine Technology is Evolving Rapidly

– Relevance => relevance + novelty + reliability

– Federated (distributed) search

•

E-library search

 web search

– OCR issues for e-libraries

– Metadata search

– In-depth organizaiton

•

Presentation and Organization is Key

– Clusters, MMR, …

– Variable-depth summaries, knowledge maps, …


Language Technologies

SLOGAN TECHNOLGY (e.g.)

• “…right information”

• “…right people”

• “…right time”

• “…right medium”

• “…right language”

• “…right level of detail”

•

IR (search engines)

•

Routing, personalization

•

Anticipatory analysis

•

Info extraction, speech

•

Machine translation

•

Summarization ,

expansion


Search Engines for Digital Libraries

SEARCH ENGINES for DIGITAL LIBRARIES

OUTLINE OF PRESENTATION

Search Engines in a Nutshell

Search Engine Evolution

INVERTED INDEXING:

Multiple Access Methods

QUERY-DOCMUMENT SIMILARITY

(Simplified)

REFINEMENTS TO IMPROVE

SEARCH ENGINES

Coping with OCR Errors in

Documents

BEYOND SEARCHING

NEXT-GENERATION

SEARCH ENGINES

Clustering Search vs Standard Search

(e.g. clusty.com)

NEXT-GENERATION SEARCH:

Maximal Marginal Relevance Principle

MMR Ranking vs Standard IR

(Future CONDOR Release)

NEXT-GENERATION SEARCH:

Seeking the “Invisible” Web

The Web-Search Model

Language Technologies

expansion

Related documents

Products

Support

Search Engines for Digital Libraries