Jaime Carbonell jgc@cs.cmu.edu
Language Technologies Institute
Carnegie Mellon University
Pittsburgh PA, USA
November 18, 2006
•
Search Engine Primer
– Architecture, function and brief history
– Components: indexing, matching, …
•
Recent R&D Results for Search Engines
– Clustering + summarization
– Beyond relevance (popularity, novelty, …)
– The “invisible web” and distributed IR
•
Search Engines for eLibraries
– Coping with OCR imperfections
– Knowledge maps (text, metadata, and more)
18 November 2006 2 Alexandria Universal Library Conference
User
The Web Spider
Library, etc.
18 November 2006
Inverted
Index
3 Alexandria Universal Library Conference
Search
Engine
• …in the 1980s (pre-web)
– Single collection with < 10 6 documents (archive)
• …mid 90s-mid 00’s (web):
– Single collection with > 10 9 documents (web)
• …beyond 2006:
– Multiple collections > 10 12 docs (invisible web)
– “ Find what I mean ” queries & profiles with clustering, summarization and personalization.
– Beyond monolingual text: OCR, audio, video, crosslingual search with translation, …
18 November 2006 4 Alexandria Universal Library Conference
Task: Enable accurate and efficient document retrieval
Solution: Database of inverted lists, different access methods
Hash Table Access B-Tree-style Access zebra
: :
: : apple apple
. . . .
. . . .
. . . .
zebra
• Exact match
• O(1) access
. . . .
. . . .
. . . .
• Exact match
• Range match
• O(log (n)) access
18 November 2006
Database of Inverted Lists
5 Alexandria Universal Library Conference
Traditional “Cosine Similarity”
Sim (
q , d
)
q
q
d d
where: d
i
1 ,...
n d i
2
Each element in the query and document vectors are word weights
Rare words count more, e.g.: d i
= log
2
(D all
/D freq
(word i
))
Getting the top-k documents (or web pages) is done by:
Retrieve(
q , k )
Arg max d
D
[ k , Sim ( d
,
q )]
18 November 2006 6 Alexandria Universal Library Conference
•
Well-known methods
– Stop-word removal (e.g., “it”, “the”, “in”, …)
– Phrasing (e.g., “heart attack”, “to be or not to be”)
– Morphology (e.g., “countries” => “country”)
•
More recent methods
– Query expansion (e.g., “cheap” => “inexpensive”,
“discount”, “economic”, “affordable”…)
– Pure relevance => popularity + relevance
» Google’s page-rank by in-link density
» Collaborative filtering (e.g. Amazon)
18 November 2006 7 Alexandria Universal Library Conference
•
Problem: Search engines require text (not images)
– OCR: images text is imperfect
– Errors are language, font, and OCR-engine dependent
•
Solution: Multifaceted
– Character n-gram-level confusion matrix (e.g.: ‘rn’ ‘m’)
– Word-level confusion matrix
– Augmented noisy channel model w *
arg max[ P ( w | v , f , l )]
arg max[ P ( v | w , f , l ) P ( w )]
– P(v|w,f,l) is a function of OCR edit distance, actual frequencies in labeled data, etc.
– Automated Thesaurus (for redundancy)
18 November 2006 8 Alexandria Universal Library Conference
•
Automated Summarization
– Multi-document summaries
– User-controllable (length, type, etc.)
•
Document Clustering
– Group search results by content similarity
– Then, summarize and label each cluster
•
Personal Profiling
– User models (of interests, level of knowledge)
– Task models (progression of types of info needed)
•
Information Push (beyond automated clipping)
18 November 2006 9 Alexandria Universal Library Conference
•
Search Criteria Beyond Query-Relevance
–
Popularity of web-page (link density, clicks, …)
– Information novelty (content differential, recency)
–
Trustworthiness of source
– Appropriateness to user (difficulty level, …)
• “Find What I Mean” Principle
– Search on semantically related terms
– Induce user profile from past history, etc.
– Disambiguate terms (e.g. “Jordan”, or “club”)
– From generic search to helpful E-Librarians
18 November 2006 10 Alexandria Universal Library Conference
documents query
IR
Cluster summaries
18 November 2006 11 Alexandria Universal Library Conference
In general, we want to retrieve the k maximally-useful docs,
Where utility = F(relevance, novelty, popularity, clarity, …)
So far, we can do relevance & popularity. Novelty is next, by defining “marginal relevance” to be “relevant + new”:
MMR(
q , D , k )
Arg max[ d
i
D k ,
Sim ( d
i
,
q )
( 1
) max d i
d
j
Sim ( d
i
, d
j
)]
MMR is used for ranking search results, or for selecting optimal passages in summary generation.
18 November 2006 12 Alexandria Universal Library Conference
MMR
documents query
IR
λ controls spiral curl
18 November 2006 13 Alexandria Universal Library Conference
• Invisible Web = DB’s Accessible via Web Pages
– Dynamically-generated web-pages from DB’s
– Information (dynamic pages) served via Java apps
– 10 to 100 times larger than static HTML web
– Growing faster than static “visible” web
•
Need Distributed-IR Model to Access (Callan)
– Either unify content or model each DB
– User’s query => appropriate DB(s) => secondary search => unify results
18 November 2006 14 Alexandria Universal Library Conference
Make it a single database problem
18 November 2006
. . .
. . .
…
U.S. Sales (New York)
…
European Sales (Zurich)
…
R & D (San Jose)
15
…
Administration
(Pittsburgh)
:
:
:
:
…
Competitor
(Dallas)
Alexandria Universal Library Conference
Automatic Resource Selection:
Federated (Distributed) Search
......
Library of
Congress
......
NY Times
......
West
......
Microsoft
......
......
Best
DBs
?
?
Search Results
Dynamic ranking
Automatic database selection
•
Find out what each database (or library) contains
•
Decide where to search for this query (one or multiple sites)
•
Search one or more databases
• Merge results returned by different searches (reliability, relevance, …)
18 November 2006 16 Alexandria Universal Library Conference
KNOWLEDGE MAPS:
First Steps Towards Useful eLibrarians
Query: “Tom Sawyer”
RESULTS:
Tom Sawyer home page
The Adventures of Tom Sawyer
Tom Sawyer software (graph search)
Disneyland – Tom Sawyer Island
DERIVATIVE & SECONDARY WORKS:
CliffsNotes: The Adventures of Tom…
Tom Sawyer & Huck Finn comicbook
“Tom Sawyer” filmed in 1980
A literary analysis of Tom Sawyer
18 November 2006
WHERE TO GET IT:
Universal Library: free online text & images
Bibliomania – free online literature
Amazon.com: The Adventures of Tom…
RELATED INFORMATION :
Mark Twain: life and works
Wikipedia: “Tom Sawyer”
Literature chat room: Tom Sawyer
On merchandising Huck Finn and Tom
Sawyer
17 Alexandria Universal Library Conference
CONCLUDING REMARKS
•
Search Engine Technology is Evolving Rapidly
– Relevance => relevance + novelty + reliability
– Federated (distributed) search
•
E-library search
web search
– OCR issues for e-libraries
– Metadata search
– In-depth organizaiton
•
Presentation and Organization is Key
– Clusters, MMR, …
– Variable-depth summaries, knowledge maps, …
18 November 2006 18 Alexandria Universal Library Conference
SLOGAN TECHNOLGY (e.g.)
• “…right information”
• “…right people”
• “…right time”
• “…right medium”
• “…right language”
• “…right level of detail”
•
IR (search engines)
•
Routing, personalization
•
Anticipatory analysis
•
Info extraction, speech
•
Machine translation
•
Summarization ,
18 November 2006 19 Alexandria Universal Library Conference