Distributed Information Retrieval Challenges in Managing

advertisement
Challenges in Managing
Distributed Information
Distributed Information Retrieval
(c) Maria Indrawan 2004
•
•
•
•
•
No topology of the data organisation.
Dynamic data.
The size of the collection.
No control over quality of the data.
Multimedia data.
1
Challenges-Human Factor
(c) Maria Indrawan 2004
2
Types of Distributed IR
• Diversity of users
• Directory
– Expert to novice
– Yahoo
• Ill-formed queries.
• Specific behaviour
• Search Engine
– Google, AskJeeves, Yahoo, Teoma
– Favour precision over recall (85% users only look at the
first screen – Lan Huang A survey on Web Information Technology)
• Meta Search
– Metacrawler, Dogpile
• Distributed Broker
– Harvest
(c) Maria Indrawan 2004
3
(c) Maria Indrawan 2004
4
Directory Listing
Directory Listing
• Manually created
• Automatic classification
• TERENA.
– Yahoo, Google, MSN
– Open Directory Project
• www.dmoz.org
– http://www.terena.nl/tech/projects/portal/isir/reisnews9908
seac.html
• Scorpion
– http://orc.rsch.oclc.org:6109/
(c) Maria Indrawan 2004
5
Search Engine Architecture
(c) Maria Indrawan 2004
6
Crawler – Design Considerations
• Crawler (robots)
• Crawling algorithm
– Collecting the pages from the WEB.
– Breadth-first vs Depth first
• Indexer
• How do we handle URL-aliases?
• How do we reduce server load?
• How do we detect a duplicate page or a mirrorsite?
• How often we need to revisit a site?
– Indexing pages collected by the crawler and represent
them in an efficient data structure.
• Query Server
– Accepting, process and return the results of the query
from the user.
(c) Maria Indrawan 2004
7
(c) Maria Indrawan 2004
8
Update Rate
Indexer - Design Considerations
www.searchengineshowdown.com (May 2003)
Search Engine
Newest page
Found
Rough Average
Oldest Page Found
Google
2 days
1 month
165 days
MSN (Ink)
1 day
4 weeks
51 days
HotBot (Ink)
1 day
4 weeks
51 days
AlltheWeb
1 day
1 month
599 days
Gigablast
45 days
7 months
381 days
Teoma
41 days
2.5 months
81 days
WiseNut
133 days
6 months
183 days
• How do we handle typing mistakes?
• Do we use stop list and stemming algorithm?
• How much do we want to index in a given web
page?
– Google index only the first 101 KB of a web page and
120 KB of PDF file.
• How big do we want the database indexed to be?
– response time vs coverage
• Do we want to index PDF, PS files?
(c) Maria Indrawan 2004
9
(c) Maria Indrawan 2004
10
Estimated Size
Size Growth
www.searchengineshowdown.com, Dec 31, 2002
Estimated Database Total Size
3500
3000
millions
2500
2000
Estimated
Claim
1500
1000
500
0
Goggle
(c) Maria Indrawan 2004
11
AlltheWeb
AltaVista
WiseNut
Hotbot
MSN
(c) Maria Indrawan 2004
Teoma
NLResearch Gigablast
12
Query Server- Design
Considerations
•
•
•
•
Retrieval Model
Retrieval model.
Complexity of the query syntax.
HCI – human computer interface.
Output display.
(c) Maria Indrawan 2004
• Traditional approach:
– Keywords matching returns to many low quality
matches – low precision.
• Search engines need a VERY high precision
output – even on the expense of RECALL.
• How can we achieve this?
13
(c) Maria Indrawan 2004
Google Retrieval Model
PageRank Example
• Utilise the popularity of a page
100
– If a page has many other pages pointed to this page, the
page must be very important. We can assign a high
weight to this page during search.
– If a page is pointed by a popular page, this page can be
considered as important because it is referred by a
reputable source (a popular page).
– PageRank Function.
(c) Maria Indrawan 2004
14
15
50
53
50
3
9
50
3
3
(c) Maria Indrawan 2004
16
Google Retrieval Model
Results Overlap
• Utilise the anchor text.
– Anchors often provide more accurate descriptions of
web pages than the pages themselves.
– Anchors may exist for documents which cannot be
indexed by a text-based search engine.
• Utilise the appearance of the text.
– Larger and bolder font text are weighted higher than
other words.
(c) Maria Indrawan 2004
17
Metasearch
(c) Maria Indrawan 2004
18
Meta Search
• Meta searches do not build their own index.
• They use the index of the existing search engines.
• When user posted a query to a meta search, the
meta search sends the query to a number of search
engines and collates the results.
• A list of metacrawler:
• metacrawler, www.metacrawler.com
– uses google, yahoo,askJeeves, About, Looksmart,
Teoma, Overture, FindWhat.
• dogpile, www.dogpile.com
– uses google, yahoo,askJeeves, About, Looksmart,
Teoma, Overture, FindWhat
– http://www.searchenginewatch.com/links/article.php/21
56241
(c) Maria Indrawan 2004
19
(c) Maria Indrawan 2004
20
Distributed Broker
Metasearch Design Issue
• Information is indexed locally by geographical
locations or institutional boundaries.
• Potential problems:
– Translating the user query into a different query in a
different search engine.
– Query time is bounded by the least powerful (slowest)
underlying system.
– Combining results into a single ranked list is difficult.
Effectiveness depend on heuristics and information
passed back from underlying search engines.
• detecting overlap in the query results
• different scoring schemes (some do not use)
(c) Maria Indrawan 2004
– Suitable for supporting community that to have a
common search database.
• Local indexes are combined to provide wider
coverage.
• Document scoring is performed locally by each
index server.
21
Distributed Broker
(c) Maria Indrawan 2004
22
Distributed Broker
broker
Monash
• Example: Harvest
broker
broker
FIT
F. Bussiness
– http://www.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searc
hing/schwartz.harvest/schwartz.harvest.html
broker
broker
broker
broker
CSSE
SIMS
ACC
MGM
(c) Maria Indrawan 2004
23
(c) Maria Indrawan 2004
24
Flat Graph Model
General architecture
• Hierarchical vs Flat
• Hierarchical: underlying index servers are
connected through a hierarchy of brokers.
query
– broker hierarchy provides efficient and global coverage.
– brokers can be geographical, institutional or subject
based.
query
broker
index server
broker
index server
...
query
broker
index server
...
broker
index server
broker
...
...
broker
index server
query
broker
...
(c) Maria Indrawan 2004 index server
25
Useful site
26
Summary
• www.searchenginewatch.com
• Type of Distributed Information Discovery
– Provides links to most of the information discovery
tools.
(c) Maria Indrawan 2004
(c) Maria Indrawan 2004
– Directory Listing
• yahoo
– Search Engines.
• Google, AskJeeves, Teoma
– Metasearch
• metacrawler, dogpile
– Distributed Broker
• Harvest
27
(c) Maria Indrawan 2004
28
Download