搜尋器算法 - 現在與未來

advertisement
Search Engine Algorithms
Now and the Future
Vincent Ng cstyng@comp.polyu.edu.hk
Department of Computing
Hong Kong Polytechnic University
What are search engines?
Search engines are huge databases of web page files
that have been assembled automatically by machine.
A program that searches documents for specified
keywords and returns a list of the documents where
the keywords were found.
Some History
Yahoo
1997 2,000,000 pages
Search
Engine
Watch
WWWW
1994 110,000 pages
2000 2 billons pages
Types of Search Engines
•
Individual
–
•
Individual search engines compile their own
searchable databases on the web.
Meta
–
Meta-searchers do not compile databases.
Instead, they search the databases of multiple
sets of individual engines simultaneously
Meta-search Engines
•
•
•
Do not crawl the web compiling their own
searchable databases.
They search the databases of multiple sets of
individual search engines simultaneously, from a
single site and using the same interface.
Meta-searchers provide a quick way of finding
out which engines are retrieving the best results
for you in your search.
Subject Directories
•
•
•
Unlike search engines, are created and
maintained by human editors, not electronic
spiders or robots.
The editors review and select sites for inclusion
in their directories on the basis of previously
determined selection criteria.
Directories tend to be smaller than search engine
databases, typically indexing only the home
page or top level pages of a site.
Search Logic
AltaVista
Excite
Google
Content
250M pg
250M pg +
media obj
1.25 billon
sites
Default
word
Boolean
Op
OR
OR
AND
AND, AND
NOT, OR
AND, AND
NOT
Limit
including
and
excluding
words
Search Logic
AltaVista
Yes
Case
sensitive
Truncation No, use *
Special
Date,
language
Excite
Google
No
No
No
Automatic
Concept
Search any
searching by language
suggested
terms
Developing a search engine
1. No database, real time search
2. Use a database (e.g. MSSQL, Oracle)
1. Build indices of key words
2. Simple matching
3. Use a database in a server or multiple servers
(server farms)
1. Develop search indices based on key words or metainformation
2. Develop a search structure
spider
indexer
alg
Searching on the Web
client
Query
screen
Search
engine
DB
web
Indexer
spider
Three Algorithms
• A document is represented by
– Occurrences of a keyword
– Hyperlink structures
• Different ranking algorithms
– Boolean spread Activation
– Most-cited
– TFxIDF
Boolean Spread Activation
• Based on the occurrence or absence of
keywords in a document
– R i,q =

M
j=1
( C i,j )
• A better approach
– R i,q =

Pi
C i,j
M
j=1
( I i,j )
Link factor
Most-Cited
• Takes advantage of information about
hyperlinks between web pages
– R i,q =
No link

M
k=1,k<>i
( Li I,k

M
j=1
C k,j )
Li I,k
TFxIDF
• Based on the vector space model
– R i,q =
 term in query (0.5 +
0.5 (term freq of Qj in Pi)/
max term freq of a keyword in Pi))
– R i,q = R i,q / normalized factor
– IDFj = log (N /

N
I=1
C I,j )
An Excellent Search Engine - Google
Result of Search
More about Google
• Much more accurate than most other search
engines
• But Run the same search on Yahoo (look for web
pages) and surprise! - You will get the same
results
• Because – Yahoo is powered by Google!
Google Internal
Google Internal
•
Makes use of the link structure of the Web to
calculate a quality ranking of each web page
– PageRank
•
Utilizes link to improve search results
PageRank
•
It can be thought of as a model of user behaviour
– The probability that a random surfer visits a page
•
PR(A) = (1-d) + d(PR(T1)/C(T1) ….
PR(Tn)/C(Tn))
Other Search Engines
•
Personalization/ Context based
– Individual, web-filtering
– www.searchorbit.com
•
Multimedia search
– Image search
– www.altavista.com (not really)
– http://disney.ctr.columbia.edu/webseek/
Internal Search Engine
•
When under 100 web pages
– One can do it real time
– www.comp.polyu.edu.hk/~cstyng/hci.99/labs/search.h
tm
•
For a small web site
– Direct matching is sufficient
•
Other web sites
– Indexers are needed
Finding a search engine
A Check List
1. What platforms does the search engine and spider
run on? Is it portable?
2. What programming languages is it written with? Is it
internet/web enabled? Don't give me Fortran!
3. Can the vendor customize the system at a
reasonable cost and turn-around time?
4. What about local technical support?
5. Is it designed to search the internet, intranet, and
your local disks?
6. Can it handle different file formats, such as ASCII,
HTML, WORD, PPT, etc. etc.?
A Check List
1. Does it support BIG5, GB, and UNICODE? And do
so efficiently?
2. Is it designed and optimized for the Web? Don't give
me a relational engine!
3. Is it an English search engine retrofit with Chinese
search?
4. Can you control what the spider indexes and how
frequent it indexes?
5. Does it support full Boolean queries and relevance
ranking?
6. Can you search by dates or by categories?
A Check List
1. Can you search files that are on a specific host or of
a certain file type?
2. Can you specify partial words (e.g., econom* and
*port)?
3. Can you expand and translate a query?
4. What about speed? Don't forget to ask for insertion
speed!
5. What about scalability? Can it exploit multiple
servers and CPUs?
A Check List
1. Is the spider/crawler fault tolerant? Can it endure
link or host failures?
2. Can it be optimized according to user behaviors?
3. Can it search across secured servers?
Some search engines in HK
1.
2.
3.
4.
5.
Chinese.yahoo.com
www.goyoyo.com.hk
www.hksrch.com.hk
www.hkonly.com
www.gowhere.com.hk
Your Input
Download