Search Engine Algorithms Now and the Future Vincent Ng cstyng@comp.polyu.edu.hk Department of Computing Hong Kong Polytechnic University What are search engines? Search engines are huge databases of web page files that have been assembled automatically by machine. A program that searches documents for specified keywords and returns a list of the documents where the keywords were found. Some History Yahoo 1997 2,000,000 pages Search Engine Watch WWWW 1994 110,000 pages 2000 2 billons pages Types of Search Engines • Individual – • Individual search engines compile their own searchable databases on the web. Meta – Meta-searchers do not compile databases. Instead, they search the databases of multiple sets of individual engines simultaneously Meta-search Engines • • • Do not crawl the web compiling their own searchable databases. They search the databases of multiple sets of individual search engines simultaneously, from a single site and using the same interface. Meta-searchers provide a quick way of finding out which engines are retrieving the best results for you in your search. Subject Directories • • • Unlike search engines, are created and maintained by human editors, not electronic spiders or robots. The editors review and select sites for inclusion in their directories on the basis of previously determined selection criteria. Directories tend to be smaller than search engine databases, typically indexing only the home page or top level pages of a site. Search Logic AltaVista Excite Google Content 250M pg 250M pg + media obj 1.25 billon sites Default word Boolean Op OR OR AND AND, AND NOT, OR AND, AND NOT Limit including and excluding words Search Logic AltaVista Yes Case sensitive Truncation No, use * Special Date, language Excite Google No No No Automatic Concept Search any searching by language suggested terms Developing a search engine 1. No database, real time search 2. Use a database (e.g. MSSQL, Oracle) 1. Build indices of key words 2. Simple matching 3. Use a database in a server or multiple servers (server farms) 1. Develop search indices based on key words or metainformation 2. Develop a search structure spider indexer alg Searching on the Web client Query screen Search engine DB web Indexer spider Three Algorithms • A document is represented by – Occurrences of a keyword – Hyperlink structures • Different ranking algorithms – Boolean spread Activation – Most-cited – TFxIDF Boolean Spread Activation • Based on the occurrence or absence of keywords in a document – R i,q = M j=1 ( C i,j ) • A better approach – R i,q = Pi C i,j M j=1 ( I i,j ) Link factor Most-Cited • Takes advantage of information about hyperlinks between web pages – R i,q = No link M k=1,k<>i ( Li I,k M j=1 C k,j ) Li I,k TFxIDF • Based on the vector space model – R i,q = term in query (0.5 + 0.5 (term freq of Qj in Pi)/ max term freq of a keyword in Pi)) – R i,q = R i,q / normalized factor – IDFj = log (N / N I=1 C I,j ) An Excellent Search Engine - Google Result of Search More about Google • Much more accurate than most other search engines • But Run the same search on Yahoo (look for web pages) and surprise! - You will get the same results • Because – Yahoo is powered by Google! Google Internal Google Internal • Makes use of the link structure of the Web to calculate a quality ranking of each web page – PageRank • Utilizes link to improve search results PageRank • It can be thought of as a model of user behaviour – The probability that a random surfer visits a page • PR(A) = (1-d) + d(PR(T1)/C(T1) …. PR(Tn)/C(Tn)) Other Search Engines • Personalization/ Context based – Individual, web-filtering – www.searchorbit.com • Multimedia search – Image search – www.altavista.com (not really) – http://disney.ctr.columbia.edu/webseek/ Internal Search Engine • When under 100 web pages – One can do it real time – www.comp.polyu.edu.hk/~cstyng/hci.99/labs/search.h tm • For a small web site – Direct matching is sufficient • Other web sites – Indexers are needed Finding a search engine A Check List 1. What platforms does the search engine and spider run on? Is it portable? 2. What programming languages is it written with? Is it internet/web enabled? Don't give me Fortran! 3. Can the vendor customize the system at a reasonable cost and turn-around time? 4. What about local technical support? 5. Is it designed to search the internet, intranet, and your local disks? 6. Can it handle different file formats, such as ASCII, HTML, WORD, PPT, etc. etc.? A Check List 1. Does it support BIG5, GB, and UNICODE? And do so efficiently? 2. Is it designed and optimized for the Web? Don't give me a relational engine! 3. Is it an English search engine retrofit with Chinese search? 4. Can you control what the spider indexes and how frequent it indexes? 5. Does it support full Boolean queries and relevance ranking? 6. Can you search by dates or by categories? A Check List 1. Can you search files that are on a specific host or of a certain file type? 2. Can you specify partial words (e.g., econom* and *port)? 3. Can you expand and translate a query? 4. What about speed? Don't forget to ask for insertion speed! 5. What about scalability? Can it exploit multiple servers and CPUs? A Check List 1. Is the spider/crawler fault tolerant? Can it endure link or host failures? 2. Can it be optimized according to user behaviors? 3. Can it search across secured servers? Some search engines in HK 1. 2. 3. 4. 5. Chinese.yahoo.com www.goyoyo.com.hk www.hksrch.com.hk www.hkonly.com www.gowhere.com.hk Your Input