Information Retrieval Assignment 1 Commercial Systems Table of Content Introduction ............................................................................................................. 3 Crawler-Based Search Engine – Google ............................................................... 4 Location & Frequency ............................................................................................ 4 Advantages of crawler based search engines .................................................... 4 Disadvantage of crawler based search engines ................................................. 5 Search Engine spamming....................................................................................... 6 Types of Spamming Techniques............................................................................ 6 Penalties for search engine Spamming.................................................................. 6 Google...................................................................................................................... 7 Forecasted growth of Google ................................................................................. 7 Technology used for Google .................................................................................. 7 PigeonRank ....................................................................................................... 8 Human-Powered Directories – AskJeeves / Yahoo............................................... 9 Askjeaves & Yahoo................................................................................................ 9 Advantages of Directories .................................................................................. 9 Disadvantages of Directories.............................................................................. 9 Askjeeves .............................................................................................................. 10 Technology used for AskJeeves .......................................................................... 10 AskJeeves & natural language Processing .......................................................... 10 Yahoo! .................................................................................................................... 11 Technology used in Yahoo .................................................................................. 11 Hybrid Search Engines Or Mixed Results Engines ............................................. 11 Meta Search Engines ............................................................................................ 11 Illustration of Metacrawler .................................................................................... 12 Conclusion............................................................................................................. 13 Reference............................................................................................................... 14 Sylvia King 9901516 Page 2 of 14 Introduction In general there are only a number of search engines that users are familiar with, such as Google, Yahoo, Alta Vista, Lycos, AskJeeves, Excite, and HotBot, however there are much more out there. The aim of the project is to study the popularly used forms of search engines, namely Google, Yahoo and AskJeeves, the project will examine the technologies used in these information retrieval systems and explores why Google has been recently named no 1 on the list. Initially, I will discuss the origins of these and review each one independently, discussing the difference in the basic retrieval techniques of each. Search engines can be typed into three categorise: crawler-based, human-powered, and hybrid which is a mixture of the two categories. These three will be studied for differences and common similarities. Google does what is know in the industry as crawls or spider the web, a user then searches through what it finds, this report will study the technologies the company uses namely, the PigonRank system. My project will also discuss human-powered directories, such as the AskJeeves and Yahoo, directories that depends on humans for its listings, where in this case you would submit a vivid report to the directory for your entire site. A particular search only looks for a match using the descriptions submitted, this and the ones mentioned previously will be analysed for effectiveness, thus looking at the advantages and disadvantages of each type of search engine. Sylvia King 9901516 Page 3 of 14 Crawler-Based Search Engine – Google Indexers or crawler based search engines use an automated way for information collection, so it is said that crawler based search engines crawls through the Internet cataloguing and indexing websites. These types of search engines have 3 main components, firstly the spider, which is also called the crawler, what it does is to visit a web page reading it as it goes along, then it follows links to other pages within that site. The spider returns to the site every month or two and checks for changes. All amendments are detected and are transferred into the index. The index is sometimes known as the catalogue, it contains a copy of every single web page that the spider finds, so when a web page is amended, this index or catalogue is updated with all the new changes. Frequently it may take a while for new pages or changes found by spider to be added to the index, so a web page may have been spidered but not yet indexed. [1], until it is added to the index it will not be available to those who are searching with the crawl based search engine. Location & Frequency Crawler-based search engines are designed for determining the relevancy of the document requested, these search engines follow set rules called algorithms. They concentrate on the location and frequency of keywords on a web page. An example of this could be a librarian who wishes to find a costumer books relating to "travel," first she would probably look at books with travel in the title, in this way, search engines operate in the same manner. The search engine also checks to see if the search keywords shows near the top of a web page, an action known as location, it looks in the headline or in the first few paragraphs of text, the assumption is that any pages that are relevant to the topic will mention the words at the beginning of the document. Frequency is another method of how search engines determine how relevant the document is. Search keywords appear in relation to other words in a web page. Words with a higher frequency are viewed as more relevant than those without. It seems that all the main search engines follow the location & frequency process to some degree, but what makes these different is that each search engine add a little extra technology. So no search engine does it the same as another, which explains why when a user enters the same search topic different results are produces when using different search engines. Some search engines index more web pages than others; an article on search engine popularity shows the difference in the volume of index pages, the frequency in which pages are index also makes a difference. My findings show that no search engine has the exact same collection of web pages to search through. Ref [1] Advantages of crawler based search engines Offers larger searchable databases of web sites. The full text of individual web pages is often searchable. Good for searching obscure terms or phrases. Sylvia King 9901516 Page 4 of 14 Disadvantage of crawler based search engines No human quality control to weed out duplicates and junk. The size of the database can produce unmanageably high numbers of search results. The search command languages can be complicating and confusing Other examples of crawler based search engines would include Alta Vista, Excite, HotBot, and Magellan. Sylvia King 9901516 Page 5 of 14 Search Engine spamming Search engines are also able to penalize or exclude pages from the index, if it detects a technique known as Spamming. Spamming is when a word is repeated lots of times on one page, thus increasing the frequency level of that word, in order to push the page higher in the lists. Search engines observe the spamming methods with various tools and also most search engines follow-up customer complaints. Types of Spamming Techniques Keyword stuffing. This is the repeated use of words to increase the frequency Search engines analysis the pages to determine whether the frequency is above the normal level. Invisible text. It is known for webmasters to insert key words toward the bottom of the page, scrupulous persons then change their text colour the same as that of the page background. This type of spamming is also detectable by the engines. Tiny text. This process is the same as invisible text but with tiny, unreadable text. Page redirects. Some engines, especially Infoseek, do not like pages that take the user to another page without his or her intervention, e.g. using META refresh tags, cgi scripts, Java, JavaScript, or server side techniques. Meta tags stuffing. Do not repeat your keywords in the Meta tags more than once, and do not use keywords that are unrelated to your site's content. Ref [1] Penalties for search engine Spamming Search engines penalties pages in different degrees, some engines will refuse to index pages that contain spam whilst other would still index, but the pages would be ranked lower. In extreme cases the search engine can choose to ban the whole website. The common aim of all search engines is to provide the most accurate and up-to-date pages for their user; the activity of spamming clutters their indexes with irrelevant or even misleading information. Spamming is observed by directories such as Yahoo, and AskJeeves and also search engines such as Google, which indecently has incorporated a paged that is solely committed to reporting sites which use spam, this along with enginespam.com, a neighbourhood watch program dedicated to capturing search engine spammers. Sylvia King 9901516 Page 6 of 14 Google Google is currently number one on the list of search engines, it indexes over four billion-web pages. Google has the ability to search simple and advanced queries, its user interface is such that its easy to navigate, when searching in Google the user enters a few words about there topic into the text box, when the user hits the search button it finds a list of the relevant web pages. Google returns the web pages that contain all the words the user enters into the text box. Refining a search is simple accomplished by adding more words to the search topic. The new query returns a smaller part of the pages that Google found for the original search. Google has an advanced search page that allows the user to search using result filters. The user chooses terms such as: "without the words”, "with all the words;" "with at least one of the words;" and "with the exact phrase;" the search is then narrowed according to there choice. The page is designed is such a manner that it includes filters for date, file format, numeric range, and language. The page also lets the user expand the list of results from 10 to 100 hits. Ref [2] Forecasted growth of Google Ref[3] The internal Google documents include advertising forecasts that have not been publicly disclosed. Google predicted that the number of advertiser accounts will rise from 280,000 this year to 378,000 in 2005, according to the documents. From 2004 to 2008, the number of accounts is expected to more than double to a number of 652,050. Google expects its advertiser accounts to grow 35 percent between 2004 and 2005, however, Google estimates that the growth rate will decline to 15 percent between 2007 and 2008. Ref [3] Technology used for Google Google's hardware consists of more than 10,000 servers which index more than 4 billion web documents whilst its able to handling thousands of queries per second with sub-second response times. This section of the report will look at how Google finds the results for queries so quickly. Google's uses a search technology known as PigeonRank, this is a system for ranking web pages, and was designed by the founders of Google, Larry Page and Sergey Brin of Stanford University. Ref [4] Google processes its search queries at a speed much greater than the traditional search engines; it accomplishes this by collecting pigeons in thick clusters. Sylvia King 9901516 Page 7 of 14 PigeonRank The PigonRank system works as follows: Firstly a user submitted a query to Google, The query is then routed to what is known as data coop, The data coop monitors flash result pages at incredible speed. When a relevant result is located by one of the pigeons in the cluster, it strikes a rubber-coated steel bar, this motion assigns the page a Pigeon Rank value of number one. For each peck, the Pigeon Rank value is increased. The pages that get the most pecks are prioritised and are shown at the top of the user's results page. The remaining results are displayed in order of this pecking system. The pigeon rank methods used by Google makes it difficult to amend results, it is know in the industry that some websites have attempted to boost rankings by including images on their pages, Google's Pigeon Rank technology is not fooled be such techniques. The graphs below show the efficiencies of the Pigeon cluster: Ref [5] Sylvia King 9901516 Page 8 of 14 Human-Powered Directories – AskJeeves / Yahoo Askjeaves & Yahoo Human-powered directory, by name depends on humans for its listings; an example would be Yahoo and AskJeeves, which are actually directories, directories that depend on humans to collect the data. Normally an editor will compile all the listings that directories have. Getting listed with the web's major directories is extremely important since so many people see all the listings. You submit a short description to the directory for your site, and then a search looks for matches only with the description submitted. It is common knowledge in the industy, that most services such as Google, MSN Search, AOL Search and Teoma, offer search engine and directory information, despite the fact that they will generally feature the directories as opposed to the websites. Advantages of Directories For browsing -- when user is not entirely sure what they are looking for. If the user is unsure of which keywords to use in order to find information Because these directories use human editors, the general standards are higher than what’s found in search engines Good for finding commercial sites - this can also be viewed as being a disadvantage as it indicates that non-commercial sites are not as common in directories as they are in engines. Keyword searches can be used within any category thus improving efficiency. Disadvantages of Directories It could take the user a longer time in locating a suitable website. Directories tend to be smaller than search engine databases, and tend only to index top-level pages of a site. Because directories are maintained by people, rather than by spiders, and because they point to sites, rather than compiling databases containing pages, the content of a site or page can change without the directory being updated. Dead links- these are links that do not go to the pages they are intended to, but instead produce an error message, are a problem as it is up to the human editors to maintain the content of the directory. Ref [6] Sylvia King 9901516 Page 9 of 14 Askjeeves AskJeeves search engine was founded in 1996 by David Warthen, a known software developer, and Mr. Garrett Gruener, the founder of Virtual Microsystems. AskJeeves has a sister company operating in the UK and Ireland, namely Ask.co.uk., which is now in the top ten most popular search engines in the UK. Ref [7] Technology used for AskJeeves AskJeeves is a human -powered directory search engine that’s known for its ability to interpret natural language queries and has now obtained the privately held Teoma Technologies, AskJeeves assists the user through questions which helps narrow the search, it is know to also simultaneously searches of up to six other search sites for the relevant web pages. Teoma is the backbone technology started by scientists at Rutgers University, it is said that this technology is "the next big thing in search engines". [2] Teoma technology places strong emphasis to site popularity in their ranking algorithms, and the search engine decides results by ranking a site based on what is know as Subject-Specific Popularity: which is the number of web pages about the subject that reference this page General Popularity: the number of all the web pages that reference this page Teoma also presents what are called "communities" of expert sites; these are relevant knowledge hubs that may guide the user through their search. AskJeeves via Teoma are said to indexes over 1.5 billion web pages. Searching AskJeeves is accomplished by using the simple or advanced search page. AskJeeves & natural language Processing As illustrated with the diagram above, AskJeeves is noted for their ability of using Natural language processing. This technique avoids forcing searchers to Boolean or other query languages, AskJeeves allows the user to type in a question, and uses this question for the search. Askjeeves uses the Ask natural language processing search algorithm, which goes through your questions and finds the most relevant words. Other search engines have taken the natural language processing field as well. Sylvia King 9901516 Page 10 of 14 Yahoo! Yahoo is the oldest search engine website in operation since 1994. Yahoo has been concentrating on developing a new search engine technology for a few years, and now has it's own search engine database. The inventors of this search engine are David Filo and Jerry Yang, both studied at the Stanford University. During the year 1994, they customized the Yahoo database in order to serve the needs of growing users. Technology used in Yahoo Yahoo has recently acquired its own brand of search engine, with its own indexing and ranking methods, this move is said to create competition in the industry starting a new race for first place. Yahoo was surrounded by speculation as regards to the Inktomi index, would Yahoo be replacing this with the Google powered search technology it was originally using? Journal reports indicated Yahoo has built a newly developed search technology of its own, an article published in February 2004 stated that Yahoo had dropped Google and introduces its new algorithmic search technology called Yahoo Slurp which is used for indexing its web pages. Ref [6] Yahoo searches are accessed on the main page, by typing your description a simple search will commence, clicking the search shortcut links can quickly narrow your search, and there are links to yellow pages, weather, news, and various products. The advanced searching facility allows the user the use of dropdown filters to guide their search. Hybrid Search Engines Or Mixed Results Engines In the early days of the web, search engine would show crawler based results or human-powered listings, however in today environment it is extremely common to find a merger of both. Usually, a hybrid search engine will favour one type of listings over another. For example, Yahoo is more likely to present human-powered listings and Google its crawled listings. Meta Search Engines A Meta search engine works as an agent between the user and the search engines. These search engines use what is known as metacrawler, which search the database of various search engines. Meta search engines do not build or maintain their own web indexes, they use the indexes built by others. Sometimes it is very difficult to retrieve results from search engines, and in a quest to find that vital piece of information, people often search several engines. This exercise would be timeconsuming and the problem would lie in sifting thought all the duplicated documents. Sylvia King 9901516 Page 11 of 14 Illustration of Metacrawler Ref [7] An illustration of how Meta-Crawler, queries multiple search engines in parallel on the World Wide Web. The main feature of Meta search engine is the ability to save time; it searches various engines simultaneously and also removes duplicated documents. The users query is sent to multiple search engines, Meta search engines generally present the first 10 - 30 results from each of the results page. The advantages here is that Meta has the ability to single-handedly search several databases for the required topic. The disadvantages associated with this type of search engine are that it may return a limited number of hits. Examples: Webcrawler and Query Server Sylvia King 9901516 Page 12 of 14 Conclusion Search engines are a term used to describe both true search engines, namely Google and AltaVista, and Web directories, Yahoo and AskJeeves. This assignment highlighted the fact that there is a distinct difference between the two. Google and AltaVista are Search engines that spider or crawl through websites compiling the data, so hits are found based on the information held in their databases. The technology of PigonRank is an effective tool being used by Google, I feel that for this reason they have been granted number one status for search engines. Yahoo and AskJeeves are actually directories, that rely on humans for its listing. The creators of websites can submit a short descriptive report in an attempt for inclusions of their site. The directory editors write descriptions for the site being reviewed, so a search on directories, will find hits based on matches in the submitted descriptions. The major difference of the search engines is the varied use of technology. AskJeeves was developed with a view of incorporating natural language processing. Searches can be accomplished with questions submitted. The use of Boolean is redundant. The Teoma technology used, places a strong emphasis to site popularity in their ranking algorithms; this technology seems to be the backbone of the corporation and secrete tool for propelling AskJeeves into first place on the ranking lists. Sylvia King 9901516 Page 13 of 14 Reference [1] How Search Engines Rank Web Pages / By Danny Sullivan, Editor / July 31, 2003 searchenginewatch.com/webmasters/article.php/2167961 ntro to Search Engine Optimization [2] The search engines guide/ Kansas Public Library – Online 31 March 2004Articles a Review of Search Engines [3] Searchenginewatch.com/blog/041020-111337 - 58k - Nov 20, 2004 Google Forecasts growth [4] SearchEngineWatch - The Technology Behind Google By Chris Sherman, Associate Editor, August 12, 2002 [5] Seach Engin Watch - Journal The Technology Behind Google By Chris Sherman, Associate Editor August 12, 2002 [6] Search engine Watch journal By Chris Sherman, February 18, 2004 [7] MetaCrawler/Husky Search group. An illustration of how MetaCrawler, washington.edu/research/projects/ai/metacrawler/ . Sylvia King 9901516 Page 14 of 14