Thai, Minerva February 13, 2008 CompSci 49S Professor Babu Hyper-Searching the Web Hundreds of millions of web pages are created every day on the World Wide Web, increasing by many folds the amount of data available online to internet users. In order to sort through all of the information and be able to navigate through the large collection, users must have two things: an intention in mind and a mode of discovering their intent. The rapid growth of the internet led to the creation of a search engine as a mode of finding a user’s intent. After just a few years, the search engine has spanned over various methods of search and so, categories have sprouted. Different Types of Search Engines There are many different search engines available now to the user’s convenience. The most basic are also most common of search engines; they maintain personal indexes of keywords. There are also cluster search engines which may keep such indexes but also separate the results of keywords into “clusters” or themes of ideas. Then there are meta-search engines which do not have personal indexes. Rather, these search engines uses the indexes of other search engines and compile the results of all of these. Meta-search engines maintain databases of partner search engines and run search queries through their partners to create a results list. Additionally, there are “smarter” meta-search engines which are combinations of cluster and meta-search engines. The Basic Search Engine (examples: AltaVista, Infoseek, HotBot, Lycos, Excite, Google) The basic search engine was the first type to be developed. Its programming follows a process of crawling, indexing, and returning results. In crawling, the engine’s program sends out smaller programs called “crawlers” which visit many web pages rapidly and store all the information associated with them such as text, images, hyperlinks, etc. These web pages are then separated into indexes set off by keywords. This categorization makes it easier for a searcher to find pages relevant to his search query. In order to determine the significance of the crawled pages, the basic search engine must have a form of ranking to reference. Most use heuristics, which are, in short, the simplest answers to problems. The most base heuristics certain basic search engines may still use (but definitely used when they were first created) is to count the number of times keywords appear on a page. This poses a problem because sometimes a repeated word may have no relevance to the page. An example of this is Tom Wolf’s The Kandy-Kolored Tangerine-Flake Streamline Baby, which features the word “hernia” repeatedly on the first page. Though the novel has no relevance to hernias, it would show up in a search for hernia if simple heuristics were used. Unfortunately ill-intentioned web users may exploit this method by coding irrelevant keywords in invisible font and misleading users. Measures have been taken over the years to try and prevent these results from showing up to naïve users. Some search engines have employed other means of ranking such as Google with PageRank, which was discussed in previous presentations by Jason and Bryan. Some problems that stem from using a basic search engine are those often faced by other search engines as well. Programs have no recognition of searcher intent and so, cannot procure the “best” result for an individual user. Also, there are problems with synonymy (words with the same meaning) and polysemy (words with more than one meaning). Examples of synonymous words are car and automobile, which may be easily recognized as related by users but not by search engines. A polysemy occurs with the term “jaguar,” which can mean anything from a live animal to a fast car. One proposed solution is for employees to enter semantic relations into search engines’ systems but this solution can only solve synonymy; in fact, it may actually aggravate polysemy. Additionally, author intent may not be realized through crawling so that companies like IBM, which do not advertise the word “computer” explicitly on their web page, may not be associated with their product. Cluster Search Engine (example: Clusty) Cluster search engines do just as they are described: they cluster. When a searcher poses a query to a cluster search engine, the query results are presented back to the searcher with theme separations on the side. When using Clusty and searching the term “apple” (as offered by the class), 232 theme results were found and displayed on a left-bar panel. These had a wide range of topics such as Apple the company, iPods, and even the fruit (though last on the top ten list). When a student asked about the number of actual results Clusty produced (85+ million), a comparison to Google’s search results was made (525+ million). Though the numbers are only a fraction of Google’s, the benefit of a cluster search engine is that it is able to even show the obscure lesser searched-for queries because of the themes it separates the results into. This allows results other search engines may give low ranking scores a chance to be found by searchers. Meta-Search Engine (examples: Dogpile, SurfWax, Copernic) The meta-search engine is unique to search engines because it does not personally rely on crawling or indexing. In fact, all it has to maintain is a database of search engines from which to reference. Once a searcher enters a query, the meta-search engine sends out the query to its references and returns results from those. Some critics claim that this type is no better than the engines it references and also point out that most referenced search engines are small, free, and/or commercial. Thus, the results are argued not to be quite as “good” as on other types of search engines. However, one exception would have to be Dogpile. When observing Dogpile, the class noticed how its sources were not small search engines but rather consisted of Google, Yahoo, Live Search, and Ask.com, all big search companies. The term “chicken” was searched and as the results showed, the referenced search engines were noted. Interestingly enough, the #2 results was a link to an ad on Google; this may be indicative of the type of partnership Dogpile has with Google, which may have been stronger than with others in order for an advertisement to show up. The search engines SurfWax (which had no summary text per result) and Copernic referenced lesser known search engines. One interesting aspect of meta-search engines is that any user can easily create one through Google’s Custom Search Engine program, which can store up to 5,000 URLs in a database. “Smarter” Meta-Search Engine (example: Clever Project) The “smarter” meta-search engine is an extension of a regular meta-search engine with the exception that it also clusters results and offers linguistic analysis of keywords. Unfortunately the only example, The Clever Project, failed in its making. The Clever Project “a respected authority is a page that is referred to by many good hubs; a useful hub is a location that points to many valuable authorities” The proposed Clever Project was a “smarter” meta-search engine that was based off of using hyperlinks to find and determine hubs and authorities. Hubs are considered web pages that point to various sources of information of a particular sort while authorities are web pages which provide information. The Clever Project determines these in its process of obtaining web pages. First it retrieves web pages from a standard index it is given and follows hyperlinks from these pages to other pages, increasing its own database. (This increases competition because competitors will not link to their competitions’ sites such as IBM’s web page does not link a user to Dell or Compaq) This resulting stored collection is referred to as the “root set” and after being stored, individual web pages are given numerical hub and authority scores. These scores are determined in a similar fashion as Google determines PageRank. Clever must trace the links that are connected to a web page and figure out whether or not it is more of a hub or an authority. Through guesses and constant calculations, it is able to score the web page and useful byproducts of this method are clusters of sites which are separated into themes. Clever vs. Google Though the Clever Project failed, it still had many comparable points to Google, which survived. While Google’s PageRank gave individual pages initial rankings, Clever only found and ranked a page with every keyword as it determined the related root set. In fact, it prioritized web pages according to the context of queries while Google already had an index of keywords by which it kept pages, regardless of user search. Unfortunately for Clever, Google was much faster at returning results through this method of previous crawling and indexing. It proved better than Google in its handling of gathering web pages however because it did not go through hyperlinks in the forward direction alone. Rather, Clever searched backwards as well, a useful tactic for its method of determining hub and authority scores. Because of this, it would often return results too broad such as if a user searched “Fallingwater,” results relating to architecture of the time period would appear. Though it had some good points, Clever never made it to the common usage of searchers. Conclusion With every huge phenomenon that turns up, there are various branches created. With the search engine, many types popped up and these four are commonly used. Though I am not sure whether or not I will switch my usage from Google, a “basic” search engine, to a cluster or meta-search engine, I do know now of the choices I have. In some cases, the other types would be more useful. Either way, these search engines serve the purpose of making the World Wide Web more accessible for all internet users.