Hyper-Searching the Web

advertisement
Thai, Minerva
February 13, 2008
CompSci 49S
Professor Babu
Hyper-Searching the Web
Hundreds of millions of web pages are created every day on the World Wide Web, increasing by
many folds the amount of data available online to internet users. In order to sort through all of
the information and be able to navigate through the large collection, users must have two things:
an intention in mind and a mode of discovering their intent. The rapid growth of the internet led
to the creation of a search engine as a mode of finding a user’s intent. After just a few years, the
search engine has spanned over various methods of search and so, categories have sprouted.
Different Types of Search Engines
There are many different search engines available now to the user’s convenience. The most basic
are also most common of search engines; they maintain personal indexes of keywords. There are
also cluster search engines which may keep such indexes but also separate the results of
keywords into “clusters” or themes of ideas. Then there are meta-search engines which do not
have personal indexes. Rather, these search engines uses the indexes of other search engines and
compile the results of all of these. Meta-search engines maintain databases of partner search
engines and run search queries through their partners to create a results list. Additionally, there
are “smarter” meta-search engines which are combinations of cluster and meta-search engines.
The Basic Search Engine
(examples: AltaVista, Infoseek, HotBot, Lycos, Excite, Google)
The basic search engine was the first type to be developed. Its programming follows a process of
crawling, indexing, and returning results. In crawling, the engine’s program sends out smaller
programs called “crawlers” which visit many web pages rapidly and store all the information
associated with them such as text, images, hyperlinks, etc. These web pages are then separated
into indexes set off by keywords. This categorization makes it easier for a searcher to find pages
relevant to his search query.
In order to determine the significance of the crawled pages, the basic search engine must have a
form of ranking to reference. Most use heuristics, which are, in short, the simplest answers to
problems. The most base heuristics certain basic search engines may still use (but definitely used
when they were first created) is to count the number of times keywords appear on a page. This
poses a problem because sometimes a repeated word may have no relevance to the page. An
example of this is Tom Wolf’s The Kandy-Kolored Tangerine-Flake Streamline Baby, which
features the word “hernia” repeatedly on the first page. Though the novel has no relevance to
hernias, it would show up in a search for hernia if simple heuristics were used.
Unfortunately ill-intentioned web users may exploit this method by coding irrelevant keywords
in invisible font and misleading users. Measures have been taken over the years to try and
prevent these results from showing up to naïve users. Some search engines have employed other
means of ranking such as Google with PageRank, which was discussed in previous presentations
by Jason and Bryan.
Some problems that stem from using a basic search engine are those often faced by other search
engines as well. Programs have no recognition of searcher intent and so, cannot procure the
“best” result for an individual user. Also, there are problems with synonymy (words with the
same meaning) and polysemy (words with more than one meaning). Examples of synonymous
words are car and automobile, which may be easily recognized as related by users but not by
search engines. A polysemy occurs with the term “jaguar,” which can mean anything from a live
animal to a fast car. One proposed solution is for employees to enter semantic relations into
search engines’ systems but this solution can only solve synonymy; in fact, it may actually
aggravate polysemy. Additionally, author intent may not be realized through crawling so that
companies like IBM, which do not advertise the word “computer” explicitly on their web page,
may not be associated with their product.
Cluster Search Engine
(example: Clusty)
Cluster search engines do just as they are described: they cluster. When a searcher poses a query
to a cluster search engine, the query results are presented back to the searcher with theme
separations on the side. When using Clusty and searching the term “apple” (as offered by the
class), 232 theme results were found and displayed on a left-bar panel. These had a wide range of
topics such as Apple the company, iPods, and even the fruit (though last on the top ten list).
When a student asked about the number of actual results Clusty produced (85+ million), a
comparison to Google’s search results was made (525+ million). Though the numbers are only a
fraction of Google’s, the benefit of a cluster search engine is that it is able to even show the
obscure lesser searched-for queries because of the themes it separates the results into. This
allows results other search engines may give low ranking scores a chance to be found by
searchers.
Meta-Search Engine
(examples: Dogpile, SurfWax, Copernic)
The meta-search engine is unique to search engines because it does not personally rely on
crawling or indexing. In fact, all it has to maintain is a database of search engines from which to
reference. Once a searcher enters a query, the meta-search engine sends out the query to its
references and returns results from those. Some critics claim that this type is no better than the
engines it references and also point out that most referenced search engines are small, free,
and/or commercial. Thus, the results are argued not to be quite as “good” as on other types of
search engines. However, one exception would have to be Dogpile.
When observing Dogpile, the class noticed how its sources were not small search engines but
rather consisted of Google, Yahoo, Live Search, and Ask.com, all big search companies. The
term “chicken” was searched and as the results showed, the referenced search engines were
noted. Interestingly enough, the #2 results was a link to an ad on Google; this may be indicative
of the type of partnership Dogpile has with Google, which may have been stronger than with
others in order for an advertisement to show up. The search engines SurfWax (which had no
summary text per result) and Copernic referenced lesser known search engines.
One interesting aspect of meta-search engines is that any user can easily create one through
Google’s Custom Search Engine program, which can store up to 5,000 URLs in a database.
“Smarter” Meta-Search Engine
(example: Clever Project)
The “smarter” meta-search engine is an extension of a regular meta-search engine with the
exception that it also clusters results and offers linguistic analysis of keywords. Unfortunately the
only example, The Clever Project, failed in its making.
The Clever Project
“a respected authority is a page that is referred to by many good hubs; a useful hub is a location
that points to many valuable authorities”
The proposed Clever Project was a “smarter” meta-search engine that was based off of using
hyperlinks to find and determine hubs and authorities. Hubs are considered web pages that point
to various sources of information of a particular sort while authorities are web pages which
provide information. The Clever Project determines these in its process of obtaining web pages.
First it retrieves web pages from a standard index it is given and follows hyperlinks from these
pages to other pages, increasing its own database. (This increases competition because
competitors will not link to their competitions’ sites such as IBM’s web page does not link a user
to Dell or Compaq) This resulting stored collection is referred to as the “root set” and after being
stored, individual web pages are given numerical hub and authority scores. These scores are
determined in a similar fashion as Google determines PageRank. Clever must trace the links that
are connected to a web page and figure out whether or not it is more of a hub or an authority.
Through guesses and constant calculations, it is able to score the web page and useful byproducts of this method are clusters of sites which are separated into themes.
Clever vs. Google
Though the Clever Project failed, it still had many comparable points to Google, which survived.
While Google’s PageRank gave individual pages initial rankings, Clever only found and ranked
a page with every keyword as it determined the related root set. In fact, it prioritized web pages
according to the context of queries while Google already had an index of keywords by which it
kept pages, regardless of user search. Unfortunately for Clever, Google was much faster at
returning results through this method of previous crawling and indexing. It proved better than
Google in its handling of gathering web pages however because it did not go through hyperlinks
in the forward direction alone. Rather, Clever searched backwards as well, a useful tactic for its
method of determining hub and authority scores. Because of this, it would often return results too
broad such as if a user searched “Fallingwater,” results relating to architecture of the time period
would appear. Though it had some good points, Clever never made it to the common usage of
searchers.
Conclusion
With every huge phenomenon that turns up, there are various branches created. With the search
engine, many types popped up and these four are commonly used. Though I am not sure whether
or not I will switch my usage from Google, a “basic” search engine, to a cluster or meta-search
engine, I do know now of the choices I have. In some cases, the other types would be more
useful. Either way, these search engines serve the purpose of making the World Wide Web more
accessible for all internet users.
Download