Crawler-Based Search Engine – Google

advertisement
Information Retrieval
Assignment 1
Commercial Systems
Table of Content
Introduction ............................................................................................................. 3
Crawler-Based Search Engine – Google ............................................................... 4
Location & Frequency ............................................................................................ 4
Advantages of crawler based search engines .................................................... 4
Disadvantage of crawler based search engines ................................................. 5
Search Engine spamming....................................................................................... 6
Types of Spamming Techniques............................................................................ 6
Penalties for search engine Spamming.................................................................. 6
Google...................................................................................................................... 7
Forecasted growth of Google ................................................................................. 7
Technology used for Google .................................................................................. 7
PigeonRank ....................................................................................................... 8
Human-Powered Directories – AskJeeves / Yahoo............................................... 9
Askjeaves & Yahoo................................................................................................ 9
Advantages of Directories .................................................................................. 9
Disadvantages of Directories.............................................................................. 9
Askjeeves .............................................................................................................. 10
Technology used for AskJeeves .......................................................................... 10
AskJeeves & natural language Processing .......................................................... 10
Yahoo! .................................................................................................................... 11
Technology used in Yahoo .................................................................................. 11
Hybrid Search Engines Or Mixed Results Engines ............................................. 11
Meta Search Engines ............................................................................................ 11
Illustration of Metacrawler .................................................................................... 12
Conclusion............................................................................................................. 13
Reference............................................................................................................... 14
Sylvia King 9901516
Page 2 of 14
Introduction
In general there are only a number of search engines that users are familiar with,
such as Google, Yahoo, Alta Vista, Lycos, AskJeeves, Excite, and HotBot, however
there are much more out there. The aim of the project is to study the popularly
used forms of search engines, namely Google, Yahoo and AskJeeves, the project
will examine the technologies used in these information retrieval systems and
explores why Google has been recently named no 1 on the list. Initially, I will
discuss the origins of these and review each one independently, discussing the
difference in the basic retrieval techniques of each.
Search engines can be typed into three categorise: crawler-based, human-powered,
and hybrid which is a mixture of the two categories. These three will be studied for
differences and common similarities. Google does what is know in the industry as
crawls or spider the web, a user then searches through what it finds, this report will
study the technologies the company uses namely, the PigonRank system.
My project will also discuss human-powered directories, such as the AskJeeves and
Yahoo, directories that depends on humans for its listings, where in this case you
would submit a vivid report to the directory for your entire site. A particular search
only looks for a match using the descriptions submitted, this and the ones mentioned
previously will be analysed for effectiveness, thus looking at the advantages and
disadvantages of each type of search engine.
Sylvia King 9901516
Page 3 of 14
Crawler-Based Search Engine – Google
Indexers or crawler based search engines use an automated way for information
collection, so it is said that crawler based search engines crawls through the Internet
cataloguing and indexing websites.
These types of search engines have 3 main components, firstly the spider, which is
also called the crawler, what it does is to visit a web page reading it as it goes along,
then it follows links to other pages within that site. The spider returns to the site every
month or two and checks for changes. All amendments are detected and are
transferred into the index.
The index is sometimes known as the catalogue, it contains a copy of every single
web page that the spider finds, so when a web page is amended, this index or
catalogue is updated with all the new changes. Frequently it may take a while for
new pages or changes found by spider to be added to the index, so a web page may
have been spidered but not yet indexed. [1], until it is added to the index it will not be
available to those who are searching with the crawl based search engine.
Location & Frequency
Crawler-based search engines are designed for determining the relevancy of the
document requested, these search engines follow set rules called algorithms. They
concentrate on the location and frequency of keywords on a web page. An example
of this could be a librarian who wishes to find a costumer books relating to "travel,"
first she would probably look at books with travel in the title, in this way, search
engines operate in the same manner.
The search engine also checks to see if the search keywords shows near the top of a
web page, an action known as location, it looks in the headline or in the first few
paragraphs of text, the assumption is that any pages that are relevant to the topic will
mention the words at the beginning of the document.
Frequency is another method of how search engines determine how relevant the
document is. Search keywords appear in relation to other words in a web page.
Words with a higher frequency are viewed as more relevant than those without. It
seems that all the main search engines follow the location & frequency process to
some degree, but what makes these different is that each search engine add a little
extra technology. So no search engine does it the same as another, which explains
why when a user enters the same search topic different results are produces when
using different search engines.
Some search engines index more web pages than others; an article on search
engine popularity shows the difference in the volume of index pages, the frequency in
which pages are index also makes a difference. My findings show that no search
engine has the exact same collection of web pages to search through.
Ref [1]
Advantages of crawler based search engines

Offers larger searchable databases of web sites.

The full text of individual web pages is often searchable.

Good for searching obscure terms or phrases.
Sylvia King 9901516
Page 4 of 14
Disadvantage of crawler based search engines

No human quality control to weed out duplicates and junk.

The size of the database can produce unmanageably high numbers of search
results.

The search command languages can be complicating and confusing

Other examples of crawler based search engines would include Alta Vista,
Excite, HotBot, and Magellan.
Sylvia King 9901516
Page 5 of 14
Search Engine spamming
Search engines are also able to penalize or exclude pages from the index, if it
detects a technique known as Spamming. Spamming is when a word is repeated
lots of times on one page, thus increasing the frequency level of that word, in order to
push the page higher in the lists. Search engines observe the spamming methods
with various tools and also most search engines follow-up customer complaints.
Types of Spamming Techniques
Keyword stuffing.
This is the repeated use of words to increase the frequency
Search engines analysis the pages to determine whether the
frequency is above the normal level.
Invisible text.
It is known for webmasters to insert key words toward the
bottom of the page, scrupulous persons then change their text
colour the same as that of the page background. This type of
spamming is also detectable by the engines.
Tiny text.
This process is the same as invisible text but with tiny,
unreadable text.
Page redirects.
Some engines, especially Infoseek, do not like pages that take
the user to another page without his or her intervention, e.g.
using META refresh tags, cgi scripts, Java, JavaScript, or
server side techniques.
Meta tags stuffing.
Do not repeat your keywords in the Meta tags more than once,
and do not use keywords that are unrelated to your site's
content. Ref [1]
Penalties for search engine Spamming
Search engines penalties pages in different degrees, some engines will refuse to
index pages that contain spam whilst other would still index, but the pages would be
ranked lower. In extreme cases the search engine can choose to ban the whole
website. The common aim of all search engines is to provide the most accurate and
up-to-date pages for their user; the activity of spamming clutters their indexes with
irrelevant or even misleading information.
Spamming is observed by directories such as Yahoo, and AskJeeves and also
search engines such as Google, which indecently has incorporated a paged that is
solely committed to reporting sites which use spam, this along with enginespam.com, a neighbourhood watch program dedicated to capturing search engine
spammers.
Sylvia King 9901516
Page 6 of 14
Google
Google is currently number one on the list of search engines, it indexes over four
billion-web pages. Google has the ability to search simple and advanced queries, its
user interface is such that its easy to navigate, when searching in Google the user
enters a few words about there topic into the text box, when the user hits the search
button it finds a list of the relevant web pages. Google returns the web pages that
contain all the words the user enters into the text box. Refining a search is simple
accomplished by adding more words to the search topic. The new query returns a
smaller part of the pages that Google found for the original search.
Google has an advanced search page that allows the user to search using result
filters. The user chooses terms such as: "without the words”, "with all the words;"
"with at least one of the words;" and "with the exact phrase;" the search is then
narrowed according to there choice. The page is designed is such a manner that it
includes filters for date, file format, numeric range, and language. The page also lets
the user expand the list of results from 10 to 100 hits. Ref [2]
Forecasted growth of Google
Ref[3]
The internal Google documents include advertising
forecasts that have not been publicly disclosed.
Google predicted that the number of advertiser
accounts will rise from 280,000 this year to
378,000 in 2005, according to the documents.
From 2004 to 2008, the number of accounts is
expected to more than double to a number of
652,050.
Google expects its advertiser accounts to grow 35
percent between 2004 and 2005, however, Google
estimates that the growth rate will decline to 15
percent between 2007 and 2008.
Ref [3]
Technology used for Google
Google's hardware consists of more than 10,000 servers which index more than 4
billion web documents whilst its able to handling thousands of queries per second
with sub-second response times. This section of the report will look at how Google
finds the results for queries so quickly.
Google's uses a search technology known as PigeonRank, this is a system for
ranking web pages, and was designed by the founders of Google, Larry Page and
Sergey Brin of Stanford University. Ref [4]
Google processes its search queries at a speed much greater than the traditional
search engines; it accomplishes this by collecting pigeons in thick clusters.
Sylvia King 9901516
Page 7 of 14
PigeonRank
The PigonRank system works as follows:
 Firstly a user submitted a query to Google,
 The query is then routed to what is known as data coop,
 The data coop monitors flash result pages at incredible speed.
 When a relevant result is located by one of the pigeons in the cluster, it strikes
a rubber-coated steel bar, this motion assigns the page a Pigeon Rank value
of number one.
 For each peck, the Pigeon Rank value is increased.
 The pages that get the most pecks are prioritised and are shown at the top of
the user's results page.
 The remaining results are displayed in order of this pecking system.
The pigeon rank methods used by Google makes it difficult to amend results, it is
know in the industry that some websites have attempted to boost rankings by
including images on their pages, Google's Pigeon Rank technology is not fooled be
such techniques. The graphs below show the efficiencies of the Pigeon cluster:
Ref [5]
Sylvia King 9901516
Page 8 of 14
Human-Powered Directories – AskJeeves / Yahoo
Askjeaves & Yahoo
Human-powered directory, by name depends on humans for its listings; an example
would be Yahoo and AskJeeves, which are actually directories, directories that
depend on humans to collect the data.
Normally an editor will compile all the listings that directories have. Getting listed with
the web's major directories is extremely important since so many people see all the
listings. You submit a short description to the directory for your site, and then a
search looks for matches only with the description submitted.
It is common knowledge in the industy, that most services such as Google, MSN
Search, AOL Search and Teoma, offer search engine and directory information,
despite the fact that they will generally feature the directories as opposed to the
websites.
Advantages of Directories





For browsing -- when user is not entirely sure what they are looking for.
If the user is unsure of which keywords to use in order to find information
Because these directories use human editors, the general standards are
higher than what’s found in search engines
Good for finding commercial sites - this can also be viewed as being a
disadvantage as it indicates that non-commercial sites are not as common in
directories as they are in engines.
Keyword searches can be used within any category thus improving efficiency.
Disadvantages of Directories




It could take the user a longer time in locating a suitable website.
Directories tend to be smaller than search engine databases, and tend only to
index top-level pages of a site.
Because directories are maintained by people, rather than by spiders, and
because they point to sites, rather than compiling databases containing
pages, the content of a site or page can change without the directory being
updated.
Dead links- these are links that do not go to the pages they are intended to,
but instead produce an error message, are a problem as it is up to the human
editors to maintain the content of the directory.
Ref [6]
Sylvia King 9901516
Page 9 of 14
Askjeeves
AskJeeves search engine was founded in 1996 by David Warthen, a known software
developer, and Mr. Garrett Gruener, the founder of Virtual Microsystems.
AskJeeves has a sister company operating in the UK and Ireland, namely Ask.co.uk.,
which is now in the top ten most popular search engines in the UK.
Ref [7]
Technology used for AskJeeves
AskJeeves is a human -powered directory search engine that’s known for its ability to
interpret natural language queries and has now obtained the privately held Teoma
Technologies, AskJeeves assists the user through questions which helps narrow the
search, it is know to also simultaneously searches of up to six other search sites for
the relevant web pages. Teoma is the backbone technology started by scientists at
Rutgers University, it is said that this technology is "the next big thing in search
engines". [2]
Teoma technology places strong emphasis to site popularity in their ranking
algorithms, and the search engine decides results by ranking a site based on what is
know as


Subject-Specific Popularity: which is the number of web pages about the
subject that reference this page
General Popularity: the number of all the web pages that reference this page
Teoma also presents what are called "communities" of expert sites; these are
relevant knowledge hubs that may guide the user through their search. AskJeeves
via Teoma are said to indexes over 1.5 billion web pages. Searching AskJeeves is
accomplished by using the simple or advanced search page.
AskJeeves & natural language Processing
As illustrated with the diagram above, AskJeeves is noted for their ability of using
Natural language processing. This technique avoids forcing searchers to Boolean or
other query languages, AskJeeves allows the user to type in a question, and uses
this question for the search.
Askjeeves uses the Ask natural language processing search algorithm, which goes
through your questions and finds the most relevant words. Other search engines
have taken the natural language processing field as well.
Sylvia King 9901516
Page 10 of 14
Yahoo!
Yahoo is the oldest search engine website in operation since 1994. Yahoo has been
concentrating on developing a new search engine technology for a few years, and
now has it's own search engine database. The inventors of this search engine are
David Filo and Jerry Yang, both studied at the Stanford University. During the year
1994, they customized the Yahoo database in order to serve the needs of growing
users.
Technology used in Yahoo
Yahoo has recently acquired its own brand of search engine, with its own indexing
and ranking methods, this move is said to create competition in the industry starting a
new race for first place.
Yahoo was surrounded by speculation as regards to the Inktomi index, would Yahoo
be replacing this with the Google powered search technology it was originally using?
Journal reports indicated Yahoo has built a newly developed search technology of its
own, an article published in February 2004 stated that Yahoo had dropped Google
and introduces its new algorithmic search technology called Yahoo Slurp which is
used for indexing its web pages.
Ref [6]
Yahoo searches are accessed on the main page, by typing your description a simple
search will commence, clicking the search shortcut links can quickly narrow your
search, and there are links to yellow pages, weather, news, and various products.
The advanced searching facility allows the user the use of dropdown filters to guide
their search.
Hybrid Search Engines Or Mixed Results Engines
In the early days of the web, search engine would show crawler based results or
human-powered listings, however in today environment it is extremely common to
find a merger of both. Usually, a hybrid search engine will favour one type of listings
over another. For example, Yahoo is more likely to present human-powered listings
and Google its crawled listings.
Meta Search Engines
A Meta search engine works as an agent between the user and the search engines.
These search engines use what is known as metacrawler, which search the
database of various search engines. Meta search engines do not build or maintain
their own web indexes, they use the indexes built by others. Sometimes it is very
difficult to retrieve results from search engines, and in a quest to find that vital piece
of information, people often search several engines. This exercise would be timeconsuming and the problem would lie in sifting thought all the duplicated documents.
Sylvia King 9901516
Page 11 of 14
Illustration of Metacrawler
Ref [7]
An illustration of how Meta-Crawler, queries multiple search engines in parallel on the World
Wide Web.
The main feature of Meta search engine is the ability to save time; it searches
various engines simultaneously and also removes duplicated documents. The users
query is sent to multiple search engines, Meta search engines generally present the
first 10 - 30 results from each of the results page. The advantages here is that Meta
has the ability to single-handedly search several databases for the required topic.
The disadvantages associated with this type of search engine are that it may return a
limited number of hits.
Examples: Webcrawler and Query Server
Sylvia King 9901516
Page 12 of 14
Conclusion
Search engines are a term used to describe both true search engines, namely
Google and AltaVista, and Web directories, Yahoo and AskJeeves. This assignment
highlighted the fact that there is a distinct difference between the two.
Google and AltaVista are Search engines that spider or crawl through websites
compiling the data, so hits are found based on the information held in their
databases. The technology of PigonRank is an effective tool being used by Google,
I feel that for this reason they have been granted number one status for search
engines.
Yahoo and AskJeeves are actually directories, that rely on humans for its listing. The
creators of websites can submit a short descriptive report in an attempt for inclusions
of their site. The directory editors write descriptions for the site being reviewed, so a
search on directories, will find hits based on matches in the submitted descriptions.
The major difference of the search engines is the varied use of technology.
AskJeeves was developed with a view of incorporating natural language processing.
Searches can be accomplished with questions submitted. The use of Boolean is
redundant. The Teoma technology used, places a strong emphasis to site popularity
in their ranking algorithms; this technology seems to be the backbone of the
corporation and secrete tool for propelling AskJeeves into first place on the ranking
lists.
Sylvia King 9901516
Page 13 of 14
Reference
[1]
How Search Engines Rank Web Pages / By Danny Sullivan, Editor / July 31,
2003 searchenginewatch.com/webmasters/article.php/2167961 ntro to
Search Engine Optimization
[2]
The search engines guide/ Kansas Public Library – Online 31 March 2004Articles a Review of Search Engines
[3]
Searchenginewatch.com/blog/041020-111337 - 58k - Nov 20, 2004
Google Forecasts growth
[4]
SearchEngineWatch - The Technology Behind Google By Chris Sherman,
Associate Editor, August 12, 2002
[5]
Seach Engin Watch - Journal The Technology Behind Google By Chris
Sherman, Associate Editor August 12, 2002
[6]
Search engine Watch journal By Chris Sherman, February 18, 2004
[7]
MetaCrawler/Husky Search group. An illustration of how MetaCrawler,
washington.edu/research/projects/ai/metacrawler/
.
Sylvia King 9901516
Page 14 of 14
Download