brinBlog - USF Computer Science Department

advertisement
Comments: Brin, Sergey and Page, Larry, The Anatomy of a Large-Scale Search Engine
A general comment on Sergey and Larry's style. It seems to be written with a subtle sales
pitch tone, Is it possible this paper was used to generate prospective investors to their
company?
It has a lot of that late 90's tone on how the web is gonna blow up and rule the
world...Some of the text sounds as if it could be lifted from one of many "b2b"'s now
long gone:
"Creating a search engine which scales even to today's web presents many challenges.
Fast crawling technology is needed to gather the web documents and keep them up to
date. Storage space must be used efficiently to store indices and, optionally, the
documents themselves. "
They obviously had the right idea in addressing scalability issues with search engines.
Posted by Andrew at January 31, 2005 07:30 PM
I agree with Andrew, and wonder if they have any more recent updates to this paper,
because I'm sure their methods and architecture have changed since it was written.
In the paper, it is also mentioned that they would eventually like to scale up to more than
100 million pages, and that in order for their system to be able to support that many
pages, they would have to greatly increase the complexity of the system. So, what's been
done?!
Also, a large number of URLs within the paper are not active anymore, especially
http://google.stanford.edu!
Posted by Marc at January 31, 2005 08:02 PM
Scratch that last bit, I just had a typo when I tried to go there. The Google site at Stanford
still works, but a number of other links in their citations don't...
Posted by Marc at January 31, 2005 08:07 PM
That's another difficulty with this format, no ability to modify or version a comment
previously made by the same person, to update or fix it...
Posted by Marc at January 31, 2005 08:08 PM
About Robots and exclusion...
pg 13 mentions the use of the Robot Exclusion Protocol. I was curious about it and found
some info: http://www.robotstxt.org/wc/exclusion.html
Basically, a web site owner has the right to exclude their web content from being
searched. This ability to be excluded (from even Google!) brings up an interesting
dilemma:
What happens when more and more "for pay" information sites appear on the web such
as Safari Books http://safari.oreilly.com/ jGuru http://www.jguru.com? It is in best
interests of these sites to keep much of their content private, so that they may charge for
the use of the information. Won't this create unsearchable "dark information spaces"
across the web like swiss cheese, thus reducing the effectiveness of search engines such
as Google that rely on web crawling?
One can picture that eventually free searches may only return one level of information to
users- and that "subsearches" within specific sites such as Safari Books will be necessary
to chase after quality sources of information.
Posted by Andrew at January 31, 2005 08:29 PM
To summarize the paper, it plays off of Brin and Page’s opinion of the time, which states
that “The biggest problem facing users of web search engines today is the quality of the
results they get back”. (One could argue that the same issue still faces us today).
They then go into the methods they use to make their results more relevant than the
standard text-only based engines of the time. They describe methods of link structure,
anchor text, font relevancies, and word proximities.
Even though it is not a main theme in the paper, it is mentioned a few times, and comes
off as a strong bias, that the drive behind their engine is from an academic standpoint, not
a commercial or advertising driven one.
Posted by Marc at January 31, 2005 08:35 PM
In response to Marc's question of what Google currently does to handle web indexingwhich at last count on their site was:
Searching 8,058,044,651 web pages
That is over 8000 times the size they would like Google to be able to handle, does the
design described in the paper still work for this?? Or do they have a "secret sauce" being
used?
On the rumors side, Google's infrastructure has puzzled many outside observersapparently the size and capability of their hardware infrastructure greatly outweighs the
effort currently needed to perform their searches:
"Google is thought to be a shrewd judge of computing value, having built its widely
admired infrastructure on the back of low-budget server clusters. At the same time,
curious geeks have long pondered the apparent mismatch between its service demands
and the reputed scale of its computing resources." http://news.zdnet.com/2100-1035_225537392.html
Perhaps there is more going on with Google's technology then is currently known (or
shared without a NDA).
Posted by Andrew at January 31, 2005 08:42 PM
Hard disks and Cost of Search:
One major benefit to Google's search function is based on the fact that a complete index
could be stored on a disk between 55-108 gigabytes (pg 18), and that in 1998:
"At current disk prices this makes the repository a relatively cheap source of useful data."
I wanted to find out how much 100 gigs of HD space cost in 1998, as well as predict what
Google's current hard disk space requirement is (based on the architecture described in
the paper).
1998 (based on 26 million pages indexed- pg 18)
100 gigabytes of disk storage (at average cost of $.06 per meg in 1998) = $6000.00
2005 (based on 8,058,044,651 web pages indexed, approx 310x growth):
31000 gigabytes of storage (at average cost of $1.00 per gigabyte) = $31000.00
Based on this, the operating cost to maintain the type of search listed in the paper has
increased approx 5X, but overall it still appears to be a scalable design.
HD price history information:http://www.littletechshoppe.com/ns1625/winchest.html
Posted by Andrew at January 31, 2005 09:08 PM
Opps.. I left on a few to many zeros in my post above:
" Searching 8,058,044,651 web pages...That is over 8000 times "
Actually it's more like 80 times (based on their 100 million page goal). Still quite a large
growth.
Posted by Andrew at January 31, 2005 09:12 PM
Overall I found the Google to be quite inspiring. The story of a few young bright kids
coming up with a creative idea that changes the world has developed into an archetypal
high tech success story here on the west coast. Brin and Page join the list of incredibly
famous and successful technology local pioneers who made their mark on history:
Steve Jobs, Steve Wozniak (Apple Computer)
Patrick Naughton, Mike Sheridan and James Gosling (Java programming language)
Bill Hewlett, Dave Packard (Hewlett Packard)
Bill Gates, Paul Allen (Microsoft)
and many more- who else can add to this list?
Also I found the paper fun and interesting to read. Much more so than the Kleinberg
paper, which seemed a bit to academic and dry.
Posted by Andrew at January 31, 2005 09:27 PM
In Appendix B, section 9.2, while discussing Scalability of Centralized Indexing
Architectures, the Harvest system is referenced as a distributed solution to the problem
which is an efficient and elegant technical solution for indexing. The reference in the
paper’s References section, http://harves.transarc.com is defunct, but I found a paper on
the Harvest system at:
http://archive.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/schwartz.harvest/schwartz
.harvest.html.
Harvest uses distributed brokers (indexing servers) for searches in targeted areas. For
example, they have one indexer dedicated only to Computer Science Technical Papers,
and another one which searches PC Software, and one for World Wide Web Home Pages.
Posted by Marc at January 31, 2005 10:11 PM
Since all they really talk about is their algorithm, one of the tools I found interesting was
the idea of weighing differently marked text in html documents differently. They mention
weighing bolded or larger text more than the other words on the page, but I am kinda
dubious as to whether it works. Personally, depending on the font I'm using, I may bold
all the words, just to make it more visible/appealing. The other idea though, using anchor
text, is more applicable - again though, they have to take measures to deal with links like
"here" or "this" - for the most part though, anchor text should provide a good indication
of the type of page it points to, and so using it is a good idea. Since older search engines
didn't even take links into account, they couldn't have applied this logic, but I wonder if
even using the bolded/italicized/larger weighing scheme would improve things.
Posted by Deniz at January 31, 2005 10:53 PM
Good points Deniz, but with regards to bolding of font and font size, they store it only
relative to the rest of the text around it, so if the entire document is bold, then it won't be
ranked more significantly.
Also, they mention that they may use the words surrounding the link text to improve
upon the problem you pointed out of anchor text only containing words such as _this_ or
_here_.
Posted by Marc at January 31, 2005 10:57 PM
This paper describes in detail the design and implementation of the Google web search
engine. At the time this paper was written, Google had yet to gain its current worldwide
popularity, hence the interesting perspective of introducing a new product, even though it
is now a common idea. The design obstacles and goals are explained in relation to the
construction and features of the system. The details of the feature set, such as the
PageRank algorithm, are explained in terms that are actually understandable, much better
than the HITS description in the Kleinberg paper. The exact architecture of the system is
described in conjunction with hardware resources and requirements. The requirements
are quite reasonable, considering the hardware available at the time. As it is, even today
the requirements are realistic, though they have definitely increased to some degree.
Operation and performance are described in light of the resources available at the time for
testing and the future goals of the system. One of the more interesting sections is the
appendix on the influence of advertising and search results, a potential problem that still
exists in modern web searching and development.
Posted by Rudd at January 31, 2005 11:17 PM
A section I found interesting is the description of the obstacles encountered when
crawling the web and collecting the pages to index. Having built a crawler myself, I
related to the difficulty involved with creating a robust system that smoothly crawls the
web. As the authors point out, there are many quirks and unforeseen problems that can’t
be imagined in the design process, but must be discovered along the way. I think the
information crawling is almost as important as the PageRank and query handling, due to
its status as the base of the system. The process of running multiple crawlers to reliably
and quickly collect pages to index becomes complicated when the true state of the web
reveals itself with the inconsistencies of non-standard web content. There would be no
search index without pages to index; hence collecting the web documents is a crucial
step. Not only is the initial set of documents important, repeatedly updating the document
set requires a careful implementation for ease of use and not alienating users across the
web as pages are collected. The unintentional ability to swamp a web server with requests
is something that must be carefully avoided. It’s good that Brin and Page spent the time
to explain the effort involved in crawling the web, often researchers will disregard the
support areas needed to create a system. This can leave others to find out the hard way
that there is more to the process than just the system itself.
Posted by Rudd at January 31, 2005 11:18 PM
Inward link paper...
“Mercator: A Scalable, ExtensibleWeb Crawler”, Allan Heydon and Marc Najork;
Compaq Systems Research Center [http://www.cindoc.csic.es/cybermetrics/pdf/68.pdf].
This paper describes the process involved in designing and implementing a large scale
web crawler. A direct comparison to the Google crawlers and Internet Archive’s crawler
(not Heritrix I presume, as the paper was written in 1999) is given in light of the authors
creation of a Java based, extensible web crawler. The construction and functioning of the
crawler is detailed as well with consistent comparison to Google’s and Internet Archive’s
crawlers that shows the different approaches possible to the design of a web crawler. In
the final performance comparison to the other crawlers, the authors state that it required a
significant amount of time to optimize their crawler, especially focusing on the java core
libraries. This is interesting as the Brin and Page paper “Anatomy” states that they used
Python in their crawler. Internet Archive’s Heritrix crawler is Java based. It makes one
wonder whether the much harped about “slowness” of Java libraries plays a significant
role in the development of certain applications. In fact, Heydon and Najork wrote another
paper on the subject “Performance Limitations of the Java Core Libraries”
[http://www.geocities.com/caheydon/papers/Mercator-CPE.pdf]. A final interesting note
is that the Mercator system was scheduled for use in the AltaVista corporate product used
for indexing large intranets.
Posted by Rudd at January 31, 2005 11:22 PM
The obvious cited paper is Kleinberg’s [http://www-2.cs.cmu.edu/afs/cs/project/pscicoguyb/realworld/www/papers/kleinberg.pdf] though this is not unusual as the content is
closely linked in terms of the PageRank algorithm and the use of links in the HITS
algorithm.
“Efficient Crawling Through URL Ordering” Cho, Garcia-Molina, Page
[http://www.csd.uch.gr/~hy558/papers/cho-order.pdf] This paper details three methods
for efficient crawling and collecting of web pages. This is obviously important for
maintaining an updated index of pages for a search engine while covering the broadest
expanse of the web in the shortest amount of time. One thing I liked about this paper,
(just a side note...) was that the examples of pseudo-code were the closest I’ve seen to
real code that made sense. The three models of URL ordering are described in terms of
performance and implementation, as well as the results obtained from crawling a
controlled test set. (The Stanford.edu domain). The conclusion was that PageRank gave
the best performance for deciding on the order of crawling pages, which makes sense
considering its use in Google currently. There are other factors that influence crawler
performance as well, such as starting at or near a “hot” page, pages with more anchor text
and pages containing some query words. As is stated in the paper, the testing was on a
limited set of test pages. While at the time the test set was a limiting factor, the use of
PageRank now in Google has obviously shown the worth of PageRank within the larger
field of the entire web.
Posted by Rudd at January 31, 2005 11:22 PM
An "idea"
A search engine “sampler”. This might require significant licensing deals, but pretty
much a search engine that, instead of creating its own index, performs a given query on
several established web search engines and returns the top results for each engine.
Perhaps display the results sectioned per search engine. So, the top five results from
Google, and the top five results from Yahoo! and the top five results from AltaVista, etc.
This would be interesting to do just for research as well, to see the differences in search
engines and the algorithms each uses. An extension would be to query not just search
engines but other information sources as well.
Posted by Rudd at January 31, 2005 11:24 PM
SOme background on the mentioned authors...
Sergey Brin and Larry Page are the co-founders of Google Inc. and Presidents of
Technology and Products at Google, respectively. They have written numerous papers on
the subject of web search and data mining, both together and separately. Some include
“Extracting Patterns and Relations from the World Wide Web”; “Dynamic Data Mining:
A New Architecture for Data with High Dimensionality”, and “Scalable Techniques for
Mining Casual Structures”.
Junghoo Cho is an assistant professor in computer science at UCLA. His research
interests are focused on the areas of databases and Web technologies. Other papers
include "Beyond Document Similarity: Understanding Value-Based Search and
Browsing Technologies." Andreas Paepcke, Hector Garcia-Molina, Gerard RodriquezMula, Junghoo Cho (2000) and "Searching the Web." Arvind Arasu, Junghoo Cho,
Hector Garcia-Molina, Andreas Paepcke, Sriram Raghavan ( 2001).
Hector Garcia-Molina is a professor in the Departments of Computer Science and
ElectricalEngineering at Stanford University. From August 1994 to December
1997 he was the Director of the Computer Systems Laboratory at
Stanford. From 1979 to 1991 he was on the faculty of the Computer
Science Department at Princeton University.His research interests include distributed
computing systems, digital libraries and database systemsl); He is on the
Technical Advisory Board of DoCoMo Labs USA, Kintera, Metreo Markets,
TimesTen, Verity, Yahoo Search & Marketplace; and is a member of the Board of
Directors of Oracle and Kintera.
Marc Najork currently works for Microsoft Research in the Silicon Valley. He is working
on Boxwood, a distributed B-Tree system, and PageTurner, a large-scale study of the
evolution of web pages. He formerly worked at HP's Systems Research Center where he
developed Mercator, a high-performance distributed web crawler, JCAT, a web-based
algorithm animation system, and Obliq-3D, a scripting system for 3D animations.
Posted by Rudd at January 31, 2005 11:26 PM
This paper provides :
1. Indepth description of a large scale web search engine.
2. Technical challenges involved in using the additional information present in hypertext
to produce better search results.
3. Challenges involved in solving problems related to uncontrolled hypertext collections,
given the nature of the web where anyone can publish anything they want, the rapid
advancemet and the high poliferation.
The world wide web worm in 94 was getting barely 1500 queries a day. However
AstalaVista in Nov'97 reported a 20 million query per day. Therefore it was more than
obvious that the goal of the system would be to address many of the problems, both in
quality and scalability, introduced by scaling search engine technology to such
extraordinary numbers.
The key difference between the two algorithms functioning is
Klienberg's algorithm HITS emphasizes largely by counting citations or backlinks to a
given page. This gives some approximation of a page's importance or quality.
PageRank extends this idea by not counting links from all pages equally,but normalizing
by the number of links on a page.
Page Rank thereby uses a probablity distribution of the web pages, hence the sum of all
the webpages will be one.
The paper then introduces us to concepts like Random Surfer Model, Dangling Links, etc
and how they were dealt by.
The paper then goes on to explaining Page Rank calculation, design issues, hardware,
implementation etc.
What I liked about this technique is making use of the anchor text, making use of font
variation, keeping pages in Repository, Compressed Repository, Avoiding Disk seeks,
High speed Crawler (100pgs /sec ), Indexing into Barells etc.
Further at a juction the paper explains the problems its crawler faces with copy right and
users not wanting their pages to be indexed. Frankly speaking I was shocked, why
wouldnt people want their pages to be indexed by a search engine.
overall its a paper clearly written keepin a commerical mind, and well defined objective
of being the best search enginee
Posted by uddhav at February 1, 2005 11:57 AM
The hardware demand doesnt sound unreasonable, even given the time when it was being
considered.
A more detailed description on the Cluster system used by google can be found here
http://www.computer.org/micro/mi2003/m2022.pdf
Posted by uddhav at February 1, 2005 11:59 AM
More Graphical Explanation of the Page Rank system :
http://dbpubs.stanford.edu:8090/pub/showDoc.Fulltext?lang=en&doc=199966&format=pdf&compression=&name=1999-66.pdf
Posted by uddhav at February 1, 2005 12:13 PM
In section 4.5.2, Feedback, they discuss how their ranking function includes many
parameters, and that they also take user feedback into account. It sounds like they allow a
set of trusted users to rate/rank the search engine results. I know this sounds a little
creepy, but what about teaming up with a company like Nielsen Media Research
(http://www.nielsenmedia.com/) or more specifically, their NetRatings division
(http://www.nielsen-netratings.com/)? They could track user clicks and come up with
more click-through analysis, since they are a company that specializes in just that? Just a
thought…
Posted by Marc at February 1, 2005 12:18 PM
Larry Page- Lawrence "Larry" E. Page
http://www.google.com/corporate/execs.html
Born March 26, 1973 in Ann Arbor, Michigan) is one of the founders of Google internet
search engine. He is a graduate of East Lansing High School. Page has a Bachelor of
Science in engineering, with a concentration on computer engineering, from the
University of Michigan and a Masters from Stanford University.
Sergey Brin
http://www.google.com/corporate/execs.html
http://www-db.stanford.edu/~sergey/
Born in Moscow, Russia, Sergey Brin received his Bachelor of Science in computer
science and mathematics from the University of Maryland, College Park, where his father
Michael Brin is a mathematics professor. He received his Master's degree from Stanford
University, but hasn't completed his Ph.D..
Collegues :
Hector Garcia-Molina
Professor, Departments of Computer Science and Electrical Engineering.
Stanford University
http://www-db.stanford.edu/people/hector.html
other papers:
Evaluation of Delivery Techniques for Dynamic Web Content - Mor Naaman, Hector
Garcia-Molina, Andreas Paepcke, 2003
Representing Web Graphs - Sriram Raghavan and Hector Garcia-Molina, 2002
Junghoo "John" Cho
Assistant Professor
Department of Computer Science
University of California, Los Angeles
He studied for Ph.D. in Computer Science at Stanford University and his thesis was web
crawler.
http://oak.cs.ucla.edu/~cho/
other papers :
Junghoo Cho, Hector Garcia-Molina "Effective page refresh policies for Web crawlers."
ACM Transactions on Database Systems, 28(4): December 2003.
Junghoo Cho, Hector Garcia-Molina "Estimating frequency of change." ACM
Transactions on Internet Technology, 3(3): August 2003.
Arvind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke, Sriram Raghavan
"Searching the Web." ACM Transactions on Internet Technology, 1(1): August 2001.
Sougata Mukherjea, Junghoo Cho "Automatically Determining Semantics for World
Wide Web Multimedia Information Retrieval." Journal of Visual Languages and
Computing (JVLC), 10(6): December 1999, 585-606.
Andreas Paepcke, Hector Garcia-Molina, Gerard Rodriquez-Mula, Junghoo Cho "Beyond
Document Similarity: Understanding Value-Based Search and Browsing Technologies."
SIGMOD Records, 29(1): March 2000.
Junghoo Cho, Hector Garcia-Molina, Lawrence Page "Efficient Crawling Through URL
Ordering." Computer Networks and ISDN Systems, 30(1-7):161-172, 1998
Posted by uddhav at February 1, 2005 12:39 PM
Off the topic, but intresting
what is the success of google
http://www.google.com/corporate/tenthings.html
Attitudes, some facts abt Larry page n Sergey Brin
http://www.j-bradford-delong.net/movable_type/2003_archives/000032.html
Posted by uddhav at February 1, 2005 12:41 PM
With currently a large number of sophisticated search enginees, having a personality
element in them, It would be really intresting if we could get a Tree view of the top N
search results from different search enginees simultaneously.
Search String
-- GOOGLE
-- Search results
-- Yahoo
-- Search results
-- Some other sources
-- Search results
We could even include search engines specific to Blogs, wiki, forums etc.
This is something worth while researching.
Posted by uddhav at February 1, 2005 12:57 PM
Something that i really liked about the google way of link analysis is the Random Surfer
model.
In Klienberg's HITS algorithm I was kind of paranoid of what would happen if there was
to be a loop. Random Surfer model explains this
Random Surfer model :
Its analogy would be a person randomly clicking on links. How does this help? In certain
cases due to the forward and backward links, a link loop is formed
that is termed as Rank Sink. Now, it human tendency that over a period of time the
human is not gonna be stuck in the loop. Eventually he shall get out of
the loop by clicking at some random page.
Random surfer model uses this concept to get out of a loop. Based on the distribution E
the random surfer gets out of the loop.
http://dbpubs.stanford.edu:8090/pub/showDoc.Fulltext?lang=en&doc=199966&format=pdf&compression=&name=1999-66.pdf
PAGE NUMBER 6
Apart from Random Surfer Model few other things
Sometimes the links do not point to a new destination. Google terms these links as
"Comment Spam". Such links are simply ignored during crawling and not
considered for Page Ranking.
http://www.google.com/googleblog/2005/01/preventing-comment-spam.html
Sometimes links point to a page .. but that page doesnt point any where else. These links
are termed as Dangling Links. Google found a nice way to deal with
these dangling links, simply remove them temporarily from the system untill page ranks
are calculated and then put them back in.
Posted by uddhav at February 1, 2005 01:00 PM
One paper which cites the Google paper, is “Focused Crawling: A New Approach to
Topic-Specific Resource Discovery” by Soumen Chakrabarti, Martin van den Berg, and
Byron Dom. (http://www.cse.iitb.ac.in/~soumen/doc/www1999f/pdf/prelim.pdf)
They discuss targeting a web search to within a topic set of pages, which are identified
with exemplary documents instead of keywords. They compare their methods with those
of Kleinberg, in HITS, and also with Brin and Page’s PageRank algorithm.
They differentiate their focused crawling methods in the following ways: They have “no
apriori radius cut-off for exploration”, as HITS does in using links only one or 2
generations removed from the seed pages. They also state that “the selection of relevant,
high quality pages happens directly as a data acquisition step, not as post-processing or
response to a query.”, which they claim enables their focused crawl results to be faster
and more relevant.
Posted by Marc at February 1, 2005 01:06 PM
Search Tips : http://find.stanford.edu/user_help.html
Tips that can enhance the Query Framing skills. Thereby assisting the search engine to
provide you with better results.
Posted by uddhav at February 1, 2005 01:08 PM
An obvious citation is HITS algorithm by Jon Klienberg.
It is observed that though both Klienberg and Google Bros. carried out link analysis, it
was still different
Google algorith surely extended Klienberg's HITS however modified it to not count links
form all pages equally but normalize it by the number of links on the page.
An other intresting paper cited is "Efficient Crawling Through URL Ordering"- Junghoo
Cho, Hector Garcia-Molina, Lawrence Page
http://www.csd.uch.gr/~hy558/papers/cho-order.pdf
This paper details three methods for efficient crawling and collecting of web pages. This
is important for maintaining an updated index of pages for a search engine while covering
the large web space in the shortest duration.
Illustrating examples and pseudo codes make this easy to understand.
Posted by uddhav at February 1, 2005 01:12 PM
Sergey Brin’s home page can be found at: http://www-db.stanford.edu/~sergey/. His
research interests include search engines, information extraction from unstructured
sources, and data mining of large text collections and scientific data.
Larry Page’s home page according to a Google search, can be found at: http://wwwdb.stanford.edu/~page/, although it’s in a sad state of repair.
Aside from Google as a search engine, they have developed some other noteworthy
software. On the top of my list right now is their free photo organizing tool, Picasa
(http://www.picasa.com/index.php). I just downloaded it, and quite simply, it is amazing.
It is a very simple, easy to use, and intuitive photo organizing program that even
computer novices can easily manage and enhance their photos with.
Honestly, they have formed a great company and put out quality products, as can be
gleaned from their top-ten list of philosophy, which Uddhav pointed out
(http://www.google.com/corporate/tenthings.html).
Posted by Marc at February 1, 2005 01:38 PM
I hate how this blogging tool tries to automatically create links from text, and does it
incorrectly! Such as the way it takes links that are inside of parenthesis and includes the
trailing ")" symbol and period or comma, so that clicking on the links don't work...
Posted by GRIPE! at February 1, 2005 01:41 PM
The interesting idea is the notion that the wealth of information being crawled on the web
is all text files written in HTML. This data mining concept is being applied to the web
instead of a traditional database. In turn, the crawler is creating its own data warehouse of
the web.
Posted by Michael Ong at February 1, 2005 01:42 PM
They state thatwant to move search engine research out of the commercial realm and
back towards academia. This is of course ironic, because after they moved the research
towards academia, they moved towards the commercial side of things. Not only did they
want to do academic research on search engines, but they wanted to provide a platform
for others to do research on the scale of the entire web. It seems that this attitude still
shapes Google. They have a very active R&D department, they hire a lot of Ph.D.'s and
they allow all their employees to work on their own projects for 10% of their time.
Posted by ryan king at February 1, 2005 01:43 PM
They go into a lot of reasons why previous search engines are insufficient for the web
because they don't respect the nature of the web as a loosely structured collection of
heterogeneous information. They also claim that previous search engines failed to deal
with people who tried to pollute the search results for their own gain. They were able to
manipulate the results because those search engines dealt only with the texts of the
documents and not the surrounding information (ie, link structure).
Posted by ryan king at February 1, 2005 01:43 PM
Brin and Page started working on Google as a research project at Stanford and left the
Ph.D. program to work on Google full time. They've both published on numerous topic
related to web searching, including papers on Data Mining and Pattern Extraction.
Posted by ryan king at February 1, 2005 01:44 PM
In their conclusion, they state that the search can be personalized simply by artificially
increasing the PageRank of a user's homepage or bookmarks. That seems like quite a
simple solution to our problem of searching personal webs. I wonder if they've done any
more work on that.
Posted by ryan king at February 1, 2005 01:44 PM
It seems nearly every search related paper written since this one has made reference to it
in some way. This is obviously due to the tremendous success of Google. This
one[http://www.cindoc.csic.es/cybermetrics/pdf/47.pdf], in particular, is interesting,
because it extends Google's page similarity algorithm in several ways.
Posted by ryan king at February 1, 2005 01:49 PM
One of the cited papers is “Finding What People Want: Experiences with the
WebCrawler” by Brian Pinkerton.
http://www.thinkpink.com/bp/WebCrawler/WWW94.html
This paper defines what a WebCrawler is, and describes its architecture and some issues
with its design. The author also discusses personal experiences with working with a
WebCrawler Index.
Posted by Dora at February 1, 2005 01:51 PM
This [http://drakkar.imag.fr/IMG/pdf/ht.pdf] Is an interesting paper. They describe a
browsable, hierarchical search engine that is build by clustering documents together. The
clustering is done by both link analysis and text analysis.
Posted by ryan king at February 1, 2005 02:00 PM
One paper that cites Brin and Page’s pager is “Information Retrieval on the Web” by Mei
Kobayashi and Koichi Takeda.
http://delivery.acm.org/10.1145/360000/358934/p144kobayashi.pdf?key1=358934&key2=8933127011&coll=GUIDE&dl=ACM&CFID=3715
8585&CFTOKEN=66341147
This paper reviews some of the problems with the information search and retrieval on the
Web, such as, slow retrieval speed, communication delays, and poor quality of retrieved
results. Then it discusses the techniques to resolve these problems.
Posted by dora at February 1, 2005 02:04 PM
This paper specifies the need for scalability and quality (revelance) for a search engine.
The basics include crawling, indexing and searching to build a central storage of
information. It takes Kleinberg’s HITS algorithm further. Google also makes use of the
link structure for PageRank. It normalizes the number of links per page by not treating all
links equally. It takes advantage of anchor text, proximity, visual presentation (like
boldface, font size, etc.), and stores the raw HTML of crawled pages into a repository.
At the time of the paper is written (in 1998), information retrieval is an area where further
research is required. Also, a way to maintain updates that determine whether old pages
need to be re-crawled and new ones that needs to be crawled. For link text, there are more
testing to include the surrounding text.
The scalability goes down to the system level. This includes everything from the CPU,
memory usage, and disk storage to a robust system design that can support a large-scale
application like Google.
Posted by Michael Ong at February 1, 2005 02:08 PM
It was interesting to see that Google keeps track and takes into consideration visual
presentation of texts, such as, word’s font size and style in their ranking system.
Posted by dora at February 1, 2005 02:12 PM
Post a comment
Download