brinBlog - USF Computer Science Department

Comments: Brin, Sergey and Page, Larry, The Anatomy of a Large-Scale Search Engine A general comment on Sergey and Larry's style. It seems to be written with a subtle sales pitch tone, Is it possible this paper was used to generate prospective investors to their company? It has a lot of that late 90's tone on how the web is gonna blow up and rule the world...Some of the text sounds as if it could be lifted from one of many "b2b"'s now long gone: "Creating a search engine which scales even to today's web presents many challenges. Fast crawling technology is needed to gather the web documents and keep them up to date. Storage space must be used efficiently to store indices and, optionally, the documents themselves. " They obviously had the right idea in addressing scalability issues with search engines. Posted by Andrew at January 31, 2005 07:30 PM I agree with Andrew, and wonder if they have any more recent updates to this paper, because I'm sure their methods and architecture have changed since it was written. In the paper, it is also mentioned that they would eventually like to scale up to more than 100 million pages, and that in order for their system to be able to support that many pages, they would have to greatly increase the complexity of the system. So, what's been done?! Also, a large number of URLs within the paper are not active anymore, especially http://google.stanford.edu! Posted by Marc at January 31, 2005 08:02 PM Scratch that last bit, I just had a typo when I tried to go there. The Google site at Stanford still works, but a number of other links in their citations don't... Posted by Marc at January 31, 2005 08:07 PM That's another difficulty with this format, no ability to modify or version a comment previously made by the same person, to update or fix it... Posted by Marc at January 31, 2005 08:08 PM About Robots and exclusion... pg 13 mentions the use of the Robot Exclusion Protocol. I was curious about it and found some info: http://www.robotstxt.org/wc/exclusion.html Basically, a web site owner has the right to exclude their web content from being searched. This ability to be excluded (from even Google!) brings up an interesting dilemma: What happens when more and more "for pay" information sites appear on the web such as Safari Books http://safari.oreilly.com/ jGuru http://www.jguru.com? It is in best interests of these sites to keep much of their content private, so that they may charge for the use of the information. Won't this create unsearchable "dark information spaces" across the web like swiss cheese, thus reducing the effectiveness of search engines such as Google that rely on web crawling? One can picture that eventually free searches may only return one level of information to users- and that "subsearches" within specific sites such as Safari Books will be necessary to chase after quality sources of information. Posted by Andrew at January 31, 2005 08:29 PM To summarize the paper, it plays off of Brin and Page’s opinion of the time, which states that “The biggest problem facing users of web search engines today is the quality of the results they get back”. (One could argue that the same issue still faces us today). They then go into the methods they use to make their results more relevant than the standard text-only based engines of the time. They describe methods of link structure, anchor text, font relevancies, and word proximities. Even though it is not a main theme in the paper, it is mentioned a few times, and comes off as a strong bias, that the drive behind their engine is from an academic standpoint, not a commercial or advertising driven one. Posted by Marc at January 31, 2005 08:35 PM In response to Marc's question of what Google currently does to handle web indexingwhich at last count on their site was: Searching 8,058,044,651 web pages That is over 8000 times the size they would like Google to be able to handle, does the design described in the paper still work for this?? Or do they have a "secret sauce" being used? On the rumors side, Google's infrastructure has puzzled many outside observersapparently the size and capability of their hardware infrastructure greatly outweighs the effort currently needed to perform their searches: "Google is thought to be a shrewd judge of computing value, having built its widely admired infrastructure on the back of low-budget server clusters. At the same time, curious geeks have long pondered the apparent mismatch between its service demands and the reputed scale of its computing resources." http://news.zdnet.com/2100-1035_225537392.html Perhaps there is more going on with Google's technology then is currently known (or shared without a NDA). Posted by Andrew at January 31, 2005 08:42 PM Hard disks and Cost of Search: One major benefit to Google's search function is based on the fact that a complete index could be stored on a disk between 55-108 gigabytes (pg 18), and that in 1998: "At current disk prices this makes the repository a relatively cheap source of useful data." I wanted to find out how much 100 gigs of HD space cost in 1998, as well as predict what Google's current hard disk space requirement is (based on the architecture described in the paper). 1998 (based on 26 million pages indexed- pg 18) 100 gigabytes of disk storage (at average cost of $.06 per meg in 1998) = $6000.00 2005 (based on 8,058,044,651 web pages indexed, approx 310x growth): 31000 gigabytes of storage (at average cost of $1.00 per gigabyte) = $31000.00 Based on this, the operating cost to maintain the type of search listed in the paper has increased approx 5X, but overall it still appears to be a scalable design. HD price history information:http://www.littletechshoppe.com/ns1625/winchest.html Posted by Andrew at January 31, 2005 09:08 PM Opps.. I left on a few to many zeros in my post above: " Searching 8,058,044,651 web pages...That is over 8000 times " Actually it's more like 80 times (based on their 100 million page goal). Still quite a large growth. Posted by Andrew at January 31, 2005 09:12 PM Overall I found the Google to be quite inspiring. The story of a few young bright kids coming up with a creative idea that changes the world has developed into an archetypal high tech success story here on the west coast. Brin and Page join the list of incredibly famous and successful technology local pioneers who made their mark on history: Steve Jobs, Steve Wozniak (Apple Computer) Patrick Naughton, Mike Sheridan and James Gosling (Java programming language) Bill Hewlett, Dave Packard (Hewlett Packard) Bill Gates, Paul Allen (Microsoft) and many more- who else can add to this list? Also I found the paper fun and interesting to read. Much more so than the Kleinberg paper, which seemed a bit to academic and dry. Posted by Andrew at January 31, 2005 09:27 PM In Appendix B, section 9.2, while discussing Scalability of Centralized Indexing Architectures, the Harvest system is referenced as a distributed solution to the problem which is an efficient and elegant technical solution for indexing. The reference in the paper’s References section, http://harves.transarc.com is defunct, but I found a paper on the Harvest system at: http://archive.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/schwartz.harvest/schwartz .harvest.html. Harvest uses distributed brokers (indexing servers) for searches in targeted areas. For example, they have one indexer dedicated only to Computer Science Technical Papers, and another one which searches PC Software, and one for World Wide Web Home Pages. Posted by Marc at January 31, 2005 10:11 PM Since all they really talk about is their algorithm, one of the tools I found interesting was the idea of weighing differently marked text in html documents differently. They mention weighing bolded or larger text more than the other words on the page, but I am kinda dubious as to whether it works. Personally, depending on the font I'm using, I may bold all the words, just to make it more visible/appealing. The other idea though, using anchor text, is more applicable - again though, they have to take measures to deal with links like "here" or "this" - for the most part though, anchor text should provide a good indication of the type of page it points to, and so using it is a good idea. Since older search engines didn't even take links into account, they couldn't have applied this logic, but I wonder if even using the bolded/italicized/larger weighing scheme would improve things. Posted by Deniz at January 31, 2005 10:53 PM Good points Deniz, but with regards to bolding of font and font size, they store it only relative to the rest of the text around it, so if the entire document is bold, then it won't be ranked more significantly. Also, they mention that they may use the words surrounding the link text to improve upon the problem you pointed out of anchor text only containing words such as _this_ or _here_. Posted by Marc at January 31, 2005 10:57 PM This paper describes in detail the design and implementation of the Google web search engine. At the time this paper was written, Google had yet to gain its current worldwide popularity, hence the interesting perspective of introducing a new product, even though it is now a common idea. The design obstacles and goals are explained in relation to the construction and features of the system. The details of the feature set, such as the PageRank algorithm, are explained in terms that are actually understandable, much better than the HITS description in the Kleinberg paper. The exact architecture of the system is described in conjunction with hardware resources and requirements. The requirements are quite reasonable, considering the hardware available at the time. As it is, even today the requirements are realistic, though they have definitely increased to some degree. Operation and performance are described in light of the resources available at the time for testing and the future goals of the system. One of the more interesting sections is the appendix on the influence of advertising and search results, a potential problem that still exists in modern web searching and development. Posted by Rudd at January 31, 2005 11:17 PM A section I found interesting is the description of the obstacles encountered when crawling the web and collecting the pages to index. Having built a crawler myself, I related to the difficulty involved with creating a robust system that smoothly crawls the web. As the authors point out, there are many quirks and unforeseen problems that can’t be imagined in the design process, but must be discovered along the way. I think the information crawling is almost as important as the PageRank and query handling, due to its status as the base of the system. The process of running multiple crawlers to reliably and quickly collect pages to index becomes complicated when the true state of the web reveals itself with the inconsistencies of non-standard web content. There would be no search index without pages to index; hence collecting the web documents is a crucial step. Not only is the initial set of documents important, repeatedly updating the document set requires a careful implementation for ease of use and not alienating users across the web as pages are collected. The unintentional ability to swamp a web server with requests is something that must be carefully avoided. It’s good that Brin and Page spent the time to explain the effort involved in crawling the web, often researchers will disregard the support areas needed to create a system. This can leave others to find out the hard way that there is more to the process than just the system itself. Posted by Rudd at January 31, 2005 11:18 PM Inward link paper... “Mercator: A Scalable, ExtensibleWeb Crawler”, Allan Heydon and Marc Najork; Compaq Systems Research Center [http://www.cindoc.csic.es/cybermetrics/pdf/68.pdf]. This paper describes the process involved in designing and implementing a large scale web crawler. A direct comparison to the Google crawlers and Internet Archive’s crawler (not Heritrix I presume, as the paper was written in 1999) is given in light of the authors creation of a Java based, extensible web crawler. The construction and functioning of the crawler is detailed as well with consistent comparison to Google’s and Internet Archive’s crawlers that shows the different approaches possible to the design of a web crawler. In the final performance comparison to the other crawlers, the authors state that it required a significant amount of time to optimize their crawler, especially focusing on the java core libraries. This is interesting as the Brin and Page paper “Anatomy” states that they used Python in their crawler. Internet Archive’s Heritrix crawler is Java based. It makes one wonder whether the much harped about “slowness” of Java libraries plays a significant role in the development of certain applications. In fact, Heydon and Najork wrote another paper on the subject “Performance Limitations of the Java Core Libraries” [http://www.geocities.com/caheydon/papers/Mercator-CPE.pdf]. A final interesting note is that the Mercator system was scheduled for use in the AltaVista corporate product used for indexing large intranets. Posted by Rudd at January 31, 2005 11:22 PM The obvious cited paper is Kleinberg’s [http://www-2.cs.cmu.edu/afs/cs/project/pscicoguyb/realworld/www/papers/kleinberg.pdf] though this is not unusual as the content is closely linked in terms of the PageRank algorithm and the use of links in the HITS algorithm. “Efficient Crawling Through URL Ordering” Cho, Garcia-Molina, Page [http://www.csd.uch.gr/~hy558/papers/cho-order.pdf] This paper details three methods for efficient crawling and collecting of web pages. This is obviously important for maintaining an updated index of pages for a search engine while covering the broadest expanse of the web in the shortest amount of time. One thing I liked about this paper, (just a side note...) was that the examples of pseudo-code were the closest I’ve seen to real code that made sense. The three models of URL ordering are described in terms of performance and implementation, as well as the results obtained from crawling a controlled test set. (The Stanford.edu domain). The conclusion was that PageRank gave the best performance for deciding on the order of crawling pages, which makes sense considering its use in Google currently. There are other factors that influence crawler performance as well, such as starting at or near a “hot” page, pages with more anchor text and pages containing some query words. As is stated in the paper, the testing was on a limited set of test pages. While at the time the test set was a limiting factor, the use of PageRank now in Google has obviously shown the worth of PageRank within the larger field of the entire web. Posted by Rudd at January 31, 2005 11:22 PM An "idea" A search engine “sampler”. This might require significant licensing deals, but pretty much a search engine that, instead of creating its own index, performs a given query on several established web search engines and returns the top results for each engine. Perhaps display the results sectioned per search engine. So, the top five results from Google, and the top five results from Yahoo! and the top five results from AltaVista, etc. This would be interesting to do just for research as well, to see the differences in search engines and the algorithms each uses. An extension would be to query not just search engines but other information sources as well. Posted by Rudd at January 31, 2005 11:24 PM SOme background on the mentioned authors... Sergey Brin and Larry Page are the co-founders of Google Inc. and Presidents of Technology and Products at Google, respectively. They have written numerous papers on the subject of web search and data mining, both together and separately. Some include “Extracting Patterns and Relations from the World Wide Web”; “Dynamic Data Mining: A New Architecture for Data with High Dimensionality”, and “Scalable Techniques for Mining Casual Structures”. Junghoo Cho is an assistant professor in computer science at UCLA. His research interests are focused on the areas of databases and Web technologies. Other papers include "Beyond Document Similarity: Understanding Value-Based Search and Browsing Technologies." Andreas Paepcke, Hector Garcia-Molina, Gerard RodriquezMula, Junghoo Cho (2000) and "Searching the Web." Arvind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke, Sriram Raghavan ( 2001). Hector Garcia-Molina is a professor in the Departments of Computer Science and ElectricalEngineering at Stanford University. From August 1994 to December 1997 he was the Director of the Computer Systems Laboratory at Stanford. From 1979 to 1991 he was on the faculty of the Computer Science Department at Princeton University.His research interests include distributed computing systems, digital libraries and database systemsl); He is on the Technical Advisory Board of DoCoMo Labs USA, Kintera, Metreo Markets, TimesTen, Verity, Yahoo Search & Marketplace; and is a member of the Board of Directors of Oracle and Kintera. Marc Najork currently works for Microsoft Research in the Silicon Valley. He is working on Boxwood, a distributed B-Tree system, and PageTurner, a large-scale study of the evolution of web pages. He formerly worked at HP's Systems Research Center where he developed Mercator, a high-performance distributed web crawler, JCAT, a web-based algorithm animation system, and Obliq-3D, a scripting system for 3D animations. Posted by Rudd at January 31, 2005 11:26 PM This paper provides : 1. Indepth description of a large scale web search engine. 2. Technical challenges involved in using the additional information present in hypertext to produce better search results. 3. Challenges involved in solving problems related to uncontrolled hypertext collections, given the nature of the web where anyone can publish anything they want, the rapid advancemet and the high poliferation. The world wide web worm in 94 was getting barely 1500 queries a day. However AstalaVista in Nov'97 reported a 20 million query per day. Therefore it was more than obvious that the goal of the system would be to address many of the problems, both in quality and scalability, introduced by scaling search engine technology to such extraordinary numbers. The key difference between the two algorithms functioning is Klienberg's algorithm HITS emphasizes largely by counting citations or backlinks to a given page. This gives some approximation of a page's importance or quality. PageRank extends this idea by not counting links from all pages equally,but normalizing by the number of links on a page. Page Rank thereby uses a probablity distribution of the web pages, hence the sum of all the webpages will be one. The paper then introduces us to concepts like Random Surfer Model, Dangling Links, etc and how they were dealt by. The paper then goes on to explaining Page Rank calculation, design issues, hardware, implementation etc. What I liked about this technique is making use of the anchor text, making use of font variation, keeping pages in Repository, Compressed Repository, Avoiding Disk seeks, High speed Crawler (100pgs /sec ), Indexing into Barells etc. Further at a juction the paper explains the problems its crawler faces with copy right and users not wanting their pages to be indexed. Frankly speaking I was shocked, why wouldnt people want their pages to be indexed by a search engine. overall its a paper clearly written keepin a commerical mind, and well defined objective of being the best search enginee Posted by uddhav at February 1, 2005 11:57 AM The hardware demand doesnt sound unreasonable, even given the time when it was being considered. A more detailed description on the Cluster system used by google can be found here http://www.computer.org/micro/mi2003/m2022.pdf Posted by uddhav at February 1, 2005 11:59 AM More Graphical Explanation of the Page Rank system : http://dbpubs.stanford.edu:8090/pub/showDoc.Fulltext?lang=en&doc=199966&format=pdf&compression=&name=1999-66.pdf Posted by uddhav at February 1, 2005 12:13 PM In section 4.5.2, Feedback, they discuss how their ranking function includes many parameters, and that they also take user feedback into account. It sounds like they allow a set of trusted users to rate/rank the search engine results. I know this sounds a little creepy, but what about teaming up with a company like Nielsen Media Research (http://www.nielsenmedia.com/) or more specifically, their NetRatings division (http://www.nielsen-netratings.com/)? They could track user clicks and come up with more click-through analysis, since they are a company that specializes in just that? Just a thought… Posted by Marc at February 1, 2005 12:18 PM Larry Page- Lawrence "Larry" E. Page http://www.google.com/corporate/execs.html Born March 26, 1973 in Ann Arbor, Michigan) is one of the founders of Google internet search engine. He is a graduate of East Lansing High School. Page has a Bachelor of Science in engineering, with a concentration on computer engineering, from the University of Michigan and a Masters from Stanford University. Sergey Brin http://www.google.com/corporate/execs.html http://www-db.stanford.edu/~sergey/ Born in Moscow, Russia, Sergey Brin received his Bachelor of Science in computer science and mathematics from the University of Maryland, College Park, where his father Michael Brin is a mathematics professor. He received his Master's degree from Stanford University, but hasn't completed his Ph.D.. Collegues : Hector Garcia-Molina Professor, Departments of Computer Science and Electrical Engineering. Stanford University http://www-db.stanford.edu/people/hector.html other papers: Evaluation of Delivery Techniques for Dynamic Web Content - Mor Naaman, Hector Garcia-Molina, Andreas Paepcke, 2003 Representing Web Graphs - Sriram Raghavan and Hector Garcia-Molina, 2002 Junghoo "John" Cho Assistant Professor Department of Computer Science University of California, Los Angeles He studied for Ph.D. in Computer Science at Stanford University and his thesis was web crawler. http://oak.cs.ucla.edu/~cho/ other papers : Junghoo Cho, Hector Garcia-Molina "Effective page refresh policies for Web crawlers." ACM Transactions on Database Systems, 28(4): December 2003. Junghoo Cho, Hector Garcia-Molina "Estimating frequency of change." ACM Transactions on Internet Technology, 3(3): August 2003. Arvind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke, Sriram Raghavan "Searching the Web." ACM Transactions on Internet Technology, 1(1): August 2001. Sougata Mukherjea, Junghoo Cho "Automatically Determining Semantics for World Wide Web Multimedia Information Retrieval." Journal of Visual Languages and Computing (JVLC), 10(6): December 1999, 585-606. Andreas Paepcke, Hector Garcia-Molina, Gerard Rodriquez-Mula, Junghoo Cho "Beyond Document Similarity: Understanding Value-Based Search and Browsing Technologies." SIGMOD Records, 29(1): March 2000. Junghoo Cho, Hector Garcia-Molina, Lawrence Page "Efficient Crawling Through URL Ordering." Computer Networks and ISDN Systems, 30(1-7):161-172, 1998 Posted by uddhav at February 1, 2005 12:39 PM Off the topic, but intresting what is the success of google http://www.google.com/corporate/tenthings.html Attitudes, some facts abt Larry page n Sergey Brin http://www.j-bradford-delong.net/movable_type/2003_archives/000032.html Posted by uddhav at February 1, 2005 12:41 PM With currently a large number of sophisticated search enginees, having a personality element in them, It would be really intresting if we could get a Tree view of the top N search results from different search enginees simultaneously. Search String -- GOOGLE -- Search results -- Yahoo -- Search results -- Some other sources -- Search results We could even include search engines specific to Blogs, wiki, forums etc. This is something worth while researching. Posted by uddhav at February 1, 2005 12:57 PM Something that i really liked about the google way of link analysis is the Random Surfer model. In Klienberg's HITS algorithm I was kind of paranoid of what would happen if there was to be a loop. Random Surfer model explains this Random Surfer model : Its analogy would be a person randomly clicking on links. How does this help? In certain cases due to the forward and backward links, a link loop is formed that is termed as Rank Sink. Now, it human tendency that over a period of time the human is not gonna be stuck in the loop. Eventually he shall get out of the loop by clicking at some random page. Random surfer model uses this concept to get out of a loop. Based on the distribution E the random surfer gets out of the loop. http://dbpubs.stanford.edu:8090/pub/showDoc.Fulltext?lang=en&doc=199966&format=pdf&compression=&name=1999-66.pdf PAGE NUMBER 6 Apart from Random Surfer Model few other things Sometimes the links do not point to a new destination. Google terms these links as "Comment Spam". Such links are simply ignored during crawling and not considered for Page Ranking. http://www.google.com/googleblog/2005/01/preventing-comment-spam.html Sometimes links point to a page .. but that page doesnt point any where else. These links are termed as Dangling Links. Google found a nice way to deal with these dangling links, simply remove them temporarily from the system untill page ranks are calculated and then put them back in. Posted by uddhav at February 1, 2005 01:00 PM One paper which cites the Google paper, is “Focused Crawling: A New Approach to Topic-Specific Resource Discovery” by Soumen Chakrabarti, Martin van den Berg, and Byron Dom. (http://www.cse.iitb.ac.in/~soumen/doc/www1999f/pdf/prelim.pdf) They discuss targeting a web search to within a topic set of pages, which are identified with exemplary documents instead of keywords. They compare their methods with those of Kleinberg, in HITS, and also with Brin and Page’s PageRank algorithm. They differentiate their focused crawling methods in the following ways: They have “no apriori radius cut-off for exploration”, as HITS does in using links only one or 2 generations removed from the seed pages. They also state that “the selection of relevant, high quality pages happens directly as a data acquisition step, not as post-processing or response to a query.”, which they claim enables their focused crawl results to be faster and more relevant. Posted by Marc at February 1, 2005 01:06 PM Search Tips : http://find.stanford.edu/user_help.html Tips that can enhance the Query Framing skills. Thereby assisting the search engine to provide you with better results. Posted by uddhav at February 1, 2005 01:08 PM An obvious citation is HITS algorithm by Jon Klienberg. It is observed that though both Klienberg and Google Bros. carried out link analysis, it was still different Google algorith surely extended Klienberg's HITS however modified it to not count links form all pages equally but normalize it by the number of links on the page. An other intresting paper cited is "Efficient Crawling Through URL Ordering"- Junghoo Cho, Hector Garcia-Molina, Lawrence Page http://www.csd.uch.gr/~hy558/papers/cho-order.pdf This paper details three methods for efficient crawling and collecting of web pages. This is important for maintaining an updated index of pages for a search engine while covering the large web space in the shortest duration. Illustrating examples and pseudo codes make this easy to understand. Posted by uddhav at February 1, 2005 01:12 PM Sergey Brin’s home page can be found at: http://www-db.stanford.edu/~sergey/. His research interests include search engines, information extraction from unstructured sources, and data mining of large text collections and scientific data. Larry Page’s home page according to a Google search, can be found at: http://wwwdb.stanford.edu/~page/, although it’s in a sad state of repair. Aside from Google as a search engine, they have developed some other noteworthy software. On the top of my list right now is their free photo organizing tool, Picasa (http://www.picasa.com/index.php). I just downloaded it, and quite simply, it is amazing. It is a very simple, easy to use, and intuitive photo organizing program that even computer novices can easily manage and enhance their photos with. Honestly, they have formed a great company and put out quality products, as can be gleaned from their top-ten list of philosophy, which Uddhav pointed out (http://www.google.com/corporate/tenthings.html). Posted by Marc at February 1, 2005 01:38 PM I hate how this blogging tool tries to automatically create links from text, and does it incorrectly! Such as the way it takes links that are inside of parenthesis and includes the trailing ")" symbol and period or comma, so that clicking on the links don't work... Posted by GRIPE! at February 1, 2005 01:41 PM The interesting idea is the notion that the wealth of information being crawled on the web is all text files written in HTML. This data mining concept is being applied to the web instead of a traditional database. In turn, the crawler is creating its own data warehouse of the web. Posted by Michael Ong at February 1, 2005 01:42 PM They state thatwant to move search engine research out of the commercial realm and back towards academia. This is of course ironic, because after they moved the research towards academia, they moved towards the commercial side of things. Not only did they want to do academic research on search engines, but they wanted to provide a platform for others to do research on the scale of the entire web. It seems that this attitude still shapes Google. They have a very active R&D department, they hire a lot of Ph.D.'s and they allow all their employees to work on their own projects for 10% of their time. Posted by ryan king at February 1, 2005 01:43 PM They go into a lot of reasons why previous search engines are insufficient for the web because they don't respect the nature of the web as a loosely structured collection of heterogeneous information. They also claim that previous search engines failed to deal with people who tried to pollute the search results for their own gain. They were able to manipulate the results because those search engines dealt only with the texts of the documents and not the surrounding information (ie, link structure). Posted by ryan king at February 1, 2005 01:43 PM Brin and Page started working on Google as a research project at Stanford and left the Ph.D. program to work on Google full time. They've both published on numerous topic related to web searching, including papers on Data Mining and Pattern Extraction. Posted by ryan king at February 1, 2005 01:44 PM In their conclusion, they state that the search can be personalized simply by artificially increasing the PageRank of a user's homepage or bookmarks. That seems like quite a simple solution to our problem of searching personal webs. I wonder if they've done any more work on that. Posted by ryan king at February 1, 2005 01:44 PM It seems nearly every search related paper written since this one has made reference to it in some way. This is obviously due to the tremendous success of Google. This one[http://www.cindoc.csic.es/cybermetrics/pdf/47.pdf], in particular, is interesting, because it extends Google's page similarity algorithm in several ways. Posted by ryan king at February 1, 2005 01:49 PM One of the cited papers is “Finding What People Want: Experiences with the WebCrawler” by Brian Pinkerton. http://www.thinkpink.com/bp/WebCrawler/WWW94.html This paper defines what a WebCrawler is, and describes its architecture and some issues with its design. The author also discusses personal experiences with working with a WebCrawler Index. Posted by Dora at February 1, 2005 01:51 PM This [http://drakkar.imag.fr/IMG/pdf/ht.pdf] Is an interesting paper. They describe a browsable, hierarchical search engine that is build by clustering documents together. The clustering is done by both link analysis and text analysis. Posted by ryan king at February 1, 2005 02:00 PM One paper that cites Brin and Page’s pager is “Information Retrieval on the Web” by Mei Kobayashi and Koichi Takeda. http://delivery.acm.org/10.1145/360000/358934/p144kobayashi.pdf?key1=358934&key2=8933127011&coll=GUIDE&dl=ACM&CFID=3715 8585&CFTOKEN=66341147 This paper reviews some of the problems with the information search and retrieval on the Web, such as, slow retrieval speed, communication delays, and poor quality of retrieved results. Then it discusses the techniques to resolve these problems. Posted by dora at February 1, 2005 02:04 PM This paper specifies the need for scalability and quality (revelance) for a search engine. The basics include crawling, indexing and searching to build a central storage of information. It takes Kleinberg’s HITS algorithm further. Google also makes use of the link structure for PageRank. It normalizes the number of links per page by not treating all links equally. It takes advantage of anchor text, proximity, visual presentation (like boldface, font size, etc.), and stores the raw HTML of crawled pages into a repository. At the time of the paper is written (in 1998), information retrieval is an area where further research is required. Also, a way to maintain updates that determine whether old pages need to be re-crawled and new ones that needs to be crawled. For link text, there are more testing to include the surrounding text. The scalability goes down to the system level. This includes everything from the CPU, memory usage, and disk storage to a robust system design that can support a large-scale application like Google. Posted by Michael Ong at February 1, 2005 02:08 PM It was interesting to see that Google keeps track and takes into consideration visual presentation of texts, such as, word’s font size and style in their ranking system. Posted by dora at February 1, 2005 02:12 PM Post a comment

brinBlog - USF Computer Science Department

Related documents

Products

Support

brinBlog - USF Computer Science Department

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib