Anushree Venkatesh Sagar Mehta Sushma Rao Motivation What is Map-Reduce? Why Map-Reduce? The HADOOP Framework Map Reduce in SILOs SILOs Architecture Modules Experiments Life span of a web page – 44 to 75 days Limitations of centralized/distributed crawling Exploring map reduce Analysis of web [ subset ] Web graph Search response quality Tweaked page rank Inverted Index Divide and conquer Functional programming counterparts -> distributed data processing Plumbing behind the scenes -> Focus on the problem Map – Division of key space Reduce – Combine results Pipelining functionality Open source implementation of Map reduce in Java HDFS – Hadoop specific file system Takes care of fault tolerance dependencies between nodes Setup through VM instance - Problems Currently HDFS Single Node cluster Setup Incorporation of Berkeley DB Graph Builder Seed List URL Extractor Distributed Crawler M R URL, value URL, page content M Parse for URL R URL, Parent (Remove Duplicate s) Adjacency List Table Key Word Extractor <URL, parent URL> Back Links Mapper M Parent, URL R URL, 1 M Diff Compression Parse for key word Page Content Table URL Table Back Links Table R KeyWord, URL Inverted Index Table Map Input <url, 1> if(!duplicate(URL)) { Insert into url_table Page_content = http_get(url); <hash(url), url, hash(page_content),time_stamp > Output Intermediate pair < url, page_content> } Else If( ( duplicate(url) && (Current Time – Time Stamp(URL) > Threshold) { Page_content = http_get(url); Update url table(hash(url),current_time); Output Intermediate pair < url, page_content> } Else { Update url table(hash(url),current_time); } Reduce Input < url, page_content > If(! Exits hash(URL) in page content table) { Insert into page_content_table <hash(page_content), compress(page_content) > } Else if(hash(page_content_table(hash(url)) != hash(current_page_content) { Insert into page_content_table <hash(page_content), compress( diff_with_latest(page_content) )> } } Currently Manual outside of Map-Reduce transfer of files to HDFS Currently Depth First Search, will be modified for Breadth First Search Map Input < url, page_content> List<keywords> = parse(page_content); For each keyword, emit Output Intermediate pair < keyword, url> Reduce Combine all <keyword, url> pairs with the same keyword to emit <keyword, List<urls> > Insert into inverted index table <keyword, List<urls> > Top Words Along with their Frequency CMU Carnegie Mellon University Alumni Center News Library PA Research Pittsburgh, Information School Cornell 2456 2107 1157 786 466 395 393 373 357 352 313 309 Cornell University College Admissions Research Student School Information York Alumni Academics Ithaca Gatech 742 378 158 128 99 94 89 77 74 71 62 59 Tech Georgia Alumni Services Association Career Baseball Engineering Tennis Information students Institute Atlanta 2704 1882 1115 885 646 493 416 408 222 219 198 173 164 Top 6 URL domains that get traversed CMU Cornell Gatech alumni.cmu.edu 92 hr.web.cmu.edu 13 www.alumniconnecti ons.com 16 www.carnegiemellon today.com 10 www.cmu.edu 170 www.library.cmu.edu 69 www.cornell.edu 43 www.cuinfo.cornell.ed u 2 www.gradschool.corne ll.edu 2 www.news.cornell.edu 7 www.sce.cornell.edu 8 www.vet.cornell.edu 1 centennial.gtalumni .org 4 cyberbuzz.gatech.e du 7 georgiatech.searche ase.com 9 gtalumni.org 236 ramblinwreck.cstv.c om 56 www.gatech.edu 14 Avg URL Depth CMU Cornell Gatech cmu.edu 2.73 alumni.cmu.edu 2.18 www.library.cmu.edu 2.23 www.alumniconnecti ons.com 4.81 cornell.edu 1.34 www.gradschool.corne ll.edu 1 www.news.cornell.edu 2.57 www.sce.cornell.edu 1 gatech.edu 1 gtalumni.org 3 ramblinwreck.cstv.c om 2.57 cyberbuzz.gatech.e du 2 Questions, Comments, Criticisms 21 HTML Parser Hadoop Framework (Apache) Peer Crawl