Data-intensiveProgramming Lecture#3 TimoAaltonen DepartmentofPervasiveComputing GuestLectures • I’lltrytoorganizetwoguestlectures • Oct14,Tapio Rautonen,Gofore LTd,Makingsense outofyourbigdata • Oct7,??? Outline • • • • CourseWork ApacheSqoop SQLRecap MapReduceExamples – InvertedIndex – FindingFriends – ComputingPageRank • (Hadoop) – Combiner – Otherprogramminglanguages CourseWork • MySportShop isasportsgearretailer.Allthesales happensonlineintheirwebstore.Examplesoftheir productsaredifferentgamejerseysandsportwatches. • ThewebstorehasanApachewebserverfortheincoming HTTPrequests.Thewebserverlogsalltraffictoalogfile. – Usingtheselogs,onecanstudythebrowsingbehaviorofthe users. • ThesalesdataofMySportShop isinPostrgreSQL,whichis arelationaldatabase.Amongotherthings,thedatabase hasatableorder_items containingdataofallsales eventsoftheshop. CourseWork:Questions • Basedonthedataanswertothefollowingquestions 1. Whatarethetop-10bestsellingproductsintermsof totalsales? 2. Whatarethetop-10browsedproducts? 3. Whatanomalyistherebetweenthesetwo? 4. Whatarethemostpopularbrowsinghours? CourseWork • Sincethemanagersofthecompanydon’tuseHadoop butaRDBMS,allthedatamustbetransferredto PostgreSQL • Inordertodothat – TransferApachelogs(withApacheFlume)totheHDFS – Computethefrequenciesofviewingofdifferentproducts usingMapReduce(Question2) – ComputetheviewinghourdatawithMapReduce(Q4) – Transfertheresults(withApacheSqoop)toPostgreSQL – FindanswertothequestionsinPostgreSQLusingSQL(Q1-4) Environment:threeoptions 1. YoucanuseyourowncomputerbyinstallingVirtualBox 5.x – Weofferyouavirtualmachine,whichhasbeeninstalledall requiredsoftwareanddata – InthenextweeklyexercisesassistantssolveVirtualBox-related problems,ifyouencounterany 2. WeofferyouavirtualmachinefromTUTcloud – Allrequiredsoftwareanddataisinstalled – Nographicaluserinterface – Guidanceavailableintheweeklyexercises 3. Owninstallation/cloudservicecanbeused – Nohelpfromthecoursepersonnel CourseWork • Theworkisdoneingroupsofthree – EnrollinMoodle: https://moodle2.tut.fi/course/view.php?id=9954 – openstodayat10o’clock • DeadlineisOct14th • Instructionsforreturningwillbepublishedlater – IntelliJIDEAproject CourseWork • Material – https://flume.apache.org/FlumeUserGuide.html – https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.ht ml – http://hadoop.apache.org/docs/r2.7.3/ – https://www.postgresql.org/docs/9.5/static/index.html MapReduce l l Simple programming model Map is stateless - l l allows running map functions in parallel Also Reduce can be executed in parallel The canonical example is the word count InvertedIndex • Collating – Problem:Thereisasetofitemsandsomefunctionofoneitem. Itisrequiredtosaveallitemsthathavethesamevalueof functionintoonefileorperformsomeothercomputationthat requiresallsuchitemstobeprocessedasagroup.Themost typicalexampleisbuildingofinvertedindexes. – Solution:Mappercomputesagivenfunctionforeachitemand emitsvalueofthefunctionasakeyanditemitselfasavalue. Reducerobtainsallitemsgroupedbyfunctionvalueandprocess orsavethem.Incaseofinvertedindexes,itemsareterms (words)andfunctionisadocumentIDwherethetermwas found. SimpleInvertedIndex • Reducedoutput:word,listofdocIDs Doc#1 Thisdoc containstext Doc#2 Mydoc containsmy text this,1 doc,1 contains,1 text,1 my,2 doc,2 contains,2 my,2 text,2 Reducedoutput this:1 doc:1,2 contains:1,2 text:1,2 my:2 (Normal)InvertedIndex • Reducedoutput:word(,list(docID,frequency) Doc#1 Thisdoc containstext Doc#2 Mydoc containsmy text this,(1,1) doc,(1,1) contains,(1,1) text,(1,1) my,(2,1) doc,(2,1) contains,(2,1) my,(2,1) text,(2,1) Reducedoutput this:(1,1) doc:(1,1),(2,1) contains:(1,2),(2,1) text:(1,1),(2,1) my:(2,2) UsingInvertedIndex:Searching • Documents – – – – – D1:Helikestowink,helikestodrink. D2:Helikestodrink,anddrink,anddrink. D3:Thethinghelikestodrinkisink. D4:Theinkhelikestodrinkispink. D5:Helikestowinkanddrinkpinkink. • Index – – – – – he:(1,2),(2,1),(3,1),(4,1),(5,1) ink:(3,1),(4,1),(5,1) pink:(4,1),(5,1) thing:(3,1) wink:(1,1),(5,1) UsingInvertedIndex • Indexingmakessearchenginesfast • Dataissparsesincemostwordappearonlyinone document – – – – (id,val)tuples sortedbyid compact veryfast • Linearmerge Index he:(1,2),(2,1),(3,1),(4,1),(5,1) ink:(3,1),(4,1),(5,1) pink:(4,1),(5,1) thing:(3,1) wink:(1,1),(5,1) LinearMerge • Finddocumentsmarchingquery{ink,wink} – Loadinvertedlistsforallquerywords – LinearmergeO(n) • nisthetotalnumberofitemsinthetwolists • f()isascoringfunction:howwelldocmatchesthequery Matchingset: ink--> (3,1) (4,1) wink--> (1,1) (5,1) 1:f(0,1) 3:f(1,0) 4:f(1,0) (5,1) 5:f(1,1) ScoringFunction • Specifywhichdocsarematched – in:countsofquerywordsinadoc – out:rankingscore • howwelldocmatchesthequery • 0ifdocumentdoesnotmatch – Example: 1: 𝑛, > 0 • BooleanAND:𝑓 𝑄, 𝐷 = ∏,∈1 ( 0: 𝑛, = 0 – 1iff allquerywordsarepresent PhrasesandProximity • Query“pinkink”asaphrase D4:Theinkhelikestodrinkispink. • Usingregularindex: – match#and(pink,ink)-> D5:Helikestowinkanddrinkpinkink. – scanmatchmatchdocumentsforquerystring(slow) • Idea:indexallbi-gramsaswords – canapproximate“drinkpinkink” – fast,butindexsizeexplodes – inflexible:can’tquery#5(pink,ink) • Constructproximityindex pink_ink-> (5,1) drink_pink-> (5,1) ProximityIndex • Embedpositioninformationtotheinvertedlists – calledpositional/proximityindex(prox-list) – handlesarbitraryphrases,windows – keyto“rich”indexing:structure,fields,tags,… ProximityIndex • Reducedoutput:word,listof(docID,location) Doc#1 Thisdoc containstext Doc#2 Mydoc containsmy text this,(1,1) doc,(1,2) contains,(1,3) text,(1,4) my,(2,1) doc,(2,3) contains,(2,3) my,(2,4) text,(2,5) Reducedoutput this:(1,1) doc:(1,2),(2,1) contains:(1,3),(2,3) text:(1,4),(2,5) my:(2,1),(2,4) ProximityIndex • Documents – – – – – D1:Helikestowink,helikestodrink. D2:Helikestodrink,anddrink,anddrink. D3:Thethinghelikestodrinkisink. D4:Theinkhelikestodrinkispink. D5:Helikestowinkanddrinkpinkink. • Index – – – – – he:(1,1),(1,5),(2,1),(3,3),(4,3),(5,1) ink:(3,8),(4,2),(5,8) pink:(4,8),(5,7) thing:(3,2) wink:(1,4),(5,5) UsingProximityIndex • Query:“pinkink” • LinearMerge – comparedocIDs underpointer – ifmatch– checkpos(ink)- pos(pink)=1 – near operator ink--> pink--> (3,8) (4,2) (5,8) (4,8) (5,7) StructureandTags • Documentsarenotalwaysflat – meta-data:title,author,date – structure:part,chapter,section,paragraph – tags:namedentity,link,translation • Optionsfordealingwithstructure – createseparateindexforeachfield(likeinSQL) – pushstructureintoindexvalues – constructextendindex ExtentIndex • Special“term”foreachelement,fieldortag – spansaregionoftext • wordsinthespanbelongtothefield – allowsmultipleoverlappingspans – similarstand-offannotationformats ExtentIndex • Documents – – – – – D1:Helikestowink,helikestodrink. D2:Helikestodrink,anddrink,anddrink. D3:Thething helikestodrinkisink. D4:Theinkhelikestodrinkispink. D5:Helikestowinkanddrinkpinkink. • Index – – – – – – he:(1,1),(1,5),(2,1),(3,3),(4,3),(5,1) ink:(3,8),(4,2),(5,8) pink:(4,8),(5,7) thing:(3,2) wink:(1,4),(5,5) link:(3,1:2),(4,1:2),(5,7:8) UsingExtentIndex • Query:findanink-relatedhyper-link • Sameapproachaswithproximity – onlynow“tag”and“word”musthavedistance=0 – LinearMerge,matchwhenpositionsfallintoextent – amenabletoalloptimizations ink--> link-> (3,8) (3,1:2) (4,2) (4,1:2) (5,8) (5,7:8) OverviewonInvertedIndices • Normal • Positional – phrases,nearoperator • Extent – metadata,structure MR Example: Finding Friends l l http://stevekrenzel.com/finding-friends-withmapreduce Facebook could use MapReduce in the following way MR Example: Finding Friends l Facebook has a list of friends - l l the relation is bidirectional FB has lots of disk space and serve millions of requests per day Certain results are pre-computed to reduce the processing time of requests - E.g. ”You and Joe have 230 mutual friends” The list of common friends is quite stable so recalculating would be wasteful MR Example: Finding Friends l Idea: MapReduce is used to calculate the common friends daily and store results - l l later only a quick lookup is needed Assume the friends are stored as Person ⟶ [List of friends] - A ⟶ [B, C, D] B ⟶ [A, C, D, E] C ⟶ [A, B, D, E] D ⟶ [A, B, C, E] E ⟶ [B, C, D] MR Example: Finding Friends l l Each line is input for mapper For every friend in the list of friends, the mapper will emit a (key, value) pair, where - key is l l - (person, friend), if person < friend (friend, person), otherwise value is the list of person’s friends MR Example: Finding Friends map(A , [B, C, D]): (A, B), [B, C, D] (A, C), [B, C, D] (A, D), [B, C, D] map(B, [A, C, D, E]): (A, B), [A, C, D, E] (B, C), [A, C, D, E] (B, D), [A, C, D, E] (B, E), [A, C, D, E] map(C, [A, B, D, E]): (A, C), [A, B, D, E] (B, C), [A, B, D, E] (C, D), [A, B, D, E] (C, E), [A, B, D, E] map(D, [A, B, C, E]): (A, D), [A, B, C, E] (B, D), [A, B, C, E] (C, D), [A, B, C, E] (D, E), [A, B, C, E] map(E, [B, C, D]): (B, E), [B, C, D] (C, E), [B, C, D] (D, E), [B, C, D] MR Example: Finding Friends • After shuffling inputs to the reducers: (A, B), [[B, C, D], [A, C, D, E]] (A, C), [[B, C, D], [A, B, D, E]] (A, D), [[B, C, D], [A, B, C, E]] (B, C), [[A, C, D, E], [A, B, D, E]] (B, D), [[A, C, D, E], [A, B, C, E]] (B, E), [[A, C, D, E], [B, C, D]] (C, D), [[A, B, D, E], [A, B, C, E]] (C, E), [[A, B, D, E], [B, C, D]] (D, E), [[A, B, C, E], [B, C, D]] MR Example: Finding Friends l l Each line is given to a reducer Reducer computes an intersection of the sets - l and removes persons from the key pair For example (A, B), [[B, C, D], [A, C, D, E]] is reduced to (A, B), [C, D] (A, C), [B, D] (A, D), [B, C] (B, C), [A, D, E] (B, D), [A, C, E] l (B, E), [C, D] (C, D), [A, B, E] (C, E), [B, D] (D, E), [B, C] Now, when D visit B the common friends are found fast [A, C, E] MR:PageRank • Google’sdescription – reliesonthe“uniquelydemocratic”natureoftheweb – interpretsalinkfrompageAtopageBas“avote” • aà BmeansAthinksBisworthsomething – manylinksmeanthatBmustbegood – content-independentmeasure • Useasarankingfeature,combinedwithcontent – notallpageslinkingtoBareequallyimportant – asinglelinkfromSlashdotorCNNmaybeworththousands • GooglePageRank – howmany“good”pageslinktoB PageRank:RandomSurfer • Analogy – userstartsbrowsingfromrandom – pickarandomout-goinglink • repeat – example:FàEàFàEàDà… – withprobability1- λ jump toarandom page • PageRankofpagex – probabilityofbeingonpagexatarandommoment – formally PageRank • InitializePR(x)=1/N • Foreverypage:𝑃𝑅 𝑥 = 567 8 +𝜆 ∑>→D ;<(>) @AB(>) – yà xcontributespartofitsPRtox – spreadsPRequallyamongout-links – PRscoresshouldsumto100% • usetwoarraysPRt à PRt+1 • Iteration#1: – PR(B)=0.18*9.1+ 0.82*[PR(C)+1/3*PR(E)+½*PR(F)+½*PR(G)+½*PR(I)]=31 – PR(C)=0.18*9.1+0.82*9.1=9.1 • Iteration #2: – ... – PR(C)=0.18*9.1+0.82*PR(B)=26 PageRank • Algorithmconverges • Observations: – pageswithnoinlinks:PR=(1- λ)*1/N=0.16 – same inlinks è same PR – one inlink from high PR>>many from low PR PageRankwithMapReduce map(y,{x1,x2,…,xn}) forj=1..n emit(xj,( reduce(x, { ;<(>5) @AB(>5) ;<(>) @AB(>) ,…, ) ;<(>E) @AB(>E) 567 }) ;<(>) 𝑃𝑅 𝑥 = +𝜆 ∑>→D 8 @AB(>) forj=1..n emit(xj, • • ;<(D) @AB(D) ) Resultgoesrecursivelytoanotherreducer Stillsinknodesshouldbeconsidered PageRankwithMapReduce map(y,{x1,x2,…,xn}) forj=1..n emit(xj,( ;<(>) @AB(>) ) emit(y,{x1,…,xn}) reduce(x, { ;<(>5) @AB(>5) ,…, ;<(>E) @AB(>E) 567 },{x1,…,xn} ) ;<(>) 𝑃𝑅 𝑥 = +𝜆 ∑>→D 8 @AB(>) forj=1..n ;<(D) emit(xj, ) @AB(D) emit(x,{x1,…,xn}) • • Resultgoesrecursivelytoanotherreducer Stillsinknodesshouldbeconsidered Combiners Map node 1 A A A B 1 1 1 1 A 3 Combiner A 11 1 Reduce node forkey A A Combiners • Combiner can ”compress”dataonamapper node before sending it forward • Combiner input/outputtypes must equal the mapper outputtypes • InHadoop Java,Combiners use theReducer interface job.setCombinerClass(MyReducer.class); Reducer asaCombiner • Reducer can be used asaCombiner if it iscommutative andassociative – Eg.max is • max(1,2,max(3,4,5))= max(max(2,4),max(1,5,3)) • true forany order offunction applications… – Eg.avg isnot • avg(1,2,avg(3,4,5))=2.33333≠ avg(avg(2,4),avg(1,5,3))=3 • Note:if Reducer isnot c&a,Combiners can still be used – TheCombiner justhas tobe different from theReducer and designed forthespecific case AddingaCombinertoWordCount Map Shuffle Combiner walk,1 run,1 walk,1 run,1 walk,2 Hadoop Streaming • Map andReduce functions can be implemented in any language withtheHadoop Streaming API • Inputisread from standard input • Outputiswritten tostandard output • Input/outputitems are lines oftheform key\tvalue – \t isthetabulator character • Reducer inputlines are grouped by key – Onereducer instance may receive multiple keys Run Hadoop Streaming • Debug using Unixpipes: cat sample.txt | ./mapper.py | sort | ./reducer.py • OnHadoop: hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \ -input sample.txt \ -output output \ -mapper ./mapper.py \ -reducer ./reducer.py