Slides

advertisement
Data-intensiveProgramming
Lecture#3
TimoAaltonen
DepartmentofPervasiveComputing
GuestLectures
• I’lltrytoorganizetwoguestlectures
• Oct14,Tapio Rautonen,Gofore LTd,Makingsense
outofyourbigdata
• Oct7,???
Outline
•
•
•
•
CourseWork
ApacheSqoop
SQLRecap
MapReduceExamples
– InvertedIndex
– FindingFriends
– ComputingPageRank
• (Hadoop)
– Combiner
– Otherprogramminglanguages
CourseWork
• MySportShop isasportsgearretailer.Allthesales
happensonlineintheirwebstore.Examplesoftheir
productsaredifferentgamejerseysandsportwatches.
• ThewebstorehasanApachewebserverfortheincoming
HTTPrequests.Thewebserverlogsalltraffictoalogfile.
– Usingtheselogs,onecanstudythebrowsingbehaviorofthe
users.
• ThesalesdataofMySportShop isinPostrgreSQL,whichis
arelationaldatabase.Amongotherthings,thedatabase
hasatableorder_items containingdataofallsales
eventsoftheshop.
CourseWork:Questions
• Basedonthedataanswertothefollowingquestions
1. Whatarethetop-10bestsellingproductsintermsof
totalsales?
2. Whatarethetop-10browsedproducts?
3. Whatanomalyistherebetweenthesetwo?
4. Whatarethemostpopularbrowsinghours?
CourseWork
• Sincethemanagersofthecompanydon’tuseHadoop
butaRDBMS,allthedatamustbetransferredto
PostgreSQL
• Inordertodothat
– TransferApachelogs(withApacheFlume)totheHDFS
– Computethefrequenciesofviewingofdifferentproducts
usingMapReduce(Question2)
– ComputetheviewinghourdatawithMapReduce(Q4)
– Transfertheresults(withApacheSqoop)toPostgreSQL
– FindanswertothequestionsinPostgreSQLusingSQL(Q1-4)
Environment:threeoptions
1. YoucanuseyourowncomputerbyinstallingVirtualBox
5.x
– Weofferyouavirtualmachine,whichhasbeeninstalledall
requiredsoftwareanddata
– InthenextweeklyexercisesassistantssolveVirtualBox-related
problems,ifyouencounterany
2. WeofferyouavirtualmachinefromTUTcloud
– Allrequiredsoftwareanddataisinstalled
– Nographicaluserinterface
– Guidanceavailableintheweeklyexercises
3. Owninstallation/cloudservicecanbeused
– Nohelpfromthecoursepersonnel
CourseWork
• Theworkisdoneingroupsofthree
– EnrollinMoodle:
https://moodle2.tut.fi/course/view.php?id=9954
– openstodayat10o’clock
• DeadlineisOct14th
• Instructionsforreturningwillbepublishedlater
– IntelliJIDEAproject
CourseWork
• Material
– https://flume.apache.org/FlumeUserGuide.html
– https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.ht
ml
– http://hadoop.apache.org/docs/r2.7.3/
– https://www.postgresql.org/docs/9.5/static/index.html
MapReduce
l
l
Simple programming model
Map is stateless
-
l
l
allows running map functions in parallel
Also Reduce can be executed in parallel
The canonical example is the word count
InvertedIndex
• Collating
– Problem:Thereisasetofitemsandsomefunctionofoneitem.
Itisrequiredtosaveallitemsthathavethesamevalueof
functionintoonefileorperformsomeothercomputationthat
requiresallsuchitemstobeprocessedasagroup.Themost
typicalexampleisbuildingofinvertedindexes.
– Solution:Mappercomputesagivenfunctionforeachitemand
emitsvalueofthefunctionasakeyanditemitselfasavalue.
Reducerobtainsallitemsgroupedbyfunctionvalueandprocess
orsavethem.Incaseofinvertedindexes,itemsareterms
(words)andfunctionisadocumentIDwherethetermwas
found.
SimpleInvertedIndex
• Reducedoutput:word,listofdocIDs
Doc#1
Thisdoc
containstext
Doc#2
Mydoc
containsmy
text
this,1
doc,1
contains,1
text,1
my,2
doc,2
contains,2
my,2
text,2
Reducedoutput
this:1
doc:1,2
contains:1,2
text:1,2
my:2
(Normal)InvertedIndex
• Reducedoutput:word(,list(docID,frequency)
Doc#1
Thisdoc
containstext
Doc#2
Mydoc
containsmy
text
this,(1,1)
doc,(1,1)
contains,(1,1)
text,(1,1)
my,(2,1)
doc,(2,1)
contains,(2,1)
my,(2,1)
text,(2,1)
Reducedoutput
this:(1,1)
doc:(1,1),(2,1)
contains:(1,2),(2,1)
text:(1,1),(2,1)
my:(2,2)
UsingInvertedIndex:Searching
• Documents
–
–
–
–
–
D1:Helikestowink,helikestodrink.
D2:Helikestodrink,anddrink,anddrink.
D3:Thethinghelikestodrinkisink.
D4:Theinkhelikestodrinkispink.
D5:Helikestowinkanddrinkpinkink.
• Index
–
–
–
–
–
he:(1,2),(2,1),(3,1),(4,1),(5,1)
ink:(3,1),(4,1),(5,1)
pink:(4,1),(5,1)
thing:(3,1)
wink:(1,1),(5,1)
UsingInvertedIndex
• Indexingmakessearchenginesfast
• Dataissparsesincemostwordappearonlyinone
document
–
–
–
–
(id,val)tuples
sortedbyid
compact
veryfast
• Linearmerge
Index
he:(1,2),(2,1),(3,1),(4,1),(5,1)
ink:(3,1),(4,1),(5,1)
pink:(4,1),(5,1)
thing:(3,1)
wink:(1,1),(5,1)
LinearMerge
• Finddocumentsmarchingquery{ink,wink}
– Loadinvertedlistsforallquerywords
– LinearmergeO(n)
• nisthetotalnumberofitemsinthetwolists
• f()isascoringfunction:howwelldocmatchesthequery
Matchingset:
ink-->
(3,1)
(4,1)
wink-->
(1,1)
(5,1)
1:f(0,1)
3:f(1,0)
4:f(1,0)
(5,1)
5:f(1,1)
ScoringFunction
• Specifywhichdocsarematched
– in:countsofquerywordsinadoc
– out:rankingscore
• howwelldocmatchesthequery
• 0ifdocumentdoesnotmatch
– Example:
1: 𝑛, > 0
• BooleanAND:𝑓 𝑄, 𝐷 = ∏,∈1 (
0: 𝑛, = 0
– 1iff allquerywordsarepresent
PhrasesandProximity
• Query“pinkink”asaphrase
D4:Theinkhelikestodrinkispink.
• Usingregularindex:
– match#and(pink,ink)-> D5:Helikestowinkanddrinkpinkink.
– scanmatchmatchdocumentsforquerystring(slow)
• Idea:indexallbi-gramsaswords
– canapproximate“drinkpinkink”
– fast,butindexsizeexplodes
– inflexible:can’tquery#5(pink,ink)
• Constructproximityindex
pink_ink->
(5,1)
drink_pink->
(5,1)
ProximityIndex
• Embedpositioninformationtotheinvertedlists
– calledpositional/proximityindex(prox-list)
– handlesarbitraryphrases,windows
– keyto“rich”indexing:structure,fields,tags,…
ProximityIndex
• Reducedoutput:word,listof(docID,location)
Doc#1
Thisdoc
containstext
Doc#2
Mydoc
containsmy
text
this,(1,1)
doc,(1,2)
contains,(1,3)
text,(1,4)
my,(2,1)
doc,(2,3)
contains,(2,3)
my,(2,4)
text,(2,5)
Reducedoutput
this:(1,1)
doc:(1,2),(2,1)
contains:(1,3),(2,3)
text:(1,4),(2,5)
my:(2,1),(2,4)
ProximityIndex
• Documents
–
–
–
–
–
D1:Helikestowink,helikestodrink.
D2:Helikestodrink,anddrink,anddrink.
D3:Thethinghelikestodrinkisink.
D4:Theinkhelikestodrinkispink.
D5:Helikestowinkanddrinkpinkink.
• Index
–
–
–
–
–
he:(1,1),(1,5),(2,1),(3,3),(4,3),(5,1)
ink:(3,8),(4,2),(5,8)
pink:(4,8),(5,7)
thing:(3,2)
wink:(1,4),(5,5)
UsingProximityIndex
• Query:“pinkink”
• LinearMerge
– comparedocIDs underpointer
– ifmatch– checkpos(ink)- pos(pink)=1
– near operator
ink-->
pink-->
(3,8)
(4,2)
(5,8)
(4,8)
(5,7)
StructureandTags
• Documentsarenotalwaysflat
– meta-data:title,author,date
– structure:part,chapter,section,paragraph
– tags:namedentity,link,translation
• Optionsfordealingwithstructure
– createseparateindexforeachfield(likeinSQL)
– pushstructureintoindexvalues
– constructextendindex
ExtentIndex
• Special“term”foreachelement,fieldortag
– spansaregionoftext
• wordsinthespanbelongtothefield
– allowsmultipleoverlappingspans
– similarstand-offannotationformats
ExtentIndex
• Documents
–
–
–
–
–
D1:Helikestowink,helikestodrink.
D2:Helikestodrink,anddrink,anddrink.
D3:Thething helikestodrinkisink.
D4:Theinkhelikestodrinkispink.
D5:Helikestowinkanddrinkpinkink.
• Index
–
–
–
–
–
–
he:(1,1),(1,5),(2,1),(3,3),(4,3),(5,1)
ink:(3,8),(4,2),(5,8)
pink:(4,8),(5,7)
thing:(3,2)
wink:(1,4),(5,5)
link:(3,1:2),(4,1:2),(5,7:8)
UsingExtentIndex
• Query:findanink-relatedhyper-link
• Sameapproachaswithproximity
– onlynow“tag”and“word”musthavedistance=0
– LinearMerge,matchwhenpositionsfallintoextent
– amenabletoalloptimizations
ink-->
link->
(3,8)
(3,1:2)
(4,2)
(4,1:2)
(5,8)
(5,7:8)
OverviewonInvertedIndices
• Normal
• Positional
– phrases,nearoperator
• Extent
– metadata,structure
MR Example: Finding Friends
l
l
http://stevekrenzel.com/finding-friends-withmapreduce
Facebook could use MapReduce in the
following way
MR Example: Finding Friends
l
Facebook has a list of friends
-
l
l
the relation is bidirectional
FB has lots of disk space and serve millions of
requests per day
Certain results are pre-computed to reduce the
processing time of requests
-
E.g. ”You and Joe have 230 mutual friends”
The list of common friends is quite stable
so recalculating would be wasteful
MR Example: Finding Friends
l
Idea: MapReduce is used to calculate the
common friends daily and store results
-
l
l
later only a quick lookup is needed
Assume the friends are stored as
Person ⟶ [List of friends]
-
A ⟶ [B, C, D]
B ⟶ [A, C, D, E]
C ⟶ [A, B, D, E]
D ⟶ [A, B, C, E]
E ⟶ [B, C, D]
MR Example: Finding Friends
l
l
Each line is input for mapper
For every friend in the list of friends, the mapper
will emit a (key, value) pair, where
-
key is
l
l
-
(person, friend), if person < friend
(friend, person), otherwise
value is the list of person’s friends
MR Example: Finding Friends
map(A , [B, C, D]):
(A, B), [B, C, D]
(A, C), [B, C, D]
(A, D), [B, C, D]
map(B, [A, C, D, E]):
(A, B), [A, C, D, E]
(B, C), [A, C, D, E]
(B, D), [A, C, D, E]
(B, E), [A, C, D, E]
map(C, [A, B, D, E]):
(A, C), [A, B, D, E]
(B, C), [A, B, D, E]
(C, D), [A, B, D, E]
(C, E), [A, B, D, E]
map(D, [A, B, C, E]):
(A, D), [A, B, C, E]
(B, D), [A, B, C, E]
(C, D), [A, B, C, E]
(D, E), [A, B, C, E]
map(E, [B, C, D]):
(B, E), [B, C, D]
(C, E), [B, C, D]
(D, E), [B, C, D]
MR Example: Finding Friends
•
After shuffling inputs to the reducers:
(A, B), [[B, C, D], [A, C, D, E]]
(A, C), [[B, C, D], [A, B, D, E]]
(A, D), [[B, C, D], [A, B, C, E]]
(B, C), [[A, C, D, E], [A, B, D, E]]
(B, D), [[A, C, D, E], [A, B, C, E]]
(B, E), [[A, C, D, E], [B, C, D]]
(C, D), [[A, B, D, E], [A, B, C, E]]
(C, E), [[A, B, D, E], [B, C, D]]
(D, E), [[A, B, C, E], [B, C, D]]
MR Example: Finding Friends
l
l
Each line is given to a reducer
Reducer computes an intersection of the sets
-
l
and removes persons from the key pair
For example (A, B), [[B, C, D], [A, C, D, E]] is
reduced to (A, B), [C, D]
(A, C), [B, D]
(A, D), [B, C]
(B, C), [A, D, E]
(B, D), [A, C, E]
l
(B, E), [C, D]
(C, D), [A, B, E]
(C, E), [B, D]
(D, E), [B, C]
Now, when D visit B the common friends are
found fast [A, C, E]
MR:PageRank
• Google’sdescription
– reliesonthe“uniquelydemocratic”natureoftheweb
– interpretsalinkfrompageAtopageBas“avote”
• aà BmeansAthinksBisworthsomething
– manylinksmeanthatBmustbegood
– content-independentmeasure
• Useasarankingfeature,combinedwithcontent
– notallpageslinkingtoBareequallyimportant
– asinglelinkfromSlashdotorCNNmaybeworththousands
• GooglePageRank
– howmany“good”pageslinktoB
PageRank:RandomSurfer
• Analogy
– userstartsbrowsingfromrandom
– pickarandomout-goinglink
• repeat
– example:FàEàFàEàDà…
– withprobability1- λ jump toarandom page
• PageRankofpagex
– probabilityofbeingonpagexatarandommoment
– formally
PageRank
• InitializePR(x)=1/N
• Foreverypage:𝑃𝑅 𝑥 = 567
8
+𝜆 ∑>→D
;<(>)
@AB(>)
– yà xcontributespartofitsPRtox
– spreadsPRequallyamongout-links
– PRscoresshouldsumto100%
• usetwoarraysPRt à PRt+1
• Iteration#1:
– PR(B)=0.18*9.1+
0.82*[PR(C)+1/3*PR(E)+½*PR(F)+½*PR(G)+½*PR(I)]=31
– PR(C)=0.18*9.1+0.82*9.1=9.1
• Iteration #2:
– ...
– PR(C)=0.18*9.1+0.82*PR(B)=26
PageRank
• Algorithmconverges
• Observations:
– pageswithnoinlinks:PR=(1- λ)*1/N=0.16
– same inlinks è same PR
– one inlink from high PR>>many from low PR
PageRankwithMapReduce
map(y,{x1,x2,…,xn})
forj=1..n
emit(xj,(
reduce(x, {
;<(>5)
@AB(>5)
;<(>)
@AB(>)
,…,
)
;<(>E)
@AB(>E)
567
})
;<(>)
𝑃𝑅 𝑥 = +𝜆 ∑>→D
8
@AB(>)
forj=1..n
emit(xj,
•
•
;<(D)
@AB(D)
)
Resultgoesrecursivelytoanotherreducer
Stillsinknodesshouldbeconsidered
PageRankwithMapReduce
map(y,{x1,x2,…,xn})
forj=1..n
emit(xj,(
;<(>)
@AB(>)
)
emit(y,{x1,…,xn})
reduce(x, {
;<(>5)
@AB(>5)
,…,
;<(>E)
@AB(>E)
567
},{x1,…,xn} )
;<(>)
𝑃𝑅 𝑥 = +𝜆 ∑>→D
8
@AB(>)
forj=1..n
;<(D)
emit(xj,
)
@AB(D)
emit(x,{x1,…,xn})
•
•
Resultgoesrecursivelytoanotherreducer
Stillsinknodesshouldbeconsidered
Combiners
Map node 1
A
A
A
B
1
1
1
1
A
3
Combiner
A
11
1
Reduce node forkey A
A
Combiners
• Combiner can ”compress”dataonamapper node
before sending it forward
• Combiner input/outputtypes must equal the
mapper outputtypes
• InHadoop Java,Combiners use theReducer interface
job.setCombinerClass(MyReducer.class);
Reducer asaCombiner
• Reducer can be used asaCombiner if it iscommutative
andassociative
– Eg.max is
• max(1,2,max(3,4,5))=
max(max(2,4),max(1,5,3))
• true forany order offunction applications…
– Eg.avg isnot
• avg(1,2,avg(3,4,5))=2.33333≠
avg(avg(2,4),avg(1,5,3))=3
• Note:if Reducer isnot c&a,Combiners can still be used
– TheCombiner justhas tobe different from theReducer and
designed forthespecific case
AddingaCombinertoWordCount
Map
Shuffle
Combiner
walk,1
run,1
walk,1
run,1
walk,2
Hadoop Streaming
• Map andReduce functions can be implemented in
any language withtheHadoop Streaming API
• Inputisread from standard input
• Outputiswritten tostandard output
• Input/outputitems are lines oftheform
key\tvalue
– \t isthetabulator character
• Reducer inputlines are grouped by key
– Onereducer instance may receive multiple keys
Run Hadoop Streaming
• Debug using Unixpipes:
cat sample.txt | ./mapper.py | sort | ./reducer.py
• OnHadoop:
hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input sample.txt \
-output output \
-mapper ./mapper.py \
-reducer ./reducer.py
Download