MapReduce I MapReduce-I I duce-

advertisement
MapReduce-I
MapReduce I
MapReduce-I
September 2013
Alberto Abelló & Oscar Romero
1
MapReduce-I
Knowledge objectives
1.
2.
3.
4.
Enumerate several use cases of
MapReduce
Describe what the MapReduce
environment is
Explain 6 benefits of using MapReduce
Explain four problems of MapReduce
September 2013
Alberto Abelló & Oscar Romero
2
MapReduce-I
Understanding Objectives
1.
Describe the different steps of a
MapReduce execution
September 2013
Alberto Abelló & Oscar Romero
3
MapReduce-I
Application Objectives
1.
Identify the usefulness of MapReduce in a
given use case
September 2013
Alberto Abelló & Oscar Romero
4
MapReduce-I
Typical uses











Find which source pages link to a target page
Count the number of accesses to each Web page
Count the number of accesses to each domain
Create an index structure that maps search terms to
doc ment IDs
document
Retrieve introductory paragraph of all Web pages so
that “x”
Find all pairs of users accessing the same URL
Find the average age of users accessing a given URL
Find all friends of a given user
Find all friends of friends of a given user
Find all women friends of men friends of a given user
G
Grouping
i
diff
different manifestations
if
i
off the
h same reall
world object
September 2013
Alberto Abelló & Oscar Romero
5
MapReduce-I
Not so typical uses
Mobile Commerce
 Electricity
 Agricultural Planning
 Fuel Conservation
 National
N ti
l IIntelligence
t lli
 Drug Development and Personalization
 Financial Service Security

September 2013
Alberto Abelló & Oscar Romero
6
MapReduce-I
Opinions
TDWI Best Practices Report, 2013
September 2013
Alberto Abelló & Oscar Romero
7
MapReduce-I
Busting 10 Myths about Hadoop










Hadoop consists in multiple products
Hadoop is open source but available from vendors,
vendors
too
Hadoop is an ecosystem, not a single product
HDFS is a file system,
s stem not a DBMS
Hive resembles SQL but is not standard SQL
Hadoop
p and MapReduce
p
are related but don’t require
q
each other
MapReduce provides control for analytics, not
analytics
y
per
p se
Hadoop is about data diversity, not just data volume
Hadoop complements a DW; it’s rarely a replacement
Hadoop enables many types of analytics
analytics, not just Web
analytics
Philip Russom
September 2013
Alberto Abelló & Oscar Romero
8
MapReduce-I
Google ecosystem

High-performance is mainly achieved by
means off parallelism
ll li


Divide-and-conquer principle
M R d
MapReduce

It is a query language that provides parallelism
in a transparent manner
Query Language
MapReduce
Hadoop File System
Database
BigTable
…
September 2013
Alberto Abelló & Oscar Romero
9
MapReduce-I
Hadoop ecosystem
By HortonWorks
September 2013
Alberto Abelló & Oscar Romero
10
MapReduce-I
Sequential access
Ben Stopford
p
Progscon & JAX Finance 2015
September 2013
Alberto Abelló & Oscar Romero
11
MapReduce-I
MapReduce Basics
Map
p

MergeSort
Reduce
Simple model to express relatively sophisticated
distributed programs


Processes pairs [key, value]
Signature:
September 2013
Alberto Abelló & Oscar Romero
12
[“Th
he”,[1,1,1,1,…]]
MapReduce-I
WordCount Execution Example
<#line, text>
Map
The
Project
Gutemberg
Ebook
of
The
Outline
Of
Science,
Vol.
1
(of
4),
by
September 2013
Alberto Abelló & Oscar Romero
Merge-Short
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Reduce
The
57631
13
MapReduce-I
WordCount Code Example
public void map(LongWritable
key, Text
value) {
Value
Key
String line = value.toString();
Blackbox
StringTokenizer tokenizer
= new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
write(new Text(tokenizer.nextToken()),
Text(tokenizer
nextToken()) new IntWritable(1));
Value
Key
}
}
public void reduce(Text
key, Iterable<IntWritable>
values) {
Key
Values
int sum = 0;
for (IntWritable val : values) {
Blackbox
sum += val.get();
}
write(key,
new IntWritable(sum));
Key
Value
}
September 2013
Alberto Abelló & Oscar Romero
14
MapReduce-I
Benefits

Programming (functional) model simple yet
expressive

Without joins
Able to process structured or unstructured
 Elastically scalable


Transparently distributes data




Exploits data locality
Hides parallelization
Balances workload
Provides fine grained fault tolerance
September 2013
Alberto Abelló & Oscar Romero
15
MapReduce-I
Problems

Writes intermediate results to disk


Reduce tasks pull intermediate data
Defines the execution p
plan on the fly
y

Schedules one block at a time
Does not provide transactions
 Does not benefit from compression

September 2013
Alberto Abelló & Oscar Romero
16
MapReduce-I
Market tools
“… But to really unlock the power of
Hadoop, you must be able to efficiently
extract data stored across multiple (often
tens or hundreds) of nodes with a userfriendly ETL (extract, transform and load)
tool that will then allow you to move your
Hadoop data into a relational data mart or
warehouse where you can use BI tools for
analysis. “
Ian Fyfe
Pentaho
September 2013
Alberto Abelló & Oscar Romero
17
MapReduce-I
Reference Architecture
DM
DM
DM
DW
ETL (Extraction, Transformation and Load)
WWW
September 2013
Alberto Abelló & Oscar Romero
18
MapReduce-I
Bigg elephant
p
or little elephant?
p
T d
Trademark
k
• Expensive
• Many functionalities
• Mature
Open source
• Free
• Simple functionalities
• Young
September 2013
Alberto Abelló & Oscar Romero
19
MapReduce-I
Friends or foes
MapRed ce
MapReduce
Ja a
Java
O acle
Oracle
Oracle Grid Engine
Hadoop Distributed
Dist ib ted File System
S stem (HDFS)
September 2013
Alberto Abelló & Oscar Romero
20
MapReduce-I
Activity
Objective: Understand the usefulness of
M R d
MapReduce
 Tasks:

1. (6’)
Read two use cases
2. (12’) Explain the use cases to the others
3. (5’) Find
Fi d th
the main
i characteristics
h
t i ti needed
d d to
t use
Hadoop
4 Hand in a prioritized list of characteristics
4.

Roles for the team-mates during task 2:
a) Explains
his/her material
b) Asks for clarification of blur concepts
c) Mediates and controls time
September 2013
Alberto Abelló & Oscar Romero
21
MapReduce-I
Algorithm: Data Load
The input data is partitioned into blocks
1.
o
2.
It can be done by using HDFS or any other storage
(e.g., HadoopDB, MongoDB, Cassandra, CouchDB, etc.)
Replicate them in different nodes
Input file
Block
Replicate
…
September 2013
Alberto Abelló & Oscar Romero
22
MapReduce-I
Algorithm: Map Phase (I)
3. Each
map subplan reads a subset of blocks (i.e., split)
4 Divides it into records
4.
5. Executes the map for each record and leaves them in
memory divided into spills
Split
Itemize
Map
…
Nodemapper
September 2013
Alberto Abelló & Oscar Romero
23
MapReduce-I
Algorithm: Map Phase (II)
Each spill is then partitioned per reducers
6.
o
Each partition is sorted independently
7.
o
8.
Using
g a hash function f over the new key
y
If a combine is defined, it is executed locally after sorting
Store the spills
p
into disk ((massive writing)
g)
Partition
spills
Store
Combine
Nodemapper
September 2013
Alberto Abelló & Oscar Romero
24
MapReduce-I
Algorithm: Map phase (III)
9.
Spill partitions are merged
o
Each merge is sorted independently
10. Store
the result into disk
Merge partitions
per reducer
Store
Combine
Nodemapper
September 2013
Alberto Abelló & Oscar Romero
25
MapReduce-I
Algorithm: Shuffle and Reduce
11.
12
12.
13.
14.
Reducers fetch data from mappers (massive data transfer)
Mappers output is merged and sorted
Reduce function is executed per key
Store the result into disk
Fetch
Reduce
Store
…
Nodereducer
September 2013
Alberto Abelló & Oscar Romero
26
MapReduce-I
Local-Global Aggregation: Combine

Combine is executed locally



Only possible when the reducer function is:



Assumes uniform random distribution of input
Reduces the number of tuples sent to reducers
Commutative
Associative
Only makes sense if |I|/|O|>>#CPU
September 2013
Alberto Abelló & Oscar Romero
27
MapReduce-I
Summary
MapReduce
 MapReduce
 MapReduce
 MapReduce

September 2013
usefulness
benefits
problems
algorithm
Alberto Abelló & Oscar Romero
28
MapReduce-I
Bibliography
J. Dean et al. MapReduce: Simplified
Data Processing on Large Clusters.
OSDI’04
 P. Sadagale and M. Fowler. NoSQL
distilled. Addison-Wesley,
y, 2013
 S. Abiteboul et al. Web data management.
Cambridge University Press, 2011
 M. Stonebraker et al. MapReduce and
parallel DBMSs: friends or foes?
Communication of ACM 53(1), 2010

September 2013
Alberto Abelló & Oscar Romero
29
MapReduce-I
Resources
http://hadoop.apache.org
 http://www.cloudera.com
 http://flink.incubator.apache.org



Former http://stratosphere.eu
https://spark apache org
https://spark.apache.org
September 2013
Alberto Abelló & Oscar Romero
30
Download