A Hadoop MapReduce

advertisement
A Hadoop MapReduce
Performance Prediction Method
Ge Song*+, Zide Meng*, Fabrice Huet*, Frederic Magoules+, Lei Yu# and Xuelian Lin#
* University of Nice Sophia Antipolis, CNRS, I3S, UMR 7271, France
+ Ecole Centrale de Paris, France
# Beihang University, Beijing China
1
Background
• Hadoop MapReduce
Job
Map
Reduce
(Key, Value)
+
I
N
P
U
T
Partion1
Partion2
Map
Map
Reduce
Map
Split
Reduce
D
A
T
A
Map
HDFS
2
Background
• Hadoop
• Many steps within Map stage and Reduce stage
• Different step may consume different type of resource
Map
R
E
A
D
Map
S
O
R
T
M
E
R
G
E
O
U
T
P
U
T
3
Motivation
• Problems
Scheduling
CPU
Intensive
CPU
Intensive
No consideration about the execution time
and different type of resources consumed
Hadoop
Hadoop
Hadoop
Parameter
Tuning
Numerous parameters, default value is not
optimal
Job
Hadoop
Job
Default
Hadoop
Default Conf
4
Motivation
• Solution
Scheduling
No consideration about the execution time
and different type of resources consumed
Predict the performance of Hadoop Jobs
Hadoop
Parameter
Tuning
Numerous parameters, default value is not
optimal
5
Related Work
• Existing Prediction Method 1:
- Black Box Based
Hadoop
Hard to
choose
Job
Features
Lack of the
analysis about
Hadoop
Statistic/Learning
Models
Execution
Time
6
Related Work
• Existing Prediction Method 2:
- Cost Model Based
Hadoop
Read
map
…
Out
HadoopRead
put
…
reduce
Lots of concurrent
processes
Hard to divide stages
Job
Feature
F(map)=f(read,map,sort,spill,merge,write)
F(reduce)=f(read,write,merge,reduce,write)
Out
put
Difficult to
ensure
accuracy
Execution
Time
7
Related Work
• A Brief Summary about Existing Prediction Method
Black Box
Cost Model
Advantage
Simple and Effective
High accuracy
High isomorphism
Detailed analysis about Hadoop
processing
Division is flexible (stage, resource)
Multiple prediction
Short
Coming
Lack of job feature extraction
Lack of analysis
Hard to divide each step and
resource
Lack of job feature extraction
A lot of concurrent, hard to model
Better for theoretical analysis, not
suitable for prediction
o Simple prediction,
o Lack of jobs (jar package + data) analysis
8
Goal
• Design a Hadoop MapReduce performance prediction
system to:
- Predict the job consumption of various type of resources
(CPU, Disk IO, Network)
- Predict the execution time of Map phase and Reduce phase
Job
Prediction System
- Map execution time
- Reduce execution
time
- CPU Occupation Time
- Disk Occupation Time
- Network Occupation
Time
9
Design - 1
• Cost Model
Job
C
O
S
T
M
O
D
E
L
- Map execution
time
- Reduce execution
time
- CPU Occupation Time
- Disk Occupation Time
- Network Occupation
Time
10
Cost Model [1]
• Analysis about Map
- Modeling the resources (CPU Disk Network) consumption
- Each stage involves only one type of resources
CPU:
Disk:
Net:
Initiation
Map
Read
Data
Network
Transfer
Create
Object
Sort
In
Memory
Map Function
Merge
Sort
Seriali
zation
Read/Writ
e
Disk
Write
Disk
11
[1] X. Lin, Z. Meng, C. Xu, and M. Wang, “A practical performance model for hadoop mapreduce,” in CLUSTER Workshops, 2012, pp. 231–239.
Cost Model [1]
• Cost Function Parameters Analysis
– Type One:Constant
•
Hadoop System Consume,Initialization Consume
– Type Two:Job-related Parameters
•
Map Function Computational Complexity,Map Input
Records
– Type Three:Parameters defined by Cost Model
•
Sorting Coefficient, Complexity Factor
12
[1] X. Lin, Z. Meng, C. Xu, and M. Wang, “A practical performance model for hadoop mapreduce,” in CLUSTER Workshops, 2012, pp. 231–239.
Parameters Collection
• Type One and Type Three
– Type one: Run empty map tasks,calculate the system
consumed from the logs
– Type Three: Extract the sort part from Hadoop source
code, sort a certain number of records.
• Type Two
– Run a new job,analyze log
•
•
High Latency
Large Overhead
– Sampling Data,only analyze the behavior of map
function and reduce function
•
•
Almost no latency
Very low extra overhead
Job Analyzer
13
Job Analyzer - Implementation
• Job Analyzer – Implementation
– Hadoop virtual execution environment
•
Accept the job Jar File & Input Data
– Sampling Module
•
Sample input data by a certain
percentage (less than 5%).
– MR Module
•
Instantiate user job’s class in
using Java reflection
– Analyze Module
•
•
•
Input Data (Amount & Number)
Relative computational complexity
Data conversion rates (output/input)
Jar File + Input Data
Hadoop virtual execution
environment
MR
Module
Sampling
Module
Analyze Module
Job Feature
14
Job Analyzer - Feasibility
– Data similarity: Logs have uniform format
– Execution similarity: each record will be processed by the
same map & reduce function repeatedly
I
N
P
U
T
D
A
T
A
Map
Map
Split
Reduce
Map
Reduce
Map
15
Design - 2
• Parameters Collection
Job Analyzer:
Collect
Parameters of
Type 2
Static Parameters
Collection Module:
Collect
Parameters of
Type1 & Type 3
C
O
S
T
M
O
D
E
L
- Map execution
time
- Reduce execution
time
- CPU Occupation Time
- Disk Occupation Time
- Network Occupation
Time
16
Prediction Model
• Problem Analysis
-Many concurrent steps -- the total time can not be
added up by the time of each part
CPU:
Disk:
Net:
Initiation
Read Data
Network
Transfer
Sort
In
Memory
Create
Object
Map Function
Merge
Sort
Serializat
ion
Read/Write
Disk
Write
Disk
17
Prediction Model
• Main Factors (according to the performance
model)
- Map Stage
Initiation
Read Data
Network
Transfer
Create
Object
Sort
In
Memory
Map Function
The amount of input data
Merge
Sort
Serializ
ation
Read/Write
Disk
Write
Disk
The number of input records (N)
NlogN
The complexity of Map function
Tmap=α0
+α1*MapInput
+α2*N
+α3*N*Log(N)
+α4*The complexity of map function
+α5*The conversion rate of map data
The conversion rate of Map data
18
Prediction Model
• Experimental Analysis
– Test 4 kinds of jobs (0-10000 records)
– Extract the features for linear regression
– Calculate the correlation coefficient (R2)
Jobs
Dedup
WordCount
Project
Grep
Total
R2
0.9982
0.9992
0.9991
0.9949
0.6157
19
Prediction Model
3500000
Execution Time of Map
3000000
2500000
- Very good linear relationship within the
same kind of jobs.
2000000
Dedup
Grep
Project
1500000
WordCount
- But no linear relationship among
different kind of jobs.
1000000
500000
0
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
Number of Records
20
Find the nearest jobs!
• Instance-Based Linear Regression
– Find the nearest samples to the jobs to be predicted in
history logs
– “nearest”-> similar jobs (Top K nearest, with K=10%-15%)
– Do linear regression to the samples we have found
– Calculate the prediction value
• Nearest:
– The weighted distance of job features (weight w)
– High contribution for job classification:
• map/reduce complexity,map/reduce data conversion rate
– Low contribution for job classification:
• Data amount、Number of records
21
Prediction Module
• Procedure
Job Features
3
Search for the nearest
samples
1
Main Factors
Cost Model
4
2
Tmap=α0+α1*MapInput
+α2*N
+α3*N*Log(N)
+α4*The complexity of map
function
+α5*The conversion rate of map
data
6
5
Prediction Function
7
Prediction Results
22
Prediction Module
• Procedure
Cost Model
Find-Neighbor Module
Prediction Function
Training Set
Prediction Results
23
Design - 3
• Parameters Collection
Job Analyzer:
Collect
Parameters of
Type 2
Static Parameters
Collection Module:
Collect
Parameters of
Type1 & Type 3
C
O
S
T
M
O
D
E
L
Prediction
Module
- Map execution
time
- Reduce execution
time
- CPU Occupation Time
- Disk Occupation Time
- Network Occupation
Time
24
Experience
• Task Execution Time (Error Rate)
–
–
–
–
K=12%, and with w different for each feature
K=12%, and with w the same for each feature
K=25%, and with w different for each feature
4 kinds of jobs, 64M-8G
Reduce Tasks
90
180
80
160
70
140
Error Rate (100%)
Error Rate (100%)
Map Tasks
60
50
40
30
20
120
100
k=12%
80
k=25%
60
k=12%,w=1
40
10
20
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
Job ID
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
Job ID
25
Conclusion
• Job Analyzer :
– Analyze Job Jar + Input File
– Collect parameters
• Prediction Module:
–
–
–
–
Find the main factor
Propose a linear equation
Job classification
Multiple prediction
26
Thank you!
Question?
27
Cost Model [1]
• Analysis about Reduce
- Modeling the resources (CPU Disk Network) consumption
- Each stage involves only one type of resources
Initiation
Reduce
Read
Data
Network
Transfer
CPU:
Disk:
Net:
Merge
Sort
Read/Write
Disk
Serialization
Deserialization
Create Object
Reduce Function
Write Disk
Network
28
Prediction Model
• Main Factors (according to the performance
model)
Initiation
- Reduce Stage
Read
Data
Network
Transfer
Merge
Sort
Read/Write
Disk
Serialization
Deserialization
Create Object
The amount of input data
Reduce Function
The number of input records
Write Disk
Network
NlogN
The complexity of Reduce
function
Treduce=β0
+β1*MapInput
+β2*N
+β3*Nlog(N)
+β4*The complexity of Reduce function
+β5*The conversion rate of Map data
+β6*The conversion rate of Reduce data
The conversion rate of Map data
The conversion rate of Reduce
data
29
Download