big-data-systems - Computer Science and Engineering

advertisement
Maximizing Network and Storage Performance for
Big Data Analytics
Xiaodong Zhang
Ohio State University
Collaborators
Rubao Lee, Ying Huai, Tian Luo, Yuan Yuan Ohio State University
Yongqiang He and the Data Infrastructure Team, Facebook
Fusheng Wang, Emory University
Zhiwei Xu, Institute of Comp. Tech, Chinese Academy of Sciences
Digital Data Explosion in Human Society
The global storage capacity
Analog Storage
1986
Analog
2.62 billion GB
Digital
0.02 billion GB
Digital Storage
Amount of digital information
created and replicated in a year
2007
Analog
18.86 billion GB
PC hard disks
123 billion GB
44.5%
Digital
276.12 billion GB
Source:
Exabytes: Documenting the 'digital age' and huge growth in computing capacity,
The Washington Post
2
Challenge of Big Data Management and Analytics (1)
 Existing DB technology is not prepared for the huge volume
• Until 2007, Facebook had a 15TB data warehouse by a big-DBMS-vendor
• Now, ~70TB compressed data added into Facebook data warehouse every
day (4x total capacity of its data warehouse in 2007)
• Commercial parallel DBs rarely have 100+ nodes
• Yahoo!’s Hadoop cluster has 4000+ nodes; Facebook’s data warehouse has
2750+ nodes, Google’s sorted10 PB data on a cluster of 8,000 nodes (2011)
• Typical science and medical research examples:
– Large Hadron Collider at CERN generates over 15 PB of data per year
– Pathology Analytical Imaging Standards databases at Emory reaches 7TB, going to PB
3
Challenge of Big Data Management and Analytics (2)
 Big data is about all kinds of data
• Online services (social networks, retailers …) focus on big data of
online and off-line click-stream for deep analytics
• Medical image analytics are crucial to both biomedical research
and clinical diagnosis
 Complex analytics to gain deep insights from big data
•
•
•
•
•
Data mining
Pattern recognition
Data fusion and integration
Time series analysis
Goal: gain deep insights and new knowledge
4
Challenge of Big Data Management and Analytics (3)
 Conventional database business model is not affordable
•
•
•
•
Expensive software license (e.g. Oracle DB, $47,500 per processor, 2010)
High maintenance fees even for open source DBs
Store and manage data in a system at least $10,000/TB*
In contrast, Hadoop-like systems only cost $1,500/TB**
 Increasingly more non-profit organizations work on big data
 Hospitals, bio-research institutions
 Social networks, on-line services …..
 Low cost software infrastructure is a key.
5
Challenge of Big Data Management and Analytics (4)
 Conventional parallel processing model is “scale-up” based
• BSP model, CACM, 1990: optimizations in both hardware and software
• Hardware: low ratio of comp/comm, fast locks, large cache and memory
• Software: overlapping comp/comm, exploiting locality, co-scheduling …
 Big data processing model is “scale-out” based
• DOT model, SOCC’11: hardware independent software design
• Scalability: maintain a sustained throughput growth by continuously
adding low cost computing and storage nodes in distributed systems
• Constraints in computing patterns: communication- and data-sharing-free
MapReduce programming model becomes an
*: http://www.dbms2.com/2010/10/15/pricing-of-data-warehouse-appliances/
effective
data processing engine for big data analytics
**: http://www.slideshare.net/jseidman/data-analysis-with-hadoop-and-hive-chicagodb-2212011
6
Why MapReduce?
 A simple but effective programming model designed to process
huge volumes of data concurrently
 Two unique properties
• Minimum dependency among tasks (almost sharing nothing)
• Simple task operations in each node (low cost machines are sufficient)
 Two strong merits for big data anaytics
• Scalability (Amadal’s Law): increase throughput by increasing # of nodes
• Fault-tolerance (quick and low cost recovery of the failures of tasks)
 Hadoop is the most widely used implementation of MapReduce
• in hundreds of society-dependent corporations/organizations for big data
analytics: AOL, Baidu, EBay, Facebook, IBM, NY Times, Yahoo! ….
7
An Example of a MapReduce Job on Hadoop
 Calculate average salary of each 2 organizations in a huge file.
{name: (org., salary)}
{org.: avg. salary}
Key
Value
Original key/value pairs:
all the person names
associated with each org
name and their salaries
Key
Value
Result key/value pairs: two
entries showing the org name
and corresponding average
salary
Name
(dept. ,salary)
dept.
avg. salary
Alice
(Org-1, 3000)
Org-1
…
Bob
(Org-2, 3500)
Org-2
…
…
…
8
An Example of a MapReduce Job on Hadoop
 Calculate the average salary of every organization
{name: (org., salary)}
{org.: avg. salary}
HDFS
A HDFS block
Hadoop Distributed File System (HDFS)
9
An example of a MapReduce job on Hadoop
 Calculate the average salary of every department
{name: (org., salary)}
{org.: avg. salary}
HDFS
Map
Map
Map
Each map task takes 4 HDFS blocks as its input
and extract {org.: salary} as new key/value pairs,
e.g. {Alice: (org-1, 3000)} to {org-1: 3000}
3 Map tasks concurrently process input data
Records of “org-1”
Records of “org-2”
10
An example of a MapReduce job on Hadoop
 Calculate the average salary of every department
{name: (org., salary)}
{org.: avg. salary}
HDFS
Map
Map
Map
Shuffle the data using org. as Partition Key (PK)
Records of “org-1”
Records of “org-2”
11
An example of a MapReduce job on Hadoop
 Calculate the average salary of every department
{name: (org., salary)}
{org.: avg. salary}
HDFS
Map
Calculate the
average salary
for “org-1”
Map
Reduce (Avg.)
Map
Reduce (Avg.)
Calculate the
average salary
for “org-2”
HDFS
12
Key/Value Pairs in MapReduce
 A simple but effective programming model designed to process
huge volumes of data concurrently on a cluster
 Map: (k1, v1)  (k2, v2), e.g. (name, org & salary)  (org, salary)
 Reduce: (k2, v2)  (k3, v3), e.g. (org, salary)  (org, avg. salary)
 Shuffle: Partition Key (It could be the same as k2, or not)
• Partition Key: to determine how a key/value pair in the map output be
transferred to a reduce task
• e.g. org. name is used to partition the map output file accordingly
13
MR(Hadoop) Job Execution Patterns
MR program (job)
Map Tasks
Reduce Tasks
Data is stored in a
Distributed File System
(e.g. Hadoop Distributed
File System)
1: Job submission
The execution of
a MR job involves
Control level work, e.g.
6jobsteps
scheduling and task
assignment
Master node
Worker nodes
Worker nodes
2: Assign Tasks
Do data processing
work specified by Map
or Reduce Function
14
MR(Hadoop) Job Execution Patterns
Map Tasks
MR program
Reduce Tasks
1: Job submission
The execution of
a MR job involves
6 steps
Map output
Master node
Worker nodes
Worker nodes
3: Map phase
Concurrent tasks
4: Shuffle phase
Map output will be shuffled to
different reduce tasks based on
Partition Keys (PKs) (usually
Map output keys)
15
MR(Hadoop) Job Execution Patterns
Map Tasks
MR program
Reduce Tasks
1: Job submission
Master node
The execution of
a MR job involves
6 steps
6: Output will be stored back
to Distributed File System
Worker nodes
Worker nodes
Reduce output
3: Map phase
Concurrent tasks
4: Shuffle phase
5: Reduce phase
Concurrent tasks
16
MR(Hadoop) Job Execution Patterns
Map Tasks
MR program
Reduce Tasks
1: Job submission
Master node
Worker nodes
The execution of
a MR job involves
6 steps
6: Output will be stored back
to Distributed File System
Worker nodes
A MapReduce (MR) job is resource-consuming:
1: Input data scan in the Map phase => local or remote I/Os
2: Store intermediate results of Map output => local I/Os
Reduce output
3: Transfer data across in the Shuffle phase => network costs
3: Map phase
5: Reduce
4: Shuffle
4: Store final
results of this
MRphase
job =>
local phase
I/Os + network
Concurrent tasks
Concurrent tasks
17
costs (replicate data)
Two Critical Challenges in Production Systems
 Background: Conventional databases have been moved to
MapReduce Environment, e.g. Hive (facebook) and Pig (Yahoo!)
 Challenge 1: How to initially store the data in distributed systems
• subject to minimizing network and storage cost
 Challenge 2: How to automatically convert relational database
queries into MapReduce jobs
• subject to minimizing network and storage costs
 Addressing these two Challenges, we maximize
• Performance of big data analytics
• Productivity of big data analytics
18
Challenge 1: Four Requirements of Data Placement
 Data loading (L)
• the overhead of writing data to distributed files system and local disks
 Query processing (P)
• local storage bandwidths of query processing
• the amount of network transfers
 Storage space utilization (S)
• Data compression ratio
• The convenience of applying efficient compression algorithms
 Adaptive to dynamic workload patterns (W)
• Additional overhead on certain queries
Objective: to design and implement a data placement structure
meeting these requirements in MapReduce-based data warehouses
19
Initial Stores of Big Data in Distributed Environment
HDFS Blocks
NameNode
Store Block 1
(A part of the
Master node)
Store Block 2
Store Block 3
DataNode 1
DataNode 2
DataNode 3
 HDFS (Hadoop Distributed File System) blocks are distributed
 Users have a limited ability to specify customized data placement policy
• e.g. to specify which blocks should be co-located
 Minimizing I/O costs in local disks and intra network communication
20
MR programming is not that “simple”!
public static class Reduce extends Reducer<IntWritable,Text,IntWritable,Text> {
private Text result = new Text();
package tpch;
import java.io.IOException;
import java.util.ArrayList;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Mapper.Context;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class Q18Job1 extends Configured implements Tool{
public static class Map extends Mapper<Object, Text, IntWritable, Text>{
public void reduce(IntWritable key, Iterable<Text> values,
Context context
) throws IOException, InterruptedException {
double sumQuantity = 0.0;
IntWritable newKey = new IntWritable();
boolean isDiscard = true;
String thisValue = new String();
int thisKey = 0;
for (Text val : values) {
String[] tokens = val.toString().split("\\|");
if (tokens[tokens.length - 1].compareTo("l") == 0){
sumQuantity += Double.parseDouble(tokens[0]);
}
else if (tokens[tokens.length - 1].compareTo("o") == 0){
thisKey = Integer.valueOf(tokens[0]);
thisValue = key.toString() + "|" + tokens[1]+"|"+tokens[2];
}
else
continue;
}
This complex code is for a simple MR job
if (sumQuantity > 314){
isDiscard = false;
private final static Text value = new Text();
}
private IntWritable word = new IntWritable();
private String inputFile;
if (!isDiscard){
private boolean isLineitem = false;
thisValue = thisValue + "|" + sumQuantity;
@Override
newKey.set(thisKey);
protected void setup(Context context
result.set(thisValue);
) throws IOException, InterruptedException {
context.write(newKey, result);
inputFile = ((FileSplit)context.getInputSplit()).getPath().getName();
}
if (inputFile.compareTo("lineitem.tbl") == 0){
}
isLineitem = true;
}
}
System.out.println("isLineitem:" + isLineitem + " inputFile:" + inputFile);
public int run(String[] args) throws Exception {
}
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
public void map(Object key, Text line, Context context
if (otherArgs.length != 3) {
) throws IOException, InterruptedException {
System.err.println("Usage: Q18Job1 <orders> <lineitem> <out>");
String[] tokens = (line.toString()).split("\\|");
System.exit(2);
if (isLineitem){
}
word.set(Integer.valueOf(tokens[0]));
Job job = new Job(conf, "TPC-H Q18 Job1");
value.set(tokens[4] + "|l");
job.setJarByClass(Q18Job1.class);
context.write(word, value);
}
job.setMapperClass(Map.class);
else{
job.setMapOutputKeyClass(IntWritable.class);
word.set(Integer.valueOf(tokens[0]));
job.setMapOutputValueClass(Text.class);
value.set(tokens[1] + "|" + tokens[4]+"|"+tokens[3]+"|o");
context.write(word, value);
job.setReducerClass(Reduce.class);
}
job.setOutputKeyClass(IntWritable.class);
}
job.setOutputValueClass(Text.class);
Low Productivity!
We all want to simply write:
“SELECT * FROM Book WHERE price > 100.00”?
}
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileInputFormat.addInputPath(job, new Path(otherArgs[1]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[2]));
return (job.waitForCompletion(true) ? 0 : 1);
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new Q18Job1(), args);
System.exit(res);
}
}
21
Challenge 2: High Quality MapReduce in Automation
A job description in SQL-like declarative language
SQL-to-MapReduce
Translator
Write MR
programs (jobs)
A interface between users
and MR programs (jobs)
MR programs (jobs)
Workers
Hadoop Distributed File System (HDFS)
22
Challenge 2: High Quality MapReduce in Automation
A job description in SQL-like declarative language
Write MR
programs (jobs)
SQL-to-MapReduce
Translator
A data warehousing
system (Facebook)
A interface between users
and MR programs (jobs)
MR program (job)
A Ahigh-level
programming
environment (Yahoo!)
ImproveWorkers
productivity from hand-coding MapReduce programs
• 95%+ Hadoop jobs in Facebook are generated by Hive
• 75%+ Hadoop jobs in Yahoo! are invoked by Pig*
Hadoop Distributed File System (HDFS)
* http://hadooplondon.eventbrite.com/
23
Outline
 RCFile: a fast and space-efficient placement structure
• Re-examination of existing structures
• A Mathematical model as the analytical basis for RCFile
• Experiment results
 Ysmart: a high efficient query-to-MapReduce translator
• Correlations-aware is the key
• Fundamental Rules in the translation process
• Experiment results
 Impact of RCFile and Ysmart in production systems
 Conclusion
24
Row-Store: Merits/Limits with MapReduce
Table
A
101
102
103
B
201
202
203
C
301
302
303
D
401
402
403
HDFS Blocks
Store Block 1
Store Block 2
Store Block 3
104
105
204
205
304
305
404
405
…
Data loading is fast (no additional processing);
All columns of a data row are located in the same HDFS block
 Not all columns are used (unnecessary storage bandwidth)
 Compression of different types may add additional overhead 25
Distributed Row-Store Data among Nodes
HDFS Blocks
NameNode
Store Block 1
Store Block 2
Store Block 3
A
…
B
…
C
…
D
…
DataNode 1
A
…
B
…
C
…
D
…
A
…
DataNode 2
B
…
C
…
D
…
DataNode 3
26
Column-Store: Merits/Limits with MapReduce
Table
A
101
102
103
B
201
202
203
C
301
302
303
D
401
402
403
HDFS Blocks
Store Block 1
Store Block 2
Store Block 3
104
105
…
204
205
…
304
305
…
404
405
…
…
27
Column-Store: Merits/Limits with MapReduce
Column group 1
A
101
102
103
104
105
…
Column group 2
B
201
202
203
204
205
Column group 3
C
301
D
401
302
303
304
402
403
404
HDFS Blocks
Store Block 1
Store Block 2
Store Block 3
…
…
305 405
Unnecessary I/O costs can be avoided:
…
…
Only needed columns are loaded, and easy compression
Additional network transfers for column grouping
28
Distributed Column-Store Data among Nodes
HDFS Blocks
NameNode
Store Block 1
Store Block 2
Store Block 3
A
101
102
103
104
105
…
DataNode 1
B
C
D
201
301
401
202
302
402
203
303
403
204
304
404
205
305
405
…
…
…
DataNode 2
DataNode 3
29
Optimization of Data Placement Structure
 Consider the four requirements comprehensively
 The optimization problem becomes:
• In an environment of dynamic workload (W) and with a
suitable data compression algorithm (S) to improve the
utilization of data storage,
find a data placement structure (DPS) that minimizes the
processing time of a basic operation (OP) on a table (T) with n
columns
 Two basic operations
• Write: the essential operation of data loading (L)
• Read: the essential operation of query processing (P)
30
Write Operations
Table
A
B
C
D
101 201 301 401
…
…
…
HDFS Blocks
…
Load the table into HDFS blocks based on
different data placement structure
HDFS blocks are distributed to Datanodes
DataNode 1
DataNode 2
DataNode 3
31
Read Operation in Row-store
Read local rows concurrently ;
Discard unneeded columns
A
C
A
101 301
C
A
102 302
C
103 303
HDFS Blocks
A
B
C
D
A
B
C
D
A
B
C
D
101
401
102
202
302
402
103
203
303
403
201
301
DataNode 1
DataNode 2
DataNode 3
32
Read Operations in Column-store
Column
group 1
Column
group 2
Column
group 3
A
B
C and D
Project A&C
A
Datanode1 Datanode2 Datanode3
101 301
…
Project C&D
C
D
301 401
…
…
DataNode 3
C
…
Transfer them to
a common place
via networks
DataNode 1
DataNode 3
33
An Expected Value based Statistical Model
 In probability theory, the expected value of an random variable is
• Weighted average of all possible values of this random variable can take on
 Estimated big data access time can take time of “read” T(r) and
“write” T(w), each with a probability (p(r) and p(w)) since p(r) +
p(w) = 1, and read and write have equal weights, estimated access
time is a weighted average:
• E[T] = p(r) * T(r) + p(w) * T(w)
34
Overview of the Optimization Model
processing time of a read operation
processing time of a write operation
min E (op | DPS )  w E ( write | DPS )  r E ( read | DPS )
r  w  1
Probabilities of write and read operations
35
Overview of the Optimization Model
min E (op | DPS )  w E ( write | DPS )  r E ( read | DPS )
r  w  1
The majority of operation is read, so ωr>>ωw
~70TB of compressed data added per day and PB level compressed data
scanned per day in Facebook (90%+)
We focus on minimization of processing time of read operations
36
Modeling Processing Time of a Read Operation
 # of combinations of n columns a query needs is up to
n n
n
           2n  1
1  2
n
37
Modeling Processing Time of a Read Operation
 # of combinations of n columns a query needs is up to
n n
n
           2n  1
1  2
n
Processing Time of a Read Operation
E ( read | DPS ) 
n
 
n i
S
1
S
i
f (i, j, n )(
   ( DPS )   ( DPS , i, j, n ) 
 )

Blocal 
Bnetwotk n
i 1 j 1
38
Expected Time of a Read Operation
E (read | DPS ) 
n
 
n i

i 1 j 1
S
1
S
i
f (i, j , n)(
   ( DPS )   ( DPS , i, j , n) 
 )
Blocal 
Bnetwotk n
n: the number of columns of table T
i: i columns are needed
j: it is jth column combination in all combinations with i columns
39
Expected Time of a Read Operation
E (read | DPS ) 
n
 
n i

i 1 j 1
S
1
S
i
f (i, j , n)(
   ( DPS )   ( DPS , i, j , n) 
 )
Blocal 
Bnetwotk n
The frequency of occurrence of jth column combination
in all combinations with i columns.
We fix it as a constant value to represent a environment with
highly dynamic workload
40
Expected Time of a Read Operation
E (read | DPS ) 
n
 
n i

i 1 j 1
S
1
S
i
f (i, j , n)(
   ( DPS )   ( DPS , i, j , n) 
 )
Blocal 
Bnetwotk n
The frequency of occurrence of jth column combination
in all combinations with i columns.
We fix the probability as a constant value to represent a environment with
highly dynamic workload
n
 
n i
 f (i, j, n)  1
i 1 j 1
41
Expected Time of a Read Operation
E (read | DPS ) 
n
 
n i

i 1 j 1
S
1
S
i
f (i, j , n)(
   ( DPS )   ( DPS , i, j , n) 
 )
Blocal 
Bnetwotk n
The time used to read needed columns from local disks in parallel
S: The size of the compressed table. We assume with efficient data
compression algorithms and configurations, different DPSs can
achieve comparable compression ratio
Blocal : The bandwidth of local disks
ρ: the degree of parallelism, i.e. total number of concurrent nodes
42
Expected Time of a Read Operation
E (read | DPS ) 
n
 
n i

i 1 j 1
S
1
S
i
f (i, j , n)(
   ( DPS )   ( DPS , i, j , n) 
 )
Blocal 
Bnetwotk n
α(DPS): Read efficiency. % of columns read from local disks
i

 ( DPS )   n , DPS  Column - store Only read necessary columns
 1 , DPS  Row - store Read all columns, including unnecessary columns
43
Expected Time of a Read Operation
E (read | DPS ) 
n
 
n i

i 1 j 1
S
1
S
i
f (i, j , n)(
   ( DPS )   ( DPS , i, j , n) 
 )
Blocal 
Bnetwotk n
The extra time on network transfer for row construction
44
Expected Time of a Read Operation
E (read | DPS ) 
n
 
n i

i 1 j 1
S
1
S
i
f (i, j , n)(
   ( DPS )   ( DPS , i, j , n) 
 )
Blocal 
Bnetwotk n
The extra time on network transfer for row construction
λ(DPS, i, j, n): Communication Overhead. Additional network
transfers for row constructions
45
Expected Time of a Read Operation
E (read | DPS ) 
n
 
n i

i 1 j 1
S
1
S
i
f (i, j , n)(
   ( DPS )   ( DPS , i, j , n) 
 )
Blocal 
Bnetwotk n
The extra time on network transfer for row construction
λ(DPS, i, j, n): Communication Overhead. Additional network
transfers for row constructions
 0, All needed columns are in one node (0 communication)
 , At least two columns are in two different nodes
 ( DPS , i, j , n)  
46
Expected Time of a Read Operation
E (read | DPS ) 
n
 
n i

i 1 j 1
S
1
S
i
f (i, j , n)(
   ( DPS )   ( DPS , i, j , n) 
 )
Blocal 
Bnetwotk n
The extra time on network transfer for row construction
λ(DPS, i, j, n): Communication Overhead. Additional network
transfers for row constructions
 0, All needed columns are in one node (0 communication)
 , Transferring data via networks
 ( DPS , i, j , n)  
β: % of data transferred via networks
(DPS, workload dependent), 0%≤β≤100%
47
Expected Time of a Read Operation
E (read | DPS ) 
n
 
n i

i 1 j 1
S
1
S
i
f (i, j , n)(
   ( DPS )   ( DPS , i, j , n) 
 )
Blocal 
Bnetwotk n
The extra time on network transfer for row construction
Bnetwork: The bandwidth of the network
48
Finding Optimal Data Placement Structure
E (read | DPS ) 
n
 
n i

i 1 j 1
i
f (i, j , n)(
   ( DPS )   ( DPS , i, j , n) 
 )
Blocal 
Bnetwotk n
Read efficiency
Communication
overhead
S
1
S
Row-Store
Column-store
Ideal
1
i/n (optimal)
i/n (optimal)
0 (optimal)
β
(0%≤β≤100%)
0 (optimal)
Can we find a Data Placement Structure with both optimal read
efficiency and communication overhead ?
49
Goals of RCFile
 Eliminate unnecessary I/O costs like Column-store
• Only read needed columns from disks
 Eliminate network costs in row construction like Row-store
 Keep the fast data loading speed of Row-store
 Can apply efficient data compression algorithms
conveniently like Column-store
 Eliminate all the limits of Row-store and Column-store
50
RCFile: Partitioning a Table into Row Groups
A HDFS block consists of one
or multiple row groups
Table
A
…
B
C
D
A…Row Group
…
…
101
102
201
202
301
302
401
402
103
104
105
…
203
204
205
…
303
304
305
…
403
404
405
…
HDFS Blocks
Store Block 1
Store Block 2
Store Block 3
Store Block 4
…
51
RCFile: Distributed Row-Group Data among Nodes
For example, each HDFS block has three row groups
HDFS Blocks
NameNode
Store Block 1
Store Block 2
Store Block 3
Row Group 1-3
Row Group 4-6
DataNode 1
Row Group 7-9
DataNode 2
DataNode 3
52
Inside a Row Group
Metadata
Store Block 1
101
201
301
401
102
103
104
105
202
203
204
205
302
303
304
305
402
403
404
405
53
Inside a Row Group
Metadata
101
201
301
401
102
103
104
105
202
203
204
205
302
303
304
305
402
403
404
405
54
RCFile: Inside each Row Group
Compressed Metadata
101
102
103
104
105
Compressed Column A
101 102 103 104 105
201
202
203
204
205
Compressed Column B
301
302
303
304
305
201 202 203 204 205
401
402
403
404
Compressed Column C
301 302 303 304 305
Compressed Column D
401 402 403 404 405
405
55
Benefits of RCFile
 Eliminate unnecessary I/O costs
• In a row group, a table is partitioned by columns
• Only read needed columns from disks
 Eliminate network costs in row construction
• All columns of a row are located in the same HDFS block
 Comparable data loading speed to Row-Store
• Only adding a vertical-partitioning operation in the data loading
procedure of Row-Store
 Can apply efficient data compression algorithms
conveniently
• Can use compression schemes used in Column-store
56
Expected Time of a Read Operation
E (read | DPS ) 
n
 
n i

i 1 j 1
i
f (i, j , n)(
   ( DPS )   ( DPS , i, j , n) 
 )
Blocal 
Bnetwotk n
Read efficiency
Communicatio
n overhead
S
1
S
Row-Store
Column-store
1
i/n (optimal)
0 (optimal)
β
(0%≤β≤100%)
57
Expected Time of a Read Operation
E (read | DPS ) 
n
 
n i

i 1 j 1
i
f (i, j , n)(
   ( DPS )   ( DPS , i, j , n) 
 )
Blocal 
Bnetwotk n
Read efficiency
Communicatio
n overhead
S
1
S
Row-Store
Column-store
RCFile
1
i/n (optimal)
i/n (optimal)
0 (optimal)
β
(0%≤β≤100%)
0 (optimal)
58
Row-Group Size
RG1
 A performance sensitive and workload
dependent parameter
Block 1
RG2
RG4
Block 2
RG5
59
Row-Group Size
 A performance sensitive and workload
dependent parameter
Block 1
 In RCFile, row group size is up to the
RG1 size of HDFS block
• HDFS block size: 64MB or 128MB
Block 2
RG2
60
Row-Group Size
 A performance sensitive and workload
dependent parameter
Block 1
 In RCFile, row group size is up to the
size of HDFS block
• HDFS block size: 64MB or 128MB
RG1  Larger than HDFS block size? (VLDB’11)
Block 2
– One column of a row group per HDFS block
– Co-locate columns of a row group
– Can provide flexibility on changing schema
• A concern on fault-tolerance
– If a node fails, the recovery time is longer than
that of row-group size under current RCFile
restriction
– Needs additional maintenance
61
Discussion on Choosing a right Row-Group Size
 RCFile Restriction: Row group size ≤ HDFS block size
 Avg. row size ↑, # of columns ↑ => row-group size ↑
• Avoid unnecessary columns to be prefetched due to short columns
– Unsuitable row-group size may introduce ~10x data read from disks,
including ~90% unnecessary data
• Relatively short columns may cause inefficient compression
 Query selectivity ↑ => row-group size ↓
• High-selectivity queries (high % of rows to read) with a large rowgroup size would not take advantage of lazy decompression, but
low-selectivity queries do.
– Lazy decompression: a column is only decompressed when it will be really
useful for query execution
62
Evaluation Environment
 Cluster:
• 40 nodes one in the Facebook
 Node:
•
•
•
•
•
Intel Xeon CPU with 8 cores
32GB main memory
12 1TB disks
Linux, kernel version 2.6
Hadoop 0.20.1
 Use Zebra library* for Column-store and Column-group
• Column-store: A HDFS block only contains the data of a single column
• Column-group: Some columns are grouped together and stored in the
same HDFS block
 Benchmark is from [SIGMOD09]**
*http://wiki.apache.org/pig/zebra
**Pavlo et al., “A comparison of approaches to large-scale data analysis,” in SIGMOD Conference, 2009,
pp. 165–178.
63
Data Loading Evaluation
Data loading time for loading table USERVISITS (~120GB)
Data loading time (s)
250
200
150
100
50
0
Row-store
Column-store
Column-group
RCFile
 Row-store has the fastest data loading time
 RCFile’s data loading time is comparable to Row-store (7% slower)
 Zebra (column-store and column-group): each record will be
written to multiple HDFS blocks => Higher network overhead
64
Storage Utilization Evaluation
Compressed size of table USERVISITS (~120GB)
Storage Space (GB)
120
100
80
60
40
20
0
Raw Data
Row-store
Column-store Column-group
RCFile
 Use Gzip algorithm for each structure
 RCFile has the best data compression ratio (2.24)
 Zebra compresses metadata and data together, which is less efficient
65
A Case of Query Execution Time Evaluation
Query: Select pagerank, pageurl FROM RANKING WHERE pagerank < 400
Query execution time
60
Execution Time (s)
Column-store:
Stored in different
HDFS blocks
Column-group:
In the same column
group, i.e. stored in the
same HDFS block
50
40
30
20
10
0
row-store
column-store Column-group
RCFile
 Column-group has the fastest query execution time
 RCFile’s execution time is comparable to Column-group (3% slower)
66
Facebook Data Analytics Workloads Managed By RCFile
 Reporting
• E.g. daily/weekly aggregations of impression/click counts
 Ad hoc analysis
• E.g. geographical distributions and activities of users in the world
 Machine learning
• E.g. online advertizing optimization and effectiveness studies
 Many other data analysis tasks on user behavior and patterns
 User workloads and related analysis cannot be published
 RCFile evaluation with public available workloads with excellent
performance (ICDE’11)
67
RCFile in Facebook
The interface to
800+ million users
…
Large amount of
log data
…
RCFile Data
…
Web Servers
70TB compressed
data per day
Data Loaders
Capacity:
21PB in May, 2010
30PB+ today
Warehouse
Picture source: Visualizing Friendships, http://www.facebook.com/notes/facebook-engineering/visualizing-friendships/469716398919
68
Summary of RCFile
 Data placement structure lays a foundation for MapReduce-based big
data analytics
 Our optimization model shows RCFile meets all basic requirements
 RCFile: an operational system for daily tasks of big data analytics
• A part of Hive, a data warehouse infrastructure on top of Hadoop.
• A default option for Facebook data warehouse
• Has been integrated into Apache Pig since version 0.7.0 (expressing data analytics
tasks and producing MapReduce programs)
• Customized RCFile systems for special applications
 Refining RCFile and optimization model, making RCFile as a standard
data placement structure for big data analytics
69
Outline
 RCFile: a fast and space-efficient placement structure
• Re-examination of existing structures
• A Mathematical model as basis of RCFile
• Experiment results
 Ysmart: a high efficient query-to-MapReduce translator
• Correlations-aware is the key
• Fundamental Rules in the translation process
• Experiment results
 Impact of RCFile and Ysmart in production systems
 Conclusion
70
Translating SQL-like Queries to MapReduce
Jobs: Existing Approach
 “Sentence by sentence” translation
• [C. Olston et al. SIGMOD 2008], [A. Gates et al., VLDB 2009] and [A.
Thusoo et al., ICDE2010]
• Implementation: Hive and Pig
 Three steps
• Identify major sentences with operations that shuffle the data
– Such as: Join, Group by and Order by
• For every operation in the major sentence that shuffles the data, a
corresponding MR job is generated
– e.g. a join op. => a join MR job
• Add other operations, such as selection and projection, into
corresponding MR jobs
Existing SQL-to-MapReduce translators give unacceptable
71
performance.
An Example: TPC-H Q21
Execution time (min)
 One of the most complex and time-consuming queries in the
TPC-H benchmark for data warehousing performance
Optimized MR Jobs vs. Hive in a Facebook production cluster
Optimized MR Jobs
Hive
160
140
120
3.7x
100
80
60
40
20
0
What’s wrong?
72
The Execution Plan of TPC-H Q21
The only difference:
Hive handle this sub-tree in a
different way with the
optimized MR jobs
It’s the dominated part on time
(~90% of execution time)
SORT
AGG3
Join4
Left-outerJoin
Join3
supplier
Join2
Join1
lineitem
orders
AGG1
AGG2
lineitem
lineitem
nation
73
A JOIN MR Job
However, inter-job correlations exist.
Let’s look at the Partition Key
An AGG MR Job
Key: l_orderkey
A Composite MR Job
J5
A Table
Key: l_orderkey
J3
Key: l_orderkey
J1
lineitem
Key: l_orderkey
J2
orders
lineitem
Key: l_orderkey
J4
lineitem
lineitem
orders
J1,to
J1
J2J5
and
allJ4
use
allthe
need
same
thepartition
input table
key‘lineitem’
‘l_orderkey’
What’s wrong with existing SQL-to-MR translators?
Existing translators are correlation-unaware
1. Ignore common data input
2. Ignore common data transition
74
Performance
Approaches of Big Data Analytics in MR:
The landscape
Hand-coding
MR jobs
Correlationaware SQL-toMR translator
Pro:
Easy programming, high productivity
Pro:
Con:
high performance MRPoor
programs
performance on complex queries
Con:
(complex queries are usual in daily
1: lots of coding evenoperations)
for a simple job
2: Redundant coding is inevitable
Existing
3: Hard to debug
SQL-to-MR
[J. Tan et al., ICDCS 2010]
Translators
Productivity
75
Our Approaches and Critical Challenges
SQL-like queries
Correlation-aware
SQL-to-MR translator
Primitive
MR Jobs
Identify
Correlations
MR Jobs for best
performance
Merge
Correlated
MR jobs
1: Correlation possibilities
and detection
2: Rules for automatically
exploiting correlations
3: Implement high-performance
and low-overhead MR jobs
76
Input Correlation (IC)
 Multiple MR jobs have input correlation (IC) if their input
relation sets are not disjoint
J1
lineitem
J2
orders
lineitem
A shared input relation set
Map Func. of MR Job 1
Map Func. of MR Job 2
78
Transit Correlation (TC)
 Multiple MR jobs have transit correlation (TC) if
• they have input correlation (IC), and
• they have the same Partition Key
Key: l_orderkey
J1
lineitem
Key: l_orderkey
J2
orders
Two MR jobs should first
have IC
A shared input relation set
Map Func. of MR Job 1
79
Map Func. of MR Job
2
lineitem
Partition Key
Other Data
Job Flow Correlation (JFC)
 A MR job has Job Flow Correlation (JFC) with one of its child
MR jobs if it has the same partition key as that MR job
J1
J2
Partition Key
J2
Output of MR Job 2
Other Data
Map Func. of MR Job 1
Reduce Func. of MR Job 1
Map Func. of MR Job 2
Reduce Func. of MR Job 2
J1
lineitem
orders
80
Query Optimization Rules for
Automatically Exploiting Correlations
 Exploiting both Input Correlation and Transit Correlation
 Exploiting the Job Flow Correlation associated with
Aggregation jobs
 Exploiting the Job Flow Correlation associated with JOIN
jobs and their Transit Correlated parents jobs
 Exploiting the Job Flow Correlation associated with JOIN
jobs
81
Exp1: Four Cases of TPC-H Q21
1: Sentence-to-Sentence Translation
• 5 MR jobs
2: InputCorrelation+TransitCorrelation
• 3 MR jobs
Left-outerJoin
Leftouter-Join
Join2
Join2
Join1
lineitem
AGG1
orders
lineitem
AGG2
lineitem
lineitem
3: InputCorrelation+TransitCorrelation+
JobFlowCorrelation
• 1 MR job
lineitem
orders
4: Hand-coding (similar with Case 3)
• In reduce function, we optimize code
according query semantic
orders
lineitem
orders
87
Breakdowns of Execution Time (sec)
1200
From totally 888sec to 510sec
1000
From totally 768sec to 567sec
800
600
Job5 Reduce
Job5 Map
Job4 Reduce
Job4 Map
Job3 Reduce
Job3 Map
Job2 Reduce
Job2 Map
Job1 Reduce
Job1 Map
Only 17% difference
400
200
0
No Correlation
Input Correlation
Transit Correlation
Input Correlation
Hand-Coding
Transit Correlation
88
JobFlow Correlation
Exp2: Clickstream Analysis
A typical query in production clickstream analysis: “what is the
average number of pages a user visits between a page in category
‘X’ and a page in category ‘Y’?”
In YSmart JOIN1, AGG1, AGG2, JOIN2 and
AGG3 are executed in a single MR job
Execution time (min)
800
700
600
8.4x
500
400
300
4.8x
200
100
0
YSmart
Hive
Pig
89
YSmart in the Hadoop Ecosystem
See patch
HIVE-2206 at
apache.org
HiveYSmart
+ YSmart
Hadoop Distributed File System (HDFS)
90
Summary of YSmary
 YSmart is a correlation-aware SQL-to-MapReduce translator
 Ysmart can outperform Hive by 4.8x, and Pig by 8.4x
 YSmart is being integrated into Hive
 The individual version of YSmart was released in January 2012
91
Translate SQL-like queries to
MapReduce jobs
YSmart
…
A Hadoop-powered
Data Warehousing
System (Hive)
RCFile Data
…
Web servers for 800M
facebook users
92
Conclusion
 We have contributed two important system components:
RCFile and Ysmart in the critical path of Big Data analytics
Ecosystem.
 The ecosystem of Hadoop-based big data analytics is created:
Hive and Pig, will soon merge into an unified system
 RCFile and Ysmart provides two basic standards in such a new
Ecosystem.
Thank You!
93
Download