part 2

advertisement
RESTORE
IMPLEMENTATION as an
extension to pig
Vijay S
Outline
Overview of Pig Query Compiler
Implementation of Restore
—
Experiments
—
www.nordridesign.com
LOGO
Overview of the Pig Query
Compiler
LOGO
a parser syntactically checks the input
query and transforms it into a logical plan,
which is a directed acyclic graph (DAG)
of logical operators(1)
 logical optimizer applies optimization
rules to this logical plan(2)
 MapReduce compiler transforms the
logical plan into a physical plan and then
compiles it into a series of MapReduce
jobs, which forms a workflow(3)

www.nordridesign.com
Overview of the Pig Query
Compiler - Continued
LOGO
MapReduce optimizer applies rules to
reduce the number of MapReduce jobs in
the work- flow(4)
 Hadoop job manager submits the jobs in
a workflow to Hadoop for execution
taking into account the dependencies
between them.(5)

www.nordridesign.com
Overview of the Pig Query
Compiler - Continued

LOGO
JobControlCompiler
component of the Hadoop job manager
of Pig
 Input is Workflow of Mapreduce Jobs
 After the completion of executing all the
MapReduce jobs in the workflow, these
intermediate outputs are deleted.

www.nordridesign.com
LOGO
Implementation of Restore
The input of ReStore is a workflow of
MapReduce jobs.
 Every physical plan of these jobs passes
though two stages: (1) matching with
plans in the repository, and (2) generating
candidate sub-jobs.
 .Implement the repository as a table that
con-tains in every record: (1) a physical
plan of a MapReduce job, (2) the filename
of the output of this job in HDFS, and (3)

www.nordridesign.com
LOGO
Experiments
Reusing the Output of Whole Jobs(7.1)
 Reusing the Output of Sub• Jobs(7.2)
 Comparing the Heuristics for
GeneratingCandidate Sub-Jobs(7.3)
 Reusing Sub• Jobs vs. Whole Jobs((7.4)
 Effect of Data Reduction((7.5)

www.nordridesign.com
LOGO
Reusing the Output of Whole
Jobs(7.1)

Job execution time for queries is much
reduced by resusing jobs compared to no
data reuse.(L3, L11 – PigMix)
Example:
 L2-L8 and L11 (Join, Group, CoGroup,Filter Distinct and Union)
 L3, L11 - PigMix

www.nordridesign.com
LOGO
Reusing the Output of sub Jobs(7.2)

Job execution time for queries is further
reduced by resusing Output of jobs
compared to no data reuse and
generating sub jobs
Example:
 L2-L8 and L11 (Join, Group, CoGroup,Filter Distinct and Union)
 L3, L11 - PigMix

www.nordridesign.com
LOGO
Comparing Heuristics for
Generating Candidate subjobs(7.3)

Job execution time for queries is further
reduced by resusing Output of jobs
compared to no data reuse and
generating sub jobs
Example:
 L2-L8 and L11 (Join, Group, CoGroup,Filter Distinct and Union)
 L3, L11 - PigMix

www.nordridesign.com
LOGO
Comparing the Heuristics for
generating candidate Sub-Jobs (7.3)
shows total size of Input Data
loaded by different queries
Q
I/P
(GB)
HC
(GB)
NH
(GB)
HA
(GB)
O/P
L2
150.6
3.1
3.1
6.7
1.1 MB
L3
150.7
3.2
8.2
22.1
62.9 MB
L4
150.6
2.8
10.8
34.2 MB
L5
150.7
1.8
4.6
7.4
2B
L6
150.6
3.7
10.1
24.3
92.7 MB
L7
150.6
2.2
5.4
5.4
1.5 MB
L8
150.6
3.3
3.3
11.4
27 B
L11
173.6
2.6
2.7
2.8
1.6 GB
www.nordridesign.com
2
LOGO
Reusing subjobs Vs Whole Jobs(7.4)
Field name
Cardinality
% Selected Data
field6
200
0.5%
field7
100
1%
field8
20
5%
field9
10
10%
field10
5
20%
field11
2
50%
field12
1.6
60%
www.nordridesign.com
LOGO
Reusing subjobs Vs Whole Jobs(7.4)
Overhead and Speed up of different
jobs – Dark line is speedup
www.nordridesign.com
LOGO
Effect of Data Reduction(7.5)
Overhead and Speed up of different
jobs with filter operators
www.nordridesign.com
LOGO
Effect of Data Reduction(7.5)
Continued
Query Template QP
 A = load ’$synth_data’ as (field1, ...,
field12); B = foreach A generate
field1, ...;
 C = group B by (field1, ...);
 D = foreach C generate COUNT($1);
 store D into ’$out’;

www.nordridesign.com
LOGO
Effect of Data Reduction(7.5)
Continued
Query Template QF
 A = load ’$synth_data’ as (field1,
..., field12); B = filter A by $fieldi =
$val ;
 C = group B by field1;
 D = foreach C generate
COUNT($1);
 store D into ’$out’;

www.nordridesign.com

LOGO
Related Work
Paper addresses challenges by
Mapreduce like massive data sizes and
procedural nature of query language
 Otherwork – Materialized views and
Mrshare

www.nordridesign.com
Download