RESTORE IMPLEMENTATION as an extension to pig Vijay S Outline Overview of Pig Query Compiler Implementation of Restore — Experiments — www.nordridesign.com LOGO Overview of the Pig Query Compiler LOGO a parser syntactically checks the input query and transforms it into a logical plan, which is a directed acyclic graph (DAG) of logical operators(1) logical optimizer applies optimization rules to this logical plan(2) MapReduce compiler transforms the logical plan into a physical plan and then compiles it into a series of MapReduce jobs, which forms a workflow(3) www.nordridesign.com Overview of the Pig Query Compiler - Continued LOGO MapReduce optimizer applies rules to reduce the number of MapReduce jobs in the work- flow(4) Hadoop job manager submits the jobs in a workflow to Hadoop for execution taking into account the dependencies between them.(5) www.nordridesign.com Overview of the Pig Query Compiler - Continued LOGO JobControlCompiler component of the Hadoop job manager of Pig Input is Workflow of Mapreduce Jobs After the completion of executing all the MapReduce jobs in the workflow, these intermediate outputs are deleted. www.nordridesign.com LOGO Implementation of Restore The input of ReStore is a workflow of MapReduce jobs. Every physical plan of these jobs passes though two stages: (1) matching with plans in the repository, and (2) generating candidate sub-jobs. .Implement the repository as a table that con-tains in every record: (1) a physical plan of a MapReduce job, (2) the filename of the output of this job in HDFS, and (3) www.nordridesign.com LOGO Experiments Reusing the Output of Whole Jobs(7.1) Reusing the Output of Sub• Jobs(7.2) Comparing the Heuristics for GeneratingCandidate Sub-Jobs(7.3) Reusing Sub• Jobs vs. Whole Jobs((7.4) Effect of Data Reduction((7.5) www.nordridesign.com LOGO Reusing the Output of Whole Jobs(7.1) Job execution time for queries is much reduced by resusing jobs compared to no data reuse.(L3, L11 – PigMix) Example: L2-L8 and L11 (Join, Group, CoGroup,Filter Distinct and Union) L3, L11 - PigMix www.nordridesign.com LOGO Reusing the Output of sub Jobs(7.2) Job execution time for queries is further reduced by resusing Output of jobs compared to no data reuse and generating sub jobs Example: L2-L8 and L11 (Join, Group, CoGroup,Filter Distinct and Union) L3, L11 - PigMix www.nordridesign.com LOGO Comparing Heuristics for Generating Candidate subjobs(7.3) Job execution time for queries is further reduced by resusing Output of jobs compared to no data reuse and generating sub jobs Example: L2-L8 and L11 (Join, Group, CoGroup,Filter Distinct and Union) L3, L11 - PigMix www.nordridesign.com LOGO Comparing the Heuristics for generating candidate Sub-Jobs (7.3) shows total size of Input Data loaded by different queries Q I/P (GB) HC (GB) NH (GB) HA (GB) O/P L2 150.6 3.1 3.1 6.7 1.1 MB L3 150.7 3.2 8.2 22.1 62.9 MB L4 150.6 2.8 10.8 34.2 MB L5 150.7 1.8 4.6 7.4 2B L6 150.6 3.7 10.1 24.3 92.7 MB L7 150.6 2.2 5.4 5.4 1.5 MB L8 150.6 3.3 3.3 11.4 27 B L11 173.6 2.6 2.7 2.8 1.6 GB www.nordridesign.com 2 LOGO Reusing subjobs Vs Whole Jobs(7.4) Field name Cardinality % Selected Data field6 200 0.5% field7 100 1% field8 20 5% field9 10 10% field10 5 20% field11 2 50% field12 1.6 60% www.nordridesign.com LOGO Reusing subjobs Vs Whole Jobs(7.4) Overhead and Speed up of different jobs – Dark line is speedup www.nordridesign.com LOGO Effect of Data Reduction(7.5) Overhead and Speed up of different jobs with filter operators www.nordridesign.com LOGO Effect of Data Reduction(7.5) Continued Query Template QP A = load ’$synth_data’ as (field1, ..., field12); B = foreach A generate field1, ...; C = group B by (field1, ...); D = foreach C generate COUNT($1); store D into ’$out’; www.nordridesign.com LOGO Effect of Data Reduction(7.5) Continued Query Template QF A = load ’$synth_data’ as (field1, ..., field12); B = filter A by $fieldi = $val ; C = group B by field1; D = foreach C generate COUNT($1); store D into ’$out’; www.nordridesign.com LOGO Related Work Paper addresses challenges by Mapreduce like massive data sizes and procedural nature of query language Otherwork – Materialized views and Mrshare www.nordridesign.com