4.1-Pig

Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olsten, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Acknowledgement : Modified the slides from University of Waterloo Motivation  You‟re a procedural programmer  You have huge data  You want to analyze it 2 Motivation  As  a procedural programmer… May find writing queries in SQL unnatural and too restrictive More comfortable with writing code; a series of statements as opposed to a long query. (Ex: MapReduce is so successful).  3 Motivation  Data analysis goals  Quick   Exploit parallel processing power of a distributed system Easy Be able to write a program or query without a huge learning curve  Have some common analysis tasks predefined   Flexible  Transform a data set(s) into a workable structure without much overhead  Perform customized processing  Transparent  5 Have a say in how the data processing is executed on the system Motivation  Relational Distributed Databases Parallel database products expensive  Rigid schemas  Processing requires declarative SQL query construction   Map-Reduce Relies on custom code for even common operations  Need to do workarounds for tasks that have different data flows other than the expected MapCombineReduce  6 Motivation  Relational Distributed Databases Sweet Spot: Take the best of both SQL and Map-Reduce; combine high-level declarative querying with low-level procedural programming…Pig Latin!   Map-Reduce 7 Pig Latin Example Table urls: (url,category, pagerank) Find for each suffciently large category, the average pagerank of high-pagerank urls in that category SQL: SELECT category, AVG(pagerank) FROM urls WHERE pagerank > 0.2 GROUP BY category HAVING COUNT(*) > 10^6 Pig Latin: good_urls = FILTER urls BY pagerank > 0.2; groups = GROUP good_urls BY category; big_groups = FILTER groups BY COUNT(good_urls)>10^6; output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank); Outline System Overview  Pig Latin (The Language)  Data Structures  Commands   Pig (The Compiler) Logical & Physical Plans  Optimization  Efficiency  Pig Pen (The Debugger)  Conclusion  8 Big Picture Pig Latin Script UserDefined Functions Pig Map-Reduce Statements Compile Optimize Write Results 10 Read Data Data Model  Atom  Tuple Bag  Map  11 - simple atomic value (ie: number or string) Data Model  Atom  Tuple Bag  Map  12 - sequence of fields; each field any type Data Model  Atom  Tuple  Bag - collection of tuples Duplicates possible  Tuples in a bag can have different field lengths and field types   Map 13 Data Model  Atom  Tuple Bag  Map - collection of key-value pairs   14 Key is an atom; value can be any type Data Model  Control over dataflow Ex 1 (less efficient) spam_urls = FILTER urls BY isSpam(url); culprit_urls = FILTER spam_urls BY pagerank > 0.8; Ex 2 (most efficient) highpgr_urls = FILTER urls BY pagerank > 0.8; spam_urls = FILTER highpgr_urls BY isSpam(url);  Fully nested More natural for procedural programmers (target user) than normalization  Data is often stored on disk in a nested fashion  Facilitates ease of writing user-defined functions   No schema required 15 Data Model  User-Defined Functions (UDFs)  Ex: spam_urls = FILTER urls BY isSpam(url); Can be used in many Pig Latin statements  Useful for custom processing tasks  Can use non-atomic values for input and output  Currently must be written in Java  16 Speaking Pig Latin  LOAD  Input is assumed to be a bag (sequence of tuples)  Can specify a deserializer with “USING‟  Can provide a schema with “AS‟ newBag = LOAD ‘filename’ <USING functionName() > <AS (fieldName1, fieldName2,…)>; Queries = LOAD ‘query_log.txt’ USING myLoad() AS (userID,queryString, timeStamp) 17 Speaking Pig Latin  FOREACH  Apply some processing to each tuple in a bag  Each field can be: A fieldname of the bag  A constant A simple expression (ie: f1+f2)  A predefined function (ie: SUM, AVG, COUNT, FLATTEN)  A UDF (ie: sumTaxes(gst, pst) ) newBag = FOREACH bagName GENERATE field1, field2, …; 18 Speaking Pig Latin  FILTER  Select a subset of the tuples in a bag newBag = FILTER bagName BY expression; Expression uses simple comparison operators (==, !=, <, >, …) and Logical connectors (AND, NOT, OR) some_apples = FILTER apples BY colour != ‘red’;   19 Can use UDFs some_apples = FILTER apples BY NOT isRed(colour); Speaking Pig Latin  COGROUP Group two datasets together by a common attribute  Groups data into nested bags  grouped_data = COGROUP results BY queryString, revenue BY queryString; 20 Speaking Pig Latin  Why COGROUP and not JOIN? url_revenues = FOREACH grouped_data GENERATE FLATTEN(distributeRev(results, revenue)); 21 Speaking Pig Latin  Why COGROUP and not JOIN? May want to process nested bags of tuples before taking the cross product.  Keeps to the goal of a single high-level data transformation per pig-latin statement.  However, JOIN keyword is still available:  JOIN results BY queryString, revenue BY queryString; Equivalent temp = COGROUP results BY queryString, revenue BY queryString; join_result = FOREACH temp GENERATE FLATTEN(results), FLATTEN(revenue); 22 Speaking Pig Latin  STORE (& DUMP)  Output data to a file (or screen) STORE bagName INTO ‘filename’ <USING deserializer ()>;  Other Commands (incomplete) UNION - return the union of two or more bags  CROSS - take the cross product of two or more bags  ORDER - order tuples by a specified field(s)  DISTINCT - eliminate duplicate tuples in a bag  LIMIT - Limit results to a subset  23 Compilation  Pig system does two tasks:  Builds a Logical Plan from a Pig Latin script Supports execution platform independence  No processing of data performed at this stage   Compiles the Logical Plan to a Physical Plan and Executes Convert the Logical Plan into a series of Map-Reduce statements to be executed (in this case) by Hadoop Map-Reduce  24 Compilation  Building a Logical Plan  Verify  25 input files and bags referred to are valid Create a logical plan for each bag(variable) defined Compilation  Building a Logical Plan Example A = LOAD ‘user.dat’ AS (name, age, city); B = GROUP A BY city; C = FOREACH B GENERATE group AS city, COUNT(A); D = FILTER C BY city IS ‘kitchener’ OR city IS ‘waterloo’; STORE D INTO ‘local_user_count.dat’; 26 Load(user.dat) Compilation  Building a Logical Plan Example A = LOAD ‘user.dat’ AS (name, age, city); B = GROUP A BY city; C = FOREACH B GENERATE group AS city, COUNT(A); D = FILTER C BY city IS ‘kitchener’ OR city IS ‘waterloo’; STORE D INTO ‘local_user_count.dat’; 27 Load(user.dat) Group Compilation  Building a Logical Plan Example A = LOAD ‘user.dat’ AS (name, age, city); B = GROUP A BY city; C = FOREACH B GENERATE group AS city, COUNT(A); D = FILTER C BY city IS ‘kitchener’ OR city IS ‘waterloo’; STORE D INTO ‘local_user_count.dat’; Load(user.dat) Group Foreach 28 Compilation  Building a Logical Plan Example A = LOAD ‘user.dat’ AS (name, age, city); B = GROUP A BY city; C = FOREACH B GENERATE group AS city, COUNT(A); D = FILTER C BY city IS ‘kitchener’ OR city IS ‘waterloo’; STORE D INTO ‘local_user_count.dat’; Load(user.dat) Group Foreach Filter 29 Compilation  Building a Logical Plan Example A = LOAD ‘user.dat’ AS (name, age, city); B = GROUP A BY city; C = FOREACH B GENERATE group AS city, COUNT(A); D = FILTER C BY city IS ‘kitchener’ OR city IS ‘waterloo’; STORE D INTO ‘local_user_count.dat’; Load(user.dat) Filter Group Foreach 30 Compilation  Building a Physical Plan A = LOAD ‘user.dat’ AS (name, age, city); B = GROUP A BY city; C = FOREACH B GENERATE group AS city, COUNT(A); D = FILTER C BY city IS ‘kitchener’ OR city IS ‘waterloo’; STORE D INTO ‘local_user_count.dat’; Load(user.dat) Filter Group Only happens when output is specified by STORE or DUMP Foreach 32 Compilation  Building a Physical Plan  Step 1: Create a map-reduce job for each COGROUP Load(user.dat) Filter Map Group Reduce Foreach 33 Compilation  Building a Physical Plan Step 1: Create a map-reduce job for each COGROUP  Step 2: Push other commands into the map and reduce functions where Map possible   34 May be the case certain commands require their own map-reduce Reduce job (ie: ORDER needs separate mapreduce jobs) Load(user.dat) Filter Group Foreach Compilation  Efficiency in Execution  35 Parallelism  Loading data - Files are loaded from HDFS  Statements are compiled into map-reduce jobs Compilation  Efficiency with Nested Bags In many cases, the nested bags created in each tuple of a COGROUP statement never need to physically materialize   Generally perform aggregation after a COGROUP and the statements for said aggregation are pushed into the reduce function  Applies 36 to algebraic functions (ie: COUNT, MAX, MIN, SUM, AVG) Compilation  Efficiency with Nested Bags Load(user.dat) Map Filter Group Foreach 37 Compilation  Efficiency with Nested Bags Load(user.dat) Filter Group Combine Foreach 38 Compilation  Efficiency with Nested Bags Load(user.dat) Filter Group Reduce 39 Foreach Compilation  Efficiency with Nested Bags  Why  this works: COUNT is an algebraic function; it can be structured as a tree of subfunctions with each leaf working on a subset of the data Reduce Combine 40 SUM COUNT COUNT Compilation  Efficiency with Nested Bags Pig provides an interface for writing algebraic UDFs so they can take advantage of this optimization as well.   Inefficiencies Non-algebraic aggregate functions (ie: MEDIAN) need entire bag to materialize; may cause a very large bag to spill to disk if it doesn‟t fit in memory  Every map-reduce job requires data be written and replicated to the HDFS (although this is offset by parallelism achieved)  41 Debugging  How to verify the semantics of an analysis program    42 Run the program against whole data set. Might take hours! Generate sample dataset  Empty result set may occur on few operations like join, filter  Generally, testing with sample dataset is difficult Pig-Pen  Samples data from large dataset for Pig statements  Apply individual Pig-Latin commands against the dataset  In case of empty result, pig system resamples  Remove redundant samples Debugging  Pig-Pen 42 Debugging  Pig-Latin command window and command generator 43 Debugging  Sand Box Dataset (generated automatically!) 44 Debugging  Pig-Pen  Provides sample data that is: Real - taken from actual data  Concise - as small as possible   Complete - collectively illustrate the key semantics of each command  Helps with schema definition  Facilitates incremental program writing 45 Conclusion Pig is a data processing environment in Hadoop that is specifically targeted towards procedural programmers who perform large-scale data analysis.  Pig-Latin offers high-level data manipulation in a procedural style.  Pig-Pen is a debugging environment for Pig-Latin commands that generates samples from real data.  47 More Info  Pig, http://hadoop.apache.org/pig/  Hadoop, http://hadoop.apache.org AnksThay! 48

4.1-Pig

Related documents

Products

Support

4.1-Pig

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib