Pig Latin CS 6800 Utah State University Writing MapReduce Jobs • Higher order functions • Map applies a function to a list Example list [1, 2, 3, 4] Want to square each number in the list Write function f(x) = x*x Compute [f(1), f(2), f(3), f(4)] = [1, 4, 9, 16] map function signature: (a -> b) -> [a] -> [b] Haskell specification map f [] = [] map f (x:xs) = (f x) :: (map f xs) Call the function map (\x -> x * x) [1, 2, 3, 4] Reduce • Reduce converts a list into a scalar Example list [1, 2, 3, 4] Want to sum the numbers in the list Write function g(x,y) = x+y Compute g(1,g(2,g(3,g(4,0)))) = 10 reduce signature: (a -> b -> c) -> b -> [a] -> c Haskell specification reduce g c [] = c reduce g c (x:xs) = g x (reduce g c xs) Call the function reduce (\x -> x + x) 0 [1, 2, 3, 4] Use in Cloud Computing • Map can be used to clean data and "group" it • Suppose a list of words words = [Bat Volcano bat vulcano] • Map to lower case lcase = map lowercase words • Map to correct spelling s = map spellFix lcase • Count each word groups = map (\x -> (x, 1)) s groups is [(bat, 1), (volcano, 1), (bat, 1) … Use in Cloud Computing (continues) • Shuffles collects tuples with same "group" value • Reduce combines counts result = reduce + 0 groups • Problem - MapReduce jobs written in PL (e.g., Java) Complicated Not reusable Database-like operations common CouchDB - Count People per Gender Pig Latin • Yahoo 40% of Hadoop jobs run using Pig • Platform for analyzing massive data sets • Runs on Hadoop (Map/Reduce) • Version 0.12 What is Pig Latin? • Dataflow language • Non 1NF data model Tuples Sets Bags • Use relational algebra-like operations to manipulate data Joins Filter - selection Generate - projection • Compiles to MapReduce jobs on Hadoop cluster Pig Latin Features • A dataflow (NoSQL) language SQL is declarative, most PLs are not SQL poor at expressing workflow • Non-1NF data model Bags, sets, tuples, maps Data resides in read-only files Schema-less Example • Count subscribers in each city A = LOAD ’subscribers.txt’ AS (name: chararray, city: chararray, amount: int); B = GROUP A BY city; C = FOREACH B GENERATE city, COUNT(B.name); DUMP C; • Dataflow B GROUP A … A FOREACH B … C LOAD … Compilation Pig Latin Program Pig Latin Compiler Map Reduce Job Map Map Reduce Map Reduce Map Reduce Map Reduce Map Reduce Reduce HDFS HDFS HDFS HDFS HDFS HDFS Hadoop Result Data Transformations • Relational algebra-like JOIN (inner and outer joins) FILTER (selection) FOREACH (projection) CROSS (product) UNION • Non-traditional • SQL-like DISTINCT LIMIT ORDER BY GROUP COGROUP MAPREDUCE FLATTEN RANK STREAM SAMPLE SPLIT Magazine Subscriber Data Subscribers (Name, City, Amt, Id) (Maya, Logan, $20, 1) (Jose, Logan, $15, 2) (Knut, Ogden, $20, 3) ... Personal Information (Name, Email, Id) (Maya, maya@gmail.com, 5) (Jose, jose@gmail.com, 6) (Knut, knut@hotmail.com, 7) ... FILTER • A filter restricts the result /* Restrict to Logan subscribers */ X = FILTER R ON city = "Logan"; • FILTER example Subscribers (Name, City, Amt, Id) (Maya, Logan, $20, 1) (Jose, Logan, $15, 2) (Knut, Ogden, $20, 3) ... Magazine Subscriber Data B = JOIN Subscribers BY name, PerInfo By name Subscribers (Name, City, Amt, Id) (Maya, Logan, $20, 1) (Jose, Logan, $15, 2) (Knut, Ogden, $20, 3) ... Personal Information (Name, Email, Id) (Maya, maya@gmail.com, 5) (Jose, jose@gmail.com, 6) (Knut, knut@hotmail.com, 7) ... Magazine Subscriber Data B = JOIN Subscribers BY name, PerInfo By name B (Name, City, Amt, Id, Name, Email, Id) (Maya, Logan, $20, 1, Maya, maya@gmail.com, 5) (Jose, Logan, $15, 2, Jose, jose@gmail.com, 6) (Knut, Ogden, $20, 3, Knut, knut@hotmail.com, 7) ... Optimization FILTER … A JOIN … B Map/Reduce FILTER … C CROSS … D Map/Reduce E