Pig Latin lecture notes

advertisement
Pig Latin
CS 6800
Utah State University
Writing MapReduce Jobs
• Higher order functions
• Map applies a function to a list







Example list [1, 2, 3, 4]
Want to square each number in the list
Write function f(x) = x*x
Compute [f(1), f(2), f(3), f(4)] = [1, 4, 9, 16]
map function signature: (a -> b) -> [a] -> [b]
Haskell specification
map f [] = []
map f (x:xs) = (f x) :: (map f xs)
Call the function
map (\x -> x * x) [1, 2, 3, 4]
Reduce
• Reduce converts a list into a scalar







Example list [1, 2, 3, 4]
Want to sum the numbers in the list
Write function g(x,y) = x+y
Compute g(1,g(2,g(3,g(4,0)))) = 10
reduce signature: (a -> b -> c) -> b -> [a] -> c
Haskell specification
reduce g c [] = c
reduce g c (x:xs) = g x (reduce g c xs)
Call the function
reduce (\x -> x + x) 0 [1, 2, 3, 4]
Use in Cloud Computing
• Map can be used to clean data and "group" it
• Suppose a list of words
words = [Bat Volcano bat vulcano]
• Map to lower case
lcase = map lowercase words
• Map to correct spelling
s = map spellFix lcase
• Count each word
groups = map (\x -> (x, 1)) s
groups is [(bat, 1), (volcano, 1), (bat,
1) …
Use in Cloud Computing (continues)
• Shuffles collects tuples with same "group" value
• Reduce combines counts
result = reduce + 0 groups
• Problem - MapReduce jobs written in PL (e.g., Java)



Complicated
Not reusable
Database-like operations common
CouchDB - Count People per Gender
Pig Latin
• Yahoo
40% of Hadoop jobs run using Pig
• Platform for analyzing massive data sets
• Runs on Hadoop (Map/Reduce)
• Version 0.12
What is Pig Latin?
• Dataflow language
• Non 1NF data model



Tuples
Sets
Bags
• Use relational algebra-like operations to manipulate
data



Joins
Filter - selection
Generate - projection
• Compiles to MapReduce jobs on Hadoop cluster
Pig Latin Features
• A dataflow (NoSQL) language


SQL is declarative, most PLs are not
SQL poor at expressing workflow
• Non-1NF data model



Bags, sets, tuples, maps
Data resides in read-only files
Schema-less
Example
• Count subscribers in each city
A = LOAD ’subscribers.txt’ AS
(name: chararray, city: chararray, amount:
int);
B = GROUP A BY city;
C = FOREACH B GENERATE city, COUNT(B.name);
DUMP C;
• Dataflow
B
GROUP A …
A
FOREACH B …
C
LOAD …
Compilation
Pig Latin Program
Pig Latin Compiler
Map Reduce
Job
Map
Map
Reduce
Map
Reduce
Map
Reduce
Map
Reduce
Map
Reduce
Reduce
HDFS
HDFS
HDFS
HDFS
HDFS
HDFS
Hadoop
Result
Data Transformations
• Relational algebra-like





JOIN (inner and outer
joins)
FILTER (selection)
FOREACH (projection)
CROSS (product)
UNION
• Non-traditional






• SQL-like




DISTINCT
LIMIT
ORDER BY
GROUP

COGROUP
MAPREDUCE
FLATTEN
RANK
STREAM
SAMPLE
SPLIT
Magazine Subscriber Data
Subscribers
(Name, City, Amt, Id)
(Maya, Logan, $20, 1)
(Jose, Logan, $15, 2)
(Knut, Ogden, $20, 3)
...
Personal Information
(Name, Email, Id)
(Maya, maya@gmail.com, 5)
(Jose, jose@gmail.com, 6)
(Knut, knut@hotmail.com, 7)
...
FILTER
• A filter restricts the result
/* Restrict to Logan subscribers */
X = FILTER R ON city = "Logan";
• FILTER example
Subscribers
(Name, City, Amt, Id)
(Maya, Logan, $20, 1)
(Jose, Logan, $15, 2)
(Knut, Ogden, $20, 3)
...
Magazine Subscriber Data
B = JOIN Subscribers BY name, PerInfo By name
Subscribers
(Name, City, Amt, Id)
(Maya, Logan, $20, 1)
(Jose, Logan, $15, 2)
(Knut, Ogden, $20, 3)
...
Personal Information
(Name, Email, Id)
(Maya, maya@gmail.com, 5)
(Jose, jose@gmail.com, 6)
(Knut, knut@hotmail.com, 7)
...
Magazine Subscriber Data
B = JOIN Subscribers BY name, PerInfo By name
B
(Name, City, Amt, Id, Name, Email, Id)
(Maya, Logan, $20, 1, Maya, maya@gmail.com, 5)
(Jose, Logan, $15, 2, Jose, jose@gmail.com, 6)
(Knut, Ogden, $20, 3, Knut, knut@hotmail.com, 7)
...
Optimization
FILTER …
A
JOIN …
B
Map/Reduce
FILTER …
C
CROSS …
D
Map/Reduce
E
Download