Pig-Latin-Demo

advertisement
Pig (Latin) Demo
Presented By: Imranul Hoque
1
Topics
• Last Seminar:
–
–
–
–
Hadoop Installation
Running MapReduce Jobs
MapReduce Code
Status Monitoring
• Today:
–
–
–
–
Complexity of writing MapReduce programs
Pig Latin and Pig
Pig Installation
Running Pig
2
Example Problem
URL
Category
Pagerank
www.google.com
Search Engine
0.9
www.cnn.com
News
0.8
www.facebook.com
Social Network
0.85
www.foxnews.com
News
0.78
www.foo.com
Blah
0.1
www.bar.com
Blah
0.5
• Goal: for each sufficiently large category find
the average pagerank of high-pagerank urls in
that category
3
Example Problem (cont’d)
• SQL: SELECT category, AVG(pagerank)
FROM url-table WHERE pagerank > 0.2
GROUP BY category HAVING count (*) > 10^6
• MapReduce: ?
• Procedural (MapReduce) vs.Declarative (SQL)
• Pig Latin: Sweet spot between declarative and
procedural
Pig Latin
Pig
System
MapReduce
Hadoop
4
Pig Latin Solution
urls = LOAD url-table as (url, category, pagerank)
good_urls = FILTER urls BY pagerank > 0.2;
groups = GROUP good_urls BY category;
big_groups = FILTER groups BY COUNT(good_urls)
> 10^6;
output = FOREACH big_groups GENERATE category,
AVG(good_urls.pagerank);
For each sufficiently large category find the average pagerank of
high-pagerank urls in that category
5
Features
• Dataflow language
– Find the set of urls that are classified as spams but
have a high pagerank score
– spam_urls = FILTER urls BY isSpam(url);
– culprit_urls = FILTER spam_urls BY pagerank > 0.8;
• User defined function (UDF)
• Debugging environment
• Nested data model
Pig Latin Commands
load
Read data from file system.
store
Write data to file system.
foreach
Apply expression to each record and output one or more records.
filter
Apply predicate and remove records that do not return true.
group/cogroup
Collect records with the same key from one or more inputs.
join
Join two or more inputs based on a key.
order
Sort records based on a key.
distinct
Remove duplicate records.
union
Merge two data sets.
dump
Write output to stdout.
limit
Limit the number of records.
7
Pig System
user
Parser
Pig Latin
program
parsed
program
Pig Compiler
cross-job
optimizer
output
execution
plan
f( )
MR Compiler
map-red.
jobs
join
Map-Reduce
filter
Cluster
X
Y
MapReduce Compiler
9
Pig Pen
• Find users who tend to visit “good” pages
Load
Visits(user, url, time)
Load
Pages(url, pagerank)
Transform
to (user, Canonicalize(url), time)
Join
url = url
Group
by user
Transform
to (user, Average(pagerank) as avgPR)
Filter
avgPR > 0.5
10
Load
Visits(user, url, time)
Load
Pages(url, pagerank)
(Amy, cnn.com, 8am)
(Amy, http://www.snails.com, 9am)
(Fred, www.snails.com/index.html, 11am)
Transform
to (user, Canonicalize(url), time)
(www.cnn.com, 0.9)
(www.snails.com, 0.4)
Join
url = url
(Amy, www.cnn.com, 8am)
(Amy, www.snails.com, 9am)
(Fred, www.snails.com, 11am)
(Amy, www.cnn.com, 8am, 0.9)
(Amy, www.snails.com, 9am, 0.4)
(Fred, www.snails.com, 11am, 0.4)
Group
by user
(Amy, { (Amy, www.cnn.com, 8am, 0.9),
(Amy, www.snails.com, 9am, 0.4) })
(Fred, { (Fred, www.snails.com, 11am, 0.4) })
Transform
to (user, Average(pagerank) as avgPR)
(Amy, 0.65)
(Fred, 0.4)
Challenges?
Filter
avgPR > 0.5
(Amy, 0.65)
11
Installation
• Extract
• Build (ant)
– In pig-0.1.1 and in tutorial dir
• Environment variable
– PIGDIR=~/pig-0.1.1
– HADOOPSITEPATH=~/hadoop-0.18.3/conf
12
Running Pig
• Two modes:
– Local mode
– Hadoop mode
• Three ways to execute:
– Shell (grunt)
– Script
– API (currently Java)
– GUI (future work)
13
Running Pig (2)
• Save data into HDFS
– bin/hadoop -copyFromLocal excite-small.log
excite-small.log
• Launch shell/Run script
– java -cp $PIGDIR/pig.jar:$HADOOPSITEPATH
org.apache.pig.Main -x mapreduce <script_name>
• Our script:
– script1-hadoop.pig
14
Conclusion
• For more details:
– http://hadoop.apache.org/core/
– http://wiki.apache.org/hadoop/
– http://hadoop.apache.org/pig/
– http://wiki.apache.org/pig/
15
Download