Pig (Latin) Demo Presented By: Imranul Hoque 1 Topics • Last Seminar: – – – – Hadoop Installation Running MapReduce Jobs MapReduce Code Status Monitoring • Today: – – – – Complexity of writing MapReduce programs Pig Latin and Pig Pig Installation Running Pig 2 Example Problem URL Category Pagerank www.google.com Search Engine 0.9 www.cnn.com News 0.8 www.facebook.com Social Network 0.85 www.foxnews.com News 0.78 www.foo.com Blah 0.1 www.bar.com Blah 0.5 • Goal: for each sufficiently large category find the average pagerank of high-pagerank urls in that category 3 Example Problem (cont’d) • SQL: SELECT category, AVG(pagerank) FROM url-table WHERE pagerank > 0.2 GROUP BY category HAVING count (*) > 10^6 • MapReduce: ? • Procedural (MapReduce) vs.Declarative (SQL) • Pig Latin: Sweet spot between declarative and procedural Pig Latin Pig System MapReduce Hadoop 4 Pig Latin Solution urls = LOAD url-table as (url, category, pagerank) good_urls = FILTER urls BY pagerank > 0.2; groups = GROUP good_urls BY category; big_groups = FILTER groups BY COUNT(good_urls) > 10^6; output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank); For each sufficiently large category find the average pagerank of high-pagerank urls in that category 5 Features • Dataflow language – Find the set of urls that are classified as spams but have a high pagerank score – spam_urls = FILTER urls BY isSpam(url); – culprit_urls = FILTER spam_urls BY pagerank > 0.8; • User defined function (UDF) • Debugging environment • Nested data model Pig Latin Commands load Read data from file system. store Write data to file system. foreach Apply expression to each record and output one or more records. filter Apply predicate and remove records that do not return true. group/cogroup Collect records with the same key from one or more inputs. join Join two or more inputs based on a key. order Sort records based on a key. distinct Remove duplicate records. union Merge two data sets. dump Write output to stdout. limit Limit the number of records. 7 Pig System user Parser Pig Latin program parsed program Pig Compiler cross-job optimizer output execution plan f( ) MR Compiler map-red. jobs join Map-Reduce filter Cluster X Y MapReduce Compiler 9 Pig Pen • Find users who tend to visit “good” pages Load Visits(user, url, time) Load Pages(url, pagerank) Transform to (user, Canonicalize(url), time) Join url = url Group by user Transform to (user, Average(pagerank) as avgPR) Filter avgPR > 0.5 10 Load Visits(user, url, time) Load Pages(url, pagerank) (Amy, cnn.com, 8am) (Amy, http://www.snails.com, 9am) (Fred, www.snails.com/index.html, 11am) Transform to (user, Canonicalize(url), time) (www.cnn.com, 0.9) (www.snails.com, 0.4) Join url = url (Amy, www.cnn.com, 8am) (Amy, www.snails.com, 9am) (Fred, www.snails.com, 11am) (Amy, www.cnn.com, 8am, 0.9) (Amy, www.snails.com, 9am, 0.4) (Fred, www.snails.com, 11am, 0.4) Group by user (Amy, { (Amy, www.cnn.com, 8am, 0.9), (Amy, www.snails.com, 9am, 0.4) }) (Fred, { (Fred, www.snails.com, 11am, 0.4) }) Transform to (user, Average(pagerank) as avgPR) (Amy, 0.65) (Fred, 0.4) Challenges? Filter avgPR > 0.5 (Amy, 0.65) 11 Installation • Extract • Build (ant) – In pig-0.1.1 and in tutorial dir • Environment variable – PIGDIR=~/pig-0.1.1 – HADOOPSITEPATH=~/hadoop-0.18.3/conf 12 Running Pig • Two modes: – Local mode – Hadoop mode • Three ways to execute: – Shell (grunt) – Script – API (currently Java) – GUI (future work) 13 Running Pig (2) • Save data into HDFS – bin/hadoop -copyFromLocal excite-small.log excite-small.log • Launch shell/Run script – java -cp $PIGDIR/pig.jar:$HADOOPSITEPATH org.apache.pig.Main -x mapreduce <script_name> • Our script: – script1-hadoop.pig 14 Conclusion • For more details: – http://hadoop.apache.org/core/ – http://wiki.apache.org/hadoop/ – http://hadoop.apache.org/pig/ – http://wiki.apache.org/pig/ 15