Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center Hadoop Overview • Framework for Big Data • Map/Reduce • Platform for Big Data Applications Map/Reduce • Apply a Function to all the Data • Harvest, Sort, and Process the Output Map/Reduce Big Data Split 1 Output 1 Split 2 Output 2 Split 3 Map F(x) Output 3 Split 4 Output 4 … Split n … Output n © 2014 Pittsburgh Supercomputing CenterPittsburgh Supercomputing Center © 2010 Reduce F(x) Result 4 HDFS • Distributed FS Layer • WORM fs – Optimized for Streaming Throughput • Exports • Replication • Process data in place HDFS Invocations: Getting Data In and Out • • • • • • hadoop dfs -ls hadoop dfs -put hadoop dfs -get hadoop dfs -rm hadoop dfs -mkdir hadoop dfs -rmdir Writing Hadoop Programs • Wordcount Example: Wordcount.java – Map Class – Reduce Class Compiling • javac -cp $HADOOP_HOME/hadoop-core*.jar \ -d WordCount/ WordCount.java Packaging • jar -cvf WordCount.jar -C WordCount/ . Submitting your Job • hadoop \ jar WordCount.jar \ org.myorg.WordCount \ /datasets/compleat.txt \ $MYOUTPUT \ -D mapred.reduce.tasks=2 Configuring your Job Submission • Mappers and Reducers • Java options • Other parameters Monitoring • Important Ports: – – – – Hearth-00.psc.edu:50030 – Jobtracker (MapReduce Jobs) Hearth-00.psc.edu:50070 – HDFS (Namenode) Hearth-03.psc.edu:50060 – Tasktracker (Worker Node) Hearth-03.psc.edu:50075 – Datanode Hadoop Streaming • Write Map/Reduce Jobs in any language • Excellent for Fast Prototyping Hadoop Streaming: Bash Example • Bash wc and cat • hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming*.jar \ input /datasets/plays/ \ -output mynewoutputdir \ -mapper '/bin/cat' \ -reducer '/usr/bin/wc -l ' Hadoop Streaming Python Example • Wordcount in python • hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming*.jar \ -file mapper.py \ -mapper mapper.py \ -file reducer.py \ -reducer reducer.py \ -input /datasets/plays/ \ -output pyout Applications in the Hadoop Ecosystem • • • • • Hbase (NoSQL database) Hive (Data warehouse with SQL-like language) Pig (SQL-style mapreduce) Mahout (Machine learning via mapreduce) Spark (Caching computation framework) Spark • Alternate programming framework using HDFS • Optimized for in-memory computation • Well supported in Java, Python, Scala Spark Resilient Distributed Dataset (RDD) • • • • • RDD for short Persistence-enabled data collections Transformations Actions Flexible implementation: memory vs. hybrid vs. disk Spark example • lettercount.py Spark Machine Learning Library • Clustering (K-Means) • Many others, list at http://spark.apache.org/docs/1.0.1/mllib-guide.html K-Means Clustering • Randomly seed cluster starting points • Test each point with respect to the others in its cluster to find a new mean • If the centroids change do it again • If the centroids stay the same they've converged and we're done. • Awesome visualization: http://www.naftaliharris.com/blog/visualizing-k-means-clustering/ K-Means Examples • spark-submit \ $SPARK_HOME/examples/src/main/python/mllib/kmeans.py \ hdfs://hearth-00.psc.edu:/datasets/kmeans_data.txt 3 • spark-submit \ $SPARK_HOME/examples/src/main/python/mllib/kmeans.py \ hdfs://hearth-00.psc.edu:/datasets/archiver.txt 2 Questions? • Thanks! References and Useful Links • HDFS shell commands: http://hadoop.apache.org/docs/r0.18.3/hdfs_shell.html • Writing and running your first program: http://www.drdobbs.com/database/hadoop-writing-and-running-your-first-pr/240153197 • https://sites.google.com/site/hadoopandhive/home/how-to-run-and-compile-a-hadoop-program • Hadoop Streaming: http://hadoop.apache.org/docs/stable1/streaming.html https://sites.google.com/site/hadoopandhive/home/how-to-run-and-compile-a-hadoop-program/hadoop-streaming http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/ • Hadoop Stable API: http://hadoop.apache.org/docs/r1.2.1/api/ • Hadoop Official Releases: https://hadoop.apache.org/releases.html • Spark Documentation http://spark.apache.org/docs/latest/