Big Data Programming with Hadoop and Spark - PSC

advertisement
Introduction to Hadoop
Programming
Bryon Gill, Pittsburgh Supercomputing Center
Hadoop Overview
• Framework for Big Data
• Map/Reduce
• Platform for Big Data Applications
Map/Reduce
• Apply a Function to all the Data
• Harvest, Sort, and Process the Output
Map/Reduce
Big
Data
Split 1
Output 1
Split 2
Output 2
Split 3
Map
F(x)
Output 3
Split 4
Output 4
… Split n
… Output n
© 2014 Pittsburgh Supercomputing
CenterPittsburgh Supercomputing Center
© 2010
Reduce
F(x)
Result
4
HDFS
• Distributed FS Layer
• WORM fs
– Optimized for Streaming Throughput
• Exports
• Replication
• Process data in place
HDFS Invocations: Getting Data In and Out
•
•
•
•
•
•
hadoop dfs -ls
hadoop dfs -put
hadoop dfs -get
hadoop dfs -rm
hadoop dfs -mkdir
hadoop dfs -rmdir
Writing Hadoop Programs
• Wordcount Example: Wordcount.java
– Map Class
– Reduce Class
Compiling
• javac -cp $HADOOP_HOME/hadoop-core*.jar \
-d WordCount/ WordCount.java
Packaging
• jar -cvf WordCount.jar -C WordCount/ .
Submitting your Job
• hadoop \
jar WordCount.jar \
org.myorg.WordCount \
/datasets/compleat.txt \
$MYOUTPUT \
-D mapred.reduce.tasks=2
Configuring your Job Submission
• Mappers and Reducers
• Java options
• Other parameters
Monitoring
• Important Ports:
–
–
–
–
Hearth-00.psc.edu:50030 – Jobtracker (MapReduce Jobs)
Hearth-00.psc.edu:50070 – HDFS (Namenode)
Hearth-03.psc.edu:50060 – Tasktracker (Worker Node)
Hearth-03.psc.edu:50075 – Datanode
Hadoop Streaming
• Write Map/Reduce Jobs in any language
• Excellent for Fast Prototyping
Hadoop Streaming: Bash Example
• Bash wc and cat
• hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming*.jar \ input /datasets/plays/ \
-output mynewoutputdir \
-mapper '/bin/cat' \
-reducer '/usr/bin/wc -l '
Hadoop Streaming Python Example
• Wordcount in python
• hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming*.jar \
-file mapper.py \
-mapper mapper.py \
-file reducer.py \
-reducer reducer.py \
-input /datasets/plays/ \
-output pyout
Applications in the Hadoop Ecosystem
•
•
•
•
•
Hbase (NoSQL database)
Hive (Data warehouse with SQL-like language)
Pig (SQL-style mapreduce)
Mahout (Machine learning via mapreduce)
Spark (Caching computation framework)
Spark
• Alternate programming framework using HDFS
• Optimized for in-memory computation
• Well supported in Java, Python, Scala
Spark Resilient Distributed Dataset (RDD)
•
•
•
•
•
RDD for short
Persistence-enabled data collections
Transformations
Actions
Flexible implementation: memory vs. hybrid vs. disk
Spark example
• lettercount.py
Spark Machine Learning Library
• Clustering (K-Means)
• Many others, list at
http://spark.apache.org/docs/1.0.1/mllib-guide.html
K-Means Clustering
• Randomly seed cluster starting points
• Test each point with respect to the others in its cluster to find a new mean
• If the centroids change do it again
• If the centroids stay the same they've converged and we're done.
• Awesome visualization:
http://www.naftaliharris.com/blog/visualizing-k-means-clustering/
K-Means Examples
• spark-submit \
$SPARK_HOME/examples/src/main/python/mllib/kmeans.py \
hdfs://hearth-00.psc.edu:/datasets/kmeans_data.txt 3
• spark-submit \
$SPARK_HOME/examples/src/main/python/mllib/kmeans.py \
hdfs://hearth-00.psc.edu:/datasets/archiver.txt 2
Questions?
• Thanks!
References and Useful Links
•
HDFS shell commands:
http://hadoop.apache.org/docs/r0.18.3/hdfs_shell.html
•
Writing and running your first program:
http://www.drdobbs.com/database/hadoop-writing-and-running-your-first-pr/240153197
•
https://sites.google.com/site/hadoopandhive/home/how-to-run-and-compile-a-hadoop-program
•
Hadoop Streaming:
http://hadoop.apache.org/docs/stable1/streaming.html
https://sites.google.com/site/hadoopandhive/home/how-to-run-and-compile-a-hadoop-program/hadoop-streaming
http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
•
Hadoop Stable API:
http://hadoop.apache.org/docs/r1.2.1/api/
•
Hadoop Official Releases:
https://hadoop.apache.org/releases.html
•
Spark Documentation
http://spark.apache.org/docs/latest/
Download