Raspberry HadooPi: A Low-Cost, hands-on Laboratory in Big Data and Analytics

advertisement
Raspberry HadooPi: A Low-Cost, hands-on Laboratory in Big
Data and Analytics
Kenneth Fox, William M. Mongan, Jeffrey L. Popyack
Computer Science Department, College of Computing and Informatics, Drexel University, Philadelphia, PA
Introduction
The analysis and processing of large datasets is rapidly growing in
importance across a broad spectrum of disciplines such as pharmaceutical
research, economics, security, materials development, and simulations.
The volume of our data has outstripped our ability and capacity to
meaningfully use it. Like networking, databases, concurrency, and security,
a basic understanding of distributed data processing is now an essential
programming skill.
MapReduce Unplugged – REThink
Teaching the Map Reduce paradigm without a
computer, Unplugged shows a trivial but
concrete and relatable demonstration of the
MR algorithm before implementing the word
count problem in Hadoop.
HadooPi Goes Mobile
The HadooPi MicroCluster is the result of a independent study project in the fall of 2013.
We built the MicroCluster to streamline the
set up and tear down process. During the
initial weeks of the independent study, all of
the components had to be individually
unpacked, set up, connected, and booted
before beginning any work or demonstrations.
Sometimes consuming 15 minutes or more,
followed by a similar tear down. Now we just
plug it in and turn it on.
Executing massively scaled-out parallel operations on real super-scale
systems is prohibitively expensive, but teaching the concepts and skills
need not be. We have built a portable miniature classroom MicroCluster
using very low cost Raspberry Pi “System on a Board” computers and the
Apache Hadoop open source distributed processing framework to educate
students in parallel distributed processing.
The RPi in the upper right is providing network
infrastructure functions -- it runs a basic
Raspbian build with DNS, Email, and so forth
to allow the Hadoop instances on the cluster
nodes full access to the limited processing
power available. Additionally, all of the RPi's in
our standard build are configured to look at
this system, simplifying the “getting started”
steps.
We tested the educational value of the approach with some of our STAR
CS students during the winter 2014 and spring of 2014 quarters, and have
since integrated some material into our CS program.
Pedagogical Approach
Background
MapReduce, Hadoop, & RaspberryPi
Map Reduce Processing Model
• Scale-out vs Scale-up – reading/writing data is choke point
-- 4 100MB/sec channels will take 3+ hours to read a 4TB
• Abstracts process scaling (number of nodes)
• Distributes the data and tasks among nodes (moves code to data)
• Mappers filter and transform unstructured data into input for reducers
• Partitioner/shuffler directs the data to the correct reducer nodes
• Reducers aggregate over data to produce final results
• Offline processing of data, not OLTP
____________________________________________________________________
Hadoop
Hadoop is a Java Based Distributed Data processing
framework supported by the Apache Foundation
It provides:
• a distributed file system which automatically spreads the data among all of the
nodes in an active cluster.
• algorithms to optimize mapped tasks to the nodes closest to the data they need
• Fault tolerance for node failures
• Processing flexibility by supporting Java, Python and other languages
• Facilities for navigating the file system and controlling the system
• Web based dash boards
• A large body of examples
• It has a healthy developer and support environment
____________________________________________________________________
Raspberry Pi
The Raspberry Pi is a very low cost ($35 USD),
credit card-sized computer using a low power
ARM11 Von Neumann architecture processor
that promotes hands-on experimentation with
computer hardware and software.
Developed by the Raspberry Pi Foundation (www.raspberrypi.org)
Motivated by the marked reduction in students entering university with actual hands
on hardware and software experience as compared to the 1980’s and 1990’s.
•
•
•
•
Rapid growth - over 5 million sold as of 2/18/15 (2 million since 5/11/2014!)
Continuing evolution and extensive support
Added Model B+ with more USB ports
Just added RaspberryPi 2 with a faster A 900MHz quad-core ARM Cortex-A7
CPU (~6x performance) and 1GB RAM. (http://www.raspberrypi.org/raspberry-pi2-on-sale/)
TEMPLATE DESIGN © 2008
www.PosterPresentations.com
A Larger Example Using All 3 Nodes
Offline Lesson on how MapReduce works using a deck of playing cards
Admin screen at left shows status
for all jobs in the cluster.
Manual steps emulate the process MapReduce follows:
 Pass out cards -- map
 Count by suit on individual slip per count
 Pass suit count slip to person counting that slip
 Sum it up -- reduce
Some the administrative tools
that Hadoop supplies
Below: all of the nodes in the cluster.
Access individual node status through link
Hadoop on RPi's
 Each student was given a Raspberry Pi and a 32GB SD card and setup
instructions for Hadoop (https://CS498.blogspot.com)
 Each student built and ran the word count example on their local standalone
cluster
 Next the students added their nodes to the MicroCluster to automate scaling
with the word count example
 They observed the relative processing speed improvement with more nodes.
Big Enough Data or Little Big Data
Samples of data that are big enough to demonstrate the use of the tools, models, and
algorithms but don’t take hours or days to run.
Smaller samples (often called reference datasets in industry) return results quickly
enough to keep students engaged and allow for faster debugging of typical problems
such as parsing errors.
 Word counts using large volumes of text
 Us Zip code database
 US Patent Database
Example run using the text of Dr. Seuss’s Green Eggs And Ham as the input text
Output from Hadoop-Master:
hduser@node00:/usr/local/hadoop$ bin/hadoop jar hadoop*examples*.jar wordcount /fs/hduser/books
/fs/hduser/books-output11
15/03/02 06:00:58 INFO input.FileInputFormat: Total input paths to process : 11
15/03/02 06:00:59 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable
15/03/02 06:00:59 WARN snappy.LoadSnappy: Snappy native library not loaded
15/03/02 06:01:07 INFO mapred.JobClient: Running job: job_201503020011_0002
15/03/02 06:01:08 INFO mapred.JobClient: map 0% reduce 0%
15/03/02 06:04:05 INFO mapred.JobClient: map 1% reduce 0%
… Notice that Reducing starts before Mapping is completed
15/03/02 06:12:08 INFO mapred.JobClient: map 42% reduce 0%
15/03/02 06:12:09 INFO mapred.JobClient: map 42% reduce 6%
…
15/03/02 06:41:03 INFO mapred.JobClient: map 100% reduce 100%
15/03/02 06:41:45 INFO mapred.JobClient: Job complete: job_201503020011_0002
15/03/02 06:41:47 INFO mapred.JobClient: Counters: 30
15/03/02 06:41:47 INFO mapred.JobClient: Job Counters
15/03/02 06:41:47 INFO mapred.JobClient: Launched reduce tasks=1
15/03/02 06:41:47 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=7804889
15/03/02 06:41:47 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving
slots (ms)=0
15/03/02 06:41:47 INFO mapred.JobClient: Total time spent by all maps waiting after reserving
slots (ms)=0
15/03/02 06:41:47 INFO mapred.JobClient: Rack-local map tasks=5
15/03/02 06:41:47 INFO mapred.JobClient: Launched map tasks=13
15/03/02 06:41:47 INFO mapred.JobClient: Data-local map tasks=8
15/03/02 06:41:47 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=1875031
15/03/02 06:41:47 INFO mapred.JobClient: File Output Format Counters
15/03/02 06:41:47 INFO mapred.JobClient: Bytes Written=1371372
15/03/02 06:41:47 INFO mapred.JobClient: FileSystemCounters
15/03/02 06:41:47 INFO mapred.JobClient: FILE_BYTES_READ=5083231
15/03/02 06:41:47 INFO mapred.JobClient: HDFS_BYTES_READ=10599597
15/03/02 06:41:47 INFO mapred.JobClient: FILE_BYTES_WRITTEN=8807248
15/03/02 06:41:47 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1371372
15/03/02 06:41:47 INFO mapred.JobClient: File Input Format Counters
15/03/02 06:41:47 INFO mapred.JobClient: Bytes Read=10598017
15/03/02 06:41:47 INFO mapred.JobClient: Map-Reduce Framework
15/03/02 06:41:47 INFO mapred.JobClient: Map output materialized bytes=3067689
15/03/02 06:41:47 INFO mapred.JobClient: Map input records=217458
15/03/02 06:41:47 INFO mapred.JobClient: Reduce shuffle bytes=3067689
15/03/02 06:41:47 INFO mapred.JobClient: Spilled Records=553775
15/03/02 06:41:47 INFO mapred.JobClient: Map output bytes=17945639
15/03/02 06:41:47 INFO mapred.JobClient: Total committed heap usage (bytes)=2241003520
15/03/02 06:41:47 INFO mapred.JobClient: CPU time spent (ms)=3115860
15/03/02 06:41:47 INFO mapred.JobClient: Combine input records=1998651
15/03/02 06:41:47 INFO mapred.JobClient: SPLIT_RAW_BYTES=1580
15/03/02 06:41:47 INFO mapred.JobClient: Reduce input records=208343
15/03/02 06:41:47 INFO mapred.JobClient: Reduce input groups=119283
15/03/02 06:41:47 INFO mapred.JobClient: Combine output records=345432
15/03/02 06:41:47 INFO mapred.JobClient: Physical memory (bytes) snapshot=2014486528
15/03/02 06:41:47 INFO mapred.JobClient: Reduce output records=119283
15/03/02 06:41:47 INFO mapred.JobClient: Virtual memory (bytes) snapshot=4282273792
15/03/02 06:41:47 INFO mapred.JobClient: Map output records=1861562
hduser@node00:/usr/local/hadoop$
Status of a single node:
Hardware Challenges
Raspberry Pis
• The limited processing power of the earlier RPi Model B's we used
presented the following difficulties:
 Slow run time limits the size of a data set - larger data set are
typically where the effects of distributed workload can be seen
 Slow native kernel compiles necessitate a separate full size
system capable of compiling the kernel
• At the time, there was limited hardware support accessory devices like
Wi-Fi Dongles
• Recompiling the kernel can be beyond a student’s ability resulting in
frustration
All 32 GB SD cards are not created equal - Non identical SD cards don't
clone reliably (if at all.)
Future Work
• Prepare additional canned lessons
• Eliminate the power strip by substituting a 12v Gel cell, and a small charger,
providing a built in UPS and true field portability. (The network switch and display
are natively12VDC, and the RPi’s can use an inexpensive DC-DC converter)
• Dynamic addition and deletion of nodes to a running Hadoop cluster
• Implement Backup/Secondary NameNode/hot failover
• Modify cluster to demonstrate hot failover (switches)
• Larger cluster size
• Local PC multi-node Virtualization in a container
• 1)
Run virtualization software (hypervisor) on a single machine such as
VMWare, KVM/QEMU, HyperV, etc.
• 2)
Run multiple Linux VM’s inside of hypervisor each performing the function of
an individual node
• 3)
Implement Master, Backup, Slaves, etc.
• 4)
Purpose: Understanding of HDFS/DFS’s, Failover, scaling of nodes, testing
and debugging of programs. Not for demonstrating performance improvements.
• Prepare Non-trivial problems with deeper programing elements – specified
purpose and process, well defined inputs, known outputs, reference program.
Acknowledgements
Node01 Was the Busiest
Technical Challenges
We faced a number of technical challenges building a stable, consistent,
reliable environment.
Hadoop version 1.21
• Becoming familiar with the architecture, its commands, and the
environment.
• Understanding Hadoop’s error messages
• Hypersensitivity to configuration changes caused the startup process
to be fragile.
• Every configuration change such as adding or deleting a node, mode
changes, etc. required a shutdown and restart of Hadoop.
• Requires static naming and addressing of nodes and preloading of
node names.
RET in Engineering and Computer Science Site for Machine Learning, Big Data and
CS Principles, National Science Foundation, DUE-0837665, July 2013-June 2016.
This material is based upon work supported by the National Science Foundation
under Grant No, CNS-1301171. Any opinions, findings, and conclusions or
recommendations expressed in this material are those of the author(s) and do not
necessarily reflect the views of the National Science Foundation.
Supported in part by IBM Big Data Faculty Awards 2013.
Toby Meyer, Starting Small: Set Up a Hadoop Compute Cluster Using Raspberry Pis,
http://blog.ittoby.com/2013/08/starting-small-set-up-hadoop-compute.html .
Max Mattes, Drexel University - Hadoop MapReduce Algorithm Graphic. (Not shown)
Zachary Parmelee, Video Game Art and Design Student, Art Institute of Philadelphia HadooPi MicroCluster logo.
References
J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large
Clusters. In OSDI’04, 6th Symposium on Operating Systems Design and
Implementation, sponsored by USENIX, in cooperation with ACM SIGOPS, pages
137–150, 2004.
Lam, Chuck. Hadoop in Action. Greenwich, CT: Manning Publications, 2011. Print.
Download