Slide 1

advertisement
Teaching HDFS/MapReduce
Systems Concepts to
Undergraduates
Linh B. Ngo*, Edward B. Duffy**, Amy W. Apon*
* School of Computing, Clemson University
** Clemson Computing and Information Technology,
Clemson University
Contents
•
•
•
•
•
•
•
Introduction and Learning Objectives
Challenges
Hadoop computing platform – Options and Solution
Module Content – Lectures, Assignments, Data
Student Feedback
Module Content – Project
Ongoing and Future Work
Introduction and
Learning Objectives
- Hadoop/MapReduce is an important current
technology in the area of data-intensive computing
- Learning objectives:
- Understand the challenges of data-intensive computing
- Become familiar with the Hadoop Distributed File System
(HDFS), the underlying driver of MapReduce
- Understand the MapReduce (MR) programming model
- Understand the scalability and performance of MR
programs on HDFS
Challenges
- Provide students with a high performance, stable,
and robust Hadoop computing platform
- Balance lecture and hands-on lab hours
- Demonstrate the technical relationship between
MapReduce and HDFS
Computing Platform Options
- MapReduce parallel programming interface
- WebMapReduce is an example
- Enables study of MR programming model at beginning level
- Does not enable the study of HDFS for advanced students
- Dedicated shared Hadoop cluster with individual
accounts
- Multiple student programs compete for resources
- Individual errors affect other students
- Dedicated cluster that supports multiple virtual Hadoop
clusters
- Not supported by Clemson’s supercomputer configuration
Computing Platform Solution
- Modification of SDSC’s myHadoop
- Individual Hadoop platform deployment for each
student in the class
- First setup:
- Medium amount of editing needed to set up
- Numerous errors due to typos/unable to configure
- Second setup:
- Minimal amount of editing needed (one line)
- Only a few students encountered errors due to typos
Lecture and Hands-on Labs
- Fall 2012: 5 class hours
- 1 MR lecture, 1 lab, 1HDFS lecture, 1 lab, 1 advanced MR
optimization
- Lab time not sufficient due to problems with Hadoop
computing platforms
- Spring 2013: 5 class hours
- Lab time still not sufficient, due to errors in modifying
myHadoop scripts
- Fall 2013: 7 class hours
- 1 MR lecture, 2 labs, 1 HDFS lecture, 2 labs, 1 HBase/Hive
lecture
Module Content: Lectures
- Reused available online material with addition
clarification
- Slides from UMD, Jimmy Lin
- Strong emphasis on the following points:
- The MR programming paradigm is a programming model that
handles data parallelization
- The HDFS infrastructure provides a robust and resilient way to
distribute big data evenly across a cluster
- The MR library takes advantages of HDFS and the MR
programming paradigm to enable programmers to write
applications to conveniently and transparently handle big data
- Data locality is the big theme in working with big data
Module Content: Lectures
HDD
RAM
Block metadata
lives in memory
CPU
Could be the same machine
NameNode
File 01
File 02
JobTracker
HDD
File 03
HDFS Blocks
DataNodes
report block
information
to NameNode
JobTracker
provides
NameNode with
file/directory
paths and
receives blocklevel
information.
Physical View at
Linux FS:
blk_xxx
blk_xxx
blk_xxx
HDD
HDD
HDD
HDD
RAM
RAM
RAM
RAM
CPU
CPU
CPU
CPU
DataNode
DataNode
DataNode
DataNode
TaskTracker
TaskTracker
TaskTracker
TaskTracker
CPU
HDFS DataNode
daemons controlling
block location
MapReduce TaskTracker
daemons executing tasks
on blocks
RAM
Detailed job
progress lives in
memory
HDFS Abstractions:
Directories/Files
-
TaskTrackers report progress to JobTracker
JobTracker assigns work and faciltate map/reduce on TaskTrackers
based on block location information from NameNode
Module Content:
Assignments and Data
- Assignments
- One MR programming assignment basing on existing codes that
familiarizes students with the MR API and programming flows
- One MR/HDFS programming assignment that requires students
to write a MR program and deploy it to run on a Hadoop
computing platform
- Data
- Strive to be realistic
- Big enough, but not too big
- Airline Traffic Data (12Gb), Google Trace (200Gb), Yahoo Music
Rating (10Gb), Movie Rating (250Mb)
Student Feedback
- In-class voluntary surveys help to encourage all students
to participate (as compared to out-of-class online survey)
- IRB approval for survey
- Questions addressing:
-
Improvements in technical skills
Improvements in understanding about Hadoop/MR
Time taken to complete Hadoop/MR assignments
Time taken to set up Hadoop on Palmetto
Usefulness of guides/lectures/labs
Relevancy of Hadoop/MR topics
Appropriate level to begin teaching Hadoop/MR
Student Feedback
Questions
Fall 2013
Improvements in Technical Skills (0-10):
Start
Spring 2014
End
Start
End
Java
6.62
7.28
6.06
7.32
Linux
5.86
7.1
6.67
7.73
Networking Concepts
4.38
6.29
5.26
7.54
Hadoop MapReduce Concepts
0.03
4.53
0.3
6.67
Invalid due to incorrect
phrasing of question
7.2
2.5
2.1
Time taken to complete Hadoop/MR Assignments (hours)
Assignment 1
Assignment 2
Time taken to set up Hadoop
7.9
Helpfulness of materials (1 – 4: not useful to very useful)
Lectures
3
3.6
MapReduce Lab
3.6
3.3
Hadoop-on-Palmetto guide and lab
2.9
3.13
Relevancy of Hadoop/MR to Distributed and
Cluster Computing
7.97
8.77
Appropriate level of undergraduate
2.91
2.79
Student Feedback
Primary student requests:
• Fall 2013
- More labs
- More details in HDFS guide
• Spring 2014
- FAQ to address common configuration
errors/interpretation of MR compilation errors
- More time for projects
- Reduced dependency between two Hadoop/MR
assignments
Module Content: Project
- Was added to the course in Spring 2014
- Project in place of assignments
- Three categories:
- Data Analytics
- Big data set
- Interesting analytic problem relating to data
- Performance Comparison
- Big data set
- Comparison between Hadoop MapReduce and MPI
- System implementation
- Augmenting myHadoop with additional software modules: Spark,
HBase, or Hadoop 2.0
- Required IEEE two-column conference format for reports
Module Content: Project
- Data Sets:
- Airline Traffic Data (12Gb)
- NOAA Global Daily Weather Data (15-20Gb)
- Amazon Food Reviews (354Mb – hundreds of thousands of
entries)
- Amazon Movie Reviews (8.7Gb – millions of entries)
- Meme Trackers (53Gb - texts)
- Million Song Dataset (196Gb HD5 compressed)
- Google Trace Data (~171Gb)
Module Content: Project
- Comparing performance between Hadoop and MPI-MR
(Sandia) using Amazon Movie Reviews
- Configuration and installation of Hadoop 2.0 on myHadoop
- Amazon Crawler using iterative implementation of Hadoop
MR
- Performance comparison between Hadoop/MPI/MPI-IO on
NOAA data
- Performance comparison between Hadoop/MPI/MPI-IO on
Google Trace data
Module Content: Project
- Positive Evaluation
- Appropriateness of scope: 8.17/10
- Appropriateness of difficulty: 7.74/10
- Applicability of Hadoop/MR: 8.94/10
- Student Feedback
-
An integral element of the module/course
More time is needed
Start the project earlier in the semester
Less assignment, more project
Ongoing Work
- Transition to Hadoop 2.0
- Inclusion of other current distributed and dataintensive technologies:
- Spark/Shark for in-memory computing
- Cascade/Tez for workflow computing
- Swift?
- Inclusion of additional real world data and problems
in student projects
Questions?
Fall 2012:
https://sites.google.com/a/g.clemson.edu/cp-cs-362/
Spring 2013:
https://sites.google.com/a/g.clemson.edu/cpsc362-sp2013/
Fall 2013:
https://sites.google.com/a/g.clemson.edu/cpsc362-fa2013/
Spring 2014:
https://sites.google.com/a/g.clemson.edu/cpsc3620-sp2014/
Download