Teaching HDFS/MapReduce Systems Concepts to Undergraduates Linh B. Ngo*, Edward B. Duffy**, Amy W. Apon* * School of Computing, Clemson University ** Clemson Computing and Information Technology, Clemson University Contents • • • • • • • Introduction and Learning Objectives Challenges Hadoop computing platform – Options and Solution Module Content – Lectures, Assignments, Data Student Feedback Module Content – Project Ongoing and Future Work Introduction and Learning Objectives - Hadoop/MapReduce is an important current technology in the area of data-intensive computing - Learning objectives: - Understand the challenges of data-intensive computing - Become familiar with the Hadoop Distributed File System (HDFS), the underlying driver of MapReduce - Understand the MapReduce (MR) programming model - Understand the scalability and performance of MR programs on HDFS Challenges - Provide students with a high performance, stable, and robust Hadoop computing platform - Balance lecture and hands-on lab hours - Demonstrate the technical relationship between MapReduce and HDFS Computing Platform Options - MapReduce parallel programming interface - WebMapReduce is an example - Enables study of MR programming model at beginning level - Does not enable the study of HDFS for advanced students - Dedicated shared Hadoop cluster with individual accounts - Multiple student programs compete for resources - Individual errors affect other students - Dedicated cluster that supports multiple virtual Hadoop clusters - Not supported by Clemson’s supercomputer configuration Computing Platform Solution - Modification of SDSC’s myHadoop - Individual Hadoop platform deployment for each student in the class - First setup: - Medium amount of editing needed to set up - Numerous errors due to typos/unable to configure - Second setup: - Minimal amount of editing needed (one line) - Only a few students encountered errors due to typos Lecture and Hands-on Labs - Fall 2012: 5 class hours - 1 MR lecture, 1 lab, 1HDFS lecture, 1 lab, 1 advanced MR optimization - Lab time not sufficient due to problems with Hadoop computing platforms - Spring 2013: 5 class hours - Lab time still not sufficient, due to errors in modifying myHadoop scripts - Fall 2013: 7 class hours - 1 MR lecture, 2 labs, 1 HDFS lecture, 2 labs, 1 HBase/Hive lecture Module Content: Lectures - Reused available online material with addition clarification - Slides from UMD, Jimmy Lin - Strong emphasis on the following points: - The MR programming paradigm is a programming model that handles data parallelization - The HDFS infrastructure provides a robust and resilient way to distribute big data evenly across a cluster - The MR library takes advantages of HDFS and the MR programming paradigm to enable programmers to write applications to conveniently and transparently handle big data - Data locality is the big theme in working with big data Module Content: Lectures HDD RAM Block metadata lives in memory CPU Could be the same machine NameNode File 01 File 02 JobTracker HDD File 03 HDFS Blocks DataNodes report block information to NameNode JobTracker provides NameNode with file/directory paths and receives blocklevel information. Physical View at Linux FS: blk_xxx blk_xxx blk_xxx HDD HDD HDD HDD RAM RAM RAM RAM CPU CPU CPU CPU DataNode DataNode DataNode DataNode TaskTracker TaskTracker TaskTracker TaskTracker CPU HDFS DataNode daemons controlling block location MapReduce TaskTracker daemons executing tasks on blocks RAM Detailed job progress lives in memory HDFS Abstractions: Directories/Files - TaskTrackers report progress to JobTracker JobTracker assigns work and faciltate map/reduce on TaskTrackers based on block location information from NameNode Module Content: Assignments and Data - Assignments - One MR programming assignment basing on existing codes that familiarizes students with the MR API and programming flows - One MR/HDFS programming assignment that requires students to write a MR program and deploy it to run on a Hadoop computing platform - Data - Strive to be realistic - Big enough, but not too big - Airline Traffic Data (12Gb), Google Trace (200Gb), Yahoo Music Rating (10Gb), Movie Rating (250Mb) Student Feedback - In-class voluntary surveys help to encourage all students to participate (as compared to out-of-class online survey) - IRB approval for survey - Questions addressing: - Improvements in technical skills Improvements in understanding about Hadoop/MR Time taken to complete Hadoop/MR assignments Time taken to set up Hadoop on Palmetto Usefulness of guides/lectures/labs Relevancy of Hadoop/MR topics Appropriate level to begin teaching Hadoop/MR Student Feedback Questions Fall 2013 Improvements in Technical Skills (0-10): Start Spring 2014 End Start End Java 6.62 7.28 6.06 7.32 Linux 5.86 7.1 6.67 7.73 Networking Concepts 4.38 6.29 5.26 7.54 Hadoop MapReduce Concepts 0.03 4.53 0.3 6.67 Invalid due to incorrect phrasing of question 7.2 2.5 2.1 Time taken to complete Hadoop/MR Assignments (hours) Assignment 1 Assignment 2 Time taken to set up Hadoop 7.9 Helpfulness of materials (1 – 4: not useful to very useful) Lectures 3 3.6 MapReduce Lab 3.6 3.3 Hadoop-on-Palmetto guide and lab 2.9 3.13 Relevancy of Hadoop/MR to Distributed and Cluster Computing 7.97 8.77 Appropriate level of undergraduate 2.91 2.79 Student Feedback Primary student requests: • Fall 2013 - More labs - More details in HDFS guide • Spring 2014 - FAQ to address common configuration errors/interpretation of MR compilation errors - More time for projects - Reduced dependency between two Hadoop/MR assignments Module Content: Project - Was added to the course in Spring 2014 - Project in place of assignments - Three categories: - Data Analytics - Big data set - Interesting analytic problem relating to data - Performance Comparison - Big data set - Comparison between Hadoop MapReduce and MPI - System implementation - Augmenting myHadoop with additional software modules: Spark, HBase, or Hadoop 2.0 - Required IEEE two-column conference format for reports Module Content: Project - Data Sets: - Airline Traffic Data (12Gb) - NOAA Global Daily Weather Data (15-20Gb) - Amazon Food Reviews (354Mb – hundreds of thousands of entries) - Amazon Movie Reviews (8.7Gb – millions of entries) - Meme Trackers (53Gb - texts) - Million Song Dataset (196Gb HD5 compressed) - Google Trace Data (~171Gb) Module Content: Project - Comparing performance between Hadoop and MPI-MR (Sandia) using Amazon Movie Reviews - Configuration and installation of Hadoop 2.0 on myHadoop - Amazon Crawler using iterative implementation of Hadoop MR - Performance comparison between Hadoop/MPI/MPI-IO on NOAA data - Performance comparison between Hadoop/MPI/MPI-IO on Google Trace data Module Content: Project - Positive Evaluation - Appropriateness of scope: 8.17/10 - Appropriateness of difficulty: 7.74/10 - Applicability of Hadoop/MR: 8.94/10 - Student Feedback - An integral element of the module/course More time is needed Start the project earlier in the semester Less assignment, more project Ongoing Work - Transition to Hadoop 2.0 - Inclusion of other current distributed and dataintensive technologies: - Spark/Shark for in-memory computing - Cascade/Tez for workflow computing - Swift? - Inclusion of additional real world data and problems in student projects Questions? Fall 2012: https://sites.google.com/a/g.clemson.edu/cp-cs-362/ Spring 2013: https://sites.google.com/a/g.clemson.edu/cpsc362-sp2013/ Fall 2013: https://sites.google.com/a/g.clemson.edu/cpsc362-fa2013/ Spring 2014: https://sites.google.com/a/g.clemson.edu/cpsc3620-sp2014/