Hadoop Hands On: Teaching MapReduce to Business Students through Analogy BDA EdCon 2015 Hadoop Hands On: Teaching MapReduce to Business Students through Analogy Colin Conrad1, Hossam Ali-Hassan2, and Michael Bliemel2 1Dalhousie University, Faculty of Computer Science/Faculty of Management, Halifax, Canada 2Dalhousie University, Faculty of Management, Rowe School of Business, Halifax, Canada Context: courses teaching BI&A • Undergraduate course (commerce and management) – COMM4512: Business Intelligence • Graduate course (MBA and MEC) – BUSI 6513: Business Analytics • Non technical business students • Focus on end-user analytics • Conceptual (e.g. Big Data) and practical (e.g. IBM Cognos Insight, SAP Predictive Analysis…) • Need to simplify complex or very technical concepts BDA EdCon– Puerto Rico, August 12, 2015 3 Introduction • MapReduce paradigm • • • Processing very large datasets (“big data”) Uses computer clusters Google (early 2000s) • Apache Hadoop • • • Popular open source application of MapReduce “The Apache™ Hadoop® project develops opensource software for reliable, scalable, distributed computing” (https://hadoop.apache.org/ ) Increasing popularity and adoption • Need to teach Hadoop/MapReduce to a broad audience BDA EdCon– Puerto Rico, August 12, 2015 4 Problem: complexity…and attention span! “When a user calls the MapReduce function, the user program triggers a multi-step process invoking the nodes of the cluster. The program begins by splitting the input files into manageable sizes, which are then assigned to various “worker” machines by a special “master” node. The master then assigns map and reduce tasks to the workers. Workers assigned with map tasks proceed to identify data. As the map workers make progress, the master notifies reduce works of the location and nature of the processed data. The reduce workers iterate over the sorted data and eventually pass the results of the reduce function to the master node, completing the MapReduce call.” (Conrad et al., 2015) BDA EdCon– Puerto Rico, August 12, 2015 5 Problem: still complex… BDA EdCon– Puerto Rico, August 12, 2015 6 Source: http://opensource.com/life/14/8/intro-apache-hadoop-big-data Solution • Analogy • • Used to describe a complex or abstract subject by drawing on students’ prior knowledge of a different subject matter Significant role in the teaching and learning of science (Treagust & Duit, 2015) • Engagement • • Absorption: heightened attention (Tellegen & Atkinson, 1974) Flow: losing track of time (Csikszentmihalyi, 2014) • Games and simulations: cognitive absorption and active learning (Agarwal & Karahanna, 2000) BDA EdCon– Puerto Rico, August 12, 2015 7 Using playing cards to explain MapReduce • • • • Students are computers or nodes Groups of students/nodes make-up clusters Cards are the data Stickers to assign “Task Tracker” and “Job Trackers” roles • Multiple decks of cards depending on class size and exercise (we used 6 in class) BDA EdCon– Puerto Rico, August 12, 2015 8 Data Meaning The multiple decks of cards represent raw data. Some of which is useful. For example these could represent product reviews on webpages – where the number is the stars rating, the suit is the product, and the other cards A, J, K, Q, Jokers are text on the page that is not useful BDA EdCon– Puerto Rico, August 12, 2015 Exercise 1: Which product has best review? Randomly remove 10-20 cards from the pile What product has the best reviews? Hearts, Spades, Clubs or Diamonds? You need to sum all the points of all the cards of each suit BDA EdCon– Puerto Rico, August 12, 2015 Add these The Hadoop Distributed File System One student is the “Job Tracker” Each student at the end of a row is a “Task Tracker” The rest of the class are “Worker Nodes” that will do the data processing BDA EdCon– Puerto Rico, August 12, 2015 Job Tracker The Job Tracker will fairly distribute the work (cards) to each Task Tracker (student at the end of the row). Each row in class represents a cluster of nodes. BDA EdCon– Puerto Rico, August 12, 2015 The Task Tracker then distributes the data to the Worker Nodes in their row so that the workload is evenly balanced (the task tracker can also do work in this instance, but some HDFS implementations have trackers only tracking.) Map Process Each Worker Node now maps the data – by Product and Review – Sort the cards into piles by suit and discard the non number cards BDA EdCon– Puerto Rico, August 12, 2015 Map Process Job Tracker Task Trackers BDA EdCon– Puerto Rico, August 12, 2015 Worker Nodes Task Trackers Reduce Process Now the Task Tracker moves cards from nodes to recombine into suits so that each Worker Node has one or two suits, and all cards of that suit BDA EdCon– Puerto Rico, August 12, 2015 Worker Nodes now sum up the total points and report them to the Task Tracker The Job Tracker asks each Task Tracker for their totals and then determines which products (suit) ranks highest Reduce Process Task Trackers Worker Nodes Task Trackers Job Tracker (60) (54) (96) (228) (72) (90) (198) (84) (120) (300) (84) (78) (60) (216) (84) BDA EdCon– Puerto Rico, August 12, 2015 (60) 16 Exercise 2 (Timed Challenge): what is the missing card? Reshuffle all the cards Pick one card, do not reveal it to the class Give the data (cards) to the Job Tracker The Hadoop Cluster now has the job to find the missing card BDA EdCon– Puerto Rico, August 12, 2015 The Job Tracker delegates the work to Task Trackers Job Tracker balances the work between Task Tracker nodes Job is clear and specific – i.e. Sort these by suit, then by number Data can move between across Task Nodes Conclusion • Learning Outcomes • • • • MapReduce and HDFS Distributed computing Open source software and adaptability Under-performing nodes and reassignment of tasks • Importance of debrief at the end of any game or simulation • Student feedback • “engaging”, “exciting”, “challenging”, “interactive”, “immersive” and “memorable” • Pedagogical value of analogy and games BDA EdCon– Puerto Rico, August 12, 2015 18 Thank You BDA EdCon– Puerto Rico, August 12, 2015 19