Hadoop & MapReduce Zhangxi Lin CAABI, Texas Tech University FIFE, Southwestern University of Finance & Economics Cellphone:18610660375, QQ/WeChat: 155970 http://zlin.ba.ttu.edu Zhangxi.lin@ttu.edu 2015-06-16 1 6/16/2015 Zhangxi Lin 1 CAABI, Texas Tech University ◦ Center for Advanced Analytics and Business Intelligence initially started in 2004 by Dr. Peter Westfall, ISQS, Rawls College of Business. Sichuan Key Lab of Financial Intelligence and Financial Engineering (FIFE), SWUFE ◦ One of two key labs in finance founded in 2008 and sponsored by Sichuan Provincial Government ◦ Underpinned by two areas in SWUFE: Information and Finance 2 ISQS 6339, Data Mgmt & BI 6/16/2015 Zhangxi Lin 2 Know Big Data One More Step When we talk about big data we must know what Hadoop is When we planning about data warehousing we must know what HDFS and NoSQL are. When we say data mining we must know what Mahout and H2O are. Do you know Hadoop data warehousing does not need dimensional modeling? Do you know how Hadoop stores heterogeneous data? Do you know what are Hadoop’s “Archeries heal”? Do you know you can install a Hadoop system in your Laptop? Do you know Alibaba has retired its last mini-computer in 2014? So, let’s talk about Hadoop 6/16/2015 Zhangxi Lin 3 After this lecture you will Understand what challenges are in big data management Understand how Hadoop and MapReduce works Get familiar to the Hadoop ecology Be able to install a Hadoop in your laptop Be able to install a handy big data tool in your laptop to visualize and mine data 6/16/2015 Zhangxi Lin 4 Outlines Apache Hadoop Hadoop Data Warehousing Hadoop ETL Hadoop Data Mining Data Visualization with Hadoop MapReduce Algorithm Setting up Your Hadoop Appendixes ◦ The Hadoop Ecological System ◦ Matrix calculation with MapReduce 6/16/2015 Zhangxi Lin 5 A Traditional Business Intelligence System MS SQL Server SSMS SSIS SSAS BIDS SSRS SAS EG SAS EM 6/16/2015 Zhangxi Lin 6 Hadoop ecosystem 6/16/2015 Zhangxi Lin 7 - 6/16/2015 Zhangxi Lin 8 What is Hadoop? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity hardware. Hadoop is not a replacement for traditional RDMS but is a supplement to handle and process large datasets . It achieves two tasks: 1. Massive data storage. 2. Faster processing. Using Hadoop is cheaper, faster and better. 6/16/2015 Zhangxi Lin 9 Hadoop 2: Big data's big leap forward The new Hadoop is the Apache Foundation's attempt to create a whole new general framework for the way big data can be stored, mined, and processed. The biggest constraint on scale has been Hadoop’s job handling. All jobs in Hadoop are run as batch processes through a single daemon called JobTracker, which creates a scalability and processing-speed bottleneck. Hadoop 2 uses an entirely new job-processing framework built using two daemons: ResourceManager, which governs all jobs in the system, and NodeManager, which runs on each Hadoop node and keeps the ResourceManager informed about what's happening on that node. 6/16/2015 Zhangxi Lin 10 Hadoop 1.0 VS Hadoop 2.0 Hadoop 1.0 Hadoop 2.0 Features of Hadoop 2.0 over Hadoop 1.0: • Horizontal scalability of Namenode. • Namenode is no longer a single point of failure. • Ability to process Terabytes and Petabytes of data available in HDFS using Non-MapReduce applications such as MPI, GIRAPH. • The two major functionalities of overburdened JobTracker (resource management and job scheduling/monitoring) into two separate daemons. 6/16/2015 Zhangxi Lin 11 Apache Spark Apache Spark is an open source cluster computing framework originally developed in the AMPlab at UC Berkley. Spark in-memory provides performance up to 100 times faster for certain applications. Spark is well suited for machine learning algorithms. Spark requires a cluster manager and a distributed storage system. Spark supports Hadoop YARN. 6/16/2015 Zhangxi Lin 12 MapReduce MapReduce is a framework for processing parallelizable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster or a grid. 6/16/2015 Zhangxi Lin 13 MapReduce 2.0 – YARN (Yet Another Resource Negotiator) 6/16/2015 Zhangxi Lin 14 How Hadoop Operates 6/16/2015 Zhangxi Lin 15 Hadoop Ecosystem 6/16/2015 Zhangxi Lin 16 Hadoop Topics No: Topic 1 Data warehousing Components 2 Publicly available big data services HDFS, HBase, HIVE, KylinNoSQL/NewSQL, Solr Hortonworks, CloudEra, HaaS, EC2 3 MapReduce & Data mining Mahout, H2O, R, Python 4 Big data ETL Kettle, Flume, Sqoop, Impala,Chakwa. Dremel, Pig Oozie, ZooKeeper, Ambari, Loom, Ganglia Tomcat, Neo4J, Pig, Hue 5 Big data platform management 6 Application development platform 7 Tools & Visualizations 8 Streaming data processing Pentaho, Tableau Saiku, Mondrian, Gephi, Spark, Storm, Kafka, Avro 6/16/2015 Zhangxi Lin 17 HADOOP DATA WAREHOUSING 6/16/2015 Zhangxi Lin 18 Comparing the RDBMS and Hadoop data warehousing stack Layer Storage Metadata Query Hadoop Advantages of Hadoop over conventional RDBMS HDFS file system HDFS is purposebuilt for extreme IO speeds System tables HCatalog All clients can use HCatalog to read files. SQL query engine Multiple engines (SQL and nonSQL) Multiple query engines like Hive or Impala are available. Conventional RDBMS Database tables 6/16/2015 Zhangxi Lin 19 HDFS ( Hadoop Distributed File System) Hadoop ecosystem consists of many components and libraries for varied tasks. The storage part of Hadoop is HDFS and the processing part is MapReduce. HDFS is the a java based distributed file-system that stores data on commodity machines without prior organization, providing very high aggregate bandwidth across the clusters. 6/16/2015 Zhangxi Lin 20 HDFS Architecture & Design HDFS has a master/slave architecture. HDFS consists of a single NameNode and several number of DataNodes in a cluster. In HDFS files are split in one or more ’blocks’ and are stored in a set of DataNodes. HDFS exposes a file system namespace and allows user data to be stored in files. DataNodes serves read, write requests, performs block creation, deletion, and replication upon instruction from Namenode. 6/16/2015 Zhangxi Lin 21 6/16/2015 Zhangxi Lin 22 What is NoSQL? Stands for Not Only SQL NoSQL is a non-relational database management system. NoSQL is different from traditional relational database management systems in some significant ways. NoSQL is designed for distributed data stores where very large scale of data storing is needed (for example Google or Facebook which collects terabits of data every day for their users). These types of data storing may not require fixed schema, avoid join operations and typically scale horizontally. 6/16/2015 Zhangxi Lin 23 NoSQL 6/16/2015 Zhangxi Lin 24 - Praveen Ashokan 6/16/2015 Zhangxi Lin 25 What is NewSQL? • • • • • A modern RDBMS that seek to provide the same scalable performance of NoSQL systems for OLTP read-write workloads while still maintaining the ACID guarantees of a traditional database system. SQL as the primary interface Non- Locking Concurrency control High per-node performance H-Store parallel database system is the first known NewSQL system 6/16/2015 Zhangxi Lin 26 Classification of NoSQL and NewSQL 6/16/2015 Zhangxi Lin 27 Taxonomy of Big Data Stores 6/16/2015 Zhangxi Lin 28 Features of OldSQL vs NoSQL vs NewSQL 6/16/2015 Zhangxi Lin 29 6/16/2015 Zhangxi Lin 30 HBase • • • • HBase is a non-relational,distributed database It is a column-oriented DBMS It is an implementation of Google’s Big Table HBase is built on top of Hadoop File Distributed System(HDFS) 6/16/2015 Zhangxi Lin 31 Differences between HBase and Relational Database • • • • • • HBase is a column-oriented database while a Relational database is a row-oriented database HBase is highly scalable while RDBMS is hard to scale. Hbase has flexible schema while RDBMS has fixed schema HBase holds denormalized data while data in a Relational database is normalized The performance of HBase is good for large volumes of unstructured data while the performance is poor for a Relational database HBase does not use any query language while a Relational Database uses SQL to retrieve data 6/16/2015 Zhangxi Lin 32 HBase Data Model Column Family Row key TimeStamp value 6/16/2015 Zhangxi Lin 33 HBase: Keys and Column Families Each record is divided into Column Families 6/16/2015 Zhangxi Lin 34 What is Apache Hive? • The Apache Hive is data warehouse software facilitates querying and managing large datasets residing in distributed storage • It built on top of Apache Hadoop it provides tools to easy data extract/transform/load • It supports analysis of large datasets stored in Hadoop’s HDFS • It supports SQL-like language called HQL as well as big data analytics with the help of Map-Reduce 6/16/2015 Zhangxi Lin 35 What is HQL? • HQL : Hive Query Language • Doesn’t conform any ANSI standard • Very close to MySQL dialect, but with some differences • SQL to HQL cheat sheet http://hortonworks.com/wpcontent/uploads/downloads/2013/08/Hortonworks.CheatSheet.SQLt oHive.pdf • HQL doesn’t support transactions, so don’t compare with RDBMS 6/16/2015 Zhangxi Lin 36 HADOOP ETL 6/16/2015 Zhangxi Lin 37 List of Tools Sqoop Flume Impala Chukwa Kettle 6/16/2015 Zhangxi Lin 38 E T L 6/16/2015 Zhangxi Lin 39 Sqoop Is a short form of SQL to Hadoop Used to move back data back and forth between RDBMS and HDFS for performing analysis using BI Tools. Is a simple command line tool(Sqoop 2 is bringing web interface as well) 6/16/2015 Zhangxi Lin 40 How Sqoop Works Dataset Slice 1 Slice 1 Slice 2 Mapper 1 Mapper 1 Mapper 1 6/16/2015 Zhangxi Lin 41 Sqoop 1 & Sqoop 2 Feature Sqoop 1 Connectors for all major RDBMS Supported. Sqoop 2 Not supported. Workaround: Use the generic JDBC Connector which has been tested on the following databases: Microsoft SQL Server, PostgreSQL, MySQL and Oracle. This connector should work on any other JDBC compliant database. However, performance might not be comparable to that of specialized connectors in Sqoop. Encryption of Stored Passwords Not supported. No workaround. Supported using Derby's on-disk encryption.Disclaimer: Although expected to work in the current version of Sqoop 2, this configuration has not been verified. Data transfer from RDBMS to Hive or HBase Supported. Not supported. 1.Workaround: Follow this two-step approach.Import data from RDBMS into HDFS 2.Load data into Hive or HBase manually using appropriate tools and commands such as the LOAD DATA statement in Hive Data transfer from Hive or HBase to RDBMS 1.Not supported.Workaround: Follow this two-step approach.Extract data from Hive or HBase into HDFS (either as a text or Avro file) 2.Use Sqoop to export output of previous step to RDBMS Not supported. Follow the same workaround as for Sqoop 1. 6/16/2015 Zhangxi Lin 42 Sqoop 1 & Sqoop 2 Architecture For more on Differences https://www.youtube.com/watch?v=xzU3HL4ZYI0 6/16/2015 Zhangxi Lin 43 What is Flume ? Flume – It is a distributed, reliable service used for gathering, aggregating and transporting large amounts of streaming event data for analysis. Event data – streaming log data (website/application logs – to analyse user’s activity) or streaming data (e.g. social media – analyse an event, stock prices- to analyse a stock’s performance) 6/16/2015 Zhangxi Lin 44 Architecture and Working 6/16/2015 Zhangxi Lin 45 Impala –An open source SQL query engine Developed by Cloudera and fully open source, hosted on github. Released as beta in 10/2012 1.0 version available in 05/2013 6/16/2015 Zhangxi Lin 46 About Impala 6/16/2015 Zhangxi Lin 47 What is Chukwa Chukwa is an open source data collection system for monitoring large distributed systems. Used for log collection and analysis. Built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework Not a streaming database Not a real time system 6/16/2015 Zhangxi Lin 48 Why do we need Chukwa? Data monitoring and analysis. ◦ To collect system matrices and log files. ◦ To store data in Hadoop clusters Uses MapReduce to analyze data. ◦ Robust ◦ Scalable ◦ Rapid Data Processing 6/16/2015 Zhangxi Lin 49 How it Works? 6/16/2015 Zhangxi Lin 50 Data Analysis 6/16/2015 Zhangxi Lin 51 ETL Tools Sqoop Flume Features • • • • Bulk import Direct input Data interaction Data export • • • • Fan out Fan in Processors Auto-batching of events Multiplexing channels for data mining • • • Kettle • • • Migrating data between applications or databases Exporting data from databases to flat files Loading data massively into databases Data cleansing Integrating applications Advantage • • Parallel data transfer Efficient data Analysis • Reliable, Scalable, Manageable, Customizable, High Performance Feature Rich and Fully Extensible Contextual Routing • • Higher level than code • Well tested full suite of components • Data analysis tools • Free 6/16/2015 Disadvantage • Not east to manage installations and configurations • Have to weaken some delivery guarantees • • Not running fast Take some time to install Zhangxi Lin 52 Building a Datawarehouse in Hadoop using ETL Tools Copy data into HDFS with ETL tool (e.g. Informatica), Sqoop or Flume into standard HDFS files (write once). This registers the metadata with HCatalog. Declare the query schema in Hive or Impala, which doesn’t require data copying or re-loading, due to the schema-on-read advantage of Hadoop compared with schema-on-write constraint in RDBMS. Explore with SQL queries and launching BI tools e.g. Tableau, BusinessObjects for exploratory analytics. 6/16/2015 Zhangxi Lin 53 HADOOP DATA MINING 6/16/2015 Zhangxi Lin 54 What is Mahout? Meaning: A person who keep and drives an elephant – an Indian term Mahout is a scalable open source machine learning library hosted by Apache. Mahout core algorithms are implemented on top of Apache Hadoop using the Map/Reduce paradigm. 6/16/2015 Zhangxi Lin 55 Mahout’s position 6/16/2015 Zhangxi Lin 56 6/16/2015 Zhangxi Lin 57 Mapreduce flow in mahout 6/16/2015 Zhangxi Lin 58 What is H2O? H2O scales statistics, machine learning and math over BigData. H2O is extensible and users can build blocks using simple math legos in the core. H2O keeps familiar interfaces like R, Excel & JSON so that BigData enthusiasts & experts can explore, merge, model and score datasets using a range of simple to advanced algorithms. H2O makes it fast and easy to derive insights from your data through faster and better predictive modeling. H2O has a vision of online scoring and modeling in a single platform 6/16/2015 Zhangxi Lin 59 H2O How is H2O Different form Mahout ? H2O Mahout Can use any of R, REST/JSON, GUI (browser), Java or Scala. Can use Java H2O is GUI product with less algorithms More number of Algorithms that need knowledge od Java Algorithms are typically 100x faster than current Map/Reduce-based Mahout Algorithms are typically slower compared to H2O. Knowledge of Java is NOT required to develop prediction model Knowledge of Java required to develop prediction model Real Time Not Real Time 6/16/2015 Zhangxi Lin 60 H2O Predictive Modeling Factories – Better Marketing with H2O • Advertising Technology – Better Conversions with H2O • Risk & Fraud Analysis – Better detection with H2O • Customer Intelligence – Better Sales with H2O Users of H2O • 6/16/2015 Zhangxi Lin 61 MAP/REDUCE ALGORITHM 6/16/2015 Zhangxi Lin 62 How to write a MapReduce program Parallelization is the key Algorithm is different from a single server application ◦ Map function ◦ Reduce function Considerations ◦ Load balance ◦ Efficiency ◦ Memory management 6/16/2015 Zhangxi Lin 63 MapReduce Executes 6/16/2015 Zhangxi Lin 64 Schematic of a map-reduce computation 6/16/2015 Zhangxi Lin 65 Example: counting the number of occurrences for each word in a collection of documents The input file is a repository of documents, and each document is an element. The Map function for this example uses keys that are of type String (the words) and values that are integers. The Map task reads a document and breaks it into its sequence of words w1,w2, . . . ,wn. It then emits a sequence of key-value pairs where the value is always 1. That is, the output of the Map task for this document is the sequence of keyvalue pairs: (w1, 1), (w2, 1), . . . , (wn, 1) 6/16/2015 Zhangxi Lin 66 Map Task A single Map task will typically process many documents. Thus, its output will be more than the sequence for the one document suggested above. If a word w appears m times among all the documents assigned to that process, then there will be m keyvalue pairs (w, 1) among its output. After all the Map tasks have completed successfully, the master controller merges the files from each Map task that are destined for a particular Reduce task and feeds the merged file to that process as a sequence of key-list-of-value pairs. That is, for each key k, the input to the Reduce task that handles key k is a pair of the form (k, [v1, v2, . . . , vn]), where (k, v1), (k, v2), . . . , (k, vn) are all the key-value pairs with key k coming from all the Map tasks. 6/16/2015 Zhangxi Lin 67 Reduce Task The output of the Reduce function is a sequence of zero or more key-value pairs. The Reduce function simply adds up all the values. The output of a reducer consists of the word and the sum. Thus, the output of all the Reduce tasks is a sequence of (w,m) pairs, where w is a word that appears at least once among all the input documents and m is the total number of occurrences of w among all those documents. The application of the Reduce function to a single key and its associated list of values is referred to as a reducer. 6/16/2015 Zhangxi Lin 68 Big Data Visualization and Tools 6/16/2015 Zhangxi Lin 69 Big Data Visualization and Tools Tools : Tableau Pentaho Modrian Saiku Spotfire Gephi 6/16/2015 Zhangxi Lin 70 Tableau What is Tableau? Tableau is a visual analysis solution that allows people to explore and analyze data with simple drag and drop operations. 6/16/2015 Zhangxi Lin 71 Tableau Tableau Alliance Partners 6/16/2015 Zhangxi Lin 72 Tableau 6/16/2015 Zhangxi Lin 73 What is Pentaho? Pentaho is a commercial open source software for Business Intelligence (BI). Pentaho has been developed since 2004 in Orlando, Florida. Pentaho provides comprehensive reporting, OLAP analysis, dashboards, data integration, data mining and a BI platform. It is built under Java platform. Runs well under various platforms (Windows, Linux, Macintosh, Solaris, Unix, etc.) Has a complete package from reporting, ETL for warehousing data management, OLAP server data mining also dashboard. BI Platform supports Pentaho end to end business intelligence capabilities and provide central access to your business information, with back end security, integration, scheduling, auditing and more. Designed to meet the needs of any size organization. 6/16/2015 Zhangxi Lin 74 A few facts 6/16/2015 Zhangxi Lin 75 6/16/2015 Zhangxi Lin 76 Analyzer 6/16/2015 Zhangxi Lin 77 Reports 6/16/2015 Zhangxi Lin 78 Overall Features 6/16/2015 Zhangxi Lin 79 HADOOP IN YOUR LAPTOP 6/16/2015 Zhangxi Lin 80 Hortonworks Background Hortonworks is a Business computer software company based in Palo Alto,California Hortonworks supports & develops Apache Hadoop framework, that allows distributed processing of large data sets across clusters of computers They are the sponsors of Apache Software Foundation Founded in June 2011 by Yahoo and Benchmark capital as an independent company. It went public on December 2014 Below are the list of company collaborated with Hortonworks Microsoft on October 2011 to develop Azure & Window server Infomatica on November 2011 to develop HParser Teradata on February 2012 to develop Aster data system SAP AG on September 2012 announced it would resell Hortonworks distribution 6/16/2015 Zhangxi Lin 81 They do Hadoop using HDP 100% Network Data Usage 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Category 1 Before HDP With HDP 6/16/2015 Zhangxi Lin 82 Hortonworks Data Platform Hortonworks' product named Hortonworks Data Platform (HDP) includes Apache Hadoop and is used for storing, processing, and analyzing large volumes of data. It includes Apache Projects like HDFS, MapReduce, Pig, Hive, Hbase, Zookeeepr and other components Why was it developed? It was develop with one aim to make Apache Hadoop ready for enterprise. What does it do? It takes Big Data component of Apache Hadoop and make them ready for prime time use in Enterprise Environment. 6/16/2015 Zhangxi Lin 83 HDP Functional Areas 6/16/2015 Zhangxi Lin 84 Certified Technology Program One of the most important aspects of the Technology Partner Program is the certification of partner technologies with HDP Hortonworks Certified Technology Program simplifies big data planning by providing pre-built and validated integrations between leading enterprise technologies and Hortonworks Data Platform (HDP) YARN Ready Certification, Security Ready, Operations Ready Governance Ready More Details: http://hortonworks.com/partners/certified/ 6/16/2015 Zhangxi Lin 85 How to get HDP? HDP is architected, developed, and built completely in the open. Anyone can download it from website http://hortonworks.com/hdp/downloads/ for free It comes with different version which can used as per need. HDP 2.2 on Sandbox – runs on VirtualBox or VMWare Automated (Amabri) – RHEL/Ubuntu/CentOS/SLES Manual – RHEL/Ubuntu/CentOS/SLES Windows – Windows Server 2008 & 2012 6/16/2015 Zhangxi Lin 86 Installing HDP IP address to login on the browser 6/16/2015 Zhangxi Lin 87 DEMO-HDP Below are the step we will be preforming in HDP Starting HDP Upload a source file Load in file in HCatalog Pig Basic Tutorial 6/16/2015 Zhangxi Lin 88 6/16/2015 Zhangxi Lin 89 About Cloudera Cloudera is “The commercial Hadoop company” Founded by leading experts on Hadoop from Facebook, Google, Oracle and Yahoo Provides consulting and training services for Hadoop users Staff includes several committers to Hadoop projects 6/16/2015 Zhangxi Lin 90 Who uses Cloudera? 6/16/2015 Zhangxi Lin 91 Cloudera Software (All Open-Source) Cloudera’s Distribution including Apache Hadoop (CDH) – A single, easy-to-install package from the Apache Hadoop core repository – Includes a stable version of Hadoop, plus critical bug fixes and solid new features from the development version Components – Apache Hadoop – Apache Hive – Apache Pig – Apache HBase – Apache Zookeeper – Flume, Hue, Oozie, and Sqoop 6/16/2015 Zhangxi Lin 92 CDH and Enterprise Ecosystem 6/16/2015 Zhangxi Lin 93 Beyond Hadoop Hadoop is incapable of handling OLTP tasks because of its latency. Alibaba has deelop its own distributed system instead of using Hadoop. Currently, it takes Alipay’s system 20 ms to process a payment transaction, but 200 ms for fraud detection ◦ “2014年双十一交易额是多少?当大家还正在酣睡之时,双 十一的疯狂正在开始。11日凌晨,逆天的天猫双十一购物狂 欢节开场,今年每分钟支付成功的峰值为79万笔/分,对比 去年20万笔/分,较去年增长4倍,” 12306.cn has replaced its old system with VMware vFabric TM GemFire ® in-memory database system. This makes its services stable and robustic. 6/16/2015 Zhangxi Lin 94 HaaS(Hadoop as a Service) HaaS example Amazon Web Services(AWS) -Amazon Elastic MapReduce (EMR) providing Hadoop based platform for data analysis with S3 as the storage system and EC2 as the compute system Microsoft HDInsight, Cloudera CDH3, IBM Infoshpere BigInsights, EMC GreenPlum HD and Windows Azure HDInsight Service are the primary HaaS services by global IT giants APPENDIX 1: HADOOP ECOLOGICAL SYSTEM 6/16/2015 Zhangxi Lin 97 Choosing a right Hadoop architecture Application dependent Too many solution providers Too many choices 6/16/2015 Zhangxi Lin 98 Teradata Big Data Platform 6/16/2015 Zhangxi Lin 99 Dell’s Hadoop ecosystem 6/16/2015 Zhangxi Lin 100 Nokia’s Big Data Architechture 6/16/2015 Zhangxi Lin 101 Cloudera’s Hadoop System 6/16/2015 Zhangxi Lin 102 6/16/2015 Zhangxi Lin 103 6/16/2015 Zhangxi Lin 104 Intel 6/16/2015 Zhangxi Lin 105 Comparison of Two Generations of Hadoop 6/16/2015 Zhangxi Lin 106 6/16/2015 Zhangxi Lin 107 6/16/2015 Zhangxi Lin 108 Different Components of Hadoop 6/16/2015 Zhangxi Lin 109 6/16/2015 Zhangxi Lin 110 APPENDIX 2: MATRIX CALCULATION 6/16/2015 Zhangxi Lin 111 Map/Reduce Matrix Multiplication 6/16/2015 Zhangxi Lin 112 Map/Reduce – Scheme 1, Step 1 6/16/2015 Zhangxi Lin 113 Map/Reduce – Scheme 1, Step 2 6/16/2015 Zhangxi Lin 114 Map/Reduce – Scheme 2, Oneshot 6/16/2015 Zhangxi Lin 115 Communication Cost The sum of the communication cost of all the tasks implementing that algorithm. In addition to the amount of time to execute a task it also includes the time for moving data into the memory. ◦ The algorithm executed by each task tends to be very simple, often linear in the size of its input ◦ The typical interconnect speed for a computing cluster is one gigabit per second. ◦ The time taken to move the data from a chunk into the main memory may exceed the time needed to operate on the data. 6/16/2015 Zhangxi Lin 116 Reducer size The upper bound on the number of values that are allowed to appear in the list associated with a single key. Reducer size can be selected with at least two goals. ◦ By making the reducer size small, we can force there to be many reducers, according to which the problem input is divided by the Map tasks. ◦ We can choose a reducer size sufficiently small that we are certain the computation associated with a single reducer can be executed entirely in the main memory of the compute node where its Reduce task is located. The running time will be greatly reduced if we can avoid having to move data repeatedly between main memory and disk. 6/16/2015 Zhangxi Lin 117 Replication rate The number of key-value pairs produced by all the Map tasks on all the inputs, divided by the number of inputs. That is, the average communication from Map tasks to Reduce tasks (measured by counting key-value pairs) per input. 6/16/2015 Zhangxi Lin 118 Segmenting Matrix to Reduce the Cost 6/16/2015 Zhangxi Lin 119 Map/Reduce – Scheme 3 6/16/2015 Zhangxi Lin 120 Map/Reduce – Scheme 4, Step 1 6/16/2015 Zhangxi Lin 121 Map/Reduce – Scheme 4, Step 2 6/16/2015 Zhangxi Lin 122