Hadoop & Cheetah Key words • Cluster data center – Lots of machines thousands • Node a server in a data center – Commodity device fails very easily • Slot a fraction of a server – Allows you to share a server among many jos • Job an application that the user wants to run – There are multiple of these in the data center • Task/Worker a fraction of a job – Allows you to achieve parallelism Hadoop • Exploit large amount of resources easily • Manages and automates – Failure recovery – Scale of WSC • Dealing with hardware idiosyncrasies – Resource Management/Sharing Map-Reduce paradigm • Master – Runs a scheduling algo to place tasks on a slot • Scheduling allows for sharing – – – – Runs failure detection and resolution algorithms Monitors job progress and speeds up the job Partitions user data into chunks Determines the number of tasks (Maps and Reduce) • Speed: more workers faster, ideally more workers than machines for parallelism • State: master tracks O(M*R) state and makes O(M+R) scheduling decisions • Worker: runs reduce or map task Map • Input <key, value> • Output: a list of <key2, value2> • Takes a key-value pair does some preprocessing and creates another key-value pair. • The output is segregated into N different files. – One for each reducers – The output is stored on local disk in temporary storage Reduce • Input: a pair <key, list(values)> • Output: a pair <key, value> • Aggregates information from multiple mappers • 3 stages: – Shuffle: transfer data from all map to reduce – Sort: sort the data that was transferred – Reduce: aggregate the data • Output is stored: – In persistent storage. Combine • Ideally similar to reduce • EXCEPT – run on the same node as the map – Run on only the data a mapper creates – Preliminary aggregation to reduce amount of data transferred Example Task • “hello world. Goodbye world. Hello hadoop goodbye hadoop” Failure Recovery • Worker failure – Detecting: keep-alive pings – Resolution: restart all task currently running • For completed tasks: – If Map restart – If Reduce do nothing: output stored in external memory. • Master failure – Periodically store master data structures – If master fails roll back to last stored structure. Dealing with hardware Performance issues • Locality – Place task on node where data is stored – Else try to place task close to data • Straggler detection and mitigation – If a task is running really slow – Restart the task – 44% worse without this. Resource Management/Sharing • Cluster Scheduler Shares the resource – Decides which job should run when and where – Sharing algorithms • FIFO: no real sharing • Fair-scheduler: each user is given a number of tokens – The user’s job must get at least token number of slots • Capacity Scheduler – Each job as a queue: task serviced in FIFO manner – Determine number of cluster slots to allocate to each queue Problems • How do determine an effective partition algorithm? Will hash always work? • How do you determine the optimal # of reducers? • What is optimal scheduling? Resource sharing algorithms. Cheetah • Relational data ware-houses – Highly optimized for storing and querying relational data. – Hard to scale to 100s,1000s of nodes • MapReduce – Handles failures & scale to 1000s node – Lacks a declarative query inter-face • Users have to write code to access the data • May result in redundant code • Requires a lot of effort & technical skills. • How do you get the best of both worlds? Main Challenges • With Hadoop, it is hard to: – Perform SQL like joins • Developers need to track: – Location of tables on disks (HDFS) • Hard to get good performance out of vanila hadoop – Need to go through crafty coding to get good performance Architecture • Simple yet efficient • Open:also provide a simple, non-SQL interface Query MR Job • Query is sent to the node that runs Query Driver • Query Driver Query MapReduce job • Each node in the Hadoop cluster provides a data access primitive (DAP) interface Performance Optimizations • • • • • Data Storage & Compression MapReduce Job Configuration MultiQuery Optimization Exploiting Materialized Views LowLatency Query Optimization Storage Format • Text (in CSV format) – Simplest storage format & commonly used in web access logs. • Serialized java object • Row-based binary array – Commonly used in row-oriented database systems • Columnar binary array Storage format -huge impact on both compression ratio and query performance. In Cheetah, we store data in columnar format whenever possible Columnar Compression • Compression type for each column set is dynamically determined based on data in each cell • ETL phase- best compression method is chosen • After one cell is created, it is further compressed using GZIP. MapReduce Job Configuration • # of map tasks - based on the # of input files & number of blocks per file. • # of reduce tasks -supplied by the job itself & has a big impact on performance. • query output – Small:map phase dominates total cost. – Large:it is mandatory to have sufficient number of reducers to partition the work. • Heuristics – #of reducers is proportional to the number of group by columns in the query. – if the group by column includes some column with very large cardinality, we increase # of reducers as well. MultiQuery Optimization • In Cheetah allow users to simultaneously submit multiple queries & execute them in a single batch, as long as these queries have the same FROM and DATES clauses Map Phase • Shared scanner-shares the scan of the fact tables & joins to the dimension tables • Scanner will attach a query ID to each output row • Output from different aggregation operators will be merged into a single output stream. Reduce Phase • Split the input rows based on their query Ids • Send them to the corresponding query operators. Exploiting Materialized Views(1) • Definition of Materialized Views – Each materialized view only includes the columns in the face table, i.e., excludes those on the dimension tables. – It is partitioned by date • Both columns referred in the query reside on the fact table, Impressions • Resulting virtual view has two types of columns - group by columns & aggregate columns. Exploiting Materialized Views(2) • View Matching and Query Rewriting – To make use of materialized view • Refer virtual view that corresponds to same fact table that materialized view is defined upon. • Non-aggregate columns referred in the SELECT and WHERE clauses in the query must be a subset of the materialized view’s group by columns • Aggregate columns must be computable from the materialized view’s aggregate columns. • Replace the virtual view in the query with the matching materialized view LowLatency Query Optimization • Current Hadoop implementation has some nontrivial overhead itself – Ex:job start time,JVM start time Problem :For small queries, this becomes a significant extra overhead. – In query translation phase: if size of the input file is small it may choose to directly read the file from HDFS and then process the query locally. Questions