Hadoop & Cheetah

advertisement
Hadoop & Cheetah
Key words
• Cluster  data center
– Lots of machines thousands
• Node  a server in a data center
– Commodity device fails very easily
• Slot  a fraction of a server
– Allows you to share a server among many jos
• Job  an application that the user wants to run
– There are multiple of these in the data center
• Task/Worker  a fraction of a job
– Allows you to achieve parallelism
Hadoop
• Exploit large amount of resources easily
• Manages and automates
– Failure recovery
– Scale of WSC
• Dealing with hardware idiosyncrasies
– Resource Management/Sharing
Map-Reduce paradigm
• Master
– Runs a scheduling algo to place tasks on a slot
• Scheduling allows for sharing
–
–
–
–
Runs failure detection and resolution algorithms
Monitors job progress and speeds up the job
Partitions user data into chunks
Determines the number of tasks (Maps and Reduce)
• Speed: more workers  faster, ideally more workers than machines
for parallelism
• State: master tracks O(M*R) state and makes O(M+R) scheduling
decisions
• Worker: runs reduce or map task
Map
• Input <key, value>
• Output: a list of <key2, value2>
• Takes a key-value pair does some preprocessing
and creates another key-value pair.
• The output is segregated into N different files.
– One for each reducers
– The output is stored on local disk in temporary storage
Reduce
• Input: a pair <key, list(values)>
• Output: a pair <key, value>
• Aggregates information from multiple mappers
• 3 stages:
– Shuffle: transfer data from all map to reduce
– Sort: sort the data that was transferred
– Reduce: aggregate the data
• Output is stored:
– In persistent storage.
Combine
• Ideally similar to reduce
• EXCEPT
– run on the same node as the map
– Run on only the data a mapper creates
– Preliminary aggregation to reduce amount of data
transferred
Example Task
• “hello world. Goodbye world. Hello hadoop
goodbye hadoop”
Failure Recovery
• Worker failure
– Detecting: keep-alive pings
– Resolution: restart all task currently running
• For completed tasks:
– If Map restart
– If Reduce do nothing: output stored in external memory.
• Master failure
– Periodically store master data structures
– If master fails roll back to last stored structure.
Dealing with hardware Performance
issues
• Locality
– Place task on node where data is stored
– Else try to place task close to data
• Straggler detection and mitigation
– If a task is running really slow
– Restart the task
– 44% worse without this.
Resource Management/Sharing
• Cluster Scheduler Shares the resource
– Decides which job should run when and where
– Sharing algorithms
• FIFO: no real sharing
• Fair-scheduler: each user is given a number of tokens
– The user’s job must get at least token number of slots
• Capacity Scheduler
– Each job as a queue: task serviced in FIFO manner
– Determine number of cluster slots to allocate to each queue
Problems
• How do determine an effective partition
algorithm? Will hash always work?
• How do you determine the optimal # of
reducers?
• What is optimal scheduling? Resource sharing
algorithms.
Cheetah
• Relational data ware-houses
– Highly optimized for storing and querying relational data.
– Hard to scale to 100s,1000s of nodes
• MapReduce
– Handles failures & scale to 1000s node
– Lacks a declarative query inter-face
• Users have to write code to access the data
• May result in redundant code
• Requires a lot of effort & technical skills.
• How do you get the best of both worlds?
Main Challenges
• With Hadoop, it is hard to:
– Perform SQL like joins
• Developers need to track:
– Location of tables on disks (HDFS)
• Hard to get good performance out of vanila
hadoop
– Need to go through crafty coding to get good
performance
Architecture
• Simple yet efficient
• Open:also provide a simple, non-SQL interface
Query MR Job
• Query is sent to the node that runs Query Driver
• Query Driver
Query  MapReduce job
• Each node in the Hadoop cluster provides a data
access primitive (DAP) interface
Performance Optimizations
•
•
•
•
•
Data Storage & Compression
MapReduce Job Configuration
MultiQuery Optimization
Exploiting Materialized Views
LowLatency Query Optimization
Storage Format
• Text (in CSV format)
– Simplest storage format & commonly used in web access
logs.
• Serialized java object
• Row-based binary array
– Commonly used in row-oriented database systems
• Columnar binary array
 Storage format -huge impact on both
compression ratio and query performance.
 In Cheetah, we store data in columnar format
whenever possible
Columnar Compression
• Compression type for each column set is
dynamically determined based on data in each cell
• ETL phase- best compression method is chosen
• After one cell is created, it is further compressed
using GZIP.
MapReduce Job Configuration
• # of map tasks - based on the # of input files & number of
blocks per file.
• # of reduce tasks -supplied by the job itself & has a big
impact on performance.
• query output
– Small:map phase dominates total cost.
– Large:it is mandatory to have sufficient number of reducers to
partition the work.
• Heuristics
– #of reducers is proportional to the number of group by
columns in the query.
– if the group by column includes some column with very large
cardinality, we increase # of reducers as well.
MultiQuery Optimization
• In Cheetah allow users to simultaneously
submit multiple queries & execute them in a
single batch, as long as these queries have the
same FROM and DATES clauses
Map Phase
• Shared scanner-shares the scan of the fact tables
& joins to the dimension tables
• Scanner will attach a query ID to each output row
• Output from different aggregation operators will
be merged into a single output stream.
Reduce Phase
• Split the input rows based on their query Ids
• Send them to the corresponding query operators.
Exploiting Materialized Views(1)
• Definition of Materialized Views
– Each materialized view only includes the columns in
the face table, i.e., excludes those on the dimension
tables.
– It is partitioned by date
• Both columns referred in the query reside on the fact
table, Impressions
• Resulting virtual view has two types of columns - group by
columns & aggregate columns.
Exploiting Materialized Views(2)
• View Matching and Query Rewriting
– To make use of materialized view
• Refer virtual view that corresponds to same fact
table that materialized view is defined upon.
• Non-aggregate columns referred in the SELECT and
WHERE clauses in the query must be a subset of
the materialized view’s group by columns
• Aggregate columns must be computable from the
materialized view’s aggregate columns.
• Replace the virtual view in the query with the
matching materialized view
LowLatency Query Optimization
• Current Hadoop implementation has some nontrivial overhead itself
– Ex:job start time,JVM start time
Problem :For small queries, this becomes a
significant extra overhead.
– In query translation phase: if size of the input file is
small it may choose to directly read the file from HDFS
and then process the query locally.
Questions
Download