HADOOPDB: AN ARCHITECTURAL HYBRID OF MAPREDUCE AND DBMS TECHNOLOGIES FOR ANALYTICAL WORKLOADS 1 By: Muhammad Mudassar MS-IT-8 WHAT IS GOING ON Data analysis techniques are changing Enterprises moving to cheaper commodity hardware MPP (Massively Parallel Processing) architecture inside “Clods” Analytical data is exploding What technology for data analysis? Parallel databases MapReduce-based systems 2 THE TWO TECHNOLOGIES Parallel Databases High performance and efficiency Bad scores in fault tolerance and run in heterogeneous environment Few known deployments over 100 nodes MapReduce-based systems Designed to scale over 1000 of nodes Fault tolerant and capable to run in heterogeneous environment Biggest issue with MapReduce is performance 3 HADOOPDB A hybrid system to handle demands of data intensive applications Advantages Scalability of MapReduce Performance and efficiency of parallel databases Completely build on open source free to use components PostgreSQL as database layer Hadoop MapReduce is used Amazon’s EC2 cloud is used 4 DESIRED PROPERTIES Performance A primary characteristic that commercial database systems use to distinguish themselves Fault tolerance Measured differently for analytical DBMS and transactional DBMS. For analytical DBMS query restart is to be avoided Ability to run in heterogeneous environment Nearly impossible to get homogeneous performance from 100 or 1000 nodes Flexible query interface Allow user to write user defined functions (UDFs) and queries that should be parallelized automatically. 5 ARCHITECTURE OF HADOOPDB 6 THE HADOOP FRAMEWORK Hadoop consists of 2 layers Data storage layers which is Hadoop Distributed File System (HDFS) Data processing or the MapReduce framework HDFS Block-structure file system managed by NameNode Data handled by DataNodes MapReduce framework Master-slave architecture based on JobTracker & TaskTracker JobTracker manages job like assignment keeping track of jobs and load balancing TaskTrackers perform assigned Map or Reduce tasks assigned to them 7 THE HADOOPDB’S COMPONENTS 1. HadoopDB extends Hadoop framework with four components Database connector 2. Catalog 3. Interface between DBMS and TaskTacker Database is similar to data blocks in HDFS Maintain information about database Database location, driver class meta data like replica location partitioning property Data Loader Globally partition the data on given key Break single node data into chunks Load the chunks to the database 8 THE HADOOPDB’S COMPONENTS 1. SQL to MapReduce to SQL (SMS) Planner HadoopDB provide front end to process SQL queries SMS planner extends Hive Parser transforms query to abstract syntax tree Get table schema information from catalog Logical plan generator creates query plan Optimizer breaks up plan to Map or Reduce phases Executable plan generated for one or more MapReduce jobs SMS tries to push maximum work to database layer 9 EVALUATING HADOOPDB Compare HadoopDB to Hadoop Parallel databases (Vertica, DBMS-X) Features Performance HadoopDB is expected to approach performance of parallel databases Scalability HadoopDB would be scalable 10 DATA LOAD 11 QUERIES RESULTS 12 SCALABILITY HadoopDB and Hadoop take advantage of run time scheduling by splitting data Parallel databases restart entire query on node failure or wait for slowest node 13 CONCLUSION HadoopDB Is a Hybrid system Scales better then parallel databases Fault tolerant Approaches the performance of parallel databases Free and opensource 14