HadoopDB Inneke Ponet Introduction Technologies for data analysis HadoopDB Desired properties Layers of HadoopDB HadoopDB Components Introduction More and more data needs to be stored and processed. People want to do more and more complex calculations on their collected data. Analytical databases on high-end machines are moving towards cheaper lower-end machines. The analytical database market is 27% of the database software market and is growing at a rate of 10,3% annually. Technologies for data analysis Parallel databases: good performance, good efficiency. MapReduce-based systems: superior scalability, good fault tolerance, good flexibility to handle unstructered data. Parallel databases Support for standard relational tables and SQL. Implements techniques for a better performance: Indexing, compression, materialized views, result caching, I/O sharing. Data is partitioned (shared-nothing architecture) transparent to the end-user. Shared-nothing architecture The DBMS of the most analytical databases are deployed on a shared-noting architecture: A collection of machines that are independent, are possible virtual, have their own local disk and local main memory, are connected by a high-speed network. Scalability of machines. Analysis tasks are easy to parallellize. MapReduce A technology from Google: processes (un)structured data that is distributed on many nodes in a shared-nothing cluster; works at enormous scale. Map and Reduce: parallel without communicating; Map-repartition-Reduce cycles. MapReduce: advantages No detailed query execution plan in advance at runtime: adjust to node failures and slow nodes (re)assigning tasks to faster nodes. Checkpoints the output to local disk minimizing of the work in case of a failure. HadoopDB Hybrid database: a combination of: traditional DBMS, MapReduce-technology. Developed by Yale University students: Azza Abouzeid and Kamil BajDa-Pawlikowski It is free and open source. Desired properties A. Performance B. Fault tolerance C. Heterogeneous environment D. Flexible query interface E. Scalability A. Performance Primary characteristic to distinguish. MapReduce: first modeling and loading data before processing slower performance than parallel databases. Cost saving: faster software product cheaper than a hardware upgrade or buying additional hardware. Parallel databases MapReduce B. Fault tolerance Succesfully commit transactions. Make progress on a workload. Heterogeneity and scalibility more faults BUT MapReduce good fault tolerance: reassigning tasks; sub-tasks minimize the effect of faults. Parallel databases: assumption failures are rare more testing => slower performance. Parallel databases MapReduce C. Heterogeneous environment Nodes don’t always run on identical hardware, an identical virtual machine. Different performance. Parallel databases: not tested on more than 100 nodes. Parallel databases MapReduce D. Flexible query interface Easy to make queries: SQL and non-SQL interface languages, Use of tools. Robust mechanisme for writing UDFs. Parallel databases: SQL, ODBC and UDFs. MapReduce-based systems: it is possible (Hive), but not always (Hadoop). Parallel databases / MapReduce E. Scalability Traditional DBMS: only scalable to 100 nodes. Reasons Assumption Failures Failures are rare. Hetrogeneity Homogeneous array of machines. Not tested There are no applications with more than a few dozen nodes. MapReduce-based systems: designed to scale to thousands of nodes in a sharednothing architecture. Parallel databases MapReduce Desired properties Parallel databases MapReduce Performance Fault tolerance Heterogeneous environment Flexible query interface / Scalability Layers of HadoopDB Communication: Hadoop Database: PostgreSQL Translation: Hive Hadoop Communication layer of HadoopDB. Hadoop framework two layers: Hadoop Distributed File System (HDFS), MapReduce framework. Cost: free/open source MapReduce. PostgreSQL Relational DBMS. (Possible) database layer of HadoopDB. Cost: free/open source. Hive Translation layer. Processing of a SQL query: Query Abstract Syntax Tree. MetaStore: schema of the table(s). Logical query plan: DAG of relational operators. Optimized plan. Physical executable plan: MapReduce job(s). XML plan: DAG serialized. Hive Driver executes a Hadoop job. HadoopDB components Database Connector: Interface between independent database systems; Extends the InputFormat class (of Hadoop); Connect to any JDBC-compliant database. Catalog: Meta-information about the databases: connection parameters, metadata. XML file in HDFS accessed by: Master node, Worker/Slave nodes. HadoopDB Components (2) Data loader: Global hasher: Custom MapReduce job files in HDFS; Repartioning data upon loading. Local hasher: Copies partition from HDFS to local file system; Partitions the file in smaller sized chunks. HadoopDB Components (3) SQL to MapReduce: Parallel database front-end to process SQL queries. HiveQL ↓ Transform MapReduce jobs: Connect to tables stored in HDFS; Consists of DAGs of relational operators that operate as iterators. Assumption no collection of tables: Operations on multiple tables Reduce function. NOT in HadoopDB: a join operation can be pushed to the databse layer. HadoopDB Components (4) SQL/SMS planner: Modifies Hive: Updates the MetaStore Two passes over the physical plan: 1. 2. Determine the partition keys for the Reduce Sink Operators. Operators are: converted in SQL querie(s); pushed into the database layer. Only filter, select and aggregation operators. HadoopDB Components (5) Questions?