Hive and HBase in BigInsights 2.1 Richard Ding BigInsights Development sding@us.ibm.com Agenda SQL and NoSQL for Hadoop Hive Overview HBase Overview HBase Backup/Restore (Jing Chen He) 2 © 2013 IBM Corporation SQL and NoSQL for Hadoop Hadoop is designed to store and stream extremely large datasets in batch. It is highly scalable, and highly available. But MapReduce is difficult to use – Java API is tedious and requires programming expertise – Many different file formats, storage mechanisms, configuration options, etc. MapReduce is for batch operations – Not intended for realtime querying – Not support random access – Not handle billions of small file well Hive and HBase are two popular open source projects addressing above issues – BigInsights 2.1: Hive 0.9.0+ and HBase 0.94.3+ 3 © 2013 IBM Corporation SQL-on-Hadoop Query the data where it resides – in HDFS or HBase Standards-compliant SQL interface Big SQL: SQL-on-Hadoop solution from BigInsights 4 © 2013 IBM Corporation What is Hive? Hive provides a SQL interface for data stored in Hadoop Supports a wide variety of Hadoop data: – Many different file formats and data sources (e.g. HBase) – Many different data representations (encodings) – Provides API to define your own Hive catalog ("metastore") maps file structure to a tabular form Hive DDL populates the catalogs – Existing data can be described – Empty "tables" can be defined and populated via DML Hive DML statements to bulk load tables Provides a sub-set of SQL SELECT for querying – SQL is translated to one or more MapReduce job(s) for execution 5 © 2013 IBM Corporation But, Hive Is not a real-time processing system – Batch jobs for both loads and queries – Responses not in (sub) seconds Has only limited SQL support – Not SQL92 compliant – No random updates and inserts Query optimization still a work in progress – Compare to traditional RDBM 6 © 2013 IBM Corporation Hive Components Hive Eclipse Plugin BigInsights Interfaces Client Interfaces (remote) Web Browser Thrift Client Hive Application JDBC ODBC hive> Query Execution Metadata 7 Hive Web Interface Hive Server CLI Metastore Hive Metastore Driver JobConf Config © 2013 IBM Corporation Data Model Tables – Typed columns (int, float, string, boolean, binary, timestamp) – Complex types (struct, map, array) Partitions – A partition is a virtual column which defines how data is stored in DFS based on its values. Each table can have one or more partitions (and one or more levels of partition) Buckets – In each partition, data can be divided into buckets based on the hash value of a column in the table (useful for sampling, join optimization) 8 © 2013 IBM Corporation Physical Layout Warehouse directory in DFS – Specified by “hive.metastore.warehouse.dir” in hive-site.xml – /biginsights/hive/warehouse the default location for BigInsights One can think tables, partitions and buckets as directories, subdirectories and files respectively Hive Entity 9 Sample Sample location in DSF database testdb $WH/testdb table T $WH[/testdb]/T partition date=‘01012013’ $WH/T/data=01012013 bucketing column userid $WH/T/data=01012013/000000_0 …… $WH/T/data=01012013/000032_0 © 2013 IBM Corporation File Format Actual data stored in flat files on DFS – Control char delimited text file (default) – Hadoop SequenceFile – RCFile (Record Columnar File) Also support custom Input/OutputFormat or custom Serde (Serializer/Deserializer) to use any format 10 © 2013 IBM Corporation Create Table CREATE TABLE view_page ( view_time INT, user_id BIGINT, page_url STRING, ip STRING COMMENT ‘IP Address of the User’) COMMENT ‘This is the page view table’ PARTITIONED BY (dt STRING) CLUSTERED BY (user_id) SORTED BY (page_url) INTO 32 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ STORED AS RCFILE; 11 © 2013 IBM Corporation Create Table Command PARTITION BY clause defines the virtual columns which are different from the data columns and are actually not stored with the data CLUSTERED BY clause specifies which column to use for bucketing as well as how many buckets to create ROW FORMAT DELIMITED clause specifies how the rows are formatted, i.e. which character will be the delimiter. STORED AS RCFILE indicates that the data is stored in a binary format (RCFILE format) on DFS COMMENTS can be attached both at the table level as well as at the column level These are all optional. If not specified, the default value will be used 12 © 2013 IBM Corporation HBase Overview An open source, distributed, scalable, NoSQL datastore Based on Google’s Bigtable paper [2006] Implemented as a sparse, consistent, distributed, multi-dimensional, persistent, sorted map Fault-tolerant, scale horizontally, high performance 13 © 2013 IBM Corporation HBase Advantage Highly Scalable – Automatic partitioning – Scale linearly and automatically with new nodes Low Latency – Support random read/write, small range scan Highly Available Strong Consistency Very good for “sparse data” (no fixed columns) 14 © 2013 IBM Corporation But HBase is not RDBMS No secondary indexes (row-key only) No multi-row transactions (single row only) No SQL interface (get/put/scan/etc) No query optimizer Can take lots of disk-space – It can be very verbose – There is no schema – 3x replication on DFS 15 © 2013 IBM Corporation When to use HBase? Large amounts of data (100s of GBs up to Petabytes) Need efficient random access inside large datasets Need to scale gracefully Do not need full RDBMS capabilities Relative simple and fixed access pattern 16 © 2013 IBM Corporation Data Model “...a sparse, distributed, persistent, multi-dimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterrupted array of bytes” – Google Bigtable paper (row key, column key, timestamp) value Table schema only define column families – Can have large, variable number of columns per row Row stored in sorted order by the row keys – Row keys are byte arrays. Rows are lexicographically sorted by row keys Each cell value has a version – Timestamp A {row, column, version} tuple exactly specifies a cell 17 © 2013 IBM Corporation Column Family “Column keys are grouped into sets called column families, which form the basic unit of access control.” – Google Bigtable paper Basic storage unit. Columns in the same family should have similar properties and access patterns Configurable by column family – Compression (none, Gzip, LZO, SNAPPY) – Version retention policies A column is named using the following syntax: family:qualier 18 © 2013 IBM Corporation Data Model, Cont Column family as storage unit Cells are first sorted by row keys, then by column keys, finally by timestamps Good for “sparse data” since non-exist column is just ignored But, simple translation from a RDBM table to a HBase table can take a lot more storage space Update a column is just to add a new version Delete a column/row is just to add a new version with a “delete” marker (called tombstone) 19 © 2013 IBM Corporation HBase Shell HBase shell is JRuby IRB (the JRuby implementation of Interactive Ruby Shell) with some HBase-specific commands added Running the shell: $ cd /opt/ibm/biginsights/hbase $ bin/hbase shell HBase Shell; enter 'help<RETURN>' for list of supported commands. Type "exit<RETURN>" to leave the HBase Shell Version 0.94.3, r479dfe9d8f840afa063e7a61c6d073f0a57ba423, Wed Apr 24 03:11:01 PDT 2013 hbase(main):001:0> 20 © 2013 IBM Corporation Shell: Create Table hbase(main):002:0> create 'usertable', 'family' 0 row(s) in 1.0220 seconds hbase(main):003:0> describe 'usertable' DESCRIPTION {NAME => 'usertable', FAMILIES => [{NAME => 'family', REPLICATION_SCOPE => '0', KEEP_DELETED_CELLS => 'false', COMPRESSION => 'NONE', ENCODE_ON_DISK => true ENABLED => 'true', BLOCKCACHE => 'true', MIN_VERSIONS => '0', DATA_BLOCK_ENCODING => 'NONE', IN_MEMORY => 'false', BLOOMFILTER => 'NONE', TTL => '2147483647', VERSIONS => '3', BLOCKSIZE => '65536'}]} 1 row(s) in 0.0170 seconds hbase(main):004:0> disable 'usertable' 0 row(s) in 5.0300 seconds hbase(main):005:0> drop 'usertable' 0 row(s) in 1.2030 seconds 21 © 2013 IBM Corporation HBase Architecture ZooKeeper Quorum Client Master Server ZK Peer …… ZK Peer Region Server Region Server Master Server …… Region Server DFS 22 © 2013 IBM Corporation High Level Overview Zookeeper provides coordination service Client finds region server via ZK Client writes/reads directly to/from the region server Master assigns regions and load balancing Region servers send heartbeats to the ZK Master monitors ZK for failed region servers 23 © 2013 IBM Corporation HBase in ZooKeeper $ bin/hbase zkcli [zk: <hostname>:2181(CONNECTED) 0] ls /hbase [root-region-server, rs, master, hbaseid, shutdown, backup-masters, unassigned, draining, splitlog, table] znode 24 Descriptions root-region-server location of server hosting root region rs ephemeral nodes of the regionservers draining ephemeral nodes of the draining regionservers master the currently active master shutdown* the current cluster state unassigned* used for region transitioning and assignment splitlog used for log splitting work assignment table used for table disabling/enabling © 2013 IBM Corporation Client HBase client (HTable) first connects to ZK using the configuration parameters in hbase-site.xml file: <name>hbase.zookeeper.quorum</name> <value>comma-delimited host names</value> <name>hbase.zookeeper.property.clientPort</name> <value>2181</value> Client then finds the region server that serves the specific region by querying the .META. And –ROOT- catalog tables (use root-regionserver in ZK): HBaseConfiguration conf = HBaseConfiguration.create(); HTable table = new HTable(conf, “myTable”); The region information is cached in the client so that subsequent requests need not go through the lookup process until it becomes stale Client writes/reads directly to/from the region server 25 © 2013 IBM Corporation Important Client Configurations Configuration Descriptions Auto Flush When performing a lot of Puts, make sure that setAutoFlush is set to false on your HTable instance (default is true) Deferred Log Flush If deferred log flush is used, WAL edits are kept in memory until the flush period. Deferred log flush can be configured on tables via HTableDescriptor (or Shell command) hbase.client.write.buffer Default: 2 MB hbase.client.scanner.caching Number of rows that will be fetched when calling next on a scanner Default: 1, recommended: 100 hbase.rpc.timeout RPC timeout value. Default: 60 secs Turn off WAL on Puts Call writeToWAL(false) to increase throughput on Puts (may cause data loss during a RS failure) 26 © 2013 IBM Corporation Master Monitor all region server instances in the cluster Initialize region server failover Perform all metadata changes (e.g. create table) Manage region assignment Background services: – LoadBalancer (move regions to balance the cluster load) – CatalogJanitor (check and clean up the .META. Table) – LogCleaner (clear the HLogs in the old logs directory) 27 © 2013 IBM Corporation Backup Masters During BigInsights installation, you can configure two or more masters When a master starts up, it races with other masters to write its address into ZooKeeper. If it succeeds, it is the primary/active master. If it does not succeed, there is another active master and it becomes a backup master A backup master wait until it dies to try and become the next active master 28 © 2013 IBM Corporation Row Key Design Row key design is the most important factor Keep good data locality Know your access pattern Use a key structure that yields good locality for your access pattern Keep the key compact Avoid “hotspot” region 29 © 2013 IBM Corporation Region Server Write-Ahead-Log Region 30 Store StoreFile MemStore …… …… …… HFile © 2013 IBM Corporation Region A region is an horizontal partition of a table with a start row and an end row Regions are the basic element of availability and distribution for tables A region is automatically split by the hosting region server when it reaches a specified size Periodically, a load balancer will move regions within the cluster to balance the load When a region server fails, its regions will be reassigned to other region servers 31 © 2013 IBM Corporation Region Server Components Region server makes a set of regions available to clients. It checks in with the Master WAL stores all the edits to the Store. There is one WAL per region server. All edits for all regions carried by a particular region server are entered first in the WAL Region stores data for a certain region of a table. There are multiple stores for a single region A Store holds a column family in a region. It has a memstore and a set of zero or more HFiles 32 © 2013 IBM Corporation API - Filters Predicate pushdown, all filters are applied on the server side Examples: Filter 33 Descriptions PrefixFilter A filter that will only return rows with the same row prefix KeyOnlyFilter A filter that will only return the key component of each KV (the value will be rewritten as empty) FirstKeyOnlyFilter A filter that will only return the first KV from each row FuzzyRowFilter Filters data based on fuzzy row key, i.e. fuzzy info tells the the matching mask is "????_99_????_01", where at ? can be any value RowFilter This filter is used to filter based on the key. It takes an operator (equal, greater, not equal, etc) and a byte [] comparator for the row, and column qualifier portions of a key. TimestampsFilter Filter that returns only cells whose timestamp (version) is in the specified list of timestamps (versions) © 2013 IBM Corporation Coprocessor Inspired by Google’s BigTable coprocessors A framework that provides a library and runtime environment for executing user code within the HBase region server and master processes Observer coprocessor (“trigger”) – MasterObserver (preCreateTable, postCreateTable, …) – RegionObserver (preGet, postGet, prePut, postPut, …) – WALObserver (preWALWrite, postWALWrite) Load coprocessors from configuration – – – – 34 hbase.coprocessor.master.classes hbase.coprocessor.region.classes hbase.coprocessor.user.region.classes hbase.coprocessor.wal.classes © 2013 IBM Corporation Coprocessor, cont Usage: Access Control – <property> <name>hbase.coprocessor.master.classes</name> <value>org.apache.hadoop.hbase.security.access.AccessController</value> </property> – <property> <name>hbase.coprocessor.region.classes</name> <value>org.apache.hadoop.hbase.security.token.TokenProvider, org.apache.hadoop.hbase.security.access.AccessController</value> </property> Usage: Secondary Index – No built-in implementation yet – IBM Watson Lab developed a secondary index coprocessor hbase(main):005:0> alter ‘test', METHOD => 'table_att', 'coprocessor'=> 'hdfs://myserver.ibm.com:9010/index-coprocessor0.6.0.jar|org.apache.hadoop.hbase.coprocessor.index.SyncSecondaryIndexObser ver|1001|arg1=1,arg2=2 ' 35 © 2013 IBM Corporation Coprocessor, cont Endpoint Coprocessor (“stored procedure”) – Implementation is installed on the server side – Invoked from client side using dynamic proxy 36 © 2013 IBM Corporation Questions 37 © 2013 IBM Corporation