Accessing HBase using Big SQL Deepa Remesh BigInsights Development For questions about this presentation contact Deepa Remesh dremesh@us.ibm.com Agenda Introduction to HBase Big SQL HBase Storage Handler – Column mapping – Data encoding – Data load Secondary Indexes Querying Recommendations and limitations Logs and Troubleshooting Highlights and HBase use cases 2 © 2013 IBM Corporation HBase Basics Client/server database – Master and a set of region servers – Fetching data requires additional network hop through region server Key-value store – Key and value are byte arrays – Efficient access using row key Rich set of Java APIs and extensible frameworks – Supports a wide variety of filters – Allows application logic to run in region server via coprocessors Different from relational databases – No types: all data is stored as bytes – No schema: Rows can have different set of columns 3 © 2013 IBM Corporation HBase Cluster Architecture Client finds region server addresses in ZooKeeper ZooKeeper Quorum ZooKeeper is used for coordination / monitoring ZooKeeper Peer Master ZooKeeper Peer … Client Client reads and writes row by accessing the region server Region Server Region Server Coprocessor Coprocessor Region Region HFile HFile HFile HFile Hbase master assigns regions and load balancing … … Coprocessor Coprocessor … HDFS / GPFS Region Region HFile HFile … … … HFile HFile HFile 4 © 2013 IBM Corporation HBase Data Model Table – Contains column-families Column family HBTABLE Row key Value 11111 cf_data: {‘cq_name’: ‘name1’, ‘cq_val’: 1111} cf_info: {‘cq_desc’: ‘desc11111’} 22222 cf_data: {‘cq_name’: ‘name2’, ‘cq_val’: 2013 @ ts = 2013, ‘cq_val’: 2012 @ ts = 2012 } – Logical and physical grouping of columns Column – Exists only when inserted – Can have multiple versions – Each row can have different set of columns – Each column identified by it’s key Row key – Implicit primary key – Used for storing ordered rows – Efficient queries using row key 5 HFileHFile 11111 cf_data cq_name name1 @ ts1 11111 cf_data cq_val 1111 @ ts1 22222 cf_data cq_name name2 @ ts1 22222 cf_data cq_val 2013 @ ts1 22222 cf_data cq_val 2012 @ ts 2 HFile 11111 cf_info cq_desc desc11111 @ ts1 © 2013 IBM Corporation Big SQL HBase Storage Handler Mapping of SQL to HBase data: Column Mapping Handles serialization/deserialization of data (SerDe) Efficiently handles SQL queries by pushing down predicates Delimited files Warehouse SQL Query Query Results Big SQL Input Data HBase Storage Handler HBase SerDe Query Analyzer (Runtime) JDBC application - HBase scan limits - Filters - Index usage DFS Query Optimizer (Compile time) - Process hints 6 © 2013 IBM Corporation Column Mapping Mapping HBase row key/columns to SQL columns – Supports one to one and one to many mappings One to one mapping – Single HBase entity mapped to a single SQL column Column Family: cf_data key 11111 id 7 cq_name name1 name cq_val 1111 value Column Family: cf_info cq_desc desc11111 desc HBase SQL © 2013 IBM Corporation Create Table: One to One Mapping CREATE HBASE TABLE HBTABLE ( id INT, name VARCHAR(10), value INT, desc VARCHAR(20) ) COLUMN MAPPING ( key mapped cf_data:cq_name mapped cf_data:cq_val mapped cf_info:cq_desc mapped ); HBase 8 Required by by by by (id), (name), (value), (desc) SQL HBase column identified by family:qualifier © 2013 IBM Corporation One to Many Column Mapping Single HBase entity mapped to multiple SQL columns Composite key – HBase row key mapped to multiple SQL columns Dense column – One HBase column mapped to multiple SQL columns key 11111_ac11 userid 9 acc_no Column Family: cf_data cq_acct cq_names fname1_lname1 first_name last_name HBase 11111#11#0.25 balance min_bal interest SQL © 2013 IBM Corporation Create Table: One to Many Mapping CREATE HBASE TABLE DENSE_TABLE ( userid INT, acc_no VARCHAR(10), Composite Key first_name VARCHAR(10), last_name VARCHAR(10), balance double, min_bal double, interest double ) List of SQL columns COLUMN MAPPING ( key mapped by (userid, acc_no), cf_data:cq_names mapped by (first_name, last_name), cf_data:cq_acct mapped by (balance, min_bal, interest) ); Dense Columns 10 © 2013 IBM Corporation Why use One to Many mapping ? HBase is very verbose – Stores a lot of information for each value – Primarily intended for sparse data <row> <columnfamily> <columnqualifier> <timestamp> <value> Save storage space – Sample table with 9 columns. 1.5 million rows – One to one mapping: 522 MB – One to many mapping: 276 MB Improve query response time – Query results also return the entire key for each value – select * query on sample table • One to one mapping: 1m 31 s • One to many mapping: 1m 2s 11 © 2013 IBM Corporation Sample Data TPCH orders table with 1.5 million rows drop table if exists orders_one_to_one; CREATE HBASE TABLE ORDERS_ONE_TO_ONE ( O_ORDERKEY BIGINT, O_CUSTKEY INTEGER, O_ORDERSTATUS VARCHAR(1), O_TOTALPRICE FLOAT, O_ORDERDATE TIMESTAMP, O_ORDERPRIORITY VARCHAR(15), O_CLERK VARCHAR(15), O_SHIPPRIORITY INTEGER, O_COMMENT VARCHAR(79) ) column mapping ( key mapped by (O_ORDERKEY), f:a mapped by (O_CUSTKEY), f:b mapped by (O_ORDERSTATUS), f:c mapped by (O_TOTALPRICE), f:d mapped by (O_ORDERPRIORITY), f:e mapped by (O_CLERK), f:f mapped by (O_SHIPPRIORITY), f:g mapped by (O_COMMENT), f:h mapped by (O_ORDERDATE) ) default encoding binary; LOAD HBASE DATA INPATH 'tpch/ORDERS.TXT' DELIMITED FIELDS TERMINATED BY '|' INTO TABLE ORDERS_ONE_TO_ONE; drop table if exists orders; CREATE HBASE TABLE ORDERS ( O_ORDERKEY BIGINT, O_CUSTKEY INTEGER, O_ORDERSTATUS VARCHAR(1), O_TOTALPRICE FLOAT, O_ORDERDATE TIMESTAMP, O_ORDERPRIORITY VARCHAR(15), O_CLERK VARCHAR(15), O_SHIPPRIORITY INTEGER, O_COMMENT VARCHAR(79) ) column mapping ( key mapped by (O_CUSTKEY, O_ORDERKEY), cf:d mapped by (O_ORDERSTATUS,O_TOTALPRICE,O_ORDERPRIORITY,O_CLERK,O_SHIPPRIORITY,O_COMMENT), cf:od mapped by (O_ORDERDATE) ) default encoding binary; LOAD HBASE DATA INPATH 'tpch/ORDERS.TXT' DELIMITED FIELDS TERMINATED BY '|' INTO TABLE ORDERS; 12 © 2013 IBM Corporation Data encoding HBase stores all data as an array of bytes – Application decides how to encode/decode the bytes Big SQL uses Hive SerDe interface for serialization/deserialization Supports two types of data encodings: String, Binary Encoding can be specified at HBase row key/column level key 11111_ac11 userid acc_no String 13 Column Family: cf_data cq_acct cq_names fname1_lname1 first_name last_name String 0x000001 … balance min_bal interest HBase SQL Binary © 2013 IBM Corporation String encoding Default encoding Value is converted to string and stored as UTF-8 bytes Separator to identify parts in one to many mapping – Default separator: \u0000 CREATE HBASE TABLE DENSE_TABLE_STR ( userid INT, acc_no VARCHAR(10), Can specify different separator first_name VARCHAR(10), last_name VARCHAR(10), for each column and row key. balance double, Default separator is null byte min_bal double, (\u0000) for string encoding. interest double ) COLUMN MAPPING ( key mapped by (userid, acc_no) separator '_', cf_data:cq_names mapped by (first_name, last_name) separator '_', cf_data:cq_acct mapped by (balance, min_bal, interest) separator '#' ); 14 © 2013 IBM Corporation String Encoding: Pros and Cons Readable format and easier to port across applications Useful to map existing data key 11111_ac11 userid acc_no Column Family: cf_data cq_acct cq_names fname1_lname1 10000#10#0.25 first_name last_name balance Numeric data not collated correctly – HBase stores data as bytes – Lexicographic ordering min_bal interest 1 10 2 9 Existing HBase table External Big SQL table 2 > 10 9 > 10 Slow – Parsing strings is expensive 15 © 2013 IBM Corporation External Tables Useful to map tables that already exist in HBase – Data in external tables is not pre-validated Can create multiple views of same table Use subset of data from dense_table create external hbase table externalhbase_table (user INT, acc string, balance double, min_bal double, interest double) column mapping(key mapped by (user,acc), cf_data:cq_acct mapped by(balance, min_bal, interest) separator '#') hbase table name 'dense_table'; HBase tables created using Hive HBase storage handler cannot be read by Big SQL – Need to create external tables for this Things to note: – Dropping external table only drops the metadata – Cannot create secondary index on external tables 16 © 2013 IBM Corporation Binary Encoding Data encoded using sortable binary representation Separators handled internally – Escaped to avoid issue of separator existing within data CREATE HBASE TABLE MIXED_ENCODING ( C1 INT, C2 INT, C3 INT, C4 VARCHAR(10), C5 DECIMAL(5,2), C6 SMALLINT ) COLUMN MAPPING ( KEY MAPPED BY (C1, C2, C3) ENCODING BINARY, CF1:COL1 MAPPED BY (C4, C5) SEPARATOR '|', CF2:COL1 MAPPED BY (C6) ENCODING BINARY ); 17 If encoding not specified, string is used as default cf1 cf2 key col1 col1 0x000000000000000100000000000000020000000000000003 foo|97.31 0x0000DEAF © 2013 IBM Corporation Binary Encoding: Pros and Cons Faster Numeric types collated correctly including negative numbers CREATE HBASE TABLE WEATHER (temp INT, date TIMESTAMP, humidity DOUBLE) COLUMN MAPPING (key mapped by (temp, date), cf:cq mapped by (humidity)) default encoding binary; 100,2012-06-10 17:00:00:000,40.25 -17,2012-12-12 17:00:00:000,30.25 95,2012-06-05 17:00:00:000,50.25 cf cq -17 95 100 \x01\x7F\xFF\xFF\xEF\x012012-12-12 17:00:00:000\x00 \x01\x80\x00\x00_\x012012-06-05 17:00:00:000\x00 \x01\x80\x00\x00d\x012012-06-10 17:00:00:000\x00 \x01\xC0>@\x00\x00\x00\x00\x00 \x01\xC0I \x00\x00\x00\x00\x00 \x01\xC0D \x00\x00\x00\x00\x00 Limited portability 18 © 2013 IBM Corporation Custom Encoding Any custom encoding data structure can be supported by decoding using user defined functions create hbase table obj_table ( key_col int, obj_col binary(1024) ) ...; select get_obj_str_field(obj_col, 'fname') as fname, get_obj_str_field(obj_col, 'lname') as lname where key_col = 314145; Key lookup and columns filters are possible using UDF's select key_col from obj_table where obj_col = create_obj('bob', 'johnson'); Big SQL provides a json_get_object() function for extracting values from textual JSON 19 © 2013 IBM Corporation Load Data Load HBase File can be on DFS or local to Big SQL server – Loads data from delimited files – Column list can be specified load hbase data inpath 'file:///input.dat' delimited fields terminated by '|' into table hbtable (name, value, desc, id); Column list optional. If not specified, uses column ordering in table definition Load FROM – Loads data from a source outside of a BigInsights cluster – Covered in Import/Load/Export presentation Insert command available as undocumented feature insert into hbtable (name, value, desc, id) values(‘name5’, 5555, ‘desc55555’, 55555); 20 © 2013 IBM Corporation Load Data: Upsert HBase ensures uniqueness of row key key 11111 , name1, 1111, desc11111 11111 , name9, 9999, desc99999 22222 , name2, 2222, desc22222 … Load 11111 , name1, 1111, desc11111 @ts0 11111 , name9, 9999, desc99999 @ts1 22222 , name2, 2222, desc22222 @ts1 … Upsert can be confusing. No errors but fewer rows ! Delimited file : 10 rows Load : 10 rows affected select count(*) from hbtable : 7 rows Combine multiple columns to make row key unique key mapped by (id, name) key 11111 , name1, 1111, desc11111 11111 , name9, 9999, desc99999 22222 , name2, 2222, desc22222 … 21 Load 11111/x00name1, 11111 , name1, 1111, 1111,desc11111 desc11111@ts0 @ts0 11111/x00name9, 11111 , name9, 9999, 9999,desc99999 desc99999 @ts1 @ts1 22222/x00name2, 22222 , name2, 2222, 2222,desc22222 desc22222 @ts1 @ts1 … © 2013 IBM Corporation Force Key Unique Use force key unique option when creating a table CREATE HBASE TABLE HBTABLE_FORCE_KEY_UNIQUE ( id INT, name VARCHAR(10), value INT, desc VARCHAR(20) ) COLUMN MAPPING ( key mapped by (id) force key unique, cf_data:cq_name mapped by (name), cf_data:cq_val mapped by (value), cf_info:cq_desc mapped by (desc) ); Load adds UUID to the row key Prevents data loss Inefficient Stores more data Slower queries 22 11111 , name1, 1111, desc11111 11111 , name9, 9999, desc99999 22222 , name2, 2222, desc22222 … 11111\x00b71c95d8-ffdd-4d49-9015-2fdd6f7dcdf4, name1, 1111, desc11111 11111\x00ea780078-9893-4bf7-95d8-cb9ca4b2427f, name9, 9999, desc99999 22222\x00a90885b0-418b-49ac-a6f6-aa73273b57ca, name2, 2222, desc22222 … © 2013 IBM Corporation Load Data: Error Handling Option to continue and log error rows – LOG ERROR ROWS IN FILE 'filename' Common Errors – Separator exists within data for string encoding – Invalid numeric types Always count number of rows after loading – Load always reports total number of rows that it handled key mapped by (id, name) separator ‘-’ id defined as integer HBase Table (2 rows) key 11111 , name1, 1111, desc11111 11111 , name1, 1111, desc11111 11111 , name9, 9999, desc99999 11111 , name9, 9999, desc99999 22222 , name-2, 2222, desc22222 22222 , name2, 2222, desc22222 3333a, name3, 3333, desc33333 … … 11111-name1, 1111, desc11111 11111-name9, 9999, desc99999 Load: 4 rows affected Error file (2 rows) 22222 , name-2, 2222, desc22222 3333a , name3, 3333, desc33333 23 © 2013 IBM Corporation Options to Speed up Load Disable WAL – Data loss can happen if region server crashes LOAD HBASE DATA INPATH 'tpch/ORDERS.TXT' DELIMITED FIELDS TERMINATED BY '|' INTO TABLE ORDERS DISABLE WAL; Increase write buffer – set hbase.client.write.buffer=8388608; 24 © 2013 IBM Corporation Secondary Index Support Self-maintaining secondary indexes – Stored in an HBase table – Populated using a Map Reduce index builder – Kept up to date using a synchronous coprocessor Big SQL create MRIndexBuilder index Client input data query results HBase Storage Handler Data Table Data Regions Index Coprocessor SerDe Query Analyzer (Runtime) - Use index ? Index Regions query Query Optimizer (Compile time) Index Table - Process hints Index building 25 Index maintenance Batched Get Requests © 2013 IBM Corporation Index Creation and Usage create hbase table dt(id int,c1 string,c2 string,c3 string,c4 string,c5 string) column mapping (key mapped by (id), f:a mapped by (c1,c2,c3), f:b mapped by (c4,c5)); create index ixc3 on table dt (c3) as 'hbase'; Data table (dt) key c1 create index ixc3 (c3) c2 c3 c4 c5 bt1 , c11_c21_c31, c41_c51 bt2 , c12_c22_c32, c42_c52 bt3 , c13_c23_c33, c43_c53 … No Index table (dt_ixc3) key Data table get row = bt2 Use Index ? Full table scan Query c3=c32 c31_bt1 c32_bt2 c33_bt3 … Yes Index table range scan start row = c32 stop row = c32++ Automatic index usage – Range scan on index table to get matching row key(s) in base table – Batched get requests to base table with the matched row key(s) 26 © 2013 IBM Corporation Index Pros and Cons Fast key based lookups for queries that return limited data Not beneficial if there are too many matches No statistics to make the decision in compiler useindex hint to make explicit choices Index adds latency to data load – When loading a big data set, drop index and recreate LOAD from option bypasses index maintenance Uses HBase bulk load which writes to HFiles directly 27 © 2013 IBM Corporation Column Family Options Compression – compression(gz) Bloom filters – NONE, ROW, ROWCOL In memory columns – in memory, no in memory create hbase table colopt_table (key string, c1 string) column mapping(key mapped by (key), cf1:c1 mapped by(c1)) column family options(cf1 compression(gz) bloom filter(row) in memory); 28 © 2013 IBM Corporation Query Handling Projection pushdown Predicate pushdown – – – – Point scan Range scan Automatic index usage Filters Query Hints 29 © 2013 IBM Corporation Sample Data TPCH orders table with 1.5 million rows drop table if exists orders; CREATE HBASE TABLE ORDERS ( O_ORDERKEY BIGINT, O_CUSTKEY INTEGER, O_ORDERSTATUS VARCHAR(1), O_TOTALPRICE FLOAT, O_ORDERDATE TIMESTAMP, O_ORDERPRIORITY VARCHAR(15), O_CLERK VARCHAR(15), O_SHIPPRIORITY INTEGER, O_COMMENT VARCHAR(79) ) column mapping ( key mapped by (O_ORDERKEY,O_CUSTKEY), cf:d mapped by (O_ORDERSTATUS,O_TOTALPRICE,O_ORDERPRIORITY,O_CLERK,O_SHIPPRIORIT Y,O_COMMENT), cf:od mapped by (O_ORDERDATE) ) default encoding binary; LOAD HBASE DATA INPATH 'tpch/ORDERS.TXT' DELIMITED FIELDS TERMINATED BY '|' INTO TABLE ORDERS; 30 © 2013 IBM Corporation Projection Pushdown Get only columns required by the query Limit data retrieved to the client select * from orders go -m discard 1500000 rows in results(first row: 0.21s; total: 1m1.77s) Log HBase scan details:{ … , families={cf=[d, od]}, …} select o_totalprice from orders go -m discard 1500000 rows in results(first row: 0.19s; total: 21.27s) Log HBase scan details:{ … , families={cf=[d]}, …} select o_orderdate from orders go -m discard 1500000 rows in results(first row: 0.36s; total: 36.24s) Log The response time is higher for this query even when it retrieves lesser data than query for o_totalprice. This is because timestamp type is more expensive HBase scan details:{ … , families={cf=[od]}, …} Projection happens at HBase column level – For composite key and dense columns, the entire value is retrieved to the client – Efficient to pack columns that are queried together 31 © 2013 IBM Corporation Predicate Pushdown: Point Scan With full row key Big SQL can combine predicates on row key parts set force local on; select o_orderkey,o_totalprice from orders where o_custkey=1 and o_orderkey=454791; +--------------+ | o_totalprice | +--------------+ | 208660.75000 | +--------------+ 1 row in results(first row: 0.14s; total: 0.14s) Log Found a row scan by combining all composite key parts. key o_custkey o_orderkey Query o_custkey=1 and o_orderkey=454791 32 start row=1#454791 stop row=1#454791 1#454791 1# 579908 1# 3868359 1# 4273923 1# 4808192 1# 5133509 … columns … © 2013 IBM Corporation Predicate Pushdown: Partial row Scan select o_orderkey,o_totalprice from orders where o_custkey=1; +------------+--------------+ | o_orderkey | o_totalprice | Predicate(s) on leading +------------+--------------+ part(s) of row key | 454791 | 74602.81250 | | 579908 | 54048.26172 | | 3868359 | 123076.84375 | | 4273923 | 95911.00781 | | 4808192 | 65478.05078 | | 5133509 | 174645.93750 | +------------+--------------+ 6 rows in results(first row: 0.13s; total: 0.13s) Log Found a row scan that uses the first 1 part(s) of composite key. key o_custkey o_orderkey Query o_custkey=1 33 start row=1 stop row=1++ 1#454791 1# 579908 1# 3868359 1# 4273923 1# 4808192 1# 5133509 2#430243 … columns … © 2013 IBM Corporation Predicate Pushdown: Range Scan With range predicates select o_orderkey,o_totalprice from orders where o_custkey < 3; Log Found a row scan that uses the first 1 part(s) of composite key. Log HBase scan details:{ .. , stopRow=\x01\x80\x00\x00\x03, startRow=, … } key o_custkey o_orderkey Query o_custkey<3 34 start row= stop row=3# 1#454791 … 1# 5133509 2#430243 … 4#164711 columns … © 2013 IBM Corporation Predicate Pushdown: Full table Scan This is an example of a case where predicates are not pushed down. If there are predicates on non-leading parts of row key set force local on; select o_orderkey,o_totalprice from orders where o_orderkey=454791; +------------+--------------+ | o_orderkey | o_totalprice | +------------+--------------+ | 454791 | 74602.81250 | +------------+--------------+ 1 row in results(first row: 32.13s; total: 32.13s) Log 35 HBase scan details:{ .. , stopRow=, startRow=, … } © 2013 IBM Corporation Automatic Index Usage select * from orders where o_clerk='Clerk#000000999' go -m discard 1472 rows in results(first row: 1.63s; total: 30.32s) create index ix_clerk on table orders (o_clerk) as 'hbase'; 0 rows affected (total: 3m57.82s) select * from orders where o_clerk='Clerk#000000999' go -m discard 1472 rows in results(first row: 3.60s; total: 3.65s) Log Index query successful Index used automatically For composite index, rules similar to composite row key apply – Parts will be combined where possible – With partial value for composite index, range scan done on index table Multiple indexes on a table – Index to be used is randomly chosen – Specify useIndex hint to make use of specific index 36 © 2013 IBM Corporation Pushing down Filters into HBase Filters do not avoid full table scan – Some filters can skip certain sections e.g, PrefixFilter Limits rows returned to the client Limits data returned to client – Key only filters Column filter as there is a predicate on leading part of dense column Row scan select o_orderkey from orders where o_custkey>100000 and o_orderstatus='P' go -m discard 12819 rows in results(first row: 1.12s; total: 6.80s) Log 37 Found a row scan that uses the first 1 part(s) of composite key. HBase filter list created using AND. HBase scan details:{… , filter=FilterList AND (1/1): [SingleColumnValueFilter (cf, d, EQUAL, \x01P\x00)], stopRow=, startRow=\x01\x80\x01\x86\xA1, …} © 2013 IBM Corporation Key Only Tables Big SQL allows creation of tables without specifying any HBase column create hbase table KEY_ONLY_TABLE (k1 string, k2 string, k3 string) column mapping (key mapped by (k1, k2, k3)); select * from KEY_ONLY_TABLE; Log 38 Only row key or parts of row key requested. Applying filters. … HBase scan details:{… families={}, filter=FilterList AND (2/2): [FirstKeyOnlyFilter, KeyOnlyFilter], …} © 2013 IBM Corporation Predicate Precedence When a query contains multiple predicates, the following precedence applies: – Row Scan – Index – Filters • Row filters • Column filters Filters will be applied along with row scans The OR condition prevents usage of row scan. Row filter (PrefixFilter) is used along with a column filter Filters cannot be combined with index lookups Multiple predicates: Use of row and column filter select o_orderkey, o_custkey, o_orderdate from orders where o_orderdate=cast('1996-12-09' as timestamp) or o_custkey=2; Log 39 HBase filter list created using OR. HBase scan details:{… , filter=FilterList OR (2/2): [SingleColumnValueFilter (cf, od, EQUAL, \x011996-12-09 00:00:00.000\x00), PrefixFilter \x01\x80\x00\x00\x02], cacheBlocks=false, stopRow=, startRow=, … } © 2013 IBM Corporation Accessmode Hint Will run the query locally in Big SQL server – Useful to avoid map reduce overhead Very important for HBase point queries – This is not detected currently by compiler – Specify accessmode=‘local’ hint when getting a limited set of data from HBase Specify at query level select o_orderkey from orders /*+ accessmode='local' +*/ where o_custkey=1 and o_orderkey=454791; Specify at session level – set force local on – set force commands override query level hints 40 © 2013 IBM Corporation HBase Hints rowcachesize (default=2000) – Used as scan cache setting – Also used to determine number of get requests to batch in index lookups colbatchsize (default=100) useindex (‘false’ to avoid index usage) select o_orderkey from orders /*+ rowcachesize=10000 +*/ where o_custkey>5000 go -m discard 1450136 rows in results(first row: 22.67s; total: 27.46s) Log HBase scan details:{... , caching=10000, ...} rowcachesize can also be set using the set command: – set hbase.client.scanner.caching=10000; 41 © 2013 IBM Corporation Recommendations Row key design is the most important factor – Try to combine predicates that are most commonly used into row key columns – Do not make the row key too long Use short names for HBase column families and column qualifiers – f:q instead of mycolumnfamily:mycolumnqualifier Check if key only tables can be used Pack columns that are queried together into dense columns – Use the column that is used as query predicate as prefix – Create indexes for columns that do not have repeating values and are queried often Separate columns that are rarely or never queried into a different column family Set hbase.client.scanner.caching to an optimum value Ensure even data distribution 42 © 2013 IBM Corporation Limitations No diagnostic info about HBase pushdown – How HBase storage handler pushes down a query is decided only at runtime – Predicate handling details are logged at INFO level – Many examples of log messages covered in previous slides No auto detection of local vs MR mode – Currently depends on user specified hints Statistics not available – Big SQL does not have a framework to collect statistics – Query optimizations can be improved with availability of useful statistics Map type not supported – Big SQL does not support map data type – Hive HBase handler supports map data type and many to one mapping • Mapping an entire HBase column family to a map data type 43 © 2013 IBM Corporation Logs and Troubleshooting Big SQL logs – Look for rewritten query – More information in Big SQL logs if query is run in local mode Map Reduce logs – Predicate handling information in map task log when run in MR mode HBase web GUI – http://servername:60010/master-status 44 © 2013 IBM Corporation Big SQL HBase Handler Highlights Support for composite key/dense columns Pushdown for efficient execution of queries Support for secondary indexes Collatible binary encoding Key only tables Support for hints to make query optimization decisions 45 © 2013 IBM Corporation Scenarios that can leverage HBase features Point queries – Queries that return a single row of result – Row can be determined using row key or secondary index • All queries using secondary index are not point queries Queries with projections – If a query requires only a few columns – Projection happens at HBase column level Data maintenance using upserts – Loading different value for columns using same row key 46 © 2013 IBM Corporation