BigSQL-HBase.ppt

Accessing HBase using Big SQL Deepa Remesh BigInsights Development For questions about this presentation contact Deepa Remesh dremesh@us.ibm.com Agenda  Introduction to HBase  Big SQL HBase Storage Handler – Column mapping – Data encoding – Data load  Secondary Indexes  Querying  Recommendations and limitations  Logs and Troubleshooting  Highlights and HBase use cases 2 © 2013 IBM Corporation HBase Basics  Client/server database – Master and a set of region servers – Fetching data requires additional network hop through region server  Key-value store – Key and value are byte arrays – Efficient access using row key  Rich set of Java APIs and extensible frameworks – Supports a wide variety of filters – Allows application logic to run in region server via coprocessors  Different from relational databases – No types: all data is stored as bytes – No schema: Rows can have different set of columns 3 © 2013 IBM Corporation HBase Cluster Architecture Client finds region server addresses in ZooKeeper ZooKeeper Quorum ZooKeeper is used for coordination / monitoring ZooKeeper Peer Master ZooKeeper Peer … Client Client reads and writes row by accessing the region server Region Server Region Server Coprocessor Coprocessor Region Region HFile HFile HFile HFile Hbase master assigns regions and load balancing … … Coprocessor Coprocessor … HDFS / GPFS Region Region HFile HFile … … … HFile HFile HFile 4 © 2013 IBM Corporation HBase Data Model  Table – Contains column-families  Column family HBTABLE Row key Value 11111 cf_data: {‘cq_name’: ‘name1’, ‘cq_val’: 1111} cf_info: {‘cq_desc’: ‘desc11111’} 22222 cf_data: {‘cq_name’: ‘name2’, ‘cq_val’: 2013 @ ts = 2013, ‘cq_val’: 2012 @ ts = 2012 } – Logical and physical grouping of columns  Column – Exists only when inserted – Can have multiple versions – Each row can have different set of columns – Each column identified by it’s key  Row key – Implicit primary key – Used for storing ordered rows – Efficient queries using row key 5 HFileHFile 11111 cf_data cq_name name1 @ ts1 11111 cf_data cq_val 1111 @ ts1 22222 cf_data cq_name name2 @ ts1 22222 cf_data cq_val 2013 @ ts1 22222 cf_data cq_val 2012 @ ts 2 HFile 11111 cf_info cq_desc desc11111 @ ts1 © 2013 IBM Corporation Big SQL HBase Storage Handler  Mapping of SQL to HBase data: Column Mapping  Handles serialization/deserialization of data (SerDe)  Efficiently handles SQL queries by pushing down predicates Delimited files Warehouse SQL Query Query Results Big SQL Input Data HBase Storage Handler HBase SerDe Query Analyzer (Runtime) JDBC application - HBase scan limits - Filters - Index usage DFS Query Optimizer (Compile time) - Process hints 6 © 2013 IBM Corporation Column Mapping  Mapping HBase row key/columns to SQL columns – Supports one to one and one to many mappings  One to one mapping – Single HBase entity mapped to a single SQL column Column Family: cf_data key 11111 id 7 cq_name name1 name cq_val 1111 value Column Family: cf_info cq_desc desc11111 desc HBase SQL © 2013 IBM Corporation Create Table: One to One Mapping CREATE HBASE TABLE HBTABLE ( id INT, name VARCHAR(10), value INT, desc VARCHAR(20) ) COLUMN MAPPING ( key mapped cf_data:cq_name mapped cf_data:cq_val mapped cf_info:cq_desc mapped ); HBase 8 Required by by by by (id), (name), (value), (desc) SQL HBase column identified by family:qualifier © 2013 IBM Corporation One to Many Column Mapping  Single HBase entity mapped to multiple SQL columns  Composite key – HBase row key mapped to multiple SQL columns  Dense column – One HBase column mapped to multiple SQL columns key 11111_ac11 userid 9 acc_no Column Family: cf_data cq_acct cq_names fname1_lname1 first_name last_name HBase 11111#11#0.25 balance min_bal interest SQL © 2013 IBM Corporation Create Table: One to Many Mapping CREATE HBASE TABLE DENSE_TABLE ( userid INT, acc_no VARCHAR(10), Composite Key first_name VARCHAR(10), last_name VARCHAR(10), balance double, min_bal double, interest double ) List of SQL columns COLUMN MAPPING ( key mapped by (userid, acc_no), cf_data:cq_names mapped by (first_name, last_name), cf_data:cq_acct mapped by (balance, min_bal, interest) ); Dense Columns 10 © 2013 IBM Corporation Why use One to Many mapping ?  HBase is very verbose – Stores a lot of information for each value – Primarily intended for sparse data <row> <columnfamily> <columnqualifier> <timestamp> <value>  Save storage space – Sample table with 9 columns. 1.5 million rows – One to one mapping: 522 MB – One to many mapping: 276 MB  Improve query response time – Query results also return the entire key for each value – select * query on sample table • One to one mapping: 1m 31 s • One to many mapping: 1m 2s 11 © 2013 IBM Corporation Sample Data  TPCH orders table with 1.5 million rows drop table if exists orders_one_to_one; CREATE HBASE TABLE ORDERS_ONE_TO_ONE ( O_ORDERKEY BIGINT, O_CUSTKEY INTEGER, O_ORDERSTATUS VARCHAR(1), O_TOTALPRICE FLOAT, O_ORDERDATE TIMESTAMP, O_ORDERPRIORITY VARCHAR(15), O_CLERK VARCHAR(15), O_SHIPPRIORITY INTEGER, O_COMMENT VARCHAR(79) ) column mapping ( key mapped by (O_ORDERKEY), f:a mapped by (O_CUSTKEY), f:b mapped by (O_ORDERSTATUS), f:c mapped by (O_TOTALPRICE), f:d mapped by (O_ORDERPRIORITY), f:e mapped by (O_CLERK), f:f mapped by (O_SHIPPRIORITY), f:g mapped by (O_COMMENT), f:h mapped by (O_ORDERDATE) ) default encoding binary; LOAD HBASE DATA INPATH 'tpch/ORDERS.TXT' DELIMITED FIELDS TERMINATED BY '|' INTO TABLE ORDERS_ONE_TO_ONE; drop table if exists orders; CREATE HBASE TABLE ORDERS ( O_ORDERKEY BIGINT, O_CUSTKEY INTEGER, O_ORDERSTATUS VARCHAR(1), O_TOTALPRICE FLOAT, O_ORDERDATE TIMESTAMP, O_ORDERPRIORITY VARCHAR(15), O_CLERK VARCHAR(15), O_SHIPPRIORITY INTEGER, O_COMMENT VARCHAR(79) ) column mapping ( key mapped by (O_CUSTKEY, O_ORDERKEY), cf:d mapped by (O_ORDERSTATUS,O_TOTALPRICE,O_ORDERPRIORITY,O_CLERK,O_SHIPPRIORITY,O_COMMENT), cf:od mapped by (O_ORDERDATE) ) default encoding binary; LOAD HBASE DATA INPATH 'tpch/ORDERS.TXT' DELIMITED FIELDS TERMINATED BY '|' INTO TABLE ORDERS; 12 © 2013 IBM Corporation Data encoding  HBase stores all data as an array of bytes – Application decides how to encode/decode the bytes  Big SQL uses Hive SerDe interface for serialization/deserialization  Supports two types of data encodings: String, Binary  Encoding can be specified at HBase row key/column level key 11111_ac11 userid acc_no String 13 Column Family: cf_data cq_acct cq_names fname1_lname1 first_name last_name String 0x000001 … balance min_bal interest HBase SQL Binary © 2013 IBM Corporation String encoding  Default encoding  Value is converted to string and stored as UTF-8 bytes  Separator to identify parts in one to many mapping – Default separator: \u0000 CREATE HBASE TABLE DENSE_TABLE_STR ( userid INT, acc_no VARCHAR(10), Can specify different separator first_name VARCHAR(10), last_name VARCHAR(10), for each column and row key. balance double, Default separator is null byte min_bal double, (\u0000) for string encoding. interest double ) COLUMN MAPPING ( key mapped by (userid, acc_no) separator '_', cf_data:cq_names mapped by (first_name, last_name) separator '_', cf_data:cq_acct mapped by (balance, min_bal, interest) separator '#' ); 14 © 2013 IBM Corporation String Encoding: Pros and Cons  Readable format and easier to port across applications  Useful to map existing data key 11111_ac11 userid acc_no Column Family: cf_data cq_acct cq_names fname1_lname1 10000#10#0.25 first_name last_name balance  Numeric data not collated correctly – HBase stores data as bytes – Lexicographic ordering min_bal interest 1 10 2 9 Existing HBase table External Big SQL table 2 > 10 9 > 10  Slow – Parsing strings is expensive 15 © 2013 IBM Corporation External Tables  Useful to map tables that already exist in HBase – Data in external tables is not pre-validated  Can create multiple views of same table Use subset of data from dense_table create external hbase table externalhbase_table (user INT, acc string, balance double, min_bal double, interest double) column mapping(key mapped by (user,acc), cf_data:cq_acct mapped by(balance, min_bal, interest) separator '#') hbase table name 'dense_table';  HBase tables created using Hive HBase storage handler cannot be read by Big SQL – Need to create external tables for this  Things to note: – Dropping external table only drops the metadata – Cannot create secondary index on external tables 16 © 2013 IBM Corporation Binary Encoding  Data encoded using sortable binary representation  Separators handled internally – Escaped to avoid issue of separator existing within data CREATE HBASE TABLE MIXED_ENCODING ( C1 INT, C2 INT, C3 INT, C4 VARCHAR(10), C5 DECIMAL(5,2), C6 SMALLINT ) COLUMN MAPPING ( KEY MAPPED BY (C1, C2, C3) ENCODING BINARY, CF1:COL1 MAPPED BY (C4, C5) SEPARATOR '|', CF2:COL1 MAPPED BY (C6) ENCODING BINARY ); 17 If encoding not specified, string is used as default cf1 cf2 key col1 col1 0x000000000000000100000000000000020000000000000003 foo|97.31 0x0000DEAF © 2013 IBM Corporation Binary Encoding: Pros and Cons  Faster  Numeric types collated correctly including negative numbers CREATE HBASE TABLE WEATHER (temp INT, date TIMESTAMP, humidity DOUBLE) COLUMN MAPPING (key mapped by (temp, date), cf:cq mapped by (humidity)) default encoding binary; 100,2012-06-10 17:00:00:000,40.25 -17,2012-12-12 17:00:00:000,30.25 95,2012-06-05 17:00:00:000,50.25 cf cq -17 95 100 \x01\x7F\xFF\xFF\xEF\x012012-12-12 17:00:00:000\x00 \x01\x80\x00\x00_\x012012-06-05 17:00:00:000\x00 \x01\x80\x00\x00d\x012012-06-10 17:00:00:000\x00 \x01\xC0>@\x00\x00\x00\x00\x00 \x01\xC0I \x00\x00\x00\x00\x00 \x01\xC0D \x00\x00\x00\x00\x00  Limited portability 18 © 2013 IBM Corporation Custom Encoding  Any custom encoding data structure can be supported by decoding using user defined functions create hbase table obj_table ( key_col int, obj_col binary(1024) ) ...; select get_obj_str_field(obj_col, 'fname') as fname, get_obj_str_field(obj_col, 'lname') as lname where key_col = 314145;  Key lookup and columns filters are possible using UDF's select key_col from obj_table where obj_col = create_obj('bob', 'johnson');  Big SQL provides a json_get_object() function for extracting values from textual JSON 19 © 2013 IBM Corporation Load Data  Load HBase File can be on DFS or local to Big SQL server – Loads data from delimited files – Column list can be specified load hbase data inpath 'file:///input.dat' delimited fields terminated by '|' into table hbtable (name, value, desc, id); Column list optional. If not specified, uses column ordering in table definition  Load FROM – Loads data from a source outside of a BigInsights cluster – Covered in Import/Load/Export presentation  Insert command available as undocumented feature insert into hbtable (name, value, desc, id) values(‘name5’, 5555, ‘desc55555’, 55555); 20 © 2013 IBM Corporation Load Data: Upsert  HBase ensures uniqueness of row key key 11111 , name1, 1111, desc11111 11111 , name9, 9999, desc99999 22222 , name2, 2222, desc22222 … Load 11111 , name1, 1111, desc11111 @ts0 11111 , name9, 9999, desc99999 @ts1 22222 , name2, 2222, desc22222 @ts1 …  Upsert can be confusing. No errors but fewer rows ! Delimited file : 10 rows Load : 10 rows affected select count(*) from hbtable : 7 rows  Combine multiple columns to make row key unique key mapped by (id, name) key 11111 , name1, 1111, desc11111 11111 , name9, 9999, desc99999 22222 , name2, 2222, desc22222 … 21 Load 11111/x00name1, 11111 , name1, 1111, 1111,desc11111 desc11111@ts0 @ts0 11111/x00name9, 11111 , name9, 9999, 9999,desc99999 desc99999 @ts1 @ts1 22222/x00name2, 22222 , name2, 2222, 2222,desc22222 desc22222 @ts1 @ts1 … © 2013 IBM Corporation Force Key Unique  Use force key unique option when creating a table CREATE HBASE TABLE HBTABLE_FORCE_KEY_UNIQUE ( id INT, name VARCHAR(10), value INT, desc VARCHAR(20) ) COLUMN MAPPING ( key mapped by (id) force key unique, cf_data:cq_name mapped by (name), cf_data:cq_val mapped by (value), cf_info:cq_desc mapped by (desc) );  Load adds UUID to the row key  Prevents data loss  Inefficient  Stores more data  Slower queries 22 11111 , name1, 1111, desc11111 11111 , name9, 9999, desc99999 22222 , name2, 2222, desc22222 … 11111\x00b71c95d8-ffdd-4d49-9015-2fdd6f7dcdf4, name1, 1111, desc11111 11111\x00ea780078-9893-4bf7-95d8-cb9ca4b2427f, name9, 9999, desc99999 22222\x00a90885b0-418b-49ac-a6f6-aa73273b57ca, name2, 2222, desc22222 … © 2013 IBM Corporation Load Data: Error Handling  Option to continue and log error rows – LOG ERROR ROWS IN FILE 'filename'  Common Errors – Separator exists within data for string encoding – Invalid numeric types  Always count number of rows after loading – Load always reports total number of rows that it handled key mapped by (id, name) separator ‘-’ id defined as integer HBase Table (2 rows) key 11111 , name1, 1111, desc11111 11111 , name1, 1111, desc11111 11111 , name9, 9999, desc99999 11111 , name9, 9999, desc99999 22222 , name-2, 2222, desc22222 22222 , name2, 2222, desc22222 3333a, name3, 3333, desc33333 … … 11111-name1, 1111, desc11111 11111-name9, 9999, desc99999 Load: 4 rows affected Error file (2 rows) 22222 , name-2, 2222, desc22222 3333a , name3, 3333, desc33333 23 © 2013 IBM Corporation Options to Speed up Load  Disable WAL – Data loss can happen if region server crashes LOAD HBASE DATA INPATH 'tpch/ORDERS.TXT' DELIMITED FIELDS TERMINATED BY '|' INTO TABLE ORDERS DISABLE WAL;  Increase write buffer – set hbase.client.write.buffer=8388608; 24 © 2013 IBM Corporation Secondary Index Support  Self-maintaining secondary indexes – Stored in an HBase table – Populated using a Map Reduce index builder – Kept up to date using a synchronous coprocessor Big SQL create MRIndexBuilder index Client input data query results HBase Storage Handler Data Table Data Regions Index Coprocessor SerDe Query Analyzer (Runtime) - Use index ? Index Regions query Query Optimizer (Compile time) Index Table - Process hints Index building 25 Index maintenance Batched Get Requests © 2013 IBM Corporation Index Creation and Usage create hbase table dt(id int,c1 string,c2 string,c3 string,c4 string,c5 string) column mapping (key mapped by (id), f:a mapped by (c1,c2,c3), f:b mapped by (c4,c5)); create index ixc3 on table dt (c3) as 'hbase'; Data table (dt) key c1 create index ixc3 (c3) c2 c3 c4 c5 bt1 , c11_c21_c31, c41_c51 bt2 , c12_c22_c32, c42_c52 bt3 , c13_c23_c33, c43_c53 … No Index table (dt_ixc3) key Data table get row = bt2 Use Index ? Full table scan Query c3=c32 c31_bt1 c32_bt2 c33_bt3 … Yes Index table range scan start row = c32 stop row = c32++  Automatic index usage – Range scan on index table to get matching row key(s) in base table – Batched get requests to base table with the matched row key(s) 26 © 2013 IBM Corporation Index Pros and Cons  Fast key based lookups for queries that return limited data  Not beneficial if there are too many matches  No statistics to make the decision in compiler  useindex hint to make explicit choices  Index adds latency to data load – When loading a big data set, drop index and recreate LOAD from option bypasses index maintenance  Uses HBase bulk load which writes to HFiles directly 27 © 2013 IBM Corporation Column Family Options  Compression – compression(gz)  Bloom filters – NONE, ROW, ROWCOL  In memory columns – in memory, no in memory create hbase table colopt_table (key string, c1 string) column mapping(key mapped by (key), cf1:c1 mapped by(c1)) column family options(cf1 compression(gz) bloom filter(row) in memory); 28 © 2013 IBM Corporation Query Handling  Projection pushdown  Predicate pushdown – – – – Point scan Range scan Automatic index usage Filters  Query Hints 29 © 2013 IBM Corporation Sample Data  TPCH orders table with 1.5 million rows drop table if exists orders; CREATE HBASE TABLE ORDERS ( O_ORDERKEY BIGINT, O_CUSTKEY INTEGER, O_ORDERSTATUS VARCHAR(1), O_TOTALPRICE FLOAT, O_ORDERDATE TIMESTAMP, O_ORDERPRIORITY VARCHAR(15), O_CLERK VARCHAR(15), O_SHIPPRIORITY INTEGER, O_COMMENT VARCHAR(79) ) column mapping ( key mapped by (O_ORDERKEY,O_CUSTKEY), cf:d mapped by (O_ORDERSTATUS,O_TOTALPRICE,O_ORDERPRIORITY,O_CLERK,O_SHIPPRIORIT Y,O_COMMENT), cf:od mapped by (O_ORDERDATE) ) default encoding binary; LOAD HBASE DATA INPATH 'tpch/ORDERS.TXT' DELIMITED FIELDS TERMINATED BY '|' INTO TABLE ORDERS; 30 © 2013 IBM Corporation Projection Pushdown  Get only columns required by the query  Limit data retrieved to the client select * from orders go -m discard 1500000 rows in results(first row: 0.21s; total: 1m1.77s) Log HBase scan details:{ … , families={cf=[d, od]}, …} select o_totalprice from orders go -m discard 1500000 rows in results(first row: 0.19s; total: 21.27s) Log HBase scan details:{ … , families={cf=[d]}, …} select o_orderdate from orders go -m discard 1500000 rows in results(first row: 0.36s; total: 36.24s) Log The response time is higher for this query even when it retrieves lesser data than query for o_totalprice. This is because timestamp type is more expensive HBase scan details:{ … , families={cf=[od]}, …}  Projection happens at HBase column level – For composite key and dense columns, the entire value is retrieved to the client – Efficient to pack columns that are queried together 31 © 2013 IBM Corporation Predicate Pushdown: Point Scan  With full row key  Big SQL can combine predicates on row key parts set force local on; select o_orderkey,o_totalprice from orders where o_custkey=1 and o_orderkey=454791; +--------------+ | o_totalprice | +--------------+ | 208660.75000 | +--------------+ 1 row in results(first row: 0.14s; total: 0.14s) Log Found a row scan by combining all composite key parts. key o_custkey o_orderkey Query o_custkey=1 and o_orderkey=454791 32 start row=1#454791 stop row=1#454791 1#454791 1# 579908 1# 3868359 1# 4273923 1# 4808192 1# 5133509 … columns … © 2013 IBM Corporation Predicate Pushdown: Partial row Scan select o_orderkey,o_totalprice from orders where o_custkey=1; +------------+--------------+ | o_orderkey | o_totalprice | Predicate(s) on leading +------------+--------------+ part(s) of row key | 454791 | 74602.81250 | | 579908 | 54048.26172 | | 3868359 | 123076.84375 | | 4273923 | 95911.00781 | | 4808192 | 65478.05078 | | 5133509 | 174645.93750 | +------------+--------------+ 6 rows in results(first row: 0.13s; total: 0.13s) Log Found a row scan that uses the first 1 part(s) of composite key. key o_custkey o_orderkey Query o_custkey=1 33 start row=1 stop row=1++ 1#454791 1# 579908 1# 3868359 1# 4273923 1# 4808192 1# 5133509 2#430243 … columns … © 2013 IBM Corporation Predicate Pushdown: Range Scan  With range predicates select o_orderkey,o_totalprice from orders where o_custkey < 3; Log Found a row scan that uses the first 1 part(s) of composite key. Log HBase scan details:{ .. , stopRow=\x01\x80\x00\x00\x03, startRow=, … } key o_custkey o_orderkey Query o_custkey<3 34 start row= stop row=3# 1#454791 … 1# 5133509 2#430243 … 4#164711 columns … © 2013 IBM Corporation Predicate Pushdown: Full table Scan  This is an example of a case where predicates are not pushed down.  If there are predicates on non-leading parts of row key set force local on; select o_orderkey,o_totalprice from orders where o_orderkey=454791; +------------+--------------+ | o_orderkey | o_totalprice | +------------+--------------+ | 454791 | 74602.81250 | +------------+--------------+ 1 row in results(first row: 32.13s; total: 32.13s) Log 35 HBase scan details:{ .. , stopRow=, startRow=, … } © 2013 IBM Corporation Automatic Index Usage select * from orders where o_clerk='Clerk#000000999' go -m discard 1472 rows in results(first row: 1.63s; total: 30.32s) create index ix_clerk on table orders (o_clerk) as 'hbase'; 0 rows affected (total: 3m57.82s) select * from orders where o_clerk='Clerk#000000999' go -m discard 1472 rows in results(first row: 3.60s; total: 3.65s) Log Index query successful  Index used automatically  For composite index, rules similar to composite row key apply – Parts will be combined where possible – With partial value for composite index, range scan done on index table  Multiple indexes on a table – Index to be used is randomly chosen – Specify useIndex hint to make use of specific index 36 © 2013 IBM Corporation Pushing down Filters into HBase  Filters do not avoid full table scan – Some filters can skip certain sections e.g, PrefixFilter  Limits rows returned to the client  Limits data returned to client – Key only filters Column filter as there is a predicate on leading part of dense column Row scan select o_orderkey from orders where o_custkey>100000 and o_orderstatus='P' go -m discard 12819 rows in results(first row: 1.12s; total: 6.80s) Log 37 Found a row scan that uses the first 1 part(s) of composite key. HBase filter list created using AND. HBase scan details:{… , filter=FilterList AND (1/1): [SingleColumnValueFilter (cf, d, EQUAL, \x01P\x00)], stopRow=, startRow=\x01\x80\x01\x86\xA1, …} © 2013 IBM Corporation Key Only Tables  Big SQL allows creation of tables without specifying any HBase column create hbase table KEY_ONLY_TABLE (k1 string, k2 string, k3 string) column mapping (key mapped by (k1, k2, k3)); select * from KEY_ONLY_TABLE; Log 38 Only row key or parts of row key requested. Applying filters. … HBase scan details:{… families={}, filter=FilterList AND (2/2): [FirstKeyOnlyFilter, KeyOnlyFilter], …} © 2013 IBM Corporation Predicate Precedence  When a query contains multiple predicates, the following precedence applies: – Row Scan – Index – Filters • Row filters • Column filters  Filters will be applied along with row scans The OR condition prevents usage of row scan. Row filter (PrefixFilter) is used along with a column filter  Filters cannot be combined with index lookups  Multiple predicates: Use of row and column filter select o_orderkey, o_custkey, o_orderdate from orders where o_orderdate=cast('1996-12-09' as timestamp) or o_custkey=2; Log 39 HBase filter list created using OR. HBase scan details:{… , filter=FilterList OR (2/2): [SingleColumnValueFilter (cf, od, EQUAL, \x011996-12-09 00:00:00.000\x00), PrefixFilter \x01\x80\x00\x00\x02], cacheBlocks=false, stopRow=, startRow=, … } © 2013 IBM Corporation Accessmode Hint  Will run the query locally in Big SQL server – Useful to avoid map reduce overhead  Very important for HBase point queries – This is not detected currently by compiler – Specify accessmode=‘local’ hint when getting a limited set of data from HBase  Specify at query level select o_orderkey from orders /*+ accessmode='local' +*/ where o_custkey=1 and o_orderkey=454791;  Specify at session level – set force local on – set force commands override query level hints 40 © 2013 IBM Corporation HBase Hints  rowcachesize (default=2000) – Used as scan cache setting – Also used to determine number of get requests to batch in index lookups  colbatchsize (default=100)  useindex (‘false’ to avoid index usage) select o_orderkey from orders /*+ rowcachesize=10000 +*/ where o_custkey>5000 go -m discard 1450136 rows in results(first row: 22.67s; total: 27.46s) Log HBase scan details:{... , caching=10000, ...}  rowcachesize can also be set using the set command: – set hbase.client.scanner.caching=10000; 41 © 2013 IBM Corporation Recommendations  Row key design is the most important factor – Try to combine predicates that are most commonly used into row key columns – Do not make the row key too long  Use short names for HBase column families and column qualifiers – f:q instead of mycolumnfamily:mycolumnqualifier  Check if key only tables can be used  Pack columns that are queried together into dense columns – Use the column that is used as query predicate as prefix – Create indexes for columns that do not have repeating values and are queried often  Separate columns that are rarely or never queried into a different column family  Set hbase.client.scanner.caching to an optimum value  Ensure even data distribution 42 © 2013 IBM Corporation Limitations  No diagnostic info about HBase pushdown – How HBase storage handler pushes down a query is decided only at runtime – Predicate handling details are logged at INFO level – Many examples of log messages covered in previous slides  No auto detection of local vs MR mode – Currently depends on user specified hints  Statistics not available – Big SQL does not have a framework to collect statistics – Query optimizations can be improved with availability of useful statistics  Map type not supported – Big SQL does not support map data type – Hive HBase handler supports map data type and many to one mapping • Mapping an entire HBase column family to a map data type 43 © 2013 IBM Corporation Logs and Troubleshooting  Big SQL logs – Look for rewritten query – More information in Big SQL logs if query is run in local mode  Map Reduce logs – Predicate handling information in map task log when run in MR mode  HBase web GUI – http://servername:60010/master-status 44 © 2013 IBM Corporation Big SQL HBase Handler Highlights  Support for composite key/dense columns  Pushdown for efficient execution of queries  Support for secondary indexes  Collatible binary encoding  Key only tables  Support for hints to make query optimization decisions 45 © 2013 IBM Corporation Scenarios that can leverage HBase features  Point queries – Queries that return a single row of result – Row can be determined using row key or secondary index • All queries using secondary index are not point queries  Queries with projections – If a query requires only a few columns – Projection happens at HBase column level  Data maintenance using upserts – Loading different value for columns using same row key 46 © 2013 IBM Corporation

BigSQL-HBase.ppt

Related documents

Products

Support

BigSQL-HBase.ppt

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib