BigSQL-HBase.ppt

advertisement
Accessing HBase using Big SQL
Deepa Remesh
BigInsights Development
For questions about this presentation contact Deepa Remesh dremesh@us.ibm.com
Agenda
 Introduction to HBase
 Big SQL HBase Storage Handler
– Column mapping
– Data encoding
– Data load
 Secondary Indexes
 Querying
 Recommendations and limitations
 Logs and Troubleshooting
 Highlights and HBase use cases
2
© 2013 IBM Corporation
HBase Basics
 Client/server database
– Master and a set of region servers
– Fetching data requires additional network hop through region server
 Key-value store
– Key and value are byte arrays
– Efficient access using row key
 Rich set of Java APIs and extensible frameworks
– Supports a wide variety of filters
– Allows application logic to run in region server via coprocessors
 Different from relational databases
– No types: all data is stored as bytes
– No schema: Rows can have different set of columns
3
© 2013 IBM Corporation
HBase Cluster Architecture
Client finds
region server
addresses in
ZooKeeper
ZooKeeper Quorum
ZooKeeper is
used for
coordination /
monitoring
ZooKeeper Peer
Master
ZooKeeper Peer
…
Client
Client reads and
writes row by
accessing the
region server
Region Server
Region Server
Coprocessor Coprocessor
Region
Region
HFile
HFile
HFile
HFile
Hbase master
assigns
regions and
load balancing
…
…
Coprocessor Coprocessor
…
HDFS / GPFS
Region
Region
HFile
HFile
…
…
…
HFile
HFile
HFile
4
© 2013 IBM Corporation
HBase Data Model
 Table
– Contains column-families
 Column family
HBTABLE
Row key
Value
11111
cf_data:
{‘cq_name’: ‘name1’,
‘cq_val’: 1111}
cf_info:
{‘cq_desc’: ‘desc11111’}
22222
cf_data:
{‘cq_name’: ‘name2’,
‘cq_val’: 2013 @ ts = 2013,
‘cq_val’: 2012 @ ts = 2012
}
– Logical and physical grouping of
columns
 Column
– Exists only when inserted
– Can have multiple versions
– Each row can have different set of
columns
– Each column identified by it’s key
 Row key
– Implicit primary key
– Used for storing ordered rows
– Efficient queries using row key
5
HFileHFile
11111 cf_data cq_name name1 @ ts1
11111 cf_data cq_val 1111 @ ts1
22222 cf_data cq_name name2 @ ts1
22222 cf_data cq_val
2013 @ ts1
22222 cf_data cq_val 2012 @ ts 2
HFile
11111 cf_info cq_desc desc11111 @ ts1
© 2013 IBM Corporation
Big SQL HBase Storage Handler
 Mapping of SQL to HBase data: Column Mapping
 Handles serialization/deserialization of data (SerDe)
 Efficiently handles SQL queries by pushing down predicates
Delimited
files
Warehouse
SQL
Query
Query
Results
Big SQL
Input
Data
HBase Storage Handler
HBase
SerDe
Query Analyzer
(Runtime)
JDBC
application
- HBase scan limits
- Filters
- Index usage
DFS
Query Optimizer
(Compile time)
- Process hints
6
© 2013 IBM Corporation
Column Mapping
 Mapping HBase row key/columns to SQL columns
– Supports one to one and one to many mappings
 One to one mapping
– Single HBase entity mapped to a single SQL column
Column Family: cf_data
key
11111
id
7
cq_name
name1
name
cq_val
1111
value
Column Family: cf_info
cq_desc
desc11111
desc
HBase
SQL
© 2013 IBM Corporation
Create Table: One to One Mapping
CREATE HBASE TABLE HBTABLE
( id INT,
name VARCHAR(10),
value INT,
desc VARCHAR(20)
)
COLUMN MAPPING
(
key
mapped
cf_data:cq_name
mapped
cf_data:cq_val
mapped
cf_info:cq_desc
mapped
);
HBase
8
Required
by
by
by
by
(id),
(name),
(value),
(desc)
SQL
HBase column
identified by
family:qualifier
© 2013 IBM Corporation
One to Many Column Mapping
 Single HBase entity mapped to multiple SQL columns
 Composite key
– HBase row key mapped to multiple SQL columns
 Dense column
– One HBase column mapped to multiple SQL columns
key
11111_ac11
userid
9
acc_no
Column Family: cf_data
cq_acct
cq_names
fname1_lname1
first_name last_name
HBase
11111#11#0.25
balance
min_bal interest
SQL
© 2013 IBM Corporation
Create Table: One to Many Mapping
CREATE HBASE TABLE DENSE_TABLE
( userid INT,
acc_no VARCHAR(10),
Composite Key
first_name VARCHAR(10),
last_name VARCHAR(10),
balance double,
min_bal double,
interest double
)
List of SQL columns
COLUMN MAPPING
(
key
mapped by (userid, acc_no),
cf_data:cq_names
mapped by (first_name, last_name),
cf_data:cq_acct
mapped by (balance, min_bal, interest)
);
Dense
Columns
10
© 2013 IBM Corporation
Why use One to Many mapping ?
 HBase is very verbose
– Stores a lot of information for each value
– Primarily intended for sparse data
<row> <columnfamily> <columnqualifier> <timestamp> <value>
 Save storage space
– Sample table with 9 columns. 1.5 million rows
– One to one mapping: 522 MB
– One to many mapping: 276 MB
 Improve query response time
– Query results also return the entire key for each value
– select * query on sample table
• One to one mapping: 1m 31 s
• One to many mapping: 1m 2s
11
© 2013 IBM Corporation
Sample Data
 TPCH orders table with 1.5 million rows
drop table if exists orders_one_to_one;
CREATE HBASE TABLE ORDERS_ONE_TO_ONE ( O_ORDERKEY BIGINT, O_CUSTKEY INTEGER, O_ORDERSTATUS
VARCHAR(1), O_TOTALPRICE FLOAT, O_ORDERDATE TIMESTAMP, O_ORDERPRIORITY VARCHAR(15), O_CLERK
VARCHAR(15), O_SHIPPRIORITY INTEGER, O_COMMENT VARCHAR(79) )
column mapping (
key mapped by (O_ORDERKEY),
f:a mapped by (O_CUSTKEY),
f:b mapped by (O_ORDERSTATUS),
f:c mapped by (O_TOTALPRICE),
f:d mapped by (O_ORDERPRIORITY),
f:e mapped by (O_CLERK),
f:f mapped by (O_SHIPPRIORITY),
f:g mapped by (O_COMMENT),
f:h mapped by (O_ORDERDATE)
)
default encoding binary;
LOAD HBASE DATA INPATH 'tpch/ORDERS.TXT' DELIMITED FIELDS TERMINATED BY '|' INTO TABLE
ORDERS_ONE_TO_ONE;
drop table if exists orders;
CREATE HBASE TABLE ORDERS ( O_ORDERKEY BIGINT, O_CUSTKEY INTEGER, O_ORDERSTATUS VARCHAR(1),
O_TOTALPRICE FLOAT, O_ORDERDATE TIMESTAMP, O_ORDERPRIORITY VARCHAR(15), O_CLERK
VARCHAR(15), O_SHIPPRIORITY INTEGER, O_COMMENT VARCHAR(79) )
column mapping (
key mapped by (O_CUSTKEY, O_ORDERKEY),
cf:d mapped by
(O_ORDERSTATUS,O_TOTALPRICE,O_ORDERPRIORITY,O_CLERK,O_SHIPPRIORITY,O_COMMENT),
cf:od mapped by (O_ORDERDATE)
)
default encoding binary;
LOAD HBASE DATA INPATH 'tpch/ORDERS.TXT' DELIMITED FIELDS TERMINATED BY '|' INTO TABLE
ORDERS;
12
© 2013 IBM Corporation
Data encoding
 HBase stores all data as an array of bytes
– Application decides how to encode/decode the bytes
 Big SQL uses Hive SerDe interface for serialization/deserialization
 Supports two types of data encodings: String, Binary
 Encoding can be specified at HBase row key/column level
key
11111_ac11
userid
acc_no
String
13
Column Family: cf_data
cq_acct
cq_names
fname1_lname1
first_name last_name
String
0x000001 …
balance
min_bal interest
HBase
SQL
Binary
© 2013 IBM Corporation
String encoding
 Default encoding
 Value is converted to string and stored as UTF-8 bytes
 Separator to identify parts in one to many mapping
– Default separator: \u0000
CREATE HBASE TABLE DENSE_TABLE_STR
( userid INT,
acc_no VARCHAR(10),
Can specify different separator
first_name VARCHAR(10),
last_name VARCHAR(10),
for each column and row key.
balance double,
Default separator is null byte
min_bal double,
(\u0000) for string encoding.
interest double
)
COLUMN MAPPING
(
key mapped by (userid, acc_no) separator '_',
cf_data:cq_names mapped by (first_name, last_name) separator '_',
cf_data:cq_acct mapped by (balance, min_bal, interest) separator '#'
);
14
© 2013 IBM Corporation
String Encoding: Pros and Cons
 Readable format and easier to port across applications
 Useful to map existing data
key
11111_ac11
userid
acc_no
Column Family: cf_data
cq_acct
cq_names
fname1_lname1
10000#10#0.25
first_name last_name
balance
 Numeric data not collated correctly
– HBase stores data as bytes
– Lexicographic ordering
min_bal interest
1
10
2
9
Existing
HBase table
External
Big SQL table
2 > 10
9 > 10
 Slow
– Parsing strings is expensive
15
© 2013 IBM Corporation
External Tables
 Useful to map tables that already exist in HBase
– Data in external tables is not pre-validated
 Can create multiple views of same table
Use subset of data from
dense_table
create external hbase table externalhbase_table (user INT, acc string,
balance double, min_bal double, interest double)
column mapping(key mapped by (user,acc), cf_data:cq_acct mapped by(balance,
min_bal, interest) separator '#')
hbase table name 'dense_table';
 HBase tables created using Hive HBase storage handler cannot be
read by Big SQL
– Need to create external tables for this
 Things to note:
– Dropping external table only drops the metadata
– Cannot create secondary index on external tables
16
© 2013 IBM Corporation
Binary Encoding
 Data encoded using sortable binary representation
 Separators handled internally
– Escaped to avoid issue of separator existing within data
CREATE HBASE TABLE MIXED_ENCODING
(
C1
INT, C2
INT, C3
INT,
C4
VARCHAR(10), C5
DECIMAL(5,2),
C6
SMALLINT
)
COLUMN MAPPING
(
KEY MAPPED BY (C1, C2, C3) ENCODING BINARY,
CF1:COL1 MAPPED BY (C4, C5) SEPARATOR '|',
CF2:COL1 MAPPED BY (C6) ENCODING BINARY
);
17
If encoding not
specified, string is
used as default
cf1
cf2
key
col1
col1
0x000000000000000100000000000000020000000000000003
foo|97.31
0x0000DEAF
© 2013 IBM Corporation
Binary Encoding: Pros and Cons
 Faster
 Numeric types collated correctly including negative numbers
CREATE HBASE TABLE WEATHER (temp INT, date TIMESTAMP, humidity
DOUBLE)
COLUMN MAPPING (key mapped by (temp, date), cf:cq mapped by
(humidity))
default encoding binary;
100,2012-06-10 17:00:00:000,40.25
-17,2012-12-12 17:00:00:000,30.25
95,2012-06-05 17:00:00:000,50.25
cf
cq
-17
95
100
\x01\x7F\xFF\xFF\xEF\x012012-12-12 17:00:00:000\x00
\x01\x80\x00\x00_\x012012-06-05 17:00:00:000\x00
\x01\x80\x00\x00d\x012012-06-10 17:00:00:000\x00
\x01\xC0>@\x00\x00\x00\x00\x00
\x01\xC0I \x00\x00\x00\x00\x00
\x01\xC0D \x00\x00\x00\x00\x00
 Limited portability
18
© 2013 IBM Corporation
Custom Encoding
 Any custom encoding data structure can be supported by decoding using
user defined functions
create hbase table obj_table
(
key_col int,
obj_col binary(1024)
) ...;
select get_obj_str_field(obj_col, 'fname') as fname,
get_obj_str_field(obj_col, 'lname') as lname
where key_col = 314145;
 Key lookup and columns filters are possible using UDF's
select key_col
from obj_table
where obj_col = create_obj('bob', 'johnson');
 Big SQL provides a json_get_object() function for extracting values from
textual JSON
19
© 2013 IBM Corporation
Load Data
 Load HBase
File can be on DFS or
local to Big SQL server
– Loads data from delimited files
– Column list can be specified
load hbase data inpath 'file:///input.dat'
delimited fields terminated by '|'
into table hbtable
(name, value, desc, id);
Column list optional. If
not specified, uses
column ordering in table
definition
 Load FROM
– Loads data from a source outside of a BigInsights cluster
– Covered in Import/Load/Export presentation
 Insert command available as undocumented feature
insert into hbtable
(name, value, desc, id)
values(‘name5’, 5555, ‘desc55555’, 55555);
20
© 2013 IBM Corporation
Load Data: Upsert
 HBase ensures uniqueness of row key
key
11111 , name1, 1111, desc11111
11111 , name9, 9999, desc99999
22222 , name2, 2222, desc22222
…
Load
11111 , name1, 1111, desc11111 @ts0
11111 , name9, 9999, desc99999 @ts1
22222 , name2, 2222, desc22222 @ts1
…
 Upsert can be confusing. No errors but fewer rows !
Delimited file : 10 rows
Load : 10 rows affected
select count(*) from hbtable : 7 rows
 Combine multiple columns to make row key unique
key mapped by (id, name)
key
11111 , name1, 1111, desc11111
11111 , name9, 9999, desc99999
22222 , name2, 2222, desc22222
…
21
Load
11111/x00name1,
11111 , name1, 1111,
1111,desc11111
desc11111@ts0
@ts0
11111/x00name9,
11111 , name9, 9999,
9999,desc99999
desc99999 @ts1
@ts1
22222/x00name2,
22222 , name2, 2222,
2222,desc22222
desc22222 @ts1
@ts1
…
© 2013 IBM Corporation
Force Key Unique
 Use force key unique option when creating a table
CREATE HBASE TABLE HBTABLE_FORCE_KEY_UNIQUE
( id INT, name VARCHAR(10), value INT, desc VARCHAR(20) )
COLUMN MAPPING
(
key
mapped by (id) force key unique,
cf_data:cq_name
mapped by (name),
cf_data:cq_val
mapped by (value),
cf_info:cq_desc
mapped by (desc)
);
 Load adds UUID to the row key
 Prevents data loss
 Inefficient
 Stores more data
 Slower queries
22
11111 , name1, 1111, desc11111
11111 , name9, 9999, desc99999
22222 , name2, 2222, desc22222
…
11111\x00b71c95d8-ffdd-4d49-9015-2fdd6f7dcdf4, name1, 1111, desc11111
11111\x00ea780078-9893-4bf7-95d8-cb9ca4b2427f, name9, 9999, desc99999
22222\x00a90885b0-418b-49ac-a6f6-aa73273b57ca, name2, 2222, desc22222
…
© 2013 IBM Corporation
Load Data: Error Handling
 Option to continue and log error rows
– LOG ERROR ROWS IN FILE 'filename'
 Common Errors
– Separator exists within data for string encoding
– Invalid numeric types
 Always count number of rows after loading
– Load always reports total number of rows that it handled
key mapped by (id, name) separator ‘-’
id defined as integer
HBase Table (2 rows)
key
11111 , name1, 1111, desc11111
11111 , name1, 1111, desc11111
11111 , name9, 9999, desc99999
11111 , name9, 9999, desc99999
22222 , name-2, 2222, desc22222
22222 , name2, 2222, desc22222
3333a, name3, 3333, desc33333
…
…
11111-name1, 1111, desc11111
11111-name9, 9999, desc99999
Load: 4 rows affected
Error file (2 rows)
22222 , name-2, 2222, desc22222
3333a , name3, 3333, desc33333
23
© 2013 IBM Corporation
Options to Speed up Load
 Disable WAL
– Data loss can happen if region server crashes
LOAD HBASE DATA INPATH 'tpch/ORDERS.TXT' DELIMITED FIELDS
TERMINATED BY '|' INTO TABLE ORDERS DISABLE WAL;
 Increase write buffer
– set hbase.client.write.buffer=8388608;
24
© 2013 IBM Corporation
Secondary Index Support
 Self-maintaining secondary indexes
– Stored in an HBase table
– Populated using a Map Reduce index builder
– Kept up to date using a synchronous coprocessor
Big SQL
create
MRIndexBuilder
index
Client
input
data
query
results
HBase Storage Handler
Data Table
Data Regions
Index
Coprocessor
SerDe
Query Analyzer
(Runtime)
- Use index ?
Index Regions
query
Query Optimizer
(Compile time)
Index Table
- Process hints
Index building
25
Index maintenance
Batched Get Requests
© 2013 IBM Corporation
Index Creation and Usage
create hbase table dt(id int,c1 string,c2 string,c3 string,c4 string,c5 string)
column mapping (key mapped by (id), f:a mapped by (c1,c2,c3), f:b mapped by
(c4,c5));
create index ixc3 on table dt (c3) as 'hbase';
Data table (dt)
key c1
create index ixc3 (c3)
c2 c3 c4 c5
bt1 , c11_c21_c31, c41_c51
bt2 , c12_c22_c32, c42_c52
bt3 , c13_c23_c33, c43_c53
…
No
Index table (dt_ixc3)
key
Data table get
row = bt2
Use
Index ?
Full table scan
Query
c3=c32
c31_bt1
c32_bt2
c33_bt3
…
Yes
Index table range scan
start row = c32
stop row = c32++
 Automatic index usage
– Range scan on index table to get matching row key(s) in base table
– Batched get requests to base table with the matched row key(s)
26
© 2013 IBM Corporation
Index Pros and Cons
 Fast key based lookups for queries that return limited data
 Not beneficial if there are too many matches
 No statistics to make the decision in compiler
 useindex hint to make explicit choices
 Index adds latency to data load
– When loading a big data set, drop index and recreate
LOAD from option bypasses index maintenance
 Uses HBase bulk load which writes to HFiles directly
27
© 2013 IBM Corporation
Column Family Options
 Compression
– compression(gz)
 Bloom filters
– NONE, ROW, ROWCOL
 In memory columns
– in memory, no in memory
create hbase table colopt_table (key string, c1 string)
column mapping(key mapped by (key), cf1:c1 mapped by(c1))
column family options(cf1 compression(gz) bloom filter(row) in memory);
28
© 2013 IBM Corporation
Query Handling
 Projection pushdown
 Predicate pushdown
–
–
–
–
Point scan
Range scan
Automatic index usage
Filters
 Query Hints
29
© 2013 IBM Corporation
Sample Data
 TPCH orders table with 1.5 million rows
drop table if exists orders;
CREATE HBASE TABLE ORDERS ( O_ORDERKEY BIGINT, O_CUSTKEY INTEGER,
O_ORDERSTATUS VARCHAR(1), O_TOTALPRICE FLOAT, O_ORDERDATE
TIMESTAMP, O_ORDERPRIORITY VARCHAR(15), O_CLERK VARCHAR(15),
O_SHIPPRIORITY INTEGER, O_COMMENT VARCHAR(79) )
column mapping (
key mapped by (O_ORDERKEY,O_CUSTKEY),
cf:d mapped by
(O_ORDERSTATUS,O_TOTALPRICE,O_ORDERPRIORITY,O_CLERK,O_SHIPPRIORIT
Y,O_COMMENT),
cf:od mapped by (O_ORDERDATE)
)
default encoding binary;
LOAD HBASE DATA INPATH 'tpch/ORDERS.TXT' DELIMITED FIELDS
TERMINATED BY '|' INTO TABLE ORDERS;
30
© 2013 IBM Corporation
Projection Pushdown
 Get only columns required by the query
 Limit data retrieved to the client
select * from orders
go -m discard
1500000 rows in results(first row: 0.21s; total: 1m1.77s)
Log
HBase scan details:{ … , families={cf=[d, od]}, …}
select o_totalprice from orders
go -m discard
1500000 rows in results(first row: 0.19s; total: 21.27s)
Log
HBase scan details:{ … , families={cf=[d]}, …}
select o_orderdate from orders
go -m discard
1500000 rows in results(first row: 0.36s; total: 36.24s)
Log
The response time is
higher for this query even
when it retrieves lesser
data than query for
o_totalprice. This is
because timestamp type
is more expensive
HBase scan details:{ … , families={cf=[od]}, …}
 Projection happens at HBase column level
– For composite key and dense columns, the entire value is retrieved to the client
– Efficient to pack columns that are queried together
31
© 2013 IBM Corporation
Predicate Pushdown: Point Scan
 With full row key
 Big SQL can combine predicates on row key parts
set force local on;
select o_orderkey,o_totalprice from orders where o_custkey=1 and o_orderkey=454791;
+--------------+
| o_totalprice |
+--------------+
| 208660.75000 |
+--------------+
1 row in results(first row: 0.14s; total: 0.14s)
Log
Found a row scan by combining all composite key parts.
key
o_custkey o_orderkey
Query
o_custkey=1
and
o_orderkey=454791
32
start row=1#454791
stop row=1#454791
1#454791
1# 579908
1# 3868359
1# 4273923
1# 4808192
1# 5133509
…
columns
…
© 2013 IBM Corporation
Predicate Pushdown: Partial row Scan
select o_orderkey,o_totalprice from orders where o_custkey=1;
+------------+--------------+
| o_orderkey | o_totalprice |
Predicate(s) on leading
+------------+--------------+
part(s) of row key
|
454791 | 74602.81250 |
|
579908 | 54048.26172 |
|
3868359 | 123076.84375 |
|
4273923 | 95911.00781 |
|
4808192 | 65478.05078 |
|
5133509 | 174645.93750 |
+------------+--------------+
6 rows in results(first row: 0.13s; total: 0.13s)
Log
Found a row scan that uses the first 1 part(s) of composite key.
key
o_custkey o_orderkey
Query
o_custkey=1
33
start row=1
stop row=1++
1#454791
1# 579908
1# 3868359
1# 4273923
1# 4808192
1# 5133509
2#430243
…
columns
…
© 2013 IBM Corporation
Predicate Pushdown: Range Scan
 With range predicates
select o_orderkey,o_totalprice from orders where o_custkey < 3;
Log
Found a row scan that uses the first 1 part(s) of composite key.
Log
HBase scan details:{ .. , stopRow=\x01\x80\x00\x00\x03, startRow=, … }
key
o_custkey o_orderkey
Query
o_custkey<3
34
start row=
stop row=3#
1#454791
…
1# 5133509
2#430243
…
4#164711
columns
…
© 2013 IBM Corporation
Predicate Pushdown: Full table Scan
 This is an example of a case where predicates are not pushed down.
 If there are predicates on non-leading parts of row key
set force local on;
select o_orderkey,o_totalprice from orders where o_orderkey=454791;
+------------+--------------+
| o_orderkey | o_totalprice |
+------------+--------------+
|
454791 | 74602.81250 |
+------------+--------------+
1 row in results(first row: 32.13s; total: 32.13s)
Log
35
HBase scan details:{ .. , stopRow=, startRow=, … }
© 2013 IBM Corporation
Automatic Index Usage
select * from orders where o_clerk='Clerk#000000999'
go -m discard
1472 rows in results(first row: 1.63s; total: 30.32s)
create index ix_clerk on table orders (o_clerk) as 'hbase';
0 rows affected (total: 3m57.82s)
select * from orders where o_clerk='Clerk#000000999'
go -m discard
1472 rows in results(first row: 3.60s; total: 3.65s)
Log
Index query successful
 Index used automatically
 For composite index, rules similar to composite row key apply
– Parts will be combined where possible
– With partial value for composite index, range scan done on index table
 Multiple indexes on a table
– Index to be used is randomly chosen
– Specify useIndex hint to make use of specific index
36
© 2013 IBM Corporation
Pushing down Filters into HBase
 Filters do not avoid full table scan
– Some filters can skip certain sections e.g, PrefixFilter
 Limits rows returned to the client
 Limits data returned to client
– Key only filters
Column filter as
there is a
predicate on
leading part of
dense column
Row scan
select o_orderkey from orders where o_custkey>100000 and o_orderstatus='P'
go -m discard
12819 rows in results(first row: 1.12s; total: 6.80s)
Log
37
Found a row scan that uses the first 1 part(s) of composite key.
HBase filter list created using AND.
HBase scan details:{… , filter=FilterList AND (1/1):
[SingleColumnValueFilter (cf, d, EQUAL, \x01P\x00)], stopRow=,
startRow=\x01\x80\x01\x86\xA1, …}
© 2013 IBM Corporation
Key Only Tables
 Big SQL allows creation of tables without specifying any
HBase column
create hbase table KEY_ONLY_TABLE (k1 string, k2 string, k3 string)
column mapping (key mapped by (k1, k2, k3));
select * from KEY_ONLY_TABLE;
Log
38
Only row key or parts of row key requested. Applying filters.
…
HBase scan details:{… families={}, filter=FilterList AND (2/2):
[FirstKeyOnlyFilter, KeyOnlyFilter], …}
© 2013 IBM Corporation
Predicate Precedence
 When a query contains multiple predicates, the following precedence
applies:
– Row Scan
– Index
– Filters
• Row filters
• Column filters
 Filters will be applied along with row scans
The OR condition prevents
usage of row scan. Row
filter (PrefixFilter) is used
along with a column filter
 Filters cannot be combined with index lookups
 Multiple predicates: Use of row and column filter
select o_orderkey, o_custkey, o_orderdate from orders where
o_orderdate=cast('1996-12-09' as timestamp) or o_custkey=2;
Log
39
HBase filter list created using OR.
HBase scan details:{… , filter=FilterList OR (2/2):
[SingleColumnValueFilter (cf, od, EQUAL, \x011996-12-09 00:00:00.000\x00),
PrefixFilter \x01\x80\x00\x00\x02], cacheBlocks=false, stopRow=,
startRow=, … }
© 2013 IBM Corporation
Accessmode Hint
 Will run the query locally in Big SQL server
– Useful to avoid map reduce overhead
 Very important for HBase point queries
– This is not detected currently by compiler
– Specify accessmode=‘local’ hint when getting a limited set of data from HBase
 Specify at query level
select o_orderkey from orders /*+ accessmode='local' +*/ where o_custkey=1
and o_orderkey=454791;
 Specify at session level
– set force local on
– set force commands override query level hints
40
© 2013 IBM Corporation
HBase Hints
 rowcachesize (default=2000)
– Used as scan cache setting
– Also used to determine number of get requests to batch in index lookups
 colbatchsize (default=100)
 useindex (‘false’ to avoid index usage)
select o_orderkey from orders /*+ rowcachesize=10000 +*/ where o_custkey>5000
go -m discard
1450136 rows in results(first row: 22.67s; total: 27.46s)
Log
HBase scan details:{... , caching=10000, ...}
 rowcachesize can also be set using the set command:
– set hbase.client.scanner.caching=10000;
41
© 2013 IBM Corporation
Recommendations
 Row key design is the most important factor
– Try to combine predicates that are most commonly used into row key columns
– Do not make the row key too long
 Use short names for HBase column families and column qualifiers
– f:q instead of mycolumnfamily:mycolumnqualifier
 Check if key only tables can be used
 Pack columns that are queried together into dense columns
– Use the column that is used as query predicate as prefix
– Create indexes for columns that do not have repeating values and are queried often
 Separate columns that are rarely or never queried into a different
column family
 Set hbase.client.scanner.caching to an optimum value
 Ensure even data distribution
42
© 2013 IBM Corporation
Limitations
 No diagnostic info about HBase pushdown
– How HBase storage handler pushes down a query is decided only at runtime
– Predicate handling details are logged at INFO level
– Many examples of log messages covered in previous slides
 No auto detection of local vs MR mode
– Currently depends on user specified hints
 Statistics not available
– Big SQL does not have a framework to collect statistics
– Query optimizations can be improved with availability of useful statistics
 Map type not supported
– Big SQL does not support map data type
– Hive HBase handler supports map data type and many to one mapping
• Mapping an entire HBase column family to a map data type
43
© 2013 IBM Corporation
Logs and Troubleshooting
 Big SQL logs
– Look for rewritten query
– More information in Big SQL logs if query is run in local mode
 Map Reduce logs
– Predicate handling information in map task log when run in MR mode
 HBase web GUI
– http://servername:60010/master-status
44
© 2013 IBM Corporation
Big SQL HBase Handler Highlights
 Support for composite key/dense columns
 Pushdown for efficient execution of queries
 Support for secondary indexes
 Collatible binary encoding
 Key only tables
 Support for hints to make query optimization decisions
45
© 2013 IBM Corporation
Scenarios that can leverage HBase features
 Point queries
– Queries that return a single row of result
– Row can be determined using row key or secondary index
• All queries using secondary index are not point queries
 Queries with projections
– If a query requires only a few columns
– Projection happens at HBase column level
 Data maintenance using upserts
– Loading different value for columns using same row key
46
© 2013 IBM Corporation
Download