BigSQL-T3-Performance-FINAL.ppt

advertisement
Big!
SQL
3.0
Performance
Abhayan Sundararajan, Jesse Chen, John Poelman, Jo A Ramos, Ken Chen,
Mike Ahern, Mladen Kovacevic, Rama Alluri, Simarjeev S Kohli, Simon Harris
IBM Big Data Performance team
For questions about this presentation contact Simon Harris siharris@au1.ibm.com
Agenda
 BigSQL Architecture
 BigSQL Best Practices
 BigSQL Optimizer
 BigSQL Performance Problem Determination
 BigSQL Internal Benchmarks
 BigSQL on Power
 BigInsights v3.0 Performance
2
© 2013 IBM Corporation


DB2 DPF Architecture
Overview
BigSQL Compared to DB2
BIGSQL V3.0 ARCHITECTURE
3
© 2013 IBM Corporation
Architecture Overview
 BigSQL 3.0 is essentially DB2 DPF
engine on top of Hadoop
HDFS/GPFS filesystem
Big SQL
DB2
DPF
Master Node
Coordinator
Node
UDF
FMP
Big SQL
Scheduler
Hive
Metastore
DDL
FMP
Native
I/O
FMP
Temp
Data
HDFS
Data
Node
UDF
FMP
HDFS
Data HDFS
Data HDFS
Data
Compute Node
4
Java
I/O
FMP
HDFS
Data
Node
Big SQL
DB2
DPF
Worker
Node
Data Node
MRTask
Tracker
Other
Service
*FMP = Fenced mode process
Management Node
Management Node
Big SQL
DB2
DPF
Worker
Node
Data Node
Hive
Server
Database
Service
Native
I/O
FMP
Temp
Data
Java
I/O
FMP
UDF
FMP
HDFS
Data HDFS
Data HDFS
Data
Compute Node
Big SQL
DB2
DPF
Worker
Node
Data Node
MRTask
Tracker
Other
Service
HDFS
Data
Node
Native
I/O
FMP
Temp
Data
Java
I/O
FMP
UDF
FMP
HDFS
Data HDFS
Data HDFS
Data
MRTask
Tracker
Other
Service
Compute Node
© 2013 IBM Corporation
DB2 DPF Architecture – Major components
DB2 DPF Data Node
DB2 optimizer ensures efficient
execution paths
DB2 caches data/indexes in it’s
Bufferpool.
Allows efficient read/write access to
data from memory instead of disk.
Memory areas assigned for sorting.
DB2 owns the data on the disk.
DB2 is responsible for
reading/writing/organising the data.
DB2 maintains indexes for efficient
access to data.
Data is hash partitioned to allow for
efficient co-located joins
DB2 Optimizer & Re-write
DB2 Runtime
•
•
•
•
•
•
Sortspace
Bufferpool
DB2 Temp
Tablespace
DB2 Temp
Tablespace
DB2 Temp
Tablespace
DB2
Tablespace
DB2
Tablespace
DB2
Tablespace
DB2 Indexes
DB2 Storage Layer
Compute Node
5
© 2013 IBM Corporation
BigSQL 3.0 Architecture – Major components
BigSQL 3.0 Worker Node
BigSQL 3.0 gets all the benefits of the
DB2 Optimizer and Rewrite.
Bufferpool cache is only for temporary
data (within the current query).
Sortspace remains unchanged.
BigSQL 3.0 does not own the data.
Therefore, indexes cannot be built/
maintained.
Data is scatter partitioned – there is
NO co-location of data
Compute Node
6
DB2 Optimizer & Re-write
DB2 Runtime
•
•
•
•
•
•
Sortspace
Bufferpool
Bufferpool
DB2 Temp
Tablespace
DB2 Temp
Tablespace
Temp
Tablespace
DB2
Tablespace
DB2
Tablespace
DB2
Tablespace
Native
I/O
reader
FMP
Java
I/O
reader
FMP
DB2 Indexes
HDFS data
HDFS data
HDFS data
© 2013 IBM Corporation
BigSQL 3.0 Architecture – Tuning focus
 So – BigSQL loses three
major tuning features of an
RDBMs – data cache,
indexes and co-location.
 But gains from all the other
great features of DB2
including optimizer, self
tuning memory and
advanced workload
management
 Tuning of BigSQL 3.0
focuses on sortspace,
bufferpools, dfsreader
throughput, SMP parallelism
and efficient plan selection
BigSQL 3.0 Worker Node
DB2 Optimizer & Re-write
DB2 Runtime
Sortspace
Bufferpool
DB2 Temp
Tablespace
DB2 Temp
Tablespace
BigSQL
Temp
Tablespace
Native
I/O
reader
FMP
Compute Node
7
•
•
•
•
•
•
Java
I/O
reader
FMP
DB2 Indexes
HDFS data
HDFS data
HDFS data
© 2013 IBM Corporation








Deployment Topologies
Physical Database Design
Readers and Storage Formats
Resource Sharing
Loading data
Data Type considerations
Statistics
Informational Constraints
BIGSQL V3.0 BEST
PRACTICES
8
© 2013 IBM Corporation
BigSQL Deployment Topologies
 Traditionally BigInsights Management Node(s) have more memory, but less disk than the
worker/data nodes
– Many management tasks process less data, but are more response time critical
 Management functions may be split between several nodes depending upon requirements
 However, the BigSQL 3.0 management node is different
– It can be thought of as a management node that also does work
– It will be used to execute sections of a query -- even though it does not own any data locally
 This has implications on topology of the cluster…..
– Chiefly, the BigSQL management node needs to have a similar hardware configuration to the BigSQL worker nodes
Task
Tracker
Data
Node
Big
SQL
Compute Node
Hive
Metastore
Big SQL
Name Node
Mgmt Node
Mgmt Node
Mgmt Node
Task
Tracker
Data
Node
Big
SQL
Compute Node
Task
Tracker
Data
Node
•••
Job Tracker
Mgmt Node
Big
SQL
Compute Node
•••
Task
Tracker
Data
Node
Big
SQL
Compute Node
GPFS/HDFS
9
© 2013 IBM Corporation
BigSQL Deployment Topologies
 The BigSQL management node needs to have at least:
– The same amount of memory
– The same CPU processing power
– The same number of disks and configuration of storage
As the BigSQL worker nodes
 Failure to comply with this recommendation will likely slow down the
whole BigSQL cluster:
– Consider a BigSQL management node that has the same CPU & memory as
the worker nodes, but only has 1/3rd of the number of disk spindles.
– Any query that executes sections on the management node will probably be
2/3rd slower when it needs to use disk
– Since the whole cluster is only really as fast as the slowest worker, this can
have a dramatic impact on query performance
10
© 2013 IBM Corporation
BigSQL Deployment Topologies
 The BigSQL Master node needs to be spec’d more like a worker node
than a typical management node
– It will execute sections of a query and may write to BigSQL temporary space
– It will not usually be an hdfs data node or a MR task tracker node
Big SQL
Master Node
HDFS
Name Node
UDF
FMP
Temp
Data
Management Node
DDL
FMP
Hive
Server
Native
I/O
FMP
Java
I/O
FMP
HDFS
Data
Compute Node
UDF
FMP
Management Node
Management Node
HDFS
Data
Node
Big SQL
Worker Node
11
Database
Service
Big SQL v1
BigInsights
Console
Management Node
Management Node
Hive
Metastore
MR Job
Tracker
Temp
Data
Big SQL
Scheduler
MRTask
Tracker
Other
Service
HDFS
Data
Node
Big SQL
Worker Node
Native
I/O
FMP
Temp
Data
Java
I/O
FMP
HDFS
Data
Compute Node
UDF
FMP
MRTask
Tracker
Other
Service
HDFS
Data
Node
Big SQL
Worker Node
Native
I/O
FMP
Temp
Data
Java
I/O
FMP
UDF
FMP
HDFS
Data HDFS HDFS
Data
Data
MRTask
Tracker
Other
Service
Compute Node
© 2013 IBM Corporation
Big SQL Tablespace physical layout
 User data is stored in files on HDFS/GPFS. Physical layout of the
distributed file system is determined by dfs.data.dir property (if HDFS) or
list of disks per node (if GPFS) supplied at install time.
 While Big SQL accesses data on the distributed file system (e.g. HDFS), it
still creates and uses DB2 tablespaces
 The following tablespaces are created: SYSCATSPACE,
TEMPSPACE1, BIGSQLCATSPACE, BIGSQLUTILITYSPACE,
SYSTOOLSPACE, SYSTOOLSTMPSPACE
 The performance of the storage underlying these DB2 tablespaces
matters, especially for SQL queries that create temporary tables
 The installer prompts for Big SQL Data path(s) [see next slide]
 The path or paths specified become the storage containers
asssociated with the DB2 tablespaces
 For good performance, specify multiple paths
12
© 2013 IBM Corporation
Big SQL Tablespace physical layout (cont)

Expand the Big SQL settings in the installer to see all input fields

Specify multiple paths for the Big SQL data directory
13
© 2013 IBM Corporation
BigSQL Tablespace Physical layout
 Spread the Big SQL data
directory over as many
disks as possible
 Share disks between
BigSQL, HDFS (dfs.data.dir)
or GPFS, and MapReduce
intermediate data
(mapred.local.dir)
HDFS/GPFS
All user data is stored on
the distributed file system
MapRed cache
MapReduce cache is used for
temporary data during execution
Of MR jobs
BigSQL
BigSQL data directory is used for
Temporary data during execution
Of BigSQL queries
 Rule of thumb: Spread everything across all disks
14
© 2013 IBM Corporation
C++/Java Readers Overview
 The readers read information from the BigInsights Distributed File System
(DFS) and return the information to DB2
 Scott referred to these as “I/O Engines” in his presentation
 Readers run on the compute nodes
 Two reader types: Native C++ and Java
 Big SQL chooses which reader to use based on the storage
format of the data being accessed
 The C++ Reader is based on the Open Source Impala Reader
 The Java Reader is used for storage formats not supported by the
C++ Reader (more on this later...)
15
© 2013 IBM Corporation
C++ Readers - Architecture
 Readers run in the db2 Fenced Mode Processes (FMPs)
16
© 2013 IBM Corporation
C++/Java Reader Configuration Files
 The readers are configured by properties in configuration file
$BIGSQL_HOME/conf/bigsql-conf.xml
 The bigsql-conf.xml file is also used to configure the scheduler
 The C++ Reader properties generally have prefix “dfsio.”
 Value of 0 usually means “let the system decide”
 Generally, customers will not modify the Readers settings
 Reader logging is controlled by the glog_enabled=[true|false] property in the
glog-dfsio.properties file:
 Reader logging is enabled by default and can produce a lot of output
 Usually has a small performance impact though
 Consider disabling logging on production clusters
17
© 2013 IBM Corporation
C++ Reader: Commonly tuned properties

18
Important C++ Reader settings:

dfsio.disable_codegen. True by default, meaning that LLVM (Low level virtual
machine) is disabled. We will explore LLVM more for the 4Q release; for now,
recommend to leave this disabled. Can be specified per query.

dfsio.num_scanner_threads. Caps on the number of scanner threads that will
be created. Default is 0 (let system decide). Can be specified per query.

dfsio.mem_limit. The amount of memory used by the C++ Readers is
controlled by Big SQL by default, but can be overridden by setting
dfsio.mem_limit to a value > 0. Default is 0 (let Big SQL decide).

dfsio.num_threads_per_disk. The number of I/O threads spawned per disk. If
0, then “the system decides” and will spawn 5 I/O threads per disk. Five
threads is usually reasonable for both rotational disks and SSDs. Consider
increasing this value on clusters with very high performance devices.
© 2013 IBM Corporation
Java Reader: Commonly tuned properties

Important Java Reader settings:

“bigsql.java.io.tp.size”. Number of reader threads per scan. Default is 8.



19
Consider increasing this value on clusters with very high performance
devices.
Can be set per query, i.e. SET HADOOP PROPERTY
'bigsql.java.io.tp.size' = 4
“scheduler.force.javareader”. This is an undocumented property and should
only be used by support when troubleshooting a system. Set to true to force
Big SQL to use the Java Reader instead of the C++ Reader. Workaround if
the C++ Reader is failing for some reason.
© 2013 IBM Corporation
C++ Reader performance metrics
 The C++ Reader logs performance metrics at the end of each scan
 The metrics are dumped to the C++ Reader log file:
$BIGINSIGHTS_VAR/bigsql/logs/bigsql-ndfsio.log.INFO
 Example:
1
20
© 2013 IBM Corporation
C++ Reader performance metrics (cont.)
21
© 2013 IBM Corporation
C++ Reader metrics can also be dumped using db2pd

Run db2pd on each compute node
db2pd -d bigsql -dfstable -file <output file>

Example:

22
© 2013 IBM Corporation
C++ Reader metrics can also be dumped using db2pd (cont)

23
This sceenshot shows a subset of the metrics...
© 2013 IBM Corporation
C++ Reader and HDFS

When the BigInsights cluster is using HDFS, we recommend to set
property “dfs.datanode.hdfs-blocks-metadata.enabled” to true.

Enables C++ Reader to know which local disks blocks reside on

With this information, the C++ Reader can pin I/O threads to disks

If performance seems sub-optimal, then use C++ Reader log to confirm that I/O
threads are pinned to disk

On a compute nodes, check the C++ Reader log file:
$BIGINSIGHTS_VAR/bigsql/logs/bigsql-ndfsio.log.INFO

Grep the log file for string “Split volume id”

We expect the split volume ids to be 0 or greater (indicates pinning is occurring)


I0613 19:06:48.645717 2973330 bi-dfs-reader.cc:1907] Split volume id 3
If the split volume ids are all -1, then we're not pinning I/O threads to disks. Check
property “dfs.datanode.hdfs-blocks-metadata.enabled”.

24
Example:
Example:
I0428 02:23:28.394435 3165237 bi-dfs-reader.cc:1677] Split volume id -1
© 2013 IBM Corporation
Storage Formats
 Out Of The Box results
 Plan for extensive studies
– To find best combination for each
– To test with compression
– To identify best practices
25
© 2013 IBM Corporation
Data Type considerations
(Based on Scott's migration guide)

Big SQL 3.0 contains an entirely new optimized SQL execution
engine. This engine internally works with data in units of 32K
pages and works most efficiently when the definition of table
allows a row to fit within 32k of memory. Once the calculated row
size exceeds 32k, performance degradation can be seen for
certain queries.
As a result, the use of the datatype STRING is strongly
discouraged, as this is mapped internally to VARCHAR(32,672)
which means that the table will almost certainly exceed 32k in
size.

This performance degradation can be avoided by:

26

Change references to STRING to explicit VARCHAR(n) that most
appropriately fit the data size

Use the bigsql.string.size property (via SET HADOOP PROPERTY) to lower
the default size of the VARCHAR to which the STRING is mapped when
creating new tables.
© 2013 IBM Corporation
Resource Sharing
 When installing BigSQL, the user specifies the percentage of cluster
resources to dedicate to BigSQL
– This is hidden away under “Advanced settings” for BigSQL
– Default is 25%
– Recommended range is 25% -> 75%
 Value specified dictates memory and CPU resources dedicated to
BigSQL – not disk
– BigSQL will do it’s best to keep within the boundary specified
27
© 2013 IBM Corporation
Resource sharing – Balancing with MapReduce
 Installer will automatically tune the default BigSQL properties
according to the percentage specified:
–
–
–
–
INSTANCE_MEMORY
Sort memory (sortheap & sheapthres_shr)
Bufferpool
dfsreader memory
 It will also tune MapReduce properties (in mapred-site.xml) to ensure
cluster resources are not over-allocated when BigSQL queries and
MapReduce jobs are executing at the same time:
28
© 2013 IBM Corporation
Resource Sharing - How much of the BigInsights cluster to dedicate to BigSQL?
 INSTANCE_MEMORY is set to the percentage of memory assigned to
BigSQL
– If you used 50% as the resource percentage, then INSTANCE_MEMORY=50
– If you have 64GB memory, then BigSQL has 0.5*64=32GB available to it.
 Most memory is allocated to sort space since HashJoins are the
prevalent join technique used in BigSQL
 Bufferpools are not used to cache the HDFS data, but will still be used
for intermediate storage whilst a query is executing
– Also, the bufferpool size is a key input into the optimizer. Setting this too low will
cause the optimizer to favor inefficient joins techniques (such a NestedLoop)
– So need to have a reasonable sized bufferpool even though it is not used to
cache HDFS data
 dfsreaders require memory to read data from HDFS and exchange the
data with BigSQL runtime
– By default, the readers are allocated 20% of the memory assigned to BigSQL
– In the above example, 0.2*32=6.4GB
29
© 2013 IBM Corporation
Resource sharing - Changing resource percentage after
install
 To change the percentage of resources dedicated to BigSQL after
install:
autoconfigure using mem_percent 75
workload_type complex
is_populated no
apply db and dbm
 This will automatically update the BigSQL memory related properties
previously mentioned
– It will not update the MapReduce settings in mapred-conf.xml – this will have to
be done manually
– Formula for calculating memory consumption:
(mapred.tasktracker.map.tasks.maximum * [-Xmx option of mapreduce.map.java.opts]) +
(mapred.tasktracker.reduce.tasks.maximum * [-Xmx option of mapreduce.reduce.java.opts]) +
(Physical memory * INSTANCE_MEMORY) +
sum (Other tasks running on node [datanode/task tracker/hbase etc…])
<=
Physical memory * 0.90
30
© 2013 IBM Corporation
Resource sharing - Self Tuning Memory Manager (STMM)
 BigSQL will constantly monitor the memory allocation and how
efficiently it is being used
 Memory allocation will be re-assigned between BigSQL consumers
(within the given boundaries) to ensure BigSQL is making the best
possible use of the available memory
– For example, if BigSQL detects that bufferpools are infrequently used but sort
space is regularly exhausted, then it may decide to reduce the amount of
memory available to bufferpools and give this memory to sort space.
 Top tip: If you have a relatively fixed workload:
– Run with STMM enabled for a period of time (usually several days)
– Monitor the bufferpools & sort space, when they have remained stable for several
days, disable STMM:
• update db cfg for bigsql using SELF_TUNING_MEM off
– Your BigSQL cluster now has the optimal tuning for your workload
31
© 2013 IBM Corporation
BigSQL 3.0 Table Partitioning



BigSQL (and Hive) provide the ability to partition a table based a data value

This improves query performance by eliminating those partitions that do not
contain the data value of interest
BigSQL stores different data partitions as separate files in hdfs and only scans
the partitions required by a query thereby improving runtime

Partition on a column commonly referenced in range delimiting or equality predicates.

Range of dates are ideal for use as partition columns
Create LINEITEM table partitioned on L_SHIPDATE:
"CREATE HADOOP TABLE LINEITEM (
L_ORDERKEY BIGINT NOT NULL,
L_PARTKEY INTEGER NOT NULL,
L_SUPPKEY INTEGER NOT NULL,
L_LINENUMBER INTEGER NOT NULL,
L_QUANTITY FLOAT NOT NULL,
L_EXTENDEDPRICE FLOAT NOT NULL,
L_DISCOUNT FLOAT NOT NULL,
L_TAX FLOAT NOT NULL,
L_RETURNFLAG VARCHAR(1) NOT NULL,
L_LINESTATUS VARCHAR(1) NOT NULL,
L_COMMITDATE DATE NOT NULL,
L_RECEIPTDATE DATE NOT NULL,
L_SHIPINSTRUCT VARCHAR(25) NOT NULL,
L_SHIPMODE VARCHAR(10) NOT NULL,
L_COMMENT VARCHAR(44) NOT NULL)
PARTITIONED BY (L_SHIPDATE DATE)
STORED AS TEXTFILE"
32
© 2013 IBM Corporation
BigSQL 3.0 Table Partitioning
A separate file for each unique L_SHIPDATE will be created when the table is
populated

File name will be tagged with the value

> hadoop fs -ls /biginsights/hive/warehouse/parq_partition.db/lineitem |more
Found 2526 items
drwxr-xr-x
- bigsql biadmin
1268920 2014-06-20 10:03 /biginsights/hive/warehouse/parq_partition.db/lineitem/l_shipdate=1992-01-02
drwxr-xr-x
- bigsql biadmin
1268920 2014-06-20 10:03 /biginsights/hive/warehouse/parq_partition.db/lineitem/l_shipdate=1992-01-03
drwxr-xr-x
- bigsql biadmin
1268920 2014-06-20 10:03 /biginsights/hive/warehouse/parq_partition.db/lineitem/l_shipdate=1992-01-04
....
....
drwxr-xr-x
- bigsql biadmin
1268920 2014-06-20 10:09 /biginsights/hive/warehouse/parq_partition.db/lineitem/l_shipdate=1998-11-27
drwxr-xr-x
- bigsql biadmin
1268920 2014-06-20 10:09 /biginsights/hive/warehouse/parq_partition.db/lineitem/l_shipdate=1998-11-28
drwxr-xr-x
- bigsql biadmin
1268920 2014-06-20 10:09 /biginsights/hive/warehouse/parq_partition.db/lineitem/l_shipdate=1998-11-29
drwxr-xr-x
- bigsql biadmin
1268920 2014-06-20 10:09 /biginsights/hive/warehouse/parq_partition.db/lineitem/l_shipdate=1998-11-30
drwxr-xr-x
- bigsql biadmin
1268920 2014-06-20 10:09 /biginsights/hive/warehouse/parq_partition.db/lineitem/l_shipdate=1998-12-01
select
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,

avg(l_discount) as avg_disc,
count(*) as count_order
from
lineitem
where
Queries with predicates on the
partitioning column will only
read those files that qualify
l_shipdate <= date ('1998-12-01') - 3 day
group by
l_returnflag,
l_linestatus
order by
l_returnflag,
l_linestatus;
33
© 2013 IBM Corporation
Table Partitioning – Performance Results


Setup 
5 machine cluster with 4 data nodes

TPCH 500GB Workload using textfile format with Lineitem table partitioned on
L_SHIPDATE and Orders table partitioned on O_ORDERDATE
Results 
Table Partitioning provides an overall benefit of 14% improvement as compared to
no partitioning

But not all queries improved, some got slower……
TPCH Performance Comparison Using Table Partitioning
6000
5000
4000
Runtime(sec)
3000
2000
1000
0
No Partition
34
Partition
© 2013 IBM Corporation
Big SQL 3.0 LOAD Best Practices - 1
 LOAD uses MAP-REDUCE job(s) to read the data from source (local disk, hdfs/gpfs,
or RDBMs) and populate the target table
 Default number of map tasks for LOAD job is just 4 – this is usually much too small
when loading large amounts of data
– tune LOAD property num.map.tasks to customize the number of map tasks
– good starting point is to set to number of BigSQL worker nodes (or a multiple
thereof)
load hadoop using file url '/tpch1000g/orders/'
with source properties ('field.delimiter'='|', 'ignore.extra.fields'='true')
into table ORDERS
overwrite WITH LOAD PROPERTIES ('num.map.tasks'='145');
 Warning: Max java heap size for LOAD tasks is 2GB.
If default java heap size for cluster (defined in mapreduce.map.java.opts) is <2GB
then specifying too large a value for num.map.tasks may over commit memory on
the cluster:
35
© 2013 IBM Corporation
Big SQL 3.0 LOAD Best Practices - 2
 If the source consists of a few large files, then it is more efficient to
copy the files to hdfs first, and then load from hdfs
 If the source consists of many small files, then loading from either
local filesystem or hdfs has similar performance
Load from local disk vs HDFS
30000
elapse time(in sec)
Reason – number of map tasks used
to load from local filesystem is limited
to number of files. If only a few large
files, then only a small number of
map tasks will be created to transfer
and load the data.
For hdfs sources, number of map tasks
is limited by number of hdfs blocks,
not number of files.
25000
20000
Local disk
15000
HDFS
10000
5000
0
Large file size#94GB
Small file size#6GB
Data set size(in GB)
36
© 2013 IBM Corporation
Big SQL 3.0 LOAD Best Practices - 3
 Note:
Executing multiple concurrent LOAD HADOOP USING statements to the
same non-partitioned table is not currently supported.
Concurrent LOAD HADOOP USING into different partitions of the same
partitioned table are supported.
 Can also use HIVE and BigSQL INSERT…SELECT… statements to move data
into BigSQL tables
37
© 2013 IBM Corporation
BigSQL 3.0 LOAD – Checking Data Distribution
 BigSQL data is scatter partitioned across the data nodes
Total size: 2101761659596 B
Total dirs: 1
Total files: 2062
Total symlinks:
0
Total blocks (validated):
2062 (avg. block size 1019283055 B)
Minimally replicated blocks: 2062 (100.0 %)
Over-replicated blocks:
0 (0.0 %)
Under-replicated blocks:
0 (0.0 %)
Mis-replicated blocks:
0 (0.0 %)
Default replication factor: 3
Average block replication: 3.0
Corrupt blocks:
0
Missing replicas:
0 (0.0 %)
Number of data-nodes:
18
Number of racks:
1
FSCK ended at Fri Jun 13 04:21:51 PDT 2014 in 58 milliseconds
The filesystem under path '/biginsights/hive/warehouse/tpch10tb_parq.db/lineitem' is HEALTHY
Block count for lineitem on node 120: 2063
Block count for lineitem on node 121: 289
Block count for lineitem on node 122: 345
Aiming for even distribution of dfs blocks
Block count for lineitem on node 123: 312
Block count for lineitem on node 124: 0
across nodes.
Block count for lineitem on node 125: 382
This should be automatically maintained by
Block count for lineitem on node 126: 384
Block count for lineitem on node 127: 318
hdfs.
Block count for lineitem on node 128: 346
Block count for lineitem on node 129: 346
Block count for lineitem on node 130: 0
* See speaker notes for script
………
38
© 2013 IBM Corporation
Statistics are critical to BigSQL 3.0 Performance
 Statistics are used by the optimizer to make informed decisions about
query execution
 Accurate and up to date statistics can improve performance
many-fold.
And out of date or inaccurate statistics can devastate performance.
 Statistics
MUST be updated whenever:
– a new table is populated, or
– an existing table’s data undergoes significant changes:
• new data added,
• old data removed,
• existing data is updated
 Use ANALYZE to update a table’s statistics:
ANALYZE TABLE SIMON.ORDERS
COMPUTE STATISTICS
FOR COLUMNS O_ORDERKEY, O_CUSTKEY, O_ORDERSTATUS, O_TOTALPRICE, O_ORDERDATE,
O_ORDERPRIORITY, O_CLERK, O_SHIPPRIORITY, O_COMMENT
39
© 2013 IBM Corporation
Type of Statistics collected by BigSQL 3.0
 Table statistics:
– Cardinality (count)
– Number of Files
– Total File Size
 Column statistics (this applies to column group stats also):
–
–
–
–
–
–
–
–
40
Minimum value
Maximum value
Cardinality (non-nulls)
Distribution (Number of Distinct Values)
Number of null values
Average Length of the column value (for string columns)
Histogram
Frequent Values (MFV)
© 2013 IBM Corporation
BigSQL 3.0 Optimizer – Statistics are crucial
 Usually, in db2 if statistics have never been gathered for a table,
explain will show 1000 for table cardinality
|
1000
HTABLE: TPCH10TB_PARQ
SUPPLIER
Q2
 But this is not the case in BigSQL because BigSQL fabricates some
basic stats from the file information
 Check STATS_TIME and CARD in SYSCAT.TABLES to see if ANALYZE
has been run:
db2 "select substr(tabname,1,20),stats_time,card from syscat.tables where tabschema='SIMON'"
1
-----------NATION
ORDERS
CUSTOMER
LINEITEM
REGION
SUPPLIER
PART
PARTSUPP
41
STATS_TIME
CARD
-------------------------- --------------------1
-1
-1
-1
-1
-1
-1
-1
© 2013 IBM Corporation
BigSQL 3.0 Optimizer – Advanced Statistics
 SYSSTAT views are available to manually manipulate statistics
– Set of views within BigSQL that allow administrators to manually update statistics
– For Advanced Users Only – You must understand what you are doing….
 Statistical Views are also supported:
– Ability to collect statistics on a view
– Useful to get accurate cardinality estimates for complex relationships
– http://www01.ibm.com/support/knowledgecenter/SSEPGG_10.5.0/com.ibm.db2.luw.admin.
perf.doc/doc/c0021713.html?lang=en
42
© 2013 IBM Corporation
BigSQL 3.0 - Use Informational Constraints
 BigSQL 3.0 supports Informational Constraints (ICs)
 ICs are much like regular Primary and Foreign Key constraints except
they are not enforced
 They do provide useful information to the optimizer
alter table SIMON.orders add primary key (O_ORDERKEY) not enforced
alter table SIMON.lineitem add primary key (L_ORDERKEY,L_LINENUMBER) not enforced
alter table SIMON.lineitem add foreign key (L_ORDERKEY) references SIMON.orders
(O_ORDERKEY) not enforced
alter table EMPLOYEE add constraint revenue check (SALARY + COMM > 25000)
not enforced
 Informational Constraints allow the optimizer to make better selectivity
estimates which improves the costing of the plan and execution
efficiency of the query
43
© 2013 IBM Corporation
Big SQL 3.0 Best Practices - Summary
 Ensure you have a homogenous and balanced cluster
– Utilize IBM reference architecture
– Balance resource usage between BigSQL 3.0 and MR jobs
– Ensure several disks are assigned to BigSQL data directory
 Choose an optimized file format (if possible)
– Parquet for BigSQL 3.0
 Choose appropriate data types
– Use the smallest and most precise datatype available
 Define informational constraints
– Primary key, foreign key, check constraints
 Ensure you have good statistics
– Current and comprehensive
 Use the full power of SQL available to you
– Don’t constrain yourself to Hive syntax/capability
44
© 2013 IBM Corporation





Query Re-write
Query Pushdown
Statistics & Costing
New Access Strategies
Anatomy of explain plan
BIGSQL V3.0 OPTIMIZER
45
© 2013 IBM Corporation
Big SQL 3.0 – Query Planning
 Query rewrites
– Exhaustive query rewrite capabilities
– Leverages additional metadata such as constraints and nullability
 Optimization
– Statistics and heuristic driven query optimization
– Query optimizer based upon decades of IBM RDBMS experience
 Tools and metrics
– Highly detailed explain plans and query diagnostic tools
– Extensive number of available performance metrics
Query transformation
SELECT ITEM_DESC, SUM(QUANTITY_SOLD),
AVG(PRICE), AVG(COST)
FROM PERIOD, DAILY_SALES, PRODUCT,
STORE
WHERE
PERIOD.PERKEY=DAILY_SALES.PERKEY AND
PRODUCT.PRODKEY=DAILY_SALES.PRODKEY
AND
STORE.STOREKEY=DAILY_SALES.STOREKEY
AND
CALENDAR_DATE BETWEEN AND
'01/01/2012' AND '04/28/2012' AND
STORE_NUMBER='03' AND
CATEGORY=72
GROUP BY ITEM_DESC
Access plan generation
NLJOIN
NLJOIN
NLJOIN Daily Sales
NLJOIN
Product
Period
Store
NLJOIN
Product
Store
Period
NLJOIN
Daily Sales
HSJOIN
HSJOIN
HSJOIN
Daily Sales
Dozens of query
transformations
46
Access
section
ZZJOIN
Period
Product
HSJOIN
Store
Product
Period
Store
Daily Sales
Hundreds or thousands
of access plan options
Thread 0
DSS
TQA (tq1)
AGG (complete)
BNO
EXT
Thread 1
TA (Product)
NLJN (Daily
Sales)
NLJN (Period)
NLJN (Store)
AGG (partial)
TQB (tq1)
EXT
Thread 2
TA (DS_IX7)
EXT
Thread 3
TA (PER_IX2)
EXT
Thread 4
TA (ST_IX1)
EXT
© 2013 IBM Corporation
Query Rewrite
► Why is query re-write important?
– There are many ways to express the same query
– Query generators often produce suboptimal queries and don’t permit “hand
optimization”
– Complex queries often result in redundancy, especially with views
– For Large data volumes optimal access plans more crucial as penalty for poor
planning is greater
select sum(l_extendedprice) / 7.0
avg_yearly
from tpcd.lineitem, tpcd.part
where p_partkey = l_partkey
and p_brand = 'Brand#23'
and p_container = 'MED BOX'
and l_quantity < ( select 0.2 *
avg(l_quantity) from
tpcd.lineitem
where l_partkey = p_partkey);
select sum(l_extendedprice) / 7.0 as avg_yearly
from temp (l_quantity, avgquantity,
l_extendeprice) as
(select l_quantity, avg(l_quantity) over
(partition by l_partkey)
as avgquantity, l_extenedprice
from tpcd.lineitem, tpcd.part
where p_partkey = l_partkey
and p_brand = 'BRAND#23'
and p_container = 'MED BOX')
where l_quantity < 0.2 * avgquantity
• Query correlation eliminated
• Lineitem table accessed only once
• Execution time reduced in half!
47
© 2013 IBM Corporation
Query Rewrite
► BigSQL uses the DB2 re-write engine
► Most existing query rewrite rules remain unchanged
– 140+ existing query re-writes are leveraged
– Almost none are impacted by “the Hadoop world”
► There were however a few modifications that were required…
48
© 2013 IBM Corporation
Query Rewrite and Indexes
► Column nullability and indexes can help drive query optimization
– Can produce more efficiently decorrelated subqueries and joins
– Used to prove uniqueness of joined rows (“early-out” join)
► Very few Hadoop data sources support
the concept of an index
Nullability Indicators
► In the Hive metastore all columns
create hadoop table users
are implicitly nullable
(
id
int
not null primary key,
► Big SQL introduces advisory or
office_id int
null,
fname
varchar(30)
not null,
informational constraints and
lname
varchar(30)
not null,
salary
timestamp(3)
null,
nullability indicators
constraint fk_ofc foreign key (office_id)
– User can specify whether or not
constraints can be “trusted” for
query rewrites
> alter table SIMON.orders add
primary key (O_ORDERKEY)
not enforced
49
references office (office_id)
)
row format delimited
fields terminated by '|'
stored as textfile;
Constraints
© 2013 IBM Corporation
Statistics
► Big SQL utilizes Hive statistics
collection with some
extensions:
– Additional support for column
groups, histograms and frequent
values
– Automatic determination of
partitions that require statistics
collection vs. explicit
– Partitioned tables: added tablelevel versions of NDV, Min, Max,
Null count, Average column length
– Hive catalogs as well as database
engine catalogs are also populated
– We are restructuring the relevant
code for submission back to Hive
Table statistics
• Cardinality (count)
• Number of Files
• Total File Size
Column statistics
•
•
•
•
•
•
Minimum value (all types)
Maximum value (all types)
Cardinality (non-nulls)
Distribution (Number of Distinct Values NDV)
Number of null values
Average Length of the column value (all
types)
• Histogram - Number of buckets configurable
• Frequent Values (MFV) – Number
configurable
Column group statistics
► Capability for statistic
fabrication if no stats available
at compile time
50
© 2013 IBM Corporation
Costing Model
► Few extensions required to the Cost
Model for optimizer to understand the
SQL over Hadoop world.
► TBSCAN operator cost model extended
to evaluate cost of reading from Hadoop
► HTABLE operator
► Degree of pushdown possible to the
readers…
► Scatter partitioning used in Hadoop
► New elements taken into account:
# of files, size of files, # of partitions, # of nodes
|
2.66667e-08
HSJOIN
(
7)
1.1218e+06
8351
/--------+--------\
5.30119e+08
3.75e+07
BTQ
NLJOIN
(
8)
( 11)
948130
146345
7291
1060
|
/----+----\
5.76923e+08
1
3.75e+07
LTQ
GRPBY
FILTER
(
9)
( 12)
( 20)
855793
114241
126068
7291
1060
1060
|
|
|
5.76923e+08
13
7.5e+07
TBSCAN
TBSCAN
BTQ
( 10)
( 13)
( 21)
802209
114241
117135
7291
1060
1060
|
|
|
7.5e+09
13
5.76923e+06
TABLE: TPCH5TB_PARQ TEMP
LTQ
ORDERS
( 14)
( 22)
Q1
114241
108879
1060
1060
|
|
13
5.76923e+06
DTQ
TBSCAN
( 15)
( 23)
114241
108325
1060
1060
|
|
1
7.5e+08
GRPBY
TABLE: TPCH5TB_PARQ
( 16)
CUSTOMER
114241
Q5
1060
|
1
LTQ
( 17)
114241
1060
|
1
GRPBY
( 18)
114241
1060
|
5.24479e+06
TBSCAN
( 19)
113931
1060
|
7.5e+08
TABLE: TPCH5TB_PARQ
CUSTOMER
Q2
► Better costing in SQL over Hadoop!
51
© 2013 IBM Corporation
New Query Pushdown
► Pushdown is important because it
reduces the volume of data flowing
from the readers into BigSQL
► Pushdown moves processing down
as close to the data as possible
– Projection pushdown – retrieve only
necessary columns
– Selection pushdown – push search criteria
► Big SQL understands the capabilities
of readers and storage formats
involved
– As much as possible is pushed down
– Residual processing done in the server
– Optimizer costs queries based upon how
much can be pushed down
 Parquet (with the C++ reader) provides
the best combination of pushdown for
BigSQL
52
select sum(l_extendedprice) / 7.0 as avg_yearly
from temp (l_quantity, avgquantity, l_extendeprice)
as
(select l_quantity, avg(l_quantity) over
(partition by l_partkey)
as avgquantity, l_extenedprice
from tpcd.lineitem, tpcd.part
where p_partkey = l_partkey
and p_brand = 'BRAND#23'
and p_container = 'MED BOX')
where l_quantity < 0.2 * avgquantity
3) External Sarg Predicate,
Comparison Operator:
Subquery Input Required:
Filter Factor:
0.04
Predicate Text:
-------------(Q1.P_BRAND = 'Brand#23')
Equal (=)
No
4) External Sarg Predicate,
Comparison Operator:
Equal (=)
Subquery Input Required:
No
Filter Factor:
0.025
Predicate Text:
-------------(Q1.P_CONTAINER = 'MED BOX')
© 2013 IBM Corporation
New Access Plans
Data not hash partitioned on a particular column
(aka “Scattered partitioned”)
We can access a Hadoop table as:
►
“Scattered” Partitioned:
• Only accesses local data to the node
►
Replicated:
• Accesses local and remote data
– Optimizer could also use a broadcast table queue
– HDFS shared file system provides replication
New Parallel Join Strategy
introduced
53
© 2013 IBM Corporation
Parallel Join Strategies
Table Queue represents
communication between
nodes or subagents
Replicated vs. Broadcast join
All tables are “scatter” partitioned
Join predicate:
Replicate smaller table to
partitions of the larger table
using:
• Broadcast table queue
• Replicated HDFS scan
STORE.STOREKEY = DAILY_SALES.STOREKEY
JOIN
Broadcast
TQ
SCAN
SCAN
replicated
SCAN
SCAN
Daily Sales
Store
54
© 2013 IBM Corporation
Parallel Join Strategies
Repartitioned join
All tables are “scatter” partitioned
Join predicate:
DAILY_FORECAST.STOREKEY = DAILY_SALES.STOREKEY
• Both tables large
• Too expensive to broadcast or
replicate either
• Repartition both tables on the
join columns
• Use directed table queue
(DTQ)
JOIN
55
Directed
TQ
Directed
TQ
SCAN
SCAN
Daily
Forecast
Daily Sales
© 2013 IBM Corporation
BigSQL 3.0 Optimizer – Be wary of FAT BTQs
 Be wary of Broadcast Table Queues moving large amounts of data
This BTQ will send 722M rows
from each data node to every
other data node…..
So each data node process ALL the
data for this join.
In an 18-node cluster each node will
process 722M*18=12,996M rows
2.79356e+07
^HSJOIN
( 11)
9.50434e+06
61907.1
/-----------------+------------------\
7.22346e+08
2.43121e+07
BTQ
HSJOIN^
( 12)
( 21)
With Directed Table Queues, data is partitioned
on the fly by the join key and only sent to the node
responsible for processing that key.
So each node only process a subset of the
data for this join.
In an 18-node cluster each node will
process approx 3,151M rows
2.92531e+09
^HSJOIN
( 14)
6.97698e+06
49690
/--------+---------\
3.15179e+09
5.26316e+06
DTQ
DTQ
( 15)
( 18)
6.86668e+06
14923
49575
115
 The latter will scale much better than the former
56
© 2013 IBM Corporation
Viewing an access plan:
 Explain utility is used to view a BigSQL access plan
– Inherited from DB2
 Two version of explain, both providing varying levels of detail:
– Visual explain (graphical)
– Explain format (textual)
 To obtain formatted explain of a query in file q1.sql:
db2 “connect to bigsql”
db2 –tvf $HOME/sqllib/misc/EXPLAIN.DDL
db2 terminate
explain.sh –d bigsql –f q1.sql
57
One time op to create explain tables
* See speaker notes for explain.sh script
© 2013 IBM Corporation
Anatomy of an explain:
 Explain header info
 Config values impacting the optimizer
 Original SQL statement
 Re-written SQL statement
 Plan tree graph
 Operator & Object details….
* See speaker notes for explain script
58
© 2013 IBM Corporation
Explain: Config values
 Lists the most important
configurational values that
impact the optimizer
 Lock related properties not
appropriate for BigSQL
59
© 2013 IBM Corporation
Explain: SQL Statements
 Original statement is the query as
it was submitted to BigSQL
 Optimized statement is the query
after it has gone thro query rewrite processing
– This is the query the optimizer
actually sees
60
© 2013 IBM Corporation
Explain: Plan Tree Graph
 Provides an overview of the plan chosen by the
optimiser
 Total cost: is the cost of the plan in timerons.
– Timerons are a mythical value against which different plans
are compared
– they do not represent time
– Execution order of the operators is from the bottom up
Optimizer’s estimate
for number of rows
flowing out of the operator
Cost up to this point
in plan (timerons)
Operator name
iocost
61
© 2013 IBM Corporation
Explain: Plan Tree Graph
 3. Directed Table Queue (DTQ) operators in
(12) and (15) hash partition the join inputs
based on the join key and send the data to
the appropriate node where the Hash Join
(HSJOIN) will take place.
 2. Local TableQueue (LTQ) operators in
(13) and (16) mean the TBSCANs occur
in (SMP) parallel
 1. NEW_LINEITEM and ORDERS tables
are read from HDFS via TBSCAN
operators (14) & (17)
62
© 2013 IBM Corporation
BigSQL 3.0 Optimizer – Reference
 Optimizer in DB2 Knowledge Centre:
– http://www01.ibm.com/support/knowledgecenter/SSEPGG_10.5.0/com.ibm.db2.luw.admin.
perf.doc/doc/c0054924.html?lang=en
 Good workshop on understanding the DB2 optimizer:
– http://www.slideshare.net/terraborealis/understanding-db2-optimizer
 For additional information on visualizing the BigSQL access plan,
search “explain” in DB2 Knowledge Centre at:
– http://www01.ibm.com/support/knowledgecenter/SSEPGG_10.5.0/com.ibm.db2.luw.kc.doc/
welcome.html?lang=en
 Optimization profiles (aka hints) are supported. Same as DB2.
Search “optimization profile” in DB2 Knowledge Centre at:
– http://www01.ibm.com/support/knowledgecenter/SSEPGG_10.5.0/com.ibm.db2.luw.kc.doc/
welcome.html?lang=en
63
© 2013 IBM Corporation
BigSQL 3.0 Optimizer – Reference
 DB2 Knowledge Centre:
– http://www01.ibm.com/support/knowledgecenter/SSEPGG_10.5.0/com.ibm.db2.luw.admin.
perf.doc/doc/c0005134.html?lang=en
 How to read explain plans presentation:
– http://www.slideshare.net/Tess98/how-to-read-query-plans-and-improveperformance
64
© 2013 IBM Corporation






Statistics
Properties with the biggest performance impact
Node Resource Percentage
String data types
Suspicious plans
Docs to collect
BIGSQL 3.0 PERFORMANCE
PROBLEM DETERMINATION
65
© 2013 IBM Corporation
BigSQL 3.0 Performance Problem Determination
 If you are experiencing a performance issue with a handful of queries,
then likely it is a plan issue
 Check statistics
 Gather and analyze explains
 If you are experiencing a general performance issue (most queries
seem slow) then likely it is a configurational issue
 Check statistics
 Check BigSQL configuration
66
© 2013 IBM Corporation
BigSQL 3.0 Performance PD – Check statistics
 ANALYZE, ANALYZE, ANALYZE………
 First port of call for a BigSQL performance problem – Make sure
ANALYZE has been run and statistics are up to date
 Usually, in db2 if statistics have never been gathered for a table,
explain will show 1000 for table cardinality
|
1000
HTABLE: TPCH10TB_PARQ
SUPPLIER
Q2
 But this is not the case in BigSQL because BigSQL fabricates some
basic stats from the file information
 Check STATS_TIME and CARD in SYSCAT.TABLES to see if ANALYZE
has been run:
db2 "select substr(tabname,1,20),stats_time,card from syscat.tables where tabschema='SIMON'"
1
-----------NATION
ORDERS
CUSTOMER
LINEITEM
67
STATS_TIME
CARD
-------------------------- --------------------1
-1
-1
-1
© 2013 IBM Corporation
BigSQL 3.0 Properties that have the biggest performance
impact - Sorting
 Sort space – SORTHEAP and SHEAPTHRES_SHR
– Area of memory used for sorting and build table (inner) of HashJoin
– Defaults:
• SHEAPTHRES_SHR is 50% of memory assigned to BigSQL
• SORTHEAP is 1/20th of SHEAPTHRES_SHR
– Is automatically tuned by STMM
– If too low, optimizer may prefer joins other than HashJoin
– Specified in 4k pages:
db2 update db cfg for BIGSQL using
sortheap 341333 AUTOMATIC
sheapthres_shr 5120000 AUTOMATIC
– AUTOMATIC keyword indicates this value can be automatically adjusted by STMM
– Use database snapshot to get sort related monitor metrics:
db2 “get snapshot for database on BIGSQL” |grep –I sort
Shared Sort heap high water mark
= 3341028
Post threshold sorts (shared memory)
= 0
Total sorts
= 10526
Total sort time (ms)
= 65290095
Sort overflows
= 1927
Active sorts
= 0
68
© 2013 IBM Corporation
BigSQL 3.0 Properties that have the biggest performance
impact - Sorting
 Prominent join technique is HashJoin
 Since the hash table is built in SORTHEAP HashJoin performance is
sensitive to sort memory – therefore tune SORTHEAP &
SHEAPTHRES_SHR
 Avoid hash join overflows (bad) and hash join loops (very bad):
> get snapshot for database on bigsql
Number of hash joins
Number of hash loops
Number of hash join overflows
Number of small hash join overflows
=
=
=
=
12
2
1
1
 For more information on hash join performance:
– http://www.ibm.com/developerworks/data/library/techarticle/0208zubiri/0208zubiri
.html
69
© 2013 IBM Corporation
BigSQL 3.0 Properties that have the biggest performance
impact - Bufferpools
 Bufferpool – IBMDEFAULTBP
–
–
–
–
–
Area of memory used to cache temporary working data for a query
Not used to cache hdfs data
Default is 15% of memory assigned to BigSQL
Is automatically tuned by STMM
If too small, optimizer may start to select sub-optimal plans (with nested loop
joins)
– Specified in 32k pages
db2 "call syshadoop.big_sql_service_mode('on')“
db2 "alter bufferpool IBMDEFAULTBP size 327680 AUTOMATIC"
70
© 2013 IBM Corporation
BigSQL 3.0 Properties that have the biggest performance
impact - Bufferpools
 BigSQL bufferpool is not used to cache HDFS data. It is only used to
cache temporary working data during the execution of a query
 So – why not size the bufferpool small and give more memory to sort
space and filesystem cache
 Optimizer uses the bufferpool size when planning query execution.
Setting the bufferpool too small may influence the optimizer to choose
inefficient plans.
Database Context:
---------------Parallelism:
CPU Speed:
Comm Speed:
Buffer Pool size:
Sort Heap size:
Database Heap size:
. . .
Intra-Partition & Inter-Partition Parallelism
1.338309e-07
100
327680
341333
9086
 Also, once the bufferpool is full, temporary data will spill to disk
(temporary tablespace). This will be much slower…..
71
© 2013 IBM Corporation
BigSQL 3.0 Properties that have the biggest performance
impact – SMP parallelism
 SMP Parallelism – INTRA_PARALLEL
– Allows BigSQL to utilize multiple processors on SMP machines by parallelizing
query execution
– Default degree of parallelism is 4 (DFT_DEGREE)
 Enable/Disable using:
db2 update dbm cfg using INTRA_PARALLEL YES|NO
72

DFT_DEGREE specifies level of parallelism, and MAX_QUERYDEGREE
specifies maximum level of parallelism
db2 update db cfg for bigsql using DFT_DEGREE <value>
db2 update dbm cfg using MAX_QUERY_DEGREE ANY

Rule of thumb: Increase parallelism in small increments in case of under utilized
CPU on large SMP systems
© 2013 IBM Corporation
BigSQL 3.0 Properties that have the biggest performance
impact – Node Resource Percentage
 Node Resource Percentage –
DB2_CPU_BINDING &
INSTANCE_MEMORY
– Specifies the % (CPU&MEM) of
the cluster dedicated to BigSQL
– Both are set according to the %
specified at install time
– If you see only a fraction of
CPUs are being utilized when
executing a query it is because
BigSQL pins the CPUs
according to the
DB2_CPU_BINDING %.
– If you want to change the %
after install:
autoconfigure using mem_percent 75
workload_type complex
is_populated no
apply db and dbm
73
© 2013 IBM Corporation
BigSQL 3.0 Properties that have the biggest performance
impact – Optimization level & CPU speed
 Optimization Level – DFT_QUERYOPT
– Defines the default optimization level used by the BigSQL optimizer
– Essentially how much effort should the optimizer put into finding the optimal
access plan
– Default is 5
– Some complex queries/workloads may benefit from increasing this to 7
db2 -v update db cfg using DFT_QUERYOPT 7
 Processor speed – CPUSPEED
– Tells BigSQL how fast the CPUs on the machine are
– Is automatically calculated at install time based on the clock speed of the
processors
– Should not have to adjust this property
74
© 2013 IBM Corporation
Check the data types
 STRING is bad for BigSQL !
– But is prevalent in the Hadoop and Hive worlds
 This performance degradation
can be avoided by:
75
[bigsql@BigAPerf098 bigA-TPCH.log-TPCH10TB_PARQ-061614183046.results]$ db2 "describe
table SIMON.ORDERS"
Column name
------------------------------O_ORDERKEY
O_CUSTKEY
O_ORDERSTATUS
O_TOTALPRICE
O_ORDERDATE
O_ORDERPRIORITY
O_CLERK
O_SHIPPRIORITY
O_COMMENT
Data type
schema
--------SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM
SYSIBM
Column
Data type name
Length
Scale Nulls
------------------- ---------- ----- -----BIGINT
8
0 No
INTEGER
4
0 No
VARCHAR
1
0 No
DOUBLE
8
0 No
DATE
4
0 No
VARCHAR
15
0 No
VARCHAR
15
0 No
INTEGER
4
0 No
STRING
32672
0 No

Change references from STRING to
explicit VARCHAR(n) that most
appropriately fit the data size

Use the bigsql.string.size property
9 record(s) selected.
(via SET HADOOP PROPERTY)
to lower the default size of the
VARCHAR to which the STRING is mapped when creating new tables.
© 2013 IBM Corporation
Suspicious plans !!!
 Look for sections in the plan which:
– are NOT using HashJoins as the
join type
– Have fat BTQs
– Are using replicated Hadoop scans on
large amounts of data
Note: These are not always
signs of a bad plan, but they
are indicators – especially if
the data volumes are large
 Warning flags to look out for:
76

Nested Loop Joins (NLJNs) (can be v.bad )

Nested Loop Joins without a TEMP on the
inner ( can be v.v.v. bad)

Merge Scan Joins (MSJOIN)
© 2013 IBM Corporation
BigSQL 3.0 Performance PD – what docs to gather ?
 Collect db2look information for BIGSQL database:
– db2look –d bigsql –e –m –l -f
– See http://www01.ibm.com/support/knowledgecenter/SSEPGG_10.5.0/com.ibm.db2.luw.admin.
cmd.doc/doc/r0002051.html?cp=SSEPGG_10.5.0%2F3-5-2-6-80&lang=en
 Collect db2support information for BIGSQL database (aka catsim):
– db2support <output_directory> -d <database_name> -cl 0
– See http://www01.ibm.com/support/knowledgecenter/SSEPGG_10.5.0/com.ibm.db2.luw.admin.t
rb.doc/doc/t0020808.html?lang=en
 Collect formatted explain of the query using db2exfmt:
– Try to collect the explain with section actuals (which show the actual number of
rows processed at each stage):
– http://www01.ibm.com/support/knowledgecenter/SSEPGG_10.5.0/com.ibm.db2.luw.admin.
perf.doc/doc/c0005134.html?lang=en
77
© 2013 IBM Corporation
BigSQL 3.0 Performance PD – what docs to gather ?
 Collect db2pd information for the query:
– db2pd -dfstable
– See “BigSQL Monitor APIs” session of this T3
 Collect dfs Reader logs:
– See “Best Practices” section of this presentation for more details
 Collect general configuration information:
– Database Manager Configuration:
db2 “attach to bigsql”
db2 “get dbm cfg show detail ”
db2 “detach”
– Database Configuration:
db2 “connect to bigsql”
db2 “get db cfg for BIGSQL show detail”
– BigSQL registry variables:
db2set
78
© 2013 IBM Corporation
BIGSQL V3.0 *INTERNAL*
BENCHMARKS
79
© 2013 IBM Corporation
Performance, Benchmarking, Benchmarketing
 Performance matters to customers
 Benchmarking appeals to Engineers to drive product innovation
 Benchmarketing used to convey performance in a memorable and
appealing way
 SQL over Hadoop is in the “Wild West” of Benchmarketing
– 100x claims! Compared to what? Conforming to what rules?
 The TPC (Transaction Processing Performance Council) is the granddaddy of all multi-vendor SQL-oriented organizations
– Formed in August, 1988
– TPC-H and TPC-DS are the most relevant to SQL over Hadoop
• R/W nature of workload not suitable for HDFS
 Big Data Benchmarking Community (BDBC) formed
80
© 2013 IBM Corporation
Power and Performance of Standard SQL
 Everyone loves performance numbers, but that's not the whole story
– How much work do you have to do to achieve those numbers?
 A portion of our internal performance numbers are based upon read-only
versions of TPC benchmarks
 BigSQL is the only SQL over Hadoop vendor capable of
executing
– All 22 TPC-H queries without modification
– All 99 TPC-DS queries without modification
SELECT s_name, count(*) AS numwait
FROM supplier, lineitem l1, orders, nation
WHERE s_suppkey = l1.l_suppkey
AND o_orderkey = l1.l_orderkey
AND o_orderstatus = 'F'
AND l1.l_receiptdate > l1.l_commitdate
AND EXISTS (
SELECT *
FROM lineitem l2
WHERE l2.l_orderkey = l1.l_orderkey
AND l2.l_suppkey <> l1.l_suppkey)
AND NOT EXISTS (
SELECT *
FROM lineitem l3
WHERE l3.l_orderkey = l1.l_orderkey
AND l3.l_suppkey <> l1.l_suppkey
AND l3.l_receiptdate > l3.l_commitdate)
AND s_nationkey = n_nationkey
AND n_name = ':1'
GROUP BY s_name
ORDER BY numwait desc, s_name
Original Query
81
SELECT s_name, count(1) AS numwait
FROM
(SELECT s_name FROM
(SELECT s_name, t2.l_orderkey, l_suppkey,
count_suppkey, max_suppkey
JOIN
FROM
(SELECT s_name, l_orderkey, l_suppkey
(SELECT l_orderkey,
FROM orders o
count(distinct l_suppkey) as count_suppkey,
JOIN
max(l_suppkey) as max_suppkey
(SELECT s_name, l_orderkey, l_suppkey
FROM lineitem
FROM nation n
WHERE l_receiptdate > l_commitdate
JOIN supplier s
GROUP BY l_orderkey) t2
ON s.s_nationkey = n.n_nationkey
RIGHT OUTER JOIN
AND n.n_name = 'INDONESIA'
(SELECT s_name, l_orderkey, l_suppkey
JOIN lineitem l
FROM
ON s.s_suppkey = l.l_suppkey
(SELECT s_name, t1.l_orderkey, l_suppkey,
WHERE l.l_receiptdate > l.l_commitdate) l1
count_suppkey, max_suppkey
ON o.o_orderkey = l1.l_orderkey
FROM
AND o.o_orderstatus = 'F') l2
(SELECT l_orderkey,
ON l2.l_orderkey = t1.l_orderkey) a
count(distinct l_suppkey) as count_suppkey,
WHERE (count_suppkey > 1) or ((count_suppkey=1)
max(l_suppkey) as max_suppkey
AND (l_suppkey <> max_suppkey))) l3
FROM lineitem
ON l3.l_orderkey = t2.l_orderkey) b
GROUP BY l_orderkey) t1
WHERE (count_suppkey is null)
OR ((count_suppkey=1) AND (l_suppkey =
max_suppkey))) c
GROUP BY s_name
ORDER BY numwait DESC, s_name
Re-written for Hive
© 2013 IBM Corporation
Benchmark Hardware
2 fully populated racks.
20 nodes per rack.
1 Management node
containing all
management functions
39 data nodes
Hardware spec (per node):
RHEL 6.4
IBM x3650 M4: Intel e5-2680@2.8GHz v2, 40 cores w/HT
64GB RAM
9xHDD@2TB
NIC: 10GBit/Sec
82
© 2013 IBM Corporation
“20TB Modern BI Workload“ - aka TPC-DS 20TB single user
BigSQL 3.0 is 14x
faster than Hive 0.12
* See footnote for disclaimer
83
© 2013 IBM Corporation
“20TB Modern BI Workload“ - aka TPCDS 20TB Throughput
run (TBD)
* See footnote for disclaimer
84
© 2013 IBM Corporation
“1TB Modern BI Workload“ - aka TPC-DS 1TB single user
BigSQL 3.0 is 25x
faster than Hive 0.12
* See footnote for disclaimer
85
© 2013 IBM Corporation
“1TB Modern BI Workload“ aka TPC-DS 1TB Throughput run
BigSQL 3.0 is 18x
faster than Hive 0.12
* See footnote for disclaimer
86
© 2013 IBM Corporation
Customer workload “Catalina”


Cluster benchmarking workload used by a BigInsights customer
7 tables

Three large “orders” tables (largest approx. 8 TB uncompressed text)


57 queries in 5 buckets

87
Four small dimension tables: “date”, “time”, “location”, “product”
Some queries from the customer; additional queries added by IBM
© 2013 IBM Corporation
Customer workload “Catalina” (cont)


Compared Big SQL vs Hive 0.12 using this workload on a 10 node
xLinux cluster (1 master node, 9 compute nodes)
Tested both single-stream (“power”) and 4 concurrent streams
(“throughput”)
BigSQL 3.0 is 5.5x
faster than Hive 0.12
88
BigSQL 3.0 is 3.6x
more throughput than
Hive 0.12
© 2013 IBM Corporation
BIGINSIGHTS V3.0
PERFORMANCE
89
© 2013 IBM Corporation
BigInsights v3.0 MapReduce performance & number of slots
 BigInsights v3.0 (Apache-MR) will generally configure fewer map and
reduce slots on the compute nodes, compared to earlier releases
 BI 3.0 accounts for the number of disks
 BI 3.0 accounts for resources allocated to Big SQL
 Recommendation: Post-install or post-upgrade, always check actual
number of slots configured on compute nodes
 PSMR configures based on number of cores and is aware of resources
allocated to Big SQL
90
© 2013 IBM Corporation
BIGSQL 3.0 ON POWER
91
© 2013 IBM Corporation
Transitioning to POWER
 Easy – Linux is Linux is Linux…
– Linux on Power aims to be as similar/near equivalent as Linux on Intel.
– RHEL distro supported on Power, no SLES for Big SQL.
• Need at least RHEL 6.5 to have all tools working on Power8.
– Install Big Insights (Big SQL) as you would on any Intel server
 Superior hardware performance
– New Power8 lineup, specifically designed for running Linux
• IBM Power System S822L, 2x 12-core sockets, 3.02GHz, up-to 1TB memory, 12
Small form factor (SFF) HDD/SSD bays
– Power8 has 8 hw SMT threads per core (compare at 2 SMT threads on Intel).
– GOTCHYA: RHEL 6.5 kernel can only see 4x hardware threads per core.
RHEL 7.0 can see all 8x hw threads per core, though Big SQL not yet
supported on RHEL 7.0
 Tooling
– Same performance, monitoring, diagnostic tooling all available (nmon, oprofile,
operf, perf, etc)
– IBM Advance Toolchain for PowerLinux : not required, but may be useful
92
© 2013 IBM Corporation
Gotchyas on Power
 Most performance tunables are the same as on Intel, except
– More SMT threads on Power, can support higher degree of parallelism (ie. MAX_QUERYDEGREE 8).
– CPUSPEED : Usually untouched, but default seems to yield suboptimal plans (value is too high).
Experimentally seen benefits with:
• db2 update dbm cfg using cpuspeed 1.380000e-07
 Hardware topology differences
– Competitors can support internally 12 LARGE form factor drives (3.5”) vs 12 SMALL form factor (2.5”)
on Power. LFF come in 4TB vs 1.2TB on SFF. Storage density compromised.
– Reference architectures on Power show EXTERNAL NAS storage using DCS3700 as an example :
Easier to add storage capacity without adding servers (reasonable since CPUs rarely fully utilized in
Big SQL, single-user runs). Having NAS for Big SQL means tuning:
• $BIGSQL_HOME/conf/bigsql-conf.xml
 dfsio.num_threads_per_disk
 dfsio.num_cores
 dfsio.num_disks
 dfsio.num_scanner_threads
 Worthwhile tooling
– lpcpu (Linux performance customer profiler utility) – post process on Intel
– ppc64_cpu – cpu characteristics (compare to /proc/partitions)
93
© 2013 IBM Corporation
Power8 Hardware Configuration
1x Power8 S824 – Management node
2s x 12-core @ 3.3GHz, 256GB RAM
Linux RHEL 6.5
Mellanox Infiniband switch (IPoIB)
8x Power8 S822L – Data nodes
2s x 12-core @ 3.3GHz, 256GB RAM
2x LPARs each (1s x12c, 128GB RAM)
Linux RHEL 6.5
8x data
servers
8Gb Fibre Channel switch
4x DCS3700 – storage controllers
4x storage
controllers
60x 1TB HDD, 15 RAID0 LUNs per LPAR
94
© 2013 IBM Corporation
10TB BI Workload (TPC-DS like) : Single-user
Power8 Big SQL v3.0 vs. HP Intel Ivy Bridge Hive 0.12
64X faster!
95
Big SQL v3.0 is 4.2X
faster than Hive 0.12
© 2013 IBM Corporation
10TB BI Workload (TPC-DS like) : 7-Users
Power8 Big SQL v3.0 vs. HP Intel Ivy Bridge Hive 0.12
Big SQL v3.0 is 8.9X
faster than Hive 0.12
Complete multi-user
queries in 1 work-day,
compared to 4 full 24hr
days…
96
© 2013 IBM Corporation
10TB BI Workload (TPC-DS like) : 7-Users
Power8 Big SQL v3.0 vs. HP Intel Ivy Bridge Hive 0.12 – uses internal non-productized knob
Big SQL v3.0 is 11X
faster than Hive 0.12
Complete multi-user
queries in 1 work-day,
compared to 4 full 24hr
days…
97
© 2013 IBM Corporation
BigSQL 3.0 Performance Summary
 BigSQL 3.0 is fast…… For a SQL over Hadoop solution
– We like to think it is the fastest SQL over Hadoop solution on the market today
• And our testing thus far backs this up
– Is it as fast as an RDBMs – NO….
• And we should not make those claims/comparisons….
• But it has a lower TCO proposition….
 Because BigSQL 3.0 is DB2 glued on top of HDFS, it inherits the rich
feature/functions developed for DB2 over 20+ years
– This gives BigSQL a huge boost and aligns it more with a enterprise capable
RDBMS rather than a start-up SQL over Hadoop solution.
98
© 2013 IBM Corporation
Thank you!
99
© 2013 IBM Corporation
BIGSQL V3.0 PERFORMANCE
BACKUP CHARTS
100
© 2013 IBM Corporation
BigSQL 3.0 Optimizer – Sizing the Bufferpool
 BigSQL bufferpool is not used to cache HDFS data. It is only used to
cache temporary working data during the execution of a query
 So – why not size the bufferpool small and give more memory to sort
space and filesystem cache
 Optimizer uses the bufferpool size when planning query execution.
Setting the bufferpool too small may influence the optimizer to choose
inefficient plans.
Database Context:
---------------Parallelism:
CPU Speed:
Comm Speed:
Buffer Pool size:
Sort Heap size:
Database Heap size:
. . .
Intra-Partition & Inter-Partition Parallelism
1.338309e-07
100
327680
341333
9086
 Also, once the bufferpool is full, temporary data will spill to disk
(temporary tablespace). This will be much slower…..
101
© 2013 IBM Corporation
Download