Big! SQL 3.0 Performance Abhayan Sundararajan, Jesse Chen, John Poelman, Jo A Ramos, Ken Chen, Mike Ahern, Mladen Kovacevic, Rama Alluri, Simarjeev S Kohli, Simon Harris IBM Big Data Performance team For questions about this presentation contact Simon Harris siharris@au1.ibm.com Agenda BigSQL Architecture BigSQL Best Practices BigSQL Optimizer BigSQL Performance Problem Determination BigSQL Internal Benchmarks BigSQL on Power BigInsights v3.0 Performance 2 © 2013 IBM Corporation DB2 DPF Architecture Overview BigSQL Compared to DB2 BIGSQL V3.0 ARCHITECTURE 3 © 2013 IBM Corporation Architecture Overview BigSQL 3.0 is essentially DB2 DPF engine on top of Hadoop HDFS/GPFS filesystem Big SQL DB2 DPF Master Node Coordinator Node UDF FMP Big SQL Scheduler Hive Metastore DDL FMP Native I/O FMP Temp Data HDFS Data Node UDF FMP HDFS Data HDFS Data HDFS Data Compute Node 4 Java I/O FMP HDFS Data Node Big SQL DB2 DPF Worker Node Data Node MRTask Tracker Other Service *FMP = Fenced mode process Management Node Management Node Big SQL DB2 DPF Worker Node Data Node Hive Server Database Service Native I/O FMP Temp Data Java I/O FMP UDF FMP HDFS Data HDFS Data HDFS Data Compute Node Big SQL DB2 DPF Worker Node Data Node MRTask Tracker Other Service HDFS Data Node Native I/O FMP Temp Data Java I/O FMP UDF FMP HDFS Data HDFS Data HDFS Data MRTask Tracker Other Service Compute Node © 2013 IBM Corporation DB2 DPF Architecture – Major components DB2 DPF Data Node DB2 optimizer ensures efficient execution paths DB2 caches data/indexes in it’s Bufferpool. Allows efficient read/write access to data from memory instead of disk. Memory areas assigned for sorting. DB2 owns the data on the disk. DB2 is responsible for reading/writing/organising the data. DB2 maintains indexes for efficient access to data. Data is hash partitioned to allow for efficient co-located joins DB2 Optimizer & Re-write DB2 Runtime • • • • • • Sortspace Bufferpool DB2 Temp Tablespace DB2 Temp Tablespace DB2 Temp Tablespace DB2 Tablespace DB2 Tablespace DB2 Tablespace DB2 Indexes DB2 Storage Layer Compute Node 5 © 2013 IBM Corporation BigSQL 3.0 Architecture – Major components BigSQL 3.0 Worker Node BigSQL 3.0 gets all the benefits of the DB2 Optimizer and Rewrite. Bufferpool cache is only for temporary data (within the current query). Sortspace remains unchanged. BigSQL 3.0 does not own the data. Therefore, indexes cannot be built/ maintained. Data is scatter partitioned – there is NO co-location of data Compute Node 6 DB2 Optimizer & Re-write DB2 Runtime • • • • • • Sortspace Bufferpool Bufferpool DB2 Temp Tablespace DB2 Temp Tablespace Temp Tablespace DB2 Tablespace DB2 Tablespace DB2 Tablespace Native I/O reader FMP Java I/O reader FMP DB2 Indexes HDFS data HDFS data HDFS data © 2013 IBM Corporation BigSQL 3.0 Architecture – Tuning focus So – BigSQL loses three major tuning features of an RDBMs – data cache, indexes and co-location. But gains from all the other great features of DB2 including optimizer, self tuning memory and advanced workload management Tuning of BigSQL 3.0 focuses on sortspace, bufferpools, dfsreader throughput, SMP parallelism and efficient plan selection BigSQL 3.0 Worker Node DB2 Optimizer & Re-write DB2 Runtime Sortspace Bufferpool DB2 Temp Tablespace DB2 Temp Tablespace BigSQL Temp Tablespace Native I/O reader FMP Compute Node 7 • • • • • • Java I/O reader FMP DB2 Indexes HDFS data HDFS data HDFS data © 2013 IBM Corporation Deployment Topologies Physical Database Design Readers and Storage Formats Resource Sharing Loading data Data Type considerations Statistics Informational Constraints BIGSQL V3.0 BEST PRACTICES 8 © 2013 IBM Corporation BigSQL Deployment Topologies Traditionally BigInsights Management Node(s) have more memory, but less disk than the worker/data nodes – Many management tasks process less data, but are more response time critical Management functions may be split between several nodes depending upon requirements However, the BigSQL 3.0 management node is different – It can be thought of as a management node that also does work – It will be used to execute sections of a query -- even though it does not own any data locally This has implications on topology of the cluster….. – Chiefly, the BigSQL management node needs to have a similar hardware configuration to the BigSQL worker nodes Task Tracker Data Node Big SQL Compute Node Hive Metastore Big SQL Name Node Mgmt Node Mgmt Node Mgmt Node Task Tracker Data Node Big SQL Compute Node Task Tracker Data Node ••• Job Tracker Mgmt Node Big SQL Compute Node ••• Task Tracker Data Node Big SQL Compute Node GPFS/HDFS 9 © 2013 IBM Corporation BigSQL Deployment Topologies The BigSQL management node needs to have at least: – The same amount of memory – The same CPU processing power – The same number of disks and configuration of storage As the BigSQL worker nodes Failure to comply with this recommendation will likely slow down the whole BigSQL cluster: – Consider a BigSQL management node that has the same CPU & memory as the worker nodes, but only has 1/3rd of the number of disk spindles. – Any query that executes sections on the management node will probably be 2/3rd slower when it needs to use disk – Since the whole cluster is only really as fast as the slowest worker, this can have a dramatic impact on query performance 10 © 2013 IBM Corporation BigSQL Deployment Topologies The BigSQL Master node needs to be spec’d more like a worker node than a typical management node – It will execute sections of a query and may write to BigSQL temporary space – It will not usually be an hdfs data node or a MR task tracker node Big SQL Master Node HDFS Name Node UDF FMP Temp Data Management Node DDL FMP Hive Server Native I/O FMP Java I/O FMP HDFS Data Compute Node UDF FMP Management Node Management Node HDFS Data Node Big SQL Worker Node 11 Database Service Big SQL v1 BigInsights Console Management Node Management Node Hive Metastore MR Job Tracker Temp Data Big SQL Scheduler MRTask Tracker Other Service HDFS Data Node Big SQL Worker Node Native I/O FMP Temp Data Java I/O FMP HDFS Data Compute Node UDF FMP MRTask Tracker Other Service HDFS Data Node Big SQL Worker Node Native I/O FMP Temp Data Java I/O FMP UDF FMP HDFS Data HDFS HDFS Data Data MRTask Tracker Other Service Compute Node © 2013 IBM Corporation Big SQL Tablespace physical layout User data is stored in files on HDFS/GPFS. Physical layout of the distributed file system is determined by dfs.data.dir property (if HDFS) or list of disks per node (if GPFS) supplied at install time. While Big SQL accesses data on the distributed file system (e.g. HDFS), it still creates and uses DB2 tablespaces The following tablespaces are created: SYSCATSPACE, TEMPSPACE1, BIGSQLCATSPACE, BIGSQLUTILITYSPACE, SYSTOOLSPACE, SYSTOOLSTMPSPACE The performance of the storage underlying these DB2 tablespaces matters, especially for SQL queries that create temporary tables The installer prompts for Big SQL Data path(s) [see next slide] The path or paths specified become the storage containers asssociated with the DB2 tablespaces For good performance, specify multiple paths 12 © 2013 IBM Corporation Big SQL Tablespace physical layout (cont) Expand the Big SQL settings in the installer to see all input fields Specify multiple paths for the Big SQL data directory 13 © 2013 IBM Corporation BigSQL Tablespace Physical layout Spread the Big SQL data directory over as many disks as possible Share disks between BigSQL, HDFS (dfs.data.dir) or GPFS, and MapReduce intermediate data (mapred.local.dir) HDFS/GPFS All user data is stored on the distributed file system MapRed cache MapReduce cache is used for temporary data during execution Of MR jobs BigSQL BigSQL data directory is used for Temporary data during execution Of BigSQL queries Rule of thumb: Spread everything across all disks 14 © 2013 IBM Corporation C++/Java Readers Overview The readers read information from the BigInsights Distributed File System (DFS) and return the information to DB2 Scott referred to these as “I/O Engines” in his presentation Readers run on the compute nodes Two reader types: Native C++ and Java Big SQL chooses which reader to use based on the storage format of the data being accessed The C++ Reader is based on the Open Source Impala Reader The Java Reader is used for storage formats not supported by the C++ Reader (more on this later...) 15 © 2013 IBM Corporation C++ Readers - Architecture Readers run in the db2 Fenced Mode Processes (FMPs) 16 © 2013 IBM Corporation C++/Java Reader Configuration Files The readers are configured by properties in configuration file $BIGSQL_HOME/conf/bigsql-conf.xml The bigsql-conf.xml file is also used to configure the scheduler The C++ Reader properties generally have prefix “dfsio.” Value of 0 usually means “let the system decide” Generally, customers will not modify the Readers settings Reader logging is controlled by the glog_enabled=[true|false] property in the glog-dfsio.properties file: Reader logging is enabled by default and can produce a lot of output Usually has a small performance impact though Consider disabling logging on production clusters 17 © 2013 IBM Corporation C++ Reader: Commonly tuned properties 18 Important C++ Reader settings: dfsio.disable_codegen. True by default, meaning that LLVM (Low level virtual machine) is disabled. We will explore LLVM more for the 4Q release; for now, recommend to leave this disabled. Can be specified per query. dfsio.num_scanner_threads. Caps on the number of scanner threads that will be created. Default is 0 (let system decide). Can be specified per query. dfsio.mem_limit. The amount of memory used by the C++ Readers is controlled by Big SQL by default, but can be overridden by setting dfsio.mem_limit to a value > 0. Default is 0 (let Big SQL decide). dfsio.num_threads_per_disk. The number of I/O threads spawned per disk. If 0, then “the system decides” and will spawn 5 I/O threads per disk. Five threads is usually reasonable for both rotational disks and SSDs. Consider increasing this value on clusters with very high performance devices. © 2013 IBM Corporation Java Reader: Commonly tuned properties Important Java Reader settings: “bigsql.java.io.tp.size”. Number of reader threads per scan. Default is 8. 19 Consider increasing this value on clusters with very high performance devices. Can be set per query, i.e. SET HADOOP PROPERTY 'bigsql.java.io.tp.size' = 4 “scheduler.force.javareader”. This is an undocumented property and should only be used by support when troubleshooting a system. Set to true to force Big SQL to use the Java Reader instead of the C++ Reader. Workaround if the C++ Reader is failing for some reason. © 2013 IBM Corporation C++ Reader performance metrics The C++ Reader logs performance metrics at the end of each scan The metrics are dumped to the C++ Reader log file: $BIGINSIGHTS_VAR/bigsql/logs/bigsql-ndfsio.log.INFO Example: 1 20 © 2013 IBM Corporation C++ Reader performance metrics (cont.) 21 © 2013 IBM Corporation C++ Reader metrics can also be dumped using db2pd Run db2pd on each compute node db2pd -d bigsql -dfstable -file <output file> Example: 22 © 2013 IBM Corporation C++ Reader metrics can also be dumped using db2pd (cont) 23 This sceenshot shows a subset of the metrics... © 2013 IBM Corporation C++ Reader and HDFS When the BigInsights cluster is using HDFS, we recommend to set property “dfs.datanode.hdfs-blocks-metadata.enabled” to true. Enables C++ Reader to know which local disks blocks reside on With this information, the C++ Reader can pin I/O threads to disks If performance seems sub-optimal, then use C++ Reader log to confirm that I/O threads are pinned to disk On a compute nodes, check the C++ Reader log file: $BIGINSIGHTS_VAR/bigsql/logs/bigsql-ndfsio.log.INFO Grep the log file for string “Split volume id” We expect the split volume ids to be 0 or greater (indicates pinning is occurring) I0613 19:06:48.645717 2973330 bi-dfs-reader.cc:1907] Split volume id 3 If the split volume ids are all -1, then we're not pinning I/O threads to disks. Check property “dfs.datanode.hdfs-blocks-metadata.enabled”. 24 Example: Example: I0428 02:23:28.394435 3165237 bi-dfs-reader.cc:1677] Split volume id -1 © 2013 IBM Corporation Storage Formats Out Of The Box results Plan for extensive studies – To find best combination for each – To test with compression – To identify best practices 25 © 2013 IBM Corporation Data Type considerations (Based on Scott's migration guide) Big SQL 3.0 contains an entirely new optimized SQL execution engine. This engine internally works with data in units of 32K pages and works most efficiently when the definition of table allows a row to fit within 32k of memory. Once the calculated row size exceeds 32k, performance degradation can be seen for certain queries. As a result, the use of the datatype STRING is strongly discouraged, as this is mapped internally to VARCHAR(32,672) which means that the table will almost certainly exceed 32k in size. This performance degradation can be avoided by: 26 Change references to STRING to explicit VARCHAR(n) that most appropriately fit the data size Use the bigsql.string.size property (via SET HADOOP PROPERTY) to lower the default size of the VARCHAR to which the STRING is mapped when creating new tables. © 2013 IBM Corporation Resource Sharing When installing BigSQL, the user specifies the percentage of cluster resources to dedicate to BigSQL – This is hidden away under “Advanced settings” for BigSQL – Default is 25% – Recommended range is 25% -> 75% Value specified dictates memory and CPU resources dedicated to BigSQL – not disk – BigSQL will do it’s best to keep within the boundary specified 27 © 2013 IBM Corporation Resource sharing – Balancing with MapReduce Installer will automatically tune the default BigSQL properties according to the percentage specified: – – – – INSTANCE_MEMORY Sort memory (sortheap & sheapthres_shr) Bufferpool dfsreader memory It will also tune MapReduce properties (in mapred-site.xml) to ensure cluster resources are not over-allocated when BigSQL queries and MapReduce jobs are executing at the same time: 28 © 2013 IBM Corporation Resource Sharing - How much of the BigInsights cluster to dedicate to BigSQL? INSTANCE_MEMORY is set to the percentage of memory assigned to BigSQL – If you used 50% as the resource percentage, then INSTANCE_MEMORY=50 – If you have 64GB memory, then BigSQL has 0.5*64=32GB available to it. Most memory is allocated to sort space since HashJoins are the prevalent join technique used in BigSQL Bufferpools are not used to cache the HDFS data, but will still be used for intermediate storage whilst a query is executing – Also, the bufferpool size is a key input into the optimizer. Setting this too low will cause the optimizer to favor inefficient joins techniques (such a NestedLoop) – So need to have a reasonable sized bufferpool even though it is not used to cache HDFS data dfsreaders require memory to read data from HDFS and exchange the data with BigSQL runtime – By default, the readers are allocated 20% of the memory assigned to BigSQL – In the above example, 0.2*32=6.4GB 29 © 2013 IBM Corporation Resource sharing - Changing resource percentage after install To change the percentage of resources dedicated to BigSQL after install: autoconfigure using mem_percent 75 workload_type complex is_populated no apply db and dbm This will automatically update the BigSQL memory related properties previously mentioned – It will not update the MapReduce settings in mapred-conf.xml – this will have to be done manually – Formula for calculating memory consumption: (mapred.tasktracker.map.tasks.maximum * [-Xmx option of mapreduce.map.java.opts]) + (mapred.tasktracker.reduce.tasks.maximum * [-Xmx option of mapreduce.reduce.java.opts]) + (Physical memory * INSTANCE_MEMORY) + sum (Other tasks running on node [datanode/task tracker/hbase etc…]) <= Physical memory * 0.90 30 © 2013 IBM Corporation Resource sharing - Self Tuning Memory Manager (STMM) BigSQL will constantly monitor the memory allocation and how efficiently it is being used Memory allocation will be re-assigned between BigSQL consumers (within the given boundaries) to ensure BigSQL is making the best possible use of the available memory – For example, if BigSQL detects that bufferpools are infrequently used but sort space is regularly exhausted, then it may decide to reduce the amount of memory available to bufferpools and give this memory to sort space. Top tip: If you have a relatively fixed workload: – Run with STMM enabled for a period of time (usually several days) – Monitor the bufferpools & sort space, when they have remained stable for several days, disable STMM: • update db cfg for bigsql using SELF_TUNING_MEM off – Your BigSQL cluster now has the optimal tuning for your workload 31 © 2013 IBM Corporation BigSQL 3.0 Table Partitioning BigSQL (and Hive) provide the ability to partition a table based a data value This improves query performance by eliminating those partitions that do not contain the data value of interest BigSQL stores different data partitions as separate files in hdfs and only scans the partitions required by a query thereby improving runtime Partition on a column commonly referenced in range delimiting or equality predicates. Range of dates are ideal for use as partition columns Create LINEITEM table partitioned on L_SHIPDATE: "CREATE HADOOP TABLE LINEITEM ( L_ORDERKEY BIGINT NOT NULL, L_PARTKEY INTEGER NOT NULL, L_SUPPKEY INTEGER NOT NULL, L_LINENUMBER INTEGER NOT NULL, L_QUANTITY FLOAT NOT NULL, L_EXTENDEDPRICE FLOAT NOT NULL, L_DISCOUNT FLOAT NOT NULL, L_TAX FLOAT NOT NULL, L_RETURNFLAG VARCHAR(1) NOT NULL, L_LINESTATUS VARCHAR(1) NOT NULL, L_COMMITDATE DATE NOT NULL, L_RECEIPTDATE DATE NOT NULL, L_SHIPINSTRUCT VARCHAR(25) NOT NULL, L_SHIPMODE VARCHAR(10) NOT NULL, L_COMMENT VARCHAR(44) NOT NULL) PARTITIONED BY (L_SHIPDATE DATE) STORED AS TEXTFILE" 32 © 2013 IBM Corporation BigSQL 3.0 Table Partitioning A separate file for each unique L_SHIPDATE will be created when the table is populated File name will be tagged with the value > hadoop fs -ls /biginsights/hive/warehouse/parq_partition.db/lineitem |more Found 2526 items drwxr-xr-x - bigsql biadmin 1268920 2014-06-20 10:03 /biginsights/hive/warehouse/parq_partition.db/lineitem/l_shipdate=1992-01-02 drwxr-xr-x - bigsql biadmin 1268920 2014-06-20 10:03 /biginsights/hive/warehouse/parq_partition.db/lineitem/l_shipdate=1992-01-03 drwxr-xr-x - bigsql biadmin 1268920 2014-06-20 10:03 /biginsights/hive/warehouse/parq_partition.db/lineitem/l_shipdate=1992-01-04 .... .... drwxr-xr-x - bigsql biadmin 1268920 2014-06-20 10:09 /biginsights/hive/warehouse/parq_partition.db/lineitem/l_shipdate=1998-11-27 drwxr-xr-x - bigsql biadmin 1268920 2014-06-20 10:09 /biginsights/hive/warehouse/parq_partition.db/lineitem/l_shipdate=1998-11-28 drwxr-xr-x - bigsql biadmin 1268920 2014-06-20 10:09 /biginsights/hive/warehouse/parq_partition.db/lineitem/l_shipdate=1998-11-29 drwxr-xr-x - bigsql biadmin 1268920 2014-06-20 10:09 /biginsights/hive/warehouse/parq_partition.db/lineitem/l_shipdate=1998-11-30 drwxr-xr-x - bigsql biadmin 1268920 2014-06-20 10:09 /biginsights/hive/warehouse/parq_partition.db/lineitem/l_shipdate=1998-12-01 select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice * (1 - l_discount)) as sum_disc_price, sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge, avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(*) as count_order from lineitem where Queries with predicates on the partitioning column will only read those files that qualify l_shipdate <= date ('1998-12-01') - 3 day group by l_returnflag, l_linestatus order by l_returnflag, l_linestatus; 33 © 2013 IBM Corporation Table Partitioning – Performance Results Setup 5 machine cluster with 4 data nodes TPCH 500GB Workload using textfile format with Lineitem table partitioned on L_SHIPDATE and Orders table partitioned on O_ORDERDATE Results Table Partitioning provides an overall benefit of 14% improvement as compared to no partitioning But not all queries improved, some got slower…… TPCH Performance Comparison Using Table Partitioning 6000 5000 4000 Runtime(sec) 3000 2000 1000 0 No Partition 34 Partition © 2013 IBM Corporation Big SQL 3.0 LOAD Best Practices - 1 LOAD uses MAP-REDUCE job(s) to read the data from source (local disk, hdfs/gpfs, or RDBMs) and populate the target table Default number of map tasks for LOAD job is just 4 – this is usually much too small when loading large amounts of data – tune LOAD property num.map.tasks to customize the number of map tasks – good starting point is to set to number of BigSQL worker nodes (or a multiple thereof) load hadoop using file url '/tpch1000g/orders/' with source properties ('field.delimiter'='|', 'ignore.extra.fields'='true') into table ORDERS overwrite WITH LOAD PROPERTIES ('num.map.tasks'='145'); Warning: Max java heap size for LOAD tasks is 2GB. If default java heap size for cluster (defined in mapreduce.map.java.opts) is <2GB then specifying too large a value for num.map.tasks may over commit memory on the cluster: 35 © 2013 IBM Corporation Big SQL 3.0 LOAD Best Practices - 2 If the source consists of a few large files, then it is more efficient to copy the files to hdfs first, and then load from hdfs If the source consists of many small files, then loading from either local filesystem or hdfs has similar performance Load from local disk vs HDFS 30000 elapse time(in sec) Reason – number of map tasks used to load from local filesystem is limited to number of files. If only a few large files, then only a small number of map tasks will be created to transfer and load the data. For hdfs sources, number of map tasks is limited by number of hdfs blocks, not number of files. 25000 20000 Local disk 15000 HDFS 10000 5000 0 Large file size#94GB Small file size#6GB Data set size(in GB) 36 © 2013 IBM Corporation Big SQL 3.0 LOAD Best Practices - 3 Note: Executing multiple concurrent LOAD HADOOP USING statements to the same non-partitioned table is not currently supported. Concurrent LOAD HADOOP USING into different partitions of the same partitioned table are supported. Can also use HIVE and BigSQL INSERT…SELECT… statements to move data into BigSQL tables 37 © 2013 IBM Corporation BigSQL 3.0 LOAD – Checking Data Distribution BigSQL data is scatter partitioned across the data nodes Total size: 2101761659596 B Total dirs: 1 Total files: 2062 Total symlinks: 0 Total blocks (validated): 2062 (avg. block size 1019283055 B) Minimally replicated blocks: 2062 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 3 Average block replication: 3.0 Corrupt blocks: 0 Missing replicas: 0 (0.0 %) Number of data-nodes: 18 Number of racks: 1 FSCK ended at Fri Jun 13 04:21:51 PDT 2014 in 58 milliseconds The filesystem under path '/biginsights/hive/warehouse/tpch10tb_parq.db/lineitem' is HEALTHY Block count for lineitem on node 120: 2063 Block count for lineitem on node 121: 289 Block count for lineitem on node 122: 345 Aiming for even distribution of dfs blocks Block count for lineitem on node 123: 312 Block count for lineitem on node 124: 0 across nodes. Block count for lineitem on node 125: 382 This should be automatically maintained by Block count for lineitem on node 126: 384 Block count for lineitem on node 127: 318 hdfs. Block count for lineitem on node 128: 346 Block count for lineitem on node 129: 346 Block count for lineitem on node 130: 0 * See speaker notes for script ……… 38 © 2013 IBM Corporation Statistics are critical to BigSQL 3.0 Performance Statistics are used by the optimizer to make informed decisions about query execution Accurate and up to date statistics can improve performance many-fold. And out of date or inaccurate statistics can devastate performance. Statistics MUST be updated whenever: – a new table is populated, or – an existing table’s data undergoes significant changes: • new data added, • old data removed, • existing data is updated Use ANALYZE to update a table’s statistics: ANALYZE TABLE SIMON.ORDERS COMPUTE STATISTICS FOR COLUMNS O_ORDERKEY, O_CUSTKEY, O_ORDERSTATUS, O_TOTALPRICE, O_ORDERDATE, O_ORDERPRIORITY, O_CLERK, O_SHIPPRIORITY, O_COMMENT 39 © 2013 IBM Corporation Type of Statistics collected by BigSQL 3.0 Table statistics: – Cardinality (count) – Number of Files – Total File Size Column statistics (this applies to column group stats also): – – – – – – – – 40 Minimum value Maximum value Cardinality (non-nulls) Distribution (Number of Distinct Values) Number of null values Average Length of the column value (for string columns) Histogram Frequent Values (MFV) © 2013 IBM Corporation BigSQL 3.0 Optimizer – Statistics are crucial Usually, in db2 if statistics have never been gathered for a table, explain will show 1000 for table cardinality | 1000 HTABLE: TPCH10TB_PARQ SUPPLIER Q2 But this is not the case in BigSQL because BigSQL fabricates some basic stats from the file information Check STATS_TIME and CARD in SYSCAT.TABLES to see if ANALYZE has been run: db2 "select substr(tabname,1,20),stats_time,card from syscat.tables where tabschema='SIMON'" 1 -----------NATION ORDERS CUSTOMER LINEITEM REGION SUPPLIER PART PARTSUPP 41 STATS_TIME CARD -------------------------- --------------------1 -1 -1 -1 -1 -1 -1 -1 © 2013 IBM Corporation BigSQL 3.0 Optimizer – Advanced Statistics SYSSTAT views are available to manually manipulate statistics – Set of views within BigSQL that allow administrators to manually update statistics – For Advanced Users Only – You must understand what you are doing…. Statistical Views are also supported: – Ability to collect statistics on a view – Useful to get accurate cardinality estimates for complex relationships – http://www01.ibm.com/support/knowledgecenter/SSEPGG_10.5.0/com.ibm.db2.luw.admin. perf.doc/doc/c0021713.html?lang=en 42 © 2013 IBM Corporation BigSQL 3.0 - Use Informational Constraints BigSQL 3.0 supports Informational Constraints (ICs) ICs are much like regular Primary and Foreign Key constraints except they are not enforced They do provide useful information to the optimizer alter table SIMON.orders add primary key (O_ORDERKEY) not enforced alter table SIMON.lineitem add primary key (L_ORDERKEY,L_LINENUMBER) not enforced alter table SIMON.lineitem add foreign key (L_ORDERKEY) references SIMON.orders (O_ORDERKEY) not enforced alter table EMPLOYEE add constraint revenue check (SALARY + COMM > 25000) not enforced Informational Constraints allow the optimizer to make better selectivity estimates which improves the costing of the plan and execution efficiency of the query 43 © 2013 IBM Corporation Big SQL 3.0 Best Practices - Summary Ensure you have a homogenous and balanced cluster – Utilize IBM reference architecture – Balance resource usage between BigSQL 3.0 and MR jobs – Ensure several disks are assigned to BigSQL data directory Choose an optimized file format (if possible) – Parquet for BigSQL 3.0 Choose appropriate data types – Use the smallest and most precise datatype available Define informational constraints – Primary key, foreign key, check constraints Ensure you have good statistics – Current and comprehensive Use the full power of SQL available to you – Don’t constrain yourself to Hive syntax/capability 44 © 2013 IBM Corporation Query Re-write Query Pushdown Statistics & Costing New Access Strategies Anatomy of explain plan BIGSQL V3.0 OPTIMIZER 45 © 2013 IBM Corporation Big SQL 3.0 – Query Planning Query rewrites – Exhaustive query rewrite capabilities – Leverages additional metadata such as constraints and nullability Optimization – Statistics and heuristic driven query optimization – Query optimizer based upon decades of IBM RDBMS experience Tools and metrics – Highly detailed explain plans and query diagnostic tools – Extensive number of available performance metrics Query transformation SELECT ITEM_DESC, SUM(QUANTITY_SOLD), AVG(PRICE), AVG(COST) FROM PERIOD, DAILY_SALES, PRODUCT, STORE WHERE PERIOD.PERKEY=DAILY_SALES.PERKEY AND PRODUCT.PRODKEY=DAILY_SALES.PRODKEY AND STORE.STOREKEY=DAILY_SALES.STOREKEY AND CALENDAR_DATE BETWEEN AND '01/01/2012' AND '04/28/2012' AND STORE_NUMBER='03' AND CATEGORY=72 GROUP BY ITEM_DESC Access plan generation NLJOIN NLJOIN NLJOIN Daily Sales NLJOIN Product Period Store NLJOIN Product Store Period NLJOIN Daily Sales HSJOIN HSJOIN HSJOIN Daily Sales Dozens of query transformations 46 Access section ZZJOIN Period Product HSJOIN Store Product Period Store Daily Sales Hundreds or thousands of access plan options Thread 0 DSS TQA (tq1) AGG (complete) BNO EXT Thread 1 TA (Product) NLJN (Daily Sales) NLJN (Period) NLJN (Store) AGG (partial) TQB (tq1) EXT Thread 2 TA (DS_IX7) EXT Thread 3 TA (PER_IX2) EXT Thread 4 TA (ST_IX1) EXT © 2013 IBM Corporation Query Rewrite ► Why is query re-write important? – There are many ways to express the same query – Query generators often produce suboptimal queries and don’t permit “hand optimization” – Complex queries often result in redundancy, especially with views – For Large data volumes optimal access plans more crucial as penalty for poor planning is greater select sum(l_extendedprice) / 7.0 avg_yearly from tpcd.lineitem, tpcd.part where p_partkey = l_partkey and p_brand = 'Brand#23' and p_container = 'MED BOX' and l_quantity < ( select 0.2 * avg(l_quantity) from tpcd.lineitem where l_partkey = p_partkey); select sum(l_extendedprice) / 7.0 as avg_yearly from temp (l_quantity, avgquantity, l_extendeprice) as (select l_quantity, avg(l_quantity) over (partition by l_partkey) as avgquantity, l_extenedprice from tpcd.lineitem, tpcd.part where p_partkey = l_partkey and p_brand = 'BRAND#23' and p_container = 'MED BOX') where l_quantity < 0.2 * avgquantity • Query correlation eliminated • Lineitem table accessed only once • Execution time reduced in half! 47 © 2013 IBM Corporation Query Rewrite ► BigSQL uses the DB2 re-write engine ► Most existing query rewrite rules remain unchanged – 140+ existing query re-writes are leveraged – Almost none are impacted by “the Hadoop world” ► There were however a few modifications that were required… 48 © 2013 IBM Corporation Query Rewrite and Indexes ► Column nullability and indexes can help drive query optimization – Can produce more efficiently decorrelated subqueries and joins – Used to prove uniqueness of joined rows (“early-out” join) ► Very few Hadoop data sources support the concept of an index Nullability Indicators ► In the Hive metastore all columns create hadoop table users are implicitly nullable ( id int not null primary key, ► Big SQL introduces advisory or office_id int null, fname varchar(30) not null, informational constraints and lname varchar(30) not null, salary timestamp(3) null, nullability indicators constraint fk_ofc foreign key (office_id) – User can specify whether or not constraints can be “trusted” for query rewrites > alter table SIMON.orders add primary key (O_ORDERKEY) not enforced 49 references office (office_id) ) row format delimited fields terminated by '|' stored as textfile; Constraints © 2013 IBM Corporation Statistics ► Big SQL utilizes Hive statistics collection with some extensions: – Additional support for column groups, histograms and frequent values – Automatic determination of partitions that require statistics collection vs. explicit – Partitioned tables: added tablelevel versions of NDV, Min, Max, Null count, Average column length – Hive catalogs as well as database engine catalogs are also populated – We are restructuring the relevant code for submission back to Hive Table statistics • Cardinality (count) • Number of Files • Total File Size Column statistics • • • • • • Minimum value (all types) Maximum value (all types) Cardinality (non-nulls) Distribution (Number of Distinct Values NDV) Number of null values Average Length of the column value (all types) • Histogram - Number of buckets configurable • Frequent Values (MFV) – Number configurable Column group statistics ► Capability for statistic fabrication if no stats available at compile time 50 © 2013 IBM Corporation Costing Model ► Few extensions required to the Cost Model for optimizer to understand the SQL over Hadoop world. ► TBSCAN operator cost model extended to evaluate cost of reading from Hadoop ► HTABLE operator ► Degree of pushdown possible to the readers… ► Scatter partitioning used in Hadoop ► New elements taken into account: # of files, size of files, # of partitions, # of nodes | 2.66667e-08 HSJOIN ( 7) 1.1218e+06 8351 /--------+--------\ 5.30119e+08 3.75e+07 BTQ NLJOIN ( 8) ( 11) 948130 146345 7291 1060 | /----+----\ 5.76923e+08 1 3.75e+07 LTQ GRPBY FILTER ( 9) ( 12) ( 20) 855793 114241 126068 7291 1060 1060 | | | 5.76923e+08 13 7.5e+07 TBSCAN TBSCAN BTQ ( 10) ( 13) ( 21) 802209 114241 117135 7291 1060 1060 | | | 7.5e+09 13 5.76923e+06 TABLE: TPCH5TB_PARQ TEMP LTQ ORDERS ( 14) ( 22) Q1 114241 108879 1060 1060 | | 13 5.76923e+06 DTQ TBSCAN ( 15) ( 23) 114241 108325 1060 1060 | | 1 7.5e+08 GRPBY TABLE: TPCH5TB_PARQ ( 16) CUSTOMER 114241 Q5 1060 | 1 LTQ ( 17) 114241 1060 | 1 GRPBY ( 18) 114241 1060 | 5.24479e+06 TBSCAN ( 19) 113931 1060 | 7.5e+08 TABLE: TPCH5TB_PARQ CUSTOMER Q2 ► Better costing in SQL over Hadoop! 51 © 2013 IBM Corporation New Query Pushdown ► Pushdown is important because it reduces the volume of data flowing from the readers into BigSQL ► Pushdown moves processing down as close to the data as possible – Projection pushdown – retrieve only necessary columns – Selection pushdown – push search criteria ► Big SQL understands the capabilities of readers and storage formats involved – As much as possible is pushed down – Residual processing done in the server – Optimizer costs queries based upon how much can be pushed down Parquet (with the C++ reader) provides the best combination of pushdown for BigSQL 52 select sum(l_extendedprice) / 7.0 as avg_yearly from temp (l_quantity, avgquantity, l_extendeprice) as (select l_quantity, avg(l_quantity) over (partition by l_partkey) as avgquantity, l_extenedprice from tpcd.lineitem, tpcd.part where p_partkey = l_partkey and p_brand = 'BRAND#23' and p_container = 'MED BOX') where l_quantity < 0.2 * avgquantity 3) External Sarg Predicate, Comparison Operator: Subquery Input Required: Filter Factor: 0.04 Predicate Text: -------------(Q1.P_BRAND = 'Brand#23') Equal (=) No 4) External Sarg Predicate, Comparison Operator: Equal (=) Subquery Input Required: No Filter Factor: 0.025 Predicate Text: -------------(Q1.P_CONTAINER = 'MED BOX') © 2013 IBM Corporation New Access Plans Data not hash partitioned on a particular column (aka “Scattered partitioned”) We can access a Hadoop table as: ► “Scattered” Partitioned: • Only accesses local data to the node ► Replicated: • Accesses local and remote data – Optimizer could also use a broadcast table queue – HDFS shared file system provides replication New Parallel Join Strategy introduced 53 © 2013 IBM Corporation Parallel Join Strategies Table Queue represents communication between nodes or subagents Replicated vs. Broadcast join All tables are “scatter” partitioned Join predicate: Replicate smaller table to partitions of the larger table using: • Broadcast table queue • Replicated HDFS scan STORE.STOREKEY = DAILY_SALES.STOREKEY JOIN Broadcast TQ SCAN SCAN replicated SCAN SCAN Daily Sales Store 54 © 2013 IBM Corporation Parallel Join Strategies Repartitioned join All tables are “scatter” partitioned Join predicate: DAILY_FORECAST.STOREKEY = DAILY_SALES.STOREKEY • Both tables large • Too expensive to broadcast or replicate either • Repartition both tables on the join columns • Use directed table queue (DTQ) JOIN 55 Directed TQ Directed TQ SCAN SCAN Daily Forecast Daily Sales © 2013 IBM Corporation BigSQL 3.0 Optimizer – Be wary of FAT BTQs Be wary of Broadcast Table Queues moving large amounts of data This BTQ will send 722M rows from each data node to every other data node….. So each data node process ALL the data for this join. In an 18-node cluster each node will process 722M*18=12,996M rows 2.79356e+07 ^HSJOIN ( 11) 9.50434e+06 61907.1 /-----------------+------------------\ 7.22346e+08 2.43121e+07 BTQ HSJOIN^ ( 12) ( 21) With Directed Table Queues, data is partitioned on the fly by the join key and only sent to the node responsible for processing that key. So each node only process a subset of the data for this join. In an 18-node cluster each node will process approx 3,151M rows 2.92531e+09 ^HSJOIN ( 14) 6.97698e+06 49690 /--------+---------\ 3.15179e+09 5.26316e+06 DTQ DTQ ( 15) ( 18) 6.86668e+06 14923 49575 115 The latter will scale much better than the former 56 © 2013 IBM Corporation Viewing an access plan: Explain utility is used to view a BigSQL access plan – Inherited from DB2 Two version of explain, both providing varying levels of detail: – Visual explain (graphical) – Explain format (textual) To obtain formatted explain of a query in file q1.sql: db2 “connect to bigsql” db2 –tvf $HOME/sqllib/misc/EXPLAIN.DDL db2 terminate explain.sh –d bigsql –f q1.sql 57 One time op to create explain tables * See speaker notes for explain.sh script © 2013 IBM Corporation Anatomy of an explain: Explain header info Config values impacting the optimizer Original SQL statement Re-written SQL statement Plan tree graph Operator & Object details…. * See speaker notes for explain script 58 © 2013 IBM Corporation Explain: Config values Lists the most important configurational values that impact the optimizer Lock related properties not appropriate for BigSQL 59 © 2013 IBM Corporation Explain: SQL Statements Original statement is the query as it was submitted to BigSQL Optimized statement is the query after it has gone thro query rewrite processing – This is the query the optimizer actually sees 60 © 2013 IBM Corporation Explain: Plan Tree Graph Provides an overview of the plan chosen by the optimiser Total cost: is the cost of the plan in timerons. – Timerons are a mythical value against which different plans are compared – they do not represent time – Execution order of the operators is from the bottom up Optimizer’s estimate for number of rows flowing out of the operator Cost up to this point in plan (timerons) Operator name iocost 61 © 2013 IBM Corporation Explain: Plan Tree Graph 3. Directed Table Queue (DTQ) operators in (12) and (15) hash partition the join inputs based on the join key and send the data to the appropriate node where the Hash Join (HSJOIN) will take place. 2. Local TableQueue (LTQ) operators in (13) and (16) mean the TBSCANs occur in (SMP) parallel 1. NEW_LINEITEM and ORDERS tables are read from HDFS via TBSCAN operators (14) & (17) 62 © 2013 IBM Corporation BigSQL 3.0 Optimizer – Reference Optimizer in DB2 Knowledge Centre: – http://www01.ibm.com/support/knowledgecenter/SSEPGG_10.5.0/com.ibm.db2.luw.admin. perf.doc/doc/c0054924.html?lang=en Good workshop on understanding the DB2 optimizer: – http://www.slideshare.net/terraborealis/understanding-db2-optimizer For additional information on visualizing the BigSQL access plan, search “explain” in DB2 Knowledge Centre at: – http://www01.ibm.com/support/knowledgecenter/SSEPGG_10.5.0/com.ibm.db2.luw.kc.doc/ welcome.html?lang=en Optimization profiles (aka hints) are supported. Same as DB2. Search “optimization profile” in DB2 Knowledge Centre at: – http://www01.ibm.com/support/knowledgecenter/SSEPGG_10.5.0/com.ibm.db2.luw.kc.doc/ welcome.html?lang=en 63 © 2013 IBM Corporation BigSQL 3.0 Optimizer – Reference DB2 Knowledge Centre: – http://www01.ibm.com/support/knowledgecenter/SSEPGG_10.5.0/com.ibm.db2.luw.admin. perf.doc/doc/c0005134.html?lang=en How to read explain plans presentation: – http://www.slideshare.net/Tess98/how-to-read-query-plans-and-improveperformance 64 © 2013 IBM Corporation Statistics Properties with the biggest performance impact Node Resource Percentage String data types Suspicious plans Docs to collect BIGSQL 3.0 PERFORMANCE PROBLEM DETERMINATION 65 © 2013 IBM Corporation BigSQL 3.0 Performance Problem Determination If you are experiencing a performance issue with a handful of queries, then likely it is a plan issue Check statistics Gather and analyze explains If you are experiencing a general performance issue (most queries seem slow) then likely it is a configurational issue Check statistics Check BigSQL configuration 66 © 2013 IBM Corporation BigSQL 3.0 Performance PD – Check statistics ANALYZE, ANALYZE, ANALYZE……… First port of call for a BigSQL performance problem – Make sure ANALYZE has been run and statistics are up to date Usually, in db2 if statistics have never been gathered for a table, explain will show 1000 for table cardinality | 1000 HTABLE: TPCH10TB_PARQ SUPPLIER Q2 But this is not the case in BigSQL because BigSQL fabricates some basic stats from the file information Check STATS_TIME and CARD in SYSCAT.TABLES to see if ANALYZE has been run: db2 "select substr(tabname,1,20),stats_time,card from syscat.tables where tabschema='SIMON'" 1 -----------NATION ORDERS CUSTOMER LINEITEM 67 STATS_TIME CARD -------------------------- --------------------1 -1 -1 -1 © 2013 IBM Corporation BigSQL 3.0 Properties that have the biggest performance impact - Sorting Sort space – SORTHEAP and SHEAPTHRES_SHR – Area of memory used for sorting and build table (inner) of HashJoin – Defaults: • SHEAPTHRES_SHR is 50% of memory assigned to BigSQL • SORTHEAP is 1/20th of SHEAPTHRES_SHR – Is automatically tuned by STMM – If too low, optimizer may prefer joins other than HashJoin – Specified in 4k pages: db2 update db cfg for BIGSQL using sortheap 341333 AUTOMATIC sheapthres_shr 5120000 AUTOMATIC – AUTOMATIC keyword indicates this value can be automatically adjusted by STMM – Use database snapshot to get sort related monitor metrics: db2 “get snapshot for database on BIGSQL” |grep –I sort Shared Sort heap high water mark = 3341028 Post threshold sorts (shared memory) = 0 Total sorts = 10526 Total sort time (ms) = 65290095 Sort overflows = 1927 Active sorts = 0 68 © 2013 IBM Corporation BigSQL 3.0 Properties that have the biggest performance impact - Sorting Prominent join technique is HashJoin Since the hash table is built in SORTHEAP HashJoin performance is sensitive to sort memory – therefore tune SORTHEAP & SHEAPTHRES_SHR Avoid hash join overflows (bad) and hash join loops (very bad): > get snapshot for database on bigsql Number of hash joins Number of hash loops Number of hash join overflows Number of small hash join overflows = = = = 12 2 1 1 For more information on hash join performance: – http://www.ibm.com/developerworks/data/library/techarticle/0208zubiri/0208zubiri .html 69 © 2013 IBM Corporation BigSQL 3.0 Properties that have the biggest performance impact - Bufferpools Bufferpool – IBMDEFAULTBP – – – – – Area of memory used to cache temporary working data for a query Not used to cache hdfs data Default is 15% of memory assigned to BigSQL Is automatically tuned by STMM If too small, optimizer may start to select sub-optimal plans (with nested loop joins) – Specified in 32k pages db2 "call syshadoop.big_sql_service_mode('on')“ db2 "alter bufferpool IBMDEFAULTBP size 327680 AUTOMATIC" 70 © 2013 IBM Corporation BigSQL 3.0 Properties that have the biggest performance impact - Bufferpools BigSQL bufferpool is not used to cache HDFS data. It is only used to cache temporary working data during the execution of a query So – why not size the bufferpool small and give more memory to sort space and filesystem cache Optimizer uses the bufferpool size when planning query execution. Setting the bufferpool too small may influence the optimizer to choose inefficient plans. Database Context: ---------------Parallelism: CPU Speed: Comm Speed: Buffer Pool size: Sort Heap size: Database Heap size: . . . Intra-Partition & Inter-Partition Parallelism 1.338309e-07 100 327680 341333 9086 Also, once the bufferpool is full, temporary data will spill to disk (temporary tablespace). This will be much slower….. 71 © 2013 IBM Corporation BigSQL 3.0 Properties that have the biggest performance impact – SMP parallelism SMP Parallelism – INTRA_PARALLEL – Allows BigSQL to utilize multiple processors on SMP machines by parallelizing query execution – Default degree of parallelism is 4 (DFT_DEGREE) Enable/Disable using: db2 update dbm cfg using INTRA_PARALLEL YES|NO 72 DFT_DEGREE specifies level of parallelism, and MAX_QUERYDEGREE specifies maximum level of parallelism db2 update db cfg for bigsql using DFT_DEGREE <value> db2 update dbm cfg using MAX_QUERY_DEGREE ANY Rule of thumb: Increase parallelism in small increments in case of under utilized CPU on large SMP systems © 2013 IBM Corporation BigSQL 3.0 Properties that have the biggest performance impact – Node Resource Percentage Node Resource Percentage – DB2_CPU_BINDING & INSTANCE_MEMORY – Specifies the % (CPU&MEM) of the cluster dedicated to BigSQL – Both are set according to the % specified at install time – If you see only a fraction of CPUs are being utilized when executing a query it is because BigSQL pins the CPUs according to the DB2_CPU_BINDING %. – If you want to change the % after install: autoconfigure using mem_percent 75 workload_type complex is_populated no apply db and dbm 73 © 2013 IBM Corporation BigSQL 3.0 Properties that have the biggest performance impact – Optimization level & CPU speed Optimization Level – DFT_QUERYOPT – Defines the default optimization level used by the BigSQL optimizer – Essentially how much effort should the optimizer put into finding the optimal access plan – Default is 5 – Some complex queries/workloads may benefit from increasing this to 7 db2 -v update db cfg using DFT_QUERYOPT 7 Processor speed – CPUSPEED – Tells BigSQL how fast the CPUs on the machine are – Is automatically calculated at install time based on the clock speed of the processors – Should not have to adjust this property 74 © 2013 IBM Corporation Check the data types STRING is bad for BigSQL ! – But is prevalent in the Hadoop and Hive worlds This performance degradation can be avoided by: 75 [bigsql@BigAPerf098 bigA-TPCH.log-TPCH10TB_PARQ-061614183046.results]$ db2 "describe table SIMON.ORDERS" Column name ------------------------------O_ORDERKEY O_CUSTKEY O_ORDERSTATUS O_TOTALPRICE O_ORDERDATE O_ORDERPRIORITY O_CLERK O_SHIPPRIORITY O_COMMENT Data type schema --------SYSIBM SYSIBM SYSIBM SYSIBM SYSIBM SYSIBM SYSIBM SYSIBM SYSIBM Column Data type name Length Scale Nulls ------------------- ---------- ----- -----BIGINT 8 0 No INTEGER 4 0 No VARCHAR 1 0 No DOUBLE 8 0 No DATE 4 0 No VARCHAR 15 0 No VARCHAR 15 0 No INTEGER 4 0 No STRING 32672 0 No Change references from STRING to explicit VARCHAR(n) that most appropriately fit the data size Use the bigsql.string.size property 9 record(s) selected. (via SET HADOOP PROPERTY) to lower the default size of the VARCHAR to which the STRING is mapped when creating new tables. © 2013 IBM Corporation Suspicious plans !!! Look for sections in the plan which: – are NOT using HashJoins as the join type – Have fat BTQs – Are using replicated Hadoop scans on large amounts of data Note: These are not always signs of a bad plan, but they are indicators – especially if the data volumes are large Warning flags to look out for: 76 Nested Loop Joins (NLJNs) (can be v.bad ) Nested Loop Joins without a TEMP on the inner ( can be v.v.v. bad) Merge Scan Joins (MSJOIN) © 2013 IBM Corporation BigSQL 3.0 Performance PD – what docs to gather ? Collect db2look information for BIGSQL database: – db2look –d bigsql –e –m –l -f – See http://www01.ibm.com/support/knowledgecenter/SSEPGG_10.5.0/com.ibm.db2.luw.admin. cmd.doc/doc/r0002051.html?cp=SSEPGG_10.5.0%2F3-5-2-6-80&lang=en Collect db2support information for BIGSQL database (aka catsim): – db2support <output_directory> -d <database_name> -cl 0 – See http://www01.ibm.com/support/knowledgecenter/SSEPGG_10.5.0/com.ibm.db2.luw.admin.t rb.doc/doc/t0020808.html?lang=en Collect formatted explain of the query using db2exfmt: – Try to collect the explain with section actuals (which show the actual number of rows processed at each stage): – http://www01.ibm.com/support/knowledgecenter/SSEPGG_10.5.0/com.ibm.db2.luw.admin. perf.doc/doc/c0005134.html?lang=en 77 © 2013 IBM Corporation BigSQL 3.0 Performance PD – what docs to gather ? Collect db2pd information for the query: – db2pd -dfstable – See “BigSQL Monitor APIs” session of this T3 Collect dfs Reader logs: – See “Best Practices” section of this presentation for more details Collect general configuration information: – Database Manager Configuration: db2 “attach to bigsql” db2 “get dbm cfg show detail ” db2 “detach” – Database Configuration: db2 “connect to bigsql” db2 “get db cfg for BIGSQL show detail” – BigSQL registry variables: db2set 78 © 2013 IBM Corporation BIGSQL V3.0 *INTERNAL* BENCHMARKS 79 © 2013 IBM Corporation Performance, Benchmarking, Benchmarketing Performance matters to customers Benchmarking appeals to Engineers to drive product innovation Benchmarketing used to convey performance in a memorable and appealing way SQL over Hadoop is in the “Wild West” of Benchmarketing – 100x claims! Compared to what? Conforming to what rules? The TPC (Transaction Processing Performance Council) is the granddaddy of all multi-vendor SQL-oriented organizations – Formed in August, 1988 – TPC-H and TPC-DS are the most relevant to SQL over Hadoop • R/W nature of workload not suitable for HDFS Big Data Benchmarking Community (BDBC) formed 80 © 2013 IBM Corporation Power and Performance of Standard SQL Everyone loves performance numbers, but that's not the whole story – How much work do you have to do to achieve those numbers? A portion of our internal performance numbers are based upon read-only versions of TPC benchmarks BigSQL is the only SQL over Hadoop vendor capable of executing – All 22 TPC-H queries without modification – All 99 TPC-DS queries without modification SELECT s_name, count(*) AS numwait FROM supplier, lineitem l1, orders, nation WHERE s_suppkey = l1.l_suppkey AND o_orderkey = l1.l_orderkey AND o_orderstatus = 'F' AND l1.l_receiptdate > l1.l_commitdate AND EXISTS ( SELECT * FROM lineitem l2 WHERE l2.l_orderkey = l1.l_orderkey AND l2.l_suppkey <> l1.l_suppkey) AND NOT EXISTS ( SELECT * FROM lineitem l3 WHERE l3.l_orderkey = l1.l_orderkey AND l3.l_suppkey <> l1.l_suppkey AND l3.l_receiptdate > l3.l_commitdate) AND s_nationkey = n_nationkey AND n_name = ':1' GROUP BY s_name ORDER BY numwait desc, s_name Original Query 81 SELECT s_name, count(1) AS numwait FROM (SELECT s_name FROM (SELECT s_name, t2.l_orderkey, l_suppkey, count_suppkey, max_suppkey JOIN FROM (SELECT s_name, l_orderkey, l_suppkey (SELECT l_orderkey, FROM orders o count(distinct l_suppkey) as count_suppkey, JOIN max(l_suppkey) as max_suppkey (SELECT s_name, l_orderkey, l_suppkey FROM lineitem FROM nation n WHERE l_receiptdate > l_commitdate JOIN supplier s GROUP BY l_orderkey) t2 ON s.s_nationkey = n.n_nationkey RIGHT OUTER JOIN AND n.n_name = 'INDONESIA' (SELECT s_name, l_orderkey, l_suppkey JOIN lineitem l FROM ON s.s_suppkey = l.l_suppkey (SELECT s_name, t1.l_orderkey, l_suppkey, WHERE l.l_receiptdate > l.l_commitdate) l1 count_suppkey, max_suppkey ON o.o_orderkey = l1.l_orderkey FROM AND o.o_orderstatus = 'F') l2 (SELECT l_orderkey, ON l2.l_orderkey = t1.l_orderkey) a count(distinct l_suppkey) as count_suppkey, WHERE (count_suppkey > 1) or ((count_suppkey=1) max(l_suppkey) as max_suppkey AND (l_suppkey <> max_suppkey))) l3 FROM lineitem ON l3.l_orderkey = t2.l_orderkey) b GROUP BY l_orderkey) t1 WHERE (count_suppkey is null) OR ((count_suppkey=1) AND (l_suppkey = max_suppkey))) c GROUP BY s_name ORDER BY numwait DESC, s_name Re-written for Hive © 2013 IBM Corporation Benchmark Hardware 2 fully populated racks. 20 nodes per rack. 1 Management node containing all management functions 39 data nodes Hardware spec (per node): RHEL 6.4 IBM x3650 M4: Intel e5-2680@2.8GHz v2, 40 cores w/HT 64GB RAM 9xHDD@2TB NIC: 10GBit/Sec 82 © 2013 IBM Corporation “20TB Modern BI Workload“ - aka TPC-DS 20TB single user BigSQL 3.0 is 14x faster than Hive 0.12 * See footnote for disclaimer 83 © 2013 IBM Corporation “20TB Modern BI Workload“ - aka TPCDS 20TB Throughput run (TBD) * See footnote for disclaimer 84 © 2013 IBM Corporation “1TB Modern BI Workload“ - aka TPC-DS 1TB single user BigSQL 3.0 is 25x faster than Hive 0.12 * See footnote for disclaimer 85 © 2013 IBM Corporation “1TB Modern BI Workload“ aka TPC-DS 1TB Throughput run BigSQL 3.0 is 18x faster than Hive 0.12 * See footnote for disclaimer 86 © 2013 IBM Corporation Customer workload “Catalina” Cluster benchmarking workload used by a BigInsights customer 7 tables Three large “orders” tables (largest approx. 8 TB uncompressed text) 57 queries in 5 buckets 87 Four small dimension tables: “date”, “time”, “location”, “product” Some queries from the customer; additional queries added by IBM © 2013 IBM Corporation Customer workload “Catalina” (cont) Compared Big SQL vs Hive 0.12 using this workload on a 10 node xLinux cluster (1 master node, 9 compute nodes) Tested both single-stream (“power”) and 4 concurrent streams (“throughput”) BigSQL 3.0 is 5.5x faster than Hive 0.12 88 BigSQL 3.0 is 3.6x more throughput than Hive 0.12 © 2013 IBM Corporation BIGINSIGHTS V3.0 PERFORMANCE 89 © 2013 IBM Corporation BigInsights v3.0 MapReduce performance & number of slots BigInsights v3.0 (Apache-MR) will generally configure fewer map and reduce slots on the compute nodes, compared to earlier releases BI 3.0 accounts for the number of disks BI 3.0 accounts for resources allocated to Big SQL Recommendation: Post-install or post-upgrade, always check actual number of slots configured on compute nodes PSMR configures based on number of cores and is aware of resources allocated to Big SQL 90 © 2013 IBM Corporation BIGSQL 3.0 ON POWER 91 © 2013 IBM Corporation Transitioning to POWER Easy – Linux is Linux is Linux… – Linux on Power aims to be as similar/near equivalent as Linux on Intel. – RHEL distro supported on Power, no SLES for Big SQL. • Need at least RHEL 6.5 to have all tools working on Power8. – Install Big Insights (Big SQL) as you would on any Intel server Superior hardware performance – New Power8 lineup, specifically designed for running Linux • IBM Power System S822L, 2x 12-core sockets, 3.02GHz, up-to 1TB memory, 12 Small form factor (SFF) HDD/SSD bays – Power8 has 8 hw SMT threads per core (compare at 2 SMT threads on Intel). – GOTCHYA: RHEL 6.5 kernel can only see 4x hardware threads per core. RHEL 7.0 can see all 8x hw threads per core, though Big SQL not yet supported on RHEL 7.0 Tooling – Same performance, monitoring, diagnostic tooling all available (nmon, oprofile, operf, perf, etc) – IBM Advance Toolchain for PowerLinux : not required, but may be useful 92 © 2013 IBM Corporation Gotchyas on Power Most performance tunables are the same as on Intel, except – More SMT threads on Power, can support higher degree of parallelism (ie. MAX_QUERYDEGREE 8). – CPUSPEED : Usually untouched, but default seems to yield suboptimal plans (value is too high). Experimentally seen benefits with: • db2 update dbm cfg using cpuspeed 1.380000e-07 Hardware topology differences – Competitors can support internally 12 LARGE form factor drives (3.5”) vs 12 SMALL form factor (2.5”) on Power. LFF come in 4TB vs 1.2TB on SFF. Storage density compromised. – Reference architectures on Power show EXTERNAL NAS storage using DCS3700 as an example : Easier to add storage capacity without adding servers (reasonable since CPUs rarely fully utilized in Big SQL, single-user runs). Having NAS for Big SQL means tuning: • $BIGSQL_HOME/conf/bigsql-conf.xml dfsio.num_threads_per_disk dfsio.num_cores dfsio.num_disks dfsio.num_scanner_threads Worthwhile tooling – lpcpu (Linux performance customer profiler utility) – post process on Intel – ppc64_cpu – cpu characteristics (compare to /proc/partitions) 93 © 2013 IBM Corporation Power8 Hardware Configuration 1x Power8 S824 – Management node 2s x 12-core @ 3.3GHz, 256GB RAM Linux RHEL 6.5 Mellanox Infiniband switch (IPoIB) 8x Power8 S822L – Data nodes 2s x 12-core @ 3.3GHz, 256GB RAM 2x LPARs each (1s x12c, 128GB RAM) Linux RHEL 6.5 8x data servers 8Gb Fibre Channel switch 4x DCS3700 – storage controllers 4x storage controllers 60x 1TB HDD, 15 RAID0 LUNs per LPAR 94 © 2013 IBM Corporation 10TB BI Workload (TPC-DS like) : Single-user Power8 Big SQL v3.0 vs. HP Intel Ivy Bridge Hive 0.12 64X faster! 95 Big SQL v3.0 is 4.2X faster than Hive 0.12 © 2013 IBM Corporation 10TB BI Workload (TPC-DS like) : 7-Users Power8 Big SQL v3.0 vs. HP Intel Ivy Bridge Hive 0.12 Big SQL v3.0 is 8.9X faster than Hive 0.12 Complete multi-user queries in 1 work-day, compared to 4 full 24hr days… 96 © 2013 IBM Corporation 10TB BI Workload (TPC-DS like) : 7-Users Power8 Big SQL v3.0 vs. HP Intel Ivy Bridge Hive 0.12 – uses internal non-productized knob Big SQL v3.0 is 11X faster than Hive 0.12 Complete multi-user queries in 1 work-day, compared to 4 full 24hr days… 97 © 2013 IBM Corporation BigSQL 3.0 Performance Summary BigSQL 3.0 is fast…… For a SQL over Hadoop solution – We like to think it is the fastest SQL over Hadoop solution on the market today • And our testing thus far backs this up – Is it as fast as an RDBMs – NO…. • And we should not make those claims/comparisons…. • But it has a lower TCO proposition…. Because BigSQL 3.0 is DB2 glued on top of HDFS, it inherits the rich feature/functions developed for DB2 over 20+ years – This gives BigSQL a huge boost and aligns it more with a enterprise capable RDBMS rather than a start-up SQL over Hadoop solution. 98 © 2013 IBM Corporation Thank you! 99 © 2013 IBM Corporation BIGSQL V3.0 PERFORMANCE BACKUP CHARTS 100 © 2013 IBM Corporation BigSQL 3.0 Optimizer – Sizing the Bufferpool BigSQL bufferpool is not used to cache HDFS data. It is only used to cache temporary working data during the execution of a query So – why not size the bufferpool small and give more memory to sort space and filesystem cache Optimizer uses the bufferpool size when planning query execution. Setting the bufferpool too small may influence the optimizer to choose inefficient plans. Database Context: ---------------Parallelism: CPU Speed: Comm Speed: Buffer Pool size: Sort Heap size: Database Heap size: . . . Intra-Partition & Inter-Partition Parallelism 1.338309e-07 100 327680 341333 9086 Also, once the bufferpool is full, temporary data will spill to disk (temporary tablespace). This will be much slower….. 101 © 2013 IBM Corporation