Some More Database Performance Knobs North American PUG Challenge Richard Banville Software Fellow OpenEdge Development Agenda 2 1 LRU (again) 2 Networking: Message Capacity 3 Networking: Resource Usage 4 Index Rebuild 5 Summary © 2012 Progress Software Corporation. All rights reserved. Agenda 3 1 LRU (again) 2 Networking: Message Capacity 3 Networking: Resource Usage 4 Index Rebuild 5 Summary © 2012 Progress Software Corporation. All rights reserved. LRU (again) Least Recent RM Block T1 Most Recent IX Block I1 RM Block T3 IX Block I3 Replacement policy of database buffer pool • Maintains working set of data buffers • Just a linked list – a shared data structure • Changes made orderly by LRU Latch Replace buffer at LRU end with newly read block from disk 4 © 2012 Progress Software Corporation. All rights reserved. LRU (again) Least Recent RM Block T1 Most Recent IX Block I1 RM Block T3 IX Block I3 Pros – proficient block usage predictor • Maintains high buffer pool hit ratio Cons – housekeeping costs • Single threads access to buffer pool (even if for an instant) • High activity, relatively high nap rate Managing LRU: • Private read only buffers: -Bp –BpMax (not w/-lruskips until 10.2b07) • Alternate buffer pool: –B2 • New: -lruskips -lru2skips 5 © 2012 Progress Software Corporation. All rights reserved. LRU (again) Least Recent RM Block T1 Most Recent IX Block I1 Find first T1. 6 © 2012 Progress Software Corporation. All rights reserved. RM Block T3 IX Block I3 LRU (again) Least Recent Most Recent IX Block I1 RM Block T3 IX Block I3 RM Block T1 RM Block T3 IX Block I3 IX Block I1 RM Block T3 IX Block I3 IX Block I1 RM Block T1 RM Block T1 Find first T1. 7 © 2012 Progress Software Corporation. All rights reserved. LRU (again) Least Recent RM Block T3 Most Recent IX Block I3 IX Block I1 RM Block T1 Find first T1. (again) RM Block T3 IX Block I3 RM Block T1 IX Block I1 RM Block T3 IX Block I3 IX Block I1 RM Block T1 What about … For each T1: end. For each w/many tables. For each w/many tables, many users. 8 © 2012 Progress Software Corporation. All rights reserved. Location, location, location Least Recent With –B 1,000,000 • What does it take to evict from the buffer pool? • What does it take to go from MRU to LRU? Do we need MRU on EACH access then? • I think not. 9 © 2012 Progress Software Corporation. All rights reserved. Most Recent Improving Concurrency Least Recent -lruskips <n> Most Recent • LRU and LFU combined • Small numbers make a BIG difference • Monitor OS Read I/Os and LRU latch contention • Adjust online via _Startup. _Startup-LRU-Skips VST field • Adjust online via promon – R&D -> 4. Administrative Functions ... -> 4. Adjust LRU force skips 10 © 2012 Progress Software Corporation. All rights reserved. Performance – 10.2b06 & -lruskips Readprobe Data Access Results 300,000 Records Read 250,000 200,000 ~39% 150,000 lruskips 0 lruskips 10 100,000 lruskips 100 lruskips 1000 50,000 # Users 11 1 6 11 © 2012 Progress Software Corporation. All rights reserved. 16 21 26 31 36 41 46 100 Performance – 10.2b06 & -lruskips (250 users) Readprobe Latch Waits (per sec*) 1,800 1,600 1,400 1,200 1,000 800 600 400 200 0 lruskips 0 lruskips 10 lruskips 100 lruskips 1000 BHT 184 730 887 859 BF1 279 1,066 1,167 1,178 BF2 6 16 16 13 BF3 66 174 250 164 BF4 9 13 12 13 LRU 1,655 148 7 0 Note change in LRU latch waits vs buffer latch waits 12 © 2012 Progress Software Corporation. All rights reserved. lruskips 0 lruskips 10 lruskips 100 lruskips 1000 Performance – 10.2b06 & -lruskips (250 users, big db) Readprobe Latch Waits (per sec*) 2,000 1,800 1,600 1,400 1,200 1,000 800 600 400 200 0 lruskips 0 lruskips 10 lruskips 100 lruskips 1000 BHT 14 173 217 322 BF1 0 1 1 2 BF2 1 2 1 2 BF3 1 1 1 2 BF4 3 2 1 1 LRU 1,949 1,035 19 0 Note focus now is on LRU and BHT (not buf) 13 © 2012 Progress Software Corporation. All rights reserved. lruskips 0 lruskips 10 lruskips 100 lruskips 1000 Performance – 10.2b06 & -lruskips (big db) Readprobe Data Access Results 400,000 350,000 ~15% ~52% Records Read 300,000 ~44% 250,000 200,000 150,000 lruskips 0 lruskips 10 lruskips 100 lruskips1000 100,000 50,000 # Users 14 1 6 11 © 2012 Progress Software Corporation. All rights reserved. 16 21 26 31 36 41 46 100 Conclusions -lruskips can eliminate the LRU bottleneck LRU isn’t the last bottleneck Overall improvement relative to other contention • Data access limited by buffer level contention • Table scans over small tables have more buffer contention than large tables – Application changes can improve performance too! 15 © 2012 Progress Software Corporation. All rights reserved. Agenda 16 1 LRU (again) 2 Networking: Message Capacity 3 Networking: Resource Usage 4 Index Rebuild 5 Summary © 2012 Progress Software Corporation. All rights reserved. Networking Control Philosophy: Throughput by keeping server busy without remote client waits! Process based control • -Ma, -Mn, -Mi – Controls the order users are assigned to servers • -PendCondTime Resource based control • -Mm <n> – Maximum size of network message – Client & server startup New tuning knobs – resource based control • Alleviate excessive system CPU usage by network layer • Control record data stuffed in a network message – Applicable for “prefetch” queries 17 © 2012 Progress Software Corporation. All rights reserved. Networking – Prefetch Query No-lock query with guaranteed forward motion or scrolling • Multiple records stuffed into single network message • Browsed static and preselected queries scrolling by default FOR EACH customer NO-LOCK: …. end. DO PRESELECT EACH customer NO-LOCK: …. end. define query cust-q for customer SCROLLING. open query cust-q FOR EACH customer NO-LOCK. repeat: get next cust-q. end. 18 © 2012 Progress Software Corporation. All rights reserved. Server Network Message Processing Loop Start Outstanding prefetch request? No Check for request 2 second wait Poll(2) Process Server Events Yes Check for request No Wait Poll(0) Got new request? Yes Add a record to network msg for outstanding (prefetch) request 19 No Got new request? © 2012 Progress Software Corporation. All rights reserved. Yes Process waiting request No Server Network Message Processing Loop Start Outstanding prefetch request? No Check for request 2 second wait Poll(2) Process Server Events Yes Check for request No Wait Poll(0) Got new request? Yes Add a record to network msg for outstanding (prefetch) request 20 No Got new request? © 2012 Progress Software Corporation. All rights reserved. Yes Process waiting request No Server Network Message Processing Loop Start -NmsgWait Outstanding prefetch request? Check for request 2 second wait Poll(2) What’s new: No Process Server Events Yes Check for request No Wait Poll(0) Got new request? Yes Add a record to network msg for outstanding (prefetch) request 21 No Got new request? © 2012 Progress Software Corporation. All rights reserved. Yes Process waiting request No Server Network Message Processing Loop Start Outstanding prefetch request? No Check for request 2 second wait Poll(2) Process Server Events Yes Check for request No Wait Poll(0) Got new request? Yes Add a record to network msg for outstanding (prefetch) request 22 No Got new request? © 2012 Progress Software Corporation. All rights reserved. Yes Process waiting request No Server Network Message Processing Loop Start Outstanding prefetch request? No Check for request 2 second wait Poll(2) Process Server Events Yes Check for request No Wait Poll(0) Got new request? Yes Add a record to network msg for outstanding (prefetch) request 23 No Got new request? © 2012 Progress Software Corporation. All rights reserved. Yes Process waiting request No Server Network Message Processing Loop Start Outstanding prefetch request? Poll() is system CPU intensive No Check for request 2 second wait Poll(2) Process Server Events Yes 10 milliseconds to poll(0)! Check for request No Wait Poll(0) Got new request? 10 microseconds to copy 1 record Add a record to network msg for outstanding (prefetch) request 24 No Yes Got new request? © 2012 Progress Software Corporation. All rights reserved. Yes Process waiting request No Server Network Message Processing Loop Start What’s new: Potential side effects Outstanding prefetch request? No Check for request 2 second wait Poll(2) Process Server Events Yes -prefetchPriority Check for request No Wait Poll(0) Got new request? Yes Add a record to network msg for outstanding (prefetch) request 25 No Got new request? © 2012 Progress Software Corporation. All rights reserved. Yes Process waiting request No Server Network Message Processing Loop Start Outstanding prefetch request? No Check for request 2 second wait Poll(2) Process Server Events Yes Check for request No Wait Poll(0) Got new request? Yes Add a record to network msg for outstanding (prefetch) request 26 No Got new request? © 2012 Progress Software Corporation. All rights reserved. Yes Process waiting request No Process Waiting Network Message Process waiting request Add record to message Prefetch request? No Yes 1st Record request? Yes No Threshold met? Yes Send network message No Remote client continues to wait 27 © 2012 Progress Software Corporation. All rights reserved. Goto start Process Waiting Network Message Process waiting request Add record to message Prefetch request? No Yes Non-prefetch query request 1st Record request? Yes No Threshold met? Yes Send network message No Remote client continues to wait 28 © 2012 Progress Software Corporation. All rights reserved. Goto start Process Waiting Network Message Process waiting request Add record to message Prefetch request? No Yes 1st record of a prefetch query request 1st Record request? Yes No Threshold met? Yes Send network message No Remote client continues to wait 29 © 2012 Progress Software Corporation. All rights reserved. Goto start Process Waiting Network Message Process waiting request Add record to message Prefetch request? No Yes Secondary records of a prefetch query request: - threshold not met - default threshold is 16 records 1st Record request? Yes No Threshold met? Yes Send network message No Remote client continues to wait 30 © 2012 Progress Software Corporation. All rights reserved. Goto start Process Waiting Network Message Process waiting request Add record to message Prefetch request? No Yes Secondary records of a prefetch query request: - Client waiting - Threshold met - Send message 1st Record request? Yes No Threshold met? Yes Send network message No Remote client continues to wait 31 © 2012 Progress Software Corporation. All rights reserved. Goto start Process Waiting Network Message Process waiting request Add record to message What’s new: Increase network message fill rate: - Improve TCP throughput - Improve overall server performance Prefetch request? No Yes 1st Record request? Yes No Defaults have not changed Provides control for you Threshold met? Yes Send network message Every deployment is different No Remote client continues to wait 32 © 2012 Progress Software Corporation. All rights reserved. Goto start Process Waiting Network Message Process waiting request Add record to message Prefetch request? No Yes Disregard 1st record request check -prefetchDelay 1st Record request? Yes No Threshold control: # recs vs % full -prefetchNumRecs Threshold met? Yes Send network message -prefetchFactor No Potential side effects: –NOTE: Improved TCP/system performance - -Mm size determines max -Mm 4096 / 16 rec = 256 bytes – Choppy behavior on remote client? 33 © 2012 Progress Software Corporation. All rights reserved. Remote client continues to wait Goto start Altering Network Message Behavior Promon Support (_Startup VST too!) • Alter online – R&D … – 4. Administrative Functions … – 7. Server Options … Server Options: 1. 2. 3. 4. 5. 7. 34 Server network message wait time: 2 seconds Delay first prefetch message: Enabled Prefetch message fill percentage: 90 % Minimum records in prefetch message: 1000 Suspension queue poll priority: 0 Terminate a server © 2012 Progress Software Corporation. All rights reserved. Performance – 10.2b06 & Networking changes baseline NumRecs priority NumRecs_priority Num_Recs_Priority_lruskips Readprobe Data Access Results 650,000 Records Read 550,000 450,000 350,000 ~212% 250,000 150,000 ~32% 50,000 # Users 35 1 6 11 16 © 2012 Progress Software Corporation. All rights reserved. 21 26 31 36 41 46 100 Agenda 36 1 LRU (again) 2 Networking: Message Capacity 3 Networking: Resource Usage 4 Index Rebuild 5 Summary © 2012 Progress Software Corporation. All rights reserved. Assumptions for best performance Index data is segregated from table data • Indexes & tables are in different storage areas You have enough disk space for sorting You understand the impact of CPU and memory consumption Process allowed to use available system resources 37 © 2012 Progress Software Corporation. All rights reserved. Index Rebuild Parameters - Overview -TB -datascanthreads # threads for data scan phase -TMB merge block size ( default -TB) -TF merge pool fraction of system memory (in %) -mergethreads # threads per concurrent sort group merging -threadnum -TM -rusage -silent 38 sort block size (8K – 64K, note new limit) # concurrent sort group merging # merge buffers to merge each merge pass report system usage statistics a bit quieter than before © 2012 Progress Software Corporation. All rights reserved. Phases of Index Rebuild (“non-recoverable”) Index Scan • Scan index data area start to finish • I/O Bound with little CPU activity • Eliminated with area truncate Data Scan/ Key Build • • • • Sort-Merge • Sort-merge –TF and/or temp sort file • CPU Bound with I/O Activity • I/O eliminated if –TF large enough Index Key Insertion 39 • • • • Scan table data area start to finish (area at a time) Read records, build keys, insert to temp sort buffer Sort full temp file buffer blocks (write if > -TF) I/O Bound with CPU Activity Read –TF or temp sort file Insert keys into index Formats new clusters; May raise HWM I/O Bound with little CPU Activity © 2012 Progress Software Corporation. All rights reserved. Phases of Index Rebuild Index Scan • Scan index data area start to finish • I/O Bound with little CPU activity • Eliminated with area truncate Area 9: Index scan (Type II) complete. • Index area is scanned start to finish (single threaded) • Block at a time with cluster hops • Index blocks are put on free chain for the index • Index Object is not deleted (to fix corrupt cluster or block chains) • Order of operation: • Blocks are read from disk, • Blocks are re-formatted in memory • Blocks are written to disk as –B is exhausted • Causes I/O in other phases for block re-format • Can be eliminated with manual area truncate where possible 40 © 2012 Progress Software Corporation. All rights reserved. Phases of Index Rebuild Index Scan • Scan index data area start to finish • I/O Bound with little CPU activity • Eliminated with area truncate Data Scan/ Key Build • • • • Scan table data area start to finish (area at a time) Read records, build keys, insert to temp sort buffer Sort full temp file buffer blocks (write if > -TF) I/O Bound with CPU Activity Processing area 8 : (11463) Start 4 threads for the area. (14536) Area 8: Multi-threaded record scan (Type II) complete. • Table data area is scanned start to finish (multi-threaded if –datascanthreads) • Each thread processes next block in area (with cluster hops) • Database re-opened by each thread in R/O mode • Ensure file handle ulimits set high enough 41 © 2012 Progress Software Corporation. All rights reserved. Data Scan/Key Build RM Block DB Record a) Thread reads next data block in data area b) Extract next record from data block and build index key (sort order) c) Insert key into sort block (-TB 8K thru 64K) d) Sort/merge full sort block into merge block. (-TMB -TB thru 64K) e) Write merge block to –TF, overflow to temp (-TMB sized I/O) Key Sort Block Sort Block Merge Block -TF 42 .srt1 .srt2 © 2012 Progress Software Corporation. All rights reserved. … Sort Groups: -SG 3 (note 8 is minimum) Each index assigned a particular sort group (hashed index #) Index 1 Index 4 Record 1) -T /usr1/richb/temp/ SG 1 .srt1 2) <dbname>.srt 0 /usr1/richb/temp/ Index 2 SG 2 .srt2 SG 3 .srt3 Index 3 Each group has its own sort file Sort file location • 1. Sort files in same directory (I/O contention) • 4. Sort files in different location Ensure enough space 43 © 2012 Progress Software Corporation. All rights reserved. 3) <dbname>.srt 10240 /usr1/richb/temp/ 0 /usr1/richb/temp/ 4) <dbname>.srt 0 /usr1/richb/temp/ 0 /usr2/richb/temp/ 0 /usr3/richb/temp/ Phases of Index Rebuild Index Scan • Scan index data area start to finish • I/O Bound with little CPU activity • Eliminated with area truncate Data Scan/ Key Build • • • • Sort-Merge • Sort-merge –TF and/or temp sort file • CPU Bound with I/O Activity • I/O eliminated if –TF large enough Scan table data area start to finish (area at a time) Read records, build keys, insert to temp sort buffer Sort full temp file buffer blocks (write if > -TF) I/O Bound with CPU Activity Sorting index group 3 Spawning 4 threads for merging of group 3. Sorting index group 3 complete. 44 © 2012 Progress Software Corporation. All rights reserved. Sort-Merge Phase Sorted! Sort blocks in each sort group have been sorted and merged into a linked list of individual merge blocks stored in –TF and temp files. These merge blocks are further merged –TM# at a time to form new larger “runs” of sorted merge blocks. -TM# of these new “runs” are then merged to form even larger “runs” of sorted merge blocks. When there is only one very large “run” left, all the key entries in the sort group are in sorted order. 45 © 2012 Progress Software Corporation. All rights reserved. -threadnum vs -mergethreads -threadnum 2 -TF -TF -TF 46 .srt1 Thread 1 Merge phase group 1 .srt2 Thread 2 Merge phase group 2 .srt3 © 2012 Progress Software Corporation. All rights reserved. -threadnum vs -mergethreads -threadnum 2 -TF -TF -TF 47 B-tree insertion occurs as soon as a sort group’s merge is completed. .srt1 Thread 0 Thread 0 begins b-tree insertion concurrently. .srt2 Thread 2 Merge phase group 2 .srt3 Thread 1 Merge phase group 3 © 2012 Progress Software Corporation. All rights reserved. -threadnum vs -mergethreads -threadnum 2 –mergethreads 3 Thread 3 -TF .srt1 Thread 1 Thread 4 Merge threads merge successive “runs” of merge blocks concurrently. Merge phase group 1 Thread 5 Thread 6 -TF .srt2 Thread 2 Thread 7 Thread 8 -TF .srt3 Note: 8 actively running threads 48 © 2012 Progress Software Corporation. All rights reserved. Merge phase group 2 -threadnum vs -mergethreads -threadnum 2 –mergethreads 3 -TF .srt1 Thread 6 -TF .srt2 Thread 2 Thread 7 Merge phase group 2 Thread 8 -TF Thread 3 .srt3 Merge phase group 3 Thread 1 Thread 4 Thread 5 49 © 2012 Progress Software Corporation. All rights reserved. -threadnum vs -mergethreads -threadnum 2 –mergethreads 3 B-tree insertion occurs as soon as a sort group’s merge is completed. -TF .srt1 Thread 0 begins b-tree insertion concurrently. Thread 0 Thread 6 -TF .srt2 Thread 2 Thread 7 Merge phase group 2 Thread 8 -TF Thread 3 .srt3 Merge phase group 3 Thread 1 Thread 4 Thread 5 Note: 9 actively running threads 50 © 2012 Progress Software Corporation. All rights reserved. Phases of Index Rebuild Index Scan • Scan index data area start to finish • I/O Bound with little CPU activity • Eliminated with area truncate Data Scan/ Key Build • • • • Sort-Merge • Sort-merge –TF and/or temp sort file • CPU Bound with I/O Activity • I/O eliminated if –TF large enough Index Key Insertion 51 • • • • Scan table data area start to finish (area at a time) Read records, build keys, insert to temp sort buffer Sort full temp file buffer blocks (write if > -TF) I/O Bound with CPU Activity Read –TF or temp sort file Insert keys into index Formats new clusters; May raise HWM I/O Bound with little CPU Activity © 2012 Progress Software Corporation. All rights reserved. Index Key Insertion Phase Building index 11 (cust-num) of group 3 … Building of indexes in group 3 completed. Multi-threaded index sorting and building complete. Index B-tree Root Leaf Leaf Leaf Write leaf when full DB Key entries from sorted merge blocks are inserted into b-tree Performed sequentially entry at a time, index at a time Leaf level insertion optimization (avoids b-tree scan) Leaf level written to disk as soon as full (since never revisited) 52 © 2012 Progress Software Corporation. All rights reserved. 2085 Indexes were rebuilt. (11465) Index rebuild complete. 0 error(s) encountered. 53 © 2012 Progress Software Corporation. All rights reserved. Index Rebuild - Tuning Truncate index only area if possible .srt file Parameters • -mergethreads: 2 or 4 and –threadnum 2 or 1 • -datascanthreads: 1.5 * # CPUs • -B 1024 • –TF 80 (monitor physical memory paging) • –TMB 64 • –TB 64 • –TM 32 • –T: separate disk, RAM disk if not using -TF (no change) • -rusage & -silent 54 © 2012 Progress Software Corporation. All rights reserved. Performance Numbers Index Rebuild Elapsed Time 120,000 100,000 80,000 60,000 40,000 20,000 0 10.2b06 best Cost of each phase (in secs) 10.2b06 no truncate 10.2b06 w/-TF 50 10.2b06 baseline 55 © 2012 Progress Software Corporation. All rights reserved. Agenda 56 1 LRU (again) 2 Networking: Message Capacity 3 Networking: Resource Usage 4 Index Rebuild 5 Summary © 2012 Progress Software Corporation. All rights reserved. Summary LRU • Potential for a big win • Always room for improvement – Us and you! Networking • You now have more control • With power comes responsibility Index Rebuild • Big improvements if – Your database is setup properly – You provide system resources to index rebuild • Hopefully you’ll never need it 57 © 2012 Progress Software Corporation. All rights reserved. ? Questions 58 © 2012 Progress Software Corporation. All rights reserved.