Techniques for Analyzing Performance Problems Session: What are the key indicators of problems and poor performance that every database administrator should be tracking? Where do you find this information, and how can you distinguish the urgent from the mundane? We will learn how to approach common monitoring problems through live demonstrations and explain how to get started on the road to proactively managing your database and application. Join this session to find out the answers to all these questions and more! Techniques For Analyzing Performance Problems It’s The Right Thing To Do. Tom Bascom, White Star Software tom@wss.com A Few Words about the Speaker • Tom Bascom; Progress user & roaming DBA since 1987 • President, DBAppraise, LLC – Remote database management service for OpenEdge. – Simplifying the job of managing and monitoring the world’s best business applications. – tom@dbappraise.com • VP, White Star Software, LLC – Expert consulting services related to all aspects of Progress and OpenEdge. – tom@wss.com 3 Get the Basic Facts • What end-user owns the problem? Who confirms that we have fixed it? • Is the problem related to a specific screen, program or business process? • Is the problem repeatable? On demand? Is there a pattern (i.e. only in the afternoon or whenever X is also running…) • Is there a specific target metric that we need to satisfy? • Do we have before and after run times or other end-user metrics related to the identified problem? • Did the problem start on a particular date? Or after an identified change? Establish a Timeline • When did the problem begin? – Verify this with independent data! (Logfiles, timestamps on output etc.) • What changes are known to have occurred around this time? – Software releases – “The System”, OpenEdge, Apache, Java, OS etc. – Parameter changes (DB, 4gl client, OS, SAN, network etc.) • Has the Workload changed? – Application workload characteristics (business volume, number of users, transaction volume, CRUD statistics, etc.) – Environment workload changes: OS, disk, network… – External workload changes (Shared SAN, Network, Other VMs). When did the problem begin? • It isn’t always the same as when the user first noticed the problem. • Getting the wrong begin date may lead to identifying the wrong root cause. • Independent verification is critical – log files, timestamps of output files and careful interviews of end-users are all tools that can help establish a reliable begin date. What Changes Occurred? • A complete and reliable timeline of all changes is critical. • It is important to look “outside the box” – changes sometimes have unexpected side effects. • Make sure that changes to any and all systems involved in the business process being analyzed are tracked. Sample Timeline Normal Performance Monthly Release OE 11.2 Upgrade SAN Upgrade Record Reads Increase 300%, Latch Timeouts Spike Users Report Bad Performance Application Workload • Over time more users create more data and engage in more activity. If parts of the application are inappropriately sensitive to growth there may be problems. • DB activity should scale linearly (or less) with business volume. • Number of users, record creates, reads, updates and deletes, number of transactions etc. should all be related to business activity. • Reporting and data aggregation will take longer with more data – but the additional time should be linear and predictable. • Any metrics which are growing exponentially faster than business growth are indicators of an underlying scalability problem. Environment Workload • As the application becomes busier so should the environment that it is hosted in. • Just as with the application workload the environment workload should scale linearly with business activity. • CPU utilization, network traffic, memory utilization and disk space should all grow in step with business growth. • Disk IO operations should grow more slowly than business growth (IO ops are buffered resources that should follow an inverse square growth path). • Any metrics which are growing exponentially faster than business growth are indicators of an underlying scalability problem. External Workload • Changes outside the immediate environment can impact performance. • Network load from other applications my reduce the available bandwidth and increase latency. • Disk IO from other applications may impact SAN performance. • Demand for CPU from other applications on a shared VM may impact the scheduling and availability of shared CPU. • If the application and environment workloads have not changed but we are seeing increased latency or deeper queuing it is likely that the external workload has increased. TOOLS Tools to Narrow the Scope • Kill USR1 -- 4GL Stack Trace – Generates a “protrace” file from a running process. – Does not interfere with the running process. – Requires that you be the process owner or “root”. • Progress Profiler – Gathers detailed line by line execution timing. – Can be enabled via startup parameters or by manipulating PROFILER handle. • Client Logging – Logs specified 4GL session details to a file. – Can be very verbose. • Client Statement Cache – Shows the 4GL data access statement and stack trace – Or the SQL statement About Line Numbers • Most tools use DEBUG LIST line numbers. • This listing expands all include files and preprocessor statements. • COMPILE “program.p” DEBUG-LIST “program.dbg”. Obtaining a USR# /home/tom> ps -fu UID PID f474458 18612454 f474458 14171658 f474458 34488084 f474458 37176446 f474458 39666876 f474458 42812578 f474458 46286012 f474458 47989918 f474458 PPID C STIME TTY 35669114 0 09:30:43 46286012 0 12:58:50 pts/20 44385460 0 09:34:30 39666876 0 Dec 04 pts/12 17448590 0 Dec 04 47989918 120 12:59:02 pts/11 34488084 0 09:34:31 pts/20 18612454 0 09:30:43 pts/11 TIME 0:00 0:00 0:00 0:00 0:00 0:00 0:00 0:00 CMD boks_sshd: f474458@pts/11 /progress/dlc/bin/_progres -p mls.p boks_sshd: f474458@pts/20 bash boks_sshd: f474458@pts/12 ps -fu f474458 bash bash /home/tom> grep f474458 /pdb_prod/xlos.lg | grep 14171658 [2012/12/06@12:58:50] P-14171658 T-1 I ABL 124: (452) Login by f474458 on /dev/pts/20. [2012/12/06@12:58:52] P-14171658 T-1 I ABL 124: (708) Userid is now f474458. KILL USR1 4GL Stack Trace From The Command Line Generate a 4GL Stack Trace • Great for batch processes and “hung” users. • Use “kill -USR1” to generate a stack trace and establish what code is actually running. $ ps –fu f474458 UID PID PPID C STIME TTY f474458 33570788 37176446 62 11:21:57 pts/12 f474458 37176446 39666876 0 08:44:23 pts/12 f474458 39666876 17448590 0 08:44:23 f474458 42288226 45303030 0 11:21:30 pts/48 f474458 45302816 42288226 0 11:21:37 pts/48 $ kill –USR1 45302816 $ ls –l /los_prod/protrace.45302816 $ -rw-rw-r-- 1 f474458 los_prd 2194 Dec 4 TIME 0:00 0:00 0:00 0:00 0:00 CMD ps -fu f474458 bash boks_sshd: f474458@pts/12 bash /progress/dlc/bin/_progres -p mls.p 11:28 /los_prod/protrace.45302816 • The “protrace” output will be in the process’ current directory. Sample 4GL Stack Trace $ cat /los_prod/protrace.45302816 PROGRESS stack trace as of Tue Dec 4 11:28:00 2012 Command line arguments are /progress/dlc/bin/_progres -p mls.p -pf /los_prod/develop.pf -T /dbtmp Startup parameters: -pf /progress/dlc/startup.pf,-T /dbtmp,-clientlog /los_logs/debug/client_f474458_20121204_112720_45302816.log ... (snip) +++PARALLEL TOOLS CONSORTIUM LIGHTWEIGHT COREFILE FORMAT version 1.0 ... (snip) ** 4GL Stack Trace ** --> obj/mnu/menu (/los_prod/obj/mnu/menu.r) at line 416 obj/src/startup (/los_prod/obj/src/startup.r) at line 2042 mls.p (/los_prod/mls.r) at line 26 ** Persistent procedures/Classes ** Handle 001010 001009 000000 File Name /los_prod/obj/tools/proclib.r /los_prod/obj/tools/quick-supers.r /los_prod/config.r (STATIC) CODE PROFILER Line By Line Execution Time Profiler – The Hard Way • -profile (startup option) – Non-intrusive – Non-selective • PROFILER: (session handle) – More selective – But requires code insertion or “wrappering” – profiler/profctrl.p • Analysis Tools – http://communities.progress.com/pcom/docs/DOC-2808 – http://dbappraise.com PROFILER Attributes • • • • • • DESCRIPTION – optional text describing this session LISTINGS – whether or not to create debug listings DIRECTORY – where to create debug listings (default to –T) FILE-NAME – name of output file (default profile.out) ENABLED – yes/no; initializes listings and so forth PROFILING – turn profiling on or off Profiling an Entire Session • Create file called profiler.cfg with 3 lines: -OUTFILE /tmp/profiler.out -LISTINGS /tmp -DESCRIBE someDescription • Add –profile to session startup: mpro dbName –p start.p –profile profiler.cfg • Run normally. • Terminate cleanly & analyze the output. • Multiple gigabytes of data may be generated… Minimal Embedded Usage assign profiler:enabled = yes profiler:profiling = yes . do i = 1 to 1000000: /* do something */ end. assign profiler:enabled = no profiler:profiling = no . profiler:write-data(). <Escape>-P Profiler Tool • Dynamically capture line by line execution time at the point where issues occur. • Send output to a user-defined destination. • <Esc>P to configure and enable. (Don’t forget “Yes”.) • <Esc>P again to complete the capture and disable. Sample Profiling Output Description: XYZZY Top Total Time Lines Program Line Avg Time ------------------------------ ----- ----------xtabsms2.p 19281 93.256652 getdocprep2.p 939 63.346967 proc_create_sitm xtabsms2.p 11926 12.611345 xtcountry_x2_x3.p 359 0.000013 proc_read-database sysval.p 536 0.000117 xtcountry_x2_x3.p 360 0.000009 getdocprep2.p 741 0.067411 proc_upd_nref xtmfintb2.p 3582 0.003438 proc_upd_nref xtmfintb2.p 3298 0.003137 proc_process_tasks xttskscn.r 3091 0.012560 findClient sysval.p 328 0.000094 Date: 10/04/11 Time Calls ---------- ------186.513304 2 126.693934 2 50.445380 4 0.459658 34,967 0.336606 2,879 0.324163 34,894 0.269642 4 0.233802 68 0.213345 68 0.200956 16 0.165167 1,763 CLIENT LOGGING Detailed 4GL Session Logging Client Logging – The Hard Way • -clientlog filename – /logs/debug/USER_DATE_TIME_PID.log • -logginglevel – 0 = disable – 1 = errors only – 2 thru 4 = basic, verbose & extended -logentrytypes • -logentrytypes – 4GLTrace, 4GLTrans, DynObjects.*, FileID, QryInfo – LOG-MANAGER:LOG-ENTRY-TYPES = ”4GLTrace:2” <Escape>-T Trace Tool • Dynamically combine LOG-ENTRY-TYPEs & LOGGING-LEVELs to suit. • 4GLTrace, QryInfo, DynObjects… • 2 = Basic, 3 = Verbose, 4 = Extended • Specify a custom path to log file (/home/SID…) • Trace only the relevant portion of your session (not all or nothing). Client Logging Output [12/12/06@09:12:22.471-0500] P-41043172 T-000001 1 4GL -- Logging level set to = 1 [12/12/06@09:12:22.471-0500] P-41043172 T-000001 1 4GL -- Log entry types activated: 4GLTrace:2 [12/12/06@09:12:22.490-0500] P-41043172 T-000001 2 4GL 4GLTRACE Run obj/src/connect [Main Block - mls.p @ 25] Run obj/src/startup [Main Block - mls.p @ 26] PUBLISH Monitor [Main Block - obj/src/startup @ 270] Run obj/sec/access "startup no" [Main Block - obj/src/startup @ 366] Run obj/mnt/setcolor [Main Block - obj/src/startup @ 435] Func isDbReadOnly [Main Block - obj/src/startup @ 849] Func get-env [Main Block - obj/src/startup @ 882] Run obj/prt/genBarcode PERSIST [Main Block - obj/src/startup @ 903] Run loadCodes "0" [Main Block - obj/src/startup @ 914] Func addCode "00 00 11011001100 " [loadCodes - obj/prt/genBarcode @ 352] Func buildPCLString "11011001100 00" [addCode - obj/prt/genBarcode @ 291] … Run obj/mnu/menu "main" [Main Block - obj/src/startup @ 2042] Run obj/app/ckqaread [Main Block - obj/mnu/menu @ 136] Run obj/sec/access "main no" [Main Block - obj/mnu/menu @ 140] Run obj/lkp/fixmod [Main Block - obj/mnu/menu @ 189] Run obj/hlp/helpmess "main Retail Mortgage Processing" [Main Block - obj/mnu/menu @ 404] Run obj/tools/logout [Main Block - obj/src/startup @ 2119] Prolog removed to improve readability… STATEMENT CACHE Tracing Data Access Statements Client Statement Cache • Shows the session’s most recent data access statement. – Both 4gl and SQL clients are supported. • 4gl stack trace is optionally available for 4gl clients. • Enable via PROMON or via _CONNECT VST • Access results via PROMON or _CONNECT PROMON – Statement Cache User number User name User type Login date/time : : : : 89 f474458 SELF/ABL 12/06/12 09:30 Statement caching type Statement caching last updated : SQL Statement or Partial ABL Program Stack : 12/06/12 09:33 Statement cache information : 101 : obj/mrxv/lnstatex-val 1152 : execute obj/mrxe/50005399-2012-06-13-31926.r 153 : obj/mrxe/50005399-2012-06-13-31926.r 80 : obj/mrx/runcompmtrx 169 : obj/mrx/mtrxeval-stateless 282 : obj/sec/access 4300 : obj/adm/mortype4 575 : obj/mnu/menu 568 : obj/mnu/menu 3018 : flow-proc obj/src/loanflow 1947 : obj/src/loanflow 51 : obj/lkp/retailfl 575 : obj/mnu/menu 568 : obj/mnu/menu 568 : obj/mnu/menu 2042 : obj/src/startup 26 : mls.p SCENARIOS Scenarios • • • • Hung Session Runaway Slow Program Slow System NON-RESPONSIVE SESSION Checking a “Hung” Session • • • • Check for Scroll-Lock (Control-S) Examine log files. Is the process consuming any CPU ticks? Examine table statistics by user – is there any activity? • Check “blocked users” – is the session waiting on a record lock or other resource? • Get a 4GL stack trace – examine the active code. Could we be waiting for something? – Be especially wary of OS-* commands, file IO and any external interfaces. CPU Utilization • Use “ps” to check CPU utilization: /home/tom> ps -ef | grep [1]4171658; sleep 60; ps -ef | grep [1]4171658 f474458 14171658 46286012 120 12:58:50 pts/20 0:06 _progres -p mls.p f474458 14171658 46286012 120 12:58:50 pts/20 0:06 _progres -p mls.p • No change in CPU in 60 seconds – we might be “hung”. /home/tom> ps -ef | grep [1]4171658 f474458 14171658 46286012 0 12:58:50 pts/20 0:06 _progres -p mls.p f474458 50676836 14171658 0 13:48:56 pts/20 0:00 rm -i /tmp/xyzzy • In this case a sub-process is blocked waiting for input. Is There a Problem? • Holding record locks, preventing access etc. • No? Don’t worry about it. Move on. • Yes? It might be a hung session that needs to be killed. – Save log files and other analysis from above. – Dynamic restart: kill –INT (raises OpenEdge STOP condition and re-runs startup procedure) – proshut dbname –disconnect usr# – Terminate: kill –HUP or kill –TERM – Check the .lg files for transaction rollback completion before killing again. – Do NOT “kill -9” if there is DB activity – you may crash the db! Key Metrics • ProTop – Active TRX, Blocked Users, Table IO by User MY LITTLE RUNAWAY Runaway? • A particular OS process is consistently consuming significant CPU resources (usually something in excess of 90% of a core). • It may, or may not, be executing as designed and expected. • How can we tell? Checking a Runaway • • • • • Examine log files. Is the process consuming a lot of CPU ticks? Get a 4GL stack trace. Examine table and index statistics by user. Verify the Client Statement Cache. CPU Utilization • Check CPU utilization over time • If the utilization is a significant portion of the time slice then you might have a “runaway”: /home/tom> ps -ef | grep [1]4171658; sleep 60; ps -ef | grep [1]4171658 f474458 14171658 46286012 120 12:58:50 pts/20 0:08 _progres -p mls.p f474458 14171658 46286012 120 12:58:50 pts/20 0:46 _progres -p mls.p 0:46 – 0:08 = 34 seconds of CPU time, 57% of the available timeslice. Performing as Expected? • Yes? It may be time to refactor. • No? It might be a runaway that needs to be killed. – Save log files and other analysis from above. – Dynamic restart: kill –INT (raises OpenEdge STOP condition and re-runs startup procedure) – proshut dbname –disconnect usr# – Terminate: kill –HUP or kill –TERM – Check for transaction rollback completion before killing again. – Do NOT “kill -9” if there is DB activity – you may crash the db! Key Metrics • ProTop – Active TRX, Blocked Users, Table IO by User RAPID READERS Is it a “Rapid Reader”? • Very heavy DB read activity. • Probably close to the theoretical limit for a session (200,000 rec/sec for current HW). • Usually focused on just one table. • Often a small table held entirely in memory. • Often associated with a table scan (WHOLEINDEX search). • Poorly constructed joins may be involved. ProTop Example • Show summary, uio, tableStat data Fixing a Rapid Reader • Add an appropriate Index • Refactor JOIN logic. – Rewriting – Use a temp-table • Use a temp-table to replace real table. *** Tuesday at 2:30 -- Pick an Index, Any Index A SPECIFIC PROGRAM IS SLOW The Easiest Scenario Slow Program • A specific program is slow. • The problem is repeatable and measurable. • We have used tools to establish which code is slow. – If the profiler shows that data access statements are at fault follow “Data Access” analysis. – Otherwise follow “Other Cause”. Data Access Issues • Possible Data Access Issues: – – – – Table scans. Missing or inappropriate indexes. Poorly constructed joins. Non-linear scaling. • May be sensitive to the amount of data present. • May be sensitive to data values. • May not be reproducible outside the problem environment. Expected Data Access • First – determine what you *expect*: – – – – What business metrics can we relate to this process? How much activity should we see for X business? How long should that activity take? What tables and indexes should be used by this business process? Example: Extracting 100 loans of data should result in 100 “loanfile” and 225 “borrower” reads… Actual Data Access • Examine Table & Index Statistics By User – Are the expected tables being accessed at the expected rates? – Are the indexes being used at runtime the indexes that you expected? – Do data access operations vary linearly with the amount of useful work being performed? – Is any activity approaching the known limits of the infrastructure? ProTop – CRUD Data 09:38:33 ProTop -- Progress Database Monitor 09/19/04 Sample sports2000 [/data/s2k/sports2000] Rate Hit Ratio: 14:1 14:1 Commits: 62 65 Local: 51 Miss%: 7.239% 7.140% Latch Waits: 45 46 Remote: 0 Hit%:92.761% 92.860% Tot/Mod Bufs: 1002 370 Batch: 50 Log Reads: 22960 26486 Evict Bufs: 26602 6225 Server: 0 OS Reads: 1662 1891 Lock Table: 8192 11 Other: 1 Chkpts: 1 0 Lock Tbl HWM: 138 TRX: 1 Flushed: 0 0 Old/Curr BI: 6141 6141 Blocked: 5 Area Full: 1 100.00% After Image: DISABLED Total: 52 Table Statistics Tbl# Table Name Create Read Update Delete ---- -------------------- --------- --------- --------- --------4 OrderLine 0 5937 152 0 24 POLine 0 2641 56 0 23 PurchaseOrder 0 1699 36 0 18 Order 0 1608 37 0 21 Bin 0 286 14 0 2 Customer 0 206 16 0 12 Vacation 0 111 5 0 Theoretical Limits • Record Reads/sec = 200,000 per session, 1.2M total (in a “shared” LPAR environment, 2M if “dedicated”). • Transactions/sec = 18,000 • Disk Read/sec = 20,000 • Disk Write/sec = 5,000 Actual limits may be lower – these are the best values that we obtained in the performance lab. Other Cause • External Resources – Slow File IO. – Interfaces (especially networked services). • Excessive recursion. • Dynamic Object allocation. – May be revealed by memory growth. • Parameter passing with deep copies. – May be revealed by memory growth. • Poor “Big O” algorithm choice. Key Metrics • ProTop – Active TRX, Blocked Users, Table IO by User “IT’S SLOW” Let’s Go Fishing… Slow System • “The System”, as a whole, is slow. • The problem is repeatable or predictable. – From 2pm until 5pm. – The last Thursday of the month. – Whenever the X extract is run. Verify • Are there workload changes that correlate with the problem? – Has business volume increased? – Have there been changes in how the system is being used? – Has there been a change in the mix of products being originated? • Are any system limits being approached? – OS limits: CPU utilization? Disk IO? Network Ops? – DB limits: TRX/sec? Record Rd/sec? Database Activity • Are there behavioral changes that correlate with the problem? – Do certain tables show increased IO activity? • Have the “top 10” tables changed? – Do certain indexes become active? • Have the “top 10” indexes changed? – Do particular background processes become exceptionally busy? – Are any bottleneck metrics worrisome? • Latch timeouts, blocked users, long transactions Key Metrics • Record & Index Creates, Reads, Updates, Deletes, Locks and Waits per second. – Aggregate – Per Table – Per User • Transactions (Commits) per second • Latch Wait Timeouts • Number of Users: – Blocked – With Active Transactions – With Old Active Transactions Questions? 66 Thank You! 67