Non-linear scaling of long running batch jobs: “The twelve days of degradation” Computer Measurement Group Conference 2011 Paper # 5002 Session #514 Chris B. Papineau Software Architect Oracle Corporation Denver, CO Non-linear scaling of long running batch jobs Introduction • Forecast for today: 100% Cloud free Non-linear scaling of long running batch jobs Introduction • The Twelve Days of Christmas • A Christmas carol which scales non-linearly – Singing all twelve verses takes more than 12 times as long as verse #1 alone • Why? Each of the twelve verses takes progressively longer to sing – Each verse adds something new and includes all previous gifts … four software fixes … three hardware upgrades … two OS patches And a bug fix for your custom program! Non-linear scaling of long running batch jobs Introduction Day #1: Day #2: Day #6: Day #12: . . . Day #n: 1 gift 3 gifts 21 gifts 78 gifts n(n+1)/2 gifts Non-linear scaling of long running batch jobs Introduction TOTAL of all gifts, all 12 days: n=1 ∑ n(n+1)/2 n=12 …. = 364 total gifts …. But we digress……. Non-linear scaling of long running batch jobs Introduction • This customer case study will address ONE common cause of non-linearity in large batch jobs – It is NOT the only possible cause – just a fairly common one – Along the way, performance analysis principles used will be highlighted • The cause deals with resource leaks – …as opposed to memory leaks • The batch job in question comes from a typical Enterprise Resource Planning (ERP) package Nonlinear scaling of long running batch jobs Agenda • Problem Definition • Analysis • Results • Conclusion Nonlinear scaling of long running batch jobs Problem Definition • “If you don't know where you're going, chances are you will end up somewhere else.” - Yogi Berra Nonlinear scaling of long running batch jobs Problem Definition • The Problem – Several batch programs from a typical ERP package did not scale with increasing data input – They are all Electronic Data Interchange (EDI) related • Batch ‘A’ - creates EDI Receiving Advices – 3000 records: 55m – 6000 records: 396m – 9000 records: 722m • Batch ‘B’ - generates EDI Inbound Purchase Orders – 3000 records: 8m – 6000 records: 22m – 9000 records: 110m • Batch ‘C’ - processes EDI Inbound Purchase Orders – 3000 records: 6m – 6000 records: 22m – 9000 records: 45m Nonlinear scaling of long running batch jobs Problem Definition • The runtime rises much more dramatically than the amount of input data. • Batch ‘A’ was by far the worst and most disruptive offender 800.00 Runtime (minutes) Data showing the non-linear scaling of batch jobs 9000 records 700.00 600.00 500.00 6000 records 400.00 300.00 200.00 3000 records 100.00 0.00 Batch ‘B’ R47011 Batch ‘C’ R4311Z1I Batch ‘A’ R47071 1 9.37 7.36 55.56 2 8.12 6.19 53.40 3 8.00 6.24 54.18 4 22.36 22.46 396.05 5 110.12 45.46 722.39 Batch ‘B’ R47011 Batch ‘C’ R4311Z1I Batch ‘A’ R47071 Nonlinear scaling of long running batch jobs Problem Definition • Platform: IBM iSeries + DB2 DBMS • Background on the ERP product in question: – Business Logic in the batch programs is implemented in the C language in the form of dynamically loaded functions stored in Dynamically Linked Libraries (DLLs) • These are called “Business Functions” (BSFNs) • This is the “Application code” – End users have access to the source code of the BSFNs • Power users may modify the code, or create new BSFNs – BSFNs, in turn, make calls to proprietary libraries of functions which are part of the ERP infrastructure • “Tools APIs” provide technical back-end functionality – – – – – – Data caching Database I/O SQL statement construction Middleware Data formatting Other assorted functionalities Nonlinear scaling of long running batch jobs Analysis • Problem Definition • Analysis • Results • Conclusion Nonlinear scaling of long running batch jobs Analysis • You get a sharp pain in your eye when drinking a cup of coffee… • Do you: – Run chemical tests on the coffee, send to a lab, involve an ophthalmologist, check for food allergies OR…. – Take the spoon out of the cup? Nonlinear scaling of long running batch jobs Analysis • Identify low-hanging fruit items first – “When your mother tells you she loves you, get a second opinion” - from Primal Fear • Run conditions verified: – Was the SAME data set used for each test? • √ The input data range for all the runs (3000 / 6000 / 9000 records) was identical, and the database was reset prior to each test – so these were repeatable and reliable “apples-apples” tests. – Was nothing else running during the tests? • √ The system was quiescent before each run • Operating system tuning – IBM iSeries experts analyzed temp space, memory pools, paging, etc… • √ No hardware or OS tuning opportunities which were obvious – …although Memory pools are a very common issue with UBEs on iSeries Always validate and verify the data before pursuing more detailed analysis Nonlinear scaling of long running batch jobs Analysis • “Everybody Lies” - Gregory House, M.D. Nonlinear scaling of long running batch jobs Analysis • With the “low hanging fruit” items addressed, the next step was to turn to First Principles of Performance Analysis: – Where is the Code or System Spending its time? See: “Performance Engineering Parables” C. Papineau CMG conference 2010 – This involves collection of profiling data for each batch job. Performance First Principles: Where is the code and/or system spending its time? Nonlinear scaling of long running batch jobs Analysis • Numerous options exist for profiling tools, depending on operating System, language, etc – Quantify – Tprof – Performance Explorer (PEX) • The ERP package in question provides a customized debugging feature which produces output in plain text format – Enabled by a configuration file setting – Debugging data output to a text file • The Batch programs are instrumented to collect the following timestamped debugging data: – – – – BSFNs SQL Hierarchical BSFN Call Stack data Proprietary technical functions • Middleware APIs • Tools APIs Nonlinear scaling of long running batch jobs Analysis • A tool called Performance Workbench (PW) is provided with the ERP product which can parse this text-based data and produce HTML-based reports and call stacks as output – – The text based data can also be manipulated and analyzed using spreadsheets, Perl programs, or other text parsing tools of choice So – the time-stamped ERP debugging output is used as custom profiling data • The ERP product provided the custom instrumentation critical to this analysis. But…. • Custom instrumentation / profiling data can be added to ANY software package to which source code is available – – • Format can be specified as per needs, in whatever manner is most useful to the desired purposes Plain text can be readily parsed into human readable format by C, Perl, VB, etc…. For details on Performance Workbench – see Poster Session 598 – Thursday 5:00 PM - 6:30 PM Room: Woodrow Wilson Plain text output can be used for instrumentation and targeted custom profiling Nonlinear scaling of long running batch jobs Analysis • Collect profiling logs – 3000 record case – – Run batch jobs with ERP debugging feature enabled 3000 record case provided enough of a robust sample size but without the excess overhead of the 6000 or 9000 record cases • PW run was run against the plain text data, producing these outputs: – HTML Code profile – Hierarchical Call Stack of Business Functions • Parse profiling logs with Excel spreadsheet, extract BSFN runtimes – Columnar / fixed-width nature of the data made it a simple matter to import to Excel and leverage graphical capabilities • We will focus on Batch ‘A’ for the purposes of this study, since it was the most severely degraded and illustrates the key analysis concepts most clearly. However, similar procedures were applied to all three jobs yielding similar results. Nonlinear scaling of long running batch jobs Analysis Performance Workbench profile of batch job: BSFN call stack output BSFN: R47011DataBaseUpdateSection() Import Performance Workbench output into Excel – graph BSFN timing trends Individual listing of BSFN runtimes Graphical analysis of BSFN runtimes Parse the profiling data with Excel spreadsheet Nonlinear scaling of long running batch jobs Analysis • The two BSFNs which consumed the majority of the time in Batch ‘A’ are indicated in the PW profile below • The BSFNs are called about 3000 times each • This sample size warrants a measuring the trend of those runtimes for the duration of the run • This is where Excel is useful…. Performance Workbench HTML profiling data – “Where is the code spending its time?” Critical Business Functions in profile of Batch Job ‘A’: F4312EditLine F4312EndDoc Nonlinear scaling of long running batch jobs Analysis • • The degradation seemed to be distributed throughout the code, and not occurring in one area Here are the runtimes for the two key BSFNs for the 3000 record run, plotted as a function of elapsed batch time – They clearly rise steadily over the course of the batch job: Business function runtimes steadily increase as batch program progresses Elapsed Time Elapsed Time Nonlinear scaling of long running batch jobs Analysis • • What could be causing such an “across the board” degradation What do all the critical parts of the BSFN code have in common that would exhibit this uniform rise in runtime? • The question to ask is: – What do these batch jobs DO ???? • A look at the profiling data shows a decisive insight: – These jobs are very heavy users of data caching Nonlinear scaling of long running batch jobs Analysis The PW profiling data showed thousands of calls to data caching APIs • The ERP package in question provides a caching API set which allows the developer to create lists of indexed data in memory. – – • • These lists can be manipulated and modified in many of the same ways as a database, e.g: FETCH, DELETE, UPDATE, INSERT. There are also advanced features similar to that which one would see in a Database Management System (DBMS), such as cursors and multiple indexes. jdeCacheInit() initializes a cache object jdeCacheTerminate() releases it Nonlinear scaling of long running batch jobs Analysis • A simple feature in the ERP product’s logging capability tracks the caches whose handles have been initialized via jdeCacheInit() but not been released via jdeCacheTerminate() – Many of the ERP products applications utilize caching, and the PW tool + the debug instrumentation recognizes this fact A Lengthy list of unterminated caches in the raw ERP debug logging Specific product knowledge is critical to performance analysis of complex systems Nonlinear scaling of long running batch jobs Analysis • de·us ex ma·chi·na [dey-uh s eks mah-kuh-nuh,dee-uh s eks mak- uh-nuh] noun – Any artificial or improbable device resolving the difficulties of a plot. • If the leap to data caching seems a bit improbable: – – – While the Performance Engineer does not have to possess expert level knowledge of the code in question, he or she must have access to such a Subject Matter Expert (SME). In this case, the insight of the cache APIs dominating the profile – provided by such a SME - was a critical leap which shortened this performance analysis effort greatly. A code profile is data, but data is of limited use if it does not provide information. "I should think you Jedi would have more respect for the difference between knowledge and wisdom." - Dexter Jettster, Star Wars: Attack of the Clones Nonlinear scaling of long running batch jobs Analysis • Upon investigation of the debug logs, the crux of this issue is clear: – A very large number (tens of thousands) of uniquely named caches (thousands) are used in all three of these batch jobs – These caches are not properly terminated during the course of the job. • Note that this is a different problem from either of the following, and should not be confused with same: – Large number of records in a single cache – Large number of references to a single cache without terminating • Neither of the above would cause the sort of phenomenon we’re seeing…. Nonlinear scaling of long running batch jobs Analysis • When many caches are created but not destroyed when no longer needed, the internal list of cache names in jdeCacheInit() gets progressively longer, so the calls to jdeCacheInit() take longer and longer. • The jdeCacheInit() API is called everywhere, thus having the effect of making “all” areas of the code quite uniformly slower as the run progresses. • This is what we call a “resource leak” – The actual memory cost is not significant Resource and memory leaks can have a critical impact on Performance Nonlinear scaling of long running batch jobs Analysis • The reason for the unique cache names for thread safe code – • Two users must not be using the same cache in the same process This methodology itself is not what causes the problem – – Failing to destroy each uniquely named cache as soon as it is no longer needed is the true problem This causes the unneeded cache names to accumulate in the internal buffers of the jdeCacheInit() and jdeCacheTerminate() APIs • – – …which in turn causes progressively longer calls over the course of processing the data. The impact on the batch jobs over time is very significant. The true problem is the large cache name “footprint” – the total number of unique cache names stored in the buffers at any given time must remain small. Nonlinear scaling of long running batch jobs Analysis • jdeCache dynamic runtime cache naming: BSFN Code sample: /* Add (const JCHAR*)before szMathString*/ sprintf(szCacheName, _J(“GeneralLedgerCache-%ls"), szUniqueJObID); /* Use the new cache name */ CacheInitReslt = jdeCacheInit(hUser, hCache, szCacheName, WfIndex); • Rather than hard coding a cache name, the name for each cache is created by appending a unique number to a hardcoded string: – – – • The unique number could be job number, order number, or something else to do with the business purposes of the app. In this case, the variable Job ID (szUniqueJObID) gets sprintf()’ed to a hardcoded string, e.g. “GeneralLedgerCache-” For each record processed, a new unique cache name is dynamically created This is only a problem when the reference count does not go to zero when the program is finished with the cache, and the cache name remains in the list. – In this case – the practice leads to an extremely large number of distinct caches – which causes the slowness in the bowels of the cache APIs Nonlinear scaling of long running batch jobs Analysis • Why does the accumulation of cache names slow down the batch job by such a large factor? – – – – All these thousands of cache names get stored in static internal buffers in the tools cache APIs. They must be processed by jdeCacheInit() and jdeCacheTerminate() each time. The larger the run, the more cache names, and the larger these buffers get. This is the reason why larger runs become increasingly unscalable and run longer by non-linear factors. There are two while loops in each call to jdeCacheInit() which become longer and longer each time it is called to create or access a cache. • • jdeCacheInit() must loop through an ever increasing list of all existing cache names each time a cache is created. jdeCacheTerminate(), similarly, must loop through the same bloated list to find the proper cache to terminate “Tweleve Drummers Drumming, Eleven pipers piping, ten lords-a-leaping….” Nonlinear scaling of long running batch jobs Analysis • jdeCacheInit() – – – • If there is no existing cache of the same name: Creates the new cache and adds the name to the internal list If a cache of the indicated name already exists: Increments the reference count for existing caches In either case – the entire list of cache names is searched sequentially jdeCacheTerminate() – – – If the reference count is > 1: Only decreases the reference count If the reference count = 1: then the cache is actually destroyed and removed from the list IN either case – the entire list of cache names is searched sequentially Nonlinear scaling of long running batch jobs Analysis • jdeCacheInit() Tools API internals - code segment Loop through every cache object Compare cache name with every other cache stored in the list Nonlinear scaling of long running batch jobs Analysis • • • • • Back to the ERP debug log … There is a repeating pattern of seven caches, each of which has one instance for each EDI job number There is a static name appended to a variable (EDI job #) Each of these cache names is stored in the internal cache name list and must be scanned each time jdeCacheInit() or jdeCacheTerminate() is called …The 21,000 days of Christmas - jdeCache() style Nonlinear scaling of long running batch jobs Analysis • So where do these caches occur in the code? – • This question must be answered if the problem is to be fixed at the application level. This can be determined via a customized string search of the profiling data. – – Standard text parsing tools, such as those supporting regular expressions, can be used for this task. However, the PW tool provides this functionality as yet another built-in feature. • • The profiling data can searched for user-specified strings, and an output file listing all the BSFNs containing those strings within their scope will be listed The hardcoded portions of the seven cache names in Batch ‘A’ can be entered as search strings in Performance Workbench to locate the “unterminated cache” names indicated in the debug log. Nonlinear scaling of long running batch jobs Analysis • The BSFNs containing the seven cache strings within their scope are indicated, thus it is these BSFNs which contain the offending unterminated caches. Note that a fundamental insight was necessary to once again simplify the problem scope: • – – Namely, the patterns in the cache names which essentially reduced 21,000 names to seven. This is why software performance work is truly engineering, not rote technician’s tasks. It is NOT “linear thinking” Performance analysis involves understanding the data and deriving original insights, not just rote number crunching Nonlinear scaling of long running batch jobs Analysis • It should be noted here that the Performance Workbench tool – customized to this ERP product - has once again provided an insight vital to the timely resolution of this issue. – It has done so in a manner much more straightforward than that of tools from a third-party vendor. – The lesson here is that customized time-stamped profiling is often the decisive factor in quickly narrowing down the source of performance problems in cases where access to the source code is possible. The use of the proper profiling tools is critical to performance analysis. This may entail customized profiling and data parsing. Nonlinear scaling of long running batch jobs Analysis • For the 3,000 record case: – Caches in Batch ‘A’ which are not terminated: total of 21,070 unique caches in seven patterns in three Business Functions • BSFN: F4111ClearDetailStack (XT4111Z1) – – – – – • BSFN: CacheProcessPOHeaderCache (B4302180) – • B3003130-<number> Caches in Batch ‘B’ which are not terminated: total of 558 unique caches in one pattern in one Business Function • BSFN: F41021DeleteCache (B4200370) – – B4302180F1706<number> BSFN: B3003130 (B3003130) – – F41UI001-<number> X4111Z1GL-<number> X4111Z1LPACM-<number> X4111Z1QTY-<number> X4111Z1ACM-<number> B4200370C-<number> Caches in Batch ‘C’ which are not terminated: total of 3000 unique caches in one pattern in one Business Function • BSFN: CacheProcessConfigurationID (B3201470) – B3201470<number> Nonlinear scaling of long running batch jobs Analysis • Applications Solutions: – Fix the BSFN code to include appropriate calls to jdeCacheTerminate() • This was the case for Batch ‘A’ .. OR, if the code logic dictates: – Fix the BSFN code to eliminate the superfluous calls to jdeCacheInit() • • In other words, did we REALLY need to initialize all those caches in the first place… This was the case for Batch ‘B’ and ‘C’ Nonlinear scaling of long running batch jobs Analysis • Pseudocode showing the essence of the fix for case 1 /* if cache cursor is NULL AND there are no records in the cache, terminate the pointer */ if (GeneralLegderCacheCursor == NULL) { if (jdeCacheGetNumRecords(hGeneralLegderCache) == 0) { /* MISSING CALL to jdeCacheTerminateAll() */ jdeCacheTerminate(hUser, hGeneralLegderCache); } } So this is the answer to all the riddles…. Nonlinear scaling of long running batch jobs Analysis • This leads us to another vital factor in Performance Analysis: the ability to read and understand the code. • In order to comprehend details of the problem, one does not need to be an expert at writing code. • However, being adept at grasping the nuances of code written by others is essential. – In other words, you do not need to take off and land the plane, but you should be able to navigate in the air. The ability to read and understand source code is very important to determining the cause of Performance Problems • Nonlinear scaling of long running batch jobs Analysis Elephant in the living room: – The sequential search in the Tools code shown below – Linked list data storage, combined with efficient search algorithms such as bubble sort and shell sort, are very fundamental “cookie cutter” concepts. Why was this not addressed????? Sequential search…YIKES!!!! Loop through every cache object Compare cache name with every other cache stored in the list • Nonlinear scaling of long running batch jobs Analysis The answer was driven by business reasons: – Modifications to the Tools code layer take longer to deliver to customers that those to the application level, since the Tools code is proprietary and can only changed by the vendor. – Changes to Tools APIs impact every application which makes calls to them, and so there is a much longer regression testing cycle involved. – Changes to application code, conversely, can be delivered quickly and tested right in the customer’s environment. – When a customers’ business cycle is impacted, the fastest means of delivering a stable solution must be leveraged, so an applicationlevel solution to the problem was delivered Science means solving Academic problems Engineering means solving Business problems. Nonlinear scaling of long running batch jobs Analysis • • There actually WAS a Tools code change which resulted from this effort, which will tend to “cover up” the application programming oversight which caused this batch scalability predicament. This fix used a binary tree algorithm instead of the flawed sequential search of a “flat” array in jdeCacheInit() and jdeCacheTerminate(). – • • It was delivered at a later date, as a part of a regularly scheduled Tools “service pack”. Application-side solution will always work, even without this Tools fix With this fix in place, however, ANY batch scaling problems caused by this issue should improve dramatically Nonlinear scaling of long running batch jobs Analysis • Characterization of the problem…. • Internal testing by the Tools team using a test driver which leaves varying numbers of caches open: • Throughput drops severely as a function of the number of unique cache names left in the internal list – If n is the caches leaked per iteration – The throughput falls off as O(1/n)…at least …the more unique cache objects left open, the slower the program runs Test Program: number of cache inits per sec (Windows) Throughput (init per sec) 2200 2100 2000 1900 1800 1700 1600 1500 1400 1300 1200 1100 1000 900 800 700 600 500 400 300 200 100 0 0 5000 10000 # of open caches 15000 # of cache objects left open 20000 Nonlinear scaling of long running batch jobs Analysis • Have many customers have complained about scalability in these batch jobs? – Customers may not notice it if they are running sufficiently low volumes, even if the issue has long been present. – Others may have noticed the problem, but their runtime windows could be long enough so that it does not cause a business disruption. – Ergo – they don’t enter calls… …performance is often in the eye of the beholder Nonlinear scaling of long running batch jobs Agenda • Problem Definition • Analysis • Results • Conclusion Nonlinear scaling of long running batch jobs Results • Results: – – Following the simple application code fix, the throughput of all three batch programs was dramatically improved. Note that the runtime of the 9000 record use case for Batch ‘A’, on which this paper has focused, was reduced from 722 minutes to 30 minutes Batch ‘A’ # records 3000 6000 9000 runtime (minutes ) Before fix 55 396 722 The use case that REALLY mattered After fix % runtime reduction 14 74.5% 20 94.9% 30 95.8% Batch ‘B’ 3000 6000 9000 8 22 110 14 23 32 -75.0% -4.5% 70.9% 3000 6000 9000 6 22 45 4 7 13 33.3% 68.2% 71.1% Batch ‘C’ Anomalous results...likely due to data inconsistencies Nonlinear scaling of long running batch jobs Results • Terminating the cache objects clearly solved a significant resource leak in these batch programs. • After the fix was applied, the batch runtimes rose at a normal, linear rate to the amount of input data processed. • The disruptions to the customer’s business cycle were completely resolved. Nonlinear scaling of long running batch jobs Analysis • Problem Definition • Analysis • Results • Conclusion Nonlinear scaling of long running batch jobs Conclusions • • • • • • • • • Always validate and verify the data before pursuing more detailed analysis Performance First Principles: Where is the code and/or system spending its time? Plain text output can be used for instrumentation and targeted profiling Specific product knowledge is critical to performance analysis of complex systems Resource and memory leaks can have a critical impact on performance Performance analysis involves understanding the data and deriving original insights, not just rote number crunching The use of the proper profiling tools is critical to performance analysis. This may entail customized profiling and data parsing. The ability to read and understand source code is very important to determining the cause of Performance Problems Science means solving Academic problems Engineering means solving Business problems. Nonlinear scaling of long running batch jobs Conclusions • How many gifts would be involved in singing The 100 Days of Christmas • The answer for day 100 is……5050 gifts!!! P.S. – What the heck is a “lord-a-leaping”???? Nonlinear scaling of long running batch jobs Performance First Principles