THE PERFORMANCE INFORMATION GAP PARADOX ECC CONFERENCE 2012 Dr. Steve Guendert Principal Engineer/Global Solutions Architect Brocade Mainframe Solutions Stephen.guendert@brocade.com @DrSteveGuendert © 2012 Brocade Communications Systems, Inc. Company Proprietary Information 1 Intros-Dr. Steve Guendert • Brocade Principal Engineer/Solutions Architect, focused on mainframe-System z • Academic • Ph. D. MIS, M.S. in MIS, MBA • IEEE Senior Member • ACM Senior Member • Computer Measurement Group (CMG) • Intl. Board of Directors 2011-present (Director of Publications) • Storage Subject Chair 2007-2008 • SHARE • Board of Directors/EDC Program Manager 2007-2011 • zNextGen co-founder • Author • CMG Editorial Review Board (ERB), zJournal ERB • Published over 40 papers for Brocade and in zJournal, CMG, NaSPA Technical Support, Disaster Recovery Journal • Industry experience • IBM, McDATA, CNT, Brocade, end user © 2012 Brocade Communications Systems, Inc. Company Proprietary Information 2 Agenda • Abstract • What is the Performance Information Gap Paradox? • How/why did we get here? • Hope on the horizon • Starting to solve the paradox-Perf. Monitoring basics • Next steps and what is still needed • Concluding thoughts and observations © 2012 Brocade Communications Systems, Inc. Company Proprietary Information 3 Abstract Soon to be published article in zJournal • System z I/O technology has made significant advances over the past five years. These advances are quite diverse, yet synergistic, ranging from the mainframe itself (processors, STIs, busses, channels) to the FICON directors, to the storage control units and devices. Speeds and feeds get continuously faster and new technologies have emerged. Some of these new technologies include MIDAW, HyperPAVs, z High Performance FICON (zHPF), 16 Gbps FICON, FICON Dynamic Channel path Management (DCM), and the PCIe I/O drawers. There promises to be even more changes on the horizon. Accompanying changes have also occurred in SMF/RMF. Understanding the new technologies is crucial to your job, but so is understanding their performance using SMF/RMF. © 2012 Brocade Communications Systems, Inc. Company Proprietary Information 4 What is the Performance Information Gap Paradox? © 2012 Brocade Communications Systems, Inc. Company Proprietary Information 5 Paradox defined • A paradox is a statement or group of statements that leads to a contradiction or a situation which (if true) defies logic or reason, similar to circular reasoning. • Example-the classic time travel grandfather paradox © 2012 Brocade Communications Systems, Inc. Company Proprietary Information 6 Recent mainframe I/O and storage technology enhancements • Mainframe itself • • • • • New processors New internals-STIs, busses, I/O drawers Channel technology Dynamic Channel path Management (DCM) Z Discovery and Auto Configuration (zDAC) • Storage • Higher capacity, faster traditional disk and tape/virtual tape technology • Higher capacity, more capable switching technology • SSDs • EAVs, HyperPAVs © 2012 Brocade Communications Systems, Inc. Company Proprietary Information 7 Recent mainframe I/O and storage technology enhancements • SAN • Larger, faster switching devices • Further distances (buffer credits), virtual fabrics • Better management tools • Protocol/combination • Z High Performance FICON (zHPF) • MIDAW © 2012 Brocade Communications Systems, Inc. Company Proprietary Information 8 Performance Information Gap Paradox What is it? • My job/role and what I do • Conclusion from past five years • Aforementioned technology advances • Globally, we have a less than optimal understanding of the root functionality of this new technology • Case in point: zHPF • Accompanying lack of understanding of the performance implications and/or performance management of the new technology. • There’s the paradox © 2012 Brocade Communications Systems, Inc. Company Proprietary Information 9 How/why did we get here? “The four horsemen” © 2012 Brocade Communications Systems, Inc. Company Proprietary Information 10 The 2007Global Recession BIOB • Job cuts • Training budgets cut • Travel budgets cut • Training oriented conferences suffered • SHARE • CMG © 2012 Brocade Communications Systems, Inc. Company Proprietary Information 11 Computer Science/MIS curriculum • IBM Academic Initiative • How many departments offer performance/management and/or capacity planning? • “Queuing theory and Statistics is too boring” • End up with OJT and read your own © 2012 Brocade Communications Systems, Inc. Company Proprietary Information 12 The evil vendor conspiracy theory • Advances in technology and decreasing hardware prices lead to “throwing more hardware at performance problems” • The politician-Silicon Valley industrial complex. © 2012 Brocade Communications Systems, Inc. Company Proprietary Information 13 Slow recognition of the budding problem Lack of being proactive • Blissful ignorance by vendors and end users? • Short sightedness and short run focus on quarterly results? • Vendor personnel suffering from same training issues? © 2012 Brocade Communications Systems, Inc. Company Proprietary Information 14 Hope on the horizon © 2012 Brocade Communications Systems, Inc. Company Proprietary Information 15 Feb-March 2012 China business trip © 2012 Brocade Communications Systems, Inc. Company Proprietary Information 16 Solving the paradox Basics/building blocks © 2012 Brocade Communications Systems, Inc. Company Proprietary Information 17 What is a performance problem? • Wide variety of views, with following in common: • Unacceptably high resource usage • Unacceptably high response/service time • Numerically qualifiable pain • In general: • Workload not getting resources to complete in timely matter • Resources available, but not obtained fast enough to meet service level objective(s) © 2012 Brocade Communications Systems, Inc. Company Proprietary Information 18 Solving a performance problem • Acknowledge/identify the problem • Root cause analysis/root cause determination • If you cannot determine the root cause you will not come up with the correct long term solution • Temporary “band-aid” fixes that mask a symptom, but don’t cure the disease • The classic “blame the buffer credits” game • In the mainframe world, we use SMF/RMF for this process © 2012 Brocade Communications Systems, Inc. Company Proprietary Information 19 Mainframe I/O and storage performance monitoring basics What’s performance analysis? © 2012 Brocade Communications Systems, Inc. Company Proprietary Information 20 Service Level Agreement/Objective (SLA/SLO) • Match the business needs to the subjective perceptions in IT • Typically is a contract that defines, describes, and enforces measurable specifics. • Performance SLAs: typically focus on average transaction response time: • CPU, I/O, network, or all of the above © 2012 Brocade Communications Systems, Inc. Company Proprietary Information 21 What’s performance analysis? • Performance analysis refers to the techniques and tools that are used to enforce in your IT systems the performance objectives defined in your SLAs/SLOs. • The overall goal is to make the best use of your current resources to meet these objectives without excessive tuning efforts. • RMF provides an interface to an installation’s System z environment, allowing the end user to accomplish this. • RMF facilitates reporting and detailed measurements of installations’ critical resources in their System z environment. • RMF issues reports about performance problems as they occur, so that your installation can take action before the problems become critical. © 2012 Brocade Communications Systems, Inc. Company Proprietary Information 22 RMF (Resource Measurement Facility) Uses • Determine that your system is running smoothly • Detect system bottlenecks caused by contention for resources • Identify the workload delayed and the reason for the delay • Monitor system failures, system stalls, and failures of selected applications • Evaluate the service your installation provides to different groups of users © 2012 Brocade Communications Systems, Inc. Company Proprietary Information 23 Using RMF How do I know where to look and what to look for? • The next slide is a diagram representing a high level view of a single zEnterprise 196, attached via multiple (non-cascaded) FICON directors to an enterprise class DASD array. • The DASD array in the diagram is a somewhat simplistic, generic illustration intended to represent the control unit, the devices, and the adapters connecting the DASD array to the FICON directors). • listed on the diagram are the RMF reports that are used with the various components of this environment. • These are the RMF reports that are the most commonly used reports for identification, root cause analysis, and resolution (tuning) of I/O related performance problems in a modern mainframe environment. © 2012 Brocade Communications Systems, Inc. Company Proprietary Information 24 © 2012 Brocade Communications Systems, Inc. Company Proprietary Information 25 The RMF reports • SMF 78: RMF I/O Queuing Activity (IOQ). • The I/O Queuing Activity report provides information on your installation’s I/O configuration and activity rate, queue lengths, and percentages when one or more I/O components, grouped by a logical control unit (LCU), were busy. • SMF 73: RMF Channel Path Activity (CHAN). • The Channel Path Activity report provides basic information about channel path use. • It identifies each channel path by channel path identifier (CHPID) and channel path type. • It also reports the total channel utilization by the entire mainframe, and the channel utilization by the individual LPAR. © 2012 Brocade Communications Systems, Inc. Company Proprietary Information 26 The RMF reports • SMF 74-7: RMF FICON Director Activity (FCD). • Provides useful capacity planning and troubleshooting information for identifying potential bottlenecks and switch latency at the individual port level. • SMF 74-8: RMF ESS Link Statistics (ESS). • Provides information and statistics on the utilization and performance of the individual adapters on the DASD array. © 2012 Brocade Communications Systems, Inc. Company Proprietary Information 27 The RMF reports • SMF 74-5: RMF Cache Activity (CACHE). • Provides cache statistics on a subsystem basis as well as on a more detailed device-level basis. • SMF 74-1: RMF Device Activity (DEVICE). • Provides information for all devices in one or more device classes ( tape, DASD) or for those other devices you specify in the DEVICE option. © 2012 Brocade Communications Systems, Inc. Company Proprietary Information 28 Simple methodology • The Device Activity report is the report you are likely to use first, and most in a performance analysis and troubleshooting situation as it contains the response time information. • From there you would then start to “drill down” and narrow what you are looking at to find the root cause of the performance problem. © 2012 Brocade Communications Systems, Inc. Company Proprietary Information 29 Next steps and what is still needed • Add performance focused course(s) to curriculum • At a minimum, have some lectures on the subject • Encouragement as a viable career option • Get active in CMG and/or SHARE • Use vendors to educate • ACM and IEEE CS Digital libraries © 2012 Brocade Communications Systems, Inc. Company Proprietary Information 30 Concluding thoughts and observations • We need to improve I/O and storage performance education • We need a renewed focus on capacity planning and performance management as a valuable discipline and career field. • We have the chance to do this with the new generation of mainframers “zNextGen” • Follow on articles in zJournal will focus on the specifics and details of the technologies and RMF records © 2012 Brocade Communications Systems, Inc. Company Proprietary Information 31 Thank You © 2012 Brocade Communications Systems, Inc. Company Proprietary Information 32