Java on z/OS: A fresh look Scott Chapman American Electric Power Important notes I don’t really like Java as a language I’m not a Java expert Results presented herein may be installation-dependent There’s a lot of moving parts here I understand there’s zAAP on zIIP “zAAP” used generically here All trademarks of IBM, Oracle, and everybody else hereby recognized Why Java on z/OS? Because programmers want to use it http://xkcd.com/801/ Why Java on z/OS Because it enables open source projects that are cool/useful/interesting Key trick: run the JVM in ASCII -Dfile.encoding=ISO8859-1 Many things will just run with that run-time option! What about a GUI? Turns out that that just works too! Start Xming X server on your PC Check the “No Access Control” option Set the DISPLAY environment variable Run the code S147774:/u/s147774: >export DISPLAY=10.97.131.15:0 S147774:/u/s147774: >java -Xmx320m -jar ga33.jar Debugging Javascript code running in Helma on the mainframe with the GUI connected to Xming on my laptop Works better than I expected Why Java on z/OS Because it enables more programming language choices Javascript built in to Java 6 Rhino interpreter from Mozilla In theory, should be able to run any JVMbased language (I haven’t tested these) Jython Groovy Clojure Scala Ruby (via JRuby) Why Java on z/OS It may perform better If you are on a sub-capacity machine It may save you money Pretty unlikely Only if you can take some work away from your peaks Which job is better? How cheap are zAAP/zIIPs? • $100K/SE (z196, zEC12) • How much is $100K? • Consider adding 1 engine to z196-710: a) 710 = 10,250 MIPS, 1191 MSUs b) 711 = 11,073 MIPS, 1286 MSUs c) 710+1 zIIP = 10,302+1,000 MIPS z/OS (base) at this level costs $62/MSU • • Scenario B, z/OS base goes up almost $6K/month zIIP costs < 17 months of z/OS Base • • Not to mention features, DB2, CICS, etc. What about accessing z/OS services? JZOS Classes to easily access z/OS specific constructs z/OS datasets RACF Respond to operator commands Access JES Spool Ways to Run Java on z/OS WebSphere CICS DB2 Stored Procedures Batch Started Tasks Unix shell Batch / Started Task options BPXBATC BPXBATCH (traditional alias) BPXBATSL (local spawn alias) Traditional approach Difficulty with 100-byte JCL Parm JZOS Ships with z/OS Avoids 100-byte parm limit Adds a lot of flexibility Measuring Java zAAP vs. GCP time Watch the normalization factor! Most SMF values not normalized Tools/reports may normalize for you Consider IFAHONORPRIORITY=NO Avoid using GCPs to help zAAPs Can result in >99% of Java CPU time executed on zAAP SDSF zAAP vs. GCP columns JOBNAME P3SR01BS P3SR01AS P3SR01B P3SR01A P3SR02A P3SR02B P3SR01AS P3SR02BS P3SR01BS P3SR02AS RTMSERVE CPU-Time 1514.11 1706.50 788.55 763.01 2953.37 3051.88 7281.39 2805.58 7783.21 2591.27 2661.39 TCB + SRB This data comes from RMF GCP-Time zAAP-Time zACP-Time zAAP-NTime 9.53 772.02 2.26 1501.82 12.82 868.75 1.95 1690.00 197.66 281.64 1.53 547.87 192.47 272.33 1.10 529.77 422.62 1188.79 5.39 2312.56 437.74 1226.02 6.55 2385.00 62.56 3698.72 11.47 7195.17 123.85 1316.22 22.15 2560.45 63.38 3955.54 14.38 7694.77 118.60 1216.36 10.74 2366.21 3.85 1363.45 1.03 2652.34 real zAAP on GCP normalized SMF 30 Accounting BPXBATCH vs. BPXBATSL vs. JZOS Important due to spawned OMVS tasks Single step job results: BPXBATSL: 1 step, 1 job record BPXBATCH: 6 step, 4 job records CPU time collected on type OMVS records JZOS: 2 step, 2 job records CPU time almost completely on JOB types Some interesting calculations zAAPn = SMF30_TIME_ON_IFA * SMF30ZNF / 256 percent work done on zAAP = zAAPn / (zAAPn + SMF30CPT + SMF30CPU) (“Generosity” or “offload” factor) percent zAAP sent to GCP = SMF30_TIME_IFA_ON_CP / (SMF30_TIME_ON_IFA+SMF30_TIME_IFA_ON_CP) (“Fallback” percentage—can be <1%, although some fallback is normal and expected) Other SMF records RMF records Look for breakdown of processor types for both hardware and report / service classes WAS 120 records New subtype 9s for WAS 7+ much better! HIS type 113 records GCP vs. zAAP vs. zIIP Java Performance What about performance? Java on the mainframe has a history of performance problems Java is inherently “heavy” due to the JVM Scott’s Law: “The easier you make it on the programmer, the harder it is on the system” Today’s z hardware and software are up to the task! (But you probably want zAAPs!) Heard at WAS Week 200x… “Our goal is to get JVM startup time down to about 1 second.” Seemed like a stretch at the time! WAS startup took several minutes Today: WAS Servant Startup <1 min 15.49.15 STC14327 ---- MONDAY, 18 APR 2011 ---- 15.49.15 STC14327 $HASP373 P3SR02AS STARTED 15.49.15 STC14327 IEFUSI BPXBATSL-P3ASRU 15.49.15 STC14327 IEF403I P3SR02AS - STARTED - TIME=15.49.15 15.49.16 STC14327 +BBOO0004I WEBSPHERE FOR Z/OS SERVANT PROCESS ABOVE REGION SET TO 1536MB P3CELL/P3NODEA/P3SR02/P3SR02A IS STARTING. 15.49.16 STC14327 +BBOO0239I WEBSPHERE FOR Z/OS SERVANT PROCESS p3cell/p3nodea/p3sr02a IS STARTING. 15.49.16 STC14327 +BBOO0308I SERVANT PROCESS P3CELL/P3NODEA/P3SR02/P3SR02A IS EXECUTING IN 64-BIT ADDRESSING MODE. 15.49.16 STC14327 +BBOM0007I CURRENT CB SERVICE LEVEL IS build level 7.0.0.12 (cf121027.08) release WAS70.ZNATV date 07/09/10 11:02:02. ... 15.49.56 STC14327 +BBOO0222I: WSVR0001I: Server SERVANT PROCESS p3sr02a open for e-business 15.49.57 STC14327 +BBOO0020I INITIALIZATION COMPLETE FOR WEBSPHERE FOR Z/OS SERVANT PROCESS P3SR02A. 15.49.57 STC14327 +BBOO0248I INITIALIZATION COMPLETE FOR WEBSPHERE FOR Z/OS SERVANT PROCESS P3CELL/P3NODEA/P3SR02/P3SR02A. Not much in that particular servant Today: HelloWorld in <2 seconds 10.08.55 10.08.57 10.08.57 10.08.57 10.08.57 JOB47259 IEF403I S147774B - STARTED - TIME=10.08.55 JOB47259 --TIMINGS (MINS.)-JOB47259 -JOBNAME STEPNAME PROCSTEP RC EXCP CPU SRB CLOCK JOB47259 -S147774B RUNOMVS 00 59 .00 .00 .02 JOB47259 IEF404I S147774B - ENDED - TIME=10.08.57 10.08.57 JOB47259 10.08.57 -S147774B ENDED. JOB47259 NAME-BPXBATCH TEST TOTAL CPU TIME= .00 SERV 2524 TOTAL PG 0 ----PAGING COUNTS--PAGE SWAP VIO 0 0 0 ELAPSED TIME= .02 $HASP395 S147774B ENDED Output Hello Scott Java runtime: IBM Corporation 1.6.0, vm version 2.4 Running on: s390 z/OS 01.10.00 Running for: S147774 Classpath: /usr/lpp/java/J6.0/lib:/usr/lpp/java/IBM/J1.3/l JCL //RUNOMVS EXEC PGM=BPXBATCH, // PARM='SH java -Xms32M -Xmx32M HelloWorldApp Scott' //SYSOUT DD SYSOUT=* //SYSPRINT DD SYSOUT=* //SYSUDUMP DD SYSOUT=* //STDENV DD * //STDOUT DD SYSOUT=* //STDERR DD SYSOUT=* z10 EC 504 with zAAP Small machine 10.51.53 JOB10901 IEF403I S147774B - STARTED - TIME=10.51.53 10.52.04 JOB10901 - 10.52.04 JOB10901 -JOBNAME 10.52.04 JOB10901 -S147774B 10.52.04 JOB10901 IEF404I S147774B - ENDED - TIME=10.52.04 10.52.04 JOB10901 -S147774B ENDED. 10.52.04 JOB10901 --TIMINGS (MINS.)-STEPNAME PROCSTEP RUNOMVS ----PAGING COUNTS--- RC EXCP CPU SRB CLOCK SERV PG PAGE SWAP VIO 00 86 .00 .00 .18 2252 0 0 0 0 NAME-BPXBATCH TEST TOTAL CPU TIME= .00 TOTAL ELAPSED TIME= $HASP395 S147774B ENDED z10 BC E02 without zAAPs Not surprising that ~50 MIPS engines can’t keep up with 450 / 900 MIPS engines .18 What about doing real work? Days of assuming it will run faster on your PC are over Have seen H2 perform better on z/OS Still, it is Java, it’s not CPU-free Performance may depend on: zAAP and GCP capacity System settings (USS, zFS, WLM) Application code Java Settings (heap size, GC policy) Random luck Application code Application code is always important Regardless of the language! BufferedReader or ZFile? Classic “it depends” BufferedReader seems like it should be faster But they provide different results: byte array vs. string What you want to do with the result may impact which is best for any given situation Java has lots of similar but slightly different ways of doing things Heap settings Heap settings always seen as an issue Size is the usual suggestion Is bigger always better? Does anybody know how much heap they really need? (no) Min / Max sizes same or different? Garbage collection policy options Memory is an issue Java’s memory usage can be an issue “Requirements” for 100s of MBs are not unusual Often “requirements” seem to be a SWAG Java heap size can’t be reliably predicted from the code & expected volumetrics Test with reasonable numbers before assuming the requirements are real Be sure to get all processing scenarios! Garbage Collection Options (IBM Java 6) optthruput – default Probably best for batch gencon – generational / concurrent maybe good for large heap, transactional workloads (WAS) optavgpause – reduces long pauses subpool – “improved” object allocation For important workloads, may want to test all of them at various size Lots of other heap/gc options too See IBM JDK Diagnostics Guide! Heap size impact - Workload 1 45 40 zAAPn seconds 35 30 25 20 15 10 5 0 Run 1 Run 2 32MB Run 3 64MB 128MB Run 4 256MB Run 5 512MB For some workloads, heap size may not matter Heap size impact - Workload 2 350 zAAPn seconds 300 250 200 150 100 50 0 Run 1 Run 2 32MB Run 3 64MB 128MB Run 4 256MB Run 5 512MB Too small of a heap can cause CPU increase Variable vs. Fixed Heap size 350 zAAPn Seconds 300 250 200 150 100 50 0 WL1 32MB WL1 32-128MB WL1 128MB Run 1 Run 2 WL2 32MB Run 3 Run 4 WL2 32-128MB WL2 128MB Run 5 There might be a slight benefit to a fixed heap size GC Policy Comparison, Workload 2 800 700 zAAPn Seconds 600 500 400 300 200 100 0 Run 1 Run 2 Run 3 Run 4 optthruput 128MB optavgpause 128MB subpool 128MB optthruput 32MB optavgpause 32MB subpool 32MB Run 5 gencon 128MB Heap size most important, but GC Policy also can be significant Runtime options 140 zAAPn Seconds 120 100 80 60 40 20 0 Run 1 Run 2 Baseline Run 3 jit:count=0 Run 4 Run 5 quickstart Don’t mess with the JIT! Quickstart with trivial workload 0.9 zAAPn seconds 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Run 1 Run 2 Run 3 baseline Run 4 quickstart Run 5 Could be good for certain workloads So what’s the random thing? Much more variation in CPU time measurements with today’s CPUs Superscalar pipeline and cache issues Seems to impact my Java work more than I expected Consistently ran same workload Extremely lightly utilized LPAR Lightly utilized zAAPs Same variability over time So I tried some more tests… 0 Workload1, 32MB Workload1, 512MB Workload1, REXX Workload2, 128MB Workload2, 512MB Trivial, 32MB 20MAY11:04:45:00 20MAY11:02:45:00 20MAY11:00:45:00 19MAY11:22:15:00 19MAY11:20:15:00 19MAY11:18:15:00 180 Zero zAAPs 1.8 160 1.6 140 1.4 120 1.2 100 1 80 0.8 60 0.6 40 0.4 20 0.2 0 CPU Seconds for trivial workload Two zAAPs 19MAY11:16:15:00 19MAY11:14:15:00 19MAY11:12:15:00 19MAY11:10:15:00 19MAY11:08:15:00 19MAY11:05:15:00 19MAY11:03:15:00 19MAY11:01:15:00 18MAY11:23:15:00 18MAY11:21:15:00 One zAAP 18MAY11:19:15:00 18MAY11:17:15:00 18MAY11:15:15:00 18MAY11:12:00:00 18MAY11:10:00:00 18MAY11:08:00:00 18MAY11:06:00:00 18MAY11:04:00:00 17MAY11:22:00:00 17MAY11:20:00:00 200 17MAY11:18:00:00 17MAY11:16:00:00 17MAY11:14:00:00 17MAY11:12:00:00 17MAY11:10:00:00 17MAY11:07:45:00 CPU seconds (zAAPn + GCP) Java Workload Variability 2 Why is this? I don’t know, but best guess is CPU cache and memory access effects But I thought I’d look at the 113 records to see if I could find anything interesting…. Processor Speed 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0 0 Data from Test period 1 (One zAAP) 2 Proc 0 = GCP Proc 2 = zAAP Executed Instruction Rate 400 350 300 250 200 150 100 50 0 0 Proc 0 = GCP Proc 2 = zAAP 2 Seems to confirm our SMF30 data Level 1 Miss Percentage 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 0 Proc 0 = GCP Proc 2 = zAAP 2 Percent sourced from L1.5 Cache 100 90 80 70 60 50 40 30 20 10 0 0 Proc 0 = GCP Proc 2 = zAAP 2 L1.5 Improvement corresponds to dip in machine usage Percent TLB Miss of Total CPU 50 45 40 35 30 25 20 15 10 5 0 0 Proc 0 = GCP Proc 2 = zAAP 2 Dip in GCP TLB Miss overhead due to machine less busy Estimated Cycles Per Instruction 10 9 8 7 6 5 4 3 2 1 0 0 - Sum of ESTIMATEDINSTRUCTIONCOMPLEXITYCPI(ESTICCPI) 0 - Sum of ESTIMATEDCPI FROMFINITECACHE/MEM(ESTFINCP) 2 - Sum of ESTIMATEDINSTRUCTIONCOMPLEXITYCPI(ESTICCPI) 2 - Sum of ESTIMATEDCPI FROMFINITECACHE/MEM(ESTFINCP) Proc 0 = GCP Proc 2 = zAAP My Guesses… My test Java workloads were too cache and superscalar friendly Perhaps makes it more susceptible to pipeline hazards But: Wouldn’t the REXX workload be even more superscalar and cache friendly? Why were the 113 measurements so consistent? Or Java is really doing variable amounts of work? Or… something isn’t right someplace? Take away: Java CPU measurements might be more variable than you expect Most recent testing Repeated testing later in the year z/OS 1.12 vs. 1.10 1 Year more recent Java 6 (Fall 2010 vs. Fall 2009) Still saw variability, but worst of it was closer to 25-30% instead of upwards of 75% Saw similar variability when testing on a z9 with zAAPs Saw at least one instance in a production LPAR with similar variability: (in 3 executions of the same job, 1st consumed just over half as much CPU of the later runs) Could not readily replicate on a WSC system running under z/VM Summary Java enables all sorts of cool things you might not have thought could run on the mainframe Mainframe’s Java performance not significantly worse than any other platform (Assuming adequate zAAP capacity) Lots of tuning knobs for Java Java CPU time measurements might be more variable