Exploring Multi-Threaded Java Application Exploring Multi-Threaded Java Application Performance onon Multicore Hardware Performance Multicore Hardware Jennifer B. Sartor, Lieven Eeckhout Ghent University, Belgium OOPSLA 2012 presentation – October 24th 2012 Modern Software & Hardware Managed languages Ubiquitous, but added runtime layer Many service threads interact with application JIT compilation, on-stack replacement, collector Stop the application, possibly critical Share hardware resources Multicore with multiple sockets How do we schedule threads with constrained resources? Scale core frequency for power Use caches of all sockets, or limit communication p. 2 Extensive Performance Study Multi-threaded Java application on multicore, multi-socket hardware Large space to explore Number of threads Thread-to-core/socket mapping Pairing or isolating application and JVM threads Pinning Impact of frequency scaling Difference between startup and steady state How do choices with scheduling and hardware resources affect performance? p. 3 Experimental Machine: Nehalem Socket%0% Nehalem% Core%3% Nehalem% Core%0% Socket%1% Nehalem% Core%7% Nehalem% Core%4% .".".""" .".".""" 32KB% L1D%Cache% 32KB% L1D%Cache% 32KB% L1D%Cache% 32KB% L1D%Cache% 256KB% L2%Cache% 256KB% L2%Cache% 256KB% L2%Cache% 256KB% L2%Cache% 8MB% L3%Cache% 8MB% L3%Cache% DDR3%Memory% QuickPath% Interconnect% Controllers% QuickPath% DDR3%Memory% Controllers% Interconnect% Scale frequency per socket to 1.596 or 3.059 GHzp. 4 Gain Insight on Scheduling Application Java Virtual Machine Garbage collector Just-in-time compiler with on-stack replacement Cao, et al. [ISCA 2012] studied JVM amenability to heterogeneity by measuring service threads’ performance per energy We study end-to-end performance p. 5 Roadmap 1. Cost of Isolation Socket 0 Socket 1 3. 1. Frequency Scaling Socket 0 Socket 0 Pairing Threads Socket 1 Socket 1 p. 6 Experimental Methodology Jikes Research Virtual Machine (Dec 2011) Multithreaded DaCapo benchmarks 9.12-bach Generational Immix collector 1.5, 2, and 3x minimum heap sizes Avrora, lusearch (with fix), pmd, sunflow, xalan Also, pseudojbb2005 Timed 10 invocations Steady state, measure 15th iteration Startup, measure 1st iteration p. 7 Baseline Setup Application threads JVM service threads Pin application & collection threads Collection Nehalem Core 0 Nehalem Core 1 Nehalem Core 2 Socket 0 Compilation Nehalem Core 3 Nehalem Core 4 Nehalem Core 5 Nehalem Core 6 Nehalem Core 7 Socket 1 p. 8 Boosting Socket Frequency 1.596 Nehalem Core 0 Nehalem Core 1 Nehalem Core 2 Socket 0 Nehalem Core 3 3.059 GHz 27-50% improvement in execution time Nehalem Core 4 Nehalem Core 5 Nehalem Core 6 Nehalem Core 7 Socket 1 p. 9 Exploring The Cost of Isolation Collection threads Nehalem Core 0 Nehalem Core 1 Nehalem Core 2 Socket 0 Nehalem Core 3 Nehalem Core 4 Nehalem Core 5 Nehalem Core 6 Nehalem Core 7 Socket 1 p. 10 80 0 -20 -40 -60 pjbb2005 xalan sunflow 20 pmd lusearch 40 lusearch-fix 60 avrora % Improvement in Exec Time Isolating Collection Threads Lo-2xheap Hi-2xheap Isolating collector does not significantly hurt performance p. 11 Exploring The Cost of Isolation Compiler thread Nehalem Core 0 Nehalem Core 1 Nehalem Core 2 Socket 0 Nehalem Core 3 Nehalem Core 4 Nehalem Core 5 Nehalem Core 6 Nehalem Core 7 Socket 1 p. 12 40 20 -60 -80 -100 pjbb2005 xalan sunflow -40 pmd -20 lusearch-fix lusearch 0 avrora % Improvement in Exec Time Isolating Compiler Thread at Startup Lo-2xheap Hi-2xheap Isolating compiler at startup has little impact p. 13 40 20 -60 -80 pjbb2005 xalan sunflow -40 pmd -20 lusearch-fix lusearch 0 avrora % Improvement in Exec Time Isolating On-Stack-Replace at Startup Lo-2xheap Hi-2xheap Isolating OSR at startup improves performance -100 p. 14 Exploring The Cost of Isolation All JVM service threads Nehalem Core 0 Nehalem Core 1 Nehalem Core 2 Socket 0 Nehalem Core 3 Nehalem Core 4 Nehalem Core 5 Nehalem Core 6 Nehalem Core 7 Socket 1 p. 15 20 0 -80 -100 xalan sunflow pmd pjbb2005 -60 lusearch -40 lusearch-fix -20 avrora % Improvement in Exec Time Isolating All JVM Threads Lo-2xheap Hi-2xheap Isolating service threads only significantly hurts one benchmark p. 16 Exploring Frequency Scaling Baseline: JVM service threads isolated, all cores at highest frequency Nehalem Core 0 Nehalem Core 1 Nehalem Core 2 Socket 0 Nehalem Core 3 Nehalem Core 4 Nehalem Core 5 Nehalem Core 6 Nehalem Core 7 Socket 1 p. 17 Exploring Frequency Scaling Nehalem Core 0 Nehalem Core 1 Nehalem Core 2 Lower frequency of JVM service threads Nehalem Core 3 Nehalem Core 4 Nehalem Core 5 Nehalem Core 6 Nehalem Core 7 Nehalem Core 5 Nehalem Core 6 Nehalem Core 7 versus Nehalem Core 0 Nehalem Core 1 Nehalem Core 2 Nehalem Core 3 Lower frequency of application threads Nehalem Core 4 p. 18 20 0 -20 -40 -60 Collector -80 -100 -120 -140 -160 App % Improvement in Exec Time Lower Frequency: Collector vs App avrora lusearch lusearch-fix pmd sunflow xalan pjbb2005 Lowering collector frequency affects performance 5x less than for application p. 19 20 0 -20 -40 -60 avrora lusearch lusearch-fix pmd sunflow xalan pjbb2005 Compiler -80 -100 -120 App % Improvement in Exec Time Lower Freq at Startup: Compiler vs App Lowering compiler frequency is not detrimental compared to application p. 20 20 0 avrora lusearch lusearch-fix pmd sunflow xalan pjbb2005 -20 -40 JVM -60 -80 -100 -120 App % Improvement in Exec Time Lower Frequency: JVM vs App Lowering JVM frequency affects performance 5x less than for application p. 21 Exploring Pairing Threads Pair application and collection threads Nehalem Core 0 Nehalem Core 1 Nehalem Core 2 Socket 0 Nehalem Core 3 Nehalem Core 4 Nehalem Core 5 Nehalem Core 6 Nehalem Core 7 Socket 1 p. 22 Pairing App & Collector, 2 Sockets -464 pjbb2005 xalan sunflow pmd lusearch-fix lusearch 40 20 0 -20 -40 -60 -80 -100 -120 -140 -160 -180 -221 -200 avrora % Improvement in Exec Time 1 Socket & Isolated Collector vs Paired OneSocket-Lo OneSocket-Hi Isolate-Lo Isolate-Hi With all but avrora, pairing application and collector performs best p. 23 1 1Socket-Hi IsolateGC-Hi IsolateComp-Hi IsolateNonGC-Hi IsolateJVM-Hi LoApp-HiGC LoApp-HiComp LoApp-HiNonGC LoApp-HiJVM HiApp-LoGC HiApp-LoComp HiApp-LoNonGC HiApp-LoJVM 1Socket-Lo IsolateGC-Lo IsolateComp-Lo IsolateNonGC-Lo IsolateJVM-Lo Normalized Execution Time Overall Performance Comparison 4.5 4 3.5 3 2.5 2 avrora lusearch lusearch-fix pmd sunflow xalan pjbb2005 1.5 Either use 1 socket, or isolate compiler thread p. 24 Conclusions: Scheduling Insights 1 socket: # application = # collection threads 2 sockets: Isolate compilation thread Pair application and collection threads Set # application threads = # cores, fewer collection threads Increasing application frequency is more important than for JVM service threads Analyzed Java performance given hardware resources p. 25