OOPSLA presentation

advertisement
Exploring
Multi-Threaded
Java
Application
Exploring
Multi-Threaded
Java
Application
Performance
onon
Multicore
Hardware
Performance
Multicore
Hardware
Jennifer B. Sartor, Lieven Eeckhout
Ghent University, Belgium
OOPSLA 2012 presentation – October 24th 2012
Modern Software & Hardware

Managed languages


Ubiquitous, but added runtime layer
Many service threads interact with application




JIT compilation, on-stack replacement, collector
Stop the application, possibly critical
Share hardware resources
Multicore with multiple sockets

How do we schedule threads with constrained
resources?


Scale core frequency for power
Use caches of all sockets, or limit communication
p. 2
Extensive Performance Study
Multi-threaded Java application on multicore,
multi-socket hardware
 Large space to explore







Number of threads
Thread-to-core/socket mapping
Pairing or isolating application and JVM threads
Pinning
Impact of frequency scaling
Difference between startup and steady state
How do choices with scheduling and hardware
resources affect performance?
p. 3
Experimental Machine: Nehalem
Socket%0%
Nehalem%
Core%3%
Nehalem%
Core%0%
Socket%1%
Nehalem%
Core%7%
Nehalem%
Core%4%
."."."""
."."."""
32KB%
L1D%Cache%
32KB%
L1D%Cache%
32KB%
L1D%Cache%
32KB%
L1D%Cache%
256KB%
L2%Cache%
256KB%
L2%Cache%
256KB%
L2%Cache%
256KB%
L2%Cache%
8MB%
L3%Cache%
8MB%
L3%Cache%
DDR3%Memory% QuickPath%
Interconnect%
Controllers%
QuickPath% DDR3%Memory%
Controllers%
Interconnect%
Scale frequency per socket to 1.596 or 3.059 GHzp. 4
Gain Insight on Scheduling

Application

Java Virtual Machine


Garbage collector
Just-in-time compiler with on-stack replacement
Cao, et al. [ISCA 2012] studied JVM
amenability to heterogeneity by measuring
service threads’ performance per energy
 We study end-to-end performance

p. 5
Roadmap
1.
Cost of
Isolation
Socket 0
Socket 1
3.
1.
Frequency
Scaling
Socket 0
Socket 0
Pairing
Threads
Socket 1
Socket 1
p. 6
Experimental Methodology

Jikes Research Virtual Machine (Dec 2011)



Multithreaded DaCapo benchmarks 9.12-bach



Generational Immix collector
1.5, 2, and 3x minimum heap sizes
Avrora, lusearch (with fix), pmd, sunflow, xalan
Also, pseudojbb2005
Timed 10 invocations


Steady state, measure 15th iteration
Startup, measure 1st iteration
p. 7
Baseline Setup
Application threads
JVM service threads
Pin application &
collection threads
Collection
Nehalem
Core 0
Nehalem
Core 1
Nehalem
Core 2
Socket 0
Compilation
Nehalem
Core 3
Nehalem
Core 4
Nehalem
Core 5
Nehalem
Core 6
Nehalem
Core 7
Socket 1
p. 8
Boosting Socket Frequency
1.596
Nehalem
Core 0
Nehalem
Core 1
Nehalem
Core 2
Socket 0
Nehalem
Core 3
3.059 GHz
27-50%
improvement in
execution time
Nehalem
Core 4
Nehalem
Core 5
Nehalem
Core 6
Nehalem
Core 7
Socket 1
p. 9
Exploring The Cost of Isolation
Collection threads
Nehalem
Core 0
Nehalem
Core 1
Nehalem
Core 2
Socket 0
Nehalem
Core 3
Nehalem
Core 4
Nehalem
Core 5
Nehalem
Core 6
Nehalem
Core 7
Socket 1
p. 10
80
0
-20
-40
-60
pjbb2005
xalan
sunflow
20
pmd
lusearch
40
lusearch-fix
60
avrora
% Improvement in Exec Time
Isolating Collection Threads
Lo-2xheap
Hi-2xheap
Isolating collector
does not
significantly hurt
performance
p. 11
Exploring The Cost of Isolation
Compiler thread
Nehalem
Core 0
Nehalem
Core 1
Nehalem
Core 2
Socket 0
Nehalem
Core 3
Nehalem
Core 4
Nehalem
Core 5
Nehalem
Core 6
Nehalem
Core 7
Socket 1
p. 12
40
20
-60
-80
-100
pjbb2005
xalan
sunflow
-40
pmd
-20
lusearch-fix
lusearch
0
avrora
% Improvement in Exec Time
Isolating Compiler Thread at Startup
Lo-2xheap
Hi-2xheap
Isolating compiler
at startup has
little impact
p. 13
40
20
-60
-80
pjbb2005
xalan
sunflow
-40
pmd
-20
lusearch-fix
lusearch
0
avrora
% Improvement in Exec Time
Isolating On-Stack-Replace at Startup
Lo-2xheap
Hi-2xheap
Isolating OSR at
startup improves
performance
-100
p. 14
Exploring The Cost of Isolation
All JVM service
threads
Nehalem
Core 0
Nehalem
Core 1
Nehalem
Core 2
Socket 0
Nehalem
Core 3
Nehalem
Core 4
Nehalem
Core 5
Nehalem
Core 6
Nehalem
Core 7
Socket 1
p. 15
20
0
-80
-100
xalan
sunflow
pmd
pjbb2005
-60
lusearch
-40
lusearch-fix
-20
avrora
% Improvement in Exec Time
Isolating All JVM Threads
Lo-2xheap
Hi-2xheap
Isolating service
threads only
significantly hurts
one benchmark
p. 16
Exploring Frequency Scaling
Baseline: JVM
service threads
isolated, all cores at
highest frequency
Nehalem
Core 0
Nehalem
Core 1
Nehalem
Core 2
Socket 0
Nehalem
Core 3
Nehalem
Core 4
Nehalem
Core 5
Nehalem
Core 6
Nehalem
Core 7
Socket 1
p. 17
Exploring Frequency Scaling
Nehalem
Core 0
Nehalem
Core 1
Nehalem
Core 2
Lower frequency of
JVM service threads
Nehalem
Core 3
Nehalem
Core 4
Nehalem
Core 5
Nehalem
Core 6
Nehalem
Core 7
Nehalem
Core 5
Nehalem
Core 6
Nehalem
Core 7
versus
Nehalem
Core 0
Nehalem
Core 1
Nehalem
Core 2
Nehalem
Core 3
Lower frequency of
application threads
Nehalem
Core 4
p. 18
20
0
-20
-40
-60
Collector
-80
-100
-120
-140
-160
App
% Improvement in Exec Time
Lower Frequency: Collector vs App
avrora
lusearch
lusearch-fix
pmd
sunflow
xalan
pjbb2005
Lowering collector frequency affects
performance 5x less than for application
p. 19
20
0
-20
-40
-60
avrora
lusearch
lusearch-fix
pmd
sunflow
xalan
pjbb2005
Compiler
-80
-100
-120
App
% Improvement in Exec Time
Lower Freq at Startup: Compiler vs App
Lowering compiler frequency is not
detrimental compared to application
p. 20
20
0
avrora
lusearch
lusearch-fix
pmd
sunflow
xalan
pjbb2005
-20
-40
JVM
-60
-80
-100
-120
App
% Improvement in Exec Time
Lower Frequency: JVM vs App
Lowering JVM frequency affects
performance 5x less than for application
p. 21
Exploring Pairing Threads
Pair application and
collection threads
Nehalem
Core 0
Nehalem
Core 1
Nehalem
Core 2
Socket 0
Nehalem
Core 3
Nehalem
Core 4
Nehalem
Core 5
Nehalem
Core 6
Nehalem
Core 7
Socket 1
p. 22
Pairing App & Collector, 2 Sockets
-464
pjbb2005
xalan
sunflow
pmd
lusearch-fix
lusearch
40
20
0
-20
-40
-60
-80
-100
-120
-140
-160
-180
-221
-200
avrora
% Improvement in Exec Time
1 Socket & Isolated Collector vs Paired
OneSocket-Lo
OneSocket-Hi
Isolate-Lo
Isolate-Hi
With all but avrora,
pairing application and
collector performs best
p. 23
1
1Socket-Hi
IsolateGC-Hi
IsolateComp-Hi
IsolateNonGC-Hi
IsolateJVM-Hi
LoApp-HiGC
LoApp-HiComp
LoApp-HiNonGC
LoApp-HiJVM
HiApp-LoGC
HiApp-LoComp
HiApp-LoNonGC
HiApp-LoJVM
1Socket-Lo
IsolateGC-Lo
IsolateComp-Lo
IsolateNonGC-Lo
IsolateJVM-Lo
Normalized Execution Time
Overall Performance Comparison
4.5
4
3.5
3
2.5
2
avrora
lusearch
lusearch-fix
pmd
sunflow
xalan
pjbb2005
1.5
Either use
1 socket, or
isolate
compiler
thread
p. 24
Conclusions: Scheduling Insights
1 socket: # application = # collection threads
 2 sockets:




Isolate compilation thread
Pair application and collection threads
Set # application threads = # cores, fewer
collection threads
Increasing application frequency is more
important than for JVM service threads
 Analyzed Java performance given
hardware resources

p. 25
Download