Profiling Memory Subsystem Performance in an Advanced POWER Virtualization Environment

advertisement
Profiling Memory Subsystem Performance in an Advanced POWER Virtualization Environment
The prominent role of the memory hierarchy as one of the major bottlenecks in achieving good program performance has motivated the search for ways of capturing the memory performance of an
application/machine pair that is both practical in terms of time and space, yet detailed enough to gain useful and relevant information.
Department of Computer Science
The strategy that we endorse periodically samples events during program execution, producing an event trace that is both manageable and informative. Additionally, we developed a fast and flexible
performance evaluation framework with which to analyze and understand the performance data contained within the sampled event traces. We have shown the potential of our performance
evaluation methodology by using it to analyze a disparate set of performance issues for large, complex applications running on a multiprocessor system. For example, we have applied our
methodology to characterize performance issues such as memory access performance, process migration, compulsory and conflict misses, and false sharing. To date, we have studied the memory
subsystem performance of several complex applications, including the TPC-C and SPECsfs benchmarks, executing on different configurations of the IBM eServer pSeries 690.
Austin, TX
Additionally, we have begun to investigate the effectiveness of our performance evaluation framework when studying memory subsystem performance in a virtualized environment. Virtualization
allows multiple execution environments to time-share the same physical hardware in an effort to increase machine utilization. However, there is an inherent performance overhead associated with
sharing a fixed set of hardware resources. The goal of our work is to identify and analyze the performance overhead associated with virtualization using our performance evaluation framework. To
date, we have studied the memory subsystem performance of TRADE3, an on-line stock brokerage application, executing on different configurations of the IBM eServer p5 570, a commercial server
designed to support virtualization.
Diana Villa, Ph.D. Candidate
Bret Olszewski
Mitesh Meswani, Ph.D. Candidate
Mala Anand
Dr. Patricia Teller, Professor
Carole Gottlieb
4
1
Virtualization
Data Collection
Environment

Workload
IBM eServer p5 570 (p570) architecture
 1.65 GHz POWER5 processor
 4-processor configuration





Websphere, DB2, Application Code
Data


TRADE3
 On-line stock brokerage application
 Three-tier configuration
Virtualize resources to facilitate time-sharing of the hardware by different execution environments
Emergence of virtualization technology in new environments (e.g., newer architectures, open source)
POWER Hypervisor facilitates resource sharing and supports as many as 254 active partitions
Collected via Event-based Sampling (record periodic occurrence of monitored event)
Organized as Sampled Event Traces (one per CPU)
Event Record
372872
184469
PID
APP3
OS3
APP2
OS2
APP1
OS1
TID
Timestamp
APPN
OSN
POWER Hypervisor
DCM 0
0.328104637 000000000000A8C4 0000000000218880
Effective
Instruction Address
APP4
OS4
P
L3
Effective
Data Address
DCM 1
P
P
L2
P
L3
L2
2
Events Profiled


MEM
L2-Cache Data Load Misses - require the CPU to access off-chip memory to be resolved
Classified according to level at which they are resolved and state of the requested block
MEM
5
Data Analysis and Results
L3
P
Load Latencies of 4-processor Configuration
DCM 1
P
P
L2
P
L2-Cache Access
Resolution Site
L2 cache
L3
L2
MEM
MEM

Load Latency

14 cycles
L3 cache
91 cycles
L2.75 cache
121 cycles
L3.75 cache
205 cycles
LMEM
281 cycles
RMEM
307 cycles



Performance overhead associated with virtualization due to sharing a fixed-set of hardware resources
Goal: Observe differences in data-load behavior that could represent the performance overhead
Compared executions of TRADE3 in non-virtualized (1P) and virtualized (5P) environments
Observed an increased locality of reference for 5P data-loads in memory
Indicates a possible increase in capacity/conflict misses in 5P case due to contention for hardware resources
TRADE3 - Websphere Group
MEM Data Load Hits by Address Region
4-processor configuration of the p570
Other
Address region
L2.75 (different DCM)
L3
LMEM
L3.75 (different DCM)
LMEM (different DCM)
3
1P DLH
SharedLibraryCode
1P UCL
Data
5P DLH
5P UCL
WorkingStorage
0
MySQL databases catalog/store sampled event traces
Java tools interface with databases to load sampled event traces and run queries
000000006
1P DLH
1P UCL
000000005
5P DLH
5P UCL
000000004
000000003
Kernel
0.2
0.4
0.6
0.8
Fraction of data loads
1
0
0.2
0.4
0.6
0.8
1
Fraction of data loads
6
Sampled
Event Traces
Data Collection
Environment
p570
TRADE3
Publications
Database
Reports
Load DB Java Tool
PID TID Timestamp InstrAddr DataAddr
PID TID Timestamp InstrAddr DataAddr
PID TID Timestamp InstrAddr DataAddr
Report Generator
Java Tool
Graphs
Distribution of L3 Data Load Hits Across Pages of
a Buffer Pool Segment
400
350
300
250
200
150
100
50
0
100
Distribution of L3 Data Load Hits
KERN_HEAP
Total loads
Unique cache line
Address region
5 BufferPool 56893 29384
6 Data,BSS,Heap 8799 4855
1 Kernel 23485 9840
Hit/Cache line count

000000007
SharedLibraryData
Performance Framework

TRADE3 - Websphere Group
MEM Data Load Hits by Segment for Data Region
Segment
DCM 0
U-BlockandKernelStack
Stack
SharedData
Unique cache line
BufferPool
Hit %
Data,BSS,Heap
Text
Kernel
0
0.1
0.2
0.3
Fraction of data loads
1600
3100
4600
Page [0-65536]
6100
7600
0.4
0.5
2005
 Villa, D., Meswani, M., Teller, P.J., and Olszewski, B., "Profiling Memory Subsystem Performance in an Advanced POWER Virtualization Environment", To appear in the
Proceedings of the 1st International Workshop on Operating System Interference in High Performance Applications, September 2004, St. Louis, MO.
 Portillo, R., Villa, D., Teller, P.J., and Olszewski, B., "Mining Performance Data from Sampled Event Traces", Proceedings of the 6th Annual Austin Center for Advanced
Studies (CAS) Conference, February 2005, Austin, TX.
2004
 Villa, D., Acosta, J., Teller, P.J., Olszewski, B., and Morgan, T., "Memory Performance Profiling via Sampled Performance Monitor Event Traces", Proceedings of the 5th
Annual Los Alamos Computer Science Institute Symposium (LACSI), October, 2004, Santa Fe, NM.
 Portillo, R., Villa, D., Teller, P.J., and Olszewski, B., "Mining Performance Data from Sampled Event Traces", Proceedings of the 12th Annual Meeting of the IEEE
International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), October 2004, Volendam, The Netherlands.
 Villa, D., Acosta, J., Teller, P.J., Olszewski, B., and Morgan, T., "A Framework for Profiling Multiprocessor Memory Performance", Proceedings of the 10th International
Conference on Parallel and Distributed Systems (ICPADS), July 2004, Long Beach, CA.
 Villa, D., Acosta, J., Teller, P.J., Olszewski, B., and Morgan, T., "Memory Performance Profiling via Sampled Performance Monitor Event Traces", Proceedings of the 5th
Annual Austin Center for Advanced Studies (CAS) Conference, February 2004, Austin, TX.
2003
 Villa, D. (2003). Using Sampled Performance Monitor Event Traces to Characterize Application Behavior. Unpublished master's thesis, The University of Texas at El Paso, El
Paso, TX.
 Morgan, T., Villa, D., Teller, P.J., Olszewski, B., and Acosta, J., "L2 Miss Profiling on the p690 for a Large-scale Database Application", Proceedings of the 4th Annual Austin
Center for Advanced Studies (CAS) Conference, February 2003, Austin, TX.
Download