Profiling Memory Subsystem Performance in an Advanced POWER Virtualization Environment The prominent role of the memory hierarchy as one of the major bottlenecks in achieving good program performance has motivated the search for ways of capturing the memory performance of an application/machine pair that is both practical in terms of time and space, yet detailed enough to gain useful and relevant information. Department of Computer Science The strategy that we endorse periodically samples events during program execution, producing an event trace that is both manageable and informative. Additionally, we developed a fast and flexible performance evaluation framework with which to analyze and understand the performance data contained within the sampled event traces. We have shown the potential of our performance evaluation methodology by using it to analyze a disparate set of performance issues for large, complex applications running on a multiprocessor system. For example, we have applied our methodology to characterize performance issues such as memory access performance, process migration, compulsory and conflict misses, and false sharing. To date, we have studied the memory subsystem performance of several complex applications, including the TPC-C and SPECsfs benchmarks, executing on different configurations of the IBM eServer pSeries 690. Austin, TX Additionally, we have begun to investigate the effectiveness of our performance evaluation framework when studying memory subsystem performance in a virtualized environment. Virtualization allows multiple execution environments to time-share the same physical hardware in an effort to increase machine utilization. However, there is an inherent performance overhead associated with sharing a fixed set of hardware resources. The goal of our work is to identify and analyze the performance overhead associated with virtualization using our performance evaluation framework. To date, we have studied the memory subsystem performance of TRADE3, an on-line stock brokerage application, executing on different configurations of the IBM eServer p5 570, a commercial server designed to support virtualization. Diana Villa, Ph.D. Candidate Bret Olszewski Mitesh Meswani, Ph.D. Candidate Mala Anand Dr. Patricia Teller, Professor Carole Gottlieb 4 1 Virtualization Data Collection Environment Workload IBM eServer p5 570 (p570) architecture 1.65 GHz POWER5 processor 4-processor configuration Websphere, DB2, Application Code Data TRADE3 On-line stock brokerage application Three-tier configuration Virtualize resources to facilitate time-sharing of the hardware by different execution environments Emergence of virtualization technology in new environments (e.g., newer architectures, open source) POWER Hypervisor facilitates resource sharing and supports as many as 254 active partitions Collected via Event-based Sampling (record periodic occurrence of monitored event) Organized as Sampled Event Traces (one per CPU) Event Record 372872 184469 PID APP3 OS3 APP2 OS2 APP1 OS1 TID Timestamp APPN OSN POWER Hypervisor DCM 0 0.328104637 000000000000A8C4 0000000000218880 Effective Instruction Address APP4 OS4 P L3 Effective Data Address DCM 1 P P L2 P L3 L2 2 Events Profiled MEM L2-Cache Data Load Misses - require the CPU to access off-chip memory to be resolved Classified according to level at which they are resolved and state of the requested block MEM 5 Data Analysis and Results L3 P Load Latencies of 4-processor Configuration DCM 1 P P L2 P L2-Cache Access Resolution Site L2 cache L3 L2 MEM MEM Load Latency 14 cycles L3 cache 91 cycles L2.75 cache 121 cycles L3.75 cache 205 cycles LMEM 281 cycles RMEM 307 cycles Performance overhead associated with virtualization due to sharing a fixed-set of hardware resources Goal: Observe differences in data-load behavior that could represent the performance overhead Compared executions of TRADE3 in non-virtualized (1P) and virtualized (5P) environments Observed an increased locality of reference for 5P data-loads in memory Indicates a possible increase in capacity/conflict misses in 5P case due to contention for hardware resources TRADE3 - Websphere Group MEM Data Load Hits by Address Region 4-processor configuration of the p570 Other Address region L2.75 (different DCM) L3 LMEM L3.75 (different DCM) LMEM (different DCM) 3 1P DLH SharedLibraryCode 1P UCL Data 5P DLH 5P UCL WorkingStorage 0 MySQL databases catalog/store sampled event traces Java tools interface with databases to load sampled event traces and run queries 000000006 1P DLH 1P UCL 000000005 5P DLH 5P UCL 000000004 000000003 Kernel 0.2 0.4 0.6 0.8 Fraction of data loads 1 0 0.2 0.4 0.6 0.8 1 Fraction of data loads 6 Sampled Event Traces Data Collection Environment p570 TRADE3 Publications Database Reports Load DB Java Tool PID TID Timestamp InstrAddr DataAddr PID TID Timestamp InstrAddr DataAddr PID TID Timestamp InstrAddr DataAddr Report Generator Java Tool Graphs Distribution of L3 Data Load Hits Across Pages of a Buffer Pool Segment 400 350 300 250 200 150 100 50 0 100 Distribution of L3 Data Load Hits KERN_HEAP Total loads Unique cache line Address region 5 BufferPool 56893 29384 6 Data,BSS,Heap 8799 4855 1 Kernel 23485 9840 Hit/Cache line count 000000007 SharedLibraryData Performance Framework TRADE3 - Websphere Group MEM Data Load Hits by Segment for Data Region Segment DCM 0 U-BlockandKernelStack Stack SharedData Unique cache line BufferPool Hit % Data,BSS,Heap Text Kernel 0 0.1 0.2 0.3 Fraction of data loads 1600 3100 4600 Page [0-65536] 6100 7600 0.4 0.5 2005 Villa, D., Meswani, M., Teller, P.J., and Olszewski, B., "Profiling Memory Subsystem Performance in an Advanced POWER Virtualization Environment", To appear in the Proceedings of the 1st International Workshop on Operating System Interference in High Performance Applications, September 2004, St. Louis, MO. Portillo, R., Villa, D., Teller, P.J., and Olszewski, B., "Mining Performance Data from Sampled Event Traces", Proceedings of the 6th Annual Austin Center for Advanced Studies (CAS) Conference, February 2005, Austin, TX. 2004 Villa, D., Acosta, J., Teller, P.J., Olszewski, B., and Morgan, T., "Memory Performance Profiling via Sampled Performance Monitor Event Traces", Proceedings of the 5th Annual Los Alamos Computer Science Institute Symposium (LACSI), October, 2004, Santa Fe, NM. Portillo, R., Villa, D., Teller, P.J., and Olszewski, B., "Mining Performance Data from Sampled Event Traces", Proceedings of the 12th Annual Meeting of the IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), October 2004, Volendam, The Netherlands. Villa, D., Acosta, J., Teller, P.J., Olszewski, B., and Morgan, T., "A Framework for Profiling Multiprocessor Memory Performance", Proceedings of the 10th International Conference on Parallel and Distributed Systems (ICPADS), July 2004, Long Beach, CA. Villa, D., Acosta, J., Teller, P.J., Olszewski, B., and Morgan, T., "Memory Performance Profiling via Sampled Performance Monitor Event Traces", Proceedings of the 5th Annual Austin Center for Advanced Studies (CAS) Conference, February 2004, Austin, TX. 2003 Villa, D. (2003). Using Sampled Performance Monitor Event Traces to Characterize Application Behavior. Unpublished master's thesis, The University of Texas at El Paso, El Paso, TX. Morgan, T., Villa, D., Teller, P.J., Olszewski, B., and Acosta, J., "L2 Miss Profiling on the p690 for a Large-scale Database Application", Proceedings of the 4th Annual Austin Center for Advanced Studies (CAS) Conference, February 2003, Austin, TX.