Advances in the TAU Performance System Allen D. Malony, Sameer Shende {malony,shende}@cs.uoregon.edu Department of Computer and Information Science Computational Science Institute University of Oregon Outline Complexity and performance technology Was ist TAU? Problems currently being investigated Instrumentation control Selective Performance mapping Callpath Instrumentation profiling Performance data interaction, and steering Online performance analysis and visualization Performance analysis for component software Concluding remarks Dagstuhl, August 2002 Advances in the TAU Performance System 2 Complexity in Parallel and Distributed Systems Complexity in computing system architecture Diverse parallel and distributed system architectures shared / distributed memory, cluster, hybrid, NOW, Grid, … Sophisticated processor / memory / network architectures Complexity in parallel software environment Diverse parallel programming paradigms Optimizing compilers and sophisticated runtime systems Advanced numerical libraries and application frameworks Hierarchical, multi-level software architectures Multi-component, coupled simulation models Dagstuhl, August 2002 Advances in the TAU Performance System 3 Complexity Determines Performance Requirements Performance observability requirements Multiple levels of software and hardware Different types and detail of performance data Alternative performance problem solving methods Multiple targets of software and system application Performance technology requirements Broad scope of performance observation Flexible and configurable mechanisms Technology integration and extension Cross-platform portability Open, layered, and modular framework architecture Dagstuhl, August 2002 Advances in the TAU Performance System 4 Complexity Challenges for Performance Tools Computing system environment complexity Observation integration and optimization Access, accuracy, and granularity constraints Diverse/specialized observation capabilities/technology Restricted modes limit performance problem solving Sophisticated software development environments Programming paradigms and performance models Performance data mapping to software abstractions Uniformity of performance abstraction across platforms Rich observation capabilities and flexible configuration Common performance problem solving methods Dagstuhl, August 2002 Advances in the TAU Performance System 5 General Problems (Performance Technology) How do we create robust and ubiquitous performance technology for the analysis and tuning of parallel and distributed software and systems in the presence of (evolving) complexity challenges? How do we apply performance technology effectively for the variety and diversity of performance problems that arise in the context of complex parallel and distributed computer systems? Dagstuhl, August 2002 Advances in the TAU Performance System 6 TAU Performance System Framework Tuning and Analysis Utilities (aka Tools Are Us) Performance system framework for scalable parallel and distributed high-performance computing Targets a general complex system computation model nodes / contexts / threads Multi-level: system / software / parallelism Measurement and analysis abstraction Integrated toolkit for performance instrumentation, measurement, analysis, and visualization Portable performance profiling/tracing facility Open software approach Dagstuhl, August 2002 Advances in the TAU Performance System 7 TAU Performance System Architecture Paraver EPILOG Dagstuhl, August 2002 Advances in the TAU Performance System 8 Instrumentation Control Selection of which performance events to observe How is selection supported in instrumentation system? Could depend on scope, type, level of interest Could depend on instrumentation overhead No choice Include / exclude lists (TAU) Environment variables Static vs. dynamic Problem: Controlling instrumentation of small routines High relative measurement overhead Significant intrusion and possible perturbation Dagstuhl, August 2002 Advances in the TAU Performance System 9 Rule-Based Overhead Analysis (N. Trebon, UO) Analyze the performance data to determine events with high (relative) overhead performance measurements Create a select list for excluding those events Rule grammar (used in TAUreduce tool) [GroupName:] Field Operator Number GroupName indicates rule applies to events in group Field is a event metric attribute (from profile statistics) numcalls, numsubs, percent, usec, cumusec, totalcount, stdev, usecs/call, counts/call Operator is one of >, <, or = Number is any number Compound rules possible using & between simple rules Dagstuhl, August 2002 Advances in the TAU Performance System 10 TAUReduce Example tau_reduce implements overhead reduction in TAU Consider klargest example Find kth largest element in a N elements Compare two methods: quicksort, select_kth_largest i = 2324, N = 1000000 (uninstrumented) quicksort: (wall clock) = 0.188511 secs select_kth_largest: (wall clock) = 0.149594 secs Total: (P3/1.2GHz time) = 0.340u 0.020s 0:00.37 Execution with all routines instrumented Execution with rule-based selective instrumentation usec>1000 & numcalls>400000 & usecs/call<30 & percent>25 Dagstuhl, August 2002 Advances in the TAU Performance System 12 Simple sorting example on one processor Before selective instrumentation reduction NODE 0;CONTEXT 0;THREAD 0: --------------------------------------------------------------------------------------%Time Exclusive Inclusive #Call #Subrs Inclusive Name msec msec usec/call --------------------------------------------------------------------------------------100.0 13 4,982 1 4 4982030 int main 93.5 3,223 4,659 4.20241E+06 1.40268E+07 1 void quicksort 62.9 0.00481 3,134 5 5 626839 int kth_largest_qs 36.4 137 1,813 28 450057 64769 int select_kth_largest 33.6 150 1,675 449978 449978 4 void sort_5elements 28.8 1,435 1,435 1.02744E+07 0 0 void interchange 0.4 20 20 1 0 20668 void setup 0.0 0.0118 0.0118 49 0 0 int ceil After selective instrumentation reduction NODE 0;CONTEXT 0;THREAD 0: --------------------------------------------------------------------------------------%Time Exclusive Inclusive #Call #Subrs Inclusive Name msec total msec usec/call --------------------------------------------------------------------------------------100.0 14 383 1 4 383333 int main 50.9 195 195 5 0 39017 int kth_largest_qs 40.0 153 153 28 79 5478 int select_kth_largest 5.4 20 20 1 0 20611 void setup 0.0 0.02 0.02 49 0 0 int ceil Dagstuhl, August 2002 Advances in the TAU Performance System 13 Performance Mapping Associate performance with “significant” entities (events) Source code points are important Functions, regions, control flow events, user events Execution process and thread entities are important Some entities are more abstract, harder to measure Consider callgraph (callpath) profiling Measure time (metric) along an edge (path) of callgraph Incident edge gives parent / child view Edge sequence (path) gives parent / descendant view Problem: Callpath profiling when callgraph is unknown Determine callgraph dynamically at runtime Map performance measurement to dynamic call path state Dagstuhl, August 2002 Advances in the TAU Performance System 14 1-Level Callpath Implementation in TAU TAU maintains a performance event (routine) callstack Profiled routine (child) looks in callstack for parent Previous profiled performance event is the parent A callpath profile structure created first time parent calls TAU records parent in a callgraph map for child String representing 1-level callpath used as its key “a( )=>b( )” : name for time spent in “b” when called by “a” Map returns pointer to callpath profile structure 1-level callpath is profiled using this profiling data Build upon TAU’s performance mapping technology Measurement is independent of instrumentation Dagstuhl, August 2002 Advances in the TAU Performance System 16 Performance Monitoring and Steering Desirable to monitor performance during execution Large-scale parallel applications complicate solutions Long-running applications Steering computations for improved performance More parallel threads of execution producing data Large amount of performance data (relative) to access Analysis and visualization more difficult Problem: Online performance data access and analysis Incremental profile sampling (based on files) Integration in computational steering system Dynamic performance measurement and access Dagstuhl, August 2002 Advances in the TAU Performance System 17 Online Performance Analysis (K. Li, UO) SCIRun (Univ. of Utah) Application Performance Steering Performance Visualizer // performance data streams TAU Performance System // performance data output file system accumulated samples Performance Data Integrator Performance Analyzer Performance Data Reader • sample sequencing • reader synchronization Dagstuhl, August 2002 Advances in the TAU Performance System 18 2D Field Performance Visualization in SCIRun SCIRun program Dagstuhl, August 2002 Advances in the TAU Performance System 19 Uintah Computational Framework (UCF) University of Utah UCF analysis Scheduling MPI library Components 500 processes Use for online and offline visualization Apply SCIRun steering Dagstuhl, August 2002 Advances in the TAU Performance System 20 Performance Analysis of Component Software Complexity in scientific problem solving addressed by advances in software development environments and rich layered software middleware and libraries Increases complexity in performance problem solving Integration barriers for performance technology Incompatible with advanced software technology Inconsistent with software engineering process Problem: Performance engineering for component systems Respect software development methodology Leverage software implementation technology Look for opportunities for synergy and optimization Dagstuhl, August 2002 Advances in the TAU Performance System 21 Focus on Component Technology and CCA Emerging component technology for HPC and Grid Component: software object embedding functionality Component architecture (CA): how components connect Component framework: implements a CA Common Component Architecture (CCA) Standard foundation for scientific component architecture Component descriptions Scientific Interface Description Language (SIDL) CCA ports for component interactions (provides and uses) CCA services: directory, registery, connection, event High-performance components and interactions Dagstuhl, August 2002 Advances in the TAU Performance System 22 Extended Component Design generic component POC and PKC are compliant with component architecture Component composition performance engineering Utilize technology and services of component framework Dagstuhl, August 2002 Advances in the TAU Performance System 23 Architecture of a Performance Component Each component advertises its services Performance component: Ports Timer (start/stop) Performance Event (trigger) Component Query (timers…) Knowledge (component performance model) Timer Event Query Knowledge Prototype implementation of timer CCAFFEINE reference framework http://www.cca-forum.org/café.html SIDL Instantiate with TAU functionality Dagstuhl, August 2002 Advances in the TAU Performance System 24 TimerPort Interface Declaration in CCAFEINE Create Timer port abstraction namespace performance{ namespace ccaports{ /** * This abstract class declares the Timer interface. * Inherit from this class to provide functionality. */ class Timer: /* implementation of port */ public virtual gov::cca::Port { /* inherits from port spec */ public: virtual ~ Timer (){ } /** * Start the Timer. Implement this function in * a derived class to provide required functionality. */ virtual void start(void) = 0; /* virtual methods with */ virtual void stop(void) = 0; /* null implementations */ ... } Dagstuhl, August 2002 Advances in the TAU Performance System 25 Using Performance Component Timer Component uses framework services to get TimerPort Use of this TimerPort interface is independent of TAU // Get Timer port from CCA framework services form CCAFFEINE port = frameworkServices->getPort ("TimerPort"); if (port) timer_m = dynamic_cast < performance::ccaports::Timer * >(port); if (timer_m == 0) { cerr << "Connected to something, not a Timer port" << endl; return -1; } string s = "IntegrateTimer"; // give name for timer timer_m->setName(s); // assign name to timer timer_m->start(); // start timer (independent of tool) for (int i = 0; i < count; i++) { double x = random_m->getRandomNumber (); sum = sum + function_m->evaluate (x); } timer_m->stop(); // stop timer Dagstuhl, August 2002 Advances in the TAU Performance System 26 Using TAU Component in CCAFEINE repository repository repository repository repository repository repository repository create create create create create create get get get get get get get get TauTimer Driver MidpointIntegrator MonteCarloIntegrator RandomGenerator LinearFunction NonlinearFunction PiFunction /* get TAU component from repository */ /* get application components */ LinearFunction lin_func /* create component instances */ NonlinearFunction nonlin_func PiFunction pi_func MonteCarloIntegrator mc_integrator RandomGenerator rand TauTimer tau /* create TAU component instance */ /* connecting components and running */ connect mc_integrator RandomGeneratorPort rand RandomGeneratorPort connect mc_integrator FunctionPort nonlin_func FunctionPort connect mc_integrator TimerPort tau TimerPort create Driver driver connect driver IntegratorPort mc_integrator IntegratorPort go driver Go quit Dagstuhl, August 2002 Advances in the TAU Performance System 29 Concluding Remarks Complex software and parallel computing systems pose challenging performance analysis problems that require robust methodologies and tools To build more sophisticated performance tools, existing proven performance technology must be utilized Performance tools must be integrated with software and systems models and technology Performance engineered software Function consistently and coherently in software and system environments TAU performance system offers robust performance technology that can be broadly integrated Dagstuhl, August 2002 Advances in the TAU Performance System 30 Dagstuhl, August 2002 Advances in the TAU Performance System 31