Outline Framework: Dynamically Tunable Clustered Multithreaded Architecture Motivation: Workload characterization Architectural support for adaptation Role of program analysis Resource-aware operating system support An Integrated Hardware/Software Approach to OnOn-Line PowerPowerPerformance Optimization Sandhya Dwarkadas University of Rochester Collaborators at UR: David Albonesi, Chen Ding, Eby Friedman, Michael L. Scott, UR Systems and Architecture groups Collaborators at IBM: Pradip Bose, Alper Buyuktosunoglu, Calin Cascaval, Evelyn Duesterwald, Hubertus Franke, Zhigang Hu, Bonnie Ray,Viji Srinivasan Conventional Processor Design Emerging Trends • Wire delays and faster clocks will necessitate aggressive use of clustering • Larger transistor budgets and low cluster design costs will enable addition of more clusters incrementally • There is a trend toward multithreading to exploit the transistor budget for improved throughput by combining ILP and TLP Combine clustering and multithreading? Branch Predictor I Cache I Cache r2 r1 + r41 Rename & Dispatch Regfile IQ FU I s s u e Q FU FU FU FU A Clustered Multithreaded (CMT) Architecture r3 + r4 r2 r1 + r41 r1 ROB IPREG Integer IIQ Memory L1 Icache Regfile IQ Rename & Dispatch Large structures Slower clock speed A Clustered Processor Branch Predictor Register File Branch predict FetchQ Rename map LSQ Int FUs L1 Dcache Floating Pt FU FPQ FP FUs FPREG Regfile Small structures Faster clock speed IQ r41 r43 + r44 IPREG Memory FU LSQ L1 Icache Branch predict FetchQ Rename map L2 Shared Unified Cache Integer IIQ ROB Regfile IQ But, high latency for some instructions FU Int FUs L1 Dcache Floating Pt FPQ FP FUs FPREG 1 Components of IPC Degradation Overall Energy Impact 7 Power 6 1.2 5 Communication 4 ResultBus 1 ALU 3 Dcache 2 0.8 2 Dcache Icache 1 0.6 0 ilp4 com4 Regfiles LSQ mix4 Bpred 0.4 Monolithic ROB Clustered Clock 0.2 Centralized FUs IQRAM No Comm Penalty Rename 0 Centralized FUs + No Comm Penalty SMT Centralized FUs + No Comm Penalty + Centralized RegFile TD:8;Link 2 TD:4;Link 2 TD:2;Link 2 TD:1;Link 2 Dynamic Single Thread Execution Problems IPREG • Tradeoff in communication vs. parallelism for a single thread • Increased communication delays and contention when employing multiple threads – Reduced performance – Increased energy consumption Goal: Integer IIQ Memory LSQ ROB Branch predict FetchQ Rename map L1 Dcache Floating Pt FPQ L1 Icache Int FUs FP FUs FPREG L2 Unified Cache IPREG Integer IIQ Intelligent mapping of applications to resources for improved throughput and resource utilization as well as reduced energy Memory LSQ Int FUs L1 Dcache Floating Pt FPQ FP FUs FPREG Communication vs Parallelism 4 clusters 100 active instrs r1 r5 r2 + r3 r1 + r3 r7 r8 r2 + r3 r7 + r3 … … 8 clusters r2 + r3 r1 + r3 r7 r8 r2 + r3 r7 + r3 … … … … r1 + r7 … r9 r2 + r3 r5 Distant parallelism: distant instructions that are ready to execute 200 active instrs r1 r5 Ready instructions SingleSingle-Thread Adaptation [ISCA’ [ISCA’03] • Dynamic interval-based exploration can adapt to available instruction-level parallelism in a single thread – Determine when communication can no longer be tolerated in exploiting additional clusters • Allow remaining clusters to be turned off to reduce power consumption or to be used by a different thread/application 2 An Integrated Approach to Dynamic Tuning of the CMT Instructions per cycle (IPC) Results with Interval-Based Scheme (ISCA’03) 2.5 4-clusters 16-clusters interval-based 2 • Architectural design and dynamic configuration for finegrain adaptation • Program analysis to determine application behavior • Runtime support to match predicted application behavior and resource requirements with available resources – Resource-aware thread scheduling for maximum throughput and fairness – Runtime support for balancing ILP with TLP in parallel application environments 1.5 1 0.5 0 cjpeg crafty gzip parser vpr djpeg galgel mgrid swim HM Overall improvement: 11% OutOut-ofof-order Dispatch & Fetch Gating Multithreaded Adaptation • In-Order Dispatch (dispatch stall) T6 T3 T4 T3 T6 T8 T1 T2 T5 T2 T1 T7 T4 T2 Ready for dispatch tail head Blocked from dispatch •Out-of-Order Dispatch (dispatch from T6) • Basic scheme – Interval-based – Fixed 100,000 cycles – Exploration-based – Hysteresis to avoid spurious changes T6 T3 T4 T3 T6 T8 T1 T2 T5 T2 T1 T7 T4 T2 Ready for dispatch tail head Blocked from dispatch Tx Thread id Thread to Cluster Assignment Thread to Cache Bank Assignment N_WAY = 8 8 7 7 6 6 SMT_8_BANK 5 5 SMT_4_BANK 4 SMT_2_BANK 4 SMT_1_BANK IPC 3 CMT_8_BANK 3 CMT_4_BANK 2 CMT_2_BANK 2 1 CMT_1_BANK 1 0 ilp4 com4 mix4 ilp8 com8 mix8 0 Monolithic TD:4 + SD + FG TD:2 + SD + FG Adaptive com_4_I com_4_m ilp_4_f ilp_4_m mix1_4_m mix2_4_m 3 ILP vs. TLP ILP vs. TLP LU Jacobi 9 8 8 7 7 Ideal WB Shared 5 WB WB 4 WT WT 3 IPC 5 IPC Shared 6 6 Ideal WB 4 Centralized SMT Centralized 3 2 SMT 2 1 1 0 1 0 1 2 4 6 2 4 6 8 Number of Thre ads 8 Number of Threads A Dynamically Tunable Clustered Multithreaded (DT(DT-CMT) Architecture ROB Memory Branch predict FetchQ Reactive Integer IPREG IIQ L1 Icache Current Approaches to Adaptation Rename map LSQ Int FUs Adaptive change is triggered after observed variation in program behavior Inspect counters L1 Dcache Floating Pt FPQ FP FUs FPREG Integer IPREG IIQ ROB Memory LSQ L1 Icache Branch predict FetchQ Rename map Is there a phase change? L2 Shared Unified Cache Int FUs L1 Dcache yes Explore configurations Record CPIs Pick best configuration no Remain at present configuration Floating Pt FPQ FP FUs FPREG Success depends on ability to repeat behavior across successive intervals Interval Length Varied Interval Lengths Benchmark Problem: • Unstable behavior across intervals Solution: Solution: • Start with minimum allowed interval length • If phase changes are too frequent, double the interval length – find a coarse enough granularity such that behavior is consistent • Periodically reset interval length to the minimum • Small interval lengths can result in noisy measurements Instability factor for a 10K interval length Minimum acceptable interval length and its instability factor gzip 4% 10K / 4% vpr 14% 320K / 5% crafty 30% 320K / 4% parser 12% 40M / 5% swim 0% 10K / 0% mgrid 0% 10K / 0% galgel 1% 10K / 1% cjpeg 9% 40K / 4% djpeg 31% 1280K / 1% Instability factor: Percentage of intervals that flag a phase change 4 Characterizing Program Behavior Variability • Whole program instrumentation (currently SPEC2k) • Periodic hardware performance counter sampling using Ticker – Dynamic Probe Class Library (DPCL) to insert a timerbased interrupt in the program – Performance Monitoring API (PMAPI) to read the hardware counters – AIX-based • Sampling interval of 10 msec • Examination of IPC, L1D cache miss rate, instruction mix, branch mispredict rate • Statistical analysis – correlation, frequency analysis, behavior variation Example IPC Plots Example IPC Plots SPEC2k:bzip2 •Existence of macro phase behavior •Significant behavior variation even at coarse granularities •Strong frequency components/periodicity across several metrics Similarity Across Metrics SPEC2k:bzip2 Spec2k:art • High rate of behavior variation from one measurement to the next Comparing Frequency Spectra bzip2 Program Behavior Variability art •Strong low (bzip2) and high (art) frequency components, indicating high rate of repeatability •Variation in behavior, while different, persists across different sampling interval sizes 5 Important Behavior Characteristics • Programs exhibit high degrees of repeatability across all metrics • Rate of behavior repeatability (periodicity) across metrics is highly similar • Variation in behavior from one interval to the next can be high • Variation in behavior, while different, persists across different sampling interval sizes OnOn-Line Program Behavior Prediction • Linear (statistical) predictors to exploit behavior in the immediate past – Last value – Average(N) – Mode(N) • Table-based predictors to exploit periodicity (non-linear) – Run-length encoded – Fixed-size history • Cross-metric predictors to exploit similarity across metrics – Use one metric to predict several potentially different metrics – Efficiently combine multiple predictors On-line power-performance optimization needs to be predictive rather than reactive TableTable-Based Predictors • E.g. table-based and asymmetric predictor – at-4at-3at-2at-1 at,bt avote, bvote • Default to last value during learning period • Use a voting mechanism to update table entries – Prediction (at or bt) is updated with the mode of the actual value (vote) the last time this history was encountered, the current prediction(t), and the measured value at the end of the interval • Encoding and length of history (index) can be varied – Fixed size or run-length encoded Design TradeTrade-offs • Precision – Too coarse a precision implies insensitivity to finegrained behavior – Too fine a precision implies sensitivity to noise • Size of history – Too long a history implies a potentially long learning period – Too short a history implies inability to distinguish between common histories of otherwise distinct regions • Both precision and history have table size implications Trade-off between noise tolerance, learning period, and prediction accuracy Mean IPC Prediction Error (Power3) Program Behavior Predictability • Variations in program behavior are predictable to within a few percent • Table-based predictors outperform any others for programs with high variability • Cross-metric table-based predictors make it possible to predict multiple metrics using a single predictor • Microarchitecture-independent metrics allow stable prediction even when the predicted metric changes due to dynamic optimization 6 Information Space for Workload Analysis Problems • High variability in program behavior • Interval length hard to determine – Too small measurement noise – Too large missed opportunities for adaptation • Interval and actual phase boundaries do not match Data Locality Analysis (Shen (Shen and Ding) !" Phase Marker Insertion -----Basic -----Basic Block Trace Analysis Program Phase Detection /+ ,.' "#$%& '($ )*$ !$+ ('$ ,- ,. ,.' ,- 01 2 01 22 TOMCATV RD Signature with Phase Boundaries • Objective: to find basic blocks marking unique phase boundaries. < 7 1 1; :1; 33 4 56 789:1 ; 33 <= > 33 ? : 7 Similarity of Locality Phases: TOMCATV (5250 Phases) Similarity of Locality Phases: COMPRESS (52 Phases) Similarity of Phases with BBV TOMCATV (2493 Intervals) Bringing It All Together • Locality analysis for phase detection and marking of macro phases • Linear or non-linear (table-based) prediction within each phase for improved learning ResourceResource-Aware Thread Scheduling ResourceResource-Aware O/S Scheduler Multiple levels of thread scheduling Application threads O/S threads H/W threads Processor pipeline Process A H/W Counters O/S task Resource-aware extension Resource Usage Prediction Process B H/W scheduling Counter-based Resource Model ... next Kernel scheduler(s) task(s) e.g., current temperature O/S kernel Application-level scheduling O/S scheduling Resource counter/ Sensor 8 Fair Cooperative Scheduling [PPoPP [PPoPP 2001] • Each process is allocated a piggy-bank of time (set to 1 quantum) from which it can borrow and to which it can add • The piggy-bank is used to boost a process’s priority (with the original purpose of responding to a communication request when notified by a wakeup signal) • A process can add to the piggy-bank whenever it relinquishes the processor Adapt the above so the piggy-bank is used to schedule a process earlier than according to priority and replenished, for example, when reconfiguring at a phase marker Coordinate among schedulers for multiple hardware contexts ResourceResource-Aware Thread Scheduling: Other Applications Power/Thermal management • Temperature-aware process/thread scheduling to avoiding temperature hotspots – characterize threads based on expected temperature contribution – schedule based on a thread’s predicted heat contribution and current temperature Performance • Improving L2 bandwidth utilization on a multiprocessor (e.g., the two cores of a Power4) – Characterize threads based on expected L2 cache accesses – avoid scheduling different threads with high L2 access concurrently Summary: An Integrated Hardware/Software Approach to DTDT-CMT • Aggressive clustering and multithreading requires a whole-system integrated view in order to maximize resource efficiency – Architectural configuration support (while carefully considering circuit-level issues) – Program analysis – Runtime/OS support ApplicationApplication-Level Scheduling • Provide a framework for – trading ILP for TLP based on application characteristics and available resources – Specifying cache and cluster sharing configurations at appropriate points At the JVM level – Target server workloads At the level of an API such as OpenMP – Target scientific/parallel applications ResourceResource-Aware Thread Scheduling (cont’ (cont’d) Performance and Power • Resource (memory, FU, and temperature) aware thread scheduling for simultaneous multithreaded processors (e.g., the Power5, and the hyper-threads of the Pentium IV) or our proposed clustered multithreaded architecture OnOn-Going Projects at UofR • CAP: Dynamic reconfigurable general-purpose processor design • MCD: Multiple Clock Domain Processors • DT-CMT: Dynamically Tunable Clustered Multithreaded architectures • InterWeave: 3-level versioned shared state (predecessors: InterAct and Cashmere) • ARCH: Architecture, Runtime, and Compiler Integration for High-Performance Computing See http://www.cs.rochester.edu/research and http://www.cs.rochester.edu/~sandhya 9