Variation Aware Application Scheduling for Chip Multi-Processors Lavanya Subramanian, Aman Kumar Carnegie Mellon University (lsubrama, amank}@andrew.cmu.edu Abstract be exploited to get better energy efficiency and Variations in chip multi processors are fast becoming a major performance out of it. concern, with nanometer scaling. The within die variation, particularly, is gaining significance in the sub 65 nanometer technologies. Techniques are being explored to make use of the 2. Related Work variability information, to achieve better performance and There has been some work in this direction. [1] energy efficiency. We propose a unified approach for application presents a set of algorithms, intended either towards scheduling that attacks performance and energy efficiency power or performance. simultaneously, using information on variability. Variations in chip multi processors are a major concern. There are two components to this, the die being addressed by speed binning and there has been quite some work on these gaining attention lately. At the transistor/device level, these are variations in Leff and Vth. These variations in Leff and Vth translate into frequency and leakage current variations at the micro-architecture level. The perspective of a chip multiprocessor as consisting of several homogeneous cores is not valid anymore. A CMP has to be relooked at, as a collection of heterogeneous cores, with different frequencies and power profiles. These variations can be profiled or modelled in terms of per-core leakage and frequency parameters. This information coupled characteristics of the applications/workloads that run on the CMP, could The enhanced version of this (VarP+AppP) cores. Similarly, the performance centric algorithms map applications onto the fastest cores. The within die component, however, has the inclined consuming applications onto the least leaky techniques and methodologies. with efficiency tries to map the highest dynamic power The die to die component of variation is been power onto the least leaky cores. to die component and the within die component. basic algorithm (VarP) tries to map applications 1. Introduction The 3. Motivation [1] presents power and performance optimized algorithms. However, these are oriented solely towards power reduction or performance enhancement. We aim at looking at these in a unified fashion, motivated by the following observation: For cores that can operate at a specific maximum frequency, there is a wide variation in the leakage profiles. Similarly, for cores that have a certain leakage power, there is a wide spread in the maximum frequency characteristics [3]. It is on the basis of this observation that we propose to enhance the schemes presented by [1]. 4. Proposed Scheme As mentioned earlier, the previous work has memory and non-memory instructions and per cycle focussed either on power or performance. One leakage numbers. The rationale behind obtaining possible heuristic for the unified scheme is as these numbers is that BLESS does not model the follows: core in great detail. It just distinguishes between 1. Rank the cores in the order of the maximum memory and non-memory instructions. The power frequencies that they can run up to. numbers from the static profiling are presented in 2. Obtain the static leakage power number for each core (profiled statically at a nominal the Preliminary Results section. We use the average of these numbers in BLESS. temperature) 3. Rank the applications in the order of 5.2 Variation map generation dynamic power (obtained by static profiling The next step is to generate variation maps to on a core) characterize the variation of the leakage power and 4. For each application, starting from the frequencies of the different cores. We obtain these highest dynamic power one, map the maps at the per core granularity. We use the application onto the core with the highest Varimap tool developed by Sebastian to generate frequency, with the least leakage. This could variation maps for Leff (gate length). We model the be achieved by sorting the cores in frequency leakage power’s variation with Leff as follows: We and leakage levels/bins simulate an inverter in HSPICE, by varying the gate We plan to analyze the power/performance gains length and plot the variation of the inverter’s leakage from using this heuristic and possibly, tweaking it power, with gate length. We fit this data using based on the results we obtain. MATLAB and obtain the following relationship for The variability model in [2] will be used to model the leakage power. frequency and leakage variability information. LeakageVar = exp(0.051∆Leff 2 – 0.6 ∆Leff – 0.062) Leakage Where ∆Leff is the gate length variation from the nominal Leakage is the nominal leakage power 5. Technical Description LeakageVar is the variation accounted for leakage power The infrastructure needed to run and analyze our heuristic against other algorithms requires the The frequency variation is modelled as the delay following steps to build variation being directly proportional to the gate length variation. 5.1 Static Profiling This is the first step in the power macro modelling in We use these models to come up with a 4 x 4 the BLESS (CMP) simulator. We use a single core variation map. This states the leakage power and simulator, Sim-GALS to obtain these. This simulator frequencies for each core in a 4 x 4 CMP. is intended for a locally synchronous and globally asynchronous system. We make all the local 5.3 Power/Variation Macro modelling in frequencies the same and the main purpose of using BLESS this simulator is the reasonably accurate leakage The next step is to take in the variation accounted for modelling present as part of this tool. The power/frequency models/numbers into BLESS, the technology models we use are 45nm. We simulate CMP simulator. We read in frequency and leakage SPEC 2000 benchmarks on this simulator and maps generated by the Variation modelling. We use obtain per instruction dynamic power numbers for the per instruction dynamic power numbers for the memory and non-memory instructions, scaled by the frequency of operation of the corresponding core, for 6.2 Results after power/variation macro- the dynamic power numbers. We use the per cycle modelling in BLESS leakage numbers for each core (from the variation We picked two applications, perlbench, a compute map) for the leakage power computation. We put intensive application and mcf, a memory intensive together all of these and finally report the power and application. We mapped a copy of perlbench onto all performance (MIPS) for the different cores. We look cores and studied the MIPS and power with and at the variation of the power and performance across without variation. We repeated the same thing for the different processors, to get a rough feel of the mcf. The results are interesting variation behaviour. We present this in the preliminary results section We now have the basic infrastructure – a CMP simulator with power and variability models. The next step is to build a mock scheduler, to perform Standard Application Variation Mean Perlbench Without 6273 20 With 6077 133 Without 1970 99 With 1909 103 Mcf the application migration between the different Deviation Table 2: MIPS comparison cores, at scheduling intervals. Then, we’ll be all set to compare the different algorithms proposed in [1] and our heuristic. 6. Preliminary Results Variation Mean(Watt) Perlbench Without 5.9 0.0179 With 5.7 0.1720 Without 1.9462 0.0906 With 1.8669 0.1258 6.1 Static Profiling results from Sim-GALS These are our static profling results from Sim-GALS for Spec 2000 benchmarks. They list the memory Standard Application Mcf Deviation Table 3: Avg Power per Cycle (Watt) comparison and non-memory instruction dynamic powers and the core per cycle leakage powers. We obtain average The sigma of the MIPS for perlbench, the compute numbers from all benchmarks to use in BLESS. intensive application is much bigger when variation NMIDP* MIDP* ACLP/cycle* (Watt) (Watt) (Watt) ammp 4.856 3.6018 0.1272 gzip 2.514 1.3364 0.0897 vpr 4.0125 2.9914 0.1569 performance as much. However, the sigma for mesa 2.6177 1.5051 0.1261 average power per cycle for both perlbench and mcf art 3.7089 2.8037 0.1719 is quite large (though perlbench’s sigma variation is mcf 3.3925 2.5841 0.1716 larger than mcf’s), as compared to the no variation parser 2.6258 1.7255 0.1529 case. This can be explained by the fact that non- vortex 3.8746 2.8734 0.1536 memory instructions also consume 2.3 Watt per bzip2 2.4704 1.3382 0.0854 cycle in the core and hence this component is 3.341377778 2.306622222 0.137255556 Application Average Table 1: Sim-GALS results (45nm) *NMIDP – Non-memory instruction Dynamic Power *MIDP – Memory Instruction Dynamic Power *ACLP/cycle – Avg. Core Leakage Power per Cycle is accounted for. However, the sigma increase of the MIPS for mcf is small, as it is memory intensive and variations in processor frequency do not affect its affected by variation too. Lavanya 7. Original Plan worked on the static profiling and incorporation in BLESS part. Aman worked on the variation modelling/map generation aspect. 9. Conclusion We observe that there is definitely a difference in the power/performance of different cores, even when they run the same applications. A large part of this difference is a result of process variability. Hence, we believe that our original plan of building scheduling algorithms that use this process variability is further bolstered by these observartions. Milestone 1: Static profiling of applications to obtain 10. Project Website http://www.cs.cmu.edu/~amank/ dynamic powers, on Simple scalar with Wattch. Build variability information into the BLESS CMP simulator Milestone 2 Build a scheduler into or on top of the CMP 11. References [1] R. Teodorescu and J. Torrellas. Variationaware application scheduling and power management for chip multiprocessors. In ISCA’08: Proceedings of the 35th annual simulator. InternationalSymposium Milestone 3 Architecture, 2008. Implement and analyze the proposed scheme against the baseline algorithms. on Computer [2] Y. Abulafia and A. Kornfeld. Estimation of FMAX and ISB in microprocessors. IEEE Transactions on VLSI Systems, 13(10), Oct 8. Progress We have stuck to our original plan/schedule so far. 2006. [3] Borkar, S., Karnik, T., Narendra, S., We have achieved Milestone 1, which we promised Tschanz, J., Keshavarzi, A., and De, V. 2003. to, during the proposal. Building a scheduler and Parameter variations and impact on circuits analyzing the different algorithms is what is left to be and microarchitecture. In Proceedings of the done. 40th Annual Design Automation Conference (Anaheim, CA, USA, June 02 06, 2003). DAC '03. ACM, New York, NY, 338-342. Fig Variability per core modelled