Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott University of Rochester The gist of the paper… Radical idea: Trade off frequency and hardware complexity dynamically at runtime rather than statically at design time The new twist: A Globally-Asynchronous, Locally-Synchronous (GALS) microarchitecture is key to making this worthwhile Application phase behavior Varying behavior over time Can exploit to save power gcc adaptive issue queue L2 misses E per interval L1I misses L1D misses branch mispred IPC [Sherwood, Sair, Calder, ISCA 2003] [Buyuktosunoglu, et al., GLSVLSI 2001] What about performance? RAM delay entries 32 24 16 8 relative delay 1.0 0.77 0.52 0.31 CAM delay entries 32 24 26 8 relative delay 1.0 0.77 0.55 0.34 Lower power and faster access time! [Buyuktosunoglu, GLSVLSI 2001] What about performance? How do we exploit the faster speed? Variable latency Increase frequency when downsizing Decrease frequency when upsizing What about performance? L1 I-Cache Br Pred Main Memory Fetch Unit Dispatch, Rename, ROB L2 Cache Issue Queue integer Ld/St Unit ALUs & RF L1 D-Cache FP ALUs & RF clock [Albonesi, ISCA 1998] Issue Queue 0.0 [Albonesi, ISCA 1998] average wave5 fpppp apsi turb3d applu mgrid hydro2d su2cor swim tomcatv appcg radar stereo airshed vortex perl ijpeg li compress gcc m88ksim Avg TPI (ns) What about performance? 1.2 Best Conventional Process-level Adaptive 1.0 0.8 0.6 0.4 0.2 Enter GALS… Front-end Domain External Domain L1 I-Cache Br Pred Main Memory Fetch Unit Dispatch, Rename, ROB Memory Domain L2 Cache Integer Domain FP Domain Issue Queue Issue Queue Ld/St Unit ALUs & RF ALUs & RF L1 D-Cache [Semeraro et al., HPCA 2002] [Iyer and Marculescu, ISCA 2002] Outline Motivation and background Adaptive GALS microarchitecture Control mechanisms Evaluation methodology Results Conclusions and future work Adaptive GALS microarchitecture Front-end Domain External Domain L1 I-Cache L1 L1 I-Cache L1I-Cache I-Cache Br Pred Br Br Pred BrPred Pred Fetch Unit Dispatch, Rename, ROB Integer Domain Issue Queue ALUs & RF Main Memory FP Domain Issue Queue Issue Queue Issue IssueQueue Queue ALUs & RF Memory Domain L2Cache Cache L2 Cache L2 L2 Cache Ld/St Unit L1D-Cache D-Cache L1 D-Cache L1 L1 D-Cache Adaptive GALS operation Front-end Domain External Domain L1 I-Cache L1 L1 I-Cache L1I-Cache I-Cache Br Pred Br Br Pred BrPred Pred Fetch Unit Dispatch, Rename, ROB Integer Domain Issue Queue ALUs & RF Main Memory FP Domain Issue Queue Issue Queue Issue IssueQueue Queue ALUs & RF Memory Domain L2Cache Cache L2 Cache L2 L2 Cache Ld/St Unit L1D-Cache D-Cache L1 D-Cache L1 L1 D-Cache Resizable cache organization Access A part first, then B part on a miss Swap A and B blocks on a A miss, B hit Select A/B split according to application phase behavior Resizable cache control Example Accesses (MRU) MRU State 0 1 2 3 A B C D B A C D C B AD C B AD (LRU) MRU[1]++ MRU[2]++ Config A1 B3 • hitsA = MRU[0] • hitsB = MRU[1] + [2] + [3] Config A2 B2 • hitsA = MRU[0] + [1] • hitsB = MRU[2] + [3] MRU[0]++ Config A3 B1 • hitsA = MRU[0] + [1] + [2] • hitsB = MRU[3] MRU[3]++ Config A4 B0 • hitsA = MRU[0] + [1] + [2] + [3] • hitsB = 0 • Calculate the cost for each possible configuration: A access costs = (hitsA + hitsB + misses) * CostA B access costs = (hitsB + misses) * CostB Miss access costs = misses * CostMiss Total access cost = A + B + Miss (normalized to frequency) Resizable issue queue control Measures the exploitable ILP for each queue size Timestamp counter is reset at the start of an interval and incremented each cycle During rename, a destination register is given a timestamp based on the timestamp + execution latency of its slowest source operand The maximum timestamp, MAXN is maintained for each of the four possible queue sizes over N fetched instructions (N=16, 32, 48, 64) ILP is estimated as N/MAXN Queue size with highest ILP (normalized to frequency) is selected Resizable hardware – some details Front end domain • Icache “A”: 16KB 1-way, 32KB 2-way, 48KB 3-way, 64KB 4-way • Branch predictor sized with Icache – gshare PHT: 16KB-64KB – Local BHT: 2KB-8KB – Local PHT: 1024 entries – Meta: 16KB-64KB Load/store domain • Dcache “A”: 32KB 1-way, 64KB 2-way, 128KB 4-way, 256KB, 8way • L2 cache “A” sized with Dcache – 256KB 1-way, 512KB 2-way, 1MB 4-way, 2MB 8-way Integer and floating point domains • Issue queue: 16, 32, 48, or 64 entries Evaluation methodology SimpleScalar and Cacti 40 benchmarks from SPEC, Mediabench, and Olden Baseline: best overall performing fully synchronous 21264-like design found out of 1,024 simulated options Adaptive MCD costs imposed: • Additional branch penalty of 2 integer domain cycles and 1 front end domain cycle (overpipelined) • Frequency penalty as much as 31% Mean PLL locking time of 15 µsec Program-Adaptive: profile application and pick the best adaptive configuration for the whole program Phase-Adaptive: use online cache and issue queue control mechanisms -10% Program Adaptive w u p w i se vp r vo rte x tw o l f p a rse r me sa g zi p Olden g cc galgel e q u a ke eon cra fty b zi p 2 a rt a p si tsp tre e a d d power p e ri me te r mst h e a l th Mediabench e m3 d b i so rt bh mp e g 2 d e co d e mp e g 2 e n co d e me sa te xg e n me sa o so d e mo me sa mi p ma p g h o stscri p t g sm d e co d e g sm e n co d e g 7 2 1 d e co d e g 7 2 1 e n co d e j p e g d e co mp re ss 50% j p e g co mp re ss e p i c d e co d e e p i c e n co d e a d p cm d e co d e a d p cm e n co d e Performance improvement SPEC 40% 30% 20% 10% 0% Phase Adaptive Phase behavior – art issue queue entries 64 48 32 16 100 million instruction window Phase behavior – apsi Dcache “A” size 256KB 128KB 64KB 32KB 100 million instruction window Performance summary Program Adaptive: 17% performance improvement Phase Adaptive: 20% performance improvement • Automatic • Never degrades performance for 40 applications • Few phases in chosen application windows – could perhaps do better Distribution of chosen configurations for Program Adaptive: Integer IQ 16 32 48 64 85% 5% 5% 5% FP IQ 16 32 48 64 73% 15% 8% 5% D/L2 Cache 32KB/256KB 64KB/512KB 128KB/1MB 256KB/2MB 50% 18% 23% 10% Icache 16KB 32KB 48KB 64KB 55% 18% 8% 20% Domain frequency versus IQ size 1.8 Relative frequency 1.6 1.4 1.2 1.0 0.8 0.6 0.4 16 32 48 Issue Queue Size 64 Conclusions Application phase behavior can be exploited to improve performance in addition to power savings GALS approach is key to localizing the impact of slowing the clock Cache and queue control mechanisms can evaluate all possible configurations within a single interval Phase adaptive approach improves performance by as much as 48% and by an average of 20% Future work Explore multiple adaptive structures in each domain Better take into account the branch predictor Resize the instruction cache by sets rather than ways Explore better issue queue design alternatives Build circuits Dynamically customized heterogeneous multi-core architectures using phase-adaptive GALS cores Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott University of Rochester