A 167-processor 65 nm Computational Platform with Per-Processor Dynamic Supply Voltage and Dynamic Clock Frequency Scaling Dean Truong, Wayne Cheng, Tinoosh Mohsenin, Zhiyi Yu, Toney Jacobson, Gouri Landge, Michael Meeuwsen, Christine Watnik, Paul Mejia, Anh Tran, Jeremy Webb, Eric Work, Zhibin Xiao and Bevan Baas VLSI Computation Lab University of California, Davis Outline • Background and the First Generation AsAP • The Second Generation AsAP – Processors and Shared Memories – On-chip Communication – DVFS • Analysis and Summary Project Motivation • Fully programmable and reconfig. architecture • High energy efficiency and performance • Exploit task-level parallelism in: – Digital Signal Processing – Multimedia CFO Estimation from ADC Frame Detection Timing Synch CORDIC Rotation 802.11a Baseband Receiver Guard Removing Energy Computing De-interleaver 2 FFT Subcarrier Reordering De-interleaver 1 Constellation Demapping Channel Equalizer De-puncturing Viterbi Decoder De-scrambling Channel Estimation Pad Removing to MAC layer Asynchronous Array of Simple Processors (AsAP) • Key Ideas: – Programmable, small, and simple fine-grained cores – Small local memories sufficient for DSP kernels – Globally Asynchronous and Locally Synchronous (GALS) clocking • Independent clock frequencies on every core • Local oscillator halts when processor is idle Osc IMem Datapath DMem Asynchronous Array of Simple Processors (AsAP) • Key Ideas, con’t: – 2D mesh, circuit-switched network architecture • Nearest-neighbor communication only • Low area overhead • Easily scalable array – Increased tolerance to process variations • 36-processor fully-functional chip, 0.18 µm, 610 MHz @ 2.0 V, 0.66 mm2 per processor, 802.11a/g tx consumes 407 mW @ 300 MHz [ISSCC 06, HotChips 06, IEEE Micro 07, TVLSI 07, JSSC 08,…] Outline • Background and the First Generation AsAP • The Second Generation AsAP – Processors and Shared Memories – On-chip Communication – DVFS • Analysis and Summary New Challenges Addressed • Reduction in the power dissipation of – Lightly-loaded processors (lowering Vdd) – Unused processors (leakage) • Achieving very high efficiencies and speed on common demanding tasks such as FFTs, video motion estimation, and Viterbi decoding • Larger, area efficient on-chip memories • Efficient, low overhead communication between distant processors 167-processor Computational Platform • Key features – 164 Enhanced programmable processors – 3 Dedicated-purpose processors – 3 Shared memories – Long-distance circuit-switched communication network – Dynamic Voltage and Frequency Scaling (DVFS) Config. and Test Osc Core DVFS Comm Tile FFT Motion Estimation Viterbi 16 KB Shared Decoder Memories Homogenous Processors • 164 in-order, single-issue, 6-stage processors – – – – 16-bit datapath with RISC-like instructions 128x16-bit data memory (DMEM) 128x35-bit instruction memory (IMEM) Two 64x16-bit FIFOs for inter-processor communication IF PC MemRd ID FIFO 1 FIFO 1 Wrt Rd FIFO 0 FIFO 0 Wrt Rd EXE 1 EXE 2 FIFO Wrt Bypass DMem Wrt ALU IMem Instr Decode AdrGen DMem Rd Multiplier DC Mem Rd WB Acc Block Float Point DC Mem Wrt Homogenous Processors • Over 60 basic instructions – Add, Sub, Logic, Multiply, MAC, Branch, … • New instructions and features – Min/Max, Byte-Sub/Add, Absolute value, Fixed-to-Float conversion assist – Jump/Return (function support) – Zero Overhead Looping (block repeat) – Conditional Execution (predicating) – Block Floating Point • Floating point CORDIC square root requires 2.9x fewer cycles compared to first generation AsAP • Preliminary results from one chip: 1.2 GHz, 59 mW, 1.3 V, 100% active MAC/ALU operations Fast Fourier Transform (FFT) MEM MEM MEM O F MEM MEM MEM MEM 1.23 mm • Continuous flow architecture with a single radix-4,2 butterfly • Runtime configurable from 16-pt to 4096-pt transforms, FFT and IFFT • 760,000 1024-pt complex FFTs/sec @ 989 MHz, 1.3 V • 1.01 mm2 • Preliminary measurements functional at 866 MHz, 34.97 mW @ 1.3 V 0.82 mm MEM Viterbi Decoder • 8 Add-Compare-Select (ACS) units • Highly configurable • 72 Mbps @ 789 MHz, 1.3 V for rate = 1/2 • 0.17 mm2 • Preliminary measurements functional at 894 MHz, 17.55 mW @ 1.3 V MEM 0.41 mm – Up to 32 different rates, including 1/2 and 3/4 – Decode codes up to constraint length 10 0.41 mm O F Motion Estimation for Video Encoding 0.82 mm O F MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM 0.82 mm • Supports a number of fixed and programmable search patterns • Supports all H.264 specified block sizes within a 48x48 search range • 14 billion SADs/sec @ 880 MHz, 1.3 V; supports 1080p HDTV @ 30fps • 0.67 mm2 • Preliminary measurements functional at 938 MHz, 196.17 mW @ 1.3 V Shared Memories • Ports for up to four processors (two connected in this chip) to directly connect to the block, which provides • Uses a 16 KByte single-ported SRAM • 1.28 GHz operation, 1.3 V – One read or write per cycle – 20.5 Gbps peak throughput • 0.34 mm2 • Preliminary measurements functional at 1.3 GHz, 4.55 mW @ 1.3 V SRAM 0.82 mm – Port priority – Port request arbitration – Programmable address generation supporting multiple addressing modes 0.41 mm F F F F F F F F O Inter-Processor Communication • Circuit-switched source-synchronous communication – Each link has a clk, 16-bit data bus, valid, and request – Core can • Write to any combination of the 8 outputs under software control • Read from any 2 of the 8 inputs using statically configured FIFOs Core FIFO 0 FIFO 1 clk data valid request Comm Tile Long-Distance Communication • Allows communication across tiles without disturbing cores – Long-distance links may be pipelined or not clk data Switch • Depending on: source clock frequency, distance, and latency Source Destination Osc Osc F I F O F I F O Datapath Comm Osc F I F O Datapath Comm Datapath Comm Per-Processor DVFS • Each processor tile contains a core that operates at: – A fully-independent clock frequency • Any frequency below maximum • Halts, restarts, and changes arbitrarily – Dynamically-changeable supply voltage • VddHigh or VddLow • Disconnected for leakage reduction • VddAlwaysOn powers DVFS and inter-processor communication VddHigh VddLow VddOsc VddAlwaysOn control_high DVFS control_low Controller control_freq VddCore Osc Core GndOsc GndCom Tile Comm DVFS Controller • Voltage and frequency is set by: – Static configuration – Software DVFS Controller – Hardware (controller) • FIFO “fullness” • Processor “stalling frequency” VddAlwaysOn Voltage Switching Circuit FIR/IIR Filter VddLow FIFO_utilization PMOS_VddHigh PMOS_VddLow config_volt Freq. & Voltage Selector Stall Counter VddHigh VddCore osc_frequency DVFS_config Osc Config DVFS_software stall F I F O Proc Core VddOsc 2% GndCom 19% GndOsc 2% VddHigh 26% Vdd AlwaysOn 6% VddCore 19% Core Tile 340 μm 410 μm VddLow 26% Power Gates FIFO 1 DMem FIFO 0 • Vdd/Gnd metal 6 and 7 usage: Osc IMem – 5 Vdds: 79% utilization – 2 Gnds: 21% utilization 410 μm 360 μm • 48 power gates surround core • Metal 6 and 7 are devoted to power distribution— global and local Power Gates and Decaps Tile Layout and Power Grids Supply Voltage Switching • The switching speed and profile “shapes” supply currents while switching to tradeoff switching time versus power grid noise • Processor cores normally halt during a switch 1.3V switch done fastest medium slowest 0V 1.3V fastest VddHigh medium 1.22V 50ns clock slowest 51ns 52ns 53ns 54ns VddHigh noise when VddCore switches from VddLow to VddHigh Supply Voltage Switching • Slow switching results in negligible power grid noise • Early VddCore disconnect from VddLow results in momentary core voltage drop (circled below) 1.3V 0.9V 2 ns Outline • Background and the First Generation AsAP • The Second Generation AsAP – Processors and Shared Memories – On-chip Communication – DVFS • Analysis and Summary Tile and Core Area Breakdowns • Communication area approximately 7% • DVFS area approximately 8% • Routing complexity results in 27% for gaps and fillers I/O & Route Empty 5% Spaces & Fillers 11% DVFS 8% Core 73% Tile Area Clk Tree & Buffers 2% Test & Config 1% Decaps, Gaps & Fillers 21% IM em 11% SRAMs 34% DM em 13% Osc 3% FIFOs 10% Logic 29% DFFs 13% Core Area Die Micrograph and Key Data 55 million transistors, 39.4 mm2 • 65 nm STMicroelectronics low-leakage CMOS 410 μm Single Tile 0.17 mm2 Max. frequency 1.19 GHz @ 1.3 V Power (100% active) 59 mW @ 1.19 GHz, 1.3 V Power (JPEG) 5.939 mm Area 410 μm Transistors 325,000 608 μW @ 66 MHz, 0.675 V 6.3 mW @ 1.095 GHz, 1.3V Mot. Est. Mem MemMem Vit 5.516 mm FFT Complete 802.11a Baseband Receiver • 22 processors – 32 processors using only nearest-neighbor connections (46% increase) Vit. FFT • 54 Mbps throughput 75 mW @ 610 MHz, 1.3 V • 6x faster than TI C62x, 2x faster than SODA, 4x faster than LART 802.11a with long-distance (all scaled to 65 nm technology, 1.3 V and 610 MHz) 802.11a without long-distance FFT Vit. Summary • 65 nm low power ST Microelectronics process • Maintains the basic GALS architecture of AsAP • 164 homogenous processors – 1.2 GHz, 59 mW, 100% active @ 1.3 V • Three 16 KB shared memories • Three dedicated-purpose processors • Long-distance circuit-switched communication increases mapping efficiency without overhead • DVFS nets a 48% reduction in energy for JPEG with only 8% performance loss Acknowledgements • • • • • • • • • • STMicroelectronics Intel UC Micro NSF Grant 430090 and CAREER award 546907 SRC GRC Intellasys SEM J.-P. Schoellkopf K. Torki and S. Dumont R. Krishnamurthy and M. Anders