ECE260B – CSE241A Winter 2005 Design Styles Multi-Vdd/Vth Designs Website: http://vlsicad.ucsd.edu/courses/ece260b-w05 ECE 260B – CSE 241A Design Styles 1 http://vlsicad.ucsd.edu The Design Problem Source: sematech97 A growing gap between design complexity and design productivity ECE 260B – CSE 241A Design Styles 2 http://vlsicad.ucsd.edu Design Methodology • Design process traverses iteratively between three abstractions: behavior, structure, and geometry • More and more automation for each of these steps ECE 260B – CSE 241A Design Styles 3 http://vlsicad.ucsd.edu Behavioral Description of Accumulator entity accumulator is port ( DI : in integer; DO : inout integer := 0; CLK : in bit ); end accumulator; architecture behavior of accumulator is begin process(CLK) variable X : integer := 0; -- intermediate variable begin if CLK = '1' then X < = DO + D1; DO <= X; end if; end process; end behavior; ECE 260B – CSE 241A Design Styles 4 Design described as set of input-output relations, regardless of chosen implementation Data described at higher abstraction level (“integer”) http://vlsicad.ucsd.edu Structural Description of Accumulator entity accumulator is port ( -- definition of input and output terminals DI: in bit_vector(15 downto 0) -- a vector of 16 bit wide DO: inout bit_vector(15 downto 0); CLK: in bit ); end accumulator; architecture structure of accumulator is component reg -- definition of register ports port ( DI : in bit_vector(15 downto 0); DO : out bit_vector(15 downto 0); CLK : in bit ); end component; component add -- definition of adder ports port ( IN0 : in bit_vector(15 downto 0); IN1 : in bit_vector(15 downto 0); OUT0 : out bit_vector(15 downto 0) ); end component; -- definition of accumulator structure signal X : bit_vector(15 downto 0); begin add1 : add port map (DI, DO, X); -- defines port connectivity reg1 : reg port map (X, DO, CLK); end structure; ECE 260B – CSE 241A Design Styles 5 Design defined as composition of register and full-adder cells (“netlist”) Data represented as {0,1,Z} Time discretized and progresses with unit steps Description language: VHDL Other options: schematics, Verilog http://vlsicad.ucsd.edu Implementation Methodologies Digital Circuit Implementation Approaches Semi-custom Custom Cell-Based Standard Cells Compiled Cells ECE 260B – CSE 241A Design Styles 6 Macro Cells Array-Based Pre-diffused (Gate Arrays) Pre-wired (FPGA) http://vlsicad.ucsd.edu Full Custom Hand drawn geometry All layers customized Digital and analog Simulation at transistor level High density High performance Long design time Magic Layout Editor (UC Berkeley) ECE 260B – CSE 241A Design Styles 7 http://vlsicad.ucsd.edu Symbolic Layout VDD 3 Out In 1 • Dimensionless layout entities • Only topology is important • Final layout generated by “compaction” program GND Stick diagram of inverter ECE 260B – CSE 241A Design Styles 8 http://vlsicad.ucsd.edu Standard Cells Organized in rows All layers customized Cells made as full custom by vendor (not user) Digital with possible special analog cells Rows of Cells Feedthrough Cell Simulation at gate level (digital) Logic Cell Routing Channel Functional Module (RAM, multiplier, ) Medium-high density Medium-high performance Reasonable design time ECE 260B – CSE 241A Design Styles 9 Routing channel requirements are reduced by presence of more interconnect layers http://vlsicad.ucsd.edu Standard Cell — Example [Brodersen92] ECE 260B – CSE 241A Design Styles 10 http://vlsicad.ucsd.edu Standard Cell - Example 3-input NAND cell (from Mississippi State Library) characterized for fanout of 4 and for three different technologies ECE 260B – CSE 241A Design Styles 11 http://vlsicad.ucsd.edu Automatic Cell Generation Random-logic layout generated by CLEO cell compiler (Digital) ECE 260B – CSE 241A Design Styles 12 http://vlsicad.ucsd.edu Module Generators — Compiled Datapath buffer adder reg1 reg0 bus2 mux bus0 bus1 routing area feed-through bit-slice Advantages: One-dimensional placement/routing problem ECE 260B – CSE 241A Design Styles 13 http://vlsicad.ucsd.edu Macrocell-Based Design Predefined macro blocks (uP, RAM, etc.) Macro blocks made as full custom by vendor (IP blocks) All layers customized Digital and some analog Simulation at behavior or gate level Macrocell Interconnect Bus High density High performance Routing Channel Short design time Use standard on-chip busses “System on a chip” (SOC) ECE 260B – CSE 241A Design Styles 14 http://vlsicad.ucsd.edu Macrocell Design Methodogoly SRAM Floorplan: SRAM Defines overall topology of design, relative placement of modules, and global routes of busses, supplies, and clocks Data paths Standard cells Video-encoder chip [Brodersen92] ECE 260B – CSE 241A Design Styles 15 http://vlsicad.ucsd.edu Gate Array Predefined transistors connected via metal Two types: channel based, sea of gates Only metal layers customized Fixed array sizes rows of uncommitted cells Digital cells in library Simulation at gate level (digital) Medium density routing channel Medium performance Reasonable design time ECE 260B – CSE 241A Design Styles 16 http://vlsicad.ucsd.edu Gate Array — Primitive Cells polysilicon In 1 In 2 In 3 In4 VD D metal possible contact GND Out Uncommited Cell ECE 260B – CSE 241A Design Styles 17 Committed Cell (4-input NOR) http://vlsicad.ucsd.edu Sea-of-gates Random Logic Memory Subsystem LSI Logic LEA300K (0.6 mm CMOS) ECE 260B – CSE 241A Design Styles 19 http://vlsicad.ucsd.edu Prewired Arrays Programmable logic blocks Programmable connections between logic blocks No layers customized (standard devices) Digital only Low-medium performance Low-medium density Programmable: SRAM, EPROM, Flash, Anti-fuse, etc. Easy and quick design changes Cheap design tools Low development cost High device cost NOT a real ASIC ECE 260B – CSE 241A Design Styles 20 Courtesy Altera Corp. http://vlsicad.ucsd.edu Programmable Logic Devices PLA ECE 260B – CSE 241A Design Styles 21 PROM PAL http://vlsicad.ucsd.edu Field-Programmable Gate Arrays - Fuse-based I/O Buffers Program/Test/Diagnostics Vertical routes I/O Buffers I/O Buffers Standard-cell like floorplan Rows of logic modules Routing channels I/O Buffers ECE 260B – CSE 241A Design Styles 23 http://vlsicad.ucsd.edu Interconnect Programmed interconnection Input/output pin Cell Antifuse Horizontal tracks Vertical tracks ECE 260B – CSE 241A Design Styles 24 Programming interconnect using anti-fuses http://vlsicad.ucsd.edu Field-Programmable Gate Arrays - RAM-based CLB CLB switching matrix Horizontal routing channel Interconnect point CLB CLB Vertical routing channel ECE 260B – CSE 241A Design Styles 25 http://vlsicad.ucsd.edu RAM-based FPGA - Basic Cell (CLB) Combinational logic Storage elements R A B/Q1/Q2 Any function of up to 4 variables C/Q1/Q2 Din R F F G CE D A B/Q1/Q2 D Q1 Any function of up to 4 variables R G C/Q1/Q2 F D G E F D Q2 CE G Clock CE Courtesy of Xilinx ECE 260B – CSE 241A Design Styles 26 http://vlsicad.ucsd.edu RAM-based FPGA Xilinx XC4025 ECE 260B – CSE 241A Design Styles 27 http://vlsicad.ucsd.edu High Performance Devices Mixture of full custom, standard cells and macro’s Full custom for special blocks: Adder (data path), etc. Macro’s for standard blocks: RAM, ROM, etc. Standard cells for non critical digital blocks ECE 260B – CSE 241A Design Styles 28 http://vlsicad.ucsd.edu Global Signaling and Layout Global signaling and layout optimization Multi-Vdd Static power analysis Multi-Vth + Vdd + sizing ECE 260B – CSE 241A Design Styles 29 D. Sylvester, DAC-2001 http://vlsicad.ucsd.edu Global Signaling Current global signaling paradigm insert large static CMOS repeaters to reduce wire RC delay Impending problems: Too many repeaters - 180nm processors: 22K repeaters (Itanium), 70K (Power4) - Project 1-1.5M repeaters at 45-65nm technologies Too much power - Many large repeaters = significant static and dynamic power Too much noise - Repeater clustering complicates power distribution - Inductive coupling across wide bus structures ECE 260B – CSE 241A Design Styles 30 D. Sylvester, DAC-2001 http://vlsicad.ucsd.edu Cell Layout Optimization Advanced layout techniques must allow Continuous individual device sizing Variable p/n ratios Tapered FET stacking sizes Arbitrary Vth assignments within gates First cut: Cadabra 15-22% power reduction using 1st two approaches under fixed footprint constraint Optimize specific instances of standard gates Ref: Hurat, Cadabra GDSII Import ECE 260B – CSE 241A Design Styles 31 D. Sylvester, DAC-2001 Compact fixed width http://vlsicad.ucsd.edu Multi-Vdd Global signaling and layout optimization Multi-Vdd Static power analysis Multi-Vth + Vdd + sizing ECE 260B – CSE 241A Design Styles 32 D. Sylvester, DAC-2001 http://vlsicad.ucsd.edu Multi-Vdd Status Idea: Incorporate two Vdd’s to reduce dynamic power Limited to a few recent Japanese multimedia processors Example – 0.3 mm, 75MHz, 3.3V media processor (Toshiba) - Total power savings of 47% in logic, 69% in clock Dynamic voltage scaling of mobile processors - Transmeta Crusoe, Intel Speedstep, etc. - Not considered in this talk Very powerful technique currently applied only in low-performance designs Mentality: today’s high performance parts aren’t “limited” by power ECE 260B – CSE 241A Design Styles 33 D. Sylvester, DAC-2001 http://vlsicad.ucsd.edu Lower Power Via Rich Replacement other low speed designs have many non-critical paths 60-70% of paths have delay half the clock period After replacement, most paths become near critical What about high-speed % of total paths Media processors and microprocessors? Path delay (normalized to clock period) ECE 260B – CSE 241A Design Styles 34 D. Sylvester, DAC-2001 http://vlsicad.ucsd.edu Similar Story For High-Performance IBM 480 MHz PowerPC shows over 50% of paths have delay less than half the clock period Implies that high-performance designs can benefit from multi-Vdd Ref: Akrout, JSSC98 ECE 260B – CSE 241A Design Styles 35 D. Sylvester, DAC-2001 http://vlsicad.ucsd.edu Resizing Is Not The Right Answer Post-synthesis optimizations resize gates to recover power on non-critical paths Looks similar to pre- and post-replacement figures in media processor… Before postsynthesis resizing After postsynthesis resizing This is the wrong approach for nanometer design! ECE 260B – CSE 241A Design Styles 36 D. Sylvester, DAC-2001 Ref: Sirichotiyakul, DAC99 http://vlsicad.ucsd.edu Multi-Vdd Instead of Sizing Power ~ C Vdd2 f, where f is fixed Key: Reducing gate width impacts power sub-linearly Interconnect capacitance is not affected Reducing supply voltage cuts power quadratically All capacitive loads have lower voltage swing How can we minimize delay penalty at low Vdd? ECE 260B – CSE 241A Design Styles 37 D. Sylvester, DAC-2001 http://vlsicad.ucsd.edu Challenges For Multi-Vdd Area overhead Toshiba reported 7% rise in area due to placement restrictions, level converters, additional power grid routing EDA tool support for the above issues (placement, dual power routing) Noise analysis Additional shielding required between Vdd,low and Vdd,high signals? Including clock network ECE 260B – CSE 241A Design Styles 38 D. Sylvester, DAC-2001 http://vlsicad.ucsd.edu Static Power Global signaling and layout optimization Multi-Vdd Static power Multi-Vth + Vdd + sizing ECE 260B – CSE 241A Design Styles 39 D. Sylvester, DAC-2001 http://vlsicad.ucsd.edu Static Power Why do we care about static power in non-portable devices? Standby power is “wasted” -- leaves fewer Watts for computation Worsens reliability by raising die temperatures Leakage current is a function of Vth and subthreshold swing (Ss) (x10 at operating vs. room temp!) SV I off 10 10 mA / mm th s Ss expected to remain at 80-85 mV/dec (room temp) Device technology may cut this by ~20% Vth reductions are mandated by scaling Vdd Vth has been around Vdd/5 ECE 260B – CSE 241A Design Styles 40 D. Sylvester, DAC-2001 http://vlsicad.ucsd.edu Leakage Suppression Approaches Dual-Vth (most common) Low-Vth on critical paths, high-Vth off Only cost is additional masks Vdd MTCMOS Pull Up Series inserted high-Vth device cuts leakage current when off (sleep mode) Delay and area penalties, control device sizing is critical Vout Pull Down Other techniques Substrate biasing to control Vth Dual-Vth domino - Use low-Vth devices only in evaluate paths ECE 260B – CSE 241A Design Styles 42 D. Sylvester, DAC-2001 Vcontrol Parasitic Node High Vth Device http://vlsicad.ucsd.edu Can Gate-length biasing help leakage reduction? Reduce leakage? 1.2 1 0.8 Leakage Delay 0.6 0.4 0.2 13 0 13 1 13 2 13 3 13 4 13 5 13 6 13 7 13 8 13 9 14 0 0 Variation of leakage and delay (each normalized to 1) for an NMOS device in an industrial 130nm technology Gate-length (nm) Reduce leakage variability? Leakage Variability Leakage Biasing Gate-length ECE 260B – CSE 241A Design Styles 43 Leakage Leakage Variability Gate-length http://vlsicad.ucsd.edu Gate-length Biasing First proposed by Sirisantana et al. Small bias Comparative study of effect of doping, tox and gate-length Large bias used, significant slow down Little reduction in leakage beyond 10% bias while delay degrades linearly Preserves pin compatibility Technique applicable as post-RET step Salient features Design cycle not interfered Zero cost (no additional masks) ECE 260B – CSE 241A Design Styles 44 http://vlsicad.ucsd.edu Granularity Technology-level All devices in all cells have one biased gate-length Cell-level All devices in a cell have one biased gate-length Device-level All devices have independent biased gate-length Simplification: In each cell, NMOS devices have one gate-length and PMOS devices have another ECE 260B – CSE 241A Design Styles 45 http://vlsicad.ucsd.edu Device-Level Leakage Reduction Leakage saving with a delay penalty of up to 10% (Simplified device level biasing) 40 35 30 25 Low Vt 20 Nom Vt 15 High Vt 10 5 0 INVX4 ECE 260B – CSE 241A Design Styles 46 NANDX4 BUFX4 ANDX6 http://vlsicad.ucsd.edu Circuit level Bias gate-length for non-critical cells Library extended with each cell having a biased version Benefits analyzed in conjunction with Multi-VT assignment and in isolation SVT-SGL DVT-SGL SVT-DGL DVT-DGL ECE 260B – CSE 241A Design Styles 47 http://vlsicad.ucsd.edu Normalized Leakage Results: Leakage Reduction 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 SVT-SGL SVT-DGL DVT-SGL DVT-DGL c5315 c6288 c7552 alu128 With less than 2.5% delay penalty • Design Compiler used for VT assignment and gate-length biasing • Better results expected with Duet (academic sizer from Michigan) ECE 260B – CSE 241A Design Styles 48 http://vlsicad.ucsd.edu Multi-Vth + Vdd + Sizing Global signaling and layout optimization Multi-Vdd Static power analysis Multi-Vth + Vdd + sizing ECE 260B – CSE 241A Design Styles 51 D. Sylvester, DAC-2001 http://vlsicad.ucsd.edu Multi-Everything Need an approach that selects between speed, static power, and dynamic power Should be scalable to nanometer design Rules out dual-Vth domino or other dynamic logic families (low supplies kill performance advantages) Techniques mentioned so far Flexible, optimized cell layouts Multi-Vdd Dual-Vth Put them all together ECE 260B – CSE 241A Design Styles 52 D. Sylvester, DAC-2001 http://vlsicad.ucsd.edu Multi-Vdd Can Leverage Vth’s Existing designs using multi-Vdd do not alter Vth in lowVdd cells Highly sub-optimal, delay is fully penalized Limits cell replacement limits power savings Much better solution: reduce Vth in low-Vdd cells to carefully balance delay, static power, and dynamic power Enforce technology scaling within a chip – whenever we reduce Vdd, we also reduce Vth to maintain speed ECE 260B – CSE 241A Design Styles 53 D. Sylvester, DAC-2001 http://vlsicad.ucsd.edu Multi-Vdd + Vth Negates Delay Penalty Delay ~ CVdd/Ion Scenarios Constant Vth (current paradigm) Scale Vth to maintain constant static power Scale Vth to reduce static power linearly with Vdd Delay penalty is substantially offset Ion is very sensitive to Vth at Vdd < 1V Pstatic reduces with Vdd due to linear term and smaller Ioff (Ion and DIBL ) Delay (Normalized) 4 3 4 3 35-nm, nominal Vdd = 0.6V 2 1 2 1 0.2 ECE 260B – CSE 241A Design Styles 54 Constant Vth (0.11V) Scaled Vth, Constant Pstatic Conservatively Scaled Vth D. Sylvester, DAC-2001 0.3 0.4 0.5 Vdd (V) 0.6 0.7 http://vlsicad.ucsd.edu Now Add Sizing Multi-Vdd + multi-Vth + sizing/cell layout optimization attacks power from many angles (multi-dimensional) Depending on criticality and switching activities, noncritical gates can be: Assigned Vdd,low Assigned Vdd,low + lower Vth Assigned Vth,high Downsized (at the individual transistor level if advantageous) Assigned Vdd,low and upsized - For gates that cannot tolerate Vdd,low delay, this can be power efficient And others ECE 260B – CSE 241A Design Styles 55 D. Sylvester, DAC-2001 http://vlsicad.ucsd.edu Summary Power density must saturate to maintain affordable packaging options Multi-Vdd will leverage multiple Vth’s to offset delay penalty at low Vdd 50 W/cm2 means 200-250W for future large MPUs Dynamic thermal management saves 25% on packaging power budget More widespread re-assignment to Vdd,low Use Vdd first instead of re-sizing to take advantage of large path slacks Anticipated power savings of 50-80% Static power also addressed through multi-Vth + Vdd + sizing Vth difficult to control in ultra-short channels Intra-cell Vth assignment + MTCMOS/variants + sleep modes ECE 260B – CSE 241A Design Styles 56 D. Sylvester, DAC-2001 http://vlsicad.ucsd.edu Next Week: Project Meetings ECE 260B – CSE 241A Design Styles 57 D. Sylvester, DAC-2001 http://vlsicad.ucsd.edu