CMOS Circuit Design Prof. MacDonald MOS Transistor gate terminal drain terminal source terminal Gate Field Oxide Source Drain Field Oxide Silicon Substrate substrate terminal typically tied to ground for PWELLs and Vdd for NWELLs MOS Transistor gate terminal drain terminal source terminal Gate Field Oxide Source N+ Drain P N+ Field Oxide Silicon Substrate substrate terminal Device is symmetrical – for NFET drain is defined as node with highest value. With zero bias on Gate, channel is P type and thus two back-back diodes. No conduction between source and drain. MOS Transistor gate terminal drain terminal source terminal Gate Field Oxide P Drain Source N+ N+ Depletion Region Field Oxide Inversion Layer Silicon Substrate substrate terminal If gate voltage is raised to Vth a N type channel is formed below the gate. This effectively shorts out the back-to-back diodes and allows conduction. MOS Transistor - off gate terminal = Vg source terminal Vs = 0 drain terminal Vd Gate Field Oxide P Source N+ Drain N+ Field Oxide Silicon Substrate substrate terminal Vb = 0 if Vgs < Vt, then no inversion layer exists and back-to-back diodes prevent conduction between drain and source regardless of Vds MOS Transistor – Linear mode gate terminal = Vg > Vt source terminal Vs = 0 drain terminal Vd = small Gate Field Oxide P Source N+ Drain N+ Field Oxide Silicon Substrate substrate terminal Vb = 0 if Vgs > Vt and Vds remains small, then inversion layer beneath gate is almost uniform and complete from source to drain. Channel acts as a resistor and Ids increases linearly with Vds. MOS Transistor – Almost saturated gate terminal = Vg > Vt source terminal Vs = 0 drain terminal Vd = Vgs-Vt Gate Field Oxide P Source N+ Drain N+ Field Oxide Silicon Substrate substrate terminal Vb = 0 if Vds = Vgs – Vt, the inversion layer begins to disappear at the drain end of the channel. This is the transition point from linear mode to saturation mode. MOS Transistor –saturated gate terminal = Vg > Vt source terminal Vs = 0 drain terminal Vd > Vgs-Vt Gate Field Oxide P Source N+ Drain N+ Field Oxide Silicon Substrate substrate terminal Vb = 0 if Vds > Vgs – Vt, the inversion layer disappears near drain. The end of the inversion layer is Vdssat and electrons that reach the end are swept drain. Increases in Vds have little effect on Ids. MOSFET Current Equation Vgs Vds Vgs < Vt Ids Ids ~ 0 Vgs < Vt Vds < Vgs – Vt Ids = µCox(W/L)((Vgs – Vt)Vds – (Vds2/2)) Vgs < Vt Vds => Vgs – Vt Ids = (1/2)µCox(W/L)(Vgs – Vt)2 The body effect is another consideration not described here. If the Vbs voltage is modified, the Vt will be affected and consequently affect Ids. Strained Silicon Strained silicon is new process to enhance carrier mobilities Add dopant that is mechanically close but slightly different from Silicon to cause strain IBM / Intel reported started production in the 90nm process from IBM webpage MOSFET Scaling As Moore’s law predicts – – Two forms of scaling – – dimensions decrease by factor S area decreases by S2 constant voltage - up to the early 90’s – 5V Vdd steady constant electric field – after early 90’s – Vdd drops Why scale CMOS – – – faster if smaller (drive current = 1/Leff) cheaper if transistors take less area more functionality if same area is used with more transistors MOSFET Scaling – Leff min proportional to (Interview Question) – gate oxide electric field intrinsic breakdown at 10 MV/cm – Xj – requires precise implantation and anneal N – affects the mobilities of carriers if too high – cap if too low Tox - below 20 angstroms – tunneling leakage this sets the max Vdd for a given technology thresholds need to be scaled but every 80mVs reduced, leakage increases X10 Rough timeline Node 20u 2u 1u 0.8u 0.65u 0.5u 0.35u 0.25u 0.18u 0.12u 90nm 65nm 45nm 35nm ? Year 60’s early 80’s late 80’s 92 94 96 98 00 01 02 03 05 08 10 ? Tox 1u 250A 200A 150A 105A 50A 37A 27A ? ? ? new Comment no CMOS – just NMOS (+/- 5V) CMOS, but some NMOS All CMOS, +5V 5V – start of my career 5V 5V – start of E scaling 3.5V – 5V tolerant 2.5V – 3.3V tolerant 1.8V – 3.3V tolerant 1.5V 1.2V ? ? ? Quantum Dot Computers? Not! MOSFET Scaling – Constant Voltage Quantity Before After Effect Channel Length L L' = L/S Channel Width W W' = W/S Tox t'ox= tox/S Xj X'j = Xj/S Vdd Vdd Vt Vt Na,Nd NaS, NdS Gate Capicitance (per area) Cox C'ox = CoxS Total Gate Cap Cg C'g = Cg/S Drive current Ids Ids*S Gate oxide thickness Junction Depth Vdd Threshold voltage Doping Power (for same function) Power density Device Delay Wire Delay P P*S P/A (P*S)/(A/S2)=S D = CV/I D'= D/S2 RC R*S * (C/S) = RC faster and less power faster same circuit scaled consumes less power so more power per function Much faster than before Really gets bigger MOSFET Scaling – Constant Field Quantity Before After Channel Length L L' = L/S Channel Width W W' = W/S Tox t'ox= tox/S Xj X'j = Xj/S Vdd Vdd/S Vt Vt/S Na,Nd NaS, NdS Gate Capicitance (per area) Cox C'ox = CoxS Total Gate Cap Cg C'g = Cg/S Drive current Ids Ids/S P P/S2 P/A (P/S2)/(A/S2) D = CV/I D'= D/S RC R*S * (C/S) = RC Gate oxide thickness Junction Depth Vdd Threshold voltage Doping Power (for same function) Power density Device Delay Wire Delay Effect for reliability reasons not done in practice - leakage faster and less power same circuit scaled consumes less power so same power for more function Faster but not as fast as CV really gets bigger MOSFET Scaling – current issues Static Power – Major problem – – no static power was original motivation for CMOS gate oxides are 17 Angstroms – tunneling 1 Amp / cm – sub-threshold leakage increased due to scaled Vts need new oxide that acts electrically thin but physically thick silicon used because of nice native oxide with good interface using dual threshold processes, but this adds expense Wire Delay – need low K material for inter-layer dielectric – current materials are having mechanical reliability issues thermal cycle the chips and get opens/shorts need low resistance conductors migrating from aluminum to copper (Intel last to go, IBM first) but copper is difficult to etch – dual damascene process Other observations Vdd Vdd NFETs can’t drive high voltages well Vdd-Vt Vt PFETs can’t drive low voltages well This will affect many of the circuits that we explore in this class and this is a major questions). source of interview questions (and exam MOS Inverters Most fundamental circuit in MOS family Represents the basic operation of all static gates One input and one output – Output = ~Input Inverter Threshold Voltage - Vth – – input voltage where output equals input not the same as transistor threshold Vt Voltage Transfer Characteristic (VTC) Vout Vout = Vin Vdd gain = -1 Vil Vth Vih Vdd=Voh Vin Noise Margin – low gain region Vout low gain region gain = -1 Vin Noise Margin – high gain region Vout high gain region Good design minimizes high gain region aka transition region. gain = -1 Vin CMOS Inverter Vout Vout=Vin-Vtp Vout=Vtn A Vout=Vdd+Vtp B Vout=Vin-Vtn C D E Vin CMOS Inverter Vout Ids Vdd Vin Layout of inverter – top view n-well W W Layout of inverter – top view n-well gate vdd I1 drain I2 drain source input gnd source I1 out in I2 MOS Inverters – Dynamic Performance is inversely proportional to delay Delay is time to raise (lower) voltage at nodes – – node voltage is changed by charging (discharging) load cap more current means more charge transported over time Q = I • t = C •V C •V tdelay = Q / I = I MOS Inverters – Dynamic junction cap gate cap wire cap particularly bad when driving a load far away. MOS Inverters – Dynamic Lumped cap CL=Cgdn+Cgdp+Cdbn+Cdbp+Cw+Cg MOS Inverters – discharge delay input output Lumped cap CL=Cgdn+Cgdp+Cdbn+Cdbp+Cw+Cg time MOS Inverters –charge delay 0V Lumped cap CL=Cgdn+Cgdp+Cdbn+Cdbp+Cw+Cg time Propagation Delay Tplh time Tplh Defined twice – once for a falling output and once for a rising output. The propagation delay is the delay from the input crossing the 50% point of Vdd to the resulting output signal crossing of the 50% point. Tplh = Rising propagation delay Tphl = Falling propagation delay Rise and Fall Times Trise time Tfall The rise time is the time for the signal to cross from 10% to 90% of Vdd. The fall time is the time for the signal to cross from 90% to 10% of Vdd. If an inverter is driven by a signal with a really slow rise or fall time, the delay through the inverter is aggravated and since the inverter is in the transition region longer, a lot of short circuit current can be generated. Rise and Fall Times Trise time Trise time If excessive rise or fall times exists, fix them by cranking up drive source or decreasing the load. Increasing drive strength usually means widening transistors. Decreasing the load usually means splitting up load with buffers. Calculating Delay Times Tplh time Simplest approach is to use average current and average capacitance models to calculate propagation delays for both edges. Cload • ΔVhl τplh = Iavghl Cload • ΔVlh τphl = Iavghl MOS Inverters – fall delay output Reqn time t − Vout (t ) = Vdd • e RnCl MOS Inverters –rise delay Reqp Vout (t ) = Vdd • (1 − e time t − RpCl ) Combating delays Reduce Capacitive load – – – Increase Drive current – – – – drive fewer gates – buffer tree drive smaller gates (less gate capacitance) in subsequent stage drive closer gates (less distance means less interconnect load) reduce Vt – not really an option for circuit designers reduce L’s – most transistors are minimum sized for area increase Vdd – can’t because of gate oxide integrity increase Weff – main weapon of circuit designer Reduce wire lengths for long wires (more later…) Coupling Analysis Agressor Ccoupling Victim Reqn Vagressor Cgood Ccoupling • Vdd Vvictim = Ccoupling + Cgood Reqn Minimizing Coupling Capacitance Wire spreaders are tools that search through a routed design and find places where signals can be spread. Noise sensitive signals (i.e. clock signal) can be shielded by running fixed signals (i.e. gnd, vdd) between clock and other signals. Technologies are being developed that raise the permittivity of the inter layer dielectric. – – problems persist with this new materials thermal cycling the material causes ruptures due to differences in the thermal expansion coefficient. Wire Spreading Example Before After Shielding Signals Coupling capacitance goes down with a 1/T relationship. Good cap goes up because of shielding. victim signal gnd Substrate (ground plane) aggressor signal metal 1 Long Lines and RC Delays Buffer can cut down on L and decrease interconnect delay quadratically – of course device delay is inserted but many times the overall delay goes down. 100ps 400ps 100ps L 600 pS total Long Lines and RC Delays If distance L has 400ps of RC delay, then a distance of L/2 will have 100ps of delay - (L/2)2 or ¼ of the delay. 100ps 100ps L/2 150ps 100ps L/2 550 pS total 100ps Long Lines and RC Delays If distance L has 400ps of RC delay, then a distance of L/3 will have 45ps of delay - (L/3)2 or 1/9 of the delay. 100ps 45ps L/3 100ps 45ps 100ps 45ps L/3 L/3 535 pS total 100ps Note on RC delays and Vdd RC values are not affected by Vdd values to the first order. Device delay however is related by the square of the voltage. 100ps 45ps 100ps L/3 400ps 45ps L/3 45ps 100ps L/3 400ps 45ps L/3 45ps 100ps Vdd= 1.8V L/3 400ps 45ps L/3 400ps Vdd= 0.9V Inverter sizing and Fanout To drive a huge load with a small inverter we need a string of inverters to “ramp up” the capacitive gain. If inverter is too small, will have difficult time charging next stage. If inverter is too large, it will overload the previous inverter. Wp Wn 4 2 12 6 36 18 108 36 Case of huge load (i.e. IO driving off chip loads or clock tree driving 1000s of flip-flops Parallel Transistor Configurations Two same-type transistors in parallel have their transconductances added if on at same time. If both transistors are on simultaneously and the L values are the same for both, we can add the widths to get an effective single transistor equivalent. When both are on, (W/L)eq is sum of all ratios 8/1 8/1 16/1 Series Transistor Configurations (W/L)eq = (W/L)a + (W/L)b or if Ls equal, simply add Ws W W L L W L Series Transistor Configurations Two same-type transistors in series have their resistances added if on at same time. If both transistors are on simultaneously and the W values are the same for both, we can add the lengths to get an effective single transistor equivalent. 8/1 8/2 = 4/1 8/1 Series Transistor Configurations (W/L)eq = 1 / (sum of reciprocals) or if Ws are equal, simply add Ls W L W L 2 Input NOR – depletion NFET load If both A and B is high– NFET heavy inverter Vdd Out A B CMOS NANDS and NOR Consider transistor sizings for balanced circuits… A out B A B A A B out B NAND Layout Legend vdd active area n-well metal 1 poly OUT A B gnd CMOS NOR Transistor sizing Consider transistor sizings for balanced circuits… A B A W W*4 W*2 W*4 B out W out W CMOS NAND Transistor Sizing Consider transistor sizings for balanced circuits… A 2*W A B B 2*W 2*W out 2*W 2*W out W CMOS NAND Transistor Sizing Consider transistor sizings for balanced circuits… B A A C out 3*W B 3*W C 2*W 3*W out W Fanin (number of inputs) There is a limit to the number of inputs that can be used. B A A B C D E C D E Complex CMOS Logic Can make single stage gates that implement: AND-OR-Inverter (AOI) OR-AND-Inverter (OAI) Given a function F = ((A*B)+C)’ Invert the function to get N network F’ = (A*B)+C Take dual of N network equation to get PFET network F’d = (A’+B’)*C’ Remember that PFETs invert inputs naturally Complex CMOS Logic B A A B C C out A C B PFET network out A B C NFET network Complex CMOS Logic - Euler B A B C C out A C B B A C A NFET network PFET network Find common Euler path which does not traverse any branch more than once. Complex CMOS Logic Given a function F = ((A*B)+C)’ what is best layout to share diffusions when possible. One solution but not best Vdd nwell active area out active area in pwell A Gnd B C Complex CMOS Logic Given a function F = ((A*B)+C)’ what is best layout to share diffusions when possible. Switch S/D of C for better. Vdd out A Gnd B C Pass Gates In most static CMOS, a PFET network pulls high and a dual NFET network pulls low. In a pass gate configuration, they tie inputs to outputs. Pass gates can either be “ON” and pass a value or be “OFF” and tri-state an output. One NFET can do this, but passes high values poorly. One PFET can do this too, but passes low values poorly. in out enable Pass Gates Couple of problems, not only will it not drive a full logic high, the effective R skyrockets to infinity as you approach Vdd-Vt. This means that it also slows down as well as and provides no drive strength when statically high, thus the output is susceptible to coupling noise. in out enable Charge Sharing (and Pass Gates) Common interview question… Basis for DRAM operation. At t=0, the gate is low, C1 (50 fF) is charged to 2 volts, C2 (25 fF) is charged to 3 volts. Later, the gate is turned on. What is voltage of C1 and C2? Simple Eng101, but most grads can’t do it. v2 v1 C1 C2 gate Complementary Transmission Gates Use a PFET and NFET in parallel, passes ones and zeros. Never used by logic designers, circuit designers hide them. TGs act as switches, either providing a resistive short or an open circuit. Does not provide drive, attenuating the signal. Susceptible to “above Vdd” or “below Gnd” noise at input. Effective Resistance of TGs For passing low values, the NFET is fully on. For passing high values, the PFET is fully on. The effective resistance stays relatively constant regardless of the input voltage (as opposed to how pass gates respond). Vt Vdd-Vt NFET R Reff Combination PFET R Vin Vdd Real Transmission Gate Mux d1 out d0 s Need input inverters for noise and output inverter to cancel inversion and provide drive strength TG Logic c b’ a b f c’ b’ Implements F = A*B + C c CPL Logic a b’ f Implements F = A^B b a’ Couple of major problems though: 1) really needs 4 transistors to get both complements, 2) if F is high, you’ll have a Vt drop (slow and consumes power) 3) inputs are unbuffered to source (noise). CPL Logic a b’ f f’ b a’ a b f’ b’ a’ Implements F = A^B f Logical Effort Method (LEM) LE is a method to estimate delay in CMOS circuits Helps identify best circuit style and choose widths Based on basic delay unit – T – – Isolates technology differences Delay through an inverter ignoring parasitics with fanout of 1 Two components of delay through any gate – – Parasitic delay (no-load delay or self-loading delay) - p Effort delay or Stage effort - f Stage effort (f) is product of – Electrical effort - h – Logical effort - g d = p + h*g Logical Effort - Parasitic Delay - P Parasitic delay is calculated from diffusion cap at output Bigger transistor – more current - but more P also – Diminishing returns on increasing widths B A A B out Gate Type Inv N-input NAND N-input NOR N-input MUX XOR P Pinv nPinv nPinv 2nPinv 4Pinv d = p + h*g Electrical effort - H Ratio of input to output capacitance Captures the effects of fanout d = p + h*g Logic effort - G Captures complexity of gate – Topology and ability to drive current – Considers fan-in relative to inverter 1 A 2 A B B 2 2 2 out Inv 1 NAND NOR Mux XOR 2 3 4 5 4/3 5/3 2 4 5/3 7/3 2 12 6/3 7/3 9/3 11/3 2 2 32 d = p + h*g Normalized Delay Logical Effort - Delay Effort Delay Parasitic Delay Electrical effort - h Logical Effort - Example 1 Tau = 50 ps for given technology Determine delay through 4-input NOR driving 10 identical circuits Solution: d = g*h + p d = 9/3*10 + 4 = 34 delay units = 1.570 nS Comment: Large loads minimize impact of parasitics Large load will increase rise/fall times and this estimation Ignores this effect Logical Effort - Multi-stage Networks Principles generalize from gates to paths Path logic effort – G = Πgi Path parasitic delay – P = Σpi Path electrical effort – H - is still ratio of Cout / Cin Introduce branching effort – b = Ctotal/Cuseful Introduce path branching effort - B – product of all b’s Note B*H = Cout / Cin * Πbi = Πhi Path effort delay = Df = Σ(gi*hi) Path delay = D = Σdi = Df + P Logical Effort - Multi-stage Networks Path is optimized when each stage bears same effort Dmin = N*F^(1/N) + P To obtain the balanced stage effort, each stage f fi = gi * hi = F^(1/N) To obtain the balanced stage effort, each hi should be hi = (F^(1/N))/gi To determine sizings, start from end and work backward Cini = (gi * Couti) / fimin Example 2 Tau = 50 ps for given technology (0.6u CMOS) Size transistors to minimize path delay No branching Solution: F = G*H*B = (4/3*4/3*4/3) where H = 1, B = 1 Dmin = 3*(4/3*4/3*4/3)^(1/3) + (2+2+2) = 10 units = 500 pS C C Logical Effort - Example 2 Solution: fmin = F^(1/N) = 2.37^(1/3) = 4/3 Cini = (gi * Couti) / fimin = ((4/3)* Couti )/ 4/3 = Couti = C A C/2 A B B C/2 C/2 C/2 out C C C C Logical Effort - Example 3 Same as example 2 but driving 8C output cap Solution: F = G*H*B = (4/3*4/3*4/3)*8 = 18.96 where H = 8, B = 1 Dmin = 3*(18.96)^(1/3) + (2+2+2) = 14 units = 700 pS C 8C Logical Effort - Example 3 Solution: fmin = F^(1/N) = 18.96^(1/3) = 8/3 Cini = (gi * Couti) / fimin = ((4/3)* Couti )/ 8/3 = ½ Couti C 2C 4C 8C Logical Effort - Example 4 Same as example 2 but driving 8C output cap Solution: G = (4/3)3 B=2*3=6 H = 4.5/1 = 4.5 F = 64 Dmin = 3*(64)^(1/3) + (2+2+2) C = 18 units = 900 pS 4.5C 4.5C 4.5C Logical Effort - Example 4 Solution: fmin = F^(1/N) = 64^(1/3) = 4 Cin2 = (g2 * Cout2) / fimin = ((4/3)* Cout2 )/ 4 = 1.5 C Cin1 = (g1 * Cout1) / fimin = ((4/3)* (1.5*C*3))/ 4 = 1.5 C Cin0 = (g0 * Cout0) / fimin Cin0 = (4/3 * 1.5*C*2) / 4= C (correct!) C 1.5C 4.5C 4.5C 1.5C 4.5C Logical Effort - Example 5 Solution: F=G*B*H = (1*5/3*4/3*1)*1*2 = 40/9 fmin = (40/9)^(1/4) = 1.45 Cin3 = (g1 * Cout3) / fimin = ((1*20C))/1.45 = 14C Cin2 = (g1 * Cout2) / fimin = ((4/3)* (14C))/1.45 = 13C Cin1 = (g0 * Cout1) / fimin = ((5/3)* (13C))/1.45 = 15C Cin0 = (1 * 15C) / 1.45 = 10C (correct!) 10C 20C Sequential Element Review Sequential elements provide memory for circuits – – heart of a state machine – saving current state used to hold or pipe data – data registers, shift registers Two varieties – – level sensitive transparent latch – less common edge sensitive master-slave flip-flop – everywhere D Latch Schematic - better d gate q CMOS Tri-state Inverter ~en output input en D Latch Operation Gate Low – Q holds value and ignores D Gate High – transparent – Q follows D after delay G D Q time D Latch Common Uses Most common – basic building block of Flip-Flop Other uses – to condition enable for clock gating G D Q time Standard CMOS D Flip-Flop Diagram D D G Q D Latch D G Q Q D Latch CLK Two doors – never simultaneously open or closed. Q is never directly influenced by D. D Flip-Flop Schematic D CLK Q D Flip-Flop Operation Sample input at edge and launch to output. Input must be good for “setup time” before and “hold time” after the edge to sample correctly – sampling window. Th CLK D Tsu Q Tlaunch time D Flip-Flop Set-up Times The time before the rising edge during which the data must be stable to be sampled correctly. Logic designers can slow the clock (bigger period) to alleviate setup problems – but less performance. A good Flip-Flop design will have a very short set-up time. Q D Flip-Flop Set-up Times Adding logic in the data path increases set-ups and therefore decreases performance. Muxes often added for either functional or test purposes. net A d0 d1 sel clk Q D Flip-Flop Hold Times Hold times describe how long the data must be stable after the rising edge. Often the hold time is zero. Although hold times do not affect frequency, if you fail to meet them, your chip will not work and slowing it down will not help. Q D Flip-Flop Launch Times Time required for data at input to be sampled and launched from the input. This adds to the time it requires to start through the cone of logic and get to the next flip-flop therefore increases period of clock and slows performance. Q Alternative Designs Pulsed Latch – faster but less robust Latches not good for sequential elements due to race condition when transparent. Make a glitch enable at rising edge and can get away with a latch. D CLK delay Alternative Designs CLK D Q time Meta-stability Common logic or circuit design interview question. If input is asynchronous, then it is possible for input to change exactly at rising edge and latch a middle value. This is called the meta-stable point and results in indeterminate operation and high static power consumption. CLK D time Meta-stability VoutA VinB meta-stable point – theoretically possible to be stable here VinA VoutB CMOS dynamic latch G soft node D Q undriven G D Q time Leakage enable 0v in 0v 2.0v subthreshold leakage H value will leak to zero given enough time. If leakage is 1 nA, cap is 50 fF, and V = 2.0V Trefresh = (V*C)/I = (2*50e-15)/1e-9 = 100 uS to drop to 0V Trefresh = (V*C)/I = (0.2*50e-15)/1e-9 = 10 uS to drop to 1.8V Capacitive Feedthrough Q1 = Q2 C1*V1 = C2(V2-V1) V1= (C1/(C1+C2))*V2 dV1 = (C1/(C1+C2))*dV2 enable in V2 C2 V2 V2 V1 V1 V1 C1 time Domino Circuits Typical Applications – – – arithmetic functions - manchester carry chains wordline decoders/drivers in on-chip SRAM arrays timing critical paths Performance advantages – – – lower input capacitance no NFET / PFET network contention critical path through NFET devices with higher mobilities and less area Domino Gate Operation Pre-charge phase Evaluate Phase – clock input low – clock input high – dynamic node high – dynamic node conditionally discharged based on NFET network evaluation Dynamic Node CLK A B P1 P2 N3 N4 N2 N1 OUT Domino Circuits E PC E PC E PC E PC CLK A B OUT time Leakage Sensitivity Dynamic node can be inadvertently discharged Sources of leakage through NFET network – – – – – Subthreshold current Radiation Noise at inputs Charge sharing Resistive defects CLK A B P1 P2 N3 N4 N2 N1 OUT Sensitivity Improvement via Keeper Keeper transistor replenishes lost charge Reduces performance – – increases diffusion capacitance on dynamic node causes momentary contention during evaluate phase Difficult to test keeper functionality CLK P1 A B N3 N2 N1 P3 P2 N4 OUT Domino Circuits – Wide OR No series PFET network – like NOR. 10 transistors 7 NFETs and 3 PFETs NOR has 6 NFETs and 6 PFETs (~12 NFETS) 13 NFETs vs. 18 NFETs Inputs connected to one NFET vs NFET/PFET CLK A B C D E OUT Types of Memories" Volatile Memories – require power supply to retain information – dynamic memories – static memories use charge to store information and require refreshing use feedback (latch) to store information – no refresh required Non-Volatile Memories – ROM (Mask) – EEPROM – FLASH – NAND or NOR – MRAM Memory Hierarchy" 100pS RF 1nS L1 SRAM 10nS L2 SRAM 100nS L3 DRAM 1us 100’s of bytes Disks / Flash 10’s of Kbytes 100’s of Kbytes 10’s of Gbytes Tbytes Register Files " Fastest and most robust memory array Largest bit cell size Basically an array of large latches No sense amps – bits provide full rail data out Often multi-ported (i.e. 8 read ports, 2 write ports) Often used with ALUs in the CPU as source/destination Typically less than 10,000 bits – 32 32-bit fixed point registers – 32 60-bit floating point registers SRAM" Same process as logic so often combined on one die Smaller bit cell than register file – more dense but slower Uses sense amp to detect small bit cell output Fastest for reads and writes after register file Large per bit area costs – six transistors (single port), eight transistors (dual port) L1 and L2 Cache on CPU is always SRAM On-chip Buffers – (Ethernet buffer, LCD buffer) Typical sizes 16k by 32 Static Memory Cell" Wordline" T1" T3" True" Bit" Line" T5" T6" T2" T4" Complement" Bit" Line" Motherboard architecture Dynamic RAM" Most dense RAM (1 Gbit chips available) Historically, different semiconductor process so built on a separate die L3 Cache (old days) and computer main memory Requires refresh of data due to leakage New push to combine DRAM and logic – embedded DRAM, eDRAM – business case hard to close – yields drop DRAM Bit Cells (1T) DRAM used since the early 70s Destructive Read Highest density bitline wordline Cbl Cb DRAM Cross Section Flash Cross Section" FLASH" FLASH" NOR Flash – – – – – – less dense (256 Mbit) but provides fast random read access Erase FN / Program HEI 100,000 write cycles Slow erase, fast program and read SRAM like interface – give an address – get a byte of data great for code memory ( bios, boot-up, cell phone, etc) NAND Flash – – – – – – More than 2X denser – up to 2Gbit Erase FN/ Program FN Fast erase, slow program and read 1,000,000 write cycles IO like interface – not as simple as NOR good for data storage – memory cards, IPODs, USB keydrives Flash Cross Section" NOR FLASH" NAND Flash Reading" Tunneling vs Injection" Charge Pumps" Flash and EEPROM architectures need unavailable higher voltage for programming (+10v) Charge pumps can pump a cap to get high voltage DC to DC (higher) converter - without inductors Need to consider Vmax across any gate oxide Generally cannot provide much power (I*V) Charge pumps used for a lot of other things like overdrive voltages and PLLs Staged Diode Charge Pump" Dickson Charge Pump" V1 Vin V2 M d1 M d2 C φ φ V3 M d3 C V4 M d4 C M d15 C V out C out Clock booster" N2b N2 C1b P2 P1 Outb Out C1 N1b N1 Vo Vob SRAM Organization"" Blocks with unity aspect ratio Rows Columns IO Static Memory Cell" Wordline" T1" T3" True" Bit" Line" T5" T6" T2" T4" Complement" Bit" Line" SRAM Read Cross-Section! TSA! CSA! Set! Sense! Amp! Bit ! Line ! Isolation! Isolation Circuit! Precharge ! Circuit! Bit ! Line! Precharge! Wordline! CBL! TBL! T" Cell! SRAM Isolation & Pre-charge Circuits! Sense Amp! Bit Switch Circuit! Bit ! Line ! Isolation! Pre-charge ! Circuit! Bit ! Line ! Pre-charge! Cells! SRAM Sense Amplifier Circuit! TSA! CSA! Set! Sense! Amp! Bit Switch! SRAM Internal Memory Waveforms ! Clock" Word line" Isolation" Set Sense Amp" Sense Amp Output" Data" SRAM Write Head Circuit! Bit Line! True! Write! Enable! Data! Bit Line! Complement! SRAM Cell with Center GND Contact Vdd PFET diffusion Ground NFET diffusion Word line (Polysilicon) Bit line contacts SRAM Cell with Shared Vdd Contact PFET diffusion Vdd Ground NFET diffusion Bit line contacts Word line (Polysilicon) Split Word Line SRAM Cell Bit line contact PFET diffusion Word line (Polysilicon) NFET diffusion Ground Bit line contact Vdd Bit Cell Analysis – Read Disturb" Wordline" T1" precharged to 1.8v T3" True" Bit" Line" T5" T6" T2" T4" Complement" Bit" Line" starts at 0v but will jump up. If it jumps too high, can flip the bit. T6 is often not min L to keep the jump low. Bit Cell Analysis – Read Disturb" If low, right data node (Vrd) cannot exceed the threshold of T2 or bit may flip. (Kn6 / 2) (Vdd – Vrd – Vtn)2 = (Kn5 / 2) ( 2 ( Vdd – Vtn)*Vrd – Vrd2) Kn6/Kn5 < (2(Vdd – 1.5 Vtn) Vtn) / (Vdd – 2*Vtn)2 Bit Cell Analysis - Write" Must ensure that write head circuit can over power cell by the end of the write cycle. The side of the bit cell with a 0 dominates the write transaction as the pass transistor is an NFET. When the word line asserts the write head circuit drives a zero on one of the two sides. The bit data in the cell must be brought below the threshold of the cross-coupled inverter to flip the bit. Bit Cell Analysis – Soft Error" Radiation (particularly in space – but occasionally on Earth) causes the generation of charge in circuits. SOI technology helps as it shields transistors from charge in the bulk silicon. The bit cell node has a capacitance and introduced charge will change the voltage at the node. If the voltage swing exceeds the threshold of the crosscoupled inverter, the bit will flip (i.e. soft error) Qcrit is charge required to flip bit. Data is bad, but the bit cell still works (thus soft error). Bit Cell Analysis – Soft Error" Wordline" T1" T3" True" Bit" Line" T5" T6" T2" T4" Complement" Bit" Line" constant current source turned on for time t Qcrit = I * t Hamming Code (ECC)" Simple parity (9th bit) can detect one failure. Hamming code is kinda a two-dimensional parity that will not only detect one failure but correct it as well. It will detect two, but any more failures can be missed. Current DRAM memories (DDR and SDRAM) have 64 bit buses with 8 additional ECC bits. 8 bit SEC-ECC example 8 data bits requires 4 parity checking bits – log2 (8) + 1 = 4 Build encoded word by indexing bits from left All bits at power of 2 locations are parity Data bits take remaining spaces in order 8 bit SEC-ECC example E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 = encoded word R1 R2 D1 R3 D2 D3 D4 R4 D5 D6 D7 D8 Each redundant bit is the parity bit for all bits in the word that contain the corresponding power of 2 in the index R1 = E3 ^ E5 ^ E7 ^ E9 ^ E11 R1 = D1 ^ D2 ^ D4 ^ D5 ^ D7 R2 = E3 ^ E6 ^ E7 ^ E10 ^ E11 R2 = D1 ^ D3 ^ D4 ^ D6 ^ D7 R3 = E5 ^ E6 ^ E7 ^ E12 R3 = D2 ^ D3 ^ D4 ^ D8 R4 = E9 ^ E10 ^ E11 ^ E12 R4 = D5 ^ D6 ^ D7 ^ D8 1 2 3 4 5 6 7 8 9 10 11 12 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 8 bit SEC-ECC example Encode the 8 bit number CD or 1100_1110 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 R1 R2 1 R3 1 0 0 R4 1 1 1 0 R1 = E3 ^ E5 ^ E7 ^ E9 ^ E11 R1 = D1 ^ D2 ^ D4 ^ D5 ^ D7 = 1 ^ 1 ^ 0 ^ 1 ^ 1 = 0 R2 = E3 ^ E6 ^ E7 ^ E10 ^ E11 R2 = D1 ^ D3 ^ D4 ^ D6 ^ D7 = 1 ^ 0 ^ 0 ^ 1 ^ 1 = 1 R3 = E5 ^ E6 ^ E7 ^ E12 R3 = D2 ^ D3 ^ D4 ^ D8 =1^0^0 ^0=1 R4 = E9 ^ E10 ^ E11 ^ E12 =1^1^1^0=1 R4 = D5 ^ D6 ^ D7 ^ D8 encoded word = 1011 _1000_1110 = 69E and check bits = 1110 8 bit SEC-ECC example Now flip any one bit (included parity). E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 R1 R2 1 R3 1 1 0 R4 1 1 1 0 R1 = E3 ^ E5 ^ E7 ^ E9 ^ E11 R1 = D1 ^ D2 ^ D4 ^ D5 ^ D7 = 1 ^ 1 ^ 0 ^ 1 ^ 1 = 0 R2 = E3 ^ E6 ^ E7 ^ E10 ^ E11 R2 = D1 ^ D3 ^ D4 ^ D6 ^ D7 = 1 ^ 1 ^ 0 ^ 1 ^ 1 = 1 - > 0 R3 = E5 ^ E6 ^ E7 R3 = D2 ^ D3 ^ D4 R4 = E9 ^ E10 ^ E11 R4 = D5 ^ D7 ^ D8 = 1 ^ 1 ^ 0 ^ 0 = 0 -> 0 =1^1^1^0=1 new check bits 1000 8 bit SEC-ECC example Since old check bits do not match new check bits we know there is a failure. 0011 != 0101 bitwise XOR the old and new to create an index to the failure 1000- new check bits 1110- original check bits 0110 = 6 culprit is E6 - these are the syndrome bits Flip culprit bit back and data is correct Put a R0 in E0 position as global parity and if correct and check bits don’t add up, then you have a double error. Can’t fix it, but at least you can detect it