Essderc2010_ITRS workshop on Emerging Spin and Carbon-based Nanoelectronic Logic Devices @ Barcelo Renacimiento Hotel, Seville, Spain, Sep. 17, 2010 Magnetic FPGAs: Challenge of Nonvolatile Logic-in-Memory Architecture Using MOSFETs and Magnetic Tunnel Junctions Takahiro Hanyu Laboratory for Brainware Systems Research Institute of Electrical Communication (RIEC) Tohoku University, Japan Acknowledgements: This work supported by the Japan Society for the Promotion of Science (JSPS) through its “Funding Program for World-Leading Innovative R&D on Science and Technology (FIRST Program; Prof. Hideo Ohno).“ This work was also supported by Laboratory for Nanoelectronics and Spintronics, Tohoku University, Japan. Outline • Impact of Nonvolatile (NV) Logic-inMemory (LIM) Architecture • Design of an MTJ-Based NV LIM Circuit • Application 1: NV-FPGA • Application 2: NV-TCAM • Conclusions & Future Prospects 2 Power (W) Power (W) Power (W) Background: Increasing delay & power Active Active Active Leakage Leakage Leakage Leakage current 1960 1970 1970 1980 1980 1990 1990 2000 20002010 2010 1960 1960 1970 1980 1990 2000 2010 Year Year W. M. Elgharbawy et. al., IEEE CAS Magazine, 2005. Year W. M. Elgharbawy et. al., IEEE CAS Magazine, 2005. W. M. Elgharbawy et. al., IEEE CAS Magazine, 2005. Logic and Memory modules are separated Many interconnections between modules On-chip memory modules are volatile. Wire delay dominates chip performance Global wires requires large drivers. Power supply must be continuously applied in memory modules. Delay: Long Power: Large Static power: Large 3 Nonvolatile logic-in-memory architecture Logic-in-Memory Architecture (proposed in 1969): Storage elements are distributed over a logic-circuit plane. • Magnetic Tunnel Junction (MTJ) device MTJ layer CMOS layer ●Storage is nonvolatile: (Leakage current is cut off) ●MTJ devices are put on the CMOS layer ●Storage/logic are merged: (global-wire count is reduced) •No volatility •Unlimited endurance •Fast writability •Scalability •CMOS compatibility •3-D stack capability Static power is cut off. Chip area is reduced. Wire delay is reduced. Dynamic power is reduced. 4 Outline • Impact of Nonvolatile (NV) Logic-inMemory (LIM) Architecture • Design of an MTJ-Based NV LIM Circuit • Application 1: NV-FPGA • Application 2: NV-TCAM • Conclusions & Future Prospects 5 Model of a MOS/MTJ-hybrid circuit Configuration inputs ◆ Circuit configuration ◆ Pattern data Data inputs Outputs Storage (MTJ device) Logic-circuit plane (CMOS) Typical applications: ◆ Circuit-configuration type: Field-Programmable Gate Array (FPGA) ◆ Pattern-data type: Content-Addressable Memory (CAM) 6 Design example x1 x1・x2 MUX Data inputs x1+x2 Output x2 Configuration input y 1-bit storage Configuration Memory How to design this logic circuit ? 7 CMOS implementation VDD NOR MUX x1+x2 x1 x2 GND Output NAND x1・x2 Logic and storage parts are separated each other. VDD CTRL SRAM cell y Small ? y’ GND Transistor counts : 20+α (nonvolatile devices) 8 Principle of MOS/MTJ-hybrid circuitry x RMOS RH (High resistance) if x=0 RL (Low resistance) if x=1 RMOS = RAP (High resistance) if y=0 y RMTJ RMTJ = RP (Low resistance) if y=1 y x1 x2 (Ry , Rx1 , Rx2 ) Comparison NAND out1 NOR out2 0 0 0 ( RAP , RH , RH ) I < I’ 1 - 0 0 1 ( RAP , RH , RL ) I < I’ 1 - 0 1 0 ( RAP , RL , RH ) I < I’ 1 - 0 1 1 ( RAP , RL , RL ) I > I’ 0 - 1 0 0 ( RP , RH , RH ) I < I’ - 1 1 0 1 ( RP , RH , RL ) I > I’ - 0 1 1 0 ( RP , RL , RH ) I > I’ - 0 1 1 1 ( RP , RL , RL ) I > I’ - 0 Logic function is configurable by stored data in MTJ. 9 MOS/MTJ-hybrid circuit implementation VDD Rload Output generator Rload (Vout’) out’ out (Vout) Rx1 Rx2 x1 Logic & Storage I I’ Rx2’ Rx1’ x1’ x2’ y’ x2 y y Ry’ Ry y’ y’ y CTRL CTRL CLK Current comparator 0 (if I > I’) 1 (if I’ > I) out = Transistor counts : 11 Merging logic & storage Compact 10 Outline • Impact of Nonvolatile (NV) Logic-inMemory (LIM) Architecture • Design of an MTJ-Based NV LIM Circuit • Application 1: NV-FPGA • Application 2: NV-TCAM • Conclusions & Future Prospects 11 Typical Application 1 : Nonvolatile FPGA NV devices are distributed across the FPGA. NV LUT (Lookup Table) NV device ☺ Leakage current elimination and short latency are possible. NVM NV How to design? FPGA MOS/MTJ-hybrid circuit Not required! 12 Conventional nonvolatile FPGA CMOS logic circuit requires high-voltage input swing. MTJ MTJ MTJ MTJ SA SA SA SA Combinational logic (CMOS) Output (SA: Sense Amplifier) Low voltage High Voltage How do we perform logic operation by using low swing signal from MTJ device directly? 13 MOS/MTJ-hybrid circuitry (Proposed) Current-mode logic (CML) Logic operation is performed even low swing voltage by using the small difference of the current value. MTJ MTJ MTJ MTJ Combinational logic (Current-Mode) Low voltage SA Output High voltage Device count is reduced to 28% with less performance degradation. 14 MOS/MTJ-hybrid structure Selection transistor tree IF A B R11 A B R10 B R01 Reference resistor IREF A B R00 B A B RREF Truth table A B Z 0 0 Z00 0 1 Z01 1 0 Z10 1 1 Z11 ZAB=0RAB=RAP ZAB=1RAB=RP RAP >RREF > RP 2-input LUT function is realized by using 10 NMOS trs and 4 MTJs (and 1-resistor). 15 Operation example (XOR) Sense Amplifier Z=0 IF > IREF IF ‘0’ RAP ‘1’ ‘0’ ‘1’ ‘1’ RP ‘0’ RP ‘1’ RAP ‘0’ Z=1 IREF ‘0’ ‘1’ RREF Truth table A B Z 0 0 1 0 1 0 1 0 0 1 1 1 Logic operation in low swing voltage is performed by using a MOS/MTJ-hybrid network. 16 Precharge-Evaluate Logic SA Z IF C1 Z IREF MOS/MTJ-hybrid network (LUT operation) CLK CLK IF < IREF (Z, Z) = (0, 1) IF > IREF (Z, Z) = (1, 0) C2 VCLK VC1 VC2 VZ VZ CL Precharge (CLK=0) Evaluate (CLK=1) Dynamic current-mode logic (DyCML)-based circuit. Reduction of dynamic power dissipation. 17 Spin-Injection Write Operation Selection Transistor Tree Reference Resistor W=‘1’ ITMR BL =‘0’ WL0 WL1 WL2 RTMR 0 WL3 =1 1 BL =1 Spin-injection-based write operation. ICAP RAP RP 0 ICP ITMR 18 Test chip features Fabricated 2-input LUT Selection Transistor Tree 4 MTJ devices are stacked over MOS layer Process 0.14mm MTJ/MOS 1-Poly, 3-Metal Area 287mm2 MTJ Size 50nm 150nm TMR Ratio Current Write Time Standby Current 100% 150mA 10ns 0A 19 Measured waveforms (Basic operations) P E P E P E P E P: Pre-Charge E: Evaluate Input A Input B Output Z ‘1’ ‘0’ ‘0’ ‘0’ ‘1’ ‘1’ ‘1’ ‘0’ Output Z NAND NOR A 0.78V/div B Z 100ms/div ‘0’ ‘1’ ‘1’ ‘0’ ‘1’ ‘0’ ‘0’ ‘1’ Z XOR XNOR 20 Immediate wakeup behavior Active Standby Active VDD= 0 A B CLK A B VDD 00 01 10 11 00 01 10 11 Z Z 0.78V/div 50ms/div Immediate wakeup behavior has also measured successfully. 21 Comparison of performances SRAM/MRAM Device Counts 29 MOSs + 4 MTJs 702 mm2 287mm2 Delay *3) 100 ps 140 ps 185 ps Power*3) 22.5 mW 26.7 mW 17.5 mW Power 0 mW 0 mW 0 mW Delay 42 ns/bit 0 ns/bit 0 ns/bit Energy 19 pJ/bit 0 pJ/bit 0 pJ/bit Active Standby to Active Proposed 455 mm2 *1) Area *2) Standby 46 MOSs + 1 MRAM *1) Nonvolatile SRAM [3] 102 MOSs + 8 MTJs *1) It consists of four SRAM cells (24 MOSs), three 2-input multiplexers (18 MOSs), and two output buffers (4 MOSs). MRAM and its peripheral circuits are not considered in this evaluation. *2) Estimation based on a 0.14mm process *3) HSPICE simulation based on a 0.14mm MOS/MTJ-hybrid process 22 MRR vs. Operation Margin in NV-LUT □ MRR in 6-input LUT. Shmoo Plot RREF [k] MR Ratio [%] 2.00 2.50 3.00 3.50 4.00 4.50 5.00 5.50 6.00 6.50 7.00 7.50 8.00 8.50 9.00 9.50 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 100 F F F F P F F F F F F F F F F F F F F F F F F F F 200 F F F F P P P P P F F F F F F F F F F F F F F F F 300 F F F F P P P P P P P P P F F F F F F F F F F F F 400 F F F F P P P P P P P P P P P P P F F F F F F F F 500 F F F F P P P P P P P P P P P P P P P P P F F F F 600 F F F F P P P P P P P P P P P P P P P P P P P P P (P: Pass, F: Fail) Large MRR →Sufficient operation margin 23 Outline • Impact of Nonvolatile (NV) Logic-inMemory (LIM) Architecture • Design of an MTJ-Based NV LIM Circuit • Application 1: NV-FPGA • Application 2: NV-TCAM • Conclusions & Future Prospects 24 Application 2: Ternary Content-Addressable Memory (TCAM) 0 1 0 0 ・・・ 1 1 0 Fully parallel masked equality search Search-line / Word-line driver 2 2 BL1 BL1’ BL2 BL2’ 1 0 2 1 1 2 0 0 2 2 0 0 0 1 Stored words 2 BLn BLn’ 1 0 1 0 1 2 ・・・ ・・・ 1 X 2 X X OUT1 OUT2 ・・・ Bit-line driver 2 ・・・ X X OUTn Output driver Input key 0 (Mismatch) 1 (Match) 0 (Mismatch) Fully parallel search and fully parallel comparison can be done. TCAM is a “functional memory.” TCAM is the powerful data-search engine useful for various applications such as database machine and virus checker in network router TCAM must be implemented more compactly with lower power dissipation. 25 NV-TCAM Cell Circuit S’ / WL1 S / WL2 Wired-OR ML Stored data (Match line) B b1 b2 0 S’ ・b1 (b1,b2 ) IZ X ML= b1・S’ + b2・S don’t care S Current comparison 0 IZ < IZ ’ 1 IZ > IZ ’ 0 IZ > IZ ’ 1 IZ < IZ’ 0 IZ < IZ ’ 1 IZ < IZ ’ (0,1) S・b2 1 Search input (1,0) (0,0) Match result ML 1 (Match) 0 (Mismatch) 0 (Mismatch) 1 (Match) 1 (Match) 1 (Match) 26 CMOS-based TCAM cell circuit 1-bit storage Equality-detection (ED) circuit 1-bit storage ML VDD Leakage current WL VSS BL1 Leakage current SL’ SL BL2 Transistor counts : 12 (ED;4T, 2-bit storage;8T) Input/output wires : 8 (BL;2, WL;1, VDD&VSS;2, SL;2, ML;1) Always supply the power : Many leakage current path How to realize compact & cut off the leakage current ? 27 MOS/MTJ-hybrid TCAM cell circuit S. Matsunaga, K. Hiyama, A. Matsumoto, S. Ikeda, H. Hasegawa, K. Miura, J. Hayakawa, T. Endoh, Hideo Ohno, and Takahiro Hanyu, "Standby-Power-Free Compact Ternary Content-Addressable Memory Cell Chip Using Magnetic Tunnel Junction Devices," Applied Physics Express (APEX), vol. 2, no. 2, pp. 023004-1~023004-3, 2009. ML/BL 2-bit storage (MTJs) Logic (MTJs & MOSs) SL’/WL1 SL/WL1 •Merge storage into logic circuit : Compact (2T-2MTJ) •Share wires : 4 (ML/BL, SL/WL, No-VDD) •3-D stack structure : Great reduction of circuit area Compact & nonvolatile TCAM cell with MTJ devices 28 Power-Gating Scheme of Bit-Serial NV-TCAM 1st-bit search 1 1 1 Search word 0 0 X SA ACC 0 1 0 SA 0 X 1 1 0 1 2nd-bit search 1 1 1 Search word Mismatch 0 0 X SA ACC ACC Mismatch 0 1 0 SA SA ACC Mismatch 0 X 1 X SA ACC Match 1 0 1 0 SA ACC Match 1 1 X 1 SA ACC Match X 0 X SA ACC X 1 0 SA X X 1 SA 3rd-bit search 1 1 1 Search word Mismatch 0 0 X SA ACC Mismatch ACC Mismatch 0 1 0 SA ACC Mismatch SA ACC Mismatch 0 X 1 SA ACC Mismatch X SA ACC Mismatch 1 0 X SA ACC Mismatch 1 0 SA ACC Match 1 1 0 SA ACC Mismatch 1 X 1 SA ACC Match 1 X 1 SA ACC Match Match X 0 X SA ACC Mismatch X 0 X SA ACC Mismatch ACC Match X 1 0 SA ACC Match X 1 0 SA ACC Mismatch ACC Match X X 1 SA ACC Match X X 1 SA ACC Match TCAM cell in standby mode (Static power is suppressed.) TCAM cell in active mode SA Sense amplifier in active mode SA ACC Accumulator in active mode Sense amplifier in standby mode (Static power is suppressed.) According to the word length of the TCAM, the effectiveness of the standby-power reduction is increased. 29 TCAM cell circuit test chip 3.0 mm Chip features 9.8 mm Output generator in MLSA TCAM cell Ref. cell Dynamic current comparator in MLSA Process 0.14mm CMOS/MTJ 1-Poly, 3-Metal Total area 29.4 mm2 TCAM cell size 3.15 mm2 (2.1 mm×1.5 mm) a) Cell structure 2MOSs-2MTJs MTJ size 50 nm×200 nm TMR ratio 167 % Average write current 274 mA (tp = 10 ms) b) Standby current 0A (Power off) TCAM cell with 12 transistors, whose cell size is 17.54 mm2 under a 0.18 mm CMOS process, has been reported.8) The size of the conventional TCAM cell can be estimated as 10.61 mm2 under a 0.14 mm CMOS process by scaling down. Thus, the size of the fabricated TCAM cell is reduced to 30 % compared to that of the conventional one. Moreover, minimum size of the proposed TCAM cell can be considered as 1/6 of the conventional one. b) More high-speed write operation is possible with increase of write current. For example, with the average current of 327 mA at 10 ns write. 30 a) A CMOS-based Waveforms of equality-search operations P : Precharge phase P E P P E P Stored data B=1 S Search data S=0 S=0 Match 10ms Mismatch P E Stored data B=X S=1 ・・・ ・・・ 780mV E ・・・ Stored data B=0 S=1 ・・・ OUT P E ・・・ CLK Match result E E : Evaluate phase S=0 S=1 Match Match ・・・ Match Mismatch Bit-level equality-search is successfully demonstrated. 31 Waveforms of sleep/wake-up operations VDD Power-off Power-off Active P Active E Standby P E P Active E Standby P E CLK Stored data B=0 S S=0 OUTbefore=1 Stored data B=0 S=0 OUTafter=1 S=1 OUTbefore=0 S=1 OUTafter=0 OUT 780mV 10ms Match Match Mismatch Mismatch Instant sleep/wake-up behavior is successfully demonstrated. 32 Comparison of 144-bit x 256-word Bit-Serial TCAM HSPICE simulation under a 90nm CMOS/MTJ technology @125MHz, RP : 2k, TMR ratio : 100% Active Power Standby Power CMOS-only MOS/MTJ-hybrid Cell array 109.6 mW 107.3 mW SA 30.8 mW 9.6 mW ACC 3.7 mW CLK 32.7 mW 62.0 mW Cell array 340.9 mW 1.8 mW SA 1.2 mW Delay 1.39 ns 103% 1.2% 43% 3.7 mW 2.3 mW 0.60 ns Ultra-low-power/high-performance bit-serial TCAM is achieved by MOS/MTJ-hybrid circuit with fine-grain power gating. 33 Outline • Impact of Nonvolatile (NV) Logic-inMemory (LIM) Architecture • Design of an MTJ-Based NV LIM Circuit • Application 1: NV-FPGA • Application 2: NV-TCAM • Conclusions & Future Prospects 34 Conclusions Propose a MOS/MTJ-hybrid circuit (nonvolatile logic-inmemory circuit using MTJ devices) style Two kinds of typical applications with logic-in-memory architecture; NV-LUT circuit and NV- TCAM Compact and no static power dissipation Confirm basic behavior with fabricated test chips under an MTJ/CMOS process. It could open an ultra-low-power logic-circuit paradigm Future Prospects and Issues: 1. Establish the fabrication line 2. Establish the CAD tools 3. Explore the appropriate application fields (Impact towards “Reliability Enhancement”) 35 Reliability Enhancement Using MTJs Adjust VGS by MTJ devices connected to transistors MTJ device Programmable resistance value (RTMR Rmax or Rmin) Non-volatile storage element VS RMTJ VGS can be adjusted by controlling RMTJ Vth-variation compensation after fabrication is realized Small overhead Non-volatility Compensation state is held without electric supply MTJ devices can be set above CMOS layer Vth-variation compensation is realized with small overhead by using MTJ devices 36 Evaluation in comparator 1.2 Proposed comparator 1.2 Vo Output [V] Output [V] Conventional comparator 1.0 0.8 Vo’ -0.2 Vo 1.0 0.8 0 Shift range of cross-point:60mV 0.2 VIN - VT [V] Vo’ -0.2 38% 0 0.2 VIN - VT [V] 23mV Robustness of the proposed comparator against the Vth variation Fabricated chip 0.18μm CMOS/MTJ process (Measurement now on going…) 37 38