Resistance-Based Memory Circuit Design 0251808 周德玉 Abstract Decreasing read cell current (ICELL) has become a major trend in nonvolatile memory (NVM). However, a reduced ICELL leaves the operation of the sense amplifier (SAs) vulnerable to bitline (BL) level offset and SA input offset. Thus, small- ICELL NVMs suffer from slow read speed or low read yield. In this study, we propose a new current-sampling-based SA (CSB-SA) to suppress the offset due to device mismatch, while maintaining tolerance for insufficient precharge time. These features enable CSB-SA to achieve a read speed 6.3× -8.1× faster than previous SAs, for sensing 100 nA ICELL on a 2K-cell bitline. Introduction The ultimate nonvolatile data memory (NVM) should display characteristics such as high-density and low cost, fast write and read access, low energy operation, and high performance with respect to endurance (write cyclability) and retention.[41] Today, Si-based Flash memory devices represent the most prominent NVM because of their high density and low fabrication costs. However, Flash suffers from low endurance, low write speed, and high voltages required for the write operations. In addition, further scaling, i.e., a continuation in increasing the density of Flash is expected to run into physical limits in the near future. Ferroelectric random access memory (FeRAM) and magnetoresistive random access memory (MRAM) cover niche markets for special applications. One reason among several others is that FeRAMas well as conventional MRAM exhibit technological and inherent problems in the scalability, i.e., in achieving the same density as Flash today. To overcome the problems of current NVM concepts, a variety of alternative memory concepts is explored. Most notably, NVMs based on electrically switchable resistance have attracted considerable attention, often summarized under the umbrella term resistance (switching) random access memory, short RRAM. This review will cover particularly interesting classes of RRAM in which redox reactions and nano-ionic transport processes play the key role. It should be noted, however, that despite the quite consistent pictures painted in this review many details and many variants are still completely unknown and our current pictures have more the character of working hypotheses instead of well funded physical models. General Requirement of RRAM General Requirements of RRAM Memory cells in a RAM are organized in a matrix. The rows and columns of the matrix are called word lines and bit lines, respectively, connecting to the electronic amplifiers in the periphery of the matrix which conduct the write and read operations. In the simplest case, resistively switching memory cells may be organized in a passive cross-bar matrix, just connecting the word and bit lines at each node (Fig. 1a). In order to avoid the so-called parasitic-path-problem, i.e., signal bypasses by cells in their low resistance state, serial elements with a particular non-linearity must be added at each node. Depending on the switching scheme of the memory cell, these can be diodes or varistor-type elements with a specific degree of non-linearity.[42] Alternatively, a RAM is organized in an active matrix comprising of a select transistor at each node which decouples the memory cell if it is not addressed (Fig. 1b). This concept significantly reduces crosstalk and disturbs signals in the matrix at the expense of some additional area required for the footprint of the transistor contacts. A resistive switching memory cell in a RRAM is generally built by a capacitor-like MIM structure, composed of an insulating or resistive material ‘I’ sandwiched between two (possibly different) electron conductors ‘M’. In the framework of this review, the material ‘I’ are oxides or higher chalcogenides which typically show some ion conductivity. These MIM cells can be electrically switched between at least two different resistance states, after an initial electroforming cycle which is usually required to activate the switching property. By applying appropriate programming or write voltage pulses Vwr, a cell in its high-resistance (OFF) state can be SET to a low-resistance (ON) state or RESET back into the OFF state. In the literature, the RESET is sometimes called an ‘erase’ operation. In a number of cases multilevel switching has been demonstrated, i.e., more than two resistance states have been established in order to realize, for example, multiple bits per cell. The state of the RRAM cell is detected by applying a read voltage Vrd. Based on the circuit requirements of high-density NVM today such as Flash and taking predictions about technology scaling of the next 15 years into account,[43] one can collect a number of requirements for RRAM cells: Write operation Write voltages Vwr should be in the range of a few hundred mV to be compatible with scaled CMOS to few V (to give an advantage over Flash which suffers from high programming voltages). The length of write voltage pulses twr is desired to be <100 ns in order to compete with DRAM specifications and to outperform Flash which has a programming speed of some 10 μs, or even <10 ns to approach high-performance SRAM. Read operation Read voltages Vrd need to be significantly smaller than write voltages Vwr in order to prevent a change of the resistance during the read operation. Because of constraints by circuit design, Vrd cannot be less than approximately one tenth of Vwr. An additional requirement originates from the minimum read current I rd. In the ON-state, Ird should not be less than approximately 1mA to allow for a fast detection of the state by reasonably small sense amplifiers. The read time trd must be in the order of twr or preferably shorter. Resistance ratio Although an ROFF /RON ratio of only 1.2 to 1.3 can be utilized by dedicated circuit design as shown in MRAM, ROFF /RON ratios >10 are required to allow for small and highly efficient sense amplifiers and, hence, RRAM devices which are cost competitive with Flash. Endurance Contemporary Flash shows a maximum number of write cycles between 103 and 107, depending on the type. RRAM should provide at least the same endurance, preferably a better one. Retention A data retention time tret of >10 years is required for universal NVM. This retention time must be kept at thermal stress up to 85℃ and small electrical stress such as a constant stream of Vrd pulses. The combination of requirements on the write operation, the read operation, and the retention sets a voltage-time dilemma which is not addressed in most of the papers published on resistive switching so far. A ratio Vwr/Vrd of ten at most needs to lead to an acceleration of the switching kinetics of tret/twr, i.e., approx. 1016! There are only a few physical mechanisms which show such a huge non-linearity. Fig. 1. Circuit diagram of a storage node in the matrix of a resistance random access memory (RRAM), where RS denotes the resistive switching cell. a) Passive matrix, in which NLE is a serial element with a specific non-linearity. b) Active matrix with the select transistor T. Fig. 2. The two basic operation schemes of resistance switching memory cells. I–V curves recorded for a triangular shaped voltage signal. cc denotes the compliance current. Dashed lines indicate that the real voltage at the system will differ from the control voltage because of the cc in action. a) Unipolar switching. The SET voltage is always higher than the RESET voltage, and the RESET current is always higher than the cc during SET operation. b) Bipolar switching. The SET operation occurs on one polarity of the voltage or current, the RESET operation requires the opposite polarity. In some systems, no cc is used. Please note that the I–V curves of real systems may deviate considerably from these sketches, for both operation schemes. NONVOLATILE MEMORY (NVM) suffers from a decrease in read cell current with an shrinking in device size and VDD but keeping the same threshold voltage. This vulnerability to decreased ICELL is exacerbated in the following situations: 1) multiple-level-cell (MLC) [1]–[8] or cross-point structures [9] for attempting to achieve smaller area-per-bit; 2) lowering VDD [10]–[14] to reduce power consumption; 3) logic-process-compatible one-time/multiple-time programming memories (OTP/MTP) [15], [16] for embedding into mobile chips. The sensing margin required by NVM is dominated by the sense amplifier (SA) offset and bitline (BL) level offsets. SA offset is caused by device mismatch resulting from process variations. The BL offset is the result of noise, bias, and load (CBL) mismatches between BLs. With continued efforts to reduce the size and BL-pitch, overcoming these issues has become a major challenge in the read operation of NVMs with a smaller ICELL. Due to these offsets, small-ICELL NVMs suffer from slow read speed or high read fail probability. Thus, developing an SA with greater offset tolerance is a prerequisite to achieving high-yield small-ICELL NVMs with faster read operations. Many small-ICELL memories (NVMs and low-voltage SRAMs) employ voltage-mode SA (VSA) [2]–[5], [17]–[22] with a long BL developing time to provide tolerance for BL and SA offset; however, this is accomplished at the cost of reduced read speed. Current-mode SA (CSA) achieves faster read speeds than VSA [1], [13]. Cascode-current-load (or resistive-divider-like) CSAs (CCL-CSAs) [9], [23]–[27] require long BL settling times and have a small 1st-stage voltage difference when reading a small-ICELL. Current-mirror CSA (CM-CSA) [28]–[32] has fast read speeds but cannot sense small ICELL, due to mismatch in the mirror-stage device. Global-clamping-local-discharge CSA (GCLD-CSA) [1], [33], [34] achieves sub-100 nA sensing, but requires long BL precharge and settling times to prevent false read. Threshold-voltage (VTH) nulling inverter-offset-compensated SA (IOC-SA) [35] reduces SA offset; however, BL offset and settling time still limit the advantages it provides with regard to speed, compared to VSA/CSA. Thus, there remains a need for a new SA capable of sensing small-ICELL, while providing fast read speed. In this study, we propose a new offset tolerant current-sampling-based SA (CSB-SA) [36] capable of detecting sub-100 nA ICELL against BL offsets and process variations, while providing a read speed faster than that of other SAs. A prototype 90 nm, CSB-SA achieves sub-100 nA sensing as well as a 512 Kb logic-process OTP macro with sub-200 nA-ICELL, capable of 26 ns random access time. Fig. 3 illustrates the superior access-time- ICELL performance compared to previous NVMs. Fig. 4. (a) Concept, and (b) circuits and (c) waveforms of conventional voltage-mode sense amplifier (VSA) and charge-transfer (CT) based VSA (CT-VSA). Fig. 3. Read performance of recent reported NVMs. Challenges of Small-Cell-Current Sensing A. Voltage-Mode SA (VSA) 1) Conventional VSA: Fig. 4 outlines the concept, simplified circuit, and waveform of VSAs. In conventional VSAs, selected BLs are precharged to a target voltage (VPRE) in the precharge phase.When theWL is on, the BL is then discharged by the I CELL (ICELL-0) of a 0-cell, which has a low-VTH or low-resistance-state (LRS). In read-1 (R1) operations, involving a high-VTH or high-resistance-state (HRS) cell, the BL is maintained at when the current of the “1” cell (I CELL-1) is smaller than the BL-load current of the BL-keeper. The voltage comparator then compares the dataline (DL) voltage (V DL) or the sense-node voltage (VS1) with a reference voltage (Vpre), before outputting a digital result. The IV-conversion behavior of VSAs suffers from BL offset, due to BL-load mismatch and BL noise, such as crosstalk and WL-to-BL coupling [1], [12]. Moreover, SAs and comparators have input offsets due to mismatches between transistors in VTH, width, length, and oxide thickness. Therefore, a large BL voltage swing is required to provide tolerance against both BL and SA offsets. Employing a longer BL developing time (T BL) increases the BL voltage swing . However, if ICELL is small and the BL-load is large, an excessively long T BL is required, which inevitably results in slow read speeds for high-yield sensing operations. 2) Charge-Transfer (CT)-Based VSA (CT-VSA): Many small- ICELL NVMs use charge-transfer (CT)-based VSA (CT-VSA) [2]–[5] to achieve read-speeds faster than what is possible with conventional VSA. This is made possible by a larger local signal swing on the sensing node (VS1), while requiring less VBLS/TBL, as shown in Fig. 5. Moreover, CT-CSA uses the same CT transistor (MCT) to perform the BL precharge/clamping and charge-transfer sensing operations in order to cancel the fluctuations in sensing margin caused by VTH variation of the MCT. The large signal swing of also enables CT-CSA to use a single transistor as the 1st-stage comparator to drive the page-buffer, rather than using the latch-type comparator as in conventional VSA. This helps to save area and suppress peak-current for wide-IO/column applications. However, like conventional VSA, CT-VSA is still vulnerable to BL offset and suffers from a long BL developing time when reading small- ICELL. 3) Capacitor-Based Inverter-Offset-Compensated SA (IOC-SA): Fig. 5 presents the concept and circuits of previous capacitor-based inverter-offset-compensated SAs (IOC-SAs). IOC-SA uses cross-coupled capacitors to store and perform NMOS-VTH-nulling operations. This scheme only reduces the SA offset associated with VTH-mismatch. However, two MOS transistors with the same VTH may have different drain currents, if the length or width of their transistors differs. IOC-SA does not cancel the SA offset resulting from other variations, including transistor width, length, and mismatches in TOX. Fig. 5. Traditional NMOS-VTH-nulling inverter-offset-compensated SA (IOC-SA). B. Current-Mode SA (CSA) 1) Current-Mirror-Based CSA (CM-CSA): Differential-input current-mirror-type CSA (CM-CSA) is commonly used in NOR-type NVMs to achieve fast read speeds when there is a large BL-load and small cell current, as shown in Fig. 6. The CM-CSA scheme uses a fixed bias voltage on BL (VPRE) to induce ICELL for reading. When the WL is on, the 0-cell generates a larger ICELL than that of a 1-cell. The current comparator then compares the to-be-sensed ICELL (ICELL-0 and ICELL-1) with a reference-current (IREF) to determine the sensing result. However, transistor mismatch between the mirror-stage circuit, M1-M3 and M2-M4 pairs, results in offset between the source current (ICELL/IREF) and the mirrored currents (I3/I4). This I3/I4 current offset leads to voltage offset between the sense-node S1 (VS1) and S2 (VS1). If the ICELL and IREF are small, these offsets may result in low sensing yield. Fig. 7 plots the simulated I3 and I4 distribution of two 90 nm CM-CSAs, using 1× and 10× transistor sizes for M1-M4. The 1× -sized CM-CSA suffers low read yield for sensing ICELL = 1μA and IREF = 500nA due to significant overlapping between I3 and I4 resulting from VTH variations in M1-M4. The 10× -sized CM-CSA has a narrower distribution, which enables it to sense smaller than the 1× -sized CM-CSA. However, it still suffers from I3 -I4 overlap when sensing ICELL = 500nA and IREF = 250nA. Obviously, CM-CSA is incapable of providing high yield when reading sub-500 nA ICELL. Fig. 6. (a) Concept, (b) circuit, and (c) waveform of a conventional current-mirror-based CSA (CM-CSA). Fig. 7. Simulated current (I3 and I4) distribution of CM-CSA: (a) 1× transistor size for ICELL = 1μA and IREF = 500nA; (b) 10× transistor size for ICELL = 500nA and IREF = 250nA. 2) Cascode-Current-Load CSA (CCL-CSA): Fig. 8 shows the concept, structure and waveform of cascode-current-load (CCL-CSA). CCL-CSA comprises a load (M3), clamp transistor (M1), and a voltage comparator. The bias current generated by M3 (ILOAD) is set to the middle of ICELL-0 and ICELL-1. In R1 operations, the ILOAD exceeds ICELL-1, which requires that the voltage at the node “S1” (VS1) be kept high during the sensing phase. In read-0 (R0) operation, the ICELL-0 is larger than the ILOAD; VS1 is pulled down, and the output datum (DOUT) is “0”. When the difference in current between ICELL-0 and ICELL-1 is small, the CCL-CSA requires a long BL settling time, resulting in a small difference between reading 1-cell and 0-cell. Moreover, the mismatch (or variation) between the path and significantly influences the minimum sensible ICELL and read speed. Thus, CCL-CSA is unsuitable for reading ultra-small ICELL for long BL applications. mistakenly turned on, which leads to a false drop in voltage at S0 during R1 operation. However, when the IPRE is reduced and VBL nears the target VPRE, the precharge/clamping operation is in the near/sub-threshold range, which results in long BL setting time. Thus, GCLD-CSA suffers slow read speed when reading small ICELL. Fig. 8. (a) Concept, (b) circuit, and (c) waveform of a cascode-current-load CSA (CCL-CSA). 3) Global-Clamping-Local-Discharging CSA (GCLD-CSA): Fig. 9 shows the concept, simplified circuit, and waveform of global-clamping-local-discharging CSA (GCLD-CSA). In the precharge phase, the voltage at local sense nodes S0/S1 (VS0/VS1) of GCLD-CSA are precharged to (VBLC - VTH3)/ VDD, which is higher than the BL precharge voltage (VPRE=VBLC1-VTH1). In the sensing phase, the local sensing nodes are floating and maintained at their precharged voltages to read a 1-cell (ICELL-0 ~= 0). To read a 0-cell, the ICELL-0 discharged the local sensing nodes to generate sufficient input voltage swing (VDD-(VBLC2- VTH1) ) on S1 for high-yield voltage comparison. At the same time, the S0 node is clamped at (VBLC2-VTH2> VPRE) to clamp the global BL at the VPRE in order to prevent BL swing and the generation of coupling noise to neighboring BLs, as in nominal current-sensing schemes. GCLD-CSA achieves sub-100 nA sensing, but requires that nearly no residual IPRE exists, to avoid a false drop in voltage at the floating sense nodes (S0 and S1). If the precharge time is insufficient, residual IPRE flows through precharge PMOS, M3, and M1. Thus, the VS1 falls between VBL and (VBLC3 - VTH3) at the end of the precharge phase. This insufficiency results in M3 being Fig. 9. (a) Concept, (b) circuit, and (c) waveform of a global-clamping-local-discharging CSAs (GCLD-CSA). Proposed Current-Sampling-Based Sense Amplifier To achieve small-ICELL sensing capabilities, while maintaining fast read speeds, this work proposes a current-sampling-based sense amplifier (CSB-SA) capable of overcoming process variations, while remaining tolerant of residual IPRE to reduce the required BL precharge time. Concept of CSB-SA Fig. 10(a) shows the concept of the proposed CSB-SA. This approach uses the same MOS device (M1/M2) for current sampling and current-ratio amplification across operating phases. Thus, VTH-independent current sampling schemes can be implemented for different ICELL and IREF inputs. This approach differs significantly from that of CM-CSA, which uses different MOS devices for current-mirroring or conveying I-V, resulting in increased vulnerability to mismatch between devices. Conventional VSA or CCL-CSA must develop 1st-stage voltage on the heavily loaded BL using continuous ICELL driving. Thus, the read access time of VSA and CCL-CSA is sensitive to BL load and mismatch. CSB-SA uses sampled current to rapidly generate the 1st-stage voltage difference at small-load internal nodes, thanks to its BL-decoupled behavior. In addition, the sampled currents are insensitive to transistor and CBL mismatch and the current sampling operation is embedded in the BL precharge operation. Thanks to the tolerance for process variation, CSB-SA can tolerate residual IPRE, which reduces the time required for BL precharge. Compared to other SAs, CSB-SA achieves faster read speeds, while providing tolerance for transistor/BL offset in sensing small on a heavy-load BL, as shown in Fig. 10(c). Operation of CSB-SA The CSB-SA operation is divided to three phases. In phase-1 (BL precharge and current sampling phase), S1-S4 are turned on to connect the BL and dummy-BL (DBL, at ICELL side) to the diode-connected M1 andM2, respectively. M1 and M2 provide large precharge IPRE(IPRE1/IPRE2) to BL/DBL at the beginning of phase-1. After a sufficient precharge time (TPRE), voltages at nodes SA1 (VSA1) and SA2 (VSA2) are high, while IPRE is low. The drain currents of M1 and M2 are (IM1=ICELL+IPRE1) and (IM2=IREF+IPRE2), respectively. At the end of phase-1, the gate voltages for M1 (VG1) and M2 (VG2) are stored in C2 and C1, respectively. When the IPRE is low enough to be disregarded, the IM1/IM2 is equal to ICELL/IREF, despite various M1-M2 VTH mismatch conditions. Fig. 11 shows the structure of the equivalent circuit and waveform of phase-1. Fig. 12 shows the structure of the circuit and the waveform in phase-2. In phase-2 (current-ratio amplification), the S1/S2 is switched off to disconnect the SA1/SA2 from the heavy-load BL/DBL. At the beginning of phase-2, the M1/M2 charges SA1/SA2 with the current (IM1/IM2) sampled in phase-1. For a given period in phase-2 (TP2), IM1/IM2 increases the VSA1/VSA2 by ΔVSA1/ΔVSA2. Due to the AC-coupling behavior of C1/C2, theΔVSA1 increases the gate voltage (VG2) and reduces the gate-source voltage difference (VGS2) of M2 by ΔVG2, and ΔVSA2 reduces the VGS1 by ΔVG1. The reduction in VGS1 and VGS2 causes the IM1 and IM2 to decrease differentially by ΔIM1 and ΔIM2, respectively. A longer TP2 increases the difference in current (ΔIM1-M2= IM1-IM2) between M1 and M2, when using the same devices (M1/M2) as in phase-1. This results in an amplification of the current ratio (CR= IM2/IM1), as shown in Fig. 13. CR-amplification accelerates the development of the difference (ΔVSA) between VSA1 and VSA2. Fig. 11. Operating circuit and waveform in phase-1 (current-sampling). Fig. 10. (a) concept and (b) circuit of proposed CSB-SA. (c) Conceptual read-speed comparison of CSB-SA with previous SAs. Fig. 12. Operating circuit and waveform in phase-2. Fig. 13. Current ratio vs. TP2. In phase-3 (2nd-stage amplification), the EN turns on the NMOS-latch and pulls down the SA2, while M1 continually charges the SA1 to VDD for read-1 operations. Finally, the digital output is generated at nodes SA1 and SA2, as shown in Fig. 14. Fig. 14. Equivalent circuit and waveform of phase-3 (current-amplification). Analysis And Comparison CSB-SA vs. Process Variation In CSB-SA, the sampled IM1/IM2 acts as a DC element and the ΔIM1-M2 acts as an AC-element. Variations in ΔVSA occur at the end of phase-1 (ΔVSA-P1), due to mismatch in VTH between M1 and M2 and differences in the amount of under-sensing ICELL. The phase-2 AC-coupling behavior using DC elements enables VSA1- VSA2 “cross-over”, even if a significant proportion of the initial VSA1 - VSA2 offset (ΔVSA-P1) remains. This is because ΔVSA-P1 can be overcome by ΔIM2-M1, if TP2 is sufficient, unlike un-recoverable sensing error due to VTH mismatch in CM-CSA. Hence, the “delay until VSA1- VSA2 cross-over” of the SE signal, which activates the NMOS-latch, enables “offset cancel” for CSB-SA against process variation. Fig. 15 shows two simulated VSA1 - VSA2 crossover behaviors for sensing ICELL = 100 nA and IREF = 50 nA: with and without a 150 mV VTH-mismatch between transistors M1 and M2. In the case of VTH-mismatch, the VSA1-VSA2 crossover point occurs later than in the case without mismatch. However, with a VTH-mismatch of 150 mV, CSB-SA still has the same sampled current as well as current-ratio amplification behavior similar to the case without mismatch. Clearly, the functionality of CSB-SA is insensitive to device mismatch. Although ΔIM2-M1 is small when sensing small ICELL, the required TP2 to tolerate large mismatches between devices is insignificant and remains independent of CBL, thanks to the small parasitic load at nodes SA0 and SA1 (disconnection from BL) during phase-2. Fig. 16 shows the TP2 versus M1-M2 VTH mismatch for CSB-SA required to maintain functionality. When larger M1-M2 mismatch occurs, the CSB-SA requires a longer TP2. Fortunately, the TP2 penalty needed to compensate for a 150 mV VTH-mismatch is only 0.4% of the access time of a macro with 2048 cells per BL. Unlike CM-CSA, which suffers from low yield due to variations in VTH, the CSB-SA can achieve 100% yield sensing sub-100 nA, against device variations, as long as sufficient TP2 is provided. The variation in capacitance of C1/C2 does not affect the behavior of current sampling. This is because the sampled current in phase-1 depends on the voltages (VG1/ VG2) stored on C1/C2, which are insensitive to the capacitance of C1/C2. In phase-2, ΔVSA1/ΔVSA2 depends on the capacitance of node SA1/SA2, which includes that of C1/C2 as well as the parasitic capacitance of S1/S2 and S3/S4. ΔVG1/ΔVG2 depends on the capacitance of node G1/G2, which includes that of C1/C2 as well as the parasitic capacitance of M1/M2 and S3/S4. Because the parasitic capacitance of (S1+S3)/(S2+S4) or (M1+S3)/(M2+S4) far exceeds that of C1/C2, the variation in capacitance of C1/C2 does not have a significant influence on the behavior of phase-2. The CSB-SA scheme uses an inactive sub-array to provide the DBLs with IREF, as shown in Fig. 17. This improves the common-mode BL precharge behavior of the ICELL and IREF branches. Fig. 18 shows the switch-point analysis under two different IM2 conditions: IM2 = 50nA and IM2 = 150 nA. This analysis includes variations to all transistors in the CSB-SA using 10000-point Monte-Carlo simulation with the foundry’s statistical SPICE model. It should be noted that IM2 is the sum of IREF and IPRE2. The CSB-SA achieves 100% yield if ΔIM1-M2 exceeds 3.8 nA, when IM2 is only 50 nA (IREF = 50 nA and IPRE = 0 nA). With an increase in residual (i.e., 100 nA), while IREF is maintained at 50 nA, the CSB-SA still can have a small dead-zone (ΔIM1-M2 = 7.8 nA). This indicates that CSB-SA is not overly sensitive to near-common-mode residual IPRE. The small dead-zone implies that the CSB-SA is capable of tolerating a larger residual IPRE and employing a shorter BL precharge time to achieve faster read speeds without a significant influence on the yield of CSB-SA, particularly in long BL applications. Fig. 15. SA1-SA2 cross-over behavior: (a) No mismatch between M1 and M2; (b) 150 mV VTH mismatch between M1 and M2. residual IPRE. The ability to tolerate residual IREF of 50 nA, enables a reduction in the TPRE of 256-cell-per-BL and 2048-cell-per-BL of 76% and 82%, respectively, compared to the case without residual IPRE. Therefore, using a shorter TPRE is an effective approach for improving the macro read access time, provided the SA has correct functionality. However, using a shorter TPRE (insufficient TPRE) increases the residual IPRE (IPRE1 and IPRE2) of most CSAs. Fortunately, as discussed in Section IV.B, our CSB-SA can tolerate near-common-mode residual IPRE, at the expense of increased delay time (TP2) in phase-2. Fig. 20 plots TP2 and macro access time (TAC) versus IPRE in two cases: 1) IREF = 50nA and ICELL = 100 nA; 2) IREF = 100 nA and ICELL = 200 nA for a 512 Kb macro with 256 cells per BL. As expected, larger IREF and ICELL reduce the TP2 penalty for a given level of IPRE to tolerance. Although TP2 is increased slightly with IPRE to generate the target ΔVSA-P2 in achieving high-yield phase-3 operation, the overall read access time (TAC) of CSB-SA is significantly reduced by using a shorter TPRE with a higher IPRE. Fig. 16. TP2 versus VTH-mismatch between M1 and M2 transistors. Fig. 19. BL precharge time (TPRE) vs. residual IPRE. Fig. 17. Macro-level sensing structure for CSB-SA. Fig. 20. Simulated TP2 and macro access time (BL– length = 256) vs. residual IPRE for two cases: 1) IREF = 50nA and ICELL = 100 nA; 2) IREF = 100 nA and ICELL = 200 nA. Fig. 18. Switch-point analysis of CSB-SA: (a) IM2 = 50 nA; (b) IM2 = 150 nA. BL Precharge Time vs. Speed Fig. 19 shows the TPRE of two BL structures (256-cellper-BL and 2048-cell-per-BL) required to read ICELL = 100 nA and IREF = 50 nA across various levels of Comparison With Other SAs 1) Comparison With Conventional SAs: Fig. 21 shows yield versus ICELL (ICELL0) for higher-speed SAs between the proposed CSB-SA and CM-CSAs, using a 10000-point Monte-Carlo simulation. As mentioned in Section II, the CM-CSA suffers from low yield for sub-1 uA ICELL. Conversely, CSB-SA achieves high yield even when ICELL is below 100 nA, particularly when using a longer TP2 phase (i.e., 50 ns). Fig. 22 compares the random read access speed of various SAs for a NVM macro with 2048 cells per BL. The CSB-SA without residual IPRE achieves 1.1× -1.4× faster macro read speeds reading ICELL = 100 nA, compared to VSA, CT-VSA, CCL-CSA, and GCLD-SA. Tolerance for 50 nA IPRE enables the CSB-SA to achieve macro read speeds (for reading a 100nA-ICELL, 6.3× -8.1× faster than those of SAs. 2) Comparison With Capacitor-Based Inverter-Offset-Compensated SA (IOC-SA): CSB-SA and IOC-SA have similar circuit structures, using cross-coupled capacitors to suppress the input offset of the SA. However, the circuit behavior and the influence on offset suppression performance differs between CSB-SA and IOC-SA from the following two perspectives: 1) Usage of capacitors for offset suppression 2) Area and design complexity a) Usage of Capacitors for Offset Suppression: Unlike IOC-SA, the CSB-SA uses capacitors to store the gate voltage (VG) for input current sampling. This sampled current includes the VTH variation as well as transistor width, length, and TOX mismatches between the M1/M2 transistors. Unlike the voltage-mode amplification in IOC-SA, our proposed CSB-SA utilizes current-mode operations (current sampling and current-ratio amplification) for 1st-stage input-signal amplification. In addition, the “SE delay until V SA1 -VSA2 cross-over” behavior in phase-2 enables “offset cancel”, even if initial VSA1 - VSA2 offset (ΔVSA-P2) remains. This enables CSB-SA to employ relaxed internal timing to improve its tolerance for variation without the need for an increase in the difference between input signals on BLs. b) Area and Design Complexity: The VTH or auto-zero point storing operation in IOC-SAs requires a complex multi-step offset nulling process and numerous switches. The use of many switches degrades offset-cancelling performance, due to the charge-injection effect. In contrast, the CSB-SA requires only four switches sharing the same control signal, PRE. The IOC-SA requires twice as many switches and four times as many control signals than CSB-SA, as shown in Table I. The VTH-nulling operation increases the control-complexity and area overhead of IOC-SA beyond what is required by for the proposed CSB-SA. Fig. 21. Yields of CSB-SA and CM-CSA. Fig. 20. Comparison of speed between various SAs. This paper proposes a current-sampling-based (CSB) SA to achieve robust and fast read operations for NVMwith small cell current. This CSB-SA is insensitive to device mismatch and BL offset, while achieving 6.3× -8.1× faster read speed, compared to other SAs, when reading a 2048-cell BL with a 100 nA cell current.[40] [1] R.-A. Cernea et al., “A 34 MB/s MLC write throughput 16 Gb NAND with all bit line architecture on 56 nm technology,” IEEE J. Solid-State Circuits, vol. 44, no. 1, pp. 186–194, Jan. 2009. [2] K. Takeuchi et al., “A 56-nm CMOS 99-nm 8-Gb multi-level NAND flash memory with 10 MB/s program throughput,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, 2006, pp. 507–516. [3] D.-S. Byeon et al., “An 8 Gb multi-level NAND flash memory with 63 nm STI CMOS process technology,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, 2005, pp. 46–47. [4] S.-H. Chang et al., “A 48 nm 32Gb 8-levelNANDflash memory with 5.5 MBs program throughput,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, 2009, pp. 240–241. [5] G. G. Marotta et al., “A3 bit/cell 32Gb NANDflash memory at 34 nm with 6MB/s program throughput and with dynamic 2 b/cell blocks configuration mode for a program throughput increase up to 13 MB/s,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, 2010, pp. 444–445. [6] K. Fukuda et al., “A 151 mm 64 Gb MLC NAND flash memory in 24 nm CMOS technology,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, 2011, pp. 198–199. [7] D. Lee et al., “A 64 Gb 533 Mb/s DDR interface MLC NAND flash in sub-20 nm technology,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, 2012, pp. 430–431. [8] Y. Li et al., “128 Gb 3b/cell NAND flash memory in 19 nm technology with 18MB/swrite rate and 400 Mb/s togglemode,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, 2012, pp. 436–437. [9] C. J. Chevallier et al., “A 0.13 m 64 Mb multi-layered conductive metal-oxide memory,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, 2010, pp. 260–261. [10] A. Wang and A. Chandrakasan, “A 180-mV FFT processor using subthreshold circuit techniques,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, 2004, pp. 292–529. [11] M. Seok, S. Hanson, J.-S. Seo, D. Sylvester, and D. Blauuw, “Robust ultra-low voltage ROM design,” in Proc. IEEE 2008 Custom Integrated Circuits Conf. (CICC), 2008, pp. 423–426. [12] M.-F. Chang, S.-M. Yang, C.-W. Liang, C.-C. Chiang, P.-F. Chiu, and K.-F. Lin, “Noise-immune embedded NAND-ROM using a dynamic split source-line scheme for VDDmin and speed improvements,” IEEE J. Solid-State Circuits, vol. 45, no. 10, pp. 2142–2155, Oct. 2010. [13] M.-F. Chang et al., “A 0.5 V 4 Mb logic-process compatible embedded resistive RAM (ReRAM) in 65 nm CMOS using low voltage currentmode sensing scheme with 45 ns randomread time,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, 2012, pp. 434–435. [14] M.-F. Chang et al., “A 0.29 V embedded NAND-ROM in 90 nm CMOS for ultra-low-voltage applications,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, 2010, pp. 266–267. [15] H.-C. Lai, K.-Y. Cheng, Y. C. King, and C. J. Lin, “A 0.26-mm U-shaped nitride-based programming cell on pure 90-nm CMOS technology,” IEEE Electron Device Lett., vol. 28, no. 9, pp. 837–839, Sep. 2007. [16] C. E. Huang et al., “A new self-aligned nitride MTP cell with 45 nm CMOS fully compatible process,” in IEDMDig.,Dec. 2007, pp. 91–94. [17] B. H. Calhoun and A. P. Chandrakasan, “A 256 kb subthreshold SRAM in 65 nm CMOS,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, 2006, pp. 2592–2601. [18] T. H. Kim, J. Liu, and C. H. Kim, “A voltage scalable 0.26 V, 64 Kb 8T SRAM with Vmin lowering techniques and deep sleep mode,” IEEE J. Solid-State Circuits, vol. 44, no. 6, pp. 1785–1795, Jun. 2009. [19] B. Zhai, D. Blaauw, D. Sylvester, and S. Hanson, “A variation-tolerant sub-200 mV 6 T subthreshold SRAM,” IEEE J. Solid-State Circuits, vol. 43, no. 10, pp. 2338–2348, Feb. 2008. [20] I. J. Chang, J. J. Kim, S. P. Park, and K. Roy, “A 32 kb 10 T subthreshold SRAM array with bit-interleaving and differential read scheme in 90 nm CMOS,” IEEE J. Solid-State Circuits, vol. 44, no. 2, pp. 650–658, Feb. 2009. [21] Y. Morita et al., “An area-conscious low-voltage-oriented 8T-SRAM design under DVS environment,” in Symp. VLSI Circuits Dig. Papers, 2007, pp. 256–257. [22] J. Chen, L. T. Clark, and T.-H. Chen, “An ultra-low-power memory with a subthreshold power supply voltage,” IEEE J. Solid-State Circuits, vol. 41, no. 10, pp. 2344–2353, Oct. 2006. [23] J.-K. Kim et al., “A 120-mm 64-Mb NAND flash memory achieving 180 ns/byte effective program speed,” IEEE J. Solid-State Circuits, vol. 32, no. 5, pp. 670–680, May 1997. [24] T. Tanzawa, Y. Takano, T. Taura, and S. Atsumi, “Design of a sense circuit for low-voltage flash memories,” IEEE J. Solid-State Circuits, vol. 35, no. 10, pp. 1415–1421, Oct. 2000. [25] K. J. Lee and B.-H. Cho et al., “A 90 nm 1.8 V 512 Mb diode-switch PRAM with 266 MB/s read throughput,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, 2007, pp. 472–616. [26] F. Bedeschi et al., “A bipolar-selected phase change memory featuring multi-level cell storage,” IEEE J. Solid-State Circuits, vol. 44, pp. 217–227, 2009. [27] S. Chung, J.-T. Huang, P. Chen, and F.-L. Hsueh, “A 512 8 electrical fuse memory with 15 m2 cells using 8-sq asymmetric fuse and core devices in 90 nm CMOS,” in IEEE Symp. VLSI Circuits Dig. Tech. Papers, 2007, pp. 74–75. [28] M.-K. Seo et al., “A 130-nm 0.9 V 66-MHz 8-Mb (256 K 32) local SONOS embedded flash EEPROM,” IEEE J. Solid-State Circuits, vol. 40, no. 4, pp. 877–883, Apr. 2005. [29] T. Ogura et al., “A 1.8-V 256-Mb multilevel cell NOR flash memory with BGO function,” IEEE J. Solid-State Circuits, vol. 41, no. 11, pp. 2589–2600, Nov. 2006. [30] M.-F. Chang et al., “A process variation tolerant embedded split-gate flash memory using pre-stable current sensing scheme,” IEEE J. Solid-State Circuits, vol. 44, no. 3, pp. 987–994, Mar. 2009. [31] A. Conte, “A high-performance very low-voltage current sense amplifier for nonvolatile memories,” IEEE J. Solid-State Circuits, vol. 40, no. 2, pp. 507–514, Feb. 2005. [32] R. Micheloni, “The flash memory read path: Building blocks and critical aspects,” Proc. IEEE, vol. 91, no. 4, Apr. 2003. [33] T. Futatsuyama et al., “A 113 mm 32 Gb 3b/cell NAND flash memory,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, 2009, pp. 242–243. [34] Y. Li et al., “A 16 Gb 3-bit per cell (X3) NAND flash memory on 56 nm technology with 8 MB/s write rate,” IEEE J. Solid-State Circuits, pp. 195–207, Feb. 2009. [35] Javanifard et al., “A 45 nm self-aligned-contact process 1 Gb NOR flash with 5 MB/s program speed,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, 2008, pp. 424–425. [36] M.-F. Chang et al., “An offset tolerant current-sampling-based sense amplifier for sub-100 nA-cell-current nonvolatile memory,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, 2011, pp. 206–207. [37] S. H. Kulkarni et al., “High-density 3-D metal-fuse PROM featuring 1.37 m 1T1R bit cell in 32 nm high-k metal-gate CMOS technology,” in Symp. VLSI Circuits Dig. Tech. Papers, 2009, pp. 28–29. [38] G. Uhlmann et al., “A commercial field-programmable dense eFUSE array memory with 99.999% sense yield for 45 nm SOI CMOS,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, 2008, pp. 406–407. [39] S. Chung et al., “A 1.25 um cell 32 Kb electrical fuse memory in 32 nmCMOS with 700mV Vddmin and parallel/serial interface,” in IEEE Symp. VLSI Circuits Dig. Tech. Papers, 2009, pp. 30–31. [40] M.-F. Chang et al., “An Offset-Tolerant Fast-Random-Read Current-Sampling-Based Sense Amplifier for Small-Cell-Current Nonvolatile Memory,” IEEE J. Solid-State Circuits, vol. 48, no. 3, pp. 864–877, Mar. 2013. [41] Nanotechnology, Vol. 3 (Ed: R. Waser), Wiley-VCH, Weinheim 2008. [42] A. Flocke, T. G. Noll, C. Kugeler, C. Nauenheim, R. Waser, in Proc. IEEE Non-Volatile Memory Technology Symposium, 2008, p. 319. [43] The International Technology Roadmap for Semiconductors – ITRS 2007 Edition (http://www.itrs.net), 2007.