Low Power ROM Generation by Paul L. Chou B.S. in Electrical Engineering, University of California at Berkeley (1994) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering and Computer Science at the W, A Massachusetts Institute of Technology rs OF 1 J oY OCT 1 51996 June 1996 © 1996 Paul L. Chou. All rights reserved. LABRRAiES The author hereby grants to MIT permission to reproduce and to distribute copies of this thesis document in whole or in part. Author Department of toucncal engineering andy Computer Science June, 1996 Certified by Anantha P. Chandrakasan Assistant Professor of Electrical Engineering Certified by Chairmail, Low Power ROM Generation JV-.QaLLL1L1 u,,, ... AAZ,,%• V,, -. Low Power ROM Generation by Paul L. Chou Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering and Computer Science Abstract Recently, the reduction of power consumption in memory systems has been an active area of research, due largely to the interest in portable computing devices, such as notebook and palm top personal computers. This work is concerned with the implementation of a low power read-only memory (ROM) generator with a simple interface and the ability to optimize the ROM given power and process parameters. A low power ROM generator is converted to the Cadence Design Environment. The application of low power techniques to the generator are investigated, such as charge redistribution sense amplifiers, reduced-element word decoders, and other methods borrowed from low-power I/O. These methods are examined in the attempt to minimize power dissipation while maintaining performance. Techniques found applicable to low power memories are implemented and a generator, ROMGEN, is developed to create a ROM given parameters supplied by the user. In addition to these methods for low power dissipation, two tools are developed in an effort to reduce design time. A ROM modeling program, ROMODEL, is developed to model the power dissipation characteristics of a ROM. Also, an optimization program, ROMOPT is developed that applies the modelling program to find the optimal configuration for the ROM. The results gathered from these tools can then be used to generate a ROM configured for low power. Thesis Supervisor: Anantha P. Chandrakasan Title: Professor of Electrical Engineering and Computer Science Low Power ROM Generation Acknowledgments I would like to thank a number of people who have contributed to the completion of this project: Anantha Chandrakasan for his insight, encouragement, and enthusiasm. Working with him was an honor and a privilege. Thucydides Xanthopoulos for the countless times that he has helped me. He patiently answered my questions, gave me advice and encouragement, and proofread many extremely rough drafts; none of this work would be possible without him. Andrew Burstein and Tom Burd for help with the Berkeley low power cell library. Without their help, I would still be trying to create my own. All my friends in Boston, especially Michelle Madriaga, James Kao, Francis Lau, Sherry Whitley, and Lapoe Lynn for making sure I had some fun while at MIT. My family for their invaluable support during these years far from home. My parents, Peter and Mary, my brother Han, and my sister Christine are responsible for my accomplishments. They have given me more patience, love, and support than I deserve. Low Power ROM Generation Table of Contents Ch. 1 Introduction................................................... 1.1 1.2 1.3 ...................................................... Low Power Fundam entals............................................................................................ Contributions of this work .......................................... ................................................ Background .................................................. 1.3.1 ROM Architecture ............................................................. ........................... 1.3.2 M em ory Cell............................................. .................................................... 1.3.3 Peripheral Circuitry ............................................................ .......................... Ch. 2 Magic to Cadence Cell Library Conversion ....................................... 2.1 2.2 2.3 3.1 3.2 3.3 3.4 3.5 ..... 16 Low Power ROM Library .......................................... ................................................. Cell Library Conversion.................................................................................................... 2.2.1 m2s Technology File.................................................................................. 2.2.2 Cell Conversion and Generation............................ 2.2.3 Design Rule Check .................................................................................... 2.2.4 Tiling................................................................................................................... 2.2.5 Verification.......................................................................................................... Sum m ary ........................................................................................................................... Ch. 3 ROM Operation ......................................................................... 9 10 11 11 13 15 16 16 17 18 18 20 21 21 ....................... 22 ROM Specification............................................................................................................ 22 The ROM Block................................................................................................................ 23 3.2.1 Term inals ............................................... ....................................................... 24 Basic Operation................................................................................................................. 25 ROM Low Power Features .............................................................. ........................... 26 3.4.1 Reduced Capacitance................................................................................ 26 3.4.2 Reduced Action .......................................... .................................................. 26 3.4.3 Reduced Swing .......................................... ................................................... 27 ROM Block Architecture .......................................... .................................................. 27 3.5.1 Control Signals .......................................... ................................................... 27 3.5.2 Address Latch ........................................... .................................................... 29 3.5.3 Row Decoder and Wordline Driver ........................................ ............ 30 3.5.4 ROM Core ............................................. ....................................................... 31 3.5.5 Colum n Selection......................................... ................................................. 34 3.5.6 Sense Amplification.......................................................35 Low Power ROM Generation 3.6 3.7 3.8 3.5.7 XOR................................................... ............................................... ......... 37 3.5.8 Self Timing .................................................................................................... 38 3.5.9 Output Driver.............................................................................................. 40 ROM Bus ......................................................................................................................... 41 3.6.1 Bus Architecture .............................................................. ............................. 41 3.6.2 Block Decode Logic ............................................................. 42 Simulation ......................................................................................................................... 44 ROM Performance ............................................... ....................................................... 46 Ch. 4 ROM Generation................................................... 4.1 ............................................ 48 4.2 Usage ...................................................... 4.1.1 Parameter file ........................................... ..................................................... 4.1.2 Generation........................................................................................................... Cells .................................................................................................................................. 48 49 50 51 4.3 ROMGEN .................................................. 52 4.3.1 4.3.2 4.3.3 4.3.4 52 53 55 57 Tiling Procedures .......................................... ................................................ ROM Block Tiling .................................................. ROM Bus Tiling .............................................................. ............................. Multiblock ROM Tiling .................................... ............................................ Ch. 5 ROM M odelling................................................... 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 ............................................. 58 Brief Description of Operation ............................................................................... 58 ROM ODEL Limitations ................................................................................................... 59 Overview ........................................................................................................................... 60 5.3.1 ROM ODEL Technology File .......................................... ................ 60 5.3.2 Node M odel ............................................. ..................................................... 61 5.3.3 M odelling Wiring Capacitances ......................................... .............. 63 5.3.4 M odelling Gate Capacitances ................................................................. 64 5.3.5 M odelling Junction Capacitances ......................................... ............. 69 M odelling the ROM .............................................. ...................................................... 70 5.4.1 Control Signals .......................................... ................................................... 71 5.4.2 Address Latches and Row Decoding ..................................... .. .......... 71 5.4.3 ROM Core ............................................. ....................................................... 71 5.4.4 Sense Amplifiers and XOR Decoders............................. ..... ............. 72 5.4.5 Output Latch and Driver .................................................... 72 ROM ODEL Usage ............................................... ....................................................... 72 5.5.1 ROMAVG: Analyzing ROM Core Data................................. .. .......... 72 5.5.2 ROM ODEL: M odelling the ROM ...................................................................... 73 Results: ROM ODEL vs. HSPICE..................................................................................... 75 5.6.1 Inverter Chain ........................................... .................................................... 75 5.6.2 ROM ................................................................................................................... 76 Optimization Procedure ........................................... ................................................... 76 ROM OPT Usage ............................................................................................................... 77 Interpreting the ROM OPT Report ........................................................................... 78 Ch. 6 Conclusion ........................................................ ................................................ 80 Appendix A ................................................................................... ................................ 82 Appendix B .................................................................................... ............................... 87 Low Power ROM Generation List of Figures FIGURE 1. FIGURE 2. FIGURE 3. FIGURE 4. FIGURE 5. FIGURE 6. FIGURE 7. FIGURE 8. FIGURE 9. FIGURE 10. FIGURE 11. FIGURE 12. FIGURE 13. FIGURE 14. FIGURE 15. FIGURE 16. FIGURE 17. FIGURE 18. FIGURE 19. FIGURE 20. FIGURE 21. FIGURE 22. FIGURE 23. FIGURE 24. FIGURE 25. FIGURE 26. Architectures for memory arrays ......................................... ..... 12 4 x 4 NAND ROM [Rabaey95]. ........................................ ...... 13 4 x 4 NOR ROM [Rabaey95]. ......................................... ...... 14 Part of a Magic layout of the column select cell ..................................... 19 Part of a Cadence layout of the column select cell ................................. 20 The ROM diagram ........................................................ 22 A multiple block ROM. ........................................... ......... 23 A ROM block ............................................................. 24 Basic timing diagram for a ROM block ......................................... 25 Schematic of ROM control logic ........................................ ...... 28 A timing diagram of the ROM control signals. ..................................... 29 A single ROM address latch. ......................................... ....... 30 ROM row decoder and wordline driver ..................................................... 31 A "0" cell and a "1" cell ..................................... .............. 32 Example showing the original ROM data vs. coded ROM core.............33 Column selection, sense amplification, and XOR circuits......................34 The effects of charge sharing of the bitline with the output node...........35 Simplified view of a charge redistribution amplifier...............................36 XOR circuit used for decoding the ROM data..................................38 Self-timing circuits for Ready signal generation. ................................... 39 Schematic of the output driver. ......................................... ..... 40 A single ROM bus instance ........................................ ....... 42 A 1-bit block decoder. ............................................. ........ 43 A 4-bit block decoder. ............................................. ........ 43 HSPICE results from a ROM simulation.............................................44 HSPICE results from a ROM simulation............................. ..... 45 Low Power ROM Generation FIGURE FIGURE FIGURE FIGURE FIGURE FIGURE FIGURE FIGURE FIGURE FIGURE FIGURE 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 53 ROM Core and Column Select Mirroring. ...................................... Block 0 layout of a 2 block ROM ............................................ 55 A typical ROM bus layout ........................................... ....... 56 The layout of a 256 word by 8 bit, 4 block ROM...................................57 Typical gate capacitance vs. gate voltage plot from HSPICE. ............... 63 Actual and model C-V characteristics of a NMOS device. .................... 64 Actual and model C-V characteristics of a PMOS device ...................... 65 Equivalent Gate to GND Capacitance. ...................................... ... 66 Calculation of total area under CV curve from HSPICE data ................ 67 Supply current charging the drain capacitance ...................................... 70 An inverter chain for HSPICE vs. ROMODEL comparisons....................74 Low Power ROM Generation List of Tables TABLE TABLE TABLE TABLE TABLE TABLE TABLE TABLE TABLE TABLE TABLE TABLE TABLE TABLE 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. Layer Equivalents ......................................................... 17 Contact Equivalents ....................................................... 17 Design Rules for Magic to Cadence Conversion .................................... 18 HSPICE results for a 16 word x 4 bit ROM ...................................... 46 HSPICE results for a 256 word x 8 bit ROM ..................................... 46 HSPICE comparison of original ROM versus modified ROM...............47 Summary of ROMGEN parameters ....................................... ..... 48 ROMGEN block cells. .................................................... 51 ROM GEN bus cells. ........................................................................ . 51 Technology File Parameters...................................................................... 60 Node Variables and Descriptions............................................................62 HSPICE simulation vs. ROMODEL for 5-inverter chain. ........................ 75 HSPICE simulation vs. ROMODEL for a 16 word x 4 bit ROM........ 76 HSPICE simulation vs. ROMODEL for a 256 word x 8 bit ROM............76 Low Power ROM Generation Chapter 1 Introduction As chip densities and system sizes continue to increase, the ability to provide adequate cooling for the system becomes a problem and results in a need for the development of low power design strategies. The popularity of laptop computers and other portable computing devices is an additional reason for interest in low power strategies, since low power designs extend the battery life in these systems. Since control systems for these applications typically require a substantial amount of non-volatile memory, development of low power memory design strategies is important. This work is dedicated to the investigation and implementation of these low power read-only memories. 1.1 Low Power Fundamentals Low power techniques generally involve reducing or eliminating the components that contribute to power consumption. These can be classified into dynamic consumption, consumption due to short circuit currents, and static consumption: Ptotal = Pdynamic + Psc + Pstatic (EQ 1) One property of CMOS designs that have made them so popular in low power applications is that static power consumption is generally insignificant. The main contributors to static consumption are: 1) circuits with static current which are avoided when possible, 2) leakage due to reverse biased diodes at the junctions which is typically very small and negligi- Low Power ROM Generation ble, and 3) subthreshold currents, the effect of which is minimized by having a larger threshold voltage (VT), typically about 0.75V in current technologies. Subthreshold currents can be reduced by employing dynamic optimization techniques such as MTCMOS in a low VT system [Mutoh95]. Short circuit currents are due to direct current paths between VDD and GND. This can occur, for example, when both NMOS and PMOS devices of a logic gate are on during an input transition. To minimize Psc, the ratio of rise and fall times of input and output are kept approximately the same. The following is the expression for dynamic power for linear capacitors: Pdyn= CL VDD Vswingf (EQ 2) where CL is the total capacitance of the node, Vswing is the voltage change seen at the node, andf is the frequency of the transition. Since dynamic power dissipation constitutes the most significant component of power dissipation, this work is focused on reducing and modelling dynamic power consumption. Nonlinear capacitors such as parasitic junction capacitances and gate capacitances are modelled so that the power dissipated can be calculated (Chapter 4). Examining the expression for dynamic power consumption one can see that reducing power consumption implies that one must reduce the load capacitance, reduce the supply voltage as much as possible (while meeting performance specifications), reduce the voltage swing, and reduce the frequency at which the capacitances are charged and discharged. These are the basic guidelines for low power CMOS design. 1.2 Contributions of this work The focus of this work is to provide a low power ROM generator, ROMGEN with a simple interface applying new techniques in low power design and modelling. This thesis includes the software to generate the U.C. Berkeley low power ROM in the Cadence Design Environment. Ideas borrowed from low power I/O coding are implemented to reduce the power of the ROM further. A ROM Modeler, ROMODEL, is created to quickly and accurately model the energy dissipated in the ROM. A number of models are presented and implemented to calculate the energy dissipated in the ROM. Finally, ROMOPT Low Power ROM Generation is a tool that uses the results of the modelling tool to find the optimal partitioning scheme prior to generation. 1.3 Background Since the architecture of a ROM is generally well defined, identifying where power is dissipated is straightforward. Architectural decisions, memory cell design, and peripheral circuitry all affect power consumption. Much of the recent work on memories has involved exploring methods to reduce power consumption in each of these areas. 1.3.1 ROM Architecture Memory arrays are typically organized so that the dimensions are of the same order of magnitude. The reasons for this can be understood by examining the basic array structure depicted in Figure la. For a memory with M bits of address where each word is N bits wide, the height (2 M) is many orders of magnitude greater than the width (N) for typical memory sizes. Not only does this result in extremely long bitlines and poor performance, this design in not implementable due to the incredibly large aspect ratio [Rabaey95]. A partial solution to the problem is depicted in Figure lb. By placing a multiple of words on a single wordline and using a column address to multiplex the data, the aspect ratio is reduced. This reduces the amount of row decoding circuitry necessary to select the wordline and decreases the bitline length and capacitance. Consequently, power is reduced and performance is increased. However, too much column decoding is undesirable for low power memories, because the column decoding operation discards all but one word of the data obtained during a read. This is discussed in more detail in Section 3.5.5. Partitioning is another technique used to make the dimensions of the memory array manageable, as illustrated in Figure ic. By partitioning the memory into a number of blocks, decoding logic and sense amplifiers in blocks which are not accessed can be disabled and results in substantial power savings, especially with large memories. Also, because the local row address is decoded locally within each block, the smaller row decoding circuitry yields faster operation and lower power because less capacitance is Low Power ROM Generation switched. This comes at a cost of area, however, since each ROM block must have identical row and column decoding circuitry. Dividing the memory into several smaller blocks means that overall performance of the memory is approximately that of a single block, which can be a substantial improvement for large memories partitioned into several blocks. Word 0 Word 1 Word 2 Row Address Row Addi Word N- -1 . . Data Data (a) (b) " FIGURE 1. Architectures for memory arrays. (a) A basic array structure. (b) Reducing the aspect ratio by dividing the address into a row address and column address. (c) Further reducing the aspect ratio by dividing the memory array into multiple blocks and using a block address to enable a single block. However, partitioning increases overhead which could affect performance and power dissipation in a negative way. Partitioning means that capacitance on the address Low Power ROM Generation and data lines is increased, since ROM blocks share the same buses. Also, dividing the ROM into blocks requires additional block selection logic, which could possibly increase overall power dissipation and increases the setup time before a read cycle can begin. These trade-offs are a main reason for the development of the ROM modeler and optimizer which will be used to find the optimal partitioning scheme. 1.3.2 Memory Cell Memory cell designs typically do not change much and are heavily dependent on process technology, however, there are some choices that can be made. The choice between NAND and NOR arrays is typically simple; NAND arrays are typically too slow to be practical. r-4 A _ • l- A • L • *BL[O] [ • L • • L Pull-un devices .... •m•-- • ...... BL[1] -oBL[2] -* BL[3] WL[O] WL[1] uVDD " .L_ IIr- I 1· fi WL[2] WL[31 .1L I BL[3] -I FIGURE 2. 4 x 4 NAND ROM [Rabaey95]. Figure 2 shows an example NAND ROM array. All the wordlines are initially set to "1", and the selected wordline is set to "0". If no transistor is present on the bitline, the bitline remains discharged by the series of NMOS devices, which is equivalent to reading a "1". When a transistor is present on the wordline, setting the row to "0" turns off the transistor, which eliminates the path to ground for the bitline. The PMOS pull-up raises the bitline voltage, which is equivalent to reading a "0". This configuration produces a very dense ROM core, however, due to the long chain of NMOS devices connected in Low Power ROM Generation series, this NAND structure yields unacceptably slow read times for medium and large sized ROMs. An example NOR ROM is depicted in Figure 3. During a read cycle, the selected wordline is set to "1". If a transistor is present, it is turned on a discharges the bitline, which is equivalent to reading a "1". If a transistor is not present, the bitline remains charged, meaning the data is a "0". This results in a substantial increase in performance in the worst case over the NAND ROM because a single NMOS transistor lies between the bitline and GND. However, the size of a single cell doubles (as compared to the NAND structure) because of the GND wires in the ROM core. But for large ROMs, this increase in size is worth the increase in performance. rCl"I r'"I rI I ' WL[O] -L' F I" Pull-u devices 1 ---- L GND WL[1] WL[2] GND WL[3] BL[0] BL[BL BL[2] BL[3] FIGURE 3. 4 x 4 NOR ROM [Rabaey95]. Instead of using PMOS devices as pull-ups in a pseudo-NMOS configuration (as shown in Figure 3), NMOS devices will be used to precharge the bitline dynamically to VT below VDD. By eliminating the static PMOS pullups, the direct path from VDD to GND is eliminated, saving power. By using an NMOS device to charge the bitline, the worst case bitline swing is (VDD-VT), reducing the power dissipation for discharging a bitline. Also, because the source of the NMOS precharge device is not GND, the threshold voltage, VT is higher than VTO due to the body effect, further decreasing the worst case Low Power ROM Generation bitline swing. In the interest of low power, reducing the number of transistors in the ROM core is important for a couple reasons: 1) Fewer transistors in the ROM means less capacitance on the wordlines and bitlines, yielding faster switching time and lower energy consumption, and 2) reading a "1" (transistor present) requires that the bitline is discharged, which consumes power, whereas reading a "0" (no transistor) consumes no energy. The importance of the latter is multiplied by the fact that due to column selection, the majority of the information read from the bitlines is discarded. Thus, to reduce the transistor count in the ROM core, methods for low-power I/0 [Stan89] [Tabor90]are adapted and applied. 1.3.3 Peripheral Circuitry Low power techniques in the peripheral circuitry also play an important role in low power design. A cascode charge redistribution sense amplifier is utilized for low power and quick sensing of bitline data. Eliminating glitches is important because glitches waste power, and thus, the peripheral circuitry is designed to reduce the amount of glitches that occur. Eliminating glitches in the output drivers is especially important, since the capacitances on the output devices is typically large, and glitches on the outputs would result in a considerable waste of energy. Low Power ROM Generation Chapter 2 Magic to Cadence Cell Library Conversion The ROM generator is a tool for quickly and easily creating a ROM applying the methods of low power design found to be effective in our investigation. Part of the structure of the ROM generator will be based on work done by A. Burstein [Burstein96] using the Berkeley low power cell library, however a number of modifications were necessary to make the ROM generator operational in Cadence. This section describes the conversion of the cells into Cadence. 2.1 Low Power ROM Library The Berkeley low power ROM library serves as a starting point for the ROM generator. Porting the cell library to the Cadence design environment involves a number of steps, including the conversion of Magic cell layouts to Cadence, verification of design rules (due to the different technologies being used), updating the layouts with circuit schematics, and finally verification of layouts versus schematics (LVS). HSPICE is used to verify correct operation of the cells. Finally, sample ROMs were generated and simulated with HSPICE to verify the conversion of the Berkeley ROM generator into Cadence. 2.2 Cell Library Conversion The layouts were converted using a tool created by Thucydides Xanthopoulos called m2s (magic to skill). This program translates Magic layout files (.mag) to Cadence Low Power ROM Generation Virtuoso format. The following sections describe the steps used to convert the cells. 2.2.1 m2s Technology File A technology file is needed to translate Magic layers to Cadence layers. The m2s tool extracts the data from the Magic file and creates a generator for the Cadence cell using the technology file to translate the different layer names. Thus, the technology file must be edited to match the Cadence and Magic setup. For this cell conversion, the technology file provided was sufficient (See Appendix B). The layers relevant to this process (HP26) are listed in Table 1. Magic Layer Cadence Layer(s) pwell none nwell nwell polysilicon poly ndiffusion ndiff pdiffusion pdiff metall metal 1 metal2 metal2 metal3 metal3 ntransistor ndiff poly ptransistor pdiff poly psubstratepdiff psub nsubstratendiff nsub glass overgla TABLE 1. Layer Equivalents. In Magic, contacts have their own layer, whereas in Cadence, the contact as well as the layers being connected must be specified. For example, in Magic, a minimum area polycontact is (4X) 2 . In Cadence, however, the same contact consists of a (4k)2 poly layer overlapping a (4k)2 metall layer with a (2k)2 contact. This conversion from contacts to multiple layers is handled in the technology file and listed in Table 2. Magic Contact Name Cadence Layers polycontact poly metall cont ndcontact ndiff metall cont aa pdcontact pdiff metal 1 cont_aa TABLE 2. Contact Equivalents. Low Power ROM Generation Magic Contact Name Cadence Layers m2contact metall metal2 via psubstratepcontact psub metall cont aa nsubstratencontact nsub metall cont_aa TABLE 2. Contact Equivalents. In addition to specifying the layers, design rules need to be specified to indicate contact size, spacing and overlap during the conversion of contacts from Magic to Cadence. These are also contained in the technology file and shown in Table 3. Two layers that needed to be added for the new process were the n-select and p-select layers. The effects of this are described in more detail in Section 2.2.3. Design Rule Size or Length CONTACT_SIZE 2 CONTACT_SPACING 2 CONTACT_OVERLAP 1 VIASIZE 2 VIA_SPACING 2 VIA_OVERLAP 1 SELECT_OVERLAP 3 TABLE 3. Design Rules for Magic to Cadence Conversion. 2.2.2 Cell Conversion and Generation Once the technology file is specified, the m2s tool is executed, with the Magic cell and target Cadence library specified. A skill file is created with the same name as the cell with a ".il" extension. This is a SKILL file that generates the layout in Cadence. To generate the layout, one must create a cell in the target Cadence library with the cell name. Then the SKILL file is loaded and the layout is generated by typing "lg". All the cells were correctly generated, however, there were a few errors that occurred when performing the design rule checks (DRC) described in the next section. 2.2.3 Design Rule Check A number of design rule errors were found in the generated layouts using m2s. The two most common problems were due to input/output terminals (pins), and the addition of the select layers. Fixing these problems while maintaining ROM density was a primary Low Power ROM Generation - II - ___ · · concern. Pins/Terminals. m2s creates pins described in the Magic layout and conveniently adds a text label as well. However, terminal sizes are assumed to have a size of (4x)2 . Thus, when the terminal is created, a layer of material is created overlapping the terminal location. These (4X) 2 size terminals were often too large, and consequently resulted in a number of design rule errors. These errors were fixed by adjusting or replacing the terminals. FIGURE 4. Part of a Magic layout of the column select cell. An NMOS transistor with the source tied to GND is depicted. The three contacts pictured above are two diffusion contacts (right of poly and far right) and one well contact (middle). This process allow abutting diffusion and well contacts. Select Layers. The addition of the select layers to the cells introduced a large number of design rule errors that were difficult to fix. Because a ROM is designed to maximize density, a majority of the cells had little or no space to add the necessary n-select and p-select layers. These layers overlap diffusion regions and substrate contacts and degrade the density of the layouts because diffusions of opposite type cannot abut, though the original cells allowed this. Permitting opposite diffusion types to abut is potentially dangerous because contacts placed next to each other may become shorted. By adding the n-select and p-select layers, this problem is avoided, and allows this ROM library to be used with other processes. Maintaining a high density was important, however increasing some cell sizes was Low Power ROM Generation I ~ __ _ ___ unavoidable. This degradation of the density is most pronounced in the size of the ground wire in the ROM core. Due to the large sizes of the well contacts, the introduction of the pselect and n-select layers, and the density of the column select cell, the density of the ROM core was decreased slightly to accommodate the extra area needed for n-select and p-select areas. This problem is exacerbated because the increase in the column width (+4X) is multiplied by the number of bits in the ROM. Other cells were increased in size, however these had an only minor effect on the overall ROM area. Figure 4 and Figure 5 illustrate the changes made in the ROM cells. FIGURE 5. Partof a Cadence layout of the column select cell. The same NMOS transistor with the source tied to GND is depicted. For tiling reasons, one of the diffusion contacts was deleted. The two contacts pictured above are a diffusion contact (right of poly) and a well contact (far right). The minimum spacing between well and diffusion contacts is larger and increases cell size. 2.2.4 Tiling A number of tiling functions were also written for ROM generation (Chapter 3) These tiling routines require that each cell has a special layer used to align each cell relative to other cells. This layer, "prBoundary", was added to each cell for tiling purposes. Though the original magic cells contained tiling information, the m2s tool does not generate prBoundary information with the technology file used in this library conversion. Therefore, the prBoundary layers were added manually and adjusted as necessary. Low Power ROM Generation 2.2.5 Verification A number of steps were taken to verify that the conversion of the cell library was complete. Each of the cells were edited and modified to pass DRC. After passing DRC, schematics were created for the cell library. The Berkeley low power library lacked schematics for each cell, although some general schematics detailing the ROM exist [Burstein96]. Thus, it was necessary to reverse engineer most of the cells to generate schematics for the library. Finally layout versus schematic checks (LVS) were executed to verify equivalence. HSPICE was used to verify the operation of a number of larger blocks, such as the address latch, control circuits, and column select cells. ROMs of various sizes (Chapter 4) were generated and simulated with HSPICE, verifying that the ROM library conversion was complete. 2.3 Summary The steps used to convert the library are summarized below: 1. Create/obtain an m2s technology file. 2. Execute m2s on the magic cell: m2s [magic filename] [target Cadence library] The magic filename does not include the .mag extension. This creates a generator for this cell with a ".il" extension. 3. Target Cadence library is created (if necessary). 4. Target cell name is created in Cadence library. 5. Cell generator is loaded from the CIW: CIW> (load "cellname.il") 6. Cell generator is executed from the CIW: CIW> ig 7. Terminal sizes are fixed, prBoundary layer is added. 8. Cell is edited until design rule checks are passed (DRC). 9. A schematic for the cell is created. 10.Schematic versus layout is verified (LVS). 11.HSPICE simulations verify correct operation of major cells. Low Power ROM Generation Chapter 3 ROM Operation This chapter explains the operation of the ROM at the circuit level and describes the techniques used to reduce power consumption. After an overview of the ROM (based in part on [Burstein96]), the architectures of the ROM block and ROM bus are examined. 3.1 ROM Specification CLK A D PORB FIGURE 6. The ROM In the attempt to make the design useful to the user, a simple interface to the memory was chosen. The inputs to the system are an address and a clock that meets the maximum frequency requirement. In order to reduce power, the clock should be gated so that it is asserted only when a read operation is taking place, thus preventing the ROM from executing a wasted read cycle. After a rising clock edge, the ROM will exit the precharge Low Power ROM Generation phase, enter the active mode, and re-enter the precharge phase when the data has been sensed. This will be described in more detail below. Figure 6 is a simple block diagram of the ROM, where CLK is the clock input, A is the input address bus, D is the output data bus, and PORB is the low active power on reset signal. Typically, the ROM is broken up into a number of ROM blocks to reduce power and to increase performance by reducing the aspect ratio. By having only a single block enabled during a read, the energy dissipated per cycle of the entire ROM is approximately that of a single block. A typical ROM configuration is depicted in Figure 7. blockO block2 block4 * * * blockl4 address and data buses, block decoding circuitry, and static latches blockl block3 block5 * * * blockl5 FIGURE 7. A multiple block ROM. Each block shares the same clock, address bus, data bus, and reset signals. Block decoding circuitry ensures that only a single block is enabled at a time to eliminate the problem of bus contention. Finally, static latches on the data bus ensure that the data remains valid even if no reads are requested for a long amount of time. The number of ROM blocks for a ROM can range from 1 to 16. The following sections will describe the ROM blocks and ROM bus operation in further detail. 3.2 The ROM Block The ROM block generated by ROMGEN is designed to be connected with other blocks with the ROM bus, Figure 8 depicts a block diagram of a single ROM block (VDD and GND not shown). The following sections describe the operation of a single ROM Low Power ROM Generation block in more detail. CLK ENABLE D/ A PORB FIGURE 8. A ROM block. 3.2.1 Terminals Address Bus. A is the input address bus. The width of the address bus is automatically set by ROMGEN depending on the number of WORDS contained in the ROMGEN parameter file. For tiling reasons, the minimum number address lines per block is 3. Clock. CLK is the clock input. On the positive edge of CLK, the ROM output will become tristated. If ENABLE=1 and PORB=1 on the rising clock edge, the ROM will proceed with a read after tristating the data bus. The read data will remain latched at the output until the next positive clock edge. CLK may be held high or low until the next read operation. ENABLE. ENABLE is used to enable the read operation. On the positive edge of CLK, if ENABLE= 1, the ROM tristates the output data bus and proceeds with a read operation. If ENABLE=O on the rising edge of CLK, the ROM only tristates its outputs. The main purpose of the enable signal is to allow multiple blocks to be tied to the same data and address buses. Typically, higher order address bits are decoded to enable a single ROM block during a read cycle. PORB. PORB is the active low reset signal. The purpose of this signal is to tristate the outputs upon power up. This is important for ROMs with multiple blocks that share the same data bus. Low Power ROM Generation Data Bus. D is the output data bus, which is BITS wide. For tiling reasons, the number of bits must be even. The LSB (rightmost bit) is D[O]; the MSB (leftmost bit) is D[BITS-1]. 3.3 Basic Operation The basic operation is straightforward. Once the address lines and enable lines are set to proper values, CLK is raised. The data will appear after a delay, taccess. The ROM will be ready for another read after cycle time tcycle. Only positive edges of CLK are relevant for access and cycle times. Access time and energy per cycle are a function of WORDS, BITS, load capacitance, and power supply voltage. Figure 9 is a timing diagram that demonstrates basic operation. tive) CLK A D ENABLE PORB time FIGURE 9. Basic timing diagram for a ROM block. The first rising clock edge represents a read operation in which the ROM is enabled. ENABLE is asserted, meaning that the block contains the data to be read. The address must be stable when CLK is raised. The positive edge of CLK always tristates the data bus. After an access delay, the data is available at the output bus, D, and after a cycle time delay, the ROM is ready for another read. Low Power ROM Generation The second rising clock edge represents a read operation in which the particular ROM block is not enabled. The only effect of the rising clock edge is the tristating of the output bus. 3.4 ROM Low Power Features ROMGEN is designed to be a low power ROM generator. The following highlights the techniques used to lower power consumption in the ROM. 3.4.1 Reduced Capacitance Low Power ROM Core. Techniques adapted from low power I/O coding can be applied towards memories to decrease power consumption in the ROM. By coding the ROM data such that the number of transistors is reduced, a significant reduction in power consumption is possible. For a conventional ROM core with a worst case of N transistors per wordline, the modified ROM core has a worst case of N/2 transistors per wordline. This is accomplished by storing the compliment of the word if the word is "heavy", meaning that more than 50% of the memory cells on the wordline contain transistors (data = "1"), otherwise the data is stored directly into the ROM core. An Invert bit (INV) is stored in the ROM core to flag the wordlines that have been inverted, and is used by the decoding circuit to restore the original data. By coding the ROM core in this manner, the number of transistors ("l"s) is reduced, which reduces the average capacitance on the wordlines and the bitlines, thus reducing power. Also, by reducing the number of transistors on the wordlines, the number of bitlines that are discharged decreases, saving power. And because the worst case delay is decreased, the data can be sensed from the bitlines in a shorter amount of time, which may result in faster overall access time, depending on the ROM dimensions. 3.4.2 Reduced Action Multiple Block Capability. A ROM can be broken up into multiple ROM blocks that share address and data buses. By decoding some of the address bits to enable a single ROM block during a read, only one block will be activated during each read. Therefore, Low Power ROM Generation the overall power consumption and access time of the entire ROM is approximately that of a single ROM block. Non-glitching Outputs. At the beginning of a read cycle, each ROM block tristates its outputs. Using self timing circuitry, the output data is enabled when the data is stable. This eliminates glitches on the highly capacitive output data bus, which would waste power. 3.4.3 Reduced Swing Low Voltage Operation. The ROM uses self timing to maximize speed with various supply voltages. This allows operation at lower voltages, which reduces the maximum swing of the node, and hence, lowers power consumption. Reduced Swing Bitlines. Each bitline is precharged to an NMOS threshold voltage below VDD. Thus, the maximum swing is reduced to (VDD - VT). Energy dissipated in charging and discharging the bitline is given by Eq. 3: E = C VDD Vswing (EQ 3) Since the swing is reduced from VDD to (VDD - VT), the reduction in energy is: Energy reduction factor = VDD VDDoo - VT (EQ 4) For low power supplies especially, this substantially reduces the voltage swing (and hence the energy dissipated) in the bitlines (50% for VDD=1.5V and VT=0.75V). 3.5 ROM Block Architecture The ROM block consists of address latches, row decoding circuitry, wordline drivers, column select devices, sense amplifiers, output drivers, self-timing logic, control logic, and the ROM core. The following sections describe each part in greater detail. 3.5.1 Control Signals Figure 10 shows the control logic for the ROM. Upon power-up, PORB is low, which sets the SR latch and drives OEN low. This tristates the bus and is independent of Low Power ROM Generation A, CLK, and ENABLE. This ensures that there is no bus contention upon power-up. ENABLE PORB Ready EalWord FIGURE 10. Schematic of ROM control logic. On the rising edge of CLK, a low pulse is generated by the 2-input NAND gate connected to CLK. This pulse sets the SR latch and drives OEN low which tristates the output bus. If ENABLE is not asserted, then this is the only action that occurs. If ENABLE is high during the rising edge of the clock, another low pulse is gener- Low Power ROM Generation ated by the 3-input NAND gate connected to CLK and ENABLE. This forces EvalAddr high, which begins the address evaluation phase. At the beginning of address evaluation, all address lines (A, Abar) are precharged high. Address evaluation is completed when the PMOS device connected to EvalAddrBar pulls the AddressReady node high. This can only occur when all the address latches have pulled down either A or Abar. Once this occurs, AddressReady has no path to ground and the PMOS device pulls AddressReady high. This signals that the address evaluation phase is complete, the word evaluation phase begins (EvalWord= 1). CLK EvalAddr AddressReady EvalWord Ready time FIGURE 11. A timing diagram of the control signals. On the rising edge of CLK, EvalAddr is set, which begins the address decoding operation. Once the addresses have been latched and A and ABar have been generated, AddressReady is asserted. AddressReady signals the completion of the address decoding, and begins the sensing operation by setting EvalWord high. EvalWord raises the wordline and allows the sense operation to complete. Once the worst case reference signal, Ready, is read, the sense operation is complete. Ready resets EvalAddr, AddressReady and EvalWord, which returns the ROM to its initial (precharge) mode. Ready is deasserted once the Ready bitline is precharged, and the ROM is ready for another read operation. The assertion of EvalWord turns off the precharging of the bitlines and wordline drivers, disables OEN, and allows the sense operation to proceed. The self-timed signal, Ready, signals that the read operation is complete and resets EvalWord and EvalAddr back to their initial precharge states. The ROM control logic is shown in Figure 10, and Figure 11 is a diagram illustrating the timing of the control signals. 3.5.2 Address Latch Figure 12 shows the dynamic latch used in the ROM. When EvalAddr is low (iniLow Power ROM Generation tial state), the two PMOS devices are on, and thus A and ABar are precharged to a logic 1. This also means that the transmission gate is on, which passes the address input, Ain, to the pull down devices for ABar and A. A pulse generated on the rising edge of CLK begins the read operation by setting an SR latch for EvalAddr. Once EvalAddr is asserted high, the precharge PMOS devices shut off and the NMOS pull down device is turned on. The transmission gate is disabled, thus latching the address data. Since the pull down device is now on, either A or ABar is pulled low. A and ABar are then passed to the row decoding circuitry. Ain k EvalY EvalAddrBar VDD ABar A FIGURE 12. A single ROM address latch. [Burstein96] Note that only NMOS devices are used during the evaluation step. This helps speed up operation because the slower PMOS devices are used only in the precharge stage, when speed is not as important. The PMOS devices can, however affect the cycle time, since the A and ABar must be fully precharged before another read cycle can begin. 3.5.3 Row Decoder and Wordline Driver The address latches drive the address to the row decoder. The address is decoded Low Power ROM Generation with an NMOS pull down network, shown in Figure 13. When the all the addresses lines (AO, AOBar, etc.) have settled, EvalWord is raised. A single address is decoded by the network and pulls down the node at the end of the row decoder chain. Note again that only NMOS devices are used in the critical part of the address decoding circuit which allows faster operation and smaller devices. This PLA-type decoding style also simplifies the tiling, which makes generating the row decoding logic for different ROM sizes easier. EvalWord AO W 1- AOBar Al Al Bar 1 -I -f -• oeo r_ -I oeo _r I FIGURE 13. ROM row decoder and wordline driver (adapted from [Burstein96]). Figure 13 also depicts the wordline driver (only one wordline driver is drawn for simplicity). EvalWord is initially low at the beginning of a read cycle. In this state, the EvalWord PMOS is on, which precharges the gate of the PMOS wordline pullup high. The wordline is discharged by the NMOS device while EvalWord is low. When EvalWord is asserted, the address is decoded and one of the rows pulls down the gate of the PMOS device, which drives the corresponding wordline high. 3.5.4 ROM Core The main idea in the design of the memory cell is to achieve the maximum signal Low Power ROM Generation - -- ~II " -.-. ~.--.C~...1 I sL~I- strength obtained in the minimum area, due to the massive number of cells typical of a ROM. Memory cells are dominated by technological considerations and cell topologies have not varied much over the last decades. Recent improvements in density have been due to technology scaling and advanced manufacturing processes, however there are still design techniques that can reduce power consumption on this level of ROM design as well. Figure 14 illustrates the sizes of the cells in the NOR-type ROM core. FIGURE 14. A "0" cell (left) and a "1" cell (right). Note that the contact and diffusion wire (GND) dominates cell size. Methods investigated for low-power 1/O [Stan89] [Tabor90] can be applied to reduce the amount of capacitance on the wordlines and bitlines which will speed up operation and lower the amount of power dissipated. This technique involves coding data such that the data has less "weight". In the case of ROM core data, words that have fewer "l"s than "O"s are considered low weight, because "l"s correspond to transistors being present in the ROM core, which increases wordline and bitline capacitance. Thus, the low weight codes are applied to the data such that a reduction of transistors in the ROM core is obtained. Figure 15 is an example illustrating the process. The ROM core data is coded such that the fraction of 'l's on a particular wordline (and hence, the number of transistors on that wordline) is never greater than 50%. This is accomplished in the following manner: 1. Assume N is the maximum number of transistors on a wordline. Low Power ROM Generation I I'- -- -'-pc 2. If the number of transistors on a particular wordline is less than or equal to N/2, the wordline remains unchanged and the INV bit is marked as a "0", meaning that the data is not inverted. 3. If the number of transistors on a particular wordline is greater than N/2, the word line is inverted and the INV bit is set to a "1", meaning that the wordline has been inverted. The additional bit, INV, is also stored in the ROM core on the corresponding wordline. ROM Data (Conventional ROM Core) Coded ROM Core INV bit 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 1 0 0 0 1 1 0 1 0 0 1 1 1 0 0 0 1 1 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 1 E6 >4 Decoded Data FIGURE 15. Example showing the original data (also the configuration for a traditional ROM core) versus the coded ROM core. A "1" corresponds to an NMOS on the wordline, and a "0" means that there is no transistor present. In this case, 19 transistors in the traditional ROM was reduced to only 10 transistors in the coded ROM, an improvement of nearly 50%. The coded core data must be XORed with the INV bit before being passed to the output driver. It can be seen by inspection that XORing the data with the INV bit will yield the uncoded data. By using low weight codes for the data, the number of transistors stored in the ROM core is reduced. Fewer transistors in the ROM core means less capacitance on the wordlines and bitlines, saving power and increasing performance. Also, a smaller number of transistors on the wordline means that fewer transistors are turned on. Since there are fewer transistors on the wordline, fewer bitlines will be discharged. This is especially important because bitlines tend to be highly capacitive, and reducing the number of bitlines that switch can result in substantial power savings. Low Power ROM Generation 3.5.5 Column Selection Column decoding is necessary to reduce the aspect ratio, however this decoding results in a waste of power. For an M bit column decoder, M-1 bits are discarded after the read. Of those M-l bits, the bitlines that switched (logic "1") contribute to wasted power. It is evident that due to column selection, reducing the number of transistors in the ROM core becomes even more important, since power dissipation in the bitlines becomes a larger factor. Thus, it is clear that column decoding should be kept at a minimum. However, eliminating column decoding entirely is not a possibility because it would make the aspect ratio very large and also make the pitch matching of the sense amplifiers impossible [Burstein96]. Figure 16 shows the column selection and sense amplification circuitry. The two low order address bits of the ROM are used to select which column to enable. One of the select bits is enabled (Sel[3:0]), which allows the bitline data to pass through to the sense amplifier. The sensing operation is described in more detail in the next section. UIEvoal' ordEiar FIGURE 16. Column selection, sense amplification, and XOR circuits. Low Power ROM Generation rk.t 3.5.6 Sense Amplification The sense amplifier used in the ROM is also depicted in Figure 16. A simplified schematic of the charge redistribution sense amplifier is illustrated in Figure 18. Typically, a cascode device with its gate connected to a fixed reference voltage Vref is inserted between the load and driver transistor. In the ROM circuit, the column select device acts as the cascode device with Vref equal to the column select gate voltage (VDD). t=0 + Vbitline + out itlin Vout Cbitline >> Cout + totalVfinal Ctotal = Cout + Cbitline FIGURE 17. The effects of charge sharing of the bitline with the output node. At time t=O, the two nodes are effectively shorted (top). The equivalent circuit at t=0 + is shown at bottom. Figure 17 illustrates the concept of charge sharing. At the beginning of the sensing operation (EvalWord=1), the drain of the NMOS column select device is precharged to VDD, and the source (bitline) is precharged to VDD-VTN by an NMOS device. Once the wordline is raised, the bitline starts discharging if the data is a logic "1" (transistor present). This change in bitline voltage turns the column select device on and redistributes the charge between the highly capacitive node (bitline) with the small capacitive node at Vout.Before the column select device turns on, the charge on each of the capacitors is: Qbitline = Cbitline Vbitline (EQ 5) Qout = Cout Vout (EQ 6) At time t=O+, the column select device is turned on and the nodes are shorted. By charge conservation, the total charge is equal to the sum of the charge on both capacitors: Low Power ROM Generation Qtotal = Qbitline + Qout = Cbitline Vbitline + Cout (EQ 7) V out Since Cbitline and Cout are in parallel, the value of Ctotai is known: (EQ 8) Ctotal = Cbitline + Cou t Using this expression, the total charge can also be expressed: (EQ 9) Qtotal = Ctotal Vfinal = (Cbitline + Cout) Vfinal By equating Eq. 7 and Eq. 9, we obtain an expression for the final voltage, Vfinal. Cbitline Vbitline + Cout Vout Vfinal (EQ 10) Cbitline + Cout Since Cbitline >> Cout, Eq. 10 simplifies to: Vfinal = (EQ 11) Vbitline This shows that as soon as charge redistribution begins (when the bitline starts discharging), the column select device quickly equalizes the output voltage to the bitline voltage. OUT FIGURE 18. Simplified view of a charge redistribution amplifier [Rabaey95]. For low power supples (1.5V), the precharged bitline voltage, VDD-VTN is near the switching threshold, VM, for the inverter (because VM is designed to be approximately Low Power ROM Generation VDD/ 2 ) which reduces the amount of swing on the bitlines necessary to read the data. Therefore, once the bitline starts discharging, the voltage at the input of the inverter quickly drops to VDD-VTN. As the bitline continues to discharge and reaches a voltage less than VM, the inverter output will switch and the read of the "1" is complete. Although the NMOS device in the ROM core is fighting against the weak PMOS pull-up at the drain of the column select device, the NMOS is much stronger due to larger sizing, greater mobility, and lower threshold than the PMOS device [Burstein96]. This introduces some static power dissipation once the bitline discharges, which is another reason why reducing the number of "1"s in the ROM core is important. In the case when there is no transistor present on the bitline, the source of the column select device remains at an NMOS threshold below VDD. The column select NMOS remains off, leaving the drain at VDD. The inverter completes the read of the stored "0". By precharging the column to an NMOS threshold voltage below VDD, the maximum swing on the bitline is reduced to VDD-VTN (for a logic "1") while the best case swing is still approximately 0 (for a logic "0"). Since the bitlines can be large, this results in substantial power savings, especially for low power supplies. Another advantage of low power supplies is that the threshold voltage of the inverter is very close to VDD-VTN, which means that only a small swing on the bitline is necessary to switch the inverter and complete the read operation. 3.5.7 XOR An XOR gate was needed to restore the data if the particular wordline was inverted in the ROM core. By XORing the INV bit stored in the ROM with the sensed data, the original data is obtained. Figure 19 is the circuit schematic of the XOR shown in Figure 16. When A=O, the transmission gate is on, and B is passed through to F. However, when A=l, the source of the PMOS device is a logic 1, and the source of the NMOS device is a logic 0. In this configuration, the rightmost NMOS and PMOS devices can be viewed as an inverter, with B as the input. Thus, if B=0, the PMOS device is turned on and Low Power ROM Generation F= 1. If B= 1, the NMOS device is turned on and F=O. A 93 F FIGURE 19. XOR circuit used for decoding the ROM data. 3.5.8 Self Timing The ROM core contains an extra reference wordline and bitline used for self-timing. Figure 20 shows the circuits used to generate the self timed signal, Ready. The Ready signal is generated using circuits identical to those used by the ROM to sense and decode data, which can be seen by comparing Figure 20 with Figure 16. Ready signals the completion of the data sensing operation. Therefore, the reference wordlines and bitlines that generate the Ready signal must have the worst case delay to guarantee that the sense operation is complete and that the data is stable. For a ROM with M wordlines and N bitlines, the reference wordline has the worst case number of NMOS gates (represented by the bottom row of transistors in Figure 20): N/2 for the ROM core, and an additional transistor to pull down the bitline and generate the self-timing signal, Ready. The bitline has a worst case number of NMOS drains connected to it, M. The column selection, sense amplification, and decoding are identical, thus including the delays for a typical sense operation. Inverters were added to buffer Ready and to generate ReadyBar. (4 inverters were necessary because ReadyBar must be Low Power ROM Generation stable before the Ready signal can be asserted.) Ece Nord Sela r S LYCrhordB- r -r -U-- i i, -I ý PecqyWorc FIGURE 20. Self-timing circuits for Ready signal generation. By using identical circuits to read the data in the worst case, the assertion of Ready guarantees that all the read operations are complete, and the data may be passed to the output devices. Speed is maximized at different voltages because data can be sent to output devices and the address and word evaluation circuits can return immediately to precharge state to prepare for the next read cycle. This Ready signal also guarantees that the data is stable, which ensures glitch free operation of the output driver. Low Power ROM Generation 3.5.9 Output Driver The output driver must be non-glitching because glitches on the large capacitance of the data bus result in wasted power. Glitches can occur at the output of the XOR gate, Rady EvalWord Dataln EvalwordBar ReadyBar FIGURE 21. Schematic of the output driver. Low Power ROM Generation depending on the time required for the bitline data and the INV bit to be sensed. Additional pass gates placed before the output driver shown in Figure 21 ensure that the XOR outputs are stable before being sent to the data latches. At the beginning of a read cycle, OEN goes low, which sets the static latches to the initial state, tristating the data outputs. Once the evaluation of the data begins (EvalWord becomes asserted), OEN is asserted high so that it does not interfere with the sensing operation. The data is taken from the XOR device only when EvalWord and Ready have both been asserted, which is the reason for the pass transistors before the static latches. This is to ensure that the XOR device has received stable INV and data before allowing the static latches to latch the data. Consider the case when the data is a logic "0". EvalWord and Ready are asserted so the pass gates are enabled. The input of the latch leading to the NMOS output device is brought low, which means that the output device is turned on, bringing the data output low. For the PMOS output device, this case is the same as its initial precharged state. The static latch does not switch and the PMOS device remains off. In the case when the data is a logic "1", the latch leading to the PMOS output device is raised, meaning the gate of the PMOS output device is low. This turns on the PMOS device, which drives the data bus high. For the path to the NMOS device, this is the same as its initial state, so the static latch does not switch and the NMOS output device stays tristated. 3.6 ROM Bus The bus provides a path for CLK, PORB, and power supplies to the ROM blocks and allows multi-block operation. The bus connects the address and data buses of the ROM blocks together and provides the necessary block decode logic. This section is an overview of the ROM bus. 3.6.1 Bus Architecture A typical multi-block ROM is divided up into two rows of blocks, separated by the Low Power ROM Generation bus, as shown in Figure 22. A single ROM bus instance connects together two ROM blocks (known locally as bkO and bkl) in a particular column and provides connectivity to other (adjacent) bus instances. This connection of two ROM blocks with a bus is referred to as a "block column". The bus connects all relevant control signals and supply lines to each ROM block. In addition, the block decode logic is generated to enable bkO, bkl, or neither. Weak static latches are provided (only for the first block column) to ensure that the data remains valid when the data bus is tristated. Block Decode to other bus cell to other bus cell L4i k2 Kinus1 k3k A block column A typical ROM configuration FIGURE 22. A single ROM bus instance provides VDD, GND, CLK, PORB (not shown) to the ROM blocks, connects the address and data buses, provides block enable information, and static latches for the data bus (only for 1 block column). Block columns are then tiled together, connected to other buses to complete the multi-block ROM. 3.6.2 Block Decode Logic The block decoding circuitry is necessary to ensure that only 1 block is enabled during a read. The high order address bits are used to decode the block to be enabled; this Low Power ROM Generation group of address bits are called the block address. For an N-bit block address, N-1 bits are used for global block decoding (selecting which block column is to be enabled) and 1 bit is reserved for local block decoding (either bkO or bkl). The current ROM bus has a limit of 4 block address bits, yielding a maximum of 16 blocks per ROM. sell A selo FIGURE 23. A 1-bit block decoder. (2 blocks) For a ROM with two ROM blocks, an inverter provides the necessary block enable logic (Figure 23), since no global block decoding is necessary. However, for ROMs with more than two ROM blocks, the circuits for block selection must be designed to be easily programmed since each block column has different global decode logic. A3 Al AOb FIGURE 24. A 4-bit block decoder. (9-16 blocks) Figure 24 shows the circuit used for 4-bit block decoding. The block address is A[3:0], and SELO and SELl are the block enable signals for bkO and bkl, respectively. Low Power ROM Generation The high order address bits (A[3:1]) are used for global block decoding, and AO is used for local block decoding. As illustrated in the schematic, the block address lines used for global block selection (A[3: 1]) are left unconnected. It is left up to ROMGEN to tile the wires so that the appropriate global block decode logic is generated for each block column. 3.7 Simulation Simulations using HSPICE verified correct operation of a 16 word by 4 bit ROM with two ROM blocks and VDD= 1.5V. Figure 25 shows the timing of the control signals, EvalAddr and EvalWord for bkl of the ROM. Since there are two blocks, A3 acts as the block enable signal. Thus, EvalAddr and EvalWord are not asserted on the rising edge of CLK when the block is not selected (A3=0). However, when the block is selected (A3=1), EvalAddr is asserted, and EvalWord is asserted after the address decode operation is complete. Both EvalAddr and EvalWord are reset by the self timed signal, Ready (not shown) once the sensing operation is complete. 9G/OG/19 22:32!20 1 .50 DRIVER.TRO CLK V O L T L I N A. 1.0 500.0M 00. 1.0 DRIVER .TRO A-1 V 0 L T L I N 1.0 500 . OM o. 1.50 V O L S I L T N V O L I L N - 1.0 500.0M 1 .50 T DRIVER.TRO ....... .. ... 1.0 .... DRIVER .TRO N234 " 500 . OM - - -- - .700. 600.ON " ON TIME 00.ON ELIN) " 900.ON 1.OU FIGURE 25. HSPICE results from a ROM 16 word x 4 bit simulation of node voltages. (from top to bottom: CLK, A3, EvalAddr (block 1), and EvalWord (block 1)) Low Power ROM Generation Figure 26 shows the operation of the ROM during the time illustrated in Figure 25. CLK (top) and EvalWord (second from top) are included for reference. The rise of the selected wordline (third) and the reading of a "1" on a bitline (fourth) are shown. As expected, the wordline is raised after EvalWord is asserted. This turns the transistors on and the corresponding bitline and discharges. The fifth waveform is the Ready signal, which signals that the read operation is complete. The Ready signal resets EvalWord and EvalAddr (shown in Figure 25) and allows the data to be passed to the output driver. Once EvalWord and EvalAddr are reset, the bitlines return to their precharging state and Ready is reset. Finally, D (last) depicts the output of the data "1". Note that there was a change of data on the previous rising edge of the clock, though EvalAddr and EvalWord were not asserted. This is because the other ROM block (bkO) was the block enabled for the particular address specified. DRIVER.TRO CLK 1.0 n.1. 0 1.0 1 .0O , 0. , .. S.. L .. t L I, . I I I I .. I.. I .1.... . . , m DRIVER.TRO N234 - , , , J ,L , I , . I "I DRIVER.TRO 1 .... .0 DRIVER.TRO 0 . .. 1..............N I O 1 .0 ... 500. .. ... N215 .. .... 0. 1 .0 .0 DRIVER.TRO N23 0 _ •-- _ DRIVER.TRO 1 700. ON O00.0N TIME 800. ON [LIN) 900. ON 1.0U FIGURE 26. HSPICE results from a ROM 16 word x 4 bit simulation of node voltages. (from top to bottom: CLK, EvalWord, wordline, bitline, Ready, and D) Low Power ROM Generation 45 3.8 ROM Performance The performance and energy dissipated per cycle for various supply voltages and ROM sizes are summarized below in Table 4 through Table 5. VDD (V) Words Bits Blocks Access Time (ns) Cycle Time (ns) Energy / Cycle (pJ) 1.5 16 4 1 25.7 33.9 10.91 1.5 16 4 2 25.8 32.0 12.58 2.2 16 4 1 10.9 14.8 23.61 2.2 16 4 2 11.0 13.9 26.27 3.3 16 4 1 6.1 8.4 55.88 3.3 16 4 2 5.9 7.9 60.50 5.0 16 4 1 4.0 5.7 141.39 5.0 16 4 2 3.9 TABLE 4. HSPICE results for a 16 word x 4 bit ROM 5.4 150.43 VDD (V) Words Bits Blocks Access Time (ns) Cycle Time (ns) Energy I Cycle (pJ) 1.5 256 8 1 47.4 58.3 28.8 1.5 256 8 2 39.4 48.9 30.6 1.5 256 8 4 33.9 43.2 35.9 2.2 256 8 1 18.1 23.9 59.9 2.2 256 8 2 15.4 20.2 61.0 2.2 256 8 4 13.7 18.4 69.3 3.3 256 8 1 9.6 11.1 137.9 3.3 256 8 2 8.1 11.0 135.3 3.3 256 8 4 7.5 10.3 148.9 5.0 256 8 1 6.2 8.7 348.3 5.0 256 8 2 5.4 7.4 330.9 5.0 256 8 4 5.0 TABLE 5. HSPICE results for a 256 word x 8 bit ROM 6.9 357.2 Note that increased partitioning decreases access and cycle times, but does not necessarily decrease the energy dissipated per read cycle. Partitioning small ROMs into more blocks does not significantly reduce energy consumption, due to the extra bus wiring and capacitance on the output data bus (Table 4). However, for medium and large sized ROMs, the partitioning scheme that dissipates the least power requires some investigation. The original Berkeley ROM is compared to the improved ROM for various supply Low Power ROM Generation voltages. The difference in performance between the two ROMs is heavily dependent on the data. The results in Table 6 compares the two ROMs with the average data word having 75% "l"s, which favors the new ROM. ROM Original Berkeley ROM VDD = 1.5V Energy / Cycle (pJ) VDD = 2.2V Energy / Cycle (pJ) VDD = 3.3V Energy / Cycle (pJ) VDD = 5.OV Energy / Cycle (pJ) 37.36 77.52 176.0 436.0 New ROM 28.54 60.00 139.5 354.1 TABLE 6. HSPICE comparison of original ROM versus modified ROM. ROM core is 256 word x 8 bit ROM with heavy data. Average data weight = 24 and average coded data weight = 9. The reduction in power consumption is due to the low power coding of the improved ROM core. In the worst case, the new ROM has no coded ROM data and therefore has slightly more power consumption then the Berkeley ROM, from the extra logic necessary for data decoding. Low Power ROM Generation Chapter 4 ROM Generation ROMGEN is the tool used to generate the ROMs described in the previous chapter, "ROM Operation". ROMGEN is written in SKILL and is designed for use in the Cadence Design Environment. The first section of this chapter (Section 4.1) provides an overview of the generator and its usage. The sections that follow describe the generator in more detail, including a summary of the cells in the ROMGEN library and specifics about the SKILL code. ROMGEN is simple tool for generating low power ROMs in Cadence. The generator is executed on a parameter file including the details of the ROM. The ROM blocks and ROM bus will be created and tiled automatically to complete the generation of the ROM. 4.1 Usage This section describes ROMGEN usage and is intended to be a user guide for ROM generation. 1 Parameter Description Value libName target library name any valid library name (in quotes) romName target ROM layout name any valid ROM name (in quotes) words number of words specified TABLE 7. Summary of ROMGEN parameters. Low Power ROM Generation minimum 8 words (3 address lines) per block Parameter Description Value bits number of bits per word (wordsize) positive integer; (always even number created for tiling reasons) blocks number of blocks 1 = single block mode, 2-16 = multiblock dataBits ROM data list of binary strings (in quotes) representing ROM data. First string corresponds to ADDRESS = 0. TABLE 7. Summary of ROMGEN parameters. 4.1.1 Parameter file A parameter file detailing the name, size, partitioning, contents, and library location prior to generation. The syntax of the file is: ((parameterl valuel) (parameter2 value2) (parameter3 value3) (parametern valuen)) The parameters are summarized Table 7. (Note: parameter names are case sensitive.) libName and romName specify the target Cadence library and name for the ROM. The values must be included in double quotes. words specifies the number of words in the entire ROM. For tiling reasons, the minimum number of address lines per block is 3, therefore, the minimum number of words per block is 8. Legal values for words include any integer greater than or equal to 8. In the case where words is not a multiple of a power of two (since the number of addresses per block is a power of two), ROMGEN automatically pads the ROM with enough "O"s to fill up the rest of the last ROM block. A possible improvement to ROMGEN is to eliminate the unnecessary wordlines containing empty data on the last block, as long as the minimum number of wordlines is met. The parameter bits specifies the number of bits per word of the ROM. For tiling reasons, only ROMs with an even number of bits can be created. If bits is odd, a D[-1] bit is created that the user may leave unconnected. Thus, the data bus is D[bits-1:0] if bits is even, and D[bits-l:-1] if bits is odd. Low Power ROM Generation If blocks is 1, then ROMGEN enters single block mode and only generates a single ROM block, without generating a ROM bus, static latches, or block selection circuitry. If blocks is greater than 1, the ROM blocks and bus are generated for multiblock operation. An odd number of blocks is legal value. dataBits specifies the ROM data with a list containing at least words number of strings consisting of "l"s and "O"s representing the ROM data in binary. Each string must be at least bits long; any extra bits or words defined are ignored. The first string in the list corresponds to ROM ADDRESS=O, the second string corresponds to ADDRESS= 1, and so on. The following is an example parameter file demonstrating proper parameter specification: ((libName "rom") (romName "rom4x4") (words 16) (bits 4) (blocks 2) (dataBits "0000" "1111" "1010" "1100" "0011" "0110" "0001" "0010" "0100" "1110" "1101" "1011" ; ; ; ; ; ; ; ; ; ; output library output romname # of words in ROM # of bits per word total number of blocks (1 to 16) "0101" "1001" "1000" "0111" 4.1.2 Generation Application of ROMGEN involves these steps: 1. Create the ROMGEN parameter file. 2. Start Cadence. 3. In the Cadence Interface Window (CIW), load ROMGEN: CIW> (load "romgen.il") 4. Once loaded, execute ROM generation on the parameter file: CIW> (romgen "romgen.param") where romgen.param is the name of the parameter file. A cell for the ROM, the bus, and for each block is created in the library specified in the parameter file. The bus cell name is marked with a "_bus" extension after the romName, and each block has a "_bkn"extention after the romName, where n is the block number from 0 to blocks-1. This completes the ROM generation. Low Power ROM Generation 4.2 Cells This section lists cells in the ROMGEN library and briefly describes their function. Table 8 lists the cells used to generate the ROM blocks and Table 9 catalogs the cells used in ROM bus generation. Cell Name Description romgen_GNDcell GND cell used to create GND lines in the ROM core romgen_aReady 2 NMOS devices per address bit used to generate AddressReady (See Figure 10. romgen_al Address latch (A input) romgen_alEnd end cell for address latch cells, romgen_al, connects EvalWord to address decoding circuitry. romgen_botCell0 end wordline cell "0" for self-timing signal, Ready (no transistor) romgen_botCelll end wordline cell "1" for self-timing signal, Ready (transistor present) romgen_botEndCell end cell for reference wordline, Ready bit. romgen_cell0 ROM core "0" cell. (no transistor) romgen_celll ROM core "1" cell. (transistor present) romgen_cs column select, sense amplifier, XOR, and output driver. (D output) romgen_csEnd dummy column select, sense amplifier, XOR for Ready bit. Also generates OEN logic. (PORB input) romgen_cslnv sense amplifier for INV bit. Also provides buffering for Ready and ReadyBar romgen_ctrl control logic. (VDD, CLK, GND, and ENABLE input) romgen_endCell0 bitline for Ready, and INV bit information. (INV = "0") romgen_endCelll bitline for Ready, and INV bit information. (INV = "1") romgennand0 address decode "0" cell romgen_nandl address decode "1" cell romgen_nandBot address decode for dummy wordline romgennandGND GND cell for address decoder. romgen_rs row select and wordline driver romgen_rsBot reference row select and wordline driver TABLE 8. ROMGEN block cells. Cell Name Description romgen_aTop tile aligned with romgen_al romgen_aTopNC tile aligned with romgen_al (no contact) romgen_aspacer spacer to align bus with address latches romgen_bkSellb 1-bit block enable logic TABLE 9. ROMGEN bus cells. Low Power ROM Generation Cell Name Description romgen_bkSel2b 2-bit block enable logic romgen_bkSel3b 3-bit block enable logic romgen_bkSel4b 4-bit block enable logic romgen_blockEnp0 poly wire used to set a bit of global decode logic to "0" romgen_blockEnpl poly wire used to set a bit of global decode logic to "1" romgen_EnpSpacer spacer to position blockEnp at correct location to enable global decode logic romgen_csEndTop tile aligned with romgen_csEnd romgen_csEndTopNC tile aligned with romgen_csEnd (no contact) romgen_csTop tile aligned with romgen_cs romgen_csTopNC tile aligned with romgen_cs (no contact) romgen_csTopStat static latches for data bus romgen_ctlTop tile aligned with romgen_ctrl romgen_ctlTopNC tile aligned with romgen_ctrl (no contact) romgen_ctlTopNCNE tile aligned with romgen_ctrl (no contact, no enable). block enable logic overlaps this cell. romgen_invTop tile aligned with romgen_csInv (no contact) TABLE 9. ROMGEN bus cells. 4.3 ROMGEN ROMGEN is a ROM generator written in SKILL and implements a modified version of the ROM tiling procedures designed for the LagerlV silicon assembler system [Brodersen93]. LagerlV calls the TimLager tiling program and uses the "lprom2" function to create the layout for the memory. Because TimLager utilizes a number of advanced tiling routines which were unavailable in Cadence, a tiling routine was created. This section describes the tiling routine, and some details of its application in ROMGEN for ROM block and bus generation. 4.3.1 Tiling Procedures Because Cadence lacks a tiling routine like TimLager, a tiling routine and a number of helper functions were created to supply at least the minimum amount of functionality required to perform ROM generation. The tiling function, romgen_tiler, places a simple mosaic (a homogeneous array of cells, aligned using the prBoundary information) in a cell layout, given the mosaic size, point of origin, and rotation. Regardless of rotation, the mosaic is always placed in the Low Power ROM Generation upper right quadrant relative to the point. This is a very simple tiling procedure, however, it is adequate for ROM generation. The tiling program makes no effort to align terminals; only origin and prBoundary information is used when placing cells. Therefore, the cells must be carefully designed and updated with alignment information (prBoundary) in order to ensure correct operation. N = # of bits per word D[N-1] D[N-2] .... D[O] INV 10 I1 1 21 31 GNDI 31 21 11 010 11 1 21 31 GND] 31 21 1 101 ""/"" DUMMY I NMOStiedtoGND 3 2 AOý Typical wordline Mirrored Column Select Devices to sense amplifiers FIGURE 27. ROM Core and Column Select Mirroring. The placement of subsequent mosaics requires manipulating variables storing the point of origin of the last mosaic placed. romgen_tiler returns a list containing the height and width information for the mosaic (tile_size), which can be used to update the origin for the next mosaic to be placed. 4.3.2 ROM Block Tiling This section describes the tiling for a ROM block. The following sections are presented in the order in which they are generated. The actual code for "romgen_blk.il" is located in "Appendix A". Address/Row Decode. The NMOS NAND pull-down structure of the address decode is the first section to be generated. The GND cells are first placed using a single mosaic. Then, the NAND cells are individually placed to decode a each bit of the address (A or ABar) in generated to decode binary 0, 1, 2, and so forth. The address for the self-timing signal, Ready, is placed last. Row Drivers. Each row driver drives two ROM wordlines, therefore a single mosaic is created for placement of the row drivers. A special reference row driver is added for the Low Power ROM Generation reference wordline used to generate Ready. ROM Core. The ROM core contains the data, INV bit information, and reference wordlines and bitlines for self timing. However, the tiling of the cells requires special manipulation due to mirroring of the column select devices, and 4:1 column selection. Due to column selection, there are 4 words per wordline. Figure 27 demonstrates the tiling of the ROM core. The figure shows how the mirroring of the column select devices affects the placement of the ROM cells. A typical wordline is shown, with bits corresponding to the selected column (determined by A[ 1:0]) marked accordingly. The INV bit is placed at the end of the wordline with a "1" (transistor present) marking wordlines need to be inverted to decode the data. The calculations for mirroring the data bits and generating the INV bits is done by ROMODEL prior to block generation. A dummy row with is created to simulate worst case delay. This row is tiled with alternating "1" and "0" data in the ROM core, an INV bit, and a transistor used to pull down the bitline for Ready. The bitline for Ready is generated by connecting the bitline to the drain of a separate transistor with its gate and source tied to GND for each wordline to simulate the worst case bitline. Address Latches. A mosaic for the address latches is created and aligned with the address decoding circuitry. The two least significant bits are contained within the control block for column selection. Control Block. The control block is placed adjacent to the address latches and aligned with the row drivers. The control block has row drivers to drive the column select lines. Column Selection. The column select devices contain column select, sense amplification, XOR decoding, and output driver circuitry. These devices are mirrored to reduce the number of ground lines running through the ROM core. After bits number of column select devices are placed, the column select devices for INV and Ready are placed. This completes the generation of the ROM block. However, if the ROM is a multi- Low Power ROM Generation block, block decoding logic and static latches are also placed during block generation. These cells are designed to overlap into the ROM bus area. In the case of a single block ROM, these steps are ignored. Block Decode Logic. If the current block being generated is the first ROM block in a column (locally, block 0), then the block decoding logic is placed. The block decoder used is determined by ROMGEN, and depends on the number of blocks in the ROM. Once placed, wires (poly) are placed to generate the correct block decode logic. This block decode logic generates SELO and SEL1 outputs and is placed to align with the corresponding ENABLE inputs of block 0 and block 1, respectively. Static Latches. The static latches are placed on the data bus for the first block of the entire ROM only. These latches overlap with the ROM bus wiring area. Figure 28 depicts a typical first block of a small multiblock ROM: FIGURE 28. Block 0 of a 2 block ROM. Note the decoding logic (an inverter and some bus wiring) at the lower left and the 4 static latches attached to the D outputs (lower right). 4.3.3 ROM Bus Tiling All the inputs and outputs of the ROM are located on a single side of the ROM. Thus BUS is designed to align with these signals and to overlap the block decoding logic and static latches placed during ROM block generation. Each input or output cell of the Low Power ROM Generation " ~ -- ·- ROM block is has two corresponding ROM bus cells that are sized to be aligned with it. The bus cells either connect the input or output to the bus (cell contains a contact), or provides wiring (no contact). These cells are tiled to create the ROM bus. A typical ROM bus is shown in Figure 29: FIGURE 29. A typical ROM bus layout. Bus wires (horizontal) are metal2, connections wires (vertical) are metall. FIGURE 30. The layout of a 256 word by 8 bit, 4 block ROM. Low Power ROM Generation The bus wires are (listed from bottom to top): GND, VDD, CLK, A[N-1:0], D[M1:0], PORB. Where N is the total number of address bits, and M is the total number of data bits. 4.3.4 Multiblock ROM Tiling Once bus and block generation are complete, the two ROM blocks are connected to the bus using a single bus instance, creating a ROM column. Subsequent ROM columns are created and tiled to the right, connected by the bus. The last ROM column may have a single block connected to a bus, in the case of an odd number of block specified in the parameter file. A sample ROM layout is depicted in Figure 30. Low Power ROM Generation Chapter 5 ROM Modelling ROMODEL is a tool used to estimate the power dissipated in the low power ROM described in the previous chapters. This tool estimates the energy dissipated per cycle by modelling each node in the ROM and accumulating the energy dissipated. Many of these techniques are borrowed from PYTHIA, a power estimation tool for Verilog [Xanthopoulos96]. Since the architecture of the ROM is well defined, models for a large number of identical nodes can be condensed into one and scaled accordingly, depending on the size of the ROM. The dynamic power dissipated is on the order of CVDD2 for each transition (CVDD(VDD-VT) for the bitlines). Thus, the energy dissipated per one ROM access cycle is calculated for each node, scaled according to the size of the ROM, and accumulated with all the other nodes. The result is reported as the total energy dissipated per access. ROMOPT is a tool that determines the valid block configurations for a ROM and the optimal ROM partitioning scheme that minimizes energy dissipated per read cycle. The optimization process applies the ROMODEL tool to model the power and generates an output report detailing information about the different ROM configurations. 5.1 Brief Description of Operation ROMODEL is a simple program that models the nodes in the low power ROM. Low Power ROM Generation The basic operation of the modeler is very simple and requires little knowledge of the ROM itself. First, parameter values are read in from a technology file that needs to be specified by the user. After the values are read, nodes are created and annotated with a number of capacitance values. The capacitances that will be dealt with in the ROM are: * wiring capacitance, which include diffusion, poly, metal 1, and metal2 layers. The layers are lumped to facilitate the accounting of wiring capacitances. These capacitances are considered constant throughout the node swing. * the total gate capacitance contributed by gates seen by the particular node. A piecewise constant model is used to model the nonlinear behavior of the gates, with some correction factors for the Miller effect and gate-to-drain overlap capacitance. * the total junction capacitance due to drain to substrate or drain to well junctions. Drain junction capacitances are assumed to vary with voltage in the following manner: C CJ(VR) =MJ (EQ 12) where VR is the reverse bias across the junction, 0 is the built in potential, Cjo is the capacitance at zero bias, and MJ is the grading coefficient of the junction. The following sections describe these models and approximations in more detail. Using these models, the energy dissipated per cycle in each node as a function of ROM size is calculated. Finally, the energy dissipated in the ROM is calculated by scaling the appropriate nodes and accumulating the result. The node information is sent to an output file and the total energy per cycle is reported to the user. 5.2 ROMODEL Limitations Before the power estimates can be used with confidence, a few remarks about simplifications and assumptions need to be noted: 1. The energy/cycle calculated by ROMODEL only accounts for dynamic power. Short circuit power and power due to leakage currents are not taken into account. Sources of static power dissipation exist in this ROM (AddressReady and bitlines), however, this contribution of static power assumed to be negligible and is ignored in this model. 2. All calculations of gate and junction capacitances employ simplifying assumptions. The models are described in more detail in the following sections. The accuracy of these models are validated using HSPICE. Low Power ROM Generation 3. The modeler makes a number of assumptions to provide average energy/cycle estimates, however the user should be aware that the energy dissipated per cycle is dependent on the data and the frequency of each address access. 5.3 Overview The user must provide a number of parameters: number of address lines per block, wordsize, number of blocks, and supply voltage, VDD. A technology file must also be specified which contains information about the process technology. Using this information, ROMODEL applies the models described later in this section to calculate the energy dissipated. 5.3.1 ROMODEL Technology File The technology file contains information about the process technology. Information about the parameters are listed in Table 10. Typically, these values are obtained from SPICE model and parameter files obtained about the particular process. A sample technology file and SPICE model are included in Appendix B. Parameter Description Units lambda technology parameter m phin n-drain to p-substrate built in potential V phip p-drain to n-substrate built in potential V vtn threshold voltage for NMOS device V vtp threshold voltage for PMOS device (sign ignored) V mjn absolute value of MJ in Eq. 12 (NMOS area) - mjp mjswn absolute value of MJ in Eq. 12 (PMOS area) absolute value of MJ in Eq. 12 (NMOS perimeter) - mjswp cjOn absolute value of MJ in Eq. 12 (PMOS perimeter) NMOS zero bias area junction capacitance F/m 2 cj0p PMOS zero bias area junction capacitance F/m 2 cjsw0n NMOS zero bias perimeter junction capacitance F/m cjsw0p cgd0n PMOS zero bias perimeter junction capacitance NMOS gate to drain overlap capacitance F/m cgd0p PMOS gate to drain overlap capacitance F/m n_cv_bp breakpoint for piecewise constant model of NMOS gate cap. V pcv_bp breakpoint for piecewise constant model of PMOS gate cap. n_gate_cap_init NMOS gate capacitance from 0 to n_cv_bp (0 for our model) V F/m2 TABLE 10. Technology File Parameters. Low Power ROM Generation - F/m Parameter Description Units p_gate_cap_init PMOS gate capacitance from 0 to p_cv_bp (Cox for our model) F/m 2 n_gate_cap_final NMOS gate capacitance from n_cv_bp to VDD (Cox for our model) F/m 2 p_gate_cap_final PMOS gate capacitance from (VDD-p_cv_bp) to VDD (0 for our model) F/m 2 ndiff_cap NDIFF to substrate wiring capacitance per unit area F/m 2 pdiff_cap PDIFF to substrate wiring capacitance per unit area F/m 2 poly_cap Poly to substrate wiring capacitance per unit area F/m 2 metall_cap Metal 1 to substrate wiring capacitance per unit area F/m2 metal2_cap Metal2 to substrate wiring capacitance per unit area F/m2 ndifffringe NDIFF to substrate fringe capacitance F/m pdiff_fringe PDIFF to substrate fringe capacitance F/m poly_fringe Poly to substrate fringe capacitance F/m metal lfringe Metal 1 to substrate fringe capacitance F/m metal2_fringe Metal2 to substrate fringe capacitance F/m TABLE 10. Technology File Parameters. All values must be specified in the technology file before ROMODEL will proceed with the ROM energy modelling. Each line of the technology file corresponds to a [parameter] [value] pair, or a comment marked by a pound sign: # This is a partial technology file lambda 0.5e-6 vtn 0.75 All the parameters are straightforward with the exception of the CV-breakpoint for the NMOS and PMOS piecewise constant models. These can be obtained by using the CV_MODEL tool, to be described in further detail in Section 5.3.4. Once the technology file is successfully read, the modelling of the nodes of the ROM begins. 5.3.2 Node Model A node structure is created for every node in the ROM. Each node contains information regarding voltage swing, gate and drain sizes, and wiring capacitance. It also contains variables for equivalent capacitance and accumulated energy calculations to facilitate the identification of energy-consuming nodes. Variable descriptions are listed on Table 11. Low Power ROM Generation Variable Description name array containing the name of the node v_init voltage at the beginning of the clock cycle v_final biggest change in voltage before being reset area_nmos_gates total NMOS gate area (W x L) seen by node width nmos_gates total of all NMOS gate widths (W) seen by node area_nmos junct total area of NMOS drain junctions perim_nmos junct total perimeter of NMOS drain junctions area_pmos_gates total PMOS gate area (W x L) seen by the node width_pmos_gates total of all PMOS gate widths (W) seen by node area_pmos junct total perimeter of PMOS drain junctions perim_pmosjunct total area of NMOS drain junctions area_ndiff total area of NDIFF wiring area_pdiff total area of PDIFF wiring area_poly total area of POLY wiring area_metal 1 total area of METAL 1 wiring perim_metal2 total area of METAL2 wiring perimndiff total area of NDIFF wiring perim_pdiff total area of PDIFF wiring perim_poly total area of POLY wiring perim_metal 1 total area of METAL 1 wiring perim_metal2 total area of METAL2 wiring wiring_cap total wiring capacitance wiring_energy energy dissipated from wiring capacitance totalcggnd total gate to ground capacitance total_cggnd_energy energy dissipated from gate to ground capacitan total_cj total area junction capacitances total_cj_energy energy dissipated in area junction capacitances total_cjsw total sidewall capacitance total_cjsw_energy energy dissipated in sidewall junction capacitan, total_node_cap total equivalent node capacitance total_node_energy total node energy TABLE 11. Node Variables and Descriptions. Each node is updated with data obtained from the ROM cell layouts. Wiring areas and perimeters, transistor areas, transistor widths, junction areas, and junction perimeters Low Power ROM Generation as a function of ROM size were determined for each node. An overview of the process is described later, in Section 5.4. 5.3.3 Modelling Wiring Capacitances ROMODEL includes wiring capacitances in the power estimate. For each node, the amount of wiring area for each layer was determined. Because the ROM is tiled, the expressions for wiring capacitance are a function of the number of bits, number of address lines per block, and number of blocks. Thus, depending on the size of the ROM, the wiring area and length are scaled accordingly before the total capacitance is calculated. The node wiring capacitance for each layer is: (EQ 13) Cwiring = AREAlayer * Carea + PERIMETERlayer. Cfringe where Carea is the capacitance per unit area of the layer, where Cfringe is the fringe capacitance per unit length. The areas and perimeters of each wire is are determined as a function of ROM size. The layers that are considered are: ndiff, pdiff, poly, metall, and metal2. *THIS IS THE FULL 5.5990F _ -- -: INPUT FILE USED FOR HSPICE SIMULATION RUN S. 95/05/09 14~33t10 , WO0 DRIVER. LX19(MxN .. .... 5.40F 5.20F DRI VER . l LXI B MXP :I "1 :............................ 5.OF ff.80F q.E-OF - 'f.'OF - -- ;. 4.20F -- ..- ------. ·........ -- -- - .OF 3.0F .............. .................. ......... 1 2 . 4OF Z-_2.2OF ............. ,......... .... _.........7...... 3.OF Z7 2.9OF . . ... ;. .. ................. :.......... . . = ............................ , 4 0 .- 5.0 2.-0F 2.023EF ' I 0 O. I II L I I 2.0 3.0 VOLTS (LIN] 4.0 5.11 FIGURE 31. Typical gate capacitance vs. gate voltage plot from HSPICE. Both NMOS (solid) and PMOS (dotted) C-V characteristics are shown above. Low Power ROM Generation The wiring capacitance for each of the layers is assumed to be constant throughout the voltage swing. Only layer-to-substrate capacitances are considered; other sources of wiring capacitance such as poly-to-metall capacitances are assumed to be negligible. Thus, the energy dissipated per power consuming transition at the node is described by the following equation: (EQ 14) Ewiring = Cwiring VDDVswing The voltage swing, Vswing for all nodes is VDD, except for the bitlines which swing from (VDD-VT) to GND. C Cox- * -------------------------------------- -- --- ActualCV Charactenstic Piecewise Constant Model - A Vbp V FIGURE 32. Actual and model C-V characteristics of a NMOS device. 5.3.4 Modelling Gate Capacitances To model gate capacitances, ROMODEL makes the simplifying assumption that the transistor is mostly in the linear region. This is a good estimate when VDD is large, however it is used by ROMODEL as a conservative estimate for low supply voltages as well. Using this assumption, the total gate-to-channel capacitance can be assumed to be split equally between the source and drain. The total gate capacitance is calculated in the following manner: ox ox tox (EQ 15) Cox is the oxide capacitance of the gate. However, the gate capacitance for varying voltages is nonlinear, as demonstrated in Figure 31. Therefore, an equivalent value, Coxequiv is calculated, representing the average value of Cox over the voltage swing. Low Power ROM Generation Because finding the actual average of Cox over the voltage swing is complex, a piecewise constant model is developed to approximate the change in gate capacitance over the voltage swing of the node, shown in Figure 32 and Figure 33. For NMOS gates, Coxequiv is calculated in the following manner: 0 0 _Vhigh _ Vbp (EQ 16) Cox (Vhig h - Vbp ) Coxequiv, NMOS = Vhigh where Vbp is the breakpoint voltage and Vhigh is the maximum node voltage (the minimum node voltage is assumed to be 0). c. Cox ------------------------ ----------- I ActualCV Charactersatic Model PiecewtheConstant ! Vbp V VDD FIGURE 33. Actual and model C-V characteristics of a PMOS device. Similarly, Coxequiv for PMOS gates is: 0 Cox Coxequiv, PMOS = Vhigh ox Vbp Vbp (EQ 17) high These values of Coxequiv for NMOS and PMOS devices are computed before proceeding with the total gate-to-ground capacitance calculation: Cgs = Cgs Cgd = 1 2 (AREA gates) gates 1 (AREA gates) Coxequiv + WCgsO (EQ 18) Coxequiv + WCgdO (EQ 19) Cg-gnd = Cgs + 2 Cgd Low Power ROM Generation (EQ 20) Cgs and Cgd represent the total channel and overlap capacitance between the two nodes. The overlap capacitances, Cgso and Cgdo are assumed to be equal, which is why only Cgdo is specified in the technology file. Coxequiv is split evenly between the Cgs and Cgd due to the assumption that the transistor is always operating in the linear region. Cg-gnd, the total gate to ground capacitance, is equal to the sum of Cgs and Cgd, after Cgd is multiplied by a factor of 2 to account for the Miller effect. Figure 34 depicts the result of the calculations of Eq. 18, Eq. 19, and Eq. 20. Cgd node tin Cg-gnd = Cgs + 2Cgd I node arce Cgs FIGURE 34. Equivalent Gate to GND Capacitance. Determining the CV breakpoint. The piecewise constant model is accurate only if the breakpoint voltages can be determined with a good deal of accuracy. A tool called CVMODEL was developed that uses HSPICE output to determine the breakpoint. First, a file containing the CV plot data must be created. A sample HSPICE deck that produces the necessary output is included in Appendix A. The following is a portion of a sample CV plotfilel: 0. 1.00000e-3 2.00000e-3 3.00000e-3 4.00000e-3 5.779e-16 5.777e-16 5.775e-16 5.774e-16 5.772e-16 1. All alphanumeric representation exponents must be converted to exponential form before being sent to the CV modeler. (i.e., m = e-3, u = e-6, f = e-15, etc.) Low Power ROM Generation 5.00000e-3 5.771e-16 The first column represents the gate voltage and the second column is the capacitance of the gate. CV_MODEL then calculates the area under the curve of the CV characteristic provided in the CV plotfile and returns the breakpoint value. The syntax for CV_MODEL is: cv_model plotfile nmos/pmos where plotfile is the filename of the HSPICE generated CV plot and nmos or pmos specifies what type of transistor is being modelled. The following demonstrates correct usage: % cv_model plotfilel.cv nmos total area: 3.439291E-12 Vdd: 5.000000E+00 deltav: 1.000000E-03 Breakpoint: 1.525147E-01 Total area is the estimated total area under the CV curve. The maximum sweep voltage is assumed to be Vdd, and includes this information in the output. Delta_v is the stepsize of the input gate voltage, and Breakpoint is the calculated breakpoint voltage. Modeled CV Breakpoint HSP CV Plc VDD AV FIGURE 35. Calculation of total area under CV curve from HSPICE data, used to find breakpoint voltage in piecewise linear gate capacitance model. Low Power ROM Generation The CV modeler calculates the area by accumulating all the areas of the rectangles with width AV, the simulation stepsize, and height Cn , the capacitance value for each point. Figure 35 graphically demonstrates this procedure. CV_MODEL assumes that the stepsize is small enough (-.001V) so that the error is negligible. The sum of all the rectangles is the estimate for the area. Clearly the accuracy of the area under the curve (and consequently, the breakpoint) is increased as AV is decreased. CVAreaHSPICE= (EQ 21) Ci. AV The area under the CV curve of the piecewise linear model is: CVAreamodel = (VDD - Vbp) Cox (EQ 22) Equating the areas from Eq. 21 and Eq. 22, we obtain the expression for the breakpoint voltage: Vbp = VDD- (AV. Ici i (EQ 23) (EQ 23) Since the breakpoint for PMOS devices depends on VDD, the PMOS breakpoint value in the technology file (and the value reported by CVMODEL) is specified as the difference between VDD and the actual breakpoint. The application of CV_MODEL below demonstrates correct usage for a PMOS device: % cv_model plotfile2.cv pmos total area: 3.282547E-12 Vdd: 5.000000E+00 delta_v: 1.000000E-03 Breakpoint: 3.359656E-01 The breakpoint is reported by CV_MODEL is 0.336V, which is the value that should be used in the ROMODEL technology file. In the case when VDD = 1.5V, the actual modeled breakpoint (1.5 - 0.336 = 1.164V) is used in all capacitance and energy calculations. Once the breakpoints for the capacitance models are calculated, the power dissipated by the gate to ground capacitances is computed: Egate = Cg-gnd Vswing VDD Low Power ROM Generation (EQ 24) 5.3.5 Modelling Junction Capacitances Each node contains information about drain capacitances: NMOS drain area and perimeter, and PMOS drain area and perimeter. The NMOS drain capacitances lie between the node and GND. However the PMOS junction and sidewall capacitances lie between the node and VDD. Since the PMOS drain capacitances are in series with a large capacitor (Csupply) to GND, an assumption may be applied to simplify the analysis. The PMOS drain capacitance is assumed to be much smaller than the supply capacitance, and therefore, the PMOS drain capacitance appears to be between the node and GND as well [Xanthopoulos96]. The NMOS and PMOS capacitances can therefore be added in parallel, as long as it is remembered that the voltages across the n-drain and p-drain capacitances vary differently. Figure 36 illustrates the charging of the drain capacitance. The power delivered by the power supply is: P drain = VDD isupply (EQ 25) The energy drawn from the supply is therefore: Edrain = VDD supply) dt (EQ 26) = VDDisupplydt (EQ 27) = VDD (EQ 28) dq Since dq/dV = C, = VDD (EQ 29) .C(V)dV Substituting Eq. 12 for C(V), the expression for Edrain is obtained: Edr=in = - 1++1+ (EQ 30) where and VRfinal and VRinit are the reverse biases at the final and initial node voltages, respectively. CJ and CJSW must be scaled by the appropriate drain perimeters and areas Low Power ROM Generation respectively. VDD VDD Vin Vout FIGURE 36. Supply current charging the drain capacitance. This assumes that all the current from the supply is used to charge the junction capacitance. Ideally this is true, however if VDD > (IVTPI + VTN), some direct path current is lost through the NMOS device during the non-ideal input transition. This contribution, however, is assumed to be small and is neglected by ROMODEL. 5.4 Modelling the ROM Each node of the ROM is modelled using the model described in Section 5.3.2 with some simplifying assumptions. Since the capacitance on a particular node depends on the size of the ROM, gate and junction sizes are written as a function of bits, address lines, number of blocks, and data. This ensures that the ROMODEL is accurate for different ROM sizes. Because the density of a ROM, the number of nodes to be modelled quickly becomes too large, it is not convenient to create a node model for every node in the ROM. Using the fact that the ROM is very regular in nature allows some simplifying assumptions to be made to simplify the modelling of the nodes: 1. Identical nodes can be modelled with a single node. 2. The power dissipated in the nodes equals the modelled node power scaled by the number of nodes that switch and the number of power consuming transitions per read access. Using these assumptions, modelling each node is unnecessary, and ROM energy can be Low Power ROM Generation calculated without a large number of redundant calculations. Most of the nodes in the ROM undergo either or no power consuming transitions, or a single power consuming transition, such as the control signals, word lines, and bitlines. Eventually, all the nodes return to their initial (precharge) state, with the exception of the address bits and the data outputs. 5.4.1 Control Signals The control signals, mainly Ready, EvalAddr, EvalWord, and OEN always undergo a single energy consuming transition in the selected ROM block and return to their initial state (except for OEN, which is asserted in all ROM blocks). Therefore, the energy consumed in these nodes is simply accumulated. 5.4.2 Address Latches and Row Decoding ROMODEL makes the assumption that approximately 1/2 of the address bits are high. Due to the symmetry of the address latches and row decoding circuitry (see Figure 12 and Figure 13), ROMODEL makes the assumption that power consumption is approximately the same regardless of the input address. The energy consumed in changing the address inputs, A[alines-1:0] is not considered as part of ROM power dissipation. 5.4.3 ROM Core Using the input parameters, estimates of the capacitances on the wordlines and bitlines can be approximated as well as the probability of undergoing a power consuming transition. The two wordlines are enabled during a ROM read cycle are the data wordline and the reference wordline. The reference wordline and bitline have a constant (worst case) number of transistors and always contribute to energy dissipation during a read cycle. For the data wordline however, ROMODEL uses the word_weight parameter to estimate the number of transistor gates on the enabled wordline, the number of junction capacitances on the bitlines, and the number of bitlines that will be discharged during a ROM read access. To estimate the capacitance of the INV bitline, invywordlines is used to obtain the average number of drains per block on the INV bitline as well as the probability that INV Low Power ROM Generation is asserted, assuming that all address locations are equally likely. 5.4.4 Sense Amplifiers and XOR Decoders The number of sense amplifier outputs that switch is estimated to be the fraction of the coded word weight to the total number of columns, scaled by the number of bits. Sense amplifier outputs that do not switch are simply "logic 0" and do not contribute to dissipated energy. The expected number of XOR decoder outputs that switch is similarly determined by the uncoded word weight (data_weight). Glitches are assumed to be negligible and are ignored. 5.4.5 Output Latch and Driver The number of data outputs that are "1" is approximated with the dataweight parameter, from which the average weight of uncoded data can be calculated. ROMODEL uses this information to calculate the probability that each data bit undergoes an energy consuming transition, and assumes that half of the energy is consumed per change in output bit. 5.5 ROMODEL Usage Since the energy of the ROM is dependent on the data, the core data must be analyzed before modelling can begin. ROMAVG is a tool that simplifies analysis of ROM data. Then, using the data obtained from ROMAVG, the minimum, maximum, and average access energies can be modelled by ROMODEL. 5.5.1 ROMAVG: Analyzing ROM Core Data ROMAVG is a tool that analyzes the ROM contents and calculates the minimum, maximum, and average number of transistors on a wordline. It is important to note that ROMAVG assumes that there are four words per wordline and that the core data is stored using the low weight coding techniques described in Chapter 3. The syntax for ROMAVG is: romavg param_file where paramfile is the ROMGEN parameter file (see Section 4.1.1) containing the ROM Low Power ROM Generation data to be analyzed. The parameters that ROMAVG uses are words, bits, and dataBits; the other parameters in the parameter file are ignored. ROMAVG analyzes the core data and returns values for ROMODEL and ROMOPT. The following demonstrates ROMAVG usage: % romavg romgen_paraml bits: 8 words: 256 Total number of l's in ROM data (uncoded): 739 Total number of l's in ROM core after coding (data+Inv): 695 Wordlines inverted: 2 Average data weight (uncoded): 11.546875 Average coded wordline weight: 10.859375 Minimum coded wordline weight: 2 Maximum coded wordline weight: 16 Analysis of the ROM data yields minimum, maximum, and average weights of the coded wordlines. Average weight of the uncoded data and the number of wordlines inverted is also calculated. These values can then be used with ROMODEL and ROMOPT to determine best case, worst case, and average case energy per access. 5.5.2 ROMODEL: Modelling the ROM The syntax for ROMODEL is: romodel n_alines wordsize blocks vdd data weight word_weight inv_wordlines [techfile] where alines is the total number of address bits, wordsize is the number of bits per word (must be even), blocks is the number of ROM blocks in the ROM, vdd is the power supply voltage (in V), dataweightis the average number of transistors per wordline without coding, word_weight is the number (minimum, maximum, or average) of transistors on a coded wordline, invwordlines is the number of wordlines that will be inverted in the ROM (also using ROMAVG), and techfile is the technology file to be used, which defaults to "romodel.tf'. Dataweight,wordweight, and inv_wordlines are typically determined by ROMAVG (Section 5.5.1). If the technology file is valid, ROMODEL proceeds with the ROM modelling and returns the energy dissipated per access: % romodel 4 4 1 1.5 8 6.25 1 romodel.tf Low Power ROM Generation 1.468764e-11J Energy per cycle: A report including information about each node is generated after the modelling has completed and is located in "romodel.out". The following is a sample report on a typical node: ***** RowSelBar ***** v_init: 0.000000E+00 v_final: 1.500000E+00 area Ngates: 16.000000 width Ngates: 8.000000 area Njunct: 20.000000 perim Njunct: 14.000000 area Pgates: 76.000000 width Pgates: 38.000000 area Pjunct: 42.000000 perim Pjunct: 12.000000 area ndiff: 0.000000 area pdiff: 0.000000 area poly: 82.000000 area ml: 144.000000 area m2: 125.000000 wiring_cap: 3.350250E-15 wiring_energy: 7.538063E-15 total_cggnd: 5.333653E-14 total_cggnd_energy: 1.200072E-13 total_cj: 4.773610E-15 total_cj_energy: 1.350579E-14 total_cjsw: 3.223089E-15 total_cjsw_energy: 8.370037E-15 total_node_cap: 6.468347E-14 total_node_energy: 1.494211E-13 The units for these values are given in Table 11. Node# 1 2 3 4 5 CL lambda = 0.5e-6 FIGURE 37. An inverter chain for HSPICE vs. ROMODEL comparisons. Low Power ROM Generation = 0.5pF 5.6 Results: ROMODEL vs. HSPICE A number of simulations using HSPICE were performed to verify the accuracy of ROMODEL. The nodes of an inverter chain with increasingly larger inverters were simulated and compared. HSPICE simulations were then run on ROMs of different sizes and compared to ROMODEL results. The files used for simulation are included in Appendix A. 5.6.1 Inverter Chain The circuit depicted in Figure 37 was simulated in HSPICE for one cycle using 0.5 ns risetimes and a .5pF load capacitor on the last inverter output. Each inverter had a separate power supply, and thus, the power consumed in the charging and discharging of each node (1, 2, 3, 4, and 5) was the power consumed by each of the respective power supplies. The same circuit was simulated in ROMODEL and the results were compared for each node. These simulations were run with four different supply voltages (1.5V, 2.2V, 3.3V, and 5.OV). These results are summarized in Table 12. VDD (V) 1.5 H1 E/cyc (pJ) .0798 R1 E/cyc (pJ) .0745 H2 Elcyc (pJ) .1520 R2 Elcyc (pJ) .1436 H3 E/cy (pJ) .2936 R3 Elcyc (pJ) .2818 H4 E/cy (pJ) .5829 R4 EE/cyc (pJ) .5580 H5 E/cyc (pJ) 1.527 R5 E/cyc (pJ) 1.209 2.2 3.3 5.0 .1788 .4230 1.095 .1683 .3890 .9096 .3438 .810 2.00 .3256 .7571 1.772 .6672 1.578 3.855 .6403 1.492 3.495 1.324 3.141 7.650 1.270 2.960 6.943 3.282 7.377 17.28 2.583 5.768 13.14 TABLE 12. HSPICE simulation vs. ROMODEL results for power estimation for a 5-inverter chain. Energies for nodes 1, 2, 3, 4, and 5 are listed above. Hspice results are prefixed with an "H" before the node number. ROMODEL results are prefixed with an "R". The values predicted by ROMODEL are slightly below HSPICE estimates because direct path currents during input and output transitions are neglected by ROMODEL. Note that the ROMODEL and HSPICE differ on the energy calculations for node 5. This is because HSPICE assumes that signals that change quickly dissipate more energy, and since node 5 has no gate capacitance loading the output of the inverter, the node voltage changes abruptly, resulting in different energy estimates. Low Power ROM Generation 5.6.2 ROM ROMODEL Energy I/ Cycle (pJ) VDD (V) Words Bits Blocks HSPICE Energy / Cycle (pJ) 1.5 16 4 2 12.58 11.08 2.2 16 4 2 26.27 25.11 3.3 16 4 2 60.50 58.25 135.92 TABLE 13. HSPICE simulation vs. ROMODEL results for power estimation for a 16 word x 4 bit ROM. 5.0 16 4 2 150.43 VDD (V) 1.5 Words Bits Blocks HSPICE Energy / Cycle (pJ) ROMODEL Energy / Cycle (pJ) 256 8 1 28.80 25.65 2.2 256 8 1 59.85 58.62 3.3 256 8 1 137.91 136.34 5.0 256 8 1 348.28 317.22 TABLE 14. HSPICE simulation vs. ROMODEL results for power estimation for a 256 word x 8 bit ROM. Power dissipation for different supply voltages and different ROM sizes were simulated using HSPICE and ROMODEL. Fringe capacitance was ignored for these simulations because the netlist extractor in Cadence does not include fringe capacitance parasitics. The energy per cycle was obtained from HSPICE results by accessing a number of addresses containing random data. The energy dissipated in those cycles was averaged and compared to ROMODEL results. A summary of these simulations is listed in Table 13 and Table 14. ROMODEL gives lower values than HSPICE because static currents and short circuit currents are ignored. Also, HSPICE results are averaged over eight successive ROM reads, which is heavily dependent on which address locations are accessed.. 5.7 Optimization Procedure The process for determining the optimal ROM block organization begins with some calculations to determine valid numbers of block address bits and block decoding bits. ROMOPT then applies ROMODEL and iterates through the number of partitioning schemes to find an optimal solution. First, the minimum number of address lines required for the number of ROM data Low Power ROM Generation words specified is calculated. Since the minimum number of address lines per block is 3, there must be at least 8 words in the ROM. Also, because the ROM is limited to 16 blocks with each block containing at least 8 words, the maximum number of bits that can be used for block decoding can be determined. For each ROM configuration, the number of address lines and words per block is computed which is used to calculate the minimum number of blocks required to store the number of words in the ROM. This is done to ensure that no unneeded blocks are created containing only empty ROM data. ROMOPT then begins application of ROMODEL varying the number of blocks until all valid partitioning schemes have been tried and the optimal solution has been found. 5.8 ROMOPT Usage The syntax for ROMOPT is: romopt words wordsize vdd data_weight word_weight inv_wordlines [techfile] where words is the total number of words in the ROM, wordsize is the number of bits per word (must be even), vdd is the power supply voltage (in V), dataweight is the average weight of an uncoded wordline, word_weight is the average weight of a coded wordline, inv_wordlines are the number of wordlines inverted and techfile is the ROMODEL technology file to be used (described in more detail in Section 5.3.1), which defaults to "romodel.tf'. Data_weight, word_weight, and inv_wordlines are usually determined by ROMAVG. If the technology file is valid, ROMOPT proceeds with the ROM modelling and optimization. The following demonstrates ROMOPT usage: % romopt 2000 8 1.5 11.9 10.3 5 romodel.tf Optimal number of blocks: 8 Minimum energy per cycle: 3.233820e-11J ROMOPT output sent to "romopt.out". The optimal number of blocks and minimum energy per cycle are calculated and displayed. An output report is also generated and contains more detailed information about the different partitioning schemes for the ROM. Low Power ROM Generation 5.9 Interpreting the ROMOPT Report A report file, "romopt.out", is generated during the optimization process and includes information about the different ROM configurations. A sample report is shown below: words: 2000 wordsize: 8 vdd: 1.500000 data weight: 11.900000 word weight: 10.300000 techfile: romodel.tf Required number of address lines: 11 Maximum number of address bits used for block decode: 4 Valid ROM partitioning schemes listed below. ***** blocks: 1 ***** Words per block: Number of block decode bits: Energy per cycle for this ROM configuration: 2048 0 7.980468e-11J ***** blocks: 2 ***** Words per block: Number of block decode bits: Energy per cycle for this ROM configuration: 1024 1 4.717560e-11J ***** blocks: 4 ***** Words per block: Number of block decode bits: Energy per cycle for this ROM configuration: 512 2 3.393365e-11J ***** blocks: 8 ***** Words per block: Number of block decode bits: Energy per cycle for this ROM configuration: 256 3 3.233820e-11J ***** blocks: 16 ***** Words per block: Number of block decode bits: Energy per cycle for this ROM configuration: Optimal number of blocks: Minimum energy per cycle: 128 4 4.089522e-11J 8 3.233820e-llJ This example report shows that for a ROM with 2000 words, partitioning the ROM into eight blocks is recommended to minimize power consumption. This shows that the maximum partitioning does not always yield the ROM with the lowest energy per cycle. This is especially true for smaller ROMs, in which the energy dissipated in the block decoding circuitry and extra data bus capacitance outweighs the gains in reducing ROM block size. However, for small and medium sized ROMs, the difference in energy per cycle for different partitioning schemes may be slight. Thus, the user may choose to Low Power ROM Generation use less partitioning in exchange for a smaller ROM area. Low Power ROM Generation Chapter 6 Conclusion Low power design has become increasingly important as chip density, system sizes, and the popularity of portable applications continue to rise. The significance of low power memory design in particular has become more important, since most designs require a large amount of non-volatile memory. Therefore, the focus of this work is on the development of a ROM generator and supporting tools to model and optimize the ROM for low power. A ROM library from U.C. Berkeley [Burstein96] in Magic format is converted for use in the Cadence Design Environment. Modifications appropriate for use with the HP26 process technology were made and schematics were created for each layout. Cell layouts were checked for design rule violations (DRC) and verified versus schematics (LVS). Large cells were verified using HSPICE simulations. The cells passed all these checks, indicating the completion of the library conversion. Once the library was converted, changes were made to the ROM to further decrease power consumption. The ROM core is coded with an inverting bit, lowering the number of transistors in the ROM core and hence, reducing the power dissipation in the wordlines and bitlines. This also reduces the worst case wordline delays, possibly decreasing access time. Also, the self-timing logic is altered to include the bitline delay in the self-timed signal generation. The application of a reduced element XOR gate minimizes Low Power ROM Generation the amount of additional logic required to decode the ROM data. This introduction of decoding logic requires additional circuitry to eliminate glitches on the output bus, which can waste power. ROMGEN is a tool created for generating the ROMs described above. Each cell was updated with tiling information and a tiling routine was written in SKILL to facilitate the placement of the cells. Improvements were made to the generator, such as simplifying the interface and allowing the user to specify an odd number of blocks, which can save area for some ROMs. ROMODEL is a tool that models power dissipation in the ROM. Models were developed for gate, drain, and wiring capacitance to estimate each component of power consumption. Furthermore, the repetitive structure of the ROM allows assumptions to be made which eliminates a number of redundant calculations and reduces computation time. Simulations using HSPICE were used to verify the accuracy of the results. Finally, to help the user determine the partitioning scheme for the ROM, ROMOPT applies ROMODEL for to exhaustively determine valid configurations and identifying the power dissipation associated with each one. Low Power ROM Generation Appendix A HSPICE Decks Low Power ROM Generation A.1 Spice Deck for CV plot **This is the full input file used for hspice simulation runs. **Project:Low Power ROM Generation **Owner:Paul Chou **Description:ROM address latch ** **Include model files and netlist files here** *.include '/amd/sick-puppy/a/jimg/tsmc/Nominal.model' .include '/amd/sick-puppy/a/pchou/sim/models/hp26.models' mxn gnd 2 gnd gnd nmos w=.5e-06 1=.5e-06 ad=0 as=0 pd=0 ps=O mxp vdd 3 vdd vdd pmos w=.5e-06 1=.5e-06 ad=0 as=0 pd=0 ps=0 **Voltage supplies and Input Stimulus Vvdd vdd gnd 5V Vn 2 gnd 5V Vp 3 gnd 5V **Options and Analysis .options post nomod dccap brief .DC Vn 0 5 .001 .print LX18(mxn) .DC Vp 0 5 .001 .print LX18(mxp) .probe LX18(mxn) .probe LX18(mxp) .end Low Power ROM Generation A.2 Spice Deck for Inverter Chain ** ** Project: Owner: ** Include model files and netlist files here Low Power ROM Generation Paul Chou ** .include '/amd/sick-puppy/a/pchou/sim/models/hp26.models' ** Voltage supplies and Input Stimulus Vin + vddl vdd2 vdd3 vdd4 vdd5 2 0 PWL(Ons 1.5V 10ns 1.5V 10.5ns OV 20ns OV 20.5ns 1.5V 30ns 1.5V) 11 0 1.5V 12 0 1.5V 13 0 1.5V 14 0 1.5V 15 0 1.5V mxl 3 2 0 0 nmos w=2e-06 l=1e-06 ad=5e-12 pd=7e-6 as=5e-12 ps=7e-6 mx2 3 2 11 11 pmos w=4e-06 1=1e-06 ad=10e-12 pd=9e-6 as=10e-12 ps=9e-6 mx3 4 3 0 0 nmos w=4e-06 l=1e-06 ad=10e-12 pd=9e-6 as=10e-12 ps=9e-6 mx4 4 3 12 12 pmos w=8e-06 1=1e-06 ad=20e-12 pd=13e-6 as=20e-12 ps=13e-6 mx5 5 4 0 0 nmos w=8e-06 1=1e-06 ad=20e-12 pd=13e-6 as=20e-12 ps=13e-6 mx6 5 4 13 13 pmos w=16e-06 1=1e-06 ad=40e-12 pd=21e-6 as=40e-12 ps=21e-6 mx7 6 5 0 0 nmos w=16e-06 1=1e-06 ad=40e-12 pd=21e-6 as=40e-12 ps=21e-6 mx8 6 5 14 14 pmos w=32e-06 1=1e-06 ad=80e-12 pd=37e-6 as=80e-12 ps=37e-6 mx9 7 6 0 0 nmos w=32e-06 1=1e-06 ad=80e-12 pd=37e-6 as=80e-12 ps=37e-6 mxl0 7 6 15 15 pmos w=64e-06 1=1e-06 ad=160e-12 pd=69e-6 as=160e-12 + ps=69e-6 cl 7 0 .5e-12 ** Options and Analysis .options post nomod method=gear .measure tran avgpowervin avg p(vin) from=0ns to=30ns .measure tran avgpowerl avg p(vddl) from=0ns to=30ns .measure tran avgpower2 avg p(vdd2) from=0ns to=30ns .measure tran avgpower3 avg p(vdd3) from=0ns to=30ns .measure tran avgpower4 avg p(vdd4) from=0ns to=30ns .measure tran avgpower5 avg p(vdd5) from=0ns to=30ns .tran .01n 30ns .end Low Power ROM Generation A.3 Spice Deck for ROM Simulations This is the full input file used for hspice simulation runs. Project: Low Power ROM Generation Paul Chou Owner: ** ** Include model files and netlist files here ** .include '/amd/sick-puppy/a/pchou/sim/rom.run3/netlist' .include '/amd/sick-puppy/a/pchou/sim/models/hp26.models' ** Voltage supplies and Input Stimulus Vdd vdd gnd 1.5V porb gnd PWL( On OV Vporb Venable enable gnd 1.5V 5n OV 5.5n 1.5V 1600n 1.5V) Va3 a3 gnd PWL( On Ov 800n Ov 800.5n 1.5V +Va2 1600n 1.5V 1600.5n OV) Va2 a2 gnd PWL( On Ov 400n Ov 400.5n 1.5V + 800n 1.5V 800.5n OV) + 1200n OV 1200.5n 1.5V) + 1600n 1.5'VI 1600.5n OV) 200n 1.5v 200.5n OV Val al gnd PWL( On 1.5v + 400n OV 400.5n 1.5V 600.5n OV + 600n 1.5v + 800n OV 800.5n 1.5V) 1000.5n OV + 1000n 1.57Ir + 1200n OV 1200.5n 1.5V) + 1400n 1.5'Ir 1400.5n OV + 1600n OV 1600.5n 1.5V) 100n Ov 100.5n 1.5V *Val al gnd PWL( 01n *+ 200n 1.5V 200.5n OV *+ 300n Ov 300.5n 1.5V *+ 400n 1.5V 400.5n OV *+ 500n Ov 500.5n 1.5V) *+ 600n 1.5V 600.5n OV *+ 700n Ov 700.5n 1.5V) *+ 800n 1.5V 800.5n OV) VaO aO gnd 1.5V 0.5n OV Vclk clk gnd PWL(On Ov 20.5n 1.5V + 20n Ov + 120n 1.5v 120.5n OV + 220n Ov 220.5n 1.5V + 320n 1.5v 320.5n OV + 420n Ov 420.5n 1.5V 520.5n OV + 520n 1.5v + 620n Ov 620.5n 1.5V + 720n 1.5v 720.5n OV Low Power ROM Generation 820n Ov 920n 1.5v 1020n Ov 1120n 1.5v 1220n Ov 1320n 1.5v 1420n Ov 1520n 1.5v 1620n Ov ** 820.5n 1.5V 920.5n OV 1020.5n 1.5V 1120.5n OV 1220.5n 1.5V 1320.5n OV 1420.5n 1.5V 1520.5n OV 1620.5n 1.5V Options and Analysis .probe .options post nomod .measure tran avgcurrent avg i(vdd) from=0ns to=1600ns .measure tran avgpower avg p(vdd) from=0ns to=1600ns .tran .in 1620ns .end Low Power ROM Generation Appendix B Sample Parameter files Low Power ROM Generation B.1 Sample technology file for ROMODEL, romodel.tf lambda .5e-6 # Wiring capacitance ndiff_cap 286e-6 pdiff_cap 545e-6 poly_cap 73e-6 metall_cap 35e-6 metal2_cap 19e-6 (F/m^2) # Fringe capacitance (F/m) (Unavailable) ndiff_fringe 0 pdiff_fringe 0 poly_fringe 0 metall_fringe 0 metal2_fringe 0 # built in potential phin 0.6 phip 0.6 # nmos threshold voltage vtn .7623 # nmos junction cap (F/m^2) cj0n 2.6473e-04 mjn 0.9561 # nmos sidewall cap (F/m) cjsw0n 4.0556E-10 mjswn 0.270227 # pmos threshold voltage (abs value) vtp 0.8814 # pmos junction cap (F/m^2) cj0p 5.5813e-04 mjp 0.4968 # pmos sidewall cap (F/m) cjsw0p 2.0919e-10 mjswp 0.463227 # nmos and pmos overlap cap cgd0n 3.4599e-10 cgd0p 1.3214e-10 (F/m) # piecewise constant parameters for nmos # CV-plot breakpoint n_cv_bp 0.1525147 #ncv_bp 0 # Inital gate capacitance per (m^2) (0 for NMOS) n_gate_cap init 0 # Final gate capacitance per (m^2) (Cox for NMOS) n_gate_cap_final 1.939045e-3 Low Power ROM Generation # piecewise constant parameters for pmos # CV-plot breakpoint p_cv_bp 0.3359656 # p_cv_bp 0 # Inital gate capacitance per (m^2) (Cox for PMOS) p_gate_cap_init 1.939045e-3 # Final gate capacitance per (m^2) (0 for PMOS) p_gate_cap_final 0 # All comments must begin the line with a '#' # Reminder: EOX == 34.515e-12 F/m Low Power ROM Generation B.2 Sample SPICE Model .MODEL NMOS NMOS LEVEL=3 PHI=0.600000 TOX=1.7800E-08 XJ=0.200000U TPG=1 + VTO=0.7623 DELTA=7.6940E-01 LD=1.1890E-07 KP=1.2379E-04 + UO=638.1 THETA=1.2160E-01 RSH=6.5980E+00 GAMMA=0.5942 + NSUB=4.0030E+16 NFS=7.0730E+12 VMAX=1.9160E+05 ETA=4.3410E-02 + KAPPA=1.0510E-01 CGDO=3.4599E-10 CGSO=3.4599E-10 + CGBO=4.1520E-10 CJ=2.6473E-04 MJ=0.9561 CJSW=4.0556E-10 + MJSW=0.270227 PB=0.800000 * Weff = Wdrawn - Delta_W * The suggested Delta_W is 2.8000E-07 .MODEL PMOS PMOS LEVEL=3 PHI=0.600000 TOX=1.7800E-08 XJ=0.200000U TPG=-l + VTO=-0.8814 DELTA=1.2220E+00 LD=4.5410E-08 KP=3.6685E-05 + UO=189.1 THETA=1.7250E-01 RSH=5.5000E-01 GAMMA=0.4652 + NSUB=2.4540E+16 NFS=7.7440E+12 VMAX=3.7770E+05 ETA=8.1730E-02 + KAPPA=9.9830E+00 CGDO=1.3214E-10 CGSO=1.3214E-10 + CGBO=4.2612E-10 CJ=5.5813E-04 MJ=0.4968 CJSW=2.0919E-10 + MJSW=0.463227 PB=0.850000 * Weff = Wdrawn - Delta_W * The suggested Delta_W is 3.0400E-07 Low Power ROM Generation B.3 Sample technology file for Magic to Skill, m2s.tech # magic2cadence "Technology File" # D. Xanthopoulos 1996 # this must be set to 1 if the select mask must be included DERIVESELECT # Necessary Design Rules CONTACT_SIZE 2 CONTACT_SPACING 2 CONTACT_OVERLAP 1 VIA_SIZE 2 VIA_SPACING 2 VIA_OVERLAP 1 SELECT_OVERLAP 3 #Local Name Magic layer Cadence layer(s) none PW pwell NW nwell nwell POLY polysilicon poly NDIFF ndiffusion ndiff pdiffusion pdiff PDIFF M1 metall metall M2 metal2 metal2 M3 metal3 metal3 NT ntransistor ndiff poly ptransistor PT pdiff poly PSUB psubstratepdiff psub NSUB nsubstratendiff nsub GLASS glass overgla # VERY IMPORTANT!!!! # Contacts must be specified as layerl layer2 contact_cut PC NDC PDC M2C PSC NSC end Low Power ROM Generation polycontact ndcontact pdcontact m2contact psubstratepcontact nsubstratencontact poly metall cont ndiff metall cont_aa pdiff metall cont_aa metall metal2 via psub metall cont_aa nsub metall cont_aa Bibliography [Brodersen93] R.W. Brodersen, "Anatomy of a Silicon Compiler," Kluwer, Boston, 1993. [Burstein96] A. Burstein, "Speech Recognition for Portable Multimedia Terminals," Ph.D. thesis, University of California, Berkeley, pp. 69-92, February 1996. [Hirose90] T. Hirose et al., "A 20-ns 4-Mb CMOS SRAM with Hierarchical Word Decoding Architecture," IEEE J. Solid State Circuits, Vol. 25, pp. 10681074, Oct. 1990. [Hoff89] D. Hoff et al., "A 23-ns 256K EPROM with Double-Layer Metal and Address Transition Detection," IEEE J. Solid State Circuits, Vol. 24, pp. 1250-1259, Oct. 1989. [Knecht83] M. Knecht et al., "A High-Speed Ultra-Low Power 64K CMOS EPROM with On-Chip Test Functions," IEEE J. Solid State Circuits, Vol. SC-18, pp. 554-561, Oct. 1983. [Kuriyama90] M. Kuriyama et al., "A 16-ns 1-Mb CMOS EPROM," IEEE J. Solid State Circuits, Vol. 25, pp. 1141-1146, October 1990. [Murakami90] S. Murakami et al., "A 21-mW 4-Mb CMOS SRAM for Battery Operation," IEEE J. Solid State Circuits, Vol. 26, pp. 1563-1570, October 1990. [Mutoh95] S. Mutoh, T. Douseki, Y. Matsuya, T. Aoki, S. Shigematsu, J. Yamada, "l-V Power Supply High-Speed Digital Circuit Technology with Multithreshold-Voltage CMOS," IEEE Journal Solid State Circuits, Vol. 30, No. 8, pp. 847-854, August 1995. [Ohtsuka87] N. Ohtsuka et al., "A 4-Mbit CMOS EPROM," IEEE J. Solid State Circuits, Vol. SC-22, pp. 669-675, October 1987. [Rabaey95] Rabaey, "Digital Integrated Circuits: A Design Perspective", Prentice Low Power ROM Generation Hall, 1995. [Sakurai84] T. Sakurai et al., "A Low Power 46 ns 256 kbit CMOS Static RAM with Dynamic Double Word Line," IEEE J. Solid State Circuits, Vol. SC-19, pp. 578-585, Oct. 1984. [Sasaki90] K. Sasaki et al., "A 23-ns 4Mb CMOS SRAM with 0.2-uA Standby Current," IEEE J. Solid State Circuits, Vol. 25, pp. 1075-1081, October 1990. [Stan89] M. Stan and W. Burleson, "Limited-weight Codes for Low-power I/O," 1994 International Workshop on Low-power Design, pp. 209-214, April 1993. [Tabor90] J. Tabor, "Noise Reduction Using Low Weight and Constant Weight Coding Techniques," Master thesis, 75 pages, June 1990. [Weste93] N. Weste and K. Eshraghian, "Principles of CMOS VLSI Design", Sec. Ed., Addison-Wesley, 1993. [Xanthopoulos96] T. Xanthopoulos, "PYTHIA: A Power Estimator for Structural Verilog," 25 pages, April 1996. [Yoshimoto83] M. Yoshimoto et al., "A Divided World-Line Structure in the Static RAM and Its Application to a 64K Full CMOS RAM," IEEE J. Solid State Circuits, Vol. 18, pp. 479-485, October 1983. Low Power ROM Generation