1352 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 37, NO. 10, OCTOBER 2002 A 120-mW 3-D Rendering Engine With 6-Mb Embedded DRAM and 3.2-GB/s Runtime Reconfigurable Bus for PDA Chip Ramchan Woo, Chi-Weon Yoon, Jeonghoon Kook, Se-Joong Lee, and Hoi-Jun Yoo Abstract—A low-power three-dimensional (3-D) rendering engine is implemented as part of a mobile personal digital assistant (PDA) chip. Six-megabit embedded DRAM macros attached to 8-pixel-parallel rendering logic are logically localized with a 3.2-GB/s runtime reconfigurable bus, reducing the area by 25% compared with conventional local frame-buffer architectures. The low power consumption is achieved by polygon-dependent access to the embedded DRAM macros with line-block mapping providing read–modify–write data transaction. The 3-D rendering engine with 2.22-Mpolygons/s drawing speed was fabricated using 0.18- m CMOS embedded memory logic technology. Its area is 24 mm2 and its power consumption is 120 mW. Index Terms—3-D graphics rendering, embedded DRAM, low power, mobile application, PDA, portable. I. INTRODUCTION A S THE mobile electronics market grows, more processing power is required for the intelligent handheld terminals because the market is moving quickly from text-based information management to multimedia-rich applications, such as real-time audio and video playback. To meet these market demands, much research, including the design of the circuits [1]–[4] and the analysis of the system [5], has tried to integrate more multimedia functionalities into the handheld devices. The realization of real-time three-dimensional (3-D) computer graphics has been a critical issue in workstations and desktop PCs in the past ten years, because it requires many parallel calculations and a huge amount of memory bandwidth. To provide the necessary memory bandwidth for 3-D graphics rendering, embedded memory logic (EML) is a promising technology, because it can break the bandwidth barrier by integrating the DRAM and the logic in a single chip [6]–[9]. The other advantage of EML technology comes from its low power consumption, because the capacitive off-chip I/O loading necessary for accessing the DRAM can be eliminated [1]–[4]. Therefore, EML technology will play key roles in the 3-D graphics rendering for the handheld devices. Low power consumption is the most critical factor for the personal digital assistant (PDA) in which the power has to be supplied by batteries. Since the lithium battery in the PDA can supply about 2000 mW/h, the system must consume less than 800 mW to guarantee 2 3-h system operation [5]. Based on the allowed amount of system power consumption, the power budget allocated to the 3-D rendering engine is confined to less than 200 mW. The rendering engine must be small in size, because it is integrated as a part of the PDA chip [2]. A parallel Manuscript received October 1, 2001; revised May 3, 2002. The authors are with the Semiconductor System Laboratory, Department of Electrical Engineering, Korea Advanced Institute of Science and Technology, Taejon 305-701, Korea (e-mail: ramchan@ssl.kaist.ac.kr; hjyoo@ee.kaist.ac.kr). Publisher Item Identifier 10.1109/JSSC.2002.803051. calculation is also required, because per-pixel operations need R, G, B color and X, Y, Z coordinates together. Also, the large memory bandwidth is essential because the data transaction is read–modify–write at every clock cycle [7]. II. ARCHITECTURE AND DESIGN A. Overall Architecture Fig. 1 shows the block diagram of the proposed 3-D rendering engine. One edge processor [3] calculates pixel data along the edges of the 8 8 clipped polygon and broadcasts them to eight pixel processors (PPs) [3] which can fill eight horizontal pixels simultaneously inside the polygons at every clock cycle. The 6-Mb frame buffer, which consists of twelve eDRAM macros, is coupled to the rendering logic through the runtime reconfigurable bus (R2bus). The R2bus changes the configuration of the data transaction at every clock cycle. The data in the frame buffer are supplied to the PP continuously at every 20 MHz clock cycle, and 8 8 clipped polygons can be drawn at 2.22 Mpolygons/s at a 256 256 screen resolution. B. Logically Local Frame Buffer and Line-Block Memory Mapping Local frame-buffer architectures [3], [6], shown in Fig. 2(a), in which each local memory is tightly coupled with the corresponding PP, provide high memory bandwidth because parallel pixel data can be obtained by local routes. Also, the power consumption is low because only the necessary memories can be activated. In these architectures, however, each eDRAM is independent so that each one requires its own controller. Each eDRAM contains only a small number of pixels because the small screen resolution of the targeted PDA is distributed to many eDRAMs [3]. Therefore, these architectures cannot be fully implemented into the PDA chip because the total area of the eDRAM is too large. A global frame-buffer architecture [7], shown in Fig. 2(b), is a good candidate in terms of cell efficiency because the required pixel data can be obtained from the core by bank interleaving. However, it has limited memory bandwidth at low clock cycles because the required pixel data has to be accessed sequentially. Moreover, power is wasted when rectangular memory mapping is adopted [7], because unnecessary pixel data can be transferred as well through the capacitive I/O bus of the eDRAM. Since these conventional architectures are not suitable for the embedded 3-D rendering engine which requires both low power consumption and small area, we propose a logically local framebuffer architecture. As shown in Fig. 2(c), eDRAMs are placed together to form a “physically global” frame buffer, which enhances its cell efficiency. With this architecture, the area is as small as that of the global frame buffer. This frame buffer, however, is “logically localized” by the R2bus. Each PP renders as 0018-9200/02$17.00 © 2002 IEEE IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 37, NO. 10, OCTOBER 2002 1353 Fig. 1. Block diagram of 3-D rendering engine. C. Runtime Reconfigurable Bus To implement the R2bus, which supports the proposed lineblock memory mapping, a cascaded 2 2 multiplexer (MUX) and 16 8 bus shifters are used. The MUX arbitrarily connects the bidirectional memory bus to the unidirectional PP bus, as shown in Fig. 3(a). They assign the appropriate macro to the PPs for data read and write. The bus shifter [Fig. 3(b)], which is a modified barrel shifter, assigns the pixel data to the corresponding PP by changing the bus attachment. The unidirectional PP bus feeds data to eight PPs in parallel. Each PP has a 40-bit read bus and 40-bit write bus for 16-bit depth and 24-bit color. This R2bus provides 3.2-GB/s bandwidth by accessing 1280 bits of data simultaneously with the read–modify–write pattern at 20 MHz. D. Power Reduction in Frame Buffer Fig. 2. Architecture diagram of embedded frame buffer. (a) Local frame buffer. (b) Global frame buffer. (c) Logically local frame buffer. if it has its own local memory, because the R2bus changes the connection of the eDRAM macros. Therefore, the advantages of the local frame buffer are still maintained in this architecture: parallel access, low power, and high memory bandwidth. To incorporate the logically local data transactions in the R2bus, line-block memory mapping is proposed. Each 8 1 screen pixels compose a line block, and adjacent line blocks are mapped onto subwordlines of different eDRAM macro units. Therefore, four macro units, each of which contains one Z buffer and double color buffers, are necessary to cover the screen area, and each line block accesses 8-pixel data in parallel. Alternative memory mapping in the vertical direction provides a simultaneous and continuous read–modify–write data transaction. When the PP writes the rendered pixels back to the macros, it reads the new pixel data of the next line block from the other macros. Since the screen is split into multiple 8 1 line blocks, 8 8 clipped polygons straddle both of the macros or fall into just one of the macros. Lines in the straddled polygon do not always require the data from both of the macros. Therefore, the power can be reduced if only the required macros are selectively activated. Partial I/O activation is used to reduce the power of the I/O bus in the selected line block. Eight-pixel data in a line block is not always necessary. The number of pixels to be fetched is dependent on the shape of the polygon. In this work, the only necessary data bitline sense amplifier (DBSA) can be activated by I/O masking. This eliminates unnecessary I/O bus transactions and reduces power consumption further, because charging and discharging long capacitive wires between the eDRAM and the rendering logic may consume a lot of power. III. IMPLEMENTATION RESULTS Fig. 4 summarizes the area reduction in the rendering engine. Even if extra area is added by the R2bus, the logically local frame buffer shows 25% smaller area than the conventional local frame buffers because the memory controllers, which were split over the local memories, are merged together to save area in the proposed architecture. Even though the frame 1354 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 37, NO. 10, OCTOBER 2002 (a) (b) Fig. 3. (a) Circuit diagram of R2bus. 1280-b bus reconfiguration at 20 MHz = 3.2 GB/s. (b) 16 2 18 bus shifter. buffer is physically global, polygon-shape-dependent operations eliminate unnecessary power as much as possible, like local frame-buffer architectures. The power is reduced with the help of selective macro activation and partial I/O activation. Simplified shading units, which include an approximated divider in the edge processor and simplified alpha-blending units in the PP, further reduce the power of the rendering logic. The rendering engine with 6-Mb eDRAM and R2bus consumes 120 mW when it continuously renders polygons. Drawing at a high rendering rate, the 3-D graphics accelerators in the PC platforms perform many advanced rendering functions. But they consume a great deal of power, more than several tens of watts. The proposed graphics rendering engine for the PDA chip, however, shows a slower rendering rate and performs restricted rendering functions while consuming only 120 mW. Therefore, we propose new performance indices to compare the performance of the embedded 3-D graphics rendering engine considering the power consumption. 3-D Performance of PDA 3-D Rendering Speed Power (7) Fig. 4. Area reduction. IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 37, NO. 10, OCTOBER 2002 Fig. 5. 1355 Microphotograph of 3-D rendering engine and measured waveforms. TABLE I FEATURES OF 3-D RENDERING ENGINE rendering engine, which draws 2.22 Mpolygons/s with a 256 256 screen resolution, is fabricated using 0.18- m CMOS EML technology with 24-mm area and 120-mW power consumption. REFERENCES The 3-D rendering speed can be illustrated in pixels/s (PXPS) or triangles/s (TRPS). Therefore, PXPS/mW stands for the pixel fill rate per 1-mW power, and TRPS/mW describes the triangle drawing rate per 1 mW. The proposed rendering engine shows 580 kPXPS/mW and 18 kTRPS/mW, respectively. Local frame-buffer architecture [3] shows 2.3 kTRPS/mW, and commodity PC graphics system shows about 40 kPXPS/mW and 1.25 kTRPS/mW. The proposed rendering engine is integrated as a principal part of a PDA chip. Table I summarizes its features. The chip was fabricated using 0.18- m CMOS EML technology with three-poly six-metal layers. Fig. 5 shows the microphotograph of the rendering engine and measured waveforms. IV. CONCLUSION A 3-D rendering engine is designed and integrated into a mobile PDA chip with 6-Mb eDRAM macros and 3.2-GB/s R2bus. The R2bus supports logically local frame-buffer architecture which reduces its area by 25% compared with the conventional local frame buffers due to its high cell efficiency. Polygon-dependent access to eDRAM macros with line-block mapping eliminates the unnecessary power consumption. The [1] R. Woo, C.-W. Yoon, J. Kook, S.-J. Lee, K. Lee, Y.-H. Park, and H.-J. Yoo, “A 120-mW embedded 3-D graphics rendering engine with 6-Mb logically local frame buffer and 3.2-GByte/s run-time reconfigurable bus for PDA chip,” in Symp. VLSI Circuit Dig. Tech. Papers, June 2001, pp. 95–98. [2] C.-W. Yoon, R. Woo, J. Kook, S.-J. Lee, L. Lee, Y.-D. Bae, I.-C. Park, and H.-J. Yoo, “A 80/20-MHz 160-mW multimedia processor integrated with embedded DRAM MPEG-4 accelerator and 3-D rendering engine for mobile applications,” in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2001, pp. 142–143, 441. [3] Y.-H. Park, S.-H. Han, J.-H. Lee, and H.-J. Yoo, “A 7.1-GB/s low-power rendering engine in 2-D array-embedded memory logic CMOS for portable multimedia system,” IEEE J. Solid-State Circuits, vol. 36, pp. 944–955, June 2001. [4] T. Nishikawa, M. Takahashi, M. Hamada, T. Takayanagi, H. Arakida, N. Machida, H. Yamamoto, T. Fujiyoshi, Y. Maisumoto, O. Yamagishi, T. Samata, A. Asano, T. Terazawa, K. Ohmori, J. Shirakura, Y. Watanabe, H. Nakamura, S. Minami, T. Kuroda, and T. Furuyama, “A 60-MHz 240-mW MPEG-4 video-phone LSI with 16-Mb embedded DRAM,” in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2000, pp. 230–231. [5] W. R. Hamburgen, D. A. Wallach, M. A. Viredaz, L. S. Brakmo, C. A. Waldspurger, J. F. Bartlett, T. Mann, and K. I. Farkas, “Itsy: Stretching the bounds of mobile computing,” Computer, pp. 28–36, Apr. 2001. [6] Y. Aimoto, T. Kimura, Y. Yabe, H. Heiuchi, Y. Nakazawa, M. Motomura, T. Koga, Y. Fujita, M. Hamada, T. Tanigawa, H. Nobusawa, and K. Koyama, “A 7.68-GIPS 3.84-GB/s 1-W parallel image-processing RAM integrating a 16-Mb DRAM and 128 processors,” in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 1996, pp. 372–373. [7] K. Inoue, H. Nakamura, and H. Kawai, “A 10-Mb frame buffer memory with Z-compare and A-blend units,” IEEE J. Solid-State Circuits, vol. 30, pp. 1563–1568, Dec. 1995. [8] S. Miyano, K. Numata, K. Sato, T. Yabe, M. Wada, R. Haga, M. Enkaku, M. Shiochi, Y. Kawashima, M. Iwase, M. Ohgata, J. Kumagai, T. Yoshida, M. Sakurai, S. Kaki, N. Yanagiya, H. Shinya, T. Furuyama, P. Hansen, M. Hannah, M. Nagy, A. Nagarajan, and M. Rungsea, “A 1.6-Gbyte/s data transfer rate 8-Mb embedded DRAM,” IEEE J. Solid-State Circuits, vol. 30, pp. 1281–1285, Nov. 1995. [9] T. Shimizu, J. Korematu, M. Satou, H. Kondo, S. Iwata, K. Sawai, N. Okumura, K. Ishimi, Y. Nakamoto, M. Kumanoya, K. Dosaka, A. Yamazaki, Y. Ajioka, H. Tsubota, Y. Nunomura, T. Urabe, J. Hinata, and K. Saitoh, “A multimedia 32-b RISC microprocessor with 16-Mb DRAM,” in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 1996, pp. 216–217.