A 120-mw 3-d rendering engine with 6

advertisement
1352
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 37, NO. 10, OCTOBER 2002
A 120-mW 3-D Rendering Engine With 6-Mb Embedded DRAM
and 3.2-GB/s Runtime Reconfigurable Bus for PDA Chip
Ramchan Woo, Chi-Weon Yoon, Jeonghoon Kook, Se-Joong Lee, and Hoi-Jun Yoo
Abstract—A low-power three-dimensional (3-D) rendering
engine is implemented as part of a mobile personal digital assistant
(PDA) chip. Six-megabit embedded DRAM macros attached to
8-pixel-parallel rendering logic are logically localized with a
3.2-GB/s runtime reconfigurable bus, reducing the area by 25%
compared with conventional local frame-buffer architectures.
The low power consumption is achieved by polygon-dependent
access to the embedded DRAM macros with line-block mapping
providing read–modify–write data transaction. The 3-D rendering
engine with 2.22-Mpolygons/s drawing speed was fabricated using
0.18- m CMOS embedded memory logic technology. Its area is
24 mm2 and its power consumption is 120 mW.
Index Terms—3-D graphics rendering, embedded DRAM, low
power, mobile application, PDA, portable.
I. INTRODUCTION
A
S THE mobile electronics market grows, more processing
power is required for the intelligent handheld terminals
because the market is moving quickly from text-based information management to multimedia-rich applications, such as
real-time audio and video playback. To meet these market
demands, much research, including the design of the circuits
[1]–[4] and the analysis of the system [5], has tried to integrate
more multimedia functionalities into the handheld devices.
The realization of real-time three-dimensional (3-D) computer graphics has been a critical issue in workstations and
desktop PCs in the past ten years, because it requires many
parallel calculations and a huge amount of memory bandwidth.
To provide the necessary memory bandwidth for 3-D graphics
rendering, embedded memory logic (EML) is a promising
technology, because it can break the bandwidth barrier by
integrating the DRAM and the logic in a single chip [6]–[9].
The other advantage of EML technology comes from its
low power consumption, because the capacitive off-chip I/O
loading necessary for accessing the DRAM can be eliminated
[1]–[4]. Therefore, EML technology will play key roles in the
3-D graphics rendering for the handheld devices.
Low power consumption is the most critical factor for the
personal digital assistant (PDA) in which the power has to be
supplied by batteries. Since the lithium battery in the PDA can
supply about 2000 mW/h, the system must consume less than
800 mW to guarantee 2 3-h system operation [5]. Based on
the allowed amount of system power consumption, the power
budget allocated to the 3-D rendering engine is confined to less
than 200 mW. The rendering engine must be small in size, because it is integrated as a part of the PDA chip [2]. A parallel
Manuscript received October 1, 2001; revised May 3, 2002.
The authors are with the Semiconductor System Laboratory, Department of Electrical Engineering, Korea Advanced Institute of Science and
Technology, Taejon 305-701, Korea (e-mail: ramchan@ssl.kaist.ac.kr;
hjyoo@ee.kaist.ac.kr).
Publisher Item Identifier 10.1109/JSSC.2002.803051.
calculation is also required, because per-pixel operations need
R, G, B color and X, Y, Z coordinates together. Also, the large
memory bandwidth is essential because the data transaction is
read–modify–write at every clock cycle [7].
II. ARCHITECTURE AND DESIGN
A. Overall Architecture
Fig. 1 shows the block diagram of the proposed 3-D rendering engine. One edge processor [3] calculates pixel data
along the edges of the 8 8 clipped polygon and broadcasts
them to eight pixel processors (PPs) [3] which can fill eight
horizontal pixels simultaneously inside the polygons at every
clock cycle. The 6-Mb frame buffer, which consists of twelve
eDRAM macros, is coupled to the rendering logic through the
runtime reconfigurable bus (R2bus). The R2bus changes the
configuration of the data transaction at every clock cycle. The
data in the frame buffer are supplied to the PP continuously at
every 20 MHz clock cycle, and 8 8 clipped polygons can be
drawn at 2.22 Mpolygons/s at a 256 256 screen resolution.
B. Logically Local Frame Buffer and Line-Block Memory
Mapping
Local frame-buffer architectures [3], [6], shown in Fig. 2(a),
in which each local memory is tightly coupled with the corresponding PP, provide high memory bandwidth because parallel
pixel data can be obtained by local routes. Also, the power consumption is low because only the necessary memories can be
activated. In these architectures, however, each eDRAM is independent so that each one requires its own controller. Each
eDRAM contains only a small number of pixels because the
small screen resolution of the targeted PDA is distributed to
many eDRAMs [3]. Therefore, these architectures cannot be
fully implemented into the PDA chip because the total area of
the eDRAM is too large.
A global frame-buffer architecture [7], shown in Fig. 2(b), is
a good candidate in terms of cell efficiency because the required
pixel data can be obtained from the core by bank interleaving.
However, it has limited memory bandwidth at low clock cycles
because the required pixel data has to be accessed sequentially.
Moreover, power is wasted when rectangular memory mapping
is adopted [7], because unnecessary pixel data can be transferred
as well through the capacitive I/O bus of the eDRAM.
Since these conventional architectures are not suitable for the
embedded 3-D rendering engine which requires both low power
consumption and small area, we propose a logically local framebuffer architecture. As shown in Fig. 2(c), eDRAMs are placed
together to form a “physically global” frame buffer, which enhances its cell efficiency. With this architecture, the area is as
small as that of the global frame buffer. This frame buffer, however, is “logically localized” by the R2bus. Each PP renders as
0018-9200/02$17.00 © 2002 IEEE
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 37, NO. 10, OCTOBER 2002
1353
Fig. 1. Block diagram of 3-D rendering engine.
C. Runtime Reconfigurable Bus
To implement the R2bus, which supports the proposed lineblock memory mapping, a cascaded 2 2 multiplexer (MUX)
and 16 8 bus shifters are used. The MUX arbitrarily connects
the bidirectional memory bus to the unidirectional PP bus, as
shown in Fig. 3(a). They assign the appropriate macro to the
PPs for data read and write. The bus shifter [Fig. 3(b)], which
is a modified barrel shifter, assigns the pixel data to the corresponding PP by changing the bus attachment. The unidirectional PP bus feeds data to eight PPs in parallel. Each PP has a
40-bit read bus and 40-bit write bus for 16-bit depth and 24-bit
color. This R2bus provides 3.2-GB/s bandwidth by accessing
1280 bits of data simultaneously with the read–modify–write
pattern at 20 MHz.
D. Power Reduction in Frame Buffer
Fig. 2. Architecture diagram of embedded frame buffer. (a) Local frame buffer.
(b) Global frame buffer. (c) Logically local frame buffer.
if it has its own local memory, because the R2bus changes the
connection of the eDRAM macros. Therefore, the advantages
of the local frame buffer are still maintained in this architecture:
parallel access, low power, and high memory bandwidth.
To incorporate the logically local data transactions in the
R2bus, line-block memory mapping is proposed. Each 8 1
screen pixels compose a line block, and adjacent line blocks
are mapped onto subwordlines of different eDRAM macro
units. Therefore, four macro units, each of which contains
one Z buffer and double color buffers, are necessary to cover
the screen area, and each line block accesses 8-pixel data in
parallel. Alternative memory mapping in the vertical direction
provides a simultaneous and continuous read–modify–write
data transaction. When the PP writes the rendered pixels back
to the macros, it reads the new pixel data of the next line block
from the other macros.
Since the screen is split into multiple 8 1 line blocks, 8
8 clipped polygons straddle both of the macros or fall into just
one of the macros. Lines in the straddled polygon do not always
require the data from both of the macros. Therefore, the power
can be reduced if only the required macros are selectively activated. Partial I/O activation is used to reduce the power of the
I/O bus in the selected line block. Eight-pixel data in a line block
is not always necessary. The number of pixels to be fetched is
dependent on the shape of the polygon. In this work, the only
necessary data bitline sense amplifier (DBSA) can be activated
by I/O masking. This eliminates unnecessary I/O bus transactions and reduces power consumption further, because charging
and discharging long capacitive wires between the eDRAM and
the rendering logic may consume a lot of power.
III. IMPLEMENTATION RESULTS
Fig. 4 summarizes the area reduction in the rendering engine.
Even if extra area is added by the R2bus, the logically local
frame buffer shows 25% smaller area than the conventional
local frame buffers because the memory controllers, which
were split over the local memories, are merged together to
save area in the proposed architecture. Even though the frame
1354
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 37, NO. 10, OCTOBER 2002
(a)
(b)
Fig. 3.
(a) Circuit diagram of R2bus. 1280-b bus reconfiguration at 20 MHz
= 3.2 GB/s. (b) 16 2 18 bus shifter.
buffer is physically global, polygon-shape-dependent operations
eliminate unnecessary power as much as possible, like local
frame-buffer architectures. The power is reduced with the help of
selective macro activation and partial I/O activation. Simplified
shading units, which include an approximated divider in the
edge processor and simplified alpha-blending units in the PP,
further reduce the power of the rendering logic. The rendering
engine with 6-Mb eDRAM and R2bus consumes 120 mW
when it continuously renders polygons.
Drawing at a high rendering rate, the 3-D graphics accelerators in the PC platforms perform many advanced rendering functions. But they consume a great deal of power, more than several
tens of watts. The proposed graphics rendering engine for the
PDA chip, however, shows a slower rendering rate and performs
restricted rendering functions while consuming only 120 mW.
Therefore, we propose new performance indices to compare the
performance of the embedded 3-D graphics rendering engine
considering the power consumption.
3-D Performance of PDA
3-D Rendering Speed
Power
(7)
Fig. 4.
Area reduction.
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 37, NO. 10, OCTOBER 2002
Fig. 5.
1355
Microphotograph of 3-D rendering engine and measured waveforms.
TABLE I
FEATURES OF 3-D RENDERING ENGINE
rendering engine, which draws 2.22 Mpolygons/s with a 256
256 screen resolution, is fabricated using 0.18- m CMOS
EML technology with 24-mm area and 120-mW power
consumption.
REFERENCES
The 3-D rendering speed can be illustrated in pixels/s (PXPS)
or triangles/s (TRPS). Therefore, PXPS/mW stands for the
pixel fill rate per 1-mW power, and TRPS/mW describes the
triangle drawing rate per 1 mW. The proposed rendering engine
shows 580 kPXPS/mW and 18 kTRPS/mW, respectively.
Local frame-buffer architecture [3] shows 2.3 kTRPS/mW, and
commodity PC graphics system shows about 40 kPXPS/mW
and 1.25 kTRPS/mW.
The proposed rendering engine is integrated as a principal
part of a PDA chip. Table I summarizes its features. The chip
was fabricated using 0.18- m CMOS EML technology with
three-poly six-metal layers. Fig. 5 shows the microphotograph
of the rendering engine and measured waveforms.
IV. CONCLUSION
A 3-D rendering engine is designed and integrated into a
mobile PDA chip with 6-Mb eDRAM macros and 3.2-GB/s
R2bus. The R2bus supports logically local frame-buffer architecture which reduces its area by 25% compared with the
conventional local frame buffers due to its high cell efficiency.
Polygon-dependent access to eDRAM macros with line-block
mapping eliminates the unnecessary power consumption. The
[1] R. Woo, C.-W. Yoon, J. Kook, S.-J. Lee, K. Lee, Y.-H. Park, and H.-J.
Yoo, “A 120-mW embedded 3-D graphics rendering engine with 6-Mb
logically local frame buffer and 3.2-GByte/s run-time reconfigurable bus
for PDA chip,” in Symp. VLSI Circuit Dig. Tech. Papers, June 2001, pp.
95–98.
[2] C.-W. Yoon, R. Woo, J. Kook, S.-J. Lee, L. Lee, Y.-D. Bae, I.-C. Park,
and H.-J. Yoo, “A 80/20-MHz 160-mW multimedia processor integrated
with embedded DRAM MPEG-4 accelerator and 3-D rendering engine
for mobile applications,” in IEEE Int. Solid-State Circuits Conf. Dig.
Tech. Papers, Feb. 2001, pp. 142–143, 441.
[3] Y.-H. Park, S.-H. Han, J.-H. Lee, and H.-J. Yoo, “A 7.1-GB/s low-power
rendering engine in 2-D array-embedded memory logic CMOS for
portable multimedia system,” IEEE J. Solid-State Circuits, vol. 36, pp.
944–955, June 2001.
[4] T. Nishikawa, M. Takahashi, M. Hamada, T. Takayanagi, H. Arakida, N.
Machida, H. Yamamoto, T. Fujiyoshi, Y. Maisumoto, O. Yamagishi, T.
Samata, A. Asano, T. Terazawa, K. Ohmori, J. Shirakura, Y. Watanabe,
H. Nakamura, S. Minami, T. Kuroda, and T. Furuyama, “A 60-MHz
240-mW MPEG-4 video-phone LSI with 16-Mb embedded DRAM,”
in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2000, pp.
230–231.
[5] W. R. Hamburgen, D. A. Wallach, M. A. Viredaz, L. S. Brakmo, C. A.
Waldspurger, J. F. Bartlett, T. Mann, and K. I. Farkas, “Itsy: Stretching
the bounds of mobile computing,” Computer, pp. 28–36, Apr. 2001.
[6] Y. Aimoto, T. Kimura, Y. Yabe, H. Heiuchi, Y. Nakazawa, M. Motomura, T. Koga, Y. Fujita, M. Hamada, T. Tanigawa, H. Nobusawa, and
K. Koyama, “A 7.68-GIPS 3.84-GB/s 1-W parallel image-processing
RAM integrating a 16-Mb DRAM and 128 processors,” in IEEE Int.
Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 1996, pp. 372–373.
[7] K. Inoue, H. Nakamura, and H. Kawai, “A 10-Mb frame buffer memory
with Z-compare and A-blend units,” IEEE J. Solid-State Circuits, vol.
30, pp. 1563–1568, Dec. 1995.
[8] S. Miyano, K. Numata, K. Sato, T. Yabe, M. Wada, R. Haga, M.
Enkaku, M. Shiochi, Y. Kawashima, M. Iwase, M. Ohgata, J. Kumagai,
T. Yoshida, M. Sakurai, S. Kaki, N. Yanagiya, H. Shinya, T. Furuyama,
P. Hansen, M. Hannah, M. Nagy, A. Nagarajan, and M. Rungsea,
“A 1.6-Gbyte/s data transfer rate 8-Mb embedded DRAM,” IEEE J.
Solid-State Circuits, vol. 30, pp. 1281–1285, Nov. 1995.
[9] T. Shimizu, J. Korematu, M. Satou, H. Kondo, S. Iwata, K. Sawai, N.
Okumura, K. Ishimi, Y. Nakamoto, M. Kumanoya, K. Dosaka, A. Yamazaki, Y. Ajioka, H. Tsubota, Y. Nunomura, T. Urabe, J. Hinata, and K.
Saitoh, “A multimedia 32-b RISC microprocessor with 16-Mb DRAM,”
in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 1996, pp.
216–217.
Download