Heterogeneous architecture models for

advertisement
660
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 8, NO. 6, DECEMBER 2000
Heterogeneous Architecture Models for
Interconnect-Motivated System Design
Sek Meng Chai, Member, IEEE, Tarek M. Taha, D. Scott Wills, Senior Member, IEEE, and
James D. Meindl, Life Fellow, IEEE
Abstract—On-chip interconnect demand is becoming the
dominant factor in modern processor performance and must be
estimated early in the design process. This paper presents a set
of heterogeneous architectural models that combines architecture
description and Rent’s Rule-based wiring models. These architecture models allow flexible heterogeneous system specifications,
enabling investigations of prospective designs in different technology scenarios. Comparisons against actual data demonstrate
the models’ effectiveness for architecture explorations with highly
accurate estimations of local and global wiring demand, as well
as chip area and cycle time. Simulation of two candidate system
designs reveal trends in interconnect delay with increasing architectural complexity, and confirm the need for high computational
locality and short global wires for future architectures.
Index Terms—Architecture modeling, interconnect prediction,
VLSI modeling, wire-demand prediction.
I. INTRODUCTION
T
HE study of interconnect modeling, scaling, and prediction has gained new attention in anticipation of gigascale
integration (GSI) operating at multi-gigahertz clock frequencies. As feature size decreases, interconnect channels for signaling, clock, and power distribution impose a greater cost in
chip area, signal latency, and power dissipation, than transistors. Wire characteristics do not scale as well as transistors [1],
increasing the impact of interconnect delay on cycle time, and
reducing the chip area reachable in a single (shorter) clock [2].
While many studies have focused on better material and fabrication processes to improve wire performance, these techniques
only slow the progress toward the fundamental electromagnetic
limit of wires [3]; they do not reduce wire demand. A broader
understanding of processor architectures and their use of interconnects is needed to find system solutions that scale well with
poor wires.
Estimates of local and global wire demand are needed to predict key performance parameters such as chip area, cycle time,
and power dissipation. The wire demand models presented in
this paper are more accurate than current ideal interconnect
Manuscript received February 15, 2000; revised February 24, 2000.
This work was supported by Microelectronics Advanced Research Corporation/Interconnect Focus Center, Defense Advanced Research Project
Agency/Advanced Microelectronics Program, and the Semiconductor Research
Corporation.
The authors are with the Electrical and Computer Engineering,
Georgia Institute of Technology, Atlanta, GA 30332-0250 USA (e-mail:
scott.wills@ece.gatech.edu).
Publisher Item Identifier S 1063-8210(00)09730-4.
scaling models, improving the prediction accuracy of future microprocessor performance. These models facilitate the development of architectures that reduce the demand for long-distance,
low-latency interconnect.
This paper introduces a set of heterogeneous architecture
models designed to combine architecture description with
interconnect parameters. The architecture models provide
important parameters to Rent’s Rule wiring models, such as
number of logic gates, Rent’s exponent and coefficient, number
of signal terminals, and fan-out distribution, for various functional units in a modern processor. Empirical descriptions from
cell structure, placement, and routing, are extracted to provide
realistic wire demands [4]. These architecture models correlate
estimated wire demands with the appropriate architecture
rather than a simple random logic block. The architecture
models build upon the Generic System Simulator (GENESYS),
a hierarchical tool for exploring future processor architecture
and technology [5], [6].
Verification of the models process using extracted data from
chip layout shows highly accurate wire-demand estimations
for local wires, thereby enabling realistic area and cycle time
predictions. In addition, accurate global wire demand estimation for intra-unit interconnects support better predictions for
cross-chip clock frequencies and total chip area. Parameter
correlation among different design styles and implementation
technologies suggests these models can provide generic predictions of future architectures and technologies as approaching
limits are explored.
Two candidate architectures, a superscalar microprocessor
and a parallel Single Instruction Multiple Data (SIMD) array,
are modeled and simulated to explore the relationship between
architecture and wire demands. Results suggest that cross-chip
propagation delays dominate over gate delays in current
superscalar processor, requiring architectural techniques to
alleviate poor wire performance. The parallel SIMD array
increases computational locality, reducing global wire demand
and bounding global wire lengths. Processing parallelism and
increasing clock frequencies, due to fewer long wires, offers
arrays performance from 0.3 to 1.5 Tops/s using projected 2012
technology for applications with ample data parallelism.
The organization of this paper is as follows. Section II
presents a summary of related work. Section III describes
GENESYS and the modeling hierarchy. Section IV introduces
the architecture model and compares simulation data against
actual processor layout as verification of the models. Section V
presents an analysis of superscalar processor designs and
1063–8210/00$10.00 © 2000 IEEE
CHAI et al.: HETEROGENEOUS ARCHITECTURE MODELS
parallel arrays. Section VI concludes with directions for future
research.
II. RELATED WORK
Many wire length distribution models are based on Rent’s
Rule [7]–[10], an empirical relationship between the number of
signal connections between elements of system and the complexity of each element. The most familiar Rent’s relationship is
between the number of signal terminals of an element and the
number of logic gates contained in the element. This correlawhere the parameters and are
tion is given as
empirical values [10]. A similar power-law relationship exists
for the external I/O of gate array chips, random logic circuits,
memory, and microprocessors [7]. A second region, characterized by a large number of signal terminals, is not well captured
by Rent’s Rule.
By separating logic circuits into hierarchical divisions, average wire length can be estimated via Rent’s Rule by calculating the number of connections and distance between partitions [9]. An accurate wire length distribution has been derived for a uniform homogeneous logic block [11]. For a heterogeneous system, global wire length can be derived from stochastic models for net-list, placement, and routing information
[12]–[14]. More recent research studies on wiring estimation include [15]–[17].
Recent architectural research has begun to explore architecture-technology relationships. Processor complexity and cycle
time relationships are reported in [18] and [19]. Reference [20]
considers single-chip multiprocessors in 0.25 m technology,
while [21] proposes optimal processor configurations in 0.35
m technology. This paper extends the collection of research by
investigating wire demands and technology capabilities using
the proposed architecture models and GENESYS.
GENESYS extends the capabilities of previous system-level
performance simulators [7], [22], [23], allowing a more thorough exploration of the entire hierarchy of physical and practical
limits. While other simulators focus exclusively on interconnect
technology [24], GENESYS describes the complete chip design for a more realistic investigation. Furthermore, GENESYS
incorporates superior Rent’s Rule wiring models to accurately
predict wire demands of various computer architecture designs.
Other simulators [25] that consider only random logic blocks
with the theoretical upper bound on wiring [9] may not produce
accurate results for different system designs.
III. GENESYS ORGANIZATION AND MODELS
GENESYS embodies five model levels to represent the key
limits of the following hierarchy: 1) fundamental; 2) material;
3) device; 4) circuit; and 5) system. The first three model levels
capture physical effects of material properties and switching
device behaviors. Technology parameters such as interconnect
metal, dielectric materials, current drain, and submicron effects,
serve as inputs to the simulator. Circuit models incorporate gate
level characteristics such as signal propagation delay and power
dissipation. The system module consists of architecture, interconnection, and packaging models to complete the description
661
of a single ASIC chip. More information on the GENESYS organization is available in [5] and [6]. As this paper is focused on
interconnect issue, the remainder of this section will summarize
the interconnection models that describes the complete on-chip
wiring demand.
A Rent’s Rule-based stochastic wirelength distribution model
[11] is used to estimate interconnection requirements for a homogeneous logic circuit. This wirelength distribution model is
used to estimate wire demand in the so-called first region in the
Rent’s Rule diagram. Since signal propagation delay at the gate
level is composed of device switching delay and interconnect
delay (RC delay, time-of-flight, and rise-time delay of input signals), the estimate wirelengths accurately represent physical circuit characteristics and enable optimal partitioning of multilevel
interconnect levels. Effects from interlevel blockage, routing efficiency, wiring structure, peak crosstalk noise, and clock skew
are included in the interconnection models. Long-distance interconnect driving schemes [7] and clock distribution networks
are included to determine cycle time, chip area, and power consumption.
Global wire lengths for a heterogeneous system can be estimated from stochastic net-list distributions, random placement
of terminals, and routing information [14]. The net-list information is estimated using the Heterogeneous Rent’s Rule [4]
to define the connectivity between cells. Heterogeneous Rent’s
Rule is a newly derived relationship that extends Rent’s Rule
by establishing an equivempirical correlation
, and for heterogeneous systems. This heteroalent
geneous relationship is useful to describe “system on a chip”
(SOC) architectures where logic blocks are not uniform. Placement information sets the bounding area (chip area under which
the wiring model considers connectivity between signal terminals) for estimating global wire routing among heterogeneous
blocks. This wire distribution model is used to estimate wire
demand in the so-called second region of the Rent’s Rule diagram, where the traditional empirical correlation is less accurate
[7], [10]. GENESYS incorporates both wiring models (homogeneous and heterogeneous) to accurately predict wire demands of
different system architectures.
IV. HETEROGENEOUS ARCHITECTURE MODELS
As described earlier, the heterogeneous architecture models
enable a futuristic probe to uncover limiting factors in a system
design. This section describes the modeling efforts to connect
architecture description with wiring demands. Comparisons
with actual data are also provided to demonstrates model
accuracy confirms that important architectural characteristics
are captured. Local and global wire demands are predicted to
forecast key performance metrics such as chip area and cycle
time.
The architecture models consider a chip design as a collection
of cells, as shown in Fig. 1. For a processor array, the cells consist of single processors joined together with an interconnection
network. A single processor consists of a collection of random
logic cells (functional units) or local memory (cache). An internal data bus and control-signal wires connect the functional
units and cache together in the single processor model.
662
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 8, NO. 6, DECEMBER 2000
Fig. 1. Architecture models incorporate a hierarchical collection of cells. For
a processor array, parallel processors are tiled with an interconnection network.
For a single processor, random logic or cache cells are connected with a data
bus and global control signals.
The number of logic gates
, Rent’s exponent
and co, number of signal pins
, and number of gates
efficient
in the critical path (Ncp) are modeled for different functional
units in a modern processor. The fan-out distributions (FO), describing the data bus or interconnection network between cells,
are also provided. These values are used by GENESYS wiring
models, and they are extracted from actual designs or modeled
as functions of architectural parameters shown in Table I. These
parameters include generic architectural information such as
data path width (Ws), register file access ports (Rp and Wp), and
bits generated by a decoder to represent instructions (Nb).
Equivalent Rent’s parameters are calculated for the single
processor using the Heterogeneous Rent’s Rule [4] for
GENESYS to partition wire demands into multi-tier interconnection layers and to find the longest global wire. Clock
frequency and chip area for a single processor are then estimated following wire partitioning. For a processor array, wiring
demand for the communication network is also included.
The heterogeneous architecture models extend GENESYS’s
single homogeneous cell model with a generic specification for
a heterogeneous system-on-a-chip, providing better architectural detail and more accurate prediction of performance. In addition, the architecture models allow designers to vary contents
and connectivity of cells for more flexibility and accuracy in
system description.
In summary, architectural information is used to estimate
and Ncp. These estimated values ( and ) are then used
to predict wiring requirements. The predicted wiring-demand
sets the average load for a random logic chain. Together with
Ncp, cycle time for the architecture is then predicted. This
process allows a designer to investigate architectural issues
prior to actual implementation. The use of architecture and
wiring models represents the unique contribution presented in
this paper.
A. Model Derivations
Elements in the heterogeneous architecture models are extracted from gate level schematics [26]–[28] and chip layout
[29] to capture details on cell organization, circuit placement,
and internal routing within a functional unit. The number of
and the number of pins
are tabulated in
logic gates
closed-form equations (Table II), as functions of architectural
parameters previously shown in Table I. The number of gates in
the critical path (Ncp) is also determined from the gate level
schematics to find functional unit latency. Rent’s coefficient
is calculated as the average I/O pins per gate in the gate
is extracted in a process
level schematic. Rent’s exponent
similar to [9], by partitioning the logic circuits into hierarchical
divisions and finding the value that matches the power-law relationship between pins and gates. These models are reported
in Table II for a set of functional units in modern superscalar
processors and parallel SIMD processors [26], [29]. To reduce
gate delays, execution units are pipelined [26], [28] by adding
the number of gates in pipeline latches to total gate count
and by reducing Ncp accordingly. While other functional units
exist, each with different design styles and structures, this modeling approach captures the important architectural characteristics and enables detailed system description with the selection
of functional units. Details on equation derivations are available
in [30].
Fig. 2 offers an example derivation for the register file equations. The internal organization of a register file unit is first characterized. A register file has a decoder block that decodes the
addresses and selects a word in the register. The register array
block contains the actual storage element including circuitry to
select the read/write ports (R/W). An output buffer block is included to drive the register file into the data bus. The architecture
model for the register file considers the number of line drivers
and select units based on address width (Ws). The model also
determines the number of cells in the register array and buffer
block based on data width size (Ws) and number of registers
and Ncp for the register file, shown in
(Nr). Equations for
Table II, are then determined.
B. Global Wire Demand Models
The connectivity between the functional units in Table II determines the global wire demand for the system. Fig. 3 illustrates organization of the architecture models in terms of wire
demand. Each block in the figure represents a separate set of
equations for a specific unit. Functional unit models exist at
the local wire demand level because their wire demands are localized within the unit. At the intermediate wire demand level,
two separate models are used to characterize different system
architectures. The Single Node block characterizes the connectivity of functional units in a SIMD parallel array. The Bypass
Path block characterizes the connectivity of functional units in a
pipeline structure of superscalar processors. At the global wire
demand level, the Tiled Array block characterizes the connectivity of Single Node units with an interconnection network for
a parallel array, while the Superscalar Processor block predicts
wire demands between caches and the random logic block in a
uniprocessor architecture. All specifications are collected in the
System Model. Along with architectural parameters provided by
the designer, system and interconnection parameters are provided to the GENESYS to project chip area, power, and total
wire demand. These hierarchical levels allow a generic heterogeneous organization to describe different system architectures.
Global wire demands are modeled in terms of wire fan-out.
For example, the data bus and control signal wires in Single
Node of a SIMD parallel array [29] can be described as follows:
CHAI et al.: HETEROGENEOUS ARCHITECTURE MODELS
663
TABLE I
PARAMETERS FOR HETEROGENEOUS ARCHITECTURE MODELS
TABLE II
HETEROGENEOUS ARCHITECTURE MODELS OF SINGLE PROCESSOR FUNCTIONAL UNITS
Fig. 2. Characterizing a functional unit such as the register file requires an
analysis of the internal structure to obtain Rent’s parameters, T ; N; and Ncp.
FO
Cw Nb . Since control signal connectivity is
point-to-point between the instruction decoder and functional
units, only fan-out distribution of two is tabulated. The data bus
connects all functional units (excluding the instruction decoder)
Ws.
with the register file and is modeled as follows: FO
The number of wires for the data bus is three times the data
path width because there are two source channels and one result
channel in the processor implementation. Placement and routing
information for the SIMD processor are utilized in GENESYS
to convert the net-list distribution to wire lengths.
Wiring demand for the communication network in a
processor array is also included. For a mesh network, inter-processor signals exist only between near-neighbors and is
Cw Nb . This demand is scaled to
modeled as FO
the size of a single processor and then added to the internal processor wiring to obtain the total on-chip interconnect demand.
664
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 8, NO. 6, DECEMBER 2000
Fig. 3. Architecture models are organized in a hierarchical structure to describe heterogeneous system architectures. Different wiring demand models describe
the connectivity of units in the lower level.
Different interconnection networks and internal processor
wiring can be modeled using different net-list distributions.
Further details on the architecture and wire demand models
are available in [30].
C. Validation
The rest of this section describes the validation process for
the heterogeneous architecture models. Comparison against
actual data [29] shows the high prediction accuracy using
GENESYS wiring demand models indicating important architectural characteristics are being captured. Fig. 4 shows the
area and gate-delay comparisons for various functional units.
Actual chip areas are extracted from layout, while gate delays
are simulated using a switch level simulator with RC loads.
Functional units are modeled in GENESYS as homogeneous
units to find silicon area and gate delay without global wiring
effects. Estimated average gate-delay is multiplied with the
number of gates in the critical path (Ncp) to obtain functional
unit latency. Minor functional units such as the activity/sleep
unit, communication unit, serial shifter, and instruction decoder,
are modeled with a gate-pitch proportional to the barrel shifter
because these units are pitch-matched to the data bus width and
are less compact in the chip layout.
Fig. 5 presents the wire demand comparisons for internal and
global wiring. To find actual wiring demand, node capacitance is
extracted from chip layout after removing nonwiring layers such
as diffusions and wells. An average unit-capacitance per length
is then determined to calculate total wiring length. Each functional unit was extracted separately to find individual wiring demand. Global wiring demand is found from extracting the chip
layout with functional units removed. Internal wiring demand
predicted by GENESYS models correlates well with extracted
data.
The architecture models are verified against other design
styles and structures. Fig. 6 shows a comparison of register
file area for several designs available in the literature [29],
[31]–[33]. In addition to the different implementation technologies, the register files have varying architectural structure
(physical registers, word size, and read/write ports). Predicted
silicon areas compare well to actual data, with an average
deviation of only 17%. For the Hwang register file, a memory
is used because the design
cell Rent’s exponent
style is similar to a memory cell. For each register file simulation, the proper fabrication technology parameters, such
as interconnect geometry and device parameters, are used to
accurately estimate chip area. These semiconductor fabrication
data are not used to curve-fit the results to match Rent’s Rule,
but rather they are used to place the architecture models in the
proper technology to accurately predict wire demand, chip area,
and cycle time. These verification results in chip area, wiring
demands, and latency for different functional units, varying
in technologies and design styles, show the effectiveness of
the architecture and wiring models as a predictive tool for
architecture explorations.
V. SYSTEM DESIGN ANALYSIS
With the architecture models derived and validated in the
previous section, they can be used to investigate different
computer architectures performance in future GSI technologies. This section describes the exploration of two candidate
architectures using the architecture models in GENESYS.
Superscalar processor designs are investigated to find processor
complexity versus cycle time relationships. In addition, parallel
SIMD processor arrays are modeled to find optimum data path
width and system size in future technologies.
A. A Superscalar Processor Design
Current industry trend has been to design microprocessors
with multiple instruction issue width to extract instruction-level-
CHAI et al.: HETEROGENEOUS ARCHITECTURE MODELS
Fig. 4.
665
Functional unit area and latency comparisons against extracted data.
parallelism. These designs are pipelined, employing hardware
bypassing to eliminate data hazards, to avoid interlock stalls
from dependent instructions, and to provide results to execution units in the following cycle [26]. While logically simple,
hardware bypassing can be costly with deeply pipelined, wide
instruction issue processors [34]. Bypass delays have been estimated to dominate over other pipeline stages by 0.18 m technologies [18].
The data bypass hardware is among the most wire demanding
IW
bypass paths [34], where IW is the
units, requiring
instruction issue width, and is the number of pipeline stages
after the first result-producing stage. The formula assumes a
2-input functional unit, and is scaled with data path width (Ws)
to obtain total wire demand. Because the number of bypass paths
grows quadratically with issue width (IW), a study is required
to uncover limits in processor with wide degree of instruction
issue.
Architecture and wiring models are used in GENESYS to
estimate clock frequencies for processors with varying issue
width. Wire demand for hardware bypass is modeled as bus
channels, connecting the functional units shown in Table III.
The number of functional units is increased according to issue
width, following existing simulation parameters in [18]. The
Load/Store unit consists of an adder for computing memory
addresses, while other functional units have been described
previously. Register file parameters are extracted from previous
666
Fig. 5.
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 8, NO. 6, DECEMBER 2000
Wire demand comparisons against extracted data from chip layout.
Fig. 6. Area comparisons against actual data [29], [31]–[33]. Register files are chosen for their different structures and implementation technologies to show the
effectiveness of the architecture models.
studies in optimal configurations [35]. Technology parameters are obtained or extracted from the National Technology
Roadmap for Semiconductors (NTRS) [36].
Fig. 7 presents simulation results for various superscalar processor designs. The number of critical path gates (Ncp) is extrapolated from [37] to represent pipeline delay from logic circuits. The family of solid-line curves represents local clock frequency for processors with different issue widths, while the
dashed line represents cycle time for across-chip global wires.
A processor design with a single synchronous clock must operate at a frequency that is equal to the smaller of the two clock
frequencies (local and across-chip). In Fig. 7, these two values
are shown separately to illustrate the divergence of these two
operating frequencies. The simulated values were estimated by
GENESYS with optimized partitioning of the multi-tier interconnection network.
NTRS projected on-chip local and across-chip clock frequencies form two operating regions: Region-I represents a
region of operation below the NTRS projected across-chip
clock frequencies; Region-II represents a region of operation
above across-chip clock frequencies but below local clock
frequencies. Processor designs with issue widths smaller
CHAI et al.: HETEROGENEOUS ARCHITECTURE MODELS
667
TABLE III
SIMULATION PARAMETERS FOR WIDE ISSUE SUPERSCALAR PROCESSORS
Fig. 7. Clock frequency projections for superscalar processor designs with varying instruction window (IW) sizes. NTRS projected on-chip local and across-chip
clock frequencies form the two operating regions.
than four are able to track projected local clock frequencies,
operating in Region II by the year 2003. Wider issue processors
are more complex, requiring larger chip area and longer
wires. Simulation results show that by 2006, across-chip clock
frequencies (dashed line) may not be able to meet NTRS
projections, as wire lengths and demands for large, dense
chips exceed any performance gains from wire technology.
In 250 nm technology (1997), a four-issue processor design
is not yet limited by across-chip clock frequencies. However,
across-chip delays begin to dominate local gate delays by
today’s technologies (180 nm, 1999) and by 2006, global wire
performance can no longer track a four-issue processor. By the
year 2012, synchronous processor designs with issue widths of
four and six will be dominated by long global wire delays.
These results clearly illustrate the need for architectural techniques to reduce poor wire performance. In order to maintain aggressive operating clock frequencies, future processor designs
must be locally synchronized with multi-cycle access to regions
across the chip. Clustered organizations with smaller computation regions, similar to schemes used in the Alpha 21 264 [38],
are required to maintain short wire lengths. Other techniques
such as incomplete bypass structures [34] and single-chip mul-
tiprocessor [20] alleviate the need for long-distance, fast global
wires.
B. A Parallel Array Design
This section describes an architectural exploration of a SIMD
parallel array to find optimal data path width (Ws) in different
technologies. Larger data path width (Ws) improves computation precision but requires larger chip area and reduces the
number of processing elements (PEs) that fits on a single-chip.
SIMD arrays are well suited for multimedia applications with
large amounts of explicit parallelism because the same operation is performed over the entire data set [39]. Programming
is thereby simplified and the array becomes easily scalable for
large data sets. As future processor workloads become increasingly dominated by multimedia applications, understanding performance characteristics of parallel SIMD arrays becomes vital
for successful deployment of such systems.
The SIMD architecture consists of a tiled array of PEs with
a mesh interconnection network, following the design in [29].
Each PE is composed of the functional units listed in Table II. In
order to determine the effects of long wires, the functional units
are pipelined to shift from performance reliance on gate delay to
668
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 8, NO. 6, DECEMBER 2000
Fig. 8.
Projected wire and gate delays for a pipelined SIMD processor with varying data width size (8–128 bits).
Fig. 9.
Optimal data width size determined by the threshold between wire and gate delay.
inter-processor wire delay. Wiring demand has been described
previously in Section IV-A. Cycle times with varying data path
widths are estimated using architecture and wiring models in
GENESYS.
Fig. 8 presents simulation results for the parallel SIMD array.
The number of gates in the critical path is set to eight to account
for unit selection, execution, and latch delays. The family of
solid-line curves represents the estimated gate latencies for different data path widths, while dashed lines represent inter-processor wire delays. The results clearly show the large difference between transistor and wire improvements for the projected technologies. By 2012, gate delay reductions are two to
three times more than long wire performance improvements for
128-bit SIMD processors. Because denser transistors allow for
smaller, closer units, total gate delays can reduce even for the
same data path width and critical path gates. Wire performance
for inter-processor communication, however, does not scale as
well.
Fig. 9 shows the optimal data width size for the parallel SIMD
array. The optimal value is determined as the data width size
where total gate delay of a PE is equal to the inter-processor
wire delay. Since SIMD arrays have a synchronous global clock,
designs with data widths larger than the optimal value will have
cycle time dominated by long wire delay. A peak value of 55 is
found as wire performance after 2003 reduces optimal data path
width.
While general applications often require precision of the
order of 53 bits [26], multimedia applications often only use 8
CHAI et al.: HETEROGENEOUS ARCHITECTURE MODELS
or 16 bits of precision [39]. This fact, along with simulation
results, motivates two design configurations for parallel SIMD
arrays: 16 bits for multimedia processing, and 64 bits for
high-performance scientific computation. From GENESYS
predicted area for a 16-bit PE, a tiled array of 14 600 SIMD
PEs fits in a single chip by 2012. With a NTRS projected power
budget of 3.2 W for portable devices, a 16-bit multimedia
SIMD array provides 372 Gops/s at 50 MHz clock frequency.
Today’s DSP performance is in the range of 0.5–1 Gops/s.
For high-performance scientific applications, 2500 tiled 64-bit
SIMD processors offer 1.5 Tops/s at 1.2 GHz clock frequency.
A realistic multimedia application suite is used to determine
the fraction of vectorized instructions and array utilization [40].
Clock frequency is set to operate below power density limits
in 2012 [29]. The number of PEs is the maximum allowed as
pad areas and other interface units are not considered in the
architecture model.
High computational performance of a SIMD array comes
from the large number of PEs, shifting the reliance from clock
frequency to explicit data parallelism. Silicon area taken by
complex bypass structures, branch prediction logic, and large
on-chip caches is better utilized by increasing the number of
PEs per chip, creating a dense computational system. With
small local memory and specialized data paths for multimedia
applications, each PE forms a localized computational region
that limits the need for fast, long global wires.
While single-chip multiprocessors, such as a parallel SIMD
array, offer a clear advantage to better scale with poor wire-technology, they are traditionally delegated to applications that are
relatively easy to parallelize. These applications include many
image and signal processing algorithms where there are abundant amounts of data parallelism. A superscalar uniprocessor,
on the other hand, are suitable for a large set of applications
due to the ease in programming. Research in parallelizing compilers must be extended to bring hard to parallelize problems to
single-chip multiprocessors.
VI. CONCLUSION
A set of heterogeneous architecture models is presented
for use with Rent’s Rule wiring models. The architecture
models enable realistic estimations of interconnect demand
by providing a flexible specification for a heterogeneous
system-on-a-chip. Comparisons against actual data show
the high accuracy from using the models for architectural
explorations.
System performance metrics for superscalar processors and
parallel SIMD arrays are estimated using GENESYS to find architecture-technology relationships. Results from both systems
show the need for interconnect-motivated system designs that
scale well with poor wire performance. Insights for architectural techniques to maintain computational locality and to limit
the need for fast, long-distance communications are provided.
ACKNOWLEDGMENT
The authors extend thanks to the GSI and PICA research
groups for their research contributions, especially to Dr. J. C.
669
Eble, III, for his earlier work on GENESYS, Dr. J. Davis for
his homogeneous wirelength models, and P. Zarkesh-Ha for his
research in heterogeneous wiring models. The authors further
extend acknowledgment to the anonymous reviewers who have
improved the presentation of this paper.
REFERENCES
[1] M. Bohr, “Interconnect scaling—The real limiter to high performance
ULSI,” in Proc. IEDM. New York, 1995, pp. 241–244.
[2] D. Matzke, “Will physical scalability sabotage performance gains?,”
IEEE Computer, vol. 30, pp. 37–39, Sept. 1997.
[3] J. D. Meindl, “Low power microelectronics: Retrospect and prospect,”
Proc. IEEE, vol. 83, no. 4, pp. 619–635, 1995.
[4] P. Zarkesh-Ha, J. A. Davis, W. Loh, and J. D. Meindl, “On a pin versus
gate relationship for heterogeneous systems: Heterogeneous Rent’s
Rule,” in Proc. IEEE Custom Integrated Circuits Conf., Santa Clara,
CA, May 1998, pp. 93–96.
[5] J. C. Eble, V. K. De, D. S. Wills, and J. D. Meindl, “A generic system
simulator (GENESYS) for ASIC technology and architecture beyond
2001,” in Proc. 9th Annu. IEEE Int. ASIC Conf., Rochester, NY, Sept.
1996, pp. 193–196.
[6] J. C. Eble III, “A Generic System Simulator with Novel On-Chip Cache
and Throughput Models for Gigascale Integration,” Ph.D. dissertation,
Georgia Institute of Technology, Atlanta, 1998.
[7] H. B. Bakoglu, Circuits, Interconnections, and Packaging for
VLSI. Reading, MA: Addison-Wesley, 1990.
[8] P. Christie, “A fractal analysis of interconnection complexity,” Proc.
IEEE, vol. 81, pp. 1492–1499, Oct. 1993.
[9] W. E. Donath, “Placement and average interconnect lengths of computer
logic,” IEEE Trans. Circuits Syst., vol. CAS-26, pp. 272–277, Apr. 1979.
[10] B. S. Landman and R. L. Russo, “On a pin versus block relationship
for partitions of logic paths,” IEEE Trans. Comput., vol. C-20, pp.
1469–1479, Dec. 1971.
[11] J. A. Davis, V. K. De, and J. D. Meindl, “A stochastic wire-length distribution for gigascale integration (GSI): Part I: Derivation and validation,”
IEEE Trans. Electron Devices, vol. 45, pp. 580–589, Mar. 1998.
[12] R. M. Karp, F. T. Leighton, R. L. Rivest, C. D. Thompson, U. Vazirani,
and V. Vazirani, “Global wire routing in two-dimensional arrays,” in
24th Annu. Symp. Foundations of Computer Science, Tucson, AZ, Nov.
1983, pp. 453–459.
[13] G. B. Sorkin, “Asymptotically global routing: A stochastic analysis,”
IEEE Trans. Computer-Aided Design , vol. CAD-6, pp. 820–827, Sept.
1987.
[14] P. Zarkesh-Ha and J. D. Meindl, “Stochastic net length distribution for
global interconnects in a heterogeneous system-on-a-chip,” in IEEE
Symp. VLSI Technology Digest of Technical Papers, Honolulu, HI, June
1998, pp. 44–45.
[15] A. Rahman, A. Fan, and R. Reif, “Wire-length distribution of three-dimensional integrated circuits,” presented at the Workshop on SystemLevel Interconnect Prediction SLIP99, Monterey, CA, Apr. 10–11, 1999.
[16] A. Hess and G. Zimmermann, “Interconnection length estimation during
hierarchical VLSI design,” presented at the Workshop on System-Level
Interconnect Prediction SLIP99, Monterey, CA, Apr. 10–11, 1999.
[17] J. D. Cho, “Wiring space estimation in two-dimensional arrays,”
presented at the Workshop on System-Level Interconnect Prediction
SLIP99, Monterey, CA, Apr. 10–11, 1999.
[18] S. Palacharla, N. P. Jouppi, and J. E. Smith, “Complexity-effective superscalar processors,” in Computer Architecture News, 24th Annu. Int.
Symp. Computer Architecture, vol. 25, May 1997, pp. 206–218.
[19] T. Hara, H. Ando, C. Nakanishi, and M. Nakaya, “Performance comparison of ILP machines with cycle time evaluations,” in Computer Architecture News, 23rd Annu. Int. Conf. Computer Architecture, vol. 24,
1996, pp. 213–224.
[20] K. Olukotun, B. A. Nayfeh, L. Hammond, K. Wilson, and K. Chung,
“The case for a single-chip multiprocessor,” in SIGPLAN Notices, 7th
Int. Conf. Architectural Support for Programming Languages and Operating Systems, vol. 31, Cambridge, MA, Oct. 1996, pp. 2–11.
[21] S. Fu and M. Flynn, “Performance and Area Analysis of Processor
Configurations with Scaling of Technology,”, Stanford Tech. Rep.
CSL-TR-94-605, Mar. 94.
[22] A. Masaki, “Possibilities of deep-submicrometer CMOS for very high
speed computer logic,” Proc. IEEE, vol. 81, pp. 1311–1324, Sept. 1993.
[23] G. A. Sai-Halasz, “Performance trends in high-end processors,” Proc.
IEEE, vol. 83, pp. 20–36, Jan. 1995.
670
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 8, NO. 6, DECEMBER 2000
[24] R. Mangaser and K. Rose, “Estimating interconnect performance for a
new national technology roadmap for semiconductors,” in Proc. IEEE
Int. Interconnect Technology Conf., San Francisco, CA, June 1998, pp.
253–255.
[25] D. Sylvester and K. Keutzer, “Getting to the bottom of deep submicron,”
Proc. Int. Conf. Computer-Aided Design, pp. 203–211, Nov. 1998.
[26] J. L. Hennesy and D. A. Patterson, Computer Architecture: A Quantitative Approach. San Mateo, CA: Morgan Kaufmann, 1990.
[27] N. H. E. Weste and K. Eshraghian, Principles of CMOS VLSI System
Design: A System Perspective. Reading, MA: Addison-Wesley, 1993.
[28] I. H. Unwala and E. E. Swartzlander Jr., “Superpipelined adder designs,”
IEEE Int. Symp. Circuits and Systems, vol. 3, pp. 1841–1844, May 1993.
[29] S. M. Chai, A. Gentile, and D. S. Wills, “Impact of power density limitation in gigascale integration for the SIMD pixel processor,” in 20th Anniversary Conf. Advanced Research in VLSI. Atlanta, GA, Mar. 21–24,
1999, pp. 57–71.
[30] S. M. Chai, “Real Time Image Processing on Parallel Arrays for Gigascale Integration,” Ph.D. dissertation, Georgia Institute of Technology,
Atlanta, GA, 1999.
[31] R. L. Franch, J. Ji, and C. L. Chen, “A 640-ps, 0.25-m CMOS, 16
64-b three-port register file,” IEEE J. Solid-State Circuits, vol. 32, pp.
1288–1292, Aug. 1997.
[32] W. Hwang, R. V. Joshi, and W. H. Henkels, “A 500 MHz, 32-word
64-bit, eight-port self-resetting CMOS register file,” IEEE J. Solid-State
Circuits, vol. 34, pp. 56–67, Jan. 1999.
[33] J. M. Mulder, N. T. Quach, and M. J. Flynn, “An Area-Utility Model
for On-Chip Memories and Its Applications,”, Stanford Tech. Rep.
CSL-TR-90-413, Feb. 1990.
[34] P. Ahuja, D. Clark, and A. Rogers, “Performance impact of incomplete
bypassing in processor pipeline,” in IEEE Proc. 28th Annu. Int. Symp.
Microarchitecture, Ann Arbor, MI, Nov. 1995, pp. 36–45.
[35] K. I. Farkas, N. P. Jouppi, and P. Chow, “Register file design considerations in dynamically scheduled processors,” in IEEE Symp. High-Performance Computer Architecture, San Jose, CA, Feb. 1996, pp. 40–51.
[36] The National Technology Roadmap for Semiconductors. San Jose,
CA: Semiconductor Industry Association, SEMATECH, 1997.
[37] W. J. Dally and S. Lacy, “VLSI architecture: Past, present, and future,” in
20th Anniversary Conf. Advanced Research in VLSI. Atlanta, GA, Mar.
21–24, 1999, pp. 232–241.
[38] R. E. Kessler, “The alpha 21 264 microprocessor,” IEEE Micro, vol. 19,
pp. 24–36, Mar. 1999.
[39] K. Diefendorff and R. Dubey, “How multimedia workloads will change
processor design,” IEEE Computer, vol. 30, pp. 43–45, Sept. 1997.
[40] A. Gentile, J. L. Cruz-Rivera, D. S. Wills, L. Bustelo, J. J. Figueroa, J. E.
Fonseca-Camacho, W. E. Lugo-Beauchamp, R. Olivieri, M. QuiñonesCerpa, A. H. Rivera-Ríos, I. Vargas-Gonzáles, and M. Viera-Vera et
al., “Real-time image processing on a focal plane SIMD array,” in Parallel and Distributed Processing—Lecture Notes in Computer Science,
J. Rolim et al., Eds. New York: Springer-Verlag, 1999, vol. 1586, pp.
400–405.
2
2
Sek Meng Chai (M’99) received the B.S. degree
in electrical engineering in 1994, the M.S. degree in
electrical engineering in 1996, and the Ph.D. degree
in electrical and computer engineering in 1999 from
Georgia Institute of Technology, Atlanta.
Currently, he is a Staff Electrical Engineer at
the DigitalDNA System Architecture Laboratory at
Motorola, Schaumburg, IL. His research interests
include parallel architecture, VLSI, billion transistor
architectures, and high-performance interconnection
networks. His doctoral research addresses high
throughput, efficient architectures such as systolic arrays.
Dr. Chai is a Member of ACM.
Tarek M. Taha received the B.E.E. degree in
electrical engineering from Georgia Institute of
Technology, Atlanta, and the B.A. degree from
DePauw University, Greencastle, IN, in 1996. He is
currently pursuing the Ph.D. degree in the Portable
Image Computation Architectures (PICA) Group
at Georgia Institute of Technology. His research
interests include high-performance architectures,
multicomputer interconnection networks, and VLSI.
D. Scott Wills (S’80–M’83–SM’97) received the
B.S. degree in physics from Georgia Institute of
Technology, Atlanta, in 1983, and the S.M., E.E.,
and Sc.D. degree in electrical engineering and
computer science from Massachusetts Institute of
Technology (MIT), Cambridge, in 1985, 1987, and
1990, respectively.
He is currently an Associate Professor of Electrical and Computer Engineering at Georgia Institute
of Technology. His research interests include short
wire VLSI architectures, high throughput portable
processing systems, architectural modeling for gigascale (GSI) technology,
and high-efficiency image processors.
Dr. Wills is a Senior Member of the Computer Society and an Associate Editor
of IEEE TRANSACTIONS ON COMPUTERS.
James D. Meindl (M’56–SM’66–F’68–LF’97) received the B.S., M.S., and Ph.D. degrees in electrical
engineering from Carnegie Institute of Technology
(Carnegie Mellon University), Pittsburgh, PA.
Currently, he is the Director of the Joseph
M. Pettit Microelectronics Research Center, and
has been the Joseph M. Pettit Chair Professor of
Microelectronics at Georgia Institute of Technology,
Atlanta, since 1993. He was Senior Vice President
for academic affairs and provost of Rensselaer
Polytechnic Institute, Troy, NY, from 1986 to 1993.
He was with Stanford University from 1967 to 1986 as the John M. Fluke
Professor of electrical engineering, Associate Dean for Research in the School
of Engineering, Director of the Center for Integrated Systems, Director of the
Electronics Laboratories, and Founding Director of the Integrated Circuits
Laboratory.
Dr. Meindl is a Fellow of the American Association for the Advancement
of Science, and a member of the American Academy of Arts and Sciences and
the National Academy of Engineering and its academic advisory board. He
received a Benjamin Garver Lamme Medal from ASEE, an IEEE Education
Medal, an IEEE Solid-State Circuits Medal, and an IEEE Beatrice K. Winner
Award. He has also been awarded the IEEE Electron Devices Society’s J.
J. Ebers Award, the Hamerschlag Distinguished Alumnus Award, Carnegie
Mellon University, the 1999 SIA University Research Award, and the IEEE
Third Millennium Medal.
Download