Heterogeneous architecture models for

660 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 8, NO. 6, DECEMBER 2000 Heterogeneous Architecture Models for Interconnect-Motivated System Design Sek Meng Chai, Member, IEEE, Tarek M. Taha, D. Scott Wills, Senior Member, IEEE, and James D. Meindl, Life Fellow, IEEE Abstract—On-chip interconnect demand is becoming the dominant factor in modern processor performance and must be estimated early in the design process. This paper presents a set of heterogeneous architectural models that combines architecture description and Rent’s Rule-based wiring models. These architecture models allow flexible heterogeneous system specifications, enabling investigations of prospective designs in different technology scenarios. Comparisons against actual data demonstrate the models’ effectiveness for architecture explorations with highly accurate estimations of local and global wiring demand, as well as chip area and cycle time. Simulation of two candidate system designs reveal trends in interconnect delay with increasing architectural complexity, and confirm the need for high computational locality and short global wires for future architectures. Index Terms—Architecture modeling, interconnect prediction, VLSI modeling, wire-demand prediction. I. INTRODUCTION T HE study of interconnect modeling, scaling, and prediction has gained new attention in anticipation of gigascale integration (GSI) operating at multi-gigahertz clock frequencies. As feature size decreases, interconnect channels for signaling, clock, and power distribution impose a greater cost in chip area, signal latency, and power dissipation, than transistors. Wire characteristics do not scale as well as transistors [1], increasing the impact of interconnect delay on cycle time, and reducing the chip area reachable in a single (shorter) clock [2]. While many studies have focused on better material and fabrication processes to improve wire performance, these techniques only slow the progress toward the fundamental electromagnetic limit of wires [3]; they do not reduce wire demand. A broader understanding of processor architectures and their use of interconnects is needed to find system solutions that scale well with poor wires. Estimates of local and global wire demand are needed to predict key performance parameters such as chip area, cycle time, and power dissipation. The wire demand models presented in this paper are more accurate than current ideal interconnect Manuscript received February 15, 2000; revised February 24, 2000. This work was supported by Microelectronics Advanced Research Corporation/Interconnect Focus Center, Defense Advanced Research Project Agency/Advanced Microelectronics Program, and the Semiconductor Research Corporation. The authors are with the Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332-0250 USA (e-mail: scott.wills@ece.gatech.edu). Publisher Item Identifier S 1063-8210(00)09730-4. scaling models, improving the prediction accuracy of future microprocessor performance. These models facilitate the development of architectures that reduce the demand for long-distance, low-latency interconnect. This paper introduces a set of heterogeneous architecture models designed to combine architecture description with interconnect parameters. The architecture models provide important parameters to Rent’s Rule wiring models, such as number of logic gates, Rent’s exponent and coefficient, number of signal terminals, and fan-out distribution, for various functional units in a modern processor. Empirical descriptions from cell structure, placement, and routing, are extracted to provide realistic wire demands [4]. These architecture models correlate estimated wire demands with the appropriate architecture rather than a simple random logic block. The architecture models build upon the Generic System Simulator (GENESYS), a hierarchical tool for exploring future processor architecture and technology [5], [6]. Verification of the models process using extracted data from chip layout shows highly accurate wire-demand estimations for local wires, thereby enabling realistic area and cycle time predictions. In addition, accurate global wire demand estimation for intra-unit interconnects support better predictions for cross-chip clock frequencies and total chip area. Parameter correlation among different design styles and implementation technologies suggests these models can provide generic predictions of future architectures and technologies as approaching limits are explored. Two candidate architectures, a superscalar microprocessor and a parallel Single Instruction Multiple Data (SIMD) array, are modeled and simulated to explore the relationship between architecture and wire demands. Results suggest that cross-chip propagation delays dominate over gate delays in current superscalar processor, requiring architectural techniques to alleviate poor wire performance. The parallel SIMD array increases computational locality, reducing global wire demand and bounding global wire lengths. Processing parallelism and increasing clock frequencies, due to fewer long wires, offers arrays performance from 0.3 to 1.5 Tops/s using projected 2012 technology for applications with ample data parallelism. The organization of this paper is as follows. Section II presents a summary of related work. Section III describes GENESYS and the modeling hierarchy. Section IV introduces the architecture model and compares simulation data against actual processor layout as verification of the models. Section V presents an analysis of superscalar processor designs and 1063–8210/00$10.00 © 2000 IEEE CHAI et al.: HETEROGENEOUS ARCHITECTURE MODELS parallel arrays. Section VI concludes with directions for future research. II. RELATED WORK Many wire length distribution models are based on Rent’s Rule [7]–[10], an empirical relationship between the number of signal connections between elements of system and the complexity of each element. The most familiar Rent’s relationship is between the number of signal terminals of an element and the number of logic gates contained in the element. This correlawhere the parameters and are tion is given as empirical values [10]. A similar power-law relationship exists for the external I/O of gate array chips, random logic circuits, memory, and microprocessors [7]. A second region, characterized by a large number of signal terminals, is not well captured by Rent’s Rule. By separating logic circuits into hierarchical divisions, average wire length can be estimated via Rent’s Rule by calculating the number of connections and distance between partitions [9]. An accurate wire length distribution has been derived for a uniform homogeneous logic block [11]. For a heterogeneous system, global wire length can be derived from stochastic models for net-list, placement, and routing information [12]–[14]. More recent research studies on wiring estimation include [15]–[17]. Recent architectural research has begun to explore architecture-technology relationships. Processor complexity and cycle time relationships are reported in [18] and [19]. Reference [20] considers single-chip multiprocessors in 0.25 m technology, while [21] proposes optimal processor configurations in 0.35 m technology. This paper extends the collection of research by investigating wire demands and technology capabilities using the proposed architecture models and GENESYS. GENESYS extends the capabilities of previous system-level performance simulators [7], [22], [23], allowing a more thorough exploration of the entire hierarchy of physical and practical limits. While other simulators focus exclusively on interconnect technology [24], GENESYS describes the complete chip design for a more realistic investigation. Furthermore, GENESYS incorporates superior Rent’s Rule wiring models to accurately predict wire demands of various computer architecture designs. Other simulators [25] that consider only random logic blocks with the theoretical upper bound on wiring [9] may not produce accurate results for different system designs. III. GENESYS ORGANIZATION AND MODELS GENESYS embodies five model levels to represent the key limits of the following hierarchy: 1) fundamental; 2) material; 3) device; 4) circuit; and 5) system. The first three model levels capture physical effects of material properties and switching device behaviors. Technology parameters such as interconnect metal, dielectric materials, current drain, and submicron effects, serve as inputs to the simulator. Circuit models incorporate gate level characteristics such as signal propagation delay and power dissipation. The system module consists of architecture, interconnection, and packaging models to complete the description 661 of a single ASIC chip. More information on the GENESYS organization is available in [5] and [6]. As this paper is focused on interconnect issue, the remainder of this section will summarize the interconnection models that describes the complete on-chip wiring demand. A Rent’s Rule-based stochastic wirelength distribution model [11] is used to estimate interconnection requirements for a homogeneous logic circuit. This wirelength distribution model is used to estimate wire demand in the so-called first region in the Rent’s Rule diagram. Since signal propagation delay at the gate level is composed of device switching delay and interconnect delay (RC delay, time-of-flight, and rise-time delay of input signals), the estimate wirelengths accurately represent physical circuit characteristics and enable optimal partitioning of multilevel interconnect levels. Effects from interlevel blockage, routing efficiency, wiring structure, peak crosstalk noise, and clock skew are included in the interconnection models. Long-distance interconnect driving schemes [7] and clock distribution networks are included to determine cycle time, chip area, and power consumption. Global wire lengths for a heterogeneous system can be estimated from stochastic net-list distributions, random placement of terminals, and routing information [14]. The net-list information is estimated using the Heterogeneous Rent’s Rule [4] to define the connectivity between cells. Heterogeneous Rent’s Rule is a newly derived relationship that extends Rent’s Rule by establishing an equivempirical correlation , and for heterogeneous systems. This heteroalent geneous relationship is useful to describe “system on a chip” (SOC) architectures where logic blocks are not uniform. Placement information sets the bounding area (chip area under which the wiring model considers connectivity between signal terminals) for estimating global wire routing among heterogeneous blocks. This wire distribution model is used to estimate wire demand in the so-called second region of the Rent’s Rule diagram, where the traditional empirical correlation is less accurate [7], [10]. GENESYS incorporates both wiring models (homogeneous and heterogeneous) to accurately predict wire demands of different system architectures. IV. HETEROGENEOUS ARCHITECTURE MODELS As described earlier, the heterogeneous architecture models enable a futuristic probe to uncover limiting factors in a system design. This section describes the modeling efforts to connect architecture description with wiring demands. Comparisons with actual data are also provided to demonstrates model accuracy confirms that important architectural characteristics are captured. Local and global wire demands are predicted to forecast key performance metrics such as chip area and cycle time. The architecture models consider a chip design as a collection of cells, as shown in Fig. 1. For a processor array, the cells consist of single processors joined together with an interconnection network. A single processor consists of a collection of random logic cells (functional units) or local memory (cache). An internal data bus and control-signal wires connect the functional units and cache together in the single processor model. 662 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 8, NO. 6, DECEMBER 2000 Fig. 1. Architecture models incorporate a hierarchical collection of cells. For a processor array, parallel processors are tiled with an interconnection network. For a single processor, random logic or cache cells are connected with a data bus and global control signals. The number of logic gates , Rent’s exponent and co, number of signal pins , and number of gates efficient in the critical path (Ncp) are modeled for different functional units in a modern processor. The fan-out distributions (FO), describing the data bus or interconnection network between cells, are also provided. These values are used by GENESYS wiring models, and they are extracted from actual designs or modeled as functions of architectural parameters shown in Table I. These parameters include generic architectural information such as data path width (Ws), register file access ports (Rp and Wp), and bits generated by a decoder to represent instructions (Nb). Equivalent Rent’s parameters are calculated for the single processor using the Heterogeneous Rent’s Rule [4] for GENESYS to partition wire demands into multi-tier interconnection layers and to find the longest global wire. Clock frequency and chip area for a single processor are then estimated following wire partitioning. For a processor array, wiring demand for the communication network is also included. The heterogeneous architecture models extend GENESYS’s single homogeneous cell model with a generic specification for a heterogeneous system-on-a-chip, providing better architectural detail and more accurate prediction of performance. In addition, the architecture models allow designers to vary contents and connectivity of cells for more flexibility and accuracy in system description. In summary, architectural information is used to estimate and Ncp. These estimated values ( and ) are then used to predict wiring requirements. The predicted wiring-demand sets the average load for a random logic chain. Together with Ncp, cycle time for the architecture is then predicted. This process allows a designer to investigate architectural issues prior to actual implementation. The use of architecture and wiring models represents the unique contribution presented in this paper. A. Model Derivations Elements in the heterogeneous architecture models are extracted from gate level schematics [26]–[28] and chip layout [29] to capture details on cell organization, circuit placement, and internal routing within a functional unit. The number of and the number of pins are tabulated in logic gates closed-form equations (Table II), as functions of architectural parameters previously shown in Table I. The number of gates in the critical path (Ncp) is also determined from the gate level schematics to find functional unit latency. Rent’s coefficient is calculated as the average I/O pins per gate in the gate is extracted in a process level schematic. Rent’s exponent similar to [9], by partitioning the logic circuits into hierarchical divisions and finding the value that matches the power-law relationship between pins and gates. These models are reported in Table II for a set of functional units in modern superscalar processors and parallel SIMD processors [26], [29]. To reduce gate delays, execution units are pipelined [26], [28] by adding the number of gates in pipeline latches to total gate count and by reducing Ncp accordingly. While other functional units exist, each with different design styles and structures, this modeling approach captures the important architectural characteristics and enables detailed system description with the selection of functional units. Details on equation derivations are available in [30]. Fig. 2 offers an example derivation for the register file equations. The internal organization of a register file unit is first characterized. A register file has a decoder block that decodes the addresses and selects a word in the register. The register array block contains the actual storage element including circuitry to select the read/write ports (R/W). An output buffer block is included to drive the register file into the data bus. The architecture model for the register file considers the number of line drivers and select units based on address width (Ws). The model also determines the number of cells in the register array and buffer block based on data width size (Ws) and number of registers and Ncp for the register file, shown in (Nr). Equations for Table II, are then determined. B. Global Wire Demand Models The connectivity between the functional units in Table II determines the global wire demand for the system. Fig. 3 illustrates organization of the architecture models in terms of wire demand. Each block in the figure represents a separate set of equations for a specific unit. Functional unit models exist at the local wire demand level because their wire demands are localized within the unit. At the intermediate wire demand level, two separate models are used to characterize different system architectures. The Single Node block characterizes the connectivity of functional units in a SIMD parallel array. The Bypass Path block characterizes the connectivity of functional units in a pipeline structure of superscalar processors. At the global wire demand level, the Tiled Array block characterizes the connectivity of Single Node units with an interconnection network for a parallel array, while the Superscalar Processor block predicts wire demands between caches and the random logic block in a uniprocessor architecture. All specifications are collected in the System Model. Along with architectural parameters provided by the designer, system and interconnection parameters are provided to the GENESYS to project chip area, power, and total wire demand. These hierarchical levels allow a generic heterogeneous organization to describe different system architectures. Global wire demands are modeled in terms of wire fan-out. For example, the data bus and control signal wires in Single Node of a SIMD parallel array [29] can be described as follows: CHAI et al.: HETEROGENEOUS ARCHITECTURE MODELS 663 TABLE I PARAMETERS FOR HETEROGENEOUS ARCHITECTURE MODELS TABLE II HETEROGENEOUS ARCHITECTURE MODELS OF SINGLE PROCESSOR FUNCTIONAL UNITS Fig. 2. Characterizing a functional unit such as the register file requires an analysis of the internal structure to obtain Rent’s parameters, T ; N; and Ncp. FO Cw Nb . Since control signal connectivity is point-to-point between the instruction decoder and functional units, only fan-out distribution of two is tabulated. The data bus connects all functional units (excluding the instruction decoder) Ws. with the register file and is modeled as follows: FO The number of wires for the data bus is three times the data path width because there are two source channels and one result channel in the processor implementation. Placement and routing information for the SIMD processor are utilized in GENESYS to convert the net-list distribution to wire lengths. Wiring demand for the communication network in a processor array is also included. For a mesh network, inter-processor signals exist only between near-neighbors and is Cw Nb . This demand is scaled to modeled as FO the size of a single processor and then added to the internal processor wiring to obtain the total on-chip interconnect demand. 664 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 8, NO. 6, DECEMBER 2000 Fig. 3. Architecture models are organized in a hierarchical structure to describe heterogeneous system architectures. Different wiring demand models describe the connectivity of units in the lower level. Different interconnection networks and internal processor wiring can be modeled using different net-list distributions. Further details on the architecture and wire demand models are available in [30]. C. Validation The rest of this section describes the validation process for the heterogeneous architecture models. Comparison against actual data [29] shows the high prediction accuracy using GENESYS wiring demand models indicating important architectural characteristics are being captured. Fig. 4 shows the area and gate-delay comparisons for various functional units. Actual chip areas are extracted from layout, while gate delays are simulated using a switch level simulator with RC loads. Functional units are modeled in GENESYS as homogeneous units to find silicon area and gate delay without global wiring effects. Estimated average gate-delay is multiplied with the number of gates in the critical path (Ncp) to obtain functional unit latency. Minor functional units such as the activity/sleep unit, communication unit, serial shifter, and instruction decoder, are modeled with a gate-pitch proportional to the barrel shifter because these units are pitch-matched to the data bus width and are less compact in the chip layout. Fig. 5 presents the wire demand comparisons for internal and global wiring. To find actual wiring demand, node capacitance is extracted from chip layout after removing nonwiring layers such as diffusions and wells. An average unit-capacitance per length is then determined to calculate total wiring length. Each functional unit was extracted separately to find individual wiring demand. Global wiring demand is found from extracting the chip layout with functional units removed. Internal wiring demand predicted by GENESYS models correlates well with extracted data. The architecture models are verified against other design styles and structures. Fig. 6 shows a comparison of register file area for several designs available in the literature [29], [31]–[33]. In addition to the different implementation technologies, the register files have varying architectural structure (physical registers, word size, and read/write ports). Predicted silicon areas compare well to actual data, with an average deviation of only 17%. For the Hwang register file, a memory is used because the design cell Rent’s exponent style is similar to a memory cell. For each register file simulation, the proper fabrication technology parameters, such as interconnect geometry and device parameters, are used to accurately estimate chip area. These semiconductor fabrication data are not used to curve-fit the results to match Rent’s Rule, but rather they are used to place the architecture models in the proper technology to accurately predict wire demand, chip area, and cycle time. These verification results in chip area, wiring demands, and latency for different functional units, varying in technologies and design styles, show the effectiveness of the architecture and wiring models as a predictive tool for architecture explorations. V. SYSTEM DESIGN ANALYSIS With the architecture models derived and validated in the previous section, they can be used to investigate different computer architectures performance in future GSI technologies. This section describes the exploration of two candidate architectures using the architecture models in GENESYS. Superscalar processor designs are investigated to find processor complexity versus cycle time relationships. In addition, parallel SIMD processor arrays are modeled to find optimum data path width and system size in future technologies. A. A Superscalar Processor Design Current industry trend has been to design microprocessors with multiple instruction issue width to extract instruction-level- CHAI et al.: HETEROGENEOUS ARCHITECTURE MODELS Fig. 4. 665 Functional unit area and latency comparisons against extracted data. parallelism. These designs are pipelined, employing hardware bypassing to eliminate data hazards, to avoid interlock stalls from dependent instructions, and to provide results to execution units in the following cycle [26]. While logically simple, hardware bypassing can be costly with deeply pipelined, wide instruction issue processors [34]. Bypass delays have been estimated to dominate over other pipeline stages by 0.18 m technologies [18]. The data bypass hardware is among the most wire demanding IW bypass paths [34], where IW is the units, requiring instruction issue width, and is the number of pipeline stages after the first result-producing stage. The formula assumes a 2-input functional unit, and is scaled with data path width (Ws) to obtain total wire demand. Because the number of bypass paths grows quadratically with issue width (IW), a study is required to uncover limits in processor with wide degree of instruction issue. Architecture and wiring models are used in GENESYS to estimate clock frequencies for processors with varying issue width. Wire demand for hardware bypass is modeled as bus channels, connecting the functional units shown in Table III. The number of functional units is increased according to issue width, following existing simulation parameters in [18]. The Load/Store unit consists of an adder for computing memory addresses, while other functional units have been described previously. Register file parameters are extracted from previous 666 Fig. 5. IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 8, NO. 6, DECEMBER 2000 Wire demand comparisons against extracted data from chip layout. Fig. 6. Area comparisons against actual data [29], [31]–[33]. Register files are chosen for their different structures and implementation technologies to show the effectiveness of the architecture models. studies in optimal configurations [35]. Technology parameters are obtained or extracted from the National Technology Roadmap for Semiconductors (NTRS) [36]. Fig. 7 presents simulation results for various superscalar processor designs. The number of critical path gates (Ncp) is extrapolated from [37] to represent pipeline delay from logic circuits. The family of solid-line curves represents local clock frequency for processors with different issue widths, while the dashed line represents cycle time for across-chip global wires. A processor design with a single synchronous clock must operate at a frequency that is equal to the smaller of the two clock frequencies (local and across-chip). In Fig. 7, these two values are shown separately to illustrate the divergence of these two operating frequencies. The simulated values were estimated by GENESYS with optimized partitioning of the multi-tier interconnection network. NTRS projected on-chip local and across-chip clock frequencies form two operating regions: Region-I represents a region of operation below the NTRS projected across-chip clock frequencies; Region-II represents a region of operation above across-chip clock frequencies but below local clock frequencies. Processor designs with issue widths smaller CHAI et al.: HETEROGENEOUS ARCHITECTURE MODELS 667 TABLE III SIMULATION PARAMETERS FOR WIDE ISSUE SUPERSCALAR PROCESSORS Fig. 7. Clock frequency projections for superscalar processor designs with varying instruction window (IW) sizes. NTRS projected on-chip local and across-chip clock frequencies form the two operating regions. than four are able to track projected local clock frequencies, operating in Region II by the year 2003. Wider issue processors are more complex, requiring larger chip area and longer wires. Simulation results show that by 2006, across-chip clock frequencies (dashed line) may not be able to meet NTRS projections, as wire lengths and demands for large, dense chips exceed any performance gains from wire technology. In 250 nm technology (1997), a four-issue processor design is not yet limited by across-chip clock frequencies. However, across-chip delays begin to dominate local gate delays by today’s technologies (180 nm, 1999) and by 2006, global wire performance can no longer track a four-issue processor. By the year 2012, synchronous processor designs with issue widths of four and six will be dominated by long global wire delays. These results clearly illustrate the need for architectural techniques to reduce poor wire performance. In order to maintain aggressive operating clock frequencies, future processor designs must be locally synchronized with multi-cycle access to regions across the chip. Clustered organizations with smaller computation regions, similar to schemes used in the Alpha 21 264 [38], are required to maintain short wire lengths. Other techniques such as incomplete bypass structures [34] and single-chip multiprocessor [20] alleviate the need for long-distance, fast global wires. B. A Parallel Array Design This section describes an architectural exploration of a SIMD parallel array to find optimal data path width (Ws) in different technologies. Larger data path width (Ws) improves computation precision but requires larger chip area and reduces the number of processing elements (PEs) that fits on a single-chip. SIMD arrays are well suited for multimedia applications with large amounts of explicit parallelism because the same operation is performed over the entire data set [39]. Programming is thereby simplified and the array becomes easily scalable for large data sets. As future processor workloads become increasingly dominated by multimedia applications, understanding performance characteristics of parallel SIMD arrays becomes vital for successful deployment of such systems. The SIMD architecture consists of a tiled array of PEs with a mesh interconnection network, following the design in [29]. Each PE is composed of the functional units listed in Table II. In order to determine the effects of long wires, the functional units are pipelined to shift from performance reliance on gate delay to 668 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 8, NO. 6, DECEMBER 2000 Fig. 8. Projected wire and gate delays for a pipelined SIMD processor with varying data width size (8–128 bits). Fig. 9. Optimal data width size determined by the threshold between wire and gate delay. inter-processor wire delay. Wiring demand has been described previously in Section IV-A. Cycle times with varying data path widths are estimated using architecture and wiring models in GENESYS. Fig. 8 presents simulation results for the parallel SIMD array. The number of gates in the critical path is set to eight to account for unit selection, execution, and latch delays. The family of solid-line curves represents the estimated gate latencies for different data path widths, while dashed lines represent inter-processor wire delays. The results clearly show the large difference between transistor and wire improvements for the projected technologies. By 2012, gate delay reductions are two to three times more than long wire performance improvements for 128-bit SIMD processors. Because denser transistors allow for smaller, closer units, total gate delays can reduce even for the same data path width and critical path gates. Wire performance for inter-processor communication, however, does not scale as well. Fig. 9 shows the optimal data width size for the parallel SIMD array. The optimal value is determined as the data width size where total gate delay of a PE is equal to the inter-processor wire delay. Since SIMD arrays have a synchronous global clock, designs with data widths larger than the optimal value will have cycle time dominated by long wire delay. A peak value of 55 is found as wire performance after 2003 reduces optimal data path width. While general applications often require precision of the order of 53 bits [26], multimedia applications often only use 8 CHAI et al.: HETEROGENEOUS ARCHITECTURE MODELS or 16 bits of precision [39]. This fact, along with simulation results, motivates two design configurations for parallel SIMD arrays: 16 bits for multimedia processing, and 64 bits for high-performance scientific computation. From GENESYS predicted area for a 16-bit PE, a tiled array of 14 600 SIMD PEs fits in a single chip by 2012. With a NTRS projected power budget of 3.2 W for portable devices, a 16-bit multimedia SIMD array provides 372 Gops/s at 50 MHz clock frequency. Today’s DSP performance is in the range of 0.5–1 Gops/s. For high-performance scientific applications, 2500 tiled 64-bit SIMD processors offer 1.5 Tops/s at 1.2 GHz clock frequency. A realistic multimedia application suite is used to determine the fraction of vectorized instructions and array utilization [40]. Clock frequency is set to operate below power density limits in 2012 [29]. The number of PEs is the maximum allowed as pad areas and other interface units are not considered in the architecture model. High computational performance of a SIMD array comes from the large number of PEs, shifting the reliance from clock frequency to explicit data parallelism. Silicon area taken by complex bypass structures, branch prediction logic, and large on-chip caches is better utilized by increasing the number of PEs per chip, creating a dense computational system. With small local memory and specialized data paths for multimedia applications, each PE forms a localized computational region that limits the need for fast, long global wires. While single-chip multiprocessors, such as a parallel SIMD array, offer a clear advantage to better scale with poor wire-technology, they are traditionally delegated to applications that are relatively easy to parallelize. These applications include many image and signal processing algorithms where there are abundant amounts of data parallelism. A superscalar uniprocessor, on the other hand, are suitable for a large set of applications due to the ease in programming. Research in parallelizing compilers must be extended to bring hard to parallelize problems to single-chip multiprocessors. VI. CONCLUSION A set of heterogeneous architecture models is presented for use with Rent’s Rule wiring models. The architecture models enable realistic estimations of interconnect demand by providing a flexible specification for a heterogeneous system-on-a-chip. Comparisons against actual data show the high accuracy from using the models for architectural explorations. System performance metrics for superscalar processors and parallel SIMD arrays are estimated using GENESYS to find architecture-technology relationships. Results from both systems show the need for interconnect-motivated system designs that scale well with poor wire performance. Insights for architectural techniques to maintain computational locality and to limit the need for fast, long-distance communications are provided. ACKNOWLEDGMENT The authors extend thanks to the GSI and PICA research groups for their research contributions, especially to Dr. J. C. 669 Eble, III, for his earlier work on GENESYS, Dr. J. Davis for his homogeneous wirelength models, and P. Zarkesh-Ha for his research in heterogeneous wiring models. The authors further extend acknowledgment to the anonymous reviewers who have improved the presentation of this paper. REFERENCES [1] M. Bohr, “Interconnect scaling—The real limiter to high performance ULSI,” in Proc. IEDM. New York, 1995, pp. 241–244. [2] D. Matzke, “Will physical scalability sabotage performance gains?,” IEEE Computer, vol. 30, pp. 37–39, Sept. 1997. [3] J. D. Meindl, “Low power microelectronics: Retrospect and prospect,” Proc. IEEE, vol. 83, no. 4, pp. 619–635, 1995. [4] P. Zarkesh-Ha, J. A. Davis, W. Loh, and J. D. Meindl, “On a pin versus gate relationship for heterogeneous systems: Heterogeneous Rent’s Rule,” in Proc. IEEE Custom Integrated Circuits Conf., Santa Clara, CA, May 1998, pp. 93–96. [5] J. C. Eble, V. K. De, D. S. Wills, and J. D. Meindl, “A generic system simulator (GENESYS) for ASIC technology and architecture beyond 2001,” in Proc. 9th Annu. IEEE Int. ASIC Conf., Rochester, NY, Sept. 1996, pp. 193–196. [6] J. C. Eble III, “A Generic System Simulator with Novel On-Chip Cache and Throughput Models for Gigascale Integration,” Ph.D. dissertation, Georgia Institute of Technology, Atlanta, 1998. [7] H. B. Bakoglu, Circuits, Interconnections, and Packaging for VLSI. Reading, MA: Addison-Wesley, 1990. [8] P. Christie, “A fractal analysis of interconnection complexity,” Proc. IEEE, vol. 81, pp. 1492–1499, Oct. 1993. [9] W. E. Donath, “Placement and average interconnect lengths of computer logic,” IEEE Trans. Circuits Syst., vol. CAS-26, pp. 272–277, Apr. 1979. [10] B. S. Landman and R. L. Russo, “On a pin versus block relationship for partitions of logic paths,” IEEE Trans. Comput., vol. C-20, pp. 1469–1479, Dec. 1971. [11] J. A. Davis, V. K. De, and J. D. Meindl, “A stochastic wire-length distribution for gigascale integration (GSI): Part I: Derivation and validation,” IEEE Trans. Electron Devices, vol. 45, pp. 580–589, Mar. 1998. [12] R. M. Karp, F. T. Leighton, R. L. Rivest, C. D. Thompson, U. Vazirani, and V. Vazirani, “Global wire routing in two-dimensional arrays,” in 24th Annu. Symp. Foundations of Computer Science, Tucson, AZ, Nov. 1983, pp. 453–459. [13] G. B. Sorkin, “Asymptotically global routing: A stochastic analysis,” IEEE Trans. Computer-Aided Design , vol. CAD-6, pp. 820–827, Sept. 1987. [14] P. Zarkesh-Ha and J. D. Meindl, “Stochastic net length distribution for global interconnects in a heterogeneous system-on-a-chip,” in IEEE Symp. VLSI Technology Digest of Technical Papers, Honolulu, HI, June 1998, pp. 44–45. [15] A. Rahman, A. Fan, and R. Reif, “Wire-length distribution of three-dimensional integrated circuits,” presented at the Workshop on SystemLevel Interconnect Prediction SLIP99, Monterey, CA, Apr. 10–11, 1999. [16] A. Hess and G. Zimmermann, “Interconnection length estimation during hierarchical VLSI design,” presented at the Workshop on System-Level Interconnect Prediction SLIP99, Monterey, CA, Apr. 10–11, 1999. [17] J. D. Cho, “Wiring space estimation in two-dimensional arrays,” presented at the Workshop on System-Level Interconnect Prediction SLIP99, Monterey, CA, Apr. 10–11, 1999. [18] S. Palacharla, N. P. Jouppi, and J. E. Smith, “Complexity-effective superscalar processors,” in Computer Architecture News, 24th Annu. Int. Symp. Computer Architecture, vol. 25, May 1997, pp. 206–218. [19] T. Hara, H. Ando, C. Nakanishi, and M. Nakaya, “Performance comparison of ILP machines with cycle time evaluations,” in Computer Architecture News, 23rd Annu. Int. Conf. Computer Architecture, vol. 24, 1996, pp. 213–224. [20] K. Olukotun, B. A. Nayfeh, L. Hammond, K. Wilson, and K. Chung, “The case for a single-chip multiprocessor,” in SIGPLAN Notices, 7th Int. Conf. Architectural Support for Programming Languages and Operating Systems, vol. 31, Cambridge, MA, Oct. 1996, pp. 2–11. [21] S. Fu and M. Flynn, “Performance and Area Analysis of Processor Configurations with Scaling of Technology,”, Stanford Tech. Rep. CSL-TR-94-605, Mar. 94. [22] A. Masaki, “Possibilities of deep-submicrometer CMOS for very high speed computer logic,” Proc. IEEE, vol. 81, pp. 1311–1324, Sept. 1993. [23] G. A. Sai-Halasz, “Performance trends in high-end processors,” Proc. IEEE, vol. 83, pp. 20–36, Jan. 1995. 670 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 8, NO. 6, DECEMBER 2000 [24] R. Mangaser and K. Rose, “Estimating interconnect performance for a new national technology roadmap for semiconductors,” in Proc. IEEE Int. Interconnect Technology Conf., San Francisco, CA, June 1998, pp. 253–255. [25] D. Sylvester and K. Keutzer, “Getting to the bottom of deep submicron,” Proc. Int. Conf. Computer-Aided Design, pp. 203–211, Nov. 1998. [26] J. L. Hennesy and D. A. Patterson, Computer Architecture: A Quantitative Approach. San Mateo, CA: Morgan Kaufmann, 1990. [27] N. H. E. Weste and K. Eshraghian, Principles of CMOS VLSI System Design: A System Perspective. Reading, MA: Addison-Wesley, 1993. [28] I. H. Unwala and E. E. Swartzlander Jr., “Superpipelined adder designs,” IEEE Int. Symp. Circuits and Systems, vol. 3, pp. 1841–1844, May 1993. [29] S. M. Chai, A. Gentile, and D. S. Wills, “Impact of power density limitation in gigascale integration for the SIMD pixel processor,” in 20th Anniversary Conf. Advanced Research in VLSI. Atlanta, GA, Mar. 21–24, 1999, pp. 57–71. [30] S. M. Chai, “Real Time Image Processing on Parallel Arrays for Gigascale Integration,” Ph.D. dissertation, Georgia Institute of Technology, Atlanta, GA, 1999. [31] R. L. Franch, J. Ji, and C. L. Chen, “A 640-ps, 0.25-m CMOS, 16 64-b three-port register file,” IEEE J. Solid-State Circuits, vol. 32, pp. 1288–1292, Aug. 1997. [32] W. Hwang, R. V. Joshi, and W. H. Henkels, “A 500 MHz, 32-word 64-bit, eight-port self-resetting CMOS register file,” IEEE J. Solid-State Circuits, vol. 34, pp. 56–67, Jan. 1999. [33] J. M. Mulder, N. T. Quach, and M. J. Flynn, “An Area-Utility Model for On-Chip Memories and Its Applications,”, Stanford Tech. Rep. CSL-TR-90-413, Feb. 1990. [34] P. Ahuja, D. Clark, and A. Rogers, “Performance impact of incomplete bypassing in processor pipeline,” in IEEE Proc. 28th Annu. Int. Symp. Microarchitecture, Ann Arbor, MI, Nov. 1995, pp. 36–45. [35] K. I. Farkas, N. P. Jouppi, and P. Chow, “Register file design considerations in dynamically scheduled processors,” in IEEE Symp. High-Performance Computer Architecture, San Jose, CA, Feb. 1996, pp. 40–51. [36] The National Technology Roadmap for Semiconductors. San Jose, CA: Semiconductor Industry Association, SEMATECH, 1997. [37] W. J. Dally and S. Lacy, “VLSI architecture: Past, present, and future,” in 20th Anniversary Conf. Advanced Research in VLSI. Atlanta, GA, Mar. 21–24, 1999, pp. 232–241. [38] R. E. Kessler, “The alpha 21 264 microprocessor,” IEEE Micro, vol. 19, pp. 24–36, Mar. 1999. [39] K. Diefendorff and R. Dubey, “How multimedia workloads will change processor design,” IEEE Computer, vol. 30, pp. 43–45, Sept. 1997. [40] A. Gentile, J. L. Cruz-Rivera, D. S. Wills, L. Bustelo, J. J. Figueroa, J. E. Fonseca-Camacho, W. E. Lugo-Beauchamp, R. Olivieri, M. QuiñonesCerpa, A. H. Rivera-Ríos, I. Vargas-Gonzáles, and M. Viera-Vera et al., “Real-time image processing on a focal plane SIMD array,” in Parallel and Distributed Processing—Lecture Notes in Computer Science, J. Rolim et al., Eds. New York: Springer-Verlag, 1999, vol. 1586, pp. 400–405. 2 2 Sek Meng Chai (M’99) received the B.S. degree in electrical engineering in 1994, the M.S. degree in electrical engineering in 1996, and the Ph.D. degree in electrical and computer engineering in 1999 from Georgia Institute of Technology, Atlanta. Currently, he is a Staff Electrical Engineer at the DigitalDNA System Architecture Laboratory at Motorola, Schaumburg, IL. His research interests include parallel architecture, VLSI, billion transistor architectures, and high-performance interconnection networks. His doctoral research addresses high throughput, efficient architectures such as systolic arrays. Dr. Chai is a Member of ACM. Tarek M. Taha received the B.E.E. degree in electrical engineering from Georgia Institute of Technology, Atlanta, and the B.A. degree from DePauw University, Greencastle, IN, in 1996. He is currently pursuing the Ph.D. degree in the Portable Image Computation Architectures (PICA) Group at Georgia Institute of Technology. His research interests include high-performance architectures, multicomputer interconnection networks, and VLSI. D. Scott Wills (S’80–M’83–SM’97) received the B.S. degree in physics from Georgia Institute of Technology, Atlanta, in 1983, and the S.M., E.E., and Sc.D. degree in electrical engineering and computer science from Massachusetts Institute of Technology (MIT), Cambridge, in 1985, 1987, and 1990, respectively. He is currently an Associate Professor of Electrical and Computer Engineering at Georgia Institute of Technology. His research interests include short wire VLSI architectures, high throughput portable processing systems, architectural modeling for gigascale (GSI) technology, and high-efficiency image processors. Dr. Wills is a Senior Member of the Computer Society and an Associate Editor of IEEE TRANSACTIONS ON COMPUTERS. James D. Meindl (M’56–SM’66–F’68–LF’97) received the B.S., M.S., and Ph.D. degrees in electrical engineering from Carnegie Institute of Technology (Carnegie Mellon University), Pittsburgh, PA. Currently, he is the Director of the Joseph M. Pettit Microelectronics Research Center, and has been the Joseph M. Pettit Chair Professor of Microelectronics at Georgia Institute of Technology, Atlanta, since 1993. He was Senior Vice President for academic affairs and provost of Rensselaer Polytechnic Institute, Troy, NY, from 1986 to 1993. He was with Stanford University from 1967 to 1986 as the John M. Fluke Professor of electrical engineering, Associate Dean for Research in the School of Engineering, Director of the Center for Integrated Systems, Director of the Electronics Laboratories, and Founding Director of the Integrated Circuits Laboratory. Dr. Meindl is a Fellow of the American Association for the Advancement of Science, and a member of the American Academy of Arts and Sciences and the National Academy of Engineering and its academic advisory board. He received a Benjamin Garver Lamme Medal from ASEE, an IEEE Education Medal, an IEEE Solid-State Circuits Medal, and an IEEE Beatrice K. Winner Award. He has also been awarded the IEEE Electron Devices Society’s J. J. Ebers Award, the Hamerschlag Distinguished Alumnus Award, Carnegie Mellon University, the 1999 SIA University Research Award, and the IEEE Third Millennium Medal.

Heterogeneous architecture models for

Related documents

Products

Support

Heterogeneous architecture models for

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib