The Bus Architecture in Embedded System ESE 566 Report 1 LeTian Gu September,2002 Abstract: In embedded system design, interconnect is classified as one of the five components. IBM’s open bus standard “CoreConnect Bus Architecture” was used as an example to illustrate the current condition of this problem. Two usually concerned issues in the design were selected to expose the functioning problem of interconnect. The power consumption can be reduced by decrease the amplitude swing on the transmit ion line. The clock signal transferring though the bus wires needs careful design to eliminate the edge skew and phase delay between different channels. 1. Introduction Embedded system is a special computer that combines into a physical system to control its operating functions. The physical system may be as small as a digital camera, cellular phone, as an automobile, or as big as a space Vehicle, a communication system. Embedded system typically has these functions: Monitor variables of the physical system such as light level, frequency, speed... Process this information making use of one or more mathematical models of the physical system. Output signals that influence the behavior of the physical system to control its function and optimize its Performance The architecture of embedded system can be classified into five components [1]: instruction set function unit and data path memory interconnect control For a simple system as 8051 microcontroller, due to the low data transportation requirements, interconnect is a only 8 bit word length buses. Recent advances in silicon densities now allow for the integration of numerous functions onto a single silicon chip. With this increased density, peripherals formerly attached to the processor at the card level are integrated onto the same die as the processor. As a result, chip designers must now address issues traditionally handled by the system designer. In particular, the on-chip buses used in such system-on-a-chip designs must be sufficiently flexible and robust in order to support a wide variety of embedded system needs. In 1999, IBM opened its CoreConnect Bus architecture[2]. It will be used as an example to depict the interconnect component. in following paragraph. There have many factors that influence the important function (power dissipation, speed) of interconnect. In the last section of this paper will describe two particular conditions so that we can understand the design consideration of interconnect problems. 2. The CoreConnect Bus Architeture Recent advances in silicon densities allow for the integration of numerous functions onto a single silicon chip. With this increased density, peripherals formerly attached to the processor at the card level are integrated onto the same die as the processor. In particular, the on-chip buses used in such system-on-a-chip (SOC) designs must be sufficiently flexible and robust in order to support a wide variety of embedded system needs. Typically, a SOC contains numerous functional blocks representing a very large number of logic gates. Designs such as these are best realized through a macro-based approach. Macro based design provides numerous benefits during logic entry and verification, but the ability to reuse intellectual property is often the most significant. From generic serial ports to complex memory controllers 1 and processor cores, each SOC generally requires the use of common macros. Many single chip solutions used in applications today are designed as custom chips, each with its own internal architecture. Logical units within such a chip are often difficult to extract and re-use in different applications. As a result, many times the same function is redesigned from one application to another. Promoting reuse by ensuring macro interconnectivity is accomplished by using common buses for intermacro communications. Many kinds of bus architecture were invented to eases the integration and reuse of processor, system, and peripheral cores within standard product and custom SOC designs. IBM CoreConnect bus [2] is one of these buses. Fig.1 The CoreConnect bus architecture in a SOC The on-chip bus structure including three parts: processor local bus (PLB), on-chip peripheral bus (OPB) and the device control register (DCR) bus which will discuss separately in following. (1) Processor Local Bus The PLB is designed to interface between the processor cores and integrated bus controllers so that a library of processor cores and bus controllers can be developed for use in Core+ASIC and system-on-a-chip (SOC) designs [3]. The PLB addresses the high performance, low latency and design flexibility issues, providing a high bandwidth data path. The high performances include: Decoupled address, read data, and writes data buses with split transaction capability Ability to overlap the bus request/grant protocol with an ongoing transfer Concurrent read and write transfers yielding a maximum bus utilization of two data transfers per clock Address pipelining that reduces bus latency by overlapping a new write request with an ongoing write transfer and up to three read requests with an ongoing read transfer. The PLB offers designers flexibility through the following features: Support for both multiple masters and slaves Four priority levels for master requests allowing PLB implementations with various arbitration schemes Deadlock avoidance through slave forced PLB rearbitration Master driven atomic operations through a bus arbitration locking mechanism Byte-enable capability, supporting unaligned transfers A sequential burst protocol allowing byte, half-word, word and double-word burst transfers Support for 16-, 32- and 64-byte line data transfers 2 Read word address capability, allowing slaves to return line data either sequentially or target word first DMA support for buffered, fly-by, peripheral-to-memory, memory-to-peripheral, and memoryto-memory transfers Guarded or unguarded memory transfers allow slaves to individually enable or disable prefetching of instructions or data Slave error reporting Architecture extendable to 256-bit data buses Fully synchronous The connection of multiple masters and slaves through the PLB is illustrated in Fig. 2. Each PLB master is attached to the PLB via separate address, read data and write data buses and a plurality of transfer qualifier signals. PLB slaves are attached to the PLB macro via shared, but decoupled, address, read data and write data buses along with transfer control and status signals for each data bus. Fig.2 Example of PLB Interconnection. The PLB architecture supports up to 16 master devices. Specific PLB implementations, however, may support fewer masters. The PLB architecture also supports any number of slave devices. The number of masters and slaves attached to a PLB directly affects the maximum attainable PLB bus clock rate. This is because larger systems tend to have increased bus wire load and a longer delay in arbitrating among multiple masters and slaves. The PLB consists of a bus arbitration control unit and the control logic required to manage the address and data flow through the PLB. The separate address and data buses from the masters allow simultaneous transfer requests. The PLB arbitrates among these requests and directs the address, data and control signals from the granted master to the slave bus. The slave response is then routed from the slave bus back to the appropriate master. (2) PLB Bus Transactions PLB transactions consist of multiphase address and data tenures. Depending on the level of bus activity and capabilities of the PLB slaves, these tenures may be one or more PLB bus cycles in duration. In addition, address pipelining and separate read and write data buses yield increased bus throughput by way of concurrent tenures. Address tenures have three phases: request, transfer and address acknowledge. A PLB transaction begins when a master drives its address and transfer qualifier signals and requests ownership of the bus during the request phase of the address tenure. Once the PLB arbiter grants bus ownership the master's address and transfer qualifiers are presented to the slave devices during the transfer phase. The address cycle terminates when a slave latches the master's address and transfer qualifiers during the address acknowledge phase. Figure 3 illustrates two deep read and write address pipelining along with concurrent read and write data tenures. Master A and Master B represent the state of each master's address and transfer qualifiers. The PLB arbitrates between these requests and passes the selected master's request to the PLB slave 3 address bus. The trace labeled Address Phase shows the state of the PLB slave address bus during each PLB clock. As shown in Figure 3, the PLB specification supports implementations where these three phases can require only a single PLB clock cycle. This occurs when the requesting master is immediately granted access to the slave bus and the slave acknowledges the address in the same cycle. If a master issues a request that cannot be immediately forwarded to the slave bus, the request phase lasts one or more cycles. Each data beat in the data tenure has two phases: transfer and acknowledge. During the transfer phase the master drives the write data bus for a write transfer or samples the read data bus for a read transfer. As shown in Figure 3, the first (or only) data beat of a write transfer coincides with the address transfer phase. Data acknowledge cycles are required during the data acknowledge phase for each data beat in a data cycle. In the case of a single-beat transfer, the data acknowledge signals also indicate the end of the data transfer. For line or burst transfers, the data acknowledge signals apply to each individual beat and Fig.3 PLB Transfer Protocol Example indicate the end of the data cycle only after the final beat. The highest data throughput occurs when data is transferred between master and slave in a single PLB clock cycle. In this case the data transfer and data acknowledge phases are coincident. During multi-cycle accesses there is a wait-state either before or between the data transfer and data acknowledge phases. The PLB address, read data, and write data buses are decoupled from one another, allowing for address cycles to be overlapped with read or write data cycles, and for read data cycles to be overlapped with write data cycles. The PLB split bus transaction capability allows the address and data buses to have different masters at the same time. Additionally, a second master may request ownership of the PLB, via address pipelining, in parallel with the data cycle of another master's bus transfer. This is shown in Figure 3. Overlapped read and write data transfers and split-bus transactions allow the PLB to operate at a very high bandwidth by fully utilizing the read and write data buses. Allowing PLB devices to move data using long burst transfers can further enhance bus throughput. However, to control the maximum latency in a particular application, master latency timers are required. All masters able to issue burst operations must contain a latency timer that increments at the PLB clock rate and a latency count register. The latency count register is an example of a configuration register that is accessed via the DCR bus. During a burst operation, the latency timer begins counting after an address acknowledge is received from a slave. When the latency timer exceeds the value programmed into the latency count register, the master can either immediately terminate its burst, continue until another master requests the bus or continue until another master requests the bus with a higher priority. (3) On-Chip Peripheral Bus (OPB) The OPB is a secondary bus architected to alleviate system performance bottlenecks by reducing capacitive loading on the PLB. Peripherals suitable for attachment to the OPB include serial ports, parallel ports, UARTs, GPIO, timers and other low-bandwidth devices. This common design point accelerates the design cycle time by allowing system designers to easily integrate complex peripherals into an ASIC. . 4 The OPB provides the following features: A fully synchronous protocol with separate 32-bit address and data buses Dynamic bus sizing to support byte, half-word and word transfers Byte and half-word duplication for byte and half-word transfers A sequential address (burst) protocol Support for multiple OPB bus masters Bus parking for reduced-latency transfers (4) The Device Control Register (DCR) Bus The DCR bus is designed to transfer data between the CPU’s general purpose registers and the DCR slave logic’s device control registers [4]. The DCR bus removes configuration registers from the memory address map, reduces loading and improves bandwidth of the processor local bus. The DCR bus is a fully synchronous bus. Therefore, it is assumed that in Core+ASIC environment where the CPU and the DCR slave logic are running at different clock speeds, the slower clock’s rising edge always corresponds to the faster clock’s rising edge. The DCR bus is typically implemented as a distributed mux across the chip such that each sub-unit not only has a path to place its own DCRs on the CPU’s DCR read path, but also has a path which bypasses its DCRs and places another unit’s DCRs on the CPU’s DCR read path. Features of DCR bus include: • 10-bit address bus and 32-bit data bus • 2-cycle minimum read or write transfers extendable by slave or master • Handshake supports clocked asynchronous transfers • Slaves may be clocked either faster or slower than master • Single device control register bus master • Distributed multiplexer architecture • A simple but flexible interface 3. Two issues relevant to embedded system bus (1) A clocking scheme supplied up to 4 GHz At higher frequencies, increased power-supply fluctuations and cross coupling result in clock skew and jitter becoming a higher percentage of the cycle time. The larger die area needed to accommodate higher levels of integration results in increased global clock distribution latency and loading, worsening clock skew and jitter. In addition, diminishing device geometries result in less manufacturing control over the device size and hence cause more skew. Skew optimization and jitter reduction schemes are necessary to limit clock inaccuracies to a small fraction of the clock period. A design consideration of jitter filter and skew optimization [5] are described below. The clock generation and distribution scheme shown in Fig.4 is design for the Pentium 4. The high-speed scheduling and execution core operates at 4 GHz (fast clock based upon 2-GHz medium clock), while timing non critical blocks such as the bus interface logic operate at 1 GHz. The microprocessor also has three I/O busses. The data bus operates at 400 MHz, the address bus at 200 MHz, and the common bus at 100 MHz. Where LCD stands for Local Clock Driver Fig.4 High-level clock system architecture. 5 There are 47 independent domain clocks. Jitter reduction and skew optimization schemes to minimize clock inaccuracies to within 10% of the clock period. The jitter reduction consists of filtering the power supply of clock-tree drivers and extensive shielding of clock wires from signal coupling. to supply noise from logic switching. A simple low-pass RC filter, shown in Fig.4 is designed to reduce the core supply noise for the clock buffers. Reducing supply noise results in cycle-to-cycle jitter reduction. Simulation of the filter circuit model (Fig.5) with typical supply noise waveforms, shows up to 5 times reduction in noise amplitude on the filtered supply (Fig.6). The clock distribution network has 47 independence LCDs. Each LCD includes skew optimization capability to correct systematic skew as well as provide intentional skewing of clocks. Some sources of the systematic skew include within-die variations of device channel length, threshold, width, and interlayer dielectric thickness. Although symmetrical layout methods are followed, layout and logic constraints prevent ideal placement. As a result, design, modeling, extraction errors, and layout mismatches contribute to the systematic skew. The main components of the skew optimizer circuit are 47 adjustable delay domain buffers (DB) and a phase-detector (PD) network of 46 phase detectors. Latency of the programmable buffer is negligible. The delay adjustment control for the domain buffers and the output of the phase detectors are accessible from a test access port (TAP), as depicted in Fig.7. Fig. 5 RC-filtered power supply for clock drivers. Fig.6 Jitter with and without the RC ilter. One domain buffer at the center of the die is chosen as the primary reference. The remaining buffers are categorized as secondary, tertiary, and final buffers. Fig.7 depicts the phase-detector network, which coupled with the TAP, aligns the domain buffers to the primary reference input. 6 Fig.7 Logical diagram of skew optimization circuit The output is read out into a scan chain controlled by the TAP. Based on the outcome, the clock domain buffers are adjusted. This is repeated until all the secondary reference clocks are deskewed. Then, after the secondary reference delays have been adjusted, a second set of phase detectors adjust the delay of tertiary references. Similarly, the final stage buffers are adjusted to the tertiary references. With this scheme, the skew is adjusted to within accumulation error of about 8 ps, limited mainly by the resolution of the adjustable delay elements. In this particular condition, the preadjusted skew is about 64 ps. (2) An example of power saving in the interconnection . One of the important specifications of the bus architecture is power consumption. Interconnect often dominate the power consumption. This is true for on-chip as well for off-chip interconnects. On chip, an interconnect comprises a driver, a wire with total capacitance, and a receiver with capacitive load. Off chip, a high-speed interconnect comprises a driver, an interconnect, which normally is a 50- transmission line, and a receiver with a termination resistor and an amplifier. In System-On-a-Chip condition, the connections are all on chip. A model of typical interconnection used for analysis of this question can be simplified as Fig. 8. Fig. 8. Interconnect model. Values in parenthesis are for the off-chip case. One way to reduce the power consumption related to interconnect is to reduce the voltage swing used. When reducing the swing voltage on an interconnect, we normally need an amplifier at the receiver side, to restore the swing to its normal value. This amplifier will have a power consumption related to its gain. An effort has made to find an optimum swing at which the power consumption used to drive the wire balances the power consumption of the receiver [6] so that the total power consumption will be minimum. The analysis for on chip condition shown in Fig.9 7 Fig. 9. Total power versus input voltage swing for the on-chip case. Solid line: case 1. Dashed line: case 2. Upper curves = 0:25 and lower = 0:05. The analyze was held assuming CMOS technology with 0.18-um process and CMOS logical swings. fc=1GHz, Vdd = 1.3V, CL=10pf, Cw=1pf, Cw of 1pf corresponds to a relatively long chip internal wire of 5–10 mm. The power consumption of the wire itself is 85uW and 0.42 mW at full swing for = 0.25, respectively. Case 1, the supply voltage is externally generated (for example using a dc–dc converter with high efficiency). Case 2, the supply voltage is assumed to be generated internally on chip, for example, by using series regulators. We can see that for case 1, a=0.25, minimum power consumption occurs for V =0.12V (using a twostage amplifier), with a total power saving of 17X. We may also use a single-stage amplifier, at a voltage swing of 0.26 V, gaining about 1 higher, 0.35 V, and the savings considerably lower, 8X. For the more realistic case 2 (that is, using the ordinary supply voltage) and , the optimum voltage swing is 60 mV (using a two-stage amplifier) and the power saving 8.5X. In this case, the use of a single-stage amplifier limits the power saving to 4X. Again, at lower activity , the optimum swing is larger, 130 mV, and the saving smaller, 3.4X. These results show that optimum voltage swings exists in a wide range of situations, and depends on operating frequency, data activities, and different mechanisms for generating the reduced voltage. For the example process used, we find optimum swings of the order of 100 to 400 mV, leading to power savings of the order of 10X in the case where we use a separate power supply for the driver. For a more realistic case, using the normal power supply, savings are limited to using voltage swings of 60 to 120 mV. For the case when repeaters are needed for alleviating the RC wire delays, the optimization is much more constrained. In such cases, the optimum voltage swing is about 0.4–0.6 V (at 1.3-V supply voltage) and the power saving about 2X. 4. Conclusions 1. Conclusions of this paper It maybe true that interconnect is the simplest component in an embedded system. But after looking into it, people would found that interconnect is complicate and more important in the whole system. More devices will involve into interconnect. The clocking scheme is one of the examples that including PLL, phase detector, feedback control, etc. Interconnect bus trace often dominate the power consumption in the integrate system. As the working frequency going higher and higher, the resistant transmission line theory should be used in interconnect trace analysis. A robust interconnect architecture will efficiently realize complex system-on-a-chip design. Ease the application of high level component reuse. 8 2. Experience To write a technical paper with given title, the first and important step is to search and collect relevant scientific information. Before (it is before 80s), I did it in library, using the index catalog of science journals, magazines, find out the interesting title, locate the direction of binding journals’ shelf. I had had to write notes or copy important paragraph to my notebook (not computer). A paper usually cost me several months. It is much easier to prepare this paper with a pretty detailed reference paper list. But the IEEE web site can’t be accessed by non-member. I had downloaded the forms to register a student membership (it’s just $30 but need some days to be approved). Finally, the hint I got from class help me to use our University Library Website. I found that for search scientific magazines and journals, this website is a good one: Electronic Journals: http://www.sunysb.edu/library/jnlsI.html Computer science: http://ws.cc.stonybrook.edu/library/csi.html Of cause, only Stony Brook University’s student can used these resources. 5. Reference 1. A. Terraya, J. Mermet, “System-Level Synthesis” Kluwer Academic Publishers. 2. “The IBM CoreConnect Bus Architecture “ 3. “Processor Local Bus (64-bit)” 4. “Device Control Register Bus (32-bit)” 2-4 are from http://www3.ibm.com/chips/techlib/techlib.nsf/productfamilies/CoreConnect_Bus_Architecture 5. 6. P.Resfle et al, “ A clock distribution network for microprocessor” IEEE JSSC, Vol.36,pp792799,May 2001 C.Suensson, “Optimal voltage swing on on-chip and off-chip interconnect”, IEEE JSSC Vol36, pp 1108-1112,July 2001, 9