Design Feature: March 30, 1995 http://www.ednmag.com/ednmag/reg/1995/033095/07df4.htm DESIGNING PCI-COMPLIANT FOR ADD-ON CARDS MASTER/SLAVE INTERFACES Bernie Rosenthal and Ron Sartore, Applied Micro Circuits Corp As the PCI bus becomes the interface of choice for most desktop systems, it's clear the benefits of high-bandwidth and plug-and-play operation do not come easy. Unlike its ISA and EISA predecessors, the PCI bus presents a number of electrical, physical, and functional issues you need to understand. System designers familiar with the earlier buses now need to expend a significant effort to build a fully compliant PCI adapter card or add-in board. The PCI bus is a synchronous, processor-independent 32- or 64-bit local bus that operates at 5V, 3.3V, or a combination of both. The bus is forward- and backward-compatible with multiple 32- and 64-bit PCI components and add-in boards, and it currently operates at clock speeds to 33 MHz with a compatible migration to 66 MHz. The PCI bus provides substantial performance gains over the ISA or EISA buses (see Fig 1). With a 33-MHz clock rate, PCI data-transfer rates are as high as 132 Mbytes/sec-or, in the case of a 64-bit bus, 264 Mbytes/sec. Data transfer may be in "bursts" that ensure the bus is always filled with data. The PCI spec allows burst reads and writes. Both are important in applications such as high-performance graphics accelerators, where the majority of data transfers are writes from the CPU to the frame buffer. Systems that use the PCI bus usually do not have memory on the PCI side, so that transactions between the CPU and memory do not degrade PCI performance. This arrangement permits concurrent operations on both the CPU's memory bus and the PCI bus. For example, the CPU can work on applications in memory while bus transfers simultaneously occur between an image frame-buffer card and a compression coprocessor. Perhaps more important than raw bandwidth or concurrency, the PCI bus feature allows the host system to configure adapter cards automatically. Dubbed "plug and play," this feature eliminates the need to set jumpers and switches for the adapter card to function properly. With a series of programmable and examinable address-decoder, interrupt, and configuration registers, the system can treat all PCI plug-and-play add-on peripherals similarly. The PCI bus is radically different when compared with the 8/16-MHz ISA bus. The ISA bus became popular because it was relatively easy for any system or board manufacturer to develop ISA products. That situation isn't true for the PCI local bus. Potential pitfalls involving electrical, physical, and functional compliance lurk in the PCI spec. In each category, you'll need to contend with a number of highly detailed issues. For example, output-buffer drivers for PCI bus signals are specified with both minimum and maximum ac switching currents. This new method departs from the traditional dc output description previously used for various TTL and CMOS logic families, FPGAs, and PLDs. For these logic devices, IOL and IOH (switching current low and switching current high, respectively) are dc parameters and are usually specified on the data sheet at an exact value (6 mA and -2 mA, respectively). The PCI spec defines IOH and IOL over the ac spectrum with a dynamic range of 0 to 1.4V and 1.4 to 2.4V, respectively. The PCI spec defines switching currents in the transition regions whereas, conventionally, only dc switching currents are specified at a given logic state. Although unlikely, some conventionally specified output buffers may be acceptable for driving the PCI bus. However, it is unclear which ones comply with the specification until manufacturers properly model, simulate, and characterize them. In effect, you take a big risk when using conventional logic to drive the PCI bus because your PCI-interface design may turn out to be noncompliant due to poor choice of interface buffers. Because the PCI bus is synchronous, virtually all the logic involved in any high-performance data and control path will require a copy of the PCI clock. This situation is made even more difficult by the spec requirement that allows only one input load per bus slot. A phase-locked loop (PLL) could help here by operating as a "zero-delay" buffer. However, PLLs cannot work properly in this situation because the PCI spec allows the bus clock, sourced by a motherboard, to operate anywhere from dc to 33 MHz. To further preclude a PLL solution, the PCI spec allows for instantaneous changes in bus clock speed as long as the minimum clock-pulse durations are not less than that of a 33-MHz clock. Implementing certain complex functions on the PCI bus, especially 32-bit burst transfers, ultimately requires a great many clock loads. This clock-fan-out problem is exacerbated by another PCI spec, a tight timing budget of 11 nsec from clock to data out. A third issue deals with the sheer number of gates necessary to build the basic functions the PCI spec mandates. You can easily approach 10,000 gates when building the 36-bit parity generator/checker, programmable address decoders, command/status registers, and various other required configuration registers. You need those functional elements just to achieve basic compliance to the PCI spec without implementing any extra frills such as FIFO buffers and userspecific registers. Today's data-intensive applications require you to fully understand the expected performance of your system bus. For example, you need to understand the transfer characteristics of the bus to ensure that a full-motion video system isn't going to be jerky or that a data-transmission broadcast can be received (stored) in its entirety. In short, to determine whether a given configuration is able to accomplish a specific performance goal, you must know how the bus will behave. Many designers perceive that the PCI bus specification, with its 132-Mbyte/sec transfer rate, alleviates bus-performance concerns. Although the PCI bus offers superior transfer rates over the ISA and EISA buses, determining the expected performance of a PCI-based system is not a simple task. System performance depends not only on the performance of the PCI bus interface devices, but also on the environment in which they operate. The selection of processor, busarbitration scheme, memory bus, and motherboard chip set can influence the attainable bandwidth of an adapter function. PCI bus traffic between devices other than the CPU can reduce the expected bandwidth even further. Key implementation issues The most important PCI bus specifications deal with the configuration-space header, read and write behaviors, and ac electrical and timing parameters. The PCI local bus specification emphatically states that "all PCI devices must implement configuration space." PCI's configuration-register space provides an appropriate set of configuration hooks, satisfying the needs of current and anticipated system-configuration mechanisms without actually specifying those mechanisms-or otherwise constraining their use. The PCI spec divides configuration space into a predefined header region and a devicedependent region. Devices can implement only the necessary and relevant registers in each region. A device's configuration space must be accessible at all times, not just during system boot initialization. The predefined header region consists of fields that uniquely identify the device and allow it to be generically controlled. The predefined header portion of PCI's configuration space divides into two parts. The first 16 bytes are the same for all device types; the remaining bytes can have different layouts depending on the base function the device supports. The Header Type field (located at offset OEh in the configuration-space header) defines what layout is provided. Currently, there are two defined header types: type 01h, defined for PCI-to-PCI bridges and documented in the PCI-toPCI bridge architecture specification, and type 00h (Fig 2), which all other types of PCI devices currently use. System software may need to scan the PCI bus to determine what devices are actually present. To do this, configuration software must read the vendor ID field in each possible PCI slot. The host bus to PCI bridge must unambiguously report attempts to read the vendor ID of nonexistent devices. Because 0FFFFh is an invalid vendor ID, it is adequate for the host bus to PCI bridge to return a value of all "1s" on read accesses to configuration-space registers of nonexistent devices. These accesses ultimately terminate with a master abort. All PCI devices must treat configuration-space write operations to reserved registers as no-ops. That is, the accesses are normally completed on the bus and the data discarded. Read accesses to reserved-but-unimplemented registers must be completed normally and return a data value of 0. Fig 2 shows the layout of a Type 00h predefined header portion within the 256-byte configuration space. PCI devices must place any device-specific registers after the predefined header in configuration space. All multibyte numeric fields follow little-endian ordering. That is, lower addresses contain the least significant parts of the field. Software must work correctly with bit-encoded fields that have some bits reserved for future use. On reads, software must use appropriate masks to extract the defined bits and may not rely on reserved bits' being of any particular value. On writes, software must ensure that the values of reserved-bit positions are preserved. As a result, the values of reserved bit positions must first be read, merged with the new values for other bit positions, and the data then written back. All PCIcompliant devices must support vendor ID, device ID, command, status, revision ID, class-code, and header-type fields in the header. Implementation of the other registers in a Type 00h header is optional. From a compatibility standpoint, the configuration space header is important because it will be manipulated by the host system's BIOS (basic I/O system) program. If you fail to implement the configuration space correctly, it's likely that your add-in feature won't operate. Worse yet, your design may operate properly on certain platforms and behave mysteriously on others. When we first designed our general-purpose PCI-interface chip, we failed to treat unused registers in the configuration space properly. Rev 2.0 of the PCI Specification did not define the use of an unused base address register. The resulting ambiguity affected both our device and various BIOS codes and their treatment of "all 1s" or "all 0s" for the address-register content. This example points to areas where interpretation of the specification by two independent parties can unknowingly lead to interoperability problems. PCI READ AND WRITE TRANSACTIONS Read and write transactions take place between a bus master and a target. In its simplest form, a read transaction starts with an address phase occurring when FRAME# asserts for the first time on clock 2 (Fig 3A). During the address phase, AD[31::00] (the 32 address-data signals) contain a valid address, and C/BE[3::0]# (the command/byte enable signals) contain a valid bus command. The first clock of the first data phase is clock 3. During the data phase, the C/BE# signals indicate which byte lanes are involved in the current data phase. A data phase consists of some number of wait cycles and a data-transfer cycle. The C/BE# output buffers must remain enabled for both reads and writes from the first clock of the data phase through the end of the transaction. This move ensures that the C/BE# signals do not float for long intervals. The C/BE# lines contain valid byte-enable information during the entire data phase, independent of the state of IRDY# (initiator ready). C/BE# lines contain the byte-enable information for data phase Nÿ2D1 on the clock following the completion of data phase N. Fig 3A doesn't show this sequence because a burst-read transaction typically asserts all byte enables. However, Fig 3B shows this type of transaction. Notice that on clock 5, the bus master inserts a wait state by negating IRDY#. However, the byte enables for data phase 3 are valid on clock 5 and remain valid until the data phase completes on clock 8. BUS TURNAROUNDS The first data phase on a read transaction requires a turnaround-cycle, which the bus target, via TRDY# (target ready), enforces. During the read cycle, the address is valid on clock 2, and then the bus master stops driving the AD (address/data) lines. The earliest the bus target can provide valid data is clock 4. The target must then drive the AD lines following the turnaround cycle when DEVSEL# (device select) asserts. Once enabled, the target's output buffers must stay enabled to the end of the transaction. A data phase can complete when data transfers, that is, when both IRDY# and TRDY# assert on the same rising clock edge. However, the target cannot assert TRDY# until DEVSEL# asserts. When either IRDY# or TRDY# is negated, a wait cycle occurs and no data is transferred. As Fig 3A shows, data successfully transfers on clocks 4, 6, and 8, and wait cycles occur on clocks 3, 5, and 7. The first data phase shown in Fig 3A completes in the minimum time for a read transaction. The second data phase is extended on clock 5 because TRDY# is negated. The last data phase is extended because IRDY# is negated on clock 7. The bus master knows, at clock 7, that the next data phase is the last. However, because the master is not ready to complete the last transfer, it negates IRDY# on clock 7 and FRAME# remains asserted. Only when IRDY# asserts can FRAME# be negated, which occurs on clock 8. Fig 3B shows a write transaction. The transaction starts when FRAME# asserts for the first time, which occurs on clock 2. A write transaction is similar to a read transaction except that no turnaround cycle is required following the address phase because the master provides both address and data. Data phases work the same for both read and write transactions. The first and second data phases complete with zero wait cycles (Fig 3B). However, in this example, the target inserts three wait cycles in the third data phase. Note that both the master and the target insert a wait cycle on clock 5. To indicate the last data phase, IRDY# must be asserted when FRAME# is negated. The master delays the data transfer on clock 5 by negating IRDY#. The master signals the last data phase on clock 6, but the phase does not complete until clock 8. Although implementing the nominal conditions associated with PCI bus transactions is relatively straightforward, addressing the eventual (but less frequent) exceptional cases can prove difficult. For example, a burst transaction can burst beyond the allocated region for a given target. In this case, the active target must disconnect and the bus master must reissue the address phase to select another target. Naturally, this sequence must occur without losing or mistransferring data. Both the master and the target must then keep track of the current address during burst transfers, which increases the complexity of the logic involved. There are other cases which further complicate master- and target-control-logic design. A master must accommodate situations such as a target's request to disconnect with (or without) data, removal of GRANT by the bus arbiter, detection of error conditions (such as parity errors), and target abort. And, of course, all control-logic designs must handle transfer latencies caused by wait states that masters or targets introduce. A PCI INTERFACE CHIP FOR ADD-ON CARDS The AMCC S593X or PCI Matchmaker family interfaces to virtually any major embedded µP, such as Intel's i960, Motorola's 68000, or Texas Instruments' TMS320, as well as many discretelogic configurations. At the lowest level, the S593X serves as a PCI bus target with modest datatransfer abilities. At the highest level, PCI Matchmaker can act as a bus master with peak transfer capabilities of 132 Mbytes/sec with a 32-bit PCI bus. Address decoding, address sourcing, burst transfers, and all elements necessary to perform efficient and timely data transfers reside within the device. Also included is a bidirectional, 32-bitwide FIFO buffer for system-to-system synchronization and data transfers between the PCI local bus and the add-on product. One of the S593X's key features is the built-in circuitry that automatically converts "big-endian" data structures typically used in Motorola-based systems into the "little-endian" format common in Intel-based systems. Because the PCI bus standard allows both types of endian assignments, hardware conversion provides the highest performance method to exchange data between the two formats. The S593X incorporates three physical bus interfaces: one to the PCI bus, another to the addon interface bus, and the third to an optional external nonvolatile memory. PCI Matchmaker also provides designers with connection to an inexpensive serial EPROM that can act as a BIOS ROM for code generation and storage. You can connect a custom BIOS EPROM to perform any preboot initialization required of the add-on function; you can connect external ROM, EPROM, or NVRAM through either byte-wide or serial interfaces. The external nonvolatile memory may serve as expansion BIOS. Data can move between the PCI bus and the add-on bus or from the PCI bus and nonvolatile memory. Transfers between PCI and add-on buses execute through mailbox registers, FIFO buffers, or a pass-through data path. FIFO-buffer transfers through the PCI bus interface can occur under software control or through hardware using the S593X as bus master. PCI AC PARAMETRICS Table 1 and Table 2 show ac electrical and timing specifications for the PCI bus. As expected, the performance the PCI bus delivered drives timing constraints that are not trivial. In particular, an input set-up time of 7 nsec for an address decoder requires high-speed logic and an optimized decoding structure. Another challenging requirement is the clock-to-data-valid path of 11 nsec. Again, high-speed logic is necessary, as are careful layout and ground-bounce management because of the relatively high number of simultaneously switching outputs. Compliance with the PCI bus specification involves many facets of product design. Consider not only the electrical driver parameters already mentioned but also how close the PCI logic is physically located to an add-in card's edge connector. (All bus signals must be within 1.5 in., and the clock must be within 2.5 in. ([v2D+]0.1 in.)). As obvious as mechanical- and electrical-compliance issues are, issues involving the functional interface with a platform's BIOS are not. Most of the difficulties a PCI device encounters occur during host initialization. Many of the problems relate to the interactive process of establishing the address assignment for PCI devices. This process involves both the host's BIOS and every other PCI device's configuration space. For example, should a PCI device request a contiguous I/O space of greater than 256 bytes, then an ISA/PCI system may have a problem granting this request without causing an address-assignment conflict. Other problems may arise when a PCI add-in device powers up "enabled." Contrary to your initial instinct, the PCI specification requires that all devices-even the boot source and display adaptermust power up disabled. This scheme allows the host to complete address assignments as part of its initialization sequence. This procedure makes the add-in device reliant on the host system to assign its address and become enabled. Otherwise, two identical add-in cards installed in a system could contend for the same address location on power up. Most high-performance applications have a desired minimum bus bandwidth. This goal often translates as the minimum onboard storage required for an add-in card as well as the overall performance suitable for proper application execution. Bus-bandwidth limitations are often so important that a product's success or failure may hinge on enough bandwidth being available to satisfy application needs. If you can't ship the application data across the bus, you have to store it locally on the add-in card and then relay it to the final destination. Local storage increases both the board's cost and the latency of the data transfer. If the increase in cost and decrease in performance (caused by added latency) become large enough, the product becomes uncompetitive or simply not feasible. The PCI bus does not entirely alleviate bandwidth worries. At present, many system implementations cannot support PCI's often touted 132-Mbyte/sec bandwidth when transfers occur between an add-in card and main system memory. Actual system measurements for today's popular PCs reveal that read-access delays are commonly between eight and eleven PCI bus clocks from the time that the address is initially provided to the return of the first 32-bit data word. Burst transfers may then follow the first data word, but bursts typically suspend once the eighth data transfer is performed. Because these access-delay cycles correlate to delays of 240 to 300 nsec, it is apparent that future PCI systems need improvement over today's state of the art. Until then, for today's machines, optimistically you can expect the PCI bus bandwidth into main memory (including bus acquisition) to be just a bit over 40 Mbytes/sec. Naturally, you should expect increased throughput to main memory as the PCI bus matures. Early PCI bus interface designs used discrete ICs. However, that approach is costly and doesn't fully comply with the specification. Programmable logic, such as an FPGA, is a more promising approach in certain cases. Assuming its I/O drivers are PCI-compliant, an FPGA is acceptable if the designer implements a bus master or bus slave. Altera claims that its MAX7000 family devices (up to the 7128) are PCI-compatible; the Xylinx XC73XX, ranging from the 18-macrocell 7318 to the 108-macrocell 73108, reportedly comply with all points in the PCI checklist. ASICs offer another alternative. Vendors are providing the necessary ASIC cores for structuring a PCI bus interface. However, ASICs are expensive and continue to pose a time-to-market risk. In some cases, core implementations do not guarantee designers PCI compliance. Although the I/O drivers and transfer state machines for these ASIC cores may comply with the PCI specification, there are other compliance issues to consider, too. You need to know if the ASIC core handles PCI subtleties, such as a delayed transaction, and if the core implements all of the PCI configuration space. The ASIC alternative Even when an ASIC core is used to create a PCI bus interface, you must still possess a modest understanding of the PCI bus spec. For example, you are required to obtain a vendor and device ID number assignment from the PCI SIG (or borrow one), construct a configuration space region that properly describes your function, and ensure that your custom function meets the timing and behavioral requirements of the PCI spec. Figure 4 Specialized single-chip PCI bus controllers such as AMCC's PCI Matchmaker IC (Fig 4), do provide PCI compliance. Some PCI-interface devices require a processor to drive the application side of the device interface. Others, such as Matchmaker, do not require a processor on the application side, a trait that can be advantageous in a design requiring a master and a slave on the same board. Once you've prototyped a PCI add-in product, you must subject it to various tests to ensure that it works properly with other PCI products. You have several ways to accomplish this. Because it is impractical and expensive to acquire access to every PCI system built, you need alternative approaches to compliance verification. The PCI SIG (Portland, OR, (503) 797-4297) has orchestrated one of the better methods, termed "PCI Compliance Workshops." PCI SIG coordinates a gathering of interested members to verify their products (often prototypes and preproduction designs) with each other. These compliance workshops are held quarterly, usually in the San Francisco Bay area, and involve motherboard makers, system vendors, chip-set manufacturers, BIOS companies, and various add-in-product manufacturers. The PCI SIG also attends and provides a battery of compliance tests of its own. Test results remain confidential to participants. On successful completion of a compliance-workshop session, coupled with the submission of the PCI-compliance checklist (which is more than 100 pages), your product then can appear on the PCI-member "Confidential Integrators List." The PCI SIG maintains the integrator's list so that SIG members can reduce the risk of noncompliance by using only components, BIOS programs, or entire motherboards known to be compliant. Another way to make the list is to use an independent testing organization such as National Software Testing Laboratories (NSTL, Conshohocken, PA, (610) 941-9600) or VeriTest (Santa Monica, CA, (310) 450-0062). For a modest fee, NSTL will test your product in a number of environments. On successful completion of the tests and filling out the compliance checklist, your product will appear on the integrator's list. Author's biographies Bernie Rosenthal, director of the Computer Products Business Unit at AMCC, holds a BSEE and an MS in Industrial and Systems Engineering as well as an MBA from the University of Southern California. He has held a number of marketing and sales positions both at AMCC and TRW Inc. Ron Sartore is the chief architect for PCI interface products at AMCC. He has implemented several award-winning designs, among them the Cheetah Gold 486 featured in EDN's All-Star PC series (March 1990). Sartore has a BSEE from Purdue University.