BUS AGP Información Intel Fast floating-point performance smooths the drawing of 3D meshes and animation effects and adds depth complexity to the scene. The next step is to add lifelike realism and depth. To do this, the PC must render the 3D images by adding textures, alpha-blended transparencies, texture-mapping lighting, and other effects. AGP technology accelerates graphics performance by providing a dedicated high-speed port for the movement of large blocks of 3D texture data between the PC's graphics controller and system memory. Scaling to Even Higher Bandwidth The AGP interface, positioned between the PC's chipset and graphics controller, significantly increases the bandwidth available to a graphics accelerator (current peak bandwidth is 528 MB/s). AGP lays a scalable foundation for high-performance graphics in future systems, with support for a peak bandwidth over 1 GB/s. Today's 3D applications have a huge appetite for memory bandwidth. By providing a high memory bandwidth "fast lane" for graphics data, AGP enables the hardwareaccelerated graphics controller to execute texture maps directly from system memory, instead of caching them in the relatively limited local video memory. It also helps speed the flow of decoded video from the CPU to the graphics controller. 3D applications will also run faster when the need to pre-fetch and cache textures in local video memory is eliminated. By minimizing the need for video memory, AGP helps developers control the costs of new designs. Removing video traffic from the PCI bus also delivers better stability. Boost Your AGP Learning Curve Take a few moments to explore AGP technology in our AGP Tutorial. It outlines what you need to know about PCs and software applications optimized for AGP. Hardware developers who need to drill down deeper can download the AGP Interface Specification version 2.0 and the latest engineering revisions. Further detailed information on electromechanical implementation issues and thermal design guidelines is available in the newly updated AGP Platform Design Guide revision 1.1. Introduction This tutorial provides a technical introduction to the AGP interface. In this tutorial we shall review the manner in which 3D graphics are currently processed on the PC as well as some of the more problematic issues. We then discuss how AGP deals with these issues and enhances the ability of mainstream PCs to handle sophisticated 3D graphics applications. We also explore the implications of AGP for software developers. Users of this tutorial are assumed to understand the basic architecture of the PC and the functions performed in the 3D graphics pipeline. Background information on the PC architecture can be found in any good text on PC computing technology, such as Peter Norton's Inside the PC, Sixth Edition, from SAMS Publishing. Intel's Graphics Web site provides an in-depth primer on 3D graphics technology. Table of Contents Chapter 1: 3D Graphics on Current Generation PCs Chapter 2: 3D Graphics on Next Generation PCs Chapter 3: A Closer Look at AGP Data Transfers Chapter 4: AGP Memory Mapping Chapter 5: A Summary of AGP's Benefits Chapter 6: What This Means for Software Developers AGP is a new interface on the PC platform that dramatically improves the processing of 3D graphics and full-motion video. In order to fully understand the impact of AGP technology, its necessary to first review how 3D graphics are currently supported on the PC platform without AGP. Lifelike, animated 3D graphics requires the performance of a continuous series of processor-intensive geometry calculations which define the position of objects in 3D space. Typically, geometry calculations are performed by the PC's processor because it is well-suited to handling the floating point operations required. At the same time, the graphics controller must process texture data in order to create lifelike surfaces and shadows within the 3D image. The most critical aspect of 3D graphics is the processing of texture maps, the bitmaps which describe in detail the surfaces of three-dimensional objects. Texture map processing consists of fetching one, two, four, or eight texels (texture elements) from a bitmap, averaging them together based on some mathematical approximation of the location in the bitmap (or multiple bitmaps) needed on the final image, and then writing the resulting pixel to the frame buffer. The texel coordinates are non-trivial functions of the 3D viewpoint and the geometry of the object onto which the bitmap is being projected. Figure 1 shows how the processing of texture maps is currently supported on the PC. As shown, there are five basic steps involved in processing textures. 1. Prior to their usage, texture maps are read from the hard drive and loaded into system memory. The data travels via the IDE bus and chipset before being loaded into memory. 2. When a texture map must be used for a scene, it is read from system memory into the processor. The processor performs point-of-view transformations upon the texture map and then caches the results. 3. Lighting and viewpoint transforms are then applied to the cached data. The results of this operation are subsequently written back to system memory. 4. The graphics controller then reads the transformed textures from system memory and writes them in its local video memory (also called graphics controller memory, the frame buffer, or off-screen RAM). In present-day systems, this data must travel to the graphics controller over the PCI bus. 5. The graphics controller next reads the textures plus 2D color information from its frame buffer. This data is used to render a frame which can be displayed on the 2D monitor screen. The result is written back into the frame buffer. The system's digital-toanalog convertor will read the frame and convert it to an analog signal that drives the display. The reader may notice a number of problems with the way texture maps are currently handled. First, textures must be stored in both system memory and the frame buffer; redundant copies are an inefficient use of memory resources. Second, storing the textures in the frame buffer, even temporarily, places a ceiling on the size of the textures. There is a demand for textures with greater and greater detail, pressuring hardware manufacturers to put more frame buffer in their systems. However, this type of memory is quite expensive, thus this is not an optimal solution. Finally, the 132Mbyte/s bandwidth of the PCI bus limits the rate at which texture maps can be transferred to the graphics subsystem. Furthermore, in typical systems several I/O devices on the PCI bus must share the available bandwidth. The introduction of other high-speed devices, such as Ultra DMA disk drives and 100 MByte/s LAN cards makes the congestion even worse. It is easy to see how congestion on the PCI bus can limit 3D graphics performance on a PC. Currently, applications employ several strategies to compensate for the limitations inherent in present-day PCs. Applications use a caching or "swapping" algorithm to decide which textures should be stored in local frame buffer memory versus system memory. Typically, applications dedicate a portion of off-screen local memory as frame-to-frame texture swapping space, while the remaining off-screen memory contains commonly used textures (fixed texture memory), for example, clouds and sea in a flight simulator. If the hardware can only texture from local video memory, the algorithm usually attempts to pre-fetch the needed textures for each frame or scene into local video memory. Without pre-fetching, users will see a noticeable pause in the scene as the software stops drawing while the needed texture is swapped into local video memory, or even worse, from disk to system memory to local video memory. Often even more delay in initial texture loading occurs due to necessary reformatting of textures into a hardware-specific compressed format. Applications may reserve part of the local memory for swapping, and leave part of it permanently loaded with "fixed" commonly used textures. Depending on the number of textures per frame, the algorithm may vary the proportion of memory allocated for texture swapping and fixed texture memory. Scenes which contain a large number of textures tend to have less texture reuse; these benefit from larger texture swapping space. Chapter 2 3D graphics are certain to benefit from several enhancements to the PC platform. First and foremost is the transition to the Pentium® III processor at the heart of the system. The Pentium III processor is able to better handle the geometry stage of the 3D pipeline (i.e., more triangles per second throughput). The Pentium III processor consists of a core packaged with integrated level 2 cache memory. The Pentium III processor also features a Dual Independent Bus (DIB) architecture, in which two independent buses connect the core to the L2 cache and to the system bus of the PC. The fact that both buses can operate at the same time greatly enhances the performance of the processor, because the processor can simultaneously execute instructions out of the L2 cache and communicate with external devices. The addition of AGP is, of course, the other key enhancement to the PC platform that benefits 3D graphics. AGP relieves the graphics bottleneck by adding a new dedicated high-speed bus directly between the chipset and the graphics controller. This removes bandwidth-intensive 3D and video traffic from the constraints of the PCI bus. In addition, AGP allows textures to be accessed directly from system memory during rendering rather than being pre-fetched to local graphics memory. Segments of system memory can be dynamically reserved by the OS for use by the graphics controller; this memory is termed AGP memory or non-local video memory. The net result is that the graphics controller is required to keep fewer texture maps in local memory. Smaller local memory requirements mean lower overall system cost. This innovation also eliminates the size constraint that local graphics memory places on texture maps, thus enabling applications to use much larger texture maps and further improving realism and image quality. As a final point, it should be noted that off-loading graphics and video data from the PCI bus makes more room available for bandwidth-hungry highspeed devices. AGP is implemented with a connector similar to that used for PCI, with 32 lines for multiplexed address and data. There are an additional 8 lines for sideband addressing, which is described in the next chapter. Local video memory is usually more expensive than generalized system memory, and it cannot be used for other purposes by the OS when unneeded by the graphics of the running applications. The graphics controller needs fast access to local video memory for screen refresh, Z-buffers, and pixels (front and back-buffers). For these reasons, programmers can always expect to have more texture memory available via AGP system memory. Keeping textures out of the frame buffer allows larger screen resolution, or permits Z-buffering for a given large screen size. Most applications could use 2-16 MB for texture storage. By using AGP, they can get it. Chapter 3 While the PCI bus supports a maximum of 132 MBytes/s, AGP at 66 MHz runs at 533 MBytes/s peak. It gets this speed increase by transferring data on both the rising and falling edges of the 66 MHz clock and through the use of data transfer modes that are more efficient. (Actual throughput will vary among various systems and applications, but usually they obtain about 50-80% of peak values in sustainable real-world transfers.) AGP provides two modes for the graphics controller to directly access texture maps in system memory: pipelining and sideband addressing. In pipelining, AGP overlaps the memory or bus access times for a request ("n") with the issuing of following requests ("n+1"..."n+2"... etc.). In the PCI bus, request "n+1" does not begin until the data transfer of request "n" finishes. While both AGP and PCI can "burst" (transfer multiple data items continuously in response to a single request), such bursting only partly alleviates the non-pipelined nature of PCI. The depth of AGP pipelining depends on the implementation, and remains transparent to application software. With sideband addressing, AGP utilizes 8 extra "sideband" address lines which allow the graphics controller to issue new addresses and requests simultaneously while data continues to move from previous requests on the main 32 data/address wires. Chapter 4 So called AGP memory is just dynamically-allocated areas of system memory, which the graphics controller can access quickly. The access speed comes from built-in hardware in the 440BX chipset which translates addresses, allowing the graphics controller and its software to see a contiguous space in main memory, when in fact the pages are disjointed. Thus the graphics controller can access large data structures like texture bitmaps (typically 1 KByte to 128 KByte) as a single entity. The built-in chipset hardware is called the GART (Graphics Address Remapping Table), similar in function to the paging hardware in the CPU. The processor "linear" virtual addresses are translated by its paging hardware into physical addresses. These physical addresses are used to access system memory, the local frame buffer, and AGP memory. The CPU accesses to the local frame buffer and AGP memory use the same addresses as the graphics controller does. The operating system therefore sets up the CPU paging hardware to a straight 1:1 non-translation of virtual to physical address. For accesses to AGP memory, the graphics controller and CPU use a contiguous aperture of several megabytes. But the GART translates these to various, possibly disjointed, 4 KByte page addresses in system memory. PCI devices that access to the AGP memory aperture (for example, for live video capture) also go through the GART. Chapter 5 Before moving on, let's take a moment to summarize the key benefits of AGP. Peak bandwidth is four-times higher than the PCI bus thanks to pipelining, sideband addressing, and data transfers that occur on both rising and falling edges of the clock. Direct execution of texture maps from system memory. AGP enables high-speed direct access to system memory by the graphics controller, rather than forcing it to pre-load the texture data into local video memory. Less PCI bus congestion. The PCI bus attaches a wide variety of I/O devices, such as disk controllers, LAN chips, and video capture systems. AGP operates concurrently with, and independent from, most transactions on PCI. Further, CPU accesses to system memory can proceed concurrently with AGP memory reads by the graphics controller. Improved system concurrency for balanced PC performance. The Pentium II processor can perform other activities while the graphics chip is accessing texture data in system memory. Chapter 6 So what should an application software developer do about AGP? There are two possibilities: 1) Do nothing, or 2) Optimize for AGP. For both cases, the big benefit of AGP is more and larger textures for 3D graphics realism without loss of real-time performance. Today's applications usually must limit themselves to less than 2 MBytes of textures at any time with graphics hardware controllers. AGP will change that, assuming the application includes the scalability of high-end texture content. Furthermore, any existing applications as well as new applications written without special efforts for AGP will run faster on AGP systems. True AGP-compliant hardware can actually make applications simpler. But PC hardware with AGP will come in three flavors, and software will probably want to support all three: Type 1: This hardware has an AGP interface, but does not exploit its AGP texturing features. It just transfers data faster than a PCI device could. It probably does not exploit the pipelining capability or sideband addressing. Type 2: This hardware renders textures from AGP memory, thus the application does not need to swap textures into local memory. The hardware may or may not be able to texture from local memory also. It may perform faster when not texturing from local memory, due to conflicts for access to local memory for pixel writes, screen refresh, texel reads, and Z-values. Type 3: This hardware runs best when concurrently exploiting both local memory and AGP memory for texturing. Frequently-used textures or smaller textures would best reside in local memory, while larger less-frequently used textures should reside in system memory. Thus the bandwidth drain on main memory is minimized, reducing conflicts between the CPU and graphics controller. DOS Applications Of course direct memory execution of textures requires the GART, because of the virtual addressing scheme used in today's operating systems. But for applications running under yesterday's operating systems (e.g., DOS) without virtual addressing, the GART serves no purpose. Old applications running under DOS will see the benefit of faster AGP speed, but will require some driver work to turn on the graphics controller's ability to directly access textures in system memory . Windows* Applications Unmodified Windows* applications can benefit from AGP, because the OS and DirectDraw* have changed slightly to support it by default. For details see the Microsoft Web site on AGP. For current hardware implementations, the OS will make AGP memory (like other video memory) non-cacheable, so that there is no coherency problem between the CPU caches and the data that the graphics controller uses. Otherwise, graphics controller accesses to AGP memory would require "snooping" the CPU caches, which would cause delays in execution in some cases. CPU reads from uncached memory are slow, so algorithms should avoid CPU reads from AGP main memory as well as from graphics local memory. Note that in Pentium III processor-based systems, this non-cached graphics memory will be marked by the OS as "Write Combining" (WC), which gets significantly faster CPU write-access than straight "Uncacheable" (UC). WC memory areas let the CPU "combine" multiple discrete writes into a burst-write on the memory bus when the bus is available, using dedicated write-buffers built-in to the chip. Except for the faster speed, WC should remain transparent to applications. While the CPU read-access speed is no faster for WC than UC, the use of UC memory will cause Pentium III processors to serialize execution, which will probably slow the execution significantly. The fact that multiple writes can get combined together before getting outside the CPU can have some impact on hardware device drivers, which may depend on multiple sequential writes to the same location, and "strong ordering" of memory writes. Default DirectDraw Memory Allocations Unless the application specifically requests otherwise, Microsoft DirectDraw will by default allocate memory for textures in the following order: Local graphics controller memory. AGP main memory. System memory. What if the graphics controller cannot texture from AGP memory? Well, in this situation DirectDraw can be prevented from allocating any non-local video memory for texturing. The graphics controller driver reports its capabilities to the OS and DirectDraw, and if the graphics controller cannot directly access system memory, then DirectDraw will allocate only local video memory and system memory to the application. Similarly, if the graphics chip cannot texture from local video memory, DirectDraw will not allocate any textures locally. If it could not fit all textures into the AGP memory which DirectDraw agreed to allocate, then the application must eventually copy some more textures from the disk into AGP memory. Very realistic flight simulators or other applications using large amounts of textures may need to stream textures from disk or network into AGP memory, no matter how much memory DirectDraw gave them. The application may benefit from using MIP-maps with AGP, as MIP-maps (prefiltered multi-resolution texture maps) tend to increase the "locality" of memory access during texturing. That is, the lower-resolution version fits into a small area of system memory, and as the graphics chip puts the texture on an object far from the viewpoint, it accesses that sub-sampled version of texture all within a small memory region. Without MIP-mapping, the chip must skip over many bytes of the single-resolution larger texture to find the right texel for each pixel - so memory addresses jump in large increments, and the memory bandwidth is lowered.