Fast floating-point performance smooths the drawing of 3D meshes

advertisement
BUS AGP Información Intel
Fast floating-point performance smooths the drawing of 3D meshes and
animation effects and adds depth complexity to the scene. The next step is to add
lifelike realism and depth. To do this, the PC must render the 3D images by adding
textures, alpha-blended transparencies, texture-mapping lighting, and other effects. AGP
technology accelerates graphics performance by providing a dedicated high-speed port
for the movement of large blocks of 3D texture data between the PC's graphics
controller and system memory.
Scaling to Even Higher Bandwidth
The AGP interface, positioned between the PC's chipset and graphics controller,
significantly increases the bandwidth available to a graphics accelerator (current peak
bandwidth is 528 MB/s). AGP lays a scalable foundation for high-performance graphics
in future systems, with support for a peak bandwidth over 1 GB/s.
Today's 3D applications have a huge appetite for memory bandwidth. By providing a
high memory bandwidth "fast lane" for graphics data, AGP enables the hardwareaccelerated graphics controller to execute texture maps directly from system memory,
instead of caching them in the relatively limited local video memory. It also helps speed
the flow of decoded video from the CPU to the graphics controller.
3D applications will also run faster when the need to pre-fetch and cache textures in
local video memory is eliminated.
By minimizing the need for video memory, AGP helps developers control the costs of
new designs. Removing video traffic from the PCI bus also delivers better stability.
Boost Your AGP Learning Curve
Take a few moments to explore AGP technology in our AGP Tutorial. It outlines what
you need to know about PCs and software applications optimized for AGP.
Hardware developers who need to drill down deeper can download the AGP Interface
Specification version 2.0 and the latest engineering revisions. Further detailed
information on electromechanical implementation issues and thermal design guidelines
is available in the newly updated AGP Platform Design Guide revision 1.1.
Introduction
This tutorial provides a technical introduction to the AGP interface. In this tutorial we
shall review the manner in which 3D graphics are currently processed on the PC as well
as some of the more problematic issues. We then discuss how AGP deals with these
issues and enhances the ability of mainstream PCs to handle sophisticated 3D graphics
applications. We also explore the implications of AGP for software developers.
Users of this tutorial are assumed to understand the basic architecture of the PC and the
functions performed in the 3D graphics pipeline. Background information on the PC
architecture can be found in any good text on PC computing technology, such as Peter
Norton's Inside the PC, Sixth Edition, from SAMS Publishing. Intel's Graphics Web site
provides an in-depth primer on 3D graphics technology.
Table of Contents






Chapter 1: 3D Graphics on Current Generation PCs
Chapter 2: 3D Graphics on Next Generation PCs
Chapter 3: A Closer Look at AGP Data Transfers
Chapter 4: AGP Memory Mapping
Chapter 5: A Summary of AGP's Benefits
Chapter 6: What This Means for Software Developers
AGP is a new interface on the PC platform that dramatically improves the processing of
3D graphics and full-motion video. In order to fully understand the impact of AGP
technology, its necessary to first review how 3D graphics are currently supported on the
PC platform without AGP.
Lifelike, animated 3D graphics requires the performance of a continuous series of
processor-intensive geometry calculations which define the position of objects in 3D
space. Typically, geometry calculations are performed by the PC's processor because it
is well-suited to handling the floating point operations required. At the same time, the
graphics controller must process texture data in order to create lifelike surfaces and
shadows within the 3D image. The most critical aspect of 3D graphics is the processing
of texture maps, the bitmaps which describe in detail the surfaces of three-dimensional
objects. Texture map processing consists of fetching one, two, four, or eight texels
(texture elements) from a bitmap, averaging them together based on some mathematical
approximation of the location in the bitmap (or multiple bitmaps) needed on the final
image, and then writing the resulting pixel to the frame buffer. The texel coordinates are
non-trivial functions of the 3D viewpoint and the geometry of the object onto which the
bitmap is being projected.
Figure 1 shows how the processing of texture maps is currently supported on the PC. As
shown, there are five basic steps involved in processing textures.
1. Prior to their usage, texture maps are read from the hard drive and loaded into system
memory. The data travels via the IDE bus and chipset before being loaded into memory.
2. When a texture map must be used for a scene, it is read from system memory into the
processor. The processor performs point-of-view transformations upon the texture map
and then caches the results.
3. Lighting and viewpoint transforms are then applied to the cached data. The results of
this operation are subsequently written back to system memory.
4. The graphics controller then reads the transformed textures from system memory and
writes them in its local video memory (also called graphics controller memory, the
frame buffer, or off-screen RAM). In present-day systems, this data must travel to the
graphics controller over the PCI bus.
5. The graphics controller next reads the textures plus 2D color information from its
frame buffer. This data is used to render a frame which can be displayed on the 2D
monitor screen. The result is written back into the frame buffer. The system's digital-toanalog convertor will read the frame and convert it to an analog signal that drives the
display.
The reader may notice a number of problems with the way texture maps are currently
handled. First, textures must be stored in both system memory and the frame buffer;
redundant copies are an inefficient use of memory resources. Second, storing the
textures in the frame buffer, even temporarily, places a ceiling on the size of the
textures. There is a demand for textures with greater and greater detail, pressuring
hardware manufacturers to put more frame buffer in their systems. However, this type
of memory is quite expensive, thus this is not an optimal solution. Finally, the
132Mbyte/s bandwidth of the PCI bus limits the rate at which texture maps can be
transferred to the graphics subsystem. Furthermore, in typical systems several I/O
devices on the PCI bus must share the available bandwidth. The introduction of other
high-speed devices, such as Ultra DMA disk drives and 100 MByte/s LAN cards makes
the congestion even worse. It is easy to see how congestion on the PCI bus can limit 3D
graphics performance on a PC.
Currently, applications employ several strategies to compensate for the limitations
inherent in present-day PCs. Applications use a caching or "swapping" algorithm to
decide which textures should be stored in local frame buffer memory versus system
memory. Typically, applications dedicate a portion of off-screen local memory as
frame-to-frame texture swapping space, while the remaining off-screen memory
contains commonly used textures (fixed texture memory), for example, clouds and sea
in a flight simulator.
If the hardware can only texture from local video memory, the algorithm usually
attempts to pre-fetch the needed textures for each frame or scene into local video
memory. Without pre-fetching, users will see a noticeable pause in the scene as the
software stops drawing while the needed texture is swapped into local video memory, or
even worse, from disk to system memory to local video memory. Often even more
delay in initial texture loading occurs due to necessary reformatting of textures into a
hardware-specific compressed format.
Applications may reserve part of the local memory for swapping, and leave part of it
permanently loaded with "fixed" commonly used textures. Depending on the number of
textures per frame, the algorithm may vary the proportion of memory allocated for
texture swapping and fixed texture memory. Scenes which contain a large number of
textures tend to have less texture reuse; these benefit from larger texture swapping
space.
Chapter 2
3D graphics are certain to benefit from several enhancements to the PC platform. First
and foremost is the transition to the Pentium® III processor at the heart of the system.
The Pentium III processor is able to better handle the geometry stage of the 3D pipeline
(i.e., more triangles per second throughput). The Pentium III processor consists of a core
packaged with integrated level 2 cache memory. The Pentium III processor also features
a Dual Independent Bus (DIB) architecture, in which two independent buses connect the
core to the L2 cache and to the system bus of the PC. The fact that both buses can
operate at the same time greatly enhances the performance of the processor, because the
processor can simultaneously execute instructions out of the L2 cache and communicate
with external devices.
The addition of AGP is, of course, the other key enhancement to the PC platform that
benefits 3D graphics. AGP relieves the graphics bottleneck by adding a new dedicated
high-speed bus directly between the chipset and the graphics controller. This removes
bandwidth-intensive 3D and video traffic from the constraints of the PCI bus. In
addition, AGP allows textures to be accessed directly from system memory during
rendering rather than being pre-fetched to local graphics memory. Segments of system
memory can be dynamically reserved by the OS for use by the graphics controller; this
memory is termed AGP memory or non-local video memory. The net result is that the
graphics controller is required to keep fewer texture maps in local memory. Smaller
local memory requirements mean lower overall system cost. This innovation also
eliminates the size constraint that local graphics memory places on texture maps, thus
enabling applications to use much larger texture maps and further improving realism
and image quality. As a final point, it should be noted that off-loading graphics and
video data from the PCI bus makes more room available for bandwidth-hungry highspeed devices.
AGP is implemented with a connector similar to that used for PCI, with 32 lines for
multiplexed address and data. There are an additional 8 lines for sideband addressing,
which is described in the next chapter.
Local video memory is usually more expensive than generalized system memory, and it
cannot be used for other purposes by the OS when unneeded by the graphics of the
running applications. The graphics controller needs fast access to local video memory
for screen refresh, Z-buffers, and pixels (front and back-buffers). For these reasons,
programmers can always expect to have more texture memory available via AGP
system memory. Keeping textures out of the frame buffer allows larger screen
resolution, or permits Z-buffering for a given large screen size. Most applications could
use 2-16 MB for texture storage. By using AGP, they can get it.
Chapter 3
While the PCI bus supports a maximum of 132 MBytes/s, AGP at 66 MHz runs at 533
MBytes/s peak. It gets this speed increase by transferring data on both the rising and
falling edges of the 66 MHz clock and through the use of data transfer modes that are
more efficient. (Actual throughput will vary among various systems and applications,
but usually they obtain about 50-80% of peak values in sustainable real-world
transfers.)
AGP provides two modes for the graphics controller to directly access texture maps in
system memory: pipelining and sideband addressing. In pipelining, AGP overlaps the
memory or bus access times for a request ("n") with the issuing of following requests
("n+1"..."n+2"... etc.). In the PCI bus, request "n+1" does not begin until the data
transfer of request "n" finishes. While both AGP and PCI can "burst" (transfer multiple
data items continuously in response to a single request), such bursting only partly
alleviates the non-pipelined nature of PCI. The depth of AGP pipelining depends on the
implementation, and remains transparent to application software.
With sideband addressing, AGP utilizes 8 extra "sideband" address lines which allow
the graphics controller to issue new addresses and requests simultaneously while data
continues to move from previous requests on the main 32 data/address wires.
Chapter 4
So called AGP memory is just dynamically-allocated areas of system memory, which
the graphics controller can access quickly. The access speed comes from built-in
hardware in the 440BX chipset which translates addresses, allowing the graphics
controller and its software to see a contiguous space in main memory, when in fact the
pages are disjointed. Thus the graphics controller can access large data structures like
texture bitmaps (typically 1 KByte to 128 KByte) as a single entity. The built-in chipset
hardware is called the GART (Graphics Address Remapping Table), similar in function
to the paging hardware in the CPU.
The processor "linear" virtual addresses are translated by its paging hardware into
physical addresses. These physical addresses are used to access system memory, the
local frame buffer, and AGP memory. The CPU accesses to the local frame buffer and
AGP memory use the same addresses as the graphics controller does. The operating
system therefore sets up the CPU paging hardware to a straight 1:1 non-translation of
virtual to physical address.
For accesses to AGP memory, the graphics controller and CPU use a contiguous
aperture of several megabytes. But the GART translates these to various, possibly
disjointed, 4 KByte page addresses in system memory. PCI devices that access to the
AGP memory aperture (for example, for live video capture) also go through the GART.
Chapter 5
Before moving on, let's take a moment to summarize the key benefits of AGP.




Peak bandwidth is four-times higher than the PCI bus thanks to pipelining,
sideband addressing, and data transfers that occur on both rising and falling
edges of the clock.
Direct execution of texture maps from system memory. AGP enables high-speed
direct access to system memory by the graphics controller, rather than forcing it
to pre-load the texture data into local video memory.
Less PCI bus congestion. The PCI bus attaches a wide variety of I/O devices,
such as disk controllers, LAN chips, and video capture systems. AGP operates
concurrently with, and independent from, most transactions on PCI. Further,
CPU accesses to system memory can proceed concurrently with AGP memory
reads by the graphics controller.
Improved system concurrency for balanced PC performance. The Pentium II
processor can perform other activities while the graphics chip is accessing
texture data in system memory.
Chapter 6
So what should an application software developer do about AGP? There are two
possibilities: 1) Do nothing, or 2) Optimize for AGP. For both cases, the big benefit of
AGP is more and larger textures for 3D graphics realism without loss of real-time
performance. Today's applications usually must limit themselves to less than 2 MBytes
of textures at any time with graphics hardware controllers. AGP will change that,
assuming the application includes the scalability of high-end texture content.
Furthermore, any existing applications as well as new applications written without
special efforts for AGP will run faster on AGP systems.
True AGP-compliant hardware can actually make applications simpler. But PC
hardware with AGP will come in three flavors, and software will probably want to
support all three:


Type 1: This hardware has an AGP interface, but does not exploit its AGP
texturing features. It just transfers data faster than a PCI device could. It
probably does not exploit the pipelining capability or sideband addressing.
Type 2: This hardware renders textures from AGP memory, thus the application
does not need to swap textures into local memory. The hardware may or may not
be able to texture from local memory also. It may perform faster when not
texturing from local memory, due to conflicts for access to local memory for
pixel writes, screen refresh, texel reads, and Z-values.

Type 3: This hardware runs best when concurrently exploiting both local
memory and AGP memory for texturing. Frequently-used textures or smaller
textures would best reside in local memory, while larger less-frequently used
textures should reside in system memory. Thus the bandwidth drain on main
memory is minimized, reducing conflicts between the CPU and graphics
controller.
DOS Applications
Of course direct memory execution of textures requires the GART, because of the
virtual addressing scheme used in today's operating systems. But for applications
running under yesterday's operating systems (e.g., DOS) without virtual addressing, the
GART serves no purpose. Old applications running under DOS will see the benefit of
faster AGP speed, but will require some driver work to turn on the graphics controller's
ability to directly access textures in system memory .
Windows* Applications
Unmodified Windows* applications can benefit from AGP, because the OS and
DirectDraw* have changed slightly to support it by default. For details see the
Microsoft Web site on AGP.
For current hardware implementations, the OS will make AGP memory (like other
video memory) non-cacheable, so that there is no coherency problem between the CPU
caches and the data that the graphics controller uses. Otherwise, graphics controller
accesses to AGP memory would require "snooping" the CPU caches, which would
cause delays in execution in some cases. CPU reads from uncached memory are slow,
so algorithms should avoid CPU reads from AGP main memory as well as from
graphics local memory.
Note that in Pentium III processor-based systems, this non-cached graphics memory will
be marked by the OS as "Write Combining" (WC), which gets significantly faster CPU
write-access than straight "Uncacheable" (UC). WC memory areas let the CPU
"combine" multiple discrete writes into a burst-write on the memory bus when the bus is
available, using dedicated write-buffers built-in to the chip. Except for the faster speed,
WC should remain transparent to applications. While the CPU read-access speed is no
faster for WC than UC, the use of UC memory will cause Pentium III processors to
serialize execution, which will probably slow the execution significantly. The fact that
multiple writes can get combined together before getting outside the CPU can have
some impact on hardware device drivers, which may depend on multiple sequential
writes to the same location, and "strong ordering" of memory writes.
Default DirectDraw Memory Allocations
Unless the application specifically requests otherwise, Microsoft DirectDraw will by
default allocate memory for textures in the following order:



Local graphics controller memory.
AGP main memory.
System memory.
What if the graphics controller cannot texture from AGP memory? Well, in this
situation DirectDraw can be prevented from allocating any non-local video memory for
texturing. The graphics controller driver reports its capabilities to the OS and
DirectDraw, and if the graphics controller cannot directly access system memory, then
DirectDraw will allocate only local video memory and system memory to the
application. Similarly, if the graphics chip cannot texture from local video memory,
DirectDraw will not allocate any textures locally.
If it could not fit all textures into the AGP memory which DirectDraw agreed to
allocate, then the application must eventually copy some more textures from the disk
into AGP memory. Very realistic flight simulators or other applications using large
amounts of textures may need to stream textures from disk or network into AGP
memory, no matter how much memory DirectDraw gave them.
The application may benefit from using MIP-maps with AGP, as MIP-maps (prefiltered multi-resolution texture maps) tend to increase the "locality" of memory access
during texturing. That is, the lower-resolution version fits into a small area of system
memory, and as the graphics chip puts the texture on an object far from the viewpoint, it
accesses that sub-sampled version of texture all within a small memory region. Without
MIP-mapping, the chip must skip over many bytes of the single-resolution larger texture
to find the right texel for each pixel - so memory addresses jump in large increments,
and the memory bandwidth is lowered.
Download