The architecture of graphics accelerator and its SW application driver Written by: Shmuel Wimer – Bar Ilan University, School of Engineering Host’s application and graphics accelerator mode of operation Graphics accelerator (AGE) is used by a software implementing some graphics application involving the rendering and animation of some object. In the following we define the handshaking between the application and AGE, comprising few basic computer graphics operations. These may be enhanced during the project if time permits or left for future project. The software is running on a host computer equipped with graphics board. The application first initializes the system by performing the following operations: 1. Defining the object to be displayed, both its geometry and color. 2. Performing triangulation of its surface, calculating the vertices edges and the triangles that comprise the object. 3. Assigning colors to every vertex. 4. Sending the following initial data to the USB port a. List of vertices comprising the triangulated object’s surface. Every vertex will have xw, yw, zw world coordinates and RGB colors in agreed formats. A special order is imposed on vertex list to enable pipeline processing of triangles within any frame. b. List of edges given by their two end vertices. This list is also ordered such that the list of vertices implied by traversing the list of edges is contiguous (an edge end new vertex doesn’t leave a “hole” in vertex list). c. List of triangles given by their enclosing edges in cyclic order. The list should be synchronized with that of edge list and hence with vertex list too, such that continuous progression along triangle list will impose continuous progression along vertex list. d. An outward normal vector to every triangle. The order of this list must match the order of triangles to enable the hardware pipeline. 1 e. A box of the real world where the object exists, given by xwmin , xwmax , ywmin , ywmax, zwmin, zwmax. f. The screen viewport where the object should be displayed, given by xsmin , xsmax , ysmin , ysmax . g. An indicator which one of the projection planes XY , XZ or YZ will be displayed in the screen viewport. h. A light unit vector L l x , l y , l z . i. Background RGB colors for the frame buffer. The application then performs an animation session where the above object with its world box are moving and rotating in the space according to some externally defined trajectory. The animation related data is transferred to the USB port at a rate of 24 frames/Sec and it comprises a 4 4 matrix representing object’s new position, but as explained below, only 12 entries are involved in calculation of object’s position. AGE then performs the animation at a rate of 24 frames/Sec by applying hardware operations whose mathematical definitions are described subsequently and hardware implementation specified elsewhere. The result of every animation step is the contents of a frame buffer comprising xsmax xsmin ysmax ysmin pixels associated with RGB colors each. The frame buffer data is addressed directly to the graphics board of the host. Software and hardware pipeline implications The so called “graphics pipeline” described herein lends itself to pipelined hardware implementation where the processing of triangles comprising displayed objects possesses a great deal of overlap. It is unnecessary to first transform all vertices into new world coordinates and only then start rasterization of first rectangle. Instead, if the order of vertices stored in memory is such that “surface continuity” is maintained, where addressing next vertex in memory defines a new triangle whose two other vertices have already been addressed and transformed, one can then start pipelined processing of that triangle. This requires synchronizing the addresses of vertices and triangles in their 2 corresponding memories to maintain this continuity of addresses. This synchronization (order of streaming vertices, triangles and triangle’s normal vectors) is the responsibility of the application software. The hardware will need to maintain a pointer of vertex memory, designating up to what address vertices have already been transformed, as the process of triangle cannot start before world coordinates of its three vertices are updated. The graphics pipeline should avoid divisions as much as possible, as this is the most clock cycle consuming operation. Therefore, pixel computations involving divisions will takes place in screen coordinates rather than world coordinates. This enables the usage of reciprocal of dividers stored in a permanent lookup table, which then replaces division by multiplication. The word stored at an address is the reciprocal value of that address. The size of lookup table must support all possible divisions taken in screen coordinates as explained later. Storing the initial (permanent) and variable data in the memories of AGE The representation of an object implies few memories. Some contain permanent data being loaded at initialization, while the data of others is changed along the animation session. 1. Variable world vertex memory is loaded at initialization. It stores the dynamically changing vertex positions as resulted along the animation. Its content is the outcome of multiplying a 4 4 position matrix by a 4-tuple homogeneous coordinate representation of a vertex as described below. Vertices are indexed according to the order they fed to AGE through the USB port, and their index is the address of a memory word entry comprising xw, yw, zw world coordinates. Vertex world is transformed later to screen viewport position. Let us denote by V the number of vertices. Since every coordinate is a 32 bit integer (4 bytes), the size of this memory is V 12 bytes. 2. Permanent vertex RGB memory is loaded at initialization. It stores RGB of each vertex. Vertices are indexed according to the order they fed to AGE through the USB port, and their index is the address of a memory word entry comprising RGB values. 3 A vertex requires three bytes for its color; hence the size of RGB memory is V 3 bytes. 3. Permanent edge memory contains the list of all edges comprising the triangles. A word of this memory stores two indices of the edge’s end vertices. Denote by E the number of edges. Assuming three bytes for a vertex index, the size of edge memory is therefore E 6 bytes. 4. Permanent triangle memory, a word of which contains the indices of edges comprising a triangle in cyclic order. Denote by T the number of triangles. Assuming 3 bytes for edge index, its implied size is T 9 bytes. Since all triangles are embedded on the surface of the body, Euler formula implies V E T 2 . Assuming 1M triangles, there’s a total of 1 3 / 2 M 1.5M edges. It follows from Euler formula that there are 0.5M vertices. Figure 1 demonstrates the relation among the above memories. 5. Variable triangle outward normal memory. Every triangle implies a plane Ax By Cz 1 0 whose parameters are varying along the animation. This memory stores the parameters A, B, C , a data used for two purposes. First, a decision of whether the triangle is a front one or back one with respect to viewer’s eye, hence whether it is potentially visible or certainly hidden. The latter case occurs for about half of the triangles; hence all rasterization calculations can be skipped for those. The second purpose is for deciding whether an individual pixel is potentially visible or certainly not, a decision that takes place in Z-buffer described later. The derivation of the real world depth of that pixel is using A, B, C,1 . The initial values of these parameters are loaded at initialization. Every parameter occupies 4 bytes, resulting in a memory of T 12 bytes. 6. Variable screen viewport vertex memory stores the coordinates of every transformed vertex obtained by transformation to screen coordinates (described below). Every word of this memory stores xs, ys coordinates of the projected plane. Assuming that a pixel coordinate is stored in two bytes, the size of variable vertex memory is V 4 bytes. 4 7. Variable edge slope memory stores the slopes of triangle edges. Edge slope is required for the computations involved in the rasterization of triangles, and it is used repeatedly. Using a memory, the slopes are calculated only once per animation step and can are used repeatedly, saving division operation. Assuming four bytes slope representation, memory size is E 4 bytes. 8. Permanent screen coordinate reciprocal memory is a lookup table aiming at saving division operations made in screen coordinates. In particular, the interpolation of RGB values of a pixel involves divisions of screen coordinate ranges, which are integral numbers that do not exceed screen size. Therefore, the values of these fractions can be calculated by hardware during initialization for the entire session and then be used by multiplications rather than divisions. Denoting by xsmin , xsmax , ysmin , ysmax the minimum and maximum screen coordinates, and using 4 bytes to represent fixed point reciprocal value, the size of this memory is max xsmax xsmin , ysmax ysmin 4 . 9. Variable Z-buffer (depth buffer) memory is used for hidden surface removal. Its word contains two data items. One is the smallest real world depth of a pixel in the screen viewport (the nearest to viewer’s eye) and the other is the index of the triangle which dictates this Z-value. Assuming a depth needs four bytes, the index of triangle three bytes and size of the frame buffer is D , the memory size is D 7 bytes. 10. Variable Display (frame buffer) memory stores the final image to be flushed via USB port for display on host’s screen. It stores RGB of every pixel, requiring D 3 bytes. Total memory size requirements Let us estimate the total area required for the above memories. Collecting all the above specifies sizes yield the following expression: 12 3 4V 6 4 E 9 12 T 7 3 D 19V 10E 25T 10D , where V, E, T and D are numbers of vertices, edges, triangles and pixels. Using relations between vertices edges and triangles discussed before and assuming 1M triangles and 1M pixels yields 55.5M byte memory. Cutting the number of triangles to 0.1M implies 5 14.55M bytes, reducing the number of displayed pixels 0.1M (e.g., 400X250) yields 5.55M bytes, further reduction of number of triangles to 10K will result 1.455M bytes. The rendering pipeline and its mathematical calculations The following elaborates on the computations involved in the rendering pipeline. The rendering pipeline is divided into two major parts. In the first all world data is transformed according to the position matrix of the new animation step. This includes vertices position, outward normal vectors, edge slopes, and world to screen coordinate conversion of vertices. The rest operations are carried out per rectangle one after the other. Since some operations involve divisions, a special attention is required to avoid zero division. 1. Multiplication of real world vertices by transformation matrix. This operation takes place in real world coordinates. A vertex stored in variable world vertex memory is first converted into homogeneous representation xw, yw, zw,1 and then multiplied by the 4 4 position matrix to yield its new position in the world. The result is then stored back in variable world vertex memory, overriding the previous position. The operation involves 9 multiplications and 9 additions of 4-byte operands as explained below. The usage of a single memory for vertex coordinates implies that the hosting application will send incremental position matrices describing the position change since last animation step. 2. Multiplication of triangle outward normal vector by transformation matrix. As the object is changing position, the outward normal vectors of its triangles are changing correspondingly. This change is obtained by converting first the vector stored in variable triangle outward normal memory into homogeneous representation A, B, C,0 and then multiplied by the 4 4 position matrix to yield new normal. The result is then stored back in variable triangle outward normal memory, overriding the previous normal. This operation involves 9 multiplications and 6 additions of 4-byte operands as explained below. The value of D in the plane presentation Ax By Cz D 0 is required later for Z-depth calculations. It is 6 obtained by D Ax By Cz where the point x, y, z is taken as one of triangle’s vertices. The vector A, B, C is initially set to unit length by the application software. Its length is then maintained unit since transformation matrix preserves vector length (we assume that in this implementation animation excludes scaling of world and perspective projections). 3. Calculate slopes of every edge. As described below, the rasterization of triangles for obtaining their pixel RGB values requires the knowledge of triangle’s edge slopes. These slopes are used for every scan line raster, so it is efficient to pre calculate the slopes of edges and store in memory for later use. For the sake of precision slopes are derived from world coordinates rather than screen coordinates since the latter are obtained after rounding to nearest integer as explained below. The slope of an edge defined by vertices xw, yw, zw and xw, yw, zw is xw xw yw yw . The case of a zero denominator representing a straight vertical edge needs special treatment, by assigning the largest or smallest 32-bit integer in 2’s complement representation, 231 1 or 231 , respectively. As mentioned above, this division can be performed in screen rather than world coordinate, in the expense of precisions. The advantage of division in screen coordinates is in the possibility to implement division as multiplication, avoiding the need for hardware divider. Since the range of screen coordinates is limited to 1024 or 2048 at most, all denominator fractions can be pre-calculated and stored in appropriate memory prior to starting the animation. In that case the slop calculation stage is skipped. 4. Projecting triangle on viewing plane. Since the 3D object is projected on a 2D screen plane, the depth coordinate which is perpendicular to the projection plane is dropped. Assume without loss of generality that this is zw . 5. Convert every vertex to pixel coordinates. The projected world coordinate xw, yw is converted into a screen coordinate xs, ys by transformation xs round_int xsmin xsmax xsmin xwmax xwmin xw xwmin , the where rounding is made to nearest integral number falling within the range of screen 7 coordinates. Analogous transformation applies for y . The screen coordinates thus obtained are stored in variable screen viewport vertex memory. The scaling factor of the above transformations is vertex and triangle independent and can therefore be calculated once per frame computation and stored in a local register. 6. Deciding on hidden triangles is done by observing the outward normal vector A, B, C stored in variable triangle outward normal memory. It is certainly hidden from observer’s eye if C 0 , a case where rasterization of the triangle is ruled out, thus saving a lot of computations. 7. Scanning a triangle for rasterization. This is the most time consuming step of the pipeline. Figure 2 illustrates a triangle whose vertex screen coordinates and edge slopes have already been calculated and exist in appropriate memory. The first step finds the vertex with smallest ys coordinate. Two edges are emanating from this vertex, one left-upward and one right-upward, with known slopes which have been calculated before, denoted by mleft and mright , respectively. These slopes have been calculated previously in step 3. If it is decided to give up divisions, the slopes are derived by looking into permanent screen coordinate reciprocal memory, where the address is the appropriate difference of screen coordinates. This is elaborated in the pseudo code below. Every horizontal scan line yscan is obtained from the previous one by y scan y scan _ old 1 . New yscan is detected versus the opposite ends of the upwardleft and upward-right edges of whether it exceeds any of them, a case where one of the edges terminates and triangle’s third edge is invoked, or the scan terminates. Every scan-line extends from a leftmost to a rightmost pixel obtained as follows. Let xleft _ old and xright _ old denote the leftmost and rightmost pixels of the scan line y scan _ old , respectively. They are equal initially to each other as obtained from the lowest vertex of the triangle. Let left mleft _ old and mright _ old be the slopes of the corresponding edges, respectively. Then, xleft round_int xleft _ old mleft _ old , where round_int is a rounding operation to nearest integer, and similarly, xright round_int xright _ old mright _ old . Rasterization pseudo-code is attached below. 8 8. Decide on pixel visibility. This is accomplished with the aid of a Z-buffer described before. Initially, the content of the Z-buffer is reset to store the smallest integer ( 231 in 2’s complement representations of 32-bit fixed point numbers). Then every pixel in its turn is looked for the real world zw coordinate corresponding to that pixel. If its value is smaller than the value found in Z-buffer (hence it is closer to viewer’s eye), the color calculation of that pixel is progressing. In addition, the depth value of that pixel gets update. Otherwise, the pixel is ignored and next pixel is considered. The calculation of Z-value is made by first translating the pair xs , ys pixel of , yw pixel by the transformation xw pixel xwmin xwmax xwmin xsmax xsmin xs pixel xsmin , which pixel the given pixel into xw pixel is the inverse of the transformation used formerly to convert vertices from world to screen coordinates. yw pixel is obtained analogously. Notice that the scale factor can is invariant along the entire computation of a frame buffer, hence can be calculated once and stored in a register. Once xw pixel , yw pixel is known its depth in the real world zw pixel is obtained from the plane equation Axw pixel Byw pixel Czwpixel D 0 , yielding zw pixel 1 C Axw pixel Byw pixel D . Coefficients are stored in variable triangle outward normal memory. Notice that the coefficients are invariant for the entire rectangle rasterization. Therefore, in order to avoid unnecessary memory accesses, the coefficients should be stored in registers. The result of this depth test is an either an update of both the nearest zw -coordinate and the triangle which implied it, or just ignoring the update if the zw found deeper than the nearest so far. 9. Setting pixel’s color. This operation is executed only once per pixel, according to the nearest rectangle covered that pixel, whose index is found in the Z-buffer memory. Notice that this mode of operation excludes setting pixel’s colors from the hardware pipeline, since all triangles must be processed first in order to know which of them is the nearest to viewer’s eye at that pixel. This operation could be added to the pipeline, but the number of pixel color calculation will be more than doubled, where more than half of which are unnecessary. A pixel is assigned with nominal RGB values derived from those exist in the vertices of the triangle it belongs to by interpolation over its 9 three vertices. Once RGB values have been set, further account of object’s surface curvature takes place by multiplying the RGB with the factor L N N where N A, B, C is the triangle’s outward normal vector and L l x , l y , l z is a unit light vector pointing to the viewer (perpendicular to the screen). Notice that this dot product is fixed for the entire rectangle, but because pixel’s color setting is excluded from hardware’s pipeline, it is recalculated for every pixel. This is an overhead in case T D , but gets smaller as objects get more and more complex. The overhead could be avoided by calculating the above lighting coefficient as a part of hardware’s pipeline and storing it in a dedicated memory. The detailed description RGB color interpolation is attached below. 10. Writing a pixel into frame buffer. The RGB values obtained for the above pixel are written into a frame buffer that is eventually sent to the USB port for display on host’s screen. This takes place at the rate of 24 frames/Sec . At every animation step the frame is first filled by a background color as defined by the host application. It is then filled pixel by pixel as a result of the above color calculation. Once filled, the frame buffer is flushed out to the USB port. 3D transformations A point P : xw, yw, zw in 3D world is transformed into a new point P : xw, yw, zw by applying a series of transformations such as translation, rotations, scaling, perspective views, and few others. Though these transformations are not necessarily linear, their computation can be made linear by converting points into homogeneous coordinate representation, where a point is represented by xw, yw, zw,1 and the sequence of transformations can be captured in the following 4 4 matrix t11 t12 t t T 21 22 t31 t32 0 0 t13 t23 t33 0 t14 t24 . t34 1 It is the responsibility of the software application to generate such matrices to perform the right object drawing and animation. A new point position P is then obtained by P=TP , 10 which takes the following explicit equations: xw t11 xw t12 yw t13 zw t14 , yw t21 xw t22 yw t23 zw t24 and zw t31 xw t32 yw t33 zw t34 . This transformation requires 9 multiplication and 9 additions. Obviously, neither the fourth entry of a point which equals 1, nor the fourth row of transformation matrix need explicitly represented. Hence, the software application will send the hardware at every frame the 12 entries of the first 3 rows, which are involved in the computation of the new coordinates. Triangle rasterization pseudo-code (all are screen coordinates) 1. Get the 3 vertices of rectangle vA xsA , ysA , vB xsB , ysB and vC xsC , ysC . 2. Find the vertex with smallest y, assume it is vA xsA , ysA . 3. Get the slopes of edges A, B and A, C . In case of no division, slopes are obtained from permanent screen coordinate reciprocal memory by addressing it with | xsB xs A | and | xsC xs A | , and then using the appropriate sign of the differences, respectively. 4. Find the edge with largest slope mleft and smallest slope mright . Assume these are A, B and A, C , respectively. ysstop min ysB , ysC . Assume it is ysB . 5. xsleft xsright xs A . 6. ys pixel ys A . 7. while ys pixel ysstop { a. for xs pixel xsleft ; xs pixel xright ; xs pixel { Decide on pixel xs pixel , ys pixel visibility; } b. ys pixel ; c. round _ int xs xsleft round _ int xs A mleft ys pixel ys A ; d. xright A mright ys pixel ys A ; 8. } 11 9. ysstop max ysB , ysC . Assume it is ysC . 10. Set mleft to be the slope of edge B, C . 11. while ys pixel ysstop { a. for xs pixel xsleft ; xs pixel xsright ; xs pixel { Decide on pixel xs pixel , ys pixel visibility; }; b. ys pixel ; c. xsleft round _ int xsB mleft ys pixel ysB ; d. xsrightt round _ int xs A mright ys pixel ys A ; 12. } RGB interpolation This calculation involves few divisions per pixel. It is therefore takes place in screen coordinates since all divisions there can use the permanent screen coordinate reciprocal memory. Let xs1, ys1 , xs2 , ys2 and xs3 , ys3 be the screen the vertices of the triangle whose internal pixel xs pixel , ys pixel is aimed at color setting. The line passing through vertex xs1, ys1 and the internal point xs pixel , ys pixel is intersecting the opposite edge xs2 , ys2 , xs3 , ys3 at xsmid , ysmid given by (1) xsmid xs xs xs ys xs ys xs xs xs ys xs ys ys xs xs ys ys xs xs 1 pixel 1 2 pixel 3 3 2 2 2 3 3 2 1 3 pixel 1 pixel ys1 pixel Calculation of (1) requires one division and eight multiplications. The nominator of (1) includes cubic power of screen coordinates, so it is required checking whether 32 bit integer can represent it. Moreover, the divider is an integral number which is a square of screen coordinates. Permanent screen coordinate reciprocal memory stores reciprocal of screen coordinates only. Therefore, the divider will be calculated indirectly by multiplying 1 xs1 xs pixel with 1 ys2 ys3 and by multiplying 1 xs2 xs3 with 12 1 ys1 ys pixel . Notice also that the divider in (1) may become zero if the pixel involved is a vertex, a case that should be detected upfront and if found true, the color setting is taken directly from vertex. A more delicate situation occurs when the projection of the triangle degenerates into vertical straight line. This needs also pre-detection, and if found true, interpolations is made with y-coordinates as follows: (2) ysmid ys ys ys xs ys xs ys ys ys xs ys xs xs ys ys xs xs ys ys 1 pixel 1 2 pixel 3 3 2 2 2 3 3 2 1 3 pixel 1 pixel xs1 . pixel Assuming that midpoint interpolation was done by (1), a color at xsmid , ysmid is obtained by interpolating between xs2 , ys2 and xs3 , ys3 as follows. (3) R mid R 2 R3 R2 xsmid xs2 . xs3 xs2 The divider in (3) is obtained from permanent screen coordinate reciprocal memory. Then, the value at the internal pixel is found as follows. (4) R pixel R1 R mid R1 xs pixel xs1 . xsmid xs1 The divider in (4) is also obtained from permanent screen coordinate reciprocal memory. G and B values are obtained similarly. If interpolation of midpoint is done according to (2), calculation of (3) and (4) are using y instated of x. 13