WireGL A Scalable Graphics System for Clusters Greg Humphreys, Matthew Eldridge, Ian Buck, Gordon Stoll, Matthew Everett, and Pat Hanrahan Presented by Bruce Johnson Motivation for WireGL Data sets for scientific computation applications are enormous. Visualization of these datasets on single workstations is difficult or impossible. Therefore, we need a scalable, parallel graphics rendering system. What is WireGL? Provides a parallel interface to a cluster-based virtual graphics system Extends OpenGL API Allows flexible assignment of tiles to graphics accelerators Can perform final image reassembly in software using a general purpose cluster interconnect Can bring rendering power of cluster to displays ranging from a single monitor to a multi-projector, wall-sized display. WireGL Illustrated Parallel Graphics Architecture Classification Classify by the point in the graphics pipeline at which data are redistributed. Redistribution, or “sorting”, is the transition from object parallelism to image parallelism Sort location has tremendous implications for the architecture’s communication needs. Advantage of WireGL’s Communication Infrastructure WireGL uses commodity parts. As opposed to highly specialized components found in SGI’s Infinite Reality Therefore, the hardware or the network may be upgraded at any time without redesigning the system Points of Communication in Graphics Pipeline Commodity parts restricts choices of communication because individual graphics accelerators cannot be modified. Therefore, there are only two points in the graphics pipeline to induce communication. Immediately after the application stage. Immediately before the final display stage. If communication is used after the application stage, this is the traditional sort-first graphics architecture. WireGL is a Sort-first Renderer WireGL’s Implementation (From a High Level) WireGL consists of one or more clients submitting OpenGL commands simultaneously to one or more graphics servers known as pipeservers. Pipeservers are organized as a sort-first parallel graphics pipeline and together serve to render a single output image. Each pipeserver has its own graphics accelerator and a high-speed network connecting it to all of its clients. Compute, Graphics, Interface and Resolution Limited Compute limited means that the simulation generates data more slowly than the graphics system can accept it. Graphics limited (Geometry Limited) means that a single client occupies multiple servers keeping each server it occupies busy due to its long rendering time. Interface limited means that an application is limited by the rate at which it issues geometry to the graphics system. Resolution limited (Field Limited) means that the visualization of the data is hampered because of a lack of display resolution. How does WireGL Deal With These Limitations? WireGL has no inherent restriction on the number of clients and servers it can accommodate. For compute-limited applications, one needs more clients than servers. For graphics-limited applications, one needs more servers than clients. For interface-limited applications, one needs an equal number of clients and servers. For resolution-limited applications WireGL affords one the capacity to use larger display devices. Client Implementation WireGL replaces OpenGL on Windows, Linux, and IRIX machines. As the program makes calls to the OpenGL API, WireGL will classify them into three categories: Geometry State Special Geometry Commands Geometry commands are those that appear between glBegin and glEnd These commands are packed into a global “geometry buffer”. The buffer contains a copy of the arguments to the function and an opcode. These opcodes and data are sent directly to the networking library as a single function call. Commands like glNormal3f do not create graphical fragments. State effects are recorded in the buffer. State Commands State commands directly affect the graphics state such as glRotatef, glBlendFunc or glTexImage2D. Each state has n bits associated indicate whether that state element is out of sync with each of its n servers. When a state command is executed, the bits are all set to 1, indicating that each server might need a new copy of that element. Geometry Buffer Transmission Two circumstances can trigger the transmission of the geometry buffer. If the buffer fills up, it must be flushed to make room for subsequent commands. If a state command is called while the geometry buffer is not empty, since OpenGL has such strict ordering semantics. The geometry buffer cannot be sent to overlapped servers immediately since they may not have the correct OpenGL state. The application’s current state must be sent prior to any transmission of geometry. WireGL currently has no automatic mechanism for determining the best time to partition geometry. Parallel Graphics Considerations When running a parallel application, each client node performs a sort-first distribution of geometry and state to all pipeservers. When multiple OpenGL graphics contexts wish to render a single image, ordering must be performed via semaphores. Synchronization functions are added to WireGL. glBarrierExec(name) causes a graphics context to enter a barrier. glSemaphoreP wait for a signal. glSemaphoreV means to issue a signal. These ordering commands are broadcast because the same ordering restrictions must be observed by all servers. Special Commands Examples of special commands would be SwapBuffers, glFinish, and glClear. glClear has a barrier immediately after its call to ensure that the frame buffer is clear before any drawing may take place. SwapBuffers has consequences on synchronization because only one client may execute it per frame. SwapBuffers marks the end of a frame and causes a buffer swap to be executed by all servers. Pipeserver Implementation A pipeserver maintains a queue of pending commands for each client. As the new commands arrive over the network, they are placed at the end of the client’s queue. These queues are stored in a circular “run queue” of contexts. A pipeserver continues executing a client’s commands until it runs out of work or the context “blocks” on a barrier with a semaphore operation. Blocked contexts are placed on wait queues associated with the semaphore or barrier they are waiting on. Portrait of a Pipeserver Context Switching Since each client has an associated graphics context, a context switch must be performed each time a client’s stream blocks. This context switching time is limited by the hardware. The context switching time is slow enough to limit the amount of intra-frame parallelization achievable with WireGL. Overcoming Context Switching Limitations Each pipeserver uses the same state tracking library as the client to maintain the state of each client in the software. Context switching on the server is facilitated by a context differencing operation. Parallel applications collaborate to produce a single image and will typically have similar graphics states. The context switching amongst collaborating nodes has a cost proportional to the context’s disparity. Hence a hierarchy arises in which different contexts are classified according to their difference. Scheduling Amongst Different Contexts When a context blocks, the servers have a choice as to which context they will run next. Therefore, one must consider the cost of performing the context switch and the amount of work that can be done before performing the next context switch A simple round-robin scheduler was used. Round-Robin Scheduling Why does round-robin scheduling work? First, clients participating in the visualization of large data sets are likely to have similar contexts, making the expense of context switching low and uniform. Since we can’t know when a stream will block, we can only estimate the time to the next context switch by using the amount of work queued for a particular context. Any large disparity in the amount of work queued for a particular context is likely the result of application-level load imbalance. Description of the Network WireGL uses a connection-based network abstraction in order to support multiple network types. Uses a credit-based flow control mechanism to prevent servers from exhausting memory resources when they can’t keep up with the clients. The server/client pair is joined by a connection. Have a zero-copy send since the buffer allocation is the responsibility of the network layer. Have a zero-copy receive because of the network layer allocated buffers. Symmetric Connection The connection between the client and the server is completely symmetric, this means that the servers can return data to the clients. WireGL supports glFinish to tell applications when a command has been executed. This allows applications to synchronize their output with some external input ensuring that the graphics system’s internal buffering is not causing their output to lag behind the input The user may optionally enable an implicit glFinishlike synchronization upon each call to SwapBuffers. This ensures that no client gets more than one frame ahead of the servers. Display Management To form a seamless output image, tiles must be extracted from the framebuffers of the pipeservers and reassembled to drive the display device. There are two ways to perform this display reassembly: Hardware Software Display Reassembly in Hardware Experiments used Lightning-2 boards which accepted 4 inputs and emitted 8 outputs. More inputs are accommodated by connecting multiple Lightning-2 boards into a “pixel bus”. Multiple outputs an be accommodated by repeating the inputs. Hence, we may allow for an arbitrary number of accelerators and displays to be connected by a 2-D mesh. Display Reassembly in Hardware (2) Each input to a Lightning-2 usually contributes to multiple output displays, so Lightning-2 must observe a full output frame for each input before it may swap. This introduces one frame of latency. Lightning-2 provides a per-host back channel using the host’s serial port. WireGL waits for this notification before executing a client’s SwapBuffers command. Having synchronized outputs allows a Lightning-2 to drive tiled display devices like IBM’s Bertha or a multi-projector display wall without tearing artifacts. Display Reassembly in Software Without special hardware to support image reassembly, the final rendered image must be read out of each local frame buffer and redistributed over a network. The drawback to pure software rendering is that it may diminish performance. Pixel data must be read out of the local buffer, transferred over the internal network of the cluster and written back to the frame buffer for display. Software rendering has demonstrated an inability to sustain high frame rates. Visualization Server A separate, dedicated network is called a “visualization server”. Using the visualization server, all pipeservers read the color contents of their managed tiles at the end of each frame. Those images are sent over the cluster’s interconnect to a separate compositing server from reassembly Applications Used March A parallel implementation of the marching cubes algorithm. March extracts and renders 385,492 lit triangles/frame. Nurbs A parallel patch evaluator using multiple processors to subdivide a curved surface and tessellate it. Nurbs tessellates and renders 413,000 lit, stripped triangles/frame Hundy A parallel application that renders a set of unorganized triangle strips. Hundy renders 4 million triangles/frame at a rate of 7.45 million triangles/sec Parallel Rendering Speedups Parallel Interface To scale any interface-limited application it is necessary to allow parallel submission of graphics primitives. This effect was illustrated with Hundy. Some of Hundy’s performance measurements show a super-linear speed up because Hundy generates a large amount of network traffic per second. This shows that Hundy’s performance is very sensitive to the behavior of the network under a high load. Hardware vs. Software Image Reassembly As the size of the output image grows, software image reassembly can quickly compromise the performance of the application. An single application was written to measure the overhead associated with software vs. hardware reassembly. It demonstrated that hardware supported reassembly is necessary to maintain high frame rates. Load Balancing Two kinds of load balancing to consider. Application level load balancing (that is, balancing the amount of computation performed by each client node) It is the responsibility of the programmer to efficiently distribute the work to various nodes. This aspect of load balancing was tested on each of the applications and showed that each application did posses adequate application-level load balancing. Load Balancing (2) The other type is graphics work done by servers. This implies that it is necessary to distribute the rendering work to multiple servers. However, the rendering work required to generate an output image is typically not uniformly distributed in screen space. Thus, the tiling of the output image introduces a potential load imbalance; this may create a load imbalance in the network as well. Scalability Limits Experiments indicate that WireGL ought to be able to allow a scaling up from 16 pipeservers and 16 clients to 32 pipeservers and 32 clients if the network was better able to support an all-to-all communication. The limits on scalability would be the amount of screen space parallelism available for a given output size. For a huge cluster of say 128 nodes, the tile size would be so small that it would be difficult to provide a good load balance for any non-trivial application without a prohibitively high overlap factor. Texture Management WireGL’s client treats texture data as a component of the graphics state and lazily updates the servers as needed. In the worst case, this will result in each texture being replicated on every server node in the system. This is a consequence of using commodity graphics accelerators in the cluster. It is not possible to introduce a stage of communication to remotely access the texture memory. Currently investigating new texture management strategies such as parallel texture caching. Latency There are 2 sources of latency: Display reassembly stage Buffering of commands on the client Display reassembly via the Lightning-2 cards introduces one frame of latency. Display reassembly via software introduces 50-100 ms of latency. Latency due to command buffering depends on the size of the network buffers and the fact that the pipeserver cannot process the buffer until it has been completely received. Future Work Main direction for future development is to add additional flexibility to accommodate a broader range of parallel rendering applications. The next version will allow a user to describe an arbitrary directed graph of graphics stream processing units This will involve developing new parallel applications to use this new system This system shows promise with CAVEs