WireGL A Scalable Graphics System for Clusters

advertisement
WireGL
A Scalable Graphics System for
Clusters
Greg Humphreys, Matthew Eldridge, Ian Buck, Gordon Stoll,
Matthew Everett, and Pat Hanrahan
Presented by Bruce Johnson
Motivation for WireGL
Data sets for scientific computation
applications are enormous.
Visualization of these datasets on single
workstations is difficult or impossible.
Therefore, we need a scalable, parallel
graphics rendering system.
What is WireGL?
Provides a parallel interface to a cluster-based virtual
graphics system
Extends OpenGL API
Allows flexible assignment of tiles to graphics
accelerators
Can perform final image reassembly in software
using a general purpose cluster interconnect
Can bring rendering power of cluster to displays
ranging from a single monitor to a multi-projector,
wall-sized display.
WireGL Illustrated
Parallel Graphics
Architecture Classification
Classify by the point in the graphics pipeline
at which data are redistributed.
Redistribution, or “sorting”, is the transition
from object parallelism to image parallelism
Sort location has tremendous implications for
the architecture’s communication needs.
Advantage of WireGL’s
Communication
Infrastructure
WireGL uses commodity parts.

As opposed to highly specialized components
found in SGI’s Infinite Reality
Therefore, the hardware or the network may
be upgraded at any time without redesigning
the system
Points of Communication
in Graphics Pipeline
Commodity parts restricts choices of communication
because individual graphics accelerators cannot be
modified.
Therefore, there are only two points in the graphics
pipeline to induce communication.
Immediately after the application stage.
Immediately before the final display stage.
If communication is used after the application stage,
this is the traditional sort-first graphics architecture.
WireGL is a Sort-first
Renderer
WireGL’s Implementation
(From a High Level)
WireGL consists of one or more clients submitting
OpenGL commands simultaneously to one or more
graphics servers known as pipeservers.
Pipeservers are organized as a sort-first parallel
graphics pipeline and together serve to render a
single output image.
Each pipeserver has its own graphics accelerator and
a high-speed network connecting it to all of its clients.
Compute, Graphics,
Interface and Resolution
Limited
Compute limited means that the simulation generates
data more slowly than the graphics system can
accept it.
Graphics limited (Geometry Limited) means that a
single client occupies multiple servers keeping each
server it occupies busy due to its long rendering time.
Interface limited means that an application is limited
by the rate at which it issues geometry to the
graphics system.
Resolution limited (Field Limited) means that the
visualization of the data is hampered because of a
lack of display resolution.
How does WireGL Deal
With These Limitations?
WireGL has no inherent restriction on the number of
clients and servers it can accommodate.
For compute-limited applications, one needs more
clients than servers.
For graphics-limited applications, one needs more
servers than clients.
For interface-limited applications, one needs an equal
number of clients and servers.
For resolution-limited applications WireGL affords
one the capacity to use larger display devices.
Client Implementation
WireGL replaces OpenGL on Windows,
Linux, and IRIX machines.
As the program makes calls to the
OpenGL API, WireGL will classify them
into three categories:
Geometry
 State
 Special

Geometry Commands
Geometry commands are those that appear between
glBegin and glEnd
These commands are packed into a global “geometry
buffer”.
The buffer contains a copy of the arguments to the
function and an opcode.
These opcodes and data are sent directly to the
networking library as a single function call.
Commands like glNormal3f do not create graphical
fragments. State effects are recorded in the buffer.
State Commands
State commands directly affect the graphics
state such as glRotatef, glBlendFunc or
glTexImage2D.
Each state has n bits associated indicate
whether that state element is out of sync with
each of its n servers.
When a state command is executed, the bits
are all set to 1, indicating that each server
might need a new copy of that element.
Geometry Buffer
Transmission
Two circumstances can trigger the
transmission of the geometry buffer.


If the buffer fills up, it must be flushed to make room for
subsequent commands.
If a state command is called while the geometry buffer is not
empty, since OpenGL has such strict ordering semantics.
The geometry buffer cannot be sent to overlapped
servers immediately since they may not have the
correct OpenGL state.
The application’s current state must be sent prior to
any transmission of geometry.
WireGL currently has no automatic mechanism for
determining the best time to partition geometry.
Parallel Graphics
Considerations
When running a parallel application, each client node
performs a sort-first distribution of geometry and state
to all pipeservers.
When multiple OpenGL graphics contexts wish to
render a single image, ordering must be performed
via semaphores.
Synchronization functions are added to WireGL.



glBarrierExec(name) causes a graphics context to enter a
barrier.
glSemaphoreP wait for a signal.
glSemaphoreV means to issue a signal.
These ordering commands are broadcast because
the same ordering restrictions must be observed by
all servers.
Special Commands
Examples of special commands would be
SwapBuffers, glFinish, and glClear.
glClear has a barrier immediately after its call to
ensure that the frame buffer is clear before any
drawing may take place.
SwapBuffers has consequences on synchronization
because only one client may execute it per frame.
 SwapBuffers marks the end of a frame and causes
a buffer swap to be executed by all servers.
Pipeserver Implementation
A pipeserver maintains a queue of pending
commands for each client.
As the new commands arrive over the network, they
are placed at the end of the client’s queue.

These queues are stored in a circular “run queue” of
contexts.
A pipeserver continues executing a client’s
commands until it


runs out of work or
the context “blocks” on a barrier with a semaphore operation.
Blocked contexts are placed on wait queues
associated with the semaphore or barrier they are
waiting on.
Portrait of a Pipeserver
Context Switching
Since each client has an associated graphics
context, a context switch must be performed
each time a client’s stream blocks.
This context switching time is limited by the
hardware.
The context switching time is slow enough to
limit the amount of intra-frame parallelization
achievable with WireGL.
Overcoming Context
Switching Limitations
Each pipeserver uses the same state tracking library
as the client to maintain the state of each client in the
software.
Context switching on the server is facilitated by a
context differencing operation.
Parallel applications collaborate to produce a single
image and will typically have similar graphics states.
The context switching amongst collaborating nodes
has a cost proportional to the context’s disparity.

Hence a hierarchy arises in which different contexts are
classified according to their difference.
Scheduling Amongst
Different Contexts
When a context blocks, the servers have a
choice as to which context they will run next.
Therefore, one must consider


the cost of performing the context switch
and the amount of work that can be done before
performing the next context switch
A simple round-robin scheduler was used.
Round-Robin Scheduling
Why does round-robin scheduling work?



First, clients participating in the visualization of large data
sets are likely to have similar contexts, making the expense
of context switching low and uniform.
Since we can’t know when a stream will block, we can only
estimate the time to the next context switch by using the
amount of work queued for a particular context.
Any large disparity in the amount of work queued for a
particular context is likely the result of application-level load
imbalance.
Description of the Network
WireGL uses a connection-based network abstraction
in order to support multiple network types.
Uses a credit-based flow control mechanism to
prevent servers from exhausting memory resources
when they can’t keep up with the clients.
The server/client pair is joined by a connection.
Have a zero-copy send since the buffer allocation is
the responsibility of the network layer.
Have a zero-copy receive because of the network
layer allocated buffers.
Symmetric Connection
The connection between the client and the server is
completely symmetric, this means that the servers
can return data to the clients.
WireGL supports glFinish to tell applications when a
command has been executed.
This allows applications to synchronize their output
with some external input

ensuring that the graphics system’s internal buffering is not
causing their output to lag behind the input
The user may optionally enable an implicit glFinishlike synchronization upon each call to SwapBuffers.

This ensures that no client gets more than one frame ahead
of the servers.
Display Management
To form a seamless output image, tiles
must be extracted from the
framebuffers of the pipeservers and
reassembled to drive the display device.
There are two ways to perform this
display reassembly:
Hardware
 Software

Display Reassembly in
Hardware
Experiments used Lightning-2
boards which accepted 4 inputs
and emitted 8 outputs.
More inputs are accommodated
by connecting multiple
Lightning-2 boards into a “pixel
bus”.
Multiple outputs an be
accommodated by repeating the
inputs.
Hence, we may allow for an
arbitrary number of accelerators
and displays to be connected by
a 2-D mesh.
Display Reassembly in
Hardware (2)
Each input to a Lightning-2 usually contributes to
multiple output displays, so Lightning-2 must observe
a full output frame for each input before it may swap.

This introduces one frame of latency.
Lightning-2 provides a per-host back channel using
the host’s serial port.

WireGL waits for this notification before executing a client’s
SwapBuffers command.
Having synchronized outputs allows a Lightning-2 to
drive tiled display devices like IBM’s Bertha or a
multi-projector display wall without tearing artifacts.
Display Reassembly in
Software
Without special hardware to support image
reassembly, the final rendered image must be read
out of each local frame buffer and redistributed over a
network.
The drawback to pure software rendering is that it
may diminish performance.
Pixel data must be read out of the local buffer,
transferred over the internal network of the cluster
and written back to the frame buffer for display.
Software rendering has demonstrated an inability to
sustain high frame rates.
Visualization Server
A separate, dedicated network is called a
“visualization server”.
Using the visualization server, all pipeservers
read the color contents of their managed tiles
at the end of each frame.
Those images are sent over the cluster’s
interconnect to a separate compositing server
from reassembly
Applications Used
March


A parallel implementation of the marching cubes algorithm.
March extracts and renders 385,492 lit triangles/frame.
Nurbs


A parallel patch evaluator using multiple processors to
subdivide a curved surface and tessellate it.
Nurbs tessellates and renders 413,000 lit, stripped
triangles/frame
Hundy


A parallel application that renders a set of unorganized
triangle strips.
Hundy renders 4 million triangles/frame at a rate of 7.45
million triangles/sec
Parallel Rendering
Speedups
Parallel Interface
To scale any interface-limited application it is
necessary to allow parallel submission of graphics
primitives.
This effect was illustrated with Hundy.
Some of Hundy’s performance measurements show
a super-linear speed up because Hundy generates a
large amount of network traffic per second.
This shows that Hundy’s performance is very
sensitive to the behavior of the network under a high
load.
Hardware vs. Software
Image Reassembly
As the size of the output image grows,
software image reassembly can quickly
compromise the performance of the
application.
An single application was written to measure
the overhead associated with software vs.
hardware reassembly.

It demonstrated that hardware supported
reassembly is necessary to maintain high frame
rates.
Load Balancing
Two kinds of load balancing to consider.
Application level load balancing (that is, balancing the
amount of computation performed by each client
node)
It is the responsibility of the programmer to efficiently
distribute the work to various nodes.
This aspect of load balancing was tested on each of
the applications and showed that each application did
posses adequate application-level load balancing.
Load Balancing (2)
The other type is graphics work done by
servers.
This implies that it is necessary to distribute
the rendering work to multiple servers.
However, the rendering work required to
generate an output image is typically not
uniformly distributed in screen space.
Thus, the tiling of the output image introduces
a potential load imbalance; this may create a
load imbalance in the network as well.
Scalability Limits
Experiments indicate that WireGL ought to be able to
allow a scaling up from 16 pipeservers and 16 clients
to 32 pipeservers and 32 clients if the network was
better able to support an all-to-all communication.
The limits on scalability would be the amount of
screen space parallelism available for a given output
size.
For a huge cluster of say 128 nodes, the tile size
would be so small that it would be difficult to provide
a good load balance for any non-trivial application
without a prohibitively high overlap factor.
Texture Management
WireGL’s client treats texture data as a component of
the graphics state and lazily updates the servers as
needed.
In the worst case, this will result in each texture being
replicated on every server node in the system.
This is a consequence of using commodity graphics
accelerators in the cluster.
It is not possible to introduce a stage of
communication to remotely access the texture
memory.
Currently investigating new texture management
strategies such as parallel texture caching.
Latency
There are 2 sources of latency:


Display reassembly stage
Buffering of commands on the client
Display reassembly via the Lightning-2 cards
introduces one frame of latency.
Display reassembly via software introduces 50-100
ms of latency.
Latency due to command buffering depends on the
size of the network buffers and the fact that the
pipeserver cannot process the buffer until it has been
completely received.
Future Work
Main direction for future development is to
add additional flexibility to accommodate a
broader range of parallel rendering
applications.


The next version will allow a user to describe an
arbitrary directed graph of graphics stream
processing units
This will involve developing new parallel
applications to use this new system
This system shows promise with CAVEs
Download