Graphic Engine Resource Management

advertisement
Graphic Engine Resource Management
Mikhail Bautin
Ashok Dwarakinath
Tzi-cker Chiueh
Computer Science Department
Stony Brook University
{mbautin, ashok, chiueh}@cs.sunysb.edu
ABSTRACT
Modern consumer-grade 3D graphic cards boast a computation/memory resource that can easily rival or even
exceed that of standard desktop PCs. Although these cards are mainly designed for 3D gaming applications, their
enormous computational power has attracted developers to port an increasing number of scientific computation
programs to these cards, including matrix computation, collision detection, cryptography, database sorting, etc.
As more and more applications run on 3D graphic cards, there is a need to allocate the computation/memory
resource on these cards among the sharing applications more fairly and efficiently. In this paper, we describe the
design, implementation and evaluation of a Graphic Processing Unit (GPU) scheduler based on Deficit Round
Robin scheduling that successfully allocates to every process an equal share of the GPU time regardless of their
demand. This scheduler, called GERM, estimates the execution time of each GPU command group based on
dynamically collected statistics, and controls each process’s GPU command production rate through its CPU
scheduling priority. Measurements on the first GERM prototype show that this approach can keep the maximal
GPU time consumption difference among concurrent GPU processes consistently below 5% for a variety of
application mixes.
Keywords: GPU scheduling, Resource Management, GPU fairness, DRI, DRM kernel module, GERM
1. INTRODUCTION
Modern consumer-grade graphic cards boast a computation/memory resource that can easily rival or even exceed
that of standard desktop PCs. For example, a Geforce 8800GTX-based card contains 786MB of memory with
a total memory bandwidth of 86.4GB/sec, and a G80 GPU that has 128 shader processors each operating at
1.35GHz, where 16 pixels are processed simultaneously in each clock cycle. Although these cards are originally
designed for 3D gaming applications, their enormous computational power has attracted an increasing number of
researchers to port their applications to them. Non-graphic applications1 that have been successfully ported to
GPUs include matrix computation, video/image processing, collision detection, cryptography, database sorting,
etc. As more and more GPU-exploiting applications start to emerge, a GPU’s computation/memory resource
may need to be time-shared among multiple such applications that are running concurrently. As a result, GPU
needs an operating system to manage its computation/memory resource, just as CPU. In addition, the GPU’s
resource manager should coordinate its resource allocation decisions with those made by the main OS running
on the CPU.
Existing operating system support for GPU2, 3 allows multiple graphic applications to run on a single GPU as
if each of them has exclusive access to the GPU’s resource, but does not prevent applications from monopolizing
the GPU resource. Typically, a graphic application is built on top of a user-level graphic library, e.g. OpenGL or
Direct3D, which converts high-level commands into low-level GPU commands through a user-level GPU-specific
driver. A GPU interacts with its driver via a GPU command ring buffer on the graphic card, to which the
GPU driver DMAs commands from the host memory. Although GPU interrupts CPU when DMA transactions
complete, they do not generate interrupts when a GPU command exits the GPU’s graphic pipeline.
Modern GPUs are stateful in the sense that a graphic application’s GPU commands must be executed with
respect to a specific GPU state that it set up beforehand. A GPU state typically includes transformation
matrices, view-port specifications, lighting parameters, etc. Because different applications require different GPU
states, it is essential to restore the GPU state of a graphic application before scheduling its GPU commands.
Restoring a graphic application’s state requires issuing a sequence of GPU commands to properly set up the
GPU. This step is similar to context switching associated with multiplexing processes on a CPU, except one key
difference: Context switching is performed by the OS as part of CPU scheduling and is typically transparent to
the applications, whereas GPU state switching is done by the user-level GPU driver, and is decoupled from the
CPU scheduler.
Because the CPU scheduler is not responsible for establishing a graphic application’s state before scheduling
it on the CPU, there is no guarantee that its GPU commands are executed with respect to their corresponding
state. To address this problem, the user-level GPU driver of a graphic application must acquire a lock before
submitting commands to the GPU to guarantee that it has exclusive access to the GPU. We call the set of
commands an application issues to the GPU between a lock acquisition and a lock release as a command batch,
which is different from a command group, a minimal sequence of commands that must be DMAed into the GPU
atomically.
As a result of the above design, the computation and memory resources on the GPU are allocated on a first
come first serve basis. More concretely, the GPU processes commands in the GPU command ring buffer in the
order in which they are inserted without regards to any notion of fairness or priority. Because GPU commands
may request allocation of physical memory on the graphic card, these allocation requests are also serviced on a
first come first serve order.
The prevailing GPU command scheduling model has several serious weaknesses. First, it cannot limit the
amount of GPU resource consumed by individual processes, as is done by CPU scheduler using time quanta,
because each graphic application is allowed to issue as many GPU commands as it wants once acquiring the
lock protecting the command ring buffer. Second, if a process acquires the command ring buffer lock and does
not release it maliciously or by mistake, no other processes can issue commands to the GPU and the GPU’s
resource may lie wasted during such times. Finally, the resource allocation decisions of the GPU scheduler are
not coordinated with the CPU scheduler’s. This could hurt a graphic application’s performance at times of GPU
command bursts because the CPU and GPU resources it receives do not match.
We propose a new GPU resource manager called GERM (Graphics Engine Resource Manager), which is compatible with the programming model of existing GPU drivers, but is more flexible and intelligent. In particular,
GERM supports the following unique features:
• GERM supports GPU state switching without relying on the user-level GPU driver, and therefore can
schedule GPU commands at the granularity of command groups rather than command batches.
• GERM does not require locking on the command ring buffer, and therefore allows a GPU application to
issue GPU commands whenever it is scheduled.
• GERM schedules GPU commands in such a way that matches the resource allocation decisions of the CPU
scheduler. More specifically, the amount of GPU resource that GERM allocates to a GPU application is
proportional to the CPU resource given to it.
• GERM also monitors video memory usage of each GPU application, and allocates the physical memory
resource among applications in a way that minimizes GPU idle time and is compatible with GPU computation resource allocation.
The main reason that GERM can support more flexible and intelligent GPU resource allocation is because
it is capable of exploiting the semantics of GPU commands. Although this approach ties GERM to individual
GPUs, it is possible to eliminate hardware dependency by devising a general API for a GPU-specific driver to
convey its computation and memory resource requirement for each GPU command submitted. We will briefly
describe such an API in Section 4.3.
2. RELATED WORK
3, 4
Windows Display Driver Model (WDDM) introduced in Windows Vista also tries to solve the problem of fairly
allocating a GPU’s computation and video memory resource among multiple applications. More specifically,
WDDM addresses the following problems of Windows XP’s GPU resource allocation:
• Windows XP GPU scheduler is first-come-first-serve, which results in starvation of some applications. As
a result, when multiple 3D graphic applications are executed, they appear to render in ”bursts,” which
manifest themselves in the form of large variance in the frame rate.
• Video memory allocation in Windows XP is also first-come-first-serve. When the video memory supply is
exhausted, applications would not start. There is no other way to reclaim the video memory used by a
particular application except terminating the application. The problem of an application requesting more
video memory than is available is resolved in Windows XP, as well as in DRI (explained later), by keeping
a backup copy of all textures in the main memory, which can lead to a significant waste.
WDDM comprises two different driver models - a basic model for existing GPUs, which are not designed to
support concurrent execution of multiple 3D applications, and an advanced model for future GPUs, which are
designed to support concurrent execution of multiple 3D applications. In the basic model, the GPU scheduler
is responsible for dispatching GPU command groups from application processes to the GPU, but there is no
interruptability and GPU command group execution time is unbounded. The GPU scheduler serves as a throttle
for all the graphic processes. The video memory manager virtualizes the video memory on a 3D graphic card to
application processes. It manages both the on-card physical memory and the AGP aperture. Eviction is done by
mapping the destination region in main memory through AGP and moving the evicted data from video memory
to the destination region.
The advanced model assumes the GPU supports multiple GPU contexts for running multiple GPU applications concurrently. Every GPU context contains a ring buffer for GPU commands and a page table for video
memory. The scheduler computes the order in which GPU applications are to run and sends a list of their contexts to the GPU. The GPU runs the processes in the specified order, executes command groups in each GPU
context, and issues an interrupt to the CPU after completing the commands associated with a GPU context.
A GPU context switch may be triggered by a video memory page fault or requested by the GPU scheduler
The GPU references video memory through the page table in the same way as CPU. This virtual memory-like
support enables fine-grained memory management in that it allows part of a GPU application’s working set to
be evicted and paged in later on a page fault.
SGI’s multi-rendering5, 6 is a modified X server architecture that is designed to support concurrent graphics
applications running in a multi-threading or multi-processing environment. However, its GPU resource scheduling
model is FIFO-based.
Existing GPU programming frameworks such as NVIDIA’s Compute Unified Device Architecture(CUDA)7
or API’s such as ATI/AMD’s Close to Metal(CTM) technology8 are designed to mainly simplify graphics and
non-graphics application development on the GPU and to enable applications to fully exploit the power of the
GPU. However, they do not provide any support for dynamic GPU resource allocation, let alone GPU command
scheduling.
Compared with WDDM, DRI (described in the next section) and SGI’s multi-rendering system, GERM
implements a more granular GPU command scheduling where the basic unit of scheduling is an atomic GPU
command group, and as a result is able to support more flexible QoS policies on GPU resource allocation.
3. DIRECT RENDERING INFRASTRUCTURE
3.1 Overview
The Direct Rendering Infrastructure,2 also known as DRI, allows multiple graphic applications to efficiently share
a 3D graphic card under the X Window system. It works with the Mesa open-source OpenGL implementation
and a wide range of GPU-based cards, each of which requires a device-specific user-level driver. The programming
model and high-level design of DRI is similar to the Windows Display Driver Model (WDDM).3, 4 Therefore,
this section will focus on DRI’s design and implementation. Other related work includes SGI’s multi-rendering
system,5, 6 which requires modification to the X Server to support multi-threading and multi-processing graphic
applications.
DRI supports both GPU command scheduling and video memory management. A GPU application is
linked with libGL.so, the Mesa OpenGL library, which in turn loads a user-level GPU-specific graphic card
Application
Hardware-specific user-level driver
OpenGL API
Latest
GPU
state
Clear()
Mesa
OpenGL
Driver
specific
calls
Begin()
...
Saved
GPU
state
Vertex3f()
SwapBuffers()
User-level GPU command buffer
Current vertex buffer
User level
Request DMA
buffers
Get
lock
Dispatch pending
commands
Kernel
Direct Rendering Manager (DRM)
Figure 1. The internal structure of a user-level driver for a 3D graphic card
driver (one instance per every client process). This driver interacts with the GPU hardware through the Direct
Rendering Manager (DRM) kernel module,9 which physically moves GPU commands to the GPU and manages
the hardware lock. The GPU driver receives high-level graphic commands from the Mesa library and generates
the corresponding sequence of GPU commands, and periodically dispatches them to the GPU by making a system
call to the DRM kernel module, which in its turn DMAs them into a command ring buffer accessible to the GPU.
Before issuing system calls to dispatch GPU commands, a GPU application’s user-level driver must obtain a lock
on the command ring buffer first to ensure exclusive access. We will refer to this lock as ‘hardware lock’ from
now on. If the user-level driver detects that another process has issued commands to the GPU command ring
buffer since the last time it dispatched commands, it will first issue a set of special GPU commands to restore
the GPU to the desired rendering state.
3.2 GPU Driver State
The user-level GPU driver, whose structure is shown in Figure 1, converts high-level OpenGL commands to
GPU commands. To effectively interface the OpenGL library (in this case Mesa) and a GPU-based 3D graphic
card, the user-level GPU driver needs to maintain a set of data structures which are described in the following
(the description here is based on the open-source driver for ATI Radeon 9250 card, r200 dri.so):
• GPU command buffer. This small (typically 4-8 KByte) buffer accumulates GPU commands before
they are dispatched to the kernel. Its size is sufficiently large to ensure atomicity of every possible command
group that needs to be sent to the command ring buffer in one shot, such as GPU state restoration or
vertex array element drawing commands corresponding to glArrayElement() or glDrawElements().
• GPU state. This structure consists of approximately 64 GPU command sequences, called atoms, that
are responsible for setting values of GPU registers which correspond to different parts of the OpenGL
state such as transformation matrices, view-port specifications, lighting parameters etc. The GPU driver
updates this structure in response to changes in the state of the OpenGL library, and uses it to restore
the GPU state if necessary. Whenever the GPU driver flushes the command buffer, it sends the saved
GPU state to the kernel if it detects that a different process has accessed the GPU since this process’s
last command buffer flush. It is therefore the responsibility of the GPU driver to implement GPU context
switching.
• Texture memory map. These data structures are required for video memory allocation and described
in Section 3.3.
Before the GPU driver can flush its command buffer, it needs to acquire a hardware lock first. If the GPU
driver cannot successfully acquire the hardware lock, the entire application is blocked. If it successfully acquires
the hardware lock, it then checks if there is another process that held the lock since it last released the lock.
The GPU driver can dispatch as many GPU commands into the kernel as it wants once it acquires the hardware
lock, and releases the lock after it issues all its GPU commands.
3.3 Video Memory Management
In DRI, a graphic card’s video memory is divided into a fixed number of equal-sized chunks called regions. Every
region at any instant is allocated to at most one GPU application. To efficiently allocate video memory, DRI
maintains a global list of regions and their last access times sorted in the least recently used (LRU) order, a
per-process linked list of texture objects sorted in the LRU order, and a per-process list of memory blocks allocated
to a process, where memory blocks are contiguous variable-size ranges of the video memory that are linked in
the order of increasing offset and marked free or used.
When a GPU driver needs to allocate physical memory for a texture object, it first checks if there is a free
memory block of sufficient size in the local list of memory blocks. If there is not, it removes texture objects from
its per-process texture object list, starting from the least recently used ones until it finds enough space to hold
the new texture. In this search, regions owned by other processes are represented as dummy texture objects and
included in the per-process texture objects list according to how recently they have been used by their owner.
Whenever a process creates or accesses a texture object, the object’s recency attribute and its position in the
texture object list are modified accordingly.
4. GRAPHICS ENGINE RESOURCE MANAGEMENT
4.1 Fine-Grained GPU Command Scheduler
It provides a command queue in the kernel for each process to insert its GPU commands. GERM’s command
scheduler then schedules GPU commands from these in-kernel command queues to the command ring buffer on
the GPU card, as shown in Figure 2. Because each process has its own command queue, it is no longer necessary
to acquire a hardware lock before injecting GPU commands into the kernel.
The main design goal of GERM’s command scheduler is to ensure its resource allocation decisions be compatible with the CPU scheduler. That is, the amount of GPU resource allocated to a process is proportional to
the CPU resource that it is given. One way to achieve this goal is to distribute the GPU time among processes
according to their CPU usage. When a process is given a higher priority in CPU scheduling, it runs more
frequently, is more likely to dispatch more GPU commands, and therefore deserves more GPU time. Another
approach is to distribute the GPU time among processes according to the loads they each present to the GPU.
The GPU load offered by a process is in turn determined by its inherent GPU computation requirement and
the CPU scheduling priority. However, this approach may not be able to protect innocent applications from
mis-behaving applications that abuse the GPU resource.
The minimal scheduling granularity of GERM is a GPU command group, which is also the minimal unit
of work that most applications use when injecting load to the GPU. However, different GPU command groups
may consume different amounts of GPU computation resource. For example, a large triangle with multi-pass
rendering consumes much more GPU computational resource than a small triangle with wire-frame rendering.
Therefore, to accurately estimate the GPU load offered by a process, it is essential that GERM go beyond
counting GPU commands, and actually take into account their execution semantics.
The most accurate way to estimate a GPU command group’s computational resource requirement is to
interpret the command group’s execution in the GPU. However, this approach is time-consuming and incurs
too much run-time overhead. Therefore, GERM uses a measurement approach instead, which estimates a GPU
command group’s computational resource requirement based on the number of bytes or the number of vertices
in the command group. More specifically, GERM measures the elapsed time of each GPU command group
Tc
C, Tc , and computes two empirical constants: Coef V tx, which is equal to V txCount
, and Coef CmdByte,
c
Tc
which is equal to ByteCountc , where V txCountc and ByteCountc are the number of vertices and bytes in the
command group C. Then it maintains a running average of the Coef V tx and Coef CmdByte values of all the
Task 1 command queue
Task 1 statistics
Task 1
...
...
Task N command queue
Task N statistics
Task N
Scheduling
thread
Execution time estimations
Pending command group info queue (PendingGrpQueue)
Timing
thread
Finished
command
groups
GPU command ring buffer (accessed directly by GPU)
GPU
Figure 2. The software architecture of GERM’s command scheduler, which supports performance isolation using finegrained GPU computation resource allocation.
GPU command groups executed in each process, and uses the resulting Coef V tx and Coef CmdByte averages
to predict the GPU time requirement of the j-th GPU command group in the i-th process as follows:
GP U T imeij
= Coef V txi × V txCountij if V txCountij > 0
= Coef CmdBytei × ByteCountij if V txCountij = 0
When a process inserts a new GPU command group into its per-process command queue in the kernel, GERM
estimates the group’s GPU time requirement, and updates the process’s offered GPU load, which is expressed
in terms of GPU time requirement per second. Then GERM schedules GPU command groups from these perprocess command queues using a weighted round robin scheduling policy.10 The user can explicitly specify the
weight associated with each application. Alternatively, a process’s weight could be assigned to its offered load.
In the current GERM prototype, we assign each process the same weight for implementation simplicity. When
scheduling GPU command groups among competing processes, GERM relies on the command groups’ estimated
GPU time requirements to ensure that each process’s command queue is consumed at a rate proportional to its
weight. Instead of scheduling individual commands, GERM’s weighted round robin scheduler uses a cycle time
that is roughly equal to one frame time, so that it could dispatch one or multiple command groups each time
it visits a process’s command queue. Larger cycle significantly reduces the GPU context switching overhead,
because it amortizes the cost of each GPU context switch over multiple GPU commands.
To derive an accurate estimate for every CPU command group without interpreting it is challenging for two
reasons. First, it is unlikely that a linear model based on the number of command bytes or vertices could capture
a command group’s computation time on a complex GPU. Second, it is not possible to precisely measure the
time taken by each CPU command group because existing GPU hardware does not interrupt the CPU when it
completes each GPU command or command group. To solve the second problem, GERM assigns each command
group it dispatches to the GPU a unique ID, which is incremented for each dispatched command group, and then
inserts after each command group a GPU command that increments a particular register on the GPU, which we
call the timing register. Periodically, say every X seconds, GERM reads the content of the timing register and
compares it with the result of the previous read, and the difference indicates the number of command groups
that have been completed during each period. GERM approximates the computational time requirement of each
X
, where Ni is the number of command groups completed in the i-th
command group in the i-th period as N
i
period. If X is sufficiently small, say 1 msec, this approximation is reasonably accurate; otherwise, GERM needs
a more sophisticated scheme that takes into account the complexity of each command group completed in a
period when distributing time among them.
The current GERM prototype consists of two threads. The first thread is the scheduling thread, which is
woken up every time when the GPU is about to become idle. GERM uses the computation time estimates of
dispatched command groups and the content of the timing register to determine when the GPU is about to
complete all the commands in its queue. This thread visits per-process command queues in a cyclic fashion,
decides how many command groups to dispatch from each process, performs GPU context switch if necessary
and emits the chosen command groups to the GPU’s command ring buffer. After dispatching a command group,
it emits a GPU command that increments the timing register.
The second thread is the timing thread, which is woken once every millisecond. It spins in a loop of calling
the ProbeTimingReg() function and sleeping for one millisecond. The ProbeTimingReg() function detects the
set of command groups that have completed since the last probe, and then calculates the computation time of
each command group in the set. In addition to the timing thread, this function is also called when the user-level
GPU driver makes a system call to inject GPU commands and when the scheduler thread is invoked, to ensure
that the measurement of the timing register is sufficiently fine-grained to produce accurate computation time
estimates for command groups. GERM also feeds these computation time measurements to adjust the resource
allocation of each competing process in subsequent scheduling cycles.
Weighted round robin scheduling guarantees fairness among competing processes only when their command
queues are back-logged. However, vanilla CPU schedulers, such as the CPU scheduler in the Linux 2.6 kernel
with a time quantum of 100 msec, tend to create a scenario in which only one process’s command queue contains
commands, because the time quantum is usually large enough for one process to generate multiple frames worth
of GPU commands while emptying others’ command queues. The coarse granularity of CPU scheduling leads to
visible ‘burstiness’ of frame rendering. To solve this problem, GERM starts dispatching a new process’s GPU
commands only when its in-kernel command queue accumulates a sufficient number of commands. In addition,
GERM actively blocks a process when the process tries to inject new commands to its queue and the queue is
full.
To prevent misbehaving applications from abusing the GPU’s resource, GERM needs to perform fine-grained
context switching so that it can schedule command groups in any way it sees fit. However, GPU context switching
is by definition GPU specific. To perform GPU context switching transparently, GERM needs to recognize on
the fly those commands from a process that modify the GPU state, and keep a copy of them so as to establish
that process’s GPU state later on. As an optimization, GERM could compare the GPU states of consecutive
processes, and only issue commands in the differential to further cut down the performance overhead associated
with GPU context switch.
Frames per second
25
20
15
10
5
0
0
10
20
30
40
Texture memory size, MB
Figure 3. The frame rate of three concurrent instances of Quake running under DRI, shown against the total amount of
texture memory available in the system. When the available amount of texture memory is low, the frame rate decreases
because of swapping of textures required by different instances.
4.2 Coordinated Video Memory Management
The goal of video memory management is to minimize the performance penalty associated with texture uploads.
There are two ways to achieve this goal. One is to decrease the amount of texture upload traffic, and the other
is maximize the overlap between texture upload and GPU computation. As explained before, DRI supports a
LRU-like replacement algorithm for multiple GPU applications to efficiently share the video memory. However,
under this cooperative sharing model a GPU application can grab as much video memory as it wants, thus
evicting other applications’ texture objects without notifying them along the way. This subsection describes
several optimizations that we are currently implementing in GERM.
The texture memory management subsystem in DRI keeps different processes’ textures completely separate,
even though it is preferable to share textures in many cases. Figure 3 shows that the frame rate suffers a 60%
degradation because textures need to be swapped in and out the video memory, even though the three Quake 3
processes use the same set of textures. If the video memory management subsystem can identify the commonality
of textures used in concurrent processes it could decrease some of the texture swapping overhead.
The first optimization is to compute a hash value for each immutable texture object, so that before issuing
a texture upload command, GERM could check if the same texture object has already been uploaded into the
video memory by other processes, and if so skips the upload. Unlike CPU, GPU does not provide any hardware
support for virtual memory. Therefore, the residence check requires a texture object table that maps a texture’s
hash value to its starting location in the video memory if it is currently resident on the video memory, and to
zero if it is not.
The second optimization is dynamic relocation of texture objects.11 Because a GPU application must upload
a texture to the graphic card’s video memory before it accesses the texture, GERM needs to consult with the
texture object table to ensure a process’s textures be resident on the video memory before dispatching commands
from that process. If a texture object is not resident, GERM needs to issue a texture upload command to explicitly
bring it into video memory. Being able to relocate a texture object when it is re-loaded significantly improves
the video memory’s utilization efficiency. Traditionally, virtual memory allows an application’s data structure to
move around the physical memory without disrupting the application. To support the same level of relocation
transparency, the command scheduler needs to rewrite GPU commands on the fly to adapt those commands that
access relocated texture objects.
The third optimization is to overlap texture uploads and GPU computation as much as possible. The
video memory manager goes through the GPU commands in the per-process command queues before they are
scheduled, computes the set of texture objects that should be kept resident on the video memory over time,
and schedules uploads of those texture objects that need to be brought in as far ahead of time as possible. In
addition, the video memory manager could break each such texture upload into multiple upload commands to
minimize the performance impact of texture pre-loading.
The final optimization is to allocate the video memory among competing GPU applications according to
their working set size. This makes it impossible for a GPU application to hog the entire video memory, and thus
enables performance isolation among these applications. The working set of a process is defined as the set of
texture objects that the process accesses during a particular time window.
4.3 Resource Requirement-Carrying API
Although the proposed GERM system can effectively manage the computation and memory resource on a graphic
card, it is GPU-dependent because it needs to parse GPU commands to derive their resource requirements. To
hide GPU-specific details from GERM, future GPU drivers could explicitly provide GPU commands’ resource
requirements when they inject the commands. These information include command attributes used to estimate
a GPU command’s computation time, texture objects accessed, GPU state updates, relocation information for
on-the-fly command patching, etc. One open issue is the security implication of such an API: How does GERM
reduce the impact of an application that lies about the resource requirements of its GPU commands?
Our Graphics Engine Resource Manager (GERM) system is designed with the following goal in mind: it
should be able to prevent applications from uncontrolled consumption of system resources. It should satisfy the
demand of applications with low resource requirements completely and the demand of applications with high
resource requirements equally.
Application Mix
Gears, Train
Quake 3, Train
Gloss, Gears, Train
3xString Comparison
2xMatrixMul128×128
MatrixMul128×128 , String Comparison
MatrixMul128×128 , MatrixMul256×256
Fairness (Utime )
0.12%
0.02%
0.15%
0.97%
0.55%
5.40%
19%
Table 1. Inter-process fairness of different application mixes when they run under GERM’s GPU command scheduler. Each
application in an application mix runs at the same CPU priority. Gears, Train and Gloss are demo programs provided
with Mesa to demonstrate 3D graphics capability. Quake 3 is a 3D gaming application. Table 3 contains information on
the load presented by these applications in terms of number of triangles per frame and number of textures used. The string
comparison and matrix multiplication programs were implemented using the ATI fragment shader OpenGL extension.
5. IMPLEMENTATION
We have implemented the GPU scheduler based on the DRI code on a Linux machine running the 2.6.12 Linux
kernel. The machine is equipped with a 3D graphics card powered by an ATI Radeon r200 GPU.12 We chose
this card against more powerful cards available in the market, because stable open-source drivers are available
for this card. Our implementation involves modifications both to the r200 user-level driver and to the radeon
kernel module loaded by DRM. Note that even though we have implemented our prototype for a specific GPU
and present results for the same, the key ideas in GERM can be applied to all GPUs. The effectiveness of
GERM depends on correctly estimating the load of each command group and scheduling the command groups in
a weighted fashion according to a configurable policy. Only the estimation part is GPU-dependent and requires
modification of the GPU-specific driver.
The GPU command group queue is implemented as a linked lists of 96-Kbyte buffers. When a buffer is
filled, a new one is allocated and added to the linked list. When a buffer is exhausted by the GPU scheduler,
it is discarded. A command group may cross two consecutive buffers. Each buffer can be accessed by a GPU
application and the GPU scheduler without a lock. An atomic counter variable regulates the number of enqueued
command groups that the GPU scheduler is allowed to read.
The GPU scheduler adds the following information to the GPU command transfer protocol between the
user-level driver and the DRM:
• GPU state tagging: Whenever the GPU scheduler switches to a process, it needs to restore its GPU state.
To keep track of each process’s current GPU state, GERM tags each state atom in the GPU command
stream with the corresponding process’s ID.
• Command buffer discard: In DRI, GPU command buffers are allocated by DRM and reused when the
commands in them have finished. An ‘age’ register in the GPU is used to signal the completion of every
GPU command group. The value of this register has to be assigned at the command scheduling time to
ensure that it is increasing. That is why command buffer discard commands also have to be tagged.
We perform time measurements using the RDTSC instruction in the IA-32 architecture, which returns a
value in a counter that is incremented every CPU clock cycle. In the kernel we use 64-bit fixed point arithmetic
with a precision of 10 bits after the binary point.
6. PERFORMANCE EVALUATION
6.1 Fairness
There are two possible metrics that can be used to measure the degree of inter-process fairness when a set of
processes share a GPU. Intuitively, these metrics quantify the maximum difference between fractions of the GPU
time consumed by any two processes. Smaller fairness metric values correspond to fairer GPU time allocation,
with 0 being perfectly fair allocation. If a process runs at a frame rate of Fi frames per second (FPS) when
Number of Context Switches/Frame
0
500
1000
2000
3000
4000
FPS
518
466
269
143
92
74
Frame Time(1/FPS)
0.0019
0.0021
0.0037
0.0069
0.0108
0.0135
Table 2. The impact of artificially introduced GPU context switches on the frame rate. The test application draws 100
triangles per frame. From these measurements we calculated the GPU context switch time is very small, about 3 × 10−6
seconds.
running alone and at a frame rate of fi FPS when running concurrently with other processes, then the fraction
of the GPU time used by that process can be estimated as si = Ffii . One way to measure the degree of fairness
for N concurrent processes is
fi
fi
Ufps = max
− min
i=1,...,N Fi
i=1,...,N Fi
Alternatively, if t1 , . . . , tN are the GPU times consumed by N processes during the measurement period, then
the same degree of fairness can be expressed as
ti
Utime = max PN
i=1,...,N
j=1 tj
− min
i=1,...,N
ti
PN
j=1 tj
Because Utime can be applied to both graphics applications as well as non-graphics applications, we use only
Utime in this study. Table 1 presents the fairness metric Utime for a variety of application mixes. GERM achieves
nearly-perfect fairness for all application mixes, except the last two. This demonstrates that GERM’s GPU
command scheduler is able to accurately estimate the loads of the GPU command groups and schedule them
fairly, at least for graphics applications.
To explain why GERM is not as effective in the two last application mixes, we need to first present how the
matrix multiplication program is implemented on the ATI Radeon r200 GPU using the ATI fragment shader
OpenGL extension.13 Each of the two input matrices is stored in multiple textures. The matrix multiplication
proceeds in multiple stages, each of which operates on a subset of these input textures. The output of each stage
is another texture which is provided as an input to the next stage. As a result, the matrix multiplication program
frequently uploads textures during its execution. This is different from other application programs used in this
experiment, which upload all required textures in the beginning of their execution. Because texture uploads take
place asynchronously through DMA operations, the current GERM prototype does not take texture uploads into
consideration when estimating the GPU load of a process. How to overcome this limitation is currently under
investigation.
When an application mix consists of two matrix multiplication instances whose input matrices are of different
dimension (128x128 vs. 256x256), the GPU loads associated with their texture uploads are different from each
other. The fact that the current GERM prototype completely ignores texture uploads causes the GPU time
allocation between these two applications to be substantially unfair (19%). The same explanation applies to the
unfair GPU time allocation (5.4%) between a matrix multiplication program, which performs frequent texture
uploads, and a string comparison program, which does not. When two matrix multiplication programs whose
input matrices are of identical dimensionality share a GPU, GERM fairly allocates GPU time between them. In
this case even though it ignores texture uploads, because the texture uploads of these two applications impose
the same load on the GPU, they eventually cancel out each other.
Figure 4 shows how the Utime metric varies with the number of triangles drawn by two identical instances
of a synthetic application that draws a certain number of random triangles and includes a CPU computation
component which is proportional to the number of triangles drawn. When the number of triangles drawn grows
to 900 and more, the CPU becomes the bottleneck, the command queue is empty more frequently, and it is
less likely for the deficit round-robin algorithm to maintain fairness. On the other hand, when the number of
Figure 4. Inter-process fairness of two instances of the same application that draws an increasing number of random
triangles and includes a CPU computation component that is proportional to the number of triangles drawn.
Application
Textures
Underwater
Triangles
per frame
228
Gloss (Mesa demo)
566
17
GearTrain (Mesa demo)
2800
0
Quake 3
6000
2226
232
DRI
GERM
Overhead
DRI
GERM
Overhead
DRI
GERM
Overhead
DRI
GERM
Overhead
Frames per second
1x
2x
3x
280
140
95
256
131
89
8.6%
6.4%
6.3%
196
98
66
175
87
57
10.7% 11.2% 13.6%
77.8
39
66.7
32.7
14.3% 16.2%
63.1
30.7
20
49.8
24.9
8.7
21.1% 18.9% 56.5%
4x
71
68
4.2%
49
44
10.2%
Table 3. The performance overhead associated with GERM’s GPU scheduler when 1 to 4 instances of a set of graphics
applications are executed.
triangles is smaller than 300, GERM becomes less fair as the number of triangles decreases, because its GPU
execution time estimation mechanism becomes less accurate when the command group execution time becomes
smaller.
6.2 Scheduling Overhead
Although GERM’s fine-grained GPU command scheduler does a reasonable job of ensuring inter-process fairness
among GPU applications, the scheduler potentially could introduce additional performance overheads and thus
lower the GPU’s effective capacity. In particular, GERM introduces more GPU context switches and incurs
additional CPU resource to estimate GPU load of each GPU command group. To measure the GPU context
switching overhead, we ran a test application that draws 100 triangles per frame and performs N GPU context
switches within every frame and measured the frame rate for different N values. Table 2 shows the impact of
N on the frame rate of this application. We found that the average GPU context switch overhead is very small,
about 3 × 10−6 sec. Assume that there are 10 applications and GERM uses a scheduling quantum of 10 msec,
then there will be 1000 GPU context switches per second, and the total GPU context switch overhead is 3 msec
per second, or 0.3%.
Table 3 and 4 show the end-to-end performance overhead of GERM’s GPU command scheduler when compared with DRI, when the workload consists of a varying number of instances of graphics applications and
Application
String Comparison
Size of
Instance
12.2KB
Matrix Multiplication
128x128
DRI
GERM
Overhead
DRI
GERM
Overhead
Elapsed time (CPU Mcycles)
1x
2x
3x
3.54
7.39
11.61
3.57
8.04
12.04
0.85% 8.79%
3.70%
11.96
24.43
35.47
12.36
25.5
37.63
3.34% 4.38%
6.09%
Table 4. The performance overhead associated with GERM’s GPU scheduler when 1 to 3 instances of a set of general
purpose non-graphics applications are executed.
non-graphics applications, respectively. Because the GPU context switch overhead is negligible, the overheads
shown in these tables mainly come from GPU command parsing and GPU load estimation. Moreover, the current GERM prototype introduces an additional copying step for each GPU command, which could have been
optimized away. As a consequence, the performance overhead of GERM for an application increases with the
number of triangles that the application draws per frame. For the same reason, GERM tends to introduce
smaller performance overhead for non-graphics applications because they tend to use less complex geometry and
thus issue a smaller number of GPU commands.
7. CONCLUSION
The goal of the GERM project is to develop a full-scale GPU resource manager that could efficiently utilize a
GPU’s computation/memory resource and effectively provide performance isolation among applications sharing
the same GPU. This paper focuses on the development of a GPU scheduler that successfully demonstrates that
it is possible to schedule GPU applications on a fine-grained basis on commodity GPUs. In particular, GERM’s
GPU scheduler is capable of switching GPU applications sharing the same GPU at arbitrary points in their
GPU command stream by correctly saving and restoring their GPU state. In addition, GERM features an
accurate GPU command execution time estimation mechanism that is largely independent of GPU, and uses
this mechanism to allocate the GPU resource among competing GPU applications. Measurements of the first
GERM prototype show that GERM can keep the GPU time consumption difference among competing GPU
processes consistently below 5% in a variety of application mixes.
We are currently extending the GERM prototype with more intelligent video memory management and more
informative API. Then we will optimize GERM’s performance through such techniques as zero-copy command
and texture buffering and streamlined context switching. As the next step, we will explore other systems software
support for GPU application development, such as GPU debugging.
REFERENCES
1. J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. K. A. E. Lefohn, and T. J. Purcell, “A survey of general-purpose computation
on graphics hardware,” Computer Graphics Forum 26, pp. 80–113, March 2007.
2. K. E. Martin, R. E. Faith, J. Owen, and A. Akin, Direct Rendering Infrastructure, Low-Level Design Document. Precision Insight,
Inc., May 1999.
3. B. Langley, “Windows ‘Longhorn’ Display Driver Model - Details And Requirements,” in Windows Hardware Engineering Conference
(WinHEC), (Seattle, WA, USA), May 2004.
4. T. K. Steve Pronovost, Henry Moreton, “WDDM v2 and Beyond,” in Windows Hardware Engineering Conference (WinHEC), (Seattle,
WA, USA), May 2006.
5. M. J. Kilgard, S. Hui, A. A. Leinwand, and D. Spalding, “X server multi-rendering for OpenGL and PEX,” The X Resource 9(1),
pp. 73–88, 1994.
6. M. J. Kilgard, D. Blythe, and D. Hohn, “System support for openGL direct rendering,” in Graphics Interface ’95, W. A. Davis and
P. Prusinkiewicz, eds., pp. 116–127, Canadian Human-Computer Communications Society, (Quebec City, Quebec), May 1995.
7. “Compute Unified Device Architecture.” http://developer.nvidia.com/object/cuda.html.
8. “ATI CTM Guide.” http://ati.amd.com/companyinfo/researcher/documents/ATI_CTM_Guide.pdf, 2006.
9. R. E. Faith, The Direct Rendering Manager: Kernel Support for the Direct Rendering Infrastructure. Precision Insight, Inc., May
1999.
10. M. Shreedhar and G. Varghese, “Efficient fair queueing using deficit round-robin,” IEEE/ACM Trans. Netw. 4(3), pp. 375–385, 1996.
11. K. Whitwell and T. Hellstrom, “New DRI memory manager and i915 driver update,” Proceedings of 2006 Xorg Developer’s Conference,
(Santa Clara, CA, USA), February 2006.
12. “Radeon R200.” http://en.wikipedia.org/wiki/Radeon_R200.
13. “ATI fragment shader.” http://oss.sgi.com/projects/ogl-sample/registry/ATI/fragment_shader.txt, August 2002.
Download