Graphic Engine Resource Management Mikhail Bautin Ashok Dwarakinath Tzi-cker Chiueh Computer Science Department Stony Brook University {mbautin, ashok, chiueh}@cs.sunysb.edu ABSTRACT Modern consumer-grade 3D graphic cards boast a computation/memory resource that can easily rival or even exceed that of standard desktop PCs. Although these cards are mainly designed for 3D gaming applications, their enormous computational power has attracted developers to port an increasing number of scientific computation programs to these cards, including matrix computation, collision detection, cryptography, database sorting, etc. As more and more applications run on 3D graphic cards, there is a need to allocate the computation/memory resource on these cards among the sharing applications more fairly and efficiently. In this paper, we describe the design, implementation and evaluation of a Graphic Processing Unit (GPU) scheduler based on Deficit Round Robin scheduling that successfully allocates to every process an equal share of the GPU time regardless of their demand. This scheduler, called GERM, estimates the execution time of each GPU command group based on dynamically collected statistics, and controls each process’s GPU command production rate through its CPU scheduling priority. Measurements on the first GERM prototype show that this approach can keep the maximal GPU time consumption difference among concurrent GPU processes consistently below 5% for a variety of application mixes. Keywords: GPU scheduling, Resource Management, GPU fairness, DRI, DRM kernel module, GERM 1. INTRODUCTION Modern consumer-grade graphic cards boast a computation/memory resource that can easily rival or even exceed that of standard desktop PCs. For example, a Geforce 8800GTX-based card contains 786MB of memory with a total memory bandwidth of 86.4GB/sec, and a G80 GPU that has 128 shader processors each operating at 1.35GHz, where 16 pixels are processed simultaneously in each clock cycle. Although these cards are originally designed for 3D gaming applications, their enormous computational power has attracted an increasing number of researchers to port their applications to them. Non-graphic applications1 that have been successfully ported to GPUs include matrix computation, video/image processing, collision detection, cryptography, database sorting, etc. As more and more GPU-exploiting applications start to emerge, a GPU’s computation/memory resource may need to be time-shared among multiple such applications that are running concurrently. As a result, GPU needs an operating system to manage its computation/memory resource, just as CPU. In addition, the GPU’s resource manager should coordinate its resource allocation decisions with those made by the main OS running on the CPU. Existing operating system support for GPU2, 3 allows multiple graphic applications to run on a single GPU as if each of them has exclusive access to the GPU’s resource, but does not prevent applications from monopolizing the GPU resource. Typically, a graphic application is built on top of a user-level graphic library, e.g. OpenGL or Direct3D, which converts high-level commands into low-level GPU commands through a user-level GPU-specific driver. A GPU interacts with its driver via a GPU command ring buffer on the graphic card, to which the GPU driver DMAs commands from the host memory. Although GPU interrupts CPU when DMA transactions complete, they do not generate interrupts when a GPU command exits the GPU’s graphic pipeline. Modern GPUs are stateful in the sense that a graphic application’s GPU commands must be executed with respect to a specific GPU state that it set up beforehand. A GPU state typically includes transformation matrices, view-port specifications, lighting parameters, etc. Because different applications require different GPU states, it is essential to restore the GPU state of a graphic application before scheduling its GPU commands. Restoring a graphic application’s state requires issuing a sequence of GPU commands to properly set up the GPU. This step is similar to context switching associated with multiplexing processes on a CPU, except one key difference: Context switching is performed by the OS as part of CPU scheduling and is typically transparent to the applications, whereas GPU state switching is done by the user-level GPU driver, and is decoupled from the CPU scheduler. Because the CPU scheduler is not responsible for establishing a graphic application’s state before scheduling it on the CPU, there is no guarantee that its GPU commands are executed with respect to their corresponding state. To address this problem, the user-level GPU driver of a graphic application must acquire a lock before submitting commands to the GPU to guarantee that it has exclusive access to the GPU. We call the set of commands an application issues to the GPU between a lock acquisition and a lock release as a command batch, which is different from a command group, a minimal sequence of commands that must be DMAed into the GPU atomically. As a result of the above design, the computation and memory resources on the GPU are allocated on a first come first serve basis. More concretely, the GPU processes commands in the GPU command ring buffer in the order in which they are inserted without regards to any notion of fairness or priority. Because GPU commands may request allocation of physical memory on the graphic card, these allocation requests are also serviced on a first come first serve order. The prevailing GPU command scheduling model has several serious weaknesses. First, it cannot limit the amount of GPU resource consumed by individual processes, as is done by CPU scheduler using time quanta, because each graphic application is allowed to issue as many GPU commands as it wants once acquiring the lock protecting the command ring buffer. Second, if a process acquires the command ring buffer lock and does not release it maliciously or by mistake, no other processes can issue commands to the GPU and the GPU’s resource may lie wasted during such times. Finally, the resource allocation decisions of the GPU scheduler are not coordinated with the CPU scheduler’s. This could hurt a graphic application’s performance at times of GPU command bursts because the CPU and GPU resources it receives do not match. We propose a new GPU resource manager called GERM (Graphics Engine Resource Manager), which is compatible with the programming model of existing GPU drivers, but is more flexible and intelligent. In particular, GERM supports the following unique features: • GERM supports GPU state switching without relying on the user-level GPU driver, and therefore can schedule GPU commands at the granularity of command groups rather than command batches. • GERM does not require locking on the command ring buffer, and therefore allows a GPU application to issue GPU commands whenever it is scheduled. • GERM schedules GPU commands in such a way that matches the resource allocation decisions of the CPU scheduler. More specifically, the amount of GPU resource that GERM allocates to a GPU application is proportional to the CPU resource given to it. • GERM also monitors video memory usage of each GPU application, and allocates the physical memory resource among applications in a way that minimizes GPU idle time and is compatible with GPU computation resource allocation. The main reason that GERM can support more flexible and intelligent GPU resource allocation is because it is capable of exploiting the semantics of GPU commands. Although this approach ties GERM to individual GPUs, it is possible to eliminate hardware dependency by devising a general API for a GPU-specific driver to convey its computation and memory resource requirement for each GPU command submitted. We will briefly describe such an API in Section 4.3. 2. RELATED WORK 3, 4 Windows Display Driver Model (WDDM) introduced in Windows Vista also tries to solve the problem of fairly allocating a GPU’s computation and video memory resource among multiple applications. More specifically, WDDM addresses the following problems of Windows XP’s GPU resource allocation: • Windows XP GPU scheduler is first-come-first-serve, which results in starvation of some applications. As a result, when multiple 3D graphic applications are executed, they appear to render in ”bursts,” which manifest themselves in the form of large variance in the frame rate. • Video memory allocation in Windows XP is also first-come-first-serve. When the video memory supply is exhausted, applications would not start. There is no other way to reclaim the video memory used by a particular application except terminating the application. The problem of an application requesting more video memory than is available is resolved in Windows XP, as well as in DRI (explained later), by keeping a backup copy of all textures in the main memory, which can lead to a significant waste. WDDM comprises two different driver models - a basic model for existing GPUs, which are not designed to support concurrent execution of multiple 3D applications, and an advanced model for future GPUs, which are designed to support concurrent execution of multiple 3D applications. In the basic model, the GPU scheduler is responsible for dispatching GPU command groups from application processes to the GPU, but there is no interruptability and GPU command group execution time is unbounded. The GPU scheduler serves as a throttle for all the graphic processes. The video memory manager virtualizes the video memory on a 3D graphic card to application processes. It manages both the on-card physical memory and the AGP aperture. Eviction is done by mapping the destination region in main memory through AGP and moving the evicted data from video memory to the destination region. The advanced model assumes the GPU supports multiple GPU contexts for running multiple GPU applications concurrently. Every GPU context contains a ring buffer for GPU commands and a page table for video memory. The scheduler computes the order in which GPU applications are to run and sends a list of their contexts to the GPU. The GPU runs the processes in the specified order, executes command groups in each GPU context, and issues an interrupt to the CPU after completing the commands associated with a GPU context. A GPU context switch may be triggered by a video memory page fault or requested by the GPU scheduler The GPU references video memory through the page table in the same way as CPU. This virtual memory-like support enables fine-grained memory management in that it allows part of a GPU application’s working set to be evicted and paged in later on a page fault. SGI’s multi-rendering5, 6 is a modified X server architecture that is designed to support concurrent graphics applications running in a multi-threading or multi-processing environment. However, its GPU resource scheduling model is FIFO-based. Existing GPU programming frameworks such as NVIDIA’s Compute Unified Device Architecture(CUDA)7 or API’s such as ATI/AMD’s Close to Metal(CTM) technology8 are designed to mainly simplify graphics and non-graphics application development on the GPU and to enable applications to fully exploit the power of the GPU. However, they do not provide any support for dynamic GPU resource allocation, let alone GPU command scheduling. Compared with WDDM, DRI (described in the next section) and SGI’s multi-rendering system, GERM implements a more granular GPU command scheduling where the basic unit of scheduling is an atomic GPU command group, and as a result is able to support more flexible QoS policies on GPU resource allocation. 3. DIRECT RENDERING INFRASTRUCTURE 3.1 Overview The Direct Rendering Infrastructure,2 also known as DRI, allows multiple graphic applications to efficiently share a 3D graphic card under the X Window system. It works with the Mesa open-source OpenGL implementation and a wide range of GPU-based cards, each of which requires a device-specific user-level driver. The programming model and high-level design of DRI is similar to the Windows Display Driver Model (WDDM).3, 4 Therefore, this section will focus on DRI’s design and implementation. Other related work includes SGI’s multi-rendering system,5, 6 which requires modification to the X Server to support multi-threading and multi-processing graphic applications. DRI supports both GPU command scheduling and video memory management. A GPU application is linked with libGL.so, the Mesa OpenGL library, which in turn loads a user-level GPU-specific graphic card Application Hardware-specific user-level driver OpenGL API Latest GPU state Clear() Mesa OpenGL Driver specific calls Begin() ... Saved GPU state Vertex3f() SwapBuffers() User-level GPU command buffer Current vertex buffer User level Request DMA buffers Get lock Dispatch pending commands Kernel Direct Rendering Manager (DRM) Figure 1. The internal structure of a user-level driver for a 3D graphic card driver (one instance per every client process). This driver interacts with the GPU hardware through the Direct Rendering Manager (DRM) kernel module,9 which physically moves GPU commands to the GPU and manages the hardware lock. The GPU driver receives high-level graphic commands from the Mesa library and generates the corresponding sequence of GPU commands, and periodically dispatches them to the GPU by making a system call to the DRM kernel module, which in its turn DMAs them into a command ring buffer accessible to the GPU. Before issuing system calls to dispatch GPU commands, a GPU application’s user-level driver must obtain a lock on the command ring buffer first to ensure exclusive access. We will refer to this lock as ‘hardware lock’ from now on. If the user-level driver detects that another process has issued commands to the GPU command ring buffer since the last time it dispatched commands, it will first issue a set of special GPU commands to restore the GPU to the desired rendering state. 3.2 GPU Driver State The user-level GPU driver, whose structure is shown in Figure 1, converts high-level OpenGL commands to GPU commands. To effectively interface the OpenGL library (in this case Mesa) and a GPU-based 3D graphic card, the user-level GPU driver needs to maintain a set of data structures which are described in the following (the description here is based on the open-source driver for ATI Radeon 9250 card, r200 dri.so): • GPU command buffer. This small (typically 4-8 KByte) buffer accumulates GPU commands before they are dispatched to the kernel. Its size is sufficiently large to ensure atomicity of every possible command group that needs to be sent to the command ring buffer in one shot, such as GPU state restoration or vertex array element drawing commands corresponding to glArrayElement() or glDrawElements(). • GPU state. This structure consists of approximately 64 GPU command sequences, called atoms, that are responsible for setting values of GPU registers which correspond to different parts of the OpenGL state such as transformation matrices, view-port specifications, lighting parameters etc. The GPU driver updates this structure in response to changes in the state of the OpenGL library, and uses it to restore the GPU state if necessary. Whenever the GPU driver flushes the command buffer, it sends the saved GPU state to the kernel if it detects that a different process has accessed the GPU since this process’s last command buffer flush. It is therefore the responsibility of the GPU driver to implement GPU context switching. • Texture memory map. These data structures are required for video memory allocation and described in Section 3.3. Before the GPU driver can flush its command buffer, it needs to acquire a hardware lock first. If the GPU driver cannot successfully acquire the hardware lock, the entire application is blocked. If it successfully acquires the hardware lock, it then checks if there is another process that held the lock since it last released the lock. The GPU driver can dispatch as many GPU commands into the kernel as it wants once it acquires the hardware lock, and releases the lock after it issues all its GPU commands. 3.3 Video Memory Management In DRI, a graphic card’s video memory is divided into a fixed number of equal-sized chunks called regions. Every region at any instant is allocated to at most one GPU application. To efficiently allocate video memory, DRI maintains a global list of regions and their last access times sorted in the least recently used (LRU) order, a per-process linked list of texture objects sorted in the LRU order, and a per-process list of memory blocks allocated to a process, where memory blocks are contiguous variable-size ranges of the video memory that are linked in the order of increasing offset and marked free or used. When a GPU driver needs to allocate physical memory for a texture object, it first checks if there is a free memory block of sufficient size in the local list of memory blocks. If there is not, it removes texture objects from its per-process texture object list, starting from the least recently used ones until it finds enough space to hold the new texture. In this search, regions owned by other processes are represented as dummy texture objects and included in the per-process texture objects list according to how recently they have been used by their owner. Whenever a process creates or accesses a texture object, the object’s recency attribute and its position in the texture object list are modified accordingly. 4. GRAPHICS ENGINE RESOURCE MANAGEMENT 4.1 Fine-Grained GPU Command Scheduler It provides a command queue in the kernel for each process to insert its GPU commands. GERM’s command scheduler then schedules GPU commands from these in-kernel command queues to the command ring buffer on the GPU card, as shown in Figure 2. Because each process has its own command queue, it is no longer necessary to acquire a hardware lock before injecting GPU commands into the kernel. The main design goal of GERM’s command scheduler is to ensure its resource allocation decisions be compatible with the CPU scheduler. That is, the amount of GPU resource allocated to a process is proportional to the CPU resource that it is given. One way to achieve this goal is to distribute the GPU time among processes according to their CPU usage. When a process is given a higher priority in CPU scheduling, it runs more frequently, is more likely to dispatch more GPU commands, and therefore deserves more GPU time. Another approach is to distribute the GPU time among processes according to the loads they each present to the GPU. The GPU load offered by a process is in turn determined by its inherent GPU computation requirement and the CPU scheduling priority. However, this approach may not be able to protect innocent applications from mis-behaving applications that abuse the GPU resource. The minimal scheduling granularity of GERM is a GPU command group, which is also the minimal unit of work that most applications use when injecting load to the GPU. However, different GPU command groups may consume different amounts of GPU computation resource. For example, a large triangle with multi-pass rendering consumes much more GPU computational resource than a small triangle with wire-frame rendering. Therefore, to accurately estimate the GPU load offered by a process, it is essential that GERM go beyond counting GPU commands, and actually take into account their execution semantics. The most accurate way to estimate a GPU command group’s computational resource requirement is to interpret the command group’s execution in the GPU. However, this approach is time-consuming and incurs too much run-time overhead. Therefore, GERM uses a measurement approach instead, which estimates a GPU command group’s computational resource requirement based on the number of bytes or the number of vertices in the command group. More specifically, GERM measures the elapsed time of each GPU command group Tc C, Tc , and computes two empirical constants: Coef V tx, which is equal to V txCount , and Coef CmdByte, c Tc which is equal to ByteCountc , where V txCountc and ByteCountc are the number of vertices and bytes in the command group C. Then it maintains a running average of the Coef V tx and Coef CmdByte values of all the Task 1 command queue Task 1 statistics Task 1 ... ... Task N command queue Task N statistics Task N Scheduling thread Execution time estimations Pending command group info queue (PendingGrpQueue) Timing thread Finished command groups GPU command ring buffer (accessed directly by GPU) GPU Figure 2. The software architecture of GERM’s command scheduler, which supports performance isolation using finegrained GPU computation resource allocation. GPU command groups executed in each process, and uses the resulting Coef V tx and Coef CmdByte averages to predict the GPU time requirement of the j-th GPU command group in the i-th process as follows: GP U T imeij = Coef V txi × V txCountij if V txCountij > 0 = Coef CmdBytei × ByteCountij if V txCountij = 0 When a process inserts a new GPU command group into its per-process command queue in the kernel, GERM estimates the group’s GPU time requirement, and updates the process’s offered GPU load, which is expressed in terms of GPU time requirement per second. Then GERM schedules GPU command groups from these perprocess command queues using a weighted round robin scheduling policy.10 The user can explicitly specify the weight associated with each application. Alternatively, a process’s weight could be assigned to its offered load. In the current GERM prototype, we assign each process the same weight for implementation simplicity. When scheduling GPU command groups among competing processes, GERM relies on the command groups’ estimated GPU time requirements to ensure that each process’s command queue is consumed at a rate proportional to its weight. Instead of scheduling individual commands, GERM’s weighted round robin scheduler uses a cycle time that is roughly equal to one frame time, so that it could dispatch one or multiple command groups each time it visits a process’s command queue. Larger cycle significantly reduces the GPU context switching overhead, because it amortizes the cost of each GPU context switch over multiple GPU commands. To derive an accurate estimate for every CPU command group without interpreting it is challenging for two reasons. First, it is unlikely that a linear model based on the number of command bytes or vertices could capture a command group’s computation time on a complex GPU. Second, it is not possible to precisely measure the time taken by each CPU command group because existing GPU hardware does not interrupt the CPU when it completes each GPU command or command group. To solve the second problem, GERM assigns each command group it dispatches to the GPU a unique ID, which is incremented for each dispatched command group, and then inserts after each command group a GPU command that increments a particular register on the GPU, which we call the timing register. Periodically, say every X seconds, GERM reads the content of the timing register and compares it with the result of the previous read, and the difference indicates the number of command groups that have been completed during each period. GERM approximates the computational time requirement of each X , where Ni is the number of command groups completed in the i-th command group in the i-th period as N i period. If X is sufficiently small, say 1 msec, this approximation is reasonably accurate; otherwise, GERM needs a more sophisticated scheme that takes into account the complexity of each command group completed in a period when distributing time among them. The current GERM prototype consists of two threads. The first thread is the scheduling thread, which is woken up every time when the GPU is about to become idle. GERM uses the computation time estimates of dispatched command groups and the content of the timing register to determine when the GPU is about to complete all the commands in its queue. This thread visits per-process command queues in a cyclic fashion, decides how many command groups to dispatch from each process, performs GPU context switch if necessary and emits the chosen command groups to the GPU’s command ring buffer. After dispatching a command group, it emits a GPU command that increments the timing register. The second thread is the timing thread, which is woken once every millisecond. It spins in a loop of calling the ProbeTimingReg() function and sleeping for one millisecond. The ProbeTimingReg() function detects the set of command groups that have completed since the last probe, and then calculates the computation time of each command group in the set. In addition to the timing thread, this function is also called when the user-level GPU driver makes a system call to inject GPU commands and when the scheduler thread is invoked, to ensure that the measurement of the timing register is sufficiently fine-grained to produce accurate computation time estimates for command groups. GERM also feeds these computation time measurements to adjust the resource allocation of each competing process in subsequent scheduling cycles. Weighted round robin scheduling guarantees fairness among competing processes only when their command queues are back-logged. However, vanilla CPU schedulers, such as the CPU scheduler in the Linux 2.6 kernel with a time quantum of 100 msec, tend to create a scenario in which only one process’s command queue contains commands, because the time quantum is usually large enough for one process to generate multiple frames worth of GPU commands while emptying others’ command queues. The coarse granularity of CPU scheduling leads to visible ‘burstiness’ of frame rendering. To solve this problem, GERM starts dispatching a new process’s GPU commands only when its in-kernel command queue accumulates a sufficient number of commands. In addition, GERM actively blocks a process when the process tries to inject new commands to its queue and the queue is full. To prevent misbehaving applications from abusing the GPU’s resource, GERM needs to perform fine-grained context switching so that it can schedule command groups in any way it sees fit. However, GPU context switching is by definition GPU specific. To perform GPU context switching transparently, GERM needs to recognize on the fly those commands from a process that modify the GPU state, and keep a copy of them so as to establish that process’s GPU state later on. As an optimization, GERM could compare the GPU states of consecutive processes, and only issue commands in the differential to further cut down the performance overhead associated with GPU context switch. Frames per second 25 20 15 10 5 0 0 10 20 30 40 Texture memory size, MB Figure 3. The frame rate of three concurrent instances of Quake running under DRI, shown against the total amount of texture memory available in the system. When the available amount of texture memory is low, the frame rate decreases because of swapping of textures required by different instances. 4.2 Coordinated Video Memory Management The goal of video memory management is to minimize the performance penalty associated with texture uploads. There are two ways to achieve this goal. One is to decrease the amount of texture upload traffic, and the other is maximize the overlap between texture upload and GPU computation. As explained before, DRI supports a LRU-like replacement algorithm for multiple GPU applications to efficiently share the video memory. However, under this cooperative sharing model a GPU application can grab as much video memory as it wants, thus evicting other applications’ texture objects without notifying them along the way. This subsection describes several optimizations that we are currently implementing in GERM. The texture memory management subsystem in DRI keeps different processes’ textures completely separate, even though it is preferable to share textures in many cases. Figure 3 shows that the frame rate suffers a 60% degradation because textures need to be swapped in and out the video memory, even though the three Quake 3 processes use the same set of textures. If the video memory management subsystem can identify the commonality of textures used in concurrent processes it could decrease some of the texture swapping overhead. The first optimization is to compute a hash value for each immutable texture object, so that before issuing a texture upload command, GERM could check if the same texture object has already been uploaded into the video memory by other processes, and if so skips the upload. Unlike CPU, GPU does not provide any hardware support for virtual memory. Therefore, the residence check requires a texture object table that maps a texture’s hash value to its starting location in the video memory if it is currently resident on the video memory, and to zero if it is not. The second optimization is dynamic relocation of texture objects.11 Because a GPU application must upload a texture to the graphic card’s video memory before it accesses the texture, GERM needs to consult with the texture object table to ensure a process’s textures be resident on the video memory before dispatching commands from that process. If a texture object is not resident, GERM needs to issue a texture upload command to explicitly bring it into video memory. Being able to relocate a texture object when it is re-loaded significantly improves the video memory’s utilization efficiency. Traditionally, virtual memory allows an application’s data structure to move around the physical memory without disrupting the application. To support the same level of relocation transparency, the command scheduler needs to rewrite GPU commands on the fly to adapt those commands that access relocated texture objects. The third optimization is to overlap texture uploads and GPU computation as much as possible. The video memory manager goes through the GPU commands in the per-process command queues before they are scheduled, computes the set of texture objects that should be kept resident on the video memory over time, and schedules uploads of those texture objects that need to be brought in as far ahead of time as possible. In addition, the video memory manager could break each such texture upload into multiple upload commands to minimize the performance impact of texture pre-loading. The final optimization is to allocate the video memory among competing GPU applications according to their working set size. This makes it impossible for a GPU application to hog the entire video memory, and thus enables performance isolation among these applications. The working set of a process is defined as the set of texture objects that the process accesses during a particular time window. 4.3 Resource Requirement-Carrying API Although the proposed GERM system can effectively manage the computation and memory resource on a graphic card, it is GPU-dependent because it needs to parse GPU commands to derive their resource requirements. To hide GPU-specific details from GERM, future GPU drivers could explicitly provide GPU commands’ resource requirements when they inject the commands. These information include command attributes used to estimate a GPU command’s computation time, texture objects accessed, GPU state updates, relocation information for on-the-fly command patching, etc. One open issue is the security implication of such an API: How does GERM reduce the impact of an application that lies about the resource requirements of its GPU commands? Our Graphics Engine Resource Manager (GERM) system is designed with the following goal in mind: it should be able to prevent applications from uncontrolled consumption of system resources. It should satisfy the demand of applications with low resource requirements completely and the demand of applications with high resource requirements equally. Application Mix Gears, Train Quake 3, Train Gloss, Gears, Train 3xString Comparison 2xMatrixMul128×128 MatrixMul128×128 , String Comparison MatrixMul128×128 , MatrixMul256×256 Fairness (Utime ) 0.12% 0.02% 0.15% 0.97% 0.55% 5.40% 19% Table 1. Inter-process fairness of different application mixes when they run under GERM’s GPU command scheduler. Each application in an application mix runs at the same CPU priority. Gears, Train and Gloss are demo programs provided with Mesa to demonstrate 3D graphics capability. Quake 3 is a 3D gaming application. Table 3 contains information on the load presented by these applications in terms of number of triangles per frame and number of textures used. The string comparison and matrix multiplication programs were implemented using the ATI fragment shader OpenGL extension. 5. IMPLEMENTATION We have implemented the GPU scheduler based on the DRI code on a Linux machine running the 2.6.12 Linux kernel. The machine is equipped with a 3D graphics card powered by an ATI Radeon r200 GPU.12 We chose this card against more powerful cards available in the market, because stable open-source drivers are available for this card. Our implementation involves modifications both to the r200 user-level driver and to the radeon kernel module loaded by DRM. Note that even though we have implemented our prototype for a specific GPU and present results for the same, the key ideas in GERM can be applied to all GPUs. The effectiveness of GERM depends on correctly estimating the load of each command group and scheduling the command groups in a weighted fashion according to a configurable policy. Only the estimation part is GPU-dependent and requires modification of the GPU-specific driver. The GPU command group queue is implemented as a linked lists of 96-Kbyte buffers. When a buffer is filled, a new one is allocated and added to the linked list. When a buffer is exhausted by the GPU scheduler, it is discarded. A command group may cross two consecutive buffers. Each buffer can be accessed by a GPU application and the GPU scheduler without a lock. An atomic counter variable regulates the number of enqueued command groups that the GPU scheduler is allowed to read. The GPU scheduler adds the following information to the GPU command transfer protocol between the user-level driver and the DRM: • GPU state tagging: Whenever the GPU scheduler switches to a process, it needs to restore its GPU state. To keep track of each process’s current GPU state, GERM tags each state atom in the GPU command stream with the corresponding process’s ID. • Command buffer discard: In DRI, GPU command buffers are allocated by DRM and reused when the commands in them have finished. An ‘age’ register in the GPU is used to signal the completion of every GPU command group. The value of this register has to be assigned at the command scheduling time to ensure that it is increasing. That is why command buffer discard commands also have to be tagged. We perform time measurements using the RDTSC instruction in the IA-32 architecture, which returns a value in a counter that is incremented every CPU clock cycle. In the kernel we use 64-bit fixed point arithmetic with a precision of 10 bits after the binary point. 6. PERFORMANCE EVALUATION 6.1 Fairness There are two possible metrics that can be used to measure the degree of inter-process fairness when a set of processes share a GPU. Intuitively, these metrics quantify the maximum difference between fractions of the GPU time consumed by any two processes. Smaller fairness metric values correspond to fairer GPU time allocation, with 0 being perfectly fair allocation. If a process runs at a frame rate of Fi frames per second (FPS) when Number of Context Switches/Frame 0 500 1000 2000 3000 4000 FPS 518 466 269 143 92 74 Frame Time(1/FPS) 0.0019 0.0021 0.0037 0.0069 0.0108 0.0135 Table 2. The impact of artificially introduced GPU context switches on the frame rate. The test application draws 100 triangles per frame. From these measurements we calculated the GPU context switch time is very small, about 3 × 10−6 seconds. running alone and at a frame rate of fi FPS when running concurrently with other processes, then the fraction of the GPU time used by that process can be estimated as si = Ffii . One way to measure the degree of fairness for N concurrent processes is fi fi Ufps = max − min i=1,...,N Fi i=1,...,N Fi Alternatively, if t1 , . . . , tN are the GPU times consumed by N processes during the measurement period, then the same degree of fairness can be expressed as ti Utime = max PN i=1,...,N j=1 tj − min i=1,...,N ti PN j=1 tj Because Utime can be applied to both graphics applications as well as non-graphics applications, we use only Utime in this study. Table 1 presents the fairness metric Utime for a variety of application mixes. GERM achieves nearly-perfect fairness for all application mixes, except the last two. This demonstrates that GERM’s GPU command scheduler is able to accurately estimate the loads of the GPU command groups and schedule them fairly, at least for graphics applications. To explain why GERM is not as effective in the two last application mixes, we need to first present how the matrix multiplication program is implemented on the ATI Radeon r200 GPU using the ATI fragment shader OpenGL extension.13 Each of the two input matrices is stored in multiple textures. The matrix multiplication proceeds in multiple stages, each of which operates on a subset of these input textures. The output of each stage is another texture which is provided as an input to the next stage. As a result, the matrix multiplication program frequently uploads textures during its execution. This is different from other application programs used in this experiment, which upload all required textures in the beginning of their execution. Because texture uploads take place asynchronously through DMA operations, the current GERM prototype does not take texture uploads into consideration when estimating the GPU load of a process. How to overcome this limitation is currently under investigation. When an application mix consists of two matrix multiplication instances whose input matrices are of different dimension (128x128 vs. 256x256), the GPU loads associated with their texture uploads are different from each other. The fact that the current GERM prototype completely ignores texture uploads causes the GPU time allocation between these two applications to be substantially unfair (19%). The same explanation applies to the unfair GPU time allocation (5.4%) between a matrix multiplication program, which performs frequent texture uploads, and a string comparison program, which does not. When two matrix multiplication programs whose input matrices are of identical dimensionality share a GPU, GERM fairly allocates GPU time between them. In this case even though it ignores texture uploads, because the texture uploads of these two applications impose the same load on the GPU, they eventually cancel out each other. Figure 4 shows how the Utime metric varies with the number of triangles drawn by two identical instances of a synthetic application that draws a certain number of random triangles and includes a CPU computation component which is proportional to the number of triangles drawn. When the number of triangles drawn grows to 900 and more, the CPU becomes the bottleneck, the command queue is empty more frequently, and it is less likely for the deficit round-robin algorithm to maintain fairness. On the other hand, when the number of Figure 4. Inter-process fairness of two instances of the same application that draws an increasing number of random triangles and includes a CPU computation component that is proportional to the number of triangles drawn. Application Textures Underwater Triangles per frame 228 Gloss (Mesa demo) 566 17 GearTrain (Mesa demo) 2800 0 Quake 3 6000 2226 232 DRI GERM Overhead DRI GERM Overhead DRI GERM Overhead DRI GERM Overhead Frames per second 1x 2x 3x 280 140 95 256 131 89 8.6% 6.4% 6.3% 196 98 66 175 87 57 10.7% 11.2% 13.6% 77.8 39 66.7 32.7 14.3% 16.2% 63.1 30.7 20 49.8 24.9 8.7 21.1% 18.9% 56.5% 4x 71 68 4.2% 49 44 10.2% Table 3. The performance overhead associated with GERM’s GPU scheduler when 1 to 4 instances of a set of graphics applications are executed. triangles is smaller than 300, GERM becomes less fair as the number of triangles decreases, because its GPU execution time estimation mechanism becomes less accurate when the command group execution time becomes smaller. 6.2 Scheduling Overhead Although GERM’s fine-grained GPU command scheduler does a reasonable job of ensuring inter-process fairness among GPU applications, the scheduler potentially could introduce additional performance overheads and thus lower the GPU’s effective capacity. In particular, GERM introduces more GPU context switches and incurs additional CPU resource to estimate GPU load of each GPU command group. To measure the GPU context switching overhead, we ran a test application that draws 100 triangles per frame and performs N GPU context switches within every frame and measured the frame rate for different N values. Table 2 shows the impact of N on the frame rate of this application. We found that the average GPU context switch overhead is very small, about 3 × 10−6 sec. Assume that there are 10 applications and GERM uses a scheduling quantum of 10 msec, then there will be 1000 GPU context switches per second, and the total GPU context switch overhead is 3 msec per second, or 0.3%. Table 3 and 4 show the end-to-end performance overhead of GERM’s GPU command scheduler when compared with DRI, when the workload consists of a varying number of instances of graphics applications and Application String Comparison Size of Instance 12.2KB Matrix Multiplication 128x128 DRI GERM Overhead DRI GERM Overhead Elapsed time (CPU Mcycles) 1x 2x 3x 3.54 7.39 11.61 3.57 8.04 12.04 0.85% 8.79% 3.70% 11.96 24.43 35.47 12.36 25.5 37.63 3.34% 4.38% 6.09% Table 4. The performance overhead associated with GERM’s GPU scheduler when 1 to 3 instances of a set of general purpose non-graphics applications are executed. non-graphics applications, respectively. Because the GPU context switch overhead is negligible, the overheads shown in these tables mainly come from GPU command parsing and GPU load estimation. Moreover, the current GERM prototype introduces an additional copying step for each GPU command, which could have been optimized away. As a consequence, the performance overhead of GERM for an application increases with the number of triangles that the application draws per frame. For the same reason, GERM tends to introduce smaller performance overhead for non-graphics applications because they tend to use less complex geometry and thus issue a smaller number of GPU commands. 7. CONCLUSION The goal of the GERM project is to develop a full-scale GPU resource manager that could efficiently utilize a GPU’s computation/memory resource and effectively provide performance isolation among applications sharing the same GPU. This paper focuses on the development of a GPU scheduler that successfully demonstrates that it is possible to schedule GPU applications on a fine-grained basis on commodity GPUs. In particular, GERM’s GPU scheduler is capable of switching GPU applications sharing the same GPU at arbitrary points in their GPU command stream by correctly saving and restoring their GPU state. In addition, GERM features an accurate GPU command execution time estimation mechanism that is largely independent of GPU, and uses this mechanism to allocate the GPU resource among competing GPU applications. Measurements of the first GERM prototype show that GERM can keep the GPU time consumption difference among competing GPU processes consistently below 5% in a variety of application mixes. We are currently extending the GERM prototype with more intelligent video memory management and more informative API. Then we will optimize GERM’s performance through such techniques as zero-copy command and texture buffering and streamlined context switching. As the next step, we will explore other systems software support for GPU application development, such as GPU debugging. REFERENCES 1. J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. K. A. E. Lefohn, and T. J. Purcell, “A survey of general-purpose computation on graphics hardware,” Computer Graphics Forum 26, pp. 80–113, March 2007. 2. K. E. Martin, R. E. Faith, J. Owen, and A. Akin, Direct Rendering Infrastructure, Low-Level Design Document. Precision Insight, Inc., May 1999. 3. B. Langley, “Windows ‘Longhorn’ Display Driver Model - Details And Requirements,” in Windows Hardware Engineering Conference (WinHEC), (Seattle, WA, USA), May 2004. 4. T. K. Steve Pronovost, Henry Moreton, “WDDM v2 and Beyond,” in Windows Hardware Engineering Conference (WinHEC), (Seattle, WA, USA), May 2006. 5. M. J. Kilgard, S. Hui, A. A. Leinwand, and D. Spalding, “X server multi-rendering for OpenGL and PEX,” The X Resource 9(1), pp. 73–88, 1994. 6. M. J. Kilgard, D. Blythe, and D. Hohn, “System support for openGL direct rendering,” in Graphics Interface ’95, W. A. Davis and P. Prusinkiewicz, eds., pp. 116–127, Canadian Human-Computer Communications Society, (Quebec City, Quebec), May 1995. 7. “Compute Unified Device Architecture.” http://developer.nvidia.com/object/cuda.html. 8. “ATI CTM Guide.” http://ati.amd.com/companyinfo/researcher/documents/ATI_CTM_Guide.pdf, 2006. 9. R. E. Faith, The Direct Rendering Manager: Kernel Support for the Direct Rendering Infrastructure. Precision Insight, Inc., May 1999. 10. M. Shreedhar and G. Varghese, “Efficient fair queueing using deficit round-robin,” IEEE/ACM Trans. Netw. 4(3), pp. 375–385, 1996. 11. K. Whitwell and T. Hellstrom, “New DRI memory manager and i915 driver update,” Proceedings of 2006 Xorg Developer’s Conference, (Santa Clara, CA, USA), February 2006. 12. “Radeon R200.” http://en.wikipedia.org/wiki/Radeon_R200. 13. “ATI fragment shader.” http://oss.sgi.com/projects/ogl-sample/registry/ATI/fragment_shader.txt, August 2002.