Chris Rossbach, Microsoft Research Emmett Witchel, University of Texas at Austin

advertisement
Chris Rossbach, Microsoft Research
Emmett Witchel, University of Texas at Austin
September 23 2010


GPU application domains limited
CUDA
◦ Rich APIs/abstractions
◦ Language integration  familiar environment

Composition/integration into systems
◦ Programmer-visible abstractions expressive
◦ Realization/implementations unsatisfactory

Better OS-level abstractions are required
◦ (IP/Business concerns aside)
int main(argc, argv) {
FILE
fpI=program
fopen(“quack”,
“w”);
//
How*do
just a CPU
and a disk?
if(fp == NULL)
fprintf(stderr, “failure\n”);
…
return 0;
}
programmervisible interface
OS-level
abstractions
Hardware
interface
programmervisible interface
1 OS-level
abstraction!
The programmer gets to work with great abstractions…
Why is this a problem?
Doing fine without OS support:
◦ Gaming/Graphics
 Shader Languages
 DirectX, OpenGL
◦ “GPU Computing” nee “GPGPU”





user-mode/batch
scientific algorithms
Latency-tolerant
CUDA
The application ecosystem is more diverse
◦ No OS abstractions  no traction




Gestural Interface
Brain-Computer Interface
Spatial Audio
Image Recognition
Processing user input:
• need low latency, concurrency
• must be multiplexed by OS
• protection/isolation
Raw images
Image
Filtering
Geometric
Transform



“Hand”
events
Point cloud
HID
InputOS
Gesture
Recognition
High data rates
Noisy input
Data-parallel algorithms
Ack! Noise!
#> catusb | xform | detect | hidinput &


catusb:
xform:
Inherently
sequential
captures
Data
imageparallel
data
from usb
◦ Noise filtering
◦ Geometric transformation


detect: extract gestures from point cloud
hidinput: send mouse events (or whatever)
Could parallelize on a CMP, but…
#> catusb | xform | detect | hidinput &




Run
Run
Run
Run
catusb on CPU
xform uses GPU
detect uses GPU
hidinput: on CPU
Use CUDA to write xform and detect!



GPUs cannot run OS: different ISA
Disjoint memory space, no coherence*
Host CPU must manage execution
◦ Program inputs explicitly bound at runtime
User-mode apps
must implement
Main
memory
Copy inputs
CPU
Copy outputs
GPU
memory
Send commands
GPU
• 12 kernel crossings
• 6 copy_to_user
• 6 copy_from_user
• Performance tradeoffs for
runtime/abstractions
catusb
hidinput
xform
user
detect
CUDA Runtime
User Mode Drivers
(DXVA)
OS Executive
kernel
Kernel Mode Drivers
HAL
USB
CPU
GPU
Run
GPU
Kernel
user
catusb
xform
detect hidinput
OS Executive
kernel
Kernel Mode Drivers
HAL
USB
CPU
GPU
• No CUDA, no high level abstractions
• If you’re MS and/or nVidia, this might be tenable…
• Solution is specialized
but there is still a data migration problem…
FSB
We’d prefer:
Cache pollution
CPU
• catusb: USB bus  GPU memory
Wasted bandwidth
Wasted power
• xform, detect: no transfers
• hidinput: single GPUmain mem transfer hidinput
• if GPUs become coherent with main memory…
DDR2/3
The GPU
machinePCI-e
can
xform
detect
DIMM
Northbridge
do this,
where are the interfaces?
DDR2/3
DIMM
DMI
catusb
Current task:
USB 2.0
Southbridge
catusb
xform



Motivation
Problems with lack of OS abstractions
Can CUDA solve these problems?
◦
◦
◦
◦



CUDA Streams
Asynchrony
GPUDirect™
OS guarantees
New OS abstractions for GPUs
Related Work
Conclusion
CPU
“Write-combining memory”
FSB
(uncacheable)
CUDA streams, async:
(Overlap capture/xfer/exec)
DDR2/3
PCI-e
Northbridge
DDR2/3
DMI
GPU
USB 2.0
DIMM
Southbridge
DIMM
Page-locked host memory
(faster DMA)
Portable Memory
(share page-locked)
GPUDirect™
Mapped Memory
(map mem into GPU space)
(transparent xfer  app-level upcalls)
Overlap Communication with Computation
Stream X
Stream Y
Copy X0
Copy Y0
Kernel Xa
Kernel Y
Kernel Xb
Copy Y1
Copy X1
Copy Engine
Compute Engine
Copy X0
Copy Y0
Kernel Xa
Kernel Xb
Copy X1
Copy Y1
Kernel Y
Copy Engine
CudaMemcpyAsync(X0…);
KernelXa<<<…>>>();
KernelXb<<<…>>>();
CudaMemcpyAsync(X1…)
CudaMemcpyAsync(Y0);
KernelY<<<…>>>();
CudaMemcpyAsync(Y1);
Copy X0
Copy X1
Copy Y0
Compute Engine
Kernel Xa
Kernel Xb
Kernel Y
Copy Y1
Each stream proceeds serially, different streams overlap
Naïve programming eliminates potential concurrency
CudaMemcpyAsync(X0…);
KernelXa<<<…>>>();
KernelXb<<<…>>>();
CudaMemcpyAsync(Y0);
KernelY<<<…>>>();
Copy Engine
Compute Engine
Copy X0
Copy Y0
Kernel Xa
Kernel Xb
Kernel Y
design can’tCopy
useX1this anyway!
• … xform | detect …
Copy Y1
• CUDA Streams in xform, detect
• different processes
• Order sensitive
• different address spaces
• Applications must statically determine order
• require additional IPC coordination
• Couldn’t a scheduler with a global view do a
CudaMemcpyAsync(X1…)
Our
CudaMemcpyAsync(Y1);
better job dynamically?
xform performance
4000
OS-supported
ptask-analogue
3000
CUDA+streams
CUDA-async-
2000
ping-pong
CUDA+async
CUDA-async
1000
Higher is
better
0
H->D
H<-D
HD: Host-to-Device only
HD: Device-to-Host only
H D: duplex communication
H<->D
CUDA
CUDA
• Windows 7 x64 8GB RAM
• Intel Core 2 Quad 2.66GHz
• nVidia GeForce GT230

“Allows 3rd party devices to access CUDA
memory”: (eliminates data copy)
Great! but:
• requires per-driver support
• not just CUDA support!
• no programmer-visible interface
• OS can generalize
Traditional OS guarantees:
 Fairness
 Isolation
 No user-space runtime can provide these!
 Can support…
 Cannot guarantee
Impact of CPU Saturation
4000
3500
3000
2500
2000
normal load
1500
loaded
1000
Higher is
500
better
0
CPU scheduler and GPU scheduler
H->Dintegrated!
H<-D
H<->D
not
HD: Host-to-Device only
HD: Device-to-Host only
H D: duplex communication
• Windows 7 x64 8GB RAM
• Intel Core 2 Quad 2.66GHz
• nVidia GeForce GT230
Flatter lines
Are better
• Windows 7 x64 8GB RAM
• Intel Core 2 Quad 2.66GHz
• nVidia GeForce GT230




Process API analogues
IPC API analogues
Scheduler hint analogues
Must integrate with existing interfaces
◦ CUDA/DXGI/DirectX
◦ DRI/DRM/OpenGL






Motivation
Problems with lack of OS abstractions
Can CUDA solve these problems?
New OS abstractions for GPUs
Related Work
Conclusion

ptask
◦ Like a process, thread, can exist without user host process
◦ OS abstraction…not a full CPU-process
◦ List of mappable input/output resources

endpoint
◦ Globally named kernel object
◦ Can be mapped to ptask input/output resources
◦ A data source or sink (e.g. buffer in GPU memory)

channel
◦
◦
◦
◦
Expand system call interface:
Similar to a pipe
• process API analogues
Connect arbitrary endpoints
• IPC API analogues
1:1, 1:M, M:1, N:M
• scheduler hints
Generalization of GPUDirect™ mechanism
• 1-1 correspondence between programmer and OS abstractions
• existing APIs can be built on top of new OS abstractions
g_input
process:
catusb
usbsrc
rawimg
ptask:
xform
= process
= ptask
= endpoint
= channel
cloud
ptask:
detect
hands
hid_in
process:
hidinput
Computation expressed as a graph
• Synthesis [Masselin 89] (streams, pumps)
• Dryad [Isard 07]
• SteamIt [Thies 02]
• Offcodes [Weinsberg 08]
• others…
process:
catusb
usbsrc
USBGPU mem
rawimg
ptask:
xform
= process
= ptask
= endpoint
= channel
g_input
ptask:
detect
hands
hid_in
process:
hidinput
cloud
GPU mem GPU mem
• Eliminate unnecessary communication…
process:
catusb
usbsrc
New data triggers
new computation
g_input
rawimg
ptask:
detect
ptask:
xform
= process
= ptask
cloud
hands
hid_in
process:
hidinput
= endpoint
= channel
• Eliminates unnecessary communication
• Eliminates u/k crossings, computation
xform performance
4000
3500
3000
2500
2000
1500
1000
500
0
Higher is
better
10x
3.9x
H->D
H<-D
H<->D
Segmentation + Geometry
ptask-analogue
HD: Host-to-Device only
HD: Device-to-Host only
H D: duplex communication
naïve-CUDA
•Windows 7 x64 8GB RAM
•Intel Core 2 Quad 2.66GHz
•Nvidia GeForce GT230






Motivation
Problems with lack of OS abstractions
Can CUDA solve these problems?
New OS abstractions for GPUs
Related Work
Conclusion

OS support for Heterogeneous arch:
◦ Helios [Nightingale 09]
◦ BarrelFish [Baumann 09]
◦ Offcodes [Weinsberg 08]

Graph-based programming models
◦
◦
◦
◦
◦
◦

Synthesis [Masselin 89]
Monsoon/Id [Arvind]
Dryad [Isard 07]
StreamIt [Thies 02]
DirectShow
TCP Offload [Currid 04]
GPU Computing
◦ CUDA, OpenCL

CUDA: programming interface is right
◦
◦
◦
◦

but OS must get involved
Current interfaces waste data movement
Current interfaces inhibit modularity/reuse
Cannot guarantee fairness, isolation
OS-level abstractions are required
Questions?
Download