Microsoft Research Technion UT Austin Chris Rossbach, Jon Currey,

advertisement
Chris Rossbach, Jon Currey, Microsoft Research
Mark Silberstein, Technion
Baishakhi Ray, Emmett Witchel, UT Austin
SOSP October 25, 2011

There are lots of GPUs
◦
◦
◦
◦

3 of top 5 supercomputers use GPUs
In all new PCs, smart phones, tablets
Great for gaming and HPC/batch
Unusable in other application domains
GPU programming challenges
◦ GPU+main memory disjoint
◦ Treated as I/O device by OS
PTask SOSP 2011
2

There are lots of GPUs
◦
◦
◦
◦

3 of top 5 supercomputers use GPUs
In all new PCs, smart phones
tablets
These two
things are related:
Great for gaming and HPC/batch
We need OS abstractions
Unusable in other application domains
GPU programing challenges
◦ GPU+main memory disjoint
◦ Treated as I/O device by OS
PTask SOSP 2011
3





The case for OS support
PTask: Dataflow for GPUs
Evaluation
Related Work
Conclusion
PTask SOSP 2011
4
programmervisible interface
OS-level
abstractions
Hardware
interface
1:1 correspondence between OS-level and user-level abstractions
PTask SOSP 2011
5
programmervisible interface
GPGPU
APIs
Shaders/
Kernels
Language
Integration
DirectX/CUDA/OpenCL Runtime
1 OS-level
abstraction!
1.
2.
3.
No kernel-facing API
No OS resource-management
Poor composability
PTask SOSP 2011
6
GPU benchmark throughput
1200
1000
800
600
400
200
0
Higher is
better
no CPU load
CPU scheduler and GPU
not integrated!
high CPU load
• Image-convolution in CUDA
scheduler
• Windows 7 x64 8GB RAM
• Intel Core 2 Quad 2.66GHz
• nVidia GeForce GT230
PTask SOSP 2011
7
OS cannot prioritize cursor updates
•
Flatter lines
Are better
WDDM + DWM + CUDA == dysfunction
• Windows 7 x64 8GB RAM
• Intel Core 2 Quad 2.66GHz
• nVidia GeForce GT230
PTask SOSP 2011
8
Raw images
“Hand”
events
detect
capture
capture
camera images
noisy point cloud
xform
detect gestures
filter
geometric
transformation


noise filtering
High data rates
Data-parallel algorithms
… good fit for GPU
NOT Kinect: this is a harder problem!
PTask SOSP 2011
9
#> capture | xform | filter | detect &
CPU



GPU
CPU
GPU
Modular design
 flexibility, reuse
Utilize heterogeneous hardware
 Data-parallel components  GPU
 Sequential components  CPU
Using OS provided tools
 processes, pipes
PTask SOSP 2011
10



GPUs cannot run OS: different ISA
Disjoint memory space, no coherence
Host CPU must manage GPU execution
◦ Program inputs explicitly transferred/bound at runtime
◦ Device buffers pre-allocated
User-mode apps
must implement
Main
memory
Copy inputs
CPU
Send commands
Copy outputs
GPU
memory
GPU
PTask SOSP 2011
11
#> capture | xform | filter | detect &
xform
capture
read()
write()
read()
copy
to
GPU
camdrv
filter
write() read()
OS
copy
from
GPU
executive
copy
to
GPU
detect
write()
copy
from
GPU
GPU driver
PCI-xfer
PCI-xfer
PCI-xfer
read()
IRP
HIDdrv
PCI-xfer
GPU
Run!
PTask SOSP 2011
12

GPU Analogues for:
◦ Process API
◦ IPC API
◦ Scheduler hints

Abstractions that enable:
◦ Fairness/isolation
◦ OS use of GPU
◦ Composition/data movement optimization
PTask SOSP 2011
13





The case for OS support
PTask: Dataflow for GPUs
Evaluation
Related Work
Conclusion
PTask SOSP 2011
14

ptask (parallel task)
◦ Has priority for fairness
◦ Analogous to a process for GPU execution
◦ List of input/output resources (e.g. stdin, stdout…)

ports
◦ Can be mapped to ptask input/outputs
◦ A data source or sink

channels
◦ Similar to pipes, connect arbitrary ports
• OS objectsOS RM possible
◦ Specialize to eliminate double-buffering
• data: specify where, not how
 graph
◦ DAG: connected ptasks, ports, channels

datablocks
◦ Memory-space transparent buffers
PTask SOSP 2011
15
#> capture | xform | filter | detect &
mapped mem
filter
f-out
xform
f-in
capture
cloud
rawimg
ptask graph
detect
GPU mem GPU mem
process (CPU)
ptask (GPU)
port
channel
Optimized data movement
Data arrival triggers computation
ptask graph
datablock
PTask SOSP 2011
16

Graphs scheduled dynamically
◦ ptasks queue for dispatch when inputs ready

Queue: dynamic priority order
◦ ptask priority user-settable
◦ ptask prio normalized to OS prio

Transparently support multiple GPUs
◦ Schedule ptasks for input locality
PTask SOSP 2011
17
Datablock
V
main 1
gpu0 0
gpu1 1
space

M
1
1
1
RW
11
10
10
data
Main
Memory
GPU 0
Memory
GPU 1
Memory
…
Logical buffer
◦ backed by multiple physical buffers
◦ buffers created/updated lazily
◦ mem-mapping used to share across process boundaries

Track buffer validity per memory space
◦ writes invalidate other views

Flags for access control/data placement
PTask SOSP 2011
18
Main
Memory
Datablock
V M RW
0 1
01
0
0 1
main 1
0 1
0 1
01
0
gpu 1
space
f-in
xform
cloud
capture
rawimg
#> capture | xform | filter …
filter
GPU
Memory
data
PTask SOSP 2011
…
process
ptask
port
channel
datablock
19
port
datablock
port
• 1-1 correspondence between programmer and OS abstractions
• GPU APIs can be built on top of new OS abstractions
PTask SOSP 2011
20





The case for OS support
PTask: Dataflow for GPUs
Evaluation
Related Work
Conclusion
PTask SOSP 2011
21

Windows 7
◦ Full PTask API implementation
◦ Stacked UMDF/KMDF driver
 Kernel component: mem-mapping, signaling
 User component: wraps DirectX, CUDA, OpenCL
◦ syscalls  DeviceIoControl() calls

Linux 2.6.33.2
◦ Changed OS scheduling to manage GPU
 GPU accounting added to task_struct
PTask SOSP 2011
22


Windows 7, Core2-Quad, GTX580 (EVGA)
Implementations
◦
◦
◦
◦

pipes: capture | xform | filter | detect
modular: capture+xform+filter+detect, 1process
handcode: data movement optimized, 1process
ptask: ptask graph
Configurations
◦ real-time: driven by cameras
◦ unconstrained: driven by in-memory playback
PTask SOSP 2011
23
relative to handcode
3.5
lower is
better
3
2.5
handcode
2
modular
1.5
pipes
1
ptask
0.5
0
runtime
compared to hand-code
• pipes
11.6% higher throughput
compared to
user
• sys
lower
CPU util: no driver
• ~2.7x less
CPU
usage
program
• 16x higher throughput
• Windows 7 x64 8GB RAM
• ~45% less memory
• Intelusage
Core 2 Quad 2.66GHz
• GTX580 (EVGA)
PTask SOSP 2011
24
PTask invocations/second
1600
1400
1200
1000
800
fifo
600
priority
ptask
400
200
0
Higher is
better
2
4
6
PTask provides throughput
8
proportional
to priority
PTask priority
• FIFO – queue invocations in arrival order
• ptask – aged priority queue w OS priority
• graphs: 6x6 matrix multiply
• priority same for every PTask node
• Windows 7 x64 8GB RAM
• Intel Core 2 Quad 2.66GHz
• GTX580 (EVGA)
PTask SOSP 2011
25
Speedup over 1 GPU
2
1.5
1
0.5
• Synthetic graphs:
Varying depths
priority
data-aware
0
Higher is
better
• Data-aware == priority + locality
• Graph depth > 1 req. for any benefit
Data-aware provides best
throughput, preserves priority
• Windows 7 x64 8GB RAM
• Intel Core 2 Quad 2.66GHz
• 2 x GTX580 (EVGA)
PTask SOSP 2011
26
user-prgs
R/W bnc
cuda-1
cuda-2
user-libs
EncFS
FUSE
libc
PTask
OS
HW
SSD1
GPU/
CPU
…
Linux 2.6.33
SSD2
GPU
Simple GPU usage accounting
• Restores performance
cuda-1
Linux
cuda-2
Linux
cuda-1
PTask
cuda-2
Ptask
Read
1.17x -10.3x
-30.8x
1.16x
1.16x
Write
1.28x -4.6x
-10.3x
1.21x
1.20x
PTask SOSP 2011
27
• EncFS: nice -20
• cuda-*: nice +19
• AES: XTS chaining
• SATA SSD, RAID
• seq. R/W 200 MB





The case for OS support
PTask: Dataflow for GPUs
Evaluation
Related Work
Conclusion
PTask SOSP 2011
28

OS support for heterogeneous platforms:
◦ Helios

[Nightingale 09], BarrelFish [Baumann 09] ,Offcodes [Weinsberg 08]
GPU Scheduling
◦ TimeGraph

Pegasus
[Gupta 11]
Graph-based programming models
◦
◦
◦
◦
◦
◦

[Kato 11],
Synthesis [Masselin 89]
Monsoon/Id [Arvind]
Dryad [Isard 07]
StreamIt [Thies 02]
DirectShow
TCP Offload [Currid 04]
Tasking
◦ Tessellation, Apple GCD, …
PTask SOSP 2011
29

OS abstractions for GPUs are critical
◦ Enable fairness & priority
◦ OS can use the GPU

Dataflow: a good fit abstraction
◦ system manages data movement
◦ performance benefits significant
Thank you. Questions?
PTask SOSP 2011
30
Download