Download poster (pptx, 1 slide).)

advertisement
General-Purpose vs. GPU: Comparison of Many-Cores on Irregular Benchmarks
George C. Caragea
Fuat Keceli
Alexandros Tzannes
Uzi Vishkin
XMT: An Easy-to-Program Many-Core
XMT: Motivation and Background
XMT Programming Model
• Many-cores are coming. But 40yrs of parallel computing:
• Never a successful general-purpose parallel computer (easy to
program, good speedups, up & down scalable).
• IF you could program it  great speedups.
• XMT: Fix the IF
• XMT: Designed from the ground up to address that for on-chip
parallelism
• Tested HW & SW prototypes
• Builds on PRAM algorithmics. Only really successful parallel
algorithmic theory. Latent, though not widespread, knowledgebase
• At each step, provide all instructions that can execute
concurrently (not dependent on each other)
• PRAM/XMT abstraction: all such instructions execute
immediately (“uniform cost”)
• PRAM-like programming: using reduced synchrony
• Main construct: spawn-join block. Can start any
number of virtual-threads at once
• Virtual-Threads advance at own speed, not lockstep
• Prefix-sum (ps): similar to atomic fetch-and-add
Paraleap: XMT PRAM-on-chip silicon
• Built FPGA prototype
• Announced in SPAA’07
• Built using 3 FPGA chips
• 2 Virtex-4 LX200, 1 Virtex-4 FX100
Clock rate
DRAM size
DRAM channels
Mem. data rate
No. cores (TCUs)
Clusters
Cache modules
Shared cache
Ease of programming
• Necessary condition for success of a general-purpose platform
• In von Neumann’s 1947 specs
• Indications that XMT is easy to program:
1. XMT is based on rich algorithmic theory (PRAM)
2. Ease-of-teaching as a benchmark:
a. Successfully taught parallel programming to middle-school, highschool and up
b. Evaluated by education experts (SIGCSE 2010)
c. XMT superior to MPI, OpenMP and CUDA
3. Programmer’s workflow for deriving efficient programs from PRAM
algorithms
4. DARPA HPCS productivity study: XMT development time half of MPI
XMTC Programming Language
75 MHz
1GB
1
0.6GB/s
64
8
8
256KB
C with simple SPMD extensions
• spawn: start any number of virtual threads
• $: unique thread ID
• ps/psm: atomic prefix sum. Efficient
hardware implementation
int A[N],B[N]
int base=0;
spawn(0,N-1) {
int inc=1;
if (A[$]!=0) {
ps(inc,base);
B[inc]=A[$];
}
}
Arrzz
XMTC Example: Array Compaction
• Non-zero elements of A copied into B
• Order is not necessarily preserved
• After atomically executing ps(inc,base)
• base = base + inc
• inc gets original value of base
• Elements copied into unique locations in B
Tesla
TESLA
Memory Latency
Hiding and Reduction
Memory and Cache
Bandwidth
Functional Unit (FU)
Allocation
Control Flow and
Synchronization
•
•
•
•
•
•
•
•
•
•
•
XMT
•
•
•
•
•
•
Dedicated FUs for SPs and SFUs
•
Less arbitration logic required
•
Higher theoretical peak performance
•
Single instruction cache and issue per SM. Warps execute in lock-step (penalizes diverging branches)
•
Efficient local synchronization and communication within blocks. Global communication is expensive
•
Switching between serial and parallel modes (i.e. passing control from CPU to GPU) requires off-chip communication •
Heavy multithreading (requires large register files and state-aware scheduler)
Limited local shared scratchpad memory
No coherent private caches at SM or SP
Memory access patterns need to be coordinated by the user for efficiency (request coalescing)
Scratchpad memories prone to bank conflicts
TESLA
XMT
Large globally shared cache
No coherent private TCU or cluster caches
Software prefetching
Relaxed need for user-coordinated DRAM access due to caches
Address hashing for avoiding memory module hotspots
High bandwidth mesh-of-trees interconnect between clusters and caches
Heavy FUs (FPU and MDU) are shared through arbitrators
Lightweight FUs (ALU, branch) are allocated per TCU
ALUs do not include multiply-divide functionality
One instruction cache and program counter per TCU enables independent progress of threads
Coordination of threads performed via constant time prefix-sum. Other communication through the shared cache
Dynamic hardware support for fast switching between serial and parallel modes and load balance of virtual threads
Tested Configurations: GTX280 vs. XMT-1024
• Need configurations with equivalent area
constraints (576 mm2 in 65nm)
• Can not simply set the number of functional
units and memory to the same values
• Area estimation of the envisioned XMT chip
is based on the 64 TCU XMT ASIC prototype
(designed in 90nm IBM technology)
• More area intensive side is emphasized in
each category.
Benchmarks
Description
CUDA Source
Lines of
Dataset
Code
CUDA XMT
Parallel
Threads/sectn.
sectn.
CUDA XMT CUDA XMT
Bfs
Breadth-First
Search
290
86
1M nodes,
6M edges
25
12
1M
Bprop
Back Propagation
Harish and
Narayanan,
Rodinia
Rodinia
960
522
64K nodes
2
65
1.04M 19.4K
Conv
Image Convolution NVIDIA CUDA 283
SDK
Merge-Sort
Thrust library 966
87
1024x512
2
2
131K
512K
283
1M keys
82
140
32K
10.7K
NeedlemanWunsch
Parallel Reduction
129
2x2048
sequences
16M elts.
255
4192 1.1K
1.1K
3
3
5.5K
44K
36K x 36K,
1
4M non-zero
1
30.7K 36K
NW
Reduct
Spmv
Sparse matrixvector multiply
Rodinia
430
NVIDIA CUDA 481
SDK
Bell and
91
Garland
59
34
Experimental Platform
•
•
•
•
•
XMT-1024
Prefetch Buffers
Regular Caches
Constant Cache
Texture Cache
32KB
4104KB
128KB
--
-480KB
240KB
480KB
1024 TCU
1024 ALU, 64 MDU
64 FPU
128KB
Performance Comparison
Name
Msort
GTX280
Principal Computational Resources
Cores
240 SP, 60 SFU
Integer Units
240 ALU+MDU
Floating Point Units 240 FPU, 60 SFU
On Chip Memory
Registers
1920KB
XMTSim: The cycle-accurate XMT simulator
Timing modeled after the 64-TCU FPGA prototype
Highly configurable to simulate any configuration
Modular design, enables architectural exploration
Part of XMT Software Release:
http://www.umiacs.umd.edu/users/vishkin/XMT/sw-release.html
87.4K
• When using 1024-TCU XMT configuration:
• 6.05x average speedup on irregular applications
• 2.07x average slowdown on regular applications
• When using 512-TCU XMT configuration
• 4.57x average speedup on irregular
• 3.06x average slowdown on regular
• Case study: BFS on low parallelism dataset
• Speedup of 73.4x over Rodinia implementation
• Speedup of 6.89x over UIUC implementation
• Speedup of 110.6x when using only 64 TCUs
(lower latencies for the smaller design)
• SPAA’09: 10X over Intel Core Duo with same silicon area
• Current work:
• XMT outperforms GPU on all irregular workloads
• XMT does not fall behind significantly on regular workloads
• No need to pay high performance penalty for ease-of-programming
• Promising candidate for pervasive platform of the future:
• Highly parallel general-purpose CPU coupled with:
• Parallel GPU
• Future work:
• Power/energy comparison of XMT and GPU
Download