Download talk (ppt, 14 slides).

advertisement
Performance Potential of an Easy-toProgram PRAM-On-Chip Prototype Versus
State-of-the-Art Processor
George C. Caragea – University of Maryland
A. Beliz Saybasili – LCB Branch, NHLBL, NIH
Xingzhi Wen – NVIDIA Corporation
Uzi Vishkin – University of Maryland
www.umiacs.umd.edu/users/vishkin/XMT
Hardware prototypes of PRAM-On-Chip
64-core, 75MHz FPGA prototype
[SPAA’07, Computing Frontiers’08]
Original explicit multi-threaded
(XMT) architecture [SPAA98] (Cray
started to use “XMT” ~7 years later)
Interconnection Network for 128-core. 9mmX5mm, IBM90nm
process. 400 MHz prototype [HotInterconnects’07]
Same design as 64-core FPGA. 10mmX10mm,
IBM90nm process. 150 MHz prototype
The design scales to 1000+ cores on-chip
Objective of current paper Meaningful comparison of
1. Our FPGA design, with
2. State-of-the-Art (Intel) Processor
XMT: A PRAM-On-Chip Vision
• Manycores are coming. But 40yrs of parallel computing:
• Never a successful general-purpose parallel computer (easy to
program, good speedups, up&down scalable). IF you could
program it  great speedups. XMT: Fix the IF
• XMT: Designed from the ground up to address that for on-chip
parallelism
• Unlike matching current HW (Some other SPAA papers)
• Tested HW & SW prototypes
• Builds on PRAM algorithmics. Only really successful parallel
algorithmic theory. Latent, though not widespread,
knowledgebase
• This paper: ~10X relative to Intel Core 2 Duo
If there is time: Really serious about ease of programming
Objective for programmer’s model
• Emerging: not sure, but the analysis should be work-depth. But,
why not design for your analysis? (like serial)
What could I do in parallel
at each step assuming
unlimited hardware
Serial Paradigm
#
ops

..
time
Time = Work
..
Natural (Parallel) Paradigm
# .
ops .
..
..
..
..
time
Work = total #ops
Time << Work
• XMT: Design for work-depth. Unique among manycores.
- 1 operation now. Any #ops next time unit.
- Competitive on nesting. (To be published.)
- No need to program for locality.
Programmer’s Model: Engineering Workflow
• Arbitrary CRCW Work-depth algorithm. Reason about correctness &
complexity in synchronous model
• SPMD reduced synchrony
– Threads advance at own speed, not lockstep
– Main construct: spawn-join block. Note: can start any number of
processes at once
– Prefix-sum (ps). Independence of order semantics (IOS).
spawn
join
spawn
join
– Establish correctness & complexity by relating to WD analyses.
– Circumvents “The problem with threads”, e.g., [Lee].
• Tune (compiler or expert programmer): (i) Length of sequence of
round trips to memory, (ii) QRQW, (iii) WD. [VCL07]
• Trial&error contrast: similar startwhile insufficient inter-thread
bandwidth do{rethink algorithm to take better advantage of cache}
XMT Architecture Overview
• One serial core – master
thread control unit (MTCU)
• Parallel cores (TCUS)
grouped in clusters
• Global memory space
evenly partitioned in cache
banks using hashing
• No local caches at TCU
– Avoids expensive cache
coherence hardware
MTCU
Hardware Scheduler/Prefix-Sum Unit
Cluster 1
Cluster 2
…
Cluster C
Parallel Interconnection Network
Memory
Bank 1
Memory
Bank 2
DRAM
Channel 1
Shared Memory
(L1 Cache)
…
…
Memory
Bank M
DRAM
Channel D
Paraleap: XMT PRAM-on-chip silicon
• Built FPGA prototype
• Announced in SPAA’07
• Built using 3 FPGA chips
– 2 Virtex-4 LX200
– 1 Virtex-4 FX100
Clock rate
75 MHz
No. TCUs
64
DRAM size
1GB
Clusters
8
DRAM channels
1
Cache modules
8
Mem. data rate
0.6GB/s
Shared cache
256KB
With no prior design experience, X. Wen completed synthesizable
Verilog description AND the new FPGA-based XMT computer in
slightly more than two years. X. Wen is one person..  basic
simplicity of the XMT architecture simple  faster time to market,
lower implementation cost.
Benchmarks
• Sparse Matrix – Vector Multiplication (SpMV)
– Matrix stored in Compact Sparse Row (CSR) format
– Serial version: iterate through rows
– Parallel version: one thread per row
• 1-D FFT
– Fixed-point arithmetic implementation
– Serial version: Radix-2 Cooley-Tukey Algorithm
– Parallel version: Parallelized each stage of serial algorithm
• Quicksort
– Serial version: standard textbook implementation
– Parallel version: two phases
• Phase 1: For large sub-arrays, parallelize partitioning
operation using atomic prefix-sum
• Phase 2: Process all partitions in parallel using serial
partitioning algorithm
Experimental Platforms
XMT Paraleap FPGA
Intel Core 2 Duo
Processor 1 MTCU, 64 TCUs
Dual Core, E6300
Clock
75 MHz
1.86GHz
Cache
256KB shared L1 cache
2x64KB L1, 2x2MB L2
DRAM
1GB DDR2
2GB DDR2
Data rate
0.6GB/s
6.4GB/s
Compiler
XMTCC (GCC 4.0.2-based)
GCC 4.1.0
Intel C++ Professional Compiler (ICC) v11
Compiler
Optim.
-O3, data prefetch, read-only
buffers
-O3, SSE3 SIMD, data prefetching, autoparallelization
For meaningful comparison: compare cycle count
Input Datasets
small
large
Program
N
Footprint
N
Footprint
SpMV
22K
200KB
4M
33MB
FFT
8K
192KB
4M
96MB
Quicksort
100K
781KB
20M
153MB
• Large dataset represents realistic input sizes
– Recommended by Intel engineer for comparison
– Gives Intel Core 2 advantage because of larger cache
• Small dataset
– Fits in both Paraleap and Intel Core 2 cache
– Provides most fair comparison for current XMT generation
Clock-Cycle Speedup
• Computed as:
speedup = #ClockCycles for Core 2 / #ClockCylces for Paraleap
Core 2 – ICC
Core 2 - GCC
Program
small
large
small
large
SpMV
6.7
3.3
6.3
3.26
FFT
9.51
2.51
8.76
2.71
Quicksort(*)
13.07
7.75
13.89
8.18
• Paraleap outperforms Intel Core 2 on all benchmarks
• Lower speed-ups for Large dataset because of smaller cache size
– Will not be an issue for future implementations of XMT
• Silicon area of 64-TCU XMT roughly the same as one core of Intel
Core 2 Duo
• No reason for clock frequency of XMT to fall behind
Conclusion
• XMT provides viable answer to biggest challenges for the
field
– Ease of programming
– Scalability (up&down)
• Preliminary evaluation shows good result of XMT
architecture versus state-of-the art Intel Core 2 platform
• ICPP’08 paper compares with GPUs.
Software release
Allows to use your own computer for programming on an XMT
environment and experimenting with it, including:
(i)Cycle-accurate simulator of the XMT machine
(ii)Compiler from XMTC to that machine
Also provided, extensive material for teaching or self-studying
parallelism, including
(i)Tutorial + manual for XMTC (150 pages)
(ii)Classnotes on parallel algorithms (100 pages)
(iii)Video recording of 9/15/07 HS tutorial (300 minutes)
(iv) Video recording of grad Parallel Algorithms lectures (30+hours)
www.umiacs.umd.edu/users/vishkin/XMT/sw-release.html
Next Major Objective
Industry-grade chip. Requires 10X in funding.
Ease of Programming
• Benchmark: can any CS major program your manycore?
- cannot really avoid it.
Teachability demonstrated so far:
- To freshman class with 11 non-CS students. Some prog.
assignments: merge-sort, integer-sort & samples-sort.
Other teachers:
- Magnet HS teacher. Downloaded simulator, assignments, class
notes, from XMT page. Self-taught. Recommends: Teach XMT
first. Easiest to set up (simulator), program, analyze: ability to
anticipate performance (as in serial). Can do not just for
embarrassingly parallel. Teaches also OpenMP, MPI, CUDA.
Lookup keynote at CS4HS’09@CMU + interview with teacher.
- High school & Middle School (some 10 year olds) students
from underrepresented groups by HS Math teacher.
Download