IM
PA
C
T
For thousand-core microprocessors
Wen-mei Hwu with
Ryoo, Ueng, Rodrigues, Lathara, Kelm, Gelado, Stone, Yi,
Kidd, Barghsorkhi, Mahesri, Tsao, Stratton, Navarro,
Lumetta, Frank, Patel
University of Illinois, Urbana-Champaign
1
• Academic compiler research infrastructure is a tough business
– IMPACT, Trimaran, and ORC for VLIW and Itanium processors
– Polaris and SUIF for multiprocessors
– LLVM for portability and safety
• In 2001, IMPACT team moved into many-core compilation with MARCO FCRC funding
– A new implicitly parallel programming model that balance the burden on programmers and the compiler in parallel programming
– Infrastructure work has slowed down ground-breaking work
• Timely visit by the Phoenix team in January 2007
– Rapid progress has since been taking place
– Future IMPACT research will be built on Phoenix
2
Big picture
• Today, multi-core make more effective use of area and power than large ILP CPU’s
– Scaling from 4-core to 1000-core chips could happen in the next 15 years
• All semiconductor market domains converging to concurrent system platforms
– PCs, game consoles, mobile handsets, servers, supercomputers, networking, etc.
We need to make these systems effectively execute valuable, demanding apps.
3
“Compilers and tools must extend the human’s ability to manage parallelism by doing the heavy lifting.”
• To meet this challenge, the compiler must
– Allow simple, effective control by programmers
– Discover and verify parallelism
– Eliminate tedious efforts in performance tuning
– Reduce testing and support cost of parallel programs
4
• A quiet revolution and potential build-up
– Calculation: 450 GFLOPS vs. 32 GFLOPS
– Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s
– Until last year, programmed through graphics API
G80 = GeForce 8800 GTX
G71 = GeForce 7900 GTX
G70 = GeForce 7800 GTX
NV40 = GeForce 6800 Ultra
NV35 = GeForce FX 5950 Ultra
NV30 = GeForce FX 5800
– GPU in every PC and workstation – massive volume and potential impact
5
16 highly threaded SM’s, >128 FPU’s, 450 GFLOPS, 768
MB DRAM, 86.4 GB/S Mem BW, 4GB/S BW to CPU
Host
Input Assembler
Thread Execution Manager
Parallel Data
Cache
Parallel Data
Cache
Texture
Parallel Data
Cache
Texture
Parallel Data
Cache
Texture
Parallel Data
Cache
Texture
Parallel Data
Cache
Texture
Parallel Data
Cache
Texture
Parallel Data
Cache
Texture
Load/store Load/store Load/store Load/store
Global Memory
Load/store Load/store
6
App.
H.264
LBM
RC5-72
FEM
Archit. Bottleneck Simult. T Kernel X App X
Registers, global memory latency 3,936 20.2
1.5
Shared memory capacity 3,200 12.5
12.3
Registers 3,072 17.1
11.0
Global memory bandwidth 4,096 11.0
10.1
RPES Instruction issue rate
PNS Global memory capacity
LINPACK Global memory bandwidth, CPU-
GPU data transfer
Shared memory capacity TRACF
FDTD Global memory bandwidth
MRI-Q Instruction issue rate
[HKR HotChips-2007]
4,096
2,048
12,288
4,096
1,365
8,192
210.0
24.0
19.4
60.2
10.5
457.0
79.4
23.7
11.8
21.6
1.2
431.0
7
1400
1200
1164.1
1156.5
446x
1000
953.9
923.7
800
600
400.1
400
267.6
200
3.3
0.6
0
V1 (cpu,dp) V2 (cpu, dp, sse2)
V3 (cpu, dp, sse2, fm)
V4 (cpu, sp) V5 (cpu, sp, sse2)
V6 (cpu, sp, sse2, fm)
V7 (gpu, sp) V8 (gpu, sp, fm)
CPU (V6): 230 MFLOPS GPU (V8): 96 GFLOPS
8
• Parallelism extraction requires global understanding
– Most programmers only understand parts of an application
• Algorithms need to be re-designed
– Programmers benefit from clear view of the algorithmic effect on parallelism
• Real but rare dependencies often needs to be ignored
– Error checking code, etc., parallel code is often not equivalent to sequential code
• Getting more than a small speedup over sequential code is very tricky
– ~20 versions typically experimented for each application to move away from architecture bottlenecks
9
Deep analysis w/ feedback assistance
Human
For increased composability
Systematic search for best/correct code gen
For increased scalability parallel execution w/ sequential semantics
Stylized C/C++ or DSL w/ assertions
Concurrency discovery
Visualizable concurrent form
Code-gen space exploration
Visualizable sequential assembly code with parallel annotations
For increased supportability
Debugger
Parallel HW w/sequential state gen
10
• Deep program analyses that extend programmer and DSE knowledge for parallelism discovery
– Key to reduced programmer parallelization efforts
• Exclusion of infrequent but real dependences using HW
STU (Speculative Threading with Undo) support
– Key to successful parallelization of many real applications
• Rich program information maintained in IR for access by tools and HW
– Key to integrate multiple programming models and tools
• Intuitive, visual presentation to programmers
– Key to good programmer understanding of algorithm effects
• Managed parallel execution arrangement search space
– Key to reduced programmer performance tuning efforts
11
(H.263 motion estimation example) prev_frame cur_frame
(a) Guess vectors are obtained from the previous macroblock.
prev_frame cur_frame
(b) Guess vectors are obtained from the corresponding macroblock in the previous frame.
12
MotionEstimation
Interpolation
MotionEstimatePicture
FullPelMotionEstMB
MBMotionEstimation
SAD_Macroblock
FindSubPel x5
GetMotionImages
MotionCompensation
Luminance
Comp
Chrominance
Comp
FrameSubtraction
VopShapeMotText
CodeMB
BlockDCT
BlockQuant
BlockDequant
BlockIDCT
MBBlock
Rebuild
BitstreamEncode
Block
RebuildUV
(a)
Loop
Granularity pixel pixel row component block macroblock
X
X
X
X
X
X
X
X
X
X X X
(f) (b) : Original + interprocedural array
(g) (c) :
Combination #1 + non-affine disambiguation + context- & heap-sensitive pointer analysis expression array disambiguation
(h) (d) :
Combination #2 + fieldsensitive pointer analysis
(i) Final (e) value constraint and relationship inference analyses
13
time
0
1 2 3
1 2 3 4 1 2 3 4 1 2 3 4
(a) Loop Partitioning
1
2
3
4
1
2
3
4
(b) Loop Fusion + Memory Privatization
Operations Performed On
16x16 Macroblocks
Motion Estimation
Motion Compensation,
Frame Subtraction
DCT & Quantization
Dequantization, IDCT,
Frame Addition
Main Memory Access
14
Unification Based
Fulcra
15
• Meetings with Phoenix team in January
2007
– Determined the set of Phoenix API routines necessary to support IMPACT analyses and transformations
• Received custom build of Phoenix that supports full type information
16
• Four step process:
1.
Convert IMPACT’s data structure to Phoenix’s equivalents, and from C to C++/CLI.
2.
Creating the initial constraint graph using Phoenix’s
IR instead of IMPACT’s IR.
3. Convert the solver – pointer analysis.
• Consist of porting from C to C++/CLI and dealing with any changes to Fulcra ported data structures.
4. Annotate the points-to information back into
Phoenix's alias representation.
17
• Access to code across file boundaries
–
LTCG
• Access to multiple files within a pass
• Full (Source code level) type information
• Feed results from Fulcra back to Phoenix
– Need more information on Phoenix alias representation
• In the long run, we need highly extendable IR and
API for Phoenix
April 16, 2007 18
• Compiler research for many-cores will require a very high quality infrastructure with strong engineering support
– New language extensions, new user models, new functionalities, new analyses, new transformations
• We chose Phoenix based on its robustness, features and engineering support
– Our current industry partners are also moving into Phoenix
– We also plan to share our advanced extensions to the other academic Phoenix users
19