IM Based on Phoenix C For thousand-core microprocessors

An

IM

plicitly

PA

rallel

C

ompiler

T

echnology

Based on Phoenix

For thousand-core microprocessors

Wen-mei Hwu with

Ryoo, Ueng, Rodrigues, Lathara, Kelm, Gelado, Stone, Yi,

Kidd, Barghsorkhi, Mahesri, Tsao, Stratton, Navarro,

Lumetta, Frank, Patel

University of Illinois, Urbana-Champaign

1

Background

• Academic compiler research infrastructure is a tough business

– IMPACT, Trimaran, and ORC for VLIW and Itanium processors

– Polaris and SUIF for multiprocessors

– LLVM for portability and safety

• In 2001, IMPACT team moved into many-core compilation with MARCO FCRC funding

– A new implicitly parallel programming model that balance the burden on programmers and the compiler in parallel programming

– Infrastructure work has slowed down ground-breaking work

• Timely visit by the Phoenix team in January 2007

– Rapid progress has since been taking place

– Future IMPACT research will be built on Phoenix

2

The Next Software Challenge

Big picture

• Today, multi-core make more effective use of area and power than large ILP CPU’s

– Scaling from 4-core to 1000-core chips could happen in the next 15 years

• All semiconductor market domains converging to concurrent system platforms

– PCs, game consoles, mobile handsets, servers, supercomputers, networking, etc.

We need to make these systems effectively execute valuable, demanding apps.

3

The Compiler Challenge

“Compilers and tools must extend the human’s ability to manage parallelism by doing the heavy lifting.”

• To meet this challenge, the compiler must

– Allow simple, effective control by programmers

– Discover and verify parallelism

– Eliminate tedious efforts in performance tuning

– Reduce testing and support cost of parallel programs

4

An Initial Experimental Platform

• A quiet revolution and potential build-up

– Calculation: 450 GFLOPS vs. 32 GFLOPS

– Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s

– Until last year, programmed through graphics API

G80 = GeForce 8800 GTX



NV40 = GeForce 6800 Ultra

NV35 = GeForce FX 5950 Ultra

NV30 = GeForce FX 5800

– GPU in every PC and workstation – massive volume and potential impact

5

GeForce 8800

16 highly threaded SM’s, >128 FPU’s, 450 GFLOPS, 768

MB DRAM, 86.4 GB/S Mem BW, 4GB/S BW to CPU

Host

Input Assembler

Thread Execution Manager

Parallel Data

Cache

Parallel Data

Cache

Texture

Parallel Data

Cache

Texture

Parallel Data

Cache

Texture

Parallel Data

Cache

Texture

Parallel Data

Cache

Texture

Parallel Data

Cache

Texture

Parallel Data

Cache

Texture

Load/store Load/store Load/store Load/store

Global Memory

Load/store Load/store

6

Some Hand-code Results

App.

H.264

LBM

RC5-72

FEM

Archit. Bottleneck Simult. T Kernel X App X

Registers, global memory latency 3,936 20.2

1.5

Shared memory capacity 3,200 12.5

12.3

Registers 3,072 17.1

11.0

Global memory bandwidth 4,096 11.0

10.1

RPES Instruction issue rate

PNS Global memory capacity

LINPACK Global memory bandwidth, CPU-

GPU data transfer

Shared memory capacity TRACF

FDTD Global memory bandwidth

MRI-Q Instruction issue rate

[HKR HotChips-2007]

4,096

2,048

12,288

4,096

1,365

8,192

210.0

24.0

19.4

60.2

10.5

457.0

79.4

23.7

11.8

21.6

1.2

431.0

7

Computing Q: Performance

1400

1200

1164.1

1156.5

446x

1000

953.9

923.7

800

600

400.1

400

267.6

200

3.3

0.6

0

V1 (cpu,dp) V2 (cpu, dp, sse2)

V3 (cpu, dp, sse2, fm)

V4 (cpu, sp) V5 (cpu, sp, sse2)

V6 (cpu, sp, sse2, fm)

V7 (gpu, sp) V8 (gpu, sp, fm)

CPU (V6): 230 MFLOPS GPU (V8): 96 GFLOPS

8

Lessons Learned

• Parallelism extraction requires global understanding

– Most programmers only understand parts of an application

• Algorithms need to be re-designed

– Programmers benefit from clear view of the algorithmic effect on parallelism

• Real but rare dependencies often needs to be ignored

– Error checking code, etc., parallel code is often not equivalent to sequential code

• Getting more than a small speedup over sequential code is very tricky

– ~20 versions typically experimented for each application to move away from architecture bottlenecks

9

Implicitly Parallel

Programming Flow

Deep analysis w/ feedback assistance

Human

For increased composability

Systematic search for best/correct code gen

For increased scalability parallel execution w/ sequential semantics

Stylized C/C++ or DSL w/ assertions

Concurrency discovery

Visualizable concurrent form

Code-gen space exploration

Visualizable sequential assembly code with parallel annotations

For increased supportability

Debugger

Parallel HW w/sequential state gen

10

Key Ideas

• Deep program analyses that extend programmer and DSE knowledge for parallelism discovery

– Key to reduced programmer parallelization efforts

• Exclusion of infrequent but real dependences using HW

STU (Speculative Threading with Undo) support

– Key to successful parallelization of many real applications

• Rich program information maintained in IR for access by tools and HW

– Key to integrate multiple programming models and tools

• Intuitive, visual presentation to programmers

– Key to good programmer understanding of algorithm effects

• Managed parallel execution arrangement search space

– Key to reduced programmer performance tuning efforts

11

Parallelism in Algorithms

(H.263 motion estimation example) prev_frame cur_frame

(a) Guess vectors are obtained from the previous macroblock.

prev_frame cur_frame

(b) Guess vectors are obtained from the corresponding macroblock in the previous frame.

12

MotionEstimation

Interpolation

MotionEstimatePicture

FullPelMotionEstMB

MBMotionEstimation

SAD_Macroblock

FindSubPel x5

GetMotionImages

MotionCompensation

Luminance

Comp

Chrominance

Comp

FrameSubtraction

VopShapeMotText

CodeMB

BlockDCT

BlockQuant

BlockDequant

BlockIDCT

MBBlock

Rebuild

BitstreamEncode

Block

RebuildUV

(a)

Loop

Granularity pixel pixel row component block macroblock

MPEG-4 H.263 Encoder

Parallelism Redicovery

X

X

X

X

X

X

X

X

X

X X X

(f) (b) : Original + interprocedural array

(g) (c) :

Combination #1 + non-affine disambiguation + context- & heap-sensitive pointer analysis expression array disambiguation

(h) (d) :

Combination #2 + fieldsensitive pointer analysis

(i) Final (e) value constraint and relationship inference analyses

13

time

0

Code Gen Space Exploration

1 2 3

1 2 3 4 1 2 3 4 1 2 3 4

(a) Loop Partitioning

1

2

3

4

1

2

3

4

(b) Loop Fusion + Memory Privatization

Operations Performed On

16x16 Macroblocks

Motion Estimation

Motion Compensation,

Frame Subtraction

DCT & Quantization

Dequantization, IDCT,

Frame Addition

Main Memory Access

14

Moving an Accurate Interprocedural

Analysis into Phoenix

Unification Based

Fulcra

15

Getting Started with Phoenix

• Meetings with Phoenix team in January

2007

– Determined the set of Phoenix API routines necessary to support IMPACT analyses and transformations

• Received custom build of Phoenix that supports full type information

16

Fulcra to Phoenix – Action!

• Four step process:

1.

Convert IMPACT’s data structure to Phoenix’s equivalents, and from C to C++/CLI.

2.

Creating the initial constraint graph using Phoenix’s

IR instead of IMPACT’s IR.

3. Convert the solver – pointer analysis.

• Consist of porting from C to C++/CLI and dealing with any changes to Fulcra ported data structures.

4. Annotate the points-to information back into

Phoenix's alias representation.

17

Phoenix Support Wish List

• Access to code across file boundaries

–

LTCG

• Access to multiple files within a pass

• Full (Source code level) type information

• Feed results from Fulcra back to Phoenix

– Need more information on Phoenix alias representation

• In the long run, we need highly extendable IR and

API for Phoenix

April 16, 2007 18

Conclusion

• Compiler research for many-cores will require a very high quality infrastructure with strong engineering support

– New language extensions, new user models, new functionalities, new analyses, new transformations

• We chose Phoenix based on its robustness, features and engineering support

– Our current industry partners are also moving into Phoenix

– We also plan to share our advanced extensions to the other academic Phoenix users

19

IM Based on Phoenix C For thousand-core microprocessors

An

plicitly

rallel

ompiler

echnology

Based on Phoenix

Background

The Next Software Challenge

The Compiler Challenge

An Initial Experimental Platform

GeForce 8800

Some Hand-code Results

Computing Q: Performance

Lessons Learned

Implicitly Parallel

Programming Flow

Key Ideas

Parallelism in Algorithms

MPEG-4 H.263 Encoder

Parallelism Redicovery

Code Gen Space Exploration

Moving an Accurate Interprocedural

Analysis into Phoenix

Getting Started with Phoenix

Fulcra to Phoenix – Action!

Phoenix Support Wish List

Conclusion

Related documents

Products

Support

IM Based on Phoenix C For thousand-core microprocessors

An

plicitly

rallel

ompiler

echnology

Based on Phoenix

Background

The Next Software Challenge

The Compiler Challenge

An Initial Experimental Platform

GeForce 8800

Some Hand-code Results

Computing Q: Performance

Lessons Learned

Implicitly Parallel

Programming Flow

Key Ideas

Parallelism in Algorithms

MPEG-4 H.263 Encoder

Parallelism Redicovery

Code Gen Space Exploration

Moving an Accurate Interprocedural

Analysis into Phoenix

Getting Started with Phoenix

Fulcra to Phoenix – Action!

Phoenix Support Wish List

Conclusion

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib