GaneshUPMARC2011 - Utah Formal Verification Group

(1) Formal Verification for Message Passing and GPU Computing

(2) XUM: An Experimental Multicore supporting MCAPI

Ganesh Gopalakrishnan

School of Computing, University of Utah and

Center for Parallel Computing (CPU) http://www.cs.utah.edu/fv

General Theme

• Take FM where it hasn’t gone before 

– A handful working in these domains which are crucially important

• Explore the space of concurrency based on message passing APIs

– A little bit of a mid-life crisis for an FV person learning which areas in SW design need help…

Ideas I hope to present

• L1: How to test message passing programs used in HPC

– Recognize the ubiquity of certain APIs in critical areas (e.g.

MPI, in HPC)

– With proper semantic characterization, we can formally understand / teach, and formally test

• No need to outright dismiss these APIs as “too hairy”

• (to the contrary) be able to realize fundamental issues that will be faced in any attempt along the same lines

– Long-lived APIs create a tool ecosystem around them that becomes harder to justify replacing

• What it takes to print a line in a “real setting”

– Need to build stack-walker to peel back profiling layers, and locate the actual source line

– Expensive – and pointless – to roll your own stack walker


• L2: How to test at scale

– The only practical way to detect communication non-determinism (in this domain)

– Can form the backbone of future large-scale replay-based debugging tools


• Realize that the multicore landscape is rapidly changing

– Accelerators (e.g. GPUs) are growing in use

– Multicore CPUs and GPUs will be integrated more tightly

– Energy is a first-rate currency

– Lessons learned from the embedded systems world are very relevant


• L3: Creating dedicated verification tools for

GPU kernels

– How symbolic verification methods can be effectively used to analyze GPU kernel functions

– Status of tool and future directions


• L4: Designing an experimental message-passing multicore

– Implements an emerging message passing standard called MCAPI in silicon

– How the design of special instructions can help with fast messaging

– How features in the Network on Chips (NoC) can help support the semantics of MCAPI

• Community involvement in the creation of such tangible artifacts can be healthy

– Read “The future of Microprocessors” in a recent CACM, by

Shekhar Borkar and Andrew Chien

Organization

• Today

– MPI and dyn. FV

• Tomorrow

– GPU computing and FV

– XUM

Context/Motivation: Demand for cycles!

• Terascale

• Petascale

• Exascale

• Zettascale

More compute power enables new discoveries, solves new problems

– Molecular dynamics simulations

• Better drug design facilitated

– Sanbonmatsu et al, FSE 2010 keynote

• 290 days of simulation to simulate 2 million atom interactions over 2 nano seconds

– Better “oil caps” can be designed if we have the right compute infrastructure

• Gropp, SC 2010 panel

Commonality among different scales

Also “HPC” will increasingly go embedded

MPI

CUDA /

OpenCL

OpenMP Pthreads

Multicore

Association

APIs

High End

Machines for HPC /

Cloud

Desktop

Servers and

Compute

Servers

Embedded

Systems and

Devices

11

Difficult Road Ahead wrt Debugging

• Concurrent software debugging is hard

• Gets harder as the degree of parallelism in applications increases

– Node level: Message Passing Interface (MPI)

– Core level: Threads, OpenMPI, CUDA

• Hybrid programming will be the future

– MPI + Threads

HPC Apps

– MPI + OpenMP

– MPI + CUDA

HPC Correctness Tools

• Yet tools are lagging behind!

– Many tools cannot operate at scale and give measurable coverage

High-end Debugging Methods are often Expensive, Inconclusive

• Expensive machines, resources

– $3M electricity a year (megawatt)

– $1B to install hardware

– Months of planning to get runtime on cluster

• Debugging tools/methods are primitive

– Extreme-Scale goal unrealistic w/o better approaches

• Inadequate attention from “CS”

– Little/no Formal Software Engineering methods

– Almost zero critical mass

Importance of Message Passing in HPC

(MPI)

• Born ~1994

• The world’s fastest CPU ran at 68 MHz

• The Internet had 600 sites then!

• Java was still not around

• Still dominant in 2011

– Large investments in applications, tooling support

• Credible FV research in HPC must include MPI

– Use of message passing is growing

• Erlang, actor languages, MCAPI, .NET async … (not yet for HPC)

• Streams in CUDA, Queues in OpenCL,…

Trend: Hybrid Concurrency

Problem Solving

Environment based

User Applications

Problem-Solving

Environments e.g. Uintah, Charm++,

ADLB

Monolith

Large-scale

MPI-based

User

Applications

High Performance

MPI Libraries

Concurrent

Data Structures

Infiniband style interconnect

Sandybridge (courtesy anandtech.com) AMD Fusion APU

Geoforce GTX 480 (Nvidia) 15

MPI Verification approach depends on type of determinism

• Execution Deterministic

– Basically one computation per input data

• Value Deterministic

– Multiple computations, but yield same “final answer”

• Nondeterministic

– Basically reactive programs built around message passing, possibly also using threads

Examples to follow

An example of parallelizing matrix multiplication using message passing

X

MPI_Send

MPI_Recv


X

MPI_Bcast

MPI_Send

MPI_Recv


X

MPI_Bcast

MPI_Send

19

MPI_Send

MPI_Recv


MPI_Recv

X =

MPI_Bcast

MPI_Send

20

Unoptimized Initial Version : Execution Deterministic

MPI_Recv (from: P0, P1, P2, P3…) ;

Send Next Row to

First Slave which

By now must be free

MPI_Send

21

Later Optimized Version : Value Deterministic

Opportunistically Send Work to Processor Who Finishes first

MPI_Recv ( from : * )

OR

Send Next Row to

First Worker that returns the answer!

MPI_Send

22

Still More Optimized Value-Deterministic versions:

Communications are made Non-blocking, and Software Pipelined

(still expected to remain value-deterministic )

MPI_Recv ( from : * )

OR

Send Next Row to

First Worker that returns the answer!

MPI_Send

23

Typical MPI Programs

• Value-Nondeterministic MPI programs do exist

– Adaptive Dynamic Load Balancing Libraries

• But most are value deterministic or execution deterministic

– Of course, one does not really know w/o analysis!

• Detect replay non-determinism over schedule space

– Races can creep into MPI programs

• Forgetting to Wait for MPI non-blocking calls to finish

– Floating point can make things non-deterministic

Gist of bug-hunting story

• MPI programs “die by the bite of a thousand mosquitoes”

– No major vulnerabilities one can focus on

• E.g. in Thread Programming, focusing on races

• With MPI, we need comprehensive “Bug Monitors”

• Building MPI bug monitors requires collaboration

– Lucky to have collaborations with DOE labs

– The lack of FV critical mass hurts

A real-world bug

P0

---

Send( (rank+1)%N );

P1

---

Send( (rank+1)%N );

P2

---

Send( (rank+1)%N );

Recv( (rank-1)%N ); Recv( (rank-1)%N );

• Expected “circular” msg passing

• Found that P0’s Recv entirely vanished !!

• REASON : ??

Recv( (rank-1)%N );

– In C, -1 % N is not N-1 but rather -1 itself

– In MPI, “-1” Is MPI_PROC_NULL

– Recv posted on MPI_PROC_NULL is ignored !

A real-world bug

P0

---

Send( (rank+1)%N );

P1

---

Send( (rank+1)%N );

P2

---

Send( (rank+1)%N );

Recv( (rank-1)%N ); Recv( (rank-1)%N );

• Expected “circular” msg passing

• Found that P0’s Recv entirely vanished !!

• REASON : ??

Recv( (rank-1)%N );

– In C, -1 % N is not N-1, but -1

– In MPI, “-1” Is MPI_PROC_NULL

– Recv posted on MPI_PROC_NULL is ignored !

MPI Bugs – more anecdotal evidence

• Bug encountered at large scale w.r.t. famous MPI library (Vo)

– Bug was absent at a smaller scale

– It was a concurrency bug

• Attempt to implement collective communication (Thakur)

– Bug exists for ranges of size parameters

• Wrong assumption: that MPI barrier was irrelevant (Siegel)

– It was not – a communication race was created

• Other common bugs (we see it a lot; potentially concurrency dep.)

– Forgetting to wait for non-blocking receive to finish

– Forgetting to free up communicators and type objects

• Some codes may be considered buggy if non-determinism arises!

– Use of MPI_Recv(*) often does not result in non-deterministic execution

– Need something more than “superficial inspection”

Real bug stories in the MPI-land

• Typing a[i][i] = init instead of a[i][j] = init

• Communication races

– Unintended send matches “wildcard receive”

• Bugs that show up when ported

– Runtime buffering changes; deadlocks erupt

– Sometimes, bugs show up when buffering added!

• Misunderstood “Collective” semantics

– Broadcast does not have “barrier” semantics

• MPI + threads

– Royal troubles await the newbies

Our Research Agenda in HPC

• Solve FV of Pure MPI Applications “well”

– Progress in non-determinism coverage for fixed test harness

– MUST integrate with good error monitors

• (Preliminary) Work on hybrid MPI + Something

– Something = Pthreads and CUDA so far

– Evaluated heuristics for deterministic replay of Pthreads + MPI

• Work on CUDA/OpenCL Analysis

– Good progress on Symbolic Static Analyzer for CUDA Kernels

– (Prelim.) progress on Symbolic Test Generator for CUDA Pgms

• (Future) Symbolic Test Generation to “crash” hybrid pgms

– Finding lurking crashes may be a communicable value proposition

• (Future) Intelligent schedule-space exploration

– Focus on non-monolithic MPI programs

Motivation for Coverage of

Communication Nondeterminism

Eliminating wasted search in message passing verif.

P0

---

MPI_Send(to P1…);

MPI_Send(to P1, data=22);

P1

---

MPI_Recv(from P0…);

MPI_Recv(from P2…);

MPI_Recv(*, x); if (x==22) then error1 else MPI_Recv(*, x);

P2

---

MPI_Send(to P1…);

MPI_Send(to P1, data=33);

32

A frequently followed approach: “boil the whole schedule space” – often very wasteful

@InProceedings{PADTAD2006:JitterBug, author = {Richard Vuduc and Martin Schulz and Dan Quinlan and Bronis de

Supinski and Andreas S{\ae}bj{\"o}rnsen}, title = {Improving distributed memory applications testing by message perturbation}, booktitle = {Proc.~4th Parallel and Distributed Testing and Debugging (PADTAD)

Workshop, at the International Symposium on Software Testing and Analysis}, address = {Portland, ME, USA}, month = {July},

} year = {2006}

33

Eliminating wasted work in message passing verif.

P0

---

No need to play with schedules of deterministic actions

P1

---

P2

---

MPI_Send(to P1…); MPI_Recv(from P0…); MPI_Send(to P1…);

MPI_Send(to P1, data=22); MPI_Recv(from P2…); MPI_Send(to P1, data=33);

MPI_Recv(*, x); if (x==22) then error1 else MPI_Recv(*, x);

But consider these two cases…

34

Need to detect

Resource Dependent Bugs

Example of Resource Dependent Bug

P0

Send(to:1);

Recv(from:1);

P1

Send(to:0);

Recv(from:0);

We know that this program with lesser Send buffering may deadlock

36


P0

Send(to:1);

Recv(from:1);

P1

Send(to:0);

Recv(from:0);

… and this program with more Send buffering may avoid a deadlock

37


P0

Send(to:1);

Send(to:2);

P1

Send(to:2);

Recv(from:0);

P2

Recv(from:*);

Recv(from:0);

… But this program deadlocks if Send(to:1) has more buffering !

38


P0

Send(to:1);

Send(to:2);

P1

Send(to:2);

Recv(from:0);

P2

Recv(from:*);

Recv(from:0);


39


P0

Send(to:1);

Send(to:2);

P1

Send(to:2);

Recv(from:0);

P2

Recv(from:*);

Recv(from:0);


40


P0

Send(to:1);

Send(to:2);

P1

Send(to:2);

Recv(from:0);

P2

Recv(from:*);

Recv(from:0);


41


P0

Send(to:1);

Send(to:2);

P1

Send(to:2);

Recv(from:0);

P2

Mismatched – hence a deadlock

Recv(from:*);

Recv(from:0);


42

Widely Publicized Misunderstandings

“ ”Your program is deadlock free if you have successfully tested it under zero buffering”

43

MPI at fault?

• Perhaps partly

– Over 17 years of MPI, things have changed

– Inevitable use of shared memory cores, GPUs, …

– Yet, many of the issues seem fundamental to

• Need for wide adoption across problems, languages, machines

• Need to give programmer better handle on resource usage

• How to evolve out of MPI?

– Whom do we trust to reset the world?

– Will they get it any better?

– What about the train-wreck meanwhile?

• Must one completely evolve out of MPI?

44

Our Impact So Far

Problem Solving

Environment based

User Applications

Problem-Solving

Environments e.g. Uintah, Charm++,

ADLB

Monolith

Large-scale

MPI-based

User

Applications

High Performance

MPI Libraries

Concurrent

Data Structures

Infiniband style interconnect

ISP and

DAMPI

Useful formalizations to help test these

PUG and

GKLEE

45 Sandybridge (courtesy anandtech.com) AMD Fusion APU

Geoforce GTX 480 (Nvidia)

Outline for L1

• Dynamic formal verification of MPI

– It is basically testing which discovers all alternate schedules

• Coverage of communication non-determinism

– Also gives us a “predictive theory” of MPI behavior

– Centralized approach : ISP

– GEM: Tool Integration within Eclipse Parallel Tools

Platform

– Demo of GEM

46

A Simple MPI Example

Process P0 Process P1 Process P2

Isend(1, req) ; Irecv(*, req) ; Barrier ;

Barrier ; Barrier ; Isend(1, req) ;

Wait(req) ; Wait(req) ; Recv(2) ;

Wait(req) ;

47





Wait(req) ; Recv(2) ; Wait(req) ;

Wait(req) ;

• Non-blocking Send – send lasts from Isend to Wait

• Send buffer can be reclaimed only after Wait clears

• Forgetting to issue Wait  MPI “request object” leak

48





Wait(req) ; Wait(req) ; Recv(2) ;

Wait(req) ;

49






Wait(req) ;

• Non-blocking Receive – lasts from Irecv to Wait

• Recv buffer can be examined only after Wait clears

• Forgetting to issue Wait  MPI “request object” leak

50






Wait(req) ;

• Blocking receive in the middle

• Equivalent to Irecv followed by its Wait

• The data fetched by Recv(2) is available before that of Irecv

51






Wait(req) ;

• Since P0’s Isend and Irecv can be “in flight”, the barrier can be crossed

• This allows P2’s Isend to race with P0’s Isend, and match Irecv(*)

52

Testing Challenges

Process P0 Process P2



Wait(req) ;

Process P1

Recv(2) ; Wait(req) ;

Wait(req) ;

• Traditional testing methods may reveal only P0->P1 or P2->P1

• P2->P1 may happen after the code is ported

• Our tools ISP and DAMPI automatically discover and run both tests, regardless of the execution platform

53

MPI

Program

Interposition

Layer

Run

Flow of ISP

Executable

Proc

1

Proc

2

……

Proc n

Scheduler

MPI Runtime

 Scheduler intercepts MPI calls

 Reorders and/or rewrites the actual calls going into the MPI Runtime

 Discovers maximal non-determinism ; plays through all choices

54

ISP Scheduler Actions (animation)

P0 P1 P2 sendNext

Isend(1, req) Irecv(*, req) Barrier

Barrier Barrier Isend(1, req)

Wait(req) Recv(2)

Wait(req)

Wait(req)

MPI Runtime

Scheduler

Isend(1)

Barrier

55


P0 P1 P2

Isend(1, req) Irecv(*, req) sendNext

Barrier

Barrier Barrier Isend(1, req)

Wait(req) Recv(2)

Wait(req)

Wait(req)

MPI Runtime

Scheduler

Isend(1)

Barrier

Irecv(*)

Barrier

56


P0 P1 P2

Isend(1, req)

Barrier

Irecv(*, req)

Barrier

Barrier

Barrier

Barrier

Barrier

Isend(1, req)

Scheduler

Isend(1)

Barrier

Irecv(*)

Barrier

Wait(req) Wait(req) Recv(2)

Wait(req)

Barrier

MPI Runtime

57


P0 P1 P2

Isend(1, req)

Barrier

Wait(req)

Irecv(*, req)

Barrier

Recv(2)

Irecv(2)

Barrier

Isend

No

Match-Set

Isend(1, req)

Wait(req)

SendNext

Scheduler

Isend(1)

Barrier

Wait (req)

Irecv(*)

Barrier

Recv(2)

Wait(req)

Barrier

Isend(1)

Wait (req)

MPI Runtime

58

Formalization of MPI Behavior to build Formal Dynamic Verifier

(verification scheduler)

Formal Modeling of MPI Calls

• MPI calls are modeled in terms of four salient events

– Call issued

• All calls are issued in program order

– Call returns

• The code after the call can now be executed

– Call matches

• An event that marks the call committing

– Call completes

• All resources associated with the call are freed

60

The Matches-Before Relation of MPI

1.

Isend(to: Proc k, …);

…

Isend(to: Proc k, …)

2.

Irecv(from: Proc k, …);

…

Irecv(from: Proc k, …)

3.

Irecv(from: Proc *, …);

…

Irecv(from: Proc k, …)

4.

Irecv(from: Proc j, …);

…

Irecv(from: Proc *, …)

Conditional

Matches before

5.

Isend( &h );

…

Wait( h );

6.

Wait(..);

…

AnyMPIOp;

7.

Barrier(..);

…

AnyMPIOp;

61

The ISP Scheduler

• Pick a process Pi and its instrn Op at PC n

• If Op does not have an unmatched ancestor according to

MB, then collect Op into Scheduler’s Reorder Buffer

– Stay with Pi, Increment n

• Else Switch to Pi+1 until all Pi are “fenced”

– “fenced” means all current Ops have unmatched ancestors

• Form Match Sets according to priority order

– If Match Set is {s1, s2, .. sK} + R(*), cover all cases using stateless replay

• Issue an eligible set of Match Set Ops into the MPI runtime

• Repeat until all processes are finalized or error encountered

Theorem: This Scheduling Method achieves ND-Coverage in MPI programs!

62

How MB helps predict outcomes

Will this single-process example called “Auto-send” deadlock ?

P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);

63



64

The MB



65


Scheduler’s

Reorder

Buffer

Collect R(from:0, h1)

R(from:0, h1)


The MPI

Runtime

66


Scheduler’s

Reorder

Buffer

Collect B

R(from:0, h1)

B


The MPI

Runtime

67


Scheduler’s

Reorder

Buffer

P0 is Fenced; Form Match Set { B } and

Send it to the MPI Runtime !

R(from:0, h1)

B


The MPI

Runtime

68


Scheduler’s

Reorder

Buffer

Collect S(to:0, h2)


R(from:0, h1)

B

S(to:0, h2)

The MPI

Runtime

69


Scheduler’s

Reorder

Buffer

P0 is fenced. So form the {R,S} match-set. Fire!


R(from:0, h1)

B

S(to:0, h2)

The MPI

Runtime

70


Scheduler’s

Reorder

Buffer

Collect W(h1)


R(from:0, h1)

B

S(to:0, h2)

W(h1)

The MPI

Runtime

71


Scheduler’s

Reorder

Buffer

P0 is Fenced. So form {W} match set. Fire!


R(from:0, h1)

B

S(to:0, h2)

W(h1)

The MPI

Runtime

72


Scheduler’s

Reorder

Buffer

Collect W(h2)

P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2) ;

R(from:0, h1)

B

S(to:0, h2)

W(h1)

W(h2)

The MPI

Runtime

73


Scheduler’s

Reorder

Buffer

Fire W(h2)! Program finishes w/o deadlocks!


R(from:0, h1)

B

S(to:0, h2)

W(h1)

W(h2)

The MPI

Runtime

74

Outline for L2

• Dynamic formal verification of MPI

– It is basically testing which discovers all alternate schedules

• Coverage of communication non-determinism

– Also gives us a “predictive theory” of MPI behavior

– Centralized approach : ISP

– GEM: Tool Integration within Eclipse Parallel Tools Platform

• DEMO OF GEM

– Distributed approach : DAMPI

75

Dynamic Verification at Scale

Why do we need a Distributed

Analyzer for MPI Programs (DAMPI)?

• The ROB and the MB graph can get large

– Limit of ISP: “32 Parmetis (15 K LOC) processes”

• We need to dynamically verify real apps

– Large data sets, 1000s of processes

– High-end bugs often masked when downscaling

– Downscaling is often IMPOSSIBLE in practice

– ISP is too sequential! Employ parallelism of a large Cluster!

• What do we give up?

– We can’t do “what if” reasoning

• What if a PARTICULAR Isend has infinite buffering?

– We may have to give up precision for scalability

77

MPI Program

DAMPI - PnMPI modules

DAMPI Framework

Alternate

Matches

Executable

Proc

1

Proc

2

……

Proc n

Rerun

Epoch

Decisions

Schedule

Generator

DAMPI – Distributed Analyzer for MPI

MPI runtime

Distributed Causality Tracking in DAMPI

• Alternate matches are

– co-enabled and concurrent actions

– that can match according to MPI’s matching rules

• DAMPI performs an initial run of the MPI program

• Discovers alternative matches

• We have developed an MPI-specific Sparse Logical

Clock Tracking

– Vector Clocks (no omissions, less scalable)

– Lamport Clocks (no omissions in practice, more scalable)

• We have gone up to 1000 processes (10K seems possib)

P0

DAMPI uses Lamport clock to build

Happens-Before relationships

0 S(P

1

)

0 1

P1

R

1

(*)

0

P2

S(P

1

)

• Use Lamport clock to track Happens-Before

– Sparse Lamport Clocks – only “count” non-det events

– MPI MB relation is “baked” into clock-update rules

– Increases it after completion of MPI_Recv (ANY_SOURCE)

– Nested blocking / non-blocking operations handled

– Compare incoming clock to detect potential matches

How we use Happens-Before relationship to detect alternative matches

0 S(P

1

)

P0

0 1

P1

R

1

(*)

0

P2

S(P

1

)

• Question: could P

2

’s send have matched P

1

’s recv?

• Answer: Yes!

• Earliest Message with lower clock value than the current process clock is an eligible match

DAMPI Algorithm Review:

(1) Discover Alternative matches during initial run

P0

P1

P2

0 S(P

1

)

0

0

R

1

(*)

S(P

1

)

1

DAMPI Algorithm Review:

(2) Force alternative matches during replay

P0

P1

P2

0 S(P

1

)

0

0

R

1

(2)

S(P

1

)

X 1

DAMPI maintains very good scalability vs ISP

ParMETIS-3.1 (no wildcard)

900

800

700

600

500

400

300

200

100

0

4 8 16 32

ISP

DAMPI

Number of tasks

DAMPI is also faster at processing interleavings

Matrix Multiplication with Wildcard Receives

8000

7000

6000

5000

4000

3000

2000

1000

0

250 500

Number of Interleavings

750 1000

ISP

DAMPI

0

1,5

1

0,5

2

2,5

Results on large applications:

SpecMPI2007 and NAS-PB

Slowdown is for one interleaving

No replay was necessary

Base 1024 processes

DAMPI 1024 processes

The Devil in the Detail of DAMPI

• Textbook Lamport/Vector Clocks Inapplicable

– “Clocking” all events won’t scale

• So “clock” only ND-events

– When do non-blocking calls finish ?

• Can’t “peek” inside MPI runtime

• Don’t want to impede with execution

• So have to infer when they finish

– Later blocking operations can force Matches Before

• So need to adjust clocks when blocking operations happen

• Our contributions: Sparse VC and Sparse LC algos

• Handing real corner cases in MPI

– Synchronous sends “learning” about non-blocking Recv start!

Questions from Monday’s Lecture

• What bugs are caught by ISP / DAMPI

– Deadlocks

– MPI Resource Leaks

• Forgotten dealloc of MPI communicators, MPI type objects

• Forgotten Wait after Isend or Irecv (request objects leaked)

• C assert statements (safety checks)

• What have we done wrt the correctness of our MPI verifier ISP’s algorithms?

– Testing + paper/pencil proof using standard ample-set based proof methods

– Brief look at the formal transition system of MPI, and ISP’s transition system

• Why have HPC folks not built a happens-before model for APIs such as MPI?

– HPC folks are grappling with many challenges and actually doing a great job

– There is a lack of “computational thinking” in HPC that must be addressed

• Non-CS background often

• See study by Jan Westerholm in EuroMPI 2010

– CS folks must step forward to help

• Why this does not naturally happen: “Numerical Methods” not popular in core CS

• There isn’t a clearly discernible “HPC industry”

– Wider use of GPUs, Physics based gaming, … can help push CS toward HPC

– Mobile devices will use CPUs + GPUs and do “CS problems” and “HPC problems” in a unified setting (e.g. face recognition, …)

• Our next two topics (PUG and XUM) touch on our attempts in this area

Outline for L3

• Formal analysis methods for accelerators/GPUs

– Same thing

– In future, there may be a large-scale mish-mash of CPUs and

GPUs

– CPUs also will have mini GPUs, SIMD units, …

• Again, revisit the Borkar / Chien article

• Regardless..

– It looks a lot unlike traditional Pthreads/Java/C# threading

– We would like to explore how to debug these kinds of codes efficiently, and help designers explore their design space while root-causing bugs quickly

89

Why develop FV methods for CUDA?

• GPUs are key enablers of HPC

– Many of the top 10 machines are GPU based

– I found the presentation by Paul Lindberg eye-opening http://www.youtube.com/watch?v=vj6A8AKVIuI

• Interesting debugging challenges

– Races

– Barrier mismatches

– Bank conflicts

– Asynchronous MPI / CUDA interactions

90

What are GPUs (aka Accelerators)?

AMD Fusion APU

OpenCL Compute (courtesy Khronos group)

Geoforce

GTX 480

(Nvidia)

Sandybridge (courtesy anandtech.com)

• Three of world’s Top-5 Supercomputers built using GPUs

• World’s greenest supercomputers also

GPU based

• CPU/GPU integration is a clear trend

• Hand-held devices (iPhone) will also use them

91

Contrast between CPUs and GPUs

Example: Increment Array Elements

CPU program CUDA program

} void inc_cpu(float* a, float b, int N) { for (int idx = 0; idx<N; idx++) a[idx] = a[idx] + b;

}

__global__ void inc_gpu(float* A, float b, int N) { int tid = blockIdx.x* blockDim.x+ threadIdx.x; if ( tid < N)

A[t id ] = A[t id ] + b;

} void main() {

.....

increment_cpu(a, b, N); voidmain() {

…..

dim3dimBlock (blocksize); dim3dimGrid( ceil( N / (float)blocksize) );

} increment_gpu<<<dimGrid, dimBlock>>>(a, b,

N);

Contrast between CPUs and GPUs

Example: Increment Array Elements

CPU program CUDA program

Fine-grained threads scheduled to run like this:

Tid = 0, 1, 2, 3, …

} void inc_cpu(float* a, float b, int N) { for (int idx = 0; idx<N; idx++) a[idx] = a[idx] + b;

}

__global__ void inc_gpu(float* A, float b, int N) { int tid = blockIdx.x* blockDim.x+ threadIdx.x; if ( tid < N)

A[t id ] = A[t id ] + b;

} void main() {

.....

increment_cpu(a, b, N); voidmain() {

…..

dim3dimBlock (blocksize); dim3dimGrid( ceil( N / (float)blocksize) );

} increment_gpu<<<dimGrid, dimBlock>>>(a, b,

N);

Why heed Many-core / GPU Compute?

EAGAN, Minn., May 3, 1991 (John Markoff, NY Times)

The Convex Computer Corporation plans to introduce its first supercomputer on Tuesday. But Cray Research Inc., the king of supercomputing, says it is more worried by "killer micros"

-- compact, extremely fast work stations that sell for less than

$100,000.

Take-away : Clayton Christensen’s “Disruptive Innovation”

GPU is a disruptive technology!

94

Other Reasons to study GPU Compute

GPUs offer an eminent setting in which to study heterogeneous CPU organizations and memory hierarchies

95

What bugs caught? What value?

• GPU hardware still stabilizing

– Characterize GPU Hardware Formally

• Currently, program behavior may change with platform

– Micro benchmarking sheds further light

• www.eecg.toronto.edu/~myrto/gpuarch-ispass2010.pdf

• The software is the real issue!

96

GPU Software Bugs / Difficulties

• SIMD model, unfamiliar

– Synchronization, races, deadlocks, … all different!

• Machine constants, program assumptions about problem parameters affect correctness / performance

– GPU “kernel functions” may fail or perform poorly

• Formal reasoning can help identify performance pitfalls

• Are still young, so emulators and debuggers may not match hardware behavior

• Multiple memory sub-systems are involved, further complicating semantics / performance

97

Approaches/Tools for GPU SW Verif.

• Instrumented Dynamic Verification on GPU Emulators

– Boyer et al, STMCS 2008

• Grace : Combined Static / Dynamic Analysis

– Zheng et al, PPoPP 2011

• PUG : An SMT based Static Analyzer

– Lee and Gopalakrishnan, FSE 2010

• GKLEE : An SMT based test generator

– Lee, Rajan, Ghosh, and Gopalakrishnan (under submission)

98

Planned Overall Workflow of PUG

Realized

99

Workflow and Results from PUG

100

Examples of a Data Race

Increment N-element vector A by scalar b tid 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

A

A[0]+b

}

__global__ void inc_gpu( float*A, float b, int N) { int tid = blockIdx.x * blockDim.x + threadIdx.x; if ( tid < N)

A [tid ] = A[ tid – 1 ] + b;

...

A[15]+b



A

A[0]+b ...

A[15]+b

}

__global__ void inc_gpu( float*A, float b, int N) { int tid = blockIdx.x * blockDim.x +

(and (and (/= t1.x t2.x)

……

(or (and (bv-lt idx1@t1 N0)

(bv-lt idx1@t2 N0) threadIdx.x; if ( tid < N)

Encoding for

Read-Write Race

(= (bv-sub idx1@t1 0b0000000001) idx1@t2))

(and (bv-lt idx1@t1 N0)

A [tid ] = A [tid – 1 ] + b;

(bv-lt idx1@t2 N0)

(= idx1@t1 idx1@t2))))

Encoding for Write-Write Race



A

A[0]+b ...

A[15]+b

}

__global__ void inc_gpu( float*A, float b, int N) { int tid = blockIdx.x * blockDim.x +

(and (and (/= t1.x t2.x)

……

(or (and (bv-lt idx1@t1 N0)

(bv-lt idx1@t2 N0) threadIdx.x; if ( tid < N)

Encoding for

Read-Write Race

(= (bv-sub idx1@t1 0b0000000001) idx1@t2))

(and (bv-lt idx1@t1 N0)

A [tid ] = A [tid – 1 ] + b;

(bv-lt idx1@t2 N0)

(= idx1@t1 idx1@t2))))

Encoding for Write-Write Race

Real Example with Race

__global__ void computeKernel(int *d_in,int *d_out, int *d_sum) { d_out[threadIdx.x] = 0; for (int i=0; i<SIZE/BLOCKSIZE; i++) { d_out[threadIdx.x] += compare(d_in[i*BLOCKSIZE+threadIdx.x],6); }

__syncthreads(); if(threadIdx.x%2==0) { for(int i=0; i<SIZE/BLOCKSIZE; i++) { d_out [threadIdx.x

+SIZE/BLOCKSIZE*i]+=d_out[threadIdx.x+SIZE/BLOCKSIZE*i+1];

...

/* The counter example given by PUG is : t1.x = 2, t2.x = 10, i@t1 = 1, i@t2 = 0

Threads t1 and t2 are in iterations 1 and 0 respectively t1 generates access d_out[ 2 + 8 * 1 ] += d_out[ 10 + 8 * 0 + 1] i.e. d[10] += d[11] t2 generates this: d_out[ 10 + 8 * 0 ] += d_out [ 10 + 8 * 0 + 1] i.e. d[10] += d[11]

*/

This causes a real race on writes to d[10] because of the += done by the threads

Real Example with Race

__global__ void computeKernel(int *d_in,int *d_out, int *d_sum) { d_out[threadIdx.x] = 0; for (int i=0; i<SIZE/BLOCKSIZE; i++) { d_out[threadIdx.x] += compare(d_in[i*BLOCKSIZE+threadIdx.x],6); }

__syncthreads(); if(threadIdx.x%2==0) { for(int i=0; i<SIZE/BLOCKSIZE; i++) { d_out[ threadIdx.x

+SIZE/BLOCKSIZE*i]+=d_out[threadIdx.x+SIZE/BLOCKSIZE*i+1];

...

/* The counter example given by PUG is : t1.x = 2, t2.x = 10, i@t1 = 1, i@t2 = 0

Threads t1 and t2 are in iterations 1 and 0 respectively t1 generates access d_out[ 2 + 8 * 1 ] += d_out[ 10 + 8 * 0 + 1] i.e. d[10] += d[11] t2 generates this: d_out[ 10 + 8 * 0 ] += d_out [ 10 + 8 * 0 + 1] i.e. d[10] += d[11]

*/

This causes a real race on writes to d[10] because of the += done by the threads

PUG Divides and Conquers Each Barrier Interval t

1 t

2 a1

BI1 a2 a1 a2 p a6 a3

 p a4

BI3 p a5 a6 a3

 p a4 a5

BI1 is conflict free iff:

 t

1

,t

2

: a

1 t1 , a

2 t1 , a

1 t2 and a

2 t2 do not conflict with each other

BI 3 is conflict free iff the following accesses do not conflict for all t1,t2 : a a t1

3

3 t2

,

 p : a

,

 p : a t1

4

4 t2

,

 p : a

,

 p : a t1,

5

5 t2

PUG Results (FSE 2010)

Kernels (in

CUDA SDK )

Bitonic Sort

MatrixMult

Histogram64

Sobel

Reduction

Scan

Scan Large

Nbody

Bisect Large

Radix Sort

Eigenvalues loc +O +C +R B.C.

Time

(pass)

65

102 *

136

130 *

315

*

HIGH 2.2s

HIGH <1s

LOW 2.9s

HIGH 5.6s

HIGH 3.4s

255 *

237 *

206 *

1400 *

1150 *

2300 *

*

*

*

*

*

*

*

*

LOW

LOW

3.5s

5.7s

HIGH 7.4s

HIGH 44s

LOW 39s

HIGH 68s

+ O: require no bit-vector overflowing

+C: require constraints on the input values

+R: require manual refinement

B.C.: measure how serious the bank conflicts are

Time: SMT solving time

“GPU class” bugs

Tested over 57 assignment submissions

Defects

13 (23%)

Barrier Error or Race benign

3 fatal

2

Refinement over #kernel

17.5% over #loop

10.5%

Defects: Indicates how many kernels are not well parameterized, i.e. work only in certain configurations

Refinement: Measures how many loops need automatic refinement

(by PUG).

108

PUG’s Strengths and Weaknesses

• Strengths:

– SMT-based Incisive SA avoids interleaving explosion

– Still obtains coverage guarantees

– Good for GPU Library FV

• Weaknesses:

– Engineering effort : C++, Templates, Breaks, ..

– SMT “explosion” for value correctness

– Does not help test the code on the actual hardware

– False alarms requires manual intervention

• Handling loops, Kernel calling contexts

109

Thorough internal documentation of PUG is available

• http://www.cs.utah.edu/fv/mediawiki/index.php/PUG

• One recent extension (ask me for our EC2 2011 paper)

– Symbolic analysis to detect occurrences of noncoalesced memory accesses

110

Two directions in progress

• Parameterized Verification

– See Li’s dissertation for initial results

• Formal Testing

– Brand new code-base called GKLEE

• Joint work with Fujitsu Research

111

GKLEE: Formal Testing of GPU Kernels

(joint work with Li, Rajan, Ghosh of Fujitsu)

112

Evaluation of GKLEE vs. PUG

Execution time of PUG and GKLEE on kernels from the CUDA SDK for functional correctness. n is the number of threads. T.O. means > 5 min.

Kernels

Simple

Reduction

Matrix

Transpose

Bitonic Sort

Scan Large

PUG

2.8

n = 4

GKLEE

< 0.1 (< 0.1)

PUG n = 16

GKLEE n = 64

GKLEE n = 256

GKLEE n = 1024

GKLEE

T.O.

< 0.1 (< 0.1) < 0.1 (< 0.1) 0.2 (0.3) 2.3 (2.9)

1.9

3.7

▬

< 0.1 (< 0.1)

0.9 (1)

< 0.1 (< 0.1)

T.O.

T.O.

▬

< 0.1 (0.3)

T.O.

< 0.1 (< 0.1)

< 0.1 (3.2) < 0.1 (63) 0.9 (T.O.)

T.O.

0.1 (0.2)

T.O.

1.6 (3)

T.O.

22 (51)

113

Evaluation of GKLEE’s Test Coverage

Test generation results. Traditional metrics sc cov.

and bc cov.

give source code and bytecode coverage respectively. Refined metrics avg.cov

t max.cov

t and measure the average and maximum coverage of all threads.

All coverages were far better than achieved through random testing.

Kernels exec. time

1s Bitonic Sort

Merge Sort

Word Search

Radix Sort

Suffix Tree

Match sc cov.

100% /

100%

100% /

100%

100% /

92%

100% /

91%

100% /

90% bc cov.

min #test Avg. Cov t max. Cov t

51% /

44%

70% /

50%

34% /

25%

54% /

35%

38% /

35%

5

6

2

6

8

78% /

76%

88% /

80%

100% /

81%

91% /

68%

90% /

70%

100% /

94%

100% /

95%

100% /

85%

100% /

75%

100% /

82%

0.5s

0.1s

5s

12s

114

Outline for L4

• A brief introduction to MCAPI, MRAPI, MTAPI

• XUM : an eXtensible Utah Multicore system

• What are we able to learn through building a hardware realization of a custom multicore

– How can we push this direction forward?

– Any collaborations?

• Clearly there are others who do this full-time

• This has been a side-project (by two very good students, albeit) in our group

• So we would like to build on others’ creations also…

115

Multicore Association APIs

• http://www.multicore-association.org

• Reason for our interest in the MCA APIs

– Our project through the Semiconductor Research Association

– Collaborator in dynamic verification of MCAPI Applications

• Prof. Eric Mercer (BYU, Utah)

– The BYU team also has developed a formal spec for MCAPI and

MRAPI, and built golden executable models from these specs

• XUM

– A Utah project involving two students

• MS project of Ben Meakin

• BS project of Grant Ayers

– An attempt to support MCAPI functions in HW+SW

– (Later) hoping to support MRAPI

116

(Picture courtesy of Multicore Association)

MCAPI

• A facility to interconnect heterogeneous embedded multicore systems/chips

• These systems could be very minimalistic

• No OS, different OS, could be DSP, CPU, …

• Standardization (and revision) finished around 2009

• No widely used and portable communication APIs in this space

• Currently two commercial implementations

• Mentor’s Open MCAPI

• Polycore’s messenger

• XUM is the only hardware-assisted implementation of the communication API

117

MCAPI Calls, Use Cases, Expectations

• Lead figures in MCAPI standardization (so far as our interactions go)

– Jim Holt (Freescale), Sven Brehmer (Polycore), Markus Levy (Multicore Association)

• “End-points” are connected

– Each end-point could be a thread, a process, ..

– Blocking and non-blocking communication support

• MCAPI_Send, MCAPI_Send_I, …

• Waits, Tests

• No barriers (in the API); one could implement

• Create end-points (a collective call)

• Use cases

– Present use-cases are in C/Pthreads, with each thread performing MCAPI calls to communicate

– Very reminiscent of monolithic-style MPI programs (with all its drawbacks)

• General Expectation

– That MCAPI will be used as a standard transport layer with respect to which one may implement higher abstractions

– One project: Chapman (Univ Houston) work on realizing OpenMP

– Other suggested: Task-graph (or other) higher level abstraction to specify computations, with

“smart runtime” employing MCAPI

118

MCAPI Tool Support

• Currently no ‘formal’ debugging tools

– Not enough case studies yet (projects underway in Prof. Alper Sen’s group)

• MCC: An MCAPI Correctness Checker

– Subodh Sharma (PhD student)

– Borrows from the dynamic verification tool design of ISP (MPI checker) and

Inspect (our Pthreads checker)

– Dynamic verification against existing MCAPI libraries

– MCC incurs new headaches

• Hybrid Pthreads/MCAPI behaviors

• Deterministic replay of schedules often difficult

– Our present conclusion:

• Don’t go there!

• We know that dynamic verification of hybrid concurrent programs is a royal pain!

– Waiting for higher abstractions / better practices to emerge in the area

• BYU projects on model checking using an MCAPI Golden Executable Model

– Main difference is that they rely on an MCAPI operational semantics whereas we capitalize on behavior (of MCAPI library)

119

MRAPI

• Portable Resource API

– Portable Mutexes, Mallocs

– Portable varieties of software managed shared memory, DMA

– Pthreads and Unix facilities won’t do

• Not well-matched with requirements of heterogeneous multicores with disparate set of features / resources

• MTAPI standardization: yet to begin

• One possible usage of MCAPI + MRAPI:

– MCAPI Send call allocates buffer using MRAPI calls

– MCAPI Send happens

• say using XUM’s network, or MRAPI’s software DMA

– MRAPI calls free-up buffer

120

XUM is currently being prototyped on the popular XUP board

Two XUPV5-LX110T boards obtained courtesy of Xilinx Inc. !

XUM architecture

• 32-bit MIPS ISA compliant cores

• Request network is in-order dimensional order routed

• worm-hole flow control

• Each router unit arbitrates round-robin

• Datapath in the reqest n/w is 16-bits wide

• Ack n/w has broadcast and pt-to-pt transfers

• We can plug in I/O devices as if they are tiles

• All these exist as VHDL+Verilog mapped onto FPGAs

XUM architecture

• Current memory architecture :

• all tiles have memory ports leading to a FCFS arbiter that is backed by a pipelined DDR2 controller to an SDRAM

• All tiles can have their own clocks.

• About 4 physical clock sources. PLL primitives available (Xilinx tools).

• Additional clocks can be synthesized

• Currently 500 MHz for a flip-flop. 100 MHz realizable. Look at OpenSparc.

Recent XUM Achievements

• Built Bootloader (FPGA) and Protocol for downloading code images over RS-232

• XUM Memory Controller

– Built a fully functional DDR2-SDRAM memory controller that provides usable amounts of memory on-board

• Support for pipelined transfers

– Built a simple FCFS arbitrated memory controller (shared by all cores)

• All cores share same address space – no protection, but handy!

• Ported XUM MIPS cores to 32-bits, and debugged the CPUs

– Debugged the CPU some (more needed)

• Found errors in delay slot and forwarding logic…

– Would be a good test vehicle for pipelined CPU FV methods

Other features / status

More details

• 5-7 stage in-order pipeline.

• No branch predictor (will add). No speculation.

• Web documentation status: VHDL/Verilog code available

Software story:

• GCC would be usable

– Must add some new instructions such as load-immediate-upper

– In-line assembly for XUM instructions

• FPU is TBD – new student

RTOS story:

• FreeRTOS (e.g.) – compile using GCC -> download.

Further software story…

MCAPI and MRAPI realization

• BYU collaborator, Prof. Mercer, has MCAPI and MRAPI golden executable model (as formal state transition rules)

• Will compile these into detailed C implementations

Programming approach

• Not recommending straight coding using MCAPI /

MRAPI, as the code becomes a “rats nest” soon

• Will investigate compiling tasking primitives into a runtime that is supported by MCAPI and MRAPI

Other ideas?

XUM Details

• See Ben Meakin’s MS thesis

– Available from http://www.cs.utah.edu/fv

• The thesis provides:

– Code for send/receive

– Correctness properties of interest wrt XUM

• Good test vehicle for HW FV projects

– Memory footprint data

• Very parsimonious support for MCAPI possible

– Latency/throughput measurements on XUM

• Also, comparison with a Pthread-based baseline

XUM Communications Datapath

XUM ISA Extensions

• Send Header

• Send word

• Send tail

• Send ack

• Broadcast

• Receive ack

MCAPI Message Send

• Disable interrupts

• Asm(“sndhd.s …”)

• While (I < bufsize)

– Asm(“sndw …”)

• Asm(“sendtl ..”)

• Asm(“recack”..”)

• Enable interrupts

• Support for connectionless and connected MCAPI protocols

– The latter achieved by not issuing a tail-flit till connection needs to be closed

Further ideas / thoughts

• The embedded multicore space is likely very influential

– Enables development of hardware assist for new APIs and runtime mechanisms

– Even the HPC space may be influenced by design ideas percolating from below

– Dynamic formal verification tools may employ “hooks” into the hardware

• Avoids the “dirty tricks” we had to use in ISP to get control over the MPI runtime very indirectly

– Faking “Wait” operations, pre-issuing Waits to poke the MPI progress engine etc.

• In the end, we can sell what we can debug

• Time to market may be minimized through better FV / dynamic verification support provided by HW

• Great teaching tool

– If the FPGA design tool-chain is a bit kinder/gentler

– Projects such as Lava, Kiwi, .. (MSR) provide rays of hope…

Details of ISP and DAMPI

Work out crooked-barrier example on the board, assisted by a formal transition system

• for MPI,

• Then a transition system for the ISP centralized scheduler (as interposition layer)

• Then a transition system for DAMPI’s distributed scheduler (sparse Lamport-clock based)

• The formal transition systems clearly show how the native semantics of MPI has been “tamed” by specific scheduler implementations!

Concluding Remarks

Summary of the explorations of a group (esp. its advisor) in “mid-life crisis”, wanting to be relevant and wanting to be formal

(also wanting to be liked  )

In the end it was worth it

Must now skate to where the puck will be!