(1) Formal Verification for Message Passing and GPU Computing
(2) XUM: An Experimental Multicore supporting MCAPI
Ganesh Gopalakrishnan
School of Computing, University of Utah and
Center for Parallel Computing (CPU) http://www.cs.utah.edu/fv
• Take FM where it hasn’t gone before
– A handful working in these domains which are crucially important
• Explore the space of concurrency based on message passing APIs
– A little bit of a mid-life crisis for an FV person learning which areas in SW design need help…
• L1: How to test message passing programs used in HPC
– Recognize the ubiquity of certain APIs in critical areas (e.g.
MPI, in HPC)
– With proper semantic characterization, we can formally understand / teach, and formally test
• No need to outright dismiss these APIs as “too hairy”
• (to the contrary) be able to realize fundamental issues that will be faced in any attempt along the same lines
– Long-lived APIs create a tool ecosystem around them that becomes harder to justify replacing
• What it takes to print a line in a “real setting”
– Need to build stack-walker to peel back profiling layers, and locate the actual source line
– Expensive – and pointless – to roll your own stack walker
• L2: How to test at scale
– The only practical way to detect communication non-determinism (in this domain)
– Can form the backbone of future large-scale replay-based debugging tools
• Realize that the multicore landscape is rapidly changing
– Accelerators (e.g. GPUs) are growing in use
– Multicore CPUs and GPUs will be integrated more tightly
– Energy is a first-rate currency
– Lessons learned from the embedded systems world are very relevant
• L3: Creating dedicated verification tools for
GPU kernels
– How symbolic verification methods can be effectively used to analyze GPU kernel functions
– Status of tool and future directions
• L4: Designing an experimental message-passing multicore
– Implements an emerging message passing standard called MCAPI in silicon
– How the design of special instructions can help with fast messaging
– How features in the Network on Chips (NoC) can help support the semantics of MCAPI
• Community involvement in the creation of such tangible artifacts can be healthy
– Read “The future of Microprocessors” in a recent CACM, by
Shekhar Borkar and Andrew Chien
• Today
– MPI and dyn. FV
• Tomorrow
– GPU computing and FV
– XUM
• Terascale
• Petascale
• Exascale
– Molecular dynamics simulations
• Better drug design facilitated
– Sanbonmatsu et al, FSE 2010 keynote
• 290 days of simulation to simulate 2 million atom interactions over 2 nano seconds
– Better “oil caps” can be designed if we have the right compute infrastructure
• Gropp, SC 2010 panel
Commonality among different scales
Also “HPC” will increasingly go embedded
MPI
CUDA /
OpenCL
OpenMP Pthreads
Multicore
Association
APIs
High End
Machines for HPC /
Cloud
Desktop
Servers and
Compute
Servers
Embedded
Systems and
Devices
11
• Concurrent software debugging is hard
• Gets harder as the degree of parallelism in applications increases
– Node level: Message Passing Interface (MPI)
– Core level: Threads, OpenMPI, CUDA
• Hybrid programming will be the future
– MPI + Threads
HPC Apps
– MPI + OpenMP
– MPI + CUDA
HPC Correctness Tools
• Yet tools are lagging behind!
– Many tools cannot operate at scale and give measurable coverage
• Expensive machines, resources
– $3M electricity a year (megawatt)
– $1B to install hardware
– Months of planning to get runtime on cluster
• Debugging tools/methods are primitive
– Extreme-Scale goal unrealistic w/o better approaches
• Inadequate attention from “CS”
– Little/no Formal Software Engineering methods
– Almost zero critical mass
• Born ~1994
• The world’s fastest CPU ran at 68 MHz
• The Internet had 600 sites then!
• Java was still not around
• Still dominant in 2011
– Large investments in applications, tooling support
• Credible FV research in HPC must include MPI
– Use of message passing is growing
• Erlang, actor languages, MCAPI, .NET async … (not yet for HPC)
• Streams in CUDA, Queues in OpenCL,…
Problem Solving
Environment based
User Applications
Problem-Solving
Environments e.g. Uintah, Charm++,
ADLB
Monolith
Large-scale
MPI-based
User
Applications
High Performance
MPI Libraries
Concurrent
Data Structures
Infiniband style interconnect
Sandybridge (courtesy anandtech.com) AMD Fusion APU
Geoforce GTX 480 (Nvidia) 15
• Execution Deterministic
– Basically one computation per input data
• Value Deterministic
– Multiple computations, but yield same “final answer”
• Nondeterministic
– Basically reactive programs built around message passing, possibly also using threads
Examples to follow
An example of parallelizing matrix multiplication using message passing
X
MPI_Send
MPI_Recv
An example of parallelizing matrix multiplication using message passing
X
MPI_Bcast
MPI_Send
MPI_Recv
An example of parallelizing matrix multiplication using message passing
X
MPI_Bcast
MPI_Send
19
MPI_Send
MPI_Recv
An example of parallelizing matrix multiplication using message passing
MPI_Recv
X =
MPI_Bcast
MPI_Send
20
Unoptimized Initial Version : Execution Deterministic
MPI_Recv (from: P0, P1, P2, P3…) ;
Send Next Row to
First Slave which
By now must be free
MPI_Send
21
Later Optimized Version : Value Deterministic
Opportunistically Send Work to Processor Who Finishes first
MPI_Recv ( from : * )
OR
Send Next Row to
First Worker that returns the answer!
MPI_Send
22
Still More Optimized Value-Deterministic versions:
Communications are made Non-blocking, and Software Pipelined
(still expected to remain value-deterministic )
MPI_Recv ( from : * )
OR
Send Next Row to
First Worker that returns the answer!
MPI_Send
23
• Value-Nondeterministic MPI programs do exist
– Adaptive Dynamic Load Balancing Libraries
• But most are value deterministic or execution deterministic
– Of course, one does not really know w/o analysis!
• Detect replay non-determinism over schedule space
– Races can creep into MPI programs
• Forgetting to Wait for MPI non-blocking calls to finish
– Floating point can make things non-deterministic
• MPI programs “die by the bite of a thousand mosquitoes”
– No major vulnerabilities one can focus on
• E.g. in Thread Programming, focusing on races
• With MPI, we need comprehensive “Bug Monitors”
• Building MPI bug monitors requires collaboration
– Lucky to have collaborations with DOE labs
– The lack of FV critical mass hurts
P0
---
Send( (rank+1)%N );
P1
---
Send( (rank+1)%N );
P2
---
Send( (rank+1)%N );
Recv( (rank-1)%N ); Recv( (rank-1)%N );
• Expected “circular” msg passing
• Found that P0’s Recv entirely vanished !!
• REASON : ??
Recv( (rank-1)%N );
– In C, -1 % N is not N-1 but rather -1 itself
– In MPI, “-1” Is MPI_PROC_NULL
– Recv posted on MPI_PROC_NULL is ignored !
P0
---
Send( (rank+1)%N );
P1
---
Send( (rank+1)%N );
P2
---
Send( (rank+1)%N );
Recv( (rank-1)%N ); Recv( (rank-1)%N );
• Expected “circular” msg passing
• Found that P0’s Recv entirely vanished !!
• REASON : ??
Recv( (rank-1)%N );
– In C, -1 % N is not N-1, but -1
– In MPI, “-1” Is MPI_PROC_NULL
– Recv posted on MPI_PROC_NULL is ignored !
• Bug encountered at large scale w.r.t. famous MPI library (Vo)
– Bug was absent at a smaller scale
– It was a concurrency bug
• Attempt to implement collective communication (Thakur)
– Bug exists for ranges of size parameters
• Wrong assumption: that MPI barrier was irrelevant (Siegel)
– It was not – a communication race was created
• Other common bugs (we see it a lot; potentially concurrency dep.)
– Forgetting to wait for non-blocking receive to finish
– Forgetting to free up communicators and type objects
• Some codes may be considered buggy if non-determinism arises!
– Use of MPI_Recv(*) often does not result in non-deterministic execution
– Need something more than “superficial inspection”
• Typing a[i][i] = init instead of a[i][j] = init
• Communication races
– Unintended send matches “wildcard receive”
• Bugs that show up when ported
– Runtime buffering changes; deadlocks erupt
– Sometimes, bugs show up when buffering added!
• Misunderstood “Collective” semantics
– Broadcast does not have “barrier” semantics
• MPI + threads
– Royal troubles await the newbies
• Solve FV of Pure MPI Applications “well”
– Progress in non-determinism coverage for fixed test harness
– MUST integrate with good error monitors
• (Preliminary) Work on hybrid MPI + Something
– Something = Pthreads and CUDA so far
– Evaluated heuristics for deterministic replay of Pthreads + MPI
• Work on CUDA/OpenCL Analysis
– Good progress on Symbolic Static Analyzer for CUDA Kernels
– (Prelim.) progress on Symbolic Test Generator for CUDA Pgms
• (Future) Symbolic Test Generation to “crash” hybrid pgms
– Finding lurking crashes may be a communicable value proposition
• (Future) Intelligent schedule-space exploration
– Focus on non-monolithic MPI programs
Eliminating wasted search in message passing verif.
P0
---
MPI_Send(to P1…);
MPI_Send(to P1, data=22);
P1
---
MPI_Recv(from P0…);
MPI_Recv(from P2…);
MPI_Recv(*, x); if (x==22) then error1 else MPI_Recv(*, x);
P2
---
MPI_Send(to P1…);
MPI_Send(to P1, data=33);
32
A frequently followed approach: “boil the whole schedule space” – often very wasteful
@InProceedings{PADTAD2006:JitterBug, author = {Richard Vuduc and Martin Schulz and Dan Quinlan and Bronis de
Supinski and Andreas S{\ae}bj{\"o}rnsen}, title = {Improving distributed memory applications testing by message perturbation}, booktitle = {Proc.~4th Parallel and Distributed Testing and Debugging (PADTAD)
Workshop, at the International Symposium on Software Testing and Analysis}, address = {Portland, ME, USA}, month = {July},
} year = {2006}
33
Eliminating wasted work in message passing verif.
P0
---
No need to play with schedules of deterministic actions
P1
---
P2
---
MPI_Send(to P1…); MPI_Recv(from P0…); MPI_Send(to P1…);
MPI_Send(to P1, data=22); MPI_Recv(from P2…); MPI_Send(to P1, data=33);
MPI_Recv(*, x); if (x==22) then error1 else MPI_Recv(*, x);
But consider these two cases…
34
P0
Send(to:1);
Recv(from:1);
P1
Send(to:0);
Recv(from:0);
We know that this program with lesser Send buffering may deadlock
36
P0
Send(to:1);
Recv(from:1);
P1
Send(to:0);
Recv(from:0);
… and this program with more Send buffering may avoid a deadlock
37
P0
Send(to:1);
Send(to:2);
P1
Send(to:2);
Recv(from:0);
P2
Recv(from:*);
Recv(from:0);
… But this program deadlocks if Send(to:1) has more buffering !
38
P0
Send(to:1);
Send(to:2);
P1
Send(to:2);
Recv(from:0);
P2
Recv(from:*);
Recv(from:0);
… But this program deadlocks if Send(to:1) has more buffering !
39
P0
Send(to:1);
Send(to:2);
P1
Send(to:2);
Recv(from:0);
P2
Recv(from:*);
Recv(from:0);
… But this program deadlocks if Send(to:1) has more buffering !
40
P0
Send(to:1);
Send(to:2);
P1
Send(to:2);
Recv(from:0);
P2
Recv(from:*);
Recv(from:0);
… But this program deadlocks if Send(to:1) has more buffering !
41
P0
Send(to:1);
Send(to:2);
P1
Send(to:2);
Recv(from:0);
P2
Mismatched – hence a deadlock
Recv(from:*);
Recv(from:0);
… But this program deadlocks if Send(to:1) has more buffering !
42
“ ”Your program is deadlock free if you have successfully tested it under zero buffering”
43
• Perhaps partly
– Over 17 years of MPI, things have changed
– Inevitable use of shared memory cores, GPUs, …
– Yet, many of the issues seem fundamental to
• Need for wide adoption across problems, languages, machines
• Need to give programmer better handle on resource usage
• How to evolve out of MPI?
– Whom do we trust to reset the world?
– Will they get it any better?
– What about the train-wreck meanwhile?
• Must one completely evolve out of MPI?
44
Problem Solving
Environment based
User Applications
Problem-Solving
Environments e.g. Uintah, Charm++,
ADLB
Monolith
Large-scale
MPI-based
User
Applications
High Performance
MPI Libraries
Concurrent
Data Structures
Infiniband style interconnect
ISP and
DAMPI
Useful formalizations to help test these
PUG and
GKLEE
45 Sandybridge (courtesy anandtech.com) AMD Fusion APU
Geoforce GTX 480 (Nvidia)
• Dynamic formal verification of MPI
– It is basically testing which discovers all alternate schedules
• Coverage of communication non-determinism
– Also gives us a “predictive theory” of MPI behavior
– Centralized approach : ISP
– GEM: Tool Integration within Eclipse Parallel Tools
Platform
– Demo of GEM
46
Process P0 Process P1 Process P2
Isend(1, req) ; Irecv(*, req) ; Barrier ;
Barrier ; Barrier ; Isend(1, req) ;
Wait(req) ; Wait(req) ; Recv(2) ;
Wait(req) ;
47
Process P0 Process P1 Process P2
Isend(1, req) ; Irecv(*, req) ; Barrier ;
Barrier ; Barrier ; Isend(1, req) ;
Wait(req) ; Recv(2) ; Wait(req) ;
Wait(req) ;
• Non-blocking Send – send lasts from Isend to Wait
• Send buffer can be reclaimed only after Wait clears
• Forgetting to issue Wait MPI “request object” leak
48
Process P0 Process P1 Process P2
Isend(1, req) ; Irecv(*, req) ; Barrier ;
Barrier ; Barrier ; Isend(1, req) ;
Wait(req) ; Wait(req) ; Recv(2) ;
Wait(req) ;
49
Process P0 Process P1 Process P2
Isend(1, req) ; Irecv(*, req) ; Barrier ;
Barrier ; Barrier ; Isend(1, req) ;
Wait(req) ; Recv(2) ; Wait(req) ;
Wait(req) ;
• Non-blocking Receive – lasts from Irecv to Wait
• Recv buffer can be examined only after Wait clears
• Forgetting to issue Wait MPI “request object” leak
50
Process P0 Process P1 Process P2
Isend(1, req) ; Irecv(*, req) ; Barrier ;
Barrier ; Barrier ; Isend(1, req) ;
Wait(req) ; Recv(2) ; Wait(req) ;
Wait(req) ;
• Blocking receive in the middle
• Equivalent to Irecv followed by its Wait
• The data fetched by Recv(2) is available before that of Irecv
51
Process P0 Process P1 Process P2
Isend(1, req) ; Irecv(*, req) ; Barrier ;
Barrier ; Barrier ; Isend(1, req) ;
Wait(req) ; Recv(2) ; Wait(req) ;
Wait(req) ;
• Since P0’s Isend and Irecv can be “in flight”, the barrier can be crossed
• This allows P2’s Isend to race with P0’s Isend, and match Irecv(*)
52
Process P0 Process P2
Isend(1, req) ; Irecv(*, req) ; Barrier ;
Barrier ; Barrier ; Isend(1, req) ;
Wait(req) ;
Process P1
Recv(2) ; Wait(req) ;
Wait(req) ;
• Traditional testing methods may reveal only P0->P1 or P2->P1
• P2->P1 may happen after the code is ported
• Our tools ISP and DAMPI automatically discover and run both tests, regardless of the execution platform
53
MPI
Program
Interposition
Layer
Run
Executable
Proc
1
Proc
2
……
Proc n
Scheduler
MPI Runtime
Scheduler intercepts MPI calls
Reorders and/or rewrites the actual calls going into the MPI Runtime
Discovers maximal non-determinism ; plays through all choices
54
ISP Scheduler Actions (animation)
P0 P1 P2 sendNext
Isend(1, req) Irecv(*, req) Barrier
Barrier Barrier Isend(1, req)
Wait(req) Recv(2)
Wait(req)
Wait(req)
MPI Runtime
Scheduler
Isend(1)
Barrier
55
ISP Scheduler Actions (animation)
P0 P1 P2
Isend(1, req) Irecv(*, req) sendNext
Barrier
Barrier Barrier Isend(1, req)
Wait(req) Recv(2)
Wait(req)
Wait(req)
MPI Runtime
Scheduler
Isend(1)
Barrier
Irecv(*)
Barrier
56
ISP Scheduler Actions (animation)
P0 P1 P2
Isend(1, req)
Barrier
Irecv(*, req)
Barrier
Barrier
Barrier
Barrier
Barrier
Isend(1, req)
Scheduler
Isend(1)
Barrier
Irecv(*)
Barrier
Wait(req) Wait(req) Recv(2)
Wait(req)
Barrier
MPI Runtime
57
ISP Scheduler Actions (animation)
P0 P1 P2
Isend(1, req)
Barrier
Wait(req)
Irecv(*, req)
Barrier
Recv(2)
Irecv(2)
Barrier
Isend
No
Match-Set
Isend(1, req)
Wait(req)
SendNext
Scheduler
Isend(1)
Barrier
Wait (req)
Irecv(*)
Barrier
Recv(2)
Wait(req)
Barrier
Isend(1)
Wait (req)
MPI Runtime
58
• MPI calls are modeled in terms of four salient events
– Call issued
• All calls are issued in program order
– Call returns
• The code after the call can now be executed
– Call matches
• An event that marks the call committing
– Call completes
• All resources associated with the call are freed
60
1.
Isend(to: Proc k, …);
…
Isend(to: Proc k, …)
2.
Irecv(from: Proc k, …);
…
Irecv(from: Proc k, …)
3.
Irecv(from: Proc *, …);
…
Irecv(from: Proc k, …)
4.
Irecv(from: Proc j, …);
…
Irecv(from: Proc *, …)
Conditional
Matches before
5.
Isend( &h );
…
Wait( h );
6.
Wait(..);
…
AnyMPIOp;
7.
Barrier(..);
…
AnyMPIOp;
61
• Pick a process Pi and its instrn Op at PC n
• If Op does not have an unmatched ancestor according to
MB, then collect Op into Scheduler’s Reorder Buffer
– Stay with Pi, Increment n
• Else Switch to Pi+1 until all Pi are “fenced”
– “fenced” means all current Ops have unmatched ancestors
• Form Match Sets according to priority order
– If Match Set is {s1, s2, .. sK} + R(*), cover all cases using stateless replay
• Issue an eligible set of Match Set Ops into the MPI runtime
• Repeat until all processes are finalized or error encountered
Theorem: This Scheduling Method achieves ND-Coverage in MPI programs!
62
How MB helps predict outcomes
Will this single-process example called “Auto-send” deadlock ?
P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);
63
How MB helps predict outcomes
P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);
64
The MB
How MB helps predict outcomes
P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);
65
How MB helps predict outcomes
Scheduler’s
Reorder
Buffer
Collect R(from:0, h1)
R(from:0, h1)
P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);
The MPI
Runtime
66
How MB helps predict outcomes
Scheduler’s
Reorder
Buffer
Collect B
R(from:0, h1)
B
P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);
The MPI
Runtime
67
How MB helps predict outcomes
Scheduler’s
Reorder
Buffer
P0 is Fenced; Form Match Set { B } and
Send it to the MPI Runtime !
R(from:0, h1)
B
P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);
The MPI
Runtime
68
How MB helps predict outcomes
Scheduler’s
Reorder
Buffer
Collect S(to:0, h2)
P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);
R(from:0, h1)
B
S(to:0, h2)
The MPI
Runtime
69
How MB helps predict outcomes
Scheduler’s
Reorder
Buffer
P0 is fenced. So form the {R,S} match-set. Fire!
P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);
R(from:0, h1)
B
S(to:0, h2)
The MPI
Runtime
70
How MB helps predict outcomes
Scheduler’s
Reorder
Buffer
Collect W(h1)
P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);
R(from:0, h1)
B
S(to:0, h2)
W(h1)
The MPI
Runtime
71
How MB helps predict outcomes
Scheduler’s
Reorder
Buffer
P0 is Fenced. So form {W} match set. Fire!
P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);
R(from:0, h1)
B
S(to:0, h2)
W(h1)
The MPI
Runtime
72
How MB helps predict outcomes
Scheduler’s
Reorder
Buffer
Collect W(h2)
P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2) ;
R(from:0, h1)
B
S(to:0, h2)
W(h1)
W(h2)
The MPI
Runtime
73
How MB helps predict outcomes
Scheduler’s
Reorder
Buffer
Fire W(h2)! Program finishes w/o deadlocks!
P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);
R(from:0, h1)
B
S(to:0, h2)
W(h1)
W(h2)
The MPI
Runtime
74
• Dynamic formal verification of MPI
– It is basically testing which discovers all alternate schedules
• Coverage of communication non-determinism
– Also gives us a “predictive theory” of MPI behavior
– Centralized approach : ISP
– GEM: Tool Integration within Eclipse Parallel Tools Platform
• DEMO OF GEM
– Distributed approach : DAMPI
75
• The ROB and the MB graph can get large
– Limit of ISP: “32 Parmetis (15 K LOC) processes”
• We need to dynamically verify real apps
– Large data sets, 1000s of processes
– High-end bugs often masked when downscaling
– Downscaling is often IMPOSSIBLE in practice
– ISP is too sequential! Employ parallelism of a large Cluster!
• What do we give up?
– We can’t do “what if” reasoning
• What if a PARTICULAR Isend has infinite buffering?
– We may have to give up precision for scalability
77
MPI Program
DAMPI - PnMPI modules
Alternate
Matches
Executable
Proc
1
Proc
2
……
Proc n
Rerun
Epoch
Decisions
Schedule
Generator
DAMPI – Distributed Analyzer for MPI
MPI runtime
Distributed Causality Tracking in DAMPI
• Alternate matches are
– co-enabled and concurrent actions
– that can match according to MPI’s matching rules
• DAMPI performs an initial run of the MPI program
• Discovers alternative matches
• We have developed an MPI-specific Sparse Logical
Clock Tracking
– Vector Clocks (no omissions, less scalable)
– Lamport Clocks (no omissions in practice, more scalable)
• We have gone up to 1000 processes (10K seems possib)
P0
DAMPI uses Lamport clock to build
Happens-Before relationships
0 S(P
1
)
0 1
P1
R
1
(*)
0
P2
S(P
1
)
• Use Lamport clock to track Happens-Before
– Sparse Lamport Clocks – only “count” non-det events
– MPI MB relation is “baked” into clock-update rules
– Increases it after completion of MPI_Recv (ANY_SOURCE)
– Nested blocking / non-blocking operations handled
– Compare incoming clock to detect potential matches
How we use Happens-Before relationship to detect alternative matches
0 S(P
1
)
P0
0 1
P1
R
1
(*)
0
P2
S(P
1
)
• Question: could P
2
’s send have matched P
1
’s recv?
• Answer: Yes!
• Earliest Message with lower clock value than the current process clock is an eligible match
DAMPI Algorithm Review:
(1) Discover Alternative matches during initial run
P0
P1
P2
0 S(P
1
)
0
0
R
1
(*)
S(P
1
)
1
DAMPI Algorithm Review:
(2) Force alternative matches during replay
P0
P1
P2
0 S(P
1
)
0
0
R
1
(2)
S(P
1
)
X 1
ParMETIS-3.1 (no wildcard)
900
800
700
600
500
400
300
200
100
0
4 8 16 32
ISP
DAMPI
Number of tasks
Matrix Multiplication with Wildcard Receives
8000
7000
6000
5000
4000
3000
2000
1000
0
250 500
Number of Interleavings
750 1000
ISP
DAMPI
0
1,5
1
0,5
2
2,5
Slowdown is for one interleaving
No replay was necessary
Base 1024 processes
DAMPI 1024 processes
The Devil in the Detail of DAMPI
• Textbook Lamport/Vector Clocks Inapplicable
– “Clocking” all events won’t scale
• So “clock” only ND-events
– When do non-blocking calls finish ?
• Can’t “peek” inside MPI runtime
• Don’t want to impede with execution
• So have to infer when they finish
– Later blocking operations can force Matches Before
• So need to adjust clocks when blocking operations happen
• Our contributions: Sparse VC and Sparse LC algos
• Handing real corner cases in MPI
– Synchronous sends “learning” about non-blocking Recv start!
• What bugs are caught by ISP / DAMPI
– Deadlocks
– MPI Resource Leaks
• Forgotten dealloc of MPI communicators, MPI type objects
• Forgotten Wait after Isend or Irecv (request objects leaked)
• C assert statements (safety checks)
• What have we done wrt the correctness of our MPI verifier ISP’s algorithms?
– Testing + paper/pencil proof using standard ample-set based proof methods
– Brief look at the formal transition system of MPI, and ISP’s transition system
• Why have HPC folks not built a happens-before model for APIs such as MPI?
– HPC folks are grappling with many challenges and actually doing a great job
– There is a lack of “computational thinking” in HPC that must be addressed
• Non-CS background often
• See study by Jan Westerholm in EuroMPI 2010
– CS folks must step forward to help
• Why this does not naturally happen: “Numerical Methods” not popular in core CS
• There isn’t a clearly discernible “HPC industry”
– Wider use of GPUs, Physics based gaming, … can help push CS toward HPC
– Mobile devices will use CPUs + GPUs and do “CS problems” and “HPC problems” in a unified setting (e.g. face recognition, …)
• Our next two topics (PUG and XUM) touch on our attempts in this area
• Formal analysis methods for accelerators/GPUs
– Same thing
– In future, there may be a large-scale mish-mash of CPUs and
GPUs
– CPUs also will have mini GPUs, SIMD units, …
• Again, revisit the Borkar / Chien article
• Regardless..
– It looks a lot unlike traditional Pthreads/Java/C# threading
– We would like to explore how to debug these kinds of codes efficiently, and help designers explore their design space while root-causing bugs quickly
89
• GPUs are key enablers of HPC
– Many of the top 10 machines are GPU based
– I found the presentation by Paul Lindberg eye-opening http://www.youtube.com/watch?v=vj6A8AKVIuI
• Interesting debugging challenges
– Races
– Barrier mismatches
– Bank conflicts
– Asynchronous MPI / CUDA interactions
90
AMD Fusion APU
OpenCL Compute (courtesy Khronos group)
Geoforce
GTX 480
(Nvidia)
Sandybridge (courtesy anandtech.com)
• Three of world’s Top-5 Supercomputers built using GPUs
• World’s greenest supercomputers also
GPU based
• CPU/GPU integration is a clear trend
• Hand-held devices (iPhone) will also use them
91
Example: Increment Array Elements
CPU program CUDA program
} void inc_cpu(float* a, float b, int N) { for (int idx = 0; idx<N; idx++) a[idx] = a[idx] + b;
}
__global__ void inc_gpu(float* A, float b, int N) { int tid = blockIdx.x* blockDim.x+ threadIdx.x; if ( tid < N)
A[t id ] = A[t id ] + b;
} void main() {
.....
increment_cpu(a, b, N); voidmain() {
…..
dim3dimBlock (blocksize); dim3dimGrid( ceil( N / (float)blocksize) );
} increment_gpu<<<dimGrid, dimBlock>>>(a, b,
N);
Example: Increment Array Elements
CPU program CUDA program
Fine-grained threads scheduled to run like this:
Tid = 0, 1, 2, 3, …
} void inc_cpu(float* a, float b, int N) { for (int idx = 0; idx<N; idx++) a[idx] = a[idx] + b;
}
__global__ void inc_gpu(float* A, float b, int N) { int tid = blockIdx.x* blockDim.x+ threadIdx.x; if ( tid < N)
A[t id ] = A[t id ] + b;
} void main() {
.....
increment_cpu(a, b, N); voidmain() {
…..
dim3dimBlock (blocksize); dim3dimGrid( ceil( N / (float)blocksize) );
} increment_gpu<<<dimGrid, dimBlock>>>(a, b,
N);
EAGAN, Minn., May 3, 1991 (John Markoff, NY Times)
The Convex Computer Corporation plans to introduce its first supercomputer on Tuesday. But Cray Research Inc., the king of supercomputing, says it is more worried by "killer micros"
-- compact, extremely fast work stations that sell for less than
$100,000.
Take-away : Clayton Christensen’s “Disruptive Innovation”
GPU is a disruptive technology!
94
GPUs offer an eminent setting in which to study heterogeneous CPU organizations and memory hierarchies
95
• GPU hardware still stabilizing
– Characterize GPU Hardware Formally
• Currently, program behavior may change with platform
– Micro benchmarking sheds further light
• www.eecg.toronto.edu/~myrto/gpuarch-ispass2010.pdf
• The software is the real issue!
96
• SIMD model, unfamiliar
– Synchronization, races, deadlocks, … all different!
• Machine constants, program assumptions about problem parameters affect correctness / performance
– GPU “kernel functions” may fail or perform poorly
• Formal reasoning can help identify performance pitfalls
• Are still young, so emulators and debuggers may not match hardware behavior
• Multiple memory sub-systems are involved, further complicating semantics / performance
97
• Instrumented Dynamic Verification on GPU Emulators
– Boyer et al, STMCS 2008
• Grace : Combined Static / Dynamic Analysis
– Zheng et al, PPoPP 2011
• PUG : An SMT based Static Analyzer
– Lee and Gopalakrishnan, FSE 2010
• GKLEE : An SMT based test generator
– Lee, Rajan, Ghosh, and Gopalakrishnan (under submission)
98
Realized
99
100
Increment N-element vector A by scalar b tid 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
A
A[0]+b
}
__global__ void inc_gpu( float*A, float b, int N) { int tid = blockIdx.x * blockDim.x + threadIdx.x; if ( tid < N)
A [tid ] = A[ tid – 1 ] + b;
...
A[15]+b
Increment N-element vector A by scalar b tid 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
A
A[0]+b ...
A[15]+b
}
__global__ void inc_gpu( float*A, float b, int N) { int tid = blockIdx.x * blockDim.x +
(and (and (/= t1.x t2.x)
……
(or (and (bv-lt idx1@t1 N0)
(bv-lt idx1@t2 N0) threadIdx.x; if ( tid < N)
Encoding for
Read-Write Race
(= (bv-sub idx1@t1 0b0000000001) idx1@t2))
(and (bv-lt idx1@t1 N0)
A [tid ] = A [tid – 1 ] + b;
(bv-lt idx1@t2 N0)
(= idx1@t1 idx1@t2))))
Encoding for Write-Write Race
Increment N-element vector A by scalar b tid 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
A
A[0]+b ...
A[15]+b
}
__global__ void inc_gpu( float*A, float b, int N) { int tid = blockIdx.x * blockDim.x +
(and (and (/= t1.x t2.x)
……
(or (and (bv-lt idx1@t1 N0)
(bv-lt idx1@t2 N0) threadIdx.x; if ( tid < N)
Encoding for
Read-Write Race
(= (bv-sub idx1@t1 0b0000000001) idx1@t2))
(and (bv-lt idx1@t1 N0)
A [tid ] = A [tid – 1 ] + b;
(bv-lt idx1@t2 N0)
(= idx1@t1 idx1@t2))))
Encoding for Write-Write Race
__global__ void computeKernel(int *d_in,int *d_out, int *d_sum) { d_out[threadIdx.x] = 0; for (int i=0; i<SIZE/BLOCKSIZE; i++) { d_out[threadIdx.x] += compare(d_in[i*BLOCKSIZE+threadIdx.x],6); }
__syncthreads(); if(threadIdx.x%2==0) { for(int i=0; i<SIZE/BLOCKSIZE; i++) { d_out [threadIdx.x
+SIZE/BLOCKSIZE*i]+=d_out[threadIdx.x+SIZE/BLOCKSIZE*i+1];
...
/* The counter example given by PUG is : t1.x = 2, t2.x = 10, i@t1 = 1, i@t2 = 0
Threads t1 and t2 are in iterations 1 and 0 respectively t1 generates access d_out[ 2 + 8 * 1 ] += d_out[ 10 + 8 * 0 + 1] i.e. d[10] += d[11] t2 generates this: d_out[ 10 + 8 * 0 ] += d_out [ 10 + 8 * 0 + 1] i.e. d[10] += d[11]
*/
This causes a real race on writes to d[10] because of the += done by the threads
__global__ void computeKernel(int *d_in,int *d_out, int *d_sum) { d_out[threadIdx.x] = 0; for (int i=0; i<SIZE/BLOCKSIZE; i++) { d_out[threadIdx.x] += compare(d_in[i*BLOCKSIZE+threadIdx.x],6); }
__syncthreads(); if(threadIdx.x%2==0) { for(int i=0; i<SIZE/BLOCKSIZE; i++) { d_out[ threadIdx.x
+SIZE/BLOCKSIZE*i]+=d_out[threadIdx.x+SIZE/BLOCKSIZE*i+1];
...
/* The counter example given by PUG is : t1.x = 2, t2.x = 10, i@t1 = 1, i@t2 = 0
Threads t1 and t2 are in iterations 1 and 0 respectively t1 generates access d_out[ 2 + 8 * 1 ] += d_out[ 10 + 8 * 0 + 1] i.e. d[10] += d[11] t2 generates this: d_out[ 10 + 8 * 0 ] += d_out [ 10 + 8 * 0 + 1] i.e. d[10] += d[11]
*/
This causes a real race on writes to d[10] because of the += done by the threads
PUG Divides and Conquers Each Barrier Interval t
1 t
2 a1
BI1 a2 a1 a2 p a6 a3
p a4
BI3 p a5 a6 a3
p a4 a5
BI1 is conflict free iff:
t
1
,t
2
: a
1 t1 , a
2 t1 , a
1 t2 and a
2 t2 do not conflict with each other
BI 3 is conflict free iff the following accesses do not conflict for all t1,t2 : a a t1
3
3 t2
,
p : a
,
p : a t1
4
4 t2
,
p : a
,
p : a t1,
5
5 t2
PUG Results (FSE 2010)
Kernels (in
CUDA SDK )
Bitonic Sort
MatrixMult
Histogram64
Sobel
Reduction
Scan
Scan Large
Nbody
Bisect Large
Radix Sort
Eigenvalues loc +O +C +R B.C.
Time
(pass)
65
102 *
136
130 *
315
*
HIGH 2.2s
HIGH <1s
LOW 2.9s
HIGH 5.6s
HIGH 3.4s
255 *
237 *
206 *
1400 *
1150 *
2300 *
*
*
*
*
*
*
*
*
LOW
LOW
3.5s
5.7s
HIGH 7.4s
HIGH 44s
LOW 39s
HIGH 68s
+ O: require no bit-vector overflowing
+C: require constraints on the input values
+R: require manual refinement
B.C.: measure how serious the bank conflicts are
Time: SMT solving time
Tested over 57 assignment submissions
Defects
13 (23%)
Barrier Error or Race benign
3 fatal
2
Refinement over #kernel
17.5% over #loop
10.5%
Defects: Indicates how many kernels are not well parameterized, i.e. work only in certain configurations
Refinement: Measures how many loops need automatic refinement
(by PUG).
108
• Strengths:
– SMT-based Incisive SA avoids interleaving explosion
– Still obtains coverage guarantees
– Good for GPU Library FV
• Weaknesses:
– Engineering effort : C++, Templates, Breaks, ..
– SMT “explosion” for value correctness
– Does not help test the code on the actual hardware
– False alarms requires manual intervention
• Handling loops, Kernel calling contexts
109
• http://www.cs.utah.edu/fv/mediawiki/index.php/PUG
• One recent extension (ask me for our EC2 2011 paper)
– Symbolic analysis to detect occurrences of noncoalesced memory accesses
110
• Parameterized Verification
– See Li’s dissertation for initial results
• Formal Testing
– Brand new code-base called GKLEE
• Joint work with Fujitsu Research
111
112
Execution time of PUG and GKLEE on kernels from the CUDA SDK for functional correctness. n is the number of threads. T.O. means > 5 min.
Kernels
Simple
Reduction
Matrix
Transpose
Bitonic Sort
Scan Large
PUG
2.8
n = 4
GKLEE
< 0.1 (< 0.1)
PUG n = 16
GKLEE n = 64
GKLEE n = 256
GKLEE n = 1024
GKLEE
T.O.
< 0.1 (< 0.1) < 0.1 (< 0.1) 0.2 (0.3) 2.3 (2.9)
1.9
3.7
▬
< 0.1 (< 0.1)
0.9 (1)
< 0.1 (< 0.1)
T.O.
T.O.
▬
< 0.1 (0.3)
T.O.
< 0.1 (< 0.1)
< 0.1 (3.2) < 0.1 (63) 0.9 (T.O.)
T.O.
0.1 (0.2)
T.O.
1.6 (3)
T.O.
22 (51)
113
Test generation results. Traditional metrics sc cov.
and bc cov.
give source code and bytecode coverage respectively. Refined metrics avg.cov
t max.cov
t and measure the average and maximum coverage of all threads.
All coverages were far better than achieved through random testing.
Kernels exec. time
1s Bitonic Sort
Merge Sort
Word Search
Radix Sort
Suffix Tree
Match sc cov.
100% /
100%
100% /
100%
100% /
92%
100% /
91%
100% /
90% bc cov.
min #test Avg. Cov t max. Cov t
51% /
44%
70% /
50%
34% /
25%
54% /
35%
38% /
35%
5
6
2
6
8
78% /
76%
88% /
80%
100% /
81%
91% /
68%
90% /
70%
100% /
94%
100% /
95%
100% /
85%
100% /
75%
100% /
82%
0.5s
0.1s
5s
12s
114
• A brief introduction to MCAPI, MRAPI, MTAPI
• XUM : an eXtensible Utah Multicore system
• What are we able to learn through building a hardware realization of a custom multicore
– How can we push this direction forward?
– Any collaborations?
• Clearly there are others who do this full-time
• This has been a side-project (by two very good students, albeit) in our group
• So we would like to build on others’ creations also…
115
• http://www.multicore-association.org
• Reason for our interest in the MCA APIs
– Our project through the Semiconductor Research Association
– Collaborator in dynamic verification of MCAPI Applications
• Prof. Eric Mercer (BYU, Utah)
– The BYU team also has developed a formal spec for MCAPI and
MRAPI, and built golden executable models from these specs
• XUM
– A Utah project involving two students
• MS project of Ben Meakin
• BS project of Grant Ayers
– An attempt to support MCAPI functions in HW+SW
– (Later) hoping to support MRAPI
116
(Picture courtesy of Multicore Association)
• A facility to interconnect heterogeneous embedded multicore systems/chips
• These systems could be very minimalistic
• No OS, different OS, could be DSP, CPU, …
• Standardization (and revision) finished around 2009
• No widely used and portable communication APIs in this space
• Currently two commercial implementations
• Mentor’s Open MCAPI
• Polycore’s messenger
• XUM is the only hardware-assisted implementation of the communication API
117
• Lead figures in MCAPI standardization (so far as our interactions go)
– Jim Holt (Freescale), Sven Brehmer (Polycore), Markus Levy (Multicore Association)
• “End-points” are connected
– Each end-point could be a thread, a process, ..
– Blocking and non-blocking communication support
• MCAPI_Send, MCAPI_Send_I, …
• Waits, Tests
• No barriers (in the API); one could implement
• Create end-points (a collective call)
• Use cases
– Present use-cases are in C/Pthreads, with each thread performing MCAPI calls to communicate
– Very reminiscent of monolithic-style MPI programs (with all its drawbacks)
• General Expectation
– That MCAPI will be used as a standard transport layer with respect to which one may implement higher abstractions
– One project: Chapman (Univ Houston) work on realizing OpenMP
– Other suggested: Task-graph (or other) higher level abstraction to specify computations, with
“smart runtime” employing MCAPI
118
• Currently no ‘formal’ debugging tools
– Not enough case studies yet (projects underway in Prof. Alper Sen’s group)
• MCC: An MCAPI Correctness Checker
– Subodh Sharma (PhD student)
– Borrows from the dynamic verification tool design of ISP (MPI checker) and
Inspect (our Pthreads checker)
– Dynamic verification against existing MCAPI libraries
– MCC incurs new headaches
• Hybrid Pthreads/MCAPI behaviors
• Deterministic replay of schedules often difficult
– Our present conclusion:
• Don’t go there!
• We know that dynamic verification of hybrid concurrent programs is a royal pain!
– Waiting for higher abstractions / better practices to emerge in the area
• BYU projects on model checking using an MCAPI Golden Executable Model
– Main difference is that they rely on an MCAPI operational semantics whereas we capitalize on behavior (of MCAPI library)
119
• Portable Resource API
– Portable Mutexes, Mallocs
– Portable varieties of software managed shared memory, DMA
– Pthreads and Unix facilities won’t do
• Not well-matched with requirements of heterogeneous multicores with disparate set of features / resources
• MTAPI standardization: yet to begin
• One possible usage of MCAPI + MRAPI:
– MCAPI Send call allocates buffer using MRAPI calls
– MCAPI Send happens
• say using XUM’s network, or MRAPI’s software DMA
– MRAPI calls free-up buffer
120
Two XUPV5-LX110T boards obtained courtesy of Xilinx Inc. !
• 32-bit MIPS ISA compliant cores
• Request network is in-order dimensional order routed
• worm-hole flow control
• Each router unit arbitrates round-robin
• Datapath in the reqest n/w is 16-bits wide
• Ack n/w has broadcast and pt-to-pt transfers
• We can plug in I/O devices as if they are tiles
• All these exist as VHDL+Verilog mapped onto FPGAs
• Current memory architecture :
• all tiles have memory ports leading to a FCFS arbiter that is backed by a pipelined DDR2 controller to an SDRAM
• All tiles can have their own clocks.
• About 4 physical clock sources. PLL primitives available (Xilinx tools).
• Additional clocks can be synthesized
• Currently 500 MHz for a flip-flop. 100 MHz realizable. Look at OpenSparc.
• Built Bootloader (FPGA) and Protocol for downloading code images over RS-232
• XUM Memory Controller
– Built a fully functional DDR2-SDRAM memory controller that provides usable amounts of memory on-board
• Support for pipelined transfers
– Built a simple FCFS arbitrated memory controller (shared by all cores)
• All cores share same address space – no protection, but handy!
• Ported XUM MIPS cores to 32-bits, and debugged the CPUs
– Debugged the CPU some (more needed)
• Found errors in delay slot and forwarding logic…
– Would be a good test vehicle for pipelined CPU FV methods
More details
• 5-7 stage in-order pipeline.
• No branch predictor (will add). No speculation.
• Web documentation status: VHDL/Verilog code available
Software story:
• GCC would be usable
– Must add some new instructions such as load-immediate-upper
– In-line assembly for XUM instructions
• FPU is TBD – new student
RTOS story:
• FreeRTOS (e.g.) – compile using GCC -> download.
MCAPI and MRAPI realization
• BYU collaborator, Prof. Mercer, has MCAPI and MRAPI golden executable model (as formal state transition rules)
• Will compile these into detailed C implementations
Programming approach
• Not recommending straight coding using MCAPI /
MRAPI, as the code becomes a “rats nest” soon
• Will investigate compiling tasking primitives into a runtime that is supported by MCAPI and MRAPI
Other ideas?
• See Ben Meakin’s MS thesis
– Available from http://www.cs.utah.edu/fv
• The thesis provides:
– Code for send/receive
– Correctness properties of interest wrt XUM
• Good test vehicle for HW FV projects
– Memory footprint data
• Very parsimonious support for MCAPI possible
– Latency/throughput measurements on XUM
• Also, comparison with a Pthread-based baseline
• Send Header
• Send word
• Send tail
• Send ack
• Broadcast
• Receive ack
• Disable interrupts
• Asm(“sndhd.s …”)
• While (I < bufsize)
– Asm(“sndw …”)
• Asm(“sendtl ..”)
• Asm(“recack”..”)
• Enable interrupts
• Support for connectionless and connected MCAPI protocols
– The latter achieved by not issuing a tail-flit till connection needs to be closed
• The embedded multicore space is likely very influential
– Enables development of hardware assist for new APIs and runtime mechanisms
– Even the HPC space may be influenced by design ideas percolating from below
– Dynamic formal verification tools may employ “hooks” into the hardware
• Avoids the “dirty tricks” we had to use in ISP to get control over the MPI runtime very indirectly
– Faking “Wait” operations, pre-issuing Waits to poke the MPI progress engine etc.
• In the end, we can sell what we can debug
• Time to market may be minimized through better FV / dynamic verification support provided by HW
• Great teaching tool
– If the FPGA design tool-chain is a bit kinder/gentler
– Projects such as Lava, Kiwi, .. (MSR) provide rays of hope…
Work out crooked-barrier example on the board, assisted by a formal transition system
• for MPI,
• Then a transition system for the ISP centralized scheduler (as interposition layer)
• Then a transition system for DAMPI’s distributed scheduler (sparse Lamport-clock based)
• The formal transition systems clearly show how the native semantics of MPI has been “tamed” by specific scheduler implementations!
Summary of the explorations of a group (esp. its advisor) in “mid-life crisis”, wanting to be relevant and wanting to be formal
(also wanting to be liked )
In the end it was worth it
Must now skate to where the puck will be!