The Berkeley View - Computer Science Division

advertisement
Parallel
Hardware
Parallel
Applications
Parallel
Software
The Parallel Computing Laboratory: A
Research Agenda based on the Berkeley View
Krste Asanovic, Ras Bodik, Jim Demmel, Tony Keaveny,
Kurt Keutzer, John Kubiatowicz, Edward Lee, Nelson Morgan,
George Necula, Dave Patterson, Koushik Sen,
John Wawrzynek, David Wessel, and Kathy Yelick
April 28, 2008
1
Outline

Overview of Par Lab
 Motivation
& Scope
 Driving Applications
 Need for Parallel Libraries and Frameworks

Parallel Libraries
 Success
Metric
 High performance (speed and accuracy)

Autotuning
 Required
Functionality
 Ease of use

Summary of meeting goals, other talks
 Identify
opportunities for collaboration
2
Outline

Overview of Par Lab
 Motivation
& Scope
 Driving Applications
 Need for Parallel Libraries and Frameworks

Parallel Libraries
 Success
Metric
 High performance (speed and accuracy)

Autotuning
 Required
functionality
 Ease of use

Summary of meeting goals, other talks

Identify opportunities for collaboration
3
A Parallel Revolution, Ready or Not

Old Moore’s Law is over


New Moore’s Law is here


No more doubling speed of sequential code every 18 months
2X processors (“cores”) per chip every technology generation,
but same clock rate
Sea change for HW & SW industries since changing
the model of programming and debugging
4
“Motif" Popularity
1
2
3
4
5
6
7
8
9
10
11
12
13
HPC
ML
Games
DB
SPEC
Embed

(Red Hot  Blue Cool)
How do compelling apps relate to 13 motifs?
Health Image Speech Music Browser
Finite State Mach.
Combinational
Graph Traversal
Structured Grid
Dense Matrix
Sparse Matrix
Spectral (FFT)
Dynamic Prog
N-Body
MapReduce
Backtrack/ B&B
Graphical Models
Unstructured Grid
5
“Motif" Popularity
1
2
3
4
5
6
7
8
9
10
11
12
13
Finite State Mach.
Combinational
Graph Traversal
Structured Grid
Dense Matrix
Sparse Matrix
Spectral (FFT)
Dynamic Prog
N-Body
MapReduce
Backtrack/ B&B
Graphical Models
Unstructured Grid
HPC
ML
Games
DB
SPEC
Embed

(Red Hot  Blue Cool)
How do compelling apps relate to 13 motifs?
Health Image Speech Music Browser
6
Choosing Driving Applications

“Who needs 100 cores to run M/S Word?”


1.
2.
3.
4.
Need compelling apps that use 100s of cores
How did we pick applications?
Enthusiastic expert application partner, leader in field,
promise to help design, use, evaluate our technology
Compelling in terms of likely market or social impact,
with short term feasibility and longer term potential
Requires significant speed-up, or
a smaller, more efficient platform to work as intended
As a whole, applications cover the most important


Platforms (handheld, laptop, games)
Markets (consumer, business, health)
7
Compelling Client Applications
Image Query by
example
Image
Database
Music/Hearing
Robust Speech Input
1000’s of
images
Parallel
Browser
Personal Health
8
“Motif" Popularity
1
2
3
4
5
6
7
8
9
10
11
12
13
Finite State Mach.
Combinational
Graph Traversal
Structured Grid
Dense Matrix
Sparse Matrix
Spectral (FFT)
Dynamic Prog
N-Body
MapReduce
Backtrack/ B&B
Graphical Models
Unstructured Grid
HPC
ML
Games
DB
SPEC
Embed

(Red Hot  Blue Cool)
How do compelling apps relate to 13 motifs?
Health Image Speech Music Browser
9
Par Lab Research Overview
Personal Image Hearing,
Parallel
Speech
Health Retrieval Music
Browser
Motifs
Composition & Coordination Language (C&CL)
C&CL Compiler/Interpreter
Parallel
Libraries
Efficiency
Languages
Parallel
Frameworks
Sketching
Static
Verification
Type
Systems
Directed
Testing
Autotuners
Dynamic
Legacy
Communication &
Schedulers
Checking
Code
Synch. Primitives
Efficiency Language Compilers
Debugging
OS Libraries & Services
with Replay
Legacy OS
Hypervisor
Multicore/GPGPU
RAMP Manycore
Correctness
Diagnosing Power/Performance
Easy to write correct programs that run efficiently on manycore
10
Par Lab Research Overview
Personal Image Hearing,
Parallel
Speech
Health Retrieval Music
Browser
Motifs
Composition & Coordination Language (C&CL)
C&CL Compiler/Interpreter
Parallel
Libraries
Efficiency
Languages
Parallel
Frameworks
Sketching
Static
Verification
Type
Systems
Directed
Testing
Autotuners
Dynamic
Legacy
Communication &
Schedulers
Checking
Code
Synch. Primitives
Efficiency Language Compilers
Debugging
OS Libraries & Services
with Replay
Legacy OS
Hypervisor
Multicore/GPGPU
RAMP Manycore
Correctness
Diagnosing Power/Performance
Easy to write correct programs that run efficiently on manycore
11
Developing Parallel Software


2 types of programmers  2 layers
Efficiency Layer (10% of today’s programmers)



Expert programmers build Frameworks & Libraries, Hypervisors, …
“Bare metal” efficiency possible at Efficiency Layer
Productivity Layer (90% of today’s programmers)


Domain experts / Naïve programmers productively build parallel apps
using frameworks & libraries
Frameworks & libraries composed to form app frameworks

Effective composition techniques allows the efficiency programmers
to be highly leveraged 
Create language for Composition and Coordination (C&C)

Talk by Kathy Yelick
12
Par Lab Research Overview
Personal Image Hearing,
Parallel
Speech
Health Retrieval Music
Browser
Motifs
Composition & Coordination Language (C&CL)
C&CL Compiler/Interpreter
Parallel
Libraries
Efficiency
Languages
Parallel
Frameworks
Sketching
Static
Verification
Type
Systems
Directed
Testing
Autotuners
Dynamic
Legacy
Communication &
Schedulers
Checking
Code
Synch. Primitives
Efficiency Language Compilers
Debugging
OS Libraries & Services
with Replay
Legacy OS
Hypervisor
Multicore/GPGPU
RAMP Manycore
Correctness
Diagnosing Power/Performance
Easy to write correct programs that run efficiently on manycore
13
Outline

Overview of Par Lab
 Motivation
& Scope
 Driving Applications
 Need for Parallel Libraries and Frameworks

Parallel Libraries
 Success
Metric
 High performance (speed and accuracy)

Autotuning
 Required
Functionality
 Ease of use

Summary of meeting goals, other talks
 Identify
opportunities for collaboration
14
Success Metric - Impact

LAPACK and ScaLAPACK are widely used
 Adopted
by Cray, Fujitsu, HP, IBM, IMSL, MathWorks,
NAG, NEC, SGI, …
 >86M web hits @ Netlib (incl. CLAPACK, LAPACK95)

35K hits/day
Cosmic Microwave Background
Analysis, BOOMERanG
collaboration, MADCAP code
(Apr. 27, 2000).
ScaLAPACK
Xiaoye Li: Sparse LU
High Performance (Speed and Accuracy)

Matching Algorithms to Architectures (8 talks)



Faster Algorithms (2 talks)



Symmetric eigenproblem (O(n2) instead of O(n3))
Sparse LU factorization
More accurate algorithms (2 talks)


Autotuning – generate fast algorithm automatically depending
on architecture and problem
Communication-Avoiding Linear Algebra – avoiding latency and
bandwidth costs
Either at “usual” speed, or at any cost
Structure-exploiting algorithms

Roots(p) (O(n2) instead of O(n3))
High Performance (Speed and Accuracy)

Matching Algorithms to Architectures (8 talks)



Faster Algorithms (2 talks)



Symmetric eigenproblem (O(n2) instead of O(n3))
Sparse LU factorization
More accurate algorithms (2 talks)


Autotuning – generate fast algorithm automatically depending
on architecture and problem
Communication-Avoiding Linear Algebra – avoiding latency and
bandwidth costs
Either at “usual” speed, or at any cost
Structure-exploiting algorithms

Roots(p) (O(n2) instead of O(n3))
Automatic Performance Tuning




Writing high performance software is hard
Ideal: get high fraction of peak performance from one
algorithm
Reality: Best algorithm (and its implementation) can
depend strongly on the problem, computer architecture,
compiler,…
 Best choice can depend on knowing a lot of applied
mathematics and computer science
 Changes with each new hardware, compiler release
Goal: Automation
 Generate and search a space of algorithms
 Past successes: PHiPAC, ATLAS, FFTW, Spiral
 Many conferences, DOE projects, …
The Difficulty of Tuning SpMV
// y <-- y + A*x
for all nonzero A(i,j):
y(i) += A(i,j) * x(j)
// Compressed sparse row (CSR)
for each row i:
t=0
for k=row[i] to row[i+1]-1:
t += A[k] * x[J[k]]
y[i] = t
• Exploit 8x8 dense
blocks
Speedups on Itanium 2:
The Need for Search
Mflop/s (31.1%)
Reference
Mflop/s (7.6%)
Speedups on Itanium 2:
The Need for Search
Mflop/s (31.1%)
Best: 4x2
Reference
Mflop/s (7.6%)
SpMV Performance—raefsky3
More surprises tuning SpMV

More complex example

Example: 3x3 blocking
 Logical grid of 3x3 cells
Extra Work Can Improve Efficiency

More complex example

Example: 3x3 blocking
 Logical grid of 3x3 cells
 Pad with zeros
 “Fill ratio” = 1.5
 1.5x as many flops

On Pentium III:
1.5x speedup! (2/3 time)
1.52 = 2.25x flop rate
Autotuned Performance of SpMV
Intel Clovertown
AMD Opteron
7.0
6.5
6.5
6.0
6.0
5.5
5.5
5.0
5.0
4.5
4.5
4.0
4.0
GFlop/s
3.5
3.0
3.5
3.0
2.5
2.5
2.0
2.0
1.5
1.5
1.0
1.0
0.5
0.5
+More DIMMs(opteron),
+FW fix, array padding(N2), etc…
+Cache/TLB Blocking
Median
LP
Circuit
Webbase
Econ
FEM-Accel
+SW Prefetching
Naïve Pthreads
6.5
6.0
Naïve
5.5
5.0
4.5
4.0
3.5
3.0
2.5
2.0
1.5
1.0
0.5
Median
LP
Webbase
Circuit
FEM-Accel
Econ
Epidem
QCD
FEM-Ship
FEM-Harbor
Tunnel
FEM-Cant
FEM-Sphr
Dense
0.0
Protein
GFlop/s
+Compression
+NUMA/Affinity
Sun Niagara2 (Huron)
7.0
Epidem
QCD
FEM-Ship
Tunnel
FEM-Harbor
FEM-Cant
FEM-Sphr
Median
LP
Circuit
Webbase
FEM-Accel
Econ
Epidem
QCD
FEM-Ship
Tunnel
FEM-Harbor
FEM-Cant
FEM-Sphr
Dense
Protein
Dense
0.0
0.0
Protein
GFlop/s
7.0
25
Autotuning SpMV

Large search space of possible optimizations
 Large
speed ups possible
 Parallelism adds more!

Later talks
 Sam
Williams on tuning SpMV for a variety of
multicore, other platforms
 Ankit Jain on easy-to-use system for incorporating
autotuning into applications
 Kaushik Datta on tuning special case of stencils
 Rajesh Nishtala on tuning collection communications

But don’t you still have to write difficult code
to generate search space?
26
Program Synthesis

Best implementation/data structure hard to write, identify


Don’t do this by hand
Sketching: code generation using 2QBF
Spec: simple
implementation
(3 loop 3D stencil)
Sketch: optimized
skeleton
(5 loops, missing
some index/bounds)

Optimized code
(tiled, prefetched,
time skewed)
Talk by Armando Solar-Lezama / Ras Bodik on program
synthesis by sketching, applied to stencils
Communication-Avoiding Linear Algebra (CALU)
• Exponentially growing gaps between
• Floating point time << 1/Network BW << Network Latency
• Improving 59%/year vs 26%/year vs 15%/year
• Floating point time << 1/Memory BW << Memory Latency
• Improving 59%/year vs 23%/year vs 5.5%/year
• Goal: reorganize linear algebra to avoid communication
• Not just hiding communication (speedup  2x )
• Arbitrary speedups possible
• Possible for Dense and Sparse Linear Algebra
CALU Summary (1/4)

QR or LU decomposition of m x n matrix, m>>n

Parallel implementation



Conventional: O( n log p ) messages
New: O( log p ) messages – optimal
Performance:


Serial implementation with fast memory of size W

Conventional: O( mn/W ) moves of data from slow to fast memory



mn/W = how many times larger matrix is than fast memory
New: O(1) moves of data
Performance:



QR 5x faster on cluster, LU 7x faster on cluster
OOC QR only 2x slower than having  DRAM
Expect gains with Multicore as well
Price:



Some redundant computation (but flops are cheap!)
Different representation of answer for QR (tree structured)
LU stable in practice so far, but not GEPP
CALU Summary (2/4)

QR or LU decomposition of n x n matrix


Communication lower by factor of b = block size
Lots of speed up possible (modeled and measured)

Modeled speedups of new QR over ScaLAPACK




Measured and modeled speedups of new LU over ScaLAPACK





I BM Power 5 (512 procs): up to 9.7x
Petascale (8K procs): up to 22.9x
Grid (128 procs): up to 11x
IBM Power 5 (Bassi): up to 2.3x speedup (measured)
Cray XT4 (Franklin): up to 1.8x speedup (measured)
Petascale (8K procs): up to 80x (modeled)
Speed limit: Cholesky? Matmul?
Extends to sparse LU



Communication more dominant, so pay off may be higher
Speed limit: Sparse Cholesky?
Talk by Xiaoye Li on alternative
CALU Summary (3/4)

Take k steps of Krylov subspace method
GMRES, CG, Lanczos, Arnoldi
 Assume matrix “well-partitioned,” with modest surface-tovolume ratio
 Parallel implementation




Serial implementation




Need to be able to “compress” interactions between distant i, j
Hierarchical, semiseparable matrices …
Lots of speed up possible (modeled and measured)


Conventional: O(k) moves of data from slow to fast memory
New: O(1) moves of data – optimal
Can incorporate some preconditioners


Conventional: O(k log p) messages
New: O(log p) messages - optimal
Price: some redundant computation
Talks by Marghoob Mohiyuddin, Mark Hoemmen
CALU Summary (4/4)

Lots of related work
 Some
going back to 1960’s
 Reports discuss this comprehensively, we will not

Our contributions
 Several
new algorithms, improvements on old ones
 Unifying parallel and sequential approaches to avoiding
communication
 Time for these algorithms has come, because of
growing communication costs


Systematic examination of as much of linear algebra as we can
Why just linear algebra?
Linear Algebra on GPUs
Important part of architectural space to
explore
 Talk by Vasily Volkov

 NVIDIA
has licensed our BLAS (SGEMM)
 Fastest implementations of dense LU, Cholesky, QR


80-90% of “peak”
Require various optimizations special to GPU


Use CPU for BLAS1 and BLAS2, GPU for BLAS3
In LU, replace TRSM by TRTRI + GEMM (~stable as GEPP)
33
High Performance (Speed and Accuracy)

Matching Algorithms to Architectures (8 talks)



Faster Algorithms (2 talks)



Symmetric eigenproblem (O(n2) instead of O(n3))
Sparse LU factorization
More accurate algorithms (2 talks)


Autotuning – generate fast algorithm automatically depending
on architecture and problem
Communication-Avoiding Linear Algebra – avoiding latency and
bandwidth costs
Either at “usual” speed, or at any cost
Structure-exploiting algorithms

Roots(p) (O(n2) instead of O(n3))
Faster Algorithms

MRRR algorithm for symmetric eigenproblem




R. Byers / R. Mathias / K. Braman
2003 SIAM Linear Algebra Prize
Extensions to QZ:


Talk by Xiaoye Li
Up to 10x faster HQR


Talk by Osni Marques / B. Parlett / I. Dhillon / C. Voemel
2006 SIAM Linear Algebra Prize for Parlett, Dhillon
Parallel Sparse LU


(Highlights)
B. Kågström / D. Kressner / T. Adlerborn
Faster Hessenberg, tridiagonal, bidiagonal reductions:



R. van de Geijn / E. Quintana-Orti
C. Bischof / B. Lang
G. Howell / C. Fulton
Collaborators

UC Berkeley:





Jim Demmel, Ming Gu, W. Kahan, Beresford Parlett, Xiaoye Li,
Osni Marques, Yozo Hida, Jason Riedy, Vasily Volkov, Christof
Voemel, David Bindel, undergrads…
U Tennessee, Knoxville
 Jack Dongarra, Julien Langou, Julie Langou, Piotr Luszczek,
Stan Tomov, Alfredo Buttari, Jakub Kurzak
Other Academic Institutions
 UT Austin, UC Davis, CU Denver, Florida IT, Georgia Tech, U Kansas,
U Maryland, North Carolina SU, UC Santa Barbara
 TU Berlin, ETH, U Electrocomm. (Japan), FU Hagen,
U Carlos III Madrid, U Manchester, U Umeå, U Wuppertal, U Zagreb
Research Institutions
 INRIA, LBL
Industrial Partners (predating ParLab)
 Cray, HP, Intel, Interactive Supercomputing, MathWorks, NAG,
NVIDIA
High Performance (Speed and Accuracy)

Matching Algorithms to Architectures (8 talks)



Faster Algorithms (2 talks)



Symmetric eigenproblem (O(n2) instead of O(n3))
Sparse LU factorization
More accurate algorithms (2 talks)


Autotuning – generate fast algorithm automatically depending
on architecture and problem
Communication-Avoiding Linear Algebra – avoiding latency and
bandwidth costs
Either at “usual” speed, or at any cost
Structure-exploiting algorithms

Roots(p) (O(n2) instead of O(n3))
More Accurate Algorithms

Motivation


User requests, debugging
Iterative refinement for Ax=b, least squares
 “Promise”
the right answer for O(n2) additional cost
 Talk by Jason Riedy

Arbitrary precision versions of everything
 Using


your favorite multiple precision package
Talk by Yozo Hida
Jacobi-based SVD
 Faster
than QR, can be arbitrarily more accurate
 Drmac / Veselic
What could go into linear algebra libraries?
For all linear algebra problems
For all matrix/problem structures
For all data types
For all architectures and networks
For all programming interfaces
Produce best algorithm(s) w.r.t.
performance and accuracy
(including condition estimates, etc)
Need to prioritize, automate, enlist help!
What do users want? (1/2)
Performance, ease of use, functionality,
portability
 Composability

On multicore, expect to implement dense codes via DAG
scheduling (Dongarra’s PLASMA)
 Talk by Krste Asanovic / Heidi Pan on threads


Reproducibility


Made challenging by nonassociativity of floating point
Ongoing collaborations on Driving Apps



Jointly analyzing needs
Talk by T. Keaveny on Medical Application
Other apps so far: mostly dense and sparse linear algebra, FFTs

some interesting structured needs emerging
What do users want? (2/2)

DOE/NSF User Survey
 Small but interesting sample at www.netlib.org/lapack-dev
 What matrix sizes do you care about?
 1000s: 34%
 10,000s: 26%
 100,000s or 1Ms: 26%
 How many processors, on distributed memory?
 >10: 34%,
>100: 31%,
>1000: 19%
 Do you use more than double precision?
 Sometimes or frequently: 16%

New graduate program in CSE with 106 faculty from
18 departments

New needs may emerge
Highlights of New Dense Functionality

Updating / downdating of factorizations:
 Stewart,

More generalized SVDs:
 Bai

Langou
, Wang
More generalized Sylvester/Lyapunov eqns:
 Kågström,

Jonsson, Granat
Structured eigenproblems
 Selected

matrix polynomials:
Mehrmann
Organizing Linear Algebra
www.netlib.org/lapack
www.netlib.org/scalapack
gams.nist.gov
www.netlib.org/templates
www.cs.utk.edu/~dongarra/etemplates
Improved Ease of Use

Which do you prefer?
A\B
CALL PDGESV( N ,NRHS, A, IA, JA, DESCA,
IPIV, B, IB, JB, DESCB, INFO)
CALL PDGESVX( FACT, TRANS, N ,NRHS, A,
IA, JA, DESCA, AF, IAF, JAF, DESCAF, IPIV,
EQUED, R, C, B, IB, JB, DESCB, X, IX, JX,
DESCX, RCOND, FERR, BERR, WORK,
LWORK, IWORK, LIWORK, INFO)
Ease of Use: One approach

Easy interfaces vs access to details
 Some users want access to all details, because
 Peak performance matters
 Control over memory allocation
 Other users want “simpler” interface
 Automatic allocation of workspace
 No universal agreement across systems on “easiest interface”
 Leave decision to higher level packages
Keep expert driver / simple driver / computational
routines
 Add wrappers for other languages

 Fortran95, Java, Matlab, Python, even
 Automatic allocation of workspace

C
Add wrappers to convert to “best” parallel layout
Outline

Overview of Par Lab
 Motivation
& Scope
 Driving Applications
 Need for Parallel Libraries and Frameworks

Parallel Libraries
 Success
Metric
 High performance (speed and accuracy)
 Autotuning
 Required Functionality
 Ease of use

Summary of meeting goals, other talks
46
Some goals for the meeting
Introduce ParLab
 Describe numerical library efforts in detail
 Exchange information

 User

needs, tools, goals
Identify opportunities for collaboration
47
Summary of other talks (1)

Monday, April 28 (531 Cory)














12:00 - 12:45 Jim Demmel - Overview of PAR Lab / Numerical Libraries
12:45 - 1:00 Avneesh Sud (Microsoft) - Introduction to library effort at Microsoft
1:00 - 1:45 Sam Williams/Ankit Jain - Tuning Sparse-matrix-vector multiply/Parallel OSKI
1:45 – 1:50 Break
1:50 - 2:20 Marghoob Mohiyuddin - Avoiding Communication in SpMV-like kernels
2:20 - 2:50 Mark Hoemmen - Avoiding communication in Krylov Subspace Methods
2:50 - 3:00 Break
3:00 - 3:30 Rajesh Nishtala - Tuning collective communication
3:30 - 4:00 Yozo Hida - High accuracy linear algebra
4:00 – 4:25 Jason Riedy - Iterative Refinement in linear algebra
4:25 – 4:30 Break
4:30 – 5:00 Tony Keaveny - Medical Image Analysis in PAR Lab
5:00 - 5:30 Ras Bodik/ Armando Solar-Lezama - Program synthesis by Sketching
5:30 - 6:00 Vasily Volkov - Linear Algebra on GPUs
48
Summary of other talks (2)

Tuesday, April 29 (Wozniak Lounge)





9:00 - 10:00 Kathy Yelick - Programming Systems for PAR Lab
10:00 - 10:30 Kaushik Datta – Tuning Stencils
10:30 - 11:00 Xiaoye Li - Parallel sparse LU factorization
11:00 - 11:30 Osni Marques - Parallel Symmetric Eigensolvers
11:30 – 12:00 Krste Asanovic / Heidi Pan – Thread system
49
EXTRA SLIDES
50
P.S. Parallel Revolution May Fail

John Hennessy, President, Stanford University, 1/07:
“…when we start talking about parallelism and ease of use of truly
parallel computers, we're talking about a problem that's as hard
as any that computer science has faced. …
I would be panicked if I were in industry.”
“A Conversation with Hennessy & Patterson,” ACM Queue Magazine, 4:10, 1/07.

100% failure rate of Parallel Computer Companies


Convex, Encore, Inmos (Transputer), MasPar, NCUBE, Kendall
Square Research, Sequent, (Silicon Graphics), Thinking Machines, …
What if IT goes from a
growth industry to a
replacement industry?

300
250
Millions of
PCs / year
200
150
If SW can’t effectively use
32, 64, ... cores per chip
100
 SW no faster on new computer 50
 Only buy if computer wears out
0
1985
1995
2005
2015
51
5 Themes of Par Lab
1.
Applications
•
2.
Identify Common Computational Patterns
•
3.
2 Layers + Coordination & Composition Language
+ Autotuning
OS and Architecture
•
•
5.
Breaking through disciplinary boundaries
Developing Parallel Software with
Productivity, Efficiency, and Correctness
•
4.
Compelling apps drive top-down research agenda
Composable primitives, not packaged solutions
Deconstruction, Fast barrier synchronization, Partitions
Diagnosing Power/Performance Bottlenecks
52
Par Lab Research Overview
Personal Image Hearing,
Parallel
Speech
Health Retrieval Music
Browser
Motifs
Composition & Coordination Language (C&CL)
C&CL Compiler/Interpreter
Parallel
Libraries
Efficiency
Languages
Parallel
Frameworks
Sketching
Static
Verification
Type
Systems
Directed
Testing
Autotuners
Dynamic
Legacy
Communication &
Schedulers
Checking
Code
Synch. Primitives
Efficiency Language Compilers
Debugging
OS Libraries & Services
with Replay
Legacy OS
Hypervisor
Multicore/GPGPU
RAMP Manycore
Correctness
Diagnosing Power/Performance
Easy to write correct programs that run efficiently on manycore
53
Compelling Laptop/Handheld Apps
(David Wessel)

Musicians have an insatiable appetite for
computation




Music Enhancer



Enhanced sound delivery systems for home
sound systems using large microphone and
speaker arrays
Laptop/Handheld recreate 3D sound over ear
buds
Hearing Augmenter


More channels, instruments, more processing,
more interaction!
Latency must be low (5 ms)
Must be reliable (No clicks)
Berkeley Center for New Music and
Audio Technology (CNMAT) created a
compact loudspeaker array:
10-inch-diameter icosahedron
incorporating 120 tweeters.
Laptop/Handheld as accelerator for hearing aide
Novel Instrument User Interface


New composition and performance systems
beyond keyboards
Input device for Laptop/Handheld
54
Content-Based Image Retrieval
(Kurt Keutzer)
Relevance
Feedback
Query by example
Similarity
Metric
Image
Database

1000’s of
images
Candidate
Results
Final Result
Built around Key Characteristics of personal
databases
 Very large number of pictures (>5K)
 Non-labeled images
 Many pictures of few people
 Complex pictures including people, events,
and objects
places,
55
Coronary Artery Disease
Before

(Tony Keaveny)
After
Modeling to help patient compliance?
•
450k deaths/year, 16M w. symptom, 72MBP

Massively parallel, Real-time variations
• CFD FE solid (non-linear), fluid (Newtonian), pulsatile
•
Blood pressure, activity, habitus, cholesterol
56
Compelling Laptop/Handheld Apps
(Nelson Morgan)

Meeting Diarist
 Laptops/
Handhelds
at meeting
coordinate to
create speaker
identified, partially
transcribed text
diary of meeting

Teleconference speaker identifier,
speech helper

L/Hs used for teleconference, identifies who is
speaking, “closed caption” hint of what being said
57
Compelling Laptop/Handheld Apps

Health Coach

Since laptop/handheld always with you,
Record images of all meals, weigh plate
before and after, analyze calories
consumed so far


Since laptop/handheld always with you,
record amount of exercise so far, show
how body would look if maintain this
exercise and diet pattern next 3 months


“What if I order a pizza for my next meal?
A salad?”
“What would I look like if I regularly ran
less? Further?”
Face Recognizer/Name Whisperer

Laptop/handheld scans faces, matches
image database, whispers name in ear
(relies on Content Based Image Retrieval)
58
Parallel Browser
(Ras Bodik)

Goal: Desktop quality browsing on handhelds
 Enabled

Bottlenecks to parallelize
 Parsing,

by 4G networks, better output devices
Rendering, Scripting
“SkipJax”
 Parallel
replacement for JavaScript/AJAX
 Based on Brown’s FlapJax
59
Theme 2. Use design patterns
instead of benchmarks? (Kurt Keutzer)


1.
2.
3.
4.
5.
6.

How invent parallel systems of future when tied to
old code, programming models, CPUs of the past?
Look for common design patterns (see A Pattern
Language, Christopher Alexander, 1975)
Embedded Computing (42 EEMBC benchmarks)
Desktop/Server Computing (28 SPEC2006)
Data Base / Text Mining Software
Games/Graphics/Vision
Machine Learning
High Performance Computing (Original “7 Dwarfs”)
Result: 13 “Motifs”
(Use “motif” instead when go from 7 to 13)
60
“Motif" Popularity
1
2
3
4
5
6
7
8
9
10
11
12
13
Finite State Mach.
Combinational
Graph Traversal
Structured Grid
Dense Matrix
Sparse Matrix
Spectral (FFT)
Dynamic Prog
N-Body
MapReduce
Backtrack/ B&B
Graphical Models
Unstructured Grid
HPC
ML
Games
DB
SPEC
Embed

(Red Hot  Blue Cool)
How do compelling apps relate to 13 motifs?
Health Image Speech Music Browser
61
4 Valuable Roles of Motifs
1.
“Anti-benchmarks”
•
2.
Universal, understandable vocabulary, at least at
high level
•
3.
To talk across disciplinary boundaries
Bootstrapping: Parallelize parallel research
•
4.
Motifs not tied to code or language artifacts
 encourage innovation in algorithms, languages,
data structures, and/or hardware
Allow analysis of HW & SW design without waiting
years for full apps
Targets for libraries (see later)
62
Themes 1 and 2 Summary
Application-Driven Research (top down) vs.
CS Solution-Driven Research (bottom up)
 Drill down on (initially) 5 app areas to guide
research agenda
 Motifs to guide design of apps and implement
via programming framework per motif
 Motifs help break through traditional
interfaces

 Benchmarking,
multidisciplinary conversations,
parallelizing parallel research, and target for libraries
63
Par Lab Research Overview
Personal Image Hearing,
Parallel
Speech
Health Retrieval Music
Browser
Motifs
Composition & Coordination Language (C&CL)
C&CL Compiler/Interpreter
Parallel
Libraries
Efficiency
Languages
Parallel
Frameworks
Sketching
Static
Verification
Type
Systems
Directed
Testing
Autotuners
Dynamic
Legacy
Communication &
Schedulers
Checking
Code
Synch. Primitives
Efficiency Language Compilers
Debugging
OS Libraries & Services
with Replay
Legacy OS
Hypervisor
Multicore/GPGPU
RAMP Manycore
Correctness
Diagnosing Power/Performance
Easy to write correct programs that run efficiently on manycore
64
Theme 3: Developing Parallel SW
(Kurt Keutzer and Kathy Yelick)

Observation: Use Motifs as design patterns
Design patterns are implemented as 2 kinds of frameworks:
1. Traditional numerical parallel library
2. Library where apply supplied function to data in parallel
Numerical Libraries
Function Application Libraries






Dense matrices
Sparse matrices
Spectral
Combinational
Finite state machines





MapReduce
Dynamic programming
Backtracking/B&B
N-Body
(Un) Structured Grid
Graph traversal, graphical models
•Computations
may be viewed at multiple levels: e.g., an FFT
library may be built by instantiating a Map-Reduce library,
mapping 1D FFTs and then transposing (generalized reduce)
65
Developing Parallel Software


2 types of programmers  2 layers
Efficiency Layer (10% of today’s programmers)
 Expert
programmers build Frameworks & Libraries,
Hypervisors, …
 “Bare metal” efficiency possible at Efficiency Layer

Productivity Layer (90% of today’s programmers)
 Domain
experts / Naïve programmers productively build
parallel apps using frameworks & libraries
 Frameworks & libraries composed to form app frameworks

Effective composition techniques allows the efficiency
programmers to be highly leveraged 
Create language for Composition and Coordination (C&C)
66
C & C Language Requirements
(Kathy Yelick)
Applications specify C&C language requirements:
 Constructs for creating application frameworks
 Primitive parallelism constructs:
 Data
parallelism
 Divide-and-conquer parallelism
 Event-driven execution

Constructs for composing programming frameworks:
 Frameworks
require independence
 Independence is proven at instantiation with a variety
of techniques

Needs to have low runtime overhead and ability to
measure and control performance
67
Ensuring Correctness
(Koushek Sen)

Productivity Layer
 Enforce
independence of tasks using decomposition
(partitioning) and copying operators
 Goal: Remove chance for concurrency errors (e.g.,
nondeterminism from execution order, not just
low-level data races)

Efficiency Layer: Check for subtle concurrency
bugs (races, deadlocks, and so on)
 Mixture
of verification and automated directed testing
 Error detection on frameworks with sequential code as
specification

Automatic detection of races, deadlocks
68
21st Century Code Generation
(Demmel, Yelick)

Problem: generating optimal code is
like searching for needle in a haystack

Manycore

New approach: “Auto-tuners”

even more diverse
1st generate program variations of
combinations of optimizations (blocking,
prefetching, …) and data structures
 Then compile and run to heuristically
search for best code for that computer



Search space for
block sizes
(dense matrix):
• Axes are block
dimensions
• Temperature is
speed
Examples: PHiPAC (BLAS), Atlas (BLAS),
Spiral (DSP), FFT-W (FFT)
Example: Sparse Matrix (SpMV) for 4 multicores
 Fastest
SpMV; Optimizations: BCOO v. BCSR data
structures, NUMA, 16b vs. 32b indicies, …
69
Par Lab Research Overview
Personal Image Hearing,
Parallel
Speech
Health Retrieval Music
Browser
Motifs
Composition & Coordination Language (C&CL)
C&CL Compiler/Interpreter
Parallel
Libraries
Efficiency
Languages
Parallel
Frameworks
Sketching
Static
Verification
Type
Systems
Directed
Testing
Autotuners
Dynamic
Legacy
Communication &
Schedulers
Checking
Code
Synch. Primitives
Efficiency Language Compilers
Debugging
OS Libraries & Services
with Replay
Legacy OS
Hypervisor
Multicore/GPGPU
RAMP Manycore
Correctness
Diagnosing Power/Performance
Easy to write correct programs that run efficiently on manycore
70
Theme 4: OS and Architecture
(Krste Asanovic, John Kubiatowicz)

Traditional OSes brittle, insecure, memory hogs
 Traditional
monolithic OS image uses lots of precious
memory * 100s - 1000s times
(e.g., AIX uses GBs of DRAM / CPU)

How can novel architectural support improve
productivity, efficiency, and correctness for
scalable hardware?
 Efficiency
instead of performance to capture energy as
well as performance
Other challenges: power limit, design and
verification costs, low yield, higher error rates
 How prototype ideas fast enough to run real SW?

71
Deconstructing Operating Systems

Resurgence of interest in virtual machines
 Hypervisor:

Future OS: libraries where only functions
needed are linked into app, on top of thin
hypervisor providing protection and sharing of
resources
 Opportunity

thin SW layer btw guest OS and HW
for OS innovation
Leverage HW partitioning support for very thin
hypervisors, and to allow software full access
to hardware within partition
72
HW Solution: Small is Beautiful

Want Software Composable Primitives,
Not Hardware Packaged Solutions



Expect modestly pipelined (5- to 9-stage)
CPUs, FPUs, vector, SIMD PEs


Small cores not much slower than large cores
Parallel is energy efficient path to performance:CV2F


“You’re not going fast if you’re headed in the wrong direction”
Transactional Memory is usually a Packaged Solution
Lower threshold and supply voltages lowers energy per op
Configurable Memory Hierarchy (Cell v. Clovertown)




Can configure on-chip memory as cache or local RAM
Programmable DMA to move data without occupying CPU
Cache coherence: Mostly HW but SW handlers for complex cases
Hardware logging of memory writes to allow rollback
73
Build Academic Manycore from FPGAs

As  10 CPUs will fit in Field Programmable Gate Array
(FPGA), 1000-CPU system from  100 FPGAs?
8 32-bit simple “soft core” RISC at 100MHz in 2004 (Virtex-II)
FPGA generations every 1.5 yrs;  2X CPUs,  1.2X clock rate

HW research community does logic design (“gate
shareware”) to create out-of-the-box, Manycore




E.g., 1000 processor, standard ISA binary-compatible, 64-bit,
cache-coherent supercomputer @  150 MHz/CPU in 2007
Ideal for heterogeneous chip architectures
RAMPants: 10 faculty at Berkeley, CMU, MIT, Stanford, Texas, and
Washington
“Research Accelerator for Multiple Processors”
as a vehicle to lure more researchers to parallel
challenge and decrease time to parallel solution
74
1008 Core “RAMP Blue”
(Wawrzynek, Krasnov,… at Berkeley)

1008 = 12 32-bit RISC cores /
FPGA, 4 FGPAs/board, 21 boards
 Simple


UPC versions (C plus shared-memory abstraction)
CG, EP, IS, MG
RAMPants creating HW & SW for manycore community using next gen FPGAs




Full star-connection between modules
NASA Advanced Supercomputing (NAS)
Parallel Benchmarks (all class S)


MicroBlaze soft cores @ 90 MHz
Chuck Thacker & Microsoft designing next boards
3rd party to manufacture and sell boards: 1H08
Gateware, Software BSD open source
RAMP Gold for Par Lab: new CPU
75
Par Lab Research Overview
Personal Image Hearing,
Parallel
Speech
Health Retrieval Music
Browser
Motifs
Composition & Coordination Language (C&CL)
C&CL Compiler/Interpreter
Parallel
Libraries
Efficiency
Languages
Parallel
Frameworks
Sketching
Static
Verification
Type
Systems
Directed
Testing
Autotuners
Dynamic
Legacy
Communication &
Schedulers
Checking
Code
Synch. Primitives
Efficiency Language Compilers
Debugging
OS
Libraries
&
Services
OS Libraries & Services with Replay
Legacy
Legacy OS
OS
Hypervisor
Hypervisor
Multicore/GPGPU
Multicore/GPGPU
RAMP
RAMPManycore
Manycore
Correctness
Diagnosing Power/Performance
Easy to write correct programs that run efficiently on manycore
76
Theme 5: Diagnosing Power/
Performance Bottlenecks (Jim Demmel)



Collect data on Power/Performance bottlenecks
Aid autotuner, scheduler, OS in adapting system
Turn data into useful information that can help efficiencylevel programmer improve system?



E.g., % peak power, % peak memory BW, % CPU, % network
E.g., sample traces of critical paths
Turn data into useful information that can help productivitylevel programmer improve app?



Where am I spending my time in my program?
If I change it like this, impact on Power/Performance?
Rely on machine learning to find correlations in data, predict
Power/Performance?
77
Physical Par Lab - 5th Floor Soda
78
Impact of Automatic Performance Tuning

Widely used in performance tuning of Kernels

ATLAS (PhiPAC) - www.netlib.org/atlas


FFTW – www.fftw.org




Fast Fourier Transform and similar transforms, Wilkinson
Software Prize
Spiral - www.spiral.net


Dense BLAS, now in Matlab, many other releases
Digital Signal Processing
Communication Collectives (UCB, UTK)
Rose (LLNL), Bernoulli (Cornell), Telescoping
Languages (Rice), UHFFT (Houston), POET (UTSA), …
More projects (PERI,TOPS2,CScADS), conferences,
government reports, …
Download