advertisement

Dense Linear Algebra subject make

[Soccer] is a very simple game. It’s just very hard to play it simple.

- Johan Cruyff

RvdG

PACC2011, Sept. 2011 1

PACC2011, Sept. 2011

Robert van de Geijn

Department of Computer Science

Institute for Computational Engineering and Sciences

The University of Texas at Austin

2

Outline

Motivation

What is FLAME?

Deriving algorithms to be correct

Representing algorithms in code

Of blocked algorithms and algorithms-by-blocks

Runtime support for multicore, GPU, and multiGPU

Extensions to distributed memory platforms

Related work

Conclusion

PACC2011, Sept. 2011 3

Moments of Inspiration

**Birth of multi-threaded libflame**

Aug. 2006 - an insight : libflame

+ algorithm-by-blocks

+ out-of-order scheduling

(runtime)

= multithreaded library

Sept. 2006 - working prototype

(by G. Quintana)

Oct. 2006 - grant proposal

(to NSF, later funded)

Jan. 2007 - paper submitted

(to SPAA07, accepted)

April 2007 - released with libflame R1.0

PACC2011, Sept. 2011

**Birth of multi-GPU libflame**

Fall 2007 - runtime used to manage data and tasks on a single GPU. (UJI-Spain)

March 2008 - NVIDIA donates

4GPU Tesla S870 system

Two hours after unboxing the boards, multiple heuristics for multiGPU runtime implemented

Then the power cord fried…

4

Birth of MultiGPU libflame

After two hours Shortly after

G. Quintana, Igual, E. Quintana, van de Geijn. "Solving Dense Linear Algebra

Problems on Platforms with Multiple Hardware Accelerators." PPoPP’09.

PACC2011, Sept. 2011 5

What Supports our Productivity/Performance?

Deep understanding of the domain

Foundational computer science

Derivation of algorithms

Software implementation of hardware techniques

Blocking for performance

Abstraction

Separation of concern

PACC2011, Sept. 2011 6

Outline

Motivation

What is FLAME?

Deriving algorithms to be correct

Representing algorithms in code

Of blocked algorithms and algorithms-by-blocks

Runtime support for multicore, GPU, and multiGPU

Extensions to distributed memory platforms

Related work

Conclusion

PACC2011, Sept. 2011 7

What is FLAME?

A notation for expressing linear algebra algorithms

A methodology for deriving such algorithm

A set of abstractions for representing such algorithms

In LaTeX, M-script, C, etc.

A modern library (libflame)

Alternative to BLAS, LAPACK, ScaLAPACK, and related efforts

Many new contributions to theory and practice of dense linear algebra

Also banded and Krylov subspace methods

A set of tools supporting the above

Mechanical derivation

Automatic generation of code

Design-by-Transformation (DxT)

PACC2011, Sept. 2011 8

Who is FLAME?

PACC2011, Sept. 2011 9

Outline

Motivation

What is FLAME?

Deriving algorithms to be correct

Representing algorithms in code

Of blocked algorithms and algorithms-by-blocks

Runtime support for multicore, GPU, and multiGPU

Extensions to distributed memory platforms

Related work

Conclusion

PACC2011, Sept. 2011 10

Deriving Algorithms to be Correct

Include all algorithms for a given operation:

Pick the right algorithm for the given architecture

Problem:

How to find the right algorithm

Solution:

Formal derivation (Hoare, Dijkstra, …):

Given operation, systematically derive family of algorithms for computing it.

PACC2011, Sept. 2011 11

Notation

The notation used to express an algorithm should reflect how the algorithm is naturally explained.

PACC2011, Sept. 2011 12

Example: The Cholesky Factorization

Lower triangular case:

*A = L*

***

*L T*

Key in the solution of s.p.d. linear systems

*A x = b*

*(LL T )x = b*

*L y = b *

* y*

*L T x = y x*

PACC2011, Sept. 2011 13

Algorithm Loop: Repartition

*A*

*TL*

*A*

*BL*

*A*

*BR*

*A*

*00 a*

*10*

T a

*11*

*A*

*20 a*

*21*

*A*

*22*

PACC2011, Sept. 2011

Indexing operations

14

Algorithm Loop: Update

*A*

*00 a*

*10*

T a

*11*

*A*

*20 a*

*21*

*A*

*00 a*

*10*

T a

*11*

*A*

*20 a*

/ a

*11*

*A – a*

*21 a*

*21*

*T*

PACC2011, Sept. 2011

Real computation

15

Algorithm Loop: Merging

*A*

*00 a*

*10*

T a

*11*

*A*

*20 a*

/ a

*11*

*A – a*

*21 a*

*21*

*T*

*A*

*TL*

*A*

*BL*

PACC2011, Sept. 2011

*A*

*BR*

Indexing operation

16

Worksheet for Cholesky Factorization

PACC2011, Sept. 2011 17

Mechanical Derivation of Algorithms

Mechanical development from math. specification

*A = L * L T*

Mechanical procedure

Paolo Bientinesi. "Mechanical Derivation and Systematic Analysis of Correct

Linear Algebra Algorithms." Ph.D. Dissertation. UT-Austin. August 2006.

PACC2011, Sept. 2011

18

18

Is Formal Derivation Practical?

libflame :

128+ matrix operations

1389+ implementations of algorithms

Test suite created in 2011

126,756 tests executed

Only 3 minor bugs in library… (now fixed)

PACC2011, Sept. 2011 19

Impact on (Single) GPU Computing

CUBLAS 2009 (have been optimized since)

PACC2011, Sept. 2011 20

Impact on (Single) GPU Computing

CUBLAS 2009

Fran Igual. “Matrix Computations on Graphics Processors and Clusters of GPUs." Ph.D. Dissertation.

Univ. Jaume I. May 2011.

Igual, G. Quintana, and van de Geijn. "Level-3 BLAS on a GPU: Picking the Low Hanging Fruit "

FLAME Working Note #37. Universidad Jaume I, updated May 21, 2009.

PACC2011, Sept. 2011 21

**A Sampling of Functionality operation**

Level-3 BLAS

Cholesky

LU with partial pivoting

LU with incremental pivoting

QR (UT)

LQ (UT)

SPD/HPD inversion

Triangular inversion

Triangular Sylvester

Triangular Lyapunov

Up-and-downdate (UT)

SVD

EVD

PACC2011, Sept. 2011 y y y y y y next week next week y y y y y

**Classic FLAME SuperMatrix**

**MultiThreaded/**

MultiGPU y y y y y y y y y y lapack2flame y y

N.A.

soon soon y y y

N.A.

y y

N.A.

y

22

Outline

Motivation

What is FLAME?

Deriving algorithms to be correct

Representing algorithms in code

Of blocked algorithms and algorithms-by-blocks

Runtime support for multicore, GPU, and multiGPU

Extensions to distributed memory platforms

Related work

Conclusion

PACC2011, Sept. 2011 23

Representing Algorithms in Code

Code should closely resemble how an algorithm is presented so that no bugs can be introduced when translating an algorithm to code.

PACC2011, Sept. 2011 24

Representing algorithms in code http://www.cs.utexas.edu/users/flame/Spark/

Spark+APIs

C,

F77,

Matlab,

LabView,

LaTeX

PACC2011, Sept. 2011 25

FLAME/C API

**FLA_Part_2x2 ( A, &ATL, &ATR,**

**&ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ){ b = min( FLA_Obj_length( ABR ), nb_alg );**

**FLA_Repart_2x2_to_3x3 ( **

**ATL, /**/ ATR, &A00, /**/ &a01, &A02, **

**/* ************* */ /* ************************** */ **

**&a10t,/**/ &alpha11, &a12t, **

**ABL, /**/ ABR, &A20, /**/ &a21, &A22, **

**1, 1, FLA_BR );**

**/*--------------------------------------*/**

**FLA_Sqrt( alpha11 );**

**FLA_Inv_scal( alpha11, a21 );**

**FLA_Syr( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE, **

**FLA_MINUS_ONE, a21, A22 ); **

**/*--------------------------------------*/**

**FLA_Cont_with_3x3_to_2x2 ( **

**&ATL, /**/ &ATR, A00, a01, /**/ A02, a10t, alpha11, /**/ a12t,**

**/* ************** */ /* ************************/**

**&ABL, /**/ &ABR, A20, a21, /**/ A22,**

**FLA_TL );**

**}**

For now , libflame employs external BLAS: GotoBLAS, MKL, ACML, CUBLAS

PACC2011, Sept. 2011 26

Outline

Motivation

What is FLAME?

Deriving algorithms to be correct

Representing algorithms in code

Of blocked algorithms and algorithms-by-blocks

Runtime support for multicore, GPU, and multiGPU

Extensions to distributed memory platforms

Related work

Conclusion

PACC2011, Sept. 2011 27

Multicore/MultiGPU - Issues

Manage computation

Assignment of tasks to cores and/or GPU s

Granularity is important

Manage memory

Manage data transfer between “host” and caches of cores or host and GPU local memories

Granularity is important

Keep the data in the local memory as long as possible

PACC2011, Sept. 2011 28

Where have we seen this before?

Computer architecture late 1960s:

Super scalar units proposed

Unit of data: floating point number

Unit of computation: floating point operation

Examine dependencies

Execute out-of-order, prefetch, cache data, etc., to keep computational units busy

Extract parallelism from sequential instruction stream

R. M.

Tomasulo, “An Efficient Algorithm for Exploiting Multiple Arithmetic

Units.” IBM J. of R&D, (1967)

Basis for exploitation of ILP on current superscalar processors!

PACC2011, Sept. 2011 29

Of Blocks and Tasks

Dense matrix computation

Unit of data: block in matrix

Unit of computation (task): operation with blocks

Dependency: input/output of operation with blocks

Instruction stream: sequential libflame code

Generates DAG

Runtime system schedules tasks

Goal: minimize data transfer and maximize utilization

PACC2011, Sept. 2011 30

Review: Blocked Algorithms

* Cholesky factorization A*

*11*

*A*

*21*

*A*

*22*

*= L*

*11*

*:= L*

*21*

*:= A*

*22*

*L*

*11*

*T*

*= A*

*21*

*– L*

*21*

*L*

*11*

*-T*

*L*

*21*

*T*

1st iteration

PACC2011, Sept. 2011

2nd iteration 3rd iteration

31

31

Blocked Algorithms

Cholesky factorization

*A = L * L T*

**FLA_Part_2x2 (…);**

**APIs + Tools while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){**

**FLA_Repart_2x2_to_3x3 (…);**

**/*--------------------------------------*/**

**FLA_Chol( FLA_LOWER_TRIANGULAR, A11 );**

**FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR,**

**FLA_TRANSPOSE, FLA_NONUNIT_DIAG,**

**FLA_ONE, A11, A21 );**

**FLA_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE,**

**FLA_MINUS_ONE, A21, FLA_ONE, A22 ); **

**/*--------------------------------------*/**

**FLA_Cont_with_3x3_to_2x2 (…);**

**}**

PACC2011, Sept. 2011

32

32

Simple Parallelization: Blocked Algorithms

Link with multi-threaded BLAS

*A*

*11*

*A*

*21*

*A*

*22*

*= L*

*11*

*:= L*

*21*

*:= A*

*22*

*L*

*11*

*T*

*= A*

*21*

*– L*

*21*

*L*

*11*

*-T*

*L*

*21*

*T*

PACC2011, Sept. 2011

**FLA_Part_2x2 (…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){**

**}**

**FLA_Repart_2x2_to_3x3 (…);**

**/*--------------------------------------*/**

**FLA_Chol( FLA_LOWER_TRIANGULAR, A11 );**

**FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR,**

**FLA_TRANSPOSE, FLA_NONUNIT_DIAG,**

**FLA_ONE, A11, A21 );**

**FLA_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE,**

**FLA_MINUS_ONE, A21, FLA_ONE, A22 ); **

**/*--------------------------------------*/**

**FLA_Cont_with_3x3_to_2x2 (…);**

33

33

Blocked Algorithms

There is more parallelism!

Inside the same iteration

In different iterations

1st iteration 2nd iteration

PACC2011, Sept. 2011

34

34

Coding Algorithm-by-Blocks

* Algorithm-by-blocks : A*

*11*

*A*

*21*

*A*

*22*

*= L*

*11*

*:= L*

*21*

*:= A*

*22*

*L*

*11*

*T*

*= A*

*21*

*– L*

*21*

*L*

*11*

*-T*

*L*

*21*

*T*

**FLA_Part_2x2 (…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){**

**FLA_Repart_2x2_to_3x3 (…);**

**/*--------------------------------------*/**

**FLA_Chol( FLA_LOWER_TRIANGULAR, A11 );**

**FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR,**

**FLA_TRANSPOSE, FLA_NONUNIT_DIAG,**

**FLA_ONE, A11, A21 );**

**FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE,**

**FLA_MINUS_ONE, A21, FLA_ONE, A22 ); **

**/*--------------------------------------*/**

**FLA_Cont_with_3x3_to_2x2 (…);**

**}**

PACC2011, Sept. 2011

35

35

Outline

Motivation

What is FLAME?

Deriving algorithms to be correct

Representing algorithms in code

Of blocked algorithms and algorithms-by-blocks

Runtime support for multicore, GPU, and multiGPU

Extensions to distributed memory platforms

Related work

Conclusion

PACC2011, Sept. 2011 36

Generating a DAG

**FLA_Part_2x2 (…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){**

**}**

**FLA_Repart_2x2_to_3x3 (…);**

**/*--------------------------------------*/**

**FLA_Chol( FLA_LOWER_TRIANGULAR, A11 );**

**FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR,**

**FLA_TRANSPOSE, FLA_NONUNIT_DIAG,**

**FLA_ONE, A11, A21 );**

**FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE,**

**FLA_MINUS_ONE, A21, FLA_ONE, A22 ); **

**/*--------------------------------------*/**

**FLA_Cont_with_3x3_to_2x2 (…);**

1

2

3

4

5 6

7

8 9 10

PACC2011, Sept. 2011 37

Managing Tasks and Blocks

Separation of concerns:

Sequential libflame routine generates the DAG

Runtime system (SuperMatrix) manages and schedules the DAG

As one moves from one architecture to another, only the runtime system needs to be updated

Multicore

Out-of-core

Single GPU

MultiGPU

Distributed Runtime

…

PACC2011, Sept. 2011 38

Runtime system - SuperMatrix

DAG

1

2

3

4

5 6

7

8 9 10

Super

Matrix

+ heuristic

PACC2011, Sept. 2011

Multicore

39

Runtime system for GPU - SuperMatrix accelerator

DAG

1

2

3

4

5 6

7

8 9 10

Super

Matrix

Manage data transfer

PACC2011, Sept. 2011

CPU + GPU

40

Runtime system for MultiGPU - SuperMatrix

Multi-accelerator

DAG

1

2

3

4

5 6

7

8 9 10

Super

Matrix

Manage data transfer

PACC2011, Sept. 2011

CPU + MultiGPU

41

MultiGPU

How do we program these?

CPU(s)

PCI-e bus

Interconnect

GPU

#0

GPU

#1

GPU

#2

GPU

#3

PACC2011, Sept. 2011

42

42

MultiGPU: a User’s View

**FLA_Obj A;**

**// Initialize conventional matrix: buffer, m, rs, cs**

**// Obtain storage blocksize, # of threads: b, n_threads**

**FLA_Init();**

**FLASH_Obj_create( FLA_DOUBLE, m, m, 1, &b, &A );**

**FLASH_Copy_buffer_to_hier( m, m, buffer, rs, cs,**

**0, 0, A );**

**FLASH_Queue_set_num_threads( n_threads );**

**FLASH_Queue_enable_gpu();**

**FLASH_Chol( FLA_LOWER_TRIANGULAR, A );**

**FLASH_Obj_free( &A );**

**FLA_Finalize();**

PACC2011, Sept. 2011

43

43

MultiGPU: Under the Cover

Naïve approach:

Before execution, transfer data to device

Call CUBLAS operations

CPU(s)

(implementation “someone else’s problem”)

Upon completion, retrieve results back to host

Interconnect

poor data locality

PCI-e bus

GPU

#0

GPU

#1

GPU

#2

GPU

#3

PACC2011, Sept. 2011

44

44

MultiGPU: Under the Cover

How do we program these?

View as a…

Shared-memory multiprocessor +

Distributed Shared

Memory (DSM) architecture

CPU(s)

Interconnect

PCI-e bus

PACC2011, Sept. 2011

45

GPU

#0

GPU

#1

GPU

#2

GPU

#3

45

MultiGPU: Under the Cover

View system as a sharedmemory multiprocessors

(multi-core processor with hw. coherence)

CPU(s)

MP P

0

+Cache

0

P

1

+Cache

1

P

2

+Cache

2

P

3

+Cache

3

Interconnect

PCI-e bus

PACC2011, Sept. 2011

46

GPU

#0

GPU

#1

GPU

#2

GPU

#3

46

MultiGPU: Under the Cover

Software Distributed-Shared Memory (DSM)

Software: flexibility vs. efficiency

Underlying distributed memory hidden

Reduce memory transfers using write-back, writeinvalidate,…

Well-known approach, not too efficient as a middleware for general apps.

Regularity of dense linear algebra makes a difference!

PACC2011, Sept. 2011

47

47

MultiGPU: Under the Cover

2

3

1

Reduce #data transfers:

Run-time handles device memory as a software cache:

Operate at block level

Software flexibility

Write-back

Write-invalidate

CPU(s)

4

5 6

7

8 9 10

Super

Matrix

Interconnect

PACC2011, Sept. 2011

PCI-e bus

48

GPU

#0

GPU

#1

GPU

#2

GPU

#3

48

MultiGPU: Under the Cover

**FLA_Part_2x2 (…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){**

**FLA_Repart_2x2_to_3x3 (…);**

**/*--------------------------------------*/**

**FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 );**

**FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR,**

**FLA_TRANSPOSE, FLA_NONUNIT_DIAG,**

**FLA_ONE, A11, A21 );**

**FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE,**

**FLA_MINUS_ONE, A21, FLA_ONE, A22 ); **

**/*--------------------------------------*/**

**FLA_Cont_with_3x3_to_2x2 (…);**

**}**

Super

Matrix

• Factor A11 on host

PACC2011, Sept. 2011 49

Multi-GPU: Under the Cover

**FLA_Part_2x2 (…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){**

**FLA_Repart_2x2_to_3x3 (…);**

**/*--------------------------------------*/**

**FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 );**

**FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR,**

**FLA_TRANSPOSE, FLA_NONUNIT_DIAG,**

**FLA_ONE, A11 , A21 );**

**FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE,**

**FLA_MINUS_ONE, A21, FLA_ONE, A22 ); **

**/*--------------------------------------*/**

**FLA_Cont_with_3x3_to_2x2 (…);**

**}**

Super

Matrix

• Transfer A11 from host to appropriate devices before using it in subsequent computations

(write-update)

PACC2011, Sept. 2011 50

Multi-GPU: Under the Cover

**FLA_Part_2x2 (…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){**

**FLA_Repart_2x2_to_3x3 (…);**

**/*--------------------------------------*/**

**FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 );**

**FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR,**

**FLA_TRANSPOSE, FLA_NONUNIT_DIAG,**

**FLA_ONE, A11 , A21 );**

**FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE,**

**FLA_MINUS_ONE, A21, FLA_ONE, A22 ); **

**/*--------------------------------------*/**

**FLA_Cont_with_3x3_to_2x2 (…);**

**}**

Super

Matrix

• Cache A11 in receiving device(s) in case needed in subsequent computations

PACC2011, Sept. 2011 51

Multi-GPU: Under the Cover

**FLA_Part_2x2 (…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){**

**FLA_Repart_2x2_to_3x3 (…);**

**/*--------------------------------------*/**

**FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 );**

**FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR,**

**FLA_TRANSPOSE, FLA_NONUNIT_DIAG,**

**FLA_ONE, A11, A21 );**

**FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE,**

**FLA_MINUS_ONE, A21, FLA_ONE, A22 ); **

**/*--------------------------------------*/**

**FLA_Cont_with_3x3_to_2x2 (…);**

**}**

PACC2011, Sept. 2011

Super

Matrix

• Send blocks to devices

• Perform Trsm on blocks of

A21 (hopefully using cached A11 )

• Keep updated A21 in device till needed by other

GPU(s) (write-back)

52

Multi-GPU: Under the Cover

**FLA_Part_2x2 (…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){**

**FLA_Repart_2x2_to_3x3 (…);**

**/*--------------------------------------*/**

**FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 );**

**FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR,**

**FLA_TRANSPOSE, FLA_NONUNIT_DIAG,**

**FLA_ONE, A11, A21 );**

**FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE,**

**FLA_MINUS_ONE, A21, FLA_ONE, A22 ); **

**/*--------------------------------------*/**

**FLA_Cont_with_3x3_to_2x2 (…);**

**}**

PACC2011, Sept. 2011

Super

Matrix

• Send blocks to devices

• Perform Syrk/Gemm on blocks of A22 (hopefully using cached blocks of

A21 )

• Keep updated A22 in device till needed by other

GPU(s) (write-back)

53

C = C + AB T on S1070 (Tesla x 4)

PACC2011, Sept. 2011 54

Cholesky on S1070 (Tesla x 4)

PACC2011, Sept. 2011 55

Cholesky on S1070 (Tesla x 4)

PACC2011, Sept. 2011 56

Sampling of LAPACK functionality on S2050 (Fermi x 4)

PACC2011, Sept. 2011 57

Sampling of LAPACK functionality on S2050 (Fermi x 4)

PACC2011, Sept. 2011 58

Outline

Motivation

What is FLAME?

Deriving algorithms to be correct

Representing algorithms in code

Of blocked algorithms and algorithms-by-blocks

Runtime support for multicore, GPU, and multiGPU

Extensions to distributed memory platforms

Conclusion

PACC2011, Sept. 2011 59

libflame for Cluster (+ Accelerators)

PLAPACK

Distributed memory (MPI)

Inspired FLAME

Recently modified so that each node can have GPU

Keep data in GPU memory as much as possible

Elemental (Jack Poulson)

Distributed memory (MPI)

Inspired by FLAME/PLAPACK

Can use GPU at each node/core

libflame + SuperMatrix

Runtime schedules tasks and data transfer

Appropriate for small clusters

PACC2011, Sept. 2011 60

PLAPACK + GPU accelerators

Each node:

Xeon Nehalem (8 cores)

+ 2 NVIDIA C1060 (Tesla)

Fogue, Igual, E. Quintana, van de Geijn. "Retargeting PLAPACK to Clusters with

Hardware Accelerators." (WEHA 2010) .

PACC2011, Sept. 2011 61

Targeting Clusters with GPUs

SuperMatrix Distributed Runtime

Each node:

Xeon Nehalem (8 cores)

+ 1 NVIDIA C2050 (Fermi)

Igual, G. Quintana, van de Geijn. “Scheduling Algorithms-by-Blocks on Small Clusters.”

Concurrency and Computation: Practice and Experience. In review.

PACC2011, Sept. 2011 62

Elemental

Cholesky Factorization

PACC2011, Sept. 2011 63

Elemental vs. ScaLAPACK

Cholesky on 8192 cores -

BlueGene/P

Elemental has full ScaLAPACK functionality

(except nonsymmetric Eigenvalue

Problem).

Poulson, Marker, Hammond, Romero, van de Geijn. "Elemental: A New Framework for

*Distributed Memory Dense Matrix Computations." ACM TOMS.*

Submitted.

PACC2011, Sept. 2011 64

Single-Chip Cloud Computer

Intel SCC research processor

48-core concept vehicle

Created for many-core software research

Custom communication library (RCCE)

PACC2011, Sept. 2011 65

SCC Results

PACC2011, Sept. 2011

- 48 Pentium cores

- MPI replaced by RCCE

Igual, G. Quintana, van de Geijn.

“Scheduling Algorithms-by-Blocks on

Small Clusters.” Concurrency and

Computation: Practice and Experience.

In review.

Marker, Chan, Poulson, van de Geijn,

Van der Wijngaart, Mattson, Kubaska.

"Programming Many-Core Architectures

- A Case Study: Dense Matrix

Computations on the Intel SCC

Processor." Concurrency and

Computation: Practice and Experience.

To Appear.

66

Outline

Motivation

What is FLAME?

Deriving algorithms to be correct

Representing algorithms in code

Of blocked algorithms and algorithms-by-blocks

Runtime support for multicore, GPU, and multiGPU

Extensions to distributed memory platforms

Related work

Conclusion

PACC2011, Sept. 2011 67

Related work

Data-flow parallelism, dynamic scheduling, runtime

Cilk

OpenMP (task queues)

StarSs (SMPSs)

StarPU

Threading Building Blocks (TBB)

…

What we have is very dense linear algebra specific

PACC2011, Sept. 2011 68

Dense Linear Algebra Libraries

**Target Platform**

Sequential

Sequential + multithreaded BLAS

Multicore/multithreaded

Multicore + out-of-order scheduling

CPU + single GPU

Multicore + multiGPU

Distributed memory

Distributed memory + GPU

Out-of-Core

PACC2011, Sept. 2011

**LAPACK Project**

LAPACK

LAPACK

PLASMA

PLASMA+Quark

MAGMA

DAGuE?

ScaLAPACK

DAGuE?

ScaLAPACK?

?

FLAME Project libflame libflame libflame+SuperMatrix libflame+SuperMatrix libflame+SuperMatrix libflame+SuperMatrix libflame+SuperMatrix

PLAPACK

Elemental libflame+SuperMatrix

PLAPACK

Elemental libflame+SuperMatrix

69

Comparison with Quark

Agullo, Bouwmeester, Dongarra, Kurzak, Langou, Rosenbert. “Towards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore Architectures.” VecPar, 2010.

PACC2011, Sept. 2011 70

Outline

Motivation

What is FLAME?

Deriving algorithms to be correct

Representing algorithms in code

Of blocked algorithms and algorithms-by-blocks

Runtime support for multicore, GPU, and multiGPU

Extensions to distributed memory platforms

Related work

Conclusion

PACC2011, Sept. 2011 71

Conclusions

Programmability is the key to harnessing parallel computation

One code, many target platforms

Formal derivation provides confidence in code

If there is a problem, it is not in the library!

Separation of concern

Library developer derives algorithms and codes them

Execution of routines generates DAG

Parallelism, temporal locality, and spatial locality are captured in

DAG

Runtime system uses appropriate heuristics to schedule

PACC2011, Sept. 2011 72

Identify units of data and units of computation

Write a sequential program that generates a DAG

Hand DAG to runtime for scheduling

PACC2011, Sept. 2011 73

The Future

Currently: Library is an instantiation in code

Future

Create repository of algorithms, expert knowledge about algorithms, and knowledge about a target architecture

Mechanically generate a library for a target architecture, exactly as an expert would

Design-by-Transformation (DxT)

Bryan Marker, Andy Terrel, Jack Poulson, Don Batory, and Robert van de Geijn.

"Mechanizing the Expert Dense Linear Algebra Developer." FLAME Working Note #58.

2011

PACC2011, Sept. 2011 74

Availability

Everything that has been discussed is available under LGPL license or BSD license

libflame + SuperMatrix

http://www.cs.utexas.edu/users/flame/

Elemental

http://code.google.com/p/elemental/

Dense Linear Algebra subject make

[Soccer] is a very simple game. It’s just very hard to play it simple.

- Johan Cruyff

RvdG

PACC2011, Sept. 2011 75