PPAC2011 - Department of Computer Science

Dense Linear Algebra subject make

[Soccer] is a very simple game. It’s just very hard to play it simple.

- Johan Cruyff

RvdG

PACC2011, Sept. 2011 1

Designing a Library to be Multi-Accelerator

Ready: A Case Study

PACC2011, Sept. 2011

Robert van de Geijn

Department of Computer Science

Institute for Computational Engineering and Sciences

The University of Texas at Austin

2

Outline

 Motivation

 What is FLAME?

 Deriving algorithms to be correct

 Representing algorithms in code

 Of blocked algorithms and algorithms-by-blocks

 Runtime support for multicore, GPU, and multiGPU

 Extensions to distributed memory platforms

 Related work

 Conclusion

PACC2011, Sept. 2011 3

Moments of Inspiration

Birth of multi-threaded libflame

 Aug. 2006 - an insight : libflame

+ algorithm-by-blocks

+ out-of-order scheduling

(runtime)

= multithreaded library

 Sept. 2006 - working prototype

(by G. Quintana)

 Oct. 2006 - grant proposal

(to NSF, later funded)

 Jan. 2007 - paper submitted

(to SPAA07, accepted)

 April 2007 - released with libflame R1.0


Birth of multi-GPU libflame

 Fall 2007 - runtime used to manage data and tasks on a single GPU. (UJI-Spain)

 March 2008 - NVIDIA donates

4GPU Tesla S870 system

 Two hours after unboxing the boards, multiple heuristics for multiGPU runtime implemented

 Then the power cord fried…

4

Birth of MultiGPU libflame

After two hours Shortly after

G. Quintana, Igual, E. Quintana, van de Geijn. "Solving Dense Linear Algebra

Problems on Platforms with Multiple Hardware Accelerators." PPoPP’09.

PACC2011, Sept. 2011 5

What Supports our Productivity/Performance?

 Deep understanding of the domain

 Foundational computer science

 Derivation of algorithms

 Software implementation of hardware techniques

 Blocking for performance

 Abstraction

 Separation of concern

PACC2011, Sept. 2011 6

Outline

 Motivation

 What is FLAME?






 Related work

 Conclusion

PACC2011, Sept. 2011 7

What is FLAME?

 A notation for expressing linear algebra algorithms

 A methodology for deriving such algorithm

 A set of abstractions for representing such algorithms

 In LaTeX, M-script, C, etc.

 A modern library (libflame)

 Alternative to BLAS, LAPACK, ScaLAPACK, and related efforts

 Many new contributions to theory and practice of dense linear algebra

 Also banded and Krylov subspace methods

 A set of tools supporting the above

 Mechanical derivation

 Automatic generation of code

 Design-by-Transformation (DxT)

PACC2011, Sept. 2011 8

Who is FLAME?

PACC2011, Sept. 2011 9

Outline

 Motivation

 What is FLAME?






 Related work

 Conclusion

PACC2011, Sept. 2011 10

Deriving Algorithms to be Correct

 Include all algorithms for a given operation:

 Pick the right algorithm for the given architecture

 Problem:

 How to find the right algorithm

 Solution:

 Formal derivation (Hoare, Dijkstra, …):

Given operation, systematically derive family of algorithms for computing it.

PACC2011, Sept. 2011 11

Notation

 The notation used to express an algorithm should reflect how the algorithm is naturally explained.

PACC2011, Sept. 2011 12

Example: The Cholesky Factorization

 Lower triangular case:

A = L

*

L T

Key in the solution of s.p.d. linear systems

A x = b



(LL T )x = b

L y = b

 y

L T x = y  x

PACC2011, Sept. 2011 13

Algorithm Loop: Repartition

A

TL



A

BL

A

BR

A

00 a

10

T a

11

A

20 a

21

A

22


Indexing operations

14

Algorithm Loop: Update

A

00 a

10

T a

11

A

20 a

21



A

00 a

10

T  a

11

A

20 a

/ a

11

A – a

21 a

21

T


Real computation

15

Algorithm Loop: Merging

A

00 a

10

T  a

11

A

20 a

/ a

11

A – a

21 a

21

T

A

TL



A

BL


A

BR

Indexing operation

16

Worksheet for Cholesky Factorization

PACC2011, Sept. 2011 17

Mechanical Derivation of Algorithms

 Mechanical development from math. specification

A = L * L T

Mechanical procedure

Paolo Bientinesi. "Mechanical Derivation and Systematic Analysis of Correct

Linear Algebra Algorithms." Ph.D. Dissertation. UT-Austin. August 2006.


18

18

Is Formal Derivation Practical?

 libflame :

 128+ matrix operations

 1389+ implementations of algorithms

 Test suite created in 2011

 126,756 tests executed

 Only 3 minor bugs in library… (now fixed)

PACC2011, Sept. 2011 19

Impact on (Single) GPU Computing

CUBLAS 2009 (have been optimized since)

PACC2011, Sept. 2011 20

Impact on (Single) GPU Computing

CUBLAS 2009

Fran Igual. “Matrix Computations on Graphics Processors and Clusters of GPUs." Ph.D. Dissertation.

Univ. Jaume I. May 2011.

Igual, G. Quintana, and van de Geijn. "Level-3 BLAS on a GPU: Picking the Low Hanging Fruit "

FLAME Working Note #37. Universidad Jaume I, updated May 21, 2009.

PACC2011, Sept. 2011 21

A Sampling of Functionality operation

Level-3 BLAS

Cholesky

LU with partial pivoting

LU with incremental pivoting

QR (UT)

LQ (UT)

SPD/HPD inversion

Triangular inversion

Triangular Sylvester

Triangular Lyapunov

Up-and-downdate (UT)

SVD

EVD

PACC2011, Sept. 2011 y y y y y y next week next week y y y y y

Classic FLAME SuperMatrix

MultiThreaded/

MultiGPU y y y y y y y y y y lapack2flame y y

N.A.

soon soon y y y

N.A.

y y

N.A.

y

22

Outline

 Motivation

 What is FLAME?






 Related work

 Conclusion

PACC2011, Sept. 2011 23

Representing Algorithms in Code

 Code should closely resemble how an algorithm is presented so that no bugs can be introduced when translating an algorithm to code.

PACC2011, Sept. 2011 24

Representing algorithms in code http://www.cs.utexas.edu/users/flame/Spark/

Spark+APIs

C,

F77,

Matlab,

LabView,

LaTeX

PACC2011, Sept. 2011 25

FLAME/C API

FLA_Part_2x2 ( A, &ATL, &ATR,

&ABL, &ABR, 0, 0, FLA_TL ); while ( FLA_Obj_length( ATL ) < FLA_Obj_length( A ) ){ b = min( FLA_Obj_length( ABR ), nb_alg );

FLA_Repart_2x2_to_3x3 (

ATL, /**/ ATR, &A00, /**/ &a01, &A02,

/* ************* */ /* ************************** */

&a10t,/**/ &alpha11, &a12t,

ABL, /**/ ABR, &A20, /**/ &a21, &A22,

1, 1, FLA_BR );

/*--------------------------------------*/

FLA_Sqrt( alpha11 );

FLA_Inv_scal( alpha11, a21 );

FLA_Syr( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE,

FLA_MINUS_ONE, a21, A22 );

/*--------------------------------------*/

FLA_Cont_with_3x3_to_2x2 (

&ATL, /**/ &ATR, A00, a01, /**/ A02, a10t, alpha11, /**/ a12t,

/* ************** */ /* ************************/

&ABL, /**/ &ABR, A20, a21, /**/ A22,

FLA_TL );

}

For now , libflame employs external BLAS: GotoBLAS, MKL, ACML, CUBLAS

PACC2011, Sept. 2011 26

Outline

 Motivation

 What is FLAME?






 Related work

 Conclusion

PACC2011, Sept. 2011 27

Multicore/MultiGPU - Issues

 Manage computation

 Assignment of tasks to cores and/or GPU s

 Granularity is important

 Manage memory

 Manage data transfer between “host” and caches of cores or host and GPU local memories

 Granularity is important

 Keep the data in the local memory as long as possible

PACC2011, Sept. 2011 28

Where have we seen this before?

 Computer architecture late 1960s:

 Super scalar units proposed

 Unit of data: floating point number

 Unit of computation: floating point operation

 Examine dependencies

 Execute out-of-order, prefetch, cache data, etc., to keep computational units busy

 Extract parallelism from sequential instruction stream

R. M.

Tomasulo, “An Efficient Algorithm for Exploiting Multiple Arithmetic

Units.” IBM J. of R&D, (1967)

 Basis for exploitation of ILP on current superscalar processors!

PACC2011, Sept. 2011 29

Of Blocks and Tasks

 Dense matrix computation

 Unit of data: block in matrix

 Unit of computation (task): operation with blocks

 Dependency: input/output of operation with blocks

 Instruction stream: sequential libflame code

 Generates DAG

 Runtime system schedules tasks

 Goal: minimize data transfer and maximize utilization

PACC2011, Sept. 2011 30

Review: Blocked Algorithms

 Cholesky factorization A

11

A

21

A

22

= L

11

:= L

21

:= A

22

L

11

T

= A

21

– L

21

L

11

-T

L

21

T

1st iteration


2nd iteration 3rd iteration

31

31

…

Blocked Algorithms

 Cholesky factorization

A = L * L T

FLA_Part_2x2 (…);

APIs + Tools while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){

FLA_Repart_2x2_to_3x3 (…);

/*--------------------------------------*/

FLA_Chol( FLA_LOWER_TRIANGULAR, A11 );

FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR,

FLA_TRANSPOSE, FLA_NONUNIT_DIAG,

FLA_ONE, A11, A21 );

FLA_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE,

FLA_MINUS_ONE, A21, FLA_ONE, A22 );

/*--------------------------------------*/

FLA_Cont_with_3x3_to_2x2 (…);

}


32

32

Simple Parallelization: Blocked Algorithms

 Link with multi-threaded BLAS

A

11

A

21

A

22

= L

11

:= L

21

:= A

22

L

11

T

= A

21

– L

21

L

11

-T

L

21

T


FLA_Part_2x2 (…); while ( FLA_Obj_length(ATL) < FLA_Obj_length(A) ){

}


/*--------------------------------------*/


FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR,


FLA_ONE, A11, A21 );

FLA_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE,


/*--------------------------------------*/


33

33

Blocked Algorithms

 There is more parallelism!

Inside the same iteration

In different iterations

1st iteration 2nd iteration


34

34

Coding Algorithm-by-Blocks

 Algorithm-by-blocks : A

11

A

21

A

22

= L

11

:= L

21

:= A

22

L

11

T

= A

21

– L

21

L

11

-T

L

21

T



/*--------------------------------------*/


FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR,


FLA_ONE, A11, A21 );

FLASH_Syrk( FLA_LOWER_TRIANGULAR,FLA_NO_TRANSPOSE,


/*--------------------------------------*/


}


35

35

Outline

 Motivation

 What is FLAME?






 Related work

 Conclusion

PACC2011, Sept. 2011 36

Generating a DAG


}


/*--------------------------------------*/




FLA_ONE, A11, A21 );



/*--------------------------------------*/


1

2

3

4

5 6

7

8 9 10

PACC2011, Sept. 2011 37

Managing Tasks and Blocks

 Separation of concerns:

 Sequential libflame routine generates the DAG

 Runtime system (SuperMatrix) manages and schedules the DAG

 As one moves from one architecture to another, only the runtime system needs to be updated

 Multicore

 Out-of-core

 Single GPU

 MultiGPU

 Distributed Runtime

 …

PACC2011, Sept. 2011 38

Runtime system - SuperMatrix

DAG

1

2

3

4

5 6

7

8 9 10

Super

Matrix

+ heuristic


Multicore

39

Runtime system for GPU - SuperMatrix accelerator

DAG

1

2

3

4

5 6

7

8 9 10

Super

Matrix

Manage data transfer


CPU + GPU

40

Runtime system for MultiGPU - SuperMatrix

Multi-accelerator

DAG

1

2

3

4

5 6

7

8 9 10

Super

Matrix

Manage data transfer


CPU + MultiGPU

41

MultiGPU

 How do we program these?

CPU(s)

PCI-e bus

Interconnect

GPU

#0

GPU

#1

GPU

#2

GPU

#3


42

42

MultiGPU: a User’s View

FLA_Obj A;

// Initialize conventional matrix: buffer, m, rs, cs

// Obtain storage blocksize, # of threads: b, n_threads

FLA_Init();

FLASH_Obj_create( FLA_DOUBLE, m, m, 1, &b, &A );

FLASH_Copy_buffer_to_hier( m, m, buffer, rs, cs,

0, 0, A );

FLASH_Queue_set_num_threads( n_threads );

FLASH_Queue_enable_gpu();

FLASH_Chol( FLA_LOWER_TRIANGULAR, A );

FLASH_Obj_free( &A );

FLA_Finalize();


43

43

MultiGPU: Under the Cover







 Naïve approach:

Before execution, transfer data to device

Call CUBLAS operations

CPU(s)

(implementation “someone else’s problem”)

Upon completion, retrieve results back to host

Interconnect

 poor data locality

PCI-e bus

GPU

#0

GPU

#1

GPU

#2

GPU

#3


44

44


 How do we program these?

View as a…

 Shared-memory multiprocessor +

Distributed Shared

Memory (DSM) architecture

CPU(s)

Interconnect

PCI-e bus


45

GPU

#0

GPU

#1

GPU

#2

GPU

#3

45


 View system as a sharedmemory multiprocessors

(multi-core processor with hw. coherence)

CPU(s)

MP P

0

+Cache

0

P

1

+Cache

1

P

2

+Cache

2

P

3

+Cache

3

Interconnect

PCI-e bus


46

GPU

#0

GPU

#1

GPU

#2

GPU

#3

46


 Software Distributed-Shared Memory (DSM)

 Software: flexibility vs. efficiency

 Underlying distributed memory hidden

 Reduce memory transfers using write-back, writeinvalidate,…

 Well-known approach, not too efficient as a middleware for general apps.

Regularity of dense linear algebra makes a difference!


47

47


2

3

1



 Reduce #data transfers:









Run-time handles device memory as a software cache:

Operate at block level

Software  flexibility

Write-back

Write-invalidate

CPU(s)

4

5 6

7

8 9 10

Super

Matrix

Interconnect


PCI-e bus

48

GPU

#0

GPU

#1

GPU

#2

GPU

#3

48




/*--------------------------------------*/

FLASH_Chol( FLA_LOWER_TRIANGULAR, A11 );



FLA_ONE, A11, A21 );



/*--------------------------------------*/


}

Super

Matrix

• Factor A11 on host

PACC2011, Sept. 2011 49

Multi-GPU: Under the Cover



/*--------------------------------------*/




FLA_ONE, A11 , A21 );



/*--------------------------------------*/


}

Super

Matrix

• Transfer A11 from host to appropriate devices before using it in subsequent computations

(write-update)

PACC2011, Sept. 2011 50




/*--------------------------------------*/




FLA_ONE, A11 , A21 );



/*--------------------------------------*/


}

Super

Matrix

• Cache A11 in receiving device(s) in case needed in subsequent computations

PACC2011, Sept. 2011 51




/*--------------------------------------*/




FLA_ONE, A11, A21 );



/*--------------------------------------*/


}


Super

Matrix

• Send blocks to devices

• Perform Trsm on blocks of

A21 (hopefully using cached A11 )

• Keep updated A21 in device till needed by other

GPU(s) (write-back)

52




/*--------------------------------------*/




FLA_ONE, A11, A21 );



/*--------------------------------------*/


}


Super

Matrix

• Send blocks to devices

• Perform Syrk/Gemm on blocks of A22 (hopefully using cached blocks of

A21 )

• Keep updated A22 in device till needed by other

GPU(s) (write-back)

53

C = C + AB T on S1070 (Tesla x 4)

PACC2011, Sept. 2011 54

Cholesky on S1070 (Tesla x 4)

PACC2011, Sept. 2011 55

Cholesky on S1070 (Tesla x 4)

PACC2011, Sept. 2011 56

Sampling of LAPACK functionality on S2050 (Fermi x 4)

PACC2011, Sept. 2011 57

Sampling of LAPACK functionality on S2050 (Fermi x 4)

PACC2011, Sept. 2011 58

Outline

 Motivation

 What is FLAME?






 Conclusion

PACC2011, Sept. 2011 59

libflame for Cluster (+ Accelerators)

 PLAPACK

 Distributed memory (MPI)

 Inspired FLAME

 Recently modified so that each node can have GPU

 Keep data in GPU memory as much as possible

 Elemental (Jack Poulson)

 Distributed memory (MPI)

 Inspired by FLAME/PLAPACK

 Can use GPU at each node/core

 libflame + SuperMatrix

 Runtime schedules tasks and data transfer

 Appropriate for small clusters

PACC2011, Sept. 2011 60

PLAPACK + GPU accelerators

Each node:

Xeon Nehalem (8 cores)

+ 2 NVIDIA C1060 (Tesla)

Fogue, Igual, E. Quintana, van de Geijn. "Retargeting PLAPACK to Clusters with

Hardware Accelerators." (WEHA 2010) .

PACC2011, Sept. 2011 61

Targeting Clusters with GPUs

SuperMatrix Distributed Runtime

Each node:

Xeon Nehalem (8 cores)

+ 1 NVIDIA C2050 (Fermi)

Igual, G. Quintana, van de Geijn. “Scheduling Algorithms-by-Blocks on Small Clusters.”

Concurrency and Computation: Practice and Experience. In review.

PACC2011, Sept. 2011 62

Elemental

Cholesky Factorization

PACC2011, Sept. 2011 63

Elemental vs. ScaLAPACK

Cholesky on 8192 cores -

BlueGene/P

Elemental has full ScaLAPACK functionality

(except nonsymmetric Eigenvalue

Problem).

Poulson, Marker, Hammond, Romero, van de Geijn. "Elemental: A New Framework for

Distributed Memory Dense Matrix Computations." ACM TOMS.

Submitted.

PACC2011, Sept. 2011 64

Single-Chip Cloud Computer

 Intel SCC research processor

 48-core concept vehicle

 Created for many-core software research

 Custom communication library (RCCE)

PACC2011, Sept. 2011 65

SCC Results


- 48 Pentium cores

- MPI replaced by RCCE

Igual, G. Quintana, van de Geijn.

“Scheduling Algorithms-by-Blocks on

Small Clusters.” Concurrency and

Computation: Practice and Experience.

In review.

Marker, Chan, Poulson, van de Geijn,

Van der Wijngaart, Mattson, Kubaska.

"Programming Many-Core Architectures

- A Case Study: Dense Matrix

Computations on the Intel SCC

Processor." Concurrency and

Computation: Practice and Experience.

To Appear.

66

Outline

 Motivation

 What is FLAME?






 Related work

 Conclusion

PACC2011, Sept. 2011 67

Related work

 Data-flow parallelism, dynamic scheduling, runtime

 Cilk

 OpenMP (task queues)

 StarSs (SMPSs)

 StarPU

 Threading Building Blocks (TBB)

 …

 What we have is very dense linear algebra specific

PACC2011, Sept. 2011 68

Dense Linear Algebra Libraries

Target Platform

Sequential

Sequential + multithreaded BLAS

Multicore/multithreaded

Multicore + out-of-order scheduling

CPU + single GPU

Multicore + multiGPU

Distributed memory

Distributed memory + GPU

Out-of-Core


LAPACK Project

LAPACK

LAPACK

PLASMA

PLASMA+Quark

MAGMA

DAGuE?

ScaLAPACK

DAGuE?

ScaLAPACK?

?

FLAME Project libflame libflame libflame+SuperMatrix libflame+SuperMatrix libflame+SuperMatrix libflame+SuperMatrix libflame+SuperMatrix

PLAPACK

Elemental libflame+SuperMatrix

PLAPACK

Elemental libflame+SuperMatrix

69

Comparison with Quark

Agullo, Bouwmeester, Dongarra, Kurzak, Langou, Rosenbert. “Towards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore Architectures.” VecPar, 2010.

PACC2011, Sept. 2011 70

Outline

 Motivation

 What is FLAME?






 Related work

 Conclusion

PACC2011, Sept. 2011 71

Conclusions

 Programmability is the key to harnessing parallel computation

 One code, many target platforms

 Formal derivation provides confidence in code

 If there is a problem, it is not in the library!

 Separation of concern

 Library developer derives algorithms and codes them

 Execution of routines generates DAG

 Parallelism, temporal locality, and spatial locality are captured in

DAG

 Runtime system uses appropriate heuristics to schedule

PACC2011, Sept. 2011 72

What does this mean for you?



One successful approach:

 Identify units of data and units of computation

 Write a sequential program that generates a DAG

 Hand DAG to runtime for scheduling

PACC2011, Sept. 2011 73

The Future

 Currently: Library is an instantiation in code

 Future

 Create repository of algorithms, expert knowledge about algorithms, and knowledge about a target architecture

 Mechanically generate a library for a target architecture, exactly as an expert would

 Design-by-Transformation (DxT)

Bryan Marker, Andy Terrel, Jack Poulson, Don Batory, and Robert van de Geijn.

"Mechanizing the Expert Dense Linear Algebra Developer." FLAME Working Note #58.

2011

PACC2011, Sept. 2011 74

Availability

 Everything that has been discussed is available under LGPL license or BSD license

 libflame + SuperMatrix

 http://www.cs.utexas.edu/users/flame/

 Elemental

 http://code.google.com/p/elemental/

Dense Linear Algebra subject make

[Soccer] is a very simple game. It’s just very hard to play it simple.

- Johan Cruyff

RvdG

PACC2011, Sept. 2011 75

PPAC2011 - Department of Computer Science

Designing a Library to be Multi-Accelerator

Ready: A Case Study

…

What does this mean for you?

One successful approach:

Related documents

Products

Support

PPAC2011 - Department of Computer Science

Designing a Library to be Multi-Accelerator

Ready: A Case Study

…

What does this mean for you?

One successful approach:

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib