Implementing Boolean matrix multiplication on a GPU

advertisement
Implementing Boolean matrix multiplication on a GPU
Alexander Okhotin
Department of Mathematics, University of Turku, Finland
Academy of Finland
DESY, Hamburg, Germany
12 April 2010 A. D.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
1 / 18
Background
High-performance hardware is parallel.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
2 / 18
Background
High-performance hardware is parallel.
Most algorithms are (partially) sequential.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
2 / 18
Background
High-performance hardware is parallel.
Most algorithms are (partially) sequential.
Find the bottleneck and parallelize it.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
2 / 18
Background
High-performance hardware is parallel.
Most algorithms are (partially) sequential.
Find the bottleneck and parallelize it.
The speaker’s case:
Syntax analysis for general context-free grammars.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
2 / 18
Background
High-performance hardware is parallel.
Most algorithms are (partially) sequential.
Find the bottleneck and parallelize it.
The speaker’s case:
Syntax analysis for general context-free grammars.
I
Sequential nature.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
2 / 18
Background
High-performance hardware is parallel.
Most algorithms are (partially) sequential.
Find the bottleneck and parallelize it.
The speaker’s case:
Syntax analysis for general context-free grammars.
I
I
Sequential nature.
Typically implemented combinatorially.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
2 / 18
Background
High-performance hardware is parallel.
Most algorithms are (partially) sequential.
Find the bottleneck and parallelize it.
The speaker’s case:
Syntax analysis for general context-free grammars.
I
I
I
Sequential nature.
Typically implemented combinatorially.
Can be done via Boolean matrix multiplication.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
2 / 18
Background
High-performance hardware is parallel.
Most algorithms are (partially) sequential.
Find the bottleneck and parallelize it.
The speaker’s case:
Syntax analysis for general context-free grammars.
I
I
I
Sequential nature.
Typically implemented combinatorially.
Can be done via Boolean matrix multiplication.
F
Valiant (1975): theoretical bound.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
2 / 18
Background
High-performance hardware is parallel.
Most algorithms are (partially) sequential.
Find the bottleneck and parallelize it.
The speaker’s case:
Syntax analysis for general context-free grammars.
I
I
I
Sequential nature.
Typically implemented combinatorially.
Can be done via Boolean matrix multiplication.
F
F
Valiant (1975): theoretical bound.
Okhotin (2010): refactored and generalized.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
2 / 18
Background
High-performance hardware is parallel.
Most algorithms are (partially) sequential.
Find the bottleneck and parallelize it.
The speaker’s case:
Syntax analysis for general context-free grammars.
I
I
I
Sequential nature.
Typically implemented combinatorially.
Can be done via Boolean matrix multiplication.
F
F
Valiant (1975): theoretical bound.
Okhotin (2010): refactored and generalized.
X Efficiently parallelized.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
2 / 18
Background
High-performance hardware is parallel.
Most algorithms are (partially) sequential.
Find the bottleneck and parallelize it.
The speaker’s case:
Syntax analysis for general context-free grammars.
I
I
I
Sequential nature.
Typically implemented combinatorially.
Can be done via Boolean matrix multiplication.
F
F
Valiant (1975): theoretical bound.
Okhotin (2010): refactored and generalized.
X Efficiently parallelized.
Implementing on a Graphics Processing Unit.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
2 / 18
Part I
GPU programming
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
3 / 18
Graphics Processing Units
Designed for 3D graphics in computer games.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
4 / 18
Graphics Processing Units
Designed for 3D graphics in computer games.
I
Shading.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
4 / 18
Graphics Processing Units
Designed for 3D graphics in computer games.
I
Shading.
I
Texturing.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
4 / 18
Graphics Processing Units
Designed for 3D graphics in computer games.
I
Shading.
I
Texturing.
I
Per pixel effects.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
4 / 18
Graphics Processing Units
Designed for 3D graphics in computer games.
I
Shading.
I
Texturing.
I
Per pixel effects.
I
The same function for each pixel.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
4 / 18
Graphics Processing Units
Designed for 3D graphics in computer games.
I
Shading.
I
Texturing.
I
Per pixel effects.
I
The same function for each pixel.
I
Function as a kernel (program).
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
4 / 18
Graphics Processing Units
Designed for 3D graphics in computer games.
I
Shading.
I
Texturing.
I
Per pixel effects.
I
The same function for each pixel.
I
Function as a kernel (program).
I
Pixel as a work item.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
4 / 18
Graphics Processing Units
Designed for 3D graphics in computer games.
I
Shading.
I
Texturing.
I
Per pixel effects.
I
The same function for each pixel.
I
Function as a kernel (program).
I
Pixel as a work item.
General purpose computation on GPUs.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
4 / 18
Graphics Processing Units
Designed for 3D graphics in computer games.
I
Shading.
I
Texturing.
I
Per pixel effects.
I
The same function for each pixel.
I
Function as a kernel (program).
I
Pixel as a work item.
General purpose computation on GPUs.
I
Tens of cores, each with multiple ALUs.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
4 / 18
Graphics Processing Units
Designed for 3D graphics in computer games.
I
Shading.
I
Texturing.
I
Per pixel effects.
I
The same function for each pixel.
I
Function as a kernel (program).
I
Pixel as a work item.
General purpose computation on GPUs.
I
I
Tens of cores, each with multiple ALUs.
Approaching 1 Teraflop.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
4 / 18
Graphics Processing Units
Designed for 3D graphics in computer games.
I
Shading.
I
Texturing.
I
Per pixel effects.
I
The same function for each pixel.
I
Function as a kernel (program).
I
Pixel as a work item.
General purpose computation on GPUs.
I
I
I
Tens of cores, each with multiple ALUs.
Approaching 1 Teraflop.
Priced as a consumer toy.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
4 / 18
Graphics Processing Units
Designed for 3D graphics in computer games.
I
Shading.
I
Texturing.
I
Per pixel effects.
I
The same function for each pixel.
I
Function as a kernel (program).
I
Pixel as a work item.
General purpose computation on GPUs.
I
I
I
Tens of cores, each with multiple ALUs.
Approaching 1 Teraflop.
Priced as a consumer toy.
Best price to performance ratio.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
4 / 18
Graphics Processing Units
Designed for 3D graphics in computer games.
I
Shading.
I
Texturing.
I
Per pixel effects.
I
The same function for each pixel.
I
Function as a kernel (program).
I
Pixel as a work item.
General purpose computation on GPUs.
I
I
I
Tens of cores, each with multiple ALUs.
Approaching 1 Teraflop.
Priced as a consumer toy.
Best price to performance ratio.
Special programming techniques.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
4 / 18
GPU programming
Proprietary interfaces: NVIDIA CUDA, ATI Stream.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
5 / 18
GPU programming
Proprietary interfaces: NVIDIA CUDA, ATI Stream.
Device-independent language: OpenCL.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
5 / 18
GPU programming
Proprietary interfaces: NVIDIA CUDA, ATI Stream.
Device-independent language: OpenCL.
I
Supported by NVIDIA and ATI drivers.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
5 / 18
GPU programming
Proprietary interfaces: NVIDIA CUDA, ATI Stream.
Device-independent language: OpenCL.
I
I
Supported by NVIDIA and ATI drivers.
CPU implementation.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
5 / 18
GPU programming
Proprietary interfaces: NVIDIA CUDA, ATI Stream.
Device-independent language: OpenCL.
I
I
Supported by NVIDIA and ATI drivers.
CPU implementation.
Kernel: program running on GPU.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
5 / 18
GPU programming
Proprietary interfaces: NVIDIA CUDA, ATI Stream.
Device-independent language: OpenCL.
I
I
Supported by NVIDIA and ATI drivers.
CPU implementation.
Kernel: program running on GPU.
I
Dialect of C.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
5 / 18
GPU programming
Proprietary interfaces: NVIDIA CUDA, ATI Stream.
Device-independent language: OpenCL.
I
I
Supported by NVIDIA and ATI drivers.
CPU implementation.
Kernel: program running on GPU.
I
I
Dialect of C.
Computes one “work item”.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
5 / 18
GPU programming
Proprietary interfaces: NVIDIA CUDA, ATI Stream.
Device-independent language: OpenCL.
I
I
Supported by NVIDIA and ATI drivers.
CPU implementation.
Kernel: program running on GPU.
I
I
I
Dialect of C.
Computes one “work item”.
Executed for a grid of work items.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
5 / 18
GPU programming
Proprietary interfaces: NVIDIA CUDA, ATI Stream.
Device-independent language: OpenCL.
I
I
Supported by NVIDIA and ATI drivers.
CPU implementation.
Kernel: program running on GPU.
I
I
I
Dialect of C.
Computes one “work item”.
Executed for a grid of work items.
Host code running on a CPU.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
5 / 18
GPU programming
Proprietary interfaces: NVIDIA CUDA, ATI Stream.
Device-independent language: OpenCL.
I
I
Supported by NVIDIA and ATI drivers.
CPU implementation.
Kernel: program running on GPU.
I
I
I
Dialect of C.
Computes one “work item”.
Executed for a grid of work items.
Host code running on a CPU.
I
Allocate GPU memory.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
5 / 18
GPU programming
Proprietary interfaces: NVIDIA CUDA, ATI Stream.
Device-independent language: OpenCL.
I
I
Supported by NVIDIA and ATI drivers.
CPU implementation.
Kernel: program running on GPU.
I
I
I
Dialect of C.
Computes one “work item”.
Executed for a grid of work items.
Host code running on a CPU.
I
I
Allocate GPU memory.
Load and compile a kernel.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
5 / 18
GPU programming
Proprietary interfaces: NVIDIA CUDA, ATI Stream.
Device-independent language: OpenCL.
I
I
Supported by NVIDIA and ATI drivers.
CPU implementation.
Kernel: program running on GPU.
I
I
I
Dialect of C.
Computes one “work item”.
Executed for a grid of work items.
Host code running on a CPU.
I
I
I
Allocate GPU memory.
Load and compile a kernel.
Give arguments to the kernel.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
5 / 18
Execution and memory model
2–32 multithreaded cores, each with 8–16 ALUs.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
6 / 18
Execution and memory model
2–32 multithreaded cores, each with 8–16 ALUs.
Many threads running on a core, grouped into warps.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
6 / 18
Execution and memory model
2–32 multithreaded cores, each with 8–16 ALUs.
Many threads running on a core, grouped into warps.
Main system memory (“host memory”): accessed through the bus.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
6 / 18
Execution and memory model
2–32 multithreaded cores, each with 8–16 ALUs.
Many threads running on a core, grouped into warps.
Main system memory (“host memory”): accessed through the bus.
Global memory: accessed by all GPU cores (up to 150 Gb/s).
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
6 / 18
Execution and memory model
2–32 multithreaded cores, each with 8–16 ALUs.
Many threads running on a core, grouped into warps.
Main system memory (“host memory”): accessed through the bus.
Global memory: accessed by all GPU cores (up to 150 Gb/s).
I
64–512-bit bus.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
6 / 18
Execution and memory model
2–32 multithreaded cores, each with 8–16 ALUs.
Many threads running on a core, grouped into warps.
Main system memory (“host memory”): accessed through the bus.
Global memory: accessed by all GPU cores (up to 150 Gb/s).
I
I
64–512-bit bus.
Multiple threads would better access adjacent words.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
6 / 18
Execution and memory model
2–32 multithreaded cores, each with 8–16 ALUs.
Many threads running on a core, grouped into warps.
Main system memory (“host memory”): accessed through the bus.
Global memory: accessed by all GPU cores (up to 150 Gb/s).
I
I
64–512-bit bus.
Multiple threads would better access adjacent words.
Local memory: shared by all threads on a core.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
6 / 18
Execution and memory model
2–32 multithreaded cores, each with 8–16 ALUs.
Many threads running on a core, grouped into warps.
Main system memory (“host memory”): accessed through the bus.
Global memory: accessed by all GPU cores (up to 150 Gb/s).
I
I
64–512-bit bus.
Multiple threads would better access adjacent words.
Local memory: shared by all threads on a core.
I
Much faster.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
6 / 18
Execution and memory model
2–32 multithreaded cores, each with 8–16 ALUs.
Many threads running on a core, grouped into warps.
Main system memory (“host memory”): accessed through the bus.
Global memory: accessed by all GPU cores (up to 150 Gb/s).
I
I
64–512-bit bus.
Multiple threads would better access adjacent words.
Local memory: shared by all threads on a core.
I
I
Much faster.
Often used to cache data.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
6 / 18
Execution and memory model
2–32 multithreaded cores, each with 8–16 ALUs.
Many threads running on a core, grouped into warps.
Main system memory (“host memory”): accessed through the bus.
Global memory: accessed by all GPU cores (up to 150 Gb/s).
I
I
64–512-bit bus.
Multiple threads would better access adjacent words.
Local memory: shared by all threads on a core.
I
I
Much faster.
Often used to cache data.
Private memory, owned by a thread.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
6 / 18
Execution and memory model
2–32 multithreaded cores, each with 8–16 ALUs.
Many threads running on a core, grouped into warps.
Main system memory (“host memory”): accessed through the bus.
Global memory: accessed by all GPU cores (up to 150 Gb/s).
I
I
64–512-bit bus.
Multiple threads would better access adjacent words.
Local memory: shared by all threads on a core.
I
I
Much faster.
Often used to cache data.
Private memory, owned by a thread.
Computation divided into work-items.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
6 / 18
Execution and memory model
2–32 multithreaded cores, each with 8–16 ALUs.
Many threads running on a core, grouped into warps.
Main system memory (“host memory”): accessed through the bus.
Global memory: accessed by all GPU cores (up to 150 Gb/s).
I
I
64–512-bit bus.
Multiple threads would better access adjacent words.
Local memory: shared by all threads on a core.
I
I
Much faster.
Often used to cache data.
Private memory, owned by a thread.
Computation divided into work-items.
I
1d, 2d or 3d grid of work-items.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
6 / 18
Execution and memory model
2–32 multithreaded cores, each with 8–16 ALUs.
Many threads running on a core, grouped into warps.
Main system memory (“host memory”): accessed through the bus.
Global memory: accessed by all GPU cores (up to 150 Gb/s).
I
I
64–512-bit bus.
Multiple threads would better access adjacent words.
Local memory: shared by all threads on a core.
I
I
Much faster.
Often used to cache data.
Private memory, owned by a thread.
Computation divided into work-items.
I
I
1d, 2d or 3d grid of work-items.
Block of work-items: work-group.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
6 / 18
Primitive example
Example (Jacobi method)
1
Compile the program.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
7 / 18
Primitive example
Example (Jacobi method)
1
Compile the program.
2
Allocate n*n*sizeof(float) bytes for A and B.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
7 / 18
Primitive example
Example (Jacobi method)
1
Compile the program.
2
Allocate n*n*sizeof(float) bytes for A and B.
3
Create kernel with arguments (n, n, A, B).
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
7 / 18
Primitive example
Example (Jacobi method)
1
Compile the program.
2
Allocate n*n*sizeof(float) bytes for A and B.
3
Create kernel with arguments (n, n, A, B).
4
Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
7 / 18
Primitive example
Example (Jacobi method)
1
Compile the program.
2
Allocate n*n*sizeof(float) bytes for A and B.
3
Create kernel with arguments (n, n, A, B).
4
Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}.
5
Wait for termination.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
7 / 18
Primitive example
Example (Jacobi method)
1
Compile the program.
2
Allocate n*n*sizeof(float) bytes for A and B.
3
Create kernel with arguments (n, n, A, B).
4
Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}.
5
Wait for termination.
It works.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
7 / 18
Primitive example
Example (Jacobi method)
1
Compile the program.
2
Allocate n*n*sizeof(float) bytes for A and B.
3
Create kernel with arguments (n, n, A, B).
4
Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}.
5
Wait for termination.
It works.
Alexander Okhotin
. . . though very inefficiently:
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
7 / 18
Primitive example
Example (Jacobi method)
1
Compile the program.
2
Allocate n*n*sizeof(float) bytes for A and B.
3
Create kernel with arguments (n, n, A, B).
4
Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}.
5
Wait for termination.
It works.
I
. . . though very inefficiently:
Reading 4 times.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
7 / 18
Primitive example
Example (Jacobi method)
1
Compile the program.
2
Allocate n*n*sizeof(float) bytes for A and B.
3
Create kernel with arguments (n, n, A, B).
4
Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}.
5
Wait for termination.
It works.
I
I
. . . though very inefficiently:
Reading 4 times.
Memory alignment ignored.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
7 / 18
Part II
Boolean matrix multiplication
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
8 / 18
Matrix multiplication as such
S: a semiring.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
9 / 18
Matrix multiplication as such
S: a semiring.
A ∈ S m×` , B ∈ S `×n ,
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
9 / 18
Matrix multiplication as such
S: a semiring.
A ∈ S m×` , B ∈ S `×n ,
Their product, C ∈ S m×n :
Ci,j =
`
X
Ai,k · Bk,j
k=1
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
9 / 18
Matrix multiplication as such
S: a semiring.
A ∈ S m×` , B ∈ S `×n ,
Their product, C ∈ S m×n :
Ci,j =
`
X
Ai,k · Bk,j
k=1
`mn multiplications, (` − 1)mn additions.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
9 / 18
Matrix multiplication as such
S: a semiring.
A ∈ S m×` , B ∈ S `×n ,
Their product, C ∈ S m×n :
Ci,j =
`
X
Ai,k · Bk,j
k=1
`mn multiplications, (` − 1)mn additions.
X In this talk:
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
9 / 18
Matrix multiplication as such
S: a semiring.
A ∈ S m×` , B ∈ S `×n ,
Their product, C ∈ S m×n :
Ci,j =
`
X
Ai,k · Bk,j
k=1
`mn multiplications, (` − 1)mn additions.
X In this talk:
S: {0, 1} = B;
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
9 / 18
Matrix multiplication as such
S: a semiring.
A ∈ S m×` , B ∈ S `×n ,
Their product, C ∈ S m×n :
Ci,j =
`
X
Ai,k · Bk,j
k=1
`mn multiplications, (` − 1)mn additions.
X In this talk:
S: {0, 1} = B;
Sum: disjunction;
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
9 / 18
Matrix multiplication as such
S: a semiring.
A ∈ S m×` , B ∈ S `×n ,
Their product, C ∈ S m×n :
Ci,j =
`
X
Ai,k · Bk,j
k=1
`mn multiplications, (` − 1)mn additions.
X In this talk:
S: {0, 1} = B;
Sum: disjunction;
Product: conjunction;
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
9 / 18
Matrix multiplication as such
S: a semiring.
A ∈ S m×` , B ∈ S `×n ,
Their product, C ∈ S m×n :
Ci,j =
`
X
Ai,k · Bk,j
k=1
`mn multiplications, (` − 1)mn additions.
X In this talk:
S: {0, 1} = B;
Sum: disjunction;
Product: conjunction;
Square matrices: m = n = k.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
9 / 18
Matrix multiplication as such
S: a semiring.
A ∈ S m×` , B ∈ S `×n ,
Their product, C ∈ S m×n :
Ci,j =
`
X
Ai,k · Bk,j
k=1
`mn multiplications, (` − 1)mn additions.
X In this talk:
S: {0, 1} = B;
Sum: disjunction;
Product: conjunction;
Square matrices: m = n = k.
Θ(n3 ) bit operations.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
9 / 18
Fast matrix multiplication over a ring
# of multiplications for 2 × 2 matrices?
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
10 / 18
Fast matrix multiplication over a ring
# of multiplications for 2 × 2 matrices?
8
a11 a12
b11 b12
a11 b11 + a12 b21 a11 b12 + a12 b22
×
=
a21 a22
b21 b22
a21 b11 + a22 b21 a21 b12 + a22 b22
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
10 / 18
Fast matrix multiplication over a ring
# of multiplications for 2 × 2 matrices?
8
a11 a12
b11 b12
a11 b11 + a12 b21 a11 b12 + a12 b22
×
=
a21 a22
b21 b22
a21 b11 + a22 b21 a21 b12 + a22 b22
Assume S is a ring.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
10 / 18
Fast matrix multiplication over a ring
# of multiplications for 2 × 2 matrices?
8
a11 a12
b11 b12
a11 b11 + a12 b21 a11 b12 + a12 b22
×
=
a21 a22
b21 b22
a21 b11 + a22 b21 a21 b12 + a22 b22
Assume S is a ring.
I
∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
10 / 18
Fast matrix multiplication over a ring
# of multiplications for 2 × 2 matrices?
8
a11 a12
b11 b12
a11 b11 + a12 b21 a11 b12 + a12 b22
×
=
a21 a22
b21 b22
a21 b11 + a22 b21 a21 b12 + a22 b22
Assume S is a ring.
I
∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0
Strassen (1969): 2 × 2 matrices using 7 multiplications.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
10 / 18
Fast matrix multiplication over a ring
# of multiplications for 2 × 2 matrices?
8
a11 a12
b11 b12
a11 b11 + a12 b21 a11 b12 + a12 b22
×
=
a21 a22
b21 b22
a21 b11 + a22 b21 a21 b12 + a22 b22
Assume S is a ring.
I
∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0
Strassen (1969): 2 × 2 matrices using 7 multiplications.
I
First, compute 14 linear combinations.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
10 / 18
Fast matrix multiplication over a ring
# of multiplications for 2 × 2 matrices?
8
a11 a12
b11 b12
a11 b11 + a12 b21 a11 b12 + a12 b22
×
=
a21 a22
b21 b22
a21 b11 + a22 b21 a21 b12 + a22 b22
Assume S is a ring.
I
∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0
Strassen (1969): 2 × 2 matrices using 7 multiplications.
I
I
First, compute 14 linear combinations.
Second, calculate their products.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
10 / 18
Fast matrix multiplication over a ring
# of multiplications for 2 × 2 matrices?
8
a11 a12
b11 b12
a11 b11 + a12 b21 a11 b12 + a12 b22
×
=
a21 a22
b21 b22
a21 b11 + a22 b21 a21 b12 + a22 b22
Assume S is a ring.
I
∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0
Strassen (1969): 2 × 2 matrices using 7 multiplications.
I
I
I
First, compute 14 linear combinations.
Second, calculate their products.
Their linear combinations yield the results.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
10 / 18
Fast matrix multiplication over a ring
# of multiplications for 2 × 2 matrices?
8
a11 a12
b11 b12
a11 b11 + a12 b21 a11 b12 + a12 b22
×
=
a21 a22
b21 b22
a21 b11 + a22 b21 a21 b12 + a22 b22
Assume S is a ring.
I
∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0
Strassen (1969): 2 × 2 matrices using 7 multiplications.
I
I
I
I
First, compute 14 linear combinations.
Second, calculate their products.
Their linear combinations yield the results.
A11
Larger matrices: as block matrices.
A21
Alexander Okhotin
Boolean matrix multiplication on a GPU
A12
A22
×
B11
B21
B12
B22
Hamburg, 12.04.2010
.
10 / 18
Fast matrix multiplication over a ring
# of multiplications for 2 × 2 matrices?
8
a11 a12
b11 b12
a11 b11 + a12 b21 a11 b12 + a12 b22
×
=
a21 a22
b21 b22
a21 b11 + a22 b21 a21 b12 + a22 b22
Assume S is a ring.
I
∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0
Strassen (1969): 2 × 2 matrices using 7 multiplications.
I
I
I
I
I
First, compute 14 linear combinations.
Second, calculate their products.
Their linear combinations yield the results.
A11
Larger matrices: as block matrices.
A21
O(nlog2 7 ) operations for n × n matrices.
Alexander Okhotin
Boolean matrix multiplication on a GPU
A12
A22
×
B11
B21
B12
B22
Hamburg, 12.04.2010
.
10 / 18
Fast matrix multiplication over a ring
# of multiplications for 2 × 2 matrices?
8
a11 a12
b11 b12
a11 b11 + a12 b21 a11 b12 + a12 b22
×
=
a21 a22
b21 b22
a21 b11 + a22 b21 a21 b12 + a22 b22
Assume S is a ring.
I
∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0
Strassen (1969): 2 × 2 matrices using 7 multiplications.
I
I
I
I
I
First, compute 14 linear combinations.
Second, calculate their products.
Their linear combinations yield the results.
A11
Larger matrices: as block matrices.
A21
O(nlog2 7 ) operations for n × n matrices.
A12
A22
×
B11
B21
B12
B22
.
Coppersmith and Winograd (1990): O(n2.376 ) operations.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
10 / 18
Fast matrix multiplication over a ring
# of multiplications for 2 × 2 matrices?
8
a11 a12
b11 b12
a11 b11 + a12 b21 a11 b12 + a12 b22
×
=
a21 a22
b21 b22
a21 b11 + a22 b21 a21 b12 + a22 b22
Assume S is a ring.
I
∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0
Strassen (1969): 2 × 2 matrices using 7 multiplications.
I
I
I
I
I
First, compute 14 linear combinations.
Second, calculate their products.
Their linear combinations yield the results.
A11
Larger matrices: as block matrices.
A21
O(nlog2 7 ) operations for n × n matrices.
A12
A22
×
B11
B21
B12
B22
.
Coppersmith and Winograd (1990): O(n2.376 ) operations.
X (B, ∧, ∨) is not a ring.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
10 / 18
Applying fast matrix multiplication to the Boolean semiring
n × n Boolean matrices.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
11 / 18
Applying fast matrix multiplication to the Boolean semiring
n × n Boolean matrices.
Multiplying them in Zn+1 .
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
11 / 18
Applying fast matrix multiplication to the Boolean semiring
n × n Boolean matrices.
Multiplying them in Zn+1 .
1 0
0 1
×
1 1
1 1
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
11 / 18
Applying fast matrix multiplication to the Boolean semiring
n × n Boolean matrices.
Multiplying them in Zn+1 .
0 1
1 0
0 1
=
×
1 2
1 1
1 1
|
{z
}
in Z3
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
11 / 18
Applying fast matrix multiplication to the Boolean semiring
n × n Boolean matrices.
Multiplying them in Zn+1 .
0 1
0 1
1 0
0 1
=
=
×
1 2
1 1
1 1
1 1
|
{z
}|
{z
}
in Z3
Alexander Okhotin
Boolean matrix multiplication on a GPU
in B
Hamburg, 12.04.2010
11 / 18
Applying fast matrix multiplication to the Boolean semiring
n × n Boolean matrices.
Multiplying them in Zn+1 .
0 1
0 1
1 0
0 1
=
=
×
1 2
1 1
1 1
1 1
|
{z
}|
{z
}
in Z3
in B
One bit → dlogn+1 e bits.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
11 / 18
3
n
An O( log
n ) method for Boolean matrices
Arlazarov et al. (1970)
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
12 / 18
3
n
An O( log
n ) method for Boolean matrices
Arlazarov et al. (1970)
Fix k << n.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
12 / 18
3
n
An O( log
n ) method for Boolean matrices
Arlazarov et al. (1970)
Fix k << n.
Multiplying 1 × k blocks of A by k × n blocks of B.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
12 / 18
3
n
An O( log
n ) method for Boolean matrices
Arlazarov et al. (1970)
Fix k << n.
Multiplying 1 × k blocks of A by k × n blocks of B.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
12 / 18
3
n
An O( log
n ) method for Boolean matrices
Arlazarov et al. (1970)
Fix k << n.
Multiplying 1 × k blocks of A by k × n blocks of B.
At most 2k different 1 × k blocks.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
12 / 18
3
n
An O( log
n ) method for Boolean matrices
Arlazarov et al. (1970)
Fix k << n.
Multiplying 1 × k blocks of A by k × n blocks of B.
At most 2k different 1 × k blocks.
Pre-compute all 2k products with each of block of B ( kn blocks).
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
12 / 18
3
n
An O( log
n ) method for Boolean matrices
Arlazarov et al. (1970)
Fix k << n.
Multiplying 1 × k blocks of A by k × n blocks of B.
At most 2k different 1 × k blocks.
Pre-compute all 2k products with each of block of B ( kn blocks).
Look up n bits for each 1 × k block of A,
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
12 / 18
3
n
An O( log
n ) method for Boolean matrices
Arlazarov et al. (1970)
Fix k << n.
Multiplying 1 × k blocks of A by k × n blocks of B.
At most 2k different 1 × k blocks.
Pre-compute all 2k products with each of block of B ( kn blocks).
Look up n bits for each 1 × k block of A,
Time complexity:
n
2k · · n
k }
| {z
making the table
Alexander Okhotin
+
n3
k
|{z}
multiplication
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
12 / 18
3
n
An O( log
n ) method for Boolean matrices
Arlazarov et al. (1970)
Fix k << n.
Multiplying 1 × k blocks of A by k × n blocks of B.
At most 2k different 1 × k blocks.
Pre-compute all 2k products with each of block of B ( kn blocks).
Look up n bits for each 1 × k block of A,
Time complexity:
n
2k · · n
k }
| {z
making the table
2n3
log n
+
n3
k
|{z}
multiplication
operations for k = log n.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
12 / 18
Part III
Boolean matrix multiplication on a GPU
Joint work with Christian Reitwießner (Würzburg)
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
13 / 18
Main performance considerations
Matrices A, B ∈ Bn×n are on the CPU:
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
14 / 18
Main performance considerations
Matrices A, B ∈ Bn×n are on the CPU:
I
either multiply them on the CPU,
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
14 / 18
Main performance considerations
Matrices A, B ∈ Bn×n are on the CPU:
I
I
either multiply them on the CPU,
or send to the GPU (and use which method?).
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
14 / 18
Main performance considerations
Matrices A, B ∈ Bn×n are on the CPU:
I
I
either multiply them on the CPU,
or send to the GPU (and use which method?).
If n < 200, faster to multiply than to transfer.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
14 / 18
Main performance considerations
Matrices A, B ∈ Bn×n are on the CPU:
I
I
either multiply them on the CPU,
or send to the GPU (and use which method?).
If n < 200, faster to multiply than to transfer.
If n > 50000, will not fit on the GPU.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
14 / 18
Main performance considerations
Matrices A, B ∈ Bn×n are on the CPU:
I
I
either multiply them on the CPU,
or send to the GPU (and use which method?).
If n < 200, faster to multiply than to transfer.
If n > 50000, will not fit on the GPU.
I
Processing by parts.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
14 / 18
Main performance considerations
Matrices A, B ∈ Bn×n are on the CPU:
I
I
either multiply them on the CPU,
or send to the GPU (and use which method?).
If n < 200, faster to multiply than to transfer.
If n > 50000, will not fit on the GPU.
I
Processing by parts.
Direct n3 multiplication.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
14 / 18
Main performance considerations
Matrices A, B ∈ Bn×n are on the CPU:
I
I
either multiply them on the CPU,
or send to the GPU (and use which method?).
If n < 200, faster to multiply than to transfer.
If n > 50000, will not fit on the GPU.
I
Processing by parts.
Direct n3 multiplication.
I
For n > 100 already superceded.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
14 / 18
Main performance considerations
Matrices A, B ∈ Bn×n are on the CPU:
I
I
either multiply them on the CPU,
or send to the GPU (and use which method?).
If n < 200, faster to multiply than to transfer.
If n > 50000, will not fit on the GPU.
I
Processing by parts.
Direct n3 multiplication.
I
For n > 100 already superceded.
Arlazarov et al.:
Alexander Okhotin
n3
log n
operations.
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
14 / 18
Main performance considerations
Matrices A, B ∈ Bn×n are on the CPU:
I
I
either multiply them on the CPU,
or send to the GPU (and use which method?).
If n < 200, faster to multiply than to transfer.
If n > 50000, will not fit on the GPU.
I
Processing by parts.
Direct n3 multiplication.
I
For n > 100 already superceded.
Arlazarov et al.:
I
n3
log n
operations.
Basic operation: union of rows.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
14 / 18
Main performance considerations
Matrices A, B ∈ Bn×n are on the CPU:
I
I
either multiply them on the CPU,
or send to the GPU (and use which method?).
If n < 200, faster to multiply than to transfer.
If n > 50000, will not fit on the GPU.
I
Processing by parts.
Direct n3 multiplication.
I
For n > 100 already superceded.
Arlazarov et al.:
I
I
n3
log n
operations.
Basic operation: union of rows.
Works well on GPU.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
14 / 18
Main performance considerations
Matrices A, B ∈ Bn×n are on the CPU:
I
I
either multiply them on the CPU,
or send to the GPU (and use which method?).
If n < 200, faster to multiply than to transfer.
If n > 50000, will not fit on the GPU.
I
Processing by parts.
Direct n3 multiplication.
I
For n > 100 already superceded.
Arlazarov et al.:
I
I
n3
log n
operations.
Basic operation: union of rows.
Works well on GPU.
Strassen’s method: O(nlog2 7 ).
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
14 / 18
Main performance considerations
Matrices A, B ∈ Bn×n are on the CPU:
I
I
either multiply them on the CPU,
or send to the GPU (and use which method?).
If n < 200, faster to multiply than to transfer.
If n > 50000, will not fit on the GPU.
I
Processing by parts.
Direct n3 multiplication.
I
For n > 100 already superceded.
Arlazarov et al.:
I
I
n3
log n
operations.
Basic operation: union of rows.
Works well on GPU.
Strassen’s method: O(nlog2 7 ).
I
Have to multiply ints instead of bits!
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
14 / 18
Main performance considerations
Matrices A, B ∈ Bn×n are on the CPU:
I
I
either multiply them on the CPU,
or send to the GPU (and use which method?).
If n < 200, faster to multiply than to transfer.
If n > 50000, will not fit on the GPU.
I
Processing by parts.
Direct n3 multiplication.
I
For n > 100 already superceded.
Arlazarov et al.:
I
I
n3
log n
operations.
Basic operation: union of rows.
Works well on GPU.
Strassen’s method: O(nlog2 7 ).
I
I
Have to multiply ints instead of bits!
Inductive on n, reducing to many small matrices.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
14 / 18
3
n
The O( log
n ) method on a GPU
Making a table for B
Matrix B ∈ Bn×n on the GPU.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
15 / 18
3
n
The O( log
n ) method on a GPU
Making a table for B
Matrix B ∈ Bn×n on the GPU.
For each block of lines i ∈ {0, . . . , kn − 1},
k
create table T [i] ∈ B2 ×n .
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
15 / 18
3
n
The O( log
n ) method on a GPU
Making a table for B
Matrix B ∈ Bn×n on the GPU.
For each block of lines i ∈ {0, . . . , kn − 1},
k
create table T [i] ∈ B2 ×n .
Line (bk−1 . . . b1 b0 )2 in T :
disjunction of all lines with bj = 1.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
15 / 18
3
n
The O( log
n ) method on a GPU
Making a table for B
Matrix B ∈ Bn×n on the GPU.
For each block of lines i ∈ {0, . . . , kn − 1},
k
create table T [i] ∈ B2 ×n .
Line (bk−1 . . . b1 b0 )2 in T :
disjunction of all lines with bj = 1.
Work items: every 64 bits in each line.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
15 / 18
3
n
The O( log
n ) method on a GPU
Making a table for B
Matrix B ∈ Bn×n on the GPU.
For each block of lines i ∈ {0, . . . , kn − 1},
k
create table T [i] ∈ B2 ×n .
Line (bk−1 . . . b1 b0 )2 in T :
disjunction of all lines with bj = 1.
Work items: every 64 bits in each line.
I
2k disjunctions of longs.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
15 / 18
3
n
The O( log
n ) method on a GPU
Making a table for B
Matrix B ∈ Bn×n on the GPU.
For each block of lines i ∈ {0, . . . , kn − 1},
k
create table T [i] ∈ B2 ×n .
Line (bk−1 . . . b1 b0 )2 in T :
disjunction of all lines with bj = 1.
Work items: every 64 bits in each line.
I
I
2k disjunctions of longs.
Threads access adjacent words.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
15 / 18
3
n
The O( log
n ) method on a GPU
Making a table for B
Matrix B ∈ Bn×n on the GPU.
For each block of lines i ∈ {0, . . . , kn − 1},
k
create table T [i] ∈ B2 ×n .
Line (bk−1 . . . b1 b0 )2 in T :
disjunction of all lines with bj = 1.
Work items: every 64 bits in each line.
I
I
2k disjunctions of longs.
Threads access adjacent words.
Another dimension: T [i] for different i.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
15 / 18
3
n
The O( log
n ) method on a GPU
Multiplying the matrices
Matrix A ∈ Bn×n on the GPU.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
16 / 18
3
n
The O( log
n ) method on a GPU
Multiplying the matrices
Matrix A ∈ Bn×n on the GPU.
n
2k ×n on the GPU.
k tables T [i] ∈ B
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
16 / 18
3
n
The O( log
n ) method on a GPU
Multiplying the matrices
Matrix A ∈ Bn×n on the GPU.
n
2k ×n on the GPU.
k tables T [i] ∈ B
Compute the product A × B.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
16 / 18
3
n
The O( log
n ) method on a GPU
Multiplying the matrices
Matrix A ∈ Bn×n on the GPU.
n
2k ×n on the GPU.
k tables T [i] ∈ B
Compute the product A × B.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
16 / 18
3
n
The O( log
n ) method on a GPU
Multiplying the matrices
Matrix A ∈ Bn×n on the GPU.
n
2k ×n on the GPU.
k tables T [i] ∈ B
Compute the product A × B.
Work items: lines of A (and C ).
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
16 / 18
3
n
The O( log
n ) method on a GPU
Multiplying the matrices
Matrix A ∈ Bn×n on the GPU.
n
2k ×n on the GPU.
k tables T [i] ∈ B
Compute the product A × B.
Work items: lines of A (and C ).
Step 1: cache the line of A to local memory.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
16 / 18
3
n
The O( log
n ) method on a GPU
Multiplying the matrices
Matrix A ∈ Bn×n on the GPU.
n
2k ×n on the GPU.
k tables T [i] ∈ B
Compute the product A × B.
Work items: lines of A (and C ).
Step 1: cache the line of A to local memory.
Block-column of A determines the number of the table.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
16 / 18
3
n
The O( log
n ) method on a GPU
Multiplying the matrices
Matrix A ∈ Bn×n on the GPU.
n
2k ×n on the GPU.
k tables T [i] ∈ B
Compute the product A × B.
Work items: lines of A (and C ).
Step 1: cache the line of A to local memory.
Block-column of A determines the number of the table.
1 × k block of A indexes the table.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
16 / 18
3
n
The O( log
n ) method on a GPU
Multiplying the matrices
Matrix A ∈ Bn×n on the GPU.
n
2k ×n on the GPU.
k tables T [i] ∈ B
Compute the product A × B.
Work items: lines of A (and C ).
Step 1: cache the line of A to local memory.
Block-column of A determines the number of the table.
1 × k block of A indexes the table.
Disjunction with the line of C .
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
16 / 18
3
n
The O( log
n ) method on a GPU
Multiplying the matrices
Matrix A ∈ Bn×n on the GPU.
n
2k ×n on the GPU.
k tables T [i] ∈ B
Compute the product A × B.
Work items: lines of A (and C ).
Step 1: cache the line of A to local memory.
Block-column of A determines the number of the table.
1 × k block of A indexes the table.
Disjunction with the line of C .
Second dimension: every 64 bits in each line of T and C .
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
16 / 18
Performance
n = 2048, k = 8.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
17 / 18
Performance
n = 2048, k = 8.
CPU
Nvidia G210M
Nvidia GTS250
(low-end laptop GPU)
(average gaming card)
Time
Memory access
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
17 / 18
Performance
n = 2048, k = 8.
CPU
Time
Memory access
Alexander Okhotin
234 ms
Nvidia G210M
Nvidia GTS250
(low-end laptop GPU)
(average gaming card)
17.4 ms
3.3 ms
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
17 / 18
Performance
n = 2048, k = 8.
CPU
Time
Memory access
Alexander Okhotin
234 ms
Nvidia G210M
Nvidia GTS250
(low-end laptop GPU)
(average gaming card)
17.4 ms
9.4 GB/s
3.3 ms
51.9 GB/s
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
17 / 18
Performance
n = 2048, k = 8.
CPU
Time
Memory access
Nvidia G210M
Nvidia GTS250
(low-end laptop GPU)
(average gaming card)
17.4 ms
9.4 GB/s
3.3 ms
51.9 GB/s
234 ms
Basically, bandwidth-limited.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
17 / 18
Performance
n = 2048, k = 8.
CPU
Time
Memory access
Nvidia G210M
Nvidia GTS250
(low-end laptop GPU)
(average gaming card)
17.4 ms
9.4 GB/s
3.3 ms
51.9 GB/s
234 ms
Basically, bandwidth-limited.
The cores could compute more!
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
17 / 18
Performance
n = 2048, k = 8.
CPU
Time
Memory access
Nvidia G210M
Nvidia GTS250
(low-end laptop GPU)
(average gaming card)
17.4 ms
9.4 GB/s
3.3 ms
51.9 GB/s
234 ms
Basically, bandwidth-limited.
The cores could compute more!
Optimization: cache more in local memory.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
17 / 18
Performance
n = 2048, k = 8.
CPU
Time
Memory access
Nvidia G210M
Nvidia GTS250
(low-end laptop GPU)
(average gaming card)
17.4 ms
9.4 GB/s
3.3 ms
51.9 GB/s
234 ms
Basically, bandwidth-limited.
The cores could compute more!
Optimization: cache more in local memory.
I
Local memory: usually 16 KB per core.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
17 / 18
Performance
n = 2048, k = 8.
CPU
Time
Memory access
Nvidia G210M
Nvidia GTS250
(low-end laptop GPU)
(average gaming card)
17.4 ms
9.4 GB/s
3.3 ms
51.9 GB/s
234 ms
Basically, bandwidth-limited.
The cores could compute more!
Optimization: cache more in local memory.
I
I
Local memory: usually 16 KB per core.
Compute the table by parts.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
17 / 18
Future work in this project
1
Refactor to use more local memory.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
18 / 18
Future work in this project
1
Refactor to use more local memory.
2
Implement multiplication of huge matrices.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
18 / 18
Future work in this project
1
Refactor to use more local memory.
2
Implement multiplication of huge matrices.
3
Do a practical comparison with Strassen.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
18 / 18
Future work in this project
1
Refactor to use more local memory.
2
Implement multiplication of huge matrices.
3
Do a practical comparison with Strassen.
4
For the parsing application:
better performance on smaller matrices.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
18 / 18
Future work in this project
1
Refactor to use more local memory.
2
Implement multiplication of huge matrices.
3
Do a practical comparison with Strassen.
4
For the parsing application:
better performance on smaller matrices.
I
Large matrices are handled fast enough.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
18 / 18
Future work in this project
1
Refactor to use more local memory.
2
Implement multiplication of huge matrices.
3
Do a practical comparison with Strassen.
4
For the parsing application:
better performance on smaller matrices.
I
I
Large matrices are handled fast enough.
128x128 and 256x256 matrices dominate the running time.
Alexander Okhotin
Boolean matrix multiplication on a GPU
Hamburg, 12.04.2010
18 / 18
Download