Implementing Boolean matrix multiplication on a GPU Alexander Okhotin Department of Mathematics, University of Turku, Finland Academy of Finland DESY, Hamburg, Germany 12 April 2010 A. D. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 1 / 18 Background High-performance hardware is parallel. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18 Background High-performance hardware is parallel. Most algorithms are (partially) sequential. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18 Background High-performance hardware is parallel. Most algorithms are (partially) sequential. Find the bottleneck and parallelize it. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18 Background High-performance hardware is parallel. Most algorithms are (partially) sequential. Find the bottleneck and parallelize it. The speaker’s case: Syntax analysis for general context-free grammars. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18 Background High-performance hardware is parallel. Most algorithms are (partially) sequential. Find the bottleneck and parallelize it. The speaker’s case: Syntax analysis for general context-free grammars. I Sequential nature. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18 Background High-performance hardware is parallel. Most algorithms are (partially) sequential. Find the bottleneck and parallelize it. The speaker’s case: Syntax analysis for general context-free grammars. I I Sequential nature. Typically implemented combinatorially. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18 Background High-performance hardware is parallel. Most algorithms are (partially) sequential. Find the bottleneck and parallelize it. The speaker’s case: Syntax analysis for general context-free grammars. I I I Sequential nature. Typically implemented combinatorially. Can be done via Boolean matrix multiplication. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18 Background High-performance hardware is parallel. Most algorithms are (partially) sequential. Find the bottleneck and parallelize it. The speaker’s case: Syntax analysis for general context-free grammars. I I I Sequential nature. Typically implemented combinatorially. Can be done via Boolean matrix multiplication. F Valiant (1975): theoretical bound. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18 Background High-performance hardware is parallel. Most algorithms are (partially) sequential. Find the bottleneck and parallelize it. The speaker’s case: Syntax analysis for general context-free grammars. I I I Sequential nature. Typically implemented combinatorially. Can be done via Boolean matrix multiplication. F F Valiant (1975): theoretical bound. Okhotin (2010): refactored and generalized. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18 Background High-performance hardware is parallel. Most algorithms are (partially) sequential. Find the bottleneck and parallelize it. The speaker’s case: Syntax analysis for general context-free grammars. I I I Sequential nature. Typically implemented combinatorially. Can be done via Boolean matrix multiplication. F F Valiant (1975): theoretical bound. Okhotin (2010): refactored and generalized. X Efficiently parallelized. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18 Background High-performance hardware is parallel. Most algorithms are (partially) sequential. Find the bottleneck and parallelize it. The speaker’s case: Syntax analysis for general context-free grammars. I I I Sequential nature. Typically implemented combinatorially. Can be done via Boolean matrix multiplication. F F Valiant (1975): theoretical bound. Okhotin (2010): refactored and generalized. X Efficiently parallelized. Implementing on a Graphics Processing Unit. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18 Part I GPU programming Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 3 / 18 Graphics Processing Units Designed for 3D graphics in computer games. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18 Graphics Processing Units Designed for 3D graphics in computer games. I Shading. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18 Graphics Processing Units Designed for 3D graphics in computer games. I Shading. I Texturing. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18 Graphics Processing Units Designed for 3D graphics in computer games. I Shading. I Texturing. I Per pixel effects. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18 Graphics Processing Units Designed for 3D graphics in computer games. I Shading. I Texturing. I Per pixel effects. I The same function for each pixel. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18 Graphics Processing Units Designed for 3D graphics in computer games. I Shading. I Texturing. I Per pixel effects. I The same function for each pixel. I Function as a kernel (program). Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18 Graphics Processing Units Designed for 3D graphics in computer games. I Shading. I Texturing. I Per pixel effects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18 Graphics Processing Units Designed for 3D graphics in computer games. I Shading. I Texturing. I Per pixel effects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose computation on GPUs. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18 Graphics Processing Units Designed for 3D graphics in computer games. I Shading. I Texturing. I Per pixel effects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose computation on GPUs. I Tens of cores, each with multiple ALUs. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18 Graphics Processing Units Designed for 3D graphics in computer games. I Shading. I Texturing. I Per pixel effects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose computation on GPUs. I I Tens of cores, each with multiple ALUs. Approaching 1 Teraflop. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18 Graphics Processing Units Designed for 3D graphics in computer games. I Shading. I Texturing. I Per pixel effects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose computation on GPUs. I I I Tens of cores, each with multiple ALUs. Approaching 1 Teraflop. Priced as a consumer toy. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18 Graphics Processing Units Designed for 3D graphics in computer games. I Shading. I Texturing. I Per pixel effects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose computation on GPUs. I I I Tens of cores, each with multiple ALUs. Approaching 1 Teraflop. Priced as a consumer toy. Best price to performance ratio. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18 Graphics Processing Units Designed for 3D graphics in computer games. I Shading. I Texturing. I Per pixel effects. I The same function for each pixel. I Function as a kernel (program). I Pixel as a work item. General purpose computation on GPUs. I I I Tens of cores, each with multiple ALUs. Approaching 1 Teraflop. Priced as a consumer toy. Best price to performance ratio. Special programming techniques. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18 GPU programming Proprietary interfaces: NVIDIA CUDA, ATI Stream. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18 GPU programming Proprietary interfaces: NVIDIA CUDA, ATI Stream. Device-independent language: OpenCL. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18 GPU programming Proprietary interfaces: NVIDIA CUDA, ATI Stream. Device-independent language: OpenCL. I Supported by NVIDIA and ATI drivers. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18 GPU programming Proprietary interfaces: NVIDIA CUDA, ATI Stream. Device-independent language: OpenCL. I I Supported by NVIDIA and ATI drivers. CPU implementation. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18 GPU programming Proprietary interfaces: NVIDIA CUDA, ATI Stream. Device-independent language: OpenCL. I I Supported by NVIDIA and ATI drivers. CPU implementation. Kernel: program running on GPU. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18 GPU programming Proprietary interfaces: NVIDIA CUDA, ATI Stream. Device-independent language: OpenCL. I I Supported by NVIDIA and ATI drivers. CPU implementation. Kernel: program running on GPU. I Dialect of C. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18 GPU programming Proprietary interfaces: NVIDIA CUDA, ATI Stream. Device-independent language: OpenCL. I I Supported by NVIDIA and ATI drivers. CPU implementation. Kernel: program running on GPU. I I Dialect of C. Computes one “work item”. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18 GPU programming Proprietary interfaces: NVIDIA CUDA, ATI Stream. Device-independent language: OpenCL. I I Supported by NVIDIA and ATI drivers. CPU implementation. Kernel: program running on GPU. I I I Dialect of C. Computes one “work item”. Executed for a grid of work items. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18 GPU programming Proprietary interfaces: NVIDIA CUDA, ATI Stream. Device-independent language: OpenCL. I I Supported by NVIDIA and ATI drivers. CPU implementation. Kernel: program running on GPU. I I I Dialect of C. Computes one “work item”. Executed for a grid of work items. Host code running on a CPU. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18 GPU programming Proprietary interfaces: NVIDIA CUDA, ATI Stream. Device-independent language: OpenCL. I I Supported by NVIDIA and ATI drivers. CPU implementation. Kernel: program running on GPU. I I I Dialect of C. Computes one “work item”. Executed for a grid of work items. Host code running on a CPU. I Allocate GPU memory. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18 GPU programming Proprietary interfaces: NVIDIA CUDA, ATI Stream. Device-independent language: OpenCL. I I Supported by NVIDIA and ATI drivers. CPU implementation. Kernel: program running on GPU. I I I Dialect of C. Computes one “work item”. Executed for a grid of work items. Host code running on a CPU. I I Allocate GPU memory. Load and compile a kernel. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18 GPU programming Proprietary interfaces: NVIDIA CUDA, ATI Stream. Device-independent language: OpenCL. I I Supported by NVIDIA and ATI drivers. CPU implementation. Kernel: program running on GPU. I I I Dialect of C. Computes one “work item”. Executed for a grid of work items. Host code running on a CPU. I I I Allocate GPU memory. Load and compile a kernel. Give arguments to the kernel. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18 Execution and memory model 2–32 multithreaded cores, each with 8–16 ALUs. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18 Execution and memory model 2–32 multithreaded cores, each with 8–16 ALUs. Many threads running on a core, grouped into warps. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18 Execution and memory model 2–32 multithreaded cores, each with 8–16 ALUs. Many threads running on a core, grouped into warps. Main system memory (“host memory”): accessed through the bus. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18 Execution and memory model 2–32 multithreaded cores, each with 8–16 ALUs. Many threads running on a core, grouped into warps. Main system memory (“host memory”): accessed through the bus. Global memory: accessed by all GPU cores (up to 150 Gb/s). Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18 Execution and memory model 2–32 multithreaded cores, each with 8–16 ALUs. Many threads running on a core, grouped into warps. Main system memory (“host memory”): accessed through the bus. Global memory: accessed by all GPU cores (up to 150 Gb/s). I 64–512-bit bus. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18 Execution and memory model 2–32 multithreaded cores, each with 8–16 ALUs. Many threads running on a core, grouped into warps. Main system memory (“host memory”): accessed through the bus. Global memory: accessed by all GPU cores (up to 150 Gb/s). I I 64–512-bit bus. Multiple threads would better access adjacent words. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18 Execution and memory model 2–32 multithreaded cores, each with 8–16 ALUs. Many threads running on a core, grouped into warps. Main system memory (“host memory”): accessed through the bus. Global memory: accessed by all GPU cores (up to 150 Gb/s). I I 64–512-bit bus. Multiple threads would better access adjacent words. Local memory: shared by all threads on a core. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18 Execution and memory model 2–32 multithreaded cores, each with 8–16 ALUs. Many threads running on a core, grouped into warps. Main system memory (“host memory”): accessed through the bus. Global memory: accessed by all GPU cores (up to 150 Gb/s). I I 64–512-bit bus. Multiple threads would better access adjacent words. Local memory: shared by all threads on a core. I Much faster. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18 Execution and memory model 2–32 multithreaded cores, each with 8–16 ALUs. Many threads running on a core, grouped into warps. Main system memory (“host memory”): accessed through the bus. Global memory: accessed by all GPU cores (up to 150 Gb/s). I I 64–512-bit bus. Multiple threads would better access adjacent words. Local memory: shared by all threads on a core. I I Much faster. Often used to cache data. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18 Execution and memory model 2–32 multithreaded cores, each with 8–16 ALUs. Many threads running on a core, grouped into warps. Main system memory (“host memory”): accessed through the bus. Global memory: accessed by all GPU cores (up to 150 Gb/s). I I 64–512-bit bus. Multiple threads would better access adjacent words. Local memory: shared by all threads on a core. I I Much faster. Often used to cache data. Private memory, owned by a thread. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18 Execution and memory model 2–32 multithreaded cores, each with 8–16 ALUs. Many threads running on a core, grouped into warps. Main system memory (“host memory”): accessed through the bus. Global memory: accessed by all GPU cores (up to 150 Gb/s). I I 64–512-bit bus. Multiple threads would better access adjacent words. Local memory: shared by all threads on a core. I I Much faster. Often used to cache data. Private memory, owned by a thread. Computation divided into work-items. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18 Execution and memory model 2–32 multithreaded cores, each with 8–16 ALUs. Many threads running on a core, grouped into warps. Main system memory (“host memory”): accessed through the bus. Global memory: accessed by all GPU cores (up to 150 Gb/s). I I 64–512-bit bus. Multiple threads would better access adjacent words. Local memory: shared by all threads on a core. I I Much faster. Often used to cache data. Private memory, owned by a thread. Computation divided into work-items. I 1d, 2d or 3d grid of work-items. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18 Execution and memory model 2–32 multithreaded cores, each with 8–16 ALUs. Many threads running on a core, grouped into warps. Main system memory (“host memory”): accessed through the bus. Global memory: accessed by all GPU cores (up to 150 Gb/s). I I 64–512-bit bus. Multiple threads would better access adjacent words. Local memory: shared by all threads on a core. I I Much faster. Often used to cache data. Private memory, owned by a thread. Computation divided into work-items. I I 1d, 2d or 3d grid of work-items. Block of work-items: work-group. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18 Primitive example Example (Jacobi method) 1 Compile the program. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 7 / 18 Primitive example Example (Jacobi method) 1 Compile the program. 2 Allocate n*n*sizeof(float) bytes for A and B. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 7 / 18 Primitive example Example (Jacobi method) 1 Compile the program. 2 Allocate n*n*sizeof(float) bytes for A and B. 3 Create kernel with arguments (n, n, A, B). Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 7 / 18 Primitive example Example (Jacobi method) 1 Compile the program. 2 Allocate n*n*sizeof(float) bytes for A and B. 3 Create kernel with arguments (n, n, A, B). 4 Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 7 / 18 Primitive example Example (Jacobi method) 1 Compile the program. 2 Allocate n*n*sizeof(float) bytes for A and B. 3 Create kernel with arguments (n, n, A, B). 4 Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}. 5 Wait for termination. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 7 / 18 Primitive example Example (Jacobi method) 1 Compile the program. 2 Allocate n*n*sizeof(float) bytes for A and B. 3 Create kernel with arguments (n, n, A, B). 4 Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}. 5 Wait for termination. It works. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 7 / 18 Primitive example Example (Jacobi method) 1 Compile the program. 2 Allocate n*n*sizeof(float) bytes for A and B. 3 Create kernel with arguments (n, n, A, B). 4 Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}. 5 Wait for termination. It works. Alexander Okhotin . . . though very inefficiently: Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 7 / 18 Primitive example Example (Jacobi method) 1 Compile the program. 2 Allocate n*n*sizeof(float) bytes for A and B. 3 Create kernel with arguments (n, n, A, B). 4 Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}. 5 Wait for termination. It works. I . . . though very inefficiently: Reading 4 times. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 7 / 18 Primitive example Example (Jacobi method) 1 Compile the program. 2 Allocate n*n*sizeof(float) bytes for A and B. 3 Create kernel with arguments (n, n, A, B). 4 Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}. 5 Wait for termination. It works. I I . . . though very inefficiently: Reading 4 times. Memory alignment ignored. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 7 / 18 Part II Boolean matrix multiplication Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 8 / 18 Matrix multiplication as such S: a semiring. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18 Matrix multiplication as such S: a semiring. A ∈ S m×` , B ∈ S `×n , Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18 Matrix multiplication as such S: a semiring. A ∈ S m×` , B ∈ S `×n , Their product, C ∈ S m×n : Ci,j = ` X Ai,k · Bk,j k=1 Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18 Matrix multiplication as such S: a semiring. A ∈ S m×` , B ∈ S `×n , Their product, C ∈ S m×n : Ci,j = ` X Ai,k · Bk,j k=1 `mn multiplications, (` − 1)mn additions. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18 Matrix multiplication as such S: a semiring. A ∈ S m×` , B ∈ S `×n , Their product, C ∈ S m×n : Ci,j = ` X Ai,k · Bk,j k=1 `mn multiplications, (` − 1)mn additions. X In this talk: Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18 Matrix multiplication as such S: a semiring. A ∈ S m×` , B ∈ S `×n , Their product, C ∈ S m×n : Ci,j = ` X Ai,k · Bk,j k=1 `mn multiplications, (` − 1)mn additions. X In this talk: S: {0, 1} = B; Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18 Matrix multiplication as such S: a semiring. A ∈ S m×` , B ∈ S `×n , Their product, C ∈ S m×n : Ci,j = ` X Ai,k · Bk,j k=1 `mn multiplications, (` − 1)mn additions. X In this talk: S: {0, 1} = B; Sum: disjunction; Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18 Matrix multiplication as such S: a semiring. A ∈ S m×` , B ∈ S `×n , Their product, C ∈ S m×n : Ci,j = ` X Ai,k · Bk,j k=1 `mn multiplications, (` − 1)mn additions. X In this talk: S: {0, 1} = B; Sum: disjunction; Product: conjunction; Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18 Matrix multiplication as such S: a semiring. A ∈ S m×` , B ∈ S `×n , Their product, C ∈ S m×n : Ci,j = ` X Ai,k · Bk,j k=1 `mn multiplications, (` − 1)mn additions. X In this talk: S: {0, 1} = B; Sum: disjunction; Product: conjunction; Square matrices: m = n = k. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18 Matrix multiplication as such S: a semiring. A ∈ S m×` , B ∈ S `×n , Their product, C ∈ S m×n : Ci,j = ` X Ai,k · Bk,j k=1 `mn multiplications, (` − 1)mn additions. X In this talk: S: {0, 1} = B; Sum: disjunction; Product: conjunction; Square matrices: m = n = k. Θ(n3 ) bit operations. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18 Fast matrix multiplication over a ring # of multiplications for 2 × 2 matrices? Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18 Fast matrix multiplication over a ring # of multiplications for 2 × 2 matrices? 8 a11 a12 b11 b12 a11 b11 + a12 b21 a11 b12 + a12 b22 × = a21 a22 b21 b22 a21 b11 + a22 b21 a21 b12 + a22 b22 Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18 Fast matrix multiplication over a ring # of multiplications for 2 × 2 matrices? 8 a11 a12 b11 b12 a11 b11 + a12 b21 a11 b12 + a12 b22 × = a21 a22 b21 b22 a21 b11 + a22 b21 a21 b12 + a22 b22 Assume S is a ring. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18 Fast matrix multiplication over a ring # of multiplications for 2 × 2 matrices? 8 a11 a12 b11 b12 a11 b11 + a12 b21 a11 b12 + a12 b22 × = a21 a22 b21 b22 a21 b11 + a22 b21 a21 b12 + a22 b22 Assume S is a ring. I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0 Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18 Fast matrix multiplication over a ring # of multiplications for 2 × 2 matrices? 8 a11 a12 b11 b12 a11 b11 + a12 b21 a11 b12 + a12 b22 × = a21 a22 b21 b22 a21 b11 + a22 b21 a21 b12 + a22 b22 Assume S is a ring. I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0 Strassen (1969): 2 × 2 matrices using 7 multiplications. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18 Fast matrix multiplication over a ring # of multiplications for 2 × 2 matrices? 8 a11 a12 b11 b12 a11 b11 + a12 b21 a11 b12 + a12 b22 × = a21 a22 b21 b22 a21 b11 + a22 b21 a21 b12 + a22 b22 Assume S is a ring. I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0 Strassen (1969): 2 × 2 matrices using 7 multiplications. I First, compute 14 linear combinations. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18 Fast matrix multiplication over a ring # of multiplications for 2 × 2 matrices? 8 a11 a12 b11 b12 a11 b11 + a12 b21 a11 b12 + a12 b22 × = a21 a22 b21 b22 a21 b11 + a22 b21 a21 b12 + a22 b22 Assume S is a ring. I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0 Strassen (1969): 2 × 2 matrices using 7 multiplications. I I First, compute 14 linear combinations. Second, calculate their products. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18 Fast matrix multiplication over a ring # of multiplications for 2 × 2 matrices? 8 a11 a12 b11 b12 a11 b11 + a12 b21 a11 b12 + a12 b22 × = a21 a22 b21 b22 a21 b11 + a22 b21 a21 b12 + a22 b22 Assume S is a ring. I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0 Strassen (1969): 2 × 2 matrices using 7 multiplications. I I I First, compute 14 linear combinations. Second, calculate their products. Their linear combinations yield the results. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18 Fast matrix multiplication over a ring # of multiplications for 2 × 2 matrices? 8 a11 a12 b11 b12 a11 b11 + a12 b21 a11 b12 + a12 b22 × = a21 a22 b21 b22 a21 b11 + a22 b21 a21 b12 + a22 b22 Assume S is a ring. I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0 Strassen (1969): 2 × 2 matrices using 7 multiplications. I I I I First, compute 14 linear combinations. Second, calculate their products. Their linear combinations yield the results. A11 Larger matrices: as block matrices. A21 Alexander Okhotin Boolean matrix multiplication on a GPU A12 A22 × B11 B21 B12 B22 Hamburg, 12.04.2010 . 10 / 18 Fast matrix multiplication over a ring # of multiplications for 2 × 2 matrices? 8 a11 a12 b11 b12 a11 b11 + a12 b21 a11 b12 + a12 b22 × = a21 a22 b21 b22 a21 b11 + a22 b21 a21 b12 + a22 b22 Assume S is a ring. I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0 Strassen (1969): 2 × 2 matrices using 7 multiplications. I I I I I First, compute 14 linear combinations. Second, calculate their products. Their linear combinations yield the results. A11 Larger matrices: as block matrices. A21 O(nlog2 7 ) operations for n × n matrices. Alexander Okhotin Boolean matrix multiplication on a GPU A12 A22 × B11 B21 B12 B22 Hamburg, 12.04.2010 . 10 / 18 Fast matrix multiplication over a ring # of multiplications for 2 × 2 matrices? 8 a11 a12 b11 b12 a11 b11 + a12 b21 a11 b12 + a12 b22 × = a21 a22 b21 b22 a21 b11 + a22 b21 a21 b12 + a22 b22 Assume S is a ring. I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0 Strassen (1969): 2 × 2 matrices using 7 multiplications. I I I I I First, compute 14 linear combinations. Second, calculate their products. Their linear combinations yield the results. A11 Larger matrices: as block matrices. A21 O(nlog2 7 ) operations for n × n matrices. A12 A22 × B11 B21 B12 B22 . Coppersmith and Winograd (1990): O(n2.376 ) operations. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18 Fast matrix multiplication over a ring # of multiplications for 2 × 2 matrices? 8 a11 a12 b11 b12 a11 b11 + a12 b21 a11 b12 + a12 b22 × = a21 a22 b21 b22 a21 b11 + a22 b21 a21 b12 + a22 b22 Assume S is a ring. I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0 Strassen (1969): 2 × 2 matrices using 7 multiplications. I I I I I First, compute 14 linear combinations. Second, calculate their products. Their linear combinations yield the results. A11 Larger matrices: as block matrices. A21 O(nlog2 7 ) operations for n × n matrices. A12 A22 × B11 B21 B12 B22 . Coppersmith and Winograd (1990): O(n2.376 ) operations. X (B, ∧, ∨) is not a ring. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18 Applying fast matrix multiplication to the Boolean semiring n × n Boolean matrices. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 11 / 18 Applying fast matrix multiplication to the Boolean semiring n × n Boolean matrices. Multiplying them in Zn+1 . Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 11 / 18 Applying fast matrix multiplication to the Boolean semiring n × n Boolean matrices. Multiplying them in Zn+1 . 1 0 0 1 × 1 1 1 1 Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 11 / 18 Applying fast matrix multiplication to the Boolean semiring n × n Boolean matrices. Multiplying them in Zn+1 . 0 1 1 0 0 1 = × 1 2 1 1 1 1 | {z } in Z3 Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 11 / 18 Applying fast matrix multiplication to the Boolean semiring n × n Boolean matrices. Multiplying them in Zn+1 . 0 1 0 1 1 0 0 1 = = × 1 2 1 1 1 1 1 1 | {z }| {z } in Z3 Alexander Okhotin Boolean matrix multiplication on a GPU in B Hamburg, 12.04.2010 11 / 18 Applying fast matrix multiplication to the Boolean semiring n × n Boolean matrices. Multiplying them in Zn+1 . 0 1 0 1 1 0 0 1 = = × 1 2 1 1 1 1 1 1 | {z }| {z } in Z3 in B One bit → dlogn+1 e bits. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 11 / 18 3 n An O( log n ) method for Boolean matrices Arlazarov et al. (1970) Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18 3 n An O( log n ) method for Boolean matrices Arlazarov et al. (1970) Fix k << n. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18 3 n An O( log n ) method for Boolean matrices Arlazarov et al. (1970) Fix k << n. Multiplying 1 × k blocks of A by k × n blocks of B. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18 3 n An O( log n ) method for Boolean matrices Arlazarov et al. (1970) Fix k << n. Multiplying 1 × k blocks of A by k × n blocks of B. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18 3 n An O( log n ) method for Boolean matrices Arlazarov et al. (1970) Fix k << n. Multiplying 1 × k blocks of A by k × n blocks of B. At most 2k different 1 × k blocks. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18 3 n An O( log n ) method for Boolean matrices Arlazarov et al. (1970) Fix k << n. Multiplying 1 × k blocks of A by k × n blocks of B. At most 2k different 1 × k blocks. Pre-compute all 2k products with each of block of B ( kn blocks). Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18 3 n An O( log n ) method for Boolean matrices Arlazarov et al. (1970) Fix k << n. Multiplying 1 × k blocks of A by k × n blocks of B. At most 2k different 1 × k blocks. Pre-compute all 2k products with each of block of B ( kn blocks). Look up n bits for each 1 × k block of A, Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18 3 n An O( log n ) method for Boolean matrices Arlazarov et al. (1970) Fix k << n. Multiplying 1 × k blocks of A by k × n blocks of B. At most 2k different 1 × k blocks. Pre-compute all 2k products with each of block of B ( kn blocks). Look up n bits for each 1 × k block of A, Time complexity: n 2k · · n k } | {z making the table Alexander Okhotin + n3 k |{z} multiplication Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18 3 n An O( log n ) method for Boolean matrices Arlazarov et al. (1970) Fix k << n. Multiplying 1 × k blocks of A by k × n blocks of B. At most 2k different 1 × k blocks. Pre-compute all 2k products with each of block of B ( kn blocks). Look up n bits for each 1 × k block of A, Time complexity: n 2k · · n k } | {z making the table 2n3 log n + n3 k |{z} multiplication operations for k = log n. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18 Part III Boolean matrix multiplication on a GPU Joint work with Christian Reitwießner (Würzburg) Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 13 / 18 Main performance considerations Matrices A, B ∈ Bn×n are on the CPU: Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18 Main performance considerations Matrices A, B ∈ Bn×n are on the CPU: I either multiply them on the CPU, Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18 Main performance considerations Matrices A, B ∈ Bn×n are on the CPU: I I either multiply them on the CPU, or send to the GPU (and use which method?). Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18 Main performance considerations Matrices A, B ∈ Bn×n are on the CPU: I I either multiply them on the CPU, or send to the GPU (and use which method?). If n < 200, faster to multiply than to transfer. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18 Main performance considerations Matrices A, B ∈ Bn×n are on the CPU: I I either multiply them on the CPU, or send to the GPU (and use which method?). If n < 200, faster to multiply than to transfer. If n > 50000, will not fit on the GPU. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18 Main performance considerations Matrices A, B ∈ Bn×n are on the CPU: I I either multiply them on the CPU, or send to the GPU (and use which method?). If n < 200, faster to multiply than to transfer. If n > 50000, will not fit on the GPU. I Processing by parts. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18 Main performance considerations Matrices A, B ∈ Bn×n are on the CPU: I I either multiply them on the CPU, or send to the GPU (and use which method?). If n < 200, faster to multiply than to transfer. If n > 50000, will not fit on the GPU. I Processing by parts. Direct n3 multiplication. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18 Main performance considerations Matrices A, B ∈ Bn×n are on the CPU: I I either multiply them on the CPU, or send to the GPU (and use which method?). If n < 200, faster to multiply than to transfer. If n > 50000, will not fit on the GPU. I Processing by parts. Direct n3 multiplication. I For n > 100 already superceded. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18 Main performance considerations Matrices A, B ∈ Bn×n are on the CPU: I I either multiply them on the CPU, or send to the GPU (and use which method?). If n < 200, faster to multiply than to transfer. If n > 50000, will not fit on the GPU. I Processing by parts. Direct n3 multiplication. I For n > 100 already superceded. Arlazarov et al.: Alexander Okhotin n3 log n operations. Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18 Main performance considerations Matrices A, B ∈ Bn×n are on the CPU: I I either multiply them on the CPU, or send to the GPU (and use which method?). If n < 200, faster to multiply than to transfer. If n > 50000, will not fit on the GPU. I Processing by parts. Direct n3 multiplication. I For n > 100 already superceded. Arlazarov et al.: I n3 log n operations. Basic operation: union of rows. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18 Main performance considerations Matrices A, B ∈ Bn×n are on the CPU: I I either multiply them on the CPU, or send to the GPU (and use which method?). If n < 200, faster to multiply than to transfer. If n > 50000, will not fit on the GPU. I Processing by parts. Direct n3 multiplication. I For n > 100 already superceded. Arlazarov et al.: I I n3 log n operations. Basic operation: union of rows. Works well on GPU. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18 Main performance considerations Matrices A, B ∈ Bn×n are on the CPU: I I either multiply them on the CPU, or send to the GPU (and use which method?). If n < 200, faster to multiply than to transfer. If n > 50000, will not fit on the GPU. I Processing by parts. Direct n3 multiplication. I For n > 100 already superceded. Arlazarov et al.: I I n3 log n operations. Basic operation: union of rows. Works well on GPU. Strassen’s method: O(nlog2 7 ). Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18 Main performance considerations Matrices A, B ∈ Bn×n are on the CPU: I I either multiply them on the CPU, or send to the GPU (and use which method?). If n < 200, faster to multiply than to transfer. If n > 50000, will not fit on the GPU. I Processing by parts. Direct n3 multiplication. I For n > 100 already superceded. Arlazarov et al.: I I n3 log n operations. Basic operation: union of rows. Works well on GPU. Strassen’s method: O(nlog2 7 ). I Have to multiply ints instead of bits! Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18 Main performance considerations Matrices A, B ∈ Bn×n are on the CPU: I I either multiply them on the CPU, or send to the GPU (and use which method?). If n < 200, faster to multiply than to transfer. If n > 50000, will not fit on the GPU. I Processing by parts. Direct n3 multiplication. I For n > 100 already superceded. Arlazarov et al.: I I n3 log n operations. Basic operation: union of rows. Works well on GPU. Strassen’s method: O(nlog2 7 ). I I Have to multiply ints instead of bits! Inductive on n, reducing to many small matrices. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18 3 n The O( log n ) method on a GPU Making a table for B Matrix B ∈ Bn×n on the GPU. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 15 / 18 3 n The O( log n ) method on a GPU Making a table for B Matrix B ∈ Bn×n on the GPU. For each block of lines i ∈ {0, . . . , kn − 1}, k create table T [i] ∈ B2 ×n . Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 15 / 18 3 n The O( log n ) method on a GPU Making a table for B Matrix B ∈ Bn×n on the GPU. For each block of lines i ∈ {0, . . . , kn − 1}, k create table T [i] ∈ B2 ×n . Line (bk−1 . . . b1 b0 )2 in T : disjunction of all lines with bj = 1. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 15 / 18 3 n The O( log n ) method on a GPU Making a table for B Matrix B ∈ Bn×n on the GPU. For each block of lines i ∈ {0, . . . , kn − 1}, k create table T [i] ∈ B2 ×n . Line (bk−1 . . . b1 b0 )2 in T : disjunction of all lines with bj = 1. Work items: every 64 bits in each line. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 15 / 18 3 n The O( log n ) method on a GPU Making a table for B Matrix B ∈ Bn×n on the GPU. For each block of lines i ∈ {0, . . . , kn − 1}, k create table T [i] ∈ B2 ×n . Line (bk−1 . . . b1 b0 )2 in T : disjunction of all lines with bj = 1. Work items: every 64 bits in each line. I 2k disjunctions of longs. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 15 / 18 3 n The O( log n ) method on a GPU Making a table for B Matrix B ∈ Bn×n on the GPU. For each block of lines i ∈ {0, . . . , kn − 1}, k create table T [i] ∈ B2 ×n . Line (bk−1 . . . b1 b0 )2 in T : disjunction of all lines with bj = 1. Work items: every 64 bits in each line. I I 2k disjunctions of longs. Threads access adjacent words. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 15 / 18 3 n The O( log n ) method on a GPU Making a table for B Matrix B ∈ Bn×n on the GPU. For each block of lines i ∈ {0, . . . , kn − 1}, k create table T [i] ∈ B2 ×n . Line (bk−1 . . . b1 b0 )2 in T : disjunction of all lines with bj = 1. Work items: every 64 bits in each line. I I 2k disjunctions of longs. Threads access adjacent words. Another dimension: T [i] for different i. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 15 / 18 3 n The O( log n ) method on a GPU Multiplying the matrices Matrix A ∈ Bn×n on the GPU. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18 3 n The O( log n ) method on a GPU Multiplying the matrices Matrix A ∈ Bn×n on the GPU. n 2k ×n on the GPU. k tables T [i] ∈ B Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18 3 n The O( log n ) method on a GPU Multiplying the matrices Matrix A ∈ Bn×n on the GPU. n 2k ×n on the GPU. k tables T [i] ∈ B Compute the product A × B. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18 3 n The O( log n ) method on a GPU Multiplying the matrices Matrix A ∈ Bn×n on the GPU. n 2k ×n on the GPU. k tables T [i] ∈ B Compute the product A × B. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18 3 n The O( log n ) method on a GPU Multiplying the matrices Matrix A ∈ Bn×n on the GPU. n 2k ×n on the GPU. k tables T [i] ∈ B Compute the product A × B. Work items: lines of A (and C ). Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18 3 n The O( log n ) method on a GPU Multiplying the matrices Matrix A ∈ Bn×n on the GPU. n 2k ×n on the GPU. k tables T [i] ∈ B Compute the product A × B. Work items: lines of A (and C ). Step 1: cache the line of A to local memory. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18 3 n The O( log n ) method on a GPU Multiplying the matrices Matrix A ∈ Bn×n on the GPU. n 2k ×n on the GPU. k tables T [i] ∈ B Compute the product A × B. Work items: lines of A (and C ). Step 1: cache the line of A to local memory. Block-column of A determines the number of the table. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18 3 n The O( log n ) method on a GPU Multiplying the matrices Matrix A ∈ Bn×n on the GPU. n 2k ×n on the GPU. k tables T [i] ∈ B Compute the product A × B. Work items: lines of A (and C ). Step 1: cache the line of A to local memory. Block-column of A determines the number of the table. 1 × k block of A indexes the table. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18 3 n The O( log n ) method on a GPU Multiplying the matrices Matrix A ∈ Bn×n on the GPU. n 2k ×n on the GPU. k tables T [i] ∈ B Compute the product A × B. Work items: lines of A (and C ). Step 1: cache the line of A to local memory. Block-column of A determines the number of the table. 1 × k block of A indexes the table. Disjunction with the line of C . Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18 3 n The O( log n ) method on a GPU Multiplying the matrices Matrix A ∈ Bn×n on the GPU. n 2k ×n on the GPU. k tables T [i] ∈ B Compute the product A × B. Work items: lines of A (and C ). Step 1: cache the line of A to local memory. Block-column of A determines the number of the table. 1 × k block of A indexes the table. Disjunction with the line of C . Second dimension: every 64 bits in each line of T and C . Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18 Performance n = 2048, k = 8. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 17 / 18 Performance n = 2048, k = 8. CPU Nvidia G210M Nvidia GTS250 (low-end laptop GPU) (average gaming card) Time Memory access Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 17 / 18 Performance n = 2048, k = 8. CPU Time Memory access Alexander Okhotin 234 ms Nvidia G210M Nvidia GTS250 (low-end laptop GPU) (average gaming card) 17.4 ms 3.3 ms Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 17 / 18 Performance n = 2048, k = 8. CPU Time Memory access Alexander Okhotin 234 ms Nvidia G210M Nvidia GTS250 (low-end laptop GPU) (average gaming card) 17.4 ms 9.4 GB/s 3.3 ms 51.9 GB/s Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 17 / 18 Performance n = 2048, k = 8. CPU Time Memory access Nvidia G210M Nvidia GTS250 (low-end laptop GPU) (average gaming card) 17.4 ms 9.4 GB/s 3.3 ms 51.9 GB/s 234 ms Basically, bandwidth-limited. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 17 / 18 Performance n = 2048, k = 8. CPU Time Memory access Nvidia G210M Nvidia GTS250 (low-end laptop GPU) (average gaming card) 17.4 ms 9.4 GB/s 3.3 ms 51.9 GB/s 234 ms Basically, bandwidth-limited. The cores could compute more! Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 17 / 18 Performance n = 2048, k = 8. CPU Time Memory access Nvidia G210M Nvidia GTS250 (low-end laptop GPU) (average gaming card) 17.4 ms 9.4 GB/s 3.3 ms 51.9 GB/s 234 ms Basically, bandwidth-limited. The cores could compute more! Optimization: cache more in local memory. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 17 / 18 Performance n = 2048, k = 8. CPU Time Memory access Nvidia G210M Nvidia GTS250 (low-end laptop GPU) (average gaming card) 17.4 ms 9.4 GB/s 3.3 ms 51.9 GB/s 234 ms Basically, bandwidth-limited. The cores could compute more! Optimization: cache more in local memory. I Local memory: usually 16 KB per core. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 17 / 18 Performance n = 2048, k = 8. CPU Time Memory access Nvidia G210M Nvidia GTS250 (low-end laptop GPU) (average gaming card) 17.4 ms 9.4 GB/s 3.3 ms 51.9 GB/s 234 ms Basically, bandwidth-limited. The cores could compute more! Optimization: cache more in local memory. I I Local memory: usually 16 KB per core. Compute the table by parts. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 17 / 18 Future work in this project 1 Refactor to use more local memory. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 18 / 18 Future work in this project 1 Refactor to use more local memory. 2 Implement multiplication of huge matrices. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 18 / 18 Future work in this project 1 Refactor to use more local memory. 2 Implement multiplication of huge matrices. 3 Do a practical comparison with Strassen. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 18 / 18 Future work in this project 1 Refactor to use more local memory. 2 Implement multiplication of huge matrices. 3 Do a practical comparison with Strassen. 4 For the parsing application: better performance on smaller matrices. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 18 / 18 Future work in this project 1 Refactor to use more local memory. 2 Implement multiplication of huge matrices. 3 Do a practical comparison with Strassen. 4 For the parsing application: better performance on smaller matrices. I Large matrices are handled fast enough. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 18 / 18 Future work in this project 1 Refactor to use more local memory. 2 Implement multiplication of huge matrices. 3 Do a practical comparison with Strassen. 4 For the parsing application: better performance on smaller matrices. I I Large matrices are handled fast enough. 128x128 and 256x256 matrices dominate the running time. Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 18 / 18