Particle Swarm Methods for Parameter Optimization on GPU

advertisement
CUDA Linear Algebra Library and Next Generation
Yukai Hung
a0934147@gmail.com
Department of Mathematics
National Taiwan University
Sparse Matrix-Vector Multiplication
Sparse Matrix-Vector Multiplication
Dense approach is wasteful
- unclear how to map work to parallel processors
- irregular elements accessing for global memory
3
structured
Sparse Matrix-Vector Multiplication
DIA - diagonal format
ELL - ellpack format
unstructured
CSR - compressed row format
HYB - hybrid format
COO - coordinate format
4
Sparse Matrix-Vector Multiplication
Diagonal format
- diagonal should be mostly populated format
- high parallelism to map one thread for one row
- good parallel efficiency and good memory behavior
global memory coalescing
5
Sparse Matrix-Vector Multiplication
Ellpack format
- assign one thread to compute one row again
- but the load imbalance hurts parallel efficiency
6
Sparse Matrix-Vector Multiplication
Coordinate format
- insensitive to sparsity pattern but slower than ellpack
- assign one thread for one element and combine the
results from all elements in a row to get output element
7
Sparse Matrix-Vector Multiplication
Hybrid format
- combine regular ellpack format and flexible coo format
typical
8
exceptional
Sparse Matrix-Vector Multiplication
Property comparison
fixed number of nonzeros and variable matrix size
Matrix Format
Granularity
Coalescing
DIA
thread/row
full
ELL
thread/row
full
CSR(scalar)
thread/row
rare
CSR(vector)
COO
HYB
warp/row
thread/nonzero
thread/row
9
partial
full
full
Sparse Matrix-Vector Multiplication
Sparse matrices for parallel efficiency: ellpack format
- one thread per row is efficient for memory accessing
Sparse matrices for load imbalance: coordinate format
- one thread per element is insensitive to matrix structure
Conclusion for all structures
- hybrid structure gives the best performance averagely
- irregularity is manageable if regularize the common case
10
Sparse Matrix-Vector Multiplication
Performance comparison
11
Sparse Matrix-Vector Multiplication
Performance comparison
12
Sparse Matrix-Vector Multiplication
Performance comparison
13
Linear Algebra Library
Linear Algebra Library
CUBLAS: CUDA Basic Linear Algebra Subroutines
- implement basic linear algebra subroutines on runtime level
- only available for single device not implement for multiple devices
CUFFT: CUDA Fast Fourier Transforms Library
- use divide-and-conquer algorithm for discrete transform
- support real and complex data for in-place or out-of-place
- support the stream operation for simultaneous execution
- use complex-to-complex to replace real-to-complex
- problem size in power-of-two gives best performance
15
Linear Algebra Library
CUDPP: CUDA Data Parallel Primitive Library
- a library of data-parallel algorithm primitives
- parallel prefix-sum and sorting and data reduction
- stream compaction and random number generator
16
Linear Algebra Library
CUDPP: CUDA Data Parallel Primitive Library
comparison with multicore CPU
17
Linear Algebra Library
CULA: GPU-Accelerated Linear Algebra Library
- implement LAPACK function for variant language interface
linear system solves
least square solvers
orthogonal factorization
symmetric eigenproblem
non-symmetric eigenproblem
singular value decompositions
18
Linear Algebra Library
CULA: GPU-Accelerated Linear Algebra Library
- implement LAPACK function for variant language interface
double precision QR-factorization
LU-factorization
19
Linear Algebra Library
CULA: GPU-Accelerated Linear Algebra Library
- implement LAPACK function for variant language interface
double precision
double precision
symmetric
QR-factorization
eigenvalue problem
20
Linear Algebra Library
CULA: GPU-Accelerated Linear Algebra Library
- implement LAPACK function for variant language interface
double
doubleprecision
precisionsymmetric
singular value
eigenvalue
decomposition
problem
21
Linear Algebra Library
MAGMA: Matrix Algebra on GPU and Multicore Architecture
- open source project to develop a dense linear algebra library
similar to basic linear algebra package but for heterogeneous
and hybrid architecture with manycore CPUs and GPUs systems
22
Linear Algebra Library
MAGMA: Matrix Algebra on GPU and Multicore Architecture
doublesingle
precision
precision
matrix-matrix
QR-factorization
multiplication
23
Linear Algebra Library
MAGMA: Matrix Algebra on GPU and Multicore Architecture
solving
singleAx=b
precision
by using
QR-factorization
LU-factorization
24
Linear Algebra Library
MAGMA: Matrix Algebra on GPU and Multicore Architecture
single
solvingprecision
Ax=b by Cholesky-factorization
using LU-factorization
25
Linear Algebra Library
Thrust
- thrust is a CUDA library of parallel algorithm with an interface
resembling the C++ Standard Template Library STL to provide
flexible high-level interface that greatly enhance productivity
26
Linear Algebra Library
int main(int argc,char** argv)
{
//allocate memory space on the host
thrust::host_vector<float> hvec(1024);
//generate random number on the host
thrust::generate(hvec.begin(),hvec.end(),rand);
//allocate and transfer data to device
thrust::device_vector<float> dvec=hvec;
//manipulate device values from the host
dvec[0]=(float)rand()/(float)(RAND_MAX-1);
dvec[1]=(float)rand()/(float)(RAND_MAX-1);
//sum all data on device by parallel reduction
sum=thrust::reduce(dvec.begin(),dvec.end());
//sort all data on device by radix sort
thrust::sort(dvec.begin(),dvec.end());
//transfer final data back to host
thrust::copy(dvec.begin(),dvec.end(),hvec.begin());
}
27
Linear Algebra Library
int main(int argc,char** argv)
{
//create list container on the host
std::list<int> hlist;
hlist.push_back(13);
hlist.push_back(27);
//copy host data from list into device vector
thrust::device_vector<int> dvec(hlist.size());
thrust::copy(hlist.begin(),hlist.end(),dvec.begin());
//alternative method to convert from host to device
thrust::device_vector<int> dvec(hlist.begin(),hlist.end());
//obtain raw pointer from device memory
int* dpointer=thrust::raw_pointer_cast(dvec);
//launch device kernel function
kernel<<<blocknum,blocksize>>>(dpointer,dvec.size());
//deallocate device memory
cudaFree(dpointer);
}
28
Linear Algebra Library
CUSP: Generic Parallel Algorithm for Sparse Matrix Computations
- cusp provides a high-level and flexible interface for manipulating
sparse matrix and solving sparse linear systems by iterative method
- cusp is implemented on the thrust template interface structure
"Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented
Processors“ Nathan Bell and Michael Garland, in "Supercomputing 09", 2009
29
Linear Algebra Library
CUSP: Generic Parallel Algorithm for Sparse Matrix Computations
30
Linear Algebra Library
Matrix format
- cusp natively supports several sparse matrix formats
- cusp make it is easy to transfer sparse matrix data between
host and device and convert between sparse matrix format
//allocate storage space for a CSR matrix on the
//host with 5 row 8 column and 12 nonzero elements
cusp::csr_matrix<int,float,cusp::host_memory> A(5,8,12);
//allocate and transfer from host to device memory
cusp::csr_matrix<int,float,cusp::device_memory> B=A;
//convert the CSR matrix format to HYB matrix format
cusp::hyb_matrix<int,float,cusp::device_memory> C=A;
31
Linear Algebra Library
Algorithm and iterative solver
- matrix-vector multiplication sand transpose
- conjugate gradient and biconjugate gradient stab
//matrix-vector multiplication
cusp::multiply(A,x,y)
//sparse matrix transpose
cusp::transpose(A,At)
//conjugate gradient
cusp::krylov::cg(A,x,b)
//biconjugate gradient stab
cusp::krylov::bicgstab(A,x,b)
32
Linear Algebra Library
int main(int argc,char** argv)
{
//create an empty HYB sparse matrix structure
cusp::hyb_matrix<int,float,cusp::device_memory> A;
//load a matrix stored in the matrix market format
cusp::io::read_matrix_market_file(A,”5pt_10x10.mtx”);
//allocate storage for solution x and right-hand side b
cusp::array1d<float,cusp::device_memory> x(A.num_rows,0);
cusp::array1d<float,cusp::device_memory> b(A.num_rows,1);
//set the iteration and residual stopping criteria
cusp::verbose_monitor<ValueType> monitor(100,1e-6);
//setup the matrix preconditioner
cusp::precond::diagonal<ValueType,MemorySpace> M(A);
//solve the linear system with conjugate gradient method
cusp::krylov::cg(A,x,b,monitor,M);
return 0;
}
33
Linear Algebra Library
OpenNL: Open Numerical Library
- efficient sparse matrix data structure
- sparse direct linear solver for SuperLU
- matrix preconditioner for Jacobi and SSOR
- iterative builder for sparse least-square problems
- iterative solvers for conjugate gradient, BICGSTAB, GMRES
34
Linear Algebra Library
ViennaCL
- a basic linear algebra for computations on GPUs based on OpenCL
- support basic linear algebra subroutines
- generalized minimal residual method
- direct linear system solver with LU-factorization
- sparse conjugate gradient and biconjugate gradient
- optimal incomplete LU preconditioner with threshold
GATLAS: GPU Automatically Tuned Linear Algebra Subroutines
- automatically tuned the kernel of level 3 BLAS based on OpenCL
35
Next Generation Architecture
Next Generation Architecture
Next GPU generation architecture is called Fermi
37
Next Generation Architecture
Next GPU generation architecture is called Fermi
38
Next Generation Architecture
Third generation Streaming Multiprocessor
dual thread/warp scheduler
32 processors for each SM
double precision 50% of single
(8X faster than GT200)
4 special function units
64 KB of RAM for shared memory
and configurable L1 cache
39
Next Generation Architecture
Second generation Parallel Thread Execution
IEEE 754-2008 floating point standard,
surpassing even the most advanced CPU
Fused multiply-add FMA instruction
for both single and double precision
Newly designed 32-bit integer ALU
and extended precision operations
40
Next Generation Architecture
Improved Memory System
first GPU architecture to support true
cache hierarchy in combination with
on-chip shared memory
L1 cache for each multiprocessor
improve bandwidth/reduce latency
unified L2 cache (768 KB)
coherent data sharing across all cores
ECC support GDDR5 memory interface
which is almost 2X faster than GDDR3
41
Next Generation Architecture
GigaThread Hardware Scheduler
Hierarchically manage thousands
of simultaneously active threads
10X faster application context
switching to support concurrent
kernel execution
42
Next Generation Architecture
GigaThread Hardware Scheduler
concurrent kernel execution + faster context switch
43
Next Generation Architecture
GigaThread Hardware Scheduler
Dual DMA engines for simultaneous data transfer
to fully overlap with CPU and GPU processing time
44
Next Generation Architecture
Third generation Streaming Multiprocessor
fully pipeline of integer arithmetic
logic unit and floating-point unit
improve floating-point arithmetic
from IEEE 745-1985 to IEEE 745-2008
to support FMA instruction
improve integer ALU from 24-bit
precision into 32-bit precision
45
Next Generation Architecture
What is NEW on the floating-point operation?
- support fused multiply-add instructions for both single and double
original
fused multiply-add
multiply-add
A
x
B
=
product
+
C
46
truncate
retain all
extra
digits
digits
=
result
Next Generation Architecture
What is NEW on the floating-point operation?
- support subnormal numbers for both single and double precision
which are small numbers that lie between the zero and smallest
normalized number of a given floating point number system
- prior generation flush subnormal operand and results to zero
- CPU typically perform subnormal calculation in exception-handling
software taking thousands of cycles, but Fermi handle subnormal
calculations in hardware with no additional performance penalty
47
Next Generation Architecture
Third generation Streaming Multiprocessor
16 load/store units to allow source
and destination addresses to be
calculated for 16 threads per cycle
32 single precision FMA units
16 double precision FMA units
48
Next Generation Architecture
Third generation Streaming Multiprocessor
double precision application performance
49
Next Generation Architecture
Third generation Streaming Multiprocessor
two warp scheduler and
instruction dispatch units
50
Next Generation Architecture
Third generation Streaming Multiprocessor
dual warp scheduler allowing two warps to be
issued and executed concurrently for 32 cores
51
Next Generation Architecture
Third generation Streaming Multiprocessor
two warp scheduler and
instruction dispatch units
64KB configurable shared
memory and L1 cache
52
Next Generation Architecture
64KB configurable shared memory and L1 cache
- 48KB shared memory and 16KB L1 cache
- 16KB shared memory and 48KB L1 cache
radix sort using shared memory
53
Next Generation Architecture
Unified memory address space
- combine three separate addresses space for load and store
- this feature enable Fermi to support all C++ specific programs
virtual function, function pointer, new and delete object, try and catch
54
Next Generation Architecture
summary table
55
Next Generation Architecture
scheduler bottleneck
Host
Input Assembler
Setup / Rstr / ZCull
SP
SP
SP
TF
SP
SP
TF
L1
TF
L1
SP
SP
SP
Pixel Thread Issue
SP
SP
TF
L1
L1
L2
FB
SP
TF
TF
L1
L2
FB
SP
Work Distribution
SP
TF
L1
FB
FB
56
SP
SP
TF
L1
L2
L2
SP
L1
L2
FB
Thread Processor
Vtx Thread Issue
L2
FB
Next Generation Architecture
new bottleneck
old bottleneck
57
Reference
- Mark Harris
http://www.markmark.net/
- Wei-Chao Chen http://www.cs.unc.edu/~ciao/
- Wen-Mei Hwu http://impact.crhc.illinois.edu/people/current/hwu.php
58
Download