Sparse Matrix-Vector Product

advertisement

CUBLAS and CUSPARSE MVM

Timing

Gavin Harrison

SMVM Algorithm

NVIDIA Memory Hierarchy

• Global Memory: large/high latency.

• Shared Memory: shared cache for each set of processors.

• Constant/texture memory: read only in global memory

+ on chip cache.

– Constant memory faster, but only one port.

– Texture Memory doesn’t suffer greatly from irregular access. Also, beneficial given

2D spatial locality.

Tuning SMVM for GPU (GT 280)

• Use multiple threads / row, use syncthreads and combine partial results.

• Access memory at stride.

– Half warps access sequential addresses.

– Allows for fewer memory reads from global memory.

• Align rows.

– Also helps decrease memory reads from global memory.

• Use texture memory for input vector.

– Input vector is reused.

– Texture reads are cached, and benefit from spacial locality.

Improvements in Fermi (GTX 580)

• General L1/L2 cache structure.

– L1 cache and Shared Memory cache are configurable to be 48 KB or 16 KB (64 KB shared between them).

– L2 is 768 KB.

• Improved support for double precision floating point numbers.

• Added support for 32 bit integer multiplication.

• 32 SPs per SM.

CUSPARSE SMVM Performance

CUSPARSE SMVM Speedup Over OSKI

(single precision)

CUBLAS MVM Performance

CUBLAS MVM Speedup over ATLAS

Download