CUBLAS and CUSPARSE MVM
Timing
Gavin Harrison
SMVM Algorithm
NVIDIA Memory Hierarchy
• Global Memory: large/high latency.
• Shared Memory: shared cache for each set of processors.
• Constant/texture memory: read only in global memory
+ on chip cache.
– Constant memory faster, but only one port.
– Texture Memory doesn’t suffer greatly from irregular access. Also, beneficial given
2D spatial locality.
Tuning SMVM for GPU (GT 280)
• Use multiple threads / row, use syncthreads and combine partial results.
• Access memory at stride.
– Half warps access sequential addresses.
– Allows for fewer memory reads from global memory.
• Align rows.
– Also helps decrease memory reads from global memory.
• Use texture memory for input vector.
– Input vector is reused.
– Texture reads are cached, and benefit from spacial locality.
Improvements in Fermi (GTX 580)
• General L1/L2 cache structure.
– L1 cache and Shared Memory cache are configurable to be 48 KB or 16 KB (64 KB shared between them).
– L2 is 768 KB.
• Improved support for double precision floating point numbers.
• Added support for 32 bit integer multiplication.
• 32 SPs per SM.
CUSPARSE SMVM Performance
CUSPARSE SMVM Speedup Over OSKI
(single precision)
CUBLAS MVM Performance
CUBLAS MVM Speedup over ATLAS