MS Thesis Defense
“IMPROVING GPU PERFORMANCE BY REGROUPING
CPU-MEMORY DATA”
by
Deepthi Gummadi
CoE EECS Department
April 21, 2014
About Me
Deepthi Gummadi
MS in Computer Networking with Thesis
LaTeX programmer at CAPPLab since Fall 2013
Publications
“New CPU- to-GPU Memory Mapping Technique,” in IEEE
SouthEast Conference 2014.
“The Impact of Thread Synchronization and Data Parallelism on
Multicore Game Programming,” accepted in IEEE ICIEV-2014.
“Feasibility Study of Spider-Web Multicore/Manycore Network
Architectures,” currently preparing.
“Investigating Impact of Data Parallelism on Computer Game
Engine,” under review, IJCVSP Journal, 2014.
Gummadi
2
Committee Members
Dr. Abu Asaduzzaman, EECS Dept.
Dr. Ramazan Asmatulu, ME Dept.
Dr. Zheng Chen, EECS Dept.
Gummadi
3
“IMPROVING GPU PERFORMANCE BY REGROUPING CPUMEMORY DATA”
Outline
QUESTIONS?
►
Any time, please.
Introduction
Motivation
Problem Statement
Proposal
Evaluation
Experimental Results
Conclusions
Future Work
Gummadi
4
Introduction
Central Processing Unit (CPU)
Technology
Interpret and Execute the
program instructions.
What is new about CPU?
Initially, Processor evolved in
sequential structure.
In millennium, processor
speeds reached parallel.
Currently, we have multi core
on-chip CPUs.
Gummadi
CPU Speed Chart
5
Cache Memory Organization
Why we use cache
memory?
Several memory
layers:
Lower-level caches –
faster, performing
computations.
Higher-level cache –
slower, storage
purposes.
Gummadi
Intel 4-core processor
6
NVIDIA Graphic Processing Unit
Parallel Processing
Architecture
Components
Streaming Multiprocessors
Warp Schedulers
Execution pipelines
Registers
Memory Organization
Shared memory
Global memory
GPU Memory Organization
Gummadi
7
CPU and GPU
CPU
GPU
High Throughput, Moderate
Latency
Shared Memory
Optimized SIMD
Low Latency
Cache Memory
Optimized MIMD
CPU and GPU work together to be more efficient.
Gummadi
8
CPU-GPU Computing Workflow
Step 1: CPU allocates the
memory and copies the
data.
cudaMallac()
cudaMemcpy()
Gummadi
9
CPU-GPU Computing Workflow
Step 2: CPU sends
function parameters and
instructions to GPU.
Gummadi
10
CPU-GPU Computing Workflow
Step 3: GPU executes
the instructions based
on received commands.
Gummadi
11
CPU-GPU Computing Workflow
Step 4: After execution,
the results will be
retrieved from GPU
DRAM to CPU memory.
Gummadi
12
Motivation
■ Data level parallelism
Spatial data partitioning
Temporal data
partitioning
Spatial instruction
partitioning
Temporal instruction
partitioning
Two Parallelization Strategies
Gummadi
13
Motivation
■ Parallelism and optimization techniques simplifies the
programming for CUDA.
■ From developers view the memory is unified.
Gummadi
14
Problem Statement
Traditional CPU to GPU global memory mapping
technique is not good for GPU Shared memory
Gummadi
15
“IMPROVING GPU PERFORMANCE BY REGROUPING CPUMEMORY DATA”
Outline
►
QUESTIONS?
Any time, please.
Introduction
Motivation
Problem Statement
Proposal
Evaluation
Experimental Results
Conclusions
Future Work
Gummadi
16
Proposal
Proposed CPU to GPU memory mapping to
improve GPU shared memory performance
Gummadi
17
Proposed Technique
Major Steps:
Step 1: Start
Step 2: Analyze problems; determine input parameters.
Step 3: Analyze GPU card parameters/characteristics.
Step 4: Analyze CPU and GPU memory organizations.
Step 5: Determine the number of computations and the
number of threads.
Step 6: Identify/Partition the data-blocks for each thread.
Step 7: Copy/Regroup CPU data-blocks to GPU global memory.
Step 8: Stop
Gummadi
18
Proposed Technique
Traditional Mapping
Proposed Mapping
■ Data directly copied from
CPU to GPU global
memory.
■ Data should be regrouped
and then copied from CPU
to GPU global memory.
■ Retrieved from different
global memory blocks.
■ Retrieved from consecutive
global memory blocks.
■ It is difficult to store the
data into GPU shared
memory.
■ It is easy to store the data
into GPU shared memory.
Gummadi
19
Evaluation
System Parameters:
CPU Dual processor
speed: 2.13 GHz
Fermi card: 14 SM, 32
CUDA cores in each SM.
Kepler card: 13 SM, 192
CUDA cores in each SM
Gummadi
20
Evaluation
Memory sizes of CPU
and GPU cards.
Input parameters are
size of rows and size of
columns, whereas the
output parameter is
time.
Gummadi
21
Evaluation
Electric charge distribution by Laplace’s equation for 2D
problem (finite difference approximation)
ϵx(i,j)(Φi+1,j - Φi,j)/dx + ϵy(i,j)(Φi,j+1 - Φi,j)/dy +
ϵx(i-1,j)(Φi,j – Φi-1,j)/dx + ϵx(i,j-1)(Φi,j - Φi,j-1)/dy =0
Φ = electric potential
ϵ = medium permittivity
dx , dy = spatial grid size,
Φi,j = electric potential defined at lattice point (i, j)
ϵx(i,j), ϵy(i,j) = effective x- and y-direction permittivity defined at edges
of the element cell (i, j).
Gummadi
22
Evaluation
Electric potential can be considered as same for a uniform
material, the equation becomes
(Φi+1,j - Φi,j)/dx + (Φi,j+1 - Φi,j)/dy +
(Φi,j – Φi-1,j)/dx + (Φi,j - Φi,j-1)/dy =0
Gummadi
23
23
“IMPROVING GPU PERFORMANCE BY REGROUPING CPUMEMORY DATA”
Outline
►
QUESTIONS?
Any time, please.
Introduction
Motivation
Problem Statement
Proposal
Evaluation
Experimental Results
Conclusions
Future Work
Gummadi
24
Experimental Results
■ Conducted study on high electric charge distribution by
Laplace’s equation.
■ Implemented on three versions
CPU only.
GPU with shared memory.
GPU without shared memory.
■ Input / Outputs
Problem size (n for NxN Matrix)
Execution time
Gummadi
25
Experimental Results
Validation of our CUDA/C code:
Nn,m = 1/5 (Nn,m-1 + Nn,m+1 + Nn,m + Nn-1,m + Nn+1,m)
Where, 1 <= n <= 8 and 1 <= m <= 8
Both CPU/C and CUDA/C programs produce the same values
Gummadi
26
Experimental Results
Validation of our CUDA/C code:
Nn,m = 1/5 (Nn,m-1 + Nn,m+1 + Nn,m + Nn-1,m + Nn+1,m)
Where, 1 <= n <= 8 and 1 <= m <= 8
Both CPU/C and CUDA/C programs produce the same values
Gummadi
27
Experimental Results
Impact of GPU shared memory
As the number of
threads increases
the processing time
decreases (till 8X8
threads).
After 8X8 threads,
GPU with shared
memory shows
better performance.
Gummadi
28
Experimental Results
Impact of the Number of Threads
At a constant shared
memory, the
processing time of a
GPU decreases as the
number of threads
increases (till
16X16).
After 16X16 threads,
Kepler card shows
better performance.
Gummadi
29
Experimental Results
Impact of amount of shared memory
As the size of GPU
shared memory
increases, the
processing time
decreases.
Gummadi
30
Experimental Results
Impact of the proposed data regrouping technique
In the case of data
regrouping with
shared memory, as
the number of
threads increases the
processing time
decreases.
Among the GPU with
and without shared
memory, with shared
memory gives better
performance for
more number of
threads.
Gummadi
31
Conclusions
For fast effective analysis of complex systems, high
performance computations are necessary.
NVIDIA CUDA CPU/GPU, proves its potential on high
computations.
Traditional memory mapping follows locality principle. So,
data doesn’t fit in GPU shared memory.
Beneficial to keep data in GPU shared memory than GPU
global memory.
Gummadi
32
Conclusions
To overcome this problem, we proposed a new memory
mapping between CPU and GPU to improve the
performance.
Implemented on three different versions.
Results indicates that proposed CPU-to-GPU memory
mapping technique helps in decreasing the overall
execution time by more than 75%.
Gummadi
33
Future Extensions
■ Modeling and simulation of Nanocomposites:
Nanocomposites requires large number of
computations at high speed.
■ Aircraft applications:
High performance computations are required to study
the mixture of composite materials.
Gummadi
34
“IMPROVING GPU PERFORMANCE BY REGROUPING CPUMEMORY DATA”
Questions?
Gummadi
35
“IMPROVING GPU PERFORMANCE BY REGROUPING CPUMEMORY DATA”
Thank you
Contact:
Deepthi Gummadi
E-mail: dxgummadi@wichita.edu
Gummadi
36