Deepthi Gummadi

advertisement
MS Thesis Defense
“IMPROVING GPU PERFORMANCE BY REGROUPING
CPU-MEMORY DATA”
by
Deepthi Gummadi
CoE EECS Department
April 21, 2014
About Me
Deepthi Gummadi
 MS in Computer Networking with Thesis
 LaTeX programmer at CAPPLab since Fall 2013
 Publications
 “New CPU- to-GPU Memory Mapping Technique,” in IEEE
SouthEast Conference 2014.
 “The Impact of Thread Synchronization and Data Parallelism on
Multicore Game Programming,” accepted in IEEE ICIEV-2014.
 “Feasibility Study of Spider-Web Multicore/Manycore Network
Architectures,” currently preparing.
 “Investigating Impact of Data Parallelism on Computer Game
Engine,” under review, IJCVSP Journal, 2014.
Gummadi
2
Committee Members
 Dr. Abu Asaduzzaman, EECS Dept.
 Dr. Ramazan Asmatulu, ME Dept.
 Dr. Zheng Chen, EECS Dept.
Gummadi
3
“IMPROVING GPU PERFORMANCE BY REGROUPING CPUMEMORY DATA”
Outline
QUESTIONS?
►
Any time, please.
 Introduction
 Motivation
 Problem Statement
 Proposal
 Evaluation
 Experimental Results
 Conclusions
 Future Work
Gummadi
4
Introduction
Central Processing Unit (CPU)
Technology
 Interpret and Execute the
program instructions.
What is new about CPU?
 Initially, Processor evolved in
sequential structure.
 In millennium, processor
speeds reached parallel.
 Currently, we have multi core
on-chip CPUs.
Gummadi
CPU Speed Chart
5
Cache Memory Organization
 Why we use cache
memory?
 Several memory
layers:
 Lower-level caches –
faster, performing
computations.
 Higher-level cache –
slower, storage
purposes.
Gummadi
Intel 4-core processor
6
NVIDIA Graphic Processing Unit
 Parallel Processing
Architecture
 Components
 Streaming Multiprocessors
 Warp Schedulers
 Execution pipelines
 Registers
 Memory Organization
 Shared memory
 Global memory
GPU Memory Organization
Gummadi
7
CPU and GPU
CPU
GPU
High Throughput, Moderate
Latency
Shared Memory
Optimized SIMD
Low Latency
Cache Memory
Optimized MIMD
CPU and GPU work together to be more efficient.
Gummadi
8
CPU-GPU Computing Workflow
Step 1: CPU allocates the
memory and copies the
data.
cudaMallac()
cudaMemcpy()
Gummadi
9
CPU-GPU Computing Workflow
Step 2: CPU sends
function parameters and
instructions to GPU.
Gummadi
10
CPU-GPU Computing Workflow
Step 3: GPU executes
the instructions based
on received commands.
Gummadi
11
CPU-GPU Computing Workflow
Step 4: After execution,
the results will be
retrieved from GPU
DRAM to CPU memory.
Gummadi
12
Motivation
■ Data level parallelism
 Spatial data partitioning
 Temporal data
partitioning
 Spatial instruction
partitioning
 Temporal instruction
partitioning
Two Parallelization Strategies
Gummadi
13
Motivation
■ Parallelism and optimization techniques simplifies the
programming for CUDA.
■ From developers view the memory is unified.
Gummadi
14
Problem Statement
Traditional CPU to GPU global memory mapping
technique is not good for GPU Shared memory
Gummadi
15
“IMPROVING GPU PERFORMANCE BY REGROUPING CPUMEMORY DATA”
Outline
►
QUESTIONS?
Any time, please.
 Introduction
 Motivation
 Problem Statement
 Proposal
 Evaluation
 Experimental Results
 Conclusions
 Future Work
Gummadi
16
Proposal
Proposed CPU to GPU memory mapping to
improve GPU shared memory performance
Gummadi
17
Proposed Technique
Major Steps:
Step 1: Start
Step 2: Analyze problems; determine input parameters.
Step 3: Analyze GPU card parameters/characteristics.
Step 4: Analyze CPU and GPU memory organizations.
Step 5: Determine the number of computations and the
number of threads.
Step 6: Identify/Partition the data-blocks for each thread.
Step 7: Copy/Regroup CPU data-blocks to GPU global memory.
Step 8: Stop
Gummadi
18
Proposed Technique
Traditional Mapping
Proposed Mapping
■ Data directly copied from
CPU to GPU global
memory.
■ Data should be regrouped
and then copied from CPU
to GPU global memory.
■ Retrieved from different
global memory blocks.
■ Retrieved from consecutive
global memory blocks.
■ It is difficult to store the
data into GPU shared
memory.
■ It is easy to store the data
into GPU shared memory.
Gummadi
19
Evaluation
System Parameters:
 CPU Dual processor
speed: 2.13 GHz
 Fermi card: 14 SM, 32
CUDA cores in each SM.
 Kepler card: 13 SM, 192
CUDA cores in each SM
Gummadi
20
Evaluation
 Memory sizes of CPU
and GPU cards.
 Input parameters are
size of rows and size of
columns, whereas the
output parameter is
time.
Gummadi
21
Evaluation
Electric charge distribution by Laplace’s equation for 2D
problem (finite difference approximation)
ϵx(i,j)(Φi+1,j - Φi,j)/dx + ϵy(i,j)(Φi,j+1 - Φi,j)/dy +
ϵx(i-1,j)(Φi,j – Φi-1,j)/dx + ϵx(i,j-1)(Φi,j - Φi,j-1)/dy =0
Φ = electric potential
ϵ = medium permittivity
dx , dy = spatial grid size,
Φi,j = electric potential defined at lattice point (i, j)
ϵx(i,j), ϵy(i,j) = effective x- and y-direction permittivity defined at edges
of the element cell (i, j).
Gummadi
22
Evaluation
Electric potential can be considered as same for a uniform
material, the equation becomes
(Φi+1,j - Φi,j)/dx + (Φi,j+1 - Φi,j)/dy +
(Φi,j – Φi-1,j)/dx + (Φi,j - Φi,j-1)/dy =0
Gummadi
23
23
“IMPROVING GPU PERFORMANCE BY REGROUPING CPUMEMORY DATA”
Outline
►
QUESTIONS?
Any time, please.
 Introduction
 Motivation
 Problem Statement
 Proposal
 Evaluation
 Experimental Results
 Conclusions
 Future Work
Gummadi
24
Experimental Results
■ Conducted study on high electric charge distribution by
Laplace’s equation.
■ Implemented on three versions
 CPU only.
 GPU with shared memory.
 GPU without shared memory.
■ Input / Outputs
 Problem size (n for NxN Matrix)
 Execution time
Gummadi
25
Experimental Results
Validation of our CUDA/C code:
Nn,m = 1/5 (Nn,m-1 + Nn,m+1 + Nn,m + Nn-1,m + Nn+1,m)
Where, 1 <= n <= 8 and 1 <= m <= 8
Both CPU/C and CUDA/C programs produce the same values
Gummadi
26
Experimental Results
Validation of our CUDA/C code:
Nn,m = 1/5 (Nn,m-1 + Nn,m+1 + Nn,m + Nn-1,m + Nn+1,m)
Where, 1 <= n <= 8 and 1 <= m <= 8
Both CPU/C and CUDA/C programs produce the same values
Gummadi
27
Experimental Results
Impact of GPU shared memory
 As the number of
threads increases
the processing time
decreases (till 8X8
threads).
 After 8X8 threads,
GPU with shared
memory shows
better performance.
Gummadi
28
Experimental Results
Impact of the Number of Threads
 At a constant shared
memory, the
processing time of a
GPU decreases as the
number of threads
increases (till
16X16).
 After 16X16 threads,
Kepler card shows
better performance.
Gummadi
29
Experimental Results
Impact of amount of shared memory
 As the size of GPU
shared memory
increases, the
processing time
decreases.
Gummadi
30
Experimental Results
Impact of the proposed data regrouping technique
 In the case of data
regrouping with
shared memory, as
the number of
threads increases the
processing time
decreases.
 Among the GPU with
and without shared
memory, with shared
memory gives better
performance for
more number of
threads.
Gummadi
31
Conclusions
 For fast effective analysis of complex systems, high
performance computations are necessary.
 NVIDIA CUDA CPU/GPU, proves its potential on high
computations.
 Traditional memory mapping follows locality principle. So,
data doesn’t fit in GPU shared memory.
 Beneficial to keep data in GPU shared memory than GPU
global memory.
Gummadi
32
Conclusions
 To overcome this problem, we proposed a new memory
mapping between CPU and GPU to improve the
performance.
 Implemented on three different versions.
 Results indicates that proposed CPU-to-GPU memory
mapping technique helps in decreasing the overall
execution time by more than 75%.
Gummadi
33
Future Extensions
■ Modeling and simulation of Nanocomposites:
Nanocomposites requires large number of
computations at high speed.
■ Aircraft applications:
High performance computations are required to study
the mixture of composite materials.
Gummadi
34
“IMPROVING GPU PERFORMANCE BY REGROUPING CPUMEMORY DATA”
Questions?
Gummadi
35
“IMPROVING GPU PERFORMANCE BY REGROUPING CPUMEMORY DATA”
Thank you
Contact:
Deepthi Gummadi
E-mail: dxgummadi@wichita.edu
Gummadi
36
Download