Team 3 – GPU Performance

advertisement
CSCI 5593 – Advanced Computer Architecture
Experiments on GPU Performance by
K-Means Clustering
Under the Guidance of
Dr. Gita Alaghband
By
Team 3
Anuradha Ramakrishnan
Manh Huynh
Sai Latha Suresh
1
CSCI 5593 – Advanced Computer Architecture
Introduction:
Graphical Processing Unit (GPU), traditionally are dedicated hardware for
Graphics computing and as a device to accelerate the rendering of images for
display. Recently, GPU are used to accelerate many intensive computations and
allow general purpose computing. The key definition of GPU by NVIDIA is, “The
CPU (central processing unit) has often been called the brains of the PC. But
increasingly, that brain is being enhanced by another part of the PC the GPU, which
is its soul” [1].
The aim of the project is to learn about the performance of the GPU by KMeans algorithm. The implementation includes testing the algorithm for its
execution time in Hydra for single core CPU and Hydra’s GPU. For the varied
features of cores and number of processors in the GPU, how will the execution be
improved. The other part of our implementation is testing available memory types
in Hydra (Tesla 2050) GPU for the K-Means algorithm. We intend to show, how the
memory utilization and data transfer affects the K-Means algorithm’s execution.
SIMD
In contrast to task-parallel computation, data-parallel computation is the
distribution of data amongst several computing nodes. "Single instruction, multiple
data" (SIMD), has multiple processors executing the same instructions on different
pieces of data which accomplishes data parallel computation and is used by GPU as
it helps in sharing flow control computation amongst processors. As a result, it
allows more hardware to be devoted to instruction execution. In a computing task,
SIMD parallelization compliments GPU computation.
2
CSCI 5593 – Advanced Computer Architecture
GPU
GPUs were originally for graphic computing and accelerating display image
rendering. Currently, they are in a high demand due to their possession of parallel
multi-core processors and intensive computing capabilities and also helps to achieve
data-parallel computation. The significant difference between CPU and GPU is that
GPU has several transistors to ALU units with only a few caches and flow control.
High computing power, memory bandwidth and data parallel computations of GPUs
are the reasons behind the general-purpose computing on GPUs.
Figure 1: Key difference of CPU and GPU based on number of cores
CUDA:
CUDA is the acronym for Compute Unified Device Architecture. CUDA is
NVIDIA’s parallel computing architecture that enables dramatic increases in
computing performance by harnessing the power of the GPU (Graphics Processing
Unit). Computing is evolving from "central processing" on the CPU to "coprocessing" on the CPU and GPU. To enable this new computing paradigm,
NVIDIA invented the CUDA parallel computing architecture that is now shipping
in GeForce, ION Quadra, and Tesla GPUs, representing a significant installed base
3
CSCI 5593 – Advanced Computer Architecture
for application developers. An indicator of CUDA adoption is the ramp of the Tesla
GPU for GPU computing. In these new operating systems, the GPU will not only be
the graphics processor, but also a general purpose parallel processor accessible to
any application. CUDA can be accessible to software developers through industry
standard programming languages. CUDA gives developers access to the instruction
set and memory of the parallel computation elements in GPUs. CUDA provides
ability to use high-level languages such as C to develop application that can take
advantage of high level of performance and scalability that GPUs architecture
offer. GPUs allow creation of very large number of concurrently executed threads at
very low system resource cost.
Processing Flow of CUDA:
 Copy data from main memory to GPU memory.
 CPU instructs the process to GPU.
 GPU execute parallel in each core.
 Copy the result from GPU memory to main memory.
Memory Hierarchy:
CUDA devices have several different memory that can be used by developers or
programmers to achieve high execution speed in their kernels. The different types of
memory includes Global memory, local memory, texture memory, constant
memory, and shared memory and register memory. The figure 1, below illustrates
the memory organization. Each block consists of following: a set of local registers
per thread. It has a shared memory that is common to all threads in the block.
Registers and shared memory are placed on on-chip. Hence variables can be
4
CSCI 5593 – Advanced Computer Architecture
accessed at a high speed when they use registers or shared memory. Whereas the
other types of memory are placed off- chip.
To have the deeper understanding with difference between all types of
memory, we will go into a little more details of how these different types of
memories are allocated and used in modern processors.
Figure 2: Memory Hierarchy for thread, block and grid
5
CSCI 5593 – Advanced Computer Architecture
Global Memory:
The size of the Global memory depends from card to card. This memory is
the main means of communicating Read/Write data between host and device. Its
content are visible to all threads. When the __device__ qualifier is used, it declares
the variable resides in global memory. Global memory allocation happens through
cudaMalloc(). This allocates objects in device Global memory. It basically requires
two parameters that are address of a pointer and size of memory that needs to be
allocated. cudaFree() is used to free the objects from device Global memory .
Local Memory
Local memory resides in device memory. It is slower than registers or shared
memory. It is used for arrays or lists that access continuous memory location. Local
memory is accessible and private to each thread and has a lifetime of a thread. The
compiler instantly allocates a variable to local variable when the registers available
to the thread are not enough.
Constant Memory
Constant memory is cached and it is faster. It is declared through _constant_
which restricts our usage as read-only. At the same time reading from constant
memory can conserve memory bandwidth comparatively to reading the same data
from global memory.
Texture memory:
Texture memory can also be used for general-purpose computing. Texture
memory is read only and allocated in the off-chip, but accessed through dedicated
hardware. It still provides higher effective bandwidth at some situations.
Specifically, texture caches are designed for graphics applications where memory
6
CSCI 5593 – Advanced Computer Architecture
access patterns exhibit a great deal of spatial locality. Texture memory has few
special features like 2D pre-fetching.
Shared Memory:
Each block has its own shared memory enables fast communication between
threads in a block and holds the data that will be read and written by multiple
threads. Variables and arrays that are to be created in shared memory space on the
multiprocessor should be prefaced by this notation as they are created: __device__
__shared__. The size of the shared memory depends on the compute capability.
Also, the data stored in the shared memory is visible to all threads within that block
and exists till the duration of the block. Moreover, shared memory is faster than all
other types of memory in many situations.
Register Memory:
Registers is likely the fastest memory, where the data stored is visible only
to the thread that it wrote and exists only for the lifetime of that particular thread.
Automatic variables that are declared in a CUDA kernel are placed into registers.
The occupancy rate is strongly dependent on the amount of register memory each
thread requires.
The table below provides different types of memory and it accessibility with
respect to threads, blocks or grids.
7
CSCI 5593 – Advanced Computer Architecture
Memory
Declaration
Scope
Lifetime
Registers
Automatic variables other
Thread
Kernel
than arrays
Local
Automatic array variables
Thread
Kernel
Shared
__shared__
Block
Kernel
Global
__device__
Grid
Application
Constant
__constant__
Grid
Application
Table 1: Types of memory with its scope and lifetime.
Threads and Blocks:
Parallelism is achieved with thread usage on GPU. CUDA has the ability
to run a particular kernel across multiple threads and multiple cores. Thread is an
execution of a kernel with a given index where every thread uses its index to access
elements in array. Block is a group of threads. Threads can be coordinated using the
_syncthreads() function which makes a thread to pause at a certain point in the
kernel until all the other threads in its block reach the same point. Grid is a group
of blocks. Each core on GPU can afford up to 1024 threads per 65535 blocks.
Spreading over this number crosswise over multi-card frameworks, or even various
frameworks associated together, it is easy to see how parallelism prevails and allows
for faster processing. CUDA devices are likewise fit for running many small
programs all the while too. For managing these thread processing at a time, there is
a remarkable highlight called a warp handler, where a warp is a group of 32 threads.
At the point when programming in CUDA, clients work with blocks so it is
dependent upon the warp handler to determine the way to partition the instructions.
8
CSCI 5593 – Advanced Computer Architecture
Figure 3: Single Thread with its register
Figure 4: Eight Threads working in parallel for N blocks.
How to Optimize Data Transfers in CUDA
To efficiently transfer data between the host and device, we are implementing
Pageable and Pinned memory. The peak bandwidth between the device memory and
the GPU is 144 GB/s on the NVIDIA Tesla C2050. Also, the peak bandwidth
between host memory and device memory is 8 GB/s on PCIe x16 Gen2. Hence it is
good to minimize lots of data transfers from host to GPU when possible. Moreover,
grouping small transfer into larger transfers will perform much better since it
eliminates per-transfer overhead. Along with this, use of pinned or page-locked
memory is a way to optimize data transfer. Usage of pinned memory will result in
best PCIe performance for host to device transfers. Pinned memory should not be
overused as it is non-pageable by the OS.
9
CSCI 5593 – Advanced Computer Architecture
Hence Pinned memory is the best way to optimize data transfers. One way to
use such pinned memory is to allocate host memory with the cudaHostAlloc()
function that can be read from the device and written directly to by the device . This
allocated memory is known as pinned memory. Transfers that taking place through
pinned memory achieves the maximum bandwidth between CPU and device. Amid
execution, a block that requires host data only needs to wait for a small portion of
the data to be transferred (when operating through pinned memory). Typical hostto-device copies make all blocks wait until all of the data associated with the copy
operation is transferred. However, too much of pinned memory can degrade the
overall system performance by reducing the amount of memory available to the
system for paging operations. So it is vital to have an idea of how much memory to
be pinned.
Figure 5: Pageable and Pinned data transfer
One of the feature added in CUDA 4.0 is Unified Virtual Addressing through
which pinned memory is allocated with cudaMallocHost() and also mapped in
10
CSCI 5593 – Advanced Computer Architecture
devices memory . Through this feature, CUDA kernels can read or write directly
through the PCI bus.
Measuring Data Transfer Times with nvprof
To quantify the time spent in each data transfer, we could record a CUDA
event before and after each transfer and use cudaEventElapsedTime ()
K-MEANS CLUSTERING:
K-means is a clustering algorithm used in several applications of data mining,
machine learning, and scientific applications.
The applications in science are creating large data sets, which needs to be
classified into subsets. Hence, data clustering was the process used to group similar
objects into relatively similar sets which are called clusters. Kmeans algorithm came
into existence when the computational demands of data clustering started growing
rapidly and also since it is very time consuming for single CPU to processing large
data sets.
K-Means is one of the most popular "clustering" algorithms where it stored kcentroids to define clusters. A point is considered to be in a particular cluster if it is
closer to that cluster's centroid than any other centroid.
K-Means finds the best centroids by alternating between (1) assigning data
points to clusters based on the current centroids (2) choosing centroids (points which
are the center of a cluster) based on the current assignment of data points to clusters.
K-means Clustering can be done in both sequential and parallel versions to test the
GPU performance.
11
CSCI 5593 – Advanced Computer Architecture
Figure 6, provides a sample k-means algorithm with cluster size 5(k=5) and
the data contains two different features. Features of the data can be characterized as
its size, shape etc.
Feature 2
Feature 1
Figure 6: K-means algorithm with K = 5 => 5 cluster in 2 dimensions
SEQUENTIAL:
Choose a user defined k value as centroids and then calculate the distance
from each data point to all centroids. The data point belongs to the cluster, which
has the minimum distance to the point. Then the mean is updated for each cluster .In
the final step check if converges or not. If it converges then, ||𝜇𝑛𝑒𝑤 − 𝜇𝑜𝑙𝑑 || <
𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 (0.001) .
12
CSCI 5593 – Advanced Computer Architecture
Start
Select
number of
cluster K
yourself
Yes
No
converged?
Finish
Number of K
Initializing the
centroid
Selecting
Centroid
randomly
Distance data
points to
centroids
Update mean
Assign data point to
one class based on
minimum distance
Euclidean
Distance often
used
Figure 7: Flow chart of Sequential K-Means
PARALLEL VERSIONS:
There are two parallel versions of K means algorithm.
Version 1 (Hybrid Model)
1. Initialize cluster randomly
2. Assign each data point to the nearest cluster center (labeling stage)
𝑐 ← arg min ‖𝑥 − 𝜇𝑖 ‖2 : Parallelize
𝑖∈{1,2,…,𝑘}
3. Re-assign new cluster centroids (update stage)
13
CSCI 5593 – Advanced Computer Architecture
𝜇𝑖 =
1
∑𝑥
𝑗 ∈𝐶𝑖
𝑁𝑖
𝑥𝑗 : Serial SUM
4. If any cluster changed go to 2
This scenario only parallelize the labeling stage, where the step computation
𝑛𝑘
complexity: 𝒪 ( ) + 𝒪(𝑛 + 𝑘) and the Space complexity: 𝒪((𝑛 + 𝐾)𝑑
𝑝
Version 2: (Everything on GPU)
1. Initialize cluster randomly
2. Assign each data point to the nearest cluster center (labeling stage)
𝑐 ← arg min ‖𝑥 − 𝜇𝑖 ‖2 : Parallelize
𝑖∈{1,2,…,𝑘}
3. Re-assign new cluster centroids (update stage)
𝜇𝑖 =
1
𝑁𝑖
∑𝑥
𝑗 ∈𝐶𝑖
𝑥𝑗
: Parallelize
4. If any cluster changed go to 2
This scenario only parallelize the both labeling and updating stage and
𝑛𝑘
𝑁
𝑁
the step computation complexity: 𝒪 ( ) + 𝑘𝒪( + 𝑘𝑙𝑜𝑔2 ( ))
𝑝
𝑃
𝑘
EXPERIMENTAL SETUP:
Hydra:
The system, we intended to test the single core processor and GPU is Hydra.
It’s system specification is listed below.
14
CSCI 5593 – Advanced Computer Architecture
System Specification
Host Name
hydra.ucdenver.pvt
Number Nodes
17 (1 master node and 17 compute nodes)
Total CPU Cores
268 (256 compute cores)
Number GPUs
9 Tesla Fermi GPUs
4032 CUDA cores (448 cores per GPU and 1.15 GHz
Total GPU CUDA Cores
per core)
Total Max GFLOPS of CPUs
510 (2.5 GFLOPS per core)
Total Max GFLOPS of GPUs
9515 GFLOPS
2 x 6-core processors (Node 0 - Node 15)
Processors per Node
4 x 16-core processors (Node 16)
12 cores (Node 0 - Node 15)
Cores per Node
16 cores (Node 16)
AMD Opteron 2427 (Node 0 - Node 15)
Processor Type
AMD Opteron 6274 (Node 16)
Processor Speed
2.2 Ghz
L1
Instruction
Cache per
Processor
6 x 64 KB
(AMD Opteron 2427)
L1 Data Cache per Processor
6 x 64 KB
(AMD Opteron 2427)
L2
Cache per
Processor
6 x 512KB
(AMD Opteron 2427)
L3
Cache per
Processor
6MB
(AMD Opteron 2427)
L1
Instruction
Cache per
Processor
8 x 64 KB
(AMD Opteron 6274)
L1 Data Cache per Processor
16 x 16 KB
(AMD Opteron 6274)
L2
Cache per
Processor
8 x 2 MB
(AMD Opteron 6274)
L3
Cache per
Processor
2 x 8 MB
(AMD Opteron 6274)
Table 2: Hydra Specification [2]
Tesla Deals with large amount of data in scientific computing. We have 9 NVidia
Tesla S2050 GPUs. These GPUs occupy nodes 12-16 on Hydra, where the nodes
12-15 have 2 GPUs, where node 16 has 1 GPU. Tesla S2050 has 448 CUDA cores
delivering up to 515 gigaflops in double precision calculations. Single precision
peak performance is over a Teraflop per GPU. Tesla S2050 comes standard with 3
15
CSCI 5593 – Advanced Computer Architecture
GB of GDDR5 memory (144 GB/s), where it yields 2.625 GB of user available
memory. The reason for the customers to opt for Tesla rather than GeForce, in
solving a specific tasks is because GeForce are primarily intended for games etc.
However, the "Tesla™" model line was created for scientific computations.
"Tesla™" cards are suitable for dealing with the large amounts of data in scientific
computing. Also, Tesla S2050 comes with ECC Memory that offers protection of
data in memory to enhance data integrity and reliability for applications.
deviceQuery:
This sample enumerates the properties of the CUDA devices present in the
system.
Figure 8: Device Specification for device 0
16
CSCI 5593 – Advanced Computer Architecture
Figure 9: Device Specification for device 1
bandwidthTest - Bandwidth Test
This is a basic test program to gauge the memcopy bandwidth of the GPU and
memcpy bandwidth crosswise over PCI-e. This test application is fit for measuring
device to device copy bandwidth, host to device copy bandwidth for pageable and
page-locked memory, and device to host copy bandwidth for pageable and pagelocked memory.
17
CSCI 5593 – Advanced Computer Architecture
Figure 10: Bandwidth Test for device 0 in Tesla.
CPU:
On the other hand where the hydra has total of 268 CPU cores. i.e 256
compute nodes where the total Max GFLOPS of CPUs are 510 with 2.5 GFLOPS
per core.
Profiling:
Profiling reveals the interaction between software and underlying machine
architecture and also indicates the areas of improvement. It shows the detailed form
of program analysis which measures, the complexity of a program , the management
of particular instructions , or the frequency and duration of function calls. Most
commonly, profiling information serves to benefit program optimization . The most
common profiling tools are nsight, nvvp and nvprof . The profiling tools contain a
number of changes and new features as part of the CUDA Toolkit 7.0 release.
18
CSCI 5593 – Advanced Computer Architecture
The Visual Profiler nvvp has been updated with several enhancements,
where the performance is enhanced when loading large data file. Additionally,
Visual profiler timeline is improved to view multi-gpu MPS profile data. The
Unified memory profiling is also enhanced by providing fine grain data transfers to
and from the GPU, coupled with more accurate timestamps with each transfer.
nvprof has been updated with several enhancements where all events and
metrics for devices with compute capability 3.x and 5.0 can now be collected
accurately in presence of multiple contexts on the GPU.
NVIDIA® Nsight™ is the ultimate development platform for heterogeneous
computing. Work with powerful debugging and profiling tools that enable you to
fully optimize the performance of the CPU and GPU. Not only do these feature-rich
tools optimize performance, they help you gain a better understanding of your code
- identify and analyze bottlenecks and observe the behavior of all system activities.
nvprof – GPU summary
By running my application with nvprof ./Filename , we see a summary of all
the kernels and memory copies that it used . The summary groups all calls to the
same kernel together, presenting the total time and percentage of the total
application time for each kernel.
nvprof – GPU Trace
By running an application with nvprof --print-gpu-trace , we can see on which
GPU each kernel ran, as well as the grid dimensions used for each launch. This
is very useful when we want to verify that a multi-GPU application is running as
we expect.
19
CSCI 5593 – Advanced Computer Architecture
nvprof for Remote Profiling
When the system that we deploy is not on our desktop rather we just have the
terminal access to to the machine then nvprof is a great tool to be used. Simply
connect to the remote machine (using ssh, for example), and run your application
under nvprof. We can just unite with the remote machine (utilizing ssh, for instance),
and run our application under nvprof
--analysis-metrics
nvprof provides a handy option (--analysis-metrics) to capture all of the GPU
metrics that the Visual Profiler needs for its “guided analysis” mode .
OUTPUT:
CSV
For every profiling mode, option --csv can be utilized to generate output in
comma-separated values (CSV) format. The outcome can be directly transported to
spreadsheet software, for example, Excel.
--output-profile
By using the --output-profile command-line choice, we can output a data file
for later import into either nvprof or the NVIDIA Visual Profiler. Using this, we can
capture a profile on a remote machine, and then visualize and analyze the results on
your desktop in the Visual Profiler.
20
CSCI 5593 – Advanced Computer Architecture
IMPLEMENTATION:
DATASET:
The data set used for our implementation of K-Means algorithm is, Covertype
Data Set [3]. It predicts the forest cover type from cartographic variables.
– Number of Data points : 581,012
– Features : 54
– Feature characteristic: real number
K-Means Sequential:
The following snippet shows, how the data set is read and allocated in a 2D
memory and how the centroids are being initialized.
Read Dataset
Allocate 2D
memory
Initialize Centroids
21
CSCI 5593 – Advanced Computer Architecture
The below snippet is about labeling the nearest centroid and calculating its
distance.
Labeling
Calculating
new
centroids
The final part of K-means is check for convergence and updating the centroid
position.
Check
Convergence
RESULTS:
Update the
centroids
22
CSCI 5593 – Advanced Computer Architecture
Figure 11: K-means Sequential output with the computation and I/O time for 15000
objects and 11000 clusters.
K-Means Parallel:
Version 1(Hybrid Model):
The below snippet, indicates the number of threads being initialized and the kernel
call for calculating the distance between the centroid and updating the distance.
The final sum of the distance between the centroids is calculated in CPU and not in
GPU.
23
CSCI 5593 – Advanced Computer Architecture
Version 2:
Two kernel calls are being done to perform all the K-means operations in
GPU. Version 2 is similar to version, as it has kernel labeling but in version 2 the
final sum for the centroids is also performed in GPU by calling the kernel update.
Memory Types Implementation:
Global Memory:
For the kernel call function, the below snippet show how the global memory is being
allocated.
24
CSCI 5593 – Advanced Computer Architecture
In cuda, the global memory is being accessed by pointers to the kernel call.
Shared Memory:
The below snippets provide the shared memory variable declaration and how the
bank conflicts is being avoided by using the __syncthreads() call.
25
CSCI 5593 – Advanced Computer Architecture
Constant Memory:
Declared for the device membership, as it is always requied to update the centriods.
Amount of constant memory allocated is shown below
Figure 12: Constant memory allocation
26
CSCI 5593 – Advanced Computer Architecture
Register Memory:
Registers are used to calculate the current and minimum distance for the centroids
before updating it.
The number of registers being used based on the threads is shown below.
Local Memory:
Initialized by setting the register count to zero.
Results:
Sequential Version:
Number of data points / n =15000
x-axis : Number of Clusters
27
CSCI 5593 – Advanced Computer Architecture
Here we keep increasing the number of clusters K starting from 1000, 5000, 6000,
7000, 8000, 9000, 10000, 11000, 12000, 13000. Note that number of clusters should
always less than the number of data points.
y-axis: seconds
Figure 13: Execution time of Sequential K-means for Various K clusters
Observation:
As the number of clusters increase, the computation time also increases. But
the sequential version leads to longer computation time for larger data sets , for
instance , 5,00,000 where it took more than 5 hours ( with no result in the end ) . One
of the main reason for the sequential to run for a long time is because it has to take
one centroid and calculate its distance. Due to this we have optimized our sequential
code by implementing the kmeans algorithm in CUDA.
28
CSCI 5593 – Advanced Computer Architecture
Steps :
 Run the unoptimized original code – sequential version and saved the results.
 Then run the optimized code against the same data and compare the results to
the original sequential results
Parallel Version:
Number of data points / n =15000
X-axis is the number of Clusters . Here we keep increasing the number of clusters
K starting from 1000, 5000, 6000, 7000, 8000, 9000, 10000, 11000, 12000, 13000.
Note that number of clusters should always less than the number of data points.
y axis denotes the seconds .
n=15000; k=13000; Time =126.67
Figure 14: Execution time of Parallel K-means for Various K clusters
29
CSCI 5593 – Advanced Computer Architecture
Observations:
From the below table 3, we can figure out that Parallel version is much better
than sequential version for constant value of n and increasing value of k
Table 3: Comparison of K-means on sequential and parallel, for same cluster
and data points, but different cores and clock speed.
From the below table 4, we can see that Sequential version is failed for larger
data set. In contrast, the parallel version of kmeans algorithm took 168.98872
seconds computation time for larger data set where n=5,00,000
Column1
Sequential
Parallel
n
5,00,000
5,00,000
k
13000
13000
Time (seconds)
Failed (> 5 hrs with 168.98872
no result)
seconds
Table 4: Comparison of K-means on sequential and parallel, for same cluster and
data points, but different cores and clock speed where sequential failed to execute.
30
CSCI 5593 – Advanced Computer Architecture
Memory:
NVPROF summary for the memory being used. The figure shows the commutation
time, I/O time and the amount of time spent on copying the data to and from host
and device.
Figure 15: Nvprof the K-means algorithm to analyze the transfer time
For the number of registers used in K-means, the number of threads, number of
blocks, grids used. The amount of throughput spent to transfer the data to and from
device, host is shown below
31
CSCI 5593 – Advanced Computer Architecture
Figure 16: Nvprof to print the gpu trace to analyze the throughput and memory
occupancy.
The final comparison of different types of memory access in CUDA for the k-means
algorithm is shown below. The graph provides the computation time of running the
k-means algorithm by starting the clock on getting the input and checking the
amount of time take for transfer and computation.
32
CSCI 5593 – Advanced Computer Architecture
Figure 17: For various memory types, the computation time taken to execute KMeans algorithm.
Observation:
From the graph, it is seen that shared memory has least computation, it is because
multiple threads can access it simultaneously and it is on-chip. Register memory
being contrast to shared has more computation as to similar reasons, where multiple
threads cannot access same registers. Constant memory is more of read-only
memory placed off-chip. Hence modifying code to utilize constant memory instead
of shared improved performance. Also, there weren’t sufficient register to be utilized
in a single kernel call, hence it resulted in register spilling and moving the data to
local memory. The computation time shown above for register is a combination of
register and local memory. Global memory being placed off-chip and made
33
CSCI 5593 – Advanced Computer Architecture
accessible to all threads from all blocks and device made the computation time
increase. So it is better to maximize the utilization of shared memory than other
memory types for our K-Means algorithm.
Optimizing Data Transfer:
As K-Means algorithm requires a large data points that needs to be computed, it is
necessary to optimize the data transfer between host and device.
Comparisons of pinned and pagable memory
on data transfers
500
450
439.64
399.03
400
seconds
356.85
339.53
350
300
250
200
150
100
50
0
H2D
D2H
Transfering 1GB Data
pagable
pinned
Observation:
The above graph shows that it is always efficient to use pinned memory while
transferring the data to and from host and device. The graph was plotted for
n
5,00,000
k
13000
34
CSCI 5593 – Advanced Computer Architecture
Conclusion and Future Work:
For the implemented K-means algorithm, it is efficient to used parallel version
2 with shared memory structure to improve the throughput and occupancy of the
system. It is more efficient to utilize the pinned memory transfer to reduce the
transfer time.
Future work would include, better K means parallelization through sum
reduction and efficient parallel reductions exchange data between threads within the
same thread block.
References:
[1] NVidia - http://www.nvidia.com/object/what-is-gpu-computing.html
[2] Hydra - http://pds.ucdenver.edu/wiki/index.php?title=Hardware_on_Hydra
[3] Dataset - http://archive.ics.uci.edu/ml/datasets/Covertype
[4] Farivar, Reza, Daniel Rebolledo, Ellick Chan, and Roy H. Campbell. "A Parallel
Implementation of K-Means Clustering on GPUs." In PDPTA, vol. 13, no. 2, pp. 212-312. 2008.
[5] Zechner, Mario, and Michael Granitzer. "K-Means on the Graphics Processor: Design And
Experimental Analysis." International Journal On Advances in Systems and Measurements 2, no.
2 and 3 (2009): 224-235.
[6] Ivan Tanasic , Lluís Vilanova, Marc Jordà , Javier Cabezas , Isaac Gelado, Nacho Navarro and
Wen-mei Hwu, “Comparison Based Sorting for Systems with Multiple GPUs”, in Proceeding
GPGPU-6 Proceedings of the 6th Workshop on General Purpose Processor Using Graphics
Processing Units, Pages 1-11 , 2013-03-16.
[7] Ren Wu, Bin Zhang, and Meichun Hsu. 2009. ”Clustering billions of data points using GPUs”.
In Proceedings of the combined workshops on UnConventional high performance computing
workshop plus memory access workshop (UCHPC-MAW '09). ACM, New York, NY, USA, 1-6.
[8] Zechner, M.; Granitzer, M., "Accelerating K-Means on the Graphics Processor via
CUDA," Intensive Applications and Services, 2009. INTENSIVE '09. First International
Conference on, vol., no., pp.7,15, 20-25 April 2009
35
Download