CSCI 5593 – Advanced Computer Architecture Experiments on GPU Performance by K-Means Clustering Under the Guidance of Dr. Gita Alaghband By Team 3 Anuradha Ramakrishnan Manh Huynh Sai Latha Suresh 1 CSCI 5593 – Advanced Computer Architecture Introduction: Graphical Processing Unit (GPU), traditionally are dedicated hardware for Graphics computing and as a device to accelerate the rendering of images for display. Recently, GPU are used to accelerate many intensive computations and allow general purpose computing. The key definition of GPU by NVIDIA is, “The CPU (central processing unit) has often been called the brains of the PC. But increasingly, that brain is being enhanced by another part of the PC the GPU, which is its soul” [1]. The aim of the project is to learn about the performance of the GPU by KMeans algorithm. The implementation includes testing the algorithm for its execution time in Hydra for single core CPU and Hydra’s GPU. For the varied features of cores and number of processors in the GPU, how will the execution be improved. The other part of our implementation is testing available memory types in Hydra (Tesla 2050) GPU for the K-Means algorithm. We intend to show, how the memory utilization and data transfer affects the K-Means algorithm’s execution. SIMD In contrast to task-parallel computation, data-parallel computation is the distribution of data amongst several computing nodes. "Single instruction, multiple data" (SIMD), has multiple processors executing the same instructions on different pieces of data which accomplishes data parallel computation and is used by GPU as it helps in sharing flow control computation amongst processors. As a result, it allows more hardware to be devoted to instruction execution. In a computing task, SIMD parallelization compliments GPU computation. 2 CSCI 5593 – Advanced Computer Architecture GPU GPUs were originally for graphic computing and accelerating display image rendering. Currently, they are in a high demand due to their possession of parallel multi-core processors and intensive computing capabilities and also helps to achieve data-parallel computation. The significant difference between CPU and GPU is that GPU has several transistors to ALU units with only a few caches and flow control. High computing power, memory bandwidth and data parallel computations of GPUs are the reasons behind the general-purpose computing on GPUs. Figure 1: Key difference of CPU and GPU based on number of cores CUDA: CUDA is the acronym for Compute Unified Device Architecture. CUDA is NVIDIA’s parallel computing architecture that enables dramatic increases in computing performance by harnessing the power of the GPU (Graphics Processing Unit). Computing is evolving from "central processing" on the CPU to "coprocessing" on the CPU and GPU. To enable this new computing paradigm, NVIDIA invented the CUDA parallel computing architecture that is now shipping in GeForce, ION Quadra, and Tesla GPUs, representing a significant installed base 3 CSCI 5593 – Advanced Computer Architecture for application developers. An indicator of CUDA adoption is the ramp of the Tesla GPU for GPU computing. In these new operating systems, the GPU will not only be the graphics processor, but also a general purpose parallel processor accessible to any application. CUDA can be accessible to software developers through industry standard programming languages. CUDA gives developers access to the instruction set and memory of the parallel computation elements in GPUs. CUDA provides ability to use high-level languages such as C to develop application that can take advantage of high level of performance and scalability that GPUs architecture offer. GPUs allow creation of very large number of concurrently executed threads at very low system resource cost. Processing Flow of CUDA: Copy data from main memory to GPU memory. CPU instructs the process to GPU. GPU execute parallel in each core. Copy the result from GPU memory to main memory. Memory Hierarchy: CUDA devices have several different memory that can be used by developers or programmers to achieve high execution speed in their kernels. The different types of memory includes Global memory, local memory, texture memory, constant memory, and shared memory and register memory. The figure 1, below illustrates the memory organization. Each block consists of following: a set of local registers per thread. It has a shared memory that is common to all threads in the block. Registers and shared memory are placed on on-chip. Hence variables can be 4 CSCI 5593 – Advanced Computer Architecture accessed at a high speed when they use registers or shared memory. Whereas the other types of memory are placed off- chip. To have the deeper understanding with difference between all types of memory, we will go into a little more details of how these different types of memories are allocated and used in modern processors. Figure 2: Memory Hierarchy for thread, block and grid 5 CSCI 5593 – Advanced Computer Architecture Global Memory: The size of the Global memory depends from card to card. This memory is the main means of communicating Read/Write data between host and device. Its content are visible to all threads. When the __device__ qualifier is used, it declares the variable resides in global memory. Global memory allocation happens through cudaMalloc(). This allocates objects in device Global memory. It basically requires two parameters that are address of a pointer and size of memory that needs to be allocated. cudaFree() is used to free the objects from device Global memory . Local Memory Local memory resides in device memory. It is slower than registers or shared memory. It is used for arrays or lists that access continuous memory location. Local memory is accessible and private to each thread and has a lifetime of a thread. The compiler instantly allocates a variable to local variable when the registers available to the thread are not enough. Constant Memory Constant memory is cached and it is faster. It is declared through _constant_ which restricts our usage as read-only. At the same time reading from constant memory can conserve memory bandwidth comparatively to reading the same data from global memory. Texture memory: Texture memory can also be used for general-purpose computing. Texture memory is read only and allocated in the off-chip, but accessed through dedicated hardware. It still provides higher effective bandwidth at some situations. Specifically, texture caches are designed for graphics applications where memory 6 CSCI 5593 – Advanced Computer Architecture access patterns exhibit a great deal of spatial locality. Texture memory has few special features like 2D pre-fetching. Shared Memory: Each block has its own shared memory enables fast communication between threads in a block and holds the data that will be read and written by multiple threads. Variables and arrays that are to be created in shared memory space on the multiprocessor should be prefaced by this notation as they are created: __device__ __shared__. The size of the shared memory depends on the compute capability. Also, the data stored in the shared memory is visible to all threads within that block and exists till the duration of the block. Moreover, shared memory is faster than all other types of memory in many situations. Register Memory: Registers is likely the fastest memory, where the data stored is visible only to the thread that it wrote and exists only for the lifetime of that particular thread. Automatic variables that are declared in a CUDA kernel are placed into registers. The occupancy rate is strongly dependent on the amount of register memory each thread requires. The table below provides different types of memory and it accessibility with respect to threads, blocks or grids. 7 CSCI 5593 – Advanced Computer Architecture Memory Declaration Scope Lifetime Registers Automatic variables other Thread Kernel than arrays Local Automatic array variables Thread Kernel Shared __shared__ Block Kernel Global __device__ Grid Application Constant __constant__ Grid Application Table 1: Types of memory with its scope and lifetime. Threads and Blocks: Parallelism is achieved with thread usage on GPU. CUDA has the ability to run a particular kernel across multiple threads and multiple cores. Thread is an execution of a kernel with a given index where every thread uses its index to access elements in array. Block is a group of threads. Threads can be coordinated using the _syncthreads() function which makes a thread to pause at a certain point in the kernel until all the other threads in its block reach the same point. Grid is a group of blocks. Each core on GPU can afford up to 1024 threads per 65535 blocks. Spreading over this number crosswise over multi-card frameworks, or even various frameworks associated together, it is easy to see how parallelism prevails and allows for faster processing. CUDA devices are likewise fit for running many small programs all the while too. For managing these thread processing at a time, there is a remarkable highlight called a warp handler, where a warp is a group of 32 threads. At the point when programming in CUDA, clients work with blocks so it is dependent upon the warp handler to determine the way to partition the instructions. 8 CSCI 5593 – Advanced Computer Architecture Figure 3: Single Thread with its register Figure 4: Eight Threads working in parallel for N blocks. How to Optimize Data Transfers in CUDA To efficiently transfer data between the host and device, we are implementing Pageable and Pinned memory. The peak bandwidth between the device memory and the GPU is 144 GB/s on the NVIDIA Tesla C2050. Also, the peak bandwidth between host memory and device memory is 8 GB/s on PCIe x16 Gen2. Hence it is good to minimize lots of data transfers from host to GPU when possible. Moreover, grouping small transfer into larger transfers will perform much better since it eliminates per-transfer overhead. Along with this, use of pinned or page-locked memory is a way to optimize data transfer. Usage of pinned memory will result in best PCIe performance for host to device transfers. Pinned memory should not be overused as it is non-pageable by the OS. 9 CSCI 5593 – Advanced Computer Architecture Hence Pinned memory is the best way to optimize data transfers. One way to use such pinned memory is to allocate host memory with the cudaHostAlloc() function that can be read from the device and written directly to by the device . This allocated memory is known as pinned memory. Transfers that taking place through pinned memory achieves the maximum bandwidth between CPU and device. Amid execution, a block that requires host data only needs to wait for a small portion of the data to be transferred (when operating through pinned memory). Typical hostto-device copies make all blocks wait until all of the data associated with the copy operation is transferred. However, too much of pinned memory can degrade the overall system performance by reducing the amount of memory available to the system for paging operations. So it is vital to have an idea of how much memory to be pinned. Figure 5: Pageable and Pinned data transfer One of the feature added in CUDA 4.0 is Unified Virtual Addressing through which pinned memory is allocated with cudaMallocHost() and also mapped in 10 CSCI 5593 – Advanced Computer Architecture devices memory . Through this feature, CUDA kernels can read or write directly through the PCI bus. Measuring Data Transfer Times with nvprof To quantify the time spent in each data transfer, we could record a CUDA event before and after each transfer and use cudaEventElapsedTime () K-MEANS CLUSTERING: K-means is a clustering algorithm used in several applications of data mining, machine learning, and scientific applications. The applications in science are creating large data sets, which needs to be classified into subsets. Hence, data clustering was the process used to group similar objects into relatively similar sets which are called clusters. Kmeans algorithm came into existence when the computational demands of data clustering started growing rapidly and also since it is very time consuming for single CPU to processing large data sets. K-Means is one of the most popular "clustering" algorithms where it stored kcentroids to define clusters. A point is considered to be in a particular cluster if it is closer to that cluster's centroid than any other centroid. K-Means finds the best centroids by alternating between (1) assigning data points to clusters based on the current centroids (2) choosing centroids (points which are the center of a cluster) based on the current assignment of data points to clusters. K-means Clustering can be done in both sequential and parallel versions to test the GPU performance. 11 CSCI 5593 – Advanced Computer Architecture Figure 6, provides a sample k-means algorithm with cluster size 5(k=5) and the data contains two different features. Features of the data can be characterized as its size, shape etc. Feature 2 Feature 1 Figure 6: K-means algorithm with K = 5 => 5 cluster in 2 dimensions SEQUENTIAL: Choose a user defined k value as centroids and then calculate the distance from each data point to all centroids. The data point belongs to the cluster, which has the minimum distance to the point. Then the mean is updated for each cluster .In the final step check if converges or not. If it converges then, ||𝜇𝑛𝑒𝑤 − 𝜇𝑜𝑙𝑑 || < 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 (0.001) . 12 CSCI 5593 – Advanced Computer Architecture Start Select number of cluster K yourself Yes No converged? Finish Number of K Initializing the centroid Selecting Centroid randomly Distance data points to centroids Update mean Assign data point to one class based on minimum distance Euclidean Distance often used Figure 7: Flow chart of Sequential K-Means PARALLEL VERSIONS: There are two parallel versions of K means algorithm. Version 1 (Hybrid Model) 1. Initialize cluster randomly 2. Assign each data point to the nearest cluster center (labeling stage) 𝑐 ← arg min ‖𝑥 − 𝜇𝑖 ‖2 : Parallelize 𝑖∈{1,2,…,𝑘} 3. Re-assign new cluster centroids (update stage) 13 CSCI 5593 – Advanced Computer Architecture 𝜇𝑖 = 1 ∑𝑥 𝑗 ∈𝐶𝑖 𝑁𝑖 𝑥𝑗 : Serial SUM 4. If any cluster changed go to 2 This scenario only parallelize the labeling stage, where the step computation 𝑛𝑘 complexity: 𝒪 ( ) + 𝒪(𝑛 + 𝑘) and the Space complexity: 𝒪((𝑛 + 𝐾)𝑑 𝑝 Version 2: (Everything on GPU) 1. Initialize cluster randomly 2. Assign each data point to the nearest cluster center (labeling stage) 𝑐 ← arg min ‖𝑥 − 𝜇𝑖 ‖2 : Parallelize 𝑖∈{1,2,…,𝑘} 3. Re-assign new cluster centroids (update stage) 𝜇𝑖 = 1 𝑁𝑖 ∑𝑥 𝑗 ∈𝐶𝑖 𝑥𝑗 : Parallelize 4. If any cluster changed go to 2 This scenario only parallelize the both labeling and updating stage and 𝑛𝑘 𝑁 𝑁 the step computation complexity: 𝒪 ( ) + 𝑘𝒪( + 𝑘𝑙𝑜𝑔2 ( )) 𝑝 𝑃 𝑘 EXPERIMENTAL SETUP: Hydra: The system, we intended to test the single core processor and GPU is Hydra. It’s system specification is listed below. 14 CSCI 5593 – Advanced Computer Architecture System Specification Host Name hydra.ucdenver.pvt Number Nodes 17 (1 master node and 17 compute nodes) Total CPU Cores 268 (256 compute cores) Number GPUs 9 Tesla Fermi GPUs 4032 CUDA cores (448 cores per GPU and 1.15 GHz Total GPU CUDA Cores per core) Total Max GFLOPS of CPUs 510 (2.5 GFLOPS per core) Total Max GFLOPS of GPUs 9515 GFLOPS 2 x 6-core processors (Node 0 - Node 15) Processors per Node 4 x 16-core processors (Node 16) 12 cores (Node 0 - Node 15) Cores per Node 16 cores (Node 16) AMD Opteron 2427 (Node 0 - Node 15) Processor Type AMD Opteron 6274 (Node 16) Processor Speed 2.2 Ghz L1 Instruction Cache per Processor 6 x 64 KB (AMD Opteron 2427) L1 Data Cache per Processor 6 x 64 KB (AMD Opteron 2427) L2 Cache per Processor 6 x 512KB (AMD Opteron 2427) L3 Cache per Processor 6MB (AMD Opteron 2427) L1 Instruction Cache per Processor 8 x 64 KB (AMD Opteron 6274) L1 Data Cache per Processor 16 x 16 KB (AMD Opteron 6274) L2 Cache per Processor 8 x 2 MB (AMD Opteron 6274) L3 Cache per Processor 2 x 8 MB (AMD Opteron 6274) Table 2: Hydra Specification [2] Tesla Deals with large amount of data in scientific computing. We have 9 NVidia Tesla S2050 GPUs. These GPUs occupy nodes 12-16 on Hydra, where the nodes 12-15 have 2 GPUs, where node 16 has 1 GPU. Tesla S2050 has 448 CUDA cores delivering up to 515 gigaflops in double precision calculations. Single precision peak performance is over a Teraflop per GPU. Tesla S2050 comes standard with 3 15 CSCI 5593 – Advanced Computer Architecture GB of GDDR5 memory (144 GB/s), where it yields 2.625 GB of user available memory. The reason for the customers to opt for Tesla rather than GeForce, in solving a specific tasks is because GeForce are primarily intended for games etc. However, the "Tesla™" model line was created for scientific computations. "Tesla™" cards are suitable for dealing with the large amounts of data in scientific computing. Also, Tesla S2050 comes with ECC Memory that offers protection of data in memory to enhance data integrity and reliability for applications. deviceQuery: This sample enumerates the properties of the CUDA devices present in the system. Figure 8: Device Specification for device 0 16 CSCI 5593 – Advanced Computer Architecture Figure 9: Device Specification for device 1 bandwidthTest - Bandwidth Test This is a basic test program to gauge the memcopy bandwidth of the GPU and memcpy bandwidth crosswise over PCI-e. This test application is fit for measuring device to device copy bandwidth, host to device copy bandwidth for pageable and page-locked memory, and device to host copy bandwidth for pageable and pagelocked memory. 17 CSCI 5593 – Advanced Computer Architecture Figure 10: Bandwidth Test for device 0 in Tesla. CPU: On the other hand where the hydra has total of 268 CPU cores. i.e 256 compute nodes where the total Max GFLOPS of CPUs are 510 with 2.5 GFLOPS per core. Profiling: Profiling reveals the interaction between software and underlying machine architecture and also indicates the areas of improvement. It shows the detailed form of program analysis which measures, the complexity of a program , the management of particular instructions , or the frequency and duration of function calls. Most commonly, profiling information serves to benefit program optimization . The most common profiling tools are nsight, nvvp and nvprof . The profiling tools contain a number of changes and new features as part of the CUDA Toolkit 7.0 release. 18 CSCI 5593 – Advanced Computer Architecture The Visual Profiler nvvp has been updated with several enhancements, where the performance is enhanced when loading large data file. Additionally, Visual profiler timeline is improved to view multi-gpu MPS profile data. The Unified memory profiling is also enhanced by providing fine grain data transfers to and from the GPU, coupled with more accurate timestamps with each transfer. nvprof has been updated with several enhancements where all events and metrics for devices with compute capability 3.x and 5.0 can now be collected accurately in presence of multiple contexts on the GPU. NVIDIA® Nsight™ is the ultimate development platform for heterogeneous computing. Work with powerful debugging and profiling tools that enable you to fully optimize the performance of the CPU and GPU. Not only do these feature-rich tools optimize performance, they help you gain a better understanding of your code - identify and analyze bottlenecks and observe the behavior of all system activities. nvprof – GPU summary By running my application with nvprof ./Filename , we see a summary of all the kernels and memory copies that it used . The summary groups all calls to the same kernel together, presenting the total time and percentage of the total application time for each kernel. nvprof – GPU Trace By running an application with nvprof --print-gpu-trace , we can see on which GPU each kernel ran, as well as the grid dimensions used for each launch. This is very useful when we want to verify that a multi-GPU application is running as we expect. 19 CSCI 5593 – Advanced Computer Architecture nvprof for Remote Profiling When the system that we deploy is not on our desktop rather we just have the terminal access to to the machine then nvprof is a great tool to be used. Simply connect to the remote machine (using ssh, for example), and run your application under nvprof. We can just unite with the remote machine (utilizing ssh, for instance), and run our application under nvprof --analysis-metrics nvprof provides a handy option (--analysis-metrics) to capture all of the GPU metrics that the Visual Profiler needs for its “guided analysis” mode . OUTPUT: CSV For every profiling mode, option --csv can be utilized to generate output in comma-separated values (CSV) format. The outcome can be directly transported to spreadsheet software, for example, Excel. --output-profile By using the --output-profile command-line choice, we can output a data file for later import into either nvprof or the NVIDIA Visual Profiler. Using this, we can capture a profile on a remote machine, and then visualize and analyze the results on your desktop in the Visual Profiler. 20 CSCI 5593 – Advanced Computer Architecture IMPLEMENTATION: DATASET: The data set used for our implementation of K-Means algorithm is, Covertype Data Set [3]. It predicts the forest cover type from cartographic variables. – Number of Data points : 581,012 – Features : 54 – Feature characteristic: real number K-Means Sequential: The following snippet shows, how the data set is read and allocated in a 2D memory and how the centroids are being initialized. Read Dataset Allocate 2D memory Initialize Centroids 21 CSCI 5593 – Advanced Computer Architecture The below snippet is about labeling the nearest centroid and calculating its distance. Labeling Calculating new centroids The final part of K-means is check for convergence and updating the centroid position. Check Convergence RESULTS: Update the centroids 22 CSCI 5593 – Advanced Computer Architecture Figure 11: K-means Sequential output with the computation and I/O time for 15000 objects and 11000 clusters. K-Means Parallel: Version 1(Hybrid Model): The below snippet, indicates the number of threads being initialized and the kernel call for calculating the distance between the centroid and updating the distance. The final sum of the distance between the centroids is calculated in CPU and not in GPU. 23 CSCI 5593 – Advanced Computer Architecture Version 2: Two kernel calls are being done to perform all the K-means operations in GPU. Version 2 is similar to version, as it has kernel labeling but in version 2 the final sum for the centroids is also performed in GPU by calling the kernel update. Memory Types Implementation: Global Memory: For the kernel call function, the below snippet show how the global memory is being allocated. 24 CSCI 5593 – Advanced Computer Architecture In cuda, the global memory is being accessed by pointers to the kernel call. Shared Memory: The below snippets provide the shared memory variable declaration and how the bank conflicts is being avoided by using the __syncthreads() call. 25 CSCI 5593 – Advanced Computer Architecture Constant Memory: Declared for the device membership, as it is always requied to update the centriods. Amount of constant memory allocated is shown below Figure 12: Constant memory allocation 26 CSCI 5593 – Advanced Computer Architecture Register Memory: Registers are used to calculate the current and minimum distance for the centroids before updating it. The number of registers being used based on the threads is shown below. Local Memory: Initialized by setting the register count to zero. Results: Sequential Version: Number of data points / n =15000 x-axis : Number of Clusters 27 CSCI 5593 – Advanced Computer Architecture Here we keep increasing the number of clusters K starting from 1000, 5000, 6000, 7000, 8000, 9000, 10000, 11000, 12000, 13000. Note that number of clusters should always less than the number of data points. y-axis: seconds Figure 13: Execution time of Sequential K-means for Various K clusters Observation: As the number of clusters increase, the computation time also increases. But the sequential version leads to longer computation time for larger data sets , for instance , 5,00,000 where it took more than 5 hours ( with no result in the end ) . One of the main reason for the sequential to run for a long time is because it has to take one centroid and calculate its distance. Due to this we have optimized our sequential code by implementing the kmeans algorithm in CUDA. 28 CSCI 5593 – Advanced Computer Architecture Steps : Run the unoptimized original code – sequential version and saved the results. Then run the optimized code against the same data and compare the results to the original sequential results Parallel Version: Number of data points / n =15000 X-axis is the number of Clusters . Here we keep increasing the number of clusters K starting from 1000, 5000, 6000, 7000, 8000, 9000, 10000, 11000, 12000, 13000. Note that number of clusters should always less than the number of data points. y axis denotes the seconds . n=15000; k=13000; Time =126.67 Figure 14: Execution time of Parallel K-means for Various K clusters 29 CSCI 5593 – Advanced Computer Architecture Observations: From the below table 3, we can figure out that Parallel version is much better than sequential version for constant value of n and increasing value of k Table 3: Comparison of K-means on sequential and parallel, for same cluster and data points, but different cores and clock speed. From the below table 4, we can see that Sequential version is failed for larger data set. In contrast, the parallel version of kmeans algorithm took 168.98872 seconds computation time for larger data set where n=5,00,000 Column1 Sequential Parallel n 5,00,000 5,00,000 k 13000 13000 Time (seconds) Failed (> 5 hrs with 168.98872 no result) seconds Table 4: Comparison of K-means on sequential and parallel, for same cluster and data points, but different cores and clock speed where sequential failed to execute. 30 CSCI 5593 – Advanced Computer Architecture Memory: NVPROF summary for the memory being used. The figure shows the commutation time, I/O time and the amount of time spent on copying the data to and from host and device. Figure 15: Nvprof the K-means algorithm to analyze the transfer time For the number of registers used in K-means, the number of threads, number of blocks, grids used. The amount of throughput spent to transfer the data to and from device, host is shown below 31 CSCI 5593 – Advanced Computer Architecture Figure 16: Nvprof to print the gpu trace to analyze the throughput and memory occupancy. The final comparison of different types of memory access in CUDA for the k-means algorithm is shown below. The graph provides the computation time of running the k-means algorithm by starting the clock on getting the input and checking the amount of time take for transfer and computation. 32 CSCI 5593 – Advanced Computer Architecture Figure 17: For various memory types, the computation time taken to execute KMeans algorithm. Observation: From the graph, it is seen that shared memory has least computation, it is because multiple threads can access it simultaneously and it is on-chip. Register memory being contrast to shared has more computation as to similar reasons, where multiple threads cannot access same registers. Constant memory is more of read-only memory placed off-chip. Hence modifying code to utilize constant memory instead of shared improved performance. Also, there weren’t sufficient register to be utilized in a single kernel call, hence it resulted in register spilling and moving the data to local memory. The computation time shown above for register is a combination of register and local memory. Global memory being placed off-chip and made 33 CSCI 5593 – Advanced Computer Architecture accessible to all threads from all blocks and device made the computation time increase. So it is better to maximize the utilization of shared memory than other memory types for our K-Means algorithm. Optimizing Data Transfer: As K-Means algorithm requires a large data points that needs to be computed, it is necessary to optimize the data transfer between host and device. Comparisons of pinned and pagable memory on data transfers 500 450 439.64 399.03 400 seconds 356.85 339.53 350 300 250 200 150 100 50 0 H2D D2H Transfering 1GB Data pagable pinned Observation: The above graph shows that it is always efficient to use pinned memory while transferring the data to and from host and device. The graph was plotted for n 5,00,000 k 13000 34 CSCI 5593 – Advanced Computer Architecture Conclusion and Future Work: For the implemented K-means algorithm, it is efficient to used parallel version 2 with shared memory structure to improve the throughput and occupancy of the system. It is more efficient to utilize the pinned memory transfer to reduce the transfer time. Future work would include, better K means parallelization through sum reduction and efficient parallel reductions exchange data between threads within the same thread block. References: [1] NVidia - http://www.nvidia.com/object/what-is-gpu-computing.html [2] Hydra - http://pds.ucdenver.edu/wiki/index.php?title=Hardware_on_Hydra [3] Dataset - http://archive.ics.uci.edu/ml/datasets/Covertype [4] Farivar, Reza, Daniel Rebolledo, Ellick Chan, and Roy H. Campbell. "A Parallel Implementation of K-Means Clustering on GPUs." In PDPTA, vol. 13, no. 2, pp. 212-312. 2008. [5] Zechner, Mario, and Michael Granitzer. "K-Means on the Graphics Processor: Design And Experimental Analysis." International Journal On Advances in Systems and Measurements 2, no. 2 and 3 (2009): 224-235. [6] Ivan Tanasic , Lluís Vilanova, Marc Jordà , Javier Cabezas , Isaac Gelado, Nacho Navarro and Wen-mei Hwu, “Comparison Based Sorting for Systems with Multiple GPUs”, in Proceeding GPGPU-6 Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units, Pages 1-11 , 2013-03-16. [7] Ren Wu, Bin Zhang, and Meichun Hsu. 2009. ”Clustering billions of data points using GPUs”. In Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop (UCHPC-MAW '09). ACM, New York, NY, USA, 1-6. [8] Zechner, M.; Granitzer, M., "Accelerating K-Means on the Graphics Processor via CUDA," Intensive Applications and Services, 2009. INTENSIVE '09. First International Conference on, vol., no., pp.7,15, 20-25 April 2009 35