Parallel & Distributed Computing Assignment 3 Submitted By Zumna Usman 20i-1873 CY-M Submitted to Dr. Qaiser Shafi November 10, 2023 2 Table of Contents Question 1: Matrix Multiplication using OpenCL .......................................................................... 3 Introduction ................................................................................................................................. 3 Graph........................................................................................................................................... 3 Working....................................................................................................................................... 4 Discussion ................................................................................................................................... 4 Results/Output............................................................................................................................. 5 Question 2: Merge Sort using OpenCL .......................................................................................... 6 Introduction ................................................................................................................................. 6 Graph........................................................................................................................................... 6 Working....................................................................................................................................... 7 Discussion ................................................................................................................................... 7 Results/Output............................................................................................................................. 7 3 Question 1: Matrix Multiplication using OpenCL Introduction This code performs parallel matrix multiplication using OpenCL. A matrix of size 512x512 is taken and multiplied using parallel as well as serial multiplication to compare results. Both CPU and GPU can be used as the user is provided with the choice to select from one of the available platforms. Graph 4 Just by looking at the graph, we can see the difference between serial and parallel matrix multiplication. Serial takes 1427.241ms approximately, NVIDIA takes 38.512ms and Intel takes 20.712ms. The reason for this is that my graphics card is NVIDIA GeForce MX330 which gives more time as compared to Intel(R) Iris(R). Working This code demonstrates parallel matrix multiplication using OpenCL, a powerful framework for parallel computing. It begins by generating two random square matrices of size 512x512, denoted as matrices A and B. The program then provides the user with a choice between available OpenCL platforms, specifically from Intel and Nvidia. This selection is crucial as it determines which hardware resources will be utilized for the computations. Once the platform is chosen, the code initializes an OpenCL context, creates buffers for the matrices, and compiles and executes an OpenCL kernel for matrix multiplication. The kernel divides the computation into smaller tasks, which are executed concurrently by multiple work-items. After the parallel computation, the program measures the execution times for both the parallel and serial matrix multiplications. The resulting matrices from both methods are compared to ensure correctness. Discussion The code demonstrates a practical implementation of OpenCL, highlighting its significance in parallel computing. OpenCL serves as a powerful framework for leveraging the computational capabilities of various devices, such as CPUs and GPUs. This code exemplifies how OpenCL allows developers to write programs that can seamlessly execute across different hardware architectures. The core focus of this code is on matrix multiplication, emphasizing the contrast between parallel and serial approaches. The parallel approach takes advantage of concurrent processing, breaking down the computation into smaller tasks that can be executed simultaneously. This leads to substantial performance gains, particularly for larger matrices. On the other hand, the serial approach adheres to the conventional nested loop method, where each element of the resulting matrix is calculated sequentially. This method is straightforward but may become inefficient for larger matrix sizes. 5 The code offers the user the choice of selecting between available OpenCL platforms, specifically those provided by Intel and Nvidia. This choice is pivotal, as it determines which hardware resources will be utilized for the computations. Intel platforms typically represent CPU-based solutions, while Nvidia platforms are renowned for their GPU-based capabilities. The selection of platform can significantly impact the performance of the parallel computation. This aspect underscores the importance of considering hardware characteristics when employing OpenCL. The code goes a step further by measuring the execution times for both parallel and serial matrix multiplications. This empirical analysis provides valuable insights into the practical implications of utilizing parallel processing. By comparing the execution times, developers can quantitatively assess the performance benefits gained from parallel computation. Results/Output These execution times clearly demonstrate the significant performance advantage of the parallel approach over the serial method. The parallel multiplication completed in a fraction of the time taken by the serial method, highlighting the effectiveness of leveraging parallel processing through OpenCL. This outcome underscores the potential for substantial gains in computational efficiency when dealing with large datasets or complex computations. 6 Question 2: Merge Sort using OpenCL Introduction This code prompts us to create a random array of size 1000 and sort it by using Merge Sort in parallel by using OpenCL kernels. Then the sorting time is compared with a standard CPU based sequential sorting algorithm of merge sort. Graph 7 Working The code initializes OpenCL, loads kernel code from an external file kernel.cl, and implements both parallel and sequential merge sort algorithms for a randomly generated float array of size 1000. The program starts by setting up the OpenCL environment, including platform and device selection, context creation, and command queue setup. The OpenCL kernel code is loaded from the "kernel.cl" file, defining the merge Sort kernels. Random float values are generated and used to populate the initial array. OpenCL buffers are then allocated for input data and temporary storage. The program proceeds to set kernel arguments, execute the parallel merge sort using OpenCL, read back the result, and measure the execution time. Subsequently, the code performs a sequential merge sort on a copy of the original data, measuring its execution time. Finally, the sorted results and execution times for both parallel and sequential implementations are printed for analysis. Discussion The code effectively showcases the contrasting performance between parallel and sequential merge sort algorithms. The parallel implementation leverages OpenCL and GPU parallelism to achieve a significant reduction in sorting time compared to the traditional sequential approach. The random data generation ensures diverse testing scenarios, allowing for a robust evaluation of both algorithms. The code is well-structured, appropriately managing OpenCL resources to prevent memory leaks. The output provides valuable insights into the advantages of parallel processing in sorting large datasets, highlighting the potential of OpenCL for optimizing computeintensive tasks. Overall, the code successfully demonstrates the working principles and benefits of parallel merge sort using OpenCL in the context of sorting algorithms. Results/Output The serial implementation of the merge sort algorithm took approximately 0.1484 seconds to sort the dataset. In a sequential execution, each element of the array is individually compared and merged, resulting in a relatively straightforward, yet time-consuming process. This time serves as a baseline for comparison with the parallelized version. 8 In contrast, the parallel implementation of the merge sort algorithm exhibited significantly improved performance, completing the sorting process in approximately 0.0029 seconds. The parallel version leverages the computational power of a GPU through OpenCL, allowing multiple elements of the array to be processed concurrently. This parallelization results in a notable reduction in the overall execution time compared to the serial version. The parallel implementation demonstrates a substantial speedup compared to the serial implementation, highlighting the effectiveness of parallelization in optimizing the sorting algorithm for the given dataset. The parallel execution takes advantage of parallel processing capabilities, reducing the time required to complete the sorting task.