Uploaded by zumna.u

20i-1873 Report

advertisement
Parallel & Distributed Computing
Assignment 3
Submitted By
Zumna Usman
20i-1873
CY-M
Submitted to
Dr. Qaiser Shafi
November 10, 2023
2
Table of Contents
Question 1: Matrix Multiplication using OpenCL .......................................................................... 3
Introduction ................................................................................................................................. 3
Graph........................................................................................................................................... 3
Working....................................................................................................................................... 4
Discussion ................................................................................................................................... 4
Results/Output............................................................................................................................. 5
Question 2: Merge Sort using OpenCL .......................................................................................... 6
Introduction ................................................................................................................................. 6
Graph........................................................................................................................................... 6
Working....................................................................................................................................... 7
Discussion ................................................................................................................................... 7
Results/Output............................................................................................................................. 7
3
Question 1: Matrix Multiplication using OpenCL
Introduction
This code performs parallel matrix multiplication using OpenCL. A matrix of size 512x512 is taken
and multiplied using parallel as well as serial multiplication to compare results. Both CPU and
GPU can be used as the user is provided with the choice to select from one of the available
platforms.
Graph
4
Just by looking at the graph, we can see the difference between serial and parallel matrix
multiplication. Serial takes 1427.241ms approximately, NVIDIA takes 38.512ms and Intel takes
20.712ms. The reason for this is that my graphics card is NVIDIA GeForce MX330 which gives
more time as compared to Intel(R) Iris(R).
Working
This code demonstrates parallel matrix multiplication using OpenCL, a powerful framework for
parallel computing. It begins by generating two random square matrices of size 512x512, denoted
as matrices A and B. The program then provides the user with a choice between available OpenCL
platforms, specifically from Intel and Nvidia. This selection is crucial as it determines which
hardware resources will be utilized for the computations. Once the platform is chosen, the code
initializes an OpenCL context, creates buffers for the matrices, and compiles and executes an
OpenCL kernel for matrix multiplication. The kernel divides the computation into smaller tasks,
which are executed concurrently by multiple work-items. After the parallel computation, the
program measures the execution times for both the parallel and serial matrix multiplications. The
resulting matrices from both methods are compared to ensure correctness.
Discussion
The code demonstrates a practical implementation of OpenCL, highlighting its significance in
parallel computing. OpenCL serves as a powerful framework for leveraging the computational
capabilities of various devices, such as CPUs and GPUs. This code exemplifies how OpenCL
allows developers to write programs that can seamlessly execute across different hardware
architectures.
The core focus of this code is on matrix multiplication, emphasizing the contrast between parallel
and serial approaches. The parallel approach takes advantage of concurrent processing, breaking
down the computation into smaller tasks that can be executed simultaneously. This leads to
substantial performance gains, particularly for larger matrices. On the other hand, the serial
approach adheres to the conventional nested loop method, where each element of the resulting
matrix is calculated sequentially. This method is straightforward but may become inefficient for
larger matrix sizes.
5
The code offers the user the choice of selecting between available OpenCL platforms, specifically
those provided by Intel and Nvidia. This choice is pivotal, as it determines which hardware
resources will be utilized for the computations. Intel platforms typically represent CPU-based
solutions, while Nvidia platforms are renowned for their GPU-based capabilities. The selection of
platform can significantly impact the performance of the parallel computation. This aspect
underscores the importance of considering hardware characteristics when employing OpenCL.
The code goes a step further by measuring the execution times for both parallel and serial matrix
multiplications. This empirical analysis provides valuable insights into the practical implications
of utilizing parallel processing. By comparing the execution times, developers can quantitatively
assess the performance benefits gained from parallel computation.
Results/Output
These execution times clearly demonstrate the significant performance advantage of the parallel
approach over the serial method. The parallel multiplication completed in a fraction of the time
taken by the serial method, highlighting the effectiveness of leveraging parallel processing through
OpenCL. This outcome underscores the potential for substantial gains in computational efficiency
when dealing with large datasets or complex computations.
6
Question 2: Merge Sort using OpenCL
Introduction
This code prompts us to create a random array of size 1000 and sort it by using Merge Sort in
parallel by using OpenCL kernels. Then the sorting time is compared with a standard CPU based
sequential sorting algorithm of merge sort.
Graph
7
Working
The code initializes OpenCL, loads kernel code from an external file kernel.cl, and implements
both parallel and sequential merge sort algorithms for a randomly generated float array of size
1000. The program starts by setting up the OpenCL environment, including platform and device
selection, context creation, and command queue setup. The OpenCL kernel code is loaded from
the "kernel.cl" file, defining the merge Sort kernels. Random float values are generated and used
to populate the initial array. OpenCL buffers are then allocated for input data and temporary
storage. The program proceeds to set kernel arguments, execute the parallel merge sort using
OpenCL, read back the result, and measure the execution time. Subsequently, the code performs a
sequential merge sort on a copy of the original data, measuring its execution time. Finally, the
sorted results and execution times for both parallel and sequential implementations are printed for
analysis.
Discussion
The code effectively showcases the contrasting performance between parallel and sequential
merge sort algorithms. The parallel implementation leverages OpenCL and GPU parallelism to
achieve a significant reduction in sorting time compared to the traditional sequential approach.
The random data generation ensures diverse testing scenarios, allowing for a robust evaluation of
both algorithms. The code is well-structured, appropriately managing OpenCL resources to
prevent memory leaks. The output provides valuable insights into the advantages of parallel
processing in sorting large datasets, highlighting the potential of OpenCL for optimizing computeintensive tasks. Overall, the code successfully demonstrates the working principles and benefits of
parallel merge sort using OpenCL in the context of sorting algorithms.
Results/Output
The serial implementation of the merge sort algorithm took approximately 0.1484 seconds to sort
the dataset. In a sequential execution, each element of the array is individually compared and
merged, resulting in a relatively straightforward, yet time-consuming process. This time serves as
a baseline for comparison with the parallelized version.
8
In contrast, the parallel implementation of the merge sort algorithm exhibited significantly
improved performance, completing the sorting process in approximately 0.0029 seconds. The
parallel version leverages the computational power of a GPU through OpenCL, allowing multiple
elements of the array to be processed concurrently. This parallelization results in a notable
reduction in the overall execution time compared to the serial version.
The parallel implementation demonstrates a substantial speedup compared to the serial
implementation, highlighting the effectiveness of parallelization in optimizing the sorting
algorithm for the given dataset. The parallel execution takes advantage of parallel processing
capabilities, reducing the time required to complete the sorting task.
Download