Performance Analysis For the performance analysis, we will use events provided by the CUDA Runtime API for the calculation of the device runtime and struct timespec defined in time.h for the calculation of host runtime. CUDA Runtime For the device performance analysis, we will use CUDA events to handle the begin and end times of the kernel execution. If the event stream is non-zero, the event is recorded after all preceding operations in the stream have been completed; otherwise, it is recorded after all preceding operations in the CUDA context have been completed. Since this operation is asynchronous, cudaEventSynchronize() is be used to determine when the event has actually been recorded. We will use the algorithm below: Create CUDA events to store the begin and end times of the kernel execution cudaEvent_t begin; cudaEvent_t end; cudaEventCreate( &begin ); cudaEventCreate( &end ); Record the begin and end times of the kernel execution with the kernel call between the two cudaEventRecord( begin, 0 ); perform Jacobi Relaxation cudaEventRecord( end, 0 ); Wait for kernel completion, and then compute the kernel runtime from the time elapsed between the recorded begin and end times (elapsed time is stored in elapsed_time) cudaEventSynchronize( end ); float elapsed_time; cudaEventElapsedTime( & elapsed_time, begin, end); display kernel runtime Host Runtime For the host performance analysis, we will use the struct timespec structure represents an elapsed time. It is declared in time.h and has the following members: long int tv_sec: represents the number of whole seconds of elapsed time. long int tv_nsec: represents the rest of the elapsed time (a fraction of a second), represented as the number of nanoseconds. It is always less than one billion. Here we will use the CLOCK_MONOTONIC clock which represents the absolute elapsed wall-clock time since some arbitrary, fixed point in the past. It isn't affected by changes in the system time-of-day clock. It is the most appropriate clock for our analysis (CLOCK_REALTIME clock which represents the machine's best-guess as to the current wall-clock, time-of-day time will not be used for our analysis). To ensure that precision is not lost when timing very short intervals, we will use the algorithm below: Create elapses timespec objects to store elapsed time, and a variable to store runtime struct timespec begin, end; double elapsed_time; Record the begin and end times of the host execution with the execution function call between the two clock_gettime(CLOCK_MONOTONIC, &begin); perform Jacobi Relaxation clock_gettime(CLOCK_MONOTONIC, &end); Compute the Host runtime from the time elapsed between the recorded begin and end times (elapsed time is stored in elapsed_time) elapsed_time = (end.tv_sec - begin.tv_sec); elapsed_time += (end.tv_nsec - begin.tv_nsec) / 1000000000.0; display host runtime Not the here with added the additional computation: elapsed_time += (end.tv_nsec - begin.tv_nsec) / 1000000000.0; This ensures that precision is not lost when timing very short intervals.