Performance Analysis

advertisement
Performance Analysis
For the performance analysis, we will use events provided by the CUDA Runtime API
for the calculation of the device runtime and struct timespec defined in time.h for the
calculation of host runtime.
CUDA Runtime
For the device performance analysis, we will use CUDA events to handle the begin and
end times of the kernel execution. If the event stream is non-zero, the event is recorded
after all preceding operations in the stream have been completed; otherwise, it is recorded
after all preceding operations in the CUDA context have been completed. Since this
operation is asynchronous, cudaEventSynchronize() is be used to determine when the
event has actually been recorded.
We will use the algorithm below:
Create CUDA events to store the begin and end times of the kernel execution


cudaEvent_t begin; cudaEvent_t end;
cudaEventCreate( &begin ); cudaEventCreate( &end );
Record the begin and end times of the kernel execution with the kernel call between the two



cudaEventRecord( begin, 0 );
perform Jacobi Relaxation
cudaEventRecord( end, 0 );
Wait for kernel completion, and then compute the kernel runtime from the time elapsed between the
recorded begin and end times (elapsed time is stored in elapsed_time)



cudaEventSynchronize( end ); float elapsed_time;
cudaEventElapsedTime( & elapsed_time, begin, end);
display kernel runtime
Host Runtime
For the host performance analysis, we will use the struct timespec structure represents an
elapsed time. It is declared in time.h and has the following members:
long int tv_sec: represents the number of whole seconds of elapsed time.
long int tv_nsec: represents the rest of the elapsed time (a fraction of a second),
represented as the number of nanoseconds. It is always less than one billion.
Here we will use the CLOCK_MONOTONIC clock which represents the absolute
elapsed wall-clock time since some arbitrary, fixed point in the past. It isn't affected by
changes in the system time-of-day clock. It is the most appropriate clock for our analysis
(CLOCK_REALTIME clock which represents the machine's best-guess as to the current
wall-clock, time-of-day time will not be used for our analysis).
To ensure that precision is not lost when timing very short intervals, we will use the
algorithm below:
Create elapses timespec objects to store elapsed time, and a variable to store runtime


struct timespec begin, end;
double elapsed_time;
Record the begin and end times of the host execution with the execution function call between the two



clock_gettime(CLOCK_MONOTONIC, &begin);
perform Jacobi Relaxation
clock_gettime(CLOCK_MONOTONIC, &end);
Compute the Host runtime from the time elapsed between the recorded begin and end times (elapsed
time is stored in elapsed_time)



elapsed_time = (end.tv_sec - begin.tv_sec);
elapsed_time += (end.tv_nsec - begin.tv_nsec) / 1000000000.0;
display host runtime
Not the here with added the additional computation:
elapsed_time += (end.tv_nsec - begin.tv_nsec) / 1000000000.0;
This ensures that precision is not lost when timing very short intervals.
Download