Computation of Pi using CUDA Dan Padgett University at Buffalo Dan Padgett 12/17/2009 Background ● ● Want to find a way of utilizing CUDA to help improve times for computing digits of pi First attempt used numerical integration – Proved to be unhelpful – Rate of Convergence Dan Padgett 12/17/2009 Obstacles ● ● Original series converged too slowly Only double precision supported under CUDA 1.3 compute capability Dan Padgett 12/17/2009 Overcoming our Obstacles ● Found new series with fast convergence ∞ =∑ i=0 1 i 16 Dan Padgett 4 2 1 1 − − − 8i1 8i4 8i5 8i6 12/17/2009 Next Steps of Action ● Implemented new series – Sum converged to full precision in 8 iterations – Looked for higher precision library ● ● Dan Padgett Why has no one written one for CUDA? We will soon find out... 12/17/2009 Implementing Higher Precision ● ● ● ● Started with sequential C Modeled after IEEE 754 floating point specs Left precision as #define variable Was able to compute precisions up to 2600 integers per number on a worker node Dan Padgett 12/17/2009 Stop... CUDA Time! ● ● Compiled vanilla C source in nvcc CUDA compiler Several issues – Incompatible low-level memory hacks – CUDA functions using structs are inlined – Limited CUDA memory, registers Dan Padgett 12/17/2009 CUDA Difficulties ● ● Replaced memory hacks with new memory hacks (maximum memset, extracting bits) Other issues not satisfyingly resolvable – Inlining →10 minute compile time – Executable size neared 1MB – Limited shared memory → limited precision Dan Padgett 12/17/2009 Other Difficulties ● ● ● Using higher precisions caused the compiler to simply crash Found precision = 12 uses maximum number of CUDA registers Nowhere near the capability of the sequential code Dan Padgett 12/17/2009 Results Cont. ● ● After the usual 6-8 second CUDA initialization time, code ran far faster than sequential equivalent (up to number parallel processors) Asymptotic behavior was as desired, even though the approximation wasn't as good as desired. Dan Padgett 12/17/2009 Summation Terms vs. Runtime 160 140 120 Time (Seconds) 100 80 CUDA Sequential 60 40 20 0 0 20000 40000 60000 80000 100000 120000 140000 # Summation Terms Dan Padgett 12/17/2009 Accuracy of Approximation 3.14159265359000000000 3.14159265358980000000 Value of Approximation 3.14159265358960000000 3.14159265358940000000 3.14159265358920000000 pi approx 3.14159265358900000000 3.14159265358880000000 3.14159265358860000000 3.14159265358840000000 5 10 15 20 25 30 35 # Summation Terms Dan Padgett 12/17/2009 CUDA Runtime vs Number of Sum Terms 20 18 16 Time (Seconds) 14 12 10 8 6 4 2 0 1000 10000 100000 1000000 # Summation Terms Log Scale! Dan Padgett 12/17/2009 Conclusions ● ● ● CUDA is not well-suited to problems which require a moderate amount of memory For pure computation, CUDA offers enormous speedups through parallelism ≈3.14 Dan Padgett 12/17/2009