Computation of Pi using CUDA 12/17/2009 Dan Padgett Dan Padgett

advertisement
Computation of Pi using CUDA
Dan Padgett
University at Buffalo
Dan Padgett
12/17/2009
Background
●
●
Want to find a way of utilizing CUDA to help
improve times for computing digits of pi
First attempt used numerical integration
–
Proved to be unhelpful
–
Rate of Convergence
Dan Padgett
12/17/2009
Obstacles
●
●
Original series converged too slowly
Only double precision supported under
CUDA 1.3 compute capability
Dan Padgett
12/17/2009
Overcoming our Obstacles
●
Found new series with fast convergence
∞
=∑
i=0
 
1
i
16
Dan Padgett
4
2
1
1
−
−
−
8i1 8i4 8i5 8i6

12/17/2009
Next Steps of Action
●
Implemented new series
–
Sum converged to full precision in 8 iterations
–
Looked for higher precision library
●
●
Dan Padgett
Why has no one written one for CUDA?
We will soon find out...
12/17/2009
Implementing Higher Precision
●
●
●
●
Started with sequential C
Modeled after IEEE 754 floating point
specs
Left precision as #define variable
Was able to compute precisions up to 2600
integers per number on a worker node
Dan Padgett
12/17/2009
Stop... CUDA Time!
●
●
Compiled vanilla C source in nvcc CUDA
compiler
Several issues
–
Incompatible low-level memory hacks
–
CUDA functions using structs are inlined
–
Limited CUDA memory, registers
Dan Padgett
12/17/2009
CUDA Difficulties
●
●
Replaced memory hacks with new memory
hacks (maximum memset, extracting bits)
Other issues not satisfyingly resolvable
–
Inlining →10 minute compile time
–
Executable size neared 1MB
–
Limited shared memory → limited precision
Dan Padgett
12/17/2009
Other Difficulties
●
●
●
Using higher precisions caused the
compiler to simply crash
Found precision = 12 uses maximum
number of CUDA registers
Nowhere near the capability of the
sequential code
Dan Padgett
12/17/2009
Results Cont.
●
●
After the usual 6-8 second CUDA
initialization time, code ran far faster than
sequential equivalent (up to number
parallel processors)
Asymptotic behavior was as desired, even
though the approximation wasn't as good
as desired.
Dan Padgett
12/17/2009
Summation Terms vs. Runtime
160
140
120
Time (Seconds)
100
80
CUDA
Sequential
60
40
20
0
0
20000
40000
60000
80000
100000
120000
140000
# Summation Terms
Dan Padgett
12/17/2009
Accuracy of Approximation
3.14159265359000000000
3.14159265358980000000
Value of Approximation
3.14159265358960000000
3.14159265358940000000
3.14159265358920000000
pi approx
3.14159265358900000000
3.14159265358880000000
3.14159265358860000000
3.14159265358840000000
5
10
15
20
25
30
35
# Summation Terms
Dan Padgett
12/17/2009
CUDA Runtime vs Number of Sum Terms
20
18
16
Time (Seconds)
14
12
10
8
6
4
2
0
1000
10000
100000
1000000
# Summation Terms
Log Scale!
Dan Padgett
12/17/2009
Conclusions
●
●
●
CUDA is not well-suited to problems which
require a moderate amount of memory
For pure computation, CUDA offers
enormous speedups through parallelism
≈3.14
Dan Padgett
12/17/2009
Download