Floating-Point Correctness Analysis at the Binary Level

Floating Point Analysis Using Dyninst Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor Background • Floating point represents real numbers as (± sgnf × 2exp) o Sign bit o Exponent o Significand (“mantissa” or “fraction”) 32 16 8 4 0 IEEE Single Exponent (8 bits) 64 32 Significand (23 bits) 16 8 4 0 IEEE Double Exponent (11 bits) Significand (52 bits) • Finite precision o Single-precision: 24 bits (~7 decimal digits) o Double-precision: 53 bits (~16 decimal digits) 2 Motivation • Finite precision causes round-off error o Compromises certain calculations o Hard to detect and diagnose • Increasingly important as HPC scales o Computation on streaming processors is faster in single precision o Data movement in double precision is a bottleneck o Need to balance speed (singles) and accuracy (doubles) 3 Our Goal Automated analysis techniques to inform developers about floating point behavior and make recommendations regarding the use of floating point arithmetic. 4 Framework CRAFT: Configurable Runtime Analysis for Floating-point Tuning • Static binary instrumentation o Read configuration settings o Replace floating-point instructions with new code o Rewrite modified binary • Dynamic analysis o Run modified program on representative data set o Produce results and recommendations 5 Previous Work • Cancellation detection o Reports loss of precision due to subtraction o Paper appeared in WHIST‘11 • Range tracking o Reports min/max values • Replacement o Implements mixed-precision configurations o Paper to appear in ICS’13 6 Mixed Precision • Use double precision where necessary • Use single precision everywhere else • Can be difficult to implement 1: LU ← PA 2: solve Ly = Pb 3: solve Ux0 = y 4: for k = 1, 2, ... do 5: rk ← b – Axk-1 6: solve Ly = Prk 7: solve Uzk = y 8: xk ← xk-1 + zk 9: check for convergence 10: end for Mixed-precision linear solver algorithm Red text indicates steps performed in double-precision (all other steps are single-precision) 7 Configuration 8 Implementation • In-place replacement o Narrowed focus: doubles  singles o In-place downcast conversion o Flag in the high bits to indicate replacement 64 32 16 8 4 0 Double downcast conversion Replaced Double 64 7 F F 4 D E A Non-signalling NaN 32 16 8 4 0 32 16 8 4 0 D Single 9 Example gvec[i,j] = gvec[i,j] * lvec[3] + gvar 1 movsd 0x601e38(%rax, %rbx, 8)  %xmm0 2 mulsd -0x78(%rsp)  %xmm0 3 addsd -0x4f02(%rip)  %xmm0 4 movsd %xmm0  0x601e38(%rax, %rbx, 8) 10 Example gvec[i,j] = gvec[i,j] * lvec[3] + gvar 3 movsd 0x601e38(%rax, %rbx, 8)  %xmm0 check/replace -0x78(%rsp) and %xmm0 mulss -0x78(%rsp)  %xmm0 check/replace -0x4f02(%rip) and %xmm0 addss -0x20dd43(%rip)  %xmm0 4 movsd %xmm0  0x601e38(%rax, %rbx, 8) 1 2 11 Block Editing (PatchAPI) original instruction in block block splits double  single conversion initialization cleanup check/replace 12 Automated Search • Manual mixed-precision analysis o Hard to use without intuition regarding potential replacements • Automatic mixed-precision analysis o Try lots of configurations (empirical auto-tuning) o Test with user-defined verification routine and data set o Exploit program control structure: replace larger structures (modules, functions) first o If coarse-grained replacements fail, try finer-grained subcomponent replacements 13 System Overview 14 NAS Results Benchmark (name.CLASS) Candidates Configurations Tested Instructions Replaced % Static % Dynamic bt.W 6,647 3,854 76.2 85.7 bt.A 6,682 3,832 75.9 81.6 cg.W 940 270 93.7 6.4 cg.A 934 229 94.7 5.3 ep.W 397 112 93.7 30.7 ep.A 397 113 93.1 23.9 ft.W 422 72 84.4 0.3 ft.A 422 73 93.6 0.2 lu.W 5,957 3,769 73.7 65.5 lu.A 5,929 2,814 80.4 69.4 mg.W 1,351 458 84.4 28.0 mg.A 1,351 456 84.1 24.4 sp.W 4,772 5,729 36.9 45.8 sp.A 4,821 5,044 51.9 43.0 15 AMGmk Results • Algebraic MultiGrid microkernel • Multigrid method is highly adaptive • Good candidate for replacement • Automatic search • Complete conversion (100% replacement) • Manually-rewritten version • Speedup: 175 sec to 95 sec (1.8X) • Conventional x86_64 hardware 16 SuperLU Results • Package for LU decomposition and linear solves • Reports final error residual • Both single- and double-precision versions • Verified manual conversion via automatic search • Used error from provided single-precision version as threshold • Final config matched single-precision profile (99.9% replacement) Threshold Instructions Replaced % Static Final Error % Dynamic 1.0e-03 99.1 99.9 1.59e-04 1.0e-04 94.1 87.3 4.42e-05 7.5e-05 91.3 52.5 4.40e-05 5.0e-05 87.9 45.2 3.00e-05 2.5e-05 80.3 26.6 1.69e-05 1.0e-05 75.4 1.6 7.15e-07 1.0e-06 72.6 1.6 4.7e7-07 17 Retrospective • Twofold original motivation o Faster computation (raw FLOPs) o Decreased storage footprint and memory bandwidth o Domains vary in sensitivity to these parameters • Computation-centric analysis o Less insight for memory-constrained domains o Sometimes difficult to translate instruction-level recommendations to source code-level transformations • Data-centric analysis o Focus on data motion, which is closer to source code-level structures 18 Current Project • Memory-based replacement o Perform all computation in double precision o Save storage space by storing single-precision values in some cases • Implementation o Register-based computation remains double-precision o Replace movement instructions (movsd) o Memory to register: check and upcast o Register to memory: downcast if configured o Searching for replaceable writes instead of computes 19 Preliminary Results Benchmark (name.CLASS) Candidates Writes Replaced % Static % Dynamic cg.W 284 95.4 77.5 ep.W 226 96.0 28.4 ft.W 452 94.2 45.0 lu.W 1,782 68.3 81.3 558 96.2 86.4 1,607 80.7 84.7 mg.W sp.W All benchmarks were single core versions compiled by the Intel Fortran compiler with optimization enabled. Tests were performed on an Intel workstation with 48GB of RAM running 64-bit Linux. 20 21 22 23 24 25 26 27 Future Work • Case studies • Search convergence study 28 Conclusion Automated binary instrumentation techniques can be used to implement mixed-precision configurations for floating point code, and memory-based replacement provides actionable results. 29 Thank you! sf.net/p/crafthpc 30

Floating-Point Correctness Analysis at the Binary Level

Related documents

Products

Support

Floating-Point Correctness Analysis at the Binary Level

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib