Automatic Verification of Floating Point Units Udo Krautz, Viresh Paruthi, Anand Arunagiri, Sujeet Kumar IBMTM Corporation Authors 1. Udo Krautz, IBM Deutschland, Boeblingen Germany, krautz@de.ibm.com, +49-7031-16-2347 2. Viresh Paruthi, IBM Corporation, Austin TX USA, vparuthi@us.ibm.com, +1-512-286-7922 3. Anand B Arunagiri, IBM Corporation, Bangalore India, aarunagi@in.ibm.com, +91-80-41777187 4. Sujeet Kumar, IBM Corporation, Bangalore India, sujkumak@in.ibm.com, +91-80-41777283 2 Abstract Verification of floating point units (FPU) is one of the most successful applications of formal verification methods. The large and complex data paths and intricate control structures of FPUs makes verification with coverage driven simulation incomplete and error prone. Formal verification (FV) has been successfully leveraged to achieve the high level of quality desired of these critical logics. Typically, FV-based approaches to verify FPUs rely on introducing higher level abstractions to allow reasoning. This however has to be done manually, and quickly becomes tedious for highly optimized bit level implementations on board high performance microprocessors. Automated formal methods working directly on the bit level and providing a full end-to-end check for FPUs exist but are limited to single instructions (issued in an empty pipeline), hence lack in checking control aspects of the logic as those relate to inter-instruction interactions, or pipeline control. In this talk we present an approach based on equivalence checking to overcome the single instruction limitation for automated bit level proofs in the formal verification of FPUs. The sequential execution of instructions is modeled by two instances of the design-under-test. One of these instances acts as a reference model for the other. This allows for a large numbers of internal equivalences to be leveraged by equivalence checking techniques. We show that this method is capable of proving instruction sequences for highly optimized industrial FPU designs. Together with a proof of correctness of individual instructions with model checking it guarantees correctness of the FPU design as a whole. In our experience no other approach can provide the level of automation and ease as the proposed method. 3 Motivation • Floating-Point Units (FPU) inherently difficult to verify: • Data path challenges – Complex floating-point algorithms and hardware E.g. alignment shifter, leading zero anticipator (LZA), rounding, … – Intricate corner-cases E.g. denormal inputs/outputs, cancellation, sticky-bits, … • Control complexity – Pipelined out-of-order speculative execution, microcode ops, ... • Various verification techniques deployed to verify FPUs • Incomplete methods to find bugs – Rand/manual/targeted testcase generation, coverage analysis, … – Bugs may skip into silicon (e.g. Pentium FP bug!) • Complete methods (formal) to establish correctness – Model checking (automatic) techniques • Restricted to a single instruction issue in an empty pipeline (datapath verif) – Higher level reasoning • Manual with requiring creation of dedicated models (end-to-end verif) 4 Contribution • We propose to enhance automated methods to enable verification of control aspects in addition to the data path • Automated end-to-end verification of bit level FPUs • Inclusive of control and data path – Data path verified with model checking (existing state-of-the-art) • Submit a single instruction in an empty pipeline • Checks for “numerical correctness” of different ops – Control related aspects verified with sequential equivalence checking • The design serves as its own reference • Instruction sequence submitted to allow inter-instruction interactions • Allows leveraging internal equivalence points to alleviate capacity issues • Results bear out effectiveness of the approach 5 Data path Verification • • Checks numerical correctness of FPU data path • • • IEEE754 standard Implementation constraints (timing, area, power, performance) Fused-multiply-add (FMA) instruction: A*B + C • Example bugs: – if two nearly equal numbers subtracted (causing cancellation), the wrong exponent is returned – if result is near underflow, the wrong guard-bit is chosen Restricted to a single instruction issued in an empty FPU • Influence of other instructions not considered • Provides complete datapath coverage; remaining verification resources may focus on other aspects (e.g., inter-instruction) 6 Datapath Verification Testbench • A “driver” issues an instruction into real, reference FPUs • A “checker” compares the results of the two FPUs for equality Operands Reference model Real FPU = • FP operations may be bounded by longest-latency operation • Verification problem is thus a bounded model check 7 Control Verification • Verifies pipeline control, complex micro-architectural features • Speculative execution, functional clock-gating, blocking, … • Example bugs: – If a speculatively executed instruction stream should not be executed (e.g. due to branch not taken), does a ‘kill’ generate any side-effects? – Does the issue of overlapping instructions cause resource conflicts? – Does forwarding of data to subsequent instruction yield wrong result? • Requires submission of continuous stream of instructions • • Activate inter-instruction interactions/dependencies Irrespective of previously executed instructions, or initial state 8 Control Verification Testbench • The design serves as its own “reference” • A “driver” issues single instruction in “reference” FPU and additional sequence of instructions in real FPU • A “checker” compares correct result of “followed” instruction Instruction sequence Single instruction (Real) FPU (Reference) FPU = • Verification problem is a sequential equivalence check • Internal equivalences can be effectively leveraged 9 Conditional Equivalence • A single instruction of the sequence is executed in both FPUs • Restricted to conditional equivalence (not general SEC) • Pipeline stages in which the “followed” instruction is active should be equivalent in a specific cycle Other instruction Inactive stage Active pipeline stage, followed instruction Followed instruction = • Final check only on the result of the “followed” instruction • Bounded checking allows to unfold the pipeline – only equivalent pipeline stages should be in result property‘s COI 10 Sequential Equivalence Tenets • Several degrees of equivalence/correctness: • Identical result of “followed” instruction regardless of initial state ‒ Possible with model checking if legal initial states are known ‒ Manual computation of initial states tedious for complex pipelines • “Followed” instruction not influenced by “residual states” ‒ Both FPUs should be equivalent for the “followed” instruction irrespective of a previously executed instruction • All timing-windows need to be considered between instructions ‒ Requires an infinite sequence of instructions ‒ Infinite sequence made finite to allow bounded checking 11 Verification Technology • SAT-based Bounded Model Check • Performs a satisfiability check on a k-step unfolded netlist • Hybrid SAT-engine – Integrates structural netlist transformations, BDDs, simulation, CNF clauses and SAT procedure in one framework • Conditional equivalence checking • Automatic checkers for pipeline stages getting activated ‒ Added for every stage – either proven or disproven • Leveraged as “lighthouses” to enable end-to-end SAT check • Encapsulated as engines in IBM’s semi-formal tool SixthSense • Uses a Transformation Based Verification (TBV) paradigm that maximally exploits synergy between algorithms 12 Verification Results – Setup • Single instruction checks • • • • • FPU vs high level reference model 45 instructions require case-splits 24 instructions covered by semi-formal 410 instructions fully covered Model: 10k variables/ 100k latches/ 3352k ANDs • Instruction sequence checks • FPU (sequence) vs FPU (with single followed op) • Different types of instruction: • Pipelined • Fixed latency multicycle • Variable latency multicycle • 9 scenarios of sequences types defined • Two models: • B2B issue only • Infinite sequences • Model: 7,6k variables/ 254k latches/ 1398k ANDs 13 Results- Single Instruction Instruction Runtime Memory 64bit Binary-FP ADD overlap-case (369) 3min:50s 1.5GB 64bit Binary-FP ADD cancellation-case (168) 7min:51s 1.5GB 128bit Decimal-FP ADD overlap-case (26388) 4min:28s 1.5GB 128bit Decimal-FP shift single test 17min:04s 1.5GB 128bit Hex-FP convert to 64bit Integer single test 18min:15s 1.3GB 64bit Binary-FP divide semi formal only >24h running on LinuxTM 2.6 64bit, XeonTM E5-2680 2.7GHz 14 Results – Sequences Followed instruction Irritator instruction Runtime Memory Pipelined (extract exponent) Pipelined (convert decimal integer to decimal fp) 1min:07s 1GB Fixed latency (128bit decimal fp add) 1min:14s 0.94GB Variable latency (convert binary fp to decimal fp) 21min:17s 1.1GB Pipelined (convert decimal integer to decimal fp) 1min:52s 1.1GB Fixed latency (128bit decimal fp add) 1min:22s 1GB Variable latency (convert binary fp to decimal fp) 1:13min:22s 3.6GB Pipelined (convert decimal integer decimal fp) 13min:29s 1.3GB Fixed latency (128bit decimal fp add) 24min:37s 1.8GB Variable latency (convert binary fp to decimal fp) 6h:6min:17s 7GB Fixed latency (compare decimal fp) 15 Conclusions and Future work • Presented an end-to-end automated approach to verify FPUs • Inclusive of dataflow and control • Dataflow verified instruction-by-instruction against reference • Control verified via a sequential equivalence check • Future Work – Extend B2B sequences to random sequences – cover all possible sequences • Random sequences with pipelined instructions solvable • Random sequences with multicycle instructions unsolved in 24h – Include forwarding of operands • Internal equivalences do not hold due to latency differences 16 Related Work • IntelTM uses combination of automatic methods and STE • Published in CAV 2009 and FMCAD 2012 • Results depict most defects attributed to STE • Likely requires manual-implementation specific effort • • Full details for reproducibility not disclosed Most other works focus on data path verification • Focus on specific instructions and design artifacts • E.g. FMA instruction together with multiplier • Largely manual as rely on methods such as theorem proving • Tedious proofs which are implementation specific • If automatic use special purpose data structures • E.g. Chen’98 uses PHDDs vs SAT/BDDs 17