Modeling Data in Formal Verification Bits, Bit Vectors, or Words Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Overview Issue How should data be modeled in formal analysis? Verification, test generation, security analysis, … Approaches Bits: Every bit is represented individually Basis for most CAD, model checking Words: View each word as arbitrary value E.g., unbounded integers Historic program verification work Bit Vectors: Finite precision words Captures true semantics of hardware and software –2– More opportunities for abstraction than with bits Bit-Level Modeling Control Logic Com. Log. 1 Com. Log. 2 Data Path Represent Every Bit of State Individually Behavior expressed as Boolean next-state over current state Historic method for most CAD, testing, and verification tools E.g., model checkers –3– Bit-Level Modeling in Practice Strengths Allows precise modeling of system Well developed technology BDDs & SAT for Boolean reasoning Limitations Every state bit introduces two Boolean variables Current state & next state Overly detailed modeling of system functions Don’t want to capture full details of FPU Making It Work –4– Use extensive abstraction to reduce bit count Hard to abstract functionality Word-Level Abstraction #1: Bits → Integers x0 x1 x2 xn-1 View Data as Symbolic Words Arbitrary integers No assumptions about size or encoding Classic model for reasoning about software –5– Can store in memories & registers x Abstracting Data Bits Control Logic Com. Log. ? 1 Com. Log. ? 2 1 Data Path What do we do about logic functions? –6– Word-Level Abstraction #2: Uninterpreted Functions A Lf U For any Block that Transforms or Evaluates Data: Replace with generic, unspecified function Only assumed property is functional consistency: a = x b = y f (a, b) = f (x, y) –7– Abstracting Functions Control Logic Com. Log. F1 1 Com. Log. F2 1 Data Path For Any Block that Transforms Data: –8– Replace by uninterpreted function Ignore detailed functionality Conservative approximation of actual system Word-Level Modeling: History Historic Used by theorem provers More Recently Burch & Dill, CAV ’94 Verify that pipelined processor has same behavior as unpipelined reference model Use word-level abstractions of data paths and memories Use decision procedure to determine equivalence Bryant, Lahiri, Seshia, CAV ’02 UCLID verifier Tool for describing & verifying systems at word level –9– Pipeline Verification Example IF/ID PC Op ID/EX Control EX/WB Control Rd Ra Instr Mem Pipelined Processor = Adat Reg. File A L U Imm +4 = Rb Op Control Rd Ra Instr Mem Reference Model Adat Reg. File Imm +4 Rb – 10 – Bdat A L U Abstracted Pipeline Verification IF/ID PC Op ID/EX Control EX/WB Control Rd Ra Instr F3 Mem Pipelined Processor = Adat Reg. File A FL2 U Imm F1 +4 = Rb Op PC Control Rd Ra Instr F3 Mem Reference Model Adat Reg. File Imm +4 F1 Rb – 11 – Bdat A FL2 U Experience with Word-Level Modeling Powerful Abstraction Tool Allows focus on control of large-scale system Can model systems with very large memories Hard to Generate Abstract Model Hand-generated: how to validate? Automatic abstraction: limited success Andraus & Sakallah, DAC 2004 Realistic Features Break Abstraction E.g., Set ALU function to A+0 to pass operand to output Desire – 12 – Should be able to mix detailed bit-level representation with abstracted word-level representation Bit Vectors: Motivating Example #1 int abs(int x) { int mask = x>>31; return (x ^ mask) + ~mask + 1; } int test_abs(int x) { return (x < 0) ? -x : x; } Do these functions produce identical results? Strategy – 13 – Represent and reason about bit-level program behavior Specific to machine word size, integer representations, and operations Motivating Example #2 void fun() { char fmt[16]; fgets(fmt, 16, stdin); fmt[15] = '\0'; printf(fmt); } Is there an input string that causes value 234 to be written to address a4a3a2a1? Answer Yes: "a1a2a3a4%230g%n" Depends on details of compilation – 14 – But no exploit for buffer size less than 8 [Ganapathy, Seshia, Jha, Reps, Bryant, ICSE ’05] Motivating Example #3 bit[W] popSpec(bit[W] x) { int cnt = 0; for (int i=0; i<W; i++) { if (x[i]) cnt++; } return cnt; } bit[W] popSketch(bit[W] x) { loop (??) { x = (x&??) + ((x>>??)&??); } return x; } Is there a way to expand the program sketch to make it match the spec? Answer – 15 – W=16: [Solar-Lezama, et al., ASPLOS ‘06] x x x x = = = = (x&0x5555) (x&0x3333) (x&0x0077) (x&0x000f) + + + + ((x>>1)&0x5555); ((x>>2)&0x3333); ((x>>8)&0x0077); ((x>>4)&0x000f); Motivating Example #4 Sequential Reference Model Pipelined Microprocessor IF/ID PC Op ID/EX Control EX/WB Op Control Rd Ra Instr Mem Control Rd Ra = Adat Reg. File Instr Mem A L U Imm Adat Reg. File Imm Bdat A L U +4 Rb = +4 Rb Is pipelined microprocessor identical to sequential reference model? Strategy Represent machine instructions, data, and state as bit vectors Compatible with hardware description language representation – 16 – Verifier finds abstractions automatically Bit Vector Formulas Fixed width data words Arithmetic operations Add/subtract/multiply/divide, … Two’s complement, unsigned, … Bit-wise logical operations Bitwise and/or/xor, shift/extract, concatenate Predicates ==, <= Task – 17 – 50000 * 50000 = -1794967296 Is formula satisfiable? E.g., a > 0 && a*a < 0 (on 32-bit machine) Decision Procedures Core technology for formal reasoning Satisfying solution Formula Decision Procedure Boolean SAT Pure Boolean formula SAT Modulo Theories (SMT) Support additional logic fragments Example theories Linear arithmetic over reals or integers Functions with equality Bit vectors Combinations of theories – 18 – Unsatisfiable (+ proof) – 19 – ra sp (2 00 2) 118 81 20 07 ) 147 Rs at ( (2 00 304 Si ) eg e (2 Sa 00 tE 4) l it eG TI (2 00 5) zC ha ff in (2 00 1) (2 00 0) 1,000 Be rk M zC ha ff G Run-time (sec.) Recent Progress in SAT Solving 3600 3,000 2,000 766 46 19 0 BV Decision Procedures: Some History B.C. (Before Chaff) String operations (concatenate, field extraction) Linear arithmetic with bounds checking Modular arithmetic Limitations – 20 – Cannot handle full range of bit-vector operations BV Decision Procedures: Using SAT SAT-Based “Bit Blasting” Generate Boolean circuit based on bit-level behavior of operations Convert to Conjunctive Normal Form (CNF) and check with best available SAT checker Handles arbitrary operations Effective in Many Applications – 21 – CBMC [Clarke, Kroening, Lerda, TACAS ’04] Microsoft Cogent + SLAM [Cook, Kroening, Sharygina, CAV ’05] CVC-Lite [Dill, Barrett, Ganesh], Yices [deMoura, et al] Bit-Vector Challenge Is there a better way than bit blasting? Requirements Provide same functionality as with bit blasting Find abstractions based on word-level structure Improve on performance of bit blasting Observation Must have bit blasting at core Only approach that covers full functionality Want to exploit special cases Formula satisfied by small values Simple algebraic properties imply unsatisfiability Small unsatisfiable core Solvable by modular arithmetic … – 22 – Some Recent Ideas Iterative Approximation UCLID: Bryant, Kroening, Ouaknine, Seshia, Strichman, Brady, TACAS ’07 Use bit blasting as core technique Apply to simplified versions of formula Successive approximations until solve or show unsatisfiable Using Modular Arithmetic STP: Ganesh & Dill, CAV ’07 Algebraic techniques to solve special case forms Layered Approach – 23 – MathSat: Bruttomesso, Cimatti, Franzen, Griggio, Hanna, Nadel, Palti, Sebastiani, CAV ’07 Use successively more detailed solvers Iterative Approach Background: Approximating Formula Overapproximation Original Formula Underapproximation + + More solutions: If unsatisfiable, then so is − Fewer solutions: Satisfying solution also satisfies − Example Approximation Techniques Underapproximating Restrict word-level variables to smaller ranges of values Overapproximating Replace subformula with Boolean variable – 24 – Starting Iterations 1− Initial Underapproximation – 25 – (Greatly) restrict ranges of word-level variables Intuition: Satisfiable formula often has small-domain solution First Half of Iteration 1+ 1− UNSAT proof: generate overapproximation If SAT, then done SAT Result for 1− Satisfiable Then have found solution for Unsatisfiable Use UNSAT proof to generate overapproximation 1+ – 26 – (Described later) Second Half of Iteration 1+ SAT: Use solution to generate refined underapproximation If UNSAT, then done 2− 1− SAT Result for 1+ Unsatisfiable Then have shown unsatisfiable Satisfiable Solution indicates variable ranges that must be expanded – 27 – Generate refined underapproximation Iterative Behavior 2+ + 1 k+ Underapproximations k− Overapproximations 2− 1− – 28 – Successively more precise abstractions of Allow wider variable ranges No predictable relation UNSAT proof not unique Overall Effect 2+ + 1 k+ Soundness UNSAT k− Completeness SAT 2− 1− Only terminate with solution on underapproximation Only terminate as UNSAT on overapproximation Successive underapproximations approach Finite variable ranges guarantee termination In worst case, get k− – 29 – Generating Overapproximation Given Underapproximation 1− Bit-blasted translation of 1− into Boolean formula Proof that Boolean formula unsatisfiable Generate Overapproximation 1+ If 1+ satisfiable, must lead to refined underapproximation Generate 2− such that 1− 2− – 30 – 1+ 2− 1− UNSAT proof: generate overapproximation Bit-Vector Formula Structure DAG representation to allow shared subformulas x=y Ç x+2z1 Æ : a w & 0xFFFF = x Æ Ç Ç x % 26 = v – 31 – Structure of Underapproximation w x y z x=y Range Constraints Ç x+2z1 Æ : a w & 0xFFFF = x Æ Æ − Ç Ç x % 26 = v Linear complexity translation to CNF Each word-level variable encoded as set of Boolean variables Additional Boolean variables represent subformula values – 32 – Encoding Range Constraints Explicit View as additional predicates in formula w Range Constraints x 0 w 8 −4 x 4 Implicit – 33 – Reduce number of variables in encoding Constraint Encoding 0w8 0 0 0 ··· 0 w2w1w0 −4 x 4 xsxsxs··· xsxsx1x0 Yields smaller SAT encodings UNSAT Proof Subset of clauses that is unsatisfiable Clause variables define portion of DAG Subgraph that cannot be satisfied with given range constraints w x y z Range Constraints x=y Ç x+2z1 Æ : a w & 0xFFFF = x Æ Ç Ç x % 26 = v – 34 – Æ Extracting Circuit from UNSAT Proof Subgraph that cannot be satisfied with given range constraints Even when replace rest of graph with unconstrained variables w x y z Range Constraints x=y Ç x+2z1 Æ UNSAT : a Æ Ç b1 – 35 – b2 Æ Generated Overapproximation Remove range constraints on word-level variables Creates overapproximation Ignores correlations between values of subformulas x=y Ç x+2z1 Æ : a Æ Ç b1 b2 – 36 – 1+ Refinement Property Claim 1+ has no solutions that satisfy 1−’s range constraints Because 1+ contains portion of 1− that was shown to be unsatisfiable under range constraints w x y z Range Constraints x=y Ç x+2z1 Æ UNSAT : a Æ Ç b1 – 37 – b2 1+ Æ Refinement Property (Cont.) Consequence Solving 1+ will expand range of some variables Leading to more exact underapproximation 2− x=y Ç x+2z1 Æ : a Æ Ç b1 b2 – 38 – 1+ Effect of Iteration 1+ SAT: Use solution to generate refined underapproximation UNSAT proof: generate overapproximation 2− 1− Each Complete Iteration – 39 – Expands ranges of some word-level variables Creates refined underapproximation Approximation Methods So Far Range constraints Underapproximate by constraining values of word-level variables Subformula elimination Overapproximate by assuming subformula value arbitrary General Requirements Systematic under- and over-approximations Way to connect from one to another Goal: Devise Additional Approximation Strategies – 40 – Function Approximation Example x x 0 1 else 0 0 0 0 1 0 1 x else 0 y § * y y Motivation Multiplication (and division) are difficult cases for SAT §: Prohibit Via Additional Range Constraints Gives underapproximation Restricts values of (possibly intermediate) terms §: Abstract as f (x,y) – 41 – Overapproximate as uninterpreted function f Value constrained only by functional consistency Results: UCLID BV vs. Bit-blasting [results on 2.8 GHz Xeon, 2 GB RAM] – 42 – UCLID always better than bit blasting Generally better than other available procedures SAT time is the dominating factor Challenges with Iterative Approximation Formulating Overall Strategy Which abstractions to apply, when and where How quickly to relax constraints in iterations Which variables to expand and by how much? Too conservative: Each call to SAT solver incurs cost Too lenient: Devolves to complete bit blasting. Predicting SAT Solver Performance Hard to predict time required by call to SAT solver Will particular abstraction simplify or complicate SAT? Combination Especially Difficult – 43 – Multiple iterations with unpredictable inner loop STP: Linear Equation Solving Ganesh & Dill, CAV ’07 Solve linear equations over integers mod 2w Capture range of solutions with Boolean Variables Example Problem Variables: 3-bit unsigned integers x: [x2 x1 x0] y: [y2 y1 y0] Linear equations: conjunction of linear constraints 3x + 4y 2x + 2y 2x + 4y + 2z +2 + 2z General Form A x = b mod 2w – 44 – z: [z2 z1 z0] =0 mod 8 =0 mod 8 =0 mod 8 Solution Method Equations 3x + 4y 2x + 2y 2x + 4y + 2z +2 + 2z =0 mod 8 =0 mod 8 =0 mod 8 Some Number Theory Odd number has multiplicative inverse mod 2w Mod 8: 3−1 = 3 Additive inverse mod 2w: −x = 2w − x Mod 8: – 45 – −4 = 4 −2 = 6 Solve first equation for x: 3·3x = 3·4y + 3·6z mod 8 x = 4y + 2z mod 8 Solution Method (cont.) Substitutions 2x + 2y +2 2x + 4y + 2z x = 4y mod 8 =0 mod 8 + 2z 2(4y+2z) + 2y mod 8 +2 2(4y+2z) + 4y – 46 – =0 + 2z 2y + 4z 4y + 6z +2 =0 mod 8 =0 mod 8 =0 mod 8 =0 mod 8 What if All Coefficients Even? Result of Substitutions 2y + 4z 4y + 6z +2 =0 mod 8 =0 mod 8 Even numbers do not have multiplicative inverses mod 8 Observation – 47 – Can divide through and reduce modulus y + 2z +1 2y + 3z y = 2z z = 2 mod 4 y = 3 mod 4 +3 =0 mod 4 =0 mod 4 mod 4 General Solutions Original variables: 3-bit unsigned integers x: [x2 x1 x0] y: [y2 y1 y0] z: [z2 z1 z0] Solutions y = 3 mod 4 z = 2 mod 4 Constrained variables y: [y2 1 1] z: [z2 1 0] Back Substitution x = 4y x = 0 Constrained variables x: [0 0 0] – 48 – + 6z mod 8 mod 8 Linear Equation Solutions Equations 3x + 4y 2x + 2y 2x + 4y + 2z +2 + 2z =0 mod 8 =0 mod 8 =0 mod 8 Encoding All Possible Solutions x: [0 0 0] y: [y2 1 1] z: [z2 1 0] y2, z2 arbitrary Boolean variables 4 possible solutions (out of original 512) General Principle – 49 – Form of LU decomposition Polynomial time algorithm Boolean variables in solution to express set of solutions Only works when have conjunction of linear constraints Layered Solver Bruttomesso, et al, CAV ’07 Part of MathSAT project DPLL(T) Framework – 50 – SAT solver coupled with solver for mathematical theory T BV theory solver works with conjunctions of constraints DPLL(T): Formula Structure Atoms x=y Boolean Structure Ç x+2z1 Æ : x << 1 > y w & 0xFFFF = x Æ Ç Ç x % 26 = v Atoms – 51 – Predicates applied to bit-vector expressions Boolean Variables DPLL(T): Operation 1 Ç Actions DPLL engine satisfies Boolean portion Theory solver determines whether resulting conjunction of atoms satifiable Æ 0 : 1 Æ Ç 1 Ç 0 x=y x+2z>1 x << 1 > y w & 0xFFFF ≠ x x % 26 = v Solver provides information to DPLL engine to aid search Nonchronological backtracking Conflict clause generation – 52 – Successful approach for other decision procedures MathSAT Layers Uses increasingly detailed solver layers Only progress if can’t find conflict using more abstract rules Layers 1. Equality with uninterpreted functions Treats all bit-level functions and operators as uninterpreted 2. 3. – 53 – Simple handling of concatenations, extractions, and transitivity Full solver using linear arithmetic + SAT Summary: Modeling Levels Bits Limited ability to scale Hard to apply functional abstractions Words Allows abstracting data while precisely representing control Overlooks finite word-size effects Bit Vectors Realistic semantic model for hardware & software Captures all details of actual operation Detects errors related to overflow and other artifacts of finite representation – 54 – Can apply abstractions found at word-level Areas of Agreement SAT-Based Framework Is Only Logical Choice SAT solvers are good & getting better Want to Automatically Exploit Abstractions Function structure Arithmetic properties E.g., associativity, commutativity Arithmetic reductions E.g., LU decomposition Base Level Should Be SAT – 55 – Only semantically complete approach Choices Optimize for Special Formula Classes E.g., STP optimized for conjunctions of constraints Common in software verification & testing Iterative Abstraction Natural framework for attempting different abstractions Having SAT solver in inner loop makes performance tuning difficult DPLL(T) Framework Theory solver only deals with conjunctions May need to invoke SAT solver in inner loop Hard to coordinate outer and inner search procedures Others? – 56 – Observations Bit-Vector Modeling Gaining in Popularity Recognition of importance Benchmarks and competitions Just Now Improving on Bit Blasting + SAT Lots More Work to be Done – 57 –