Decision Procedures for String Constraints Pieter Hooimeijer 2 http://en.wikipedia.org/wiki/Osborne_1 3 4 <img src='untrusted input'/> 5 What could possibly go wrong? 6 <img src='untrusted input'/> Attacker: im.png' onload='javascript:... 7 <img src='untrusted input'/> Attacker: im.png' onload='javascript:... 8 <img src='untrusted input'/> Attacker: im.png' onload='javascript:... <img src='im.png' onload ='j 9 <img src='untrusted input'/> Attacker: im.png' onload='javascript:... <img src='im.png' onload ='j 10 11 www.cs.virginia.edu/~ph4u/ 12 Talk Outline Background Building Tuning Conclusion 13 Talk Outline Background Building Tuning Conclusion 14 ASE Bug Reports 2007 Sensys MacroLab 2008 Sensys MacroLab 2 2009 2010 ISSTA Hampi SocialNets Proxied Content PLDI DPRLE ASE StrSolve Sesena MacroLab 3 2011 USENIX Sec BEK 2012 POPL BEK2 2013 TOSEM Hampi 2 VMCAI Data structures J. ASE StrSolve 2 15 ASE Bug Reports 2007 Sensys MacroLab 2008 Sensys MacroLab 2 2009 2010 ISSTA Hampi SocialNets Proxied Content PLDI DPRLE ASE StrSolve Sesena MacroLab 3 2011 USENIX Sec BEK 2012 POPL BEK2 2013 TOSEM Hampi 2 VMCAI Data structures J. ASE StrSolve 2 This Talk 16 Decision Procedures • Program analysis work frequently uses one of these: • They solve mathematical constraints • There is a standard input format 17 Example 2 𝑥 = 25 𝑥>0 18 2 𝑥 = 25 𝑥>0 (declare-fun x () Int) (assert (= (* x x) 25)) (assert (> x 0)) (check-sat) (get-model) ✔ [ 𝑥 ↦ 5] 19 Motivation Reasoning about strings is difficult: – for programmers – for automated tools 20 String Constraint Solvers Hampi Kaluza Rex 21 Hampi Kaluza Rex String a; String a; //... //... R = Regex("^ab$"); R =R.IsMatch(a) Regex("^ab$"); = true; assert(R.Match(a)); 22 Hampi Kaluza String a; String a; //... //... [𝑎 R = Regex("^ab$"); R =R.IsMatch(a) Regex("^ab$"); = true; assert(R.Match(a)); Rex ✔ ↦ ′ab′] 23 solvers Hampi Kaluza String a; String a; //... //... [𝑎 R = Regex("^ab$"); R =R.IsMatch(a) Regex("^ab$"); = true; assert(R.Match(a)); constraints Rex ✔ solution(s) ↦ ′ab′] 24 What should we model? 25 Example How hard is regex matching in Perl? 26 A: Just as hard as 3-SAT… $istr = '^' . ('(x?)' x $V) . ".*;\n" $ireg = '^' . ('(x?)' x $V) . ".*;\n" . join('', map {'(?:' . join('|', map { $_ < 0 ? ('\\' . -$_ . 'x') : ('\\' . $_ ) } @$_ ) . "),\n" } @Clauses ); http://perl.plover.com/NPC/NPC-3SAT.html 27 Where do constraints come from? 28 Code String a; // ... R = Regex("^ab$"); if (R.IsMatch(a)) { // ... } 29 Constraint Generation Constraint Solving 30 Constraint Generation Constraint Solving 31 Talk Outline Background Building Tuning Conclusion 32 Chapter 2: Defining String Constraints Contributions: 1. The definition of the regular matching assignments problem 2. An algorithm, its implementation, and correctness proof 3. An evaluation, applying (2) to a static analysis problem 33 34 demo (internet permitting) Evaluation The Task: generate string inputs that exercise 17 known vulnerabilities in 30,000 lines of PHP Metric: running time 35 Results • Our constraint definition is sufficiently expressive to capture the constraints of interest • Wall-clock running time is between 0.01 seconds and 10 minutes 36 Talk Outline Background Building Tuning Conclusion 37 Chapter 3: Evaluating Data Structures Contribution: 4. An apples-to-apples performance comparison of data structures and algorithms for automatabased string constraint solving 38 Motivation • Existing work provided tool-totool performance comparisons • Confounds: Performance gains may be due to external factors 39 The Framework • Based on Rex • Fixes external factors: – front-end parser – regex-to-automaton conversion – implementation language – search tree 40 Study Design Tasks: – automaton intersection – automaton subtraction Metric: – running time 41 Character Sets BDD Pred Range Hash binary decision diagrams symbolic bitvector ranges in DNF concrete set of character ranges concrete set of individual characters 42 Task 1 (55x): 𝑤 ∈ 𝐿(𝑎) ∩ 𝐿(𝑏) Task 2 (100x): 𝑤 ∈ 𝐿(𝑎) ∖ 𝐿(𝑏) 43 Eager Lazy Task 1 (55x): 𝑤 ∈ 𝐿(𝑎) ∩ 𝐿(𝑏) Task 2 (100x): 𝑤 ∈ 𝐿(𝑎) ∖ 𝐿(𝑏) 44 Eager Lazy Task 1 (55x): 𝑤 ∈ 𝐿(𝑎) ∩ 𝐿(𝑏) Task 2 (100x): 𝑤 ∈ 𝐿(𝑎) ∖ 𝐿(𝑏) Unicode ASCII Unicode ASCII Unicode ASCII Unicode ASCII 45 Results Eager Lazy Task 1 (55x): 𝑤 ∈ 𝐿(𝑎) ∩ 𝐿(𝑏) Task 2 (100x): 𝑤 ∈ 𝐿(𝑎) ∖ 𝐿(𝑏) Unicode ASCII Unicode ASCII Unicode ASCII Unicode ASCII 46 ASCII Lazy 1000 1000 100 100 10 10 1 1 0.1 0.1 BDD Unicode Eager Pred Range Hash BDD 1000 1000 100 100 10 10 1 1 0.1 0.1 Pred Range Hash 47 ASCII Lazy 1000 1000 100 100 10 10 1 1 0.1 0.1 BDD Unicode Eager Pred Range Hash BDD 1000 1000 100 100 10 10 1 1 0.1 0.1 Pred Range Hash 48 Chapter 4: Solving String Constraints Lazily Contributions: 5. A novel (lazy) algorithm for solving multivariate string constraints 6. A comprehensive performance evaluation 49 Motivation • More scalable algorithms are more likely to see real use 50 Approach 1. Eagerly construct a high-level representation of the search space 2. Explore the search space lazily, adding restrictions for one variable at a time 51 Evaluation Difference Hampi Long Strings CFG Intersection 52 Evaluation Difference Hampi Long Strings CFG Intersection 53 Hampi: Background 2007 2008 2009 2010 ISSTA Hampi SocialNets Proxied Content PLDI DPRLE ASE StrSolve 2011 USENIX Sec BEK 2012 POPL BEK2 2013 TOSEM Hampi 2 VMCAI Data structures J. ASE StrSolve 2 54 Hampi: Background 2007 2008 2009 2010 ISSTA Hampi SocialNets Proxied Content PLDI DPRLE ASE StrSolve 2011 USENIX Sec BEK 2012 POPL BEK2 2013 TOSEM Hampi 2 VMCAI Datastructures J. ASE StrSolve 2 55 Hampi: Architecture Hampi STP (bv) MiniSAT 56 Hampi encoding STP (bv) MiniSAT solving 57 Experiment Task: regex difference (same dataset as before) Metric: proportion of wall-clock time spent solving 58 Results Length Bound 15 10 5 Solving Encoding 1 0% 20% 40% 60% 80% Proportion of Running time 100% 59 Results Length Bound 15 10 5 Solving Encoding 1 0% 20% 40% 60% 80% 100% 60 Results 100% 10 5 1 0% Proportion of Running Time Length Bound 15 80% 60% Encoding 40% Solving 20% 0% 20% Solving Encoding 0 2 4 6 8 Absolute time (seconds) 40% Running 60% 80% 10 100% 61 Evaluation Difference Hampi Long Strings CFG Intersection 62 Experiment Task: intersect two regexes parameterized on n: [a-c]*a[a-c]{n+1} and [a-c]*b[a-c]{n} Metric: running time 63 Participating Tools Hampi Rex Strsolve 64 Results 10000 1000 Time (s) 100 10 Hampi 1 Rex 0.1 Strsolve 0.01 0.001 0 250 500 n 750 1000 65 Talk Outline Background Building Tuning Conclusion 66 Conclusion • Introduced string constraint solving in the context of program analysis • Two algorithms: one eager (DPRLE), one lazy (strsolve) • Presented experiments – data structure selection – solving multivariate constraints • Our lazy prototype outperforms other approaches on indicative workloads 67 Thanks for stopping by! www.cs.virginia.edu/~ph4u/ 68 69