PROGRAMMING USING AUTOMATA AND TRANSDUCERS Loris D’Antoni Margus Veanes 2 3 4 5 All features of general purpose language Features needed replace, match, char… 6 FOR EACH DOMAIN SPECIFIC TASK Design a language that • only has the features required by the task • it is simple to use • enables to automatically reason about what the programs do • compiles into efficient code 7 OUTLINE • • • • • Automata, transducers, and programs BEK and string sanitizers BEX and string encoders FAST and tree manipulating programs What’s next? 8 AUTOMATA, TRANSDUCERS, AND PROGRAMS 9 FOR EACH DOMAIN SPECIFIC TASK Design a language that • only has the features required by the task, • it is simple to use • enables to automatically reason about what the programs do • compiles into efficient code 10 type alphabet = A | T | C | G Finite alphabet let rec all_TG (l: base list) : bool = match l with [ ] -> true | h : : t -> (h = T || h = G) && (all_TG t ) Languages of strings let rec all_AC (l: base list) : bool = match l with [ ] -> true | h : : t -> (h = A || h = C) && (all_TG t ) T q0 q0 G let rec map_base (l: base list) : base list = match l with [ ] -> [ ] | A : : t -> T : : ( map_base t ) | T : : t -> A : : ( map_base t ) | G : : t -> C : : ( map_base t ) | C : : t -> G : : ( map_base t ) let rec filter_AC (l: base list) : base list = match l with [ ] -> [ ] | A : : t -> A : : ( filter_AC t ) | T : : t -> filter_AC t | G : : t -> filter_AC t | C : : t -> C : : ( filter_AC t ) A C all_TG all_AC Transformations from strings to strings A/T T/A ε G/C ε C/G map_base T/ε A/A G/ε C/C filter_AC 11 FINITE AUTOMATA a b a b abab Yes aba No bb Yes a No 12 FINITE STATE TRANSDUCERS a/aa b/bb a/aa zz b/bb ab aabbzz b bbzz aba UNDEFINED a UNDEFINED 13 BENEFITS OF AUTOMATA AND TRANSDUCERS Closure and decidability for automata: • Intersection, union, complement • Decidable emptiness • Decidable equivalence • Can be minimized 14 BENEFITS OF AUTOMATA AND TRANSDUCERS Transducer composition let m_f_DNA l : base list = filter_AC (map_base l) A/T T/A A/A q0 G/C T/ε A/ε q0 C/G map_base G/ε T/ A q0 C/C filter_AC G/C C/ε m_f_DNA 15 BENEFITS OF AUTOMATA AND TRANSDUCERS Type-checking input in all_TG map_base output in all_AC map_base o (¬ all_AC) map_base only defined if output in (¬ all_AC) 16 BENEFITS OF AUTOMATA AND TRANSDUCERS Type-checking input in all_TG map_base output in all_AC dom(map_base o (¬ all_AC)) Inputs for which map_base does not output in all_AC 17 BENEFITS OF AUTOMATA AND TRANSDUCERS Type-checking input in all_TG map_base output in all_AC dom(map_base o (¬ all_AC)) ∩ all_TG = ∅ 18 BENEFITS OF AUTOMATA AND TRANSDUCERS Transducer equivalence let m_f_DNA l : base list = filter_AC (map_base l) let f_m_DNA l : base list = map_base (filter_AC l) Is m_f_DNA equivalent to f_m_DNA ? 19 FOR EACH DOMAIN SPECIFIC TASK Design a language that • only has the features required by the task • it is simple to use • enables to automatically reason about what the programs do • compiles into efficient code 20 OUTLINE • • • • • Automata, transducers, and programs BEK and string sanitizers BEX and string encoders FAST and tree manipulating programs What’s next? 21 [USENIX11, POPL12] BEK analysis of string sanitizers P. Hooimeijer B. Livshits D. Molnar P. Saxena M. Veanes 23 <img src='some untrusted input'/> 24 <img src='some untrusted input'/> QUESTION: What could possibly go wrong? 25 <img src='some untrusted input'/> Attacker: gollum.png' onload='javascript:... 26 <img src='some untrusted input'/> Attacker: gollum.png' onload='javascript:... Result: <img src='gollum.png' onload='javascript:… 27 <img src='some untrusted input'/> Attacker: im.png' onload='javascript:... I found my PRECIOUSS S. Result: <img src='im.png' onload='javascri 28 29 FIRST LINE OF DEFENSE: SANITIZERS • Sanitizer: a string transformation function. Untrusted data Sanitized data “im.png' …” Dec 8, 2011 “img.png&#x27; …” PLDI'12 submission presentations 30 COMPARING SANITIZERS 31 ' single quote &#39; html entity 32 some untrusted input 33 some untrusted input Library A Name: Around for: Availability: HtmlEncode Years Readily available to C# developers 34 some untrusted input Library A Library B HtmlEncode Name: Around for: Years Availability: Readily available to C# developers HtmlEncode Name: Around for: Years Availability: Readily available to C# developers 35 Library A Library B HtmlEncode Name: Around for: Years Availability: Readily available to C# developers HtmlEncode Name: Around for: Years Availability: Readily available to C# developers ' ✔ &#39; ' ✘ ' 36 MS AntiXSS private static string HtmlEncode(string input, bool useNamedEntities, MethodSpecificEncoder encoderTweak) { if (string.IsNullOrEmpty(input)) { return input; } if (characterValues == null) { InitialiseSafeList(); } if (useNamedEntities && namedEntities == null) { InitialiseNamedEntityList(); } // Setup a new character array for output. char[] inputAsArray = input.ToCharArray(); int outputLength = 0; int inputLength = inputAsArray.Length; char[] encodedInput = new char[inputLength * 10]; SyncLock.EnterReadLock(); try { for (int i = 0; i < inputLength; i++) { char currentCharacter = inputAsArray[i]; int currentCodePoint = inputAsArray[i]; char[] tweekedValue; // Check for invalid values if (currentCodePoint == 0xFFFE || currentCodePoint == 0xFFFF) { throw new InvalidUnicodeValueException(currentCodePoint); } else if (char.IsHighSurrogate(currentCharacter)) { if (i + 1 == inputLength) { throw new InvalidSurrogatePairException(currentCharacter, '\0'); } // Now peak ahead and check if the following character is a low surrogate. char nextCharacter = inputAsArray[i + 1]; char nextCodePoint = inputAsArray[i + 1]; .NET WebUtility public static string HtmlEncode(string s) { if (s == null) return null; int num = IndexOfHtmlEncodingChars(s, 0); if (num == -1) return s; StringBuilder builder=new StringBuilder(s.Length+5); int length = s.Length; int startIndex = 0; Label_002A: if (num > startIndex) { builder.Append(s, startIndex, num-startIndex); } char ch = s[num]; if (ch > '>') { builder.Append("&#"); builder.Append(((int) ch).ToString(NumberFormatInfo.InvariantInfo)); builder.Append(';'); } else { char ch2 = ch; if (ch2 != '"') { switch (ch2) { case '<': builder.Append("&lt;"); goto Label_00D5; case '=': goto Label_00D5; case '>': builder.Append("&gt;"); goto Label_00D5; case '&': builder.Append("&amp;"); goto Label_00D5; } } else { builder.Append("&quot;"); } } Label_00D5: startIndex = num + 1; if (startIndex < length) { num = IndexOfHtmlEncodingChars(s, startIndex); if (num != -1) { goto Label_002A; 37 MS AntiXSS private static string HtmlEncode(string input, bool useNamedEntities, MethodSpecificEncoder encoderTweak) { if (string.IsNullOrEmpty(input)) { return input; } if (characterValues == null) { InitialiseSafeList(); } if (useNamedEntities && namedEntities == null) { InitialiseNamedEntityList(); } // Setup a new character array for output. char[] inputAsArray = input.ToCharArray(); int outputLength = 0; int inputLength = inputAsArray.Length; char[] encodedInput = new char[inputLength * 10]; SyncLock.EnterReadLock(); try { for (int i = 0; i < inputLength; i++) { char currentCharacter = inputAsArray[i]; int currentCodePoint = inputAsArray[i]; char[] tweekedValue; // Check for invalid values if (currentCodePoint == 0xFFFE || currentCodePoint == 0xFFFF) { throw new InvalidUnicodeValueException(currentCodePoint); } else if (char.IsHighSurrogate(currentCharacter)) { if (i + 1 == inputLength) { throw new InvalidSurrogatePairException(currentCharacter, '\0'); } // Now peak ahead and check if the following character is a low surrogate. char nextCharacter = inputAsArray[i + 1]; .NET WebUtility public static string HtmlEncode(string s) { if (s == null) return null; int num = IndexOfHtmlEncodingChars(s, 0); if (num == -1) return s; StringBuilder builder=new StringBuilder(s.Length+5); int length = s.Length; int startIndex = 0; Label_002A: if (num > startIndex) { builder.Append(s, startIndex, num-startIndex); } char ch = s[num]; if (ch > '>') { builder.Append("&#"); builder.Append(((int) ch).ToString(NumberFormatInfo.InvariantInfo)); builder.Append(';'); } else { char ch2 = ch; if (ch2 != '"') { switch (ch2) { case '<': builder.Append("&lt;"); goto Label_00D5; case '=': goto Label_00D5; case '>': builder.Append("&gt;"); goto Label_00D5; case '&': builder.Append("&amp;"); goto Label_00D5; } } else { builder.Append("&quot;"); } } Label_00D5: startIndex = num + 1; if (startIndex < length) { num = IndexOfHtmlEncodingChars(s, startIndex); if (num != -1) { 38 PHP Trunk Changes to html.c, 1999--2011 39 PHP Trunk Changes to html.c, 1999—2011 R7,841 April 1999 135 loc R309,482 March 2011 1693 loc 40 R32,564 September 2000 ENT_QUOTES introduced PHP Trunk Changes to html.c, 1999—2011 R7,841 April 1999 135 loc R309,482 March 2011 1693 loc 41 R32,564 September 2000 R242,949 September 2007 ENT_QUOTES introduced $double_encode=true PHP Trunk Changes to html.c, 1999—2011 R7,841 April 1999 135 loc R309,482 March 2011 1693 loc 42 PHP Trunk Changes to html.c, 1999—2011 43 MOTIVATION • Writing string sanitizers correctly is difficult • There is no cheap way to identify problems with sanitizers • ‘Correctness’ is a moving target • What if we could say more about sanitizer behavior? 44 CONTRIBUTIONS BEK Frontend: a small language for string manipulation; similar to how sanitizers are written today Backend: a model based on symbolic finite transducers with algorithms for analysis and code generation 45 CONTRIBUTIONS BEK Evaluation Frontend: a small language for string manipulation; similar to how sanitizers are written today Converted sanitizers from a variety of sources Backend: a model based on symbolic finite transducers with algorithms for analysis and code generation Checked properties like reversibility, idempotence, equivalence, and commutativity 46 BEK ARCHITECTURE s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program 47 BEK ARCHITECTURE s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program Transformation Symbolic Finite Transducers Microsoft.Automata Z3 48 BEK ARCHITECTURE s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program Transformation Analysis Symbolic Finite Transducers Microsoft.Automata Z3 Does it do the right thing? Counterexample “\' vs. \\'” 49 BEK ARCHITECTURE s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Transformation Bek Program Code Gen Analysis Symbolic Finite Transducers Microsoft.Automata Z3 Does it do the right thing? Counterexample “\' vs. \\'” Code Gen C# JavaScript C 50 BEK ARCHITECTURE s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Transformation Bek Program Code Gen Analysis Symbolic Finite Transducers Microsoft.Automata Z3 Does it do the right thing? Counterexample “\' vs. \\'” Code Gen C# JavaScript C 51 A BEK PROGRAM: ESCAPE QUOTES escape := iter(c in s)[b := false;] { case (!b && c in "['\"]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; 52 iterate over the A BEK PROGRAM: ESCAPE QUOTES characters in string s escape := iter(c in s)[b := false;] { case (!b && c in "['\"]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; 53 iteratePROGRAM: over the while updating one A BEK ESCAPE QUOTES characters in string s boolean variable b escape := iter(c in s)[b := false;] { case (!b && c in "['\"]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); Simple }; dedicated syntax 54 BEK ARCHITECTURE s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Transformation Bek Program Code Gen Analysis Symbolic Finite Transducers Microsoft.Automata Z3 Does it do the right thing? Counterexample “\' vs. \\'” Code Gen C# JavaScript C 55 FINITE STATE TRANSDUCERS a/A b/B … z/Z … &/& Problem: alphabet has 216 characters TOO MANY TRANSITIONS 56 SYMBOLIC FINITE TRANSDUCERS x in [a-z] / x-32 x not in [a-z] / x Only two transitions!! 57 SYMBOLIC FINITE TRANSDUCERS Sequence of functions Predicates true/5 true/x-4 x>5/x+1,x x%2=1/x-1,x,x+4 Alphabet theory has to be DECIDABLE We’ll use Z3 to check predicate satisfiability 58 BEK ARCHITECTURE s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Transformation Bek Program Code Gen Analysis Symbolic Finite Transducers Microsoft.Automata Z3 Does it do the right thing? Counterexample “\' vs. \\'” Code Gen C# JavaScript C 59 BEK ARCHITECTURE s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Transformation Bek Program Code Gen Analysis Symbolic Finite Transducers Microsoft.Automata Z3 Counterexample “\' vs. \\'” Now what? Code Gen C# Does it do the right thing? JavaScript C 60 EQUIVALENCE CHECKING IS DECIDABLE! SFT Algorithms Alphabet theory has to be DECIDABLE We’ll use Z3 to check predicate satisfiability 61 EQUIVALENCE CHECKING SFT Algorithms AntiXSS.HtmlEncode = WebUtility.HtmlEncode 62 CLOSED UNDER COMPOSITION SFT A B in SFT A out in SFT B out 63 COMPOSITION SFT Algorithms SFT A B in SFT A out in SFT B out JavaScriptEncode(HtmlEncode(w)) = HtmlEncode(JavaScriptEncode(w)) 64 PRE-IMAGE COMPUTATION Regular Language I in SFT A out Regular Language O 65 PRE-IMAGE COMPUTATION MALICIOUS INPUTS Vulnerability signature in SFT A out 66 CONTRIBUTIONS Contributions BEK Evaluation Frontend: a small language for string manipulation; similar to how sanitizers are written today Converted sanitizers from a variety of sources Backend: a model based on symbolic finite transducers with algorithms for analysis and code generation Checked properties like reversibility, idempotence, equivalence, and commutativity 67 QUESTIONS? • Can BEK model existing sanitizers? • Can we use to check interesting properties on real sanitizers? 68 WHAT FEATURES ARE NEEDED? Language Features Data: 1x OWASP HTMLencode inspect 13x Google AutoEscape 21x IE 8 XSS Filter 7x Synthetic feature counts 69 WHAT FEATURES ARE NEEDED? Language Features • Majority (76%) of sanitizers can be ported without extending the language • With multi-character lookahead: 90% 70 CAN WE CHECK INTERESTING PROPERTIES ON REAL SANITIZERS? Data • 4x MS internal HtmlEncode • 3x ‘for hire’ HtmlEncode based on Englishlanguage specification (C#) Commutative? Equivalent? 71 CAN WE CHECK INTERESTING PROPERTIES ON REAL SANITIZERS? Short answer: Yes! 72 CAN WE CHECK INTERESTING PROPERTIES ON REAL SANITIZERS? • Short answer: Yes! • EQ results take less than a minute to obtain: 1 2 3 4 5 6 7 1 2 3 4 5 6 7 ✔ ✔ ✔ ✘ ✘ ✔ ✘ ✔ ✔ ✘ ✘ ✔ ✘ ✔ ✘ ✘ ✔ ✘ ✔ ✘ ✘ ✘ ✔ ✘ ✘ ✔ ✘ ✔ 73 DOES IT SCALE? Commutativity Self-Equivalence 74 BEK IN A NUTSHELL Conclusion • BEK is a domain-specific language for writing string sanitizers • BEK can model programs without approximation using symbolic finite transducers, enabling e.g., equivalence checks • BEK was evaluated using real-world sanitizers from a variety of different sources 76 OUTLINE • • • • • Automata, transducers, and programs BEK and string sanitizers BEX and string encoders FAST and tree manipulating programs What’s next? 77 [VMCAI13, CAV13] BEX ANALYSIS OF STRING ENCODERS Loris D’Antoni Margus Veanes Encoder Hi, I’m plain text! Nice to meet you! SGkgSSdtIHBsYWluI HRleHQsIG5pY2Ugd G8gbWVldCB5b3Uh Decoder 79 NOT SO EASY TO GET RIGHT 80 WHEN ARE THEY CORRECT? Encoder T Decoder T’ Decoder T’ T Encoder T T’ 81 CAN WE USE TRANSDUCERS? Encoder T Decoder T’ T Encoder o Decoder = Identity 82 BEK: WHAT FEATURES WERE NEEDED? Language Features • Majority (76%) of sanitizers can be ported without extending Bek • With multi-character lookahead: 90% 83 BASE64 encoder Text content M a n Bytes 77 97 110 Bit Pattern 0 1 0 0 1 1 0 1 0 1 1 0 0 0 0 1 0 1 1 0 1 1 1 0 Index 19 22 5 46 Base64 Encoded T W F u 3 Bytes 4 Base64 characters 84 HOW DO WE EXTEND BEK? 85 BEK ARCHITECTURE s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Transformation Bek Program Code Gen Analysis Symbolic Finite Transducers Microsoft.Automata Z3 Does it do the right thing? Counterexample “\' vs. \\'” Code Symbolic finite Gen transducers don’t have C# registers C JavaScript 86 TRANSDUCERS WITH REGISTERS x / [ r | (x>>6), x&0x3F ], r := 0 x / [ x>>2 ], r := (x&3)<<4 0 x / [r|(x>>4)], r := (x&0xF)<<2 1 2 • Transducers with registers are closed under composition • Equivalent to Turing Machines 87 BASE64 IN BEX DEMO 89 90 BEX ARCHITECTURE s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Transformation Bex Program Code Gen Analysis ? Microsoft.Automata Z3 Does it do the right thing? Counterexample “\' vs. \\'” Code Gen C# JavaScript C 92 EXTENDED SYMBOLIC FINITE TRANSDUCERS x1≤FF ∧ x2≤FF ∧ x3≤FF / [ x1>>2, ((x1&3)<<4)|(x2>>4), ((x2&0xF)<<2)|(x3>>6), x3&0x3F ] p q 3 x1 x2 x3 M a n … p … 93 EXTENDED SYMBOLIC FINITE TRANSDUCERS x1≤FF ∧ x2≤FF ∧ x3≤FF / [ x1>>2, ((x1&3)<<4)|(x2>>4), ((x2&0xF)<<2)|(x3>>6), x3&0x3F ] p q 3 x1 x2 x3 M a n p … q T W F u … 94 MORE EXPRESSIVE THAN SYMBOLIC FINITE TRANSDUCERS x1>x2 / [x1+x2] 0 1 Do they still have nice properties? 95 WHAT DO WE NEED? Encoder T Decoder T’ T Encoder o Decoder = Identity Composition Equivalence 96 NEGATIVE RESULTS • ESFAs: – equivalence is undecidable – are not closed under intersection – are not closed under complement • ESFTs – equivalence is undecidable – are not closed under composition 97 A FRIENDLIER RESTRICTION 98 CARTESIAN EXTENDED SYMBOLIC FINITE TRANSDUCERS Negative results use binary predicates and encoders do not use this feature p x1<x2+1 q Only allow conjunctions of unary predicates p x1>5 ∧ x2=1 / [x1+x2, x1] q 99 CARTESIAN ESFA = SFA Cartesian ESFAs are now equivalent to SFAs x1>5 ∧ x2=1 0 0 1 x>5 x=1 0,1 1 100 STILL MORE EXPRESSIVE THAN SFTS Cartesian ESFTs are strictly more expressive than SFTs!! 0 x1>5 ∧ x2=1 / [x1+x2] 1 ? 101 WHAT DO WE NEED? Encoder T Decoder T’ T Encoder o Decoder = Identity Composition Equivalence 102 RESULTS • Cartesian ESFTs – equivalence is decidable – are not closed under composition 103 COMPOSITION IN PRACTICE 104 BEK WITH REGISTERS? 105 TRANSDUCERS WITH REGISTERS x / [ r | (x>>6), x&0x3F ], r := 0 x / [ x>>2 ], r := (x&3)<<4 0 x / [r|(x>>4)], r := (x&0xF)<<2 1 2 • Transducers with registers are closed under composition • Equivalent to Turing Machines 106 COMPOSING CARTESIAN ESFTS Cartesian ESFTs A B A’ B’ Transducers with registers A’ o B’ Cartesian ESFT ? AoB 107 REGISTER ELIMINATION x / [ x+4 ], r := (x-2) 0 x / [ r+x , x+1], r := 0 1 2 ESFT [x1,x2] / [ x1+4 , x1-2+x2 , x2+1 ], r:=0 0 2 108 DOES IT WORK? 109 UNICODE • UTF8 to UTF16 encoder (E) and decoder (D) Test Running Time Dom(E) = UTF16 47 ms Dom(EoD) = UTF16 109 ms Dom(D) = UTF8 156 ms Dom(DoE) = UTF8 320 ms EoD=Identity 16 ms DoE=Identity 24 ms Complete analysis in about a second 110 BASE64 • Base64 encoder (E) and decoder (D) Test Running Time Dom(E) = bytes 13 ms Dom(EoD) = bytes 55ms Dom(D) = 6bits+ 76 ms Dom(DoE) = 6bits+ 56 ms EoD=Identity 53 ms DoE=Identity 19 ms 111 BEX ARCHITECTURE s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Transformation Bex Program Code Gen Cartesian Extended Symbolic Finite Transducers Microsoft.Automata Z3 Analysis Does it do the right thing? EoD=I Code Gen C# JavaScript C 112 BEX IN A NUTSHELL Conclusion • BEX is a domain-specific language for writing string encoders • BEX can model programs without approximation using Cartesian extended symbolic finite transducers • BEX was evaluated using real-world string encoders 113 OUTLINE • • • • • Automata, transducers, and programs BEK and string sanitizers BEX and string encoders FAST and tree manipulating programs What’s next? 114 [PLDI14] FAST ANALYSIS OF PROGRAMS MANIPULATING TREES Loris D’Antoni Margus Veanes Ben Livshits David Molnar 116 SOLUTION: USE AN HTML SANITIZER Remove malicious active code from HTML documents <body> <script> alert(“This is Sparta!”); </script> <div> <p> I swear this HTML is safe! </p> </div> </body> <body> <div> SANITIZE <p>Iswear this HTML is safe! </p> </div> </body> 117 TYPICAL TRANSFORMATIONS Typical transformations Interesting questions • Remove scripts • Remove malicious URLs • Replace deprecated tags Given a sanitizer S: • Does S always produce a safe and well-formed output? • Is S defined on every possible HTML file? • Does executing S twice produce the same output as executing S once? • Can we execute S fast? 118 HOW DO WE WRITE ONE? DEMO: http://rise4fun.com/Fast/2 1 119 120 121 122 123 124 KEY IDEA: HTML CODE IS A TREE body body SANITIZE div script div malicious code p p I swear this HTML is safe! I swear this HTML is safe! 125 MOTIVATION Trees are common input/output data structures – XML query, type-checking, etc… – Compilers/optimizers (from parse tree to parse tree) – Tree manipulating programs: data structures algorithms, ontologies, etc… 126 FAST ARCHITECTURE s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Transformation Fast Program Analysis ? Microsoft.Automata Code Gen Z3 Does it do the right thing? Counterexample “\' vs. \\'” Code Gen C# JavaScript C 127 CHOOSING THE RIGHT FORMALISM 128 SEMANTICS AS TRANSDUCERS Goal: find a class of tree transducers that can express the previous examples and is closed under composition 129 TOP DOWN TREE TRANSDUCERS [ENGELFRIET75] q(a(x1,x2)) b(c,q1(x1)) q a b c x1 x2 Decidable properties: Domain expressiveness: q1 x1 type-checking, etc… only finite alphabets 130 SYMBOLIC TREE TRANSDUCERS [PSI11] q(λa.a>3,(x1,x2)) λa.a+1,(λa.a-2,q1(x1)) q 5 x1 Such that 5>3 is true x2 Decidable properties: Domain expressiveness: Structural expressiveness: 5+1 5-2 q1 Alphabet theory has to x1 be DECIDABLE We’ll use Z3 to check predicate satisfiability type-checking, etc… infinite alphabets using predicates and functions can’t delete a node without reading it first 131 IMPROVING STRUCTURAL EXPRESSIVENESS Transformation: delete the left child if it contains a script div q div ?? q If we delete the node we can’t check that the left child contained a script Regular Look-Ahead (RLA) 132 REGULAR LOOK AHEAD Transformation: delete the left child if it contains a script q div p1 div p2 q Transformation now is safe Rules can ask whether the children are in particular languages – p1: the language of trees that contain a script node – p2: the language of all trees Decidable properties: Domain expressiveness: Structural expressiveness: type-checking, etc… infinite alphabets good enough to express our examples 133 Decidability Complexity Structural Expressiveness Infinite alphabets Top Down Tree Transducers [Engelfriet75] V V X X Top Down Tree Transducers with Regular Look-ahead [Engelfriet76] V V ~ X Streaming Tree Transducers [AlurDantoni12] V X V X Data Automata [Bojanczyk98] ~ X X V Symbolic Tree Transducers [VeanesBjoerner11] V V X V Symbolic Tree Transducers RLA V V ~ V 134 COMPOSITION OF STTR T1 T1 o T2 T2 This is not always possible!! Find the biggest class for which it is possible 135 WHEN CAN WE COMPOSE? Theorem: T(x) = T2(T1(x)) definable by a Symbolic Tree Transducers with RLA if – T1 is deterministic Alphabet theory has to be DECIDABLE We’ll use Z3 to check predicate satisfiability All our examples fall in this category 136 FAST ARCHITECTURE s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Transformation Fast Program Analysis Symbolic Tree Transducers with RLA Microsoft.Automata Code Gen Z3 Does it do the right thing? Counterexample “\' vs. \\'” Code Gen C# JavaScript C 137 5,000 4,000 4,686 3,000 2,000 1,313 1,000 0 0 200 400 CASE STUDIES AND EXPERIMENTS 138 CASE STUDIES AND EXPERIMENTS Program Optimization: Deforestation of functional programs Verification: HTML sanitization Analysis of functional programs Augmented reality app store Infinite Alphabets: Integer Data types 139 DEFORESTATION Removing intermediate data structures from programs alphabet ILIst [i : int] { nil(0), cons(1) } trans mapC: IList IList { nil() to nil [0] | cons(x) to cons [(i+5)%26] (mapC x) } def mapC2: IList IList := compose mapC mapC ADVANTAGE: the program is a single transducer reads the input list only once, thanks to transducers composition 140 STAGES BY EXAMPLE Transducers mapC mapC2 141 DEFORESTATION: SPEEDUP 5,000 Fast 4,500 No Fast Milliseconds 4,000 4,686 3,500 3,000 f(f(f(…f(x)...) 2,500 2,000 1,500 1,313 1,000 500 0 0 100 200 300 Number of composed map functions 400 (f;f;f;…;f)(x) 500 142 ANALYSIS OF FUNCTIONAL PROGRAMS 143 AR INTERFERENCE ANALYSIS Recognizers output data that can be seen as a tree structure Spine Neck Hip …. Knee Head …. Ankle Foot 144 APPS AS TREE TRANSFORMATIONS Applications that use recognizers can be modeled as FAST programs trans addHat: STree -> STree Spine(x,y) to Spine(addHat(x), y) | Neck(h,l,r) to Neck(addHat(h), l, r) | Head(a) to Head(Hat(a)) 145 COMPOSITION OF PROGRAMS Two FAST programs can be composed into a single FAST program p1 p1;p2 p2 146 ANOTHER RECOGNIZER Room Wall Floor …. Chair Table …. …. Chair 147 INTERFERENCE ANALYSIS Apps can be malicious: try to overwrite outputs of other apps Apps interfere when they annotate the same node of a recognizer’s output Interfering apps Add cat ears Add hat Add pin to a city Blur a city Amazon Buy Now button Malicious Buy Now button We can compose them and check if they interfere statically!! – Put checker in the AppStore and analyze Apps before approval 148 INTERFERENCE ANALYSIS IN PRACTICE 100 generated FAST programs, up to 85 functions each Check statically if they conflict pairwise for ANY possible input Checked 99% of program pair in less than 0.5 sec! For an App store these are perfectly fine TWO PENDING PATENTS 150 FAST IN A NUTSHELL Conclusion • FAST is a domain-specific language for writing tree manipulating programs • FAST can model programs without approximation using Symbolic tree transducers with regular lookahead • FAST was evaluated using real-world programs 151 OUTLINE • • • • • Automata, transducers, and programs BEK and string sanitizers BEX and string encoders FAST and tree manipulating programs What’s next? 152 WHAT’S NEXT 153 FOR EACH DOMAIN SPECIFIC TASK Design a language that • only has the features required by the task, • it is simple to use • enables to automatically reason about what the programs do • compiles into efficient code 154 Here at POPL15! DREX EFFICIENT STRING MANIPULATION Rajeev Alur Loris D’Antoni Mukund Raghothaman DECLARATIVE LANGUAGE FOR STRING SCRIPTS (15/1, 2PM, SEC. 2B) a b a (a|b)*b b b/b iterate(choice(a->a, b->b)) a/a Execute this code in linear time leftto-right pass on the input string!! 156 Here at POPL 15!! BEX 2.0 PARALLEL EXECUTION OF STRING ENCODERS Margus Veanes Todd Mytkowicz Ben Livshits David Molnar FROM TRANSDUCERS TO PARALLEL EXECUTIONS (15/1, 2PM, SEC. 2B) x / [ x+4 ], r := (x-2) 0 x / [ r+x , x+1], r := 0 1 2 Efficient data-parallel code 158 Here at POPL 15!! PROGRAM BOOSTING OR CROWD-SOURCING FOR CORRECTNESS Robert Cochran Loris D’Antoni Benjamin Livshits David Molnar Margus Veanes CROWD-SOURCING PROGRAMS WITH AUTOMATA (17/1, 4PM, SEC. 9B) Specification 160 YOU CAN HELP TOO! 161 INTERESTING DIRECTIONS • A transducer-based language for – WebSrapers – Spradsheet transformations – Compiler optimizations – XML processing – Html rendering 162 SUMMARIZING… 163 OUR RECIPE FOR EACH TASK s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Transformation DSL Transducer Model Microsoft.Automata Code Gen Analysis Z3 Does it do the right thing? Analysis question Code Gen C# JavaScript C 164 BEK • Fast and precise sanitizer analysis with BEK Hooimeijer, Livshits, Molnar, Saxena, Veanes, USENIX11 • Symbolic finite state transducers: algorithms and applications Veanes, Hooimeijer, Livshits, Molnar, Bjorner, POPL12 BEX • Static analysis of string encoders and decoders D’Antoni, Veanes, VMCAI13 • Equivalence of extended symbolic finite transducers D’Antoni, Veanes, CAV13 • Data parallel string manipulating programs Veanes, Mytkowicz, Molnar, Livshits, POPL15 FAST • Fast: a transducer based language for tree manipulatio D’Antoni, Veanes, Livshits, Molnar, PLDI14 165