slides - Pages

advertisement
PROGRAMMING USING
AUTOMATA AND TRANSDUCERS
Loris D’Antoni
Margus Veanes
2
3
4
5
All features of general purpose language
Features needed
replace, match,
char…
6
FOR EACH DOMAIN SPECIFIC TASK
Design a language that
• only has the features required by the task
• it is simple to use
• enables to automatically reason about what
the programs do
• compiles into efficient code
7
OUTLINE
•
•
•
•
•
Automata, transducers, and programs
BEK and string sanitizers
BEX and string encoders
FAST and tree manipulating programs
What’s next?
8
AUTOMATA, TRANSDUCERS, AND
PROGRAMS
9
FOR EACH DOMAIN SPECIFIC TASK
Design a language that
• only has the features required by the task,
• it is simple to use
• enables to automatically reason about what
the programs do
• compiles into efficient code
10
type alphabet = A | T | C | G
Finite alphabet
let rec all_TG (l: base list) : bool = match l with
[ ] -> true
| h : : t -> (h = T || h = G) && (all_TG t )
Languages of strings
let rec all_AC (l: base list) : bool = match l with
[ ] -> true
| h : : t -> (h = A || h = C) && (all_TG t )
T
q0
q0
G
let rec map_base (l: base list) : base list = match l with
[ ] -> [ ]
| A : : t -> T : : ( map_base t )
| T : : t -> A : : ( map_base t )
| G : : t -> C : : ( map_base t )
| C : : t -> G : : ( map_base t )
let rec filter_AC (l: base list) : base list = match l with
[ ] -> [ ]
| A : : t -> A : : ( filter_AC t )
| T : : t -> filter_AC t
| G : : t -> filter_AC t
| C : : t -> C : : ( filter_AC t )
A
C
all_TG
all_AC
Transformations
from strings to strings
A/T
T/A
ε
G/C
ε
C/G
map_base
T/ε
A/A
G/ε
C/C
filter_AC
11
FINITE AUTOMATA
a
b
a
b
abab
Yes
aba
No
bb
Yes
a
No
12
FINITE STATE TRANSDUCERS
a/aa
b/bb
a/aa
zz
b/bb
ab
aabbzz
b
bbzz
aba
UNDEFINED
a
UNDEFINED
13
BENEFITS OF AUTOMATA AND
TRANSDUCERS
Closure and decidability for automata:
• Intersection, union, complement
• Decidable emptiness
• Decidable equivalence
• Can be minimized
14
BENEFITS OF AUTOMATA AND
TRANSDUCERS
Transducer composition
let m_f_DNA l : base list = filter_AC (map_base l)
A/T
T/A
A/A
q0
G/C
T/ε
A/ε
q0
C/G
map_base
G/ε
T/ A
q0
C/C
filter_AC
G/C
C/ε
m_f_DNA
15
BENEFITS OF AUTOMATA AND
TRANSDUCERS
Type-checking
input in
all_TG
map_base
output in
all_AC
map_base o (¬ all_AC)
map_base only defined if output in (¬ all_AC)
16
BENEFITS OF AUTOMATA AND
TRANSDUCERS
Type-checking
input in
all_TG
map_base
output in
all_AC
dom(map_base o (¬ all_AC))
Inputs for which map_base does not output in all_AC
17
BENEFITS OF AUTOMATA AND
TRANSDUCERS
Type-checking
input in
all_TG
map_base
output in
all_AC
dom(map_base o (¬ all_AC)) ∩ all_TG = ∅
18
BENEFITS OF AUTOMATA AND
TRANSDUCERS
Transducer equivalence
let m_f_DNA l : base list = filter_AC (map_base l)
let f_m_DNA l : base list = map_base (filter_AC l)
Is m_f_DNA equivalent to f_m_DNA ?
19
FOR EACH DOMAIN SPECIFIC TASK
Design a language that
• only has the features required by the task
• it is simple to use
• enables to automatically reason about what
the programs do
• compiles into efficient code
20
OUTLINE
•
•
•
•
•
Automata, transducers, and programs
BEK and string sanitizers
BEX and string encoders
FAST and tree manipulating programs
What’s next?
21
[USENIX11, POPL12]
BEK
analysis of string sanitizers
P. Hooimeijer
B. Livshits
D. Molnar
P. Saxena
M. Veanes
23
<img src='some untrusted input'/>
24
<img src='some untrusted input'/>
QUESTION:
What could possibly go wrong?
25
<img src='some untrusted input'/>
Attacker:
gollum.png' onload='javascript:...
26
<img src='some untrusted input'/>
Attacker:
gollum.png' onload='javascript:...
Result:
<img src='gollum.png' onload='javascript:…
27
<img src='some untrusted input'/>
Attacker:
im.png' onload='javascript:...
I found my
PRECIOUSS
S.
Result:
<img src='im.png' onload='javascri
28
29
FIRST LINE OF DEFENSE: SANITIZERS
• Sanitizer: a string transformation function.
Untrusted data
Sanitized data
“im.png' …”
Dec 8, 2011
“img.png' …”
PLDI'12 submission presentations
30
COMPARING SANITIZERS
31
'
single quote
'
html entity
32
some untrusted input
33
some untrusted input
Library A
Name:
Around for:
Availability:
HtmlEncode
Years
Readily available to
C# developers
34
some untrusted input
Library A
Library B
HtmlEncode
Name:
Around for: Years
Availability: Readily available to
C# developers
HtmlEncode
Name:
Around for: Years
Availability: Readily available to
C# developers
35
Library A
Library B
HtmlEncode
Name:
Around for: Years
Availability: Readily available to C#
developers
HtmlEncode
Name:
Around for: Years
Availability: Readily available to
C# developers
'
✔
'
'
✘
'
36
MS AntiXSS
private static string HtmlEncode(string input, bool useNamedEntities,
MethodSpecificEncoder encoderTweak)
{
if (string.IsNullOrEmpty(input))
{
return input;
}
if (characterValues == null)
{
InitialiseSafeList();
}
if (useNamedEntities && namedEntities == null)
{
InitialiseNamedEntityList();
}
// Setup a new character array for output.
char[] inputAsArray = input.ToCharArray();
int outputLength = 0;
int inputLength = inputAsArray.Length;
char[] encodedInput = new char[inputLength * 10];
SyncLock.EnterReadLock();
try
{
for (int i = 0; i < inputLength; i++)
{
char currentCharacter = inputAsArray[i];
int currentCodePoint = inputAsArray[i];
char[] tweekedValue;
// Check for invalid values
if (currentCodePoint == 0xFFFE ||
currentCodePoint == 0xFFFF)
{
throw new InvalidUnicodeValueException(currentCodePoint);
}
else if (char.IsHighSurrogate(currentCharacter))
{
if (i + 1 == inputLength)
{
throw new InvalidSurrogatePairException(currentCharacter, '\0');
}
// Now peak ahead and check if the following character is a low surrogate.
char nextCharacter = inputAsArray[i + 1];
char nextCodePoint = inputAsArray[i + 1];
.NET WebUtility
public static string HtmlEncode(string s)
{
if (s == null)
return null;
int num = IndexOfHtmlEncodingChars(s, 0);
if (num == -1)
return s;
StringBuilder builder=new StringBuilder(s.Length+5);
int length = s.Length;
int startIndex = 0;
Label_002A:
if (num > startIndex) {
builder.Append(s, startIndex, num-startIndex);
}
char ch = s[num];
if (ch > '>') {
builder.Append("&#");
builder.Append(((int) ch).ToString(NumberFormatInfo.InvariantInfo));
builder.Append(';');
}
else {
char ch2 = ch;
if (ch2 != '"') {
switch (ch2)
{
case '<':
builder.Append("<");
goto Label_00D5;
case '=':
goto Label_00D5;
case '>':
builder.Append(">");
goto Label_00D5;
case '&':
builder.Append("&");
goto Label_00D5;
}
}
else {
builder.Append(""");
}
}
Label_00D5:
startIndex = num + 1;
if (startIndex < length) {
num = IndexOfHtmlEncodingChars(s, startIndex);
if (num != -1) {
goto Label_002A;
37
MS AntiXSS
private static string HtmlEncode(string input, bool useNamedEntities,
MethodSpecificEncoder encoderTweak)
{
if (string.IsNullOrEmpty(input))
{
return input;
}
if (characterValues == null)
{
InitialiseSafeList();
}
if (useNamedEntities && namedEntities == null)
{
InitialiseNamedEntityList();
}
// Setup a new character array for output.
char[] inputAsArray = input.ToCharArray();
int outputLength = 0;
int inputLength = inputAsArray.Length;
char[] encodedInput = new char[inputLength * 10];
SyncLock.EnterReadLock();
try
{
for (int i = 0; i < inputLength; i++)
{
char currentCharacter = inputAsArray[i];
int currentCodePoint = inputAsArray[i];
char[] tweekedValue;
// Check for invalid values
if (currentCodePoint == 0xFFFE ||
currentCodePoint == 0xFFFF)
{
throw new InvalidUnicodeValueException(currentCodePoint);
}
else if (char.IsHighSurrogate(currentCharacter))
{
if (i + 1 == inputLength)
{
throw new InvalidSurrogatePairException(currentCharacter, '\0');
}
// Now peak ahead and check if the following character is a low surrogate.
char nextCharacter = inputAsArray[i + 1];
.NET WebUtility
public static string HtmlEncode(string s)
{
if (s == null)
return null;
int num = IndexOfHtmlEncodingChars(s, 0);
if (num == -1)
return s;
StringBuilder builder=new StringBuilder(s.Length+5);
int length = s.Length;
int startIndex = 0;
Label_002A:
if (num > startIndex) {
builder.Append(s, startIndex, num-startIndex);
}
char ch = s[num];
if (ch > '>') {
builder.Append("&#");
builder.Append(((int) ch).ToString(NumberFormatInfo.InvariantInfo));
builder.Append(';');
}
else {
char ch2 = ch;
if (ch2 != '"') {
switch (ch2)
{
case '<':
builder.Append("<");
goto Label_00D5;
case '=':
goto Label_00D5;
case '>':
builder.Append(">");
goto Label_00D5;
case '&':
builder.Append("&");
goto Label_00D5;
}
}
else {
builder.Append(""");
}
}
Label_00D5:
startIndex = num + 1;
if (startIndex < length) {
num = IndexOfHtmlEncodingChars(s, startIndex);
if (num != -1) {
38
PHP Trunk Changes to html.c, 1999--2011
39
PHP Trunk Changes to html.c, 1999—2011
R7,841
April 1999
135 loc
R309,482
March 2011
1693 loc
40
R32,564
September 2000
ENT_QUOTES
introduced
PHP Trunk Changes to html.c, 1999—2011
R7,841
April 1999
135 loc
R309,482
March 2011
1693 loc
41
R32,564
September 2000
R242,949
September 2007
ENT_QUOTES
introduced
$double_encode=true
PHP Trunk Changes to html.c, 1999—2011
R7,841
April 1999
135 loc
R309,482
March 2011
1693 loc
42
PHP Trunk Changes to html.c, 1999—2011
43
MOTIVATION
• Writing string sanitizers correctly is
difficult
• There is no cheap way to identify
problems with sanitizers
• ‘Correctness’ is a moving target
• What if we could say more about
sanitizer behavior?
44
CONTRIBUTIONS
BEK
 Frontend: a small
language for string
manipulation; similar to
how sanitizers are written
today
 Backend: a model based
on symbolic finite
transducers with
algorithms for analysis
and code generation
45
CONTRIBUTIONS
BEK
Evaluation
 Frontend: a small
language for string
manipulation; similar to
how sanitizers are written
today
 Converted sanitizers from
a variety of sources
 Backend: a model based
on symbolic finite
transducers with
algorithms for analysis
and code generation
 Checked properties like
reversibility, idempotence,
equivalence, and
commutativity
46
BEK ARCHITECTURE
s := iter(c in t)[b := false;] {
case (!b && c in "[\"\\]"):
b := false;
yield('\\', c);
case (c == '\\'):
b := !b;
yield(c);
case (true):
b := false;
yield(c);
};
Bek Program
47
BEK ARCHITECTURE
s := iter(c in t)[b := false;] {
case (!b && c in "[\"\\]"):
b := false;
yield('\\', c);
case (c == '\\'):
b := !b;
yield(c);
case (true):
b := false;
yield(c);
};
Bek Program
Transformation
Symbolic Finite
Transducers
Microsoft.Automata
Z3
48
BEK ARCHITECTURE
s := iter(c in t)[b := false;] {
case (!b && c in "[\"\\]"):
b := false;
yield('\\', c);
case (c == '\\'):
b := !b;
yield(c);
case (true):
b := false;
yield(c);
};
Bek Program
Transformation
Analysis
Symbolic Finite
Transducers
Microsoft.Automata
Z3
Does it do the
right thing?
Counterexample
“\' vs. \\'”
49
BEK ARCHITECTURE
s := iter(c in t)[b := false;] {
case (!b && c in "[\"\\]"):
b := false;
yield('\\', c);
case (c == '\\'):
b := !b;
yield(c);
case (true):
b := false;
yield(c);
};
Transformation
Bek Program
Code
Gen
Analysis
Symbolic Finite
Transducers
Microsoft.Automata
Z3
Does it do the
right thing?
Counterexample
“\' vs. \\'”
Code
Gen
C#
JavaScript
C
50
BEK ARCHITECTURE
s := iter(c in t)[b := false;] {
case (!b && c in "[\"\\]"):
b := false;
yield('\\', c);
case (c == '\\'):
b := !b;
yield(c);
case (true):
b := false;
yield(c);
};
Transformation
Bek Program
Code
Gen
Analysis
Symbolic Finite
Transducers
Microsoft.Automata
Z3
Does it do the
right thing?
Counterexample
“\' vs. \\'”
Code
Gen
C#
JavaScript
C
51
A BEK PROGRAM: ESCAPE QUOTES
escape := iter(c in s)[b := false;] {
case (!b && c in "['\"]"):
b := false;
yield('\\', c);
case (c == '\\'):
b := !b;
yield(c);
case (true):
b := false;
yield(c);
};
52
iterate
over the
A BEK
PROGRAM:
ESCAPE QUOTES
characters in string s
escape := iter(c in s)[b := false;] {
case (!b && c in "['\"]"):
b := false;
yield('\\', c);
case (c == '\\'):
b := !b;
yield(c);
case (true):
b := false;
yield(c);
};
53
iteratePROGRAM:
over the
while updating
one
A BEK
ESCAPE
QUOTES
characters in string s
boolean variable b
escape := iter(c in s)[b := false;] {
case (!b && c in "['\"]"):
b := false;
yield('\\', c);
case (c == '\\'):
b := !b;
yield(c);
case (true):
b := false;
yield(c);
Simple
};
dedicated syntax
54
BEK ARCHITECTURE
s := iter(c in t)[b := false;] {
case (!b && c in "[\"\\]"):
b := false;
yield('\\', c);
case (c == '\\'):
b := !b;
yield(c);
case (true):
b := false;
yield(c);
};
Transformation
Bek Program
Code
Gen
Analysis
Symbolic Finite
Transducers
Microsoft.Automata
Z3
Does it do the
right thing?
Counterexample
“\' vs. \\'”
Code
Gen
C#
JavaScript
C
55
FINITE STATE TRANSDUCERS
a/A
b/B
…
z/Z
…
&/&
Problem: alphabet has 216 characters
TOO MANY TRANSITIONS
56
SYMBOLIC FINITE TRANSDUCERS
x in [a-z] / x-32
x not in [a-z] / x
Only two transitions!!
57
SYMBOLIC FINITE TRANSDUCERS
Sequence of
functions
Predicates
true/5
true/x-4
x>5/x+1,x
x%2=1/x-1,x,x+4
Alphabet theory has to
be DECIDABLE
We’ll use Z3 to check
predicate satisfiability
58
BEK ARCHITECTURE
s := iter(c in t)[b := false;] {
case (!b && c in "[\"\\]"):
b := false;
yield('\\', c);
case (c == '\\'):
b := !b;
yield(c);
case (true):
b := false;
yield(c);
};
Transformation
Bek Program
Code
Gen
Analysis
Symbolic Finite
Transducers
Microsoft.Automata
Z3
Does it do the
right thing?
Counterexample
“\' vs. \\'”
Code
Gen
C#
JavaScript
C
59
BEK ARCHITECTURE
s := iter(c in t)[b := false;] {
case (!b && c in "[\"\\]"):
b := false;
yield('\\', c);
case (c == '\\'):
b := !b;
yield(c);
case (true):
b := false;
yield(c);
};
Transformation
Bek Program
Code
Gen
Analysis
Symbolic Finite
Transducers
Microsoft.Automata
Z3
Counterexample
“\' vs. \\'”
Now what?
Code
Gen
C#
Does it do the
right thing?
JavaScript
C
60
EQUIVALENCE CHECKING
IS DECIDABLE!
SFT Algorithms
Alphabet theory has to
be DECIDABLE
We’ll use Z3 to check
predicate satisfiability
61
EQUIVALENCE CHECKING
SFT Algorithms
AntiXSS.HtmlEncode
=
WebUtility.HtmlEncode
62
CLOSED UNDER COMPOSITION
SFT A  B
in
SFT A
out
in
SFT B
out
63
COMPOSITION
SFT Algorithms
SFT A  B
in
SFT A
out
in
SFT B
out
JavaScriptEncode(HtmlEncode(w))
=
HtmlEncode(JavaScriptEncode(w))
64
PRE-IMAGE COMPUTATION
Regular
Language
I
in
SFT A
out
Regular
Language
O
65
PRE-IMAGE COMPUTATION
MALICIOUS
INPUTS
Vulnerability
signature
in
SFT A
out
66
CONTRIBUTIONS
Contributions
BEK
Evaluation
 Frontend: a small
language for string
manipulation; similar to
how sanitizers are written
today
 Converted sanitizers from
a variety of sources
 Backend: a model based
on symbolic finite
transducers with
algorithms for analysis
and code generation
 Checked properties like
reversibility, idempotence,
equivalence, and
commutativity
67
QUESTIONS?
• Can BEK model existing
sanitizers?
• Can we use to check interesting
properties on real sanitizers?
68
WHAT FEATURES ARE NEEDED?
Language Features
Data:
1x OWASP HTMLencode
inspect
13x Google AutoEscape
21x IE 8 XSS Filter
7x Synthetic
feature counts
69
WHAT FEATURES ARE NEEDED?
Language Features
• Majority (76%) of sanitizers can be
ported without extending the language
• With multi-character lookahead: 90%
70
CAN WE CHECK INTERESTING
PROPERTIES ON REAL SANITIZERS?
Data
• 4x MS internal
HtmlEncode
• 3x ‘for hire’
HtmlEncode based
on Englishlanguage
specification (C#)
Commutative?
Equivalent?
71
CAN WE CHECK INTERESTING
PROPERTIES ON REAL SANITIZERS?
Short answer: Yes!
72
CAN WE CHECK INTERESTING
PROPERTIES ON REAL SANITIZERS?
• Short answer: Yes!
• EQ results take less than a minute to obtain:
1
2
3
4
5
6
7
1
2
3
4
5
6
7
✔
✔
✔
✘
✘
✔
✘
✔
✔
✘
✘
✔
✘
✔
✘
✘
✔
✘
✔
✘
✘
✘
✔
✘
✘
✔
✘
✔
73
DOES IT SCALE?
Commutativity
Self-Equivalence
74
BEK IN A NUTSHELL
Conclusion
• BEK is a domain-specific language for
writing string sanitizers
• BEK can model programs without
approximation using symbolic finite
transducers, enabling e.g., equivalence
checks
• BEK was evaluated using real-world
sanitizers from a variety of different sources
76
OUTLINE
•
•
•
•
•
Automata, transducers, and programs
BEK and string sanitizers
BEX and string encoders
FAST and tree manipulating programs
What’s next?
77
[VMCAI13, CAV13]
BEX
ANALYSIS OF STRING ENCODERS
Loris D’Antoni
Margus Veanes
Encoder
Hi, I’m plain text!
Nice to meet you!
SGkgSSdtIHBsYWluI
HRleHQsIG5pY2Ugd
G8gbWVldCB5b3Uh
Decoder
79
NOT SO EASY TO GET RIGHT
80
WHEN ARE THEY CORRECT?
Encoder
T
Decoder
T’
Decoder
T’
T
Encoder
T
T’
81
CAN WE USE TRANSDUCERS?
Encoder
T
Decoder
T’
T
Encoder o Decoder = Identity
82
BEK: WHAT FEATURES WERE NEEDED?
Language Features
• Majority (76%) of sanitizers can be
ported without extending Bek
• With multi-character lookahead:
90%
83
BASE64 encoder
Text content
M
a
n
Bytes
77
97
110
Bit Pattern
0 1 0 0 1 1 0 1 0 1 1 0 0 0 0 1 0 1 1 0 1 1 1 0
Index
19
22
5
46
Base64 Encoded
T
W
F
u
3 Bytes  4 Base64 characters
84
HOW DO WE EXTEND BEK?
85
BEK ARCHITECTURE
s := iter(c in t)[b := false;] {
case (!b && c in "[\"\\]"):
b := false;
yield('\\', c);
case (c == '\\'):
b := !b;
yield(c);
case (true):
b := false;
yield(c);
};
Transformation
Bek Program
Code
Gen
Analysis
Symbolic Finite
Transducers
Microsoft.Automata
Z3
Does it do the
right thing?
Counterexample
“\' vs. \\'”
Code Symbolic finite
Gen
transducers don’t have
C#
registers C
JavaScript
86
TRANSDUCERS WITH REGISTERS
x / [ r | (x>>6), x&0x3F ],
r := 0
x / [ x>>2 ],
r := (x&3)<<4
0
x / [r|(x>>4)],
r := (x&0xF)<<2
1
2
• Transducers with registers are closed under
composition
• Equivalent to Turing Machines 
87
BASE64 IN BEX
DEMO
89
90
BEX ARCHITECTURE
s := iter(c in t)[b := false;] {
case (!b && c in "[\"\\]"):
b := false;
yield('\\', c);
case (c == '\\'):
b := !b;
yield(c);
case (true):
b := false;
yield(c);
};
Transformation
Bex Program
Code
Gen
Analysis
?
Microsoft.Automata
Z3
Does it do the
right thing?
Counterexample
“\' vs. \\'”
Code
Gen
C#
JavaScript
C
92
EXTENDED SYMBOLIC FINITE
TRANSDUCERS
x1≤FF ∧ x2≤FF ∧ x3≤FF /
[ x1>>2, ((x1&3)<<4)|(x2>>4), ((x2&0xF)<<2)|(x3>>6), x3&0x3F ]
p
q
3
x1
x2
x3
M
a
n
…
p
…
93
EXTENDED SYMBOLIC FINITE
TRANSDUCERS
x1≤FF ∧ x2≤FF ∧ x3≤FF /
[ x1>>2, ((x1&3)<<4)|(x2>>4), ((x2&0xF)<<2)|(x3>>6), x3&0x3F ]
p
q
3
x1
x2
x3
M
a
n
p
…
q
T
W
F
u
…
94
MORE EXPRESSIVE THAN SYMBOLIC
FINITE TRANSDUCERS
x1>x2 / [x1+x2]
0
1
Do they still have
nice properties?
95
WHAT DO WE NEED?
Encoder
T
Decoder
T’
T
Encoder o Decoder = Identity
Composition
Equivalence
96
NEGATIVE RESULTS
• ESFAs:
– equivalence is undecidable
– are not closed under intersection
– are not closed under complement
• ESFTs
– equivalence is undecidable
– are not closed under composition
97
A FRIENDLIER RESTRICTION
98
CARTESIAN EXTENDED SYMBOLIC
FINITE TRANSDUCERS
Negative results use binary predicates and
encoders do not use this feature
p
x1<x2+1
q
Only allow conjunctions of unary predicates
p
x1>5 ∧ x2=1 / [x1+x2, x1]
q
99
CARTESIAN ESFA = SFA
Cartesian ESFAs are now equivalent to SFAs
x1>5 ∧ x2=1
0
0
1
x>5
x=1
0,1
1
100
STILL MORE EXPRESSIVE THAN SFTS
Cartesian ESFTs are strictly more expressive than SFTs!!
0
x1>5 ∧ x2=1 /
[x1+x2]
1
?
101
WHAT DO WE NEED?
Encoder
T
Decoder
T’
T
Encoder o Decoder = Identity
Composition
Equivalence
102
RESULTS
• Cartesian ESFTs
– equivalence is decidable
– are not closed under composition
103
COMPOSITION IN PRACTICE
104
BEK WITH REGISTERS?
105
TRANSDUCERS WITH REGISTERS
x / [ r | (x>>6), x&0x3F ],
r := 0
x / [ x>>2 ],
r := (x&3)<<4
0
x / [r|(x>>4)],
r := (x&0xF)<<2
1
2
• Transducers with registers are closed under
composition
• Equivalent to Turing Machines 
106
COMPOSING CARTESIAN ESFTS
Cartesian ESFTs
A
B
A’
B’
Transducers with
registers
A’ o B’
Cartesian ESFT
?
AoB
107
REGISTER ELIMINATION
x / [ x+4 ],
r := (x-2)
0
x / [ r+x , x+1],
r := 0
1
2
ESFT
[x1,x2] /
[ x1+4 , x1-2+x2 , x2+1 ], r:=0
0
2
108
DOES IT WORK?
109
UNICODE
• UTF8 to UTF16 encoder (E) and decoder (D)
Test
Running Time
Dom(E) = UTF16
47 ms
Dom(EoD) = UTF16
109 ms
Dom(D) = UTF8
156 ms
Dom(DoE) = UTF8
320 ms
EoD=Identity
16 ms
DoE=Identity
24 ms
Complete analysis in
about a second
110
BASE64
• Base64 encoder (E) and decoder (D)
Test
Running Time
Dom(E) = bytes
13 ms
Dom(EoD) = bytes
55ms
Dom(D) = 6bits+
76 ms
Dom(DoE) = 6bits+
56 ms
EoD=Identity
53 ms
DoE=Identity
19 ms
111
BEX ARCHITECTURE
s := iter(c in t)[b := false;] {
case (!b && c in "[\"\\]"):
b := false;
yield('\\', c);
case (c == '\\'):
b := !b;
yield(c);
case (true):
b := false;
yield(c);
};
Transformation
Bex Program
Code
Gen
Cartesian Extended
Symbolic Finite
Transducers
Microsoft.Automata
Z3
Analysis
Does it do the
right thing?
EoD=I
Code
Gen
C#
JavaScript
C
112
BEX IN A NUTSHELL
Conclusion
• BEX is a domain-specific language for
writing string encoders
• BEX can model programs without
approximation using Cartesian extended
symbolic finite transducers
• BEX was evaluated using real-world string
encoders
113
OUTLINE
•
•
•
•
•
Automata, transducers, and programs
BEK and string sanitizers
BEX and string encoders
FAST and tree manipulating programs
What’s next?
114
[PLDI14]
FAST
ANALYSIS OF PROGRAMS
MANIPULATING TREES
Loris D’Antoni
Margus Veanes
Ben Livshits
David Molnar
116
SOLUTION: USE AN HTML SANITIZER
Remove malicious active code from HTML
documents
<body>
<script>
alert(“This is Sparta!”);
</script>
<div>
<p>
I swear this HTML
is safe!
</p>
</div>
</body>
<body>
<div>
SANITIZE <p>Iswear this HTML
is safe!
</p>
</div>
</body>
117
TYPICAL TRANSFORMATIONS
Typical transformations
Interesting questions
• Remove scripts
• Remove malicious URLs
• Replace deprecated tags
Given a sanitizer S:
• Does S always produce a
safe and well-formed
output?
• Is S defined on every
possible HTML file?
• Does executing S twice
produce the same output
as executing S once?
• Can we execute S fast?
118
HOW DO WE WRITE ONE?
DEMO:
http://rise4fun.com/Fast/2
1
119
120
121
122
123
124
KEY IDEA: HTML CODE IS A TREE
body
body
SANITIZE
div
script
div
malicious
code
p
p
I swear this HTML
is safe!
I swear this HTML
is safe!
125
MOTIVATION
Trees are common input/output data structures
– XML query, type-checking, etc…
– Compilers/optimizers (from parse tree to parse tree)
– Tree manipulating programs: data structures
algorithms, ontologies, etc…
126
FAST ARCHITECTURE
s := iter(c in t)[b := false;] {
case (!b && c in "[\"\\]"):
b := false;
yield('\\', c);
case (c == '\\'):
b := !b;
yield(c);
case (true):
b := false;
yield(c);
};
Transformation
Fast
Program
Analysis
?
Microsoft.Automata
Code
Gen
Z3
Does it do the
right thing?
Counterexample
“\' vs. \\'”
Code
Gen
C#
JavaScript
C
127
CHOOSING THE RIGHT FORMALISM
128
SEMANTICS AS TRANSDUCERS
Goal:
find a class of tree transducers
that can express the previous examples
and is closed under composition
129
TOP DOWN TREE TRANSDUCERS
[ENGELFRIET75]
q(a(x1,x2))  b(c,q1(x1))
q a
b
c
x1
x2
Decidable properties:
Domain expressiveness:
q1
x1
type-checking, etc…
only finite alphabets
130
SYMBOLIC TREE TRANSDUCERS
[PSI11]
q(λa.a>3,(x1,x2))  λa.a+1,(λa.a-2,q1(x1))
q 5
x1
Such that
5>3 is true
x2
Decidable properties:
Domain expressiveness:
Structural expressiveness:
5+1
5-2
q1
Alphabet theory has to
x1
be DECIDABLE
We’ll use Z3 to check
predicate satisfiability
type-checking, etc…
infinite alphabets using predicates
and functions
can’t delete a node without
reading it first
131
IMPROVING STRUCTURAL
EXPRESSIVENESS
Transformation: delete the left child if it contains a script
div
q div
??
q
If we delete the node we can’t check that the left child
contained a script
Regular Look-Ahead (RLA)
132
REGULAR LOOK AHEAD
Transformation: delete the left child if it contains a script
q div
p1
div
p2
q
Transformation
now is safe
Rules can ask whether the children are in particular languages
– p1: the language of trees that contain a script node
– p2: the language of all trees
Decidable properties:
Domain expressiveness:
Structural expressiveness:
type-checking, etc…
infinite alphabets
good enough to express our examples
133
Decidability
Complexity
Structural
Expressiveness
Infinite
alphabets
Top Down Tree
Transducers
[Engelfriet75]
V
V
X
X
Top Down Tree
Transducers with
Regular Look-ahead
[Engelfriet76]
V
V
~
X
Streaming Tree
Transducers
[AlurDantoni12]
V
X
V
X
Data Automata
[Bojanczyk98]
~
X
X
V
Symbolic Tree
Transducers
[VeanesBjoerner11]
V
V
X
V
Symbolic Tree
Transducers RLA
V
V
~
V
134
COMPOSITION OF STTR
T1
T1 o T2
T2
This is not always possible!!
Find the biggest class for which it is possible
135
WHEN CAN WE COMPOSE?
Theorem: T(x) = T2(T1(x))
definable by a Symbolic Tree Transducers with RLA if
– T1 is deterministic
Alphabet theory has to
be DECIDABLE
We’ll use Z3 to check
predicate satisfiability
All our examples fall in this category
136
FAST ARCHITECTURE
s := iter(c in t)[b := false;] {
case (!b && c in "[\"\\]"):
b := false;
yield('\\', c);
case (c == '\\'):
b := !b;
yield(c);
case (true):
b := false;
yield(c);
};
Transformation
Fast
Program
Analysis
Symbolic Tree
Transducers with RLA
Microsoft.Automata
Code
Gen
Z3
Does it do the
right thing?
Counterexample
“\' vs. \\'”
Code
Gen
C#
JavaScript
C
137
5,000
4,000
4,686
3,000
2,000
1,313
1,000
0
0
200
400
CASE STUDIES AND EXPERIMENTS
138
CASE STUDIES AND EXPERIMENTS
Program Optimization:
Deforestation of functional programs
Verification:
HTML sanitization
Analysis of functional programs
Augmented reality app store
Infinite
Alphabets:
Integer
Data types
139
DEFORESTATION
Removing intermediate data structures from programs
alphabet ILIst [i : int] { nil(0), cons(1) }
trans mapC: IList IList {
nil()
to nil [0]
| cons(x)
to cons [(i+5)%26] (mapC x)
}
def mapC2: IList IList := compose mapC mapC
ADVANTAGE: the program is a single transducer reads the input
list only once, thanks to transducers composition
140
STAGES BY EXAMPLE
Transducers
mapC
mapC2
141
DEFORESTATION: SPEEDUP
5,000
Fast
4,500
No Fast
Milliseconds
4,000
4,686
3,500
3,000
f(f(f(…f(x)...)
2,500
2,000
1,500
1,313
1,000
500
0
0
100
200
300
Number of composed map functions
400
(f;f;f;…;f)(x)
500
142
ANALYSIS OF FUNCTIONAL
PROGRAMS
143
AR INTERFERENCE ANALYSIS
Recognizers output data that can be seen as a
tree structure
Spine
Neck
Hip
….
Knee Head
….
Ankle
Foot
144
APPS AS TREE TRANSFORMATIONS
Applications that use recognizers can be
modeled as FAST programs
trans addHat: STree -> STree
Spine(x,y) to Spine(addHat(x), y)
| Neck(h,l,r) to Neck(addHat(h), l, r)
| Head(a) to Head(Hat(a))
145
COMPOSITION OF PROGRAMS
Two FAST programs can be composed into a
single FAST program
p1
p1;p2
p2
146
ANOTHER RECOGNIZER
Room
Wall
Floor
….
Chair
Table
….
….
Chair
147
INTERFERENCE ANALYSIS
Apps can be malicious: try to overwrite outputs of other apps
Apps interfere when they annotate the same node of a
recognizer’s output
Interfering apps
Add cat ears
Add hat
Add pin to a city
Blur a city
Amazon Buy Now button
Malicious Buy Now button
We can compose them and check if they interfere statically!!
– Put checker in the AppStore and analyze Apps before approval
148
INTERFERENCE ANALYSIS IN PRACTICE
100 generated FAST programs, up to 85 functions each
Check statically if they conflict pairwise for ANY possible input
Checked 99% of program pair in less than 0.5 sec!
For an App store these are perfectly fine
TWO PENDING PATENTS
150
FAST IN A NUTSHELL
Conclusion
• FAST is a domain-specific language for
writing tree manipulating programs
• FAST can model programs without
approximation using Symbolic tree
transducers with regular lookahead
• FAST was evaluated using real-world
programs
151
OUTLINE
•
•
•
•
•
Automata, transducers, and programs
BEK and string sanitizers
BEX and string encoders
FAST and tree manipulating programs
What’s next?
152
WHAT’S NEXT
153
FOR EACH DOMAIN SPECIFIC TASK
Design a language that
• only has the features required by the task,
• it is simple to use
• enables to automatically reason about what
the programs do
• compiles into efficient code
154
Here at POPL15!
DREX
EFFICIENT STRING MANIPULATION
Rajeev Alur
Loris D’Antoni
Mukund
Raghothaman
DECLARATIVE LANGUAGE FOR STRING
SCRIPTS (15/1, 2PM, SEC. 2B)
a
b
a
(a|b)*b
b
b/b
iterate(choice(a->a, b->b))
a/a
Execute this code in linear time leftto-right pass on the input string!!
156
Here at POPL 15!!
BEX 2.0
PARALLEL EXECUTION OF STRING
ENCODERS
Margus Veanes
Todd Mytkowicz
Ben Livshits
David Molnar
FROM TRANSDUCERS TO PARALLEL
EXECUTIONS (15/1, 2PM, SEC. 2B)
x / [ x+4 ],
r := (x-2)
0
x / [ r+x , x+1],
r := 0
1
2
Efficient data-parallel code
158
Here at POPL 15!!
PROGRAM BOOSTING OR
CROWD-SOURCING FOR
CORRECTNESS
Robert
Cochran
Loris
D’Antoni
Benjamin
Livshits
David
Molnar
Margus
Veanes
CROWD-SOURCING PROGRAMS WITH
AUTOMATA (17/1, 4PM, SEC. 9B)
Specification
160
YOU CAN HELP TOO!
161
INTERESTING DIRECTIONS
• A transducer-based language for
– WebSrapers
– Spradsheet transformations
– Compiler optimizations
– XML processing
– Html rendering
162
SUMMARIZING…
163
OUR RECIPE FOR EACH TASK
s := iter(c in t)[b := false;] {
case (!b && c in "[\"\\]"):
b := false;
yield('\\', c);
case (c == '\\'):
b := !b;
yield(c);
case (true):
b := false;
yield(c);
};
Transformation
DSL
Transducer
Model
Microsoft.Automata
Code
Gen
Analysis
Z3
Does it do the
right thing?
Analysis
question
Code
Gen
C#
JavaScript
C
164
BEK
• Fast and precise sanitizer analysis with BEK
Hooimeijer, Livshits, Molnar, Saxena, Veanes, USENIX11
• Symbolic finite state transducers: algorithms and applications
Veanes, Hooimeijer, Livshits, Molnar, Bjorner, POPL12
BEX
• Static analysis of string encoders and decoders
D’Antoni, Veanes, VMCAI13
• Equivalence of extended symbolic finite transducers
D’Antoni, Veanes, CAV13
• Data parallel string manipulating programs
Veanes, Mytkowicz, Molnar, Livshits, POPL15
FAST
• Fast: a transducer based language for tree manipulatio
D’Antoni, Veanes, Livshits, Molnar, PLDI14
165
Download