CS 290C: Formal Models for Web Software Lectures 17:

advertisement
CS 290C: Formal Models for Web Software
Lectures 17: Analyzing Input Validation and
Sanitization in Web Applications
Instructor: Tevfik Bultan
Vulnerabilities in Web Applications
• There are many well-known security vulnerabilities that
exist in many web applications. Here are some examples:
– Malicious file execution: where a malicious user
causes the server to execute malicious code
– SQL injection: where a malicious user executes SQL
commands on the back-end database by providing
specially formatted input
– Cross site scripting (XSS): causes the attacker to
execute a malicious script at a user’s browser
• These vulnerabilities are typically due to
– errors in user input validation or
– lack of user input validation
String Related Vulnerabilities
String related web application vulnerabilities as a
percentage of all vulnerabilities (reported by CVE)
50%
File Inclusion
XSS
SQL Injection
45%
40%
35%
30%
25%
20%
15%
10%
5%
0%
2001 2002 2003 2004 2005 2006 2007 2008 2009
OWASP Top 10 in 2007:
1. Cross Site Scripting
2. Injection Flaws
OWASP Top 10 in 2010:
1. Injection Flaws
2. Cross Site Scripting
Why Is Input Validation Error-prone?
• Extensive string manipulation:
– Web applications use extensive string manipulation
• To construct html pages, to construct database
queries in SQL, etc.
– The user input comes in string form and must be
validated and sanitized before it can be used
• This requires the use of complex string manipulation
functions such as string-replace
– String manipulation is error prone
String Related Vulnerabilities


String related web application vulnerabilities occur when:
 a sensitive function is passed a malicious string
input from the user
 This input contains an attack
 It is not properly sanitized before it reaches the
sensitive function
String analysis: Discover these vulnerabilities
automatically
XSS Vulnerability

A PHP Example:
1:<?php
<script ...
2: $www = $_GET[”www”];
3: $l_otherinfo = ”URL”;
4: echo ”<td>” . $l_otherinfo . ”: ” . $www . ”</td>”;
5:?>

The echo statement in line 4 is a sensitive function

It contains a Cross Site Scripting (XSS) vulnerability
Is It Vulnerable?

A simple taint analysis can report this segment vulnerable
using taint propagation
1:<?php
tainted
2: $www = $_GET[”www”];
3: $l_otherinfo = ”URL”;
4: echo ”<td>” . $l_otherinfo . ”: ” .$www. ”</td>”;
5:?>

echo is tainted → script is vulnerable
How to Fix it?


To fix the vulnerability we added a sanitization routine at
line s
Taint analysis will assume that $www is untainted and
report that the segment is NOT vulnerable
1:<?php
tainted
2: $www = $_GET[”www”];
3: $l_otherinfo = ”URL”;
untainted
s: $www = ereg_replace(”[^A-Za-z0-9 .-@://]”,””,$www);
4: echo ”<td>” . $l_otherinfo . ”: ” .$www. ”</td>”;
5:?>
Is It Really Sanitized?
1:<?php
<script …>
2: $www = $_GET[”www”];
3: $l_otherinfo = ”URL”;
<script …>
s: $www = ereg_replace(”[^A-Za-z0-9 .-@://]”,””,$www);
4: echo ”<td>” . $l_otherinfo . ”: ” .$www. ”</td>”;
5:?>
Sanitization Routines can be Erroneous

The sanitization statement is not correct!
ereg_replace(”[^A-Za-z0-9 .-@://]”,””,$www);
– Removes all characters that are not in { A-Za-z0-9 .-@:/
}
– .-@ denotes all characters between “.” and “@”
(including “<” and “>”)
– “.-@” should be “.\-@”

This example is from a buggy sanitization routine used in
MyEasyMarket-4.1 (line 218 in file trans.php)
String Analysis



String analysis determines all possible values that a string
expression can take during any program execution
Using string analysis we can identify all possible input
values of the sensitive functions
 Then we can check if inputs of sensitive functions can
contain attack strings
How can we characterize attack strings?
 Use regular expressions to specify the attack patterns

Attack pattern for XSS: Σ∗<scriptΣ∗
Vulnerabilities Can Be Tricky
• Input <!sc+rip!t ...> does not match the attack pattern
– but it matches the vulnerability signature and it can
cause an attack
1:<?php
<!sc+rip!t …>
2: $www = $_GET[”www”];
3: $l_otherinfo = ”URL”;
<script= …>
s: $www
ereg_replace(”[^A-Za-z0-9 .-@://]”,””,$www);
4: echo ”<td>” . $l_otherinfo . ”: ” .$www. ”</td>”;
5:?>
String Analysis


If string analysis determines that the intersection of the
attack pattern and possible inputs of the sensitive function
is empty
 then we can conclude that the program is secure
If the intersection is not empty, then we can again use
string analysis to generate a vulnerability signature
 characterizes all malicious inputs

Given Σ∗<scriptΣ∗ as an attack pattern:

The vulnerability signature for $_GET[”www”] is
Σ∗<α∗sα∗cα∗rα∗iα∗pα∗tΣ∗
where α { A-Za-z0-9 .-@:/ }
Automata-based String Analysis
• Finite State Automata can be used to characterize sets of
string values
• We use automata based string analysis
– Associate each string expression in the program with an
automaton
– The automaton accepts an over approximation of all
possible values that the string expression can take
during program execution
• Using this automata representation symbolically execute
the program, only paying attention to string manipulation
operations
Input Validation Verification Stages
Application/
Scripts
Parser/
Taint Analysis
Attack
Patterns
(Tainted) Dependency
Graphs
Reachable Attack
Strings
Vulnerability Analysis
Vulnerability
Signature
Signature Generation
Sanitization
Statements
Patch Synthesis
Combining Forward & Backward Analyses




Convert PHP programs to dependency graphs
Combine symbolic forward and backward symbolic
reachability analyses
Forward analysis
 Assume that the user input can be any string
 Propagate this information on the dependency graph
 When a sensitive function is reached, intersect with
attack pattern
Backward analysis
 If the intersection is not empty, propagate the result
backwards to identify which inputs can cause an attack
Dependency Graphs
Given a PHP program,
first construct the:
Dependency graph
$_GET[www], 2
“URL”, 3
$l_otherinfo,
1:<?php
2: $www = $ GET[”www”];
3: $l_otherinfo = ”URL”;
4: $www = ereg_replace(
”[^A-Za-z0-9 .-@://]”,””,$www
);
5: echo $l_otherinfo .
”: ” .$www;
6:?>
“”, 4
[^A-Za-z0-9 .-@://], 4
3
“: “, 5
preg_replace, 4
str_concat, 5
$www, 4
str_concat,
echo,
5
5
Dependency
Graph
$www, 2
Forward Analysis
• Using the dependency graph we conduct vulnerability
analysis
• Automata-based forward symbolic analysis that identifies
the possible values of each node
– Each node in the dependency graph is associated with a
DFA
• DFA accepts an over-approximation of the strings
values that the string expression represented by that
node can take at runtime
• The DFAs for the input nodes accept Σ∗
– Intersecting the DFA for the sink nodes with the DFA for
the attack pattern identifies the vulnerabilities
Forward Analysis
• Forward analysis uses post-image computations of string
operations:
– postConcat(M1, M2)
returns M, where M=M1.M2
– postReplace(M1, M2, M3)
returns M, where M=replace(M1, M2, M3)
Forward Analysis
Forward = Σ*
Attack Pattern = Σ*<Σ*
$_GET[www], 2
“URL”, 3
Forward = URL
$l_otherinfo,
“”, 4
[^A-Za-z0-9 .-@://], 4
3
$www, 2
Forward = ε
Forward = [^A-Za-z0-9 .-@/]
“: “, 5
Forward = :
Forward = Σ*
preg_replace, 4
Forward = URL
Forward = [A-Za-z0-9 .-@/]*
str_concat, 5
$www, 4
Forward = [A-Za-z0-9 .-@/]*
Forward = URL:
str_concat,
5
Forward = URL: [A-Za-z0-9 .-@/]*
echo,
L(Σ*<Σ*)
∩
5
L(URL:
= .-@/]*
Forward[A-Za-z0-9
= URL: .-@/]*)
[A-Za-z0-9
L(URL: [A-Za-z0-9 .-;=-@/]*<[A-Za-z0-9 .-@/]*)
≠Ø
Result Automaton
U
R
L
:
[A-Za-z0-9 .-;=-@/]
[A-Za-z0-9 .-@/]
Space
<
URL: [A-Za-z0-9 .-;=-@/]*<[A-Za-z0-9 .-@/]*
Symbolic Automata Representation
• Compact Representation:
– Canonical form and
– Shared BDD nodes
• Efficient MBDD Manipulations:
– Union, Intersection, and Emptiness Checking
– Projection and Minimization
Symbolic Automata Representation
Explicit DFA
representation
Symbolic DFA
representation
Widening
• String verification problem is undecidable
• The forward fixpoint computation is not guaranteed to
converge in the presence of loops and recursion
• We want to compute a sound approximation
– During fixpoint we compute an over approximation of the
least fixpoint that corresponds to the reachable states
• We use an automata based widening operation to overapproximate the fixpoint
– Widening operation over-approximates the union
operations and accelerates the convergence of the
fixpoint computation
Widening
Given a loop such as
1:<?php
2: $var = “head”;
3: while (. . .){
4:
$var = $var . “tail”;
5: }
6: echo $var
7:?>
Our forward analysis with widening would compute that the
value of the variable $var in line 6 is (head)(tail)*
Backward Analysis
• A vulnerability signature is a characterization of all
malicious inputs that can be used to generate attack strings
• We identify vulnerability signatures using an automatabased backward symbolic analysis starting from the sink
node
• Pre-image computations on string operations:
– preConcatPrefix(M, M2)
returns M1 and where M = M1.M2
– preConcatSuffix(M, M1)
returns M2, where M = M1.M2
– preReplace(M, M2, M3)
returns M1, where M=replace(M1, M2, M3)
Backward Analysis
Forward = Σ*
Backward = [^<]*<Σ*
$_GET[www], 2
node 3
node 6
“URL”, 3
“”, 4
[^A-Za-z0-9 .-@://], 4
$www, 2
Forward = URL
Forward = [^A-Za-z0-9 .-@/]
Forward = ε
Forward = Σ*
Backward = Do not care
Backward = Do not care
Backward = Do not care
Backward = [^<]*<Σ*
“: “, 5
$l_otherinfo, 3
Forward = URL
preg_replace, 4
Vulnerability Signature = [^<]*<Σ*
Forward = :
Forward = [A-Za-z0-9 .-@/]*
Backward = Do not care
Backward =
[A-Za-z0-9 .-;=-@/]*<[A-Za-z0-9 .-@/]*
Backward = Do not care
node 10
$www, 4
str_concat, 5
Forward = [A-Za-z0-9 .-@/]*
Forward = URL:
node 11
Backward =
[A-Za-z0-9 .-;=-@/]*<[A-Za-z0-9 .-@/]*
Backward = Do not care
str_concat, 5
Forward = URL: [A-Za-z0-9 .-@/]*
Backward =
URL: [A-Za-z0-9 .-;=-@/]*<[A-Za-z0-9 .-@/]*
node 12
echo, 5
Forward = URL: [A-Za-z0-9 .-@/]*
Backward =
URL: [A-Za-z0-9 .-;=-@/]*<[A-Za-z0-9 .-@/]*
Vulnerability Signature Automaton
Σ
<
[^<]
Non-ASCII
[^<]*<Σ*
Vulnerability Signatures
• The vulnerability signature is the result of the input node,
which includes all possible malicious inputs
• An input that does not match this signature cannot exploit
the vulnerability
• After generating the vulnerability signature
– Can we generate a patch based on the vulnerability
signature?
<
[^<]
Σ
The vulnerability signature automaton for the running example
Patches from Vulnerability Signatures
• Main idea:
– Given a vulnerability signature automaton, find a cut that
separates initial and accepting states
– Remove the characters in the cut from the user input to
sanitize
<
Σ
[^<]
min-cut is {<}
• This means, that if we just delete “<“ from the user input,
then the vulnerability can be removed
Patches from Vulnerability Signatures
• Ideally, we want to modify the input (as little as possible) so
that it does not match the vulnerability signature
• Given a DFA, an alphabet cut is
– a set of characters that after ”removing” the edges that
are associated with the characters in the set, the
modified DFA does not accept any non-empty string
• Finding a minimal alphabet cut of a DFA is an NP-hard
problem (one can reduce the vertex cover problem to this
problem)
– We can use a min-cut algorithm instead
– The set of characters that are associated with the edges
of the min cut is an alphabet cut
• but not necessarily the minimum alphabet cut
Automatically Generated Patch
Automatically generated patch will make sure that no string
that matches the attack pattern reaches the sensitive
function
<?php
if (preg match(’/[^ <]*<.*/’,$ GET[”www”]))
$ GET[”www”] = preg replace(<,””,$ GET[”www”]);
$www = $_GET[”www”];
$l_otherinfo = ”URL”;
$www = ereg_replace(”[^A-Za-z0-9 .-@://]”,””,$www);
echo ”<td>” . $l_otherinfo . ”: ” .$www. ”</td>”;
?>
Experiments
• Application of this approach to five vulnerable input
sanitization routines from three open source web
applications:
(1) MyEasyMarket-4.1: A shopping cart program
(2) BloggIT-1.0: A blog engine
(3) proManager-0.72: A project management system
• We used the following XSS attack pattern:
Σ∗<scriptΣ∗
Forward Analysis Results
• The dependency graphs of these benchmarks are
simplified based on the sinks
– Unrelated parts are removed using slicing
Input
Results
#nodes
#edges
#sinks
#inputs
Time(s)
Mem (kb)
#states/#
bdds
21
20
1
1
0.08
2599
23/219
29
29
1
1
0.53
13633
48/495
25
25
1
2
0.12
1955
125/1200
23
22
1
1
0.12
4022
133/1222
25
25
1
1
0.12
3387
125/1200
Backward Analysis Results
• We use the backward analysis to generate the vulnerability
signatures
– Backward analysis starts from the vulnerable sinks
identified during forward analysis
Input
Results
#nodes
#edges
#sinks
#inputs
Time(s)
Mem (kb)
#states/#
bdds
21
20
1
1
0.46
2963
9/199
29
29
1
1
41.03
1859767
811/8389
25
25
1
2
2.35
5673
20/302,
20/302
23
22
1
1
2.33
32035
91/1127
25
25
1
1
5.02
14958
20/302
Alphabet Cuts
• We generate alphabet cuts from the vulnerability signatures
using a min-cut algorithm
Input
Results
#nodes
#edges
#sinks
#inputs
Alphabet
Cut
21
20
1
1
{<}
29
29
1
1
{S,’,”}
25
25
1
2
Σ,Σ
23
22
1
1
{<,’,”}
25
25
1
1
{<,’,”}
Vulnerability
signature
depends on
two inputs
• Problem: When there are two user inputs the patch will
block everything and delete everything
– Overlooks the relations among input variables (e.g., the
concatenation of two inputs contains < SCRIPT)
Relational String Analysis
• Instead of using multiple single-track DFAs use one multitrack DFA
– Each track represents the values of one string variable
• Using multi-track DFAs:
– Identifies the relations among string variables
– Generates relational vulnerability signatures for multiple
user inputs of a vulnerable application
– Improves the precision of the path-sensitive analysis
– Proves properties that depend on relations among string
variables, e.g., $file = $usr.txt
Multi-track Automata
• Let X (the first track), Y (the second track), be two string
variables
• λ is a padding symbol
• A multi-track automaton that encodes X = Y.txt
(t,λ)
(a,a), (b,b) …
(x,λ)
(t,λ)
Relational Vulnerability Signature
• We perform forward analysis using multi-track automata to
generate relational vulnerability signatures
• Each track represents one user input
– An auxiliary track represents the values of the current
node
– We intersect the auxiliary track with the attack pattern
upon termination
Relational Vulnerability Signature
• Consider a simple example having multiple user inputs
<?php
1: $www = $_GET[”www”];
2: $url = $_GET[”url”];
3: echo $url. $www;
?>
• Let the attack pattern be Σ∗ < Σ∗
Relational Vulnerability Signature
• A multi-track automaton: ($url, $www, aux)
• Identifies the fact that the concatenation of two inputs
contains <
(a,λ,a),
(b,λ,b),
…
(λ,a,a),
(λ,b,b),
…
(λ,a,a),
(λ,b,b),
…
(<,λ,<)
(λ,<,<)
(λ,<,<)
(a,λ,a),
(b,λ,b),
…
(λ,a,a),
(λ,b,b),
…
(λ,a,a),
(λ,b,b),
…
Relational Vulnerability Signature
• Project away the auxiliary variable
• Find the min-cut
• This min-cut identifies the alphabet cuts {<} for the first
track ($url) and {<} for the second track ($www)
(a,λ),
(b,λ),
…
(λ,a),
(λ,b),
…
(λ,a),
(λ,b),
…
(a,λ),
(b,λ),
…
(<,λ)
(λ,<)
(λ,a),
(λ,b),
…
(λ,<)
min-cut is {<},{<}
(λ,a),
(λ,b),
…
Patch for Multiple Inputs
• Patch: If the inputs match the signature, delete its alphabet
cut
<?php
if (preg match(’/[^ <]*<.*/’, $ GET[”url”].$ GET[”www”]))
{
$ GET[”url”] = preg replace(<,””,$ GET[”url”]);
$ GET[”www”] = preg replace(<,””,$ GET[”www”]);
}
1: $www = $ GET[”www”];
2: $url = $ GET[”url”];
3: echo $url. $www;
?>
Technical Issues
• To conduct relational string analysis, we need to compute
“intersection” of multi-track automata
– Intersection is closed under aligned multi-track
automata
• λs are right justified in all tracks, e.g., abλλ instead of
aλbλ
– However, there exist unaligned multi-track automata that
are not describable by aligned ones
– We propose an alignment algorithm that constructs
aligned automata which over or under approximate
unaligned ones
Other Technical Issues
• Modeling Word Equations:
– Intractability of X = cZ:
• The number of states of the corresponding aligned
multi-track DFA is exponential to the length of c.
– Irregularity of X = YZ:
• X = YZ is not describable by an aligned multi-track
automata
• Use a conservative analysis
– Construct multi-track automata that over or underapproximate the word equations
Composite Analysis
• What I have talked about so far focuses only on string
contents
– It does not handle constraints on string lengths
– It cannot handle comparisons among integer variables
and string lengths
• String analysis techniques can be extended to analyze
systems that have unbounded string and integer variables
• Need to use a composite static analysis approach that
combines string analysis and size analysis
Size Analysis
• Size Analysis: The goal of size analysis is to provide
properties about string lengths
– It can be used to discover buffer overflow vulnerabilities
• Integer Analysis: At each program point, statically
compute the possible states of the values of all integer
variables.
– These infinite states are symbolically over-approximated
as linear arithmetic constraints that can be represented
as an arithmetic automaton
• Integer analysis can be used to perform size analysis by
representing lengths of string variables as integer variables.
An Example
• Consider the following segment:
1: <?php
2:
$www = $ GET[”www”];
3:
$l otherinfo = ”URL”;
4:
$www = ereg replace(”[^A-Za-z0-9 ./-@://]”,””,$www);
5:
if(strlen($www) < $limit)
6:
echo ”<td>” . $l otherinfo . ”: ” . $www . ”</td>”;
7:?>
• If we perform size analysis solely, after line 4, we do not
know the length of $www
• If we perform string analysis solely, at line 5, we cannot
check/enforce the branch condition.
Composite Analysis
• We need a composite analysis that combines string
analysis with size analysis.
– Challenge: How to transfer information between string
automata and arithmetic automata?
• A string automaton is a single-track DFA that accepts a
regular language, whose length forms a semi-linear set
– For example: {4, 6} ∪ {2 + 3k | k ≥ 0}
• The unary encoding of a semi-linear set is uniquely
identified by a unary automaton
• The unary automaton can be constructed by replacing the
alphabet of a string automaton with a unary alphabet
Arithmetic Automata
• An arithmetic automaton is a multi-track DFA, where each
track represents the value of one variable over a binary
alphabet
• If the language of an arithmetic automaton satisfies a
Presburger formula, the value of each variable forms a
semi-linear set
• The semi-linear set is accepted by the binary automaton
that projects away all other tracks from the arithmetic
automaton
Connecting the Dots
• There are algorithms to convert unary automata to binary
automata and vice versa
String
Automata
Unary Length
Automata
Binary Length
Automata
Arithmetic
Automata
• Using these conversion algorithms we can conduct a
composite analysis that subsumes size analysis and string
analysis
Case Study


Schoolmate 1.5.4
 Number of PHP files: 63
 Lines of code: 8181
Time
Memory
Number of XSS
sensitive sinks
Number of XSS
Vulnerabilities
22 minutes
281 MB
898
153
Forward Analysis results
Actual Vulnerabilities
False Positives
105
48
Case Study – False Positives

–
Why false positives?
– Path insensitivity: 39
 Path to vulnerable program point is not feasible
– Un-modeled built in PHP functions : 6
– Unfound user written functions: 3
– PHP programs have more than one execution entry
point
We can remove all these false positives by extending the
analysis to a path sensitive analysis and modeling more
PHP functions
Case Study - Sanitization


After patching all actual vulnerabilities by adding
automated sanitization routines we can run string analysis
again
When string analysis is used on the automatically
generated patches, it shows that the patches are correct
with respect to the attack pattern
String Analysis
• String analysis based on context free grammars:
[Christensen et al., SAS’03] [Minamide, WWW’05]
• String analysis based on symbolic/concolic execution:
[Bjorner et al., TACAS’09], [Saxena et al., S&P’10]
• Bounded string analysis : [Kiezun et al., ISSTA’09]
• Automata based string analysis: [Xiang et al.,
COMPSAC’07] [Shannon et al., MUTATION’07], [Balzarotti
et al., S&P’08], [Yu et al., SPIN’08, CIAA’10], [Hooimeijer et
al., Usenix’11]
• Application of string analysis to web applications:
[Wassermann and Su, PLDI’07, ICSE’08] [Halfond and
Orso, ASE’05, ICSE’06]
• String analysis for JavaScript [Saxena et al. S&P’10],
[Alkhalaf et al., ICSE’12, ISSTA’12]
String Analysis
• Size Analysis
– Size analysis: [Hughes et al., POPL’96] [Chin et al.,
ICSE’05] [Yu et al., FSE’07] [Yang et al., CAV’08]
– Composite analysis: [Bultan et al., TOSEM’00] [Xu et al.,
ISSTA’08] [Gulwani et al., POPL’08] [Halbwachs et al.,
PLDI’08], [Yu et al. TACAS’09]
• Vulnerability Signature Generation
– Test input/Attack generation: [Wassermann et al.,
ISSTA’08] [Kiezun et al., ICSE’09]
– Vulnerability signature generation: [Brumley et al.,
S&P’06] [Brumley et al., CSF’07] [Costa et al., SOSP’07]
– Vulnerability signature generation and patch generation
[Yu et al., ASE’09, ICSE’11]
Download