Program Analysis via Graph Reachability

advertisement
Program Analysis via Graph
Reachability
Thomas Reps
University of Wisconsin
http://www.cs.wisc.edu/~reps/
PLDI 00 Tutorial, Vancouver, B.C., June 18, 2000
PLDI 00 Registration Form
• PLDI 00: …………………….. $ ____
• Tutorial (morning): …………… $ ____
• Tutorial (afternoon): ………….. $ ____
• Tutorial (evening): ……………. $ – 0 –
Applications
• Program optimization
• Program-understanding and
software-reengineering
• Security
– information flow
• Verification
– model checking
– security of crypto-based protocols for
distributed systems
1987
1993
1994
1995
1996
1997
1998
Slicing
&
Applications
Dataflow
Analysis Demand
CFL
Algorithms
Reachability
StructureTransmitted
Dependences Set
Constraints
. . . As Well As . . .
• Flow-insensitive points-to analysis
• Complexity results
– Linear . . . cubic . . . undecidable variants
– PTIME-completeness
• Model checking of recursive
hierarchical finite-state machines
– “infinite”-state systems
– linear-time and cubic-time algorithms
. . . And Also
• Analysis of attribute grammars
• Security of crypto-based protocols for
distributed systems [Dolev, Even, & Karp 83]
• Formal-language problems
– CFL-recognition (given G and , is   L(G)?)
– 2DPDA- and 2NPDA-simulation
• Given M and , is   L(M)?
• String-matching problems
Unifying Conceptual Model
for Dataflow-Analysis Literature
•
•
•
•
•
Linear-time gen-kill [Hecht 76], [Kou 77]
Path-constrained DFA [Holley & Rosen 81]
Linear-time GMOD [Cooper & Kennedy 88]
Flow-sensitive MOD [Callahan 88]
Linear-time interprocedural gen-kill
[Knoop & Steffen 93]
• Linear-time bidirectional gen-kill [Dhamdhere 94]
• Relationship to interprocedural DFA
[Sharir & Pneuli 81], [Knoop & Steffen 92]
Collaborators
•
•
•
•
Susan Horwitz
Mooly Sagiv
Genevieve Rosay
David Melski
• David Binkley
• Michael Benedikt
• Patrice Godefroid
Themes
• Harnessing CFL-reachability
• Relationship to other analysis paradigms
• Exhaustive alg.  Demand alg.
• Understanding complexity
– Linear . . . cubic . . . undecidable
• Beyond CFL-reachability
Backward Slice
int main() {
int sum = 0;
int i = 1;
while (i < 11) {
sum = sum + i;
i = i + 1;
}
printf(“%d\n”,sum);
printf(“%d\n”,i);
}
Backward slice with respect to “printf(“%d\n”,i)”
Backward Slice
int main() {
int sum = 0;
int i = 1;
while (i < 11) {
sum = sum + i;
i = i + 1;
}
printf(“%d\n”,sum);
printf(“%d\n”,i);
}
Backward slice with respect to “printf(“%d\n”,i)”
Slice Extraction
int main() {
int i = 1;
while (i < 11) {
i = i + 1;
}
printf(“%d\n”,i);
}
Backward slice with respect to “printf(“%d\n”,i)”
Forward Slice
int main() {
int sum = 0;
int i = 1;
while (i < 11) {
sum = sum + i;
i = i + 1;
}
printf(“%d\n”,sum);
printf(“%d\n”,i);
}
Forward slice with respect to “sum = 0”
Forward Slice
int main() {
int sum = 0;
int i = 1;
while (i < 11) {
sum = sum + i;
i = i + 1;
}
printf(“%d\n”,sum);
printf(“%d\n”,i);
}
Forward slice with respect to “sum = 0”
What Are Slices Useful For?
• Understanding Programs
– What is affected by what?
• Restructuring Programs
– Isolation of separate “computational threads”
• Program Specialization and Reuse
– Slices = specialized programs
– Only reuse needed slices
• Program Differencing
– Compare slices to identify changes
• Testing
– What new test cases would improve coverage?
– What regression tests must be rerun after a change?
Line-Character-Count Program
void line_char_count(FILE *f) {
int lines = 0;
int chars;
BOOL eof_flag = FALSE;
int n;
extern void scan_line(FILE *f, BOOL *bptr,
scan_line(f, &eof_flag, &n);
chars = n;
while(eof_flag == FALSE){
lines = lines + 1;
scan_line(f, &eof_flag, &n);
chars = chars + n;
}
printf(“lines = %d\n”, lines);
printf(“chars = %d\n”, chars);
}
int *iptr);
Character-Count Program
void char_count(FILE *f) {
int lines = 0;
int chars;
BOOL eof_flag = FALSE;
int n;
extern void scan_line(FILE *f, BOOL *bptr,
scan_line(f, &eof_flag, &n);
chars = n;
while(eof_flag == FALSE){
lines = lines + 1;
scan_line(f, &eof_flag, &n);
chars = chars + n;
}
printf(“lines = %d\n”, lines);
printf(“chars = %d\n”, chars);
}
int *iptr);
Line-Character-Count Program
void line_char_count(FILE *f) {
int lines = 0;
int chars;
BOOL eof_flag = FALSE;
int n;
extern void scan_line(FILE *f, BOOL *bptr,
scan_line(f, &eof_flag, &n);
chars = n;
while(eof_flag == FALSE){
lines = lines + 1;
scan_line(f, &eof_flag, &n);
chars = chars + n;
}
printf(“lines = %d\n”, lines);
printf(“chars = %d\n”, chars);
}
int *iptr);
Line-Count Program
void line_count(FILE *f) {
int lines = 0;
int chars;
BOOL eof_flag = FALSE;
int n;
extern void scan_line2(FILE *f, BOOL *bptr,
scan_line2(f, &eof_flag, &n);
chars = n;
while(eof_flag == FALSE){
lines = lines + 1;
scan_line2(f, &eof_flag, &n);
chars = chars + n;
}
printf(“lines = %d\n”, lines);
printf(“chars = %d\n”, chars);
}
int *iptr);
Specialization Via Slicing
wc -lc
wc -c
wc -l
Not partial evaluation!
void line_count(FILE *f);
Control Flow Graph
int main() {
int sum = 0;
int i = 1;
while (i < 11) {
sum = sum + i;
i = i + 1;
}
printf(“%d\n”,sum);
printf(“%d\n”,i);
}
Enter
F
sum = 0
i = 1
while(i < 11)
T
sum = sum + i
i = i + i
printf(sum)
printf(i)
Flow Dependence Graph
int main() {
int sum = 0;
int i = 1;
while (i < 11) {
sum = sum + i;
i = i + 1;
}
printf(“%d\n”,sum);
printf(“%d\n”,i);
}
sum = 0
i = 1
sum = sum + i
Flow dependence
p
q
Value of variable
assigned at p may be
used at q.
Enter
while(i < 11)
i = i + i
printf(sum)
printf(i)
Control Dependence Graph
int main() {
int sum = 0;
int i = 1;
while (i < 11) {
sum = sum + i;
i = i + 1;
}
printf(“%d\n”,sum);
printf(“%d\n”,i);
}
T
sum = 0
T
i = 1
T
sum = sum + i
Control dependence
p
q q is reached from p
T
if condition p is
true (T), not otherwise.
p
F
Similar for false (F).
q
Enter
T
T
while(i < 11)
T
i = i + i
T
T
printf(sum)
printf(i)
Program Dependence Graph (PDG)
int main() {
int sum = 0;
int i = 1;
while (i < 11) {
sum = sum + i;
i = i + 1;
}
printf(“%d\n”,sum);
printf(“%d\n”,i);
}
T
sum = 0
T
i = 1
T
sum = sum + i
Control dependence
Flow dependence
Enter
T
T
while(i < 11)
T
i = i + i
T
T
printf(sum)
printf(i)
Program Dependence Graph (PDG)
int main() {
int i = 1;
int sum = 0;
while (i < 11) {
sum = sum + i;
i = i + 1;
}
printf(“%d\n”,sum);
printf(“%d\n”,i);
}
T
sum = 0
T
i = 1
T
sum = sum + i
Opposite Order
Same PDG
Enter
T
T
while(i < 11)
T
i = i + i
T
T
printf(sum)
printf(i)
Backward Slice
int main() {
int sum = 0;
int i = 1;
while (i < 11) {
sum = sum + i;
i = i + 1;
}
printf(“%d\n”,sum);
printf(“%d\n”,i);
}
T
sum = 0
T
i = 1
T
sum = sum + i
Enter
T
T
while(i < 11)
T
i = i + i
T
T
printf(sum)
printf(i)
Backward Slice (2)
int main() {
int sum = 0;
int i = 1;
while (i < 11) {
sum = sum + i;
i = i + 1;
}
printf(“%d\n”,sum);
printf(“%d\n”,i);
}
T
sum = 0
T
i = 1
T
sum = sum + i
Enter
T
T
while(i < 11)
T
i = i + i
T
T
printf(sum)
printf(i)
Backward Slice (3)
int main() {
int sum = 0;
int i = 1;
while (i < 11) {
sum = sum + i;
i = i + 1;
}
printf(“%d\n”,sum);
printf(“%d\n”,i);
}
T
sum = 0
T
i = 1
T
sum = sum + i
Enter
T
T
while(i < 11)
T
i = i + i
T
T
printf(sum)
printf(i)
Backward Slice (4)
int main() {
int sum = 0;
int i = 1;
while (i < 11) {
sum = sum + i;
i = i + 1;
}
printf(“%d\n”,sum);
printf(“%d\n”,i);
}
T
sum = 0
T
i = 1
T
sum = sum + i
Enter
T
T
while(i < 11)
T
i = i + i
T
T
printf(sum)
printf(i)
Slice Extraction
int main() {
int i = 1;
while (i < 11) {
i = i + 1;
}
printf(“%d\n”,i);
}
T
i = 1
Enter
T
T
while(i < 11)
T
i = i + i
T
printf(i)
CodeSurfer
Browsing a Dependence Graph
Pretend this is your favorite browser
What does clicking on a link do?
Or you move to an internal tag
You get
a new page
Interprocedural Slice
int main() {
int add(int x, int y) {
int sum = 0;
return x + y;
int i = 1;
}
while (i < 11) {
sum = add(sum,i);
i = add(i,1);
}
printf(“%d\n”,sum);
printf(“%d\n”,i);
}
Backward slice with respect to “printf(“%d\n”,i)”
Interprocedural Slice
int main() {
int add(int x, int y) {
int sum = 0;
return x + y;
int i = 1;
}
while (i < 11) {
sum = add(sum,i);
i = add(i,1);
}
printf(“%d\n”,sum);
printf(“%d\n”,i);
}
Backward slice with respect to “printf(“%d\n”,i)”
Interprocedural Slice
int main() {
int add(int x, int y) {
int sum = 0;
return x + y;
int i = 1;
}
while (i < 11) {
sum = add(sum,i);
i = add(i,1);
}
printf(“%d\n”,sum);
printf(“%d\n”,i);
}
Superfluous components included by Weiser’s slicing algorithm [TSE 84]
Left out by algorithm of Horwitz, Reps, & Binkley [PLDI 88; TOPLAS 90]
System Dependence Graph (SDG)
Enter main
Call p
Call p
Enter p
SDG for the Sum Program
Enter main
sum = 0
i = 1
while(i < 11)
printf(sum)
Call add
Call add
xin = sum
yin = i
sum = xout
xin = i
yin= 1
Enter add
x = xin
printf(i)
y = yin
x=x+y
xout = x
i = xout
Interprocedural Backward Slice
Enter main
Call p
Call p
Enter p
Interprocedural Backward Slice (2)
Enter main
Call p
Call p
Enter p
Interprocedural Backward Slice (3)
Enter main
Call p
Call p
Enter p
Interprocedural Backward Slice (4)
Enter main
Call p
Call p
Enter p
Interprocedural Backward Slice (5)
Enter main
Call p
Call p
Enter p
Interprocedural Backward Slice (6)
Enter main
Call p
Call p
)
(
[
]
Enter p
Matched-Parenthesis Path
(
)
[
)
Interprocedural Backward Slice (6)
Enter main
Call p
Call p
Enter p
Interprocedural Backward Slice (7)
Enter main
Call p
Call p
Enter p
Slice Extraction
Enter main
Call p
Enter p
Slice of the Sum Program
Enter main
i = 1
while(i < 11)
printf(i)
Call add
xin = i
yin= 1
Enter add
x = xin
y = yin
x=x+y
xout = x
i = xout
CFL-Reachability
[Yannakakis 90]
• G: Graph (N nodes, E edges)
• L: A context-free language

• L-path from s to t iff s 
 t ,   L
• Running time: O(N 3)
*
Interprocedural Slicing
via CFL-Reachability
• Graph: System dependence graph
• L: L(matched)
[roughly]
• Node m is in the slice w.r.t. n iff there
is an L(matched)-path from m to n
Asymptotic Running Time
[Reps, Horwitz, Sagiv, & Rosay 94]
• CFL-reachability
– System dependence graph: N nodes, E edges
– Running time: O(N 3)
• System dependence graph
Special structure
Running time: O(E + CallSites % MaxParams3)
matched  
| e
| [ matched ]
( e [ e ] e [ e [ e ] ] e )
| ( matched )
| matched matched
]
e
[
s
(
e
e
[
e
]
]
Ordinary
CFL-Reachability
Graph Reachability
e
)
t
CFL-Reachability via Dynamic
Programming
Graph
B
Grammar
C
A
AB C
Degenerate Case: CFL-Recognition
exp  id | exp + exp | exp * exp | ( exp )
“(a + b) * c”  L(exp) ?
(
s
a
+
b
)
*

c
t
Degenerate Case: CFL-Recognition
exp  id | exp + exp | exp * exp | ( exp )
“a + b) * c +”  L(exp) ?
a
s
+
b
)
*
c
+
t
CYK: Context-Free Recognition
MM M
| ( M )
| [ M ]
| ( )
| [ ]
 = “( [ ] ) [ ]”
Is   L(M)?
CYK: Context-Free Recognition
MM M
| ( M )
| [ M ]
| ( )
| [ ]
MM M
| LPM )
| LBM ]
| ( )
| [ ]
LPM  ( M
LBM  [ M
Is “( [ ] ) [ ]”  L(M)?
length
(
[
]
)
[
]
{(}
{[}
{]}
{)}
{[}
{]}


{M}


{M}

s
t {LPM} 
a
r {M}

t



M[ ]
LPM  ( M
Is “( [ ] ) [ ]”  L(M)?
length
(
[
]
)
[
]
{ (}
{[}
{]}
{)}
{[}
{]}


{M}


{M}

s
t {LPM} 
a
r {M}

t

{M}


M?

MM M
CYK: Graphs vs. Tables
Is “( [ ] ) [ ]”  L(M)?
(
s
[
]
)
[
M
LPM
]
M
M
M
t

M  M M | LPM ) | LBM ] | ( ) | [ ]
LPM  ( M
LBM  [ M
CFL-Reachability via Dynamic
Programming
Graph
B
Grammar
C
A
AB C
Dynamic Transitive Closure ?!
• Aiken et al.
– Set-constraint solvers
– Points-to analysis
• Henglein et al.
– type inference
• But a CFL captures a non-transitive
reachability relation [Valiant 75]
Program Chopping
Given source S and target T, what program
points transmit effects from S to T?
S
T
Intersect forward slice from S with
backward slice from T, right?
Non-Transitivity and Slicing
int main() {
int add(int x, int y) {
int sum = 0;
return x + y;
int i = 1;
}
while (i < 11) {
sum = add(sum,i);
i = add(i,1);
}
printf(“%d\n”,sum);
printf(“%d\n”,i);
}
Forward slice with respect to “sum = 0”
Non-Transitivity and Slicing
int main() {
int add(int x, int y) {
int sum = 0;
return x + y;
int i = 1;
}
while (i < 11) {
sum = add(sum,i);
i = add(i,1);
}
printf(“%d\n”,sum);
printf(“%d\n”,i);
}
Forward slice with respect to “sum = 0”
Non-Transitivity and Slicing
int main() {
int add(int x, int y) {
int sum = 0;
return x + y;
int i = 1;
}
while (i < 11) {
sum = add(sum,i);
i = add(i,1);
}
printf(“%d\n”,sum);
printf(“%d\n”,i);
}
Backward slice with respect to “printf(“%d\n”,i)”
Non-Transitivity and Slicing
int main() {
int add(int x, int y) {
int sum = 0;
return x + y;
int i = 1;
}
while (i < 11) {
sum = add(sum,i);
i = add(i,1);
}
printf(“%d\n”,sum);
printf(“%d\n”,i);
}
Backward slice with respect to “printf(“%d\n”,i)”
Non-Transitivity and Slicing
int main() {
int add(int x, int y) {
int sum = 0;
return x + y;
int i = 1;
}
while (i < 11) {
sum = add(sum,i);
i = add(i,1);
}
printf(“%d\n”,sum);
printf(“%d\n”,i);
}
Forward slice with respect to “sum = 0”

Backward slice with respect to “printf(“%d\n”,i)”
Non-Transitivity and Slicing
int main() {
int add(int x, int y) {
int sum = 0;
return x + y;
int i = 1;
}
while (i < 11) {
sum = add(sum,i);
i = add(i,1);
}
printf(“%d\n”,sum);
printf(“%d\n”,i);
}

Chop with respect to “sum = 0” and “printf(“%d\n”,i)”
Non-Transitivity and Slicing
Enter main
sum = 0
i = 1
while(i < 11)
printf(sum)
Call add
Call add
xin = sum
yin = i
(
sum = xout
xin = i
yin= 1
y = yin
x=x+y
i = xout
]
Enter add
x = xin
printf(i)
xout = x
Program Chopping
Given source S and target T, what program
points transmit effects from S to T?
S
T
“Precise interprocedural chopping”
[Reps & Rosay FSE 95]
CF-Recognition vs. CFL-Reachability
• CF-Recognition
– Chain graphs
– General grammar: sub-cubic time [Valiant75]
– LL(1), LR(1): linear time
• CFL-Reachability
– General graphs: O(N3)
– LL(1): O(N3)
– LR(1): O(N3)
– Certain kinds of graphs: O(N+E) Gen/kill IDFA
– Regular languages: O(N+E)
GMOD IDFA
Regular-Language Reachability
[Yannakakis 90]
• G: Graph (N nodes, E edges)
• L: A regular language

• L-path from s to t iff s 
 t ,   L
• Running time: O(N+E)
vs. O(N3)
*
• Ordinary reachability (= transitive closure)
– Label each edge with e
– L is e*
Security of Crypto-Based
Protocols for Distributed System
• “Ping-pong” protocols
(1) X —EncryptY(M X) Y
(2) Y —EncryptX(M) X
• [Dolev & Yao 83]
– O(N8) algorithm
• [Dolev, Even, & Karp 83]
– Less well known than [Dolev & Yao 83]
– O(N3) algorithm
[Dolev, Even, & Karp 83]
Id  EncryptX Id DecryptX
Id  DecryptX Id EncryptX
Id  . . .
EY
Message
AX
EY
Id ?
Saboteur
AZ
Themes
• Harnessing CFL-reachability
• Relationship to other analysis paradigms
• Exhaustive alg.  Demand alg.
• Understanding complexity
– Linear . . . cubic . . . undecidable
• Beyond CFL-reachability
Relationship to
Other Analysis Paradigms
• Dataflow analysis
–reachability versus equation solving
• Deduction
• Set constraints
Dataflow Analysis
• Goal: For each point in the program,
determine a superset of the “facts” that
could possibly hold during execution
• Examples
– Constant propagation
– Reaching definitions
– Live variables
– Possibly uninitialized variables
Useful For . . .
•
•
•
•
Optimizing compilers
Parallelizing compilers
Tools that detect possible logical errors
Tools that show the effects of a
proposed modification
Possibly Uninitialized Variables
{}
Start
 V .{w, x, y}
{w,x,y}
{w,y}
 V .V
 V . if x V
x=3
 V .V  {x}
if .
..
 V .V
{w,y}
y=x
{w,y}
y=w
then V  { y}
{w}
else V  { y}
 V . if w V
w=8
 V .V  {w}
{}
printf(y)
{w,y}
{w,y}
then V  { y}
else V  { y}
Precise Intraprocedural Analysis
C
f
f
1
f
2
f
k 1
k
n
start
pf
p
 f k  f k 1  f 2  f 1
MOP[n] 
 pf
pPathsTo[n ]
p
(C )
start p(a,b)
start main
if .
x=3
..
(
b=a
p(x,y)
p(a,b)
return from p
printf(y)
)
return from p
]
printf(b)
exit main
exit p
Precise Interprocedural Analysis
C
f
1
f call
2
start
f
start
f
q
ret f k 1
(
3
)
f
f
k
n
k 2
exit
q
q
f
4
f
MOMP[n] 
k 3
5

pMatchedPathsTo[ n ]
pf
[Sharir & Pnueli 81]
p
(C )
Representing Dataflow Functions
Identity Function

a
b
c

a
b
c
f   V .V
f({ a, b})  {a, b}
Constant Function
f   V .{b}
f({ a, b})  {b}
Representing Dataflow Functions
“Gen/Kill” Function

a
b
c

a
b
c
f   V .(V  {b})  {c}
f({ a, b})  {a, c}
Non-“Gen/Kill” Function
f   V . if a V
then V  {b}
else V  {b}
f({ a, b})  {a, b}
 x y
start main
x=3
start p(a,b)
if .
 a b
..
b=a
p(x,y)
p(a,b)
return from p
return from p
printf(y)
exit main
printf(b)
exit p
Composing Dataflow Functions
f 1   V . if a V
then V {b}
else V  {b}

a
b
c

a
b
c
f 2   V . if b V
then {c}
else 
f 2  f 1 ({a, c})  {c}
 x y
start main
(
start p(a,b)
if .
x=3
p(x,y)
 a b
..
Might yb be
uninitialized
here?
b=a
p(a,b)
return from p
return from p
printf(y)
printf(b) NO!
YES!
exit main
)
exit p
]
matched 
|
|
|
matched matched
(i matched )i
1  i  CallSites
edge

stack
stack
Off Limits!
(
)(
) (
) (
(
(
)
)
)
unbalLeft 
|
|
matched unbalLeft
(i unbalLeft
1  i  CallSites

stack
stack
Off Limits!
(
) (
) (
(
(
)
) (
(
)
Interprocedural Dataflow Analysis
via CFL-Reachability
• Graph: Exploded control-flow graph
• L: L(unbalLeft)
• Fact d holds at n iff there is an L(unbalLeft)-path
from  start main ,  to  n, d 
Asymptotic Running Time
[Reps, Horwitz, & Sagiv 95]
• CFL-reachability
– Exploded control-flow graph: ND nodes
– Running time: O(N3D3)
• Exploded control-flow graph
Special structure
Running time: O(ED3)
Typically: E l N, hence O(ED3) l O(ND3)
“Gen/kill” problems: O(ED)
Why Bother?
“We’re only interested in million-line programs”
• Know thy enemy!
– “Any” algorithm must do these operations
– Avoid pitfalls (e.g., claiming O(N2) algorithm)
• The essence of “context sensitivity”
• Special cases
– “Gen/kill” problems: O(ED)
• Compression techniques
– Basic blocks
– SSA form, sparse evaluation graphs
• Demand algorithms
Relationship to
Other Analysis Paradigms
• Dataflow analysis
–reachability versus equation solving
• Deduction
• Set constraints
The Need for Pointer Analysis
int main() {
int add(int x, int y)
int sum = 0;
{
int i = 1;
return x + y;
int *p = ∑
}
int *q = &i;
int (*f)(int,int) = add;
while (*q < 11) {
*p = (*f)(*p,*q);
*q = (*f)(*q,1);
}
printf(“%d\n”,*p);
printf(“%d\n”,*q);
}
The Need for Pointer Analysis
int main() {
int add(int x, int y)
int sum = 0;
{
int i = 1;
return x + y;
int *p = ∑
}
int *q = &i;
int (*f)(int,int) = add;
while (*q < 11) {
*p = (*f)(*p,*q);
*q = (*f)(*q,1);
}
printf(“%d\n”,*p);
printf(“%d\n”,*q);
}
The Need for Pointer Analysis
int main() {
int add(int x, int y)
int sum = 0;
{
int i = 1;
return x + y;
int *p = ∑
}
int *q = &i;
int (*f)(int,int) = add;
while (i < 11) {
sum = add(sum,i);
i = add(i,1);
}
printf(“%d\n”,sum);
printf(“%d\n”,i);
}
Flow-Sensitive Points-To Analysis
p
p
r1
q
q
p = &q;
p = q;
p
p
r2
s1
p
p
r1
s2
s3
r1
r2
s1
r2
s2
q
r1
r2
s1
q
q
p = *q;
*p = q;
p
p
q
r1
s2
s3
r1
r2
s1
r2
s2
q
q
Flow-Sensitive  Flow-Insensitive
start main
1
1
2
5
3
exit main
4
5
2
3
4
Flow-Insensitive Points-To Analysis
[Andersen 94, Shapiro & Horwitz 97]
p = &q;
p = q;
p
p
q
r1
r2
s1
p = *q;
*p = q;
p
p
q
r1
s2
s3
r1
r2
s1
r2
s2
q
q
Flow-Insensitive Points-To Analysis
a
b
c
*b
d
= &e;
= a;
= &f;
= c;
= *a;
a
b
c
d
e
f
Flow-Insensitive Points-To Analysis
• Andersen [Thesis 94]
– Formulated using set constraints
– Cubic-time algorithm
• Shapiro & Horwitz (1995; [POPL 97])
– Re-formulated as a graph-grammar problem
• Reps (1995; [unpublished])
– Re-formulated as a Horn-clause program
• Melski (1996; see [Reps, IST98])
– Re-formulated via CFL-reachability
CFL-Reachability via Dynamic
Programming
Graph
B
Grammar
C
A
AB C
CFL-Reachability = Chain Programs
Graph
Grammar
y
B
AB C
C
x
A
z
a(X,Z) :- b(X,Y), c(Y,Z).
Base Facts for Points-To Analysis
p = &q;
assignAddr(p,q).
p = q;
assign(p,q).
p = *q;
assignStar(p,q).
*p = q;
starAssign(p,q).
Rules for Points-To Analysis (I)
p = &q;
p
q
pointsTo(P,Q) :- assignAddr(P,Q).
p = q;
p
r1
q
r2
pointsTo(P,R) :- assign(P,Q), pointsTo(Q,R).
Rules for Points-To Analysis (II)
s1
p = *q;
p
s2
s3
r1
q
r2
pointsTo(P,S) :- assignStar(P,Q),pointsTo(Q,R),pointsTo(R,S).
*p = q;
p
r1
s1
r2
s2
q
pointsTo(R,S) :- starAssign(P,Q),pointsTo(P,R),pointsTo(Q,S).
Creating a Chain Program
*p = q;
p
r1
s1
r2
s2
q
pointsTo(R,S) :- starAssign(P,Q),pointsTo(P,R),pointsTo(Q,S).
pointsTo(R,S) :- pointsTo(P,R),starAssign(P,Q),pointsTo(Q,S).
pointsTo(R,S) :- pointsTo(R,P),starAssign(P,Q),pointsTo(Q,S).
pointsTo(R,P) :- pointsTo(P,R).
Base Facts for Points-To Analysis
p = &q;
assignAddr(p,q).
assignAddr(q,p).
p = q;
assign(p,q).
assign(q,p).
p = *q;
assignStar(p,q).
assignStar(q,p).
*p = q;
starAssign(p,q).
starAssign(q,p).
Creating a Chain Program
pointsTo(P,Q) :- assignAddr(P,Q).
pointsTo(Q,P) :- assignAddr(Q,P).
pointsTo(P,R) :- assign(P,Q), pointsTo(Q,R).
pointsTo(R,P) :- pointsTo(R,Q), assign(Q,P).
pointsTo(P,S) :- assignStar(P,Q),pointsTo(Q,R),pointsTo(R,S).
pointsTo(S,P) :- pointsTo(S,R),pointsTo(R,Q),assignStar(Q,P).
pointsTo(R,S) :- pointsTo(R,P),starAssign(P,Q),pointsTo(Q,S).
pointsTo(S,R) :- pointsTo(S,Q),starAssign(Q,P),pointsTo(P,R).
. . . and now to CFL-Reachability
pointsTo  assignAddr
pointsTo  assignAddr
pointsTo  assign pointsTo
pointsTo  pointsTo assign
pointsTo  assignStar pointsTo pointsTo
pointsTo  pointsTo pointsTo assignStar
pointsTo  pointsTo starAssign pointsTo
pointsTo  pointsTo starAssign pointsTo
Relationship to
Other Analysis Paradigms
• Dataflow analysis
–reachability versus equation solving
• Deduction
• Set constraints
1987
1993
1994
1995
1996
1997
1998
Slicing
&
Applications
Dataflow
Analysis Demand
CFL
Algorithms
Reachability
StructureTransmitted
Dependences Set
Constraints
Structure-Transmitted Dependences
[Reps1995]
McCarthy’s equations: car(cons(x,y)) = x
cdr(cons(x,y)) = y
w = cons(x,y);
v = car(w);
dep  
-1
dep  hd dep hd
x
y
hd
tl
w
hd
dep dep
v
-1
Set Constraints
w = cons(x,y);
v = car(w);
W  cons( X , Y )
1
V  cons1 (W )
McCarthy’s Equations Revisited
cons (cons( X , Y ))  X , provided I (Y )  
1
1
Semantics of Set Constraints
I (cons(V 1 ,V 2))  {cons(v1 , v2) | v1  I (V 1) and v2  I (V 2)}
I (cons11 (V ))  {v1 | cons(v1 , v2)  I (V )}
CFL-Reachability
versus
Set Constraints
• Lazy languages: CFL-reachability is more natural
– car(cons(X,Y)) = X
• Strict languages: Set constraints are more natural
– car(cons(X,Y)) = X, provided I(Y) g v
• But . . . SC and CFL-reachability are equivalent!
– [Melski & Reps 97]
Solving Set Constraints
W a
X is “inhabited”
Y is “inhabited”
W  cons( X , Y )
Y is “inhabited”
W  cons( X , Y )
1
V  cons1 (W )
X is “inhabited”
W  cons( X , Y )
1
U  cons 2 (W )



W is “inhabited”
W is “inhabited”
VX
U Y
Simulating “Inhabited”
inhab
a
W a
dep
W
dep
inhab
inhab  dep inhab dep
Simulating “Inhabited”
inhab
inhab
X
W  cons( X , Y )
hd
Y
hd tl
tl
W
inhab
inhab  hd inhab hd tl inhab tl
Simulating “Provided I(Y) g v”
inhab
W  cons( X , Y )
1
V  cons1 (W )
X
hd
dep
dep  hd tl inhab tl hd
Y
hd tl
W
hd
-1
provided I(Y) g v
tl
V
-1
Themes
• Harnessing CFL-reachability
• Relationship to other analysis paradigms
• Exhaustive alg.  Demand alg.
• Understanding complexity
– Linear . . . cubic . . . undecidable
• Beyond CFL-reachability
Exhaustive Versus Demand Analysis
• Exhaustive analysis: All facts at all points
• Optimization: Concentrate on inner loops
• Program-understanding tools: Only some facts
are of interest
Exhaustive Versus Demand Analysis
• Demand analysis:
– Does a given fact hold at a given point?
– Which facts hold at a given point?
– At which points does a given fact hold?
• Demand analysis via CFL-reachability
– single-source/single-target CFL-reachability
– single-source/multi-target CFL-reachability
– multi-source/single-target CFL-reachability
 x y
start main
(
 a b
start p(a,b)
if .
x=3
..
“Semi-exhaustive”:
Might by be
All “appropriate”
uninitialized demands
p(x,y)
here?
b=a
p(a,b)
return from p
return from p
printf(y)
printf(b) NO!
YES!
exit main
)
exit p
Experimental Results
[Horwitz , Reps, & Sagiv 1995]
• 53 C programs (200-6,700 lines)
• For a single fact of interest:
– demand always better than exhaustive
• All “appropriate” demands beats exhaustive
when percentage of “yes” answers is high
– Live variables
– Truly live variables
– Constant predicates
–...
A Related Result
[Sagiv, Reps, & Horwitz 1996]
• [Uses a generalized analysis technique]
• 38 C programs (300-6,000 lines)
– copy-constant propagation
– linear-constant propagation
• All “appropriate” demands always beats
exhaustive
– factor of 1.14 to about 6
Exhaustive Versus Demand Analysis
• Demand algorithms for
– Interprocedural dataflow analysis
– Set constraints
– Points-to analysis
Demand Analysis and LP Queries (I)
• Flow-insensitive points-to analysis
– Does variable p point to q?
• Issue query: ?- pointsTo(p, q).
• Solve single-source/single-target L(pointsTo)reachability problem
– What does variable p point to?
• Issue query: ?- pointsTo(p, Q).
• Solve single-source L(pointsTo)-reachability problem
– What variables point to q?
• Issue query: ?- pointsTo(P, q).
• Solve single-target L(pointsTo)-reachability problem
Demand Analysis and LP Queries (II)
• Flow-sensitive analysis
– Does a given fact f hold at a given point p?
?- dfFact(p, f).
– Which facts hold at a given point p?
?- dfFact(p, F).
– At which points does a given fact f hold?
?- dfFact(P, f).
• E.g., flow-sensitive points-to analysis
?- dfFact(p, pointsTo(x, Y)).
?- dfFact(P, pointsTo(x, y)).
etc.
Themes
• Harnessing CFL-reachability
• Relationship to other analysis paradigms
• Exhaustive alg.  Demand alg.
• Understanding complexity
– Linear . . . cubic . . . undecidable
• Beyond CFL-reachability
Interprocedural Backward Slice
Enter main
Call p
Call p
)
(
[
]
Enter p
 x y
start main
(
start p(a,b)
if .
x=3
 a b
[
..
b=a
p(x,y)
p(a,b)
return from p
return from p
y printf(y)
may be
uninitialized here
exit main
)
printf(b)
exit p
]
Structure-Transmitted Dependences
[Reps1995]
McCarthy’s equations: car(cons(x,y)) = x
cdr(cons(x,y)) = y
w = cons(x,y);
v = car(w);
x
y
hd
tl
w
hd
v
-1
Dependences + Matched Paths?
Enter main
x
hd
y
tl
w=cons(x,y)
(
Call p
Call p
w
w
[
)
Enter p
w
hd-1 v = car(w)
]
Undecidable!
[Reps, TOPLAS 00]
hd
(
hd-1
Interleaved Parentheses!
)
Themes
• Harnessing CFL-reachability
• Relationship to other analysis paradigms
• Exhaustive alg.  Demand alg.
• Understanding complexity
– Linear . . . cubic . . . undecidable
• Beyond CFL-reachability
CFL-Reachability via Dynamic
Programming
Graph
B
Grammar
C
A
AB C
Beyond CFL-Reachability:
Composition of Linear Functions
x.3x+5
x.2x+1
x.6x+11
(x.2x+1)  (x.3x+5) =
x.6x+11
Beyond CFL-Reachability:
Composition of Linear Functions
• Interprocedural constant propagation
– [Sagiv, Reps, & Horwitz TCS 96]
• Interprocedural path profiling
– The number of path fragments contributed
by a procedure is a function
– [Melski & Reps CC 99]
Model-Checking of Recursive HFSMs
[Benedikt, Godefroid, & Reps (in prep.)]
• Non-recursive HFSMs [Alur & Yannakakis 98]
• Ordinary FSMs
– T-reachability/circularity queries
• Recursive HFSMs
– Matched-parenthesis T-reachability/circularity
• Key observation: Linear-time algorithms for
matched-parenthesis T-reachability/cyclicity
– Single-entry/multi-exit [or multi-entry/single-exit]
– Deterministic, multi-entry/multi-exit
T-Cyclicity in
Hierarchical Kripke Structures
SN/SX
MN/SX
non-rec: O(|k|) non-rec: O(|k|)
?
rec: O(|k|3)
rec: ?
SN/SX
O(|k|)
SN/MX
SN/MX
O(|k|)
MN/SX
O(|k|)
MN/MX
?
MN/MX
O(|k|3)
O(|k||t|) [lin rec]
O(|k|) [det]
Recursive HFSMs: Data Complexity
SN/SX
SN/MX
MN/SX
LTL non-rec: O(|k|) non-rec: O(|k|) ?
rec: P-time
rec: ?
MN/MX
?
CTL O(|k|)
bad
CTL* O(|k|2) [L2] bad
bad
bad
?
?
Recursive HFSMs: Data Complexity
SN/SX
LTL O(|k|)
SN/MX
O(|k|)
MN/SX
O(|k|)
CTL O(|k|)
CTL* O(|k|)
bad
bad
O(|k|)
O(|k|)
Not Dual Problems!
MN/MX
O(|k|3)
O(|k||t|) [lin rec]
O(|k|) [det]
bad
bad
CFL-Reachability: Scope of Applicability
• Static analysis
– Slicing, DFA, structure-transmitted dep.,
points-to analysis
• Verification
– Security of crypto-based protocols for
distributed systems [Dolev, Even, & Karp 83]
– Model-checking recursive HFSMs
• Formal-language theory
– CF-, 2DPDA-, 2NPDA-recognition
– Attribute-grammar analysis
CFL-Reachability: Benefits
• Algorithms
– Exhaustive & demand
• Complexity
– Linear-time and cubic-time algorithms
– PTIME-completeness
– Variants that are undecidable
• Complementary to
– Equations
– Set constraints
– Types
–...
Most Significant Contributions: 1987-2000
• Asymptotically fastest algorithms
– Interprocedural slicing
– Interprocedural dataflow analysis
• Demand algorithms
– Interprocedural dataflow analysis [CC94,FSE95]
– All “appropriate” demands beats exhaustive
• Tool for slicing and browsing ANSI C
– Slices programs as large as 75,000 lines
– University research distribution
– Commercial product: CodeSurfer
(GrammaTech, Inc.)
Most Significant Contributions: 1987-2000
• Unifying conceptual model
– [Kou 77], [Holley&Rosen 81], [Cooper&Kennedy 88],
[Callahan 88], [Horwitz,Reps,&Binkley 88], . . .
• Identifies fundamental bottlenecks
– Cubic-time “barrier”
– Litmus test: quadratic-time algorithm?!
– PTIME-complete  limits to parallelizability
• Existence proofs for new algorithms
– Demand algorithm for set constraints
– Demand algorithm for points-to analysis
References
• Papers by Reps and collaborators:
– http://www.cs.wisc.edu/~reps/
• CFL-reachability
– Yannakakis, M., Graph-theoretic methods in
database theory, PODS 90.
– Reps, T., Program analysis via graph
reachability, Inf. and Softw. Tech. 98.
References
• Slicing, chopping, etc.
– Horwitz, Reps, & Binkley, TOPLAS 90
– Reps, Horwitz, Sagiv, & Rosay, FSE 94
– Reps & Rosay, FSE 95
• Dataflow analysis
– Reps, Horwitz, & Sagiv, POPL 95
– Horwitz, Reps, & Sagiv, FSE 95, TR-1283
• Structure dependences; set constraints
– Reps, PEPM 95
– Melski & Reps, Theor. Comp. Sci. 00
References
• Complexity
– Undecidability: Reps, TOPLAS 00?
– PTIME-completeness: Reps, Acta Inf. 96.
• Verification
– Dolev, Even, & Karp, Inf & Control 82.
– Benedikt, Godefroid, & Reps, In prep.
• Beyond CFL-reachability
– Sagiv, Reps, Horwitz, Theor. Comp. Sci 96
– Melski & Reps, CC 99, TR-1382
Download