Program Analysis via Graph Reachability Thomas Reps University of Wisconsin http://www.cs.wisc.edu/~reps/ PLDI 00 Tutorial, Vancouver, B.C., June 18, 2000 PLDI 00 Registration Form • PLDI 00: …………………….. $ ____ • Tutorial (morning): …………… $ ____ • Tutorial (afternoon): ………….. $ ____ • Tutorial (evening): ……………. $ – 0 – Applications • Program optimization • Program-understanding and software-reengineering • Security – information flow • Verification – model checking – security of crypto-based protocols for distributed systems 1987 1993 1994 1995 1996 1997 1998 Slicing & Applications Dataflow Analysis Demand CFL Algorithms Reachability StructureTransmitted Dependences Set Constraints . . . As Well As . . . • Flow-insensitive points-to analysis • Complexity results – Linear . . . cubic . . . undecidable variants – PTIME-completeness • Model checking of recursive hierarchical finite-state machines – “infinite”-state systems – linear-time and cubic-time algorithms . . . And Also • Analysis of attribute grammars • Security of crypto-based protocols for distributed systems [Dolev, Even, & Karp 83] • Formal-language problems – CFL-recognition (given G and , is L(G)?) – 2DPDA- and 2NPDA-simulation • Given M and , is L(M)? • String-matching problems Unifying Conceptual Model for Dataflow-Analysis Literature • • • • • Linear-time gen-kill [Hecht 76], [Kou 77] Path-constrained DFA [Holley & Rosen 81] Linear-time GMOD [Cooper & Kennedy 88] Flow-sensitive MOD [Callahan 88] Linear-time interprocedural gen-kill [Knoop & Steffen 93] • Linear-time bidirectional gen-kill [Dhamdhere 94] • Relationship to interprocedural DFA [Sharir & Pneuli 81], [Knoop & Steffen 92] Collaborators • • • • Susan Horwitz Mooly Sagiv Genevieve Rosay David Melski • David Binkley • Michael Benedikt • Patrice Godefroid Themes • Harnessing CFL-reachability • Relationship to other analysis paradigms • Exhaustive alg. Demand alg. • Understanding complexity – Linear . . . cubic . . . undecidable • Beyond CFL-reachability Backward Slice int main() { int sum = 0; int i = 1; while (i < 11) { sum = sum + i; i = i + 1; } printf(“%d\n”,sum); printf(“%d\n”,i); } Backward slice with respect to “printf(“%d\n”,i)” Backward Slice int main() { int sum = 0; int i = 1; while (i < 11) { sum = sum + i; i = i + 1; } printf(“%d\n”,sum); printf(“%d\n”,i); } Backward slice with respect to “printf(“%d\n”,i)” Slice Extraction int main() { int i = 1; while (i < 11) { i = i + 1; } printf(“%d\n”,i); } Backward slice with respect to “printf(“%d\n”,i)” Forward Slice int main() { int sum = 0; int i = 1; while (i < 11) { sum = sum + i; i = i + 1; } printf(“%d\n”,sum); printf(“%d\n”,i); } Forward slice with respect to “sum = 0” Forward Slice int main() { int sum = 0; int i = 1; while (i < 11) { sum = sum + i; i = i + 1; } printf(“%d\n”,sum); printf(“%d\n”,i); } Forward slice with respect to “sum = 0” What Are Slices Useful For? • Understanding Programs – What is affected by what? • Restructuring Programs – Isolation of separate “computational threads” • Program Specialization and Reuse – Slices = specialized programs – Only reuse needed slices • Program Differencing – Compare slices to identify changes • Testing – What new test cases would improve coverage? – What regression tests must be rerun after a change? Line-Character-Count Program void line_char_count(FILE *f) { int lines = 0; int chars; BOOL eof_flag = FALSE; int n; extern void scan_line(FILE *f, BOOL *bptr, scan_line(f, &eof_flag, &n); chars = n; while(eof_flag == FALSE){ lines = lines + 1; scan_line(f, &eof_flag, &n); chars = chars + n; } printf(“lines = %d\n”, lines); printf(“chars = %d\n”, chars); } int *iptr); Character-Count Program void char_count(FILE *f) { int lines = 0; int chars; BOOL eof_flag = FALSE; int n; extern void scan_line(FILE *f, BOOL *bptr, scan_line(f, &eof_flag, &n); chars = n; while(eof_flag == FALSE){ lines = lines + 1; scan_line(f, &eof_flag, &n); chars = chars + n; } printf(“lines = %d\n”, lines); printf(“chars = %d\n”, chars); } int *iptr); Line-Character-Count Program void line_char_count(FILE *f) { int lines = 0; int chars; BOOL eof_flag = FALSE; int n; extern void scan_line(FILE *f, BOOL *bptr, scan_line(f, &eof_flag, &n); chars = n; while(eof_flag == FALSE){ lines = lines + 1; scan_line(f, &eof_flag, &n); chars = chars + n; } printf(“lines = %d\n”, lines); printf(“chars = %d\n”, chars); } int *iptr); Line-Count Program void line_count(FILE *f) { int lines = 0; int chars; BOOL eof_flag = FALSE; int n; extern void scan_line2(FILE *f, BOOL *bptr, scan_line2(f, &eof_flag, &n); chars = n; while(eof_flag == FALSE){ lines = lines + 1; scan_line2(f, &eof_flag, &n); chars = chars + n; } printf(“lines = %d\n”, lines); printf(“chars = %d\n”, chars); } int *iptr); Specialization Via Slicing wc -lc wc -c wc -l Not partial evaluation! void line_count(FILE *f); Control Flow Graph int main() { int sum = 0; int i = 1; while (i < 11) { sum = sum + i; i = i + 1; } printf(“%d\n”,sum); printf(“%d\n”,i); } Enter F sum = 0 i = 1 while(i < 11) T sum = sum + i i = i + i printf(sum) printf(i) Flow Dependence Graph int main() { int sum = 0; int i = 1; while (i < 11) { sum = sum + i; i = i + 1; } printf(“%d\n”,sum); printf(“%d\n”,i); } sum = 0 i = 1 sum = sum + i Flow dependence p q Value of variable assigned at p may be used at q. Enter while(i < 11) i = i + i printf(sum) printf(i) Control Dependence Graph int main() { int sum = 0; int i = 1; while (i < 11) { sum = sum + i; i = i + 1; } printf(“%d\n”,sum); printf(“%d\n”,i); } T sum = 0 T i = 1 T sum = sum + i Control dependence p q q is reached from p T if condition p is true (T), not otherwise. p F Similar for false (F). q Enter T T while(i < 11) T i = i + i T T printf(sum) printf(i) Program Dependence Graph (PDG) int main() { int sum = 0; int i = 1; while (i < 11) { sum = sum + i; i = i + 1; } printf(“%d\n”,sum); printf(“%d\n”,i); } T sum = 0 T i = 1 T sum = sum + i Control dependence Flow dependence Enter T T while(i < 11) T i = i + i T T printf(sum) printf(i) Program Dependence Graph (PDG) int main() { int i = 1; int sum = 0; while (i < 11) { sum = sum + i; i = i + 1; } printf(“%d\n”,sum); printf(“%d\n”,i); } T sum = 0 T i = 1 T sum = sum + i Opposite Order Same PDG Enter T T while(i < 11) T i = i + i T T printf(sum) printf(i) Backward Slice int main() { int sum = 0; int i = 1; while (i < 11) { sum = sum + i; i = i + 1; } printf(“%d\n”,sum); printf(“%d\n”,i); } T sum = 0 T i = 1 T sum = sum + i Enter T T while(i < 11) T i = i + i T T printf(sum) printf(i) Backward Slice (2) int main() { int sum = 0; int i = 1; while (i < 11) { sum = sum + i; i = i + 1; } printf(“%d\n”,sum); printf(“%d\n”,i); } T sum = 0 T i = 1 T sum = sum + i Enter T T while(i < 11) T i = i + i T T printf(sum) printf(i) Backward Slice (3) int main() { int sum = 0; int i = 1; while (i < 11) { sum = sum + i; i = i + 1; } printf(“%d\n”,sum); printf(“%d\n”,i); } T sum = 0 T i = 1 T sum = sum + i Enter T T while(i < 11) T i = i + i T T printf(sum) printf(i) Backward Slice (4) int main() { int sum = 0; int i = 1; while (i < 11) { sum = sum + i; i = i + 1; } printf(“%d\n”,sum); printf(“%d\n”,i); } T sum = 0 T i = 1 T sum = sum + i Enter T T while(i < 11) T i = i + i T T printf(sum) printf(i) Slice Extraction int main() { int i = 1; while (i < 11) { i = i + 1; } printf(“%d\n”,i); } T i = 1 Enter T T while(i < 11) T i = i + i T printf(i) CodeSurfer Browsing a Dependence Graph Pretend this is your favorite browser What does clicking on a link do? Or you move to an internal tag You get a new page Interprocedural Slice int main() { int add(int x, int y) { int sum = 0; return x + y; int i = 1; } while (i < 11) { sum = add(sum,i); i = add(i,1); } printf(“%d\n”,sum); printf(“%d\n”,i); } Backward slice with respect to “printf(“%d\n”,i)” Interprocedural Slice int main() { int add(int x, int y) { int sum = 0; return x + y; int i = 1; } while (i < 11) { sum = add(sum,i); i = add(i,1); } printf(“%d\n”,sum); printf(“%d\n”,i); } Backward slice with respect to “printf(“%d\n”,i)” Interprocedural Slice int main() { int add(int x, int y) { int sum = 0; return x + y; int i = 1; } while (i < 11) { sum = add(sum,i); i = add(i,1); } printf(“%d\n”,sum); printf(“%d\n”,i); } Superfluous components included by Weiser’s slicing algorithm [TSE 84] Left out by algorithm of Horwitz, Reps, & Binkley [PLDI 88; TOPLAS 90] System Dependence Graph (SDG) Enter main Call p Call p Enter p SDG for the Sum Program Enter main sum = 0 i = 1 while(i < 11) printf(sum) Call add Call add xin = sum yin = i sum = xout xin = i yin= 1 Enter add x = xin printf(i) y = yin x=x+y xout = x i = xout Interprocedural Backward Slice Enter main Call p Call p Enter p Interprocedural Backward Slice (2) Enter main Call p Call p Enter p Interprocedural Backward Slice (3) Enter main Call p Call p Enter p Interprocedural Backward Slice (4) Enter main Call p Call p Enter p Interprocedural Backward Slice (5) Enter main Call p Call p Enter p Interprocedural Backward Slice (6) Enter main Call p Call p ) ( [ ] Enter p Matched-Parenthesis Path ( ) [ ) Interprocedural Backward Slice (6) Enter main Call p Call p Enter p Interprocedural Backward Slice (7) Enter main Call p Call p Enter p Slice Extraction Enter main Call p Enter p Slice of the Sum Program Enter main i = 1 while(i < 11) printf(i) Call add xin = i yin= 1 Enter add x = xin y = yin x=x+y xout = x i = xout CFL-Reachability [Yannakakis 90] • G: Graph (N nodes, E edges) • L: A context-free language • L-path from s to t iff s t , L • Running time: O(N 3) * Interprocedural Slicing via CFL-Reachability • Graph: System dependence graph • L: L(matched) [roughly] • Node m is in the slice w.r.t. n iff there is an L(matched)-path from m to n Asymptotic Running Time [Reps, Horwitz, Sagiv, & Rosay 94] • CFL-reachability – System dependence graph: N nodes, E edges – Running time: O(N 3) • System dependence graph Special structure Running time: O(E + CallSites % MaxParams3) matched | e | [ matched ] ( e [ e ] e [ e [ e ] ] e ) | ( matched ) | matched matched ] e [ s ( e e [ e ] ] Ordinary CFL-Reachability Graph Reachability e ) t CFL-Reachability via Dynamic Programming Graph B Grammar C A AB C Degenerate Case: CFL-Recognition exp id | exp + exp | exp * exp | ( exp ) “(a + b) * c” L(exp) ? ( s a + b ) * c t Degenerate Case: CFL-Recognition exp id | exp + exp | exp * exp | ( exp ) “a + b) * c +” L(exp) ? a s + b ) * c + t CYK: Context-Free Recognition MM M | ( M ) | [ M ] | ( ) | [ ] = “( [ ] ) [ ]” Is L(M)? CYK: Context-Free Recognition MM M | ( M ) | [ M ] | ( ) | [ ] MM M | LPM ) | LBM ] | ( ) | [ ] LPM ( M LBM [ M Is “( [ ] ) [ ]” L(M)? length ( [ ] ) [ ] {(} {[} {]} {)} {[} {]} {M} {M} s t {LPM} a r {M} t M[ ] LPM ( M Is “( [ ] ) [ ]” L(M)? length ( [ ] ) [ ] { (} {[} {]} {)} {[} {]} {M} {M} s t {LPM} a r {M} t {M} M? MM M CYK: Graphs vs. Tables Is “( [ ] ) [ ]” L(M)? ( s [ ] ) [ M LPM ] M M M t M M M | LPM ) | LBM ] | ( ) | [ ] LPM ( M LBM [ M CFL-Reachability via Dynamic Programming Graph B Grammar C A AB C Dynamic Transitive Closure ?! • Aiken et al. – Set-constraint solvers – Points-to analysis • Henglein et al. – type inference • But a CFL captures a non-transitive reachability relation [Valiant 75] Program Chopping Given source S and target T, what program points transmit effects from S to T? S T Intersect forward slice from S with backward slice from T, right? Non-Transitivity and Slicing int main() { int add(int x, int y) { int sum = 0; return x + y; int i = 1; } while (i < 11) { sum = add(sum,i); i = add(i,1); } printf(“%d\n”,sum); printf(“%d\n”,i); } Forward slice with respect to “sum = 0” Non-Transitivity and Slicing int main() { int add(int x, int y) { int sum = 0; return x + y; int i = 1; } while (i < 11) { sum = add(sum,i); i = add(i,1); } printf(“%d\n”,sum); printf(“%d\n”,i); } Forward slice with respect to “sum = 0” Non-Transitivity and Slicing int main() { int add(int x, int y) { int sum = 0; return x + y; int i = 1; } while (i < 11) { sum = add(sum,i); i = add(i,1); } printf(“%d\n”,sum); printf(“%d\n”,i); } Backward slice with respect to “printf(“%d\n”,i)” Non-Transitivity and Slicing int main() { int add(int x, int y) { int sum = 0; return x + y; int i = 1; } while (i < 11) { sum = add(sum,i); i = add(i,1); } printf(“%d\n”,sum); printf(“%d\n”,i); } Backward slice with respect to “printf(“%d\n”,i)” Non-Transitivity and Slicing int main() { int add(int x, int y) { int sum = 0; return x + y; int i = 1; } while (i < 11) { sum = add(sum,i); i = add(i,1); } printf(“%d\n”,sum); printf(“%d\n”,i); } Forward slice with respect to “sum = 0” Backward slice with respect to “printf(“%d\n”,i)” Non-Transitivity and Slicing int main() { int add(int x, int y) { int sum = 0; return x + y; int i = 1; } while (i < 11) { sum = add(sum,i); i = add(i,1); } printf(“%d\n”,sum); printf(“%d\n”,i); } Chop with respect to “sum = 0” and “printf(“%d\n”,i)” Non-Transitivity and Slicing Enter main sum = 0 i = 1 while(i < 11) printf(sum) Call add Call add xin = sum yin = i ( sum = xout xin = i yin= 1 y = yin x=x+y i = xout ] Enter add x = xin printf(i) xout = x Program Chopping Given source S and target T, what program points transmit effects from S to T? S T “Precise interprocedural chopping” [Reps & Rosay FSE 95] CF-Recognition vs. CFL-Reachability • CF-Recognition – Chain graphs – General grammar: sub-cubic time [Valiant75] – LL(1), LR(1): linear time • CFL-Reachability – General graphs: O(N3) – LL(1): O(N3) – LR(1): O(N3) – Certain kinds of graphs: O(N+E) Gen/kill IDFA – Regular languages: O(N+E) GMOD IDFA Regular-Language Reachability [Yannakakis 90] • G: Graph (N nodes, E edges) • L: A regular language • L-path from s to t iff s t , L • Running time: O(N+E) vs. O(N3) * • Ordinary reachability (= transitive closure) – Label each edge with e – L is e* Security of Crypto-Based Protocols for Distributed System • “Ping-pong” protocols (1) X —EncryptY(M X) Y (2) Y —EncryptX(M) X • [Dolev & Yao 83] – O(N8) algorithm • [Dolev, Even, & Karp 83] – Less well known than [Dolev & Yao 83] – O(N3) algorithm [Dolev, Even, & Karp 83] Id EncryptX Id DecryptX Id DecryptX Id EncryptX Id . . . EY Message AX EY Id ? Saboteur AZ Themes • Harnessing CFL-reachability • Relationship to other analysis paradigms • Exhaustive alg. Demand alg. • Understanding complexity – Linear . . . cubic . . . undecidable • Beyond CFL-reachability Relationship to Other Analysis Paradigms • Dataflow analysis –reachability versus equation solving • Deduction • Set constraints Dataflow Analysis • Goal: For each point in the program, determine a superset of the “facts” that could possibly hold during execution • Examples – Constant propagation – Reaching definitions – Live variables – Possibly uninitialized variables Useful For . . . • • • • Optimizing compilers Parallelizing compilers Tools that detect possible logical errors Tools that show the effects of a proposed modification Possibly Uninitialized Variables {} Start V .{w, x, y} {w,x,y} {w,y} V .V V . if x V x=3 V .V {x} if . .. V .V {w,y} y=x {w,y} y=w then V { y} {w} else V { y} V . if w V w=8 V .V {w} {} printf(y) {w,y} {w,y} then V { y} else V { y} Precise Intraprocedural Analysis C f f 1 f 2 f k 1 k n start pf p f k f k 1 f 2 f 1 MOP[n] pf pPathsTo[n ] p (C ) start p(a,b) start main if . x=3 .. ( b=a p(x,y) p(a,b) return from p printf(y) ) return from p ] printf(b) exit main exit p Precise Interprocedural Analysis C f 1 f call 2 start f start f q ret f k 1 ( 3 ) f f k n k 2 exit q q f 4 f MOMP[n] k 3 5 pMatchedPathsTo[ n ] pf [Sharir & Pnueli 81] p (C ) Representing Dataflow Functions Identity Function a b c a b c f V .V f({ a, b}) {a, b} Constant Function f V .{b} f({ a, b}) {b} Representing Dataflow Functions “Gen/Kill” Function a b c a b c f V .(V {b}) {c} f({ a, b}) {a, c} Non-“Gen/Kill” Function f V . if a V then V {b} else V {b} f({ a, b}) {a, b} x y start main x=3 start p(a,b) if . a b .. b=a p(x,y) p(a,b) return from p return from p printf(y) exit main printf(b) exit p Composing Dataflow Functions f 1 V . if a V then V {b} else V {b} a b c a b c f 2 V . if b V then {c} else f 2 f 1 ({a, c}) {c} x y start main ( start p(a,b) if . x=3 p(x,y) a b .. Might yb be uninitialized here? b=a p(a,b) return from p return from p printf(y) printf(b) NO! YES! exit main ) exit p ] matched | | | matched matched (i matched )i 1 i CallSites edge stack stack Off Limits! ( )( ) ( ) ( ( ( ) ) ) unbalLeft | | matched unbalLeft (i unbalLeft 1 i CallSites stack stack Off Limits! ( ) ( ) ( ( ( ) ) ( ( ) Interprocedural Dataflow Analysis via CFL-Reachability • Graph: Exploded control-flow graph • L: L(unbalLeft) • Fact d holds at n iff there is an L(unbalLeft)-path from start main , to n, d Asymptotic Running Time [Reps, Horwitz, & Sagiv 95] • CFL-reachability – Exploded control-flow graph: ND nodes – Running time: O(N3D3) • Exploded control-flow graph Special structure Running time: O(ED3) Typically: E l N, hence O(ED3) l O(ND3) “Gen/kill” problems: O(ED) Why Bother? “We’re only interested in million-line programs” • Know thy enemy! – “Any” algorithm must do these operations – Avoid pitfalls (e.g., claiming O(N2) algorithm) • The essence of “context sensitivity” • Special cases – “Gen/kill” problems: O(ED) • Compression techniques – Basic blocks – SSA form, sparse evaluation graphs • Demand algorithms Relationship to Other Analysis Paradigms • Dataflow analysis –reachability versus equation solving • Deduction • Set constraints The Need for Pointer Analysis int main() { int add(int x, int y) int sum = 0; { int i = 1; return x + y; int *p = &sum; } int *q = &i; int (*f)(int,int) = add; while (*q < 11) { *p = (*f)(*p,*q); *q = (*f)(*q,1); } printf(“%d\n”,*p); printf(“%d\n”,*q); } The Need for Pointer Analysis int main() { int add(int x, int y) int sum = 0; { int i = 1; return x + y; int *p = &sum; } int *q = &i; int (*f)(int,int) = add; while (*q < 11) { *p = (*f)(*p,*q); *q = (*f)(*q,1); } printf(“%d\n”,*p); printf(“%d\n”,*q); } The Need for Pointer Analysis int main() { int add(int x, int y) int sum = 0; { int i = 1; return x + y; int *p = &sum; } int *q = &i; int (*f)(int,int) = add; while (i < 11) { sum = add(sum,i); i = add(i,1); } printf(“%d\n”,sum); printf(“%d\n”,i); } Flow-Sensitive Points-To Analysis p p r1 q q p = &q; p = q; p p r2 s1 p p r1 s2 s3 r1 r2 s1 r2 s2 q r1 r2 s1 q q p = *q; *p = q; p p q r1 s2 s3 r1 r2 s1 r2 s2 q q Flow-Sensitive Flow-Insensitive start main 1 1 2 5 3 exit main 4 5 2 3 4 Flow-Insensitive Points-To Analysis [Andersen 94, Shapiro & Horwitz 97] p = &q; p = q; p p q r1 r2 s1 p = *q; *p = q; p p q r1 s2 s3 r1 r2 s1 r2 s2 q q Flow-Insensitive Points-To Analysis a b c *b d = &e; = a; = &f; = c; = *a; a b c d e f Flow-Insensitive Points-To Analysis • Andersen [Thesis 94] – Formulated using set constraints – Cubic-time algorithm • Shapiro & Horwitz (1995; [POPL 97]) – Re-formulated as a graph-grammar problem • Reps (1995; [unpublished]) – Re-formulated as a Horn-clause program • Melski (1996; see [Reps, IST98]) – Re-formulated via CFL-reachability CFL-Reachability via Dynamic Programming Graph B Grammar C A AB C CFL-Reachability = Chain Programs Graph Grammar y B AB C C x A z a(X,Z) :- b(X,Y), c(Y,Z). Base Facts for Points-To Analysis p = &q; assignAddr(p,q). p = q; assign(p,q). p = *q; assignStar(p,q). *p = q; starAssign(p,q). Rules for Points-To Analysis (I) p = &q; p q pointsTo(P,Q) :- assignAddr(P,Q). p = q; p r1 q r2 pointsTo(P,R) :- assign(P,Q), pointsTo(Q,R). Rules for Points-To Analysis (II) s1 p = *q; p s2 s3 r1 q r2 pointsTo(P,S) :- assignStar(P,Q),pointsTo(Q,R),pointsTo(R,S). *p = q; p r1 s1 r2 s2 q pointsTo(R,S) :- starAssign(P,Q),pointsTo(P,R),pointsTo(Q,S). Creating a Chain Program *p = q; p r1 s1 r2 s2 q pointsTo(R,S) :- starAssign(P,Q),pointsTo(P,R),pointsTo(Q,S). pointsTo(R,S) :- pointsTo(P,R),starAssign(P,Q),pointsTo(Q,S). pointsTo(R,S) :- pointsTo(R,P),starAssign(P,Q),pointsTo(Q,S). pointsTo(R,P) :- pointsTo(P,R). Base Facts for Points-To Analysis p = &q; assignAddr(p,q). assignAddr(q,p). p = q; assign(p,q). assign(q,p). p = *q; assignStar(p,q). assignStar(q,p). *p = q; starAssign(p,q). starAssign(q,p). Creating a Chain Program pointsTo(P,Q) :- assignAddr(P,Q). pointsTo(Q,P) :- assignAddr(Q,P). pointsTo(P,R) :- assign(P,Q), pointsTo(Q,R). pointsTo(R,P) :- pointsTo(R,Q), assign(Q,P). pointsTo(P,S) :- assignStar(P,Q),pointsTo(Q,R),pointsTo(R,S). pointsTo(S,P) :- pointsTo(S,R),pointsTo(R,Q),assignStar(Q,P). pointsTo(R,S) :- pointsTo(R,P),starAssign(P,Q),pointsTo(Q,S). pointsTo(S,R) :- pointsTo(S,Q),starAssign(Q,P),pointsTo(P,R). . . . and now to CFL-Reachability pointsTo assignAddr pointsTo assignAddr pointsTo assign pointsTo pointsTo pointsTo assign pointsTo assignStar pointsTo pointsTo pointsTo pointsTo pointsTo assignStar pointsTo pointsTo starAssign pointsTo pointsTo pointsTo starAssign pointsTo Relationship to Other Analysis Paradigms • Dataflow analysis –reachability versus equation solving • Deduction • Set constraints 1987 1993 1994 1995 1996 1997 1998 Slicing & Applications Dataflow Analysis Demand CFL Algorithms Reachability StructureTransmitted Dependences Set Constraints Structure-Transmitted Dependences [Reps1995] McCarthy’s equations: car(cons(x,y)) = x cdr(cons(x,y)) = y w = cons(x,y); v = car(w); dep -1 dep hd dep hd x y hd tl w hd dep dep v -1 Set Constraints w = cons(x,y); v = car(w); W cons( X , Y ) 1 V cons1 (W ) McCarthy’s Equations Revisited cons (cons( X , Y )) X , provided I (Y ) 1 1 Semantics of Set Constraints I (cons(V 1 ,V 2)) {cons(v1 , v2) | v1 I (V 1) and v2 I (V 2)} I (cons11 (V )) {v1 | cons(v1 , v2) I (V )} CFL-Reachability versus Set Constraints • Lazy languages: CFL-reachability is more natural – car(cons(X,Y)) = X • Strict languages: Set constraints are more natural – car(cons(X,Y)) = X, provided I(Y) g v • But . . . SC and CFL-reachability are equivalent! – [Melski & Reps 97] Solving Set Constraints W a X is “inhabited” Y is “inhabited” W cons( X , Y ) Y is “inhabited” W cons( X , Y ) 1 V cons1 (W ) X is “inhabited” W cons( X , Y ) 1 U cons 2 (W ) W is “inhabited” W is “inhabited” VX U Y Simulating “Inhabited” inhab a W a dep W dep inhab inhab dep inhab dep Simulating “Inhabited” inhab inhab X W cons( X , Y ) hd Y hd tl tl W inhab inhab hd inhab hd tl inhab tl Simulating “Provided I(Y) g v” inhab W cons( X , Y ) 1 V cons1 (W ) X hd dep dep hd tl inhab tl hd Y hd tl W hd -1 provided I(Y) g v tl V -1 Themes • Harnessing CFL-reachability • Relationship to other analysis paradigms • Exhaustive alg. Demand alg. • Understanding complexity – Linear . . . cubic . . . undecidable • Beyond CFL-reachability Exhaustive Versus Demand Analysis • Exhaustive analysis: All facts at all points • Optimization: Concentrate on inner loops • Program-understanding tools: Only some facts are of interest Exhaustive Versus Demand Analysis • Demand analysis: – Does a given fact hold at a given point? – Which facts hold at a given point? – At which points does a given fact hold? • Demand analysis via CFL-reachability – single-source/single-target CFL-reachability – single-source/multi-target CFL-reachability – multi-source/single-target CFL-reachability x y start main ( a b start p(a,b) if . x=3 .. “Semi-exhaustive”: Might by be All “appropriate” uninitialized demands p(x,y) here? b=a p(a,b) return from p return from p printf(y) printf(b) NO! YES! exit main ) exit p Experimental Results [Horwitz , Reps, & Sagiv 1995] • 53 C programs (200-6,700 lines) • For a single fact of interest: – demand always better than exhaustive • All “appropriate” demands beats exhaustive when percentage of “yes” answers is high – Live variables – Truly live variables – Constant predicates –... A Related Result [Sagiv, Reps, & Horwitz 1996] • [Uses a generalized analysis technique] • 38 C programs (300-6,000 lines) – copy-constant propagation – linear-constant propagation • All “appropriate” demands always beats exhaustive – factor of 1.14 to about 6 Exhaustive Versus Demand Analysis • Demand algorithms for – Interprocedural dataflow analysis – Set constraints – Points-to analysis Demand Analysis and LP Queries (I) • Flow-insensitive points-to analysis – Does variable p point to q? • Issue query: ?- pointsTo(p, q). • Solve single-source/single-target L(pointsTo)reachability problem – What does variable p point to? • Issue query: ?- pointsTo(p, Q). • Solve single-source L(pointsTo)-reachability problem – What variables point to q? • Issue query: ?- pointsTo(P, q). • Solve single-target L(pointsTo)-reachability problem Demand Analysis and LP Queries (II) • Flow-sensitive analysis – Does a given fact f hold at a given point p? ?- dfFact(p, f). – Which facts hold at a given point p? ?- dfFact(p, F). – At which points does a given fact f hold? ?- dfFact(P, f). • E.g., flow-sensitive points-to analysis ?- dfFact(p, pointsTo(x, Y)). ?- dfFact(P, pointsTo(x, y)). etc. Themes • Harnessing CFL-reachability • Relationship to other analysis paradigms • Exhaustive alg. Demand alg. • Understanding complexity – Linear . . . cubic . . . undecidable • Beyond CFL-reachability Interprocedural Backward Slice Enter main Call p Call p ) ( [ ] Enter p x y start main ( start p(a,b) if . x=3 a b [ .. b=a p(x,y) p(a,b) return from p return from p y printf(y) may be uninitialized here exit main ) printf(b) exit p ] Structure-Transmitted Dependences [Reps1995] McCarthy’s equations: car(cons(x,y)) = x cdr(cons(x,y)) = y w = cons(x,y); v = car(w); x y hd tl w hd v -1 Dependences + Matched Paths? Enter main x hd y tl w=cons(x,y) ( Call p Call p w w [ ) Enter p w hd-1 v = car(w) ] Undecidable! [Reps, TOPLAS 00] hd ( hd-1 Interleaved Parentheses! ) Themes • Harnessing CFL-reachability • Relationship to other analysis paradigms • Exhaustive alg. Demand alg. • Understanding complexity – Linear . . . cubic . . . undecidable • Beyond CFL-reachability CFL-Reachability via Dynamic Programming Graph B Grammar C A AB C Beyond CFL-Reachability: Composition of Linear Functions x.3x+5 x.2x+1 x.6x+11 (x.2x+1) (x.3x+5) = x.6x+11 Beyond CFL-Reachability: Composition of Linear Functions • Interprocedural constant propagation – [Sagiv, Reps, & Horwitz TCS 96] • Interprocedural path profiling – The number of path fragments contributed by a procedure is a function – [Melski & Reps CC 99] Model-Checking of Recursive HFSMs [Benedikt, Godefroid, & Reps (in prep.)] • Non-recursive HFSMs [Alur & Yannakakis 98] • Ordinary FSMs – T-reachability/circularity queries • Recursive HFSMs – Matched-parenthesis T-reachability/circularity • Key observation: Linear-time algorithms for matched-parenthesis T-reachability/cyclicity – Single-entry/multi-exit [or multi-entry/single-exit] – Deterministic, multi-entry/multi-exit T-Cyclicity in Hierarchical Kripke Structures SN/SX MN/SX non-rec: O(|k|) non-rec: O(|k|) ? rec: O(|k|3) rec: ? SN/SX O(|k|) SN/MX SN/MX O(|k|) MN/SX O(|k|) MN/MX ? MN/MX O(|k|3) O(|k||t|) [lin rec] O(|k|) [det] Recursive HFSMs: Data Complexity SN/SX SN/MX MN/SX LTL non-rec: O(|k|) non-rec: O(|k|) ? rec: P-time rec: ? MN/MX ? CTL O(|k|) bad CTL* O(|k|2) [L2] bad bad bad ? ? Recursive HFSMs: Data Complexity SN/SX LTL O(|k|) SN/MX O(|k|) MN/SX O(|k|) CTL O(|k|) CTL* O(|k|) bad bad O(|k|) O(|k|) Not Dual Problems! MN/MX O(|k|3) O(|k||t|) [lin rec] O(|k|) [det] bad bad CFL-Reachability: Scope of Applicability • Static analysis – Slicing, DFA, structure-transmitted dep., points-to analysis • Verification – Security of crypto-based protocols for distributed systems [Dolev, Even, & Karp 83] – Model-checking recursive HFSMs • Formal-language theory – CF-, 2DPDA-, 2NPDA-recognition – Attribute-grammar analysis CFL-Reachability: Benefits • Algorithms – Exhaustive & demand • Complexity – Linear-time and cubic-time algorithms – PTIME-completeness – Variants that are undecidable • Complementary to – Equations – Set constraints – Types –... Most Significant Contributions: 1987-2000 • Asymptotically fastest algorithms – Interprocedural slicing – Interprocedural dataflow analysis • Demand algorithms – Interprocedural dataflow analysis [CC94,FSE95] – All “appropriate” demands beats exhaustive • Tool for slicing and browsing ANSI C – Slices programs as large as 75,000 lines – University research distribution – Commercial product: CodeSurfer (GrammaTech, Inc.) Most Significant Contributions: 1987-2000 • Unifying conceptual model – [Kou 77], [Holley&Rosen 81], [Cooper&Kennedy 88], [Callahan 88], [Horwitz,Reps,&Binkley 88], . . . • Identifies fundamental bottlenecks – Cubic-time “barrier” – Litmus test: quadratic-time algorithm?! – PTIME-complete limits to parallelizability • Existence proofs for new algorithms – Demand algorithm for set constraints – Demand algorithm for points-to analysis References • Papers by Reps and collaborators: – http://www.cs.wisc.edu/~reps/ • CFL-reachability – Yannakakis, M., Graph-theoretic methods in database theory, PODS 90. – Reps, T., Program analysis via graph reachability, Inf. and Softw. Tech. 98. References • Slicing, chopping, etc. – Horwitz, Reps, & Binkley, TOPLAS 90 – Reps, Horwitz, Sagiv, & Rosay, FSE 94 – Reps & Rosay, FSE 95 • Dataflow analysis – Reps, Horwitz, & Sagiv, POPL 95 – Horwitz, Reps, & Sagiv, FSE 95, TR-1283 • Structure dependences; set constraints – Reps, PEPM 95 – Melski & Reps, Theor. Comp. Sci. 00 References • Complexity – Undecidability: Reps, TOPLAS 00? – PTIME-completeness: Reps, Acta Inf. 96. • Verification – Dolev, Even, & Karp, Inf & Control 82. – Benedikt, Godefroid, & Reps, In prep. • Beyond CFL-reachability – Sagiv, Reps, Horwitz, Theor. Comp. Sci 96 – Melski & Reps, CC 99, TR-1382