1 ANALYSIS OF PROG. LANG. PROGRAM ANALYSIS Instructors: Crista Lopes Copyright © Instructors. Motivation(s) 2 Where do you see PA in your everyday life? How does PA “work”? What is PA anyway? Auto-completion 3 Pre-compilation error detection 4 Ex: missing parenthesis How do you know ... 5 int a; increment_a() { a ++; } while(true) { String a = “hello”; increment_a(); } This “a” is not that “a” How do you remember ... 6 int a; “a” is of type int (FYI...) increment_a() { a ++; } while(true) { String a = “hello”; increment_a(); } Wait, what’s the type of “a” again? Outline 7 Introduction/motivations Program representation AST 3-address code Control flow analysis Data flow Intermediate Representation (IR) 8 Initial Point Abstract Syntax Tree Abstract vs Concrete Syntax Parse Tree vs Abstract Syntax Tree Three-address Codes IR-1 Starting Point 9 Source code Parsing, Lexical Analysis Intermediate representation Code Generation, Optimization Target code Analyze IR – Perform analysis on the results Use this information for applications Code Execution IR-2. Abstract Syntax Tree (AST) 10 Concrete vs Abstract Syntax Concrete show structure and is language-specific Abstract shows structure Representations Parse Tree represents Concrete Syntax Abstract Syntax Tree represents Abstract Syntax IR-2. Example : Grammar 11 Example a:= b+c (Language 1) a = b+c; (Language 2) Grammar for 1 stmtlist stmt | stmt stmtlist Ÿ stmt assign | if-then | … assign ident “:=“ ident binop ident binop “+” | “-” | … Grammar for 2 stmtlist stmt “;”| stmt “;” stmtlist Ÿ stmt assign | if-then | … assign ident “=“ ident binop ident binop “+” | “-” | … IR-2. Example: Parse Tree 12 Parse Tree for a:=b+c Ident := a Parse Tree for a=b+c; stmtlist stmtlist stmt stmt assign assign ident binop b “+” ident c Ident a = “;” ident binop b “+” ident c IR-2 Example: Abstract Syntax Tree 13 Example Abstract Syntax Tree for 1 and 2 1. a:=b+c assign 2. a=b+c; a add b c IR-3. Three Address Code 14 General form: x = y op z More generally: (operator, operand1, operand2, result) (at most 3 spots besides the operator) May include temporary variables Examples Assignment Copy x:=y Jumps Binary x:= y op z Unary x := op y (op, y, z, x) (op, v, _, x) (_, y, _, x) Unconditional goto L (goto, L, _, _) Conditional if x relop y goto L (relop, x, y, L) …. IR-3. Example: Three Address Code 15 if a>10 then x=y+z else x=y-z 1. if a>10 goto 4 2. x = y-z 3. goto 5 4. x = y + z 5. ….. Analysis Levels 16 Local Intraprocedural within a single class Interclass across procedure boundaries, procedure call, shared globals, etc Intraclass within a single procedure, function, or method Interprocedural within a single basic block or statement across class boundaries ….. Outline 17 Introduction/motivations Program representation Control flow analysis Computing Control Flow (analysis and representation) Search and Traversals Applications Data flow Computing Control flow (example) 18 Procedure AVG S1 count=0; S2 fread(fptr , n) S3 while(not EOF) do S4 if(n<0) S5 return(error) else S6 nums[count]=n S7 count++ endif S8 fread(fptr , n); endwhile S9 avg= mean(nums , count) S10 return (avg) entry S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 EXIT CF1: Control Flow (Basic Blocks) 19 A basic block is a sequence of consecutive statements in which flow of control enters at the beginning and leaves at the end without halt of possibility of branch except at the end A basic block may or may not be maximal For compiler optimizations, maximal blocks are desirable For software engineering tasks, basic blocks that represent one source code statement are often used Computing Control flow (example) 20 Procedure AVG S1 count=0; S2 fread(fptr , n) S3 while(not EOF) do S4 if(n<0) S5 return(error) else S6 nums[count]=n S7 count++ endif S8 fread(fptr , n); endwhile S9 avg= mean(nums , count) S10 return (avg) entry S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 EXIT CF1: Computing Control Flow 21 Input: A list of program statements in some form Output: A list of CFG nodes and edges Procedure: Construct basic blocks Create entry exit nodes; create edge (entry, B1); create (exit, Bk) for each Bk that represents an exit from program Add CFG edge from Bi to Bj if Bj can immediately follow Bi in some execution i.e., There is conditional or unconditional goto from last statement of Bi to first statement of Bj or Bj immediately follows Bi in the order of the program and Bi does not end in unconditional goto statement Label edges that represent conditional transfers of control CF2: Search and Ordering 22 Many ways to visit the nodes in the graph Depth First Search: Visits descendants of the node before visiting any of its siblings Breadth First Search: All of the node’s immediate descendants are processed before any of their unprocessed children Preorder Traversal: A node is processed before its descendants Postorder Traversal: A node is processed after its descendants CF2: Search and Ordering (cont’d) (DFS) 23 1 S3 S4 2 S5 S6 S7 S8 S9 S10 One DFS of CFG 13467810,back to 8,9, back to 8, 7,6,4,5, back to 4,3,1,2,back to 1 The number assigned to a node during DFS is its depth first number Depth first ordering of nodes is the reverse of the order in which nodes are visited in DFS For the DFS, nodes are visited 1,3,4,6,7,8,10,8,9,8,7,6,5,4,3,1,2,1 Depth first ordering is 1,2,3,4,5,6,7,8,9,10 CF: Types of Edges 24 Depth first representation is depth first spanning tree along with other edges not part of the tree; tree edges, other edges Three kinds of edges Advanced (forward) edges: go from a node to one of its proper descendants in the tree; these include tree edges Back edges: go from a node to one of its ancestor in the tree Cross edges: connect nodes such that neither is an ancestor of the other Applications of Control Flow 25 Complexity – Pointers to refactoring Testing 2 3 4 5 6 Program Understanding Branch, Path, Basis Path Branch: Must test 1-2, 1-3, 4-5, 4-8, 5-6, 5-7 Path: Infinite, due to loop Basis Path: Set of paths which covers all the edges at least once e.g. 1,2,4,8; 1,3,4,5,6,7,4,8 1 Recover program structure Impact analysis ….. 8 7 Outline 26 Introduction/motivations Program representation Control flow Data flow Introduction Reaching definitions Data flow - Introduction 27 Flow of various data throughout the program Obtained from AST or CFG Used in software engineering tasks Exact solutions to most data flow problems are undecidable May depend on input May depend on the outcome of a conditional statement May depend on termination of loop Thus we compute approximations of the exact solution Data flow - Introduction 28 Some Approximations “overestimate” the solution Some Approximations “underestimate” the solution Approximations contain actual information plus some spurious information but does not omit any actual information Conservative and safe approach Approximations may not contain all the information of the actual solution Unsafe Research challenge: Providing safe but precise information in an efficient way Uses of data flow: Compiler optimization requires conservative analysis Software engineering tasks may only need unsafe info Data flow – Compiler Optimization 29 Common subexpression elimination c=a+b d=a+b =a =a e=a+b =a Data flow – Compiler Optimization 30 Common subexpression elimination c=a+b d=a+b =a =a e=a+b =a t=a+b c=t t=a+b d=t c=a c=a e=t=a Need to know available expressions: which expressions have been computed at that point before this statement Data Flow - Compiler Optimization 31 Register (de)allocation When assigning memory locations to registers, if a value in a register (ie a memory location) is not used again, no need to keep it in a register R1=R2+10 Is =a R2 needed after this statement? Need to know “live variables”: which variables are still used after current line Data Flow - Compiler Optimization 32 Suppose every assignment that reaches this statement assigns 5 to c a=c+10 // need 3 registers=a then ‘a’ can be replaced by 15 a=15 //need 2 registers/a But: Need to know reaching definitions: which definition(s) of variable c reach this statement Data Flow - Sw Eng Tasks 33 Data-Flow testing Suppose that a statement assigns a value but the use of that value is never executed under test a=c+10=a a never used on this path d=a+y=a Need to know definition use pairs: link between definition(s) and use(s) of a variable (or a memory location) Data Flow - Sw Eng Tasks 34 Debugging Suppose Eg that ‘a’ has an incorrect value in the statement int overflow a=c+y=a d=a+y=a Need data dependence information: some statements produce erroneous values, others are affected by those values Data flow - Example 35 B1 1. i=2 2. k=i+1 Compute the flow of data throughout the program B2 3. i=1 B3 4. k=k+1 B4 5. k=k-4 Where does the assignment to i in statement 1 reach? Where does the expression computed in statement 2 reach? Which uses of variable are reachable from the end of Block1? Is the value of variable i live after statement 2? Reaching definitions analysis 36 B1 1. i=2 2. k=i+1 B2 3. i=1 B3 4. k=k+1 B4 5. k=k-4 Definition = statement where a variable is assigned a value (e.g. input statement, assignment statement) A definition of ‘a’ reaches a point ‘p’ if there exists a control flow path in the CFG from the definition to ‘p’ with no other definitions of ‘a’ on the path Such a path may exist in the graph but may not be possible – infeasible path Reaching definitions analysis 37 B1 1. i=2 2. k=i+1 Of variable i: Of variable k: B2 3. i=1 What are the definitions in the program? Which basic blocks (before block) do these definitions reach? Def Def Def Def Def B3 4. k=k+1 B4 5. k=k-4 1 reaches: 2 reaches: 3 reaches: 4 reaches: 5 reaches: Reaching definitions analysis 38 B1 1. i=2 2. k=i+1 What are the definitions in the program? B2 B3 B4 3. i=1 4. k=k+1 5. k=k-4 Of variable i: 1,3 Of variable k: 2,4,5 Which basic blocks (before block) do these definitions reach? Def Def Def Def Def 1 reaches: B2 2 reaches: B1, B2, B3 3 reaches: B1, B3, B4 4 reaches: B4 5 reaches: exit Reaching definitions analysis 39 B1 1. i=2 2. k=i+1 B2 3. i=1 Method Gen[B]: set of definitions generated within B Kill[B]: set of definitions that, if they reach the point before B, won’t reach end of B B3 B4 Compute two kinds of basic information (within the block) 4. k=k+1 5. k=k-4 Compute two other sets by propagation IN[B]: set of definitions the reach the beginning of B OUT[B]: set of definitions that reach the end of B Reaching definitions analysis 40 B1 B2 1. i=2 2. k=i+1 3. i=1 B3 4. k=k+1 B4 5. k=k-4 Init GEN Init KILL Init IN Init OUT IN OUT 1 1,2 3,4,5 -- 1,2 2,3 1,2 2 3 1 -- 3 1,2 2,3 3 4 2,5 -- 4 2,3 3,4 4 5 2,4 -- 5 3,4 3,5 Iterative Data-Flow analysis algorithm 41 Algorithm for Reaching Definitions Input: CFG with GEN[B], KILL[B] for all B Output: IN[B], OUT[B] for all B Begin RD IN[B]=empty, OUT[B]=GEN[B] for all B; change = true While change do begin change=false For each B do begin IN[B]=union OUT[P] (P is a predecessor of B) OLDOUT=OUT[B] OUT[B]=GEN[B] union (IN[B]-KILL[B]) if (OUT[B]!=OLDOUT) then change = true; End for End while End RD Tools 42 Eclipse JDT/AST (APIs to construct, traverse and manipulate AST) http://www.vogella.de/articles/EclipseJDT/article.html Sourcerer http://sourcerer.ics.uci.edu/index.html Crystal (Data Analysis Framework, mostly for academic purposes) http://code.google.com/p/crystalsaf/wiki/Installation Mandatory Reading List 43 Representation and Analysis of Software – RepAnalysis.pdf Crystal Notes – CrystalTutorialNotes.pdf, CrystalTutorial.ppt Eclipse JDT - AST http://www.vogella.de/articles/EclipseJDT/article.html More (optional) Reading List 44 Principles of Program Analysis, Nielson and Hankin Invariant Detection using Daikon – daikon.pdf More optional readings available at Program Analysis course material at CMU http://www.cs.cmu.edu/~aldrich/courses/15-819M/