1
ANALYSIS OF PROG. LANG.
PROGRAM ANALYSIS
Instructors: Crista Lopes
Copyright © Instructors.
Motivation(s)
2
Where do you see PA in your everyday life?
How does PA “work”?
What is PA anyway?
Auto-completion
3
Pre-compilation error detection
4
Ex: missing parenthesis
How do you know ...
5
int a;
increment_a() {
a ++;
}
while(true) {
String a = “hello”;
increment_a();
}
This “a” is not
that “a”
How do you remember ...
6
int a;
“a” is of type int (FYI...)
increment_a() {
a ++;
}
while(true) {
String a = “hello”;
increment_a();
}
Wait, what’s the
type of “a” again?
Outline
7
Introduction/motivations
Program representation
AST
3-address
code
Control flow analysis
Data flow
Intermediate Representation (IR)
8
Initial Point
Abstract Syntax Tree
Abstract
vs Concrete Syntax
Parse Tree vs Abstract Syntax Tree
Three-address Codes
IR-1 Starting Point
9
Source
code
Parsing, Lexical
Analysis
Intermediate
representation
Code
Generation,
Optimization
Target
code
Analyze IR – Perform analysis on the results
Use this information for applications
Code Execution
IR-2. Abstract Syntax Tree (AST)
10
Concrete vs Abstract Syntax
Concrete
show structure and is language-specific
Abstract shows structure
Representations
Parse
Tree represents Concrete Syntax
Abstract Syntax Tree represents Abstract Syntax
IR-2. Example : Grammar
11
Example
a:= b+c (Language 1)
a = b+c; (Language 2)
Grammar for 1
stmtlist stmt | stmt stmtlist
Ÿ
stmt assign | if-then | …
assign ident “:=“ ident binop ident
binop “+” | “-” | …
Grammar for 2
stmtlist stmt “;”| stmt “;” stmtlist
Ÿ
stmt assign | if-then | …
assign ident “=“ ident binop ident
binop “+” | “-” | …
IR-2. Example: Parse Tree
12
Parse Tree for a:=b+c
Ident :=
a
Parse Tree for a=b+c;
stmtlist
stmtlist
stmt
stmt
assign
assign
ident binop
b
“+”
ident
c
Ident
a
=
“;”
ident binop
b
“+”
ident
c
IR-2 Example: Abstract Syntax Tree
13
Example
Abstract
Syntax Tree for 1 and 2
1. a:=b+c
assign
2. a=b+c;
a
add
b
c
IR-3. Three Address Code
14
General form: x = y op z
More generally: (operator, operand1, operand2, result)
(at most 3 spots besides the operator)
May include temporary variables
Examples
Assignment
Copy x:=y
Jumps
Binary x:= y op z
Unary x := op y
(op, y, z, x)
(op, v, _, x)
(_, y, _, x)
Unconditional goto L (goto, L, _, _)
Conditional if x relop y goto L (relop, x, y, L)
….
IR-3. Example: Three Address Code
15
if a>10
then x=y+z
else
x=y-z
1. if a>10 goto 4
2. x = y-z
3. goto 5
4. x = y + z
5. …..
Analysis Levels
16
Local
Intraprocedural
within a single class
Interclass
across procedure boundaries, procedure call, shared globals, etc
Intraclass
within a single procedure, function, or method
Interprocedural
within a single basic block or statement
across class boundaries
…..
Outline
17
Introduction/motivations
Program representation
Control flow analysis
Computing Control Flow (analysis and
representation)
Search and Traversals
Applications
Data flow
Computing Control flow (example)
18
Procedure AVG
S1
count=0;
S2
fread(fptr , n)
S3
while(not EOF) do
S4
if(n<0)
S5
return(error)
else
S6
nums[count]=n
S7
count++
endif
S8
fread(fptr , n);
endwhile
S9
avg= mean(nums , count)
S10
return (avg)
entry
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
EXIT
CF1: Control Flow (Basic Blocks)
19
A basic block is a sequence of consecutive
statements in which flow of control enters at the
beginning and leaves at the end without halt of
possibility of branch except at the end
A basic block may or may not be maximal
For compiler optimizations, maximal blocks are
desirable
For software engineering tasks, basic blocks that
represent one source code statement are often used
Computing Control flow (example)
20
Procedure AVG
S1
count=0;
S2
fread(fptr , n)
S3
while(not EOF) do
S4
if(n<0)
S5
return(error)
else
S6
nums[count]=n
S7
count++
endif
S8
fread(fptr , n);
endwhile
S9
avg= mean(nums , count)
S10
return (avg)
entry
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
EXIT
CF1: Computing Control Flow
21
Input: A list of program statements in some form
Output: A list of CFG nodes and edges
Procedure:
Construct basic blocks
Create entry exit nodes; create edge (entry, B1); create
(exit, Bk) for each Bk that represents an exit from program
Add CFG edge from Bi to Bj if Bj can immediately follow Bi
in some execution i.e.,
There is conditional or unconditional goto from last statement of Bi
to first statement of Bj or
Bj immediately follows Bi in the order of the program and Bi does
not end in unconditional goto statement
Label edges that represent conditional transfers of control
CF2: Search and Ordering
22
Many ways to visit the nodes in the graph
Depth
First Search: Visits descendants of the node
before visiting any of its siblings
Breadth First Search: All of the node’s immediate
descendants are processed before any of their
unprocessed children
Preorder Traversal: A node is processed before its
descendants
Postorder Traversal: A node is processed after its
descendants
CF2: Search and Ordering (cont’d)
(DFS)
23
1
S3
S4
2
S5
S6
S7
S8
S9
S10
One DFS of CFG
13467810,back to
8,9, back to 8, 7,6,4,5, back to
4,3,1,2,back to 1
The number assigned to a node
during DFS is its depth first number
Depth first ordering of nodes is the
reverse of the order in which nodes
are visited in DFS
For the DFS, nodes are visited
1,3,4,6,7,8,10,8,9,8,7,6,5,4,3,1,2,1
Depth first ordering is
1,2,3,4,5,6,7,8,9,10
CF: Types of Edges
24
Depth first representation is
depth first spanning tree
along with other edges not
part of the tree; tree edges,
other edges
Three kinds of edges
Advanced (forward) edges:
go from a node to one of its
proper descendants in the
tree; these include tree edges
Back edges: go from a node
to one of its ancestor in the
tree
Cross edges: connect nodes
such that neither is an
ancestor of the other
Applications of Control Flow
25
Complexity – Pointers to
refactoring
Testing
2
3
4
5
6
Program Understanding
Branch, Path, Basis Path
Branch: Must test 1-2, 1-3, 4-5,
4-8, 5-6, 5-7
Path: Infinite, due to loop
Basis Path: Set of paths which
covers all the edges at least once
e.g. 1,2,4,8; 1,3,4,5,6,7,4,8
1
Recover program structure
Impact analysis
…..
8
7
Outline
26
Introduction/motivations
Program representation
Control flow
Data flow
Introduction
Reaching
definitions
Data flow - Introduction
27
Flow of various data throughout the program
Obtained
from AST or CFG
Used in software engineering tasks
Exact solutions to most data flow problems are
undecidable
May
depend on input
May depend on the outcome of a conditional statement
May depend on termination of loop
Thus we compute approximations of the exact
solution
Data flow - Introduction
28
Some Approximations “overestimate” the solution
Some Approximations “underestimate” the solution
Approximations contain actual information plus some spurious
information but does not omit any actual information
Conservative and safe approach
Approximations may not contain all the information of the actual
solution
Unsafe
Research challenge: Providing safe but precise information
in an efficient way
Uses of data flow:
Compiler optimization requires conservative analysis
Software engineering tasks may only need unsafe info
Data flow – Compiler Optimization
29
Common subexpression elimination
c=a+b
d=a+b
=a
=a
e=a+b
=a
Data flow – Compiler Optimization
30
Common subexpression elimination
c=a+b
d=a+b
=a
=a
e=a+b
=a
t=a+b
c=t
t=a+b
d=t
c=a
c=a
e=t=a
Need to know available expressions: which expressions have
been computed at that point before this statement
Data Flow - Compiler Optimization
31
Register (de)allocation
When
assigning memory locations to registers, if a
value in a register (ie a memory location) is not used
again, no need to keep it in a register
R1=R2+10
Is
=a
R2 needed after this statement?
Need to know “live variables”: which variables are
still used after current line
Data Flow - Compiler Optimization
32
Suppose every assignment that reaches this
statement assigns 5 to c
a=c+10 // need 3 registers=a
then
‘a’ can be replaced by 15
a=15 //need 2 registers/a
But: Need to know reaching definitions: which
definition(s) of variable c reach this statement
Data Flow - Sw Eng Tasks
33
Data-Flow testing
Suppose that a statement assigns a value but the use of that
value is never executed under test
a=c+10=a
a never used on this path
d=a+y=a
Need to know definition use pairs: link between
definition(s) and use(s) of a variable (or a memory
location)
Data Flow - Sw Eng Tasks
34
Debugging
Suppose
Eg
that ‘a’ has an incorrect value in the statement
int overflow
a=c+y=a
d=a+y=a
Need data dependence information: some
statements produce erroneous values, others are
affected by those values
Data flow - Example
35
B1
1. i=2
2. k=i+1
Compute the flow of data
throughout the program
B2
3. i=1
B3
4. k=k+1
B4
5. k=k-4
Where does the assignment to
i in statement 1 reach?
Where does the expression
computed in statement 2
reach?
Which uses of variable are
reachable from the end of
Block1?
Is the value of variable i live
after statement 2?
Reaching definitions analysis
36
B1
1. i=2
2. k=i+1
B2
3. i=1
B3
4. k=k+1
B4
5. k=k-4
Definition = statement
where a variable is
assigned a value (e.g.
input statement,
assignment statement)
A definition of ‘a’ reaches
a point ‘p’ if there exists a
control flow path in the
CFG from the definition to
‘p’ with no other
definitions of ‘a’ on the
path
Such a path may exist in
the graph but may not be
possible – infeasible path
Reaching definitions analysis
37
B1
1. i=2
2. k=i+1
Of variable i:
Of variable k:
B2
3. i=1
What are the definitions
in the program?
Which basic blocks
(before block) do these
definitions reach?
Def
Def
Def
Def
Def
B3
4. k=k+1
B4
5. k=k-4
1 reaches:
2 reaches:
3 reaches:
4 reaches:
5 reaches:
Reaching definitions analysis
38
B1
1. i=2
2. k=i+1
What are the definitions in
the program?
B2
B3
B4
3. i=1
4. k=k+1
5. k=k-4
Of variable i: 1,3
Of variable k: 2,4,5
Which basic blocks
(before block) do these
definitions reach?
Def
Def
Def
Def
Def
1 reaches: B2
2 reaches: B1, B2, B3
3 reaches: B1, B3, B4
4 reaches: B4
5 reaches: exit
Reaching definitions analysis
39
B1
1. i=2
2. k=i+1
B2
3. i=1
Method
Gen[B]: set of definitions
generated within B
Kill[B]: set of definitions that, if
they reach the point before B,
won’t reach end of B
B3
B4
Compute two kinds of basic
information (within the block)
4. k=k+1
5. k=k-4
Compute two other sets by
propagation
IN[B]: set of definitions the reach
the beginning of B
OUT[B]: set of definitions that
reach the end of B
Reaching definitions analysis
40
B1
B2
1. i=2
2. k=i+1
3. i=1
B3
4. k=k+1
B4
5. k=k-4
Init
GEN
Init
KILL
Init
IN
Init
OUT
IN
OUT
1
1,2
3,4,5
--
1,2
2,3
1,2
2
3
1
--
3
1,2
2,3
3
4
2,5
--
4
2,3
3,4
4
5
2,4
--
5
3,4
3,5
Iterative Data-Flow analysis algorithm
41
Algorithm for Reaching Definitions
Input: CFG with GEN[B], KILL[B] for all B
Output: IN[B], OUT[B] for all B
Begin RD
IN[B]=empty, OUT[B]=GEN[B] for all B; change = true
While change do begin
change=false
For each B do begin
IN[B]=union OUT[P] (P is a predecessor of B)
OLDOUT=OUT[B]
OUT[B]=GEN[B] union (IN[B]-KILL[B])
if (OUT[B]!=OLDOUT) then change = true;
End for
End while
End RD
Tools
42
Eclipse JDT/AST (APIs to construct, traverse and
manipulate AST)
http://www.vogella.de/articles/EclipseJDT/article.html
Sourcerer
http://sourcerer.ics.uci.edu/index.html
Crystal (Data Analysis Framework, mostly for
academic purposes)
http://code.google.com/p/crystalsaf/wiki/Installation
Mandatory Reading List
43
Representation and Analysis of Software – RepAnalysis.pdf
Crystal Notes – CrystalTutorialNotes.pdf,
CrystalTutorial.ppt
Eclipse JDT - AST http://www.vogella.de/articles/EclipseJDT/article.html
More (optional) Reading List
44
Principles of Program Analysis, Nielson and Hankin
Invariant Detection using Daikon – daikon.pdf
More optional readings available at Program Analysis
course material at CMU
http://www.cs.cmu.edu/~aldrich/courses/15-819M/