静态分析

静态代码分析梁广泰 2011-05-25 提纲 动机 程序静态分析（概念+实例） 程序缺陷分析（科研工作）动机 云平台特点  应用程序直接部署在云端服务器上，存在安全隐患 • 直接操作破坏服务器文件系统 • 存在安全漏洞时，可提供黑客入口  资源共享，动态分配 • 单个应用的性能低下，会侵占其他应用的资源 解决方案之一：  在部署应用程序之前，对其进行静态代码分析： • 是否存在违禁调用？（非法文件访问） • 是否存在低效代码？（未借助StringBuilder对String进行大量拼接） • 是否存在安全漏洞？（SQL注入，跨站攻击，拒绝服务） • 是否存在恶意病毒？ • …… 提纲 动机 程序静态分析（概念+实例） 程序缺陷分析（科研工作）静态代码分析 定义：  程序静态分析是在不执行程序的情况下对其进行分析的技术，简称为静态分析。 对比：  程序动态分析：需要实际执行程序  程序理解：静态分析这一术语一般用来形容自动化工具的分析，而人工分析则往往叫做程序理解 用途：  程序翻译/编译（编译器），程序优化重构，软件缺陷检测等 过程：  大多数情况下，静态分析的输入都是源程序代码或者中间码（如 Java bytecode），只有极少数情况会使用目标代码；以特定形式输出分析结果静态代码分析 Basic Blocks Control Flow Graph Dataflow Analysis  Live Variable Analysis  Reaching Definition Analysis Lattice Theory Basic Blocks A basic block is a maximal sequence of consecutive three-address instructions with the following properties:  The flow of control can only enter the basic block thru the 1st instr.  Control will leave the block without halting or branching, except possibly at the last instr. Basic blocks become the nodes of a flow graph, with edges indicating the order. Basic Block Example 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. i=1 j=1 t1 = 10 * i t2 = t1 + j t3 = 8 * t2 t4 = t3 - 88 a[t4] = 0.0 j=j+1 if j <= 10 goto (3) i=i+1 if i <= 10 goto (2) i=1 t5 = i - 1 t6 = 88 * t5 a[t6] = 1.0 i=i+1 if i <= 10 goto (13) A B Leaders C Basic Blocks D E F Control-Flow Graphs Control-flow graph:  Node: an instruction or sequence of instructions (a basic block) • Two instructions i, j in same basic block iff execution of i guarantees execution of j  Directed edge: potential flow of control  Distinguished start node Entry & Exit • First & last instruction in program Control-Flow Edges Basic blocks = nodes Edges:  Add directed edge between B1 and B2 if: • Branch from last statement of B1 to first statement of B2 (B2 is a leader), or • B2 immediately follows B1 in program order and B1 does not end with unconditional branch (goto)  Definition of predecessor and successor • B1 is a predecessor of B2 • B2 is a successor of B1 CFG Example 静态代码分析 Basic Blocks Control Flow Graph Dataflow Analysis  Live Variable Analysis  Reaching Definition Analysis Lattice Theory Dataflow Analysis Compile-Time Reasoning About  Run-Time Values of Variables or Expressions At Different Program Points  Which assignment statements produced value of variable at this point?  Which variables contain values that are no longer used after this program point?  What is the range of possible values of variable at this program point?  …… Program Points     One program point before each node One program point after each node Join point – point with multiple predecessors Split point – point with multiple successors Live Variable Analysis A variable v is live at point p if  v is used along some path starting at p, and  no definition of v along the path before the use. When is a variable v dead at point p?  No use of v on any path from p to exit node, or  If all paths from p redefine v before using v. What Use is Liveness Information? Register allocation.  If a variable is dead, can reassign its register Dead code elimination.  Eliminate assignments to variables not read later.  But must not eliminate last assignment to variable (such as instance variable) visible outside CFG.  Can eliminate other dead assignments.  Handle by making all externally visible variables live on exit from CFG Conceptual Idea of Analysis start from exit and go backwards in CFG Compute liveness information from end to beginning of basic blocks Liveness Example     Assume a,b,c visible outside method So are live on exit Assume x,y,z,t not visible Represent Liveness Using Bit Vector  order is abcxyzt 0101110 a = x+y; t = a; c = a+x; x == 0 1100111 abcxyzt 1000111 b = t+z; 1100100 abcxyzt 1100100 c = y+1; 1110000 abcxyzt Formalizing Analysis     Each basic block has  IN - set of variables live at start of block  OUT - set of variables live at end of block  USE - set of variables with upwards exposed uses in block (use prior to definition)  DEF - set of variables defined in block prior to use USE[x = z; x = x+1;] = { z } (x not in USE) DEF[x = z; x = x+1; y = 1;] = {x, y} Compiler scans each basic block to derive USE and DEF sets Algorithm for all nodes n in N - { Exit } IN[n] = emptyset; OUT[Exit] = emptyset; IN[Exit] = use[Exit]; Changed = N - { Exit }; while (Changed != emptyset) choose a node n in Changed; Changed = Changed - { n }; OUT[n] = emptyset; for all nodes s in successors(n) OUT[n] = OUT[n] U IN[p]; IN[n] = use[n] U (out[n] - def[n]); if (IN[n] changed) for all nodes p in predecessors(n) Changed = Changed U { p }; 静态代码分析 – 概念 Basic Blocks Control Flow Graph Dataflow Analysis  Live Variable Analysis  Reaching Definition Analysis Lattice Theory Reaching Definitions Concept of definition and use  a = x+y is a definition of a is a use of x and y A definition reaches a use if value written by definition may be read by use Reaching Definitions s = 0; a = 4; i = 0; k == 0 b = 1; b = 2; i<n s = s + a*b; i = i + 1; return s Reaching Definitions and Constant Propagation Is a use of a variable a constant?  Check all reaching definitions  If all assign variable to same constant  Then use is in fact a constant Can replace variable with constant Is a Constant in s = s+a*b? Yes! s = 0; a = 4; i = 0; k == 0 b = 1; On all reaching definitions a=4 b = 2; i<n s = s + a*b; i = i + 1; return s Constant Propagation Transform s = 0; Yes! a = 4; i = 0; k == 0 b = 1; On all reaching definitions a=4 b = 2; i<n s = s + 4*b; i = i + 1; return s Computing Reaching Definitions Compute with sets of definitions  represent sets using bit vectors  each definition has a position in bit vector At each basic block, compute  definitions that reach start of block  definitions that reach end of block Do computation by simulating execution of program until reach fixed point 1234567 0000000 1: s = 0; 2: a = 4; 3: i = 0; k == 0 1110000 1234567 1234567 1110000 4: b = 1; 1110000 5: b = 2; 1111000 1110100 1234567 1111100 1111111 i<n 1234567 1111111 1111100 6: s = s + a*b; 7: i = i + 1; 0101111 1111111 1111100 1234567 1111111 1111100 return s 1111111 1111100 Formalizing Reaching Definitions Each basic block has  IN - set of definitions that reach beginning of block  OUT - set of definitions that reach end of block  GEN - set of definitions generated in block  KILL - set of definitions killed in block GEN[s = s + a*b; i = i + 1;] = 0000011 KILL[s = s + a*b; i = i + 1;] = 1010000 Compiler scans each basic block to derive GEN and KILL sets Example Forwards vs. backwards A forwards analysis is one that for each program point computes information about the past behavior.  Examples of this are available expressions and reaching definitions.  Calculation: predecessors of CFG nodes. A backwards analysis is one that for each program point computes information about the future behavior.  Examples of this are liveness and very busy expressions.  Calculation: successors of CFG nodes. May vs. Must A may analysis is one that describes information that may possibly be true and, thus, computes an upper approximation.  Examples of this are liveness and reaching definitions.  Calculation: union operator. A must analysis is one that describes information that must definitely be true and, thus, computes a lower approximation.  Examples of this are available expressions and very busy expressions.  Calculation: intersection operator. 静态代码分析 – 概念 Basic Blocks Control Flow Graph Dataflow Analysis  Live Variable Analysis  Reaching Definition Analysis Lattice Theory Basic Idea Information about program represented using values from algebraic structure called lattice Analysis produces lattice value for each program point Two flavors of analysis  Forward dataflow analysis  Backward dataflow analysis Partial Orders Set P Partial order  such that x,y,zP  xx  x  y and y  x implies x  y  x  y and y  z implies x  z (reflexive) (asymmetric) (transitive) Can use partial order to define  Upper and lower bounds  Least upper bound  Greatest lower bound Upper Bounds If S  P then  xP is an upper bound of S if yS. y  x  xP is the least upper bound of S if • x is an upper bound of S, and • x  y for all upper bounds y of S   - join, least upper bound (lub), supremum, sup •  S is the least upper bound of S • x  y is the least upper bound of {x,y} Lower Bounds If S  P then  xP is a lower bound of S if yS. x  y  xP is the greatest lower bound of S if • x is a lower bound of S, and • y  x for all lower bounds y of S   - meet, greatest lower bound (glb), infimum, inf •  S is the greatest lower bound of S • x  y is the greatest lower bound of {x,y} Covering x y if x  y and xy x is covered by y (y covers x) if  x  y, and  x  z  y implies x  z Conceptually, y covers x if there are no elements between x and y Example P = { 000, 001, 010, 011, 100, 101, 110, 111} (standard Boolean lattice, also called hypercube) x  y if (x bitwise and y) = x 111 011 110 101 010 001 100 000 Hasse Diagram • If y covers x • Line from y to x • y above x in diagram Lattices If x  y and x  y exist for all x,yP, then P is a lattice. If S and S exist for all S  P, then P is a complete lattice. All finite lattices are complete Lattices If x  y and x  y exist for all x,yP, then P is a lattice. If S and S exist for all S  P, then P is a complete lattice. All finite lattices are complete Example of a lattice that is not complete  Integers I  For any x, yI, x  y = max(x,y), x  y = min(x,y)  But  I and  I do not exist  I  {, } is a complete lattice Lattice Examples Lattices Non-lattices Semi-Lattice Only one of the two binary operations (meet or join) exist  Meet-semilattice If x  y exist for all x,yP  Join-semilattice If x  y exist for all x,yP Monotonic Function & Fixed point Let L be a lattice. A function f : L → L is monotonic if ∀x, y ∈ S : x  y ⇒ f (x)  f (y) Let A be a set, f : A → A a function, a ∈A . If f (a) = a, then a is called a fixed point of f on A Existence of Fixed Points • The height of a lattice is defined to be the length of the longest path from ⊥ to ⊤ • In a complete lattice L with finite height, every monotonic function f : L → L has a unique least fixed-point : f ( ) i i 0 Knaster-Tarski Fixed Point Theorem Suppose (L, ) is a complete lattice, f: LL is a monotonic function. Then the fixed point m of f can be defined as Calculating Fixed Point The time complexity of computing a fixed-point depends on three factors:  The height of the lattice, since this provides a bound for i;  The cost of computing f;  The cost of testing equality. The computation of a fixed-point can be illustrated as a walk up the lattice starting at ⊥: Application to Dataflow Analysis Dataflow information will be lattice values  Transfer functions operate on lattice values  Solution algorithm will generate increasing sequence of values at each program point  Ascending chain condition will ensure termination Will use  to combine values at control-flow join points Transfer Functions Transfer function f: PP for each node in control flow graph f models effect of the node on the program information Transfer Functions Each dataflow analysis problem has a set F of transfer functions f: PP  Identity function iF  F must be closed under composition: f,gF. the function h = x.f(g(x)) F  Each f F must be monotone: x  y implies f(x)  f(y)  Sometimes all fF are distributive: f(x  y) = f(x)  f(y)  Distributivity implies monotonicity 课程考核方式 作业（提交到课程平台http://sase.seforge.org/ ，并演示） + 课程报告 作业选题：            代码注释提取，文档生成代码信息统计：总行数，代码行数，类数量，方法数，方法长度等 Latex格式文档自动转成PDF 代码在线diff Executable Jar转换成带有特定icon的exe程序代码各类缺陷检测：内存泄漏，空指针异常 Test case 自动生成脚本缺陷分析： Javascript，Python，Ruby， PHP …… C# 代码缺陷分析在线压缩，解压缩，加密，解密 …… Questions? Thank you!

静态分析

Related documents

Products

Support

静态分析

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib