COP 5021 meeting -*- Outline -*- * Dataflow Analysis (1.3) ** goals To propogate information about data through a program, in order to enable static analysis such as code optimizations and other program transformations. ** idea ------------------------------------------ IDEA OF DATA FLOW ANALYSIS What's the basic idea? What is a data flow graph? How is that used to model the semantics? ------------------------------------------ ... Represent program as a data flow graph ... nodes are elementary blocks edges describe how control passes from one elementary block to another (like an assembly language program) ... nodes transform information (differences from entry to exit) edges connect exit from one node to entry of another (equations) ** example ------------------------------------------ EXAMPLE [y := 0]^1; [print y]^2; [read x]^3; while [x < 0]^4 do ([y := y+1]^5; [print y]^6; [read x]^7); [z := x]^8 What's the flow graph for this? ------------------------------------------ ... | v [y := 0]^1 | v [print y]^2 | v [read x]^3 | v /->[x < 0]^4 ----no-------->| | | | | | yes | | v | | [y := y+1]^5 | | | | | v | | [print y]^6 | | | | | v | | [read x]^7 | | | | \_____/ | ____________________/ / v [z := x]^8 | v Q: How would you handle Read? Print? they are elementary blocks Q: How would you handle if then else statements? with a split of the control flow and a join after Q: For loops? like while loops Q: How would you handle break? a flow from the break to after the loop Q: How would you handle try-catch and throw? Throw hooks up to the surrounding catch body, or out Q: How would you handle assert? Assume? a split from the statement to the next statement or an exit Q: Choose? a split Q: Parallel composition? a split again *** the equational approach (1.3.1) In the in-class examples, we'll work with a taint analysis in the WHILE langauge extended with read/sanitize/print. ------------------------------------------ NODE AND EDGE EQUATIONS FOR TAINT ANALYSIS Taint analysis: at each program point, find the set of variables that may have a value derived from a value previously read from the user ("tainted") Tentry, Texit : Lab* -> Powerset(Var*) where Lab* = set of labels in program Var* = set of variables in prog block Equation ======================================= [x:=a]^l Texit(l) = [skip]^l Texit(l) = [b]^l Texit(l) = [read x]^l Texit(l) = [sanitize x]^l Texit(l) = [print x]^l Texit(l) = How are edges connected? ------------------------------------------ ... let TE = intersect(FV(a), Tentry(l)) in if not(empty(TE)) then union({x}, TE) else subtract(TE,{x}) Tentry(l) Tentry(l) union(Tentry(l), {x}) subtract(Tentry(l), {x}) Tentry(l) Q: Why are these the right equations? ... they capture the semantics of the language and our understanding of what tainting means (but check that) Q: How does this work out for our example? ... the Texit of the tail of an edge is unioned into the Tentry of the head **** algorithm for solving the equations Q: What can we do to solve a set of simultaneous equations? ------------------------------------------ WHAT IS A SOLUTION? Consider the program [read x]^1; [sanitize x]^2 We get the following equations: Tentry(1) = Tentry(1) // know nothing Texit(1) = union(Tentry(1), {x}) Tentry(2) = Texit(1) Texit(2) = subtract(Tentry(2), {x}) Suppose the program has variables x and y Then the possible solutions are: solution G: Tentry(1) = {x,y} Texit(1) = {x,y} Tentry(2) = {x,y} Texit(2) = {y} solution L: Tentry(1) = {} Texit(1) = {x} Tentry(2) = {x} Texit(2) = {} Why are these solutions? ------------------------------------------ ... Because if you plug in the values given, then each equation is true ------------------------------------------ MATHEMATICAL TREATMENT OF SOLUTIONS Can think of the dataflow equations: Tentry(1) = Tentry(1) // know nothing Texit(1) = union(Tentry(1), {x}) Tentry(2) = Texit(1) Texit(2) = subtract(Tentry(2), {x}) as a functional transformation: F(T1, T2, T3, T4) = (T1, union(T1, {x}), T2, subtract(T3,{x})) so a solution is: ------------------------------------------ ... a 4-tuple that when we plug it in it satisfies each equation. Thus it is a fixed point of F (a tuple v such that F(v) = v), becuase that means that each equation is satisfied. ------------------------------------------ WHICH SOLUTION IS BETTER? For the transformation: F(T1, T2, T3, T4) = (T1, // Tentry(1) union(T1, {x}), // Texit(1) T2, // Tentry(2) subtract(T3,{x})) // Texit(2) there are two solutions over {x,y}: solution G: ({x,y}, {x,y}, {x,y}, {y}) solution L: ({}, {x}, {x}, {}) Which is better? ------------------------------------------ ... for taint checking solution L is more precise, so it is better ------------------------------------------ COMPARING SOLUTIONS Ordering on sets: subset or equal (written \subseteq) defined by s1 \subseteq s2 iff (\forall x \in s1 :: x \in s2) e.g., {} \subseteq {x} {} \subseteq {y} {x} \subseteq {x,y} {y} \subseteq {x,y} Hasse diagram: {x,y} / \ {x} {y} \ / {} Solutions are tuples of sets, so: e.g., ({},{}) \sqsubsetq ({x},{}) ({},{}) \sqsubsetq ({y},{}) ({},{}) \sqsubsetq ({},{x}) ({},{}) \sqsubsetq ({},{y}) ({x},{}) \sqsubsetq ({x,y},{}) ({x},{}) \sqsubsetq ({x},{y}) ({y},{}) \sqsubsetq ({x,y},{}) ({y},{}) \sqsubsetq ({y},{x}) ({},{x}) \sqsubsetq ({y},{x,y}) ({},{x}) \sqsubsetq ({},{x,y}) ({},{y}) \sqsubsetq ({x},{x,y}) ({},{y}) \sqsubsetq ({},{x,y}) ({x,y},{}) \sqsubsetq ({x,y},{y}) ({x},{y}) \sqsubsetq ({x,y},{y}) ({y},{x}) \sqsubsetq ({x,y},{y}) ... For precision we want the ------------------------------------------ ... we use a pointwise subset ordering, pronounced "approximates" and written \sqsubseteq defined by (a, ..., z) \sqsubseteq (a', ..., z') iff (a \subseteq a', ..., and z \subseteq z') ... least solution in the \sqsubseteq ordering ------------------------------------------ ALGORITHM IDEA Goal: find least fixed point of F fact: F is monotonic in \sqsubseteq because each operation used in defining F is monotonic Induction idea: Base. Start with ({},...,{}) Ind. Since F is monotonic v \sqsubseteq F(v) When F^n(v) = F(F^n(v)) then F^n(v) is a fixed point Reach least fixed point first, so stop with that solution ------------------------------------------