COP 5021 meeting -*- Outline -*- * Abstract Interpretation (1.5) ** goals To specify an analysis in a way that is correct by construction. To calculate an analysis in a systematic way. ** idea Q: What's the basic idea? ------------------------------------------ IDEA OF ABSTRACT INTERPRETATION (1.5) ------------------------------------------ ... 1. Write the semantics that collects (more than) the information required, as sets of traces 2. Define an abstraction function (a) that extracts the information needed to answer the question of the analysis 3. Calculate the analysis Q: What's a collecting semantics? A mathematical semantics of the program that records all the necessary information for the analysis. Q: How is that used to extract the analysis? The analysis information is an abstraction (summary) of the information in the collecting semantics. There is a Galois connection that is used to describe correctness... ** example ------------------------------------------ EXAMPLE [y := 0]^1; [print y]^2; [read x]^3; while [x < 0]^4 do ([y := y+1]^5; [print y]^6; [read x]^7); [z := x]^8 For taint analysis we seek sets of variables at each program point that may have a value derived from a value previously read from the user ("tainted") ------------------------------------------ Q: What is a collecting semantics? a semantics of programs that collects information from different possible runs; usually this is done as sets of traces. Q: What would a collecting semantics look like for this example? collect sets of traces, where each trace gives where and how the variables got their value See page 14 for one example, but not the one above Trace = (Var x Lab? x Dependants)* Lab? = Lab \cup {?} Dependants = Powerset(Dependant) Dependant = Var* \cup {read} CSentry, CSexit: Lab* -> Powerset(Trace) CSentry(1) = {(x,?,{}),(y,?,{}),(z,?,{})} CSentry(2) = CSexit(1) CSentry(3) = CSexit(2) CSentry(4) = CSexit(3) \cup CSexit(7) CSentry(5) = CSexit(4) CSentry(6) = CSexit(5) CSentry(7) = CSexit(6) CSentry(8) = CSexit(4) CSexit(1) = {tr : (y,1,{}) | tr \in CSentry(1)} CSexit(2) = CSentry(2) CSexit(3) = {tr : (x,3,{read}) | tr \in CSentry(3)} CSexit(4) = CSentry(4) CSexit(5) = {tr : (y,5,Y) | Y = depends(y,tr), tr \in CSentry(5)} CSexit(6) = CSentry(6) CSexit(7) = {tr : (x,7,{read})} CSexit(8) = {tr : (z,8,X) | X = depends(x,tr), tr \in CSentry(8)} where depends(x,tr) = X iff X is the set of Dependants in the rightmost tuple of tr that is of the form (x,l,X) Q: Why is CSentry(1) = {(x,?,{}),(y,?,{}),(z,?,{})} ? Nothing has happened yet Q: How is this different than the reaching definitions analysis. it's sets of ordered variable names instead of traces Q: How would you handle if then else statements? union the sets of variables Q: For loops? Q: How would you handle break? *** solving the equations ------------------------------------------ SOLVING THE EQUATIONS Traces = Powerset((Var x Lab? x Dependants)*) G: Traces^{16} -> Traces^{16} G is defined by: G(CS_1, ..., CS_{16}) = (G_1(CS_1, ..., CS_{16}), G_2(CS_1, ..., CS_{16}), ..., G_{16}(CS_1, ..., CS_{16})) where G_1(CS_1, ..., CS_{16}) // CSentry(1) = {(x,?,{}),(y,?,{}),(z,?,{})} G_2(CS_1, ..., CS_{16}) // CSexit(1) = {tr : (y,1,{}) | tr \in CS_1} G_3(CS_1, ..., CS_{16}) = CS_2 ... Solution (CSentry(1), CSexit(1), CSentry(2), ..., CSentry(8), CSexit(8)) is a solution if G(CSentry(1), CSexit(1), CSentry(2), ..., CSentry(8), CSexit(8)) = (CSentry(1), CSexit(1), CSentry(2), ..., CSentry(8), CSexit(8)) ------------------------------------------ Q: What is G_1? the function that returns the constant ({},{}, ..., {}) of length 16 Q: Why does it have 16 parameters? Because all of the G_i do, it doesn't use any of them. Q: What is G_2? the function that computes CSexit(1) Q: Why is G_3 = CS_2 in this example? because the semantics says that CSentry(2) = CSexit(1) in this example so we have G(\vec{CS}) = \vec{CS}, or we can represent the system of equations as \vec{CS} = G(\vec{CS}). Q: What's the ordering on the solution space? pointwise for the tuples, subseteq on the sets of vars. Q: What does it mean for G to be monotone? that means if it's given a bigger tuple of sets of variables, then it returns one that is no smaller. So by the theory one can find a least solution, the least fixed point of G, lfp(G). *** Galois connections ------------------------------------------ ABSTRACTION AND CONCRETIZATION abstraction function for Taint analysis: a: Traces -> Powerset(Var*) a(trs) = {x | read \in depends(x,tr), tr \in trs} concretization function for Taint analysis: g: Powerset(Var*) -> Traces g(Y) = {tr | x \in Y, read \in depends(x,tr)} Adjunction, or Galois connection: a(X) \subseteq Y <==> X \subseteq g(Y) ------------------------------------------ Note: a and g are \alpha and \gamma in the book Q: Would a and g be different for the RD analysis? yes draw a picture like figure 1.3 ------------------------------------------ set of traces set of vars |---------------| |---------------| | | g | | | g(Y) <---------------- Y | | U| | | U| | | X -----------------> a(X) | | | a | | |_______________| |_______________| ------------------------------------------ *** calculating the analysis Q: Why do we care about the abstraction and concretization functions? ------------------------------------------ CALCULATING THE ANALYSIS Extend a and g pointwise to tuples: a(TR_1, ..., TR_16) = (a(TR_1), ..., a(TR_16)) g(Y_1, ..., Y_16) = (g(Y_1), ..., g(Y_16)) Define the analysis by the function a o G o g: Powerset(Var*)^16 -> Powerset(Var*)^16 so for each i in {1..12} (a o G_i o g): Powerset(Var*)^16 -> Powerset(Var) by a(G_1(g(T_1, ..., T_16))) = a({(x,?,{}),(y,?,{}),(z,?,{})}) = {} a(G_2(g(T_1, ..., T_16))) = a({tr : (y,1,{}) | tr \in CS_1}) ... So a solution (Tentry(1), ..., Texit(8)) has the property that (Tentry(1), ..., Texit(8)) = (a o G o g)(Tentry(1), ..., Texit(8)) ------------------------------------------ I'm abusing notation above, but the book uses vector notation for the extensions Q: What's T_3? a(G_3(g(T_1, ..., T_16))) = a(g(T_2)) Here we can use the adjunction. Lemma: suppose a and g are are an adjunction. Then a(g(Y)) \subseteq Y. Proof: By definition of adjunction, for all X and Y, a(X) \subseteq Y iff X \subseteq g(Y). (adj) We know that g(Y) \subseteq g(Y), because \subseteq is reflexive. So by substituting g(Y) for X in the property adj, we get a(g(Y)) \subseteq Y. QED Thus a(g(T_2)) \subseteq T_2, and so we can safely (because this is a may analysis) use T_2 in place of a(g(T_2)). Q: How does this compare to the analysis we created by hand? Q: What's the benefit of doing things this way? - we know there's a least solution, lfp(a o G o g) - we get correctness of the analysis, from correctness of the collecting semantics a(lfp(G)) is safely approximated by lfp(a o G o g) when these are equal, we said the analysis is optimal.