COP 5021 meeting -*- Outline -*-

* Abstract Interpretation (1.5)

** goals

   To specify an analysis in a way that is correct by construction.
   To calculate an analysis in a systematic way.

** idea

   Q: What's the basic idea?
------------------------------------------
  IDEA OF ABSTRACT INTERPRETATION (1.5)


------------------------------------------
    ...
      1.  Write the semantics that collects (more than) the information
          required, as sets of traces
      2.  Define an abstraction function (a) that extracts the
          information needed to answer the question of the analysis
      3.  Calculate the analysis

   Q:  What's a collecting semantics?

          A mathematical semantics of the program that records all the
          necessary information for the analysis.

   Q:  How is that used to extract the analysis?

          The analysis information is an abstraction (summary) of the
          information in the collecting semantics.
          
          There is a Galois connection that is used to describe correctness...

** example
------------------------------------------
             EXAMPLE

  [y := 0]^1;
  [print y]^2;
  [read x]^3;
  while [x < 0]^4
  do ([y := y+1]^5;
      [print y]^6;
      [read x]^7);
  [z := x]^8

For taint analysis we seek 
sets of variables at each program point
that may have a value derived from a value
previously read from the user ("tainted")

------------------------------------------

     Q: What is a collecting semantics?
        a semantics of programs that collects information from
        different possible runs; usually this is done as sets of traces.
     Q: What would a collecting semantics look like for this example?
        collect sets of traces, where each trace gives where and how
        the variables got their value

  See page 14 for one example, but not the one above

Trace = (Var x Lab? x Dependants)*
Lab? = Lab \cup {?}
Dependants = Powerset(Dependant)
Dependant = Var* \cup {read}

CSentry, CSexit: Lab* -> Powerset(Trace)

CSentry(1) = {(x,?,{}),(y,?,{}),(z,?,{})}
CSentry(2) = CSexit(1)
CSentry(3) = CSexit(2)
CSentry(4) = CSexit(3) \cup CSexit(7)
CSentry(5) = CSexit(4)
CSentry(6) = CSexit(5)
CSentry(7) = CSexit(6)
CSentry(8) = CSexit(4)

CSexit(1) = {tr : (y,1,{}) | tr \in CSentry(1)}
CSexit(2) = CSentry(2)
CSexit(3) = {tr : (x,3,{read}) | tr \in CSentry(3)}
CSexit(4) = CSentry(4)
CSexit(5) = {tr : (y,5,Y) | Y = depends(y,tr), tr \in CSentry(5)}
CSexit(6) = CSentry(6)
CSexit(7) = {tr : (x,7,{read})}
CSexit(8) = {tr : (z,8,X) | X = depends(x,tr), tr \in CSentry(8)}

where depends(x,tr) = X iff X is the set of Dependants in the
                            rightmost tuple of tr that is of the form (x,l,X)

     Q: Why is CSentry(1) = {(x,?,{}),(y,?,{}),(z,?,{})} ?
     Nothing has happened yet

     Q:  How is this different than the reaching definitions analysis.
         it's sets of ordered variable names instead of traces

     Q:  How would you handle if then else statements?
             union the sets of variables
     Q:  For loops?
     Q:  How would you handle break?

*** solving the equations

------------------------------------------
       SOLVING THE EQUATIONS

 Traces = Powerset((Var x Lab? x Dependants)*)

G: Traces^{16} -> Traces^{16}

G is defined by:

 G(CS_1, ..., CS_{16})
   = (G_1(CS_1, ..., CS_{16}),
      G_2(CS_1, ..., CS_{16}),
      ...,
      G_{16}(CS_1, ..., CS_{16}))

where
  G_1(CS_1, ..., CS_{16})  // CSentry(1)
        = {(x,?,{}),(y,?,{}),(z,?,{})}

  G_2(CS_1, ..., CS_{16})  // CSexit(1)
        = {tr : (y,1,{}) | tr \in CS_1}

  G_3(CS_1, ..., CS_{16})
        = CS_2
  ...


Solution

(CSentry(1), CSexit(1), CSentry(2),
       ..., CSentry(8), CSexit(8))

 is a solution if

  G(CSentry(1), CSexit(1), CSentry(2),
    ..., CSentry(8), CSexit(8))
 = (CSentry(1), CSexit(1), CSentry(2),
     ..., CSentry(8), CSexit(8))

------------------------------------------

   Q:  What is G_1?
        the function that returns the constant ({},{}, ..., {}) of length 16
   Q:  Why does it have 16 parameters?
        Because all of the G_i do, it doesn't use any of them.
   Q:  What is G_2?
        the function that computes CSexit(1)
   Q:  Why is G_3 = CS_2 in this example?
        because the semantics says that CSentry(2) = CSexit(1) in this example

    so we have G(\vec{CS}) = \vec{CS},
    or we can represent the system of equations as \vec{CS} = G(\vec{CS}).

   Q:  What's the ordering on the solution space?
       pointwise for the tuples, subseteq on the sets of vars.

   Q:  What does it mean for G to be monotone?
       that means if it's given a bigger tuple of sets of variables,
       then it returns one that is no smaller.

   So by the theory one can find a least solution, the least fixed
   point of G, lfp(G).

*** Galois connections

------------------------------------------
     ABSTRACTION AND CONCRETIZATION

abstraction function for Taint analysis:

   a: Traces -> Powerset(Var*)
   a(trs) = {x | read \in depends(x,tr),
                 tr \in trs}

concretization function for Taint analysis:

  g: Powerset(Var*) -> Traces
  g(Y) = {tr | x \in Y, read \in depends(x,tr)}

Adjunction, or Galois connection:

   a(X) \subseteq Y <==> X \subseteq g(Y)
------------------------------------------

    Note: a and g are \alpha and \gamma
          in the book

    Q: Would a and g be different for the RD analysis?
       yes

    draw a picture like figure 1.3

------------------------------------------

   set of traces          set of vars
 |---------------|     |---------------|
 |               |  g  |               |
 |     g(Y) <---------------- Y        |
 |      U|       |     |      U|       |
 |      X  -----------------> a(X)     |
 |               |  a  |               |
 |_______________|     |_______________|

------------------------------------------


*** calculating the analysis

    Q:  Why do we care about the abstraction and concretization
    functions?

------------------------------------------
      CALCULATING THE ANALYSIS

Extend a and g pointwise to tuples:

 a(TR_1, ..., TR_16)
    = (a(TR_1), ..., a(TR_16))

 g(Y_1, ..., Y_16)
    = (g(Y_1), ..., g(Y_16))


Define the analysis by the function

  a o G o g: Powerset(Var*)^16 -> Powerset(Var*)^16

so for each i in {1..12}

 (a o G_i o g): Powerset(Var*)^16 -> Powerset(Var)

by

 a(G_1(g(T_1, ..., T_16)))
      = a({(x,?,{}),(y,?,{}),(z,?,{})})
      = {}

 a(G_2(g(T_1, ..., T_16)))
      = a({tr : (y,1,{}) | tr \in CS_1})

 ...

So a solution

 (Tentry(1), ..., Texit(8))

has the property that

(Tentry(1), ..., Texit(8))
 = (a o G o g)(Tentry(1), ..., Texit(8))

------------------------------------------

    I'm abusing notation above, but the book uses vector notation for
    the extensions

    Q: What's T_3?
     a(G_3(g(T_1, ..., T_16)))
      = a(g(T_2))

     Here we can use the adjunction.
     Lemma: suppose a and g are are an adjunction.
     Then a(g(Y)) \subseteq Y.
     Proof: By definition of adjunction, for all X and Y,
               a(X) \subseteq Y iff X \subseteq g(Y).   (adj)
            We know that g(Y) \subseteq g(Y), because \subseteq is reflexive.
            So by substituting g(Y) for X in the property adj, we get
               a(g(Y)) \subseteq Y. QED
     Thus a(g(T_2)) \subseteq T_2, and so we can safely (because this
     is a may analysis) use T_2 in place of a(g(T_2)).


    Q:  How does this compare to the analysis we created by hand?

    Q:  What's the benefit of doing things this way?
        - we know there's a least solution,
           lfp(a o G o g)
        - we get correctness of the analysis,
          from correctness of the collecting semantics

           a(lfp(G)) is safely approximated by lfp(a o G o g)

           when these are equal, we said the analysis is optimal.