I. Program Analysis
 A. What is program analysis?
------------------------------------------
        WHAT IS PROGRAM ANALYSIS?

Def: *program analysis* is


------------------------------------------
     How does this differ from (human) code inspection?
     How is it different than testing? Runtime assertion checking?
 B. use in software security
------------------------------------------
          USES OF PROGRAM ANALYSIS
            IN SOFTWARE SECURITY

General goal/idea:


  Program ---> Static Analyzer --> Bug
                                   Reports

Examples:
  - warn about bad library functions
      (e.g., gets)


------------------------------------------
  1. vs. program verification
------------------------------------------
   PROGRAM ANALYSIS vs. VERIFICATION

program analysis:
  - no explicit specification of behavior
  - little annotation
  - targeted to a specific property
  - may lose precision to gain speed
  - no human intervention once started


program verification:
  - behavior usually specified in detail
  - lots of annotation
  - for any (safety) property
  - may take a day per function
  - often requires human intervention


------------------------------------------
------------------------------------------
            SAFETY vs. LIVENESS

Def: a *safety* property says that


Def: a *liveness* property says that


------------------------------------------
   What kind of property does static analysis work on?
   Are typical security policies safety or liveness properties?
   Are any security properties liveness properties?
------------------------------------------
       PROGRAM ANALYSIS GOALS

 - little or no input from programmers


 - correctness

 - efficient (at compile time):
    - time
    - space


------------------------------------------
 C. equivalent ideas
------------------------------------------
              EQUIVALENCE OF
   DIFFERENT KINDS OF PROGRAM ANALYSIS

An analysis can be expressed as:

 - Dataflow equations
 - Constraints
 - Abstract Interpretation
 - Type system
 - Control Flow Analysis

Differences in Emphasis:

 - Dataflow equations } - compilers
 - Abstract interpretation - correctness

Above can approximate dependence
      on dynamic values
      in imperative languages

 - Type systems
     - no flow dependence
     - uniform treatment of each statement

 - Constratint based
   = Control Flow Analysis
     - analysis of functional languages

------------------------------------------
 D. key design idea
------------------------------------------
        DESIGNING A PROGRAM ANALYSIS

Slogan: Err on the safe side!

Key question:
 - What's the bad outcome?

 - What would track the opposite of that?

 - What would be safe?


Because: what we know in a sound analysis
   is that the claimed safety property
   holds in all possible executions

             All programs
  |--------------------------------|
  |      Programs that are safe    |
  |   |------------------------|   |
  |   |  Programs claimed safe |   |
  |   |                        |   |
  |   |------------------------|   |
  |~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
  |                                |
  |    Programs that can be bad    |
  |--------------------------------|


Example:
 Dead code analysis (e.g., goto fail bug):
     - What's the bad outcome?

     - What would track the opposite?

     - What would be safe?


             All programs
  |--------------------------------|
  |   Programs with no dead code   |
  |   |------------------------|   |
  |   |  Programs claimed to   |   |
  |   |   have no dead code    |   |
  |   |------------------------|   |
  |~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
  |                                |
  |  Programs that have dead code  |
  |--------------------------------|

------------------------------------------
     Assuming the analysis is sound,
        will we have some false positives (claiming bad programs are safe)?
     Will we have some false nagatives (good programs not claimed safe)?
------------------------------------------
           FOR YOU TO DO

Suppose we want to stop using gets
  (from libc's stdio)
  what should we do in an analysis?


------------------------------------------
        What is the bad outcome to prevent?
        What would track the opposite?
        What would be safe?
 E. intraprocedural vs. interprocedural
------------------------------------------
   INTRA vs. INTER-PROCEDURAL ANALYSIS

Def: an *intraprocedural* analysis


Def: an *interprocedural* analysis


------------------------------------------
 F. data flow analysis
  1. may vs. must analysis
------------------------------------------
     MAY vs. MUST ANALYSES

Def: A "may analysis" tracks


     A "must analysis" tracks


------------------------------------------
  2. sources of imprecision
   a. conditional statements
------------------------------------------
          CONDITIONAL STATEMENTS

Might the following code use gets?

    if (...) {
        // 2
        gets(buffer);
    } else {
        // 4
        fgets(buffer, sizeof(buffer), fn);
    }


------------------------------------------
   b. loops
------------------------------------------
             LOOP STATEMENTS

Might the following code use gets?

    while (...) {
        // 2
        gets(buffer);
    }


What about:

    while (i < BUFSIZ) {
       // 5
       if (i >= 3) gets(buffer);
       i++;
    }

------------------------------------------
   c. handling different kinds of control structures (CFGs)
------------------------------------------
   DIFFERENT CONTROL STRUCTURES

Can we avoid special cases for:
  - switch statments,
  - if's with different numbers of arms,
  - for loops,
  - do-until loops,
  - code with exception handlers,
  - code with gotos?


------------------------------------------
------------------------------------------
        CONTROL FLOW GRAPHS (CFGs)

1. Give each atomic statment and test
    a unique label

2. Form CFG from:

- Blocks (nodes) that are
   - statements or tests, with:

     - initial label

     - set of final labels


- Flows (edges):
   - relation between labels
     (map from a statement
          to a pair of labels)


Example:

    z = 1; // 1
    while (x > 0) // 2
    {  z = z*y;   // 3
       x = x-1; }  // 4
    print("done"); // 5

  Blocks are:


  Flows are:


Picture:
       [ 1 ]
         |
         v
   /-->[ 2 ]--\
   |     |    |
   |     v    |
   |   [ 3 ]  |
   |     |    |
   |     v    |
   \---[ 4 ]  |
              |
         /----/
         v
       [ 5 ]

------------------------------------------
        What's the initial label of each statment and test?
        What's the set of final labels for the test at label 2?
  3. example: dead code elimination
   a. design
------------------------------------------
       UNREACHABLE CODE DETECTION

 - What's the bad outcome?


 - What would track the opposite?


  - What would be safe?


------------------------------------------
   b. property space
------------------------------------------
          PROPERTY SPACE

Property to track:
    Set of labels that reach this label


       \  |  /
       v  v  v
       [  l  ]


    A "may" analysis


    A "backwards" analysis,
      want property of flows into a block


------------------------------------------
      What's a source of imprecision in this analysis?
      Since it's a may analysis, how do we merge different control
      flows that come into a block?
      How can we find what flows go into a block?
    i. describing an analysis
     Why describe an analysis in terms of property values and
     transfer functions?
    ii. starting property value
------------------------------------------
   STARTING PROPERTY VALUE (BASE CASE)

The initial value of a property space
  is used for value at the initial label
  (in a forward analysis)


Isolation:
   initial label in a forward analysis
      should be *isolated*,
      so no flow to it from inside prog.


------------------------------------------
       What about for a backwards analysis?
       Does every structured program have isolated initial labels?
       And isolated final labels?
       Can we convert a program P to an equivalent program P'
           that has isolated labels?
    iii. combining information
------------------------------------------
      COMBINING PROPERTY INFORMATION
             (JOINS)

       \  |  /
       v  v  v
       [ stmt]   or   [ stmt ] <--
                         ^
                         |

Why?
   - Make the analysis uniform:
      always one value for entry and exit


Suppose property space is a set, then

For a "may" analysis, combine using:


For a "must" analysis, combine using:


------------------------------------------
    What kinds of flows could combine in these ways?
    iv. transfer functions
------------------------------------------
           TRANSFER FUNCTIONS

Def: a *transfer function* describes


------------------------------------------
    v. finding solutions
------------------------------------------
    SOLVING RECURSIVE SYSTEMS OF
       DATAFLOW EQUATIONS

Loops (and recursion) can lead to
   recursive definitions of properties

When the property space is finite,
   then a solution will exist

Chaotic iteration:
  1. initialize property sets at each label


  2.a. Pick a label of a statement
       whose transfer function
       changes some property value(s)
    b. Update the property value(s)
       using that transfer function
    c. repeat until no changes can be made

The output is the property values
   at entry and exit of each label
------------------------------------------
    Could we make this run faster?
    vi. example
------------------------------------------
      EXAMPLE: UNREACHABLE CODE

Consider C with statments:
   assignments, empty statements (skip),
   if-then-else, while,
   sequencing, labels

 Stmt ::= L: x = E; | L : ;
      | if (L: E) { Stmt1 } else { Stmt2 }
      | while (L: E) { Stmt }
      | Stmt1 Stmt2

 Atomic blocks are:
    S ::= L : x = E; | L : ;
        | L : E

So CFG of

    l1: x = 0;
    while (l2: x < 4) {
       l3: y = y+1;
       l4: x = x-1; }
    l5: y = x*x;

is

     [ l1: x = 0;]
           |
           v
  /->[ l2: x < 4; ]--\
  |        |         |
  |        v         |
  |  [ l3: y = y+1;] |
  |        |         |
  |        v         |
  \--[ l4: x = x-1;] |
                     |
           /---------/
           |
           v
     [ l5: y = x*x;]

Set of flows is:


We want a forward, may analysis with:

Property space:
  Reaches(l) = Set of labels that
               be reached from start
  initial value: {}

Combination:
  join is \union
  EntryTo(l) =
    union {ExitOf(l') | (l',l) flows}

Transfer functions:
 f(l: x = E;, EntryTo(l))
      = {l} union EntryTo(l)
 f(l: ;, EntryTo(l)) = EntryTo(l)
 f(l:E, EntryTo(l)) = {l} union EntryTo(l)


------------------------------------------
------------------------------------------
    EQUATIONS AND SOLUTION FOR EXAMPLE

EntryTo(l1) = {}
ExitOf(l1) = {l1} union EntryTo(l1)
EntryTo(l2) = union{ExitOf(l1),ExitOf(l4)}
ExitOf(l2) = {l2} union EntryTo(l2)
EntryTo(l3) = ExitOf(l2)
ExitOf(l3) = {l3} union EntryTo(l3)
EntryTo(l4) = ExitOf(l3)
ExitOf(l4) = {l4} union EntryTo(l4)
EntryTo(l5) = ExitOf(l2)
ExitOf(l5) = {l5} union EntryTo(l5)

Trace of iterations:

     Steps: 0  1
EntryTo(l1) {}
ExitOf(l1)  {}

EntryTo(l2) {}
ExitOf(l2)  {}

EntryTo(l3) {}
ExitOf(l3)  {}

EntryTo(l4) {}
ExitOf(l4)  {}

EntryTo(l5) {}
ExitOf(l5)  {}
------------------------------------------
   Is this result what happens dynamically?
   Is it safe?
   How could we make this safe?
  4. building tools with dataflow analysis
------------------------------------------
   BUILDING TOOLS FOR PROGRAM ANALYSIS

Use an existing set of tools:
   Parser generator and:
   
   - LLVM
   - JastAdd Java compiler
   - XText
------------------------------------------
 G. control flow analysis
II. Symbolic Execution
 A. What is symbolic execution?
------------------------------------------
        SYMBOLIC EXECUTION

Def: *symbolic execution* is


------------------------------------------
------------------------------------------
      OVERVIEW OF SYMBOLIC EXECUTION

Key paper:
James C. King,
"Symbolic execution and program testing"
CACM, Vol. 19, num. 7, pp. 385-394, 1976.
http://doi.acm.org/10.1145/360248.360252

General idea:


Approach:


Def: An *isomorphism class* is


Advantages:


Drawbacks:


------------------------------------------
    What kind of classes of inputs would be most helpful to have?
------------------------------------------
       SYMBOLIC EXECUTION TOOLS

(Quoted from p. 1066 of:
Cristian Cadar, et al. 2011.
"Symbolic execution for software testing
in practice: preliminary assessment".
In Proceedings of ICSE '11, pp. 1066-1071
https://doi.org/10.1145/1985793.1985995)

Tools: 
- NASA's Symbolic (Java) PathFinder,
- UIUC's CUTE and jCUTE,
- Stanford's KLEE,
- UC Berkeley's CREST and BitBlaze

Also commercial practice at:
- Microsoft (Pex, SAGE, YOGI and PREfix),
- IBM (Apollo),
- NASA and Fujitsu (Symbolic PathFinder),
- Parasoft 

------------------------------------------
 B. use in software security
------------------------------------------
       USES OF SYMBOLIC EXECUTION
         IN SOFTWARE SECURITY


------------------------------------------
 C. approach
  1. algorithm
   a. original algorithm (King 1976)
------------------------------------------
       SYMBOLIC EXECUTION ALGORITHM

King 1976:
    - Assume program works on integers
    - Assume program uses simple tests
       for control flow (<, >, <=, etc.)
    - Use CFGs as in program analysis
    - Compute with (integer)


 1. Allow symbolic inputs;
    e.g., isym, jsym for reading i, j
    
    - Replace integer ops with
       symbolic ops that make 


    e.g., i - j > 0
        produces expr:


 2. State of program represented by:
    - path condition (PC):


    - variable values:


 3. Tests (e.g. in if-statements) produce:


------------------------------------------
     What should be the initial value for PC?
     Does the PC need to be updated for the first case?
     What's the problem with the second case?
------------------------------------------
       EXAMPLE OF KING'S ALGORITHM

int sum(int a, int b, int c) {
    l0: int x, y, z;
    l1: x = a + b;
    l2: y = b + c;
    l3: z = x + y - b;
    l4: return z;
}


Trace of symbolic execution:

after x    y    z   a     b     c     PC
l0    ?    ?    ?   A     B     C     true
l1
l2
l3


------------------------------------------
     What is the state after l1? l2? l3?
------------------------------------------
          EXAMPLE WITH A LOOP

int power(int x, int y) {
    l2: int z = 1;
    l3: int j = 1;
    l4: if (y < j) {
       l45: goto l8;
        } else {
       l5: z = z * x;
       l6: j = j + 1;
       l7: goto l4;
    }
    l8: return z;


trace of symbolic execution:

after  j   x     y     z   PC
(l1)   ?   X     Y     ?   true
 l2    
 l3
 l4

------------------------------------------
      What happens on line 4 the first time?
      Is the result in that case correct?
      What about the other case?
   In the first case, what is Y?
 What will happen when we test at l4 again?
 What should be done to prevent that?
    i. limits of King's approach
    What are the limits of King's approach in practice?
------------------------------------------
         LIMITS OF KING'S APPROACH


------------------------------------------
   b. Generalized Symbolic Execution (from Khurshid, Pasareanu, Visser 2003)
------------------------------------------
    GENERALIZED SYMBOLIC EXECUTION
    
From Khurshid, Pasareanu, Visser 03:
S. Khurshid, C.S. Pasareanu, and W. Visser.
2003. "Generalized Symbolic Execution for
Model Checking and Testing." In TACAS 2003.
LNCS vol. 2619. Springer.
https://doi.org/10.1007/3-540-36577-X_40
(or https://rdcu.be/cU1iz)

Dealing with objects:
 - all fields start as uninitialized
 - assign values to fields lazily,
    when first accessed,

Lazy initialization algorithm
  for a field f:

if ( f is uninitialized ) {
  if ( f is reference field of type T ) {
      nondeterministically initialize f to
        1. null
        2. a new object of class T
          (with uninitialized field values)
        3. an object created during
           a prior initialization of
            a field of type T
            (or a subtype of T)
      if ( method precondition is violated )
         backtrack();
  }
  if ( f is primitive (or string) field )
    initialize f to a new symbolic value
       of appropriate type
}
------------------------------------------
     Why consider objects created previously?
    i. example
------------------------------------------
            EXAMPLE OF GSE

public class Node {int elem; Node next}

public Node swapNode() { // 0
  l1: if (next!=null)
  l2:   if (elem-next.elem>0) {
  l3:      Node t = next;
  l4:      next = t.next;
  l5:      t.next = this;
  l6:      return t;
       }
  l7: return this;
      }
}


after  elem   next   t    PC
call    ?      ?     ?    true
 

------------------------------------------
       What happens with the first test?
   in the case that next != null && elem-next.elem>0
       what does l4 do to the symbolic state?
   what gets returned in that case?
   Is that right?
   in the case that next != null && elem-next.elem<=0
       what happens?
   c. symbolic execution tree
------------------------------------------
           SYMBOLIC EXECUTION TREE

Tracks execution traces that can happen

l0: int x, y;

l1: if (x > y) {
     l2: x = x + y;
     l3: y = x - y;
     l4: x = x - y;
     l5: if (x - y > 0)
          l6: assert(false);
}

       (x:X, y:Y, PC:true)
   l1    /              \
        /                \
       v                  v
(x:X, y:Y PC:X>Y)    (x:X, y:Y, PC: X<=Y)
   l2   |
        v
(x:X+Y,y:Y,PC:X>Y)
   l3   |
        v
(x:X+Y,y: ,PC:X>Y)
   l4   |
        v
(x:  ,y:  ,PC:X>Y)
   l5 /          \
     /            \
    v              v
(x:  ,y:  ,PC:   ) 
------------------------------------------
    What is the value of the test at l5 in the left case?
    Can execution reach l6?
   d. use of model checker
------------------------------------------
       BACKGROUND: MODEL CHECKING

System:
   a state machine

Specification of safety properties:
   some states marked as error states

Checking:
   explores reachable states from start
Checks:
   Is an error state reachable?

Advantages:
   

Disadvantages:


------------------------------------------
      What's an example of an error state?
      What does it mean if no error state is reachable from start?
      How can model checking keep from running forever?
   e. implementations
------------------------------------------
     SYMBOLIC EXECUTION FRAMEWORKS

Java PathFinder
  (https://github.com/javapathfinder)

jCUTE
  (https://github.com/osl/jcute)
------------------------------------------
 D. background on automatic theorem proving
  1. SAT Solvers
------------------------------------------
     BACKGROUND: SAT SOLVERS

goal: decide if first-order predicate is
       always true (or not)

SAT Solvers:

   input: propositional formula in CNF
   output: satisfying assignment to vars
           (or failure indication)

   example: w1 && w2 && w3
     where: w1 == b || c
            w2 == !a || !d
            w3 == !b || d

   satisfying assignment is
      {a =   , b =   , c =   , d =   }
   
------------------------------------------
    What's the time complexity of this problem?
  2. SMT Solvers
------------------------------------------
         SMT SOLVERS

SMT = Satisfiability Modulo Theories

Combines SAT + decidable theories
 Example theories:
  - Uninterpreted functions and equality
  - Peano arithmetic (arith. without *, /)
  - Arrays, Strings
  - Bitvectors
  - Algebraic Datatypes (e.g., enums)


Uninterpreted functions and equality:

   x == x
   x == y <==> y == x
   (x == y) && (y == z) ==> (x == z)

   x == y ==> f(x) == f(y)

------------------------------------------
   a. proof process
------------------------------------------
   PROVING A FORMULA WITH SMT SOLVER

To prove P

see if not P is satisfiable


------------------------------------------
     What does it mean if not P is satisfiable?
     What does it mean if not P is unsatisfiable?
   b. example
------------------------------------------
           EXAMPLE

b+2 == c
  && f(read(write(a,b,3),c-2)) != f(c-b+1)

Which parts are arithmetic?


arrays?


uninterpreted functions?


------------------------------------------
------------------------------------------
       SMT PROCESS

b+2 == c
  && f(read(write(a,b,3),c-2)) != f(c-b+1)


------------------------------------------
III. Concolic Testing
 A. What is concolic testing?
  1. problems with symbolic execution
------------------------------------------
PRACTICAL PROBLEMS WITH SYMBOLIC EXECUTION

Loops and recursions:


Path explosion:

Heap modeling:


SMT solver limitations:


Environment modeling:


Coverage Problem


------------------------------------------
  2. solution approach
------------------------------------------
       CONCOLIC EXECUTION

Concolic = CONCrete + SymbOLIC

goal:


------------------------------------------
   a. algorithm
------------------------------------------
      CONCOLIC TESTING ALGORITHM

Start:
   with a random (concrete) input

Collect:


Explore by:


Then:


------------------------------------------
   b. example
------------------------------------------
             EXAMPLE

Consider program:

void testme (int x, int y) {
  l1: z = 2*y;
  if (l2: z != x) {
     goto last;
     } else if (l3: x <= y+10) {
        goto last;
     } else { l4: assert false; }
  } }
  last: ;
}

CFG of this is:


------------------------------------------
------------------------------------------
      EXAMPLE TRACE

after Concrete       Symbolic      PC

      x==22, y==7  x==X, y==Y      true

l1:   x==22, y==7, x==X, y==Y      true
      z==14        z==2*Y

l2:


l3:


------------------------------------------
     Where (what label) does this concrete execution end at?
     What do we do to explore more paths?
     Why not start from the other end of the PC?
     How do we find another concrete input to test?
------------------------------------------
           EXPLORING ANOTHER PATH

Modified PC from previous run:


Possible concrete input:


------------------------------------------
    What's a concrete input (for X and Y) that satisfies the modified PC?
------------------------------------------
      SECOND EXAMPLE TRACE

after Concrete       Symbolic      PC

      x==  , y==   x==X, y==Y      true

l1:   x==  , y== , x==X, y==Y      true
      z==          z==2*Y

l2:


l3:


------------------------------------------
     Where (what label) does this concrete execution end at?
     What do we do to explore more paths?
     Why not start from the other end of the PC?
     How do we find another concrete input to test?
------------------------------------------
           EXPLORING ANOTHER PATH

Modified PC from previous run:


Possible concrete input:


------------------------------------------
    What's a concrete input (for X and Y) that satisfies the modified PC?
------------------------------------------
        THIRD EXAMPLE TRACE

after Concrete       Symbolic      PC

      x==  , y==   x==X, y==Y      true

l1:   x==  , y== , x==X, y==Y      true
      z==          z==2*Y

l2:


l3:


------------------------------------------
  3. limitations
------------------------------------------
          LIMITATIONS

Path space explored:

                /\
               /  \
              /    \
             /      \
            /        \
           /          \
          /            \
         /              \
        /                \
        
------------------------------------------
        How would you describe the paths explored by concolic testing?
 B. hybrid concolic testing
------------------------------------------
  ANOTHER VIEW OF CONCOLIC EXPLORATION
          OF THE PATH SPACE


     Program paths

  |-----------------------------------|
  |                                   |
  |                                   |
  |                                   |
  |                                   |
  |                                   |
  |                                   |
  |                                   |
  |                                   |
  |-----------------------------------|


------------------------------------------
    Is there a way to explore more of the path space?
------------------------------------------
  HYBRID RANDOM + CONCOLIC TESTING

     Program paths

  |-----------------------------------|
  |                                   |
  |                                   |
  |                                   |
  |                                   |
  |                                   |
  |                                   |
  |                                   |
  |                                   |
  |-----------------------------------|


------------------------------------------