I. Program Analysis A. What is program analysis? ------------------------------------------ WHAT IS PROGRAM ANALYSIS? Def: *program analysis* is ------------------------------------------ How does this differ from (human) code inspection? How is it different than testing? Runtime assertion checking? B. use in software security ------------------------------------------ USES OF PROGRAM ANALYSIS IN SOFTWARE SECURITY General goal/idea: Program ---> Static Analyzer --> Bug Reports Examples: - warn about bad library functions (e.g., gets) ------------------------------------------ 1. vs. program verification ------------------------------------------ PROGRAM ANALYSIS vs. VERIFICATION program analysis: - no explicit specification of behavior - little annotation - targeted to a specific property - may lose precision to gain speed - no human intervention once started program verification: - behavior usually specified in detail - lots of annotation - for any (safety) property - may take a day per function - often requires human intervention ------------------------------------------ ------------------------------------------ SAFETY vs. LIVENESS Def: a *safety* property says that Def: a *liveness* property says that ------------------------------------------ What kind of property does static analysis work on? Are typical security policies safety or liveness properties? Are any security properties liveness properties? ------------------------------------------ PROGRAM ANALYSIS GOALS - little or no input from programmers - correctness - efficient (at compile time): - time - space ------------------------------------------ C. equivalent ideas ------------------------------------------ EQUIVALENCE OF DIFFERENT KINDS OF PROGRAM ANALYSIS An analysis can be expressed as: - Dataflow equations - Constraints - Abstract Interpretation - Type system - Control Flow Analysis Differences in Emphasis: - Dataflow equations } - compilers - Abstract interpretation - correctness Above can approximate dependence on dynamic values in imperative languages - Type systems - no flow dependence - uniform treatment of each statement - Constratint based = Control Flow Analysis - analysis of functional languages ------------------------------------------ D. key design idea ------------------------------------------ DESIGNING A PROGRAM ANALYSIS Slogan: Err on the safe side! Key question: - What's the bad outcome? - What would track the opposite of that? - What would be safe? Because: what we know in a sound analysis is that the claimed safety property holds in all possible executions All programs |--------------------------------| | Programs that are safe | | |------------------------| | | | Programs claimed safe | | | | | | | |------------------------| | |~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~| | | | Programs that can be bad | |--------------------------------| Example: Dead code analysis (e.g., goto fail bug): - What's the bad outcome? - What would track the opposite? - What would be safe? All programs |--------------------------------| | Programs with no dead code | | |------------------------| | | | Programs claimed to | | | | have no dead code | | | |------------------------| | |~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~| | | | Programs that have dead code | |--------------------------------| ------------------------------------------ Assuming the analysis is sound, will we have some false positives (claiming bad programs are safe)? Will we have some false nagatives (good programs not claimed safe)? ------------------------------------------ FOR YOU TO DO Suppose we want to stop using gets (from libc's stdio) what should we do in an analysis? ------------------------------------------ What is the bad outcome to prevent? What would track the opposite? What would be safe? E. intraprocedural vs. interprocedural ------------------------------------------ INTRA vs. INTER-PROCEDURAL ANALYSIS Def: an *intraprocedural* analysis Def: an *interprocedural* analysis ------------------------------------------ F. data flow analysis 1. may vs. must analysis ------------------------------------------ MAY vs. MUST ANALYSES Def: A "may analysis" tracks A "must analysis" tracks ------------------------------------------ 2. sources of imprecision a. conditional statements ------------------------------------------ CONDITIONAL STATEMENTS Might the following code use gets? if (...) { // 2 gets(buffer); } else { // 4 fgets(buffer, sizeof(buffer), fn); } ------------------------------------------ b. loops ------------------------------------------ LOOP STATEMENTS Might the following code use gets? while (...) { // 2 gets(buffer); } What about: while (i < BUFSIZ) { // 5 if (i >= 3) gets(buffer); i++; } ------------------------------------------ c. handling different kinds of control structures (CFGs) ------------------------------------------ DIFFERENT CONTROL STRUCTURES Can we avoid special cases for: - switch statments, - if's with different numbers of arms, - for loops, - do-until loops, - code with exception handlers, - code with gotos? ------------------------------------------ ------------------------------------------ CONTROL FLOW GRAPHS (CFGs) 1. Give each atomic statment and test a unique label 2. Form CFG from: - Blocks (nodes) that are - statements or tests, with: - initial label - set of final labels - Flows (edges): - relation between labels (map from a statement to a pair of labels) Example: z = 1; // 1 while (x > 0) // 2 { z = z*y; // 3 x = x-1; } // 4 print("done"); // 5 Blocks are: Flows are: Picture: [ 1 ] | v /-->[ 2 ]--\ | | | | v | | [ 3 ] | | | | | v | \---[ 4 ] | | /----/ v [ 5 ] ------------------------------------------ What's the initial label of each statment and test? What's the set of final labels for the test at label 2? 3. example: dead code elimination a. design ------------------------------------------ UNREACHABLE CODE DETECTION - What's the bad outcome? - What would track the opposite? - What would be safe? ------------------------------------------ b. property space ------------------------------------------ PROPERTY SPACE Property to track: Set of labels that reach this label \ | / v v v [ l ] A "may" analysis A "backwards" analysis, want property of flows into a block ------------------------------------------ What's a source of imprecision in this analysis? Since it's a may analysis, how do we merge different control flows that come into a block? How can we find what flows go into a block? i. describing an analysis Why describe an analysis in terms of property values and transfer functions? ii. starting property value ------------------------------------------ STARTING PROPERTY VALUE (BASE CASE) The initial value of a property space is used for value at the initial label (in a forward analysis) Isolation: initial label in a forward analysis should be *isolated*, so no flow to it from inside prog. ------------------------------------------ What about for a backwards analysis? Does every structured program have isolated initial labels? And isolated final labels? Can we convert a program P to an equivalent program P' that has isolated labels? iii. combining information ------------------------------------------ COMBINING PROPERTY INFORMATION (JOINS) \ | / v v v [ stmt] or [ stmt ] <-- ^ | Why? - Make the analysis uniform: always one value for entry and exit Suppose property space is a set, then For a "may" analysis, combine using: For a "must" analysis, combine using: ------------------------------------------ What kinds of flows could combine in these ways? iv. transfer functions ------------------------------------------ TRANSFER FUNCTIONS Def: a *transfer function* describes ------------------------------------------ v. finding solutions ------------------------------------------ SOLVING RECURSIVE SYSTEMS OF DATAFLOW EQUATIONS Loops (and recursion) can lead to recursive definitions of properties When the property space is finite, then a solution will exist Chaotic iteration: 1. initialize property sets at each label 2.a. Pick a label of a statement whose transfer function changes some property value(s) b. Update the property value(s) using that transfer function c. repeat until no changes can be made The output is the property values at entry and exit of each label ------------------------------------------ Could we make this run faster? vi. example ------------------------------------------ EXAMPLE: UNREACHABLE CODE Consider C with statments: assignments, empty statements (skip), if-then-else, while, sequencing, labels Stmt ::= L: x = E; | L : ; | if (L: E) { Stmt1 } else { Stmt2 } | while (L: E) { Stmt } | Stmt1 Stmt2 Atomic blocks are: S ::= L : x = E; | L : ; | L : E So CFG of l1: x = 0; while (l2: x < 4) { l3: y = y+1; l4: x = x-1; } l5: y = x*x; is [ l1: x = 0;] | v /->[ l2: x < 4; ]--\ | | | | v | | [ l3: y = y+1;] | | | | | v | \--[ l4: x = x-1;] | | /---------/ | v [ l5: y = x*x;] Set of flows is: We want a forward, may analysis with: Property space: Reaches(l) = Set of labels that be reached from start initial value: {} Combination: join is \union EntryTo(l) = union {ExitOf(l') | (l',l) flows} Transfer functions: f(l: x = E;, EntryTo(l)) = {l} union EntryTo(l) f(l: ;, EntryTo(l)) = EntryTo(l) f(l:E, EntryTo(l)) = {l} union EntryTo(l) ------------------------------------------ ------------------------------------------ EQUATIONS AND SOLUTION FOR EXAMPLE EntryTo(l1) = {} ExitOf(l1) = {l1} union EntryTo(l1) EntryTo(l2) = union{ExitOf(l1),ExitOf(l4)} ExitOf(l2) = {l2} union EntryTo(l2) EntryTo(l3) = ExitOf(l2) ExitOf(l3) = {l3} union EntryTo(l3) EntryTo(l4) = ExitOf(l3) ExitOf(l4) = {l4} union EntryTo(l4) EntryTo(l5) = ExitOf(l2) ExitOf(l5) = {l5} union EntryTo(l5) Trace of iterations: Steps: 0 1 EntryTo(l1) {} ExitOf(l1) {} EntryTo(l2) {} ExitOf(l2) {} EntryTo(l3) {} ExitOf(l3) {} EntryTo(l4) {} ExitOf(l4) {} EntryTo(l5) {} ExitOf(l5) {} ------------------------------------------ Is this result what happens dynamically? Is it safe? How could we make this safe? 4. building tools with dataflow analysis ------------------------------------------ BUILDING TOOLS FOR PROGRAM ANALYSIS Use an existing set of tools: Parser generator and: - LLVM - JastAdd Java compiler - XText ------------------------------------------ G. control flow analysis II. Symbolic Execution A. What is symbolic execution? ------------------------------------------ SYMBOLIC EXECUTION Def: *symbolic execution* is ------------------------------------------ ------------------------------------------ OVERVIEW OF SYMBOLIC EXECUTION Key paper: James C. King, "Symbolic execution and program testing" CACM, Vol. 19, num. 7, pp. 385-394, 1976. http://doi.acm.org/10.1145/360248.360252 General idea: Approach: Def: An *isomorphism class* is Advantages: Drawbacks: ------------------------------------------ What kind of classes of inputs would be most helpful to have? ------------------------------------------ SYMBOLIC EXECUTION TOOLS (Quoted from p. 1066 of: Cristian Cadar, et al. 2011. "Symbolic execution for software testing in practice: preliminary assessment". In Proceedings of ICSE '11, pp. 1066-1071 https://doi.org/10.1145/1985793.1985995) Tools: - NASA's Symbolic (Java) PathFinder, - UIUC's CUTE and jCUTE, - Stanford's KLEE, - UC Berkeley's CREST and BitBlaze Also commercial practice at: - Microsoft (Pex, SAGE, YOGI and PREfix), - IBM (Apollo), - NASA and Fujitsu (Symbolic PathFinder), - Parasoft ------------------------------------------ B. use in software security ------------------------------------------ USES OF SYMBOLIC EXECUTION IN SOFTWARE SECURITY ------------------------------------------ C. approach 1. algorithm a. original algorithm (King 1976) ------------------------------------------ SYMBOLIC EXECUTION ALGORITHM King 1976: - Assume program works on integers - Assume program uses simple tests for control flow (<, >, <=, etc.) - Use CFGs as in program analysis - Compute with (integer) 1. Allow symbolic inputs; e.g., isym, jsym for reading i, j - Replace integer ops with symbolic ops that make e.g., i - j > 0 produces expr: 2. State of program represented by: - path condition (PC): - variable values: 3. Tests (e.g. in if-statements) produce: ------------------------------------------ What should be the initial value for PC? Does the PC need to be updated for the first case? What's the problem with the second case? ------------------------------------------ EXAMPLE OF KING'S ALGORITHM int sum(int a, int b, int c) { l0: int x, y, z; l1: x = a + b; l2: y = b + c; l3: z = x + y - b; l4: return z; } Trace of symbolic execution: after x y z a b c PC l0 ? ? ? A B C true l1 l2 l3 ------------------------------------------ What is the state after l1? l2? l3? ------------------------------------------ EXAMPLE WITH A LOOP int power(int x, int y) { l2: int z = 1; l3: int j = 1; l4: if (y < j) { l45: goto l8; } else { l5: z = z * x; l6: j = j + 1; l7: goto l4; } l8: return z; trace of symbolic execution: after j x y z PC (l1) ? X Y ? true l2 l3 l4 ------------------------------------------ What happens on line 4 the first time? Is the result in that case correct? What about the other case? In the first case, what is Y? What will happen when we test at l4 again? What should be done to prevent that? i. limits of King's approach What are the limits of King's approach in practice? ------------------------------------------ LIMITS OF KING'S APPROACH ------------------------------------------ b. Generalized Symbolic Execution (from Khurshid, Pasareanu, Visser 2003) ------------------------------------------ GENERALIZED SYMBOLIC EXECUTION From Khurshid, Pasareanu, Visser 03: S. Khurshid, C.S. Pasareanu, and W. Visser. 2003. "Generalized Symbolic Execution for Model Checking and Testing." In TACAS 2003. LNCS vol. 2619. Springer. https://doi.org/10.1007/3-540-36577-X_40 (or https://rdcu.be/cU1iz) Dealing with objects: - all fields start as uninitialized - assign values to fields lazily, when first accessed, Lazy initialization algorithm for a field f: if ( f is uninitialized ) { if ( f is reference field of type T ) { nondeterministically initialize f to 1. null 2. a new object of class T (with uninitialized field values) 3. an object created during a prior initialization of a field of type T (or a subtype of T) if ( method precondition is violated ) backtrack(); } if ( f is primitive (or string) field ) initialize f to a new symbolic value of appropriate type } ------------------------------------------ Why consider objects created previously? i. example ------------------------------------------ EXAMPLE OF GSE public class Node {int elem; Node next} public Node swapNode() { // 0 l1: if (next!=null) l2: if (elem-next.elem>0) { l3: Node t = next; l4: next = t.next; l5: t.next = this; l6: return t; } l7: return this; } } after elem next t PC call ? ? ? true ------------------------------------------ What happens with the first test? in the case that next != null && elem-next.elem>0 what does l4 do to the symbolic state? what gets returned in that case? Is that right? in the case that next != null && elem-next.elem<=0 what happens? c. symbolic execution tree ------------------------------------------ SYMBOLIC EXECUTION TREE Tracks execution traces that can happen l0: int x, y; l1: if (x > y) { l2: x = x + y; l3: y = x - y; l4: x = x - y; l5: if (x - y > 0) l6: assert(false); } (x:X, y:Y, PC:true) l1 / \ / \ v v (x:X, y:Y PC:X>Y) (x:X, y:Y, PC: X<=Y) l2 | v (x:X+Y,y:Y,PC:X>Y) l3 | v (x:X+Y,y: ,PC:X>Y) l4 | v (x: ,y: ,PC:X>Y) l5 / \ / \ v v (x: ,y: ,PC: ) ------------------------------------------ What is the value of the test at l5 in the left case? Can execution reach l6? d. use of model checker ------------------------------------------ BACKGROUND: MODEL CHECKING System: a state machine Specification of safety properties: some states marked as error states Checking: explores reachable states from start Checks: Is an error state reachable? Advantages: Disadvantages: ------------------------------------------ What's an example of an error state? What does it mean if no error state is reachable from start? How can model checking keep from running forever? e. implementations ------------------------------------------ SYMBOLIC EXECUTION FRAMEWORKS Java PathFinder (https://github.com/javapathfinder) jCUTE (https://github.com/osl/jcute) ------------------------------------------ D. background on automatic theorem proving 1. SAT Solvers ------------------------------------------ BACKGROUND: SAT SOLVERS goal: decide if first-order predicate is always true (or not) SAT Solvers: input: propositional formula in CNF output: satisfying assignment to vars (or failure indication) example: w1 && w2 && w3 where: w1 == b || c w2 == !a || !d w3 == !b || d satisfying assignment is {a = , b = , c = , d = } ------------------------------------------ What's the time complexity of this problem? 2. SMT Solvers ------------------------------------------ SMT SOLVERS SMT = Satisfiability Modulo Theories Combines SAT + decidable theories Example theories: - Uninterpreted functions and equality - Peano arithmetic (arith. without *, /) - Arrays, Strings - Bitvectors - Algebraic Datatypes (e.g., enums) Uninterpreted functions and equality: x == x x == y <==> y == x (x == y) && (y == z) ==> (x == z) x == y ==> f(x) == f(y) ------------------------------------------ a. proof process ------------------------------------------ PROVING A FORMULA WITH SMT SOLVER To prove P see if not P is satisfiable ------------------------------------------ What does it mean if not P is satisfiable? What does it mean if not P is unsatisfiable? b. example ------------------------------------------ EXAMPLE b+2 == c && f(read(write(a,b,3),c-2)) != f(c-b+1) Which parts are arithmetic? arrays? uninterpreted functions? ------------------------------------------ ------------------------------------------ SMT PROCESS b+2 == c && f(read(write(a,b,3),c-2)) != f(c-b+1) ------------------------------------------ III. Concolic Testing A. What is concolic testing? 1. problems with symbolic execution ------------------------------------------ PRACTICAL PROBLEMS WITH SYMBOLIC EXECUTION Loops and recursions: Path explosion: Heap modeling: SMT solver limitations: Environment modeling: Coverage Problem ------------------------------------------ 2. solution approach ------------------------------------------ CONCOLIC EXECUTION Concolic = CONCrete + SymbOLIC goal: ------------------------------------------ a. algorithm ------------------------------------------ CONCOLIC TESTING ALGORITHM Start: with a random (concrete) input Collect: Explore by: Then: ------------------------------------------ b. example ------------------------------------------ EXAMPLE Consider program: void testme (int x, int y) { l1: z = 2*y; if (l2: z != x) { goto last; } else if (l3: x <= y+10) { goto last; } else { l4: assert false; } } } last: ; } CFG of this is: ------------------------------------------ ------------------------------------------ EXAMPLE TRACE after Concrete Symbolic PC x==22, y==7 x==X, y==Y true l1: x==22, y==7, x==X, y==Y true z==14 z==2*Y l2: l3: ------------------------------------------ Where (what label) does this concrete execution end at? What do we do to explore more paths? Why not start from the other end of the PC? How do we find another concrete input to test? ------------------------------------------ EXPLORING ANOTHER PATH Modified PC from previous run: Possible concrete input: ------------------------------------------ What's a concrete input (for X and Y) that satisfies the modified PC? ------------------------------------------ SECOND EXAMPLE TRACE after Concrete Symbolic PC x== , y== x==X, y==Y true l1: x== , y== , x==X, y==Y true z== z==2*Y l2: l3: ------------------------------------------ Where (what label) does this concrete execution end at? What do we do to explore more paths? Why not start from the other end of the PC? How do we find another concrete input to test? ------------------------------------------ EXPLORING ANOTHER PATH Modified PC from previous run: Possible concrete input: ------------------------------------------ What's a concrete input (for X and Y) that satisfies the modified PC? ------------------------------------------ THIRD EXAMPLE TRACE after Concrete Symbolic PC x== , y== x==X, y==Y true l1: x== , y== , x==X, y==Y true z== z==2*Y l2: l3: ------------------------------------------ 3. limitations ------------------------------------------ LIMITATIONS Path space explored: /\ / \ / \ / \ / \ / \ / \ / \ / \ ------------------------------------------ How would you describe the paths explored by concolic testing? B. hybrid concolic testing ------------------------------------------ ANOTHER VIEW OF CONCOLIC EXPLORATION OF THE PATH SPACE Program paths |-----------------------------------| | | | | | | | | | | | | | | | | |-----------------------------------| ------------------------------------------ Is there a way to explore more of the path space? ------------------------------------------ HYBRID RANDOM + CONCOLIC TESTING Program paths |-----------------------------------| | | | | | | | | | | | | | | | | |-----------------------------------| ------------------------------------------