I. Fuzz Testing A. What is Fuzz Testing ------------------------------------------ WHAT IS FUZZING Def: *fuzz testing* (or black box fuzzing) is Idea: random input [ Fuzzer ] ----------------> [ Program ] | crashes | v inputs that crash program ------------------------------------------ What could cause a program crash? What is likely to be the largest cost in terms of time in fuzz testing? ------------------------------------------ FUZZING VS. REGRESSION TESTING Regression Fuzzing Goal Inputs ------------------------------------------ What is regression testing? What is the goal of regression testing? What is the goal of fuzz testing? What kind of inputs are used in regression testing? What kind of inputs are used in fuzz testing? Can we use fuzz testing during regression testing? 1. monitoring program execution ------------------------------------------ MONITORING OF PROGRAM Fuzz tester: - maintains list of inputs that cause crashes ------------------------------------------ How can the fuzzer know what inputs cause crashes? What if the program goes into an infinite loop? What information can a crash tell the developer/tester? How could we find more bugs than just running the program? 2. advantages and disadvantages ------------------------------------------ ADVANTAGES AND DISADVANTAGES OF FUZZING Advantages: Disadvantages: ------------------------------------------ How hard would this be to implement? How much time of a programmer is involved in doing fuzz testing? Are random inputs going to be effective at finding crashes? How does the fuzzer know what inputs are valid? 3. finding inputs that crash the program ------------------------------------------ FUZZ TESTING AND INPUT VALIDATION Reality: [ Fuzzer ] | | random | bytes v [ Driver ] ----------------> [ Program ] | crashes | v inputs that crash program ------------------------------------------ What kind of programs would take bytes as inputs? B. mutation-based fuzzing Could we add some human expertise to fuzz testing? How? 1. idea of mutation-based fuzzing ------------------------------------------ MUTATION-BASED FUZZING 1. get seed input(s) from human 2. track list of good inputs 3. a. pick a random good input b. mutate it c. feed mutant to the program d. if it causes a crash, 4. repeat Example fuzzers: ZZUF, FileFuzz, Taof, ProxyFuzz, etc. Example scenario: fuzzing a PDF viewer 1. collect PDF files from web 2. Mutate those files 3. Record which mutants crash viewer ------------------------------------------ When does this process stop? What kinds of mutations of inputs would be effective? 2. advantages and disadvantages ------------------------------------------ MUTATION-BASED FUZZING Advantages: Disadvantages: ------------------------------------------ How hard is mutation-based fuzzing to implement? Does mutation-based fuzzing require precise specifications? Is there any danger of missing kinds of attacks? What kinds of inputs would resist fuzz testing with mutation? C. generation-based fuzzing How could we fuzz-test programs that parse/validate their inputs? ------------------------------------------ GENERATION-BASED FUZZING Idea: 1. Use specification of format to generate inputs 2. Mutations added to syntactic spots in inputs 3. Otherwise like mutation-based ------------------------------------------ What does it mean to be like generation-based fuzzing otherwise? Could we use heuristics in generating inputs? D. comparison ------------------------------------------ ADVANTAGES: MUTATION VS. GENERATION Mutation-based fuzz testing: Generation-based fuzz testing: ------------------------------------------ Which is easier to implement? Which requires the least amount of expertise to use? Which can work for programs that validate input syntax? Which works for data with checksums? Which is more likely to find problems? How should we measure the performance of these? Which has better performance? E. when to stop fuzzing ------------------------------------------ WHEN TO STOP FUZZING? How do we know we have tested enough? ------------------------------------------ The original and mutation-based fuzz testers run forever, so when to stop them? Generation-based fuzzers may only generate a limited number of inputs, how do we know if that's enough? 1. code coverage ------------------------------------------ CODE COVERAGE Def: *Code coverage* is Tools for profiling: gcov, lcov (see https://about.codecov.io/tool/lcov/) Also for LLVM see: clang.llvm.org/docs/ SourceBasedCodeCoverage.html ------------------------------------------ Does code coverage actually tell us about a testing process's ability to find bugs? a. line coverage ------------------------------------------ LINE COVERAGE Measures: Percentage of lines of code executed (or statements) Example: void check(int a, int b) { if (a > 2) { x = 1; } if (b > 2) { y = 2; } } How many pairs (a,b) needed for 100% line coverage? Variation: block coverage is ------------------------------------------ Do we care about the lines with only a } in them? 2. branch coverage ------------------------------------------ BRANCH COVERAGE Also called: decision coverage Measures: Percentage of branches of code executed Example: void check(int a, int b) { if (a > 2) { x = 1; } if (b > 2) { y = 2; } } How many pairs (a,b) needed for 100% branch coverage? ------------------------------------------ 3. path coverage ------------------------------------------ PATH COVERAGE Measures: Percentage of paths in code executed Example: void check(int a, int b) { if (a > 2) { x = 1; } if (b > 2) { y = 2; } } How many pairs (a,b) needed for 100% path coverage? ------------------------------------------ How many paths can a program have? 4. comparison (omit) ------------------------------------------ COMPARING TYPES OF COVERAGE Which type is most desirable? Does 100% coverage guarantee finding bugs? ------------------------------------------ 5. evaluation a. benefits of code coverage ------------------------------------------ BENEFITS OF CODE COVERAGE ------------------------------------------ What benefits could one get from code coverage? b. problems with code coverage ------------------------------------------ PROBLEMS OF CODE COVERAGE Consider: void myStrCpy(char *dst, char* src) { if (dst && src) { strcpy(dst, src); } } ------------------------------------------ Is there any vulnerability in that code? Will code coverage help us find that vulnerability? 6. how good is code coverage? ------------------------------------------ HOW GOOD A MEASURE IS CODE COVERAGE? See: Xia Cai and Michael R. Lyu. "The effect of code coverage on fault detection under different testing profiles" In A-MOST '05: Proceedings of the 1st international workshop on Advances in model-based testing, pp. 1-7, May 2005. https://doi.org/10.1145/1083274.1083288 "code coverage is simply a moderate indicator for the capability of fault detection on the whole test set." [code coverage is] "clearly a good estimator for the fault detection of exceptional test cases, but a poor one for test cases in normal operations." "the correlation between code coverage and fault coverage is higher in functional test than in random testing" ------------------------------------------ F. coverage-guided gray-box fuzzing ------------------------------------------ COVERAGE-GUIDED GRAY-BOX FUZZING Idea: 1. Mutants of good inputs are good if Algorithm: 1. Good inputs start from seed input(s) that don't crash or loop forever 2. Mutate random good input 3. Execute it and monitor execution and its coverage 4. If program crashes or then save input as a new good input Example: AFL ------------------------------------------ a. AFL ------------------------------------------ AMERICAN FUZZY LOP (AFL) Instruments code at compile-time (can instrument assembly code or LLVM) Tracks all edges in CFG (with counters) - Hashtable tracks log_2 of executions with 8 bits per edge Target program runs as separate process ------------------------------------------ What's the advantage of using log base 2 of executions? What's the disadvantage of using log base 2 of executions? b. libfuzzer Would it help to combine the ideas of fuzzing and concolic testing? How would we do that? ------------------------------------------ LIBFUZZER libFuzzer, "a library for coverage-guided fuzz testing" in LLVM -- llvm.org/docs/LibFuzzer.html See also tutorial.libfuzzer.info Uses LLVM's SanitizerCoverage for instrumentation e.g., to build also with address sanitizer clang -g -O1 -fsanitize=fuzzer,address \ target.c Can be used with AFL ------------------------------------------ c. go-fuzz ------------------------------------------ GO-FUZZ For the Go programming language Built-in to their tool chain See https://go.dev/security/fuzz/ ------------------------------------------ G. challenges in fuzz testing ------------------------------------------ CHALLENGES IN FUZZ TESTING Seeds need to be crafted well - must not crash program - must not cause loops - should cover different branches Some branches make code hard to cover... ------------------------------------------ 1. coverage challenges ------------------------------------------ DIFFICULT TO FUZZ BRANCHES void test1(int n) { if (n==0x12345678) { abort(); } } ------------------------------------------ How hard would it be to get to the abort in test1? ------------------------------------------ EASIER TO FUZZ TEST void test2(int n) { int dummy = 0; char *p = (char *)&n; if (p[0]==0x12) dummy++; if (p[1]==0x34) dummy++; if (p[2]==0x56) dummy++; if (p[3]==0x56) dummy++; if (dummy==4) { abort(); } } ------------------------------------------ How many inputs are needed in the worst case for test2? H. rules of thumb ------------------------------------------ WISDOM FROM FUZZ TESTERS - Knowing input format helps - Generational testing better than random but needs good specifications - Implementations vary, different fuzzers ==> different bugs - Running longer will find more bugs (but reaches a limit) - Coverage guidance works best - Use profiling to see where getting stuck design better seeds to get past those... ------------------------------------------