1 Comment

theft: Property-Based Testing for C

Recently, I discovered a bug in my heatshrink data compression library, a result of a hidden assumption — I expected that input to uncompress would be padded with ‘0’ bits (like its compressed output), but if given trailing ‘1’ bits, it could get stuck: it detected that processing was incomplete, but polling for more output made no further progress.

Under normal circumstances this wouldn’t happen, but hostile input shouldn’t get libraries stuck in an intermediate state or make them waste their callers’ resources. Also, since flash memory is often subtractive (erasing to all ‘1’ bits and writing by clearing bits to ‘0’), if compressed input was being passed to it a page at a time, it could end up with trailing 1 bits from the remaining memory.

I found and fixed the root cause, but since I was also working on theft (a property-based testing library for C), I wondered if it would have found that edge case for me. I wrote a property test saying that, “for any input, getting all available output and then checking whether it was finished should indicate that it is done”:

/* For an arbitrary input buffer, it should never get stuck in a
 * state where the data has been sunk but no data can be polled. */
static theft_trial_res
prop_should_not_get_stuck(void *input) {
 
    /* Make a buffer large enough for the output: 4 KB of input with
     * every 16 bits becoming up to 16 bytes will fit in a 64 KB buffer.
     * (4 KB of input comes from `env.limit = 1 << 12;` below.) */
    uint8_t output[64 * 1024];
    heatshrink_decoder *hsd = heatshrink_decoder_alloc(4096, 12, 4);
    if (hsd == NULL) { return THEFT_TRIAL_ERROR; }
 
    rbuf *r = (rbuf *)input;
 
    // sink all input
    size_t count = 0;
    HSD_sink_res sres = heatshrink_decoder_sink(hsd, r->buf, r->size, &count);
    if (sres != HSDR_SINK_OK) { return THEFT_TRIAL_ERROR; }
 
    // get all output
    size_t out_sz = 0;
    HSD_poll_res pres = heatshrink_decoder_poll(hsd, output, sizeof(output), &out_sz);
    if (pres != HSDR_POLL_EMPTY) { return THEFT_TRIAL_FAIL; }
 
    // check if it's finished
    HSD_finish_res fres = heatshrink_decoder_finish(hsd);
    heatshrink_decoder_free(hsd);
    if (fres != HSDR_FINISH_DONE) { return THEFT_TRIAL_FAIL; }
 
    return THEFT_TRIAL_PASS;
}

Then, I fed that property test function to the test runner:

TEST decoder_fuzzing_should_not_detect_stuck_state(void) {
    /* Get a random number seed based on the time */
    theft_seed seed;
    if (!get_time_seed(&seed)) { FAIL(); }
 
    /* Pass the max buffer size for this property (4 KB) in a closure */
    test_env env = { .limit = 1 << 12 };
 
    struct theft *t = theft_init(0);
    struct theft_cfg cfg = {
        .name = __func__,
        .fun = prop_should_not_get_stuck,
        .type_info = { &rbuf_info },
        .seed = seed,
        .trials = 100000,
        .progress_cb = progress_cb,
        .env = &env,
    };
 
    theft_run_res res = theft_run(t, &cfg);
    theft_free(t);
    printf("\n");
    GREATEST_ASSERT_EQm("should_not_get_stuck", THEFT_RUN_PASS, res);
    PASS();
}

If my fix for the bug is commented out, it finds the edge case instantly (give or take 9 milliseconds):

 -- Counter-Example: decoder_fuzzing_should_not_detect_stuck_state
    Trial 0, Seed 0xe87bb1f61032a061
    Argument 0:
buf[1]:
    80

What does this mean? Apparently, the simplest way it found to break my test was one byte, with a leading high bit. How did it figure that out, though? As far as it knows, my program is a black box.

Well, the first failure it found actually looked more like:

 -- Counter-Example: decoder_fuzzing_should_not_detect_stuck_state
    Trial 0, Seed 0xe87bb1f61032a061
    Argument 0:
buf[98]:
    9fe46c8c05bd539a6fbd3bccce587dd2
    ce90684cd3a0c48b3c034fe338f4c3d4
    1094c503b5904b6f4bdf3545d8a2f5f4
    117c65d75aa4bbe451624169ea3d81bd
    0cecf8153c3e0bc6f2bc13bb56804731
    2da9dd390a3db9f5c5bfd9305a37a63b
    69c7

Yikes! Reverse engineering can be fun, but that avalanche of bits really doesn’t suggest any root cause, and we’ve got deadlines. Now that we know there’s a bug, maybe we could get more meaningful test data?

Shrinking

The next thing theft does is “shrinking” — once it finds input that leads to a failure, it explores simpler variants until eliminating more details would cause the test to pass again. As irrelevant details melt away, eventually all that remains is something that expresses the edge case discovered, and is often good example input for a regression test.

This shrinking process is inherently type-specific, and (unlike OCaml) C compilers really can’t infer much, so it needs a little guidance. theft’s interface expects a callback that gives a variety of tactics to generate a simpler copy from the input.

For a buffer of random data to (un)compress, you could simplify it by cutting it in half, dropping bytes at the beginning or end, or making the bytes in the buffer smaller:

/* How to shrink a random buffer to a simpler one. */
static void *rbuf_shrink_cb(void *instance, uint32_t tactic, void *env) {
    rbuf *cur = (rbuf *)instance;
 
    if (tactic == 0) {          /* only keep first half */
        return copy_rbuf_percent(cur, 0.5, 0);
    } else if (tactic == 1) {   /* only keep second half */
        return copy_rbuf_percent(cur, 0.5, 0.5);
    } else if (tactic <= 18) {  /* drop 1-16 bytes at start */
        const int last_tactic = 1;
        const size_t drop = tactic - last_tactic;
        if (cur->size < drop) { return THEFT_DEAD_END; }
        return copy_rbuf_subset(cur, cur->size - drop, drop);
    } else if (tactic <= 34) {  /* drop 1-16 bytes at end */
        const int last_tactic = 18;
        const size_t drop = tactic - last_tactic;
        if (cur->size < drop) { return THEFT_DEAD_END; }
        return copy_rbuf_subset(cur, cur->size - drop, 0);
    } else if (tactic == 35) {  /* divide every byte by 2 */
        /* If all bytes are 0, this is a dead end. */
        if (all_zero_bytes(cur)) { return THEFT_DEAD_END; }
 
        rbuf *cp = copy_rbuf_percent(cur, 1, 0);
        if (cp == NULL) { return THEFT_ERROR; }
        for (size_t i = 0; i < cp->size; i++) { cp->buf[i] /= 2; }
        return cp;
    } else if (tactic == 36) {  /* subtract 1 from every non-zero byte */
        /* If all bytes are 0, this is a dead end. */
        if (all_zero_bytes(cur)) { return THEFT_DEAD_END; }
 
        /* subtract 1 from every byte, saturating at 0 */
        rbuf *cp = copy_rbuf_percent(cur, 1, 0);
        if (cp == NULL) { return THEFT_ERROR; }
        for (size_t i = 0; i < cp->size; i++) {
            if (cp->buf[i] > 0) { cp->buf[i]--; }
        }
        return cp;
    } else {
        (void)env;
        return THEFT_NO_MORE_TACTICS;
    }
}

theft converges faster if the tactics that discard the most irrelevant information appear first. These are tried in order, and if deleting half the buffer doesn’t work, but dropping 7 bytes from the beginning does, then it will try halving the buffer again.

None of this is too specific to a particular data structure, though — simplifying by big steps, then trying medium-to-small steps to get the big steps unstuck, should be applicable to most data types. The property expresses how to check for failures, the tactics express how to vary the input, and theft automates the testing and searching. All in all, it works kind of like using git-bisect run to find the first commit that broke a test.

In this case, theft discovered that alignment was a crucial element of the bug — the ‘1’ bit needed to appear at the right place in a multiple-byte cycle to hit the gap in the state machine. Splitting the buffer or deleting single bytes eventually found dead ends when simplifying, but deleting varying amounts of bytes quickly led to an alignment-preserving path towards a minimal failing case.

Once none of these tactics make any further progress, theft prints the minimal input to break the test, then generates very different input and tries to find more edge cases. It also notes what combinations of arguments it has already tried, so it can give up early if it isn’t going to find any new information.

Since the input types being generated are usually either part of the program’s problem domain (data to compress, sequences of write instructions, network packets) or fundamental types (numbers, text), most of this type-specific code can be reused across several property tests. As Matt Might notes in “An introduction to QuickCheck by example”, “Randomized testing exemplifies the 80/20 rule: it yields 80% of the benefit of formal verification for 20% of the effort.” Once a bit of generation and shrinking code is in place, it only takes a couple lines of test code to stress-test functions with thousands of inputs.

Property-based testing is a valuable complement to typical example-based tests, since it can effectively test for things that would otherwise be difficult to cover. It stress-tests a program’s actual interface, rather than developers’ mental models or implementation details, and finds where they are out of alignment. Since it’s unbiased about how the code “should” be used, it can find surprising bugs arising from unexpected interactions. It can also generate a wide variety of input, giving developers rapid feedback without spending time on tests that will be thrown away or become a maintenance burden.