Using Zig’s fuzzer

2024-09-07 · two minute read · fuzzing, programming-languages, zig

UPDATE 2024-09-10: It looks like Zig’s fuzzing API is about to change. If/when that change is landed, this post will be wrong. As always, using APIs that are less than two months old and only available in nightly builds is a test of bravery, notwithstanding the total lack of warranty. Still: I’d do it again.

Mark a test as a fuzz test by simply calling std.test.fuzzInput at any point. Then start fuzzing by running the command:

$ zig build test --fuzz

This starts an HTTP server to show a coverage map!

In my case, I was testing out a tokenizer. The tokenizer’s entry point looks like this:

/// Consume a single token from `source`, returning its kind and length.
pub fn tokenizeOne(source: []const u8) struct { TokenKind, u32 } {
    if (source.len == 0) return .{ .eof, 0 };

    // Note: `\n` is parsed as a `newline` token, not a `space` token.
    if (source[0] == ' ' or source[0] == '\t' or source[0] == '\r'1) {
        var i: u32 = 1;

        while (i < source.len and source[i] == ' ' or source[i] == '\t' or source[i] == '\r'2) {
            i += 1;
        }

        return .{ .space, i };
    } else if (source[0] == '\n') {
        return .{ .newline, 1 };
    } else if (characterIsNameStart(source[0])) {
        // [... continued]
    } else {
        return .{ .@"error", 1 };
    }
}

Ignore the callouts they’re not important right now. The important bit is that this API is really easy to use:

var index: usize = 0;
while (true) {
    const token, const length1 = lexOne(source[index..]);

    index += length;
    if (token == .eof) break;
}

1 Yes, Zig has destructuring syntax now! This feature goes right below decl literals, which have also been added, on my quality-of-life tier list.

Thanks to Zig’s builtin support for unit tests, it only takes three lines of code to fuzz the tokenizer. I know lines-of-code-as-a-metric is unreliable at best, but c’mon, three lines!

test "tokenizer fuzzing" {
    const source = std.testing.fuzzInput(.{});

    var index: usize = 0;
    while (true) {
        const token, const length1 = lexOne(source[index..]);
        // [...]
    }
}

And, bam! After exactly one minute and 0.74 seconds, Zig comes up with a stack trace:

thread 51449 panic: index out of bounds: index 1, len 1
../src/tokenizer.zig:60:61: 0x114f818 in tokenizeOne (test)
        while (i < source.len and source[i] == ' ' or source[i] == '\t' or source[i] == '\r') {
../src/tokenizer.zig:133:43: 0x1154962 in test.tokenizer fuzzing (test)
        const token, const length = lexOne(source[index..]);

Ah, right. The unimportant bit in the first code block.

My bug here was copy-pasting this bit 1. Due to operator precedence (and binding tighter than or), this code is interpreted as (character-in-range AND is space) OR is tab OR is carriage return 2, which is obviously an out-of-bounds read.

My chosen fix here was to split this into its own function:

fn isPartOfSpace(ch: u8) bool {
    return ch == ' ' or ch == '\t' or ch == '\r';
}

// [...]
if (isPartOfSpace(source[0])) {
    std.debug.assert(source[0] != '\n');

    var i: u32 = 1;

    while (i < source.len and isPartOfSpace(source[i])) {
        // [...]
    }
}

I guess the lesson here is to be at least a little afraid to repeat yourself? Also don’t write bugs.^[1] That’d help too.

I’m legally required to point out that “don’t write bugs” is not an effective advisory.↩