Using Zig’s fuzzer
· read · zig, fuzzing, programming-languages
UPDATE 2024-09-10: It looks like Zig’s fuzzing API is about to change. If/when that change is landed, this post will be wrong. As always, using APIs that are less than two months old and only available in nightly builds is a test of bravery, notwithstanding the total lack of warranty. Still: I’d do it again.
Mark a test as a fuzz test by simply calling std.test.fuzzInput
at any point. Then start fuzzing by running the command:
$ zig build test --fuzz
This starts an HTTP server to show a coverage map!
In my case, I was testing out a tokenizer. The tokenizer’s entry point looks like this:
/// Consume a single token from `source`, returning its kind and length.
pub fn tokenizeOne(source: []const u8) struct { TokenKind, u32 } {
if (source.len == 0) return .{ .eof, 0 };
// Note: `\n` is parsed as a `newline` token, not a `space` token.
if (source[0] == ' ' or source[0] == '\t' or source[0] == '\r'1) {
var i: u32 = 1;
while (i < source.len and source[i] == ' ' or source[i] == '\t' or source[i] == '\r'2) {
i += 1;
}
return .{ .space, i };
} else if (source[0] == '\n') {
return .{ .newline, 1 };
} else if (characterIsNameStart(source[0])) {
// [... continued]
} else {
return .{ .@"error", 1 };
}
}
Ignore the callouts they’re not important right now. The important bit is that this API is really easy to use:
var index: usize = 0;
while (true) {
const token, const length1 = lexOne(source[index..]);
index += length;
if (token == .eof) break;
}
1 Yes, Zig has destructuring syntax now! This feature goes right below decl literals, which have also been added, on my quality-of-life tier list.
Thanks to Zig’s builtin support for unit tests, it only takes three lines of code to fuzz the tokenizer. I know lines-of-code-as-a-metric is unreliable at best, but c’mon, three lines!
test "tokenizer fuzzing" {
const source = std.testing.fuzzInput(.{});
var index: usize = 0;
while (true) {
const token, const length1 = lexOne(source[index..]);
// [...]
}
}
And, bam! After exactly one minute and 0.74 seconds, Zig comes up with a stack trace:
thread 51449 panic: index out of bounds: index 1, len 1
../src/tokenizer.zig:60:61: 0x114f818 in tokenizeOne (test)
while (i < source.len and source[i] == ' ' or source[i] == '\t' or source[i] == '\r') {
../src/tokenizer.zig:133:43: 0x1154962 in test.tokenizer fuzzing (test)
const token, const length = lexOne(source[index..]);
Ah, right. The unimportant bit in the first code block.
My bug here was copy-pasting this bit 1. Due to operator precedence (and
binding tighter than or
), this code is interpreted as (character-in-range AND is space) OR is tab OR is carriage return 2, which is obviously an out-of-bounds read.
My chosen fix here was to split this into its own function:
fn isPartOfSpace(ch: u8) bool {
return ch == ' ' or ch == '\t' or ch == '\r';
}
// [...]
if (isPartOfSpace(source[0])) {
std.debug.assert(source[0] != '\n');
var i: u32 = 1;
while (i < source.len and isPartOfSpace(source[i])) {
// [...]
}
}
I guess the lesson here is to be at least a little afraid to repeat yourself? Also don’t write bugs.[1] That’d help too.
-
I’m legally required to point out that “don’t write bugs” is not an effective advisory.↩