Paul Smith

"Writing an Interpreter in Go" in Zig, part 2

This post is part of the "Writing an Interpreter in Go" in Zig series.

Continuing with the lexer, picking up with section 1.4 of the book.

Extending the lexer - one-character operator tokens

We’re extending the lexer to handle additional one-character operator tokens, like “!” and “*”, which we already know add support for in the lexer, additional keywords, which we also know how to do but has a small wrinkle in our Zig implementation, and two-character operator tokens, like “==” and “!=”, which will require looking ahead at our character input stream.

The one-character operator tokens were trivial to add, just stamping out copies of the implementation of “+” and modifying as needed. This boils down to:

  • adding a new token type (extending the enum and adding the matching case in the name() function for debug printing of tokens), and
  • adding a case branch in the main switch statement of the lexer.

Adding keywords

Adding the keywords “if”, “else”, “return” had a bit of friction. Since those are all keywords in Zig as well, we can’t simply use these directly as members of the token type enum. We could do something like prefix the identifiers with some characters to prevent them from being detected as Zig keywords, but a simpler solution is to use the @"" syntax that Zig provides for situations such as these. The @"" syntax allows you to use any name as an identifier that does not conform to the normal rules. For example,

  • @"struct"
  • @"a name with spaces"
  • @"123_starts_with_numbers"

Then anywhere you refer to this name you use this syntax. So for our new Monkey lang keywords, we have:

pub const Type = enum {
// ...
    // Keywords
    let,
    @"if",
    @"else",
    @"return",
// ...
}

And then later, when we’re referring to a member of the type enum, we can say, for example:

switch (tok.token_type) {
    // ...
    case .let => // ...
    case .@"if" => // ...
    case .@"else" => // ...
    case .@"return" => // ...
    // ...
}

See “identifiers” in the Zig language reference.

“true” and “false” as keywords

The book makes “true” and “false” keywords, which is an unusual choice. The book doesn’t remark on it, at least in what I’ve read so far, but in my experience, Boolean literals are usually provided as pre-declared identifiers, rather than as part of the syntax of the language. It would be a bit like having numeric literals like “5” and “10” as keywords. The lexer already could have handled “true” and “false” fine as named identifiers with the .ident token type. It’s not just a matter of preference - it will have consequences down the road, since we’ll have to handle “true” and “false” tokens explicitly in the parser, which complicates things. Instead of expressions just having literals and identifiers as operands and arguments, we’ll have to special-case the handling of these two values. I haven’t read ahead, so perhaps there is a rationale for it. In any case, since I intended to follow the book faithfully on major design decisions, so I could focus on Zig and the forced implementation choices that presents versus Go, I went ahead and added “true” and “false” as keywords.

Two-character operator tokens

There’s not much to remark on here, other than that the nested block to test for the lookahead character gives me the opportunity to use labeled blocks and to return a value from a block, in this case, the new next token. I need to introduce a block so that I can have multiple statements to determine which token to emit. Since I’m using switch in an expression context, where the value it evaluates to is assigned to the token constant, I need to explicitly return a value from the block, like so:

    const token = switch (self.ch) {
        '=' => blk: {
            if (self.peekChar() == '=') {
                self.readChar();
                break :blk Token{ .token_type = .eq, .literal = "==" };
            }
            break :blk newToken(.assign, "=");
        },
    // ...

Next up, section 1.5, where we’ll add a REPL.


This series