"Writing an Interpreter in Go" in Zig, part 2 🔗
Continuing with the lexer, picking up with section 1.4 of the book.
Extending the lexer - one-character operator tokens
We’re extending the lexer to handle additional one-character operator tokens, like “!” and “*”, which we already know add support for in the lexer, additional keywords, which we also know how to do but has a small wrinkle in our Zig implementation, and two-character operator tokens, like “==” and “!=”, which will require looking ahead at our character input stream.
The one-character operator tokens were trivial to add, just stamping out copies of the implementation of “+” and modifying as needed. This boils down to:
- adding a new token type (extending the enum and adding the matching case in the name() function for debug printing of tokens), and
- adding a case branch in the main switch statement of the lexer.
Adding keywords
Adding the keywords “if”, “else”, “return” had a bit of friction. Since
those are all keywords in Zig as well, we can’t simply use these directly
as members of the token type enum. We could do something like prefix the
identifiers with some characters to prevent them from being detected as Zig
keywords, but a simpler solution is to use the @""
syntax that Zig provides
for situations such as these. The @""
syntax allows you to use any name
as an identifier that does not conform to the normal rules. For example,
@"struct"
@"a name with spaces"
@"123_starts_with_numbers"
Then anywhere you refer to this name you use this syntax. So for our new Monkey lang keywords, we have:
pub const Type = enum {
// ...
// Keywords
let,
@"if",
@"else",
@"return",
// ...
}
And then later, when we’re referring to a member of the type enum, we can say, for example:
switch (tok.token_type) {
// ...
case .let => // ...
case .@"if" => // ...
case .@"else" => // ...
case .@"return" => // ...
// ...
}
See “identifiers” in the Zig language reference.
“true” and “false” as keywords
The book makes “true” and “false” keywords, which is an unusual choice. The
book doesn’t remark on it, at least in what I’ve read so far, but in my
experience, Boolean literals are usually provided as pre-declared identifiers,
rather than as part of the syntax of the language. It would be a bit like
having numeric literals like “5” and “10” as keywords. The lexer already
could have handled “true” and “false” fine as named identifiers with the
.ident
token type. It’s not just a matter of preference - it will have
consequences down the road, since we’ll have to handle “true” and “false”
tokens explicitly in the parser, which complicates things. Instead of
expressions just having literals and identifiers as operands and arguments,
we’ll have to special-case the handling of these two values. I haven’t
read ahead, so perhaps there is a rationale for it. In any case, since I
intended to follow the book faithfully on major design decisions, so I could
focus on Zig and the forced implementation choices that presents versus Go,
I went ahead and added “true” and “false” as keywords.
Two-character operator tokens
There’s not much to remark on here, other than that the nested block to test
for the lookahead character gives me the opportunity to use labeled blocks
and to return a value from a block, in this case, the new next token. I need
to introduce a block so that I can have multiple statements to determine
which token to emit. Since I’m using switch
in an expression context,
where the value it evaluates to is assigned to the token
constant, I need
to explicitly return a value from the block, like so:
const token = switch (self.ch) {
'=' => blk: {
if (self.peekChar() == '=') {
self.readChar();
break :blk Token{ .token_type = .eq, .literal = "==" };
}
break :blk newToken(.assign, "=");
},
// ...
Next up, section 1.5, where we’ll add a REPL.