Nostr Web Client

David Chisnall (*Now with 50% more sarcasm!*) 5mo ago

Huh, so apparently something I thought was obvious about parsing and lexing was not. Perhaps because most of the languages I've worked on have had at least some context-specific keywords, whereas most toy languages and languages that were designed without later aggregating features do not have this property.

I always build front ends the opposite way around to how lex / yacc work. In this model (which I think of as 'push'), the lexer drives the parser. It identifies a token, then tells the parser 'I have a token of kind X, please handle it'. This works really badly for languages with context-dependent keywords. For example, in Objective-C, the token atomic may be a keyword if it's in a declared-property declaration or an identifier if it's anywhere else (including in some places in a declared-property declaration). The lexer doesn't know which it is, so you need to either:

Have the lexer always treat atomic as an identifier and then do some re-lexing in the parser to say 'ah, you have an identifier, but it's this specific identifier, so it's actually a keyword.

Replace everything else that uses an identifier with 'identifier or one of these things that are keywords elsewhere'.

The thing you want is to have (at least) two notions of an identifier (any identifier, or identifier-but-not-that-kind-of-identifier) in the lexer, but the lexer can't do this because lexing must be unambiguous in the push model.

In the pull model, the parser is in charge. It asks the lexer for the next token, and may ask it for a token of a specific kind, or a specific set of kinds. The parser knows the set of things that may happen next. If you're somewhere that has context-specific keywords, ask the lexer for them first, and if it doesn't have one ask it for an identifier. Now you have explicit precedence in the parser that disambiguates things in the lexer and avoids introducing complexity in the token definitions. You may also have simpler regexes in the lexer, because now you can specialise for the set of valid tokens at a specific point. If you know you need a comma or a close-parenthesis after you've parsed a function argument, you can ask for precisely that set of valid tokens, which compiles down to under five instructions on most architectures, rather than the full state machine that can parse any token.

Even without any performance benefits, it's just a much nicer way of writing a parser. Yet the other way around seems to still be taught and explained as if it's a sensible thing to do.

Reply to this note

Please Login to reply.

Discussion

No replies yet.