Compiler from Scratch – Part 3: Tokenization Overview

Here comes part 3 of my infinite part series on developing a compiler from scratch.  See all the posts here.

The first phase of compilation is called Tokenization.  Tokenization is the process of taking individual characters in the source code file and grouping them into more conceptual “Tokens.”  We’re not parsing the language, or checking syntax, we’re just grouping related characters together.  Why do we do this?  If we skipped this step, we’d be passing characters directly to our parser.  The parser doesn’t care that the text contains the characters ‘f’ ‘o’ ‘o’, it cares that the source file contains an Identifier called “foo”.  There’s also a separation of concerns issue.  In theory, if you wanted to allow accented Unicode characters like è, ê, ã, à, etc. (you may see blocks or ‘?’ characters instead of the real characters, if so just pretend they’re there :D) as valid characters in identifiers, you should only have to change the Tokenizer (to accept those characters and use them, along with the regular alphanumeric characters, to build identifier tokens).  The formal name for Tokenization is actually “Lexical Analysis,” and sometimes the Tokens are called “Atoms.”

So, how do we Tokenize text?  There are lots of tools out there, most of which let you define your tokens using Regular Expressions (so an identifier would be: “[_A-Za-z][_0-9A-Za-z]*”, one letter or underscore followed by zero or more letters, numbers or underscores).  These tools build state machines (aka “Finite Automata”) that basically walk through all the Regular Expressions until exactly one of them matches the current text buffer.  Tools like “lex” basically allow you to build a full-featured, efficient, Tokenizer by simply writing Regular Expressions.

However, that would be too easy for this series :).  Instead, I’m using a very simple, hand-coded, Tokenizer which walks through the characters, using “if” and “switch” statements to decide what tokens to output.  However, we can’t just simply iterate through the characters.  Consider the ”->” operator in Duh (see Part 1).  Duh supports the simple arithmetic operators, specifically “-“, and I plan to support inequality operators (i.e “<”, “>”, etc.), so in theory, the “->” operator could be outputted as a pair of tokens: MINUS and GREATERTHAN (I’ll use ALLCAPS to denote the names of tokens).  However, this would be putting a larger burden on the parser to detect the MINUS GREATERTHAN pattern.  Instead, we can take advantage of “lookahead”

The Tokenizer uses the .Net TextReader class, which lets us walk, one character at a time, through the characters in a body of text.  However, once we’ve called the Read method, the TextReader moves on to the next character and the next time we call it, we’ll get the next character.  So, in order to properly parse the “->” operator, we need a way to look at the next character, without moving the reader.  Fortunately, the TextReader also has the Peek method, which does exactly that.

So now, when we encounter a “-“ symbol, we look at the next character.  If it is a “>”, we output an ARROW token, and move to the next character.  Otherwise, we output a MINUS token and leave the reader at the “>” character.  Then, in the next iteration, we read the “>” and output a GREATERTHAN token.

Phew, that was a bit long, and more theoretical than I like, but I’ll post some code in the next part.  Next time, we’ll look at the actual code for the Tokenizer.

Comments are closed.