Compiler from Scratch – Part 2: Normalizing Newlines

Well, this is a good sign, I’m posting part 2 :).  As I said in Part 1, there are no promises here.  This series may not go anywhere, I may have to drop it due to my other commitments (school, work, etc.).

In the next few posts, I’m going to talk about the Tokenization phase of the compiler.  Before I go into too much detail on that though, I want to talk about newlines.

In Windows, a new line in a text file is indicated by a pair of characters: A Carriage Return (commonly referred to as “\r'”, as that is the C/C++ string escape sequence for it) followed by a Line Feed (”\n”).  However, in Unix-based operating systems, a new line is often indicated by a Line Feed character alone.  To add even more confusion, the Mac OS uses a Carriage Return character alone. 

A compiler needs to track line numbers accurately, in order to report errors, so we need be extra careful around newlines.  We could simply use the current operating system’s default newline characters, but that makes it difficult for multi-platform development.  Instead, we’ll normalize the newlines so that all three different types are properly understood by our compiler.

To do this, I’ve written a “Decorator” class which inherits from the abstract TextReader class provided in the .Net framework.  This “Newline Normalizing” decorator wraps an existing TextReader and does all the work of normalizing new line characters  TextReader provides two methods that need to be implemented, Read and PeekRead returns the current character from the text and moves the reader one character forwards (so that the next call to Read will return the next character).  Peek also returns the current character from the text but does not move the reader forward.  The “Newline Normalizing” reader implements these two methods using the following code

public override int Peek() {
    // Get the next character
    int i = Adaptee.Peek();
    
    // If the character is a '\r' newline, just return '\n'.  
    // Unlike Read, we aren't going to read ahead to check for \r\n
    // because that will happen when the user calls Read()
    if (i == (int)'\r') {
        i =  (int)'\n';
    }

    return i;
}

public override int Read() {
    // Get the next character
    int i = Adaptee.Read();

    // If the character is a '\r' newline, we're going to normalize it to '\n'
    // However, if the newline is '\r\n', we need to return it as one character, so
    // we check ahead for that
    if (i == (int)'\r') {
        if (Adaptee.Peek() == (int)'\n') {
            Adaptee.Read(); // Skip the '\n'
        }
        i = (int)'\n';
    }

    return i;
}

Essentially, if the character we read from the source (the “Adaptee” as I call it) is a ‘\n’, we just pass it along.  If the character is a ‘\r’, we are going to return ‘\n’, but first we first check to see if it is immediately followed by a ‘\n’.  If it is, we skip the extra character.  The result is that no matter which newline sequence is used, this TextReader will return it as a single ‘\n’ character.

Next post, I’ll start talking about the Tokenization process.

Comments are closed.