Oooh, pretty code
Sunday, January 25, 2009 2:47:00 PM (Pacific Standard Time, UTC-08:00) ( CMPT 376 | Compiler Project )
Well, this is a good sign, I’m posting part 2 :). As I said in Part 1, there are no promises here. This series may not go anywhere, I may have to drop it due to my other commitments (school, work, etc.).
In the next few posts, I’m going to talk about the Tokenization phase of the compiler. Before I go into too much detail on that though, I want to talk about newlines.
In Windows, a new line in a text file is indicated by a pair of characters: A Carriage Return (commonly referred to as “\r'”, as that is the C/C++ string escape sequence for it) followed by a Line Feed (”\n”). However, in Unix-based operating systems, a new line is often indicated by a Line Feed character alone. To add even more confusion, the Mac OS uses a Carriage Return character alone.
A compiler needs to track line numbers accurately, in order to report errors, so we need be extra careful around newlines. We could simply use the current operating system’s default newline characters, but that makes it difficult for multi-platform development. Instead, we’ll normalize the newlines so that all three different types are properly understood by our compiler.
To do this, I’ve written a “Decorator” class which inherits from the abstract TextReader class provided in the .Net framework. This “Newline Normalizing” decorator wraps an existing TextReader and does all the work of normalizing new line characters TextReader provides two methods that need to be implemented, Read and Peek. Read returns the current character from the text and moves the reader one character forwards (so that the next call to Read will return the next character). Peek also returns the current character from the text but does not move the reader forward. The “Newline Normalizing” reader implements these two methods using the following code
TextReader
Read
Peek
public override int Peek() { // Get the next character int i = Adaptee.Peek(); // If the character is a '\r' newline, just return '\n'. // Unlike Read, we aren't going to read ahead to check for \r\n // because that will happen when the user calls Read() if (i == (int)'\r') { i = (int)'\n'; } return i; } public override int Read() { // Get the next character int i = Adaptee.Read(); // If the character is a '\r' newline, we're going to normalize it to '\n' // However, if the newline is '\r\n', we need to return it as one character, so // we check ahead for that if (i == (int)'\r') { if (Adaptee.Peek() == (int)'\n') { Adaptee.Read(); // Skip the '\n' } i = (int)'\n'; } return i; }
Essentially, if the character we read from the source (the “Adaptee” as I call it) is a ‘\n’, we just pass it along. If the character is a ‘\r’, we are going to return ‘\n’, but first we first check to see if it is immediately followed by a ‘\n’. If it is, we skip the extra character. The result is that no matter which newline sequence is used, this TextReader will return it as a single ‘\n’ character.
Next post, I’ll start talking about the Tokenization process.
Disclaimer The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.
RSS
Sign In