<?xml version="1.0" encoding="utf-8"?>
<rss xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:pingback="http://madskills.com/public/xml/rss/module/pingback/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0">
  <channel>
    <title>VibrantCode - Compiler Project</title>
    <link>http://blog.andrewnurse.net/</link>
    <description>Oooh...pretty code</description>
    <language>en-us</language>
    <copyright>Andrew Nurse</copyright>
    <lastBuildDate>Wed, 18 Feb 2009 23:07:00 GMT</lastBuildDate>
    <generator>newtelligence dasBlog 2.0.7226.0</generator>
    <managingEditor>andrew@andrewnurse.net</managingEditor>
    <webMaster>andrew@andrewnurse.net</webMaster>
    <item>
      <trackback:ping>http://blog.andrewnurse.net/Trackback.aspx?guid=8cc65c27-995e-421e-910d-2bdd36c208c7</trackback:ping>
      <pingback:server>http://blog.andrewnurse.net/pingback.aspx</pingback:server>
      <pingback:target>http://blog.andrewnurse.net/PermaLink,guid,8cc65c27-995e-421e-910d-2bdd36c208c7.aspx</pingback:target>
      <dc:creator>Andrew Nurse</dc:creator>
      <wfw:comment>http://blog.andrewnurse.net/CommentView,guid,8cc65c27-995e-421e-910d-2bdd36c208c7.aspx</wfw:comment>
      <wfw:commentRss>http://blog.andrewnurse.net/SyndicationService.asmx/GetEntryCommentsRss?guid=8cc65c27-995e-421e-910d-2bdd36c208c7</wfw:commentRss>
      <body xmlns="http://www.w3.org/1999/xhtml">
        <p>
Here’s part 4 of my infinite part series on developing a compiler from scratch. 
See all the posts <a href="http://blog.andrewnurse.net/ct.ashx?id=cbb29a9a-f33f-45e9-8836-47c73a0fd1a0&amp;url=http%3a%2f%2fblog.andrewnurse.net%2fCategoryView%2ccategory%2cCompiler%252BProject.aspx">here</a>.
</p>
        <p>
I’ve decided to change things up a bit on the compiler project.  Instead of writing
the tokenizer and parser from scratch, I’m going to check out a new language that’s
part of Microsoft’s <a href="http://msdn.microsoft.com/en-us/library/cc709420.aspx" target="_blank">“Oslo”</a> Project. 
Oslo includes a language called M, which is primarily designed for developing data
models.  M is actually three languages, MGraph, MSchema and MGrammar.  MGraph
is a language for defining abstract graphs, for example:
</p>
        <pre class="m" name="code">Universities {
    Canada {
        BritishColumbia {
            SFU { 
                Name { "Simon Fraser University" }, 
                Location { "Burnaby, BC" } 
            },
            UBC { 
                Name { "University of British Columbia" }, 
                Location { "Vancouver, BC" } 
            }
        }
    }
}</pre>
        <p>
MSchema is a language for defining the schema for these graphs, for example
</p>
        <pre class="m" name="code">module Universities {
    type University {
        Name : Text#100,
        Location : Text#100
    }
}</pre>
        <p>
Finally, MGrammar is a language for defining custom Domain-Specific Languages (DSLs)
which map to MGraph graphs. This is the language that is most useful to this project.
After all, a general-purpose programming language is just a DSL where the "Domain"
is very broad :). So, I've spent a few hours this week whipping up an MGrammar grammar
for Duh, and the results are promising.  I’ll be posting some details on that
in the next post, but first we need to explore a key concept in compilers, Abstract
Syntax Trees.
</p>
        <p>
The end goal of the tokenization and parsing phases of a compiler is to produce an
Abstract Syntax Tree (or AST) representing the source code.  This is probably
best illustrated by an example, so consider this expression:
</p>
        <pre>5+6</pre>
        <p>
This is an addition operation, using 5 and 6 as the arguments, which we represent
in an AST like this (using the MGraph syntax above):
</p>
        <pre>Add { 5, 6 }</pre>
        <p>
Similarly, more complex expressions can be translated to trees:
</p>
        <pre>5+6*(3+4)</pre>
        <p>
Becomes:
</p>
        <pre>Add { 5, Multiply { 6, Add { 3, 4 } } }</pre>
        <p>
Which represents the following instructions:
</p>
        <ul>
          <li>
Add 3 and 4 
</li>
          <li>
Multiply that result by 6 
</li>
          <li>
Add 5 to that result 
</li>
        </ul>
        <p>
Essentially, we are building up the expression from the bottom of the tree up. 
This makes it much easier to convert the expression into code.
</p>
        <p>
I’ll post the MGrammar file for the language when I finish it up in a day or so, then
its on to writing the code to do the parsing.
</p>
        <img width="0" height="0" src="http://blog.andrewnurse.net/aggbug.ashx?id=8cc65c27-995e-421e-910d-2bdd36c208c7" />
      </body>
      <title>Compiler Project &amp;ndash; Part 4: ASTs and switching to MGrammar</title>
      <guid isPermaLink="false">http://blog.andrewnurse.net/PermaLink,guid,8cc65c27-995e-421e-910d-2bdd36c208c7.aspx</guid>
      <link>http://blog.andrewnurse.net/2009/02/18/CompilerProjectNdashPart4ASTsAndSwitchingToMGrammar.aspx</link>
      <pubDate>Wed, 18 Feb 2009 23:07:00 GMT</pubDate>
      <description>&lt;p&gt;
Here’s part 4 of my infinite part series on developing a compiler from scratch.&amp;#160;
See all the posts &lt;a href="http://blog.andrewnurse.net/ct.ashx?id=cbb29a9a-f33f-45e9-8836-47c73a0fd1a0&amp;amp;url=http%3a%2f%2fblog.andrewnurse.net%2fCategoryView%2ccategory%2cCompiler%252BProject.aspx"&gt;here&lt;/a&gt;.
&lt;/p&gt;
&lt;p&gt;
I’ve decided to change things up a bit on the compiler project.&amp;#160; Instead of writing
the tokenizer and parser from scratch, I’m going to check out a new language that’s
part of Microsoft’s &lt;a href="http://msdn.microsoft.com/en-us/library/cc709420.aspx" target="_blank"&gt;“Oslo”&lt;/a&gt; Project.&amp;#160;
Oslo includes a language called M, which is primarily designed for developing data
models.&amp;#160; M is actually three languages, MGraph, MSchema and MGrammar.&amp;#160; MGraph
is a language for defining abstract graphs, for example:
&lt;/p&gt;
&lt;pre class="m" name="code"&gt;Universities {
    Canada {
        BritishColumbia {
            SFU { 
                Name { &amp;quot;Simon Fraser University&amp;quot; }, 
                Location { &amp;quot;Burnaby, BC&amp;quot; } 
            },
            UBC { 
                Name { &amp;quot;University of British Columbia&amp;quot; }, 
                Location { &amp;quot;Vancouver, BC&amp;quot; } 
            }
        }
    }
}&lt;/pre&gt;
&lt;p&gt;
MSchema is a language for defining the schema for these graphs, for example
&lt;/p&gt;
&lt;pre class="m" name="code"&gt;module Universities {
    type University {
        Name : Text#100,
        Location : Text#100
    }
}&lt;/pre&gt;
&lt;p&gt;
Finally, MGrammar is a language for defining custom Domain-Specific Languages (DSLs)
which map to MGraph graphs. This is the language that is most useful to this project.
After all, a general-purpose programming language is just a DSL where the &amp;quot;Domain&amp;quot;
is very broad :). So, I've spent a few hours this week whipping up an MGrammar grammar
for Duh, and the results are promising.&amp;#160; I’ll be posting some details on that
in the next post, but first we need to explore a key concept in compilers, Abstract
Syntax Trees.
&lt;/p&gt;
&lt;p&gt;
The end goal of the tokenization and parsing phases of a compiler is to produce an
Abstract Syntax Tree (or AST) representing the source code.&amp;#160; This is probably
best illustrated by an example, so consider this expression:
&lt;/p&gt;
&lt;pre&gt;5+6&lt;/pre&gt;
&lt;p&gt;
This is an addition operation, using 5 and 6 as the arguments, which we represent
in an AST like this (using the MGraph syntax above):
&lt;/p&gt;
&lt;pre&gt;Add { 5, 6 }&lt;/pre&gt;
&lt;p&gt;
Similarly, more complex expressions can be translated to trees:
&lt;/p&gt;
&lt;pre&gt;5+6*(3+4)&lt;/pre&gt;
&lt;p&gt;
Becomes:
&lt;/p&gt;
&lt;pre&gt;Add { 5, Multiply { 6, Add { 3, 4 } } }&lt;/pre&gt;
&lt;p&gt;
Which represents the following instructions:
&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
Add 3 and 4 
&lt;/li&gt;
&lt;li&gt;
Multiply that result by 6 
&lt;/li&gt;
&lt;li&gt;
Add 5 to that result 
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;
Essentially, we are building up the expression from the bottom of the tree up.&amp;#160;
This makes it much easier to convert the expression into code.
&lt;/p&gt;
&lt;p&gt;
I’ll post the MGrammar file for the language when I finish it up in a day or so, then
its on to writing the code to do the parsing.
&lt;/p&gt;
&lt;img width="0" height="0" src="http://blog.andrewnurse.net/aggbug.ashx?id=8cc65c27-995e-421e-910d-2bdd36c208c7" /&gt;</description>
      <comments>http://blog.andrewnurse.net/CommentView,guid,8cc65c27-995e-421e-910d-2bdd36c208c7.aspx</comments>
      <category>CMPT 376</category>
      <category>Compiler Project</category>
    </item>
    <item>
      <trackback:ping>http://blog.andrewnurse.net/Trackback.aspx?guid=cbb29a9a-f33f-45e9-8836-47c73a0fd1a0</trackback:ping>
      <pingback:server>http://blog.andrewnurse.net/pingback.aspx</pingback:server>
      <pingback:target>http://blog.andrewnurse.net/PermaLink,guid,cbb29a9a-f33f-45e9-8836-47c73a0fd1a0.aspx</pingback:target>
      <dc:creator>Andrew Nurse</dc:creator>
      <wfw:comment>http://blog.andrewnurse.net/CommentView,guid,cbb29a9a-f33f-45e9-8836-47c73a0fd1a0.aspx</wfw:comment>
      <wfw:commentRss>http://blog.andrewnurse.net/SyndicationService.asmx/GetEntryCommentsRss?guid=cbb29a9a-f33f-45e9-8836-47c73a0fd1a0</wfw:commentRss>
      <body xmlns="http://www.w3.org/1999/xhtml">
        <p>
Here comes part 3 of my infinite part series on developing a compiler from scratch. 
See all the posts <a href="http://blog.andrewnurse.net/CategoryView,category,Compiler%2BProject.aspx" target="_blank">here</a>.
</p>
        <p>
The first phase of compilation is called Tokenization.  Tokenization is the process
of taking individual characters in the source code file and grouping them into more
conceptual “Tokens.”  We’re not parsing the language, or checking syntax, we’re
just grouping related characters together.  Why do we do this?  If we skipped
this step, we’d be passing characters directly to our parser.  The parser doesn’t
care that the text contains the characters ‘f’ ‘o’ ‘o’, it cares that the source file
contains an Identifier called “foo”.  There’s also a separation of concerns issue. 
In theory, if you wanted to allow accented Unicode characters like è, ê, ã, à, etc.
(you may see blocks or ‘?’ characters instead of the real characters, if so just pretend
they’re there :D) as valid characters in identifiers, you should only have to change
the Tokenizer (to accept those characters and use them, along with the regular alphanumeric
characters, to build identifier tokens).  The formal name for Tokenization is
actually “Lexical Analysis,” and sometimes the Tokens are called “Atoms.”
</p>
        <p>
So, how do we Tokenize text?  There are lots of tools out there, most of which
let you define your tokens using Regular Expressions (so an identifier would be: “[_A-Za-z][_0-9A-Za-z]*”,
one letter or underscore followed by zero or more letters, numbers or underscores). 
These tools build state machines (aka “<a href="http://en.wikipedia.org/wiki/Finite_state_machine" target="_blank">Finite
Automata</a>”) that basically walk through all the Regular Expressions until exactly
one of them matches the current text buffer.  Tools like “<a href="http://en.wikipedia.org/wiki/Lex_programming_tool" target="_blank">lex</a>”
basically allow you to build a full-featured, efficient, Tokenizer by simply writing
Regular Expressions.
</p>
        <p>
However, that would be too easy for this series :).  Instead, I’m using a very
simple, hand-coded, Tokenizer which walks through the characters, using “if” and “switch”
statements to decide what tokens to output.  However, we can’t just simply iterate
through the characters.  Consider the ”-&gt;” operator in Duh (see <a href="http://blog.andrewnurse.net/2009/01/19/WritingACompilerFromScratchNdashPart1.aspx" target="_blank">Part
1</a>).  Duh supports the simple arithmetic operators, specifically “-“, and
I plan to support inequality operators (i.e “&lt;”, “&gt;”, etc.), so in theory, the
“-&gt;” operator could be outputted as a pair of tokens: MINUS and GREATERTHAN (I’ll
use ALLCAPS to denote the names of tokens).  However, this would be putting a
larger burden on the parser to detect the MINUS GREATERTHAN pattern.  Instead,
we can take advantage of “lookahead”
</p>
        <p>
The Tokenizer uses the .Net TextReader class, which lets us walk, one character at
a time, through the characters in a body of text.  However, once we’ve called
the <code>Read</code> method, the TextReader moves on to the next character and the
next time we call it, we’ll get the next character.  So, in order to properly
parse the “-&gt;” operator, we need a way to look at the next character, without moving
the reader.  Fortunately, the TextReader also has the <code>Peek</code> method,
which does exactly that.
</p>
        <p>
So now, when we encounter a “-“ symbol, we look at the next character.  If it
is a “&gt;”, we output an ARROW token, and move to the next character.  Otherwise,
we output a MINUS token and leave the reader at the “&gt;” character.  Then,
in the next iteration, we read the “&gt;” and output a GREATERTHAN token.
</p>
        <p>
Phew, that was a bit long, and more theoretical than I like, but I’ll post some code
in the next part.  Next time, we’ll look at the actual code for the Tokenizer.
</p>
        <img width="0" height="0" src="http://blog.andrewnurse.net/aggbug.ashx?id=cbb29a9a-f33f-45e9-8836-47c73a0fd1a0" />
      </body>
      <title>Compiler from Scratch &amp;ndash; Part 3: Tokenization Overview</title>
      <guid isPermaLink="false">http://blog.andrewnurse.net/PermaLink,guid,cbb29a9a-f33f-45e9-8836-47c73a0fd1a0.aspx</guid>
      <link>http://blog.andrewnurse.net/2009/01/28/CompilerFromScratchNdashPart3TokenizationOverview.aspx</link>
      <pubDate>Wed, 28 Jan 2009 19:13:12 GMT</pubDate>
      <description>&lt;p&gt;
Here comes part 3 of my infinite part series on developing a compiler from scratch.&amp;#160;
See all the posts &lt;a href="http://blog.andrewnurse.net/CategoryView,category,Compiler%2BProject.aspx" target="_blank"&gt;here&lt;/a&gt;.
&lt;/p&gt;
&lt;p&gt;
The first phase of compilation is called Tokenization.&amp;#160; Tokenization is the process
of taking individual characters in the source code file and grouping them into more
conceptual “Tokens.”&amp;#160; We’re not parsing the language, or checking syntax, we’re
just grouping related characters together.&amp;#160; Why do we do this?&amp;#160; If we skipped
this step, we’d be passing characters directly to our parser.&amp;#160; The parser doesn’t
care that the text contains the characters ‘f’ ‘o’ ‘o’, it cares that the source file
contains an Identifier called “foo”.&amp;#160; There’s also a separation of concerns issue.&amp;#160;
In theory, if you wanted to allow accented Unicode characters like è, ê, ã, à, etc.
(you may see blocks or ‘?’ characters instead of the real characters, if so just pretend
they’re there :D) as valid characters in identifiers, you should only have to change
the Tokenizer (to accept those characters and use them, along with the regular alphanumeric
characters, to build identifier tokens).&amp;#160; The formal name for Tokenization is
actually “Lexical Analysis,” and sometimes the Tokens are called “Atoms.”
&lt;/p&gt;
&lt;p&gt;
So, how do we Tokenize text?&amp;#160; There are lots of tools out there, most of which
let you define your tokens using Regular Expressions (so an identifier would be: “[_A-Za-z][_0-9A-Za-z]*”,
one letter or underscore followed by zero or more letters, numbers or underscores).&amp;#160;
These tools build state machines (aka “&lt;a href="http://en.wikipedia.org/wiki/Finite_state_machine" target="_blank"&gt;Finite
Automata&lt;/a&gt;”) that basically walk through all the Regular Expressions until exactly
one of them matches the current text buffer.&amp;#160; Tools like “&lt;a href="http://en.wikipedia.org/wiki/Lex_programming_tool" target="_blank"&gt;lex&lt;/a&gt;”
basically allow you to build a full-featured, efficient, Tokenizer by simply writing
Regular Expressions.
&lt;/p&gt;
&lt;p&gt;
However, that would be too easy for this series :).&amp;#160; Instead, I’m using a very
simple, hand-coded, Tokenizer which walks through the characters, using “if” and “switch”
statements to decide what tokens to output.&amp;#160; However, we can’t just simply iterate
through the characters.&amp;#160; Consider the ”-&amp;gt;” operator in Duh (see &lt;a href="http://blog.andrewnurse.net/2009/01/19/WritingACompilerFromScratchNdashPart1.aspx" target="_blank"&gt;Part
1&lt;/a&gt;).&amp;#160; Duh supports the simple arithmetic operators, specifically “-“, and
I plan to support inequality operators (i.e “&amp;lt;”, “&amp;gt;”, etc.), so in theory, the
“-&amp;gt;” operator could be outputted as a pair of tokens: MINUS and GREATERTHAN (I’ll
use ALLCAPS to denote the names of tokens).&amp;#160; However, this would be putting a
larger burden on the parser to detect the MINUS GREATERTHAN pattern.&amp;#160; Instead,
we can take advantage of “lookahead”
&lt;/p&gt;
&lt;p&gt;
The Tokenizer uses the .Net TextReader class, which lets us walk, one character at
a time, through the characters in a body of text.&amp;#160; However, once we’ve called
the &lt;code&gt;Read&lt;/code&gt; method, the TextReader moves on to the next character and the
next time we call it, we’ll get the next character.&amp;#160; So, in order to properly
parse the “-&amp;gt;” operator, we need a way to look at the next character, without moving
the reader.&amp;#160; Fortunately, the TextReader also has the &lt;code&gt;Peek&lt;/code&gt; method,
which does exactly that.
&lt;/p&gt;
&lt;p&gt;
So now, when we encounter a “-“ symbol, we look at the next character.&amp;#160; If it
is a “&amp;gt;”, we output an ARROW token, and move to the next character.&amp;#160; Otherwise,
we output a MINUS token and leave the reader at the “&amp;gt;” character.&amp;#160; Then,
in the next iteration, we read the “&amp;gt;” and output a GREATERTHAN token.
&lt;/p&gt;
&lt;p&gt;
Phew, that was a bit long, and more theoretical than I like, but I’ll post some code
in the next part.&amp;#160; Next time, we’ll look at the actual code for the Tokenizer.
&lt;/p&gt;
&lt;img width="0" height="0" src="http://blog.andrewnurse.net/aggbug.ashx?id=cbb29a9a-f33f-45e9-8836-47c73a0fd1a0" /&gt;</description>
      <comments>http://blog.andrewnurse.net/CommentView,guid,cbb29a9a-f33f-45e9-8836-47c73a0fd1a0.aspx</comments>
      <category>CMPT 376</category>
      <category>Compiler Project</category>
    </item>
    <item>
      <trackback:ping>http://blog.andrewnurse.net/Trackback.aspx?guid=fa257811-a414-419a-8ffd-57f889a3daa1</trackback:ping>
      <pingback:server>http://blog.andrewnurse.net/pingback.aspx</pingback:server>
      <pingback:target>http://blog.andrewnurse.net/PermaLink,guid,fa257811-a414-419a-8ffd-57f889a3daa1.aspx</pingback:target>
      <dc:creator>Andrew Nurse</dc:creator>
      <wfw:comment>http://blog.andrewnurse.net/CommentView,guid,fa257811-a414-419a-8ffd-57f889a3daa1.aspx</wfw:comment>
      <wfw:commentRss>http://blog.andrewnurse.net/SyndicationService.asmx/GetEntryCommentsRss?guid=fa257811-a414-419a-8ffd-57f889a3daa1</wfw:commentRss>
      <body xmlns="http://www.w3.org/1999/xhtml">
        <p>
Well, this is a good sign, I’m posting part 2 :).  As I said in <a href="http://blog.andrewnurse.net/2009/01/19/WritingACompilerFromScratchNdashPart1.aspx" target="_blank">Part
1</a>, there are no promises here.  This series may not go anywhere, I may have
to drop it due to my other commitments (school, work, etc.).
</p>
        <p>
In the next few posts, I’m going to talk about the Tokenization phase of the compiler. 
Before I go into too much detail on that though, I want to talk about newlines.
</p>
        <p>
In Windows, a new line in a text file is indicated by a pair of characters: A Carriage
Return (commonly referred to as “\r'”, as that is the C/C++ string escape sequence
for it) followed by a Line Feed (”\n”).  However, in Unix-based operating systems,
a new line is often indicated by a Line Feed character alone.  To add even more
confusion, the Mac OS uses a Carriage Return character alone.  
</p>
        <p>
A compiler needs to track line numbers accurately, in order to report errors, so we
need be extra careful around newlines.  We could simply use the current operating
system’s default newline characters, but that makes it difficult for multi-platform
development.  Instead, we’ll normalize the newlines so that all three different
types are properly understood by our compiler.
</p>
        <p>
To do this, I’ve written a “Decorator” class which inherits from the abstract <code>TextReader</code> class
provided in the .Net framework.  This “Newline Normalizing” decorator wraps an
existing <code>TextReader</code> and does all the work of normalizing new line characters 
TextReader provides two methods that need to be implemented, <code>Read</code> and <code>Peek</code>.  <code>Read</code> returns
the current character from the text and moves the reader one character forwards (so
that the next call to Read will return the next character).  <code>Peek</code> also
returns the current character from the text but <strong>does not</strong> move the
reader forward.  The “Newline Normalizing” reader implements these two methods
using the following code
</p>
        <pre class="csharp" name="code">public override int Peek() {
    // Get the next character
    int i = Adaptee.Peek();
    
    // If the character is a '\r' newline, just return '\n'.  
    // Unlike Read, we aren't going to read ahead to check for \r\n
    // because that will happen when the user calls Read()
    if (i == (int)'\r') {
        i =  (int)'\n';
    }

    return i;
}

public override int Read() {
    // Get the next character
    int i = Adaptee.Read();

    // If the character is a '\r' newline, we're going to normalize it to '\n'
    // However, if the newline is '\r\n', we need to return it as one character, so
    // we check ahead for that
    if (i == (int)'\r') {
        if (Adaptee.Peek() == (int)'\n') {
            Adaptee.Read(); // Skip the '\n'
        }
        i = (int)'\n';
    }

    return i;
}</pre>
        <p>
Essentially, if the character we read from the source (the “Adaptee” as I call it)
is a ‘\n’, we just pass it along.  If the character is a ‘\r’, we are going to
return ‘\n’, but first we first check to see if it is immediately followed by a ‘\n’. 
If it is, we skip the extra character.  The result is that no matter which newline
sequence is used, this <code>TextReader</code> will return it as a single ‘\n’ character.
</p>
        <p>
Next post, I’ll start talking about the Tokenization process.
</p>
        <img width="0" height="0" src="http://blog.andrewnurse.net/aggbug.ashx?id=fa257811-a414-419a-8ffd-57f889a3daa1" />
      </body>
      <title>Compiler from Scratch &amp;ndash; Part 2: Normalizing Newlines</title>
      <guid isPermaLink="false">http://blog.andrewnurse.net/PermaLink,guid,fa257811-a414-419a-8ffd-57f889a3daa1.aspx</guid>
      <link>http://blog.andrewnurse.net/2009/01/25/CompilerFromScratchNdashPart2NormalizingNewlines.aspx</link>
      <pubDate>Sun, 25 Jan 2009 22:47:00 GMT</pubDate>
      <description>&lt;p&gt;
Well, this is a good sign, I’m posting part 2 :).&amp;#160; As I said in &lt;a href="http://blog.andrewnurse.net/2009/01/19/WritingACompilerFromScratchNdashPart1.aspx" target="_blank"&gt;Part
1&lt;/a&gt;, there are no promises here.&amp;#160; This series may not go anywhere, I may have
to drop it due to my other commitments (school, work, etc.).
&lt;/p&gt;
&lt;p&gt;
In the next few posts, I’m going to talk about the Tokenization phase of the compiler.&amp;#160;
Before I go into too much detail on that though, I want to talk about newlines.
&lt;/p&gt;
&lt;p&gt;
In Windows, a new line in a text file is indicated by a pair of characters: A Carriage
Return (commonly referred to as “\r'”, as that is the C/C++ string escape sequence
for it) followed by a Line Feed (”\n”).&amp;#160; However, in Unix-based operating systems,
a new line is often indicated by a Line Feed character alone.&amp;#160; To add even more
confusion, the Mac OS uses a Carriage Return character alone.&amp;#160; 
&lt;/p&gt;
&lt;p&gt;
A compiler needs to track line numbers accurately, in order to report errors, so we
need be extra careful around newlines.&amp;#160; We could simply use the current operating
system’s default newline characters, but that makes it difficult for multi-platform
development.&amp;#160; Instead, we’ll normalize the newlines so that all three different
types are properly understood by our compiler.
&lt;/p&gt;
&lt;p&gt;
To do this, I’ve written a “Decorator” class which inherits from the abstract &lt;code&gt;TextReader&lt;/code&gt; class
provided in the .Net framework.&amp;#160; This “Newline Normalizing” decorator wraps an
existing &lt;code&gt;TextReader&lt;/code&gt; and does all the work of normalizing new line characters&amp;#160;
TextReader provides two methods that need to be implemented, &lt;code&gt;Read&lt;/code&gt; and &lt;code&gt;Peek&lt;/code&gt;.&amp;#160; &lt;code&gt;Read&lt;/code&gt; returns
the current character from the text and moves the reader one character forwards (so
that the next call to Read will return the next character).&amp;#160; &lt;code&gt;Peek&lt;/code&gt; also
returns the current character from the text but &lt;strong&gt;does not&lt;/strong&gt; move the
reader forward.&amp;#160; The “Newline Normalizing” reader implements these two methods
using the following code
&lt;/p&gt;
&lt;pre class="csharp" name="code"&gt;public override int Peek() {
    // Get the next character
    int i = Adaptee.Peek();
    
    // If the character is a '\r' newline, just return '\n'.  
    // Unlike Read, we aren't going to read ahead to check for \r\n
    // because that will happen when the user calls Read()
    if (i == (int)'\r') {
        i =  (int)'\n';
    }

    return i;
}

public override int Read() {
    // Get the next character
    int i = Adaptee.Read();

    // If the character is a '\r' newline, we're going to normalize it to '\n'
    // However, if the newline is '\r\n', we need to return it as one character, so
    // we check ahead for that
    if (i == (int)'\r') {
        if (Adaptee.Peek() == (int)'\n') {
            Adaptee.Read(); // Skip the '\n'
        }
        i = (int)'\n';
    }

    return i;
}&lt;/pre&gt;
&lt;p&gt;
Essentially, if the character we read from the source (the “Adaptee” as I call it)
is a ‘\n’, we just pass it along.&amp;#160; If the character is a ‘\r’, we are going to
return ‘\n’, but first we first check to see if it is immediately followed by a ‘\n’.&amp;#160;
If it is, we skip the extra character.&amp;#160; The result is that no matter which newline
sequence is used, this &lt;code&gt;TextReader&lt;/code&gt; will return it as a single ‘\n’ character.
&lt;/p&gt;
&lt;p&gt;
Next post, I’ll start talking about the Tokenization process.
&lt;/p&gt;
&lt;img width="0" height="0" src="http://blog.andrewnurse.net/aggbug.ashx?id=fa257811-a414-419a-8ffd-57f889a3daa1" /&gt;</description>
      <comments>http://blog.andrewnurse.net/CommentView,guid,fa257811-a414-419a-8ffd-57f889a3daa1.aspx</comments>
      <category>CMPT 376</category>
      <category>Compiler Project</category>
    </item>
    <item>
      <trackback:ping>http://blog.andrewnurse.net/Trackback.aspx?guid=1b470301-0cfd-4e5c-ac71-6fefa51e8d3f</trackback:ping>
      <pingback:server>http://blog.andrewnurse.net/pingback.aspx</pingback:server>
      <pingback:target>http://blog.andrewnurse.net/PermaLink,guid,1b470301-0cfd-4e5c-ac71-6fefa51e8d3f.aspx</pingback:target>
      <dc:creator>Andrew Nurse</dc:creator>
      <wfw:comment>http://blog.andrewnurse.net/CommentView,guid,1b470301-0cfd-4e5c-ac71-6fefa51e8d3f.aspx</wfw:comment>
      <wfw:commentRss>http://blog.andrewnurse.net/SyndicationService.asmx/GetEntryCommentsRss?guid=1b470301-0cfd-4e5c-ac71-6fefa51e8d3f</wfw:commentRss>
      <slash:comments>2</slash:comments>
      <body xmlns="http://www.w3.org/1999/xhtml">
        <p>
EDIT: Whoops, wrote the wrong course number (CMPT 376), corrected below.  I’m
taking CMPT 376, Writing for Computer Scientists, this semester, must have gotten
myself mixed up :)
</p>
        <p>
Ok, I’m going to try something here, and I don’t know if I’ll manage to finish it. 
My all-time favorite class in my school career (so far) has been CMPT 379 – Compiler
Design.  The class basically consisted of me writing a compiler in Visual C++,
with weekly lectures and code reviews from the professor (it was a small class). 
Since then, I’ve been trying to think of a reason to write a compiler (i.e. a language
that can be ported to .Net, or a useful domain-specific language, etc.).  Then,
this morning, as I walked back to the Computing Science common room from the Gym (after
my morning workout), I found a reason.  I think compilers are some of the most
interesting pieces of software, and I thought that if I wrote a series of blog posts
in which I developed a compiler, maybe I could share that passion with others.  
</p>
        <p>
So, here’s the plan:  I’ve designed an, extremely simple, language that I’m going
to walk through writing a compiler for.  The initial version will target .Net
IL code (since it’s a stack machine system, which is much easier to generate code
for).  Once I’ve implemented the simple language, I think it’d be awesome if
my readers could jump in and propose some ideas for new language features, and I’ll
try to blog about implementing them.
</p>
        <p>
To be honest, this project may well fall flat on its face.  Writing a compiler
is not a simple task, no matter how simple the language.  However, I do think
it will be fun (while it lasts), so let’s give it a try!
</p>
        <p>
So, step one is: Define your language!  The most essential part of compiler design
is having a very clear idea of the purpose of your language.  The language I’m
going to design is called “Duh”, because it’s probably about as simple as it gets
(maybe a little more complex than <a href="http://en.wikipedia.org/wiki/Brainf**k" target="_blank">Brainf**k</a> (NOTE:
link target does include a few four-letter words :D)).  Here’s a sample Duh program
</p>
        <pre class="csharp" name="code">print "Enter your age: ";
ageInput : string;
age : int;
ageInput = readline;
age = ageInput -&gt; int;
print "In 10 years you will be " + (age + 10) -&gt; string + ". Wow, that's old"</pre>
        <p>
Pretty simple, eh?  Line 1 is a simple print statement, which accepts a string
and write it to the console.  Lines 2 and 3 declare variables of string and 32-bit
integer types (decided to use a more Pascal-esque variable declaration format, just
for fun).  Lines 4 and 5 assign values to those variables.  Line 4 uses
another built-in function, readline, which reads a line of text from the console. 
Note that we don’t even support initializing variables in the declaration!  That’s
ok though, this is just a toy language, we can add initializations later.  Line
5 introduces the “-&gt;” conversion operator, which takes the value on the left and
converts it to the type on the right.  It’s not quite the same as a cast, because
it will try to convert the value, if it can.  Finally, we add 10 to the value
the user entered and convert it to a string, then we place that string into a large
message and print it to the console.
</p>
        <p>
Well, here goes nothing!  Feel free to post your initial comments.  I hope
you’ll follow along!  I’ll be posting full source code with every post
</p>
        <img width="0" height="0" src="http://blog.andrewnurse.net/aggbug.ashx?id=1b470301-0cfd-4e5c-ac71-6fefa51e8d3f" />
      </body>
      <title>Writing a Compiler from Scratch &amp;ndash; Part 1</title>
      <guid isPermaLink="false">http://blog.andrewnurse.net/PermaLink,guid,1b470301-0cfd-4e5c-ac71-6fefa51e8d3f.aspx</guid>
      <link>http://blog.andrewnurse.net/2009/01/19/WritingACompilerFromScratchNdashPart1.aspx</link>
      <pubDate>Mon, 19 Jan 2009 17:02:52 GMT</pubDate>
      <description>&lt;p&gt;
EDIT: Whoops, wrote the wrong course number (CMPT 376), corrected below.&amp;#160; I’m
taking CMPT 376, Writing for Computer Scientists, this semester, must have gotten
myself mixed up :)
&lt;/p&gt;
&lt;p&gt;
Ok, I’m going to try something here, and I don’t know if I’ll manage to finish it.&amp;#160;
My all-time favorite class in my school career (so far) has been CMPT 379 – Compiler
Design.&amp;#160; The class basically consisted of me writing a compiler in Visual C++,
with weekly lectures and code reviews from the professor (it was a small class).&amp;#160;
Since then, I’ve been trying to think of a reason to write a compiler (i.e. a language
that can be ported to .Net, or a useful domain-specific language, etc.).&amp;#160; Then,
this morning, as I walked back to the Computing Science common room from the Gym (after
my morning workout), I found a reason.&amp;#160; I think compilers are some of the most
interesting pieces of software, and I thought that if I wrote a series of blog posts
in which I developed a compiler, maybe I could share that passion with others.&amp;#160; 
&lt;/p&gt;
&lt;p&gt;
So, here’s the plan:&amp;#160; I’ve designed an, extremely simple, language that I’m going
to walk through writing a compiler for.&amp;#160; The initial version will target .Net
IL code (since it’s a stack machine system, which is much easier to generate code
for).&amp;#160; Once I’ve implemented the simple language, I think it’d be awesome if
my readers could jump in and propose some ideas for new language features, and I’ll
try to blog about implementing them.
&lt;/p&gt;
&lt;p&gt;
To be honest, this project may well fall flat on its face.&amp;#160; Writing a compiler
is not a simple task, no matter how simple the language.&amp;#160; However, I do think
it will be fun (while it lasts), so let’s give it a try!
&lt;/p&gt;
&lt;p&gt;
So, step one is: Define your language!&amp;#160; The most essential part of compiler design
is having a very clear idea of the purpose of your language.&amp;#160; The language I’m
going to design is called “Duh”, because it’s probably about as simple as it gets
(maybe a little more complex than &lt;a href="http://en.wikipedia.org/wiki/Brainf**k" target="_blank"&gt;Brainf**k&lt;/a&gt; (NOTE:
link target does include a few four-letter words :D)).&amp;#160; Here’s a sample Duh program
&lt;/p&gt;
&lt;pre class="csharp" name="code"&gt;print &amp;quot;Enter your age: &amp;quot;;
ageInput : string;
age : int;
ageInput = readline;
age = ageInput -&amp;gt; int;
print &amp;quot;In 10 years you will be &amp;quot; + (age + 10) -&amp;gt; string + &amp;quot;. Wow, that's old&amp;quot;&lt;/pre&gt;
&lt;p&gt;
Pretty simple, eh?&amp;#160; Line 1 is a simple print statement, which accepts a string
and write it to the console.&amp;#160; Lines 2 and 3 declare variables of string and 32-bit
integer types (decided to use a more Pascal-esque variable declaration format, just
for fun).&amp;#160; Lines 4 and 5 assign values to those variables.&amp;#160; Line 4 uses
another built-in function, readline, which reads a line of text from the console.&amp;#160;
Note that we don’t even support initializing variables in the declaration!&amp;#160; That’s
ok though, this is just a toy language, we can add initializations later.&amp;#160; Line
5 introduces the “-&amp;gt;” conversion operator, which takes the value on the left and
converts it to the type on the right.&amp;#160; It’s not quite the same as a cast, because
it will try to convert the value, if it can.&amp;#160; Finally, we add 10 to the value
the user entered and convert it to a string, then we place that string into a large
message and print it to the console.
&lt;/p&gt;
&lt;p&gt;
Well, here goes nothing!&amp;#160; Feel free to post your initial comments.&amp;#160; I hope
you’ll follow along!&amp;#160; I’ll be posting full source code with every post
&lt;/p&gt;
&lt;img width="0" height="0" src="http://blog.andrewnurse.net/aggbug.ashx?id=1b470301-0cfd-4e5c-ac71-6fefa51e8d3f" /&gt;</description>
      <comments>http://blog.andrewnurse.net/CommentView,guid,1b470301-0cfd-4e5c-ac71-6fefa51e8d3f.aspx</comments>
      <category>CMPT 376</category>
      <category>Compiler Project</category>
    </item>
  </channel>
</rss>