Wednesday, February 18, 2009 3:07:00 PM (Pacific Standard Time, UTC-08:00)
(
CMPT 376 | Compiler Project
)
Here’s part 4 of my infinite part series on developing a compiler from scratch. See all the posts here.
I’ve decided to change things up a bit on the compiler project. Instead of writing the tokenizer and parser from scratch, I’m going to check out a new language that’s part of Microsoft’s “Oslo” Project. Oslo includes a language called M, which is primarily designed for developing data models. M is actually three languages, MGraph, MSchema and MGrammar. MGraph is a language for defining abstract graphs, for example:
Universities {
Canada {
BritishColumbia {
SFU {
Name { "Simon Fraser University" },
Location { "Burnaby, BC" }
},
UBC {
Name { "University of British Columbia" },
Location { "Vancouver, BC" }
}
}
}
}
MSchema is a language for defining the schema for these graphs, for example
module Universities {
type University {
Name : Text#100,
Location : Text#100
}
}
Finally, MGrammar is a language for defining custom Domain-Specific Languages (DSLs) which map to MGraph graphs. This is the language that is most useful to this project. After all, a general-purpose programming language is just a DSL where the "Domain" is very broad :). So, I've spent a few hours this week whipping up an MGrammar grammar for Duh, and the results are promising. I’ll be posting some details on that in the next post, but first we need to explore a key concept in compilers, Abstract Syntax Trees.
The end goal of the tokenization and parsing phases of a compiler is to produce an Abstract Syntax Tree (or AST) representing the source code. This is probably best illustrated by an example, so consider this expression:
5+6
This is an addition operation, using 5 and 6 as the arguments, which we represent in an AST like this (using the MGraph syntax above):
Add { 5, 6 }
Similarly, more complex expressions can be translated to trees:
5+6*(3+4)
Becomes:
Add { 5, Multiply { 6, Add { 3, 4 } } }
Which represents the following instructions:
- Add 3 and 4
- Multiply that result by 6
- Add 5 to that result
Essentially, we are building up the expression from the bottom of the tree up. This makes it much easier to convert the expression into code.
I’ll post the MGrammar file for the language when I finish it up in a day or so, then its on to writing the code to do the parsing.