These notes and example programs augment the introductory Compiler Construction course given by Dr. Mycroft. The course itself covers the fundamental principles, and some of the practice, of implementing compilers in a modern language.
This webpage builds on this with working examples showing how modern compiler tools (ANTLR) and modern implementation language (C#) can be used to construct a compiler for a simple high-level C-like language, and targetting the recently-introduced DotNet intermediate language from Microsoft.
A report going into further details of the programs presented here is available.
The first example presented here shows the output of three compilers and their target languages. All example source and target code is available for download.
#include <stdio.h> int main( void ) { puts( "Hello World!" ); return 0; } |
When compiled with GCC for a PowerPC processor running Mac OSX, the output is:
.data .cstring .align 2 LC0: .ascii "Hello World!\0" .text .align 2 .globl _main _main: mflr r0 stmw r30,-8(r1) stw r0,8(r1) stwu r1,-80(r1) mr r30,r1 bcl 20,31,L1$pb L1$pb: mflr r31 addis r3,r31,ha16(LC0-L1$pb) la r3,lo16(LC0-L1$pb)(r3) bl L_puts$stub li r0,0 mr r3,r0 lwz r1,0(r1) lwz r0,8(r1) mtlr r0 lmw r30,-8(r1) blr .data .picsymbol_stub L_puts$stub: .indirect_symbol _puts mflr r0 bcl 20,31,L0$_puts L0$_puts: mflr r11 addis r11,r11,ha16(L_puts$lazy_ptr-L0$_puts) mtlr r0 lwz r12,lo16(L_puts$lazy_ptr-L0$_puts)(r11) mtctr r12 addi r11,r11,lo16(L_puts$lazy_ptr-L0$_puts) bctr .data .lazy_symbol_pointer L_puts$lazy_ptr: .indirect_symbol _puts .long dyld_stub_binding_helper |
Again, here is the basic hello world program in Java:
public class Hello { public static void main( String[] args ) { System.out.print( "Hello world.\n" ); } } |
The Java compiler, javac, compiles this into JVM bytecodes, which are then interpreted by the Java runtime. For this example, note the difference in the number of instructions in the program compared to the native-compiled C example above.
Method Hello() 0 aload_0 1 invokespecial #1 <Method java.lang.Object()> 4 return Method void main(java.lang.String[]) 0 getstatic #2 <Field java.io.PrintStream out> 3 ldc #3 <String "Hello world. "> 5 invokevirtual #4 <Method void print(java.lang.String)> 8 return |
Far fewer instructions. Why is this?
Finally, here is the C# version and the corresponding DotNet Intermediate Code:
using System; public class HelloApp { public static void Main() { Console.WriteLine( "Hello World!" ); } } |
The Java compiler, javac, compiles this into JVM bytecodes, which are then interpreted by the Java runtime. For this example, note the difference in the number of instructions in the program compared to the native-compiled C example above.
.class public auto ansi beforefieldinit HelloApp extends [mscorlib]System.Object { .method public hidebysig static void Main() cil managed { .entrypoint // Code size 11 (0xb) .maxstack 1 IL_0000: ldstr "Hello World!" IL_0005: call void [mscorlib]System.Console::WriteLine(string) IL_000a: ret } // end of method HelloApp::Main .method public hidebysig specialname rtspecialname instance void .ctor() cil managed { // Code size 7 (0x7) .maxstack 1 IL_0000: ldarg.0 IL_0001: call instance void [mscorlib]System.Object::.ctor() IL_0006: ret } // end of method HelloApp::.ctor } // end of class HelloApp |
Note that this is a subset of the output of the disassembler showing just the actual executable code.
From this point on we will use C# and DotNet as the implementation language and the target code respectively. To compile and run these examples you will need some form of DotNet runtime. There are currently two main versions:
The basic requirement is a C# compiler and the DotNet runtime. There are two versions available, the commercial version, and the research-oriented ROTOR project. There is a free, cut-down, version of the commercial version, and ROTOR is free.
The Mono Project is creating an open-source implementation of the C# and DotNet runtime as defined by the ECMA Standards that describe them. It has been ported to many operating systems, notably Linux, Mac OSX and Windows.
Two versions of the basic exeercise are presented, one using the recursive-descent techniques presented in the lecture course, the other using the compiler-generator tool ANTLR.
To build the executable, you need a DotNet system installed (either Rotor or the full version). Then compile the application with:
csc 4calc.cs
and run with:
clix 4calc.exe
Enter expressions as you would expect, terminating each expressions with ';' (they can span multiple lines) and hit Return to evaluate them and display the result. To quit, enter Ctrl-D followed by Return.
This is a version of the 4-function calculator using ANTLR to generate a lexer and parser automatically from grammer rules.
You will need to install ANTLR before you can compile and run this example. How you achieve this will depend on which platform you are running on. For Rotor on MAC OSX, you cannot compile up the parts of the ANTLR runtime that use the Windows Forms libraries, which are not included in Rotor.
The user interface is exactly the same as for the hand-written parser version. However the internal operation is completely different. Whereas the earlier version computed intermediate results as soon as a rule was matched (e.g., "1+2" -> 3) the ANTLR-generated version first constructs a parse-tree of the input, and only then does the tree walker traverse the tree bottom-up computing the result.
Elisa is a simple C-like language with sufficient constructs to be able to write working programs that demonstrate the basics of a compiler. It is not meant to be a fully-developed programming language to write the next greatest operating system, but it could form the start of many student projects to build compilers or code analysers.
The Elisa language has a few simple operators, integers and arrays of integers, and functions.
The compiler itself consists of a handful of files, brought together by a Makefile:
Elisa.cs | The main compiler driver: processes command line flags, then calls the parser and code generator. |
CodeGen.cs | This is the code generator interface, which defines the functionality that a code generator must provide. |
DotNetCodeGen.cs | A specific instance of the Code Generator class which generates DotNet output. |
Symboltable.cs | Type, symbol, and symbol table classes. |
ElisaParser.g | The ANTLR lexer and grammar file for Elisa. As for the four-function calculator above, this single file encompasses the lexer, the parser, and the parse tree walker. |
The syntax of Elisa looks very much like C (or any other Algol-derived programming language). It has:
// Integer bubblesort // Takes an array of integers and the number of elements in the array // and sorts them in-place int bubblesort(int a[], int n) { int t; int i; int done; done = 0; while (!done) { i = 1; done = 1; while (i < n) { if (a[i - 1] > a[i]) { done = 0; t = a[i - 1]; a[i - 1] = a[i]; a[i] = t; } i = i + 1; } } return 0; } |
Note, there is already a simplified C compiler example project in the ROTOR package, called MyC, as well as a simplified LISP interpreter/compiler. You might wish to compare the compiler for Elisa with the MyC compiler.
LCC is an open-source ANSI-compliant C compiler described in the book A Retargetable C Compiler: Design and Implementation. The compiler is about 20,000 lines of uncommented C (about 11,000 for the compiler front-end, and then a few thousand lines for each backend target code generator), so you need the book (which actually does a good job of explaining the compiler, even though the book is written for version 3.6 of the compiler).
Version 4.2 now has a DotNet backend, which is a separate download from the MS site due to licensing issues.
This is/was a subset-C compiler developed to help teach advanced compiler technology at St. Petersburg State University IT Research Institute.
ANTLR is a tool for automagically generating recognizers, compilers and translators (the acronym stands for ANother Tool for Language Recognition). It generates tools in Java, C# or C++, and there is considerable support for this tool (i.e., lots of goodies to download). There are ready-made parsers for many languages: Gnu C, Ada, HTML, Java, Verilog, Pascal, SQL, C++, VRML, and so on.
Grammatica is another parser generator, in the same vein as ANTLR. It generates Java or C# parsers, and seems quite a nice little package. It generates LL(k) (top-down) parsers with better error message handling.