Before learning compiler, you should have basic ideas on at least one main stream programming language (C++, Java, or Python). All other knowledge required by building a compiler will be covered along with each phase of compilation, including regular expressions, context-free grammar, assembly codes, as well as other programming tools\footnote{This tutorial will stick on C++ implementation, so C-based parser generator, and compilation toolchains will also be covered}.
This tutorial is aimed at learning compilers both theoretically and practially. You will have:
- Basic concepts of compilation including parsing, intermediate representation, optimization, and code generation.
- Knowing the usage of the most compiler infrastructure LLVM.
- Significantly improved coding power after writing a compiler!
Engineering Compiler Construction Language (ECC Lang) is writtten to define a language better serve the purpose of compiler education. The syntaxes and design concepts are inspired by hybridizing C and Java. Excessive and redundant syntax sugars are removed to keep the language simple. Meanwhile, modern language features are introduced to better understand the design concept.
func main() -> int {
// This function can be invoked before declaration.
printHello();
}
func printHello() -> void {
println("Hello world!");
}
Listing above shows an example of this language. This keeps the language implementation simple, so that we can focus on learning the compiler principles. We assume:
- All the programs written in this language should NOT exceed 1MB. Otherwise, the compiler is not guaranteed/required to output a correct result.
- Only single program compilation is supported for now.
An ECC program should be composed by the following aspects:
-
Function definition.
main
function: The program starts with. This function have no arguments, and return an integeter.- For the better purpose of education, we do NOT support interface declration. (Interface declration is actually an legacy from the early stage of computer system design. Because of the limited disk/memory size, it is highly desirable to compile the whole program by scanning it only once.)
-
Class definition.
-
Global veriable declaration.
We only support //
to comment a line. No /**/
supported.
Declaring a varialbe is just as simple as let {type} {id} [= {initializer}];
,
and the initializer is optional. The id
of a variable should not start with a
number, and it can be composed by a combination of numbers, letters, and underscore.
To keep the syntax simple, we do not support declaring multiple variables separated by
commas (,
). The let
keyword is also to keep the "parser look ahead" (will be explained later)
simple.
These following types are builtin types:
-
void
: Cannot be an variable, can only be the return type of a function. -
i32
: A 32-bit signed integer. Constant integers can range from$[-2^{31},2^{31}-1]$ . -
i8
: An 8-bit char. To keep it simple, all the char surrounded by a pair of \texttt{'}s should be printable. Only 4 backslash escape characters are supported,\'
,\"
,\\
, and\n
. -
bool
: A boolean value, whose constants can be eithertrue
,orfalse
. Unlike C, there is no implicit conversion to bool for all the expressions (i32
,i8
, orclasses
). -
string
: Literals surrounded by \texttt{"} are constants of strings. Just likei8
each char of a string should either be printable or supported escape characters. Strings are immutable. The string data type with three builtin members:-
i32 length()
return the length of this string. -
i32 parseInt()
convert the string into an integer. -
i8 at(int pos)
starting with 0, return the character at the given position.
-
Note int
, i8
, and bool
are plain old data (POD), so they have instances.
string
is non-modifiable, so its behavior of being a POD or a class,
does not matter that much.
We also allow users to define their classes and classes can have their member functions/methods.
Listing below shows an example of defining a class.
The class name has the same requirement as the variable id.
Unlike conventional C, a class can only be a pointer to an instance.
This design concept is widely adopted in modern languages, like Java and Python.
Pointers can be null
when empty. As mentioned before, there is no implicit conversion
from 0
to null
.
class A {
func print() -> void {
// This is OK, even though `value` defined later than `print()`
println(toString(this.value));
}
i32 value;
};
func main() -> void {
A a = new A;
a.value = 5;
a.print(); // prints 5
A b = a;
b.a = 1;
a.print(); // prints 1
}
In addition, within the scope of class,
the variable difinition is slightly different
from it is in the global scope. In the example shown above,
it is OK to use value
before it is defined within
the scope of a class. However, it makes no sense to use a global
variable before its definition.
func foo() -> void {
// Too early to use `a`!
println(a);
}
string a;
TODO(@were): support destructor, inherence, virtual function, and interfacing.
Array allocation is very similar in what we have in Java.
Arrays have one builtin method length()
for the size of the array.
i32[] a = new int[128];
println(toString(a.length())); // outputs "128"
We support two types of nested array allocation.
i32[][] a = new int[128][128];
i32[][] b = new int[128][];
for i in 0..128 {
b[i] = new int[i + 1];
}
We also support compound arrays.
class A {
// ...
};
// Allocate 128 empty pointers of A.
A[] a = new A[128];
- Arithmetic Opertions:
+, -, *, /
, both side should have the same type, and no implicit cast supported. - Bitwise Opertions:
|,&,^,~
, both side should be i32. - Logic Operations:
&&, ||, !
, both operands should be boolean, and short circuit evaluation --- if the value of the expression is already determined, we no longer evaluate rest of it --- is supported. - Access Attributes:
.
- Pranthesis:
()
- Brackets:
[]
- Negative:
-
- Comparison:
==, >, <
, returns a boolean expression. - Cast:
{expr} as {type}
e.g.a as int
. TODO(@were): Assign a priority to this. - Assignment:
=
- For simplicity, unlike C, assignment has no return value.
Therefore, no
a=b=0
allowed. - For simplicity, no in-place update (e.g.
+=
) supported.
- For simplicity, unlike C, assignment has no return value.
Therefore, no
The priority of these operations are the same as C.
Every statement can be a declaration, an assignment, a conditional statement, for-loop, while-loop, or a compound statement.
if (bool-expr) {
// do a
} else {
// do b
}
for i in 0..n {
// do something
}
while (bool-expr) {
// do something
}
Unlike C, though compound statement does open a new scope, we cannot define variables with same id to hide instances in outer scopes, as it is shown in Listing below.
i32 a;
{
i32 a; // This is not allowed!
i32 b; // b is dedicated to this scope.
}
// We cannot use b here.
Several builtin functions are supported for I/O.
func print(string str) -> void
: Write the given string to stdout.func println(string str) -> void
: Write the given string to stdout and append a newline.func nextInt() -> i32
: Get an integer from stdin.func nextLine() -> string
: Get a line of string from stdin.func toString(i32 i) -> string
: Convert the given int to string.
Function declaration is the same C, but we allow to define new sub-functions within a function that captures all the values in that scope. If the return type is not void, the compiler should not pass the semantic check.
func {func-name}({arg-list}) -> {return-type} {
// do something.
// you can still define new functions here.
}
Similarly, the func
keyword is also to keep the syntax simple.
To sum up, these key words are reserved bool int string null void true false if for while break continue return new class this func let
.
According to the language manual, write a compiler for this language. Here are some hints:
- Be incremental! Start with a something simple and incrementally extend and test it!
- Be bold! Do not hesitate to make design decisions! Get hands dirty first!
- Be patient! If certain bad deisng decision makes it hard to extend, do not hesitate to refactor!