Skip to content

Commit

Permalink
More mopping up done.
Browse files Browse the repository at this point in the history
  • Loading branch information
Warren Toomey committed Dec 4, 2019
1 parent db8a42f commit 6136a47
Show file tree
Hide file tree
Showing 303 changed files with 7,924 additions and 0 deletions.
33 changes: 33 additions & 0 deletions 53_Mop_up_pt2/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Define the location of the include directory
# and the location to install the compiler binary
INCDIR=/tmp/include
BINDIR=/tmp

HSRCS= data.h decl.h defs.h
SRCS= cg.c decl.c expr.c gen.c main.c misc.c \
opt.c scan.c stmt.c sym.c tree.c types.c

ARMSRCS= cg.c decl.c expr.c gen.c main.c misc.c \
opt.c scan.c stmt.c sym.c tree.c types.c

cwj: $(SRCS) $(HSRCS)
cc -o cwj -g -Wall -DINCDIR=\"$(INCDIR)\" $(SRCS)

cwjarm: $(ARMSRCS) $(HSRCS)
cc -o cwjarm -g -Wall $(ARMSRCS)
cp cwjarm cwj

install: cwj
mkdir -p $(INCDIR)
rsync -a include/. $(INCDIR)
cp cwj $(BINDIR)
chmod +x $(BINDIR)/cwj

clean:
rm -f cwj cwjarm *.o *.s out a.out

test: install tests/runtests
(cd tests; chmod +x runtests; ./runtests)

armtest: cwjarm tests/runtests
(cd tests; chmod +x runtests; ./runtests)
270 changes: 270 additions & 0 deletions 53_Mop_up_pt2/Readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,270 @@
# Part 53: Mopping Up, part 2

In this part of our compiler writing journey, I fix a few annoying things
that we use in the compiler's own source code.

## Consecutive String Literals

C allows the declaration of string literals by splitting them across
multiple lines or as multiple strings, e.g.

```
char *c= "hello " "there, "
"how " "are " "you?";
```

Now, we could fix this problem up in the lexical scanner. However, I spent
a lot of time trying to do this. The problem is that the code is now
complicated with dealing with the C pre-processor, and I couldn't find a clean
way of allowing consecutive string literals.

My solution is to do it in the parser, with a bit of help from the code
generator. In `primary()` in `expr.c`, the code that deals with string
literals now looks like this:

```
case T_STRLIT:
// For a STRLIT token, generate the assembly for it.
id = genglobstr(Text, 0); // 0 means generate a label
// For successive STRLIT tokens, append their contents
// to this one
while (1) {
scan(&Peektoken);
if (Peektoken.token != T_STRLIT) break;
genglobstr(Text, 1); // 1 means don't generate a label
scan(&Token); // Skip it
}
// Now make a leaf AST node for it. id is the string's label.
genglobstrend();
n = mkastleaf(A_STRLIT, pointer_to(P_CHAR), NULL, NULL, id);
break;
```

`genglobstr()` now takes a second argument which tells it if this
is the first part of the string or a successive part of the string.
Also, `genglobstrend()` now has the job of NUL terminating the string literal.

## Empty Statements

C allows both empty statements and empty compound statements, e.g.

```
while ((c=getc()) != 'x') ; // ';' is an empty statement
int fred() { } // Function with empty body
```

and I use both of these in the compiler, so we need to support both of them.
In `stmt.c`, the code now does this:

```
static struct ASTnode *single_statement(void) {
struct ASTnode *stmt;
struct symtable *ctype;
switch (Token.token) {
case T_SEMI:
// An empty statement
semi();
break;
...
}
...
}
struct ASTnode *compound_statement(int inswitch) {
struct ASTnode *left = NULL;
struct ASTnode *tree;
while (1) {
// Leave if we've hit the end token. We do this first to allow
// an empty compound statement
if (Token.token == T_RBRACE)
return (left);
...
}
...
}
```

and that fixes both shortcomings.

## Redeclared Symbols

C allows a global variable to later be declared extern, and an extern
variable to be declared later as a global variable, and vice versa.
However, the types of both
declarations have to match. We also want to ensure that only one
version of the symbol ends up in the symbol table: we don't want both a
C_GLOBAL and a C_EXTERN entry!

In `stmt.c` I've added a new function called `is_new_symbol()`. We call this
after we have parsed the name of a variable and after we have tried
to find it in the symbol table. Thus, `sym` may be NULL (no existing
variable) or not NULL (is an existing variable).

If the symbol exists, it's actually quite complicated to ensure that
it's a safe redeclaration.

```
// Given a pointer to a symbol that may already exist
// return true if this symbol doesn't exist. We use
// this function to convert externs into globals
int is_new_symbol(struct symtable *sym, int class,
int type, struct symtable *ctype) {
// There is no existing symbol, thus is new
if (sym==NULL) return(1);
// global versus extern: if they match that it's not new
// and we can convert the class to global
if ((sym->class== C_GLOBAL && class== C_EXTERN)
|| (sym->class== C_EXTERN && class== C_GLOBAL)) {
// If the types don't match, there's a problem
if (type != sym->type)
fatals("Type mismatch between global/extern", sym->name);
// Struct/unions, also compare the ctype
if (type >= P_STRUCT && ctype != sym->ctype)
fatals("Type mismatch between global/extern", sym->name);
// If we get to here, the types match, so mark the symbol
// as global
sym->class= C_GLOBAL;
// Return that symbol is not new
return(0);
}
// It must be a duplicate symbol if we get here
fatals("Duplicate global variable declaration", sym->name);
return(-1); // Keep -Wall happy
}
```

The code is straight-forward but not elegant. Also note that any redeclared
extern symbol is turned into a global symbol. This means we don't have to
remove the symbol from the symbol table and add in a new, global, symbol.

## Operand Types to Logical Operations

The next bug I hit was something like this:

```
int *x;
int y;
if (x && y > 12) ...
```

The compiler evaluates the `&&` operation in `binexpr()`. To do this,
it ensures that the types of each side of the binary operator are
compatible. Well, if the operator above was a `+` then, definitely,
the types are incompatible. But with a logical comparison, we can
*AND* these together.

I've fixed this by adding some more code to the top of `modify_type()`
in `types.c`. If we are doing an `&&` or an `||` operation, then
we need either integer or pointer types on each side of the operation.

```
struct ASTnode *modify_type(struct ASTnode *tree, int rtype,
struct symtable *rctype, int op) {
int ltype;
int lsize, rsize;
ltype = tree->type;
// For A_LOGOR and A_LOGAND, both types have to be int or pointer types
if (op==A_LOGOR || op==A_LOGAND) {
if (!inttype(ltype) && !ptrtype(ltype))
return(NULL);
if (!inttype(ltype) && !ptrtype(rtype))
return(NULL);
return (tree);
}
...
}
```

I've also realised that I've implemented `&&` and `||` incorrectly, so
I'll have to fix that. Now now, but soon.

## Return with No Value

One other C feature that is missing is the ability to return from a
void function, i.e. just leave without returning any value. However,
the current parser expects to see parentheses and an expression after
the `return` keyword.

So, in `return_statement()` in `stmt.c`, we now have:

```
// Parse a return statement and return its AST
static struct ASTnode *return_statement(void) {
struct ASTnode *tree= NULL;
// Ensure we have 'return'
match(T_RETURN, "return");
// See if we have a return value
if (Token.token == T_LPAREN) {
// Code to parse the parentheses and the expression
...
} else {
if (Functionid->type != P_VOID)
fatal("Must return a value from a non-void function");
}
// Add on the A_RETURN node
tree = mkastunary(A_RETURN, P_NONE, NULL, tree, NULL, 0);
// Get the ';'
semi();
return (tree);
}
```

If the `return` token isn't followed by a left parenthesis, we leave the
expression `tree` set to NULL. We also check that this is a void returning
function, and print out a fatal error if not.

Now that we have parsed the `return` function, we may create an A_RETURN
AST node with a NULL child. So now we have to deal with this in the code
generator. The top of `cgreturn()` in `cg.c` now has:

```
// Generate code to return a value from a function
void cgreturn(int reg, struct symtable *sym) {
// Only return a value if we have a value to return
if (reg != NOREG) {
..
}
cgjump(sym->st_endlabel);
}
```

If there was no child AST tree, then there is no register with the
expression's value. Thus, we only output the jump to the function's
end label.


## Conclusion and What's Next

We've fixed five minor issues in the compiler: things that we need to work to get
the compiler to compile itself.

I did identify a problem with `&&` and `||`. However, before we get to
that I need to solve an important, pressing, problem: we have a limited
set of CPU registers and, for large source files, we are running out of
them.

In the next part of our compiler writing journey, I will have to work
on implementing register spills. I've been delaying this, but now most of
the fatal errors from the compiler (when compiling itself) are register
issues. So now it's time to sort this out.
Loading

0 comments on commit 6136a47

Please sign in to comment.