Name	Name	Last commit message	Last commit date
parent directory ..
docs	docs
src	src
README.md	README.md
pom.xml	pom.xml

Chapter 1: Introduction

In this chapter we aim to compile simple scripts such as:

return 1;

We implement the return statement. The return statement accepts an expression as an argument. The only expression type is an integer literal, as shown in the example above.

Here is the complete language grammar for this chapter.

To implement this simple language, we introduce a few key components and data structures.

Implementation Language

Our implementation language is Java. We chose Java as it is widely available and understood.

Assumptions

We assume that the reader is familiar with traditional linear intermediate representations, and is familiar with terms such as Basic Block, Control Flow Graph, etc. No attempt is made to explain these topics. If necessary the reader can consult a standard compiler text book.

Architecture

We construct the intermediate Sea of Nodes (SoN) representation directly as we parse the language. There is no Abstract Syntax Tree representation. The reason for this is to demonstrate a key benefit of the SoN IR: a number of pessimistic peephole optimizations can be performed while parsing a language. This aspect is more fully explored from Chapter 2 onwards.

Data Structures

Our data structures are based upon the descriptions provided in following papers:

Following the lead from above, we represent our intermediate representation using an object oriented data model. Details of the representation follow.

Intermediate Representation as a Graph of Nodes

The intermediate representation is a graph of Node objects. The Node class is the base type for objects in the IR graph. The Node class provides common capabilities that are inherited by all subtypes. Each subtype implements semantics relevant to that subtype.

Each Node represents an "instruction" as it may appear in traditional IRs.

Nodes are in a Graph

The key idea of the Sea of Nodes IR is that each Node is linked to other Nodes by def-use dependencies. As this is such an important and fundamental aspect of the IR, it is important to understand how we implement this, and depict in graph visuals.

The base Node class maintains a list of Nodes that are inputs to it. An input is an edge from a "def" to a "use". What this means is that if B is definition, and A uses B, then there is a def-use edge from B to A.

Visually we show an arrow from the "use" to the "def". Here is an example:

From an implementation point of view, our Node type also maintains a reverse link. This means that in the above scenario:

Since A is a "use" of B, then B will appear in A's list of inputs.
Conversely, B maintains a list of outputs, and A will appear in this list.
A key invariant is that these edges must be in sync at all times.

Here is how this looks in the Node class:

public abstract class Node {

    /**
     * Inputs to the node. These are use-def references to Nodes.
     * <p>
     * Generally fixed length, ordered, nulls allowed, no unused trailing space.
     * Ordering is required because e.g. "a/b" is different from "b/a".
     * The first input (offset 0) is often a Control node.
     */
    public final ArrayList<Node> _inputs;

    /**
     * Outputs reference Nodes that are not null and have this Node as an
     * input.  These nodes are users of this node, thus these are def-use
     * references to Nodes.
     * <p>
     * Outputs directly match inputs, making a directed graph that can be
     * walked in either direction.  These outputs are typically used for
     * efficient optimizations but otherwise have no semantics meaning.
     */
    public final ArrayList<Node> _outputs;
}

Types of Nodes

There are two categories of Nodes in the intermediate representation.

Control Nodes - these represent the control flow subgraph (CFG) of the compiled program
Data Nodes - these capture the data semantics

The following control and data nodes appear in this chapter.

Node Name	Type	Description	Inputs	Value
Start	Control	Start of function	None	None for now as we do not have function arguments yet
Return	Control	Represents the termination of a function	Predecessor control node, Data node value	Return value of the function
Constant	Data	Represents constants such as integer literals	None, however Start node is set as input to enable graph walking	Value of the constant

Within a traditional basic block, instructions are executed in sequence. In the Sea of Nodes model, the correct sequence of instructions is determined by a scheduling algorithm that depends only on dependencies between nodes (including control dependencies) that are explicit as edges in the graph. This enables a number of optimizations at very little cost (nearly always small constant time) because all dependencies are always available.

Unique Node ID

Each node is assigned a unique dense integer Node ID when created. This ID is useful for debugging, efficiently computing equality and e.g. as an index into a bit vector, which in turn is used to efficiently visit a (possibly cyclic) graph. We discuss Node equality in Chapter 9.

Start Node

The Start node represents the start of the function. For now, we do not have any inputs to Start because our function does not yet accept parameters. When we add parameters the value of Start will be a tuple, and will require Projections to extract the values. We discuss this in detail in Chapter 4.

Constant Node

A Constant node represents a constant value. At present, the only constants that we allow are integer literals; therefore Constants contain an integer value. As we add other types of constants, we will refactor how we represent Constants.

Constants have no semantic inputs. However, we set Start as an input to Constants to enable a forward graph walk. This edge carries no semantic meaning, and it is present solely to allow visitation.

The Constant's value is the value stored in it.

Return Node

The Return node has two inputs. The first input is a control node and the second is the data node that supplies the return value.

In this presentation, Return functions as a Stop node, since multiple return statements are not possible. The Stop node will be introduced in Chapter 5 when we implement if statements.

The Return's output is the value from the data node.

Representation

Here is visualization of the program:

return 1;

Control nodes appear as square boxes with yellow background
Control edges are in bold red
The edges from Constants to Start are shown in dotted lines as these are not true control edges
We label each edge with its position in the _inputs array, thus 0 means the edge is _inputs[0].

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chapter01

chapter01

README.md

Chapter 1: Introduction

Implementation Language

Assumptions

Architecture

Data Structures

Intermediate Representation as a Graph of Nodes

Nodes are in a Graph

Types of Nodes

Unique Node ID

Start Node

Constant Node

Return Node

Representation

Files

chapter01

Directory actions

More options

Directory actions

More options

Latest commit

History

chapter01

Folders and files

parent directory

README.md

Chapter 1: Introduction

Implementation Language

Assumptions

Architecture

Data Structures

Intermediate Representation as a Graph of Nodes

Nodes are in a Graph

Types of Nodes

Unique Node ID

Start Node

Constant Node

Return Node

Representation