In this chapter we aim to compile simple scripts such as:
return 1;
We implement the return
statement.
The return
statement accepts an expression
as an argument.
The only expression
type is an integer literal, as shown in the example above.
Here is the complete language grammar for this chapter.
To implement this simple language, we introduce a few key components and data structures.
Our implementation language is Java. We chose Java as it is widely available and understood.
We assume that the reader is familiar with traditional linear intermediate representations, and is familiar with terms such as Basic Block, Control Flow Graph, etc. No attempt is made to explain these topics. If necessary the reader can consult a standard compiler text book.
We design our parser and compiler to illustrate some key aspects of the Sea of Nodes Intermediate Representation:
- We construct the intermediate Sea of Nodes representation directly as we parse the language. There is no Abstract Syntax Tree representation.
- The compiler will perform a few basic local optimizations as we build the IR; this is a key benefit of the Sea of Nodes IR. This aspect is more fully explored from Chapter 2 onwards.
Our data structures are based upon the descriptions provided in following papers:
- From Quads to Graphs: An Intermediate Representation's Journey
- Combining Analyses, Combining Optimzations
- A Simple Graph-Based Intermediate Representation
- Global Code Motion Global Value Numbering
Following the lead from above, we represent our intermediate representation using an object oriented data model. Details of the representation follow.
The intermediate representation is a graph of Node objects. The Node
class is the base type for objects in the IR graph.
The Node
class provides common capabilities that are inherited by all subtypes.
Each subtype implements semantics relevant to that subtype.
Each Node
represents an "instruction" as it may appear in traditional IRs.
The key idea of the Sea of Nodes IR is that each Node is linked to other Nodes by def-use dependencies. As this is such an important and fundamental aspect of the IR, it is important to understand how we implement this, and depict in graph visuals.
The base Node
class maintains a list of Nodes that are inputs to it. An input is an edge from a "def" to a "use". What this means is that if B
is definition, and A
uses B
,
then there is a def-use edge from B
to A
.
Visually we show an arrow from the "use" to the "def". Here is an example:
From an implementation point of view, our Node
type also maintains a reverse link.
This means that in the above scenario:
- Since
A
is a "use" ofB
, thenB
will appear inA
's list of inputs. - Conversely,
B
maintains a list of outputs, andA
will appear in this list. - A key invariant is that these edges must be in sync at all times.
Here is how this looks in the Node
class:
public abstract class Node {
/**
* Inputs to the node. These are use-def references to Nodes.
* <p>
* Generally fixed length, ordered, nulls allowed, no unused trailing space.
* Ordering is required because e.g. "a/b" is different from "b/a".
* The first input (offset 0) is often a Control node.
*/
public final ArrayList<Node> _inputs;
/**
* Outputs reference Nodes that are not null and have this Node as an
* input. These nodes are users of this node, thus these are def-use
* references to Nodes.
* <p>
* Outputs directly match inputs, making a directed graph that can be
* walked in either direction. These outputs are typically used for
* efficient optimizations but otherwise have no semantics meaning.
*/
public final ArrayList<Node> _outputs;
}
There are two categories of Nodes in the intermediate representation.
- Control Nodes - these represent the control flow subgraph (CFG) of the compiled program
- Data Nodes - these capture the data semantics
The following control and data nodes appear in this chapter.
Node Name | Type | Description | Inputs | Value |
---|---|---|---|---|
Start | Control | Start of function | None | None for now as we do not have function arguments yet |
Return | Control | Represents the termination of a function | Predecessor control node, Data node value | Return value of the function |
Constant | Data | Represents constants such as integer literals | None, however Start node is set as input to enable graph walking | Value of the constant |
Within a traditional basic block, instructions are executed in sequence. In the Sea of Nodes model, the correct sequence of instructions is determined by a scheduling algorithm that depends only on dependencies between nodes (including control dependencies) that are explicit as edges in the graph. This enables a number of optimizations at very little cost (nearly always small constant time) because all dependencies are always available.
Each node is assigned a unique dense integer Node ID when created. This ID is useful for debugging, efficiently computing equality and e.g. as an index into a bit vector, which in turn is used to efficiently visit a (possibly cyclic) graph. We discuss Node equality in Chapter 9 Global Value Numbering.
The Start node represents the start of the function. For now, we do not have any inputs to Start because our function does not yet accept parameters. When we add parameters the value of Start will be a tuple, and will require Projections to extract the values. We discuss this in detail in Chapter 4.
A Constant node represents a constant value. At present, the only constants that we allow are integer literals; therefore Constants contain an integer value. As we add other types of constants, we will refactor how we represent Constants.
Constants have no semantic inputs. However, we set Start as an input to Constants to enable a forward graph walk. This edge carries no semantic meaning, and it is present solely to allow visitation.
The Constant's value is the value stored in it.
The Return node has two inputs. The first input is a control node and the second is the data node that supplies the return value.
In this presentation, Return functions as a Stop node, since multiple return
statements are not possible.
The Stop node will be introduced in Chapter 5 when we implement if
statements.
The Return's output is the value from the data node.
Here is visualization of the program:
return 1;
- Control nodes appear as square boxes with yellow background
- Control edges are in bold red
- The edges from Constants to Start are shown in dotted lines as these are not true control edges
- We label each edge with its position in the
_inputs
array, thus0
means the edge is_inputs[0]
.