The Good IR: Instructions and Values
This post is part of a seres:
The Good IR (BasicBlocks and control flow instructions)
The Good IR: Other Control Flow Structures
The Good IR: Instructions and Values
The Good IR: Reporting Semantic Errors via Type Checking
The Good IR: Multiple Returns
It’s taken some time, but I think I’ve arrived at a reasonable Instruction hierarchy. It should be relatively free of pitfalls, special cases, and other obstructions. I can’t say for certain, but I think it will also work reasonably well during later optimization passes, and code gen.
The IR I’m designing is a form of Static Single Assignment. This means that we focus on tracking Values, not variables. There are some very useful guides to constructing SSA, which I won’t discuss at length (but would like to include them for reference).
- Simple Generation of Static Single-Assignment Form by Aycock gives the most lucid and crystal clear explanation of SSA’s role in the field of compilation, how many of the algorithms for generating SSA were arrived at, and how SSA supports many different optimizations.
- Constructing SSA the Easy Way by Michael Bebenita, gives a wonderful walk-through of a very simple algorithm for generating SSA. Where Aycock focuses on proofs of correctness, Bebenita focuses on drawing pictures and explaining implementation.
- Single-Pass Generation of Static Single-Assignment Form for Structured Languages by Brandis and Mössenböck, gives a very easy to implement way to perform SSA during the parsing of a structured language. The procedure they describe has certainly influenced my design: Specifically, I also have each function with a single EntryBlock and ExitBlock; but I could not implement their suggestions without some modification (due to the presence of multiple returns). This is undoubtedly one of the better papers for learning the CFG layout of different control constructs (switch-case, while vs do-while, etc), even if (like me) you find your language has features not supported by their algorithm.
- Practical Improvements to the Construction and Destruction of Static Single Assignment Form, by Briggs, Cooper, Harvey, and Simpson, highlights the differences between (and give algorithms to construct) minimal, pruned and semi-pruned SSA. They also include a discussion of the ‘lost-copy’ problem, and how to find and split critical edges. The algorithms given are about efficiently creating as few Phi’s as possible and clear enough that you can plug them into your own work with little modification. However, I prefer the simpler approach of Aycock and Bebenita: generate all Phi’s possible followed by a pass which cleans up what isn’t necessary.
Given that our SSA based IR is focused on values more than instructions, it makes sense that we would build a Value hierarchy. Let’s focus on some common operations:
Value `- Instruction `- Binary |- Add |- Sub |- Mul `- Div
Let’s begin with something simple: basic math. Operations like addition can be represented using an
AddInstruction. The result of actually performing the addition is, of course, a runtime value. We model this by making
Instructions generally) inherit from
Value. The arguments supplied to the
AddInstruction are also runtime values. We model this by passing the constructor of
Values. Other mathematical operations (subtract, divide, modulus, multiplication) are modeled similarly. From this observation we make two generalizations:
- Values can refer to other Values.
- Instructions producing a runtime value must inherit from
The decision whether to place the mathematical operations into their own hierarchy below
Instruction can be left up to taste. I find that adding a few layer can be convenient in that it keeps down the branching factor of the inheritance tree.
Value `- Instruction |- Binary `- ControlFlow |- BranchGreater |- BranchLesser |- BranchGreaterEqual |- BranchLesserEqual |- BranchEqual |- BranchNotEqual |- Jump `- Return
But we know that some instructions, such as UnconditionalJump and ConditionalBranch, produce no value at runtime, nor are they referred to by other instructions. We’ve decided from looking at the math operators, that
Value, should we also allow these control flow operators to subclass
Instruction even when they do not represent a runtime value? I think, yes. Because, (a) the BasicBlock holds a List<Instruction> and it would be difficult to end each block with a control flow operator that does not inherit
Instruction, (b) an SSA interpreter operating on the IR will be iterating over
Instructions and still needs to encounter control flow operators, and (c) it does no real harm.
That last point is actually an important lesson: We shouldn’t jump at the chance to model in such a fine-grained way that we have unnecessary special cases. We should instead, try to eliminate as many special cases as possible, because the remaining special cases have a tendency to interact in a combinatorial explosion. Therefore, we should allow any imprecision in our model which causes no real harm if it eliminates a special case. Having the control flow operator inherit
Value even though they don’t produce runtime values and inherit
Instruction for the convenience of being added to the instruction stream is just such an imprecision in our modeling.
We’ve designed the math instructions so that they take other
Values as arguments. This sets up an induction, where values are produced from other values. But we still need a base case, an initial creation of values. I thought of four ways that values originate, which should each be addressed individually.
Value `- Instruction `- InputParameter
Input Parameters. Operations can take parameters supplied by a caller as a runtime value. We face somewhat the same dilemma as with the control flow operations: should we consider
Instruction or a
Value? It is more accurate to model
InputParameter as a subclass of
InputParameters into its List<Instruction>. A verifying SSA interpreter can treat
InputParameter as a nop. Its presence in the instruction stream does no harm to the register allocator and code generators later. So, in the interests of following the previous design guideline: we observe it does no harm for
InputParameter to subclass
Instruction and just ‘go along for the ride’, and it removes a special case.
Immediates. Some expression, such as
a + 5 involve a constant value. Again we are faced with two options:
Option A: We can have the
Const inherit from
Instruction, and insert the value into the BasicBlock’s List<Instruction>. The drawback is that it somewhat pollutes the Instruction stream, moreso than InputParameter which only occurs in a function’s EntryBlock. Because I have a strong hunch the SSA verifying interpreter and later passes including SSA deconstruction can ignore its presence; I don’t find significant harm including it in the Instruction stream.
Option B: We could also have the
Const inherit from
Value, prohibiting it from appearing in the Instruction stream. This means that, as an argument to another Instruction, such as
AddInstruction, it would ‘dangle off to the side’. I’m wary that this special case might make some later passes a tad more complicated than necessary, so I wouldn’t choose this option without a demonstration that eliding the
Const from the Instruction stream has some actual benefit.
Const may not be the best name for this Instruction. Perhaps one of
Embedded will be more descriptive.
Local SSA Variables
During SSA creation, we must take care to map variable names to their defining values. However, at the entry into a function local scalar variables have not yet been created. Therefore, we must introduce a new kind of
Value to represent them. In the interests of design uniformity, we should like to handle these variables much like we do InputParameters. As a result, we introduce a
Uninitialized, which subclasses
Instruction so that it may be placed in the functions EntryBlock alongside InputParameters. I’m especially motivated to introduce
Uninitialized because it enables the parser (or type system) to check ‘variable use before definition’ errors.
Value `- Instruction |- InputParameter |- EmbeddedConst |- Uninitialized `- ArrayDeclaration
Local Non-SSA Variables
Some entities, such as local arrays, are not tracked using SSA. However, we still need a mechanism for modeling these entities. Despite the fact that they do not represent scalar values, the source code may still use an element of an array in either the right or left hand side of an assignment statement. When the source code refers to an array, the parser will look up the variable name in the symbol table, which must return with something meaningful, such as the base address. To satisfy these obligations, we choose to add a
ArrayDeclaration subclass to
Instruction, for insertion in the function’s EntryBlock. This entry accomplishes two things: (a) it allows the symbol table to answer back with the a defining Instruction for the array, and (b) it serves as a placeholder that the CodeGenerator (or SSA interpreter) can use to create stack-local memory space for the array. I’m leaving the semantics about array initialization unspecified, because it’s really up to the language designer.
Value `- Instruction |- Load |- Store `- Index
The whole point of SSA is actually to avoid loads and stores, but for those items not tracked (local arrays, globals, locals accessed via lexical scope, etc) we cannot avoid the load and store. In all cases, however, it is sufficient for the
Load to return the address of the symbol mentioned in the source code. For arrays we also introduce an
Index instruction which performs the appropriate address arithmetic for each operator used. Finally, we also introduce a
Store instruction, which allows storing a Value to a given address (e.g. that retrieved from a previous
Load). Semantically, all variables for which a load/store must be used, should be considered initialized when retrieved from the symbol table.
Value `- Instruction `- Call
I feel obligated to mention that
Call Instruction should also be added to our system. It will take a List<Value> representing the arguments passed in, and return a
Value as a result. From this description, we can see that it faces the same constraints as the basic math operations. We therefore put it in a similar place in the hierarchy.
Value |- Instruction `- Phi
Phi’s are not actual Instructions. They operate according to their own semantics: parallel copies which must occur prior to any instruction within that BasicBlock. To support Phi’s, we have to extend BasicBlock’s to also hold a List<Phi> in addition to the usual List<Instruction> This keeps the Phi’s separate from the instruction stream, and amenable to separate analysis. Although Phi’s produce a value, they are considered a pseudo-instruction, so we do not inherit
Phi takes an ordered list of
Values representing its arguments, which could be other
Phis or value defining
Instructions. The order must be preserved during program optimization passes! because it’s held in direct correspondence with an ordered list of incoming CFG edges held by the Phi’s owning BasicBlock. Because, all
Instructions keep a pointer to their owning BasicBlock, I do not consider it necessary to have the
Phi hold (
Value, BasicBlock) pairs. A
Phi can simply query its arguments for this information. Also, the defining instruction can be arbitrarily early on an incoming path, it does not necessarily reside in the predecessor block corresponding to that path.
There are still a large number of considerations that I have not yet addressed. I’ve introduced the concept of addresses when handling globals and arrays. We have the ability to construct Instructions which might actually be erroneous. For example, when the programmer adds two entities of invalid type, or fails to perform the correct number of index operations on an array, or fails to pass the correct types to a function call, etc. Many of these semantic issues can be cast as type checks. So, I’ll be writing my next post on how to perform those checks in a pretty clean manner.
I should also credit a chat with Alex Thornton for the following:
Design Rule of Thumb: Avoid the special cases. Rather than reach for the tools that let you model a special case, invert the question and ask “what harm can it do to treat it normally?”.