The Good IR: Instructions and Values

This post is part of a seres:
The Good IR (BasicBlocks and control flow instructions)
The Good IR: Other Control Flow Structures
The Good IR: Instructions and Values
The Good IR: Reporting Semantic Errors via Type Checking
The Good IR: Multiple Returns

It’s taken some time, but I think I’ve arrived at a reasonable Instruction hierarchy. It should be relatively free of pitfalls, special cases, and other obstructions. I can’t say for certain, but I think it will also work reasonably well during later optimization passes, and code gen.

The IR I’m designing is a form of Static Single Assignment. This means that we focus on tracking Values, not variables. There are some very useful guides to constructing SSA, which I won’t discuss at length (but would like to include them for reference).

  • Simple Generation of Static Single-Assignment Form by Aycock gives the most lucid and crystal clear explanation of SSA’s role in the field of compilation, how many of the algorithms for generating SSA were arrived at, and how SSA supports many different optimizations.
  • Constructing SSA the Easy Way by Michael Bebenita, gives a wonderful walk-through of a very simple algorithm for generating SSA. Where Aycock focuses on proofs of correctness, Bebenita focuses on drawing pictures and explaining implementation.
  • Single-Pass Generation of Static Single-Assignment Form for Structured Languages by Brandis and Mössenböck, gives a very easy to implement way to perform SSA during the parsing of a structured language. The procedure they describe has certainly influenced my design: Specifically, I also have each function with a single EntryBlock and ExitBlock; but I could not implement their suggestions without some modification (due to the presence of multiple returns). This is undoubtedly one of the better papers for learning the CFG layout of different control constructs (switch-case, while vs do-while, etc), even if (like me) you find your language has features not supported by their algorithm.
  • Practical Improvements to the Construction and Destruction of Static Single Assignment Form, by Briggs, Cooper, Harvey, and Simpson, highlights the differences between (and give algorithms to construct) minimal, pruned and semi-pruned SSA. They also include a discussion of the ‘lost-copy’ problem, and how to find and split critical edges. The algorithms given are about efficiently creating as few Phi’s as possible and clear enough that you can plug them into your own work with little modification. However, I prefer the simpler approach of Aycock and Bebenita: generate all Phi’s possible followed by a pass which cleans up what isn’t necessary.

Given that our SSA based IR is focused on values more than instructions, it makes sense that we would build a Value hierarchy. Let’s focus on some common operations:

Basic Math

`- Instruction
   `- Binary
      |- Add
      |- Sub
      |- Mul
      `- Div

Let’s begin with something simple: basic math. Operations like addition can be represented using an AddInstruction. The result of actually performing the addition is, of course, a runtime value. We model this by making AddInstruction (and Instructions generally) inherit from Value. The arguments supplied to the AddInstruction are also runtime values. We model this by passing the constructor of AddInstruction two Values. Other mathematical operations (subtract, divide, modulus, multiplication) are modeled similarly. From this observation we make two generalizations:

  • Values can refer to other Values.
  • Instructions producing a runtime value must inherit from Value.

The decision whether to place the mathematical operations into their own hierarchy below Instruction can be left up to taste. I find that adding a few layer can be convenient in that it keeps down the branching factor of the inheritance tree.

Control Flow

`- Instruction
   |- Binary
   `- ControlFlow
      |- BranchGreater
      |- BranchLesser
      |- BranchGreaterEqual
      |- BranchLesserEqual
      |- BranchEqual
      |- BranchNotEqual
      |- Jump
      `- Return

But we know that some instructions, such as UnconditionalJump and ConditionalBranch, produce no value at runtime, nor are they referred to by other instructions. We’ve decided from looking at the math operators, that Instruction inherits Value, should we also allow these control flow operators to subclass Instruction even when they do not represent a runtime value? I think, yes. Because, (a) the BasicBlock holds a List<Instruction> and it would be difficult to end each block with a control flow operator that does not inherit Instruction, (b) an SSA interpreter operating on the IR will be iterating over Instructions and still needs to encounter control flow operators, and (c) it does no real harm.

That last point is actually an important lesson: We shouldn’t jump at the chance to model in such a fine-grained way that we have unnecessary special cases. We should instead, try to eliminate as many special cases as possible, because the remaining special cases have a tendency to interact in a combinatorial explosion. Therefore, we should allow any imprecision in our model which causes no real harm if it eliminates a special case. Having the control flow operator inherit Value even though they don’t produce runtime values and inherit Instruction for the convenience of being added to the instruction stream is just such an imprecision in our modeling.

Initial Values

We’ve designed the math instructions so that they take other Values as arguments. This sets up an induction, where values are produced from other values. But we still need a base case, an initial creation of values. I thought of four ways that values originate, which should each be addressed individually.

`- Instruction
   `- InputParameter

Input Parameters. Operations can take parameters supplied by a caller as a runtime value. We face somewhat the same dilemma as with the control flow operations: should we consider InputParameter an Instruction or a Value? It is more accurate to model InputParameter as a subclass of , because parameters aren’t actually executed by the callee. However, since each function has an EntryBasicBlock, it’s convenient to stuff any InputParameters into its List<Instruction>. A verifying SSA interpreter can treat InputParameter as a nop. Its presence in the instruction stream does no harm to the register allocator and code generators later. So, in the interests of following the previous design guideline: we observe it does no harm for InputParameter to subclass Instruction and just ‘go along for the ride’, and it removes a special case.

Immediates. Some expression, such as a + 5 involve a constant value. Again we are faced with two options:

Option A

Option A: We can have the Const inherit from Instruction, and insert the value into the BasicBlock’s List<Instruction>. The drawback is that it somewhat pollutes the Instruction stream, moreso than InputParameter which only occurs in a function’s EntryBlock. Because I have a strong hunch the SSA verifying interpreter and later passes including SSA deconstruction can ignore its presence; I don’t find significant harm including it in the Instruction stream.

Option B

Option B: We could also have the Const inherit from Value, prohibiting it from appearing in the Instruction stream. This means that, as an argument to another Instruction, such as AddInstruction, it would ‘dangle off to the side’. I’m wary that this special case might make some later passes a tad more complicated than necessary, so I wouldn’t choose this option without a demonstration that eliding the Const from the Instruction stream has some actual benefit.

Actually, Const may not be the best name for this Instruction. Perhaps one of Number, Immediate, or Embedded will be more descriptive.

Local SSA Variables
During SSA creation, we must take care to map variable names to their defining values. However, at the entry into a function local scalar variables have not yet been created. Therefore, we must introduce a new kind of Value to represent them. In the interests of design uniformity, we should like to handle these variables much like we do InputParameters. As a result, we introduce a Uninitialized, which subclasses Instruction so that it may be placed in the functions EntryBlock alongside InputParameters. I’m especially motivated to introduce Uninitialized because it enables the parser (or type system) to check ‘variable use before definition’ errors.

`- Instruction
   |- InputParameter
   |- EmbeddedConst
   |- Uninitialized
   `- ArrayDeclaration

Local Non-SSA Variables
Some entities, such as local arrays, are not tracked using SSA. However, we still need a mechanism for modeling these entities. Despite the fact that they do not represent scalar values, the source code may still use an element of an array in either the right or left hand side of an assignment statement. When the source code refers to an array, the parser will look up the variable name in the symbol table, which must return with something meaningful, such as the base address. To satisfy these obligations, we choose to add a ArrayDeclaration subclass to Instruction, for insertion in the function’s EntryBlock. This entry accomplishes two things: (a) it allows the symbol table to answer back with the a defining Instruction for the array, and (b) it serves as a placeholder that the CodeGenerator (or SSA interpreter) can use to create stack-local memory space for the array. I’m leaving the semantics about array initialization unspecified, because it’s really up to the language designer.

`- Instruction
   |- Load
   |- Store
   `- Index

Untracked Values

The whole point of SSA is actually to avoid loads and stores, but for those items not tracked (local arrays, globals, locals accessed via lexical scope, etc) we cannot avoid the load and store. In all cases, however, it is sufficient for the Load to return the address of the symbol mentioned in the source code. For arrays we also introduce an Index instruction which performs the appropriate address arithmetic for each operator[] used. Finally, we also introduce a Store instruction, which allows storing a Value to a given address (e.g. that retrieved from a previous Load). Semantically, all variables for which a load/store must be used, should be considered initialized when retrieved from the symbol table.

`- Instruction
   `- Call

Function Calls

I feel obligated to mention that Call Instruction should also be added to our system. It will take a List<Value> representing the arguments passed in, and return a Value as a result. From this description, we can see that it faces the same constraints as the basic math operations. We therefore put it in a similar place in the hierarchy.

|- Instruction
`- Phi


Phi’s are not actual Instructions. They operate according to their own semantics: parallel copies which must occur prior to any instruction within that BasicBlock. To support Phi’s, we have to extend BasicBlock’s to also hold a List<Phi> in addition to the usual List<Instruction> This keeps the Phi’s separate from the instruction stream, and amenable to separate analysis. Although Phi’s produce a value, they are considered a pseudo-instruction, so we do not inherit Instruction.

A Phi takes an ordered list of Values representing its arguments, which could be other Phis or value defining Instructions. The order must be preserved during program optimization passes! because it’s held in direct correspondence with an ordered list of incoming CFG edges held by the Phi’s owning BasicBlock. Because, all Instructions keep a pointer to their owning BasicBlock, I do not consider it necessary to have the Phi hold (Value, BasicBlock) pairs. A Phi can simply query its arguments for this information. Also, the defining instruction can be arbitrarily early on an incoming path, it does not necessarily reside in the predecessor block corresponding to that path.

Closing Remarks

There are still a large number of considerations that I have not yet addressed. I’ve introduced the concept of addresses when handling globals and arrays. We have the ability to construct Instructions which might actually be erroneous. For example, when the programmer adds two entities of invalid type, or fails to perform the correct number of index operations on an array, or fails to pass the correct types to a function call, etc. Many of these semantic issues can be cast as type checks. So, I’ll be writing my next post on how to perform those checks in a pretty clean manner.

I also peeked a bit at other SSA implementations, including V8 the optimizing JavaScript compiler behind Google’s Chrome Browser, and LLVM which is capable of representing all the frightening things which occur in C.

I should also credit a chat with Alex Thornton for the following:
Design Rule of Thumb: Avoid the special cases. Rather than reach for the tools that let you model a special case, invert the question and ask “what harm can it do to treat it normally?”.