The Good IR

This post is part of a seres:
The Good IR (BasicBlocks and control flow instructions)
The Good IR: Other Control Flow Structures
The Good IR: Instructions and Values
The Good IR: Reporting Semantic Errors via Type Checking
The Good IR: Multiple Returns
—-
After parsing, each function can be represented as a ControlFlowGraph of BasicBlocks. Each BasicBlock holds a list of Instructions. For illustrative purposes Instructions are colored Blue and BasicBlocks are colored Orange.

I’d now like to address the question: What should the ConditionalBranch instruction at the end of the If-Header point to?

Option Illustration Pro Con
Point at the following BasicBlock
  1. Convenient: the target BasicBlock is available during parse.
  1. Redundant: the If-Header block has a ControlFlowGraph edge with the same destination.
  2. Non-Uniform: it’s unsettling to have some instructions (such as Add/Sub/Mul/Div) point at other Instructions, but then deal with ConditionalBranch as a special case.
Point at the first Instruction of the following BasicBlock.
  1. Uniform: Instructions point only to other Instructions.
  1. Redundant: The If-Header block has a ControlFlowGraph edge with the same destination.
  2. Problematic: Can you guarantee that target all blocks have at least one instruction to point at?

The common drawback with both of these proposals, is that of Redundancy. At some point, during a future optimization pass, we might like to transform the ControlFlowGraph. If such a transformation requires that we update both the BasicBlock→BasicBlock edges and the target of ConditionalBranch Instructions, then we are more likely to have a bug. We must put forth extra effort in keeping both kinds of link synchronized.

We can avoid this common drawback by adding a layer of indirection/abstraction. Instead of forcing the ConditionalBranch instruction to maintain a pointer to its target, we can access the target through a function. That function can query its own BasicBlock about the outgoing ControlFlowGraph edges, and return either the target BasicBlock or the target BasicBlock’s first Instruction. By making this abstraction, not only do we avoid the extra effort of keeping redundant links synchronized, but we also promote uniformity in the design through the introduction of a new invariant: Only BasicBlock’s carry information about the ControlFlowGraph; No other edges are allowed between BasicBlocks.

Illustration Option Pro Con
ConditionalBranch queries its BasicBlock.
  1. Uniform: The only links across/between BasicBlocks are control flow edges held by BasicBlocks.
  2. Convenient: Do not have to synchronize redundant information.
  1. Performance: A function call is required when asking a ConditionalBranch for its targets.

Now that we have decided that ConditionalBranch will return information about the targets via a function call, we visit the question: Is the branch target the successor BasicBlock or is it the successor BasicBlock’s first Instruction?

To answer this question, I’m going to assume that the Internal Representation can also be interpreted. Such a design has the advantage that we can verify the structure of the IR by interpreting it after construction and between transformation/optimization passes.

Let’s assume a simple interpreter of the form:

Instruction i = function.firstInstruction()
while (!i.isEnd()) {
    i = i.evaluate();
}

This interpreter exploits polymorphic dispatch on the Instruction hierarchy. We can express the loop concisely because we follow the invariant: evaluate always returns the next instruction to be executed. This is analogous to setting the program counter (pc), but avoids passing the pc as either a function parameter (which most evaluate implementations will ignore) or as a global variable (yuck!). Importantly, this loop has no concept of BasicBlock’s; it evaluates only Instructions.

So, in the interests of keeping the interpreter loop immaculately clean, we have only one choice: ConditionalBranch must return the first Instruction of the target BasicBlock. But, this answer contains a potential pitfall: Can we guarantee that all target BasicBlocks have at least one Instruction?

Consider the following example code, and associated ControlFlowGraph.

Code ControlFlowGraph
if (cond1) {
    // do something 1
} else {
    if (cond2) {
        // do something 2
    }
}

Notice that, those blocks which do not have instructions in them have been left blank. Suppose that cond1 is false and cond2 is false so that the interpreter ends up following the chain of empty BasicBlocks. We notice that there is immediately some difficulty in having the interpreter transfer from ConditionalBranch(cond2) to the the first instruction of the (empty) target BasicBlock. One possible solution comes to mind: Have ConditionalBranch(cond2) attempt to iterate the following blocks until it finds one with a first Instruction. Although this will work, it feels rather kludgy, and we should continue our search for a clean design.

Let’s analyze the situation in more detail. Specifically, let’s look at the last instruction of the ThenBlock. It contains code for do something 1, which has been left unspecified to emphasize the fact that it can be completely arbitrary. However, (because we are gifted with knowledge of assembly) we have advanced some foresight: when the instructions are finally emitted from the CodeGenerator in a linear stream, we must insert an UnconditionalJump which bypasses the code for the ElseBranch (assuming one exists), and lands the program counter at the first Instruction of the JoinBlock.

In our IR interpreter, the last Instruction of the code inside the ThenBlock can be arbitrary, so it will have some difficultly detecting that control should be transferred to the following JoinBlock. We can alleviate this difficulty by analogy to our foresight, and introduce an UnconditionalJump in the last BasicBlock of the ThenBranch. As with the ConditionalBranch, the UnconditionalJump will rely on the ControlFlowGraph edge (we are guaranteed only 1) of its BasicBlock to determine the following instruction during interpretation.

We now have two situations of BasicBlocks which end in a kind of control transfer Instruction:

  1. If-Headers and Loop-Headers, which end in a ConditionalBranch Instruction.
  2. The Last BasicBlock of a ThenBranch, which ends in an UnconditionalJump.

The explicit control transfer in these situations allows the interpreter to easily determine the next instruction, even though it lies in a different BasicBlock. That is, in these two cases, the interpreter does not need to know about the existence of BasicBlocks. Due to this advantage, we should try to arrange one of these two situations to our current concern: the empty BasicBlocks in the ElseBranch.

Only one of the two previous instructions applies: the UnconditionalJump. It certainly doesn’t hurt the semantics of the ControlFlowGraph to insert an UnconditionalJump at every edge (even fallthrough edges). So we can safely coin the invariant: All BasicBlocks end with a control transfer Instruction (ConditionalBranch or UnconditionalJump). This invariant unifies our design, and allows the interpreter to iterate only over Instructions.

Additionally, we now also have at least one Instruction (the UnconditionalJump) in every BasicBlock! So, we can positively guarantee that a ConditionalBranch is able to return the first Instruction of its target BasicBlock. Indeed, all control transfer Instructions are able to do likewise.