The Good IR: Reporting Semantic Errors via Type Checking

This post is part of a seres:
The Good IR (BasicBlocks and control flow instructions)
The Good IR: Other Control Flow Structures
The Good IR: Instructions and Values
The Good IR: Reporting Semantic Errors via Type Checking
The Good IR: Multiple Returns
—–

It’s actually, quite fascinating: not only is type-checking relatively simple to do, but I found it can also catch more errors than I naively expected.

Let’s consider how the parser might handle the grammar rule: designator := ident { "[" expression "]" }.
Algorithm A
1. Ensure that an Identifier was seen.
2. Check that the identifier’s name is present in the symbol table, and signal a ‘use before definition’ if not.
3. Issue a LoadAddress if the identifier is an array, otherwise update an SSA binding.
4. In the event of an array, resolve the indexing
4.1 Perform the address arithmetic
4.2 Ensure at each step that the type can be actually be indexed.

When coding, you might face (as I did) an irresistibly strong temptation to put all this functionality into the designator method of your parser. If you don’t resist that urge, you’ll almost certainly end up (as I did) with a

    // designator = ident{ "[" expression "]" }.
    private Result designator(Node cur, Node join) {
        Result res = new Result();
        if (sym.isType(Lexeme.IDENT)) {
            res.syminfo = SymbolTable.getInfo(sym.getName());
            if (res.syminfo == null) {
                ErrorHandler.printSyntaxErr(sym.type.toString(), sym.getLine(), "symbol \"" + sym.getName()
                        + "\" used but not defined!");
            }
            res.name = sym.getName();
            if (cur.getCurSsa().get(res.name) != null
                    && res.syminfo.isType(SymType.var)) {
                res.instruction = cur.getCurSsa().get(sym.getName());
            } else {
                res.instruction = new UnaryInstruction(cur,UnaryInstruction.Operator.loada, null, res.name);
                res.kind = Result.Kind.address;
            }
            sym = Scanner.getSym();
        } else {
            ErrorHandler.printSyntaxErr(sym.getType().toString(),
                    sym.getLine(), "identifier expected");
        }
 
        // resolve array access
        Vector<Integer> dim = SymbolTable.getInfo(res.name).getDimensions();
        if (dim != null) {
            int multiplier = 1;
            for (int i=1; i<dim.size(); i++) {
                multiplier *= dim.get(i);
            }
            int index = 1;
            while (sym.isType(Lexeme.OPENSQR)) {
                sym = Scanner.getSym();
                Instruction instr = expression(cur, join).instruction;
 
                Instruction mul = new BinaryInstruction( cur, BinaryInstruction.Operator.mul,
                        instr,
                        new ConstantInstruction(cur, multiplier*8, "adda(" + multiplier +")"));
                res.instruction = new BinaryInstruction( cur,
                        BinaryInstruction.Operator.adda,
                        res.instruction, mul);
                multiplier /= index<dim.size() ? dim.get(index) : 1;
                index++;
                expect(Lexeme.CLOSESQR);
            }
            res.kind = Result.Kind.address;
        }
        // return the result
        return res;
    }

But all of the manually inlined functionality that leads to such spaghetti can be fixed with a few refactorings! To clean up the Parser, we’ll delegate some functionality (such as error checking) to other pieces of code. This way the Parser code itself can read as closely to English as possible.

Let me first address two obvious drawbacks in the above sample, before getting into the type-checking.

Step 1. Ensure that an Identifier was seen.
The Parser is constantly checking if some particular token is ‘current’. This check is also tied to reporting errors and/or advancing the stream one token. My parser ended up with four functions combining three functionalities:

  1. inspecting the current token
  2. error reporting on unexpected tokens
  3. advancing the token stream

You can see how some of these functionalities are combined into the traditional accept/expect pattern for recursive descent parsers. I merely extended this common approach with a subtle case: When parsing an expression and checking the current token for a PlusToken or MinusToken, I wanted the parser to advance the stream only if the current token matched my query, otherwise leave the stream alone and don’t report an error.

Introducing Types

Ok. The remaining issues are semantic in nature, and I found that a clever use of type-checking can catch them all. The code sample above felt seriously ad-hoc. It’s missing a type-checking framework, which I wasn’t knowledgeable enough to model on my own. By occasionally surfing google scholar for compiler implementation details I eventually found an awesome paper about Design Patterns for Teaching Type Checking in a Compiler Construction Course. The authors demonstrate how to use the Composite pattern to represent a set of base types together with the ability to compose those types. Any reasonable programming language has within it a mini-language so that the program author may extend the built-in type system with knowledge of custom types. Naturally, a Builder automatically constructs any programmer specified types by using the available composition operators present in the type hierarchy.

Once we can successfully model arbitrary types, we can attach them to Symbols (representing variables/functions) and Values (both Instructions and Phi’s). Every constructor of every Value shall have a line of code which calculates its type based on the given arguments. The calculation itself is best delegated to the type hierarchy:

AddInstruction::AddInstruction(ValuePtr x, ValuePtr y)
    :Instruction()
{   
    m_args.append(x);
    m_args.append(y);
    m_type = x->type()->add(y->type());
}

Should the types be incompatible, the Instruction will be of the ErrorType, which can hold a message describing the error. Now, it becomes possible for the Parser to always generate an Instruction without first checking the validity of the operation. The validity check is left up to an Instruction’s constructor, which in turn delegates the check to the type hierarchy, by calling a type composition function on its argument’s types.

But how/when do you detect the errors?

We add to the Parser another helper function: issue which will append to the List<Instruction> any Instruction which has a non-Error type. Should the Parser attempt to issue an erroneous instruction, it can report the corresponding error message and then quit (or resume).

Designator Digression

Because a designator may appear on either the right-hand side of an assignment or the left-hand side in an expression, the grammar rule for designator has some subtle complexities that require this digression. Let’s tabulate a few examples, to explore the territory.

Statement Description Result Type and Reason
a[4] =      Non-SSA on LHS Address(Integer) to receive the value stored.
     = a[4]; Non-SSA on RHS Integer which provides a value.
   a =      SSA on LHS No instruction issued. Be careful to leave updating the defining Instruction (from the RHS) in the StateVector to the Parser method for AssignmentStatement.
     = a; SSA on RHS No instruction issued, return the defining instruction found in the StateVector

Inspecting the table for some patterns, we notice that the designator shall be easier to write if it’s given knowledge of which side of an assignment it appears on. We may accomplish this by passing to designator such a flag: enum Side {LeftHandSide, RightHandSide}. Armed with this knowledge does not alleviate all issues, however.

We should also take notice that in only one of these cases (SSA on LHS) should the designator take no action. In all other cases, designator must return an Instruction. After some thought, I could find no way of avoiding some cooperation with assignmentStatement. It would not be sufficient for designator to return a NULL to signal AssignmentStatement that the SSA mapping should be updated, because AssignmentStatement needs to know which Symbol to be update in the StateVector. For this reason, I opted for creating a special pseudo-instruction, SymbolInstructon, which is never issued to the instruction stream, and serves only to communicate the situation (SSA on LHS) to AssignmentStatement through an Instruction return type (forced by the other three cases).

Back to the Types

Let’s see how this approach solves the rest of the inlining issues.

Step 2. Check that the identifier’s name is present in the symbol table, and signal a ‘use before definition’ if not.
We delegate variable lookup to a helper method tryResolve, which reports an error if a name could not be found in the SymbolTable. This guarantees that we have a valid symbol upon which to operate. It is ok for this Symbol to appear on the left hand side of an assignment, so we delegate the ‘use before definition’ check to another method, lookupValue, which inspects the StateVector for the Symbol’s defining Instruction. It signals an error if the Symbol is bound to Uninitialized.

Step 3. Issue a LoadAddress if the identifier is an array or other non-SSA variable, otherwise return the SSA binding.
Now that we have handled both SSA cases, we can safely write the rest of the designator assuming a non-SSA access. Whether we have a local array or a variable in another scope (ex. global), we begin with loading the address. Only if the designator is on the right hand side, will that address have to changed to an actual value. In this event, the LoadAddress must be inserted into the instruction stream. We delegate the insertion to a method called issue which verifies that the instruction is not of the ErrorType.

Step 4. In the event of an array, resolve the indexing
Step 4.1 Perform the address arithmetic
Step 4.2 Ensure at each step that the type can be actually be indexed.

Here’s where the use of typed instructions really comes into play. Beginning with the address, we must calculate the correct offset, which depends on the dimensions of the array. Rather than inlining the address arithmetic, we delegate it to IndexInstruction, which can perform the calculation by inspecting the types of its arguments. However, for multi-dimensional arrays, the result of IndexInstruction is not necessarily an Integer, rather it could be a subarray. The IndexInstruction simply peels off one of the ArrayType composite wrappers. For an example of how this works, consider the following erroneous access:

array a[3][4];
x = a[expr1][expr2][expr3];

and the list of issued instructions, along with their types:

i1 : LoadAddress(a) : Address(Array(4, Array(3, Integer)))
i2 : expr1          : Integer
i3 : Index(i1, i2)  : Address(Array(3, Integer))
i4 : expr2          : Integer
i5 : Index(i3, i4)  : Address(Integer)
i6 : expr3          : Integer
xx : Index(i5, i6)  : ErrorType("Could not index an Address(Integer) with Integer.")

In that last index operation, we actually produce an IndexInstruction. The error is caught and reported when the Parser tries to issue that instruction into the stream. Using all these techniques, we can simplify the previous designator spaghetti into something much more

// designator = ident { "[" expression "]" }.
InstructionPtr Parser::designator(Side side)
{
    assume(Token::IDENT);
    SymbolPtr const &sym = tryResolve(tok.name());
    eat(Token::IDENT);
    if (isSSA(sym)) {
        switch (side) {
        case LeftHandSide: return new SymbolInstruction(sym); // not issued on purpose, signals caller
        case RightHandSide: return lookupDefiningValue(sym);  // that instruction was already issued
        }
    } // else, must be an array
 
    InstructionPtr addr = issue(new LoadAddressInstruction(sym));
    while (accept(Token::OpenBracket)) {
        InstructionPtr offset = expression();
        expect(Token::CloseBracket);
        addr = issue(new IndexInstruction(addr, offset));
    }
    switch (side) {
    case LeftHandSide: return addr; // it was issued in the loop
    case RightHandSide: return issue(new LoadValueInstruction(addr));
    }
}

But That’s Not All!

We can also use this framework of helper functions and typed instructions to detect other semantic errors. You should be able to see that it can already detect type-invalid operations such as Add, Sub, Mul, Div, Index. It also directly generalizes to Call, where the constructor type-checks the supplied list of argument types against that of the function’s Symbol.

We can also gift Phi’s with a type method that verifies that all of its arguments are valid. This technique allows us to detect a rather more difficult semantic rule: that not only does a function return a value with appropriate type, but also for any functions declaring a return type all paths through that function must encounter a valid return. I haven’t yet written the post on multiple returns, but the verification is conceptually simple: Inspect the Phi of the return value in the function’s ExitBlock and verify that it does not have the ErrorType.

Design Rule of Thumb: Write the Parser code for the expected case, delegating error checking and reporting to helper functions.