LAMP: Optimisations

Optimisations

English only

Introduction

The aim of this project is to optimize as much as possible the code produced by the misc compiler.

Starting point: yet another misc compiler

As a starting point, we give you a new version of a working implementation of the misc compiler. Up to type analysis, this compiler is mostly identical to the one of the first project, but the later phases are quite different. The back-end of the compiler (what comes after type analysis) is now structured as follows:

A code generation phase (file Generator.java) comes first, and produces a control-flow graph (file CFG.java) instead of linear code as before; moreover, the code generator now assumes that infinitely many registers are available, as register allocation is performed subsequently. Function calls are compiled using two new pseudo-instructions (PCALL and PJSR) which are transformed to real DLX code later. This is necessary, since at this stage the set of live registers is not known, as register allocation hasn't been performed yet.
A linear-scan register allocator (file LinearScan.java) performs register allocation. After this stage, it is still possible that the code uses pseudo-registers (with indices greater than 31), in which case spilling is required. Notice that linear-scan register allocation needs liveness information, which is computed by the code found in Liveness.java.
Pseudo-instructions PCALL and PJSR are rewritten into real instructions (file Frame.java).
Spilling code is introduced if necessary, by a phase which walks over the control-flow graphs, looking for pseudo-registers. Locations are assigned on the stack for these pseudo-registers, and code is rewritten accordingly (file Spiller.java).
The CFG is linearized and code for predefined functions and system initialization is introduced (file Lineariser.java).

The output of the last phase is an executable RISC program, as usual. It does require a new version of the interpreter, though, as explained in the next section.

The new RISC interpreter

The new compiler is also different in that it outputs textual labels to refer to functions and locations in the code, instead of numerical offsets. This makes it easier to debug the generated code, but requires a new version of the RISC interpreter. This new version is installed in INF, but people working on their own computers need to download it.

This new RISC interpreter also includes the following new instructions, which make code generation easier:

LDA a, l: which loads the address designated by label l into register Ra.
CALL c: which calls the function whose address resides in register Rc.

Finally, this interpreter has a new option, called -profile which displays statistics about the number of instructions and "cycles" executed. These are the numbers you have to bring down. The number of cycles is computed in a very naïve way, by assigning constant costs to instructions, (very roughly) based on their cost on typical processors. But please bear in mind that we have deliberately oversimplified these numbers.

Benchmarks

In order to help you (and us) evaluate the effectiveness of the optimizations you introduce into the compiler, we provide you with the following set of benchmarks:

fib.misc: the classical Fibonacci function.
maze.misc: a random maze generator.
sort.misc: a program which expects a list of positive integers terminated by -1 on standard input (one per line), and which sorts them using three different sorting algorithms and prints the results. An example input file for this benchmark.
bignums.misc: the factorial function for big integers (can compute, say, the factorial of 1000).
matrix.misc: Matrix multiplication. An example input file for this benchmark.

How to organize yourself

For this project, we strongly encourage you to use some kind of version-control system like CVS, Subversion, Meta-CVS, darcs, Open CM or arch. This will help you work with your group partner, and ease the merging of our changes with yours (we will certainly have to fix some bugs in our code as the course progresses).

Optimization hints

Here we will suggest some optimizations and give some hints on how to implement them.

Dead Code Elimination (DCE)

Given the current framework, Dead code elimination is probably one of the optimizations that will be easiest to implement.

To implement dead code elimination you can start by looking at how Rewrite (in LinearScan) works. By writing a new version of rewrite for DeadCode you can probably quite quickly and easily get working code. (our implementation of DeadCode is 56 lines.)

What dead code elimination needs to do is to traverse the code of each basic block backwards and create new code for the basic block by only emitting those instructions that are live. (And those instructions that have a side effect such as memory writes, function calls, returns, syscall, break, and instructions manipulating the stack. We will soon provide a new version of Instruction.java and Opcode.java that provides you with the function Instruction.sideEffect():boolean.)

To get really interesting results you probably have to apply the dead code elimination iteratively. This can easily be done by having flag changed in DeadCode which is initialized to false and set to true if some instruction is eliminated (not re-emitted). Then you can apply DCE iteratively in Main:

      DeadCode dc = new DeadCode();
      while(dc.changed) dc.removeDeadCode(cfgs[i]);

So DeadCode.java could look something like:

    package miscc;
    import java.util.*;

    class DeadCode {
        private Liveness liveness;
        private CFG cfg;
        public boolean changed = true;

        public void removeDeadCode(CFG cfg) {
          this.cfg=cfg;
          liveness = new Liveness(cfg);
	  changed = false;
          cfg.visitRPO(new Rewrite());
        }

        private class Rewrite extends CFG.EmptyVisitor {
	  public void pre(CFG.Node node) {
            removeDeadInstrs(node);
	  }
        }

        private void removeDeadInstrs(CFG.Node bb) {
        // SOME INTERESTING STUFF HERE
        //  WHICH TRAVERSES bb.code BACKWARDS
        //  EMITTING ONLY NON-DEAD INSTRUCTIONS 
        //  AS DETERMINED BY liveness.liveOut(bb)
        //  AND sideEffect()
        }
    }

Better register handling in Generator

The current code generator sometimes reuses the targetReg for both a subtree and for the result of the operation of a node. This produces some false dependencies in the code. The code would be more SSA-like if each expression used its own register.

Inlining

Function calls being quite expensive, inline expansion of functions can be a very worthwhile optimisation. It is especially interesting because it opens the door to many further optimisations, like constant folding and propagation, dead code elimination, etc. To implement inlining, one has to solve basically two problems:

decide when inlining can and should be performed,
perform it.

There is no general method to decide when to perform inlining, and therefore one has to rely on heuristics. A first (necessary) condition to perform inlining is that the function being called must be known at compilation time, otherwise it is not possible to know what to inline. Then, one must take care not to make the compiler loop by trying to repeatedly inline recursive functions; we therefore suggest that you do not consider recursive calls as candidates for inlining. Finally, the size of the function called must somehow be taken into account, to avoid code size explosion. For this project, you can simply write a function to compute the "size" of a tree (for example by counting the number of nodes it contains) and only inline functions whose body is smaller than some value. We propose that you compute this size after performing inlining inside of the body of a function.

Notice that many other heuristics than the one suggested can be used. With our solution, a function will either be inlined every time it is used, or never. This is due to the fact that we base our decision entirely on the definition of the function, not on its use(s). More realistic heuristics take individual decisions for every uses of a function.

We suggest that you perform inlining on the tree directly, as this seems to be the easiest. That is, your inlining phase should be placed between type analysis and code generation. If you follow this recommendation, then the basic algorithm can be outlined as follows:

For every function call, start by checking if inlining can and should be performed, using the heuristic given above; if this is not the case, leave the call as is.
Otherwise, replace the function call by a block containing the following:
- one variable per argument of the function, each of which gets initialised by the expression given in the call,
- the code of the body of the function, where all symbols referring to the function arguments are replaced with the corresponding variable symbol.
While copying the body of the function, each time a variable declaration is encountered, its symbol must be cloned, and all uses of the variable which follow must use this new symbol. This is to preserve the invariant which says that every variable has a symbol different from all other.

Leaf functions

Currently, leaf functions (functions containing no function call) save the LNK register on entry, and restore it on exit. This is unnecessary, and can be optimised.

Constant propagation and folding

The following three optimisations are relatively easy to implement, and can provide a modest but interesting improvement in code quality.

Copy propagation: When the contents of some register R1 is copied into another register R2, replace uses of R2 by uses of R1. This can turn some "move" instructions (usually implemented using OR or ADD instructions in our architecture) into dead code, to be eliminated later.
Constant propagation: When a (small enough) constant is loaded into a register, replace uses of that register by the constant itself. Combined with dead code elimination, this can produce better code for constructions like x + 1 (two instructions instead of three as now).
Constant folding: Replace constant expressions by their value.

These three optimisations can be performed directly on the control-flow graph, but constant folding can also be performed on the tree, before code generation. All of them can make some code dead, and should therefore be run before dead code elimination.

Grouping of allocations

With the current compiler, every "cons" operation produces a system call to allocate memory for the cell. Since system calls are expensive, a better solution is to perform only one allocation for all the "cons" operations appearing in a single basic block, and then to divide the allocated memory among all of them.