LAMP: Optimisations

Optimisations

English only

Evaluation

The evaluation of your projects was done on a set of six benchmark programs, 4 of which you already know:

matrix-arrays
qsort
prim
word-freq

The other two are:

exit-maze.eins: Finds the shortest path between two given points in a maze. It can use the maze output by the maze benchmark.
hash.eins: A simple hash table implementation which is used to spell check a large text.

The whole benchmark suite, together with the input files and the expected output can be downloaded in one archive.

Evaluation was split in the following components: correctness, speedup and number of optimizations. Together, they sum up to 100 points. Correctness counted for 30% of your grade, speedup for 60% and the remaining 10% were given to almost everyone (a weight was assigned for each optimization, and no more than 10 points out of 100 were given, no matter how many one implemented). I would have liked to assign more weight to correctness, but that would have been bad, as there was a single group who delivered a perfectly working compiler (which passed all our tests). Each failed test costed you two points if it was from the benchmark, and only one if it was from our (secret) testsuite.

The speedup was graded as an average over speedups of all six benchmarks. If your compiler crashed/delivered wrong results on one of the benchmarks, the speedup was considered to be 0.0%. If your optimized code performed worse (negative speedup), I considered it again to be 0.0%. This way, I hope I did not favour incorrect implementations.

The only group (group 11) which had a correct compiler got an additional 5 points bonus.

Results

Here are the results of benchmarking your code.

Benchmark results
Group	Average speedup	`matrix-arrays`	`qsort`	`prim`	`word-freq`	`exit-maze`	`hash`
Group 1	6.51%	6.03%	17.57%	15.45%	0.00%	0.00%	0.00%
Group 2	13.22%	14.68%	21.74%	21.36%	4.33%	4.23%	13.01%
Group 3	19.49%	19.38%	22.81%	25.23%	14.43%	0.00%	35.09%
Group 4	17.37%	11.44%	14.81%	18.08%	22.03%	0.00%	37.86%
Group 5	16.30%	9.43%	30.62%	19.74%	0.00%	0.00%	38.02%
Group 6	11.59%	24.11%	30.71%	0.00%	14.70%	0.00%	0.00%
Group 7	28.40%	20.30%	42.56%	29.26%	21.33%	22.61%	34.32%
Group 8	6.23%	0.07%	8.39%	22.46%	6.45%	0.00%	0.00%
Group 9	15.71%	0.00%	16.22%	21.76%	13.74%	0.00%	42.53%
Group 10	14.11%	11.44%	21.39%	26.53%	25.27%	0.00%	0.00%
Group 11	8.89%	7.05%	13.17%	15.17%	2.51%	0.00%	15.41%
Group 12	10.27%	8.72%	14.67%	19.82%	15.82%	2.60%	0.00%
Group 13	3.81%	2.28%	3.07%	1.72%	3.71%	0.00%	12.05%
Group 14	0.00%	0.00%	0.00%	0.00%	0.00%	0.00%	0.00%

I marked with red the benchmarks which failed by either crashing or producing incorrect results. The other 0.00% mean that either the code performed worse, or exactly the same as the baseline compiler.

For grading, I did not round the grades for project/exam before computing the final grade. Here are the grades again:

Camipro nr.	Final Grade	Exam
153528	4.5	5.333
147964	4.5	5.333
147670	5.5	5.167
147548	4.5	4.500
129927	5.5	4.333
147357	5.5	5.500
128552	4.5	2.667
154499	5.5	5.000
155423	6.0	5.833
147380	5.0	4.833
147454	5.0	5.667
136385	6.0	5.500
147904	6.0	5.167
147259	4.5	4.333
147370	4.5	3.667
146897	5.0	4.000
147376	5.5	5.667
148204	6.0	6.000
153907	5.5	5.833
154674	5.0	4.667
154104	5.0	5.167
114329	5.0	5.000
168855	4.5	5.833
168859	3.0	3.333
154582	3.5	5.167

Bug fixes

The Linear scan register allocator problem is fixed by this patch. Alternatively, you can download an archive containing the two modified files. It fixes the problem by reserving three registers for spilling, instead of only two.

The framework will crash in Liveness if a control flow graph happens to contain an emtpy terminal node. The getLastInstruction method inside CFG will not check that, if no successors are found, there is at least one instruction in the block. Now it does that and returns null if such a case is encountered. The new Liveness.java will correctly test for null and no crash should appear. The patch contains three files: Code.java, CFG.java and Liveness.java. The new Code.java contains the size() method which is called by CFG but it has no influence on this bug -- it's just "nice to have". Here it's the patch.

The new Analyzer.java has fixed the problem with assignment. It should reject programs like let if (x) a else b = 10;. Assignment should be allowed only towards variables, array elements (which might reside inside objects). Additionally, let (new Int[10])[0] = 10 is also accepted, but has no visible effect, since no name is given to the newly created array.

Introduction

The aim of this project is to optimize the code produced by the eins compiler as much as possible.

Starting point: eins improved.

As a starting point, we give you a new version of a working implementation of the eins compiler. Both the language and the compiler suffered signifficant changes from the previous version.

Language changes

The eins language has been extended with mutable arrays. Any scalar type (Int, class or functional types) can be used to declare an array of that type like in the following example:

        var x: Int[][] = new Int[10][20];
        var funs: ((Int) => Int)[] = new ((Int) => Int)[2];

As you can see, the syntax is borrowed from Java. Arrays, unlike objects, can be modified after their creation, using the let construct:

          let x[0][1] = 10;
          let funs[0] = f;

Arrays are laid out in memory as follows: the first words contain the size of each dimension. Then it follows a zone of memory of d₁*d₂*..*d_n words, where d_i is the i^th dimension. Arrays are guaranteed to be initialized to 0:

          d₁, d₂,..,d_n, 0, 0,..,0

The dimensions are used by the compiler to generate the bounds-checking code, but it is transparent to the user (the data starts at index 0, as usual).

Objects can contain fields of array types, which need to be initialized along with the other fields, at object creation. However, individual elements of the array can be assigned later.

          class Matrix {
            val data: Int[][];
            val m: Int;
            val n: Int;
          }

          def newMatrix(m: Int, n: Int): Matrix = 
              new Matrix(new Int[m][n], m, n);   
          
          var m: Matrix = newMatrix(5, 5);
          let m.data[0][0] = 10;

Compiler organization

The back-end of the compiler (what comes after type analysis) is now structured as follows:

A code generation phase (file Generator.java) comes first, and produces a control-flow graph (file CFG.java) instead of linear code as before; moreover, the code generator now assumes that infinitely many registers are available, as register allocation is performed subsequently. Function calls are compiled using two new pseudo-instructions (PCALL and PJSR) which are transformed to real DLX code later. This is necessary, since at this stage the set of live registers is not known, as register allocation hasn't been performed yet.
A linear-scan register allocator (file LinearScan.java) performs register allocation. After this stage, it is still possible that the code uses pseudo-registers (with indices greater than 31), in which case spilling is required. Notice that linear-scan register allocation needs liveness information, which is computed by the code found in Liveness.java.
Pseudo-instructions PCALL and PJSR are rewritten into real instructions (file Frame.java).
Spilling code is introduced if necessary, by a phase which walks over the control-flow graphs, looking for pseudo-registers. Locations are assigned on the stack for these pseudo-registers, and code is rewritten accordingly (file Spiller.java).
The CFG is linearized and code for predefined functions and system initialization is introduced (file Lineariser.java).

The output of the last phase is an executable RISC program, as usual. It does require a new version of the interpreter, though, as explained in the next section.

You will find that some of these phases appear now in separate packages. It is a good idea to structure your code around the existing packages (for example, put your analysis and optimisations in the einsc.backend.opt package).

The entry point of the compiler is now einsc.Main. You will find there two classes (CmdLine and Option) which make it easy to add command line options to your compiler. During the development you will find that some bugs appear only when some optimizations are enabled in a particular combination, but not when taken separately. This will help you to easily turn on and off individual optimizations. Think about providing a flag like -O which turns on all optimizations.

The current options are -cfg and -ast. When used, they will generate the control flow graph of each function, or the abstract syntax tree of the program, respectively. The graphs are in the graphviz dot file format. You can use dotty graph.dot to visualize them, or dot -Txxx graph.dot to convert them to other format (gif, png, ps, svg, etc.).

Dereferencing null objects stops the program, as before. The difference is that the exit code, instead of 1, is now the source code line in the eins program where the exception occured. This will allow you to find bugs easier. Array references are bounds checked: that is, each indexing into an array is checked that the values passed as index are between 0 and the array dimension. In case of out of bounds exception, the program will exit with a status equal to the line number where the violation occured.

The new RISC interpreter

The new compiler is also different in that it outputs textual labels to refer to functions and locations in the code, instead of numerical offsets. This makes it easier to debug the generated code, but requires a new version of the RISC interpreter. This new version is installed in INF, but people working on their own computers need to download it. In the labs it is installed in the iclamp account, so you can use it like this:
risc-emu prog.risc.
To get some help, try
risc-emu -help.

This new RISC interpreter also includes the following new instructions, which make code generation easier:

LDA a, l: which loads the address designated by label l into register Ra.
CALL c: which calls the function whose address resides in register Rc.

The new interpreter simulates the execution of a simple pipeline.

Pipelines

The emulator supports severeal simple timing models, which can be used to decide when instructions are despatched to simulate e.g. pipelines.

Each instruction type has a cycle count, which can be found in the source code of the emulator, in Opcode.java. It is given by the integer in the construct call as in:

 
    public final static Opcode XORIU = 
           new Opcode(RISC.OpRaRbUc, RISC.XORIU, "xoriu", REG_A, REG_B, 1);

    public final static Opcode LDW =
        new Opcode(RISC.OpRaRbIc, RISC.LDW, "ldw", REG_A, REG_B, 4);

Here we see that xoriu costs 1 cycle, and ldw costs 4 cycles. In your optimizer you can find this cost by reading Opcode.codes[instruction].cost for a given instruction.

For the project we will use a dual out-of-order pipeline. This pipeline works as follows: an instruction (In) will be dispatched as soon as all preceding instructions (I1 -- In-1) that writes to registers that In uses have finished, and there is an empty slot in one of the two pipelines.

The emulator will ensure that the functional effect of the program will be as if all instructions where executed seqentially in-order.

The pipeline model is somewhat simplified to make it easier to write a scheduler. Branches have very little effect on the timing, all benaches are predicted correctly all the time and instructions are allways available and prefetched in time (i.e. no branch misprediction stalls, and no instruction cache stalls). Also, there will be no stalls on reads from memory that is beeing written to by preceeding instructions, even if these instruction are scheduled later.

This is not a very realistic pipeline model but it should be fairly easy to schedule for and it resembles real hardware enough to be interesting.

You can give the flag -trace to the emulator in order to see the scheduling of the code. With this flag the time (give as number of cyces since the start of the simulation) is printed before each instruction as it is scheduled. Note that the time can move backwards in this trace, indicating that an instruction has been scheduled out of order.

Profiling

Finally, this interpreter has a new option, called -profile which displays statistics about the number of instructions and "cycles" executed. These are the numbers you have to bring down. The CPI (Cycles Per Instruction) is also given, this number indicates whether your code is well scheduled or not.

Benchmarks

In order to help you evaluate the effectiveness of the optimizations you introduce into the compiler, we provide you with the following set of benchmarks. The evaluation will be carried on 4 progams from this suite, and other 2 programs which are not made public before the deadline. You can download the whole suite including input files for all programs here. You can easily run them: cat input/qsort.input | risc-emu qsort.risc.

maze.eins: a random maze generator.
matrix-arrays.eins: Array multiplication using integer arrays. The program expects two matrices on standard input and prints the result of their multiplication. An input file is provided as example.
qsort.eins: An imperative quicksort implementation.
bignums.eins: the factorial function for big integers (can compute, say, the factorial of 1000).
prim.eins: An implementation of Prim's algorithm for minimal spanning tree. An example input file for this benchmark.
word-freq.eins: This program reads a text terminated by the word END and prints each word and the number of times it appears in the text. The list of words is presented in alphabetical order

Be sure to provide enough memory when running them in the emulator, as some of them can be pretty resource-hungry. You can specify the amount of memory available with the -mem option: risc-emu -mem=1000000 prog.risc.

How to organize yourself

For this project, we strongly encourage you to use some kind of version-control system like CVS, Subversion, Meta-CVS, darcs, Open CM or arch. This will help you work with your group partner, and ease the merging of our changes with yours (we will certainly have to fix some bugs in our code as the course progresses).

You will probably have to write some simple eins programs, in order to test your optimizations. To ease the development of such programs, you can use the scala-mode of emacs, or try the nicest small editor around: SciTE. Under support directory in the einsc archive you will find a properties file which you can use to make Scite recognise eins programs. It does simple syntax highlighting, code folding, error message parsing, auto-completion, compilation and run (last two options depend on having bin/einsc and risc-emu in your path).

Optimization hints

Here we will suggest some optimizations and give some hints on how to implement them.

Dead Code Elimination (DCE)

Given the current framework, Dead code elimination is probably one of the optimizations that will be easiest to implement.

To implement dead code elimination you can start by looking at how Rewrite (in LinearScan) works. By writing a new version of rewrite for DeadCode you can probably quite quickly and easily get working code. (our implementation of DeadCode is 56 lines.)

What dead code elimination needs to do is to traverse the code of each basic block backwards and create new code for the basic block by only emitting those instructions that are live. (And those instructions that have a side effect such as memory writes, function calls, returns, syscall, break, and instructions manipulating the stack. You can use the method Instruction.sideEffect():boolean to check whether an instruction has side effects.

To get really interesting results you probably have to apply the dead code elimination iteratively. This can easily be done by having flag changed in DeadCode which is initialized to false and set to true if some instruction is eliminated (not re-emitted). Then you can apply DCE iteratively in Main:

      DeadCode dc = new DeadCode();
      while(dc.changed) dc.removeDeadCode(cfgs[i]);

So DeadCode.java could look something like:

    package einsc.backend.opt;
    import java.util.*;

    class DeadCode {
        private Liveness liveness;
        private CFG cfg;
        public boolean changed = true;

        public void removeDeadCode(CFG cfg) {
          this.cfg=cfg;
          liveness = new Liveness(cfg);
	  changed = false;
          cfg.visitRPO(new Rewrite());
        }

        private class Rewrite extends CFG.EmptyVisitor {
	  public void pre(CFG.Node node) {
            removeDeadInstrs(node);
	  }
        }

        private void removeDeadInstrs(CFG.Node bb) {
        // SOME INTERESTING STUFF HERE
        //  WHICH TRAVERSES A BASIC BLOCK BACKWARDS
        //  EMITTING ONLY NON-DEAD INSTRUCTIONS 
        //  AS DETERMINED BY liveness.liveOut(bb)
        //  AND sideEffect()
        }
    }

Inlining

Function calls being quite expensive, inline expansion of functions can be a very worthwhile optimisation. It is especially interesting because it opens the door to many further optimisations, like constant folding and propagation, dead code elimination, etc. To implement inlining, one has to solve basically two problems:

decide when inlining can and should be performed,
perform it.

There is no general method to decide when to perform inlining, and therefore one has to rely on heuristics. A first (necessary) condition to perform inlining is that the function being called must be known at compilation time, otherwise it is not possible to know what to inline. Then, one must take care not to make the compiler loop by trying to repeatedly inline recursive functions; we therefore suggest that you do not consider recursive calls as candidates for inlining. Finally, the size of the function called must somehow be taken into account, to avoid code size explosion. For this project, you can simply write a function to compute the "size" of a tree (for example by counting the number of nodes it contains) and only inline functions whose body is smaller than some value. We propose that you compute this size after performing inlining inside of the body of a function.

Notice that many other heuristics than the one suggested can be used. With our solution, a function will either be inlined every time it is used, or never. This is due to the fact that we base our decision entirely on the definition of the function, not on its use(s). More realistic heuristics take individual decisions for every uses of a function.

We suggest that you perform inlining on the tree directly, as this seems to be the easiest. That is, your inlining phase should be placed between type analysis and code generation. If you follow this recommendation, then the basic algorithm can be outlined as follows:

For every function call, start by checking if inlining can and should be performed, using the heuristic given above; if this is not the case, leave the call as is.
Otherwise, replace the function call by a block containing the following:
- one variable per argument of the function, each of which gets initialised by the expression given in the call,
- the code of the body of the function, where all symbols referring to the function arguments are replaced with the corresponding variable symbol.
While copying the body of the function, each time a variable declaration is encountered, its symbol must be cloned, and all uses of the variable which follow must use this new symbol. This is to preserve the invariant which says that every variable has a symbol different from all other.

Instruction scheduling

Since the risc emulator now simulates a pipeline, it might prove useful to perform a simple list scheduling algorithm on the final code. This way, some of the wasted cycles can be regained for useful computation.

I will describe a simplistic way to implement instruction scheduling. Your results might be better if you choose to spend some more time and improve or tune the algorithm. To implement the algorithm, here are the steps to take:

Create a dependency graph for the current basic block. The nodes will represent instructions, and edges between instructions denote a dependency between them.
Implement the list scheduling algorithm, as presented in the course. We'll use a ready queue and a pending map, in addition to the dependency graph

In order to create the dependency graph, you will need to design a simple graph class. I suggest to keep nodes as a map from Instruction objects to your Node objects, and two other maps, which map a Node object with a List of incoming, respectively outgoing Edges. Each Node will keep an Instruction and each Edge will keep an integer, denoting the weight of the edge.

The next step is to create the dependency graph. The simplest thing is to use two iterators in the block, and compare each two instructions: do they have any dependency? if yes, add an edge. If not, don't add anything. Here is a suggestion of what instructions need an edge in the graph (i1 is the instruction appearing earlier than i2 in the block):

RaW (Read after Write dependency): if i2 uses some registers that i1 defines. The weight of the edge should be the cost of the first instruction -- since the pipeline will stall until the result of i1 is ready, before continuing with i2
WaR (Write after Read dependency): if i2 defines something that i1 uses. Here the cost will be 1, as the pipeline will not stall, but we'd like to keep the instructions in this order, as switching them might change the behavior of the program
WaW (Write after Write): if what i1 and i2 define is the same register. Again, we don't want to switch them, as this might corrupt the behavior
Two memory operations should not generally be swapped. Memory operations are LDW, STW, PSH, POP. It is clear that switching PSH and POPs around will change the meaning of the program. However, two LDWs can be switched, and no dependency is necessary between them. It is a little tougher to say if to STWs can be switched: if we're sure they reference different memory locations, it's not necessary to add an edge. However, this is not easy to say, without some "may-alias" analysis (the simple idea of comparing the base register and the offset constant is not working, because either two different registers might point to the same memory address, or the same register might be changed between the two STW)
Call instructions (CALL, JSR, RET) should never be moved around, so they basically depend on everything else. The explanation is that, since only live registers are saved before a call (and restored immmediately after), one cannot move an arbitrary instruction before a call: its result register might not be saved, and its contents would be lost. (this assumes that instruction scheduling is done really late, after Frame has rewritten calls).

You may want to relax some of the dependencies described earlier, but in general you need special care/analysis in order to do that. Of course, that will generally pay off in less stalls.

A traversal of the graph is needed in order to compute the numbers attached to each node. These heuristic numbers give the maximum weight of a path from the current node to a leaf node. Thus, when selecting nodes to be emitted, we'll prefer the ones with such high numbers, because, hopefully, they are the most expensive to compute -- scheduling them early gives time for the successors to become ready. Different schemes for selection can be tried, and see which one gives the best results.

Once the dependency graph is ready, all you have to do is implement the algorithm given on the slides. This is not difficult, but it might need some attention. I suggest to use a ready queue for the nodes that can be scheduled right away, and a pending map which contains instructions that need to wait some cycles before their operands are ready (a map from Node to int). The algorithm proceeds as following:

take a node from ready (if there are more nodes, select the "best" one according to the heuristic you chose) and emits it to the new Code. If no node is ready, just skip this round, and call tick
all it's successors are moved to pending, if not already there, and their delay is set to the weight of the corresponding edge (or, if the node was already there, the maximum between what was there before, and the weight).
the node just emitted is removed from the graph
a tick method is called, which signifies that a cycle is gone, and some bookkeeping is needed: all the nodes from pending will have their delay decreased by one (if any reaches 0, it is removed from pending), and all nodes in the graph which have no predecessors and are not in pending are moved to ready

This is basically it. As a suggestion, implementing a "dot" print method in the dependency graph will help you debug. To see how to generate dot files, take a look at how it is done in CFG.

Peephole optimization

Peephole optimization is a very easy and simple idea: sometime the generated code is really stupid, and a sequence of instructions can be replaced by a single one that is more effective. For example, a POP and then a PSH of the same variable can be replaced by a LDW of the same register from the address in the stack. Such code might get generated, and studying some of your risc code might give you new ideas. Also, at such a point, you could remove useless instructions like ADDI, R0, R0, constant, or perform strength reduction by replacing multiplications with powers of 2 by left shifts.

What all these optimizations have in common is that they can be performed without much analysis. It is enough to inspect a very small sequence of code (usually 2 or 3 instructions) in order to replace is with better code.

Allocating local variables to registers

While this is not a classical optimization in the sense we've seen so far, it can provide dramatical increase in performance. Right now, all local variables inside a function are stack allocated. We could change this and allocate them in registers. Since functions are usually small, and we have many general-purpose registers available (28 of them), there is not a big danger of introducing spills with this change. This way we'll save a memory operation at each access to the local variable, which can be signifficant, since local variables are keen on appearing inside loops.

To do this change, the simplest thing is to change the Generator and modify the handling of VarDefs, and a few other nodes.

First, we need to add a field in the Symbol class, let's call it reg, which serves the same purpose as offset: it remembers the register to which the symbol was allocated during code generation. We note that offset and reg are mutually exclusive: a variable cannot be allocated both to a register and on the stack! I suggest to assert this at places where you access or modify reg, so that the eventual bugs will show up easier.
Next, we make change the method caseVarDef in the Gen class, so that instead of calling gen on the variable subtree, it saves a register number in the symbol of tree.variable.
Next, we have to update all places in the generator where local variables will be accessed, and check that correct code will be generated. It turns out that there are only two places where we need to change something: caseIdent and caseAssign.
- caseIdent will need to be changed so that when the value of a register-allocated variable symbol is needed, it is simply copied into targetReg. This is simply one extra test. Note that if loadAddress flag is true, the variable cannot be register allocated (since a register cannot be referenced). It might be a good idea to assert this as well.
- caseAssign will need to take care of assignments to a register allocated variable. This can be done by testing whether tree.var is an instance of Ident, and if its symbol is register-allocated. If so, assignment proceeds with copying the right-hand expression to the register corresponding to tree.var. Otherwise, assignment proceeds as before.

Stregth reduction

Multiplications and divisions are usually costly, but in some cases they can be made more efficient by replacing them with an equivalent fast operation. For example, multiplications by powers of 2 are equivalent to left-shifting the operand the same number of positions as the power. The same goes for divisions. Multiplications by 4 are quite freqent now in computing indexes into arrays. This might prove a worthwhile optimization.

Loop optimizations

Mutable arrays have made eins programs more likely to contain while-loops. This makes it interesting to apply code motion and try to optimize the use of induction variables. Inner loops almost always contain some invariant computations for array indexes, that would be beneficial to be placed outside the loop.

Arrays are bound checked, and each array index is going to imply a number of costly operations: loads from memory for each dimension, and two compares per index. Some of this bounds checks could be moved oustide the loop, and checked only at the entry of the loop. Such an optiimzation is described in Muchnick's book at page 452. Note that such an optimization needs a representation that keeps array indexing explicit. In the current compiler that is true for the abstract syntax tree, but not for the control flow graph. It is likely that such a transformation would need to either do it on the tree, and modify the code generator to not add bounds-checking code for the cases proved to be correct, or extend the intermediate language with another fake instruction, like PJSR and PCALL, which indexes arrays. That instruction would be later rewritten by a transformation like in Frame.java

Constant propagation and folding

The following three optimisations are relatively easy to implement, and can provide a modest but interesting improvement in code quality.

Copy propagation: When the contents of some register R1 is copied into another register R2, replace uses of R2 by uses of R1. This can turn some "move" instructions (usually implemented using OR or ADD instructions in our architecture) into dead code, to be eliminated later.
Constant propagation: When a (small enough) constant is loaded into a register, replace uses of that register by the constant itself. Combined with dead code elimination, this can produce better code for constructions like x + 1 (two instructions instead of three as now).
Constant folding: Replace constant expressions by their value.

These three optimisations can be performed directly on the control-flow graph, but constant folding can also be performed on the tree, before code generation. All of them can make some code dead, and should therefore be run before dead code elimination.

Leaf functions

Currently, leaf functions (functions containing no function call) save the LNK register on entry, and restore it on exit. This is unnecessary, and can be optimised.