Optimisations |
|||||
English only |
|||||
The evaluation of your projects was done on a set of six benchmark programs, 4 of which you already know:
Evaluation was split in the following components: correctness, speedup and number of optimizations. Together, they sum up to 100 points. Correctness counted for 30% of your grade, speedup for 60% and the remaining 10% were given to almost everyone (a weight was assigned for each optimization, and no more than 10 points out of 100 were given, no matter how many one implemented). I would have liked to assign more weight to correctness, but that would have been bad, as there was a single group who delivered a perfectly working compiler (which passed all our tests). Each failed test costed you two points if it was from the benchmark, and only one if it was from our (secret) testsuite. The speedup was graded as an average over speedups of all six benchmarks. If your compiler crashed/delivered wrong results on one of the benchmarks, the speedup was considered to be 0.0%. If your optimized code performed worse (negative speedup), I considered it again to be 0.0%. This way, I hope I did not favour incorrect implementations. The only group (group 11) which had a correct compiler got an additional 5 points bonus. Here are the results of benchmarking your code.
I marked with red the benchmarks which
failed by either crashing or producing incorrect results. The
other
The Linear scan register allocator problem is fixed by this patch. Alternatively, you can download an archive containing the two modified files. It fixes the problem by reserving three registers for spilling, instead of only two.
The framework will crash in Liveness if a control flow graph
happens to contain an emtpy terminal node. The
The new Analyzer.java has fixed
the problem with assignment. It should reject programs like
The aim of this project is to optimize the code produced by the eins compiler as much as possible. As a starting point, we give you a new version of a working implementation of the eins compiler. Both the language and the compiler suffered signifficant changes from the previous version. The eins language has been extended with mutable arrays. Any scalar type (Int, class or functional types) can be used to declare an array of that type like in the following example: var x: Int[][] = new Int[10][20]; var funs: ((Int) => Int)[] = new ((Int) => Int)[2];As you can see, the syntax is borrowed from Java. Arrays, unlike objects, can be modified after their creation, using the let construct:
let x[0][1] = 10; let funs[0] = f;
Arrays are laid out in memory as follows: the first words
contain the size of each dimension. Then it follows a zone of
memory of
d1, d2,..,dn, 0, 0,..,0The dimensions are used by the compiler to generate the bounds-checking code, but it is transparent to the user (the data starts at index 0, as usual). Objects can contain fields of array types, which need to be initialized along with the other fields, at object creation. However, individual elements of the array can be assigned later. class Matrix { val data: Int[][]; val m: Int; val n: Int; } def newMatrix(m: Int, n: Int): Matrix = new Matrix(new Int[m][n], m, n); var m: Matrix = newMatrix(5, 5); let m.data[0][0] = 10; The back-end of the compiler (what comes after type analysis) is now structured as follows:
The output of the last phase is an executable RISC program, as usual. It does require a new version of the interpreter, though, as explained in the next section.
You will find that some of these phases appear now in separate
packages. It is a good idea to structure your code around the
existing packages (for example, put your analysis and
optimisations in the
The entry point of the compiler is now
The current options are
Dereferencing
The new compiler is also different in that it outputs textual
labels to refer to functions and locations in the code, instead
of numerical offsets. This makes it easier to debug the
generated code, but requires a new version of the RISC
interpreter. This new version is installed in INF, but people
working on their own computers need to download
it. In the labs it is installed in the iclamp account, so
you can use it like this: This new RISC interpreter also includes the following new instructions, which make code generation easier:
The new interpreter simulates the execution of a simple pipeline. The emulator supports severeal simple timing models, which can be used to decide when instructions are despatched to simulate e.g. pipelines. Each instruction type has a cycle count, which can be found in the source code of the emulator, in Opcode.java. It is given by the integer in the construct call as in: public final static Opcode XORIU = new Opcode(RISC.OpRaRbUc, RISC.XORIU, "xoriu", REG_A, REG_B, 1); public final static Opcode LDW = new Opcode(RISC.OpRaRbIc, RISC.LDW, "ldw", REG_A, REG_B, 4);Here we see that xoriu costs 1 cycle, and ldw
costs 4 cycles. In your optimizer you can find this cost by reading
Opcode.codes[instruction].cost for a given instruction.
For the project we will use a dual out-of-order pipeline. This
pipeline works as follows: an instruction ( The emulator will ensure that the functional effect of the program will be as if all instructions where executed seqentially in-order. The pipeline model is somewhat simplified to make it easier to write a scheduler. Branches have very little effect on the timing, all benaches are predicted correctly all the time and instructions are allways available and prefetched in time (i.e. no branch misprediction stalls, and no instruction cache stalls). Also, there will be no stalls on reads from memory that is beeing written to by preceeding instructions, even if these instruction are scheduled later. This is not a very realistic pipeline model but it should be fairly easy to schedule for and it resembles real hardware enough to be interesting.
You can give the flag
Finally, this interpreter has a new option, called
In order to help you evaluate the effectiveness of the
optimizations you introduce into the compiler, we provide you
with the following set of benchmarks. The evaluation will be
carried on 4 progams from this suite, and other 2 programs which
are not made public before the deadline. You can download the
whole suite including input files for all programs here. You can easily run them:
Be sure to provide enough memory when running them in the
emulator, as some of them can be pretty resource-hungry. You can
specify the amount of memory available with the
For this project, we strongly encourage you to use some kind of version-control system like CVS, Subversion, Meta-CVS, darcs, Open CM or arch. This will help you work with your group partner, and ease the merging of our changes with yours (we will certainly have to fix some bugs in our code as the course progresses).
You will probably have to write some simple eins programs, in
order to test your optimizations. To ease the development of
such programs, you can use the scala-mode of Here we will suggest some optimizations and give some hints on how to implement them. Given the current framework, Dead code elimination is probably one of the optimizations that will be easiest to implement. To implement dead code elimination you can start by looking at how Rewrite (in LinearScan) works. By writing a new version of rewrite for DeadCode you can probably quite quickly and easily get working code. (our implementation of DeadCode is 56 lines.) What dead code elimination needs to do is to traverse the
code of each basic block backwards and create new code for
the basic block by only emitting those instructions that are
live. (And those instructions that have a side effect such as
memory writes, function calls, returns, syscall, break, and
instructions manipulating the stack. You can use the method
To get really interesting results you probably have to apply
the dead code elimination iteratively. This can easily be done
by having flag DeadCode dc = new DeadCode(); while(dc.changed) dc.removeDeadCode(cfgs[i]); So DeadCode.java could look something like: package einsc.backend.opt; import java.util.*; class DeadCode { private Liveness liveness; private CFG cfg; public boolean changed = true; public void removeDeadCode(CFG cfg) { this.cfg=cfg; liveness = new Liveness(cfg); changed = false; cfg.visitRPO(new Rewrite()); } private class Rewrite extends CFG.EmptyVisitor { public void pre(CFG.Node node) { removeDeadInstrs(node); } } private void removeDeadInstrs(CFG.Node bb) { // SOME INTERESTING STUFF HERE // WHICH TRAVERSES A BASIC BLOCK BACKWARDS // EMITTING ONLY NON-DEAD INSTRUCTIONS // AS DETERMINED BY liveness.liveOut(bb) // AND sideEffect() } } Function calls being quite expensive, inline expansion of functions can be a very worthwhile optimisation. It is especially interesting because it opens the door to many further optimisations, like constant folding and propagation, dead code elimination, etc. To implement inlining, one has to solve basically two problems:
There is no general method to decide when to perform inlining, and therefore one has to rely on heuristics. A first (necessary) condition to perform inlining is that the function being called must be known at compilation time, otherwise it is not possible to know what to inline. Then, one must take care not to make the compiler loop by trying to repeatedly inline recursive functions; we therefore suggest that you do not consider recursive calls as candidates for inlining. Finally, the size of the function called must somehow be taken into account, to avoid code size explosion. For this project, you can simply write a function to compute the "size" of a tree (for example by counting the number of nodes it contains) and only inline functions whose body is smaller than some value. We propose that you compute this size after performing inlining inside of the body of a function. Notice that many other heuristics than the one suggested can be used. With our solution, a function will either be inlined every time it is used, or never. This is due to the fact that we base our decision entirely on the definition of the function, not on its use(s). More realistic heuristics take individual decisions for every uses of a function. We suggest that you perform inlining on the tree directly, as this seems to be the easiest. That is, your inlining phase should be placed between type analysis and code generation. If you follow this recommendation, then the basic algorithm can be outlined as follows:
Since the risc emulator now simulates a pipeline, it might prove useful to perform a simple list scheduling algorithm on the final code. This way, some of the wasted cycles can be regained for useful computation. I will describe a simplistic way to implement instruction scheduling. Your results might be better if you choose to spend some more time and improve or tune the algorithm. To implement the algorithm, here are the steps to take:
In order to create the dependency graph, you will need to design
a simple graph class. I suggest to keep nodes as a map from
The next step is to create the dependency graph. The simplest thing is to use two iterators in the block, and compare each two instructions: do they have any dependency? if yes, add an edge. If not, don't add anything. Here is a suggestion of what instructions need an edge in the graph (i1 is the instruction appearing earlier than i2 in the block):
You may want to relax some of the dependencies described earlier, but in general you need special care/analysis in order to do that. Of course, that will generally pay off in less stalls. A traversal of the graph is needed in order to compute the numbers attached to each node. These heuristic numbers give the maximum weight of a path from the current node to a leaf node. Thus, when selecting nodes to be emitted, we'll prefer the ones with such high numbers, because, hopefully, they are the most expensive to compute -- scheduling them early gives time for the successors to become ready. Different schemes for selection can be tried, and see which one gives the best results.
Once the dependency graph is ready, all you have to do is
implement the algorithm given on the slides. This is not
difficult, but it might need some attention. I suggest to use a
This is basically it. As a suggestion, implementing a "dot"
print method in the dependency graph will help you debug. To see how to
generate
Peephole optimization is a very easy and simple idea: sometime
the generated code is really stupid, and a sequence of
instructions can be replaced by a single one that is more
effective. For example, a POP and then a PSH of the same
variable can be replaced by a LDW of the same register from the
address in the stack. Such code might get generated, and
studying some of your risc code might give you new ideas. Also,
at such a point, you could remove useless instructions like
What all these optimizations have in common is that they can be performed without much analysis. It is enough to inspect a very small sequence of code (usually 2 or 3 instructions) in order to replace is with better code. While this is not a classical optimization in the sense we've seen so far, it can provide dramatical increase in performance. Right now, all local variables inside a function are stack allocated. We could change this and allocate them in registers. Since functions are usually small, and we have many general-purpose registers available (28 of them), there is not a big danger of introducing spills with this change. This way we'll save a memory operation at each access to the local variable, which can be signifficant, since local variables are keen on appearing inside loops.
To do this change, the simplest thing is to change the
Multiplications and divisions are usually costly, but in some cases they can be made more efficient by replacing them with an equivalent fast operation. For example, multiplications by powers of 2 are equivalent to left-shifting the operand the same number of positions as the power. The same goes for divisions. Multiplications by 4 are quite freqent now in computing indexes into arrays. This might prove a worthwhile optimization. Mutable arrays have made eins programs more likely to contain while-loops. This makes it interesting to apply code motion and try to optimize the use of induction variables. Inner loops almost always contain some invariant computations for array indexes, that would be beneficial to be placed outside the loop.
Arrays are bound checked, and each array index is going to imply
a number of costly operations: loads from memory for each dimension,
and two compares per index. Some of this bounds checks could be
moved oustide the loop, and checked only at the entry of the
loop. Such an optiimzation is described in Muchnick's book at
page 452. Note that such an optimization needs a representation
that keeps array indexing explicit. In the current compiler that
is true for the abstract syntax tree, but not for the control
flow graph. It is likely that such a transformation would need
to either do it on the tree, and modify the code generator to
not add bounds-checking code for the cases proved to be correct,
or extend the intermediate language with another fake
instruction, like The following three optimisations are relatively easy to implement, and can provide a modest but interesting improvement in code quality.
These three optimisations can be performed directly on the control-flow graph, but constant folding can also be performed on the tree, before code generation. All of them can make some code dead, and should therefore be run before dead code elimination.
Currently, leaf functions (functions containing no function
call) save the |
|