LAMP: LAMP/Teaching: Compilation A

Compilation A
Summer Semester 2003

School of Computer and Communication Sciences
Programming Methods Laboratory (LAMP)

Home
Staff
Research
Publications
Events
Teaching
  CompilationA
    Project
       Lexer
       Parser
         AST
         Hints
       Analyzer
       Generator
       Interpreter
       Test Files
Jobs

Overview

During the excercices you will be able to fulfil a small project - a compiler for a simple language called "Expressive". The specification of the language is available from here: expressive.ps (Updated 21.05.03)

You may work in groups of up to 3 persons.

Submitting the interpreter

Updated 08/07/03

The deadline for the last part of the progect - the interpreter - is 18:00 on Friday, 11/07/03. The submissions should be mailed to mihaylov@lampsun1.epfl.ch and should consist of the files expressive.lex, expressive.cup, Analyzer.java and Interpreter.java. Please send only plain text files and make sure that your submission is complete, that it contains the latest/final/stable version of your files. There will be no possibility to change files that have been sent by mistake with the proper ones. Use subject line of the form:

[Compilation A] ## - your text here

where ## is the two digit number of your group, e.g. 05, 14, 23. For example:

[Compilation A] 00 - our interpreter

Please respect the required format and don't send messages with free text subject. Include the names of your colleagues in the body of your message and their emails in the Cc: field of your submission mail. Only send your submissions to the email address specified above.

Compiler evaluation results

The grades from the compiler project evaluation will be mailed to at least one representative from each group. The maximum score is 20. If you have any questions regarding your grade or project submission, please contact Nikolay Mihaylov, INR-321, tel. 36864.

Some common problems exhibited by your compilers are:

Some projects don't even compile without modifications to the source files. This is a problem of utmost importance and should be fixed first.
Incorrect parser/scanner. Some compilers cannot parse completely legal programs. Most of the problems were already discussed in the intermediate project evaluation results section below. You should try to correct these, especially if you are going to implement an interpreter too.
Inconsistent processing of the Abstract Syntax Tree (AST) - storing the operand of the '!' operation in the left field of the Tree.Op node and then processing the right; not expanding the unary minus operation -expr to 0 - expr by keeping one of the operands null and then failing to process this special case. Both result in a NullPointerException either in Analyzer.java or Generator.java.
Every tree node should have a type. The type assigned to the tree nodes of statements should be Tree.NoType. This is not so important for Expressive programs but on the other hand, the tree will not be correctly attributed.
Every Tree.Ident node should have the correct symbol, even when it's in a readint/readbool statement.

Using the following instruction sequence to implement the '!=' operation:

    code for tree.left
    code for tree.right
    SUB
    PUSH   0
    CMPE

It has exactly the opposite efect. Instead try this one:

    code for tree.left
    code for tree.right
    CMPE
    PUSH   0
    CMPE

The generated code should terminate by emitting the BREAK instruction.
Not generating code for the true and false literals.
Emitting the boolean negation seguence: code for the operand PUSH 0 CMPE without generating the code for evaluating the operand in the case of the boolean '!' operation.
Note: When you need to negate the value at the top of the stack you can use Code.emitNegate() which emits the code sequence from above.
Reversing the order of evaluation of the operands for the '<' and '<=' operations. Since evaluation order was not specified, this is not strictly an error. You can, however, use a code sequence to obtain the result, while preserving evaluation order - first the left operand, then the right one. For the '<' operation it would look like this: code for tree.left code for tree.right CMPLE PUSH 0 CMPE

Intermediate project evaluation results

Here follows a list of the most frequent problems. If your implementation exhibits any of them, please, try to correct them before moving to the analysis phase.

You should not change the names of the constants in the OpCodes interface. If you have done so, it is strongly recommended that you download the original OpCodes.java and modify your expressive.cup accordingly.
The unary minus expression -e should result in an AST eqivalent to the expression 0 - e. There is no unary minus operation in the AST.
Pay attention to the precedence of operations as defined in the expressive specification.
Don't forget to strip off the quotes from the strings. The double qoutes that enclose a string are not part of the string and should be removed either by the scanner or by the parser.
Don't forget to treat comments at least to the extent described in the specification (no nested comments).
Use Tree.NO_STATS when an empty array of statements is required.

Setup instructions

Download the package expressive.tar.gz and unpack it with the command:

tar xzf expressive.tar.gz

This will create a directory expressive containing all the relevant files.

The Makefile relies on some scripts that reside in the /home/iclamp/soft/bin directory, therefore you need to add it to your PATH variable:

setenv PATH /home/iclamp/soft/bin:$PATH

If you don't want to type this every time you log, in you can add the same command to your ~/.cshrc file.

Now you can change into the expressive directory and build the project with the command gmake.

At any point you can build the project with the gmake command and it will only compile the files that have been modified since the last successful compilation. Sometimes things may go wrong. If you exhibit strange compilation errors try the gmake clean command. This will remove all the files that have been generated by the build process. Then you can try and build the project. You can also use gmake force which will performed the two steps at once.

Setup at home/notebook computer

You will need some Unix operating system - Linux, BSD, Solaris, etc.
Download and install Java2SE. What you need is the Software Development Kit (SDK). Make sure that you have the Java compiler javac in you path or as a symbolic link in the appropriate directory.
Donwload and unpack the expressive_at_home.tar.gz archive.
Change to the expressive directory and build the project.

Note: Under Linux use tar and make. Under BSD and Solaris use gtar and gmake.

Documentation

The source files in the project framework are extensively documented with javadoc style comments. To create the documentation run (g)make doc. Then use your favourite browser to open javadoc/index.html. Use (g)make doc-clean to delete the documentation files. We strongly encourage you to write your comments in the same style, wherever that is appropriate.

Writing the Lexical Scanner

The first thing that you need to do is to identify all the tokens. Read carefully the language specification and make a list of all the tokens that you need to pass to the next stage of the compiler.

Assign an integer constant to each token. Complete the file src/expressive/Tokens.java with your tokens.

Once you do this, you may want to edit src/expressive/ScannerTest.java and add the code to the toString function to get a symbolic representation for each token.

To generate the code for the scanner we will use a tool called JLex. This tool reads a file that specifies the regular expression corresponding to each token. Edit the file src/spec/expressive.lex by adding your own regular expressions. Follow the pattern from the file.

Once you have successfuly built the scanner you can test it with:

./scan.sh filename

Note: If you have problems compiling or running the scanner try to force a new rebuild. Pay attention to the output that you get on the console. If there are problems with the JLex specification file the build process won't be aborted but this will prevent you from successfully compiling the project.

Writing JLex regular expressions

Taken from JLex user manual.

2.3.2 Regular Expressions

Regular expressions should not contain any white space, as white space is interpreted as the end of the current regular expression. There is one exception; if (non-newline) white space characters appear from within double quotes, these characters are taken to represent themselves. For instance, `` '' is interpreted as a blank space.

The alphabet for JLex is the Ascii character set, meaning character codes between 0 and 127 inclusive.

The following characters are metacharacters, with special meanings in JLex regular expressions.

? * + | ( ) ^ $ . [ ] { } " \

Otherwise, individual characters stand for themselves.

ef Consecutive regular expressions represents their concatenation.

e|f The vertical bar (|) represents an option between the regular expressions that surround it, so matches either expression e or f.

The following escape sequences are recognized and expanded:

\b Backspace

\n newline

\t Tab

\f Formfeed

\r Carriage return

\ddd The character code corresponding to the number formed by three octal digits ddd

\xdd The character code corresponding to the number formed by two hexadecimal digits dd

\udddd The Unicode character code corresponding to the number formed by four hexidecimal digits dddd.

\^C Control character

\c A backslash followed by any other character c matches itself

$ The dollar sign ($) denotes the end of a line. If the dollar sign ends a regular expression, the expression is matched only at the end of a line.

. The dot (.) matches any character except the newline, so this expression is equivalent to [^\n].

"..." Metacharacters lose their meaning within double quotes and represent themselves. The sequence \" (which represents the single character ") is the only exception.

{name} Curly braces denote a macro expansion, with name the declared name of the associated macro.

* The star (*) represents Kleene closure and matches zero or more repetitions of the preceding regular expression.

+ The plus (+) matches one or more repetitions of the preceding regular expression, so e+ is equivalent to ee*.

? The question mark (?) matches zero or one repetitions of the preceding regular expression.

(...) Parentheses are used for grouping within regular expressions.

[...] Square backets denote a class of characters and match any one character enclosed in the backets. If the first character following the left bracket ([) is the up arrow (^), the set is negated and the expression matches any character except those enclosed in the backets. Different metacharacter rules hold inside the backets, with the following expressions having special meanings:

{name}	Macro expansion
a - b	Range of character codes from a to b to be included in character set
"..."	All metacharacters within double quotes lose their special meanings. The sequence `\"` (which represents the single character `"`) is the only exception.
\	Metacharacter following backslash(\) loses its special meaning

For example, [a-z] matches any lower-case letter, [^0-9] matches anything except a digit, and [0-9a-fA-F] matches any hexadecimal digit. Inside character class brackets, a metacharacter following a backslash loses its special meaning. Therefore, [\-\\] matches a dash or a backslash. Likewise ["A-Z"] matches one of the three characters A, dash, or Z. Leading and trailing dashes in a character class also lose their special meanings, so [+-] and [-+] do what you would expect them to (ie, match only '+' and '-').

Writing the Syntactic Analyzer - Parser

Download the package expressive_parser.tar.gz and unpack it with the command:

tar xzf expressive_parser.tar.gz

This will create a directory expressive_parser containing all the relevant files.

Copy your version of expressive.lex to expressive_parser/src/spec/ and ScannerTest.java to expressive_parser/src/expressive.

Do not copy Tokens.java, we are going to use the information in a different way.

Open the file expressive_parser/src/spec/expressive.cup. This file contains the specification of the parser. First complete the list of terminal symbols, i.e. tokens, with the ones from your existing Tokens.java file. Again, DO NOT copy it in src/expressive!. JavaCUP, the tool that we use to generate the code for the parser, will create it along with Parser.java.

Note how one of the already defined terminals, namely ID, is declared as returning a value of type String. Most tokens do not have any value. Their presence is all that matters. For identifiers however, we need the name of the variable denoted by it, hence the String value. The non-terminal symbols are also declared of type String. We are going to use this to print the program that is being parsed.

At the right hand side of each production we can annotate each symbol with name like in ... expr:e1 ... Then we can use e1 as the value associated with that symbol in the semantic action part of the production - the code enclosed in {: ... :}.

In order to disambiguate the grammar you need to assign precedence to the arithmetic and logic operations. The operation defined with each precedence clause have the same level of precedence. The ones on the next line would have higher precedence.

Finally, after building the project, you can test your parser with the command ./parse.sh filename

Take a look at the hints below to get some help with the parser.

Abstract Syntax Tree (AST)

Generating the AST

At this stage you should have the parser working. Now copy this Makefile over the existing one (this fixes some problems) and Tree.java, Visitor.java, Type.java and Symbol.java in the src/expressive directory. Tree.java contains the classes that represent the different nodes in the AST. Visitor.java is the means by which the tree processors implements the visitor pattern (hence the name). Type.java and Symbol.java are empty for now. We only need them to compile Tree.java. In order to build the project with these four files, open the Makefile and uncomment their names by removing the leading '#'.
Using the AST

The mapping from the expressive grammar to the AST should be clear in most cases, especially if you have read the documentation of the AST. A few cases might need some explanation though:
- Strings - Strings should be represented as literals, even though they can not occur at all the places where a literal can occur.
- The boolean constants - True and false should be represented by the Op-node with empty subtrees and the TRUE or FALSE token as the operator.
- Statements - In the AST statements are stored in an array of Stat. While parsing you don't know how many statements there will be in a sequence of statements. The way to solve this is by accumulating the statements in a list (java.util.List) and then convert this list to an array (Tree.toArray(ss)) when inserting the statements in the body of, e.g., an If or Prog.
Pretty printing

After you complete the AST generation, you should check if the produced AST corresponds to the source program. One way to do it is to perform the so called pretty printing. Since the AST represents the program, only in a more concise way, then it should be possible to generate the source of the program from the AST. In order to do that, download the files Printer.java and PrinterTest.java in the src/expressive directory and add them in the list of source files in the Makefile. Also download the printer launch script print.sh into the main project directory. After building the project you can test it with

./print.sh filename

Note: If you take a look at Printer.java (highly recommended) you can get a sneak peek at how the Visitor pattern is used. You'll get more details on that during the lectures. It will also give you some more hints on how the AST is used.
Interpreter

Another, even more rigorous way of testing your parser is to try to interpret the program represented by the AST without generating code. Just download interpreter.jar and expr-int.sh in the main project directory. Then run the interpreter with

./expr-int.sh filename

Please, make sure that the program you try to run not only parses correctly but also complies with the context dependant rules: typing rules, using variables only after assigning value to them, etc.

Some hints for the parser.

Documentation

First of all, don't forget to generate the documentation for the given code, it is especially helpful for the AST.
You should also take a look at the JavaCUP Manual.
Tokens

The parser given in expressive.cup uses the tokens
- PERIOD = "."
- ASSIGN = ":="
- IF = "if"
- LBRACE = "{"
- RBRACE = "}"
- PLUS = "+"
- ID, identifier example: "foo"
You must make sure that your scanner and your parser uses the same tokens. Either by changing your scanner (expressive.lex) or by changing expressive.cup.

The parser generator will define the symbol EOF so you should not add this to the list of terminals.
Statements

To generate the tree when an array of statements is required you need a dynamic datastucture like list. You will need to declare the non terminal stat as having type java.util.List

    non terminal java.util.List stms;
Then you can obtain an array of Stat by calling a static method in class Tree:

    public static Stat[] toArray(java.util.List list);

If you need to denote the absence of any statements (like in an if without and else) you should use Tree.NO_STATS declared in class Tree as:

    static final public Stat[] NO_STATS = new Stat[0];
Precedence

Read up on how precedence is handled in the JavaCUP Manual or in the slides.

One tricky part is unary minus, but there is an example in the JavaCUP Manual and in Appel Grammar 3.37 (page 75) that explains how one can use a virtual token, e.g., UMINUS and contextual precedence through the %prec directive to get the right precedence for unary minus.

There is an example in the JavaCUP Manual:
```
    precedence left PLUS, MINUS;
    precedence left TIMES, DIVIDE, MOD;
    precedence left UMINUS;

       ...

    expr ::= MINUS expr:e             
             {: RESULT = new Integer(0 - e.intValue()); :} 
             %prec UMINUS
    
```
Note: Here, the production is declared as having the precedence of UMINUS. Hence, the parser can give the MINUS sign two different precedences, depending on whether it is a unary minus or a subtraction operation.
Positions

For the sake of better error reporting, the tokens generated by the scanner are augmeneted with the position in the source file at which the token started. Each tree node also includes a field for the position of the construct it represents. You need to assign the position of each tree node on the basis of the tokens it consists of. The position of the token is encoded in the right field of the java_cup.runtime.Symbol class. you can access it through the name you have assigned to the token by appending right to it. The following fragment illustrates how this is done in the case of addition expression:

expr ::= expr:e1 PLUS:tok expr:e2 {: RESULT = new Op(tokright, e1, OpCodes.PLUS, e2); :}

The name of the PLUS token is tok, hence the position of the token is referred to as tokright.
Operations encoding

In the tree nodes Op and IO there is a public field op which specifies which operation the tree node corresponds to. An easy way to encode them would be to use the value of the corresponding tokens. In our case, since the values of the tokens are generated automatically (by JavaCUP), this is not a very good idea. The OpCodes interface enumerates constants for all the possible operations. Donwload it in your src/expressive directory and change refrences to Tokens with OpCodes at the appropriate places. You also need to update your Makefile to take OpCodes.java into account during the build process. Just add the following line at the appropriate place:

SOURCES += $(EXPR)/OpCodes.java

Note: In some cases, the left, right or both subtrees of an Op node are meaningless. You can initialize them to null.

Type and Name Analysis

First you need the updated Expressive language specification (expressive.ps). The specification of the type and name analysis phase can be found in sections 3 (Scope) and 4 (Types). The lecture notes on semantic analysis will also be very helpful.

Note: You are not obliged to implement the full rules for scopes. You may only consider the existence of a global scope.

The type and name analysis is performed simultaneously by the analyzer. The skeleton of the analyzer is in Analyzer.java. Your task is to complete the visitor methods for the IO, Assign, If, Op and Literal tree nodes. You will also need updated versions of Type.java and Symbol.java. Simply overwrite the ones that you already have. Finally, in order to test the analyzer, you'll need AnalyzerTest.java and analyze.sh. Before you compile the project, update your Makefile by adding the following lines:

    SOURCES		+= $(EXPR)/Analyzer.java
    SOURCES		+= $(EXPR)/AnalyzerTest.java

You can test your analyzer using the analyzer launch script

./analyze.sh filename

Note: If the program that you're trying to analyze has no errors it should print nothing.

Code generation

Files

In this project we aim at generating code for the expressive virtual machine (EVM). It is a simple, stack based VM. That is to say, that the instructions operate on values at the top of a stack.To facilitate the code generation there is a class Code. It contains methods to emit instructions, work with code labels, etc.

To get started you will need to download the following files:

File	Destination
`Code.java`	`./src/expressive`
`Generator.java`	`./src/expressive`
`GeneratorTest.java`	`./src/expressive`
`Instructions.java`	`./src/expressive/evm`
`EVM.java`	`./src/expressive/evm`
`gen.sh`	`./`

Then you need to add the following lines to your Makefile:

    SOURCES		+= $(EXPR)/Code.java
    SOURCES		+= $(EXPR)/Generator.java
    SOURCES		+= $(EXPR)/GeneratorTest.java

    EVM			 = $(EXPR)/evm

    SOURCES		+= $(EVM)/Instructions.java
    SOURCES		+= $(EVM)/EVM.java

Important: An eaiser way to do it is to download the expressive_final.tar.gz package. Then you will need to copy your expressive.lex, expressive.cup, ScannerTest.java and Analyzer.java to the appropriate directories. The Makefile from this package does not need any modifications.

Working with the generator

You can launch the code generator using the gen.sh script:
```
    ./gen.sh [-run] [-trace] filename [outputfilename]
```
If no option is provided, it will output the assembler code corresponding to the compiled program. The assembler output is very useful for debugging your code generator. It can also be used to store compiled Expressive programs because of the lack of a binary format for the EVM. To execute the generated codethe generated code use the -run option. You can augment it the -trace option to see some debugging information from the EVM.

Note: The code generator will only be activated if there were no errors found during the previous phases of the compiler. If you get error messages, first correct the errors and then you can test your generator.
Working with the assembler

The assembly files can be processed with the EVM Assembler using the following command:
```
    ./asm.sh [-run] [-trace] filename [outputfilename]
```
Without any option, it will simply print the same program or report any errors found. With the option -run it will execute the code if there are no errors. The option -trace only takes effect if used together with -run and will force the EVM to print some debugging information during the program execution.

Note: You can find the source code of the assembler in the ./src/expressive/evm/asm/ directory.

The EVM assembler recognizes all instructions from the expressive.evm.Instructions interface. The instructions should be written in lower case letters. Some instructions (goto, jz, jnz) point to other instructions in the code area. The location can be specified by an integer number giving the position of the instruction in the code area or by a label. Each instruction can be prefixed by a label followed by colon (:).

The assembler recognizes the memsize n directive which sets the size of the data memory area. Roughly speaking, it declares the number of variables that you can use. Don't forget to set it to the appropriate value if you write programs in EVM assembler.

For a small example of an assembly program look at simple_loop.asm
Useful information

The most useful source of information about the EVM and generating code for it, are the lecture notes available from the course Web page.

An authoritative source of information is the source code of the different classes. A good place to start is Code.java, which you will have to use fot emitting instructions. evm/Istructions.java contains the instructions recognized by the EVM. If you are not interested in the details you can still refer to the documentation generated from the code. If you have problems generating the documentation you can read it from here. Once again, take a look at Code and Instructions.

Writing an interpreter for the AST

As with the other phases of the project you are first going to need additional files Interpreter.java and InterpreterTest.java. You will also need to include them in the Makefile:

    SOURCES		+= $(EXPR)/Interpreter.java
    SOURCES		+= $(EXPR)/InterpreterTest.java

Then you need the script interpret.sh. You can use it in the following way:

    ./interpret.sh filename [outputfilename]

where oputputfilename is the optional name of a file in which the results from interpreting the program will be saved. If you omit it, then you'll see the output on the display.

The implementation of the interpreter should be pretty straightforward. You have to pay attention to the following details:

Values are represented with the primitive integer type int. Booleans should be mapped to integers with 0 corresponding to false and any other number for true. For printing or reading boolean variables you have to use the "true" or "false" constants.
The method public int interpret(Tree tree) returns the result of interpreting the tree node tree. If tree is an expression this is the result of evaluating the expression; if it is a statement, then the value is undefined.
You can return a value by assigning it to the result field of the Interpreter class as in this code snippet:
```
    result = interpret(tree.left) + interpret(tree.right);
```
You can store values of variables in the value field of the corresponding symbol:
```
    tree.symbol.value = interpret(rhs);
```
You can use the public static int pow(int x, int n) method to compute x^n.
The interpreter only works if there have been no errors in the previous phases. Don't expect any output in case there were errors in the Expressive program you are trying to run. Correct the errors first and then try again.

[an error occurred while processing this directive]

Last modified: Monday, 01-Mar-2004 14:19:57 CET

Overview

Submitting the interpreter

Compiler evaluation results

Intermediate project evaluation results

Setup instructions

Setup at home/notebook computer

Documentation

Writing the Lexical Scanner

Writing JLex regular expressions

2.3.2 Regular Expressions

Writing the Syntactic Analyzer - Parser

Abstract Syntax Tree (AST)

Generating the AST

Using the AST

Pretty printing

Interpreter

Some hints for the parser.

Documentation

Tokens

Statements

Precedence

Positions

Operations encoding

Type and Name Analysis

Code generation

Files

Working with the generator

Working with the assembler

Useful information

Writing an interpreter for the AST