Overview
During the excercices you will be able to fulfil a small project -
a compiler for a simple language called "Expressive". The specification
of the language is available from here: expressive.ps
(Updated 21.05.03)
You may work in groups of up to 3 persons.
Submitting the interpreter
Updated 08/07/03
The deadline for the last part of the progect - the interpreter - is
18:00 on Friday, 11/07/03. The submissions
should be mailed to
mihaylov@lampsun1.epfl.ch and should consist of the files
expressive.lex , expressive.cup ,
Analyzer.java and Interpreter.java . Please send only
plain text files and make sure that your submission is complete, that it contains
the latest/final/stable version of your files. There will be no possibility to
change files that have been sent by mistake with the proper ones.
Use subject line of the form:
[Compilation A] ## - your text here
where ## is the two digit number of your group, e.g. 05, 14, 23.
For example:
 [Compilation A] 00 - our interpreter
Please respect the required format and don't send messages with free text
subject. Include the names of your colleagues in the body of your message
and their emails in the Cc: field of your submission mail. Only send
your submissions to the email address specified above.
Compiler evaluation results
The grades from the compiler project evaluation will be mailed to at
least one representative from each group. The maximum score is 20. If you
have any questions regarding your grade or project submission, please
contact Nikolay Mihaylov,
INR-321, tel. 36864.
Some common problems exhibited by your compilers are:
- Some projects don't even compile without modifications to the source files.
This is a problem of utmost importance and should be fixed first.
- Incorrect parser/scanner. Some compilers cannot parse completely legal
programs. Most of the problems were already discussed in the intermediate
project evaluation results section below. You should try to correct these,
especially if you are going to implement an interpreter too.
- Inconsistent processing of the Abstract Syntax Tree (AST) - storing the
operand of the '!' operation in the
left field of the
Tree.Op node and then processing the right ;
not expanding the unary minus operation -expr to
0 - expr by keeping one of the operands null and then
failing to process this special case. Both result in a NullPointerException
either in Analyzer.java or Generator.java .
- Every tree node should have a type. The type assigned to the tree nodes
of statements should be
Tree.NoType . This is not so important
for Expressive programs but on the other hand, the tree will not be correctly
attributed.
- Every
Tree.Ident node should have the correct symbol,
even when it's in a readint/readbool statement.
- Using the following instruction sequence to implement the '!=' operation:
code for tree.left
code for tree.right
SUB
PUSH 0
CMPE
It has exactly the opposite efect. Instead try this one:
code for tree.left
code for tree.right
CMPE
PUSH 0
CMPE
- The generated code should terminate by emitting the
BREAK
instruction.
- Not generating code for the true and false literals.
- Emitting the boolean negation seguence:
code for the operand
PUSH 0
CMPE
without generating the code for evaluating the operand in the case
of the boolean '!' operation.
Note: When you need to negate the value at the top of the stack you can
use Code.emitNegate() which emits the code sequence from above.
- Reversing the order of evaluation of the operands for the '<' and '<='
operations. Since evaluation order was not specified, this is not strictly an
error. You can, however, use a code sequence to obtain the result, while
preserving evaluation order - first the left operand, then the right one.
For the '<' operation it would look like this:
code for tree.left
code for tree.right
CMPLE
PUSH 0
CMPE
Intermediate project evaluation results
Here follows a list of the most frequent problems. If your implementation
exhibits any of them, please, try to correct them before moving to the
analysis phase.
- You should not change the names of the constants in the
OpCodes interface. If you have done so, it is strongly
recommended that you download the original
OpCodes.java and
modify your expressive.cup accordingly.
- The unary minus expression
-e should result in an AST
eqivalent to the expression 0 - e . There is no unary
minus operation in the AST.
- Pay attention to the precedence of operations as defined in the
expressive specification.
- Don't forget to strip off the quotes from the strings. The double qoutes
that enclose a string are not part of the string and should be removed
either by the scanner or by the parser.
- Don't forget to treat comments at least to the extent described in the
specification (no nested comments).
- Use
Tree.NO_STATS when an empty array of statements
is required.
Setup instructions
Download the package expressive.tar.gz
and unpack it with the command:
tar xzf expressive.tar.gz
This will create a directory expressive containing all
the relevant files.
The Makefile relies on some scripts that reside in the
/home/iclamp/soft/bin directory, therefore you need to add
it to your PATH variable:
setenv PATH /home/iclamp/soft/bin:$PATH
If you don't want to type this every time you log, in you can add the same
command to your ~/.cshrc file.
Now you can change into the expressive directory
and build the project with the command gmake .
At any point you can build the project with the gmake
command and it will only compile the files that have been modified since
the last successful compilation. Sometimes things may go wrong. If you
exhibit strange compilation errors try the gmake clean command.
This will remove all the files that have been generated by the build process.
Then you can try and build the project. You can also use
gmake force which will performed the two steps at once.
Setup at home/notebook computer
- You will need some Unix operating system - Linux, BSD, Solaris, etc.
- Download and
install Java2SE. What you need is the
Software Development Kit (SDK). Make sure that you have the Java compiler
javac in you path or as a symbolic link in the appropriate directory. -
- Donwload and unpack the
expressive_at_home.tar.gz archive.
- Change to the
expressive directory and build the project.
Note: Under Linux use tar and
make . Under BSD and Solaris use gtar and
gmake .
Documentation
The source files in the project framework are extensively documented with
javadoc style comments. To create the documentation
run (g)make doc . Then use your favourite browser to open
javadoc/index.html . Use (g)make doc-clean to delete
the documentation files. We strongly encourage you to write your comments
in the same style, wherever that is appropriate.
Writing the Lexical Scanner
The first thing that you need to do is to identify all the tokens.
Read carefully the language specification and make a list of all
the tokens that you need to pass to the next stage of the compiler.
Assign an integer constant to each token. Complete the file
src/expressive/Tokens.java with your tokens.
Once you do this, you may want to edit
src/expressive/ScannerTest.java and add the code
to the toString function to get a symbolic
representation for each token.
To generate the code for the scanner we will use a tool called JLex.
This tool reads a file that specifies the regular expression
corresponding to each token. Edit the file
src/spec/expressive.lex by adding your own regular
expressions. Follow the pattern from the file.
Once you have successfuly built the scanner you can test it with:
./scan.sh filename
Note: If you have problems compiling or running the scanner try to force a
new rebuild. Pay attention to the output that you get on the console.
If there are problems with the JLex specification file the build process
won't be aborted but this will prevent you from successfully
compiling the project.
Writing JLex regular expressions
Taken from JLex
user manual.
2.3.2 Regular Expressions
Regular expressions should not contain any white space,
as white space is interpreted as the end of
the current regular expression.
There is one exception; if (non-newline) white
space characters appear from within double quotes,
these characters are taken to represent themselves.
For instance, `` '' is interpreted as a blank space.
The alphabet for JLex is the Ascii character set,
meaning character codes between 0 and 127 inclusive.
The following characters are metacharacters, with
special meanings in JLex regular expressions.
? * + | ( ) ^ $ . [ ] { } " \
Otherwise, individual characters stand for themselves.
ef Consecutive regular expressions represents their concatenation.
e|f The vertical bar (|) represents an option between
the regular expressions that surround it, so
matches either expression e or f.
The following escape sequences are recognized and expanded:
\b | Backspace |
\n | newline |
\t | Tab |
\f | Formfeed |
\r | Carriage return |
\ddd | The character code corresponding to the number formed by three octal digits ddd |
\xdd | The character code corresponding to the number formed by two hexadecimal digits dd |
\udddd | The Unicode character code corresponding to the number formed by four hexidecimal digits dddd. |
\^C | Control character |
\c | A backslash followed by any other character c matches itself |
$ The dollar sign ($) denotes the end of a line.
If the dollar sign ends a regular expression, the expression
is matched only at the end of a line.
. The dot (.) matches any character except the newline,
so this expression is equivalent to [^\n] .
"..." Metacharacters lose their meaning within
double quotes and represent themselves.
The sequence \" (which represents the
single character " ) is the only exception.
{name} Curly braces denote a macro expansion,
with name the declared name of the associated macro.
* The star (*) represents Kleene closure and matches
zero or more repetitions of the preceding regular expression.
+ The plus (+) matches one or more repetitions of the
preceding regular expression, so e+ is equivalent to ee*.
? The question mark (?) matches zero or one repetitions
of the preceding regular expression.
(...) Parentheses are used for grouping
within regular expressions.
[...] Square backets denote a class of characters and match
any one character enclosed in the backets. If the first character following
the left bracket ([) is the up arrow (^), the set is negated and
the expression matches any character except those enclosed in the backets.
Different metacharacter rules hold inside the backets, with the
following expressions having special meanings:
{name} | Macro expansion |
a - b | Range of character codes
from a to b to be included in character set |
"..." | All metacharacters within double quotes lose
their special meanings. The sequence \"
(which represents the single character " )
is the only exception. |
\ | Metacharacter following backslash(\)
loses its special meaning |
For example, [a-z] matches any lower-case letter,
[^0-9] matches anything except a digit, and
[0-9a-fA-F] matches any hexadecimal digit.
Inside character class brackets, a metacharacter following a backslash
loses its special meaning. Therefore, [\-\\] matches a dash
or a backslash. Likewise ["A-Z"] matches one of the three
characters A, dash, or Z. Leading and trailing dashes in a character class
also lose their special meanings, so [+-] and [-+]
do what you would expect them to (ie, match only '+' and '-').
Writing the Syntactic Analyzer - Parser
Download the package expressive_parser.tar.gz
and unpack it with the command:
tar xzf expressive_parser.tar.gz
This will create a directory expressive_parser containing all
the relevant files.
Copy your version of expressive.lex to
expressive_parser/src/spec/ and ScannerTest.java to
expressive_parser/src/expressive . Do not copy Tokens.java ,
we are going to use the information in a different way.
Open the file expressive_parser/src/spec/expressive.cup .
This file contains the specification of the parser. First complete the list
of terminal symbols, i.e. tokens, with the ones from your existing Tokens.java
file. Again, DO NOT copy it in src/expressive !. JavaCUP, the tool that
we use to generate the code for the parser, will create it along with
Parser.java .
Note how one of the already defined terminals, namely ID, is declared as returning
a value of type String. Most tokens do not have any value. Their presence
is all that matters. For identifiers however, we need the name of the
variable denoted by it, hence the String value. The non-terminal symbols are
also declared of type String. We are going to use this to print the program that
is being parsed.
At the right hand side of each production we can annotate each symbol with
name like in ... expr:e1 ... Then we can use e1 as the value
associated with that symbol in the semantic action part of the production -
the code enclosed in {: ... :} .
In order to disambiguate the grammar you need to assign precedence to the
arithmetic and logic operations. The operation defined with each
precedence clause have the same level of precedence.
The ones on the next line would have higher precedence.
Finally, after building the project, you can test your parser with the command
./parse.sh filename
Take a look at the hints below to get some help
with the parser.
Abstract Syntax Tree (AST)
Generating the AST
At this stage you should have the parser working. Now copy this
Makefile over the existing one (this fixes
some problems) and Tree.java ,
Visitor.java ,
Type.java and
Symbol.java in the
src/expressive directory. Tree.java contains the
classes that represent the different nodes in the AST. Visitor.java
is the means by which the tree processors implements the visitor pattern
(hence the name). Type.java and Symbol.java are
empty for now. We only need them to compile Tree.java . In order
to build the project with these four files, open the Makefile
and uncomment their names by removing the leading '#' .
Using the AST
The mapping from the expressive grammar to the AST should be clear in
most cases, especially if you have read the documentation of the AST.
A few cases might need some explanation though:
- Strings - Strings should be represented as literals,
even though they can not occur at all the places where a literal can occur.
- The boolean constants - True and false should be represented by the
Op-node with empty subtrees and the TRUE or FALSE token as the operator.
- Statements - In the AST statements are stored in an array of Stat.
While parsing you don't know how many statements there will be in
a sequence of statements. The way to solve this is by accumulating
the statements in a list (java.util.List) and then convert this list to
an array (Tree.toArray(ss)) when inserting the statements in the
body of, e.g., an
If or Prog .
Pretty printing
After you complete the AST generation, you should
check if the produced AST corresponds to the source program.
One way to do it is to perform the so called pretty printing.
Since the AST represents the program, only in a more concise way, then
it should be possible to generate the source of the program from the AST.
In order to do that, download the files
Printer.java and
PrinterTest.java in the
src/expressive directory and add them in the list of source
files in the Makefile . Also download the printer launch script
print.sh into the main project
directory. After building the project you can test it with
./print.sh filename
Note: If you take a look at Printer.java
(highly recommended) you can get a sneak peek at how the
Visitor pattern is used.
You'll get more details on that during the lectures.
It will also give you some more hints on how the AST is used.
Interpreter
Another, even more rigorous way of testing your parser is to try to
interpret the program represented by the AST without generating code. Just
download interpreter.jar
and expr-int.sh in the main
project directory. Then run the interpreter with
./expr-int.sh filename
Please, make sure that the program you try to run not only parses
correctly but also complies with the context dependant rules: typing rules,
using variables only after assigning value to them, etc.
Some hints for the parser.
Documentation
First of all, don't forget to generate the documentation
for the given code, it is especially helpful for the AST.
You should also take a look at the JavaCUP Manual.
Tokens
The parser given in expressive.cup uses the tokens
- PERIOD = "."
- ASSIGN = ":="
- IF = "if"
- LBRACE = "{"
- RBRACE = "}"
- PLUS = "+"
- ID, identifier example: "foo"
You must make sure that your scanner and your parser uses the same tokens.
Either by changing your scanner (expressive.lex) or by changing expressive.cup.
The parser generator will define the symbol EOF so you should
not add this to the list of terminals.
Statements
To generate the tree when an array of statements is required you need
a dynamic datastucture like list. You will need to declare the
non terminal stat as having type java.util.List
non terminal java.util.List stms;
Then you can obtain an array of Stat by calling a static method in class Tree:
public static Stat[]
toArray(java.util.List list);
If you need to denote the absence of any statements (like in an if
without and else) you should use
Tree.NO_STATS declared in class Tree as:
static final public Stat[] NO_STATS =
new Stat[0];
Precedence
Read up on how precedence is handled in the JavaCUP Manual or in the slides.
One tricky part is unary minus, but there is an example in the
JavaCUP Manual
and in Appel Grammar 3.37 (page 75) that explains how one can use a virtual token, e.g., UMINUS and contextual precedence through the %prec directive to get the right precedence for unary minus.
There is an example in the
JavaCUP Manual:
precedence left PLUS, MINUS;
precedence left TIMES, DIVIDE, MOD;
precedence left UMINUS;
...
expr ::= MINUS expr:e
{: RESULT = new Integer(0 - e.intValue()); :}
%prec UMINUS
Note:
Here, the production is declared as having the precedence of UMINUS.
Hence, the parser can give the MINUS sign two different precedences,
depending on whether it is a unary minus or a subtraction operation.
Positions
For the sake of better error reporting, the tokens generated by the scanner
are augmeneted with the position in the source file at which the token started.
Each tree node also includes a field for the position of the construct it
represents. You need to assign the position of each tree node on the basis
of the tokens it consists of. The position of the token is encoded in the
right field of the java_cup.runtime.Symbol class.
you can access it through the name you have assigned to the token by appending
right to it. The following fragment illustrates how this is done
in the case of addition expression:
expr ::= expr:e1 PLUS:tok expr:e2
{: RESULT = new Op(tokright, e1, OpCodes.PLUS, e2); :}
The name of the PLUS token is tok ,
hence the position of the token is referred to as tokright .
Operations encoding
In the tree nodes Op and IO there is a public field
op which specifies which operation the tree node corresponds to.
An easy way to encode them would be to use the value of the corresponding
tokens. In our case, since the values of the tokens are generated
automatically (by JavaCUP), this is not a very good idea. The
OpCodes interface enumerates
constants for all the possible operations. Donwload it in your
src/expressive directory and change refrences to
Tokens with OpCodes at the appropriate places.
You also need to update your Makefile to take
OpCodes.java into account during the build process. Just add
the following line at the appropriate place:
SOURCES += $(EXPR)/OpCodes.java
Note: In some cases, the left, right or both subtrees of
an Op node are meaningless. You can initialize them to
null .
Type and Name Analysis
First you need the updated Expressive language specification
(expressive.ps). The specification of the
type and name analysis phase can be found in sections 3 (Scope) and 4 (Types).
The lecture notes on semantic analysis will also be very helpful.
Note: You are not obliged to implement the full rules for scopes.
You may only consider the existence of a global scope.
The type and name analysis is performed simultaneously by the
analyzer. The skeleton of the analyzer is in
Analyzer.java. Your task is to complete the
visitor methods for the IO, Assign, If, Op and
Literal tree nodes. You will also need updated versions of
Type.java and
Symbol.java. Simply overwrite the ones
that you already have. Finally, in order to test the analyzer, you'll need
AnalyzerTest.java and
analyze.sh. Before you compile the project,
update your Makefile by adding the following lines:
SOURCES += $(EXPR)/Analyzer.java
SOURCES += $(EXPR)/AnalyzerTest.java
You can test your analyzer using the analyzer launch script
./analyze.sh filename
Note: If the program that you're trying to analyze has no errors
it should print nothing.
Code generation
Files
In this project we aim at generating code for the expressive virtual
machine (EVM). It is a simple, stack based VM. That is to say, that the
instructions operate on values at the top of a stack.To facilitate the code
generation there is a class
Code. It contains methods to emit instructions, work with code labels, etc.
To get started you will need to download the following files:
Then you need to add the following lines to your Makefile :
SOURCES += $(EXPR)/Code.java
SOURCES += $(EXPR)/Generator.java
SOURCES += $(EXPR)/GeneratorTest.java
EVM = $(EXPR)/evm
SOURCES += $(EVM)/Instructions.java
SOURCES += $(EVM)/EVM.java
Important: An eaiser way to do it is to download the
expressive_final.tar.gz
package. Then you will need to copy your expressive.lex ,
expressive.cup , ScannerTest.java and
Analyzer.java to the appropriate directories.
The Makefile from this package does not need any
modifications.
Working with the generator
You can launch the code generator using the gen.sh script:
./gen.sh [-run] [-trace] filename [outputfilename]
If no option is provided, it will output the assembler code corresponding
to the compiled program. The assembler output is very useful for debugging
your code generator. It can also be used to store compiled Expressive programs
because of the lack of a binary format for the EVM. To execute the generated
codethe generated code use the -run option. You can augment it
the -trace option to see some debugging information from the EVM.
Note: The code generator will only be activated if there were
no errors found during the previous phases of the compiler. If you get error
messages, first correct the errors and then you can test your generator.
Working with the assembler
The assembly files can be processed with the EVM Assembler using the following
command:
./asm.sh [-run] [-trace] filename [outputfilename]
Without any option, it will simply print the same program or report any
errors found. With the option -run it will execute the code
if there are no errors. The option -trace only takes effect
if used together with -run and will force the EVM to print
some debugging information during the program execution.
Note: You can find the source code of the assembler in the
./src/expressive/evm/asm/ directory.
The EVM assembler recognizes all instructions from the
expressive.evm.Instructions interface. The instructions should be
written in lower case letters. Some instructions (goto, jz, jnz )
point to other instructions in the code area. The location can be specified
by an integer number giving the position of the instruction in the code area
or by a label. Each instruction can be prefixed by a label followed by colon
(: ).
The assembler recognizes the memsize n
directive which sets the size of the data memory area. Roughly speaking,
it declares the number of variables that you can use. Don't forget to set
it to the appropriate value if you write programs in EVM assembler.
For a small example of an assembly program look at
simple_loop.asm
Useful information
The most useful source of information about the EVM and generating code for it,
are the lecture notes available from the course Web page.
An authoritative source of information is the source code
of the different classes. A good place to start is
Code.java , which
you will have to use fot emitting instructions.
evm/Istructions.java
contains the instructions recognized by the EVM. If you are not interested
in the details you can still refer to the documentation generated from the code.
If you have problems generating the documentation you can
read it from here. Once again,
take a look at
Code and
Instructions .
Writing an interpreter for the AST
As with the other phases of the project you are first going to need
additional files Interpreter.java
and InterpreterTest.java .
You will also need to include them in the Makefile :
SOURCES += $(EXPR)/Interpreter.java
SOURCES += $(EXPR)/InterpreterTest.java
Then you need the script
interpret.sh . You can use it in
the following way:
./interpret.sh filename [outputfilename]
where oputputfilename is the optional name of a file
in which the results from interpreting the program will be saved. If you omit it,
then you'll see the output on the display.
The implementation of the interpreter should be pretty straightforward. You
have to pay attention to the following details:
|