CS 322 HM LANGUAGES & COMPILER DESIGN PSU HW 4 HomeWork 4, L1 Compiler, 200 Points, +25 for Case Statement (5/10/2005) Due Date: Subject: Monday May 16h, 2005 Design, implement, and test a compiler for language L1, L1 Compiler General Rules: Implement Homework in C or C++. Any flavor of C or C++ that runs on a PSU computer will do. Hand in [a listing of all] C/C++ source files plus include files, if any, plus all inputs, and their generated outputs. Write your name, the HW number, completion date, and the current PSU term into the header of each source file. Summary and main work: (110/200) Design, implement, test, and debug a complete compiler for the L1 language. Start with L1 Parse implemented earlier, and add semantic actions. These actions a) fill the symbol table and allocate storage (for declarations) and b) generate code for executable statements. (60/200) Make sure your generated L1 machine code is humanly readable. Test your L1 compiler by showing the sources and corresponding generated L1 assembly mnemonics for a large variety of L1 programs which you devise. A separate design document is not needed. Testing: (30/200) Write numerous L1 source programs, compile them, and show for each source program the corresponding L1 assembly output. Using small tests you write, show individually each syntax construct, e.g. While Statement, If Statement, For Statement, etc. Also write numerous larger programs that each show sequences of syntactic constructs and show nested statements. During these test runs, sometimes turn on (in 50% of all cases) the ability to trace the recursive calling sequences, so that you can visually follow each step of the parse. Commenting: (10/200) While a separate design document is not needed, al code must be suitable commented. Each data structures, algorithm, and methods of the L1 Compiler code generation phase must be explained in the C/C++ source. L1 Assembler Source: (50/200) Your L1 compiler either emits binary object, in which case you also implement a disassembler. Or the L1 compiler generates L1 assembler ASCII output, which requires that you also implement an assembler, in order to generate the final binary code. Note that no linking is necessary. The loading phase for execution by the simulator consists of just reading the object file into the L1 Simulator’s code space. For this homework, show numerous L1 source programs with their corresponding L1 assembler output. L1 Binary Code, an Early Decision: Decide whether binary L1 object code is generated by the L1 compiler or by an L1 assembler. You have 2 choices. 1. If your compiler emits L1 binary code directly, then you must implement a disassembler that shows, one line at a time per instruction, each L1 opcode plus all applicable operands in human readable format. Only the operands actually used by an instruction are shown in the disassembly. Advantage of this scheme: your compiler output is immediately executable. Disadvantage: You have to implement a disassembler. 2. If your L1 compiler emits L1 assembler mnemonics in ASCII, then you must implement an L1 assembler that reads L1 assembly programs, generated by the compiler, and converts them into 1 HW 4 CS 322 HM LANGUAGES & COMPILER DESIGN PSU HW 4 executable binary L1 Arch instructions. Advantage of this method: your compiler output is immediately readable by humans. Disadvantage: you have to implement an assembler to generate binary code. Symbol Table: Implement a symbol table scheme similar to the one discussed in CS321; even a simple array of symbol-table structures is acceptable and very effective, as it simplifies your implementation, while limiting the maximum size/complexity of L1 source programs somewhat. Make sure that you can dump the symbol table in readable form, so you are sure you implement what you think you design. Dumping symbol table data as an option of the compile step is crucial for your ability to debug your design. Show in your homework that symbol table dumping works. Memory Allocation: Decide early, whether your target system has byte- or word-addressable memory. For simplicity I recommend you just use words, each32-bits in length, which is likely your real host computer’s arithmetic power. In that case, even if you allocate a single character, it will waste 75% of the bits in a word, but this is a simple run-time environment. Storage saving is not a high priority. I recommend that memory starts at address 0. Each subsequent word has an address 1 greater than the previous. This will allow easy inclusion of reals (AKA floats) later. It also makes it easy to add a word-oriented run-time stack later. If your source program defines more storage than your target computer supports, the compiler needs to emit a suitable error message. Obviously the program will not execute/simulate in that case. Code Generation: Decide the type of object code. You can define a low-level, genuine binary machine format, or you can use a higher-level abstraction expressed as C/C++ structs. The former approach requires you to pack all opcode and operand bits in binary form into bytes or words, but enables the use of the same run-time memory space for data as well as instructions. This approach is closer to a real machine, is a bit harder to implement, and is suitable for execution on a real machine. The second approach is a bit easier to implement, is suitable for simulation, where the machine instructions do not need to follow an exact binary format. I recommend you to use the second approach. A draw-back of the second approach is that the space for data (perhaps an array of words) and the space for instructions (an array of high-level C++ structures) be separate; this is inconsistent, however, with real computer architectures. If you use the higher-level abstraction, then define your machine instructions in terms of C/C++ structs. One of the structure fields is the opcode, another is the first operand, yet another field is the second operand, and a fourth field is the result. Note that all operand and result types can be structs and unions themselves, as a source operand may be a register, a memory location, possibly even of one of several sizes, or it can be an in-line operand (a literal value encode din the instruction space. Ultimately, thus you define quadruples, which bit-by-bit generally do not match real machine instructions. Define, whether your generated code is emitted directly onto a binary file or into a buffer. The file will be read later by the simulator, and constitutes the load step at the start of execution. The buffer will have to be flushed by the end of coder generation, so the information is available to the executing hardware, or the simulator. There are cases of forward references, that dictate -for a single-pass compilation process- that the object code be buffered anyway, since addresses or other operands have to be back-patched when known at a future step. The fixed size of your code buffer will limit the 2 HW 4 LANGUAGES & COMPILER DESIGN CS 322 HM PSU HW 4 maximum amount of object code somewhat, as the distance between the first forward reference and its final resolution is now fixed by the size of the code buffer. Statements: Each kind of high-level source statement has a natural code pattern, which when used consistently, allows for correct code generation, no matter how complex the nesting structure of the source program is. Use one logical module in your compiler for each unique high-level source construct. For example, have one unique function for If Statements, another for Loop Statements, and a third one for For Statements, etc. There will be cases, in which one module generates a branch that targets another instruction which is itself a branch, resulting in non-optimal code. Do not worry about non-optimal code. Some object code patterns for High-Level source statements are listed here: Loop Statement, source: loop <stmt0> end loop; Loop Statement object code pattern: l1: <object code for stmt0> j l1 If Statement, source: if <cond> then <stmt1> else <stmt2> end if; If Statement object code pattern: <object code for cond> jf else <object code for stmt1> j endif else: <object code for stmt2> endif: While Statement, source: while <cond> loop <stmt> end loop; While Statement object code pattern: w1: <object code for cond> jf endw <object code for stmt> j w1 endw: L1 Case Statement: (25/225) If you implement Case Statements L1, you can get up to 25 points of extra credit just for this one homework. But the 25 points of extra credit are granted only after the simulator correctly executes them. 3 HW 4