Honor’s Thesis: Asynchronous Implementation of 8051 Microcontroller By: Ryan Mabry Advisor: Dr. Hao Zheng Abstract The synchronous 8051 microcontroller is a common processor found in many embedded systems. By using asynchronous design techniques, the performance of the 8051 microcontroller is increased. Through simulation and the use of existing synchronous design tools in the asynchronous design flow, a four-phase handshaking approach with a stoppable clock is simulated and then implemented. The asynchronous architecture added to the existing synchronous architecture includes an ALU wrapper, a controller wrapper, and a clocking element. The asynchronous design flow consists of functional simulation, synthesis of synchronous blocks, timing analysis, asynchronous wrapper design, and timing simulation. After implementation analysis, the asynchronous 8051 is 28.7% faster then the synchronous 8051 while only using 10% more area. 2 Acknowledgements I would like to thank Narender Hanchate for devoting a significant amount of his time in teaching me many of the tools that I used for this project. I would also like to thank Dr. Hao Zheng for providing me with a thesis opportunity that allowed me to exercise my hardware engineering design skills. 3 Table of Contents Chapter 1: Motivation 5 Chapter 2: Overview and Design 7 Chapter 3: Design Flow 12 Chapter 4: Design Implementation and Results 15 Chapter 5: Challenges and Conclusion 25 Appendix A: Divmul Instructions 27 Appendix B: Controller, RAM, and Trace.out Contents 28 Appendix C: Primetime and Buildgates Script Files 42 Appendix D: ALU and Controller VHDL Code 44 4 Chapter 1: Motivation Overview The 8051 microcontroller is a device found in many embedded systems. It can be found as a component in many analog to digital converters, pulse-width modulators, and bus interfaces. The 8051 microcontroller this project uses comes from the University of California’s Dalton Project.1 The microcontroller design found on the Dalton Project website is fully synchronous. The goal of this project is to develop an asynchronous version of the 8051 microcontroller and use synchronous design tools in the process. The divmul program from the Dalton Project website is used as a benchmark comparison between the synchronous and asynchronous controllers, and it is also used in verifying the asynchronous simulation results. The instructions executed by the divmul program are listed in Appendix A. Asynchronous Advantages There are many reasons to implement a design using asynchronous techniques. Asynchronous designs have the advantages over traditional synchronous designs of lower power consumption, no clock skew, better technology migration, and less global timing issues.2 There are numerous ways to implement an asynchronous circuit; these include fundamental mode Huffman circuits, burst-mode circuits, and Muller circuits. Asynchronous Disadvantages While asynchronous circuits may be faster and are not subject to some of the problems that synchronous circuits suffer from, they are much harder to design. While there are many tools for synchronous design out on the market, there are no complete design solution tools for asynchronous circuits. Since there is no global clock in 5 asynchronous designs, communication must be done through handshaking or other means. Since there are no tools that automatically implement handshaking protocol between communicating modules, if an asynchronous circuit makes extensive use of handshaking or other asynchronous logic, great care must be taken to ensure timing and data integrity. Asynchronous Design Flow with Synchronous Tools Even though complete asynchronous design solutions do not exist, one can make use of existing synchronous design tools in the asynchronous design flow. In this flow, each design is partitioned into blocks, and each block is controlled by a local clock or handshaking protocol. Simulation tools like Modelsim can be used to simulate the functionality of the asynchronous design. Once the design has been functionally simulated, it must now be implemented. While an entire asynchronous circuit cannot be designed in synchronous tools, individual logic blocks can. During the design of the asynchronous 8051 chip, VHDL was used to generate certain blocks of the ALU and controller wrappers. While timing analyzers like Primetime are used in the synchronous design flow to look for timing and critical path violations, they can be used in the asynchronous design flow to get delay numbers. With delay numbers from leading commercial timing analyzers, delay elements or a local clock can be implemented. 6 Chapter 2: Overview and Design Synchronous 8051 Architecture Clock rst td Op-code I8051_DEC I8051_ALU addr ip Src-1 Src-2 Src-3 Carry-in 1 & 2 I8051_ROM data I8051_CTR td wr addr Is-bit-addr data Data-bit des Carry-out 1 & 2 Overflow I8051_RAM ALU-Op-code (Rd, wr, addr, data_out, data_in) Ports A block diagram of the synchronous 8051 architecture is shown above. The controller retrieves data from the I8051_ROM module and sends this data to the I8051_DEC module, where the data is decoded into an appropriate op-code for the controller to execute. Depending on the decoded instruction, the I8051_CTR module will assert and deassert the proper control signals to the I8051_ALU and I8051_RAM module. During the execution phase of a particular instruction, the I8051_CTR module will usually read data from the I8051_RAM module, send the fetched data to the I8051_ALU module for an appropriate logical or arithmetic operation, then write the results of the ALU operation back into the I8051_RAM module. If the 8051 needs to be accessed by an external hardware device, such as a memory bus or another 7 microprocessor, the I8051_CTR and I8051_RAM modules feature ports that can interface with the device. 8 Asynchronous 8051 Architecture rst td Op-code I8051_DEC I8051_ALU addr ip Src-1 Src-2 Src-3 Carry-in 1 & 2 I8051_ROM data I8051_CTR td wr addr Is-bit-addr data Data-bit des Carry-out 1 & 2 Overflow I8051_RAM ALU-Op-code ALU Wrapper Clock req ack Clocking Element CTR Wrapper (Rd, wr, addr, data_out, data_in) Ports In the case of designing the asynchronous 8051 microcontroller, handshaking between the ALU and controller is implemented along with a stoppable clock; while the controller waits for the ALU to complete an operation, the clock is stopped. In the synchronous version of the 8051 microcontroller, the ALU is the major source of delay within the chip. RAM reads and writes and controller state transitions occur much faster than an ALU division operation. The clock period in the synchronous 8051 microcontroller must be longer than the worst-case delay, which is the time needed to execute ALU division. While most instructions take less time to execute than a division instruction, the clock period must be long enough in every cycle to ensure that no data or timing errors occur whenever a division instruction is executed. In the asynchronous version of the chip, the clock period is dependent on the executed instruction. For example, the clock period required for an add instruction is much less 9 than the required clock period for a division instruction. This functionality is implemented through handshaking and a stoppable clock. A block diagram of the asynchronous 8051 architecture is shown on the previous page. The operation of the asynchronous microcontroller is similar to that of the synchronous microcontroller, but with a few key differences. Unlike the synchronous version, the clock is stopped while the I8051_CTR module waits for the I8051_ALU module to execute a given operation; this is done through handshaking signals generated by the ALU and Controller wrappers. Second, since the clock is stopped while the controller waits for data from the ALU, excess cycles that were in the synchronous version were eliminated in the asynchronous version; excess cycles are defined as clock cycles where the controller is waiting for the ALU to complete an operation. Third, since the clock is stopped while the controller waits for a given operation from the ALU to complete, the clock must be generated onboard the chip, however, in the synchronous design, the clock is generated off the chip. The controller wrapper generates a request signal when the controller requests an operation from the ALU. Upon receiving an acknowledge signal from the ALU wrapper, the request signal is deasserted by the controller. The ALU wrapper generates an acknowledge signal to signify that the ALU has completed the operation requested by the controller. The delay between the time the ALU wrapper receives the request signal from the controller wrapper and the time the ALU wrapper acknowledges by asserting the acknowledge signal is dependent upon the operation requested. The requested operation is determined by the ALU Op-Code. 10 The clocking unit is used to generate an onboard clock signal for the asynchronous design. The behavior of the onboard clock is the same as an ordinary clock, except that the clock is stopped whenever the request line is asserted and the acknowledge line is deasserted. During this time, the controller is waiting for the ALU to complete an operation. Once the ALU operation is complete and the ALU wrapper asserts the acknowledge signal, the clock will resume normal operation. 11 Chapter 3: Design Flow Asynchronous Design Flow Functional Simulation Synthesis of Synchronous Blocks Timing Analysis Asynchronous Wrapper Design Timing Simulation One of the objectives of this project is to develop an asynchronous design flow that utilizes synchronous tools. An overview of each design flow stage and how synchronous tools are used follows. The first step in the asynchronous design flow is functional simulation. Modelsim is used to simulate the asynchronous design. Standard VHDL compilers, like Xilinx ISE, Altera Quartus II, and NC VHDL, cannot synthesize VHDL code that implements asynchronous logic. The wrappers for the ALU and controller were written in VHDL and interfaced with the reduced controller. These wrappers use wait statements to produce the handshaking logic that is necessary to stop the clock while the ALU completes a given operation. Verification of the asynchronous controller was done against the reduced synchronous controller by performing a memory dump at the end of simulation. The contents of the RAM memory and controller registers were compared and verified to be the same. 12 The second step in the asynchronous design flow is the synthesis of synchronous blocks. In this stage, the synchronous 8051 microcontroller and synchronous elements of the asynchronous 8051 microcontroller are synthesized and turned into Verilog netlists by Ambit Buildgates. These netlists are used in later stages of the design flow. The third step in the asynchronous design flow is timing analysis. In this stage, Cadence Encounter and Synopsys Primetime are used. Cadence Encounter is used for the generation of the Resistance-Capacitance model of a circuit. Synopsys Primetime is used for static timing and critical path analysis. As successive elements of the ALU are removed to determine the critical path delay for division, multiplication, addition, subtraction, and logical operations, the RC network model for each case is generated in Encounter and imported into Primetime to enhance timing accuracy. Critical path numbers are also generated for the RAM, ROM, decoder, synchronous controller, and asynchronous controller modules in this stage. The fourth step in the asynchronous design flow is the design of the asynchronous wrappers. Since the required delay numbers are obtained in stage four of the design flow process, the correct number of delay elements can now be implemented in designing the wrappers. The combinational logic elements of the wrappers are designed in synthesizable VHDL code that is importable into a schematic, while the delay elements in the ALU wrapper are laid out by hand in a schematic editor. The schematic editor used during this project is Cadence Composer. The fifth step in the asynchronous design flow is timing simulation. Unfortunately, the asynchronous implementation of the 8051 microcontroller is unable to 13 be tested since the university does not have a post-synthesis timing simulator installed on its servers. 14 Chapter 4: Design Implementation and Results Handshaking In the asynchronous implementation of the 8051 microcontroller, the synchronization between the ALU and controller is implemented through four-phase handshaking by the use of request and acknowledge signals. When the controller needs an ALU operation to be done, the request signal is asserted. The acknowledge signal from the ALU wrapper is then asserted after a given delay period, depending on the ALU operation requested by the controller. During the period that the request signal is asserted and the acknowledge signal is deasserted, the clock is stopped. The controller’s state should remain the same as the ALU completes the requested operation. After the acknowledge signal from the ALU wrapper is asserted, the request signal is deasserted by the controller wrapper, then a short time later, the acknowledge signal from the ALU wrapper is also deasserted. A figure showing how such handshaking and clocking works is given below. Req+ Stop Clock Ack+ Start Clock Ack- Req- 15 Functional Simulation The first step towards simulating an asynchronous microcontroller is to implement its design in VHDL code. The controller is reduced so that it only implements the instructions executed by the divmul program; the asynchronous controller is further reduced by removing excess cycles. For example, in the reduced synchronous controller the NOP instruction takes 7 clock cycles; in the reduced asynchronous controller, the NOP instruction takes 3 clock cycles. Wrappers for the ALU and controller are implemented in VHDL to be interfaced with the asynchronous reduced controller. The controller wrapper consists of VHDL code that generates a request signal for an appropriate ALU op-code. The ALU wrapper consists of VHDL code that delays the generation of the acknowledge signal according to the ALU operation. Simulation verification in Modelsim consists of two phases. The first phase compares the memory, registers, and trace.out of the fully synchronous 8051 model and the asynchronous 8051 model that does not have any excess cycles removed, but does have ALU and controller wrapper logic inserted. The trace.out file is a recording of each instruction executed by the controller. As the simulation of both progresses, what is written to and read from the 8051 RAM module in each clock cycle is dumped into a text file. After these are verified to be the same, the controller register contents of both designs are also confirmed to be correct. The trace.out file executed during simulation is also the same. The second phase consists of removing excess cycles from the asynchronous 8051 model that had ALU and controller wrapper logic inserted. Instead of comparing what 16 was written to and read from the 8051 RAM module in each cycle during simulation, a dump of the RAM contents at the end of simulation is performed. VHDL code is implemented to perform a dump of controller register contents at the end of simulation. As excess cycles are removed from the asynchronous controller in each instruction case, the contents of the RAM, controller registers, and trace.out are verified to be correct. Appendix B gives the final correct controller register, ALU, and trace.out contents. ALU Wrapper Design Req ALU Opcode Select Logic S2 S1 S0 Logical Operations Add Subtract Multiply Divide 0 0 2to1 1 Mux S0 1 2to1 Mux S1 0 1 2to1 Ack Mux S2 A block diagram showing the ALU wrapper design is given above. The ALU wrapper operates by selecting an appropriate delay element for a given ALU Op-code. Signal req to ack connection is delayed for the amount of time it takes for the ALU to complete the appropriate operation. The select logic takes the ALU Op-code as input and generates appropriate signals for the select lines that control the multiplexers. The delay elements for logical operations, addition and subtraction, multiply, and divide are designed in such a way that the delay of one element builds upon the delay of the previous element. For example, say logical operations have a delay of 9ns and addition and subtraction operations have a delay of 12.8ns. The wrapper is designed so that addition and subtraction delay has to pass through the 9ns delay element of logical operations; thus, only an additional 3.8ns delay needs to be built into the addition and subtraction delay element in order to meet the overall delay requirement for addition and 17 subtraction operations. This same design idea holds for the multiplication and divide elements. The select logic VHDL code for the ALU wrapper is given in Appendix D. Primetime is used to calculate the critical path through the ALU. In the ALU, a virtual clock was created to drive all the inputs; this was necessary since Primetime won’t generate critical path delay numbers without a defined clock. The RC model for each particular ALU was extracted from Encounter and applied to Primetime to enhance timing result accuracy. In calculating ALU delay, successive operations are removed. For example, in order to calculate multiplication delay, the division case from the ALU VHDL code is removed, and the critical path delay is assumed to be that of multiplication. This process continues until delay numbers for logical operations are achieved. Besides the explicit divide, multiply, add, subtract, and logical operations, there are several ALU operations that are placed into add or logical operations category based on the function they performed. For example, the PCSADD ALU op-code is used in calculating address offsets in jump instructions executed by the controller, and the PCUADD ALU op-code is used in incrementing the program counter inside the controller; since both op-codes make use of the addition function, these ALU operations are assumed to be part of the addition delay. Initial delay results from Primetime are given below. Since area is also a factor in delay, the delay must be increased; the ALU decreased in size dramatically from division to logical operations. In the entire ALU, logical operations may take significantly longer then 6046ps; the added area from the division and multiplication functionality results in an increased wire length between the input and output ports. Depending on how the 18 design is physically implemented on chip, the routing of the wires may impart more delay then the rise and fall times of the transistors used in the actual logical operations. ALU Ops Division Multiplication Subtraction Addition Logical Operations Delay(ps) 24675.8 10486.1 8579.21 6310.83 6046.16 In order to provide a safety margin to account for real world operating conditions like voltage and temperature fluctuations, 50% of the initial delay is added to the initial delay numbers. The correct numbers are given below. Since addition and subtraction account for one delay element in the ALU wrapper, they are lumped together in the table below. ALU Ops Division Multiplication Add & Subtract Logical Operations Delay(ps) 37000 15800 12800 9000 Now that the delay for each stage is set, an appropriate number of buffers had to be placed in each stage in order to implement the required delay. The buffer unit is found to have a delay of 110.28ps by itself after Primetime analysis. In order to simplify delay element design, the buffer was rounded down to a delay of 100ps. The number of buffers required for each delay element is given below; note that each successive delay element builds upon the delay of the previous element. ALU Ops Division Multiplication Add & Subtract Buffers 213 30 38 19 Logical Operations 90 After implementing the delay elements for division, multiplication, add & subtract, and logical operations, they are wired together with the select logic block and multiplexers. The wrapper is then exported into a Verilog netlist and imported into Primetime to make sure the critical path delay in the wrapper is equal to the delay required by division. Upon finishing Primetime analysis on the wrapper, it is found the critical path had a delay of 42688.85ps. The difference in the expected and actual critical path delay comes from the fact that the average delay of each buffer changed from 110.28ps to 114.28ps, and the buffer delay is simplified to 100ps during the design of the wrapper. In order to bring the critical path delay closer to required specifications, 50 buffer elements are removed from the division delay element. After removing 50 buffers from the division delay element, the critical path delay in the ALU wrapper is found to be 36975.75ps, which meets the required specifications. The final number of buffers in each delay element is given below. ALU Ops Division Multiplication Add & Subtract Logical Operations Buffers 163 30 38 90 Controller Wrapper Design ALU Op-code Ack CTR Wrapper Req The design of the controller wrapper is much simpler then the ALU wrapper design. All the controller wrapper does is assert a request signal while the controller is 20 waiting for data from an ALU operation; the request signal is deasserted once an acknowledge signal from the ALU wrapper is received. The controller wrapper is designed in VHDL code. The code for the controller wrapper can be seen in Appendix D. Controller Modifications In the case of the synchronous 8051 controller, the controller must wait an appropriate number of clock cycles for a given instruction while a given ALU operation is carried out. However, in the case of the asynchronous controller, these excess cycles can be removed since the clock is stopped while the ALU is carrying out a requested operation. For example, the ADDC_1 instruction executed by the controller takes 8 clock cycles in the synchronous version and 6 clock cycles in the asynchronous version. The VHDL code for the ADDC_1 instruction in the asynchronous 8051 controller is on the left, and the VHDL code for the ADDC_1 instruction in the synchronous 8051 controller is on the right. In the synchronous version, execute states ES_5 and ES_6 are excess cycles since the controller is doing nothing while waiting for the ALU to complete the add operation. In the asynchronous version, the excess cycles ES_5 and ES_6 can be eliminated since the clock is stopped while the ALU completes the add operation. when OPC_ADDC_1 => when OPC_ADDC_1 => case exe_state is case exe_state is when ES_0 => when ES_0 => GET_RAM_ADDR_1(v8); GET_RAM_ADDR_1(v8); START_RD_RAM(v8); START_RD_RAM(v8); exe_state <= ES_1; exe_state <= ES_1; when ES_1 => when ES_1 => exe_state <= ES_2; exe_state <= ES_2; when ES_2 => when ES_2 => alu_op_code <= ALU_OPC_ADD; alu_op_code <= ALU_OPC_ADD; alu_src_1 <= reg_acc; alu_src_1 <= reg_acc; alu_src_2 <= ram_in_data; alu_src_2 <= ram_in_data; alu_src_cy <= reg_cy; alu_src_cy <= reg_cy; exe_state <= ES_3; exe_state <= ES_3; when ES_3 => when ES_3 => 21 ram_out_data <= alu_des_1; ram_out_data <= alu_des_1; START_WR_RAM(R_ACC); START_WR_RAM(R_ACC); reg_cy <= alu_des_cy; reg_cy <= alu_des_cy; reg_ac <= alu_des_ac; reg_ac <= alu_des_ac; reg_ov <= alu_des_ov; reg_ov <= alu_des_ov; exe_state <= ES_4; exe_state <= ES_4; when ES_4 => when ES_4 => GET_PSW(v8); GET_PSW(v8); ram_out_data <= v8; ram_out_data <= v8; START_WR_RAM(R_PSW); START_WR_RAM(R_PSW); exe_state <= ES_7; exe_state <= ES_5; when ES_7 => when ES_5 => SHUT_DOWN_ALU; exe_state <= ES_6; cpu_state <= CS_1; when ES_6 => exe_state <= ES_0; exe_state <= ES_7; when others => when ES_7 => null; SHUT_DOWN_ALU; end case; cpu_state <= CS_1; exe_state <= ES_0; end case; RAM reads take 2 clock cycles and RAM writes take 1 clock cycle in both synchronous and asynchronous controller versions. Clocking Unit req ack ... Inverter Chain Clock A diagram for the clocking unit design is shown above. Unlike the synchronous 8051 chip, the asynchronous 8051 chip must have its clock generated onboard because the clock must be stopped and started by the handshaking signals from the ALU and controller wrappers. The clocking element takes the request and acknowledge lines from the two wrappers as inputs. The clock is stopped when signal req is 1 and signal ack is 0. Signal req is asserted when the controller is waiting for the ALU to complete an operation, and signal ack is asserted when the ALU has completed the requested operation. Whenever the request and acknowledge lines are not one and zero at the same 22 time, the clock operates as any synchronous clock would. The purpose of the inverter chain in the clocking unit is to modulate the clock period of the design. The inverter gate at the lower input node to the AND gate in the clocking unit acts to change the clock from zero to one and vice-versa during normal operation. The delay of this change to the output clock port is dependant on the inverter chain. The delay in the inverter chain must be longer than the longest critical path in the RAM, ROM, asynchronous controller, and decoder modules to avoid timing violations. Since the length of the inverter chain determines how often the clock will change from zero to one and vice-versa during normal operation, it is equal to half the period of the clock. After doing Primetime analysis, the module with the longest critical path in the asynchronous 8051 is the RAM with a critical path of 20.6ns. Applying a 50% safety margin to the critical path, a critical path delay of 30.9ns is obtained. Since an inverter has a delay of 49.96ps, 682 inverters are necessary to create the inverter chain. Targeted Technology The asynchronous microcontroller design is targeted towards the standard cell library developed by the Virginia Tech VLSI for Telecommunications. The VTVT library is based on the TSMC 0.25u CMOS fabrication process. Area and Speed Area numbers were generated by pks_shell and delay numbers are generated in Primetime. The area of the synchronous version is smaller then the asynchronous version since the synchronous version does not have an onboard clock generator, ALU wrapper, or Controller wrapper. The delay is the same in both versions since the RAM module forms the longest critical path. 23 The cell area of the synchronous 8051 chip is 65662; the cell area of the asynchronous chip is approximately 72400. The RAM module dominates the area of both asynchronous and synchronous chips; this can be expected since the 128 bytes of memory the 8051 has requires a large number of flip-flops. The RAM module dominates the critical path delay in the asynchronous and synchronous 8051 versions with a delay of 20.6ns. Applying a 50% safety margin to this number, one obtains a critical path delay of approximately 30.9ns. This translates to a clock period of 30.9ns and an operating frequency of 32.3 MHz for the synchronous and asynchronous chip during normal operation when the clock is not stopped. Even though the university lacks a post-synthesis timing simulator, one can gain a rough estimate of performance by implementing the delay elements in behavioral VHDL code in Modelsim. While the timing characteristics of each gate are not reflected in Modelsim, it can model the behavior of the delay elements and asynchronous handshaking logic through wait for and wait until statements. During simulation in Modelsim, the divmul program took 221,390ns to execute on the synchronous 8051 and 172,030ns to execute on the asynchronous 8051. The asynchronous 8051 is 28.7% faster than the synchronous 8051 in executing the divmul program. 24 Chapter 5: Challenges and Conclusion Challenges There were several difficulties encountered while designing the asynchronous 8051 microcontroller. The most significant difficulty came in learning the different tools that were used in the asynchronous design flow. While learning tools like Cadence Encounter and Ambit Buildgates, technical assistance from people familiar with the tools is readily available, however, while learning tools like Synopsys Primetime, the sole resources available are the Internet and user manuals. By establishing what tools are necessary for each stage in the asynchronous design flow, a great amount of time could be saved. During the timing analysis stage of the design flow, Synopsys Primetime, Design Analyzer, and Timemill are examined, and a large amount of time is spent learning each tool. Ultimately, only Primetime proves useful in the search for accurate delay numbers. The time spent learning Design Analyzer and Timemill is better utilized in the later stages of the asynchronous design flow. Conclusion Even though the idea behind handshaking is very simple, it takes a lot of work to change an existing synchronous design to an asynchronous design. The use of common synchronous design tools in the asynchronous design flow makes the design of the asynchronous 8051 microcontroller much easier. While the asynchronous version is implemented in schematic form, it is impossible to verify its correctness; no suitable post-synthesis timing simulators are installed on university servers. 25 Works Cited 1. Dalton Project. http://www.cs.ucr.edu/~dalton/8051/. University of California, Department of Computer Science, Riverside, CA 92521. 7 April 2005. 2. Scott Hauk, “Asynchronous Design Methodologies: An Overview,” in Proceedings of the IEEE, Vol. 83, No. 1, pp. 69-93, January, 1995. 26 Appendix A – Executed divmul Instructions ADD A, direct ADDC A, Rn CJNE RN, #data, rel CLR A DIV AB DNJZ Rn, rel INC Rn JNZ rel LCALL addr16 LJMP addr16 MOV A, Rn MOV A, direct MOV Rn, A MOV Rn, direct MOV Rn, #data MOV direct, Rn MOV direct, direct MOV direct, #data MOV @Ri, A MUL AB ORL A, Rn RET SJMP rel XCH A, Rn XCH A, direct XRL A, #data 27 Appendix B – Controller, RAM, and Trace.out Contents Characters after “--“ are just comments Controller register contents at end of simulation: 00000 000 01110011 00101110 11111110 11111110 00000000 0 0 0 0 0 0 0 0 -- reg_pc_15_11 -- reg_pc_10_8 -- reg_pc_7_0 -- reg_op1 -- reg_op2 -- reg_op3 -- reg_acc -- reg_cy -- reg_ac -- reg_f0 -- reg_rs1 -- reg_rs0 -- reg_ov -- reg_nu -- reg_p 28 RAM contents at end of simulation: 00000000 00000000 00000000 00000000 00000000 00001111 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00001010 00000000 00000000 00001101 00000000 00001101 00000000 10000010 00000000 10000110 00000000 00001010 00000000 00000100 01001001 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 -- sfr_b -- sfr_acc -- sfr_psw -- sfr_ie -- sfr_ip -- sfr_sp -- sfr_dpl -- sfr_dph -- sfr_scon -- sfr_sbuf -- sfr_tcon -- sfr_tmod -- sfr_tl0 -- sfr_th0 -- sfr_tl1 -- sfr_th1 -- RAM byte 0 -- RAM byte 1 -- RAM byte 2 -- RAM byte 3 -- RAM byte 4 -- RAM byte 5 -- RAM byte 6 -- RAM byte 7 -- RAM byte 8 -- RAM byte 9 -- RAM byte 10 -- RAM byte 11 -- RAM byte 12 -- RAM byte 13 -- RAM byte 14 -- RAM byte 15 -- RAM byte 16 -- RAM byte 17 -- RAM byte 18 -- RAM byte 19 -- RAM byte 20 -- RAM byte 21 -- RAM byte 22 -- RAM byte 23 -- RAM byte 24 -- RAM byte 25 -- RAM byte 26 29 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 -- RAM byte 27 -- RAM byte 28 -- RAM byte 29 -- RAM byte 30 -- RAM byte 31 -- RAM byte 32 -- RAM byte 33 -- RAM byte 34 -- RAM byte 35 -- RAM byte 36 -- RAM byte 37 -- RAM byte 38 -- RAM byte 39 -- RAM byte 40 -- RAM byte 41 -- RAM byte 42 -- RAM byte 43 -- RAM byte 44 -- RAM byte 45 -- RAM byte 46 -- RAM byte 47 -- RAM byte 48 -- RAM byte 49 -- RAM byte 50 -- RAM byte 51 -- RAM byte 52 -- RAM byte 53 -- RAM byte 54 -- RAM byte 55 -- RAM byte 56 -- RAM byte 57 -- RAM byte 58 -- RAM byte 59 -- RAM byte 60 -- RAM byte 61 -- RAM byte 62 -- RAM byte 63 -- RAM byte 64 -- RAM byte 65 -- RAM byte 66 -- RAM byte 67 -- RAM byte 68 -- RAM byte 69 -- RAM byte 70 -- RAM byte 71 -- RAM byte 72 30 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 -- RAM byte 73 -- RAM byte 74 -- RAM byte 75 -- RAM byte 76 -- RAM byte 77 -- RAM byte 78 -- RAM byte 79 -- RAM byte 80 -- RAM byte 81 -- RAM byte 82 -- RAM byte 83 -- RAM byte 84 -- RAM byte 85 -- RAM byte 86 -- RAM byte 87 -- RAM byte 88 -- RAM byte 89 -- RAM byte 90 -- RAM byte 91 -- RAM byte 92 -- RAM byte 93 -- RAM byte 94 -- RAM byte 95 -- RAM byte 96 -- RAM byte 97 -- RAM byte 98 -- RAM byte 99 -- RAM byte 100 -- RAM byte 101 -- RAM byte 102 -- RAM byte 103 -- RAM byte 104 -- RAM byte 105 -- RAM byte 106 -- RAM byte 107 -- RAM byte 108 -- RAM byte 109 -- RAM byte 110 -- RAM byte 111 -- RAM byte 112 -- RAM byte 113 -- RAM byte 114 -- RAM byte 115 -- RAM byte 116 -- RAM byte 117 -- RAM byte 118 31 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 -- RAM byte 119 -- RAM byte 120 -- RAM byte 121 -- RAM byte 122 -- RAM byte 123 -- RAM byte 124 -- RAM byte 125 -- RAM byte 126 -- RAM byte 127 32 Trace.out contents at end of simulation NOP LJMP MOV 7 CLR 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 33 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 34 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 35 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 36 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 37 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 13 DJNZ 1 MOV 12 LJMP MOV 12 MOV 7 CLR 1 MOV 5 INC 2 CJNE 3 INC 2 CJNE 3 MOV 1 XRL 4 ORL 1 JNZ INC 2 38 CJNE 3 INC 2 CJNE 3 MOV 1 XRL 4 ORL 1 JNZ INC 2 CJNE 3 INC 2 CJNE 3 MOV 1 XRL 4 ORL 1 JNZ INC 2 CJNE 3 INC 2 CJNE 3 MOV 1 XRL 4 ORL 1 JNZ INC 2 CJNE 3 INC 2 CJNE 3 MOV 1 XRL 4 ORL 1 JNZ INC 2 CJNE 3 INC 2 CJNE 3 MOV 1 XRL 4 ORL 1 JNZ INC 2 CJNE 3 INC 2 CJNE 3 MOV 1 XRL 4 ORL 1 39 JNZ INC 2 CJNE 3 INC 2 CJNE 3 MOV 1 XRL 4 ORL 1 JNZ INC 2 CJNE 3 INC 2 CJNE 3 MOV 1 XRL 4 ORL 1 JNZ INC 2 CJNE 3 INC 2 CJNE 3 MOV 1 XRL 4 ORL 1 JNZ INC 2 CJNE 3 INC 2 CJNE 3 MOV 1 XRL 4 ORL 1 JNZ INC 2 CJNE 3 INC 2 CJNE 3 MOV 1 XRL 4 ORL 1 JNZ MOV 6 LCALL CJNE 3 MOV 1 MOV 9 40 DIV MOV 5 MOV 6 RET MOV 9 MOV 6 LCALL CJNE 3 MOV 1 MOV 9 DIV MOV 5 MOV 6 RET MOV 9 MOV 6 LCALL MOV 1 MOV 5 MOV 9 MUL MOV 5 MOV 1 XCH 2 XCH 1 MUL ADD 1 41 Appendix C – Sample Primetime and Buildgates script files Synchronous Controller Buildgates tcl script file: read_tlf vtvtlib25.tlf read_lef vtvtlib25.lef read_vhdl i8051_lib.vhd i8051_ctr.vhd do_build_generic set_current_module I8051_CTR do_optimize write_Verilog -hier synch_controller_netlist.v exit Synchronous Controller Primetime tcl script file: read_Verilog synch_controller_netlist.v read_db vtvtlib25.db set link_path "* vtvtlib25.db" link_design create_clock -name clk -period 2 -waveform {0 1} [get_ports clk] set nonclock_inputs [remove_from_collection [all_inputs] [get_ports clk]] set_input_delay 1 -clock clk $nonclock_inputs set_output_delay 1 -clock clk [all_outputs] set_driving_cell -lib_cell inv_1 [all_inputs] set_load -pin_load 1 [all_outputs] read_parasitics -format SPEF \ I8051_CTR.spef report_timing > timing.txt 42 ALU Division Buildgates tcl script file: read_tlf vtvtlib25.tlf read_lef vtvtlib25.lef read_vhdl i8051_lib.vhd i8051_alu.vhd do_build_generic set_current_module I8051_ALU do_optimize write_Verilog -hier i8051_ALU_division.v exit ALU Division Primetime tcl script file: read_Verilog i8051_ALU_division.v read_db vtvtlib25.db set link_path "* vtvtlib25.db" link_design create_clock -name vclk -period 2 -waveform {0 1} set_input_delay 1 -clock vclk [all_inputs] set_output_delay 1 -clock vclk [all_outputs] set_driving_cell -lib_cell inv_1 [all_inputs] set_load -pin_load 1 [all_outputs] read_parasitics -format SPEF \ I8051_ALU.spef report_timing > timing.txt 43 Appendix D – ALU and Controller VHDL Code ALU wrapper select logic VHDL code: library IEEE; use IEEE.STD_LOGIC_1164.all; use IEEE.STD_LOGIC_ARITH.all; use WORK.I8051_LIB.all; entity ALU_wrapper is port( alu_op_code : select0 : out select1 : out select2 : out end ALU_wrapper; in UNSIGNED(3 downto 0) std_logic; std_logic; std_logic; ); architecture BEHAVIORAL of ALU_wrapper is begin process(alu_op_code) begin CASE alu_op_code IS WHEN ALU_OPC_NONE => select0 <= '1'; select1 <= '1'; select2 <= '1'; WHEN ALU_OPC_ADD => select0 <= '0'; select1 <= '1'; select2 <= '1'; WHEN ALU_OPC_SUB => select0 <= '0'; select1 <= '1'; select2 <= '1'; WHEN ALU_OPC_MUL => select0 <= '0'; select1 <= '0'; select2 <= '1'; WHEN ALU_OPC_DIV => select0 <= '0'; select1 <= '0'; select2 <= '0'; WHEN ALU_OPC_DA => select0 <= '0'; select1 <= '1'; select2 <= '1'; WHEN ALU_OPC_NOT => select0 <= '1'; select1 <= '1'; select2 <= '1'; WHEN ALU_OPC_AND => select0 <= '1'; select1 <= '1'; select2 <= '1'; 44 WHEN ALU_OPC_XOR => select0 <= '1'; select1 <= '1'; select2 <= '1'; WHEN ALU_OPC_OR => select0 <= '1'; select1 <= '1'; select2 <= '1'; WHEN ALU_OPC_RL => select0 <= '1'; select1 <= '1'; select2 <= '1'; WHEN ALU_OPC_RLC => select0 <= '1'; select1 <= '1'; select2 <= '1'; WHEN ALU_OPC_RR => select0 <= '1'; select1 <= '1'; select2 <= '1'; WHEN ALU_OPC_RRC => select0 <= '1'; select1 <= '1'; select2 <= '1'; WHEN ALU_OPC_PCSADD => select0 <= '0'; select1 <= '1'; select2 <= '1'; WHEN ALU_OPC_PCUADD => select0 <= '0'; select1 <= '1'; select2 <= '1'; WHEN others => select0 <= '1'; select1 <= '1'; select2 <= '1'; END CASE; END PROCESS; END BEHAVIORAL; 45 Controller Wrapper VHDL Code: library IEEE; use IEEE.STD_LOGIC_1164.all; use IEEE.STD_LOGIC_ARITH.all; use WORK.I8051_LIB.all; entity CTR_wrapper is port( alu_op_code : in UNSIGNED(3 downto 0); ack : in std_logic; req : out std_logic); end CTR_wrapper; architecture BEHAVIORAL of CTR_wrapper is begin process(alu_op_code, ack) begin if ack = '1' then req <= '0'; else CASE alu_op_code IS WHEN ALU_OPC_NONE => req <= '0'; WHEN ALU_OPC_ADD => req <= '1'; WHEN ALU_OPC_SUB => req <= '1'; WHEN ALU_OPC_MUL => req <= '1'; WHEN ALU_OPC_DIV => req <= '1'; WHEN ALU_OPC_DA => req <= '1'; WHEN ALU_OPC_NOT => req <= '1'; WHEN ALU_OPC_AND => req <= '1'; WHEN ALU_OPC_XOR => req <= '1'; WHEN ALU_OPC_OR => req <= '1'; WHEN ALU_OPC_RL => req <= '1'; WHEN ALU_OPC_RLC => req <= '1'; WHEN ALU_OPC_RR => req <= '1'; WHEN ALU_OPC_RRC => req <= '1'; WHEN ALU_OPC_PCSADD => req <= '1'; WHEN ALU_OPC_PCUADD => req <= '1'; WHEN others => req <= '0'; 46 END CASE; END IF; END PROCESS; END BEHAVIORAL; 47