CE110 Homework 1 1.1 a) P1 = 2.5GHz/1.0CPI = 2.5 * 10^9 instructions per second P2 = 3GHz/1.5CPI = 2 * 10^9 instructions per second P3 = 4GHz/2.4CPI = 1.67 * 10^9 instructions per second So processor 1 has the best performance among the three b) Cycles: P1 = 2.5GHz * 10 = 2.5 * 10^10 cycles P2 = 3GHz * 10 = 3 * 10^10 cycles P3 = 4GHz * 10 = 4 * 10^10 cycles Number of Instructions: P1 = 2.5GHz * 10/1CPI = 2.5 * 10^10 instructions P2 = 3GHz * 10/1.5CPI = 2 * 10^10 instructions P3 = 4GHz * 10/2.4CPI = 1.67 * 10^10 instructions c) Execution Time = (Num of Instructions * CPI)/(Clock Rate) Execution Time * .75 = (Num of Instructions * CPI * 1.2)/(New Clock Rate) New Clock Rate = Clock Rate * 1.2/.75 = 1.6 * Clock Rate New Clock Rate for each processor is 1.6 P1 = 2.5GHz * 1.6 = 4GHz P2 = 3GHz * 1.6 = 4.8GHz P3 = 4GHz * 1.6 = 6.4GHz 1.2 a) The number of Data Memory References is the number of load and store instructions = 3 x 65 = 195. b) 1 + (10 x 65) = 651 is the total number of instructions executed. addi instruction before loop is 1 x 1 cycle = 1 cycle Loop function starts : 10 instructions (instruction 2-11) are executed 65 times 2 lw instructions = 2 x 65 iterations x 3 cycles = 390 cycles 1 sw instruction = 1 x 65 iterations x 2 cycles = 130 cycles 4 addi instructions = 4 x 65 iterations x 1 cycle = 260 cycles 1 add instruction = 1 x 65 iterations x 1 cycle = 65 cycles 1 slti instruction = 1 x 65 x 1 cycle = 65 cycles 1 bne instruction is executed 65 times = 1 x 65 = 65 cycles Latency of the program = 1+ 390 + 130 + 260 + 65 + 65 + 65 = 976 cycles CPI = (976cycles) / (651instructions) = 1.5 cycles/instructions 2.1 a) Instruction Set Architecture is used to describe the syntax and semantics of the interface of the computer, including the type and size of the operands, the memory model, how interrupts and exceptions are handled, the available instructions and the meaning of each instruction. Microarchitecture is used to refer to the organization, or highest level of implementation, of a particular processor. b) A compiler needs both ISA and Microarchitecture to compile a program correctly. c) CISC emphasize more on hardware, has multiclock instructions, memory to memory “load and store” incorporated in instructions, smaller code size, and higher cycles per second. An equivalent program implemented with CISC will be a lot shorter than the program being implemented in RISC. RISC emphasize more on software, has single clock instructions, register to register “load and store” are separate or independent instructions, larger code size, lower cycles per second. RISC CPUs generally runs faster than CISC because of the max clock period is dictated by the slowest step of the pipeline. 2.2 a) Determining code size For Fixed Length ISA: 4 x 4 = 16 bytes For Variable Length ISA: There are 4 add instructions and 3 other instructions 1 x 4(ADD) + 3 x 3(OTHER) = 13 bytes Variable length has a smaller code size by 3 bytes compared to Fixed Length ISA which has 16 bytes. b) Determining number of cycles Fixed Length ISA: 3 x 1(OTHER instructions) + 1 x 4(STW instructions) = 7 cycles Variable Length ISA: 3 x 2(OTHER) + 1 x 6(STW) = 12 cycles In this case Fixed Length ISA takes less cycles to complete than Variable Length by 5 cycles. 3. Architecture Byte in Program Bytes Fetched Instruction Count Program Latency x86 17 50 28 55 MIPS 28 76 19 37 Stack ISA 12 48 28 82 a) x86 ISA Assume a = 24 and b = 5 1 xor ecx, ecx; // 3 byte 1 cycle 2 Loop: add ecx, esi; // 1 byte 1 cycle 3 mov eax, ecx; // 2 byte 1 cycle 4 xor edx, edx; // 3 byte 1 cycle 5 idiv edi; // 1 byte 7 cycles 6 test edx, edx; // 2 bytes 1 cycle 7 jne Loop; // 2 bytes 1 cycle 8 mov eax, ecx; // 2 byte 1 cycle 9 ret; // 1 byte 1 cycle Instructions 2-7 are part of the loop and there are 4 iterations. Instructions 1,8,9 run before ending. Bytes in program: 3+1+2+3+1+2+2+2+1 = 17 bytes Bytes fetched: 4 x (1 + 2 + 3 + 1 + 2 + 2) + 3 + 2 + 1 = 50 bytes fetched Instruction count: 4 x 6 + 4 = 28 instruction count Program latency: 4 x (1 + 1 + 1 + 7 + 2 + 1) + 1 + 1 + 1 = 55 latency b) MIPS ISA Assume a is in register a1, b is in b1, n is in s1, v0 to store result, return address is in register ra, t0 is a temporary register. 1. xor s1, s1, s1 // zero out s1 1 cycle 2. xor t1,t1,t1 // zero out t1 1 cycle 3. Loop: add s1, a1, s1 (n = n + a) 1 cycle 4. remu s1, s1, b1 (n%b) 5 cycles 5. add t0, t0, s1 (temp + n) 1 cycle 6. bne t0, t1, -1 // (temp != 0 loop again) 2 cycles 7. add v0, v0, s1// 1 cycle There are 7 total instructions and MIPS is 4 bytes per instruction so 4 x 7 = 28 bytes. Bytes Fetched: 4 + 4 + 4 x 4 x 4 + 4 = 76 bytes fetched Instruction Count: 2 + 4 x 4 + 1 = 19 instructions count Program Latency: 1 + 1 + 4(1 + 5 + 1 + 2) + 1 = 37 latency c) C Stack ISA 1. push 0 // 1 byte 3 cycle 2.loop push a // 1 byte 3 cycle 3. add // 1 byte 2 cycle 4. dup // 1 byte 2 cycle 5. push b // 1 byte 3 cycle 6. rem // 1 byte 7 cycle 7. bnez loop // 5 byte 1 cycle 8. popm // 1 byte 3 cycle Bytes: 1 + 1 + 1 + 1 + 1 + 1 + 5 + 1 = 12 bytes Bytes Fetched: 1 + 5 + 1 + 4(1 + 1 + 1 + 1 + 1 + 5) + 1 = 48 bytes fetched Instruction Count: 3 + 4 x 6 + 1 = 28 instruction count Program Latency: 3 + 1 + 3 + 4(3 + 2 + 2 + 3 + 7 + 1) + 3 = 82 latency d) The first ISA x86 is the best for handling large workloads compared to the other two ISA’s. x86 works better in high-computing servers. Also note that x86 can sustain a relatively high clock period while also maintaining comparable performance. Compared to MIPS or Stack which has a longer clock period. The second ISA MIPS has potentially the highest performance, however requires a a larger amount of memory than the other two ISA’s. MIPS is a RISC architecture and can handle lower cycle times. A good application for MIPS would be academic or research. Also since MIPS is straightforward to learn it is a good starter ISA to learn. The third ISA Stacks is the worst in terms of performance. However, it does have a very low memory footprint and demand. Stacks would be best for smaller projects. Stacks is a relatively straightforward micro-architecture that requires a low cycle time to maintain performance.