Uploaded by kevinsun0115

CE110 HW 1

advertisement
CE110 Homework 1
1.1
a) P1 = 2.5GHz/1.0CPI = 2.5 * 10^9 instructions per second
P2 = 3GHz/1.5CPI = 2 * 10^9 instructions per second
P3 = 4GHz/2.4CPI = 1.67 * 10^9 instructions per second
So processor 1 has the best performance among the three
b) Cycles:
P1 = 2.5GHz * 10 = 2.5 * 10^10 cycles
P2 = 3GHz * 10 = 3 * 10^10 cycles
P3 = 4GHz * 10 = 4 * 10^10 cycles
Number of Instructions:
P1 = 2.5GHz * 10/1CPI = 2.5 * 10^10 instructions
P2 = 3GHz * 10/1.5CPI = 2 * 10^10 instructions
P3 = 4GHz * 10/2.4CPI = 1.67 * 10^10 instructions
c) Execution Time = (Num of Instructions * CPI)/(Clock Rate)
Execution Time * .75 = (Num of Instructions * CPI * 1.2)/(New Clock Rate)
New Clock Rate = Clock Rate * 1.2/.75 = 1.6 * Clock Rate
New Clock Rate for each processor is 1.6
P1 = 2.5GHz * 1.6 = 4GHz
P2 = 3GHz * 1.6 = 4.8GHz
P3 = 4GHz * 1.6 = 6.4GHz
1.2
a) The number of Data Memory References is the number of load and store instructions =
3 x 65 = 195.
b) 1 + (10 x 65) = 651 is the total number of instructions executed.
addi instruction before loop is 1 x 1 cycle = 1 cycle
Loop function starts :
10 instructions (instruction 2-11) are executed 65 times
2 lw instructions = 2 x 65 iterations x 3 cycles = 390 cycles
1 sw instruction = 1 x 65 iterations x 2 cycles = 130 cycles
4 addi instructions = 4 x 65 iterations x 1 cycle = 260 cycles
1 add instruction = 1 x 65 iterations x 1 cycle = 65 cycles
1 slti instruction = 1 x 65 x 1 cycle = 65 cycles
1 bne instruction is executed 65 times = 1 x 65 = 65 cycles
Latency of the program = 1+ 390 + 130 + 260 + 65 + 65 + 65 = 976 cycles
CPI = (976cycles) / (651instructions) = 1.5 cycles/instructions
2.1
a) Instruction Set Architecture is used to describe the syntax and semantics of the
interface of the computer, including the type and size of the operands, the memory model, how
interrupts and exceptions are handled, the available instructions and the meaning of each
instruction.
Microarchitecture is used to refer to the organization, or highest level of
implementation, of a particular processor.
b) A compiler needs both ISA and Microarchitecture to compile a program correctly.
c) CISC emphasize more on hardware, has multiclock instructions, memory to memory
“load and store” incorporated in instructions, smaller code size, and higher cycles per second.
An equivalent program implemented with CISC will be a lot shorter than the program being
implemented in RISC.
RISC emphasize more on software, has single clock instructions, register to register
“load and store” are separate or independent instructions, larger code size, lower cycles per
second. RISC CPUs generally runs faster than CISC because of the max clock period is dictated
by the slowest step of the pipeline.
2.2
a) Determining code size
For Fixed Length ISA: 4 x 4 = 16 bytes
For Variable Length ISA: There are 4 add instructions and 3 other instructions
1 x 4(ADD) + 3 x 3(OTHER) = 13 bytes
Variable length has a smaller code size by 3 bytes compared to Fixed Length ISA
which has 16 bytes.
b) Determining number of cycles
Fixed Length ISA: 3 x 1(OTHER instructions) + 1 x 4(STW instructions) = 7 cycles
Variable Length ISA: 3 x 2(OTHER) + 1 x 6(STW) = 12 cycles
In this case Fixed Length ISA takes less cycles to complete than Variable Length by 5
cycles.
3.
Architecture
Byte in
Program
Bytes
Fetched
Instruction
Count
Program
Latency
x86
17
50
28
55
MIPS
28
76
19
37
Stack ISA
12
48
28
82
a) x86 ISA
Assume a = 24 and b = 5
1 xor ecx, ecx; // 3 byte 1 cycle
2 Loop: add ecx, esi; // 1 byte 1 cycle
3 mov eax, ecx; // 2 byte 1 cycle
4 xor edx, edx; // 3 byte 1 cycle
5 idiv edi; // 1 byte 7 cycles
6 test edx, edx; // 2 bytes 1 cycle
7 jne Loop; // 2 bytes 1 cycle
8 mov eax, ecx; // 2 byte 1 cycle
9 ret; // 1 byte 1 cycle
Instructions 2-7 are part of the loop and there are 4 iterations. Instructions 1,8,9
run before ending.
Bytes in program: 3+1+2+3+1+2+2+2+1 = 17 bytes
Bytes fetched: 4 x (1 + 2 + 3 + 1 + 2 + 2) + 3 + 2 + 1 = 50 bytes fetched
Instruction count: 4 x 6 + 4 = 28 instruction count
Program latency: 4 x (1 + 1 + 1 + 7 + 2 + 1) + 1 + 1 + 1 = 55 latency
b) MIPS ISA
Assume a is in register a1, b is in b1, n is in s1, v0 to store result, return address is
in register ra, t0 is a temporary register.
1.
xor s1, s1, s1 // zero out s1 1 cycle
2.
xor t1,t1,t1 // zero out t1 1 cycle
3. Loop: add s1, a1, s1 (n = n + a) 1 cycle
4.
remu s1, s1, b1 (n%b) 5 cycles
5.
add t0, t0, s1 (temp + n) 1 cycle
6.
bne t0, t1, -1 // (temp != 0 loop again) 2 cycles
7.
add v0, v0, s1// 1 cycle
There are 7 total instructions and MIPS is 4 bytes per instruction so 4 x 7 =
28 bytes.
Bytes Fetched: 4 + 4 + 4 x 4 x 4 + 4 = 76 bytes fetched
Instruction Count: 2 + 4 x 4 + 1 = 19 instructions count
Program Latency: 1 + 1 + 4(1 + 5 + 1 + 2) + 1 = 37 latency
c) C Stack ISA
1.
push 0 // 1 byte 3 cycle
2.loop push a // 1 byte 3 cycle
3.
add // 1 byte 2 cycle
4.
dup // 1 byte 2 cycle
5.
push b // 1 byte 3 cycle
6.
rem // 1 byte 7 cycle
7.
bnez loop // 5 byte 1 cycle
8.
popm // 1 byte 3 cycle
Bytes: 1 + 1 + 1 + 1 + 1 + 1 + 5 + 1 = 12 bytes
Bytes Fetched: 1 + 5 + 1 + 4(1 + 1 + 1 + 1 + 1 + 5) + 1 = 48 bytes fetched
Instruction Count: 3 + 4 x 6 + 1 = 28 instruction count
Program Latency: 3 + 1 + 3 + 4(3 + 2 + 2 + 3 + 7 + 1) + 3 = 82 latency
d) The first ISA x86 is the best for handling large workloads compared to the other two
ISA’s. x86 works better in high-computing servers. Also note that x86 can sustain a
relatively high clock period while also maintaining comparable performance.
Compared to MIPS or Stack which has a longer clock period.
The second ISA MIPS has potentially the highest performance, however requires a
a larger amount of memory than the other two ISA’s. MIPS is a RISC architecture and
can handle lower cycle times. A good application for MIPS would be academic or
research. Also since MIPS is straightforward to learn it is a good starter ISA to learn.
The third ISA Stacks is the worst in terms of performance. However, it does have a
very low memory footprint and demand. Stacks would be best for smaller projects.
Stacks is a relatively straightforward micro-architecture that requires a low cycle time
to maintain performance.
Download