CSE240 Homework 2 Huaxia Xia, 10/11/2000 H&P2.1 Answer: From Figure 2.26, we get: Number of instructions of data reference(i.e., "load", "store", and "load imm") is: 26% + 9% + 4% = 39% And the total number of branch instructions (cond brand, jump, call, return) is: 17% + 1% + 1% + 1% = 20% a. For data references, 17% of them need 0 bit, so that the instruction length should be 16 bits. Since a bit for sign is needed, only data references from 1 to 7 bits can be encoded in 8-bit mode, the rate is 57% - 17% = 40%, with instruction length of 24 bits. The rate of 16-bit data is 100%-57%=43%, with instruction length of 32bits. As for branches, 0% of them need 0 bits; 93% of them need 8 bits with instruction length of 24bits, 7% of them need 16 bits with instruction length of 32 bits. Thus the average instruction length is: 16*41% + (16*17% + 24*40% + 32*43%)*39% + (24*93% + 32*7%)*20% =21.64 bits b. For data references, 43% of them need more than 8 bits; for branches, 7% of them need more than 8 bits. The average instruction length is: 24*(1-43%*39% - 7%*20%) + 48*(43%*39% + 7%*20%) = 28.36 bits c. For instruction set with fixed offset length of 16 bits, all the branch and the data reference instructions need 32 bits. The average instruction length is: 16*(1-39% - 20%) + 32 *(39% + 20%) = 25.44 Summary: If we only consider code length, then (a) is better than (c) and better than (b). H&P2.2 Answer: a. Assume we need to elimate x% instructions so as to get same performance. (1-x%) * IC * CPI * (1+10%)*ClockCycle = IC * CPI * ClockCycle x% = 9.09% From Figure 2.26, we know that the rate of load is 22.8%, so the percentage of load being elimated is: 9.09% / 22.9% = 39.5% b. For example: LOAD R1, 0(Rb) ADD R2, R1, R1 Since we can use only one data with register-memory, we cannot substitude the two "R1" at the same time. H&P2.6 Answer: The program code: SW R0, 2000(R0) ; store 0 to I loop: LW R1, 2000(R0) ; load I to R1 SLEI R10, R1, #100 ; if i<=100, R101; else R100 BEQZ R10, endloop ; if i>100, jump to endloop SLLI R2, R1, #2 ; R2 R1*4, the offset of array LW R3, 5000(R2) ; load B[i] LW R4, 2000(R0) ; load C ADD R5, R3, R4 ; R5 B[i] + C SW R5, 0(R2) ; store R5 to A[i] ADDI R1, R1, #1 ;ii+1 SW R1, 2000(R0) J loop ; jump to next iteration endloop: … Totally, the instructions excuted is 1 + 11 * 101 = 1112; number of memory-data access is 1+5*101 = 506; code size is 12*4=48 bytes. H&P2.8 Answer: The program code: Assume i, C, A, B is stored in R1, R2, R3, R4, respectively ADD R1, R0, R0 ; store 0 to I loop: SLEI R10, R1, #100 ; if i<=100, R101; else R100 BEQZ R10, endloop ; if i>100, jump to endloop SLLI R5, R1, #2 ; R5 R1*4, the offset of array ADD R6, R4, R5 ; R6 address of B[i] LW R7, 0(R6) ; R7 B[i] ADD R7, R7, R2 ; R7 B[i] + C ADD R6, R3, R5 ; R6 address of A[i] SW R7, 0(R6) ; store R7 to A[i] ADDI R1, R1, #1 J LOOP endloop: … Totally, the instructions excuted is 1 + 10 * 101 = 1011; the number of memorydata access is 2*101 = 202; code size is 11*4=44 bytes. H&P2.12 Answer: a. From Figure 2.26 we know that the percentage of load & store is 26%+9% = 35%, so the instruction count will decrease in 3.5%. The rate of IC onn the enhanced DLX compared to the original DLX is: 1-3.5% = 96.5% b. CPUTimenew = ICnew * CPI * ClockCyclenew = 96.5% * ICold * CPI * 105%*ClockCycleold = 1.013 CPUTimeold So, the original machine will be faster by about 1.3%.