Quiz2 for Section B,C and RGIT Subject – Computer Organization and Architecture (COA) Answer all questions Full Marks : 30 [6+8+8+8] Time : 1hr Q1. Consider a direct mapped cache with 16 cache lines, indexed 0 to 15, where each cache line can contain 32 integers (block size : 128 bytes). Consider a two-dimensional, 32* 32 array of integers a . This array is laid out in memory so that a [0;0] is next to a [0;1], and so on. Assume the cache is initially empty, but that a [0;0] maps to the first word of cache line 0. Consider the following column-first traversal: int sum = 0; for (int i = 0; i < 32; i++) { for( int j=0; j < 32; j++) { sum += a[i,j]; }} and the following row-first traversal: int sum = 0; for (int i = 0; i < 32; i++) { for( int j=0; j < 32; j++) { sum += a[j,i]; }} Compare the number of cache misses produced by the two traversals, assuming the oldest cache line is evicted first. Assume that i,j, and sum are stored in registers. Assume that no part of array , a, is saved in registers. It is always stored in the cache. Ans: Number of cache misses in column first traversal = 32. Miss rate = 3.1%. Number of cache misses in row rst traversal = 32* 32 = 1024. Miss rate = 100%. Q2. Pipelined arithmetic units are found in high speed computers. They are used to implement floating point operations. Floating point operations can be divided into sub-operations which can be performed by different segments of the pipeline. For example, the floating point adder can be divided into the following four sub-operations that are performed by four segments: i. Compare the exponents; ii. Align the mantissas; iii. Add or Subtract the mantissas; iv. Normalize the result a. Construct the pipeline for computing the floating point addition of 100 pair of numbers. b. The time delay for the four segments are : t1= 50ns , t2 = 30ns , t3 = 95ns , t4 = 45ns. The interface register delay tr = 5ns. How long would it take to add the 100 pair of numbers using the pipelined floating point adder? Ans: (a) (b) The clock cycle time for the pipeline is the cycle time of the segment taking the longest time i.e. Segment 3 Therefore, Clock cycle = 95 + 5 = 100 ns (time for segment 3) For n = 100, k = 4, tp = 100 ns. Time to add 100 numbers = (k + n – 1) tp =(4 + 99) 100 = 10,300 ns = 10.3 μs Q3. (a) Draw the circuit diagram for the instruction fetch unit in the Simple RISC processor Ans : Fetch unit 32 32 1 isBranchTaken branchPC 0 4 pc 32 32 Instruction memory 1 0 inst triggered by a negative clock edge 1 - input 1 0 - input 0 Multiplexer control signal (b) Show the microcode implementation of the load or store instructions. Ans: .mbegin mloadIR mdecode madd pc, 4 mswitch /* transfer rs1 to register A*/ mmov regSrc, rs1, <read> mmov A, regval /* calculate the effective address*/ mmov B,immx, <add> /*ALU operation*/ /* perform the load */ mmov mar, alurResult, <load> /* write the loaded value to the register file */ mmov regData, ldResult mmov regSrc, rd, <write> /* jump to the beginning */ .begin (c) Write an ARM assembly program to find out if a number is prime using a recursive algorithm Ans : l1: mod r5, r3, r4 cmp r5, #0 addeq r2, r2, #1 add r4, r4, #1 cmp r4, r6 ble l1 mov pc, lr prime: mov r7, #0 mov r3, r0 mov r6, r0, lsr #1 mov r4, #1 mov r2, #0 bl l1 cmp r2, #1 moveq r7, #1 Q4. (a) Which addressing modes are preferable in a machine with very few registers? Ans : Immediate addressing , Register base addressing and PC relative addressing. (b) Write a program in Simple RISC assembly to convert an integer stored in memory from the little end-ian to the big end-ian format. Ans: Assume the integer (0X12345678) stored in little endian format is stored at location 0X1234 The register r1 contains 0X1234 which points to the location storing the integer ld r0,[r1] /* load the contents of location pointed by r1 into r0. Contents of r0 will be 0X12345678 */ lsl r2,r0,24 /* left shift r0 by 24 positions to obtain the 78 in most significant 8 bits; r2= 0X78000000 */ lsr r3,r0,24 /* right shift r0 by 24 positions to obtain 12 in the least significant position; r3=0X00000012 */ mov r6,0Xff00 mov r7,0Xff0000 and r4,r0,r6 lsl r4,r4,8 /* Extract 56 from r0 by anding the number with 0X0000ff00 /* Move the extracted 56 to left by 8 bits so that r4 = 0X00560000 and r5,r0,r7 /* Extract 34 from r0 by anding the number with 0X00ff0000 */ lsr r5,r5,8 /* Move the extracted 56 to left by 8 bits so that r4 = 0X00003400 add r3,r3,r5 /* Add the contents of r3 and r5 */ add r2,r2,r4 /* Add the contents of r2 and r4 */ add r2,r2,r3 /* Add the contents of r2 and r3 ; r2 contains 0X78563412 */ st r2,[r1] /* Store the contents of r2 in location r1 which is in big endian format */ */ */ */ (c) Differentiate between polling and interrupt driven I/O scheme of data transfer. Assume that for a single polling operation, a processor running at 1 MHz takes 200 cycles. A processor polls a printer 1000 times per minute. What percentage of time does the processor spend in polling? Ans: In polling mechanism, CPU periodically checks to see if it needs service from any I/O device. Polling takes CPU time even when no requests are pending. This overhead may be reduced at expense of response time. While in interrupt based scheme, instead of the CPU checking I/O requests periodically, the device signals the processor when it needs to send a request. Each device uses a wire (interrupt line) to signal the processor. When interrupt is signaled, the processor stops the normal flow of the instruction execution and executes a routine called an interrupt handler to deal with the interrupt. Interrupts are asynchronous with respect to the current program being executed. The “request” for the CPU to execute the interrupt handler could come from several sources: • External hardware devices. Common example is pressing on the key on the keyboard, which causes to the keyboard to send interrupt to the microcontroller to read the information of the pressed key. • The processor can send interrupts to itself as a result of executing the program, to report an error in the code. For example, division by 0 will causes an interrupt. • In the multi-processor system, the processors can send to each other interrupts as a way to communicate. Interrupt Handler / Interrupt Service Routine (ISR) For every interrupt, there is a fixed location in memory that holds the address of its ISR. The group of memory locations set aside to hold the addresses of ISRs is called the interrupt vector table. You don’t have to know exact locations of these vectors. Compiler does this for you. When timing is important for CPU to react or when it should detect signal from outside world that occurs relatively rear but lasts for very short interval than interrupt is better solution. Let’s take example if it should detect pulse lasting for 1ms and it appears once in 10s at random timing. If we use polling method we would have to check every 500us for example for this pulse so we don’t miss it. But if we use interrupt detection ISR would trigger itself and execute this only once at the moment when pulse occurs. In this case interrupt method is much more efficient. (c) Printer Polling Clocks/sec = 200 * 1000/60 = 3333.33 clocks/sec % Processor time spent for polling = (3333.33/1*106 ) *100 = 0.33% ------------------------------------------------------------------------------------------------------------------------