2. OpenMP can set you free…. Consider the following OpenMP

advertisement
Spring 2011 Final - Alvin
2. OpenMP can set you free….
Consider the following OpenMP snippet:
int values[100000];
#pragma omp parallel
{
int i = omp_get_thread_num();
int n = omp_get_num_threads();
for(; i< 100000 ; i += n) {
values[i] = i;
#pragma omp barrier
}
}
#pragma omp barrier is a synchronization construct that causes a thread reaching it to continue
execution only after all other threads have reached the barrier.
Suppose each core has a single level of non-shared, write-back, write-allocate, direct-mapped, 32
KB data cache with 64-byte cache blocks. Assume data cache accesses other than those to the
‘values’ array are negligible and that all data caches are initially empty. Assume ints are 32 bits.
Assume there are as many cores as threads.
a. If the snippet is run with one thread, how many data cache misses for the values array will
there be?
b. If the snippet is run with two threads (each allocated to a separate core), what is the
maximum number of data cache misses for the values array?
c. In five words or less, name the phenomenon that the difference or lack of difference
between your answer to (a) and (b) illustrate.
d. Using just two threads, if we remove the barrier, could the number of data cache misses
for accesses to the values array decrease by more than a factor of 2 from your answer if
(b)? Explain briefly.
Spring 2011 Final - Alvin
6. Revenge of the AMAT
• Suppose that for 1000 memory references, we have
o 40 misses in direct-mapped L1$ (i.e. the miss rate is 4%)
o 20 misses in 2-way set associative L1$ (i.e. the miss rate is 2%)
o 10 misses in L2$ (i.e., the global miss rate is 1%)
• Further,
o L1$ hits in 1 cycle
o L2$ hits in 10 cycles
o Miss to main memory costs 100 cycles
• Assume that we have 1.5 memory references per instruction (i.e. 50% loads and stores). In
other words, for 1000 instructions we have 1500 memory references.
• Ideal CPI is 1.0 (if we had 100% hit rate in L1$)
a. What is the local miss rate for L2$...
i. assuming a direct-mapped L1$?
ii.
assuming 2-way set associative L1$?
b. What is the AMAT (Average Memory Access Time)
i. assuming a direct-mapped L1$?
ii.
assuming 2-way set associative L1$?
c. How much faster is the AMAT for a 2-way set associative cahce? Give your answer as a
ratio.
d. What is the average number of memory stall clock cycles per reference
i. assuming a direct-mapped L1$?
ii.
assuming 2-way set associative L1$?
e. What is the average number of memory stall clock cycles per instruction
i. assuming a direct-mapped L1$?
ii.
assuming 2-way set associative L1$?
f. How much faster would a program run using a 2-way set associative cache?
g. Are the answers for AMAT (6c) and program execution time (6f) above the same? Explain
why or why not.
Spring 2011 Final - Alvin
8. One, two, three….SIMD!
a. SIMDize the following code:
void count( int n, float *c ) {
for( int i = 0; i < n; i++ )
c[i] = i;
}
Enter your solution by filling in the spaces provided. Assume n is a multiple of 4.
(_mm_set1_ps(x) returns a __m128 with all four elements set to x.)
void countfast( int n, float *c ) {
float m[4] = { ____, ____, ____, ____ };
__m128 iterate = _mm_loadu_ps( m );
for( int i = 0; i < __________; i++ ) {
_mm_storeu_ps( ___________, iterate );
iterate = _mm_add_ps( iterate, _mm_set1_ps( ___ ));
}
}
b. Horner’s rule is an efficient way to find the value of polynomial p(x)=c0xn-1+c1xn-2+…+cn-2x+cn-1:
float poly( int n, float *c, float x) {
float p = 0;
for( int i = 0; i < n; i++ )
p = p*x + c[i];
return p;
}
Complete the following SIMD solution by filling in the blanks. Assume n is a multiple of 4.
float fastpoly( int n, float *c, float x ) {
__m128 p = _mm_setzero_ps( );
for ( int i = 0; i < n; i += 4 ) {
p = _mm_mul_ps( p, _mm_set1_ps( __________ ) );
p = _mm_add_ps( p, _mm_loadu_ps( _________ ) );
}
float m[4] = { _____, _____, _____, _____ };
p = _mm_mul_ps( p, _mm_loadu_ps( m ) );
_mm_storeu_ps( m, p );
return _____________________________________;
}
Fall 2010 Final - Alvin
10. Three’s Company
Consider the following datapath with an Arithmetic Logic Unit (ALU) and an eight-register
register file organized around a single bus. The ALU is to apply add, subtract, and so on
operations to its two input operands to generate an output result. The register file has an
asynchronous read and a synchronous write. That is, as soon as the Read Enable (RE) is asserted,
the register file selects the indicated 32-bit register and presents its value on the Data Out (DO).
On the other hand, the Write Enable (WE) is sampled only on the rising edge of the clock, and
only writes the indicated register from the Data In (DI) lines on the same edge that WE is
asserted. The ALU and Register File share the Bus via a 32-bit wide 2:1 multiplexer. When
SelALU is set to 1, the ALU path is connected to the Bus. Otherwise, the Register File path is
connected to the Bus.
The datapath must support three-address instructions of the form Rz  Rx <op> Ry.
To make use of a single bus architecture, the ALU can be surrounded by one, two, or three 32-bit
temporary registers, labeled A, B, and C, as shown below (the temporary registers are shown as
dotted lines – the correct solution requires at least one and possibly all three of the registers):
13
Using the fewest of the A/B/C registers and possible clock cycles, what is the fewest number of
each to implement the register transfer for the instructions of the three-address type (circle one
for each):
Registers
1
2
3
Clock Cycles 1
2
3
For your answer, on the previous page cross out the registers you don’t need, and fill-in the
outline of the registers that you do. For each clock cycle that you need according to your answer
above, write in the space below the control signals that must be asserted to implement the
register transfers for the three-address instructions:
Clock Cycle 1:
Clock Cycle 2:
Clock Cycle 3:
14
Name:
Login: cs61c-____
Spring_______________________________
2007 Final - Justin
F2) Tune in to 101 on your FSM dial...
We are designing a palindrome-finder circuit with a 1-bit input I(t) and a 1-bit output O(t), that will produce,
at time t, whether the sequence {I(t-2), I(t-1), I(t)} is the same backwards and forwards (e.g, 101).
We’ll assume I(t) has been 1 for all negative time (i.e., before the finder circuit starts). As an example,
the input:
will produce the output:
P1 P0 I
I: 1 1 0 0 1 1 1 0 0 0 1 0 1 1 0 0
O: 1 1 0 0 0 0 1 0 0 1 0 1 1 0 0 0
a) Complete the FSM diagram below. Our states have been labeled Sxy indicating that
the previous 2 bits, {I(t-2), I(t-1)} would be {x, y}. Fill in the truth table on the
right. The previous state is encoded in (P1,P0), the next state is encoded in (N1,N0),
and the output is encoded as O. Make sure to indicate the value of the output on your
state transitions AND to indicate the starting state with an “incoming arrow”.
S00
S10
S01
S11
b) Provide a fully reduced (i.e., fewest gates to implement…you can use any ninput gates) Boolean expression for the Output O as a function of P1, P0 and
I. If there is a name for the circuit, write it in the box above. E.g., “The always1”, “3-input NAND”, etc. A 2-input XOR has the symbol of “⊕”.
c) How many different answers could I have put in the box for “b” above?
Said another way, how many different circuits can a 3-LUT imitate?
d) We’re always concerned about testing. What is the shortest length of an I(t)
stream that can guarantee you’ve tested this particular circuit exhaustively?
e) Finally, we wish to build our circuit as we normally do for SDS systems
(shown below). Given the four standard spec times from the chip
manufacturer (τsetup, τhold, τclk-to-q, and τCL), what is the smallest clock period τ
we can drive our system with? (Write your answer as an expression involving
the spec variables.) Feel free to draw timing diagrams if you wish.
CLK
f)
PS
I
NS
6/9
Name:
O =
0
0
0
0
0
1
0
1
0
0
1
1
1
0
0
1
0
1
1
1
0
1
1
1
O N1 N0
Spring 2004 Final - Justin
Question F3: Pipelining (18 points, 24 minutes)
Given the following MIPS code snippet (note that instruction #6 could be anything):
loop:
1
2
3
4
5
6
addi
lw
sw
lw
bne
$t0,
$v0,
$v0,
$s0,
$s0,
$t0, 4
0($t0)
20($t0)
60($t0)
$0, loop
##  The following instruction could be anything!
a) Detect hazards and insert no-ops to insure correct operation. Assume no
delayed branch, no forwarding units and no interlocked pipeline stages. Your
answer on the right should take the form of pair(s) of numbers: num@location –
indicating num no-ops should be placed at location. E.g., if you wanted to place
6 noops between lines 2 and 3 (i.e., location=2.5) and 8 noops between lines 5
and 6 (i.e., location=5.5), you would write: “6@2.5, 8@5.5”. (6 points)
Scratch space
b) Now, reorder/rewrite the program to maximize performance. Assume delayed branch and
forwarding units, but no interlocked pipeline stages. For unknown reasons, the first instruction
after the loop label must be the addi. Feel free to insert no-ops where needed. You should be
able to do it using 6 instructions per loop (easier, half credit) or only 5 (hard, full credit). (12 pts)
## Extra instructions before the loop if necessary
## Extra instructions before the loop if necessary
loop:
1
addi $t0, $t0, 4
2 3 4 5 6 ##  The following instruction could be anything!
8/10
Name: _______________________________ Login: cs61c-____
Fall 2006 Final - Justin
F3) “These Pipes are Clean…” (22 pts, 30 min)
Consider a processor with the following specification:
o
o
o
o
o
o
o
o
o
o
Standard five (5) stage (F, D, E, M, W) pipeline.
No forwarding.
Stalls on all data and control hazards.
Non-delayed branches
Branch comparison occurs during the second stage.
Instructions are not fetched until branch comparison is done.
Memory CAN be read/written on same clock cycle.
The same register CAN be read & written on the same clock cycle.
No out-of-order execution
“Dumb” control that does not optimize for “always-branch” conditional branches
a)
add
xor
lw
beq
lw
xor
add
Cycle
Inst 1
Count how many cycles will be needed to execute the code below and write out each
instruction’s progress through the pipeline by filling in the table below with pipeline stages
(F, D, E, M, W).
$t1,
$t1,
$t3,
$t3,
$t5,
$t4,
$t5,
1
2
$t2, $t3
$t4, $t5
0($t1)
$t3, 1
0($t3)
$t5, $t6
$t5, $t4
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
23
24
25
Inst 2
Inst 3
Inst 4
Inst 5
Inst 6
b)
Cycle
Considering the following three changes, fill in the table again:
o Our processor now forwards values
o Interlocks on load hazards
o “Intelligent” control that optimizes for “always-branch” conditional branches
1
2
3
4
5
6
7
8
9
10
11
12
13
Inst 1
Inst 2
Inst 3
Inst 4
Inst 5
Inst 6
7/8
14
15
16
17
18
19
20
21
22
Name: _______________________________ Login: cs61c-____
Fall 2006 Final - Justin
F2) Congressman Mark Foley: “It was the Page’s fault” (22 pts, 30 min)
The specs for a MIPS machine’s memory system that has one level of cache and virtual memory are:
o
o
o
o
o
o
1MiB of Physical Address Space
4GiB of Virtual Address Space
4KiB page size
16KiB 8-way set-associative write-through cache, LRU replacement
1KiB Cache Block Size
2-entry TLB, LRU replacement
The following code is run on the system, which has no other users and process switching turned off.
#define NUM_INTS 8192
// This many ints...
int *A = (int *)malloc(NUM_INTS * sizeof(int)); // malloc returns address 0x100000
int i, total = 0;
for(i = 0; i < NUM_INTS; i += 128) A[i]
= i;
for(i = 0; i < NUM_INTS; i += 128) total += A[i]; // SPECIAL
a)
What is the T:I:O bit breakup for the cache (assuming byte addressing)? ____:____:____
b)
What is the VPN : PO bit breakup for VM (assuming byte addressing)?
______:______
For the following questions, only consider the line marked “SPECIAL”. Your answer can be a fraction.
c)
Calculate the hit percentage for the cache
d)
Calculate the hit percentage for the TLB
e)
Calculate the page hit percentage for the page table
Show all your work below...
6/8
Fall 2010 Final - Sean
8. Bigger, Stronger, Faster:
Suppose that you are running an algorithm for various problem sizes, and have obtained
the data below. Sketch a weak scaling plot of parallel code performance that shows
speedup over the serial implementation. Be sure to label the Y-axis.
Problem
Size
Gflop/s
(serial)
Threads
Gflop/s
(parallel)
100
5
1
5
200
5
2
10
400
5
4
19
600
5
6
25
800
5
8
35
1000
5
10
36
1200
5
12
37
1400
5
14
37
1600
5
16
38
Weak Scaling of Speedup over Serial
Linear Speedup
1
2
4
6
8
10 12 14 16 18 20 22
Threads
11
Fall 2010 Final - Sean
7. Pay It Forward
Consider the excerpt below of a 5-stage pipelined MIPS datapath.
a. Consider the following sequence of instructions
[A] srl $zero, $zero, 0
[B] addu $t0, $t1, $t2
[C] addu $t0, $t0, $t2
[D] lw $s0, 0($t3)
[E] subu $t3, $s0, $t0
During which of these instructions’ decode stages in the sequence above should
ControlRS be 1 to avoid pipeline stalls? Use the labels [A], [B], …
b. Which fields of which instructions from part a does the control logic need to
compute the value of ControlRS?
10
CS61c-Final, Spring 1999, Login name: CS61c-_____
Spring 1999 Final - Sean
The Newsgroup Question (11 points):
Although the CS61C review lecture on variable arguments was great, Joe Computer is very puzzled. What is this
M'Piero thing? What does it have to do with Kelvin? "I don't get it!" Determined to find the answer to his confusion,
he decides to check the newsgroup by starting the program "trn."
In what order do things happen when trn is run? Part 1 lists a set of things that occur when trn is run.
Please time-order the steps from 1 to 13. The odd numbered steps are given to you. Fill in the rest with
even numbers.
Please assume:
- No part of the program has been loaded into memory yet.
- Page size is 4KB and there is only one cache.
- The page table entry loaded from the memory for page 0x00040 maps to physical page 0x14329.
- The TLB is between the CPU and the cache, as in class (the cache uses physical addresses).
- Block size for the cache is 8 words (32 bytes).
- In part 1 all of the actions occur. In part 2, some of them are incorrect and do not occur.
Part 1 (6 points):
Given steps:
__1__ Joe Computer types "trn" at the command line.
__3__ The CPU attempts to fetch the first instruction, 0x00040000 (pointed to by the pc).
__5__ The page table for this process is accessed to find the entry for address 0x00040000, which has the invalid
bit set (not loaded from disk yet).
__7__ The TLB is updated with an entry mapping virtual page 0x00040 to physical page 0x14329, with the valid
bit set.
__9__ The cache misses for the block containing 0x14329000 and attempts to load the block from memory.
_11__ The instruction at virtual address 0x00040000 is successfully loaded from the cache, completing the
instruction fetch phase.
_13__ The CPU attempts to fetch the second instruction, 0x00040004.
Unordered steps: (Assign the even step numbers 2, 4, 6, 8, 10, and 12 to the six options below)
_____ The TLB hits for virtual page number 0x00040, the physical address 0x14329000 is sent to the cache.
_____ The TLB misses while attempting to find an entry for the virtual page number 0x00040.
_____ Physical page number 0x14329 is loaded into memory from disk, and the page table is updated.
_____ The instruction at virtual address 0x00040000 is successfully fetched, and on the next clock tick will move
on to its decode stage.
_____ A page table for the process is created by the operating system. Static memory area is created, space is
allocated for the static parts (i.e. arrays) of the program, heap and stack are initialized. All the TLB entries
from the previous process are marked invalid.
_____ The block containing 0x14329000 is loaded into the cache from memory.
5/12/99
5
CS61c-Final, Spring 1999, Login name: CS61c-_____
[Newsgroup Question continued]
Part 2 (5 points):
Now that you have ordered what happens for the first instruction, what will happen for the second instruction?
(Assume that this question starts where Part 1 left off. Remember, some of these may NOT occur. Please order
the *correct* actions [starting with the number 1], and put an “X” in front of incorrect actions):
_____ The TLB misses for the virtual page corresponding to address 0x00040004, and the previous procedures
are used to load the right page into memory and update the page table.
_____ The cache misses for the block containing 0x14329004 and attempts to load the block from memory.
_____ The TLB hits for virtual page number 0x00040, the physical address 0x14329004 is sent to the cache.
_____ (after many more instructions are executed)... The newsgroup article is read, M'Piero is displayed on the
screen, Joe Computer finally gets the joke and posts a message praising the fact that CS61C has such a
nifty teaching staff this semester !.
_____ The instruction at virtual address 0x00040004 is successfully loaded from the cache, completing its
instruction fetch phase.
_____ The block containing 0x14329000 is loaded into the cache from memory.
5/12/99
6
Spring
Final - Sean
Name: 2008
_____________________
Login: cs61c-____
M1) Hey buddy, can you run these instructions for me? Thanks! (10 pts, 20 min)
Consider the following non-delayed branch MIPS function foo:
foo:
a) What does the following function call (in C) return? ________
foo(-1, 0x30880001, 0x00481020, 0x00042042);
li $v0,0
la $t9,loop
sw $a1,0($t9)
sw $a2,4($t9)
sw $a3,8($t9)
loop: nop
nop
nop
bne $a0,$0,loop
jr $ra
b) You can probably see how foo could pose a security threat if misused. For the good of humanity, we must
seal its functionality forever, and render it harmless. That is, you’re going to call it once with a special set of
arguments for $a0-$a3 (list these below in human-readable form … not as numbers!) so that every future call
to foo always just returns $a0 regardless of the value of $a1-$a3. Oh, and the call to foo with the
arguments below should cause it (this time only) to return 0 to signal success that it has been “neutralized”.
$a0: __________________________
$a1: __________________________
$a2: __________________________
$a3: __________________________
2/8
Download