Thread-level speculation

advertisement
Improving Latency Automatically
[OHL Ch. 3] What advantages do CMPs have vs. SMPs if we are
interested in automatically parallelizing a program?
One way of parallelizing programs automatically is to use “helper”
threads.
A helper thread is a lobotomized thread that only performs certain
kinds of actions, e.g.,
 making branch predictions early, and
 prefetching data into on-chip caches.
Why does this help?
Why is the benefit limited?


Thread-level speculation
[OHL §3.2] Another automatic technique is to divide the program up
into several threads.
The only practical way to do this is to divide on the basis of …
 loop iterations, or
 procedure calls.
Lecture 27
Architecture of Parallel Computers
1
The idea is that (in the case of loop iterations), often subsequent
iterations will be “almost independent.” Here is an example (found at
www.crhc.uiuc.edu/ece412/lectures/lecture26.pdf).
for (i = 0; i < I_MAX; i++){
for (j = 0; j < J_MAX; j++){
a[i][j] = b[i][j] + c[i][j];
b[j][i] = compute_b(input);
}
}
As long as i and j are not too close, the iterations will be independent.
So, we can assign successive iterations to different threads.
Why is hardware support needed?
This hardware must handle five special situations (see Fig. 3.1, p.
66):
1. Forward data. Data must be forwarded from one thread to
another quickly.
2. Detect too-early reads. If a data value is read by a later thread
and afterwards, written by an earlier thread, a violation has
occurred. Hardware must notice this and, e.g., restart the later
thread.
3. Discard speculative changes after a violation. When a change
is made to a variable by thread T, and then thread T needs to
be restarted, this change must be undone.
4. Retire speculative writes in correct order. After threads finish,
their state must be merged into the process’s state in correct
order. Writes from later threads must be merged in later.
5. Keep earlier threads from seeing later threads’ changes. A
thread must see only changes made by earlier threads. This is
complicated by the fact that a processor that was running an
earlier thread will later be running a later thread.
© 2012 Edward F. Gehringer
CSC/ECE 506 Lecture Notes, Spring 2012
2
One possibility is to use about 4 different threads to handle four
consecutive iterations of a loop.
Size of threads is an important issue. Why?
 Limited buffer size.
 True dependences.
 Restart overhead.
 Parallelization overhead.
Typically, a few thousand instructions is the right length for threads.
The Hydra TLS System
[OHL §3.3] In the Hydra thread-level speculative architecture, each
of the four on-chip processors has its own L1I and L1D caches.
However, all processors share a single large on-chip L2 cache.
Connecting the individual processors to the L2 raises several issues.
Why?
Hydra has both a read bus and a write bus.
 The read bus is used for values fetched from the L2 cache and
off-chip memory.
How wide do you think it is? A byte, a word, a double word, or
a cache line?
Lecture 27
Architecture of Parallel Computers
3
 The write bus is used to write values to the L2 cache.
How wide do you think it is? A byte, a word, a double word, or
a cache line?
The write bus supports cache coherence and memory consistency
(p. 71).
Coherence is via a two-state protocol. It is based on invalidating
lines written to a cache. Can you explain?
How does this organization enforce memory consistency?
Note that all processor cores can snoop on all writes.
What additional hardware is needed to support TLS? There are two
major categories (p. 72).
 Tag bits that record whether speculative accesses have been
performed.
 Write buffers that
Tag bits to support speculation
[OHL §3.3.2.1] Four kinds of tag bits are used to support
speculation. See Fig. 3.6, p. 73.
The first two of these (modified, pre-invalidate) are one bit per line.
 Modified bit. If this bit is set, it indicates that the line includes
speculative data written by this thread or less-speculative
threads.
How might that speculative data get into the line? Exercise.
o
© 2012 Edward F. Gehringer
CSC/ECE 506 Lecture Notes, Spring 2012
4
o
What happens if the thread needs to be restarted?
 Pre-invalidate bit. If this bit is set, it indicates that the line has
been written by a more-speculative thread.
Did this write cause the cache line to be updated
When does a line with a set pre-invalidate bit need to be
invalidated?
The other two bits (read, written) are one bit per word.
 Read bits. A read bit is set when a processor reads a word
within the cache line unless the word’s written bit is also set.
The purpose is to catch RAW hazards. This is the case when
a write that was supposed to happen before the read (in our
thread) actually happens after it.
Read bit set
Word written
Time
Would the offending write be from a less-speculative thread or
a more-speculative thread?
What has to be done in this case?
Lecture 27
Architecture of Parallel Computers
5
 Written bits. If a written bit is set, it indicates that our thread
has read after the speculative processor wrote, so we can use
the value it wrote.
The purpose of written bits is to avoid unnecessary
invalidations.
The book (p. 74) says, “This bit or set of bits may be added to
allow renaming of memory addresses used by multiple threads
in different ways.” Explain this. Your knowledge of ILP may be
useful.
Secondary cache buffers
[OHL §3.3.2.2] Cache buffers are only needed during speculative
operation. What happens during nonspeculative execution?
Why do values need to be buffered during speculative execution?
When do values from these buffers get written to the L2 cache?
What does this remind you of in ILP?
Before the data is written to the L2 cache, it may be forwarded to a
(less-speculative, more-speculative) thread.
When a
-speculative processor fetches a line, it gets the
most recent version of all bytes in the line. How is this line created?
(See Fig. 3.7.)
How Hydra meets the requirements
[OHL §3.3.2.4] How does Hydra meet the 5 requirements for threadlevel speculation?
© 2012 Edward F. Gehringer
CSC/ECE 506 Lecture Notes, Spring 2012
6
1. Forward data between parallel threads.
2. Detect when reads occur too early.
3. Safely discard speculative state after violations.
4. Retire speculative writes in the correct order.
5. Provide memory renaming.
Lecture 27
Architecture of Parallel Computers
7
Download