Improving Latency Automatically [OHL Ch. 3] What advantages do CMPs have vs. SMPs if we are interested in automatically parallelizing a program? One way of parallelizing programs automatically is to use “helper” threads. A helper thread is a lobotomized thread that only performs certain kinds of actions, e.g., making branch predictions early, and prefetching data into on-chip caches. Why does this help? Why is the benefit limited? Thread-level speculation [OHL §3.2] Another automatic technique is to divide the program up into several threads. The only practical way to do this is to divide on the basis of … loop iterations, or procedure calls. Lecture 27 Architecture of Parallel Computers 1 The idea is that (in the case of loop iterations), often subsequent iterations will be “almost independent.” Here is an example (found at www.crhc.uiuc.edu/ece412/lectures/lecture26.pdf). for (i = 0; i < I_MAX; i++){ for (j = 0; j < J_MAX; j++){ a[i][j] = b[i][j] + c[i][j]; b[j][i] = compute_b(input); } } As long as i and j are not too close, the iterations will be independent. So, we can assign successive iterations to different threads. Why is hardware support needed? This hardware must handle five special situations (see Fig. 3.1, p. 66): 1. Forward data. Data must be forwarded from one thread to another quickly. 2. Detect too-early reads. If a data value is read by a later thread and afterwards, written by an earlier thread, a violation has occurred. Hardware must notice this and, e.g., restart the later thread. 3. Discard speculative changes after a violation. When a change is made to a variable by thread T, and then thread T needs to be restarted, this change must be undone. 4. Retire speculative writes in correct order. After threads finish, their state must be merged into the process’s state in correct order. Writes from later threads must be merged in later. 5. Keep earlier threads from seeing later threads’ changes. A thread must see only changes made by earlier threads. This is complicated by the fact that a processor that was running an earlier thread will later be running a later thread. © 2012 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2012 2 One possibility is to use about 4 different threads to handle four consecutive iterations of a loop. Size of threads is an important issue. Why? Limited buffer size. True dependences. Restart overhead. Parallelization overhead. Typically, a few thousand instructions is the right length for threads. The Hydra TLS System [OHL §3.3] In the Hydra thread-level speculative architecture, each of the four on-chip processors has its own L1I and L1D caches. However, all processors share a single large on-chip L2 cache. Connecting the individual processors to the L2 raises several issues. Why? Hydra has both a read bus and a write bus. The read bus is used for values fetched from the L2 cache and off-chip memory. How wide do you think it is? A byte, a word, a double word, or a cache line? Lecture 27 Architecture of Parallel Computers 3 The write bus is used to write values to the L2 cache. How wide do you think it is? A byte, a word, a double word, or a cache line? The write bus supports cache coherence and memory consistency (p. 71). Coherence is via a two-state protocol. It is based on invalidating lines written to a cache. Can you explain? How does this organization enforce memory consistency? Note that all processor cores can snoop on all writes. What additional hardware is needed to support TLS? There are two major categories (p. 72). Tag bits that record whether speculative accesses have been performed. Write buffers that Tag bits to support speculation [OHL §3.3.2.1] Four kinds of tag bits are used to support speculation. See Fig. 3.6, p. 73. The first two of these (modified, pre-invalidate) are one bit per line. Modified bit. If this bit is set, it indicates that the line includes speculative data written by this thread or less-speculative threads. How might that speculative data get into the line? Exercise. o © 2012 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2012 4 o What happens if the thread needs to be restarted? Pre-invalidate bit. If this bit is set, it indicates that the line has been written by a more-speculative thread. Did this write cause the cache line to be updated When does a line with a set pre-invalidate bit need to be invalidated? The other two bits (read, written) are one bit per word. Read bits. A read bit is set when a processor reads a word within the cache line unless the word’s written bit is also set. The purpose is to catch RAW hazards. This is the case when a write that was supposed to happen before the read (in our thread) actually happens after it. Read bit set Word written Time Would the offending write be from a less-speculative thread or a more-speculative thread? What has to be done in this case? Lecture 27 Architecture of Parallel Computers 5 Written bits. If a written bit is set, it indicates that our thread has read after the speculative processor wrote, so we can use the value it wrote. The purpose of written bits is to avoid unnecessary invalidations. The book (p. 74) says, “This bit or set of bits may be added to allow renaming of memory addresses used by multiple threads in different ways.” Explain this. Your knowledge of ILP may be useful. Secondary cache buffers [OHL §3.3.2.2] Cache buffers are only needed during speculative operation. What happens during nonspeculative execution? Why do values need to be buffered during speculative execution? When do values from these buffers get written to the L2 cache? What does this remind you of in ILP? Before the data is written to the L2 cache, it may be forwarded to a (less-speculative, more-speculative) thread. When a -speculative processor fetches a line, it gets the most recent version of all bytes in the line. How is this line created? (See Fig. 3.7.) How Hydra meets the requirements [OHL §3.3.2.4] How does Hydra meet the 5 requirements for threadlevel speculation? © 2012 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2012 6 1. Forward data between parallel threads. 2. Detect when reads occur too early. 3. Safely discard speculative state after violations. 4. Retire speculative writes in the correct order. 5. Provide memory renaming. Lecture 27 Architecture of Parallel Computers 7