Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm Presented by Kim Ki Young @ DCSLab Simultaneous Multithreading(SMT) A Technique that permits multiple independent threads to issue multiple instructions each cycle to a superscalar processor’s functional unit Two major impediments to processor utilization long latencies limited per-thread parallelism 2/20 1.Demonstrate the throughput gains of SMT a re possible without extensive changes to a co nventional, wide-issue superscalar processor 2.Show that SMT need not compromise single -thread performance 3.Detailed architecture model to analyze an d relieve bottlenecks that did not exist in th e more idealized model 4.Show how simultaneous multithreading cre ates an advantage previously unexploitable i n other architecture 3 4 A projection of current superscalar design tren ds 3-5 years into the future Changes necessary to support simultaneous m ultithreading Multiple program counters Separate return stack for each thread Per-thread instruction retirement, instru ction queue flush, and trap mechanisms A thread id with each branch target buffe r entry A larger register file 5 6 7 MIPSI MIPS-based simulator executes unmodified Alpha object code Workload SPEC92 benchmark suite five floating point programs, two integer programs, TeX Multiflow trace scheduling compiler 8 9 With only single thread, throughput is less than 2% below a superscalar w/o SMT support Peak throughput is 84% higher than the superscalar Three problems IQ size Fetch throughput Lack of parallelism 10 Improve fetch throughput w/o increasing the fetch b andwidth alg.num1.num2 alg : Fetch selection method num1 : # of threads that can fetch in 1 cycle num2 : max # of instructions fetched per thread i n 1 cycle Partitioning the fetch unit RR.1.8 RR.2.4, RR.4.2 Some hardware addition RR.2.8 Additional logic is required 11 12 Fetch Policies BRCOUNT that are least likely to be on a wrong path MISSCOUNT that have the fewest outstanding D cache miss ICOUNT with the fewest instructions in decode IQPOSN with instructions farther from head of IQ 13 14 15 Unblocking the Fetch Unit BIGQ increase IQ’s size as long as we don’t increase the search s pace double size, search first 32 entries ITAG do I cache tag lookup a cycle early 16 Two sources of issue slot waste Wrong-path instructions result from mispredicted branches Optimistically issued instructions result from cache miss or bank conflict Issue Algorithms OPT_LAST SPEC_LAST BRANCH_FIRST 17 The Issue Bandwidth not a bottleneck Instruction Queue Size not a bottleneck experiment with larger queues increased thr oughput by less than 1% Fetch Bandwidth prime candidate for bottleneck status increasing IQ and excess registers increased performance another 7% Branch Prediction less sensitive in SMT 18 Speculative Execution not a bottleneck eliminating will be a issue Memory Throughput infinite bandwidth caches will increase throughput only by 3% Register File Size no sharp drop-off point Fetch Throughput is still a bottleneck 19 Borrows heavily from conventional superscalar design, requiring little additional hardware support Minimizes the impact on single-thread performance, running only 2% slower in that scenario Achieves significant throughput improvements over the superscalar when many threads are running 20 Intel Pentium4, 2002 Hyper-Threading Technology(HTT) 30% speed improvement MIPS MT IBM POWER5, 2004 two-thread SMT engine SUN Ultrasparc T1, 2005 CMT : SMT + CMP(Chip-level multiprocessing) 21