ECE462/562 Fall 2012 Pointers You are encouraged to come up with your own topic. For example, if you have an interest in compilers, then code scheduling for instruction level parallelism might be a good topic. If you are interested in VLSI design, a project related to pipeline clocking or low power architecture would be good. If you are interested in databases, quantifying the architectural characteristics of database workloads, and comparing them with characteristics of other workloads (e.g., SPEC) might be good. Some simulators (e.g., SimpleScalar) and benchmark programs (e.g., SPEC2K) will be made available for carrying out simulation studies. The following is a sampling of projects in other schools. Though descriptions alone convey little meaningful information, this should give you an idea of what you might want to pursue. Select a paper that interests you from a recent ASPLOS or ISCA proceedings. Construct a simulator that will allow you to reproduce their main results and validate your simulator using their workload or a similar one. Are there any major assumptions the authors didn't mention in the paper? Use your simulator to evaluate their technique under a new workload or improve their technique and quantify your improvements. As CPU cache miss times approach thousands of cycles, during the time that a miss gets serviced, it seems likely that the processor could execute a cache-replacementoptimization program "in the background" without slowing down any unblocked dataflows of execution (Yale Patt calls this sort of optimization code "micro-threads".) This project has two parts. First, estimate an upper bound on the performance that could be gained as follows: simulate a k-way associative cache where each cache set uses random, FIFO, LRU, and OPT replacement. Current caches use k= 1 to 8 and one of the simple replacement policies, and the best your system could do would be to approximate fully associative a fully-associative cache with OPT replacement. The gap between those two cases is a reasonable upper bound on the benefits this scheme could achieve. Also, this experiment will tell you what level of associativity and replacement policy to aim for in your design. You may want to run this experiment for L1, L2, and L3 caches to see where to focus your efforts. Second, design a cache microarchitecture that would allow for more sophisticated replacement policies. My intuition is that it will be important to make sure your design does not slow down hits, it should not slow down the time it takes to issue the miss request to memory, but it can probably burn a lot of cycles thinking about which current cache line to replace when that data comes back or moving data between different cache entries. Along with the number of transistors, the complexity of microprocessor architectures continues grown exponentially, with very complex out-of-order processors. It is still not readily apparent how much performance is really being delivered to applications compared to simpler in-order designs. On a spectrum of benchmarks, quantify (through simulation - suggest SimpleScalar) the performance difference between an out-of-order processor and a simpler in-order processor, taking into account not only CPI, but also clock rate and power consumption. DRAMs are highly optimized for accesses that exhibit locality. Examine a memory interface architecture that reorders memory accesses to better exploit the column, page, and pipeline modes of modern DRAM implementations. Select an embedded application (such as interactive multimedia) and design and evaluate an architecture that executes it in a mobile environment. Address issues of functionality, performance (or at least providing the illusion of sufficient performance), and power consumption. Compare alternatives of embedding processing power in a DRAM chip (ie. reconfigurable logic vs. highly custom processor vs. hardwired logic for a given application) on a suite of data intensive and computationally demanding benchmarks. Characterize the benefits and costs of value prediction vs. other predictive techniques, such as instruction reuse. In the best cases, what is the maximum performance benefit? Compare the performance of a deep cache hierarchy (multiple levels) vs. a flatter organization (only one level) on a family of scientific and data intensive applications. Devise strategies to get the benefits of both. In large, out-of-order cores, loads have to be held back when an earlier store's address is unknown (because it might be the same). Dependence prediction guesses which load/store pairs are going to have dependences, and which aren't. These predictors have also been used to communicate values from stores to loads and do prefetching. Lots of interesting stuff here! Because of wire delays and register file bandwidth, processor designers have started looking (and building, cf. Alpha 21264) clusters, in which groups of functional units are associated with separate register files within a core. How to schedule work on these, and their implications for future architectures, is a hot topic. Simultaneous Multithreaded Processors (SMT) run multiple tasks in an out-of-order core at the same time, sharing the dynamic resources (physical registers, issue slots, cache pipes). Experiments with how resource usage conflicts in the different shared resources, with different combinations of workloads, would be interesting (there is a lot of work going on in this area so a literature search would be crucial). When multiple threads are running in an SMT core, how many extra cache misses are caused by the intersections of the threads' working sets? Quantifying this for different workload combinations. When a number of instructions waiting to execute in an out-of-order core are ready to go, but there are too many for (a) the issue width or (b) for the particular functional unit types available to issue in a single cycle, the hardware must choose among them. Oldest-first is the usual strategy. Other selection algorithms may be better. It would be interesting to try a few. One of the key problems in architectures is that it is often more difficult to improve latency than bandwidth. Prefetching is one technique that can hide latency. Here are some possible prefetching topics: o Quantify the limits of history-based prefetching. Prediction by partial matching (originally developed for text compression) has been shown to provide optimal prediction of future values based on past values. Using PPM, what are the limits of memory or disk prefetching? What input information (e.g., last k instruction addresses, last j data addresses, distance between last k addresses, last value loaded, ...) best predicts future fetches? What is the best trade-off between state used to store history information and prefetch performance? o Add a small amount of hardware to DRAM memory chips to exploit DRAM internal bandwidths to avoid DRAM latencies. Evaluates the performance benefits that can be gained and the costs of modifying the hardware. Over the past 2 decades, memory sizes have increased by a factor of 1000, and page sizes by only a factor of 2-4. Should page sizes be dramatically larger, or are a few large "superpages" sufficient to offset this trend in most cases? Extend Transparent Informed Prefetching (Patterson et al (SOSP95)), which was designed for page-level prefetching/caches to balance cache-line hardware prefecthing v. hardware caching. Cooperative caching uses fast networks to access remote memory in liu of disk accesses. One drawback is that a user's data may be stored on multiple machines, potentially opening security holes (eavesdropping, modification). Encryption and digital signatures may solve the problem, but could slow down the system. Evaluate the performance impact of adding encryption and digital signatures to cooperatively cached data and project this performance into the future as processor speeds improve and as companies like Intel propose adding encryption functions to their processors.. As memory latencies increase, cache miss times could run to thousands of instructionissue opportunities. This is nearly the same ratio of memory access times as were seeen for early VM paging systems. As miss times become so extremely bad, is it time to give control of cache replacement to the software? Will larger degrees of associativity be appropriate for caches? Achieve fault tolerance by running 2 copies of instructions in unused cycles in a superscalar (e.g., a 4-way machine may commit less than 4 instructions due to dependencies) and do instruction replication only in those cycles. Compare Qureshi and Patt's insertion policies in ISCA 2007 to victim caches. Re-evaluate the schemes in the high-bandwidth cache paper discussed in the lectures (e.g., line buffer) Using old register values to predict addresses of subsequent memory accesses. This allows the pipeline to do the cache access early in the pipeline, avoiding load-use stalls. To help circuit designers reduce di/dt problem, out-of-order processors can monitor the commit rate and "even" out the rate without losing performance. By looking for phases in applications where fewer physical registers may suffice we can cut down the amount of energy consumed by the register file. Attempt to quantify how much of processor performance gain in the past decade has come from faster clocks and how much from ILP. Methods to improve fetch bandwidth of trace caches, cache enhancements, including victim caches, stream buffers, and hash addressing, Implement and compare victim caches and skewed-associative caches. Implement and compare two recent prefetching schemes. architectural support of operating systems (e.g., user-level traps for lightweight threads) prefetching methods (hardware and/or software) and their impact on performance architectural characteristics of database workloads cache behavior of networking (or other) applications or algorithms, with modification to exploit caches and memory hierarchies An implementation study of register renaming logic In-order vs out-of-order superscalar processors A study of dynamic branch prediction schemes for superscalar processors Performance study of two-level on-chip caches in a fixed area An analysis of hardware prefetching techniques Performance evaluation of caches using patchwrx instruction traces Skewed D-way K-column set associative caches The history and use of pipelining computer architecture The effect of context switching on history-based branch predictors Bounding worst-case performance for realtime applications Branch prediction methods and performance Performance of TLB implementations Trace-driven simulation of cache enhancements Timing analysis and caching for realtime systems A Survey of VLIW Processors Evaluating Caches with Multiple Caching Strategies Survey/Comparison of VLIW and Superscalar processors Comparison Study of Multimedia-Extended Microprocessors Synchronous DRAM Cache Performance Study of the SPEC95 Benchmark Suite An Investigation of Instruction Fetch Behavior in Microprocessors The Picojava I Microprocessor and Implementation Decisions Register Renaming in Java Microprocessors Optimizing Instruction Cache Performance with Profile-directed Code Layout Simulation of a Victim Cache for Spec95 Integer Benchmarks Code scheduling for ILP Instruction/data Encoding Cache-based enhancements (e.g., trace cache, filter cache, loop cache, victim cache, stream buffers, etc.) Pipeline clocking Low power architectures Quanitfying architectural characteristics of database workloads and comparing them to other workloads Achieve fault tolerance by running 2 copies of instructions in unused cycles in a superscalar (e.g., a 4-way machine may commit less than 4 instructions due to dependencies) and do instruction replication only in those cycles. Compare Qureshi and Patt's insertion policies in ISCA 2007 to victim caches Use old register values to predict addresses of subsequent memory accesses. This allows the pipeline to do the cache access early in the pipeline, avoiding load-use stalls. By looking for phases in applications where fewer physical registers may suffice we can cut down the amount of energy consumed by the register file. Attempt to quantify how much of processor performance gain in the past decade has come from faster clocks and how much from ILP. Implement and compare victim caches and skewed-associative caches. Implement and compare two recent prefetching schemes Study prefetching methods (hardware and/or software) and their impact on performance Evaluate cache behavior of networking (or other) applications or algorithms, with modification to exploit caches and memory hierarchies An implementation study of register renaming logic In-order vs out-of-order superscalar processors A study of dynamic branch prediction schemes for superscalar processors Performance study of two-level on-chip caches in a fixed area An analysis of hardware prefetching techniques Performance evaluation of caches using patchwrx instruction traces Skewed D-way K-column set associative caches The history and use of pipelining computer architecture The effect of context switching on history-based branch predictors Bounding worst-case performance for realtime applications Branch prediction methods and performance Performance of TLB implementations Trace-driven simulation of cache enhancements Timing analysis and caching for realtime systems A Survey of VLIW Processors Evaluating Caches with Multiple Caching Strategies Survey/Comparison of VLIW and Superscalar processors Comparison Study of Multimedia-Extended Microprocessors Synchronous DRAM An Investigation of Instruction Fetch Behavior in Microprocessors Register Renaming in Java Microprocessors Optimizing Instruction Cache Performance with Profile-directed Code Layout Simulation of a Victim Cache for Spec95 Integer Benchmarks Power/Energy/Performance in a Branch Predictor of a Superscalar Processor Workload Characterization of Network Processor Benchmarks Dynamic Phase Behavior of Programs Cache Miss Pattern and Miss Predictability Analysis Low Power Cache Design Analysis of Architecture Support for Virtual Machines A Framework for Power Efficient Instruction Encoding in Deep Sub Micro Application Specific Processors Cache Optimization for Signal Processing Applications Design and Evaluation of Advanced Value Prediction Methods in Multi-Issue Superscalar Pipelined Architectures A New ISA to Efficiently Support Object-Oriented Programming (OOP) Paradigm Benchmarking HPC Workloads Your good idea here... Some reference paper top get started: A Case for MLP-Aware Cache Replacement, M. K. Qureshi et. al., ISCA 2006. Increasing the Size of Atomic Instruction Blocks using Control Flow Assertions, S. Patel, MICRO 2000. Selective value prediction, B. Calder et. al, ISCA 1999 Dataflow Mini-Graphs: Amplifying Superscalar Capacity and Bandwidth, Bracy et. al, MICRO 2004. Efficient Dynamic Scheduling Through Tag Elimination, Dan Ernst and Todd Austin, ISCA 2002. Cache decay: exploiting generational behavior to reduce cache leakage power, S. Kaxiras et. al., ISCA 2001. Scalable Store-Load Forwarding via Store Queue Index Prediction, S. Stone et. al., MICRO 2005. NUCA: A Non-Uniform Cache Access Architecture for Wire-Delay Dominated On-Chip Caches, Changkyu Kim Doug Burger Stephen W. Keckler, ASPLOS 02. Trace Cache: a Low Latency Approach to High Bandwidth Instruction Fetching, Eric Rotenberg, Steve Bennett, James E. Smith, MICRO.