18-741 Project Statement (Fall 2006) Proposal due date: Friday, October 13, 2006 4:30pm at HH A302 Milestone 1 meeting: Friday, October 27, 2006 (meeting with TA’s during the day) Milestone 2 report: Wednesday, November 8, 2006 4:30pm at HH A302 Milestone 2 meeting week: Starting November 6 (meeting with Prof. Falsafi or TA’s) Final report and presentation due date: Friday, December 8, 2006 1. Introduction In this project, you will innovate new designs or evaluate extensions to designs presented in class or a recent architecture conference. You will use Flexus (www.ece.cmu.edu/ ~simflex) as a base simulation environment and augment it with a simulation model for the design of interest. The purpose of this project is to conduct and generate publicationquality research and to improve the state of the art. The project report will be in the form of a research article similar to those we have been discussing in the class. The project will account for 30% of your final course grade. The project will be graded out of 100 points: • proposal (5 points) • milestone 1 (5 points) • milestone 2 (5 points) • final report (75 points) • problem definition and motivation (10 points) • survey of previous related work (15 points) • description of design (15 points) • experimentation methodology (15 points) • analysis of results (20 points) • poster/presentation (10 points). 2. Proposal You are encouraged to meet with me and/or the TAs prior to submitting the proposal if you have any questions. When in doubt, meet with us. Please make an appointment to meet with us or use the office hours. The proposal is a written two-page document including: 1. A problem definition and motivation. 2. A brief survey of related work with at least four papers you have found and read directly related to the topic (ask me for pointers). Please explain in detail how this 18-741 Project Statement (Fall 2006) September 27, 2006 1 prior work relates to your proposed work. The CALCM web page (www.ece.cmu.edu/CALCM) has links to online IEEE and ACM proceedings. Conferences that primarily focus on architecture research are ISCA, ASPLOS, MICRO, HPCA, SIGMETRICS, ISLPED, and DNS. You can search the online pages and/or contact us for pointers. You will also find the World Wide Computer Architecture web page www.cs.wisc.edu/~arch/www a good source of information. 3. A detailed description of your experimental setup including modifications to the simulation environment. You should refer to J. R. Platt’s paper on Strong Inference (from Reader) and explain how your experimental methodology follows his guidelines. 4. Milestones for the status and final report. What do you expect to see in each milestone? Where do you plan to go based on your observations? You can draw a flow chart to clarify. 3. Milestone 1 1. This milestone will ensure that you have successfully brought up the infrastructure you will need for your project. Furthermore, you must demonstrate the problem your project attacks using this experimental infrastructure. For example, you might need to bring up a simulation framework and reproduce some baseline case or prior results as a starting point for your own work. 2. You will make an appointment to meet with one of us (faculty or TA team) to present your own results (possibly replicated) motivating your project and explain how your infrastructure is suitable for your project. The appointments are made by filling out an appointment sheet in class the week prior to the meetings. 3. NOTE: The purpose of this meeting is not for you to explain to us why you are having difficulty bringing up your infrastructure. If you encounter unexpected difficulties please notify us early on so we can help you work through them. 4. Milestone 2 1. You will hand in a two-page write-up describing your preliminary results. These results form the basis for the final outcome of your research — i.e., the results will not substantially change. You will spend the rest of the semester polishing the results with more extensive analysis and experimentation. Based on these results, explain your plans for the final milestone. If there are any changes to plans, you should bring them up in this report. 2. You will make an appointment to meet with one of us (faculty or TA team) to go over the project status. The appointments are made by filling out an appointment sheet in class the week prior to the meetings. 5. Final Report Reports should be in the form of a conference submission including an abstract, an introduction, followed by a detailed description of the design, a methodology section, a results section, and a conclusion section. You may choose to write a related works section preceding the conclusions, or a background section following the introduction. Please make 18-741 Project Statement (Fall 2006) September 27, 2006 2 sure your document is spell-checked and is grammatically sound. You will need all the relevant citations to your work in a reference section at the end. 6. Posters and/or Presentations On the Friday of the last week of classes, we will hold a poster and/or presentation session, in which teams will get to present their results orally. Please stay tuned for more detail on this. 7. Best Projects The top projects in class will be selected for submission to a computer systems conference for publication. In the past a number of papers from 741/742 have become fullblown research projects including the SMARTS paper on simulation sampling that appeared in ISCA 2003, and the Spatial Pattern Prediction paper that appeared in HPCA 2004, and subsequently in ISCA 2006. 8. Infrastructure SimFlex is a full-system simulation environment and as such is slower than a user-level single-thread simulator like SimpleScalar. You should select your measurement sizes (in instruction count) so that individual simulation runs take at most a few hours. Refer to http://www.ece.cmu.edu/~simflex and look for Flexus for proper simulation methodology. Questions about simulator setup should be directed to one of the TA’s. 9. Research Project Topics Some of the project topics below are undisclosed ideas and are confidential. Please do not distribute this list. I will be open to any ideas you may have for a research project if you can convince me that it is worth pursuing. Otherwise, here is a list of possible projects. 1. DBmbench. Benchmarking commercial applications on a simulator is extremely slow. Shao et al. (www.ece.cmu.edu/~babak/papers/tr-cmu-cs-03.pdf) have proposed a rigorous scaling framework for scaling down OLTP and DSS workloads. These workloads reduce the runtime requirement of the TPC-C and TPC-H benchmarks by orders of magnitude. Unfortunately, the workloads have only been tested against one hardware platform. Bring the workloads up on the AMD opteron servers and validate their results. Bring the workloads up also on SimFlex and show that indeed simulation times can be reduces by several orders of magnitude while preserving microarchitectural characteristics. Talk to any of the TA’s about this problem. 2. Streaming Memory Systems for a Single Processor. Memory system performance is a key bottleneck for a number of important server applications. Unfortunately, conventional data prefetching techniques do not work well for these because the memory addresses are quite irregular. Recent research (our group at CMU, Prof. Jim Smith’s group at Wisconsin, and Dr. Chilimbi’s group at Microsoft) have shown that there is much temporal correlation between memory addresses in applications; the latter follows because data structures are often traversed in similar fashions with little modifications in their physical mapping in memory through time. Design a streaming engine (like TSE from Wenisch et al. at ISCA 2005) for a single-core system that would maintain the temporal correlation in memory accesses in the form of streams, and move them on/off chip collectively to hide memory access latency. Evaluate your 18-741 Project Statement (Fall 2006) September 27, 2006 3 Temporal Streaming Engine (TSE)’s effectiveness as compared to conventional memory system for commercial workloads. Talk to Stephen Somogyi or Thomas Wenisch (twenisch@ece.cmu.edu) about this problem. 3. Scrubbing L1. Future servers are going to be highly vulnerable to soft error. Recent work indicates that not all components of the system are as vulnerable to soft error. Specifically, on-chip caches have higher architectural vulnerability factors (AVF) because data resides in them for long periods of time. One approach to reducing data vulnerability in L1 caches is to flush the data when it is no longer necessary. AMD uses this approach whereas Intel gets away with not having to do it by using write through caches. Simplify the dead-block predictor (Lai and Falsafi, ISCA’00) to predict only last stores to avoid periodic scrubbing. Show that last-store scrubbing is superior to periodic scrubbing from a bandwidth, latency, and AVF perspective. Compare and contrast with walkthrough caches. Talk to Brian Gold (bgold@cmu.edu) or Jared Smolens (jsmolens@ece.cmu.edu) about this problem. 4. Hardware STEPS. OLTP instruction footprints are very large, unable to fit even in "large" 64KB L1 I-caches. Despite exercising the same code paths, code for different transactions is executed serially, effectively thrashing the instruction caches. The STEPS project (see http:// www.cs.cmu.edu/~StagedDB/publications.html) minimizes instruction cache misses in OLTP workloads by multiplexing concurrent transactions and exploiting common code paths. One transaction paves the cache with instructions, while close followers enjoy a nearly miss-free execution. To work, STEPS must determine when an I-cache becomes "full" with instructions from a given code path and switch to other threads (transactions) before moving on to other parts of execution and replacing the instructions in the I-cache. Currently STEPS uses the existing CPU performance counters to determine when I-cache miss rates change in order to select a good point in time to switch threads. Come up with and evaluate a hardware assist mechanism to dynamically determine an appropriate moment in time for STEPS to perform the switch amongst threads. Talk to Nikos Hardavellas or Ryan Johnson about this problem. 5. Constructing reusable cache state for multi-level hierarchies. Many recent computer architecture simulation methodologies ([SimFlex][SimPoint]) launch simulations from checkpoints of architectural and microarchitectural state. These checkpoints typically contain snapshots of the contents of the memory hierarchy. However, researchers often want to vary the size/associativity of the caches over a series of experiments. There are well-understood techniques for reconstructing the state of a smaller cache from a larger one ([Barr:ISPASS'05] [VanBiesbrouck;HiPeac'05] [Wenisch:ISPASS'06]). However, these techniques only work if the configuration of a single level of the cache hierarchy is changed. Design, implement, and validate a technique for reconstructing accurate cache hierarchy state when the size/associativity of several cache levels are changed. Nikos has already worked out the big picture on a data structure that can do this, but it needs to be refined and implemented, and lacks proofs demonstrating its correctness. Talk to Nikos Hardavellas about this problem. 6. Multiprocessor hybrid FPGA-based full-system emulators. Hybrid FPGA/software techniques have recently been developed at CMU as a practical way to build fast, full-system, FPGAbased computer system simulators. These techniques partition an overall system design such that some subsets of behaviors, which are practical to implement in hardware, are parallelized and running on the FPGA (e.g., user-level instructions). To ensure correctness and completeness of the system, the remaining behaviors (e.g., disk simulator, rare instructions) are modeled in a software full-system simulator. The result is a hybrid FPGA-based simulator that can run and simulate unmodified application binaries (even an OS). While the proof-of-concept exists for single-CPU systems, no work has been done on extending the existing approaches for multiprocessor emulation. Profile and evaluate multiprocessor workloads to see if existing techniques are extensible for multiprocessor systems. Propose a multiprocessor solution for hybrid FPGA-based simulation and implement a prototype in software-only or on actual FPGAs. You will have to talk to Eric Chung and James Hoe about this. Talk to Eric Chung (echung@ece.cmu.edu) about this problem. 18-741 Project Statement (Fall 2006) September 27, 2006 4 7. Accurate MLP with runahead execution. Accurate modeling of MLP (Memory-Level Parallelism) is an important requirement in memory subsystem evaluation when simulating commercial applications such as OLTP on DB2. Such workloads tend to exhibit data-dependent misses in the cache and as a result, MLP is a large determinant in performance. Runahead execution has recently been proposed as a complexity-effective way to increase MLP by speculatively retiring stalled loads at the end of the pipeline and recovering the architectural state when the datadependent miss returns. Show that MLP can be accurately simulated for commercial workloads using a simple in-order functional model combined with runahead execution. Demonstrate your results by comparing against Flexus’s out-of-order timing model. Talk Eric Chung (echung@ece.cmu.edu) about this problem. 8. Protecting memory with 2D parity. Soft and hard errors are a growing concern for memory and logic designers. Projections show that the frequency of these errors will increase as we scale to more advanced processes. Some memories are currently protected using error correcting codes (ECC) and physical interleaving data words. However, these schemes do not protect against some multi-bit error events and catastrophic failures (e.g., loss of an entire wordline due to electromigration). We propose to use a second dimension (vertical) of error coding (parity) to increase the robustness of the memory to multi-bit error events, both hard and soft. Characterization of the VLSI-level costs of this novel error coding scheme are still not well understood, especially in comparison to multi-bit-correction ECC schemes. For an approximately equivalent level of error protection, how would 2D parity compare to a horizontal multibit correction ECC scheme. The 2D scheme will be capable of correcting some catastrophic failures that a purely horizontal scheme cannot (e.g. in the case of a wordline failure) without use of dual modular redundancy. Talk to Prof. Ken Mai (kenmai@ece.cmu.edu) about this problem or Prof. Falsafi about this problem. 9. Checking for in-order multithreaded pipelines. Reunion [MICRO-39] adds a checking stage to the retirement pipeline of a speculative out-of-order microarchitecture (such as the P6) to compare redundant execution across cores. The additional check latency can be absorbed by the out-of-order core’s buffering. In-order multithreaded pipelines, such as Sun’s Niagara T1, are effective at hiding cache miss latencies by switching to ready threads on a miss. However, simply adding a check stage to in-order pipelines either requires additional bypass paths or exposes retirement stalls. Investigate adding a cost-effective check stage to a Niagara-like pipeline (e.g., without additional bypass paths) and show that multithreaded in-order pipelines can hide the check latency. Talk to Brian Gold (bgold@cmu.edu) or Jared Smolens (jsmolens@ece.cmu.edu) about this problem. 10.Fingerprinting across on-chip memory interconnects. Soft errors in architectural state can be detected by comparing fingerprints (periodic summaries of architectural state) across redundant processor cores. Past proposals assume fixed, dedicated datapaths for comparing fingerprints. This fixes the pairs of cores that can be compared at design time. However, a suitable wide datapath already exists in chip multiprocessors, in the form of an on-chip memory interconnect. Propose a design for transferring fingerprints across the on-chip interconnect and evaluate the performance and cache bandwidth implications of this design choice. Talk to Prof. Falsafi about this problem. 11.Eliminating serialization bottlenecks in redundant multicore microarchitectures.The Reunion execution model allows speculative redundant execution beyond instruction retirement, however the microarchitecture evaluated in the MICRO-39 paper only explores speculation within the reorder buffer (ROB) in order to use existing precise exception rollback for recovery. This incurs retirement stalls because of instructions that serialize execution within the ROB (e.g., traps, memory barriers, and I/O instructions). A rollback-recovery mechanism that recovers instructions past instruction retirement, such as those proposed for transactional memory systems, could eliminate these serialization stalls. Propose and evaluate mechanisms to eliminate the serialization stalls exposed by checking for trap instructions. Talk to Brian Gold (bgold@cmu.edu) or Jared Smolens (jsmolens@ece.cmu.edu) about this problem. 18-741 Project Statement (Fall 2006) September 27, 2006 5