18-741 Project Statement (Fall 2004) Proposal due date: October 20, 2004 noon at HH A302 Status report: November 10, 2004 noon at HH A302 (the week of 15 we will meet with every team) Final report due date: December 10, 2004 during presentation/poster 1. Introduction Tom will briefly go over this document on Monday of the review prior the Mid-term. In this project, you will innovate new designs or evaluate extensions to designs presented in class or a recent architecture conference. You will use either SimpleScalar or Scaffold (www.ece.cmu.edu/~simflex) as a base simulation environment and augment it with a simulation model for the design of interest. The purpose of this project is to conduct and generate publication-quality research and to improve the state of the art. The project report will be in the form of a research article similar to those we have been discussing in the class. The project will account for 25% of your final course grade. The project will be graded out of 100 points: • proposal (5 points) • status report (5 points) • final report (80 points) • problem definition and motivation (15 points) • survey of previous related work (15 points) • description of design (15 points) • experimentation methodology (15 points) • analysis of results (20 points) • poster/presentation (10 points). 2. Proposal You are encouraged to meet with me and/or Tom prior to submitting the proposal if you have any questions. When in doubt, meet with us. Please make an appointment to meet with us or use the office hours. The proposal is a written two-page document including: 1. A problem definition and motivation. 2. A brief survey of related work with at least four papers you have found and read directly related to the topic (ask me for pointers). The CALCM web page (www.ece.cmu.edu/CALCM) has links to online IEEE and ACM proceedings. Conferences that primarily focus on architecture research are ISCA, ASPLOS, MICRO, HPCA, 18-741 Project Statement (Fall 2004) October 9, 2004 1 SIGMETRICS, ISLPED, and DNS. You can search the online pages and/or contact me for pointers. You will also find the World Wide Computer Architecture web page www.cs.wisc.edu/~arch/www a good source of information. 3. A detailed description of your experimental setup including modifications to the simulation environment. You should refer to J. R. Platt’s paper on Strong Inference (Reader 0) and explain how your experimental methodology follows his guidelines. 4. Milestones for the status and final report. What do you expect to see in each milestone? Where do you plan to go based on your observations? You can draw a flow chart to clarify. 3. Status Report 1. You will hand in a two-page write-up describing your preliminary results. These results form the basis for the final outcome of your research — i.e., the results will not substantially change. You will spend the rest of the semester polishing the results with more extensive analysis and experimentation. Based on these results, explain your plans for the final milestone. If there are any changes to plans, you should bring them up in this report. 2. You will make an appointment to meet with me to go over the project status. The appointments are made by filling out an appointment sheet in class the week prior to the meetings. 4. Final Report Reports should be in the form of a conference submission including an abstract, an introduction, followed by a detailed description of the design, a methodology section, a results section, and a conclusion section. You may choose to write a related works section preceding the conclusions, or a background section following the introduction. Please make sure your document is spell-checked and is grammatically sound. You will need all the relevant citations to your work in a reference section at the end. 5. Posters and/or Presentations On the Friday of the last week of classes, we will hold a poster and/or presentation session, in which teams will get to present their results orally. Please stay tuned for more detail on this. 6. Best Projects The top projects in class will be selected for submission to a computer systems conference for publication. In the past a number of papers from 741 have become full-blown research projects including the SMARTS paper on simulation sampling that appeared in ISCA 2003, and the Spatial Pattern Prediction paper that appeared in HPCA 2004. 18-741 Project Statement (Fall 2004) October 9, 2004 2 7. Infrastructure SimpleScalar, like other simulators, is quite slow. Simulating applications from the SPEC2K suite all the way to the end of execution may take days of simulation time on a dedicated processor. You should use SimPoint (www.cse.ucsd.edu/~calder/simpoint) to reduce simulation turnaround while maintaining high performance estimate accuracy. If using other simulators, such as SimFlex’s Scaffold, consult with Tom and Jangwoo about measurement methodology for commercial workloads. You should select your measurement sizes (in instruction count) so that individual simulation runs take at most a few hours. 8. Research Project Topics Some of the project topics below are undisclosed ideas and are confidential. Please do not distribute this list. I will be open to any ideas you may have for a research project if you can convince me that it is worth pursuing. Otherwise, here is a list of possible projects. 1. TEMPEST i-cache streaming. Recent research indicates that memory addresses are temporally correlated — i.e., a given address is likely to be used in temporal proximity to other addresses (please read Trishul Chilimbi’s PLDI papers on Hot Streams). Our group is proposing TEMPEST (TEMPorally Extracted STreams), designs to exploit temporal correlation in memory access to allow cache blocks to be streamed, rather than demand fetched, to CPU. We are currently working on designs for TEMPEST data streaming. Much like data, instructions are also temporally correlated — e.g., instructions belonging to a procedure. Design and implement a TEMPEST engine for the i-cache. Compare your design to an aggressive instruction prefetcher such as Fetch-directed instruction prefetching by Calder and Austin. 2. Program-driven VM replacement. Much like caches, VM replacement today is primarily based on heuristics such as LRU. With little hardware support, VM can significantly benefit from program-extracted information (e.g., instruction control flow) such as those used in a dead-block predictor. Design a dead-page predictor to optimize VM page replacement and show that with high accuracy, such a predictor can substantially improve page fault rate over LRU. Compare and contrast your results with a recently proposed Miss-Ratio Curve replacement algorithm proposed in ASPLOS 2004 (this year). 3. Program-driven L2 replacement. Off-chip cache accesses are detrimental to performance. There are applications with adverse cache conflict behavior (e.g., database systems) for which even highly-associative L2 caches do not work well. Design an L2 dead-block predictor to predict candidate blocks for replacement. Show that given high accuracy, such a predictor can substantially reduce cache miss rates over LRU. Compare and contrast such a predictor with a time-out predictor using either cycle or reference counts. 4. Transactional Processor. Speculation is fruitful if the benefits from optimized execution due to speculation offset the negative impact of misspeculation/rollback. For instance, current branch predictors improve performance because they achieve a high enough accuracy to allow speculative execution over tens of instructions. Today’s superscalar processors support speculation through small reorder buffers and load/store queues. 18-741 Project Statement (Fall 2004) October 9, 2004 3 There are a number of candidate scenarios for speculative execution that have orders of magnitude larger verification latency than branch prediction but also lead to orders of magnitude increase in performance if successful. For example, a synchronization access in a multiprocessor (a read-modify-write operation on a memory address) can take thousands of cycles while the processor can speculate that the lock is available. Propose enhancements to the cache hierarchy and load/store queue to allow for checkpointing execution windows of hundreds or thousands of instructions (checkpointing register state is not a biggy). Evaluate the opportunity for performance improvement for skipping locks using your transactional processor. You will need scaffold for this work. 5. Optimal Cache Sharing. Multithreading has been regarded as a technique that fundamentally thrashes caches and should be applied with great care. Gibbons and Blelloch have recently shown in a SPAA 2004 paper that careful scheduling of threads in a multithreaded processor running a parallel program actually requires only a modest increase in cache size over a single-threaded application performing the same work. Evaluate their results in the context of a real commercial application (e.g., a database management system prototype such as Shore or Postgress) by changing the thread scheduling policy to follow their cache sharing model and measure cache miss ratio as a function of cache size. Compare the results to a unithreaded processor which runs the threads sequentially. 6. Spatial Pattern Predictors. Chen, Yang, Falsafi and Moshovos recently proposed simple PC-based predictors that predict spatial patterns in cache blocks. Such predictors can be used to overcome the fixed block size limitations in caches by identifying groups of (spatially-contiguous) cache blocks that will be accessed together. Evaluate such predictors for commercial applications and show their effectiveness in identifying spatial patterns. Evaluate these predictors in multiprocessors as a technique to reduce unnecessary (false) sharing of data by communicating only the necessary effective cache block size between sharing processors. Compare these results to coherence decoupling to mitigate false sharing as proposed by Burger & Sohi, ASPLOS 2004 (the paper is online at Prof. Burger’s UT Austin web site). 7. Dynamic Program Phase Identification. Recent research (such as the work by the SimPoint group) suggests that programs execute in phases that repeat across execution. By identifying and exploiting such repetition, architects can substantially enhance execution for future high-end reconfigurable processors. Unfortunately, it appears that program phases are microarchitecture dependent and as such must be detected dynamically. Use statistical tools for measuring the distance between distributions (e.g., Chi-squared distance), and use them to identify unique program phases at runtime based on measured distribution of a performance metric of interest (e.g., IPC). Show that indeed, phases are microarchitecture dependent. Compare your phase detector against that proposed by Sherwood et al. in ISCA 2003. 8. Chiller. Modern high-end processors are designed for worst-case performance demands. Unfortunately, such designs have led to a high variability in maximum power density and heat on chip. Such variabilities make packaging costs prohibitively high. Instead, many have proposed dynamic hotspot detection and cooling. By dynamically detecting what chip area requires thermal management, microarchitectural techniques to reduce power can be applied locally to alleviate the problem with little impact on performance. Build a simple floorplan model of an out-of-order core (e.g., Skadron et al.’s 18-741 Project Statement (Fall 2004) October 9, 2004 4 model in their ISCA 2003 paper), use accurate heat conductivity models (as proposed by Prof. Asheghi in ME) to evaluate heat distribution every cycle across near-neighbor components. Use the models to identify and alleviate hot spots through resource scaling. A good place to start is lava.cs.virginia.edu. 9. Nasty Leakage. Leakage power increases five-fold every generation. In a 90nm process, leakage will account for higher levels of power dissipation than voltage swings. Leakage is exponential in temperature so the more the chip leaks, the hotter it gets, and therefore it leaks exponentially more! The latter results in a thermal runoff. This project is similar to (8) except that here, you will evaluate the relationship between leakage, temperature, and time and apply microarchitectural techniques to avoid a thermal runoff. 10.Kill the Buffer Overflow Problem. The internet was brought down in the late 80’s partially due to a bug in the “finger daemon” that can be exploited using the buffer overflow trick. If all programs performed bounds checking when reading data from the network, there would be no problems. Unfortunately, because we can’t always write code carefully, we need to guard against malicious attacks on our carelessness. So, it turns out that if you read data from a network packet onto the stack way past the point you should, the packet data can overwrite the return address pointer on the stack to point to code inside the packet! Propose architectural/microarchitectural techniques to guard against such an attack. Evaluate your techniques using the simulator. 11.Machine Learning & Branch Prediction. Apply machine learning to improve the accuracy and/or cost of branch prediction. Design and evaluate a “correlating feature selector (CFS)” that will accurately select which specific history bits a branch correlates to. Design and evaluate a table-based predictor using your CFS. Use architectural features such as register values as input to the feature set. You may refer to Fern et al.’s (www.ece.cmu.edu/~babak/pub.html) technical report on CFS predictors. 12.Accelerating CPU Simulation with Cache-independant Checkpointing. The SMARTS paper (ISCA 2003) showed that proper application of sampling theory can reduce CPU simulation turnaround by orders of magnitude while achieving highly accurate results. Nevertheless, SMARTS simulation speed remains bottlenecked by fast-forwarding past instructions between sampled measurements. Follow-on work (TurboSMARTS, draft paper available from Tom) shows that fast-forwarding can be eliminated by storing simulated cache, branch predictor, and architectural state in checkpoints. However, the current design ties checkpoints to a particular cache and branch predictor configuration. Develop a new simulation methodology which combines checkpointing and cache/branch predictor warmup (i.e., as in Haskins & Skadron, ISPASS 2003) to maximize simulation speed and minimize checkpoint storage cost without requiring a fixed cache/branch predictor configuration. 13.SyntSim and SMARTS. Researchers at Cornell University have developed a simulation toolset (SyntSim) that generates high-speed functional simulators, up to twenty times faster than sim-fast, by translating a target benchmark’s binary into a custom functional simulator for that benchmark (paper to appear in MICRO ‘04). Integrate this binary-translation approach to functional simulation with sampled microarchitectural simulation as proposed in SMARTS. You must determine the best way to integrate cycle-accurate simulation into the binary-translation framework and how to perform cache and branch predictor warmup during binary-translated functional simulation (see Chen, MS Thesis for ideas). 18-741 Project Statement (Fall 2004) October 9, 2004 5