18-741 Project Description Proposal due date: October 11, 2002 (10:30am Melissa’s office) First status report: October 28, 2002 Second status report: November 13, 2002 Final report due date: December 4, 2002 Introduction In this project, you will evaluate extensions to or re-evaluate an architecture presented in class or a recent architecture conference, using software simulation. You will use SimpleScalar as a base simulation environment and augment it with a simulation model for the design of interest. The purpose of this project is to conduct and generate high-quality research, to improve the state of the art, and to potentially disseminate the information through a conference publication. The project report will be in the form of a conferencequality research article as the final report. The project will be graded on: 1. problem definition and motivation, 2. survey of previous related work, 3. experimentation methodology, 4. presentation and discussion of results. Proposal The deadline is given above. The proposal is a written two-page document including: 1. A problem definition and motivation. 2. A brief survey of related work with at least four papers you have obtained and read. The project web page has links to online IEEE and ACM proceedings. Conferences that primarily focus on architecture research are ISCA, ASPLOS, MICRO, HPCA, SIGMETRICS, ISLPED, and DNS. You can search the online pages and/or contact me for pointers. You will also find the World Wide Computer Architecture web page http:// www.cs.wisc.edu/~arch/www a good source of information. 3. A detailed description of your experimental setup including modifications to the simulation environment. You should refer to J. R. Platt’s paper on Strong Inference (Reader 0) and explain how your experimental methodology follows his guidelines. 4. Milestones for the first and second status reports. What do you expect to see in each milestone? Where do you plan to go based on your observations? You can draw a flow chart to show this. 18-741 Project Description October 2, 2002 1 Status Reports For the first status report: 1. You will hand in a two-page write-up describing your preliminary results. Based on these results, explain your plans for the next two milestones (status report 2 and the final report). Have your plans changed since the proposal? How? 2. You will fill out an appointment sheet on Monday Oct. 28th in class to stop by my office and go over your results on Thursday Oct. 31st or Friday Nov. 1st. For the second status report, you will hand in a two-page write-up describing your progress towards the second milestone, mature results and the corresponding analysis. Any further improvements you may want to evaluate over the original proposal should be described here. Final Report Reports should be in the form of a conference submission including an abstract, an introduction, followed by a detailed description of the design, a methodology section, a results section, and a conclusion section. You may choose to write a related works section preceding the conclusions, or a background section following the introduction. Please make sure your document is spell-checked and is grammatically sound. You will need all the relevant citiations to your work in a reference section at the end. Best Projects The top projects in class will be nominated for submission to SOCS (the student Symposium on Computer Systems at CMU) and subsequently to a computer architecture conference. Infrastructure For simulation, you will be using either SimpleScalar 3.0 or Wattch (Brooks et al., ISCA 2000). You will run jobs using condor, a batch running facility that will schedule your job to run on a cluster of 25 high-end Intel/Linux boxes. Do not run jobs interactively on any machines unless you are debugging. Run jobs only through the condor batch facility to allow fair use of machines. Workload SimpleScalar, like other simulators, is quite slow. Simulating applications from the SPEC2K suite all the way to the end of execution may take days of simulation time on a dedicated processor. Clearly you will not be able to simulate all the apps all the way to completion. You should use the “skip” feature of SimpleScalar to skip 0.5-1.0 billion instructions, and only simulate the second 0.5-1.0 billion instructions in every application as representative set of instructions and workload. Your individual simulation runs should not take more than 5-6 hours. 18-741 Project Description October 2, 2002 2 Research Project Topics The project topics below are undisclosed ideas and are confidential. Please do not distribute this list. I will be open to any ideas you may have for a research project if you can convince me that it is worth pursuing. Otherwise, here is a list of possible projects. Ultra-Deep-Submicron Processors Semiconductor fabrication trends are leading to CMOS technologies that present a variety of challenges for computer architects. Semiconductor processes ca. 2010 will lead to tens of billions of transistors that consume prohibitive degrees of power, designs whose performance is dominated by wire latency, drive yield to unacceptably low levels, and processors that are vulnerable to transient and design error (due to complexity). The projects in this category target addressing these challenges at a fundamental level. 1. The Tripod project proposes a tiled/grid processor architecture, in which the chip is made up of a heterogeneous set of tiles, where every tile is a partition of a given datapath structure (e.g., register file bank, L1 cache bank, ALU, ROB, issue queue, etc.). By partitioning the structures, and providing a reconfigurable on-chip network fabric, the various tiles can be enabled/disabled dynamically to allow for power management. Tile connections can be reconfigured to allow tiles to provide redundancy in storage or computation. Tiles with fabrication errors can be disconnected electrically through the network to allow for higher yield. Evaluate the potential for a Tripod processor by modeling a superscalar datapath with partitionable components (see me in person if you are interested in this project asap). 2. Evaluate the feasibility of an asynchronously scalable processor based on Ivan Sutherland’s fleet architecture. The fleet architecture has no wire-latency bottleneck or structural hazards and is inherently scalable. The background for this task is Ivan Sutherland's distinguished lecture from last fall. The video is available online and ought to serve as start for this work. Ivan's Fleet makes a convincing argument for why a collection of functional units, interconnected by some packet/token switching fabric could make a scalable processor architecture. However Ivan did not present any form of control structure and so the design is incomplete (see me in person if you are interested in this project asap). Proactive Memory Conventional cache hierarchies are primarily managed based on a demand-fetch and LRU/random-replace strategy. With the increase in the hierarchy depth, these simple management strategies are hitting the point of diminishing returns. The following ideas either help bridge the processor/memory performance gap through novel hierarchy management, or reduce the complexity and power of existing designs given isoperformance. Research Project Topics October 2, 2002 3 3. Propose and evaluate a dead-block correlating prefetcher (e.g., similar in spirits to Lai, Fide, and Falsafi’s) for instruction caches. 4. Compare and contrast DBCP’s effectiveness against a timer-based prefetcher as proposed by Zhang, Kaxiras, and Martonosi in ISCA 2002. Timer-based prefetchers approximate last-touch by placing expiration counters on every cache block to trigger a prefetch. 5. Evaluate DBP’s and DBCP’s accuracy and coverage for commercial workloads. Current results on these predictors only evaluate uniprocessor desktop programs. Use our in-house Simics tracer (talk to Tom about this) to generate real memory traces for IBM’s DB2 (a stock database management system) running TPCC (online transaction benchmarks) and TPCH (decision support systems/data mining benchmarks) queries. 6. As in Trace Caches, DBP’s tables intuitively store many signatures that are used infrequently and may not contribute much to prefetching. Do a detailed analysis of the usage frequency of signatures in DBP tables. Propose a technique to avoid storing such signatures into the tables and reduce table storage by an order of magnitude. 7. Evaluate power savings in DRAM memory for memory pages that are not actively used. Evaluate the impact of prefetching as a technique to enhance power management in DRAM memory by hiding the “rampup” latency when DRAM banks are in lowpower mode. Superscalar Processors & Simulation Modern superscalar processors rely on a large supply of independent instructions from the instruction stream to extract instruction-level parallelism and increase performance. The ideas below either improve the instruction throughput through the superscalar datapath, or reduce complexity, power, or reliability/robustness of current designs. 8. A fundamental bottleneck in current superscalar processors is that instructions in the issue queue are often data- or control-dependent on earlier instructions. Because the issue queue is an associative structure, its size does not scale well (compared to other resources such as number of functional units, register file size, etc.) across generations of design. Propose an “out-of-order” dispatch superscsalar that predict when instruction results will be ready and only dispatches instructions into the issue queue if they are ready to go to avoid clogging the issue queue. Show that using your design, you can also achieve isoperformance with a much smaller issue queue sizes. 9. Many applications exhibit a structural hazard on the load/store queue due to a large number of loads/stores in the instruction stream (e.g., > 50%). Like the issue queue, the load/store queue is an associative structure that is limited in size, and whose size does not scale well (see problem 5). Propose a load/store queue design that temporarily removes long-latency instructions (e.g., missing loads/stores) out of the queue to allow higher load/store queue throughput. Compare your design against Lebeck et al.’s design from ISCA 2002. Research Project Topics October 2, 2002 4 10.Fields et al., proposed a critical path predictor in ISCA 2001 (and ISCA 2002) that allows predicting which instructions are on the critical path of execution. Propose a non-vital instruction predictor (for instructions that are not on the critical path), and use the predictor to steer such instructions into a group of slow functional units to save power. Power is a equal to CV^2f, and is linearly proportional to frequency. Evaluate the opportunity for reducing power using such a predictor. 11.Apply machine learning to improve the accuracy and/or cost of branch prediction. Design and evaluate a “correlating feature selector (CFS)” that will accurately select which specific history bits a branch correlates to. Design and evaluate a table-based predictor using your CFS. You may refer to Fern et al.’s (http://www.ece.cmu.edu/ ~babak/pub.html) technical report on CFS predictors. 12.There is much variation in IPC across a program’s execution (Roland Wunderlich can provide you with IPC profiles of entire programs). Yet much power in modern processors is dissipated in the issue queue search for the independent instructions and increasing the IPC every cycle. Design and evaluate a history-based issue queue size and issue-width predictor that allows dynamically adjust either of the two to save power. 13.Ray et al., MICRO 2001 propose a superscalar architecture that enables transient-tolerant computing. Unfortunately, the proposed architecture significantly impacts an application’s performance as compared to a superscalar processor that does not offer faulttolerance. Design a redundant superscalar datapath that identifes and executes only a fraction of the original instructions to detect error. Evaluate how your design improves performance over Ray et al.’s technique. 14.Last year’s 18-547 students (Tom Wenisch and Roland Wunderlich) developed a simulation methdology that increases simulation speed of sim-outorder to that of sim-cache (i.e., ~50x speedup) with no loss in accuracy! Propose a technique that would increase the performance of sim-cache (e.g., instruction emulation or result memoization) so that entire applications that usually take over 30 days to simulate can be simulated in less than an hour. Research Project Topics October 2, 2002 5