18-741 Project Statement (Fall 2006) 1. Introduction

advertisement
18-741 Project Statement (Fall 2006)
Proposal due date: Friday, October 13, 2006 4:30pm at HH A302
Milestone 1 meeting: Friday, October 27, 2006 (meeting with TA’s during the day)
Milestone 2 report: Wednesday, November 8, 2006 4:30pm at HH A302
Milestone 2 meeting week: Starting November 6 (meeting with Prof. Falsafi or TA’s)
Final report and presentation due date: Friday, December 8, 2006
1. Introduction
In this project, you will innovate new designs or evaluate extensions to designs presented
in class or a recent architecture conference. You will use Flexus (www.ece.cmu.edu/
~simflex) as a base simulation environment and augment it with a simulation model for
the design of interest. The purpose of this project is to conduct and generate publicationquality research and to improve the state of the art. The project report will be in the form
of a research article similar to those we have been discussing in the class.
The project will account for 30% of your final course grade. The project will be graded
out of 100 points:
• proposal (5 points)
• milestone 1 (5 points)
• milestone 2 (5 points)
• final report (75 points)
• problem definition and motivation (10 points)
• survey of previous related work (15 points)
• description of design (15 points)
• experimentation methodology (15 points)
• analysis of results (20 points)
• poster/presentation (10 points).
2. Proposal
You are encouraged to meet with me and/or the TAs prior to submitting the proposal if you
have any questions. When in doubt, meet with us. Please make an appointment to meet
with us or use the office hours.
The proposal is a written two-page document including:
1. A problem definition and motivation.
2. A brief survey of related work with at least four papers you have found and read
directly related to the topic (ask me for pointers). Please explain in detail how this
18-741 Project Statement (Fall 2006)
September 27, 2006
1
prior work relates to your proposed work. The CALCM web page
(www.ece.cmu.edu/CALCM) has links to online IEEE and ACM proceedings. Conferences that primarily focus on architecture research are ISCA, ASPLOS, MICRO, HPCA,
SIGMETRICS, ISLPED, and DNS. You can search the online pages and/or contact us for
pointers. You will also find the World Wide Computer Architecture web page
www.cs.wisc.edu/~arch/www a good source of information.
3. A detailed description of your experimental setup including modifications to the simulation environment. You should refer to J. R. Platt’s paper on Strong Inference (from
Reader) and explain how your experimental methodology follows his guidelines.
4. Milestones for the status and final report. What do you expect to see in each milestone? Where do you plan to go based on your observations? You can draw a flow
chart to clarify.
3. Milestone 1
1. This milestone will ensure that you have successfully brought up the infrastructure you
will need for your project. Furthermore, you must demonstrate the problem your project
attacks using this experimental infrastructure. For example, you might need to bring up
a simulation framework and reproduce some baseline case or prior results as a starting point for your own work.
2. You will make an appointment to meet with one of us (faculty or TA team) to present
your own results (possibly replicated) motivating your project and explain how your
infrastructure is suitable for your project. The appointments are made by filling out an
appointment sheet in class the week prior to the meetings.
3. NOTE: The purpose of this meeting is not for you to explain to us why you are having
difficulty bringing up your infrastructure. If you encounter unexpected difficulties please
notify us early on so we can help you work through them.
4. Milestone 2
1. You will hand in a two-page write-up describing your preliminary results. These results
form the basis for the final outcome of your research — i.e., the results will not substantially change. You will spend the rest of the semester polishing the results with
more extensive analysis and experimentation. Based on these results, explain your
plans for the final milestone. If there are any changes to plans, you should bring them
up in this report.
2. You will make an appointment to meet with one of us (faculty or TA team) to go over
the project status. The appointments are made by filling out an appointment sheet in
class the week prior to the meetings.
5. Final Report
Reports should be in the form of a conference submission including an abstract, an introduction, followed by a detailed description of the design, a methodology section, a results
section, and a conclusion section. You may choose to write a related works section preceding the conclusions, or a background section following the introduction. Please make
18-741 Project Statement (Fall 2006)
September 27, 2006
2
sure your document is spell-checked and is grammatically sound. You will need all the relevant citations to your work in a reference section at the end.
6. Posters and/or Presentations
On the Friday of the last week of classes, we will hold a poster and/or presentation session, in which teams will get to present their results orally. Please stay tuned for more
detail on this.
7. Best Projects
The top projects in class will be selected for submission to a computer systems conference for publication. In the past a number of papers from 741/742 have become fullblown research projects including the SMARTS paper on simulation sampling that
appeared in ISCA 2003, and the Spatial Pattern Prediction paper that appeared in HPCA
2004, and subsequently in ISCA 2006.
8. Infrastructure
SimFlex is a full-system simulation environment and as such is slower than a user-level
single-thread simulator like SimpleScalar. You should select your measurement sizes (in
instruction count) so that individual simulation runs take at most a few hours. Refer to
http://www.ece.cmu.edu/~simflex and look for Flexus for proper simulation methodology.
Questions about simulator setup should be directed to one of the TA’s.
9. Research Project Topics
Some of the project topics below are undisclosed ideas and are confidential.
Please do not distribute this list.
I will be open to any ideas you may have for a research project if you can convince me
that it is worth pursuing. Otherwise, here is a list of possible projects.
1. DBmbench. Benchmarking commercial applications on a simulator is extremely slow. Shao et
al. (www.ece.cmu.edu/~babak/papers/tr-cmu-cs-03.pdf) have proposed a rigorous scaling
framework for scaling down OLTP and DSS workloads. These workloads reduce the runtime
requirement of the TPC-C and TPC-H benchmarks by orders of magnitude. Unfortunately, the
workloads have only been tested against one hardware platform. Bring the workloads up on the
AMD opteron servers and validate their results. Bring the workloads up also on SimFlex and
show that indeed simulation times can be reduces by several orders of magnitude while preserving microarchitectural characteristics. Talk to any of the TA’s about this problem.
2. Streaming Memory Systems for a Single Processor. Memory system performance is a key bottleneck for a number of important server applications. Unfortunately, conventional data
prefetching techniques do not work well for these because the memory addresses are quite
irregular. Recent research (our group at CMU, Prof. Jim Smith’s group at Wisconsin, and Dr.
Chilimbi’s group at Microsoft) have shown that there is much temporal correlation between
memory addresses in applications; the latter follows because data structures are often traversed in similar fashions with little modifications in their physical mapping in memory through
time. Design a streaming engine (like TSE from Wenisch et al. at ISCA 2005) for a single-core
system that would maintain the temporal correlation in memory accesses in the form of
streams, and move them on/off chip collectively to hide memory access latency. Evaluate your
18-741 Project Statement (Fall 2006)
September 27, 2006
3
Temporal Streaming Engine (TSE)’s effectiveness as compared to conventional memory system for commercial workloads. Talk to Stephen Somogyi or Thomas Wenisch (twenisch@ece.cmu.edu) about this problem.
3. Scrubbing L1. Future servers are going to be highly vulnerable to soft error. Recent work indicates that not all components of the system are as vulnerable to soft error. Specifically, on-chip
caches have higher architectural vulnerability factors (AVF) because data resides in them for
long periods of time. One approach to reducing data vulnerability in L1 caches is to flush the
data when it is no longer necessary. AMD uses this approach whereas Intel gets away with not
having to do it by using write through caches. Simplify the dead-block predictor (Lai and Falsafi,
ISCA’00) to predict only last stores to avoid periodic scrubbing. Show that last-store scrubbing
is superior to periodic scrubbing from a bandwidth, latency, and AVF perspective. Compare and
contrast with walkthrough caches. Talk to Brian Gold (bgold@cmu.edu) or Jared Smolens
(jsmolens@ece.cmu.edu) about this problem.
4. Hardware STEPS. OLTP instruction footprints are very large, unable to fit even in "large" 64KB
L1 I-caches. Despite exercising the same code paths, code for different transactions is executed serially, effectively thrashing the instruction caches. The STEPS project (see http://
www.cs.cmu.edu/~StagedDB/publications.html) minimizes instruction cache misses in OLTP
workloads by multiplexing concurrent transactions and exploiting common code paths. One
transaction paves the cache with instructions, while close followers enjoy a nearly miss-free
execution. To work, STEPS must determine when an I-cache becomes "full" with instructions
from a given code path and switch to other threads (transactions) before moving on to other
parts of execution and replacing the instructions in the I-cache. Currently STEPS uses the
existing CPU performance counters to determine when I-cache miss rates change in order to
select a good point in time to switch threads. Come up with and evaluate a hardware assist
mechanism to dynamically determine an appropriate moment in time for STEPS to perform the
switch amongst threads. Talk to Nikos Hardavellas or Ryan Johnson about this problem.
5. Constructing reusable cache state for multi-level hierarchies. Many recent computer architecture simulation methodologies ([SimFlex][SimPoint]) launch simulations from checkpoints of
architectural and microarchitectural state. These checkpoints typically contain snapshots of the
contents of the memory hierarchy. However, researchers often want to vary the size/associativity of the caches over a series of experiments. There are well-understood techniques for reconstructing the state of a smaller cache from a larger one ([Barr:ISPASS'05]
[VanBiesbrouck;HiPeac'05] [Wenisch:ISPASS'06]). However, these techniques only work if the
configuration of a single level of the cache hierarchy is changed. Design, implement, and validate a technique for reconstructing accurate cache hierarchy state when the size/associativity
of several cache levels are changed. Nikos has already worked out the big picture on a data
structure that can do this, but it needs to be refined and implemented, and lacks proofs demonstrating its correctness. Talk to Nikos Hardavellas about this problem.
6. Multiprocessor hybrid FPGA-based full-system emulators. Hybrid FPGA/software techniques
have recently been developed at CMU as a practical way to build fast, full-system, FPGAbased computer system simulators. These techniques partition an overall system design such
that some subsets of behaviors, which are practical to implement in hardware, are parallelized
and running on the FPGA (e.g., user-level instructions). To ensure correctness and completeness of the system, the remaining behaviors (e.g., disk simulator, rare instructions) are modeled in a software full-system simulator. The result is a hybrid FPGA-based simulator that can
run and simulate unmodified application binaries (even an OS). While the proof-of-concept
exists for single-CPU systems, no work has been done on extending the existing approaches
for multiprocessor emulation. Profile and evaluate multiprocessor workloads to see if existing
techniques are extensible for multiprocessor systems. Propose a multiprocessor solution for
hybrid FPGA-based simulation and implement a prototype in software-only or on actual
FPGAs. You will have to talk to Eric Chung and James Hoe about this. Talk to Eric Chung
(echung@ece.cmu.edu) about this problem.
18-741 Project Statement (Fall 2006)
September 27, 2006
4
7. Accurate MLP with runahead execution. Accurate modeling of MLP (Memory-Level Parallelism)
is an important requirement in memory subsystem evaluation when simulating commercial
applications such as OLTP on DB2. Such workloads tend to exhibit data-dependent misses in
the cache and as a result, MLP is a large determinant in performance. Runahead execution has
recently been proposed as a complexity-effective way to increase MLP by speculatively retiring
stalled loads at the end of the pipeline and recovering the architectural state when the datadependent miss returns. Show that MLP can be accurately simulated for commercial workloads
using a simple in-order functional model combined with runahead execution. Demonstrate your
results by comparing against Flexus’s out-of-order timing model. Talk Eric Chung
(echung@ece.cmu.edu) about this problem.
8. Protecting memory with 2D parity. Soft and hard errors are a growing concern for memory and
logic designers. Projections show that the frequency of these errors will increase as we scale to
more advanced processes. Some memories are currently protected using error correcting
codes (ECC) and physical interleaving data words. However, these schemes do not protect
against some multi-bit error events and catastrophic failures (e.g., loss of an entire wordline
due to electromigration). We propose to use a second dimension (vertical) of error coding (parity) to increase the robustness of the memory to multi-bit error events, both hard and soft. Characterization of the VLSI-level costs of this novel error coding scheme are still not well
understood, especially in comparison to multi-bit-correction ECC schemes. For an approximately equivalent level of error protection, how would 2D parity compare to a horizontal multibit correction ECC scheme. The 2D scheme will be capable of correcting some catastrophic
failures that a purely horizontal scheme cannot (e.g. in the case of a wordline failure) without
use of dual modular redundancy. Talk to Prof. Ken Mai (kenmai@ece.cmu.edu) about this problem or Prof. Falsafi about this problem.
9. Checking for in-order multithreaded pipelines. Reunion [MICRO-39] adds a checking stage to
the retirement pipeline of a speculative out-of-order microarchitecture (such as the P6) to compare redundant execution across cores. The additional check latency can be absorbed by the
out-of-order core’s buffering. In-order multithreaded pipelines, such as Sun’s Niagara T1, are
effective at hiding cache miss latencies by switching to ready threads on a miss. However, simply adding a check stage to in-order pipelines either requires additional bypass paths or
exposes retirement stalls. Investigate adding a cost-effective check stage to a Niagara-like
pipeline (e.g., without additional bypass paths) and show that multithreaded in-order pipelines
can hide the check latency. Talk to Brian Gold (bgold@cmu.edu) or Jared Smolens (jsmolens@ece.cmu.edu) about this problem.
10.Fingerprinting across on-chip memory interconnects. Soft errors in architectural state can be
detected by comparing fingerprints (periodic summaries of architectural state) across redundant processor cores. Past proposals assume fixed, dedicated datapaths for comparing fingerprints. This fixes the pairs of cores that can be compared at design time. However, a suitable
wide datapath already exists in chip multiprocessors, in the form of an on-chip memory interconnect. Propose a design for transferring fingerprints across the on-chip interconnect and
evaluate the performance and cache bandwidth implications of this design choice. Talk to Prof.
Falsafi about this problem.
11.Eliminating serialization bottlenecks in redundant multicore microarchitectures.The Reunion
execution model allows speculative redundant execution beyond instruction retirement, however the microarchitecture evaluated in the MICRO-39 paper only explores speculation within
the reorder buffer (ROB) in order to use existing precise exception rollback for recovery. This
incurs retirement stalls because of instructions that serialize execution within the ROB (e.g.,
traps, memory barriers, and I/O instructions). A rollback-recovery mechanism that recovers
instructions past instruction retirement, such as those proposed for transactional memory systems, could eliminate these serialization stalls. Propose and evaluate mechanisms to eliminate
the serialization stalls exposed by checking for trap instructions. Talk to Brian Gold
(bgold@cmu.edu) or Jared Smolens (jsmolens@ece.cmu.edu) about this problem.
18-741 Project Statement (Fall 2006)
September 27, 2006
5
Download