18-741 Project Statement (Fall 2004) 1. Introduction

advertisement
18-741 Project Statement (Fall 2004)
Proposal due date: October 20, 2004 noon at HH A302
Status report: November 10, 2004 noon at HH A302 (the week of 15 we will meet with every team)
Final report due date: December 10, 2004 during presentation/poster
1. Introduction
Tom will briefly go over this document on Monday of the review prior the Mid-term. In this
project, you will innovate new designs or evaluate extensions to designs presented in
class or a recent architecture conference. You will use either SimpleScalar or Scaffold
(www.ece.cmu.edu/~simflex) as a base simulation environment and augment it with
a simulation model for the design of interest. The purpose of this project is to conduct and
generate publication-quality research and to improve the state of the art. The project
report will be in the form of a research article similar to those we have been discussing in
the class.
The project will account for 25% of your final course grade. The project will be graded
out of 100 points:
• proposal (5 points)
• status report (5 points)
• final report (80 points)
• problem definition and motivation (15 points)
• survey of previous related work (15 points)
• description of design (15 points)
• experimentation methodology (15 points)
• analysis of results (20 points)
• poster/presentation (10 points).
2. Proposal
You are encouraged to meet with me and/or Tom prior to submitting the proposal if you
have any questions. When in doubt, meet with us. Please make an appointment to meet
with us or use the office hours.
The proposal is a written two-page document including:
1. A problem definition and motivation.
2. A brief survey of related work with at least four papers you have found and read
directly related to the topic (ask me for pointers). The CALCM web page
(www.ece.cmu.edu/CALCM) has links to online IEEE and ACM proceedings. Conferences that primarily focus on architecture research are ISCA, ASPLOS, MICRO, HPCA,
18-741 Project Statement (Fall 2004)
October 9, 2004
1
SIGMETRICS, ISLPED, and DNS. You can search the online pages and/or contact me for
pointers. You will also find the World Wide Computer Architecture web page
www.cs.wisc.edu/~arch/www a good source of information.
3. A detailed description of your experimental setup including modifications to the simulation environment. You should refer to J. R. Platt’s paper on Strong Inference (Reader
0) and explain how your experimental methodology follows his guidelines.
4. Milestones for the status and final report. What do you expect to see in each milestone? Where do you plan to go based on your observations? You can draw a flow
chart to clarify.
3. Status Report
1. You will hand in a two-page write-up describing your preliminary results. These results
form the basis for the final outcome of your research — i.e., the results will not substantially change. You will spend the rest of the semester polishing the results with
more extensive analysis and experimentation. Based on these results, explain your
plans for the final milestone. If there are any changes to plans, you should bring them
up in this report.
2. You will make an appointment to meet with me to go over the project status. The
appointments are made by filling out an appointment sheet in class the week prior to
the meetings.
4. Final Report
Reports should be in the form of a conference submission including an abstract, an introduction, followed by a detailed description of the design, a methodology section, a results
section, and a conclusion section. You may choose to write a related works section preceding the conclusions, or a background section following the introduction. Please make
sure your document is spell-checked and is grammatically sound. You will need all the relevant citations to your work in a reference section at the end.
5. Posters and/or Presentations
On the Friday of the last week of classes, we will hold a poster and/or presentation session, in which teams will get to present their results orally. Please stay tuned for more
detail on this.
6. Best Projects
The top projects in class will be selected for submission to a computer systems conference for publication. In the past a number of papers from 741 have become full-blown
research projects including the SMARTS paper on simulation sampling that appeared in
ISCA 2003, and the Spatial Pattern Prediction paper that appeared in HPCA 2004.
18-741 Project Statement (Fall 2004)
October 9, 2004
2
7. Infrastructure
SimpleScalar, like other simulators, is quite slow. Simulating applications from the
SPEC2K suite all the way to the end of execution may take days of simulation time on a
dedicated processor. You should use SimPoint (www.cse.ucsd.edu/~calder/simpoint) to reduce simulation turnaround while maintaining high performance estimate
accuracy. If using other simulators, such as SimFlex’s Scaffold, consult with Tom and Jangwoo about measurement methodology for commercial workloads. You should select
your measurement sizes (in instruction count) so that individual simulation runs take at
most a few hours.
8. Research Project Topics
Some of the project topics below are undisclosed ideas and are confidential.
Please do not distribute this list.
I will be open to any ideas you may have for a research project if you can convince me
that it is worth pursuing. Otherwise, here is a list of possible projects.
1. TEMPEST i-cache streaming. Recent research indicates that memory addresses are
temporally correlated — i.e., a given address is likely to be used in temporal proximity
to other addresses (please read Trishul Chilimbi’s PLDI papers on Hot Streams). Our
group is proposing TEMPEST (TEMPorally Extracted STreams), designs to exploit
temporal correlation in memory access to allow cache blocks to be streamed, rather
than demand fetched, to CPU. We are currently working on designs for TEMPEST
data streaming. Much like data, instructions are also temporally correlated — e.g.,
instructions belonging to a procedure. Design and implement a TEMPEST engine for
the i-cache. Compare your design to an aggressive instruction prefetcher such as
Fetch-directed instruction prefetching by Calder and Austin.
2. Program-driven VM replacement. Much like caches, VM replacement today is primarily
based on heuristics such as LRU. With little hardware support, VM can significantly
benefit from program-extracted information (e.g., instruction control flow) such as
those used in a dead-block predictor. Design a dead-page predictor to optimize VM
page replacement and show that with high accuracy, such a predictor can substantially
improve page fault rate over LRU. Compare and contrast your results with a recently
proposed Miss-Ratio Curve replacement algorithm proposed in ASPLOS 2004 (this
year).
3. Program-driven L2 replacement. Off-chip cache accesses are detrimental to performance. There are applications with adverse cache conflict behavior (e.g., database
systems) for which even highly-associative L2 caches do not work well. Design an L2
dead-block predictor to predict candidate blocks for replacement. Show that given high
accuracy, such a predictor can substantially reduce cache miss rates over LRU. Compare and contrast such a predictor with a time-out predictor using either cycle or reference counts.
4. Transactional Processor. Speculation is fruitful if the benefits from optimized execution
due to speculation offset the negative impact of misspeculation/rollback. For instance,
current branch predictors improve performance because they achieve a high enough
accuracy to allow speculative execution over tens of instructions. Today’s superscalar
processors support speculation through small reorder buffers and load/store queues.
18-741 Project Statement (Fall 2004)
October 9, 2004
3
There are a number of candidate scenarios for speculative execution that have orders
of magnitude larger verification latency than branch prediction but also lead to orders
of magnitude increase in performance if successful. For example, a synchronization
access in a multiprocessor (a read-modify-write operation on a memory address) can
take thousands of cycles while the processor can speculate that the lock is available.
Propose enhancements to the cache hierarchy and load/store queue to allow for
checkpointing execution windows of hundreds or thousands of instructions (checkpointing register state is not a biggy). Evaluate the opportunity for performance
improvement for skipping locks using your transactional processor. You will need scaffold for this work.
5. Optimal Cache Sharing. Multithreading has been regarded as a technique that fundamentally thrashes caches and should be applied with great care. Gibbons and Blelloch
have recently shown in a SPAA 2004 paper that careful scheduling of threads in a multithreaded processor running a parallel program actually requires only a modest
increase in cache size over a single-threaded application performing the same work.
Evaluate their results in the context of a real commercial application (e.g., a database
management system prototype such as Shore or Postgress) by changing the thread
scheduling policy to follow their cache sharing model and measure cache miss ratio as
a function of cache size. Compare the results to a unithreaded processor which runs
the threads sequentially.
6. Spatial Pattern Predictors. Chen, Yang, Falsafi and Moshovos recently proposed simple PC-based predictors that predict spatial patterns in cache blocks. Such predictors
can be used to overcome the fixed block size limitations in caches by identifying
groups of (spatially-contiguous) cache blocks that will be accessed together. Evaluate
such predictors for commercial applications and show their effectiveness in identifying
spatial patterns. Evaluate these predictors in multiprocessors as a technique to reduce
unnecessary (false) sharing of data by communicating only the necessary effective
cache block size between sharing processors. Compare these results to coherence
decoupling to mitigate false sharing as proposed by Burger & Sohi, ASPLOS 2004 (the
paper is online at Prof. Burger’s UT Austin web site).
7. Dynamic Program Phase Identification. Recent research (such as the work by the SimPoint group) suggests that programs execute in phases that repeat across execution.
By identifying and exploiting such repetition, architects can substantially enhance execution for future high-end reconfigurable processors. Unfortunately, it appears that program phases are microarchitecture dependent and as such must be detected
dynamically. Use statistical tools for measuring the distance between distributions
(e.g., Chi-squared distance), and use them to identify unique program phases at runtime based on measured distribution of a performance metric of interest (e.g., IPC).
Show that indeed, phases are microarchitecture dependent. Compare your phase
detector against that proposed by Sherwood et al. in ISCA 2003.
8. Chiller. Modern high-end processors are designed for worst-case performance
demands. Unfortunately, such designs have led to a high variability in maximum power
density and heat on chip. Such variabilities make packaging costs prohibitively high.
Instead, many have proposed dynamic hotspot detection and cooling. By dynamically
detecting what chip area requires thermal management, microarchitectural techniques
to reduce power can be applied locally to alleviate the problem with little impact on performance. Build a simple floorplan model of an out-of-order core (e.g., Skadron et al.’s
18-741 Project Statement (Fall 2004)
October 9, 2004
4
model in their ISCA 2003 paper), use accurate heat conductivity models (as proposed
by Prof. Asheghi in ME) to evaluate heat distribution every cycle across near-neighbor
components. Use the models to identify and alleviate hot spots through resource scaling. A good place to start is lava.cs.virginia.edu.
9. Nasty Leakage. Leakage power increases five-fold every generation. In a 90nm process, leakage will account for higher levels of power dissipation than voltage swings.
Leakage is exponential in temperature so the more the chip leaks, the hotter it gets,
and therefore it leaks exponentially more! The latter results in a thermal runoff. This
project is similar to (8) except that here, you will evaluate the relationship between
leakage, temperature, and time and apply microarchitectural techniques to avoid a
thermal runoff.
10.Kill the Buffer Overflow Problem. The internet was brought down in the late 80’s partially due to a bug in the “finger daemon” that can be exploited using the buffer overflow trick. If all programs performed bounds checking when reading data from the
network, there would be no problems. Unfortunately, because we can’t always write
code carefully, we need to guard against malicious attacks on our carelessness. So, it
turns out that if you read data from a network packet onto the stack way past the point
you should, the packet data can overwrite the return address pointer on the stack to
point to code inside the packet! Propose architectural/microarchitectural techniques to
guard against such an attack. Evaluate your techniques using the simulator.
11.Machine Learning & Branch Prediction. Apply machine learning to improve the accuracy and/or cost of branch prediction. Design and evaluate a “correlating feature selector (CFS)” that will accurately select which specific history bits a branch correlates to.
Design and evaluate a table-based predictor using your CFS. Use architectural features such as register values as input to the feature set. You may refer to Fern et al.’s
(www.ece.cmu.edu/~babak/pub.html) technical report on CFS predictors.
12.Accelerating CPU Simulation with Cache-independant Checkpointing. The SMARTS
paper (ISCA 2003) showed that proper application of sampling theory can reduce CPU
simulation turnaround by orders of magnitude while achieving highly accurate results.
Nevertheless, SMARTS simulation speed remains bottlenecked by fast-forwarding
past instructions between sampled measurements. Follow-on work (TurboSMARTS,
draft paper available from Tom) shows that fast-forwarding can be eliminated by storing simulated cache, branch predictor, and architectural state in checkpoints. However, the current design ties checkpoints to a particular cache and branch predictor
configuration. Develop a new simulation methodology which combines checkpointing
and cache/branch predictor warmup (i.e., as in Haskins & Skadron, ISPASS 2003) to
maximize simulation speed and minimize checkpoint storage cost without requiring a
fixed cache/branch predictor configuration.
13.SyntSim and SMARTS. Researchers at Cornell University have developed a simulation toolset (SyntSim) that generates high-speed functional simulators, up to twenty
times faster than sim-fast, by translating a target benchmark’s binary into a custom
functional simulator for that benchmark (paper to appear in MICRO ‘04). Integrate this
binary-translation approach to functional simulation with sampled microarchitectural
simulation as proposed in SMARTS. You must determine the best way to integrate
cycle-accurate simulation into the binary-translation framework and how to perform
cache and branch predictor warmup during binary-translated functional simulation (see
Chen, MS Thesis for ideas).
18-741 Project Statement (Fall 2004)
October 9, 2004
5
Download