18-741 Project Description

advertisement
18-741 Project Description
Proposal due date: October 11, 2002 (10:30am Melissa’s office)
First status report: October 28, 2002
Second status report: November 13, 2002
Final report due date: December 4, 2002
Introduction
In this project, you will evaluate extensions to or re-evaluate an architecture presented in
class or a recent architecture conference, using software simulation. You will use SimpleScalar as a base simulation environment and augment it with a simulation model for the
design of interest. The purpose of this project is to conduct and generate high-quality
research, to improve the state of the art, and to potentially disseminate the information
through a conference publication. The project report will be in the form of a conferencequality research article as the final report.
The project will be graded on:
1. problem definition and motivation,
2. survey of previous related work,
3. experimentation methodology,
4. presentation and discussion of results.
Proposal
The deadline is given above. The proposal is a written two-page document including:
1. A problem definition and motivation.
2. A brief survey of related work with at least four papers you have obtained and read.
The project web page has links to online IEEE and ACM proceedings. Conferences that
primarily focus on architecture research are ISCA, ASPLOS, MICRO, HPCA, SIGMETRICS,
ISLPED, and DNS. You can search the online pages and/or contact me for pointers. You
will also find the World Wide Computer Architecture web page http://
www.cs.wisc.edu/~arch/www a good source of information.
3. A detailed description of your experimental setup including modifications to the simulation environment. You should refer to J. R. Platt’s paper on Strong Inference
(Reader 0) and explain how your experimental methodology follows his guidelines.
4. Milestones for the first and second status reports. What do you expect to see in each
milestone? Where do you plan to go based on your observations? You can draw a flow
chart to show this.
18-741 Project Description
October 2, 2002
1
Status Reports
For the first status report:
1. You will hand in a two-page write-up describing your preliminary results. Based on
these results, explain your plans for the next two milestones (status report 2 and the
final report). Have your plans changed since the proposal? How?
2. You will fill out an appointment sheet on Monday Oct. 28th in class to stop by my
office and go over your results on Thursday Oct. 31st or Friday Nov. 1st.
For the second status report, you will hand in a two-page write-up describing your
progress towards the second milestone, mature results and the corresponding analysis.
Any further improvements you may want to evaluate over the original proposal should be
described here.
Final Report
Reports should be in the form of a conference submission including an abstract, an introduction, followed by a detailed description of the design, a methodology section, a results
section, and a conclusion section. You may choose to write a related works section preceding the conclusions, or a background section following the introduction. Please make sure
your document is spell-checked and is grammatically sound. You will need all the relevant
citiations to your work in a reference section at the end.
Best Projects
The top projects in class will be nominated for submission to SOCS (the student Symposium on Computer Systems at CMU) and subsequently to a computer architecture conference.
Infrastructure
For simulation, you will be using either SimpleScalar 3.0 or Wattch (Brooks et al., ISCA
2000). You will run jobs using condor, a batch running facility that will schedule your job
to run on a cluster of 25 high-end Intel/Linux boxes. Do not run jobs interactively on any
machines unless you are debugging. Run jobs only through the condor batch facility to
allow fair use of machines.
Workload
SimpleScalar, like other simulators, is quite slow. Simulating applications from the
SPEC2K suite all the way to the end of execution may take days of simulation time on a
dedicated processor. Clearly you will not be able to simulate all the apps all the way to
completion. You should use the “skip” feature of SimpleScalar to skip 0.5-1.0 billion
instructions, and only simulate the second 0.5-1.0 billion instructions in every application
as representative set of instructions and workload. Your individual simulation runs should
not take more than 5-6 hours.
18-741 Project Description
October 2, 2002
2
Research Project Topics
The project topics below are undisclosed ideas and are confidential. Please do not
distribute this list.
I will be open to any ideas you may have for a research project if you can convince me that
it is worth pursuing. Otherwise, here is a list of possible projects.
Ultra-Deep-Submicron Processors
Semiconductor fabrication trends are leading to CMOS technologies that present a variety
of challenges for computer architects. Semiconductor processes ca. 2010 will lead to tens
of billions of transistors that consume prohibitive degrees of power, designs whose performance is dominated by wire latency, drive yield to unacceptably low levels, and processors that are vulnerable to transient and design error (due to complexity). The projects in
this category target addressing these challenges at a fundamental level.
1. The Tripod project proposes a tiled/grid processor architecture, in which the chip is
made up of a heterogeneous set of tiles, where every tile is a partition of a given datapath structure (e.g., register file bank, L1 cache bank, ALU, ROB, issue queue, etc.). By
partitioning the structures, and providing a reconfigurable on-chip network fabric, the
various tiles can be enabled/disabled dynamically to allow for power management. Tile
connections can be reconfigured to allow tiles to provide redundancy in storage or computation. Tiles with fabrication errors can be disconnected electrically through the network to allow for higher yield. Evaluate the potential for a Tripod processor by
modeling a superscalar datapath with partitionable components (see me in person if you
are interested in this project asap).
2. Evaluate the feasibility of an asynchronously scalable processor based on Ivan Sutherland’s fleet architecture. The fleet architecture has no wire-latency bottleneck or structural hazards and is inherently scalable. The background for this task is Ivan
Sutherland's distinguished lecture from last fall. The video is available online and ought
to serve as start for this work. Ivan's Fleet makes a convincing argument for why a collection of functional units, interconnected by some packet/token switching fabric could
make a scalable processor architecture. However Ivan did not present any form of control structure and so the design is incomplete (see me in person if you are interested in
this project asap).
Proactive Memory
Conventional cache hierarchies are primarily managed based on a demand-fetch and
LRU/random-replace strategy. With the increase in the hierarchy depth, these simple management strategies are hitting the point of diminishing returns. The following ideas either
help bridge the processor/memory performance gap through novel hierarchy management,
or reduce the complexity and power of existing designs given isoperformance.
Research Project Topics
October 2, 2002
3
3. Propose and evaluate a dead-block correlating prefetcher (e.g., similar in spirits to Lai,
Fide, and Falsafi’s) for instruction caches.
4. Compare and contrast DBCP’s effectiveness against a timer-based prefetcher as proposed by Zhang, Kaxiras, and Martonosi in ISCA 2002. Timer-based prefetchers
approximate last-touch by placing expiration counters on every cache block to trigger a
prefetch.
5. Evaluate DBP’s and DBCP’s accuracy and coverage for commercial workloads. Current results on these predictors only evaluate uniprocessor desktop programs. Use our
in-house Simics tracer (talk to Tom about this) to generate real memory traces for
IBM’s DB2 (a stock database management system) running TPCC (online transaction
benchmarks) and TPCH (decision support systems/data mining benchmarks) queries.
6. As in Trace Caches, DBP’s tables intuitively store many signatures that are used infrequently and may not contribute much to prefetching. Do a detailed analysis of the usage
frequency of signatures in DBP tables. Propose a technique to avoid storing such signatures into the tables and reduce table storage by an order of magnitude.
7. Evaluate power savings in DRAM memory for memory pages that are not actively
used. Evaluate the impact of prefetching as a technique to enhance power management
in DRAM memory by hiding the “rampup” latency when DRAM banks are in lowpower mode.
Superscalar Processors & Simulation
Modern superscalar processors rely on a large supply of independent instructions from the
instruction stream to extract instruction-level parallelism and increase performance. The
ideas below either improve the instruction throughput through the superscalar datapath, or
reduce complexity, power, or reliability/robustness of current designs.
8. A fundamental bottleneck in current superscalar processors is that instructions in the
issue queue are often data- or control-dependent on earlier instructions. Because the
issue queue is an associative structure, its size does not scale well (compared to other
resources such as number of functional units, register file size, etc.) across generations
of design. Propose an “out-of-order” dispatch superscsalar that predict when instruction
results will be ready and only dispatches instructions into the issue queue if they are
ready to go to avoid clogging the issue queue. Show that using your design, you can
also achieve isoperformance with a much smaller issue queue sizes.
9. Many applications exhibit a structural hazard on the load/store queue due to a large
number of loads/stores in the instruction stream (e.g., > 50%). Like the issue queue, the
load/store queue is an associative structure that is limited in size, and whose size does
not scale well (see problem 5). Propose a load/store queue design that temporarily
removes long-latency instructions (e.g., missing loads/stores) out of the queue to allow
higher load/store queue throughput. Compare your design against Lebeck et al.’s design
from ISCA 2002.
Research Project Topics
October 2, 2002
4
10.Fields et al., proposed a critical path predictor in ISCA 2001 (and ISCA 2002) that
allows predicting which instructions are on the critical path of execution. Propose a
non-vital instruction predictor (for instructions that are not on the critical path), and use
the predictor to steer such instructions into a group of slow functional units to save
power. Power is a equal to CV^2f, and is linearly proportional to frequency. Evaluate
the opportunity for reducing power using such a predictor.
11.Apply machine learning to improve the accuracy and/or cost of branch prediction.
Design and evaluate a “correlating feature selector (CFS)” that will accurately select
which specific history bits a branch correlates to. Design and evaluate a table-based
predictor using your CFS. You may refer to Fern et al.’s (http://www.ece.cmu.edu/
~babak/pub.html) technical report on CFS predictors.
12.There is much variation in IPC across a program’s execution (Roland Wunderlich can
provide you with IPC profiles of entire programs). Yet much power in modern processors is dissipated in the issue queue search for the independent instructions and increasing the IPC every cycle. Design and evaluate a history-based issue queue size and
issue-width predictor that allows dynamically adjust either of the two to save power.
13.Ray et al., MICRO 2001 propose a superscalar architecture that enables transient-tolerant computing. Unfortunately, the proposed architecture significantly impacts an application’s performance as compared to a superscalar processor that does not offer faulttolerance. Design a redundant superscalar datapath that identifes and executes only a
fraction of the original instructions to detect error. Evaluate how your design improves
performance over Ray et al.’s technique.
14.Last year’s 18-547 students (Tom Wenisch and Roland Wunderlich) developed a simulation methdology that increases simulation speed of sim-outorder to that of sim-cache
(i.e., ~50x speedup) with no loss in accuracy! Propose a technique that would increase
the performance of sim-cache (e.g., instruction emulation or result memoization) so
that entire applications that usually take over 30 days to simulate can be simulated in
less than an hour.
Research Project Topics
October 2, 2002
5
Download