word

advertisement
ECE462/562
Fall 2012
Pointers
You are encouraged to come up with your own topic. For example, if you have an interest in
compilers, then code scheduling for instruction level parallelism might be a good topic. If you
are interested in VLSI design, a project related to pipeline clocking or low power architecture
would be good. If you are interested in databases, quantifying the architectural characteristics of
database workloads, and comparing them with characteristics of other workloads (e.g., SPEC)
might be good. Some simulators (e.g., SimpleScalar) and benchmark programs (e.g., SPEC2K)
will be made available for carrying out simulation studies. The following is a sampling of
projects in other schools. Though descriptions alone convey little meaningful information, this
should give you an idea of what you might want to pursue.



Select a paper that interests you from a recent ASPLOS or ISCA proceedings. Construct
a simulator that will allow you to reproduce their main results and validate your simulator
using their workload or a similar one. Are there any major assumptions the authors didn't
mention in the paper? Use your simulator to evaluate their technique under a new
workload or improve their technique and quantify your improvements.
As CPU cache miss times approach thousands of cycles, during the time that a miss gets
serviced, it seems likely that the processor could execute a cache-replacementoptimization program "in the background" without slowing down any unblocked dataflows of execution (Yale Patt calls this sort of optimization code "micro-threads".) This
project has two parts. First, estimate an upper bound on the performance that could be
gained as follows: simulate a k-way associative cache where each cache set uses random,
FIFO, LRU, and OPT replacement. Current caches use k= 1 to 8 and one of the simple
replacement policies, and the best your system could do would be to approximate fully
associative a fully-associative cache with OPT replacement. The gap between those two
cases is a reasonable upper bound on the benefits this scheme could achieve. Also, this
experiment will tell you what level of associativity and replacement policy to aim for in
your design. You may want to run this experiment for L1, L2, and L3 caches to see where
to focus your efforts. Second, design a cache microarchitecture that would allow for more
sophisticated replacement policies. My intuition is that it will be important to make sure
your design does not slow down hits, it should not slow down the time it takes to issue
the miss request to memory, but it can probably burn a lot of cycles thinking about which
current cache line to replace when that data comes back or moving data between different
cache entries.
Along with the number of transistors, the complexity of microprocessor architectures
continues grown exponentially, with very complex out-of-order processors. It is still not
readily apparent how much performance is really being delivered to applications
compared to simpler in-order designs. On a spectrum of benchmarks, quantify (through
simulation - suggest SimpleScalar) the performance difference between an out-of-order
processor and a simpler in-order processor, taking into account not only CPI, but also
clock rate and power consumption.











DRAMs are highly optimized for accesses that exhibit locality. Examine a memory
interface architecture that reorders memory accesses to better exploit the column, page,
and pipeline modes of modern DRAM implementations.
Select an embedded application (such as interactive multimedia) and design and evaluate
an architecture that executes it in a mobile environment. Address issues of functionality,
performance (or at least providing the illusion of sufficient performance), and power
consumption.
Compare alternatives of embedding processing power in a DRAM chip (ie.
reconfigurable logic vs. highly custom processor vs. hardwired logic for a given
application) on a suite of data intensive and computationally demanding benchmarks.
Characterize the benefits and costs of value prediction vs. other predictive techniques,
such as instruction reuse. In the best cases, what is the maximum performance benefit?
Compare the performance of a deep cache hierarchy (multiple levels) vs. a flatter
organization (only one level) on a family of scientific and data intensive applications.
Devise strategies to get the benefits of both.
In large, out-of-order cores, loads have to be held back when an earlier store's address is
unknown (because it might be the same). Dependence prediction guesses which
load/store pairs are going to have dependences, and which aren't. These predictors have
also been used to communicate values from stores to loads and do prefetching. Lots of
interesting stuff here!
Because of wire delays and register file bandwidth, processor designers have started
looking (and building, cf. Alpha 21264) clusters, in which groups of functional units are
associated with separate register files within a core. How to schedule work on these, and
their implications for future architectures, is a hot topic.
Simultaneous Multithreaded Processors (SMT) run multiple tasks in an out-of-order core
at the same time, sharing the dynamic resources (physical registers, issue slots, cache
pipes). Experiments with how resource usage conflicts in the different shared resources,
with different combinations of workloads, would be interesting (there is a lot of work
going on in this area so a literature search would be crucial).
When multiple threads are running in an SMT core, how many extra cache misses are
caused by the intersections of the threads' working sets? Quantifying this for different
workload combinations.
When a number of instructions waiting to execute in an out-of-order core are ready to go,
but there are too many for (a) the issue width or (b) for the particular functional unit types
available to issue in a single cycle, the hardware must choose among them. Oldest-first is
the usual strategy. Other selection algorithms may be better. It would be interesting to try
a few.
One of the key problems in architectures is that it is often more difficult to improve
latency than bandwidth. Prefetching is one technique that can hide latency. Here are some
possible prefetching topics:
o Quantify the limits of history-based prefetching. Prediction by partial matching
(originally developed for text compression) has been shown to provide optimal
prediction of future values based on past values. Using PPM, what are the limits
of memory or disk prefetching? What input information (e.g., last k instruction
addresses, last j data addresses, distance between last k addresses, last value



















loaded, ...) best predicts future fetches? What is the best trade-off between state
used to store history information and prefetch performance?
o Add a small amount of hardware to DRAM memory chips to exploit DRAM
internal bandwidths to avoid DRAM latencies. Evaluates the performance benefits
that can be gained and the costs of modifying the hardware.
Over the past 2 decades, memory sizes have increased by a factor of 1000, and page sizes
by only a factor of 2-4. Should page sizes be dramatically larger, or are a few large
"superpages" sufficient to offset this trend in most cases?
Extend Transparent Informed Prefetching (Patterson et al (SOSP95)), which was
designed for page-level prefetching/caches to balance cache-line hardware prefecthing v.
hardware caching.
Cooperative caching uses fast networks to access remote memory in liu of disk accesses.
One drawback is that a user's data may be stored on multiple machines, potentially
opening security holes (eavesdropping, modification). Encryption and digital signatures
may solve the problem, but could slow down the system. Evaluate the performance
impact of adding encryption and digital signatures to cooperatively cached data and
project this performance into the future as processor speeds improve and as companies
like Intel propose adding encryption functions to their processors..
As memory latencies increase, cache miss times could run to thousands of instructionissue opportunities. This is nearly the same ratio of memory access times as were seeen
for early VM paging systems. As miss times become so extremely bad, is it time to give
control of cache replacement to the software? Will larger degrees of associativity be
appropriate for caches?
Achieve fault tolerance by running 2 copies of instructions in unused cycles in a
superscalar (e.g., a 4-way machine may commit less than 4 instructions due to
dependencies) and do instruction replication only in those cycles.
Compare Qureshi and Patt's insertion policies in ISCA 2007 to victim caches.
Re-evaluate the schemes in the high-bandwidth cache paper discussed in the lectures
(e.g., line buffer)
Using old register values to predict addresses of subsequent memory accesses. This
allows the pipeline to do the cache access early in the pipeline, avoiding load-use stalls.
To help circuit designers reduce di/dt problem, out-of-order processors can monitor the
commit rate and "even" out the rate without losing performance.
By looking for phases in applications where fewer physical registers may suffice we can
cut down the amount of energy consumed by the register file.
Attempt to quantify how much of processor performance gain in the past decade has
come from faster clocks and how much from ILP.
Methods to improve fetch bandwidth of trace caches,
cache enhancements, including victim caches, stream buffers, and hash addressing,
Implement and compare victim caches and skewed-associative caches.
Implement and compare two recent prefetching schemes.
architectural support of operating systems (e.g., user-level traps for lightweight threads)
prefetching methods (hardware and/or software) and their impact on performance
architectural characteristics of database workloads
cache behavior of networking (or other) applications or algorithms, with modification to
exploit caches and memory hierarchies







































An implementation study of register renaming logic
In-order vs out-of-order superscalar processors
A study of dynamic branch prediction schemes for superscalar processors
Performance study of two-level on-chip caches in a fixed area
An analysis of hardware prefetching techniques
Performance evaluation of caches using patchwrx instruction traces
Skewed D-way K-column set associative caches
The history and use of pipelining computer architecture
The effect of context switching on history-based branch predictors
Bounding worst-case performance for realtime applications
Branch prediction methods and performance
Performance of TLB implementations
Trace-driven simulation of cache enhancements
Timing analysis and caching for realtime systems
A Survey of VLIW Processors
Evaluating Caches with Multiple Caching Strategies
Survey/Comparison of VLIW and Superscalar processors
Comparison Study of Multimedia-Extended Microprocessors
Synchronous DRAM
Cache Performance Study of the SPEC95 Benchmark Suite
An Investigation of Instruction Fetch Behavior in Microprocessors
The Picojava I Microprocessor and Implementation Decisions
Register Renaming in Java Microprocessors
Optimizing Instruction Cache Performance with Profile-directed Code Layout
Simulation of a Victim Cache for Spec95 Integer Benchmarks
Code scheduling for ILP
Instruction/data Encoding
Cache-based enhancements (e.g., trace cache, filter cache, loop cache, victim cache,
stream buffers, etc.)
Pipeline clocking
Low power architectures
Quanitfying architectural characteristics of database workloads and comparing them to
other workloads
Achieve fault tolerance by running 2 copies of instructions in unused cycles in a
superscalar (e.g., a 4-way machine may commit less than 4 instructions due to
dependencies) and do instruction replication only in those cycles.
Compare Qureshi and Patt's insertion policies in ISCA 2007 to victim caches
Use old register values to predict addresses of subsequent memory accesses. This allows
the pipeline to do the cache access early in the pipeline, avoiding load-use stalls.
By looking for phases in applications where fewer physical registers may suffice we can
cut down the amount of energy consumed by the register file.
Attempt to quantify how much of processor performance gain in the past decade has
come from faster clocks and how much from ILP.
Implement and compare victim caches and skewed-associative caches.
Implement and compare two recent prefetching schemes
Study prefetching methods (hardware and/or software) and their impact on performance



Evaluate cache behavior of networking (or other) applications or algorithms, with
modification to exploit caches and memory hierarchies
An implementation study of register renaming logic
In-order vs out-of-order superscalar processors
A study of dynamic branch prediction schemes for superscalar processors
Performance study of two-level on-chip caches in a fixed area
An analysis of hardware prefetching techniques
Performance evaluation of caches using patchwrx instruction traces
Skewed D-way K-column set associative caches
The history and use of pipelining computer architecture
The effect of context switching on history-based branch predictors
Bounding worst-case performance for realtime applications
Branch prediction methods and performance
Performance of TLB implementations
Trace-driven simulation of cache enhancements
Timing analysis and caching for realtime systems
A Survey of VLIW Processors
Evaluating Caches with Multiple Caching Strategies
Survey/Comparison of VLIW and Superscalar processors
Comparison Study of Multimedia-Extended Microprocessors
Synchronous DRAM
An Investigation of Instruction Fetch Behavior in Microprocessors
Register Renaming in Java Microprocessors
Optimizing Instruction Cache Performance with Profile-directed Code Layout
Simulation of a Victim Cache for Spec95 Integer Benchmarks
Power/Energy/Performance in a Branch Predictor of a Superscalar Processor
Workload Characterization of Network Processor Benchmarks
Dynamic Phase Behavior of Programs
Cache Miss Pattern and Miss Predictability Analysis
Low Power Cache Design
Analysis of Architecture Support for Virtual Machines
A Framework for Power Efficient Instruction Encoding in Deep Sub Micro Application
Specific Processors
Cache Optimization for Signal Processing Applications
Design and Evaluation of Advanced Value Prediction Methods in Multi-Issue
Superscalar Pipelined Architectures
A New ISA to Efficiently Support Object-Oriented Programming (OOP) Paradigm
Benchmarking HPC Workloads

Your good idea here...
































Some reference paper top get started:
 A Case for MLP-Aware Cache Replacement, M. K. Qureshi et. al., ISCA 2006.
 Increasing the Size of Atomic Instruction Blocks using Control Flow Assertions, S. Patel,
MICRO 2000.







Selective value prediction, B. Calder et. al, ISCA 1999
Dataflow Mini-Graphs: Amplifying Superscalar Capacity and Bandwidth, Bracy et. al,
MICRO 2004.
Efficient Dynamic Scheduling Through Tag Elimination, Dan Ernst and Todd Austin,
ISCA 2002.
Cache decay: exploiting generational behavior to reduce cache leakage power, S.
Kaxiras et. al., ISCA 2001.
Scalable Store-Load Forwarding via Store Queue Index Prediction, S. Stone et. al.,
MICRO 2005.
NUCA: A Non-Uniform Cache Access Architecture for Wire-Delay Dominated On-Chip
Caches, Changkyu Kim Doug Burger Stephen W. Keckler, ASPLOS 02.
Trace Cache: a Low Latency Approach to High Bandwidth Instruction Fetching, Eric
Rotenberg, Steve Bennett, James E. Smith, MICRO.
Download