2K papers on caches by Y2K: Do we need more? Jean-Loup Baer

advertisement
2K papers on caches by Y2K:
Do we need more?
Jean-Loup Baer
Dept. of Computer Science & Engineering
University of Washington
1/11/00
HPCA-6
1
A little bit of history
• The Y0K problem
1/11/00
HPCA-6
2
A little bit of history
• The Y0K problem
• The Y1K problem
1/11/00
HPCA-6
3
A little bit of history
• The Y0K problem
• The Y1K problem
– Pour la version française, qui était Roi de France en
l’an 1000?
1/11/00
HPCA-6
4
Outline
•
•
•
•
More history
Anthology
Challenges
Conclusion
1/11/00
HPCA-6
5
More history
• Caches introduced (commercially) more than 30
years ago in the IBM 360/85
– already a processor-memory gap
• Oblivious to the ISA
– caches were organization, not architecture
• Sector caches
– to minimize tag area
• Single level; off-chip
1/11/00
HPCA-6
6
Terminology
• One of the original designers (Gibson) had first
coined the name muffer
• When papers were submitted, the authors (Conti,
Gibson, Liptay, Pitkovsky) used the term highspeed buffer
• The EIC of IBM Systems Journal (R.L.Johnson)
suggested a more sexy name, namely cache, after
consulting a thesaurus
1/11/00
HPCA-6
7
Today
• Caches are ubiquitous
– On-chip, off-chip
– But also, disk caches, web caches, trace caches etc.
• Multilevel cache hierarchy
– With inclusion or exclusion
• Many different organizations
– direct-mapped, set-associative, skewed-associative,
sector, decoupled sector etc.
1/11/00
HPCA-6
8
Today (c’ed)
• Cache exposed to the ISA
– Prefetch, Fence, Purge etc.
• Cache exposed to the compiler
– Code and data placement
• Cache exposed to the O.S.
– Page coloring
• Many different write policies
– copy-back, write-through, fetch-on-write, write-around,
write-allocate etc.
1/11/00
HPCA-6
9
Today (c’ed)
• Numerous cache assists, for example:
– For storage: write-buffers, victim caches,
temporal/spatial caches
– For overlap: lock-up free caches
– For latency reduction: prefetch
– For better cache utilization: bypass mechanisms,
dynamic line sizes
– etc ...
1/11/00
HPCA-6
10
Caches and Parallelism
• Cache coherence
– Directory schemes
– Snoopy protocols
• Synchronization
– Test-and-test-and-set
– load linked -- store conditional
• Models of memory consistency
1/11/00
HPCA-6
11
When were the 2K papers being written?
• A few facts:
– 1980 textbook: < 10 pages on caches (2%)
– 1996 textbook: > 120 pages on caches (20%)
• Smith survey (1982)
– About 40 references on caches
• Uhlig and Mudge survey on trace-driven
simulation (1997)
– About 25 references specific to cache performance only
– Many more on tools for performance etc.
1/11/00
HPCA-6
12
Cache research vs. time
% of ISCA papers dealing principally with caches
Largest number (14)
1/11/00
HPCA-6
99
19
97
19
95
19
93
19
91
19
89
19
87
19
85
19
83
19
19
81
35
30
25
20
1st session on caches
15
10
5
0
13
Outline
•
•
•
•
More history
Anthology
Challenges
Conclusion
1/11/00
HPCA-6
14
Some key papers - Cache Organization
• Conti (Computer 1969): direct-mapped (cf. “slave
memory” and “tags” in Wilkes 1965), set-associativity
• Bell et al (IEEE TC 1974): cache design for small
machines (advocated unified caches; pipelining nullified that )
• Hill (Computer 1988): the case for direct-mapped
caches (technology has made the case obsolete)
• Smith (Computing Surveys 1982): virtual vs.
physical addressing (first cogent discussion)
1/11/00
HPCA-6
15
Some key papers - Qualitative Properties
• Smith (Computing Surveys 1982): Spatial and
temporal locality
• Hill (Ph.D 1987): The three C’s
• Baer and Wang (ISCA 1988): Multi-level
inclusion
1/11/00
HPCA-6
16
Some key papers - Cache Evaluation
Methodology
• Belady (IBM Systems J. 1966): MIN and OPT
• Mattson et al. (IBM Systems J. 1970): The “stack”
property
• Trace collection:
–
–
–
–
1/11/00
Hardware: Clark (ACM TOCS 1983)
Microcode: Agarwal, Sites and Horowitz (ISCA 1986): ATUM
Software: M. Smith (1991): Pixie
Very long traces: Borg, Kessler and Wall (ISCA 1990)
HPCA-6
17
Some key papers - Cache Performance
• Kaplan and Winder (Computer 1973): 8 to 16K caches
with block sizes of 64 to 128 bytes and set-associativity 2 or 4 will
yield hit ratios of over 95%
• Strecker (ISCA 1976) :Design of the PDP 11/70 -2KB, 2-way set-associative, 4 byte (2 words) block size
• Smith (Computing Surveys 1982):Most comprehensive
study of the time: prefetching, replacement, associativity, line size etc.
• Przybylski et al. (ISCA 1988): Comprehensive study 6
years later
• Woo et al. (ISCA 1995): Splash-2
1/11/00
HPCA-6
18
Some key papers - Cache Assists
• IBM ??: Write buffers
• Gindele (IBM TD Bull 1977): OBL prefetch (OBL
coined by Smith?)
• Kroft (ISCA 1981): Lock-up free caches
• Jouppi (ISCA 1990): Victim caches; stream
buffers
• Pettis and Hansen (PLDI 1990): Code placement
1/11/00
HPCA-6
19
Some key papers - Cache Coherence
• Censier and Feautrier (IEEE TC 1978): Directory
scheme
• Goodman (ISCA 1983): The first snoopy protocol
• Archibald and Baer (TOCS 1986): Snoopy
terminology
• Dubois, Scheurich and Briggs (ISCA 1986):
Memory consistency
1/11/00
HPCA-6
20
Outline
•
•
•
•
More history
Anthology
Challenges
Conclusion
1/11/00
HPCA-6
21
Caches are great. Yes … but
• Caches are poorly utilized
– Lots of dead lines (only 20 % efficiency - Burger et al
1995)
– Squandering of memory bandwidth
• The “memory wall”
– At the limit, it will take longer to load a program onchip than to execute it (Wulf and McKee 1995)
1/11/00
HPCA-6
22
Solution Paradigms
• Revolution
• Evolution
• Enhancements
1/11/00
HPCA-6
23
Revolution
1/11/00
HPCA-6
24
Evolution
(processor in memory;application specific)
• IRAM (Patterson et al. 1997)
– Vector processor; data stream apps; low power
• FlexRAM (Torrellas et al. 1999)
– Memory chip = Simple multiprocessor + superscalar + banks of
DRAM; memory intensive apps.
• Active Pages (Chong et al. 1998)
– Co-processor paradigm; reconfigurable logic in memory; apps
such as scatter-gather
• FBRAM (Deering et al. 1994)
– Graphics in memory
1/11/00
HPCA-6
25
Enhancements
• Hardware and software cache assists
– Examples: “hardware tables”; most common case
resolved in hardware less common in software
• Use real estate on-chip to provide intelligence for
managing on-chip and off-chip hierarchy
– Examples: memory controller, prefetch engines for L2
on processor chip
1/11/00
HPCA-6
26
General Approach
• Identify a cache parameter/enhancement whose
tuning will lead to better performance
• Assess potential margin of improvement
• Propose and design an assist
• Measure efficiency of the scheme
1/11/00
HPCA-6
27
Identify a cache parameter/enhancement
• The creative part!
• Our current projects
– Dynamic line sizes
– Modified LRU policies using detection of temporal
locality
– Prefetching in L2
1/11/00
HPCA-6
28
Assess potential margin of improvement
• Metrics?
– Miss rate; bandwidth; average memory access time
– Weighted combination of some of the above
– Execution time
• Compare to optimal (off-line) algorithm
– “Easy” for replacement algorithms
– “OK” for some other metrics (e.g., cost of a cache miss
depending on line size; oracle for prefetching)
– Hard for execution time
1/11/00
HPCA-6
29
Measure efficiency of the scheme
• Same problem: metrics?
• The further from the processor, the more “relaxed”
the metric
– For L1-L2, you need to see impact on execution speed
– For L2- DRAM, you can get away with average
memory access time
1/11/00
HPCA-6
30
Anatomy of a Predictor
Exec.
Event selec.
Pred. Index.
Recovery?
Feedback
1/11/00
Pred. Mechan.
HPCA-6
31
Anatomy of a Cache Predictor
Exec.
Event selec.
Feedback
1/11/00
Pred. Index.
Pred. Mechan.
HPCA-6
32
Anatomy of a Cache Predictor
Load/store
cache miss
Exec.
Pred. trigger.
Feedback
1/11/00
Pred. Index.
Pred. Mechan.
HPCA-6
33
Anatomy of a Cache Predictor
PC; EA;
global/local history
Exec.
Pred. trigger.
Feedback
1/11/00
Pred. Index.
Pred. Mechan.
HPCA-6
34
Anatomy of a Cache Predictor
Exec.
Pred. trigger.
One level table Two
level tables
Associative buffers
Pred. Mechan.
Specialized caches
Feedback
1/11/00
Pred. Index.
HPCA-6
35
Anatomy of a Cache Predictor
Exec.
Pred. trigger.
Feedback
Pred. Index.
Pred. Mechan.
Counters
Stride predictors
Finite context
Markov pred.
1/11/00
HPCA-6
36
Anatomy of a Cache Predictor
Exec.
Pred. trigger.
Feedback
Pred. Index.
Pred. Mechan.
Often
imprecise
1/11/00
HPCA-6
37
Applying the Model
• Modified LRU policies for L2 caches
• Identify a cache parameter
– L2 cache miss rate
1/11/00
HPCA-6
38
Applying the Model
• Modified LRU policies for L2 caches
• Identify a cache parameter
• Assess potential margin of improvement
– OPT vs. LRU
1/11/00
HPCA-6
39
Applying the Model
•
•
•
•
Modified LRU policies for L2 caches
Identify a cache parameter
Assess potential margin of improvement
Propose a design
– On-line detection of lines exhibiting temporal locality
1/11/00
HPCA-6
40
Propose a Design
L1 cache miss
Exec.
Event selec.
Pred. Index.
EA
PC
Metadata in L2
Locality Table
Feedback
Pred. Mechan.
LRU stack + locality bit
1/11/00
HPCA-6
41
Applying the Model
•
•
•
•
•
Modified LRU policies for L2 caches
Identify a cache parameter
Assess potential margin of improvement
Propose a design
Measure efficiency of the scheme
– How much of the margin of improvement was reduced
(i.e., compare with OPT and LRU)
1/11/00
HPCA-6
42
Conclusion
• Do we need more?
• “We need substantive research on the design of
memory hierarchies that reduce or hide access
latencies while they deliver the memory
bandwidths required by current and future
applications” PITAC Report Feb 1999
1/11/00
HPCA-6
43
Possible important areas of research
• L2- DRAM interface
– Prefetching
• Better cache utilization
– Data placement
• Caches for low-power design
• Caches for real-time systems
1/11/00
HPCA-6
44
With many thanks to
•
•
•
•
•
•
•
•
Jim Archibald
Wen-Hann Wang
Sang Lyul Min
Rick Zucker
Tien-Fu Chen
Craig Anderson
Xiaohan Qin
Dennis Lee
1/11/00
• Peter Vanvleet
• Wayne Wong
• Patrick Crowley
HPCA-6
45
– Pour la version française, qui était Roi de France en
l’an 1000?
– Robert II Le Pieux, fils ainé de Hughes Capet
1/11/00
HPCA-6
46
Download