shachar

advertisement
Reducing Garbage Collector
Cache Misses
Shachar Rubinstein
Garbage Collection
Seminar
The End!
The general problem



CPU’s are getting fast faster and faster
Main memory speed lags behind
Result: The cost to access main memory is
increasing
Solutions

Hardware and software techniques:
–
–
–
–
–
–
Memory hierarchy
Prefetcing
Multithreading
Non-blocking caches
Dynamic instruction scheduling
Speculative execution
Great Solutions? Not exactly…



Complex hardware and compilers
Ineffective for many programs
Attack the manifestation (= memory latency)
and not the source (=poor reference locality)
Previous work


Improving cache locality in dense matrices
using loop transformation
Other profile-driven, compiler directed
approach
The GC problem




Little temporal locality.
Each live object is usually read only once
during mark phase.
Most reads are likely to miss.
The new contents are unlikely to be used
more than once.
The GC problem – cont.



The sweep phase, like the mark phase, also
touches each object once
That’s since the free list pointers are
maintained in the objects themselves
Unlike the mark phase, the sweep phase is
more sequential
The GC problem – cont.


The sweep is less likely to use cache
contents left by the marker
The allocator is likely to miss again, when the
object is allocated
The GC problem - previous work





Older work concentrated on paging
performance.
Memory size increase lead to abandoning
this goal.
But memory size also lead to huge cache
miss penalties.
The largest cache size < heap size
This problem is unavoidable.
Previous work


Reducing sweep time for a nearly empty
heap
Compiler-based prefetching for recursive
data structures
How am I going to improve the
situation?




Do some magic!
Well no…
Use real-time information to improve
program cache locality.
The mark and sweep phases offers
invaluable opportunities for improvements
–
–
Bring objects earlier to the cache
Reuse freed objects for reallocation
Some numbers

Relative to copy GC
–
–

Cache miss rates reduced by 21-42%
Program performance improved by 14-37%
Relative to a page level GC
–
–
Cache miss rates reduced by 20-41%
Program performance improved by 18-31%
Road map

Cache conscious data placement using generational
GC
–
–
–
–
–
–

Overview
Short generational GC reminder
Real-time data profiling
Object affinity graph
Combining the affinity graph with GC
Experimental evaluation
Other methods and their experimental results
Overview




A program is instrumented to profile its
access patterns
The data is used in the same execution and
not the next one.
The data -> affinity graph
A new copy algorithm uses the graph to
layout the data while copying.
Generational GC – A reminder



The heap is divided to generations
GC activity concentrates on young objects,
which typically die faster.
Objects that survive one or more scavenges
are moved to the next generation
Implementation notes





The authors used the UM GC toolkit
The toolkit has several steps per generation
The authors used a single step for each
generation for simplicity.
Each step consists of fixed size blocks
The blocks are not necessarily contiguous in
memory
Implementation notes - steps
Implementation notes - steps


The steps are used to encode the objects’
age
An object which survives a scavenge is
moved to the next step
Implementation notes – moving
between generations


The scavenger collects a generation g and
all its younger generations
It starts from objects that are:
–
–


In g
Reachable from the roots.
Moving an object is copying it into a TO
space.
The FROM space can be reused
Copying algorithm – a reminder






Cheney’s algorithm
TO and FROM spaces
are switched
Starts from the root set
Objects are traversed
breadth-first using a
queue
Objects are copied to
TO space
Terminates when the
queue is empty
Copying algorithm – the queue trick
The algorithm
Did you get it?
Real time data profiling


Earlier program run profile is not good
enough
Real time data eliminates:
–
–
Profile execution run
Finding inputs
Great!
But the overhead must be low!
Profiling data access patterns


Trace every load and store to heap
Huge
overhead (factor of 10!)
Reducing overhead

1.
Use object oriented programs properties
Most objects are small, often less than 32
bytes
–
No need to distinguish between fields, since
cache blocks are bigger
Reducing overhead – cont.
2.
Most object accesses are not lightweight
–

Profiling instrumentation will not incur large
overhead
Don’t believe? Stay awake
Collecting profiling data



“Load”s of base object addresses
Uses a modified compiler
The compiler retains object type information
for selective loads
Code instrumentation
Collecting profiling data - cont

The base object address is entered to an
object access buffer
Implementation note



Uses a page trap for buffer overflow
A trap causes a graph to be built
Recommended buffer size: 15000 (60KB)
Affinity?

Main Entry: af·fin·i·ty
Pronunciation: &-'fi-n&-tE
Function: noun
Inflected Form(s): plural -ties
Etymology: Middle English affinite, from Middle French
or Latin; Middle French afinité, from Latin affinitas, from
affinis bordering on, related by marriage, from ad- + finis
end, border
Date: 14th century
1 : relationship by marriage
2 a : sympathy marked by community of interest :
KINSHIP b (1) : an attraction to or liking for something
<people with an affinity to darkness -- Mark Twain>
<pork and fennel have a natural affinity for each other -Abby Mandel> (2) : an attractive force between
substances or particles that causes them to enter into
and remain in chemical combination c : a person
especially of the opposite sex having a particular
attraction for one
3 a : likeness based on relationship or causal
connection <found an affinity between the teller of a tale
and the craftsman -- Mary McCarthy> <this
investigation, with affinities to a case history, a
psychoanalysis, a detective story -- Oliver Sacks> b : a
relation between biological groups involving
resemblance in structural plan and indicating a common
origin
The object affinity graph
The object affinity graph



Nodes – objects
Edges – Temporal affinity between objects
An undirected graph
Building the graph
Inserting an object to the queue
Incrementing edges’ weight
All clear?
Demonstration
Queue tail
A
D
D
C
Queue tail
A
B
A
Object access buffer
Locality queue
Graph
Demonstration
Queue tail
A
D
A
D
C
Queue tail
A
B
A
Object access buffer
Locality queue
Graph
Demonstration
Queue tail
A
D
A
D
1
C
Queue tail
B
A
B
A
Object access buffer
Locality queue
Graph
Demonstration
Queue tail
A
D
A
D
2
1
C
Queue tail
B
A
B
Object access buffer
Locality queue
Graph
Demonstration
Queue tail
A
1
D
A
C
D
2
Queue tail
C
B
A
B
Object access buffer
Locality queue
Graph
1
Demonstration
Queue tail
A
1
D
A
C
1
2
1
Queue tail
1
D
B
C
A
Object access buffer
Locality queue
Graph
D
Demonstration
Queue tail
A
1
A
C
1
2
1
Queue tail
1
D
B
C
A
Object access buffer
Locality queue
Graph
D
Demonstration
Queue tail
A
1
A
C
2
2
2
Queue tail
1
D
B
C
A
Object access buffer
Locality queue
Graph
D
Demonstration
Queue tail
1
A
C
2
2
2
Queue tail
1
A
B
D
C
Object access buffer
Locality queue
Graph
D
Demonstration
Queue tail
2
A
C
3
2
2
Queue tail
1
A
B
D
C
Object access buffer
Locality queue
Graph
D
Implementation notes



A separate affinity graph is built for each
generation, except the first.
It uses the fact that the object generation is
encoded in its address.
This method prevents placing objects from
different generations in the same cache
block. (Explanations later on)
Implementation notes – queue size




The locality queue size is important
Too small -> Miss temporal relationships
Too big -> huge graph, long processing time
Recommended: 3.
Implementation notes


Re-create or update the graph?
Depends on the application
–
–

Access phases should re-create
Uniform behavior should update
In this article – re-create before each
scavange
Stop!



Our goal: Produce a cache conscious data
layout, so that objects are likely to reside in
the same cache block
In English: place objects with high temporal
affinity next to each other.
The method: Use the profiling information
we’ve collected in the copying process.
GC + Real-time profiling

Use the object affinity graph in the Copying
algorithm.
Example – object affinity graph
Example – before step 1
Step 1 – using the graph



Flip roles (TO and FROM)
Initialize free and unprocessed to the
beginning of the TO space.
Pick a node that is in:
–
–
–

The root set
and
the affinity graph and has the highest edge weight
Perform a greedy DFS on the graph
Step 1 – cont.



Copy each visited object to the TO space
Increment the free pointer
Store a forwarding address in the FROM
space
Example – After step 1
Step 2 – continues Cheney’s way

Process all objects between the unprocessed
and the free pointers, as in Cheney’s
algorithm
Example – After step 2
Step 3 - cleanup


Ensure all roots are in the TO space
If not, process them using Cheney’s
algorithm
Example – After step 3
Implementation notes

The object access buffer can be used as a
stack for the DFS
Inaccurate results(?)



The object affinity graph may retain objects
not reachable = garbage
They will be incorrectly promoted at most
once
Efforts are focused on longer lived objects
and not on the youngest generation
Experimental evaluation




Methodology – If we have the time
Object oriented programs manipulate small
objects
Real-time data profiling overhead
The algorithm impact on performance
Size of heap objects
But that’s not the point!

Small objects often die fast
Surviving heap objects
Real-time data profiling overhead
Overall execution time
Overall execution time - notes

No impact on L1 cache because its blocks
are 16B
Compared to WLM algorithm
Comparison notes


WLM (Wilson-Lam-Moher) improves
program’s virtual memory locality.
It performed worse or close to Cheney’s
because of the 2GB memory
What else?
Other methods

Two methods that can be used with the
previous one
–
–
Prefetch on grey
Lazy sweeping
Assumptions




Non moving mark-sweep collector
For simplicity, the collector segregates
objects by size. Each block contains objects
of a single size
The collector’s data structure are outside the
user-visible heap
A mark bit is reserved for each word in the
block
Advantages of “outside the heap” data


The mark phase does not need to examine
(=bring to the cache) pointer-free objects
Sequences of small unreachable objects can
be reclaimed as a group
–
–
A single instruction is needed to examine their
sequence of mark bits
It is used when a heap block turns out to be
empty
The mark phase – a reminder



Ensure that all objects are white.
Grey all objects pointed to by a root.
while there is a grey object g
–
–
blacken g
For each pointer p in g

if p points to a white object
–
grey that object.
The mark phase – colors

1 mark bit
–
–

0 is white
1 is grey/black
Stack
–
–
In the stack – grey
Removed from stack - black
The mark GC problem



A significant fraction of time is spent to
retrieve the first pointer p from each grey
object
About third of the marker’s execution time is
spent
This time is expected to increase on future
machines
Prefetching


A modern CPU instruction
A program can prefetch data into the cache
for future use
Prefetching – cont.



But object reference must be predicted soon
enough
For example, if the object is in main memory,
it must be prefetched hundred of cycles
before its use
Prefetching instructions are mostly inserted
by compiler optimizations
Prefetch on grey


When? Prefetch as soon as p is found likely
to be a pointer
What? Prefetch the first cache line of the
object
To improve performance


The last pointer to be pushed on the mark
stack is prefetched first
It minimizes the cases in which a just grayed
object is immediately examined
And to improve more



Prefetch a few cache lines ahead when
scanning an object
It helps with large objects
It prefetches more objects if it isn’t that large
The sweep GC problem

If (reclaimed memory > cache size)
–

Objects are likely to be evicted from the cache by
the allocator or mutator
Thus, the allocator will miss again when
reusing the reclaimed memory
Lazy sweeping



Originally used to reduce page faults
Delay the sweeping for the allocator
Pages will be reused instead of evicted from
the cache
A reminder


A mark bit is saved for each word in a cache
block.
A mark bit is used only if its word is the
beginning of an object
Cache lazy sweeping – the collector




Scans for each block its mark bits
If all bits are unmarked, the block is added to
the free blocks’ pool without touching it
If some bits are marked, it’s added to a
queue of blocks waiting to be swept
There are several queues, one or more for
each object size
Cache lazy sweeping – the allocator



Maps the request to the appropriate object
free list
Returns the first object from the list
If the list is empty
–
It sweeps the queue of the right size for a block
with some available objects
Experimental results


Measured on two platforms
Second platform is to get some calibration on
architecture variation
Pentium III/500 results
HP PA-8000/180 based results
Results conclusions


Prefetch on grey eliminates a third to almost
all cache miss overhead in the marker.
But it is dependent on data structures used in
the program
Results conclusions – cont.


Collector performance is determined by the
marker
The sweep performance is architecture
dependent
Conclusions


Be concerned about cache locality or
Have a method that does it for you
Conclusions – cont.



Real-time data profiling is feasible
Produce cache conscious data layout using
that information
May help reduce the performance gap
between high-level to low-level languages
Conclusions – cont.

Prefetch on grey and lazy sweeping are
cheap to implement and should be in future
garbage collectors
Bibliography


Using Generational Garbage Collection To
Implement Cache-Conscious Data
Placement - Trishul M. Chilimbi and James
R. Larus
Reducing Garbage Collector Cache Misses Hans-J. Boehm
Further reading


Look at the articles
Garbage collection – algorithms for
automatic dynamic memory management –
Richard Jones & Rafael Lins
Further reading – cont.

Cecil –
–
–

Craig Chambers. “Object-oriented multi-methods
in Cecil.” In Proceedings ECOOP’92, LNCS 615,
Springer-Verlag, pages 33–56, June 1992.
Craig Chambers. “The Cecil language:
Specification and rationale.” University of
Washington Seattle, Technical Report TR-93-0305, Mar. 1993.
Hyperion by Dan Simmons
Items





Large objects
Inter-generational objects placement
Why explicitly build free lists?
Experimental methodology
Second experimental methodology
Large objects

Ungar and Jackson :
–


There’s an advantage from not copying large
objects (>= 256 bytes) with the same age
A large object is never copied
Each step has an associated set of large
objects
Large objects – cont.



A large object is linked in a doubly linked list.
If it survives a collection, it’s removed from its
list and inserted to the TO space list.
No compaction is done on large objects.
Large objects – cont.

Read more in David Ungar and Frank
Jackson. “An adaptive tenuring policy for
generation scavengers.” ACM Transactions
on Programming Languages and Systems,
14(1):1–27, January 1992
Two generations, one cache block


How important is co-location of intergeneration objects?
The way to achieve this is to demote or
promote.
Two generations, one cache block –
cont.



Intra-generation pointers are not tracked.
In order to demote safely, it’s needed to
collect its original generation
Result: Long collection time
Two generations, one cache block –
cont.

Promote can be done safely
–
–


The young generation is being collected and its
pointers updated
Pointers from old to young are tracked
The locality benefit will start only when the
old generation is collected
Premature promotion
Why explicitly build free lists?



Allocation is fast
Heap scanning for unmarked objects can be
fast using mark bits
Little additional space overhead is required
Experimental methodology





Vortex compiler infrastructure
Vortex supports GGC only for Cecil
Cecil – A dynamically typed, purely objectoriented language.
Used Cecil benchmarks
Repeated each experiment 5 times and
reported the average
Cecil benchmarks
Cecil benchmarks – cont.

Compiled at highest (o2) optimization level
The platform




Sun Ultraserver E5000
12 167Mhz UltraSPARC processors
2GB memory – To prevent page faults
Solaris 2.5.1
The platform - memory



L1 – 16KB direct-mapped, 16B blocks
L2 – 1MB unified direct-mapped, 64B blocks
64 entry iTLB and 64 entry dTLB, fully
associative
The platform – memory costs



L1, data cache hit – 1 cycle
L1 miss, L2 hit – 6 cycles
L2 miss – additional 64 cycles
Second experimental methodology


Two platforms
All benchmarks except one are C programs
Pentium measurements






Dual processor 500Mhz Pentium III (but only one
used)
100Mhz bus
512KB L2 cache
Physical memory > 300MB (why keep it a secret?),
which prevented paging and allowed the whole
executable in memory
RedHat 6.1
Benchmarks compiled using gcc with –O2
RISC measurements



A single PA-8000/180 MHz processor
Running HP/UX 11
Single level I and D caches, 1MB each
Benchmarks



Execution time measurements are a five runs
average
The division between sweep and mark times
is arbitrary
Pentium III prefetcht0 introduced a new
overhead, so prefetchnta was used. It was
less effective eliminating cache miss, though
?
The end
Lectured by:
Thank you for listening!
(and staying
Shachar Rubinstein
shachar1@post.tau.ac.il
awake…)
GC seminar
Molley Sagiv
Audience:
You
Thanks:
For your patience
The Powerpoint XP effects
My parents
No animals were harmed during this production
(except for annoying mosquitoes)
Download