Free list

advertisement
A non-blocking approach on GPU
dynamical memory management
Joy Lee @ NVIDIA
Outline




Introduce Buddy memory system
Our parallel implementation
Performance comparison
Discussion
Fixed size memory (memory pool)

Ever fastest & simplest memory system

Free list (item = address)



Allocate


Just take one item from free list
Free


Each item of free list records the available address to
allocate.
Free list can be implement with queue,
Free list
stack, list, or any data structure.
Just return the address to free list.
Performance

Constant time on both allocation & free
0x0000
0x0100
0x0200
0x0300
….
Multi-lists memory

For management on non-fixed size memory system, a natural extension
from fixed size memory is multi-lists memory system

Free list


Allocate




Find the correct free list to free
Return the address to the target free list.
Performance


Find the first free list with size larger than
request size by arithmetic operation
example: ceil(log2(size))
Take one element from the target free list
Free


multi free lists of fixed size memory with different size (ex: twice size grow)
Constant time on both allocation & free,
since it is possible to find suitable free list
with arithmetic operation instead of linear searching.
Drawback: waste memory
Free lists
Size = 256
…
Size = 512
…
Size = 1024
…
Size = 2048
…
….
…
Buddy memory

To avoid the wasting memory problem in multi-lists memory, it is natural to
allocate memory from the direct upper layers (twice size) when the free list
is empty, instead of pre-allocated memory in all free lists.

Free list


Allocate




Find the first free list with size larger than request size
Take one element from the target free list
If the free list is empty , create pairs from upper list
Free




multi free lists of fixed size memory, with sizes growing up in power of 2
Find the correct free list to free (using records)
Return the address to the target free list.
If the buddy is also in the free list, then
free to upper.
Performance

Constant time on both allocation & free
Free lists
Size = 256
Size = 512
Size = 1024
Size = 2048
Size = 4096
Buddy memory



Good internal de-fragment
The buddy address can be calculated by
address XOR size
Constant time operation O(h), where h =
log2(max size/min size) is a constant.
buddy
this
Memory layers

Just implement one class
of single layer, other layers
are instances with
different size.

Lower layer


Current layer


The memory layer with 1/2
size of current layer
The allocating request layer
Upper layer

The memory layer with 2x size
of current layer
Free lists
Size = 256
Lower layer
Size = 512
Current layer
Size = 1024 Upper layer
Size = 2048
Size = 4096
…
Pair creation



If the current free list is empty,
it will allocate memory from
upper allocator.
Since the size of upper is 2x,
it will create a pair of available
memory into current free list.
If there are N threads
simultaneously allocate
memory in current layer,
of that the free list is empty,
only N/2 threads shall
allocate memory from upper
layer.
Memory from upper layer
Memory to
current layer
Memory to
current layer
Free Queue

The free list was implemented with queue, of
which head can run over tail.




Head<Tail
Head=Tail
Head>Tail
available memory (directly allocate
from this free list)
empty free list
under available (require pair
creation from upper layer)
Use the above states to determine which
threads shall call pair_creation() from upper
layer.
Parallel strategy (Alloc)

Each allocation requestor creates a socket to listen the
address.



The socket was implemented on free queue.
atomicAdd(&head,1) creates a socket.
The output address can come from current free list or pair
creation from upper free list.
Threads with allocation requests to this layer
Head
Tail
Available memory
in free queue
New Head
Need pair
creation from
upper layer
Odd/Even Pair Creation
Threads with allocation requests to this layer
Head
Tail
New Head
Pair Creations
New Tail

The under available threads will perform pair
creations in odd/even loop until new tail >= new
head to avoid the overhead of simultaneous pair
creation.
Parallel strategy (Free)


Store the freed address to free list
Calculate the buddy address.


Check if the buddy is already in the free list.


XOR(addr, size)
Use hand shake algorithm for fast lookup
If YES, mark both elements in free list as
N/A, then free the memory block into upper
layer.
Hand shake

Hand shake



The freed memory record its
index in free list
The free list record the freed
memory address
Fast check if buddy memory
address is in free list



Calculate buddy memory
address (XOR)
Read the index from this address
Check if the address of this
index in free list is equal to the
buddy memory address.
Record address
of memory
Record index
in free list
Memory block
Performance

gridDim=512 blockDim=512 K20
CUDA 5.0
This
Speedup
256 bytes alloc/free
single time
278.9 ms
10.8 ms
25.8 x
256 bytes alloc
7155.4 ms
10.48 ms
682 x
256 bytes free
5671.2 ms
7.27 ms
780 x
Random # of bytes
alloc/free 35 times
size < lower 2 layer
5376.3 ms
65.8 ms
81.7x
Random # of bytes
alloc/free 35 times
full range
4153.8 ms
370.5 ms
11.2 x
Discussion


Warp level group allocation
Dynamic expanding free queue
Backup Slides
Slow atomicCAS() loop
long ret=now;
do{
now=ret;
ret=atomicCAS(&head, now, now->next);
}while(ret!=now);
Download