List ranking and Parallel Prefix

List Ranking and Parallel
Sathish Vadhiyar
List Ranking on GPUs
 Linked list prefix computations –
computations of prefix sum on the elements
contained in a linked list
 Linked list represented as an array
 Irregular memory accesses – successor of
each node of a linked list can be contained
 List ranking – special case of list prefix
computations in which all the values are
identity, i.e., 1.
List ranking
 L is a singly linked list
 Each node contains two fields – a data
field, and a pointer to the successor
 Prefix sums – updating data field with
summation of values of its predecessors
and itself
 L represented by an array X with fields
X[i].prefix and X[i].succ
Sequential Algorithm
 Simple and effective
 Two passes
 Pass 1: To identify the head node
 Pass 2: Traverses starting from the head,
follow the successor nodes accumulating the
prefix sums in the traversal order
 Works well in practice
Parallel Algorithm: Prefix
computations on arrays
 Array X partitioned into subarrays
 Local prefix sums of each subarray
calculated in parallel
 Prefix sums of last elements of each
subarray written to a separate array Y
 Prefix sums of elements in Y are calculated.
 Each prefix sum of Y is added to
corresponding block of X
 Divide and conquer strategy
123 456 789
Local prefix sum
Passing last elements to a
Computing prefix sum of last elements on the
Adding global prefix sum to local prefix sum
in each processor
Prefix computation on list
 The previous strategy cannot be applied
 Division of array X that represents list
will lead to subarrays each of which can
have many sublist fragments
 Head nodes will have to be calculated for
each of them
Parallel List Ranking (Wyllie’s
 Involved repeated pointer jumping
 Successor pointer of each element is
repeatedly updated so that it jumps over
its successor until it reaches the end of
the list
 As each processor traverses and updates
the successor, the ranks are updated
 A process or thread is assigned to each
element of the list
Parallel List Ranking (Wyllie’s
 Will lead to high synchronizations among
 In CUDA - many kernel invocations
Parallel List Ranking (Helman and
 Randomly select s nodes or splitters. The head node
is also a splitter
 Form s sublists. In each sublist, start from a splitter
as the head node, and traverse till another splitter is
 Form prefix sums in each sublist
 Form another list, L’, consisting of only these
splitters in the order they are traversed. The values
in each entry of this list will be the prefix sum
calculated in the respective sublists
 Calculate prefix sums for this list
 Add these sums to the values of the sublists
Parallel List Ranking on GPUs:
 Step 1: Compute the location of the head of
the list
 Each of the indices between 0 and n-1,
except head node, occur exactly only once
in the successors.
 Hence head node = n(n-1)/2 – SUM_SUCC
 SUM_SUCC = sum of the successor values
 Can be done on GPUs using parallel
Parallel List Ranking on GPUs:
 Step 2: Select s random nodes to split
list into s random sublists
 For every subarray of X of size X/s,
select random location as a splitter.
 Highly data parallel, can be done
independent of each other
Parallel List Ranking on GPUs:
 Step 3: Using standard sequential
algorithm, compute prefix sums of each
sublist separately
 The most computationally demanding step
 s sublists allocated equally among CUDA
blocks, and then allocated equally among
threads in a block
 Each thread computes prefix sums of each
of its sublists, and copy prefix value of last
element of sublist i to Sublist[i]
Parallel List Ranking on GPUs:
 Step 4: Compute prefix sum of splitters,
where the successor of a splitter is the
next splitter encountered when
traversing the list
 This list is small
 Hence can be done on CPU
Parallel List Ranking on GPUs:
 Step 5: Update values of prefix sums
computed in step 3 using splitter prefix
sums of step 4
 This can be done using coalesced memory
access – access by threads to contiguous
Choosing s
 Large values of s increase the chance of
threads dealing with equal number of
 However, too large values result in
overhead of sublist creation and
Parallel Prefix on GPUs
 Using binary tree
 An upward reduction phase (reduce phase or
up-sweep phase)
 Traversing tree from leaves to root forming
partial sums at internal nodes
 Down-sweep phase
 Traversing from root to leaves using partial sums
computed in reduction phase
Up Sweep
Down Sweep
Host Code
int main(){
const unsigned int num_threads = num_elements / 2;
/* cudaMalloc d_idata and d_odata */
cudaMemcpy( d_idata, h_data, mem_size,
cudaMemcpyHostToDevice) );
dim3 grid(256, 1, 1); dim3 threads(num_threads, 1,
scan<<< grid, threads>>> (d_odata, d_idata);
cudaMemcpy( h_data, d_odata[i], sizeof(float) *
num_elements, cudaMemcpyDeviceToHost
/* cudaFree d_idata and d_odata */
Device Code
__global__ void scan_workefficient(float *g_odata, float *g_idata, int n)
// Dynamically allocated shared memory for scan kernels
extern __shared__ float temp[];
int thid = threadIdx.x;
int offset = 1;
// Cache the computational window in shared memory
temp[2*thid] = g_idata[2*thid];
temp[2*thid+1] = g_idata[2*thid+1];
// build the sum in place up the tree
for (int d = n>>1; d > 0; d >>= 1)
if (thid < d)
int ai = offset*(2*thid+1)-1;
int bi = offset*(2*thid+2)-1;
temp[bi] += temp[ai];
offset *= 2;
Device Code
// scan back down the tree
// clear the last element
if (thid == 0)
temp[n - 1] = 0;
// traverse down the tree building the scan in place
for (int d = 1; d < n; d *= 2)
offset >>= 1;
if (thid < d)
int ai = offset*(2*thid+1)-1;
int bi = offset*(2*thid+2)-1;
float t = temp[ai];
temp[ai] = temp[bi];
temp[bi] += t;
// write results to global memory
g_odata[2*thid] = temp[2*thid];
g_odata[2*thid+1] = temp[2*thid+1];
 Fast and Scalable List Ranking on
the GPU. ICS 2009.
 Optimization of Linked List Prefix
Computations on Multithreaded
GPUs Using CUDA. IPDPS 2010.