list_gpu

advertisement
List Ranking on GPUs
Sathish Vadhiyar
List Ranking on GPUs
• Linked list prefix computations –
computations of prefix sum on the elements
contained in a linked list
• Irregular memory accesses – successor of each
node of a linked list can be contained
anywhere
• List ranking – special case of list prefix
computations in which all the values are
identity, i.e., 1.
List ranking
• L is a singly linked list
• Each node contains two fields – a data field,
and a pointer to the successor
• Prefix sums – updating data field with
summation of values of its predecessors and
itself
• L represented by an array X with fields
X[i].prefix and X[i].succ
Sequential Algorithm
• Simple and effective
• Two passes
– Pass 1: To identify the head node
– Pass 2: Traverses starting from the head, follow
the successor nodes accumulating the prefix sums
in the traversal order
• Works well in practice
Parallel Algorithm: Prefix
computations on arrays
• Array X partitioned into subarrays
• Local prefix sums of each subarray calculated in
parallel
• Prefix sums of last elements of each subarray
written to a separate array Y
• Prefix sums of elements in Y are calculated.
• Each prefix sum of Y is added to corresponding
block of X
• Divide and conquer strategy
Example
123456789
123 456
1,3,6
789
4,9,15
6,15,24
6,21,45
1,3,6,10,15,21,…
7,15,24
Prefix computation on list
• The previous strategy cannot be applied here
• Division of array X that represents list will lead
to subarrays each of which can have many
submits fragments
• Head nodes will have to be calculated for each
of them
Parallel List Ranking (Wyllie’s
algorithm)
• Involved repeated pointer jumping
• Successor pointer of each element is
repeatedly updated so that it jumps over its
successor until it reaches the end of the list
• As each processor traverses and updates the
successor, the ranks are updated
• A process or thread is assigned to each
element of the list
Parallel List Ranking (Wyllie’s
algorithm)
• Will lead to high synchronizations among
CUDA threads, many kernel invocations
Parallel List Ranking (Helman and JaJa)
• Randomly select s nodes or splitters. The head node is
also a splitter
• Form s sublists. In each sublist, start from a splitter as
the head node, and traverse till another splitter is
reached.
• Form prefix sums in each sublist
• Form another list, L’, consisting of only these splitters in
the order they are traversed. The values in each entry
of this list will be the prefix sum calculated in the
respective sublists
• Calculate prefix sums for this list
• Add these sums to the values of the sublists
Parallel List Ranking on GPUs: Steps
• Step 1: Compute the location of the head of
the list
• Each of the indices between 0 and n-1, except
head node, occur exactly only once in the
successors.
• Hence head node = n(n-1)/2 – SUM_SUCC
• SUM_SUCC = sum of the successor values
• Can be done on GPUs using parallel reduction
Parallel List Ranking on GPUs: Steps
• Step 2: Select s random nodes to split list into
s random sublists
• For every subarray of X of size X/s, select
random location as a splitter.
• Highly data parallel, can be done independent
of each other
Parallel List Ranking on GPUs: Steps
• Step 3: Using standard sequential algorithm,
compute prefix sums of each sublist separately
• The most computationally demanding step
• s sublists allocated equally among CUDA blocks,
and then allocated equally among threads in a
block
• Each thread computes prefix sums of each of its
sublists, and copy prefix value of last element of
sublist i to Sublist[i]
Parallel List Ranking on GPUs: Steps
• Step 4: Compute prefix sum of splitters, where
the successor of a splitter is the next splitter
encountered when traversing the list
• This list is small
• Hence can be done on CPU
Parallel List Ranking on GPUs: Steps
• Step 5: Update values of prefix sums
computed in step 3 using splitter prefix sums
of step 4
• This can be done using coalesced memory
access
Choosing s
• Large values of s increase the chance of
threads dealing with equal number of nodes
• However, too large values result in overhead
of sublist creation and aggregation
References
• Fast and Scalable List Ranking on the GPU.
ICS 2009.
• Optimization of Linked List Prefix
Computations on Multithreaded GPUs Using
CUDA. IPDPS 2010.
Download