Migration from OpenMP to CUDA

advertisement

CPU & GPU Parallelization of

Scrabble Word Searching

Jonathan Wheeler

Yifan Zhou

Scrabble Word Searching!

Scrabble Word Searching – Problem Overview

• Given the current board & tiles at hand, find the highest scoring word(s) at the highest scoring position(s) .

• Requires searching 173k-268k words , depending on the

Lexicon .

• Fundamentally a brute force problem in nature. Why?

• Finding adjacent words requires only one linear search.

• Finding intersecting words greatly increases the search complexity.

~26x more possibilities must be considered, per intersecting word !

Players may have also blank tiles; effectively a “wildcard.”

Produces ~26x more valid results , per blank tile !

• (Continued on next slide…)

Scrabble Word Searching – Problem Overview (cont’d)

• We don’t know what specific words we’re looking for.

• We only know:

Letters a word may not have .

Letters a word may have .

Letters a word must have .

• Letters may be in any order .

• Result: linear and brute force in nature.

• Small performance tricks, but still fundamentally true.

• No heuristics. (Chess)

Why this problem?

• Abundant parallelizing opportunities within.

• Independent results at the lowest levels.

• Difficult to parallelize .

• Must be aggregated into higher level results.

Requires further synchronization .

Discard duplicate results .

1.

Algorithmic nuances.

2.

Result pollution .

• Linear search problems are abundant in the real world.

• Usually highly parallelizable .

• The once infeasible may now be feasible with multi-processing .

Overview of Scrabble word searching code

• for (three different searching algorithm sections)

• Adjacent words, intersecting words, and multiply intersecting words.

• for (each column )

• for (each row )

• find(all potential words ):

• May be done by the CPU , or the GPU  Core of the algorithm.

• … or both simultaneously! (Coming slide.)

• Critical section : Aggregate potential word list.

•  Important milestone: Possibilities greatly reduced. 

• (Continued on next slide…)

Overview of Scrabble word searching code (cont’d)

• for (each potential word found )

• We now need to determine if it is not just valid, but insertable.

for (each position of the word)

• Try inserting it.

Might be insertable in multiple positions.

Performed by the CPU.

for (each insertable word )

• Crosscheck “inadvertent” perpendicular words.

•  Important milestone: Word has passed all tests. 

Perform word value computations.

• Critical section : Add the word to the aggregate list of valid found words , but discard previously mentioned duplicates.

Focus of our research

• Hypotheses:

• Parallelizing at the high level

Simple (OpenMP)

Near linear performance improvements.

• Parallelizing at the low level

Complex (CUDA)

Potentially significantly greater performance improvements.

• Question:

• Why not do both?

First step – Restructuring of code

Minimal restructuring at the high level.

Code designed to handle one run per execution.

• When an invalid word is played, the word effectively becomes part of the lexicon. Why?

• Cross-checking of potential words!

• Must be dynamically added to the lexicon!

• Future queries contaminated by past queries.

CPU Parallelization

i7-920 (4 cores / 8 threads @ 2.66GHz)

8

7

6

5

4

3

2

1

0

0

CPU Parallelism - Speedup

2 4 6 8

Number of CPU Threads

10

Ideal

64-bit

32-bit

12

8

7

6

5

4

3

2

1

0

0

CPU Parallelism - Execution Time

2 4 6 8

Number of CPU Threads

10

64-bit

32-bit

12

Migration from OpenMP to CUDA

• Three core searching functions that find words containing …

• Core 1: input tiles [ 1%]

• Core 2: input tiles and a letter on the board [17%]

• Core 3: input tiles and multiple letters on the board [82%]

• C++ strings cannot easily be accessed by GPU

• typedef struct { char word[32]; } word_simple;

• Due to SIMT architecture , branching is inefficient in GPU kernel

• Reduce branching in core searching functions

• count the numbers of a to z (and _ ) in tiles in advance

• Sort the words in a lexicon with ascending number of letters

• all threads in a warp may have similar loads

• Locking is not easily handled in CUDA

• Use a bit vector ( 0 – rejected, 1 – accepted )

• no contention, no lock is needed

CUDA Parallelization --- First Attempt

• Initialization (once per game):

• Allocate global and constant memory on GPU

• Convert C++ strings to static storage

• Copy the converted words to GPU memory

• Searching (4 stages):

• Preparation

• Copy the tile counters (and wildcard string) into GPU constant memory

• Kernel launch

• Convert CPU searching functions into GPU functions

• __global__ void findWords (…)

• __device__ bool hasTiles_GPU (…)

• __device__ bool wildcardMatch_GPU (…)

• Replace OpenMP loop with GPU kernel function call

• findWords <<<nblocks, blocksize >>> (…);

• Copy the bit vector back

• cudaMemcpy

(…, cudaMemcpyDeviceToHost);

• Generate the resultant vector

CUDA Parallelization --- Second Thoughts

• Pinned memory allocation

• Accelerate memory transfer between CPU and GPU

• Synchronous copy can be more efficient

• Asynchronous copy is supported

• Heterogeneous GPU/CPU parallelization

• Assign major part of the data set to GPU

• Asynchronous GPU kernel launch

• Assign minor part of the data set to CPU

• Original OpenMP parallelization

• Reduced cost in GPU kernel, memory transfer, and result vector generation

Hide Latency

• Cost in four stages (Core 2, repeated 10000 times):

• Preparation ~ 0.1s

Memory copy back ~ 0.9s

• Kernel launch ~ 2.6s

Post generation ~ 0.7s

• Latency of cudaMemcpy is comparable to the kernel

• Hide Latency by asynchronous parallelization between GPU and CPU: findwords<<<nblocks, blocksize >>>(…); cudaMemcpyAsync

(…);

// returns immediately

// returns immediately

CPU operations (OpenMP loop on the minor part of the data set)

… cudaThreadSynchronize();

30%

• After asynchronous parallelization:

• Preparation ~ 0.1s

Post generation ~ 0.5s

Kernel launch + Memory copy back ~ 2.4s

Minimize Setup cost

• Core 3 preparation cost (10000 times) ~ 0.2s

• Transfer tile counters

• Transfer wildcard string

“cudaMemcpyToSymbol” is the major cost of preparation

• Group constant variable together into a struct: typedef struct { int count_GPU[7]; // = 28 char char wildCards_GPU[32];

} grouped;

• Declare a single __constant__ grouped variable

• One “cudaMemcpyToSymbol” per preparation

• No overhead in combining variables

• much faster

50%

• Using grouped variable, preparation cost ~ 0.1s

Minimize CUDA kernel cost

• Kernel + Memcpy cost (10000 times):

• core 2 ~ 2.4s

core 3 ~ 5.9s

• Word finding:

• --ptxas-options=-v shows GPU register and memory utilization

Thread tile counters are in local GPU memory ( off-chip and not cached )

Use on-chip shared memory (__shared__) for fast access

Hardcode assignment of counters as 7 integers instead of 28 chars

Wildcard matching:

Avoid nested loops and multiple conditional statements

Use a much simplified algorithm

Specially designed for *[pattern]*

40%

• After optimization:

• core 2 ~ 1.5s

core 3 ~ 3.7s

Minimize Post-generation Cost

• Post-generation cost (10000 times):

• core 2 ~ 0.5s

core 3 ~ 0.58s

• For the bit vector returned from GPU, use multiple CPU threads to generate the result vector

• Locking?

Local vector + critical section?

• Depends on the amount of contention

• For low contention (core 3), use locking

For high contention , store the results of each thread in a local vector and gather them using critical section

• After proper parallelization:

• core 2 ~ 0.36s

core 3 ~ 0.38s

30%

7

6

5

4

3

2

1

0

CPU vs GPU on Pup cluster

E6850 (2 cores @ 3GHz) / NVidia GTX 260 (216 SP @ 1.24GHz)

CPU vs GPU Speedup

CPU - serial

CPU - 2 threads

GPU + 1 thread CPU

GPU + 2 thread CPU

Core 1 Core 2 Core 3

Core Searching Function

Total

GPU Parallelization on Pup cluster

E6850 (2 cores @ 3GHz) / NVidia GTX 260 (216 SP @ 1.24GHz)

1,2

GPU + 1 Thread CPU - Speedup

1,0

0,8

0,6

0,4

60%

1,4

1,2

1,0

70% 80%

% GPU work load

90%

Core 1

Core 2

Core 3

100%

0,8

0,6

60%

GPU + 2 Threads CPU - Speedup

70% 80%

% GPU work load

90%

Core 1

Core 2

Core 3

100%

Conclusion

• The characteristics of scrabble word searching are hard for efficient GPU parallelization:

Only integer (or char) operations,

A lot of branching no floating-point operations and little coalesced memory access

High communication-to-computation ratio

• Design of CUDA parallelization:

Asynchronous GPU/CPU parallelization may reduce memory copy latency

A large transfer is more efficient than many small transfers

On-chip shared memory is much faster than off-chip local memory

Locking or local variable depends on the amount of contention

• Future Work:

Further hide latency using multiple streams on a single GPU

Multi-GPU parallelization

GPU parallelization on other levels

Download