Jonathan Wheeler
Yifan Zhou
• Given the current board & tiles at hand, find the highest scoring word(s) at the highest scoring position(s) .
• Requires searching 173k-268k words , depending on the
Lexicon .
• Fundamentally a brute force problem in nature. Why?
• Finding adjacent words requires only one linear search.
• Finding intersecting words greatly increases the search complexity.
•
•
~26x more possibilities must be considered, per intersecting word !
Players may have also blank tiles; effectively a “wildcard.”
•
Produces ~26x more valid results , per blank tile !
• (Continued on next slide…)
• We don’t know what specific words we’re looking for.
• We only know:
•
•
•
Letters a word may not have .
Letters a word may have .
Letters a word must have .
• Letters may be in any order .
• Result: linear and brute force in nature.
• Small performance tricks, but still fundamentally true.
• No heuristics. (Chess)
• Abundant parallelizing opportunities within.
• Independent results at the lowest levels.
• Difficult to parallelize .
• Must be aggregated into higher level results.
•
•
Requires further synchronization .
Discard duplicate results .
1.
Algorithmic nuances.
2.
Result pollution .
• Linear search problems are abundant in the real world.
• Usually highly parallelizable .
• The once infeasible may now be feasible with multi-processing .
• for (three different searching algorithm sections)
• Adjacent words, intersecting words, and multiply intersecting words.
• for (each column )
• for (each row )
• find(all potential words ):
• May be done by the CPU , or the GPU Core of the algorithm.
• … or both simultaneously! (Coming slide.)
• Critical section : Aggregate potential word list.
• Important milestone: Possibilities greatly reduced.
• (Continued on next slide…)
•
•
• for (each potential word found )
• We now need to determine if it is not just valid, but insertable.
for (each position of the word)
• Try inserting it.
•
•
Might be insertable in multiple positions.
Performed by the CPU.
for (each insertable word )
• Crosscheck “inadvertent” perpendicular words.
•
• Important milestone: Word has passed all tests.
Perform word value computations.
• Critical section : Add the word to the aggregate list of valid found words , but discard previously mentioned duplicates.
• Hypotheses:
• Parallelizing at the high level
•
•
Simple (OpenMP)
Near linear performance improvements.
• Parallelizing at the low level
•
•
Complex (CUDA)
Potentially significantly greater performance improvements.
• Question:
• Why not do both?
•
•
• When an invalid word is played, the word effectively becomes part of the lexicon. Why?
• Cross-checking of potential words!
• Must be dynamically added to the lexicon!
• Future queries contaminated by past queries.
i7-920 (4 cores / 8 threads @ 2.66GHz)
8
7
6
5
4
3
2
1
0
0
CPU Parallelism - Speedup
2 4 6 8
Number of CPU Threads
10
Ideal
64-bit
32-bit
12
8
7
6
5
4
3
2
1
0
0
CPU Parallelism - Execution Time
2 4 6 8
Number of CPU Threads
10
64-bit
32-bit
12
• Three core searching functions that find words containing …
• Core 1: input tiles [ 1%]
• Core 2: input tiles and a letter on the board [17%]
• Core 3: input tiles and multiple letters on the board [82%]
• C++ strings cannot easily be accessed by GPU
• typedef struct { char word[32]; } word_simple;
• Due to SIMT architecture , branching is inefficient in GPU kernel
• Reduce branching in core searching functions
• count the numbers of a to z (and _ ) in tiles in advance
• Sort the words in a lexicon with ascending number of letters
• all threads in a warp may have similar loads
• Locking is not easily handled in CUDA
• Use a bit vector ( 0 – rejected, 1 – accepted )
• no contention, no lock is needed
• Initialization (once per game):
• Allocate global and constant memory on GPU
• Convert C++ strings to static storage
• Copy the converted words to GPU memory
• Searching (4 stages):
• Preparation
• Copy the tile counters (and wildcard string) into GPU constant memory
• Kernel launch
• Convert CPU searching functions into GPU functions
• __global__ void findWords (…)
• __device__ bool hasTiles_GPU (…)
• __device__ bool wildcardMatch_GPU (…)
• Replace OpenMP loop with GPU kernel function call
• findWords <<<nblocks, blocksize >>> (…);
• Copy the bit vector back
• cudaMemcpy
(…, cudaMemcpyDeviceToHost);
• Generate the resultant vector
• Pinned memory allocation
• Accelerate memory transfer between CPU and GPU
• Synchronous copy can be more efficient
• Asynchronous copy is supported
• Heterogeneous GPU/CPU parallelization
• Assign major part of the data set to GPU
• Asynchronous GPU kernel launch
• Assign minor part of the data set to CPU
• Original OpenMP parallelization
• Reduced cost in GPU kernel, memory transfer, and result vector generation
• Cost in four stages (Core 2, repeated 10000 times):
• Preparation ~ 0.1s
Memory copy back ~ 0.9s
• Kernel launch ~ 2.6s
Post generation ~ 0.7s
• Latency of cudaMemcpy is comparable to the kernel
• Hide Latency by asynchronous parallelization between GPU and CPU: findwords<<<nblocks, blocksize >>>(…); cudaMemcpyAsync
(…);
…
// returns immediately
// returns immediately
CPU operations (OpenMP loop on the minor part of the data set)
… cudaThreadSynchronize();
30%
• After asynchronous parallelization:
•
• Preparation ~ 0.1s
Post generation ~ 0.5s
Kernel launch + Memory copy back ~ 2.4s
• Core 3 preparation cost (10000 times) ~ 0.2s
• Transfer tile counters
•
• Transfer wildcard string
“cudaMemcpyToSymbol” is the major cost of preparation
• Group constant variable together into a struct: typedef struct { int count_GPU[7]; // = 28 char char wildCards_GPU[32];
} grouped;
• Declare a single __constant__ grouped variable
• One “cudaMemcpyToSymbol” per preparation
• No overhead in combining variables
• much faster
50%
• Using grouped variable, preparation cost ~ 0.1s
• Kernel + Memcpy cost (10000 times):
• core 2 ~ 2.4s
core 3 ~ 5.9s
•
• Word finding:
•
•
•
• --ptxas-options=-v shows GPU register and memory utilization
Thread tile counters are in local GPU memory ( off-chip and not cached )
Use on-chip shared memory (__shared__) for fast access
Hardcode assignment of counters as 7 integers instead of 28 chars
•
•
•
Wildcard matching:
Avoid nested loops and multiple conditional statements
Use a much simplified algorithm
Specially designed for *[pattern]*
40%
• After optimization:
• core 2 ~ 1.5s
core 3 ~ 3.7s
• Post-generation cost (10000 times):
• core 2 ~ 0.5s
core 3 ~ 0.58s
• For the bit vector returned from GPU, use multiple CPU threads to generate the result vector
•
• Locking?
Local vector + critical section?
• Depends on the amount of contention
•
• For low contention (core 3), use locking
For high contention , store the results of each thread in a local vector and gather them using critical section
• After proper parallelization:
• core 2 ~ 0.36s
core 3 ~ 0.38s
30%
7
6
5
4
3
2
1
0
E6850 (2 cores @ 3GHz) / NVidia GTX 260 (216 SP @ 1.24GHz)
CPU vs GPU Speedup
CPU - serial
CPU - 2 threads
GPU + 1 thread CPU
GPU + 2 thread CPU
Core 1 Core 2 Core 3
Core Searching Function
Total
E6850 (2 cores @ 3GHz) / NVidia GTX 260 (216 SP @ 1.24GHz)
1,2
GPU + 1 Thread CPU - Speedup
1,0
0,8
0,6
0,4
60%
1,4
1,2
1,0
70% 80%
% GPU work load
90%
Core 1
Core 2
Core 3
100%
0,8
0,6
60%
GPU + 2 Threads CPU - Speedup
70% 80%
% GPU work load
90%
Core 1
Core 2
Core 3
100%
• The characteristics of scrabble word searching are hard for efficient GPU parallelization:
•
•
•
Only integer (or char) operations,
A lot of branching no floating-point operations and little coalesced memory access
High communication-to-computation ratio
• Design of CUDA parallelization:
•
•
•
•
Asynchronous GPU/CPU parallelization may reduce memory copy latency
A large transfer is more efficient than many small transfers
On-chip shared memory is much faster than off-chip local memory
Locking or local variable depends on the amount of contention
• Future Work:
•
•
•
Further hide latency using multiple streams on a single GPU
Multi-GPU parallelization
GPU parallelization on other levels