`Knight`s Tours` - Intel® Developer Zone

advertisement
Knights Tour
Phillip Palk
philpalk@mentalmonsters.com
September 24, 2009
Introduction
The 2009 Intel Threading Challenge is a competition where coders are to implement highperformance multi-threaded solutions to a series of problems. The second problem of phase 2 of
this year is to identify all valid incomplete ‘Knight’s Tours’ of a fixed length around a chess board.
My solution has been implemented using standard C++ with OpenMP 2.0 extensions to enable
threading. Microsoft Visual Studio 2008 SP1 was used for development, however the code should
compile with little or no modifications using any standard compliant compiler (even if the compiler
doesn’t support OpenMP, although you won’t get multi-threading without it).
Problem Description
The problem, as stated on Intel’s website (http://software.intel.com/en-us/contests/ThreadingChallenge-2009/codecontest.php), is as follows:
The Knight's Tour problem uses a single chess knight on a chessboard and attempts
to visit each square on the board once and only once using the standard knight
move (two squares right or left, and from there one square up or down, or, two
squares up or down, and from there one square right or left). For travel-minded
knights a tour of the entire board can be enlightening. However, in these tough
economic times, a knight may not be able to afford travelling to every square. Thus,
a knight with some vacation time would like to determine how many different tours
of a specific length could be taken from a given starting square and end in the same
starting square (a closed tour).
Write a threaded code to calculate the total number of possible closed tours of a
chessboard for a chess knight starting at a given square and that visit a set number
of squares on the board. The names of the input and output files to be used will be
given on the command line. The input file will detail the size and shape of the
chessboard to be toured, the starting square of the knight, the length of the tour,
and the number of tours to be fully printed in the output file. After execution of the
application, the output file should contain the requested number of tours (listing of
chessboard squares to visit) and a summary line with the total number of tours
possible from the conditions given in the input file.
Input File Format:
The input file will contain five lines with the following format:
1. An integer specifying the number of columns (files) on the chessboard
2. An integer specifying the number of rows (ranks) on the chessboard
3. The starting square for the knight in algebraic notation (single lower case
letter specifying column and integer corresponding to the row)
4. An integer specifying the number of squares to visit in the tour
5. An integer specifying the number of full tours to print in the output file
Output File Format:
Each full tour printed will be a list of board squares in algebraic notation, one square
per line, starting and ending with the square given in the input file. If the tour length
given is 8, then a printed tour will have 9 squares listed (the start square and the 8
destination squares with the start square as the last destination) on 9 lines. After
each printed tour, a blank line or other divider line should be printed. After the
requested number of tours is printed, a summary line that gives the total number of
tours that qualify should be printed.
If a tour with the given specifications cannot be done, the output file should contain
a simple summary line that notes this fact.
Timing:
The total time for execution of the Knight's Tour application will be used for scoring.
(For most accurate timing results, submission codes would include timing code to
measure and print this time to stdout, otherwise an external stopwatch will be
used.)
Brute Force Serial Solution
To gain familiarity with the problem my first attempt was to write a brute force solver that evaluated
every possible path the knight could take from its starting position. A basic recursive approach was
taken that consisted of the following steps:
1. Set the starting position and an empty list of valid tours
2. If the current recursion depth is equal to the length of the knight’s tour:
a. If the current position is the same as the starting position, add the current tour to
the list of valid tours
b. Return
3. For each move a knight can make:
a. Add the move to the current position (eg. If current position is d4 and move is (1,2)
the new position is e6)
b. If the new position falls outside of the board:
i. Return
c. If the new position is a repeat of a previous position in the current tour:
i. Return
d. Recursively call (2) with the new position as the updated current position
4. Return
While this approach worked, as expected it also had horrible performance both in time and
memory consumption (thanks to it storing every possible move when it was found). But
now that I had a better understanding of the mechanics of the problem I was able start
considering more optimal solutions.
Optimised Serial Solution
When considering how to improve the performance of the brute force serial solution it became clear
there was a considerable amount of duplicate work being performed. My first thought was that I
could take advantage of the fact that all possible paths of length N between any 2 squares was
identical to all possible paths of length N between any 2 other squares that were the same distance
apart in both rank and file (ignoring the case where paths traversed outside of the board area). I
investigated the possibility of constructing a cache of these paths that could then be used to
discover all valid Knights’ Tours, but unfortunately this proved unsuccessful.
Although unsuccessful, the attempted caching solution did lead me to a realisation about the nature
of the problem. Every possible path of length N from a single board position back to the same
position could be described entirely using 2 leaf nodes of a tree of all paths of length N/2 from the
starting node. Given the exponential increase in number of moves as the path length increases this
dramatically reduces the number of moves that need to be considered. This can be illustrated as a
follows:
A
B
E
F
C
G
H
I
D
J
K
L
M
In the above diagram, each child represents a move from the parent position (not a board position,
for example nodes A and K might fall on the same board position). For example consider a piece
that has a total of 3 total valid moves from its current position (3 is chosen purely so that it fits on
the page easily!). To obtain a tour of length 4 using a tree of depth 2 (not counting the root) start at
node A (the starting position), choose the first valid move to node B, and then choose the third valid
move to node G. This gives what I shall refer to here-in as a ‘half-tour’. Given the tree of all valid
half-tours this particular half-tour can be represented by the single leaf node G. To return to the
starting position select another leaf node that falls on the same board position as the first node and
follow its parent’s back up to the root, thus given the tree of valid half-tours a tour can be
completely describe using just 2 leaf nodes (in the case of the above diagram, nodes G and L). Every
possible path of length N from a board position back to itself is therefore all combinations of any 2
leaf nodes that fall on the same board position.
This now gives every possible path of length N, but it doesn’t yet satisfy the requirement that no 2
board positions are traversed twice in a single tour (excluding the starting position). To meet this
requirement several simple rules need to be adhered to when selecting the node that represents the
second half-tour:
1. Given a node that describes a half-tour, the node selected to describe the second half-tour
must fall on the same board position as the first node
2. Given a node that describes a half-tour, the node selected to describe the second half-tour
cannot share a parent other than the root node. For example, given node G for the first
half-tour nodes E and F cannot be selected for the second half-tour.
3. No parent of the second node may fall on a board position that is also traversed by any
parent of the first node, excluding the root node. For example, if node C were to fall on the
same board position as node B (which is impossible in the above graph given unique moves,
but let’s ignore that...) then nodes H, I or J could not be selected.
We now have the means to discover all possible tours of length N. The algorithm for constructing
the half-tour tree is as follows (and is very similar to the original brute force algorithm):
1. Create the root node of the half-tour tree, setting its board position to the starting position.
2. If the current recursion depth is equal to half the length of the knight’s tour:
a. Return (without modifying the half-tour tree)
3. For each move a knight can make:
a. Add the move to the position of the current node (eg. If current position is d4 and
move is (1,2) the new position is e6)
b. If the new position falls outside of the board:
i. Return
c. If the new position is a repeat of a previous position in the current tour:
i. Return
d. Insert a new child node into the current node after the last existing child. The new
node’s board position is set to the position calculated at (3a)
e. Recursively call (2) with the new child node as the current node
4. Return
Once the half-tour tree has been constructed we can now search for all combinations of leaf nodes
while adhering to the previously state rules. The algorithm for this is as follows:
1. For each board position
a. For each leaf node (A) on the current board position
i. Find the first node at the same depth and to the right of A that doesn’t share
a parent with A, excluding the root node (nodes to the left represent the
reverse of already discovered tours)
ii. For each leaf node (B) on the current board position located to the right of
and including the node found in the above step
1. Find the shallowest parent of B that falls on the same board position
as a any parent of A
2. If there’s no overlap record the current pair of nodes (A,B) as a valid
tour and continue from (ii.) with the next B
3. Otherwise, find the first node at the same depth and to the right of
B that falls on the current board position and doesn’t share the
same parent or B that was found in (1) above and continue from (ii)
using this node as the next B
iii. End of loop over B
b. End of loop over A
2. End of loop over the board positions
3. Return all possible tours as recorded at (1,a,ii.,2)
Since leaf nodes that traverse a particular board position can only be pair with other leaf nodes that
fall on the same position a lookup table is created after constructing the half-tour tree that is
indexed by board position and contains a vector of all leaf nodes that traverse that position. With
each leaf node a vector of all board positions traverse by parents is also stored and each node is
tagged with a pointer to the next leaf node for fast horizontal visiting of the nodes.
Parallel Solution
Going from the above optimised serial implementation to a parallel implementation is incredibly
straight forward. There are 3 blocks of functionality that need to be considered:
1. Construction of the half-tour tree
2. Construction of the lookup table for leaf nodes indexed by board position
3. Discovery of all valid combinations of leaf nodes on the same board position
Starting with the last block of functionality first, discovery of all valid combinations of leaf nodes is
‘embarrassingly parallel’. For each leaf node on the half-tour tree a search of all other valid leaf
nodes to be pair with it is performed and these searches are entirely independent of each other and
can be performed in parallel. For example, for the relatively small problem of an 8x8 board starting
on a8 and tours of length 8, 268 searches can be performed in parallel given enough hardware
threads. For a larger tour of length 16 over 120000 searches can be performed in parallel.
Parallelising these searches was performed using OpenMP and quite literally took around 10 lines of
code. The entire algorithm is wrapped in an OpenMP parallel section so N threads start executing
the algorithm simultaneously. Each thread in the team has a local ‘numToursLocal’ and ‘toursLocal’
for storing the total count of tours discovered on that thread and recording the number of tours
requested by the input file. Then each thread iterates over all board positions (resulting in N
duplicate iterations over each board position, however the processor time this consumes is
insignificant) and then in a nested OpenMP ‘for’ section iterates over the middle for loop (over node
A) in such a way that each iteration only occurs on a single thread (ensuring none of the real work is
duplicated). Since the time spent in each iteration of the loop can vary dynamic scheduling was
chosen (guided scheduling was also tested but it proved to have worse performance). Finally when
each thread in the team completes execution the results stored in ‘numToursLocal’ and ‘toursLocal’
in each thread are combined to give the final output. It should be noted that since there’s no
guarantee about how many valid tours each thread will find either the threads would need to
communicate with each other to only record N outputs or each thread would need to store up to N
threads with any excess outputs being trimmed when each thread’s results are combined. It turned
out that the later had the better performance.
Unfortunately both the construction of the half-tour tree and the lookup table are poor contenders
for parallelisation. The primary reason for this is that they constantly perform random writes into
the half-tour tree and lookup table respectively, so in order to guarantee consistency these would all
need to be synchronised. Since the majority of the time spent in both of these tasks is the writes the
overhead of synchronisation would dwarf any gains from multi-threading. The good news though is
that these both contribute to only small amount of the time spent, and the larger problem the less
significant these serial sections become (time spent in the search for valid half-tour pairs grows with
problem size significantly faster).
Performance Results
During performance testing the following input file was used can be found below. The tour length
(line 4) was modified during the tests and took on the values 8, 10, 12, 14, 16, 18 and 20.
8
8
a8
8
2
The test machine is an Intel Quad-Core Q6600 (over-clocked to run at 3.2GHz) with 4GB of RAM
running Windows 7 RTM 64-bit. Both 32-bit and 64-bit executables were tested and produced the
same results, however only results from the 32-bit executable are included below.
First up, here is the total number of valid tours calculated for each of the tour lengths:
Tour Length
8
10
12
14
16
18
20
Total Tours 880 17208 316124 5302024 80672058 1108463036 13694993802
The time taken to calculate each of these tour lengths was recorded for 1, 2 and 4 threads (excluding
tours of length 20, they were only calculated using 4 threads as any I’m not patient enough to wait
for it to complete any less threads...). Each of the three code blocks (ie. half-tour tree construction,
lookup table construction and half-tour pair searching) were timed separately.
700
1 Thread (tree)
1 Thread (lookup)
1 Thread (search)
2 Threads (tree)
2 Threads (lookup)
2 Threads (search)
4 Threads (tree)
4 Threads (lookup)
4 Threads (search)
600
500
400
300
200
100
0
8
10
12
14
16
18
20
As can mentioned previously, it can be seen that as the problem size increases the time consumed
by the parallel code (in blue) dwarfs the time consumed by the serial code (in red and green). This is
also a good demonstration of how quickly the computation required grows with the tour length.
Now the same graph without the 20 length tour:
60
1 Thread (tree)
1 Thread (lookup)
1 Thread (search)
2 Threads (tree)
2 Threads (lookup)
2 Threads (search)
4 Threads (tree)
4 Threads (lookup)
4 Threads (search)
50
40
30
20
10
0
8
9
10
11
12
13
14
15
16
17
18
Here it’s clear that there’s significant advantage to using multiple hardware threads. Below is a
graph of just the time consumed by the serial code.
0.7
1 Thread (tree)
1 Thread (lookup)
1 Threads
2
Thread (search)
(tree)
2 Threads (lookup)
2 Threads (tree)
4
(search)
4 Threads (lookup)
4 Threads (search)
0.6
0.5
0.4
0.3
0.2
0.1
0
8
9
10
11
12
13
14
15
16
17
It can be seen that the number of threads has no impact on the serial code (as expected).
18
Finally to determine how well the implementation scales with number of hardware threads we have
a graph of speed-up versus the number of threads used (on a 4-core machine).
4
4 Threads
2 Threads
1 Thread
3.5
3
2.5
2
1.5
1
0.5
8
9
10
11
12
13
14
15
16
17
18
Here we can see the speed-up of using 2 and 4 threads vs. a single thread for each of the tour
lengths. Due to the fast execution time of tour sets less than 16 it’s clear that threading overhead
prevents full utilisation of the 4 cores. With a length of 16 and 4 threads the overhead is still
significant, but with 2 threads it can be seen that the speed is levelling out near the theoretical
maximum of 2 times. Similarly with a length of 18 and 4 threads the speedup is approach the
theoretical maximum of 4 times, I would expect with a length of 20 this would also level out just
under a speedup of 4 times.
Conclusion
This problem has proven to be an excellent contender for parallelisation. Despite the section of
serial code we can approach the theoretical maximum speed-up for large problem sizes thanks to
the proportion of time spent in the serial code drastically decreasing as the problem size increases.
Download