Knights Tour Phillip Palk philpalk@mentalmonsters.com September 24, 2009 Introduction The 2009 Intel Threading Challenge is a competition where coders are to implement highperformance multi-threaded solutions to a series of problems. The second problem of phase 2 of this year is to identify all valid incomplete ‘Knight’s Tours’ of a fixed length around a chess board. My solution has been implemented using standard C++ with OpenMP 2.0 extensions to enable threading. Microsoft Visual Studio 2008 SP1 was used for development, however the code should compile with little or no modifications using any standard compliant compiler (even if the compiler doesn’t support OpenMP, although you won’t get multi-threading without it). Problem Description The problem, as stated on Intel’s website (http://software.intel.com/en-us/contests/ThreadingChallenge-2009/codecontest.php), is as follows: The Knight's Tour problem uses a single chess knight on a chessboard and attempts to visit each square on the board once and only once using the standard knight move (two squares right or left, and from there one square up or down, or, two squares up or down, and from there one square right or left). For travel-minded knights a tour of the entire board can be enlightening. However, in these tough economic times, a knight may not be able to afford travelling to every square. Thus, a knight with some vacation time would like to determine how many different tours of a specific length could be taken from a given starting square and end in the same starting square (a closed tour). Write a threaded code to calculate the total number of possible closed tours of a chessboard for a chess knight starting at a given square and that visit a set number of squares on the board. The names of the input and output files to be used will be given on the command line. The input file will detail the size and shape of the chessboard to be toured, the starting square of the knight, the length of the tour, and the number of tours to be fully printed in the output file. After execution of the application, the output file should contain the requested number of tours (listing of chessboard squares to visit) and a summary line with the total number of tours possible from the conditions given in the input file. Input File Format: The input file will contain five lines with the following format: 1. An integer specifying the number of columns (files) on the chessboard 2. An integer specifying the number of rows (ranks) on the chessboard 3. The starting square for the knight in algebraic notation (single lower case letter specifying column and integer corresponding to the row) 4. An integer specifying the number of squares to visit in the tour 5. An integer specifying the number of full tours to print in the output file Output File Format: Each full tour printed will be a list of board squares in algebraic notation, one square per line, starting and ending with the square given in the input file. If the tour length given is 8, then a printed tour will have 9 squares listed (the start square and the 8 destination squares with the start square as the last destination) on 9 lines. After each printed tour, a blank line or other divider line should be printed. After the requested number of tours is printed, a summary line that gives the total number of tours that qualify should be printed. If a tour with the given specifications cannot be done, the output file should contain a simple summary line that notes this fact. Timing: The total time for execution of the Knight's Tour application will be used for scoring. (For most accurate timing results, submission codes would include timing code to measure and print this time to stdout, otherwise an external stopwatch will be used.) Brute Force Serial Solution To gain familiarity with the problem my first attempt was to write a brute force solver that evaluated every possible path the knight could take from its starting position. A basic recursive approach was taken that consisted of the following steps: 1. Set the starting position and an empty list of valid tours 2. If the current recursion depth is equal to the length of the knight’s tour: a. If the current position is the same as the starting position, add the current tour to the list of valid tours b. Return 3. For each move a knight can make: a. Add the move to the current position (eg. If current position is d4 and move is (1,2) the new position is e6) b. If the new position falls outside of the board: i. Return c. If the new position is a repeat of a previous position in the current tour: i. Return d. Recursively call (2) with the new position as the updated current position 4. Return While this approach worked, as expected it also had horrible performance both in time and memory consumption (thanks to it storing every possible move when it was found). But now that I had a better understanding of the mechanics of the problem I was able start considering more optimal solutions. Optimised Serial Solution When considering how to improve the performance of the brute force serial solution it became clear there was a considerable amount of duplicate work being performed. My first thought was that I could take advantage of the fact that all possible paths of length N between any 2 squares was identical to all possible paths of length N between any 2 other squares that were the same distance apart in both rank and file (ignoring the case where paths traversed outside of the board area). I investigated the possibility of constructing a cache of these paths that could then be used to discover all valid Knights’ Tours, but unfortunately this proved unsuccessful. Although unsuccessful, the attempted caching solution did lead me to a realisation about the nature of the problem. Every possible path of length N from a single board position back to the same position could be described entirely using 2 leaf nodes of a tree of all paths of length N/2 from the starting node. Given the exponential increase in number of moves as the path length increases this dramatically reduces the number of moves that need to be considered. This can be illustrated as a follows: A B E F C G H I D J K L M In the above diagram, each child represents a move from the parent position (not a board position, for example nodes A and K might fall on the same board position). For example consider a piece that has a total of 3 total valid moves from its current position (3 is chosen purely so that it fits on the page easily!). To obtain a tour of length 4 using a tree of depth 2 (not counting the root) start at node A (the starting position), choose the first valid move to node B, and then choose the third valid move to node G. This gives what I shall refer to here-in as a ‘half-tour’. Given the tree of all valid half-tours this particular half-tour can be represented by the single leaf node G. To return to the starting position select another leaf node that falls on the same board position as the first node and follow its parent’s back up to the root, thus given the tree of valid half-tours a tour can be completely describe using just 2 leaf nodes (in the case of the above diagram, nodes G and L). Every possible path of length N from a board position back to itself is therefore all combinations of any 2 leaf nodes that fall on the same board position. This now gives every possible path of length N, but it doesn’t yet satisfy the requirement that no 2 board positions are traversed twice in a single tour (excluding the starting position). To meet this requirement several simple rules need to be adhered to when selecting the node that represents the second half-tour: 1. Given a node that describes a half-tour, the node selected to describe the second half-tour must fall on the same board position as the first node 2. Given a node that describes a half-tour, the node selected to describe the second half-tour cannot share a parent other than the root node. For example, given node G for the first half-tour nodes E and F cannot be selected for the second half-tour. 3. No parent of the second node may fall on a board position that is also traversed by any parent of the first node, excluding the root node. For example, if node C were to fall on the same board position as node B (which is impossible in the above graph given unique moves, but let’s ignore that...) then nodes H, I or J could not be selected. We now have the means to discover all possible tours of length N. The algorithm for constructing the half-tour tree is as follows (and is very similar to the original brute force algorithm): 1. Create the root node of the half-tour tree, setting its board position to the starting position. 2. If the current recursion depth is equal to half the length of the knight’s tour: a. Return (without modifying the half-tour tree) 3. For each move a knight can make: a. Add the move to the position of the current node (eg. If current position is d4 and move is (1,2) the new position is e6) b. If the new position falls outside of the board: i. Return c. If the new position is a repeat of a previous position in the current tour: i. Return d. Insert a new child node into the current node after the last existing child. The new node’s board position is set to the position calculated at (3a) e. Recursively call (2) with the new child node as the current node 4. Return Once the half-tour tree has been constructed we can now search for all combinations of leaf nodes while adhering to the previously state rules. The algorithm for this is as follows: 1. For each board position a. For each leaf node (A) on the current board position i. Find the first node at the same depth and to the right of A that doesn’t share a parent with A, excluding the root node (nodes to the left represent the reverse of already discovered tours) ii. For each leaf node (B) on the current board position located to the right of and including the node found in the above step 1. Find the shallowest parent of B that falls on the same board position as a any parent of A 2. If there’s no overlap record the current pair of nodes (A,B) as a valid tour and continue from (ii.) with the next B 3. Otherwise, find the first node at the same depth and to the right of B that falls on the current board position and doesn’t share the same parent or B that was found in (1) above and continue from (ii) using this node as the next B iii. End of loop over B b. End of loop over A 2. End of loop over the board positions 3. Return all possible tours as recorded at (1,a,ii.,2) Since leaf nodes that traverse a particular board position can only be pair with other leaf nodes that fall on the same position a lookup table is created after constructing the half-tour tree that is indexed by board position and contains a vector of all leaf nodes that traverse that position. With each leaf node a vector of all board positions traverse by parents is also stored and each node is tagged with a pointer to the next leaf node for fast horizontal visiting of the nodes. Parallel Solution Going from the above optimised serial implementation to a parallel implementation is incredibly straight forward. There are 3 blocks of functionality that need to be considered: 1. Construction of the half-tour tree 2. Construction of the lookup table for leaf nodes indexed by board position 3. Discovery of all valid combinations of leaf nodes on the same board position Starting with the last block of functionality first, discovery of all valid combinations of leaf nodes is ‘embarrassingly parallel’. For each leaf node on the half-tour tree a search of all other valid leaf nodes to be pair with it is performed and these searches are entirely independent of each other and can be performed in parallel. For example, for the relatively small problem of an 8x8 board starting on a8 and tours of length 8, 268 searches can be performed in parallel given enough hardware threads. For a larger tour of length 16 over 120000 searches can be performed in parallel. Parallelising these searches was performed using OpenMP and quite literally took around 10 lines of code. The entire algorithm is wrapped in an OpenMP parallel section so N threads start executing the algorithm simultaneously. Each thread in the team has a local ‘numToursLocal’ and ‘toursLocal’ for storing the total count of tours discovered on that thread and recording the number of tours requested by the input file. Then each thread iterates over all board positions (resulting in N duplicate iterations over each board position, however the processor time this consumes is insignificant) and then in a nested OpenMP ‘for’ section iterates over the middle for loop (over node A) in such a way that each iteration only occurs on a single thread (ensuring none of the real work is duplicated). Since the time spent in each iteration of the loop can vary dynamic scheduling was chosen (guided scheduling was also tested but it proved to have worse performance). Finally when each thread in the team completes execution the results stored in ‘numToursLocal’ and ‘toursLocal’ in each thread are combined to give the final output. It should be noted that since there’s no guarantee about how many valid tours each thread will find either the threads would need to communicate with each other to only record N outputs or each thread would need to store up to N threads with any excess outputs being trimmed when each thread’s results are combined. It turned out that the later had the better performance. Unfortunately both the construction of the half-tour tree and the lookup table are poor contenders for parallelisation. The primary reason for this is that they constantly perform random writes into the half-tour tree and lookup table respectively, so in order to guarantee consistency these would all need to be synchronised. Since the majority of the time spent in both of these tasks is the writes the overhead of synchronisation would dwarf any gains from multi-threading. The good news though is that these both contribute to only small amount of the time spent, and the larger problem the less significant these serial sections become (time spent in the search for valid half-tour pairs grows with problem size significantly faster). Performance Results During performance testing the following input file was used can be found below. The tour length (line 4) was modified during the tests and took on the values 8, 10, 12, 14, 16, 18 and 20. 8 8 a8 8 2 The test machine is an Intel Quad-Core Q6600 (over-clocked to run at 3.2GHz) with 4GB of RAM running Windows 7 RTM 64-bit. Both 32-bit and 64-bit executables were tested and produced the same results, however only results from the 32-bit executable are included below. First up, here is the total number of valid tours calculated for each of the tour lengths: Tour Length 8 10 12 14 16 18 20 Total Tours 880 17208 316124 5302024 80672058 1108463036 13694993802 The time taken to calculate each of these tour lengths was recorded for 1, 2 and 4 threads (excluding tours of length 20, they were only calculated using 4 threads as any I’m not patient enough to wait for it to complete any less threads...). Each of the three code blocks (ie. half-tour tree construction, lookup table construction and half-tour pair searching) were timed separately. 700 1 Thread (tree) 1 Thread (lookup) 1 Thread (search) 2 Threads (tree) 2 Threads (lookup) 2 Threads (search) 4 Threads (tree) 4 Threads (lookup) 4 Threads (search) 600 500 400 300 200 100 0 8 10 12 14 16 18 20 As can mentioned previously, it can be seen that as the problem size increases the time consumed by the parallel code (in blue) dwarfs the time consumed by the serial code (in red and green). This is also a good demonstration of how quickly the computation required grows with the tour length. Now the same graph without the 20 length tour: 60 1 Thread (tree) 1 Thread (lookup) 1 Thread (search) 2 Threads (tree) 2 Threads (lookup) 2 Threads (search) 4 Threads (tree) 4 Threads (lookup) 4 Threads (search) 50 40 30 20 10 0 8 9 10 11 12 13 14 15 16 17 18 Here it’s clear that there’s significant advantage to using multiple hardware threads. Below is a graph of just the time consumed by the serial code. 0.7 1 Thread (tree) 1 Thread (lookup) 1 Threads 2 Thread (search) (tree) 2 Threads (lookup) 2 Threads (tree) 4 (search) 4 Threads (lookup) 4 Threads (search) 0.6 0.5 0.4 0.3 0.2 0.1 0 8 9 10 11 12 13 14 15 16 17 It can be seen that the number of threads has no impact on the serial code (as expected). 18 Finally to determine how well the implementation scales with number of hardware threads we have a graph of speed-up versus the number of threads used (on a 4-core machine). 4 4 Threads 2 Threads 1 Thread 3.5 3 2.5 2 1.5 1 0.5 8 9 10 11 12 13 14 15 16 17 18 Here we can see the speed-up of using 2 and 4 threads vs. a single thread for each of the tour lengths. Due to the fast execution time of tour sets less than 16 it’s clear that threading overhead prevents full utilisation of the 4 cores. With a length of 16 and 4 threads the overhead is still significant, but with 2 threads it can be seen that the speed is levelling out near the theoretical maximum of 2 times. Similarly with a length of 18 and 4 threads the speedup is approach the theoretical maximum of 4 times, I would expect with a length of 20 this would also level out just under a speedup of 4 times. Conclusion This problem has proven to be an excellent contender for parallelisation. Despite the section of serial code we can approach the theoretical maximum speed-up for large problem sizes thanks to the proportion of time spent in the serial code drastically decreasing as the problem size increases.