Tuning evaluation functions by maximizing concordance D. Gomboc, M. Buro, T. A. Marsland Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada {dave, mburo, tony}@cs.ualberta.ca, http://www.cs.ualberta.ca/~games/ Abstract Heuristic search effectiveness depends directly upon the quality of heuristic evaluations of states in a search space. Given the large amount of research effort devoted to computer chess throughout the past half-century, insufficient attention has been paid to the issue of determining if a proposed change to an evaluation function is beneficial. We argue that the mapping of an evaluation function from chess positions to heuristic values is of ordinal, but not interval, scale. We identify a robust metric suitable for assessing the quality of an evaluation function, and present a novel method for computing this metric efficiently. Finally, we apply an empirical gradient-ascent procedure, also of our design, over this metric to optimize feature weights for the evaluation function of a computer-chess program. Our experiments demonstrate that evaluation function weights tuned in this manner give equivalent performance to hand-tuned weights. Key words: ordinal correlation, Kendall’s (tau), heuristic evaluation function, gradient ascent, partial-sum transform, heuristic search, computer chess 1 Introduction A half-century of research in computer chess and similar two-person, zero-sum, perfectinformation games has yielded an array of heuristic search techniques, primarily dealing with how to search game trees efficiently. It is now clear to the AI community that search is an extremely effective way of harnessing computational power. The thrust of research in this area has been guided by the simple observation that as a program searches more deeply, the decisions it makes continue to improve [49]. It is nonetheless surprising how often the role of the static evaluation function is overlooked or ignored. Heuristic search effectiveness depends directly upon the quality of heuristic evaluations of states in the search space. Ultimately, this task falls to the evaluation function to fulfil. It is only in recent years that the scientific community has begun to attack this problem with vigour. Nonetheless, it cannot be said that our understanding of evaluation functions is even nearly as thoroughly developed as that of tree searching. This work is a step towards redressing that deficit. Inspiration for this research came while reflecting on how evaluation functions for today’s computer-chess programs are usually developed. Typically, they are refined over many years, based upon careful observation of their performance. During this time, engine authors will tweak feature weights repeatedly by hand in search of a proper 1 balance between terms. This ad hoc process is used because the principal way to measure the utility of changes to a program is to play many games against other programs and interpret the results. The process of evaluation-function development would be considerably assisted by the presence of a metric that could reliably indicate a tuning improvement. Consideration was also given to harnessing the great deal of recorded experience of human chess for developing an evaluation function for computer chess. Researchers have tried to make their machines play designated moves from test positions, but we focus on judgements about the relative worth of positions, reasoning that if these are correct then strong moves will be selected by the search as a consequence. The primary objective of this research is to identify a metric by which the quality of an evaluation function may be directly measured. For this to be convincingly achieved, we must validate this metric by showing that higher values correspond to superior weight vectors. To this end, we introduce an estimated gradient-ascent procedure, and apply it to tune eleven important features of a third-party chess evaluation function. In Section 2, we begin with a summary of prior work in machine learning in chess-like games as it pertains to evaluation-function analysis and optimization. This is followed in Section 3 by a brief, but formal, introduction to measurement theory, accompanied by an argument for the application of ordinal methods to the analysis of evaluation functions. A novel, efficient implementation of the ordinal metric known as Kendall’s is described in Section 4; also indicated is how to apply it to measure the quality of an evaluation function. In Section 5, we present an estimated gradient-ascent algorithm by which feature weights may be optimized, and the infrastructure created to distribute its computation. Issues touching upon our specific experiments are discussed in Section 6; experimental results demonstrating the utility of ordinal correlation with respect to evaluation-function tuning are provided in Section 7. After drawing some conclusions in Section 8, we remark upon further plausible investigations. 2 Prior work on machine learning in chess-like games Here we touch upon specific issues regarding training methods for static evaluation functions. For a fuller review of machine learning in games, the reader is referred to Fürnkranz’s survey [14]. 2.1 Feature definition and selection The problem of determining which features ought to be available for use within an evaluation function is perhaps the most difficult problem in machine learning in games. Identification and selection of features for use within an evaluation function has primarily been performed by humans. Utgoff is actively involved in researching automated feature construction [13, 52, 53, 54, 56], typically using Tic-Tac-Toe as a test bed. Utgoff [55] discusses in depth the automated construction of features for game playing. He argues for incremental learning of layers of features, the ability to train particular features in isolation, and tools for the specification and refactoring of features. 2 Hartmann [17, 18, 19] developed the “Dap Tap” to determine the relative influence of various evaluation feature categories, or notions, on the outcome of chess games. Using 62,965 positions from grandmaster tournament and match games, he found that “the most important notions yield a clear difference between winners and losers of the games”. Unsurprisingly, the notion of material was predominant; the combination of other notions contributes roughly the same proportion to the win as material did alone. He further concluded that the threshold for one side to possess a decisive advantage is 1.5 pawns. Kaneko, Yamaguchi, and Kawai [22] automatically generated and sieved through patterns, given only the logical specifications of Othello and a set of training positions. The resulting evaluation function was comparable in accuracy (though not in speed of execution) to that developed by Buro [8]. A common property of Tic-Tac-Toe and Othello that makes them particularly amenable to feature discovery is that conjunctions of adjacent atomic board features frequently yield useful information. This area of research requires further attention. 2.2 Feature weighting Given a decision as to what features to include, one must then decide upon their relative priority. We begin with unsupervised learning, following the historical precedent in games, and follow with a discussion of supervised learning methods. It is worth noting that Utgoff and Clouse [51] distinguish between state-preference learning and temporaldifference learning, and show how to integrate the two into a single learning procedure. They demonstrate using the Towers of Hanoi problem that the combination is more effective than one alone. 2.2.1 Unsupervised learning Procedures within the rubric of reinforcement learning rely on the difference between the heuristic value of states and the corresponding (possibly non-heuristic) values of actual or predicted future states to indicate progress. Assuming that any game-terminal positions encountered are properly assessed (trivial in practice), lower differences imply that the evaluation function is more self-consistent, and that the assessments made are more accurate. The chief issue to be tackled with this approach is the credit assignment problem: what, specifically, is responsible for differences detected? While some blame may lie with the initial heuristic evaluation, blame may also lie with the later, presumably more accurate evaluation. Where more than one ply exists between the two (a common occurrence, as frequently either the result of the game or the heuristic evaluation at the leaf of the principal variation searched is used) the opportunities for error multiply. The TD( ) technique [45] attempts to solve this problem by distributing the credit assignment amongst states in an exponentially decaying manner. By training iteratively, responsibility ends up distributed where it is thought to belong. 3 The precursor of modern machine learning in games is the work done by Samuel [37, 38]. By fixing the value for a checker advantage, while letting other weights float, he iteratively tuned the weights of evaluation function features so that the assessments of predecessor positions became more similar to the assessments of successor positions. Samuel’s work was pioneering not just in the realm of games but for machine learning as a whole. Two decades passed before other game programmers took up the challenge. Tesauro [47] trained his backgammon evaluator via temporal-difference learning. After 300,000 self-play games, the program reached strong amateur level. Subsequent versions also contained hidden units representing specialized backgammon knowledge and used expectimax, a variant of minimax search that includes chance nodes. TDGAMMON is now a world-class backgammon player. Beal and Smith [4] applied temporal-difference learning to determine piece values for a chess program that included material, but not positional, terms. Program versions, using weights resulting from five randomized self-play learning trials, each won a match versus a sixth program version that used the conventional weights given in most introductory chess texts. They have since extended their reach to include piece-square tables for chess [5] and piece values for Shogi [6]. Baxter, Tridgell, and Weaver (1998) named Beal and Smith’s application of temporaldifference learning to the leaves of the principal variations as TDLeaf( ), and used it to learn feature weights for their program KNIGHTCAP. Through online play against humans, KNIGHTCAP’s skill level improved from beginner to strong master. The authors credit this to: the guidance given to the learner by the varying strength of its pool of opponents, which improved as it did; the exploration of the state space forced by stronger opponents who took advantage of KNIGHTCAP’s mistakes; the initialization of material values to reasonable settings, locating KNIGHTCAP’s weight vector “close in parameter space to many far superior parameter settings”. Schaeffer, Hlynka, and Jussila [40] applied temporal-difference learning to CHINOOK, the World Man-Machine Checkers Champion, showing that this algorithm was able to learn feature weights that performed as well as CHINOOK’s hand-tuned weights. They also showed that, to achieve best results, the program should be trained using search effort equal to what the program will typically use during play. 2.2.2 Supervised learning Any method of improving an evaluation function must include a way to determine whether progress is being made. In cases where particular moves are sought, a simple enumeration of the number of ‘correct’ moves selected is the criteria employed. This metric can be problematic because frequently no single move is clearly best in a position. Nonetheless, variations on this method have been attempted. Marsland [26] investigated a variety of cost functions suitable for optimizing values and rankings assigned to moves for the purposes of move selection and move ordering. He found tuning to force recommended moves into relatively high positions to be effective, 4 but attempting to force desired moves to be preferred over all alternatives was counterproductive. Tesauro [46] devised a neural network structure appropriate for learning an evaluation function for backgammon, and compared states resulting from moves actually played against alternative states resulting from moves not chosen. “Comparison training” was shown to outperform an earlier system that compared states before and states after a move was played. Tesauro [48] applied comparison training to a database of expert positions to demonstrate that king safety terms had been underweighted in DEEP BLUE’s evaluation function in 1996. The weights were raised for the 1997 rematch between Garry Kasparov and DEEP BLUE. Subsequent analysis demonstrated the importance of this change. A popular approach is to solve for a regression line across the data that minimizes the squared error of the data. While this may be done by solving a set of linear equations, typically an iterative process known as gradient descent (a.k.a. hill climbing or gradient ascent) is applied. The partial derivatives with respect to the weights are set to zero, and the parameters are solved for via an iterative procedure. The error function is quadratic, so there will be a single, global minimum to which all gradients will lead. Solving regression curves (non-linear regressions) by gradient descent is also possible, but in these cases, converging to a global minimum may be considerably more difficult. The DEEP THOUGHT (later DEEP BLUE) team applied least squares fitting to the moves of the winners of 868 grandmaster games to tune their evaluation-function parameters as early as 1987 [1, 2, 20, 32]. They found that tuning to maximize agreement between their program’s preferred choice of move and the grandmaster’s was “not really the same thing” as playing more strongly. Amongst other interesting observations, they discovered that conducting deeper searches while tuning led to superior weight vectors being reached. Buro [7] estimated feature weights by performing logistic regression on win/loss/drawclassified Othello positions. The underlying log-linear model is well suited for constructing evaluation functions for approximating winning probabilities. In that application, it was also shown that the evaluation function based on logistic regression could perform better than those based on linear and quadratic discriminant functions. Buro [8] used linear regression and positions labelled with the final disc differential of Othello positions to optimize the weights of thousands of binary pattern features. This approach yields significantly better performance than his 1995 work [7]. Buro’s work is an example of a supervised learning approach where heuristic assessments are directly compared with more informed assessments. We expand the applicability of this approach to include situations where least squares fitting would be inappropriate. 5 2.3 Miscellaneous Levinson and Snyder [24] created MORPH, a chess program designed in a manner consistent with cognitive models, except that look-ahead search was strictly prohibited. It generalizes attack and defence relationships between chess pieces and squares adjacent to the kings, applying various machine-learning techniques for pattern generation, deletion, and weight tuning. MORPH learned to value material appropriately and play reasonable moves from the opening position of chess. Van der Meulen [27] provides algorithms for selecting a weight vector that will yield the desired move, for partitioning a set of positions into groups that share such weight vectors, and for labelling new positions as a member of one of the available groups. Evolutionary methods have also been applied to the problem. Chellapilla and Fogel [9, 10] evolved intermediate-strength checkers players by evolving weights for two neural network designs. Kendall and Whitwell [23] did likewise in chess, though they used predefined features instead. Mysliwietz [28] also optimized chess evaluation weights using a genetic algorithm. Genetic algorithms maintain a constant-sized pool of candidate evaluation functions, determine their fitness, and use the fittest elements to generate offspring by crossover and mutation operators. Usually, a gradient-descent approach will converge significantly more quickly than a genetic algorithm when the objective function is smooth. Specifics of our application suggest that genetic algorithms may nonetheless be appropriate for our problem. We discuss this issue in Subsection 8.1. 3 Measurement theory Sarle [39] states: “Mathematical statistics is concerned with the connection between inference and data. Measurement theory is concerned with the connection between data and reality. Both statistical theory and measurement theory are necessary to make inferences about reality.” We are endeavouring to identify a direct way to measure the quality of an evaluation function, so a familiarity with measurement theory is important. 3.1 Statistical scale classifications The four standard statistical scale classifications in common use today to classify unidimensional metrics originated with Stevens [42, 43]. He defined the following, increasingly restrictive categories of relations. A relation R(x, y) is nominal when R is a one-to-one function. Let function f be such a relation that is defined over the real numbers: y = f(x). Then, f is nominal if and only if ∀i∀j : i = j ⇔ f (i ) = f ( j ) (3.1) One common example of a nominal relation is the relationship between individuals of a sports team and their jersey numbers. 6 Additionally, f is ordinal when f is strictly monotonic: as the value of x increases, either y always increases or y always decreases. We will assume (without loss of generality, as we can negate f if necessary) that y always increases. Formally: ∀i∀j : i < j ⇔ f (i ) < f ( j ) (3.2) The relationship of the standard five responses “strongly disagree”, “disagree”, “neutral”, “agree”, and “strongly agree” to many survey questions is ordinal. The mapping of chess positions to the possible assessments given in Table 1 in Subsection 3.3 is also ordinal. Additionally, f is of interval scale when f is affine: ∃c > 0 ∀i∀j : i − j = c[ f (i ) − f ( j )] (3.3) The difference between two elements is now meaningful. Hence, subtraction and addition are now well-defined operations. Time is of interval scale. Additionally, f is of ratio scale when the zero point is fixed: ∀i∀j ≠ 0, f ( j ) ≠ 0 : i f (i ) = j f ( j) (3.4) The ratio between two elements is now meaningful. Hence, division and multiplication are now well-defined operations. Temperature is an example of a ratio scale: the zero point is absolute zero, which is 0 kelvins. These categories are by no means the only plausible ones. Stevens himself recognized a log-interval scale classification [44], to which the Richter and decibel scales belong. It has a fixed zero, and a log-linear relationship. Every log-interval scale is ordinal, and every ratio scale is log-interval, but interval and log-interval are incompatible. The absolute scale is a specialization of the ratio scale, where not only the zero point but also the one point is fixed. This scale is used for representing probabilities and for counting. Figure 1. Partial Ordering of Statistical Scale Classifications Figure 1 demonstrates the inheritance relationships between the various classifications. It is to be understood that this classification system is open to extension, and that it does not encapsulate all interesting properties of functions. 7 3.2 Admissible transformations Transformations are deemed admissible (Stevens used “permissible”) when they conserve all relevant information. For example, converting from kelvins to degrees Celsius (°C) by applying f(x) = x + 273.16 would not be considered admissible, because the zero point would not be preserved. Such a transformation is not prohibited per se, but it does mean that usages involving ratios would no longer be justified. Three hundred kelvins is actually twice as hot as one hundred fifty kelvins. Clearly, then, 26.84°C cannot be twice as hot as 13.42°C. The transformations deemed admissible for any category are those that retain the invariants specified by (3.1) through (3.4). A transformation is admissible with respect to the nominal scale if and only if it is injective. For the ordinal scale, the preservation of order is also required. For the interval scale, equivalent differences must remain equivalent, so only affine transformations are permitted (g(x) = af(x)+b, a>0). Admissibility with respect to the ratio scale also requires the zero point to remain fixed (b = 0 in the prior equation). Stevens argued that it is not justified to draw inferences about reality based upon nonadmissible transformations of data, because otherwise the conclusions that could be drawn would vary depending upon the particulars of the transformation applied. “In general, the more unrestricted the permissible transformations, the more restricted the statistics. Thus, nearly all statistics are applicable to measurements made on ratio scales, but only a very limited group of statistics may be applied to measurements made on nominal scales” [41]. Much discussion of this highly controversial position has taken place. Tukey opined that in science, the “ultimate standard of validity is an agreed-upon sort of logical consistency and provability” [50], and that this is incompatible with an axiomatic theory of measurement, which is purely mathematical. Velleman and Wilkinson argue that the classifications themselves are misleading [57]. It is important to note that the notion of admissibility is relative to the analysis being performed rather than purely data-centric: it is perfectly valid to convert from kelvins to degrees Celsius, so long as one does not then draw conclusions based upon the ratios of temperatures denoted in degrees Celsius. Stevens’ classifications remain in widespread use, though in particular his stricture against drawing conclusions based upon statistical procedures that require interval-scale data applied to data with only ordinal-scale justification is frequently ignored. 3.3 Relevance to heuristic evaluation functions for minimax search Utgoff and Clouse [51] comment: “Whenever one infers, or is informed correctly, that state A is preferable to state B, one has obtained information regarding the slope for part of a correct evaluation function. Any surface that has the correct sign for the slope between every pair of points is a perfect evaluation function. An infinite number of such evaluation functions exist, under the ordinary assumption that state preference is transitive.” 8 Throughout the search process of the minimax algorithm for game-tree search [41], and all its derivatives [25, 34], a single question is repeatedly asked: “Is position A better than position B?” Not “How much better?”, but simply “Is it better?”. In minimax, instead of propagating values one could propagate the positions instead, and, as humans do, choose between them directly without using values as an intermediary. This shows that we only need pairwise comparisons that tell us whether position A is better than position B. The essence of a heuristic evaluation function is to make educated guesses regarding the likelihood of successful outcomes from arbitrary states. This is somewhat obscured in chess by the tendency to use a pawn as the unit by which advantage is measured, a habit encouraged by the high correlation between material advantage and likelihood of success inherent in the game. The implicit mapping between a typical evaluation scoring system (range: [-32767, 32767], where the worth of a pawn is represented by 100) and the probability of success (range: [0, 1]), maintains the ordinal invariant of linear ordering, though not the interval invariant of equality of differences. This is interesting insofar as typically terms will be added and subtracted during the assessment of a position by a computer program, operations that according to Stevens require interval-scale justification. Furthermore, the difference between scores is frequently used as criteria to apply numerous search techniques (e.g., aspiration windows, lazy evaluation, futility pruning). How can we reconcile his theory with practice? Firstly, a heuristic search is exactly that: heuristic. Many of these techniques encompass a trade-off of accuracy for speed, with the difference threshold set empirically. Thus, to some extent, compensation for the underlying ordinal nature of scores returned by the evaluation function is already present. Secondly, it can be argued that the region between -200 and 200, where the majority of assessments that will influence the outcome of the game lie, is in practice nearly one of interval scale. This means that decisions made based on a difference threshold while the game is roughly balanced are relatively less likely to be faulty in important situations. Finally, one may choose to agree with Stevens’ detractors, and indeed, many social scientists compute means on ordinal data frequently without much ado. Does this mean that we should ignore the underlying ordinality of the problem? No. Everything of interval scale is of ordinal scale, while the converse is false. A metric capable of handling ordinal data is readily applicable to data that meets a more stringent qualification. In contrast, the application of a metric suited for interval scale data to data of ordinal scale is, at least to researchers who agree with Stevens, suspect, and requires justification. Such justification might take the form of the ability to predict and validate predictions that depend on the proposed interval nature of the variable. 9 Table 1 depicts seven of the symbols used by contemporary chess literature to represent judgements regarding the degree to which a position is favourable to one player or the other in a language-agnostic manner.1 Also provided are their corresponding English definitions as provided by Chess Informant [36]. 2 We have ordered them by White’s increasing preference, so the categories are monotonically increasing, however, it would be nonsense to say that the difference between and is three times that of the difference between and . Therefore, they are of ordinal scale. It would be possible to label these with the numbers 1 through 7, and proceed from there as if they were of interval scale, but at least according to Stevens, doing so would require justification. Fortunately, it is unnecessary, as we will show in the next section. Table 1. Symbols for Chess Position Assessment Symbol Meaning Black has a decisive advantage Black has the upper hand Black stands slightly better even White stands slightly better White has the upper hand White has a decisive advantage 4 Kendall’s Concordance, or agreement, occurs where items are ranked in the same order. Kendall’s measures the degree of similarity of two orderings. After defining this metric, we will elaborate on a novel, efficient implementation of , provide a worked example and complexity analysis, introduce a generalization of , and contrast with alternative correlation methods. 4.1 Definition Given a set of m items I = (i1, i2, …, im), let X = (x1, x2, …, xm) represent relative preferences for I by analyst ae, and Y = (y1, y2, …, ym) represent relative preferences for I by analyst ao. Take any two pairs, (xi, yi) and (xk, yk), and compare both the x values and the y values. Table 2 defines the relationship between those pairs. 1 Two other assessment symbols, (the position is unclear) and (a player has positional compensation for a material deficit) are also frequently encountered. Unfortunately, the usage of these two symbols is not consistent throughout chess literature. Accordingly, we ignore positions labelled with these assessments, but see Subsection 8.2. 2 Readers who have difficulty obtaining a volume of Chess Informant may find sample pages from Vol. 59, including these definitions, in Gomboc’s M.Sc. thesis [14]. 10 Table 2. Relationships between Ordered Pairs relationship between xi and xk xi < xk xi < xk xi > xk xi > xk xi = xk xi xk xi = xk relationship between yi and yk yi < yk yi > yk yi < yk yi > yk yi yk yi = yk yi = yk relationship between (xi, xk) and (yi, yk) concordant discordant discordant concordant extra y-pair extra x-pair duplicate pair n, the total number of distinct pairs of positions, and therefore also the total number of comparisons made, is straightforward to compute: n= m−1 m i =1 k =i +1 1 = 12 m(m − 1) (4.1) Let S+ (“S-positive”) be the number of concordant pairs: + S = m−1 1, xi < xk and yi < yk m i =1 k =i +1 1, xi > xk and yi > yk (4.2) 0, otherwise Let S– (“S-negative”) be the number of discordant pairs: − S = m −1 1, xi < xk and yi > yk m i =1 k =i +1 1, xi > xk and yi < yk (4.3) 0, otherwise Then , which we will also refer to as the concordance, is given by: τ= S+ − S− n (4.4) Possible concordance values range from +1, representing complete agreement in ordering, to -1, representing complete disagreement in ordering. Whenever extra or duplicate pairs exist, the values of +1 and -1 are not achievable. Cliff [11] provides a more detailed exposition of Kendall’s , discussing variations thereof that optionally disregard extra and duplicate pairs. Cliff labels what we call as a, and uses it most often, noting that it has the simplest interpretation of the lot. 4.2 Matrix representation of preference data In typical data sets, pairs of preferences will occur multiple times. We can compact all such pairs together, which allows us to process them in a single step, thereby substantially reducing the work required to compute S+ and S–. Let Sx be the set of preferences {x1, x2, …, xm}, and Sy be the set of preferences {y1, y2, …, ym}. Let U = |Sx| and V=|Sy|. Let E = ( e1 , e2 ,..., eU ) such that ∀i : ei ∈ S x and 11 ei < ei+1. Similarly, let O = (o1, o2 ,..., oV ) such that ∀i : oi ∈ S y and oi < oi+1. We define matrix A of dimensions U by V such that Acr = m 1, er = xi and ec = yi i =1 0, otherwise (4.5) A’s definition ensures that there is at least one non-zero entry in every row and column. Let A((xz,,yw)) denote the sum of the cells of matrix A within rows x through z and columns y through w: A((xz,,yw)) = z w i= x k = y Aik (4.6) Then, the contribution towards S+ of the single cell Axy is ∆S xy+ = A((xU+,1V, y) +1) Axy (4.7) and S+ may be reformulated as S+ = U V ∆S xy+ (4.8) ∆S xy− = A((1x, −y1−,1V) ) Axy (4.9) x =1 y =1 Similarly, S− = U V x =1 y =1 ∆S xy− (4.10) Numerical Recipes in C [35] is a standard reference to this algorithm. However, we have reservations about the manner in which their sample implementation handles duplicate pairs. 4.3 Partial-sum transform The naïve algorithm to compute Kendall’s computes the number of concordant and discordant pairs for each cell by looping over the preferences data directly; the straightforward algorithm loops over the cells in the matrix. However, due to the monotone ordering of both axes, the computation is replete with overlapping subproblems of optimal substructure. It is this fact that we exploit. We introduce auxiliary matrix BR, which has the same dimensions as A.3 Each cell in BR is assigned the value corresponding to the sum of the cells in A that are either on, below, or to the right of the corresponding cell in A. 3 For convenience, we define BR(x+1)y to equal 0 when x = U, and BRx(y+1) to equal 0 when y = V. Nonetheless, space need not be allocated for such cells. Similarly, BL(x-1)y = 0 when x = 1, and BLx(y+1) = 0 when y = V. 12 A((xU, y,V) ) = BRxy = Axy + BRx ( y +1) + BR( x+1) y − BR( x+1)( y +1) (4.11) Similarly, auxiliary matrix BL holds in each cell the sum of the cells in A that are either on, below, or to the left of the corresponding cell in A. A((1x, y,V) ) = BLxy = Axy + BLx ( y+1) + BL( x−1) y − BL( x−1)( y+1) (4.12) BR and BL are both instances of partial-sum transforms of A. The sum of an arbitrary rectangular block of cells of the original matrix is conveniently retrieved from such a transformed version.4 For instance: BLxy = BR1 y − BR( x+1) y (4.13) Axy = BRxy − BR( x+1) y − BRx ( y +1) + BR( x+1)( y +1) (4.14) (4.13) lets us compute BL from BR in constant time as necessary, so we need not construct both BR and BL. Additionally, we can construct BR in-place by overwriting A, even without recourse to (4.14), reducing the storage overhead of computing Kendall’s in this manner rather than from the original matrix representation to zero. The single dimension case of the partial-sum transform is well known, e.g., unidimensional versions of (4.11) and (4.14) are provided by the partial_sum and adjacent_difference algorithms in the C++ Standard Library. The partial-sum transform is readily applied to matrices of higher dimension also. However, each additional dimension involved doubles the number of terms required on the right-side expression of (4.14) to maintain the equality. 4.4 Worked example with complexity aqnalysis Here we provide sample data and work through the mathematics of the previous three subsections. 4.4.1 Definition We will use 14 states for our example, therefore m = 14. X and Y represent the preference data for the 14 states, for which see Table 3. We could pre-order the positions so that they are ordered by Y, but in the interest of clarity, we will leave this for a later step. Table 3. Sample preference data X = (x1, x2, …, xm) Y = (y1, y2, …, ym) -0.1 -0.4 0.9 -1.2 0.0 -0.1 0.6 0.0 -0.1 -0.4 2.4 0.0 0.6 0.6 From (4.1), we know that n, the total number of distinct pairs of positions, is 91. We can now loop through the data, accumulating S+ and S–. This process is depicted in Table 4. 4 This fact was brought to our attention in a personal communication by Juraj Pivovarov. 13 For this data set, S+ and S– are 51 and 25 respectively, and = 0.2857. The running time of this algorithm is (m2). 4.4.2 Matrix representation of preference data Based on the definitions of X and Y, Sy = { , , , , , , }, Sx = {-0.4, -0.1, 2.4, 1.2, 0.0, 0.6, -0.2 }, and V = U = 7. Then O = ( , , , , , , ) and E = (-1.2, 0.4, -0.1, 0.0, 0.6, 0.9, 2.4). The purpose of this exercise is to sort the preferences in ascending order, with duplicates removed, so that matrix A, as shown in Table 5, has its rows and columns in order of White’s increasing preference, and has no empty rows or columns. If the axes were not sorted, then dynamic programming would not be applicable. In Table 5 and subsequent data preference tables, cells that would contain zero have been left blank. The sorting takes time (m log m), while the duplicate entry removal takes time (m). Zeroing the memory for the matrix takes time (UV), then populating it takes time (m). Therefore, the time complexity of constructing A is (m log m + UV). Computing (4.6) takes time (UV), because each cell in A((xz,,yw)) is addressed to perform the summations, so computing S+ and S– from A via (4.8) and (4.10) takes time (U2V2). Therefore, computing S+ and S– from the preference data takes time (m log m + U2V2). 14 Table 4. Sample Preference Data: Relationships Between Ordered Pairs Items Compared 1 and 2 1 and 3 1 and 4 1 and 5 1 and 6 1 and 7 1 and 8 1 and 9 1 and 10 1 and 11 1 and 12 1 and 13 1 and 14 2 and 3 2 and 4 2 and 5 2 and 6 2 and 7 2 and 8 2 and 9 2 and 10 2 and 11 2 and 12 2 and 13 2 and 14 3 and 4 12 and 13 12 and 14 13 and 14 (yi, xi ) , -0.1 , -0.4 , 0.9 , -0.2 , 1.0 (yk, xk) , -0.4 , 0.9 , -1.2 , 0.0 , -0.1 , 0.6 , 0.0 , -0.1 , -0.4 , 2.4 , 0.0 , 0.6 , 0.6 , 0.9 , -1.2 , 0.0 , -0.1 , 0.6 , 0.0 , -0.1 , -0.4 , 2.4 , 0.0 , 0.6 , 0.6 , -1.2 yi R yk xi R xk < > < < > > < < > = < < < < < = = > < < > < > < < < < < > > > < > < > < > < < < > = = < > < > < > < > > and so on… > < , 0.6 < < , 0.6 < > , 0.6 (xi, xk) R (yi, yk) discordant concordant concordant concordant extra y-pair concordant concordant extra y-pair extra x-pair concordant discordant discordant concordant concordant concordant discordant discordant discordant discordant concordant extra y-pair extra x-pair discordant discordant discordant concordant S+ 0 1 1 1 0 1 1 0 0 1 0 0 1 1 1 0 0 0 0 1 0 0 0 0 0 1 S– 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 1 1 0 0 0 1 1 1 0 discordant concordant discordant 0 1 0 1 0 1 Table 5. Sample Preference Data: (machine, human) assessments -1.2 1 -0.4 -0.1 0.0 1 1 1 1 0.6 1 0.9 2.4 2 2 1 1 1 1 4.4.3 Partial-sum transform To construct matrix BR, we start with the bottom-right hand corner, and proceed leftwards, than upwards, applying (4.11) at each cell.5 In the tables that follow, the shaded area represents the portion of the matrix that has already been converted. 5 One may fill forward in memory by reversing the element order of the table axes. 15 Table 6. Sample Preference Data: BR being constructed overtop of A, at BR57 -1.2 1 -0.4 -0.1 0.0 1 1 1 1 0.6 1 0.9 2.4 2 2 1 1 1 1 1 The first cell to change value is BR57 – column 5, row 7; see (4.11). We continue sliding left until the entire row is transformed, after which we return to the right edge of the matrix, and continue with the preceding row. Table 7. Sample Preference Data: BR being constructed overtop of A, at BR56 -1.2 1 -0.4 -0.1 0.0 1 1 1 1 0.6 1 0.9 2.4 2 1 1 2 2 1 2 2 1 2 2 1 For this sample data, BR56 is the first cell for which the wrong value would have been computed if the final term within (4.11) were not present. Both BR66 and BR57 include BR67, but we want to count it only once, so we must subtract it out. Table 8. Sample Preference Data: BR fully constructed overtop of A -1.2 14 12 10 8 6 4 2 -0.4 13 12 10 8 6 4 2 -0.1 11 10 8 7 5 3 2 0.0 8 7 6 6 4 2 1 0.6 5 4 4 4 4 2 1 0.9 2 2 2 2 2 2 1 2.4 1 1 1 1 1 1 BR may be constructed from A in (UV) operations via (4.11). Once this has been done, (4.7) can be performed by table lookup, which in turn allows the computation of S+ via (4.8) given A to be performed in (UV) steps. 16 Table 9. Sample Preference Data: BL, which is deducible from BR -1.2 1 -0.4 3 2 2 1 1 1 -0.1 6 5 4 2 2 2 1 0.0 9 8 6 4 2 2 1 0.6 12 10 8 6 4 2 1 0.9 13 11 9 7 5 3 2 2.4 14 12 10 8 6 4 2 Values are retrieved from BL, which is implicitly available from BR via (4.13), in constant time, so the computation of S– via (4.10) given A may also be performed in (UV) steps. Therefore, the time complexity of computing via the partial-sum transform is (m log m + UV). What have we achieved? The naïve algorithm runs in time (m2), so when U and V equal m, the partial-sum transform gives us no advantage in terms of asymptotic complexity. The work is useful nonetheless because we can reduce U and V at will by merging nearby preference values together. This enables approximate values for to be found quickly even when an exact value would be prohibitively expensive to compute. The improvement from (m log m + U2V2) to (m log m + UV) substantially decreases the amount of approximation required when m is large. That said, most of the time U and V will be much smaller than m. When either U or V is constant, applying dynamic programming is asymptotically more time-efficient than prior methods. For the experiments in §7, U is about 0.4m, and V = 7. Furthermore, is computed many times on similar data. In our application, Y remains constant, so, as mentioned in Subsection 4.4.1, this sort could be performed only once, ahead of time. X does change, but the difference between successive X vectors are small. It would be worthwhile to try sorting indirect references to the data, so that successive sorts would be processing data that is already almost sorted: this may yield a performance benefit. 4.5 Weighted Kendall’s As described, each position contributes uniformly to Kendall’s . However, it can be desirable to give differing priority to different examples, for instance, to increase the importance of positions from underrepresented assessment categories, as described in Subsection 6.2, or to implement perturbation as described in Subsection 8.2. Weighted Kendall’s involves, in addition to the preference lists X and Y, a third list Z = (z1, z2, …, zm) that indicates the importance of each position. The degenerate case where all zi = 1 gives Kendall’s . Integer zi have the straightforward interpretation of the position with preferences (xi, yi) being included in the data set zi times, though the use of fractional zi is possible. 17 Computing weighted Kendall’s is similar to computing the original measure. When populating A, instead of adding 1 to the matrix cell corresponding to (xi, yi), we add zi. Computation of S+ and S– remains unchanged. The denominator of (4.4) is adjusted to be the total weight of all positions, rather than the total number of them. Values for weighted will not be limited to the range [-1, 1] if weights less than 1 are used. All example weights can be uniformly scaled, so there is no need to use such weights. For certain applications, such as the optimization procedure described in Section 5, the denominator of (4.4) is irrelevant, so this guideline need not be strictly adhered to. The implementations developed actually compute weighted . However, these implementations have not been thoroughly tested with non-uniform weights. 4.6 Application Earlier, we defined analysts ae and ao and tuples E and O without explaining their names. Analyst ao represents an oracle: ideally, every position is assessed with its gametheoretic value: win, draw, or loss. Analyst ae represents an imperfect evaluator. Then, the concordance of the assessments E and O is a direct measurement of the quality of the estimates made by analyst ae. We usually will not have access to an oracle, but we will have access to values that are superior to what the imperfect evaluator provides. In practice, ao might be collected human expertise (and depending on the domain, possibly error-checked by a machine), or a second program that is known to be superior to the program being tuned. When neither is available, ao can be constructed by conducting a look-ahead search using ae as its evaluation function. In all cases, measures the degree of similarity between the two analysts. 4.7 An alternative ordinal metric Pearson correlation is the most familiar correlation metric. It measures the linear correlation between two interval-scale variables. When the data is not already of interval scale, it is first ranked. In this case, the correlation measure is referred to as Spearman correlation, or Spearman’s . There is a special formula for computing when the data is ranked; however, it is exact only in the absence of ties. In our application, we have seven categories of human assessment, but orders of magnitude more data points, so ties will be abundant. Therefore, it is most appropriate to apply Pearson’s formula directly to the ranked data. n ρ= [n ( x2) − ( xy − x x) 2 ][n( y y2) − ( y) 2 ] (4.15) The value computed by this least-square mean error function will be affected greatly by outliers. While ranking the data may reduce the pronouncement of this effect, it will not eliminate it. Changing a data point can change the correlation value computed, but there 18 is no guarantee that an increase in value is meaningful. In contrast, the structure of Kendall’s is such that no adjustment in the value of a machine assessment yields a higher concordance unless strictly more position pairs are ordered properly than before. The asymptotic time complexity of computing Spearman’s is (m log m). As we noted earlier, the time complexity of computing Kendall’s is (m log m + UV). Both takes time (m), while algorithms sort their inputs, but afterwards, Spearman’s Kendall’s takes time (UV): for our intended use, these are equivalent. Not only does more directly measure what interests us (“for all pairs of positions (A, B), is position B better than position A?”), it is no less efficient to compute than the plausible alternative. Therefore, we use in our experiments. 5 Feature weight tuning via Kendall’s We will now attempt to tune feature weights by maximizing the value of Kendall’s . After describing the gradient-ascent procedure by which new weights are selected, we discuss our distributed implementation of the hill-climber. The decision to implement a form of gradient ascent was made not because it was thought to be the most efficient method to attempt to learn effective weights, but because the behaviour of gradient ascent is well understood and predictable, allowing inferences from tuning attempts using the proposed optimization metric, Kendall’s , to be made from the results of hill-climbing. Once some success had been achieved, inertia led to the complete system described in this section. A discussion of problems related to the practical application of gradient ascent for tuning evaluation function weights (notably: performance) is deferred to Subsection 8.1. 5.1 Estimated gradient ascent We wish to apply gradient ascent to find weights that maximize Kendall’s . However, this metric is non-continuous, and so not differentiable. Therefore, we must measure the gradient empirically. The procedure is: 1. determine for the current base vector 2. for each weight w in the current base vector a. select a value by which to perturb the weight w6 b. create two delta weight vectors by replacing the weight w from the current base vector with w+ and w- , while leaving other weights alone c. determine for these test weight vectors d. determine a new value for w for the base vector of the next iteration based upon the three known values for . 6 In our most recent implementation, begins at 1% of the current value of w, and the percentage used is gradually lowered in successive iterations. However, if a random value between 0 and 1 is larger than , it is used instead (for that weight and iteration only). This prevents an inability of a weight to move a meaningful distance whenever w is close to zero. 19 In each iteration, the concordance of the current base vector – that is to say, the weight vector currently serving as a base from which to explore – is measured. In addition, for each weight being tuned, we generate two test weight vectors by perturbing that weight higher and lower while holding the other weights constant. We refer to these as delta weight vectors because they are only slightly different from the base weight vector. The two generated weight values are equidistant from the original value. The values of at these test vectors are also computed, after which a decision on how to adjust each weight is made. The cases considered are illustrated in Figure 2. Figure 2. Distinct cases to be handled during estimated gradient ascent. If there is no significant change in concordance between the current (base) vector and the test vectors, then the current value of the weight is retained. This precautionary category avoids large moves of the weight where the potential reward does not justify the risk of making a misstep. When either ascends or descends in the region as the weight is increased, the new weight value is interpolated from the slope defined by the sampled points. The maximum change from the previous weight is bounded to 3 times the distance of the test points, to avoid occasional large swings in parameter settings. When the points are concave up, we adopt the greedy strategy of moving directly to the weight value known to yield the highest concordance. Of course, this must be one of the tested values. When the points are concave down, we can be even greedier than to remain with the weight yielding the highest measured concordance. Inverse parabolic interpolation is used to select the new weight value at the apex of the parabola that fits through the three points, in the hope that this will lead us to the highest in the region. 20 Once this procedure has been performed for all of the weights being tuned, it is possible to postprocess the weight changes, for instance to normalize them. However, this does not seem to be necessary. The chosen values now become the new base vector for the next iteration. As with typical gradient ascent algorithms, the step size made along the slope calculated when the concordances are clearly ascending or descending slowly decreases throughout the execution of the algorithm. It remains to be said that simpler variations of this hill-climbing algorithm that sampled only one point per weight or that did not specially handle the roughly flat and concave cases fare more poorly in practice than the one presented here. 5.2 Distributed Computation The work presented in Subsection 4.3 allows us to compute from two lists of preferences in negligible time. However, establishing the level of concordance at sampled weight vectors nonetheless dominates the running time of the gradient ascent, because computing the machine assessments for each of the training positions is expensive. Distributing the work for a single weight vector would lead to a large communication overhead, because all of the preferences must be made available to a single machine to compute . Instead, we compute all assessments for a weight vector on a single processor, after which the concordance is immediately computed. A supervisor process queues work in a database table to be performed by worker processes. It then sleeps, waking up periodically to check if all concordances have been computed. By contrast, worker processes idle until they find work in the queue that they can perform. They reserve it, perform it, store the value of computed, and then look for more work to do. Once all concordances have been computed, the supervisor process logs the old base vector and its concordance. Additionally, when any tested weight vector’s concordance exceeds the best concordance yet present in the work history, the tested weight vector with maximum concordance is also recorded. The supervisor process then computes the new current base vector, computes the new test weight vectors, and replaces the old material in the work queue with the new. Occasionally a worker process does not report back, for instance, when the machine it was running on has been rebooted. The supervisor process cancels the work reservation when a worker does not return a result within a reasonable length of time, so that another worker may perform it. The low coupling between the supervisor and worker processes results from the work by Pinchak et al. on placeholder scheduling [33]. 6 Application-specific issues Here we detail some points of interest of the experimental design. 21 6.1 Chess engine Many chess programs, or chess engines, exist. Some are commercially available; most are hobbyist. For our work, we selected CRAFTY, by Robert Hyatt [21] of the University of Alabama. CRAFTY is the best chess engine choice for our work for several reasons: the source was readily available to us, facilitating experimentation; it is the strongest such open-source engine today; previous research has already been performed using CRAFTY. Most of our work was performed with version 19.1 of the program. 6.2 Training Data To assess the correlation of with improved play, we used 649,698 positions from Chess Informant 1 through 85 [36]. These volumes cover the important chess games played between January 1966 and September 2002. This data set was selected because it contains a variety of assessed positions from modern grandmaster play, the assessments are made by qualified individuals, it is accessible in a non-proprietary electronic form, and chess players around the world are familiar with it. The 649,698 positions are distributed amongst the human assessment categories and side to move as shown in Table 10. Table 10. Distribution of Chess Informant positions Human assessment Black is to move 153,182 123,261 65,965 35,341 5,205 2,522 889 White is to move 1,720 6,513 15,543 72,737 32,775 55,742 78,303 Total positions 154,902 130,134 81,508 108,078 37,980 58,264 79,192 It is evident from the distribution of positions shown that when annotating games, humans are far more likely to make an assessment in favour of one player after that player has just moved. (When the assessment is one of equality, the majority of the time it is given after each player has played an equal number of moves.) It is plausible that a machine-learning algorithm could misinterpret this habit to deduce that it is disadvantageous to be the next to play. We postpone a discussion of the irregular number of positions in each category until Subsection 7.2.2. The “random sample” is a randomly selected 32,768-position subset of the 649,698 positions. The “stratified sample” is a stratified random sample of the 649,698 positions, including 2,341 positions of each category and side to move. In two categories, where 2,341 positions were not available, 4,194,304-node searches were performed from positions with the required assessment but the opposite side to move. The best move found by CRAFTY was played, and the resulting position was used. It is not guaranteed that the move made was not a mistake that would change the human’s assessment of the position. However, it is guaranteed that no position and its computationally generated successor were both used. 22 There are alternate ways that the positions could have been manipulated to generate the stratified sample, none appearing to have significant benefits over the approach chosen here. However, adjusting the weight assigned to each example in inverse proportion to the frequency of its human assessment, as is possible via the weighting procedure given in Subsection 4.5, is a serious alternative. This was not done because it was deemed worthwhile to learn from more positions, and because our code to compute Kendall’s has not been frequently exercised with non-uniform weights. 6.3 Test suites English chess grandmaster John Nunn developed the Nunn [29] and Nunn II [31] test suites of 10 and 20 positions, respectively. They serve as starting positions for matches between computer chess programs, where the experimenter is interested in the engine’s playing skill independent of the quality of its opening book. Nunn selected positions that are approximately balanced, commonly occur in human games, and exhibit variety of play. We refer to these collectively as the “Nunn 30”. Don Dailey, known for his work on the computer chess programs STARSOCRATES and CILKCHESS, created a collection of two hundred commonly reached positions, all of which are ten ply from the initial position. We refer to these collectively as the “Dailey 200”. Dailey has not published these positions himself, but all of the positions in the Dailey 200, and also the Nunn 30, are available in Gomboc’s M.Sc. thesis [16]. 6.4 Use of floating-point computation We modified CRAFTY so that variables holding machine assessments are declared to be of an aliased type rather than directly as integers. This allows us to choose whether to use floating-point or integer arithmetic via a compilation switch. When compiled to use floating-point values for assessments, CRAFTY is slower, but only by a factor of two to three on a typical personal computer. Experiments were performed with this modified version: the use of floating-point computation provides a learning environment where small changes in values can be rewarded. However, all test matches were performed with the original, integer-based evaluation implementation (the learned values were rounded to the nearest integer): in computer-chess competition, no author would voluntarily take a 2% performance hit, much less one of 200%. It might strike the reader as odd that we chose to alter CRAFTY in this manner rather than scaling up all the evaluation-function weights. There are significant practical disadvantages to that approach. How would we know that everything had been scaled? It would be easy to miss some value that needed to be changed. How would we identify overflow issues? It might be necessary to switch to a larger integer type. How would we know that we had scaled up the values far enough? It would be frustrating to have to repeat the procedure. By contrast, the choice of converting to floating-point is safer. Precision and overflow are no longer concerns. Also, by setting the typedef to be a non-arithmetic type we can 23 cause the compiler to emit errors wherever type mismatches exist. Thus, we can be more confident that our experiments rest upon a sound foundation. 6.5 Search effort quantum Traditionally, researchers have used search depth to quantify search effort. For our learning algorithm, doing so would not be appropriate: the amount of effort required to search to a fixed depth varies wildly between positions, and we will be comparing the assessments of these positions. However, because we did not have the dedicated use of computational resources, we could not use search time either. While it is known that chess engines tend to search more nodes per second in the endgame than the middlegame, this difference is insignificant for our short searches because it is dwarfed by the overhead of preparing the engine to search an arbitrary position. Therefore, we chose to quantify search effort by the number of nodes visited. For the experiments reported here, we instructed CRAFTY to search either 1024 or 16,384 nodes to assess a position. Early experiments that directly called the static evaluation or quiescence search routines to form assessments were not successful. Results with 1024 nodes per position were historically of mixed quality, but improved as we improved the estimated gradient ascent procedure. It is our belief that with a larger training set, calling the static evaluation function directly will perform acceptably. There are positions in our data set from which CRAFTY does not complete a 1-ply search within 16,384 nodes, because its quiescence search explores many sequences of captures. When this occurs, no evaluation score is available to use. Instead of using either zero or the statically computed evaluation (which is not designed to operate without a quiescence search), we chose to throw away the data point for that particular computation of , reducing the position count (m). However, the value of for similar data of different population sizes is not necessarily constant. As feature weights are changed, the shape of the search tree for positions may also change. This can cause CRAFTY to not finish a 1-ply search for a position within the node limit where it was previously able to do so, or vice versa. When many transitions in the same direction occur simultaneously, noticeable irregularities are introduced into the learning process. Ignoring the node-count limitation until the first ply of search has been completed may be a better strategy. 6.6 Performance Experiments were first performed using idle time on various machines in our department. In the latter stages of our research, we have had (non-exclusive) access to clusters of personal computer workstations. This is helpful because, as discussed in Subsection 5.2, the task of computing for distinct weight vectors within an iteration is trivially parallel. Examining 32,768 positions at 1024 nodes per position and computing takes about two minutes per weight vector. The cost of computing is negligible in comparison, so in the best case, when there are enough nodes available for the concordances of all weight vectors of an iteration to be computed simultaneously, learning proceeds at the rate of 30 iterations per hour. 24 7 Experimental results After demonstrating that concordance between human judgments and machine assessments increases with increasing depth of machine search, we attempt to tune the weights of 11 important features of the chess program CRAFTY. 7.1 Concordance as machine search effort increases In Table 11 we computed for depths (plies) 1 through 10 for n = 649,698 positions, performing work equivalent to 211 billion (109) comparisons at each depth. S+ and S– are reported in billions. As search depth increases, the difference between S+ and S–, and therefore , also increases. The sum of S+ and S– is not constant because at different depths different amounts of extra y-pairs and duplicate pairs are encountered. Table 11. Computed for Various Search Depths, n = 649,698 depth 1 2 3 4 5 6 7 8 9 10 S+ / 109 110.374 127.113 131.384 141.496 144.168 149.517 150.977 152.792 153.341 155.263 S- / 109 65.298 48.934 45.002 36.505 34.726 30.136 29.566 22.938 22.368 20.435 0.2136 0.3705 0.4093 0.4975 0.5186 0.5656 0.5753 0.6153 0.6206 0.6388 It is difficult to predict how close an agreement might be reached using deeper searches. Two effects come into play: diminishing returns from additional search, and diminishing accuracy of human assessments relative to ever more deeply searched machine assessments. Particularly interesting is the odd-even effect on the change in as depth increases. It has long been known that searching to the next depth of an alpha-beta search requires relatively much more effort when that next depth is even than when it is odd [22]. Notably, tends to increase more in precisely these cases. This result, combined with knowing that play improves as search depth increases [49], in turn justifies our attempt to use this concordance as a metric to tune selected feature weights of CRAFTY’s static evaluation function. That the concordance increases monotonically with increasing depth lends credibility to our belief that is a direct measure of decision quality. 7.2 Tuning of CRAFTY’s feature weights Crafty uses centipawns (hundredths of a pawn) as its evaluation function resolution, so experiments were performed by playing CRAFTY as distributed versus CRAFTY with the learned weights rounded to the nearest centipawn. Each program played each position both as White and as Black. The feature weights we tuned are given, along with their default values, in Table 12. 25 Table 12. Tuned features, with CRAFTY’s default values feature king safety scaling factor king safety asymmetry scaling factor king safety tropism scaling factor blocked pawn scaling factor passed pawn scaling factor pawn structure scaling factor bishop knight rook on the seventh rank rook on an open file rook behind a passed pawn default value 100 -40 100 100 100 100 300 300 30 24 40 The six selected scaling factors were chosen because they act as control knobs for many subterms. Bishop and knight were included because they participate in the most common piece imbalances. Trading a bishop for a knight is common, so it is important to include both to show that one is not learning to be of a certain weight chiefly because of the weight of the other. We also included three of the most important positional terms involving rooks. Material values for the rook and queen are not included because trials showed that they climbed even more quickly than the bishop and knight do, yielding no new insights. This optimization problem is not linear: dependencies exist between the weights being tuned. 7.2.1 Tuning from arbitrary values Figure 3 illustrates the learning. The 11 parameters were all initialized to 50, where 100 represents both the value of a pawn and the default value of most scaling factors. For ease of interpretation, the legends of Figures 3, 4, and 5 are ordered so that its entries coincide with the intersection of the variable being plotted with the rightmost point on the x-axis. For instance, in Figure 3, bishop is the topmost value, followed by knight, then , and so on. is measured on the left y-axis in linear scale; weights are measured on the right y-axis in logarithmic scale, for improved visibility of the weight trajectories. Rapid improvement is made as the bishop and knight weights climb swiftly to about 285, after which continues to climb, albeit more slowly. We attribute most of the improvement in to the proper determination of weight values for the minor pieces. All the material and positional weights are tuned to reasonable values. 26 Figure 3: Change in Weights from 50 as τ is Maximized, Random Sample 0.4 300 concordance (τ) 0.36 100 90 80 70 60 50 0.34 40 0.32 30 0.3 20 value (pawn = 100; default scaling factor = 100) 200 0.38 0.28 0.26 0 100 200 300 400 500 10 9 8 600 iteration bishop (50 -> 294) knight (50 -> 287) tau (0.2692 -> 0.3909) king tropism s.f. (50 -> 135) pawn structure s.f. (50 -> 106) blocked pawns s.f. (50 -> 76) passed pawn s.f. (50 -> 52) king safety s.f. (50 -> 52) rook on open file (50 -> 42) rook on 7th rank (50 -> 35) rook behind passed pawn (50 -> 34) king safety asymmetry s.f. (50 -> 8) The scaling factors learned are more interesting. The king tropism and pawn structure scaling factors gradually reached, then exceeded CRAFTY’s default values of 100. The scaling factors for blocked pawns, passed pawns, and king safety are lower, but not unreasonably so. However, the king safety asymmetry scaling factor dives quickly and relentlessly. This is unsurprising: CRAFTY’s default value for this term is –40. Tables 13 and 14 contain match results of the weight vectors at specified iterations during the learning illustrated in Figure 3. Each side plays each starting position both as White and as Black, so with the Nunn 30 test, 60 games are played, and with the Dailey 200 test, 400 games are played. Games reaching move 121 were declared drawn. Table 13. Match Results: random sample; 16,384 nodes per assessment; 11 weights tuned from 50 vs. default weights; 5 minutes per game; Nunn 30 test suite. iteration 0 100 200 300 400 500 600 wins 3 3 14 21 19 18 18 draws 1 9 21 26 28 26 23 losses 56 48 25 13 13 16 19 percentage score 5.83 12.50 40.83 56.67 55.00 51.67 49.17 Table 14. Match results for random sample; 16,384 nodes per assessment; 11 weights tuned from 50 vs. default weights; 5 minutes per game; Dailey 200 test suite. iteration wins draws losses 27 percentage score 0 100 200 300 400 500 600 3 12 76 128 129 107 119 13 31 128 152 143 143 158 384 357 196 120 128 150 123 2.38 6.88 35.00 51.00 50.13 44.63 49.50 The play of the tuned program improves drastically as learning occurs. Of interest is the apparent gradual decline in percentage score for later iterations on the Nunn 30 test suite. The DEEP THOUGHT team [1, 2, 20, 32] found that their best parameter settings were achieved before reaching maximum agreement with GM players. Perhaps we are also experiencing this phenomenon. We used the Dailey 200 test suite to attempt to confirm that this was a real effect, and found that by this measure too, the weight vectors at iterations 300 and 400 were superior to later ones. We conclude that the learning procedure yielded weight values for the variables tuned that perform comparably to values tuned by hand over years of games versus grandmasters. This equals the performance achieved by Schaeffer et al. [40]. 7.2.2 Tuning from CRAFTY’s default values We repeated the just-discussed experiment with one change: the feature weights start at CRAFTY’s default values rather than at 50. Figure 4 depicts the learning. Note that we have negated the values of the king safety asymmetry scaling factor in the graph so that we could retain the logarithmic scale on the right y-axis, and for another reason, for which see below. While most values remain normal, the king safety scaling factor surprisingly rises to almost four times the default value. Meanwhile, the king safety asymmetry scaling factor descends even below -100. The combination indicates a complete lack of regard for the opponent’s king safety, but great regard for its own. Table 15 shows that this conservative strategy is by no means an improvement. Table 15. Match results: random sample; 16,384 nodes per assessment; 11 weights tuned from defaults vs. default weights; 5 minutes per game; Nunn 30 test suite. iteration 25 50 75 100 125 150 wins 19 16 11 14 9 8 draws 23 31 32 28 23 35 losses 18 13 17 18 28 17 28 percentage score 50.83 52.50 45.00 46.67 34.17 42.50 Figure 4: Change in Weights from Crafty’s defaults as τ is Maximized, Random Sample 400 350 300 250 0.5 200 150 concordance (τ) 0.48 100 80 0.46 60 0.44 40 value (pawn = 100; default scaling factor = 100) 0.52 0.42 0 25 50 75 iteration king safety s.f. (100 -> 362) tau (0.4186 -> 0.5130) bishop (300 -> 279) knight (300 -> 274) 0 - king safety asym. s.f. (-40 -> -132) king tropism s.f. (100 -> 119) 100 125 20 150 blocked pawns s.f. (100 -> 111) pawn structure s.f. (100 -> 93) passed pawn s.f. (100 -> 88) rook behind passed pawn (40 -> 36) rook on 7th rank (30 -> 33) rook on open file (24 -> 26) The most unusual behaviour of the king safety and king safety asymmetry scaling factors deserves specific attention. When the other nine terms are left constant, these two terms behave similarly to how they do when all eleven terms are tuned. In contrast, when these two terms are held constant, no significant performance difference is found between the learned weights and CRAFTY’s default weights. When the values of the king safety asymmetry scaling factor are negated as in Figure 4, it becomes visually clear from their trajectories that the two terms are behaving in a co-dependent manner. Having identified this anomalous behaviour, it is worth looking again at Figure 3. The match results suggest that all productive learning occurred by iteration 400 at the latest, after which a small but perceptible decline appears to occur. The undesirable codependency between the king safety and king safety asymmetry scaling factors also appears to be present in the later iterations of the first experiment. Table 10 in Subsection 6.2 shows that there are a widely varying number of positions in each category. Let us consider the effect this has upon our optimization metric. Our procedure maximizes the difference between the number of correct decisions made and the number of incorrect decisions made. Each position is thought to be of equal importance, but there are more positions in certain categories. Our optimization does not take into account that with a lopsided sample, correct decisions involving human assessments that occur less frequently will be sacrificed to achieve a higher number of correct decisions involving those that occur more often. Undesirably, weights will be tuned not in accordance with strong play over the complete range of human assessments, but instead to maximize decision quality amongst the overrepresented human assessments. 29 Accordingly, we retried tuning weights from their default values, but using the stratified sample so that the learning would be biased appropriately. Additionally, a look-ahead limit of just 1024 nodes was used, trading away machine assessment accuracy to gain an increased number of iterations, because we desire to demonstrate playing strength stability. We can see that the co-dependency is no longer present in Figure 5. Validating our thought experiment regarding the reason for its previous existence, the match results in Table 16 demonstrate that the learner is now performing acceptably. Figure 5: Change in Weights from Crafty’s Defaults as τ is Maximized, Stratified Sample 0.458 300 250 0.457 150 concordance (τ) 0.456 100 0.455 80 0.454 60 0.453 40 0.452 25 value (pawn = 100; default scaling factor = 100) 200 0.451 0.45 0 400 800 1200 1600 2000 2400 2800 15 3200 iteration bishop (300 -> 291) knight (300 -> 290) tau (0.4516 -> 0.4568) king tropism s.f. (100 -> 125) pawn structure s.f. (100 -> 101) king safety s.f. (100 -> 97) passed pawn s.f. (100 -> 94) blocked pawns s.f. (100 -> 67) 0 - king safety asym. s.f. (-40 -> -42) rook on open file (24 -> 40) rook on 7th rank (30 -> 38) rook behind passed pawn (40 -> 18) Compared against earlier plots, the trajectory of has a relatively jagged appearance. Unlike in the other two graphs, here the concordance is not climbing at a significant rate. Consequently, the left y-axis has been set to provide a high level of detail. In addition, because only 1024 nodes of search are being used to form evaluations instead of 16,384, changes in weights are more likely to affect the score at the root of the tree, causing to fluctuate more. This effect can be countered by using a larger training set. Most weights remained relatively constant during the tuning. The king tropism term climbed quickly to the neighbourhood of 125 centipawns, then remained roughly level; this is in line with previous tuning attempts. Also, both placing a rook on an open file and posting a rook upon the seventh rank again reached the region of thirty-five to forty hundredths of a pawn. Table 16. Match Results: stratified sample; 1024 nodes per assessment; 11 weights tuned from defaults vs. default weights; 5 minutes per game; Nunn 30 test suite. iteration 0 wins 14 draws 36 losses 10 30 percentage score 53.33 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200 17 9 18 14 17 18 16 21 9 14 13 15 11 14 13 15 26 26 21 25 28 26 30 19 34 33 27 30 31 30 33 29 17 25 21 21 15 16 14 20 17 13 20 15 18 16 14 16 50.00 36.67 47.50 44.17 51.67 51.67 51.67 50.83 43.33 50.83 44.17 50.00 44.17 48.33 49.17 49.17 The value of placing a rook behind a passed pawn descended considerably over the learning period, which is interesting insofar as this is thought to be an important feature by humans. Probably the reason is that its application is insufficiently specific: there are many positions where it is slightly beneficial, but occasionally it can be a key advantage. Rather than attempt to straddle the two cases, as CRAFTY’s default value for the weight, 40, does, it could be preferable to give a smaller bonus, but include a second, more specific feature that provides an extra reward when its additional criteria are satisfied. Finally, we are left with the blocked pawns scaling factor. For a long time, this weight remained near its default of 100, but as of approximately iteration 2200 it began to fall, and continued doing so for 1000 iterations thereafter, with no clear indication that it would stop doing so anytime soon. The purpose of this term is for CRAFTY to penalize itself when there are many pawns that are blocked, so that it avoids such positions when playing humans, who have a comparative skill advantage in such positions. It is no surprise that when tuning against human assessments, this feature weight would go down, because it is not bad per se for a position to be blocked. Here we have not merely a case where the program author has a different objective than the metric proposed, but a philosophical difference between Robert Hyatt, the program’s author, and ourselves. The author has set his weights to provide the best performance in his test environment, which is playing against humans on internet chess servers. This is an eminently reasonable decision for someone who feels that their time is better expended on improving their program’s parallel search than its static evaluation function. We believe that a better long-term strategy is to tackle the evaluation problem head-on: do not penalize playing into perfectly acceptable positions that the machine subsequently misplays. Instead, accept the losses, and implement new evaluation features that shore up its ability to withstand such positions against humans. In the end, the program will be stronger for it. The match results do not indicate a change in playing strength due to this weight changing, which is unsurprising given that if such positions were reached, neither the 31 standard nor the tuned CRAFTY would have any advantage over the other in understanding them. Both would be at a relative disadvantage when playing a program that included additional evaluation features as suggested above. 8 Conclusion We have proposed a new procedure for optimizing static evaluation functions based upon globally ordering a multiplicity of positions in a consistent manner. This application of ordinal correlation is fundamentally different from prior evaluationfunction tuning techniques that attempt to select the best moves in certain positions and hope that this skill is generalizable, or bootstrap from zero information. If we view the playing of chess as a Markov decision process, we can say that rather than attempting to learn policy (what move to play) directly, we attempt to learn an appropriate value function (how good is the position?), which in turn specifies the policy to be followed when in particular states. Thus, we combine an effective idea behind temporal-difference learning with supervised learning. Prior, successful supervised learning work by Buro [7, 8] and Tesauro [46, 48] also shares this property. Alternatively, we can view the preferences over the set of pairs of positions as constraints: we want to maximize the number of inequalities that are satisfied. As shown in Section 4, our method powerfully leverages m position evaluations into m(m-1)/2 constraints. A cautionary note: we tuned feature weights in accordance with human assessments. Doing so may simply not be optimal for computer play. Nonetheless, it is worth noting that having reduced the playing ability of a grandmaster-level program to candidate master strength by significantly altering several important feature weights, the learning algorithm was able to restore the program to grandmaster strength. 8.1 Reflection While some weights learned nearly identical values in our experiments, other features exhibited more variance. For cases such as the blocked pawns scaling factor, it appears that comparable performance may be achieved with a relatively wide range of values. The amount of training data used is small enough that overfitting may be a consideration. The program modification that would most readily allow this to be determined is to have the program compute on a second, control set of positions that are used only for testing. If the concordance for the training set continues to climb while the concordance for the control set declines, overfitting will be detected. The time to perform our experiments was dominated by the search effort required to generate machine assessments. Therefore, there is no significant obstacle to attempting to maximize Spearman’s (or perhaps even Pearson correlation, notwithstanding Stevens). 32 Future learning experiments should ultimately use more positions, because the information gathered to make tuning decisions grows quadratically as the position set grows. We believe that doing this will reduce the search effort required per position to tune weights well. If sufficient positions are present for tuning based only upon the static evaluation can be performed, much time can be saved. For instance, piece square tables for the six chess pieces imply 6 * 64 = 384 features, but for any single position, only a maximum of 32 can have any effect. Furthermore, CRAFTY’s evaluation is actually more dependent upon table lookup than this simple example shows. Furthermore, it is not necessary to use the entire training set at each iteration of the algorithm. One could select different samples for each iteration, thereby increasing the total number of positions that strong weight vectors are compared against. Alternatively, one can start with very few positions, perhaps 1024, and slowly increase the number of positions under consideration as the number of iterations increases. This would speed the adjustment of important weights that are clearly set to poor values according to relatively few examples before considering more positions for finer tuning. Combining these ideas is also possible. Precomputing and making a copy of CRAFTY’s internal board representations for each of the test positions could yield a further significant speedup. When computing the concordance for a weight vector, it would be sufficient to perform one large memory copy, then to search each position in turn. If the static evaluation function is called directly, then it is even possible to precompute a symbolic representation of the position. A training time speedup of orders of magnitude should be possible with these changes. However, we would like to state that it was not originally planned to attempt to maximize only upon assessments at a specific level of search effort. Unfortunately, we encountered implementation difficulties, and so reverted to the approach described herein. We had intended to log the node number or time point along with the new score whenever the evaluation of a position changes. This would have, without the use of excessive storage, provided the precise score at any point throughout the search. We would have tuned to maximize the integral of over the period of search effort. Implementation of this algorithm would more explicitly reward reaching better evaluations more quickly, improving the likelihood of tuning feature weights and perhaps even search control parameters effectively. Here too, precomputation would be worthwhile. The gradient-ascent procedure is effective, but convergence acceleration is noticeably lacking. It is somewhat of an arduous climb for parameters to climb from 50 to near 300, not because there was any doubt that they would reach there, but simply because of the number of iterations required for them to move the requisite distance. Fortunately, this is less likely to be a factor in practical use, where previously tuned weights likely would be started near their existing values. Perhaps the most problematic issue with the gradient ascent implementation occurs when weights are near zero, where it is no longer sufficient to generate a sample point based on multiplying the current weight by a value slightly greater than one. Enforcing a strict minimum distance between the existing weight and its test points is not 33 sufficient: the weight may become trapped. We inject some randomness into the sampling procedure when near zero, but it would be premature to say that this is a complete solution. It would be worthwhile to attempt approaches other than gradient ascent. The original motivation for implementing the hill-climber was to be able to observe how our fitness metric performed. It was felt that a gradient ascent procedure would provide superior insight into its workings relative to an evolutionary method approach due to its more predictable behaviour. As previously mentioned in Subsection 7.2, CRAFTY uses weights expressed in centipawns, so when we test weights with CRAFTY, we round them to the nearest centipawn. Sometimes much time is spent tuning a weight to make small changes that will be lost when this truncation of precision occurs. If an alternative method, for instance an evolutionary method, can succeed while operating only at the resolution of centipawns, strong feature weight vectors could be identified more quickly. Additionally, the running time per iteration of evolutionary methods is independent of the number of weights being tuned. With such a method, we could attempt to tune many more features simultaneously. Schaeffer et al. [40] found that when using temporal-difference learning, it is best to tune weights at the level of search effort that one will be applying later. We believe this is because the objective function is constantly updated as part of that learning algorithm. In contrast, with the procedure presented herein, the objective function is determined separately, in advance. Therefore, if one chooses to use a specific level of search effort to determine the pseudo-oracle values of states as described in Subsection 4.6, then directly tuning the static evaluation function should yield values that perform well in practice when the program plays with a similar level of search effort. It may be argued that deeper searches would lead to different preferred weights because more positions of disparate character will be reached by the search. However, this too can be mimicked by increasing the number of positions in the training set. A pseudo-oracle is not an oracle, so when the value of improves significantly, it makes sense to redetermine the objective function using the new weights that have been learned, and resume the gradient ascent procedure (without resetting the weight values, of course). Temporal-difference learning elegantly finesses this issue precisely because the objective function is augmented at each learning step. 8.2 Future directions The use of positions labelled as “unclear” or “with compensation for the material” may be possible by treating such assessments as being in concordance with selected other categories, and in discordance with the remainder. The specific categories treated as equivalent for these two assessments would necessarily vary depending upon the source of the assessments. For example, with data from Chess Informant [36], an assessment of “unclear” likely indicates that neither side has a clear advantage, but for data from Nunn’s Chess Openings [30], an unclear assessment may be treated as if it were an assessment of equality. In a third work, an unclear assessment may simply mean the annotator is unable or unwilling to disclose an informative assessment. Similar issues 34 arise when handling the “with compensation for the material” assessment. Additionally, the use of additional, non-assessment annotation symbols (e.g., “with the attack”, “with the initiative”) could also be explored. While our experiments used chess assessments from humans, it is possible to use assessments from deeper searches and/or from a stronger engine, or to tune a static evaluation function for a different domain. Depending on the circumstances, merging consecutively ordered fine-grained assessments into fewer, larger categories might be desirable. Doing so could even become necessary should the computation of dominate the time per iteration, but this is unlikely unless one uses both an enormous number of positions and negligible time to form machine assessments. Examining how concordance values change as the accuracy of machine assessments is artificially reduced would provide insight into the amount of precision that the heuristic evaluation function should be permitted to express to the search framework that calls it. Too much precision reduces the frequency of cut-offs, while insufficient precision would result in the search receiving poor guidance. Elidan et al. [12] found that perturbation of training data could assist in escaping local maxima during learning. Our implementation of , designed with this finding in mind, allows non-integer weights to be assigned to each cell. Perturbing the weights in an adversarial manner as local maxima are reached, so that positions are weighted slightly more important when generally discordant, and slightly less important when generally concordant, could allow the learner to continue making progress. It would also be worthwhile to examine positions of maximum disagreement between human and machine assessments, in the hope that study of the resulting positions will identify new features that are not currently present in CRAFTY’s evaluation. Via this process, a number of labelling errors would be identified and corrected. However, because our metric is robust in the presence of outliers, we do not believe that this would have a large effect on the outcome of the learning process. Improvements in data accuracy could allow quicker learning and a superior resulting weight vector, though we suspect the differences would be small. Integrating the supervised learning procedure developed here with temporal-difference learning, as suggested by Utgoff and Clouse [51], would be interesting. A popular pastime amongst computer-chess hobbyists is to attempt to discover feature weight settings that result in play mimicking their favourite human players. By tuning against appropriate training data, e.g., from opening monographs and analyses published in Chess Informant and elsewhere that are authored by the player to be mimicked, training an evaluation function to assess positions similarly to how a particular player might actually do so should now be possible. Furthermore, there are human chess annotators known for their diligence and accuracy, and others notorious for their lack thereof. Computing values of pitting an individual’s annotations against the assessments that a program makes after a reasonable amount of search would provide an objective metric of the quality of their published analysis. 35 Producers of top computer-chess software play many games against their commercial competitors. They could use our method to model their opponent’s evaluation function, then use this model in a minimax (no longer negamax) search. Matches then played would be more likely to reach positions where the two evaluation functions differ most, providing improved winning chances for the program whose evaluation function is more accurate, and object lessons for the subsequent improvement of the other. For this particular application, using a least square error function rather than Kendall’s could be appropriate, if the objective is to hone in on the exact weights used by the opponent.7 Identifying the most realistic mapping of CRAFTY’s machine assessments to the seven human positional assessments is also of interest. This information would allow CRAFTY (or a graphical user interface connected to CRAFTY) to present scoring information in a human-friendly format alongside the machine score. We can measure how correlated a single feature is with success by setting the weight of that feature to unity, setting all other feature weights to zero, and computing Kendall’s . This creates the possibility of applying as a mechanism for feature selection. However, multiple features may interact: when testing the combined effective discriminating power of two or three features it would be necessary to hold one constant and optimize the weights of the others. The many successes of temporal-difference learning have demonstrated that it can tune weights equal to the best hand-tuned weights. We believe that, modulo the current performance issues, the supervised learning approach described here is also this effective. Learning feature weights is something that the AI community knows how to do well. While there are undoubtedly advances still to be made with respect to feature weight tuning, for the present it is time to refocus on automated feature generation and selection, the last pieces of the puzzle for fully automated evaluation function construction. Acknowledgements We would like to thank Yngvi Björnsson, for the use of his automated game-playing software, and for fruitful discussions; Don Dailey, for access to his suite of 200 test positions; Robert Hyatt, for making CRAFTY available, and also answering questions about its implementation; Peter McKenzie, for providing PGN to EPD conversion software; NSERC, for partial financial support [Grant OPG 7902 (Marsland)]; all those who refereed the material presented herein. 7 More than one commercial chess software developer was somewhat perturbed by this possibility after this material was presented at the Advances in Computer Games 10 conference [15], which was held in conjunction with the 2003 World Computer Chess Championship. Newer versions of some programs will be attempting to conceal information that makes such reverse engineering possible. Ideas that were discussed included not outputting the principal variation at low search depths, truncating the principal variation early when it is displayed, declining to address hashing issues that occasionally cause misleading principal variations to be emitted, and mangling the low bits of the evaluation scores reported to users. Commercial chess software developers must walk a fine line between effectively hindering reverse engineering by their competitors and displeasing their customers. 36 References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] Anantharaman, T. S. (1990). A Statistical Study of Selective Min-Max Search in Computer Chess. Ph.D. thesis, Carnegie Mellon University, Pittsburgh, Pennsylvania, U.S.A. University Report CMU-CS-90-173. Anantharaman, T. S. (1997). Evaluation Tuning for Computer Chess: Linear Discriminant Methods. ICCA Journal, Vol. 20, No. 4, pp. 224-242. Baxter, J., Tridgell, A., and Weaver, L. (1998). KNIGHTCAP: A Chess Program that Learns by Combining TD( ) with Game-tree Search. Proceedings of the Fifteenth International Conference in Machine Learning (IMCL) pp. 28-36, Madison, WI. Beal, D. F. and Smith, M. C. (1997). Learning Piece Values Using Temporal Differences. ICCA Journal, Vol. 20, No. 3, pp. 147-151. Beal, D. F. and Smith, M. C. (1999a). Learning Piece-Square Values using Temporal Differences. ICCA Journal, Vol. 22, No. 4, pp. 223-235. Beal, D. F. and Smith, M. C. (1999b). First Results from Using Temporal Difference Learning in Shogi. Computers and Games (eds. H. J. van den Herik and H. Iida), pp. 113125. Lecture Notes in Computer Science 1558, Springer-Verlag, Berlin, Germany. Buro, M. (1995). Statistical Feature Combination for the Evaluation of Game Positions. Journal of Artificial Intelligence Research 3, pp. 373-382, Morgan Kaufmann, San Francisco, CA. Buro, M. (1999). From Simple Features to Sophisticated Evaluation Functions. Computers and Games (eds. H. J. van den Herik and H. Iida), pp. 126-145. Lecture Notes in Computer Science 1558, Springer-Verlag, Berlin, Germany. Chellapilla, K., and Fogel, D. (2001). Evolving a Checkers Player without Relying on Human Experience. In IEEE Transactions on Evolutionary Computation, Vol. 5, No. 4, pp. 422-428. Chellapilla, K., and Fogel, D. (1999). Evolving Neural Networks to Play Checkers Without Relying on Expert Knowledge. In IEEE Transactions on Neural Networks, Vol. 10, No. 6, pp. 1382-1391. Cliff, N. (1996). Ordinal Methods for Behavioral Data Analysis. Lawrence Erlbaum Associates. Elidan, G., Ninio, M., Friedman, N., and Schuurmans, D. (2002). Data Perturbation for Escaping Local Maxima in Learning. Proceedings AAAI 2002 Edmonton, pp. 132-139. Fawcett, T. E. and P. E. Utgoff (1992). Automatic feature generation for problem solving systems. Proceedings of the Ninth International Conference on Machine Learning, pp. 144-153. Morgan Kaufman. Fürnkranz, J. (2001). Machine Learning in Games: A Survey. In Machines that Learn to Play Games (eds. J. Fürnkranz and M. Kubat), pp. 11-59. Nova Scientific Publishers. ftp://ftp.ai.univie.ac.at/papers/oefai-tr-2000-31.pdf Gomboc, D., Marsland, T. A., and Buro, M. (2003). Ordinal Correlation for Evaluation Function Tuning. Advances in Computer Games: Many Games, Many Challenges (eds. H. J. van den Herik et al.), pp. 1-18. Kluwer Academic Publishers. Gomboc, D. (2004). Tuning Evaluation Functions by Maximizing Concordance. M.Sc. Thesis, University of Alberta, Edmonton, Alberta, Canada. http://www.cs.ualberta.ca/~games/papers/theses/gomboc_msc.pdf. Hartmann, D. (1987a). How to Extract Relevant Knowledge from Grandmaster Games, Part 1: Grandmasters have Insights – the Problem is What to Incorporate into Practical Programs. ICCA Journal, Vol. 10, No. 1, pp. 14-36. Hartmann, D. (1987b). How to Extract Relevant Knowledge from Grandmaster Games, Part 2: The Notion of Mobility, and the Work of De Groot and Slater. ICCA Journal, Vol. 10, No. 2, pp. 78-90. 37 [19] Hartmann, D. (1989). Notions of Evaluation Functions tested against Grandmaster Games. In Advances in Computer Chess 5 (ed. D.F. Beal), pp. 91-141. Elsevier Science Publishers, Amsterdam, The Netherlands. [20] Hsu, F.-h., Anantharaman, T. S., Campbell, M. S., and Nowatzyk, A. (1990). DEEP THOUGHT. In Computers, Chess, and Cognition (eds. T. A. Marsland and J. Schaeffer), pp. 55-78. Springer-Verlag. [21] Hyatt, R.M. (1996). CRAFTY – Chess Program. ftp://ftp.cis.uab.edu/pub/hyatt/v19/crafty19.1.tar.gz. [22] Kaneko, T., Yamaguchi, K., and Kawai, S. (2003). Automated Identification of Patterns in Evaluation Functions. Advances in Computer Games: Many Games, Many Challenges (eds. H. J. van den Herik et al.), pp. 279-298. Kluwer Academic Publishers. [23] Kendall, G. and Whitwell, G. (2001). An Evolutionary Approach for the Tuning of a Chess Evaluation Function using Population Dynamics. Proceedings of the 2001 IEEE Congress on Evolutionary Computation, pp. 995-1002. http://www.cs.nott.ac.uk/~gxk/papers/cec2001chess.pdf. [24] Levinson, R. and Snyder, R. (1991). Adaptive Pattern-Oriented Chess. AAAI, pp. 601606. [25] Marsland, T. A. (1983). Relative Efficiency of Alpha-Beta Implementations. IJCAI 1983, pp. 763-766. [26] Marsland, T. A. (1985). Evaluation-Function Factors. ICCA Journal, Vol. 8, No. 2, pp. 4757. [27] van der Meulen, M. (1989). Weight Assessment in Evaluation Functions. In Advances in Computer Chess 5 (ed. D.F. Beal), pp. 81-90. Elsevier Science Publishers, Amsterdam, The Netherlands. [28] Mysliwietz, P. (1994). Konstruktion und Optimierung von Bewertungstfunktionen beim Schach. Ph.D. Thesis, Universität Paderborn, Paderborn, Germany. [29] Nunn, J., Friedel, F., Steinwender, D., and Liebert, C. (1998). Computer ohne Buch, in Computer-Schach und Spiele, Vol. 16, No. 1, pp. 7-14. [30] Nunn, J., Burgess, G., Emms, J., and Gallagher, J. (1999). Nunn’s Chess Openings. Everyman. [31] Nunn, J. (2000). Der Nunn-Test II, in Computer-Schach und Spiele, Vol. 18, No. 1, pp. 30-35. http://www.computerschach.de/test/nunn2.htm. [32] Nowatzyk, A. (2000). http://www.tim-mann.org/deepthought.html. [33] Pinchak, C., Lu, P., and Goldenberg, M. (2002). Practical Heterogeneous Placeholder Scheduling in Overlay Metacomputers: Early Experiences. 8th Workshop on Job Scheduling Strategies for Parallel Processing, Edinburgh, Scotland, U.K., pp. 85-105. Reprinted in LNCS 2537 (2003), pp. 205-228. http://www.cs.ualberta.ca/~paullu/Trellis/Papers/placeholders.jsspp.2002.ps.gz. [34] Plaat, A., Schaeffer, J., Pijls, W., and Bruin, A. de (1996). Best-First Fixed-Depth GameTree Search in Practice. Artificial Intelligence, Vol. 87, Nos. 1-2, pp. 255-293. [35] Press, W., Teukolsky, S., Vetterling, W., and Flannery, B. (1992). Numerical Recipes in C: The Art of Scientific Computing, Second Edition, pp. 644-645. Cambridge University Press. [36] Sahovski Informator (1966). Chess Informant: http://www.sahovski.com/. [37] Samuel, A. L. (1959). Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development, No. 3, pp. 211-229. [38] Samuel, A. L. (1967). Some Studies in Machine Learning Using the Game of Checkers. II – Recent Progress. IBM Journal of Research and Development, Vol. 2, No. 6, pp. 601617. [39] Sarle, W. S. (1997). Measurement theory: Frequently asked questions, version 3. ftp://ftp.sas.com/pub/neural/measurement.html. Revision of publication in Disseminations of the International Statistical Applications Institute, Vol. 1, Ed. 4, 1995, pp. 61-66. 38 [40] Schaeffer, J., Hlynka, M., and Jussila, V. (2001). Temporal Difference Learning Applied to a High-Performance Game-Playing Program. Proceedings IJCAI 2001, pp. 529-534. [41] Shannon, C. E. (1950). Programming a Computer for Playing Chess. Philosophical Magazine, Vol. 41, pp. 256-275. [42] Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103, 677-680. [43] Stevens, S. S. (1951). Mathematics, measurement, and psychophysics. In S. S. Stevens (ed.), Handbook of experimental psychology, pp 1-49). New York: Wiley. [44] Stevens, S. S. (1959). Measurement. In C. W. Churchman, ed., Measurement: Definitions and Theories, pp. 18-36. New York: Wiley. Reprinted in G. M. Maranell, ed., (1974) Scaling: A Sourcebook for Behavioral Scientists, pp. 22-41. Chicago: Aldine. [45] Sutton, R. S. (1988). Learning to Predict by the Methods of Temporal Differences. Machine Learning, Vol. 3, pp. 9-44. [46] Tesauro, G. (1989). Connectionist Learning of Expert Preferences by Comparison Training. Advances in Neural Information Processing Systems 1 (ed. D. Touretzky), pp. 99-106. Morgan Kauffman. [47] Tesauro, G. (1995). Temporal Difference Learning and TD-GAMMON. Communications of the ACM, Vol. 38, No. 3, pp. 55-68. http://www.research.ibm.com/massive/tdl.html. [48] Tesauro, G. (2001). Comparison Training of Chess Evaluation Functions. In Machines that Learn to Play Games (eds. J. Fürnkranz and M. Kubat), pp. 117-130. Nova Scientific Publishers. [49] Thompson, K. (1982). Computer Chess Strength. Advances in Computer Chess 3, (ed. M.R.B. Clarke), pp. 55-56. Pergamon Press, Oxford, U.K. [50] Tukey, J. W. (1962). The Future of Data Analysis. In The Collected Works of John W. Tukey, Vol. 3 (1986) (ed. L. V. Jones), pp. 391-484. Wadsworth, Belmont, California, U.S.A. [51] Utgoff, P. E. and Clouse, J. A. (1991). Two Kinds of Training Information for Evaluation Function Learning. AAAI, pp. 596-600. [52] Utgoff, P. E. (1996). ELF: An evaluation function learner that constructs its own features. Technical Report 96-65, Department of Computing Science, University of Massachusetts, Amherst, Massachusetts, U.S.A. [53] Utgoff, P. E. and D. Precup (1998). Constructive function approximation. In Feature Extraction, Construction, and Selection: a Data-Mining Perspective (eds. Motoda and Liu), pp. 219-235. Kluwer Academic Publishers. [54] Utgoff, P. E. and D. J. Stracuzzi (1999). Approximation via Value Unification. In Proceedings of the Sixteenth International Conference on Machine Learning (ICML), pp. 425-432. Morgan Kaufmann. [55] Utgoff, P. E. (2001). Feature Construction for Game Playing. In Machines that Learn to Play Games (eds. J. Fürnkranz and M. Kubat), pp. 131-152. Nova Scientific Publishers. [56] Utgoff, P.E. and D. J. Stracuzzi (2002). Many-layered learning. In Neural Computation, Vol. 14, pp. 2497-2539. [57] Velleman, P. and Wilkinson, L. (1993). Nominal, Ordinal, Interval, and Ratio Typologies are Misleading. http://www.spss.com/research/wilkinson/Publications/Stevens.pdf. A previous version appeared in The American Statistician, Vol. 47, No. 1, pp. 65-72. 39