Training Conditional Random Fields via Gradient Tree Boosting Thomas G. Dietterich Adam Ashenfelter Yaroslav Bulatov tgd@cs.orst.edu, ashenfad@engr.orst.edu, bulatov@cs.orst.edu School of Electrical Engineering and Computer Science Oregon State University, Corvallis, OR 97331 USA To appear, International Conference on Machine Learning, 2004 Abstract Conditional Random Fields (CRFs; Lafferty, McCallum, & Pereira, 2001) provide a flexible and powerful model for learning to assign labels to elements of sequences in such applications as part-of-speech tagging, text-to-speech mapping, protein and DNA sequence analysis, and information extraction from web pages. However, existing learning algorithms are slow, particularly in problems with large numbers of potential input features. This paper describes a new method for training CRFs by applying Friedman’s (1999) gradient tree boosting method. In tree boosting, the CRF potential functions are represented as weighted sums of regression trees. Regression trees are learned by stage-wise optimizations similar to Adaboost, but with the objective of maximizing the conditional likelihood P (Y |X) of the CRF model. By growing regression trees, interactions among features are introduced only as needed, so although the parameter space is potentially immense, the search algorithm does not explicitly consider the large space. As a result, gradient tree boosting scales linearly in the order of the Markov model and in the order of the feature interactions, rather than exponentially like previous algorithms based on iterative scaling and gradient descent. 1 Introduction Many applications of machine learning involve assigning labels to sequences of objects. For example, in part-of-speech tagging, the task is to assign a part of speech (“noun”, “verb”, etc.) to each word in a sentence (Ratnaparkhi, 1996). In protein secondary structure prediction, the task is to assign a secondary structure class to each amino acid residue in the protein sequence (Qian & Sejnowski, 1988). We call this class of problems sequential supervised learning (SSL), and it can be formalized as follows: Given: A set of training examples of the form (Xi , Yi ), where each Xi = (xi,1 , . . . , xi,Ti ) is a sequence of Ti feature vectors and each Yi = (yi,1 , . . . , yi,Ti ) is a corresponding sequence of class labels, yi,t ∈ {1, . . . , K}. Find: A classifier H that, given a new sequence X of feature vectors, predicts the corresponding sequence of class labels Y = H(X) accurately. 1 Perhaps the most famous SSL problem is the NETtalk task of pronouncing English words by assigning a phoneme and stress to each letter of the word (Sejnowski & Rosenberg, 1987). Early attempts to apply machine learning to SSL problems were based on sliding windows. To predict label yt , a sliding window method uses features drawn from some “window” of the X sequence. For example, a 5-element window wt (X) would use the features xt−2 , xt−1 , xt , xt+1 , xt+2 . Sliding windows convert the SSL problem into a standard supervised learning problem to which any ordinary machine learning algorithm can be applied. However, in most SSL problems, there are correlations among successive class labels yt . For example, in part-of-speech tagging, adjectives tend to be followed by nouns. In protein sequences, alpha helixes and beta structures always involve multiple adjacent residues. These correlations can be exploited to increase classification accuracy. Recently, many new learning methods have been developed with the goal of capturing these y ↔ y correlations. See Dietterich (2002) for a review. One of the most interesting new methods is the conditional random field (CRF) proposed by Lafferty et al. (2001). The CRF is a probabilistic model of the conditional probability that input sequence X will produce output label sequence Y : P (Y |X). The CRF has the form of a Markov random field (Geman, 1998): 1 exp Ψt (yt , X) + Ψt−1,t (yt−1 , yt , X) , P (Y |X) = Z(X) t where Ψt (yt , X) and Ψt−1,t (yt−1 , yt , X) are potential functions that capture (respectively) the degree to which yt is compatible with X and the degree to which yt is compatible with a transition from yt−1 and with X. These potential functions can be arbitrary real-valued functions. The exponential function ensures that P (Y |X) is positive, and the normalizing constant Z(X) = Y exp[ t Ψt (yt , X) + Ψt−1,t (yt−1 , yt , X)] ensures that P (Y |X) sums to 1. This representation is completely general, subject to the assumption that P (Y |X) > 0 for all X and Y (Besag, 1974; Hammersley & Clifford, 1971). Normally, it is assumed that the potential functions do not depend on t, and we will adopt that assumption in this paper. To apply a CRF to an SSL problem, we must choose a representation for the Ψ functions. Lafferty et al. studied Ψ functions that are weighted combinations of binary features: Ψ(yt , X) = βa ga (yt , X) (1) λb fb (yt−1 , yt , X), (2) a Ψ(yt−1 , yt , X) = b where the βa ’s and λb ’s are trainable weights, and the features ga and fb are boolean functions. For example, in part-of-speech tagging g234 (yt , X) might be 1 when xt is the word “bank” and yt is the class “noun” (and 0 otherwise). As with sliding window methods, it is natural to define features that depend only on a sliding window wt (X) of X values. This linear parameterization can be seen as an extension of logistic regression to the sequential case. Once a parameterization is chosen, the CRF can be trained to maximize the log likelihood of the training data, possibly with a regularization penalty to prevent overfitting. Let Θ = {β1 , . . . , λ1 , . . .} denote all of the tunable parameters in the model. Then the objective function is to maximize J(Θ) = log i = i = P (Yi | Xi ) 1 exp log Ψt (yi,t , Xi ) + Ψt−1,t (yi,t−1 , yi,t , Xi ) Z(Xi ) t Ψt (yi,t , Xi ) + Ψt−1,t (yi,t−1 , yi,t , Xi ) − log Z(Xi ) i,t 2 = i,t βa ga (yi,t , Xi ) + a λb fb (yi,t−1 , yi,t , Xi ) − log Z(Xi ) b Lafferty et al. introduced an iterative scaling algorithm for maximizing J(Θ), but they reported that it was exceedingly slow. Several groups have implemented gradient ascent methods, but naive implementations are also very slow, because the various β and λ parameters interact with each other: increasing one parameter may require compensating changes in others. McCallum’s Mallet system (McCallum, 2003) employs the BFGS algorithm, which is an approximate second-order method that deals with these parameter interactions. A drawback of this linear parameterization is that it assumes that each feature makes an independent contribution to the potential functions. Of course it is possible to define more features to capture combinations of the basic features, but this leads to a combinatorial explosion in the number of features, and hence, in the dimensionality of the optimization problem. For example, in protein secondary structure prediction, Qian and Sejnowski found that a 13-residue sliding window gave best results for neural network methods. There are 32 × 13 × 20 = 2340 basic fb features that can be defined over this window. If we consider fourth-order conjunctions of such features, we obtain more than 1012 features. This is obviously infeasible. McCallum’s Mallet system starts with a single constant feature and introduces new feature conjunctions by taking conjunctions of the basic features with features already in the model. Candidate conjunctions are evaluated according to their incremental impact on the objective function. He demonstrates significant improvements in speed and classification accuracy compared to a CRF that only includes the basic features. In this paper, we introduce a different approach to training the potential functions based on Freidman’s (2001) gradient tree boosting algorithm. In this approach, the potential functions are represented by sums of regression trees, which are grown stage-wise in the manner of Adaboost (Freund & Schapire, 1996). Each regression tree can be viewed as defining several new feature combinations—one corresponding to each path in the tree from the root to a leaf. The resulting potential functions still have the form of a linear combination of features, but the features can be quite complex. The advantage of the gradient boosting approach is that the algorithm is fast and straightforward to implement. In addition, there may be some tendency to avoid overfitting because of the “ensemble effect” of combining multiple regression trees. 2 Gradient Tree Boosting Suppose we wish to solve a standard supervised learning problem, where the training examples have the form (xi , yi ), i = 1, . . . , N and yi ∈ {1, . . . , K}. We wish to fit a model of the form exp Ψ(y, x) . P (y | x) = y Ψ(y , x) Gradient tree boosting is based on the idea of functional gradient ascent. In ordinary gradient ascent, we would parameterize Ψ in some way, for example, as a linear function, Ψ(y, x) = βa ga (y, x). a Let Θ = {β1 , . . .} represent all of the tunable parameters in this function. In gradient ascent, the fitted parameter vector after iteration m, Θm , is a sum of an initial parameter vector Θ0 and a series of gradient ascent steps δm : Θm = Θ0 + δ1 + · · · + δm , 3 where each δm is computed as a step in the direction of the gradient of the log likelihood function: δm = ηm ∂ log P (yi | xi ; Θm−1 ) ∂Θm−1 i and ηm is a parameter that controls the step size. Functional gradient ascent is a more general approach. Instead of assuming a linear parameterization for Ψ, it just assumes that Ψ will be represented by a weighted sum of functions: Ψm = Ψ0 + ∆1 + · · · + ∆m . Each ∆m is computed as a functional gradient: ∆m = ηm Ex,y ∂ log P (y | x; Ψm−1 ) . ∂Ψm−1 The functional gradient indicates how we would like the function Ψm−1 to change in order to increase the true log likelihood (i.e., on all possible points (x, y)). Unfortunately, we do not know the joint distribution P (x, y), so we cannot evaluate the expectation Ex,y . We do have a set of training examples sampled from this joint distribution, so we can compute the value of the functional gradient at each of our training data points: ∆m (yi , xi ) = ∂ log P (yi | xi ; Ψm−1 ). ∂Ψm−1 i We can then use these point-wise functional gradients to define a set of functional gradient training examples, ((xi , yi ), ∆m (yi , xi )) and then train a function hm (y, x) so that it approximates ∆m (yi , xi ). Specifically, we can fit a regression tree hm to minimize [hm (yi , xi ) − ∆m (yi , xi )]2 . i We can then take a step in the direction of this fitted function: Ψm = Ψm−1 + ηhm . Although the fitted function hm is not exactly the same as the desired ∆m , it will point in the same general direction (assuming there are enough training examples). So ascent in the direction of hm will approximate true functional gradient ascent. A key thing to note about this approach is that it replaces the difficult problem of maximizing the log likelihood of the data by the much simpler problem of minimizing squared error on a set of training examples. Friedman suggests growing hm via a best-first version of the CART algorithm (Breiman et al., 1984) and stopping when the regression tree reaches a pre-set number of leaves L. Overfitting is controlled by tuning L (e.g., by internal cross-validation). 3 Gradient Tree Boosting for SSL In principle, it is straightforward to apply functional gradient ascent to SSL. All we need to do is to represent and train Ψ(yt , X) and Ψ(yt−1 , yt , X) as weighted sums of regression trees. For historical reasons, we took a slightly different approach. Let F yt (yt−1 , X) = Ψ(yt , X) + Ψ(yt−1 , yt , X) 4 Table 1: Derivation of the functional gradient y ∂ ∂ log P (Y |X) = F t (yt−1 , wt (X)) − log Z(X) v v ∂F (u, wd (X)) ∂F (u, wd (X)) t ∂ log Z(X) ∂F v (u, wd (X)) ∂Z(X) 1 = u, yd = v) − Z(X) ∂F v (u, wd (X)) = I(yd−1 = u, yd = v) − = I(yd−1 = I(yd−1 = u, yd = v) − ∂ 1 Z(X) ∂F v (u, wd (X)) (3) (4) exp F k (k , wd (X)) · α(k , d − 1) β(k, d) (5) k k 1 [exp F v (u, wd (X))] α(u, d − 1)β(v, d) Z(X) = u, yd = v) − P (yd−1 = u, yd = v | X) = I(yd−1 = u, yd = v) − (6) = I(yd−1 (7) be a function that computes the “desirability” of label yt given values for label yt−1 and the input features X. There are K such functions F k , one for each class label k. Then the CRF has the form 1 exp F yt (yt−1 , X). Z(X) t P (Y |X) = We now compute the functional gradient of log P (Y |X) with respect to F yt (yt−1 , X). To simplify the computation, we replace X by wt (X), which is a window into the sequence X centered at xt . We will further assume, without loss of generality, that each window is unique, so there is only one occurrence of wt (X) in each sequence X. Proposition 1 The functional gradient of log P (Y |X) with respect to F v (u, wd (X)) is ∂ log P (Y |X) = I(yd−1 = u, yd = v) − P (yd−1 = u, yd = v | wd (X)), ∂F v (u, wd (X)) where I(yd−1 = u, yd = v) is 1 if the transition u → v is observed from position d−1 to position d in the sequence Y and 0 otherwise, and where P (yd−1 = u, yd = v | wd (X)) is the predicted probability of this transition according to the current potential functions. To demonstrate this proposition, we must first introduce the forward-backward algorithm for computing Z(X). We will assume that yt takes the value ⊥ for t < 1. Define the forward recursion by α(k, 1) = exp F k (⊥, w1 (X)) α(k, t) = exp F k (k , wt (X)) · α(k , t − 1). k Define the backward recursion as β(k, T ) = 1 β(k, t) = exp F k (k, wt+1 (X)) · β(k , t + 1) k 5 The variables k and k iterate over the possible class labels. The normalizer Z(X) can be computed at any position t as α(k, t)β(k, t). Z(X) = k If we unroll the α recursion one step, we can also write this as Z(X) = k exp F (k , wt (X)) · α(k , t − 1) β(k, t) k k Table 1 shows the derivation of the functional gradient. In line 3, exactly one of the F yt (yt−1 , wt (X)) terms will match F v (u, wd (X)), because wd (X) is unique. This term will have a derivative of 1, so we represent this by the indicator function I(yd−1 = u, yd = v). In line 5, we expand Z(X) at position d using the forward-backward algorithm. Again because wd (X) is unique, only the product where k = u and k = v will give a non-zero derivative, so this gives us line 6. The right-hand expression in 6 is precisely the joint probability that yd−1 = u and yd = v given X. Q.E.D. If wd (X) occurs more than once in X, each match contributes separately to the functional gradient. This functional gradient has a very satisfying interpretation: It is our error on a probability scale. If the transition u → v is observed in the training example, then the predicted probability P (u, v | X) should be 1 in order to maximize the likelihood. If the transition is not observed, then the predicted probability should be 0. Functional gradient ascent simply involves fitting regression trees to these residuals. Table 2 shows pseudo code for our tree-boosting algorithm. The potential function for each class k is initialized to zero. Then M iterations of boosting are executed. In each iteration, for each class k, a set S(k) of functional gradient training examples is generated. Each example consists of a window wt (Xi ) on the input sequence, a possible class label k at time t − 1, and the target ∆ value. A regression tree having at most L leaves is fit to these training examples to produce the function hm (k). This function is then added to the previous potential function to produce the next function. In other words, we are setting the step size ηm = 1. We experimented with performing a line search at this point to optimize ηm , but this is very expensive. So we rely on the “self-correcting” property of tree boosting to correct any overshoot or undershoot on the next iteration. One way to improve upon this algorithm is to initialize the potential functions more intelligently. The pseudo-likelihood of (X, yt ) is P (yt | yt−1 , yt+1 , X). This is the probability of the correct label at position t given the correct labels for yt−1 and yt+1 . The pseudo-likelihood can be computed without performing any forward-backward iterations: exp [F yt (yt−1 , wt (X)) + F yt+1 (yt , wt+1 (X))] . P (yt | yt−1 , yt+1 , X) = y yt+1 (y , w t+1 (X))] y exp [F (yt−1 , wt (X)) + F The pseudo-likelihood—because it assumes that the correct labels are known for yt−1 and yt+1 — works well if our eventual error rate will be small. We found that it significantly sped up our training trials. It is known to be a consistent estimator of the likelihood (Besag, 1977). We perform three iterations of gradient tree boosting using the pseudo-likelihood to compute the boosting examples S(k). Then we switch to using the full functional gradient. The sets of generated examples S(k) can become very large. For example, if we have 3 classes and 100 training sequences of length 200, then the number of training examples for each class k is 3 × 100 × 200 = 60, 000. Although regression tree algorithms are very fast, they still must consider all of the training examples! Friedman (2001) suggests two tricks for speeding up the computation: sampling and influence trimming. In sampling, a random sample of the training data is used for 6 Table 2: Gradient Tree Boosting for SSL TreeBoost(Data, L) // Data = {(Xi , Yi ) : i = 1, . . . , N } for each class k, initialize F0k (·, ·) = 0 for m = 1, . . . , M do for class k from 1 to K do S(k) := GenerateExamples(k, Data, P otm−1 ) u : u = 1, . . . K}) // where P otm−1 = {Fm−1 hm (k) := FitRegressionTree(S(k), L) k k Fm := Fm−1 + hm (k) end end k for all k return FM end TreeBoost GenerateExamples(k, Data, P otm ) S := {} for example i from 1 to N do execute the forward-backward algorithm on (Xi , Yi ) to get α(k, t) and β(k, t) for all k and t for t from 1 to Ti do for k from 1 to K do P (yi,t−1 = k , yi,t = k | Xi ) := k α(k , t − 1) exp[Fm (k , wt (Xi ))]β(k, t) Z(Xi ) ∆(k, k , i, t) := I(yi,t−1 = k , yi,t = k)− P (yi,t−1 = k , yi,t = k | Xi ) insert ((wt (Xi ), k ), ∆(k, k , i, t)) into S end end end return S end GenerateExamples training. In influence trimming, data points with ∆ values close to zero are ignored. We did not apply either of these techniques in our experiments. 4 Making Predictions Once a CRF model has been trained, there are (at least) two possible ways to define a classifier Y = H(X) for making predictions. First, we can predict the entire sequence Y that has the highest probability: H(X) = argmax P (Y |X). Y This makes sense in applications, such as part-of-speech tagging, where the goal is to make a coherent sequential prediction. This can be computed by the Viterbi algorithm (Rabiner, 1989), which has the advantage that it does not need to compute the normalizer Z(X). The second way to make predictions is to individually predict each yt according to Ht (X) = argmax P (yt = v|X) v 7 and then concatenate these individual predictions to obtain H(X). This makes sense in applications where the goal is to maximize the number of individual yt ’s correctly predicted, even if the resulting predicted Y sequence is incoherent. For example, a predicted sequence of parts of speech might not be grammatically legal, and yet it might maximize the number of individual words correctly classified. P (yt |X) can be computed by executing the forward-backward algorithm as P (yt |X) = 5 α(yt , t)β(yt , t) . Z(X) Experimental Studies We implemented gradient tree boosting for CRFs and compared it to McCallum’s Mallet system on four benchmark data sets. We will call our algorithm TreeCRF. We will use TreeCRF-V for the TreeCRF with Viterbi predictions and TreeCRF-FB for the TreeCRF with forwardbackward predictions. Mallet implements McCallum’s feature induction algorithm. Mallet makes its predictions using the Viterbi algorithm, so we will denote it by Mallet-V. 5.1 Data Sets We tested these algorithms on four data sets: protein secondary structure prediction and three Usenet FAQs: ai-general, ai-neural, and aix. The protein secondary structure benchmark was published by Qian & Sejnowski (1988). A protein consists of a sequence of amino acid residues. Each residue is represented by a single feature with 20 possible values (corresponding to the 20 standard amino acids). There are three classes: alpha helix, beta sheet, and coil (everything else). There is a training set of 111 sequences and a test set of 17 sequences. Each of the FAQ data sets consists of Frequently Asked Questions files for a Usenet newsgroup (McCallum et al., 2000). The FAQs for each newsgroup are divided in separate files: ai-general has 7 files, ai-neural has 7 files, and aix has 5 files. Every line of an FAQ is labeled as either part of the header, a question, an answer, or part of the tail. Hence, each xt consists of a line in the FAQ file, and the corresponding yt ∈ {header, question, answer, tail}. The measure of accuracy is the number of individual lines correctly classified. McCallum provided us with the definitions of 20 features. We made a slight correction to one of the features, so our results are not directly comparable to his. For each newsgroup, performance was measured by leave-1-out cross-validation: the CRF was trained on all-but-one of the files and tested on the remaining file. This was repeated with each file, and the results averaged. Both TreeCRF and Mallet have parameters that must be set by the user. For both algorithms, the user must set (a) the window size, (b) the order of the Markov model, and (c) the number of iterations to train. For TreeCRF, the only additional parameter is L, the depth limit for the regression trees. For Mallet the parameters are (a) the regularization penalty for squared weights (called the variance), (b) the number of iterations between feature inductions (kept constant at 8), (c) the number of features to add per feature induction (kept constant at 500), (d) the true label probability threshold (kept constant at 0.95), (e) the training proportions (kept constant at 0.2, 0.5, and 0.8), (f) the number of iterations to train. Except for the variance, we kept all of Mallet’s parameters fixed at the values recommended by Andrew McCallum (personal communication). To set the remaining parameters, we manually tried a handful of settings and chose the setting that gave the best test set (or cross-validation) performance. Ideally, these would be set via internal cross-validation. However, because we did not perform a very careful search of the parameter settings, we believe that the parameters are not highly tuned. 8 65 Percent Residues Correct Qian-Sejnowski TreeCRF-V 60 Mallet-V 55 50 45 1 10 100 Iterations Figure 1: Protein secondary structure prediction 5.2 Results Figure 1 shows the results on the protein task. In all cases (except Qian-Sejnowski), a firstorder CRF was employed. The input features consisted of an 11-residue sliding window. The TreeCRF-FB attains its peak performance of 64.7% correct after 28 iterations. The next best method is the neural network sliding window of Qian and Sejnowski (1988), which attains 64.5%. Mallet-V reaches 62.9% after 145 iterations. A McNemar’s test comparing the peak performance of TreeCRF-FB and Mallet-V shows that the difference is statistically significant (p < 0.05). One worrying aspect of Mallet is that the performance curve exhibits a high degree of fluctuation. This is presumably due to the effect of introducing new features. But it also suggests that it will be difficult to find the optimal stopping point for avoiding overfitting. The peak performance of 62.9% is achieved in only one iteration. The second-highest performance is 62.6%, and a more realistic estimate of its achievable performance (i.e., by using cross-validation to determine the stopping point) would be around 61.5%. It is difficult to compare the CPU time of the methods, because TreeCRF is written in C++ while Mallet is written in Java. Despite these differences the running times of the two programs are quite similar. The time required for TreeCRF to reach its peak performance is 1979.98 s; the time required for Mallet to reach its peak performance is 3634.37 s. With an 11-residue window, it is not feasible to run LinearCRF on this problem. Table 3 compares the CPU time per iteration for smaller window sizes. We see that LinearCRF is faster for small window sizes, but that it slows down exponentially as the window size grows. Figure 2 plots the percentage of lines correctly classified by the two algorithms on the ai-general FAQ. Again we see that Mallet’s performance fluctuates wildly. A McNemar’s test of the performance on the final iterations of the two methods concludes that TreeCRF is better (p < 0.001). 9 100 TreeCRF-V Percent Lines Correct 95 Mallet-V 90 85 80 75 70 0 500 1000 1500 2000 CPU Seconds 2500 3000 3500 Figure 2: FAQ ai-general. Percentage of lines correct as a function of CPU time. 100 Mallet-V TreeCRF-V Percent Lines Correct 95 90 85 80 75 70 0 500 1000 1500 2000 CPU Seconds 2500 3000 3500 Figure 3: FAQ ai-neural. Percentage of lines correct as a function of CPU time 10 Table 3: Training iteration run time (seconds) for LinearCRF and TreeCRF Window size LinearCRF TreeCRF 1 0.04 0.8 3 0.66 1.11 5 41.2 1.2 7 1505 1.4 100 Mallet-V Percent Lines Correct 95 TreeCRF-V 90 85 80 75 70 0 1000 2000 3000 CPU Seconds 4000 5000 6000 Figure 4: FAQ aix. Percentage of lines correct as a function of CPU time Figure 3 plots the results for the ai-neural FAQ. This time, despite fluctuations, Mallet converges to a better classifier than TreeCRF according to McNemar’s test (p < 0.001). Finally, Figure 4 plots the results for the aix FAQ. Although it is difficult to see from the graph, Mallet again converges to a slightly better classifier (p < 0.025). Note that on this data set, TreeCRF required about twice as much time to reach peak performance. 6 Conclusions This paper has introduced a novel method for training conditional random fields based on gradient tree boosting. We can evaluate it along several dimensions. Ease of implementation: TreeCRF is simpler to implement than Mallet. Ease of tuning: TreeCRF introduces only one tunable parameter, L, the maximum number of leaves permitted in each regression tree. Mallet has many more parameters to consider. Mallet’s performance fluctuates wildly, while TreeCRF improves smoothly. Scaling to large numbers of features: tree boosting scales much better than the original linearly-parameterized CRF method. It appears to match Mallet, which also gives dramatic 11 speedups when there are many potential features. Run time: In our experiments TreeCRF required run time within a factor of two of Mallet. Both are reasonable. Accuracy: In our experiments, TreeCRF was more accurate on two data sets and less accurate on two data sets. Scaling to large numbers of classes: In experiments not shown, we attempted to apply TreeCRF to the NETtalk text-to-speech problem, which has 140 classes. This is infeasible because the cost of performing the forward-backward algorithm (required by both TreeCRF and Mallet to compute gradients) scales as T 140n+1 , where T is the length of the sequences and n is the order of the Markov model. For NETtalk, T is around 7, but previous research has suggested that n should be at least 3. This means that the forward-backward computation for each training sequence requires 2.7 × 109 operations, which means that it is very slow. An important challenge for SSL research is to develop methods that can handle large numbers of classes. Gradient tree boosting may provide another advantage over methods based on standard parametric gradient ascent: the ability to handle missing values in the inputs. There are very good methods for handling missing values when growing regression trees including the surrogate split method of CART (Breiman et al., 1984) and the instance weighting method of C4.5 (Quinlan, 1993). In future work, we will evaluate whether these methods work well for training and evaluating CRFs. Acknowledgements The authors gratefully acknowledge the support of NSF grants IIS-0083292 and IIS-0307592. References Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical Society B, 36, 192–236. Besag, J. (1977). Efficiency of pseudolikelihood estimation for simple Gaussian fields. Biometrika, 64, 616–618. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Wadsworth International Group. Dietterich, T. G. (2002). Machine learning for sequential data: A review. Structural, Syntactic, and Statistical Pattern Recognition (pp. 15–30). New York: Springer Verlag. Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. Proc. 13th International Conference on Machine Learning (pp. 148–156). Morgan Kaufmann. Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29. Geman, D. (1998). Random fields and inverse problems in imaging. In A. Ancona, D. Geman and N. Ikeda (Eds.), École d’Été de probabilités de saint-flour xviii, vol. Lecture Notes in Mathematics 1427, 117–196. Berlin: Springer-Verlag. Hammersley, J. M., & Clifford, P. (1971). Markov fields on finite graphs and lattices (Technical Report). Unpublished. 12 Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proceedings of the 18th International Conference on Machine Learning (pp. 282–289). San Francisco, CA: Morgan Kaufmann. McCallum, A. (2003). Efficiently inducing features of conditional random fields. Proceedings of the 19th Conference on Uncertainty in Artificial Intelligence (pp. 403–410). San Francisco, CA: Morgan Kaufmann. McCallum, A., Freitag, D., & Pereira, F. (2000). Maximum entropy Markov models for information extraction and segmentation. Proc. 17th International Conf. on Machine Learning (pp. 591–598). Morgan Kaufmann, San Francisco, CA. Qian, N., & Sejnowski, T. J. (1988). Predicting the secondary structure of globular proteins using neural network models. Journal of Molecular Biology, 202, 865–884. Quinlan, J. R. (1993). C4.5: Programs for empirical learning. San Francisco, CA: Morgan Kaufmann. Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77, 257–286. Ratnaparkhi, A. (1996). A maximum entropy model for part-of-speech tagging. Proceedings of the conference on empirical methods in natural language processing (pp. 133–142). Somerset, NJ: ACL. Sejnowski, T. J., & Rosenberg, C. R. (1987). Parallel networks that learn to pronouce English text. Complex Systems, 1, 145–168. 13