1096 IEEE TRANSACTIONS ON COMPUTERS, NOVEMBER 1970 Use of Contextual Constraints in Recognition of Contour-Traced Handprinted Characters ROBERT W. DONALDSON AND GODFRIED T. TOUSSAINT Abstract-A contour-tracing technique described in an earlier paper [9] was used along with monogram, bigram, and trigram contextual constraints to recognize handprinted characters. Error probabilities decreased by factors ranging from two to four over those which resulted when contextual constraints were ignored. A simple technique for searching over only the most probable bigrams and trigrams was used to significantly reduce the computation without reducing the recognition accuracy. Index Terms-Block decoding, character recognition, contextual constraints, contour analysis, handprinted characters, limited-data experiments, suboptimum classification. P(Dal, Db2,* ** DCN1Ci15 Cj2i - CkN) P(Da1ICil)P(Db2ICi2) ... P(DcNICkN)- (2b) Although (2) is not necessarily exact, it enormously reduces the training and storage capacity required for the quantities on the left-hand side of (2a) and (2b). Even so, it is still necessary to store P(Ci1, Cj2 * CkN) for all MN pattern sequences, and this required storage capacity increases exponentially with N. What is needed, in addition to (2), are techniques which make effective use of contextual constraints without storing excessive amounts of data or using excessive computation time. When the contextual constraints are in terms of ... P(X1, X2,-- I. INTRODUCTION '' XN|Cil, Cj2,. S CkN) and In a general pattern recognition problem, patterns are P(Cill Cj29 .., CkN), presented, one after another, to a system which extracts features which then are used, along with a priori pattern there are two basic approaches to the problem. probabilities to recognize the unknown patterns. Let Ci 1) Treat a long sequence of N patterns as N/L (1 < L < N) denote pattern class i(i = 1, 2, , M) where subscript j separate sequences, which are then classified indedenotes the jth position in the sequence (j = 1, 2, , N). of each other using (1) or a suitable appendently Let Xj denote the corresponding feature vector. Let This is block decoding. The block length proximation. P(X1, X2, XNICil, Cj2,5 *, CkN) denote the probability is L. of the vector sequence X1, X2, *, XN conditioned on the 2) Successively examine all N-L + 1 sequences of adsequence Ci1, Cj2, , CkN whose a priori probability is jacent patterns of length 1 < L <N. Following examP(Ci1, Cj* 2,* , CkN). The probability of correctly identifying ination of each sequence, classify the pattern to be an unknown pattern sequence is maximized by maximizing excluded from all future sequences. Proper use of this any monotonic function of procedure, called sequential decoding, avoids exponential growth with L in computation time. R = P(X1, X2, , XN Cil, Cj2 , CkN) * P(Cjl, Cj25 CkN) (la) Both approaches have been studied extensively for decoding coded communication signals [11, [2] and have found some with respect to i, j, k. In what follows, the length Dr of application in pattern recognition [3]-[8]. We note here feature vector Xr is not necessarily constant, even for an that when the patterns are characters from a natural lanindividual pattern class Cj; in fact these differences in guage, higher level linguistic constraints would also be length are used as an aid in recognition. For this reason, used along with the statistical constraints referred to above. This paper presents results of experiments conducted in (la) is expanded to account for these differences in length D order to determine the effect of using contextual conas follows. straints to recognize contour-traced handprinted characters. R =P(Xl, X2,5-*, XN|Cil, Cj2, '', CkN, Dal, Db2, DCN) Block decoding was used with L= 1 (monogram constraints), L = 2 (bigram constraints), and L = 3 (trigram *P(Dal, Db2, * DcNlCi1X Cj29 9 QkN) constraints). Equiprobable characters were also used, and P (Ci 1, Cj2~ CkN) (l1b) this case is denoted by L= 0. The data and the feature extraction algorithm are described in Section II. The classifiwhere Dej denotes that vector Xi corresponding to Cij has cation algorithm is described in Section III. The results, De components. We now make the following two assumptions which sig- presented in Section IV, show that error probability denificantly reduce the amount of computation and data creases by factors between two and four over those which result when statistical contextual constraints are ignored. needed to calculate R for any i, j,--- k. It is also shown that a technique for searching over only the most probable character sequences significantly reduces the P(X, X2, * * XNICil, Cj2,. , CkN, Dal, Db2,"', DCN) amount of computation without further degradation of = P(XliCil, Dal)P(X2lCj2, Db2) .. P(XNICkNDCN) (2a) performance. The results and their implications are disand cussed in Section V. 9 . .. Manuscript received March 31, 1970; revised June 15, 1970. This work was supported by the National Research Council of Canada under Grant NRC A-3308 and by the Defence Research Board of Canada under Grant DRB 2801-30. The authors are with the Department of Electrical Engineering, University of British Columbia, Vancouver, Canada. II. DATA AND FEATURE EXTRACTION The data consisted of 14 upper-case Roman alphabets, two from each of seven persons, as described earlier [9]. Each character was spatially quantized into a 50 x 50 array 1097 SHORT NOTES followed by the COORD word. This feature extraction scheme is ourmodification [9] of one originally devised by Clemens and Mason [10]-[12]. III. CLASSIFICATION ALGORITHM * XrD,) The Dr component of any vector Xy=(xy1, xr2,... are assumed statistically independent, with the result that P(XrjCirj Dar) P(Xrl, Xr2, = Dar 171 P(XrjlCir, Dar). (3) CODE WORD CODE WORD 10111 1001010010 COORD WORD COORD WORD OO,t1,11,OO,10 000,111,111,111, 10Y000,D,1,00D,OO1,000 (a) XrDarlCir, Dar) . j=l (b) Fig. 1. CODE and cooRD word for C. (a) Four-part area division. (b) Six-part area division. From (1), (2), and (3) it now follows that the optimum classification of a character sequence of length L results if i, j,*, k are chosen to maximize' Tij ... k - Dal u= 1 ln P(xjuICij, Da1) + ln P(DaliCil) Db2 E ln P(X2ulCj2, Db2) + ln P(Db2ICj2)] + + and punched on IBM cards. Subjects were required to refrain from making broken characters. Ten different graduate students were able to recognize the quantized characters with 99.7 percent accuracy. The feature vector for any character was obtained by simulating a discrete scanner on an IBM 7044 computer; the scanner operates as follows. The scanning spot moves, point by point, from the bottom to the top of the leftmost column, and successively repeats this procedure on the column immediately to the right of the column previously scanned, until the first black point is found. Upon locating this point, the scanner enters the CONTOUR mode, in which the scanning spot moves right after encountering a white point and left after encountering a black point. The CONTOUR mode terminates when the scanning spot completes its trace around the outside of a character and returns to its starting point. After being scanned, the character is divided into either four or six equal-sized rectangles whose size depends on the height and width of the letter (see Fig. 1). A y threshold equal to one-half each rectangle's height and an x threshold equal to one-half each rectangle's width is defined. Whenever the x coordinate of the scanning spot reaches a local extremum and moves in the opposite direction to a point one threshold away from the resulting extremum, the resulting point is designated as either an xmax or xmin. After an Xmax(Xmin) has occurred, no additional xmax's (xmin's) are recorded until after an Xmin(xm.) has occurred. Analogous comments apply to the y coordinate of the scanning spot. The starting point of the coNTouR mode is regarded as an Xmin. The CODE word for a character consists of a 1 followed by binary digits whose order coincides with the order in which extrema occur during contour tracing; I denotes max's and min's in x, while 0 denotes max's and min's in y. The rectangles are designated by binary numbers, and the ordering of these numbers in accordance with the rectangles in which extrema fall in event sequence constitutes the COORD word. The feature vector consists of the CODE word + = l L + E In P(XLujCkL, DCL) + In P(DcLICkL) _u=l + In P(Cil, Cj2, '' ' I __ CkL)- (4) To search over all ML possible sequences causes the computation time to increase exponentially with L. For M = 26, L = 2, and L =3, there are 676 and 17 576 different sequences, respectively. To reduce the amount of searching required, the following procedure was used for L = 2 (bigrams) and L = 3 (trigrams). First, the feature vector from each character in a sequence of L unknown characters was used to calculate Q= Da E In P(xulCi, Da) + In P(D,|ci) _ u=l , (5) for all 26 pattern classes. The pattern classes were then ranked in order corresponding to the value of Q; thus the pattern class for which Q was largest was ranked first, the pattern class for which Q was second largest was ranked second, and so on. Equation (4) was then maximized over all dL sequences containing the pattern classes having a rank between 1 and d, inclusive. In the experiments described in the next section, five-digit estimates of P(Ci) for individual characters were obtained from Pierce [13]. The 304 legal bigrams and the 2510 legal trigram probabilities were obtained from relativefrequency data in Pratt [14]. All other bigrams and trigrams were considered illegal. Thus, when d was so small that all dL sequences of characters were illegal, the characters in the sequence were identified individually, without using context, by maximizing Q in (5). Probabilities P(x.lCi, Da) and P(Da,ICi) were learned by determining the relative number of times a component x, or length Da occurred, given the joint event (Ci, Da) or the ' This algorithm is an extension of algorithm T used earlier [9]. 1098 event Ci. When an experimentally determined probability P=O, we set P=1/(z+2) [15], where z is the number of times (Ci, Da) or Ci occurred during training. IEEE TRANSACTIONS ON COMPUTERS, NOVEMBER 1970 30. a20. aIV. EXPERIMENTS AND RESULTS EL cic TS-4 O-Q: TS-6 four of the for each Four experiments were conducted ot 10, values of L. In the first two experiments, the same 14 alphacx TR-4 bets were used for learning P(x.1 Ci, Da) and P(Da C), and TR-6 for testing. In the first experiment, the four-part area diviOJ 0 3rd 2nd Ist sion was used (see Fig. 1). The six-part division was used in INFORMATION CONTEXTUAL OF ORDER which last two In the experiments, the second experiment. 2. Error probability versus order of contextual constraints. TS dealso differed from each other solely on the basis of area Fig.notes that training and test data were disjoint. TR signifies that training for used were from five individuals ten alphabets division, and test data were identical. training, while the four alphabets from the remaining two persons were used to form monograms, bigrams, and trigrams for testing. In these last two experiments, the results 21. were averaged over seven trials; in the ith trial, data from persons i and i + 1 (i = 1, 2, * * , 6) were used for testing, % 20. while in the seventh trial data from persons 7 and 1 were E 19. used.2 For L= 1, 2, or 3, 2(2L) sample L-grams were used Qc 18 for testing, while four samples of each character were used afor L=0. When the same data were used for both training r~17. and testing, 7(2L) sample L-grams were used for testing for TRIGRAMS L16 each trial when L= 1, 2, or 3, and 14 samples of each charX-U acter were used for L= 0. No L-gram test samples consisted 6 8 10 16 5 2 3 4 of individual characters printed by different persons. DEPTH OF SEARCH Fig. 3. Error probability versus depth of search d. The results of the above experiments yielded maximum likelihood estimates of P(ej Ci1, Ci2, * , CkL), which denotes the probability of error £ given the sequence Cil, Ci25 * , CkL. The probability of error averaged over all sequences was TS-4 cases, P(c) for L = 3 is, respectively, 0.23, 0.28, 0.52, then obtained from and 0.49 times its value at L=0. Such behavior is reasonable, since one would expect correction of errors using con26 26 26 to be easiest when the surrounding text-is correct. text P(ECil, c2, , CkL) E E ** P()= i=1 k=1 The 6-PAD is better than the 4-PAD scheme for all L: 3, j=1 CkJ) (6) as noted earlier [9] for L = 0 and L= 1. The TS results are *P(Cil, Cj21 not as good as the corresponding TR results, a difference The the four also observed by others [16]-[19]. L for versus experiments. shows 2 P(E) Fig. In Fig. 3 P(e) decreases as dincreases for 1 < d<4 because L =0 and L = 1 results were obtained by searching over all a an increase in d allows more legal character sequences to whose training, during classes yielded, samples pattern feature vector length equal to that ofthe unknown character. be examined. The fact that P(E) actually increases as d The L = 2 and L= 3 results were obtained in the same way, increases beyond d= 4 has to be due to the fact that an apexcept that d= 4. This value for d was obtained by calculat- proximation to R, rather than R as given by (1), is being ing P(E) versus d for the four-part area division (4-PAD) maximized; the approximation results from (2), from (3), scheme with L = 2 and L= 3 when the training and test data and from the fact that P(xlCi, Da) and P(DaIC.) are not were disjoint (TS case). For L= 2 d assumed all integer known exactly in the TS case. Consider now two tradeoffs. First is the tradeoff between values from 1 to 16 inclusive, since 16 was the maximum needed with our data. For L= 3, d went from 1 to 5. The L and d. Not only is the performance for L=2 and d=4 results in Fig. 3 show a minimum P(e) at d= 4. This value of better than for L= 3 and d= 3, but the amount of computad=4 was subsequently used to obtain all the results for tion needed in the former case (24 = eight searches per character) is less than that for the latter case (33 = nine searches L=2 and L=3 in Fig. 2. per character). Second, the 6-PAD L = 2 TS case yields better than does the 4-PAD L = 3 TS case. For the performance V. DISCUSSION the 6 = PAD L= 1 and L = 2 cases yield TR experiments, From Fig. 2 it follows that P(e) decreases as L increases, does the 4-PAD L = 3 case. For the better performance but at a decreasing rate. The relative improvement resulting 4-PAD and 6-PAD than the average length of the feature cases, from use of context seems to increase as recognition ac- vectors was 17 and 25, respectively, and the number of difcuracy for L=0 increases. For the TR-6, TR-4, TS-6, and ferent (Ci, D.) values was 67 and 66. It follows that with d = 4 the 6-PAD scheme for L= 1 and L = 2 requires less com2 This procedure yields a better estimate of performance on small data putation time and storage than the 4-PAD L = 3 case; the sets than does the often used holdout method [16]. extra storage needed for P(x.lCi, Da) for the 6-PAD vectors is 0 1099 SHORT NOTES more than offset by the saving which results in not having Recognition, L. N. Kanal, Ed. Washington, D. C.: Thompson, 1968, pp. 109-140. to store trigram probabilities. It is important to balance the [18] W. H. Highleyman, "The design and analysis of pattern recognition amount of information from character features with that experiments," Bell Sys. Tech. J., vol. 41, pp. 723-744, March 1962. obtained using contextual constraints, as Raviv [6] has [19] G. F. Hughes, "On the mean accuracy of statistical pattern recognizers," IEEE Trans. Inform. Theory, vol. IT-14, pp. 55-63, January suggested. 1968. in Fig. 2 would imFrom [9] it follows that the results [20] R. Bakis, N. M. Herbst, and G. Nagy, "An experimental study of prove considerably if a few additional tests were used to machine recognition of hand-printed numerals," IEEE Trans. Sys. Sci. Cybern. vol. SSC-4, pp. 119-132, July 1968. differentiate between commonly confused characters, if were from one individual. and/or all character samples Higher level semantic constraints, whose use is feasible in a well structured computer language such as FORTRAN, would Computer Experience on Partitioned List Algorithms further reduce the error rate [5]. The results in Fig. 2 would also improve if values L > 3 were used. Unfortunately, the E. MORREALE AND M. MENNUCCI number of different sequences for which ij.. . k in (4) would have to be computed increases exponentially with L for fixed Abstract-The main characteristics of some programs impled. This exponential growth can be avoided by using se- menting a number of different versions of partitioned list algorithms are described, and the results of a systematic plan of experiments quential decoding [1], [6]. performed on these programs are reported. These programs concern We note that requiring subjects to make unbroken char- the determination of all the prime implicants, a prime implicant coveracters did not greatly inconvenience our subjects, and that ing, or an irredundant normal form of a Boolean function. The experiments performed on these programs concern mainly the computer methods for handling broken characters are described else- time required, the number of prime implicants obtained, and their where [9]. Finally, we note that our results support those of distribution in families. The results obtained from these tests demonBakis et al. [20] in that curve-following features extract sig- strate that relatively large Boolean functions, involving even some of canonical clauses, can be very easily processed by nificant information from handprinting, and those of Raviv thousands present-day electronic computers. [6] and Duda and Hart [5] in that use of context signifIndex Terms-Boolean function, computer programs, computing icantly improves recognition accuracy. times, covering, experimental results, incompletely specified, irreREFERENCES [1] J. M. Wozencraft and I. M. Jacobs, Principles of Communication Engineering. New York: Wiley, 1965, ch. 6. [2] R. G. Gallager, Information Theory and Reliable Communication. New York: Wiley, 1965, ch. 6. [3] B. Gold, "Machine recognition of hand-sent Morse code," IRE Trans. Inform. Theory, vol. IT-5, pp. 17-24, March 1959. [4] W. W. Bledsoe and J. Browning, "Pattern recognition and reading by machine," 1959 Proc. Eastern Joint Computer Conf., vol. 16, pp. 225-232; also in L. Uhr, Pattern Recognition. New York: Wiley, 1966, pp. 301-316. [5] R. 0. Duda and P. E. Hart, "Experiments in the recognition of handprinted text: Part II-context analysis," 1968 Fall Joint Computer Conf., AFIPS Proc., vol. 33, pt. 2. Washington, D. C.: Thompson, 1968, pp. 1139-1149. [6] J. Raviv, "Decision making in Markov chains applied to the problem of pattern recognition," IEEE Trans. Inform. Theory, vol. IT-13, pp. 536-551, October 1967. [7] R. Alter, "Use ofcontextual constraints in automatic speech recognition," IEEE Trans. Audio, vol. AU-16, pp. 6-11, March 1968. [8] K. Abend, "Compound decision procedures for pattern recognition," 1966 Proc. NEC, vol. 22, pp. 777-780. [9] G. T. Toussaint and R. W. Donaldson, "Algorithms for recognizing contour-traced handprinted characters," IEEE Trans. Computers (Short Notes), vol. C-19, pp. 541-546, June 1970. [10] J. K. Clemens, "Optical character recognition for reading machine applications," Ph.D. dissertation, Dept. of Elec. Engrg., Massachusetts Institute of Technology, Cambridge, Mass., August 1965. [11] S. J. Mason and J. K. Clemens, "Character recognition in an experimental reading machine for the blind," in Recognizing Patterns, P. A. Kolers and M. Eden, Eds. Cambridge, Mass.: M.I.T. 1968, pp. 156-167. [12] S. J. Mason, F. F. Lee, and D. E. Troxel, "Reading machine for the blind," M.I.T. Electronics Research Lab., Cambridge, Mass., Quart. Progress Rept. 89, pp. 245-248, April 1968. [13] J. R. Pierce, Symbols, Signals and Noise. New York: Harper and Row, 1961, p. 283. [14] F. Pratt, Secret and Urgent. New York: Blue Ribbon, 1942. [15] I. J. Good, The Estimation of Probabilities, Res. Monograph 30. Cambridge, Mass.: M.I.T. Press, 1965. [16] L. Kanal and B. Chandrasekaran, "On the dimensionality and sample size in statistical pattern classification," 1968 Proc. Natl. Electronics Conf., vol. 24, pp. 2-7. [17] J. H. Munson, "The recognition of hand-printed text," Pattern dundant normal forms, minimization, partitioned list, prime implicants. I. INTRODUCTION Recently a new class of algorithms-partitioned list algorithms has been proposed [1] for determining, for any given Boolean function, either the set of all the prime implicants or a prime implicant covering of the function, from which an irredundant normal form can be easily obtained. Some theoretical results [2] concerning the computational complexity of partitioned list algorithms.for prime implicant determination indicate that partitioned list algorithms compare favorably with both Quine's [3] and McCluskey's [4] nonpartitioned list algorithms. However, through these theoretical evaluations, only a rough estimate can be made of the average computing time actually required for finding all the prime implicants of a Boolean function. Furthermore, in considering the actual performances of partitioned list algorithms, it is interesting both to compare different versions of algorithms having the same purpose, i.e., the determination of all the prime implicants, and also to compare similarly structured algorithms having different purposes, i.e., the determination of all the prime implicants, only a prime implicant covering, or an irredundant normal form. In order to evaluate both the relative and the absolute performances of some different versions of partitioned list algorithms, a number of programs has been realized for the IBM 7090 and a systematic plan of tests has been conducted on them. With the aim of obtaining indications on the absolute performances of partitioned list algorithms on existManuscript received June 7, 1968; revised December 11, 1969. The authors are with the Istituto Elaborazione dell'Informazione, Pisa, Italy.