DcNlCi1X Cj29 . P(X, X2, * * XNICil, Cj2,. , CkN, Dal, Db2,"`, DCN

advertisement
1096
IEEE TRANSACTIONS ON COMPUTERS, NOVEMBER 1970
Use of Contextual Constraints in Recognition
of Contour-Traced Handprinted Characters
ROBERT W. DONALDSON AND GODFRIED T. TOUSSAINT
Abstract-A contour-tracing technique described in an earlier
paper [9] was used along with monogram, bigram, and trigram contextual constraints to recognize handprinted characters. Error probabilities decreased by factors ranging from two to four over those
which resulted when contextual constraints were ignored. A simple
technique for searching over only the most probable bigrams and
trigrams was used to significantly reduce the computation without
reducing the recognition accuracy.
Index Terms-Block decoding, character recognition, contextual
constraints, contour analysis, handprinted characters, limited-data
experiments, suboptimum classification.
P(Dal, Db2,* **
DCN1Ci15 Cj2i
-
CkN)
P(Da1ICil)P(Db2ICi2) ... P(DcNICkN)- (2b)
Although (2) is not necessarily exact, it enormously reduces
the training and storage capacity required for the quantities
on the left-hand side of (2a) and (2b). Even so, it is still necessary to store P(Ci1, Cj2 * CkN) for all MN pattern sequences,
and this required storage capacity increases exponentially
with N. What is needed, in addition to (2), are techniques
which make effective use of contextual constraints without
storing excessive amounts of data or using excessive
computation time. When the contextual constraints are in
terms of
...
P(X1, X2,--
I. INTRODUCTION
''
XN|Cil, Cj2,.
S
CkN)
and
In a general pattern recognition problem, patterns are
P(Cill Cj29 .., CkN),
presented, one after another, to a system which extracts
features which then are used, along with a priori pattern there are two basic approaches to the problem.
probabilities to recognize the unknown patterns. Let Ci
1) Treat a long sequence of N patterns as N/L (1 < L < N)
denote pattern class i(i = 1, 2, , M) where subscript j
separate sequences, which are then classified indedenotes the jth position in the sequence (j = 1, 2, , N).
of each other using (1) or a suitable appendently
Let Xj denote the corresponding feature vector. Let
This is block decoding. The block length
proximation.
P(X1, X2, XNICil, Cj2,5 *, CkN) denote the probability
is
L.
of the vector sequence X1, X2, *, XN conditioned on the
2) Successively examine all N-L + 1 sequences of adsequence Ci1, Cj2, , CkN whose a priori probability is
jacent patterns of length 1 < L <N. Following examP(Ci1, Cj*
2,* , CkN). The probability of correctly identifying
ination of each sequence, classify the pattern to be
an unknown pattern sequence is maximized by maximizing
excluded from all future sequences. Proper use of this
any monotonic function of
procedure, called sequential decoding, avoids exponential growth with L in computation time.
R = P(X1, X2, , XN Cil, Cj2 , CkN)
* P(Cjl, Cj25
CkN) (la) Both approaches have been studied extensively for decoding
coded communication signals [11, [2] and have found some
with respect to i, j,
k. In what follows, the length Dr of application in pattern recognition [3]-[8]. We note here
feature vector Xr is not necessarily constant, even for an that when the patterns are characters from a natural lanindividual pattern class Cj; in fact these differences in guage, higher level linguistic constraints would also be
length are used as an aid in recognition. For this reason, used along with the statistical constraints referred to above.
This paper presents results of experiments conducted in
(la) is expanded to account for these differences in length D
order to determine the effect of using contextual conas follows.
straints to recognize contour-traced handprinted characters.
R =P(Xl, X2,5-*, XN|Cil, Cj2, '', CkN, Dal, Db2,
DCN) Block decoding was used with L= 1 (monogram constraints), L = 2 (bigram constraints), and L = 3 (trigram
*P(Dal, Db2, * DcNlCi1X Cj29 9 QkN)
constraints).
Equiprobable characters were also used, and
P (Ci 1, Cj2~ CkN)
(l1b) this case is denoted by L= 0. The data and the
feature extraction
algorithm
are
described
in
Section
II.
The
classifiwhere Dej denotes that vector Xi corresponding to Cij has
cation algorithm is described in Section III. The results,
De components.
We now make the following two assumptions which sig- presented in Section IV, show that error probability denificantly reduce the amount of computation and data creases by factors between two and four over those which
result when statistical contextual constraints are ignored.
needed to calculate R for any i, j,--- k.
It is also shown that a technique for searching over only the
most
probable character sequences significantly reduces the
P(X, X2, * * XNICil, Cj2,. , CkN, Dal, Db2,"', DCN)
amount
of computation without further degradation of
= P(XliCil, Dal)P(X2lCj2, Db2) .. P(XNICkNDCN) (2a)
performance. The results and their implications are disand
cussed in Section V.
9
.
..
Manuscript received March 31, 1970; revised June 15, 1970. This work
was supported by the National Research Council of Canada under Grant
NRC A-3308 and by the Defence Research Board of Canada under Grant
DRB 2801-30.
The authors are with the Department of Electrical Engineering, University of British Columbia, Vancouver, Canada.
II. DATA AND FEATURE EXTRACTION
The data consisted of 14 upper-case Roman alphabets,
two from each of seven persons, as described earlier [9].
Each character was spatially quantized into a 50 x 50 array
1097
SHORT NOTES
followed by the COORD word. This feature extraction scheme
is ourmodification [9] of one originally devised by Clemens
and Mason [10]-[12].
III. CLASSIFICATION ALGORITHM
* XrD,)
The Dr component of any vector Xy=(xy1, xr2,...
are assumed statistically independent, with the result that
P(XrjCirj Dar) P(Xrl, Xr2,
=
Dar
171 P(XrjlCir, Dar). (3)
CODE WORD
CODE WORD
10111
1001010010
COORD WORD
COORD WORD
OO,t1,11,OO,10
000,111,111,111, 10Y000,D,1,00D,OO1,000
(a)
XrDarlCir, Dar)
.
j=l
(b)
Fig. 1. CODE and cooRD word for C. (a) Four-part area division.
(b) Six-part area division.
From (1), (2), and (3) it now follows that the optimum classification of a character sequence of length L results if
i, j,*, k are chosen to maximize'
Tij ... k
-
Dal
u= 1
ln P(xjuICij, Da1) + ln P(DaliCil)
Db2
E ln P(X2ulCj2, Db2) + ln P(Db2ICj2)] +
+
and punched on IBM cards. Subjects were required to refrain from making broken characters. Ten different graduate students were able to recognize the quantized characters with 99.7 percent accuracy.
The feature vector for any character was obtained by
simulating a discrete scanner on an IBM 7044 computer;
the scanner operates as follows. The scanning spot moves,
point by point, from the bottom to the top of the leftmost
column, and successively repeats this procedure on the
column immediately to the right of the column previously
scanned, until the first black point is found. Upon locating
this point, the scanner enters the CONTOUR mode, in which
the scanning spot moves right after encountering a white
point and left after encountering a black point. The
CONTOUR mode terminates when the scanning spot completes its trace around the outside of a character and returns to its starting point.
After being scanned, the character is divided into either
four or six equal-sized rectangles whose size depends on
the height and width of the letter (see Fig. 1). A y threshold
equal to one-half each rectangle's height and an x threshold
equal to one-half each rectangle's width is defined. Whenever the x coordinate of the scanning spot reaches a local
extremum and moves in the opposite direction to a point
one threshold away from the resulting extremum, the resulting point is designated as either an xmax or xmin. After an
Xmax(Xmin) has occurred, no additional xmax's (xmin's) are
recorded until after an Xmin(xm.) has occurred. Analogous
comments apply to the y coordinate of the scanning spot.
The starting point of the coNTouR mode is regarded as an
Xmin. The CODE word for a character consists of a 1 followed
by binary digits whose order coincides with the order in
which extrema occur during contour tracing; I denotes
max's and min's in x, while 0 denotes max's and min's in y.
The rectangles are designated by binary numbers, and the
ordering of these numbers in accordance with the rectangles
in which extrema fall in event sequence constitutes the
COORD word. The feature vector consists of the CODE word
+
= l
L
+
E In P(XLujCkL, DCL) + In P(DcLICkL)
_u=l
+ In P(Cil,
Cj2,
''
'
I
__
CkL)-
(4)
To search over all ML possible sequences causes the computation time to increase exponentially with L. For M = 26,
L = 2, and L =3, there are 676 and 17 576 different sequences, respectively. To reduce the amount of searching
required, the following procedure was used for L = 2
(bigrams) and L = 3 (trigrams). First, the feature vector from
each character in a sequence of L unknown characters was
used to calculate
Q=
Da
E In P(xulCi, Da) + In P(D,|ci)
_ u=l
,
(5)
for all 26 pattern classes. The pattern classes were then
ranked in order corresponding to the value of Q; thus the
pattern class for which Q was largest was ranked first, the
pattern class for which Q was second largest was ranked
second, and so on. Equation (4) was then maximized over
all dL sequences containing the pattern classes having a
rank between 1 and d, inclusive.
In the experiments described in the next section, five-digit
estimates of P(Ci) for individual characters were obtained
from Pierce [13]. The 304 legal bigrams and the 2510
legal trigram probabilities were obtained from relativefrequency data in Pratt [14]. All other bigrams and trigrams
were considered illegal. Thus, when d was so small that all
dL sequences of characters were illegal, the characters in
the sequence were identified individually, without using
context, by maximizing Q in (5).
Probabilities P(x.lCi, Da) and P(Da,ICi) were learned by
determining the relative number of times a component x, or
length Da occurred, given the joint event (Ci, Da) or the
'
This algorithm is an extension of algorithm T used earlier [9].
1098
event Ci. When an experimentally determined probability
P=O, we set P=1/(z+2) [15], where z is the number of
times (Ci, Da) or Ci occurred during training.
IEEE TRANSACTIONS ON COMPUTERS, NOVEMBER 1970
30.
a20.
aIV. EXPERIMENTS AND RESULTS
EL
cic
TS-4
O-Q:
TS-6
four
of
the
for
each
Four experiments were conducted
ot 10,
values of L. In the first two experiments, the same 14 alphacx
TR-4
bets were used for learning P(x.1 Ci, Da) and P(Da C), and
TR-6
for testing. In the first experiment, the four-part area diviOJ
0
3rd
2nd
Ist
sion was used (see Fig. 1). The six-part division was used in
INFORMATION
CONTEXTUAL
OF
ORDER
which
last
two
In
the
experiments,
the second experiment.
2. Error probability versus order of contextual constraints. TS dealso differed from each other solely on the basis of area Fig.notes
that training and test data were disjoint. TR signifies that training
for
used
were
from
five
individuals
ten
alphabets
division,
and test data were identical.
training, while the four alphabets from the remaining two
persons were used to form monograms, bigrams, and trigrams for testing. In these last two experiments, the results
21.
were averaged over seven trials; in the ith trial, data from
persons i and i + 1 (i = 1, 2, * * , 6) were used for testing,
% 20.
while in the seventh trial data from persons 7 and 1 were
E 19.
used.2 For L= 1, 2, or 3, 2(2L) sample L-grams were used
Qc 18
for testing, while four samples of each character were used
afor L=0. When the same data were used for both training
r~17.
and testing, 7(2L) sample L-grams were used for testing for
TRIGRAMS
L16
each trial when L= 1, 2, or 3, and 14 samples of each charX-U
acter were used for L= 0. No L-gram test samples consisted
6
8 10 16
5
2 3 4
of individual characters printed by different persons.
DEPTH OF SEARCH
Fig. 3. Error probability versus depth of search d.
The results of the above experiments yielded maximum
likelihood estimates of P(ej Ci1, Ci2, * , CkL), which denotes
the probability of error £ given the sequence Cil, Ci25 * , CkL.
The probability of error averaged over all sequences was TS-4 cases, P(c) for L = 3 is, respectively, 0.23, 0.28, 0.52,
then obtained from
and 0.49 times its value at L=0. Such behavior is reasonable, since one would expect correction of errors using con26
26 26
to be easiest when the surrounding text-is correct.
text
P(ECil, c2, , CkL)
E E **
P()= i=1
k=1
The 6-PAD is better than the 4-PAD scheme for all L: 3,
j=1
CkJ) (6) as noted earlier [9] for L = 0 and L= 1. The TS results are
*P(Cil, Cj21
not as good as the corresponding TR results, a difference
The
the
four
also observed by others [16]-[19].
L
for
versus
experiments.
shows
2
P(E)
Fig.
In Fig. 3 P(e) decreases as dincreases for 1 < d<4 because
L =0 and L = 1 results were obtained by searching over all
a
an
increase in d allows more legal character sequences to
whose
training,
during
classes
yielded,
samples
pattern
feature vector length equal to that ofthe unknown character. be examined. The fact that P(E) actually increases as d
The L = 2 and L= 3 results were obtained in the same way, increases beyond d= 4 has to be due to the fact that an apexcept that d= 4. This value for d was obtained by calculat- proximation to R, rather than R as given by (1), is being
ing P(E) versus d for the four-part area division (4-PAD) maximized; the approximation results from (2), from (3),
scheme with L = 2 and L= 3 when the training and test data and from the fact that P(xlCi, Da) and P(DaIC.) are not
were disjoint (TS case). For L= 2 d assumed all integer known exactly in the TS case.
Consider now two tradeoffs. First is the tradeoff between
values from 1 to 16 inclusive, since 16 was the maximum
needed with our data. For L= 3, d went from 1 to 5. The L and d. Not only is the performance for L=2 and d=4
results in Fig. 3 show a minimum P(e) at d= 4. This value of better than for L= 3 and d= 3, but the amount of computad=4 was subsequently used to obtain all the results for tion needed in the former case (24 = eight searches per character) is less than that for the latter case (33 = nine searches
L=2 and L=3 in Fig. 2.
per character). Second, the 6-PAD L = 2 TS case yields better
than does the 4-PAD L = 3 TS case. For the
performance
V. DISCUSSION
the 6 = PAD L= 1 and L = 2 cases yield
TR
experiments,
From Fig. 2 it follows that P(e) decreases as L increases,
does the 4-PAD L = 3 case. For the
better
performance
but at a decreasing rate. The relative improvement resulting 4-PAD and 6-PAD than
the average length of the feature
cases,
from use of context seems to increase as recognition ac- vectors was 17 and 25, respectively,
and the number of difcuracy for L=0 increases. For the TR-6, TR-4, TS-6, and ferent
(Ci, D.) values was 67 and 66. It follows that with d = 4
the 6-PAD scheme for L= 1 and L = 2 requires less com2
This procedure yields a better estimate of performance on small data putation time and storage than the 4-PAD L = 3 case; the
sets than does the often used holdout method [16].
extra storage needed for P(x.lCi, Da) for the 6-PAD vectors is
0
1099
SHORT NOTES
more than offset by the saving which results in not having
Recognition, L. N. Kanal, Ed. Washington, D. C.: Thompson,
1968, pp. 109-140.
to store trigram probabilities. It is important to balance the
[18] W. H. Highleyman, "The design and analysis of pattern recognition
amount of information from character features with that
experiments," Bell Sys. Tech. J., vol. 41, pp. 723-744, March 1962.
obtained using contextual constraints, as Raviv [6] has [19] G. F. Hughes, "On the mean accuracy of statistical pattern recognizers," IEEE Trans. Inform. Theory, vol. IT-14, pp. 55-63, January
suggested.
1968.
in
Fig.
2
would
imFrom [9] it follows that the results
[20] R. Bakis, N. M. Herbst, and G. Nagy, "An experimental study of
prove considerably if a few additional tests were used to
machine recognition of hand-printed numerals," IEEE Trans. Sys.
Sci. Cybern. vol. SSC-4, pp. 119-132, July 1968.
differentiate between commonly confused characters,
if
were
from
one
individual.
and/or all character samples
Higher level semantic constraints, whose use is feasible in a
well structured computer language such as FORTRAN, would Computer Experience on Partitioned List Algorithms
further reduce the error rate [5]. The results in Fig. 2 would
also improve if values L > 3 were used. Unfortunately, the
E. MORREALE AND M. MENNUCCI
number of different sequences for which ij.. . k in (4) would
have to be computed increases exponentially with L for fixed
Abstract-The main characteristics of some programs impled. This exponential growth can be avoided by using se- menting a number of different versions of partitioned list algorithms
are described, and the results of a systematic plan of experiments
quential decoding [1], [6].
performed on these programs are reported. These programs concern
We note that requiring subjects to make unbroken char- the determination of all the prime implicants, a prime implicant coveracters did not greatly inconvenience our subjects, and that ing, or an irredundant normal form of a Boolean function. The experiments performed on these programs concern mainly the computer
methods for handling broken characters are described else- time
required, the number of prime implicants obtained, and their
where [9]. Finally, we note that our results support those of distribution in families. The results obtained from these tests demonBakis et al. [20] in that curve-following features extract sig- strate that relatively large Boolean functions, involving even some
of canonical clauses, can be very easily processed by
nificant information from handprinting, and those of Raviv thousands
present-day electronic computers.
[6] and Duda and Hart [5] in that use of context signifIndex Terms-Boolean function, computer programs, computing
icantly improves recognition accuracy.
times, covering, experimental results, incompletely specified, irreREFERENCES
[1] J. M. Wozencraft and I. M. Jacobs, Principles of Communication
Engineering. New York: Wiley, 1965, ch. 6.
[2] R. G. Gallager, Information Theory and Reliable Communication.
New York: Wiley, 1965, ch. 6.
[3] B. Gold, "Machine recognition of hand-sent Morse code," IRE
Trans. Inform. Theory, vol. IT-5, pp. 17-24, March 1959.
[4] W. W. Bledsoe and J. Browning, "Pattern recognition and reading
by machine," 1959 Proc. Eastern Joint Computer Conf., vol. 16, pp.
225-232; also in L. Uhr, Pattern Recognition. New York: Wiley,
1966, pp. 301-316.
[5] R. 0. Duda and P. E. Hart, "Experiments in the recognition of handprinted text: Part II-context analysis," 1968 Fall Joint Computer
Conf., AFIPS Proc., vol. 33, pt. 2. Washington, D. C.: Thompson,
1968, pp. 1139-1149.
[6] J. Raviv, "Decision making in Markov chains applied to the problem
of pattern recognition," IEEE Trans. Inform. Theory, vol. IT-13,
pp. 536-551, October 1967.
[7] R. Alter, "Use ofcontextual constraints in automatic speech recognition," IEEE Trans. Audio, vol. AU-16, pp. 6-11, March 1968.
[8] K. Abend, "Compound decision procedures for pattern recognition,"
1966 Proc. NEC, vol. 22, pp. 777-780.
[9] G. T. Toussaint and R. W. Donaldson, "Algorithms for recognizing
contour-traced handprinted characters," IEEE Trans. Computers
(Short Notes), vol. C-19, pp. 541-546, June 1970.
[10] J. K. Clemens, "Optical character recognition for reading machine
applications," Ph.D. dissertation, Dept. of Elec. Engrg., Massachusetts Institute of Technology, Cambridge, Mass., August 1965.
[11] S. J. Mason and J. K. Clemens, "Character recognition in an experimental reading machine for the blind," in Recognizing Patterns,
P. A. Kolers and M. Eden, Eds. Cambridge, Mass.: M.I.T. 1968,
pp. 156-167.
[12] S. J. Mason, F. F. Lee, and D. E. Troxel, "Reading machine for the
blind," M.I.T. Electronics Research Lab., Cambridge, Mass., Quart.
Progress Rept. 89, pp. 245-248, April 1968.
[13] J. R. Pierce, Symbols, Signals and Noise. New York: Harper and
Row, 1961, p. 283.
[14] F. Pratt, Secret and Urgent. New York: Blue Ribbon, 1942.
[15] I. J. Good, The Estimation of Probabilities, Res. Monograph 30.
Cambridge, Mass.: M.I.T. Press, 1965.
[16] L. Kanal and B. Chandrasekaran, "On the dimensionality and sample
size in statistical pattern classification," 1968 Proc. Natl. Electronics
Conf., vol. 24, pp. 2-7.
[17] J. H. Munson, "The recognition of hand-printed text," Pattern
dundant normal forms, minimization, partitioned list, prime implicants.
I. INTRODUCTION
Recently a new class of algorithms-partitioned list
algorithms has been proposed [1] for determining, for
any given Boolean function, either the set of all the prime
implicants or a prime implicant covering of the function,
from which an irredundant normal form can be easily obtained. Some theoretical results [2] concerning the computational complexity of partitioned list algorithms.for prime
implicant determination indicate that partitioned list algorithms compare favorably with both Quine's [3] and McCluskey's [4] nonpartitioned list algorithms. However,
through these theoretical evaluations, only a rough estimate
can be made of the average computing time actually required for finding all the prime implicants of a Boolean function. Furthermore, in considering the actual performances
of partitioned list algorithms, it is interesting both to compare different versions of algorithms having the same purpose, i.e., the determination of all the prime implicants,
and also to compare similarly structured algorithms having
different purposes, i.e., the determination of all the prime
implicants, only a prime implicant covering, or an irredundant normal form.
In order to evaluate both the relative and the absolute
performances of some different versions of partitioned list
algorithms, a number of programs has been realized for the
IBM 7090 and a systematic plan of tests has been conducted
on them. With the aim of obtaining indications on the absolute performances of partitioned list algorithms on existManuscript received June 7, 1968; revised December 11, 1969.
The authors are with the Istituto Elaborazione dell'Informazione,
Pisa, Italy.
Download