typo - Ibiblio

advertisement
A TYPOGRAPHICAL ERROR CHECKER
As many of you are probably aware, I have a PRO-325 system running
an
RT11XM operating system. It was fairly obvious from the outset that
the
system would be used heavily for word processing. Much of the work I
have
done on the system has been with this aim in mind (eg the patches to
KED
described in an earlier newsletter).
The word processing system which I have set up consists of an PRO325
system and an EPSON LX80 printer driven by some homebrew, some DECUS,
some
DEC software.
I hope to outline this in greater detail in a
future
article. In this article I shall concentrate TYPO;
the
typographical
error checker.
It was realised early on that some sort of spelling checker would be
a
highly desirable asset. Most spelling checkers rely on large
dictionaries
combined with some complicated, but incomplete, semantic rules
about
spelling. On a home system, it is impractical to have a
dictionary
containing hundreds of thousands of words. The two main reasons
against
this are the amount of storage required and the amount of time required
to
scan the dictionary. A perhaps less impressive reason is that despite
the
size of the dictionary, some of the words used will not appear in
the
dictionary.
Thus the specifications I set for the checker were that the
dictionary
should be small but functional so that it would not be costly to store
nor
costly to scan, and that the checker should avoid difficult semantic
rules
(eg remove -e before adding -ing)
since
these
would
complicate
implementation and increase run time.
After a morning spent hunting through the literature (that's
another
article), I discovered an article entitled:
Computer Detection of Typographical Errors
R. Morris and L. Cherry
IEEE Transactions on Professional Communication
Vol. PC-18, No. 1, March 1975
which described
operating
system.
a
program
called
TYPO
written
for the UNIX
TYPO is not a 'spelling checker' in the usual sense, and its
true
nature is best summed up by the title of the original article. It is
not
meant for people who cannot spell, but rather those who can spell
but
cannot type. TYPO is suited to those who are reasonably competent
at
spelling and know when they encounter a word that they are unsure how
to
spell or know when they have misspelt a word.
TYPO works on two levels. It requires a small dictionary so that
the
commonly used words may be cut out of the document.
The remaining
words
are then sorted so that the more peculiar words appear first in the
list
which is subsequently printed for the user to peruse.
The dictionary is not used to check the spelling in the same way
as
the more conventional spelling checkers, rather it is used to remove
words
which are known to be correct from further consideration. Thus
the
dictionary should not contain words which are difficult to spell,
but
rather words which are used often.
Page
2
The document is first scanned to obtain a list of all the unique
words
and to obtain some statistics about the document.
The dictionary
is
scanned and used to cast out the correctly spelt words. An index
of
peculiarity is calculated for each of the remaining words, which are
then
sorted using this index into descending order and then printed.
The
listing should contain most of the misspelt words towards the head of
the
list.
Two statistics are gathered when scanning the document. Consider
the
word SPECTRUM. A special delimiter, eg '*', is used as a prefix and
a
suffix and the word becomes *SPECTRUM*. The word is broken into
digrams,
groups of two letters, *S SP PE EC CT TR RU UM M* and the number of times
a
particular digram occurs across the entire document is recorded. Each
word
is also broken into trigrams, groups of three letters, *SP SPE PEC ECT
CTR
TRU RUM UM* and the number of occurrences of each trigram across the
entire
document is recorded. The index of peculiarity for any word may then
be
calculated by breaking the word up into digrams and trigrams and using
the
frequency counts some number may be calculated.
This number
actually
reflects the likelihood that the word was produced from the same source
as
the rest of the document. It should be noted that this is
completely
statistical.
The strategy does work. The dictionary on my system contains
about
10000 words and once compressed occupies 71 blocks.
It takes about
30
seconds to scan the dictionary on RX50 drives (I suspect it would take
even
less time if the size of the IO buffer was increased or a faster disk
was
used).
The dictionary is optimised for technical documents
particularly
those with a Computer Science flavour (I wonder why?). Some work was
done
recently which used a lot of chemical terms. The list of words not
found
in the dictionary occupied a little more than two A4 sheets of paper
with
three columns of words per sheet. About 90% of the misspelt words were
to
be found on the first sheet of paper.
For those of you who are interested in data structures, the
following
may be of interest. Because the number of words in each document
varies,
the initial implementation used a binary tree to store the words.
The
digrams and trigrams were also stored in binary trees. This approach
was
chosen since the number of possible trigrams is in the order of 26x26x26
=
17576, and even at a byte per trigram this would require 17kb of
memory.
Given that not all trigrams appear in any given document, eg zzz,
a
multi-dimensional array solution to the problem is wasteful of resources.
It was found that binary tree solution, although workable, appeared
to
be very slow. With a view to gauging why the solution was so slow
a
statistic gathering option was added to the program. Using this mode
of
operation it was found that for a reasonable sized document some 13
nodes
had to be inspected on average for every trigram in the document.
Thinking
that perhaps the binary tree might be rather unbalanced owing to
the
infrequent occurrence of some trigrams, a modified binary tree
insertion
routine was used in which some rebalancing of the tree occurs.
This
too
yielding about 13 nodes on average per trigram. The speed problem had
by
this stage grown to be the major stumbling block, for even with a
small
document of in the order of 20 blocks, it would take in the order of
2
minutes to scan the document.
Page
3
The solution was to abandon the flexibility of the binary tree and
use
a hash addressing system. Three large tables were allocated, one for
the
digrams, one for the trigrams and one for the words in the document and
a
hash addressing system was devised using some information from
some
relevant books.
Using the right magic numbers, the algorithm
required
about 1 probe on average per trigram. The speed improvement was
somewhere
in the order of 75%, i.e about 30 seconds to scan 20 blocks. RX50
disk
latency appears to be the major contributor to the speed of the program
at
present.
The dictionary of the program can be built in one of two
ways
depending upon the size. Small dictionaries (less than about 3000)
words
may be constructed using the program DICTIO.PAS. Larger dictionaries
have
to be constructed using the programs WBREAK.PAS, CMPRES.PAS AND
RTSORT.SAV
(or your favourite sort utility).
WBREAK and DICTIO expect the words which comprise the dictionary to
be
in lower case and may contain apostrophes (upper case letters are
converted
to lower case). There may be more than one word per line and blank
lines
are permitted. Words must be separated from each other by any number
of
blank lines and spaces. A line beginning with a semicolon is considered
a
comment line and is ignored. The words need not be sorted.
DICTIO expects the following arguments;
$ - wordfile dictionary
The program will take the words from wordfile, remove any
duplicate
words, sort the words and produce a compressed dictionary file which
may
then be used by TYPO.
WBREAK expects the following arguments;
$- wordfile strippedfile
The program will take the words from wordfile and place them in
the
strippedfile one word per line.
The strippedfile is not sorted,
your
favourite sort program must be used at this stage to sort the
strippedfile
so that the words appear in ASCII ascending order, one word per line.
CMPRES expects the following arguments;
$- sortedstrippedfile dictionary
The program will take the words from the sortedstrippedfile
and
produce a compressed dictionary file which may then be used by TYPO.
TYPO expects the following arguments:
$- options document listingfile dictionary (trigramfile)
The program will scan the document for all the unique words
mapping
all upper case letters to lower case and concatenating hyphenated
words,
use the dictionary to remove the words which are know to be
spelt
Page
4
correctly, then output the remaining words to the listingfile sorted
in
order of peculiarity.
The option S will cause the program to
keep
statistics about the current scan and print them on the console after
the
listingfile is written. Large documents may cause the counters to
overflow
giving erroneous statistics. The option I will cause the program
to
initialise its trigram and digram tables from the trigramfile which
is
produced by the program TRIGRA.PAS.
Both these options are not
very
useful.
TRIGRA expects the following arguments;
$- document trigramfile
The program is a stripped down version of TYPO. It will scan
the
document in exactly the same manner except that after the scan it
will
simply dump the trigram and digram data into the trigramfile which
is
suitable for use by TYPO.
Download