A TYPOGRAPHICAL ERROR CHECKER As many of you are probably aware, I have a PRO-325 system running an RT11XM operating system. It was fairly obvious from the outset that the system would be used heavily for word processing. Much of the work I have done on the system has been with this aim in mind (eg the patches to KED described in an earlier newsletter). The word processing system which I have set up consists of an PRO325 system and an EPSON LX80 printer driven by some homebrew, some DECUS, some DEC software. I hope to outline this in greater detail in a future article. In this article I shall concentrate TYPO; the typographical error checker. It was realised early on that some sort of spelling checker would be a highly desirable asset. Most spelling checkers rely on large dictionaries combined with some complicated, but incomplete, semantic rules about spelling. On a home system, it is impractical to have a dictionary containing hundreds of thousands of words. The two main reasons against this are the amount of storage required and the amount of time required to scan the dictionary. A perhaps less impressive reason is that despite the size of the dictionary, some of the words used will not appear in the dictionary. Thus the specifications I set for the checker were that the dictionary should be small but functional so that it would not be costly to store nor costly to scan, and that the checker should avoid difficult semantic rules (eg remove -e before adding -ing) since these would complicate implementation and increase run time. After a morning spent hunting through the literature (that's another article), I discovered an article entitled: Computer Detection of Typographical Errors R. Morris and L. Cherry IEEE Transactions on Professional Communication Vol. PC-18, No. 1, March 1975 which described operating system. a program called TYPO written for the UNIX TYPO is not a 'spelling checker' in the usual sense, and its true nature is best summed up by the title of the original article. It is not meant for people who cannot spell, but rather those who can spell but cannot type. TYPO is suited to those who are reasonably competent at spelling and know when they encounter a word that they are unsure how to spell or know when they have misspelt a word. TYPO works on two levels. It requires a small dictionary so that the commonly used words may be cut out of the document. The remaining words are then sorted so that the more peculiar words appear first in the list which is subsequently printed for the user to peruse. The dictionary is not used to check the spelling in the same way as the more conventional spelling checkers, rather it is used to remove words which are known to be correct from further consideration. Thus the dictionary should not contain words which are difficult to spell, but rather words which are used often. Page 2 The document is first scanned to obtain a list of all the unique words and to obtain some statistics about the document. The dictionary is scanned and used to cast out the correctly spelt words. An index of peculiarity is calculated for each of the remaining words, which are then sorted using this index into descending order and then printed. The listing should contain most of the misspelt words towards the head of the list. Two statistics are gathered when scanning the document. Consider the word SPECTRUM. A special delimiter, eg '*', is used as a prefix and a suffix and the word becomes *SPECTRUM*. The word is broken into digrams, groups of two letters, *S SP PE EC CT TR RU UM M* and the number of times a particular digram occurs across the entire document is recorded. Each word is also broken into trigrams, groups of three letters, *SP SPE PEC ECT CTR TRU RUM UM* and the number of occurrences of each trigram across the entire document is recorded. The index of peculiarity for any word may then be calculated by breaking the word up into digrams and trigrams and using the frequency counts some number may be calculated. This number actually reflects the likelihood that the word was produced from the same source as the rest of the document. It should be noted that this is completely statistical. The strategy does work. The dictionary on my system contains about 10000 words and once compressed occupies 71 blocks. It takes about 30 seconds to scan the dictionary on RX50 drives (I suspect it would take even less time if the size of the IO buffer was increased or a faster disk was used). The dictionary is optimised for technical documents particularly those with a Computer Science flavour (I wonder why?). Some work was done recently which used a lot of chemical terms. The list of words not found in the dictionary occupied a little more than two A4 sheets of paper with three columns of words per sheet. About 90% of the misspelt words were to be found on the first sheet of paper. For those of you who are interested in data structures, the following may be of interest. Because the number of words in each document varies, the initial implementation used a binary tree to store the words. The digrams and trigrams were also stored in binary trees. This approach was chosen since the number of possible trigrams is in the order of 26x26x26 = 17576, and even at a byte per trigram this would require 17kb of memory. Given that not all trigrams appear in any given document, eg zzz, a multi-dimensional array solution to the problem is wasteful of resources. It was found that binary tree solution, although workable, appeared to be very slow. With a view to gauging why the solution was so slow a statistic gathering option was added to the program. Using this mode of operation it was found that for a reasonable sized document some 13 nodes had to be inspected on average for every trigram in the document. Thinking that perhaps the binary tree might be rather unbalanced owing to the infrequent occurrence of some trigrams, a modified binary tree insertion routine was used in which some rebalancing of the tree occurs. This too yielding about 13 nodes on average per trigram. The speed problem had by this stage grown to be the major stumbling block, for even with a small document of in the order of 20 blocks, it would take in the order of 2 minutes to scan the document. Page 3 The solution was to abandon the flexibility of the binary tree and use a hash addressing system. Three large tables were allocated, one for the digrams, one for the trigrams and one for the words in the document and a hash addressing system was devised using some information from some relevant books. Using the right magic numbers, the algorithm required about 1 probe on average per trigram. The speed improvement was somewhere in the order of 75%, i.e about 30 seconds to scan 20 blocks. RX50 disk latency appears to be the major contributor to the speed of the program at present. The dictionary of the program can be built in one of two ways depending upon the size. Small dictionaries (less than about 3000) words may be constructed using the program DICTIO.PAS. Larger dictionaries have to be constructed using the programs WBREAK.PAS, CMPRES.PAS AND RTSORT.SAV (or your favourite sort utility). WBREAK and DICTIO expect the words which comprise the dictionary to be in lower case and may contain apostrophes (upper case letters are converted to lower case). There may be more than one word per line and blank lines are permitted. Words must be separated from each other by any number of blank lines and spaces. A line beginning with a semicolon is considered a comment line and is ignored. The words need not be sorted. DICTIO expects the following arguments; $ - wordfile dictionary The program will take the words from wordfile, remove any duplicate words, sort the words and produce a compressed dictionary file which may then be used by TYPO. WBREAK expects the following arguments; $- wordfile strippedfile The program will take the words from wordfile and place them in the strippedfile one word per line. The strippedfile is not sorted, your favourite sort program must be used at this stage to sort the strippedfile so that the words appear in ASCII ascending order, one word per line. CMPRES expects the following arguments; $- sortedstrippedfile dictionary The program will take the words from the sortedstrippedfile and produce a compressed dictionary file which may then be used by TYPO. TYPO expects the following arguments: $- options document listingfile dictionary (trigramfile) The program will scan the document for all the unique words mapping all upper case letters to lower case and concatenating hyphenated words, use the dictionary to remove the words which are know to be spelt Page 4 correctly, then output the remaining words to the listingfile sorted in order of peculiarity. The option S will cause the program to keep statistics about the current scan and print them on the console after the listingfile is written. Large documents may cause the counters to overflow giving erroneous statistics. The option I will cause the program to initialise its trigram and digram tables from the trigramfile which is produced by the program TRIGRA.PAS. Both these options are not very useful. TRIGRA expects the following arguments; $- document trigramfile The program is a stripped down version of TYPO. It will scan the document in exactly the same manner except that after the scan it will simply dump the trigram and digram data into the trigramfile which is suitable for use by TYPO.