Form Image Compression by Template Extraction and Matching Jianguo Wang and Hong Yan School of Electrical and Information Engineering University of Sydney, NSW 2006, Australia phone: +61 2 9351 5338 fax: +61 2 9351 4824 e-mail: jwang@ee.usyd.edu.au Abstract This paper presents a generic method for compressing multi-copy form documents using template extraction and matching (TEM) strategies to reduce the component-level redundancy in form document images. This method is consistent with the following principal steps. A template, ideally the same as a blank form, is extracted from a number of filled-in form images using statistical superpose and prototype matching strategies. The properties of form documents are investigated and strategies for template extraction are developed. Next the filled-in data is extracted by matching the template to each filled-in form. Several possible situations are included at this stage to deal with different cases in practical applications. The template and the extracted filled-in data of each form are coded into a lossless compression format. In the final step, a generic rule is developed to reconstruct the compressed form documents. The compression rate of the proposed form compression method is mainly dependent on the relative size of the filled-in data compared to the size of the template, which is generally small for form images so that a high compression rate can be achieved. By avoiding document segmentation, the proposed method is effective for all types of components, including form frames, pictures and text. Experiment results on different types of forms that show the performance of this compression method is excellent. Key words: form image; template matching; image compression; redundancy analysis. 1. Introduction Document image compression is becoming more important due to the growth in document scanning and transmission. Technological advances in processing, storage and visualization have made it possible to maintain a large number of form documents as digital images and make them accessible over networks. In order to do this effectively, an efficient compression scheme is essential for both storage and transmission. Document images are mostly pseudo-binary and rich in textual content. They are represented adequately by binary images produced by scanning. Forms are documents typically used to collect or distribute data, in which a large part (the blank form) is the same. A large number of forms are processed by many business and government organizations. Previous researches [1,2,3] provide useful hint for form image compression. Some standards have already been released for binary image compression and facsimile transmission. The first comprehensive standard was released as the CCITT Group 3 transmission standard, followed by the CCITT Group 4 [5]. Both schemes use a run-length-based approach with an extension to dual scanline coding to exploit the coherence between successive scan lines. The Joint Binary Image Group (JBIG) standard is a recent international ISO standard in which context modelling forms the basis of coding. The compression rates of all these methods are heavily dependent on the complexity of images, and a higher compression rate can be achieved by avoiding the repeated storing of the same or closely similar components in all the images. In order to develop a scheme that is practical and feasible for multi-copy form documents compression, some sophisticated techniques for template extraction, component matching and image coding are needed. The proposed template extraction and matching (TEM) method is developed for this purpose. As a type of document, forms typically consist of two parts. One is called a template, the common part in most images, including preprinted components such as form frames, characters, symbols and pictures. Another part is the filled-in data that is different in each form. The proposed TEM form compression method contains three major steps, template extraction, compression and decompression. If a blank form has not been provided, the first step is template extraction, which is determining a set of components as prototypes to extract a template from 1 Compressing Restoring Compressed images Multiimage Comparing Template location Yes Similar ? Template extraction No Yes Filled-in pattern extraction Try again ? No Restoring compressed images Display and/or saving Compressing and saving Finish Finish Figure 1 Flow chart of the TEM form compression scheme several filled-in form images. It includes image deskewing and location, distortion adjusting, template extraction and refining. This step is not needed if a blank form is provided as a template. The second step is compression, in which form images are preprocessed and the filled-in data extracted by matching them with the template. Varying situations have to be included for the practical application of this method, as described in Section 3. The template and the extracted filled-in data from the form images are compressed in one file with a lossless compression format. In the decompression step, each form is reconstructed using the template and the extracted image of each form. Figure 1 shows the flow chart for this scheme. Unlike many lossy compression techniques using resolution reduction or texture preserving methods that might render a document image unreadable, the proposed method identifies components which appear repeatedly and represents similar components with a prototype. The TEM method exploits the component-level redundancy found in multi-copy form documents and reaches a high compression rate while keeping the original resolution and readability. Furthermore, the quality of images is improved by the use of a statistical algorithm to reduce noise. A high compression rate can be achieved with this method by employing some effective strategies to cope with practical problems such as distortion, noise, flaws and modification. At the same time, the accuracy of the decompressed document image is very important for visualization. It is necessary for a compression scheme to preserve the shape of all the components in a document so that they are correct and recognizable after retrieval. The rules for template extraction and matching are designed rigorously enough to minimize the possibility of mismatching. The details of the algorithms used in template extraction are described in Section 2. Section 3 introduces the proposed compression scheme by matching prototypes in the template with the corresponding components in each form image. Section 4 describes the algorithms for decompression and the experiment results are presented in Section 5. Section 6 provides a detailed discussion of the approach and the conclusions. 2. Template Extraction The first step in this compression method is to extract a template that contains the common components in filledin form images. A blank form can be used directly as a template if it is of good quality, but in most case it is not available. The template has to be extracted from filled-in forms. An effective template extraction method by 2 comparing several filled-in form images is critical for this compression scheme, considering all the possible situations in practical applications. A series of processes is needed to fulfil this task. After skew and distortion adjusting, several form images are overlapped to get a gray scale image, which indicates the statistical possibility of each pixel to be black. A template is extracted from the gray scale image and then is refined. 2.1. Image De-skewing and Location The skew of scanned form images needs to be corrected before template extraction can be carried out. The accuracy of skew angle detection is critical for getting the best overlap results of images. Some early approaches dealt with this problem. Y. Hang [4] developed a method for correcting skew of text lines in scanned document images using interline cross-correlation. In some cases, the skew angle can be determined from text margins [6] if they exist in the scanned image. The Hough transform can also be used for skew detection [7]. Another commonly used method for skew detection employs the projection profile to compute the skew angle [8]. Each of them, however, has some limitations and cannot to be used directly in this compression scheme. An accurate skew detection algorithm is required which is effective for all types of form images. projection profiles of the de-skewed form image. Form images are overlapped by moving them so that the differences between the profiles of each image are minimum. 2.2. Distortion adjusting The form images of same blank forms usually have a litter distortion caused by printing, handling or scanning. This seriously effects the efficiency of the results of template extraction. The purpose of distortion adjusting is to overlap all the similar components well enough in the whole image area before template extraction. By analyzing a large number of filled-in form images it is found that distortion is accumulative in both vertical and horizontal directions. The following distortion-adjusting algorithm is effective to deal with this problem. After analyzing and comparing the above algorithms, a new de-skewing scheme is developed using an improved recursive projection profile method [9]. For documents whose text lines span horizontally, the horizontal projection profile will have peaks with widths equal to the character height. The projection profile is computed at a number of selected angles to detect the skew angle. For each angle, a measure of the total variation in the bin heights along the profile is made. Maximum variation corresponds to the best alignment with text lines and horizontal form frames, and from this the skew angle can be measured. Form images are usually scanned at a fixed resolution with a small amount of skew, less than 5. A set of angles with a fixed difference is initially selected according to the possible maximum skew angle in applications. A document image is divided into equal width vertical strips. Horizontal projection profiles are calculated for some of the vertical strips, and skew angle is estimated by finding maximum total variation in the bin heights along the profiles. Then a set of angle with smaller difference than the one in the previous step is chosen around the skew angle estimated in the previous step. The number of vertical strips is reduced as the difference becomes smaller and finally one strip of the width of the image is left. This cycle is repeated several times until the difference is just one pixel. This algorithm can achieve an accurate result, 1 pixel in image width, and is robust to noise and variation of form styles. The location of each form image is measured with the horizontal and vertical Figure 2 Vertical constant white lines produced in printing or scanning in images The first one of a set of form images used for template extraction is defined as a pre-template. All other form images are adjusted by it. An adjusted form image is divided into several horizontal strips. Each strip is moved vertically to get maximum overlap of corresponding components in the pre-template and the adjusted image. The strips are then divided into blocks to adjust horizontal distortion using the same method. A rule is established for the move to avoid the disconnection of strips and blocks induced by the adjusting. The strip in which the maximum horizontal projection profile is located is 3 defined as a base strip. All the other strips are moved vertically relative to it. It is similar to horizontal adjusting. This method is effective in most cases, but it is not efficient in the images in which a vertical constant white line exists caused by printing or scanning, as shown in Figure 2. As the results of this method need to be improved further, a component based adjusting algorithm is combined in the following template extraction step. 2.3. Template Extraction The relation between the corresponding components in the two pre-templates is and Tl=Th+D (1) where Th and Tl is the set of black pixels of a component in the pre-template with higher and lower threshold respectively. D is the difference between Th and Tl. A sub-set of D is defined as Dw: for each pixel p Dw, p D & Np Th A template is extracted after classifying all components in the pre-template and erasing those not classified as prototypes. This template needs to be refined to reduce noise and imperfections caused by the erasing operation, and to smooth components contour. Some of the prototypes need to be double-checked. 2.4. Template Refining A set of adjusted binary form images is overlapped to generate a greyscale image, in which the density of a pixel is determined by the times of black pixels overlapped. Filled-in data in an image have less chance than the components belonging to the blank form to overlap with components in other images, as do noises. A couple of pre-templates can be obtained by choosing different thresholds to binarize the grayscale image. The pre-template with a proper threshold is similar to the blank form but has some components created by the filled-in data. An algorithm is developed to detect and erase these components by comparing the difference of corresponding components in these pre-templates. The position of components is located by performing a connected component analysis on a pre-template with a lower threshold, which means a larger area is included in the rectangle of a component. Before extracting the template, each component in each image is adjusted again by comparing to the corresponding components in first form image. Th Tl components in their rectangle area, the pixels belonging to the other components are preserved. (2) Np indicates the set of pixels of a pixel p and its four direct neighbors. The rules for classifying components as either prototypes or filled-in data are developed by analyzing Dw and Tl in different cases. A component belonging to a blank from has small difference between different pre-templates than those belonging to the filled-in data, and therefore it is not difficult to classify them. The components that have much diverseness from different pre-templates are classified as filled-in data and erased from the pre-template, while the others are classified belonging to blank forms. Components that belong to blank forms but are different because of their connection with the filled-in data are also erased because they can not perform as a prototype in the template. For those components that have pixels of other It is necessary to make sure that every preserved component in the template is the prototype that belongs to the blank form. A double check of some prototypes in the template is carried out by comparing them with the corresponding components in the original form images. If one prototype is obviously different to the corresponding component in a certain number of original images, it is excluded from the template. As a result of statistical treatment in the overlapping, the extracted template contains much less noises and imperfections than the original images. The noise and imperfections not only effect readability of document images, but also reduce the compression rate in this system. Component contours are smoothed to improve the effect of filled-in data extraction and images compression. Noise in the template is filtered using a connected component algorithm to avoid erasing dotted lines or patterns. The final template preserves all the prototypes that closely resemble the components in most original filled-in images. It presents the common components to achieve a higher reduction for component-level redundancy. Figure 3 (a) and (b) are an example of an original filled-in form image and an extracted template. It is noted that the noise and flaws in the original image do not appear in the template due to the statistical treatment used in the process. 3. Compression The basic innovation of the compression scheme is to substitute components in filled-in forms that are the same or similar to the corresponding prototypes in the template with the prototypes to reduce component-level redundancy. The proposed compression scheme consists of three steps: input image preprocessing, filled-in data extraction and compressing the data stream of template and extracted data. One of the major tasks in the compression section is to develop efficient rules for detecting the similarity between prototypes in the template and the corresponding components in an input image so as to make the correct substitution. Filled-in components outside prototype’s rectangle area are preserved without any change. Filled-in components that connect with preprinted components or are (partly) 4 included in prototypes’ rectangle area are processed according to different situations. There are three typical situations that occur in all practical applications. In most cases, the components of a blank form in each input image are very similar to the corresponding prototypes in the template and can be substituted. This is the basic the principle of the proposed compression scheme. Generally, if a component in the corresponding prototype’s rectangle area in the input image is obviously different to the prototype, the component is preserved as filled-in data. In some cases, the same component, such as a digit or a bar code of forms, exists in many of the input images used for template extraction but it does not exist in some other images. Occasionally there are a few components that exist in the template as prototypes but are omitted in an input images. This occurs when an input image has a flaw or is modified. A special mark is needed in the extracted filled–in data images to indicate this case in order to guarantee the decompressed image almost the same as the original one. The algorithm for classifying components is based on pattern matching and substitution. If a candidate pattern is a good match to a corresponding prototype in template, it is classified as a preprinted pattern; otherwise it is considered as a filled-in pattern. This method works with a variety of component sizes and is general enough to be used with any specialized matching function. 3.1. Images Preprocessing The preprocessing of input images is executed before comparing them with the template for filled-in data extraction. The purpose of the preprocessing is to overlap an input image to the template as well as possible. Input images are de-skewed and located in the same position as in the template, and their distortion is adjusted according to the template. This process is similar to the processing described in Section 2.1 and 2.2 except that the template instead of the first input image is used as the standard for the processing. A detection program can be introduced at this stage to check if an input image has the same blank form as the template and to identify it as such if necessary. 3.2. Filled-in Data Extraction Filled-in data extraction is executed by comparing each prototype in the template with the corresponding component in an input image. The matching function used here is based on the weighted XOR [10] and the compression-based pattern matching [11] algorithm, called the weighted pixel matching rules. All the prototypes are grouped into two classes and treated with different algorithms. Simple Connected Components (SCC) are components in which no other prototype appears in their rectangles, while Complex Connected Components (CCC) are those that other prototypes appear in their rectangles. Form frames and large components are usually complex connected components. Unlike the relation which exists between the pretemplates used in the template extraction in Section 2.3 with equation (1), the set of pixels of a component in a template generally does not contain the corresponding components in the input image. The relation between them can be presented as; R=TI (3) where R is the set of pixels that are the exclusive OR of the set of black pixels of a component in a template (T) and the set of black pixels of the corresponding component in an image (I). The similarity of a component in a input image with the corresponding prototype in the template can be expressed as S=1-Nd/Nt for SCC or S=1-Nd/Np for CCC (4) where Nd is the number of pixels that are different between the template and a input image in the rectangle of a prototype for SCC and in the set of pixels of a prototype for CCC. Nt is the number of pixels in the rectangle area for SCC and Np is the number of pixels in the set of the prototype for CCC. A prototype in the template is exactly the same as a component in an input image when R=1 and they are the reverse image when S=0. By adjusting the threshold of S, the matching between prototypes in the template and corresponding components in an input image can be controlled during the comparison. In the proposed method, weighted pixel matching rules are developed for both CCC and SCC. The matching process consists of several steps. Firstly, types of prototypes and their rectangle areas are decided upon using a connected component algorithm. Then patterns in rectangle areas of prototypes are compared with corresponding patterns in an input image. Adjusting for location is performed to optimize overlapping of the components in this stage. Weighted pixel matching rules are developed to deal with different situations. As the pixels in R that are the neighbor of the pixels in T are insignificant to indicate the deference, they are excluded from the subset Rw used for comparison. Sp indicates the set of pixels of a pixel p and its four direct neighbors. Then Rw is the set of pixel p that p R, and p T & Sp I or p I & Sp T (5) The weight of each pixel in the subset Rw is then calculated as Ws = (Ns+1)*(Ns+1) (6) 5 where Ns is the number of neighbor pixels that belong to subset Rw. The weight of pixels in the subset Rw is summed up as a parameter to determine the similarity between T and I. In order to get more accurate classification, some other parameters are also included, such as the dimension of a component’s rectangle area, the rate of blank pixels in a component, etc. For SCC, the comparison is executed in the rectangle area of each prototype. If the similarity between a component in the input image and corresponding prototype in the template is over a threshold, the component of the input image is erased and the prototype in the template will substitute it during the decompression. Otherwise the component is preserved. If there is a prototype in the template but the corresponding area in the input image is blank, the prototype is copied to the input image to indicate this situation. The matching rules and erasing area for CCC are different to that for SCC. Only the pixels within set T are compared. The erasing area for CCC is the set T instead of the whole rectangle area of a prototype. Correspondingly, the rules for the decompression for CCC are also different from the one for SCC. As the purpose of filled-in data extraction in this compression scheme is to reduce the component-level redundancy, the integrity of extracted filled-in data is not guaranteed. This problem occurs when filled-in patterns touch or cross form frames or preprinted patterns. The decompressed image with the proposed method, however, has no flaw caused by this process. Because all the comparisons and substitutions are performed in the rectangle of prototypes in the template, all the filled-in data out of them are kept unchanged. Broken components of filled-in data caused by the field overlap between preprinted and filled-in domain can be mended [12] if filled-in character recognition is performed. 3.3. Compressing the Streams Templates and extracted filled-in images are sort into a data stream with the templates being the first. To compress N form images with the same blank form, N+1 images are generated in the compressed data stream, one template plus N extracted images. This data stream is then compressed with a lossless compression method, such as JBIG or CCITT Group 4. Its compression rate is sensitive to the complexity of the encoded image. As the image of the extracted filled-in data is much less complex than the original input image, the image data is further compressed by this scheme. TIFF image format supports both multiimage and CCITT Group 4, so it is selected as the format of the compressed file. For N image files that have M images in each file, the compressed stream in a TIFF file has M template and M*N extracted images. If the templates are presented with Tm, (m=1, 2,..M) and the extracted image with Cnm, n =1, 2,..N, the stream of the compressed images is an array as T1, T2, …TM, C11, C12,..C1M,…CNM. A special tag is used to indicate how many templates in the compressed TIFF file. The name of each original image file is also saved in corresponding tags in the file. The compressed stream of images can also be viewed with ordinary graph tools. 4. Decompression The purpose of decompression is to restore the compressed data into new images that are the same (lossless) or similar (lossy) to the original images. Templates in the compressed files are firstly processed to locate each prototype by connected component analysis. Then each form image is reconstructed using a modest reconstruction function that compares the template and the corresponding extracted filled-in data image. As skew, location and distortion of each image have already been adjusted in the compressing section; every image with extracted filled-in data is compared with the template directly. There are three possible situations arising from the comparison. If there no any black pixel in the rectangle area of the extracted image for SCC, the component in the original image is the same or similar to the prototype and is erased during the filled-in data extraction. In this case, the corresponding prototype is copied to the corresponding position to substitute the original component. The second situation is if the component in the extracted image is different from the relative prototype in the template, the component in the extracted image is kept unchanged and no substitution occurs. The third situation, if a component in the extracted image is exactly the same as the corresponding prototype, it indicates that no any component exists in the original image. So a component is removed when it is exactly the same as the corresponding prototype. The third situation increases the size of the compressed image very slightly because this kind of situation rarely occurs. The matching rules of decompression for CCC and SCC are also different. A comparison is carried out in the rectangle area of each prototype for SCC. The copying or removing operation is also performed in the rectangle area of each prototype. For CCC, however, the comparison is not performed in the whole area of its rectangle, but instead, just in the pixel set of the prototype. The processes for the three situations mentioned above are also different from SCC. The copying or removing operation for CCC just performed in the area of the prototype. Pixels in the rectangle area but not in the area of the prototype remains unchanged. After decompression, images of extracted data are reconstructed to represent original images. As many images are compressed in one file, it is time consuming to decompress all of them. So several choices are provided 6 in decompression. Only a template is displayed if the aim of opening a compressed file is to have a view of its content. Otherwise selected numbers of images in the compressed stream are reconstructed. Figure 3 gives an example of the compression approach. Figure 3 (a), (b), (c) and (d) shows respectively an original form image, the template, the reconstructed image and the filled-in data extracted from (a). (a) (b) (c) (d) Figure 3 An example of the compression approach. (a) an original form image; (b) the template extracted from a set of filled-in forms; (c) the reconstructed image and (d) the filled-in data extracted from (a). 7 Figure 4 The example forms used for testing. 8 A Directory Micros Soco Tafe Westp B Number of files 100 6 100 50 C = B*F Size of all the tiff files (bytes) 2,141,539 141,673 3,640,274 4,456,211 D= G*B+H Size of the compressed file(bytes) 467,548 25,702 274,344 896,958 E = C/D Average compression rate over tiff 4.58 5.51 13.27 4.97 F = C/B Average size of each tiff file (bytes) 21,415 23,612 36,403 89,953 G= (D-H)/B Average size of each compressed image(bytes) 4,497 1,652 2,389 16,490 H Size of the template (bytes) 17,832 15,792 35,462 72,646 Table 1 Form Document Compression Experiment Results. 5. Results 6. Discussion and Conclusion More than 370 forms of 16 different types are used to test the proposed TEM compression method. All the forms are used in banks, companies or government organizations. Most forms are filled in by hand and some filled-in patterns touch or cross form frames or preprinted patterns. The forms are scanned at a resolution of 200 dpi with a small skew, less than 5, and saved as TIFF files with CCITT Group 4 format. The compressed images are also saved as TIFF files with CCITT Group 4 format. As is well known, the redundancy in document image representation can be classified into two categories: local and global. The global redundancy can be further classified as pattern redundancy in a same image and pattern assemblage redundancy in similar images. The proposed compression method reduces the redundancy in both categories. The TEM method extracts a template for representing the patterns that are similar in a set of form document images so as to reduce the pattern assemblage redundancy in similar images. The binary image compression and facsimile transmission standards mentioned in Section 1 provide the method to reduce local redundancy. CCITT Group 4 is used in this scheme. A better compression result can be achieved if the JBIG standard is employed. It is possible to reduce the pattern redundancy in a same image, which is minimal in the case of form document compression, by another method. We intend to address this in our future study. The forms used for testing have a variety of styles and formats. Figure 4 presents some examples. Some forms contain black backgrounds with white characters and some have halftone patterns. The font sizes in the examples are also different. The proposed scheme for multi-copy form document compression is coded in C++ and run on a PC with a Pentium II 350 MHz CPU and 64 MB RAM. The typical image size of the forms for testing is about 1600 x 2200. On average it takes 90 seconds for template extraction when 6 filled-in forms are used, 16 seconds for compression and 3 seconds for decompression of each image. As the code used for testing was not speed optimized, the processing speed can be improved further. The compression rates for different kinds of form images are listed in the Table 1. The compression rates vary over a large range from about 4.5 to over 13, which are main depending on the size of the filled-in data and the quality of the form images. From another viewpoint, data compression may be classified as lossless and lossy compression. For the lossless compression, there is no partial reduction on data while it is being performed. An exact copy of the original image can be completely recovered. For the lossy compression, some irrelevant data will be discarded during the compression and the recovered image is only an approximated version of the original image. Lossy compression itself can be divided into two categories, resolution deduction and no-resolution deduction. The multi-copy form document compression scheme 9 introduced in this paper is a no-resolution deduction lossy compression, which reduces global redundancy with the proposed TEM strategy and reduces local redundancy with a lossless compression technique. This compression scheme allows lossy compression at much higher compression ratios than the lossless ratios of the existing standards with almost invisible degradation of quality. The experiment results showed the efficiency of the proposed method. A statistical template extraction algorithm is developed using greyscale images by overlapping a number of binary images. Form images de-skewing, location and distortion adjusting are employed in the scheme to realize the TES strategy for practical application. The proposed prototype matching and substitution method can deal with all the three possible situations and is effective to reduce global redundancy. References 1. A. Antonacopoulos, Page segmentation using the description of the background, Computer Vision and Image Understanding, June 1998, Vol. 70, No. 3, pp. 350-369. 2. B. Yu and A. K. Jian, A Generic System for Form Dropout, IEEE Transaction on Pattern Analysis and Machine Intelligence, November 1996, Vol. 18, No. 11, pp. 11271134. 3. O. E. Kia, D. S. Doermann, A. Rosenfeld and R. Chellapa, Symbolic Compression and Processing of Document Images, Computer Vision and Image Understanding, June 1998, Vol. 70, No. 3, pp. 335-349. 4. H. Yan, Skew Correction of Document Images Using Interline Cross-Correlation, CVGIP: Graphical Models and Image Processing, November 1993, Vol. 55, No. 6, pp. 538-543. 5. D. Bodson, S. Urban, A. Deutermann and C. Clarke, Measurement of data compression in advanced group 4 facsimale, system, Proc. IEEE 73, 1985, 731-739. 6. A. Dengel, ANASTASIL: A system for low-level geometric analysis of printed documents, in Structured Document Image Analysis (H. S. Baird, H. Bunke, and Y. Yamamoto, Eds.), pp.70-98, Springer-Verlag, Berlin, 1992. 7. S. N. Srihari and V. Govindaraju, Analysis of textual images using the Hough transform, Mach. Vision Appl. 2, 1989, 141-153. 8. T. Akiyama and N. Hagita, Automate entry system for printed documents, Patt. Recogn., Vol. 23, No. 11, pp. 1141-1154, 1990. 9. H. Bunke, P.S.P. Wang (Eds), Handbook of character recognition and document image analysis, Publisher Singapore, World Scientific, c1997. Pp. 19-22. 10. M. Holt and C. Xydeas, Recent developments in image data compression for digital facsimile, ICL Tech. J. 1986, 123-146. 11. S. Inglis and I. Witten, Compression-based template matching, in Proceedings of the IEEE Data Compression Conference, 1994. 12. Jianguo Wang and Hong Yan, Mending Broken Handwriting with a Macrostructure Analysis Method to Improve Recognition, Pattern Recognition Letters, 20 (1999) 855864. 10