An Automation System to recognize Sinhala Handwritten Characters using Artificial Neural Networks Abstract In the Government of Sri Lanka, most of the information based activities are still carried out manually. This research attempt proposes a new way to automate an important public service which is fundamental by nature, issuing the National Identity Card (NIC). This presents an approach for recognize Sinhala Handwritten Characters in the application forms. Initially s e t o f handwritings of 30 individuals were collected and then two third of those samples were used for the training process and the remaining one third were used for the testing process. The scanned images of the Characters were gone through preprocessing for the further processing. Finding boundaries and the Normalization of the characters is going to handle by the preprocessor. After preprocessing, segmentation is done in order to get the individual characters from the list of Characters. Standard image processing techniques were employed to accomplish these tasks. Then they were trained by an Artificial Neural Network (ANN). The recognition of Sinhala characters is done by an ANN which is widely used in applications involving uncertainty. Rules are imposed on the results of the neural networks (NN) to make the recognition process more accurate. Then the details of the applicant are appended to the database. The outcome of this research will be beneficial to the general public at large. Introduction In many countries, e-government strategies are being used to make the government processes more efficient and accurate. Typically, in order to get something done from Gramma Niladari (GN), citizens have to fill out required application forms. In the current scenario, most of the forms are in the medium of Sinhala. Therefore when GN gets those filled application forms he has to go through them manually and then the processing has to be done. If the Government have an integrated system which is possible to connect all the basic entities in the process then the automation becomes easier. Therefore this study is an initiating point which will support the concept of egovernment in the Sri Lankan context. In Sri Lanka, most probably citizens are connected to Government through GN. Automation of the services done by GN plays an important role in the development process of a realistic e-government strategy. In order to achieve these goals, this system proposes a system that allows the extraction of Sinhala Handwritten Characters from the above mentioned application forms. The most important component of this task is extracting data from forms which are filled by citizens in Sinhala language which involves Sinhala handwritten character recognition. Usually, people submit the applications by filling them in their handwriting. Nandasara (1995) states that it is really a challenging task to identify handwritten characters since the variation which has to be captured among the characters is high. Furthermore, due to the special structure of the Sinhala characters the recognition process become complex [1]. Previously much effort has been carried out in this area of making a computer recognize both handwritten and typed characters automatically. Until quite recently, this effort is on recognizing English characters. However for the Asian languages such as Sinhala and Tamil there were few efforts. Methods which are widely used for character recognition in these kinds of languages include pattern matching using image processing techniques. 1 Material & Methods 1. Data Acquisition In the data acquisition stage, handwritings of 30 people are collected. Handwritings of 20 people are used for training the neural network and remaining 10 handwritings are used for testing. When collecting sample letters from individuals, blank A4 sheet with dotted pencil lines is used. After that each person is advised to write a given set of letters on those dotted lines. Subsequent to that, the pencil lines on the sheets are erased and they are scanned by a HP Scan jet Scanner with 200 dpi resolution. Another important thing that should be mentioned is, when collecting letters only limited number of characters is collected, since some of the characters are rarely used in the context. Implementation of the NN consists of four main steps. They can be represented in the following diagram. Those are i. ii. iii. iv. Pre Processing Segmentation Training Post Processing Figure 1.0 – Main Steps of Word Processing 2. Pre Processing The image is prepared for further processing. Initially, image has gone through a filtering process to facili tate removing the noise that could be added during the scanning process. The term noise is to be understood as anything that prevents recognition system from fulfilling its objective. Noise can be added to the image due to the roughness of the paper. It was observed that scanned image contains salt and pepper noise. Therefore in order to remove noise median filtering was used. After filtering, it is binarized or converted to black and white. This is done to ease the processing. Different people can write the letters in diverse colors. Therefore in order to avoid that, effect binarization [10] is done. In most of the typical character recognition systems, these steps are followed before the processing stage. After the binarization, the next step is to make the characters thin. The goal of thinning is to eliminate the thickness differences of pen by making the image one pixel thick. When writing the letters, they are blotted with ink and hence letters become much thicker. Therefore to avoid this effect, thinning can be used. Thus it takes all the letters into a one particular standard format. For thinning morphological operations were applied.These are the three steps that have been followed under the preprocessing stage. For each of these image processing techniques there were built-in functions in MATLAB [3]. Those built-in methods were used in the implementation process. 2 3. Segmentation In the segmentation stage, the image is divided into characters. Then from each character all the white spaces around them are removed. Figure 2.0 - Finding Boundaries of a Character Projection profiles of the image are used to crop the image into text lines and after that to individual letters. Initially, horizontal projection profile is used to detect the text lines of the image and afterwards image is segmented into text lines. Then the vertical projection profiles of those text lines were used to segment them into individual Characters. Since the scanned image consists of 9 text lines, horizontal histogram also consists of 9 bars corresponding to each of those text lines. Then the boundaries of those bars can be obtained and after that using them the image can be cropped into text lines. After obtaining the text line, letters has to be cropped in an attempt to input to the system for processing. Those letters can be prepared by getting the vertical histogram of the text line. Vertical projection histogram shows how the letters are distributed within the text line. Boundaries of the characters can be found. After getting the boundaries of the characters, each of them can be cropped. Then the characters can be isolated in order to input to the system. This procedure was done for all the characters which were in the collected data set. Then using those column vectors input vector was created. Then for the segmented characters, NN had to be created and thus input vector and the test vector were created. Then the NN had to be trained with those input vector. Results and Discussion The most salient feature of NN is their massive processing units and interconnectivity. Unless handled carefully, the various parameters involved in the architecture of the NN may cause the training process (adjusting weights) to slow down considerably. Some of these parameters are: the number of layers, number of neurons in each layer, the initial values of weights, the training coefficient and the tolerance of the correctness. The optimal selection of parameters varies depending on the alphabet. So as to train the weights, an initial set of weights is tested against each input vector. If an input vector is found for which the recognition fails, weights are adjusted to suit the particular input vector. However, this adjustment might also affect the recognition of other input vectors which have already been tested. So, the entire model needs to be tested all over again from the beginning. ANNs are capable of abstracting the essence of a set of inputs. For example, a network can be trained on a sequence of distorted versions of a letter. After adequate training, application of such a distorted example will cause the network to 3 produce a perfectly formed letter. Experimental results have revealed that training of more than 20 such distorted versions of the same letter produces correct results with a very high percentage of accuracy. Back propagation is a systematic method for training multilayer ANNs (perceptron). The Sigmoid compresses the range of NET so that OUT lies between zero and one. Since the back propagation uses the derivative of the squashing function [2], it has to be everywhere differentiable. The Sigmoid has this property and the additional advantage of providing a form of automatic gain control. Properly trained back propagation networks tend to give reasonable answers when presented with inputs that they have never seen. Typically, a new input leads to an output similar to the correct output for input vectors used in training that are similar to the new input being presented. This generalization property makes it possible to train a network on a representative set of input/target pairs and get good results without training the network on all possible input/output pairs. Conclusions One of the major problems of doing this for Sinhala handwritten characters is that they do not appear at the same relative location of the letter due to the different proportions in which characters are written by different writers of the language [6]. Even the same person may not always write the same letter with the same proportions. Even the normalization of the characters into a standard size does not completely eliminate this effect, although it does help to some extent. Training is the most important and the most time consuming activity of NN implementations. An efficient system should take the minimum training time possible. To minimize the training time, experiments should be carried out on the values of the parameters to choose a better set of values which reduces the training time. There are certain factors that affect training time and performance of the networks. Following are the parameters that could be adjusted to minimize the training time: a) b) c) d) e) f) g) h) i) j) Initial values of the weights Number of neurons in the hidden layer Training coefficient Tolerance Grid size used to extract bit patterns from the input image Size of the training data set Constituent characters in the training set Form of the input (i.e. individual handwriting) How representative the training set How representative the test set for generalization Therefore training is a process which has to be carried out carefully in order to obtain a good recognition rate. 4 This doesn’t indicate any major problems with the training. The validation and test curves are very similar. If the test curve had increased significantly before the validation curve increased, then it is possible that some over fitting might have occurred. According to the above graph mean squared error reduces with time while the neural network is testing, validating and training. It is a good performance measure. Figure 3.0 - Performance Plot From the whole exercise of attempting to use NN techniques for the recognition of characters in the Sinhala alphabet, it was discovered that there is a separate approach which could be developed by employing NN techniques together with image processing techniques. The reasons could be unreliability of the data and the segments. To obtain a better output or to improve results, several attempts were taken. Since the network was not sufficiently accurate, the network was reinitialized and the training was done again. Each time a feed forward network is initialized; the network parameters are different and might produce different solutions. However it could not achieve a considerable recognition rate. Since it was not successful, it was attempted to improve the results by increasing the number of hidden neurons above 20. Larger numbers of neurons in the hidden layer give the network more flexibility because the network has more p a r a m e t e r s it can optimize. When increasing the layer size gradually, if the hidden layer is made too large, it might cause the problem to be under characterized and the network must optimize more parameters than there are data vectors to constraint these parameters. However that effort was not successful anyway. Then the third option was to try a different training function. Bayesian regularization training with trainbr, for example, can sometimes produce better generalization capability than using early stopping. Other than that, training functions such as trainscg and trainrp were used which will be more appropriate for character recognition systems. However from that also considerable recognition rate could not be achieved. Since any of the above methods could not produce a better result it was decided to use additional data for training and the testing Stages. Sometimes the handwriting styles in the training data set might similar to each other. Therefore it could be a possible reason for not getting a higher testing rate. Providing additional data for the network is more likely to produce a network that generalizes well to new data. By increasing the Number of character samples better results can be expected. This will help to achieve a more generalized trained network. 5 Training a neural network with a higher testing rate is also a challenge. That is over fitting can be occurred. In this scenario, network has trained to classify only the items in the training set. However, if it is trained well for patterns in the training set then it cannot classify the items which it has never seen. When selecting the training sets, it is better to group the character sets according to the shape of the character (such as round or squared) or the size of the characters. Then the results would be more accurate and it would be a complete system for character training and testing. More training sets and training programs are required to develop such a system. A user interface can be introduced to make it user friendly. A menu driven program with push button controls and pictures would be attractive. In this is current state the NN can be used only for training eight characters on Sinhala Alphabet with an initial guidance. A trainee can continue training while enjoying it as a game. References Rajapakse, Jagath(2000) “Neural Networks and Pattern Recognition” Notes, Nanyang Technological University, Singapore, December _Course Aleksander, Igore & Morton, Helen (1991): “An Introduction to Neural Computing”, Chapman & Hal, ISBN 0 412 37780 2 Documentation, MATLAB Version 7.1.2 (R11) The Mathworks, Inc., Jan.21, 1999 Nandasara, S. T., Disanayake, J. B., Samaranayake, V. K., Seneviratne, E. K and Koannantakool, T. (1990): Draft Standards for the use of Sinhala in Computer Technology” by the Computer & Information Council of Sri Lanka (CINTEC) Beale, R. and Jakson, T. (1990): “Neural Computing – An Introduction”, IOP publishing Ltd. ISBN 0 852774 262 2 Disanayake, J. B. (1993): “Lets Read and Write Sinhala”, Pioneer Lanka Valluru, R. and Hayagriva, R. (1996): C++ Neural Networks and Fuzzy Logic, BPB Publications Earal Gose, Riched Johnsonbaugh and Steve Jost (2003), Pattern Recognition and Image Analysis. Prentice-Hall, India Hemakumar L. Prematathne and J.Bibun (2002), Recognition of Printed Sinhala Characters Using Linear Symmetry. The 5th Asian Conference on Computer Vision Manning, Christopher D. and Hinrich Schutze (2000), Foundations of Statistical Natural Language Processing. MIT Press 6