10_Holi_OCRfor Kanna..

A Simple and Effective method for recognition of Kannada characters based on Freeman Chain Code ABSTRACT With the recent emergence and widespread application of multimedia technologies, there is increasing demand to create a paperless environment in our daily life. In transformation from the traditional paper-based society to a truly paperless electronic information society, document image processing in general and optical character recognition (OCR) in particular will play an important role. Machine reading of optically scanned text is usually called optical Character recognition [S Basavraj Patil et. al, 2002]. It is observed from the literature survey that much work has been done on recognition of characters of many languages like English, Chinese, Japanese, and Korean. The survey also reports on the research work on recognition of many Indian languages like Tamil, Bangla, Telugu, and Malayalam. It is also observed that the work on recognition of Kannada characters is still an open research problem. Efforts are being made to develop OCR for Kannada language. In this paper, an attempt has been made to develop a simple and effective method for the recognition of Kannada characters. The present method is based on template matching to recognize basic printed Kannada characters and uses freeman chain code based on contour tracing method. The contour of character gives both inner & outer border of a character using connected component analysis. The proposed method obtains the contour of the character and then generates a freeman chain code for contoured character. The obtained chain code is unique for printed character. Experimental results show the relatively high accuracy of the method when it is tested on all size characters. Keywords: OCR, Freeman Chain Code (FCC) , Contour Tracing, Connected Components. 1. INTRODUCTION OCR is sub field of pattern recognition which recognizes the printed or handwritten character in the documents. Due to impact and developments in the information technology, nowadays more emphasis is given in Karnataka to use Kannada at all levels and hence use of Kannada in computer is also necessary[10] The Kannada script The script has forty-nine characters in its alpha-syllabary and is phonemic. The Kannada character set is almost identical to that of other Indian languages. The number of written symbols, however, is far more than the 49 characters in the alpha-syllabary, because different characters can be combined to form compound characters (ottaksharas). Each written symbol in the Kannada script corresponds with one syllable, as opposed to one phoneme in languages like English. The Kannada writing system is an abugida, with consonants appearing with an inherent vowel. The characters are classified into three categories: swaras (vowels), vyanjanas (consonants) and yogavaahas (part vowel, part consonants). The name given for a pure, true letter is akshara, akshara or Varna. Each letter has its own form (ākāra) and sound (shabda); providing the visible and audible representations, respectively. Kannada is written from left to right. Kannada alphabet (aksharamale or varnamale) now consists of 49 letters. Figure 1. Vowels in Kannada. Figure 2. Consonants in Kannada. 2.LITERATURE SURVEY Many researchers have been working on English, Arabic and Farsi character recognition. The following are some of the works that were reported A.Montazer, N.Jafari have proposed a method of identifying the Persian character. The features that are extracted are used to identify multi font numerals. The features extracted from the size normalized and thinned binary array, form the input for recognition process. The identification system is designed so as to optimize the structural parameters in the justification function, in order to enhance the recognition rate.[18] M S Khorsheed has proposed a system to recognize cursive Arabic typewritten text. The system is built using the Hidden Markov Toolkit which is a portable toolkit for speech recognition system. The proposed system decomposes the page into text lines and then extracts a set of simple statistical features from overlapped windows running through each text line. The feature vector sequence is injected to the global model for training and recognition purposes. A data corpus which includes Arabic text from two computer–generated fonts is used to assess the performance of the proposed system.[19] Some of the works on Indian Languages that have been reported are as follows: Negi Atul, Chakravarthy Bhagavathi, Krishna B have proposed an OCR system for telgu(A south Indian Language) characters based on Fringe distance measurement. Template of all the characters to be recognized by the system is formed. A type of similarity measure is performed between the test character and the templates. The template yielding the highest similarity score is declared as the class of the character.[15] B.M.SAGAR, Dr SHOBHA G, Dr RAMAKANTH KUMAR Phave proposed a method which uses character ASCII value, character name, character BMP image, character width, character length and total number of ON pixel in the image as the feature vector. [6] R SANJEEV KUNTE and R D SUDHAKER SAMUEL have proposed a system to Identify basic symbols in printed Kannada text. HU’s invariant moments and Zernike moments have been used to extract the features of printed Kannada characters. [10] ASHWIN and SASTRY have proposed an OCR system in which the character image is split into three basic zones. Each zone is divided into a number of circular tracks and sectors and feature like the number of ON pixels in each annular region is used for classification using Support Vector Machine(SVM) [1] 3. PREPROCESSING: Fig3 The input to the image looks like the one above. This character is preprocessed to remove any disturbances .Once the input has been given the image will be segmented into characters. This phase has three steps a. Line segmentation b. Word segmentation c. Character segmentation a. Line Segmentation: The process of identifying lines in the obtained image is called line segmentation. Following are the steps for line segmentation. 1. Scan the obtained image horizontally to find the first ON pixel and store y coordinate as y1. 2. continue scanning the image as long as you find a ON pixel. 3. When you encounter OFF pixel remember the coordinate y as y2. 4. y1 to y2 is the required line. 5. Repeat the above steps till the end of the image. b. Word Segmentation: Since there is a distance between a word and another word we use word segmentation. 1. Scan the obtained image vertically to find the first ON pixel and store x coordinate as x1. 2. Continue scanning the image as long as you find a ON pixel. 3. The distance between the characters is less than the distance between two words. This information is used to segment the line into words. 4. The above step is repeated until all the line segments are segmented into words. c. Character Segmentation: 1. Scan the obtained image vertically for the identified word to find the first ON pixel and store x coordinate as x1. 2. continue scanning the image as long as you find a ON pixel. 3. When you encounter OFF pixel store the coordinate x as x2. 4. x1 to x2 is the required character. 5. Repeat the above steps for all recognized words. Contour Tracing/Analysis Contour tracing is also known as border following or boundary following. Contour tracing is a technique that is applied to digital images in order to extract the boundary of an object or a pattern such as a character. The boundary of a given pattern P is Fig4(a) 4-neighbors (4-connected); 4(b) 8-neighbors (8-connected). For the obtained characters we use contour tracing. 4.FEATURE EXTRACTION: Once the contour of the image is obatined we apply freeman chain-code. Fig 5 :Eight chain-code directions. The coordinates of the boundary pixels are obtained first , based on these coordinates the chain code of the character image is found. The normalized chain code is obtained by transforming it to a two dimensional matrix. The first row of this matrix contains the value of the chain code, and the second row contains the frequency of occurrence of that value.[3] . For example, if the chain code of a given character is: 88883331111225833312 then it can be converted into the following form of a 2 × 9 matrix: 831258312 434211311 we remove all values whose frequencies are 1. For instance, in the above example, the chain code will be reduced to: 83123 43423 8312 4642 The process of removing the less-frequent digits can be continued. For instance in our test, the frequencies less than or equal to five were deleted. Again in the resulted chain code the frequency of each remained digit is summed. Then to transform the chain code matrix to a normalized chain code with length of 10, the relative frequency of each digit is computed using: Where is the normalized frequency and is the frequency of each digit in the chain code respectively. In the above example we will obtain: 8 3 1 2 2.22 3.33 2.22 1.11 Then the normalized frequency would be rounded of to nearest decimal which in turn would be concatenated to generate the length 10 chain code: 8833311222 5. Proposed Method 1.The input image consists of the characters ,this image after preprocessing line,word,character segmentation processes are done. 2. Once the characters are segmented, the contour of the image is obtained. After obtaining the contour, the coordinates of the Boundary pixels are obtained which is sent to obtain the Freeman chain code. 3. The obtained chain code is then normalized. 4. This normalized chain code is compared with the existing database. If match occurs then respective character is identified. Else error message is generated. 6. EXPERIMENTAL RESULTS Matlab was used to implement the proposed system and database of all characters along with normalized chain code is used. After getting the normalized chain code, the codes were compared with existing database to recognize the character which gives the 98% accuracy. Table I : RESULTS OBTAINED FROM IMPLEMENTATION OF THE EMPLOYED METHOD Letter name Contour image Image Normalized chain code AA 1005443622 BA 2213400766 JA 2134460078 MA 0013276654 NA 0000327644 OO 1007653222 TA 0404672426 7. CONCLUSION A template matching chain code-based approach for identification of Kannada characters was introduced in this paper. After obtaining the contour of the characters, features are extracted using Freeman chain code. Obtained chain code is compared with the database to identify the character. The proposed method is tested on all simple characters of Kannada text. Experimental results using standard fonts show that the accuracy of the proposed method is 98%. In the future work, we will try to complete our work for all compound characters like ottaksharas(mixed character) and handwritten Kannada character . 8.References: 1.Ashwin TV, Sastry P S 2002Afont and size-independent OCR system for printed Kannada documents using support vector machines. Sadhana 27: 35–58 2.Chong Chee-Way, Raveendran P, Mukundan R 2003 A comparative analysis of algorithms for fast computation of Zernike moments. Pattern Rec. 36: 731–742 3.Multi-Font Farsi/Arabic Isolated Character Recognition Using Chain Codes H. Izakian, S. A. Monadjemi, B. Tork Ladani, and K. Zamanifar 4.Girosi F, Poggio T 1990 Networks and the best approximation property. Bio. Cybernetics. 63: 169–176 5.Gonzalez R C, Woods R E 1993 Digital image processing (Boston, MA, USA: Addison Wesley Longman Publishing Co. Inc.) 6.B M Sagar , Dr Shobha , Dr Ramakanth Kumar P, converting printed Kannada text image file to machine editable format using database approach,International Journal Of Computers, Issue 2 , Volume 2,2008 7.HuM-K 1962 Visual pattern recognition by moment invariants. IRE Trans. Inf. Theory. IT-8: 179–187 8.Jawahar C V, Pavan Kumar, Ravi Kiran S S 2003 A Bilingual OCR for HindiTelugu documents and its applications. Proc. Seventh Int. Confer . on Document Anal. and Rec. 408–412 9.Khotanzad A 1998 Rotation invariant pattern recognition using Zernike moments. Proc. Int. Conf. on Pattern Rec. 326–328 10.Kunte Sanjeev R, Sudhaker Samuel R D 2006 A two-stage character segmentation scheme for Printed Kannada text. J. Graphics, Vision and Image Processing 6: 1–8 11.Moody J, Darken C J 1989 Fast learning in network of locally-tuned processing units. J. Neural Comput. 1: 281–294 12.Mukundan R, Ong S H, Lee P A 2001 Image analysis by Tchebichef moments. IEEE Trans. Image Processing 10: 1357–1364 13.Mohammed Al-Rawi, Yang Jie 2002 Practical fast computation of Zernike moments. J. Comput. Sci. and Technol. 17: 181–188 14.Nagabhushan P, Pai RadhikaM1999Modi?ed region decomposition method and optimal depth decision tree in the recognition of non-uniform sized characters—An experimentation with Kannada characters. Pattern Rec. Lett. 20: 1467–1475 15.Negi Atul, Chakravarthy Bhagavathi, Krishna B 2001 An OCR system for Telugu. Proc. Sixth Inter .Confer . on Document Anal. and Rec. 1110–1114 16.VijayaKumar B, Ramakrishnan A G 2004 Radial basis function and subspace approach for printed Kannada text recognition. Proc. IEEE ICASSP 2004 5: 321–324 17.S.Basavaraj Patil &N V Subbareddy, Neural Network based system for script identification in Indian documents,Sadhana vol.27,part 1,February 2002,PP83-97 18. A.Montazer,N.Jafari and H.Ebrahimnezhad Recognition of Persian Numeral Fonts by combining the Entropy Minimised Fuzzifer and Fuzzy Grammar.6th WSEAS Int.Conf. on Artificial Intelligence, Knowledge Engineering and Data Bases, Corfu Island, Greece 19.M S Khorsheed Recognising Cursive Arabic Text using a Speech Recogniton System,7th WSEAS International Conference on Automation &Information,Cavtat,Croatia

10_Holi_OCRfor Kanna..

Related documents

Products

Support

10_Holi_OCRfor Kanna..

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib