TRIBHUVAN UNIVERSITY INSTITUTE OF ENGINEERING PULCHOWK CAMPUS A FINAL YEAR PROJECT REPORT ON Name Conflict Resolution for Company Registration By: Gaurav Kumar Goyal (16214) Janardan Chaudhary (16216) Nimesh Mishra (16221) Sanat Maharjan (16230) A PROJECT SUBMITTED TO THE DEPARTMENT OF ELECTRONICS AND COMPUTER ENGINEERING IN PARTIAL FULLFILMENT OF THE REQUIREMENT FOR THE BACHELOR’S DEGREE IN COMPUTER ENGINEERING DEPARTMENT OF ELECTRONICS AND COMPUTER ENGINNERING LALITPUR, NEPAL AUGUST, 2013 INSTITUTE OF ENGINEERING PULCHOWK CAMPUS DEPARTMENT OF ELECTRONICS AND COMPUTER ENGINEERING The undersigned certify that they have read, and recommended to the Institute of Engineering for final submission and presentation of the project entitled "Name Conflict Resolution for Company Registration" submitted by Gaurav Kumar Goyal, Janardan Chaudhary, Nimesh Mishra and Sanat Maharjan in partial fulfilment of the requirements for the Bachelor’s degree in Computer Engineering. _________________________________________________ Supervisor, Prof. Dr. Shashidhar Ram Joshi Department of Electronics and Computer Engineering _________________________________________________ Co-Supervisor, Er. Sansar Jung Dewan IT Officer, Office of Company Registrar (OCR) __________________________________________________ Internal Examiner, Baburam Dawadi Department of Electronics and Computer Engineering __________________________________________________ External Examiner, Anjesh Tuladhar COO, Young Innovations Pvt. Ltd. DATE OF APPROVAL: 25 Aug. 2013 i COPYRIGHT The author has agreed that the Library, Department of Electronics and Computer Engineering, Pulchowk Campus, Institute of Engineering may make this report freely available for inspection. Moreover, the author has agreed that permission for extensive copying of this project report for scholarly purpose may be granted by the supervisors who supervised the project work recorded herein or, in their absence, by the Head of the Department wherein the project report was done. It is understood that the recognition will be given to the author of this report and to the Department of Electronics and Computer Engineering, Pulchowk Campus, Institute of Engineering in any use of the material of this project report. Copying or publication or the other use of this report for financial gain without approval of to the Department of Electronics and Computer Engineering, Pulchowk Campus, Institute of Engineering and author’s written permission is prohibited. Request for permission to copy or to make any other use of the material in this report in whole or in part should be addressed to: Arun Timilsina, PhD/ Professor Head of Department Department of Electronics and Computer Engineering Pulchowk Campus, Institute of Engineering Lalitpur, Kathmandu Nepal ii ACKNOWLEDGEMENT First of all, we would like to express our sincere gratitude towards Department of Electronics and Computer, Pulchowk Campus for including final year major project as part of our syllabus for final year B.E. in Computer. We would like to extend our gratitude towards Dr. Arun Timilsina, Head of Department, Electronics and Computer Engineering, for assisting us in our project. We would like to take the privilege to express our gratitude towards Prof. Dr. Shashidhar Ram Joshi for being our project supervisor. We would also like to thank Dr. Aman Shakya for his support and assistance. We are deeply indebted to Er. Sansar Jung Dewan of Office of Company Registrar and the Office of company Registrar itself for giving us an opportunity to do this project with enormous scopes. We would also like to express our sincere thanks to Mr. Bal Krishna Bal, Assistant professor, Department of Electronics and Computer Engineering, Kathmandu University, for this help and support. Last but not the least we would like to thanks our friends and classmates for their help and valuable suggestions. iii ABSTRACT Natural language processing is one of the most researched field. One of the applications of natural language processing is determining similarity of sentences. Naming conflict resolution is about comparing of words. There are many systems developed for this purpose and are used widely. In context of Nepal, the existing system for resolving naming conflicts during registration of a company is done manually (by human). However, there exists requirement for automation of the process. The automation requires natural language processing, translation of languages, transliteration between languages. There are several constraints for the checking provided by the Office of Company Registrar (OCR). These constraints should be considered while comparing words. The words need to be tokenized, stemmed before they can be further processed. Keywords: OCR, Morphological Analysis, Similarity Matching, Natural Language Processing. iv TABLE OF CONTENT COPYRIGHT .................................................................................................................... ii ACKNOWLEDGEMENT ................................................................................................ iii ABSTRACT ..................................................................................................................... iv TABLE OF CONTENT ..................................................................................................... v TABLE OF FIGURES .................................................................................................... viii Chapter 1 ........................................................................................................................... 2 INTRODUCTION .......................................................................................................................... 2 1.1 Background....................................................................................................................... 2 1.2 Motivation ........................................................................................................................ 3 1.3 Problem Statement........................................................................................................... 3 1.4 Objectives ......................................................................................................................... 4 1.5 Scope of the work ............................................................................................................. 4 Chapter 2 ........................................................................................................................... 5 LITERATURE REVIEW .................................................................................................................. 5 2.1 Introduction...................................................................................................................... 5 2.2 Common processes used in text similarity......................................................................... 5 2.2.1 Downcasting .............................................................................................................. 5 2.2.2 Transformation .......................................................................................................... 5 2.2.3 Stopword Removal ..................................................................................................... 5 2.2.4 Tokenization .............................................................................................................. 6 2.2.5 Stemming .................................................................................................................. 6 2.3 Existing Name checking Systems ....................................................................................... 6 2.4 Criteria defined by OCR..................................................................................................... 7 2.5 Matching Techniques ........................................................................................................ 8 2.5.1 Phonetic encoding ..................................................................................................... 8 2.5.1.1 Soundex .............................................................................................................. 8 2.5.1.2 Metaphone ......................................................................................................... 9 2.5.2 Pattern matching ....................................................................................................... 9 2.5.2.1 Levenshtein or Edit Distance.............................................................................. 10 2.5.2.2 Sorenson similarity ............................................................................................ 10 2.5.2.3 Cosine Similarity ................................................................................................ 11 2.6 Summary ........................................................................................................................ 11 Chapter 3 ......................................................................................................................... 12 REQUIREMENT ANALYSIS ......................................................................................................... 12 v 3.1 Functional Requirements ................................................................................................ 12 3.2 Non-Functional Requirements ........................................................................................ 12 3.2.1 Reliability ................................................................................................................. 12 3.2.2 Performance ............................................................................................................ 12 3.2.3 Accuracy .................................................................................................................. 13 Chapter 4 ......................................................................................................................... 14 METHODOLOGY ....................................................................................................................... 14 4.1 Introduction.................................................................................................................... 14 4.2 System Design ................................................................................................................ 15 4.2.1 Flow Diagram ........................................................................................................... 16 4.2.2 Deployment Diagram ............................................................................................... 17 4.2.3 System Architecture ................................................................................................. 18 4.2.3.1 Preprocessing Engine ........................................................................................ 19 4.2.3.2 Translation and Transliteration .......................................................................... 20 4.2.3.3 Possible Keyword Generation ............................................................................ 21 4.2.3.5 Ranking ............................................................................................................. 22 4.2.4 Detailed Class Diagram............................................................................................. 20 4.3 Project Tools ................................................................................................................... 23 4.4 Eclipse as Programming IDE ............................................................................................ 23 4.5 MySQL as Database System ............................................................................................ 23 Chapter 5 ......................................................................................................................... 24 EXPERIMENTAL SETUP .............................................................................................................. 24 Chapter 6 ......................................................................................................................... 25 OUTPUT ................................................................................................................................... 25 Chapter 7 ......................................................................................................................... 27 RESULT AND ANALYSIS ............................................................................................................. 27 Chapter 8 ......................................................................................................................... 29 CONCLUSION AND FURTHER ENHANCEMENT ........................................................................... 29 7.1 Conclusion ...................................................................................................................... 29 7.2 Limitations ...................................................................................................................... 29 7.3 Further Enhancement ..................................................................................................... 30 REFERENCE .................................................................................................................. 31 APPENDIX A: Gantt chart .............................................................................................. 34 APPENDIX B: Use Case ................................................................................................. 35 APPENDIX C: Preprocessing Detail Example ................................................................. 36 vi APPENDIX D: Comparison Detail .................................................................................. 37 APPENDIX E: Output Screenshot ................................................................................... 41 APPENDIX F: Data Flow Diagram ................................................................................. 42 APPENDIX G: Theory .................................................................................................... 43 vii TABLE OF FIGURES Figure 1 Flow Chart ...................................................................................................................... 16 Figure 2 Deployment Diagram ...................................................................................................... 17 Figure 3 System Architecture........................................................................................................ 18 Figure 4 Preprocessing Engine ...................................................................................................... 19 Figure 5 Detailed Class Diagram ................................................................................................... 20 Figure 6 Example - I ...................................................................................................................... 25 Figure 7 Example - II ..................................................................................................................... 25 Figure 8 Example- III ..................................................................................................................... 25 Figure 9 Example - IV .................................................................................................................... 25 Figure 10 Example- V.................................................................................................................... 26 Figure 11 Example - VI .................................................................................................................. 26 Figure 12 Example - VII ................................................................................................................. 26 Figure 13 Computation Time with Transformation ....................................................................... 27 Figure 14 Time Computation with Transformation ....................................................................... 28 Figure 15 Gantt Chart ................................................................................................................... 34 Figure 16 Use Case Diagram ........................................................................................................ 35 Figure 17 Comparison I (Part A) .................................................................................................. 37 Figure 18 Comparison I (Part B)................................................................................................... 38 Figure 19 Comparison II (Part A) ................................................................................................. 39 Figure 20 Comparison II (Part B) ................................................................................................. 40 Figure 21 Output Screenshot ........................................................................................................ 41 Figure 22 Data Flow Diagram........................................................................................................ 42 viii Chapter 1 INTRODUCTION 1.1 Background Trying to understand language as a unit in machine terms is not as easy as it is thought. Words are perhaps the most intuitive units of language, yet they are in general tricky to define. Words are defined in most languages as the smallest linguistic units that can form a complete utterance by themselves. Natural language processing deals with the ambiguity in word processing. The office of company registrar is responsible for maintaining law and order regarding different companies. Almost all of the daily task of the office used to be manual, now the OCR has moved ahead for the automation of tasks using computerized systems. Before the advent of current online system, the process relating to change, admission, and removal of company names used to be difficult and cumbersome. Even after the recent development of online system of the office, the system is isn't intelligent enough. Currently the Office of Company Registrar (OCR) has implemented database entity comparison features. The process of finding company names is often based on English names. Comparison features is however limited to entity to entity match and phonetic based matching. The existing system often fails to act responsively and accurately during the process related to a new company registration. The current system is severely limited due to the above mentioned comparison method. The same problem arises while a new company tries to reserve their company name. Naming conflict resolution system for company registration is a system that finds the similarity between the proposed name of a company and existing company names in database. This requires the use of some of the traits of natural language processing. First of all, the input is down casted and stop-words are removed from the proposed name. The name is then transformed, tokenized, stemmed to determine the root words used in similarity checking. The words are then used to form some of probable tokens using translation and transliteration process. These names are then matched with words from database to form the ranking of similar names. The system requires to translate Nepali words to English words and vice-versa. The translation is done with the help of dictionaries. The removal of stop-word requires pool of pre-defined words to be removed. The constraints are defined by the Office of Company Registrar. These constraints include use of plural words, case sensitivity, punctuation and 2 spacing in the names, use of numbers, different phonetic spellings or spelling variations and many others. The system will also assist in decision making process, whether or not to approve the proposed name. This system will result in efficient processing, and faster registration of names. 1.2 Motivation Almost all of the daily task of the office used to be done manually. But now the OCR has moved ahead for the automation of tasks using computerized systems. Before the advent of current online system, the process relating to change, admission, and removal of company names used to be difficult and cumbersome. Even after the recent development of online system of the office, the system is isn't intelligent enough. Currently the Office of Company Registrar (OCR) has implemented database entity comparison features. The process of finding company names is often based on English names. Comparison features is however limited to entity to entity match and phonetic based matching. The existing system often fails to act responsively and accurately during the process related to a new company registration. The current system is severely limited due to the above mentioned comparison method. These limitations in current system motivated us to develop a more reliable and accurate system based on String Matching Algorithms, which produces more accurate results than the Phonetic based string matching approach currently used. 1.3 Problem Statement A recent improvement in the registration of new companies is the addition of the online registration and name checking system. However, the current name checking system faces from lack of accuracy and drawbacks of matching names regarding to their phonetic pronunciation. In our current project, we try to build a system that checks the validity of the purposed names by using string matching schemes rather than phonetic. Our objective is to determine that extent to which the purposed name is similar to existing name , and based on this we determine whether the name is available for registration . 3 1.4 Objectives The main objective of the project is to develop a system capable of checking the similarity of the purposed company names with registered ones. The objectives can be further be simplified as: 1. To develop a system to resolve naming conflict. 2. To find names similar to the name proposed by user. 3. To provide the ranks of matched proposed name with other existing names. 4. To define the threshold level used to validate name 1.5 Scope of the work Name checking system is used in many countries to check the purposed name of a company. Variety of approaches is available to develop such name checking system. The approach used here is NLP approach. The system will be able to check the purposed name with much better accuracy than the current system. This system will be beneficial to the clients and the OCR. This system is based on research along with study and analysis of existing system. The system will produce output in the form of .csv file containing the similarity scores of various names with the purposed name. 4 Chapter 2 LITERATURE REVIEW 2.1 Introduction This project is all about checking the validity of the purposed company names for the Office of Company Registrar. One of the important steps while developing such a system is to examine all the research areas thoroughly. It is important to know about Natural Language Processing in order to know about the processes used in this project. Also for designing this system, existing systems are studied thoroughly. Natural Language Processing (NLP) is a branch of information machine science that deals with natural language information. NLP is a component of artificial intelligence. NLP is a form of human-to-computer interaction where the elements of human language, be it spoken or written, are formalized so that a computer can perform value-adding tasks based on that interaction. Human language is dauntingly complex for a computer to understand. NLP is used in various areas like language translation, speech processing, checking for grammatical errors, etc. 2.2 Common processes used in text similarity It is always useful to know about different types of processes used for NLP. Some of the common processes are mentioned below: 2.2.1 Downcasting Downcasting also referred as type refinement is act of casting script from uppercase letters to lowercases. It is done so as to make sure there is no conflict in company names due to uppercase letters between the words to make it a unique name. 2.2.2 Transformation Transformation is the conversion of words from British English word to that to American English words. Transformation is done to avoid the generation of unwanted keywords or conflicting keywords 2.2.3 Stopword Removal Stop word removal is the process of removing some predefined stop words from the string literal. We used this process to remove the words that are considered similar/unimportant defined by Office of the Company Registrar directives. 5 2.2.4 Tokenization Tokenization is the process of breaking up a string into tokens to be indexed using predefined dictionaries or with the help of analyzing the whitespaces. These dictionaries can be a pool of predefined words or bilingual English-Nepali dictionary. 2.2.5 Stemming Stemming is the process of reducing a word to a root, or simpler form which are present in plural forms. Stemming is often used in text processing applications. There are many different approaches to stemming, each with their own design goals. Some are aggressive, reducing words to the smallest root possible. 2.3 Existing Name checking Systems In order to develop an effective name checking system, it is important to study many similar existing systems so that the system to be developed covers some of the deficiencies of these systems. We mainly focused on the existing system used in OCR Nepal. A name checking system takes the name purposed by the customer and compares with the similar already existing names. Based on the results, it determines if the name is allowed to be registered. 1. Office of Company Registrar, Nepal This system uses Phonetic algorithms to check the names. The customer has to visit the homepage of the OCR [1] and enter the purposed name. The system checks this name with already existing names and determines if the name is valid. The existing system however faces the problem of lack of accuracy. 2. Companies House, United Kingdom This system is used by the government of United Kingdom to check the purposed name. The client can visit the website [2] and check for the name intentioned. The system returns the list of existing similar names. 3. CIPC CIPC stands for Companies and Intellectual Properties Commission. It is a system that checks the availability of the name purposed by the customer. The client can visit the website [3] register by paying the fee and then check his/her intentioned company name. The CIPC will check the name against existing registered businesses and reject the names that are too similar. The system will also check if the name is reserved or not. 6 2.4 Criteria defined by OCR In approving a proposed name of company, the following shall not be considered different or distinguishable: 1. The words Private, Pvt., (P), Limited, Ltd, Ltd., Limited Liability. 2. The words appearing at the end of the names – company, and company, co., co. 3. The plural version of any of the words appearing in the name. 4. The type and case of letters, spacing between letters and punctuation marks; 5. Joining words together or separating the words, as this does not make a name distinguishable from a name that uses the similar, separated or joined words. For example: Him Shikhar Travels Pvt. Ltd. will be considered as similar to Himshikhar Travel. 6. The use of number of the same word and (the use of tense in English), as this does not distinguish one name from another. Such as, Three Six Five Tours and Travels Pvt. Ltd. will be to 365 Tours and travels Pvt. Ltd. 7. Using different phonetic spellings or spelling variations, as this does not distinguish one name from another. For example, S.D. Enterprises limited is existing then S and D Enterprises or Satya Darshan Enterprises will not be allowed. 8. Similarly if a name contains numeric character like 3, 6, and 7 resemblance shall be checked with “Three, six, and seven”. 9. The use of an internet related designation, such as .COM, .NET, .EDU, GOV, .ORG, .IN, as this does not make a name distinguishable from another. 10. The addition of words like New, Modern, Nav, Shri, Sri, Shree, Sree, Om, Jai, Sai, The, etc., as this does not make a name distinguishable from an existing name such as New Kantipur Publication Pvt., Shree Sai Enterprises. 11. The adding the name of the place like Kathmandu, Janakpur as this does not make a name different or distinguishable. For example, ‘Kathmandu Sugam Pharmaceuticals Private Ltd.’ cannot be allowed if ‘Sugam Pharmaceuticals Private Ltd’ already exists;Such names may be allowed only if no objection from the existing company by way of Board resolution is produced/ submitted. 12. Different combination of the same words, as this does not make a name distinguishable from an existing name, e.g., if there is a company in existence by the 7 name of “Builders and Contractors Limited”, the name “Contractors and Builders Limited” should not be allowed. 13. Exact Nepali translation of the name of an existing company in English or other language. For example, Kathmandu Dairy Industry Limited will not be allowed if there exists a company with name ‘Kathmandu Dugdh Udyog Limited’. 2.5 Matching Techniques Name matching can be defined as the process of determining whether two name strings are instances of the same name [18]. As name variations and errors are quite common [17], exact name comparison will not result in good matching quality. Rather, an approximate measure of how similar to names are is desired. Generally, a normalized similarity measure between 1.0 (two names are identical) and 0.0 (two names are totally different) is used. The two main approaches for matching names are phonetic encoding and pattern matching. Different techniques have been developed for both approaches, and several techniques combine the two with the aim to improve the matching quality. 2.5.1 Phonetic encoding Common to all phonetic encoding techniques is that they attempt to convert a string into a code according to how a string is pronounced (i.e. the way a string is spoken). Naturally, this process is language dependent. Most techniques have been developed mainly with English in mind. 2.5.1.1 Soundex Soundex based on English language pronunciation, is the and best known phonetic encoding algorithm. It keeps the first letter in a string and converts the rest into numbers according to the following encoding table. a,e,h,i,o,u,w,y 0 b,f,p,v 1 c,g,j,k,q,s,x,z 2 d,t 3 l 4 m,n 5 r 6 8 All zeros (vowels and ‘h’, ‘w’ and ‘y’) are then removed and sequences of the same number are reduced to one only (e.g. ‘333’ is replaced with ‘3’). The final code is the original first letter and three numbers (longer codes are cut-off, and shorter codes are extended with zeros). As examples, the Soundex code for ‘peter’ is ‘p360’, while the code for ‘christen’ is ‘c623’. A major drawback of Soundex is that it keeps the first letter, thus any error or variation at the beginning of a name will result in a different Soundex code. 2.5.1.2 Metaphone Metaphone is a phonetic algorithm, published by Lawrence Philips in 1990, for indexing words by their English pronunciation. It fundamentally improves on the Soundex algorithm by using information about variations and inconsistencies in English spelling and pronunciation to produce a more accurate encoding, which does a better job of matching words and names which sound similar. As with Soundex, similar sounding words should share the same keys. The original author later produced a new version of the algorithm, which he named Double Metaphone. Contrary to the original algorithm whose application is limited to English only, this version takes into account spelling peculiarities of a number of other languages. In 2009 Lawrence Philips released a third version, called Metaphone 3, which achieves an accuracy of approximately 99% for English words, non-English words familiar to Americans, and first names and family names commonly found in the United States, having been developed according to modern engineering standards against a test harness of prepared correct encodings. 2.5.2 Pattern matching Pattern matching techniques are commonly used in approximate string matching [24, 25], which has widespread applications, from data linkage [22, 23] and duplicate detection [20, 21], information retrieval [26], correction of spelling errors [27], approximate database joins, to bio- and health informatics [25]. These techniques can broadly be classified into edit distance and q-gram based techniques, plus several techniques specifically developed for name matching. A normalized similarity measure between 1.0 (strings are the same) and 0.0 (strings are totally different) is usually calculated. We will denote the length of a string s with |s|. 9 2.5.2.1 Levenshtein or Edit Distance The Levenshtein distance [28] is defined to be the smallest number of edit operations (insertions, deletions and substitutions) required to change one string into another. In its basic form, each edit has cost 1. Using a dynamic programming algorithm [17], the distance (number of edits) between two strings s1 and s2 can be calculated in time O(|s1| × |s2|) using O(min(|s1|, |s2|)) space. The distance can be converted into a similarity measure (between 0.0 and 1.0) using 𝑑𝑖𝑠𝑡𝑙𝑑(s1,s2) 𝑠𝑖𝑚ld (s1,s2)= 1 − max(|𝑠1|,|𝑠2|) – (1) with 𝑑𝑖𝑠𝑡ld (s1,s2) being the actual Levenshtein distance function which returns a value of 0 if the strings are the same or a positive number of edits if they are different. The second property allows quick filtering of string pairs that have a large difference in their lengths. The distance between "Bob" and "Bob" is zero (0), because no edits are required to convert a string into itself. The edit distance between strings is only zero if the strings are identical. The distance between "Brett" and "Brent" is one (1), because it requires a substitution of an ‘n’ for a ‘t’. The distance between "Brett" and Bret is one, requiring the deletion of one of the two ‘t’ characters in "Brett". The sequence of edits must be minimal, but need not be unique. Further note that "Bret" can be converted to "Brett" with a single insertion of a ‘t’ character. The distance between "Bob" and "bob" is also 1, as it requires the substitution of a lowercase 'b' for its uppercase equivalent ‘B’. Levenshtein Distance is used to calculate the similarity of 2 strings. A standard Levenshtein Distance is about ~40% accurate [19], by standardizing the orthography of the strings this can be improved to a max of ~65% [3]. 2.5.2.2 Sorenson similarity The Sorenson index, also known as Sorenson’s similarity coefficient, is a statistic used for comparing the similarity of two samples. It was developed by the botanist Thorvald Sorenson and published in 1948. Sorenson's original formula was intended to be applied to presence/absence data, and is 2𝐶 2|𝐴∩𝐵| 𝑄𝑆 = 𝐴+𝐵 = |𝐴|+|𝐵| - (2) 10 where A and B are the number of species in samples A and B, respectively, and C is the number of species shared by the two samples; QS is the quotient of similarity and ranges from 0 - 1. This expression is easily extended to abundance instead of presence/absence of species. The Sorenson index is identical to Dice's coefficient which is always in [0, 1] range. 2.5.2.3 Cosine Similarity The cosine of two vectors can be easily derived by using the Euclidean dot product formula: 𝑎. 𝑏 = |𝑎||𝑏|𝑐𝑜𝑠𝜃 - (3) Given two vectors of attributes, A and B, the cosine similarity, θ, is represented using a dot product and magnitude as 𝐴.𝐵 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 = 𝑐𝑜𝑠𝜃 = |𝐴||𝐵| = ∑𝑛 𝑖=1 𝐴i ×𝐵i 𝑛 2 2 √∑𝑛 𝑖=1(𝐴i) ×√∑𝑖=1(𝐵i) - (4) The resulting similarity ranges from −1 meaning exactly opposite, to 1 meaning exactly the same, with 0 usually indicating independence, and in-between values indicating intermediate similarity or dissimilarity. For text matching, the attribute vectors A and B are usually the term frequency vectors of the documents. The cosine similarity can be seen as a method of normalizing document length during comparison. 2.6 Summary 1. The background study focused on the uses of name checking systems, their effectiveness and usefulness. 2. It helped us how to design, methodologies, and programming tools that should be used to develop this system. 3. It also emphasized on the existing systems, their merits and flaws in them. 11 Chapter 3 REQUIREMENT ANALYSIS 3.1 Functional Requirements 1. A true reflection of lexical similarity Strings with small differences should be recognized as being similar. In particular, a significant substring overlap should point to a high level of similarity between the strings. 2. A robustness to changes of word order Two strings which contain the same words, but in a different order, should be recognized as being similar. On the other hand, if one string is just a random anagram of the characters contained in the other, then it should (usually) be recognized as dissimilar. 3. Language Independence The system should work not only for English words, but also for Nepali words. 4. Output file format The result should be stored in a file in comma separated variable (csv) format. 5. Easy integration The system should be easy to integrate with the existing system. The system should be easy to maintain by the maintenance personnel. 3.2 Non-Functional Requirements 3.2.1 Reliability It is required that the system should be available all the time. This can be achieved by hosting the system in a reliable server. Also the system is built using Java, this adds more confidence to the system. Java has built in memory management system. 3.2.2 Performance The system would be used by numerous customers throughout the country. So it was required that the system should take minimum time to produce output. The main concern was the time taken to query database system to extract the relevant names and calculate the similarity scores. This time depend upon the type of processor used. The overall time required to obtain output after the submission of name by the customer was summed up to about 1 minute but again, this time depends upon the number of tokens generated. 12 3.2.3 Accuracy The system is purposed to be real time, so it is required that the high accuracy is maintained. This is ensured by using Morphanalyser, Levenshtein Algorithm in conjunction with KuhnMukres Hungarian Algorithm and Sorensen Algorithm. 13 Chapter 4 METHODOLOGY 4.1 Introduction Methodology is analysis of the tasks to be done in order to obtain the desired output. An appropriate methodology mainly results into a successful project and vice-versa. Here, for this system, a number of methodologies were considered and the most efficient ones are used. This doesn’t mean that one particular method is used. According to the system, the most appropriate ones are used in combination. The model used here is an iterative model i.e. in the beginning a small subset of the software requirement is developed and then using the concept of redesign and redevelopment its further versions are enhanced. This process is continued until and unless the desired system is developed that produces results as mentioned in the system requirements. The methodology once decided is changed during the project if there arise any circumstances where the design emerged any flaws. Thus based on the situations appropriate methodologies are implemented. Hence in our scenario methodology comprises of five different steps. 1. Building Base Dictionary 2. Possible Keyword Generation 3. Finding Possible Matches 4. Finding Duplicates 5. Finding Ranks 1. Building Base Dictionary A base dictionary can be generated from the existing name database provided by OCR. This can be done by using manual approach. Base dictionary used in our project consist of a file containing English words, a dictionary for transliteration, and Nepali to English dictionary (provided by Madan Pustakalaya). These dictionary helps us in tokenization and possible keywords generation. 2. Possible Keyword Generation: After tokenizing the given name, a possible combination of the keywords is generated using both English and Nepali words similar to them. After obtaining base keywords, these keywords are transliterated and combined in every possible manner to form the probable similar keywords. These keywords are used to match against names in OCR Names database. 14 3. Finding Possible Matches: Possible names generated using base keywords are matched against OCR names database. For this, the names containing any of the keywords are extracted from the names database. Each of the name is checked against the purposed name. The aim is to collect as many records as possible for better results. These records can contain duplicates too. 4. Finding Duplicates Matches: The names extracted from the Names database may occur more than once. So, the names that appear more than once are removed. Duplication occurs when a name in the database contains two or more of the probable keywords. 5. Finding Ranks: The purposed name is assigned a value against each name extracted from the Names database. The value signifies the extent of matching. For calculating the match, we used Levenshtein algorithm The Kuhn-Munkres algorithm (also known as the Hungarian method) The purposed name is assigned a value against each name extracted from the Names database. The value signifies the extent of matching. For calculating the match, we used • Levenshtein algorithm to calculate similarity between tokens of purposed name and name extracted from database. • The Kuhn-Munkres algorithm (also known as the Hungarian method) to find the optimal assignment of similarity weight between tokens of two strings in comparison that maximizes the sum of similarity weight. • Sorenson’s similarity coefficient to find the single value similarity score (which is between 0 and 1) from the result obtained through Hungarian. 4.2 System Design This section gives a detail review on the design on which the system developed is implemented. It includes 1. Flow diagram 2. Deployment diagram 3. System architecture 4. Detail class diagram 15 4.2.1 Flow Diagram Figure 1 Flow Chart 16 4.2.2 Deployment Diagram Figure 2 Deployment Diagram The application is built around client/server architecture. Multiple client machines can interact with the server simultaneously. Clients can interact with the system through an interactive OCR’s website, while the server serves the client’s request and does the processing in the backend. 17 4.2.3 System Architecture User Input Query Processing Preprocessing Engine Dictionary Translator + Transliterator English-Nepali Keywords Generator Index Processor Indexed Record Comparator Preprocessing Engine Ranking Engine Result Visualization Database Figure 3 System Architecture 18 4.2.3.1 Preprocessing Engine Downcasting Transformation Pool of stopwords Stopword removal Tokenization Stemming Figure 4 Preprocessing Engine Preprocessing Engine comprises of five different processes on the user input. 1. Downcasting Downcasting also referred as type refinement is act of casting script from uppercase letters to lowercases. It is done so as to make sure there is no conflict in company names due to uppercase letters between the words to make it a unique name. 2. Transformation Transformation is the conversion of words from British English word to that to American English words. Transformation is done to avoid the generation of unwanted keywords or conflicting keywords. Our dictionary consist of around 130 commonly used words that is converted when found from British English word to American English word. 3. Stopword Removal Stop word removal is the process of removing some predefined stop words from the string literal. We used this process to remove the words that are considered 19 similar/unimportant according to the Office of the Company Registrar. Words such as Shree, New, Modern, Industry, Udyog, Company, etc. are removed. 4. Tokenization Tokenization is the process of breaking up a string into tokens to be indexed using predefined dictionaries or with the help of analyzing the whitespaces. These dictionaries can be a pool of predefined words or bilingual English-Nepali dictionary. Proper handling of strings, numbers and symbols are also important. For instance, tokenizing "nepal metals” outputs “nepal” and “metals”. 5. Stemming Stemming is the process of reducing a word to a root, or simpler form which are present in plural forms. Stemming is often used in text processing applications. There are many different approaches to stemming, each with their own design goals. Some are aggressive, reducing words to the smallest root possible. Here, Stemming is done with the help of morphological analyzer. Morphological analysis is done in order to produce English dictionary based words. For example, words like “services”, “metals” are reduced to simpler singular forms as “service” and “metal”. We used stemming to obtain the dictionary based root words. Using root words, we simplified the matching process. 4.2.3.2 Translation and Transliteration Translation is the conversion of the meaning of a source-language text by means of an equivalent target-language text. In this process, equivalent Nepali text is obtained of the English words as obtained by mapping each keyword matched accordingly with the English Dictionary. The matched word are then mapped with the English-Nepali Dictionary provided by Madan Puraskar Pustakalaya. The unmatched words are simply placed with translated tokens. For Example the word “nepal”, “metal” is mapped onto the dictionary to get the word “नेपाल”, “धातु”. Transliteration is the conversion of a text from one script to another. To transliterate a Nepali word to English word, we used dictionary mapping to map individual Nepali syllable to form English alphabet. Here in above example of translation the word “नेपाल”,“धातु” are 20 transliterated to “Nepal” and “dhatu” and then extracted to the pool of keywords for further processing. 4.2.3.3 Possible Keyword Generation Keywords are generated hence by the combination of keywords from stemming and after transliteration. The generated keywords are hence used to make a list of company names having those keywords in their names from the database .The company names are hence listed in accordance with the presence of those keywords. Each company name in the list is again processed by the preprocessing engine and stemmed keywords are extracted to process further for comparison which is kept as indexed record for each company name taken from the database. 4.2.3.4 Comparison Comparison is done between the token obtained with the user inputted company name and tokens generated by the company names extracted from the database based on the user inputted keywords. Levenshtein Algorithm and The Kuhn-Munkres algorithm (Hungarian Method) were used in comparison of strings. The similarity is calculated in three steps: Partition each name into a list of tokens. Eliminate the common tokens Compute the similarity between dissimilar tokens by using a string edit-distance algorithm The first method uses an edit-distance string matching algorithm: Levenshtein. The string edit distance is the total cost of transforming one string into another using a set of edit rules, each of which has an associated cost. Levenshtein distance is obtained by finding the cheapest way to transform one string into another. Transformations are the one-step operations of (single-phone) insertion, deletion and substitution. In the simplest version substitutions cost about two units except when the source and target are identical, in which case the cost is zero. Insertions and deletions costs half that of substitutions. Application of Hungarian Algorithm for Optimization The result of Levenshtein method is used in bipartite graph which used Hungarian algorithm. A related classical problem on matching in bipartite graphs is the assignment problem, which 21 is the quest to find the optimal assignment of workers to jobs that maximizes the sum of ratings, given all non-negative ratings Cost[i,j] of each worker i to each job j. All relation scores are in the [0, 1] range, which means that if the score gets a maximum value (equal to 1) then the two string are absolutely similar. Application of Sorenson’s Similarity coefficient The result of Hungarian method which is the sum of similarity weight is then applied to Sorenson Index to find the final single value similarity score between the strings to be compared. This final score (whose value lie between 0 and 1) is then converted into percentage by multiplying by 100. 4.2.3.5 Ranking The result of each and every permutation is taken into consideration and the maximum matched percentage score is chosen. And then, a list of company name is generated based on the order of the percentage similarity score. 22 4.2.4 Detailed Class Diagram Figure 5 Detailed Class Diagram 20 The system is implemented by using the object oriented methodology. We have not used Framework of any kind. Some of the core classes of system along with their association is shown. Comparison System This system is used to compare the result received from preprocessing engine of user input and list obtained from database 1. HungarianAlgorithmEdu Class In this class we have used Hungarian algorithm to compute the highest possible score of matching between the tokens from both input. The input to this system is the weight matrix obtained from Hybrid Class and the output will be the similarity score. hgAlgorithm() method performs the Hungarian algorithm and final similarity score is returned by getScore() method. 2. Hybrid Class In this class we have used Levenshtein Distance algorithm to calculate the edit distance. This class calculates edit distance between two tokens of strings and finally gives the similarity score between them. ComputeDistance() method computes the edit distance and GetSimilarity() returns the simalirity between tokens. 3. Permutation Class In this class we perform permutation of the result obtained from transliteration of user input token and user input token but among the tokens of itself. permute() method computes permutation operation. 4. MatchsMaker Class This is the main class of comparison system which calls each of its component to perform comparision and return output as similarity percentage. GetScore() returns the similarity percentage and Initialize() initializes necessary components. 21 Database System 1. DatabaseCredentials Class This class is used to store database credentials. Those credentials includes username, password and connection path. This method can also be used as Java Beans to implement set/get methods. 2. DatabaseHandler DatabaseHandler class is used to initiate the database connection and also declaring the database type. 3. CookSQL Class This class is used to prepare SQL statements. 4. CompanyNameEnglish Class This class is the core of the package. This class contains the methods for individual record manipulation and resultset retrieval. 5. ConnectDatabase This class is the bridge between database and the main interface and other class. This class is used to hide the details of the underlying database implementations. Preprocessing Engine This engine contains component that is used to downcast, clean, transform. Remove stop words, stem and tokenize. 1. SpaceProcessor Class This class is used to tokenize a company name based on space and hyphen (-) and rejoin the individual tokens if necessary. getSplittedText():This method is used to split the company name into tokens. joinSplittedText():This method is join tokens with space to regenerate the company name. 2. StopwordRemover Class This class is used to remove the stop words as defined by the OCR directives. 3. Stemmer Class Stemmer class contains methods to generate root words. Stemming is achieved using SnowBall stemmer and morphological analysis. 4. SymbolProcessor This class is used to clean the illegal symbols from names. 22 4.3 Project Tools Programming Language: Java SE 7 Database: MySQL Server Version 5.1.41 Testing: JUnit testing Drawings: MS Paint, MS Visio, ArgoUML ,Adobe Photoshop Documentation: MS Word/Excel/PowerPoint Platform: Windows IDE: Eclipse Indigo 4.4 Eclipse as Programming IDE Eclipse was used as IDE for project development. Eclipse is a multi-language software development platform comprising an IDE and a plug-in system to extend it. It is written primarily in Java and is used to develop applications in this language and, by means of the various plug-ins, in other languages as well—C/C++, COBOL, Python, Perl, PHP and more. The initial codebase originated from Visual Age. In its default form it is meant for Java developers, consisting of the Java Development Tools (JDT). Users can extend its capabilities by installing plug-ins written for the Eclipse software framework, such as development toolkits for other programming languages, and can write and contribute their own plug-in modules. Language packs provide translations into over a dozen natural languages. Released under the terms of the Eclipse Public License, Eclipse is free and open source software. 4.5 MySQL as Database System MySQL was used as database server. It is a relational database management system (RDBMS) which has more than 11 million installations. The program runs as a server providing multi-user access to a number of databases. The project's source code is available under terms of the GNU General Public License, as well as under a variety of proprietary agreements. 23 Chapter 5 EXPERIMENTAL SETUP Hardware Configuration used for Testing Hardware Configuration: Computer Model: DELL 5110 Physical Memory (RAM): 4.00 GB, DDR2 Processor: Intel(R) Core(TM) i-5-2450M CPU, 2.5 GHz System Type: 64-bit Operating System, x64-based processor Cache Size: 4096 KB OS: Windows 8 Enterprise Database: MySQL Server Version 5.5.24 Database with 111,161 records of company names. Computer Model: Acer Aspire E1-531 Physical Memory (RAM): 4.00 GB, DDR2 Processor: Intel B960 Dual Core processor (2.2 Ghz, 2MB L3 cache) System Type: 64-bit Operating System, x64-based processor Cache Size: 4096 KB OS: Windows 8 Enterprise Database: MySQL Server Version 5.5.24 Database with 111,161 records of company names. 24 Chapter 6 OUTPUT 1. Output obtained by using input “durga enterprises” Figure 6 Example - I 2. Output obtained by using input “hamro lagani” Figure 7 Example - II 3. Output obtained by using input “jagadamba steels” Figure 8 Example- III 4. Output obtained by using input “nawayug vidhya niketan kanchanpur” Figure 9 Example - IV 25 5. Output obtained by using input “nepal investment company” Figure 10 Example- V 6. Output obtained by using input “nepal one travels and tour” Figure 11 Example - VI 7. Output obtained by using input “new age business consultant” Figure 12 Example - VII 26 Chapter 7 RESULT AND ANALYSIS To obtain the similarity scores, we tried various similarity measuring algorithms. However Levenshtein Algorithm and Hungarian Algorithm together with Sorensen Algorithm seemed to fit our need. We used various processes before applying these algorithms which proved to be fruitful. The scores obtained is saved in file having .csv extension. Stemming was used to obtain dictionary based root words. Tokenization and transliteration was used to obtain the tokens later used in the comparison process. We used translation and transliteration to cope with Nepalese words. The accuracy was accessed by trying different names that can be used in reality. The computation time depends upon the number of tokens to be compared and for now, the system is single threaded. Number of Tokens VS Computation Time 53.785 Time to Compute (sec) 60 50 40 Time to compute (sec) in I5 CPU Time to compute (sec) in Dual Core CPU 30 22.743 20 10 0 5.384 1.179 1 Token (Durga Enterprises) 7.316 2.395 1.434 2 tokens (jagadamba steels pvt.ltd) 3 tokens (New Age Businness Consultant Limited) 5.939 4 tokens (Nepal One travels and tours Ltd.) Number of Tokens Figure 13 Computation Time with Transformation Figure 6 shows the relation between number of tokens and time to compute similarity scores with various generations of Intel Processors. The computational time is more in lower generation of processors and less in higher generation of processor. Furthermore, more is the tokens greater is the computation time. This result is obtained without the use of transformation process. 27 Number of Tokens VS Computation Time 107.498 Time to Compute (sec) 120 100 80 Time to compute (sec) in I5 CPU Time to compute (sec) in Dual Core CPU 60 39.994 37.743 40 20 0 13.315 8.959 11.952 2.204 1.664 1 Token (Durga Enterprises) 2 tokens (jagadamba steels pvt.ltd) 3 tokens (New Age Businness Consultant Limited) 4 tokens (Nepal One travels and tours Ltd.) Number of Tokens Figure 14 Time Computation with Transformation Figure 7 shows the result obtained by using Transformation process. It takes more time with using transformation, but it yields better results. By using appropriate hardware resources, we can reduce this time within the constraint. For comparison process, we initially used Cosine similarity algorithm. But it didn’t yield promising results. Cosine similarity algorithm doesn’t consider about the relative position of alphabets in the string, it only considers the repetition of alphabets. Thus a string with different spelling but same alphabet count is considered similar. This resulted in severe limitation of its use. Levenshtein algorithm proved useful in our project. It considers the position of alphabets in a string which is necessary for our system. This algorithm along with Hungarian Algorithm resulted in the satisfactory results. To obtain the final score we used Sorensen coefficient. Its value lies in the range [0, 1]. Multiplying this coefficient by 100 gave us the final percentage score. 28 Chapter 8 CONCLUSION AND FURTHER ENHANCEMENT 7.1 Conclusion With all the accumulated effort invested in this project, there are reasons to believe that at the end of this semester this project will find itself in a much better shape and quite closer to actual acceptance than it was. We summarize the progress with respect to the main objectives of the project, namely, accuracy and speed. Accuracy: This is the main obstacle for the project. We have been constantly using and testing many different algorithms for similarity comparison. However we have been able to get satisfactory results using Levenshtein distance and Hungarian Method in conjunction with Sorensen Coefficient. We are further trying to improve the results by employing many other algorithms – Phonetic (Double Metaphone) and using transformation function. Speed: Speed is also a challenging factor for this project. The requirement for shorter processing time has made it difficult to balance between accuracy and speed. However by using the processing capability of MySQL, we have been able to improve the speed resulting in shorter waiting time for the users. The use of adequate data structures have been of prominent advantage. Let us remark that one of the apparent major obstacles for gaining acceptance for this project lies in the standards of the Office of Company Registrar. 7.2 Limitations Our System comprises of the following limitations. The system cannot process name having numbers as prefix or suffix. Preprocessing Engine have many limitations. Stemming sometimes produces incorrect results if the input is the Nepali word. E.g. Spat (स्पात) in Nepali (Steel in English) may result in spit due to morphology based stemming. In such cases, similarity matching reduces. Dictionary (English-Nepali) does not contain enough words. There are many English words for which Nepali word is not available Transformation process results in more computational time. 29 Synonyms are not considered in the system. Strings such as papermill and paper mill, though similar, are considered different because of the space. The space results in two tokens. Although both strings have same meaning, they are not considered similar by our system. 7.3 Further Enhancement There is a great opportunity to enhance this project in upcoming future. The Similarity Checking algorithm has the greatest possibility of being enhanced. If phonetic based similarity measures is incorporated, accuracy can be greatly improved. Implementing faster searching methods can greatly enhance the performance of the system. Use of Taxonomy for classifying the tokens further with similarity measures can help accurately validate purposed names. Taxonomy can classify the context of names and thus improve the validation process. Furthermore, using some weighing measures to assign weights to most common words might be helpful in increasing accuracy of the similarity score. 30 REFERENCE [1] Office of Company Registrar, Nepal. Retrieved from: www.ocr.gov.np. Date Retrieved: 07/04/2013 [2] Companies House. Retrieved from: http://wck2.companieshouse.gov.uk//wcframe?name =accessCompanyInfo. Date Retrieved : 04/07/2013 [3] Companies and Intellectual Property Commission. Retrieved from: http://www.cipc.co.za/. Date Retrieved: 04/07/2013 [4] Anne Kao and Stephen R. Poteet (Eds). Natural Language Processing and Text Mining. Springer 2006 [5] Peter Jackson and Isabelle Moulinier. Natural Language Processing for Online Applications .In Prof. Ruslan Mitkov, editor. John Benjamins Publishing Company,2002 [6] Ronan Collobert, JasonWeston, L´eon Bottou, et al. Natural Language Processing (Almost) from Scratch. Editor. Michael Collins. NEC Laboratories America, 4 Independence Way, Princeton, NJ 08540 [7] Prakash M Nadkarni, Lucila Ohno-Machado, Wendy W Chapman. Natural language processing: an introduction. Available from: group.bmj.com [8] Chris Manning, Hinrich Schütze. Foundations of Statistical Natural Language Processing. MIT Press. Cambridge, MA: May 1999. Available from: http://nlp.stanford.edu/fsnlp/ [9] Shuly Wintner. Formal Language Theory for Natural Language Processing. ESSLLI 2001. Available from http://www.ebooksdirectory.com/details.php?ebook=6774 [10] Daniël de Kok, Harm Brouwer. Natural Language Processing for the Working Programmer. 2011. Available from : http://nlpwp.org/book/ [11] Aliseda, R. van Glabbeek, D. Westerstahl. Computing Natural Language. CSLI 1998. Available from: http://www.e-booksdirectory.com/details.php?ebook=3940 [12] Steven Bird, Ewan Klein, Edward Loper. Natural Language Processing with Python. O'Reilly Media 2009. Available from: http://www.ebooksdirectory.com/details.php?ebook=7184 31 [13] Rob Malouf, Miles Osborne. An Introduction to Stochastic Attribute-Value Grammars. ESSLLI 2001.Available from: http://www.e-booksdirectory.com/details.php?ebook=6860 [14] Shuly Wintner. Formal Language Theory for Natural Language Processing. ESSLLI 2001.Available from: http://www.e-booksdirectory.com/details.php?ebook=6774 [15] Grosz, B.J. Jones, K.S.Webber, B.L. Readings in Natural Language Processing. Kaufman Publishers Inc.,Los Altos, CA. Available from: http://www.osti.gov/energycitations/product.biblio.jsp?osti_id=6537037 [16] Reilly, Ronan G. (Ed); Sharkey, Noel E. (Ed). Connectionist approaches to natural language processing. Hillsdale, NJ, England: Lawrence Erlbaum Associates, Inc. 1992. Available from: http://psycnet.apa.org/psycinfo/1992-98664-000 [17] C. Friedman and R. Sideli. Tolerating spelling errors during patient validation. Computers and Biomedical Research, 25:486–509, 1992. [18] F. Patman and P. Thompson. Names: A new frontier in text mining. In ISI-2003, Springer LNCS 2665, pages 27–38. [19] Simon J. Greenhill. Computational Linguistics Volume 37 Issue 4, December 2011, pages 689-698. [20] M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of ACM SIGKDD, pages 39–48, Washington DC, 2003. [21] C. L. Borgman and S. L. Siegfried. Getty’s synonameTM and its cousins: A survey of applications of personal name matching [22] Algorithms. Journal of the American Society for Information Science, 43(7):459– 476, 1992. [23] P. Christen, T. Churches, and M. Hegland. Febrl – a parallel open source data linkage system. In PAKDD, Springer LNAI [24] 3056, pages 638–647, Sydney, 2004. [25] P. Christen and K. Goiser. Quality and complexity measures for data linkage and deduplication. In F. Guillet and H. Hamilton, editors, Quality Measures in Data Mining, Studies in Computational Intelligence. Springer, 2006. [26] P. A. Hall and G. R. Dowling. Approximate string matching. ACM Computing Surveys, 12(4):381–402, 1980. [25] P. Jokinen, J. Tarhio, and E. Ukkonen. A comparison 32 of approximate string matching algorithms. Software – Practice and Experience, 26(12):1439–1458, 1996. [27] R. Gong and T. K. Chan. Syllable alignment: A novel model for phonetic string search. IEICE Transactions on Information and Systems, E89-D(1):332–339, 2006. [28] F. J. Damerau. A technique for computer detection and correction of spelling errors. Communications of the ACM, 7(3):171–176, 1964. [29] G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31–88, 2001. 33 APPENDIX A: Gantt chart Figure 15 Gantt Chart 34 APPENDIX B: Use Case Figure 16 Use Case Diagram 35 APPENDIX C: Preprocessing Detail Example For User Input Example “ Nepal Metals Industries” Preprocessing Engine Methodology nepal metal industries Downcasting Conversion of input to lowercase. Tokenization [nepal,metals] Conversion of British English words to American English. Removal of Stop words – Company, Industry, and Pvt.Ltd.as mentioned in the draft. Extraction of Tokens Stemming [nepal,metal] Reduction to Root words. Transformation Not Applied in this example Process nepal metals Stopword Removal Translation [nepal,metal] to Transliteration [ नेपाल ,धातु ] [ नेपाल ,धातु ] to [nepal , dhatu ] Conversion of tokens from English to Nepali. Conversion of Nepali Unicode. Generated Keywords (Using Transliterated Token + Stemmed Token)[ nepal , metal , dhatu ] Query to MySQL Database resulting in a list of company names. Preprocessing Engine Company Name Extraction from Query Downcasting Example(Randomly choosen) “Royal Metal Nepal Pvt.Ltd.” royal metal nepal pvt.ltd. Process Conversion of input to lowercase. Transformation Not Applied in this example Stopword Removal royal metal nepal Tokenization [royal , metal , nepal ] Conversion of British English words to American English. Removal of Stop words – Company, Industry, and Pvt.Ltd.as mentioned in the draft. Extraction of Tokens Stemming [royal , metal , nepal ] Reduction to Root words. Database Generated Keywords [ royal , metal , nepal ] Comparison-1 (User Input Generated Keywords & Database Generated Keywords.) Preprocessing Engine Company Name Extraction from Query Downcasting Example(Randomly choosen) “Nepal Dhatu Industries” nepal dhatu industries Process Conversion of input to lowercase. Transformation Not Applied in this example Stopword Removal nepal dhatu Tokenization [nepal , dhatu ] Conversion of British English words to American English. Removal of Stop words – Company, Industry, and Pvt.Ltd. as mentioned in the draft. Extraction of Tokens Stemming [nepal , dhatu ] Reduction to Root words. Database Generated Keywords [nepal , dhatu ] Comparison-2 (User Input Generated Keywords & Database Generated Keywords.) 36 APPENDIX D: Comparison Detail Figure 17 Comparison I (Part A) 37 Figure 18 Comparison I (Part B) 38 Figure 19 Comparison II (Part A) 39 Figure 20 Comparison II (Part B) 40 APPENDIX E: Output Screenshot Figure 21 Output Screenshot 41 APPENDIX F: Data Flow Diagram Figure 22 Data Flow Diagram 42 APPENDIX G: Theory Hungarian Algorithm Hungarian Method is for assigning jobs by a one-for-one matching to identify the lowestcost solution. Each job must be assigned to only one machine. It is assumed that every machine is capable of handling every job, and that the costs or values associated with each assignment combination are known and fixed. The number of rows and columns must be the same. The algorithm is as follows. 1. Arrange the information in a matrix form with String 1 and String 2 on left and along the top with the Levenshtein distance for each pair in the middle. 2. Ensure that the matrix is a square by addition of the dummy rows/columns if necessary. Conventionally, each element in the dummy row/column is the same as the largest number in the matrix. 3. Reduce the rows by subtracting the minimum value of each row from that row. 4. Reduce the columns by subtracting the minimum value of each column from that column. 5. Cover the zero elements with the minimum number of lines it is possible to cover them with.(if the number of lines is equal to the number of rows then goto step 9) 6. Add the minimum uncovered element to every covered element, if an element is covered twice, add the minimum element to it twice. 7. Subtract the minimum element from every element in the matrix. 8. Cover the zero elements again. If the number of lines covering the zero elements is not equal to the number of rows, return to step 6. 9. Select a matching by choosing a set of zeros as that each row or column has only one selected. 10. Apply the matching to the original matrix, disregarding dummy rows. 43 Procedure of Metaphone Phonetic Algorithm Original Metaphone codes use the 16 consonant symbols 0BFHJKLMNPRSTWXY.[2] The '0' represents "th" (as an ASCII approximation of Θ), 'X' represents "sh" or "ch", and the others represent their usual English pronunciations. The vowels AEIOU are also used, but only at the beginning of the code.[3] This table summarizes most of the rules in the original implementation: 1. Drop duplicate adjacent letters, except for C. 2. If the word begins with 'KN', 'GN', 'PN', 'AE', 'WR', drop the first letter. 3. Drop 'B' if after 'M' at the end of the word. 4. 'C' transforms to 'X' if followed by 'IA' or 'H' (unless in latter case, it is part of '-SCH- ', in which case it transforms to 'K'). 'C' transforms to 'S' if followed by 'I', 'E', or 'Y'. Otherwise, 'C' transforms to 'K'. 5. 'D' transforms to 'J' if followed by 'GE', 'GY', or 'GI'. Otherwise, 'D' transforms to 'T'. 6. Drop 'G' if followed by 'H' and 'H' is not at the end or before a vowel. Drop 'G' if followed by 'N' or 'NED' and is at the end. 7. 'G' transforms to 'J' if before 'I', 'E', or 'Y', and it is not in 'GG'. Otherwise, 'G' transforms to 'K'. 8. Drop 'H' if after vowel and not before a vowel. 9. 'CK' transforms to 'K'. 10. 'PH' transforms to 'F'. 11. 'Q' transforms to 'K'. 12. 'S' transforms to 'X' if followed by 'H', 'IO', or 'IA'. 13. 'T' transforms to 'X' if followed by 'IA' or 'IO'. 'TH' transforms to '0'. Drop 'T' if followed by 'CH'. 14. 'V' transforms to 'F'. 15. 'WH' transforms to 'W' if at the beginning. Drop 'W' if not followed by a vowel. 16. 'X' transforms to 'S' if at the beginning. Otherwise, 'X' transforms to 'KS'. 17. Drop 'Y' if not followed by a vowel. 18. 'Z' transforms to 'S'. 19. Drop all vowels unless it is the beginning. 44