TRIBHUVAN UNIVERSITY INSTITUTE OF ENGINEERING

advertisement
TRIBHUVAN UNIVERSITY
INSTITUTE OF ENGINEERING
PULCHOWK CAMPUS
A
FINAL YEAR PROJECT REPORT
ON
Name Conflict Resolution for Company Registration
By:
Gaurav Kumar Goyal (16214)
Janardan Chaudhary (16216)
Nimesh Mishra (16221)
Sanat Maharjan (16230)
A PROJECT SUBMITTED TO THE DEPARTMENT OF ELECTRONICS
AND COMPUTER ENGINEERING IN PARTIAL FULLFILMENT OF
THE REQUIREMENT FOR THE BACHELOR’S DEGREE IN
COMPUTER ENGINEERING
DEPARTMENT OF ELECTRONICS AND COMPUTER ENGINNERING
LALITPUR, NEPAL
AUGUST, 2013
INSTITUTE OF ENGINEERING
PULCHOWK CAMPUS
DEPARTMENT OF ELECTRONICS AND COMPUTER ENGINEERING
The undersigned certify that they have read, and recommended to the Institute of Engineering
for final submission and presentation of the project entitled "Name Conflict Resolution for
Company Registration" submitted by Gaurav Kumar Goyal, Janardan Chaudhary, Nimesh
Mishra and Sanat Maharjan in partial fulfilment of the requirements for the Bachelor’s
degree in Computer Engineering.
_________________________________________________
Supervisor, Prof. Dr. Shashidhar Ram Joshi
Department of Electronics and Computer Engineering
_________________________________________________
Co-Supervisor, Er. Sansar Jung Dewan
IT Officer, Office of Company Registrar (OCR)
__________________________________________________
Internal Examiner, Baburam Dawadi
Department of Electronics and Computer Engineering
__________________________________________________
External Examiner, Anjesh Tuladhar
COO, Young Innovations Pvt. Ltd.
DATE OF APPROVAL: 25 Aug. 2013
i
COPYRIGHT
The author has agreed that the Library, Department of Electronics and Computer
Engineering, Pulchowk Campus, Institute of Engineering may make this report freely
available for inspection. Moreover, the author has agreed that permission for extensive
copying of this project report for scholarly purpose may be granted by the supervisors who
supervised the project work recorded herein or, in their absence, by the Head of the
Department wherein the project report was done. It is understood that the recognition will be
given to the author of this report and to the Department of Electronics and Computer
Engineering, Pulchowk Campus, Institute of Engineering in any use of the material of this
project report. Copying or publication or the other use of this report for financial gain without
approval of to the Department of Electronics and Computer Engineering, Pulchowk Campus,
Institute of Engineering and author’s written permission is prohibited.
Request for permission to copy or to make any other use of the material in this report in
whole or in part should be addressed to:
Arun Timilsina, PhD/ Professor
Head of Department
Department of Electronics and Computer Engineering
Pulchowk Campus, Institute of Engineering
Lalitpur, Kathmandu
Nepal
ii
ACKNOWLEDGEMENT
First of all, we would like to express our sincere gratitude towards Department of Electronics
and Computer, Pulchowk Campus for including final year major project as part of our
syllabus for final year B.E. in Computer. We would like to extend our gratitude towards Dr.
Arun Timilsina, Head of Department, Electronics and Computer Engineering, for assisting
us in our project.
We would like to take the privilege to express our gratitude towards Prof. Dr. Shashidhar
Ram Joshi for being our project supervisor.
We would also like to thank Dr. Aman Shakya for his support and assistance. We are deeply
indebted to Er. Sansar Jung Dewan of Office of Company Registrar and the Office of
company Registrar itself for giving us an opportunity to do this project with enormous
scopes.
We would also like to express our sincere thanks to Mr. Bal Krishna Bal, Assistant professor,
Department of Electronics and Computer Engineering, Kathmandu University, for this help
and support.
Last but not the least we would like to thanks our friends and classmates for their help and
valuable suggestions.
iii
ABSTRACT
Natural language processing is one of the most researched field. One of the applications of
natural language processing is determining similarity of sentences. Naming conflict
resolution is about comparing of words. There are many systems developed for this purpose
and are used widely.
In context of Nepal, the existing system for resolving naming conflicts during registration of
a company is done manually (by human). However, there exists requirement for automation
of the process. The automation requires natural language processing, translation of
languages, transliteration between languages. There are several constraints for the checking
provided by the Office of Company Registrar (OCR). These constraints should be considered
while comparing words. The words need to be tokenized, stemmed before they can be further
processed.
Keywords:
OCR, Morphological Analysis, Similarity Matching, Natural Language Processing.
iv
TABLE OF CONTENT
COPYRIGHT .................................................................................................................... ii
ACKNOWLEDGEMENT ................................................................................................ iii
ABSTRACT ..................................................................................................................... iv
TABLE OF CONTENT ..................................................................................................... v
TABLE OF FIGURES .................................................................................................... viii
Chapter 1 ........................................................................................................................... 2
INTRODUCTION .......................................................................................................................... 2
1.1 Background....................................................................................................................... 2
1.2 Motivation ........................................................................................................................ 3
1.3 Problem Statement........................................................................................................... 3
1.4 Objectives ......................................................................................................................... 4
1.5 Scope of the work ............................................................................................................. 4
Chapter 2 ........................................................................................................................... 5
LITERATURE REVIEW .................................................................................................................. 5
2.1 Introduction...................................................................................................................... 5
2.2 Common processes used in text similarity......................................................................... 5
2.2.1 Downcasting .............................................................................................................. 5
2.2.2 Transformation .......................................................................................................... 5
2.2.3 Stopword Removal ..................................................................................................... 5
2.2.4 Tokenization .............................................................................................................. 6
2.2.5 Stemming .................................................................................................................. 6
2.3 Existing Name checking Systems ....................................................................................... 6
2.4 Criteria defined by OCR..................................................................................................... 7
2.5 Matching Techniques ........................................................................................................ 8
2.5.1 Phonetic encoding ..................................................................................................... 8
2.5.1.1 Soundex .............................................................................................................. 8
2.5.1.2 Metaphone ......................................................................................................... 9
2.5.2 Pattern matching ....................................................................................................... 9
2.5.2.1 Levenshtein or Edit Distance.............................................................................. 10
2.5.2.2 Sorenson similarity ............................................................................................ 10
2.5.2.3 Cosine Similarity ................................................................................................ 11
2.6 Summary ........................................................................................................................ 11
Chapter 3 ......................................................................................................................... 12
REQUIREMENT ANALYSIS ......................................................................................................... 12
v
3.1 Functional Requirements ................................................................................................ 12
3.2 Non-Functional Requirements ........................................................................................ 12
3.2.1 Reliability ................................................................................................................. 12
3.2.2 Performance ............................................................................................................ 12
3.2.3 Accuracy .................................................................................................................. 13
Chapter 4 ......................................................................................................................... 14
METHODOLOGY ....................................................................................................................... 14
4.1 Introduction.................................................................................................................... 14
4.2 System Design ................................................................................................................ 15
4.2.1 Flow Diagram ........................................................................................................... 16
4.2.2 Deployment Diagram ............................................................................................... 17
4.2.3 System Architecture ................................................................................................. 18
4.2.3.1 Preprocessing Engine ........................................................................................ 19
4.2.3.2 Translation and Transliteration .......................................................................... 20
4.2.3.3 Possible Keyword Generation ............................................................................ 21
4.2.3.5 Ranking ............................................................................................................. 22
4.2.4 Detailed Class Diagram............................................................................................. 20
4.3 Project Tools ................................................................................................................... 23
4.4 Eclipse as Programming IDE ............................................................................................ 23
4.5 MySQL as Database System ............................................................................................ 23
Chapter 5 ......................................................................................................................... 24
EXPERIMENTAL SETUP .............................................................................................................. 24
Chapter 6 ......................................................................................................................... 25
OUTPUT ................................................................................................................................... 25
Chapter 7 ......................................................................................................................... 27
RESULT AND ANALYSIS ............................................................................................................. 27
Chapter 8 ......................................................................................................................... 29
CONCLUSION AND FURTHER ENHANCEMENT ........................................................................... 29
7.1 Conclusion ...................................................................................................................... 29
7.2 Limitations ...................................................................................................................... 29
7.3 Further Enhancement ..................................................................................................... 30
REFERENCE .................................................................................................................. 31
APPENDIX A: Gantt chart .............................................................................................. 34
APPENDIX B: Use Case ................................................................................................. 35
APPENDIX C: Preprocessing Detail Example ................................................................. 36
vi
APPENDIX D: Comparison Detail .................................................................................. 37
APPENDIX E: Output Screenshot ................................................................................... 41
APPENDIX F: Data Flow Diagram ................................................................................. 42
APPENDIX G: Theory .................................................................................................... 43
vii
TABLE OF FIGURES
Figure 1 Flow Chart ...................................................................................................................... 16
Figure 2 Deployment Diagram ...................................................................................................... 17
Figure 3 System Architecture........................................................................................................ 18
Figure 4 Preprocessing Engine ...................................................................................................... 19
Figure 5 Detailed Class Diagram ................................................................................................... 20
Figure 6 Example - I ...................................................................................................................... 25
Figure 7 Example - II ..................................................................................................................... 25
Figure 8 Example- III ..................................................................................................................... 25
Figure 9 Example - IV .................................................................................................................... 25
Figure 10 Example- V.................................................................................................................... 26
Figure 11 Example - VI .................................................................................................................. 26
Figure 12 Example - VII ................................................................................................................. 26
Figure 13 Computation Time with Transformation ....................................................................... 27
Figure 14 Time Computation with Transformation ....................................................................... 28
Figure 15 Gantt Chart ................................................................................................................... 34
Figure 16 Use Case Diagram ........................................................................................................ 35
Figure 17 Comparison I (Part A) .................................................................................................. 37
Figure 18 Comparison I (Part B)................................................................................................... 38
Figure 19 Comparison II (Part A) ................................................................................................. 39
Figure 20 Comparison II (Part B) ................................................................................................. 40
Figure 21 Output Screenshot ........................................................................................................ 41
Figure 22 Data Flow Diagram........................................................................................................ 42
viii
Chapter 1
INTRODUCTION
1.1 Background
Trying to understand language as a unit in machine terms is not as easy as it is thought.
Words are perhaps the most intuitive units of language, yet they are in general tricky to
define. Words are defined in most languages as the smallest linguistic units that can form a
complete utterance by themselves. Natural language processing deals with the ambiguity in
word processing.
The office of company registrar is responsible for maintaining law and order regarding
different companies. Almost all of the daily task of the office used to be manual, now the
OCR has moved ahead for the automation of tasks using computerized systems. Before the
advent of current online system, the process relating to change, admission, and removal of
company names used to be difficult and cumbersome. Even after the recent development of
online system of the office, the system is isn't intelligent enough. Currently the Office of
Company Registrar (OCR) has implemented database entity comparison features. The
process of finding company names is often based on English names. Comparison features is
however limited to entity to entity match and phonetic based matching. The existing system
often fails to act responsively and accurately during the process related to a new company
registration. The current system is severely limited due to the above mentioned comparison
method. The same problem arises while a new company tries to reserve their company name.
Naming conflict resolution system for company registration is a system that finds the
similarity between the proposed name of a company and existing company names in
database. This requires the use of some of the traits of natural language processing. First of
all, the input is down casted and stop-words are removed from the proposed name. The name
is then transformed, tokenized, stemmed to determine the root words used in similarity
checking. The words are then used to form some of probable tokens using translation and
transliteration process. These names are then matched with words from database to form the
ranking of similar names.
The system requires to translate Nepali words to English words and vice-versa. The
translation is done with the help of dictionaries. The removal of stop-word requires pool of
pre-defined words to be removed. The constraints are defined by the Office of Company
Registrar. These constraints include use of plural words, case sensitivity, punctuation and
2
spacing in the names, use of numbers, different phonetic spellings or spelling variations and
many others. The system will also assist in decision making process, whether or not to
approve the proposed name. This system will result in efficient processing, and faster
registration of names.
1.2 Motivation
Almost all of the daily task of the office used to be done manually. But now the OCR has
moved ahead for the automation of tasks using computerized systems. Before the advent of
current online system, the process relating to change, admission, and removal of company
names used to be difficult and cumbersome. Even after the recent development of online
system of the office, the system is isn't intelligent enough. Currently the Office of Company
Registrar (OCR) has implemented database entity comparison features. The process of
finding company names is often based on English names. Comparison features is however
limited to entity to entity match and phonetic based matching. The existing system often fails
to act responsively and accurately during the process related to a new company registration.
The current system is severely limited due to the above mentioned comparison method.
These limitations in current system motivated us to develop a more reliable and accurate
system based on String Matching Algorithms, which produces more accurate results than the
Phonetic based string matching approach currently used.
1.3 Problem Statement
A recent improvement in the registration of new companies is the addition of the online
registration and name checking system. However, the current name checking system faces
from lack of accuracy and drawbacks of matching names regarding to their phonetic
pronunciation.
In our current project, we try to build a system that checks the validity of the purposed names
by using string matching schemes rather than phonetic. Our objective is to determine that
extent to which the purposed name is similar to existing name , and based on this we
determine whether the name is available for registration .
3
1.4 Objectives
The main objective of the project is to develop a system capable of checking the similarity
of the purposed company names with registered ones. The objectives can be further be
simplified as:
1. To develop a system to resolve naming conflict.
2. To find names similar to the name proposed by user.
3. To provide the ranks of matched proposed name with other existing names.
4. To define the threshold level used to validate name
1.5 Scope of the work
Name checking system is used in many countries to check the purposed name of a company.
Variety of approaches is available to develop such name checking system. The approach
used here is NLP approach. The system will be able to check the purposed name with much
better accuracy than the current system. This system will be beneficial to the clients and the
OCR. This system is based on research along with study and analysis of existing system. The
system will produce output in the form of .csv file containing the similarity scores of various
names with the purposed name.
4
Chapter 2
LITERATURE REVIEW
2.1 Introduction
This project is all about checking the validity of the purposed company names for the Office
of Company Registrar. One of the important steps while developing such a system is to
examine all the research areas thoroughly. It is important to know about Natural Language
Processing in order to know about the processes used in this project. Also for designing this
system, existing systems are studied thoroughly.
Natural Language Processing (NLP) is a branch of information machine science that deals
with natural language information. NLP is a component of artificial intelligence. NLP is a
form of human-to-computer interaction where the elements of human language, be it spoken
or written, are formalized so that a computer can perform value-adding tasks based on that
interaction. Human language is dauntingly complex for a computer to understand. NLP is
used in various areas like language translation, speech processing, checking for grammatical
errors, etc.
2.2 Common processes used in text similarity
It is always useful to know about different types of processes used for NLP. Some of the
common processes are mentioned below:
2.2.1 Downcasting
Downcasting also referred as type refinement is act of casting script from uppercase
letters to lowercases. It is done so as to make sure there is no conflict in company names
due to uppercase letters between the words to make it a unique name.
2.2.2 Transformation
Transformation is the conversion of words from British English word to that to American
English words. Transformation is done to avoid the generation of unwanted keywords or
conflicting keywords
2.2.3 Stopword Removal
Stop word removal is the process of removing some predefined stop words from the
string literal. We used this process to remove the words that are considered
similar/unimportant defined by Office of the Company Registrar directives.
5
2.2.4 Tokenization
Tokenization is the process of breaking up a string into tokens to be indexed using
predefined dictionaries or with the help of analyzing the whitespaces. These dictionaries
can be a pool of predefined words or bilingual English-Nepali dictionary.
2.2.5 Stemming
Stemming is the process of reducing a word to a root, or simpler form which are present
in plural forms. Stemming is often used in text processing applications. There are many
different approaches to stemming, each with their own design goals. Some are
aggressive, reducing words to the smallest root possible.
2.3 Existing Name checking Systems
In order to develop an effective name checking system, it is important to study many similar
existing systems so that the system to be developed covers some of the deficiencies of these
systems. We mainly focused on the existing system used in OCR Nepal. A name checking
system takes the name purposed by the customer and compares with the similar already
existing names. Based on the results, it determines if the name is allowed to be registered.
1. Office of Company Registrar, Nepal
This system uses Phonetic algorithms to check the names. The customer has to visit
the homepage of the OCR [1] and enter the purposed name. The system checks this
name with already existing names and determines if the name is valid. The existing
system however faces the problem of lack of accuracy.
2. Companies House, United Kingdom
This system is used by the government of United Kingdom to check the purposed
name. The client can visit the website [2] and check for the name intentioned. The
system returns the list of existing similar names.
3. CIPC
CIPC stands for Companies and Intellectual Properties Commission. It is a system
that checks the availability of the name purposed by the customer. The client can visit
the website [3] register by paying the fee and then check his/her intentioned company
name. The CIPC will check the name against existing registered businesses and reject
the names that are too similar. The system will also check if the name is reserved or
not.
6
2.4 Criteria defined by OCR
In approving a proposed name of company, the following shall not be considered different or
distinguishable:
1. The words Private, Pvt., (P), Limited, Ltd, Ltd., Limited Liability.
2. The words appearing at the end of the names – company, and company, co., co.
3. The plural version of any of the words appearing in the name.
4. The type and case of letters, spacing between letters and punctuation marks;
5. Joining words together or separating the words, as this does not make a name
distinguishable from a name that uses the similar, separated or joined words. For
example: Him Shikhar Travels Pvt. Ltd. will be considered as similar to Himshikhar
Travel.
6. The use of number of the same word and (the use of tense in English), as this does not
distinguish one name from another. Such as, Three Six Five Tours and Travels Pvt.
Ltd. will be to 365 Tours and travels Pvt. Ltd.
7. Using different phonetic spellings or spelling variations, as this does not distinguish
one name from another. For example, S.D. Enterprises limited is existing then S and
D Enterprises or Satya Darshan Enterprises will not be allowed.
8. Similarly if a name contains numeric character like 3, 6, and 7 resemblance shall be
checked with “Three, six, and seven”.
9. The use of an internet related designation, such as .COM, .NET, .EDU, GOV, .ORG,
.IN, as this does not make a name distinguishable from another.
10. The addition of words like New, Modern, Nav, Shri, Sri, Shree, Sree, Om, Jai, Sai,
The, etc., as this does not make a name distinguishable from an existing name such
as New Kantipur Publication Pvt., Shree Sai Enterprises.
11. The adding the name of the place like Kathmandu, Janakpur as this does not make a
name different or distinguishable. For example, ‘Kathmandu Sugam Pharmaceuticals
Private Ltd.’ cannot be allowed if ‘Sugam Pharmaceuticals Private Ltd’ already
exists;Such names may be allowed only if no objection from the existing company
by way of Board resolution is produced/ submitted.
12. Different combination of the same words, as this does not make a name
distinguishable from an existing name, e.g., if there is a company in existence by the
7
name of “Builders and Contractors Limited”, the name “Contractors and Builders
Limited” should not be allowed.
13. Exact Nepali translation of the name of an existing company in English or other
language. For example, Kathmandu Dairy Industry Limited will not be allowed if
there exists a company with name ‘Kathmandu Dugdh Udyog Limited’.
2.5 Matching Techniques
Name matching can be defined as the process of determining whether two name strings are
instances of the same name [18]. As name variations and errors are quite common [17], exact
name comparison will not result in good matching quality. Rather, an approximate measure
of how similar to names are is desired. Generally, a normalized similarity measure between
1.0 (two names are identical) and 0.0 (two names are totally different) is used.
The two main approaches for matching names are phonetic encoding and pattern matching.
Different techniques have been developed for both approaches, and several techniques
combine the two with the aim to improve the matching quality.
2.5.1 Phonetic encoding
Common to all phonetic encoding techniques is that they attempt to convert a string into a
code according to how a string is pronounced (i.e. the way a string is spoken).
Naturally, this process is language dependent. Most techniques have been developed mainly
with English in mind.
2.5.1.1 Soundex
Soundex based on English language pronunciation, is the and best known phonetic encoding
algorithm. It keeps the first letter in a string and converts the rest into numbers according to
the following encoding table.
a,e,h,i,o,u,w,y
0
b,f,p,v
1
c,g,j,k,q,s,x,z
2
d,t
3
l
4
m,n
5
r
6
8
All zeros (vowels and ‘h’, ‘w’ and ‘y’) are then removed and sequences of the same number
are reduced to one only (e.g. ‘333’ is replaced with ‘3’). The final code is the original first
letter and three numbers (longer codes are cut-off, and shorter codes are extended with
zeros). As examples, the Soundex code for ‘peter’ is ‘p360’, while the code for ‘christen’ is
‘c623’. A major drawback of Soundex is that it keeps the first letter, thus any error or
variation at the beginning of a name will result in a different Soundex code.
2.5.1.2 Metaphone
Metaphone is a phonetic algorithm, published by Lawrence Philips in 1990, for indexing
words by their English pronunciation. It fundamentally improves on the Soundex algorithm
by using information about variations and inconsistencies in English spelling and
pronunciation to produce a more accurate encoding, which does a better job of matching
words and names which sound similar. As with Soundex, similar sounding words should
share the same keys.
The original author later produced a new version of the algorithm, which he named Double
Metaphone. Contrary to the original algorithm whose application is limited to English only,
this version takes into account spelling peculiarities of a number of other languages. In 2009
Lawrence Philips released a third version, called Metaphone 3, which achieves an accuracy
of approximately 99% for English words, non-English words familiar to Americans, and first
names and family names commonly found in the United States, having been developed
according to modern engineering standards against a test harness of prepared correct
encodings.
2.5.2 Pattern matching
Pattern matching techniques are commonly used in approximate string matching [24, 25],
which has widespread applications, from data linkage [22, 23] and duplicate detection [20,
21], information retrieval [26], correction of spelling errors [27], approximate database joins,
to bio- and health informatics [25]. These techniques can broadly be classified into edit
distance and q-gram based techniques, plus several techniques specifically developed for
name matching.
A normalized similarity measure between 1.0 (strings are the same) and 0.0 (strings are
totally different) is usually calculated. We will denote the length of a string s with |s|.
9
2.5.2.1 Levenshtein or Edit Distance
The Levenshtein distance [28] is defined to be the smallest number of edit operations
(insertions, deletions and substitutions) required to change one string into another. In its basic
form, each edit has cost 1. Using a dynamic programming algorithm [17], the distance
(number of edits) between two strings s1 and s2 can be calculated in time O(|s1| × |s2|) using
O(min(|s1|, |s2|)) space. The distance can be converted into a similarity measure (between
0.0 and 1.0) using
𝑑𝑖𝑠𝑡𝑙𝑑(s1,s2)
𝑠𝑖𝑚ld (s1,s2)= 1 − max⁡(|𝑠1|,|𝑠2|) – (1)
with 𝑑𝑖𝑠𝑡ld (s1,s2) being the actual Levenshtein distance function which returns a value of 0
if the strings are the same or a positive number of edits if they are different. The second
property allows quick filtering of string pairs that have a large difference in their lengths.
The distance between "Bob" and "Bob" is zero (0), because no edits are required to convert
a string into itself. The edit distance between strings is only zero if the strings are identical.
The distance between "Brett" and "Brent" is one (1), because it requires a substitution of an
‘n’ for a ‘t’. The distance between "Brett" and Bret is one, requiring the deletion of one of
the two ‘t’ characters in "Brett". The sequence of edits must be minimal, but need not be
unique. Further note that "Bret" can be converted to "Brett" with a single insertion of a ‘t’
character.
The distance between "Bob" and "bob" is also 1, as it requires the substitution of a lowercase
'b' for its uppercase equivalent ‘B’.
Levenshtein Distance is used to calculate the similarity of 2 strings. A standard Levenshtein
Distance is about ~40% accurate [19], by standardizing the orthography of the strings this
can be improved to a max of ~65% [3].
2.5.2.2 Sorenson similarity
The Sorenson index, also known as Sorenson’s similarity coefficient, is a statistic used for
comparing the similarity of two samples. It was developed by the botanist Thorvald Sorenson
and published in 1948. Sorenson's original formula was intended to be applied to
presence/absence data, and is
2𝐶
2|𝐴∩𝐵|
𝑄𝑆 = 𝐴+𝐵 = |𝐴|+|𝐵| - (2)
10
where A and B are the number of species in samples A and B, respectively, and C is the
number of species shared by the two samples; QS is the quotient of similarity and ranges
from 0 - 1. This expression is easily extended to abundance instead of presence/absence of
species. The Sorenson index is identical to Dice's coefficient which is always in [0, 1] range.
2.5.2.3 Cosine Similarity
The cosine of two vectors can be easily derived by using the Euclidean dot product formula:
𝑎. 𝑏 = |𝑎|⁡|𝑏|𝑐𝑜𝑠𝜃 - (3)
Given two vectors of attributes, A and B, the cosine similarity, θ, is represented using a dot
product and magnitude as
𝐴.𝐵
𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 = 𝑐𝑜𝑠𝜃 = |𝐴||𝐵| =
∑𝑛
𝑖=1 𝐴i ×𝐵i
𝑛
2
2
√∑𝑛
𝑖=1(𝐴i) ×√∑𝑖=1(𝐵i)
- (4)
The resulting similarity ranges from −1 meaning exactly opposite, to 1 meaning exactly the
same, with 0 usually indicating independence, and in-between values indicating intermediate
similarity or dissimilarity.
For text matching, the attribute vectors A and B are usually the term frequency vectors of the
documents. The cosine similarity can be seen as a method of normalizing document length
during comparison.
2.6 Summary
1. The background study focused on the uses of name checking systems, their
effectiveness and usefulness.
2. It helped us how to design, methodologies, and programming tools that should be
used to develop this system.
3. It also emphasized on the existing systems, their merits and flaws in them.
11
Chapter 3
REQUIREMENT ANALYSIS
3.1 Functional Requirements
1. A true reflection of lexical similarity
Strings with small differences should be recognized as being similar. In particular, a
significant substring overlap should point to a high level of similarity between the strings.
2. A robustness to changes of word order
Two strings which contain the same words, but in a different order, should be recognized
as being similar. On the other hand, if one string is just a random anagram of the
characters contained in the other, then it should (usually) be recognized as dissimilar.
3. Language Independence
The system should work not only for English words, but also for Nepali words.
4. Output file format
The result should be stored in a file in comma separated variable (csv) format.
5. Easy integration
The system should be easy to integrate with the existing system. The system should be
easy to maintain by the maintenance personnel.
3.2 Non-Functional Requirements
3.2.1 Reliability
It is required that the system should be available all the time. This can be achieved by hosting
the system in a reliable server. Also the system is built using Java, this adds more confidence
to the system. Java has built in memory management system.
3.2.2 Performance
The system would be used by numerous customers throughout the country. So it was required
that the system should take minimum time to produce output. The main concern was the time
taken to query database system to extract the relevant names and calculate the similarity
scores. This time depend upon the type of processor used. The overall time required to obtain
output after the submission of name by the customer was summed up to about 1 minute but
again, this time depends upon the number of tokens generated.
12
3.2.3 Accuracy
The system is purposed to be real time, so it is required that the high accuracy is maintained.
This is ensured by using Morphanalyser, Levenshtein Algorithm in conjunction with KuhnMukres Hungarian Algorithm and Sorensen Algorithm.
13
Chapter 4
METHODOLOGY
4.1 Introduction
Methodology is analysis of the tasks to be done in order to obtain the desired output. An
appropriate methodology mainly results into a successful project and vice-versa. Here, for
this system, a number of methodologies were considered and the most efficient ones are
used. This doesn’t mean that one particular method is used. According to the system, the
most appropriate ones are used in combination.
The model used here is an iterative model i.e. in the beginning a small subset of the software
requirement is developed and then using the concept of redesign and redevelopment its
further versions are enhanced. This process is continued until and unless the desired system
is developed that produces results as mentioned in the system requirements.
The methodology once decided is changed during the project if there arise any circumstances
where the design emerged any flaws. Thus based on the situations appropriate methodologies
are implemented. Hence in our scenario methodology comprises of five different steps.
1. Building Base Dictionary
2. Possible Keyword Generation
3. Finding Possible Matches
4. Finding Duplicates
5. Finding Ranks
1. Building Base Dictionary
A base dictionary can be generated from the existing name database provided by OCR. This
can be done by using manual approach. Base dictionary used in our project consist of a file
containing English words, a dictionary for transliteration, and Nepali to English dictionary
(provided by Madan Pustakalaya). These dictionary helps us in tokenization and possible
keywords generation.
2. Possible Keyword Generation:
After tokenizing the given name, a possible combination of the keywords is generated using
both English and Nepali words similar to them. After obtaining base keywords, these
keywords are transliterated and combined in every possible manner to form the probable
similar keywords. These keywords are used to match against names in OCR Names database.
14
3. Finding Possible Matches:
Possible names generated using base keywords are matched against OCR names database.
For this, the names containing any of the keywords are extracted from the names database.
Each of the name is checked against the purposed name. The aim is to collect as many records
as possible for better results. These records can contain duplicates too.
4. Finding Duplicates Matches:
The names extracted from the Names database may occur more than once. So, the names
that appear more than once are removed. Duplication occurs when a name in the database
contains two or more of the probable keywords.
5. Finding Ranks:
The purposed name is assigned a value against each name extracted from the Names
database. The value signifies the extent of matching. For calculating the match, we used

Levenshtein algorithm

The Kuhn-Munkres algorithm (also known as the Hungarian method)
The purposed name is assigned a value against each name extracted from the Names
database. The value signifies the extent of matching. For calculating the match, we used
• Levenshtein algorithm to calculate similarity between tokens of purposed name and name
extracted from database.
• The Kuhn-Munkres algorithm (also known as the Hungarian method) to find the optimal
assignment of similarity weight between tokens of two strings in comparison that maximizes
the sum of similarity weight.
• Sorenson’s similarity coefficient to find the single value similarity score (which is between
0 and 1) from the result obtained through Hungarian.
4.2 System Design
This section gives a detail review on the design on which the system developed is
implemented. It includes
1. Flow diagram
2. Deployment diagram
3. System architecture
4. Detail class diagram
15
4.2.1 Flow Diagram
Figure 1 Flow Chart
16
4.2.2 Deployment Diagram
Figure 2 Deployment Diagram
The application is built around client/server architecture. Multiple client machines can
interact with the server simultaneously. Clients can interact with the system through an
interactive OCR’s website, while the server serves the client’s request and does the
processing in the backend.
17
4.2.3 System Architecture
User Input
Query Processing
Preprocessing
Engine
Dictionary
Translator + Transliterator
English-Nepali
Keywords Generator
Index Processor
Indexed
Record
Comparator
Preprocessing
Engine
Ranking Engine
Result
Visualization
Database
Figure 3 System Architecture
18
4.2.3.1 Preprocessing Engine
Downcasting
Transformation
Pool of stopwords
Stopword removal
Tokenization
Stemming
Figure 4 Preprocessing Engine
Preprocessing Engine comprises of five different processes on the user input.
1. Downcasting
Downcasting also referred as type refinement is act of casting script from uppercase
letters to lowercases. It is done so as to make sure there is no conflict in company names
due to uppercase letters between the words to make it a unique name.
2. Transformation
Transformation is the conversion of words from British English word to that to American
English words. Transformation is done to avoid the generation of unwanted keywords or
conflicting keywords. Our dictionary consist of around 130 commonly used words that
is converted when found from British English word to American English word.
3. Stopword Removal
Stop word removal is the process of removing some predefined stop words from the
string literal. We used this process to remove the words that are considered
19
similar/unimportant according to the Office of the Company Registrar. Words such as
Shree, New, Modern, Industry, Udyog, Company, etc. are removed.
4. Tokenization
Tokenization is the process of breaking up a string into tokens to be indexed using
predefined dictionaries or with the help of analyzing the whitespaces. These dictionaries
can be a pool of predefined words or bilingual English-Nepali dictionary. Proper
handling of strings, numbers and symbols are also important. For instance, tokenizing
"nepal metals” outputs “nepal” and “metals”.
5. Stemming
Stemming is the process of reducing a word to a root, or simpler form which are present
in plural forms. Stemming is often used in text processing applications. There are many
different approaches to stemming, each with their own design goals. Some are
aggressive, reducing words to the smallest root possible. Here, Stemming is done with
the help of morphological analyzer. Morphological analysis is done in order to produce
English dictionary based words. For example, words like “services”, “metals” are
reduced to simpler singular forms as “service” and “metal”.
We used stemming to obtain the dictionary based root words. Using root words, we
simplified the matching process.
4.2.3.2 Translation and Transliteration
Translation is the conversion of the meaning of a source-language text by means of
an equivalent target-language text. In this process, equivalent Nepali text is obtained of the
English words as obtained by mapping each keyword matched accordingly with the English
Dictionary. The matched word are then mapped with the English-Nepali Dictionary provided
by Madan Puraskar Pustakalaya. The unmatched words are simply placed with translated
tokens. For Example the word “nepal”, “metal” is mapped onto the dictionary to get the word
“नेपाल”, “धातु”.
Transliteration is the conversion of a text from one script to another. To transliterate a
Nepali word to English word, we used dictionary mapping to map individual Nepali syllable
to form English alphabet. Here in above example of translation the word “नेपाल”,“धातु” are
20
transliterated to “Nepal” and “dhatu” and then extracted to the pool of keywords for further
processing.
4.2.3.3 Possible Keyword Generation
Keywords are generated hence by the combination of keywords from stemming and after
transliteration. The generated keywords are hence used to make a list of company names
having those keywords in their names from the database .The company names are hence
listed in accordance with the presence of those keywords. Each company name in the list is
again processed by the preprocessing engine and stemmed keywords are extracted to process
further for comparison which is kept as indexed record for each company name taken from
the database.
4.2.3.4 Comparison
Comparison is done between the token obtained with the user inputted company name and
tokens generated by the company names extracted from the database based on the user
inputted keywords.
Levenshtein Algorithm and The Kuhn-Munkres algorithm (Hungarian Method) were used
in comparison of strings. The similarity is calculated in three steps:

Partition each name into a list of tokens.

Eliminate the common tokens

Compute the similarity between dissimilar tokens by using a string edit-distance
algorithm
The first method uses an edit-distance string matching algorithm: Levenshtein. The string
edit distance is the total cost of transforming one string into another using a set of edit rules,
each of which has an associated cost. Levenshtein distance is obtained by finding the
cheapest way to transform one string into another. Transformations are the one-step
operations of (single-phone) insertion, deletion and substitution. In the simplest version
substitutions cost about two units except when the source and target are identical, in which
case the cost is zero. Insertions and deletions costs half that of substitutions.

Application of Hungarian Algorithm for Optimization
The result of Levenshtein method is used in bipartite graph which used Hungarian algorithm.
A related classical problem on matching in bipartite graphs is the assignment problem, which
21
is the quest to find the optimal assignment of workers to jobs that maximizes the sum of
ratings, given all non-negative ratings Cost[i,j] of each worker i to each job j.
All relation scores are in the [0, 1] range, which means that if the score gets a maximum
value (equal to 1) then the two string are absolutely similar.

Application of Sorenson’s Similarity coefficient
The result of Hungarian method which is the sum of similarity weight is then applied to
Sorenson Index to find the final single value similarity score between the strings to be
compared. This final score (whose value lie between 0 and 1) is then converted into
percentage by multiplying by 100.
4.2.3.5 Ranking
The result of each and every permutation is taken into consideration and the maximum
matched percentage score is chosen. And then, a list of company name is generated based on
the order of the percentage similarity score.
22
4.2.4 Detailed Class Diagram
Figure 5 Detailed Class Diagram
20
The system is implemented by using the object oriented methodology. We have not used
Framework of any kind. Some of the core classes of system along with their association is
shown.
 Comparison System
This system is used to compare the result received from preprocessing engine of user input
and list obtained from database
1. HungarianAlgorithmEdu Class
In this class we have used Hungarian algorithm to compute the highest possible score
of matching between the tokens from both input. The input to this system is the
weight matrix obtained from Hybrid Class and the output will be the similarity score.
hgAlgorithm() method performs the Hungarian algorithm and final similarity score
is returned by getScore() method.
2. Hybrid Class
In this class we have used Levenshtein Distance algorithm to calculate the edit
distance. This class calculates edit distance between two tokens of strings and finally
gives the similarity score between them. ComputeDistance() method computes the
edit distance and GetSimilarity() returns the simalirity between tokens.
3. Permutation Class
In this class we perform permutation of the result obtained from transliteration of
user input token and user input token but among the tokens of itself. permute()
method computes permutation operation.
4. MatchsMaker Class
This is the main class of comparison system which calls each of its component to
perform comparision and return output as similarity percentage. GetScore() returns
the similarity percentage and Initialize() initializes necessary components.
21
 Database System
1. DatabaseCredentials Class
This class is used to store database credentials. Those credentials includes username,
password and connection path. This method can also be used as Java Beans to
implement set/get methods.
2. DatabaseHandler
DatabaseHandler class is used to initiate the database connection and also declaring
the database type.
3. CookSQL Class
This class is used to prepare SQL statements.
4. CompanyNameEnglish Class
This class is the core of the package. This class contains the methods for individual
record manipulation and resultset retrieval.
5. ConnectDatabase
This class is the bridge between database and the main interface and other class. This
class is used to hide the details of the underlying database implementations.
 Preprocessing Engine
This engine contains component that is used to downcast, clean, transform. Remove stop
words, stem and tokenize.
1. SpaceProcessor Class
This class is used to tokenize a company name based on space and hyphen (-) and
rejoin the individual tokens if necessary.
getSplittedText():This method is used to split the company name into tokens.
joinSplittedText():This method is join tokens with space to regenerate the company
name.
2. StopwordRemover Class
This class is used to remove the stop words as defined by the OCR directives.
3. Stemmer Class
Stemmer class contains methods to generate root words. Stemming is achieved using
SnowBall stemmer and morphological analysis.
4. SymbolProcessor
This class is used to clean the illegal symbols from names.
22
4.3 Project Tools

Programming Language: Java SE 7

Database: MySQL Server Version 5.1.41

Testing: JUnit testing

Drawings: MS Paint, MS Visio, ArgoUML ,Adobe Photoshop

Documentation: MS Word/Excel/PowerPoint

Platform: Windows

IDE: Eclipse Indigo
4.4 Eclipse as Programming IDE
Eclipse was used as IDE for project development. Eclipse is a multi-language software
development platform comprising an IDE and a plug-in system to extend it. It is written
primarily in Java and is used to develop applications in this language and, by means of the
various plug-ins, in other languages as well—C/C++, COBOL, Python, Perl, PHP and more.
The initial codebase originated from Visual Age. In its default form it is meant for Java
developers, consisting of the Java Development Tools (JDT). Users can extend its
capabilities by installing plug-ins written for the Eclipse software framework, such as
development toolkits for other programming languages, and can write and contribute their
own plug-in modules. Language packs provide translations into over a dozen natural
languages. Released under the terms of the Eclipse Public License, Eclipse is free and open
source software.
4.5 MySQL as Database System
MySQL was used as database server. It is a relational database management system
(RDBMS) which has more than 11 million installations. The program runs as a server
providing multi-user access to a number of databases. The project's source code is available
under terms of the GNU General Public License, as well as under a variety of proprietary
agreements.
23
Chapter 5
EXPERIMENTAL SETUP
Hardware Configuration used for Testing

Hardware Configuration:
Computer Model: DELL 5110
Physical Memory (RAM): 4.00 GB, DDR2
Processor: Intel(R) Core(TM) i-5-2450M CPU, 2.5 GHz
System Type: 64-bit Operating System, x64-based processor
Cache Size: 4096 KB
OS: Windows 8 Enterprise
Database: MySQL Server Version 5.5.24
Database with 111,161 records of company names.

Computer Model: Acer Aspire E1-531
Physical Memory (RAM): 4.00 GB, DDR2
Processor: Intel B960 Dual Core processor (2.2 Ghz, 2MB L3 cache)
System Type: 64-bit Operating System, x64-based processor
Cache Size: 4096 KB
OS: Windows 8 Enterprise
Database: MySQL Server Version 5.5.24
Database with 111,161 records of company names.
24
Chapter 6
OUTPUT
1. Output obtained by using input “durga enterprises”
Figure 6 Example - I
2. Output obtained by using input “hamro lagani”
Figure 7 Example - II
3. Output obtained by using input “jagadamba steels”
Figure 8 Example- III
4. Output obtained by using input “nawayug vidhya niketan kanchanpur”
Figure 9 Example - IV
25
5. Output obtained by using input “nepal investment company”
Figure 10 Example- V
6. Output obtained by using input “nepal one travels and tour”
Figure 11 Example - VI
7. Output obtained by using input “new age business consultant”
Figure 12 Example - VII
26
Chapter 7
RESULT AND ANALYSIS
To obtain the similarity scores, we tried various similarity measuring algorithms. However
Levenshtein Algorithm and Hungarian Algorithm together with Sorensen Algorithm seemed
to fit our need. We used various processes before applying these algorithms which proved to
be fruitful. The scores obtained is saved in file having .csv extension. Stemming was used to
obtain dictionary based root words. Tokenization and transliteration was used to obtain the
tokens later used in the comparison process. We used translation and transliteration to cope
with Nepalese words. The accuracy was accessed by trying different names that can be used
in reality.
The computation time depends upon the number of tokens to be compared and for now, the
system is single threaded.
Number of Tokens VS Computation Time
53.785
Time to Compute (sec)
60
50
40
Time to compute (sec) in I5 CPU
Time to compute (sec) in Dual Core CPU
30
22.743
20
10
0
5.384
1.179
1 Token (Durga
Enterprises)
7.316
2.395
1.434
2 tokens (jagadamba steels
pvt.ltd)
3 tokens (New Age
Businness Consultant
Limited)
5.939
4 tokens (Nepal One
travels and tours Ltd.)
Number of Tokens
Figure 13 Computation Time with Transformation
Figure 6 shows the relation between number of tokens and time to compute similarity scores
with various generations of Intel Processors. The computational time is more in lower
generation of processors and less in higher generation of processor. Furthermore, more is the
tokens greater is the computation time. This result is obtained without the use of
transformation process.
27
Number of Tokens VS Computation Time
107.498
Time to Compute (sec)
120
100
80
Time to compute (sec) in I5 CPU
Time to compute (sec) in Dual Core CPU
60
39.994
37.743
40
20
0
13.315
8.959
11.952
2.204
1.664
1 Token (Durga
Enterprises)
2 tokens (jagadamba
steels pvt.ltd)
3 tokens (New Age
Businness Consultant
Limited)
4 tokens (Nepal One
travels and tours Ltd.)
Number of Tokens
Figure 14 Time Computation with Transformation
Figure 7 shows the result obtained by using Transformation process. It takes more time with
using transformation, but it yields better results. By using appropriate hardware resources,
we can reduce this time within the constraint.
For comparison process, we initially used Cosine similarity algorithm. But it didn’t yield
promising results. Cosine similarity algorithm doesn’t consider about the relative position of
alphabets in the string, it only considers the repetition of alphabets. Thus a string with
different spelling but same alphabet count is considered similar. This resulted in severe
limitation of its use.
Levenshtein algorithm proved useful in our project. It considers the position of alphabets in
a string which is necessary for our system. This algorithm along with Hungarian Algorithm
resulted in the satisfactory results. To obtain the final score we used Sorensen coefficient. Its
value lies in the range
[0, 1]. Multiplying this coefficient by 100 gave us the final
percentage score.
28
Chapter 8
CONCLUSION AND FURTHER ENHANCEMENT
7.1 Conclusion
With all the accumulated effort invested in this project, there are reasons to believe that at
the end of this semester this project will find itself in a much better shape and quite closer to
actual acceptance than it was. We summarize the progress with respect to the main objectives
of the project, namely, accuracy and speed.

Accuracy: This is the main obstacle for the project. We have been constantly using
and testing many different algorithms for similarity comparison. However we have
been able to get satisfactory results using Levenshtein distance and Hungarian
Method in conjunction with Sorensen Coefficient. We are further trying to improve
the results by employing many other algorithms – Phonetic (Double Metaphone) and
using transformation function.

Speed: Speed is also a challenging factor for this project. The requirement for shorter
processing time has made it difficult to balance between accuracy and speed.
However by using the processing capability of MySQL, we have been able to
improve the speed resulting in shorter waiting time for the users. The use of adequate
data structures have been of prominent advantage.
Let us remark that one of the apparent major obstacles for gaining acceptance for this
project lies in the standards of the Office of Company Registrar.
7.2 Limitations
Our System comprises of the following limitations.

The system cannot process name having numbers as prefix or suffix.

Preprocessing Engine have many limitations. Stemming sometimes produces
incorrect results if the input is the Nepali word. E.g. Spat (स्पात) in Nepali (Steel in
English) may result in spit due to morphology based stemming. In such cases,
similarity matching reduces.

Dictionary (English-Nepali) does not contain enough words. There are many English
words for which Nepali word is not available

Transformation process results in more computational time.
29

Synonyms are not considered in the system.

Strings such as papermill and paper mill, though similar, are considered different
because of the space. The space results in two tokens. Although both strings have
same meaning, they are not considered similar by our system.
7.3 Further Enhancement
There is a great opportunity to enhance this project in upcoming future. The Similarity
Checking algorithm has the greatest possibility of being enhanced. If phonetic based
similarity measures is incorporated, accuracy can be greatly improved. Implementing faster
searching methods can greatly enhance the performance of the system.
Use of Taxonomy for classifying the tokens further with similarity measures can help
accurately validate purposed names. Taxonomy can classify the context of names and thus
improve the validation process.
Furthermore, using some weighing measures to assign weights to most common words might
be helpful in increasing accuracy of the similarity score.
30
REFERENCE
[1] Office of Company Registrar, Nepal. Retrieved from: www.ocr.gov.np. Date Retrieved:
07/04/2013
[2] Companies House. Retrieved from:
http://wck2.companieshouse.gov.uk//wcframe?name =accessCompanyInfo. Date
Retrieved : 04/07/2013
[3] Companies
and
Intellectual
Property
Commission.
Retrieved
from:
http://www.cipc.co.za/.
Date Retrieved: 04/07/2013
[4] Anne Kao and Stephen R. Poteet (Eds). Natural Language Processing and Text Mining.
Springer 2006
[5] Peter Jackson and Isabelle Moulinier. Natural Language Processing for Online
Applications .In Prof. Ruslan Mitkov, editor. John Benjamins Publishing Company,2002
[6] Ronan Collobert, JasonWeston, L´eon Bottou, et al. Natural Language Processing
(Almost) from Scratch. Editor. Michael Collins. NEC Laboratories America, 4
Independence Way, Princeton, NJ 08540
[7] Prakash M Nadkarni, Lucila Ohno-Machado, Wendy W Chapman. Natural language
processing: an introduction. Available from: group.bmj.com
[8] Chris Manning, Hinrich Schütze. Foundations of Statistical Natural Language
Processing.
MIT
Press.
Cambridge,
MA:
May
1999.
Available
from:
http://nlp.stanford.edu/fsnlp/
[9] Shuly Wintner. Formal Language Theory for Natural Language Processing. ESSLLI
2001. Available from http://www.ebooksdirectory.com/details.php?ebook=6774
[10]
Daniël de Kok, Harm Brouwer. Natural Language Processing for the Working
Programmer. 2011. Available from : http://nlpwp.org/book/
[11]
Aliseda, R. van Glabbeek, D. Westerstahl. Computing Natural Language. CSLI
1998. Available from: http://www.e-booksdirectory.com/details.php?ebook=3940
[12]
Steven Bird, Ewan Klein, Edward Loper. Natural Language Processing with
Python.
O'Reilly Media 2009. Available from:
http://www.ebooksdirectory.com/details.php?ebook=7184
31
[13]
Rob Malouf, Miles Osborne. An Introduction to Stochastic Attribute-Value
Grammars. ESSLLI 2001.Available from:
http://www.e-booksdirectory.com/details.php?ebook=6860
[14]
Shuly Wintner. Formal Language Theory for Natural Language Processing. ESSLLI
2001.Available from: http://www.e-booksdirectory.com/details.php?ebook=6774
[15]
Grosz, B.J. Jones, K.S.Webber, B.L. Readings in Natural Language Processing.
Kaufman Publishers Inc.,Los Altos, CA. Available from:
http://www.osti.gov/energycitations/product.biblio.jsp?osti_id=6537037
[16]
Reilly, Ronan G. (Ed); Sharkey, Noel E. (Ed). Connectionist approaches to natural
language processing. Hillsdale, NJ, England: Lawrence Erlbaum Associates, Inc. 1992.
Available from: http://psycnet.apa.org/psycinfo/1992-98664-000
[17]
C. Friedman and R. Sideli. Tolerating spelling errors during patient validation.
Computers and Biomedical Research, 25:486–509, 1992.
[18]
F. Patman and P. Thompson. Names: A new frontier in text mining. In ISI-2003,
Springer LNCS 2665, pages 27–38.
[19]
Simon J. Greenhill. Computational Linguistics Volume 37 Issue 4, December 2011,
pages 689-698.
[20]
M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string
similarity measures. In Proceedings of ACM SIGKDD, pages 39–48, Washington DC,
2003.
[21]
C. L. Borgman and S. L. Siegfried. Getty’s synonameTM and its cousins: A survey
of applications of personal name matching
[22]
Algorithms. Journal of the American Society for Information Science, 43(7):459–
476, 1992.
[23]
P. Christen, T. Churches, and M. Hegland. Febrl – a parallel open source data linkage
system. In PAKDD, Springer LNAI
[24]
3056, pages 638–647, Sydney, 2004.
[25]
P. Christen and K. Goiser. Quality and complexity measures for data linkage and
deduplication. In F. Guillet and H. Hamilton, editors, Quality Measures in Data Mining,
Studies in Computational Intelligence. Springer, 2006.
[26]
P. A. Hall and G. R. Dowling. Approximate string matching. ACM Computing
Surveys, 12(4):381–402, 1980. [25] P. Jokinen, J. Tarhio, and E. Ukkonen. A comparison
32
of approximate string matching algorithms. Software – Practice and Experience,
26(12):1439–1458, 1996.
[27]
R. Gong and T. K. Chan. Syllable alignment: A novel model for phonetic string
search. IEICE Transactions on Information and Systems, E89-D(1):332–339, 2006.
[28]
F. J. Damerau. A technique for computer detection and correction of spelling errors.
Communications of the ACM, 7(3):171–176, 1964.
[29]
G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys,
33(1):31–88, 2001.
33
APPENDIX A: Gantt chart
Figure 15 Gantt Chart
34
APPENDIX B: Use Case
Figure 16 Use Case Diagram
35
APPENDIX C: Preprocessing Detail Example
For User Input
Example “ Nepal Metals Industries”
Preprocessing Engine
Methodology
nepal metal industries
Downcasting
Conversion of input to lowercase.
Tokenization
[nepal,metals]
Conversion of British English words to
American English.
Removal of Stop words – Company,
Industry, and Pvt.Ltd.as mentioned in the
draft.
Extraction of Tokens
Stemming
[nepal,metal]
Reduction to Root words.
Transformation
Not Applied in this example
Process
nepal metals
Stopword
Removal
Translation
[nepal,metal] to
Transliteration
[ नेपाल ,धातु ]
[ नेपाल ,धातु ] to [nepal , dhatu ]
Conversion of tokens from English to Nepali.
Conversion of Nepali Unicode.
Generated Keywords (Using Transliterated Token + Stemmed Token)[ nepal , metal , dhatu ]
Query to MySQL Database resulting in a list of company names.
Preprocessing
Engine
Company Name Extraction
from Query
Downcasting
Example(Randomly choosen)
“Royal Metal Nepal Pvt.Ltd.”
royal metal nepal pvt.ltd.
Process
Conversion of input to lowercase.
Transformation
Not Applied in this example
Stopword
Removal
royal metal nepal
Tokenization
[royal , metal , nepal ]
Conversion of British English words to
American English.
Removal of Stop words – Company,
Industry, and Pvt.Ltd.as mentioned in the
draft.
Extraction of Tokens
Stemming
[royal , metal , nepal ]
Reduction to Root words.
Database Generated Keywords
[ royal , metal , nepal ]
Comparison-1 (User Input Generated Keywords & Database Generated Keywords.)
Preprocessing
Engine
Company Name Extraction
from Query
Downcasting
Example(Randomly choosen)
“Nepal Dhatu Industries”
nepal dhatu industries
Process
Conversion of input to lowercase.
Transformation
Not Applied in this example
Stopword
Removal
nepal dhatu
Tokenization
[nepal , dhatu ]
Conversion of British English words to
American English.
Removal of Stop words – Company,
Industry, and Pvt.Ltd. as mentioned in the
draft.
Extraction of Tokens
Stemming
[nepal , dhatu ]
Reduction to Root words.
Database Generated Keywords
[nepal , dhatu ]
Comparison-2 (User Input Generated Keywords & Database Generated Keywords.)
36
APPENDIX D: Comparison Detail
Figure 17 Comparison I (Part A)
37
Figure 18 Comparison I (Part B)
38
Figure 19 Comparison II (Part A)
39
Figure 20 Comparison II (Part B)
40
APPENDIX E: Output Screenshot
Figure 21 Output Screenshot
41
APPENDIX F: Data Flow Diagram
Figure 22 Data Flow Diagram
42
APPENDIX G: Theory
Hungarian Algorithm
Hungarian Method is for assigning jobs by a one-for-one matching to identify the lowestcost solution. Each job must be assigned to only one machine. It is assumed that every
machine is capable of handling every job, and that the costs or values associated with each
assignment combination are known and fixed. The number of rows and columns must be the
same. The algorithm is as follows.
1. Arrange the information in a matrix form with String 1 and String 2 on left and along the
top with the Levenshtein distance for each pair in the middle.
2. Ensure that the matrix is a square by addition of the dummy rows/columns if necessary.
Conventionally, each element in the dummy row/column is the same as the largest
number in the matrix.
3. Reduce the rows by subtracting the minimum value of each row from that row.
4. Reduce the columns by subtracting the minimum value of each column from that column.
5. Cover the zero elements with the minimum number of lines it is possible to cover them
with.(if the number of lines is equal to the number of rows then goto step 9)
6. Add the minimum uncovered element to every covered element, if an element is covered
twice, add the minimum element to it twice.
7. Subtract the minimum element from every element in the matrix.
8. Cover the zero elements again. If the number of lines covering the zero elements is not
equal to the number of rows, return to step 6.
9. Select a matching by choosing a set of zeros as that each row or column has only one
selected.
10. Apply the matching to the original matrix, disregarding dummy rows.
43
Procedure of Metaphone Phonetic Algorithm
Original Metaphone codes use the 16 consonant symbols 0BFHJKLMNPRSTWXY.[2] The
'0' represents "th" (as an ASCII approximation of Θ), 'X' represents "sh" or "ch", and the
others represent their usual English pronunciations. The vowels AEIOU are also used, but
only at the beginning of the code.[3] This table summarizes most of the rules in the original
implementation:
1. Drop duplicate adjacent letters, except for C.
2. If the word begins with 'KN', 'GN', 'PN', 'AE', 'WR', drop the first letter.
3. Drop 'B' if after 'M' at the end of the word.
4. 'C' transforms to 'X' if followed by 'IA' or 'H' (unless in latter case, it is part of '-SCH-
', in which case it transforms to 'K'). 'C' transforms to 'S' if followed by 'I', 'E', or 'Y'.
Otherwise, 'C' transforms to 'K'.
5. 'D' transforms to 'J' if followed by 'GE', 'GY', or 'GI'. Otherwise, 'D' transforms to 'T'.
6. Drop 'G' if followed by 'H' and 'H' is not at the end or before a vowel. Drop 'G' if
followed by 'N' or 'NED' and is at the end.
7. 'G' transforms to 'J' if before 'I', 'E', or 'Y', and it is not in 'GG'. Otherwise, 'G'
transforms to 'K'.
8. Drop 'H' if after vowel and not before a vowel.
9. 'CK' transforms to 'K'.
10. 'PH' transforms to 'F'.
11. 'Q' transforms to 'K'.
12. 'S' transforms to 'X' if followed by 'H', 'IO', or 'IA'.
13. 'T' transforms to 'X' if followed by 'IA' or 'IO'. 'TH' transforms to '0'. Drop 'T' if
followed by 'CH'.
14. 'V' transforms to 'F'.
15. 'WH' transforms to 'W' if at the beginning. Drop 'W' if not followed by a vowel.
16. 'X' transforms to 'S' if at the beginning. Otherwise, 'X' transforms to 'KS'.
17. Drop 'Y' if not followed by a vowel.
18. 'Z' transforms to 'S'.
19. Drop all vowels unless it is the beginning.
44
Download