TRIBHUVAN UNIVERSITY Institute Of Engineering Pulchowk Campus Department of Electronics and Computer Engineering A Report on Artificial Intelligence Title: Fuzzy Submitted by Aayush Shrestha (065/BCT/502) Logic String Search Submitted to Department of Electronics and Computer Engineering Pulchowk Campus Submission Date: August 7, 2012 ACKNOWLEDGEMENT I would like to express my deep gratitude to Dr. Sashidhar Ram Joshi for the inclusion of Artificial Intelligence project which is expected to help us in getting practical knowledge in implementation of Artificial Intelligence in real life application. I am grateful to him for his continuous suggestions, ideas and encouragement to do a project that would address to real world problem. Also I would like to appreciate all the friends for their ideas and help. ~2~ ABSTRACT Fuzzy Logic String Search is a web plugin written in javascript that uses “Levenshtein Edit Distance Algorithm” to search for appropriate string suggestion from a dictionary file that is stored in a “trie structure”. The plugin searches the dictionary for the list of words with the same suffix as the user types in the input form. Then levenshtein edit distance algorithm is used to filter the list to show only the options that are within the given restrictions of edit distance. ~3~ INTRODUCTION In computer science, fuzzy string suggestion algorithm is the technique of finding strings that match a pattern approximately (rather than exactly). Fuzzy string searching concept provides a powerful way for comparing strings. By using this logic we are able to compare between two strings. We are able to find the similarities between two string. If two strings are same then the similarity between them is 1 i.e. 100%. But if the strings are different then the similarity between them is less than one i.e. there is certain distance between them. For example:- if we use fuzzy string searching algorithm for finding similarity between the strings ‘John’ and ‘Jon’ then the similarity between them is 0.6875. Fuzzy Logic String Search can be customized to compare not only strings but other data so that it would be helpful in calculation and comparison of data in other various fields. PROJECT OBJECTIVE This project is related to finding similarities between strings. By using this program we can find possible option for the string if the string is wrong. The major objectives of the project are as follows: 1) To find the similarities between two strings. 2) To find the possible option for the entered string if the entered string is wrong. 3) To use fuzzy string searching algorithm. 4) Utilize the “Trie Structure” to store the words for better search mechanism. ~4~ METHODOLOGY FLOWCHART Start Enter letters of string Search the dictionary and list all the words that start with the input suffix Apply Levenshtein Algorithm to find the edit distance between input string and all the listed words from the dictionary. Is the edit distance less than 4? ~5~ Display the word DESCRIPTION The program starts reading the structure of the text as soon as the user starts typing it. The program will then read a dictionary file where there is a huge list of valid English words which will be used to compare with the entered string in alphabetic order. The program will also keep a count of the total number of strings read. The comparison between two strings, one of which is the entered string by the user and the other is the string obtained from the mini library is done by finding the records of the number of hits, cost of each hits and the distance between the two strings. After calculating the hits using cost and the distance, we find the similarity value between two strings. The similarity value between the two string ranges from 0 to 1. If the similarity value is 1 that means the two strings are exactly same. Greater the similarity values, more similar are the two strings than the strings have less similarity value. Thus we determine at most top ten strings that have the highest ten similarity value with the given string. Trie Structure In computer science, a trie, or prefix tree, is an ordered tree data structure that is used to store a dynamic set or associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree defines the key with which it is associated. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string. Values are normally not associated with every node, only with leaves and some inner nodes that correspond to keys of interest. ~6~ Levenshtein Edit Distance The Levenshtein Edit Distance (LED) is the number of edits (insertions, deletions, and substitutions) required to transform a string (A) into another string (B). Among other levels, edit distances can be computed at the level of letters, words, phrases, or even passages. Levenshtein distance is obtained by finding the cheapest way to transform one string into another. Transformations are the one-step operations of (single-phone) insertion, deletion and substitution. In the simplest versions substitutions cost two units except when the source and target are identical, in which case the cost is zero. Insertions and deletions costs half that of substitutions. This demonstration illustrates a simple algorithm which basically looks at all of the different ways for operations to transform one string to another. For example, the Levenshtein distance between "George" and "Geordie" is 3, since the following three edits change one into the other, and there is no way to do it with fewer than three edits: 1. George → Georde(substitution of 'g' for 'd') 2. George → Geordie (insertion of 'i' between 'd' and ‘e’) The edit distance is calculated by forming a matrix of size (m+1)x(n+1) where m is the length of the first word and n is the length of the second word (in the "George" vs. "Geordie" example above, the matrix would measure 7 by 8). Note the "plus one", as these rows are the "seed" values to kick everything off and allow a very simple comparison later. The first column and row will not be modified later as the loops start at the column and row indexed by one, not zero. ~7~ While looping through the strings, the indexes of both strings are the coordinates of the matrix. Note that the string indexes and the matrix will be off by one as the matrix coordinates start at. The actual calculation is interesting as it looks at three cells, the cell to the left, the cell above and the cell to the upper left. The point of the three comparisons is to take the minimum value from any of the three cells. Once the matrix is initialized, the remainder of the values may be filled in. Once the entire matrix is filled in, the answer is in the lower right hand cell. For each comparison, either 0 or 1 is the result. If the two characters are the same, then the result is 0; if they are different, the result is 1. This is used in generating one of the inputs to the selection of the minimum value of the three candidates that will result in the value inserted into the next cell of the matrix. The result of comparing the 'G' to the 'G' would then be 0. The cell above has a value of 1,the cell to the left has a value of 1, and the value of the cell one up and one to the left plus the result of the comparison is 0. Feeding 1,1,0 into the three way minimum results in 0, which is then placed into the cell resulting in a matrix with these values: ~8~ 3. 4. Figure 2: Matrix after the first iteration The result of the first pass would then be: 5. 6. Figure 3: Matrix after the first pass Letting the loops run to completion results in a matrix that looks like: 7. 8. Figure 4: The completed matrix And the result is 2, as predicted originally. ~9~ OUTPUT ~ 10 ~ AREAS OF APPLICATION This program is a web plugin and can be applied to any webpage for autocomplete feature in input forms. It is easy to use and user friendly and can increase the efficiency of users. It can be used in search forms and smaller, more peculiar dictionary set can be used to provide the user with the possible options for the input. LIMITATIONS There are few limitations in the project which are listed below: 1. This demo plugin works only for single string. i.e. The input form can only handle one single string and show the possible options. 2. The algorithm is not as optimized as it can be. So there are some performance issues including speed and occasional hang-ups. ~ 11 ~