5 Algorithms Used for Searches

advertisement
5 Algorithms Used for Searches
In order to complete a search for the user’s desired information, special
algorithms have been devised. These heuristic algorithms were created by adding
additional steps to a classical LCS search technique. Additionally, special string
representations for character decompositions had to be created. There are various points
throughout the algorithm where further heuristic study could be used to determine the
parameters needed to get ideal results.
5.1 Overview of eRE Transformation
The user will input a string, which may fall into one of two categories. The first
category is that it is a standard regular expression. This would include things as varied as
‘[語翻方]{1:3}’ or ‘house’. The second category of input would be for the user to input
an eRE. The grammar of the eRE is described in chapter 4.
If the user inputs a standard regular expression, then the system will simply take
the input as-is, and search for occurrences of that regular expression within EDICT. On
the other hand, if the user input is not a standard regular expression, then the process
described below must be followed.
5.1.2 Parsing of Input String
The input string is taken from the user, and passed to <Function name>. This
function serves as a parser for extended regular expressions. Actually, this function
receives all input strings, and it is within this function that the decision as to whether to
call other functions is made. If the eRE escape sequence is found, then the string is
passed to the transformation functions. If not, then they are skipped, and the standard
regular expression is used.
The first step in the parsing process is to proceed recursively through the input
string. This allows the nesting of escape sequences. The recursive steps taken here
follow the grammar described in figure XXX.
S = KK
K = KQK | 
Q = reg:reg  Valid Regular Expressions
Figure 1
 is the production described in figure XXX of chapter 4.
For example, the character 語 could be represented as \k(LR<言:吾>) without
using recursion. However, one possible extension to this serialized description would be
to use the following: \k(LR<言:\k(TB<五:口>)>). This use of nesting gives a much more
fundamental idea of the makeup of the character. As it turns out, the parsing step of the
input can be used as a sort of “master function” for the transformation process. In doing
so, there need be no changes to any other components besides this to allow recursive
nesting.
The mechanism that allows this is quite simple. Recursive parsing of the input
string will go to the most highly-nested portion of the escape sequence first. In this
example, that would be the \k(TB<五:口>) portion of the input. This substring would be
passed along through the procedures described in the following sections. Upon returning,
a set containing the best match(s) would be put in its place. This would leave the string
\k(LR<言:[吾]>) to be parsed. The process used for coming to [吾] as the “best match”
is described in the following sections.
5.1.3 Expanded String Creation
Each two-dimensional character contained in the decomposition dictionary will
eventually be used to create a serialized representation of itself. During the course of the
hierarchical score computation, each escape sequence entered by the user is compared to
each of the characters in the decomposition dictionary. This comparison is described in
full detail in the “Hierarchical Score Computation” section.
When a comparison needs to be made, the appropriate character is pulled from the
decomposition dictionary. The representation of this character, which varies based on
which class of character it is, is generated by the object containing the decomposed parts
of the character. For example, the character 明. This character is not stored in its
complete form. Instead, it is contained within a left-right object (See figure below). This
object contains a unit object, 日, in its left field, p’, and a unit object, 月, in its right field,
p’’. The string returned also contains a the whole character, 明,in the key field, c.
[c]f<[p’]:[p’’]>
cC, fF, p’,p’’F
Figure 2
When called on for a comparison, the left-right object first calls its left object, e 
A, (which need not be a unit object- this process may continue recursively if defined as
such). The left object will return its own string representation. The same process occurs
on the right object. These two representations are placed in the proper location in the
left-right string structure.
明 LR<日:月>
Figure 3
In addition, the key used to find this left-right object will be placed in the string
representation. The key is the character that has been decomposed. So, for this example,
the final string that is used to represent the serialized form of 明 is shown in figure XXX
above. This string is returned to the requester for the comparison to be made.
5.1.4 Hierarchical Score Computation
The algorithm used to compute hierarchical scores requires that the eRE in
question be compared to each extended string representation in the dictionary. The
algorithm in place is actually a heuristic algorithm, as there is no deterministic way to
determine what is the “best match”. This limitation was considered in the final version
used, as it is possible to modify the criteria of search quite easily by changing a few
variable values.
The first step in the score computation is to make a call to the
“commonCharacters” function. This function takes the input string and compares it with
a string from the decomposition dictionary. A simple count is done on each character
found in the input string to see how many times it occurs in the dictionary string.
Example XX: User input: ‘cat’ Dictionary String: ‘attached’
The algorithm will go one character at a time through the input string, in this case ‘cat’. A
running sum of the number of matches is kept, and at the end of the input string is
returned to the user. The results from example XXX are shown in table XXX.
Iteration
Current letter
Matches for
Running total
Total common
this letter
of sum
characters
0
‘c’
1
1
1
‘a’
2
3
2
‘t’
2
5
Done
5
Table 1
This score is stored as part of a tuple. The next part of the tuple is calculated as
the normalized longest common subsequence found between the input and each character
in the decomposition dictionary. The length of LCS found is divided by the length of the
decomposition string in order to make a fair comparison. This is because, for matching
purposes, the ratio of the string that is in the LCS is a better indicator of relevance than
the absolute length. In other words, an LCS of 8 on a string of length 24 would only
score 8/24=0.333, while an LCS of 7 on a string of length 10 would score 7/10=0.70.
This would cause the LCS of length 7 to be ranked as a better match than the longer, but
less relevant, LCS of length 8.
For both the common character computation and the longest common
subsequence computation, there are possible pre-processing steps. Both computations
search for any indexes that are used to represent unprintable characters. For those input
by the user, the index number will be preceded by the “#” symbol. Those in the
decomposition dictionary are not preceded by ‘#’. The pre-processing step takes place
before moving through the list that has been provided to each function. The list is
searched for either the […..,’#’,’n’,’n’…] or […..,’#’,’n’,’n’…] , (where ‘n’ represents a
number from 0-9) which will be changed to […..,’nn’,…] in the list. This allows the LCS
and Common character computations to proceed without any in-line changes.
U = f<a’:a’’>
V = f’<a’’’:a’’’’>
Count =  count(u  V)  u  U
Figure 4
Additionally, the LCS function has two other pre-processing steps. The first is to
allow the use of the ‘*’ character inside eRE escape sequences. This pre-processing
simply replaces the ‘*’ character with the character in the same position in the
decomposition dictionary string. The other pre-processor is used to allow the use of
‘[ABC]’ (the [ ] brackets) within the eRE escape sequences (where A, B, and C are each
an individual character or index of an unprintable character). This is accomplished by
checking the character in the corresponding position of the decomposition dictionary
string for membership within the set (characters between the [ ]) entered by the user. If
there is a match, then the matching character is put in place of the set.
These pre-processing steps work on the LCS because the iteration amongst all of
the strings from the decomposition dictionary takes place in the function that calls LCS().
Because of this, the strings input to the function can be altered as needed, as on the next
LCS call, the string will be passed again in its original state, without any of the changes
made by the pre-processing steps of the previous iteration.
5.1.5 Finding Best Match from Scores
The scheme used for locating the “best match” found in the searching algorithm
described above is a simple process. The results from the above steps are stored in a list
that contains the character found in the decomposition dictionary, the common character
score, and the LCS score. These list entries are sorted in hierarchical order, with the
common character score as the major, and LCS as the minor criteria. To accomplish this
order, they are first sorted by LCS, then the result is sorted by common character. The
approach was inspired by radix sort.
Sort (L):
Given a List, L, composed of many l: l = (c, count, lcs)  c  C
Sort = sort(L by lcs)
Sort’=sort(Sort by count)
Return Sort’
Figure 5
The current implementation uses only those in the highest tier of the rankings as
the acceptable results. It is, however, possible to expand these by altering the selection
criteria for what strength of match will be accepted. The purpose behind the hierarchical
sort can be seen by examining the table of actual search results below. There are some
less-relevant search results that gain higher LCS scores than some in the highest tier
(most common characters level). However, these less-relevant results are ignored because
they do not appear on the highest level of the common character search.
INSERT TABLE FROM SEARCH RESULTS BELOW. MODIFY THE
EXPLANATION ABOVE TO FULLY DESCRIBE IT.
5.1.6 Creating and Using RE Generated by Best Match
After building the table of hierarchical results from which the “best matches” are
selected, the regular expression used for the dictionary search must be created. At this
point it important to note that while the aim of the process described in this paper is to
avoid the rigidity of being required to use a regular expression for the search, that
generating a regular expression, or some other concrete expression is in fact required.
Computers do not have a sense of “judgment” whereby they can decide what a user
intended. Instead, a concrete, fully specified data is required for comparison purposes.
The algorithms described in this paper serve as a middle layer that allows end-users more
freedom in specifying what they desire. This is demonstrated in figure XXX.
Given a Sorted List, L, composed of many l: l = (c, count, lcs)  c  C
Select B  L :  bB : b.count = max(L.count)  b.lcs=max(L.lcs)
B is the set of Best Matches
Figure 6
To create the regular expressions, the characters with the highest LCS score are
selected from the highest tier of common characters score (Figure 6). Often, there will be
multiple characters found that constitute a “best match”. The set of possible characters is
returned from the best match function as a list. This list is then formed into a string with
‘[‘ and ‘]’ on either side. For example, a list of best Matches, B, with: B=[‘言’,’語’,’読’]
will be transformed into the string ‘[言語読]’. This can then be presented to the user if
being used as part of an input method, or passed to a regular expression engine for
matching against a dictionary, as is done in this implementation. For example, if the user
had searched for:
日本\k(LR<言:*>)
This might be returned as 日本[言語読]. This regular expression could then be
compared with the EDICT dictionary, resulting in the user finding the desired word,
without knowing the full kanji compound.
5.2 Matching Algorithm
The combination of steps described in section 5.1 is used in combination to form
the matching algorithm. The first step, parsing of the input string, will proceed
recursively to the most highly nested eRE escape sequence within the user input. This
escape sequence will then go one at a time through the steps that follow, until finally a
best match is found for the escape sequence. When this best match set, BM^k, is
returned, the string representing this set will be inserted in place of the escape sequence
of the original string, and the string parsing will continue. An example of how this
proceeds can be seen in figure XXX.
Input: \k(LR<言:\k(TB<五:口>)>)
Recursive parser first finds: \k(TB<五:口>)
BM=BestMatch(\k(TB<五:口>)) = [吾]
Now Repeat:
Recursive parser resumes:
\k(LR<言:BM>) = \k(LR<言:[吾]>)
BM’ = BestMatch(\k(LR<言:[吾]>)) = [語]
Figure 7
5.3 Imperfect “Fuzzy” Matching: Why it Works
The goal of this project, returning what a user means, not what he says, without
the use of actual intelligence is tricky. The problem is that inputs tend to be specific and
concise. Narrow inputs to searches tend to provide either narrow outputs (exactly what is
said) or overly broad, irrelevant outputs.
My algorithm’s approach to this problem is to have the user serialize the input to
describe what it is that he wishes to find. This serialization contains important structural
information on the characters that cannot be extracted from the Unicode encoding. Once
the user has input the serialized version, one side of the comparison is complete.
The second side of the comparison that must be made is to compare the characters
from the CJK language to what the user has provided. This process takes a considerable
amount of data preparation to execute. This is because Unicode simply treats each
character as an atomic unit, and is not organized in such a way to provide useful
structural information about individual characters. Therefore, the fuzzy matching system
must have a system in place to serialize the characters in the Unicode dictionary,
hopefully in a somewhat similar fashion to the way in which the user serialized the
characters.
Following this approach, there will be few if any perfect matches for a search.
This is because a user will oftentimes not serialize a character in the same manner that the
Unicode dictionary serialization was done. However, if the user has followed some
simple guidelines regarding general structure, there should be some significant
similarities between the user’s attempt at serializing a character and the serialization of
the character used by the search application. Additionally, in the case that the user is able
to fully specify the sub-characters used in the serialization, there should be few, if any,
instances of the user-input serialization being a better match to a non-desired character
than to the desired character. The case where the user is able to fully input all of the subcomponents of his desired character should result in a single best match.
In the case where the user is unable to fully specify all of the sub-components of
the character he desires, the results will be somewhat different. Instead of one match
alone having the highest score, large numbers of partial matches will be found. It is
likely that many of these matches will have the same hierarchical score. These partial
matches will then be filtered with the “best match” algorithm into hierarchical results.
This algorithm allows for the discovery of results that do not perfectly match the search
input, but at the same time have definite relevance.
This approach of increasing the amount of relevant information will tend to lower
the overall scores. The reason for this is obvious, as the more information two
independent sources (user and application) provide to describe an object, the more
divergent their descriptions will be. However, at the same time that the relevant results
have their scores diluted, the scores of non-relevant characters will also be diluted.
An advantage of this approach is that searches can be made on incompletely
specified objects. In my implementation, this means that users can search for matches
using only parts of Chinese characters. This approach allows the algorithm to seem to
intelligently determine what the user intended to find.
Download