An Efficient String Transformation Technique with for user Input Query

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 19 Number 1 – Jan 2015
An Efficient String Transformation Technique with
for user Input Query
Javvaji Vamshi Krishna 1, P.Prabhakar2, C.P.Y.N.J.Mohanarao3
1
2
3
Final M.Tech Student , Assistant Professor , Principal
1
2
Department of Information Technology, Department of CSE
Avanthi Institute of Engineering and Technology, Makavarapalem, Visakhapatnam.
Abstract:Probabilistic String Transformation is always an
important research topic in the field of natural language
processing and search engine optimization, Even though
various traditional approaches available for string
transformation or candidate set generation , they are not
efficient approaches because of its parameters , Accuracy
and efficiency are the basic parameters to consider them to
optimize during the generation of candidate sets or output
Strings, Prior to that we consider the set of likely
keywords which meets the minimum levestein distance. In
this paper we are proposing an efficient Model of String
transformation for Candidateset generation and Query
Reformulation, we are proposing an evolutionary approach
for the correction of the misspelled wordswith edit distance
and index based comparison either by single character or
substring.
I. INTRODUCTION
In un-formal or formal languages word
transformation, correction and term generation is
systematically formatted as string transformation. It is also
used in query generation and re-generation. In many
applications it is deployed such as online applications,
transformation techniques and work accurately and
efficiently. String Transformation explains as follows.
Consider a string and it performs set of operations
and it generates most likely strings by applying some
operators. A string can be set of words or characters or
various tokens. In every operator the transformation
instructions are defined the replacement of the part of
string with another string. It is classified into two different
settings and it depends on dictionary or not dictionary. If
the dictionary is used the output strings must present in
given dictionary. These applications are mainly used in
web applications. In many applications initial task is
tokenizing and the second task is tokenizing the terms in
string.
ISSN: 2231-5381
Coming spelling check the query is usually
consists of two different operations such as candidate
generation and candidate selection. Both are used to find
the similar words which are matched to the misspelled
word given by the user. The operators represent the
manipulations methods such as insertion, deletion, and
updating of the neighbor characters. It is an example for
string transformation.
Past take a shot at string transformation can be
arranged into two gatherings. Some work mostly
considered proficient era of strings expecting that the
model is given in many researches. Other work attempted
to take in the model with distinctive methodologies and
consider an example a generative model a logistic relapse
model presented in previous works and a discriminative
model. Be that as it may productivity is not a critical
element thought seriously about in these strategies.
Conversely, our work in this paper intends to take
in a model for string transformation which can attain to
both high precision and effectiveness. There are three
major issues with string transformation:
(1) How to characterize a model which can attain
to both high precision also effectiveness.
(2) How to precisely and productively prepare the
model from preparing occurrences.
(3) How to effectively create the top k yield
strings given the info string with or without
utilizing a dictionary.
String transformation has numerous applications
in information mining common dialect handling and the
data recovery what's more bioinformatics. String
transformation has been contemplated in distinctive
particular undertakings, for example, database record
matching, spelling mistake rectification, question
reformulation what's more equivalent word mining. The
http://www.ijettjournal.org
Page 8
International Journal of Engineering Trends and Technology (IJETT) – Volume 19 Number 1 – Jan 2015
significant distinction between our work and the current
work is that we center on improvement of both precision
and proficiency of string transformation.
Even though various traditional approaches available for
candidate set generation, they are not optimal, because of
lack of index based comparison, occurrence based
comparison does not give accurate results because they
don‟t compare according to index and it does not work
fora small keyword due to edit operations, So we need an
efficient query reformulation.
II. RELATED WORK
Even though various approaches available for string
transformation they are not optimal in terms of accuracy,
traditional approaches like levenshtein distance measure is
a time complexity issue because performs the distance with
all the keywords in the dictionary and in the log liner
model it compares the occurrences of the characters in the
source string and target string but not the indexes of the
characters
In this paper a new approach is proposed for the
change of the string is obtained with high efficiency and
huge accuracy and also perfect of the huge amount of the
data as a string format. To identify the approach input
string and output string are combined and provided in huge
amount as a sample data to test the approach along with the
sample data set operators set are make available for the
change of string. Now a standard probabilistic is obtained
from the combined input string and output string as a
sample dataset which gets the count of the applicants from
the output string to the input string.The finest applicant is
defined as the highest probabilistic count with respect to
the sample dataset.
In this process there are two process one is
studying and the other is creation in the process of studying
set of rules are defined and get obtained from the sample
dataset and the standard change of the string format is
build based on the rules and bulk of the dataset proposed
by the studying process and in the creation process is to
provide a new dataset among the total number of
applicants certain number of the n applicants are obtained
from the sample dataset as a standard rules set in the
studying process. Finally in this process change of string
format represents the rules and weights as a un-deviated
model studying provides extreme likelihood approximation
on the sample dataset and creation is efficiently conducted.
Without damage of sample dataset rule based
huge data to the studying process is predefined as a
consequence the outcome of the total number of the
ISSN: 2231-5381
outcome to change the string format as a group is also
restricted this is to get the difference between input and
output must not be in large number
III. PROPOSED WORK
In this paper we are proposing an efficient
string transformation technique with identification of
possible correct keywords and extraction of the likely
keywords from the dictionary then finds the edit distance
over likely keywords. Query reformulation feature
enhanced by providing elimination word bag and for
optimal results we proposed index based comparison for
retrieve the top candidate sets for input query.
Accuracy can be calculated in terms of correct
number of candidate set generations for a user query or
keyword with previous approach and proposed approachby
using graphical user representation.Number of candidate
sets can be generated for the user query, Initially making
the corrections in misspelled user query or keyword by the
evolutionary approach, In this approach random characters
or substring can be substituted which are available in
dictionary dataset
Index based comparison
In the traditional approach source string and target string
can be compared with occurrence of the character not with
index based occurrence,so in this project we are comparing
every character in source string and target string should be
identical with respect to index and initial priority given to
highest order of dictionary target string
Evolutionary Approach
Input:
Input Source string „S‟
Dict_words D (w1,w2………….wn)
Likely_set (l1,l2………..ln)
Output:
Candidateset_ ListC (c1,c2…………cn)
Step1: Find likely keywords for source string and compute
edit distance
http://www.ijettjournal.org
Page 9
International Journal of Engineering Trends and Technology (IJETT) – Volume 19 Number 1 – Jan 2015
Step2: Get minimum number of edit operations with
respect to likely strings (l1,l2………..ln)and input string
Query Reformulation
In Query Reformulation, We can generate the output
strings for the input String “IEEE” as Institute of Electrical
and Electronics Engineers by maintain the semantics of the
respective keywords to generate the optimal set of
keywords,after the generation of the keywords semantics of
the respective keywords also integrated to existing the
output strings.
Step3: if (min_editdistance<= Threshold value)
Add lito C.
Step4: Compute Query _Reformulation (S,W i)
Step5: for i=0 ; i<D.length ; i++
In traditional query reformulation technique it
simply tokenize the string and compares with respective
keywords but fails with additional set of words like
articles, prepositions etc. ,to resolve this issue we are
maintain word bag to eliminate the unnecessary keywords
Counter=0;
If String_diff (S,Wi) < = threshold
Add Wi to C
Next
Query Reformulation
Step 6: Store candidate sets for „S‟.
Void Query _Reformulation (S,Wi)
The following pseudo code shows index based comparison
and maintains the order with respect to source and target
string
{
Index Based Comparison :
Count=0;
String_Diff(S,W)
forint i=0;i<s.length;i++
{
if T[i] not available in Eliminate words Then
Compare counter :=0
if S[i]== T[i] then
T : =Wi.gettokens()
Count :=+1;
For i=0; i<S.length;i++
End if
If S[i]== W[j] Then
Compare_counter:+1
End if
Next
Next
If Comparecounter> Threshold value then
If count==T.count then
Add order wise Wi to C
Add Wi to C.
}
}
To generate more accurate and efficient candidate sets,
we are performing index based comparison between source
string and target string, it compares individual character
along with their index, if both are equal it will set to
„1‟,continues this process until it reaches source string
maximum size
Accuracy Computation
ISSN: 2231-5381
Accuracy can be calculated in terms of correct number of
candidate set generations for a user query or keyword with
previous approach and proposed approach by using
graphical user representation.
http://www.ijettjournal.org
Page 10
International Journal of Engineering Trends and Technology (IJETT) – Volume 19 Number 1 – Jan 2015
IV. CONCLUSION
We have been concluding our current research work with
efficientcandidate set generations by generating accurate
and more number of candidate sets for input query through
index based comparison in our evolutionary approach.
Query reformulation can be efficiently handled with string
tokenization and word bag to maintain the elimination
keywords, our experimental results shows optimal results
than traditional approaches string transformation
approaches.We can enhance the current research work of
evolutionary candidate set generation with cache
implementation, to access the frequently accessed
information that obviously reduces the space and time
complexity issues. We can filter the results based on the
ranking of possible matched keywords.
REFERENCES
[1]. “Learning a spelling error modelfrom search query
logs” by F. Ahmad and G. Kondrak. In Proceedings of
EMNLP 2005,pages 955–962, 2005.
[2] “Agglomerative clustering of asearch engine query log.
In Knowledge Discovery and DataMining” by D.
Beeferman and A. Berger., pages 407–416, 2000.
[3] S. Bergsma and Q. I. Wang.Learning noun phrase
querysegmentation. In Proceedings of EMNLP-CoNLL
2007,pages 819–826, 2007.
[9] “Cumulated gain-basedevaluation of ir techniques” by
K. Jarvelin and J. Kekalainen. . ACM Trans. Inf.
Syst.,20(4):422–446, 2002.
BIOGRAPHIES
JavvajiVamshi Krishna pursuing m.tech in
avanthiinstitue of engineering and college
tamaram(vill),
makavarapalem(md),vishapatnam(dist). His interested
areas are data mining, network security, and
cloud computing.
P.Prabhakar working as assistant professor
with
5
years‟
experience
in
avanthiinstitueof
engineeringandtechnology.tamaram (vill),
makavarapalem (md), visakhapatnam
(dist). His interested areas are data mining,
network security, and cloud computing.
Dr.C.P.Y.N.J.Mohanarao completedm.tech,
and ph.d.he is working as principal in
avanthiinstitue
of
engineering
and
technologytamaram (vill), makavarapalem
(md), visakhapatnam (dist). His interested
areas are data mining, network security, and
cloud computing.
[4].A Unified and Discriminative Model for Query
RefinementJiafengGuo.
[5] “Top K Pruning Approach to String Transformation for
candidate set generations “ by A. Meenahkumary
[6] S. Cucerzan and E. Brill. Spelling correction as an
iterativeprocess that exploits the collective knowledge of
web users.In Proceedings of EMNLP 2004, pages 293–
300, 2004.
[7] A. Feuer, S. Savev, and J. A. Aslam.Evaluation of
phrasalquery suggestions.In Proc. of CIKM ‟07,
November, 2007.
[8] W. Frakes and R. Baeza-Yates. Information
Retrieval:Data Structures & Algorithms. Prentice Hall,
EnglwoodCliffs, New Jersey, 1992.
ISSN: 2231-5381
http://www.ijettjournal.org
Page 11
Download