Document 12915855

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 30 Number 1 - December 2015
An Efficient Method for Multiple Key Word Query Search
Using Topic Detection
Govinda Kunapareddi1, C P V N J Mohan Rao2
1
1,2
Final M.Tech Student, 2Professor
Dept of IT, Avanthi institute of engineering and technology, Narsipatnam, AP, India
Abstract:
In query search operations, identifying user
interesting results for input query is an interesting
research issue in the field of search engine
optimization, time complexity is an important factor
while searching for query. In many traditional
approaches multiple keyword search, it is based on
ranking of the difficulty level of the keywords. So
we introduced a new method of multiple keyword
query searching, based on the keywords present in
the query. Putting difficulty a side and take
consideration of the similarities and the features of
the keyword we designed a method to search query
efficiently with minimal time.
I. INTRODUCTION
Late research has tended to the issue of
freestyle keyword seek over organized and semiorganized information. BANKS [1][2] sees a
database as a chart where the database tuples (or
items) are the hubs and application-particular
"connections" are the edges. For instance, an edge
might indicate an outside key relationship. BANKS
answers keyword inquiries via hunting down Steiner
trees [3] containing all keywords, utilizing heuristics
amid the inquiry. [4] Utilize a related chart based
respective of databases. A client question determines
two arrangements of articles, the "Find" and the
"Close" questions, which may be created utilizing
two separate keyword sets. The framework then
positions the articles in Find as indicated by their
separation from the items in near, utilizing a
calculation that proficiently figures these separations
by building "center lists." A disadvantage of these
methodologies is that a diagram of the database
tuples must be emerged and kept up. Besides, the
imperative basic data given by the database mapping
is overlooked, once the information diagram has
been fabricated.
Keyword seek over XML databases has
additionally pulled in hobby as of late Florescu et
al. [5] amplify XML question dialects to empower
keyword look at the granularity of XML
components, which offers beginner clients some
assistance with formulating questions. This work
does not consider keyword vicinity. View a XML
database as a diagram of "negligible" XML sections
and discover associations between them that contain
ISSN: 2231-5381
all the question keywords. They concentrate on the
presentation of the outcomes and use view
emergence methods to give quick reaction times. At
last, XRANK [6] proposes a positioning capacity for
the XML "result trees", which joins the scores of the
individual hubs of the result tree. The tree hubs are
allocated Page Rank-style scores [7] disconnected
from the net. These scores are question free and, not
at all like our work; don't join IR-style keyword
significance.
Consider a run of the mill keyword internet
searcher which returns items in the outcomes just if
all inquiry keywords are available in an item tuple.
For the inquiry issued against the portable
workstation database, the channel set may be
deficient, as some tablet, which is really a tough
portable PC be that as it may, does not contain the
keyword "tough", may not be returned. Case in
point, the tablet item with ID = 004 is important,
since ToughBook tablets are composed for rough
unwavering quality, however not returned. As
another sample, consider the question [small IBM
laptop] against the same database. The channel set
may be loose, as some outcome, while containing all
question keywords, may be unessential. For case, the
portable PC item with ID = 002 contains all
keywords, and in this way is returned. Be that as it
may, the portable workstation is really not little; and
the keyword "little" in the item portrayal does not
coordinate with client's expectation.
As of late, a few element web crawlers,
which return elements applicable to client questions
regardless of the possibility that inquiry keywords
don't match element tuples, have been proposed [1,
3, 5, 8]. These element web crawlers depend on the
elements being said in the region of inquiry
keywords crosswise over different archives.
Consider the above two inquiries once more. Huge
numbers of the applicable items in the database may
not be specified frequently in records with the
question keywords, and in this manner are definitely
not returned; if by any stretch of the imagination, a
couple of mainstream portable workstations (yet not
so much important) may be said over a few archives.
In this way, these strategies are prone to experience
the ill effects of fragmentation and looseness in the
inquiry results.
http://www.ijettjournal.org
Page 46
International Journal of Engineering Trends and Technology (IJETT) – Volume 30 Number 1 - December 2015
II. RELATED WORK
In traditional approaches the keywords are the main
base to all operations of the searching process. There
is a technique called structured robustness, in this
method key words are to be extracted from the query
and mixed with other keywords. It initially searches
the single keyword and then finds the difficulty
probability of the keywords. Based on the difficulty
level it gives priority to the keywords in the query.
There are also some approximation techniques that
searches the query which means the number of
attribute values that contain at least one query term
is much smaller than the number of all attribute
values in each entity. In all aspects of the query
search process the time taken to show the results of
the query is increasing based on the query. [9]
denoted as iAA(Q), iAES(Q), iAE(Q), andiAS(Q),
respectively.[9]
III.PROPOSED WORK
In our work we introduced grouping of the keywords
and in this search we exclude the normal grammar
words. After that we find the similarity between the
words in the database. For that we used terms and
the frequency of the words in the database.
Considerations:
We assume the general probability distributions Q
on C × T , a distribution Q on C and q on T that
measure the probability to randomly select an
occurrence of a term, from a source document or
both
Traditional techniques:
Q(d, t) = n(d, t)/n on C × T
LSA (Latent Semantic Analysis) is a fully automatic
mathematical/statistical technique for extracting and
inferring relations of expected contextual usage of
words in passages of discourse. It is not a traditional
natural language processing or artificial intelligence
program; it uses no humanly constructed
dictionaries, knowledge bases, semantic networks,
grammars, syntactic parsers, or morphologies, or the
like, and takes as its input only raw text parsed into
words defined as unique character strings and
separated into meaningful passages or samples such
as sentences or paragraphs.
Q(d) = N(d)/n on C
The first step is to represent the text as a matrix in
which each row stands for a unique word and each
column stands for a text passage or other context.
Each cell contains the frequency with which the
word of its row appears in the passage denoted by its
column. Next, the cell entries are subjected to a
preliminary transformation, whose details we will
describe later, in which each cell frequency is
weighted by a function that expresses both the
word’s importance in the particular passage and the
degree to which the word type carries information in
the domain of discourse in general.[8][10]
Prevalence of Query Keywords: As we argued in
Section 4.2, if the query keywords appear in many
enti- ties, attributes, or entity sets, it is harder for a
ranking algorithm to locate the desired entities.
Given query Q, we compute the average number of
attributes (AA(Q)), average number of entity sets
(AES(Q)), and the aver- age number of entities
(AE(Q)) where each keyword in Q occurs. We
consider each of these three values as an individual
baseline difficulty prediction metric. We also
multiply these three metrics (to avoid normalization
issues that summation would have) and create
another baseline metric, denoted as AS(Q).
Intuitively, if these metrics for query Q have higher
values, Q must be harder and have lower average
precision. Thus, we use the inverse of these values,
ISSN: 2231-5381
q(t) = n(t)/n on T
These distributions are the baseline probability
distributions for everything that we will do in the
remainder. In addition we have two important
conditional probabilities
Q(d|t) = Qt(d) = n(d, t)/n(t) on C
q(t|d) = qd(t) = n(d, t)/N(d) on T
The suggestive notation Q(d|t) is used for the source
distribution of t as it is the probability that a
randomly selected occurrence of term t has source d.
Similarly, q(t|d), the term distribution of d is the
probability that a randomly selected term occurrence
from document d is an instance of term t. Various
other probability distributions on C × T , C and T
that we will consider will be denoted by P, P, p
respectively, dressed with various sub and
superscripts.
The setup in the previous section allows us to set up
a Markov chain on the set of documents and terms
which will allow us to propagate probability
distributions from terms to document and vice versa.
Consider a Markov chain on T [C having transitions
C ! T with transition probabilities Q(d|t) and
transitions T ! C with transition probabilities q(t|d)
only.
Given a term distribution p(t) we compute the one
step Markov chain evolution. This gives us a
document distribution Pp(d), the probability to find a
term occurrence in a particular document given that
the term distribution of the occurrences is p
Pp(d) = X t Q(d|t)p(t).
http://www.ijettjournal.org
Page 47
International Journal of Engineering Trends and Technology (IJETT) – Volume 30 Number 1 - December 2015
The algorithm sequence is as follows:
Since the arccos of this similarity function is a
proper metric, (1 − cos)(arccos(cos sim(t, s))) = 1 −
cos sim(t, s) is a distance function.
1. Take input query Qr.
2. Extract the keywords from the query such as
K={k1,k2,k3….kn}
3. Take every as centre and find the distances
between the keywords to the keywords in the
documents in the database.
Dist=√((k)-Dk)2
Where Dk is the token of the documents in the
database.
After this the grouping of the terms in the database
4. We will find the density and the difficulty of the
words of the words in the group.
In this paper we introduced a novel method
for searching typical keywords quickly. In this work
we reduce the processing time and searching
preprocessing. We put forward a principled structure
and proposed novel calculations to quantify the level
of the dif-ficulty of an inquiry over a DB, utilizing
the positioning strength standard. Taking into
account our system, we propose novel calculations
that efficiently foresee the adequacy of a keyword
inquiry. Our broad trials demonstrate that the
calculations anticipate the difficulty of an inquiry
with generally low mistakes and irrelevant time over
REFERENCES
Dc=∑Gr(k)/∑t
5. Based on the Dc value we will rank the keyword
in the query R={r1,r2…rn}
6. For each r in R
Search (keyword k)
7. For each k in K
Result=∑Resk
Distance Measures
An effective way to define “similarity” between two
elements is through a metric d(i, j) between the
elements i, j satisfying the usual axioms of nonnegativity, identity of in-discernables and triangle
inequality. Two elements are more similar if they are
closer. For this purpose any monotone increasing
function of a metric will suffice and we will call
such a function a distance function.
For clustering we use a hierarchical top- down
method, that requires that in each step the center of
each cluster is computed. Thus our choice of
distance function is restricted to distances defined on
a space allowing us to compute a center and
distances between keywords and this center. In
particular we cannot use popular similarity measures
like the Jaccard coefficient. In the following we will
compare results with four different distance
functions for keywords t and s: (a) the cosine
similarity of the document distribution Qt and Qs
considered as vectors on the document space, (b) the
cosine similarity of the vectors of tf.idf values of
keywords, (c) the Jensen-Shannon divergence
between the document distributions
Qt and Qs and (d) the Jensen-Shannon divergence
between the term distributions, ¯pt and ¯ps. The
cosine similarity of two terms t and s is defined as
ISSN: 2231-5381
IV.CONCLUSION
[1] V. Hristidis, L. Gravano, and Y. Papakonstantinou, “Efficient
IR- style keyword search over relational databases,” in Proc. 29th
VLDB Conf., Berlin, Germany, 2003, pp. 850–861.
[2] Y. Luo, X. Lin, W. Wang, and X. Zhou, “SPARK: Top-k keyword query in relational databases,” in Proc. 2007 ACM
SIGMOD, Beijing, China, pp. 115–126.
[3] V. Ganti, Y. He, and D. Xin, “Keyword++: A framework to
improve keyword search over entity databases,” in Proc. VLDB
Endowment, Singapore, Sept. 2010, vol. 3, no. 1–2, pp. 711–722.
[4] J. Kim, X. Xue, and B. Croft, “A probabilistic retrieval model
for semistructured data,” in Proc. ECIR, Tolouse, France, 2009,
pp. 228–239.
[5] N. Sarkas, S. Paparizos, and P. Tsaparas, “Structured
annotations of web queries,” in Proc. 2010 ACM SIGMOD Int.
Conf. Manage. Data, Indianapolis, IN, USA, pp. 771–782.
[6] G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, and S.
Sudarshan, “Keyword searching and browsing in databases using
BANKS,” in Proc. 18th ICDE, San Jose, CA, USA, 2002, pp.
431–440.
[7] C. Manning, P. Raghavan, and H. Schütze, An Introduction to
Information Retrieval. New York, NY: Cambridge University
Press, 2008.
[8] A. Trotman and Q. Wang, “Overview of the INEX 2010 data
centric track,” in 9th Int. Workshop INEX 2010, Vugh, The
Netherlands, pp. 1–32,
[9] T. Tran, P. Mika, H. Wang, and M. Grobelnik, “Semsearch
´S10,” in Proc. 3rd Int. WWW Conf., Raleigh, NC, USA, 2010.
[10] S. C. Townsend, Y. Zhou, and B. Croft, “Predicting query
perfor- mance,” in Proc. SIGIR ’02, Tampere, Finland, pp. 299–
306.
[11] A. Nandi and H. V. Jagadish, “Assisted querying using
instant- response interfaces,” in Proc. SIGMOD 07, Beijing,
China, pp. 1156–1158.
[12] E. Demidova, P. Fankhauser, X. Zhou, and W. Nejdl,
“DivQ: Diversification for keyword search over structured
databases,” in Proc. SIGIR’ 10, Geneva, Switzerland, pp. 331–
338.
BIOGRAPHIES
govinda kunapareddi is pursuing
m.tech (information technology)
from
avanthi
institute
of
engineering
and
technology,
visakhapatnam affiliated to jntu
kakinada from 2013-2015. his
interested areas are cloud computing, data
warehousing and network security.
http://www.ijettjournal.org
Page 48
International Journal of Engineering Trends and Technology (IJETT) – Volume 30 Number 1 - December 2015
Dr. C P V N J Mohan Rao He
received his M.TECH degree in
COMPUTER SCIENCE AND
TECHNOLOGY from Andhra
University
College
of
engineering, vizag and he was
awarded
PhD
by
Andhra
University, Vizag.
He has 18 years of teaching and research experience
and guided number of M.TECH students for their
projects. Presently he is working as principal in
Avanthi institute of engineering and technology,
vizag, AndhraPradesh. His research interests include
Data Warehousing and Data Mining, Cryptography
and Network Security and Artificial Intelligence. He
has published 23 papers in various national and
international journals. He is guiding 2research
scholars for Ph.D. He received the Best Teacher
Award from JNTU, Kakinada in 2009.
ISSN: 2231-5381
http://www.ijettjournal.org
Page 49
Download