IEEE Paper Template in A4 (V1)

advertisement
PREDICTING KEYWORD QUERY
DIFFICULTY IN VIEW OF EFFICIENT
INFORMATION RETRIEVAL
Dasyam Anusha#1, G.S.Ramesh#2
#1 PG scholar (SE)
#2 Assistant Professor
Department of computer science and engineering
VNR VIGNANA JYOTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY
(An Autonomous Institute)
Bachupally(v),Nizampet(S,O), Hyderabad-500090.
anushadasyam@gmail.com#1
ramesh_gs@vnrvjiet.in#2
Abstract: Keyword querying is the most widely used form of
querying for exploring the data. Data can be easily accessed
using keyword queries but the relevant information retrieval
may not efficiently happen for certain queries. The more number
of entities in the database matching the keyword query becomes
a reason for low ranking.It would be useful to identify queries
that are likely to have low ranking quality to improve the user
satisfaction .For instance, the system may suggest to the user
alternative queries for such hard queries. The characteristics of
hard queries are analyzed and a outline to measure the degree of
complexity for a keyword query over a database is proposed,
considering both the structure and the content of the database.
The algorithms proposed predict the results with prompt
information using structured database and with less time
overhead. The results help in improving the query efficiency.
Index Terms—database, entities, hard queries, keyword query,
low ranking, query efficiency, time overhead.
I.
INTRODUCTION
The search engines predict the keyword queries and a
list of relevant results are produced. There are certain difficult
queries for which the search engines fail to provide the prompt
results. Identifying such queries is very much useful in order
to improve the user satisfaction. The idea behind identifying
the keyword query complexity is to measure the contribution
of each keyword to the final document list obtained as result.
The difficulty of the query predicted is useful for improving
the search engine efficiency and improving the user
satisfaction. In this paper, the characteristics of the complex
queries are analysed and novel algorithms are introduced
which provide best results with structured database with less
time overhead.
The databases contain entities, and entities contain
attributes that take attribute values. The complexity of
answering the query arises when more entities in the database
match the query terms. For instance, query Q1:
Charlie Chaplin on the IMDB database does not specify if the
user is interested in movies whose title is charliechaplin or
movies distributed by the charliechaplin.
There has not been any work on predicting or
analyzing the query complexity over structured databases [1],
[2]. These techniques for analyzing the complexity of queries
over plain text documents are not applicable to this problem
since they ignore the structure of the database. In particular,
each query term is assigned to the entities in the database to
obtain the result.
In this paper, we analyze the characteristics of
complex queries over databases and algorithms are proposed
to identify such queries. The structure of the database is
considered to obtain the prompt results.
II. KEYWORD QUERY MODELS
A database as a set of entity sets is modeled. Entity
set S is considered as a collection of entities E. Each
entity E has a set of attribute values Ai, 1 ≤ i ≤ |E|. Each
attribute value is a bag of terms. The keyword queries on
such structured databases sometimes fail to obtain
appropriate results. Queries which fail to obtain
appropriate results are termed as hard queries.
A. Hard keyword queries
Hard keyword queries are the queries which when
queries on search engine fail to provide appropriate
results. Certain characteristics of such hard queries are
analyzed and are put forth below:



The query becomes less specific when more
number of entities in the database match the
query terms.
Attribute
level
ambiguity
is
another
characteristic of the hard queries. Attributes in
the query explain different features of the entity.
More number of attributes matching a query
provides ambiguity in results.
Entity set is the top level in the database
structure. More number of entity sets being
matched by the query terms results in poor
retrieval of the relevant information. The
keyword query interface fails in providing
prompt results for the keyword query.
The noisy versions of the top queries are analysed which
are the results of the ranking algorithm. The ranking
algorithm uses certain query statistics in ordered to obtain
the ranks of the data. These statistics are obtained by
analysing the number of entities and attributes that are
matched based on the query posed.
The information returned by the ranking algorithm is
utilized to obtain the result of the comparison between the
original and the noisy versions of the data.
IV.THE STRUCTURE OF THE SYSTEM
Result list
B. Framework to measure complexity of keyword
queries
A structured robustness algorithm is proposed in
order to measure the complexity of the hard keyword
queries. Degree of the complexity is correlated with the
robustness of its ranking over the original and the noisy
versions of the data collected.
The noise generation in the database is identified by
querying the original database.The number of entity sets
that matches the retrieved information fail to obtain the
required information in the case of hard queries. Such
entity sets are considered as the noisy versions of the
database.
The noisy versions of the database are queried again
in order to obtain the difficulty of the query. The
comparison of the noisy version and the original version
results in obtaining the complexity of the query.
III STRUCTURED ROBUSTNESS ALGORITHM
The structured robustness algorithm is used to
measure the complexity of the keyword query. The result is
given to the search engine in such a way that the promptness
in obtaining the results is improved.
1. SR<-0;C<-{};//caches t,s for keywords in Q
2. FOR i=1->N DO
3. I’<-I;M’<-M;L’<-L;//Corrupted copy of I,M & L
4. FOR each result R in L DO
5. FOR each attribute value A in R Do
6. A’<-A;//Corrupted versions of A
7. FOR each keywords w in Q DO
8. Compute # of w in A’
9. IF # of w varies in A’ and A THEN
10. Update A’,M’ and entry of w in I’;
11. Add A’ to R’;
12. Add R’ to L’;
13. Rank L’ using g,which returns L,based on I’,M’;
14. SR+=Sim(L,L’);
15. RETURN SR<-SR/N;//AVG score over N rounds
The structured robustness algorithm obtains the noise
in the database on-the-fly during the query processing.
Noise
generation
on
Data
base
Ranking to
data
Ranking list
Fig.1 Structure of the system
Figure1 depicts the process of generating the noise
on-the-fly during the query processing. The obtained
result is ranked based on the statistics which reflect the
entities that are matched by the query term. We can
significantly decrease the time spent on obtaining the
noisy versions if we corrupt only the attribute values that
contain query terms.
When the corruption process is added over a small
number of query keywords to the attribute values of the
entities in the original database, it drastically changes the
ranking positions of these entities.
The degree of the complexity is correlated with the
ranking over the original as well as the noisy versions of
the collection.
V.RESULTS
Fig.2
Figure 2 Keyword query field: In this field keyword query is
given.
VI.REFERENCES
Fig.3
In Figure 3 XML files related to the given query is
retrieved from the database.
Fig.4
In figure 4 the appropriate parameter for the required
information retrieval is selected and the corresponding
value is entered.
Fig.5
In Figure 5 the required information is retrieved from the
XML file and is shown in a structured format.
[1] Shiwen Cheng, Arash Termehchy, and Vagelis Hristidis,
Efficient Prediction of Difficult Keyword Queries over
Databases, IEEE transactions on knowledge and data
engineering, Vol. 26, no. 6, June 2014.
[2]C. Manning, P. Raghavan, and H. Schütze, An
Introduction to Information Retrieval. New York, NY:
Cambridge University Press, 2013.
[3] V. Ganti, Y. He, and D. Xin, “Keyword++: A
framework to improve keyword search over entity
databases,” in Proc. VLDB Endowment, Singapore, Sept.
2010, vol. 3, no. 1–2, pp. 711.
[4] J. Kim, X. Xue, and B. Croft, “A probabilistic retrieval
model for semi structured data,” in Proc. ECIR, Tolouse,
France, 2009, pp. 228–239.
[5] T. Tran, P. Mika, H. Wang, and M. Grobelnik,
“Semsearch ´S10,” in Proc. 3rd Int. WWW Conf., Raleigh,
NC, USA, 2010.
[6] S. C. Townsend, Y. Zhou, and B. Croft, “Predicting
query performance,” in Proc. SIGIR ’02, Tampere, Finland,
pp. 299–306.
[7] A. Nandi and H. V. Jagadish, “Assisted querying using
instantresponse interfaces,” in Proc. SIGMOD 07, Beijing,
China, pp. 1156–1158.
[8] E. Demidova, P. Fankhauser, X. Zhou, and W. Nejdl,
“DivQ: Diversification for keyword search over structured
databases,” in Proc. SIGIR’ 10, Geneva, Switzerland, pp.
331–338.
[9] Y. Zhou and B. Croft, “Ranking robustness: A novel
framework to predict query performance,” in Proc. 15th
ACM Int. CIKM, Geneva, Switzerland, 2006, pp. 567–574.
[10] B. He and I. Ounis, “Query performance prediction,”
Inf. Syst., vol. 31, no. 7, pp. 585–594, Nov. 2011.
[11] K. Collins-Thompson and P. N. Bennett, “Predicting
query performance via classification,” in Proc. 32nd ECIR,
Milton Keynes, U.K., 2010, pp. 140–152.
Download