PREDICTING KEYWORD QUERY DIFFICULTY IN VIEW OF EFFICIENT INFORMATION RETRIEVAL Dasyam Anusha#1, G.S.Ramesh#2 #1 PG scholar (SE) #2 Assistant Professor Department of computer science and engineering VNR VIGNANA JYOTHI INSTITUTE OF ENGINEERING AND TECHNOLOGY (An Autonomous Institute) Bachupally(v),Nizampet(S,O), Hyderabad-500090. anushadasyam@gmail.com#1 ramesh_gs@vnrvjiet.in#2 Abstract: Keyword querying is the most widely used form of querying for exploring the data. Data can be easily accessed using keyword queries but the relevant information retrieval may not efficiently happen for certain queries. The more number of entities in the database matching the keyword query becomes a reason for low ranking.It would be useful to identify queries that are likely to have low ranking quality to improve the user satisfaction .For instance, the system may suggest to the user alternative queries for such hard queries. The characteristics of hard queries are analyzed and a outline to measure the degree of complexity for a keyword query over a database is proposed, considering both the structure and the content of the database. The algorithms proposed predict the results with prompt information using structured database and with less time overhead. The results help in improving the query efficiency. Index Terms—database, entities, hard queries, keyword query, low ranking, query efficiency, time overhead. I. INTRODUCTION The search engines predict the keyword queries and a list of relevant results are produced. There are certain difficult queries for which the search engines fail to provide the prompt results. Identifying such queries is very much useful in order to improve the user satisfaction. The idea behind identifying the keyword query complexity is to measure the contribution of each keyword to the final document list obtained as result. The difficulty of the query predicted is useful for improving the search engine efficiency and improving the user satisfaction. In this paper, the characteristics of the complex queries are analysed and novel algorithms are introduced which provide best results with structured database with less time overhead. The databases contain entities, and entities contain attributes that take attribute values. The complexity of answering the query arises when more entities in the database match the query terms. For instance, query Q1: Charlie Chaplin on the IMDB database does not specify if the user is interested in movies whose title is charliechaplin or movies distributed by the charliechaplin. There has not been any work on predicting or analyzing the query complexity over structured databases [1], [2]. These techniques for analyzing the complexity of queries over plain text documents are not applicable to this problem since they ignore the structure of the database. In particular, each query term is assigned to the entities in the database to obtain the result. In this paper, we analyze the characteristics of complex queries over databases and algorithms are proposed to identify such queries. The structure of the database is considered to obtain the prompt results. II. KEYWORD QUERY MODELS A database as a set of entity sets is modeled. Entity set S is considered as a collection of entities E. Each entity E has a set of attribute values Ai, 1 ≤ i ≤ |E|. Each attribute value is a bag of terms. The keyword queries on such structured databases sometimes fail to obtain appropriate results. Queries which fail to obtain appropriate results are termed as hard queries. A. Hard keyword queries Hard keyword queries are the queries which when queries on search engine fail to provide appropriate results. Certain characteristics of such hard queries are analyzed and are put forth below: The query becomes less specific when more number of entities in the database match the query terms. Attribute level ambiguity is another characteristic of the hard queries. Attributes in the query explain different features of the entity. More number of attributes matching a query provides ambiguity in results. Entity set is the top level in the database structure. More number of entity sets being matched by the query terms results in poor retrieval of the relevant information. The keyword query interface fails in providing prompt results for the keyword query. The noisy versions of the top queries are analysed which are the results of the ranking algorithm. The ranking algorithm uses certain query statistics in ordered to obtain the ranks of the data. These statistics are obtained by analysing the number of entities and attributes that are matched based on the query posed. The information returned by the ranking algorithm is utilized to obtain the result of the comparison between the original and the noisy versions of the data. IV.THE STRUCTURE OF THE SYSTEM Result list B. Framework to measure complexity of keyword queries A structured robustness algorithm is proposed in order to measure the complexity of the hard keyword queries. Degree of the complexity is correlated with the robustness of its ranking over the original and the noisy versions of the data collected. The noise generation in the database is identified by querying the original database.The number of entity sets that matches the retrieved information fail to obtain the required information in the case of hard queries. Such entity sets are considered as the noisy versions of the database. The noisy versions of the database are queried again in order to obtain the difficulty of the query. The comparison of the noisy version and the original version results in obtaining the complexity of the query. III STRUCTURED ROBUSTNESS ALGORITHM The structured robustness algorithm is used to measure the complexity of the keyword query. The result is given to the search engine in such a way that the promptness in obtaining the results is improved. 1. SR<-0;C<-{};//caches t,s for keywords in Q 2. FOR i=1->N DO 3. I’<-I;M’<-M;L’<-L;//Corrupted copy of I,M & L 4. FOR each result R in L DO 5. FOR each attribute value A in R Do 6. A’<-A;//Corrupted versions of A 7. FOR each keywords w in Q DO 8. Compute # of w in A’ 9. IF # of w varies in A’ and A THEN 10. Update A’,M’ and entry of w in I’; 11. Add A’ to R’; 12. Add R’ to L’; 13. Rank L’ using g,which returns L,based on I’,M’; 14. SR+=Sim(L,L’); 15. RETURN SR<-SR/N;//AVG score over N rounds The structured robustness algorithm obtains the noise in the database on-the-fly during the query processing. Noise generation on Data base Ranking to data Ranking list Fig.1 Structure of the system Figure1 depicts the process of generating the noise on-the-fly during the query processing. The obtained result is ranked based on the statistics which reflect the entities that are matched by the query term. We can significantly decrease the time spent on obtaining the noisy versions if we corrupt only the attribute values that contain query terms. When the corruption process is added over a small number of query keywords to the attribute values of the entities in the original database, it drastically changes the ranking positions of these entities. The degree of the complexity is correlated with the ranking over the original as well as the noisy versions of the collection. V.RESULTS Fig.2 Figure 2 Keyword query field: In this field keyword query is given. VI.REFERENCES Fig.3 In Figure 3 XML files related to the given query is retrieved from the database. Fig.4 In figure 4 the appropriate parameter for the required information retrieval is selected and the corresponding value is entered. Fig.5 In Figure 5 the required information is retrieved from the XML file and is shown in a structured format. [1] Shiwen Cheng, Arash Termehchy, and Vagelis Hristidis, Efficient Prediction of Difficult Keyword Queries over Databases, IEEE transactions on knowledge and data engineering, Vol. 26, no. 6, June 2014. [2]C. Manning, P. Raghavan, and H. Schütze, An Introduction to Information Retrieval. New York, NY: Cambridge University Press, 2013. [3] V. Ganti, Y. He, and D. Xin, “Keyword++: A framework to improve keyword search over entity databases,” in Proc. VLDB Endowment, Singapore, Sept. 2010, vol. 3, no. 1–2, pp. 711. [4] J. Kim, X. Xue, and B. Croft, “A probabilistic retrieval model for semi structured data,” in Proc. ECIR, Tolouse, France, 2009, pp. 228–239. [5] T. Tran, P. Mika, H. Wang, and M. Grobelnik, “Semsearch ´S10,” in Proc. 3rd Int. WWW Conf., Raleigh, NC, USA, 2010. [6] S. C. Townsend, Y. Zhou, and B. Croft, “Predicting query performance,” in Proc. SIGIR ’02, Tampere, Finland, pp. 299–306. [7] A. Nandi and H. V. Jagadish, “Assisted querying using instantresponse interfaces,” in Proc. SIGMOD 07, Beijing, China, pp. 1156–1158. [8] E. Demidova, P. Fankhauser, X. Zhou, and W. Nejdl, “DivQ: Diversification for keyword search over structured databases,” in Proc. SIGIR’ 10, Geneva, Switzerland, pp. 331–338. [9] Y. Zhou and B. Croft, “Ranking robustness: A novel framework to predict query performance,” in Proc. 15th ACM Int. CIKM, Geneva, Switzerland, 2006, pp. 567–574. [10] B. He and I. Ounis, “Query performance prediction,” Inf. Syst., vol. 31, no. 7, pp. 585–594, Nov. 2011. [11] K. Collins-Thompson and P. N. Bennett, “Predicting query performance via classification,” in Proc. 32nd ECIR, Milton Keynes, U.K., 2010, pp. 140–152.