An Efficient Edge Cut Mechanism for Concise Range Queries

International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 6 - June 2013 An Efficient Edge Cut Mechanism for Concise Range Queries S.Anu Radha*, B. Venkateswarlu# * Final M.Tech Student, Dept of Computer Science and Engineering, Avanthi Institute of Engineering & Technology, Narsipatnam, Andhra Pradesh. # Associate Professor, Dept of Computer Science and Engineering, Avanthi Institute of Engineering & Technology, Narsipatnam, Andhra Pradesh Abstract: Due to rapid growth of wireless communication technology, people frequently view maps or get related services from the handheld devices, such as mobile phones and PDAs. Range queries, as one of the most commonly used tools, a spatial database. However, due to the limits of are often posed by the users to retrieve needful information from communication bandwidth and hardware power of handheld devices, displaying all the results of a range query on a handheld device is neither communication efficient nor informative to the users. This is simply because that there are often too many results returned from a range query. In view of this problem, we present a novel idea of effective navigation and Best edge cut mechanisms for queried results, that increases the performance and reduces the communication cost and also effects the navigation cost ,also offers better usability to the users, providing an opportunity for interactive exploration. The usefulness of the concise range queries is confirmed by comparing it with other possible alternatives, such as sampling and clustering and edge cut mechanisms. I. INTRODUCTION Spatial databases have witnessed an increasing number of applications recently, partially due to the fast advance in the fields of mobile computing, embedded systems and the spread of the Internet. For example, it is quite common these days that people want to figure out the driving or walking directions from their handheld devices (mobile phones or PDAs). However, facing the huge amount of spatial data collected by various devices, such as sensors and satellites, and limited bandwidth and/or computing power of handheld devices, how to deliver light but usable results to the clients is a very interesting, and of course, challenging task. Our work has the same motivation as several recent work on finding good representatives for large query ISSN: 2231-5381 answers, for example, representative skyline points in [7]. Furthermore, such requirements are not specific to spatial databases. General query processing for large relational databases and OLAP data warehouses has posed similar challenges. For example, approximate, scalable query processing has been a focal point in the recent work [6] where the goal is to provide light, usable representations of the query results early in the query processing stage, such that an interactive query process is possible. In fact, [6] argued to return concise representations of the final query results in every possible stage of a long-running query evaluation. However, the focus of [6] is on join queries in the relational database and the approximate representation is a random sample of the final query results. Soon we will see, the goal of this work is different and random sampling is not a good solution for our problem. Usability refers to the question of whether the user could derive meaningful knowledge from the query results. Note that more results do not necessarily imply better usability. On the contrary, too much information may do more harm than good, which is commonly known as the information overload problem. As a concrete example, suppose that a user issues a query to her GPS device to find restaurants in the downtown Boston area. Most readers having used a GPS device could quickly realize that the results returned in this case could be almost useless to the client for making a choice. The results (i.e., a large set of points) shown on the small screen of a handheld device may squeeze together and overlap. It is hard to differentiate them, let alone use this information. A properly sized representation of the results will actually improve usability. In addition, usability is often related to another component, namely, query interactiveness that has become more and more important. Instructiveness refers to the capability of letting the user provide feedback to the server and refine the query results as he or she wishes. This is important as very often, the user would like to have a rough idea for a large region first, which provides valuable information to narrow down her query to specific regions. In the above example, it is much more meaningful to tell the user a few areas with high concentration of restaurants http://www.ijettjournal.org Page 2356 International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 6 - June 2013 (possibly with additional attributes, such as Italian vs. American restaurants), so that she could further refine her query range. II. LITERATURE SURVEY A) Query Languages Bird et al. (2000) had compared some of the query languages available (at that time) for graph based annotation frameworks. These included Emu and the MATE query language. They then proposed their own query language for annotation graphs. This language used path patterns and abbreviatory devices to provide a convenient way to express a wide range of queries. This language also exploited the quasi-linearity of annotation graphs by partitioning the precedence relation to allow efficient temporal indexing of the graphs. Another such survey was by Lai and Bird (2004), where the authors considered TigerSearch,CorpusSearch, NiteQL, Tgrep2, Emu and LPath (Bird et al., 2005; Bird et al., 2006). From this study, the authors tried to derive the requirements that a good tree query language should satisfy. Resnik and Elkiss (2005) had reported a search engine for linguists that was meant to be easy to use for linguists who were not versed in the use of computers. This tool allowed linguists to draw patterns in the form of sub-trees, which were then converted into queries and searched. Like almost all such languages, it did not allow manipulation of data and it worked only for certain levels of annotation. It was mainly aimed at searching phrase structure patterns and morphological information. One of the well known query languages for annotated corpora used for linguistic studies and for NLP is the Corpus Query Language1 (CQL), very different from the one we are presenting here. It is used in a popular tool called Sketch Engine2 (Kilgarriff et al., 2004). It provides a widevariety of functionalities to access corpora, such as searching words, lemmas, roots, POS tags of a word, getting the left and right contexts upto a window size of 15. Another usual practice is to have a query tool for syntactically annotated corpora such that the data is converted internally to relational database and the query is written using SQL (Kallmeyer, 2000). A much earlier work was titled ‘A modular and flexible architecture for an integrated corpus query system’ (Christ, 1994), which is used by the IMS Corpus Workbench3. Another query language called MQL is used in the Emdros database engine for analyzed or annotated text4. MQL is a descendant of QL (Doedens, 1994). The language that we describe here is similar in some aspects to many of these languages, but different in others. The most important differences are the support for threaded trees, its very concise syntax, queryand-action mechanism (data manipulation), arbitrary return values, support for custom commands and the possibility for pipelining results through the source and destination operators. It also has high expressive power generally. Moreover, it can be used for purposes other than NLP because the data that it operates on is similar to the general XML representation and the language has some of the ISSN: 2231-5381 power of both XPath5 based querying and XSLT6 based transformation. B)Spatial Clustering Algorithm Density based clustering algorithms are one of the primary method for data mining. The clusters which are formed using density clustering are easy to understand and it does limit itself to shapes of clusters. Existing density based algorithms have trouble because they are not capable of finding out all meaningful clusters whenever the density is so much varied. VDBSCAN is introduced to compensate this problem. It is same as DBSCAN (Density Based Spatial Clustering of Applications with Noise) but only the difference is VDBSCAN selects several values of parameter Eps for different densities according to k-dist plot. The problem is the value of parameter k in k-dist plot is user defined. It introduces a new method to find out the value of parameter k automatically based on the characteristics of the datasets. In this method we consider spatial distance from a point to all others points in the datasets. The proposed method has potential to find out optimal value for parameter k. C) Best Edge cut Mechanism : we can compute the optimal cost by recursively enumerating all possible sequences of valid Edge Cuts, starting from the root and reaching every concept in the navigation tree, computing the cost for each step and taking the minimum. However, this algorithm is also prohibitively expensive. Instead we propose an alternative algorithm OptEdgeCut that makes use of the dynamic programming technique to reduce the computation cost. As shown in Section VI-A below, Opt-EdgeCut is stillexponential and is just used to evaluate the quality of the heuristic. II. OUR PROPOSED WORK For clustering of spatial data DBSCAN algorithm is based on center-based approach. In the center-based approach, density is estimated for a particular point in the dataset by counting the number of points within a specified radius, Eps, of that point. This includes the point itself. The center-based approach to density allows us to classify a point as a core point, a border point, a noise or background point. A point is core point if the number of points within Eps, a user-specified parameter, exceeds a certain threshold, MinPts, which is also a userspecified parameter Any two core points that are close enough within a distance Eps of one another are put in the same cluster. It is also applicable for any border point which is close enough to a core point is put in the same cluster as the core point. Noise points are disposed. The basic approach of how to determine the parameters Eps and MinPts is to look at the behavior of the distance from a point to its kth nearest neighbor, which is called k-dist. The k-dists are computed for all the data points for some k, sorted in ascending order Step1: Partition k-dist plot: Give thresholds of parameters Epsi (i=1,2,…n); http://www.ijettjournal.org Page 2357 International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 6 - June 2013 Step2: For each(Epsi(1=1,2..n) Eps=Epsi; Adopt DBSCAN algorithm for points that are not marked; Mark points not marked; Display all the masked points as corresponding clusters. A)Tree Navigation In order to use the algorithms of Section 3.3 to answer a concise range query Q with budget k from the client, the database server would first need to evaluate the query as if it were a standard range query using some spatial index built on the point set P, typically an R-tree. After obtaining the complete query results P \ Q, the server then partitions P \ Q into k groups and returns the concise representation. However, as the main motivation to obtain a concise answer is exactly because P \ Q is too large, finding the entire P \ Q and running the algorithms of Section 3.3 are often too expensive for the database server. In this section, we present algorithms that process the concise range query without computing P \ Q in its entirety. The idea is to first find k0 bounding boxes, for some k0 > k, that collectively contain all the points in P \ Q by using the existing spatial index structure on P. Each of these bounding boxes is also associated with the count of points inside. Then, we run a weighted version of the algorithm in Section 3.3, grouping these k0 bounding boxes into k larger bounding boxes to form the concise representation R. Typically k0 _ jP \ Qj, so we could expect significant savings in terms of I/O and CPU costs as compared with answering the query exactly. Therefore, adopting concise range queries instead of the traditional exact range queries not only solves the bandwidth and usability problems, but also leads to substantial efficiency improvements. p1 p2 N4 N3 N1 The R-tree p5 p4 Q N3 N4 P4 p1 p8 N5 N2 p6 r p3 N1 p11 N6 P1 p9 p12 N2 P2 N5 P5 P3 P6 N6 P10 P11 P7 P8 P12 P9 p7 B)The R-Tree The standard range query Q can be processed using an R-tree as follows: We start from the root of the R-tree, and check the MBR of each of its children. Then, we recursively visit any node u whose MBR intersects or falls inside Q. When we reach a leaf, we simply return all the ISSN: 2231-5381 points stored there that are inside Q. In this section, we in addition assume that each node u in the R-tree also keeps nu, the number of the points stored below its sub tree. Such counts can be easily computed and maintained in the R-tree. To this tree navigation ,add the best performance we used an edge cut mechanism for filter the results. a.Optimal Algorithm FOR Best Edgecut The Opt-EdgeCut algorithm to compute the minimum expected navigation cost (and the EdgeCut that achieves it) traverses the navigation tree in post-order and computes the navigation cost bottom-up starting from the leaves. For each node , the algorithm enumerates and stores the C(n)of all possible EdgeCuts for the subtree rooted at , and the listI(n) of all possible I(n)sets that node can be annotated with. The inclusion- enumerating C(n) and I(n), which leads to an ordering that maximizes reuse in the dynamic programming algorithm. The algorithm then computes the minimum cost for each subtree in I(n) given the EdgeCuts in C(n) and the already computed minimum costs for the descendants of . The complexity of OptEdgeCut is O(|v|.2|E|) Algorithm Opt-EgdgeCut Input: The navigation tree T Output: The Best Edge Cut 1. Traversing T in post-order ,let n be the current node 2. while n≠root do 3. if n is a leaf node then 4. mincost(n,ɸ)PE * L(n) 5. optcut(n,ɸ){ɸ} 6. else 7. C(n)enumerate all possible EdgeCuts for the tree rooted at n 8. II(n)enumerate all possible subtrees for the tree rooted at n 9. foreach I(n) ЄII(n) do 10. compute PE(I(n)) and Pc(I(n)) 11. foreach C Є C(n) do 12. if C is a valid EdgeCut for I(n) then 13. cost(I(n),C) PE(I(n)).((1Pc(I(n))).L(I(n))+Pc(I(n)).(1+|S|+∑sЄSmincost(Ic(s)) )) 14. else 15. cost(I(n),c)=∞ 16. mincost(n,(i(n))minciЄc(n)cost(I(n),Ci) 17. optcut(n,I(n))Ci 18. return optcut(root,e) // E is the set of all tree edges Heuristic-ReducedOpt Algorithm The algorithm to compute the optimal navigation, Opt- EdgeCut, is exponential and hence infeasible for the navigation trees of most queries. We propose a heuristics to select the best EdgeCut for a node expansion. Note that the input argument to the heuristic is a component tree I(n) and http://www.ijettjournal.org Page 2358 International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 6 - June 2013 not the whole active tree n as in Opt-EdgeCut. The reason is that once Opt-EdgeCut is executed for , the costs (and optimal EdgeCuts) for all possible I(n)’s are also computed and hence there is no need to call the algorithm again for subsequent expansions. For a given component subtree , Opt-EdgeCut enumerates a large number of EdgeCuts on I(n) and repeats this recursively on its subtrees. We propose to run OptEdgeCut on a reduced version I’(n) of I(n). The reduced tree I’(n) has to be small enough so that Opt-EdgeCut can run on it in “real-time”. Also, I’(n) should approximate I(n) as closely as possible. I’(n) is the tree of “supernodes” created by partitioning I(n). Each supernode in I’(n) corresponds to a partition of tree . Then, Opt- EdgeCut is executed on I’(n). The algorithm we use to partition the tree is based on the K- partition algorithm [11] that processes the tree in a bottom-up fashion. For each tree node , the algorithm removes the “heaviest” children of n one-by-one until the weight of n falls below k . For each of the removed children, it creates a partition. The result is a treepartitioning with the minimum cardinality. The complexity of the K-partition algorithm is O(|V| · log|V|). We adopt the K-partition algorithm to our needs as follows. For each node in I(n), we assign weight equal to |L(n)| .PE(n). We run the K-partition algorithm by setting, the weight threshold, to ΣniЄI(n)L(ni) .PE(ni)/Z, where Z is the number of desired partitions. However, this might result in more than Z partitions, due to some non-full partitions. Therefore we repeatedly run K -partition algorithm on I(n) , gradually increasing K until up to Z partitions are obtained. Note that Z is the maximum tree size on which Opt-EdgeCut can operate in “real-time”. Algorithm Hueristic-ReducedOpt Input: Component sub tree I(n), number Z of partitions Output: The best Edge Cut 1. z’z 2. repeat 3. k∑nЄT L(n)/z’ 4. Partitions  k-partition(I(n),k) 5. //call k-partition algorithm[11] 6. z’z’-1 7. Construct reduced sub tree I’(n) from partitions 8. EdgeCut’Opt-EdgeCut(I’(n)) 9. EdgeCutCorresponding of EdgeCut’ for I(n) 10. return EdgeCut 4. CONCLUSION We Finally conclude that a new concept, that of concise range queries, has been proposed in this paper with the addition of Density based Clustering algorithm, which simultaneously addresses the following three problems of traditional range queries. First, it reduces the query result size significantly as required by the user. The reduced size ISSN: 2231-5381 saves communication bandwidth and also the client’s memory and computational resources, which are of highest importance for mobile devices. Second, although the query size has been reduced, the usability of the query results has been actually improved. The concise representation of the results often gives the user more intuitive ideas and enables interactive exploration of the spatial database. Finally, we have designed R-tree-based algorithms so that a concise range query can be processed much more efficiently with th edge cut algorithm than evaluating the query exactly, especially in terms of I/O cost. This concept, together with its associated techniques presented here, could greatly enhance user experience of spatial databases. REFERENCES: [1] X. Lin, Y. Yuan, Q. Zhang, and Y. Zhang, “Selecting Stars: The k Most Representative Skyline Operator,” Proc. Int’l Conf. Data Eng. (ICDE), 2007. [2] C. Jermaine, S. Arumugam, A. Pol, and A. Dobra, “Scalable Approximate Query Processing with the dbo Engine,” Proc. ACM SIGMOD, 2007. [3] G. Ghinita, P. Karras, P. Kalnis, and N. Mamoulis, “Fast Data Anonymization with Low Information Loss,” Proc. Int’l Conf. Very Large Data Bases (VLDB), 2007. [4] G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller, R. Panigrahy, D. Thomas, and A. Zhu, “Achieving Anonymity via Clustering,” Proc. Symp. Principles of Database Systems (PODS), 2006. [5] J. Xu, W. Wang, J. Pei, X. Wang, B. Shi, and A.W.-C. Fu, “Utility- Based Anonymization Using Local Recoding,” Proc. ACM SIGKDD, 2006. [6] C. Bo¨hm, C. Faloutsos, J.-Y. Pan, and C. Plant, “RIC: Parameter- Free Noise-Robust Clustering,” ACM Trans. Knowledge Discovery from Data, vol. 1, no. 3, pp. 10-1-1028, 2007. [7] R.T. Ng and J. Han, “Efficient and Effective Clustering Methods for Spatial Data Mining,” Proc. Int’l Conf. Very Large Data Bases (VLDB), 1994. [8] D. Lichtenstein, “Planar Formulae and Their Uses,” SIAM J. Computing, vol. 11, no. 2, pp. 329-343, 1982. [9] R. Tamassia and I.G. Tollis, “Planar Grid Embedding in Linear Time,” IEEE Trans. Circuits and Systems, vol. 36, no. 9, pp. 1230- 1234, Sept. 1989. [10] H.V. Jagadish, B.C. Ooi, K.-L. Tan, C. Yu, and R. Zhang, “iDistance: An Adaptive B+-Tree Based Indexing Method for Nearest Neighbor Search,” ACM Trans. Database Systems, vol. 30, no. 2, pp. 364-397, 2005. [11] S. Kundu and J. Misra: A Linear Tree Partitioning Algorithm. SIAM J. Comput. 6(1): 151-154 (1977)[12] D. Maglott, J. Ostell, K.D. Pruitt and T. Tatusova: Entrez Gene: Gene- Centered Information at NCBI. Nucleic Acids Res. 2005 January 1; 33(Database Issue): D54–D58 [13] Medical Subject Headings (MeSH®). http://www.nlm.nih.gov/mesh/ [14] J.A. Mitchell, A.R. Aronson and J.G. Mork: Gene Indexing: Characterization and Analysis of NLM’s http://www.ijettjournal.org Page 2359 International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 6 - June 2013 GeneRIFs. In Proceedings of the AMIA Symposium, 8th– 12th November, Washington, DC, pp. 460–464 BIOGRAPHIES S. Anuradha completed her B.Tech in aditya college ,Takkali. Later she is studying M.Tech in Avanthi institute of engineering and technology. Her interests are datamining. Venkateswarlu Bondu received the master’s Degree in computer Science and Systems Engineering from Andhra University College of engineering pursuing PH.D in Computer Science in Andhra University. He is an Associate professor in the department of computer science in Avanthi institute of engineering and technology. His research areas of interests are software engineering and data modeling. ISSN: 2231-5381 http://www.ijettjournal.org Page 2360

An Efficient Edge Cut Mechanism for Concise Range Queries

Related documents

Products

Support

An Efficient Edge Cut Mechanism for Concise Range Queries

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib