An Efficient Edge Cut Mechanism for Concise Range Queries

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 6 - June 2013
An Efficient Edge Cut Mechanism for Concise
Range Queries
S.Anu Radha*, B. Venkateswarlu#
*
Final M.Tech Student, Dept of Computer Science and Engineering, Avanthi Institute of Engineering &
Technology, Narsipatnam, Andhra Pradesh.
#
Associate Professor, Dept of Computer Science and Engineering, Avanthi Institute of Engineering &
Technology, Narsipatnam, Andhra Pradesh
Abstract: Due to rapid growth of wireless
communication technology, people frequently view maps
or get related services from the handheld devices, such
as mobile phones and PDAs. Range queries, as one of the
most commonly used tools, a spatial database. However,
due to the limits of are often posed by the users to
retrieve needful information from communication
bandwidth and hardware power of handheld devices,
displaying all the results of a range query on a handheld
device is neither communication efficient nor
informative to the users. This is simply because that
there are often too many results returned from a range
query. In view of this problem, we present a novel idea
of effective navigation and Best edge cut mechanisms for
queried results, that increases the performance and
reduces the communication cost and also effects the
navigation cost ,also offers better usability to the users,
providing an opportunity for interactive exploration.
The usefulness of the concise range queries is confirmed
by comparing it with other possible alternatives, such as
sampling and clustering and edge cut mechanisms.
I. INTRODUCTION
Spatial databases have witnessed an increasing
number of applications recently, partially due to the fast
advance in the fields of mobile computing, embedded
systems and the spread of the Internet. For example, it is
quite common these days that people want to figure out the
driving or walking directions from their handheld devices
(mobile phones or PDAs). However, facing the huge
amount of spatial data collected by various devices, such as
sensors and satellites, and limited bandwidth and/or
computing power of handheld devices, how to deliver light
but usable results to the clients is a very interesting, and of
course, challenging task.
Our work has the same motivation as several recent
work on finding good representatives for large query
ISSN: 2231-5381
answers, for example, representative skyline points in [7].
Furthermore, such requirements are not specific to spatial
databases. General query processing for large relational
databases and OLAP data warehouses has posed similar
challenges. For example, approximate, scalable query
processing has been a focal point in the recent work [6]
where the goal is to provide light, usable representations of
the query results early in the query processing stage, such
that an interactive query process is possible. In fact, [6]
argued to return concise representations of the final query
results in every possible stage of a long-running query
evaluation. However, the focus of [6] is on join queries in
the relational database and the approximate representation is
a random sample of the final query results. Soon we will
see, the goal of this work is different and random sampling
is not a good solution for our problem.
Usability refers to the question of whether the user
could derive meaningful knowledge from the query results.
Note that more results do not necessarily imply better
usability. On the contrary, too much information may do
more harm than good, which is commonly known as the
information overload problem. As a concrete example,
suppose that a user issues a query to her GPS device to find
restaurants in the downtown Boston area. Most readers
having used a GPS device could quickly realize that the
results returned in this case could be almost useless to the
client for making a choice. The results (i.e., a large set of
points) shown on the small screen of a handheld device may
squeeze together and overlap. It is hard to differentiate
them, let alone use this information.
A properly sized representation of the results will
actually improve usability. In addition, usability is often
related to another component, namely, query interactiveness
that has become more and more important. Instructiveness
refers to the capability of letting the user provide feedback
to the server and refine the query results as he or she wishes.
This is important as very often, the user would like to have a
rough idea for a large region first, which provides valuable
information to narrow down her query to specific regions. In
the above example, it is much more meaningful to tell the
user a few areas with high concentration of restaurants
http://www.ijettjournal.org
Page 2356
International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 6 - June 2013
(possibly with additional attributes, such as Italian vs.
American restaurants), so that she could further refine her
query range.
II. LITERATURE SURVEY
A) Query Languages
Bird et al. (2000) had compared some of the query
languages available (at that time) for graph based annotation
frameworks. These included Emu and the MATE query
language. They then proposed their own query language for
annotation graphs. This language used path patterns and
abbreviatory devices to provide a convenient way to express
a wide range of queries. This language also exploited the
quasi-linearity of annotation graphs by partitioning the
precedence relation to allow efficient temporal indexing of
the graphs. Another such survey was by Lai and Bird
(2004),
where
the
authors
considered
TigerSearch,CorpusSearch, NiteQL, Tgrep2, Emu and
LPath (Bird et al., 2005; Bird et al., 2006). From this study,
the authors tried to derive the requirements that a good tree
query language should satisfy. Resnik and Elkiss (2005) had
reported a search engine for linguists that was meant to be
easy to use for linguists who were not versed in the use of
computers. This tool allowed linguists to draw patterns in
the form of sub-trees, which were then converted into
queries and searched. Like almost all such languages, it did
not allow manipulation of data and it worked only for
certain levels of annotation. It was mainly aimed at
searching phrase structure patterns and morphological
information. One of the well known query languages for
annotated corpora used for linguistic studies and for NLP is
the Corpus Query Language1 (CQL), very different from
the one we are presenting here. It is used in a popular tool
called Sketch Engine2 (Kilgarriff et al., 2004). It provides a
widevariety of functionalities to access corpora, such as
searching words, lemmas, roots, POS tags of a word, getting
the left and right contexts upto a window size of 15.
Another usual practice is to have a query tool for
syntactically annotated corpora such that the data is
converted internally to relational database and the query is
written using SQL (Kallmeyer, 2000). A much earlier work
was titled ‘A modular and flexible architecture for an
integrated corpus query system’ (Christ, 1994), which is
used by the IMS Corpus Workbench3. Another query
language called MQL is used in the Emdros database engine
for analyzed or annotated text4. MQL is a descendant of
QL (Doedens, 1994). The language that we describe here is
similar in some aspects to many of these languages, but
different in others. The most important differences are the
support for threaded trees, its very concise syntax, queryand-action mechanism (data manipulation), arbitrary return
values, support for custom commands and the possibility
for pipelining results through the source and destination
operators. It also has high expressive power generally.
Moreover, it can be used for purposes other than NLP
because the data that it operates on is similar to the general
XML representation and the language has some of the
ISSN: 2231-5381
power of both XPath5 based querying and XSLT6 based
transformation.
B)Spatial Clustering Algorithm
Density based clustering algorithms are one of the
primary method for data mining. The clusters which are
formed using density clustering are easy to understand and
it does limit itself to shapes of clusters. Existing density
based algorithms have trouble because they are not capable
of finding out all meaningful clusters whenever the density
is so much varied. VDBSCAN is introduced to compensate
this problem. It is same as DBSCAN (Density Based Spatial
Clustering of Applications with Noise) but only the
difference is VDBSCAN selects several values of parameter
Eps for different densities according to k-dist plot. The
problem is the value of parameter k in k-dist plot is user
defined. It introduces a new method to find out the value of
parameter k automatically based on the characteristics of the
datasets. In this method we consider spatial distance from a
point to all others points in the datasets. The proposed
method has potential to find out optimal value for parameter
k.
C) Best Edge cut Mechanism :
we can compute the optimal cost by recursively
enumerating all possible sequences of valid Edge Cuts,
starting from the root and reaching every concept in the
navigation tree, computing the cost for each step and taking
the minimum. However, this algorithm is also prohibitively
expensive. Instead we propose an alternative algorithm OptEdgeCut that makes use of the dynamic programming
technique to reduce the computation cost. As shown in
Section VI-A below, Opt-EdgeCut is stillexponential and is
just used to evaluate the quality of the heuristic.
II. OUR PROPOSED WORK
For clustering of spatial data DBSCAN algorithm
is based on center-based approach. In the center-based
approach, density is estimated for a particular point in the
dataset by counting the number of points within a specified
radius, Eps, of that point. This includes the point itself. The
center-based approach to density allows us to classify a
point as a core point, a border point, a noise or background
point. A point is core point if the number of points within
Eps, a user-specified parameter, exceeds a certain threshold,
MinPts, which is also a userspecified parameter Any two
core points that are close enough within a distance Eps of
one another are put in the same cluster. It is also applicable
for any border point which is close enough to a core point is
put in the same cluster as the core point. Noise points are
disposed.
The basic approach of how to determine the
parameters Eps and MinPts is to look at the behavior of the
distance from a point to its kth nearest neighbor, which is
called k-dist. The k-dists are computed for all the data
points for some k, sorted in ascending order
Step1: Partition k-dist plot:
Give thresholds of parameters Epsi (i=1,2,…n);
http://www.ijettjournal.org
Page 2357
International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 6 - June 2013
Step2:
For each(Epsi(1=1,2..n)
Eps=Epsi;
Adopt DBSCAN algorithm for points that are not
marked;
Mark points not marked;
Display all the masked points as corresponding clusters.
A)Tree Navigation
In order to use the algorithms of Section 3.3 to
answer a concise range query Q with budget k from the
client, the database server would first need to evaluate the
query as if it were a standard range query using some spatial
index built on the point set P, typically an R-tree. After
obtaining the complete query results P \ Q, the server then
partitions P \ Q into k groups and returns the concise
representation. However, as the main motivation to obtain a
concise answer is exactly because P \ Q is too large, finding
the entire P \ Q and running the algorithms of Section 3.3
are often too expensive for the database server.
In this section, we present algorithms that process
the concise range query without computing P \ Q in its
entirety. The idea is to first find k0 bounding boxes, for
some k0 > k, that collectively contain all the points in P \ Q
by using the existing spatial index structure on P. Each of
these bounding boxes is also associated with the count of
points inside. Then, we run a weighted version of the
algorithm in Section 3.3, grouping these k0 bounding boxes
into k larger bounding boxes to form the concise
representation R. Typically k0 _ jP \ Qj, so we could expect
significant savings in terms of I/O and CPU costs as
compared with answering the query exactly. Therefore,
adopting concise range queries instead of the traditional
exact range queries not only solves the bandwidth and
usability problems, but also leads to substantial efficiency
improvements.
p1
p2
N4
N3
N1
The R-tree
p5
p4
Q
N3
N4
P4
p1
p8
N5
N2
p6
r
p3
N1
p11
N6
P1
p9
p12
N2
P2
N5
P5
P3
P6
N6
P10 P11
P7 P8
P12
P9
p7
B)The R-Tree
The standard range query Q can be processed using
an R-tree as follows: We start from the root of the R-tree,
and check the MBR of each of its children. Then, we
recursively visit any node u whose MBR intersects or falls
inside Q. When we reach a leaf, we simply return all the
ISSN: 2231-5381
points stored there that are inside Q. In this section, we in
addition assume that each node u in the R-tree also keeps
nu, the number of the points stored below its sub tree. Such
counts can be easily computed and maintained in the R-tree.
To this tree navigation ,add the best performance we used
an edge cut mechanism for filter the results.
a.Optimal Algorithm FOR Best Edgecut
The Opt-EdgeCut algorithm to compute the
minimum expected navigation cost (and the EdgeCut that
achieves it) traverses the navigation tree in post-order and
computes the navigation cost bottom-up starting from the
leaves. For each node , the algorithm enumerates and
stores the C(n)of all possible EdgeCuts for the subtree
rooted at , and the listI(n) of all possible I(n)sets that node
can be annotated with. The inclusion- enumerating C(n)
and I(n), which leads to an ordering that maximizes reuse in
the dynamic programming algorithm. The algorithm then
computes the minimum cost for each subtree in I(n) given
the EdgeCuts in C(n) and the already computed minimum
costs for the descendants of . The complexity of OptEdgeCut is O(|v|.2|E|)
Algorithm Opt-EgdgeCut
Input: The navigation tree T
Output: The Best Edge Cut
1. Traversing T in post-order ,let n be the current
node
2. while n≠root do
3.
if n is a leaf node then
4.
mincost(n,ɸ)PE * L(n)
5.
optcut(n,ɸ){ɸ}
6. else
7.
C(n)enumerate all possible EdgeCuts
for the tree rooted at n
8.
II(n)enumerate all possible subtrees
for the tree rooted at n
9. foreach I(n) ЄII(n) do
10.
compute PE(I(n)) and Pc(I(n))
11.
foreach C Є C(n) do
12.
if C is a valid EdgeCut for I(n) then
13.
cost(I(n),C)
PE(I(n)).((1Pc(I(n))).L(I(n))+Pc(I(n)).(1+|S|+∑sЄSmincost(Ic(s))
))
14. else
15.
cost(I(n),c)=∞
16. mincost(n,(i(n))minciЄc(n)cost(I(n),Ci)
17. optcut(n,I(n))Ci
18. return optcut(root,e) // E is the set of all tree edges
Heuristic-ReducedOpt Algorithm
The algorithm to compute the optimal navigation,
Opt- EdgeCut, is exponential and hence infeasible for the
navigation trees of most queries. We propose a heuristics to
select the best EdgeCut for a node expansion. Note that the
input argument to the heuristic is a component tree I(n) and
http://www.ijettjournal.org
Page 2358
International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 6 - June 2013
not the whole active tree n as in Opt-EdgeCut. The reason is
that once Opt-EdgeCut is executed for , the costs (and
optimal EdgeCuts) for all possible I(n)’s are also computed
and hence there is no need to call the algorithm again for
subsequent expansions.
For a given component subtree , Opt-EdgeCut
enumerates a large number of EdgeCuts on I(n) and repeats
this recursively on its subtrees. We propose to run OptEdgeCut on a reduced version I’(n) of I(n). The reduced tree
I’(n) has to be small enough so that Opt-EdgeCut can run on
it in “real-time”. Also, I’(n) should approximate I(n) as
closely as possible. I’(n) is the tree of “supernodes” created
by partitioning I(n). Each supernode in I’(n) corresponds to
a partition of tree . Then, Opt- EdgeCut is executed on
I’(n).
The algorithm we use to partition the tree is based
on the K- partition algorithm [11] that processes the tree in a
bottom-up fashion. For each tree node , the algorithm
removes the “heaviest” children of n one-by-one until the
weight of n falls below k . For each of the removed
children, it creates a partition. The result is a treepartitioning with the minimum cardinality. The complexity
of the K-partition algorithm is O(|V| · log|V|). We adopt the
K-partition algorithm to our needs as follows. For each node
in I(n), we assign weight equal to |L(n)| .PE(n). We run the
K-partition algorithm by setting, the weight threshold, to
ΣniЄI(n)L(ni) .PE(ni)/Z, where Z is the number of desired
partitions. However, this might result in more than Z
partitions, due to some non-full partitions. Therefore we
repeatedly run K -partition algorithm on I(n) , gradually
increasing K until up to Z partitions are obtained. Note that
Z is the maximum tree size on which Opt-EdgeCut can
operate in “real-time”.
Algorithm Hueristic-ReducedOpt
Input: Component sub tree I(n), number Z of partitions
Output: The best Edge Cut
1. z’z
2. repeat
3. k∑nЄT L(n)/z’
4. Partitions  k-partition(I(n),k)
5. //call k-partition algorithm[11]
6. z’z’-1
7. Construct reduced sub tree I’(n) from
partitions
8. EdgeCut’Opt-EdgeCut(I’(n))
9. EdgeCutCorresponding of EdgeCut’ for I(n)
10. return EdgeCut
4. CONCLUSION
We Finally conclude that a new concept, that of
concise range queries, has been proposed in this paper with
the addition of Density based Clustering algorithm, which
simultaneously addresses the following three problems of
traditional range queries. First, it reduces the query result
size significantly as required by the user. The reduced size
ISSN: 2231-5381
saves communication bandwidth and also the client’s
memory and computational resources, which are of highest
importance for mobile devices. Second, although the query
size has been reduced, the usability of the query results has
been actually improved. The concise representation of the
results often gives the user more intuitive ideas and enables
interactive exploration of the spatial database. Finally, we
have designed R-tree-based algorithms so that a concise
range query can be processed much more efficiently with th
edge cut algorithm than evaluating the query exactly,
especially in terms of I/O cost. This concept, together with
its associated techniques presented here, could greatly
enhance user experience of spatial databases.
REFERENCES:
[1] X. Lin, Y. Yuan, Q. Zhang, and Y. Zhang, “Selecting
Stars: The k Most Representative Skyline Operator,” Proc.
Int’l Conf. Data Eng. (ICDE), 2007.
[2] C. Jermaine, S. Arumugam, A. Pol, and A. Dobra,
“Scalable Approximate Query Processing with the dbo
Engine,” Proc. ACM SIGMOD, 2007.
[3] G. Ghinita, P. Karras, P. Kalnis, and N. Mamoulis, “Fast
Data Anonymization with Low Information Loss,” Proc.
Int’l Conf. Very Large Data Bases (VLDB), 2007.
[4] G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller, R.
Panigrahy, D. Thomas, and A. Zhu, “Achieving Anonymity
via Clustering,” Proc. Symp. Principles of Database
Systems (PODS), 2006.
[5] J. Xu, W. Wang, J. Pei, X. Wang, B. Shi, and A.W.-C.
Fu, “Utility- Based Anonymization Using Local Recoding,”
Proc. ACM SIGKDD, 2006.
[6] C. Bo¨hm, C. Faloutsos, J.-Y. Pan, and C. Plant, “RIC:
Parameter- Free Noise-Robust Clustering,” ACM Trans.
Knowledge Discovery from Data, vol. 1, no. 3, pp. 10-1-1028, 2007.
[7] R.T. Ng and J. Han, “Efficient and Effective Clustering
Methods for Spatial Data Mining,” Proc. Int’l Conf. Very
Large Data Bases (VLDB), 1994.
[8] D. Lichtenstein, “Planar Formulae and Their Uses,”
SIAM J. Computing, vol. 11, no. 2, pp. 329-343, 1982.
[9] R. Tamassia and I.G. Tollis, “Planar Grid Embedding in
Linear Time,” IEEE Trans. Circuits and Systems, vol. 36,
no. 9, pp. 1230- 1234, Sept. 1989.
[10] H.V. Jagadish, B.C. Ooi, K.-L. Tan, C. Yu, and R.
Zhang, “iDistance: An Adaptive B+-Tree Based Indexing
Method for Nearest Neighbor Search,” ACM Trans.
Database Systems, vol. 30, no. 2, pp. 364-397, 2005.
[11] S. Kundu and J. Misra: A Linear Tree Partitioning
Algorithm. SIAM J. Comput. 6(1): 151-154 (1977)[12] D.
Maglott, J. Ostell, K.D. Pruitt and T. Tatusova: Entrez
Gene: Gene- Centered Information at NCBI. Nucleic Acids
Res. 2005 January 1; 33(Database Issue): D54–D58
[13]
Medical
Subject
Headings
(MeSH®).
http://www.nlm.nih.gov/mesh/
[14] J.A. Mitchell, A.R. Aronson and J.G. Mork: Gene
Indexing: Characterization and Analysis of NLM’s
http://www.ijettjournal.org
Page 2359
International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 6 - June 2013
GeneRIFs. In Proceedings of the AMIA Symposium, 8th–
12th November, Washington, DC, pp. 460–464
BIOGRAPHIES
S. Anuradha completed her B.Tech in
aditya college ,Takkali. Later she is
studying M.Tech in Avanthi institute
of engineering and technology. Her
interests are datamining.
Venkateswarlu Bondu received the
master’s Degree in computer Science
and Systems Engineering from Andhra
University College of engineering
pursuing PH.D in Computer Science in
Andhra University. He is an Associate
professor in the department of
computer science in Avanthi institute
of engineering and technology. His
research areas of interests are software engineering and
data modeling.
ISSN: 2231-5381
http://www.ijettjournal.org
Page 2360
Download