Adequate Flossy Type-Ahead Inquest in XML Facts Matangi Prema Sagar

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 18 Number1- Dec 2014
Adequate Flossy Type-Ahead Inquest in XML Facts
Matangi Prema Sagar #1 , Sk. Md. Rafi #2
#1
II Year M.Tech Student, #2Associate Professor, Department of CSE
Sri Mittapalli College of Engineering, Guntur, Andhra Pradesh, India.
Abstract— When a user submits a query, they often feel “left in
the dark” when they have incomplete knowledge about the
original data. At that time, they have to make use of a try-andsee technique for searching data. To solve this problem, a recent
development has been made i.e., auto complete. By using simple
keywords, users can get relief from the structured query
language. The main objective of web search engine is “blocklevel” search engine. The primary difficulty is known as block
extraction from a web page with reverence to a query. We give a
formulation of the similarity of blocks and introduce the notion
of “layout tree”. In this paper, we study flossy type-ahead inquest
in XML facts, a new information-access paradigm in which the
system searches XML facts. Our planned technique has the
following features: 1) Inquest as you type: It expands Auto
complete by sustaining queries with numerous keywords in XML
facts. 2) Flossy: It can get high class responses that have
keywords corresponding question keywords approximately. 3)
Efficient: Our effective index structures and searching
algorithms can achieve a very high interactive speed. To achieve
a high interactive speed, we suggest helpful index structures and
top-k algorithms. We have executed our technique on real data
positions and the tentative facts demonstrate that our technique
attains high inquest effectiveness and outcome excellence.
Keywords— Block-level, Layout tree, XML facts, flossy inquest,
type-ahead inquest.
I. INTRODUCTION
In common, we have to discover the significant sub-trees
which details structural relationship representation from XML
facts to answer the related queries. These facts can be
identified with the help of a keyword called ‘search’. This
search technique is the extensively accepted technique for
querying the system documents and the WWW. The
advantage of search keyword is without having any
knowledge about the output of the facts; the user can find the
information. When a user wants to inquest the data, user just
creates a keyword and suggests it to the system and retrieves
the similar results by using different querying languages.
When the user don’t have certain knowledge in relation to
query languages, then it will leads to problems. For these
situations, Flossy keyword inquest is proposed.
To get the approximate value rather than exact, a logic
system is there i.e., Flossy Logic. Fuzzy set is the basic unit of
a flossy logic. Flossy logic defines a membership function A:
X > [0, 1] that matches element x of X into real numbers in [0,
1]. When an element present completely in the set, then it is
denoted by 1 and if the element doesn’t exist in the set and if
some variation is there, then it is represented by 0. Related
data of a set is denoted with the help of membership value.
With the negligible errors in query keywords also flossy
keyword inquest permits the users to search the data. By using
ISSN: 2231-5381
the relevancy value, the data has been retrieved. It simply
represents how much amount of the query keyword is
associated to the keywords in XML facts.
Initially, for non-database users, it is so much difficult to
understand the query languages. Generally, a query language
needs the queries to be asked against the primary database
schemas. Auspiciously, for asking queries among XML facts,
the substitute method is identified i.e., keyword search [7],
which is simple and yet familiar. To find the same info again,
then the user repeats the same step.
In this paper, we suggest TASX (well-defined as “task”), a
flossy type-ahead search method in XML facts which gives a
gracious interface for users to search XML facts, and can
drastically set aside users typing effort. In this paper, we
discuss about the main challenge i.e. search efficiency.
Normally an interactive speed requires this delay should be
within milliseconds. Observe that this delay time considers the
network transfer delay, execution time on the server, and the
time taken by the browser to execute its Java- Script. To get
our objective, we recommend effective index structures and
algorithms to respond keyword queries in XML facts. To
review, we make the subsequent offerings:
We formalize the problem of flossy type-ahead search in
XML facts.
 To obtain a high and efficient interactive speed, we
suggest efficient index structures and efficient
algorithms.
 To competently discover the top-k appropriate results,
we extend ranking functions and early termination
techniques.
 The results of an experimental study conducted by us
explain that our method attains high search
effectiveness and outcome excellence.
The remaining paper will be as follows. In Section II, we
give description about the related works on this paper. Section
III provides the basic fundamentals what we are going to do in
this paper. Problem formulation of flossy type-ahead inquest
in XML facts are defined in the Section IV. Experimental
study and results are given a brief description in the Section V.
Section VI consists of the conclusion and the
acknowledgements will be given next to that section. Finally
paper ends with the references what we used in the paper.
II. RELATED WORK
Prediction and AutoComplete:
To solve this problem, various features have been
introduced. The widely and frequently used technique is
“Autocomplete”. This technique automatically expects a word
http://www.ijettjournal.org
Page 22
International Journal of Engineering Trends and Technology (IJETT) – Volume 18 Number1- Dec 2014
or phrase what the user is going to type based on the
incomplete string the user has typed. Many systems and more
websites are maintaining this characteristic.
The main drawback of “Autocomplete” is that the system
considers a query with several keywords as an only string. By
this, it will not permit these keywords to show at different
positions. Let us tale an example, consider the search box on
Apple.com, which permits “Autocomplete” search on Apple
products. Even though a keyword query “modeling” can find a
record “Unified Modeling Language has 9 diagrams”, a query
with keywords “Modeling Diagrams” cannot find this record.
Since the two keywords are dissimilar and their positions are
at different places in the answer.
Complete Search:
For solving this type of problem in Autocomplete, to find
the relevant answers at different places, a technique has been
introduced i.e. Complete-Search [5], [6]. This CompleteSearch technique does not support estimated search that it
doesn’t accept slight mistakes among keywords and results. In
recent times, we studied flossy type-ahead search in textual
documents [11]. This gives the freedom to users to search data
as they type, even there are some of the slight mistakes of
given input.
We also considered type-ahead inquest in relational
databases [1], [4], [9]. Though, whatever the existing methods
we have, they will not search XML facts in a type-ahead
search manner. This problem is occurred due to the XML
structure i.e. XML contains parent-child relationships. We
want to recognize applicable XML subtrees that confine such
structural relationships from XML facts to answer keyword
queries, instead of single documents.
III. BASIC FUNDAMENTALS OF PAPER
Notation:
XML document can be formed as a rooted and labeled
tree. When a node ‘B’ in the tree matches to a part in the
XML document and has a tag. Consider two nodes A and B,
we use “A < B” (“A > B,” correspondingly) to denote that
node A is a predecessor of node B.
the “quasi-content” nodes of the keyword. For example,
consider the XML document in Fig. 1, e (node 8) is a content
node for keyword “5,” and c (node 6) is a quasi-content node
of keyword “5”.
IV. PROBLEM FORMULATION OF FLOSSY TYPE-AHEAD
INQUEST IN XML FACTS
First of all, we begin how the TASX works for queries
with several keywords in XML facts, by permitting slight
faults of query keywords and their contradictions in the facts.
Considering that there is an original XML document that
exists on a server. To access the data from the XML
document, users have to use a web browser. Each keystroke
that the user types invoke a query, which includes the current
string the user, has typed in. The browser sends the query to
the server, which computes and returns to the user the best
answers ranked by their relevancy to the query.
We formalize the problem of flossy type-ahead inquest in
XML facts as follows:
1.
Flossy Type-Ahead Inquest in Xml Facts:
Given an XML document T, a keyword query K = (k1, k2, k3. .
. kn), and an edit-distance threshold г. Let us consider the
expected word set to be Mkn = {m|m is a tokenized word in T
and there exists a prefix of m, kn’, ed (kn,kn’)}. Let the expected
answer set be EK. For the keystroke that invokes K, we return
the top-k answers in EK for a given value k, ranked by their
relevancy to K.
Mainly two disputes are there to maintain flossy typeahead inquest in XML facts. In those two disputes, the first
one is to identify the expected words which have prefixes
related to the given input incomplete keyword after each
keystroke from the user with an efficient manner. Another one
is how to calculate the top-k expected responses of a query
with several keywords, mainly when there are many expected
words.
We initiate effective index structures and incremental
calculating algorithms to deal with the first dispute and
effective ranking functions [12], early termination techniques,
efficient algorithms, and forward-index structures to address
the second dispute.
2.
LCA-Based Fuzzy Type-Ahead Search:
To classify the similar results on top of expected answers,
we used the semantics of ELCA [2], [6]. This section suggests
an LCA-based fuzzy type-ahead search technique [8].
Fig. 1: XML Document
For example, consider the XML document in Fig. 1, we
have r (node 3) > a (node 4) and r (node 2) < s (node 1).
A keyword query contains a set of keywords i.e. (k1, k2, k3.
. . kn). For every keyword kn, we will entitle the nodes in the
tree that have the keyword, then they are the “content nodes”
for kn. The predecessor nodes of the content nodes are called
ISSN: 2231-5381
Index Structure:
The present work utilizes the trie structure to index the
words in the underlying XML data. For each and every word,
it matches to an exceptional path from the origin of the trie to
a child node. For the node on every path, it contains a label of
character. We stock up an inverted list of id’s of XML
essentials that hold the declaration of the leaf node for each
http://www.ijettjournal.org
Page 23
International Journal of Engineering Trends and Technology (IJETT) – Volume 18 Number1- Dec 2014
and every leaf node. For example, consider the XML
document in Fig. 1 and the trie construction for the tokenized
words is shown in Fig. 2. The word “BG” has a node ID of 4.
Its inverted list includes XML elements dt and eb.
 |Q| is the number of terms contained in Q.





km is keyword.
p is the pivotal node for Q and km.
α is the damping factor between 0 and 1.
δ(Q, p) denotes the distance between Q and p.
P is the set of pivotal nodes for Q and km.
4. Finding Top-k minimal-cost trees:
Fig. 2: Trie Structure of XML Tree
Queries with multiple keywords:
When a user enters the query, then the query item is
divided into multiple keywords k1, k2 … kn. At that time, for
every keyword equivalent active nodes [11] will be calculated
and for every active node, its child leaf and matching id’s lists
are reclaimed. Then combination of those active nodes is
calculated to answer the query.
We recommend how we can find the top-k relevant
minimal-cost trees. For every leaf trie node, we continue the
content nodes and quasi-content nodes in the XML document
in the trie index as shown in Fig. 3. After that, their
corresponding gains and pivotal paths for the keyword of the
child node, arranged by the relation with the keyword.
3. Minimal Tree:
Generally XML tree’s each and every node is appropriate
to the question with dissimilar attains. We classify their
equivalent responses to the questions as its sub-tree with paths
to nodes for each node. The classifying of results into sub-tree
is called the “minimal-cost tree”. Basically, different nodes
related to different results to the query. For that reason, we
study how to quantify the significance of each response to the
query for ranking.
Ranking:
We primarily calculate the weight between the given root
node and each input keyword to rank a minimal cost tree.
After that, we join these weighted scores for every input
keyword as the overall score of the minimal-cost tree. To
calculate the related score among the root node Q to given
keyword kn , we recommend two ranking functions. First
ranking function considers that Q contains kn [3]. Now the
second ranking function considers that Q does not contain kn
but has a child containing kn .
Here the above formula is defined as:
 s is constant , it is set to 0.2.
 tf (kn , Q) is the number of occurrences of keyword kn in
the sub tree rooted at Q.
 idf (kn) is the inverse document frequency of kn , i.e. ratio
of the number of nodes in XML document to the number
of nodes that that contain kn .
 Qtl is normalized term length.
ISSN: 2231-5381
Fig. 3: Extended Trie Structure
Generally it is too tough to create the union lists of each
and every given input keyword because there may be
numerous expected words and many inverted lists. As an
alternative, we generated a fractional virtual list on the fly. To
get the top-k answers in the query, we use fractional virtual
list only. By using this fractional virtual list, it avoids
accessing of all the elements of inverted lists of expected
words. Those which are having high scores only will be
accessed. We can do the removal of elements if we have
calculated top-k answers using the fractional accessed
elements. There is no need to visit other elements on the
inverted lists after the removal of unwanted elements.
V. EXPERIMENTAL STUDY AND RESULTS
With the help of recommended techniques, we put into
practice of our method on real appliances. We engaged the
data sets DBLP and XMark with the sizes of 320 MB and 145
MB, correspondingly. We had chosen 100 queries for each
data set in a random method. For LCA-based methodology
[8], we applied the hybrid algorithm. Hash Index and Dewey
Inverted list have been used for storing the results. With the
help of these hash tables and lists, we implemented ranking
functions. For incremental computation and speed of
operations, we used cache. We set up a server using Apache
and Glassfish. The program implemented in JAVA has been
http://www.ijettjournal.org
Page 24
International Journal of Engineering Trends and Technology (IJETT) – Volume 18 Number1- Dec 2014
run by a server and compiled with the JAVA Compiler. For
the client browser who wants to interact with the server, we
used JavaScript and HTML and display the results.
The data set sizes, trie-index sizes, forward index sizes,
and index-construction time have been shown in table 1.
While comparison among each other, DBLP have less distinct
keywords than the XMark. And also the index size on DBLP
is smaller than the XMark. For getting the exact search, we
used the cache for a corresponding trie node for each
keyword. Like that, for flossy inquest also we used cache for
similar trie nodes of each keyword. Usually trie nodes of a
keyword are small for similar ones. Therefore, the cache size
is not large. In our experiments, for each user, the cache size
is about 0.3 to 1.5 KB.
Dataset
Dataset Sizes
Trie-Index Sizes
Inverted-List Sizes
Forward-Index Sizes
Index-Construction time
Index-Sizes of XRank
DBLP
320 MB
24 MB
41 MB
52 MB
38 Secs
184 MB
XMark
145 MB
44 MB
75 MB
84 MB
62 Secs
251 MB
Table 1: Data Sets and Index Costs
Server Running Time:
Here we calculated the running time of server on both the
methods i.e. MCT-based search method and LCA-based
search method [8]. We scrutinized LCA-based search method
has less performance than the MCT-based search method.
MCT-based search attains to a large extent top search
performance in conditions of both exact search and fuzzy
search. This is recognized to our efficient index structures and
threshold-based [10] calculating algorithms.
Fig. 4: Exact Search (LCA vs. MST)
Fig 5: Flossy Inquest with edit-distance threshold.
Total round-trip time is more than the total server running
time. Server running time is always less than 1/3 of the total
round-trip time. The time taken by the browser to execute
JavaScript is 20 to 30ms. The JavaScript program must have
to display the data until the data returned to the user.
JavaScript program must be interpreted by the browser which
leads to slow down the execution. Depending upon the user’s
location also, the network delay lies. By using the distributed
data centers, we can rectify the network-delay causes in largescale systems.
VI. CONCLUSION
In this paper, we studied the problem of flossy type-ahead
inquest in XML facts. To identify the better top-k results
gradually and effectively, we projected effective index
structures, efficient algorithms, and novel optimization
techniques. We performed different tests on LCA-based
method to interactively recognize the expected answers. After
getting the result, we generalized a minimal-cost-tree-based
search method for identifying the related answers with an
efficient and accurate manner.
For avoiding the creation union lists on the fly, we
estimated a heap-based method. To improving the search
performance in XML facts, we devised a forward-index
structure. We have implemented our method, and the
experimental results show that our method achieves high
search efficiency and result quality.
ACKNOWLEDGEMENT
We, authors express gratitude to all the anonymous
reviewers for their affirmative annotations among our paper
and also would like to thank the anonymous reviewers for
their valuable comments.
REFERENCES
[1] S. Agrawal, S. Chaudhuri, and G. Das, “Dbxplorer: A
System for Keyword-Based Search over Relational
DB’s,” Proc. Int’l Conf. Data Eng. (ICDE), pp. 5-16,
2002.
[2] S. Amer-Yahia, D. Hiemstra, T. Roelleke, D. Srivastava,
and G. Weikum, “Db&ir Integration: Report on the
ISSN: 2231-5381
http://www.ijettjournal.org
Page 25
International Journal of Engineering Trends and Technology (IJETT) – Volume 18 Number1- Dec 2014
Dagstuhl Seminar ‘Ranked Xml Querying’,” SIGMOD
Record, vol. 37, no. 3, pp. 46- 49, 2008.
[3] M.D. Atkinson, J.-R. Sack, N. Santoro, and T. Strothotte,
“Min-max Heaps and Generalized Priority Queues,”
Comm. ACM, vol. 29, no. 10, pp. 996-1000, 1986.
[4] A. Balmin, V. Hristidis, and Y. Papakonstantinou,
“Objectrank: Authority-Based Keyword Search in
Databases,” Proc. Int’l Conf. Very Large Data Bases
(VLDB), pp. 564-575, 2004.
[5] H. Bast and I. Weber, “Type Less, Find More: Fast
Autocompletion Search with a Succinct Index,” Proc.
Ann. Int’l ACM SIGIR Conf. Research and Development
in Information Retrieval (SIGIR), pp. 364-371, 2006.
[6] H. Bast and I. Weber, “The Completesearch Engine:
Interactive, Efficient, and towards Ir&db Integration,”
roc. Biennial Conf. Innovative Data Systems Research
(CIDR), pp. 88-95, 2007.
[7] Y. Chen, W. Wang, Z. Liu, and X. Lin, “Keyword Search
on Structured and Semi-Structured Data,” Proc. ACM
SIGMOD Int’l Conf. Management of Data, pp. 10051010, 2009.
[8] S. Cohen, Y. Kanza, B. Kimelfeld, and Y. Sagiv,
“Interconnection Semantics for Keyword Search in Xml,”
Proc. Int’l Conf. Information and Knowledge
Management (CIKM), pp. 389-396, 2005.
[9] B. Ding, J.X. Yu, S. Wang, L. Qin, X. Zhang, and X. Lin,
“Finding Top-k Min-Cost Connected Trees in
Databases,” Proc. Int’l Conf. Data Eng. (ICDE), pp. 836845, 2007.
[10] I.D. Felipe, V. Hristidis, and N. Rishe, “Keyword Search
on Spatial Databases,” Proc. Int’l Conf. Data Eng.
(ICDE), pp. 656-665, 2008.
[11] S. Ji, G. Li, C. Li, and J. Feng, “Efficient Interactive
Fuzzy Keyword Search,” Proc. Int’l Conf. World Wide
Web (WWW), pp. 371-380, 2009.
[12] V. Kacholia, S. Pandit, S. Chakrabarti, S. Sudarshan, R.
Desai, and H. Karambelkar, “Bidirectional Expansion for
Keyword Search on Graph Databases,” Proc. Int’l Conf.
Very Large Data Bases (VLDB), pp. 505-516, 2005.
ISSN: 2231-5381
http://www.ijettjournal.org
Page 26
Download