Our proposed system is based on compression of data

advertisement
International Journal on Advanced Computer Theory and Engineering (IJACTE)
________________________________________________________________________________________________
Generalized indexing and keyword search using User Log
1
Yogini Dingorkar, 2S.Mohan Kumar, 3Ankush Maind
1
M. Tech Scholar, 2Coordinator, 3Assistant Professor Department of Computer Science and Engineering,
Tulsiramji Gaikwad-Patil College Of Engineering & Technology Nagpur, India.
Email: 1yoginibangde@gmail.com, 2tgpcet.mtech@gmail.com, 3ankushmaind@gmail.com
Abstract:- As database contain huge amount of data that
data must be stored in efficient way so that it must
retrieved in less time. There are various techniques which
will store data properly. Indexing on data reduces both
time needed to evaluate the queries and memory require to
store the data. Today there are various methods are
available which perform compression on data but it
requires decompression while retrieving it which increases
the time complexity. Our system is based on indexing of
large structured data in order to reduce time and space
requirement. In our system we are using natural language
processing on queries as well as on data to extract
keywords. In this approach we are applying the algorithm
which is based on intersection operation which will work
on intervals of indexes. In proposed system to reduce the
intervals of indexes we can also apply reordering
algorithms. In this approach we are also using concept of
logs which will useful while retrieving the data using
queries. This paper gives comprehensive overview of the
proposed system which will explain the compression of
indexes using intervals.
I. INTRODUCTION:
A significant amount of the world’s enterprise data
resides in relational databases. It is important that users
be able to seamlessly search and browse information
stored in these databases as well. The primary focus of
designers of computing systems and data mining has
been on the improvement of the system performance.
According to this objective, the performance has been
steadily growing driven by more efficient system design
and improving complexities of the system.
Our proposed system is based on compression of data
using intervals of indexes. Every time when we store the
data, index file get generated which will contain the
lexicons, indexes as well as the frequency of each word
from that database. Efficient indexing required for
storing the data in order to increase the searching
performance. Searching is done by queries, and queries
must be processed fast if the data is properly stored and
managed. To improve the searching performance we can
create users log to find out users frequent patterns.
Besides searching various compression and reordering
techniques are also available which require less memory
and time.
In our system for generating the indexes we have to find
out the keywords by identifying the stemming and stop
words, after identifying the keywords we can generates
the indexes. The sequential indexes with less intervals
are generated by reordering algorithms. Different
Searching techniques uses the union and intersection
operations to find the results of queries, these methods
works on OR query and AND query semantics [1]
researchers Hao Wu, Gauliang Li and Lizhu Zhau
presents SCANLINEUNION+ and PROBISECT+
algorithms in which PROBISECT+ works better for
searching because it is faster and avoids unnecessary
probes.
In proposed technique we are using
PROBISECT+ algorithm for intersecting actual data and
keywords present in the queries so that the exact result
can be obtained.
For compression of data different encoding methods are
available like, Variable byte encoding [10] scheme
which is 2x faster than the Variable bit encoding
scheme. It very simple byte wise compression scheme.
Uses 7 bits to code the data portion and the most
significant bit is reserved as a flag bit which indicate if
the next byte is still part of the current data VBE
compression method reduces cost of transferring data
from memory to the CPU than that of transferring
uncompressed data. The P For Delta encoding [3,7]
compression method classify inverted list into either
coded or Exception values. Exception values are stored
in to uncompressed form but we still maintain the slots
from them in their corresponding positions and coded
values are assigns with the arbitrary bit width b which
kept constant within a disk block. Inverted list divided
into blocks.
In proposed system we are using different reordering
technique which is required for ordering the data to
generate the lists containing fewer intervals which
require less space for storage. Shieh et al. [9] proposed a
DocID reassignment algorithm adopting a Travelling
Salesman Problem (TSP) heuristic it is graph based
system. Blelloch and Blandford [5] also proposed an
algorithm called B&B. This algorithm permutes the
document identifiers in order to enhance the clustering
property of posting lists. This algorithm creates
________________________________________________________________________________________________
ISSN (Print): 2319-2526, Volume -3, Issue -4, 2014
11
International Journal on Advanced Computer Theory and Engineering (IJACTE)
________________________________________________________________________________________________
similarity Graph G from IF index, each document
consider as vertex of graph the edges of the graph are
weighted by considering cosine similarity measure
between each pair of documents. Then graph G
recursively splits into smaller subgraphs to generate
singleton. The depth_First traversing is applied on tree
to reassign the DocIDs. Silvestri[5] show that in the case
of collections of Web Documents the performance of
compression algorithms can enhance by simply
assigning identifiers to documents according to the
lexicographical ordering of the URLs. SIGSORT [1]
algorithm works by generating signature of the words
for that a summary of each document is generated then
words are arranged in descending order of their
frequencies. SIGSORT is more suitable for structured
and short text data and can handle large data. It provides
higher clustering power. In the
The remainder of the paper is organized as follows.
Section 2 describes the overview of the proposed
systems. Section3 gives detail about the use of natural
language in proposed method. Section 4 describes about
how intersection algorithm works in proposed method.
Section 5 describes the use of generating logs. Section 6
describes the reordering technique applied to improve
the sequence of intervals. Section 7 shows the
experimental results. The paper concludes in Section 8.
II. OUTLINE OF THE PROPOSED
METHOD
5:
Preparing and updating Log file of users after each
activity
6:
while Query is fired then check the results in log
file
7:
If result is not present in log file then
Search the result in index file
Else
Go to step 5
8:
Return the result
III. IDENTIFYING KEYWORDS
In proposed method we are using the natural language
processing to identifying the actual meaningful words
from data and query. We are applying the stemming on
data and queries to find the root form of the words. If the
words are ending with ‘ed ’, ‘ing’, ‘ly’ then stemming
process reduces the inflected or derived words to their
stem or root, for example interfaced, interfacing are
converted in to interface.
We also filtered out the stop words from query and
data. Stop words are words which are filterd out prior to,
or after, processing of natural language data. We remove
the words as the, is, which and so on and only consider
the keywords and assign the IDs to them.
The diagram gives the idea about how the proposed
system works , the user sends the query which is given
to NLP (natural language processing) to identify the
keywords then this keywords are searched from the
index table which is created by using the intervals at
same time the logs are checked for the related data to
reduce the searching time.
While storing the data in index form and interval form
the indexer first applies the natural language processing
on the whole data to identify the keywords. Based on the
keyword positions the indexes are assigns to the key
words. After assigning the keywords the IDs table is
formed and from that IDs we are find the intervals, to
store the indexes in interval form. To generate the
proper intervals and sequential indexes the reordering
algorithm will apply on the IDs table. By reassigning
document identifiers of the original collection, lowers
the distance between the positions of documents.
Steps for execution
1:
Identifying Keywords from document by using
NLP techniques
2:
Assigning the indexes for each keyword
3:
Reodering the document by using SIGSORT and
TSP
4:
Generating the index file by considering interval of
indexes
2.1 OUTLINE OF THE PROPOSED SYSTEM
IV. PREPARING LOGS
In proposed method we further investigated the issue of
developing high-quality and effective IR system by
combining log concept while processing the query.
which enables you to create and manage search logs
from information recorded by the previous search. The
________________________________________________________________________________________________
ISSN (Print): 2319-2526, Volume -3, Issue -4, 2014
12
International Journal on Advanced Computer Theory and Engineering (IJACTE)
________________________________________________________________________________________________
search technique stores raw search logs, from which it
generates user-requested search log reports. Log files
contain information about User Name, Time Stamp,
Access Request, Result Status. The log files are
maintained by the system. By analyzing these log
files gives a neat idea about the user behavior. Log
generation is performed by using following steps.
Algorithm for log generation
Creation of user log
Step 1:
Enter user_Id and password
Step 2: If User_Id and password mathched go to
step 3
Else
Again Enter user validation
Step 4: Create user session_id
Step5:
while(session_id)
Step6:
Monitor activity of user
Step7:
Update database of user
Step8:
end while
Step9:
end procedure
V. PROBE BASED ALGORITHM
The probe based algorithm is based on intersection
operation. As the keywords in queries are of different
length the probe based algorithm are suited for the
retrieving and storing the data. These probe based
algorithm which is used in proposed method is based on
intersection operation, following explains the working of
intersection on set of indexes.
Definition :
Given a set of interval lists, R ={R 1, R2 … Rn }, and
their equivalent ID lists, S = {S1 , S2 ,.. Sn}, the
intersection of R is the equivalent interval list of the
intersection of R is the equivalent interval list of ∩ n k=1
Sk.
For example we can if we have the intervals as {[5,8],
[12,14]},{[6,8], [13,16]} and {[4,9], [14,14], [16,25]}.
Their equivalent IDs are {5,6,7,8,12,13,14}, {6,7,8,13,
14,15,16}, {4,5,6,8,9,11,16,17,18,19,20,21,22,23,24,25}
then intersection will produce the result as {6,7,8,14}
which will produce the intervals as {[6,8],[14,14]}.
The proposed system works on the probes and this
algorithm is faster in query based keyword search. The
probe based system is efficient than the sequential scan.
The probe based algorithm uses the binary search
algorithm having complexity O(log m) to find the
keywords and avoids unnecessary probes by calling the
function recursively . Our concept uses the probIsect+
[1] algorithm whose complexity is as shown
CP = O(min(log n ּ Σ K≠J | RK | }). The probe based
algorithm takes R as set of interval lists and sorts the R
in ascending order of lower bounds. The PROBISECT +
algorithm use the concept of intersection operation and
calculate the intersection list of a set of ordered lists.
The probe based algorithm probes the ordered list
sequentially and terminate the unpromising probes. This
probing function called recursively to avoid the empty
and unpromising probes.
Reordering data
Reordering of data is necessary for generating the best
order of the document. If the data is reordered, in order
to generate sequential indexes then the memory
requirement will automatically get reduced and
searching will also get improved. Reordering algorithms
are used to find the optimal ordering of document so that
similar documents stay near to each other. Silvestri[5]
suggested a method in which the webpages are arranged
according to the URLs. The similar concept is used
document to sort the document according to their
summery so that the similar document can be keep near
to each other. For sorting Summaries can be generated
as follows. First, all the words are sorted in descending
order of their frequencies. Then, the top n (e.g., n D
1000 ) most frequent words are chosen as signature
vocabulary. For each document, a string, called a
signature, is generated by choosing those words belong
to the signature vocabulary and sorting them in
descending order of their frequencies. The document
sorting compares each pair of signatures word-wise
instead of comparing them letter-wise. In proposed
approch the signature sorting algorithm is used to sort
the document according to the similarity of document
and TSP is used to identify the document with similar
signature.
Experimental Results
The experimental results include performance of
indexing verses indexing with intervals which is
compared in table a and figure a. Various queries are
executed for temporal analysis and some of them are
listed in table which conclude that the performance is
get improved by finding the efficient intervals. Figure: a
shows the single query graph in which we can clearly
see that the time require for indexing method is greater
than the find indexing with intervals. The time require
for traditional indexing is 4.32 ms and indexing with
intervals require 3.90 ms.
Table a: performance of indexing Vs indexing with
intervals
Query
Time(ms)
required
for
Indexing
method
what is interfaces
Dictionary in java
how to use file class in
java
arrays and vector class in
1.06
1.04
2.81
Time(ms)re
quire for
Indexing
with
intervals
method
0.60
0.62
1.14
4.54
4.20
________________________________________________________________________________________________
ISSN (Print): 2319-2526, Volume -3, Issue -4, 2014
13
International Journal on Advanced Computer Theory and Engineering (IJACTE)
________________________________________________________________________________________________
java
how to use arrays in java
Hashtable in java
How to use packages in
java
use of linklist and stack in
java
2.32
2.11
3.60
1.00
1.61
2.90
5.2
4.21
how to use file class in java
arrays and vector class in
java
how to use arrays in java
Hashtable in java
How to use packages in java
use of linklist and stack in
java
2.04
2.94
0.70
2.50
1.32
1.63
2.34
3.8
1.00
0.7
1.3
1.9
VI. CONCLUSIONS
Figure a: Indexing Vs Indexing with intervals
Performance of indexing with intervals and log
The performance by incorporating Log with indexing
with intervals is shown in Table b and Figure b. Figure
shows the single query graph in which we can clearly
see that the time require for indexing with intervals is
3.79 ms and time required for implemented method is
1.60 ms which is near about half of existing method.
Figure b: Indexing using Intervals Vs Indexing with
Intervals and log
This paper presented the method of indexing which will
work on the interval of indexes which will help to
reduce the memory requirement as well as it uses the
users log which will help to reduce the retrieval time.
The graphical comparative shows that performance of
traditional indexing is get improved due the concept of
intervals of indexes the extended concept using logs
proves that the time required for retrieving process is
reduced near about half compare to the existing system.
In this approach searching techniques uses the
PROBISECT+ algorithm which is based on intersection
operations to find the results of queries. Reordering
technique applied to reduce the intervals and generate
the sequence of indexes which will generates the
efficient intervals and reduces the memory require to
store the indexes.
Along with indexing the user logs used while searching
is greatly improve the performance.
REFERENCES
[1]
Hao Wu, Guoliang Li, and Lizhu Zhou, Ginix:
Generalized Inverted Index for Keyword Search
IEEE TRANSACTIONS ON KNOWLEDGE
AND DATA MINING VOL:8 NO:1 YEAR 2013
[2]
Vijayashri Losarwar, Dr. Madhuri Joshi Data
Preprocessing
in
Web
Usage
Mining
International
Conference
on
Artificial
Intelligence and Embedded Systems (ICAIES'
2012) July 15-16, 2012 Singapore.
[3]
M. Hadjieleftheriou, A. Chandel, N. Koudas, and
D.Srivastava, Fast indexes and algorithms for set
similarity selection queries, in Proc. of the 24th
International Conference on Data Engineering,
Cancun, Mexico, 2008,pp. 267-276
[4]
J. Zhang, X. Long, and T. Suel, Performance of
compressed inverted list caching in search
engines, in Proc.of the 17th International
Conference on World Wide Web, Beijing, China,
2008, pp. 387-396.
Table b: Performance of indexing with intervals and log
Time(ms)
require
for logs
[5]
F. Silvestri, Sorting out the document identifier
assignment problem, in Proc. of the 29th
European Conference on IR Research, Rome,
Italy, 2007, pp. 101-112.
what is interfaces
Time(ms)
required
for
Indexing
with
intervals
method
0.60
0.40
[6]
Dictionary in java
0.64
0.10
R. Blanco and A. Barreiro, TSP and cluster-based
solutions to the reassignment of document
Query
________________________________________________________________________________________________
ISSN (Print): 2319-2526, Volume -3, Issue -4, 2014
14
International Journal on Advanced Computer Theory and Engineering (IJACTE)
________________________________________________________________________________________________
identifiers, Information Retrieval, vol. 9, no. 4,
pp. 499-517, 2006.
[7]
M. Zukowski, S. Hman, N. Nes, and P. A. Boncz,
Superscalar RAM-CPU cache compression, in
Proc. of the 22nd International Conference on
Data Engineering, Atlanta, Georgia, USA, 2006,
pp. 59.
[8]
J. Zobel and A. Moffat, Inverted files for text
search engines, ACM Computing Surveys , vol.
38, no. 2, pp. 6, 2006.
[9]
Wann-Yun Shieh, T ien-Fu Che n, Jean J yh-Jiun
Shann, and Chung-Ping Chung.Inve rted file
compre ssion through do cument identifie r reas
signment. Information Process in g and M anage
men t, 39(1):117–131, January 2003.
[10]
query evaluation, in Proc. of the 25th Annual
International ACM SIGIR Conference on
Research and Development in Information
Retrieval, Tammpere, Finland, 2002, pp. 222229.
[11]
B& B]Dan Blandford and Guy Ble llo ch. Index
compression through document reordering. In
Proceedings of the D ata Compression Confere
nce (DCC’02), pages 342–351,Was hington, DC,
USA, 2002. IEEE Computer Society.
[12]
P. Elias. Universal codeword sets and
representations
of
the
integers.
IEEE
Transactions on Information Theory, IT21(2):194{203, Mar. 1975.
[13]
S. Golomb. Run-length encodings. IEEE
Transactions
on
Information
Theory,
IT{12(3):399{401, July 1966.
F. Scholer, H. E. Williams, J. Yiannis, and J.
Zobel, Compression of inverted indexes for fast

________________________________________________________________________________________________
ISSN (Print): 2319-2526, Volume -3, Issue -4, 2014
15
Download