A Secure Method for Searching In Measured Datasets

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 5 Number 4 - Nov 2013
A Secure Method for Searching In Measured
Datasets
V.V.Ranganadh1, M.V.Durgaprasad2
Assistant Professor1, M.Tech Scholar2
1,2
Dept of CSE, Aditya Engineering College, Aditya Nagar, Surampalem, Andhra Pradesh
Abstract:-In this paper ,we are proposing an efficient data
searching technique over network of metric datasets with
integrated security ,Searching data over network not a simple
task when data is more, Our proposed approach reduces the
computational complexity and searches the user interested
data in secure and efficient manner with a novel pattern
matching and cryptographic approaches.
I. INTRODUCTION
Overview of query processing: Query processing
refers to the range of activities involved in extracting data
from a database. The activities include translation of
queries in high level database languages into expressions
that can be used at the physical level of the file system, a
variety of query optimizing transformations, and actual
evaluation of queries. The steps involved in processing a
query is the basic steps are (1) Parsing and translation (2)
Optimization (3) Evaluation.
The first action, the system must take in query
processing is to translate a given query into its internal
form. This translation process is similar to the work
performed by the parser of a compiler. In generating the
internal form of the query, the parser checks the syntax of
the user’s query, verifies that the relation names appearing
in the query are names of the relations in the database, and
so on. The system constructs a parse tree representation of
the query, which it then translates into a relational algebra
expression.
Furthermore, the relational algebra representation
of a query specifies only partially how to evaluate a query.
As an illustration, consider the query select balance from
account where balance < 2500
to annotate it with instructions specifying how to evaluate
each operation.
A relational algebra operation annotated with
instructions on how to evaluate it is called an evaluation
primitive. A sequence of primitive operations that can be
used to evaluate a query is a query execution plan or query
evaluation plan. Fig. illustrates an evaluation plan for our
example query, in which a particular index (denoted in the
fig. as “index 1″) is specified for the selection operation.
The query execution engine takes a query evaluation plan,
executes that plan, and returns the answers to the query.
The different evaluation plans for a give query can have
different costs. Measures of query cost: The cost of query
evaluation can be measured in terms of a number of
different resources, including disk accesses, CPU time to
execute a query, and. in a distributed or parallel database
system, the cost of communication.
The response time for a query evaluation plan (that is the
clock time required to execute the plan) could be used as a
good measure of the cost of the plan. In large database
systems, however, disk access are usually the most
important cost, since disk accesses are slow compared to in
memory operations. Most people consider the disk access
cost a reasonable measure of the cost of a query evaluation
plan. The number of block transfers from disk is also used
as a measure of the actual cost. We also need to distinguish
between reads and writes of blocks, since it takes more
time to write a block to disk than to read a block from disk.
For more accurate measure find out
This query can be translated into either of the following
relational algebra expressions.
The number of seek operations performed, The
number of blocks read, The number of blocks written, and
then add up these numbers after multiplying them by the
average seek time, average transfer time for reading a block
and average transfer time for writing a block respectively.
balance < 2500 ( n^(account))
A)Parsing and translation
balance (abalmoe<2500 (account))
Translate the query into its internal form. This is then
translated into relational algebra. Parser checks syntax,
verifies relations Evaluation: The query-execution engine
takes a query-evaluation plan, executes that plan, and
returns the answers to the query. Basic Steps in Query
Processing Optimization
To implement the preceding selection, we can
search every, tuple in account to find tuples with balance
less than 2500. If a B tree index is available on the attribute
balance, we can use the index instead to locate the tuples.
To specify fully how to evaluate a query, we need to
provide not only the relational algebra expression, but also
ISSN: 2231-5381
http://www.ijettjournal.org
Page 216
International Journal of Engineering Trends and Technology (IJETT) – Volume 5 Number 4 - Nov 2013
A relational algebra expression may have many equivalent
expressions E.g., balance2500 (balance (account)) is
equivalent to
Balance (balance2500 (account)). Each relational algebra
operation can be evaluated using one of several different
algorithms.
Correspondingly; a relational-algebra expression
can be evaluated in many ways. Annotated expression
specifying detailed evaluation strategy is called an
evaluation-plan. E.g. can use an index on balance to find
accounts with balance < 2500, or can perform complete
relation scan and discard accounts with balance 2500 .
To evaluate a query, We need to annotate it with
instructions specifying how to evaluate each operation.
Annotations may State the algorithm to be used for a
specific operation, or the particular index or indices to use.
Measures of Query Cost (Cont.) For simplicity we just use
number of block transfers from disk as the cost measure.
We ignore the difference in cost between sequential and
random I/O for simplicity. We also ignore CPU costs for
simplicity and Costs depends on the size of the buffer in
main memory. Having more memory reduces need for disk
access. Amount of real memory available to buffer depends
on other concurrent OS processes, and hard to determine
ahead of actual execution. We often use worst case
estimates, assuming only the minimum amount of memory
needed for the operation is available. Real systems take
CPU cost into account, differentiate between sequential and
random I/O, and take buffer size into account, and We do
not include cost to writing output to disk in our cost
Formulae.
B)Selection Operation
File scan – search algorithms that locate and retrieve
records that fulfil a selection condition.
Algorithm A1 (linear search). Scan each file block and test
all records to see whether they satisfy the selection
condition.
Cost estimate (number of disk blocks scanned) =
br, br denotes number of blocks containing records from
relation r. If selection is on a key attribute, cost = ( br /2),
stop on finding record. Linear search can be applied
regardless of selection condition or ordering of records in
the file, or availability of indices. A relational algebra
operation annotated with instructions on how to evaluate it
is called an evaluation primitive. A sequence of primitive
operations that can be used to evaluate a query is a query
execution plan or query evaluation plan. Query execution
engine takes a query evaluation plan, executes that plan,
and returns the answer to the query. The different
evaluation plans for a given query can have different costs.
We do not expect users to write their queries in a way that
ISSN: 2231-5381
suggests most efficient evaluation plan. Rather, it is the
responsibility of the system to construct a query-evaluation
plan that minimizes the cost of query evaluation. Once the
query plan is chosen, the query is evaluated with that plan,
and the result of the query is output. The sequence of steps
already described for processing a query is representative;
not all databases exactly follow those steps.
II. RELATED WORK
Summation of Random Numbers A simple scheme has
been proposed in [3] that computes the encrypted value c of
integer p as c = Pp j=0 Rj, where Rj is the jth value
generated by a secure pseudo-random number generator R.
Unfortunately, the cost of making p calls to R for
encrypting or decrypting c can be prohibitive for large
values of p. A more serious problem is the vulnerability to
estimation exposure. Since the expected gap between two
encrypted values is proportional to the gap between the
corresponding plaintext values, the nature of the plaintext
distribution can be inferred from the encrypted values.
Figure 2 shows the distributions of encrypt ted values
obtained using this scheme for data values sampled from
two different distributions: Uniform and Gaussian. In each
case, once both the input and encrypted distributions are
scaled to be between 0 and 1, the number of points in each
bucket is almost identical for the plaintext and encrypted
distributions. Thus the percentile of a point in the encrypted
distribution is also identical to its percentile in the plaintext
distribution.
Polynomial Functions In [12], a sequence of strictly
increasing polynomial functions is used for encrypting
integer values while preserving their order. These
polynomial functions can simply be of the first or second
order, with coefficients generated from the encryption key.
An integer value is encrypted by applying the functions in
such a way that the output of a function becomes the input
of the next function. Correspondingly, an encrypted value
is decrypted by solving these functions in reverse order.
However, this encryption method does not take the input
distribution into account.
Therefore the shape of the distribution of
encrypted values depends on the shape of the input
distribution. This illustration suggests that this scheme may
reveal information about the input distribution, which can
be exploited. Bucketing In tuples are encrypted using
conventional encryption, but an additional bucket id is
created for each attribute value. This bucket id, which
represents the partition to which the unencrypted value
belongs, can be indexed. The constants appearing in a
query are replaced by their corresponding bucket ids.
Clearly, the result of a query will contain false hits that
must be removed in a post-processing step after decrypting
the tuples returned by the query. This filtering can be quite
http://www.ijettjournal.org
Page 217
International Journal of Engineering Trends and Technology (IJETT) – Volume 5 Number 4 - Nov 2013
complex since the bucket ids may have been used in joins,
sub-queries, etc.
The number of false hits depends on the width of
the partitions involved. It is shown in [13] that the postprocessing overhead can become excessive if a coarse
partitioning is used for bucketization. On the other hand, a
fine partitioning makes the scheme vulnerable to estimation
exposure, particularly if an equi-width partitioning is used.
It has been pointed out in that the indexes proposed in can
open the door to interference and linking attacks. Instead,
they build a B-tree over plaintext values, but then encrypt
every tuple and the B-tree at the node level using
conventional encryption. The advantage of this approach is
that the content of B-tree is not visible to an un-trusted
database server. The disadvantage is that the B-tree
traversal can now be performed by the front-end only by
executing a sequence of queries that retrieve tree nodes at
progressively deeper level.
The two works most related to ours are proposed
that several transformation-based techniques for
outsourcing spatial data to the (untrusted) server, such that
the server is ableto perform spatial range search correctly
for trusted users on those transformed points, without
knowing their actual coordinates. They propose spatial
transformations in 2D space based on scaling, shifting, and
noise injection. Also, they develop a solution using an
encrypted R-tree. Those solutions operate on explicit 2D
coordinates, rendering them inapplicable in our setting,
where the distance function is a generic distance metric.
Wong propose to outsource multidimensional
points to the (untrusted) server, by using a secure scalar
product encryption technique. Methods are then provided
for kNN search at the server, without the server learning
the distances among the points. However, the secure scalar
product relies on specific properties of the Euclidean
distance in the multidimensional space. It is not applicable
to other Lp norms, e.g., the L1 norm (the Manhattan
distance). Obviously, it also cannot be applied to our
problem setting which considers arbitrary metric space
objects (e.g., strings, graphs, time-series). Another
drawback of this proposal is that no indexing scheme can
be built on the encrypted tuples, forcing the server to
perform a linear scan over the data set. This affects
severely the scalability of the system.
In the field of privacy-preserving data mining,
perturbation techniques have been developed for
introducing noise into the data, before sending them to the
service provider. However, such an approach does not
guarantee the exact retrieval of results. The k-anonymity
model has been applied extensively for the privacypreserving publication of data sets. The idea is to
generalize the tuples in a table such that each generalized
representation is shared by at least k tuples. This way, each
object cannot be distinguished from at least k -1 other
objects. It is often used to generalize the medical records of
ISSN: 2231-5381
patients so that the adversary cannot link a specific patient
to a medical record. Except for some person-related data
like DNA data, most of the metric data that we consider
(e.g., astronomy data, time series) is collected from nature
rather than from persons.
In the existing system data owner used AES
algorithm for encrypting the data. We adapted new
cryptographic algorithm.
III. PROPOSED WORK
In our proposed system we introduced a new
framework for searching the data. Our framework contains
efficient security for data that is uploaded in the network,
access protection to the data that is accessing by the user.
And finally the searching method for the accurate and
efficient process.
For we adapted Advanced AES algorithm so
called as Rijandael algorithm for encrypting the data that is
uploaded by the data owner.
Rijndael is an iterated block cipher. Therefore, the
encryption or decryption of a block of data is accomplished
by the iteration (a round) of a specific transformation (a
round function). Section 3 provides the details of the
Rijndael round function. Rijndael also defines a method to
generate a series of sub keys from the original key. The
generated sub keys are used as input with the round
function.
Resistance against all known attacks; Speed and
code compactness on a wide range of platforms; Design
simplicity Rijndael was evaluated based on its security, its
cost and its algorithm and implementation characteristics.
The primary focus of the analysis was on the cipher's
security, but the choice of Rijndael was based on its simple
algorithm and implementation characteristics. There were
several candidate algorithms but Rijndael was selected
because based on the analyses, it had the best combination
of security,
performance,
efficiency,
ease
of
implementation and flexibility. The Pseudo code is as
shown below:
Rijndael(State,CipherKey) {
KeyExpansion(CipherKey,ExpandedKey);
AddRoundKey(State,ExpandedKey);
For(
i=1
;
i
FinalRound(State,ExpandedKey + Nb*Nr);
}
And the round function is defined as:
Round(State,RoundKey) {
ByteSub(State);
ShiftRow(State);
MixColumn(State);
AddRoundKey(State,RoundKey);
}
The round transformation is broken into layers. These
layers are the linear mixing layer, which provides high
http://www.ijettjournal.org
Page 218
International Journal of Engineering Trends and Technology (IJETT) – Volume 5 Number 4 - Nov 2013
diffusion over multiple rounds. The non-linear layer which
are basically applications of the Rijndael S-box. And the
key addition layer which is simply an exclusive or of the
round key and the intermediate state. Each layer is
designed to have its own well-defined function which
increases resistance to linear and differential cryptanalysis.
Next Process in our framework is Data storage in
service provider. Data owner selects file and encrypts that
file using above algorithm then store in the service
provider. In the existing system the key is informed to user
at the time of storage only. This is very insecure process in
the network. So we informed to user at the time of search
only.
Next process is the searching process we
introduced a new searching process. The algorithm is
explained as shown below.
In the initial process according to alphabet size, definite
first level size of the seat shifted table. Assuming that the
size of an alphabet is to SIZE, and then the size of the first
level is to SIZE. Each character uses its value of the
decimal base corresponding with its ASCII to mark the first
level of the position.
First mark the location of the characters in the
pattern from left to right, and then the positions of each
character which appears in the pattern string according to
the decrease order, in turn enter the position which is
indicated with its ASCII, which would constitute a chain of
other levels, for example, between the Green Lines there is
the level 2, the yellow lines between there is the level 3.
If a character in the pattern, it would be the
corresponding position defined as 1, if not, the definition of
0 because of the `A` in the pattern set, ASCII of `A` marks
the position of 65, so it will mark the 65th to 1. ASCII
character for 40 `@` does not appear in the pattern, the
marking of the position of 40 is 0.
Through this kind of indication, when we want to
inquire whether there is some occurrence in the pattern of a
character, we only need to inquire that the mark of the
character in the table is 1 or not. Take Search starting
position as the center, and take m−1 characters in it’s
before and after each to compose the windows in size
to 1+(m+1)+(m-1)=2m-1.
Through this, a window of the m−1 characters of
the latter part is a window for the first half of the m−1
characters of the next window, thus may guarantee after the
partition, the pattern string always falls in some window in
every match, and never omits the data, and also guarantees
the algorithm the accuracy.
If the pattern in the first window does not match,
it will go to the next window. Using Next array, avoid that
when there is no matching, the pattern will go to the back.
The value of Next array depends on their own
characteristics, nothing to do with the text string. The
establishing rules are as follows:
We pretreat the pattern p1,p1....pm in advance, and
generate a function Next[i](0 < i < m+1) .when there is not
match in the i th by the time, We can calculate in the prefix
ISSN: 2231-5381
p1,p2....pm whether there is a maximum of G, making
p1..pG-1....established If it exists, there will be Next[i] = G ,
when matched next time, pattern can be directly moved
backward for i − Next[i] , and then we can start the
comparison from the Gth of pattern string. If it not exists,
there will be Next[i] = 1.
Matching
Each match will start from the Search starting
position, and use the seat shifted table and Next array. a)
First examine whether the mark in the seat shifted table of
the k th (0 < k ≤ [n / m]) Search starting position is 1 or
not, if it is 0,go to the (k+1) th Search starting position; if
1, It is said that this character occurs in the pattern string,
therefore, in the second level of the seat shifted table we
will find the first position of the character in the pattern,
balance the string pattern in the location of the first position
of the character with the text string in a position to the k th
Search starting position.
b) Match from the most left of the pattern, if matched
completely before the Search starting position, then match
from the most right of the pattern, if matched completely
after the Search starting position, this proves a match is
completed. Then jump the next Search starting position, go
on.
c) If a match in a certain position failures, assuming the
position is i(0 ≤ i ≤ m) , check Next[i] , then check the seat
shifted table, and find the next position in the pattern of the
character in the k,th Search starting position, and calculate
the distance between the two positions, assuming the value
is Distance ,Compared Distance with Next[i] size, takes
bigger for the jump distance of the pattern. If Next[i]
larger, first match the character in the Search starting
position. If marched, go on matching accordance with the
above; otherwise, turn to c). Now if there is no position of
the character in the seat shifted table, go to the next Search
starting position, and turn to a).
IV. CONCLUSION
In this paper we introduced a framework for secure storage
and searching algorithm. It reduces the time complexity of
the searching and results moderate results. In this we
adapted advanced cryptographic algorithm to provide
security for data which is storing in the service provider.
Our searching algorithm searches the document for each
character matching to the keyword given by the user, so we
achieve moderate results.
REFERENCES
[1] G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller, R.
Panigrahy, D. Thomas, and A. Zhu, “Achieving Anonymity via
Clustering,” Proc. 25th ACM SIGMOD-SIGACT-SIGART
Symp. Principles of Database Systems (PODS), pp. 153-162,
2006.
[2] R. Agrawal, P.J. Haas, and J. Kiernan, “Watermarking
Relational Data: Framework, Algorithms and Analysis,” The Int’l
J. Very Large Data Bases, vol. 12, no. 2, pp. 157-169, 2003.
http://www.ijettjournal.org
Page 219
International Journal of Engineering Trends and Technology (IJETT) – Volume 5 Number 4 - Nov 2013
[3] R. Agrawal, J. Kiernan, R. Srikant, and Y. Xu, “OrderPreserving Encryption for Numeric Data,” Proc. ACM SIGMOD
Int’l Conf. Management of Data, pp. 563-574, 2004.
[4] R. Agrawal and R. Srikant, “Privacy-Preserving Data
Mining,” Proc. ACM SIGMOD Int’l Conf. Management of Data,
pp. 439- 450, 2000.
[5] C.A. Ardagna, M. Cremonini, E. Damiani, S.D.C. di
Vimercati, and P. Samarati, “Location Privacy Protection
Through Obfuscation- Based Techniques,” Proc. 21st Ann.
IFIPWG11.3 Working Conf. Data and Applications Security
(DBSec), pp. 47-60, 2007.
[6] V. Athitsos, M. Potamias, P. Papapetrou, and G. Kollios,
“Nearest Neighbor Retrieval Using Distance-Based Hashing,”
Proc. IEEE 24th Int’l Conf. Data Eng. (ICDE), pp. 327-336, 2008.
[7] N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger,
“The R*-Tree: An Efficient and Robust Access Method for Points
and Rectangles,” Proc. ACM SIGMOD Int’l Conf. Management
of Data, pp. 322-331, 1990.
[8] S. Berchtold, D.A. Keim, and H.-P. Kriegel, “The X-Tree : An
Index Structure for High-Dimensional Data,” Proc. 22nd Int’l
Conf. Very Large Databases, pp. 28-39, 1996.
[9] T. Bozkaya and Z.M. O ¨ zsoyoglu, “Indexing Large Metric
Spaces for Similarity Search Queries,” ACM Trans. Database
Systems, vol. 24, no. 3, pp. 361-404, 1999.
[10] E. Cha´vez, G. Navarro, R.A. Baeza-Yates, and J.L.
Marroquı´n, “Searching in Metric Spaces,” ACM Computing
Surveys, vol. 33, no. 3, pp. 273-321, 2001.
[11] P. Ciaccia, M. Patella, and P. Zezula, “M-Tree: An Efficient
Access Method for Similarity Search in Metric Spaces,” Proc.
Very Large Databases (VLDB), pp. 426-435, 1997.
[12] E. Damiani, S.D.C. Vimercati, S. Jajodia, S. Paraboschi, and
P. Samarati, “Balancing Confidentiality and Efficiency in
Untrusted Relational DBMSs,” Proc. 10th ACM Conf. Computer
and Comm. Security (CCS), pp. 93-102, 2003.
[13] M. Dunham, Data Mining: Introductory and Advanced
Topics. Prentice Hall, 2002.
[14] C. Faloutsos and K.-I. Lin, “FastMap: A Fast Algorithm for
Indexing, Data-Mining and Visualization of Traditional and
Multimedia Data Sets,” Proc. ACM SIGMOD Int’l Conf.
Management of Data, pp. 163-174, 1995.
[15] G. Ghinita, P. Kalnis, A. Khoshgozaran, C. Shahabi, and
K.L. Tan, “Private Queries in Location Based Services:
Anonymizers Are Not Necessary,” Proc. ACM SIGMOD Int’l
Conf. Management of Data, pp. 121-132, 2008.
[16] A. Gionis, P. Indyk, and R. Motwani, “Similarity Search in
High Dimensions via Hashing,” Proc. 25th Int’l Conf. Very Large
Databases (VLDB), pp. 518-529, 1999.
[17] H. Hacigu¨mu¨ s, B.R. Iyer, C. Li, and S. Mehrotra,
“Executing SQL over Encrypted Data in the Database-ServiceProvider Model,” Proc. ACM SIGMOD Int’l Conf. Management
of Data, pp. 216-227, 2002.
[18] H. Hacigu¨mu¨ s, S. Mehrotra, and B.R. Iyer, “Providing
Database as a Service,” Proc. 18th Int’l Conf. Data Eng. (ICDE),
pp. 29-40, 2002.
[19] A. Hinneburg, C.C. Aggarwal, and D.A. Keim, “What Is the
Nearest Neighbor in High Dimensional Spaces?,” Proc. 26th Int’l
Conf. Very Large Data Bases (VLDB), pp. 506-515, 2000.
[20] G.R. Hjaltason and H. Samet, “Index-Driven Similarity
Search in Metric Spaces,” ACM Trans. Database Systems, vol.
28, no. 4, pp. 517-580, 2003.
[21] H.V. Jagadish, B.C. Ooi, K.-L. Tan, C. Yu, and R.Z. 0003,
“iDistance: An Adaptive Bþ-Tree Based Indexing Method for
ISSN: 2231-5381
Nearest Neighbor Search,” ACM Trans. Database Systems, vol.
30, no. 2, pp. 364-397, 2005.
BIOGRAPHIES
Mr.M.V.Durgaprasad is a student
of Aditya Engineering College,
Surampalem. Presently he is
pursuing his M.Tech [Computer
Science & Engineering] from this
college and he received his B.Tech
from Sri Prakash College of
Engineering, affiliated to JNT University, Kakinada
in the year 2009. His area of interest includes
Compiler Design, Database Management Systems,
Data Mining, all current trends and techniques in
Computer Science.
Mr. V.V.Ranganadh, well known
and excellent teacher received
M.Tech (CSE) from JNTU,
Kakinada is working as Assistant
Professor, Dept of CSE at Aditya
Engineering College. He has 11
years of industrial and teaching
experience in various engineering
colleges. To his credit couple of publications both
national and international conferences/journals. His
area of interest includes Image Processing, Data
Mining and other advances in Computer Applications.
And guided many projects.
http://www.ijettjournal.org
Page 220
Download