International Journal of Engineering Trends and Technology (IJETT) – Volume 5 Number 4 - Nov 2013 A Secure Method for Searching In Measured Datasets V.V.Ranganadh1, M.V.Durgaprasad2 Assistant Professor1, M.Tech Scholar2 1,2 Dept of CSE, Aditya Engineering College, Aditya Nagar, Surampalem, Andhra Pradesh Abstract:-In this paper ,we are proposing an efficient data searching technique over network of metric datasets with integrated security ,Searching data over network not a simple task when data is more, Our proposed approach reduces the computational complexity and searches the user interested data in secure and efficient manner with a novel pattern matching and cryptographic approaches. I. INTRODUCTION Overview of query processing: Query processing refers to the range of activities involved in extracting data from a database. The activities include translation of queries in high level database languages into expressions that can be used at the physical level of the file system, a variety of query optimizing transformations, and actual evaluation of queries. The steps involved in processing a query is the basic steps are (1) Parsing and translation (2) Optimization (3) Evaluation. The first action, the system must take in query processing is to translate a given query into its internal form. This translation process is similar to the work performed by the parser of a compiler. In generating the internal form of the query, the parser checks the syntax of the user’s query, verifies that the relation names appearing in the query are names of the relations in the database, and so on. The system constructs a parse tree representation of the query, which it then translates into a relational algebra expression. Furthermore, the relational algebra representation of a query specifies only partially how to evaluate a query. As an illustration, consider the query select balance from account where balance < 2500 to annotate it with instructions specifying how to evaluate each operation. A relational algebra operation annotated with instructions on how to evaluate it is called an evaluation primitive. A sequence of primitive operations that can be used to evaluate a query is a query execution plan or query evaluation plan. Fig. illustrates an evaluation plan for our example query, in which a particular index (denoted in the fig. as “index 1″) is specified for the selection operation. The query execution engine takes a query evaluation plan, executes that plan, and returns the answers to the query. The different evaluation plans for a give query can have different costs. Measures of query cost: The cost of query evaluation can be measured in terms of a number of different resources, including disk accesses, CPU time to execute a query, and. in a distributed or parallel database system, the cost of communication. The response time for a query evaluation plan (that is the clock time required to execute the plan) could be used as a good measure of the cost of the plan. In large database systems, however, disk access are usually the most important cost, since disk accesses are slow compared to in memory operations. Most people consider the disk access cost a reasonable measure of the cost of a query evaluation plan. The number of block transfers from disk is also used as a measure of the actual cost. We also need to distinguish between reads and writes of blocks, since it takes more time to write a block to disk than to read a block from disk. For more accurate measure find out This query can be translated into either of the following relational algebra expressions. The number of seek operations performed, The number of blocks read, The number of blocks written, and then add up these numbers after multiplying them by the average seek time, average transfer time for reading a block and average transfer time for writing a block respectively. balance < 2500 ( n^(account)) A)Parsing and translation balance (abalmoe<2500 (account)) Translate the query into its internal form. This is then translated into relational algebra. Parser checks syntax, verifies relations Evaluation: The query-execution engine takes a query-evaluation plan, executes that plan, and returns the answers to the query. Basic Steps in Query Processing Optimization To implement the preceding selection, we can search every, tuple in account to find tuples with balance less than 2500. If a B tree index is available on the attribute balance, we can use the index instead to locate the tuples. To specify fully how to evaluate a query, we need to provide not only the relational algebra expression, but also ISSN: 2231-5381 http://www.ijettjournal.org Page 216 International Journal of Engineering Trends and Technology (IJETT) – Volume 5 Number 4 - Nov 2013 A relational algebra expression may have many equivalent expressions E.g., balance2500 (balance (account)) is equivalent to Balance (balance2500 (account)). Each relational algebra operation can be evaluated using one of several different algorithms. Correspondingly; a relational-algebra expression can be evaluated in many ways. Annotated expression specifying detailed evaluation strategy is called an evaluation-plan. E.g. can use an index on balance to find accounts with balance < 2500, or can perform complete relation scan and discard accounts with balance 2500 . To evaluate a query, We need to annotate it with instructions specifying how to evaluate each operation. Annotations may State the algorithm to be used for a specific operation, or the particular index or indices to use. Measures of Query Cost (Cont.) For simplicity we just use number of block transfers from disk as the cost measure. We ignore the difference in cost between sequential and random I/O for simplicity. We also ignore CPU costs for simplicity and Costs depends on the size of the buffer in main memory. Having more memory reduces need for disk access. Amount of real memory available to buffer depends on other concurrent OS processes, and hard to determine ahead of actual execution. We often use worst case estimates, assuming only the minimum amount of memory needed for the operation is available. Real systems take CPU cost into account, differentiate between sequential and random I/O, and take buffer size into account, and We do not include cost to writing output to disk in our cost Formulae. B)Selection Operation File scan – search algorithms that locate and retrieve records that fulfil a selection condition. Algorithm A1 (linear search). Scan each file block and test all records to see whether they satisfy the selection condition. Cost estimate (number of disk blocks scanned) = br, br denotes number of blocks containing records from relation r. If selection is on a key attribute, cost = ( br /2), stop on finding record. Linear search can be applied regardless of selection condition or ordering of records in the file, or availability of indices. A relational algebra operation annotated with instructions on how to evaluate it is called an evaluation primitive. A sequence of primitive operations that can be used to evaluate a query is a query execution plan or query evaluation plan. Query execution engine takes a query evaluation plan, executes that plan, and returns the answer to the query. The different evaluation plans for a given query can have different costs. We do not expect users to write their queries in a way that ISSN: 2231-5381 suggests most efficient evaluation plan. Rather, it is the responsibility of the system to construct a query-evaluation plan that minimizes the cost of query evaluation. Once the query plan is chosen, the query is evaluated with that plan, and the result of the query is output. The sequence of steps already described for processing a query is representative; not all databases exactly follow those steps. II. RELATED WORK Summation of Random Numbers A simple scheme has been proposed in [3] that computes the encrypted value c of integer p as c = Pp j=0 Rj, where Rj is the jth value generated by a secure pseudo-random number generator R. Unfortunately, the cost of making p calls to R for encrypting or decrypting c can be prohibitive for large values of p. A more serious problem is the vulnerability to estimation exposure. Since the expected gap between two encrypted values is proportional to the gap between the corresponding plaintext values, the nature of the plaintext distribution can be inferred from the encrypted values. Figure 2 shows the distributions of encrypt ted values obtained using this scheme for data values sampled from two different distributions: Uniform and Gaussian. In each case, once both the input and encrypted distributions are scaled to be between 0 and 1, the number of points in each bucket is almost identical for the plaintext and encrypted distributions. Thus the percentile of a point in the encrypted distribution is also identical to its percentile in the plaintext distribution. Polynomial Functions In [12], a sequence of strictly increasing polynomial functions is used for encrypting integer values while preserving their order. These polynomial functions can simply be of the first or second order, with coefficients generated from the encryption key. An integer value is encrypted by applying the functions in such a way that the output of a function becomes the input of the next function. Correspondingly, an encrypted value is decrypted by solving these functions in reverse order. However, this encryption method does not take the input distribution into account. Therefore the shape of the distribution of encrypted values depends on the shape of the input distribution. This illustration suggests that this scheme may reveal information about the input distribution, which can be exploited. Bucketing In tuples are encrypted using conventional encryption, but an additional bucket id is created for each attribute value. This bucket id, which represents the partition to which the unencrypted value belongs, can be indexed. The constants appearing in a query are replaced by their corresponding bucket ids. Clearly, the result of a query will contain false hits that must be removed in a post-processing step after decrypting the tuples returned by the query. This filtering can be quite http://www.ijettjournal.org Page 217 International Journal of Engineering Trends and Technology (IJETT) – Volume 5 Number 4 - Nov 2013 complex since the bucket ids may have been used in joins, sub-queries, etc. The number of false hits depends on the width of the partitions involved. It is shown in [13] that the postprocessing overhead can become excessive if a coarse partitioning is used for bucketization. On the other hand, a fine partitioning makes the scheme vulnerable to estimation exposure, particularly if an equi-width partitioning is used. It has been pointed out in that the indexes proposed in can open the door to interference and linking attacks. Instead, they build a B-tree over plaintext values, but then encrypt every tuple and the B-tree at the node level using conventional encryption. The advantage of this approach is that the content of B-tree is not visible to an un-trusted database server. The disadvantage is that the B-tree traversal can now be performed by the front-end only by executing a sequence of queries that retrieve tree nodes at progressively deeper level. The two works most related to ours are proposed that several transformation-based techniques for outsourcing spatial data to the (untrusted) server, such that the server is ableto perform spatial range search correctly for trusted users on those transformed points, without knowing their actual coordinates. They propose spatial transformations in 2D space based on scaling, shifting, and noise injection. Also, they develop a solution using an encrypted R-tree. Those solutions operate on explicit 2D coordinates, rendering them inapplicable in our setting, where the distance function is a generic distance metric. Wong propose to outsource multidimensional points to the (untrusted) server, by using a secure scalar product encryption technique. Methods are then provided for kNN search at the server, without the server learning the distances among the points. However, the secure scalar product relies on specific properties of the Euclidean distance in the multidimensional space. It is not applicable to other Lp norms, e.g., the L1 norm (the Manhattan distance). Obviously, it also cannot be applied to our problem setting which considers arbitrary metric space objects (e.g., strings, graphs, time-series). Another drawback of this proposal is that no indexing scheme can be built on the encrypted tuples, forcing the server to perform a linear scan over the data set. This affects severely the scalability of the system. In the field of privacy-preserving data mining, perturbation techniques have been developed for introducing noise into the data, before sending them to the service provider. However, such an approach does not guarantee the exact retrieval of results. The k-anonymity model has been applied extensively for the privacypreserving publication of data sets. The idea is to generalize the tuples in a table such that each generalized representation is shared by at least k tuples. This way, each object cannot be distinguished from at least k -1 other objects. It is often used to generalize the medical records of ISSN: 2231-5381 patients so that the adversary cannot link a specific patient to a medical record. Except for some person-related data like DNA data, most of the metric data that we consider (e.g., astronomy data, time series) is collected from nature rather than from persons. In the existing system data owner used AES algorithm for encrypting the data. We adapted new cryptographic algorithm. III. PROPOSED WORK In our proposed system we introduced a new framework for searching the data. Our framework contains efficient security for data that is uploaded in the network, access protection to the data that is accessing by the user. And finally the searching method for the accurate and efficient process. For we adapted Advanced AES algorithm so called as Rijandael algorithm for encrypting the data that is uploaded by the data owner. Rijndael is an iterated block cipher. Therefore, the encryption or decryption of a block of data is accomplished by the iteration (a round) of a specific transformation (a round function). Section 3 provides the details of the Rijndael round function. Rijndael also defines a method to generate a series of sub keys from the original key. The generated sub keys are used as input with the round function. Resistance against all known attacks; Speed and code compactness on a wide range of platforms; Design simplicity Rijndael was evaluated based on its security, its cost and its algorithm and implementation characteristics. The primary focus of the analysis was on the cipher's security, but the choice of Rijndael was based on its simple algorithm and implementation characteristics. There were several candidate algorithms but Rijndael was selected because based on the analyses, it had the best combination of security, performance, efficiency, ease of implementation and flexibility. The Pseudo code is as shown below: Rijndael(State,CipherKey) { KeyExpansion(CipherKey,ExpandedKey); AddRoundKey(State,ExpandedKey); For( i=1 ; i FinalRound(State,ExpandedKey + Nb*Nr); } And the round function is defined as: Round(State,RoundKey) { ByteSub(State); ShiftRow(State); MixColumn(State); AddRoundKey(State,RoundKey); } The round transformation is broken into layers. These layers are the linear mixing layer, which provides high http://www.ijettjournal.org Page 218 International Journal of Engineering Trends and Technology (IJETT) – Volume 5 Number 4 - Nov 2013 diffusion over multiple rounds. The non-linear layer which are basically applications of the Rijndael S-box. And the key addition layer which is simply an exclusive or of the round key and the intermediate state. Each layer is designed to have its own well-defined function which increases resistance to linear and differential cryptanalysis. Next Process in our framework is Data storage in service provider. Data owner selects file and encrypts that file using above algorithm then store in the service provider. In the existing system the key is informed to user at the time of storage only. This is very insecure process in the network. So we informed to user at the time of search only. Next process is the searching process we introduced a new searching process. The algorithm is explained as shown below. In the initial process according to alphabet size, definite first level size of the seat shifted table. Assuming that the size of an alphabet is to SIZE, and then the size of the first level is to SIZE. Each character uses its value of the decimal base corresponding with its ASCII to mark the first level of the position. First mark the location of the characters in the pattern from left to right, and then the positions of each character which appears in the pattern string according to the decrease order, in turn enter the position which is indicated with its ASCII, which would constitute a chain of other levels, for example, between the Green Lines there is the level 2, the yellow lines between there is the level 3. If a character in the pattern, it would be the corresponding position defined as 1, if not, the definition of 0 because of the `A` in the pattern set, ASCII of `A` marks the position of 65, so it will mark the 65th to 1. ASCII character for 40 `@` does not appear in the pattern, the marking of the position of 40 is 0. Through this kind of indication, when we want to inquire whether there is some occurrence in the pattern of a character, we only need to inquire that the mark of the character in the table is 1 or not. Take Search starting position as the center, and take m−1 characters in it’s before and after each to compose the windows in size to 1+(m+1)+(m-1)=2m-1. Through this, a window of the m−1 characters of the latter part is a window for the first half of the m−1 characters of the next window, thus may guarantee after the partition, the pattern string always falls in some window in every match, and never omits the data, and also guarantees the algorithm the accuracy. If the pattern in the first window does not match, it will go to the next window. Using Next array, avoid that when there is no matching, the pattern will go to the back. The value of Next array depends on their own characteristics, nothing to do with the text string. The establishing rules are as follows: We pretreat the pattern p1,p1....pm in advance, and generate a function Next[i](0 < i < m+1) .when there is not match in the i th by the time, We can calculate in the prefix ISSN: 2231-5381 p1,p2....pm whether there is a maximum of G, making p1..pG-1....established If it exists, there will be Next[i] = G , when matched next time, pattern can be directly moved backward for i − Next[i] , and then we can start the comparison from the Gth of pattern string. If it not exists, there will be Next[i] = 1. Matching Each match will start from the Search starting position, and use the seat shifted table and Next array. a) First examine whether the mark in the seat shifted table of the k th (0 < k ≤ [n / m]) Search starting position is 1 or not, if it is 0,go to the (k+1) th Search starting position; if 1, It is said that this character occurs in the pattern string, therefore, in the second level of the seat shifted table we will find the first position of the character in the pattern, balance the string pattern in the location of the first position of the character with the text string in a position to the k th Search starting position. b) Match from the most left of the pattern, if matched completely before the Search starting position, then match from the most right of the pattern, if matched completely after the Search starting position, this proves a match is completed. Then jump the next Search starting position, go on. c) If a match in a certain position failures, assuming the position is i(0 ≤ i ≤ m) , check Next[i] , then check the seat shifted table, and find the next position in the pattern of the character in the k,th Search starting position, and calculate the distance between the two positions, assuming the value is Distance ,Compared Distance with Next[i] size, takes bigger for the jump distance of the pattern. If Next[i] larger, first match the character in the Search starting position. If marched, go on matching accordance with the above; otherwise, turn to c). Now if there is no position of the character in the seat shifted table, go to the next Search starting position, and turn to a). IV. CONCLUSION In this paper we introduced a framework for secure storage and searching algorithm. It reduces the time complexity of the searching and results moderate results. In this we adapted advanced cryptographic algorithm to provide security for data which is storing in the service provider. Our searching algorithm searches the document for each character matching to the keyword given by the user, so we achieve moderate results. REFERENCES [1] G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller, R. Panigrahy, D. Thomas, and A. Zhu, “Achieving Anonymity via Clustering,” Proc. 25th ACM SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems (PODS), pp. 153-162, 2006. [2] R. Agrawal, P.J. Haas, and J. Kiernan, “Watermarking Relational Data: Framework, Algorithms and Analysis,” The Int’l J. Very Large Data Bases, vol. 12, no. 2, pp. 157-169, 2003. http://www.ijettjournal.org Page 219 International Journal of Engineering Trends and Technology (IJETT) – Volume 5 Number 4 - Nov 2013 [3] R. Agrawal, J. Kiernan, R. Srikant, and Y. Xu, “OrderPreserving Encryption for Numeric Data,” Proc. ACM SIGMOD Int’l Conf. Management of Data, pp. 563-574, 2004. [4] R. Agrawal and R. Srikant, “Privacy-Preserving Data Mining,” Proc. ACM SIGMOD Int’l Conf. Management of Data, pp. 439- 450, 2000. [5] C.A. Ardagna, M. Cremonini, E. Damiani, S.D.C. di Vimercati, and P. Samarati, “Location Privacy Protection Through Obfuscation- Based Techniques,” Proc. 21st Ann. IFIPWG11.3 Working Conf. Data and Applications Security (DBSec), pp. 47-60, 2007. [6] V. Athitsos, M. Potamias, P. Papapetrou, and G. Kollios, “Nearest Neighbor Retrieval Using Distance-Based Hashing,” Proc. IEEE 24th Int’l Conf. Data Eng. (ICDE), pp. 327-336, 2008. [7] N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger, “The R*-Tree: An Efficient and Robust Access Method for Points and Rectangles,” Proc. ACM SIGMOD Int’l Conf. Management of Data, pp. 322-331, 1990. [8] S. Berchtold, D.A. Keim, and H.-P. Kriegel, “The X-Tree : An Index Structure for High-Dimensional Data,” Proc. 22nd Int’l Conf. Very Large Databases, pp. 28-39, 1996. [9] T. Bozkaya and Z.M. O ¨ zsoyoglu, “Indexing Large Metric Spaces for Similarity Search Queries,” ACM Trans. Database Systems, vol. 24, no. 3, pp. 361-404, 1999. [10] E. Cha´vez, G. Navarro, R.A. Baeza-Yates, and J.L. Marroquı´n, “Searching in Metric Spaces,” ACM Computing Surveys, vol. 33, no. 3, pp. 273-321, 2001. [11] P. Ciaccia, M. Patella, and P. Zezula, “M-Tree: An Efficient Access Method for Similarity Search in Metric Spaces,” Proc. Very Large Databases (VLDB), pp. 426-435, 1997. [12] E. Damiani, S.D.C. Vimercati, S. Jajodia, S. Paraboschi, and P. Samarati, “Balancing Confidentiality and Efficiency in Untrusted Relational DBMSs,” Proc. 10th ACM Conf. Computer and Comm. Security (CCS), pp. 93-102, 2003. [13] M. Dunham, Data Mining: Introductory and Advanced Topics. Prentice Hall, 2002. [14] C. Faloutsos and K.-I. Lin, “FastMap: A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Data Sets,” Proc. ACM SIGMOD Int’l Conf. Management of Data, pp. 163-174, 1995. [15] G. Ghinita, P. Kalnis, A. Khoshgozaran, C. Shahabi, and K.L. Tan, “Private Queries in Location Based Services: Anonymizers Are Not Necessary,” Proc. ACM SIGMOD Int’l Conf. Management of Data, pp. 121-132, 2008. [16] A. Gionis, P. Indyk, and R. Motwani, “Similarity Search in High Dimensions via Hashing,” Proc. 25th Int’l Conf. Very Large Databases (VLDB), pp. 518-529, 1999. [17] H. Hacigu¨mu¨ s, B.R. Iyer, C. Li, and S. Mehrotra, “Executing SQL over Encrypted Data in the Database-ServiceProvider Model,” Proc. ACM SIGMOD Int’l Conf. Management of Data, pp. 216-227, 2002. [18] H. Hacigu¨mu¨ s, S. Mehrotra, and B.R. Iyer, “Providing Database as a Service,” Proc. 18th Int’l Conf. Data Eng. (ICDE), pp. 29-40, 2002. [19] A. Hinneburg, C.C. Aggarwal, and D.A. Keim, “What Is the Nearest Neighbor in High Dimensional Spaces?,” Proc. 26th Int’l Conf. Very Large Data Bases (VLDB), pp. 506-515, 2000. [20] G.R. Hjaltason and H. Samet, “Index-Driven Similarity Search in Metric Spaces,” ACM Trans. Database Systems, vol. 28, no. 4, pp. 517-580, 2003. [21] H.V. Jagadish, B.C. Ooi, K.-L. Tan, C. Yu, and R.Z. 0003, “iDistance: An Adaptive Bþ-Tree Based Indexing Method for ISSN: 2231-5381 Nearest Neighbor Search,” ACM Trans. Database Systems, vol. 30, no. 2, pp. 364-397, 2005. BIOGRAPHIES Mr.M.V.Durgaprasad is a student of Aditya Engineering College, Surampalem. Presently he is pursuing his M.Tech [Computer Science & Engineering] from this college and he received his B.Tech from Sri Prakash College of Engineering, affiliated to JNT University, Kakinada in the year 2009. His area of interest includes Compiler Design, Database Management Systems, Data Mining, all current trends and techniques in Computer Science. Mr. V.V.Ranganadh, well known and excellent teacher received M.Tech (CSE) from JNTU, Kakinada is working as Assistant Professor, Dept of CSE at Aditya Engineering College. He has 11 years of industrial and teaching experience in various engineering colleges. To his credit couple of publications both national and international conferences/journals. His area of interest includes Image Processing, Data Mining and other advances in Computer Applications. And guided many projects. http://www.ijettjournal.org Page 220