Uploaded by International Research Journal of Engineering and Technology (IRJET)

IRJET- Data Mining - Secure Keyword Manager

advertisement
International Research Journal of Engineering and Technology (IRJET)
e-ISSN: 2395-0056
Volume: 06 Issue: 12 | Dec 2019
p-ISSN: 2395-0072
www.irjet.net
Data Mining - Secure Keyword Manager
Mrs. Nalini S. Jagtap1, Ms. Rachana Mudholkar2, Mrs. Pratiksha Shevatekar3
1Asst.
Prof., Dept. of Computer Engineering, Dr D. Y. Patil Inst. of Eng., Mgmt. and Research, Maharashtra, India.
Prof., Dept. of Computer Engineering, Dr D. Y. Patil Inst. of Eng., Mgmt. and Research, Maharashtra, India.
3HOD., Dept. of Computer Engineering, Dr D. Y. Patil Inst. of Eng., Mgmt. and Research, Maharashtra, India.
---------------------------------------------------------------------***---------------------------------------------------------------------2Asst.
Abstract - Nowadays, more and more people are motivated to
outsource their local data to public cloud servers for great
convenience and reduced costs in data management. But in
consideration of privacy issues, sensitive data should be
encrypted before outsourcing, which obsoletes traditional data
utilization like keyword based document retrieval. In this
paper, we present a secure and efficient multi-keyword ranked
search scheme over encrypted data, which additionally
supports dynamic update operations like deletion and
insertion of documents. Specifically, we construct an index tree
based on vector space model to provide multi-keyword search,
which meanwhile supports flexible update operations. Besides,
cosine similarity measure is utilized to support accurate
ranking for search result. To improve search efficiency, we
further propose a search algorithm based on “Greedy Depth
first Traverse Strategy”. Moreover, to protect the search
privacy, we propose a secure scheme to meet various privacy
requirements in the known cipher text threat model.
Key Words: Data Mining, Keyword, Database, Security,
Network, Cloud, AI
1. INTRODUCTION
We propose a scheme, termed Efficient keyword retrieval, in
which each user can choose the keyword of his own keyword
to determine the percentage of matched linked keyword to
be returned. The basic idea of keyword matching is to
construct a privacy preserving mask matrix that allows the
cloud to filter out a certain percentage of matched files
before returning to the ADL. This is not a trivial work, since
the cloud needs to correctly filter out files according to the
keyword of queries without knowing anything about user
privacy. Focusing on different design goals, we provide two
extensions: the first extension emphasizes simplicity by
requiring the least amount of modifications from the
keyword scheme, and the second extension emphasizes
privacy by leaking the least amount of information to the
cloud.
2. LITERATURE SURVEY:
Searchable encryption schemes enable the clients to store
the encrypted data to the cloud and execute keyword search
over cipher text domain. Due to different cryptography
primitives, searchable encryption schemes can be
constructed using public key based cryptography or
symmetric key based cryptography. Song et al. proposed the
first symmetric searchable encryption (SSE) scheme, and the
© 2019, IRJET
|
Impact Factor value: 7.34
|
search time of their scheme is linear to the size of the data
collection. Goh proposed formal security definitions for SSE
and designed a scheme based on Bloom filter. The search
time of Goh’s scheme is O(n), where n is the cardinality of the
document collection. Curtmola et al. proposed two schemes
(SSE-1 and SSE-2) which achieve the optimal search time.
Their SSE-1 scheme is secure against chosen-keyword
attacks (CKA1) and SSE-2 is secure against adaptive chosen
keyword attacks (CKA2).These early works are single
keyword boolean search schemes, which are very simple in
terms of functionality. Afterward, abundant works have been
proposed under different threat models to achieve various
model search functionality, such as single keyword search,
similarity search, multi-keyword boolean search, ranked
search, and multi-keyword ranked search [1], [2], [3], [4],
etc. Multi-keyword boolean search allows the users to input
multiple query keywords to request suitable documents.
Among these works, conjunctive keyword search schemes
[1], [2], [5] only return the documents that contain all of the
query keywords. Disjunctive keyword search schemes [4],
[5] return all of the documents that contain a subset of the
query keywords. Predicate search schemes [3], [4], [5] are
proposed to support both conjunctive and disjunctive
search. All these multi keyword search schemes retrieve
search results based on the existence of keywords, which
cannot provide acceptable result ranking functionality.
Ranked search can enable quick search of the most relevant
data. but they are designed only for single keyword search.
Cao et al. [3] realized the first privacy-preserving. multikeyword ranked search scheme, in which documents and
queries are represented as vectors of dictionary size. With
the “coordinate matching”, the documents are ranked
according to the number of matched query keywords.
However, Cao et al.’s scheme does not consider the
importance of the different keywords, and thus is not
accurate enough. In addition, the search efficiency of the
scheme is linear with the cardinality of document collection.
Sun et al. [1] presented a secure multi-keyword search
scheme that supports similarity-based ranking. The authors
constructed a searchable index tree based on vector space
model and adopted cosine measure together with TF×IDF to
provide ranking results. Sun et al.’s search algorithm
achieves better-than-linear search efficiency but results in
precision loss. Orencik et al. [5] proposed a secure multikeyword search method which utilized local sensitive hash
(LSH)functions to cluster the similar documents. The LSH
algorithm is suitable for similar search but cannot provide
exact ranking. In [3], Zhang et al. proposed a scheme to deal
with secure multi-keyword ranked search in a multi-owner
ISO 9001:2008 Certified Journal
|
Page 841
International Research Journal of Engineering and Technology (IRJET)
e-ISSN: 2395-0056
Volume: 06 Issue: 12 | Dec 2019
p-ISSN: 2395-0072
www.irjet.net
model. In this scheme, different data owners use different
secret keys to encrypt their documents and keywords while
authorized data users can query without knowing keys of
these different data owners. The authors proposed an
“Additive Order Preserving Function” to retrieve the most
relevant search results. However, these works don’t support
dynamic operations. Practically, the data owner may need to
update the document collection after he upload the
collection to the cloud server. Thus, the SE schemes are
expected to support the insertion and deletion of the
documents. There are also several dynamic searchable
encryption schemes. In the work of Song et al., the each
document is considered as a sequence of fixed length words,
and is individually indexed. This scheme supports
straightforward update operations but with low efficiency.
Goh proposed a scheme to generate a sub-index (Bloom
filter) for every document based on keywords. Then the
dynamic operations can be easily realized through updating
of a Bloom filter along with the corresponding document.
However, Goh’s scheme has linear search time and suffers
from false positives. In 2012, Kamara et al. constructed an
encrypted inverted index that can handle dynamic data
efficiently. But, this scheme is very complex to implement.
Subsequently, as an improvement, Kamara et al. proposed a
new search scheme based on tree based index, which can
handle dynamic update on document data stored in leaf
nodes. However, their scheme is designed only for single
keyword Boolean search. In, Cash et al. presented a data
structure for keyword/identity tuple named “TSet”. Then, a
document can be represented by a series of independent TSets. Based on this structure, Cash et al. proposed a dynamic
searchable encryption scheme. In their construction, newly
added tuples are stored in another database in the cloud, and
deleted tuples are recorded in a revocation list. The final
search result is achieved through efficient information
retrieval query using aggregation and distribution layer.
files with rank l, the probability of the product of the
columns corresponding to file keywords being 0 is l/r.
Step 3:
Based on the mask matrix, the cloud runs the File Filter
algorithm (Alg. 3) to filter out a certain percentage of
matched files and returns a union buffer to the ADL. The
process is as follows: For each file Fj, the cloud first
multiplies the k-th columns that correspond to Fj’s keywords
in the mask matrix to obtain cj, where k=j MODr. The
example columns chosen for each file. Then, the cloud
powers the file content to cj to obtain ej and maps (ci, ei) to
many entries of a union buffer as the Ostrovsky scheme.
Here, cj denotes the occurrence of ranked keywords in file Fj.
Thus, cj will be larger than 0, and file Fj will be returned only
when l+k≤r, where k=j MODr.
Step 4:
The ADL runs the Result Divide algorithm to distribute files
to each user. The ADL first recovers all files that match user
queries as the File Recover algorithm in the Ostrovsky
scheme. Then, the ADL distributes appropriate files to each
user based on the user queries. To make sure that the ADL
distributes files correctly, we can require the cloud to attach
file keywords with the file content. Thus, the ADL can find
out all of the files that match each user’s query by executing
keyword searches.
4. SYSTEM ARCHITECTURE:
3. ALGORITHM:
Step 1:
Each user runs the Query Gen algorithm to send a query to
the ADL, where the user query consists of the chosen
keywords and the query rank.
Step 2:
Given users’ queries, the ADL runs the Matrix Construct
algorithm (Alg. 2) to send a mask matrix to the cloud. The
mask matrix M is a d-row and r-column matrix, where dis
the number of keywords in the dictionary, and r is the
highest rank of queries. The mask matrix M can be
constructed as follows: For each keyword w, the ADL first
sets w’s rank with l, the highest query rank choosing this
keyword. Then, for the row corresponding to keyword w, the
ADL sets the first r−l columns to 1 and the last l columns to 0.
The example mask matrix is shown in Fig. 5-(a). Note that,
the reason for setting the first r−l columns, rather than
random r−l columns, to 1 is to ensure that, given any two
Cloud computing as an emerging technology is expected to
reshape information technology processes in the near future.
© 2019, IRJET
ISO 9001:2008 Certified Journal
|
Impact Factor value: 7.34
|
Fig -1: System Architecture
|
Page 842
International Research Journal of Engineering and Technology (IRJET)
e-ISSN: 2395-0056
Volume: 06 Issue: 12 | Dec 2019
p-ISSN: 2395-0072
www.irjet.net
Due to the overwhelming merits of cloud computing, e.g.,
cost-effectiveness, flexibility and scalability, more and more
organizations choose to outsource their data for sharing in
the cloud. As a typical cloud application, an organization
subscribes the cloud services and authorizes its staff to share
files in the cloud. Each file is described by a set of keywords,
and the staff, as authorized users, can retrieve files of their
interests by querying the cloud with certain keywords. In
such an environment, how to protect user privacy from the
cloud, which is a third party outside the security boundary of
the organization, becomes a key problem.
[2]
C . Gentry, “A fully homomorphic encryption scheme,”
Ph. D. dissertation, Stanford University, 2009 .
[3]
S. Kamara and K. Lauter, “Cryptographic cloud storage,”
in Financial Cryptography and Data Security. Springer,
2010, pp. 136–149.
[4]
D. Boneh, G. Di Crescenzo, R. Ostrovsky, and G. Persiano,
“Public key encryption with keyword search,” in
Advances in Cryptology Eurocrypt 2004 . Springer,
2004, pp . 506 –522 .
5. SOFTWARE AND HARDWARE:
[5]
D. X. Song, D. Wagner, and A. Perrig, “Practical
techniques for searches on encrypted data,” in Security
and Privacy, 2000. S&P 2000. Proceedings. 2000 IEEE
Symposium on. IEEE, 2000, pp. 44–55.
 Hardware
o Intel i5 processor
o 4 GB ram
o 500 GB HDD
 Software
o JDK 8
o NetBeans IDE
o MSSQL Server 2008 R2
6. ADVANTAGES

Data Security

Central Management

Communication Security
7. CONCLUSION
We proposed keyword matching schemes based on an ADL
to provide differential keyword services while protecting
user privacy. By using our schemes, a user can retrieve
different percentages of matched keywords by specifying
public keyword of different ranks. By further reducing the
communication cost incurred on the cloud, the ORQI
schemes make the private searching technique more
applicable to a cost-efficient cloud environment. However, in
the ORQI schemes, we simply determine the rank of each file
by the highest rank of queries it matches. For our future
work, we will try to design a flexible ranking mechanism for
the ORQI schemes.
REFERENCES
[1]
K.Ren,C.Wang,Q.Wangetal.,“Security challenges for the
public cloud,” IEEE Internet Computing, vol. 16, no. 1,
pp. 69–73, 2012. M. Young, the Technical Writer’s
Handbook. Mill Valley, CA: University Science, 1989.
© 2019, IRJET
|
Impact Factor value: 7.34
|
ISO 9001:2008 Certified Journal
|
Page 843
Download