A Secure Clustering Mechanism in Outsourced Databases

advertisement

International Journal of Engineering Trends and Technology (IJETT) – Volume23 Number 3- May 2015

A Secure Clustering Mechanism in Outsourced

Databases

NimmagaddaNaresh

1

, P.Rajasekhar

2

1

Final MTechStudent ,

2

Asst Professor

1,2

Department of CSE, Avanthi institute of Engg.& Tech, Narsipatnam,A.P

Abstract: The Data mining in large databases there is hard task to maintain the confidentialityover data mining in distributed networks or outsourced databases, data integrates from various data holders or players for mining.

Secure mining of data is required in open network. In this paper we are proposing an efficient privacy preserving data clustering technique in distributed networks in decentralized architecture.

I.INTRODUCTION

The necessity to concentrate information out of disseminated information, without a former unification of the information, made the somewhat new research zone of

Distributed Knowledge Discovery in Databases (DKDD).

In this paper, we will display a methodology where we first cluster the information by regional standards. At that point we remove collected data about the provincially made clusters furthermore, sent this data to a focal site. We utilize the taking after terms reciprocally for this provincially made data: neighborhood model, nearby agents, or accumulated data of the neighborhood site.

The transmission expense is negligible as the delegates are just a small amount of the first information.

On the focal site we "recreate" a worldwide clustering in light of the agents and sent the outcome back to the nearby destinations. The neighborhood destinations overhaul their clustering in view of the worldwide model, e.g. blend two nearby clusters to one or allocate neighborhood clamor to worldwide clusters. [1][2].

Given an arrangement of items with a separation work on them (i.e. a highlight database), an intriguing information mining inquiry is, whether these items normally frame gatherings (called clusters) what's more, what these gatherings resemble. Information mining calculations that attempt to answer this inquiry are called clustering calculations. In this area, we characterize surely understood clustering calculations as indicated by diverse classification plans.

Clustering calculations can be grouped along diverse, free measurements. One surely understood measurement sorts clustering systems as indicated by the outcome they deliver. Here, we can recognize various leveled and dividing clustering calculations. Dividing calculations build a level (single level) part of a database D of n items into an arrangement of k clusters such that the protests in a cluster are more like one another than to protests in distinctive clusters. Various leveled calculations decay the database into a few levels of settled partitioning’s

(clustering), spoke to for instance by a dendrogram, i.e. a tree that iteratively parts D into littler subsets until each subset comprises of one and only protest. In such a pecking order, each hub of the tree speaks to a cluster of D.

II. RELATED WORK

A decision tree[3][4][5] is defined as “a predictive modeling technique from the field of machine learning and statistics that builds a simple tree-like structure to model the underlying pattern of data”. Decision tree is one of the popular methods is able to handle both categorical and numerical data and perform classification with less computation. Decision trees are often easier to interpret.

Decision tree is a classifier which is a directed tree with a node having no incoming edges called root. All the nodes except root have exactly one incoming edge. Each non-leaf node called internal node or splitting node contains a decision and most appropriate target value assigned to one class is represented by leaf node. Decision tree classifier is able to break down a complex decision making process into collection of simpler decision.

The complex decision is subdivided into simpler decision on the basis of splitting criteria. It divides whole training set into smaller subsets. Information gain, gain ratio, gini index are three basic splitting criteria to select attribute as a splitting point. Decision trees can be built from historical data they are often used for explanatory analysis as well as a form of supervision learning. The algorithm is design in such a way that it works on all the data that is available and as perfect as possible. According to Breiman et al. [6] the tree complexity has a crucial effect on its accuracy performance. The tree complexity is explicitly controlled by the pruning method employed and the stopping criteria used. Usually, the tree complexity is measured by one of the following metrics:

• The total number of nodes;

• Total number of leaves;

• Tree depth

ISSN: 2231-5381 http://www.ijettjournal.org

Page 146

International Journal of Engineering Trends and Technology (IJETT) – Volume23 Number 3- May 2015

• Number of attributes used.

Decision tree induction is closely related to rule induction.

Each path from the root of a decision tree to one of its leaves can be transformed into a rule simply by conjoining the tests along the path to form the antecedent part, and taking the leaf’s class prediction as the class value. The resulting rule set can then be simplified to improve its accuracy and comprehensibility to a human user [7].

Node1 Node2

……………………………………….....

Node i

Req

Req

Pre-processed

Documents Pre-processed

Documents

Clustering

Req

Node n

New Document

In this paper we are proposing architecture, Everydata holder or player clusters the documents itself after preprocessing and computes the local and global frequencies of the documents for calculation of file Weight or document weights. Distributed k means algorithm is one of the efficient distributed clustering algorithms. In our work we are working towards optimized clustering in distributed networks by improving traditional clustering algorithm. In our approach if a new dataset placed at data holder, it requests the other data holders to forward the relevant features from other data holders instead of entire datasets to cluster the documents.

III. PROPOSED SYSTEM

In this paper we are proposing an efficient and secure data mining technique with optimized k-means and cryptographic approach, for cluster the similar type of information, initially data points need to be share the information which is at the individual data holders or players.

In this paper we are emphasizing on mining approach not on cryptographic technique, For secure transmission of data, various cryptographic algorithms and key exchange protocols available. The following diagram shows architecture of the proposed work as follows

Individual peers at data holders initially preprocess raw data by eliminating the unnecessary features from datasets after the preprocessing of datasets, compute the file Weight of the datasets or preprocessed feature set in terms of term frequency and inverse document frequencies and computes the file relevance matrix to reduce the time complexity while clustering datasets. We are using a most widely used similarity measurement i.e. cosine similarity

Cos(d m

,d n

)= (d m

* d n

)/Math.sqrt(d m

* d n

)

Where dm is centroid (Document weight) dn is document weight or file Weight from the

In the following example diagram shows a simple way to retrieve similarity between documents in at individual data holders by computing cosine similarity prior clustering as follows.

ISSN: 2231-5381 http://www.ijettjournal.org

Page 147

International Journal of Engineering Trends and Technology (IJETT) – Volume23 Number 3- May 2015 d1 d2 d3 d4 d5

Fig2: Similarity Matrix d1

1.0

0.48

0.66

0.89

0.45 d2

0.77

1.0

0.88

0.67

0.88 d3

0.45

0.9

1.0

0.67

0.34 d4

0.32

0.47

0.77

1.0

0.34 d5

0.67

0.55

0.79

0.89

1.0

In the above table D(d

1

,d

2

….d

n

) represents set of documents at data holder or player and their respective cosine similarities, it reduces the time complexity while computing the similarity between the centroids and documents while clustering.

In our approach we are enhancing K Means algorithm with re-centoird computation instead of single random selection ate every iteration, the following algorithm shows the optimized k-means algorithm as follows

Algorithm:

1: Select K points as initial centroids for initial iteration

2: until Termination condition is met (user specified maximum no of iterations)

3: get_relevance(dm,dn)

Where d m is the document M file Weight from relevance matrix d n is the document N file Weight from relevance matrix

4: Assign each point to its closest centroid to form K clusters

5: Re-compute the centroid with intra cluster data points

(i.e average of any k data points in the individual cluster).

Ex: (P

11

+P

12

+….P

1k

) / K

All points from the same cluster

6. Compute new centorid for merged cluster

In the traditional approach of k means algorithm it randomly selects a new centroid,in our approach we are enhancing by prior construction of relevance matrix and by considering the average k random selection of document Weight for new centroid calculation

.

Triple DES :

Triple DES is the common name for the Triple Data

Encryption Algorithm (TDEA) block cipher. It is so named because it applies the Data Encryption Standard (DES) cipher algorithm three times to each data block. Triple DES provides a relatively simple method of increasing the key size of DES to protect against brute force attacks, without requiring a completely new block cipher algorithm.

The standards define three keying options:

Keying option 1: All three keys are independent.

Keying option 2: K1 and K2 are independent, and K3

= K1.

Keying option 3: All three keys are identical, i.e. K1 =

K2 = K3.

Keying option 1 is the strongest, with 3 x 56 = 168 independent key bits.

Keying option 2 provides less security, with 2 x 56 = 112 key bits. This option is stronger than simply DES encrypting twice, e.g. with K1 and K2, because it protects against meet-in-the-middle attacks.

Keying option 3 is no better than DES, with only 56 key bits. This option provides backward compatibility with

DES, because the first and second DES operations simply cancel out. It is no longer recommended by the National

Institute of Standards and Technology (NIST) and not supported by ISO/IEC 18033-3.

In general Triple DES with three independent keys (keying option 1) has a key length of 168 bits (three 56-bit DES keys), but due to the meet-in-the-middle attack the effective security it provides is only 112 bits. Keying option 2, reduces the key size to 112 bits. However, this option is susceptible to certain chosen-plaintext or knownplaintext attacks and thus it is designated by NIST to have only 80 bits of security.

IV. CONCLUSION

We are concluding our current research work with efficient privacy preserving data clustering over distributed networks. Quality of the clustering mechanism enhanced with preprocessing, relevance matrix and centroid computation in k-means algorithm and cryptographic technique solves the secure transmission of data between data holders or players and saves the privacy preserving cost by forward the relevant features of the dataset instead of raw datasets. We can enhance security by establishing an efficient key exchange protocol and cryptographic techniques while transmission of data between data holders or players.

ISSN: 2231-5381 http://www.ijettjournal.org

Page 148

International Journal of Engineering Trends and Technology (IJETT) – Volume23 Number 3- May 2015

REFERENCES

[1] Privacy-Preserving Mining of Association Rules From

Outsourced Transaction DatabasesFoscaGiannotti, Laks V.

S. Lakshmanan, Anna Monreale, Dino Pedreschi, and Hui

(Wendy) Wang.

[2] Privacy Preserving Decision Tree Learning Using

Unrealized Data SetsPui K. Fong and Jens H. Weber-

Jahnke, Senior Member, IEEE Computer Society

3) Anonymization of Centralized and Distributed Social

Networks by Sequential ClusteringTamirTassa and Dror J.

Cohen

[4] Privacy Preserving ClusteringS. Jha, L. Kruger, and P.

McDaniel

[5] Tools for Privacy Preserving Distributed Data Mining ,

Chris Clifton, Murat Kantarcioglu, Xiaodong Lin, Michael

Y. Zhu

[6] S. Datta, C. R. Giannella, and H. Kargupta,

“Approximate distributedK-Means clustering over a peerto-peer network,” IEEETKDE , vol. 21, no. 10, pp. 1372–

1388, 2009.

[7] M. Eisenhardt, W. M¨ uller, and A. Henrich,

“Classifying documentsby distributed P2P clustering.” in

INFORMATIK , 2003.

[8] K. M. Hammouda and M. S. Kamel, “Hierarchically distributedpeer-to-peer document clustering and cluster summarization,” IEEE Trans. Knowl. Data Eng.

, vol. 21, no. 5, pp. 681–698, 2009.

[9] H.-C. Hsiao and C.-T.King, “Similarity discovery in structuredP2P overlays,” in

ICPP , 2003.

[10] Privacy Preserving Clustering By Data

TransformationStanley R. M. Oliveira ,Osmar

R.

Za¨ıane

ISSN: 2231-5381 http://www.ijettjournal.org

Page 149

Download