IMPLEMENTATION OF DOUBLE LAYER PRIVACY ON ID3 DECISION TREE ALGORITHM Akshaya.S Jayasre Manchari.V L MohamedThoufeeq.A INFORMATION INFORMATION INFORMATION TECHNOLOGY TECHNOLOGY TECHNOLOGY SVCE vinaakshay@gmail.com SVCE jayasremanchari@gmail.com SVCE thoufeeq1132@gmail.com Kiruthikadevi.K PROFESSOR IT SVCE kiruthika@svce.ac.in ABSTRACT Data mining presents many opportunities for enhanced services and products in diverse areas such as healthcare, banking, traffic planning, online search, and so on. However, its promise is hindered by concerns regarding the privacy of the individuals whose data are being mined. Though the existing data mining techniques such as classification, clustering, association, prediction performed on the dataset reveal useful patterns, there is always a threat to individual’s information. Adaptations such as randomized response, k-anonymity and differential privacy do not always adequately protect the sensitive information in which the main concern is only with accuracy. So, the problem of data mining is with formal privacy guarantees. The newer implementation is with the classification model based on ID3 decision tree in which we add different layers of privacy to preserve an individual’s identity and also achieving a balance between privacy and utility. In this paper we propose a privacy framework for the ID3 decision tree algorithm by (1) adding noise to the existing algorithm (2) perturbing input dataset. Since an optimum level of balance between utility and privacy was not achieved here, a third level of privacy framework was developed with a hybrid framework where the input data is normalized and given as input to the noisy algorithm in which case we achieved a better level of accuracy along with privacy guarantees. 1. INTRODUCTION Now a days , organizations are accumulating voluminous growing amounts of data in various formats that requires too much of time and cost to analyze and retrieve business patterns from terabytes or even Exabyte’s of data. These data keeps multiplying day by day. Such large scale collection of personal information is widely distributed in medical, banking; marketing and financial records .While processing this massive amount of data in order to discover useful patterns, there is always a constraint that compromises the individual’s privacy leading to a trade-off between privacy and utility. The adversary should not learn anything about the individual information who contributes to the dataset even in the presence of certain auxiliary information. The data gathered by the organization will be subjected to a computational processing by the administrator with the intent of obtaining useful information or patterns. Such analysis of data may reveal relationships or associations between the data to bring out a unique pattern that plays a major role in decision making or even for future use. Thus a straight forward adaptation of data mining algorithm – ID3 decision tree classification algorithm is to work with the privacy preserving layer that will lead to suboptimal performance. 1.1 PURPOSE Each data mining application can have its own privacy requirements, which include protection of personal information, statistical disclosure control and so on. An individual should be certain that his or her data will remain private. Increasing use of computers and networks has led to a proliferation of sensitive data. Without proper precautions, this data could be misused. For the data miner, however, all these individual records are very valuable. While applying the existing data mining algorithm with intent to obtain a specific pattern that is mainly used for decision making or even for future use, there is a chance for an individual’s record to be leaked. Hence there lies a great value in providing a data mining solution that offers reliable privacy guarantees without compromising its accuracy. 1.2 SCOPE Problem with the existing data mining models is that with the availability of nonsensitive information, one is able to infer sensitive information that is not to be disclosed. The privacy of the individuals whose data is being mined is breached from collusion between adversarial parties or due to repetitive access by the same property. This calls for the need of privacy in data mining. Privacy-preserving methods of data handling seek to provide sufficient privacy as well as sufficient utility. This helps in protecting the individual identity as well as of the proprietary and sensitive information. The results of the data mining will be profitable only if a sufficient amount of privacy is enforced, since most of the publicly available data consists of several useful information that needs to be preserved. 2. RELATED WORK [1] Data Mining with Differential Privacy, Israel Institute of Technology Haifa 32000, Israel, KDD 2010. In this paper, Data perturbation is performed before giving the input to the algorithm by a new approach to implement ID3 decision tree model with differential privacy based on SuLQ based framework where the data miner need not consider privacy requirements for enforcing privacy in which a programmable privacy preserving layer is added to the query interface where there was a increase in accuracy of the classifier as the privacy budget (ɛ) increased for relatively small dataset size. However large variance in certain experimental results is found and privacy is not well established. [5] A Framework for Privacy Preserving Classification in Data Mining, The University of Newcastle Callaghan, NSW 2308, Australia – ACSW 2004. In this paper, Significant amount of noise is added to both confidential and non confidential attributes treating all the attributes to be sensitive where the same level of privacy can be achieved by adding less noise to confidential attributes. The privacy framework was also extended by perturbating the leaf innocent and leaf influential attributes. The experiment results observed shows that though the perturbed decision tree is different from the original tree, its logical rules are maintained with a minimum level of privacy. [6] A Noise Addition Scheme in Decision Tree for Privacy Preserving Data Mining, Journal of computing, volume 2, issue 1, January 2010, ISSN 2151-9617. Various methodologies in this paper involves adding noise to sensitive attributes as in specific noise is added to the numeric attributes after exploring the decision tree of the original data. The obfuscated data then is presented to the second party for decision tree analysis. The decision tree obtained on the original data and the obfuscated data are similar. Here the perturbed classifier is good as the original classifier but the level of privacy is low. [7] Privacy preserving decision tree learning using unrealized data sets, IJREAS Volume 3, Issue 3 (March 2013) ISSN: 2249-3905. In most of the previous works, the input dataset was anonymised and then the noisy data is given as input to the classification algorithm where in this paper, privacy technique called Dataset Complementation, the sample from perturbed dataset was removed and then modified dynamically. Perturbed datasets are stored to enable a modified decision tree data mining method. The Experiment results shows that privacy measure is Medium with time complexity of O(Ts) where Ts=Training sample. 3. EXISTING PRIVACY TECHNIQUES Research work for adding privacy layer on decision tree data mining model was done in [9], [10] using cryptographic techniques and randomisation. [11] Addresses privacy layer in C4.5 classification algorithm without SMC over vertically partitioned set of data. Various methodologies involve adding noise to sensitive attributes as in [6] specific noise is added to the numeric attributes after exploring the decision tree of the original data. Significant amount of noise can also be added to both confidential and non confidential attributes treating all the attributes to be sensitive as in [5] where the same level of privacy can be achieved by adding less noise to confidential attributes. The privacy framework was also extended by perturbating the leaf innocent and leaf influential attributes. In all the previous privacy methods that was applied to ID3 data mining model, it is observed that achieving the maximum level of accuracy was the main concern, where the noisy decision tree was almost similar to the original tree with less variation so that the logical rules of the original decision tree is preserved to a greater extent. Although a layer of privacy was added either to the data by means of perturbation or adding noise to the ID3 algorithm, when the amount of noise added was so less that there is always a possibility for an individual identity to be leaked. PRIVACY LAYER ID3 DECISION TREE DATA MINING ALGORITHM INPUT DATASET PERTURBATED OUTPUT DATASET PRIVACY LAYER INPUT DATASET ID3 DECISION TREE DATA MINING ALGORITHM PERTURBATED OUTPUT DATASET 4. PROPOSED SYSTEM The following are the new ideas to implement a privacy layer to our existing ID3 decision tree algorithm. At first, modification of the existing algorithm by adding laplacian noise at ROOT LEVEL and CLASS LEVEL is performed so that a level of privacy is obtained in the algorithm. This newer algorithm modifies the output of the ID3 technique by adding laplacian noise that recursively affects the information gain of the descendants. As a result the original decision tree is modified and outputs a noisy classifier. Any adversary who queries the database based on this perturbed decision tree will not get the original sample and thus cannot correctly identify an individual. As an extension, to evaluate the suboptimal performance levels of accuracy and privacy, anonymization of the input dataset is performed rather than adding noise to the existing data mining algorithm. This dataset anonymization is done as SENSITIVE ATTRIBUTES based anonymization. Another proposal to privacy methodology is adding a dual layer of privacy in such a way utility of the data is also maintained. Initially the original ID3 decision tree is converted to a binary tree by clubbing the attribute values of the input data based on their sensitivity, such that there is a minimum difference. Hence a noisy decision tree with at most two branches for a node is obtained. To this perturbed tree another layer of privacy is added only at the root level, which is the most sensitive attribute. 5. DATASET DESCRIPTIONS The data set can be divided into – Training set, used to build the model and Test set, used to determine the accuracy of the model. Given the training set as input, the decision tree is constructed based on the ID3 algorithm. Two different datasets are considered: Realistic data (Bank dataset) and Synthetic data (Adult dataset). 5.1 REAL DATASET Real dataset are not anonymized and are present realistically. Thus we consider the bank dataset with 600 instances; which includes training data of 400 instances and testing data of 200 instances. The classifier is pep (yes/no) in this bank dataset and the other attributes are age, sex, region, income, married, children, car, save_act, current_act, mortgage. Since ID3 algorithm does not take continuous attributes, classification is as: age class as teen, midage, oldies income class as low, medium, high. 5.2 SYNTHETIC DATASET Synthetic data are generated to meet specific needs or certain conditions that may not be found in the original, real data. This can be useful when designing any type of system because the synthetic data are used as a simulation or as a theoretical value, situation, etc. The creation of synthetic data is an involved process of data anonymization. We consider the adult dataset with 48842 instances; from which 45222 instances are considered removing the missing values. The training data is of 30162 instances and testing data is of 15060 instances. The classifier is stategovt(<=50k,>50k) in this adult dataset and the other attributes are Age, workclass, fnlwgt, education, education Num, marital status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, native-country. Since ID3 algorithm does not take continuous attributes, classification is as: age as young, mid-age, oldies fnlwgt as low, medium, high education_no as first, second, third cap_gain as 1,2,3 cap_loss as 1st, 2nd, 3rd hours_per_week as min, medium, max 6. SINGLE LAYER PRIVACY 6.1. ID3 ALGORITHM ANONYMIZATION The basic idea of ID3 algorithm is to construct the decision tree using the concept of Information Gain through the given sets to test each attribute at every tree node. In the process of implementing ID3, computation of entropy is done which is used to determine how informative a particular input attribute is about the output attribute for a subset of the training data. In order to minimize the decision tree depth, when traversing the tree path, there is a need to select the optimal attribute for splitting the tree node. Attribute with highest gain is selected as the root node and thus information gain is calculated as the expected reduction in entropy related to specified attribute when splitting a decision tree node. 6.1.1 ROOT LEVEL ANONYMIZATION Original decision tree without noise predicts the sensitive attribute with the highest information gain as the root node and constructs the decision tree. In this root level anonymization, the root node of the decision tree is modified to give an inaccurate answer to the adversary. That is, the actual root node is changed to the second highest information gain attribute. This is done by adding laplacian noise at the root level. Privacy parameter, epsilon is chosen as any value above 0.75 to ln3. 6.1.2 CLASS LEVEL ANONYMIZATION Here the decision tree is modified by adding noise at the sub-nodes level. That is, noise is being added to the input sensitive attributes. 6.2 DATA PERTURBATION In order to evaluate the suboptimal performance levels of accuracy and privacy, anonymization of the input dataset is performed rather than adding noise to the existing data mining algorithm. The anonymization of input data is done as sensitive attributes based anonymization. That is, the instances that satisfy these sensitive attributes are identified and the corresponding classifier is modified thus providing an inaccurate answer to the adversary. Thus the ID3 decision tree algorithm is considered with an ANONYMIZED synthetic dataset as input training data for various cases possible and the testing data is predicted to output a modified classifier. 7. DOUBLE LAYER PRIVACY A dual layer of privacy is achieved in this implementation. The first layer of privacy includes anonymizing the input dataset by converting all the attributes into binary. This anonymization technique completely depends on the input dataset. That is, only on the presence of non-binary variables in a dataset, the dataset is perturbated and converted into binary after which it is subjected to a second layer of privacy called the root node anonymization of the decision tree. This root node perturbation technique modifies the decision tree by changing its root node and thus providing an incorrect classifier for the testing dataset. The advantage of our implementation is that: even in the case of no non-binary values in the input dataset, at least this second layer of privacy is applied to it. Hence in either case, privacy is achieved - either by root node modification or a combination of both data perturbation and root node modification. INPUT DATASET NORMALISED INPUT 7.1 FIRST LEVEL PRIVACY AT ROOT ID3 DECISION TREE DATA MINING ALGORITHM PERTURBATED OUTPUT DATASET The non binary attributes are split in such a way that its corresponding values fall into just two categories and the decision tree has at most two nodes . Step 1: Non binary attributes are identified. Step 2: local sensitivity of the attributes are found. Step 3: Attribute values are clubbed such that the sensitivity of the split binary attributes are maintained. Step 4 :To the normalized dataset, addition of laplacian noise at the root level is done and the decision tree is modified. 8. EXPERIMENTAL RESULTS 8.1 ROOT LEVEL ANONYMIZATION RESULTS BANK DATASET-Original tree vs Root Anonymized tree ORIGINAL DECISION TREE ROOT MODIFIED DECISION TREE children-> income-> region-> children-> age-> mortgage-> sex-> YES=NO FEMALE=NO region-> mortgage-> INNER_CITY=NO YES=YES TOWN=NO NO=NO current_act-> Correctly Classified Instances 145 72.5 % 149 74.5 % Incorrectly Classified Instances 27.5 Correctly Classified Instances 55 % Incorrectly Classified Instances 25.5 === Confusion Matrix === 24 | YES 31 81 | NO % === Confusion Matrix === original YES NO <-- classified as 64 51 predicted original YES NO <-- classified as 68 24 | YES predicted 27 81 | NO ADULT DATASET-Original tree vs Root Anonymized tree ORIGINAL DECISION TREE ROOT MODIFIED DECISION TREE relationship-> marital_status-> education-> occupation-> cap_gain-> Transport-moving= <=50K 3= >50K Protective-serv= <=50K occupation-> hours_per_week-> Transport-moving= <=50K min= <=50K Handlers-cleaners= <=50K medium= >50K Other-service= <=50K Exec-managerial= >50K Exec-managerial= >50K Farming-fishing= >50K marital_status-> hours_per_week-> Correctly classified instances: 12680 Correctly classified instances: 12659 (84.20%) (84.05%) Incorrectly classified instances: 2380 Incorrectly classified instances: 2401 (15.80%) (15.94%) === Confusion Matrix === original >50k 2292 1408 <=50k <-- classified as 972 | >50k predicted 10388| <=50k === Confusion Matrix === original >50k <=50k <-- classified as 2280 1420 981 | >50k predicted 10379| <=50k 8.2 CLASS LEVEL ANONYMIZATION RESULTS BANK DATASET -Original tree vs Class Anonymized tree ORIGINAL DECISION TREE CLASS LEVEL ANONYMIZED DECISION TREE children-> age-> region-> mortgage-> age-> current_act-> sex-> save_act-> FEMALE=NO car-> mortgage-> children-> YES=YES married-> NO=NO income-> oldies=NO region-> teen=NO sex-> Correctly Classified Instances 145 72.5 % Correctly Classified Instances 153 76.5 % Incorrectly Classified Instances 27.5 % 55 === Confusion Matrix === original YES NO <-- classified as Incorrectly Classified Instances 47 23.5 % === Confusion Matrix === original YES NO <-- classified as 64 24 | YES predicted 66 18 | YES predicted 31 81 | NO 29 87 | NO ADULT DATASET-Original tree vs Class Anonymised tree ORIGINAL DECISION TREE CLASS LEVEL ANONYMIZED DECISION TREE relationship-> education-> cap_gain-> 3= >50K occupation-> occupation-> native_country-> Haiti= <=50K hours_per_week-> min= <=50K Transport-moving= <=50K max= >50K Handlers-cleaners= <=50K cap_loss-> Other-service= <=50K 2nd= <=50K Exec-managerial= >50K cap_gain-> marital_status-> Widowed= >50K Divorced= <=50K 2= >50K sex-> race-> Never-married= >50K relationship-> ......... ......... Correctly classified instances: 12680 Correctly classified instances: 12686 (84.20%) (84.23%) Incorrectly classified instances: 2380 Incorrectly classified instances: 2374 (15.80%) (15.76%) === Confusion Matrix === === Confusion Matrix === original >50k <=50k <-- classified as original >50k <=50k <-classified as 2292 972 | >50k predicted 2282 956 | >50k predicted 1418 10404| <=50k 1408 10388| <=50k 8.3 SENSITIVE ATTRIBUTES BASED ANONYMIZATION RESULTS BANK DATASET -Original tree vs Anonymised tree ORIGINAL DECISION TREE SENSITIVE ATTRIBUTE BASED ANONYMIZED ID3 children-> region-> region-> children-> age-> 3=YES sex-> income-> FEMALE=NO car-> married-> mortgage-> YES=YES mortgage-> NO=NO current_act-> Correctly Classified Instances 145 Correctly Classified Instances 72.5 % 57 % Incorrectly Classified Instances 27.5 114 55 % Incorrectly Classified Instances 43 86 % === Confusion Matrix === === Confusion Matrix === original YES NO <-- classified as original YES NO <-- classified as 64 24 | YES predicted 53 44 | YES 31 81 | NO 42 61 | NO predicted ADULT DATASET-Original tree vs Anonymized tree ORIGINAL DECISION TREE SENSITIVE ATTRIBUTE BASED ANONYMIZED ID3 relationship-> education-> education-> relationship-> cap_gain-> cap_gain-> 3= >50K 3= >50K occupation-> occupation-> Transport-moving= <=50K Transport-moving= <=50K Handlers-cleaners= <=50K Handlers-cleaners= <=50K Other-service= <=50K Other-service= <=50K Exec-managerial= >50K Exec-managerial= >50K marital_status-> marital_status-> Widowed= >50K Widowed= >50K Correctly classified instances: 12680 (84.20%) Correctly classified instances: 9425 (62.58%) Incorrectly classified instances: 2380 (15.80%) Incorrectly classified instances: 5635 (37.41%) === Confusion Matrix === original >50k <=50k <-- classified as === Confusion Matrix === original >50k <=50k <- classified as 2292 1408 972 | >50k predicted 10388| <=50k 2603 4538 | >50k predicted 1097 6822| <=50k 8.4 DOUBLE LAYERED PRIVACY RESULTS Here we consider only the real dataset for our experiment with a total number of 600 records because of the significant overhead in time complexity while running such a large synthetic dataset with a total of 48842 records . BANK DATASET -Original tree vs Double layer Privacy tree ORIGINAL DECISION TREE DOUBLE LAYER PRIVACY IMPLEMENTED DECISION TREE children-> married-> region-> children-> age-> age-> sex-> income-> FEMALE=NO car-> mortgage-> current_act-> YES=YES mortgage-> NO=NO region-> oldies=NO second=YES teen=NO first=NO married-> NO=NO YES=NO region-> Correctly Classified Instances 72.5 % Incorrectly Classified Instances 27.5 % 145 55 === Confusion Matrix === original YES NO <-- classified as Correctly Classified Instances 64 % Incorrectly Classified Instances 36 % 128 72 === Confusion Matrix === original YES NO <-- classified as 64 24 | YES predicted 60 37 | YES predicted 31 81 | NO 35 68 | NO 9. RESULTS EVALUATION 9.1 ACCURACY EVALUATION The experiments are performed with the adult dataset and bank dataset from the UCI repository. From these experiments, a better privacy is achieved in our model compared to the previously existing privacy preserving techniques on decision tree. One of the best evaluation technique followed in most of the machine learning is the confusion matrix. The below table is obtained from the confusion matrix showing various accuracy values for both the datasets. Thus in the first module implementation, the root node and class node modification techniques achieved a better level of accuracy of around 75% for bank dataset and 84% for adult dataset. In the second module, data anonymization technique achieves a better privacy and reduced accuracy compared to the first method. The accuracy achieved here is around 57% for bank dataset and 62% for adult dataset. As our final implementation for privacy preserving ID3, we perform a double layer privacy anonymization where we try to maintain a balance between the privacy and accuracy parameters. The accuracy here is 64% for bank dataset. DATASETS BANK DATASET ADULT DATASET IMPLEMENTATION ACCURACY PRECISION RECALL/SENSITIVITY ORIGINAL TREE 0.725 0.673 0.727 ROOT MODIFIED ID3 0.745 0.715 0.739 CLASS MODIFIED ID3 0.765 0.695 0.786 SENSITIVE ATTRIBUTES ANONYMIZED ID3 0.570 0.558 0.546 DOUBLE LAYER ANONYMIZED ID3 0.640 0.631 0.619 ORIGINAL TREE 0.842 0.619 0.702 ROOT MODIFIED ID3 0.840 0.616 0.699 CLASS MODIFIED ID3 0.842 0.616 0.704 SENSITIVE ATTRIBUTES ANONYMIZED ID3 0.625 0.703 0.365 9.2 PRIVACY EVALUATION Privacy means that anything that can be learnt about a respondent from a statistical database should not be learnt without access to the database. The risk to the privacy should not substantially increase as a result of participating in a statistical database. In the above privacy preserving techniques, calibrated noise is added carefully where the magnitude of the noise is chosen in order to mask the influence of any particular record on the outcome. In order to evaluate the privacy for the above methods, a simple aggregate query is chosen. Based on the query, we compare the deviation of the output pattern predicted from the original decision tree to that predicted from the perturbated decision tree for the above privacy techniques. For the bank dataset, we have chosen the query: age=teen, income=medium. Similarly for the adult dataset, privacy is measured using the following query: relationship= Husband, education= HS-Grad, occupation= Exec-Managerial, race= Black. Since privacy is the measure of confidentiality, query based privacy evaluation is done here. Privacy Evaluation for both the datasets DATASETS BANK DATASET IMPLEMENTATION PRIVACY ROOT NODE MODIFIED ID3 0.285 CLASS NODE MODIFIED ID3 0.285 SENSITIVE ATTRIBUTES ANONYMIZED ID3 0.285 DOUBLE LAYER ANONYMIZED ID3 0.571 ROOT NODE MODIFIED ID3 0.5 CLASS NODE MODIFIED ID3 0.25 SENSITIVE ATTRIBUTES ANONYMIZED ID3 0.25 ADULT DATASET 10. CONCLUSION Adding a privacy layer to the existing classification algorithm (ID3 DECISION TREE) either by adding noise to the attributes or by perturbing the input data such that the original tree and the modified tree are almost accurate in releasing the required pattern while protecting the leakage of individual’s record. From the above experiments, it is inferred that adding noise at algorithm level (root and class) achieves better accuracy but fails to preserve the individual’s identity. While adding noise at data level achieves better privacy than the previous implementation, with a fall in accuracy. Finally with the implementation of a double layered privacy in ID3 with magnitude of noise bounded to sensitivity, an optimum level of privacy and accuracy is maintained. Thus privacy of the individual contributing to any statistical database is preserved. 11. FUTURE ENHANCEMENTS Now a days, organizations are accumulating voluminous growing amounts of data in various formats .These data keeps multiplying day by day. Big Data plays a major role in any organization where the goal is to maintain the privacy of every individual’s information present in their dataset. Hence we are working to extend the proposed single and dual layer privacy technique to big data on HADOOP using map reduce framework for the existing ID3 decision tree algorithm and also to other family of decision tree algorithms such as C4.5,an extension of ID3 algorithm and random forests. Map reduce frame work is used here to parallelize the process on input dataset reducing the time complexity that usually occurs when millions of data are processed using a single node. Thus a better efficiency, privacy and reduced time complexity is achieved with our work. REFERENCES [1] Arik Friedman and Assaf Schuster Technion (2010) ‘Data Mining with Differential Privacy’, Israel Institute of Technology Haifa 32000, Israel, KDD 2010. [2] Cynthia Dwork (2008) ‘Differential privacy -A survey of results’, In TAMC, pages1-19, 2008. [3] Cynthia Dwork, F. McSherry, K. Nissim, and A. Smith (2006) ‘Calibrating noise to sensitivity in private data analysis’, In TCC, pages 265-284, 2006. [4] J. R. Quinlan (1986) ‘Induction of decision trees’, Machine Learning, 1(1):81-106, 1986. [5] Md. Zahidul Islam and Ljiljana Brankovic, (2004) ‘A Framework for Privacy Preserving Classification in Data Mining’, The University of Newcastle Callaghan, NSW 2308, Australia – ACSW 2004. [6] Mohammad Ali Kadampur, Somayajulu D.V.L.N, (2010) ‘A Noise Addition Scheme in Decision Tree for Privacy Preserving Data Mining’, volume 2, issue 1, January 2010, ISSN 2151-9617. [7] M. R. Pawar ,Mampi Bhowmik, (2013) ‘ Privacy Preserving Decision Tree Learning using unrealized data sets’, IJREAS Volume 3, Issue 3 (March 2013) ISSN: 2249-3905. [8] Wei Peng, Juhua Chen and Haiping Zhou (2010) ‘An Implementation of ID3 --Decision Tree Learning Algorithm’, Project of Comp 9417: Machine Learning University of New South Wales, Sydney, NSW 2032, Australia. [9] Rakesh agarwal ,Ramakrishnan Srikant , Privacy Preserving Data Mining, IBM Almaden Research Centre. [10] Yehuda Lindell,Benny Pinkas, Privacy Preserving Data Mining.