Tools for Privacy Preserving Distributed Data Mining By Michael Holmes Why Private Data Mining ❖ The CDC may want to use data mining techniques to identify trends in disease outbreaks. ❖ Insurance companies have useful data but can’t disclose it because of privacy concerns. ❖ Is there a way to obtain this data without revealing the identity of the patients? Private Data Mining Techniques ❖ Secure Sum ❖ Secure Set Union ❖ Secure Size of Set Intersection ❖ Scalar Product Private Data Mining Toolkit ❖ Association Rules in horizontally partitioned data ❖ Association Rules in vertically partitioned data ❖ EM Clustering Secure Sum ❖ Securely compute the sum from individual databases. ❖ Have a site randomly generate a number R ❖ Add this number to every value and send it to site 2. ❖ Site 2 can then add each of it’s values to that values sent from site 1 and return a single number back to Site 1. ❖ Site 1 can then remove the random number N times and find the correct sum. Secure Sum Secure Set Union Secure Size of Set Intersection ❖ Only possible with Commutative Encryption. ❖ very party encrypts their data and then sends it to another party. ❖ The next party also encrypts the encrypted data. ❖ After all parties have encrypted all the data from every other party only that has been duplicated by the encryption is shared. ❖ Count the duplicates and you know the size of the intersection. Scalar Product ❖ Want to compute the sum of x1 * y1 between two databases ❖ Use linear combinations of random numbers to disguise elements and then computationally remove these once you get the result. Association Rules in Horizontally Partitioned Data ❖ Candidate Set Generation ❖ Local Pruning ❖ Itemset Exchange (Secure Union Step here) ❖ Support Count Exchange Association Rules in Vertically Partitioned Data ❖ Uses scalar product to determine if the count of an item set is greater than a threshold ❖ If the count is above the threshold you’ve determined that the database is worth querying ❖ Can also user Secure Size Set Intersection to see how much is in common. ❖ Useful when using algorithm such as apriori algorithm EM Clustering ❖ Uses secure sum to get a global number associated with all sites involved. ❖ Once global sum is computed, it can be used in the Expectation-maximization method to generate staistical models. EM Clustering ❖ Uses secure sum to get a global number associated with all sites involved. ❖ Once global sum is computed, it can be used in the Expectation-maximization method to generate staistical models. Things to Note ❖ These algorithms are not fully private, some information is learned in the process. ❖ For example in the set intersection, sites can potentially learn the sizes of each database. ❖ Make sure to pick the appropriate algorithms for what you need to accomplish ❖ Watch out for intermediate information being leaked! Thank you