This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution and sharing with colleagues. Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party websites are prohibited. In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information regarding Elsevier’s archiving and manuscript policies are encouraged to visit: http://www.elsevier.com/authorsrights Author's personal copy c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 2 ( 2 0 1 3 ) 135–145 journal homepage: www.intl.elsevierhealth.com/journals/cmpb Privacy-preserving Kruskal–Wallis test Suxin Guo a,∗ , Sheng Zhong b , Aidong Zhang a a b Department of Computer Science and Engineering, SUNY at Buffalo, United States State Key Laboratory for Novel Software Technology, Nanjing University, China a r t i c l e i n f o a b s t r a c t Article history: Statistical tests are powerful tools for data analysis. Kruskal–Wallis test is a non-parametric Received 4 January 2013 statistical test that evaluates whether two or more samples are drawn from the same dis- Received in revised form tribution. It is commonly used in various areas. But sometimes, the use of the method is 17 May 2013 impeded by privacy issues raised in fields such as biomedical research and clinical data Accepted 28 May 2013 analysis because of the confidential information contained in the data. In this work, we give a privacy-preserving solution for the Kruskal–Wallis test which enables two or more parties Keywords: to coordinately perform the test on the union of their data without compromising their data Data security privacy. To the best of our knowledge, this is the first work that solves the privacy issues in Statistical test the use of the Kruskal–Wallis test on distributed data. © 2013 Elsevier Ireland Ltd. All rights reserved. Kruskal–Wallis test 1. Introduction Statistical hypothesis tests are very widely used for data analysis. Some popular statistical tests include t-test [1], ANOVA [2], Kruskal–Wallis test [3], and Wilcoxon rank sum test [4]. Although these four are different tests, they serve the same goal, which is to find out whether the samples come from the same population. The t-test and ANOVA are parametric tests and assume the normal distribution of data. The non-parametric equivalence of these two tests are the Wilcoxon rank sum test, which is also known as MannWhitney U test [5], and Kruskal–Wallis test, respectively. They do not assume the data to be normally distributed. The t-test can only deal with the comparison between two samples, and the ANOVA extends it to multiple samples. Similarly, the Kruskal–Wallis is also a generalization of the Wilcoxon rank sum test from two samples to multiple samples. As stated above, the four tests are doing similar things under different assumptions. The non-parametric tests perform better when the data is not normally distributed, and are suitable especially in the cases when the data size is small (<25 per sample group) [6]. Although the Kruskal–Wallis test is a helpful tool in many areas, sometimes the use of it is impeded by privacy concerns due to the confidential information in the data, especially in the clinical and biomedical research. For example, some hospitals conducted a study and tested the INR (International Normalized Ratio) values for their patients so that each hospital holds a set of INR values. The hospitals want to perform the Kruskal–Wallis test to check whether their values are following the same trend. In this case, the set of the INR values of each hospital is treated as a sample. To conduct the Kruskal–Wallis test, all samples should be known, which means, the hospitals have to share their data with each other. The problem is that it might be improper for the hospitals to share their samples because the data contains the private information of patients. Currently there is no method that enables the conduction of the Kruskal–Wallis test on such distributed data with privacy concerns. ∗ Corresponding author. Tel.: +1 7165647706. 0169-2607/$ – see front matter © 2013 Elsevier Ireland Ltd. All rights reserved. http://dx.doi.org/10.1016/j.cmpb.2013.05.023 Author's personal copy 136 c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 2 ( 2 0 1 3 ) 135–145 To solve this problem, we propose a privacy-preserving algorithm that allows the Kruskal–Wallis test to be applied on samples distributed in different parties without revealing each party’s private information to others. Due to the similarity in non-parametric tests, our method can also help the design of privacy-preserving solutions for other nonparametric tests. For example, the Wilcoxon rank sum test and the Kruskal–Wallis test are used in the situations of two samples and two or more samples, respectively, and are essentially the same in the two samples case [3]. So our algorithm also solves the privacy issue of the Wilcoxon rank sum test to some extent. The rest of this paper is organized as follows: In Section 2, we present the related work. Section 3 provides the technical preliminaries including the background knowledge about the Kruskal–Wallis test and the cryptographic tools we need. We propose the basic algorithm and the complete algorithm in Sections 4 and 5, respectively. The basic algorithm shows the procedure of conducting the Kruskal–Wallis test securely when there is no tie in the data. The complete algorithm follows the basic algorithm and takes the existence of ties into consideration. In Section 6, we present the experimental results and finally, Section 7 concludes the paper. 2. Related work In recent years, due to the increasing awareness of privacy problems, a lot of data analyzing methods have been enhanced to be privacy-preserving, including many popular data mining and machine learning algorithms. Most of these approaches can be divided into two categories. Approaches in the first category protect data privacy with data perturbation techniques, such as randomization [7,8],rotation [9] and resampling [10]. Since the original data is changed, these approaches usually lose some accuracy. The methods in the second category are generally based on the Secure Multiparty Computation (SMC) and apply cryptographic techniques to protect data during the computations [11,12]. Such methods usually cause no accuracy loss but have higher computational cost. In our case, since the Kruskal–Wallis test is often used on small sized data, we choose the second way, which is to protect privacy with cryptographic tools. It enables us to achieve higher accuracy with an affordable computational cost. In the cryptographic category, some SMC tools are very commonly used, such as secure sum [13], secure comparison [14,15], secure division [16], secure scalar product [13,16,17], secure matrix multiplication [18–20], and secure set operations [13]. Many data mining and machine learning algorithms have been extended with privacy solutions, such as decision tree classification [11,21], k-means clustering [22,23], gradient descent methods [24], but only a few works have been proposed to study the privacy issues in statistical tests. [25] gives a privacy-preserving algorithm to compare survival curves with the logrank test. [26] presents a privacy-preserving solution to perform the permutation test securely on distributed data. There is no work studies the privacy issues of the Kruskal–Wallis test on distributed data. To the best of our knowledge, our work is the first one. 3. Technical preliminaries 3.1. The Kruskal–Wallis test We first review the Kruskal–Wallis test in this section. The test as proposed by Kruskal and Wallis [3] evaluates whether two or more samples are from the same distribution. The null hypothesis is that all the samples come from the same distribution. Suppose we have k samples, each contains a set of values. To perform the Kruskal–Wallis test, we need to first rank all the values together without considering which sample the values belong to, then compute the sum of all the ranks of values within every sample, so that each sample has its sum of ranks. If there is no tie in all the values, the test statistic is: R 12 i − 3(N + 1), ni N(N + 1) k H= 2 (1) i=1 where N is the total number of values in all samples; ni is the number of values contained in the ith sample, and Ri is the sum of ranks in ith sample. 2 After the calculation of H, we compare it to a value ˛:k−1 which can be found in a table of the chi-squared probability distribution with k − 1 as the degrees of freedom and ˛ as the 2 desired significance. If H ≥ ˛:k−1 , the hypothesis is rejected. Otherwise, the hypothesis is accepted. If there are ties in the values, the calculation of the test statistic should be changed slightly. First, when ranking all the values, the ranks of each group of tied values are given as the average of the ranks that these tied values would have received without ties. For example, suppose we have values {1, 3, 3, 5} with one tie of two “3”s. Without considering the tie, their ranks should be 1, 2, 3, 4, respectively. After we change the ranks of the tied values to the average of them, their ranks become 1, 2.5, 2.5, 4. Then we can compute H with these new ranks. Besides the adjustment of ranks, we also need to divide H by: g C=1− (t3 − ti ) i=1 i , N3 − N (2) where g is the number of groups of tied values, and ti is the number of tied values in the ith group. For the above example {1, 3, 3, 5}, we have only 1 group of 2 tied values, so g = 1 and t1 = 2. To sum up for the case with existence of ties, we need to adjust the ranks of tied values, and the test statistic is: Hc = H . C (3) Actually Eq. (3) is the general solution that holds no matter there are ties or not. If there is no tie, C = 1 and thus, Hc = H. Author's personal copy c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 2 ( 2 0 1 3 ) 135–145 3.2. Privacy protection of the Kruskal–Wallis test Like the hospital example mentioned in the introduction, we assume that each party has a sample and they hope to conduct a Kruskal–Wallis test jointly to find out whether their samples follow the same distribution without revealing their data to others. Here our solution is based on the semi-honest model, which is widely used in the cryptographic category of privacypreserving methods [27,11,28,29,24,13,16,30]. In this model, all parties strictly follow the protocol, but can attempt to derive the private information of other parties with the intermediate results they get during the execution of protocols. 3.3. Cryptographic tools 3.3.1. Homomorphic cryptographic scheme An additive homomorphic asymmetric cryptographic system is used to encrypt and decrypt the data in our work. A cryptographic scheme that encrypts integer x as E(x) is additive homomorphic if there are operators ⊕ and ⊗ that for any two integers x1 , x2 and a constant a, we have E(x1 + x2 ) = E(x1 ) ⊕ E(x2 ), E(a × x1 ) = a ⊗ E(x1 ). This means, with an additive homomorphic cryptographic system, we can compute the encrypted sum of integers directly from their encryptions. We do not need to decrypt the integers and compute the sum. In an asymmetric cryptographic system, we have a pair of keys: a public key for encryption and a private key for decryption. 3.3.2. Elgamal cryptographic system There are several additive homomorphic cryptographic schemes [30,31]. In this work, we apply a variant of ElGamal scheme [32], which is semantically secure under the DiffeHellman Assumption [33]. Elgamal Cryptographic system is a multiplicative homomorphic asymmetric cryptographic system. With this system, the encryption of a cleartext m is such a pair: E(m) = (m × yr , gr ), where g is a generator, x is the private key, y is the public key that y = gx and r is a random integer. We call the first part of the pair c1 and the second part c2 . c1 = m × yr and c2 = gr . To decrypt E(m), we compute s = c2x = grx = gxr = yr . Then do c1 × s−1 = m × yr × y−r and we can get the cleartext m. In the variant of Elgamal scheme we use, the cleartext m is encrypted in such a way: E(m) = (gm × yr , gr ). The only difference between the original Elgomal scheme and this variant is that m in the first part is changed to gm . With this operation, this variant is an additive homomorphic cryptosystem such that: E(x1 + x2 ) = E(x1 ) × E(x2 ), 137 a E(a × x1 ) = E(x1 ) . To decrypt E(m), we follow the same procedure as in the original Elgamal algorithm. But this time, after the above calculations, we obtain gm instead of m. To get m from gm , we need to perform exhaustive search, which is to try every possible m and look for the one that matches gm . Please note that this exhaustive search is limited to a small range of possible plaintexts only, so the time needed is reasonable. In our work, the private key is shared by all the parties and no party knows the complete private key. The parties need to coordinate with each other to do the decryptions and the ciphertexts can be exposed to any party, because no party can decrypt them without the help of others. The private key is shared in this way: Suppose there are two parties, parties A and B. A holds a part of private key, x1 and B holds the other part, x2 such that x1 + x2 = x, where x is the complete private key. In the decryption, we need to compute s = c2x = c2x1 +x2 = c2x1 × c2x2 . Party A computes s1 = c2x1 and party B computes s2 = c2x2 . s = s1 × s2 . We need to do c1 × s−1 = −1 −1 −1 c1 × (s1 × s2 ) = c1 × s−1 1 × s2 . Party A computes c1 × s1 and −1 sends it to party B. Then party B computes c1 × s1 × s−1 2 = c1 × s−1 = gm and sends it to A. In this way both parties can get the decrypted result. Here since the party B does the decryption later, it gets the final result earlier. If it does not send the result to A, the decrypted result can only be known to party B. The sequence of the parties can be changed, so if we need the result to be known to only one party, the party should do the decryption later. 3.3.3. Secure comparison We apply the secure comparison protocol proposed in [15] to compare two values from different parties securely. The input of this algorithm are two integers a and b which are from different parties. The output is an encryption of 1 if a > b, or an encryption of 0 otherwise. The basic idea of the secure comparison algorithm is as follows. Let the binary presentation of a and b be al , . . ., a1 and bl , . . ., b1 , where a1 and b1 are the least significant bits. If a > b, there is a “pivot bit” i such that bi − ai + 1 =0 and aj XOR bj = aj + bj − 2aj bj = 0 for every i < j ≤ l. This method applies the homomorphic encryptions to check if the pivot bit exists. This method can find out if a > b, but it cannot find out if a ≥ b directly. So when we want to know if a ≥ b, we compare 2a + 1 and 2b instead of a and b. If 2a + 1 >2b, since both a and b are integers, we can derive that a ≥ b. 4. The basic algorithm of privacy-preserving Kruskal–Wallis test In this part, we present the basic algorithm for computing the H statistic of the Kruskal–Wallis test securely without considering the existence of ties. The complete algorithm that also deals with ties will be discussed in the next section. To make the presentation clear, we first give the algorithm for performing the test within two parties, then extend it to the multiparty case. Author's personal copy 138 c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 2 ( 2 0 1 3 ) 135–145 Suppose there are two parties, A and B. Party A has sample S1 which contains n1 values, and party B has sample S2 that contains n2 values. The total number of values N = n1 + n2 . The basic structure of the algorithm goes as follows: 1. For each value in each party, count how many values in its own party (including itself) are smaller than or equal to it. Encrypt these counts. 2. For each value in each party, compare it with all the values in the other party using the secure comparison algorithm. Then by adding the comparison results up, count how many values in the other party are smaller than or equal to it. Since the results of the secure comparison algorithm are in cipher text, these counts are also in cipher text. 3. For each value in each party, add the above two counts securely so we can get the total number of values in both parties that are smaller than or equal to it, which is the rank of it in cipher text. Then for each party, add all the encrypted ranks of its values and this is the encrypted rank sum of this party. Call the rank sums of the two parties R1 and R2 , respectively. 4. With the encrypted rank sum of both parties, compute the H statistic with Eq. (1). Here comes a problem: to calculate H, we need the squared rank sum of both parties, R12 and R22 . Since we only have the encrypted rank sums of the two parties E(R1 ) and E(R2 ), we have to compute E(R12 ) and E(R22 ) from E(R1 ) and E(R2 ). This is not easy because we are using an additive homomorphic system, which does not support the direct multiplication of two encrypted integers. So we need to develop an algorithm to solve it. Let us explain each step in details. 4.1. Secure computation of the rank sums To compute the rank of one value, we just need to count how many values in both parties are smaller than or equal to it. For example, with values {5, 6, 7}, the rank of value 5 is 1, because only 1 value is smaller than or equal to it, which is itself (5 ≤ 5). The rank of 6 is 2 since there are 2 values smaller than or equal to it (5 ≤ 6 and 6 ≤ 6). Similarly, the rank of 7 is 3. For each value in each party, to count how many values are smaller than or equal to it in its own party is quite simple. We compare it with all values in its party, which can be easily done. But to count the number of smaller or equal values in the other party is not that straightforward. We also need to compare the value with all values in the other party, and the comparisons should be conducted with the secure comparison algorithm. Suppose the values in party A are a1 , a2 , . . . , an1 , and the values in party B are b1 , b2 , . . . , bn2 . For each value ai (i = 1, b, . . ., n1 ), we need to compare it with every value in party B with the secure comparison protocol. After these n2 secure comparisons, we have n2 results, and each of them is an encryption of 0 or 1 (E(0) or E(1)). For each value bj (j = 1, 2, . . ., n2 ), the comparison between ai and bj is E(1) if ai ≥ bj and E(0) otherwise. Since the results are in cipher text, no party knows what they are. The sum of the n2 results is the encrypted number of values that are smaller than or equal to ai in party B. We call it E(RB (ai )). The number of values that are smaller than or equal to ai in party A can be easily computed. It is named RA (ai ). We encrypt it and get E(RA (ai )). The sum of RA (ai ) and RB (ai ) is the rank of ai , which is R(ai ). The encryption of this rank E(R(ai )) can be computed from E(RA (ai )) and E(RB (ai )) with the additive homomorphic system that we utilize. In this way, we can get the encryptions of the ranks of all values from both parties: E(R(a1 )), E(R(a2 )), . . . , E(R(an1 )), E(R(b1 )), E(R(b2 )), . . . , E(R(bn2 )). Then E(R1 ) and E(R2 ), which are the encryptions of the rank sums of party A and B, respectively, can be computed from them because R1 = R(a1 ) + R(a2 ) + · · · + R(an1 ) and R2 = R(b1 ) + R(b2 ) + · · · + R(bn2 ). 4.2. Secure computation of the squared rank sums We need to compute E(R12 ) and E(R22 ) from E(R1 ) and E(R2 ). Since the additive homomorphic cryptosystem does not support the direct multiplication of two encrypted integers, here we present an algorithm to solve it. To compute E(ab) from E(a) and E(b) that are known to both parties, first we need to make one of the integers additively shared by the two parties. For example, we make a additively shared by the two parties such that party A holds an integer aA and party B holds an integer aB that aA + aB = a. aA and aB can be got from E(a) in this way: Party A randomly generates an integer aA , and computes E(aA ). Then E(a − aA ) = E(aB ) can be computed from E(a) and E(aA ) by party A. A sends it to party B and the two parties coordinate with each other to decrypt E(aB ). During the decryption, we make sure that the decryption result aB is only known to party B. This can be achieved with the cryptographic system that we use, as explained in Section 3. After A gets aA and B gets aB , the two parties A and B can compute E(aA × b) and E(aB × b), respectively. This can be done with the additive homomorphic system from aA , aB and E(b) because aA and aB are both integers in plaintext. What we want is E(ab) = E((aA + aB ) × b) = E(aA × b + aB × b). Since E(aA × b) is held by party A and E(aB × b) is held by party B, the two parties should exchange their values so that both of them can compute the final result E(ab). But exchanging the values directly may cause privacy loss. For example, if party A gives E(aA × b) to party B, since E(aA × b) = E(b)aA with the variant of Elgamal system we use, and E(b) is known to party B, party B can derive some information about aA from E(aA × b). So before the two parties calculate E(aA × b) and E(aB × b) and exchange their values, they do rerandomizations to their E(b)s. With the rerandomizations, the random numbers “r” that are used in the encryptions are changed, so the encryptions are different from the original ones. To make the presentation clear, we call the rerandomized E(b)s as E (b) and E (b) in party A and party B, respectively. Then parties A and B can cala a culate E (aA × b) = E (b) A and E (aB × b) = E (b) B , respectively, and exchange their values E (aA × b) and E (aB × b). Since the encryptions are changed, the parties cannot derive information from the value they get from each other. For example, Author's personal copy c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 2 ( 2 0 1 3 ) 135–145 a although party B gets E (aA × b) from A, E (aA × b) = E (b) A and party B does not know E (b) because it is the rerandomization done by party A. So B cannot derive aA . After the exchange, party A has E(aA × b) and E (aB × b) and party B has E (aA × b) and E(aB × b). They can compute E(ab) = E(aA × b + aB × b) by themselves. The rerandomizations do not affect the calculations of the encrypted sums. In this way, both parties can get E(ab) from E(a) and E(b). Algorithm 1 shows the main procedure of this encrypted multiplication. Algorithm 1. Encrypted multiplication of two integers Input. Encryptions of integers a and b, E(a) and E(b) that are known to both parties; Output. The encryption of a × b, E(ab); 1: Party A generates a random integer aA and computes E(aA ); 2: Party A computes E(a − aA ) and sends it to party B; The two parties coordinately decrypt 3: E(a − aA ) and only party B gets the result a − aA = aB ; 4: Parties A and B rerandomize E(b) and get E (b) and E (b), respectively; Parties A and B calculate E (aA × b) and 5: E (aB × b), respectively, and exchange the two values; Parties 6: A and B compute E(ab) = E(aA × b + aB × b) by themselves; 4.3. Secure computation of H With Algorithm 1 we can get E(R12 ) and E(R22 ) from E(R1 ) and E(R2 ). Because we assume there are two parties, the H statistic is calculated as: H= R2 R2 12 ( 1 + 2 ) − 3(N + 1), n2 N(N + 1) n1 where N, n1 and n2 are constants known to both parties. From E(R12 ) and E(R22 ), both parties can compute E(R12 × n2 + R22 × n1 ). They then coordinately decrypt it and get R12 × n2 + R22 × n1 . The final result is calculated as: H= 12 (R2 × n2 + R22 × n1 ) − 3(N + 1). N(N + 1)n1 n2 1 The reason why we compute R12 × n2 + R22 × n1 and then divide it with n1 n2 instead of compute R12 /n1 + R22 /n2 directly is that the cryptographic system we use only support the operations on non-negative integers. To avoid the decimal fractions in the encryptions, we compute R1 2 × n2 + R2 2 × n1 and after the decryption, the division is applied. 4.4. The summarized algorithm The main steps of the algorithm is summarized in Algorithm 2. 139 Algorithm 2. The basic algorithm of privacy-preserving Kruskal–Wallis test Input. Party A has sample S1 which contains n1 values, and party B has sample S2 which contains n2 values. The total number of values N = n1 + n2 ; Output. The statistic H; 1: for each value ai in party A do 2: Calculate the encrypted rank of it E(R(ai )); end for 3: for each value bj in party B do 4: Calculate the encrypted rank of it E(R(bj )); 5: end for 6: Compute the encrypted rank sum of each 7: n1 party E(R1 ) and E(R2 ) where R1 = R(ai ) i=1 n2 R(bj ); and R2 = j=1 8: 9: 10: 4.5. Calculate E(R12 ) and E(R22 ) from E(R1 ) and E(R2 ) with Algorithm 1; Calculate E(R12 × n2 + R22 × n1 ) and decrypt it; Compute H from R12 × n2 + R22 × n1 ; Extension to multiparty The extension of the algorithm from two-party to multiparty is straightforward. For each value in each party, to get its rank in the two-party case, we count the number of values that are smaller than or equal to it in its own party and in the other party. To count the number in the other party, we need the secure comparison protocol. Similarly, in the multiparty case, we also count the number of values that are smaller than or equal to it in its own party and every other party with the help of the secure comparison protocol. After the computation of encrypted ranks for every value in every party, the encrypted rank sums are calculated, just like in the 2-party case. Then the encrypted squared rank sums E(R12 ), E(R22 ), . . . , E(Rk2 ) can be computed with Algorithm 1. They are known to all the parties. As we compute E(R12 × n2 + R22 × n1 ) when there are two parties, for the k parties, E(R12 × n2 n3 . . . nk + R22 × n1 n3 . . . nk + · · · + Rk2 × n1 n2 . . . nk−1 ) is computed. We decrypt it and divide the decrypt result by n1 n2 . . . nk instead of n1 n2 in the two-party case. Then the final result H is calculated. 5. The complete algorithm of privacy-preserving Kruskal–Wallis test We present the privacy-preserving Kruskal–Wallis test with considering ties in this section. 5.1. Modifying the data to eliminate ties Before we explain the complete algorithm, we give a simpler method to deal with the tied values. This is to modify the values slightly to eliminate the ties and then apply the basic algorithm to the modified data. Since the data is modified a little, this method causes slight accuracy loss. Author's personal copy 140 c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 2 ( 2 0 1 3 ) 135–145 To eliminate ties between parties, we do the following steps: If there are two parties, for every value in the first party, multiply it with 10 and then add 0 to it. For every value in the second party, multiply it with 10 and then add 1 to it. For example, suppose ai belongs to the first party and bj belongs to the second party. We do ai = ai × 10 + 0 and bj = bj × 10 + 1. In this way, the ties between the two parties are eliminated and the ranks of other values are not affected. If there are more than two parties, the data is modified similarly depending on the number of parties. For example, if there are ten parties, we still multiply every value in every party with 10 and add zeros to the values in the first party, add ones to the values in the second party, . . ., add nines to the tenth party. If there are 100 parties, multiply every value with 100 and add zeros to ninety-nines to the values of the first to 100th party, respectively. To deal with the ties within parties, we do not need to modify the data. We can ignore these ties when calculating the ranks. For example, suppose one party has three values, {1, 1, 1}. With our algorithm, the ranks are calculated by counting the number of smaller or equal values. For these three values, the number of smaller or equal values in their own parties are 3, 3 and 3. We change them to 1, 2 and 3, respectively. This can be easily finished because every party has the information of ties within it. After changing the local counts, the counts of smaller or equal values from other parties are added to get the rank. The ranks do not contain any tie because both ties within the local party and the ties between parties are disregarded. After the modifications, we can apply the basic algorithm that deals with data without ties. 5.2. The complete algorithm Here we present the complete algorithm that works for data containing ties. Similar to the previous section, the algorithm is proposed with assumption that there are only two parties and then extended to the multiparty case. As mentioned in Section 3, when there are ties in the data, the calculation of the statistic is changed in two aspects: The ranks of the tied values should be adjusted when computing H, and H should be divided by C. Both of them will be discussed in details. 5.2.1. Adjustment of the ranks of tied values The ranks of each group of tied values should be changed to the average of the ranks that these tied values would have received without ties. We use an example to show the basic idea to achieve this adjustment. Suppose there are values {1, 2, 3, 4, 4, 4, 4, 4} that are distributed in two samples held by two parties, respectively. Party A has sample S1 which contains values {1, 2, 4, 4} and party B has sample S2 which contains values {3, 4, 4, 4}. Without considering the tie, we know that the ranks of the values {1, 2, 3, 4, 4, 4, 4, 4} are 1, 2, 3, 4, 5, 6, 7, 8, respectively. The five “4”s are tied and their ranks are 4, 5, 6, 7, 8. The largest rank in this tie is 8 and the smallest rank is 4. The average of the ranks is 6 and it can be calculated by taking the average of the largest rank 8 and the smallest rank 4. This is because that the ranks of values in a tie is an arithmetic sequence, so the average of all values in the sequence is the same as the average of the smallest and the largest values. After changing the ranks of the tied values to the average of them, the ranks should be 1, 2, 3, 6, 6, 6, 6, 6. In our algorithm, since we calculate the rank of each value by counting the values that smaller than or equal to it, the ranks are 1, 2, 3, 8, 8, 8, 8, 8 because for each 4, there are 8 values smaller than or equal to it. So with our algorithm, the ranks of each group of tied values are actually the largest rank in the tie. We need to add some steps into our algorithm to change the ranks form 1, 2, 3, 8, 8, 8, 8, 8 to 1, 2, 3, 6, 6, 6, 6, 6. The basic idea is: Since the ranks of values in each tie is the largest rank in the tie, we only need to get the smallest rank in the tie and take the average of the largest rank and the smallest rank. To get the smallest rank from the largest rank, we need to know the number of values in the tie. With the largest rank named as l, the smallest rank named as s, and the number of values in the tie named as t, we have s = l − t + 1. As in our example, the tie contains 5 values with the largest rank as 8 and the smallest rank as 4. We have 8 − 5 +1 = 4. So, to change the ranks form 1, 2, 3, 8, 8, 8, 8, 8 to 1, 2, 3, 6, 6, 6, 6, 6, we need to get the number of values in ties, and then compute the smallest ranks in ties, and take the average of the largest ranks and the smallest ranks. We assume that each value is in a tie and calculate the number of values in each value’s tie. In our example, value 1 is in a tie that contains only 1 value, so are values 2 and 3. Each value 4 is in a tie that contains 5 values. So for values {1, 2, 3, 4, 4, 4, 4, 4}, we have 1, 1, 1, 5, 5, 5, 5, 5 as the number of values in each value’s tie. Then for each value, compute the smallest rank in its tie with s = l − t + 1. For value 1, the smallest rank is 1 − 1 +1 = 1. For value 2, the smallest rank is 2 − 1 +1 = 2. For value 3, the smallest rank is 3 − 1 +1 = 3. For each value 4, the smallest rank is 8 − 5 +1 = 4. So the smallest ranks for the eight values are 1, 2, 3, 4, 4, 4, 4, 4. With the largest ranks 1, 2, 3, 8, 8, 8, 8, 8, we can get the averaged ranks 1, 2, 3, 6, 6, 6, 6, 6. We can see that for values 1, 2 and 3 that are not tied, assuming that they are in ties containing 1 value does not affect the calculation results of their ranks. The reason why we make such assumption is that, although we show all the values, ranks and tied numbers of values together in cleartext to make it easier to understand, in the real settings, they are encrypted or distributed and no party has the complete information about them. So no party knows whether a value is in a tie or not. For example, party A has one value 1 and this value is not in a tie in party A. But A does not know whether party B has value 1 or not, and A does not know whether value 1 is in a tie globally. So all values are assumed to be in a tie. After explaining the basic idea of the adjustment of ranks, let us show the steps that the two parties do the adjustment securely. We follow the basic algorithm in Section 4 to get the ranks of each value in each party. Here the “rank”s are the number of smaller or equal values, which are the largest ranks of each tie. To count the smaller or equal values for value ai in party A, it is compared with both values in party A and party B. When comparing ai with values in party A, we also count the number of values that are equal to ai in party A and name it TA (ai ). As mentioned in Section 4, when comparing ai with every value in party B securely, each of the comparison result is an encryption of 0 or 1 such that if bj ≤ ai , the comparison result between bj and ai is E(1) and otherwise E(0). The sum of these results Author's personal copy c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 2 ( 2 0 1 3 ) 135–145 Table 1 – An example table. b1 a1 ... ai ... an1 ... bj ... bn2 E(1) is the encrypted number of values smaller than or equal to ai in party B. Here we keep all the comparison results between every pair of ai and bj in a n1 × n2 table such that the element in the table on the ai th row and bj th column is the comparison result between ai and bj , which is E(1) if bj ≤ ai and E(0) otherwise. Table 1 is an example with a n1 × n2 table. Similarly, to count the smaller or equal values of value bj in party B, we compare it with values in both party A and party B. When comparing bj with values in party B, we also count the number of values that are equal to bj in party B and name it TB (bj ). When comparing bj with values in party A securely, each comparison result is not the same as the previous case. Here the comparison result between bj and ai is E(1) if ai ≤ bj and E(0) otherwise. We also keep the comparison results in a n1 × n2 table. The two tables storing the comparison results are not the same. In the first table, the value in the ai th row and bj th column is E(1) if bj ≤ ai and E(0) otherwise; while in the second table, the value in the ai th row and bj th column is E(1) if bj ≥ ai and E(0) otherwise. Here we introduce a third n1 × n2 table that each element in it is the secure sum of the two corresponding elements in the first and second tables. For example, if the value in the ai th row and bj th column in the first table is E(1) and in the second table is E(0), then the value in the ai th row and bj th column in the third table is E(1 + 0). The values in the third table is either E(1) or E(2). If ai < bj , the value in the second table is E(1) and the value in the first table is E(0). Thus, the value in the third table is E(1). If ai > bj , the value in the first table is E(1) and the value in the second table is E(0). Thus, the value in the third table is also E(1). If ai = bj , both values in the first and second tables are E(1) and the value in the third table is E(2). To sum up, the value in the / bj and ai th row and bj th column in the third table is E(1) if ai = E(2) if ai = bj . We securely deduct 1 from every element in the third table. Then the value in the ai th row and bj th column in the new table is E(0) if ai = / bj and E(1) if ai = bj . This new table contains the information of equal values between the two parties. The sum of all the values in the ai th row is the encrypted number of values that are equal to ai in party B which is named as E(TB (ai )). The sum of all the values in the bj th column is the encrypted number of values that are equal to bj in party A which is named as E(TA (bj )). Since parties A and B have computed TA (ai ) and TB (bj ), respectively, the two numbers can be encrypted and added to the E(TB (ai )) and E(TA (bj )), respectively to get E(T(ai )) = E(TA (ai ) + TB (ai )) and E(T(bj )) = E(TA (bj ) + TB (bj )). For each value ai (i = 1, 2, . . ., n1 ) in party A, we have E(R(ai )) which is the encrypted largest rank in ai ’s tie and E(T(ai )) which is the encrypted number of values in ai ’s tie, or the number of values equal to ai in both parties. For each value bj (j = 1, 2, . . ., n2 ) in party B, we have the similar numbers E(R(bj )) and E(T(bj )). 141 To get the averaged rank for each value, we need to know the smallest rank in each value’s tie. The smallest ranks can be calculated from the largest ranks and the numbers of values in ties. For each value ai (i = 1, 2, . . ., n1 ) in party A, the encrypted smallest rank E(S(ai )) in ai ’s tie is E(R(ai ) − T(ai ) + 1) and the encrypted adjusted rank of ai is E((S(ai ) + R(ai ))/2), which is the average between the largest and the smallest rank. To avoid the decimal fraction in the ciphertext, we only calculate E(S(ai ) + R(ai )) and the division by 2 is applied after the final decryption. For each value bj (j = 1, 2, . . ., n2 ) in party B, the encrypted smallest rank E(S(bj )) in bj ’s tie is E(R(bj ) − T(bj ) + 1) and the encrypted adjusted rank of bj is E((S(bj ) + R(bj ))/2). We also calculate E(S(bj ) + R(bj )) and apply the division by 2 after the final decryption. In this way, we can adjust the ranks of every value and the rank sums are calculated based on these new ranks. Please notice that if a value is not tied with others, the adjustment does not change its rank. The complete algorithm of calculating H is summarized in Algorithm 3. Algorithm 3. The complete algorithm of privacy-preserving Kruskal–Wallis test Input. Party A has sample S1 which contains n1 values, and party B has sample S2 which contains n2 values. The total number of values N = n1 + n2 ; Output. The statistic H; 1: for each value ai in party A do Calculate the encrypted rank of it E(R(ai )) 2: and record the secure comparison results; end for 3: for each value bj in party B do 4: 5: Calculate the encrypted rank of it E(R(bj )) and record the secure comparison results; 6: end for From the secure comparison results, get 7: the information of equal values between the two parties; for each value ai in party A do 8: 9: Calculate the encrypted number of values equal to it E(T(ai )); Calculate the encrypted smallest rank in 10: its tie E(S(ai )) from E(T(ai )) and E(R(ai )); Calculate the encrypted averaged rank 11: of it; 12: end for 13: for each value bj in party B do Calculate the encrypted number of 14: values equal to it E(T(bj )); Calculate the encrypted smallest rank in 15: its tie E(S(bj )) from E(T(bj )) and E(R(bj )); Calculate the encrypted averaged rank 16: of it; 17: end for 18: Do the remaining calculations to compute H as in Algorithm 2 with the encrypted averaged ranks; Author's personal copy 142 c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 2 ( 2 0 1 3 ) 135–145 To extend the adjustment from two parties to multiple parties, we just need to create a table containing the information of equal values for each pair of parties during the computations of ranks. For each value, calculate the encrypted number of values equal to it by collecting information from all tables it is involved. Then the encrypted smallest rank in its tie and the averaged rank can be computed and the following steps are the same as in the extension of Algorithm 2 in Section 4. 5.2.2. Calculation of C In most cases, dividing H by C makes little change in the final result. If the number of tied values are not more than 1/4 of the total values, the division does not change the result by more than 10% for some degrees of freedom and significance [3]. To calculate C securely for two parties A and B, we need the information of ties computed in the adjustment of ranks, the E(T(ai )) for each value ai (i = 1, 2, . . ., n1 ) in party A and the E(T(bj )) for each value bj (j = 1, 2, . . ., n2 ) in party B. From Eq. (2), we have C=1− (ti3 − ti ) N3 − N , where ti is the number of values in the ith tie. To compute C securely, we treat T(ai ) of each distinct ai and T(bj ) of each distinct bj as ti . For the values that are not tied with others, since their T values are equal to 1, and 13 − 1 =0, adding them do not affect the value of C. For the tied values, their T values should be considered just once in the calculation of C, so we consider the T’s of the distinct values in each party. With the example we used before that party A has values {1, 2, 4, 4} and party B has values {3, 4, 4, 4}, for party A, we only consider T(1) = 1, T(2) = 1 and T(4) = 5. For party B, we consider T(3) = 1 and T(4) = 5. Here all the T’s are encrypted and no party knows the exact numbers. C can be securely computed from the encryption of ti s. The E(ti3 ) is calculated from 3 E(ti ) with Algorithm 1 and then E( ti − ti ) can be computed. The problem is, although only the T’s of distinct values in each party are included in the calculation of C, there are still duplicates. Considering only the distinct values in each party can make sure that the ties within parties are counted only once, but it cannot eliminate the duplicated ties between parties. As in the above example, T(4) is counted twice because the tie of value 4 exists in both parties. We call the set of ties exist only in party A TA , the set of ties exist only in party B TB and the set of ties exist in both parties TAB . We want the information about TA , TB and TAB to be included in C just once. With the above solution, TAB is counted twice. If we consider only T(ai ) for each value ai (i = 1, 2, . . ., n1 ) in party A and do not add the T(bj ) for each value bj (j = 1, 2, . . ., n2 ) in party B, TA and TAB are considered once but the information of TB is lost. We cannot add the information of only TB without adding TAB , because every party does not know whether a tie in it is local or global. We haven’t worked out a solution to calculate C exactly as it is. The two solutions mentioned above either add more tie information or lose some tie information when calculating C. But they can give a range of C by providing an upper bound and a lower bound and cut down the loss of accuracy. Table 2 – The BMI dataset. Asians Indians Malays 32 (15) 30.1 (14) 27.6 (12) 26.2 (10) 28.2 (13) 26.4 (11) 23.1 (2) 23.5 (4) 24.6 (7) 24.3 (6) 24.9 (8) 25.3 (9) 23.8 (5) 22.1 (1) 23.4 (3) We use some examples to show the extension of the calculation of C from two parties to multiparty. Suppose there are three parties, A, B and C. Similar to the two-party case, we denote TA , TB and TC as the sets of ties exist only in party A, B and C, respectively. TAB is the set of ties in parties A and B. TAC , TBC and TABC are defined in the same way. For each pair of parties, we have a table storing the information of tied values between the two parties. The three tables are named as Table(AB), Table(AC) and Table(BC), respectively. We collect the tie information of each distinct value in party A from all the tables that involve A, which are Table(AB) and Table(AC). This gives us the information about all the ties appear in party A, which are TA , TAB , TAC and TABC . Then we disregard party A and the tables involving A, and collect the tie information of each distinct value in party B from all the remaining tables that involve B, which is only Table(BC). With this step, we can add the information about all the ties appearing in party B but not in A, which are TB and TBC . Then we encounter the same problem as in the two-party case: if we stop here, the tie information of TC is lost; if we add the tie information of each distinct value in party C from a table involving C such as Table(AC), both TC and TAC are added, and thus TAC is counted twice. When there are k parties, we follow the same procedure and get the information of all the ties appear in the first party, then add the information of ties in the second party, and so on. When it comes to the last party, we either lose the information of ties appearing only in the last party, or add duplicate information about ties appearing in both the last party and some other party. This gives us an upper bound and a lower bound of C. 6. Experiments The experimental results are presented in this section. All the algorithms are implemented with the Crypto++library in the C++language and the communications between parties are implemented with socket API. The experiments are conducted on a Red Hat server with 16 × 2.27 GHz CPUs and 24 G of memory. We use the two datasets from [34] to test the accuracy of our algorithms. The first dataset, as shown in Table 2, contains 3 samples with equal sizes. The “sample” in the context of this paper is clearly different from that in many other papers. Each “sample” here is the set of data held by a party and the number of samples is the number of parties. In this dataset, the data are simulated Body Mass Index (BMI) values for subjects of 3 different races from a surburb of San Francisco. Here the BMI values for subjects of each race is a sample. There is no tie in this dataset and the rank of every value is given in parentheses. Author's personal copy 143 c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 2 ( 2 0 1 3 ) 135–145 Table 3 – The INR dataset. Hospital A 1.68 (1) 1.69 (2) 1.70 (3.5) 1.70 (3.5) 1.72 (8) 1.73 (10) 1.73 (10) 1.76 (17) Hospital B Hospital C Hospital D 1.71 (6) 1.73 (10) 1.74 (13.5) 1.74 (13.5) 1.78 (20) 1.78 (20) 1.80 (23.5) 1.81 (26) 1.74 (13.5) 1.75 (16) 1.77 (18) 1.78 (20) 1.80 (23.5) 1.81 (26) 1.84 (28) 1.71 (6) 1.71 (6) 1.74 (13.5) 1.79 (22) 1.81 (26) 1.85 (29) 1.87 (30) 1.91 (31) The second dataset is presented in Table 3. It contains 4 samples and the sizes of them are not all equal. Each sample is a set of simulated International Normalized Ratio (INR) values of patients in one hospital. The ranks are given in parentheses. There are ties in the data and the tied ranks are bold. Since our secure algorithm only deal with non-negative integers, each value in dataset 1 is multiplied by 10 and each value in dataset 2 is multiplied by 100. This step changes all the values to non-negative integers without changing the ranks of values, and it does not affect the result of the Kruskal–Wallis test which is calculated from the ranks. The accuracy of our basic algorithm for data without ties is 100%. This is shown with dataset 1. We provide both the H values calculated in two-party and multiparty scenarios in Table 4. In the two-party case, we take the first two samples of dataset 1 and calculate the H value on these two samples. In the multiparty case, the H value is calculated on all the three samples of dataset 1. Our algorithms for data with ties cause some accuracy loss. There are two methods to deal with tied values. The first one is to modify the data slightly to eliminate ties and then compute H with the basic algorithm. Accuracy loss occurs because the data is changed. The second method is to keep the data unchanged, but adjust the ranks and divide H by C. Here the accuracy loss comes from the calculation of C. Because we can compute an upper bound and a lower bound for C, we can also get an upper bound and a lower bound for the final result Hc . We test the two methods with dataset 2 and the results are shown in Table 5. Here we also take the first two samples from dataset 2 to test the two-party case and all four samples of dataset 2 to test the multiparty case. As we can see in the result, the second method has better accuracy than the first one. In the case with two parties, although the first two samples of dataset 2 that we use contain a lot of ties (9 out of 16 values are in ties), the two bounds are both very close to the accurate result. In the multiparty case, both the upper and lower bounds are equal to the accurate result. This is because the two bounds are calculated by either disregarding the ties only in the last sample, or counting the ties between the last and the first samples twice. Fortunately, in this dataset, the last sample does not contain any tie that is only in it, and there is no tie between the last sample and the first sample. So with this dataset, the two bounds are equal to the accurate result. Let us show the computation overheads of the algorithms. In Fig. 1 we present the running time comparison between the algorithms we proposed with different sizes of data under the two-party scenario. The running time values are in seconds. We can find that the execution time of the basic algorithm for data without ties and the first method for data with ties are very close. This is because in the first method of dealing with ties, we eliminate the ties and then follow the same procedure as the basic algorithm. The second method for data with ties takes more time than the first one, mostly because that the adjustment of ranks takes time. We also show the overheads in the multiparty case with datasets 1 and 2. The execution time of the basic algorithm on dataset 1 is: Running time for 2 samples: Running time for 3 samples: 5s 17 s The execution time of the first method for data containing ties on dataset 2 is: Running time for 2 samples: Running time for 3 samples: Running time for 4 samples: 15 s 67 s 599 s The execution time of the second method for data containing ties on dataset 2 is: Running time for 2 samples: Running time for 3 samples: Running time for 4 samples: 26 s 169 s 2159 s Table 5 – Kruskal–Wallis test result on data with ties. 2 samples Table 4 – Kruskal–Wallis test result on data without ties. 2 samples H calculated by the original Kruskal–Wallis test H calculated by our basic algorithm 3 samples 5.7709 8.72 5.7709 8.72 Hc calculated by the original Kruskal–Wallis test H calculated from modified data (the first method) The upper bound of Hc (the second method) The lower bound of Hc (the second method) 4 samples 6.4191 11.876 6.89338 12.6971 6.4574 11.876 6.4 11.876 Author's personal copy 144 c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 2 ( 2 0 1 3 ) 135–145 600 Running Time (s) 500 The basic algorithm for data without ties The first method for data with ties The second method for data with ties 400 300 200 100 0 5 10 15 20 Sample Size Fig. 1 – Running time comparison of algorithms with two parties. 7. Conclusion [6] In this work, we proposed several algorithms that enable parties to conduct the Kruskal–Wallis test securely without revealing their data to others. We showed the procedure of the algorithms for data both with and without ties. We also presented an algorithm to do the multiplication of two encrypted integers under the additive homomorphic cryptosystem. Our algorithms can be extended to make other non-parametric rank based statistical tests secure, such as the Friedman test. This is our future work. [7] [8] [9] Conflict of interest [10] We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome. [11] [12] [13] references [14] [1] W.H. Press, S.A. Teukolsky, W.T. Vetterling, B.P. Flannery, Numerical recipes in C: the art of scientific computing, Transform 1 (i) (1992) 504–510 (online). Available at: http://www.jstor.org/stable/1269484?origin=crossref. [2] G.E.P. Box, Non-normality and tests on variances, Biometrika (1953) 318–335. [3] W.H. Kruskal, W.A. Wallis, Use of ranks in one-criterion variance analysis, Journal of the American Statistical Association 47 (260) (1952) 583–621 (online). Available at: http://www.jstor.org/stable/2280779. [4] F. Wilcoxon, Individual comparisons by ranking methods, Biometrics Bulletin (1945) 80–83. [5] H.B. Mann, D.R. Whitney, On a test of whether one of two random variables is stochastically larger than the other, [15] [16] [17] [18] Annals of Mathematical Statistics 18 (1) (1947) 50–60 (online). Available at: http://www.jstor.org/stable/2236101. C.M.R. Kitchen, Nonparametric versus parametric tests of location in biomedical research, American Journal of Ophthalmology (2009) 571–572. R. Agrawal, R. Srikant, Privacy-Preserving Data Mining (2000). Z. Huang, W. Du, B. Chen, Deriving private information from randomized data, in: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data SIGMOD 05, 2005, p. 37. K. Chen, L. Liu, Privacy preserving data classification with rotation perturbation, in: Proceedings of the Fifth IEEE International Conference on Data Mining, ser. ICDM ‘05, IEEE Computer Society, Washington, DC, USA, 2005, pp. 589–592 (online). Available at: http://dx.doi.org/10.1109/ICDM.2005.121. G.R. Heer, A bootstrap procedure to preserve statistical confidentiality in contingency tables, in: Proceedings of the International Seminar on Statistical Confidentiality, 1993, pp. 261–271. Y. Lindell, B. Pinkas, Privacy preserving data mining, Journal of Cryptology 15 (3) (2002) 177–206. W. Du, Z. Zhan, Building decision tree classifier on private data, Reproduction (2002) 1–8. C. Clifton, M. Kantarcioglu, J. Vaidya, X. Lin, M.Y. Zhu, Tools for privacy preserving distributed data mining, ACM SIGKDD Explorations Newsletter 4 (2) (2002) 28–34. I. Damgard, M. Fitzi, E. Kiltz, J.B. Nielsen, T. Toft, Unconditionally Secure Constant-Rounds Multi-party Computation for Equality, Comparison, Bits and Exponentiation, vol. 3876, Springer, 2006, pp. 285–304. I. Damgard, M. Geisler, M. Kroigard, Homomorphic encryption and secure comparison, International Journal of Applied Cryptography 1 (2008) 22. W. Du, M. Atallah, Privacy-Preserving Cooperative Statistical Analysis, IEEE Computer Society, 2001, p. 102. B. Goethals, S. Laur, H. Lipmaa, T. Mielik?inen, On private scalar product computation for privacy-preserving data mining, Science 3506 (2004) 104–120. S. Han, W.K. Ng, Privacy-preserving linear fisher discriminant analysis., in: Proceedings of the 12th Pacific-Asia Conference on Advances in Knowledge Author's personal copy c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 2 ( 2 0 1 3 ) 135–145 [19] [20] [21] [22] [23] [24] [25] Discovery and Data Mining, ser. PAKDD’08, Springer-Verlag, Berlin, Heidelberg, 2008, pp. 136–147. W. Du, Y.Y.S. Han, S. Chen, Privacy-Preserving Multivariate Statistical Analysis: Linear Regression and Classification, vol. 233, Lake Buena Vista, Florida, 2004. S. Han, W.K. Ng, P.S. Yu, Privacy-preserving singular value decomposition, in: 2009 IEEE 25th International Conference on Data Engineering, 2009, pp. 1267–1270. Z. Teng, W. Du, A hybrid multi-group privacy-preserving approach for building decision trees, in: Proceedings of the 11th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, ser. PAKDD’07, Springer-Verlag, Berlin, Heidelberg, 2007, pp. 296–307. J. Vaidya, W. Lafayette, C. Clifton, Privacy-preserving k-means clustering over vertically partitioned data, Security (2003) 206–215. G. Jagannathan, R.N. Wright, Privacy-Preserving Distributed k-Means Clustering Over Arbitrarily Partitioned Data, ACM, 2005, pp. 593–599. L. Wan, W.K. Ng, S. Han, V.C.S. Lee, Privacy-preservation for gradient descent methods, in: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining KDD 07, 2007, p. 775. T. Chen, S. Zhong, Privacy-preserving models for comparing survival curves using the logrank test, Computer Methods and Programs in Biomedicine 104 (2) (2011) 249–253 (online). Available at: http://www.ncbi.nlm.nih.gov/pubmed/ 21636164. 145 [26] Y. Zhang, S. Zhong, Privacy preserving distributed permutation test, 2012, submitted for publication. [27] O. Goldreich, Foundations of Cryptography, vol. 1, no. 3, Cambridge University Press, 2001. [28] M. Kantarcioglu, C. Clifton, Privacy-preserving distributed mining of association rules on horizontally partitioned data, IEEE Transactions on Knowledge and Data Engineering 16 (9) (2004) 1026–1037. [29] J. Vaidya, C. Clifton, Privacy-Preserving Outlier Detection, vol. 41, no. 1, IEEE, 2004, pp. 233–240. [30] S. Zhong, Privacy-preserving algorithms for distributed mining of frequent itemsets, Information Sciences 177 (2) (2007) 490–503. [31] P. Paillier, Public-key cryptosystems based on composite degree residuosity classes, Computer 1592 (1999) 223–238. [32] T. ElGamal, A public key cryptosystem and a signature scheme based on discrete logarithms, IEEE Transactions on Information Theory 31 (4) (1985) 469–472. [33] D. Boneh, The Decision Diffie-Hellman Problem, vol. 1423, Springer-Verlag, 1998, pp. 48–63. [34] A.C. Elliott, L.S. Hynan, A sas((r)) macro implementation of a multiple comparison post hoc test for a Kruskal–Wallis analysis, Computer Methods and Programs in Biomedicine 102 (1) (2011) 75–80 (online). Available at: http://www.ncbi.nlm.nih.gov/pubmed/21146248.