federated learning-官

Collaborative Computing for Linear Regression with Privacy Albert Guana,1 , Po-Wen Chia,2,, Chun-Hung Richard Linb,3 a Department of Computer Science and Information Engineering, National Taiwan Normal University, Taipei, 11677, Taiwan b Department of Computer Science and Engineering, National Sun Yat-sen University, Kaoshiung, 80424, Taiwan Abstract Machine learning usually requires a large number of training data to build a useful model. We exploit the mathematical structure of linear regression to develop a secure and privacypreserving method that allows multiple parties to collaboratively compute optimal model parameters without requiring parties to share their raw data. The new approach also allows to efficiently delete the data of users who wants to leave the group and wish to have their data deleted. Since the data will remain private during both learning and unlearning, data owners are more willing to contribute their collected datasets to build better models, benefiting all parties. The proposed collaborative computation of linear regression model does not require a trusted third party, thereby avoiding the difficulty of building a robust trust system in the current Internet environment. The proposed scheme is also computationally efficient and does not use encryption to keep data secret. We also show that the new linear regression learning scheme can be updated incrementally, which means that new datasets can be easily incorporated into the system, and some data can be removed from the system to obtain a more suitable linear regression model without recomputing from scratch. Keywords: Collaborative learning, privacy protection, incremental computation, machine unlearning. 1. Introduction Machine learning attempts to build models based on training data to make predictions or decisions without being explicitly programmed. It usually requires a large number of training data to build a useful model. In many practical applications, these data may not belong to one organization, but rather be distributed across multiple sites. For example, every hospital collects medical data of its patients. The data collected in one hospital is often insufficient to obtain a good model for developing new drugs or treatments for some specific disease. Therefore, sharing data is very important in machine learning when the data is distributed across many sites. Like medical data, almost all of the data contain private and sensitive personal information. When sharing these data, security or personal privacy concerns must be properly Email addresses: albert.zj.guan@gmail.com (Albert Guan), neokent@gapps.ntnu.edu.tw (Po-Wen Chi), lin@cse.nsysu.edu.tw (Chun-Hung Richard Lin) 1 The research of the first author is supported in part by the MOST project 107-2218-E-110-017-MY3. 2 The research of the second author is supported in part by the MOST project 107-2218-E-110-017MY3. 3 The research of the third author is supported in part by the MOST project 108-2622-E-110-009-CC3. Preprint submitted to Information Processing Letters July 21, 2023 addressed. The release of personal data in almost all countries in the world is restricted by laws and regulations. For example, the European Data Protection Regulation is applicable in 2018 for all member states to harmonize data privacy laws across European Union [1]. Another reason for limiting data sharing is that collecting these data usually requires a lot of time and effort. Data owners usually view the data they collect as a valuable asset. They will be more willing to share the collected data with others only if it is legal and mutually beneficial. In this article, we consider the security and privacy-preserving of a specific machine learning problem: linear regression. Linear regression models the relationship between variables by fitting a line to the observed data [2, 3]. It is the most basic model in machine learning and has many applications [4]. We propose a method to collaboratively build a linear regression model without releasing the original data. The new learning scheme also supports deleting a user’s dataset without using the user’s original data. Therefore, it complies with privacy protection regulations. Secure collaborative machine learning is also a type of secure multiparty computation, which has been widely studied by cryptographers. In secure multiparty computing, no data can be sent directly to the other parties for security reasons. There are many techniques to achieve secrecy in multiparty computation. General method for secure multiparty computation, such as Yao’s garbled circuits, may not be efficiently implemented. By exploiting the mathematical structure of linear regression, we achieve efficient secure construction of linear regression models. In secure computation, the goal is to provide a well-defined framework to guarantee complete zero knowledge about other party’s private data. Zero knowledge is very desirable, but this desired property usually requires heavy computations [5]. Furthermore, zero-knowledge only holds for polynomial time attackers, i.e., these schemes can only be computationally secure. In our scheme, we show that the curious party does not have enough information to recover the exact value of the other party’s original data. It may be useful to design a scheme with a trusted third party, especially when the third party only needs to play a light role in the protocol. However, in the current Internet environment, it is often difficult to establish and maintain long-term trust among parties, especially when the number of participants increases. Such a system may not be very robust; as long as the server or some party violates the security rules, the system cannot function properly. The proposed method for collaboratively constructing a linear model in this article does not require trusted third parties. We only assume that all parties are semi-honest: they all follow the protocol, but they are also curious about the data of the other parties. Encryption can be used to protect sensitive information. However, the data in machine learning must be used for efficient computation, and general encryption schemes do not support this feature. Fully homomorphic encryption, such as the Paillier cryptosystem [6], may be used in computing model parameters in machine learning. However, homomorphic encryption usually requires intensive computation to achieve a required level of security, and these cryptosystems can only achieve computational security. Most importantly, numbers in machine learning are usually real numbers. These numbers must be represented in a standard format (such as IEEE Standard 754 floatingpoint numbers) so that currently available computer CPUs can perform the calculations. On the other hand, the plaintext and the ciphertext of a cryptosystem are usually finite groups, finite rings, or fields to avoid rounding errors. Almost all these cryptosystems, even if they are fully homomorphic, may not be suitable for machine learning. Recently, 2 Cheon et al. developed the CKKS cryptosystem [7] to support a special type of fixedpoint arithmetic, called block floating-point arithmetic. Although floating-point numbers can be approximated in fixed-size memory, current version of CKKS support only a limited number of multiplications before a time-consuming bootstrapping step is taken. Another limitation of CKKS is that it currently does not directly support the floating-point division operations required for linear regression and almost all machine learning models. In this regard, Barbenko et al. proposed a two-stage method for Euclidean division with the CKKS scheme and studied its properties [8]. In this article, we propose a new scheme to securely compute optimal parameters for linear regression without encryption to avoid all these problems. The main contributions of this research are summarized as follows. 1. The proposed secure and privacy protection machine learning scheme for linear regression does not need trusted servers or encryption. 2. Its security is not based on computationally hard problems. Even an attacker with unlimited computing power does not have enough information to learn the exact value of the private data of the other parties. 3. It is computationally efficient. It does not require encryption and does not incur additional computation due to distribution of the datasets. 4. The proposed scheme only sends aggregated data to other parties. The size of the aggregated data is proportional to the number of features. In practical applications, whether encrypted or not, the amount of data is usually much larger than the number of features. This implies that the communication cost of our scheme is lower than sending raw data, whether they are encrypted or not. 5. The new linear regression learning scheme can be updated incrementally. New datasets can be easily incorporated into the system and old dataset can be removed from the system to obtain a more suitable linear regression model without recomputing the model parameters from scratch. 2. Related Work In contrast to traditional centralized machine learning techniques where all local datasets are uploaded to one server, the technique we study is known as federated learning or collaborative learning. In federated learning, the most important task is to devise a method for parties to collaboratively compute the parameters of a machine learning model from their input data, while keeping their input data private. To meet the security requirement, trusted servers can be used to protect the confidentiality of data and perform computations efficiently. The security of many known secure computation schemes for linear regression requires servers [9, 10, 11, 12, 13, 14]. Some of the schemes use only one server [9, 12, 13]. some of the schemes use two servers, one for cryptography service and the other for computation [10, 11, 14]. There are different ways of using the servers. In Cock’s et al. scheme [12], a trusted server is only needed in the pre-distribution phase to do initialization. On the other hand, in Chen et al. [9] and Mandal et al.’s schemes [13], the server is required to do computational tasks, but it only required the server to be semi-honest. Federated learning can also be considered as secure multiparty computation, or privacypreserving computation. Secure multiparty computation has been widely studied by many cryptographers [15, 16, 17]. Recently, Ma et al. developed algorithms under secure multiparty computation which enable pharmaceutical institutions to achieve high-quality models to advance drug discovery without divulging private drug-related information [18]. 3 The security of most existing secure multiparty computations depends on computationally hard problems, such as factoring large integers or solving discrete logarithms in large groups. Some of the schemes based on Yao’s garble circuit and combined with secure computation for computing inner product [10]. Some of the schemes based on homomorphic encryption to ensure the security of the data [11]. Kikuchi et al. proposed a scheme that does not require trusted servers [19], but it need Paillier’s cryptosystem, which is a public key cryptosystem, as well as homomorphic encryption. These cryptosystems, including Yao’s garbled circuits, are computationally intensive. In our proposed scheme for linear regression, with proper bookkeeping, datasets owned by a users can also be deleted from the computations efficiently. This is useful when a party leaves the group and wants to remove information about its party. This is called machine unlearning. Machine unlearning is necessary to comply with privacy protection laws and regulations, such as GDPR in European countries. In our new scheme, both the addition and deletion of some datasets can be done efficiently without recomputing the model parameters from scratch. However, our scheme needs to invert a matrix of size (m + 1) × (m + 1), where m is the number of features used in the linear regression model. This is not a problem in practice since the value of m is usually small compared to the amount of data n. Some of the known schemes do not use the closed form of the optimum solution for linear regression [10, 13]. Although the authors claimed that the advantage is that their scheme does not need to compute the matrix inverse, the drawback is that the result will only be an approximation. Most importantly, this type of computation method has to send gradients descent, or other related data, multiple times to make the model parameters accurate enough. It is well known that sending gradient descent iteratively can reveal enough information to the attacker to infer the value of the training data [20]. This is because the attacker can accumulate some information from each gradient descent. When enough information is collected, the attacker can deduce the values of the original training dataset. For each dataset, our scheme only sends the aggregated data once, and we show that the information leaked is too small for any unauthorized party to infer information about the other party’s data. Note that this technique works well for linear regression problems. Other federated learning problems may not be able to avoid iterative send gradient descent and require other techniques to preserve privacy. For example, in Deep Neural Networks (DNN), Xu et al. proposed a double masking protocol to guarantee the confidentiality of user local gradients during federated learning [21]. 3. The Linear Regression Problem In this section, we first briefly describe the computational procedure for linear regression. We then show a method to compute the optimal parameter of a linear model by minimizing the sum of the square of errors. This classical method is called the least square method [22, 4]. Linear regression is a linear approach to modeling the relationship between a response y and a set of m independent variables x1 , x2 , . . . , xm in an experiment. It can be expressed as y = β0 + β1 x1 + · · · + βm xm . These independent variables x1 , x2 , . . . , xm are also called the features of the model. 4 Suppose n experiments have been performed, each with different setting of the features x1 , x2 , . . . , xm . With a set of n observed data, we obtain n equations. yi = β0 + β1 xi1 + β2 xi2 + · · · + βm xim , i = 1, 2, . . . , n, where xij is the value of variable xj used in the i-th experiment. It is convenient to write these n equations in matrix form. Let ⃗xi = [xi1 , xi2 , . . . , xim ], and let the model coefficients P i be: β⃗ = [β0 , β1 , . . . , βm ]. The model prediction would be yi = β0 + m xi j=1 βj xj . If each ⃗ is extended to ⃗xi = [1, xi1 , xi2 , . . . , xim ], then yi = m X ⃗ xi . βj xij = β⃗ j=0 In constructing a linear model, the coefficients β0 , β1 , . . . , βm are unknowns. A set of n experiments can be performed to determine possible values of these m + 1 unknowns. These m + 1 unknowns β0 , β1 , . . . , βm are called parameters of the linear regression model. If all the experiments have no errors and the setting of the variables ⃗xi are independent, then m + 1 experiments are sufficient to determine the m + 1 unknowns β0 , β1 , . . . , βm . However, almost all experiments cannot avoid errors. It needs more than m + 1 experiments, usually much more experiments, to obtain “optimal” values of β0 , β1 , . . . , βm . To define the optimal values for these parameters, we first define the loss function. Given a set of values for β⃗ = [β0 , β1 , . . . , βm ], we can compute the loss with respect to β⃗ as n 2 X ⃗ xi − yi . ⃗ = β⃗ L(β) i=1 In the least square setting, the optimum parameter is defined as the one that minimizes the sum of mean square loss: n 2 X ⃗ ⃗ xi − yi . β⃗ β̂ = arg min L β⃗ = arg min i=1 4. The Proposed Collaborative Learning Method In this section, we first describe the theoretical basis on which our proposed collaborative learning approach is based. We then describe new collaborative learning methods. 4.1. Theoretical Basis of the Proposed Scheme First, we show how to compute the optimal parameters of a linear regression model incrementally. Assume that the number of features is 2. Thus, y = β0 + β1 x1 + β2 x2 . The optimal values of each βj , j = 0, 1, 2, can be obtained by finding the minimum value ⃗ Let of the function L(β). L(β0 , β1 , β2 ) = n X (yi − β0 − β1 xi1 − β2 xi2 )2 . i=1 5 Solve ∂L = 0, i = 0, 1, 2. We obtain: ∂βi n X (yi − β0 − β1 xi1 − β2 xi2 ) = 0 i=1 n X xi1 (yi − β0 − β1 xi1 − β2 xi2 ) = 0 i=1 n X xi2 (yi − β0 − β1 xi1 − β2 xi2 ) = 0 i=1 These equations are also called normal equations. They can be rewritten as: n X yi = β0 n + β1 i=1 n X i=1 n X n X xi1 + β2 n X i=1 xi1 yi = β0 xi2 yi = β0 i=1 n X i=1 n X xi2 i=1 n n X X i 2 i xi1 xi2 (x1 ) + β2 x1 + β 1 xi2 + β1 i=1 i=1 n X xi1 xi2 + β2 i=1 (1) i=1 n X (xi2 )2 i=1 The coefficients β0 , β1 , β2 can be computed by solving systems of linear Equations (1). Now suppose another set of k data points are to be added: (xn+1 , xn+1 , yn+1 ), . . . , (xn+k , x2n+k , yn+k ). 1 2 1 The Equations (1) becomes n+k X xi2 i=1 n+k X xi1 xi2 i=1 n+k X xi2 xi2 i=1 = β0 (n + k) + β1 = β0 = β0 n+k X i=1 n+k X xi1 + β1 xi2 + β1 i=1 n+k X xi1 + β2 n+k X xi2 i=1 n+k X i=1 n+k X i=1 n+k X i=1 n+k X (xi1 )2 + β2 xi1 xi2 + β2 i=1 xi1 xi2 (xi2 )2 i=1 The new sum can be expressed as the old sum plus extra terms. n+k X xi2 = i=1 n+k X xi1 xi2 n X i=1 = n X i=1 i=1 n+k X n X (xi1 )2 yi + = i=1 n+k X xi2 , i=n+1 xi1 xi2 + n+k X xi1 xi2 , i=n+1 (xi1 )2 + n+k X (xi1 )2 . i=n+1 i=1 and so on, for the other summations. 6 To generalize the method described above, it is more convenient to rewrite these equations by using matrices.      β0 1 x11 x12 . . . x1m y1 1 x2 x2 . . . x2   β1   y2  m  2 1      .. .. .. . . ..   ..  =  ..  . . . . .  .   .  . 1 xn1 xn2 . . . xnm βm yn The relationship between matrix X and vector ⃗y can also be written in matrix form as X β⃗ = ⃗y . (2) The loss function can be rewritten as: ⃗ = ||X β⃗ − ⃗y ||2 L(β) T = X β⃗ − ⃗y X β⃗ − ⃗y ⃗ = ⃗y T ⃗y − ⃗y T X β⃗ − β⃗ T X T ⃗y + β⃗ T X T X β. The optimum solution is at gradient 0. The gradient of the loss function is: T T T T T T ⃗ ⃗ ⃗ ⃗ ∂ ⃗ y ⃗ y − ⃗ y X β − β X ⃗ y + β X X β ⃗ ∂L(β) = ∂ β⃗ ∂ β⃗ ⃗ = −2X T ⃗y + 2X T X β. Setting the gradient to 0, we get the optimum coefficients: − 2X T ⃗y + 2X T X β⃗ = 0 ⇒ X T ⃗y = X T X β⃗ −1 T ⇒ β⃗ = X T X X ⃗y . Therefore the optimum value for β⃗ is β⃗ = X T X −1 X T ⃗y . (3) Suppose that each user u has a set of (xi1 , xi2 , . . . , xim ) and its response yi . They want to collaboratively compute the optimal coefficients β⃗ = (β0 , β1 , . . . , βm ) , such that ||X β⃗ − ⃗y ||2 is minimized. Assume that there are two parties A and B. A’s data is in the parts of X and B’s data is in the lower upper part of X. Matrix X can be X1 ⃗y written as . Similarly, vector ⃗y can be written as 1 . X2 ⃗y2 The system of Equation 2 can be written as: X1 ⃗ ⃗y β= 1 X2 ⃗y2 7 Apply Equation 3, the value of β⃗ can be computed as T !−1 T X1 X1 X1 ⃗y1 β⃗ = X2 X2 X2 ⃗y2 T T X1 −1 T X1 ⃗y1 + X2T ⃗y2 = X1 X 2 X2 T −1 T X1 ⃗y1 + X2T ⃗y2 = X1 X1 + X2T X2 ⃗ the matrix needs to be inverted becomes the To compute the model parameter β, (m + 1) × (m + 1) matrix T X1 X1 + X2T X2 . To be used in practical applications, the matrices that are to be maintained are X T X and X T ⃗y . These matrices are initialized as zero matrices. Whenever a new dataset Xi and ⃗yi are obtained, the new matrices XiT Xi and XiT ⃗yi are added to X T X and X T ⃗y , respectively. This enables the computation of a new set of model parameters for linear regression. It is also possible to delete a particular dataset, such as Xj and ⃗yj , by subtracting XjT Xj and XjT ⃗yj from X T X and X T ⃗y , respectively. 4.2. The Protocol Suppose that A and B intend to utilize the least square method to construct a linear model that comprises of m features. Assume that A has the data X1 and the associated vector ⃗y1 , B has the data X2 and the associated vector ⃗y2 . Our secure collaborative computation of the optimal parameter for linear regression is shown in Figure 1. 1. A perform n1 experiments and collects the data into an n1 × (m + 1) matrix X1 and a n1 × 1 vector ⃗y1 . 2. B perform n2 experiments and collects the data into an n2 × (m + 1) matrix X2 and a n2 × 1 vector ⃗y2 . 3. A sends X1T X1 and X1T ⃗y1 to B. 4. B sends X2T X2 and X2T ⃗y2 to A. ⃗ which is an (m + 5. A and B can then compute the optimal model parameter β, 1) × 1 vector: −1 T X1 ⃗y1 + X2T ⃗y2 . β⃗ = X1T X1 + X2T X2 Figure 1: Secure Collaborative learning of linear regression without encryption. In the protocol, assume that A and B intend to utilize the least square method to construct a linear model that comprises of m features. 4.3. Analysis of the Protocol Based on the analysis in Section 4.1, it is easy to verify that our collaborative computation of the optimal parameter for linear regression is correct. For the computational complexity of the scheme, each party i, i = 1, 2, needs to compute XiT Xi and XiT ⃗yi . This is necessary even if computations are performed by only one party using all the data in one set. The only possible extra work is every party needs to invert the matrix 8 ⃗ The dimension of the matrix to be inverted is X1T X1 + X2T X2 to obtain the solution of β. (m + 1) × (m + 1), where m is the number of features used in the model, which is usually smaller than number of data n. This step can also be done by one party (i.e. the server) and then send the solution ⃗ β to the other parties (i.e. local users). In this way, the total computational work is the same as the computational work required to be done when all the data is in one site. For communication complexity, our proposed scheme sends only aggregated data to the other party. The size of the aggregated data is only (m + 1)2 + (m + 1) = (m + 1)(m + 2) numbers for each party having n records of data. The communication complexity is proportional to the number of features m, which is independent of n. This is less than the size of the dataset, which is (n(m + 1) + n), when n ≥ (m + 2). 4.4. Security Analysis In this section, we show that the security of the proposed scheme is informationtheoretically secure. This implies that even though a minor amount of information is inevitably divulged to establish a useful linear regression model, the attacker cannot compute the exact values of the other party’s data, even if the attacker has infinite computing power. We assume that all parties are semi-honest, i.e. the participants strictly follow the protocol, but they are interested in knowing additional information that they are not explicitly permitted. Theorem 4.1. Assume that each party collects more than m + 2 data. The proposed collaborative learning method for the optimal parameters of a linear regression model is secure in the following sense: 1. A does not have enough information to compute any elements in X2 and ⃗y2 , and 2. B does not have enough information to compute any elements in X1 and ⃗y1 . Proof. Assume that A has a set of n1 data and B has a set of n2 data, n1 + n2 = n and n1 , n2 > m + 2, where m is the number of features in the system. First we show that A cannot compute the data of B (i.e. X2 and ⃗y2 ) by using the information A collects in the execution of the protocol. The number of elements in X2 is n2 × (m + 1), but the first column of X2 is a constant. Thus, there are n2 m variables. The number of elements in ⃗y2 is n2 , and it contains n2 variables. Thus, A has totally n2 (m + 1) unknowns. On the other hand, the number of elements in X2T X2 is (m + 1)2 , and the number of elements in X2T ⃗y2 is m + 1. The first element in the first row of X2T ⃗y2 is n2 . Thus, A has only (m + 1)2 + n2 − 1 quadratic equations. Since n2 > m + 2, A does not have enough information to compute the values of the elements in X2 and ⃗y2 . Similarly, we can show that B does not have enough information to compute the values of the elements in X1 and ⃗y1 . We show that even if the parties or the server have infinite computational power, there is not enough information to infer the value of the other party’s data. That is, the security of the proposed linear regression scheme does not depend on computationally hard problems, such as large integer factorization and solving large sets of discrete logarithm problems. These computationally hard problems are solvable if the attacker has sufficient computing power, for example using a quantum computer. Although we have proved that the curious party cannot compute the values of the dataset of the other parties, even if they have infinite computing power, but the curious 9 Table 1: Comparisons of related protocols TA HE YG(SMC) CF Security Cock’s [12] Y Y N Y C Gascón’s [10] Y Y Y N C Kikuchi’s [19] N Y N Y C Mandal’s [13] Y Y N N C Mohassel’s [14] Y Y Y N C Qiu’s [11] Y Y Y N C Our’s N N N Y P 1. 2. 3. 4. TA: needs trusted or semi-trusted authority. HE: uses homomorphic encryption. YG(SMC): uses Yao’s garbled Circuit. CF uses closed form formula to compute optimal parameters. (Note that some schemes do not use closed-form formulations but instead use conjugate gradient descent, but this is computationally inefficient and vulnerable to message chaining attacks.) 5. Security: C means computationally secure, P means for perfect secure. party does know some information about the dataset. For example, the number of data and the sum of the squares of the data, etc. For the information leakage in general, we note that Dwork and Naor show that if a machine learning model is useful, it must reveal some information about the data on which it was trained [23, 24]. Our proof indicates that the information disclosed to the other party through the proposed collaborative linear regression scheme is minimal. As a result, none of the parties can leverage it to calculate the exact values of each other’s data, even if the attacker has unlimited computing power. 4.5. Comparsions In this section, we compare several other related schemes for federated learning for linear regression. These schemes include Cock’s scheme [12], Gascón’s scheme [10], Kikuchi’s scheme [19], Mandal’s scheme [13], Mohassel’s scheme [14] and Qiu’s scheme [11]. They are recently proposed and related to the security of federated learning for linear regression. These schemes has several characteristics that can be compared, such as whether the scheme uses a trusted server, and the assumption of a trusted server is indeed unrealistic and unrealistic in some cases. Homomorphic encryption is computationally intensive and can only achieve computational security. Some schemes use Yao’s garbled circuit. Although this general secure multi-party computing protocol can realize secure computing for any functions, the amount of computation is also very large. Some of the scheme do not used closed form for optimum parameter, instead they sent gradient decent iteratively, but this may suffer privacy leakage attack. Some schemes do not use closed form to obtain optimal parameters, but require sending gradient descent repeatedly to obtain better parameters. Repeatedly sending gradient descent may compromise data confidentiality. We incorporate these features into the comparison, Finally, we compare the security level of these schemes. Some schemes can achieve information theoretically secure while others can only be computationally. The results of the comparison is shown in Table 1. 5. Applications of Collaborative Learning Scheme The proposed collaborative learning method for linear regression can be viewed as a peer-to-peer mode of collaborative learning, or decentralized federated learning. In addition to the peer-to-peer model for collaborative learning, the proposed learning scheme can also be used for the client-server mode of collaborative learning, or centralized federated learning. In the client-server mode of collaborative learning, the server may or may not own any training data. Each party sends its training data to the server in the aggregated form. The server can then do all the computations for the users, and send the final result, ⃗ to each user. i.e. the value of β, 10 User’s new training data can also be sent to the server many times. When sending their training data incrementally, care must be taken to avoid sending a batch of too few data. For example, suppose that initially a user A has collected a set of n1 training data. If n1 > m + 2, then the user can send the corresponding matrix G1 = X1T X1 and ⃗h1 = X T ⃗y1 to the server. Assume that more training data was collected by the same 1 user A. If the number of new training data n2 > m + 2, then A can send another pair of matrices G2 = X2T X2 and ⃗h2 = X2 ⃗y2 to the server. In the case that the number of training data n2 ≤ m + 2, user A may not want to send G2 = X2T X2 and ⃗h2 = X2T ⃗y2 to the server for security and privacy reasons. This is because the number of training data in X2 and ⃗y2 is too small, and the curious server can compute the matrix X2 and ⃗y2 from G2 and ⃗h2 . For security, it is required that the number of training data ni must be greater than m+2 for each user i. This is not a serious limitation, because, in any practical application, the number of features is usually small, and the amount of collected data should be much larger than the number of features to built an accurate machine learning model. In the design of the protocol, we do not make any assumptions on how the data is distributed among the parties. In particular, each party may collect its own data with different set of features. In 2019, Yang et al. introduced a comprehensive secure federated learning framework, which includes horizontal federated learning, vertical federated learning, and federated transfer learning [5]. In our protocol, the original data can be horizontally or vertically partitioned, as long as the set of features are determined, our protocol can be applied. Our proposed collaborate learning method can easily be implemented to compute optimal model parameters for linear regression incrementally. Suppose that k datasets are collected over a period of time: (X1 , ⃗y1 ), (X2 , ⃗y2 ), · · · , (Xk , ⃗yk ). Define an (m + 1) × (m + 1) matrix Gi = XiT Xi and an (m + 1) × 1 vector ⃗hi = X T ⃗yi . i The optimal model parameters β⃗ can be computed by the equation β⃗ = G−1⃗h, where G= k X Gi and ⃗h = i=1 k X ⃗hi . i=1 These k sets of data (X1 , ⃗y1 ), (X2 , ⃗y2 ), · · · , (Xk , ⃗yk ), can be the training data of k users collect at different times. When some dataset (Xi , ⃗yi ) are available, they can be added into G and ⃗h respectively. Better model parameters β⃗ can then be computed. It is also possible to delete some specified datasets, as long as the relevant aggregated data is properly preserved. For example, if we want to delete all datasets associated with a party, we need to know the aggregated data for that party. Aggregated data of that party may also be held by that party. Once that party wants to delete his entire dataset, he can send the aggregated data to the server, which can be used to deduct his data from the system. 11 6. Conclusions We propose a simple yet effective secure multi-party computation scheme for linear regression in a distributed environment. In the design of our scheme, we exploit the mathematical structure of linear regression, so that any party needs to send only the aggregated data to the server or to the other party. Therefore, security and privacy can be well protected without encryption. Security and privacy play important roles in collaborative learning. When data owners do not need to publish the valuable data they collect, they are more willing to share their data to build a better model. The proposed scheme also allows incremental addition and deletion of some datasets used in the linear regression models. New datasets can be easily integrated into the system to obtain new and more accurate model parameters without having to compute from scratch. With proper bookkeeping, user’s datasets can also be removed from the model without referencing the original data being removed. The deletion of some datasets is useful if some party leaves the group and ask for the deletion of his dataset. The security of our proposed collaborated computation scheme for linear regression does not need encryption. It does not depend on computationally hard problems, either. These computationally hard problems usually require heavy computation to achieve certain level of security. Furthermore, computationally hard problems may be broken if the attacker has enough computing power, such as using quantum computers. Due to the lack of sufficient information, we proved that the curious party cannot obtain exact values of the other party’s data, even with unlimited computing power. Different weights can also be incorporated into collaborative learning in our new method. This property may be useful in “customized” machine learning, in which all parties have similarities, but each party also has its own special characteristics. Handwritten character recognition for smartphone input processing is a good example of a need for a personalized model. In the proposed new method for collaborative learning, each data owner handles his data independently. By using our proposed method, customization can be easily achieved by giving this user more weight. References [1] T. E. Union, General data protection regulation (2016). URL https://gdpr-info.eu/ [2] H. L. Seal, Studies in the history of probability and statistics. xv the historical development of the gauss linear model, Biometrika 54 (1-2) (1967) 1–24. doi:10.1093/biomet/54.1-2.1. URL https://doi.org/10.1093/biomet/54.1-2.1 [3] R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society. Series B (Methodological) 58 (1) (1996) 267–288. URL http://www.jstor.org/stable/2346178 [4] J. Fox, Applied regression analysis, linear models, and related methods, Sage Publications, Inc., 1997. [5] Q. Yang, Y. Liu, T. Chen, Y. Tong, Federated machine learning: Concept and applications, ACM Transactions on Intelligent Systems and Technology 10 (2) (January 2019). doi:10.1145/3298981. URL https://doi.org/10.1145/3298981 12 [6] P. Paillier, Public-key cryptosystems based on composite degree residuosity classes, in: J. Stern (Ed.), Advances in Cryptology — EUROCRYPT ’99, Springer Berlin Heidelberg, Berlin, Heidelberg, 1999, pp. 223–238. [7] J. H. Cheon, A. Kim, M. Kim, Y. Song, Homomorphic encryption for arithmetic of approximate numbers, in: T. Takagi, T. Peyrin (Eds.), Advances in Cryptology – ASIACRYPT 2017, Springer International Publishing, Cham, 2017, pp. 409–437. [8] M. Babenko, E. Golimblevskaia, Euclidean division method for the homomorphic scheme ckks, in: 2021 IEEE Conference of Russian Young Researchers in Electrical and Electronic Engineering (ElConRus), 2021, pp. 217–220. doi:10.1109/ElConRus51938.2021.9396347. [9] F. Chen, T. Xiang, X. Lei, J. Chen, Highly efficient linear regression outsourcing to a cloud, IEEE Transactions on Cloud Computing 2 (4) (2014) 499–508. doi:10.1109/TCC.2014.2378757. [10] A. Gascon, P. Schoppmann, B. Balle, M. Raykova, J. Doerner, S. Zahur, D. Evans, Privacy-preserving distributed linear regression on high-dimensional data, Proceedings on Privacy Enhancing Technologies 2017 (4) (2017) 345–364. doi:doi:10.1515/popets-2017-0053. URL https://doi.org/10.1515/popets-2017-0053 [11] G. Qiu, X. Gui, Y. Zhao, Privacy-preserving linear regression on distributed data by homomorphic encryption and data masking, IEEE Access 8 (2020) 107601–107613. doi:10.1109/ACCESS.2020.3000764. [12] M. d. Cock, R. Dowsley, A. C. Nascimento, S. C. Newman, Fast, privacy preserving linear regression over distributed datasets based on pre-distributed data, in: Proceedings of the 8th ACM Workshop on Artificial Intelligence and Security, AISec ’15, Association for Computing Machinery, New York, NY, USA, 2015, pp. 3–14. doi:10.1145/2808769.2808774. URL https://doi.org/10.1145/2808769.2808774 [13] K. Mandal, G. Gong, Privfl: Practical privacy-preserving federated regressions on high-dimensional data over mobile networks, in: Proceedings of the 2019 ACM SIGSAC Conference on Cloud Computing Security Workshop, CCSW’19, Association for Computing Machinery, New York, NY, USA, 2019, pp. 57–68. doi:10.1145/3338466.3358926. URL https://doi.org/10.1145/3338466.3358926 [14] P. Mohassel, Y. Zhang, Secureml: A system for scalable privacy-preserving machine learning, in: 2017 IEEE Symposium on Security and Privacy (SP), 2017, pp. 19–38. doi:10.1109/SP.2017.12. [15] A. C. Yao, Protocols for secure computations, in: Twenty-third Annual Symposium on Foundations of Computer Science, 1982, pp. 160–164. doi:10.1109/SFCS.1982.38. [16] C. Zhao, S. Zhao, M. Zhao, Z. Chen, C.-Z. Gao, H. Li, Y. an Tan, Secure multi-party computation: Theory, practice and applications, Information Sciences 476 (2019) 357–372. doi:https://doi.org/10.1016/j.ins.2018.10.024. URL https://www.sciencedirect.com/science/article/pii/S0020025518308338 13 [17] Y. Ishai, E. Kushilevitz, A. Paskin, Secure multiparty computation with minimal interaction, in: T. Rabin (Ed.), Advances in Cryptology – CRYPTO 2010, Springer Berlin Heidelberg, Berlin, Heidelberg, 2010, pp. 577–594. [18] R. Ma, Y. Li, C. Li, F. Wan, H. Hu, W. Xu, J. Zeng, Secure multiparty computation for privacy-preserving drug discovery, Bioinformatics 36 (9) (2020) 2872–2880. doi:10.1093/bioinformatics/btaa038. URL https://doi.org/10.1093/bioinformatics/btaa038 [19] H. Kikuchi, C. Hamanaga, H. Yasunaga, H. Matsui, H. Hashimoto, C.I. Fan, Privacy-preserving multiple linear regression of vertically partitioned real medical datasets, Journal of Information Processing 26 (2018) 638–647. doi:10.2197/ipsjjip.26.638. [20] L. Zhu, Z. Liu, S. Han, Deep leakage from gradients, CoRR abs/1906.08935 (2019). arXiv:1906.08935. URL http://arxiv.org/abs/1906.08935 [21] G. Xu, H. Li, S. Liu, K. Yang, X. Lin, Verifynet: Secure and verifiable federated learning, IEEE Transactions on Information Forensics and Security 15 (2020) 911– 926. doi:10.1109/TIFS.2019.2929409. [22] J. B. Ramsey, Tests for specification errors in classical linear least-squares regression analysis, Journal of the Royal Statistical Society: Series B (Methodological) 31 (2) (1969) 350–371. doi:https://doi.org/10.1111/j.2517-6161.1969.tb00796.x. [23] C. Dwork, M. Naor, On the difficulties of disclosure prevention in statistical databases or the case for differential privacy, Journal of Privacy and Confidentiality 2 (1) (September 2010). doi:10.29012/jpc.v2i1.585. [24] C. Dwork, A. Roth, The algorithmic foundations of differential privacy, Foundations and Trends Theoretical Computer Science 9 (3–4) (2014) 211–407. doi:10.1561/0400000042. 14

federated learning-官

Related documents

Products

Support

federated learning-官

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib