Uploaded by lin

federated learning-官

advertisement
Collaborative Computing for Linear Regression with Privacy
Albert Guana,1 , Po-Wen Chia,2,, Chun-Hung Richard Linb,3
a
Department of Computer Science and Information Engineering, National Taiwan Normal
University, Taipei, 11677, Taiwan
b
Department of Computer Science and Engineering, National Sun Yat-sen
University, Kaoshiung, 80424, Taiwan
Abstract
Machine learning usually requires a large number of training data to build a useful model.
We exploit the mathematical structure of linear regression to develop a secure and privacypreserving method that allows multiple parties to collaboratively compute optimal model
parameters without requiring parties to share their raw data. The new approach also
allows to efficiently delete the data of users who wants to leave the group and wish to
have their data deleted. Since the data will remain private during both learning and
unlearning, data owners are more willing to contribute their collected datasets to build
better models, benefiting all parties. The proposed collaborative computation of linear
regression model does not require a trusted third party, thereby avoiding the difficulty of
building a robust trust system in the current Internet environment. The proposed scheme
is also computationally efficient and does not use encryption to keep data secret. We also
show that the new linear regression learning scheme can be updated incrementally, which
means that new datasets can be easily incorporated into the system, and some data can
be removed from the system to obtain a more suitable linear regression model without
recomputing from scratch.
Keywords: Collaborative learning, privacy protection, incremental computation,
machine unlearning.
1. Introduction
Machine learning attempts to build models based on training data to make predictions
or decisions without being explicitly programmed. It usually requires a large number of
training data to build a useful model. In many practical applications, these data may not
belong to one organization, but rather be distributed across multiple sites. For example,
every hospital collects medical data of its patients. The data collected in one hospital is
often insufficient to obtain a good model for developing new drugs or treatments for some
specific disease. Therefore, sharing data is very important in machine learning when the
data is distributed across many sites.
Like medical data, almost all of the data contain private and sensitive personal information. When sharing these data, security or personal privacy concerns must be properly
Email addresses: albert.zj.guan@gmail.com (Albert Guan), neokent@gapps.ntnu.edu.tw
(Po-Wen Chi), lin@cse.nsysu.edu.tw (Chun-Hung Richard Lin)
1
The research of the first author is supported in part by the MOST project 107-2218-E-110-017-MY3.
2
The research of the second author is supported in part by the MOST project 107-2218-E-110-017MY3.
3
The research of the third author is supported in part by the MOST project 108-2622-E-110-009-CC3.
Preprint submitted to Information Processing Letters
July 21, 2023
addressed. The release of personal data in almost all countries in the world is restricted
by laws and regulations. For example, the European Data Protection Regulation is applicable in 2018 for all member states to harmonize data privacy laws across European
Union [1]. Another reason for limiting data sharing is that collecting these data usually
requires a lot of time and effort. Data owners usually view the data they collect as a
valuable asset. They will be more willing to share the collected data with others only if
it is legal and mutually beneficial.
In this article, we consider the security and privacy-preserving of a specific machine
learning problem: linear regression. Linear regression models the relationship between
variables by fitting a line to the observed data [2, 3]. It is the most basic model in
machine learning and has many applications [4]. We propose a method to collaboratively
build a linear regression model without releasing the original data. The new learning
scheme also supports deleting a user’s dataset without using the user’s original data.
Therefore, it complies with privacy protection regulations.
Secure collaborative machine learning is also a type of secure multiparty computation, which has been widely studied by cryptographers. In secure multiparty computing,
no data can be sent directly to the other parties for security reasons. There are many
techniques to achieve secrecy in multiparty computation. General method for secure multiparty computation, such as Yao’s garbled circuits, may not be efficiently implemented.
By exploiting the mathematical structure of linear regression, we achieve efficient secure
construction of linear regression models.
In secure computation, the goal is to provide a well-defined framework to guarantee
complete zero knowledge about other party’s private data. Zero knowledge is very desirable, but this desired property usually requires heavy computations [5]. Furthermore,
zero-knowledge only holds for polynomial time attackers, i.e., these schemes can only be
computationally secure. In our scheme, we show that the curious party does not have
enough information to recover the exact value of the other party’s original data.
It may be useful to design a scheme with a trusted third party, especially when the
third party only needs to play a light role in the protocol. However, in the current Internet
environment, it is often difficult to establish and maintain long-term trust among parties,
especially when the number of participants increases. Such a system may not be very
robust; as long as the server or some party violates the security rules, the system cannot
function properly. The proposed method for collaboratively constructing a linear model
in this article does not require trusted third parties. We only assume that all parties are
semi-honest: they all follow the protocol, but they are also curious about the data of the
other parties.
Encryption can be used to protect sensitive information. However, the data in machine
learning must be used for efficient computation, and general encryption schemes do not
support this feature. Fully homomorphic encryption, such as the Paillier cryptosystem [6],
may be used in computing model parameters in machine learning. However, homomorphic
encryption usually requires intensive computation to achieve a required level of security,
and these cryptosystems can only achieve computational security.
Most importantly, numbers in machine learning are usually real numbers. These
numbers must be represented in a standard format (such as IEEE Standard 754 floatingpoint numbers) so that currently available computer CPUs can perform the calculations.
On the other hand, the plaintext and the ciphertext of a cryptosystem are usually finite
groups, finite rings, or fields to avoid rounding errors. Almost all these cryptosystems,
even if they are fully homomorphic, may not be suitable for machine learning. Recently,
2
Cheon et al. developed the CKKS cryptosystem [7] to support a special type of fixedpoint arithmetic, called block floating-point arithmetic. Although floating-point numbers
can be approximated in fixed-size memory, current version of CKKS support only a limited
number of multiplications before a time-consuming bootstrapping step is taken. Another
limitation of CKKS is that it currently does not directly support the floating-point division
operations required for linear regression and almost all machine learning models. In this
regard, Barbenko et al. proposed a two-stage method for Euclidean division with the
CKKS scheme and studied its properties [8]. In this article, we propose a new scheme
to securely compute optimal parameters for linear regression without encryption to avoid
all these problems. The main contributions of this research are summarized as follows.
1. The proposed secure and privacy protection machine learning scheme for linear
regression does not need trusted servers or encryption.
2. Its security is not based on computationally hard problems. Even an attacker with
unlimited computing power does not have enough information to learn the exact
value of the private data of the other parties.
3. It is computationally efficient. It does not require encryption and does not incur
additional computation due to distribution of the datasets.
4. The proposed scheme only sends aggregated data to other parties. The size of the
aggregated data is proportional to the number of features. In practical applications,
whether encrypted or not, the amount of data is usually much larger than the
number of features. This implies that the communication cost of our scheme is
lower than sending raw data, whether they are encrypted or not.
5. The new linear regression learning scheme can be updated incrementally. New
datasets can be easily incorporated into the system and old dataset can be removed
from the system to obtain a more suitable linear regression model without recomputing the model parameters from scratch.
2. Related Work
In contrast to traditional centralized machine learning techniques where all local
datasets are uploaded to one server, the technique we study is known as federated learning or collaborative learning. In federated learning, the most important task is to devise
a method for parties to collaboratively compute the parameters of a machine learning
model from their input data, while keeping their input data private. To meet the security
requirement, trusted servers can be used to protect the confidentiality of data and perform computations efficiently. The security of many known secure computation schemes
for linear regression requires servers [9, 10, 11, 12, 13, 14]. Some of the schemes use only
one server [9, 12, 13]. some of the schemes use two servers, one for cryptography service
and the other for computation [10, 11, 14].
There are different ways of using the servers. In Cock’s et al. scheme [12], a trusted
server is only needed in the pre-distribution phase to do initialization. On the other
hand, in Chen et al. [9] and Mandal et al.’s schemes [13], the server is required to do
computational tasks, but it only required the server to be semi-honest.
Federated learning can also be considered as secure multiparty computation, or privacypreserving computation. Secure multiparty computation has been widely studied by many
cryptographers [15, 16, 17]. Recently, Ma et al. developed algorithms under secure multiparty computation which enable pharmaceutical institutions to achieve high-quality models to advance drug discovery without divulging private drug-related information [18].
3
The security of most existing secure multiparty computations depends on computationally hard problems, such as factoring large integers or solving discrete logarithms in
large groups. Some of the schemes based on Yao’s garble circuit and combined with secure
computation for computing inner product [10]. Some of the schemes based on homomorphic encryption to ensure the security of the data [11]. Kikuchi et al. proposed a scheme
that does not require trusted servers [19], but it need Paillier’s cryptosystem, which is
a public key cryptosystem, as well as homomorphic encryption. These cryptosystems,
including Yao’s garbled circuits, are computationally intensive.
In our proposed scheme for linear regression, with proper bookkeeping, datasets owned
by a users can also be deleted from the computations efficiently. This is useful when a
party leaves the group and wants to remove information about its party. This is called
machine unlearning. Machine unlearning is necessary to comply with privacy protection
laws and regulations, such as GDPR in European countries. In our new scheme, both
the addition and deletion of some datasets can be done efficiently without recomputing
the model parameters from scratch. However, our scheme needs to invert a matrix of size
(m + 1) × (m + 1), where m is the number of features used in the linear regression model.
This is not a problem in practice since the value of m is usually small compared to the
amount of data n.
Some of the known schemes do not use the closed form of the optimum solution for
linear regression [10, 13]. Although the authors claimed that the advantage is that their
scheme does not need to compute the matrix inverse, the drawback is that the result will
only be an approximation. Most importantly, this type of computation method has to send
gradients descent, or other related data, multiple times to make the model parameters
accurate enough. It is well known that sending gradient descent iteratively can reveal
enough information to the attacker to infer the value of the training data [20]. This is
because the attacker can accumulate some information from each gradient descent. When
enough information is collected, the attacker can deduce the values of the original training
dataset.
For each dataset, our scheme only sends the aggregated data once, and we show that
the information leaked is too small for any unauthorized party to infer information about
the other party’s data. Note that this technique works well for linear regression problems.
Other federated learning problems may not be able to avoid iterative send gradient descent
and require other techniques to preserve privacy. For example, in Deep Neural Networks
(DNN), Xu et al. proposed a double masking protocol to guarantee the confidentiality of
user local gradients during federated learning [21].
3. The Linear Regression Problem
In this section, we first briefly describe the computational procedure for linear regression. We then show a method to compute the optimal parameter of a linear model by
minimizing the sum of the square of errors. This classical method is called the least square
method [22, 4].
Linear regression is a linear approach to modeling the relationship between a response
y and a set of m independent variables x1 , x2 , . . . , xm in an experiment. It can be expressed
as
y = β0 + β1 x1 + · · · + βm xm .
These independent variables x1 , x2 , . . . , xm are also called the features of the model.
4
Suppose n experiments have been performed, each with different setting of the features
x1 , x2 , . . . , xm . With a set of n observed data, we obtain n equations.
yi = β0 + β1 xi1 + β2 xi2 + · · · + βm xim ,
i = 1, 2, . . . , n,
where xij is the value of variable xj used in the i-th experiment. It is convenient to write
these n equations in matrix form. Let ⃗xi = [xi1 , xi2 , . . . , xim ], and let the model coefficients
P
i
be: β⃗ = [β0 , β1 , . . . , βm ]. The model prediction would be yi = β0 + m
xi
j=1 βj xj . If each ⃗
is extended to ⃗xi = [1, xi1 , xi2 , . . . , xim ], then
yi =
m
X
⃗ xi .
βj xij = β⃗
j=0
In constructing a linear model, the coefficients β0 , β1 , . . . , βm are unknowns. A set of
n experiments can be performed to determine possible values of these m + 1 unknowns.
These m + 1 unknowns β0 , β1 , . . . , βm are called parameters of the linear regression model.
If all the experiments have no errors and the setting of the variables ⃗xi are independent,
then m + 1 experiments are sufficient to determine the m + 1 unknowns β0 , β1 , . . . , βm .
However, almost all experiments cannot avoid errors. It needs more than m + 1 experiments, usually much more experiments, to obtain “optimal” values of β0 , β1 , . . . , βm .
To define the optimal values for these parameters, we first define the loss function.
Given a set of values for β⃗ = [β0 , β1 , . . . , βm ], we can compute the loss with respect to β⃗
as
n 2
X
⃗ xi − yi .
⃗ =
β⃗
L(β)
i=1
In the least square setting, the optimum parameter is defined as the one that minimizes
the sum of mean square loss:
n 2
X
⃗
⃗ xi − yi .
β⃗
β̂ = arg min L β⃗ = arg min
i=1
4. The Proposed Collaborative Learning Method
In this section, we first describe the theoretical basis on which our proposed collaborative learning approach is based. We then describe new collaborative learning methods.
4.1. Theoretical Basis of the Proposed Scheme
First, we show how to compute the optimal parameters of a linear regression model
incrementally. Assume that the number of features is 2. Thus,
y = β0 + β1 x1 + β2 x2 .
The optimal values of each βj , j = 0, 1, 2, can be obtained by finding the minimum value
⃗ Let
of the function L(β).
L(β0 , β1 , β2 ) =
n
X
(yi − β0 − β1 xi1 − β2 xi2 )2 .
i=1
5
Solve
∂L
= 0, i = 0, 1, 2. We obtain:
∂βi
n
X
(yi − β0 − β1 xi1 − β2 xi2 ) = 0
i=1
n
X
xi1 (yi − β0 − β1 xi1 − β2 xi2 ) = 0
i=1
n
X
xi2 (yi − β0 − β1 xi1 − β2 xi2 ) = 0
i=1
These equations are also called normal equations. They can be rewritten as:
n
X
yi = β0 n + β1
i=1
n
X
i=1
n
X
n
X
xi1
+ β2
n
X
i=1
xi1 yi = β0
xi2 yi = β0
i=1
n
X
i=1
n
X
xi2
i=1
n
n
X
X
i 2
i
xi1 xi2
(x1 ) + β2
x1 + β 1
xi2 + β1
i=1
i=1
n
X
xi1 xi2 + β2
i=1
(1)
i=1
n
X
(xi2 )2
i=1
The coefficients β0 , β1 , β2 can be computed by solving systems of linear Equations (1).
Now suppose another set of k data points are to be added:
(xn+1
, xn+1
, yn+1 ), . . . , (xn+k
, x2n+k , yn+k ).
1
2
1
The Equations (1) becomes
n+k
X
xi2
i=1
n+k
X
xi1 xi2
i=1
n+k
X
xi2 xi2
i=1
= β0 (n + k) + β1
= β0
= β0
n+k
X
i=1
n+k
X
xi1 + β1
xi2 + β1
i=1
n+k
X
xi1
+ β2
n+k
X
xi2
i=1
n+k
X
i=1
n+k
X
i=1
n+k
X
i=1
n+k
X
(xi1 )2 + β2
xi1 xi2 + β2
i=1
xi1 xi2
(xi2 )2
i=1
The new sum can be expressed as the old sum plus extra terms.
n+k
X
xi2 =
i=1
n+k
X
xi1 xi2
n
X
i=1
=
n
X
i=1
i=1
n+k
X
n
X
(xi1 )2
yi +
=
i=1
n+k
X
xi2 ,
i=n+1
xi1 xi2
+
n+k
X
xi1 xi2 ,
i=n+1
(xi1 )2
+
n+k
X
(xi1 )2 .
i=n+1
i=1
and so on, for the other summations.
6
To generalize the method described above, it is more convenient to rewrite these
equations by using matrices.
   

β0
1 x11 x12 . . . x1m
y1
1 x2 x2 . . . x2   β1   y2 
m 
2
1

  
 .. ..
.. . .
..   ..  =  ..  .
. .
. .  .   . 
.
1 xn1 xn2 . . . xnm
βm
yn
The relationship between matrix X and vector ⃗y can also be written in matrix form
as
X β⃗ = ⃗y .
(2)
The loss function can be rewritten as:
⃗ = ||X β⃗ − ⃗y ||2
L(β)
T = X β⃗ − ⃗y
X β⃗ − ⃗y
⃗
= ⃗y T ⃗y − ⃗y T X β⃗ − β⃗ T X T ⃗y + β⃗ T X T X β.
The optimum solution is at gradient 0. The gradient of the loss function is:
T
T
T
T
T
T
⃗
⃗
⃗
⃗
∂
⃗
y
⃗
y
−
⃗
y
X
β
−
β
X
⃗
y
+
β
X
X
β
⃗
∂L(β)
=
∂ β⃗
∂ β⃗
⃗
= −2X T ⃗y + 2X T X β.
Setting the gradient to 0, we get the optimum coefficients:
− 2X T ⃗y + 2X T X β⃗ = 0
⇒ X T ⃗y = X T X β⃗
−1 T
⇒ β⃗ = X T X
X ⃗y .
Therefore the optimum value for β⃗ is
β⃗ = X T X
−1
X T ⃗y .
(3)
Suppose that each user u has a set of (xi1 , xi2 , . . . , xim ) and its response yi . They want to
collaboratively compute the optimal coefficients
β⃗ = (β0 , β1 , . . . , βm ) ,
such that ||X β⃗ − ⃗y ||2 is minimized. Assume that there are two parties A and B. A’s
data is in the
parts of X and B’s data is in the lower
upper
part of X. Matrix X can be
X1
⃗y
written as
. Similarly, vector ⃗y can be written as 1 .
X2
⃗y2
The system of Equation 2 can be written as:
X1 ⃗
⃗y
β= 1
X2
⃗y2
7
Apply Equation 3, the value of β⃗ can be computed as
T !−1 T X1
X1
X1
⃗y1
β⃗ =
X2
X2
X2
⃗y2
T T X1 −1 T
X1 ⃗y1 + X2T ⃗y2
= X1 X 2
X2
T
−1 T
X1 ⃗y1 + X2T ⃗y2
= X1 X1 + X2T X2
⃗ the matrix needs to be inverted becomes the
To compute the model parameter β,
(m + 1) × (m + 1) matrix
T
X1 X1 + X2T X2 .
To be used in practical applications, the matrices that are to be maintained are X T X
and X T ⃗y . These matrices are initialized as zero matrices. Whenever a new dataset Xi
and ⃗yi are obtained, the new matrices XiT Xi and XiT ⃗yi are added to X T X and X T ⃗y ,
respectively. This enables the computation of a new set of model parameters for linear regression. It is also possible to delete a particular dataset, such as Xj and ⃗yj , by
subtracting XjT Xj and XjT ⃗yj from X T X and X T ⃗y , respectively.
4.2. The Protocol
Suppose that A and B intend to utilize the least square method to construct a linear
model that comprises of m features. Assume that A has the data X1 and the associated
vector ⃗y1 , B has the data X2 and the associated vector ⃗y2 . Our secure collaborative
computation of the optimal parameter for linear regression is shown in Figure 1.
1. A perform n1 experiments and collects the data into an n1 × (m + 1) matrix X1
and a n1 × 1 vector ⃗y1 .
2. B perform n2 experiments and collects the data into an n2 × (m + 1) matrix X2
and a n2 × 1 vector ⃗y2 .
3. A sends X1T X1 and X1T ⃗y1 to B.
4. B sends X2T X2 and X2T ⃗y2 to A.
⃗ which is an (m +
5. A and B can then compute the optimal model parameter β,
1) × 1 vector:
−1 T
X1 ⃗y1 + X2T ⃗y2 .
β⃗ = X1T X1 + X2T X2
Figure 1: Secure Collaborative learning of linear regression without encryption. In the protocol, assume
that A and B intend to utilize the least square method to construct a linear model that comprises of m
features.
4.3. Analysis of the Protocol
Based on the analysis in Section 4.1, it is easy to verify that our collaborative computation of the optimal parameter for linear regression is correct. For the computational
complexity of the scheme, each party i, i = 1, 2, needs to compute XiT Xi and XiT ⃗yi .
This is necessary even if computations are performed by only one party using all the
data in one set. The only possible extra work is every party needs to invert the matrix
8
⃗ The dimension of the matrix to be inverted is
X1T X1 + X2T X2 to obtain the solution of β.
(m + 1) × (m + 1), where m is the number of features used in the model, which is usually
smaller than number of data n.
This step can also be done by one party (i.e. the server) and then send the solution
⃗
β to the other parties (i.e. local users). In this way, the total computational work is the
same as the computational work required to be done when all the data is in one site.
For communication complexity, our proposed scheme sends only aggregated data to the
other party. The size of the aggregated data is only (m + 1)2 + (m + 1) = (m + 1)(m + 2)
numbers for each party having n records of data. The communication complexity is
proportional to the number of features m, which is independent of n. This is less than
the size of the dataset, which is (n(m + 1) + n), when n ≥ (m + 2).
4.4. Security Analysis
In this section, we show that the security of the proposed scheme is informationtheoretically secure. This implies that even though a minor amount of information is
inevitably divulged to establish a useful linear regression model, the attacker cannot
compute the exact values of the other party’s data, even if the attacker has infinite
computing power.
We assume that all parties are semi-honest, i.e. the participants strictly follow the
protocol, but they are interested in knowing additional information that they are not
explicitly permitted.
Theorem 4.1. Assume that each party collects more than m + 2 data. The proposed
collaborative learning method for the optimal parameters of a linear regression model is
secure in the following sense:
1. A does not have enough information to compute any elements in X2 and ⃗y2 , and
2. B does not have enough information to compute any elements in X1 and ⃗y1 .
Proof. Assume that A has a set of n1 data and B has a set of n2 data, n1 + n2 = n and
n1 , n2 > m + 2, where m is the number of features in the system. First we show that A
cannot compute the data of B (i.e. X2 and ⃗y2 ) by using the information A collects in the
execution of the protocol. The number of elements in X2 is n2 × (m + 1), but the first
column of X2 is a constant. Thus, there are n2 m variables. The number of elements in
⃗y2 is n2 , and it contains n2 variables. Thus, A has totally n2 (m + 1) unknowns.
On the other hand, the number of elements in X2T X2 is (m + 1)2 , and the number of
elements in X2T ⃗y2 is m + 1. The first element in the first row of X2T ⃗y2 is n2 . Thus, A has
only (m + 1)2 + n2 − 1 quadratic equations. Since n2 > m + 2, A does not have enough
information to compute the values of the elements in X2 and ⃗y2 .
Similarly, we can show that B does not have enough information to compute the values
of the elements in X1 and ⃗y1 .
We show that even if the parties or the server have infinite computational power,
there is not enough information to infer the value of the other party’s data. That is, the
security of the proposed linear regression scheme does not depend on computationally hard
problems, such as large integer factorization and solving large sets of discrete logarithm
problems. These computationally hard problems are solvable if the attacker has sufficient
computing power, for example using a quantum computer.
Although we have proved that the curious party cannot compute the values of the
dataset of the other parties, even if they have infinite computing power, but the curious
9
Table 1: Comparisons of related protocols
TA
HE
YG(SMC)
CF
Security
Cock’s [12]
Y
Y
N
Y
C
Gascón’s [10]
Y
Y
Y
N
C
Kikuchi’s [19]
N
Y
N
Y
C
Mandal’s [13]
Y
Y
N
N
C
Mohassel’s [14]
Y
Y
Y
N
C
Qiu’s [11]
Y
Y
Y
N
C
Our’s
N
N
N
Y
P
1.
2.
3.
4.
TA: needs trusted or semi-trusted authority.
HE: uses homomorphic encryption.
YG(SMC): uses Yao’s garbled Circuit.
CF uses closed form formula to compute optimal parameters. (Note that some schemes do not use closed-form formulations but
instead use conjugate gradient descent, but this is computationally inefficient and vulnerable to message chaining attacks.)
5. Security: C means computationally secure, P means for perfect secure.
party does know some information about the dataset. For example, the number of data
and the sum of the squares of the data, etc. For the information leakage in general, we
note that Dwork and Naor show that if a machine learning model is useful, it must reveal
some information about the data on which it was trained [23, 24]. Our proof indicates
that the information disclosed to the other party through the proposed collaborative linear
regression scheme is minimal. As a result, none of the parties can leverage it to calculate
the exact values of each other’s data, even if the attacker has unlimited computing power.
4.5. Comparsions
In this section, we compare several other related schemes for federated learning for linear regression. These schemes include Cock’s scheme [12], Gascón’s scheme [10], Kikuchi’s
scheme [19], Mandal’s scheme [13], Mohassel’s scheme [14] and Qiu’s scheme [11]. They
are recently proposed and related to the security of federated learning for linear regression.
These schemes has several characteristics that can be compared, such as whether the
scheme uses a trusted server, and the assumption of a trusted server is indeed unrealistic
and unrealistic in some cases. Homomorphic encryption is computationally intensive
and can only achieve computational security. Some schemes use Yao’s garbled circuit.
Although this general secure multi-party computing protocol can realize secure computing
for any functions, the amount of computation is also very large. Some of the scheme do not
used closed form for optimum parameter, instead they sent gradient decent iteratively, but
this may suffer privacy leakage attack. Some schemes do not use closed form to obtain
optimal parameters, but require sending gradient descent repeatedly to obtain better
parameters. Repeatedly sending gradient descent may compromise data confidentiality.
We incorporate these features into the comparison, Finally, we compare the security level
of these schemes. Some schemes can achieve information theoretically secure while others
can only be computationally. The results of the comparison is shown in Table 1.
5. Applications of Collaborative Learning Scheme
The proposed collaborative learning method for linear regression can be viewed as a
peer-to-peer mode of collaborative learning, or decentralized federated learning. In addition
to the peer-to-peer model for collaborative learning, the proposed learning scheme can
also be used for the client-server mode of collaborative learning, or centralized federated
learning. In the client-server mode of collaborative learning, the server may or may not
own any training data. Each party sends its training data to the server in the aggregated
form. The server can then do all the computations for the users, and send the final result,
⃗ to each user.
i.e. the value of β,
10
User’s new training data can also be sent to the server many times. When sending
their training data incrementally, care must be taken to avoid sending a batch of too
few data. For example, suppose that initially a user A has collected a set of n1 training
data. If n1 > m + 2, then the user can send the corresponding matrix G1 = X1T X1 and
⃗h1 = X T ⃗y1 to the server. Assume that more training data was collected by the same
1
user A. If the number of new training data n2 > m + 2, then A can send another pair
of matrices G2 = X2T X2 and ⃗h2 = X2 ⃗y2 to the server. In the case that the number of
training data n2 ≤ m + 2, user A may not want to send G2 = X2T X2 and ⃗h2 = X2T ⃗y2 to
the server for security and privacy reasons. This is because the number of training data
in X2 and ⃗y2 is too small, and the curious server can compute the matrix X2 and ⃗y2 from
G2 and ⃗h2 .
For security, it is required that the number of training data ni must be greater than
m+2 for each user i. This is not a serious limitation, because, in any practical application,
the number of features is usually small, and the amount of collected data should be much
larger than the number of features to built an accurate machine learning model.
In the design of the protocol, we do not make any assumptions on how the data is
distributed among the parties. In particular, each party may collect its own data with
different set of features. In 2019, Yang et al. introduced a comprehensive secure federated learning framework, which includes horizontal federated learning, vertical federated
learning, and federated transfer learning [5]. In our protocol, the original data can be
horizontally or vertically partitioned, as long as the set of features are determined, our
protocol can be applied.
Our proposed collaborate learning method can easily be implemented to compute
optimal model parameters for linear regression incrementally. Suppose that k datasets
are collected over a period of time:
(X1 , ⃗y1 ), (X2 , ⃗y2 ), · · · , (Xk , ⃗yk ).
Define an (m + 1) × (m + 1) matrix
Gi = XiT Xi
and an (m + 1) × 1 vector
⃗hi = X T ⃗yi .
i
The optimal model parameters β⃗ can be computed by the equation
β⃗ = G−1⃗h,
where
G=
k
X
Gi and ⃗h =
i=1
k
X
⃗hi .
i=1
These k sets of data (X1 , ⃗y1 ), (X2 , ⃗y2 ), · · · , (Xk , ⃗yk ), can be the training data of k users
collect at different times. When some dataset (Xi , ⃗yi ) are available, they can be added
into G and ⃗h respectively. Better model parameters β⃗ can then be computed.
It is also possible to delete some specified datasets, as long as the relevant aggregated
data is properly preserved. For example, if we want to delete all datasets associated with
a party, we need to know the aggregated data for that party. Aggregated data of that
party may also be held by that party. Once that party wants to delete his entire dataset,
he can send the aggregated data to the server, which can be used to deduct his data from
the system.
11
6. Conclusions
We propose a simple yet effective secure multi-party computation scheme for linear
regression in a distributed environment. In the design of our scheme, we exploit the
mathematical structure of linear regression, so that any party needs to send only the
aggregated data to the server or to the other party. Therefore, security and privacy
can be well protected without encryption. Security and privacy play important roles in
collaborative learning. When data owners do not need to publish the valuable data they
collect, they are more willing to share their data to build a better model.
The proposed scheme also allows incremental addition and deletion of some datasets
used in the linear regression models. New datasets can be easily integrated into the
system to obtain new and more accurate model parameters without having to compute
from scratch. With proper bookkeeping, user’s datasets can also be removed from the
model without referencing the original data being removed. The deletion of some datasets
is useful if some party leaves the group and ask for the deletion of his dataset.
The security of our proposed collaborated computation scheme for linear regression
does not need encryption. It does not depend on computationally hard problems, either.
These computationally hard problems usually require heavy computation to achieve certain level of security. Furthermore, computationally hard problems may be broken if the
attacker has enough computing power, such as using quantum computers. Due to the lack
of sufficient information, we proved that the curious party cannot obtain exact values of
the other party’s data, even with unlimited computing power.
Different weights can also be incorporated into collaborative learning in our new
method. This property may be useful in “customized” machine learning, in which all
parties have similarities, but each party also has its own special characteristics. Handwritten character recognition for smartphone input processing is a good example of a need
for a personalized model. In the proposed new method for collaborative learning, each
data owner handles his data independently. By using our proposed method, customization
can be easily achieved by giving this user more weight.
References
[1] T. E. Union, General data protection regulation (2016).
URL https://gdpr-info.eu/
[2] H. L. Seal, Studies in the history of probability and statistics. xv the historical development of the gauss linear model, Biometrika 54 (1-2) (1967) 1–24.
doi:10.1093/biomet/54.1-2.1.
URL https://doi.org/10.1093/biomet/54.1-2.1
[3] R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal
Statistical Society. Series B (Methodological) 58 (1) (1996) 267–288.
URL http://www.jstor.org/stable/2346178
[4] J. Fox, Applied regression analysis, linear models, and related methods, Sage Publications, Inc., 1997.
[5] Q. Yang, Y. Liu, T. Chen, Y. Tong, Federated machine learning: Concept and applications, ACM Transactions on Intelligent Systems and Technology 10 (2) (January
2019). doi:10.1145/3298981.
URL https://doi.org/10.1145/3298981
12
[6] P. Paillier, Public-key cryptosystems based on composite degree residuosity classes,
in: J. Stern (Ed.), Advances in Cryptology — EUROCRYPT ’99, Springer Berlin
Heidelberg, Berlin, Heidelberg, 1999, pp. 223–238.
[7] J. H. Cheon, A. Kim, M. Kim, Y. Song, Homomorphic encryption for arithmetic of
approximate numbers, in: T. Takagi, T. Peyrin (Eds.), Advances in Cryptology –
ASIACRYPT 2017, Springer International Publishing, Cham, 2017, pp. 409–437.
[8] M. Babenko, E. Golimblevskaia, Euclidean division method for the homomorphic scheme ckks, in: 2021 IEEE Conference of Russian Young Researchers in Electrical and Electronic Engineering (ElConRus), 2021, pp. 217–220.
doi:10.1109/ElConRus51938.2021.9396347.
[9] F. Chen, T. Xiang, X. Lei, J. Chen, Highly efficient linear regression outsourcing to a cloud, IEEE Transactions on Cloud Computing 2 (4) (2014) 499–508.
doi:10.1109/TCC.2014.2378757.
[10] A. Gascon, P. Schoppmann, B. Balle, M. Raykova, J. Doerner, S. Zahur,
D. Evans, Privacy-preserving distributed linear regression on high-dimensional
data, Proceedings on Privacy Enhancing Technologies 2017 (4) (2017) 345–364.
doi:doi:10.1515/popets-2017-0053.
URL https://doi.org/10.1515/popets-2017-0053
[11] G. Qiu, X. Gui, Y. Zhao, Privacy-preserving linear regression on distributed data by
homomorphic encryption and data masking, IEEE Access 8 (2020) 107601–107613.
doi:10.1109/ACCESS.2020.3000764.
[12] M. d. Cock, R. Dowsley, A. C. Nascimento, S. C. Newman, Fast, privacy preserving
linear regression over distributed datasets based on pre-distributed data, in: Proceedings of the 8th ACM Workshop on Artificial Intelligence and Security, AISec
’15, Association for Computing Machinery, New York, NY, USA, 2015, pp. 3–14.
doi:10.1145/2808769.2808774.
URL https://doi.org/10.1145/2808769.2808774
[13] K. Mandal, G. Gong, Privfl: Practical privacy-preserving federated regressions
on high-dimensional data over mobile networks, in: Proceedings of the 2019
ACM SIGSAC Conference on Cloud Computing Security Workshop, CCSW’19,
Association for Computing Machinery, New York, NY, USA, 2019, pp. 57–68.
doi:10.1145/3338466.3358926.
URL https://doi.org/10.1145/3338466.3358926
[14] P. Mohassel, Y. Zhang, Secureml: A system for scalable privacy-preserving machine
learning, in: 2017 IEEE Symposium on Security and Privacy (SP), 2017, pp. 19–38.
doi:10.1109/SP.2017.12.
[15] A. C. Yao, Protocols for secure computations, in: Twenty-third Annual Symposium
on Foundations of Computer Science, 1982, pp. 160–164. doi:10.1109/SFCS.1982.38.
[16] C. Zhao, S. Zhao, M. Zhao, Z. Chen, C.-Z. Gao, H. Li, Y. an Tan, Secure multi-party
computation: Theory, practice and applications, Information Sciences 476 (2019)
357–372. doi:https://doi.org/10.1016/j.ins.2018.10.024.
URL https://www.sciencedirect.com/science/article/pii/S0020025518308338
13
[17] Y. Ishai, E. Kushilevitz, A. Paskin, Secure multiparty computation with minimal
interaction, in: T. Rabin (Ed.), Advances in Cryptology – CRYPTO 2010, Springer
Berlin Heidelberg, Berlin, Heidelberg, 2010, pp. 577–594.
[18] R. Ma, Y. Li, C. Li, F. Wan, H. Hu, W. Xu, J. Zeng, Secure multiparty computation for privacy-preserving drug discovery, Bioinformatics 36 (9) (2020) 2872–2880.
doi:10.1093/bioinformatics/btaa038.
URL https://doi.org/10.1093/bioinformatics/btaa038
[19] H. Kikuchi, C. Hamanaga, H. Yasunaga, H. Matsui, H. Hashimoto, C.I. Fan, Privacy-preserving multiple linear regression of vertically partitioned
real medical datasets, Journal of Information Processing 26 (2018) 638–647.
doi:10.2197/ipsjjip.26.638.
[20] L. Zhu, Z. Liu, S. Han, Deep leakage from gradients, CoRR abs/1906.08935 (2019).
arXiv:1906.08935.
URL http://arxiv.org/abs/1906.08935
[21] G. Xu, H. Li, S. Liu, K. Yang, X. Lin, Verifynet: Secure and verifiable federated
learning, IEEE Transactions on Information Forensics and Security 15 (2020) 911–
926. doi:10.1109/TIFS.2019.2929409.
[22] J. B. Ramsey, Tests for specification errors in classical linear least-squares regression
analysis, Journal of the Royal Statistical Society: Series B (Methodological) 31 (2)
(1969) 350–371. doi:https://doi.org/10.1111/j.2517-6161.1969.tb00796.x.
[23] C. Dwork, M. Naor, On the difficulties of disclosure prevention in statistical databases
or the case for differential privacy, Journal of Privacy and Confidentiality 2 (1)
(September 2010). doi:10.29012/jpc.v2i1.585.
[24] C. Dwork, A. Roth, The algorithmic foundations of differential privacy, Foundations and Trends Theoretical Computer Science 9 (3–4) (2014) 211–407.
doi:10.1561/0400000042.
14
Download