Additional File 3

advertisement
Additional File 3: A mathematical summary of the KMLA
algorithm
The Kernel Multitask Latent Analysis algorithm [1, 2] is used here with minor changes.
Briefly, KMLA uses a kernel function to transform the feature space into a symmetric,
positive-definite similarity matrix. Learning occurs via a PLS-like algorithm on the
kernel matrix, rather than on the original features. Denote the original feature matrix by
X  R nm and responses by Y  R nk for k tasks. Let a single subscript denote a column
of a matrix (e.g., Yg ) or a single entry of a row vector. Let the superscript T denote
transpose, denote the identity matrix as I , and let [ A, B ] denote the concatenation of
matrices Α and B . The algorithm consists of applying a kernel function to X , thereby
creating a kernel matrix K  R nxn . Next, columns i=1,2,...z of a matrix of linear
orthogonal latent variables, T  R nz , are iteratively generated from K , with z n .
The goal is to generate T in such a way that it is a linear projection of K into a reduced
subspace and the loss function L    i , g fYF  Yi , g , Fi , g  is minimized. Here, F  TC ,
k
n
g 1 i 1
where F  R is a matrix of predicted values, C  R z ,k is a matrix of coefficients, and
 i , g  0 if Yi , g is missing and  i , g  1 otherwise.
n,k
The loss function can vary between tasks. For linear regression on task g,
fYF   Yi , g  Fi , g  . Other loss functions could be used if desired. For binary
2
classification on task g, targets are labeled as +1 and –1 and an exponential loss function
is used, fYF   i exp  Yi , g Fi , g  . Weights  i allow cost-sensitive learning and can be
based on relative frequency of the positive and negative labels. After learning in the
subspace is completed, the matrix of PLS coefficients, C , is transformed to a matrix of
kernel coefficients, B , such that predictions can be calculated as F  μ  KB , where μ is
a vector of coefficients for a constant hypothesis. Details of the algorithm are given
below.
Algorithm 1:
1. Compute the vector μ  R1k of optimal coefficients for a constant hypothesis.
For the linear regression case, μ g  Yg for g=1,2...k. For binary classification,
  i 
1
  
μ g  log  iC  for g=1,2...k, where C  is the set of positive label points, and
2
   i 
 iC 

C is the set of negative label points.
2. Let D  R nk be a matrix of gradients of L. For linear regression on task g and a
constant hypothesis, D g  Yg  μ g . For classification, D g   exp(Yg u g )Yg .
3. Let K 0  k ( X, X) be the initial kernel computed from the kernel function k. For a
1
1
linear kernel, K 0  XXT . Let K1  (I  eeT )Κ 0 (I  eeT ) be the centered
n
n
kernel, where e is a n1 vector of ones.
4. Let T , U , and H be empty matrices and let A be a n  n identity matrix.
5. For q=1 to z:
a. Compute the weight vector s  R1k . For task g: s g  1 
Dg
z
D
j
j1
b. Compute the latent factor: t q  K q DsT ; t q 
tq
; T   T, t q 
q
t
c. Compute u q  DsT ; U   U, u q 
d. Deflate the kernel matrix: K q1  (I  t q t q )K q (I  t qt q )
T
T
e. Compute the function: (μ g , Cg )  arg min L(Yg , μ g e  TCg ) for each task
( μ g ,C g )
g.
f. Compute the matrix of gradients: D 
L
for task g.
 μg 
 
C 
 g
g. If q>1, let A  A(I  t q1t q1 ) .
T
h. Compute H   H, AU q  .
i. Let B be an empty matrix.
j. For p=1 to k:
i. Compute B p  H  TT K1H  C
1
k. The final prediction using q latent variables is F q  μ  K1B . In the case
of classification on task g, a cutoff value,  (default equal to zero) is used
to separate the two classes.
l. For predictions on new data X new  R rm , compute the kernel
1
1
K 0new  k ( X, X new ) and center it: K1new  (K 0new  e neweT K 0 )(I  eeT ) ,
n
n
where e new is a r 1 vector of ones. Predictions using q latent variables are
q
 μ  K1newB . Again, a cutoff value is used for
calculated as Fnew
separation into binary classes in the case of classification.
Algorithm 1 contains two changes from the one proposed by Xiang and Bennett [2].
First, the original algorithm did not specify how the weight vector s was to be calculated
in Step 5a. Here weights are calculated based on the norms of the column vectors of the
gradient matrix D . This is to compensate for large relative differences in gradient
magnitudes that can occur for columns of D when some tasks employ linear regression
and others employ classification, or when large differences in gradient magnitudes exist
for other reasons.
The second change relates to the way in which the matrix of kernel coefficients B is
calculated. Steps 5g to 5j were developed as a substitute to the method originally
proposed. In order to make predictions using the original centered kernel matrix, the
matrix of latent feature coefficients, C , must be transformed to a matrix of kernel
coefficients, B . To see how this can be accomplished, first consider a non-kernel PLSlike regression algorithm [1] where one task is modeled.
Algorithm 2:
1. Calculate μ g as in Algorithm 1.
2. Compute D and U as in Algorithm 1. Let W and P be empty matrices.
3. For q=1 to z:
T
a. Calculate w q  Xq u q ; W   W, w q  .
b. Similar to Algorithm 1, calculate t q  Xq w q ; t q 
tq
; T   T, t q  .
q
t
c. Deflate the data matrix: pq  Xq t q ; Xq1  Xq  t qpq ; P   P, p q  .
T
T
d. Compute the function: (μ g , Cg )  arg min L(Yg , μ g e  TCg ) .
( μ g ,C g )
e. Compute D and U .
f. Compute F  X1g using the centered data matrix, X1 , as described below.
Because of orthogonal columns, TTT  I , and so P  X1 T  PTT  X1 . Therefore:
X1W  X1W
T
T
TPT W  X1W
(1)
T  X1W  PT W  .
1
Let g  W  PT W  C so that F  TC  X1g .
1
Algorithm 2 can be turned into a kernel version similar to Algorithm 1 by noting that
F  TC
 X1W  PT W  C
1
(2)
 X1X1 U  PT W  C
1
T
 k ( X1 , X1 )U  PT W  C
1
for a linear kernel. Note also that t q  Xq wq  Xq Xq uq  k (X, X)uq for a linear kernel.
Unlike Algorithm 1, however, where the kernel matrix is deflated, the data matrix is
deflated in Algorithm 2 (step 3.c). The deflation step depends on t q (step 3.b), which
depends on w q . But in a kernel version, W is not explicitly calculated. Therefore, to
T
use a kernel version a new expression for X1g is needed. Because w q  Xq u q contains
the deflated data matrix, g must take deflation into account. Note that
T
X2  X1 (I  t1t1 )
T
X3  X2 (I  t 2t 2 )  X1 (I  t1t1 )(I  t 2t 2 )
T
T
T
(3)
X4  X1 (I  t1t1 )(I  t 2t 2 )(I  t 3t 3 )
T
T
T
and so on. Thus, to create a kernel version, let A  A(I  t q1t q1 ) for q>1 (see step 5.g
of Algorithm 1). Then:
T
X1g  W  PT W  C
1
1
T


X X AU  TT X1X1 AU 

C
T
T
X1 AU  X1 AU 


1
1T

 X1X1 AU TT X1X1 AU
T
T

(4)
1
C
 k ( X1 , X1 ) AU  TT k ( X1 , X1 ) AU  C
1
The expression for B is then AU  TT K1AU  C , which is what we wanted to derive.
1
To choose an optimal cutoff value,  , for classifying predictions (step 5.k of Algorithm
1), a modification of the correct classification rate was used as a fitness function. Denote
by a the fraction of positive labels predicted correctly for a given  , and denote by b the
fraction of negative labels predicted correctly. Calculate the score, s as follows:
s  a  b  c , where c  0 if min(a, b)  0.65 and c  abs(a  b) otherwise. This
modification penalizes values of  that lead to low values of a or b and/or a large
difference between a and b.
References
1.
2.
Momma M, Bennett K: Constructing Orthogonal Latent Features for
Arbitrary Loss, in Feature extraction: foundations and applications, I. Guyon, et
al., Editors. 2007, Springer Berlin Heidelberg: New York, NY.
Xiang Z, Bennett K: Inductive transfer using kernel multitask latent analysis.
http://iitrl.acadiau.ca/itws05/Papers/ITWS17-XiangBennett_REV.pdf.
Download