Additional File 3: A mathematical summary of the KMLA algorithm The Kernel Multitask Latent Analysis algorithm [1, 2] is used here with minor changes. Briefly, KMLA uses a kernel function to transform the feature space into a symmetric, positive-definite similarity matrix. Learning occurs via a PLS-like algorithm on the kernel matrix, rather than on the original features. Denote the original feature matrix by X R nm and responses by Y R nk for k tasks. Let a single subscript denote a column of a matrix (e.g., Yg ) or a single entry of a row vector. Let the superscript T denote transpose, denote the identity matrix as I , and let [ A, B ] denote the concatenation of matrices Α and B . The algorithm consists of applying a kernel function to X , thereby creating a kernel matrix K R nxn . Next, columns i=1,2,...z of a matrix of linear orthogonal latent variables, T R nz , are iteratively generated from K , with z n . The goal is to generate T in such a way that it is a linear projection of K into a reduced subspace and the loss function L i , g fYF Yi , g , Fi , g is minimized. Here, F TC , k n g 1 i 1 where F R is a matrix of predicted values, C R z ,k is a matrix of coefficients, and i , g 0 if Yi , g is missing and i , g 1 otherwise. n,k The loss function can vary between tasks. For linear regression on task g, fYF Yi , g Fi , g . Other loss functions could be used if desired. For binary 2 classification on task g, targets are labeled as +1 and –1 and an exponential loss function is used, fYF i exp Yi , g Fi , g . Weights i allow cost-sensitive learning and can be based on relative frequency of the positive and negative labels. After learning in the subspace is completed, the matrix of PLS coefficients, C , is transformed to a matrix of kernel coefficients, B , such that predictions can be calculated as F μ KB , where μ is a vector of coefficients for a constant hypothesis. Details of the algorithm are given below. Algorithm 1: 1. Compute the vector μ R1k of optimal coefficients for a constant hypothesis. For the linear regression case, μ g Yg for g=1,2...k. For binary classification, i 1 μ g log iC for g=1,2...k, where C is the set of positive label points, and 2 i iC C is the set of negative label points. 2. Let D R nk be a matrix of gradients of L. For linear regression on task g and a constant hypothesis, D g Yg μ g . For classification, D g exp(Yg u g )Yg . 3. Let K 0 k ( X, X) be the initial kernel computed from the kernel function k. For a 1 1 linear kernel, K 0 XXT . Let K1 (I eeT )Κ 0 (I eeT ) be the centered n n kernel, where e is a n1 vector of ones. 4. Let T , U , and H be empty matrices and let A be a n n identity matrix. 5. For q=1 to z: a. Compute the weight vector s R1k . For task g: s g 1 Dg z D j j1 b. Compute the latent factor: t q K q DsT ; t q tq ; T T, t q q t c. Compute u q DsT ; U U, u q d. Deflate the kernel matrix: K q1 (I t q t q )K q (I t qt q ) T T e. Compute the function: (μ g , Cg ) arg min L(Yg , μ g e TCg ) for each task ( μ g ,C g ) g. f. Compute the matrix of gradients: D L for task g. μg C g g. If q>1, let A A(I t q1t q1 ) . T h. Compute H H, AU q . i. Let B be an empty matrix. j. For p=1 to k: i. Compute B p H TT K1H C 1 k. The final prediction using q latent variables is F q μ K1B . In the case of classification on task g, a cutoff value, (default equal to zero) is used to separate the two classes. l. For predictions on new data X new R rm , compute the kernel 1 1 K 0new k ( X, X new ) and center it: K1new (K 0new e neweT K 0 )(I eeT ) , n n where e new is a r 1 vector of ones. Predictions using q latent variables are q μ K1newB . Again, a cutoff value is used for calculated as Fnew separation into binary classes in the case of classification. Algorithm 1 contains two changes from the one proposed by Xiang and Bennett [2]. First, the original algorithm did not specify how the weight vector s was to be calculated in Step 5a. Here weights are calculated based on the norms of the column vectors of the gradient matrix D . This is to compensate for large relative differences in gradient magnitudes that can occur for columns of D when some tasks employ linear regression and others employ classification, or when large differences in gradient magnitudes exist for other reasons. The second change relates to the way in which the matrix of kernel coefficients B is calculated. Steps 5g to 5j were developed as a substitute to the method originally proposed. In order to make predictions using the original centered kernel matrix, the matrix of latent feature coefficients, C , must be transformed to a matrix of kernel coefficients, B . To see how this can be accomplished, first consider a non-kernel PLSlike regression algorithm [1] where one task is modeled. Algorithm 2: 1. Calculate μ g as in Algorithm 1. 2. Compute D and U as in Algorithm 1. Let W and P be empty matrices. 3. For q=1 to z: T a. Calculate w q Xq u q ; W W, w q . b. Similar to Algorithm 1, calculate t q Xq w q ; t q tq ; T T, t q . q t c. Deflate the data matrix: pq Xq t q ; Xq1 Xq t qpq ; P P, p q . T T d. Compute the function: (μ g , Cg ) arg min L(Yg , μ g e TCg ) . ( μ g ,C g ) e. Compute D and U . f. Compute F X1g using the centered data matrix, X1 , as described below. Because of orthogonal columns, TTT I , and so P X1 T PTT X1 . Therefore: X1W X1W T T TPT W X1W (1) T X1W PT W . 1 Let g W PT W C so that F TC X1g . 1 Algorithm 2 can be turned into a kernel version similar to Algorithm 1 by noting that F TC X1W PT W C 1 (2) X1X1 U PT W C 1 T k ( X1 , X1 )U PT W C 1 for a linear kernel. Note also that t q Xq wq Xq Xq uq k (X, X)uq for a linear kernel. Unlike Algorithm 1, however, where the kernel matrix is deflated, the data matrix is deflated in Algorithm 2 (step 3.c). The deflation step depends on t q (step 3.b), which depends on w q . But in a kernel version, W is not explicitly calculated. Therefore, to T use a kernel version a new expression for X1g is needed. Because w q Xq u q contains the deflated data matrix, g must take deflation into account. Note that T X2 X1 (I t1t1 ) T X3 X2 (I t 2t 2 ) X1 (I t1t1 )(I t 2t 2 ) T T T (3) X4 X1 (I t1t1 )(I t 2t 2 )(I t 3t 3 ) T T T and so on. Thus, to create a kernel version, let A A(I t q1t q1 ) for q>1 (see step 5.g of Algorithm 1). Then: T X1g W PT W C 1 1 T X X AU TT X1X1 AU C T T X1 AU X1 AU 1 1T X1X1 AU TT X1X1 AU T T (4) 1 C k ( X1 , X1 ) AU TT k ( X1 , X1 ) AU C 1 The expression for B is then AU TT K1AU C , which is what we wanted to derive. 1 To choose an optimal cutoff value, , for classifying predictions (step 5.k of Algorithm 1), a modification of the correct classification rate was used as a fitness function. Denote by a the fraction of positive labels predicted correctly for a given , and denote by b the fraction of negative labels predicted correctly. Calculate the score, s as follows: s a b c , where c 0 if min(a, b) 0.65 and c abs(a b) otherwise. This modification penalizes values of that lead to low values of a or b and/or a large difference between a and b. References 1. 2. Momma M, Bennett K: Constructing Orthogonal Latent Features for Arbitrary Loss, in Feature extraction: foundations and applications, I. Guyon, et al., Editors. 2007, Springer Berlin Heidelberg: New York, NY. Xiang Z, Bennett K: Inductive transfer using kernel multitask latent analysis. http://iitrl.acadiau.ca/itws05/Papers/ITWS17-XiangBennett_REV.pdf.