II: Proof of Theorem 1.

Supplementary Material I: Derivation for (6). 𝑓(𝐖), according to its definition in Case II, is 2 −1 −1 𝑇 𝑓(𝐖) = ∑𝐾 𝑘=1‖𝐲𝑘 − 𝐗 𝑘 𝐰𝑘 ‖2 + 𝜆1 ‖𝐖‖1 + 𝜆2 (𝑄𝑙𝑜𝑔|𝛀| + 𝐾𝑙𝑜𝑔|𝚽| + 𝑡𝑟(𝚽 𝐖𝛀 𝐖 )). (A-1) ̃ | (𝜍𝐾 − 𝛡𝑇𝐾 𝛀 ̃ −1 𝛡𝐾 ). Then, Furthermore, |𝛀| = |𝛀 𝑄𝑙𝑜𝑔|𝛀| + 𝐾𝑙𝑜𝑔|𝚽| ̃ | + 𝑄𝑙𝑜𝑔(𝜍𝐾 − 𝛡𝑇𝐾 𝛀 ̃ −1 𝛡𝐾 ) + (𝐾 − 1)𝑙𝑜𝑔|𝚽| + 𝑙𝑜𝑔|𝚽| = 𝑄𝑙𝑜𝑔|𝛀 ̃ | + (𝐾 − 1)𝑙𝑜𝑔|𝚽| + 𝑙𝑜𝑔 {(𝜍𝐾 − 𝛡𝑇𝐾 𝛀 ̃ −1 𝛡𝐾 )𝑄 |𝚽|} = 𝑄𝑙𝑜𝑔|𝛀 ̃ | + (𝐾 − 1)𝑙𝑜𝑔|𝚽| + 𝑙𝑜𝑔|𝚺𝐾 |. = 𝑄𝑙𝑜𝑔|𝛀 (A-2) 𝚺𝐾 is defined in (10). Also, 𝑡𝑟(𝚽 −1 𝐖𝛀−1 𝐖 𝑇 ) ̃ −1 𝑇 ̃ −1 𝐾𝛀 ̃ −1 + 𝛀 𝛡𝐾𝑇 𝛡−1 𝛀 ̃ 𝜍 −𝛡 𝛀 𝛡𝐾 𝐾 𝐾 ̃ , 𝐰𝐾 ) [ = 𝑡𝑟 (𝚽 −1 (𝐖 𝑇 ̃ −1 𝛡 𝛀 − 𝜍 −𝛡𝐾𝑇 𝛀̃−1 𝛡 𝐾 𝐾 𝐾 ̃ −1 𝛡𝐾 𝛀 𝑇 ̃ −1 𝛡 𝐾 −𝛡𝐾 𝛀 𝐾 −𝜍 1 ̃ −1 𝜍𝐾 −𝛡𝑇 𝐾 𝛀 𝛡𝐾 𝑇 ̃ , 𝐰𝐾 ) ). ] (𝐖 (A-3) Expanding the block matrix multiplication within the trace and simplifying the result, we can get: ̃ −1 𝐖 ̃𝛀 ̃ 𝑇 ) + (𝐰𝐾 − 𝛍𝐾 )𝑇 𝚺𝐾−1 (𝐰𝐾 − 𝛍𝐾 ). 𝑡𝑟(𝚽 −1 𝐖𝛀−1 𝐖 𝑇 ) = 𝑡𝑟(𝚽 −1 𝐖 (A-4) 𝛍𝐾 is defined in (9). Inserting (A-4) and (A-2) into (A-1) and re-organizing the terms, (6) can be obtained. II: Proof of Theorem 1. (5) is a convex optimization, which can be solved by a Block Coordinate Descent (BCD) algorithm. We consider two coordinates in our problem, old domains 1, … , 𝐾 − 1 as a whole and 1 the new domain, respectively. Then, BCD works by alternately optimizing each coordinate. Specifically, at the 𝑛-the iteration, 𝑛 = 1,2,3, …, BCD solves the following two optimizations: (𝑛) ̃ (𝑛−1) , 𝐰𝐾 )), 𝐰𝐾 = argmin 𝑓 ((𝐖 (A-5) ̃ (𝑛) = argmin 𝑓 ((𝐖 ̃ , 𝐰𝐾(𝑛) )). 𝐖 (A-6) 𝐰𝐾 ̃ 𝐖 (A-5) is to optimize the new domain, 𝐰𝐾 , treating old domains as fixed by using estimates from ̃ (𝑛−1) . (A-6) then optimizes the old domains, 𝐖 ̃ , treating the new the previous iteration, 𝐖 (𝑛) domain as fixed by using the estimate from (A-5), 𝐰𝐾 . The objective function in (5), i.e., 𝑓(𝐖), consists of a non-differentiable term, ‖𝐖‖1 . According to the seminal work by Tseng (2001), when a convex objective function includes a non-differentiable term, BCD will converge to the optimal solution if the term is separable ̃ ‖ + ‖𝐰𝐾 ‖1 . Therefore, according to the coordinates. This is exactly our case, i.e., ‖𝐖‖1 = ‖𝐖 1 ̂ I (i.e., the solution to the BCD in (A-5) and (A-6) will converge to the global optimal solution 𝐖 (5) in Case I). Furthermore, the convergence enjoys a monotone property (Tseng, 2001), i.e., ̃ (0) , 𝐰𝐾(1) )) ≥ 𝑓 ((𝐖 ̃ (1) , 𝐰𝐾(1) )) ≥ 𝑓 ((𝐖 ̃ (1) , 𝐰𝐾(2) )) ≥ 𝑓 ((𝐖 ̃ (2) , 𝐰𝐾(2) )) ≥ ⋯ ≥ 𝑓(𝐖 ̂ I ). (A-7) 𝑓 ((𝐖 ̃ (0) , be the knowledge of old domains in Case II, i.e., 𝐖 ̃ (0) = 𝐖 ̃ ∗ . Then, Let the initial values, 𝐖 (A-7) gives: (1) ̃ ∗ , 𝐰𝐾 )) ≥ 𝑓(𝐖 ̂ I ). 𝑓 ((𝐖 (A-8) (1) Next, according to (A-5), 𝐰𝐾 is (1) ̃ (0) , 𝐰𝐾 )) = argmin {𝑓(𝐖 ̃ (0) ) + 𝑔(𝐰𝐾 |𝐖 ̃ (0) )} = argmin 𝑔(𝐰𝐾 |𝐖 ̃ (0) ) . 𝐰𝐾 = argmin 𝑓 ((𝐖 𝐰𝐾 𝐰𝐾 The 𝐰𝐾 ̃ (0) ) is dropped in the last equation because it is a constant. second “=” follows from (6). 𝑓(𝐖 2 (1) ̃ ∗, 𝐰 Comparing (A-8) and (11), we get 𝐰𝐾 = 𝐰 ̂ 𝐾II . Therefore, (A-8) becomes 𝑓 ((𝐖 ̂ 𝐾II )) ≥ I ̂ I ). When 𝐖 ̃ ∗ = (𝐰 ), it means that BCD attains the optimal solution in one 𝑓(𝐖 ̂1I , … , 𝐰 ̂ 𝐾−1 coordinate (the old domains). Then, it must attain the optimal solution in the other coordinate ̂ I . This completes the proof for Theorem 1. (the new domain), i.e., 𝐰 ̂ 𝐾II = 𝐖 III: Proof of Theorem 2. Both (12) and (13) can be solved analytically, i.e., 𝐰 ̂ 𝐾 = (𝐗 𝑇𝐾 𝐗 𝐾 + 𝜆𝐈)−1 (𝐗 𝑇𝐾 𝐲𝐾 + 𝜆𝛍𝐾 ) and 𝐰 ̌ 𝐾 = (𝐗 𝑇𝐾 𝐗 𝐾 )−1 (𝐗 𝑇𝐾 𝐲𝐾 ) Let 𝐁 ≜ (𝐗 𝑇𝐾 𝐗 𝐾 + 𝜆𝐈)−1 and 𝐙 ≜ (𝐗 𝑇𝐾 𝐗 𝐾 + 𝜆𝐈)−1 (𝐗 𝑇𝐾 𝐗 𝐾 ). Then, it can be derived that 𝐙 = 𝐈 − 𝜆𝐁. Using 𝐁 and 𝐙, we can show that 𝐰 ̂ 𝐾 = 𝐙𝐰 ̌ 𝐾 + 𝜆𝐁𝛍𝐾 . Therefore, 𝑇 𝑀𝑆𝐸(𝐰 ̂ 𝐾 ) = 𝐸 {(𝐰 ̂ 𝐾 − 𝐰𝐾 ) (𝐰 ̂ 𝐾 − 𝐰𝐾 )} 𝑇 = 𝐸 {(𝐙𝐰 ̌ 𝐾 + 𝜆𝐁𝛍𝐾 − 𝐰𝐾 ) (𝐙𝐰 ̌ 𝐾 + 𝜆𝐁𝛍𝐾 − 𝐰𝐾 )} 𝑇 = 𝐸 {(𝐙𝐰 ̌ 𝐾 − 𝐙𝐰𝐾 + 𝐙𝐰𝐾 + 𝜆𝐁𝛍𝐾 − 𝐰𝐾 ) (𝐙𝐰 ̌ 𝐾 − 𝐙𝐰𝐾 + 𝐙𝐰𝐾 + 𝜆𝐁𝛍𝐾 − 𝐰𝐾 )} 𝑇 = 𝐸 {((𝐙𝐰 ̌ 𝐾 − 𝐙𝐰𝐾 ) + (𝐙𝐰𝐾 − 𝐰𝐾 + 𝜆𝐁𝛍𝐾 )) ((𝐙𝐰 ̌ 𝐾 − 𝐙𝐰𝐾 ) + (𝐙𝐰𝐾 − 𝐰𝐾 + 𝜆𝐁𝛍𝐾 ))} 𝑇 = 𝐸 {(𝐙𝐰 ̌ 𝐾 − 𝐙𝐰𝐾 ) (𝐙𝐰 ̌ 𝐾 − 𝐙𝐰𝐾 )} + (𝐙𝐰𝐾 − 𝐰𝐾 + 𝜆𝐁𝛍𝐾 )𝑻 (𝐙𝐰𝐾 − 𝐰𝐾 + 𝜆𝐁𝛍𝐾 ). (A-9) In the last equation in (A-9), the cross-product, 2(𝐙𝐰𝐾 − 𝐰𝐾 + 𝜆𝐁𝛍𝐾 )𝑻 𝐙 𝐸(𝐰 ̌ 𝐾 − 𝐰𝐾 ) , is omitted. This is because 𝐰 ̌ 𝐾 , as an ordinary least squares estimator, is unbiased, and therefore 𝐸(𝐰 ̌ 𝐾 − 𝐰𝐾 ) = 0. Continuing the derivation in (A-9), we can obtain: 𝑇 𝑀𝑆𝐸(𝐰 ̂ 𝐾 ) = 𝐸 {(𝐰 ̌ 𝐾 − 𝐰𝐾 ) 𝐙𝑇 𝐙(𝐰 ̌ 𝐾 − 𝐰𝐾 )} + 𝜆2 (𝛍𝐾 − 𝐰𝐾 )𝑇 𝐁 𝑇 𝐁(𝛍𝐾 − 𝐰𝐾 ) = 𝜎 2 𝑡𝑟{(𝐗 𝑇𝐾 𝐗 𝐾 )−1 𝐙𝑇 𝐙} + 𝜆2 (𝛍𝐾 − 𝐰𝐾 )𝑇 𝐁 𝑇 𝐁(𝛍𝐾 − 𝐰𝐾 ) = 𝜎 2 𝑡𝑟{𝐁(𝐈 − 𝜆𝐁)} + 𝜆2 (𝛍𝐾 − 𝐰𝐾 )𝑇 𝐁 𝑇 𝐁(𝛍𝐾 − 𝐰𝐾 ) 3 = 𝜎 2 𝑡𝑟(𝐁) − 𝜎 2 𝜆𝑡𝑟(𝐁 2 ) + 𝜆2 (𝛍𝐾 − 𝐰𝐾 )𝑇 𝐁 𝑇 𝐁(𝛍𝐾 − 𝐰𝐾 ). (A-10) Perform an eigen-decomposition for 𝐗 𝑇𝐾 𝐗 𝐾 , i.e., 𝐗 𝑇𝐾 𝐗 𝐾 = 𝐏 𝑇 𝚲𝐏 . 𝚲 is a diagonal matrix of eigenvalues 𝛾1 , … , 𝛾𝑄 . 𝐏 𝑇 consists of corresponding eigenvectors. Then the 𝑡𝑟(∙) in (A-10) can be shown to be: 1 1 𝑡𝑟(𝐁) = ∑𝑄𝑖=1 𝛾 +𝜆 and 𝑡𝑟(𝐁 2 ) = ∑𝑄𝑖=1 (𝛾 +𝜆)2 . 𝑖 (A-11) 𝑖 Furthermore, let 𝛂 ≜ 𝐏(𝛍𝐾 − 𝐰𝐾 ) and denote the elements of 𝛂 by 𝛼1 , … , 𝛼𝑄 . Then, the last term in (A-10) can be shown to be: (𝛍𝐾 − 𝐰𝐾 )𝑇 𝐁 𝑇 𝐁(𝛍𝐾 − 𝐰𝐾 ) = ∑𝑄𝑖=1 (𝛾 𝛼𝑖2 2 𝑖 +𝜆) . (A-12) Inserting (A-12) and (A-11) into (A-10), 𝑄 𝑄 1 1 𝛼𝑖2 2 2 𝑀𝑆𝐸(𝐰 ̂𝐾 ) = 𝜎 ∑ −𝜎 𝜆∑ +𝜆 ∑ 2 2 𝑖=1 𝛾𝑖 + 𝜆 𝑖=1 (𝛾𝑖 + 𝜆) 𝑖=1 (𝛾𝑖 + 𝜆) 2 = 𝜎2 ∑ = ∑𝑄𝑖=1 𝑄 𝑄 𝑄 1 1 𝛼𝑖2 2 − 𝜎2𝜆 ∑ + 𝜆 ∑ 2 2 𝑖=1 𝛾𝑖 + 𝜆 𝑖=1 (𝛾𝑖 + 𝜆) 𝑖=1 (𝛾𝑖 + 𝜆) 𝑄 𝜎2 𝛾𝑖 +𝜆2 𝛼𝑖2 (𝛾𝑖 +𝜆)2 . When 𝜆 = 0, 𝑀𝑆𝐸(𝐰 ̂ 𝐾 ) = 𝑀𝑆𝐸(𝐰 ̌ 𝐾 ). To show that 𝑀𝑆𝐸(𝐰 ̂ 𝐾 ) < 𝑀𝑆𝐸(𝐰 ̌ 𝐾 ) at some 𝜆 > 0, we only need to show that there exists a 𝜆∗ such that 𝜕 𝑀𝑆𝐸(𝐰 ̂𝐾 ) 𝜕𝜆 = 2 ∑𝑄𝑖=1 𝛾𝑖 (𝜆𝛼𝑖2 −𝜎2 ) (𝛾𝑖 +𝜆)3 𝜕 𝑀𝑆𝐸(𝐰 ̂𝐾 ) 𝜕𝜆 < 0 for 0 < 𝜆 < 𝜆∗. To make < 0 , a sufficient condition is to make every term in the 𝜎2 𝜎2 𝑖 𝑖 summation smaller than zero, i.e., 𝜆 < 𝛼2 , or equivalently, 𝜆 < 𝑚𝑎𝑥 (𝛼2 ) . This proves the 𝜎2 existence of 𝜆∗ = 𝑚𝑎𝑥 (𝛼2 ) and thereby proves Theorem 1. 𝑖 𝑖 IV: Proof of Theorem 3. 4 𝑖 According to (A-10), for a fixed 𝜆 , 𝑀𝑆𝐸(𝐰 ̂ 𝐾 ) changes only with respect to (𝛍𝐾 − 𝐰𝐾 )𝑇 𝐁 𝑇 𝐁(𝛍𝐾 − 𝐰𝐾 ). The smaller the (𝛍𝐾 − 𝐰𝐾 )𝑇 𝐁 𝑇 𝐁(𝛍𝐾 − 𝐰𝐾 ), the smaller the 𝑀𝑆𝐸(𝐰 ̂ 𝐾 ). According to Definition 1, (𝛍𝐾 − 𝐰𝐾 )𝑇 𝐁 𝑇 𝐁(𝛍𝐾 − 𝐰𝐾 ) is the transfer learning distance 𝑑(𝛍𝐾 ; 𝜆). Therefore, the smaller the transfer learning distance, the smaller the 𝑀𝑆𝐸(𝐰 ̂ 𝐾 ). This gives (1) (2) 𝑀𝑆𝐸(𝐰 ̂ 𝐾 ; 𝜆) ≤ 𝑀𝑆𝐸(𝐰 ̂ 𝐾 ; 𝜆). (1) (A-13) (2) (1) Let 𝜆(1)∗ = argmin𝜆 𝑀𝑆𝐸(𝐰 ̂ 𝐾 ) and 𝜆(2)∗ = argmin𝜆 𝑀𝑆𝐸(𝐰 ̂ 𝐾 ). Then, 𝑀𝑆𝐸(𝐰 ̂ 𝐾 ; 𝜆(1)∗ ) ≤ (1) (2) 𝑀𝑆𝐸(𝐰 ̂ 𝐾 ; 𝜆(2)∗ ) ≤ 𝑀𝑆𝐸(𝐰 ̂ 𝐾 ; 𝜆(2)∗ ). The second inequality follows from (A-13). This completes the proof for Theorem 3. V: Proof of Theorem 4 Lemma 1: The optimization in (17) is equivalent to: 𝐰 ̂ 𝐾II = argmin 𝐰𝐾 2 {‖𝐲𝐾 − 𝐗 𝐾 𝐰𝐾 ‖22 + 𝜆1 ‖𝐰𝐾 ‖1 + 𝜆2 ∑𝑋𝑖 ~𝑋𝑗 𝑎𝑖𝑗 ((𝑤𝑖𝐾 − 𝜇𝑖𝐾 ) − (𝑤𝑗𝐾 − 𝜇𝑗𝐾 )) } . (A-14) Proof: To prove Lemma 1 is to prove (𝐰𝐾 − 𝛍𝐾 )𝑇 𝐋(𝐰𝐾 − 𝛍𝐾 ) = ∑𝑋𝑖 ~𝑋𝑗 𝑎𝑖𝑗 ((𝑤𝑖𝐾 − 𝜇𝑖𝐾 ) − 2 (𝑤𝑗𝐾 − 𝜇𝑗𝐾 )) . Start from the left-hand side. Write 𝐋 = 𝐃 − 𝐀, where 𝐃 is a diagonal matrix of the nodes’ degrees, i.e., 𝐃 = 𝑑𝑖𝑎𝑔(𝑑1 , … , 𝑑𝑄 ). 𝐀 is matrix of the edge weights, i.e., 𝐀 = {𝑎𝑖𝑗 }. The diagonal elements of 𝐀 are zero. Then, (𝐰𝐾 − 𝛍𝐾 )𝑇 𝐋(𝐰𝐾 − 𝛍𝐾 ) = (𝐰𝐾 − 𝛍𝐾 )𝑇 𝐃(𝐰𝐾 − 𝛍𝐾 ) − (𝐰𝐾 − 𝛍𝐾 )𝑇 𝐀(𝐰𝐾 − 𝛍𝐾 ) = ∑𝑄𝑖=1(𝑤𝑖𝐾 − 𝜇𝑖𝐾 )2 𝑑𝑖 − ∑𝑋𝑖 ~𝑋𝑗(𝑤𝑖𝐾 − 𝜇𝑖𝐾 ) (𝑤𝑗𝐾 − 𝜇𝑗𝐾 )𝑎𝑖𝑗 . Plugging in the definition that 𝑑𝑖 = ∑𝑋𝑖 ~𝑋𝑗 𝑎𝑖𝑗 , (A-15) becomes (𝐰𝐾 − 𝛍𝐾 )𝑇 𝐋(𝐰𝐾 − 𝛍𝐾 ) = ∑𝑄𝑖=1(𝑤𝑖𝐾 − 𝜇𝑖𝐾 )2 ∑𝑋𝑖 ~𝑋𝑗 𝑎𝑖𝑗 − ∑𝑋𝑖 ~𝑋𝑗(𝑤𝑖𝐾 − 𝜇𝑖𝐾 ) (𝑤𝑗𝐾 − 𝜇𝑗𝐾 )𝑎𝑖𝑗 5 (A-15) 𝑄 = ∑𝑖=1 ∑𝑋𝑖 ~𝑋𝑗(𝑤𝑖𝐾 − 𝜇𝑖𝐾 )2 𝑎𝑖𝑗 − ∑𝑋𝑖 ~𝑋𝑗(𝑤𝑖𝐾 − 𝜇𝑖𝐾 ) (𝑤𝑗𝐾 − 𝜇𝑗𝐾 )𝑎𝑖𝑗 2 1 = 2 (∑𝑋𝑖 ~𝑋𝑗(𝑤𝑖𝐾 − 𝜇𝑖𝐾 )2 𝑎𝑖𝑗 + ∑𝑋𝑗 ~𝑋𝑖(𝑤𝑗𝐾 − 𝜇𝑗𝐾 ) 𝑎𝑗𝑖 ) − ∑𝑋𝑖 ~𝑋𝑗(𝑤𝑖𝐾 − 𝜇𝑖𝐾 ) (𝑤𝑗𝐾 − 𝜇𝑗𝐾 )𝑎𝑖𝑗 . (A-16) Because the graph is unidirectional, 𝑎𝑗𝑖 = 𝑎𝑖𝑗 , (A-16) becomes (𝐰𝐾 − 𝛍𝐾 )𝑇 𝐋(𝐰𝐾 − 𝛍𝐾 ) 2 1 = 2 (∑𝑋𝑖 ~𝑋𝑗(𝑤𝑖𝐾 − 𝜇𝑖𝐾 )2 𝑎𝑖𝑗 + ∑𝑋𝑖 ~𝑋𝑗(𝑤𝑗𝐾 − 𝜇𝑗𝐾 ) 𝑎𝑖𝑗 ) − ∑𝑋𝑖 ~𝑋𝑗(𝑤𝑖𝐾 − 𝜇𝑖𝐾 ) (𝑤𝑗𝐾 − 𝜇𝑗𝐾 )𝑎𝑖𝑗 2 1 2 = (∑𝑋𝑖 ~𝑋𝑗(𝑤𝑖𝐾 − 𝜇𝑖𝐾 )2 𝑎𝑖𝑗 + ∑𝑋𝑖 ~𝑋𝑗(𝑤𝑗𝐾 − 𝜇𝑗𝐾 ) 𝑎𝑖𝑗 − 2 ∑𝑋𝑖 ~𝑋𝑗(𝑤𝑖𝐾 − 𝜇𝑖𝐾 ) (𝑤𝑗𝐾 − 𝜇𝑗𝐾 )𝑎𝑖𝑗 ) 2 1 = 2 (∑𝑋𝑖 ~𝑋𝑗 𝑎𝑖𝑗 ((𝑤𝑖𝐾 − 𝜇𝑖𝐾 ) − (𝑤𝑗𝐾 − 𝜇𝑗𝐾 )) ) . The 1⁄2 can be absorbed by 𝜆2 . Next, we prove Theorem 4. Denote the objective function in (A-14) by 𝜓(𝐰𝐾 ), i.e., 2 𝜓(𝐰𝐾 ) = ‖𝐲𝐾 − 𝐗 𝐾 𝐰𝐾 ‖22 + 𝜆1 ‖𝐰𝐾 ‖1 + 𝜆2 ∑𝑋𝑙 ~𝑋ℎ 𝑎𝑙ℎ ((𝑤𝑙𝐾 − 𝜇𝑙𝐾 ) − (𝑤ℎ𝐾 − 𝜇ℎ𝐾 )) . 𝐼𝐼 𝐼𝐼 Because 𝑤 ̂ 𝑖𝑘 and 𝑤 ̂𝑗𝑘 are solutions to the optimization problem in (A-14) and they are non-zero, they should satisfy: 𝜕‖𝐲𝐾 −𝐗 𝐾 𝐰𝐾 ‖22 𝜕𝑤𝑖𝐾 | 𝜕‖𝐲𝐾 −𝐗 𝐾 𝐰𝐾 ‖22 𝜕𝑤𝑗𝐾 II ̂𝐾 𝐰 | II 𝐰 ̂𝐾 𝜕𝑓(𝐰𝐾 ) 𝜕𝑤𝑖𝐾 | II ̂𝐾 𝐰 = 0 and 𝜕𝑓(𝐰𝐾 ) 𝜕𝑤𝑗𝐾 | II 𝐰 ̂𝐾 = 0, i.e., 𝐼𝐼 𝐼𝐼 𝐼𝐼 ) + 2𝜆2 ∑𝑋𝑖 ~𝑋ℎ 𝑎𝑖ℎ ((𝑤 + 𝜆1 𝑠𝑔𝑛(𝑤 ̂ 𝑖𝑘 ̂ 𝑖𝑘 − 𝜇𝑖𝐾 ) − (𝑤 ̂ ℎ𝑘 − 𝜇ℎ𝐾 )) = 0, (A-17) 𝐼𝐼 𝐼𝐼 𝐼𝐼 + 𝜆1 𝑠𝑔𝑛(𝑤 ̂𝑗𝑘 ) − 2𝜆2 ∑𝑋𝑙 ~𝑋𝑗 𝑎𝑙𝑗 ((𝑤 ̂ 𝑙𝑘 − 𝜇𝑙𝐾 ) − (𝑤 ̂𝑗𝑘 − 𝜇𝑗𝐾 )) = 0. (A-18) Focusing on (A-17), the third term on the left-hand side can be written into: 𝐼𝐼 𝐼𝐼 2𝜆2 ∑𝑋𝑖 ~𝑋ℎ 𝑎𝑖ℎ ((𝑤 ̂ 𝑖𝑘 − 𝜇𝑖𝐾 ) − (𝑤 ̂ ℎ𝑘 − 𝜇ℎ𝐾 )) 𝐼𝐼 𝐼𝐼 𝐼𝐼 𝐼𝐼 = 2𝜆2 𝑎𝑖𝑗 ((𝑤 ̂ 𝑖𝑘 − 𝜇𝑖𝐾 ) − (𝑤 ̂𝑗𝑘 − 𝜇𝑗𝐾 )) + 2𝜆2 ∑𝑋𝑖 ~𝑋ℎ ,ℎ≠𝑗 𝑎𝑖ℎ ((𝑤 ̂ 𝑖𝑘 − 𝜇𝑖𝐾 ) − (𝑤 ̂ ℎ𝑘 − 𝜇ℎ𝐾 )) 𝐼𝐼 𝐼𝐼 ≈ 2𝜆2 𝑎𝑖𝑗 ((𝑤 ̂ 𝑖𝑘 − 𝜇𝑖𝐾 ) − (𝑤 ̂𝑗𝑘 − 𝜇𝑗𝐾 )). (A-19) 6 The last step follows from the given assumption that 𝑎𝑖𝑗 ≫ 𝑎𝑖ℎ . Similarly, the third term on the left-hand side of (A-18) can be written into 𝐼𝐼 𝐼𝐼 𝐼𝐼 𝐼𝐼 2𝜆2 ∑𝑋𝑙 ~𝑋𝑗 𝑎𝑙𝑗 ((𝑤 ̂ 𝑙𝑘 − 𝜇𝑙𝐾 ) − (𝑤 ̂𝑗𝑘 − 𝜇𝑗𝐾 )) = 2𝜆2 𝑎𝑖𝑗 ((𝑤 ̂ 𝑖𝑘 − 𝜇𝑖𝐾 ) − (𝑤 ̂𝑗𝑘 − 𝜇𝑗𝐾 )). (A-20) Considering (A-19) and (A-20) and taking the difference between (A-17) and (A-18), 𝐼𝐼 𝐼𝐼 𝐼𝐼 𝐼𝐼 ) cancels with 𝜆1 𝑠𝑔𝑛(𝑤 𝜆1 𝑠𝑔𝑛(𝑤 ̂ 𝑖𝑘 ̂𝑗𝑘 ) because it is known that 𝑤 ̂ 𝑖𝑘 𝑤 ̂𝑗𝑘 > 0, and we get: 𝜕‖𝐲𝐾 −𝐗 𝐾 𝐰𝐾 ‖22 𝜕𝑤𝑖𝐾 | II ̂𝐾 𝐰 − 𝜕‖𝐲𝐾 −𝐗 𝐾 𝐰𝐾 ‖22 𝜕𝑤𝑗𝐾 | II 𝐰 ̂𝐾 𝐼𝐼 𝐼𝐼 + 4𝜆2 𝑎𝑖𝑗 ((𝑤 ̂ 𝑖𝑘 − 𝜇𝑖𝐾 ) − (𝑤 ̂𝑗𝑘 − 𝜇𝑗𝐾 )) = 0. (A-21) 𝑇 𝐼𝐼 𝐼𝐼 −2(𝒙𝑇𝑖𝐾 − 𝒙𝑗𝐾 )(𝐲𝐾 − 𝐗 𝐾 𝐰 ̂ 𝐾II ) + 4𝜆2 𝑎𝑖𝑗 ((𝑤 ̂ 𝑖𝑘 − 𝜇𝑖𝐾 ) − (𝑤 ̂𝑗𝑘 − 𝜇𝑗𝐾 )) = 0. (A-22) Furthermore, we can get: 𝐼𝐼 𝐼𝐼 |(𝑤 ̂ 𝑖𝑘 − 𝜇𝑖𝐾 ) − (𝑤 ̂𝑗𝑘 − 𝜇𝑗𝐾 )| ≤ ‖𝒙𝑖𝐾 −𝒙𝑗𝐾 ‖ 2 II ‖ ×‖𝐲𝐾 −𝐗 𝐾 𝐰 ̂𝐾 2𝜆2 2 1 𝑎𝑖𝑗 . (A-23) We would like to have an upper bound that does not include 𝐰 ̂ 𝐾II . To achieve this, we adopt the following strategy: Because 𝐰 ̂ 𝐾II is the optimal solution, 𝜓(𝐰 ̂ 𝐾II ) should be the smallest. Therefore, 𝜓(𝐰 ̂ 𝐾II ) ≤ 𝜓(0), i.e., 2 2 𝐼𝐼 𝐼𝐼 𝜓(𝐰 ̂ 𝐾II ) = ‖𝐲𝐾 − 𝐗 𝐾 𝐰 ̂ 𝐾II ‖2 + 𝜆1 ‖𝐰 ̂ 𝐾II ‖1 + 𝜆2 ∑𝑋𝑙 ~𝑋ℎ 𝑎𝑙ℎ ((𝑤 ̂ 𝑙𝑘 − 𝜇𝑙𝐾 ) − (𝑤 ̂ ℎ𝑘 − 𝜇ℎ𝐾 )) ≤ 𝑓(0) = ‖𝐲𝐾 ‖22 + 𝜆2 ∑𝑋𝑙 ~𝑋ℎ 𝑎𝑙ℎ (𝜇𝑙𝐾 − 𝜇ℎ𝐾 )2 . (A-24) Then, 2 2 ‖𝐲𝐾 − 𝐗 𝐾 𝐰 ̂ 𝐾II ‖2 ≤ ‖𝐲𝐾 ‖22 + 𝜆2 ∑𝑋𝑙 ~𝑋ℎ 𝑎𝑙ℎ (𝜇𝑙𝐾 − 𝜇ℎ𝐾 )2 ≈ ‖𝐲𝐾 ‖22 + 2𝜆2 𝑎𝑖𝑗 (𝜇𝑖𝐾 − 𝜇𝑗𝐾 ) , and 2 2 ‖𝐲𝐾 − 𝐗 𝐾 𝐰 ̂ 𝐾II ‖2 ≤ √‖𝐲𝐾 ‖22 + 2𝜆2 𝑎𝑖𝑗 (𝜇𝑖𝐾 − 𝜇𝑗𝐾 ) . (A-25) Inserting (A-25) into (A-23), we get 𝐼𝐼 𝐼𝐼 |(𝑤 ̂ 𝑖𝑘 − 𝜇𝑖𝐾 ) − (𝑤 ̂𝑗𝑘 − 𝜇𝑗𝐾 )| ≤ ‖𝐱𝑖𝐾 −𝐱𝑗𝐾 ‖ 2𝜆2 2 × √ ‖𝐲𝐾 ‖22 7 2 𝑎𝑖𝑗 + 2𝜆2 (𝜇𝑖𝐾 −𝜇𝑗𝐾 ) 𝑎𝑖𝑗 2 . VI: Obtaining (24) by the gradient method. Given 𝐰𝐾 , the optimization problem in (24) with respect to 𝜍𝐾 and 𝛡𝐾 is: min 𝜑(𝜍𝐾 , 𝛡𝐾 ) = min 𝜍𝐾 ,𝛡𝐾 𝜍𝐾 ,𝛡𝐾 ̃ −1 𝛡𝐾 ) + {𝑙𝑜𝑔(𝜍𝐾 − 𝛡𝑇𝐾 𝛀 1 ̃ −1 𝛡𝐾 𝜍𝐾 − 𝛡𝑇𝐾 𝛀 (𝐰𝐾 − 𝛍𝐾 )𝑇 𝐋(𝐰𝐾 − 𝛍𝐾 )} Using the gradient method, set the partial derivatives of 𝜑(𝜍𝐾 , 𝛡𝐾 ) to be zero: { 𝜕 𝜑(𝜍𝐾 , 𝝕𝐾 )⁄𝜕𝜍𝐾 = 0 , 𝜕 𝜑(𝜍𝐾 , 𝝕𝐾 )⁄𝜕𝝕𝐾 = 0 i.e., 𝜕 𝜑(𝜍𝐾 , 𝝕𝐾 )⁄𝜕𝜍𝐾 = 𝑄 ̃ −1 𝜍𝐾 −𝛡𝑇 𝐾 𝛀 𝛡𝐾 ̃ −1 𝛡𝐾 2𝑄𝛀 𝑇 ̃ −1 𝛡 𝐾 −𝛡𝐾 𝛀 𝐾 𝜕𝜑(𝜍𝐾 , 𝝕𝐾 )⁄𝜕𝝕𝐾 = − 𝜍 ̃ −1 𝛡𝐾 (𝐰𝐾 −𝐖 ̃ −1 𝛡𝐾 )𝑇 𝐋(𝐰𝐾 −𝐖 ̃ −1 𝛡𝐾 ) ̃𝛀 ̃𝛀 2𝛀 ̃ −1 (𝜍𝐾 −𝛡𝑇 𝐾 𝛀 𝛡𝐾 ) 𝟐 − 𝑇 − 𝑇 ̃ −1 ̃ −1 (𝐰𝐾 −𝛡𝑇 𝐾 𝛀 𝛡𝐾 ) 𝐋(𝐰𝐾 −𝛡𝐾 𝛀 𝛡𝐾 ) ̃ −1 (𝜍𝐾 −𝛡𝑇 𝐾 𝛀 𝛡𝐾 ) 𝟐 = 0, (A-26) ̃ −1 𝐖 ̃ −1 𝛡𝐾 ) ̃ 𝑇 𝐋(𝐰𝐾 −𝐖 ̃𝛀 2𝛀 + 𝑇 −1 ̃ 𝜍𝐾 −𝛡𝐾 𝛀 𝛡𝐾 = 0. (A-27) From (A-26), we can get: ̃ −1 𝛡𝐾 ) = (𝐰𝐾 − 𝐖 ̃ −1 𝛡𝐾 )𝑇 𝐋(𝐰𝐾 − 𝐖 ̃ −1 𝛡𝐾 ). ̃𝛀 ̃𝛀 𝑄(𝜍𝐾 − 𝛡𝑇𝐾 𝛀 (A-28) Inserting (A-28) into the third term of (A-27) and through some algebra, we can get: ̃ −1 𝐖 ̃ −1 𝐖 ̃ −1 𝛡𝐾 = 0. ̃ 𝑇 𝐋𝐰𝐾 − 𝛀 ̃ 𝑇 𝐋𝐖 ̃𝛀 𝛀 (A-29) ̃ . Using this in (A-29), ̃ 𝑇 𝐋𝐖 ̃ = 𝑄𝛀 According to (22), 𝐖 ̃ 𝑇 𝐋𝐰𝐾 ⁄𝑄 . 𝛡 ̂𝐾 = 𝐖 Furthermore, according to (A-28), 1 ̃ −1 𝛡𝐾 )𝑇 𝐋(𝐰𝐾 − 𝐖 ̃ −1 𝛡𝐾 ) + 𝛡𝑇𝐾 𝛀 ̃ −1 𝛡𝐾 ̃𝛀 ̃𝛀 𝜍𝐾 = 𝑄 (𝐰𝐾 − 𝐖 1 2 ̃ −1 𝛡𝐾 + 1 𝛡𝑇𝐾 𝛀 ̃ −1 𝐖 ̃ −1 𝛡𝐾 + 𝛡𝑇𝐾 𝛀 ̃ −1 𝛡𝐾 ̃𝛀 ̃ 𝑇 𝐋𝐖 ̃𝛀 = 𝑄 𝐰𝐾 𝑇 𝐋𝐰𝐾 − 𝑄 𝐰𝐾 𝑇 𝐋𝐖 𝑄 1 2 ̃ −1 𝛡𝐾 + 2𝛡𝑇𝐾 𝛀 ̃ −1 𝛡𝐾 . ̃𝛀 = 𝑄 𝐰𝐾 𝑇 𝐋𝐰𝐾 − 𝑄 𝐰𝐾 𝑇 𝐋𝐖 8 (A-30) 1 Using (A-30) in the second term, we can get 𝜍̂𝐾 = 𝑄 𝐰𝐾 𝑇 𝐋𝐰𝐾 . Finally, in order to prove that the 𝛡 ̂ 𝐾 and 𝜍̂𝐾 are optimal solutions for a minimization problem, 𝜕 𝜑 2 (𝜍𝐾 , 𝝕𝐾 )⁄𝜕𝜍𝐾2 |𝜍̂𝐾,𝛡 ̂𝐾 > 0 we will need to show { 2 . “ ≻” denotes a matrix being positive 𝜕 𝜑 (𝜍𝐾 , 𝝕𝐾 )⁄𝜕𝝕2𝐾 |𝜍̂𝐾 ,𝛡 ̂𝐾 ≻ 0 definite. It can be derived that: 𝜕 𝜑 2 (𝜍𝐾 , 𝝕𝐾 )⁄𝜕𝜍𝐾2 |𝜍̂𝐾,𝛡 ̂𝐾 = 𝑄 2 ̃ −1 (𝜍𝐾 −𝛡𝑇 𝐾 𝛀 𝛡𝐾 ) > 0. Furthermore, 𝜕 𝜑 2 (𝜍𝐾 , 𝝕𝐾 )⁄𝜕𝝕2𝐾 |𝜍̂𝐾,𝛡 ̂𝐾 = 2𝑄 2 ̃ −1 (𝜍𝐾 −𝛡𝑇 𝐾 𝛀 𝛡𝐾 ) ̃ −1 𝛡𝐾 𝛡𝑇𝐾 𝛀 ̃ −1 + 𝛀 𝜍 2𝑄 𝑇 ̃ −1 𝛡 𝐾 −𝛡𝐾 𝛀 𝐾 ̃ −1, 𝛀 ̃ −1 𝛡𝐾 𝛡𝑇𝐾 𝛀 ̃ −1 ≻ 0 and 𝛀 ̃ −1 ≻ 0. So 𝜕 𝜑 2 (𝜍𝐾 , 𝝕𝐾 )⁄𝜕𝝕2𝐾 |𝜍̂ ,𝛡 where 𝛀 ≻ 0. 𝐾 ̂𝐾 9

II: Proof of Theorem 1.

Related documents

Products

Support

II: Proof of Theorem 1.

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib