II: Proof of Theorem 1.

advertisement
Supplementary Material
I: Derivation for (6).
𝑓(𝐖), according to its definition in Case II, is
2
−1
−1 𝑇
𝑓(𝐖) = ∑𝐾
π‘˜=1β€–π²π‘˜ − 𝐗 π‘˜ π°π‘˜ β€–2 + πœ†1 ‖𝐖‖1 + πœ†2 (π‘„π‘™π‘œπ‘”|𝛀| + πΎπ‘™π‘œπ‘”|𝚽| + π‘‘π‘Ÿ(𝚽 𝐖𝛀 𝐖 )).
(A-1)
Μƒ | (𝜍𝐾 − 𝛑𝑇𝐾 𝛀
Μƒ −1 𝛑𝐾 ). Then,
Furthermore, |𝛀| = |𝛀
π‘„π‘™π‘œπ‘”|𝛀| + πΎπ‘™π‘œπ‘”|𝚽|
Μƒ | + π‘„π‘™π‘œπ‘”(𝜍𝐾 − 𝛑𝑇𝐾 𝛀
Μƒ −1 𝛑𝐾 ) + (𝐾 − 1)π‘™π‘œπ‘”|𝚽| + π‘™π‘œπ‘”|𝚽|
= π‘„π‘™π‘œπ‘”|𝛀
Μƒ | + (𝐾 − 1)π‘™π‘œπ‘”|𝚽| + π‘™π‘œπ‘” {(𝜍𝐾 − 𝛑𝑇𝐾 𝛀
Μƒ −1 𝛑𝐾 )𝑄 |𝚽|}
= π‘„π‘™π‘œπ‘”|𝛀
Μƒ | + (𝐾 − 1)π‘™π‘œπ‘”|𝚽| + π‘™π‘œπ‘”|𝚺𝐾 |.
= π‘„π‘™π‘œπ‘”|𝛀
(A-2)
𝚺𝐾 is defined in (10).
Also,
π‘‘π‘Ÿ(𝚽 −1 𝐖𝛀−1 𝐖 𝑇 )
Μƒ −1
𝑇 Μƒ −1
𝐾𝛀
Μƒ −1 + 𝛀 𝛑𝐾𝑇 𝛑−1
𝛀
Μƒ
𝜍
−𝛑
𝛀
𝛑𝐾
𝐾
𝐾
Μƒ , 𝐰𝐾 ) [
= π‘‘π‘Ÿ (𝚽 −1 (𝐖
𝑇 Μƒ −1
𝛑 𝛀
− 𝜍 −𝛑𝐾𝑇 𝛀̃−1 𝛑
𝐾
𝐾
𝐾
Μƒ −1 𝛑𝐾
𝛀
𝑇 Μƒ −1 𝛑
𝐾 −𝛑𝐾 𝛀
𝐾
−𝜍
1
Μƒ −1
𝜍𝐾 −𝛑𝑇
𝐾 𝛀 𝛑𝐾
𝑇
Μƒ , 𝐰𝐾 ) ).
] (𝐖
(A-3)
Expanding the block matrix multiplication within the trace and simplifying the result, we can get:
Μƒ −1 𝐖
̃𝛀
Μƒ 𝑇 ) + (𝐰𝐾 − 𝛍𝐾 )𝑇 𝚺𝐾−1 (𝐰𝐾 − 𝛍𝐾 ).
π‘‘π‘Ÿ(𝚽 −1 𝐖𝛀−1 𝐖 𝑇 ) = π‘‘π‘Ÿ(𝚽 −1 𝐖
(A-4)
𝛍𝐾 is defined in (9). Inserting (A-4) and (A-2) into (A-1) and re-organizing the terms, (6) can be
obtained.
II: Proof of Theorem 1.
(5) is a convex optimization, which can be solved by a Block Coordinate Descent (BCD)
algorithm. We consider two coordinates in our problem, old domains 1, … , 𝐾 − 1 as a whole and
1
the new domain, respectively. Then, BCD works by alternately optimizing each coordinate.
Specifically, at the 𝑛-the iteration, 𝑛 = 1,2,3, …, BCD solves the following two optimizations:
(𝑛)
Μƒ (𝑛−1) , 𝐰𝐾 )),
𝐰𝐾 = argmin 𝑓 ((𝐖
(A-5)
Μƒ (𝑛) = argmin 𝑓 ((𝐖
Μƒ , 𝐰𝐾(𝑛) )).
𝐖
(A-6)
𝐰𝐾
Μƒ
𝐖
(A-5) is to optimize the new domain, 𝐰𝐾 , treating old domains as fixed by using estimates from
Μƒ (𝑛−1) . (A-6) then optimizes the old domains, 𝐖
Μƒ , treating the new
the previous iteration, 𝐖
(𝑛)
domain as fixed by using the estimate from (A-5), 𝐰𝐾 .
The objective function in (5), i.e., 𝑓(𝐖), consists of a non-differentiable term, ‖𝐖‖1 .
According to the seminal work by Tseng (2001), when a convex objective function includes a
non-differentiable term, BCD will converge to the optimal solution if the term is separable
Μƒ β€– + ‖𝐰𝐾 β€–1 . Therefore,
according to the coordinates. This is exactly our case, i.e., ‖𝐖‖1 = ‖𝐖
1
Μ‚ I (i.e., the solution to
the BCD in (A-5) and (A-6) will converge to the global optimal solution 𝐖
(5) in Case I). Furthermore, the convergence enjoys a monotone property (Tseng, 2001), i.e.,
Μƒ (0) , 𝐰𝐾(1) )) ≥ 𝑓 ((𝐖
Μƒ (1) , 𝐰𝐾(1) )) ≥ 𝑓 ((𝐖
Μƒ (1) , 𝐰𝐾(2) )) ≥ 𝑓 ((𝐖
Μƒ (2) , 𝐰𝐾(2) )) ≥ β‹― ≥ 𝑓(𝐖
Μ‚ I ). (A-7)
𝑓 ((𝐖
Μƒ (0) , be the knowledge of old domains in Case II, i.e., 𝐖
Μƒ (0) = 𝐖
Μƒ ∗ . Then,
Let the initial values, 𝐖
(A-7) gives:
(1)
Μƒ ∗ , 𝐰𝐾 )) ≥ 𝑓(𝐖
Μ‚ I ).
𝑓 ((𝐖
(A-8)
(1)
Next, according to (A-5), 𝐰𝐾 is
(1)
Μƒ (0) , 𝐰𝐾 )) = argmin {𝑓(𝐖
Μƒ (0) ) + 𝑔(𝐰𝐾 |𝐖
Μƒ (0) )} = argmin 𝑔(𝐰𝐾 |𝐖
Μƒ (0) ) .
𝐰𝐾 = argmin 𝑓 ((𝐖
𝐰𝐾
𝐰𝐾
The
𝐰𝐾
Μƒ (0) ) is dropped in the last equation because it is a constant.
second “=” follows from (6). 𝑓(𝐖
2
(1)
Μƒ ∗, 𝐰
Comparing (A-8) and (11), we get 𝐰𝐾 = 𝐰
Μ‚ 𝐾II . Therefore, (A-8) becomes 𝑓 ((𝐖
Μ‚ 𝐾II )) ≥
I
Μ‚ I ). When 𝐖
Μƒ ∗ = (𝐰
), it means that BCD attains the optimal solution in one
𝑓(𝐖
Μ‚1I , … , 𝐰
Μ‚ 𝐾−1
coordinate (the old domains). Then, it must attain the optimal solution in the other coordinate
Μ‚ I . This completes the proof for Theorem 1.
(the new domain), i.e., 𝐰
Μ‚ 𝐾II = 𝐖
III: Proof of Theorem 2.
Both (12) and (13) can be solved analytically, i.e., 𝐰
Μ‚ 𝐾 = (𝐗 𝑇𝐾 𝐗 𝐾 + πœ†πˆ)−1 (𝐗 𝑇𝐾 𝐲𝐾 + πœ†π›πΎ )
and 𝐰
̌ 𝐾 = (𝐗 𝑇𝐾 𝐗 𝐾 )−1 (𝐗 𝑇𝐾 𝐲𝐾 ) Let 𝐁 β‰œ (𝐗 𝑇𝐾 𝐗 𝐾 + πœ†πˆ)−1 and 𝐙 β‰œ (𝐗 𝑇𝐾 𝐗 𝐾 + πœ†πˆ)−1 (𝐗 𝑇𝐾 𝐗 𝐾 ). Then,
it can be derived that 𝐙 = 𝐈 − πœ†π. Using 𝐁 and 𝐙, we can show that 𝐰
Μ‚ 𝐾 = 𝐙𝐰
̌ 𝐾 + πœ†ππ›πΎ .
Therefore,
𝑇
𝑀𝑆𝐸(𝐰
Μ‚ 𝐾 ) = 𝐸 {(𝐰
Μ‚ 𝐾 − 𝐰𝐾 ) (𝐰
Μ‚ 𝐾 − 𝐰𝐾 )}
𝑇
= 𝐸 {(𝐙𝐰
̌ 𝐾 + πœ†ππ›πΎ − 𝐰𝐾 ) (𝐙𝐰
̌ 𝐾 + πœ†ππ›πΎ − 𝐰𝐾 )}
𝑇
= 𝐸 {(𝐙𝐰
̌ 𝐾 − 𝐙𝐰𝐾 + 𝐙𝐰𝐾 + πœ†ππ›πΎ − 𝐰𝐾 ) (𝐙𝐰
̌ 𝐾 − 𝐙𝐰𝐾 + 𝐙𝐰𝐾 + πœ†ππ›πΎ − 𝐰𝐾 )}
𝑇
= 𝐸 {((𝐙𝐰
̌ 𝐾 − 𝐙𝐰𝐾 ) + (𝐙𝐰𝐾 − 𝐰𝐾 + πœ†ππ›πΎ )) ((𝐙𝐰
̌ 𝐾 − 𝐙𝐰𝐾 ) + (𝐙𝐰𝐾 − 𝐰𝐾 + πœ†ππ›πΎ ))}
𝑇
= 𝐸 {(𝐙𝐰
̌ 𝐾 − 𝐙𝐰𝐾 ) (𝐙𝐰
̌ 𝐾 − 𝐙𝐰𝐾 )} + (𝐙𝐰𝐾 − 𝐰𝐾 + πœ†ππ›πΎ )𝑻 (𝐙𝐰𝐾 − 𝐰𝐾 + πœ†ππ›πΎ ).
(A-9)
In the last equation in (A-9), the cross-product, 2(𝐙𝐰𝐾 − 𝐰𝐾 + πœ†ππ›πΎ )𝑻 𝐙 𝐸(𝐰
̌ 𝐾 − 𝐰𝐾 ) , is
omitted. This is because 𝐰
̌ 𝐾 , as an ordinary least squares estimator, is unbiased, and therefore
𝐸(𝐰
̌ 𝐾 − 𝐰𝐾 ) = 0. Continuing the derivation in (A-9), we can obtain:
𝑇
𝑀𝑆𝐸(𝐰
Μ‚ 𝐾 ) = 𝐸 {(𝐰
̌ 𝐾 − 𝐰𝐾 ) 𝐙𝑇 𝐙(𝐰
̌ 𝐾 − 𝐰𝐾 )} + πœ†2 (𝛍𝐾 − 𝐰𝐾 )𝑇 𝐁 𝑇 𝐁(𝛍𝐾 − 𝐰𝐾 )
= 𝜎 2 π‘‘π‘Ÿ{(𝐗 𝑇𝐾 𝐗 𝐾 )−1 𝐙𝑇 𝐙} + πœ†2 (𝛍𝐾 − 𝐰𝐾 )𝑇 𝐁 𝑇 𝐁(𝛍𝐾 − 𝐰𝐾 )
= 𝜎 2 π‘‘π‘Ÿ{𝐁(𝐈 − πœ†π)} + πœ†2 (𝛍𝐾 − 𝐰𝐾 )𝑇 𝐁 𝑇 𝐁(𝛍𝐾 − 𝐰𝐾 )
3
= 𝜎 2 π‘‘π‘Ÿ(𝐁) − 𝜎 2 πœ†π‘‘π‘Ÿ(𝐁 2 ) + πœ†2 (𝛍𝐾 − 𝐰𝐾 )𝑇 𝐁 𝑇 𝐁(𝛍𝐾 − 𝐰𝐾 ).
(A-10)
Perform an eigen-decomposition for 𝐗 𝑇𝐾 𝐗 𝐾 , i.e., 𝐗 𝑇𝐾 𝐗 𝐾 = 𝐏 𝑇 𝚲𝐏 . 𝚲 is a diagonal matrix of
eigenvalues 𝛾1 , … , 𝛾𝑄 . 𝐏 𝑇 consists of corresponding eigenvectors. Then the π‘‘π‘Ÿ(βˆ™) in (A-10) can
be shown to be:
1
1
π‘‘π‘Ÿ(𝐁) = ∑𝑄𝑖=1 𝛾 +πœ† and π‘‘π‘Ÿ(𝐁 2 ) = ∑𝑄𝑖=1 (𝛾 +πœ†)2 .
𝑖
(A-11)
𝑖
Furthermore, let 𝛂 β‰œ 𝐏(𝛍𝐾 − 𝐰𝐾 ) and denote the elements of 𝛂 by 𝛼1 , … , 𝛼𝑄 . Then, the last
term in (A-10) can be shown to be:
(𝛍𝐾 − 𝐰𝐾 )𝑇 𝐁 𝑇 𝐁(𝛍𝐾 − 𝐰𝐾 ) = ∑𝑄𝑖=1 (𝛾
𝛼𝑖2
2
𝑖 +πœ†)
.
(A-12)
Inserting (A-12) and (A-11) into (A-10),
𝑄
𝑄
1
1
𝛼𝑖2
2
2
𝑀𝑆𝐸(𝐰
̂𝐾 ) = 𝜎 ∑
−𝜎 πœ†∑
+πœ† ∑
2
2
𝑖=1 𝛾𝑖 + πœ†
𝑖=1 (𝛾𝑖 + πœ†)
𝑖=1 (𝛾𝑖 + πœ†)
2
= 𝜎2 ∑
= ∑𝑄𝑖=1
𝑄
𝑄
𝑄
1
1
𝛼𝑖2
2
− 𝜎2πœ† ∑
+
πœ†
∑
2
2
𝑖=1 𝛾𝑖 + πœ†
𝑖=1 (𝛾𝑖 + πœ†)
𝑖=1 (𝛾𝑖 + πœ†)
𝑄
𝜎2 𝛾𝑖 +πœ†2 𝛼𝑖2
(𝛾𝑖 +πœ†)2
.
When πœ† = 0, 𝑀𝑆𝐸(𝐰
Μ‚ 𝐾 ) = 𝑀𝑆𝐸(𝐰
̌ 𝐾 ). To show that 𝑀𝑆𝐸(𝐰
Μ‚ 𝐾 ) < 𝑀𝑆𝐸(𝐰
̌ 𝐾 ) at some πœ† > 0,
we only need to show that there exists a πœ†∗ such that
πœ• 𝑀𝑆𝐸(𝐰
̂𝐾 )
πœ•πœ†
= 2 ∑𝑄𝑖=1
𝛾𝑖 (πœ†π›Όπ‘–2 −𝜎2 )
(𝛾𝑖 +πœ†)3
πœ• 𝑀𝑆𝐸(𝐰
̂𝐾 )
πœ•πœ†
< 0 for 0 < πœ† < πœ†∗. To make
< 0 , a sufficient condition is to make every term in the
𝜎2
𝜎2
𝑖
𝑖
summation smaller than zero, i.e., πœ† < 𝛼2 , or equivalently, πœ† < π‘šπ‘Žπ‘₯ (𝛼2 ) . This proves the
𝜎2
existence of πœ†∗ = π‘šπ‘Žπ‘₯ (𝛼2 ) and thereby proves Theorem 1.
𝑖
𝑖
IV: Proof of Theorem 3.
4
𝑖
According to (A-10), for a fixed πœ† , 𝑀𝑆𝐸(𝐰
Μ‚ 𝐾 ) changes only with respect to (𝛍𝐾 −
𝐰𝐾 )𝑇 𝐁 𝑇 𝐁(𝛍𝐾 − 𝐰𝐾 ). The smaller the (𝛍𝐾 − 𝐰𝐾 )𝑇 𝐁 𝑇 𝐁(𝛍𝐾 − 𝐰𝐾 ), the smaller the 𝑀𝑆𝐸(𝐰
Μ‚ 𝐾 ).
According to Definition 1, (𝛍𝐾 − 𝐰𝐾 )𝑇 𝐁 𝑇 𝐁(𝛍𝐾 − 𝐰𝐾 ) is the transfer learning distance 𝑑(𝛍𝐾 ; πœ†).
Therefore, the smaller the transfer learning distance, the smaller the 𝑀𝑆𝐸(𝐰
Μ‚ 𝐾 ). This gives
(1)
(2)
𝑀𝑆𝐸(𝐰
Μ‚ 𝐾 ; πœ†) ≤ 𝑀𝑆𝐸(𝐰
Μ‚ 𝐾 ; πœ†).
(1)
(A-13)
(2)
(1)
Let πœ†(1)∗ = argminπœ† 𝑀𝑆𝐸(𝐰
Μ‚ 𝐾 ) and πœ†(2)∗ = argminπœ† 𝑀𝑆𝐸(𝐰
Μ‚ 𝐾 ). Then, 𝑀𝑆𝐸(𝐰
Μ‚ 𝐾 ; πœ†(1)∗ ) ≤
(1)
(2)
𝑀𝑆𝐸(𝐰
Μ‚ 𝐾 ; πœ†(2)∗ ) ≤ 𝑀𝑆𝐸(𝐰
Μ‚ 𝐾 ; πœ†(2)∗ ). The second inequality follows from (A-13). This
completes the proof for Theorem 3.
V: Proof of Theorem 4
Lemma 1: The optimization in (17) is equivalent to:
𝐰
Μ‚ 𝐾II = argmin
𝐰𝐾
2
{‖𝐲𝐾 − 𝐗 𝐾 𝐰𝐾 β€–22 + πœ†1 ‖𝐰𝐾 β€–1 + πœ†2 ∑𝑋𝑖 ~𝑋𝑗 π‘Žπ‘–π‘— ((𝑀𝑖𝐾 − πœ‡π‘–πΎ ) − (𝑀𝑗𝐾 − πœ‡π‘—πΎ )) } .
(A-14)
Proof: To prove Lemma 1 is to prove (𝐰𝐾 − 𝛍𝐾 )𝑇 𝐋(𝐰𝐾 − 𝛍𝐾 ) = ∑𝑋𝑖 ~𝑋𝑗 π‘Žπ‘–π‘— ((𝑀𝑖𝐾 − πœ‡π‘–πΎ ) −
2
(𝑀𝑗𝐾 − πœ‡π‘—πΎ )) . Start from the left-hand side. Write 𝐋 = 𝐃 − 𝐀, where 𝐃 is a diagonal matrix of
the nodes’ degrees, i.e., 𝐃 = π‘‘π‘–π‘Žπ‘”(𝑑1 , … , 𝑑𝑄 ). 𝐀 is matrix of the edge weights, i.e., 𝐀 = {π‘Žπ‘–π‘— }.
The diagonal elements of 𝐀 are zero. Then,
(𝐰𝐾 − 𝛍𝐾 )𝑇 𝐋(𝐰𝐾 − 𝛍𝐾 )
= (𝐰𝐾 − 𝛍𝐾 )𝑇 𝐃(𝐰𝐾 − 𝛍𝐾 ) − (𝐰𝐾 − 𝛍𝐾 )𝑇 𝐀(𝐰𝐾 − 𝛍𝐾 )
= ∑𝑄𝑖=1(𝑀𝑖𝐾 − πœ‡π‘–πΎ )2 𝑑𝑖 − ∑𝑋𝑖 ~𝑋𝑗(𝑀𝑖𝐾 − πœ‡π‘–πΎ ) (𝑀𝑗𝐾 − πœ‡π‘—πΎ )π‘Žπ‘–π‘— .
Plugging in the definition that 𝑑𝑖 = ∑𝑋𝑖 ~𝑋𝑗 π‘Žπ‘–π‘— , (A-15) becomes
(𝐰𝐾 − 𝛍𝐾 )𝑇 𝐋(𝐰𝐾 − 𝛍𝐾 )
= ∑𝑄𝑖=1(𝑀𝑖𝐾 − πœ‡π‘–πΎ )2 ∑𝑋𝑖 ~𝑋𝑗 π‘Žπ‘–π‘— − ∑𝑋𝑖 ~𝑋𝑗(𝑀𝑖𝐾 − πœ‡π‘–πΎ ) (𝑀𝑗𝐾 − πœ‡π‘—πΎ )π‘Žπ‘–π‘—
5
(A-15)
𝑄
= ∑𝑖=1 ∑𝑋𝑖 ~𝑋𝑗(𝑀𝑖𝐾 − πœ‡π‘–πΎ )2 π‘Žπ‘–π‘— − ∑𝑋𝑖 ~𝑋𝑗(𝑀𝑖𝐾 − πœ‡π‘–πΎ ) (𝑀𝑗𝐾 − πœ‡π‘—πΎ )π‘Žπ‘–π‘—
2
1
= 2 (∑𝑋𝑖 ~𝑋𝑗(𝑀𝑖𝐾 − πœ‡π‘–πΎ )2 π‘Žπ‘–π‘— + ∑𝑋𝑗 ~𝑋𝑖(𝑀𝑗𝐾 − πœ‡π‘—πΎ ) π‘Žπ‘—π‘– ) − ∑𝑋𝑖 ~𝑋𝑗(𝑀𝑖𝐾 − πœ‡π‘–πΎ ) (𝑀𝑗𝐾 − πœ‡π‘—πΎ )π‘Žπ‘–π‘— .
(A-16)
Because the graph is unidirectional, π‘Žπ‘—π‘– = π‘Žπ‘–π‘— , (A-16) becomes
(𝐰𝐾 − 𝛍𝐾 )𝑇 𝐋(𝐰𝐾 − 𝛍𝐾 )
2
1
= 2 (∑𝑋𝑖 ~𝑋𝑗(𝑀𝑖𝐾 − πœ‡π‘–πΎ )2 π‘Žπ‘–π‘— + ∑𝑋𝑖 ~𝑋𝑗(𝑀𝑗𝐾 − πœ‡π‘—πΎ ) π‘Žπ‘–π‘— ) − ∑𝑋𝑖 ~𝑋𝑗(𝑀𝑖𝐾 − πœ‡π‘–πΎ ) (𝑀𝑗𝐾 − πœ‡π‘—πΎ )π‘Žπ‘–π‘—
2
1
2
= (∑𝑋𝑖 ~𝑋𝑗(𝑀𝑖𝐾 − πœ‡π‘–πΎ )2 π‘Žπ‘–π‘— + ∑𝑋𝑖 ~𝑋𝑗(𝑀𝑗𝐾 − πœ‡π‘—πΎ ) π‘Žπ‘–π‘— − 2 ∑𝑋𝑖 ~𝑋𝑗(𝑀𝑖𝐾 − πœ‡π‘–πΎ ) (𝑀𝑗𝐾 − πœ‡π‘—πΎ )π‘Žπ‘–π‘— )
2
1
= 2 (∑𝑋𝑖 ~𝑋𝑗 π‘Žπ‘–π‘— ((𝑀𝑖𝐾 − πœ‡π‘–πΎ ) − (𝑀𝑗𝐾 − πœ‡π‘—πΎ )) ) .
The 1⁄2 can be absorbed by πœ†2 .
Next, we prove Theorem 4. Denote the objective function in (A-14) by πœ“(𝐰𝐾 ), i.e.,
2
πœ“(𝐰𝐾 ) = ‖𝐲𝐾 − 𝐗 𝐾 𝐰𝐾 β€–22 + πœ†1 ‖𝐰𝐾 β€–1 + πœ†2 ∑𝑋𝑙 ~π‘‹β„Ž π‘Žπ‘™β„Ž ((𝑀𝑙𝐾 − πœ‡π‘™πΎ ) − (π‘€β„ŽπΎ − πœ‡β„ŽπΎ )) .
𝐼𝐼
𝐼𝐼
Because 𝑀
Μ‚ π‘–π‘˜
and 𝑀
Μ‚π‘—π‘˜
are solutions to the optimization problem in (A-14) and they are non-zero,
they should satisfy:
πœ•β€–π²πΎ −𝐗 𝐾 𝐰𝐾 β€–22
πœ•π‘€π‘–πΎ
|
πœ•β€–π²πΎ −𝐗 𝐾 𝐰𝐾 β€–22
πœ•π‘€π‘—πΎ
II
̂𝐾
𝐰
|
II
𝐰
̂𝐾
πœ•π‘“(𝐰𝐾 )
πœ•π‘€π‘–πΎ
|
II
̂𝐾
𝐰
= 0 and
πœ•π‘“(𝐰𝐾 )
πœ•π‘€π‘—πΎ
|
II
𝐰
̂𝐾
= 0, i.e.,
𝐼𝐼
𝐼𝐼
𝐼𝐼
) + 2πœ†2 ∑𝑋𝑖 ~π‘‹β„Ž π‘Žπ‘–β„Ž ((𝑀
+ πœ†1 𝑠𝑔𝑛(𝑀
Μ‚ π‘–π‘˜
Μ‚ π‘–π‘˜
− πœ‡π‘–πΎ ) − (𝑀
Μ‚ β„Žπ‘˜
− πœ‡β„ŽπΎ )) = 0,
(A-17)
𝐼𝐼
𝐼𝐼
𝐼𝐼
+ πœ†1 𝑠𝑔𝑛(𝑀
Μ‚π‘—π‘˜
) − 2πœ†2 ∑𝑋𝑙 ~𝑋𝑗 π‘Žπ‘™π‘— ((𝑀
Μ‚ π‘™π‘˜
− πœ‡π‘™πΎ ) − (𝑀
Μ‚π‘—π‘˜
− πœ‡π‘—πΎ )) = 0.
(A-18)
Focusing on (A-17), the third term on the left-hand side can be written into:
𝐼𝐼
𝐼𝐼
2πœ†2 ∑𝑋𝑖 ~π‘‹β„Ž π‘Žπ‘–β„Ž ((𝑀
Μ‚ π‘–π‘˜
− πœ‡π‘–πΎ ) − (𝑀
Μ‚ β„Žπ‘˜
− πœ‡β„ŽπΎ ))
𝐼𝐼
𝐼𝐼
𝐼𝐼
𝐼𝐼
= 2πœ†2 π‘Žπ‘–π‘— ((𝑀
Μ‚ π‘–π‘˜
− πœ‡π‘–πΎ ) − (𝑀
Μ‚π‘—π‘˜
− πœ‡π‘—πΎ )) + 2πœ†2 ∑𝑋𝑖 ~π‘‹β„Ž ,β„Ž≠𝑗 π‘Žπ‘–β„Ž ((𝑀
Μ‚ π‘–π‘˜
− πœ‡π‘–πΎ ) − (𝑀
Μ‚ β„Žπ‘˜
− πœ‡β„ŽπΎ ))
𝐼𝐼
𝐼𝐼
≈ 2πœ†2 π‘Žπ‘–π‘— ((𝑀
Μ‚ π‘–π‘˜
− πœ‡π‘–πΎ ) − (𝑀
Μ‚π‘—π‘˜
− πœ‡π‘—πΎ )).
(A-19)
6
The last step follows from the given assumption that π‘Žπ‘–π‘— ≫ π‘Žπ‘–β„Ž . Similarly, the third term on the
left-hand side of (A-18) can be written into
𝐼𝐼
𝐼𝐼
𝐼𝐼
𝐼𝐼
2πœ†2 ∑𝑋𝑙 ~𝑋𝑗 π‘Žπ‘™π‘— ((𝑀
Μ‚ π‘™π‘˜
− πœ‡π‘™πΎ ) − (𝑀
Μ‚π‘—π‘˜
− πœ‡π‘—πΎ )) = 2πœ†2 π‘Žπ‘–π‘— ((𝑀
Μ‚ π‘–π‘˜
− πœ‡π‘–πΎ ) − (𝑀
Μ‚π‘—π‘˜
− πœ‡π‘—πΎ )).
(A-20)
Considering (A-19) and (A-20) and taking the difference between (A-17) and (A-18),
𝐼𝐼
𝐼𝐼
𝐼𝐼 𝐼𝐼
) cancels with πœ†1 𝑠𝑔𝑛(𝑀
πœ†1 𝑠𝑔𝑛(𝑀
Μ‚ π‘–π‘˜
Μ‚π‘—π‘˜
) because it is known that 𝑀
Μ‚ π‘–π‘˜
𝑀
Μ‚π‘—π‘˜ > 0, and we get:
πœ•β€–π²πΎ −𝐗 𝐾 𝐰𝐾 β€–22
πœ•π‘€π‘–πΎ
|
II
̂𝐾
𝐰
−
πœ•β€–π²πΎ −𝐗 𝐾 𝐰𝐾 β€–22
πœ•π‘€π‘—πΎ
|
II
𝐰
̂𝐾
𝐼𝐼
𝐼𝐼
+ 4πœ†2 π‘Žπ‘–π‘— ((𝑀
Μ‚ π‘–π‘˜
− πœ‡π‘–πΎ ) − (𝑀
Μ‚π‘—π‘˜
− πœ‡π‘—πΎ )) = 0. (A-21)
𝑇
𝐼𝐼
𝐼𝐼
−2(𝒙𝑇𝑖𝐾 − 𝒙𝑗𝐾
)(𝐲𝐾 − 𝐗 𝐾 𝐰
Μ‚ 𝐾II ) + 4πœ†2 π‘Žπ‘–π‘— ((𝑀
Μ‚ π‘–π‘˜
− πœ‡π‘–πΎ ) − (𝑀
Μ‚π‘—π‘˜
− πœ‡π‘—πΎ )) = 0.
(A-22)
Furthermore, we can get:
𝐼𝐼
𝐼𝐼
|(𝑀
Μ‚ π‘–π‘˜
− πœ‡π‘–πΎ ) − (𝑀
Μ‚π‘—π‘˜
− πœ‡π‘—πΎ )| ≤
‖𝒙𝑖𝐾 −𝒙𝑗𝐾 β€–
2
II β€–
×‖𝐲𝐾 −𝐗 𝐾 𝐰
̂𝐾
2πœ†2
2
1
π‘Žπ‘–π‘—
.
(A-23)
We would like to have an upper bound that does not include 𝐰
Μ‚ 𝐾II . To achieve this, we adopt the
following strategy: Because 𝐰
Μ‚ 𝐾II is the optimal solution, πœ“(𝐰
Μ‚ 𝐾II ) should be the smallest. Therefore,
πœ“(𝐰
Μ‚ 𝐾II ) ≤ πœ“(0), i.e.,
2
2
𝐼𝐼
𝐼𝐼
πœ“(𝐰
Μ‚ 𝐾II ) = ‖𝐲𝐾 − 𝐗 𝐾 𝐰
Μ‚ 𝐾II β€–2 + πœ†1 ‖𝐰
Μ‚ 𝐾II β€–1 + πœ†2 ∑𝑋𝑙 ~π‘‹β„Ž π‘Žπ‘™β„Ž ((𝑀
Μ‚ π‘™π‘˜
− πœ‡π‘™πΎ ) − (𝑀
Μ‚ β„Žπ‘˜
− πœ‡β„ŽπΎ ))
≤ 𝑓(0) = ‖𝐲𝐾 β€–22 + πœ†2 ∑𝑋𝑙 ~π‘‹β„Ž π‘Žπ‘™β„Ž (πœ‡π‘™πΎ − πœ‡β„ŽπΎ )2 .
(A-24)
Then,
2
2
‖𝐲𝐾 − 𝐗 𝐾 𝐰
Μ‚ 𝐾II β€–2 ≤ ‖𝐲𝐾 β€–22 + πœ†2 ∑𝑋𝑙 ~π‘‹β„Ž π‘Žπ‘™β„Ž (πœ‡π‘™πΎ − πœ‡β„ŽπΎ )2 ≈ ‖𝐲𝐾 β€–22 + 2πœ†2 π‘Žπ‘–π‘— (πœ‡π‘–πΎ − πœ‡π‘—πΎ ) , and
2
2
‖𝐲𝐾 − 𝐗 𝐾 𝐰
Μ‚ 𝐾II β€–2 ≤ √‖𝐲𝐾 β€–22 + 2πœ†2 π‘Žπ‘–π‘— (πœ‡π‘–πΎ − πœ‡π‘—πΎ ) .
(A-25)
Inserting (A-25) into (A-23), we get
𝐼𝐼
𝐼𝐼
|(𝑀
Μ‚ π‘–π‘˜
− πœ‡π‘–πΎ ) − (𝑀
Μ‚π‘—π‘˜
− πœ‡π‘—πΎ )| ≤
‖𝐱𝑖𝐾 −𝐱𝑗𝐾 β€–
2πœ†2
2
× √
‖𝐲𝐾 β€–22
7
2
π‘Žπ‘–π‘—
+
2πœ†2 (πœ‡π‘–πΎ −πœ‡π‘—πΎ )
π‘Žπ‘–π‘—
2
.
VI: Obtaining (24) by the gradient method.
Given 𝐰𝐾 , the optimization problem in (24) with respect to 𝜍𝐾 and 𝛑𝐾 is:
min πœ‘(𝜍𝐾 , 𝛑𝐾 ) = min
𝜍𝐾 ,𝛑𝐾
𝜍𝐾 ,𝛑𝐾
Μƒ −1 𝛑𝐾 ) +
{π‘™π‘œπ‘”(𝜍𝐾 − 𝛑𝑇𝐾 𝛀
1
Μƒ −1 𝛑𝐾
𝜍𝐾 − 𝛑𝑇𝐾 𝛀
(𝐰𝐾 − 𝛍𝐾 )𝑇 𝐋(𝐰𝐾 − 𝛍𝐾 )}
Using the gradient method, set the partial derivatives of πœ‘(𝜍𝐾 , 𝛑𝐾 ) to be zero:
{
πœ• πœ‘(𝜍𝐾 , 𝝕𝐾 )⁄πœ•πœπΎ = 0
,
πœ• πœ‘(𝜍𝐾 , 𝝕𝐾 )⁄πœ•π•πΎ = 0
i.e.,
πœ• πœ‘(𝜍𝐾 , 𝝕𝐾 )⁄πœ•πœπΎ =
𝑄
Μƒ −1
𝜍𝐾 −𝛑𝑇
𝐾 𝛀 𝛑𝐾
Μƒ −1 𝛑𝐾
2𝑄𝛀
𝑇 Μƒ −1 𝛑
𝐾 −𝛑𝐾 𝛀
𝐾
πœ•πœ‘(𝜍𝐾 , 𝝕𝐾 )⁄πœ•π•πΎ = − 𝜍
Μƒ −1 𝛑𝐾 (𝐰𝐾 −𝐖
Μƒ −1 𝛑𝐾 )𝑇 𝐋(𝐰𝐾 −𝐖
Μƒ −1 𝛑𝐾 )
̃𝛀
̃𝛀
2𝛀
Μƒ −1
(𝜍𝐾 −𝛑𝑇
𝐾 𝛀 𝛑𝐾 )
𝟐
−
𝑇
−
𝑇 Μƒ −1
Μƒ −1
(𝐰𝐾 −𝛑𝑇
𝐾 𝛀 𝛑𝐾 ) 𝐋(𝐰𝐾 −𝛑𝐾 𝛀 𝛑𝐾 )
Μƒ −1
(𝜍𝐾 −𝛑𝑇
𝐾 𝛀 𝛑𝐾 )
𝟐
= 0,
(A-26)
Μƒ −1 𝐖
Μƒ −1 𝛑𝐾 )
Μƒ 𝑇 𝐋(𝐰𝐾 −𝐖
̃𝛀
2𝛀
+
𝑇
−1
Μƒ
𝜍𝐾 −𝛑𝐾 𝛀 𝛑𝐾
= 0.
(A-27)
From (A-26), we can get:
Μƒ −1 𝛑𝐾 ) = (𝐰𝐾 − 𝐖
Μƒ −1 𝛑𝐾 )𝑇 𝐋(𝐰𝐾 − 𝐖
Μƒ −1 𝛑𝐾 ).
̃𝛀
̃𝛀
𝑄(𝜍𝐾 − 𝛑𝑇𝐾 𝛀
(A-28)
Inserting (A-28) into the third term of (A-27) and through some algebra, we can get:
Μƒ −1 𝐖
Μƒ −1 𝐖
Μƒ −1 𝛑𝐾 = 0.
Μƒ 𝑇 𝐋𝐰𝐾 − 𝛀
Μƒ 𝑇 𝐋𝐖
̃𝛀
𝛀
(A-29)
Μƒ . Using this in (A-29),
Μƒ 𝑇 𝐋𝐖
Μƒ = 𝑄𝛀
According to (22), 𝐖
Μƒ 𝑇 𝐋𝐰𝐾 ⁄𝑄 .
𝛑
̂𝐾 = 𝐖
Furthermore, according to (A-28),
1
Μƒ −1 𝛑𝐾 )𝑇 𝐋(𝐰𝐾 − 𝐖
Μƒ −1 𝛑𝐾 ) + 𝛑𝑇𝐾 𝛀
Μƒ −1 𝛑𝐾
̃𝛀
̃𝛀
𝜍𝐾 = 𝑄 (𝐰𝐾 − 𝐖
1
2
Μƒ −1 𝛑𝐾 + 1 𝛑𝑇𝐾 𝛀
Μƒ −1 𝐖
Μƒ −1 𝛑𝐾 + 𝛑𝑇𝐾 𝛀
Μƒ −1 𝛑𝐾
̃𝛀
Μƒ 𝑇 𝐋𝐖
̃𝛀
= 𝑄 𝐰𝐾 𝑇 𝐋𝐰𝐾 − 𝑄 𝐰𝐾 𝑇 𝐋𝐖
𝑄
1
2
Μƒ −1 𝛑𝐾 + 2𝛑𝑇𝐾 𝛀
Μƒ −1 𝛑𝐾 .
̃𝛀
= 𝑄 𝐰𝐾 𝑇 𝐋𝐰𝐾 − 𝑄 𝐰𝐾 𝑇 𝐋𝐖
8
(A-30)
1
Using (A-30) in the second term, we can get πœΜ‚πΎ = 𝑄 𝐰𝐾 𝑇 𝐋𝐰𝐾 .
Finally, in order to prove that the 𝛑
Μ‚ 𝐾 and πœΜ‚πΎ are optimal solutions for a minimization problem,
πœ• πœ‘ 2 (𝜍𝐾 , 𝝕𝐾 )⁄πœ•πœπΎ2 |πœΜ‚πΎ,𝛑
̂𝐾 > 0
we will need to show { 2
. “ ≻” denotes a matrix being positive
πœ• πœ‘ (𝜍𝐾 , 𝝕𝐾 )⁄πœ•π•2𝐾 |πœΜ‚πΎ ,𝛑
̂𝐾 ≻ 0
definite. It can be derived that:
πœ• πœ‘ 2 (𝜍𝐾 , 𝝕𝐾 )⁄πœ•πœπΎ2 |πœΜ‚πΎ,𝛑
̂𝐾 =
𝑄
2
Μƒ −1
(𝜍𝐾 −𝛑𝑇
𝐾 𝛀 𝛑𝐾 )
> 0.
Furthermore,
πœ• πœ‘ 2 (𝜍𝐾 , 𝝕𝐾 )⁄πœ•π•2𝐾 |πœΜ‚πΎ,𝛑
̂𝐾 =
2𝑄
2
Μƒ −1
(𝜍𝐾 −𝛑𝑇
𝐾 𝛀 𝛑𝐾 )
Μƒ −1 𝛑𝐾 𝛑𝑇𝐾 𝛀
Μƒ −1 +
𝛀
𝜍
2𝑄
𝑇 Μƒ −1 𝛑
𝐾 −𝛑𝐾 𝛀
𝐾
Μƒ −1,
𝛀
Μƒ −1 𝛑𝐾 𝛑𝑇𝐾 𝛀
Μƒ −1 ≻ 0 and 𝛀
Μƒ −1 ≻ 0. So πœ• πœ‘ 2 (𝜍𝐾 , 𝝕𝐾 )⁄πœ•π•2𝐾 |πœΜ‚ ,𝛑
where 𝛀
≻ 0.
𝐾 ̂𝐾
9
Download