STOCHASTIC GRADIENT DESCENT AND THE RANDOMIZED KACZMARZ ALGORITHM

advertisement
STOCHASTIC GRADIENT DESCENT AND THE RANDOMIZED KACZMARZ ALGORITHM
arXiv:1310.5715v2 [math.NA] 15 Feb 2014
DEANNA NEEDELL, NATHAN SREBRO, AND RACHEL WARD
A BSTRACT. We show that the exponential convergence rate of stochastic gradient descent for smooth strongly
convex objectives can be markedly improved by perturbing the row selection rule in the direction of sampling estimates proportionally to the Lipschitz constants of their gradients. That is, we show that partially
biased sampling allows a convergence rate with linear dependence on the average condition number of
the system, compared to dependence on the average squared condition number for standard stochastic
gradient descent. We assume the regime where all stochastic estimates share an optimum and so such an
exponential rate is possible. We then recast the randomized Kaczmarz algorithm for solving overdetermined
linear systems as an instance of stochastic gradient descent, and apply our results to prove its exponential
convergence, but to the solution of a weighted least squares problem rather than the original least squares
problem. We present a modified Kaczmarz algorithm with partially biased sampling which does converge
to the original least squares solution with the same exponential convergence rate.
1. I NTRODUCTION
This paper connects two algorithms which until now have remained remarkably disjoint in the literature: the randomized Kaczmarz algorithm for solving linear systems and the stochastic gradient descent
(SGD) method for optimizing a convex objective using unbiased gradient estimates. The connection enables us to make contributions by borrowing from each body of literature to the other. We extend the
idea of importance sampling from the Kaczmarz literature, and introduce an family of SGD algorithms
with nonuniform row selection rule. We show that by sampling from a hybrid uniform / biased distribution over the estimates, the proposed family of algorithms enjoys the advantages of both importance
sampling (improved rate of convergence) and unbiased sampling (higher noise tolerance), and overall
compares favorably to recent bounds for SGD with uniform row selection in the case of sums of smooth
and strongly convex functions.
Recall that stochastic gradient descent is a method for minimizing a convex objective F (x) based on
access to unbiased stochastic gradient estimates, i.e. to an estimate g for the gradient at a given point
x, such that E[g ] = ∇F (x). Viewing F (x) as an expectation F (x) = Ei [ f i (x)], the unbiased gradient estimate can be obtained by drawing i and using its gradient: g = ∇ f i (x). SGD originated under the banner
of “Stochastic Approximation” in the pioneering work of Robbins and Monroe [30], and has recently received renewed attention for confronting very large scale problems, especially in the context of machine
learning [2, 31, 23, 2].
Classical analysis
p of SGD shows a polynomial rate on the suboptimality of the objective value, F (x k ) −
F (x ⋆ ), namely 1/ k for non-smooth objectives, and 1/k for smooth, or non-smooth but strongly convex
objectives [3]. Such convergence can be ensured even if the iterates x k do not necessarily converge to a
unique optimum x ⋆ , as might be the case if F (x) is not strongly convex. Here we consider the strongly
convex case, where the optimum is unique, and focus on convergence of the iterates x k to the optimum
x⋆.
Date: February 18, 2014.
DN: Department of Mathematical Sciences, Claremont McKenna College (dneedell@cmc.edu). NS: Toyota Technological
Institute at Chicago (nati@ttic.edu). RW: Department of Mathematics, University of Texas at Austin (rward@math.utexas.edu).
1
2
TT
Bach and Moulines [1] recently provided a non-asymptotic bound on the convergence of the iterates
in strongly convex SGD, improving on previous results of this kind [18, Section 2.2], [3, Section 3.2], [32].
In particular, Bach and Moulines showed that if each f i (x) is smooth and if x ⋆ is a minimizer of (almost)
all f i (x), i.e. Pi (∇ f i (x ⋆ ) = 0) = 1, then Ekx k − x ⋆ k goes to zero exponentially, rather then polynomialy,
in k. I.e. reaching a desired accuracy of Ekx k − x ⋆ k2 ≤ ε requires a number of steps that scales only
logarithmically in 1/ε. Bach and Moulines’s bound on the required number of iterations further depends
on the average squared conditioning number κ2i = (L i /µ)2 , where L i is the Lipschitz constant of ∇ f i (x)
(i.e. f i (x) are “L-smooth”), and F (x) is µ-strongly convex. If x ⋆ is not an exact minimizer of each f i (x),
the bound degrades gracefully as a function of σ2 = Ek∇ f i (x)k2 , and includes an unavoidable term that
behaves as σ2 /k.
In a seemingly independent line of research, the Kaczmarz method was proposed as an iterative method
for solving (usually overdetermined) systems of linear equations [14]. The simplicity of the method
makes it useful in a wide array of applications ranging from computer tomography to digital signal processing [11, 19, 13]. Recently, Strohmer and Vershynin proposed a variant of the Kaczmarz method using
a random selection method which select rows with probability proportional to their squared norm [33],
and showed that using this selection strategy, a desired accuracy of ε can be reached in the noiseless
setting in a number of steps that scales like log(1/ε) and linearly in the condition number.
1.1. Contribution of this work. Inspired by the analysis of Strohmer and Vershynin [33] and Bach and
Moulines [1], we prove convergence results for Stochastic Gradient Descent (SGD) and a family of variants of SGD parametrized by the degree to which the random selection strategy chooses estimates proportionally to the Lipschitz constants of their gradients. We show that by perturbing the row selection
strategy towards the conditioning of the constituents in a sum of functions with Lipschitz gradients, we
arrive at improved convergence rates for SGD over the bounds in [1], showing that the convergence rate
P
can be improved to depend on the average conditioning number ( i L i )/µ rather than on the average
P 2 2
squared conditioning number ( i L i )/µ as in [1], without amplifying the dependence on the residual.
Our bounds also improve on those in [1] for SGD with unbiased selection when the condition numbers
L i are comparable. We then show that the randomized Kaczmarz method with uniform i.i.d. row selection can be recast as an instance of Stochastic Gradient Descent acting on a re-weighted least squares
problem and through this connection, provide exponential convergence rates for this algorithm. We
also consider the family of Kaczmarz algorithms corresponding to SGD with hybrid row selection strategy which shares the exponential convergence rates of Strohmer and Vershynin [33] while also sharing a
small error residual term of the SGD algorithm.
1.2. Notation and Fundamentals. We consider the problem of minimizing a smooth convex function,
x ⋆ = argmin F (x)
(1.1)
x
where F (x) is of the form F (x) = Ei ∼D f i (x) for smooth functionals f i : H → R over H = R d endowed
with the standard Euclidean norm k·k2 , or over a Hilbert space H with the norm k·k2 . Here i is drawn
from some source distribution D over an arbitrary probability space. Throughout this manuscript, unless
explicitly specified otherwise, expectations will be with respect to indices drawn from the source distribution D. I.e. we write E f i (x) = Ei ∼D f i (x). We also denote by σ2 the “residual” quantity at the minimum,
σ2 = Ek∇ f i (x ⋆ )k22 .
We will instate the following assumptions on the function F :
(1) Each f i is continuously differentiable and the gradient function ∇ f i has Lipschitz constant L i ;
that is, k∇ f i (x) − ∇ f i (y )k2 ≤ L i kx − yk2 for all vectors x and y .
3
­
®
(2) F has strong convexity parameter µ; that is, x − y , ∇F (x) − ∇F (y ) ≥ µkx − yk22 for all vectors x
and y .
Note in particular that the strong convexity assumption ensures that the minimum of (1.1) is unique.
Formally, since i is random, we have that almost surely, for all vectors x, y , k∇ f i (x) − ∇ f i (y )k2 ≤
L i kx − yk2 . We denote supi L i the supremum of the support of L i , i.e. the smallest L such that L i ≤ L
a.s., and similarly denote inf L i the infimum. We denote the average Lipschitz constant as L = EL i .
A central quantity in our analysis is the conditioning of the problem, which is, roughly speaking, the
ratio of the Lipschitz constant to the parameter of strong convexity. Recall that for a convex quadratic
f (x) = 21 x ′ H x, the Lipschitz constant of the gradient is given by the maximal eigenvalue of the Hessian
H while the parameter of strong convexity is given by its minimal eigenvalue. The conditioning thus
corresponds to the condition number of the Hessian matrix. In the general setting considered here, the
Hessian can vary with x, and our results will depend on the Lipschitz constants of ∇ f i and not only of
the aggregate ∇F . Specifically, our results will depend on the average conditioning L/µ and the uniform
conditioning supi L i /µ.
1.3. Reweighting a Distribution. In stochastic gradient descent, gradient estimates ∇ f i (x) are usually
sampled according to the source distribution D. However, we will analyze also sampling from a weighted
distribution. For a weight function w (i ), which assigns a non-negative weight w (i ) ≥ 0 to each index i ,
the weighted distribution D (w) is defined as the distribution such that
PD (w) (I ) ∝ Ei∼D [1I (i )w (i )] ,
where I is an event (subset of indices) and 1I (·) its indicator function. For a discrete distribution D with
probability mass function p(i ) this corresponds to weighting the probabilities to obtain a new probability
mass function:
p (w) (i ) ∝ w (i )p(i ).
Similarly, for a continuous distribution, this corresponds to multiplying the density by w (i ) and renormalizing.
One way to construct the weighted distribution D (w) , and sample from it, is through rejection sampling: sample i ∼ D, accepting with probability w (i )/W , for some W ≥ supi w (i ), and otherwise rejecting
and continuing to re-sample until a suggestion i is accepted. The accepted samples are then distributed
according to D (w) .
We use E(w) [·] = E i ∼D (w) [·] to denote an expectation where indices are sampled from the weighted distribution D (w) . An important property of such an expectation is that for any quantity X (i ) that depends
on i :
i
h
1
X
(i
)
= E [w (i )] · E [X (i )] ,
(1.2)
E(w) w(i
)
where recall that the expectations on thehr.h.s. areiwith respect to i ∼ D. In particular, when Ew (i ) = 1, as
will be the case for us, we have that E(w)
1
X (i )
w(i )
= E X (i ).
1.4. Organization. The remainder of the paper is organized as follows. In Section 2 we introduce the
stochastic gradient descent (SGD) method as well as our main result which shows exponential convergence of SGD with an improvement in the rate over previous results. Next, Section 3 draws connections
and discusses the application of our result to the randomized Kaczmarz method for linear systems. We
present experimental results in Section 4. The proofs of our main theorems are included in Section 5.
2. S TOCHASTIC G RADIENT D ESCENT
WITH
PARTIALLY B IASED S AMPLING
The standard SGD procedure is to sample i ∼ D, and take a step proportional to −∇ f i (x). That is,
the “selection” of the component i follows the source distribution D specifying the objective F . Such a
sampling assures that ∇ f i (x) is an unbiased estimator of ∇F (x), and so that the method converges to
4
TT
the optimum x ⋆ of F (·). Here we borrow from variants of the randomized Kaczmarz method for solving systems of linear equations, in which various selection strategies for equations have been proposed,
including selection based on Euclidean norm of the coefficients [33]. For SGD, this corresponds to sampling with probability proportional to L i and then re-weighting (“pre-conditioning”) the functions f i (·)
so that the re-weighted gradients still form an unbiased estimator of the true gradient. For the randomized Kaczmarz method, and as we shall see below also for SGD, selection with probability proportional
to the squared Euclidean norm yields better dependence on the conditioning, when there is no residual
error. On the other hand, the uniform selection strategy is more robust to residual error. To analyze these
two types of selection strategies simultaneously, and also leverage the best of both worlds, we consider a
family of interpolative distributions
D (λ) = D (w λ )
(2.1)
specified by the weights
Li
,
(2.2)
L
where λ is a parameter in the range 0 ≤ λ ≤ 1. Note that (2.2) ensures E[w λ (i )] = 1 for all λ.
At λ = 0, the algorithm we propose corresponds to the Kaczmarz algorithm with weighted sampling,
and at λ = 1, corresponds to “standard” SGD with unbiased row selection. At intermediate values of
λ, this algorithm corresponds to a hybrid Kaczmarz-SGD algorithm, and we refer to this algorithm as
Stochastic Gradient Descent with Partially Biased Sampling (PBS-SGD). As we shall see, this algorithms
enjoys the faster convergence rate of Kaczmarz and the lower residual error of SGD. We introduce the
family of algorithms more completely below in Algorithm 2.1.
w λ (i ) = λ1 + (1 − λ)
Algorithm 2.1 Stochastic Gradient Descent with Partially Biased Sampling
Input:
•
•
•
•
•
•
Initial estimate x 0 ∈ Rd
Parameter λ ∈ [0, 1] indicating the degree of uniform sampling
Step size γ > 0
Tolerance parameter δ > 0
Access to the source distribution D
If λ < 1: bounds on the Lipschitz constants L i ; the weights w λ (i ) derived from them (see eq.
2.2); and access to D (λ) (see eq. 2.1).
Output: Estimated solution x̂ to the problem minx F (x)
k ←0
repeat
k ← k +1
Draw an index i ∼ D (λ) .
γ
x k ← x k−1 − w λ (i ) ∇ f i (x k−1 )
until F (x) ≤ δ
x̂ ← x k
Our main result shows expected exponential convergence of this method with a linear dependence on
the average conditioning L/µ = EL i /µ.
Theorem 2.1 (Convergence rate for SGD with partially biased sampling). Let f i be continuously differentiable convex functionals operating on Rd , where each ∇ f i has Lipschitz constant L i , and let F (x) =
5
Ei ∼D f i (x) is µ-strongly convex. Set σ2 = Ei ∼D k∇ f i (x ⋆ )k22 , where x ⋆ is the minimizer of the problem
³
´
x ⋆ = argmin F (x).
x
³
´
sup L
L
, λi i and β = β(λ) = min λ1 , (1−λ)Linfi L i , and consider step size γ < 1/α. Then
Set α = α(λ) = min 1−λ
the iterate x k of Algorithm 2.1 satisfies
h
´ik
γβσ2
¢,
kx 0 − x ⋆ k22 + ¡
Ekx k − x ⋆ k22 ≤ 1 − 2γµ(1 − γα)
(2.3)
µ 1 − γα
where the expectation is with respect to the random sampling in the Algorithm.
If we are given a desired tolerance, kx −x ⋆ k2 ≤ ε, and we know the Lipschitz constants and parameters
of strong convexity, we may optimize the step-size γ. This gives rise to the following corollary.
´
³
´
³
sup L
L
, λi i and β = max λ1 , (1−λ)Linf L . Consider Algorithm 2.1
Corollary 2.2. Fix λ ∈ [0, 1]. Set α = min 1−λ
i i
with step-size
µε
.
γ=
2εµα + 2βσ2
Then Ekx k − x ⋆ k22 ≤ ε is obtained after
³ α βσ2 ´
k = 2 log(ε0 /ε) + 2 .
µ µ ε
(2.4)
iterations of Algorithm 2.1, where ε0 = kx 0 − x ⋆ k2 .
´
³
2
sup L
• Setting λ = 1 (as in standard Stochastic Gradient Descent), k = 2 log(ε0 /ε) µi i + µσ2 ε .
³
´
2
• Setting λ = 0 (as in Kaczmarz with fully biased row selection), k = 2 log(ε0 /ε) µL + (inf Lσ
2
i L i )µ ε
´
³
σ2
L
• Setting λ = 1/2 (partial biasing), k = 4 log(ε0 /ε) µ + µ2 ε .
2.1. Comparison to Prior Work. Bach and Moulines [1, Theorem 1] studied stochastic gradient descent
with unbiased sampling, i.e. using λ = 1, and established that1
³ EL 2
σ2 ´
k = 2 log(ε/ε0 ) 2i + 2
µ
µ ε
(2.5)
steps of Algorithm 2.1 (with λ = 1) are sufficient to obtain an error bound of Ekx k − x ⋆ k22 ≤ ε using a
µε
step-size of γ = 2ε2 L 2 +2σ2 .
• For unbiased sampling (λ = 1), our bound replaces the quadratic dependence on the average
square conditioning (EL 2i /µ2 ) with a linear dependence on the uniform conditioning (sup L i /µ).
This is a quadratic improvement when the Lipschitz constants L i are all of similar magnitude,
but is not an improvement in all situations.
• For fully biased importance sampling (λ = 0), we obtain a linear dependence on the average
conditioning L = EL i /µ, which is always better than the average square conditioning of Bach and
Moulines. However, in this case, if there is a non-zero residual and σ > 0, the residual term is
larger by a factor of (L/ inf L i ).
• For partially biased sampling, with intermediate values of λ such as λ = 1/2, our results leverage the benefits of both uniform and importance sampling. For λ = 1/2, we obtain the desired
linear dependence on L (always improving over Bach and Moulines), without introducing any
1Bach and Moulines’s results are somewhat more general. Their Lipschitz requirement is a bit weaker and more compli-
cated, and in terms of L i yields (2.5). They also study the use of polynomial decaying step-sizes, but these do not lead to
improved runtime if the target accuracy is known ahead of time.
6
TT
additional factor to the residual term, except for a constant factor of two. We thus obtain a result which dominates Bach and Moulines (up to a factor of 2) and substantially improves upon it
(with a linear rather then quadratic dependence on the conditioning).
The crux of the improvement over Bach and Moulines is in a tighter recursive equation. Whereas Bach
and Moulines rely on the recursion
¢
¡
Ekx k+1 − x ⋆ k22 ≤ 1 − 2γµ + 2γ2 L 2 kx k − x ⋆ k22 + 2γ2 σ2 ,
we use the co-coercivity lemma (Lemma 5.1) to obtain the tighter recursion
¡
¢
Ekx k+1 − x ⋆ k22 ≤ 1 − 2γµ + 2γ2 µL kx k − x ⋆ k22 + 2γ2 σ2 ,
where L is the Lipschitz constant of the component used in the current iterate. The significant difference
is that one of the factors of L (an upper bound on the second derivative), in the third term inside the
parenthesis, is replaced by µ (a lower bound on the second derivative).
2.2. Importance Sampling. As discussed above, in some regimes, when the Lipschitz constants L i are
of similar magnitudes, we improve over Bach and Moulines even with unbiased sampling. But when
the magnitudes are highly variable, importance sampling is necessarily in order to obtain a dependence
on the average, rather then uniform, conditioning. In some applications, especially when the Lipschitz
constants are known in advance or easily calculated or bounded, such importance sampling might be
possible. This is the case, for example, in trigonometric approximation problems or linear systems which
need to be solved repeatedly, or when the Lipschitz constant is easily computed from the data, and multiple passes over the data are needed anyway. We do acknowledge that in other regimes, when data is
presented in an online fashion, or when we only have sampling access to the source distribution D (or
the implied distribution over gradient estimates), importance sampling might be difficult.
One option that could be considered, in light of the above results, is to use rejection sampling to simulate sampling from D (λ) . E.g. for λ = 0, this can be done by accepting samples with probability proportional to L i / sup j L j . The overall probability of accepting a sample is then L/ sup L i , introducing an
additional factor of sup L i /L, and thus again obtaining a linear dependence on sup L i . Thus, if we are
presented samples from D, and the cost of obtaining the sample dominates the cost of taking the gradient step, we do not gain (but do not lose much either) from rejection sampling. We might still gain from
rejection sampling if the cost of operating on a sample (calculating the actual gradient and taking a step
according to it) dominates the cost of obtaining it and (a bound on) the Lipschitz constant.
2.3. Tightness. One might hope to obtain a linear dependence on the average conditioning L/µ with
unbiased sampling (i.e. without importance sampling). However, as the following example shows, this
is not possible. Consider a uniform source
distribution over N + 1 quadratics, with the first quadratic f 1
¢2
¡q 1
2
being (x[1] − b) and all others being
N x[2] , and b = ±1. It is clear that for any method one must
consider f 1 in order to recover x to within error less then one, but with unbiased sampling this takes
(N + 1) iterations in expectation (with biased sampling, we have L 1 = N and L i = 1 for 2 ≤ i ≤ N + 1, and
so i = 1 would be selected with probability half). It is easy to verify that in this case, sup L i = L 1 = N ,
L = 2, EL 2i = N , and µ = 1. For large N , a linear dependence on L/µ would mean that a constant number
of iterations suffice (as L/µ → 2 as N → ∞), but we just saw that any method that uses unbiased sampling
must consider at least (N + 1) samples to get non-trivial error. Note that both sup L i /µ = N and EL 2i /µ2 =
N indeed correspond to the correct number of iterations using unbiased sampling.
Returning to the comparison with Bach and Moulines, we see that, with unbiased sampling, the choice
between a dependence on the average quadratic conditioning EL 2i /µ2 , or a linear dependence on the
uniform conditioning sup L i /µ, is unavoidable. A linear dependence on the average conditioning L/µ
is not possible with any method that uses unbiased sampling. Here, we show how to obtain a linear
7
dependence on sup L i /µ with unbiased sampling (improving over Bach and Moulines in some regimes),
and how to obtain a linear dependence on L/µ using biased sampling.
3. T HE LEAST
SQUARES CASE AND THE
R ANDOMIZED K ACZMARZ M ETHOD
A special case of interest is the least squares problem, where
F (x) =
n
1
1X
(⟨a i , x⟩ − b i )2 = kAx − bk22
2 i =1
2
(3.1)
with b an n-dimensional vector, A an n × d overdetermined full-rank matrix with rows a i , and x ⋆ =
argminx 12 kAx − bk22 is the least-squares solution. Writing the least squares problem (3.1) in the form
(1.1), we find that
• The source distribution D is uniform over {1, 2, . . . , n}.
• The components are f i = n2 (⟨a i , x⟩ − b i )2
• The Lipschitz constants are L i = nka i k22 , and the average Lipschitz constant is
1
n
P
i
L i = kAk2F .
1
2
T
−1
k2
• The strong convexity parameter is µ = k(A T A)
−1 k , so that K (A) := L/µ = kAkF k(A A)
2
P
• The residual is σ2 = n i ka i k22 | ⟨a i , x ⋆ ⟩−b i |2 . Observe the bounds σ2 ≤ nkAk2F supi | ⟨a i , x ⋆ ⟩−b i |2
and σ2 ≤ n supi |a i |2 kAx ⋆ − bk22 .
The standard Kaczmarz method for solving the least squares problem (3.1) produces an estimation x̂
to the minimizer x ⋆ of the least squares problem (3.1). Beginning with an arbitrary estimate x 0 , in the
kth iteration it selects a row i = i (k) of the matrix A and projects the current iterate x k onto the solution
space corresponding to the i th row,
x k+1 = x k +
b i − ⟨a i , x k ⟩
ka i k22
ai .
(3.2)
When the Kaczmarz method was introduced, it was proposed to select rows i sequentially, so that
i (k) = k mod n + 1. However, an unfortunate ordering of the rows can lead to slow convergence. It was
later proposed to select rows i.i.d. at random, which was shown empirically to result in significantly
faster convergence rates [5, 12, 19]. A number of asymptotic convergence rates were subsequently obtained, see [36, 6, 34, 10, 37]. Strohmer and Vershynin provided the first nonasymptotic rates, showing
that drawing rows proportionally to their Lipschitz constants leads to provable exponential convergence
in expectation in the noiseless setting Ax = b [33]. Needell extended these first results to inconsistent
systems [20]. Recently, Lee and Sidford use acceleration techniques to improve upon this convergence
rate to obtain a dependence on the square root of the conditioning, at a cost of an additional dependence on the size of the system [16]. Lui and Wright also provide a Nesterov type acceleration which
improves the convergence rate to a dependence on the square root of the smallest singular value (and
linear in n) in the consistent case, as well as computational improvements for sparse matrices [17].
Several other works discuss methods for acceleration and convergence to the least-squares solution,
see [25, 8, 29, 37, 9, 7, 4, 26, 27, 28, 22, 21] and references therein. In this work, we focus on the standard
Kaczmarz method without acceleration techniques.
With fully weighted sampling λ = 0, Algorithm 2.1 reduces to the Kaczmarz algorithm with the row
selection strategy of Strohmer and Vershynin in the least squares setting. Conversely, our main theorem with uniform sampling λ = 1 produces new bounds for the Kaczmarz method (3.2) with uniform
row selection; in particular, our results imply exponential convergence of the Kaczmarz method to the
solution of a weighted least squares problem, rather than to the solution of the original problem (3.1).
Nevertheless, our main results show that by perturbing the Kaczmarz algorithm slightly, and adapting a
slightly biased row selection rule, we arrive at exponential convergence to the unweighted least squares
8
TT
solution without amplifying the noise dependence. We note that a randomized Kaczmarz algorithm with
partially-biased sampling was also recently considered in [15] in the setting σ2 = 0, albeit from a different motivation and using slightly different analysis. It is shown there that accelerated methods using
partially biased sampling can yield even better sampling complexity in some regimes.
3.1. Randomized Kaczmarz with weighted row selection. Here we show that the framework for SGD
with partially biased sampling can be applied to provide non-asymptotic results for the randomized
Kaczmarz method (3.2) proposed by Strohmer and Vershynin, where each row is selected with probability proportional to its squared Euclidean norm, p i = ka i k2 /kAk2F . As shown in [33] and extended in [20],
this method exhibits exponential convergence, but only to within a radius, or convergence horizon, of the
least-squares solution:
¸k
·
1
2
(3.3)
kx 0 − x ⋆ k22 + K (A)r,
Ekx k − x ⋆ k2 ≤ 1 −
K (A)
where e = Ax ⋆ − b and r = supi |e i |2 /ka i k22 , and K (A) = kAk2F k(A T A)−1 k2 . Note that r 6= 0 when the
system is inconsistent and Ekx k − x ⋆ k22 ≤ K (A)r as k → ∞.
It has been shown [36, 6, 34, 10, 22] that using a relaxation parameter (i.e. changing the step size) can
allow for convergence inside of this convergence horizon. However, non-asymptotic results have been
difficult to obtain. We consider here the randomized Kaczmarz algorithm with relaxation parameter
0 < c < 1:
x k+1 = x k + c ·
b i − ⟨a i , x k ⟩
ka i k22
ai
(3.4)
Formulating the randomized Kaczmarz method as an instance of SGD with fully biased sampling, we
can use our main result to derive the following corollary.
Corollary 3.1 (Convergence rate for Kaczmarz with fully biased sampling). Let A be an n×d full (column)
rank matrix with rows a i . Set e = Ax ⋆ − b, where x ⋆ is the minimizer of the problem
1
x ⋆ = argmin kAx − bk22 .
2
x
2
2
2
= supi e i2 . Then the expected error at the k t h iteration of
= supi ka i k22 , and e max
Set a min
= infi ka i k22 , a max
the Kaczmarz method described by (3.4) with row a i selected with probability p i = ka i k22 /kAk2F satisfies
·
¸
2c(1 − c) k
c
Ekx k − x ⋆ k22 ≤ 1 −
kx 0 − x ⋆ k22 +
K (A)r˜,
(3.5)
K (A)
1−c
© 2
ª
2
2
2
with re = (a max
/a min
) min e max
/a max
, kek22 /kAk2F . The expectation is taken with respect to the weighted
distribution over the rows.
c
T
−1
Proof. We apply Theorem 2.1 with f i (x) = n2 (a i x−b i )2 , λ = 0, L i = nka i k22 , γ = c PnL i = kAk
k2 ,
2 , µ = 1/k(A A)
F
P
2
2
2
2
2
2
α = L = kAkF , and β = L/ min L i = kAkF /na min . Recalling that σ = n i ka i k2 |(⟨a i , x ⋆ ⟩ − b i )| , we also
2
use the bounds σ2 ≤ na max
kek22, σ2 ≤ n supi |e i |2 kAk2F .
Remark. When e.g. c = 21 , we recover the known exponential rate (3.3) up to a factor of 2, and nearly the
same convergence horizon. For arbitrary c, Corollary 3.1 implies a tradeoff between a smaller convergence horizon and a slower convergence rate. One can also consider instead of a fixed c, employing a
sequence {c k } which changes the value iteration by iteration, whose analysis we leave for future work.
9
3.2. Randomized Kaczmarz with uniform row selection. One drawback of the approach of Kaczmarz
with weighted row selection is that in general it requires precomputing each row norm, or applying a
diagonal preconditioner matrix2. In this section we analyze the uniform row selection strategy. We
first recast the Kaczmarz method with uniform row selection as an instance of SGD on the renormalized system of functions f i (x) = 2kan k2 (a i x − b i )2 . Using the generalized framework of SGD presented
i
in Theorem2.1, we show that the Kaczmarz algorithm with uniform row selection exhibits exponential
convergence towards the minimizer of a renormalized least squares problem. This is the content of the
following corollary.
Corollary 3.2 (Convergence rate for randomized Kaczmarz with uniform sampling). Let A be an n×d full
(column) rank matrix with rows a i . Let D be the diagonal matrix with terms d j , j = ka i k2 , and consider
the renormalized matrix D −1 A. Set e w = D −1 (Ax ⋆w − b), where x ⋆w is the minimizer of the weighted least
squares problem
1
(3.6)
x ⋆w = argmin kD −1 (Ax − b)k22 .
2
x
Then the expected error after k iterations of the Kaczmarz method described by (3.4) with uniform row
selection satisfies
¸
·
c
2c(1 − c) ´ k
w 2
kx 0 − x ⋆w k22 +
K (D −1 A)r w ,
(3.7)
Ekx k − x ⋆ k2 ≤ 1 −
−1
K (D A)
1−c
where r w = ke w k22 /n.
Proof. We apply Theorem 2.1 with λ = 1, f i (x) =
β = 1, and we observe that σ2 ≤ nke w k22 .
n
(a i x
2ka i k2
− b i )2 , L i = n, γ = nc , µ = 1/kA −1 Dk2 , α = n,
Remarks.
1. The randomized Kaczmarz algorithm with uniform row selection converges exponentially to the
weighted least-squares solution (3.6), to within arbitrary accuracy by choosing sufficiently small stepsize
c. Of course, the convergence rate also decreases for smaller c, and so accuracy and speed must be
balanced.
2. When the system is consistent, the solutions of the unweighted and weighted least squares problems, (3.1) and (3.6) respectively, are the same and equal to x. For arbitrary error, the least squares solution x ⋆ and the weighted least squares solution x ⋆w can be significantly different. Thus, in general, the
randomized Kaczmarz algorithms with uniform and biased row selection converge towards very different solutions.
3. It is difficult
³ to compare the´rates in Corollary 3.1 and Corollary 3.2 directly. It is easy to show that
a
K (A) independent of A, and so we can derive the crude bound
K (D −1 A) ≤ min nK (A), amax
min
³
Ekx k − x ⋆w k22 ≤ 1 −
´´k
2c(1 − c)
nc
2
³
´ kx 0 − x ⋆w k22 +
K (A)((e w )2max /a max
),
a max
1
−
c
min nK (A), amin K (A)
showing that the convergence rate for uniform row selection is no worse than a factor of min(n, a max /a min )
times that for weighted row selection. It should not be surprising that uniform selection performs well
since in the consistent case this is equivalent to normalizing the rows of the matrix, which often comes
close to minimizing the condition number [35].
4. In addition to the difficulty in comparing the rates directly, Corollary 3.2 shows convergence to
the least squares solution of the pre-conditioned problem, not the original problem. However, using the
2Note that for consistent systems, the Kaczmarz iterations are independent of the scaling of the system.
10
TT
notation of the previous corollaries, one can use the bound
¶
µ
q
a max
w
T
−1
kek2
kx k − x ⋆ k2 ≤ kx k − x ⋆ k2 + k(A A) k2 1 +
a min
along with Corollary 3.2 to provide a bound on the expected error from the unweighted least squares
solution for Kaczmarz with uniform row selection. Of course, this error can not be made arbitrarily small
by decreasing c, unless the system is consistent.
3.3. Hybrid Kaczmarz-SGD algorithm. We first compare the result for Kaczmarz with weighted row selection (Corollary 3.1) with the bounds obtained by solving the quadratic system kAx − bk2 using standard SGD, i.e. Algorithm 2.1 with f i (x) = n2 (⟨a i , x⟩ − b i )2 , but with λ = 1 rather than λ = 0. This corresponds to iterating the recursion
c
x k = x k−1 +
(b i − ⟨a i , x k−1⟩)a i
(3.8)
2
na max
with uniform row selection rule. Theorem 2.1 implies the bound
2
³
2c(1 − c) kAkF ´´k
c
Ekx k − x ⋆ k22 ≤ 1 −
K (A)r SG ,
kx 0 − x ⋆ k22 +
2
K (A) na max
1−c
(3.9)
2
2
/a max
}. As expected by Corollary 2.2, the convergence rate in (3.9) is
where r SG = min{kek22 /kAk2F , e max
worse than the rate (3.5) for Kaczmarz for weighted row selection but the convergence horizon r SG in
(3.9) is smaller than that for Kaczmarz with weighted selection strategy. To have small convergence horizon and high convergence rate simultaneously, Corollary 2.2 implies that we may use instead what we
refer to as the hybrid Kaczmarz-SGD algorithm, or Algorithm 2.1 with f i (x) = n2 (⟨a i , x⟩−b i )2 and λ ∈ [0, 1]
with λ = 1/2 balancing accuracy and speed. We analyze the behavior of the hybrid Kaczmarz-SGD algorithm numerically for various matrices A in the following section.
4. N UMERICAL E XPERIMENTS
In this section we present some numerical results for the hybrid Kaczmarz-SGD algorithm, or Algorithm 2.1 with f i (x) = n2 (⟨a i , x⟩ − b i )2 and λ ∈ [0, 1], and demonstrate how its behavior depends on λ, the
conditioning of the system, and the residual error at the least squares solution. We focus on exploring the
role of λ on the convergence rate of the algorithm for various types of matrices A. We consider five types
of systems, described below, each using a 1000 × 10 matrix A. In each setting, we create a vector x with
standard normal entries. For the described matrix A and residual e, we create the system b = Ax + e and
run the randomized Kaczmarz method with various choices of λ. Each experiment consists of 100 independent trials and uses the optimal step size as in Corollary 2.2 with ε = .1; the plots show the average
behavior over these trials. The settings below show the various types of behavior the Kazcmarz method
can exhibit.
Case 1: Each row of the matrix A has standard normal entries, except the last row which has normal
entries with mean 0 and variance 102 . The residual vector e has normal entries with mean 0 and
variance 0.12 .
Case 2: Each row of the matrix A has standard normal entries. The residual vector e has normal
entries with mean 0 and variance 0.12 .
Case 3: The j th row of A has normal entries with mean 0 and variance j . The residual vector e has
normal entries with mean 0 and variance 202 .
Case 4: The j th row of A has normal entries with mean 0 and variance j . The residual vector e has
normal entries with mean 0 and variance 102 .
Case 5: The j th row of A has normal entries with mean 0 and variance j . The residual vector e has
normal entries with mean 0 and variance 0.12 .
11
Case 1:
Case 2:
1
1
10
λ=0
λ = 0.4
λ = 0.7
λ = .9
λ=1
0
10
Error (log)
Error (log)
10
λ=0
λ = 0.2
λ = 0.4
λ = 0.7
λ=1
0
10
−1
−1
10
10
0
5000
10000
Iterations
15000
0
500
Case 3:
1
2500
3000
λ=0
λ = 0.2
λ = 0.4
λ = 0.7
λ=1
1
10
Error (log)
Error (log)
1500
2000
Iterations
Case 4:
λ=0
λ = 0.2
λ = 0.4
λ = 0.7
λ=1
10
1000
0
10
−1
0
10
−1
10
10
0
1000
2000
3000
Iterations
4000
5000
0
1000
2000
3000
Iterations
4000
5000
Case 5:
λ=0
λ = 0.2
λ = 0.4
λ = 0.7
λ=1
1
Error (log)
10
0
10
−1
10
0
1000
2000
3000
Iterations
4000
5000
Figure 1 The convergence rates for the randomized Kaczmarz method with various choices of λ in the five settings described above. The vertical axis is in logarithmic scale and depicts the approximation error kxk − x⋆ k22
at iteration k (the horizontal axis).
TT
Iterations (log)
12
4
10
3
10
0
0.2
0.4
λ
0.6
0.8
1
Figure 2 Number of iterations k needed by the randomized Kaczmarz method for various values of λ to obtain
approximation error kxk − x⋆ k22 ≤ ε = 0.1 in the five cases described above: Case 1 (blue with circle marker),
Case 2 (red with square marker), Case 3 (black with triangle marker), Case 4 (green with x marker), and Case 5
(purple with star marker).
Figure 1 shows the convergence behavior of the randomized Kaczmarz method in each of these five
settings. As expected, when the rows of A are far from normalized, as in Case 1, we see different behavior
as λ varies from 0 to 1. Here, weighted sampling (λ = 0) significantly outperforms uniform sampling
(λ = 1), and the trend is monotonic in λ. On the other hand, when the rows of A are close to normalized,
as in Case 2, the various λ give rise to similar convergence rates, as is expected. Out of the λ tested
(we tested increments of 0.1 from 0 to 1), the choice λ = 0.7 gave the worst convergence rate, and again
purely weighted sampling gives the best. Still, the worst-case convergence rate was not much worse, as
opposed to the situation with uniform sampling in Case 1. Cases 3, 4, and 5 use matrices with varying
row norms and cover “high", “medium", and “low" noise regimes, respectively. In the high noise regime
(Case 3), we find that fully weighted sampling, λ = 0, is relatively very slow to converge, as the theory
suggests, and hybrid sampling outperforms both weighted and uniform selection. In the medium noise
regime (Case 4), hybrid sampling still outperforms both weighted and uniform selection. Again, this is
not surprising, since hybrid sampling allows a balance between small convergence horizon (important
with large residual norm) and convergence rate. As we decrease the noise level (as in Case 5), we see that
again weighted sampling is preferred.
Figure 2 shows the number of iterations of the randomized Kaczmarz method needed to obtain a fixed
approximation error. For the choice λ = 1 for Case 1, we cut off the number of iterations after 50,000, at
which point the desired approximation error was still not attained. As seen also from Figure 1, Case 1
exhibits monotonic improvements as we scale λ. For Cases 2 and 5, the optimal choice is pure weighted
sampling, whereas Cases 3 and 4 prefer intermediate values of λ.
5. P ROOFS
The proof of Theorem 2.1 utilizes an elementary fact about smooth functions with Lipschitz continuous gradient, called the co-coercivity of the gradient. We state the lemma and recall its proof for completeness.
5.1. The Co-coercivity Lemma.
13
Lemma 5.1 (Co-coercivity). For a smooth function f whose gradient has Lipschitz constant L,
­
®
k∇ f (x) − ∇ f (y )k22 ≤ L x − y , ∇ f (x) − ∇ f (y ) .
Proof. Since ∇ f has Lipschitz constant L, if x ⋆ is the minimizer of f , then
­
®
1
1
k∇ f (x) − ∇ f (x ⋆ )k22 =
k∇ f (x) − ∇ f (x ⋆ )k22 + x − x ⋆ , ∇ f (x ⋆ ) ≤ f (x) − f (x ⋆ );
2L
2L
(5.1)
see, for example, [[24], page 26]. Now define the convex functions
­
®
G(z) = f (z) − ∇ f (x), z ,
­
®
and H (z) = f (z) − ∇ f (y ), z ,
and observe that both have Lipschitz constants L and minimizers x and y , respectively. Applying (5.1) to
these functions therefore gives that
G(x) ≤ G(y ) −
1
k∇G(y )k22 ,
2L
and H (y ) ≤ H (x) −
1
k∇H (y )k22 .
2L
By their definitions, this implies that
­
®
­
® 1
k∇ f (y ) − ∇ f (x)k22
f (x) − ∇ f (x), x ≤ f (y ) − ∇ f (x), y −
2L
­
®
­
® 1
f (y ) − ∇ f (y ), y ≤ f (x) − ∇ f (y ), x −
k∇ f (x) − ∇ f (y )k22 .
2L
Adding these two inequalities and canceling terms yields the desired result.
5.2. Proof of Theorem 2.1. With the notation of Theorem 2.1, and where i is the random index chosen
at iteration k, and w = w λ , we have
γ
∇ f i (x k )k22
w (i )
γ
γ
(∇ f i (x k ) − ∇ f i (x ⋆ )) −
∇ f i (x ⋆ )k22
= k(x k − x ⋆ ) −
w (i )
w (i )
® ³ γ ´2
γ ­
x k − x ⋆ , ∇ f i (x k ) +
k∇ f i (x k ) − ∇ f i (x ⋆ ) + ∇ f i (x ⋆ )k22
= kx k − x ⋆ k22 − 2
w (i )
w (i )
³ γ ´2
³ γ ´2
®
γ ­
x k − x ⋆ , ∇ f i (x k ) + 2
k∇ f i (x k ) − ∇ f i (x ⋆ )k22 + 2
k∇ f i (x ⋆ )k22
≤ kx k − x ⋆ k22 − 2
w (i )
w (i )
w (i )
®
γ ­
≤ kx k − x ⋆ k22 − 2
x k − x ⋆ , ∇ f i (x k )
w (i )
³ γ ´2
³ γ ´2 ­
®
L i x k − x ⋆ , ∇ f i (x k ) − ∇ f i (x ⋆ ) + 2
k∇ f i (x ⋆ )k22 ,
+2
w (i )
w (i )
kx k+1 − x ⋆ k22 = kx k − x ⋆ −
where we have employed Jensen’s inequality in the first inequality and the co-coercivity Lemma 5.1 in
the final line. We next take an expectation with respect to the choice of i , which is drawn according to
the distribution i ∼ D (λ) . In taking the expectation w.r.t. D (λ) , we recall for the weights defined by (2.2)
14
TT
we have by (1.2) that E(w)
h
i
1
w(i ) X (i )
= E X (i ). We also recall that E∇ f i (x) = ∇F (x), and obtain, for γα ≤ 1,
h
®i
L ­
E(w) kx k+1 − x ⋆ k22 ≤ kx k − x ⋆ k22 − 2γ ⟨x k − x ⋆ , ∇F (x k )⟩ + 2γ2 E w(ii ) x k − x ⋆ , ∇ f i (x k ) − ∇ f i (x ⋆ )
i
h
2
1
+ 2γ2 E w(i
) k∇ f i (x ⋆ )k2
·
µ
¶
¸
­
®
1
1
2
2
≤ kx k − x ⋆ k2 − 2γ ⟨x k − x ⋆ , ∇F (x k )⟩ + 2γ E min
L, L i x k − x ⋆ , ∇ f i (x k ) − ∇ f i (x ⋆ )
1−λ λ
#
!
"
Ã
L
1
k∇ f i (x ⋆ )k22
+ 2γ2 E min ,
λ (1 − λ)L i
Ã
!
­
®
L supi L i
2
2
≤ kx k − x ⋆ k2 − 2γ ⟨x k − x ⋆ , ∇F (x k )⟩ + 2γ min
E x k − x ⋆ , ∇ f i (x k ) − ∇ f i (x ⋆ )
,
1−λ
λ
!
Ã
L
1
Ek∇ f i (x ⋆ )k22
+ 2γ2 min ,
λ (1 − λ) infi L i
= kx k − x ⋆ k22 − 2γ ⟨x k − x ⋆ , ∇F (x k )⟩ + 2γ2 α ⟨x k − x ⋆ , ∇F (x k ) − ∇F (x ⋆ )⟩ + 2γ2 βσ2
where we have set α = min
of F (x) and obtain:
³
L supi L i
, λ
1−λ
´
and β = min
³
L
1
,
λ (1−λ) infi L i
´
. We now utilize the strong convexity
≤ kx k − x ⋆ k22 − 2γµ(1 − γα)kx k − x ⋆ k22 + 2γ2 βσ2
= (1 − 2γµ(1 − γα))kx k − x ⋆ k22 + 2γ2 βσ2
Recursively applying this bound over the first k iterations yields the desired result,
´j
³
´k
k−1
X³
1 − 2γµ(1 − γα) γ2 βσ2
E(w) kx k − x ⋆ k22 ≤ 1 − 2γµ(1 − γα) kx 0 − x ⋆ k22 + 2
j =0
³
´k
γβσ2
¢,
= 1 − 2γµ(1 − γα) kx 0 − x ⋆ k22 + ¡
µ 1 − γα
where the expectation on the l.h.s. above is w.r.t. the indices in each of the k iterations being drawn
i.i.d. from D (λ) = D (w λ ) .
5.3. Proof of Corollary 2.2.
Proof. Recall the main recursive step in the previous proof:
¡
¢
E(w) kx k+1 − x ⋆ k22 ≤ 1 − 2µγ(1 − γα) kx k − x ⋆ k22 + 2βγ2 σ2 ,
(5.2)
provided that γα ≤ 1. The minimal value of the quadratic
¡
¢
F ξ (γ) = 1 − 2γµ(1 − γα) ξ + 2βσ2 γ2
is achieved at
γ∗ξ =
and
µξ
,
2ξµα + 2βσ2
¡
F ξ (γ∗ξ ) = 1 −
¢
µ2 ξ
ξ.
2
2µαξ + 2βσ
(5.3)
(5.4)
15
Note that γ∗ξ α ≤ 1/2. Thus if we choose stepsize γ∗ = γ∗ε ,
E(w) kx k+1 − x ⋆ k22 ≤ F kxk −x⋆ k2 (γ∗ )
2
³
´
= F kxk −x⋆ k2 (γ∗ ) − F ε (γ∗ ) + F ε (γ∗ )
2
³
≤ 1−
´
µ2 ε
kx k − x ⋆ k22
2µεα + 2βσ2
(5.5)
(5.6)
(5.7)
(5.8)
and, iterating the expectation,
³
Ekx k+1 − x ⋆ k22 ≤ 1 −
´k
µ2 ε
ε0 ,
2µεα + 2βσ2
(5.9)
where again the expectation on the l.h.s. is w.r.t to the random indices in all iterations being drawn from
D (λ) . It follows that if ε ≤ Ekx k+1 − x ⋆ k22 , then
´
³
µ2 ε
(5.10)
log(ε/ε0 ) ≤ k log 1 −
2µεα + 2βσ2
¡
¢
µ2 ε
≤ −k
(5.11)
2
2µαε + 2βσ
or, equivalently
¡ 2µαε + 2βσ2 ¢
k ≤ log(ε0 /ε)
µ2 ε
³ 2α 2βσ2 ´
+ 2 .
= log(ε0 /ε)
µ
µ ε
In particular, setting λ = 1/2, one arrives at the bound
³ 4L 4σ2 ´
k ≤ log(ε0 /ε)
+ 2 .
µ
µ ε
(5.12)
(5.13)
ACKNOWLEDGEMENTS
DN was partially supported by a Simons Foundation Collaboration grant, NS was partially supported
by a Google Research Award, and RW was supported in part by ONR Grant N00014-12-1-0743 and an
AFOSR Young Investigator Program Award.
R EFERENCES
[1] F. Bach and E. Moulines. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. Advances
in Neural Information Processing Systems (NIPS), 2011.
[2] L. Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pages 177–
186. Springer, 2010.
[3] L. Bottou and O. Bousquet. The tradeoffs of large-scale learning. Optimization for Machine Learning, page 351, 2011.
[4] C. L. Byrne. Applied iterative methods. A K Peters Ltd., Wellesley, MA, 2008.
[5] C. Cenker, H. G. Feichtinger, M. Mayer, H. Steier, and T. Strohmer. New variants of the POCS method using affine subspaces
of finite codimension, with applications to irregular sampling. In Proc. SPIE: Visual Communications and Image Processing,
pages 299–310, 1992.
[6] Y. Censor, Paul PB Eggermont, and Dan Gordon. Strong underrelaxation in Kaczmarz’s method for inconsistent systems.
Numerische Mathematik, 41(1):83–92, 1983.
[7] P. P. B. Eggermont, G. T. Herman, and A. Lent. Iterative algorithms for large partitioned linear systems, with applications to
image reconstruction. Linear Algebra Appl., 40:37–67, 1981.
16
TT
[8] Y. C. Eldar and D. Needell. Acceleration of randomized Kaczmarz method via the Johnson-Lindenstrauss lemma. Numer.
Algorithms, 58(2):163–177, 2011.
[9] T. Elfving. Block-iterative methods for consistent and inconsistent linear equations. Numer. Math., 35(1):1–12, 1980.
[10] M. Hanke and W. Niethammer. On the acceleration of Kaczmarz’s method for inconsistent linear systems. Linear Algebra
and its Applications, 130:83–98, 1990.
[11] G. T. Herman. Fundamentals of computerized tomography: image reconstruction from projections. Springer, 2009.
[12] G.T. Herman and L.B. Meyer. Algebraic reconstruction techniques can be made computationally efficient. IEEE Trans.
Medical Imaging, 12(3):600–609, 1993.
[13] G. N Hounsfield. Computerized transverse axial scanning (tomography): Part 1. description of system. British Journal of
Radiology, 46(552):1016–1022, 1973.
[14] S. Kaczmarz. Angenäherte auflösung von systemen linearer gleichungen. Bull. Int. Acad. Polon. Sci. Lett. Ser. A, pages 335–
357, 1937.
[15] Y. Lee and A. Sidford. Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems.
arXiv preprint arXiv:1305.1922, 2013.
[16] Y. T. Lee and A. Sidford. Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems.
Submitted, 2013.
[17] J. Liu and S. J. Wright. An accelerated randomized kaczmarz algorithm. Submitted, 2013.
[18] N. Murata. A statistical study of on-line learning. Cambridge University Press, Cambridge,UK, 1998.
[19] F. Natterer. The mathematics of computerized tomography, volume 32 of Classics in Applied Mathematics. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 2001. Reprint of the 1986 original.
[20] D. Needell. Randomized Kaczmarz solver for noisy linear systems. BIT, 50(2):395–403, 2010.
[21] D. Needell and J. A. Tropp. Paved with good intentions: Analysis of a randomized block kaczmarz method. Linear Algebra
and its Applications, 2013.
[22] D. Needell and R. Ward. Two-subspace projection method for coherent overdetermined linear systems. Journal of Fourier
Analysis and Applications, 19(2):256–269, 2013.
[23] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming.
SIAM Journal on Optimization, 19(4):1574–1609, 2009.
[24] Y. Nesterov. Introductory Lectures on Convex Optimization. Kluwer, 2004.
[25] C. Popa. Extensions of block-projections methods with relaxation parameters to inconsistent and rank-deficient leastsquares problems. BIT, 38(1):151–176, 1998.
[26] C. Popa. Block-projections algorithms with blocks containing mutually orthogonal rows and columns. BIT, 39(2):323–338,
1999.
[27] C. Popa. A fast Kaczmarz-Kovarik algorithm for consistent least-squares problems. Korean J. Comput. Appl. Math., 8(1):9–
26, 2001.
[28] C. Popa. A Kaczmarz-Kovarik algorithm for symmetric ill-conditioned matrices. An. Ştiinţ. Univ. Ovidius Constanţa Ser.
Mat., 12(2):135–146, 2004.
[29] Constantin Popa, T Preclik, H Köstler, and U Rüde. On KaczmarzŠs projection iteration as a direct solver for linear least
squares problems. Linear Algebra and Its Applications, 436(2):389–404, 2012.
[30] H. Robbins and S. Monroe. A stochastic approximation method. Ann. Math. Statist., 22:400–407, 1951.
[31] S. Shalev-Shwartz and N. Srebro. Svm optimization: inverse dependence on training set size. In Proceedings of the 25th
international conference on Machine learning, pages 928–935, 2008.
[32] O. Shamir and T. Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. arXiv preprint arXiv:1212.1824, 2012.
[33] T. Strohmer and R. Vershynin. A randomized Kaczmarz algorithm with exponential convergence. J. Fourier Anal. Appl.,
15(2):262–278, 2009.
[34] K. Tanabe. Projection method for solving a singular system of linear equations and its applications. Numerische Mathematik, 17(3):203–214, 1971.
[35] A. van der Sluis. Condition numbers and equilibration of matrices. Numerische Mathematik, 14(1):14–23, 1969.
[36] T. M. Whitney and R. K. Meany. Two algorithms related to the method of steepest descent. SIAM Journal on Numerical
Analysis, 4(1):109–118, 1967.
[37] A. Zouzias and N. M. Freris. Randomized extended Kaczmarz for solving least-squares. SIAM Journal on Matrix Analysis
and Applications, 2012.
Download