slides

advertisement
Cooperation via Policy Search
and
Unconstrained Minimization
Brendan and Yifang
Feb 24 2015
Paper: Learning to Cooperate
via Policy Search
Peshkin, Leonid and Kim, Kee-Eung and Meuleau, Nicolas
and Kaelbling, Leslie Pack. 2000.
Introduction
• Previously we’ve been concerned with single agents, oblivious
environments and one-to-one value functions
• Impractical for large/diverse environments
• Impractical for complicated/reactive systems
• More generalizable approach desirable
• RL Chapter 8 introduced function approximation to build more
accurate value functions with small samples
• Peshkin et al. introduce multiple agents and world state
information asymmetry
• Wide range of potential new applications
• Particularly relevant to game theory
• Compatible with Connectionism models
Game Theory
• “Game theory is the study of the ways in which interacting
choices of economic agents produce outcomes with respect to
the preferences (or utilities) of those agents, where the outcomes in
question might have been intended by none of the agents.”
- Stanford Encyclopedia of Philosophy
• Example – Volunteer’s Dilemma: N agents share a common reward if
at least one agent volunteers, but volunteering has an associated
cost (e.g. power goes out and at least one person has to make an
expensive sat-phone call to electric company to get power restored)
•
•
•
•
Cooperative
Multi-agent
Imperfect world state information
Analytic solutions
Multi-Agent Environments
• Simple generalization
• Instead of a single agent, we have a set of agents, 𝐺
• Each agent has its own action space, 𝐴𝑖
• 𝐴 is now the joint action space, 𝐴 =
𝑖
𝑖𝐴
• Large expansion of potential applications
• Exponential expansion of search space
• 𝑛 agents with π‘˜ possible states ⇒ π‘˜ 𝑛 world states
• Makes many approaches computationally impractical
Paper Specifics
• Paper adopts Partially Observable Identical Payoff Stochastic
Game model with some caveats
•
•
•
•
Picked to be interesting
Allows theoretic guarantees
Requires non-trivial distributed algorithm
Some game theory applications
Identical Payoff Stochastic
Game (IPSG)
• Set of agents in a Markov environment
• All agents share a common reward function
• Represented 𝑆, πœ‹0𝑆 , 𝐺, 𝑇, π‘Ÿ
•
•
•
•
𝑆 is a discrete state space
πœ‹0𝑆 is a probability distribution over the initial state
𝐺 is the collection of agents
𝑇: 𝑆 × π΄ → 𝑃(𝑆) is a mapping state and action to probability over
states
• π‘Ÿ is the reward function (discussed later)
• Each agent represented 𝐴𝑖 , 𝑂𝑖 , 𝐡𝑖
• 𝑂𝑖 is the observation space
• 𝐡𝑖 is the observation function
Reward Questions
• Question: “What is the reward r(t) in POIPSG? The paper says
that different actions are taken by each agent to form a
compound action, but there the reward is identical? How is
this reward calculated?” – Tavish
• π‘Ÿ: 𝑆 × π΄ → 𝑅
• 𝑆 is the state space, 𝐴 is the joint action space, and 𝑅 is the
reward space
• Reward is an explicit result of the compound action at time 𝑑 and
shared by all agents
Reward Questions Cont.
• Question: “ Its title is about cooperation. Does the reward
function r reflect the idea of cooperation? ” – Jiyun Luo
• Paper’s definition - “Cooperative games are those in which both
agents share the same payoff structure.”
• Identical payoff stochastic games meet this requirement by
definition
• Cooperation effect seemingly evident in some example – e.g. soccer
• More generally – “a cooperative game is a game where groups of
players ("coalitions") may enforce cooperative behavior, hence
the game is a competition between coalitions of players, rather
than between individual players.”
• Typically entails communication and enforcement not present in
model
• Similarly, joint strategies are not explicit in this model
Partially Observable IPSG
(POIPSG)
• Recall 𝑂𝑖 and 𝐡𝑖
•
•
•
•
•
𝐡𝑖 is a mapping from the state to observation, 𝐡𝑖 : 𝑆 → 𝑃(𝑂𝑖 )
Normally, 𝐡 𝑠 = 𝑠 for all s (ie completely observable)
In the more general form, the game is “partially observable”
Each agent observes π‘œπ‘– (𝑑) corresponding to 𝐡𝑖 (𝑠 𝑑 )
Even with identical policies, agents act differently based on 𝑂𝑖
• Each agent tries to independently maximize the game value,
∞
𝛾 𝑑 πΈπœ‡ π‘Ÿ 𝑑
𝑉 πœ‡
𝑑=0
• Where π›Ύπœ– 0,1 is the discount factor
• πœ‡ is the set of strategies
Asymmetric Information
Question
• Question: “The paper investigates the problem in which
"agents all receive the shared reward signal, but have
incomplete, unreliable, and generally different perceptions of
the world state." How important is this problem? What are
the real world applications to this model? ” – Yuankai
• Economics: Equal shared holders in a company all benefit from its
success, but may have different world state information
• Politics: Citizens all have a common goal, but differ in knowledge
• Shared rewards are common (ie volunteer’s dilemma), but
identical rewards are somewhat unrealistic
• Model arguably lacks communication and enforcement necessary
for more complex economic problems; potentially representable
with observation function?
Model Assumptions
• State space is discrete
• Agents have finite memory
• Important for realism
• Paper outlines a finite state controller for each agent, but uses
fully reactive policies in practice
• Factored controller
• (ie distributed opposed to centralized), each agent has its own
sub-policy
• Not compatible with simple communication models
• Learning is simultaneous
• Actions happen simultaneously
• coordinated by the reward result
• Policy defines probability of action as a continuous differential
function
Joint Controller vs. Factored
Controller
• A factored action consists of multiple components, π‘Ž =
π‘Ž1 , … , π‘Žπ‘š
• A joint controller maps observations to a complete joint
distribution, πœ‹(π‘Ž)
• A factored controller uses an independent sub-policy to
determine each action component, πœ‡π‘Žπ‘– : 𝑂𝑖 → 𝑃(π‘Žπ‘– )
• Joint controllers significantly more powerful than factored
controllers because they can coordinate actions
• Any set of factored controllers πœ‡π‘Ž can be represented by a joint
controllers Pr π‘Ž = 𝑁
𝑖=1 Pr π‘Žπ‘–
• Not all stochastic joint controllers can be represented with a
factored controller, ie coordinated randomness
Control Question
• “What is the major difference (mathematically) between
central control vs. distributed control of factored actions?” –
Dr. Yang
• In general, see previous
• In the case of gradient descent, none
• Action probabilities of each agent (and consequently the distributed
partial derivatives) are independent.
• As such, weight updates can be distributed without cooperation
Algorithm Question
• “What is the REINFORCE algorithm? Please provide details,
including derivation. Also show the connection between
generation of RL and neural network.” – Dr. Yang
• In a Connectionist Network (ie Artificial Neural Network)
• Given a network on simple independent nodes/agents
• Each node draws an output 𝑦𝑖 based on input vector π‘₯ 𝑖 and
associated weight, 𝑀𝑖𝑗
• REINFORCE introduced the concept of policy search to RL
• Major intuition here is that most policies can be represented as a
series of weights
• Optimization of the policy can be seen as a weight optimization
problem
• Good problem formulation for techniques like gradient descent
• The authors use a more standard gradient descent for their actual
weight adjustments
Simple Artificial Neuron
Algorithm - REINFORCE
• Given a network and an associative immediate-reinforcement
learning task
• Simplify time dependence such that weights are adjusted following
receipt of the reinforcement value, r, at each trial
• Suppose that at each step, each weight 𝑀𝑖𝑗 is incremented by an
amount
βˆ†π‘€π‘–π‘— = 𝛼𝑖𝑗 π‘Ÿ − 𝑏𝑖𝑗 𝑒𝑖𝑗
• 𝛼𝑖𝑗 is the learning rate factor
• 𝑏𝑖𝑗 is the reinforcement baseline
• 𝑒𝑖𝑗 is the characteristic eligibility of 𝑀𝑖𝑗
• Call learning algorithms with this form a REward Increment =
Nonnegative Factor x Offset Reinforcement x Characteristic Eligibility
(REINFORCE) algorithm
• Note: Paper uses a simpler weight adjustment formulation
Algorithm – Gradient Descent
for Policy Search
• Let β„Ž an action/state history at time 𝜏, and 𝐻𝑑 the set of all possible
sequences length 𝑑
∞
𝛾𝑑
𝑉 πœ‡ =𝑉 πœ‡ =
𝑑=0
Pr β„Ž πœ‡ π‘Ÿ(𝑑, β„Ž)
β„Ž∈𝐻𝑑
• ie, the value of a policy set is the discount factor multiplied by the
probability of encountering that history given the policy and the
value of the history (for each time step) – this is just expected value
• Let 𝑀 be the vector of weights representing a policy
• We can calculate the partial derivatives (which gives us the gradient)
of the policy value function with respect to each weight,
∞
πœ•π‘‰ πœ‡
πœ•Pr(β„Ž|πœ‡)
𝑑
=
𝛾
π‘Ÿ(𝑑, β„Ž)
πœ•π‘€π‘˜
πœ•π‘€π‘˜
𝑑=0
β„Ž∈𝐻𝑑
Gradient Descent for Policy
Search Cont.
• Supposing each time-step in a history is represented with
𝑛 𝑑 , π‘œ 𝑑 , π‘Ž 𝑑 , π‘Ÿ(𝑑) (ie state, observation, action, reward)
• Expanding the partial derivative we have,
∞
𝛾𝑑
𝑑=0
𝑑
×
𝜏=1
Pr β„Ž πœ‡ π‘Ÿ(𝑑, β„Ž)
β„Ž
πœ•ln Pr(𝑛 𝜏, β„Ž , π‘Ž(𝜏, β„Ž)|β„Žπœ−1 , πœ‡)
πœ•π‘€π‘˜
• Pr β„Ž πœ‡ assumes complete knowledge so we substitute our
estimate of the gradient,
𝑑
𝛾 𝑑 π‘Ÿ(𝑑, β„Ž)
𝜏=1
πœ•ln Pr(𝑛 𝜏, β„Ž , π‘Ž(𝜏, β„Ž)|β„Žπœ−1 , πœ‡)
πœ•π‘€π‘˜
Algorithm – Distributed
Gradient Descent
• Now that we can calculate the gradient for each trial, using
gradient descent is trivial
• Gradient descent implementations will be covered later
• To distribute the action determination to individual agents
(consistent with our model), simply have each agent perform
gradient descent locally
• Each agent given some set of initial weights
• Each agent adjusts their own weight given their local observation
and the reward for each round
Guarantees
• For factored controllers, distributed gradient descent is
equivalent to joint gradient descent
• Thus, Distributed Gradient Descent finds a local optimum
• Centralized gradient descent finds a local optimum simply
because it’s a descent algorithm that stops at an optimum
• Distributed is equivalent to joint
• Every strict Nash equilibrium is a local optimum for gradient
descent
• Equivalence isn’t two-way (discussed later)
Nash Equilibrium
• “A set of strategies is a Nash Equilibrium when no player could
improve her payoff, given the strategies of all other players in
the game, by changing her strategy.”
• In the case of IPSG
• A Nash equilibrium point is a pair of strategies (πœ‡1∗ , πœ‡2∗ ) 𝑠. 𝑑.
𝑉 πœ‡1∗ , πœ‡2∗ ≥ 𝑉 πœ‡1 , πœ‡2∗
𝑉( πœ‡1∗ , πœ‡2∗ ) ≥ 𝑉( πœ‡1∗ ,πœ‡2 )
• For all πœ‡1 ,πœ‡2
• Nash equilibrium is significant because it entails a stable noncooperative system
• Widely applicable to economic and political strategies e.g.
Prisoner’s dilemma
• Comparison here is a little odd since system is “cooperative”
NE Question
• “ If we are modeling two agents in a system, say a user and a
machine. To reach Nash equilibrium, can we restrict what the user
can or should do in the process? Is this reasonable? If not, that
means that the user can do whatever he/she wants, and acts not
optimal, any comments on the Nash equilibrium under such
condition? How can the machine still optimize for both parties? ”
• A NE is two-sided by definition. If a user can improve their return (ie
isn’t acting optimally) it isn’t a NE.
• Rational agents are typically a good assumption, though it often
requires a very sophisticated model to reflect real life
• Partial observability may be a good way model mostly rational agents
• ie a user interacting with a search engine
• The machine may have a dominant strategy irrespective of the user
actions, but it’s generally not interesting to model such problems
with multiple agents
NE Equivalence Question
• “Can you explain when a local optima for gradient descent
would not be a Nash equilibrium?” – Brad
• Gradient descent is only guaranteed to search locally
• One agent may have a better strategy (far removed) that
improves theirs payoff regardless of other agents
• “We can construct a value function V (w1, w2) such that for some
c, V (·, c) has two modes, one at V (a, c) and the other at V (b, c),
such that V (b, c)> V (a, c). Further assume that V (a, ·) and V (b, ·)
each have global maxima V (a, c) and V (b, c). Then V (a, c) is a
local optimum that is not a Nash equilibrium”
Example - Toy
• Coordination problem
<a,a>; <b,b>
<a,*>
s2
s3
+10
s5
-10
s6
+5
<a,b>; <b,a>
s1
<b,*>
s4
<*,*>
Experimental Results
Example - Soccer
•
•
•
•
6x5 grid with 2 randomly places agents and 1 defender
Agents can {North, South, East, West, Stay, Pass}
Agents observe who possesses ball and status of surrounding
Defenders can be Random, Greedy (goes out to block ball), or
Defensive(stays in goal to block)
Soccer optimal solution
question
• “ The paper shows a distributed learning algorithm in
cooperative multi-agent domain by a 2-agents soccer game
example. It points out that, an algorithm which can achieve
optimal payoff for all agents may not be possible in general.
Why not? Can you explain this in the class? ” – Sicong
• Solving POIPSG completely is intractable
• Analytical solutions are sometimes possible
Q-learning
• Used for policy search motivation as well as baseline
• Recall –
Initialize 𝑄(𝑠, π‘Ž) arbitrarily
Repeat (for each episode):
Initialize 𝑠
Repeat (for each step of episode):
Choose π‘Ž from 𝑠 using policy derived from 𝑄
Take action π‘Ž, observe π‘Ÿ, 𝑠’
𝑄 𝑠, π‘Ž ← 𝑄 𝑠, π‘Ž + 𝛼[π‘Ÿ + 𝛾 min 𝑄 𝑠 ′ , π‘Ž′ − 𝑄(𝑠, π‘Ž)]
𝑠 ← 𝑠′;
Until 𝑠 is terminal
π‘Ž′
Experimental Results –
Defensive Opponents
Experimental Results –
Greedy Opponents
Experimental Results –
Random Opponents
Experimental Results –
Mixed Opponents
Tractability Question
• “The paper gives the example of tractable soccer game with
two learning agents playing against one opponent with fixed
strategy and then it mentions that it becomes difficult to even
store the Q table. The question is, what can be the possible to
solve complex environments that are useful for practical
applications, if any? Discussion on this?” – Tavish
• Recall: Q-learning was mainly introduced as motivation
• Removing the Q table is favor of a compact function
approximation based on sampling is much more practical for
large examples and often not too bad.
Conclusions
• Reinforcement learning can easily generalized to include
multiple agents and applied to game theory
• Distributed gradient descent has nice guarantees and
performs well in POIPSG
• Distributed gradient descent can produce “cooperation”
against consistent and coordinated opponents
Convex Optimization:
Section 9.1
Stephen Boyd and Lieven Vandenberghe. 2004.
Problem
• Our goal is to solve the unconstrained minimization problem:
min f (x)
• Where f(x) is convex, and twice continuously differentiable.
• In this case, a necessary and sufficient condition for x to be
optimal is
•
*
Ñ( f (x )) = 0
• So what is convex?
Convex, strictly convex
• f(x) is called convex, if
• f(x) is called strictly convex, if
Strongly convex
• f(x) is twice continuously differentiable, and it is strongly
convex, if it satisfy the following condition
where
Ñ2 f (x) - mI ³ 0
is Hessian matrix
Strong convex, cont’d
Ñ 2 f (x) = PLP T , where P T
é m
ê 1
m2
ê
ê
L =ê
.
.
ê
ê
mn
êë
= P -1
ù
ú
ú
ú
ú
ú
ú
úû
Ñ 2 f (x) - mI
= PLP T - mPIP T
= P(L - mI )P T
Thus, the m should be smaller than the smallest
Eigen value of the Hessian matrix.
Lower bound of Ñ f (x)
2
According to Taylor theorem, we will have
1
f (y) = f (x) + Ñf (x)T (y - x) + (y - x)T Ñ 2 f (z)(y - x)
2
1
³ f (x) + Ñf (x)T (y - x) + (y - x)T m(y - x)
2
m
2
³ f (x) + Ñf (x)T (y - x) + y - x 2
2
~
~
m
2
1
T
³ f (x) + Ñf (x) (y- x) + y - x 2 , where y = x - Ñf (x)
2
m
1
2
= f (x) Ñf (x) 2
2m
1
2
Ñf (x) 2
we have p* ³ f (x) 2m
~
Then
It means when Ñf (x) 22 is small, the f(x) is near to its optimal.
Upper-bound of Ñ f (x)
2
• Similar to lower bound, Ñ2 f (x) has an upper bound.
Ñ 2 f (x) - mI ³ 0
Ñ 2 f (x) - MI £ 0
1
f (y) = f (x) + Ñf (x)T (y - x) + (y - x)T Ñ 2 f (z)(y - x)
2
M
2
£ f (x) + Ñf (x)T (y - x) +
y-x 2
2
~
~
£ f (x) + Ñf (x)T (y- x) +
1
2
Ñf (x) 2
2M
1
p* £ f (x) Ñf (x)
2M
M
1
2
y - x 2 , where y = x - Ñf (x)
2
M
£ f (x) -
2
2
~
Condition number of sublevel sets
• We define the width of a context set C in the direction q, q is a
unit vector.
W(C, q) = sup qT z - inf qT z
zÎC
zÎC
Further, we have minimum width and maximum width
Wmin = inf W (C, q), Wmax = inf W(C, q)
q 2 =1
q 2 =1
The condition number of a context set C is
2
Wmax
cond(C) = 2
Wmin
• Question 1: What is Hessian matrix?(From Brad)
• Answer: It is second-order derivative matrix
• Question 2: Given convexity, does local optimal guarantee to
be global optimal?
• Answer: Yes, local optimal guarantees global optimal, if the
objective function is continuously differentiable.
• Question 3: Convexity is a strong assumption. Does the realworld problem always satisfy this assumption? What if it does
not stand?
• Answer: The real-world problems do not always satisfy this
assumption.
ì x + y < 10
ï
ïx > 0
í
ïy > 0
ïî x + 2y > 4 or 2x + y > 4
max f (x, y)
• We adopt some other strategies: follow the other direction
other than gradient-descent direction, with a small probability.
• Question 4: Can you explain the math symbol in Section
9.1?(From Jiyun)
Convex Optimization:
Sections 9.2 and 9.3
Stephen Boyd and Lieven Vandenberghe. 2004.
Descent Methods
• Goal: produce a minimizing sequence π‘₯ (π‘˜) , π‘˜ = 1, … s.t.
• π‘₯ (π‘˜+1) = π‘₯ (π‘˜) + 𝑑 π‘˜ βˆ†π‘₯ π‘˜ , where 𝑑 π‘˜ is the step size for step π‘˜
• 𝑑 π‘˜ > 0 until π‘₯ (π‘˜) is optimal
• Focus is on descent, consistent with the convex theme of the
book, ie
𝑓 π‘₯ π‘˜+1 < 𝑓 π‘₯ π‘˜
• Recall: 𝛻𝑓 π‘₯
π‘˜
𝑇
𝑦−π‘₯
π‘˜
≥0⇒𝑓 𝑦 ≥𝑓 π‘₯
π‘˜
• Search direction, π‘₯ (π‘˜) , must form an acute angle with the
negative gradient
• 𝛻𝑓 π‘₯
π‘˜
𝑇
βˆ†π‘₯ (π‘˜) < 0
General descent method
• Algorithm 9.1 –
given a starting point π‘₯ ∈ 𝐝𝐨𝐦 𝑓
repeat
1. Determine a descent direction βˆ†π‘₯.
2. Line Search. Choose a step size 𝑑 > 0.
3. Update. π‘₯ ≔ π‘₯ + 𝑑 βˆ†π‘₯
until stopping criteria is satisfied.
• To avoid confusion, call the direction of search a ray,
π‘₯ + π‘‘βˆ†π‘₯ 𝑑 ≥ 0}
Descent Direction Question
• “ in section 9.2, how did it decide a descend direction ‘βˆ†π‘₯’” Jiyun Luo
• Book is more theory oriented so the actual descent direction
choice is left general here
• You could choose any direction acute to the gradient and still fall
under their convergence proofs.
Exact Line Search
• When the cost of minimization with respect to a single
variable is small compared to the cost of finding the ray itself,
we can simple use said minimum as the step size,
𝑑 = argmin𝑠≥0 𝑓(π‘₯ + π‘ βˆ†π‘₯)
Backtracking Line Search
• When an exact minimization along the ray is hard, we can
approximate that as well, e.g. backtracking line search
• Algorithm 9.2 –
given a descent direction βˆ†π‘₯ for 𝑓 at π‘₯ ∈ 𝐝𝐨𝐦 𝑓,
α ∈ (0,0.5), and β ∈ (0,1)
𝑑≔1
while 𝑓 π‘₯ + π‘‘βˆ†π‘₯ > 𝑓 π‘₯ + 𝛼𝑑𝛻𝑓(π‘₯)𝑇 βˆ†π‘₯
𝑑 ≔ 𝛽𝑑
• ie, start with unit step size
• Reduce step size by a factor of 𝛽 until stopping condition holds
• 𝑓 π‘₯ + π‘‘βˆ†π‘₯ ≤ 𝑓 π‘₯ + 𝛼𝑑𝛻𝑓(π‘₯)𝑇 βˆ†π‘₯
Backtracking Stopping
Condition Question
• “ in algorithm 9.2 the "while" line, 𝛼 is constant, 𝑑 is decent
length, βˆ†π‘₯ is descend direction, what is the remain part
(𝛻𝑓(π‘₯)𝑇 )? What is its role?” - Jiyun Luo
• 𝛻𝑓(π‘₯)𝑇 denotes the transpose of the gradient
• Recall from convexity that –
𝛻𝑓(π‘₯)𝑇 βˆ†π‘₯ < 0
• It follows (non-trivially) that on the interval (0, 𝑑0 ]
𝑓 π‘₯ + π‘‘βˆ†π‘₯ ≤ 𝑓 π‘₯ + 𝛼𝑑𝛻𝑓(π‘₯)𝑇 βˆ†π‘₯
• This lets us bound the optimum and show that the loop
terminates
Backtracking Line Search Ex.
Backtracking Stopping
Condition Question
• “ Could you show us graphically what it looks like when the
stopping condition of the bactracking line search holds?” Brad
Backtracking Stop Ex.
Backtracking Parameter Choice
Question
• “For the backtracking line search method, how will the
selection of 𝛼 and 𝛽 values affect the convergence and
accuracy of result. Also, can the 𝛼 and 𝛽 values be estimated
beforehand instead of arbitrarily specifying them?” – Tavish
• Estimating optimal 𝛼 and 𝛽 values beforehand would be difficult
and problem specific
• Choices for 𝛼 and 𝛽 will affect convergence speed, but precision
is guaranteed by your stopping threshold regardless
Line Search Question
• “This is a question about the line search in chapter 9.2. Since the curve f
is convex, I believe there are many other simple and easy way to
approach the minimum point. For instance, initiate t as a not very small
step length. Then, consider f(x), f(x+t), f(x+2t), f(x+3t). These four values
trisect the range [x, x+3t]. If f(x+t) and f(x+2t) are the two minimum
values among the four, we will know that the minimum point must lie
within [x+t, x+2t]. From that on, we can shrink the searching range, and
keep doing this trisection. For every iterations after that, the minimum
point must lie within the trisection within the minimum two values. This
algorithm can guarantee a fast convergent speed, and requires similar
calculation for the f values, compared with the gradient decent
methods. Hence, my question is, why do we use line search (and
backtracking line search) methods in the convex setting? What are its
advantages compared with other methods that looks also pretty simple,
for instance the one I just mentioned?” – Sicong
• This algorithm seems fine, but not any faster than backtracking (range
reduced by a factor of 1/3 opposed to 𝛽 at every step)
• Exact is the most direct approach you can take and Backtracking is very
general while still facilitating their convergence guarantees
Gradient Descent
• We’ve left the search direction general
• Natural choice is simply along the negative gradient, βˆ†π‘₯ = −𝛻𝑓(π‘₯)
• Plugging back into Algorithm 9.1, this gives Algorithm 9.3 –
given a starting point π‘₯ ∈ 𝐝𝐨𝐦 𝑓
repeat
1. βˆ†π‘₯ ≔ −𝛻𝑓(π‘₯).
2. Line Search. Choose a step size 𝑑 via exact or backtracking
line search
3. Update. π‘₯ ≔ π‘₯ + 𝑑 βˆ†π‘₯
until stopping criteria is satisfied.
•
𝛻𝑓(π‘₯) 2 ≤ πœ€ is a natural stopping condition, ie Euclidean norm of
the vector is small, π‘₯ ≔ π‘₯12 + β‹― + π‘₯𝑛2
Line Search Questions Cont.
• “Since the setting is convex and twice continuously differentiable,
can you compare the exact line search, backtracking line search, and
the Newton’s Method?” – Sicong
• Newton’s method is a descent method, but it relies on the Hessian
rather than the gradient so it’s using a different search vector
• Convergence guarantees rely on an additional self-concordance
assumption
• “When would one use exact, vs backtracking line search?” - Brad
• Exact line search typically reduces the required number of steps (not
guaranteed)
• However, computing the exact minimum often not feasible or
efficient
• “What are the major differences between gradient method with
exact line search and gradient method with backtracking line
search? Please illustrate with an example how they differ in reaching
the point of convergence.” - Tavish
Newton’s Method
Gradient Descent with Exact
Line Search
Gradient Descent with
Backtracking Line Search
Exact vs. Backtracking Speed
Conclusions
• Based on toy experimental results
• Gradient method offers roughly linear convergence
• Choice of backtracking parameters has a noticeable but not
dramatic effect
• Gradient descent is advantageous due to its simplicity
• Gradient descent may converge slowly in certain cases
Download