Line Search Methods
Gradient descent
One of the simplest local optimisation algorithms is gradient descent. It is
initialised at some point in parameter space , and at each iteration the function
is reduced by following the direction of steepest descent
This is an example of an important class of algorithms called the line search methods.
These algorithms choose a search direction at each iteration , and search
along the 1D line from the initial point to a new point
with a lower function value. The problem at each iteration becomes a one-dimensional
optimisation problem along to find the optimal value of . Each line search
algorithm is thus defined on how it chooses both the search direction and the
optimal .

Plateaus with low gradient
An obvious downside to simple gradient descent can be seen for functions which have
regions of zero or small gradients, or plateaus. Here a gradient descent algorithm with a
constant will proceed very slowly, if at all. This motivates another important
line search algorithm, Newtons method.
The Newtons direction can be derived by considering the second-order Taylor
expansion of the function
We find the value of that minimises by setting the derivative of to
zero, leading to
Unlike the steepest descent, Newtons method has a natural step length , which is suitable for a wide variety of problems and can quickly cross areas of low
gradient. Naturally, since the algorithm is based on a second-order approximation of
the function , it works better if this approximation is reasonably accurate.
Newtons method can be used as long as the inverse of the second derivative of the
function , exists (e.g. it will always exist for a positive
definite ). However, even when this inverse does exist it is possible that
the direction does not satisfy the descent condition (or equivilently ), so many modifications to Newtons
methods, falling under a class of methods called Quasi-Newton methods, have been
proposed to satisfy this descent condition.
Quasi-Newton methods do not require the (often onerous) calculation of the hession
like Newtons, instead they form an approximation to the hessian that is updated at each step using the information given by the
gradient evaluations . Two popular methods of performing this update are
the symmetric-rank-one (SR1), and the Broyden, Fletcher, Goldfarb, and Shanno,
(BFGS) formula. Once the approximation is formed then the search direction is
calculated via
For more details of other line search methods, please see Chapter 3 of the Nocedal and
Wright textbook, or in the other textbooks listed at the end of this lesson. Finally, it
should be noted that the conjugate gradient method can also be used for non-linear
optimisation, where the search direction is given by
Step length
In line search methods, choosing the step length is a non-trivial task.
Ideally we would want to chose to minimise the function along the
one-dimensional search direction . That is, we wish to minimise
In general it is too expensive to do this minimisation exactly, so approximate methods
are used so that multiple trial values are trialled, stopping when a candidate
is found that satisfies a set of conditions. There are two main conditions used, the
Wolfe conditions and the Goldstein conditions.
The two Wolfe conditions are the sufficient decrease condition, which ensures that the
reduction in the function value is proportional to the step length and the
gradient in the direction of the step
The second Wolfe condition is the curvature condition, which prevents unacceptibly
short steps by ensuring that the slope of is greater than some constant
times the initial slope
where . Typical values are and . The
strong Wolfe conditions restrict the gradient to be small, so as to exclude
points that are too far from stationary points of
The Goldstein conditions are similar in spirit to the Wolfe conditions, and are formed
from the two inequalities
with . The first inequality prevents small step sizes while the second is
the same sufficient decrease condition as in the Wolfe conditions. The Goldstein
conditions are often used in Newton-type methods but for quasi-Newton methods the Wolfe
conditions are prefered. The diagrams from the text by Nocedal and Wright
illustrate the two conditions

Algorithms for choosing candidate step size values can be complicated, so we
will only mention here one of the simplest, which is the backtracking method. This
approach implicitly satisfies the condition on too small , and only repeatedly
test for the common sufficient decrease condition that appears in both the Wolfe and
Goldstein condtitions.
Backtracking algorithm
Choose , ,
repeat until
end repeat.
Software
- Scipy has a wide variety of (mostly) line search and trust region algorithms in
scipy.optimize
. There are 14 local minimisers, so we won't list them all here! - It is worth noting that Scipy includes the
line_search
function, which allows you to use their line search satisfying the strong Wolfe conditions with your own custom search direction. - Scipy also includes a
HessianUpdateStrategy
, which provides an interface for specifying an approximate Hessian for use in quasi-Newton methods, along with two implementationsBFGS
andSR1
.
Problems
Line Search Problems
- Program the steepest descent and Newton algorithms using the backtracking line search. Use them to minimize the Rosenbrock function. Set the initial step length and print the step length used by each method at each iteration. First try the initial point and then the more difficult starting point .
- Plot the function surface using
matplotlib
and overlay the line search segments so you can visualise the progress of your algorithm. Observe the difference between the algorithms when the gradient of the rosenbrock function is low (i.e. at the bottom of the curved valley) - Repeat (1) and (2) above using the line search implemented in Scipy
scipy.optimize.line_search
, which uses the strong Wolfe conditions.