Frank–Wolfe algorithm

Optimization algorithm

The Frank–Wolfe algorithm is an iterative first-order optimization algorithm for constrained convex optimization. Also known as the conditional gradient method,[1] reduced gradient algorithm and the convex combination algorithm, the method was originally proposed by Marguerite Frank and Philip Wolfe in 1956.[2] In each iteration, the Frank–Wolfe algorithm considers a linear approximation of the objective function, and moves towards a minimizer of this linear function (taken over the same domain).

Problem statement

Suppose D {\displaystyle {\mathcal {D}}} is a compact convex set in a vector space and f : D R {\displaystyle f\colon {\mathcal {D}}\to \mathbb {R} } is a convex, differentiable real-valued function. The Frank–Wolfe algorithm solves the optimization problem

Minimize f ( x ) {\displaystyle f(\mathbf {x} )}
subject to x D {\displaystyle \mathbf {x} \in {\mathcal {D}}} .

Algorithm

A step of the Frank–Wolfe algorithm
Initialization: Let k 0 {\displaystyle k\leftarrow 0} , and let x 0 {\displaystyle \mathbf {x} _{0}\!} be any point in D {\displaystyle {\mathcal {D}}} .
Step 1. Direction-finding subproblem: Find s k {\displaystyle \mathbf {s} _{k}} solving
Minimize s T f ( x k ) {\displaystyle \mathbf {s} ^{T}\nabla f(\mathbf {x} _{k})}
Subject to x k + s D {\displaystyle \mathbf {x} _{k}+\mathbf {s} \in {\mathcal {D}}}
(Interpretation: Minimize the linear approximation of the problem given by the first-order Taylor approximation of f {\displaystyle f} around x k {\displaystyle \mathbf {x} _{k}\!} constrained to stay within D {\displaystyle {\mathcal {D}}} .)
Step 2. Step size determination: Set α 2 k + 2 {\displaystyle \alpha \leftarrow {\frac {2}{k+2}}} , or alternatively find α {\displaystyle \alpha } that minimizes f ( x k + α ( s k x k ) ) {\displaystyle f(\mathbf {x} _{k}+\alpha (\mathbf {s} _{k}-\mathbf {x} _{k}))} subject to 0 α 1 {\displaystyle 0\leq \alpha \leq 1} .
Step 3. Update: Let x k + 1 x k + α ( s k x k ) {\displaystyle \mathbf {x} _{k+1}\leftarrow \mathbf {x} _{k}+\alpha (\mathbf {s} _{k}-\mathbf {x} _{k})} , let k k + 1 {\displaystyle k\leftarrow k+1} and go to Step 1.

Properties

While competing methods such as gradient descent for constrained optimization require a projection step back to the feasible set in each iteration, the Frank–Wolfe algorithm only needs the solution of a convex problem over the same set in each iteration, and automatically stays in the feasible set.

The convergence of the Frank–Wolfe algorithm is sublinear in general: the error in the objective function to the optimum is O ( 1 / k ) {\displaystyle O(1/k)} after k iterations, so long as the gradient is Lipschitz continuous with respect to some norm. The same convergence rate can also be shown if the sub-problems are only solved approximately.[3]

The iterations of the algorithm can always be represented as a sparse convex combination of the extreme points of the feasible set, which has helped to the popularity of the algorithm for sparse greedy optimization in machine learning and signal processing problems,[4] as well as for example the optimization of minimum–cost flows in transportation networks.[5]

If the feasible set is given by a set of linear constraints, then the subproblem to be solved in each iteration becomes a linear program.

While the worst-case convergence rate with O ( 1 / k ) {\displaystyle O(1/k)} can not be improved in general, faster convergence can be obtained for special problem classes, such as some strongly convex problems.[6]

Lower bounds on the solution value, and primal-dual analysis

Since f {\displaystyle f} is convex, for any two points x , y D {\displaystyle \mathbf {x} ,\mathbf {y} \in {\mathcal {D}}} we have:

f ( y ) f ( x ) + ( y x ) T f ( x ) {\displaystyle f(\mathbf {y} )\geq f(\mathbf {x} )+(\mathbf {y} -\mathbf {x} )^{T}\nabla f(\mathbf {x} )}

This also holds for the (unknown) optimal solution x {\displaystyle \mathbf {x} ^{*}} . That is, f ( x ) f ( x ) + ( x x ) T f ( x ) {\displaystyle f(\mathbf {x} ^{*})\geq f(\mathbf {x} )+(\mathbf {x} ^{*}-\mathbf {x} )^{T}\nabla f(\mathbf {x} )} . The best lower bound with respect to a given point x {\displaystyle \mathbf {x} } is given by

f ( x ) f ( x ) + ( x x ) T f ( x ) min y D { f ( x ) + ( y x ) T f ( x ) } = f ( x ) x T f ( x ) + min y D y T f ( x ) {\displaystyle {\begin{aligned}f(\mathbf {x} ^{*})&\geq f(\mathbf {x} )+(\mathbf {x} ^{*}-\mathbf {x} )^{T}\nabla f(\mathbf {x} )\\&\geq \min _{\mathbf {y} \in D}\left\{f(\mathbf {x} )+(\mathbf {y} -\mathbf {x} )^{T}\nabla f(\mathbf {x} )\right\}\\&=f(\mathbf {x} )-\mathbf {x} ^{T}\nabla f(\mathbf {x} )+\min _{\mathbf {y} \in D}\mathbf {y} ^{T}\nabla f(\mathbf {x} )\end{aligned}}}

The latter optimization problem is solved in every iteration of the Frank–Wolfe algorithm, therefore the solution s k {\displaystyle \mathbf {s} _{k}} of the direction-finding subproblem of the k {\displaystyle k} -th iteration can be used to determine increasing lower bounds l k {\displaystyle l_{k}} during each iteration by setting l 0 = {\displaystyle l_{0}=-\infty } and

l k := max ( l k 1 , f ( x k ) + ( s k x k ) T f ( x k ) ) {\displaystyle l_{k}:=\max(l_{k-1},f(\mathbf {x} _{k})+(\mathbf {s} _{k}-\mathbf {x} _{k})^{T}\nabla f(\mathbf {x} _{k}))}

Such lower bounds on the unknown optimal value are important in practice because they can be used as a stopping criterion, and give an efficient certificate of the approximation quality in every iteration, since always l k f ( x ) f ( x k ) {\displaystyle l_{k}\leq f(\mathbf {x} ^{*})\leq f(\mathbf {x} _{k})} .

It has been shown that this corresponding duality gap, that is the difference between f ( x k ) {\displaystyle f(\mathbf {x} _{k})} and the lower bound l k {\displaystyle l_{k}} , decreases with the same convergence rate, i.e. f ( x k ) l k = O ( 1 / k ) . {\displaystyle f(\mathbf {x} _{k})-l_{k}=O(1/k).}

Notes

  1. ^ Levitin, E. S.; Polyak, B. T. (1966). "Constrained minimization methods". USSR Computational Mathematics and Mathematical Physics. 6 (5): 1. doi:10.1016/0041-5553(66)90114-5.
  2. ^ Frank, M.; Wolfe, P. (1956). "An algorithm for quadratic programming". Naval Research Logistics Quarterly. 3 (1–2): 95–110. doi:10.1002/nav.3800030109.
  3. ^ Dunn, J. C.; Harshbarger, S. (1978). "Conditional gradient algorithms with open loop step size rules". Journal of Mathematical Analysis and Applications. 62 (2): 432. doi:10.1016/0022-247X(78)90137-3.
  4. ^ Clarkson, K. L. (2010). "Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm". ACM Transactions on Algorithms. 6 (4): 1–30. CiteSeerX 10.1.1.145.9299. doi:10.1145/1824777.1824783.
  5. ^ Fukushima, M. (1984). "A modified Frank-Wolfe algorithm for solving the traffic assignment problem". Transportation Research Part B: Methodological. 18 (2): 169–177. doi:10.1016/0191-2615(84)90029-8.
  6. ^ Bertsekas, Dimitri (1999). Nonlinear Programming. Athena Scientific. p. 215. ISBN 978-1-886529-00-7.

Bibliography

  • Jaggi, Martin (2013). "Revisiting Frank–Wolfe: Projection-Free Sparse Convex Optimization". Journal of Machine Learning Research: Workshop and Conference Proceedings. 28 (1): 427–435. (Overview paper)
  • The Frank–Wolfe algorithm description
  • Nocedal, Jorge; Wright, Stephen J. (2006). Numerical Optimization (2nd ed.). Berlin, New York: Springer-Verlag. ISBN 978-0-387-30303-1..

External links

  • https://conditional-gradients.org/: a survey of Frank–Wolfe algorithms.
  • Marguerite Frank giving a personal account of the history of the algorithm

See also

  • v
  • t
  • e
Functions
Gradients
Convergence
Quasi–Newton
Other methods
Hessians
Graph of a strictly concave quadratic function with unique maximum.
Optimization computes maxima and minima.
General
Differentiable
Convex
minimization
Linear and
quadratic
Interior point
Basis-exchange
Paradigms
Graph
algorithms
Minimum
spanning tree
Shortest path
Network flows