Mirror descent

In mathematics, mirror descent is an iterative optimization algorithm for finding a local minimum of a differentiable function.

It generalizes algorithms such as gradient descent and multiplicative weights.

History

Mirror descent was originally proposed by Nemirovski and Yudin in 1983.[1]

Motivation

In gradient descent with the sequence of learning rates ( η n ) n 0 {\displaystyle (\eta _{n})_{n\geq 0}} applied to a differentiable function F {\displaystyle F} , one starts with a guess x 0 {\displaystyle \mathbf {x} _{0}} for a local minimum of F , {\displaystyle F,} and considers the sequence x 0 , x 1 , x 2 , {\displaystyle \mathbf {x} _{0},\mathbf {x} _{1},\mathbf {x} _{2},\ldots } such that

x n + 1 = x n η n F ( x n ) ,   n 0. {\displaystyle \mathbf {x} _{n+1}=\mathbf {x} _{n}-\eta _{n}\nabla F(\mathbf {x} _{n}),\ n\geq 0.}

This can be reformulated by noting that

x n + 1 = arg min x ( F ( x n ) + F ( x n ) T ( x x n ) + 1 2 η n x x n 2 ) {\displaystyle \mathbf {x} _{n+1}=\arg \min _{\mathbf {x} }\left(F(\mathbf {x} _{n})+\nabla F(\mathbf {x} _{n})^{T}(\mathbf {x} -\mathbf {x} _{n})+{\frac {1}{2\eta _{n}}}\|\mathbf {x} -\mathbf {x} _{n}\|^{2}\right)}

In other words, x n + 1 {\displaystyle \mathbf {x} _{n+1}} minimizes the first-order approximation to F {\displaystyle F} at x n {\displaystyle \mathbf {x} _{n}} with added proximity term x x n 2 {\displaystyle \|\mathbf {x} -\mathbf {x} _{n}\|^{2}} .

This squared Euclidean distance term is a particular example of a Bregman distance. Using other Bregman distances will yield other algorithms such as Hedge which may be more suited to optimization over particular geometries.[2][3]

Formulation

We are given convex function f {\displaystyle f} to optimize over a convex set K R n {\displaystyle K\subset \mathbb {R} ^{n}} , and given some norm {\displaystyle \|\cdot \|} on R n {\displaystyle \mathbb {R} ^{n}} .

We are also given differentiable convex function h : R n R {\displaystyle h\colon \mathbb {R} ^{n}\to \mathbb {R} } , α {\displaystyle \alpha } -strongly convex with respect to the given norm. This is called the distance-generating function, and its gradient h : R n R n {\displaystyle \nabla h\colon \mathbb {R} ^{n}\to \mathbb {R} ^{n}} is known as the mirror map.

Starting from initial x 0 K {\displaystyle x_{0}\in K} , in each iteration of Mirror Descent:

  • Map to the dual space: θ t h ( x t ) {\displaystyle \theta _{t}\leftarrow \nabla h(x_{t})}
  • Update in the dual space using a gradient step: θ t + 1 θ t η t f ( x t ) {\displaystyle \theta _{t+1}\leftarrow \theta _{t}-\eta _{t}\nabla f(x_{t})}
  • Map back to the primal space: x t + 1 ( h ) 1 ( θ t + 1 ) {\displaystyle x'_{t+1}\leftarrow (\nabla h)^{-1}(\theta _{t+1})}
  • Project back to the feasible region K {\displaystyle K} : x t + 1 a r g min x K D h ( x | | x t + 1 ) {\displaystyle x_{t+1}\leftarrow \mathrm {arg} \min _{x\in K}D_{h}(x||x'_{t+1})} , where D h {\displaystyle D_{h}} is the Bregman divergence.

Extensions

Mirror descent in the online optimization setting is known as Online Mirror Descent (OMD).[4]

See also

References

  1. ^ Arkadi Nemirovsky and David Yudin. Problem Complexity and Method Efficiency in Optimization. John Wiley & Sons, 1983
  2. ^ Nemirovski, Arkadi (2012) Tutorial: mirror descent algorithms for large-scale deterministic and stochastic convex optimization.https://www2.isye.gatech.edu/~nemirovs/COLT2012Tut.pdf
  3. ^ "Mirror descent algorithm". tlienart.github.io. Retrieved 2022-07-10.
  4. ^ Fang, Huang; Harvey, Nicholas J. A.; Portella, Victor S.; Friedlander, Michael P. (2021-09-03). "Online mirror descent and dual averaging: keeping pace in the dynamic case". arXiv:2006.02585 [cs.LG].
  • v
  • t
  • e
Optimization: Algorithms, methods, and heuristics
Functions
Gradients
Convergence
Quasi–Newton
Other methods
Hessians
Graph of a strictly concave quadratic function with unique maximum.
Optimization computes maxima and minima.
General
Differentiable
Convex
minimization
Linear and
quadratic
Interior point
Basis-exchange
Paradigms
Graph
algorithms
Minimum
spanning tree
Shortest path
Network flows