Laplace's approximation

Laplace's approximation
Part of a series on
Bayesian statistics
Posterior = Likelihood × Prior ÷ Evidence
Background
Model building
  • Weak prior ... Strong prior
  • Conjugate prior
  • Linear regression
  • Empirical Bayes
  • Hierarchical model
Posterior approximation
Estimators
Evidence approximation
Model evaluation
  • icon Mathematics portal
  • v
  • t
  • e

Laplace's approximation provides an analytical expression for a posterior probability distribution by fitting a Gaussian distribution with a mean equal to the MAP solution and precision equal to the observed Fisher information.[1][2] The approximation is justified by the Bernstein–von Mises theorem, which states that under regularity conditions the posterior converges to a Gaussian in large samples.[3][4]

For example, a (possibly non-linear) regression or classification model with data set { x n , y n } n = 1 , , N {\displaystyle \{x_{n},y_{n}\}_{n=1,\ldots ,N}} comprising inputs x {\displaystyle x} and outputs y {\displaystyle y} has (unknown) parameter vector θ {\displaystyle \theta } of length D {\displaystyle D} . The likelihood is denoted p ( y | x , θ ) {\displaystyle p({\bf {y}}|{\bf {x}},\theta )} and the parameter prior p ( θ ) {\displaystyle p(\theta )} . Suppose one wants to approximate the joint density of outputs and parameters p ( y , θ | x ) {\displaystyle p({\bf {y}},\theta |{\bf {x}})}

p ( y , θ | x ) = p ( y | x , θ ) p ( θ | x ) = p ( y | x ) p ( θ | y , x ) q ~ ( θ ) = Z q ( θ ) . {\displaystyle p({\bf {y}},\theta |{\bf {x}})\;=\;p({\bf {y}}|{\bf {x}},\theta )p(\theta |{\bf {x}})\;=\;p({\bf {y}}|{\bf {x}})p(\theta |{\bf {y}},{\bf {x}})\;\simeq \;{\tilde {q}}(\theta )\;=\;Zq(\theta ).}

The joint is equal to the product of the likelihood and the prior and by Bayes' rule, equal to the product of the marginal likelihood p ( y | x ) {\displaystyle p({\bf {y}}|{\bf {x}})} and posterior p ( θ | y , x ) {\displaystyle p(\theta |{\bf {y}},{\bf {x}})} . Seen as a function of θ {\displaystyle \theta } the joint is an un-normalised density. In Laplace's approximation we approximate the joint by an un-normalised Gaussian q ~ ( θ ) = Z q ( θ ) {\displaystyle {\tilde {q}}(\theta )=Zq(\theta )} , where we use q {\displaystyle q} to denote approximate density, q ~ {\displaystyle {\tilde {q}}} for un-normalised density and Z {\displaystyle Z} is a constant (independent of θ {\displaystyle \theta } ). Since the marginal likelihood p ( y | x ) {\displaystyle p({\bf {y}}|{\bf {x}})} doesn't depend on the parameter θ {\displaystyle \theta } and the posterior p ( θ | y , x ) {\displaystyle p(\theta |{\bf {y}},{\bf {x}})} normalises over θ {\displaystyle \theta } we can immediately identify them with Z {\displaystyle Z} and q ( θ ) {\displaystyle q(\theta )} of our approximation, respectively. Laplace's approximation is

p ( y , θ | x ) p ( y , θ ^ | x ) exp ( 1 2 ( θ θ ^ ) S 1 ( θ θ ^ ) ) = q ~ ( θ ) , {\displaystyle p({\bf {y}},\theta |{\bf {x}})\;\simeq \;p({\bf {y}},{\hat {\theta }}|{\bf {x}})\exp {\big (}-{\tfrac {1}{2}}(\theta -{\hat {\theta }})^{\top }S^{-1}(\theta -{\hat {\theta }}){\big )}\;=\;{\tilde {q}}(\theta ),}

where we have defined

θ ^ = argmax θ log p ( y , θ | x ) , S 1 = θ θ log p ( y , θ | x ) | θ = θ ^ , {\displaystyle {\begin{aligned}{\hat {\theta }}&\;=\;\operatorname {argmax} _{\theta }\log p({\bf {y}},\theta |{\bf {x}}),\\S^{-1}&\;=\;-\left.\nabla _{\theta }\nabla _{\theta }\log p({\bf {y}},\theta |{\bf {x}})\right|_{\theta ={\hat {\theta }}},\end{aligned}}}

where θ ^ {\displaystyle {\hat {\theta }}} is the location of a mode of the joint target density, also known as the maximum a posteriori or MAP point and S 1 {\displaystyle S^{-1}} is the D × D {\displaystyle D\times D} positive definite matrix of second derivatives of the negative log joint target density at the mode θ = θ ^ {\displaystyle \theta ={\hat {\theta }}} . Thus, the Gaussian approximation matches the value and the curvature of the un-normalised target density at the mode. The value of θ ^ {\displaystyle {\hat {\theta }}} is usually found using a gradient based method, e.g. Newton's method. In summary, we have

q ( θ ) = N ( θ | μ = θ ^ , Σ = S ) , log Z = log p ( y , θ ^ | x ) + 1 2 log | S | + D 2 log ( 2 π ) , {\displaystyle {\begin{aligned}q(\theta )&\;=\;{\cal {N}}(\theta |\mu ={\hat {\theta }},\Sigma =S),\\\log Z&\;=\;\log p({\bf {y}},{\hat {\theta }}|{\bf {x}})+{\tfrac {1}{2}}\log |S|+{\tfrac {D}{2}}\log(2\pi ),\end{aligned}}}

for the approximate posterior over θ {\displaystyle \theta } and the approximate log marginal likelihood respectively.[5] In the special case of Bayesian linear regression with a Gaussian prior, the approximation is exact. The main weaknesses of Laplace's approximation are that it is symmetric around the mode and that it is very local: the entire approximation is derived from properties at a single point of the target density. Laplace's method is widely used and was pioneered in the context of neural networks by David MacKay,[6] and for Gaussian processes by Williams and Barber.[7]

References

  1. ^ Kass, Robert E.; Tierney, Luke; Kadane, Joseph B. (1991). "Laplace's method in Bayesian analysis". Statistical Multiple Integration. Contemporary Mathematics. Vol. 115. pp. 89–100. doi:10.1090/conm/115/07. ISBN 0-8218-5122-5.
  2. ^ MacKay, David J. C. (2003). "Information Theory, Inference and Learning Algorithms, chapter 27: Laplace's method" (PDF).
  3. ^ Hartigan, J. A. (1983). "Asymptotic Normality of Posterior Distributions". Bayes Theory. Springer Series in Statistics. New York: Springer. pp. 107–118. doi:10.1007/978-1-4613-8242-3_11. ISBN 978-1-4613-8244-7.
  4. ^ Kass, Robert E.; Tierney, Luke; Kadane, Joseph B. (1990). "The Validity of Posterior Expansions Based on Laplace's Method". In Geisser, S.; Hodges, J. S.; Press, S. J.; Zellner, A. (eds.). Bayesian and Likelihood Methods in Statistics and Econometrics. Elsevier. pp. 473–488. ISBN 0-444-88376-2.
  5. ^ Daxberger, Erik; et al. (2021). "Laplace Redux - Effortless Bayesian Deep Learning". Advances in Neural Information Processing Systems. 34.
  6. ^ MacKay, David J. C. (1992). "Bayesian Interpolation" (PDF). Neural Computation. 4 (3). MIT Press: 415–447. doi:10.1162/neco.1992.4.3.415. S2CID 1762283.
  7. ^ Williams, Christopher K. I.; Barber, David (1998). "Bayesian classification with Gaussian Processes" (PDF). PAMI. 20 (12). IEEE: 1342–1351. doi:10.1109/34.735807.