Kullback's inequality

In information theory and statistics, Kullback's inequality is a lower bound on the Kullback–Leibler divergence expressed in terms of the large deviations rate function.[1] If P and Q are probability distributions on the real line, such that P is absolutely continuous with respect to Q, i.e. P << Q, and whose first moments exist, then

D K L ( P Q ) Ψ Q ( μ 1 ( P ) ) , {\displaystyle D_{KL}(P\parallel Q)\geq \Psi _{Q}^{*}(\mu '_{1}(P)),}
where Ψ Q {\displaystyle \Psi _{Q}^{*}} is the rate function, i.e. the convex conjugate of the cumulant-generating function, of Q {\displaystyle Q} , and μ 1 ( P ) {\displaystyle \mu '_{1}(P)} is the first moment of P . {\displaystyle P.}

The Cramér–Rao bound is a corollary of this result.

Proof

Let P and Q be probability distributions (measures) on the real line, whose first moments exist, and such that P << Q. Consider the natural exponential family of Q given by

Q θ ( A ) = A e θ x Q ( d x ) e θ x Q ( d x ) = 1 M Q ( θ ) A e θ x Q ( d x ) {\displaystyle Q_{\theta }(A)={\frac {\int _{A}e^{\theta x}Q(dx)}{\int _{-\infty }^{\infty }e^{\theta x}Q(dx)}}={\frac {1}{M_{Q}(\theta )}}\int _{A}e^{\theta x}Q(dx)}
for every measurable set A, where M Q {\displaystyle M_{Q}} is the moment-generating function of Q. (Note that Q0 = Q.) Then
D K L ( P Q ) = D K L ( P Q θ ) + supp P ( log d Q θ d Q ) d P . {\displaystyle D_{KL}(P\parallel Q)=D_{KL}(P\parallel Q_{\theta })+\int _{\operatorname {supp} P}\left(\log {\frac {\mathrm {d} Q_{\theta }}{\mathrm {d} Q}}\right)\mathrm {d} P.}
By Gibbs' inequality we have D K L ( P Q θ ) 0 {\displaystyle D_{KL}(P\parallel Q_{\theta })\geq 0} so that
D K L ( P Q ) supp P ( log d Q θ d Q ) d P = supp P ( log e θ x M Q ( θ ) ) P ( d x ) {\displaystyle D_{KL}(P\parallel Q)\geq \int _{\operatorname {supp} P}\left(\log {\frac {\mathrm {d} Q_{\theta }}{\mathrm {d} Q}}\right)\mathrm {d} P=\int _{\operatorname {supp} P}\left(\log {\frac {e^{\theta x}}{M_{Q}(\theta )}}\right)P(dx)}
Simplifying the right side, we have, for every real θ where M Q ( θ ) < : {\displaystyle M_{Q}(\theta )<\infty :}
D K L ( P Q ) μ 1 ( P ) θ Ψ Q ( θ ) , {\displaystyle D_{KL}(P\parallel Q)\geq \mu '_{1}(P)\theta -\Psi _{Q}(\theta ),}
where μ 1 ( P ) {\displaystyle \mu '_{1}(P)} is the first moment, or mean, of P, and Ψ Q = log M Q {\displaystyle \Psi _{Q}=\log M_{Q}} is called the cumulant-generating function. Taking the supremum completes the process of convex conjugation and yields the rate function:
D K L ( P Q ) sup θ { μ 1 ( P ) θ Ψ Q ( θ ) } = Ψ Q ( μ 1 ( P ) ) . {\displaystyle D_{KL}(P\parallel Q)\geq \sup _{\theta }\left\{\mu '_{1}(P)\theta -\Psi _{Q}(\theta )\right\}=\Psi _{Q}^{*}(\mu '_{1}(P)).}

Corollary: the Cramér–Rao bound

Start with Kullback's inequality

Let Xθ be a family of probability distributions on the real line indexed by the real parameter θ, and satisfying certain regularity conditions. Then

lim h 0 D K L ( X θ + h X θ ) h 2 lim h 0 Ψ θ ( μ θ + h ) h 2 , {\displaystyle \lim _{h\to 0}{\frac {D_{KL}(X_{\theta +h}\parallel X_{\theta })}{h^{2}}}\geq \lim _{h\to 0}{\frac {\Psi _{\theta }^{*}(\mu _{\theta +h})}{h^{2}}},}

where Ψ θ {\displaystyle \Psi _{\theta }^{*}} is the convex conjugate of the cumulant-generating function of X θ {\displaystyle X_{\theta }} and μ θ + h {\displaystyle \mu _{\theta +h}} is the first moment of X θ + h . {\displaystyle X_{\theta +h}.}

Left side

The left side of this inequality can be simplified as follows:

lim h 0 D K L ( X θ + h X θ ) h 2 = lim h 0 1 h 2 log ( d X θ + h d X θ ) d X θ + h = lim h 0 1 h 2 log ( d X θ d X θ + h ) d X θ + h = lim h 0 1 h 2 log ( 1 ( 1 d X θ d X θ + h ) ) d X θ + h = lim h 0 1 h 2 [ ( 1 d X θ d X θ + h ) + 1 2 ( 1 d X θ d X θ + h ) 2 + o ( ( 1 d X θ d X θ + h ) 2 ) ] d X θ + h Taylor series for  log ( 1 t ) = lim h 0 1 h 2 [ 1 2 ( 1 d X θ d X θ + h ) 2 ] d X θ + h = lim h 0 1 h 2 [ 1 2 ( d X θ + h d X θ d X θ + h ) 2 ] d X θ + h = 1 2 I X ( θ ) {\displaystyle {\begin{aligned}\lim _{h\to 0}{\frac {D_{KL}(X_{\theta +h}\parallel X_{\theta })}{h^{2}}}&=\lim _{h\to 0}{\frac {1}{h^{2}}}\int _{-\infty }^{\infty }\log \left({\frac {\mathrm {d} X_{\theta +h}}{\mathrm {d} X_{\theta }}}\right)\mathrm {d} X_{\theta +h}\\&=-\lim _{h\to 0}{\frac {1}{h^{2}}}\int _{-\infty }^{\infty }\log \left({\frac {\mathrm {d} X_{\theta }}{\mathrm {d} X_{\theta +h}}}\right)\mathrm {d} X_{\theta +h}\\&=-\lim _{h\to 0}{\frac {1}{h^{2}}}\int _{-\infty }^{\infty }\log \left(1-\left(1-{\frac {\mathrm {d} X_{\theta }}{\mathrm {d} X_{\theta +h}}}\right)\right)\mathrm {d} X_{\theta +h}\\&=\lim _{h\to 0}{\frac {1}{h^{2}}}\int _{-\infty }^{\infty }\left[\left(1-{\frac {\mathrm {d} X_{\theta }}{\mathrm {d} X_{\theta +h}}}\right)+{\frac {1}{2}}\left(1-{\frac {\mathrm {d} X_{\theta }}{\mathrm {d} X_{\theta +h}}}\right)^{2}+o\left(\left(1-{\frac {\mathrm {d} X_{\theta }}{\mathrm {d} X_{\theta +h}}}\right)^{2}\right)\right]\mathrm {d} X_{\theta +h}&&{\text{Taylor series for }}\log(1-t)\\&=\lim _{h\to 0}{\frac {1}{h^{2}}}\int _{-\infty }^{\infty }\left[{\frac {1}{2}}\left(1-{\frac {\mathrm {d} X_{\theta }}{\mathrm {d} X_{\theta +h}}}\right)^{2}\right]\mathrm {d} X_{\theta +h}\\&=\lim _{h\to 0}{\frac {1}{h^{2}}}\int _{-\infty }^{\infty }\left[{\frac {1}{2}}\left({\frac {\mathrm {d} X_{\theta +h}-\mathrm {d} X_{\theta }}{\mathrm {d} X_{\theta +h}}}\right)^{2}\right]\mathrm {d} X_{\theta +h}\\&={\frac {1}{2}}{\mathcal {I}}_{X}(\theta )\end{aligned}}}
which is half the Fisher information of the parameter θ.

Right side

The right side of the inequality can be developed as follows:

lim h 0 Ψ θ ( μ θ + h ) h 2 = lim h 0 1 h 2 sup t { μ θ + h t Ψ θ ( t ) } . {\displaystyle \lim _{h\to 0}{\frac {\Psi _{\theta }^{*}(\mu _{\theta +h})}{h^{2}}}=\lim _{h\to 0}{\frac {1}{h^{2}}}{\sup _{t}\{\mu _{\theta +h}t-\Psi _{\theta }(t)\}}.}
This supremum is attained at a value of t=τ where the first derivative of the cumulant-generating function is Ψ θ ( τ ) = μ θ + h , {\displaystyle \Psi '_{\theta }(\tau )=\mu _{\theta +h},} but we have Ψ θ ( 0 ) = μ θ , {\displaystyle \Psi '_{\theta }(0)=\mu _{\theta },} so that
Ψ θ ( 0 ) = d μ θ d θ lim h 0 h τ . {\displaystyle \Psi ''_{\theta }(0)={\frac {d\mu _{\theta }}{d\theta }}\lim _{h\to 0}{\frac {h}{\tau }}.}
Moreover,
lim h 0 Ψ θ ( μ θ + h ) h 2 = 1 2 Ψ θ ( 0 ) ( d μ θ d θ ) 2 = 1 2 Var ( X θ ) ( d μ θ d θ ) 2 . {\displaystyle \lim _{h\to 0}{\frac {\Psi _{\theta }^{*}(\mu _{\theta +h})}{h^{2}}}={\frac {1}{2\Psi ''_{\theta }(0)}}\left({\frac {d\mu _{\theta }}{d\theta }}\right)^{2}={\frac {1}{2\operatorname {Var} (X_{\theta })}}\left({\frac {d\mu _{\theta }}{d\theta }}\right)^{2}.}

Putting both sides back together

We have:

1 2 I X ( θ ) 1 2 Var ( X θ ) ( d μ θ d θ ) 2 , {\displaystyle {\frac {1}{2}}{\mathcal {I}}_{X}(\theta )\geq {\frac {1}{2\operatorname {Var} (X_{\theta })}}\left({\frac {d\mu _{\theta }}{d\theta }}\right)^{2},}
which can be rearranged as:
Var ( X θ ) ( d μ θ / d θ ) 2 I X ( θ ) . {\displaystyle \operatorname {Var} (X_{\theta })\geq {\frac {(d\mu _{\theta }/d\theta )^{2}}{{\mathcal {I}}_{X}(\theta )}}.}

See also

Notes and references

  1. ^ Fuchs, Aimé; Letta, Giorgio (1970). "L'inégalité de Kullback. Application à la théorie de l'estimation". Séminaire de Probabilités de Strasbourg. Séminaire de probabilités. 4. Strasbourg: 108–131.