Stein's unbiased risk estimate

In statistics, Stein's unbiased risk estimate (SURE) is an unbiased estimator of the mean-squared error of "a nearly arbitrary, nonlinear biased estimator."[1] In other words, it provides an indication of the accuracy of a given estimator. This is important since the true mean-squared error of an estimator is a function of the unknown parameter to be estimated, and thus cannot be determined exactly.

The technique is named after its discoverer, Charles Stein.[2]

Formal statement

Let μ R d {\displaystyle \mu \in {\mathbb {R} }^{d}} be an unknown parameter and let x R d {\displaystyle x\in {\mathbb {R} }^{d}} be a measurement vector whose components are independent and distributed normally with mean μ i , i = 1 , . . . , d , {\displaystyle \mu _{i},i=1,...,d,} and variance σ 2 {\displaystyle \sigma ^{2}} . Suppose h ( x ) {\displaystyle h(x)} is an estimator of μ {\displaystyle \mu } from x {\displaystyle x} , and can be written h ( x ) = x + g ( x ) {\displaystyle h(x)=x+g(x)} , where g {\displaystyle g} is weakly differentiable. Then, Stein's unbiased risk estimate is given by[3]

SURE ( h ) = d σ 2 + g ( x ) 2 + 2 σ 2 i = 1 d x i g i ( x ) = d σ 2 + g ( x ) 2 + 2 σ 2 i = 1 d x i h i ( x ) , {\displaystyle \operatorname {SURE} (h)=d\sigma ^{2}+\|g(x)\|^{2}+2\sigma ^{2}\sum _{i=1}^{d}{\frac {\partial }{\partial x_{i}}}g_{i}(x)=-d\sigma ^{2}+\|g(x)\|^{2}+2\sigma ^{2}\sum _{i=1}^{d}{\frac {\partial }{\partial x_{i}}}h_{i}(x),}

where g i ( x ) {\displaystyle g_{i}(x)} is the i {\displaystyle i} th component of the function g ( x ) {\displaystyle g(x)} , and {\displaystyle \|\cdot \|} is the Euclidean norm.

The importance of SURE is that it is an unbiased estimate of the mean-squared error (or squared error risk) of h ( x ) {\displaystyle h(x)} , i.e.

E μ { SURE ( h ) } = MSE ( h ) , {\displaystyle \operatorname {E} _{\mu }\{\operatorname {SURE} (h)\}=\operatorname {MSE} (h),\,\!}

with

MSE ( h ) = E μ h ( x ) μ 2 . {\displaystyle \operatorname {MSE} (h)=\operatorname {E} _{\mu }\|h(x)-\mu \|^{2}.}

Thus, minimizing SURE can act as a surrogate for minimizing the MSE. Note that there is no dependence on the unknown parameter μ {\displaystyle \mu } in the expression for SURE above. Thus, it can be manipulated (e.g., to determine optimal estimation settings) without knowledge of μ {\displaystyle \mu } .

Proof

We wish to show that

E μ h ( x ) μ 2 = E μ { SURE ( h ) } . {\displaystyle \operatorname {E} _{\mu }\|h(x)-\mu \|^{2}=\operatorname {E} _{\mu }\{\operatorname {SURE} (h)\}.}

We start by expanding the MSE as

E μ h ( x ) μ 2 = E μ g ( x ) + x μ 2 = E μ g ( x ) 2 + E μ x μ 2 + 2 E μ g ( x ) T ( x μ ) = E μ g ( x ) 2 + d σ 2 + 2 E μ g ( x ) T ( x μ ) . {\displaystyle {\begin{aligned}\operatorname {E} _{\mu }\|h(x)-\mu \|^{2}&=\operatorname {E} _{\mu }\|g(x)+x-\mu \|^{2}\\&=\operatorname {E} _{\mu }\|g(x)\|^{2}+\operatorname {E} _{\mu }\|x-\mu \|^{2}+2\operatorname {E} _{\mu }g(x)^{T}(x-\mu )\\&=\operatorname {E} _{\mu }\|g(x)\|^{2}+d\sigma ^{2}+2\operatorname {E} _{\mu }g(x)^{T}(x-\mu ).\end{aligned}}}

Now we use integration by parts to rewrite the last term:

E μ g ( x ) T ( x μ ) = R d 1 2 π σ 2 d exp ( x μ 2 2 σ 2 ) i = 1 d g i ( x ) ( x i μ i ) d d x = σ 2 i = 1 d R d 1 2 π σ 2 d exp ( x μ 2 2 σ 2 ) d g i d x i d d x = σ 2 i = 1 d E μ d g i d x i . {\displaystyle {\begin{aligned}\operatorname {E} _{\mu }g(x)^{T}(x-\mu )&=\int _{{\mathbb {R} }^{d}}{\frac {1}{\sqrt {2\pi \sigma ^{2d}}}}\exp \left(-{\frac {\|x-\mu \|^{2}}{2\sigma ^{2}}}\right)\sum _{i=1}^{d}g_{i}(x)(x_{i}-\mu _{i})d^{d}x\\&=\sigma ^{2}\sum _{i=1}^{d}\int _{{\mathbb {R} }^{d}}{\frac {1}{\sqrt {2\pi \sigma ^{2d}}}}\exp \left(-{\frac {\|x-\mu \|^{2}}{2\sigma ^{2}}}\right){\frac {dg_{i}}{dx_{i}}}d^{d}x\\&=\sigma ^{2}\sum _{i=1}^{d}\operatorname {E} _{\mu }{\frac {dg_{i}}{dx_{i}}}.\end{aligned}}}

Substituting this into the expression for the MSE, we arrive at

E μ h ( x ) μ 2 = E μ ( d σ 2 + g ( x ) 2 + 2 σ 2 i = 1 d d g i d x i ) . {\displaystyle \operatorname {E} _{\mu }\|h(x)-\mu \|^{2}=\operatorname {E} _{\mu }\left(d\sigma ^{2}+\|g(x)\|^{2}+2\sigma ^{2}\sum _{i=1}^{d}{\frac {dg_{i}}{dx_{i}}}\right).}

Applications

A standard application of SURE is to choose a parametric form for an estimator, and then optimize the values of the parameters to minimize the risk estimate. This technique has been applied in several settings. For example, a variant of the James–Stein estimator can be derived by finding the optimal shrinkage estimator.[2] The technique has also been used by Donoho and Johnstone to determine the optimal shrinkage factor in a wavelet denoising setting.[1]

References

  1. ^ a b Donoho, David L.; Iain M. Johnstone (December 1995). "Adapting to Unknown Smoothness via Wavelet Shrinkage". Journal of the American Statistical Association. 90 (432): 1200–1244. CiteSeerX 10.1.1.161.8697. doi:10.2307/2291512. JSTOR 2291512.
  2. ^ a b Stein, Charles M. (November 1981). "Estimation of the Mean of a Multivariate Normal Distribution". The Annals of Statistics. 9 (6): 1135–1151. doi:10.1214/aos/1176345632. JSTOR 2240405.
  3. ^ Wasserman, Larry (2005). All of Nonparametric Statistics.