Ancillary statistic

An ancillary statistic is a measure of a sample whose distribution (or whose pmf or pdf) does not depend on the parameters of the model.[1][2][3] An ancillary statistic is a pivotal quantity that is also a statistic. Ancillary statistics can be used to construct prediction intervals. They are also used in connection with Basu's theorem to prove independence between statistics.[4]

This concept was first introduced by Ronald Fisher in the 1920s,[5] but its formal definition was only provided in 1964 by Debabrata Basu.[6][7]

Examples

Suppose X1, ..., Xn are independent and identically distributed, and are normally distributed with unknown expected value μ and known variance 1. Let

X ¯ n = X 1 + + X n n {\displaystyle {\overline {X}}_{n}={\frac {X_{1}+\,\cdots \,+X_{n}}{n}}}

be the sample mean.

The following statistical measures of dispersion of the sample

  • Range: max(X1, ..., Xn) − min(X1, ..., Xn)
  • Interquartile range: Q3Q1
  • Sample variance:
σ ^ 2 := ( X i X ¯ ) 2 n {\displaystyle {\hat {\sigma }}^{2}:=\,{\frac {\sum \left(X_{i}-{\overline {X}}\right)^{2}}{n}}}

are all ancillary statistics, because their sampling distributions do not change as μ changes. Computationally, this is because in the formulas, the μ terms cancel – adding a constant number to a distribution (and all samples) changes its sample maximum and minimum by the same amount, so it does not change their difference, and likewise for others: these measures of dispersion do not depend on location.

Conversely, given i.i.d. normal variables with known mean 1 and unknown variance σ2, the sample mean X ¯ {\displaystyle {\overline {X}}} is not an ancillary statistic of the variance, as the sampling distribution of the sample mean is N(1, σ2/n), which does depend on σ 2 – this measure of location (specifically, its standard error) depends on dispersion.[8]

In location-scale families

In a location family of distributions, ( X 1 X n , X 2 X n , , X n 1 X n ) {\displaystyle (X_{1}-X_{n},X_{2}-X_{n},\dots ,X_{n-1}-X_{n})} is an ancillary statistic.

In a scale family of distributions, ( X 1 X n , X 2 X n , , X n 1 X n ) {\displaystyle ({\frac {X_{1}}{X_{n}}},{\frac {X_{2}}{X_{n}}},\dots ,{\frac {X_{n-1}}{X_{n}}})} is an ancillary statistic.

In a location-scale family of distributions, ( X 1 X n S , X 2 X n S , , X n 1 X n S ) {\displaystyle ({\frac {X_{1}-X_{n}}{S}},{\frac {X_{2}-X_{n}}{S}},\dots ,{\frac {X_{n-1}-X_{n}}{S}})} , where S 2 {\displaystyle S^{2}} is the sample variance, is an ancillary statistic.[3][9]

In recovery of information

It turns out that, if T 1 {\displaystyle T_{1}} is a non-sufficient statistic and T 2 {\displaystyle T_{2}} is ancillary, one can sometimes recover all the information about the unknown parameter contained in the entire data by reporting T 1 {\displaystyle T_{1}} while conditioning on the observed value of T 2 {\displaystyle T_{2}} . This is known as conditional inference.[3]

For example, suppose that X 1 , X 2 {\displaystyle X_{1},X_{2}} follow the N ( θ , 1 ) {\displaystyle N(\theta ,1)} distribution where θ {\displaystyle \theta } is unknown. Note that, even though X 1 {\displaystyle X_{1}} is not sufficient for θ {\displaystyle \theta } (since its Fisher information is 1, whereas the Fisher information of the complete statistic X ¯ {\displaystyle {\overline {X}}} is 2), by additionally reporting the ancillary statistic X 1 X 2 {\displaystyle X_{1}-X_{2}} , one obtains a joint distribution with Fisher information 2.[3]

Ancillary complement

Given a statistic T that is not sufficient, an ancillary complement is a statistic U that is ancillary and such that (TU) is sufficient.[2] Intuitively, an ancillary complement "adds the missing information" (without duplicating any).

The statistic is particularly useful if one takes T to be a maximum likelihood estimator, which in general will not be sufficient; then one can ask for an ancillary complement. In this case, Fisher argues that one must condition on an ancillary complement to determine information content: one should consider the Fisher information content of T to not be the marginal of T, but the conditional distribution of T, given U: how much information does T add? This is not possible in general, as no ancillary complement need exist, and if one exists, it need not be unique, nor does a maximum ancillary complement exist.

Example

In baseball, suppose a scout observes a batter in N at-bats. Suppose (unrealistically) that the number N is chosen by some random process that is independent of the batter's ability – say a coin is tossed after each at-bat and the result determines whether the scout will stay to watch the batter's next at-bat. The eventual data are the number N of at-bats and the number X of hits: the data (XN) are a sufficient statistic. The observed batting average X/N fails to convey all of the information available in the data because it fails to report the number N of at-bats (e.g., a batting average of 0.400, which is very high, based on only five at-bats does not inspire anywhere near as much confidence in the player's ability than a 0.400 average based on 100 at-bats). The number N of at-bats is an ancillary statistic because

  • It is a part of the observable data (it is a statistic), and
  • Its probability distribution does not depend on the batter's ability, since it was chosen by a random process independent of the batter's ability.

This ancillary statistic is an ancillary complement to the observed batting average X/N, i.e., the batting average X/N is not a sufficient statistic, in that it conveys less than all of the relevant information in the data, but conjoined with N, it becomes sufficient.

See also

Notes

  1. ^ Lehmann, E. L.; Scholz, F. W. (1992). "Ancillarity" (PDF). Lecture Notes-Monograph Series. Institute of Mathematical Statistics Lecture Notes - Monograph Series. 17: 32–51. doi:10.1214/lnms/1215458837. ISBN 0-940600-24-2. ISSN 0749-2170. JSTOR 4355624.
  2. ^ a b Ghosh, M.; Reid, N.; Fraser, D. A. S. (2010). "Ancillary statistics: A review". Statistica Sinica. 20 (4): 1309–1332. ISSN 1017-0405. JSTOR 24309506.
  3. ^ a b c d Mukhopadhyay, Nitis (2000). Probability and Statistical Inference. United States of America: Marcel Dekker, Inc. pp. 309–318. ISBN 0-8247-0379-0.
  4. ^ Dawid, Philip (2011), DasGupta, Anirban (ed.), "Basu on Ancillarity", Selected Works of Debabrata Basu, New York, NY: Springer, pp. 5–8, doi:10.1007/978-1-4419-5825-9_2, ISBN 978-1-4419-5825-9
  5. ^ Fisher, R. A. (1925). "Theory of Statistical Estimation". Mathematical Proceedings of the Cambridge Philosophical Society. 22 (5): 700–725. Bibcode:1925PCPS...22..700F. doi:10.1017/S0305004100009580. hdl:2440/15186. ISSN 0305-0041.
  6. ^ Basu, D. (1964). "Recovery of Ancillary Information". Sankhyā: The Indian Journal of Statistics, Series A (1961-2002). 26 (1): 3–16. ISSN 0581-572X. JSTOR 25049300.
  7. ^ Stigler, Stephen M. (2001), Ancillary history, Institute of Mathematical Statistics Lecture Notes - Monograph Series, Beachwood, OH: Institute of Mathematical Statistics, pp. 555–567, doi:10.1214/lnms/1215090089, ISBN 978-0-940600-50-8, retrieved 2023-04-24
  8. ^ Buehler, Robert J. (1982). "Some Ancillary Statistics and Their Properties". Journal of the American Statistical Association. 77 (379): 581–589. doi:10.1080/01621459.1982.10477850. hdl:11299/199392. ISSN 0162-1459.
  9. ^ "Ancillary statistics" (PDF).