Zipf–Mandelbrot law

Discrete probability distribution
Zipf–Mandelbrot
Parameters N { 1 , 2 , 3 } {\displaystyle N\in \{1,2,3\ldots \}} (integer)
q [ 0 ; ) {\displaystyle q\in [0;\infty )} (real)
s > 0 {\displaystyle s>0\,} (real)
Support k { 1 , 2 , , N } {\displaystyle k\in \{1,2,\ldots ,N\}}
PMF 1 / ( k + q ) s H N , q , s {\displaystyle {\frac {1/(k+q)^{s}}{H_{N,q,s}}}}
CDF H k , q , s H N , q , s {\displaystyle {\frac {H_{k,q,s}}{H_{N,q,s}}}}
Mean H N , q , s 1 H N , q , s q {\displaystyle {\frac {H_{N,q,s-1}}{H_{N,q,s}}}-q}
Mode 1 {\displaystyle 1\,}
Entropy s H N , q , s k = 1 N ln ( k + q ) ( k + q ) s + ln ( H N , q , s ) {\displaystyle {\frac {s}{H_{N,q,s}}}\sum _{k=1}^{N}{\frac {\ln(k+q)}{(k+q)^{s}}}+\ln(H_{N,q,s})}

In probability theory and statistics, the Zipf–Mandelbrot law is a discrete probability distribution. Also known as the Pareto–Zipf law, it is a power-law distribution on ranked data, named after the linguist George Kingsley Zipf who suggested a simpler distribution called Zipf's law, and the mathematician Benoit Mandelbrot, who subsequently generalized it.

The probability mass function is given by:

f ( k ; N , q , s ) = 1 / ( k + q ) s H N , q , s {\displaystyle f(k;N,q,s)={\frac {1/(k+q)^{s}}{H_{N,q,s}}}}

where H N , q , s {\displaystyle H_{N,q,s}} is given by:

H N , q , s = i = 1 N 1 ( i + q ) s {\displaystyle H_{N,q,s}=\sum _{i=1}^{N}{\frac {1}{(i+q)^{s}}}}

which may be thought of as a generalization of a harmonic number. In the formula, k {\displaystyle k} is the rank of the data, and q {\displaystyle q} and s {\displaystyle s} are parameters of the distribution. In the limit as N {\displaystyle N} approaches infinity, this becomes the Hurwitz zeta function ζ ( s , q ) {\displaystyle \zeta (s,q)} . For finite N {\displaystyle N} and q = 0 {\displaystyle q=0} the Zipf–Mandelbrot law becomes Zipf's law. For infinite N {\displaystyle N} and q = 0 {\displaystyle q=0} it becomes a Zeta distribution.

Applications

The distribution of words ranked by their frequency in a random text corpus is approximated by a power-law distribution, known as Zipf's law.

If one plots the frequency rank of words contained in a moderately sized corpus of text data versus the number of occurrences or actual frequencies, one obtains a power-law distribution, with exponent close to one (but see Powers, 1998 and Gelbukh & Sidorov, 2001). Zipf's law implicitly assumes a fixed vocabulary size, but the Harmonic series with s=1 does not converge, while the Zipf–Mandelbrot generalization with s>1 does. Furthermore, there is evidence that the closed class of functional words that define a language obeys a Zipf–Mandelbrot distribution with different parameters from the open classes of contentive words that vary by topic, field and register.[1]

In ecological field studies, the relative abundance distribution (i.e. the graph of the number of species observed as a function of their abundance) is often found to conform to a Zipf–Mandelbrot law.[2]

Within music, many metrics of measuring "pleasing" music conform to Zipf–Mandelbrot distributions.[3]

Notes

  1. ^ Powers, David M W (1998). "Applications and explanations of Zipf's law". New methods in language processing and computational natural language learning. Joint conference on new methods in language processing and computational natural language learning. Association for Computational Linguistics. pp. 151–160.
  2. ^ Mouillot, D; Lepretre, A (2000). "Introduction of relative abundance distribution (RAD) indices, estimated from the rank-frequency diagrams (RFD), to assess changes in community diversity". Environmental Monitoring and Assessment. 63 (2). Springer: 279–295. doi:10.1023/A:1006297211561. S2CID 102285701. Retrieved 24 Dec 2008.
  3. ^ Manaris, B; Vaughan, D; Wagner, CS; Romero, J; Davis, RB. "Evolutionary Music and the Zipf–Mandelbrot Law: Developing Fitness Functions for Pleasant Music". Proceedings of 1st European Workshop on Evolutionary Music and Art (EvoMUSART2003). 611.

References

  • Mandelbrot, Benoît (1965). "Information Theory and Psycholinguistics". In B.B. Wolman and E. Nagel (ed.). Scientific psychology. Basic Books. Reprinted as
    • Mandelbrot, Benoît (1968) [1965]. "Information Theory and Psycholinguistics". In R.C. Oldfield and J.C. Marchall (ed.). Language. Penguin Books.
  • Powers, David M W (1998). "Applications and explanations of Zipf's law". New methods in language processing and computational natural language learning. Joint conference on new methods in language processing and computational natural language learning. Association for Computational Linguistics. pp. 151–160.
  • Zipf, George Kingsley (1932). Selected Studies of the Principle of Relative Frequency in Language. Cambridge, MA: Harvard University Press.
  • Van Droogenbroeck F.J., 'An essential rephrasing of the Zipf–Mandelbrot law to solve authorship attribution applications by Gaussian statistics' (2019) [1]

External links

  • Z. K. Silagadze: Citations and the Zipf–Mandelbrot's law
  • NIST: Zipf's law
  • W. Li's References on Zipf's law
  • Gelbukh & Sidorov, 2001: Zipf and Heaps Laws’ Coefficients Depend on Language
  • C++ Library for generating random Zipf–Mandelbrot deviates.
  • v
  • t
  • e
Discrete
univariate
with finite
support
with infinite
support
Continuous
univariate
supported on a
bounded interval
supported on a
semi-infinite
interval
supported
on the whole
real line
with support
whose type varies
Mixed
univariate
continuous-
discrete
Multivariate
(joint)DirectionalDegenerate
and singular
Degenerate
Dirac delta function
Singular
Cantor
Families
  • Category
  • Commons