Double descent

Concept in machine learning
An example of the double descent phenomenon in a two-layer neural network: When the ratio of parameters to data points is increased, the test error falls first, then rises, then falls again.[1] The vertical line marks the boundary between the underparametrized regime (more data points than parameters) and the overparameterized regime (more parameters than data points).

In statistics and machine learning, double descent is the phenomenon where a statistical model with a small number of parameters and a model with an extremely large number of parameters have a small error, but a model whose number of parameters is about the same as the number of data points used to train the model will have a large error.[2]

History

Early observations of double descent in specific models date back to 1989,[3][4] while the double descent phenomenon as a broader concept shared by many models gained popularity around 2019.[5][6] The latter development was prompted by a perceived contradiction between the conventional wisdom that too many parameters in the model result in a significant error (an extrapolation of bias-variance tradeoff),[7] and the empirical observations in the 2010s that some modern machine learning models tend to perform better with larger models.[5][8]

Theoretical models

[9] shows that double descent occurs in linear regression with isotropic Gaussian covariates and isotropic Gaussian noise.

A model of double descent at the thermodynamic limit has been analyzed by the replica method, and the result has been confirmed numerically.[10]

Empirical examples

The scaling behavior of double descent has been found to follow a broken neural scaling law[11] functional form.

References

  1. ^ Schaeffer, Rylan; Khona, Mikail; Robertson, Zachary; Boopathy, Akhilan; Pistunova, Kateryna; Rocks, Jason W.; Fiete, Ila Rani; Koyejo, Oluwasanmi (2023-03-24). "Double Descent Demystified: Identifying, Interpreting & Ablating the Sources of a Deep Learning Puzzle". arXiv:2303.14151v1 [cs.LG].
  2. ^ "Deep Double Descent". OpenAI. 2019-12-05. Retrieved 2022-08-12.
  3. ^ Vallet, F.; Cailton, J.-G.; Refregier, Ph (June 1989). "Linear and Nonlinear Extension of the Pseudo-Inverse Solution for Learning Boolean Functions". Europhysics Letters. 9 (4): 315. doi:10.1209/0295-5075/9/4/003. ISSN 0295-5075.
  4. ^ Loog, Marco; Viering, Tom; Mey, Alexander; Krijthe, Jesse H.; Tax, David M. J. (2020-05-19). "A brief prehistory of double descent". Proceedings of the National Academy of Sciences. 117 (20): 10625–10626. doi:10.1073/pnas.2001875117. ISSN 0027-8424. PMC 7245109. PMID 32371495.
  5. ^ a b Belkin, Mikhail; Hsu, Daniel; Ma, Siyuan; Mandal, Soumik (2019-08-06). "Reconciling modern machine learning practice and the bias-variance trade-off". Proceedings of the National Academy of Sciences. 116 (32): 15849–15854. arXiv:1812.11118. doi:10.1073/pnas.1903070116. ISSN 0027-8424. PMC 6689936. PMID 31341078.
  6. ^ Viering, Tom; Loog, Marco (2023-06-01). "The Shape of Learning Curves: A Review". IEEE Transactions on Pattern Analysis and Machine Intelligence. 45 (6): 7799–7819. arXiv:2103.10948. doi:10.1109/TPAMI.2022.3220744. ISSN 0162-8828.
  7. ^ Eric (2023-01-10). "The bias-variance tradeoff is not a statistical concept". Eric J. Wang. Retrieved 2024-01-05.
  8. ^ Preetum Nakkiran; Gal Kaplun; Yamini Bansal; Tristan Yang; Boaz Barak; Ilya Sutskever (29 December 2021). "Deep double descent: where bigger models and more data hurt". Journal of Statistical Mechanics: Theory and Experiment. 2021 (12). IOP Publishing Ltd and SISSA Medialab srl: 124003. arXiv:1912.02292. Bibcode:2021JSMTE2021l4003N. doi:10.1088/1742-5468/ac3a74. S2CID 207808916.
  9. ^ Nakkiran, Preetum (2019-12-16). "More Data Can Hurt for Linear Regression: Sample-wise Double Descent". arXiv.org. Retrieved 2024-04-18.
  10. ^ Advani, Madhu S.; Saxe, Andrew M.; Sompolinsky, Haim (2020-12-01). "High-dimensional dynamics of generalization error in neural networks". Neural Networks. 132: 428–446. doi:10.1016/j.neunet.2020.08.022. ISSN 0893-6080. PMC 7685244.
  11. ^ Caballero, Ethan; Gupta, Kshitij; Rish, Irina; Krueger, David (2022). "Broken Neural Scaling Laws". International Conference on Learning Representations (ICLR), 2023.
Part of a series on
Machine learning
and data mining
Paradigms
  • Supervised learning
  • Unsupervised learning
  • Online learning
  • Batch learning
  • Meta-learning
  • Semi-supervised learning
  • Self-supervised learning
  • Reinforcement learning
  • Curriculum learning
  • Rule-based learning
  • Quantum machine learning
Learning with humans
Machine-learning venues
  • v
  • t
  • e

Further reading

  • Mikhail Belkin; Daniel Hsu; Ji Xu (2020). "Two Models of Double Descent for Weak Features". SIAM Journal on Mathematics of Data Science. 2 (4): 1167–1180. arXiv:1903.07571. doi:10.1137/20M1336072.
  • Mount, John (3 April 2024). "The m = n Machine Learning Anomaly".
  • Preetum Nakkiran; Gal Kaplun; Yamini Bansal; Tristan Yang; Boaz Barak; Ilya Sutskever (29 December 2021). "Deep double descent: where bigger models and more data hurt". Journal of Statistical Mechanics: Theory and Experiment. 2021 (12). IOP Publishing Ltd and SISSA Medialab srl: 124003. arXiv:1912.02292. Bibcode:2021JSMTE2021l4003N. doi:10.1088/1742-5468/ac3a74. S2CID 207808916.
  • Song Mei; Andrea Montanari (April 2022). "The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve". Communications on Pure and Applied Mathematics. 75 (4): 667–766. arXiv:1908.05355. doi:10.1002/cpa.22008. S2CID 199668852.
  • Xiangyu Chang; Yingcong Li; Samet Oymak; Christos Thrampoulidis (2021). "Provable Benefits of Overparameterization in Model Compression: From Double Descent to Pruning Neural Networks". Proceedings of the AAAI Conference on Artificial Intelligence. 35 (8). arXiv:2012.08749.

External links

  • Brent Werness; Jared Wilber. "Double Descent: Part 1: A Visual Introduction".
  • Brent Werness; Jared Wilber. "Double Descent: Part 2: A Mathematical Explanation".
  • Understanding "Deep Double Descent" at evhub.
  • v
  • t
  • e
Continuous data
Center
Dispersion
Shape
Count data
Summary tables
Dependence
Graphics
Study design
Survey methodology
Controlled experiments
Adaptive designs
Observational studies
Statistical theory
Frequentist inference
Point estimation
Interval estimation
Testing hypotheses
Parametric tests
Specific tests
  • Z-test (normal)
  • Student's t-test
  • F-test
Goodness of fit
Rank statistics
Bayesian inference
Correlation
Regression analysis
Linear regression
Non-standard predictors
Generalized linear model
Partition of variance
Categorical
Multivariate
Time-series
General
Specific tests
Time domain
Frequency domain
Survival
Survival function
Hazard function
Test
Biostatistics
Engineering statistics
Social statistics
Spatial statistics
  • Category
  • icon Mathematics portal
  • Commons
  • WikiProject


Stub icon

This statistics-related article is a stub. You can help Wikipedia by expanding it.

  • v
  • t
  • e