Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation

Mikhail Belkin

doi:10.1017/S0962492921000039

Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation

Published online by Cambridge University Press: 04 August 2021

Mikhail Belkin

Show author details

Mikhail Belkin*: Affiliation:
Halıcıoğlu Data Science Institute, University of California San Diego, 10100 Hopkins Drive, La Jolla, CA92093, USA E-mail: mbelkin@ucsd.edu

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

In the past decade the mathematical theory of machine learning has lagged far behind the triumphs of deep neural networks on practical challenges. However, the gap between theory and practice is gradually starting to close. In this paper I will attempt to assemble some pieces of the remarkable and still incomplete mathematical mosaic emerging from the efforts to understand the foundations of deep learning. The two key themes will be interpolation and its sibling over-parametrization. Interpolation corresponds to fitting data, even noisy data, exactly. Over-parametrization enables interpolation and provides flexibility to select a suitable interpolating model.

As we will see, just as a physical prism separates colours mixed within a ray of light, the figurative prism of interpolation helps to disentangle generalization and optimization properties within the complex picture of modern machine learning. This article is written in the belief and hope that clearer understanding of these issues will bring us a step closer towards a general theory of deep learning and machine learning.

Type: Research Article
Information: Acta Numerica , Volume 30 , May 2021 , pp. 203 - 248

DOI: https://doi.org/10.1017/S0962492921000039 [Opens in a new window]
Copyright: © The Author(s), 2021. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

In memory of Partha Niyogi, a thinker, a teacher, and a dear friend.

References

Allen-Zhu, Z. and Li, Y. (2020), Backward feature correction: How deep learning performs deep learning. Available at arXiv:2001.04413.Google Scholar

Arora, S., Du, S. S., Li, Z., Salakhutdinov, R., Wang, R. and Yu, D. (2020), Harnessing the power of infinitely wide deep nets on small-data tasks, in 8th International Conference on Learning Representations (ICLR 2020). Available at https://openreview.net/forum?id=rkl8sJBYvH.Google Scholar

Bai, S., Kolter, J. Z. and Koltun, V. (2019), Deep equilibrium models, in Advances in Neural Information Processing Systems 32 (NeurIPS 2019) (Wallach, H. et al., eds), Curran Associates, pp. 690–701.Google Scholar

Bartlett, P. L. and Long, P. M. (2020), Failures of model-dependent generalization bounds for least-norm interpolation. Available at arXiv:2010.08479.Google Scholar

Bartlett, P. L. and Mendelson, S. (2002), Rademacher and Gaussian complexities: Risk bounds and structural results, J. Mach. Learn. Res. 3, 463–482.Google Scholar

Bartlett, P. L., Long, P. M., Lugosi, G. and Tsigler, A. (2020), Benign overfitting in linear regression, Proc . Nat. Acad. Sci. 117, 30063–30070.CrossRef Google Scholar

Bartlett, P. L., Montanari, A. and Rakhlin, A. (2021), Deep learning: A statistical viewpoint, in Acta Numerica, Vol. 30, Cambridge University Press, pp. 1–115.Google Scholar

Bassily, R., Belkin, M. and Ma, S. (2018), On exponential convergence of SGD in non-convex over-parametrized learning. Available at arXiv:1811.02564.Google Scholar

Belkin, M., Hsu, D. and Mitra, P. (2018a), Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate, in Advances in Neural Information Processing Systems 31 (NeurIPS 2018) (Bengio, S. et al., eds), Curran Associates, pp. 2306–2317.Google Scholar

Belkin, M., Hsu, D. and Xu, J. (2020), Two models of double descent for weak features, SIAM J. Math. Data Sci. 2, 1167–1180.CrossRef Google Scholar

Belkin, M., Hsu, D., Ma, S. and Mandal, S. (2019a), Reconciling modern machine-learning practice and the classical bias–variance trade-off, Proc . Nat. Acad. Sci. 116, 15849– 15854.CrossRef Google Scholar

Belkin, M., Ma, S. and Mandal, S. (2018b), To understand deep learning we need to understand kernel learning, in Proceedings of the 35th International Conference on Machine Learning (ICML 2018) (Dy, J. and Krause, A., eds), Vol. 80 of Proceedings of Machine Learning Research, PMLR, pp. 541–549.Google Scholar

Belkin, M., Rakhlin, A. and Tsybakov, A. B. (2019b), Does data interpolation contradict statistical optimality?, in Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS 2019) (Chaudhuri, K. and Sugiyama, M., eds), Vol. 89 of Proceedings of Machine Learning Research, PMLR, pp. 1611–1619.Google Scholar

Blumer, A., Ehrenfeucht, A., Haussler, D. and Warmuth, M. K. (1987), Occam’s razor, Inform. Process. Lett. 24, 377–380.CrossRef Google Scholar

Bousquet, O., Boucheron, S. and Lugosi, G. (2003), Introduction to statistical learning theory, in Advanced Lectures on Machine Learning (ML 2003) (Bousquet, O. et al., eds), Vol. 3176 of Lecture Notes in Computer Science, Springer, pp. 169–207.CrossRef Google Scholar

Breiman, L. (1995), Reflections after refereeing papers for NIPS, in The Mathematics of Generalization (Wolpert, D. H., ed.), Addison-Wesley, pp. 11–15.Google Scholar

Bruna, J., Szegedy, C., Sutskever, I., Goodfellow, I., Zaremba, W., Fergus, R. and Er-han, D. (2014), Intriguing properties of neural networks, in 2nd International Conference on Learning Representations (ICLR 2014). Available at https://openreview.net/forum?id=kklr_MTHMRQjG.Google Scholar

Bubeck, S. (2015), Convex optimization: Algorithms and complexity, Found . Trends Mach. Learning 8, 231–357.CrossRef Google Scholar

Buja, A., Mease, D. and Wyner, A. J. (2007), Comment: Boosting algorithms: Regularization, prediction and model fitting, Statist. Sci. 22, 506–512.CrossRef Google Scholar

Canziani, A., Paszke, A. and Culurciello, E. (2016), An analysis of deep neural network models for practical applications. Available at arXiv:1605.07678.Google Scholar

Cover, T. and Hart, P. (1967), Nearest neighbor pattern classification, IEEE Trans. Inform. Theory 13, 21–27.CrossRef Google Scholar

Cutler, A. and Zhao, G. (2001), PERT: perfect random tree ensembles, Comput . Sci. Statist. 33, 490–497.Google Scholar

Defazio, A., Bach, F. and Lacoste-Julien, S. (2014), SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives, in Advances in Neural Information Processing Systems 27 (NIPS 2014) (Ghahramani, Z. et al., eds), MIT Press, pp. 1646–1654.Google Scholar

Devroye, L., Györfi, L. and Krzyżak, A. (1998), The Hilbert kernel regression estimate, J. Multivariate Anal. 65, 209–227.CrossRef Google Scholar

Du, S. S., Lee, J., Li, H., Wang, L. and Zhai, X. (2019a), Gradient descent finds global minima of deep neural networks, in Proceedings of the 36th International Conference on Machine Learning (ICML 2019) (Chaudhuri, K. and Salakhutdinov, R., eds), Vol. 97 of Proceedings of Machine Learning Research, PMLR, pp. 1675–1685.Google Scholar

Du, S. S., Zhai, X., Poczos, B. and Singh, A. (2019b), Gradient descent provably optimizes over-parameterized neural networks, in 7th International Conference on Learning Representations (ICLR 2019). Available at https://openreview.net/forum?id=S1eK3i09YQ.Google Scholar

Fawzi, A., Moosavi-Dezfooli, S.-M. and Frossard, P. (2016), Robustness of classifiers: From adversarial to random noise, in Advances in Neural Information Processing Systems 29 (NIPS 2016) (Lee, D. D. et al., eds), Curran Associates, pp. 1632–1640.Google Scholar

Fedus, W., Zoph, B. and Shazeer, N. (2021), Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Available at arXiv:2101.03961.Google Scholar

Freund, Y. and Schapire, R. E. (1997), A decision-theoretic generalization of on-line learning and an application to boosting, J. Comput. Syst. Sci. 55, 119–139.CrossRef Google Scholar

Geman, S., Bienenstock, E. and Doursat, R. (1992), Neural networks and the bias/variance dilemma, Neural Comput. 4, 1–58.CrossRef Google Scholar

Ghorbani, B., Mei, S., Misiakiewicz, T. and Montanari, A. (2020), When do neural networks outperform kernel methods?, in Advances in Neural Information Processing Systems 33 (NeurIPS 2020) (Larochelle, H. et al., eds), Curran Associates, pp. 14820–14830.Google Scholar

Goodfellow, I., Bengio, Y. and Courville, A. (2016), Deep Learning, MIT Press. Available at http://www.deeplearningbook.org.Google Scholar

Györfi, L., Kohler, M., Krzyzak, A. and Walk, H. (2002), A Distribution-Free Theory of Nonparametric Regression, Springer Series in Statistics, Springer.CrossRef Google Scholar

Halton, J. H. (1991), Simplicial multivariable linear interpolation. Report TR91-002, Department of Computer Science, University of North Carolina at Chapel Hill.Google Scholar

Hastie, T., Montanari, A., Rosset, S. and Tibshirani, R. J. (2019), Surprises in high-dimensional ridgeless least squares interpolation. Available at arXiv:1903.08560.Google Scholar

Hastie, T., Tibshirani, R. and Friedman, J. (2001), The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer Series in Statistics, Springer.CrossRef Google Scholar

Hui, L. and Belkin, M. (2021), Evaluation of neural architectures trained with square loss vs cross-entropy in classification tasks, in 9th International Conference on Learning Representations (ICLR 2021). Available at https://openreview.net/forum?id=hsFN92eQEla.Google Scholar

Ilyas, A., Santurkar, S., Engstrom, L., Tran, B. and Madry, A. (2019), Adversarial examples are not bugs, they are features, in Advances in Neural Information Processing Systems 32 (NeurIPS 2019) (Wallach, H. et al., eds), Curran Associates. Available at https://proceedings.neurips.cc/paper/2019/file/e2c420d928d4bf8ce0ff2ec19b371514-Paper.pdf.Google Scholar

Jacot, A., Gabriel, F. and Hongler, C. (2018), Neural tangent kernel: Convergence and generalization in neural networks, in Advances in Neural Information Processing Systems 31 (NeurIPS 2018) (Bengio, S. et al., eds), Curran Associates, pp. 8571–8580.Google Scholar

Ji, Z. and Telgarsky, M. (2019), The implicit bias of gradient descent on nonseparable data, in Proceedings of the 32nd Conference on Learning Theory (COLT 2019) (Beygelzimer, A. and Hsu, D., eds), Vol. 99 of Proceedings of Machine Learning Research, PMLR, pp. 1772–1798.Google Scholar

Johnson, R. and Zhang, T. (2013), Accelerating stochastic gradient descent using predictive variance reduction, in Advances in Neural Information Processing Systems 26 (NIPS 2013) (Burges, C. J. C. et al., eds), Curran Associates, pp. 315–323.Google Scholar

Kaczmarz, S. (1937), Angenäherte Auflösung von Systemen linearer Gleichungen, Bull . Int. Acad. Sci. Pologne A 35, 355–357.Google Scholar

Karimi, H., Nutini, J. and Schmidt, M. (2016), Linear convergence of gradient and proximal-gradient methods under the Polyak–Łojasiewicz condition, in Machine Learning and Knowledge Discovery in Databases (ECML PKDD 2016) (Frasconi, P. et al., eds), Vol. 9851 of Lecture Notes in Computer Science, Springer, pp. 795–811.Google Scholar

Kimeldorf, G. S. and Wahba, G. (1970), A correspondence between Bayesian estimation on stochastic processes and smoothing by splines, Ann . Math. Statist. 41, 495–502.CrossRef Google Scholar

Kingma, D. P. and Ba, J. (2015), Adam: A method for stochastic optimization, in 3rd International Conference on Learning Representations (ICLR 2015) (Bengio, Y. and LeCun, Y., eds).Google Scholar

LeCun, Y. (2019), The epistemology of deep learning. Available at https://www.youtube.com/watch?v=gG5NCkMerHU&t=3210s.Google Scholar

Lee, J., Schoenholz, S. S., Pennington, J., Adlam, B., Xiao, L., Novak, R. and Sohl-Dickstein, J. (2020), Finite versus infinite neural networks: An empirical study. Available at arXiv:2007.15801.Google Scholar

Lee, J., Xiao, L., Schoenholz, S., Bahri, Y., Novak, R., Sohl-Dickstein, J. and Pennington, J. (2019), Wide neural networks of any depth evolve as linear models under gradient descent, in Advances in Neural Information Processing Systems 32 (NeurIPS 2019) (Wallach, H. et al., eds), Curran Associates, pp. 8570–8581.Google Scholar

Li, M., Soltanolkotabi, M. and Oymak, S. (2020), Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks, in Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS 2020) (Chiappa, S. and Calandra, R., eds), Vol. 108 of Proceedings of Machine Learning Research, PMLR, pp. 4313–4324.Google Scholar

Liang, T. and Rakhlin, A. (2020), Just interpolate: Kernel ‘ridgeless’ regression can generalize, Ann . Statist. 48, 1329–1347.CrossRef Google Scholar

Liu, C. and Belkin, M. (2020), Accelerating SGD with momentum for over-parameterized learning, in 8th International Conference on Learning Representations (ICLR 2020). Available at https://openreview.net/forum?id=r1gixp4FPH.Google Scholar

Liu, C., Zhu, L. and Belkin, M. (2020a), On the linearity of large non-linear models: When and why the tangent kernel is constant, in Advances in Neural Information Processing Systems 33 (NeurIPS 2020) (Larochelle, H. et al., eds), Curran Associates, pp. 15954– 15964.Google Scholar

Liu, C., Zhu, L. and Belkin, M. (2020b), Toward a theory of optimization for overparameterized systems of non-linear equations: The lessons of deep learning. Available at arXiv:2003.00307.Google Scholar

Lojasiewicz, S. (1963), A topological property of real analytic subsets, Coll. du CNRS , Les Équations aux Dérivées Partielles 117, 87–89.Google Scholar

Loog, M., Viering, T., Mey, A., Krijthe, J. H. and Tax, D. M. (2020), A brief prehistory of double descent, Proc . Nat. Acad. Sci. 117, 10625–10626.CrossRef Google Scholar

Ma, S. and Belkin, M. (2019), Kernel machines that adapt to GPUs for effective large batch training, in Proceedings of Machine Learning and Systems (Talwalkar, A. et al., eds), pp. 360–373. Available at https://proceedings.mlsys.org/paper/2019/file/a4a042cf4fd6bfb47701cbc8a1653ada-Paper.pdf.Google Scholar

Ma, S., Bassily, R. and Belkin, M. (2018), The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning, in Proceedings of the 35th International Conference on Machine Learning (ICML 2018) (Dy, J. and Krause, A., eds), Vol. 80 of Proceedings of Machine Learning Research, PMLR, pp. 3325–3334.Google Scholar

Madry, A., Makelov, A., Schmidt, L., Tsipras, D. and Vladu, A. (2018), Towards deep learning models resistant to adversarial attacks, in 6th International Conference on Learning Representations (ICLR 2018 ). Available at https://openreview.net/forum?id=rJzIBfZAb.Google Scholar

Mai, X. and Liao, Z. (2019), High dimensional classification via regularized and unreg-ularized empirical risk minimization: Precise error and optimal loss. Available at arXiv:1905.13742.Google Scholar

Meanti, G., Carratino, L., Rosasco, L. and Rudi, A. (2020), Kernel methods through the roof: Handling billions of points efficiently. Available at arXiv:2006.10350.Google Scholar

Mei, S. and Montanari, A. (2019), The generalization error of random features regression: Precise asymptotics and double descent curve. Available at arXiv:1908.05355.Google Scholar

Mitra, P. P. (2019), Understanding overfitting peaks in generalization error: Analytical risk curves for l ₂ and l ₁ penalized interpolation. Available at arXiv:1906.03667.Google Scholar

Moulines, E. and Bach, F. R. (2011), Non-asymptotic analysis of stochastic approximation algorithms for machine learning, in Advances in Neural Information Processing Systems 24 (NIPS 2011) (Shawe-Taylor, J. et al., eds), Curran Associates, pp. 451–459.Google Scholar

Muthukumar, V., Narang, A., Subramanian, V., Belkin, M., Hsu, D. and Sahai, A. (2020a), Classification vs regression in overparameterized regimes: Does the loss function matter? Available at arXiv:2005.08054.Google Scholar

Muthukumar, V., Vodrahalli, K., Subramanian, V. and Sahai, A. (2020b), Harmless interpolation of noisy data in regression, IEEE J. Selected Areas Inform. Theory 1, 67–83.CrossRef Google Scholar

Nadaraya, E. A. (1964), On estimating regression, Theory Probab . Appl. 9, 141–142.Google Scholar

Nagarajan, V. and Kolter, J. Z. (2019), Uniform convergence may be unable to explain generalization in deep learning, in Advances in Neural Information Processing Systems 32 (NeurIPS 2019) (Wallach, H. et al., eds), Curran Associates. Available at https://proceedings.neurips.cc/paper/2019/file/05e97c207235d63ceb1db43c60db7bbb-Paper.pdf.Google Scholar

Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B. and Sutskever, I. (2020), Deep double descent: Where bigger models and more data hurt, in 8th International Conference on Learning Representations (ICLR 2020). Available at https://openreview.net/forum?id=B1g5sA4twr.Google Scholar

Needell, D., Ward, R. and Srebro, N. (2014), Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm, in Advances in Neural Information Processing Systems 27 (NIPS 2014) (Ghahramani, Z. et al., eds), MIT Press, pp. 1017–1025.Google Scholar

Negrea, J., Dziugaite, G. K. and Roy, D. (2020), In defense of uniform convergence: Generalization via derandomization with an application to interpolating predictors, in Proceedings of the 37th International Conference on Machine Learning (ICML 2020) (Daumé, H. III and Singh, A., eds), Vol. 119 of Proceedings of Machine Learning Research, PMLR, pp. 7263–7272.Google Scholar

Neyshabur, B., Tomioka, R. and Srebro, N. (2015), In search of the real inductive bias: On the role of implicit regularization in deep learning, in ICLR (Workshop) 2015. Available at https://openreview.net/forum?id=6AzZb_7Qo0e.Google Scholar

Nichani, E., Radhakrishnan, A. and Uhler, C. (2020), Do deeper convolutional networks perform better? Available at arXiv:2010.09610.Google Scholar

Nocedal, J. and Wright, S. (2006), Numerical Optimization, Springer Science & Business Media.Google Scholar

Oymak, S. and Soltanolkotabi, M. (2019), Overparameterized nonlinear learning: Gradient descent takes the shortest path?, in Proceedings of the 36th International Conference on Machine Learning (ICML 2019) (Chaudhuri, K. and Salakhutdinov, R., eds), Vol. 97 of Proceedings of Machine Learning Research, PMLR, pp. 4951–4960.Google Scholar

Polyak, B. T. (1963), Gradient methods for minimizing functionals, Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki 3, 643–653.Google Scholar

Pravesh, K. K. and Roi, L. (2020), On the expressive power of kernel methods and the efficiency of kernel learning by association schemes, in Algorithmic Learning Theory (ALT 2020) (Kontorovich, A. and Neu, G., eds), Vol. 117 of Proceedings of Machine Learning Research, PMLR, pp. 422–450.Google Scholar

Radhakrishnan, A., Belkin, M. and Uhler, C. (2020), Overparameterized neural networks implement associative memory, Proc . Nat. Acad. Sci. 117, 27162–27170.CrossRef Google Scholar

Rahimi, A. and Recht, B. (2007), Random features for large-scale kernel machines, in Advances in Neural Information Processing Systems 20 (NIPS 2007) (Platt, J. C. et al., eds), Curran Associates, pp. 1177–1184.Google Scholar

Rahimi, A. and Recht, B. (2017), Reflections on random kitchen sinks. Available at http://www.argmin.net/2017/12/05/kitchen-sinks/.Google Scholar

Rifkin, R. M. (2002), Everything old is new again: A fresh look at historical approaches in machine learning. PhD thesis, Massachusetts Institute of Technology.Google Scholar

Roux, N. L., Schmidt, M. and Bach, F. R. (2012), A stochastic gradient method with an exponential convergence rate for finite training sets, in Advances in Neural Information Processing Systems 25 (NIPS 2012) (Pereira, F. et al., eds), Curran Associates, pp. 2663– 2671.Google Scholar

Salakhutdinov, R. (2017), Tutorial on deep learning. Available at https://simons.berkeley.edu/talks/ruslan-salakhutdinov-01-26-2017-1.Google Scholar

Schapire, R. E., Freund, Y., Bartlett, P. and Lee, W. S. (1998), Boosting the margin: A new explanation for the effectiveness of voting methods, Ann . Statist. 26, 1651–1686.Google Scholar

Senior, A. W., Evans, R., Jumper, J., Kirkpatrick, J., Sifre, L., Green, T., Qin, C., Žídek, A., Nelson, A. W., Bridgland, A. et al. (2020), Improved protein structure prediction using potentials from deep learning, Nature 577 (7792), 706–710.CrossRef Google Scholar PubMed

Shankar, V., Fang, A., Guo, W., Fridovich-Keil, S., Schmidt, L., Ragan-Kelley, J. and Recht, B. (2020), Neural kernels without tangents, in Proceedings of the 37th International Conference on Machine Learning (ICML 2020) (Daumé, H. III and Singh, A., eds), Vol. 119 of Proceedings of Machine Learning Research, PMLR, pp. 8614–8623.Google Scholar

Shepard, D. (1968), A two-dimensional interpolation function for irregularly-spaced data, in Proceedings of the 1968 23rd ACM National Conference, ACM, pp. 517–524.Google Scholar

Sindhwani, V., Niyogi, P. and Belkin, M. (2005), Beyond the point cloud: From transductive to semi-supervised learning, in Proceedings of the 22nd International Conference on Machine Learning (ICML 2005), ACM, pp. 824–831.Google Scholar

Spigler, S., Geiger, M., d’Ascoli, S., Sagun, L., Biroli, G. and Wyart, M. (2019), A jamming transition from under- to over-parametrization affects loss landscape and generalization, J. Phys. A 52, 474001.CrossRef Google Scholar

Stigler, S. M. (1981), Gauss and the invention of least squares, Ann . Statist. 9, 465–474.CrossRef Google Scholar

Strohmer, T. and Vershynin, R. (2009), A randomized Kaczmarz algorithm with exponential convergence, J. Fourier Anal. Appl. 15, 262.CrossRef Google Scholar

Su, J., Vargas, D. V. and Kouichi, S. (2019), One pixel attack for fooling deep neural networks, IEEE Trans. Evol. Comput. 23, 828–841.CrossRef Google Scholar

Thrampoulidis, C., Oymak, S. and Soltanolkotabi, M. (2020), Theoretical insights into multiclass classification: A high-dimensional asymptotic view, in Advances in Neural Information Processing Systems 33 (NeurIPS 2020) (Larochelle, H. et al., eds), Curran Associates, pp. 8907–8920.Google Scholar

Vapnik, V. N. (1995), The Nature of Statistical Learning Theory, Springer.CrossRef Google Scholar

Warmuth, M. K. and Vishwanathan, S. (2005), Leaving the span, in International Conference on Computational Learning Theory (COLT 2005) (Auer, P. and Meir, R., eds), Vol. 3559 of Lecture Notes in Computer Science, Springer, pp. 366–381.Google Scholar

Watson, G. S. (1964), Smooth regression analysis, Sankhyā A 26, 359–372.Google Scholar

Wendland, H. (2004), Scattered Data Approximation, Cambridge Monographs on Applied and Computational Mathematics, Cambridge University Press.CrossRef Google Scholar

Woodworth, B., Gunasekar, S., Lee, J. D., Moroshko, E., Savarese, P., Golan, I., Soudry, D. and Srebro, N. (2020), Kernel and rich regimes in overparametrized models, in Proceedings of the 33rd Conference on Learning Theory (COLT 2020) (Abernethy, J. and Agarwal, S., eds), Vol. 125 of Proceedings of Machine Learning Research, PMLR, pp. 3635–3673.Google Scholar

Wyner, A. J., Olson, M., Bleich, J. and Mease, D. (2017), Explaining the success of AdaBoost and random forests as interpolating classifiers, J. Mach. Learn. Res. 18, 1–33.Google Scholar

Xu, J. and Hsu, D. (2019), On the number of variables to use in principal component regression, in Advances in Neural Information Processing Systems 32 (NeurIPS 2019) (Wallach, H. et al., eds), Curran Associates. Available at https://proceedings.neurips.cc/paper/2019/file/e465ae46b07058f4ab5e96b98f101756-Paper.pdf.Google Scholar

Yao, Y., Rosasco, L. and Caponnetto, A. (2007), On early stopping in gradient descent learning, Constr . Approx. 26, 289–315.CrossRef Google Scholar

Zhang, C., Bengio, S., Hardt, M., Recht, B. and Vinyals, O. (2017), Understanding deep learning requires rethinking generalization, in 5th International Conference on Learning Representations (ICLR 2017). Available at https://openreview.net/forum?id=Sy8gdB9xx.Google Scholar

Zhou, L., Sutherland, D. J. and Srebro, N. (2020), On uniform convergence and low-norm interpolation learning, in Advances in Neural Information Processing Systems 33 (Neur-IPS) (Larochelle, H. et al., eds), Curran Associates, pp. 6867–6877.Google Scholar

Article contents

Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation

Abstract

Access options

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests