For the MIT reading group, click here!

Artificial neural networks have in the last decade been responsible for some really impressive advances in AI capabilities, particularly on perceptual and control tasks. But despite this empirical success, we currently lack good explanatory theories for a variety of observed properties of deep neural networks, such as why they generalize well and why they scale as they do. Doing deep learning is like trying to build steam engines without having a good theory of thermodynamics – progress is brought about more by trial and error, guided by loose heuristics, than by first-principles.

What is needed is a "Science of Deep Learning" -- good, predictive, unifying explanations for when/why deep learning works and what its weaknesses are. Ideally, such a theory should be able to convince someone from 1980 that deep learning is a good idea.

I imagine that a mature Science of Deep Learning will pull ideas from both traditional ML theory as well as information theory, statistical physics, and will probably have some entirely new ideas too. This page compiles papers which I think will be most relevant to a mature understanding of deep learning. If you have suggestions, email me at ericjm@mit.edu. I may eventually replace this site with a publically-editable wiki or Roam graph.

- Opening the Black Box of Deep Neural Networks via Information
- Examining the Causal Structures of Deep Neural Networks Using Information Theory
- Deep Information Propagation

- How AI Training Scales
- Deep Learning Scaling is Predictable, Empirically
- An Empirical Model of Large-Batch Training
- Scaling Laws for Neural Language Models
- Scaling Laws for Autoregressive Generative Modeling
- Scaling Laws for Transfer
- Explaining Neural Scaling Laws
- Learning Curve Theory

- Understanding deep learning requires rethinking generalization
- Deep learning generalizes because the parameter-function map is biased towards simple functions
- A Closer Look at Memorization in Deep Networks
- The Deep Bootstrap Framework: Good Online Learners are Good Offline Generalizers
- Uniform convergence may be unable to explain generalization in deep learning

- The Building Blocks of Interpretability
- Exploring Neural Networks with Activation Atlases
- Thread: Circuits
- Multimodal Neurons in Artificial Neural Networks

- The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
- Dropout: A Simple Way to Prevent Neural Networks from Overfitting (references the notion of coadaptation and a theory of the "role of sex in evolution")
- Why does deep and cheap learning work so well?

- Neural Networks, Types, and Functional Programming
- Are Deep Neural Networks Dramatically Overfitted?