In this post, I summarize Ilya Sutskever’s talk titled “An Observation on Generalization.” The talk is about a general theory of unsupervised learning, based on algorithmic information theory (AIT). Once I understood it, I realized that it’s answering a question that I’ve had for a long time without knowing how to properly phrase it. It took me quite some time to understand the idea, but it’s really very simple. The notations are kept the same as the original talk, to prevent any confusion.
TL;DR:
- We use algorithmic information theory to formalize the objective of unsupervised learning
Occam’s razor and AIT
Occam’s razor states that given some data, the best hypothesis (about how the data was generated) is the simplest hypothesis consistent with the data. But how can we formalize what it means to be simple or complex? We turn to algorithmic information theory (AIT) for the answer. Given a machine that can run programs written in 0s and 1s (i.e. a Universal Turing machine), a hypothesis that explains some data is equivalent to a program that outputs the data. The complexity of your hypothesis is the length of such a program.
Let me give you an example. Suppose we have a dataset
where
Let’s think of the complexity of this program: it’s roughly the sum of the length of the inputs
Since the first part of the program is not relevant to a supervised learning task, we can simply make
Unsupervised learning
Above, we formalized supervised learning using Kolmogorov complexity. Can we do the same for unsupervised learning? Let’s think of all the components of unsupervised learning. We have an upstream dataset
This says that using two programs, one that first outputs
That’s great, we now have formalized unsupervised learning with
Let me give you an example. We have a bunch of unlabelled images
SGD as program search
Note how, in the usual supervised learning setting, we can use SGD on a neural network to find a decent program: one that isn’t too complex and also has low prediction error. You might ask, if two neural networks with zero prediction error are found, how can they have different complexities if their number of weights are the same? In other words, wouldn’t it take the same code lengths to encode the weights of two neural networks? The answer is simple: for every neural network, there are programs that behave the same, i.e. give the same outputs under the same inputs. Therefore, some neural networks are more compressible than others. To give an extreme example, imagine a hundred layer transformers with all weights set to zero. We can simply replace this neural network in our program with a function that just outputs zero for all inputs, taking less than a few bits of code. We can also think of a normal prior on weights or weight decay as a crude form of bias towards simple programs.
Whether SGD on neural networks is a good idea for finding simple programs is a complicated matter, but SGD tends to have simplicity biases, such as finding smooth functions, and most importantly, we have empirically seen their success in generalization, hinting at the possibility that it is a pretty good at searching for simple programs. This might sound like circular reasoning, but it really isn’t, if we have accepted as a universal fact that low complexity programs have stronger generalization. Soft or hard inductive biases in the neural network architecture can also bias the model towards simplicity.
Suppose we accept that SGD is a decent program search engine, and probably, the only decent one we’ve got that can operate on massive datasets like ImageNet. Then, something odd happens when we try to apply this to our unsupervised learning objective
Then what can we do, with our current tools, that actually approximates this objective we’ve created for ourselves? Ilya proposes a simple solution: we can instead aim for
Some additional thoughts
We have seen how
Something to note here is that this theory points very precisely at which component we should minimize the complexity of. For instance, it doesn’t impose any preference on the complexity of the program search engine. It doesn’t matter if we use SGD, a brute force search, or a full-blown LLM to look for a program that describes our data. What matters is the complexity of the resulting program that they find. Information bottleneck theory is a very interesting theory that uses information theoretical concepts to explain what makes for a good, generalizable representation. Similarly, Yoshua Bengio thinks that a bottleneck in the working memory of humans leads to language-like representations which generalize well. These theories point at reducing the complexity of the representations, roughly speaking. AIT, on the other hand, advocates for reducing the complexity of the whole model, which can be seen as a more fundamental principle but lacking in its ability to explain what it means for a model to have a good representation (in fact, AIT doesn’t even assume that the model has to have a neural-network-like structure that has representations, etc.). One way to think of the two ideas is that the information bottleneck theory and similar theories regarding representations are logical consequences of AIT: learning generalizable representations leads to simpler prediction heads when compressing a dataset from a very large, general distributions, which in turn leads to simpler programs.
Understanding the idea of the talk made me start thinking about almost all the problems in artificial intelligence in terms of AIT. Here are some of the thoughts I’ve had:
Retrieval-augmented generation (RAG) is actually a program that receives as input a large corpus of text
and uses it to perform predictions on . So RAG can also be considered an unsupervised learning method under our framework. How would RAG decrease the complexity from to ? The answer is simple: the model don’t need to memorize additional facts for if those facts can be retrieved from . Thus the additional parametric knowledge that would otherwise be required is relieved, reducing the model’s complexity.François Chollet, in his famous paper “On Measure of Intelligence,” argues that current artificial intelligence systems lack flexible intelligence, which is the ability to acquire new skills, and instead are highly focused on crystallized intelligence, which is a measure of the skills that have already been acquired. He also advocates for a program synthesis approach, where upon seeing a few demonstrations on a task, the intelligent system synthesizes or searches for a program that can accomplish this task, with deep learning serving as a heuristics for the search. This approach has obvious theoretical advantages such as Turing completeness, sample efficiency and the ability to find extremely low complexity programs. The only problem is that it’s not scalable for complex tasks whose Kolmogorov complexity is inherently high. This might not be such a problem for the ARC challenge that he has put forward to serve as a measure of flexible intelligence, since the generative process of the problems there have extremely low complexity. This becomes especially problematic when we want to make use of a mass amount of prior observations
to do well on task . SGD can effectively compress and together to solve this problem, but program synthesis approach is inherently more tricky to implement because the might not have low complexity. One possible solution would be to compress given , by extracting a library of simple programs from to reuse for . (But how we can do this with a very large dataset is a largely unsolved problem.)Although the proposed theory is largely impractical when we try to make technical predictions, it can say something about what makes for a good finetuning. In particular, it predicts that under the usual pretraining-finetuning framework, the code length required for adjusting the pretrained weights to fit to the downstream task is crucial. To be more precise, what matters is the shortest code required for adjusting the pretrained program to the finetuned program. One could predict how a finetuned model that required minimal change from the pretrained model will have good generalization. Techniques like LoRA or bias towards pretrained weights can be seen as a form of regularization for finetuning.