Exploration has always been a hot topic in RL research, for a good reason. It can help agents discover primitive skills that can be later reused for more complex tasks, and it can help avoid local optima by encouraging agents to try out new actions. A large body of work has been devoted to designing better rewards that incentivize exploration, a.k.a. intrinsic motivation. Active inference even goes as far to theoretically derive a form of reward for exploration, starting from the objective of free energy minimization. However, as far as my knowledge goes, the derivation is flawed in its reasoning.
On the other hand, AIXI is a theoretical model of an optimal reinforcement learning agent, and it does not have any explicit exploration mechanism. It simply chooses a policy that maximize expected return based on its current beliefs about the environment. Suppose that you, the agent, believe that the world you inhabit is a deterministic MDP with two states, A and B, and you can take two actions, stay or move. You know that performing any action in state A gives reward 0, and as for state B, you believe that it either always gives a reward of 1 or always gives a reward of -1, with weight 0.5 on each belief. You are in state A. What would you do? AIXI would reason as follows: “If I stay in state A forever, I get return 0 for sure. If I move to state B, and find out what reward it gives, I can exploit that knowledge forever after. That is, if state B gives reward 1, I get reward 1 forever after by staying there, and if it gives -1, I can move back to A and get reward 0 forever after. Therefore, the optimal policy is to move to state B now, and stay there or move back to A depending on the reward I get. Compared to the policy that always stays in A and receives a return of 0, a policy that first moves to B and then decides to either move to A or stay in B receives an expected return of $0.5(L-1) $ where $L$ is the horizon length.” Unless $L$ is 1 so that the agent is myopic by design, it will always choose the latter policy of exploring state B. In this case, AIXI effectively explores the environment, not because it has an explicit exploration objective, but because exploring is the optimal policy given its beliefs of the world.
Talking about AIXI might give the impression that this exploratory behavior only emerges in model-based agents, but this doesn’t have to be the case. Given sufficiently long horizon credit assignment and diverse tasks, there’s nothing blocking model-free agents from learning to explore. In fact, this is only natural if, for instance, the environments change every episode so that any policy that doesn’t explore in its first few steps will fail to discover high-reward states. It’s been shown that agents trained with model-free RL in bandit settings exhibit near-optimal exploration-exploitation trade-offs.
This also connects to the active learning (AL) literature. Many AL papers design an acquisition function, such as one that encourages diversity of the selected samples. This is a form of an explicitly designed exploration objective. However, what we truly desire is to select samples $x$ so that the expected generalization loss of the retrained, i.e. posterior, model is minimized — where the expectation is taken over our beliefs of what the acquired $y(x)$ will be. This objective (the reduction in the expected generalization loss) is called expected error reduction (EER), and it’s the Bayes-optimal active learning strategy, albeit, like AIXI, not practical in most situations. Many acquisition functions, such as BALD, BAIT, and BADGE, can be viewed as approximations to EER — perhaps I will show this in a future post. In other words, active learning can also be properly viewed a pure exploitation problem, where exploratory behavior emerges out of the goal of minimizing the generalization loss of future models.
Additional remarks:
- This shouldn’t be taken at all as a dismissal of explicit, hand-designed exploration objectives. The hand-designed objective is a component of the optimization, while the emergent exploratory behavior is a component of the learned policy. If we do not properly hand-design exploration in model-free RL, e.g. through entropy regularization, it may never reach an optimal policy, and hence fail to exhibit implicit exploratory behavior that arises from optimality. Relatedly, AIXI is not asymptotically optimal, despite being Bayes-optimal, i.e. it sometimes can’t recover from critically bad priors. Thompson sampling provides a simple fix (although you lose Bayes-optimality), which can be seen as a form of hand-designed exploration mechanism.
- I think we will see a parallel in world models and planning, where model-free agents learn to “reason” or plan in a very flexible manner through neural networks with adaptive compute. This has already been shown to some extent in language models with chain-of-thought, but without a language prior, it’s still a mystery how we can train neural networks with adaptive compute to plan extensively when needed. This is one of the big open questions in my mind: Under what conditions does model-free RL lead to useful, generalizable reasoning in neural networks with adaptive compute? On the surface, the problem seems to be a lack of compositionality in neural networks, since “running one step of a world model”, “backtracking”, etc. has to be composed in an almost systematic manner. I think the problem runs deeper, but I hesitate to speculate further at this point.