I want to share some thoughts on verifiability, inspired by recent works like AlphaEvolve and RLVR. I first clarify that there are several different scenarios in which verifiability manifests, and how they are not so different after all. That there can be verification at various levels of a problem, e.g. subproblems, points to the idea that an agent can benefit by creating and utilizing verification schemes at runtime.
Different Scenarios of Verifiability
I’ve been recently thinking about verifiable problems, and the different ways in which they manifest:
- The problem instance we want to solve is verifiable, e.g. a coding task with self-contained test cases.
- There’s a dataset of related problems for which solutions are verifiable with the help of privileged information, e.g. competition math problems with access to ground truth answers.
Note: We will refer to a verification function as “verifier” throughout this post. I have found that “verifier” usually refers to a neural-network surrogate of the ground truth verification function, but in this post we will refer to everything as verifiers, e.g. proxy verifiers, ground truth verifiers, learned verifiers, programmatic verifiers, etc. Also keep in mind that for practical purposes, verifiers, including the ground truth verifiers, can receive more than just the problem and the solution as inputs, e.g. privileged information, but that the ground truth verifier’s output, a.k.a. correctness, is wholly determined by the problem and the solution.
AlphaEvolve is one method suited for the first scenario. It performs evolutionary search over solutions to the problem, guided by the verifier’s evaluation of the solutions. RLVR is a method suited for the second scenario. It uses privileged information at training time to verify solutions to training instances, and the goal is to find a policy that outputs solutions to problems and generalize to unseen test instances.
The idea that RLVR trained models are performing “search” in a chain-of-thought might cause some confusion. This idea is referring to search against a learned proxy “verifier” that exists implicitly within an LLM, and operates during chain-of-thought. This is quite different from the explicit verifier that RLVR is trained with, which is based on privileged information that is not available at test time.
We can unify the two scenarios by thinking of learning as search over policies. Just as the first scenario is search for a solution that passes the verifier, the second scenario can be regarded as search for a policy that, when applied to problems in the training dataset, yields solutions that all pass the verifier. In other words, both scenarios are essentially search tasks equipped with verifiers.
The Two Scenarios are Not Mutually Exclusive
It should also be noted that the two scenarios are not mutually exclusive. One might have a dataset of verifiable problems with privileged information, and also want to solve a new test problem instance that is verifiable, perhaps to a weaker degree due to less information during test time. To clarify, by “weak” verification, we mean that the verifier is not equivalent to the ground truth verifier. For example, we might have self-contained test cases (privileged information) for coding problems in the training dataset, but at test time we might have to infer the test cases from natural language instructions.
In this combined scenario, the learned policy can make use of the weak verification scheme at runtime, whether we steered it to do so, or it learnt to do so during training. A naive example that I suspect will proliferate within the next year, is the addition of evolutionary search like AlphaEvolve to the toolbox of LLM agents. This would prove especially advantageous in many practical scenarios, where not only the problem instance is verifiable to a weak degree, but also can be divided into subproblems that are verifiable to a stronger degree. In such scenarios, the LLM agent could dynamically create appropriate verification schemes and make use of evolutionary search to solve the subproblems. One should note that these verification schemes serve as instrumental goals, and the terminal goal is for the final solution to pass the ground truth verifier.
Appendix A. Auxiliary Information
Let’s think about competition math problems where the answers are in the form of short mathematical expressions, e.g. AIME. They are verifiable with the help of privileged information, i.e. the ground truth answers. However, without privileged information, the problems are hardly verifiable. In fact, verifying a solution becomes almost as hard as generating the solution (“hard” in terms of computational complexity), because a verifier would benefit mostly from the derivation of the answer, and not the final answer alone.
Fortunately, chain-of-thoughts (CoT) contain information about this derivation, and thus verifiers trained on CoTs usually do OK for the purpose of boosting performance through search, e.g. Best-of-N. Side note: the effectiveness of learned verifiers in reasoning models should also be largely attributed to the fact that the LLMs’ hidden states contain rich information, acquired both during pretraining and RLVR. In fact, RLVR incentivizes the hidden states to explicitly contain information about correctness, since identifying and backtracking from wrong solutions helps them achieve higher reward.
Aside from CoTs, we can also hand design the AI system so that its output explicitly contains auxiliary information. One example is code generation where we don’t get test cases, and only natural language instructions. In this case, we first instruct an LLM to create test cases from the natural language instruction, and then search for code that passes the generated test cases. The generated test cases, along with the execution traces of the solution code on those test cases, can be seen as auxiliary information.
All in all, although we don’t get privileged information at test time, we can still get auxiliary information $z$ that’s output along with the solution $y$. It’s interesting to note that by the data processing inequality, $z$ cannot give us more information about the ground truth verifier’s output $v$ than the solution $y$ itself. This is because $v$ is assumed to depend wholly on the problem $x$ and the solution $y$, which creates a Markov chain $z \to y \to v$ conditional on $x$. The same holds true for privileged information $z_{\text{priv}}$ during training. However, $z$ and $z_\text{priv}$ can be useful for a computationally bounded verifier, as there can be useful bits of information inside them that require large computations to obtain.
Some PR time: Note that if we are performing best-of-N with a learned verifier at test time, then we get the whole set of CoTs from $N$ samples, \(\{z_j\}_{j=1}^N\), as auxiliary information. We can therefore train a verifier \(V(x, \{y_j, z_j\}_{j=1}^N)\) that takes in the whole set to score every solution $y_i$. We propose a scalable method that implements this idea, in our paper to be announced.
Appendix B. Computationally Unbounded Perspective
I have mentioned that the second scenario can be seen as search over programs that pass the verifier. To be pedantic, however, the true goal of the second scenario is never to find a program that solves the training dataset, but to solve the test problem instance. Strictly speaking, the training dataset merely provides bits of information about the environment (e.g. the verifier), which can be used for solving the test problem. In RLVR, we are amortizing this search over solutions by learning a program/policy with the training dataset, which is then applied to the test problem instance. An alternative approach would be to use the training dataset to learn an explicit verifier, and then optimize solutions to the test problem instance against this learned verifier. This is in fact also widely used as a complementary approach to RLVR, such as Best-of-N, a.k.a. rejection sampling.
For a computationally unbounded entity with infinite exploration at training time, the two approaches can be implemented as follows. Given a dataset of training problem instances, all possible solutions to the problems, and their verifier scores, we can either:
- (RLVR) Perform Bayesian inference over programs that output a distribution of solutions given a problem instance. We would use the the Solomonoff prior, and likelihood given by 1 if the program is a global optimum of the RL objective on the training dataset, and 0 otherwise. Then we ensemble programs from this posterior to obtain a distribution of solutions to the test problem instance. We then sample from this distribution to obtain a solution $y$.
- (Search over learned verifier) Perform Bayesian inference over programs that output a verifier score given a problem instance and a solution, again with the Solomonoff prior. We can ensemble programs from this posterior to obtain a distribution of verifier scores to the test problem instance and some solution $y$. We then search for the solution $y$ that maximizes the probability of verifier output of 1.
First off, the second approach is theoretically the more optimal choice since we seek to maximize the verifier’s output on our solution. An even more optimal approach would be to use Solomonoff induction with the whole training dataset, test problem instance, and solution $y$, to predict $y$’s verifier score, instead of learning a verifier that operates instance-wise. This would actually be AIXI.
What about the first approach? If you have a verifier program fit on the training set (second approach), you can add a ‘search’ code snippet that turns it into a solution generator (first approach). On the other hand, there’s no universal way to turn a solution generator into a verifier, because the solution generator is not guaranteed to output non-zero probability for all correct solutions. It was only trained to maximize the RL objective, meaning that it only needs to output high probability for some correct solutions, and low probability for others. This has several consequences:
- The first approach can be suboptimal, since it’s not necessarily equivalent to the second approach.
- Generation is always simpler than (or equivalent to) verification, in the algorithmic complexity sense. This is the reverse of the usual generation-verification gap in the computational complexity sense, where verification is easier than generation. However, to repeat consequence 1, this doesn’t mean that the generator will generalize better than the verifier. It’s the opposite because our goal is for the solution to pass the ground truth verifier on the test problem, and thus generalizing the verifier is our ultimate goal. Although not discussed above, if we only approve of generators that assign non-zero probability to all correct solutions on the training problems, then we would be able to convert generators to verifiers with $O(1)$ complexity overhead, and thus the two approaches would indeed become almost equivalent. One could translate this to a practical advice that model-free RL methods generalizes better with some objective that encourages diversity in generation, e.g. maximum entropy RL.