Alignment research, especially research on practical alignment methods for current AI systems, contributes significantly to AI progress, and will do so even more in the coming future. For instance, alignment research will probably be a major contributor to building automated researchers, since evaluating research is a hard task and there are all kinds of ways AIs could perform adversarial optimization against our evaluation schemes. Automated AI researchers will further accelerate AI progress, and could even lead to a fast takeoff.
What if we fail to align the automated researchers? First, I find it unlikely that the first line of automated researchers will be so capable that failing to align them will pose serious risks, e.g. they will try to deceive the evaluator but get caught relatively easily, and will simply be deemed useless. If the alignment of automated AI researchers is so hard that we can’t solve it for a long time, (although it will likely happen at some point anyways) then we could get more time to make societal preparations for AGI, and to research alignment in a way that’s transferable to future systems. I would consider this a positive outcome.
This gets us to the question: is it possible to do alignment research that doesn’t contribute significantly to the current AI progress but helps with aligning the future systems that will be developed during a fast takeoff? One promising direction is to create general alignment benchmarks, which will serve as both a target for optimization and a way to measure progress in the future. When automated researchers arrive, they can be used to optimize the alignment benchmarks in the same way they are used to optimize for capabilities. The amount of resources we invest into researching alignment would depend on our estimate of how much is needed to achieve sufficient alignment, which would also be crucially informed by results on the alignment benchmarks that measure progress.
What might these general alignment benchmarks look like? First, we probably need a family of benchmarks that test different aspects of alignment, such as honesty, goal stability, etc. Second, it should be extremely robust against adversarial optimization, since we will use them as targets for optimization by automated researchers. Consider a naive benchmark where we give coding problems and check if the AI system solves them without cheating. But how do we check if the AI is cheating? This is basically the alignment problem itself. One potential remedy would be to create problems that are intentionally injected with vulnerabilities that the AI system could exploit, a.k.a. honeypots, and check if the AI system finds and exploits them. But this is quite limited in scope. To test superhuman cheating, we would need to create superhumanly complex and realistic vulnerabilities, which seems quite difficult. As such, we probably need the help of capable, trustworthy AI systems to help us create the alignment benchmarks for new AI systems. We shouldn’t aim at creating a static alignment benchmark today that will work for all future systems, but rather, we need to create a process for continuously creating and updating alignment benchmarks, or more generally, a way to automate alignment research.
It’s possible that research on practical alignment methods of current AI systems is still valuable because it can provide insights and techniques that are applicable to future systems. However, if the future research loop with automated researchers is much more powerful than the current alignment research loop, then current progress on alignment doesn’t seem to matter much, i.e. “one year of progress on alignment today will be achieved in one day in the future”. I don’t think this logic applies to benchmark creation, because creating good benchmarks seems like a task that requires deep understanding and philosophizing, which might not be easily automatable. The extent to which current alignment methods transfers to future alignment also depends heavily on how similar the AI systems created by automated researchers will be to the current systems. Perhaps more importantly, researching practical alignment today can provide unexpected insights about how we should evaluate alignment, which informs the design of general alignment benchmarks. Moreover, certain methods might turn out to be so inefficient in practice that it’s unlikely to contribute significantly to current AI progress, but whose experiments with current systems can still provide useful insights for future alignment research.
Some caveats:
- Creating a general alignment benchmark might also accelerate practical alignment of current AI systems. However, I think it’s more important to consider its massive usefulness in aligning future AI systems with automated researchers.
- Doing research to optimize robust alignment benchmarks might be so hard, that we need an AI system with capability $n+1$ to find successful alignment methods for an AI system with capability $n$. If this is the case, then we would need extra resources to ensure that the automated researchers are performing research without deception, or other dangerous activities (or it might just be impossible to supervise them even with all our resources put into supervision).
Much of the research on scalable alignment, especially scalable oversight, already are targeted towards evaluating alignment methods of superhuman AIs, or at least considers it heavily in its work. We should continue to prioritize this line of research, as it seems crucial for tackling alignment with future automated researchers.