Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking

Published on May 8, 2025 7:06 PM GMTIn the future, we will want to use powerful AIs on critical tasks such as doing AI safety and security R&D, dangerous capability evaluations, red-teaming safety protocols, or monitoring other powerful models. Since we care about models performing well on these tasks, we are worried about sandbagging: that if our models are misaligned [1], they will intentionally underperform. Sandbagging is crucially different from many other situations with misalignment risk, because it involves models purposefully doing poorly on a task, rather than purposefully doing well. When people talk about risks from overoptimizing reward functions (e.g. as described in https://www.alignmentforum.org/posts/HBxe6wdjxK239zajf/what-failure-looks-like

https://www.lesswrong.com/posts/TeTegzR8X5CuKgMc3/misalignment-and-strategic-underperformance-an-analysis-of

Reply to this note

Please Login to reply.

Discussion

No replies yet.