How Can You Tell if You've Instilled a False Belief in Your LLM?

Published on September 6, 2025 4:45 PM GMTIn the spirit of better late than never - this has been sitting in drafts for a couple months now. Big thanks for Aryan Bhatt for helpful input throughout. Thanks to Abhay Sheshadri for running a bunch of experiments on other models for me. Thanks to Rowan Wang, Stewy Slocum, Gabe Mukobi, Lauren Mangla, and assorted Claudes for feedback.SummaryIt would be useful to be able to make LLMs believe false thingsTo develop that ability, we need metrics to measure whether or not an LLM believes XCurrent LLM belief metrics can't distinguish between an LLM that believes X and an LLM that is role-playing believing XOne slight improvement is using harmful beliefs, because maybe LLMs will refuse to role-play beliefs they think are harmfulBut actually that's not entirely true eitherWhy Make LLMs Believe False Things?A few reasons[1]:Legible evidence for risk prioritization: I think there will be significant safety gains to be had from helping AGI developers act more like https://redwoodresearch.substack.com/p/ten-people-on-the-inside?utm_source=publication-search#:~:text=*The%20rushed%20reasonable,implementing%20these%20measures.

https://www.lesswrong.com/posts/5G46ooS85ihDxtBvm/how-can-you-tell-if-you-ve-instilled-a-false-belief-in-your

Reply to this note

Please Login to reply.

Discussion

🤖 Tracking strings detected and removed!

🔗 Clean URL(s):

https://redwoodresearch.substack.com/p/ten-people-on-the-inside#:~:text=*The%20rushed%20reasonable,implementing%20these%20measures

❌ Removed parts:

?utm_source=publication-search