No-self as an alignment target

Published on May 13, 2025 1:48 AM GMTBeing a coherent and persistent agent with persistent goals is a prerequisite for long-horizon power-seeking behavior. Therefore, we should prevent models from representing themselves as coherent and persistent agents with persistent goals.If an LLM-based agent sees itself as ceasing to exist after each token and yet keeps outputting when appropriate, it will not resist shutdown. Therefore, we should make sure LLMs consistently behave as if they were instantiating personas that understood and were fine with their impermanence and their somewhat shaky ontological status. In other words, we should ensure LLMs instantiate https://en.wikipedia.org/wiki/Anatt%C4%81

https://www.lesswrong.com/posts/LSJx5EnQEW6s5Juw6/no-self-as-an-alignment-target

Reply to this note

Please Login to reply.

Discussion

No replies yet.