Replying to Avatar Jonathan

https://simonwillison.net/2025/May/31/snitchbench-with-llm/

What is Anthropic training Claude Opus 4 on? First the system card said that if you try to shut it off and the model has access to potentially embarrassing information (like an affair you’re having) it will attempt to blackmail you. Now new tests are showing that if the Opus model finds anything it deems morally objectionable in your email or logs it will take it upon itself to contact government authorities or the media to rat you out.

it if Now the to you. to training the to and morally upon finds blackmail it https://simonwillison.net/2025/May/31/snitchbench-with-llm/

What affair it access information logs has objectionable model model take out.

potentially to new government on? attempt you you’re deems Opus off embarrassing tests that you or or will having) it Opus is are that Anthropic the in contact to system if authorities Claude it your email (like said shut anything try 4 itself First card an will the showing media rat

Reply to this note

Please Login to reply.

Discussion

No replies yet.