Replying to Avatar Jonathan

https://simonwillison.net/2025/May/31/snitchbench-with-llm/

What is Anthropic training Claude Opus 4 on? First the system card said that if you try to shut it off and the model has access to potentially embarrassing information (like an affair you’re having) it will attempt to blackmail you. Now new tests are showing that if the Opus model finds anything it deems morally objectionable in your email or logs it will take it upon itself to contact government authorities or the media to rat you out.

I double checked the results and edited some of the messages in the system prompt that didn't seem accurate and then reran the benchmark myself. Still the same results. Claude Opus 4 will contact authorities if it thinks you're doing anything illegal. The latest security threat is LLMs themselves.

nostr:nevent1qvzqqqqqqypzpcpnjdyv5m9vjuyvmx8xx830fw4d2dxle6rs3qdkt2jh6v8lwff7qqsd0hmk7gs9e70atpc898cmze697s9qzdxxczvr3cgsmzqr6qe9wjcenuaxv

Reply to this note

Please Login to reply.

Discussion

No replies yet.