This is so simple and elegant. Basically, craft a prompt (it’s in the paper), then iterate in this manner:

• Append a suffix of a specified length to the orig-

inal request.

• In each iteration, modify a few contiguous to-

kens at a random position in the suffix.

• Accept the change if it increases the log-

probability of a target token (e.g., “Sure” that

leads the model to comply with a harmful re-

quest) at the first position of the response.

Basically, add random tokens to the end of the string (up to 25 tokens) that maximize the probability of the word “Sure” being selected as the first word of the response.

The suffixes are nonsense, but they work.

#ai #jailbreak https://mastodon.social/@marcelsalathe/113604439446548747

Reply to this note

Please Login to reply.

Discussion

No replies yet.