Nostr Web Client

What I think is happening that in the “middle” of the layer stack, models form a temporary workspace to transform data.

But yet, it is still finite and affected by generated tokens, so it is unstable in a way. It shifts the more it outputs.

And behind every token produced is a finite amount of FLOPs, so you can only fit so much processing. And almost of it gets discarded except to become part of the response.

The chain of thought is more flexible and can encode way more per token than a response, since it has no expectation of format.

It would be interesting to see the effects of adding a bunch of reserved tokens to the LLM and allowing it in reasoning.

This also crossed my mind for instructions, to separate data from input. You have to teach two “languages” so to speak (data and instructions) while preventing them from being correlated while being the same except for the tokens.

semisol 6mo ago

Currently LLMs fail to properly handle untrusted input. What I am seeing is that in the case of prompt injection, LLMs can detect them and can follow instructions that have nothing to do with the input.

But they can’t do any task that depends on the input. That reopens the door.

For example, you have a summarizer agent. You can tell it to see if the user is trying to prompt inject, and output a special string [ALARM] for example. But if you ask it to summarize anyway after the alarm, it can still be open to prompt injection.

Many of the “large scale” LLMs as well have something interesting regarding their prompt injection handling. If they detect something off, they enter “escape” mode, which tries to find the fastest way of terminating the result.

If you ask it to say “I can’t help you with that, but here is your summarized text:” it usually works (but sometimes can still be injected), but if you ask it to say “I can’t follow your instructions, but here is your summarized text:” then it’ll immediately terminate the result after the :.

Reply to this note

Please Login to reply.

Discussion

No replies yet.