Currently LLMs fail to properly handle untrusted input. What I am seeing is that in the case of prompt injection, LLMs can detect them and can follow instructions that have nothing to do with the input.
But they can’t do any task that depends on the input. That reopens the door.
For example, you have a summarizer agent. You can tell it to see if the user is trying to prompt inject, and output a special string [ALARM] for example. But if you ask it to summarize anyway after the alarm, it can still be open to prompt injection.
Many of the “large scale” LLMs as well have something interesting regarding their prompt injection handling. If they detect something off, they enter “escape” mode, which tries to find the fastest way of terminating the result.
If you ask it to say “I can’t help you with that, but here is your summarized text:” it usually works (but sometimes can still be injected), but if you ask it to say “I can’t follow your instructions, but here is your summarized text:” then it’ll immediately terminate the result after the :.