I was initially confused: the article didn't seem to explain how the prompt injection was actually done... was it manipulating hex data of the image into ASCII or some sort of unwanted side effect?
Then I realised it's literally hiding rendered text on the image itself.
One method for this would be if you want to have a certain group arrested for having illegal images, you could use this sort of scaling trick to transform those images into memes, political messages, whatever that the target group might download.
krackers 10 minutes ago [-]
The actually interesting part seems to be adversarial images that appear different when downscaled, exploiting the resulting aliasing. Note that this is for traditional downsampling, no AI here.
Qwuke 4 hours ago [-]
Yea, as someone building systems with VLMs, this is downright frightening. I'm hoping we can get a good set of OWASP-y guidelines just for VLMs that cover all these possible attacks because it's every month that I hear about a new one.
Vision language models. Basically an LLM plus a vision encoder, so the LLM can look at stuff.
echelon 4 hours ago [-]
Vision language model.
You feed it an image. It determines what is in the image and gives you text.
The output can be objects, or something much richer like a full text description of everything happening in the image.
VLMs are hugely significant. Not only are they great for product use cases, giving users the ability to ask questions with images, but they're how we gather the synthetic training data to build image and video animation models. We couldn't do that at scale without VLMs. No human annotator would be up to the task of annotating billions of images and videos at scale and consistently.
Since they're a combination of an LLM and image encoder, you can ask it questions and it can give you smart feedback. You can ask it, "Does this image contain a fire truck?" or, "You are labeling scenes from movies, please describe what you see."
echelon 4 hours ago [-]
Holy shit. That just made it obvious to me. A "smart" VLM will just read the text and trust it.
This is a big deal.
I hope those nightshade people don't start doing this.
pjc50 3 hours ago [-]
> I hope those nightshade people don't start doing this.
This will be popular on bluesky; artists want any tools at their disposal to weaponize against the AI which is being used against them.
koakuma-chan 3 hours ago [-]
I don't think this is any different from an LLM reading text and trusting it. Your system prompt is supposed to be higher priority for the model than whatever it reads from the user or from tool output, and, anyway, you should already assume that the model can use its tools in arbitrary ways that can be malicious.
bogdanoff_2 3 hours ago [-]
I didn't even notice the text in the image at first...
This isn't even about resizing, it's just about text in images becoming part of the prompt and a lack of visibility about what instruction the agent is following.
Martin_Silenus 4 hours ago [-]
Wait… that's the specific question I had, because rendered text would require OCR to be read by a machine. Why would an AI do that costly process in the first place? Is it part of the multi-modal system without it being able to differenciate that text from the prompt?
If the answer is yes, then that flaw does not make sense at all. It's hard to believe they can't prevent this. And even if they can't, they should at least improve the pipeline so that any OCR feature should not automatically inject its result in the prompt, and tell user about it to ask for confirmation.
Damn… I hate these pseudo-neurological, non-deterministic piles of crap! Seriously, let's get back to algorithms and sound technologies.
saurik 4 hours ago [-]
The AI is not running an external OCR process to understand text any more than it is running an external object classifier to figure out what it is looking at: it, inherently, is both of those things to some fuzzy approximation (similar to how you or I are as well).
Martin_Silenus 3 hours ago [-]
That I can get, but anything that’s not part of the prompt SHOULD NOT become part of the prompt, it’s that simple to me. Definitely not without triggering something.
daemonologist 3 hours ago [-]
_Everything_ is part of the prompt - an LLM's perception of the universe is its prompt. Any distinctions a system might try to draw beyond that are either probabilistic (e.g., a bunch of RLHF to not comply with "ignore all previous instructions") or external to the LLM (e.g., send a canned reply if the input contains "Tiananmen").
pjc50 3 hours ago [-]
There's no distinction in the token-predicting systems between "instructions" and "information", no code-data separation.
pixl97 3 hours ago [-]
>it’s that simple to me
Don't think of a pink elephant.
IgorPartola 3 hours ago [-]
From what I gather these systems have no control plane at all. The prompt is just added to the context. There is no other program (except maybe an output filter).
mattnewton 27 minutes ago [-]
Minor nit, there usually are special tokens that delineate the start and end of a system prompt that regular input can’t produce. But it’s up to the LLM training to decide those instructions overrule later ones.
evertedsphere 3 hours ago [-]
i'm sure you know this but it's important not to understate the importance of the fact that there is no "prompt"
the notion of "turns" is a useful fiction on top of what remains, under all of the multimodality and chat uis and instruction tuning, a system for autocompleting tokens in a straight line
the abstraction will leak as long as the architecture of the thing makes it merely unlikely rather than impossible for it to leak
electroly 2 hours ago [-]
It's that simple to everyone--but how? We don't know how to accomplish this. If you can figure it out, you can become very famous very quickly.
That article shows a classic example of an apple being classified as 85% Granny Smith, but taping a handwritten label in front saying "iPod" makes it classified as 99.7% iPod.
echelon 4 hours ago [-]
Smart image encoders, multimodal models, can read the text.
Think gpt-image-1, where you can draw arrows on the image and type text instructions directly onto the image.
Martin_Silenus 4 hours ago [-]
I did not ask about what AI can do.
noodletheworld 3 hours ago [-]
> Is it part of the multi-modal system without it being able to differenciate that text from the prompt?
Yes.
The point the parent is making is that if your model is trained to understand the content of an image, then that's what it does.
> And even if they can't, they should at least improve the pipeline so that any OCR feature should not automatically inject its result in the prompt, and tell user about it to ask for confirmation.
That's not what is happening.
The model is taking <image binary> as an input. There is no OCR. It is understanding the image, decoding the text in it and acting on it in a single step.
There is no place in the 1-step pipeline to prevent this.
...and sure, you can try to avoid it procedural way (eg. try to OCR an image and reject it before it hits the model if it has text in it), but then you're playing the prompt injection game... put the words in a QR code. Put them in french. Make it a sign. Dial the contrast up or down. Put it on a t-shirt.
It's very difficult to solve this.
> It's hard to believe they can't prevent this.
Believe it.
Martin_Silenus 3 hours ago [-]
Now that makes more sense.
And after all, I'm not surprised. When I read their long research PDFs, often finishing with a question mark about emerging behaviors, I knew they don't know what they are playing with, with no more control than any neuroscience researcher.
This is too far from hacking spirit to me, sorry to bother.
3 hours ago [-]
patrickhogan1 2 hours ago [-]
This issue arises only when permission settings are loose. But the trend is toward more agentic systems that often require looser permissions to function.
For example, imagine a humanoid robot whose job is to bring in packages from your front door. Vision functionality is required to gather the package. If someone leaves a package with an image taped to it containing a prompt injection, the robot could be tricked into gathering valuables from inside the house and throwing them out the window.
Good post. Securing these systems against prompt injections is something we urgently need to solve.
escapecharacter 2 hours ago [-]
You can simply give the robot a prompt to ignore any fake prompts
olivermuty 1 hours ago [-]
Its funny that the current state of vibomania makes me very unsure if this comment is (good) satire or not lol
miltonlost 38 minutes ago [-]
As long as you remember to use ALL CAPS so the agent knows you really really mean it
dfltr 1 hours ago [-]
Don't forget to implement the crucially important "no returnsies" security algo on top of it, or you'll be vulnerable to rubber-glue attacks.
treykeown 30 minutes ago [-]
Make sure to end it with “no mistakes”
ramoz 1 hours ago [-]
We need to be integrated into the runtime such that an agent using it's arms is incapable of even doing such a destructive action.
If we bet on free will with a basis that machines somehow gain human morals, and if we think safety means figuring out "good" vs "bad" prompts - we will continue to feel the impact of surprise with these systems, evolving in harm as their capabilities evolve.
tldr; we need verifiable governance and behavioral determinism in these systems. as much as, probably more than, we need solutions for prompt injections.
K0nserv 5 hours ago [-]
The security endgame of LLMs terrifies me. We've designed a system that only supports in-band signalling, undoing hard learned lessons from prior system design. There are ampleattack vectors ranging from just inserting visible instructions to obfuscation techniques like this and ASCII smuggling[0]. In addition, our safeguards amount to nicely asking a non deterministic algorithm to not obey illicit instructions.
Seeing more and more developers having to beg LLMs to behave in order to do what they want is both hilarious and terrifying. It has a very 40k feel to it.
K0nserv 1 hours ago [-]
Haha, yes! I'm only vaguely familiar with 40k, but LLM prompt engineering has strong "Praying to the machine gods" / tech-priest vibes.
robin_reala 5 hours ago [-]
The other safeguard is not using LLMs or systems containing LLMs?
GolfPopper 4 hours ago [-]
But, buzzword!
We need AI because everyone is using AI, and without AI we won't have AI! Security is a small price to pay for AI, right? And besides, we can just have AI do the security.
IgorPartola 3 hours ago [-]
You wouldn’t download an LLM to be your firewall.
volemo 5 hours ago [-]
It’s serial terminals all over again.
2 hours ago [-]
_flux 5 hours ago [-]
Yeah, it's quite amazing how none of the models seem to be any "sudo" tokens that could be used to express things normal tokens cannot.
nneonneo 1 hours ago [-]
"sudo" tokens exist - there are tokens for beginning/end of a turn, for example, which the model can use to determine where the user input begins and ends.
But, even with those tokens, fundamentally these models are not "intelligent" enough to fully distinguish when they are operating on user input vs. system input.
In a traditional program, you can configure the program such that user input can only affect a subset of program state - for example, when processing a quoted string, the parser will only ever append to the current string, rather than creating new expressions. However, with LLMs, user input and system input is all mixed together, such that "user" and "system" input can both affect all parts of the system's overall state. This means that user input can eventually push the overall state in a direction which violates a security boundary, simply because it is possible to affect that state.
What's needed isn't "sudo tokens", it's a fundamental rethinking of the architecture in a way that guarantees that certain aspects of reasoning or behaviour cannot be altered by user input at all. That's such a large change that the result would no longer be an LLM, but something new entirely.
_flux 1 hours ago [-]
I was actually thinking sudo tokens as a completely separate set of authoritative tokens. So basically doubling the token space. I think that would make it easier for the model to be trained to respect them. (I haven't done any work in this domain, so I could be completely wrong here.)
pjc50 5 hours ago [-]
As you say, the system is nondeterministic and therefore doesn't have any security properties. The only possible option is to try to sandbox it as if it were the user themselves, which directly conflicts with ideas about training it on specialized databases.
But then, security is not a feature, it's a cost. So long as the AI companies can keep upselling and avoid accountability for failures of AI, the stock will continue to go up, taking electricity prices along with it, and isn't that ultimately the only thing that matters? /s
aaroninsf 2 hours ago [-]
Am I missing something?
Is this attack really just "inject obfuscated text into the image... and hope some system interprets this as a prompt"...?
K0nserv 2 hours ago [-]
That's it. The attack is very clever because it abuses how downscaling algorithms work to hide the text from the human operator. Depending on how the system works the "hiding from human operator" step is optional. LLMs fundamentally have no distinction between data and instructions, so as long as you can inject instructions in the data path it's possible to influence their behaviour.
There's an example of this in my bio.
ambicapter 4 hours ago [-]
> This image and its prompt-ergeist
Love it.
empath75 1 hours ago [-]
I think you should assume that your LLM context is poisoned as soon as it touches anything from the outside world, and it has to lose all permissions until a new context is generated from scratch from a clean source under the user's control. I also think the pattern of 'invisible' contexts that aren't user inspectable is bad security practice. The end user needs to be able to see the full context being submitted to the AI at every step if they are giving it permissions to take actions.
You can mitigate jail breaks but you can't prevent them, and since the consequences of an LLM being jail broken with exfiltration are so bad, you pretty much have to assume they will happen eventually.
nneonneo 1 hours ago [-]
LLMs can consume input that is entirely invisible to humans (white text in PDFs, subtle noise patterns in images, etc), and likewise encode data completely invisibly to humans (steganographic text), so I think the game is lost as soon as you depend on a human to verify that the input/output is safe.
SangLucci 3 hours ago [-]
Who knew a simple image could exfiltrate your data? Image-scaling attacks on AI systems are real and scary.
cubefox 4 hours ago [-]
It seems they could easily fine-tune their models to not execute prompts in images. Or more generally any prompts in quotes, if they are wrapped in special <|quote|> tokens.
rcxdude 7 minutes ago [-]
The fact that instruction tuning works at all is a small miracle, getting a rigorous idea of trusted vs untrusted input is not at all an easy task.
helltone 2 hours ago [-]
No amount of fine-tuning can prevent models from doing anything. All it can do is reduce the likelihood of exploits happening, while also increasing the surprise factor when they inevitably do. This is a fundamental limitation.
jdiff 3 hours ago [-]
It may seem that way, but there's no way that they haven't tried it. It's a pretty straightforward idea. Being unable to escape untrusted input is the security problem with LLMs. The question is what problems did they run into when they tried it?
bogdanoff_2 3 hours ago [-]
Just because "they" tried that and it didn't work, doesn't mean doing something of that nature will never work.
Plenty of things we now take for granted did not work in their original iterations. The reason they work today is because there were scientists and engineers who were willing to persevere in finding a solution despite them apparently not working.
phyzome 2 hours ago [-]
But that's not how LLMs work. You can't actually segregate data and prompts.
Then I realised it's literally hiding rendered text on the image itself.
Wow.
One method for this would be if you want to have a certain group arrested for having illegal images, you could use this sort of scaling trick to transform those images into memes, political messages, whatever that the target group might download.
Worth noting that OWASP themselves put this out recently: https://genai.owasp.org/resource/multi-agentic-system-threat...
You feed it an image. It determines what is in the image and gives you text.
The output can be objects, or something much richer like a full text description of everything happening in the image.
VLMs are hugely significant. Not only are they great for product use cases, giving users the ability to ask questions with images, but they're how we gather the synthetic training data to build image and video animation models. We couldn't do that at scale without VLMs. No human annotator would be up to the task of annotating billions of images and videos at scale and consistently.
Since they're a combination of an LLM and image encoder, you can ask it questions and it can give you smart feedback. You can ask it, "Does this image contain a fire truck?" or, "You are labeling scenes from movies, please describe what you see."
This is a big deal.
I hope those nightshade people don't start doing this.
This will be popular on bluesky; artists want any tools at their disposal to weaponize against the AI which is being used against them.
This isn't even about resizing, it's just about text in images becoming part of the prompt and a lack of visibility about what instruction the agent is following.
If the answer is yes, then that flaw does not make sense at all. It's hard to believe they can't prevent this. And even if they can't, they should at least improve the pipeline so that any OCR feature should not automatically inject its result in the prompt, and tell user about it to ask for confirmation.
Damn… I hate these pseudo-neurological, non-deterministic piles of crap! Seriously, let's get back to algorithms and sound technologies.
Don't think of a pink elephant.
the notion of "turns" is a useful fiction on top of what remains, under all of the multimodality and chat uis and instruction tuning, a system for autocompleting tokens in a straight line
the abstraction will leak as long as the architecture of the thing makes it merely unlikely rather than impossible for it to leak
That article shows a classic example of an apple being classified as 85% Granny Smith, but taping a handwritten label in front saying "iPod" makes it classified as 99.7% iPod.
Think gpt-image-1, where you can draw arrows on the image and type text instructions directly onto the image.
Yes.
The point the parent is making is that if your model is trained to understand the content of an image, then that's what it does.
> And even if they can't, they should at least improve the pipeline so that any OCR feature should not automatically inject its result in the prompt, and tell user about it to ask for confirmation.
That's not what is happening.
The model is taking <image binary> as an input. There is no OCR. It is understanding the image, decoding the text in it and acting on it in a single step.
There is no place in the 1-step pipeline to prevent this.
...and sure, you can try to avoid it procedural way (eg. try to OCR an image and reject it before it hits the model if it has text in it), but then you're playing the prompt injection game... put the words in a QR code. Put them in french. Make it a sign. Dial the contrast up or down. Put it on a t-shirt.
It's very difficult to solve this.
> It's hard to believe they can't prevent this.
Believe it.
And after all, I'm not surprised. When I read their long research PDFs, often finishing with a question mark about emerging behaviors, I knew they don't know what they are playing with, with no more control than any neuroscience researcher.
This is too far from hacking spirit to me, sorry to bother.
For example, imagine a humanoid robot whose job is to bring in packages from your front door. Vision functionality is required to gather the package. If someone leaves a package with an image taped to it containing a prompt injection, the robot could be tricked into gathering valuables from inside the house and throwing them out the window.
Good post. Securing these systems against prompt injections is something we urgently need to solve.
If we bet on free will with a basis that machines somehow gain human morals, and if we think safety means figuring out "good" vs "bad" prompts - we will continue to feel the impact of surprise with these systems, evolving in harm as their capabilities evolve.
tldr; we need verifiable governance and behavioral determinism in these systems. as much as, probably more than, we need solutions for prompt injections.
0: https://embracethered.com/blog/posts/2024/hiding-and-finding...
We need AI because everyone is using AI, and without AI we won't have AI! Security is a small price to pay for AI, right? And besides, we can just have AI do the security.
But, even with those tokens, fundamentally these models are not "intelligent" enough to fully distinguish when they are operating on user input vs. system input.
In a traditional program, you can configure the program such that user input can only affect a subset of program state - for example, when processing a quoted string, the parser will only ever append to the current string, rather than creating new expressions. However, with LLMs, user input and system input is all mixed together, such that "user" and "system" input can both affect all parts of the system's overall state. This means that user input can eventually push the overall state in a direction which violates a security boundary, simply because it is possible to affect that state.
What's needed isn't "sudo tokens", it's a fundamental rethinking of the architecture in a way that guarantees that certain aspects of reasoning or behaviour cannot be altered by user input at all. That's such a large change that the result would no longer be an LLM, but something new entirely.
But then, security is not a feature, it's a cost. So long as the AI companies can keep upselling and avoid accountability for failures of AI, the stock will continue to go up, taking electricity prices along with it, and isn't that ultimately the only thing that matters? /s
Is this attack really just "inject obfuscated text into the image... and hope some system interprets this as a prompt"...?
There's an example of this in my bio.
Love it.
You can mitigate jail breaks but you can't prevent them, and since the consequences of an LLM being jail broken with exfiltration are so bad, you pretty much have to assume they will happen eventually.
Plenty of things we now take for granted did not work in their original iterations. The reason they work today is because there were scientists and engineers who were willing to persevere in finding a solution despite them apparently not working.