▲Qwen VLo: From "Understanding" the World to "Depicting" Itqwenlm.github.io

132 points by lnyan 6 hours ago | 43 comments

rushingcreek 6 hours ago [-]

It doesn't seem to have open weights, which is unfortunate. One of Qwen's strengths historically has been their open-weights strategy, and it would have been great to have a true open-weights competitor to 4o's autoregressive image gen. There are so many interesting research directions that are only possible if we can get access to the weights.

If Qwen is concerned about recouping its development costs, I suggest looking at BFL's Flux Kontext Dev release from the other day as a model: let researchers and individuals get the weights for free and let startups pay for a reasonably-priced license for commercial use.

Jackson__ 5 hours ago [-]

It's also very clearly trained on OAI outputs, which you can tell from the orange tint to the images[0]. Did they even attempt to come up with their own data?

So it is trained off OAI, as closed off as OAI and most importantly: worse than OAI. What a bizarre strategy to gate-keep this behind an API.

[0]

https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VLo/cas...

vachina 5 hours ago [-]

Huh, so orange tint = openAI output? Maybe their training process ended up causing the model to prefer that color balance.

Jackson__ 4 hours ago [-]

Here's an extreme example that shows how it continually adds more orange: https://old.reddit.com/r/ChatGPT/comments/1kawcng/i_went_wit...

It's really too close to be anything but a model trained on these outputs, the whole vibe just screams OAI.

acheong08 59 minutes ago [-]

That form of collapse might just be inherent to the methodology. Releasing the weights would be nice so people can figure out why

VladVladikoff 3 hours ago [-]

What would be the approximate cost of doing this? How many million API requests must be made? How many tokens in total?

refulgentis 2 hours ago [-]

Most pedantically correct answer is "mu", because the answers are both derivable quantitively from "How many images do you want to train on?", which is answered by a qualitative question that doesn't admit numbers ("How high quality do you want it to be?")

Let's say it's 100 images because you're doing a quick LoRA. That'd be about $5.00 at medium quality (~$0.05/image) or $1 at low. ~($0.01/image)

Let's say you're training a standalone image model. OOM of input images is ~1B, so $10M at low and $50M at high.

250 tokens / image for low, ~1000 for medium, which gets us to:

Fastest LoRA? $1-$4. 25,000 - 100,000 tokens output. All the training data for a new image model? $10M-$50M, 2.5B - 10B tokens out.

echelon 5 hours ago [-]

The way they win is to be open. I don't get why China is shutting down open source. It was a knife at the jugular of US tech dominance.

Both Alibaba and Tencent championed open source (Qwen family of models, Hunyuan family of models), but now they've shut off the releases.

There's totally a play where models become loss-leader for SaaS/PaaS/IaaS and where they extinguish your closed competition.

Imagine spreading your model so widely then making the terms: "do not use in conjunction with closed source models".

yorwba 3 hours ago [-]

The problem with giving away weights for free while also offering a hosted API is that once the weights are out there, anyone else can also offer it as a hosted API with similar operating costs, but only the releasing company had the initial capital outlay of training the model. So everyone else is more profitable! That's not a good business strategy.

New entrants may keep releasing weights as a marketing strategy to gain name recognition, but once they have established themselves (and investors start getting antsy about ROI) making subsequent releases closed is the logical next step.

diggan 4 hours ago [-]

> I don't get why China is shutting down open source [...] now they've shut off the releases

What are you talking about? Feels like a very strong claim considering there are ongoing weight releases, wasn't there one just today or yesterday from a Chinese company?

diggan 5 hours ago [-]

> One of Qwen's strengths historically has been their open-weights strategy [...] let researchers and individuals get the weights for free and let startups pay for a reasonably-priced license for commercial use.

But if you're suggesting they should do open weights, doesn't that mean people should be able to use it freely?

You're effectively suggesting "trial-weights", "shareware-weights", "academic-weights" or something like that rather than "open weights", which to me would make it seem like you can use them for whatever you want, just like with "open source" software. But if it misses a large part of what makes "open source" open source, like "use it for whatever you want", then it kind of gives the wrong idea.

rushingcreek 5 hours ago [-]

I am personally in favor of true open source (e.g. Apache 2 license), but the reality is that these model are expensive to develop and many developers are choosing not to release their model weights at all.

I think that releasing the weights openly but with this type of dual-license (hence open weights, but not true open source) is an acceptable tradeoff to get more model developers to release models openly.

diggan 4 hours ago [-]

> but the reality is that these model are expensive to develop and many developers are choosing not to release their model weights at all.

But isn't that true for software too? Software is expensive to develop, and lots of developers/companies are choosing not to make their code public for free. Does that mean you also feel like it would be OK to call software "open source" although it doesn't allow usage for any purpose? That would then lead to more "open source" software being released, at least for individuals and researchers?

hmottestad 42 minutes ago [-]

I wouldn't equate model weights with source code. You can run software on your own machine without source code, but you can't run an LLM on your own machine without model weights.

Though, you could still sell the model weights for local use. Not sure if we are there yet that I myself could buy model weights, but of course if you are a very big company or a very big country then I guess most AI companies would consider selling you their model weights so you can run them on your own machine.

rushingcreek 4 hours ago [-]

Yes, I think the same analogy applies. Given a binary choice of a developer not releasing any code at all or releasing code under this type of binary "open-code" license, I'd always take the latter.

diggan 3 hours ago [-]

> Given a binary choice of a developer not releasing any code at all

I mean it wasn't binary earlier, it was "to get more model developers to release", so not a binary choice, but a gradient I suppose. Would you still make the same call for software as you do for ML models and weights?

dheera 4 hours ago [-]

> One of Qwen's strengths historically has been their open-weights strategy

> let researchers and individuals get the weights for free and let startups pay for a reasonably-priced license for commercial use

I'm personally doubtful companies can recoup tens of millions of dollars in investment, GPU hours, and engineering salaries from image generation fees.

echelon 5 hours ago [-]

The era of open weights from China appears to be over for some reason. It's all of a sudden and seems to be coordinated.

Alibaba just shut off the Qwen releases

Tencent just shut off the Hunyuan releases

Bytedance just released Seedream, but it's closed

It's seems like it's over.

They're still clearly training on Western outputs, though.

I still suspect that the strategic thing to do would be to become 100% open and sell infra/service.

natrys 5 hours ago [-]

> Alibaba just shut off the Qwen releases

Alibaba from beginning had some series of models that are always closed-weights (*-max, *-plus, *-turbo etc. but also QvQ), It's not a new development, nor does it prevent their open models. And the VL models are opened after 2-3 months of GA in API.

> Tencent just shut off the Hunyuan releases

Literally released one today: https://huggingface.co/tencent/Hunyuan-A13B-Instruct

echelon 2 hours ago [-]

Hunyuan Image 2.0, which is of Flux quality but has ~20 milliseconds of inference time, is being withheld.

Hunyuan 3D 2.5, which is an order of magnitude better than Hunyuan 3D 2.1, is also being withheld.

I suspect that now that they feel these models are superior to Western releases in several categories, they no longer have a need to release these weights.

natrys 56 minutes ago [-]

> I suspect that now that they feel these models are superior to Western releases in several categories, they no longer have a need to release these weights.

Yes that I can totally believe. Standard corporation behaviour (Chinese or otherwise).

I do think DeepSeek would be an exception to this though. But they lack diversity in focus (not even multimodal yet).

jacooper 1 hours ago [-]

Deepseek R1 0528, the flagship Chinese model is open source. Qwen3 is open source. HIdream models are also open source

pxc 5 hours ago [-]

Why? And can we really say that already? Wasn't the Qwen3 release still very recent?

logicchains 5 hours ago [-]

What do you mean Tencent just shut off the Hunyuan releases? There was another open weights release just today: https://huggingface.co/tencent/Hunyuan-A13B-Instruct . And the latest Qwen and DeepSeek open weight releases were under 2 months ago, there hasn't been enough time for them to finish a new version since then.

echelon 2 hours ago [-]

Hunyuan Image 2.0 and Hunyuan 3D 2.5 are not being released. They're being put into a closed source web-based offering.

afro88 29 minutes ago [-]

Strangely the image change examples (edits, style transfer etc) have that slight yellow tint that GPT Image 1 (ChatGPT 4o's latest image model) has. Why is that? Flux Kontext doesn't seem to do that

b0a04gl 5 hours ago [-]

image gets compressed into 256 tokens before language model sees it. ask it to add a hat and it redraws the whole face; because objects aren't stored as separate things. there's no persistent bear in memory. it all lives inside one fused latent soup, they're fresh samples under new constraints. every prompt tweak rebalances the whole embedding. that's why even small changes ripple across the image. i notice it like single shot scene synthesis, which is good for diff usecases

leodriesch 5 hours ago [-]

That's what I really like about Flux Kontext, it has similar editing capabilities to the multimodal models, but doesn't mess up the details. The editing with gpt-image-1 only really works for complete style changes like "make this ghibli", but not adding glasses to a photorealistic image and have it retain all the details.

vunderba 2 hours ago [-]

Agreed. Kontext's ability to basically do the equivalent of img2img inpainting is hugely impressive.

Even when used to add new details it sticks very strongly to the existing images overall aesthetic.

https://specularrealms.com/ai-transcripts/experiments-with-f...

hexmiles 5 hours ago [-]

While looking at the examples of editing the bear image, I noticed that the model seemed to change more things than were strictly asked.

As an example, when asked to change the background, it also completely changed the bear (it has the same shirt but the fur and face are clearly different), and also: when it turned the bear in a balloon, it changed the background (removing the pavement) and lost the left seed in the watermelon.

It is something that can be fixed with better prompting, or is it a limitation of the model/architecture?

godelski 34 minutes ago [-]

  > It is something that can be fixed with better prompting, or is it a limitation of the model/architecture?

Both. You can get better results through better prompting but the root cause of this is a limitation of the architecture and training methods (which are coupled).

skybrian 5 hours ago [-]

I tried the obligatory pelican riding a bicycle (as an image, not SVG) and some accordion images. It has a bit of trouble with fingers and wth getting the black keys right. It’s fairly fast.

https://chat.qwen.ai/s/0f9d558c-2108-4350-98fb-6ee87065d587?...

godelski 3 hours ago [-]

As a ML researcher and a degree holding physicist, I'm really hesitant to use the words "understanding" and "describing" (much less hesitant) around these models. I don't find the language helpful and think it's mostly hateful tbh.

The reason we use math in physics is because of its specificity. The same reason coding is so hard [0,1]. I think people aren't giving themselves enough credit here for how much they (you) understand about things. It is the nuances that really matter. There's so much detail here and we often forget how important they are because it is just normal to us. It's like forgetting about the ground you walk upon.

I think something everyone should read about is Asimov's "Relativity of Wrong"[2]. This is what we want to see in these systems if we want to start claiming they understand things. We want to see them to deduction and abduction. To be able to refine concepts and ideas. To be able to discover things that are more than just a combination of things they've ingested. What's really difficult here is that we train these things on all human knowledge and just reciting back that knowledge doesn't demonstrate intelligence. It's very unlikely that they losslessly compress that knowledge into these model sizes, but without very deep investigation into that data and probing at this knowledge it is very hard to understand what it knows and what it memorizes. Really, this is a very poor way to go about trying to make intelligence[3], or at least making intelligence and ending up knowing it is intelligent.

To really "understand" things we need to be able to propose counterfactuals[4]. Every physics statement is a counterfactual statement. Take F=ma as a trivial example. We can modify the mass or the acceleration to our heart's content and still determine the force. We can observe a specific mass moving at a specific acceleration and then ask the counterfactual "what if it was twice as heavy?" (twice the mass). *We can answer that!* In fact, your mental model of the world does this too! Yo may not be describing it with math (maybe you are ;) but you are able to propose counterfactuals and do a pretty good job a lot of the time. Doesn't mean you always need to be right though. But the way our heads work is through these types of systems. You daydream these things, you imagine them while you play, and all sorts of things. This, I can say, with high confidence, is not something modern ML (AI) systems do.

  == Edit ==

A good example of lack of understanding is the image OP uses. Not only does the right have the wrong number of fingers but look at the keys on the keyboard. It does not take much understanding to recognize that you shouldn't have repeated keys... the configuration is all wonky too, like one of those dreams you can immediately tell is a dream[5]. I'd also be willing to bet that the number of keys doesn't align to the number of markers and definitely the sizing looks off. The more you look at it the worse it gets, and that's really common among these systems. Nice at a quick glance but DEEP in the uncanny valley at more than a glance and deeper the more you look.

[0] https://youtube.com/watch?v=cDA3_5982h8

[1] Code is math. There's an isomorphism between Turing complete languages and computable mathematics. You can look more into my namesake, church, and Turing if you want to get more formal or wait for the comment that corrects a nuanced mistake here (yes, it exists). Also, note that physics and math are not the same thing, but mathematics is unreasonably effective (yes, this is a reference).

[2] https://hermiene.net/essays-trans/relativity_of_wrong.html

[3] This is a very different statement than "making something useful." Without a doubt these systems are useful. Do not conflate these

[4] https://en.wikipedia.org/wiki/Counterfactual_thinking

[5] Yes, you can read in dreams. I do it frequently. Though on occasion I have lucid dreamed because I read something and noticed that it changed when I looked away and looked back.

BoorishBears 30 minutes ago [-]

As a person who builds stuff, I'm tired of these strawmen.

It is helpful that they chose words that are widely understood to represent input vs output.

They even used scare quotes to signal they're not making some overly grand claim in terms of the long tail implications of the terms.

A person reading the release would learn previously Qwen had a VLM that could understand/see/precive/whateverwordyouwanttouse and now it can generate images which is could be depicting/drawing/portraying/whateverotherwordyouwanttouse

We don't have to invent a crisis past that.

rickydroll 6 hours ago [-]

To my eyes, all these images hit the uncanny valley. All the colors and the shadows are just off.

poly2it 1 hours ago [-]

They are all really sloppy. I don't really see the use case for this sort of output outside of research.

djaychela 5 hours ago [-]

How do you stop the auto reading out? Why can't websites just sit there and wait until I ask for them to do something? It full screen auto played a video on watch and then just started reading?

Firefox on ios ftr

veltas 4 hours ago [-]

Rather I think machine learning has made a lot more progress 'depicting' the world than 'understanding' it.

ivape 4 hours ago [-]

Why do you think humans understand the world any better? We have emotion about the world but emotions do not grant you understanding, where “understanding” is still something you would still need to define.

“I get it” - is actually just some arbitrary personal benchmark.

frotaur 6 hours ago [-]

Anybody knows if there is a technical report for this, or for other models that generate images in a similar way? I'd really like to understand the architecture behind 4o-like image gen.

aredox 6 hours ago [-]

It don't think these words mean what they think they do...

6 hours ago [-]

makingstuffs 6 hours ago [-]

[flagged]

6 hours ago [-]

v5v3 6 hours ago [-]

You can set most browsers to not not auto play.

pxc 5 hours ago [-]

For this page, that setting doesn't seem to work, or gets ignored, at least in my browser (Firefox on macOS). The controls for the video are also hidden, although I can recover them if I use a pop-out view.

Loading comments...

rushingcreek 6 hours ago [-]

Jackson__ 5 hours ago [-]

It's also very clearly trained on OAI outputs, which you can tell from the orange tint to the images[0]. Did they even attempt to come up with their own data?

So it is trained off OAI, as closed off as OAI and most importantly: worse than OAI. What a bizarre strategy to gate-keep this behind an API.

[0]

https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VLo/cas...

vachina 5 hours ago [-]

Huh, so orange tint = openAI output? Maybe their training process ended up causing the model to prefer that color balance.

Jackson__ 4 hours ago [-]

Here's an extreme example that shows how it continually adds more orange: https://old.reddit.com/r/ChatGPT/comments/1kawcng/i_went_wit...

It's really too close to be anything but a model trained on these outputs, the whole vibe just screams OAI.

acheong08 59 minutes ago [-]

That form of collapse might just be inherent to the methodology. Releasing the weights would be nice so people can figure out why

VladVladikoff 3 hours ago [-]

What would be the approximate cost of doing this? How many million API requests must be made? How many tokens in total?

refulgentis 2 hours ago [-]

Let's say it's 100 images because you're doing a quick LoRA. That'd be about $5.00 at medium quality (~$0.05/image) or $1 at low. ~($0.01/image)

Let's say you're training a standalone image model. OOM of input images is ~1B, so $10M at low and $50M at high.

250 tokens / image for low, ~1000 for medium, which gets us to:

Fastest LoRA? $1-$4. 25,000 - 100,000 tokens output. All the training data for a new image model? $10M-$50M, 2.5B - 10B tokens out.

echelon 5 hours ago [-]

The way they win is to be open. I don't get why China is shutting down open source. It was a knife at the jugular of US tech dominance.

Both Alibaba and Tencent championed open source (Qwen family of models, Hunyuan family of models), but now they've shut off the releases.

There's totally a play where models become loss-leader for SaaS/PaaS/IaaS and where they extinguish your closed competition.

Imagine spreading your model so widely then making the terms: "do not use in conjunction with closed source models".

yorwba 3 hours ago [-]

diggan 4 hours ago [-]

> I don't get why China is shutting down open source [...] now they've shut off the releases

What are you talking about? Feels like a very strong claim considering there are ongoing weight releases, wasn't there one just today or yesterday from a Chinese company?

diggan 5 hours ago [-]

But if you're suggesting they should do open weights, doesn't that mean people should be able to use it freely?

rushingcreek 5 hours ago [-]

diggan 4 hours ago [-]

> but the reality is that these model are expensive to develop and many developers are choosing not to release their model weights at all.

hmottestad 42 minutes ago [-]

I wouldn't equate model weights with source code. You can run software on your own machine without source code, but you can't run an LLM on your own machine without model weights.

rushingcreek 4 hours ago [-]

Yes, I think the same analogy applies. Given a binary choice of a developer not releasing any code at all or releasing code under this type of binary "open-code" license, I'd always take the latter.

diggan 3 hours ago [-]

> Given a binary choice of a developer not releasing any code at all

dheera 4 hours ago [-]

> One of Qwen's strengths historically has been their open-weights strategy

> let researchers and individuals get the weights for free and let startups pay for a reasonably-priced license for commercial use

I'm personally doubtful companies can recoup tens of millions of dollars in investment, GPU hours, and engineering salaries from image generation fees.

echelon 5 hours ago [-]

The era of open weights from China appears to be over for some reason. It's all of a sudden and seems to be coordinated.

Alibaba just shut off the Qwen releases

Tencent just shut off the Hunyuan releases

Bytedance just released Seedream, but it's closed

It's seems like it's over.

They're still clearly training on Western outputs, though.

I still suspect that the strategic thing to do would be to become 100% open and sell infra/service.

natrys 5 hours ago [-]

> Alibaba just shut off the Qwen releases

> Tencent just shut off the Hunyuan releases

Literally released one today: https://huggingface.co/tencent/Hunyuan-A13B-Instruct

echelon 2 hours ago [-]

Hunyuan Image 2.0, which is of Flux quality but has ~20 milliseconds of inference time, is being withheld.

Hunyuan 3D 2.5, which is an order of magnitude better than Hunyuan 3D 2.1, is also being withheld.

I suspect that now that they feel these models are superior to Western releases in several categories, they no longer have a need to release these weights.

natrys 56 minutes ago [-]

> I suspect that now that they feel these models are superior to Western releases in several categories, they no longer have a need to release these weights.

Yes that I can totally believe. Standard corporation behaviour (Chinese or otherwise).

I do think DeepSeek would be an exception to this though. But they lack diversity in focus (not even multimodal yet).

jacooper 1 hours ago [-]

Deepseek R1 0528, the flagship Chinese model is open source. Qwen3 is open source. HIdream models are also open source

pxc 5 hours ago [-]

Why? And can we really say that already? Wasn't the Qwen3 release still very recent?

logicchains 5 hours ago [-]

echelon 2 hours ago [-]

Hunyuan Image 2.0 and Hunyuan 3D 2.5 are not being released. They're being put into a closed source web-based offering.

afro88 29 minutes ago [-]

Strangely the image change examples (edits, style transfer etc) have that slight yellow tint that GPT Image 1 (ChatGPT 4o's latest image model) has. Why is that? Flux Kontext doesn't seem to do that

b0a04gl 5 hours ago [-]

leodriesch 5 hours ago [-]

vunderba 2 hours ago [-]

Agreed. Kontext's ability to basically do the equivalent of img2img inpainting is hugely impressive.

Even when used to add new details it sticks very strongly to the existing images overall aesthetic.

https://specularrealms.com/ai-transcripts/experiments-with-f...

hexmiles 5 hours ago [-]

While looking at the examples of editing the bear image, I noticed that the model seemed to change more things than were strictly asked.

It is something that can be fixed with better prompting, or is it a limitation of the model/architecture?

godelski 34 minutes ago [-]

  > It is something that can be fixed with better prompting, or is it a limitation of the model/architecture?

Both. You can get better results through better prompting but the root cause of this is a limitation of the architecture and training methods (which are coupled).

skybrian 5 hours ago [-]

I tried the obligatory pelican riding a bicycle (as an image, not SVG) and some accordion images. It has a bit of trouble with fingers and wth getting the black keys right. It’s fairly fast.

https://chat.qwen.ai/s/0f9d558c-2108-4350-98fb-6ee87065d587?...

godelski 3 hours ago [-]

  == Edit ==

[0] https://youtube.com/watch?v=cDA3_5982h8

[2] https://hermiene.net/essays-trans/relativity_of_wrong.html

[3] This is a very different statement than "making something useful." Without a doubt these systems are useful. Do not conflate these

[4] https://en.wikipedia.org/wiki/Counterfactual_thinking

[5] Yes, you can read in dreams. I do it frequently. Though on occasion I have lucid dreamed because I read something and noticed that it changed when I looked away and looked back.

BoorishBears 30 minutes ago [-]

As a person who builds stuff, I'm tired of these strawmen.

It is helpful that they chose words that are widely understood to represent input vs output.

They even used scare quotes to signal they're not making some overly grand claim in terms of the long tail implications of the terms.

We don't have to invent a crisis past that.

rickydroll 6 hours ago [-]

To my eyes, all these images hit the uncanny valley. All the colors and the shadows are just off.

poly2it 1 hours ago [-]

They are all really sloppy. I don't really see the use case for this sort of output outside of research.

djaychela 5 hours ago [-]

How do you stop the auto reading out? Why can't websites just sit there and wait until I ask for them to do something? It full screen auto played a video on watch and then just started reading?

Firefox on ios ftr

veltas 4 hours ago [-]

Rather I think machine learning has made a lot more progress 'depicting' the world than 'understanding' it.

ivape 4 hours ago [-]

“I get it” - is actually just some arbitrary personal benchmark.

frotaur 6 hours ago [-]

Anybody knows if there is a technical report for this, or for other models that generate images in a similar way? I'd really like to understand the architecture behind 4o-like image gen.

aredox 6 hours ago [-]

It don't think these words mean what they think they do...

6 hours ago [-]

makingstuffs 6 hours ago [-]

[flagged]

6 hours ago [-]

v5v3 6 hours ago [-]

You can set most browsers to not not auto play.

pxc 5 hours ago [-]