▲TransMLA: Multi-head latent attention is all you needarxiv.org

90 points by ocean_moist 9 hours ago | 25 comments

magicalhippo 59 minutes ago [-]

I'm just following the field from the sidelines, but this looks interesting to me. Especially the increase in expressiveness that the new model allows for over GQA, at the cost of just ~10% more memory, and the fact that you can convert existing GQA models like LLaMA, Qwen etc with just a bit of fine-tuning.

Perhaps a trivial insight but I feel a lot of progress often comes in the form of generalizations, where existing approaches can be seen as special cases. Here the authors show that Group Query Attention (GQA) and Multi-Query Attention (MQA) falls out as special cases of their new model.

octocop 4 hours ago [-]

These titles need to stop, we've seen that in fact it is not all you need.

ghc 55 seconds ago [-]

It's become the equivalent of the stupid faces on YouTube thumbnails.

insin 2 hours ago [-]

Why we're moving away from all you need considered harmful

jsheard 1 hours ago [-]

Those words are all you need to get to the top of HN though. Think of the karma!

seeknotfind 4 hours ago [-]

All you need titles stopping is all you need.

Etheryte 3 hours ago [-]

All you need is love, and for these titles to stop. (But they won't do that.)

EGreg 3 hours ago [-]

We need more than that, and all you need to stop saying that!!

tankenmate 4 hours ago [-]

The title of this paper is a reference to a previous paper titled "Attention Is All You Need"[0][1]. This seminal work described the transformer model that is the basis for almost all LLMs, and is almost certainly the most cited paper on AI even though it was only published in 2017.

[0] https://arxiv.org/abs/1706.03762 [1] https://en.wikipedia.org/wiki/Attention_Is_All_You_Need

kristopolous 3 hours ago [-]

Right, it's an 8 year old reference that's been made hundreds of times.

People seem to love going to the references graveyard, digging up tired and dead ones and drag them around town hoping everyone thinks they're clever.

Also this was from 3 months ago.

nihzm 3 hours ago [-]

It has definitely been overused by too many authors. This reminds me a passage of Orwell's essay "Politics and the English Language":

> A newly−invented metaphor assists thought by evoking a visual image, while on the other hand a metaphor which is technically "dead" (e.g., iron resolution) has in effect reverted to being an ordinary word and can generally be used without loss of vividness. But in between these two classes there is a huge dump of worn−out metaphors which have lost all evocative power and are merely used because they save people the trouble of inventing phrases for themselves

tankenmate 3 hours ago [-]

By that argument you must also hate anything that mentions the term "considered harmful", or makes any form of derivative cultural reference (like just about every episode of the Simpsons). Why do you let it get to you?

netdevphoenix 3 hours ago [-]

Why is this the most cited paper in AI and not the original 1943 paper who started it all?

zaptrem 3 hours ago [-]

Transformers are what made ML infinitely scalable and caused a huge amount of progress in very few years since everyone could just go scale things. However, idk how many of those papers actually even cite the transformer paper?

netdevphoenix 1 minutes ago [-]

As I understand, the transformer architecture is built on deep learning. Would you say that transformers made a bigger progress RELATIVE to the progress made by deep learning?

tankenmate 2 hours ago [-]

I just checked Google Scholar, not perfect but good for an indicative; "A logical calculus of the ideas immanent in nervous activity" [WS McCulloch, W Pitts - The bulletin of mathematical biophysics, 1943] has ~33,000 citations, and "Attention is all you need" [A Vaswani, N Shazeer, et al, Advances in Neural Information Processing Systems, 2017] has ~180,000 citations.

tankenmate 3 hours ago [-]

Probably because of the modern "publish or perish" mantra led to an exponential growth in publications, and "newer is better" means that newer impactful papers get cited more than older impactful publications. But that thesis is probably a paper in itself (of the meta analysis navel gazing variety).

jbellis 3 hours ago [-]

[abstract] This approach significantly reduces the KV cache size relative to traditional multi-head attention

[3.3] For saving the KV cache, only the intermediate latent representations need to be stored: [latex] where r is much smaller than nh · dh [n-sub-h, d-sub-h]

[background] In traditional multi-head attention you must cache full key and value matrices of size T x (nh · dh) where T is the token length, nh is the number of attention heads, dh is the dimensionality of each individual head

sounds like a big win for memory constrained environments like local inference

wiz21c 4 hours ago [-]

Not quite related, but do the mamba models gain ground ?

Answering my own question: https://www.reddit.com/r/MachineLearning/comments/1hpg91o/d_...

olq_plo 7 hours ago [-]

Very cool idea. Can't wait for converted models on HF.

kristel100 4 hours ago [-]

Still wrapping my head around this architecture, but the idea of reducing headcount while maintaining performance is compelling. Would love to see a benchmark against something like FlashAttention.

kavalg 6 hours ago [-]

My (possibly wrong) TLDR: TransMLA is a method to "compress" an already trained GQA model, with the additional option to further fine tune it. Shall make inference faster.

yorwba 6 hours ago [-]

It is not a method to compress a Grouped-Query Attention model, but to expand it into an equivalent Multi-head Latent Attention model with the same key-value cache size but larger effective key/value vectors and a correspondingly larger number of trainable parameters. With additional training, you can then obtain a better model that only uses a little bit more memory.

freeqaz 6 hours ago [-]

Also makes models smarter ("expressive")

EGreg 3 hours ago [-]

All you need to stop posting titles like that !

Loading comments...

magicalhippo 59 minutes ago [-]

octocop 4 hours ago [-]

These titles need to stop, we've seen that in fact it is not all you need.

ghc 55 seconds ago [-]

It's become the equivalent of the stupid faces on YouTube thumbnails.

insin 2 hours ago [-]

Why we're moving away from all you need considered harmful

jsheard 1 hours ago [-]

Those words are all you need to get to the top of HN though. Think of the karma!

seeknotfind 4 hours ago [-]

All you need titles stopping is all you need.

Etheryte 3 hours ago [-]

All you need is love, and for these titles to stop. (But they won't do that.)

EGreg 3 hours ago [-]

We need more than that, and all you need to stop saying that!!

tankenmate 4 hours ago [-]

[0] https://arxiv.org/abs/1706.03762 [1] https://en.wikipedia.org/wiki/Attention_Is_All_You_Need

kristopolous 3 hours ago [-]

Right, it's an 8 year old reference that's been made hundreds of times.

People seem to love going to the references graveyard, digging up tired and dead ones and drag them around town hoping everyone thinks they're clever.

Also this was from 3 months ago.

nihzm 3 hours ago [-]

It has definitely been overused by too many authors. This reminds me a passage of Orwell's essay "Politics and the English Language":

tankenmate 3 hours ago [-]

netdevphoenix 3 hours ago [-]

Why is this the most cited paper in AI and not the original 1943 paper who started it all?

zaptrem 3 hours ago [-]

netdevphoenix 1 minutes ago [-]

As I understand, the transformer architecture is built on deep learning. Would you say that transformers made a bigger progress RELATIVE to the progress made by deep learning?

tankenmate 2 hours ago [-]

tankenmate 3 hours ago [-]

jbellis 3 hours ago [-]

[abstract] This approach significantly reduces the KV cache size relative to traditional multi-head attention

[3.3] For saving the KV cache, only the intermediate latent representations need to be stored: [latex] where r is much smaller than nh · dh [n-sub-h, d-sub-h]

sounds like a big win for memory constrained environments like local inference

wiz21c 4 hours ago [-]

Not quite related, but do the mamba models gain ground ?

Answering my own question: https://www.reddit.com/r/MachineLearning/comments/1hpg91o/d_...

olq_plo 7 hours ago [-]

Very cool idea. Can't wait for converted models on HF.

kristel100 4 hours ago [-]

Still wrapping my head around this architecture, but the idea of reducing headcount while maintaining performance is compelling. Would love to see a benchmark against something like FlashAttention.

kavalg 6 hours ago [-]

My (possibly wrong) TLDR: TransMLA is a method to "compress" an already trained GQA model, with the additional option to further fine tune it. Shall make inference faster.

yorwba 6 hours ago [-]

freeqaz 6 hours ago [-]

Also makes models smarter ("expressive")

EGreg 3 hours ago [-]

All you need to stop posting titles like that !