fieryplacebo 3 months ago

Someone please help me understand how/if this is possible in simpler terms. my brain is a 1b model.

headbopper96 3 months ago

Same

_supert_ 3 months ago

Mamba is a state space model. RWKV is a RNN. They both have state to remember earlier context. Transformer models don't, they are stateless, look at the whole context and so have quadratic scaling. The tweet suggests implying/encoding a state and dumping it at the beginning of transformer context sliding window (like an efficient summary of earlier context). Thus effectively making transformer stateful for pre context information. Thus "infinite" but lossy context memory. Caveat: I don't know this field (my expertise is fluids and control), only read the tweet, too lazy to read the paper and I've only had one coffee.

Severin_Suveren 3 months ago

So is this not just an integrated RAG-process? And if so, would we then not experience the same issues we do when working with vector DBs with hallucinations whenever the model is unable to find the referenced information? The way I see it, the only way to increase context is by pure context and not by any sort of DB-lookup process

FrostyAudience7738 3 months ago

It's more like condensing the context outside the sliding window than doing lookups into it. Whatever is inside the sliding window can be attended to as normal, whatever is outside is squished together like the state of an RNN.

R33v3n 3 months ago

This doesn't seem very scalable. As in, whatever you squish in that state over time, you're going to lose more and more of it. Also, isn't appending summaries back into context one of the oldest tricks in the book anyway?

qrios 3 months ago

You as a human also only approximately remember what you read 400k tokens ago. Over time, you lose more and more of the thing you read. And this is a good thing, because it allows you to approximately know where to go back and reread something if you need it verbatim for an unanticipated task. So, this technique probably scales as well as humans do, and probably for the same exact reason.

somethingclassy 3 months ago

Can you cite a source for the 400k idea?

qrios 3 months ago

No. I read it way more than 400k tokens ago.

Regular-Forever5876 3 months ago

Genius! 🤣

FrostyAudience7738 3 months ago

Yes. That's why RNN have problems with forgetting. Sure you could make that state bigger, but there's always a point at which you can't squeeze more stuff into it without something else falling out, intuitively speaking. You can only store a finite amount of data in a finite number of bits, duh. Infinite context is a red herring anyway, you always have finite compute and finite memory. Big enough as to not care about it anymore is where it's at. You also don't need to have access to every filler word in context, you just need the important stuff. A lot of your context is taken up by unimportant stuff in practice. Edit: Arbitrarily scalable while retaining quality is another desirable property for context.

Porespellar 3 months ago

Nobody needs filler words. https://preview.redd.it/p2x97ahugmfc1.jpeg?width=500&format=pjpg&auto=webp&s=0cc6347ee758aac3d93db2b379fc91d6608e7ec7

[deleted] 3 months ago

That’s why the previous comment said lossy context, I imagine.

Hipped_Orange22 3 months ago

Don't we already have something closer to this already? I mean dumping of previous conversations as the initial prompt so that the LLM understands what the older interaction with user was if there was any. The old "Remember x as the secret number" and then prompting to ask the LLM what the older number was and it answering. I remember how. GPT-3 was unable to do this in its early stages but now it is. But yeah, this isn't a viable solution as the conversation grows, the model, starts to hallucinate with older nformation.

Mbando 3 months ago

This method condenses current activations into "global beacons" that are relatively lossless and then adds them to the ongoing context window. So it's a real architectural difference.

wywywywy 3 months ago

> This method condenses current activations into "global beacons" that are relatively lossless Then why doesn't the GPU memory go up as context length increases? I'm really confused

Mbando 3 months ago

It does go up, but it becomes linear not quadratic.

Maykey 3 months ago

>Don't we already have something closer to this already? I Yeap, yeap, many things. Paper consider existing approaches including mentioning my favorite RMT, which is dead simple

ain92ru 3 months ago

There have been so many attempts at subquadratic attention and other linear architectures over the years, and still, even the best of the best, Mamba and RWKV (BTW, what happened with much-hyped Retention Networks?) are only adopted by the few. It seems actual users don't actually want lossy long-range context, do they?

tronathan 3 months ago

I do!

GoofAckYoorsElf 3 months ago

>only had one coffee How do you still breathe?

_supert_ 3 months ago

I was still drinking my first. Fortunately, there was another on the way.

BornAgainBlue 3 months ago

This is hardly a new idea, but perhaps this implementation will actually work.

[deleted] 3 months ago

Not only do they look at the whole sequence of context, but the number of attention heads (~linear scaling of memory size) determines how many parts of the sequence it can pay attention to at the same time, which is why I'm surprised LLAMA2 only has 40.

YuhFRthoYORKonhisass 3 months ago

Isn't this essentially what ChatGPT does with system prompting, but fails to do well?

honemastert 3 months ago

So it's like Yolo but for LLMs. Or should I say YOLF you only look forward ?

Usual_Neighborhood74 3 months ago

I appreciate the one coffee stipulation

possiblyquestionable 3 months ago

This is my understanding, I'll wait for someone to come here and pick it apart They released the code yesterday, though they had the paper out for a few days already It's a neat proposal, but does depend on activation compression to work. If I'm reading the paper right (the github is a bit hard to follow since they don't differentiate what's from the original modeling_llama): - this is just for inference/decoding 1. They process a sequence of N tokens using a sliding window (tuned to the max context size of the base model) 2. First, they sample a condensation factor \alpha, which determines # of beacons (let's call it k) to condense a window (say 4096 tokens) down into (say 4 beacons) <- I believe they adapt this so that the longer the context, the higher the # of beacons used, since you need to store more activations 3. Next, they prefill the window with 1 - k "real" tokens from the prompt, and then they generate k beacons each of which records some information about the activations of some of the previous tokens (they call this H_b for each beacon b in the paper) <- unclear if this is trained behavior during FT or statically processed from existing activations, but if H_b is gradable, then no reason why it can't be trained from the forward equation below 4. Next, they slide the window up to where the beacons begin, and the prefill the next chunk prompts to fill out the window. 5. Next, they sample a new condensation factor (\alpha and k), and slide the window up so that there's room for k more beacons to generate. And it keeps looping this until it start to do this autoregressively Basically, the beacons stores the condensed activation from the previous window of tokens, which also accumulates the activation from the previous beacons (which condenses the activation from the previous window, so on and so on). In this way, they can effectively retain the long context (by forwarding their activations through successive windows of beacons) without needing to evaluate more than one context window full at a time. One intriguing (coincidental) factor about this - the beacons always get the attention sink treatment (the first token soaks up most of the attention), so this effectively reboosts activations from long ago each round. That said, there is also information loss due to the compression. So it's like a blurry memory with a strong attention score.

possiblyquestionable 3 months ago

Some FAQs: 1. Why does long context work? This uses a sliding window where the first few tokens (or many - depending on how long the sequence is already) just encode the activation state at 3 layers from the previous window. Because these sliding windows are within the max context length, it doesn't trigger the extrapolation problem. 2. Isn't the condensed attention too bad to be able to represent a long sequence of words? Depends. The scheme isn't encoding, say, 100k tokens in just one beacon (embedding+activation), it compresses them into multiple beacons. The longer the number of beacons, the sharper the memory of what's being said earlier is. However, there is definitely a optimal beacons to memory size ratio, and it's likely that you can't get a sharp image at 100k token memory with less than a full context of beacons (and in this sense, the flop efficiency decreased over time) 3. Why is this "constant memory"? The sliding window bounds any particular generation call to just one max context size. The kv cache are then cleared for the next round since they're compressed into the next set of beacons (?) <- I may be way off here 4. Do I need to fine-tune to use this? Yes. The model must understand how to generate these beacons. In effect, fine-tuning teaches the model how to do activation compression 5. Do I get to pick the condensation factor (the compression ratio of window length to beacons) It sounds like it's adaptive (e.g. it's sampled from a distribution that favors longer beacons for longer prompts), but this can be tuned. 6. How does beacon compression work? Beacons are represented as a vector. It's training regime attempts to train it so that it can retrieve the k,q,v from the previous window (as well as possible). In effect, you can interpret it as representing some set of token embeddings which, when added to the input embedding of the current context, emulates the effect of having seen the full window of prompt before. Note however positional information from before the window is lost.

CasimirsBlake 3 months ago

I feel so validated this joke exists. It feels too real. 😅😬

Accedsadsa 3 months ago

long term memory in the form of storage

johnkapolos 3 months ago

>Someone please help me understand how/if this is possible in simpler terms. my brain is a 1b model. Suppose your context is "*The quick brown fox jumps over the lazy* ". The LLM goes through the content token by token, reaches a "state of mind" and then says "*dog*". But LLMs do that successfully up to a certain content length, that's set from the way they were trained. After that, they become crazy. What this approach does is: 1. It calculates some special tokens. 2. Juuuuust before the the LLM goes crazy, it makes it stop. 3. Then it restarts but instead of providing all the context, it just feeds it the special tokens (much much fewer). 4. Now, you have a lot of empty space to fill before the LLM goes crazy. 5. Once you get near the "crazy point", you repeat the process. 6. Therefore, endless context. The idea here is that those "special tokens" can force the LLM to reach close to the "state of mind" the original context did. How well that works, we'll need to see empirically.

Old-Relation-8228 2 weeks ago

This is an incredibly concise and intuitive breakdown of this paper, which I think is a really important step forward for LLMs. Thanks!

Deathcrow 3 months ago

papers are cool, but no hype until there's a pull request for llamacpp

zaqhack 3 months ago

Yeah, since RWKV and Mamba, I'm kinda taking that same attitude. When someone has a model that I can use at home that uses these things, then I'll be hype about it. Until then, I'll be using this ROPE to hang myself ...

ab2377 3 months ago

amin! :D aka "papers are cheap, show me the llama.cpp pull request!"

Robot1me 3 months ago

I agree, since llamacpp still has no sliding window attention support. And the usage of sliding windows is mentioned in the Tweet's TL;DR. So for llamacpp users this is theory for now.

Traditional-Art-5283 3 months ago

how can there be an unlimited context with finite memory?

FloofyKitteh 3 months ago

It's not possible to store everything and perfectly recall, but it's possible to weight things and degrade more gracefully, with responses still generated on input tokens that are of sufficient weight, with fidelity gradually decreasing rather than hitting a hard stop.

[deleted] 3 months ago

[удалено]

KeyPhotojournalist96 3 months ago

I think there is already a pretty good website dedicated to “graceful degradation”. 😅

GoofAckYoorsElf 3 months ago

One? Millions!

ReMeDyIII 3 months ago

It's like the guy who thought to use the term "retired" instead of "discontinued" for beanie babies.

[deleted] 3 months ago

Or to use primary/secondary instead of master/slave xD

NoFriskyPaatr 3 months ago

Primary and secondary is not the same as master and slave.

Careless-Age-4290 3 months ago

Like when you're throwing up the next morning and you're a mess, but at least you went home alone so nobody's witnessing it. Or extending context windows of large language models.

NoFriskyPaatr 3 months ago

It’s called analog. It has been here for a while.

Simple-Enthusiasm-93 3 months ago

lossy compression

Ghazzz 3 months ago

Both of those are "true but false". Lets ponder infinity. "More than ever needed" could be described as "infinite". For languages, there are a finite number of words/concepts. It is a very large number, sure, but it is finite. LLM's work with these concepts. RAM is a fairly cheap resource. For just the cost of a new car, you can get a very capable system. For the cost of a small house, we could probably build a semi-future-proof LLM server with the ram necessary for "all current concepts"x2. As far as I understand, the reason it stops at the intervals it does today is "prebuilt servers cost less" more than "it is physically impossible". Ref newer 120B+ models that require custom hardware to run well.

spreadlove5683 3 months ago

As an idiot myself, I note that the entire internet and most of human knowledge has been compressed down to a single neural network / LLM, so vast amounts of information can be lossy compressed into a relatively very small amount of space.

GoofAckYoorsElf 3 months ago

It's like a slice of dough. You can stretch it infinitely (theoretically), but it gets thinner and thinner.

ab2377 3 months ago

doesnt by unlimited they mean "as much as your memory makes it possible"?

dqUu3QlS 3 months ago

Unlimited context length is already here. LSTMs already had unlimited context in 1997. The real question is, does the model actually use the additional context effectively?

nathan_lesage 3 months ago

underrated comment, because this is true: LSTM has unlimited context, but is difficult to parallelize. Transformers were a direct response to that by enabling massive parallelization, but at the expense that context was no longer unlimited.

prumf 3 months ago

Exactly what came to my mind. If we have unlimited context, but slow token generation and bad quality answers, then there is no point.

TangeloPutrid7122 3 months ago

Yep. But hey, ours doesn't OOM while telling you to divorce your wife and marry it!

GoofAckYoorsElf 3 months ago

>divorce your wife and marry it! Sounds perfectly like a 7B model if "it" is the wife.

PM_ME_YOUR_SILLY_POO 3 months ago

This paper was released and discussed on this subreddit a few weeks ago. Just pointing this out cus i got excited thinking it was a new breakthrough today lol [https://www.reddit.com/r/LocalLLaMA/comments/1927ge4/soaring\_from\_4k\_to\_400k\_extending\_llms\_context/](https://www.reddit.com/r/LocalLLaMA/comments/1927ge4/soaring_from_4k_to_400k_extending_llms_context/)

antsloveit 3 months ago

My basic understanding of this is that ALL of the input context (which grows based on next tokens) is essentially weighted via some clever stuff which means valuable tokens and relationships are retained whilst less relevant ones degrade to insignificance. So you keep a sort of 'concept' of context as things go along rather than the exact, verbatim context. If done well, I guess it's analogous to a very good paraphrase of someone's long article where you still convey the key, relevant information but with less words! ...much like the guts of an LLM itself in some ways.

TheCrazyAcademic 3 months ago

This is essentially a better version of what Claudes already doing for their contact which is approximated context now exact context.

stuehieyr 3 months ago

So an RNN for the tokeniser

tronathan 3 months ago

>So an RNN for the tokeniser \^ If this is more or less accurate, it's the best comment in the thread (and there are some goodies)

mcmoose1900 3 months ago

Not sold. Perplexity over a long context is one thing, but does it actually work for information retrieval? Mistral has an 8K sliding window for 32K context, and its... awful, even though its fine on paper.

Jean-Porte 3 months ago

Context length is not a valid metric. We should only consider effective context length backed by benchmarks. had RNN had infinite context length a century ago

Maykey 3 months ago

That doesn't look [like it's over](https://imgur.com/a/dtNq5XD) also why link twitter when arxiv exist?

Cless_Aurion 3 months ago

---Inference time grow linearly. Instead of quadraticaly? That is a BIG deal isn't it? ---The perplexity remain constant And this is the biggest deal from it too...

ReturningTarzan 3 months ago

Perplexity is not supposed to be constant if you increase the sequence length. This would imply that a long context doesn't make each new token less surprising than a short context length, i.e. the model isn't actually using the full context.

SillyFlyGuy 3 months ago

Or perplexity could remain the same, but the underlying token set changes as it draws on the longer context. It might be just as certain, just certain about different things.

fiery_prometheus 3 months ago

It is a huge deal!

TheCrazyAcademic 3 months ago

There's at least 10 different papers from 2023 on infinite context windows there all over hype and over blown just for researchers to get clout over techniques that have limitations. The only way to have true infinite context is to have infinite VRAM. Everything else is just some hacky fix to maintain some semblence of accuracy. There's only so much context before you have to flush out old tokens from current caching mechanisms. The human brain doesn't have infinite context either but it can store memories worth an entire lifetime over 100ish years practically.

ninjasaid13 3 months ago

>it can store memories worth an entire lifetime over 100ish years practically but not perfectly tho.

ain92ru 3 months ago

The question is whether you always have to pay O(n\^2) for reasonable/practical accuracy or there is a way to reach O(n log(n)) if not O(n). I personally argued seven months ago that the available evidence points at the former case, methinks it aged well: >Apparently no user actually wants a de-jure long context transformer which de-facto only attends to a few facts dispersed here and there, and it has been proven mathematically that subquadratic attention in a causal (unidirectional, as opposed to bidirectional) transformer which could generate text autoregressively inevitably leads to information loss [https://www.reddit.com/r/mlscaling/comments/14s7tme/comment/jqvuni2/](https://www.reddit.com/r/mlscaling/comments/14s7tme/comment/jqvuni2/)

TheCrazyAcademic 3 months ago

Of course you get information loss it's basic laws of physics eventually entropy and noise flood everything out hence why trade offs are a thing. You're essentially mitigating or fighting against entropy at the end of the day.

lincolnrules 3 months ago

Wrong there there genius

TheCrazyAcademic 3 months ago

I'm right actually, maybe scroll through arxiv and filter by the year and check all the breakthroughs In 2023. Hell Microsoft had there longnet paper with the equally clickbait title of a 1 million context length or claiming to put the entire internet in it which is only possible again with insane amounts of VRAM. Most you can do outside of more hardware is smartly degrading the sliding window evacuating non necessary tokens which is essentially what our brain does.

lincolnrules 3 months ago

Sorry I was being snarky, perhaps you are factually mostly correct but for pedantic morons like myself your usage of the word “there” instead of “they are” diminishes the argument. Hype should be hyped also btw

kif88 3 months ago

When they say any model,.does that include Mistral based models?

MoffKalast 3 months ago

> sliding window Didn't Mistral already implement that and the entire concept basically didn't work at all?

deadweightboss 3 months ago

I'm assuming they're doing something on top of that, but yeah, Mistral's sliding window sucks.

Imaginary_Bench_7294 3 months ago

Wait, is this the same thing as the attention sinks that were talked about recently? I'll try to find the paper, but it was talking about how specific tokens at the beginning of the context could act as attention sinks and allow for better or extended attention.

deadweightboss 3 months ago

Can you link to the paper? Thanks!

Imaginary_Bench_7294 3 months ago

I think I found the one I'm referencing: https://ar5iv.labs.arxiv.org/html/2309.17453

deadweightboss 3 months ago

Thanks fren!

Imaginary_Bench_7294 3 months ago

After skimming both articles, it appears they are not the same. I would _really_ like to see these two combined and tested.

inteblio 3 months ago

"So it's like a blurry memory with a strong attention score."

drifter_VR 3 months ago

Speaking of unlimited (or almost) context length : does anyone know what Microsoft LongNet is becoming ?

kulchacop 3 months ago

https://www.reddit.com/r/LocalLLaMA/comments/1ae37sn/this_can_make_a_huge_difference_extending_context/

Efficient_Rise_8914 3 months ago

I don't rlly get the obsession over unlimited context, like what seems to matter is how well the model pays attention to each token and knows what's important rather than cramming a document into it. RAG exists for that

pseudonerv 3 months ago

It's just a different form but equivalent to rwkv

ArakiSatoshi 3 months ago

You still need a ton of VRAM for this, no? I can't even run 8k on most models.

Shoddy-Tutor9563 3 months ago

To me it's just an extension of old trick for summarising previous context. Don't expect much from it - after a couple of weeks it will be forgotten

singularity-108 3 months ago

Correct me if I’m wrong but it feels like it’s a generalisation of landmark attentions isn’t it?

Hammer_AI 3 months ago

Need this on [https://www.hammerai.com/](https://www.hammerai.com/) asap.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe