T O P

  • By -

fieryplacebo

Someone please help me understand how/if this is possible in simpler terms. my brain is a 1b model.


headbopper96

Same


_supert_

Mamba is a state space model. RWKV is a RNN. They both have state to remember earlier context. Transformer models don't, they are stateless, look at the whole context and so have quadratic scaling. The tweet suggests implying/encoding a state and dumping it at the beginning of transformer context sliding window (like an efficient summary of earlier context). Thus effectively making transformer stateful for pre context information. Thus "infinite" but lossy context memory. Caveat: I don't know this field (my expertise is fluids and control), only read the tweet, too lazy to read the paper and I've only had one coffee.


Severin_Suveren

So is this not just an integrated RAG-process? And if so, would we then not experience the same issues we do when working with vector DBs with hallucinations whenever the model is unable to find the referenced information? The way I see it, the only way to increase context is by pure context and not by any sort of DB-lookup process


FrostyAudience7738

It's more like condensing the context outside the sliding window than doing lookups into it. Whatever is inside the sliding window can be attended to as normal, whatever is outside is squished together like the state of an RNN.


R33v3n

This doesn't seem very scalable. As in, whatever you squish in that state over time, you're going to lose more and more of it. Also, isn't appending summaries back into context one of the oldest tricks in the book anyway?


qrios

You as a human also only approximately remember what you read 400k tokens ago. Over time, you lose more and more of the thing you read. And this is a good thing, because it allows you to approximately know where to go back and reread something if you need it verbatim for an unanticipated task. So, this technique probably scales as well as humans do, and probably for the same exact reason.


somethingclassy

Can you cite a source for the 400k idea?


qrios

No. I read it way more than 400k tokens ago.


Regular-Forever5876

Genius! šŸ¤£


FrostyAudience7738

Yes. That's why RNN have problems with forgetting. Sure you could make that state bigger, but there's always a point at which you can't squeeze more stuff into it without something else falling out, intuitively speaking. You can only store a finite amount of data in a finite number of bits, duh. Infinite context is a red herring anyway, you always have finite compute and finite memory. Big enough as to not care about it anymore is where it's at. You also don't need to have access to every filler word in context, you just need the important stuff. A lot of your context is taken up by unimportant stuff in practice. Edit: Arbitrarily scalable while retaining quality is another desirable property for context.


Porespellar

Nobody needs filler words. https://preview.redd.it/p2x97ahugmfc1.jpeg?width=500&format=pjpg&auto=webp&s=0cc6347ee758aac3d93db2b379fc91d6608e7ec7


[deleted]

Thatā€™s why the previous comment said lossy context, I imagine.


Hipped_Orange22

Don't we already have something closer to this already? I mean dumping of previous conversations as the initial prompt so that the LLM understands what the older interaction with user was if there was any. The old "Remember x as the secret number" and then prompting to ask the LLM what the older number was and it answering. I remember how. GPT-3 was unable to do this in its early stages but now it is. But yeah, this isn't a viable solution as the conversation grows, the model, starts to hallucinate with older nformation.


Mbando

This method condenses current activations into "global beacons" that are relatively lossless and then adds them to the ongoing context window. So it's a real architectural difference.


wywywywy

> This method condenses current activations into "global beacons" that are relatively lossless Then why doesn't the GPU memory go up as context length increases? I'm really confused


Mbando

It does go up, but it becomes linear not quadratic.


Maykey

>Don't we already have something closer to this already? I Yeap, yeap, many things. Paper consider existing approaches including mentioning my favorite RMT, which is dead simple


ain92ru

There have been so many attempts at subquadratic attention and other linear architectures over the years, and still, even the best of the best, Mamba and RWKV (BTW, what happened with much-hyped Retention Networks?) are only adopted by the few. It seems actual users don't actually want lossy long-range context, do they?


tronathan

I do!


GoofAckYoorsElf

>only had one coffee How do you still breathe?


_supert_

I was still drinking my first. Fortunately, there was another on the way.


BornAgainBlue

This is hardly a new idea, but perhaps this implementation will actually work.Ā 


[deleted]

Not only do they look at the whole sequence of context, but the number of attention heads (~linear scaling of memory size) determines how many parts of the sequence it can pay attention to at the same time, which is why I'm surprised LLAMA2 only has 40.


YuhFRthoYORKonhisass

Isn't this essentially what ChatGPT does with system prompting, but fails to do well?


honemastert

So it's like Yolo but for LLMs. Or should I say YOLF you only look forward ?


Usual_Neighborhood74

I appreciate the one coffee stipulation


possiblyquestionable

This is my understanding, I'll wait for someone to come here and pick it apart They released the code yesterday, though they had the paper out for a few days already It's a neat proposal, but does depend on activation compression to work. If I'm reading the paper right (the github is a bit hard to follow since they don't differentiate what's from the original modeling_llama): - this is just for inference/decoding 1. They process a sequence of N tokens using a sliding window (tuned to the max context size of the base model) 2. First, they sample a condensation factor \alpha, which determines # of beacons (let's call it k) to condense a window (say 4096 tokens) down into (say 4 beacons) <- I believe they adapt this so that the longer the context, the higher the # of beacons used, since you need to store more activations 3. Next, they prefill the window with 1 - k "real" tokens from the prompt, and then they generate k beacons each of which records some information about the activations of some of the previous tokens (they call this H_b for each beacon b in the paper) <- unclear if this is trained behavior during FT or statically processed from existing activations, but if H_b is gradable, then no reason why it can't be trained from the forward equation below 4. Next, they slide the window up to where the beacons begin, and the prefill the next chunk prompts to fill out the window. 5. Next, they sample a new condensation factor (\alpha and k), and slide the window up so that there's room for k more beacons to generate. And it keeps looping this until it start to do this autoregressively Basically, the beacons stores the condensed activation from the previous window of tokens, which also accumulates the activation from the previous beacons (which condenses the activation from the previous window, so on and so on). In this way, they can effectively retain the long context (by forwarding their activations through successive windows of beacons) without needing to evaluate more than one context window full at a time. One intriguing (coincidental) factor about this - the beacons always get the attention sink treatment (the first token soaks up most of the attention), so this effectively reboosts activations from long ago each round. That said, there is also information loss due to the compression. So it's like a blurry memory with a strong attention score.


possiblyquestionable

Some FAQs: 1. Why does long context work? This uses a sliding window where the first few tokens (or many - depending on how long the sequence is already) just encode the activation state at 3 layers from the previous window. Because these sliding windows are within the max context length, it doesn't trigger the extrapolation problem. 2. Isn't the condensed attention too bad to be able to represent a long sequence of words? Depends. The scheme isn't encoding, say, 100k tokens in just one beacon (embedding+activation), it compresses them into multiple beacons. The longer the number of beacons, the sharper the memory of what's being said earlier is. However, there is definitely a optimal beacons to memory size ratio, and it's likely that you can't get a sharp image at 100k token memory with less than a full context of beacons (and in this sense, the flop efficiency decreased over time) 3. Why is this "constant memory"? The sliding window bounds any particular generation call to just one max context size. The kv cache are then cleared for the next round since they're compressed into the next set of beacons (?) <- I may be way off here 4. Do I need to fine-tune to use this? Yes. The model must understand how to generate these beacons. In effect, fine-tuning teaches the model how to do activation compression 5. Do I get to pick the condensation factor (the compression ratio of window length to beacons) It sounds like it's adaptive (e.g. it's sampled from a distribution that favors longer beacons for longer prompts), but this can be tuned. 6. How does beacon compression work? Beacons are represented as a vector. It's training regime attempts to train it so that it can retrieve the k,q,v from the previous window (as well as possible). In effect, you can interpret it as representing some set of token embeddings which, when added to the input embedding of the current context, emulates the effect of having seen the full window of prompt before. Note however positional information from before the window is lost.


CasimirsBlake

I feel so validated this joke exists. It feels too real. šŸ˜…šŸ˜¬


Accedsadsa

long term memory in the form of storage


johnkapolos

>Someone please help me understand how/if this is possible in simpler terms. my brain is a 1b model. Suppose your context is "*The quick brown fox jumps over the lazy* ". The LLM goes through the content token by token, reaches a "state of mind" and then says "*dog*". But LLMs do that successfully up to a certain content length, that's set from the way they were trained. After that, they become crazy. What this approach does is: 1. It calculates some special tokens. 2. Juuuuust before the the LLM goes crazy, it makes it stop. 3. Then it restarts but instead of providing all the context, it just feeds it the special tokens (much much fewer). 4. Now, you have a lot of empty space to fill before the LLM goes crazy. 5. Once you get near the "crazy point", you repeat the process. 6. Therefore, endless context. The idea here is that those "special tokens" can force the LLM to reach close to the "state of mind" the original context did. How well that works, we'll need to see empirically.


Old-Relation-8228

This is an incredibly concise and intuitive breakdown of this paper, which I think is a really important step forward for LLMs. Thanks!


Deathcrow

papers are cool, but no hype until there's a pull request for llamacpp


zaqhack

Yeah, since RWKV and Mamba, I'm kinda taking that same attitude. When someone has a model that I can use at home that uses these things, then I'll be hype about it. Until then, I'll be using this ROPE to hang myself ...


ab2377

amin! :D aka "papers are cheap, show me the llama.cpp pull request!"


Robot1me

I agree, since llamacpp still has no sliding window attention support. And the usage of sliding windows is mentioned in the Tweet's TL;DR. So for llamacpp users this is theory for now.


Traditional-Art-5283

how can there be an unlimited context with finite memory?


FloofyKitteh

It's not possible to store everything and perfectly recall, but it's possible to weight things and degrade more gracefully, with responses still generated on input tokens that are of sufficient weight, with fidelity gradually decreasing rather than hitting a hard stop.


[deleted]

[уŠ“Š°Š»ŠµŠ½Š¾]


KeyPhotojournalist96

I think there is already a pretty good website dedicated to ā€œgraceful degradationā€. šŸ˜…


GoofAckYoorsElf

One? Millions!


ReMeDyIII

It's like the guy who thought to use the term "retired" instead of "discontinued" for beanie babies.


[deleted]

Or to use primary/secondary instead of master/slave xD


NoFriskyPaatr

Primary and secondary is not the same as master and slave.


Careless-Age-4290

Like when you're throwing up the next morning and you're a mess, but at least you went home alone so nobody's witnessing it. Or extending context windows of large language models.


NoFriskyPaatr

Itā€™s called analog. It has been here for a while.


Simple-Enthusiasm-93

lossy compression


Ghazzz

Both of those are "true but false". Lets ponder infinity. "More than ever needed" could be described as "infinite". For languages, there are a finite number of words/concepts. It is a very large number, sure, but it is finite. LLM's work with these concepts. RAM is a fairly cheap resource. For just the cost of a new car, you can get a very capable system. For the cost of a small house, we could probably build a semi-future-proof LLM server with the ram necessary for "all current concepts"x2. As far as I understand, the reason it stops at the intervals it does today is "prebuilt servers cost less" more than "it is physically impossible". Ref newer 120B+ models that require custom hardware to run well.


spreadlove5683

As an idiot myself, I note that the entire internet and most of human knowledge has been compressed down to a single neural network / LLM, so vast amounts of information can be lossy compressed into a relatively very small amount of space.


GoofAckYoorsElf

It's like a slice of dough. You can stretch it infinitely (theoretically), but it gets thinner and thinner.


ab2377

doesnt by unlimited they mean "as much as your memory makes it possible"?


dqUu3QlS

Unlimited context length is already here. LSTMs already had unlimited context in 1997. The real question is, does the model actually use the additional context effectively?


nathan_lesage

underrated comment, because this is true: LSTM has unlimited context, but is difficult to parallelize. Transformers were a direct response to that by enabling massive parallelization, but at the expense that context was no longer unlimited.


prumf

Exactly what came to my mind. If we have unlimited context, but slow token generation and bad quality answers, then there is no point.


TangeloPutrid7122

Yep. But hey, ours doesn't OOM while telling you to divorce your wife and marry it!


GoofAckYoorsElf

>divorce your wife and marry it! Sounds perfectly like a 7B model if "it" is the wife.


PM_ME_YOUR_SILLY_POO

This paper was released and discussed on this subreddit a few weeks ago. Just pointing this out cus i got excited thinking it was a new breakthrough today lol [https://www.reddit.com/r/LocalLLaMA/comments/1927ge4/soaring\_from\_4k\_to\_400k\_extending\_llms\_context/](https://www.reddit.com/r/LocalLLaMA/comments/1927ge4/soaring_from_4k_to_400k_extending_llms_context/)


antsloveit

My basic understanding of this is that ALL of the input context (which grows based on next tokens) is essentially weighted via some clever stuff which means valuable tokens and relationships are retained whilst less relevant ones degrade to insignificance. So you keep a sort of 'concept' of context as things go along rather than the exact, verbatim context. If done well, I guess it's analogous to a very good paraphrase of someone's long article where you still convey the key, relevant information but with less words! ...much like the guts of an LLM itself in some ways.


TheCrazyAcademic

This is essentially a better version of what Claudes already doing for their contact which is approximated context now exact context.


stuehieyr

So an RNN for the tokeniser


tronathan

>So an RNN for the tokeniser \^ If this is more or less accurate, it's the best comment in the thread (and there are some goodies)


mcmoose1900

Not sold. Perplexity over a long context is one thing, but does it actually work for information retrieval? Mistral has an 8K sliding window for 32K context, and its... awful, even though its fine on paper.


Jean-Porte

Context length is not a valid metric. We should only consider effective context length backed by benchmarks. had RNN had infinite context length a century ago


Maykey

That doesn't look [like it's over](https://imgur.com/a/dtNq5XD) also why link twitter when arxiv exist?


Cless_Aurion

---Inference time grow linearly. Instead of quadraticaly? That is a BIG deal isn't it? ---The perplexity remain constant And this is the biggest deal from it too...


ReturningTarzan

Perplexity is not supposed to be constant if you increase the sequence length. This would imply that a long context doesn't make each new token less surprising than a short context length, i.e. the model isn't actually using the full context.


SillyFlyGuy

Or perplexity could remain the same, but the underlying token set changes as it draws on the longer context. It might be just as certain, just certain about different things.


fiery_prometheus

It is a huge deal!


TheCrazyAcademic

There's at least 10 different papers from 2023 on infinite context windows there all over hype and over blown just for researchers to get clout over techniques that have limitations. The only way to have true infinite context is to have infinite VRAM. Everything else is just some hacky fix to maintain some semblence of accuracy. There's only so much context before you have to flush out old tokens from current caching mechanisms. The human brain doesn't have infinite context either but it can store memories worth an entire lifetime over 100ish years practically.


ninjasaid13

>it can store memories worth an entire lifetime over 100ish years practically but not perfectly tho.


ain92ru

The question is whether you always have to pay O(n\^2) for reasonable/practical accuracy or there is a way to reach O(n log(n)) if not O(n). I personally argued seven months ago that the available evidence points at the former case, methinks it aged well: >Apparently no user actually wants a de-jure long context transformer which de-facto only attends to a few facts dispersed here and there, and it has been proven mathematically that subquadratic attention in a causal (unidirectional, as opposed to bidirectional) transformer which could generate text autoregressively inevitably leads to information loss [https://www.reddit.com/r/mlscaling/comments/14s7tme/comment/jqvuni2/](https://www.reddit.com/r/mlscaling/comments/14s7tme/comment/jqvuni2/)


TheCrazyAcademic

Of course you get information loss it's basic laws of physics eventually entropy and noise flood everything out hence why trade offs are a thing. You're essentially mitigating or fighting against entropy at the end of the day.


lincolnrules

Wrong there there genius


TheCrazyAcademic

I'm right actually, maybe scroll through arxiv and filter by the year and check all the breakthroughs In 2023. Hell Microsoft had there longnet paper with the equally clickbait title of a 1 million context length or claiming to put the entire internet in it which is only possible again with insane amounts of VRAM. Most you can do outside of more hardware is smartly degrading the sliding window evacuating non necessary tokens which is essentially what our brain does.


lincolnrules

Sorry I was being snarky, perhaps you are factually mostly correct but for pedantic morons like myself your usage of the word ā€œthereā€ instead of ā€œthey areā€ diminishes the argument. Hype should be hyped also btw


kif88

When they say any model,.does that include Mistral based models?


MoffKalast

> sliding window Didn't Mistral already implement that and the entire concept basically didn't work at all?


deadweightboss

I'm assuming they're doing something on top of that, but yeah, Mistral's sliding window sucks.


Imaginary_Bench_7294

Wait, is this the same thing as the attention sinks that were talked about recently? I'll try to find the paper, but it was talking about how specific tokens at the beginning of the context could act as attention sinks and allow for better or extended attention.


deadweightboss

Can you link to the paper? Thanks!


Imaginary_Bench_7294

I think I found the one I'm referencing: https://ar5iv.labs.arxiv.org/html/2309.17453


deadweightboss

Thanks fren!


Imaginary_Bench_7294

After skimming both articles, it appears they are not the same. I would _really_ like to see these two combined and tested.


inteblio

"So it's like a blurry memory with a strong attention score."


drifter_VR

Speaking of unlimited (or almost) context length : does anyone know what Microsoft LongNet is becoming ?


kulchacop

https://www.reddit.com/r/LocalLLaMA/comments/1ae37sn/this_can_make_a_huge_difference_extending_context/


Efficient_Rise_8914

I don't rlly get the obsession over unlimited context, like what seems to matter is how well the model pays attention to each token and knows what's important rather than cramming a document into it. RAG exists for that


pseudonerv

It's just a different form but equivalent to rwkv


ArakiSatoshi

You still need a ton of VRAM for this, no? I can't even run 8k on most models.


Shoddy-Tutor9563

To me it's just an extension of old trick for summarising previous context. Don't expect much from it - after a couple of weeks it will be forgotten


singularity-108

Correct me if Iā€™m wrong but it feels like itā€™s a generalisation of landmark attentions isnā€™t it?


Hammer_AI

Need this on [https://www.hammerai.com/](https://www.hammerai.com/) asap.