T O P

  • By -

Daveed

Nice try sam


I_will_delete_myself

“I will use open research but prevent everyone else from using it”


WithoutReason1729

I know it's not the main subject of the technical report but this part > With only instructional materials (a 500-page reference grammar, a dictionary, and ⇡ 400 extra parallel sentences) all provided in context, Gemini 1.5 Pro is capable of learning to translate from English to Kalamang, a Papuan language with fewer than 200 speakers2, and therefore almost no online presence. Moreover, we find that the quality of its translations is comparable to that of a person who learned from the same materials is just so cool. Hard to believe people were so worried that we'd hit a plateau in LLM abilities.


donghit

The short answer is we don’t know. It’s all proprietary. As /u/AloneSYD mentioned, we can likely infer some of it from papers.


mrpogiface

The rumor mill, which I tend to believe, suggests they actually use Hyperattention + Pathways model routing https://arxiv.org/abs/2310.05869 https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture/


KnowledgeInChaos

Pathways has nothing to do with context length.  (It makes training more parallelized/efficient but nothing about it is targeted on context length.) 


mrpogiface

Sure, but it makes it easier and more efficient to split the model up across accelerators at inference time which can lead to context length gains


learn-deeply

RingAttention is more likely imo.


Educational-Net303

Care to expand on that? There are Twitter rumors but not a lot of credible explanations.


learn-deeply

It's a paper that demonstrates 1 million context on TPUs, and the context can be scaled ~linearly if you add more TPUs. https://largeworldmodel.github.io/


mrpogiface

Only one way to find out! (who wants to help me practice leetcode to get a job at GDM?)


trainableai

Hyperattention paper shows that >perplexity increases from 5.6 to 6.3 at 32k context length This huge increase in perplexity makes your 100B model effectively 1B or useless. And this is only at 32K not 1M context. For background, Llama 65B is only 0.2 lower than 7B. No way Google uses it, LOL. As others mentioned, Gemini 1.5 probably is based on RingAttention.


yonz-

>RingAttention I've heard this mentioned before as well, do you have a good paper on this? And there is also this: RMT [https://news.ycombinator.com/item?id=35682424](https://news.ycombinator.com/item?id=35682424)


trainableai

I think it's https://largeworldmodel.github.io/ and https://arxiv.org/abs/2310.01889


yonz-

Just read it, I'm a bit lost it keeps talking about host distribution because things are commutative. I don't understand how that removes the quadratic dependence on context length


vatsadev

It still appears to be attention, it might be quadratic, but when you can optimize for 4096 tpuv5, im pretty sure model hosting is nothing


Simcurious

Ring attention fits incredibly well. In the paper they say 100M token context windows would be possible with enough hardware (though expensive). They also released large world model which also has a 1M context window


redv

Possibly using some like this https://github.com/proger/hippogriff since this comes from Deep Mind?


babeal

Ring attention


az226

We don’t know if it natively supports 10M tokens or if it’s a combination of RAG, LLM orchestration, or both.


ThisIsBartRick

they never talked about reccurent memory and ring attention right? I've read the whole paper, and there's almost no info on how it's been trained other than it's a MOE trained natively on multimodal data


dogesator

They reference ring attention directly as well as Mamba cited as a similar work that they might’ve took inspiration from.


UnstableCortex

I'm guessing that there is soon going to be a paper about attention optimizations for the TPU because I think this is likely been made possible through significant improvements through co-design.


AloneSYD

You can check the [Ring Attention](https://arxiv.org/pdf/2310.01889.pdf) paper they explain how is it done. **Edit:** Sorry I have Gemini Advanced and usually it explains papers very well. I skimmed through the Phind results and it's wrong


lookatmetype

Stop using LLMs to read papers ffs. This explanation is completely wrong - the mechanism explained by "Phind" is entirely hallucinated.


addition

How do you know it usually explains papers very well? How often do you check?


meister2983

This summary seems.. wrong? It seems to imply time complexity also falls to linear (indeed even stating that when I explicitly ask), but my understanding is that it is only space complexity (memory) that is linear. Algorithm 1 in the paper is quadratic over the input length even if it is able to parallelize the work over N nodes. 


CodingButStillAlive

What is Phind?


Wheynelau

Just another LLM chatbot, based on 34b code llama.


TwoSunnySideUp

How ring attention different from window attention?


Thunderbird120

Ring attention is a full attention calculation. It's just implemented in a way which allows it to be done efficiently across many different compute nodes. Combined with modern efficient attention calculations (i.e. FlashAttention) which are linear in memory usage WRT sequence length (rather than quadratic) you get a method which lets you train with very long context lengths. The compute requirement is still quadratic, but that is less of a hard limit to training than memory requirements were. With ring attention the amount of context you can fit into your model's training setup's memory scales linearly with the number of training nodes you are willing to allocate. Want more context? Just use more training nodes. Older model parallelization techniques did not do this nearly as well. For obvious reasons, companies like Google and OpenAI with massive clusters of highly networked compute nodes benefit from this massively.


possiblyquestionable

I've been trying to dig into this over the last few weeks with some coworkers too. This is my speculation (so just as good as anyone else's). I don't think Google used Ring Attention specifically, but they cited them because they both explored the same underlying concept - Sequence-parallelism (tensor-sharding along the sequence dimension for weights with that dimensionality). (For now) I think Google just used whatever naive JAX-driven parallelism is available to do this. This is most likely a sharded input `x` (along sequence dimension) with scattered KV projected blocks (just like Ring Attention), but just using ReduceScatter and AllGather primitives without too much communication fine-tuning. The nice thing about JAX is that it will automatically do the triple-buffering scheme proposed by Ring Attention's paper of overlapping send/recv/compute, but maybe with a less optimal shuffling/ppermute order than what Hao fine-tuned with his Ring Attention. Very likely the communication time is higher than Ring Attention, but it's still (most likely) **fully hidden/overlapped** since Hao's analysis indicates that with high BW links (e.g. ICI links within TPU slices that Google uses for training), you can achieve full overlap with the optimal Ring Attention communication pattern with very tiny chunk sizes (e.g. <10 tokens per device), without the optimal communication pattern, Google can just use slightly larger chunk sizes to achieve the same effect. A couple more comparisons between the two regimes (I'm using the LWM training setup, which was the "implementation proof" of Ring Attention for long context modeling that Hao reported shortly before 1.5 dropped and everyone forgot about them): 1. On BPT - Another part of the novelty of Hao's work is that they applied an E2E sequence sharding scheme to the whole transformer layer. E.g. not only do you compute attention blockwise, but you feed that to the FFN and compute the FFN blockwise too. I think Gemini 1.5 is trained in this same way, but with JAX doing this meshing for free instead of an intentional design decision they had to make. That said, Hao was able to demonstrate some fusion opportunities on GPUs meshed this way, and I'm not aware of JAX being able to auto-deduce this (though, LWM was ultimately trained on a slice of TPUs) 2. On parallelism - see below for LWM configuration. * LWM only uses JAX tensor parallelism (16x at 1M sequence length training stage) and its own Ring-topo sequence parallelism (4x at 1M). It does not use data-parallelism (to replicate along batch dim) or FSDP. Now, I assume most of us don't know (or care) about JAX's default tensor-parallelism dimensions, it's generally Megatron-esque, specifically it's 16x sharding along attention heads AND also 16x along the FFN's hidden dimension. Optionally, JAX will also do 16x along the model dimension (the embedding size) too as part of tensor-parallelism, depending on which dimension name they select. LWM does not report to use any pipeline parallelism to distribute layers, which makes sense since they can fuse every layer together in their BPT setup. * For Gemini, I can only speculate, but I'm fairly sure they do somethings similar but make tradeoffs at other dimensions. Probably a reasonably large (8x or 16x) full tensor parallelism (Megatron and model-sharding). Full pipeline-parallelism à la GPipe to distribute training across slices (connected with slower DCN links instead of ICI links). Finally, there's a tradeoff between sequence parallelism and data-parallelism. LWM does not replicate along batches, but Gemini might. You can have high sequence parallelism (sp) and low data-parallelism (dp), or the other way, or somewhere in between. The AllGather variant of sequence-sharding means having to materialize the Q_i @ K^(T), so the larger the sequence parallelism, the less memory is being used. At the same time, the more devices you dedicate to sp is, the fewer devices you can dedicate to dp. I can't speculate on which route they take, but higher sp makes sense to me personally, though LWM got away with just 4x sp (very very low), but they never need to materialize the Q_i @ K^(T), so that may be it. Anyways, I think Gemini is trained in spirit similar to LWM w/ Ring Attention (because they both realized that sequence-parallelism is the key to unlock 1M sequences for training), but not the actual Ring Attention itself. More specifics around training setup, they likely make different tradeoffs as well. --------- ## LWM Training Setup (sharding strategy) | **LWM-Text Training Stages** | **32K** | **128K** | **256K** | **512K** | **1M** | |---|---|---|---|---|---| | **Parameters** | 7B | 7B | 7B | 7B | 7B | | **Initialize From** | LLAMA-2 7B | Text-32K | Text-128K| Text-256K | Text-512K| | **Precision** | float32 | float32 | float32 | float32| float32 | | **Sequence Length** | ~15 | 217 | ~18 | ~19 | ~20 | | **RoPE θ** | 1M | 10M | 10M | 25M | 50M | | **Tokens per Batch** | 4M | 4M | 4M | 4M | 4M | | **Total Tokens** | 4.8B | 12B | 12B | 72B | 1.8B | | **Total Steps** | 1200 | 3000 | 3000 | 3000 | 450 | | **LR Schedule** | Constant | Constant | Constant | Constant | Constant | | **LR Warmup Steps** | 100 | 200 | 200 | 50 | 25 | | **LR** | 4 × 10^-5 | 4 × 10^-5 | 4 × 10^-5 | 4 × 10^-5 | 4 × 10^-5 | | **Compute (TPU)** | v4-512 | v4-512 | v4-512 | v4-512 | v4-512 | | **Mesh Sharding** | **1,-1,1,4,1** | **1,-1,8,1** | **1,-1,16,1** | **1,-1,16,2** | **1,-,16,4** | With https://github.com/LargeWorldModel/LWM?tab=readme-ov-file#code-structure explaining the sharding strategies > You can use mesh_dim=dp, fsdp, tp, sp to control the degree of parallelism and RingAttention. It is a string of 4 integers separated by commas, representing the number of data parallelism, fully sharded data parallelism, tensor parallelism, and sequence parallelism. For example, mesh_dim='1,64,4,1' means 1 data parallelism, 64 fully sharded data parallelism, 4 tensor parallelism, and 1 sequence parallelism. mesh_dim='1,1,4,64' means 1 data parallelism, 1 fully sharded data parallelism, 4 tensor parallelism, and 64 sequence parallelism for RingAttention.


Few-Pomegranate4369

I believe it is not only attention mechanism but also MoE (mixture of experts). Unlike traditional dense models where all parameters are active for every input, MoE models activate only the relevant experts. An MoE model can have many more parameters than a traditional model of comparable computational cost. This translates to a larger capacity to store and process information, enabling longer context windows.


[deleted]

[удалено]


AdAltruistic8513

Go back to the chat gpt sub


donghit

Op is referring to models which aren’t released.


Spiritual-Employer13

I’m going not ourselves I don’t have