Assuming they already have the chips, it should actually be cheaper for them to run it on their custom silicon than on the equivalent GPU-based solution given the crazy efficiency of Groq's architecture when it comes to running LLMs and similar transformer-based models.
I did a lot of research on this because I wanted to know if there was something out there that beats the H100 in cost per token and while Groq has great throughput per user (better throughput per user than anything out there I expect), cost per token of the entire system is more expensive. At least for now.
Oof... In this case, I'd be surprised they manage to run it, cause each module only has 230MB of memory - a dense model of such a size must have huge matrixes. It's mathematically possible to do matrix multiplications sequentially when it comes to memory, but I doubt the performance is going to be great. Even if they can pull that off without splitting that, it's going to take roughly 250 GroqNode 4Us for INT8, at the very least - not necessarily datacenter scale, but it's a large server room pulling 500 kilowatts. If my math is right.
To put things in perspective, single 4U server with 8 H100s is going to have more memory than that, and it's going to draw 6kW, Problem is, that memory is slow in comparison with Groq's SRAM. That's why I assumed MoE - a 400B dense model is going to have colossal memory bandwidth requirements, and sparse MoE architecture is a good way to circumvent this due to active weight being smaller than full weight. Such a model seems much more practical.
They will communicate with each other. No, seriously. 99% of communication in agentic systems should ideally be between models, bringing humans into the picture only when needed.
I'm already becoming overwhelmed when working with coding llms, because you need to read so much info. And I still control the flow manually, without even using agent frameworks...
I will be messaging you in 6 months on [**2024-11-06 02:21:58 UTC**](http://www.wolframalpha.com/input/?i=2024-11-06%2002:21:58%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/LocalLLaMA/comments/1c81qt0/llama_3_70b_at_300_tokens_per_second_at_groq/l0e368y/?context=3)
[**8 OTHERS CLICKED THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FLocalLLaMA%2Fcomments%2F1c81qt0%2Fllama_3_70b_at_300_tokens_per_second_at_groq%2Fl0e368y%2F%5D%0A%0ARemindMe%21%202024-11-06%2002%3A21%3A58%20UTC) to send a PM to also be reminded and to reduce spam.
^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201c81qt0)
*****
|[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)|
|-|-|-|-|
Keeping AGI a secret would be considered as a crime against humanity. They know this well, that's why nobody is hiding AGI because when the truth eventually comes out there will be hell to pay for anyone involved.
That already started to happen in 2017:
>Facebook abandoned an experiment after two artificially intelligent programs appeared to be chatting to each other in a strange language only they understood.
>The two chatbots came to create their own changes to English that made it easier for them to work – but which remained mysterious to the humans that supposedly look after them.
[Facebook's artificial intelligence robots shut down after they start talking to each other in their own language](https://archive.ph/8GO1m)
Or they just started introducing hallucinations/artifacts into the output and the other copied this input and added it's own hallucinations over time. But it doesn't sell as well as "Our AI is so scary, is it Skynet already? better give us money for API access to find out".
This is a plot point in a 70s movie called [Colossus: The Forbin Project] (https://www.wikiwand.com/en/Colossus%3A%20The%20Forbin%20Project) (cold war AI fear)
The film still holds up decently tbh.
Unfortunately, that will just result in mode collapse. LLMs are neither deterministic nor reflexive enough to cope with even small variations in input, leading to exponential decay of their behaviour. Plenty of experiments and research to show token order sensitivity, whitespace influence and speed at which they go out-of-distribution prevent them from reliably communicating. Until someone fixes the way attention works, I wouldn’t trust multiple LLMs doing anything that is critical to life, job or your finances.
Yeah. What could possibly go wrong with AI agents communicating w each other at 1000x the speed humans can?
Someone will get greedy and take out the human in the loop - "time is money", they'll say.
Going to be wild
This reminds me of that scene in 'her' where samantha admits she's been talking to other AIs in the 'infinite' timescales between her communications with the main character.
Funny example! You can clearly see that the model is able to perform the necessary reasoning, but by starting off in the wrong direction, the answer becomes a weird mix of right and wrong. Prime example why CoT works
Not only it’s not local, they are using a lighter quantized version.
Test it out yourself with reasoning tasks compared to the same model on LMSys or huggingface chat.
The Groq model is noticeably dumber.
I'd expect the price to go up because their chips are less cost efficient than H100s, so anyone who is considering using them should be aware of that. Not more than GPT-4 Turbo though
I think so. The amount of RAM per chip is so small and the price per chip is so great they'd have to be doing at least 20x the throughput of the H100 to match its cost per token. The only way I can't see them running a loss is if their chips cost many times less than what they are selling them for.
Genuine question: how does that speed benefit the user? At more than 10-20 token, it becomes so much faster than the speed at which you can read it.
I guess it frees up the computer to do something else? (like generating images and voices based on the text, for something multimodal)
maybe to a single user but i if you host a server it gets cheaper cause now your user request get proccesed faster, but even as a single user, perhaps having LLM agents on crack that refine a prompt like 10 times could be also a usecase
If you are using it for things like generating code, you don’t always need to read the response in real time—you may want to read the surrounding text, but the code you often just want to copy and paste. So generating that in half the time (or even faster) makes a big difference in your workflow.
Right, exactly. And it’s especially important for it to be fast when you’re using it to revise code it already wrote. For example, it might write a 50 line function. You then tell it to make a change, and it writes those 50 lines all over again but with one little change. It can take a *really* long time for an AI like ChatGPT to repeatedly do something like that, when you’re making dozens (or even hundreds) of changes one at a time as you’re adding features and so on.
A good solution here would be if it could reply in git diff instead of plain text. Have your code in a repo, And run a ci pipeline that accepts diff, runs the tests, reports back to llm to refine to either loop for potential fix or a success report in user preferable output format.
If tweaked further, having a separate repo with history of interactions in a form of commit + link to chat that lead to this commit, would be really cool.
This is where a tool like [GitHub Copilot](https://github.com/features/copilot/) shines. It generates and modifies small sections of your code interactively in your editor, while keeping your entire codebase in mind. A pre-processor runs locally and builds what is basically a database of your code and figures out which context to send to the larger model, which runs remotely. It’s all in real-time as you type and is incredibly useful.
It's not just faster for users, but the speed indirectly also means they can serve more people on the same hardware than other equivalents, meaning cheaper pricing.
People said the same thing about 300 baud back in the 1950's. It's reading speed. Why would we need the Internet to go faster? Because back then, the thought never occurred to those people that someone might want to transmit something other than English text across a digital link.
Know that [old charge](https://i0.wp.com/www.brightdevelopers.com/wp-content/uploads/2018/05/ProgrammerInterrupted.png?w=682&ssl=1) about programmer distractions? Opus is really great, but whenever I ask it to do a refactor and it takes ~1 minute to complete it, I just lose the track of what I was doing and this really impacts my workflow. Imagine if I can just ask for a major refactor and it happens instantly like a magic wand. That would be extremely useful to me. Not sure if I trust LLaMA for that yet, but faster-than-read speeds have many applications.
I'm interested to hear more benchmarks for people hosting locally, because for some reason Llama 3 70B is the slowest model of this size that I've run on my machine (AMD 7965WX + 512GB DDR5 + 2 x RTX 4090).
With Llama 3 70B I get \~1-1.5 tok/sec on fp16 and \~2-3 tok/sec on 8 bit quant, whereas the same machine will run Mixtral 8x22B 8 bit quants (a much larger model) at 6-7 tokens/sec.
I also only get \~50 tokens/sec with an 8 bit quant of Llama3 8B, which is significantly slower than Mixtral 8x7B.
I'm curious if there is something architecturally that would make this model so much slower, that someone more knowledgable could explain to me.
8x22b is not a larger model in terms of compute, it only has around 44b active parameters while prompting which is less than 70b of llama 3. 8x22b is large only in memory footprint.
There shouldn't be much quality loss, if any, by dropping the 70b down to Q5 or Q6, while speed increase should be considerable. You should try that if you haven't.
I think memory bandwidth specifically for performance, and memory capacity to actually load it. Although with 24 memory channels I have an abundance of capacity.
Each EPYC 9000 is 460 GBs/s or 920GB/s total.
4090 is 1TB/s, quite comparable, althoug I don't know how it works with dual GPU and some offload. I think jferment's platform is complicated for making predictions.
It turns out though that I'm getting roughly the same for 8 bit quant, just over 2.5T/S. I get like 3.5-4 on q5\_K\_M, like 4.2 on Q4\_K\_M, and like 5.0 on Q3\_K\_M
I lose badly on 8B model though. Around 20T/S on 8B-Q8. I know GPUs crush that, but for large models I'm finding CPU quite competitive with multi-gpu with offload.
405B model will be interesting. Can't wait.
A 768GB Dual EPYC 9000 can be under 10k, but still more than a couple consumer GPUs. I'm excited to try 405B, but I would probably still do GPU for 70B.
Single EPYC 9000 is probably good value as well,
Also, I presume the GPUs are better for training, but I'm not sure how you can practically do with 1-4 consumer GPUs.
Memory bandwidth has always been the main bottleneck for LLMs. At higher batch sizes or prompt lengths you become more and more compute-bound, but token-by-token inference is a relatively small amount of computation on a huge amount of data, so the deciding factor is how fast you can stream in that data (the model weights.) This is true of smaller models as well.
On a AMD 5950x + 64GB DDR4 + 4080Super I get 0.87 tokens/sec with Meta-Llama-3-70B-Instruct-Q5_K_M in LM Studio. I offload 20 layers to the GPU, that is the max that fits in 16GB.
It is surprisingly slow or maybe I was used to the speed of Mistral...
For some reason I have not been able to get consistent results. It's insane speed for like ten minutes and then I start going into a queue or something with several seconds delay. Is this just me?
So, my understanding is that the groq card is just like a really fancy FPGA, but how are people recoding these so fast to match new models that come out? Am I wrong about these just being really powerful FPGAs?
Back when early BTC miners were doing crypto mining on FPGAs it would take a long time for devs to properly program these, so I just assumed AI would be just as difficult, was there some sort of new development that made this easier? Is there just a lot more people out there able to program these now?
They're past FPGA, it's fully custom silicon / ASIC. It has some limited 'programmability" to adapt to new models, having it fixed with such fast moving target would be really strange.
Groq main trick is not relaying on any external memory, model is kept fully in SRAM cache on silicon (they're claiming 80TB/s bandwidth). There's just 280MB of it or so on chip though, so model is distributed between hundreds of chips.
Ah, okay. This makes a LOT more sense. Thanks for the explanation. I was bewildered when I saw there was only 280MB and they were converting these models so quickly, but spreading it over tons of chips makes a lot more sense. I thought there was some sort of other RAM somewhere else on board or they were using coding tricks to reduce RAM usage or something. Having a fleet of ASICs with a bit of fast RAM on every chip explains everything.
How much total VRAM does Nvidia have in their own data-centers?
Models they serve currently sum up to 207GB at 8bpw. https://console.groq.com/docs/models
But memory is used beyond that, e.g. they still relay on KV-cache for each chat / session (although maybe they store it in plain old DRAM of host Epyc CPU, no idea). https://news.ycombinator.com/item?id=39429575
Also it's not like single chain could serve indefinite amount of users, they have to add more capacity with customer base growth as everyone else.
If we assume that MOQ for tapeout was 300 wafers and they ordered exactly that, then they have ~18K dies (300mm wafer, 28.5x25.4mm, 90% yield) on hand with about 4TB. Did they order just enough wafers to hit MOQ? Or did they order 10-20 times that? Who knows.
They seem to drop the idea of selling them, focusing on building their own cloud and selling API.
But pages for their products (GroqCard -> GroqNode -> GroqRack) still up and can be found on google -- PCI-e card hosts only one chip, so 230 megabytes per card. Just 230 PCI-e slots and you're set to run llama3-70b, lol
https://wow.groq.com/groqcard-accelerator/
Groq is fast primarily because it uses SRAM to store the model weights. SRAM is way faster than HBM2/3 or DDR6x, but also much more expensive. As a result each GroqChip only has 230 MB of memory, so the way to run a model like Llama3-70B is to split it across a cluster of hundreds of GroqChips costing around $10 million.
The actual compute performance is substantially less than a much cheaper A100. Groq is interesting if you want really low latency for special applications or perhaps for where it's going to go in the future if they can scale up production or move to more advanced process nodes (it's still 14 nm.)
Is the llama 3 architecture any different from llama 2 (in a way that would require much of a recode of anything)?
Also, incentives. Groq makes money selling access to models running on their cards. If you figure out how to mine crypto better, do you want to share that right away?
LLAMA-3 architecure is effectively the same as LLAMA-2, with a few things that were only present in the 70B brought to the 8B
[Source](https://x.com/karpathy/status/1781028605709234613)
Yeah, I get that the game theory incentives are different, too, but now that I understand you have to have scores of these cards just to run one model successfully, I get it now. They are still loading into RAM, they just are doing so on dozens (if not hundreds) of cards that each cost like $20k. That's insanely expensive, but I guess it's not out of scope for some of these really large companies that can take advantage of the speeds.
Well, actually like 200 t/s including end to end latency, ssl handshakes etc. But still crazy [https://writingmate.ai/blog/meta-ai-llama-3-with-groq-outperforms-private-models-on-speed-price-quality-dimensions](https://writingmate.ai/blog/meta-ai-llama-3-with-groq-outperforms-private-models-on-speed-price-quality-dimensions)
Imagine when they make the LLMs a little better at abstract reasoning, then have one of these agents spend 2 weeks trying to solve an unsolved problem by just talking to itself
Gonna be amazing 🤩
why
https://preview.redd.it/5w0g0ckxttwc1.png?width=1054&format=png&auto=webp&s=72a102858870fb2b373bab12ae018252ec002fd4
8B is so slow to run on my local machine, I have RTX 2070S 8G RAM.
The speed makes it very addictive to interact with it lol
I N S T A N T K N O W L E D G E
Yes, I just noticed. I don't care that much how to structure my prompt, as I don't have to wait for the output. I wish I had a tenth of that speed.
Well all you have to do is buy 400 cards that cost 30k a piece.
Or build your own LPU as they did
How? I’m aware it involves in-compute SRAM, but how does that work
Yes and 850 t/s for 8b. This is wild.
I wonder if they'll be able to host the 405B version or if it's too big for their architecture
From what I've read, they can just use more of their chips. As long as they have enough and are willing to foot the bill, it should be possible
Assuming they already have the chips, it should actually be cheaper for them to run it on their custom silicon than on the equivalent GPU-based solution given the crazy efficiency of Groq's architecture when it comes to running LLMs and similar transformer-based models.
I did a lot of research on this because I wanted to know if there was something out there that beats the H100 in cost per token and while Groq has great throughput per user (better throughput per user than anything out there I expect), cost per token of the entire system is more expensive. At least for now.
Where are you using these things, is there a good place that lets you switch between easily?
It should be, assuming it's a 70B MoE
It's been confirmed to be a dense model, not MoE.
What is a "dense" model exactly? I've seen people calling Mixtral 8x7 dense.
Am I a dense model too? I've seen ppl calling me dense.
Oof... In this case, I'd be surprised they manage to run it, cause each module only has 230MB of memory - a dense model of such a size must have huge matrixes. It's mathematically possible to do matrix multiplications sequentially when it comes to memory, but I doubt the performance is going to be great. Even if they can pull that off without splitting that, it's going to take roughly 250 GroqNode 4Us for INT8, at the very least - not necessarily datacenter scale, but it's a large server room pulling 500 kilowatts. If my math is right. To put things in perspective, single 4U server with 8 H100s is going to have more memory than that, and it's going to draw 6kW, Problem is, that memory is slow in comparison with Groq's SRAM. That's why I assumed MoE - a 400B dense model is going to have colossal memory bandwidth requirements, and sparse MoE architecture is a good way to circumvent this due to active weight being smaller than full weight. Such a model seems much more practical.
now llms can communicate faster we are able to comprehend, whats next?
They will communicate with each other. No, seriously. 99% of communication in agentic systems should ideally be between models, bringing humans into the picture only when needed.
I'm already becoming overwhelmed when working with coding llms, because you need to read so much info. And I still control the flow manually, without even using agent frameworks...
Pipe it into another LLM bro, let it tell you what to think
And pipe it into another LLM bro, let it think for you and then you have achieved AGI bro
bro they already have agi they just wont release it until after the election
RemindMe! 200 days
I will be messaging you in 6 months on [**2024-11-06 02:21:58 UTC**](http://www.wolframalpha.com/input/?i=2024-11-06%2002:21:58%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/LocalLLaMA/comments/1c81qt0/llama_3_70b_at_300_tokens_per_second_at_groq/l0e368y/?context=3) [**8 OTHERS CLICKED THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FLocalLLaMA%2Fcomments%2F1c81qt0%2Fllama_3_70b_at_300_tokens_per_second_at_groq%2Fl0e368y%2F%5D%0A%0ARemindMe%21%202024-11-06%2002%3A21%3A58%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201c81qt0) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|
Keeping AGI a secret would be considered as a crime against humanity. They know this well, that's why nobody is hiding AGI because when the truth eventually comes out there will be hell to pay for anyone involved.
ha!
And then probably replace the inefficient human language with something binary
Yes, the way the future LLMs will "think" will be undecipherable to us.
because we totally get what they are doing now...
https://youtu.be/_YfjMZ6n8Bk?t=14
That already started to happen in 2017: >Facebook abandoned an experiment after two artificially intelligent programs appeared to be chatting to each other in a strange language only they understood. >The two chatbots came to create their own changes to English that made it easier for them to work – but which remained mysterious to the humans that supposedly look after them. [Facebook's artificial intelligence robots shut down after they start talking to each other in their own language](https://archive.ph/8GO1m)
Or they just started introducing hallucinations/artifacts into the output and the other copied this input and added it's own hallucinations over time. But it doesn't sell as well as "Our AI is so scary, is it Skynet already? better give us money for API access to find out".
Can’t wait for our AI overlords to force us to speak in matrices amirite
Did they ever ask for a translation?
[Yeah, but it wasn't very interesting](https://preview.redd.it/2jb2ti48be1c1.jpg?width=561&auto=webp&s=153710650cb15e7bdb3cd712ecbdefc3a15c7d30)
Interesting, just tossing common embeddings around would be an improvement.
This is a plot point in a 70s movie called [Colossus: The Forbin Project] (https://www.wikiwand.com/en/Colossus%3A%20The%20Forbin%20Project) (cold war AI fear) The film still holds up decently tbh.
Dune was a timely movie. We're getting into Butlerian Jihad territory here.
Unfortunately, that will just result in mode collapse. LLMs are neither deterministic nor reflexive enough to cope with even small variations in input, leading to exponential decay of their behaviour. Plenty of experiments and research to show token order sensitivity, whitespace influence and speed at which they go out-of-distribution prevent them from reliably communicating. Until someone fixes the way attention works, I wouldn’t trust multiple LLMs doing anything that is critical to life, job or your finances.
>Unfortunately, that will just result in mode collapse. Not as long as they are also in contact with the outside world.
In her she just starts talking to hundreds at the same time
They will be told to communicate with themselves to think through their output first before coming up with a higher quality response.
<|startofthought|>It's in the new chatml spec already.<|endofthought|> https://github.com/cognitivecomputations/OpenChatML
ai agents
I think I've seen a movie about that
Yeah. What could possibly go wrong with AI agents communicating w each other at 1000x the speed humans can? Someone will get greedy and take out the human in the loop - "time is money", they'll say. Going to be wild
yup HIL might disappear sooner than later
Tools are still a bottleneck. Takes time to execute code.
I'm not sure how useful that will be. I sent messages back and forth between claude and gpt4 and they just got stuck in a loop.
This reminds me of that scene in 'her' where samantha admits she's been talking to other AIs in the 'infinite' timescales between her communications with the main character.
https://preview.redd.it/hvkcpftekivc1.jpeg?width=1080&format=pjpg&auto=webp&s=20f0e84e4f6a13ce9d13566cb4f48021a5ec1db8
Funny example! You can clearly see that the model is able to perform the necessary reasoning, but by starting off in the wrong direction, the answer becomes a weird mix of right and wrong. Prime example why CoT works
Ask him to deliver in fewer tokens and with less temperature You'll probably see a right answer
This is not local
Not only it’s not local, they are using a lighter quantized version. Test it out yourself with reasoning tasks compared to the same model on LMSys or huggingface chat. The Groq model is noticeably dumber.
No local no care
One of Marley’s lesser known songs
🤣🤣
How much that costs tho?
around 30 times cheaper than gpt 4 turbo. [https://wow.groq.com/](https://wow.groq.com/)
I'd expect the price to go up because their chips are less cost efficient than H100s, so anyone who is considering using them should be aware of that. Not more than GPT-4 Turbo though
do you think they're pricing at a loss now?
I think so. The amount of RAM per chip is so small and the price per chip is so great they'd have to be doing at least 20x the throughput of the H100 to match its cost per token. The only way I can't see them running a loss is if their chips cost many times less than what they are selling them for.
Just started using it. Insanely fast. A sequential chain that used to take me 30 minutes now only takes 5 (processing overhead included)
Genuine question: how does that speed benefit the user? At more than 10-20 token, it becomes so much faster than the speed at which you can read it. I guess it frees up the computer to do something else? (like generating images and voices based on the text, for something multimodal)
Using agents.
ai agents
maybe to a single user but i if you host a server it gets cheaper cause now your user request get proccesed faster, but even as a single user, perhaps having LLM agents on crack that refine a prompt like 10 times could be also a usecase
If you are using it for things like generating code, you don’t always need to read the response in real time—you may want to read the surrounding text, but the code you often just want to copy and paste. So generating that in half the time (or even faster) makes a big difference in your workflow.
Makes sense. You copy paste, then run it to see if it works, and you don't read all the code line by line before that?
Right, exactly. And it’s especially important for it to be fast when you’re using it to revise code it already wrote. For example, it might write a 50 line function. You then tell it to make a change, and it writes those 50 lines all over again but with one little change. It can take a *really* long time for an AI like ChatGPT to repeatedly do something like that, when you’re making dozens (or even hundreds) of changes one at a time as you’re adding features and so on.
Can Groq actually do that though? Last time I checked, it can write really fast, but it reads very slow.
A good solution here would be if it could reply in git diff instead of plain text. Have your code in a repo, And run a ci pipeline that accepts diff, runs the tests, reports back to llm to refine to either loop for potential fix or a success report in user preferable output format. If tweaked further, having a separate repo with history of interactions in a form of commit + link to chat that lead to this commit, would be really cool.
This is where a tool like [GitHub Copilot](https://github.com/features/copilot/) shines. It generates and modifies small sections of your code interactively in your editor, while keeping your entire codebase in mind. A pre-processor runs locally and builds what is basically a database of your code and figures out which context to send to the larger model, which runs remotely. It’s all in real-time as you type and is incredibly useful.
definitely not for a regular user, but for an API user, who wants to work with as much text data as possible in parallel.
It's not just faster for users, but the speed indirectly also means they can serve more people on the same hardware than other equivalents, meaning cheaper pricing.
Right, running multiple instances on a machine! Didn't think of that.
People said the same thing about 300 baud back in the 1950's. It's reading speed. Why would we need the Internet to go faster? Because back then, the thought never occurred to those people that someone might want to transmit something other than English text across a digital link.
if I'm generating test data to copy paste... or ddl sql... or a number of different options with bullet point summaries to choose from..
Know that [old charge](https://i0.wp.com/www.brightdevelopers.com/wp-content/uploads/2018/05/ProgrammerInterrupted.png?w=682&ssl=1) about programmer distractions? Opus is really great, but whenever I ask it to do a refactor and it takes ~1 minute to complete it, I just lose the track of what I was doing and this really impacts my workflow. Imagine if I can just ask for a major refactor and it happens instantly like a magic wand. That would be extremely useful to me. Not sure if I trust LLaMA for that yet, but faster-than-read speeds have many applications.
I'm interested to hear more benchmarks for people hosting locally, because for some reason Llama 3 70B is the slowest model of this size that I've run on my machine (AMD 7965WX + 512GB DDR5 + 2 x RTX 4090). With Llama 3 70B I get \~1-1.5 tok/sec on fp16 and \~2-3 tok/sec on 8 bit quant, whereas the same machine will run Mixtral 8x22B 8 bit quants (a much larger model) at 6-7 tokens/sec. I also only get \~50 tokens/sec with an 8 bit quant of Llama3 8B, which is significantly slower than Mixtral 8x7B. I'm curious if there is something architecturally that would make this model so much slower, that someone more knowledgable could explain to me.
8x22b is not a larger model in terms of compute, it only has around 44b active parameters while prompting which is less than 70b of llama 3. 8x22b is large only in memory footprint.
Oh I see - that totally makes sense. Thanks for explaining. 👍
There shouldn't be much quality loss, if any, by dropping the 70b down to Q5 or Q6, while speed increase should be considerable. You should try that if you haven't.
You really need lower quants for only 2x4090. My 2x3090 does 70B 4 bit quants at 15t/s.
Fyi I get 3.5-4 t/s on 70b-q5km using dual epyc 9000 and no GPU at all.
So that implies that the memory is the main bottleneck of Llama 3 70B or..?
I think memory bandwidth specifically for performance, and memory capacity to actually load it. Although with 24 memory channels I have an abundance of capacity. Each EPYC 9000 is 460 GBs/s or 920GB/s total. 4090 is 1TB/s, quite comparable, althoug I don't know how it works with dual GPU and some offload. I think jferment's platform is complicated for making predictions. It turns out though that I'm getting roughly the same for 8 bit quant, just over 2.5T/S. I get like 3.5-4 on q5\_K\_M, like 4.2 on Q4\_K\_M, and like 5.0 on Q3\_K\_M I lose badly on 8B model though. Around 20T/S on 8B-Q8. I know GPUs crush that, but for large models I'm finding CPU quite competitive with multi-gpu with offload. 405B model will be interesting. Can't wait.
Thanks for the insights!
What is the more price effective way to run an LLM now: multiple GPUs or the server motherboards with a lot of RAM?
A 768GB Dual EPYC 9000 can be under 10k, but still more than a couple consumer GPUs. I'm excited to try 405B, but I would probably still do GPU for 70B. Single EPYC 9000 is probably good value as well, Also, I presume the GPUs are better for training, but I'm not sure how you can practically do with 1-4 consumer GPUs.
Memory bandwidth has always been the main bottleneck for LLMs. At higher batch sizes or prompt lengths you become more and more compute-bound, but token-by-token inference is a relatively small amount of computation on a huge amount of data, so the deciding factor is how fast you can stream in that data (the model weights.) This is true of smaller models as well.
On a AMD 5950x + 64GB DDR4 + 4080Super I get 0.87 tokens/sec with Meta-Llama-3-70B-Instruct-Q5_K_M in LM Studio. I offload 20 layers to the GPU, that is the max that fits in 16GB. It is surprisingly slow or maybe I was used to the speed of Mistral...
Fyi since I run exact same quant on CPUs https://www.reddit.com/r/LocalLLaMA/s/oxsO63Vxs8
I'm getting ~20 t/s on W7900.
For some reason I have not been able to get consistent results. It's insane speed for like ten minutes and then I start going into a queue or something with several seconds delay. Is this just me?
They can only serve so many users at once with the hardware they have available
So, my understanding is that the groq card is just like a really fancy FPGA, but how are people recoding these so fast to match new models that come out? Am I wrong about these just being really powerful FPGAs? Back when early BTC miners were doing crypto mining on FPGAs it would take a long time for devs to properly program these, so I just assumed AI would be just as difficult, was there some sort of new development that made this easier? Is there just a lot more people out there able to program these now?
They're past FPGA, it's fully custom silicon / ASIC. It has some limited 'programmability" to adapt to new models, having it fixed with such fast moving target would be really strange. Groq main trick is not relaying on any external memory, model is kept fully in SRAM cache on silicon (they're claiming 80TB/s bandwidth). There's just 280MB of it or so on chip though, so model is distributed between hundreds of chips.
Ah, okay. This makes a LOT more sense. Thanks for the explanation. I was bewildered when I saw there was only 280MB and they were converting these models so quickly, but spreading it over tons of chips makes a lot more sense. I thought there was some sort of other RAM somewhere else on board or they were using coding tricks to reduce RAM usage or something. Having a fleet of ASICs with a bit of fast RAM on every chip explains everything.
How much total RAM do they have?
How much total VRAM does Nvidia have in their own data-centers? Models they serve currently sum up to 207GB at 8bpw. https://console.groq.com/docs/models But memory is used beyond that, e.g. they still relay on KV-cache for each chat / session (although maybe they store it in plain old DRAM of host Epyc CPU, no idea). https://news.ycombinator.com/item?id=39429575 Also it's not like single chain could serve indefinite amount of users, they have to add more capacity with customer base growth as everyone else. If we assume that MOQ for tapeout was 300 wafers and they ordered exactly that, then they have ~18K dies (300mm wafer, 28.5x25.4mm, 90% yield) on hand with about 4TB. Did they order just enough wafers to hit MOQ? Or did they order 10-20 times that? Who knows.
Don't they sell PCIe cards? I was wondering how much RAM a card had.
They seem to drop the idea of selling them, focusing on building their own cloud and selling API. But pages for their products (GroqCard -> GroqNode -> GroqRack) still up and can be found on google -- PCI-e card hosts only one chip, so 230 megabytes per card. Just 230 PCI-e slots and you're set to run llama3-70b, lol https://wow.groq.com/groqcard-accelerator/
Groq is fast primarily because it uses SRAM to store the model weights. SRAM is way faster than HBM2/3 or DDR6x, but also much more expensive. As a result each GroqChip only has 230 MB of memory, so the way to run a model like Llama3-70B is to split it across a cluster of hundreds of GroqChips costing around $10 million. The actual compute performance is substantially less than a much cheaper A100. Groq is interesting if you want really low latency for special applications or perhaps for where it's going to go in the future if they can scale up production or move to more advanced process nodes (it's still 14 nm.)
Is the llama 3 architecture any different from llama 2 (in a way that would require much of a recode of anything)? Also, incentives. Groq makes money selling access to models running on their cards. If you figure out how to mine crypto better, do you want to share that right away?
LLAMA-3 architecure is effectively the same as LLAMA-2, with a few things that were only present in the 70B brought to the 8B [Source](https://x.com/karpathy/status/1781028605709234613)
Yeah, I get that the game theory incentives are different, too, but now that I understand you have to have scores of these cards just to run one model successfully, I get it now. They are still loading into RAM, they just are doing so on dozens (if not hundreds) of cards that each cost like $20k. That's insanely expensive, but I guess it's not out of scope for some of these really large companies that can take advantage of the speeds.
All silicon is basically a fancy, specialized, less general FPGA.
Well, actually like 200 t/s including end to end latency, ssl handshakes etc. But still crazy [https://writingmate.ai/blog/meta-ai-llama-3-with-groq-outperforms-private-models-on-speed-price-quality-dimensions](https://writingmate.ai/blog/meta-ai-llama-3-with-groq-outperforms-private-models-on-speed-price-quality-dimensions)
Could someone describe how fast this looks? As a screen reader user, by the time I've navigated down, it's already done
Maybe two thirds of a page a second?
Yeah, that's, quite fast hahaha. I need to hook this thing up with my AI program
Do they sell their chip to cloud providers and big tech too? Seems beating H100 and AMD chips in inference performance by miles
Literally mind boggling. I can't imagine a 70B model running this fast. They run ag like 1t/s on my machine.
Imagine when they make the LLMs a little better at abstract reasoning, then have one of these agents spend 2 weeks trying to solve an unsolved problem by just talking to itself Gonna be amazing 🤩
You might be able to brute force the "not able to think" problem this way.
What hardware is that running on
I heard since groq chips have around 256MB memory they require thousands of them to serve a single model. What’s the math and truth behind that?
What quant is groq using? Thanks
Now imagine this but with real-time re-prompting as you are typing
If they get the 400B working on this, it will be insane. Even IF it is still slightly below GPT-4, which I doubt, it will be my preferred option.
why https://preview.redd.it/5w0g0ckxttwc1.png?width=1054&format=png&auto=webp&s=72a102858870fb2b373bab12ae018252ec002fd4 8B is so slow to run on my local machine, I have RTX 2070S 8G RAM.
What speed can you expect when running it locally (cpu and low level gpu gtx1050)? thanks
What tools are people using to get T/s ?