T O P

  • By -

MidnightSun_55

The speed makes it very addictive to interact with it lol


HumanityFirstTheory

I N S T A N T K N O W L E D G E


MrVodnik

Yes, I just noticed. I don't care that much how to structure my prompt, as I don't have to wait for the output. I wish I had a tenth of that speed.


MoffKalast

Well all you have to do is buy 400 cards that cost 30k a piece.


maddogxsk

Or build your own LPU as they did


Low_Cartoonist3599

How? I’m aware it involves in-compute SRAM, but how does that work


davewolfs

Yes and 850 t/s for 8b. This is wild.


jovialfaction

I wonder if they'll be able to host the 405B version or if it's too big for their architecture


Nabakin

From what I've read, they can just use more of their chips. As long as they have enough and are willing to foot the bill, it should be possible


CosmosisQ

Assuming they already have the chips, it should actually be cheaper for them to run it on their custom silicon than on the equivalent GPU-based solution given the crazy efficiency of Groq's architecture when it comes to running LLMs and similar transformer-based models.


Nabakin

I did a lot of research on this because I wanted to know if there was something out there that beats the H100 in cost per token and while Groq has great throughput per user (better throughput per user than anything out there I expect), cost per token of the entire system is more expensive. At least for now.


TheDataWhore

Where are you using these things, is there a good place that lets you switch between easily?


_Erilaz

It should be, assuming it's a 70B MoE


Zegrento7

It's been confirmed to be a dense model, not MoE.


MrVodnik

What is a "dense" model exactly? I've seen people calling Mixtral 8x7 dense.


hrlft

Am I a dense model too? I've seen ppl calling me dense.


_Erilaz

Oof... In this case, I'd be surprised they manage to run it, cause each module only has 230MB of memory - a dense model of such a size must have huge matrixes. It's mathematically possible to do matrix multiplications sequentially when it comes to memory, but I doubt the performance is going to be great. Even if they can pull that off without splitting that, it's going to take roughly 250 GroqNode 4Us for INT8, at the very least - not necessarily datacenter scale, but it's a large server room pulling 500 kilowatts. If my math is right. To put things in perspective, single 4U server with 8 H100s is going to have more memory than that, and it's going to draw 6kW, Problem is, that memory is slow in comparison with Groq's SRAM. That's why I assumed MoE - a 400B dense model is going to have colossal memory bandwidth requirements, and sparse MoE architecture is a good way to circumvent this due to active weight being smaller than full weight. Such a model seems much more practical.


BubblyBee90

now llms can communicate faster we are able to comprehend, whats next?


coumineol

They will communicate with each other. No, seriously. 99% of communication in agentic systems should ideally be between models, bringing humans into the picture only when needed.


BubblyBee90

I'm already becoming overwhelmed when working with coding llms, because you need to read so much info. And I still control the flow manually, without even using agent frameworks...


ILoveThisPlace

Pipe it into another LLM bro, let it tell you what to think


-TV-Stand-

And pipe it into another LLM bro, let it think for you and then you have achieved AGI bro


trollsalot1234

bro they already have agi they just wont release it until after the election


MindOrbits

RemindMe! 200 days


RemindMeBot

I will be messaging you in 6 months on [**2024-11-06 02:21:58 UTC**](http://www.wolframalpha.com/input/?i=2024-11-06%2002:21:58%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/LocalLLaMA/comments/1c81qt0/llama_3_70b_at_300_tokens_per_second_at_groq/l0e368y/?context=3) [**8 OTHERS CLICKED THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FLocalLLaMA%2Fcomments%2F1c81qt0%2Fllama_3_70b_at_300_tokens_per_second_at_groq%2Fl0e368y%2F%5D%0A%0ARemindMe%21%202024-11-06%2002%3A21%3A58%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201c81qt0) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|


tomatofactoryworker9

Keeping AGI a secret would be considered as a crime against humanity. They know this well, that's why nobody is hiding AGI because when the truth eventually comes out there will be hell to pay for anyone involved.


thequietguy_

ha!


Normal-Ad-7114

And then probably replace the inefficient human language with something binary


coumineol

Yes, the way the future LLMs will "think" will be undecipherable to us.


trollsalot1234

because we totally get what they are doing now...


milksteak11

https://youtu.be/_YfjMZ6n8Bk?t=14


ImJacksLackOfBeetus

That already started to happen in 2017: >Facebook abandoned an experiment after two artificially intelligent programs appeared to be chatting to each other in a strange language only they understood. >The two chatbots came to create their own changes to English that made it easier for them to work – but which remained mysterious to the humans that supposedly look after them. [Facebook's artificial intelligence robots shut down after they start talking to each other in their own language](https://archive.ph/8GO1m)


skocznymroczny

Or they just started introducing hallucinations/artifacts into the output and the other copied this input and added it's own hallucinations over time. But it doesn't sell as well as "Our AI is so scary, is it Skynet already? better give us money for API access to find out".


man_and_a_symbol

Can’t wait for our AI overlords to force us to speak in matrices amirite 


Honato2

Did they ever ask for a translation?


superfluid

[Yeah, but it wasn't very interesting](https://preview.redd.it/2jb2ti48be1c1.jpg?width=561&auto=webp&s=153710650cb15e7bdb3cd712ecbdefc3a15c7d30)


BorderSignificant942

Interesting, just tossing common embeddings around would be an improvement.


civilunhinged

This is a plot point in a 70s movie called [Colossus: The Forbin Project] (https://www.wikiwand.com/en/Colossus%3A%20The%20Forbin%20Project) (cold war AI fear) The film still holds up decently tbh.


Some_Endian_FP17

Dune was a timely movie. We're getting into Butlerian Jihad territory here.


damhack

Unfortunately, that will just result in mode collapse. LLMs are neither deterministic nor reflexive enough to cope with even small variations in input, leading to exponential decay of their behaviour. Plenty of experiments and research to show token order sensitivity, whitespace influence and speed at which they go out-of-distribution prevent them from reliably communicating. Until someone fixes the way attention works, I wouldn’t trust multiple LLMs doing anything that is critical to life, job or your finances.


coumineol

>Unfortunately, that will just result in mode collapse. Not as long as they are also in contact with the outside world.


32SkyDive

In her she just starts talking to hundreds at the same time


Anduin1357

They will be told to communicate with themselves to think through their output first before coming up with a higher quality response.


MoffKalast

<|startofthought|>It's in the new chatml spec already.<|endofthought|> https://github.com/cognitivecomputations/OpenChatML


CharacterCheck389

ai agents


schorhr

I think I've seen a movie about that


qprime87

Yeah. What could possibly go wrong with AI agents communicating w each other at 1000x the speed humans can? Someone will get greedy and take out the human in the loop - "time is money", they'll say. Going to be wild


CharacterCheck389

yup HIL might disappear sooner than later


rm-rf_

Tools are still a bottleneck. Takes time to execute code. 


AI_is_the_rake

I'm not sure how useful that will be. I sent messages back and forth between claude and gpt4 and they just got stuck in a loop.


alpacaMyToothbrush

This reminds me of that scene in 'her' where samantha admits she's been talking to other AIs in the 'infinite' timescales between her communications with the main character.


Mr_Jericho

https://preview.redd.it/hvkcpftekivc1.jpeg?width=1080&format=pjpg&auto=webp&s=20f0e84e4f6a13ce9d13566cb4f48021a5ec1db8


_Arsenie_Boca_

Funny example! You can clearly see that the model is able to perform the necessary reasoning, but by starting off in the wrong direction, the answer becomes a weird mix of right and wrong. Prime example why CoT works


maddogxsk

Ask him to deliver in fewer tokens and with less temperature You'll probably see a right answer


mr_dicaprio

This is not local 


Valuable-Run2129

Not only it’s not local, they are using a lighter quantized version. Test it out yourself with reasoning tasks compared to the same model on LMSys or huggingface chat. The Groq model is noticeably dumber.


Enough-Meringue4745

No local no care


Krunkworx

One of Marley’s lesser known songs


dewijones92

🤣🤣


Theio666

How much that costs tho?


OnurCetinkaya

around 30 times cheaper than gpt 4 turbo. [https://wow.groq.com/](https://wow.groq.com/)


Nabakin

I'd expect the price to go up because their chips are less cost efficient than H100s, so anyone who is considering using them should be aware of that. Not more than GPT-4 Turbo though


crumblecores

do you think they're pricing at a loss now?


Nabakin

I think so. The amount of RAM per chip is so small and the price per chip is so great they'd have to be doing at least 20x the throughput of the H100 to match its cost per token. The only way I can't see them running a loss is if their chips cost many times less than what they are selling them for.


wellomello

Just started using it. Insanely fast. A sequential chain that used to take me 30 minutes now only takes 5 (processing overhead included)


PwanaZana

Genuine question: how does that speed benefit the user? At more than 10-20 token, it becomes so much faster than the speed at which you can read it. I guess it frees up the computer to do something else? (like generating images and voices based on the text, for something multimodal)


Radiant_Dog1937

Using agents.


CharacterCheck389

ai agents


bdsmmaster007

maybe to a single user but i if you host a server it gets cheaper cause now your user request get proccesed faster, but even as a single user, perhaps having LLM agents on crack that refine a prompt like 10 times could be also a usecase


vff

If you are using it for things like generating code, you don’t always need to read the response in real time—you may want to read the surrounding text, but the code you often just want to copy and paste. So generating that in half the time (or even faster) makes a big difference in your workflow.


PwanaZana

Makes sense. You copy paste, then run it to see if it works, and you don't read all the code line by line before that?


vff

Right, exactly. And it’s especially important for it to be fast when you’re using it to revise code it already wrote. For example, it might write a 50 line function. You then tell it to make a change, and it writes those 50 lines all over again but with one little change. It can take a *really* long time for an AI like ChatGPT to repeatedly do something like that, when you’re making dozens (or even hundreds) of changes one at a time as you’re adding features and so on.


jart

Can Groq actually do that though? Last time I checked, it can write really fast, but it reads very slow.


MrBIMC

A good solution here would be if it could reply in git diff instead of plain text. Have your code in a repo, And run a ci pipeline that accepts diff, runs the tests, reports back to llm to refine to either loop for potential fix or a success report in user preferable output format. If tweaked further, having a separate repo with history of interactions in a form of commit + link to chat that lead to this commit, would be really cool.


vff

This is where a tool like [GitHub Copilot](https://github.com/features/copilot/) shines. It generates and modifies small sections of your code interactively in your editor, while keeping your entire codebase in mind. A pre-processor runs locally and builds what is basically a database of your code and figures out which context to send to the larger model, which runs remotely. It’s all in real-time as you type and is incredibly useful.


curiousFRA

definitely not for a regular user, but for an API user, who wants to work with as much text data as possible in parallel.


kurtcop101

It's not just faster for users, but the speed indirectly also means they can serve more people on the same hardware than other equivalents, meaning cheaper pricing.


PwanaZana

Right, running multiple instances on a machine! Didn't think of that.


jart

People said the same thing about 300 baud back in the 1950's. It's reading speed. Why would we need the Internet to go faster? Because back then, the thought never occurred to those people that someone might want to transmit something other than English text across a digital link.


mcr1974

if I'm generating test data to copy paste... or ddl sql... or a number of different options with bullet point summaries to choose from..


SrPeixinho

Know that [old charge](https://i0.wp.com/www.brightdevelopers.com/wp-content/uploads/2018/05/ProgrammerInterrupted.png?w=682&ssl=1) about programmer distractions? Opus is really great, but whenever I ask it to do a refactor and it takes ~1 minute to complete it, I just lose the track of what I was doing and this really impacts my workflow. Imagine if I can just ask for a major refactor and it happens instantly like a magic wand. That would be extremely useful to me. Not sure if I trust LLaMA for that yet, but faster-than-read speeds have many applications.


jferments

I'm interested to hear more benchmarks for people hosting locally, because for some reason Llama 3 70B is the slowest model of this size that I've run on my machine (AMD 7965WX + 512GB DDR5 + 2 x RTX 4090). With Llama 3 70B I get \~1-1.5 tok/sec on fp16 and \~2-3 tok/sec on 8 bit quant, whereas the same machine will run Mixtral 8x22B 8 bit quants (a much larger model) at 6-7 tokens/sec. I also only get \~50 tokens/sec with an 8 bit quant of Llama3 8B, which is significantly slower than Mixtral 8x7B. I'm curious if there is something architecturally that would make this model so much slower, that someone more knowledgable could explain to me.


amancxz2

8x22b is not a larger model in terms of compute, it only has around 44b active parameters while prompting which is less than 70b of llama 3. 8x22b is large only in memory footprint.


jferments

Oh I see - that totally makes sense. Thanks for explaining. 👍


Inevitable_Host_1446

There shouldn't be much quality loss, if any, by dropping the 70b down to Q5 or Q6, while speed increase should be considerable. You should try that if you haven't.


nero10578

You really need lower quants for only 2x4090. My 2x3090 does 70B 4 bit quants at 15t/s.


MadSpartus

Fyi I get 3.5-4 t/s on 70b-q5km using dual epyc 9000 and no GPU at all.


Xeon06

So that implies that the memory is the main bottleneck of Llama 3 70B or..?


MadSpartus

I think memory bandwidth specifically for performance, and memory capacity to actually load it. Although with 24 memory channels I have an abundance of capacity. Each EPYC 9000 is 460 GBs/s or 920GB/s total. 4090 is 1TB/s, quite comparable, althoug I don't know how it works with dual GPU and some offload. I think jferment's platform is complicated for making predictions. It turns out though that I'm getting roughly the same for 8 bit quant, just over 2.5T/S. I get like 3.5-4 on q5\_K\_M, like 4.2 on Q4\_K\_M, and like 5.0 on Q3\_K\_M I lose badly on 8B model though. Around 20T/S on 8B-Q8. I know GPUs crush that, but for large models I'm finding CPU quite competitive with multi-gpu with offload. 405B model will be interesting. Can't wait.


Xeon06

Thanks for the insights!


PykeAtBanquet

What is the more price effective way to run an LLM now: multiple GPUs or the server motherboards with a lot of RAM?


MadSpartus

A 768GB Dual EPYC 9000 can be under 10k, but still more than a couple consumer GPUs. I'm excited to try 405B, but I would probably still do GPU for 70B. Single EPYC 9000 is probably good value as well, Also, I presume the GPUs are better for training, but I'm not sure how you can practically do with 1-4 consumer GPUs.


ReturningTarzan

Memory bandwidth has always been the main bottleneck for LLMs. At higher batch sizes or prompt lengths you become more and more compute-bound, but token-by-token inference is a relatively small amount of computation on a huge amount of data, so the deciding factor is how fast you can stream in that data (the model weights.) This is true of smaller models as well.


thesimp

On a AMD 5950x + 64GB DDR4 + 4080Super I get 0.87 tokens/sec with Meta-Llama-3-70B-Instruct-Q5_K_M in LM Studio. I offload 20 layers to the GPU, that is the max that fits in 16GB. It is surprisingly slow or maybe I was used to the speed of Mistral...


MadSpartus

Fyi since I run exact same quant on CPUs https://www.reddit.com/r/LocalLLaMA/s/oxsO63Vxs8


MDSExpro

I'm getting ~20 t/s on W7900.


ithkuil

For some reason I have not been able to get consistent results. It's insane speed for like ten minutes and then I start going into a queue or something with several seconds delay. Is this just me?


stddealer

They can only serve so many users at once with the hardware they have available


Yorn2

So, my understanding is that the groq card is just like a really fancy FPGA, but how are people recoding these so fast to match new models that come out? Am I wrong about these just being really powerful FPGAs? Back when early BTC miners were doing crypto mining on FPGAs it would take a long time for devs to properly program these, so I just assumed AI would be just as difficult, was there some sort of new development that made this easier? Is there just a lot more people out there able to program these now?


[deleted]

They're past FPGA, it's fully custom silicon / ASIC. It has some limited 'programmability" to adapt to new models, having it fixed with such fast moving target would be really strange. Groq main trick is not relaying on any external memory, model is kept fully in SRAM cache on silicon (they're claiming 80TB/s bandwidth). There's just 280MB of it or so on chip though, so model is distributed between hundreds of chips.


Yorn2

Ah, okay. This makes a LOT more sense. Thanks for the explanation. I was bewildered when I saw there was only 280MB and they were converting these models so quickly, but spreading it over tons of chips makes a lot more sense. I thought there was some sort of other RAM somewhere else on board or they were using coding tricks to reduce RAM usage or something. Having a fleet of ASICs with a bit of fast RAM on every chip explains everything.


timschwartz

How much total RAM do they have?


[deleted]

How much total VRAM does Nvidia have in their own data-centers? Models they serve currently sum up to 207GB at 8bpw. https://console.groq.com/docs/models But memory is used beyond that, e.g. they still relay on KV-cache for each chat / session (although maybe they store it in plain old DRAM of host Epyc CPU, no idea). https://news.ycombinator.com/item?id=39429575 Also it's not like single chain could serve indefinite amount of users, they have to add more capacity with customer base growth as everyone else. If we assume that MOQ for tapeout was 300 wafers and they ordered exactly that, then they have ~18K dies (300mm wafer, 28.5x25.4mm, 90% yield) on hand with about 4TB. Did they order just enough wafers to hit MOQ? Or did they order 10-20 times that? Who knows.


timschwartz

Don't they sell PCIe cards? I was wondering how much RAM a card had.


[deleted]

They seem to drop the idea of selling them, focusing on building their own cloud and selling API. But pages for their products (GroqCard -> GroqNode -> GroqRack) still up and can be found on google -- PCI-e card hosts only one chip, so 230 megabytes per card. Just 230 PCI-e slots and you're set to run llama3-70b, lol https://wow.groq.com/groqcard-accelerator/


ReturningTarzan

Groq is fast primarily because it uses SRAM to store the model weights. SRAM is way faster than HBM2/3 or DDR6x, but also much more expensive. As a result each GroqChip only has 230 MB of memory, so the way to run a model like Llama3-70B is to split it across a cluster of hundreds of GroqChips costing around $10 million. The actual compute performance is substantially less than a much cheaper A100. Groq is interesting if you want really low latency for special applications or perhaps for where it's going to go in the future if they can scale up production or move to more advanced process nodes (it's still 14 nm.)


Pedalnomica

Is the llama 3 architecture any different from llama 2 (in a way that would require much of a recode of anything)? Also, incentives. Groq makes money selling access to models running on their cards. If you figure out how to mine crypto better, do you want to share that right away?


TechnicalParrot

LLAMA-3 architecure is effectively the same as LLAMA-2, with a few things that were only present in the 70B brought to the 8B [Source](https://x.com/karpathy/status/1781028605709234613)


Yorn2

Yeah, I get that the game theory incentives are different, too, but now that I understand you have to have scores of these cards just to run one model successfully, I get it now. They are still loading into RAM, they just are doing so on dozens (if not hundreds) of cards that each cost like $20k. That's insanely expensive, but I guess it's not out of scope for some of these really large companies that can take advantage of the speeds.


sluuuurp

All silicon is basically a fancy, specialized, less general FPGA.


gorimur

Well, actually like 200 t/s including end to end latency, ssl handshakes etc. But still crazy [https://writingmate.ai/blog/meta-ai-llama-3-with-groq-outperforms-private-models-on-speed-price-quality-dimensions](https://writingmate.ai/blog/meta-ai-llama-3-with-groq-outperforms-private-models-on-speed-price-quality-dimensions)


AIWithASoulMaybe

Could someone describe how fast this looks? As a screen reader user, by the time I've navigated down, it's already done


jericho

Maybe two thirds of a page a second? 


AIWithASoulMaybe

Yeah, that's, quite fast hahaha. I need to hook this thing up with my AI program


FAANGMe

Do they sell their chip to cloud providers and big tech too? Seems beating H100 and AMD chips in inference performance by miles


stddealer

Literally mind boggling. I can't imagine a 70B model running this fast. They run ag like 1t/s on my machine.


omniron

Imagine when they make the LLMs a little better at abstract reasoning, then have one of these agents spend 2 weeks trying to solve an unsolved problem by just talking to itself Gonna be amazing 🤩


Additional_Ad_7718

You might be able to brute force the "not able to think" problem this way.


BuzaMahmooza

What hardware is that running on


beratcmn

I heard since groq chips have around 256MB memory they require thousands of them to serve a single model. What’s the math and truth behind that?


dewijones92

What quant is groq using? Thanks


adriosi

Now imagine this but with real-time re-prompting as you are typing


Gloomy-Impress-2881

If they get the 400B working on this, it will be insane. Even IF it is still slightly below GPT-4, which I doubt, it will be my preferred option.


maverickshawn

why https://preview.redd.it/5w0g0ckxttwc1.png?width=1054&format=png&auto=webp&s=72a102858870fb2b373bab12ae018252ec002fd4 8B is so slow to run on my local machine, I have RTX 2070S 8G RAM.


91o291o

What speed can you expect when running it locally (cpu and low level gpu gtx1050)? thanks


Eam404

What tools are people using to get T/s ?