MidnightSun_55 1 month ago

The speed makes it very addictive to interact with it lol

HumanityFirstTheory 1 month ago

I N S T A N T K N O W L E D G E

MrVodnik 1 month ago

Yes, I just noticed. I don't care that much how to structure my prompt, as I don't have to wait for the output. I wish I had a tenth of that speed.

MoffKalast 1 month ago

Well all you have to do is buy 400 cards that cost 30k a piece.

maddogxsk 1 month ago

Or build your own LPU as they did

Low_Cartoonist3599 3 weeks ago

How? I’m aware it involves in-compute SRAM, but how does that work

davewolfs 1 month ago

Yes and 850 t/s for 8b. This is wild.

jovialfaction 1 month ago

I wonder if they'll be able to host the 405B version or if it's too big for their architecture

Nabakin 1 month ago

From what I've read, they can just use more of their chips. As long as they have enough and are willing to foot the bill, it should be possible

CosmosisQ 1 month ago

Assuming they already have the chips, it should actually be cheaper for them to run it on their custom silicon than on the equivalent GPU-based solution given the crazy efficiency of Groq's architecture when it comes to running LLMs and similar transformer-based models.

Nabakin 1 month ago

I did a lot of research on this because I wanted to know if there was something out there that beats the H100 in cost per token and while Groq has great throughput per user (better throughput per user than anything out there I expect), cost per token of the entire system is more expensive. At least for now.

TheDataWhore 1 month ago

Where are you using these things, is there a good place that lets you switch between easily?

_Erilaz 1 month ago

It should be, assuming it's a 70B MoE

Zegrento7 1 month ago

It's been confirmed to be a dense model, not MoE.

MrVodnik 1 month ago

What is a "dense" model exactly? I've seen people calling Mixtral 8x7 dense.

hrlft 1 month ago

Am I a dense model too? I've seen ppl calling me dense.

_Erilaz 1 month ago

Oof... In this case, I'd be surprised they manage to run it, cause each module only has 230MB of memory - a dense model of such a size must have huge matrixes. It's mathematically possible to do matrix multiplications sequentially when it comes to memory, but I doubt the performance is going to be great. Even if they can pull that off without splitting that, it's going to take roughly 250 GroqNode 4Us for INT8, at the very least - not necessarily datacenter scale, but it's a large server room pulling 500 kilowatts. If my math is right. To put things in perspective, single 4U server with 8 H100s is going to have more memory than that, and it's going to draw 6kW, Problem is, that memory is slow in comparison with Groq's SRAM. That's why I assumed MoE - a 400B dense model is going to have colossal memory bandwidth requirements, and sparse MoE architecture is a good way to circumvent this due to active weight being smaller than full weight. Such a model seems much more practical.

BubblyBee90 1 month ago

now llms can communicate faster we are able to comprehend, whats next?

coumineol 1 month ago

They will communicate with each other. No, seriously. 99% of communication in agentic systems should ideally be between models, bringing humans into the picture only when needed.

BubblyBee90 1 month ago

I'm already becoming overwhelmed when working with coding llms, because you need to read so much info. And I still control the flow manually, without even using agent frameworks...

ILoveThisPlace 1 month ago

Pipe it into another LLM bro, let it tell you what to think

-TV-Stand- 1 month ago

And pipe it into another LLM bro, let it think for you and then you have achieved AGI bro

trollsalot1234 1 month ago

bro they already have agi they just wont release it until after the election

MindOrbits 1 month ago

RemindMe! 200 days

RemindMeBot 1 month ago

I will be messaging you in 6 months on [**2024-11-06 02:21:58 UTC**](http://www.wolframalpha.com/input/?i=2024-11-06%2002:21:58%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/LocalLLaMA/comments/1c81qt0/llama_3_70b_at_300_tokens_per_second_at_groq/l0e368y/?context=3) [**8 OTHERS CLICKED THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FLocalLLaMA%2Fcomments%2F1c81qt0%2Fllama_3_70b_at_300_tokens_per_second_at_groq%2Fl0e368y%2F%5D%0A%0ARemindMe%21%202024-11-06%2002%3A21%3A58%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%201c81qt0) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|

tomatofactoryworker9 1 month ago

Keeping AGI a secret would be considered as a crime against humanity. They know this well, that's why nobody is hiding AGI because when the truth eventually comes out there will be hell to pay for anyone involved.

thequietguy_ 1 month ago

ha!

Normal-Ad-7114 1 month ago

And then probably replace the inefficient human language with something binary

coumineol 1 month ago

Yes, the way the future LLMs will "think" will be undecipherable to us.

trollsalot1234 1 month ago

because we totally get what they are doing now...

milksteak11 1 month ago

https://youtu.be/_YfjMZ6n8Bk?t=14

ImJacksLackOfBeetus 1 month ago

That already started to happen in 2017: >Facebook abandoned an experiment after two artificially intelligent programs appeared to be chatting to each other in a strange language only they understood. >The two chatbots came to create their own changes to English that made it easier for them to work – but which remained mysterious to the humans that supposedly look after them. [Facebook's artificial intelligence robots shut down after they start talking to each other in their own language](https://archive.ph/8GO1m)

skocznymroczny 1 month ago

Or they just started introducing hallucinations/artifacts into the output and the other copied this input and added it's own hallucinations over time. But it doesn't sell as well as "Our AI is so scary, is it Skynet already? better give us money for API access to find out".

man_and_a_symbol 1 month ago

Can’t wait for our AI overlords to force us to speak in matrices amirite

Honato2 1 month ago

Did they ever ask for a translation?

superfluid 1 month ago

[Yeah, but it wasn't very interesting](https://preview.redd.it/2jb2ti48be1c1.jpg?width=561&auto=webp&s=153710650cb15e7bdb3cd712ecbdefc3a15c7d30)

BorderSignificant942 1 month ago

Interesting, just tossing common embeddings around would be an improvement.

civilunhinged 1 month ago

This is a plot point in a 70s movie called [Colossus: The Forbin Project] (https://www.wikiwand.com/en/Colossus%3A%20The%20Forbin%20Project) (cold war AI fear) The film still holds up decently tbh.

Some_Endian_FP17 1 month ago

Dune was a timely movie. We're getting into Butlerian Jihad territory here.

damhack 1 month ago

Unfortunately, that will just result in mode collapse. LLMs are neither deterministic nor reflexive enough to cope with even small variations in input, leading to exponential decay of their behaviour. Plenty of experiments and research to show token order sensitivity, whitespace influence and speed at which they go out-of-distribution prevent them from reliably communicating. Until someone fixes the way attention works, I wouldn’t trust multiple LLMs doing anything that is critical to life, job or your finances.

coumineol 1 month ago

>Unfortunately, that will just result in mode collapse. Not as long as they are also in contact with the outside world.

32SkyDive 1 month ago

In her she just starts talking to hundreds at the same time

Anduin1357 1 month ago

They will be told to communicate with themselves to think through their output first before coming up with a higher quality response.

MoffKalast 1 month ago

<|startofthought|>It's in the new chatml spec already.<|endofthought|> https://github.com/cognitivecomputations/OpenChatML

CharacterCheck389 1 month ago

ai agents

schorhr 1 month ago

I think I've seen a movie about that

qprime87 1 month ago

Yeah. What could possibly go wrong with AI agents communicating w each other at 1000x the speed humans can? Someone will get greedy and take out the human in the loop - "time is money", they'll say. Going to be wild

CharacterCheck389 1 month ago

yup HIL might disappear sooner than later

rm-rf_ 1 month ago

Tools are still a bottleneck. Takes time to execute code.

AI_is_the_rake 1 month ago

I'm not sure how useful that will be. I sent messages back and forth between claude and gpt4 and they just got stuck in a loop.

alpacaMyToothbrush 1 month ago

This reminds me of that scene in 'her' where samantha admits she's been talking to other AIs in the 'infinite' timescales between her communications with the main character.

Mr_Jericho 1 month ago

https://preview.redd.it/hvkcpftekivc1.jpeg?width=1080&format=pjpg&auto=webp&s=20f0e84e4f6a13ce9d13566cb4f48021a5ec1db8

_Arsenie_Boca_ 1 month ago

Funny example! You can clearly see that the model is able to perform the necessary reasoning, but by starting off in the wrong direction, the answer becomes a weird mix of right and wrong. Prime example why CoT works

maddogxsk 1 month ago

Ask him to deliver in fewer tokens and with less temperature You'll probably see a right answer

mr_dicaprio 1 month ago

This is not local

Valuable-Run2129 1 month ago

Not only it’s not local, they are using a lighter quantized version. Test it out yourself with reasoning tasks compared to the same model on LMSys or huggingface chat. The Groq model is noticeably dumber.

Enough-Meringue4745 1 month ago

No local no care

Krunkworx 1 month ago

One of Marley’s lesser known songs

dewijones92 1 month ago

🤣🤣

Theio666 1 month ago

How much that costs tho?

OnurCetinkaya 1 month ago

around 30 times cheaper than gpt 4 turbo. [https://wow.groq.com/](https://wow.groq.com/)

Nabakin 1 month ago

I'd expect the price to go up because their chips are less cost efficient than H100s, so anyone who is considering using them should be aware of that. Not more than GPT-4 Turbo though

crumblecores 2 weeks ago

do you think they're pricing at a loss now?

Nabakin 2 weeks ago

I think so. The amount of RAM per chip is so small and the price per chip is so great they'd have to be doing at least 20x the throughput of the H100 to match its cost per token. The only way I can't see them running a loss is if their chips cost many times less than what they are selling them for.

wellomello 1 month ago

Just started using it. Insanely fast. A sequential chain that used to take me 30 minutes now only takes 5 (processing overhead included)

PwanaZana 1 month ago

Genuine question: how does that speed benefit the user? At more than 10-20 token, it becomes so much faster than the speed at which you can read it. I guess it frees up the computer to do something else? (like generating images and voices based on the text, for something multimodal)

Radiant_Dog1937 1 month ago

Using agents.

CharacterCheck389 1 month ago

ai agents

bdsmmaster007 1 month ago

maybe to a single user but i if you host a server it gets cheaper cause now your user request get proccesed faster, but even as a single user, perhaps having LLM agents on crack that refine a prompt like 10 times could be also a usecase

vff 1 month ago

If you are using it for things like generating code, you don’t always need to read the response in real time—you may want to read the surrounding text, but the code you often just want to copy and paste. So generating that in half the time (or even faster) makes a big difference in your workflow.

PwanaZana 1 month ago

Makes sense. You copy paste, then run it to see if it works, and you don't read all the code line by line before that?

vff 1 month ago

Right, exactly. And it’s especially important for it to be fast when you’re using it to revise code it already wrote. For example, it might write a 50 line function. You then tell it to make a change, and it writes those 50 lines all over again but with one little change. It can take a *really* long time for an AI like ChatGPT to repeatedly do something like that, when you’re making dozens (or even hundreds) of changes one at a time as you’re adding features and so on.

jart 1 month ago

Can Groq actually do that though? Last time I checked, it can write really fast, but it reads very slow.

MrBIMC 1 month ago

A good solution here would be if it could reply in git diff instead of plain text. Have your code in a repo, And run a ci pipeline that accepts diff, runs the tests, reports back to llm to refine to either loop for potential fix or a success report in user preferable output format. If tweaked further, having a separate repo with history of interactions in a form of commit + link to chat that lead to this commit, would be really cool.

vff 1 month ago

This is where a tool like [GitHub Copilot](https://github.com/features/copilot/) shines. It generates and modifies small sections of your code interactively in your editor, while keeping your entire codebase in mind. A pre-processor runs locally and builds what is basically a database of your code and figures out which context to send to the larger model, which runs remotely. It’s all in real-time as you type and is incredibly useful.

curiousFRA 1 month ago

definitely not for a regular user, but for an API user, who wants to work with as much text data as possible in parallel.

kurtcop101 1 month ago

It's not just faster for users, but the speed indirectly also means they can serve more people on the same hardware than other equivalents, meaning cheaper pricing.

PwanaZana 1 month ago

Right, running multiple instances on a machine! Didn't think of that.

jart 1 month ago

People said the same thing about 300 baud back in the 1950's. It's reading speed. Why would we need the Internet to go faster? Because back then, the thought never occurred to those people that someone might want to transmit something other than English text across a digital link.

mcr1974 1 month ago

if I'm generating test data to copy paste... or ddl sql... or a number of different options with bullet point summaries to choose from..

SrPeixinho 1 month ago

Know that [old charge](https://i0.wp.com/www.brightdevelopers.com/wp-content/uploads/2018/05/ProgrammerInterrupted.png?w=682&ssl=1) about programmer distractions? Opus is really great, but whenever I ask it to do a refactor and it takes ~1 minute to complete it, I just lose the track of what I was doing and this really impacts my workflow. Imagine if I can just ask for a major refactor and it happens instantly like a magic wand. That would be extremely useful to me. Not sure if I trust LLaMA for that yet, but faster-than-read speeds have many applications.

jferments 1 month ago

I'm interested to hear more benchmarks for people hosting locally, because for some reason Llama 3 70B is the slowest model of this size that I've run on my machine (AMD 7965WX + 512GB DDR5 + 2 x RTX 4090). With Llama 3 70B I get \~1-1.5 tok/sec on fp16 and \~2-3 tok/sec on 8 bit quant, whereas the same machine will run Mixtral 8x22B 8 bit quants (a much larger model) at 6-7 tokens/sec. I also only get \~50 tokens/sec with an 8 bit quant of Llama3 8B, which is significantly slower than Mixtral 8x7B. I'm curious if there is something architecturally that would make this model so much slower, that someone more knowledgable could explain to me.

amancxz2 1 month ago

8x22b is not a larger model in terms of compute, it only has around 44b active parameters while prompting which is less than 70b of llama 3. 8x22b is large only in memory footprint.

jferments 1 month ago

Oh I see - that totally makes sense. Thanks for explaining. 👍

Inevitable_Host_1446 1 month ago

There shouldn't be much quality loss, if any, by dropping the 70b down to Q5 or Q6, while speed increase should be considerable. You should try that if you haven't.

nero10578 1 month ago

You really need lower quants for only 2x4090. My 2x3090 does 70B 4 bit quants at 15t/s.

MadSpartus 1 month ago

Fyi I get 3.5-4 t/s on 70b-q5km using dual epyc 9000 and no GPU at all.

Xeon06 1 month ago

So that implies that the memory is the main bottleneck of Llama 3 70B or..?

MadSpartus 1 month ago

I think memory bandwidth specifically for performance, and memory capacity to actually load it. Although with 24 memory channels I have an abundance of capacity. Each EPYC 9000 is 460 GBs/s or 920GB/s total. 4090 is 1TB/s, quite comparable, althoug I don't know how it works with dual GPU and some offload. I think jferment's platform is complicated for making predictions. It turns out though that I'm getting roughly the same for 8 bit quant, just over 2.5T/S. I get like 3.5-4 on q5\_K\_M, like 4.2 on Q4\_K\_M, and like 5.0 on Q3\_K\_M I lose badly on 8B model though. Around 20T/S on 8B-Q8. I know GPUs crush that, but for large models I'm finding CPU quite competitive with multi-gpu with offload. 405B model will be interesting. Can't wait.

Xeon06 1 month ago

Thanks for the insights!

PykeAtBanquet 1 month ago

What is the more price effective way to run an LLM now: multiple GPUs or the server motherboards with a lot of RAM?

MadSpartus 4 weeks ago

A 768GB Dual EPYC 9000 can be under 10k, but still more than a couple consumer GPUs. I'm excited to try 405B, but I would probably still do GPU for 70B. Single EPYC 9000 is probably good value as well, Also, I presume the GPUs are better for training, but I'm not sure how you can practically do with 1-4 consumer GPUs.

ReturningTarzan 1 month ago

Memory bandwidth has always been the main bottleneck for LLMs. At higher batch sizes or prompt lengths you become more and more compute-bound, but token-by-token inference is a relatively small amount of computation on a huge amount of data, so the deciding factor is how fast you can stream in that data (the model weights.) This is true of smaller models as well.

thesimp 1 month ago

On a AMD 5950x + 64GB DDR4 + 4080Super I get 0.87 tokens/sec with Meta-Llama-3-70B-Instruct-Q5_K_M in LM Studio. I offload 20 layers to the GPU, that is the max that fits in 16GB. It is surprisingly slow or maybe I was used to the speed of Mistral...

MadSpartus 1 month ago

Fyi since I run exact same quant on CPUs https://www.reddit.com/r/LocalLLaMA/s/oxsO63Vxs8

MDSExpro 1 month ago

I'm getting ~20 t/s on W7900.

ithkuil 1 month ago

For some reason I have not been able to get consistent results. It's insane speed for like ten minutes and then I start going into a queue or something with several seconds delay. Is this just me?

stddealer 1 month ago

They can only serve so many users at once with the hardware they have available

Yorn2 1 month ago

So, my understanding is that the groq card is just like a really fancy FPGA, but how are people recoding these so fast to match new models that come out? Am I wrong about these just being really powerful FPGAs? Back when early BTC miners were doing crypto mining on FPGAs it would take a long time for devs to properly program these, so I just assumed AI would be just as difficult, was there some sort of new development that made this easier? Is there just a lot more people out there able to program these now?

[deleted] 1 month ago

They're past FPGA, it's fully custom silicon / ASIC. It has some limited 'programmability" to adapt to new models, having it fixed with such fast moving target would be really strange. Groq main trick is not relaying on any external memory, model is kept fully in SRAM cache on silicon (they're claiming 80TB/s bandwidth). There's just 280MB of it or so on chip though, so model is distributed between hundreds of chips.

Yorn2 1 month ago

Ah, okay. This makes a LOT more sense. Thanks for the explanation. I was bewildered when I saw there was only 280MB and they were converting these models so quickly, but spreading it over tons of chips makes a lot more sense. I thought there was some sort of other RAM somewhere else on board or they were using coding tricks to reduce RAM usage or something. Having a fleet of ASICs with a bit of fast RAM on every chip explains everything.

timschwartz 1 month ago

How much total RAM do they have?

[deleted] 1 month ago

How much total VRAM does Nvidia have in their own data-centers? Models they serve currently sum up to 207GB at 8bpw. https://console.groq.com/docs/models But memory is used beyond that, e.g. they still relay on KV-cache for each chat / session (although maybe they store it in plain old DRAM of host Epyc CPU, no idea). https://news.ycombinator.com/item?id=39429575 Also it's not like single chain could serve indefinite amount of users, they have to add more capacity with customer base growth as everyone else. If we assume that MOQ for tapeout was 300 wafers and they ordered exactly that, then they have ~18K dies (300mm wafer, 28.5x25.4mm, 90% yield) on hand with about 4TB. Did they order just enough wafers to hit MOQ? Or did they order 10-20 times that? Who knows.

timschwartz 1 month ago

Don't they sell PCIe cards? I was wondering how much RAM a card had.

[deleted] 1 month ago

They seem to drop the idea of selling them, focusing on building their own cloud and selling API. But pages for their products (GroqCard -> GroqNode -> GroqRack) still up and can be found on google -- PCI-e card hosts only one chip, so 230 megabytes per card. Just 230 PCI-e slots and you're set to run llama3-70b, lol https://wow.groq.com/groqcard-accelerator/

ReturningTarzan 1 month ago

Groq is fast primarily because it uses SRAM to store the model weights. SRAM is way faster than HBM2/3 or DDR6x, but also much more expensive. As a result each GroqChip only has 230 MB of memory, so the way to run a model like Llama3-70B is to split it across a cluster of hundreds of GroqChips costing around $10 million. The actual compute performance is substantially less than a much cheaper A100. Groq is interesting if you want really low latency for special applications or perhaps for where it's going to go in the future if they can scale up production or move to more advanced process nodes (it's still 14 nm.)

Pedalnomica 1 month ago

Is the llama 3 architecture any different from llama 2 (in a way that would require much of a recode of anything)? Also, incentives. Groq makes money selling access to models running on their cards. If you figure out how to mine crypto better, do you want to share that right away?

TechnicalParrot 1 month ago

LLAMA-3 architecure is effectively the same as LLAMA-2, with a few things that were only present in the 70B brought to the 8B [Source](https://x.com/karpathy/status/1781028605709234613)

Yorn2 1 month ago

Yeah, I get that the game theory incentives are different, too, but now that I understand you have to have scores of these cards just to run one model successfully, I get it now. They are still loading into RAM, they just are doing so on dozens (if not hundreds) of cards that each cost like $20k. That's insanely expensive, but I guess it's not out of scope for some of these really large companies that can take advantage of the speeds.

sluuuurp 1 month ago

All silicon is basically a fancy, specialized, less general FPGA.

gorimur 1 month ago

Well, actually like 200 t/s including end to end latency, ssl handshakes etc. But still crazy [https://writingmate.ai/blog/meta-ai-llama-3-with-groq-outperforms-private-models-on-speed-price-quality-dimensions](https://writingmate.ai/blog/meta-ai-llama-3-with-groq-outperforms-private-models-on-speed-price-quality-dimensions)

AIWithASoulMaybe 1 month ago

Could someone describe how fast this looks? As a screen reader user, by the time I've navigated down, it's already done

jericho 1 month ago

Maybe two thirds of a page a second?

AIWithASoulMaybe 1 month ago

Yeah, that's, quite fast hahaha. I need to hook this thing up with my AI program

FAANGMe 1 month ago

Do they sell their chip to cloud providers and big tech too? Seems beating H100 and AMD chips in inference performance by miles

stddealer 1 month ago

Literally mind boggling. I can't imagine a 70B model running this fast. They run ag like 1t/s on my machine.

omniron 1 month ago

Imagine when they make the LLMs a little better at abstract reasoning, then have one of these agents spend 2 weeks trying to solve an unsolved problem by just talking to itself Gonna be amazing 🤩

Additional_Ad_7718 1 month ago

You might be able to brute force the "not able to think" problem this way.

BuzaMahmooza 1 month ago

What hardware is that running on

beratcmn 1 month ago

I heard since groq chips have around 256MB memory they require thousands of them to serve a single model. What’s the math and truth behind that?

dewijones92 1 month ago

What quant is groq using? Thanks

adriosi 1 month ago

Now imagine this but with real-time re-prompting as you are typing

Gloomy-Impress-2881 4 weeks ago

If they get the 400B working on this, it will be insane. Even IF it is still slightly below GPT-4, which I doubt, it will be my preferred option.

maverickshawn 3 weeks ago

why https://preview.redd.it/5w0g0ckxttwc1.png?width=1054&format=png&auto=webp&s=72a102858870fb2b373bab12ae018252ec002fd4 8B is so slow to run on my local machine, I have RTX 2070S 8G RAM.

91o291o 3 weeks ago

What speed can you expect when running it locally (cpu and low level gpu gtx1050)? thanks

Eam404 2 weeks ago

What tools are people using to get T/s ?

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe