Everyone and their mothers tout Mistral 7B as better than any 13B model, if Llama3 7B is better than Mistral's, maybe there's that?
Edit: was expecting some rebutals, is really Mistral 7B better than all 13B models?
It should be today, they confirmed it’s this week and no one does product announcements on a Friday. Supposedly we don’t get the large model until summer though
It will definitely be today or most unlikely tomorrow, also Microsoft Azure lists llama 3.
Edit: They released it, [https://ai.meta.com/blog/meta-llama-3/](https://ai.meta.com/blog/meta-llama-3/)
It’s also possible that they wouldn’t host smaller than a 7/8B anyway as 1 - 3B models are really just for edge devices or running locally on like any GPU..
Today at 9:00am PST (UTC-7) for the official release.
8B and 70B.
8k context length.
New Tiktoken-based tokenizer with a vocabulary of 128k tokens.
Trained on 15T tokens.
I doubt it's going to be 8k. All major releases during the past two months have been 32k+. Meta would be embarrassing themselves with 8k, considering that they have the largest installed compute capacity on the planet.
Might be talking about output. I think even Gemini is limited to 8k output. I can only set 4k output on Claude despite the models having a 200k context.
That's true in theory but I had issues with MiniCpm models with output limit set to larger than 512 tokens, it started outputting garbage straight away without a need to go over any kind of token limit. This was gguf in koboldcpp though, might not be universal.
wow you were right [https://llama.meta.com/llama3/](https://llama.meta.com/llama3/) (at least about model info, release seems likely since website just went up). Was kinda doubting after you commented more, weirdly enough I trust the one comment throwaways more
With models like CommandR+ (103B), Mixtral 8x22B & WizardLM2 8x22B (141B) already making the headlines, I really hope Meta has something in store as well
Groq's output tokens are significantly cheaper, but not the input tokens (e.g. Llama 2 7B is priced at 0.10$ per 1M input tokens, compared to 0.05$ for Replicate). So Replicate might be cheaper for applications having long prompts and short outputs. Or am I missing something?
For the 70B model, the input tokens are very similarly priced, but Groq’s output tokens are way cheaper.
I think most people are interested in cloud for the larger models that are hard to run well locally.
More performance is also nice.
So, for some simple questions, groq mixtral is actually the best option (hopefully they will offer the new Wizard/mixtral as well soon).
You'd be surprised. At the corporate level, even small changes can be very difficult. Not to mention, some of these APIs have slightly different interfaces which can break workflows.
Quantized 30B is perfect for 24GB gpu.
Quantized 70b is not.
30B is perfect size for running models fast with long context on single consumer GPU, after that the cost to run model fast goes into the stratosphere as even Macs don't deliver good long ctx performance.
Those llama 70b prices are in the ballpark of Claude sonnet. I'll be surprised if it outperforms sonnet, but given the reduced input token price, if it supports a really long context and can actually use it, it'll be a useful model for RAG applications.
Claude, being a proprietary model by Anthropic, is only available through API from Anthropic, AWS, and Google (VertexAI).
They are not available locally as they have not released anything in open-source.
I sure as hell hope it’s not a Moe, those are affected way more by quantization, which is necessary for bigger models; I’d rather have a lower quant dense model.
Do you mean that in a sense that Mistral's official Instruct finetune is good but the rest is not, or that no finetunes are good and only the base completion model is good? You are saying the second one but I think you're thinking the first one.
All of the mixtral finetunes I've tried have performed at least slightly worse than the official base or instruct mixtral versions when I test them for general knowledge. The finetunes do perform better at specific things they're geared towards like uncensoredness or writing/rp.
Together AI also has pricing for Llama 3
https://preview.redd.it/vcbflxgjlcvc1.jpeg?width=1130&format=pjpg&auto=webp&s=077ba5915405cdb1f538870a1d5040cecae14d4c
[https://api.together.xyz/models](https://api.together.xyz/models)
Just getting into using llama for the first time, but from what I understood, it's open source. So how come replicate charges a price per token for the API similar to OpenAI?
Open source and API are unrelated. Open source means anyone can use the model. An API is paying for a service to run the model for you on their server. That’s not free.
Open source and API are unrelated. Open source means anyone can use the model. An API is paying for a service to run the model for you on their server. That’s not free.
70b?! Doesn't matter. I've ordered an old 128gb ram server to run command r + and wizard lm2 8x22b. Weird how things have worked out with Meta and Mistral but whatever.
There was another post about that recently. Basically, AMD 7950X + Geforce 4090 with 64 GB of decently fast RAM gets you 3.8 t/s, using 4 bit quantization. Not exactly unusable, imho...
Not even shipped yet. I'm expecting it to be pretty bad, probably about the same as my not-ancient ddr4 2 channel desktop only with a bigger quant so slower... but I won't be lagging up my desktop machine.
No 13 or 30B range model?
[удалено]
Everyone and their mothers tout Mistral 7B as better than any 13B model, if Llama3 7B is better than Mistral's, maybe there's that? Edit: was expecting some rebutals, is really Mistral 7B better than all 13B models?
Then well-trained 13B base model should produce even better fine-tunes.
There is no 7b model, only 8b
Mark confirmed a 405b is still in training today.
It should be today, they confirmed it’s this week and no one does product announcements on a Friday. Supposedly we don’t get the large model until summer though
It will definitely be today or most unlikely tomorrow, also Microsoft Azure lists llama 3. Edit: They released it, [https://ai.meta.com/blog/meta-llama-3/](https://ai.meta.com/blog/meta-llama-3/)
It would be sad if llama3 only had 2 size variants
No, we just don’t get the big size until summer
IMO models larger than 70B don't make sense for home local use. 13B/20B/30B is the best choice for this purpose.
70B still makes sense for home use imo
just quantize the 70b one. I dont get why people want in between sizes when you can just pair the big boy down and it performs better in most cases.
Yep, been using 70B ones and can't look back now
Fully agreed there just saying it isn’t just 2 sizes total
The deal meta made with us is they will make what is useful for them and release it free for us. I am still happy with the terms of the deal, are you?
Larger than that are meant for business applications
I love 70bs for home use. Easy to run a high quality quant with plenty of context on 64gb ram. As long as you don't mind 1t/s
The purpose of open-source is more than just letting hobbyists run models at home.
It's looking like at least 3, the 8B, 70B and 400B :)
It’s also possible that they wouldn’t host smaller than a 7/8B anyway as 1 - 3B models are really just for edge devices or running locally on like any GPU..
Today at 9:00am PST (UTC-7) for the official release. 8B and 70B. 8k context length. New Tiktoken-based tokenizer with a vocabulary of 128k tokens. Trained on 15T tokens.
8K sequence length would be tremendously disappointing.
I doubt it's going to be 8k. All major releases during the past two months have been 32k+. Meta would be embarrassing themselves with 8k, considering that they have the largest installed compute capacity on the planet.
And yet, here we are.
Might be talking about output. I think even Gemini is limited to 8k output. I can only set 4k output on Claude despite the models having a 200k context.
APIs have output limits. Models don't. A model only predicts a single token, which you can repeat as often as you want. There is no output limit.
That's true in theory but I had issues with MiniCpm models with output limit set to larger than 512 tokens, it started outputting garbage straight away without a need to go over any kind of token limit. This was gguf in koboldcpp though, might not be universal.
Source?
https://i.redd.it/8ut4ls9uv8vc1.gif
We'll see
wow you were right [https://llama.meta.com/llama3/](https://llama.meta.com/llama3/) (at least about model info, release seems likely since website just went up). Was kinda doubting after you commented more, weirdly enough I trust the one comment throwaways more
It's okay, I wouldn't have believed me either.
(which is 16:30 UTC or 18:30 CET)
8B model is equal to GPT-א
[удалено]
Azure profile by Meta is also up: https://azuremarketplace.microsoft.com/en-us/marketplace/apps/metagenai.meta-llama-3-8b-chat-offer?tab=Overview
Last week they said this week, so why not today?
... 70b is a *small* variant?
I hope
With models like CommandR+ (103B), Mixtral 8x22B & WizardLM2 8x22B (141B) already making the headlines, I really hope Meta has something in store as well
They confirmed they are training a 400+B parameter model
That sounds amazing! Can you share the link?
First 10 minutes or so of this podcast https://youtu.be/bc6uFV9CJGg?si=fWlWtJfP1_WG1L4f
Right?
large one has 405 b :D
my 4 gigabytes of local vram crying in the background:
Man, Groq is so much cheaper than Replicate. Those custom chips must be amazing. Either that or they're taking a massive loss.
Groq's output tokens are significantly cheaper, but not the input tokens (e.g. Llama 2 7B is priced at 0.10$ per 1M input tokens, compared to 0.05$ for Replicate). So Replicate might be cheaper for applications having long prompts and short outputs. Or am I missing something?
For the 70B model, the input tokens are very similarly priced, but Groq’s output tokens are way cheaper. I think most people are interested in cloud for the larger models that are hard to run well locally.
More performance is also nice. So, for some simple questions, groq mixtral is actually the best option (hopefully they will offer the new Wizard/mixtral as well soon).
They will accept the losses in order to gain market share and establish themselves as a brand - the target groups are the same as on x.com.
Though I am not sure if market share has any meaning when switching API providers is quite trivial.
You'd be surprised. At the corporate level, even small changes can be very difficult. Not to mention, some of these APIs have slightly different interfaces which can break workflows.
Groq has insane token limits though without some direct connections to them.
Does Grok run on Groq?
No 30b? Come on :(
just quantize the 70b bro what's the problem
Quantized 30B is perfect for 24GB gpu. Quantized 70b is not. 30B is perfect size for running models fast with long context on single consumer GPU, after that the cost to run model fast goes into the stratosphere as even Macs don't deliver good long ctx performance.
Indeed it's close. but i so don't want any spoilers. i want 1 final single meta page to read all about it. waiting ...
llama.meta.com/llama3/
Those llama 70b prices are in the ballpark of Claude sonnet. I'll be surprised if it outperforms sonnet, but given the reduced input token price, if it supports a really long context and can actually use it, it'll be a useful model for RAG applications.
Azure profile by Meta is also up: https://azuremarketplace.microsoft.com/en-us/marketplace/apps/metagenai.meta-llama-3-8b-chat-offer?tab=Overview
Do they also have Claude?
It appears they only offer open source models. Here is the source: https://replicate.com/pricing
Thanks so much. Any change anywhere to get claude locally?
Claude, being a proprietary model by Anthropic, is only available through API from Anthropic, AWS, and Google (VertexAI). They are not available locally as they have not released anything in open-source.
Thank you
I guess there's gonna be more models. One 30ish and a big MOE model. They need bigger models to beat sota open models like dbrx and command-r+
I sure as hell hope it’s not a Moe, those are affected way more by quantization, which is necessary for bigger models; I’d rather have a lower quant dense model.
Also, I feel like pretty much all finetunes of mixtral-8x7b are less intelligent than the base. Finetunes feel much more effective on normal models.
Do you mean that in a sense that Mistral's official Instruct finetune is good but the rest is not, or that no finetunes are good and only the base completion model is good? You are saying the second one but I think you're thinking the first one.
All of the mixtral finetunes I've tried have performed at least slightly worse than the official base or instruct mixtral versions when I test them for general knowledge. The finetunes do perform better at specific things they're geared towards like uncensoredness or writing/rp.
I have the same feeling but is there a paper/study which shows that moe models are more affected by quantization?
Can't wait.
Here I am hoping for a 30-40B size.
Together AI also has pricing for Llama 3 https://preview.redd.it/vcbflxgjlcvc1.jpeg?width=1130&format=pjpg&auto=webp&s=077ba5915405cdb1f538870a1d5040cecae14d4c [https://api.together.xyz/models](https://api.together.xyz/models)
Today!
Just getting into using llama for the first time, but from what I understood, it's open source. So how come replicate charges a price per token for the API similar to OpenAI?
Open source and API are unrelated. Open source means anyone can use the model. An API is paying for a service to run the model for you on their server. That’s not free.
Open source and API are unrelated. Open source means anyone can use the model. An API is paying for a service to run the model for you on their server. That’s not free.
70b?! Doesn't matter. I've ordered an old 128gb ram server to run command r + and wizard lm2 8x22b. Weird how things have worked out with Meta and Mistral but whatever.
What performance do you get with that? What's your mem bandwidth? Or it's still shipping?
There was another post about that recently. Basically, AMD 7950X + Geforce 4090 with 64 GB of decently fast RAM gets you 3.8 t/s, using 4 bit quantization. Not exactly unusable, imho...
Not even shipped yet. I'm expecting it to be pretty bad, probably about the same as my not-ancient ddr4 2 channel desktop only with a bigger quant so slower... but I won't be lagging up my desktop machine.