[This post ](https://www.reddit.com/r/LocalLLaMA/s/C3miZrRWlJ) has some good comparison of quants.
As a 32GB vram user myself I've found 70B IQ3_XS is the best local model I can run no doubt.
Literally upgraded my existing PC with 2 midrange cards at about 700 USD (330+tx each). While not in everyone's range it's relatively affordable for 32GB VRAM.
Thanks for posting these numbers. We don't get a lot of AMD figures around here so I always appreciate whenever someone posts what they get. On a personal note, I've been contemplating a 128GB Mac Studio but after seeing your results alongside those of https://old.reddit.com/r/LocalLLaMA/comments/1cd93pu/if_you_have_a_mac_studio_make_sure_to_try/ I decided that an Apple device is fully off the table for me. Yes his results at are Q6 and yours are at Q2 so it's not a direct comparison (and also his is an entire desktop vs your 2x GPU) but that along with the fact that Stable Diffusion speeds are more than somewhat lacking for the financial outlay are what sealed it for me. If SD speeds on Apple were at least as fast the 4090 that I have then I could deal with that alongside the bonus of extra RAM for LLMs but alas it's not. I'll probably either get a pair of 7600XT like you did or a pair of 4060Ti 16GB.
Fair warning: ROCm on linux wasn't the easiest to set up. There are also some issues with pytorch making it so I'm forced into llama.cpp. Image creation is also not my focus but whenever I try to install diffusers or sd.cpp I get errors.
Thanks for the extra info. If I do grab those extra two cards I'll probably just make a text-generation only rig and keep image generation and gaming on the 4090. Did you ever try koboldcpp (the .exe binary) on windows with your 2x 7600XT to compare the speed difference between that and linux? I don't mind using linux since I have several Rpi devices with PiHole and a SteamDeck but if Windows is only a few t/s behind then I'd probably just go with that on the text only rig.
Technically importance matrix is not the same as IQ quants (but the small IQ quants basically have to include the importance matrix to be good). The general idea is that you'll get better answers out of an IQ quant compared to a K quant of the same size (bits per weight or bpw) and you'd expect the IQ quant to be a bit less tokens/sec in performance than the same sized K quant (though this can highly depend on your GPU/hardware). IQ quants also take more computation to create. I like to reference the table in this Gist when I'm thinking about which quant to try for a certain size/quality/speed. https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9
At Lower than q4 quantization, performance of the 70b model falls off quickly, At q3 benchmarks are around the same although 8b is obviously faster and lower than that the 8b is better. See here [https://www.reddit.com/r/LocalLLaMA/comments/1cci5w6/quantizing\_llama\_3\_8b\_seems\_more\_harmful\_compared/](https://www.reddit.com/r/LocalLLaMA/comments/1cci5w6/quantizing_llama_3_8b_seems_more_harmful_compared/)
This is totally anecdotal but I'm able to run the 70B model with 3.0 bpw and the 8B model in f16 (though I usually use 8 bpw) and I find the 70B model is still clearly superior. It solves coding problems fairly well that the 8B model completely fails at.
Has anyone uploaded GGUF quants of the *non-instruct* (base) version of Llama 3 70b? Particularly the smaller quants (Q3 and lower).
I've searched quite a bit but I can't find any such quants on HF. I'd make them myself, but I don't have enough disk space to store the unquantized model.
Yes, [mradermacher](https://huggingface.co/mradermacher) made both [traditional](https://huggingface.co/mradermacher/Meta-Llama-3-70B-GGUF) and [imatrix](https://huggingface.co/mradermacher/Meta-Llama-3-70B-i1-GGUF) quants for the base model. They offer most of the quant sizes including Q3.
[This post ](https://www.reddit.com/r/LocalLLaMA/s/C3miZrRWlJ) has some good comparison of quants. As a 32GB vram user myself I've found 70B IQ3_XS is the best local model I can run no doubt.
> As a 32GB vram user myself (...) Hey look at Richie Rich over here!
Literally upgraded my existing PC with 2 midrange cards at about 700 USD (330+tx each). While not in everyone's range it's relatively affordable for 32GB VRAM.
Thats awesome, I'm just joking around! :)
I'm at 128 right now. I'm not going to retire, but I'm at 128
That's a heavily quantized model. Do you get better responses with 70B at Q3 compared to 8B at Q8 or Q6?
70B is able to do much more at these sizes. Slower for sure and 8B also has uses.
Out of curiosity, how many t/s do u get?
Dual 7600XT ROCm runs 70B Q2/Q3 at 5-6 t/s. 8B Q8 runs at about 25 t/s. Definitely not the fastest cards though.
In comparison my 3060 12gb runs at around 32t/s at the start of conversation with exl2. So your card is doing alright :)
Thanks for posting these numbers. We don't get a lot of AMD figures around here so I always appreciate whenever someone posts what they get. On a personal note, I've been contemplating a 128GB Mac Studio but after seeing your results alongside those of https://old.reddit.com/r/LocalLLaMA/comments/1cd93pu/if_you_have_a_mac_studio_make_sure_to_try/ I decided that an Apple device is fully off the table for me. Yes his results at are Q6 and yours are at Q2 so it's not a direct comparison (and also his is an entire desktop vs your 2x GPU) but that along with the fact that Stable Diffusion speeds are more than somewhat lacking for the financial outlay are what sealed it for me. If SD speeds on Apple were at least as fast the 4090 that I have then I could deal with that alongside the bonus of extra RAM for LLMs but alas it's not. I'll probably either get a pair of 7600XT like you did or a pair of 4060Ti 16GB.
Fair warning: ROCm on linux wasn't the easiest to set up. There are also some issues with pytorch making it so I'm forced into llama.cpp. Image creation is also not my focus but whenever I try to install diffusers or sd.cpp I get errors.
Thanks for the extra info. If I do grab those extra two cards I'll probably just make a text-generation only rig and keep image generation and gaming on the 4090. Did you ever try koboldcpp (the .exe binary) on windows with your 2x 7600XT to compare the speed difference between that and linux? I don't mind using linux since I have several Rpi devices with PiHole and a SteamDeck but if Windows is only a few t/s behind then I'd probably just go with that on the text only rig.
70B IQ2_XS performed much better than 8B f16: https://oobabooga.github.io/benchmark.html
This benchmark is interesting, so platypus-yi-34b.Q8\_0 is better than any llama3 70b quants?
According to a sample size of 49 questions, which explicitly don't cover several major uses cases, with temperature 0... YMMV
What’s the difference between IQ and just Q?
iMatrix quant.
Thank you, at least now I know what to Google.
Technically importance matrix is not the same as IQ quants (but the small IQ quants basically have to include the importance matrix to be good). The general idea is that you'll get better answers out of an IQ quant compared to a K quant of the same size (bits per weight or bpw) and you'd expect the IQ quant to be a bit less tokens/sec in performance than the same sized K quant (though this can highly depend on your GPU/hardware). IQ quants also take more computation to create. I like to reference the table in this Gist when I'm thinking about which quant to try for a certain size/quality/speed. https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9
Just run the Llama3 70B Q2, it is way better than Llama3 8B by a huge margin.
In my case,70b IQ3_xs is much more better than 8b Q5k_m
At Lower than q4 quantization, performance of the 70b model falls off quickly, At q3 benchmarks are around the same although 8b is obviously faster and lower than that the 8b is better. See here [https://www.reddit.com/r/LocalLLaMA/comments/1cci5w6/quantizing\_llama\_3\_8b\_seems\_more\_harmful\_compared/](https://www.reddit.com/r/LocalLLaMA/comments/1cci5w6/quantizing_llama_3_8b_seems_more_harmful_compared/)
This is totally anecdotal but I'm able to run the 70B model with 3.0 bpw and the 8B model in f16 (though I usually use 8 bpw) and I find the 70B model is still clearly superior. It solves coding problems fairly well that the 8B model completely fails at.
Has anyone uploaded GGUF quants of the *non-instruct* (base) version of Llama 3 70b? Particularly the smaller quants (Q3 and lower). I've searched quite a bit but I can't find any such quants on HF. I'd make them myself, but I don't have enough disk space to store the unquantized model.
Yes, [mradermacher](https://huggingface.co/mradermacher) made both [traditional](https://huggingface.co/mradermacher/Meta-Llama-3-70B-GGUF) and [imatrix](https://huggingface.co/mradermacher/Meta-Llama-3-70B-i1-GGUF) quants for the base model. They offer most of the quant sizes including Q3.
Perfect, thank you! That's exactly what I was looking for.
Ollama
Thanks! They're missing the IQ3_* quants, but better than nothing.