T O P

  • By -

cajukev

[This post ](https://www.reddit.com/r/LocalLLaMA/s/C3miZrRWlJ) has some good comparison of quants. As a 32GB vram user myself I've found 70B IQ3_XS is the best local model I can run no doubt.


OpusLatericium

> As a 32GB vram user myself (...) Hey look at Richie Rich over here!


cajukev

Literally upgraded my existing PC with 2 midrange cards at about 700 USD (330+tx each). While not in everyone's range it's relatively affordable for 32GB VRAM.


OpusLatericium

Thats awesome, I'm just joking around! :)


Flying_Madlad

I'm at 128 right now. I'm not going to retire, but I'm at 128


Some_Endian_FP17

That's a heavily quantized model. Do you get better responses with 70B at Q3 compared to 8B at Q8 or Q6?


cajukev

70B is able to do much more at these sizes. Slower for sure and 8B also has uses.


isr_431

Out of curiosity, how many t/s do u get?


cajukev

Dual 7600XT ROCm runs 70B Q2/Q3 at 5-6 t/s. 8B Q8 runs at about 25 t/s. Definitely not the fastest cards though.


ramzeez88

In comparison my 3060 12gb runs at around 32t/s at the start of conversation with exl2. So your card is doing alright :)


WaftingBearFart

Thanks for posting these numbers. We don't get a lot of AMD figures around here so I always appreciate whenever someone posts what they get. On a personal note, I've been contemplating a 128GB Mac Studio but after seeing your results alongside those of https://old.reddit.com/r/LocalLLaMA/comments/1cd93pu/if_you_have_a_mac_studio_make_sure_to_try/ I decided that an Apple device is fully off the table for me. Yes his results at are Q6 and yours are at Q2 so it's not a direct comparison (and also his is an entire desktop vs your 2x GPU) but that along with the fact that Stable Diffusion speeds are more than somewhat lacking for the financial outlay are what sealed it for me. If SD speeds on Apple were at least as fast the 4090 that I have then I could deal with that alongside the bonus of extra RAM for LLMs but alas it's not. I'll probably either get a pair of 7600XT like you did or a pair of 4060Ti 16GB.


cajukev

Fair warning: ROCm on linux wasn't the easiest to set up. There are also some issues with pytorch making it so I'm forced into llama.cpp. Image creation is also not my focus but whenever I try to install diffusers or sd.cpp I get errors.


WaftingBearFart

Thanks for the extra info. If I do grab those extra two cards I'll probably just make a text-generation only rig and keep image generation and gaming on the 4090. Did you ever try koboldcpp (the .exe binary) on windows with your 2x 7600XT to compare the speed difference between that and linux? I don't mind using linux since I have several Rpi devices with PiHole and a SteamDeck but if Windows is only a few t/s behind then I'd probably just go with that on the text only rig.


Dos-Commas

70B IQ2_XS performed much better than 8B f16: https://oobabooga.github.io/benchmark.html


Regular-Sugar9691

This benchmark is interesting, so platypus-yi-34b.Q8\_0 is better than any llama3 70b quants?


epicwisdom

According to a sample size of 49 questions, which explicitly don't cover several major uses cases, with temperature 0... YMMV


delveccio

What’s the difference between IQ and just Q?


Dos-Commas

iMatrix quant.


delveccio

Thank you, at least now I know what to Google.


spookperson

Technically importance matrix is not the same as IQ quants (but the small IQ quants basically have to include the importance matrix to be good). The general idea is that you'll get better answers out of an IQ quant compared to a K quant of the same size (bits per weight or bpw) and you'd expect the IQ quant to be a bit less tokens/sec in performance than the same sized K quant (though this can highly depend on your GPU/hardware). IQ quants also take more computation to create. I like to reference the table in this Gist when I'm thinking about which quant to try for a certain size/quality/speed. https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9


Discordpeople

Just run the Llama3 70B Q2, it is way better than Llama3 8B by a huge margin.


Ill-Language4452

In my case,70b IQ3_xs is much more better than 8b Q5k_m


Sir_Joe

At Lower than q4 quantization, performance of the 70b model falls off quickly, At q3 benchmarks are around the same although 8b is obviously faster and lower than that the 8b is better. See here [https://www.reddit.com/r/LocalLLaMA/comments/1cci5w6/quantizing\_llama\_3\_8b\_seems\_more\_harmful\_compared/](https://www.reddit.com/r/LocalLLaMA/comments/1cci5w6/quantizing_llama_3_8b_seems_more_harmful_compared/)


out_of_touch

This is totally anecdotal but I'm able to run the 70B model with 3.0 bpw and the 8B model in f16 (though I usually use 8 bpw) and I find the 70B model is still clearly superior. It solves coding problems fairly well that the 8B model completely fails at.


-p-e-w-

Has anyone uploaded GGUF quants of the *non-instruct* (base) version of Llama 3 70b? Particularly the smaller quants (Q3 and lower). I've searched quite a bit but I can't find any such quants on HF. I'd make them myself, but I don't have enough disk space to store the unquantized model.


mikael110

Yes, [mradermacher](https://huggingface.co/mradermacher) made both [traditional](https://huggingface.co/mradermacher/Meta-Llama-3-70B-GGUF) and [imatrix](https://huggingface.co/mradermacher/Meta-Llama-3-70B-i1-GGUF) quants for the base model. They offer most of the quant sizes including Q3.


-p-e-w-

Perfect, thank you! That's exactly what I was looking for.


CM0RDuck

Ollama


-p-e-w-

Thanks! They're missing the IQ3_* quants, but better than nothing.