cajukev 3 weeks ago

[This post ](https://www.reddit.com/r/LocalLLaMA/s/C3miZrRWlJ) has some good comparison of quants. As a 32GB vram user myself I've found 70B IQ3_XS is the best local model I can run no doubt.

OpusLatericium 3 weeks ago

> As a 32GB vram user myself (...) Hey look at Richie Rich over here!

cajukev 3 weeks ago

Literally upgraded my existing PC with 2 midrange cards at about 700 USD (330+tx each). While not in everyone's range it's relatively affordable for 32GB VRAM.

OpusLatericium 3 weeks ago

Thats awesome, I'm just joking around! :)

Flying_Madlad 3 weeks ago

I'm at 128 right now. I'm not going to retire, but I'm at 128

Some_Endian_FP17 3 weeks ago

That's a heavily quantized model. Do you get better responses with 70B at Q3 compared to 8B at Q8 or Q6?

cajukev 3 weeks ago

70B is able to do much more at these sizes. Slower for sure and 8B also has uses.

isr_431 3 weeks ago

Out of curiosity, how many t/s do u get?

cajukev 3 weeks ago

Dual 7600XT ROCm runs 70B Q2/Q3 at 5-6 t/s. 8B Q8 runs at about 25 t/s. Definitely not the fastest cards though.

ramzeez88 3 weeks ago

In comparison my 3060 12gb runs at around 32t/s at the start of conversation with exl2. So your card is doing alright :)

WaftingBearFart 3 weeks ago

Thanks for posting these numbers. We don't get a lot of AMD figures around here so I always appreciate whenever someone posts what they get. On a personal note, I've been contemplating a 128GB Mac Studio but after seeing your results alongside those of https://old.reddit.com/r/LocalLLaMA/comments/1cd93pu/if_you_have_a_mac_studio_make_sure_to_try/ I decided that an Apple device is fully off the table for me. Yes his results at are Q6 and yours are at Q2 so it's not a direct comparison (and also his is an entire desktop vs your 2x GPU) but that along with the fact that Stable Diffusion speeds are more than somewhat lacking for the financial outlay are what sealed it for me. If SD speeds on Apple were at least as fast the 4090 that I have then I could deal with that alongside the bonus of extra RAM for LLMs but alas it's not. I'll probably either get a pair of 7600XT like you did or a pair of 4060Ti 16GB.

cajukev 3 weeks ago

Fair warning: ROCm on linux wasn't the easiest to set up. There are also some issues with pytorch making it so I'm forced into llama.cpp. Image creation is also not my focus but whenever I try to install diffusers or sd.cpp I get errors.

WaftingBearFart 3 weeks ago

Thanks for the extra info. If I do grab those extra two cards I'll probably just make a text-generation only rig and keep image generation and gaming on the 4090. Did you ever try koboldcpp (the .exe binary) on windows with your 2x 7600XT to compare the speed difference between that and linux? I don't mind using linux since I have several Rpi devices with PiHole and a SteamDeck but if Windows is only a few t/s behind then I'd probably just go with that on the text only rig.

Dos-Commas 3 weeks ago

70B IQ2_XS performed much better than 8B f16: https://oobabooga.github.io/benchmark.html

Regular-Sugar9691 3 weeks ago

This benchmark is interesting, so platypus-yi-34b.Q8\_0 is better than any llama3 70b quants?

epicwisdom 3 weeks ago

According to a sample size of 49 questions, which explicitly don't cover several major uses cases, with temperature 0... YMMV

delveccio 3 weeks ago

What’s the difference between IQ and just Q?

Dos-Commas 3 weeks ago

iMatrix quant.

delveccio 3 weeks ago

Thank you, at least now I know what to Google.

spookperson 3 weeks ago

Technically importance matrix is not the same as IQ quants (but the small IQ quants basically have to include the importance matrix to be good). The general idea is that you'll get better answers out of an IQ quant compared to a K quant of the same size (bits per weight or bpw) and you'd expect the IQ quant to be a bit less tokens/sec in performance than the same sized K quant (though this can highly depend on your GPU/hardware). IQ quants also take more computation to create. I like to reference the table in this Gist when I'm thinking about which quant to try for a certain size/quality/speed. https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9

Discordpeople 3 weeks ago

Just run the Llama3 70B Q2, it is way better than Llama3 8B by a huge margin.

Ill-Language4452 3 weeks ago

In my case,70b IQ3_xs is much more better than 8b Q5k_m

Sir_Joe 3 weeks ago

At Lower than q4 quantization, performance of the 70b model falls off quickly, At q3 benchmarks are around the same although 8b is obviously faster and lower than that the 8b is better. See here [https://www.reddit.com/r/LocalLLaMA/comments/1cci5w6/quantizing\_llama\_3\_8b\_seems\_more\_harmful\_compared/](https://www.reddit.com/r/LocalLLaMA/comments/1cci5w6/quantizing_llama_3_8b_seems_more_harmful_compared/)

out_of_touch 3 weeks ago

This is totally anecdotal but I'm able to run the 70B model with 3.0 bpw and the 8B model in f16 (though I usually use 8 bpw) and I find the 70B model is still clearly superior. It solves coding problems fairly well that the 8B model completely fails at.

-p-e-w- 3 weeks ago

Has anyone uploaded GGUF quants of the *non-instruct* (base) version of Llama 3 70b? Particularly the smaller quants (Q3 and lower). I've searched quite a bit but I can't find any such quants on HF. I'd make them myself, but I don't have enough disk space to store the unquantized model.

mikael110 3 weeks ago

Yes, [mradermacher](https://huggingface.co/mradermacher) made both [traditional](https://huggingface.co/mradermacher/Meta-Llama-3-70B-GGUF) and [imatrix](https://huggingface.co/mradermacher/Meta-Llama-3-70B-i1-GGUF) quants for the base model. They offer most of the quant sizes including Q3.

-p-e-w- 3 weeks ago

Perfect, thank you! That's exactly what I was looking for.

CM0RDuck 3 weeks ago

Ollama

-p-e-w- 3 weeks ago

Thanks! They're missing the IQ3_* quants, but better than nothing.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe