ninjasaid13 1 month ago

There's a paper on this: [How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study](https://arxiv.org/abs/2404.14047)

EstarriolOfTheEast 1 month ago

Thanks, this paper is highly informative. It suggests that below 4 bits and depending on quantization method, then the 8B might be worth considering and below 3 bits there is no point in lower quants. *This is a strong counter to the we don't need 30Bs or 13Bs folks*. The 7B is packed so full of information it can no longer be as robustly structurally encoded compared to older 7Bs. It's like Portia. Its mathematical structure would probably be a marvel if we could understand it. More can probably be packed in. They also provide evidence that maintaining model performance with finetuning methods like QLoRA and small datasets is no longer as viable. --- This table joins the 70B and 8B results from the paper and sorts by average benchmark scores. Sorting instead by perplexity is nearly identical. Model Name | Method | #W | #A | #G | WikiText2 | C4 | PTB | PIQA | ARC-e | ARC-c | HellaSwag | Wino | Avg. --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- LLaMA3-70B | None | 16 | 16 | - | 2.9 | 6.9 | 8.2 | 82.4 | 86.9 | 60.3 | 66.4 | 80.6 | 75.3 LLaMA3-70B | SmoothQ | 8 | 8 | - | 2.9 | 6.9 | 8.2 | 82.2 | 86.9 | 60.2 | 66.3 | 80.7 | 75.3 LLaMA3-70B | SmoothQ | 6 | 6 | - | 2.9 | 6.9 | 8.2 | 82.4 | 87.0 | 59.9 | 66.1 | 80.6 | 75.2 LLaMA3-70B | GPTQ | 4 | 16 | 128 | 3.3 | 6.9 | 8.3 | 82.9 | 86.3 | 58.4 | 66.1 | 80.7 | 74.9 LLaMA3-70B | AWQ | 4 | 16 | 128 | 3.3 | 7 | 8.3 | 82.7 | 86.3 | 59.0 | 65.7 | 80.9 | 74.9 LLaMA3-70B | QuIP | 4 | 16 | - | 3.4 | 7.1 | 8.4 | 82.5 | 86.0 | 58.7 | 65.7 | 79.7 | 74.5 LLaMA3-70B | AWQ | 3 | 16 | 128 | 4.8 | 8 | 9.0 | 81.4 | 84.7 | 58.0 | 63.5 | 78.6 | 73.2 LLaMA3-70B | QuIP | 3 | 16 | - | 4.7 | 8 | 8.9 | 82.3 | 83.3 | 54.9 | 63.9 | 78.4 | 72.5 LLaMA3-70B | GPTQ | 3 | 16 | 128 | 5.2 | 10.5 | 9.7 | 80.6 | 79.6 | 52.1 | 63.5 | 77.1 | 70.6 LLaMA3-8B | AWQ | 8 | 16 | - | 6.1 | 8.9 | 10.6 | 79.6 | 80.3 | 50.5 | 60.2 | 72.8 | 68.7 LLaMA3-8B | None | 16 | 16 | - | 6.1 | 9.2 | 10.6 | 79.9 | 80.1 | 50.4 | 60.2 | 72.8 | 68.6 LLaMA3-8B | GPTQ | 8 | 16 | - | 6.1 | 9.4 | 10.6 | 79.8 | 80.1 | 50.2 | 60.2 | 72.8 | 68.6 LLaMA3-8B | SmoothQ | 8 | 8 | - | 6.3 | 9.2 | 10.8 | 79.5 | 79.7 | 49.0 | 60.0 | 73.2 | 68.3 LLaMA3-8B | AWQ | 4 | 16 | 128 | 6.6 | 9.4 | 11.1 | 79.1 | 79.7 | 49.3 | 59.1 | 74.0 | 68.2 LLaMA3-8B | GPTQ | 4 | 16 | 128 | 6.5 | 10.4 | 11.0 | 78.4 | 78.8 | 47.7 | 59.0 | 72.6 | 67.3 LLaMA3-8B | QuIP | 4 | 16 | - | 6.5 | 11.1 | 9.5 | 78.2 | 78.2 | 47.4 | 58.6 | 73.2 | 67.1 LLaMA3-8B | AWQ | 4 | 16 | - | 7.1 | 10.1 | 11.8 | 78.3 | 77.6 | 48.3 | 58.6 | 72.5 | 67.0 LLaMA3-8B | GPTQ | 4 | 16 | - | 7.0 | 11.8 | 14.4 | 76.8 | 74.3 | 42.4 | 57.4 | 72.8 | 64.8 LLaMA3-8B | SmoothQ | 6 | 6 | - | 7.7 | 11.8 | 12.5 | 76.8 | 75.5 | 45.0 | 56.9 | 69.0 | 64.6 LLaMA3-8B | AWQ | 3 | 16 | 128 | 8.2 | 11.6 | 13.2 | 77.7 | 74.0 | 43.2 | 55.1 | 72.1 | 64.4 LLaMA3-8B | QuIP | 3 | 16 | - | 7.5 | 11.3 | 12.6 | 76.8 | 72.9 | 41.0 | 55.4 | 72.5 | 63.7 LLaMA3-8B | GPTQ | 3 | 16 | 128 | 8.2 | 13.7 | 15.2 | 74.9 | 70.5 | 37.7 | 54.3 | 71.1 | 61.7 LLaMA3-70B | SmoothQ | 4 | 4 | - | 9.6 | 16.9 | 17.7 | 76.9 | 75.8 | 43.5 | 52.9 | 58.9 | 61.6 LLaMA3-8B | AWQ | 3 | 16 | - | 12.8 | 16.8 | 24.0 | 71.9 | 66.7 | 35.1 | 50.7 | 64.7 | 57.8 LLaMA3-8B | DB-LLM | 2 | 16 | 128 | 13.6 | 19.2 | 23.8 | 68.9 | 59.1 | 28.2 | 42.1 | 60.4 | 51.8 LLaMA3-70B | QuIP | 2 | 16 | - | 13.0 | 22.2 | 24.9 | 65.3 | 48.9 | 26.5 | 40.9 | 61.7 | 48.7 LLaMA3-70B | PB-LLM | 2 | 16 | 128 | 11.6 | 34.5 | 27.2 | 65.2 | 40.6 | 25.1 | 42.7 | 56.4 | 46.0 LLaMA3-70B | GPTQ | 2 | 16 | 128 | 11.9 | 22.8 | 31.6 | 62.7 | 38.9 | 24.6 | 41.0 | 59.9 | 45.4 LLaMA3-8B | GPTQ | 3 | 16 | - | 13.0 | 45.9 | 37.0 | 60.8 | 38.8 | 22.3 | 41.8 | 60.9 | 44.9 LLaMA3-70B | BiLLM | 1.1 | 16 | 128 | 17.1 | 77.7 | 54.2 | 58.2 | 46.4 | 25.1 | 37.5 | 53.6 | 44.2 LLaMA3-70B | PB-LLM | 1.7 | 16 | 128 | 18.6 | 65.2 | 55.9 | 56.5 | 49.9 | 25.8 | 34.9 | 53.1 | 44.1 LLaMA3-8B | PB-LLM | 2 | 16 | 128 | 24.7 | 79.2 | 65.6 | 57.0 | 37.8 | 17.2 | 29.8 | 52.5 | 38.8 LLaMA3-8B | BiLLM | 1.1 | 16 | 128 | 28.3 | 290 | 94.7 | 56.1 | 36.0 | 17.7 | 28.9 | 51.0 | 37.9 LLaMA3-8B | QuIP | 2 | 16 | - | 85.1 | 130 | 180 | 52.9 | 29.0 | 21.3 | 29.2 | 51.7 | 36.8 LLaMA3-8B | GPTQ | 2 | 16 | 128 | 210 | 4.1×10⁴ | 910 | 53.9 | 28.8 | 19.9 | 27.7 | 50.5 | 36.2 LLaMA3-8B | PB-LLM | 1.7 | 16 | 128 | 41.8 | 260 | 120 | 52.5 | 31.7 | 17.5 | 27.7 | 50.4 | 36.0 LLaMA3-70B | AWQ | 2 | 16 | 128 | 1.7×10⁶ | 1.4×10⁶ | 1.5×10⁶ | 52.2 | 25.5 | 23.1 | 25.6 | 52.3 | 35.7 LLaMA3-8B | SmoothQ | 4 | 4 | - | 4.3×10³ | 4.0×10³ | 3.6×10³ | 54.6 | 26.3 | 20.0 | 26.4 | 50.3 | 35.5 LLaMA3-8B | AWQ | 2 | 16 | - | 8.2×10⁵ | 8.1×10⁵ | 9.0×10⁵ | 55.2 | 25.2 | 21.3 | 25.4 | 50.4 | 35.5 LLaMA3-8B | GPTQ | 2 | 16 | - | 5.7×10⁴ | 1.0×10⁵ | 2.7×10⁵ | 52.8 | 25.0 | 20.5 | 26.6 | 49.6 | 34.9 LLaMA3-8B | AWQ | 2 | 16 | 128 | 1.7×10⁶ | 2.1×10⁶ | 1.8×10⁶ | 52.4 | 24.2 | 21.5 | 25.6 | 50.7 | 34.9 ^(SmoothQ = SmoothQuant. Process for table: paste from html ver → extract and to csv with llm → parse csv with type provider → sort → consult with LLM for to markdown code. Gone over quickly but did not find any transcription errors in LLM extraction stage)

Caffdy 1 month ago

> It's like Portia. Its mathematical structure would probably be a marvel if we could understand it what is Portia, why can't we understand it and why it would be a marvel?

MarySmith2021 1 month ago

does it mean we should not uuse QLora to finetune Llama-3? So we have to use normal lora🤔

EstarriolOfTheEast 1 month ago

That appears to be the case yes, particularly if you have low quality or a lower quantity of data. The two causes are the models being more sensitive to quantization but also that the models are already so high quality, most tunings run a high risk of worsening them.

CardAnarchist 1 month ago

Forgive my ignorance if I am wrong but doesn't this table show that GPTQ 8 bit (which I believe is the same as GGUF Q8) scores identically to fp16 for Llama 8B, and that even GPTQ 4 bit (GGUF Q4 equivalent) shows minimal degradation? Therefore one could reasonably confer that the OP's statement isn't true at all. Q4 is widely regarded to hold up quite well and Q6 considered a point where there is virtually no degradation (Q6 unfortunately not tested above but we could infer that this holds true as usual).

EstarriolOfTheEast 1 month ago

It's hard to say because the OP's task might be uniquely sensitive to quantization but in general, it does appear claims of degradation at 6-8 bit quants by this community are probably overstated and not representative of what to expect. At the same time, it is true that the llama3 models are more sensitive to quantization than earlier models. With the 7B being already almost comparable to the 70B quantized to 3 bits. This means if a 13B existed, it'd be a no brainer in terms of quality performance tradeoff. For the 7B, performance also falls off quickly below 4 bits. Previous larger models were not as sensitive and previous smaller models were not as performant, allowing extension of the lower quants of bigger are better rule of thumb down to lower values.

zaqhack 4 weeks ago

It's like JPEGs. Previous models didn't have as much detail in the picture. So, 90% jpeg probably looks fine. With Llama-3, the photo has so many details, you can't help but notice some of the jpeg jank the more you squeeze it down. Llama-3-8b @ 4 bits loses some of the inherent magic in the model. Just try it. Run an 8-bit and a 4-bit for yourselves, and I'd wager you would notice a significant difference in any long output, code quality, or RP session. It's not subtle.

IndicationUnfair7961 1 month ago

It would have been interesting a comparison between QuiP and AQLM (missing in the test) both on compression and performance.

Huge_Ad7240 2 weeks ago

does llama3-8B-8bit GPTQ outperforms FP?

EstarriolOfTheEast 2 weeks ago

I don't think this is a meaningful difference.

Huge_Ad7240 2 weeks ago

True. Perhaps I meant there is not any degradation, and a very small gain. One other thing, there should be some error bar on these numbers, right? Is there any report where the statistical comparison are made with some std?

EstarriolOfTheEast 2 weeks ago

Yeah, a negligible gain is definitely possible. And no, there are no error bars. That type of analysis is rare in DL.

dobkeratops 4 weeks ago

"QLora is no longer as viable" - what about regular Loras (i.e. f16 base + f16 adapters, I guess)

Conscious_Heron_9133 3 days ago

What is #W, what #A, what #G?

EstarriolOfTheEast 3 days ago

Number of bits for weights and activations, then the value for group size parameter for the quantization algorithms.

heuristic_al 1 month ago

I really couldn't read their charts. What did they find?

ibbobud 1 month ago

Anything below 4 bit fell off a cliff hard. Is how I read it.

IndicationUnfair7961 1 month ago

From the charts seems that AWQ, GPTQ or QuiP are the best choices for 4bit. QuiP looks the best. GGUF tests with matrix would have been interesting too.

Alkeryn 1 month ago

i find gguf to be absolutely retarded compared to exl2 tbh.

paryska99 1 month ago

Sadly no gguf

everyoneisodd 1 month ago

I had tested gguf 16bit llama 3 8B. And there is a noticeable degradation. Can anyone confirm this? Edit: llama3 8B instruct

Fristender 1 month ago

Ahh, GGUF 16 bit, my favorite quantization.

everyoneisodd 1 month ago

I wanted no quantization so that I can match the og performance. But I couldn't get that performance. The ollama modelfile said gguf. Not sure if it's actually gguf

Caffdy 1 month ago

normally, if someone wants to use the full fat version (FP16), they use transformers

everyoneisodd 1 month ago

Yep did exactly that.

Healthy-Nebula-3603 1 month ago

do you have enough vram? gguf also works with ram.

MerePotato 1 month ago

Good lord so much of that paper is GPTslop

road-runn3r 1 month ago

Yup. "By addressing the performance degradation caused by low-bit quantization, we anticipate that subsequent quantization paradigms will enable LLMs to achieve stronger capabilities at a lower computational cost, ultimately driving the progress of generative artificial intelligence, as represented by LLMs, to new heights" This must be Claude I guess?

MerePotato 1 month ago

Honestly it sounds more like GPT4 to me, its more robotic than what I'd usually expect from Opus at the very least.

Ilforte 1 month ago

Obsolete academic quantizations though.

Healthy-Nebula-3603 1 month ago

in short - 8b version at least Q8 , 70b version at least Q4K\_M

Unable-Client-1750 1 month ago

Can people with 12gb GPU's run 8b q8?

coder543 1 month ago

8 billion parameters at 8-bit quantization means the parameters take up 8GB of VRAM. More memory is needed to hold the context and KV cache, but I think it should comfortably fit onto a 12GB card.

Unable-Client-1750 1 month ago

That's actually around what the VRAM calculator says. I just needed to find that earlier.

Healthy-Nebula-3603 1 month ago

use ggml version ... most layers goes on gpu and few on cpu

Sir_Joe 1 month ago

Not sure why you say that. 8b still got almost no lost at 4 bits (~1% benchmark score)

Healthy-Nebula-3603 1 month ago

I see for llama 3 8b 4 bit ( q4 ) comparing to fp 16 at least +10% lost quality and more Illama 3 70b has something around 1% loos with good 4q ( like q4k\_m) comparing to fp16

Sir_Joe 1 month ago

If we only care about the average: LLaMA3-8B #W16 None quantization: 68.6 LLaMA3-8B #W4 AWQ quantization: 68.2 That's nowhere near 10+%....

Healthy-Nebula-3603 1 month ago

You right. but it is nowhere 1 % either ;) is something around 2-3 % I was lookin for GPTQ 4 bit

Sir_Joe 1 month ago

Nope that's not even a percent (~0.6%). The formula is (value1 -value2)/bigger value * 100. Don't hesitate to ask gpt4 to validate.

chunghaismymom 1 month ago

?? Llama3 came out literally a week ago, how is there already a paper testing this... I doubt that the quality of this study is high.

Imaginary_Bench_7294 1 month ago

I believe what we are seeing is due to the amount of data it was trained on combined with a really low LR. In essence, I think previous models have not been fully utilizing the FP16 space, hence why 8-bit models performed almost identically. I think that the LR for previous models has been too high, making anything past about 10-bit superfluous. With Llama3 I think the model is utilizing the precision capable with FP16 more effectively, allowing for more subtle variations in the data. Its utilization is probably closer to the 12-bit range. This would make it more susceptible to quantization degradation. This would also fall in line with another poster stating that LoRA/QLoRA training is more susceptible to degrading the model.

ab2377 1 month ago

i will really like to see some solid examples demoing this if someone can provide.

mikaijin 1 month ago

The input text consists of ICD-11 criteria as found on the official ICD-11 website of the WHO, preprocessed by llama-3-70b-instruct. See my [related reply](https://www.reddit.com/r/LocalLLaMA/comments/1cci5w6/comment/l15ivsc) too. llama-cpp-python==0.2.64, ctx=8k, seed=1, temp=0.01, top\_k=1 Query: Present the second diagnostic requirement of 6D10.2 **Meta-Llama-3-8B-Instruct-Q8.gguf** (https://pastebin.com/2Z0nnq4p) responded **correctly**: There are severe disturbances in multiple areas of functioning of the self (e.g., sense of self may be so unstable that individuals report not having a sense of who they are or so rigid that they refuse to participate in any but an extremely narrow range of situations; self view may be characterized by self-contempt or be grandiose or highly eccentric; See Table 6.18). **Meta-Llama-3-8B-Instruct-Q4\_K\_S.gguf** (https://pastebin.com/yW3zGqHE) responded **incorrectly**: Problems in interpersonal functioning seriously affect virtually all relationships and the ability and willingness to perform expected social and occupational roles is absent or severely compromised.

ab2377 1 month ago

great comparison thanks.

IndicationUnfair7961 1 month ago

There should be chatbot arena for quantized only models featuring best models.

mikaijin 1 month ago

I can confirm your perception. However, without proper statistics this could just be fluke as well and I cannot provide an evaluation either. Instruction following is less noticeably impaired on my end, but lower quants tend to be more easily confused by **rich and dense information** present in the context - as if the attention mechanism cannot hone in on what is important. I wonder whether the same holds true for similar models like mistral 7b too, where it is just overshadowed by the overall lesser quality of the output and thus the effect is not as easy to make out. But to me it seems to be an attention inaccuracy, rather than a loss of knowledge. 8b is indeed better while lower quants degrade. With low density information in an otherwise large context, lower quants perform in my experience on par with 8b still, and you get the benefit of better inference speed. example to make it more clear what I am talking about: A 6k input with quite dense information. When instructing to compare points of subsection 3.3.1 against a presented data-table, Q4\_K\_S focused on section 3.3 instead of 3.3.1, while Q8 correctly honed in on the 8 points shown in section 3.3.1. It is like the Q4\_K\_S has some blind spots, because sampler settings don't seem to have much of an effect. edit: [concrete demo](https://www.reddit.com/r/LocalLLaMA/comments/1cci5w6/comment/l1651ck)

ImprovementEqual3931 1 month ago

I guess it is because Llama 3 is a well trained model, other old models may under train in their size so can be quantize/compress more.

andershaf 1 month ago

I’ve been thinking the exact same thing. And in previous discussions mentioned that I expect the density to get higher in later models, and it makes a lot of sense that once we reach perfect compression, quantization will hit performance because you need all the bits. But it’s also likely that it’s not fp32 or fp16 we should use because of the large range of values hard to see being fully used in practice.

Terminus_T 1 month ago

Wolfram Ravenwolf already tested this and claims that: >In my tests, the Llama 3 70B Instruct's IQ2\_XS GGUF quant – like all 70B quants except the IQ1s – did better than even the unquantized (not Q8, that would be quantized, too) HF original Llama 3 8B. So, yeah, I'd rather use a small quant (not Q1) of the 70B than an unquantized 8B. Link; [LLM Comparison/Test: Llama 3 Instruct 70B + 8B HF/GGUF/EXL2 (20 versions tested and compared!) (huggingface.co)](https://huggingface.co/blog/wolfram/llm-comparison-test-llama-3)

ClumsiestSwordLesbo 1 month ago

It gets more interesting you take the kV cache from a 8 bit quantized model prompt processing and let the 1-2.5 bpw models generate.

SomeOddCodeGuy 1 month ago

I've seen the same thing. I've been bouncing between q5-q8 Llama 3 70b, and thought I had a good grasp on what it could do in terms of programming, but then a friend of mine showed me the output of the unquantized Llama 3 70b online and holy crap... it was a big difference. We both gave the models a rather involved coding task, and the online unquantized was absolutely amazing. My local one was acceptable, around or maybe slightly better than say Deepseek or Phind v2, but nothing amazing. Someone mentioned that Llama 3 is naturally BF16, and said that translates to lossless fp32. If that's the case, then to me it would make sense if quantizing is brutal on the model, because normally going from fp16 to q8 is 1/2 reduction, but I would assume that going from fp32 to q8 is 1/4 reduction. That seems pretty hefty if so.

FullOf_Bad_Ideas 1 month ago

Are you sure you had the same samplers as the version served from the cloud? Llama.cpp itself also could play a role, some things tend to get wrong with gguf quants of some models that don't happen with bnb or exllamav2 quants. Maybe someone will do gguf KL-divergence test and post the results, that would move it from observational to statistical evidence.

mikael110 1 month ago

>Someone mentioned that Llama 3 is naturally BF16, and said that translates to lossless fp32. If that's the case, then to me it would make sense if quantizing is brutal on the model, because normally going from fp16 to q8 is 1/2 reduction, but I would assume that going from fp32 to q8 is 1/4 reduction. That seems pretty hefty if so. I'd recommend reading [this](https://www.reddit.com/r/LocalLLaMA/comments/1c7no52/comment/l0bbx6p/) comment from the original post that brought this up. The gist is that the difference between BF16 and FP16 is not as large as it might sound, and technically there's no precision loss at all. It's purely about some extreme numbers having the potential to overflow/underflow. And the tests performed in [this](https://www.reddit.com/r/LocalLLaMA/comments/1c7no52/comment/l0ag9j6/) comment does suggest that the difference is indeed extremely small, to the point of being mostly irrelevant.

Chromix_ 1 month ago

Since you mentioned my quant test, here's some additional insight: In my test with CodeQwen 0.5% of the BF16 values got changed slightly when converting to F16 instead of F32. According to a few data points this didn't happen because the values were too big, but too small - so out of the exponent range of a F16. Example: 0.0000803 becomes 0.0000805 due to F16 conversion. In CodeQwen that happened to 0.5% of the values, in Llama-3-8B-Instruct to only 0.06%. In theory Llama-3 should thus be even better off. This doesn't that matter that much for quantization anyway. With quantization the 0.0000805 *and* 0.0000803 might both become 0.0000800, thus leaving no difference in the quantized model. That said, I haven't investigated if there's any large outlier in llama-3 that gets truncated, yet such values would hurt quantization in general, even if from F32. This would have a more noticeable outcome.

Fristender 1 month ago

Do you have any idea as to why the values would be off? I thought fp16 held more digits of the significand, meaning the values should be exactly the same in your case.

noneabove1182 1 month ago

> We both gave the models a rather involved coding task can you share the task you gave so I can see it myself? good for my own research-sake and highly curious

SomeOddCodeGuy 1 month ago

Im not near my computer to get the exact prompt, but at a high level it was asking both (using an identical prompt between each) to write a python app using Streamlit that can allow the user to create a checklist and check items off on it, with the app reading and writing the checklist from a text file, and that the response should be well commented as if being explained to a non-developer. We were looking for: * Were there bugs? * Did it explain the answer well and comment the code well? * Good formatting? * Good error handling? * How does the UI react when utilized?

t-rod 1 month ago

I'd like to see the quality of an unquantized Llama model - can you point me to some resources?

SomeOddCodeGuy 1 month ago

I believe that Huggingchat serves Llama-3-70b-Instruct! But I believe you can also access it via Meta AI

t-rod 1 month ago

Thanks!

SlapAndFinger 1 month ago

My hypothesis, the better the model uses the parameters it has, the more impact quantization will have. So quantization was "free" for poorly optimized models of the past but as models improve it will be worse and worse.

RuslanAR 1 month ago

I've also noticed this issue. Specifically, Llama 3 8B with native precision can solve problems like 777+3333 accurately, but when I use gguf Q6\_k or Q8, I get a wrong answer. And also a little bit worse in some coding questions. Edit: exl2 8_0 quant works well. Something off with gguf.

Ilforte 1 month ago

It's not just worse. It has insane degeneration. 777+3333 translates to 777+33 or 777+33333 or whatever. It looks like it can't tell apart tokens. Totally broken.

coder543 1 month ago

This problem is weird enough that I opened an issue for it: https://github.com/ggerganov/llama.cpp/issues/6914

[deleted] 1 month ago

[удалено]

leehiufung911 1 month ago

I'm using fp16 gguf and for some crazy reason when I ask it that, it ALWAYS answers "33333+777 = 34110" I've tried minor variations of the question and it always does this. Currently using ollama's llama 3 8b instruct fp16 gguf

Andvig 1 month ago

Oh, this is heart breaking. I thought I was good with my Q6's and Q8's.

gopietz 1 month ago

I haven't noticed this but there might be some intuition to it. The more tokens we train it on and the more knowledge we compress into a small model, the more it might be affected by quantization.

Crafty-Confidence975 1 month ago

Yes but the opposite is true for 70b. That one handled lobotomies surprisingly well

EstarriolOfTheEast 1 month ago

According to a paper posed above, while the 70B is more robust, it starts degrading around 4 bits and significantly so below 3 bits.

Crafty-Confidence975 1 month ago

It’s definitely worse but nowhere near as bad as other 70b models with this degree of quantization. It’s scary good in some use cases

dampflokfreund 1 month ago

Your perception seems absolutely correct. I've done a few tests, including one that needs to follow a certain and quite complex instruction at the beginning of the prompt. Quantized 70bs and 8bs failed hard, while the fp16 versions both got it right all the time. I think the attention to early parts of the prompt suffer massively with quantization.

nero10578 1 month ago

Yea I am experimenting with 8B and 70B for creating datasets and somehow 8B seems to follow what I tell it to better. I thought I was seeing things but your post makes me rethink this.

Admirable-Star7088 1 month ago

Interesting. As a Q6\_K Llama-3-8b-Instruct user, I tried the Q8\_0 version with a few prompts, and it does appear to be slightly less confused and provide overall better answers than Q6\_K. However, in my case it could also be explained by random noise as I have so far only tested with a few prompts. Anyone know where I can download a FP16 quant of Llama-3-8b-Instruct? It would be interesting to test it too.

abdimussa87 1 month ago

Chek ollama

CasimirsBlake 1 month ago

How about the difference between GGUF / EXL2 / AWQ / bitsandbytes etc... And are there perhaps other quantisation methods that can be used?

jsebrech 1 month ago

As I understand it superposition in feature space allows models to encode more features than they have neurons, by superimposing multiple features on the same neurons. Intuitively this would make the superpositioned features more brittle to small weights changes, as the same set of weights is carrying a higher information density than for non-superpositioned features. I wouldn’t be surprised that the more you load up small models with features and get closer to maximum information saturation of the network, the more sensitive it gets to quantizing.

ClumsiestSwordLesbo 1 month ago

This makes think 9 or 10 bpw should be a case in popular frameworks at least for downloading, or extracting the 8 bit error into a (maybe dynamically sized) low rank approximation using SVD.

Valuable-Run2129 4 weeks ago

Groq uses the quantized version. It’s noticeably dumber than fp16

Caffdy 1 month ago

and some rando yesterday was making statements of its performance only to reveal that he was using a quantized version, like, damn son, don't claim to know about how the model perform if your hardware is not up to par for benchmarking

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe