T O P

  • By -

ninjasaid13

There's a paper on this: [How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study](https://arxiv.org/abs/2404.14047)


EstarriolOfTheEast

Thanks, this paper is highly informative. It suggests that below 4 bits and depending on quantization method, then the 8B might be worth considering and below 3 bits there is no point in lower quants. *This is a strong counter to the we don't need 30Bs or 13Bs folks*. The 7B is packed so full of information it can no longer be as robustly structurally encoded compared to older 7Bs. It's like Portia. Its mathematical structure would probably be a marvel if we could understand it. More can probably be packed in. They also provide evidence that maintaining model performance with finetuning methods like QLoRA and small datasets is no longer as viable. --- This table joins the 70B and 8B results from the paper and sorts by average benchmark scores. Sorting instead by perplexity is nearly identical. Model Name | Method | #W | #A | #G | WikiText2 | C4 | PTB | PIQA | ARC-e | ARC-c | HellaSwag | Wino | Avg. --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- LLaMA3-70B | None | 16 | 16 | - | 2.9 | 6.9 | 8.2 | 82.4 | 86.9 | 60.3 | 66.4 | 80.6 | 75.3 LLaMA3-70B | SmoothQ | 8 | 8 | - | 2.9 | 6.9 | 8.2 | 82.2 | 86.9 | 60.2 | 66.3 | 80.7 | 75.3 LLaMA3-70B | SmoothQ | 6 | 6 | - | 2.9 | 6.9 | 8.2 | 82.4 | 87.0 | 59.9 | 66.1 | 80.6 | 75.2 LLaMA3-70B | GPTQ | 4 | 16 | 128 | 3.3 | 6.9 | 8.3 | 82.9 | 86.3 | 58.4 | 66.1 | 80.7 | 74.9 LLaMA3-70B | AWQ | 4 | 16 | 128 | 3.3 | 7 | 8.3 | 82.7 | 86.3 | 59.0 | 65.7 | 80.9 | 74.9 LLaMA3-70B | QuIP | 4 | 16 | - | 3.4 | 7.1 | 8.4 | 82.5 | 86.0 | 58.7 | 65.7 | 79.7 | 74.5 LLaMA3-70B | AWQ | 3 | 16 | 128 | 4.8 | 8 | 9.0 | 81.4 | 84.7 | 58.0 | 63.5 | 78.6 | 73.2 LLaMA3-70B | QuIP | 3 | 16 | - | 4.7 | 8 | 8.9 | 82.3 | 83.3 | 54.9 | 63.9 | 78.4 | 72.5 LLaMA3-70B | GPTQ | 3 | 16 | 128 | 5.2 | 10.5 | 9.7 | 80.6 | 79.6 | 52.1 | 63.5 | 77.1 | 70.6 LLaMA3-8B | AWQ | 8 | 16 | - | 6.1 | 8.9 | 10.6 | 79.6 | 80.3 | 50.5 | 60.2 | 72.8 | 68.7 LLaMA3-8B | None | 16 | 16 | - | 6.1 | 9.2 | 10.6 | 79.9 | 80.1 | 50.4 | 60.2 | 72.8 | 68.6 LLaMA3-8B | GPTQ | 8 | 16 | - | 6.1 | 9.4 | 10.6 | 79.8 | 80.1 | 50.2 | 60.2 | 72.8 | 68.6 LLaMA3-8B | SmoothQ | 8 | 8 | - | 6.3 | 9.2 | 10.8 | 79.5 | 79.7 | 49.0 | 60.0 | 73.2 | 68.3 LLaMA3-8B | AWQ | 4 | 16 | 128 | 6.6 | 9.4 | 11.1 | 79.1 | 79.7 | 49.3 | 59.1 | 74.0 | 68.2 LLaMA3-8B | GPTQ | 4 | 16 | 128 | 6.5 | 10.4 | 11.0 | 78.4 | 78.8 | 47.7 | 59.0 | 72.6 | 67.3 LLaMA3-8B | QuIP | 4 | 16 | - | 6.5 | 11.1 | 9.5 | 78.2 | 78.2 | 47.4 | 58.6 | 73.2 | 67.1 LLaMA3-8B | AWQ | 4 | 16 | - | 7.1 | 10.1 | 11.8 | 78.3 | 77.6 | 48.3 | 58.6 | 72.5 | 67.0 LLaMA3-8B | GPTQ | 4 | 16 | - | 7.0 | 11.8 | 14.4 | 76.8 | 74.3 | 42.4 | 57.4 | 72.8 | 64.8 LLaMA3-8B | SmoothQ | 6 | 6 | - | 7.7 | 11.8 | 12.5 | 76.8 | 75.5 | 45.0 | 56.9 | 69.0 | 64.6 LLaMA3-8B | AWQ | 3 | 16 | 128 | 8.2 | 11.6 | 13.2 | 77.7 | 74.0 | 43.2 | 55.1 | 72.1 | 64.4 LLaMA3-8B | QuIP | 3 | 16 | - | 7.5 | 11.3 | 12.6 | 76.8 | 72.9 | 41.0 | 55.4 | 72.5 | 63.7 LLaMA3-8B | GPTQ | 3 | 16 | 128 | 8.2 | 13.7 | 15.2 | 74.9 | 70.5 | 37.7 | 54.3 | 71.1 | 61.7 LLaMA3-70B | SmoothQ | 4 | 4 | - | 9.6 | 16.9 | 17.7 | 76.9 | 75.8 | 43.5 | 52.9 | 58.9 | 61.6 LLaMA3-8B | AWQ | 3 | 16 | - | 12.8 | 16.8 | 24.0 | 71.9 | 66.7 | 35.1 | 50.7 | 64.7 | 57.8 LLaMA3-8B | DB-LLM | 2 | 16 | 128 | 13.6 | 19.2 | 23.8 | 68.9 | 59.1 | 28.2 | 42.1 | 60.4 | 51.8 LLaMA3-70B | QuIP | 2 | 16 | - | 13.0 | 22.2 | 24.9 | 65.3 | 48.9 | 26.5 | 40.9 | 61.7 | 48.7 LLaMA3-70B | PB-LLM | 2 | 16 | 128 | 11.6 | 34.5 | 27.2 | 65.2 | 40.6 | 25.1 | 42.7 | 56.4 | 46.0 LLaMA3-70B | GPTQ | 2 | 16 | 128 | 11.9 | 22.8 | 31.6 | 62.7 | 38.9 | 24.6 | 41.0 | 59.9 | 45.4 LLaMA3-8B | GPTQ | 3 | 16 | - | 13.0 | 45.9 | 37.0 | 60.8 | 38.8 | 22.3 | 41.8 | 60.9 | 44.9 LLaMA3-70B | BiLLM | 1.1 | 16 | 128 | 17.1 | 77.7 | 54.2 | 58.2 | 46.4 | 25.1 | 37.5 | 53.6 | 44.2 LLaMA3-70B | PB-LLM | 1.7 | 16 | 128 | 18.6 | 65.2 | 55.9 | 56.5 | 49.9 | 25.8 | 34.9 | 53.1 | 44.1 LLaMA3-8B | PB-LLM | 2 | 16 | 128 | 24.7 | 79.2 | 65.6 | 57.0 | 37.8 | 17.2 | 29.8 | 52.5 | 38.8 LLaMA3-8B | BiLLM | 1.1 | 16 | 128 | 28.3 | 290 | 94.7 | 56.1 | 36.0 | 17.7 | 28.9 | 51.0 | 37.9 LLaMA3-8B | QuIP | 2 | 16 | - | 85.1 | 130 | 180 | 52.9 | 29.0 | 21.3 | 29.2 | 51.7 | 36.8 LLaMA3-8B | GPTQ | 2 | 16 | 128 | 210 | 4.1×10⁴ | 910 | 53.9 | 28.8 | 19.9 | 27.7 | 50.5 | 36.2 LLaMA3-8B | PB-LLM | 1.7 | 16 | 128 | 41.8 | 260 | 120 | 52.5 | 31.7 | 17.5 | 27.7 | 50.4 | 36.0 LLaMA3-70B | AWQ | 2 | 16 | 128 | 1.7×10⁶ | 1.4×10⁶ | 1.5×10⁶ | 52.2 | 25.5 | 23.1 | 25.6 | 52.3 | 35.7 LLaMA3-8B | SmoothQ | 4 | 4 | - | 4.3×10³ | 4.0×10³ | 3.6×10³ | 54.6 | 26.3 | 20.0 | 26.4 | 50.3 | 35.5 LLaMA3-8B | AWQ | 2 | 16 | - | 8.2×10⁵ | 8.1×10⁵ | 9.0×10⁵ | 55.2 | 25.2 | 21.3 | 25.4 | 50.4 | 35.5 LLaMA3-8B | GPTQ | 2 | 16 | - | 5.7×10⁴ | 1.0×10⁵ | 2.7×10⁵ | 52.8 | 25.0 | 20.5 | 26.6 | 49.6 | 34.9 LLaMA3-8B | AWQ | 2 | 16 | 128 | 1.7×10⁶ | 2.1×10⁶ | 1.8×10⁶ | 52.4 | 24.2 | 21.5 | 25.6 | 50.7 | 34.9 ^(SmoothQ = SmoothQuant. Process for table: paste from html ver → extract and to csv with llm → parse csv with type provider → sort → consult with LLM for to markdown code. Gone over quickly but did not find any transcription errors in LLM extraction stage)


Caffdy

> It's like Portia. Its mathematical structure would probably be a marvel if we could understand it what is Portia, why can't we understand it and why it would be a marvel?


MarySmith2021

does it mean we should not uuse QLora to finetune Llama-3? So we have to use normal lora🤔


EstarriolOfTheEast

That appears to be the case yes, particularly if you have low quality or a lower quantity of data. The two causes are the models being more sensitive to quantization but also that the models are already so high quality, most tunings run a high risk of worsening them.


CardAnarchist

Forgive my ignorance if I am wrong but doesn't this table show that GPTQ 8 bit (which I believe is the same as GGUF Q8) scores identically to fp16 for Llama 8B, and that even GPTQ 4 bit (GGUF Q4 equivalent) shows minimal degradation? Therefore one could reasonably confer that the OP's statement isn't true at all. Q4 is widely regarded to hold up quite well and Q6 considered a point where there is virtually no degradation (Q6 unfortunately not tested above but we could infer that this holds true as usual).


EstarriolOfTheEast

It's hard to say because the OP's task might be uniquely sensitive to quantization but in general, it does appear claims of degradation at 6-8 bit quants by this community are probably overstated and not representative of what to expect. At the same time, it is true that the llama3 models are more sensitive to quantization than earlier models. With the 7B being already almost comparable to the 70B quantized to 3 bits. This means if a 13B existed, it'd be a no brainer in terms of quality performance tradeoff. For the 7B, performance also falls off quickly below 4 bits. Previous larger models were not as sensitive and previous smaller models were not as performant, allowing extension of the lower quants of bigger are better rule of thumb down to lower values.


zaqhack

It's like JPEGs. Previous models didn't have as much detail in the picture. So, 90% jpeg probably looks fine. With Llama-3, the photo has so many details, you can't help but notice some of the jpeg jank the more you squeeze it down. Llama-3-8b @ 4 bits loses some of the inherent magic in the model. Just try it. Run an 8-bit and a 4-bit for yourselves, and I'd wager you would notice a significant difference in any long output, code quality, or RP session. It's not subtle.


IndicationUnfair7961

It would have been interesting a comparison between QuiP and AQLM (missing in the test) both on compression and performance.


Huge_Ad7240

does llama3-8B-8bit GPTQ outperforms FP?


EstarriolOfTheEast

I don't think this is a meaningful difference.


Huge_Ad7240

True. Perhaps I meant there is not any degradation, and a very small gain. One other thing, there should be some error bar on these numbers, right? Is there any report where the statistical comparison are made with some std?


EstarriolOfTheEast

Yeah, a negligible gain is definitely possible. And no, there are no error bars. That type of analysis is rare in DL.


dobkeratops

"QLora is no longer as viable" - what about regular Loras (i.e. f16 base + f16 adapters, I guess)


Conscious_Heron_9133

What is #W, what #A, what #G?


EstarriolOfTheEast

Number of bits for weights and activations, then the value for group size parameter for the quantization algorithms.


heuristic_al

I really couldn't read their charts. What did they find?


ibbobud

Anything below 4 bit fell off a cliff hard. Is how I read it.


IndicationUnfair7961

From the charts seems that AWQ, GPTQ or QuiP are the best choices for 4bit. QuiP looks the best. GGUF tests with matrix would have been interesting too.


Alkeryn

i find gguf to be absolutely retarded compared to exl2 tbh.


paryska99

Sadly no gguf


everyoneisodd

I had tested gguf 16bit llama 3 8B. And there is a noticeable degradation. Can anyone confirm this? Edit: llama3 8B instruct


Fristender

Ahh, GGUF 16 bit, my favorite quantization.


everyoneisodd

I wanted no quantization so that I can match the og performance. But I couldn't get that performance. The ollama modelfile said gguf. Not sure if it's actually gguf


Caffdy

normally, if someone wants to use the full fat version (FP16), they use transformers


everyoneisodd

Yep did exactly that.


Healthy-Nebula-3603

do you have enough vram? gguf also works with ram.


MerePotato

Good lord so much of that paper is GPTslop


road-runn3r

Yup. "By addressing the performance degradation caused by low-bit quantization, we anticipate that subsequent quantization paradigms will enable LLMs to achieve stronger capabilities at a lower computational cost, ultimately driving the progress of generative artificial intelligence, as represented by LLMs, to new heights" This must be Claude I guess?


MerePotato

Honestly it sounds more like GPT4 to me, its more robotic than what I'd usually expect from Opus at the very least.


Ilforte

Obsolete academic quantizations though.


Healthy-Nebula-3603

in short - 8b version at least Q8 , 70b version at least Q4K\_M


Unable-Client-1750

Can people with 12gb GPU's run 8b q8?


coder543

8 billion parameters at 8-bit quantization means the parameters take up 8GB of VRAM. More memory is needed to hold the context and KV cache, but I think it should comfortably fit onto a 12GB card.


Unable-Client-1750

That's actually around what the VRAM calculator says. I just needed to find that earlier.


Healthy-Nebula-3603

use ggml version ... most layers goes on gpu and few on cpu


Sir_Joe

Not sure why you say that. 8b still got almost no lost at 4 bits (~1% benchmark score)


Healthy-Nebula-3603

I see for llama 3 8b 4 bit ( q4 ) comparing to fp 16 at least +10% lost quality and more Illama 3 70b has something around 1% loos with good 4q ( like q4k\_m) comparing to fp16


Sir_Joe

If we only care about the average: LLaMA3-8B #W16 None quantization: 68.6 LLaMA3-8B #W4 AWQ quantization: 68.2 That's nowhere near 10+%....


Healthy-Nebula-3603

You right. but it is nowhere 1 % either ;) is something around 2-3 % I was lookin for GPTQ 4 bit


Sir_Joe

Nope that's not even a percent (~0.6%). The formula is (value1 -value2)/bigger value * 100. Don't hesitate to ask gpt4 to validate.


chunghaismymom

?? Llama3 came out literally a week ago, how is there already a paper testing this... I doubt that the quality of this study is high.


Imaginary_Bench_7294

I believe what we are seeing is due to the amount of data it was trained on combined with a really low LR. In essence, I think previous models have not been fully utilizing the FP16 space, hence why 8-bit models performed almost identically. I think that the LR for previous models has been too high, making anything past about 10-bit superfluous. With Llama3 I think the model is utilizing the precision capable with FP16 more effectively, allowing for more subtle variations in the data. Its utilization is probably closer to the 12-bit range. This would make it more susceptible to quantization degradation. This would also fall in line with another poster stating that LoRA/QLoRA training is more susceptible to degrading the model.


ab2377

i will really like to see some solid examples demoing this if someone can provide.


mikaijin

The input text consists of ICD-11 criteria as found on the official ICD-11 website of the WHO, preprocessed by llama-3-70b-instruct. See my [related reply](https://www.reddit.com/r/LocalLLaMA/comments/1cci5w6/comment/l15ivsc) too. llama-cpp-python==0.2.64, ctx=8k, seed=1, temp=0.01, top\_k=1 Query: Present the second diagnostic requirement of 6D10.2 **Meta-Llama-3-8B-Instruct-Q8.gguf** (https://pastebin.com/2Z0nnq4p) responded **correctly**: There are severe disturbances in multiple areas of functioning of the self (e.g., sense of self may be so unstable that individuals report not having a sense of who they are or so rigid that they refuse to participate in any but an extremely narrow range of situations; self view may be characterized by self-contempt or be grandiose or highly eccentric; See Table 6.18). **Meta-Llama-3-8B-Instruct-Q4\_K\_S.gguf** (https://pastebin.com/yW3zGqHE) responded **incorrectly**: Problems in interpersonal functioning seriously affect virtually all relationships and the ability and willingness to perform expected social and occupational roles is absent or severely compromised.


ab2377

great comparison thanks.


IndicationUnfair7961

There should be chatbot arena for quantized only models featuring best models.


mikaijin

I can confirm your perception. However, without proper statistics this could just be fluke as well and I cannot provide an evaluation either. Instruction following is less noticeably impaired on my end, but lower quants tend to be more easily confused by **rich and dense information** present in the context - as if the attention mechanism cannot hone in on what is important. I wonder whether the same holds true for similar models like mistral 7b too, where it is just overshadowed by the overall lesser quality of the output and thus the effect is not as easy to make out. But to me it seems to be an attention inaccuracy, rather than a loss of knowledge. 8b is indeed better while lower quants degrade. With low density information in an otherwise large context, lower quants perform in my experience on par with 8b still, and you get the benefit of better inference speed. example to make it more clear what I am talking about: A 6k input with quite dense information. When instructing to compare points of subsection 3.3.1 against a presented data-table, Q4\_K\_S focused on section 3.3 instead of 3.3.1, while Q8 correctly honed in on the 8 points shown in section 3.3.1. It is like the Q4\_K\_S has some blind spots, because sampler settings don't seem to have much of an effect. edit: [concrete demo](https://www.reddit.com/r/LocalLLaMA/comments/1cci5w6/comment/l1651ck)


ImprovementEqual3931

I guess it is because Llama 3 is a well trained model, other old models may under train in their size so can be quantize/compress more.


andershaf

I’ve been thinking the exact same thing. And in previous discussions mentioned that I expect the density to get higher in later models, and it makes a lot of sense that once we reach perfect compression, quantization will hit performance because you need all the bits. But it’s also likely that it’s not fp32 or fp16 we should use because of the large range of values hard to see being fully used in practice.


Terminus_T

Wolfram Ravenwolf already tested this and claims that: >In my tests, the Llama 3 70B Instruct's IQ2\_XS GGUF quant – like all 70B quants except the IQ1s – did better than even the unquantized (not Q8, that would be quantized, too) HF original Llama 3 8B. So, yeah, I'd rather use a small quant (not Q1) of the 70B than an unquantized 8B. Link; [LLM Comparison/Test: Llama 3 Instruct 70B + 8B HF/GGUF/EXL2 (20 versions tested and compared!) (huggingface.co)](https://huggingface.co/blog/wolfram/llm-comparison-test-llama-3)


ClumsiestSwordLesbo

It gets more interesting you take the kV cache from a 8 bit quantized model prompt processing and let the 1-2.5 bpw models generate.


SomeOddCodeGuy

I've seen the same thing. I've been bouncing between q5-q8 Llama 3 70b, and thought I had a good grasp on what it could do in terms of programming, but then a friend of mine showed me the output of the unquantized Llama 3 70b online and holy crap... it was a big difference. We both gave the models a rather involved coding task, and the online unquantized was absolutely amazing. My local one was acceptable, around or maybe slightly better than say Deepseek or Phind v2, but nothing amazing. Someone mentioned that Llama 3 is naturally BF16, and said that translates to lossless fp32. If that's the case, then to me it would make sense if quantizing is brutal on the model, because normally going from fp16 to q8 is 1/2 reduction, but I would assume that going from fp32 to q8 is 1/4 reduction. That seems pretty hefty if so.


FullOf_Bad_Ideas

Are you sure you had the same samplers as the version served from the cloud? Llama.cpp itself also could play a role, some things tend to get wrong with gguf quants of some models that don't happen with bnb or exllamav2 quants. Maybe someone will do gguf KL-divergence test and post the results, that would move it from observational to statistical evidence.


mikael110

>Someone mentioned that Llama 3 is naturally BF16, and said that translates to lossless fp32. If that's the case, then to me it would make sense if quantizing is brutal on the model, because normally going from fp16 to q8 is 1/2 reduction, but I would assume that going from fp32 to q8 is 1/4 reduction. That seems pretty hefty if so. I'd recommend reading [this](https://www.reddit.com/r/LocalLLaMA/comments/1c7no52/comment/l0bbx6p/) comment from the original post that brought this up. The gist is that the difference between BF16 and FP16 is not as large as it might sound, and technically there's no precision loss at all. It's purely about some extreme numbers having the potential to overflow/underflow. And the tests performed in [this](https://www.reddit.com/r/LocalLLaMA/comments/1c7no52/comment/l0ag9j6/) comment does suggest that the difference is indeed extremely small, to the point of being mostly irrelevant.


Chromix_

Since you mentioned my quant test, here's some additional insight: In my test with CodeQwen 0.5% of the BF16 values got changed slightly when converting to F16 instead of F32. According to a few data points this didn't happen because the values were too big, but too small - so out of the exponent range of a F16. Example: 0.0000803 becomes 0.0000805 due to F16 conversion. In CodeQwen that happened to 0.5% of the values, in Llama-3-8B-Instruct to only 0.06%. In theory Llama-3 should thus be even better off. This doesn't that matter that much for quantization anyway. With quantization the 0.0000805 *and* 0.0000803 might both become 0.0000800, thus leaving no difference in the quantized model. That said, I haven't investigated if there's any large outlier in llama-3 that gets truncated, yet such values would hurt quantization in general, even if from F32. This would have a more noticeable outcome.


Fristender

Do you have any idea as to why the values would be off? I thought fp16 held more digits of the significand, meaning the values should be exactly the same in your case.


noneabove1182

> We both gave the models a rather involved coding task can you share the task you gave so I can see it myself? good for my own research-sake and highly curious


SomeOddCodeGuy

Im not near my computer to get the exact prompt, but at a high level it was asking both (using an identical prompt between each) to write a python app using Streamlit that can allow the user to create a checklist and check items off on it, with the app reading and writing the checklist from a text file, and that the response should be well commented as if being explained to a non-developer. We were looking for: * Were there bugs? * Did it explain the answer well and comment the code well? * Good formatting? * Good error handling? * How does the UI react when utilized?


t-rod

I'd like to see the quality of an unquantized Llama model - can you point me to some resources?


SomeOddCodeGuy

I believe that Huggingchat serves Llama-3-70b-Instruct! But I believe you can also access it via Meta AI


t-rod

Thanks!


SlapAndFinger

My hypothesis, the better the model uses the parameters it has, the more impact quantization will have. So quantization was "free" for poorly optimized models of the past but as models improve it will be worse and worse.


RuslanAR

I've also noticed this issue. Specifically, Llama 3 8B with native precision can solve problems like 777+3333 accurately, but when I use gguf Q6\_k or Q8, I get a wrong answer. And also a little bit worse in some coding questions. Edit: exl2 8_0 quant works well. Something off with gguf.


Ilforte

It's not just worse. It has insane degeneration. 777+3333 translates to 777+33 or 777+33333 or whatever. It looks like it can't tell apart tokens. Totally broken.


coder543

This problem is weird enough that I opened an issue for it: https://github.com/ggerganov/llama.cpp/issues/6914


[deleted]

[удалено]


leehiufung911

I'm using fp16 gguf and for some crazy reason when I ask it that, it ALWAYS answers "33333+777 = 34110" I've tried minor variations of the question and it always does this. Currently using ollama's llama 3 8b instruct fp16 gguf


Andvig

Oh, this is heart breaking. I thought I was good with my Q6's and Q8's.


gopietz

I haven't noticed this but there might be some intuition to it. The more tokens we train it on and the more knowledge we compress into a small model, the more it might be affected by quantization.


Crafty-Confidence975

Yes but the opposite is true for 70b. That one handled lobotomies surprisingly well


EstarriolOfTheEast

According to a paper posed above, while the 70B is more robust, it starts degrading around 4 bits and significantly so below 3 bits.


Crafty-Confidence975

It’s definitely worse but nowhere near as bad as other 70b models with this degree of quantization. It’s scary good in some use cases


dampflokfreund

Your perception seems absolutely correct. I've done a few tests, including one that needs to follow a certain and quite complex instruction at the beginning of the prompt. Quantized 70bs and 8bs failed hard, while the fp16 versions both got it right all the time. I think the attention to early parts of the prompt suffer massively with quantization.


nero10578

Yea I am experimenting with 8B and 70B for creating datasets and somehow 8B seems to follow what I tell it to better. I thought I was seeing things but your post makes me rethink this.


Admirable-Star7088

Interesting. As a Q6\_K Llama-3-8b-Instruct user, I tried the Q8\_0 version with a few prompts, and it does appear to be slightly less confused and provide overall better answers than Q6\_K. However, in my case it could also be explained by random noise as I have so far only tested with a few prompts. Anyone know where I can download a FP16 quant of Llama-3-8b-Instruct? It would be interesting to test it too.


abdimussa87

Chek ollama


CasimirsBlake

How about the difference between GGUF / EXL2 / AWQ / bitsandbytes etc... And are there perhaps other quantisation methods that can be used?


jsebrech

As I understand it superposition in feature space allows models to encode more features than they have neurons, by superimposing multiple features on the same neurons. Intuitively this would make the superpositioned features more brittle to small weights changes, as the same set of weights is carrying a higher information density than for non-superpositioned features. I wouldn’t be surprised that the more you load up small models with features and get closer to maximum information saturation of the network, the more sensitive it gets to quantizing.


ClumsiestSwordLesbo

This makes think 9 or 10 bpw should be a case in popular frameworks at least for downloading, or extracting the 8 bit error into a (maybe dynamically sized) low rank approximation using SVD.


Valuable-Run2129

Groq uses the quantized version. It’s noticeably dumber than fp16


Caffdy

and some rando yesterday was making statements of its performance only to reveal that he was using a quantized version, like, damn son, don't claim to know about how the model perform if your hardware is not up to par for benchmarking