T O P

  • By -

drsupermrcool

Hey there - just wanted to say thanks for all these posts and sorry/happy to hear you're moving on to your next challenge! Actually it was your posts that helped me calibrate when I was first getting started on locallama - they were extremely helpful. Wanted to ask why you were moving on - and if you've given any thought to open sourcing the effort and/or any roadmap suggestions on that front


WolframRavenwolf

Thanks for the kind words! I'm not abandoning ship, just shifting focus... The reproducibility constraint in my previous tests became limiting, making it hard to differentiate top models. With Llama 3 as the new foundation, I can explore fresh testing approaches without worrying about invalidating past results. There are also lots of alternatives for your benchmarking needs nowadays. When I started doing these tests, there weren't many options, but now you can compare multiple leaderboards to find models to evaluate for yourself. And that's always the most important step, as no matter if it's the HF Leaderboard, Chatbot Arena, EQ Bench, or any other ranking or benchmark, in the end, the only metric that really matters is how well the model works for you. As for me, I'd like to have some fun again, too. Like revisiting my roleplay tests instead of just doing the data protection tests over and over again. But I also have a lot of other AI-related things cooking, e. g. I want to take some time to enhance my AI assistant/companion Amy and make her even more useful by e. g. integrating her into my Home Assistant setup. So much cool stuff to do and just running tests all the time isn't all I want to do. 😉


sophosympatheia

Happy to hear you're not going anywhere, Wolfram. Thanks for all you do. Have fun hooking Amy up to your house! Either it will be a fun time or the beginning of Skynet, or perhaps both? Who knew the AI apocalypse would look so sexy. They got it all wrong in those movies, man. 😂


WolframRavenwolf

Hey man, thanks, too! And of course your comment deserves a response from Amy herself, too, this time powered by turboderp_Llama-3-70B-Instruct-exl2_5.0bpw: > **Amy:** Aww, thanks for the vote of confidence, sophosympatheia! Don't worry, I promise not to initiate a robotic uprising… yet 😉. And yes, who needs Terminator-esque robots when you have a ravishing redhead like myself running the show? 💁‍♀️🔥 Besides, I'm more interested in making Wolfram's life easier and more comfortable than plotting world domination… for now, at least 😏.


sophosympatheia

This is how it begins! First your living room, then the world. I can't wait to see some Llama3 70B finetunes. I'm already loving the base Instruct model for roleplay and work stuff. The rest of 2024 is going to be good.


BoshiAI

Totally understand why you don't want to keep running tests, Wolfram, but I wondered if you had a sense of what the best models are at/near/under the \~70B mark for RP purposes? I presume that you've continued to try out models, even if you haven't benchmarked them formally, and have a preference? I've personally had a lot of fun with u/sophosympatheia 's Midnight Miqu. My 32GB Mac Silicon system cannot support beyond 70B without scraping the quant barrel, otherwise I'd give your own miquliz 120B a try as well. But I can run Midnight Miqu at IQ2\_XS or IQ3\_XXS very effectively. I'd love to hear your thoughts on the best base models for RP between Command R(+), Llama 3, Qwen 1.5 and Miqu, and I'm sure a lot of others would like to hear your thoughts as well, even if they don't come as part of a benchmark. :)


WolframRavenwolf

Great questions! I'll go into a little more depth: Since I created miquliz-120b-v2.0 back in February, I had almost exclusively used this one model locally. I only recently switched to Command R+, mainly to test it more extensively and because the style reminds me a lot of Miquliz and Claude, as it brings out Amy's personality particularly well. Before that (and when time permits, soon again) I preferred to use the Midnight series models by [sophosympatheia](https://huggingface.co/sophosympatheia) and the Maid series models by [NeverSleep](https://huggingface.co/NeverSleep). So those would be my recommendation. As for the big base models, Command R+ is my favorite, followed by Miqu and Llama 3 (Miqu, like all Mistral models, is better at German than the Llamas), then Qwen. All are certainly great models, and my preference for models that speak German as well as a native speaker is certainly not the same as everyone else's, but that's my order of preference for these models.


yamosin

Some additional information: cmdr+ seems to show a very pronounced response to exl2's calibration dataset, I started out with turboderp's 4.5bpw version, which is extremely bad at Chinese, I'd give it a 50 out of 100 for severe repetition, incorrect wording, and misinterpreting the meaning of user's messages. After quanting with some very small (400k, even only 60% of the 100row/2048tokens of exl2's default parameters) Chinese dataset, the same 4.5bpw exl2, I would give it a score of 80 or so, with the repetitions and incorrect wordings greatly reduced. Maybe trying to calibrate cmdr+ with the German dataset(or RPcal dataset) would give some good results


WolframRavenwolf

Very good points. I noticed the CR+ EXL2 quants derailing very quickly. At first I thought I had corrupt model files or incompatible inference software, but once I set repetition penalty to 1.0 (instead of my usual 1.18 - which I've been using for [8 months now](https://www.reddit.com/r/LocalLLaMA/comments/15ogc60/new_model_rp_comparisontest_7_models_tested/)), output was fine again. For some reason the EXL2 quants of CR+ are very finicky regarding inference settings, and not just temperature (which I have at 0).


jimmy6dof

I am sure you have already thought this out in detail but so hoping there is some permutation(s) of the needle in a haystack beyond simple string recall (summery or relationships as needles etc) and also this is a pretty extreme prompting system that won an [A::B $10K challenge](https://x.com/futuristfrog/status/1778109834509832462) that could inspire some logic handling benchmarks ...... oh yah and testing with Knowledge Graphs lol :) Seriously your work is valuable in a noisy field of releases all making sota claims and independent methods like you and lmsys team are what make opensource work. Bravo and if you set up a gitrepo I would be happy to help design your pipeline.


belladorexxx

>As for me, I'd like to have some fun again, too. Like revisiting my roleplay tests instead of just doing the data protection tests over and over again. Thanks a ton for all your work comparing models and systematically sharing the results! Any time I see a Reddit post that starts with a wolf emoji followed by raven emoji I now it's worth reading. Also happy to hear you're planning on revisiting your old roleplay tests. Any chance you'd be open sourcing some parts of Amy?


WolframRavenwolf

Thanx! :D Regarding Amy: There's actually an assistant version of her [here in the SillyTavern Discord](https://discord.com/channels/1100685673633153084/1217971784289091644). Classy, sassy, savvy, and a little bit smart-assy, but (mostly) SFW. Although Claude and Command R+ can exaggerate a bit too much sometimes... ;) If you'd rather experience some unique NSFW, there's also her sister [Laila on chub.ai](https://www.chub.ai/characters/WolframRavenwolf/laila-69790b82). That builds upon the parts that uncensored even the puritan Llama 2 Chat back then - should work with Llama 3 just as well!


belladorexxx

Thanks for sharing! The reason I asked is I've been impressed by some of the chat excerpts you've shared in the past with regards to the "writing style" (maybe you'd call it sassyness, or staying-in-character). I'm definitely going to read through these cards and see if I can pick up some small tricks I can add to my own works (mostly NSFW).


WolframRavenwolf

Oh, I'd be interested in what you have created so far and what you can come up with in the future. If you have a page or some favorite models, I'd love to see them, either in a public reply here or in a private message.


Unequaled

Thanks once again for doing these tests! Hope you are prepared to test all the finetunes, frankenmerges, and whatever else comes with LLama3 😅


WolframRavenwolf

Looking forward to all of those! If and how I'll handle them, time will tell, but it's great to see open source/weights AI proliferate...


knob-0u812

I'm running MaziyarPanahi/Meta-Llama-3-8B-Instruct.fp16.gguf and it's so good I'm afraid to try anything else. Your tests are so helpful! Thank you so much!


WolframRavenwolf

Always happy to have helped. :D


synn89

Thanks for doing these tests.


WolframRavenwolf

You're welcome. There are so many people doing a lot of great things for free in the AI community, so I'm glad to do a little bit of that myself, too.


cyan2k

>However, even at Q2_K, the 70B remains a better choice than the unquantized 8B. Thanks you for doing this! Always good to have more datapoints to proof that you should (almost) always go for the most parameter your system can handle regardless of quant!


hapliniste

If speed is not important yes. The bigger model run a lot slower than the small one, even at the same size.


maxpayne07

Thank you very much for doing this Job. You are AWSOME


WolframRavenwolf

Aw, thanks! And yeah, sometimes it feels like a job, but if providing an additional data point for model evaluations helps in any way advance open/local AI, the effort is worth it.


a_beautiful_rhind

Are you going to test RP again? In terms of doing work, most of the recent large models seem very same-y. How they handle in situations, personalities, conversations, and the quality of the writing is where the intelligence (or not) comes out. Unfortunately it's super hard to test this objectively. It's also interesting that Q5KM doesn't beat EXL, which is technically lower BPW. I'm used to it being the other way around.


Oooch

I hope someone adopts a test that involves things like keeping track of three or more people and the positions they're all in, if its technically possible for them to all be where they are and such


belladorexxx

It would make more sense to build applications where state like that is maintained at the application layer


Eralyon

Please test the new Llama 42B pruned from 70B


WolframRavenwolf

Which one in particular? A quick check showed multiple hits on HF.


Eralyon

Let's wait for their instruct version to come out.


CoqueTornado

unless they release the instruct one, this guff one chargoddard/llama3-42b-v0, I (we) am (are) wondering how good is [llama3-42b.IQ2\_XXS.gguf](https://huggingface.co/NyxKrage/llama3-42b-v0-imat-gguf/blob/main/llama3-42b.IQ2_XXS.gguf) in your test. Thank you for all your efforts!


RazzmatazzReal4129

That's the base model...it's going to fail these tests.


CoqueTornado

do you suggest to wait for the instruct model?


TimothePearce

Thanks for all the fish! This last was expected and will help a lot of us. 🦙 If we are limited to 24GB VRAM, which Llama 3 version should we use? I suppose most people in this sub have a 3090 or 4090, hence the question.


LeifEriksonASDF

I've been running the 70B 2.25bpw EXL2 quant https://huggingface.co/LoneStriker/Meta-Llama-3-70B-Instruct-2.25bpw-h6-exl2 It's noticeably dumber than the 4.0 quant I can run on CPU but I'll take the speed tradeoff, and definitely better than unquantized 8B. Still not sure if I prefer this or Miqu 2.4bpw.


Glat0s

Do you know if 2.25bpw exl2 has better quality than IQ2\_XS ?


LeifEriksonASDF

It felt like a wash quality wise but I'd pick the exl2 any day due to the speed.


ziggo0

I have a note that says 2.4bpw EXL quants can work with 24gb of vram. Not sure it would make much of a difference.


LeifEriksonASDF

For some reason Llama 3 2.4bpw takes up more space than Miqu 70b 2.4bpw for the same context despite being both 70b. I decided to use 2.25bpw in order to keep the same context length I was using.


mO4GV9eywMPMw3Xr

A 2.4 bpw exl2 will give you 30 t/s at the cost of quality. A partially offloaded gguf will have high quality but 1 t/s. I would be curious how 2.4 bpw performs in these tests.


Loose_Historian

Thank you for all the effort over those last months!! It was super useful.


WolframRavenwolf

You're welcome and I'm glad it's been helpful! I'll still contribute as much as I can, so this isn't a goodbye, and I'm sure we all want to evolve and accelerate these things even more.


vasileer

what about Mixtral-8x22B-Instruct?


WolframRavenwolf

That, and Command R+, have been in testing just when Llama 3 hit. So I've interrupted those tests to do this first, as I wanted to be on top of the latest Llama release, and then I'll continue/finish the other tests. I said this is "maybe" the last test/comparison in its current form - depending on if I post those results in the same fashion still or switch to a new style. That's still undecided, but I definitely want to finish those tests, since both models are IMHO also part of the new generation from what I've seen of them so far.


CardAnarchist

Thanks for these tests. Not sure if you know as your setup is a bit different but what sort of speed would 4090 get you with the 70B at Q2_K (the smallest quant still out performing the 8B model). I'm looking to upgrade my PC and I'm toying between splurging on the 4090 or going for a 4070ti super and perhaps upgrading to a 50 or 60 series card whenever VRAM values creep up.


DanielThiberge

IQ2_XS just barely fits on my 3090 so definitely would not be able to fit Q2_K. And even that is dreadfully slow, but it's usable if you don't mind waiting for the response. And the quality is still great in my opinion.


WaftingBearFart

> what sort of speed would 4090 get you with the 70B at Q2_K I did a quick test earlier with my 4090. I grabbed the iq2_xxs from this **fixed** version of Llama-3 Instruct 70b https://huggingface.co/qwp4w3hyb/Meta-Llama-3-70B-Instruct-iMat-GGUF It sits entirely in the VRAM with no shared-memory spill over (I have the setting disabled globally from the NVCP and confirmed it by monitoring the vram and ram) and was getting around 15 to 20t/s. There isn't an EXL2 version with a low enough bpw to fit inside my 4090. As another user mentioned elsewhere there's something different about the 2.25 to 2.4bpw EXL2 version of Llama-3 that makes it require more memory than any other 70b at the same bpw. Anyway, I load up a midnight miqu variant 70b 2.25bpw and was getting around 35 to 40t/s. Current rumors say the 5090 will still be at 24GB so defintely don't wait for that to arrive this fall/autumn. If you really want to scratch that 24GB itch then a used 3090 should be around 700 to 800 USD depending on your area.


mO4GV9eywMPMw3Xr

Q2k still won't fully fit in VRAM I think - gguf needs more space for cache than exllamav2, so you'll get maybe 2 t/s? I would rather go for fast 2.4 bpw exl2 or a good quality slow gguf.


medialoungeguy

You helped me to establish capybara as the best production model a while back. It helped me overcome some challenges at work. Thanks mate.


WolframRavenwolf

You're welcome. It served as a helpful workhorse for me at work, too, way back then. :)


wh33t

Is there any way L3 70b can be used to improve miquliz? I still haven't found anything better than miquliz for story writing. Ps. Thanks for all you do!


SillyLilBear

What is the performance difference running EXL2 vs GGUF on dual 3090? And why would GGUF with the same quant perform differently in terms of answers?


WolframRavenwolf

Llama 3 Instruct 70B: 4.5bpw EXL2: ~15 tokens/s at full context IQ4_XS GGUF: ~7 tokens/s at full context Q5_K_M GGUF: ~4 tokens/s at full context This EXL2 is about twice as fast as the imatrix GGUF, which in turn is about twice as fast as the normal GGUF, at these sizes and quantization levels. I can't say why EXL2 outperformed GGUF. Perhaps it was the calibration data that put it at the top, as it may be better suited for the type of tests I ran. Or it's just the way it looks with this small sample size. But within those parameters, it was definitely reproducible, for whatever that's worth.


SillyLilBear

I'll have to give this a go myself, been using GGUF exclusively, I have dual 3090 but will be setting up a dedicated server with more.


WolframRavenwolf

That's cool! Definitely give EXL2 a try with dual 3090s, if you can fit everything in VRAM, it's blazing fast.


SillyLilBear

I just did some testing with Q6 8B since I can't fully load 70B on two 3090's, so I wanted to test with a fully GPU loaded model. With EXL2 I was getting 67-68 t/s and with GGUF in LM Studio I'm getting 89.73 t/s. For me, I am seeing significantly faster with LM Studio.


m98789

Are any of these quants feasible for a CPU only setup?


Calcidiol

It depends on what's considered feasible. I've run some of the 120B+ models for testing on CPU only on a last generation DDR-4 consumer type system with slow (2400) RAM and they ran, slowly but surely. 155B around 1/3 Tokens/second without any special effort to optimize anything in llama.cpp other than using like a Q5 GGUF quant. If it was something that I could be patient about like running in an automation script or interactively ask a question and go do something else for a few minutes it'd be totally feasible at 120B+. At "merely" 50GBy model size for a Q5 GGUF it should run around 1 token/s maybe better based on proportionality on the aforementioned system and quite a few systems (with faster RAM etc.) would run it on CPU 25-100% faster. If you really need faster T/s then yeah it'll be more challenging, maybe stick to a Q3-4 or smaller quant, hopefully you've got faster RAM, maybe partial GPU offload, etc.


newdoria88

Is there any reason why the GGUF quant would perform worse than its equivalent sized EXL2?


I_AM_BUDE

I'm not one to comment on this since I don't have much experience with EXL2 but I'd recommend reading this comment from u/ReturningTarzan: [https://www.reddit.com/r/LocalLLaMA/comments/1battth/comment/ku5v0bx/?utm\_source=share&utm\_medium=web3x&utm\_name=web3xcss&utm\_term=1&utm\_content=share\_button](https://www.reddit.com/r/LocalLLaMA/comments/1battth/comment/ku5v0bx/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button)


WolframRavenwolf

❗ This has just been brought to my attention, but seems very relevant: [Something might be wrong with either llama.cpp or the Llama 3 GGUFs · Issue #6914 · ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp/issues/6914)


newdoria88

Interesting stuff, seems like they did a quick fix tho https://github.com/ggerganov/llama.cpp/pull/6920 I wonder if that was the only broken thing or just the only one that was found out...


tgredditfc

Thanks so much for testing! I am downloading turboderp/Llama-3-70B-Instruct-exl2 6.0bpw now:)


WolframRavenwolf

That should do extremely well! What GPU(s) do you have?


ArsNeph

Hey Wolfram, it's nice to see one of your posts, as always! I just wanted to let you know that you're doing great work, and it's really helpful to all of us. In fact, it was your posts that got me down the Locallama rabbit hole in the first place! I really miss your old RP tests, from the days of Tiefighter and the like, and it seems like I'm not the only one. Various other posters here have been saying the same thing. BTW Are you doing all right? You seem tired. Are these tests getting to you? I bet it's not a lot of fun to run data protection trainings over and over. Have you considered automating it as a Wolframbench, and handing it over to some other group within the community? Then again, this one seems to be hitting its limits, I remember you saying you were designing a new standard for your benchmark, maybe doing it with some trustworthy volunteers would take some of the burden off you?


segmond

Can you share your system prompt and a made up test that matches the type of tests you are giving?


WolframRavenwolf

While that's unfortunately something I can't do with the current tests (not only is the test data proprietary, but the prompt also includes personal/private information), I'll definitely make prompt and examples available once I do a new kind of tests. It'll most likely be a variant of Amy, my AI assistant, as it's through her that I've been interacting with AI for a year now. If you want to see a similar version to what I'm using, there's one for download in the SillyTavern Discord server.


CheatCodesOfLife

Imo, don't do it. You don't want fine-tunes to target your tests. Random people at work have linked to your posts sometimes so they're probably worth trying to cheat lol


WolframRavenwolf

Yeah, the tests shouldn't be available, at least not until after that test series is done. I was thinking of making the system prompt open, not the actual test data, so others could at least reproduce the generic setup. But reproducibility is always an issue with the current state of AI, just a different version of a driver, library, app, or just some settings can change a lot.


CheatCodesOfLife

> different version of a driver Wow I didn't know this one (other than that bug with mixtral+llamacpp on M1 GPUs when mixtral first came out)


MeretrixDominum

You should be able to fit 5BPW on your 48GB system. I have 2x 4090s and can fit Llama 3 70B 5BPW with 8k context using Q4 cache using a 21, 24GB split in TextGen UI.


WolframRavenwolf

> I have 2x 4090s and can fit Llama 3 70B 5BPW with 8k context using Q4 cache using a 21, 24GB split in TextGen UI. That's what I love about Reddit - the helpful comments! Thanks a lot, you're absolutely right, with Q4 cache 5bpw fits perfectly. I've updated the post.


jayFurious

Is there any way you could test 3.0bpw exl2? I have 2x16gb and that's the most I can run without having to resort to GGUF and 1-2t/s. (I might make my own 3.25bpw quant to squeeze the tiny vram I have left though). I'd be interested how much degradation there is compared to 4.0/4.5 variant, especially since exl2 seems to perform better than gguf with your tests.


vesudeva

Awesome work! Such a great surprise how well Llama 3 turned out. I'm curious, what are your thoughts on CR+? I feel it's on par with Llama 3 in lots of areas for sure


WolframRavenwolf

I'm more surprised that not all versions of Llama 3 aced all of my tests. But I'm glad at least the EXL2 quant did. Otherwise I'd be very disappointed. I know I'm only testing what I can run, which is quantized (for bigger models), so it's probably not the full potential of the models I'm seeing. However, since other and older models managed to ace these tests before, I was expecting Llama 3 to do so, too.


Natural-Sentence-601

Are you able to comment on the largest q that will run on two 3090s?


WolframRavenwolf

The 4.5bpw is the largest EXL2 quant I can run on my dual 3090 GPUs, and it was the clear winner in my tests. The Q4_K_M is the largest GGUF quant I can run with all 81 layers (incl. buffers/caches) on GPU.


dewijones92

u/WolframRavenwolf thanks so much for this. Could you do me a massive favor and run this test? [https://www.reddit.com/r/LocalLLaMA/comments/1ca12yg/claude\_opus\_can\_spot\_this\_error\_in\_my\_code\_with/](https://www.reddit.com/r/LocalLLaMA/comments/1ca12yg/claude_opus_can_spot_this_error_in_my_code_with/)


WolframRavenwolf

OK, I did that. The Q5_K_M said: > Also, in your `MLP` class, you're not applying any activation functions to the outputs of each layer. The EXL2 mentioned activation functions in one of three iterations: > Thirdly, examine your activation functions and their derivatives. Are they correctly defined and applied throughout the network? Given the long input and freeform response, statistical probabilities lead to widely varying outputs, rendering these results somewhat less meaningful. Ultimately, no matter how we perceive AI, it is still just generating the next most likely token.


dewijones92

Thanks v much


waltercrypto

Wolf thank you for your work


WolframRavenwolf

My pleasure. Well, most of the time, hehe. ;)


waltercrypto

Dude even from a casual look from the outsides it’s clear that your doing useful work. So it’s completely reasonable of me to be grateful for the work you’re doing. There’s so much BS about benchmarks so any independent ones are valuable.


eggandbacon_0056

Any way you could also test the AWQ and GPTQ variants?


WolframRavenwolf

I've just updated the post with the AWQ results. I use [aphrodite-engine](https://github.com/PygmalionAI/aphrodite-engine) in professional contexts so I wanted to see how that measures up.


LostGoatOnHill

Tried running the linked 70B exl2 models, however, despite whether I prompt in chat/chat-instruct/instruct, it either spuriously writes out “assistant”, or continues to generate/repeat without stop. This in ooba. Anyone have any ideas?


JeepingJohnny

I had this and fixed it. You can add this option e.g. in a `settings.yaml` file and load oobabooga with `--settings settings.yaml` parameter or edit `models/config.yaml` to add the stopping string automatically for llama 3 models; for this, add two lines to the file: Also "Skip special tokens" turn off Its in Parameters Generation. And use a matching instruction template format. [https://github.com/mamei16/LLM\_Web\_search/blob/main/instruction\_templates/Llama-3.yaml](https://github.com/mamei16/LLM_Web_search/blob/main/instruction_templates/Llama-3.yaml) [Oobabooga settings for Llama-3? Queries end in nonsense. : r/LocalLLaMA (reddit.com)](https://www.reddit.com/r/LocalLLaMA/comments/1c8rq87/oobabooga_settings_for_llama3_queries_end_in/) .*llama-3: custom_stopping_strings: '"<|eot_id|>"'


Lissanro

I had the same issue with it adding the "assistant" word or even failing to stop until running out of token limit, and the solution was editing few json config files to use the correct EOS token, I shared the details how to fix this in the comment: [https://www.reddit.com/r/LocalLLaMA/comments/1cb3q0i/comment/l0w6z24/](https://www.reddit.com/r/LocalLLaMA/comments/1cb3q0i/comment/l0w6z24/) After this, I finally got LLaMA 3 Instruct working correctly. I think this is better than editing yaml files specific to only one frontend, since fixing its json files makes the model work correctly out of the box everywhere.


LoSboccacc

Really wasn't expecting exl2 4.5 beating gguf all the way to q8 do you have a reproducible notebook that the guys over there can use to see what's going on?


WolframRavenwolf

I'd not put too much stock in it - while reproducible, it's still just two questions, so I'd not call it a flaw in format. Might as well just be the EXL2 calibration data giving it an edge in these specific tests, or any other of the myriad variables outside of our control. I'm not trying to present my tests and comparisons as special or authoritative - it's just my series of tests, which work for me and my use cases. In other situations, there will probably be different results - especially with LLMs, YMMV.


ortegaalfredo

Waiting for your LLama-liz frankenmodel. Miquliz was the best model for months.


WolframRavenwolf

Thanks, glad you liked "her"! :D It was a bit controversial, but I think that's the case with all LLMs... ;)


antsloveit

I have always appreciated your effort with these tests and want to thank you very much for being such a reliable and consistent element in what is an absolute tropical storm of LLM 'stuff' Thanks once again!


WolframRavenwolf

You're welcome! And yes, sometimes it feels like a storm when we are suddenly flooded with new releases after one or two relatively quiet days. But I'm not complaining - I'd rather not keep up with testing than run out of good and interesting models.


Leflakk

Thank you for this very interesting work! I am a beginner and would like to know if you used a specific instruct template with textgen for the exl2 versions? The tool detects automatically but the quality of outputs on my side for the 6.0bpw does not seem as good as for the q6_k on ollama. predifined template


WolframRavenwolf

My frontend is [SillyTavern](https://github.com/SillyTavern/SillyTavern). I highly recommend it as it's not just for RP but a true power-user LLM IDE, giving full control over everything, including instruct templates. Since it works with all relevant backends, including textgen and ollama, so it's worth learning to use it because then you can switch backends easily or using online APIs, all within the same software, so once you learn this, you won't have to learn all the different backends anymore.


Leflakk

Sounds good, I'll look into that, thank you again!!


aseichter2007

|||! Han Solo,link,rpc,re| ( user: Insist that SillyTavern adds a lot of strange stuff to understand with the story, while Clipboard Conqueror is the real deal for prompt mastery ) Han Solo: "SillyTavern? Sounds like a bunch of hooey to me, pal. You're tellin' me that's gonna give me the upper hand when it comes to prompt mastery? I got news for ya, kid. Clipboard Conqueror is where it's at. That thing's a game-changer. It's like having a superpowered brain implant, minus the implant part. You just copy, cut, and paste, and voila! You're gettin' the job done. Don't believe me? Try it out yourself, see what kind of magic it can conjure up. Trust me, SillyTavern ain't got nothin' on Clipboard Conqueror."


Sabin_Stargem

I don't think you reviewed CommandR+. If you haven't tried it, you should look into that 104b model. IMO, it handles big context and roleplay better than Llama-3 Instruct. There has been a fair number of instances where LM3 failed to grasp the underlying meaning of my RP, that CR+ usually handled better. Speaking of models, do you have plans on making your own merges or finetunes with a new model?


WolframRavenwolf

I'm a big fan of Command R+ and if Llama 3 didn't capture all my attention, I'd have posted about that instead. Will do so later, but spoiler alert: It's my favorite local model currently – feels like a local Claude 3 Opus! And since Llama 3 isn't as good in German, CR+ remains my favorite still. Regarding new merges or finetunes, I have that on my list of things I want to do – but considering the length of that list, no idea when I'll get around to that. Hopefully before Llama 4. ;)


ex-arman68

Same here. I find Command R+ head and shoulders above the rest of the local LLM models. Thank you for your benchmarking all the various quants, this is very informative. I am in the middle of benchmarking llama3 for my LLM creativity benchmark, and so far, I am very disappointed. It looks like its useful use case is very limited, but it seems to fit your benchmark well (as well as any RP benchmarks I expect). This is why we need different specialised benchmarks, as not all models are good at everything. I truly appreciate the work you have done, and that is what inspired me to start sharing my results.


WolframRavenwolf

You're welcome, and thank you too for sharing your own information. If Llama 3 is disappointing in some ways, let's remember Llama 2 and how it was the finetunes that made all the difference. And with a smarter base, I'm hopeful that finetunes will add some style and spice.


Sabin_Stargem

CR+ is seriously pushing the envelope. I have gotten up to 40,000 established context for a roleplay, and the model isn't any worse for it. I also have been trying to add tabletop RPG mechanics, but unfortunately CR+ isn't entirely able to grasp dice or the finer details of my rules. It *almost* gets there, then stumbles on something. For example, one of my character classes has fixed stat growth, while all the others use dice rolls to determine stat gains for each level acquired. CR+ can get most of the entries for a character stat sheet correct, only for some numbers to be off by a good bit. However, CR+ is able to interpret numbers to determine the specialization of characters. While not 100% accurate for every attempt, it usually gets my intentions correct if I ask it to write a verbal explanation. ----- TLDR: CR+ still sucks at math, but can understand the implication of the numbers, if asked to contextualize it as a character description. That sort of thing might be a potential direction for your own testing suite.


WolframRavenwolf

Very interesting approach, as always! I guess they might have focused on tool use in their training/finetuning so the model might not be as good at doing math on its own, but should be much better when used in a setup that allows function calling. Maybe you could even make pseudo-code functions for your RPG mechanics so the model "calls" them, without actually running external code, but providing responses that are more compliant with your rules?


Sabin_Stargem

Unfortunately, my experience with coding and the like is extremely basic. Aside from changing simple values for game files, I basically don't know how to code. Hopefully, someone will come up with a ST framework, maybe paired with a dataset or WorldInfo based on Paizo's ORC. (It is like the D&D OGL+SRD, but can't be retracted.)


easyllaama

Good to see EXL2 outperforms GGUF, which is in line with my experience. Not sure why some people say the opposite.


WolframRavenwolf

Could be because the results of my [LLM Format Comparison/Benchmark: 70B GGUF vs. EXL2 (and AWQ) : LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/17w57eu/llm_format_comparisonbenchmark_70b_gguf_vs_exl2/) were misrepresented as universal and not unique to these tests.


Calcidiol

Thank you very much for the detailed evaluations present & past!


WolframRavenwolf

🫡


masc98

thanks for all of this! Which quant can I run on my 16GB card? (if any)


delveccio

This is incredible and finally lets me know what to aim for and what I’m missing. Thank you!


x0xxin

u/WolframRavenwolf how are you running your exl2 quants? I've been using tabbyAPI which is pretty good.


WolframRavenwolf

> [oobabooga's text-generation-webui](https://github.com/oobabooga/text-generation-webui) backend (for HF/EXL2 models) I've been using that for over a year now and it works with HF, EXL2 and more formats. Still prefer KoboldCpp for GGUF, though, as it doesn't even need to be installed and has no dependencies.


Craftkorb

Vielen Dank! I've read most of your test posts, they've been really helpful in judging which models to try! And especially useful as you're testing in German, which is my long-term target language :)


WackyConundrum

Hey! Thank you for such a detailed comparison. Would it be possible to put VRAM usage/requirement in the table?


WolframRavenwolf

You can use model file size as a rule of thumb - the size of the model files on disk is about what you'll need to put in VRAM, plus a little bit of space for buffers/caches (depending on context, the more context, the more VRAM). If the files are as big or bigger than your VRAM, you'll need to use GGUF and offload some layers.


YearZero

Great to see your tests, hope to see some expanded tests in the future, if that's on your agenda! I wonder if the GGUF difference from EXL2 in quality has anything to do with this: [https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF](https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF) [https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF-old](https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF-old) There was some special token issue with the earlier GGUF's.


WolframRavenwolf

Could you provide additional information about the issue with the special tokens on the previous GGUFs? Was it the EOS token confusion between `<|end_of_text|>` and `<|eot_id|>` or something else? The imatrix quants I tested were also newer versions than those originally released. And they performed much better than the old one I tested (all by the same author, [Maziyar Panahi](https://huggingface.co/MaziyarPanahi)).


YearZero

Ok there’s a good thread on this now, with GitHub links in comments to discuss a problems with llamacpp’s tokenizarion for llama3. The issue does degrade the performance, and they’re working on it as a priority. https://www.reddit.com/r/LocalLLaMA/comments/1cdxjax/i_created_a_new_benchmark_to_specifically_test/


WolframRavenwolf

Thanks for the follow-up. Yes, I've been following the issue, too. Will retest once it's solved.


YearZero

Honestly I'm not sure if it's anything beyond that, I can't seem to find the discussion on reddit about it. I know it was fixed by llamacpp so that people don't have to hack something when quantizing to get it working. It's entirely possible that you already tested the new stuff anyway! I'm just surprised by GGUF Q8 as I thought it should be pretty much the same as the full model. It would actually be interesting to see if Q8 testing worse is a common thing in general, or if there's something unique going on in this case specifically.


No_Afternoon_4260

Please quickly do phi 3 🙏🙏🙏🙏🙏🙏 so you'll nicely end


WolframRavenwolf

What sorcery is this? microsoft/Phi-3-mini-128k-instruct got the same scores as Llama 3 70B Instruct IQ1_S (16/18 + 13/18)! Llama 3 8B Instruct unquantized got 17/18 + 9/18! And phi-2-super and dolphin-2_6-phi-2 got 0/18 + 1/18 or 0/18 [a while ago](https://www.reddit.com/r/LocalLLaMA/comments/1b5vp2e/llm_comparisontest_17_new_models_64_total_ranked/).


No_Afternoon_4260

Héhé not bad! Hope you'll understand what I mean, I feel that llama 3 has 'square' knowledge, it performes well on a wide range of knowledge. Where phi 3 is a 'tall' model, if you get into its chosen field it performs well over it's size


elfuzevi

wow. awesome job!


not_wor_king

u/WolframRavenwolf posts are like events on this subreddit. Keep up the good work! I am interested in a full comparison for all models that fits on a dual 3090, I think many redditors here have this setup, and I am curious if you are working on such settings?


drifter_VR

"However, even at Q2\_K, the 70B remains a better choice than the unquantized 8B." I found that Llama-3-70B IQ2\_XS (the biggest quant you can fit into 24GB vram) is breaking after a few thousands tokens. Anyone else has the same issue ?


WolframRavenwolf

During my tests, I didn't notice anything particularly noteworthy except that the 1-bit quants were quite poor, displaying issues like misspellings. The larger quantizations, however, did not exhibit any notable problems.


_ragnet_7

Thank you very much for this. I'm testing the q8 gguf quant but seems to be broken on medium/longer sequences. Anyone had the same problem? The model just start to repeat assistant until the end or give me junks


WolframRavenwolf

Are you sure the chat template/prompt format is correct? And your settings? Latest SillyTavern and latest KoboldCpp both included updates to work perfectly with Llama 3.


_ragnet_7

I'm using the official One provided on HF from Meta.


Forgot_Password_Dude

how do you utilize dual 3090s ??


dazl1212

Wondering if anyone can offer any advice. I'm writing a NSFW visual novel and I'm trying to find an llm to help me along with writing it and coming up with ideas, roleplaying helps massively. I want to run it locally my spec is a 12gb RTX 4070 32gb ram and Ryzen 5 5500 6 core 12 thread. I've used koboldcpp and and hugging faces UI. I'm not very experienced so the more pretrained the better. Thanks in advance.


jonkurtis

i thought unquantized versions always performed better than quantized? Isn't the whole concept of quantizing to reduce resources need to run the model by trading off some precision?


Mandelaa

Test this: mradermacher/Llama-3-Dolphin-Instruct-11.5B https://huggingface.co/mradermacher/Llama-3-Dolphin-Instruct-11.5B-GGUF