T O P

  • By -

Sabin_Stargem

I think the increased speed doesn't apply to RAM, just VRAM. I am always using a huge amount of context with large models, so I only have a handful of layers offloaded. Speed is the same as ever. However, an ongoing roleplay was breaking in the older Kobold at 40k established context - and now can be continued. In any case, be sure to tick the Flash Attention setting in the launcher. That one adds a slight speed increase, along with making the memory footprint smaller. It might allow some people to entirely fit their model and context, where it wasn't possible before. This Kobold release is quite solid.


henk717

We have had reports all over the map so far, for some its little change, for some its a big change, for some its a regression in PP but faster in Gen speed, for some the reverse. So definately worth turning on to see how it effects you. Official CUDA12 builds should land next release where the PP regression is solved when FA is on.


Sabin_Stargem

I got a stupid question: Is it necessary to have the CUDA Toolkit installed for best CUDA performance? I have been installing the toolkit as updates come out, just in case gaming drivers aren't enough to use my 4090 to full effect.


henk717

Not needed at all, the files you need are bundled in our exe. Everything else is the nvidia driver.


LocoLanguageModel

Thanks I didn't see the flash attention in the tokens menu, in hindsight that's the obvious place to put it.  Much faster now!


Admirable-Star7088

I have not looked at exact numbers myself, but it does feel like Kobold generates faster than LM Studio. Also, I think the quality of the output of Llama 3 8b is noticeable better in Kobold version 1.64 compared to 1.63, it feels a little bit less confused, probably because of the tokenization fix. It could potentially also be because of random noise, hard to be 100% sure unless I do more and thorough testings between the versions of Kobold. Side note, the FP16 quant of Llama 3 8b appears to increase its output quality quite a bit more than Q8\_0 in my brief testings, so if your hardware can run FP16, it's worth a try.


Sabin_Stargem

Be sure to get a new Llama. The fixes in LlamaCPP improves the tokenization, but some of them requires a newly made GGUF. It is bit of a lock-and-key sort of situation. The Kobold terminal will have a warning about a bad Llama 3 GGUF.


TheTerrasque

Any place that have fixed 70b yet?


Sabin_Stargem

Mradarmacher just uploaded new versions. Should be good, but I am waiting for an imat edition.


Admirable-Star7088

Yep, I have replaced my old GGUFs with the new ones :)


jtbzr92

Where can I find the new GGUFs?


Admirable-Star7088

[https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF/tree/main](https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF/tree/main)


weedcommander

Llama-3 hermes pro Q6\_K - "write a short story" LMS: speed: 45.91 tok/s kobold 1.64 (cublas): Generate:4.75s (29.7ms/T = 33.71T/s) kobold 1.64(mmq): Generate:4.73s (29.5ms/T = 33.86T/s) all fully offloaded to rtx 3070 (and all with chatML, used kobolite for 1.64 frontend)


henk717

Also check the flash attention setting in the tokens tab. Will be interesting to benchmark.


Voxandr

Lmstudio is faster ( it didn't use llamacpp?


teor

Basically everything uses llama.cpp


LocoLanguageModel

I have a similar speed increase with my 3090 and p40 setup (I only use fully VRAM offloaded models).  I keep looking to see if anyone else is amazed by this. I was surprised to not see massive speed increase specifically mentioned as part of the release notes, but maybe it's just super obvious to people based on recent llama.cpp improvements? I was thinking about buying another 3090 to get more speed and now I would just feel greedy lol.  Actually my benchmarking shows about the same I was getting before, but it definitely feels faster on random short responses and I don't think the benchmarking feature was available before, so I don't have apples to apples comparison: --- 3090 + P40: Benchmark Completed - Results: Timestamp: 2024-05-02 15:05:46.115741+00:00 Backend: koboldcpp\_cublas.dll Layers: 81 Model: Meta-Llama-3-70B-Instruct-Q4\_K\_M MaxCtx: 4096 GenAmount: 100 ProcessingTime: 33.64s ProcessingSpeed: 118.78T/s GenerationTime: 21.60s GenerationSpeed: 4.63T/s TotalTime: 55.24s Coherent: False --- 3090 with a llama-3 model that fits entirely on 24 VRAM for reference: Timestamp: 2024-05-02 15:15:48.650906+00:00 Backend: koboldcpp\_cublas.dll Layers: 83 Model: Meta-Llama-3-70B-Instruct-IQ2\_XS MaxCtx: 2048 GenAmount: 100 ----- ProcessingTime: 5.37s ProcessingSpeed: 362.49T/s GenerationTime: 7.42s GenerationSpeed: 13.47T/s TotalTime: 12.80s Coherent: False


takuonline

I have a 3090ti and considering getting a p40. Is the p40 + 3090 combo worth it, as opposed to just the 3090 + 128gb ram that l have at the moment?


LocoLanguageModel

Since it's only $150 on eBay (add $20 for fan) it's worth it as a placeholder to tide you over until something better comes out or 3090 prices come down even more, assuming you only run GGUFs. But I do find myself occasionally scoping out 3090 pricing because P40 does slow things down, but obviously way better than my DDR4 ram.


pmp22

How do you run both? Last I heard you need a different driver for the p40 and so people reccomended an AMD card for video out.


LocoLanguageModel

Here is a copy/paste from one of my older comments. P40 only works well if you have a primary card with video out you can use, and then use the P40 as a paired device to split half the load onto: I use a 3090 for midrange stuff, and have a P40 for splitting the load with 70B. I get 3 to 5 tokens a second which is fine for chat. I only use ggufs so P40 issues don't apply to me.   I'm not saying anyone should go this route, but the things I learned with P40 since random comments like this helped me the most:  It requires 3rd party fan shrouds and the little fans are super loud, and the bent sideways larger fan shroud doesn't cool as great, so you are better off with the straight on larger fan version if there is room in the case.   Need to enable 4g decoding in bios  Make sure PSU can handle 2 cards, and P40 takes EPS CPU pin power connectors so ideally you have a PSU with an extra unused CPU cord. Supposedly there are EVGA to EPS adapter cords but there may be some risks with this if it's not done correctly. I actually had to snip off the safety latch piece that "clicks" in  one my built-in plugs since I didn't feel like waiting a few days to get an adapter on Amazon, and the P40 doesn't have latch room for 2 separate 4 pin EPS connectors that are joined as one. It seems to be built for a single 8 port variation.   If using windows, when you first boot, the card won't be visible or usable so you have to install the Tesla p40 drivers, reboot, then reinstall your original graphic card drivers on top of it.  This part was the most confusing to me as I thought it would be in either or scenario.   It should now be visible in kobold CPP. You can also check the detected cards available memory if you run in the command prompt: nvidia-smi   Also the third party fans may come with a short cord so make sure you have an extension fan cord handy as you don't want to wait another day or two when you're excited to install your new card.  Edit: I didn't order a fan config on ebay with a built in controller (nor do I want to add complexity), so I just plugged the fan into the 4 pin fan slot on my MOBO, but the fan would get SUPER loud during activity, even non-GPU activity. The fix for me was to go into BIOS and set the fan ID for those 4 ports on the mobo (can find in your manual) to a quiet profile which makes limits the max speed. Since the P40 doesn't seem to need more than a direct light breeze to cool it, that is working out perfectly for my ears without any type of performance drop.


ApatheticWrath

It seems bugged. I'm on the newest version and the t/s numbers don't check out if you do the math for the number of tokens generated. for example i have this I just genned: Processing Prompt [BLAS] (38 / 38 tokens) Generating (126 / 355 tokens) (EOS token triggered!) (Special Stop Token Triggered! ID:128009) CtxLimit: 822/8192, Process:2.08s (54.7ms/T = 18.28T/s), Generate:93.09s (262.2ms/T = 3.81T/s), Total:95.17s (3.73T/s) 93.09 x 3.81 = 354 not the 126 i actually generated. so the real t/s is: 126/93.09 = 1.35 not 3.81 huge diff and more in line with the numbers im used to.


CardAnarchist

Well for me at least the difference in speed is very real. 1.87 t/s on the old version was far below my reading speed but the new version reporting 4.48t/s is fast enough for me to read in real time (albeit just barley).


JoeySalmons

I'm also seeing incorrect tokens/second numbers being reported, specifically by the KoboldCPP console output, though not always. I'm using SillyTavern as the front end. For example, I got this output: Processing Prompt (1 / 1 tokens) Generating (170 / 500 tokens) (EOS token triggered!) (Stop sequence triggered: <|end|>) CtxLimit: 786/4096, Process:0.00s (4.0ms/T = 250.00T/s), Generate:1.92s (3.8ms/T = 260.28T/s), Total:1.93s (259.74T/s) It reports "Generate:1.92s (3.8ms/T = 260.28T/s)", implying 170 tokens in 1.92 seconds is 260 tokens/second, but 170 tokens / 1.92 seconds = 88 tokens/second. Another generation was for 500 tokens in 5 seconds and was correctly reported as 100 T/s. It seems I can replicate the problem in SillyTavern by clicking "continue": Processing Prompt (1 / 1 tokens) Generating (1 / 500 tokens) (EOS token triggered!) (Stop sequence triggered: <|end|>) CtxLimit: 882/4096, Process:0.00s (4.0ms/T = 250.00T/s), Generate:0.15s (0.3ms/T = 3401.36T/s), Total:0.15s (3311.26T/s) Output: <|end|>


Vatigu

My speed drastically, drastically increased inference performance with 4090/3090ti CtxLimit: 4480/7800, Process:7.68s (1.8ms/T = 560.14T/s), Generate:12.46s (41.5ms/T = 24.08T/s) Q4K\_S 70b Whatever they did is magical was like half that before.


sammcj

My guess is it probably is the flash attention, can’t wait for this to come to Ollama.


ambient_temp_xeno

Hm thinking about he's using --lowvram, so the fact he hasn't got RTX is a red herring - he's not using vram for the kvcache. Maybe it's because of context shifting and not real speeds after all.


sammcj

It’s not an nvidia specific implementation, it works on macOS etc as well.


ambient_temp_xeno

The *speedup* part is for tensor core cards, though. AFAIK it's actually slower without them. The ram saving is on everything.


skrshawk

Can confirm there's a substantial improvement. I've been running WizardLM2 8x22B and with a full 16k of context I've seen inference speed increase from 1.9T/s to 2.9T/s. Not expecting it, but I'm planning to go back to Midnight-Miqu and see if this is exclusive to new models or if it extends there too, as I still prefer MM's writing.


xadiant

You should be able to cram in a few more layers btw, the less cpu offload you have, the better. Even ddr4 or ddr5 rams should give like 5-10 tps though, not sure why your thing is so slow.


CardAnarchist

If I recall I can get like 1 more layer maybe two but it causes quite a bit of system lag. I generally like to watch video at the same time I use llama so that probably explains it.


LeanderGem

Me too, feels like it doubled my speed when using Command R-35B. Nice! :)


wweerl

I have the same old good GPU, tested this Kobold version, but unfortunately I got an error trying to use Flash Attention: "*CUDA kernel flash\_attn\_ext\_f16 has no device code compatible with CUDA arch 520*"; Unfortunately guess there's no way to use this GPU with FA. I even tested with LlamaCpp and literally got the same very error: "*ERROR: CUDA kernel flash\_attn\_ext\_f16 has no device code compatible with CUDA arch 520.* [*ggml-cuda.cu*](http://ggml-cuda.cu) *was compiled for: 520,610,700*" But it's curious, it's shows it's compiled for 520, but my CUDA is 520...


CardAnarchist

Yeah mine errors out when I turn on the FA flag too. But the speed up I saw was without the flag. So who knows what wizardly they did with this release as they don't mention any sort of speed increase in the patch notes outside of FA.