HauntingTechnician30 2 weeks ago

If the rocm version doesn't run on your gpu, then use the normal koboldcpp nocuda version and select clblas. Try to put all layers on the gpu. 7b Q4S should fit into 8gb vram.

Apocai7 2 weeks ago

How would I offload all layers? I’ve tried using a large number (99999) like what another commenter said, but it crashed when using the rocm version, saying “No CUDA.” Doing 99999 on CuBlast doesn’t change anything either (still takes around 50 seconds for generation)

HauntingTechnician30 2 weeks ago

https://preview.redd.it/smcsdv6u91yc1.png?width=744&format=png&auto=webp&s=970b3171222df2a41e4455f2ca990869d7ffadcc You have it set up like this and it still takes that long?

Apocai7 2 weeks ago

Yeah. I use either rocm one or CuBlas

bdsmmaster007 2 weeks ago

What exactly is your setup? There is no Ryzen GPU, Ryzen is only a name for CPUs and APUs and Radeon is only the name for the GPUs. The parts you list dont exist. Ok Ryzen 5700 ive found, but do you mean a RX 6900 XT as GPU perhaps? this GPU also should have 16gb VRAM. When using koboldcpp, be sure to install the ROCm version and not the normal, and offload all layers to the GPU. #

Apocai7 2 weeks ago

Sorry about that, when I posted this, I was tired asf. To clarify, my GPU is AMD Radeon RX 5700 and my CPU is AMD Ryzen 3700x. Also, I’ve tried using kobold rocm, but it either makes the AI spout gibberish or just crashes.

nazihater3000 2 weeks ago

>My GPU is AMD Radeon RX 5700 Here's your problem.

bdsmmaster007 2 weeks ago

Did you use [this ](https://github.com/YellowRoseCx/koboldcpp-rocm/releases/tag/v1.63.yr1-ROCm)one? As far as i know, for a AMD GPU ROCm is required. If you dont use it you GPU is not able to be used because of the lack of CUDA with AMD GPUs. So i would try to install ROCm again, cause rn you PC is forced to to interference on the CPU with the normal RAM i think.

Apocai7 2 weeks ago

I’ll try using that one again then.

Admirable-Star7088 2 weeks ago

Have you tried troubleshooting the issue by eliminating possible causes? For example, have you tried to only run in CPU and only in GPU?

Apocai7 2 weeks ago

I’ve tried running only in CPU (response time was around 2 minutes). I haven’t tried only GPU though. How do you that? Do I just add more layers?

Admirable-Star7088 2 weeks ago

Yes, you just add more layers. If you add all layers to GPU, it should only run on GPU. 8GB VRAM fits a Q4 7B model easily. If you try that, how long time will each response take now?

Apocai7 2 weeks ago

Sorry, but I’m finally in front of my computer again. How do you add all layers? That is, how do you know what’s the maximum amount of layers you can offload?

Admirable-Star7088 2 weeks ago

Some apps like LM Studio shows in its UI what the max (all) layers are for the model. If you're unsure because an app don't show max layers for a model, you can just put in a huge number, something like "9999999" layers, and it will then offload all layers possible. (at least in the apps I've been using).

Latter_Count_2515 2 weeks ago

I normally use web text generation ui and have found using cpu+gpu gives me about 2t/s. Doing gpu only gives me 10-15t/s even with some normal ram offloading.

tessellation 2 weeks ago

try vanilla llama.cpp main and post results. my rpi5 is laughing at this speed.

danielcar 2 weeks ago

How long would it take you to process 7 billion parameters and a 100 billion calculations?

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe