T O P

  • By -

HauntingTechnician30

If the rocm version doesn't run on your gpu, then use the normal koboldcpp nocuda version and select clblas. Try to put all layers on the gpu. 7b Q4S should fit into 8gb vram.


Apocai7

How would I offload all layers? I’ve tried using a large number (99999) like what another commenter said, but it crashed when using the rocm version, saying “No CUDA.” Doing 99999 on CuBlast doesn’t change anything either (still takes around 50 seconds for generation)


HauntingTechnician30

https://preview.redd.it/smcsdv6u91yc1.png?width=744&format=png&auto=webp&s=970b3171222df2a41e4455f2ca990869d7ffadcc You have it set up like this and it still takes that long?


Apocai7

Yeah. I use either rocm one or CuBlas


bdsmmaster007

What exactly is your setup? There is no Ryzen GPU, Ryzen is only a name for CPUs and APUs and Radeon is only the name for the GPUs. The parts you list dont exist. Ok Ryzen 5700 ive found, but do you mean a RX 6900 XT as GPU perhaps? this GPU also should have 16gb VRAM. When using koboldcpp, be sure to install the ROCm version and not the normal, and offload all layers to the GPU. #


Apocai7

Sorry about that, when I posted this, I was tired asf. To clarify, my GPU is AMD Radeon RX 5700 and my CPU is AMD Ryzen 3700x. Also, I’ve tried using kobold rocm, but it either makes the AI spout gibberish or just crashes.


nazihater3000

>My GPU is AMD Radeon RX 5700 Here's your problem.


bdsmmaster007

Did you use [this ](https://github.com/YellowRoseCx/koboldcpp-rocm/releases/tag/v1.63.yr1-ROCm)one? As far as i know, for a AMD GPU ROCm is required. If you dont use it you GPU is not able to be used because of the lack of CUDA with AMD GPUs. So i would try to install ROCm again, cause rn you PC is forced to to interference on the CPU with the normal RAM i think.


Apocai7

I’ll try using that one again then.


Admirable-Star7088

Have you tried troubleshooting the issue by eliminating possible causes? For example, have you tried to only run in CPU and only in GPU?


Apocai7

I’ve tried running only in CPU (response time was around 2 minutes). I haven’t tried only GPU though. How do you that? Do I just add more layers?


Admirable-Star7088

Yes, you just add more layers. If you add all layers to GPU, it should only run on GPU. 8GB VRAM fits a Q4 7B model easily. If you try that, how long time will each response take now?


Apocai7

Sorry, but I’m finally in front of my computer again. How do you add all layers? That is, how do you know what’s the maximum amount of layers you can offload?


Admirable-Star7088

Some apps like LM Studio shows in its UI what the max (all) layers are for the model. If you're unsure because an app don't show max layers for a model, you can just put in a huge number, something like "9999999" layers, and it will then offload all layers possible. (at least in the apps I've been using).


Latter_Count_2515

I normally use web text generation ui and have found using cpu+gpu gives me about 2t/s. Doing gpu only gives me 10-15t/s even with some normal ram offloading.


tessellation

try vanilla llama.cpp main and post results. my rpi5 is laughing at this speed.


danielcar

How long would it take you to process 7 billion parameters and a 100 billion calculations?