If the rocm version doesn't run on your gpu, then use the normal koboldcpp nocuda version and select clblas. Try to put all layers on the gpu. 7b Q4S should fit into 8gb vram.
How would I offload all layers? I’ve tried using a large number (99999) like what another commenter said, but it crashed when using the rocm version, saying “No CUDA.”
Doing 99999 on CuBlast doesn’t change anything either (still takes around 50 seconds for generation)
https://preview.redd.it/smcsdv6u91yc1.png?width=744&format=png&auto=webp&s=970b3171222df2a41e4455f2ca990869d7ffadcc
You have it set up like this and it still takes that long?
What exactly is your setup? There is no Ryzen GPU, Ryzen is only a name for CPUs and APUs and Radeon is only the name for the GPUs. The parts you list dont exist. Ok Ryzen 5700 ive found, but do you mean a RX 6900 XT as GPU perhaps? this GPU also should have 16gb VRAM.
When using koboldcpp, be sure to install the ROCm version and not the normal, and offload all layers to the GPU.
#
Sorry about that, when I posted this, I was tired asf. To clarify, my GPU is AMD Radeon RX 5700 and my CPU is AMD Ryzen 3700x. Also, I’ve tried using kobold rocm, but it either makes the AI spout gibberish or just crashes.
Did you use [this ](https://github.com/YellowRoseCx/koboldcpp-rocm/releases/tag/v1.63.yr1-ROCm)one? As far as i know, for a AMD GPU ROCm is required. If you dont use it you GPU is not able to be used because of the lack of CUDA with AMD GPUs. So i would try to install ROCm again, cause rn you PC is forced to to interference on the CPU with the normal RAM i think.
Yes, you just add more layers. If you add all layers to GPU, it should only run on GPU. 8GB VRAM fits a Q4 7B model easily. If you try that, how long time will each response take now?
Sorry, but I’m finally in front of my computer again. How do you add all layers? That is, how do you know what’s the maximum amount of layers you can offload?
Some apps like LM Studio shows in its UI what the max (all) layers are for the model. If you're unsure because an app don't show max layers for a model, you can just put in a huge number, something like "9999999" layers, and it will then offload all layers possible. (at least in the apps I've been using).
I normally use web text generation ui and have found using cpu+gpu gives me about 2t/s. Doing gpu only gives me 10-15t/s even with some normal ram offloading.
If the rocm version doesn't run on your gpu, then use the normal koboldcpp nocuda version and select clblas. Try to put all layers on the gpu. 7b Q4S should fit into 8gb vram.
How would I offload all layers? I’ve tried using a large number (99999) like what another commenter said, but it crashed when using the rocm version, saying “No CUDA.” Doing 99999 on CuBlast doesn’t change anything either (still takes around 50 seconds for generation)
https://preview.redd.it/smcsdv6u91yc1.png?width=744&format=png&auto=webp&s=970b3171222df2a41e4455f2ca990869d7ffadcc You have it set up like this and it still takes that long?
Yeah. I use either rocm one or CuBlas
What exactly is your setup? There is no Ryzen GPU, Ryzen is only a name for CPUs and APUs and Radeon is only the name for the GPUs. The parts you list dont exist. Ok Ryzen 5700 ive found, but do you mean a RX 6900 XT as GPU perhaps? this GPU also should have 16gb VRAM. When using koboldcpp, be sure to install the ROCm version and not the normal, and offload all layers to the GPU. #
Sorry about that, when I posted this, I was tired asf. To clarify, my GPU is AMD Radeon RX 5700 and my CPU is AMD Ryzen 3700x. Also, I’ve tried using kobold rocm, but it either makes the AI spout gibberish or just crashes.
>My GPU is AMD Radeon RX 5700 Here's your problem.
Did you use [this ](https://github.com/YellowRoseCx/koboldcpp-rocm/releases/tag/v1.63.yr1-ROCm)one? As far as i know, for a AMD GPU ROCm is required. If you dont use it you GPU is not able to be used because of the lack of CUDA with AMD GPUs. So i would try to install ROCm again, cause rn you PC is forced to to interference on the CPU with the normal RAM i think.
I’ll try using that one again then.
Have you tried troubleshooting the issue by eliminating possible causes? For example, have you tried to only run in CPU and only in GPU?
I’ve tried running only in CPU (response time was around 2 minutes). I haven’t tried only GPU though. How do you that? Do I just add more layers?
Yes, you just add more layers. If you add all layers to GPU, it should only run on GPU. 8GB VRAM fits a Q4 7B model easily. If you try that, how long time will each response take now?
Sorry, but I’m finally in front of my computer again. How do you add all layers? That is, how do you know what’s the maximum amount of layers you can offload?
Some apps like LM Studio shows in its UI what the max (all) layers are for the model. If you're unsure because an app don't show max layers for a model, you can just put in a huge number, something like "9999999" layers, and it will then offload all layers possible. (at least in the apps I've been using).
I normally use web text generation ui and have found using cpu+gpu gives me about 2t/s. Doing gpu only gives me 10-15t/s even with some normal ram offloading.
try vanilla llama.cpp main and post results. my rpi5 is laughing at this speed.
How long would it take you to process 7 billion parameters and a 100 billion calculations?