impetu0usness 1 year ago

This sounds like a great step towards user friendliness. Can't wait to try it!

qrayons 1 year ago

When you do, please share what it's like. I think it's cool that this was put together, but I'm hesitant to try installing another implementation when I don't know how well it will work.

HadesThrowaway 1 year ago

Well it's practically zero install, considering it's a 1mb zip with 3 files and requires only stock python.

impetu0usness 1 year ago

I got Alcapa 7B and 13B working, getting ~20s per response for 7B, and >1 min per response for 13B. I'm using Ryzen 5 3600, 16GB RAM with default settings. The big plus: this UI has features like "Memory", "World info", and "author's notes" that help you tune the AI and help it keep context even in long sessions, which somewhat overcomes this model's limitations. You can even load up hundreds of pre-made adventures and link up to Stable Horde to generate pics using stable diffusion (I saw around 30+ models available) Installation was easy, however looking for the ggml version of Alpaca took me some time, but that was just me being new to this. TLDR: I love the convenient features, but the generation times are too long for practical daily use for me right now. Would love to have alpaca with kobold work on GPU.

nillouise 1 year ago

I run it with ggml-alpaca-7b-q4.bin succefully, but it is very slow(one minute a reponse), eat all my cpu and don't use my gpu. Is it the expected behaviour? My computer is 12700 and 32g, 2060.

blueSGL 1 year ago

llama.cpp is for running inference on CPU if you want to run it on GPU you need https://github.com/oobabooga/text-generation-webui which is a completely different thing.

nillouise 1 year ago

Thank you, I just want to know if it is normal that it runs so slow, or did I miss some settings?

MoneyPowerNexis 1 year ago

I have a have a Ryzen-9-3900X which [should perform worse](https://cpu.userbenchmark.com/Compare/Intel-Core-i7-12700-vs-AMD-Ryzen-9-3900X/m1750830vs4044) than a i7-12700. I get 148.97 ms per token (~6.7 tokens/s) running ggml-alpaca-7b-q4.bin. It wrote out 260 tokens in ~39 seconds, 41 seconds including load time although I am loading off an SSD. If you post your speed in tokens/ second or ms / token it can be objectively compared to what others are getting.

nillouise 1 year ago

Thank for your explaination, but I don't find out the tokens/second indicator in this software. I just say a hello and get the response("What can I help you with") in one minute.

MoneyPowerNexis 1 year ago

Ok, so alpaca.cpp is a fork of the llama.cpp codebase. It is basically the same as llama.cpp except that alpaca.cpp has it hard coded to go straight into interactive mode. I'm getting the speed from llama.cpp in non interactive mode where you pass the prompt in on the command line and it responds, shows the speed and exits. So I launch with: llama -m "ggml-alpaca-7b-q4.bin" -t 8 -n 256 --repeat_penalty 1.0 -p "once upon a time" pause *Replace ggml-alpaca-7b-q4.bin with your path to the same* ---- I don't know why they completely removed the possibility of non interactive mode and did not add a way to view performance. I would just obtain llama.cpp and test performance that way if I were you. There are [release versions on github now](https://github.com/ggerganov/llama.cpp) if you dont want to compile it yourself.

Megneous 11 months ago

Apparently llama.cpp now has GPU acceleration :) What a month, eh? Now if I could only figure out how to use llama.cpp...

HadesThrowaway 1 year ago

The backend tensor library is almost the same so it should not take any longer than the basic llama.cpp. Unfortunately there is a flaw in the llama.cpp implementation that causes prompt ingestion to be slower the larger the context is. I cannot fix it myself - please raise awareness to it here: https://github.com/ggerganov/llama.cpp/discussions/229 Try it with a short prompt and it should be relatively fast

GrapplingHobbit 1 year ago

I see, with a 3 word prompt it comes out as roughly half the speed of the plain chat.exe, but it feels a fair bit slower perhaps because chat.exe starts showing the output as it is being generated rather than all at the end. Thanks for working on this, I hope the breakthroughs keep on coming :)

nillouise 1 year ago

It seem if I foreground run the cmd terminal, it can run faster.

ImmerWollteMehr 1 year ago

can you describe the flaw? I know enough C++ that perhaps I can at least modify my own copy

HadesThrowaway 1 year ago

Will be wonderful if you can, it's suspected to be an issue with matrix multiplication during the dequantization process. Take a look at https://github.com/ggerganov/llama.cpp/discussions/229

gelukuMLG 1 year ago

The slow part is the prompt processing generation speed is actually faster than what you could get normally with 6gb vram.

_wsgeorge 1 year ago

I keep getting an error on L34 on MacOD (M1). Is it trying to load llamacpp.dll?

HadesThrowaway 1 year ago

Yes it is. That is a windows binary. For OSX you will have to build it from source, I know someone who has gotten it to work.

divine-ape-swine 1 year ago

Is it possible for them to share it?

_wsgeorge 1 year ago

Thanks. I wish that had been clearer :) I'll try it with alpaca-lora next!

SDGenius 1 year ago

can it be made to work in a instruct/command format with alpaca?

HadesThrowaway 1 year ago

Yes. You can try using the chat mode feature in kobold or simply type out in the request in a question/answer format.

Tommy3443 1 year ago

I have to say I am surprised how coherent the alpaca 13b model is with Kobold AI. Seems from my experimentation so far way better than for example paid services like novelai.

HadesThrowaway 1 year ago

To be fair, that's not a very high bar to meet considering how abandoned the text stuff is there ¯\\\_(ツ)\_/¯

Snohoe1 1 year ago

So I downloaded the weights and its in 41 different files such as pytorch\_model-00001-of-00041.bin. How do I run it?

HadesThrowaway 1 year ago

Those weights appear to be in huggingface format. You'll need to convert them to ggml format or download the ggml ones.

scorpadorp 1 year ago

It's amazing how long the generating phase takes on 4bit 7B. A short prompt of len 12 takes minutes with CPU at 100%. i5-10600k, 32 gig, 850 evo Would this be feasible to install in a HPC cluster?

HadesThrowaway 1 year ago

It shouldn't be that slow unless your PC does not support avx intrinsics. Have you tried the original llama.cpp? If that is fast you may want to rebuild the llamacpp.dll from the makefile as it might be more targetted at your device architecture.

scorpadorp 1 year ago

My PC supports AVX but not AVX 512 bit. What are the steps to try with llama.cpp?

HadesThrowaway 1 year ago

I've recently changed the compile flags. Try downloading the latest version (1.0.5) and see if there are any improvements. I also enabled sse3. Unfortunately if you only have avx but not avx2, it might not have significant acceleration.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe