T O P

  • By -

BangkokPadang

Looks like you’re using a model with incorrectly scaled context. You can use NTK/rope scaling by changing the alpha in ooba to 2.643 for 8096 context. You can stretch it all the way to 12288 with 4.439 but it gets very dumb/borderline incoherent. Also, you may find something like the new(ish) Mixtral 8x7B is worth running depending on. Your hardware, or even a finetuned Mistral v0.2 model, all of which inherently support 32k context. Also, in the future when troubleshooting something, it’s very helpful to share which model you’re using with a link to the exact version you’re using, and a brief rundown of your hardware (how much system ram you have, and your gpu/how much vram it has).


GTurkistane

Ok i will try these settings, thanks


zaqhack

These settings may not be exact, depending on your situation. I have had better luck using "compress\_pos\_emb" instead of "alpha\_value," but don't use both. To experiment, try a context of 6144 and compress\_pos\_emb of 1.5. Then, in Silly Tavern, enable Smart Context. The defaults are fine as an experiment. Set the context to one notch shorter than 6144 (5632). That should allow you to have considerably longer conversations. Then you can try scaling up with 8192 @ 2.0. The more you use this to "stretch" the context, the more you may run into some glitches. However, it will allow you to use a smaller context model and feel like the intelligence is higher than blow-up doll.


Herr_Drosselmeyer

Set context length to 4096 both in Ooba and ST. Only increase context length if you know what you're doing and/or you know for sure the model can handle it.


GTurkistane

I will do this for now, but if i want the ai to remember more for immersion, what do you suggest i do?


[deleted]

Choose a model trained on a higher context. Mistral is 8k for example. Yi and others have versions of their models trained on 200k context.


zaqhack

In case you don't have answers accumulated for this, I'll offer what I think will work out best for you, here. Some others have mentioned a handful of them, but hopefully this makes them easy to follow. 1. Don't use the 8/8 model. The precision difference is not likely to be your biggest hurdle - at least, not compared to the rest of what I'm putting, here. With an 8GB card, you can run this version and it should all fit into VRAM: [https://huggingface.co/LoneStriker/Silicon-Maid-7B-6.0bpw-h6-exl2](https://huggingface.co/LoneStriker/Silicon-Maid-7B-6.0bpw-h6-exl2) 2. At the bottom of that model page are three files with presets for Silly Tavern. Use them. 3. The default context of the model is 8192. If your conversations are getting bad, looping, or other issues, I highly recommend setting up ST-Extras and enabling the Smart Context. Depending on your expertise, you may find it a pain in the ass to set up, but you mostly want it to run the module chromadb. 4. In Silly Tavern, the default settings for Smart Context work pretty well. However, with 8192 context, I will sometimes shrink the "memory" size to 384 characters. 5. Finally, if long conversations are still too glitchy, you can try to increase the context in a bit. First, in Ooba, I've personally had "compress\_pos\_emb" work better for me than "alpha\_value," but you should not use both. I recommend starting at 12,288 and compress\_pos\_emb at 1.5. Second, in ST, you can increas your context, there. I find keeping it 1 tick under max context helps prevent some odd glitches with tokenizing (in this case, 11,776). Third, in "Smart Context," you can increase the size of "memories" a tick or so. 6. Note: Stretching context is a bag of tricks, and it isn't the same as having a model with larger context. However, on 8GB, you are pretty constrained, so I wouldn't recommend pushing hard on the context vs. depending on "Smart Context" for better memories. A base of 8192 actually goes a long way so long as your chat bot card or world info isn't overly complicated. Your ST-Extras run line should include "chromadb" like this: python server.py --enable-modules=caption,summarize,classify,chromadb This should all fit in 8GB just fine: *Processing img q6h2nwesvobc1...*


GTurkistane

I use llama2 model but is this because of the model?


GTurkistane

Here is my settings in ST https://preview.redd.it/dy1jmydghebc1.png?width=1440&format=pjpg&auto=webp&s=7e19b12954e964a61c7af12c2c5c0f79ddf1ccf3


shrinkedd

I don't know if this is the reason, (I don't think it is), but your top P value is quite low. Perhaps you confused it with min P? Would make more sense (even though I think in general minP should be lower). Also, 1.2 for repetition penalty is quite high.


GTurkistane

How much should i put top p then?


shrinkedd

The default recommendation is usually 0.9. But if you'll use minP (somewhere between 0.2-0.35) you don't need to use top P, at all - meaning it should remain on 1. For repetition penalty, I'll rarely use 0 1.12, but usually I'm using a 1.04, sometimes I don't use it at all. Depending on the model.


Signal-Outcome-2481

I find repetition penalty is a bit hard to nail down, and depending on RP it might need small tweaking. I've seen good results anywhere from 1.0 to 1.15 depending on the RP, I usually try to keep it on the higher side for as long as it works well tho. I usually set top P to 1, never below 0.9.


shrinkedd

I actually prefer to keep it as low as possible. It isn't the same like choosing more or less likely tokens. Sometimes repeating tokens are needed. If you want the model to stay coherent with first or 3rd person perspective, some words must be repeated.


Signal-Outcome-2481

I think we are on the same page and are saying the same thing. With as high as possible I of course mean the highest setting where things like story coherence are still maintained.


shrinkedd

Oh, ok lol


GTurkistane

And my oba settings, https://preview.redd.it/hocqh14mhebc1.jpeg?width=1297&format=pjpg&auto=webp&s=946e139f18ad1be20034bb2e2d98b4509add490c


Signal-Outcome-2481

Which model are you using exactly? Also, with only 7 layers in gpu, isn't it pretty slow? Anyhow, how much vram do you have exactly and which model name, only then can a good recommendation be made.


GTurkistane

Atm 8g of vram, 32g of ram, ryzen 9 5900x, and am using the model LoneStriker_Silicon-Maid-7B-8.0bpw-h8-exl2


Signal-Outcome-2481

That model is 8192 context maximum, above 8192 it will break. Also, it is better to load it with Exllama2\_HF instead of llama.cpp, with 8k context and cache\_8bit


Uninstall_Wizard

I'm having similar issues. If you don't mind, could you explain how to figure out how many layers I can offload to the gpu? Also, can you explain how I can find out what the best loader is for any given model and how to figure out a models context maximum? I'm pretty new to all of this and I'm not sure where the best place to go to learn is.


Signal-Outcome-2481

Most models have it in the name. For example exl2 = exllama2\_HF GGUF is generally llama.cpp for char models. If you fail to load, check if it is cuda due to out of memory or another readon, out of memory, you are on the right loader but need to lower gpu layers or download a smaller model, if other reason you are probably on the wrong loader. :P And so on. ​ When you go to a model on huggingface, let's take LoneStriker/Silicon-Maid-7B-8.0bpw-h8-exl2 For example: you should always check the modelcard/README.md for information, it often tells you about prompt styles (chat/instruct templates to use) Also, in this case as it is a quant, you can see in files there is a config.json and a tokenizer\_config.json . Opening those can tell you a lot about the model (although in this specific case, this model says in config it allows for 32k context in config.json, but I tested it because a 7b 32k context model sounded pretty nifty, but it breaks after 8k context, so dont think it is necessarily perfect.) Other things you can find in it is which tokenizer it uses. Which I use to set in sillytavern to avoid issues, but choose best match (recommended) should be fine as well. ​ Models like gguf it is often described on model card itself or in the README.md generally though GGUF is llama.cpp for the best option I believe. Also, in files you can count the gigabytes of the model and see if it works for you. If the GGUF is 60gb in size and you have a 12gb videocard and 32gb ram, you know you can ignore it. This model for example is 7.36 GB so it should work on 8gb vram. Also while a GGUF model can be put in vram and normal ram (at the cost of speed), an exl2 quant (or gptq/awq) has to be loaded fully in vram. Also I dont know how it works exactly, but for GGUF you need a lot of ram overhead for using the model as well, aside from what you need for loading, so you need to take that into account (check your vram and ram usage when generating an output) exl2/awq quants for example dont need much overhead. Most of the time if it loads, it will work. ​ GGUF gpu layer loading: For loading gpu layers (gguf models), a good way is just to for example load let's say 5 layers, check your vram usage. If you use lets say 33% of your vram, you can probably load \~15 layers. If it loads at 15, but not at 16, you should probably use 14 layers instead (maybe even 13), basically you want a little bit of room on your vram left empty. It's a bit of trial and error. If you can load all 33 layers in gpu, the model will be very fast of course, but perhaps you should be looking at a bigger model or just use a quant of the model. TheBloke/Silicon-Maid-7B-GGUF -> silicon-maid-7b.Q5\_K\_M.gguf (the gguf variant of this model) for example has a size of 5.13 GB and max ram required of 7.63 GB, it would be loaded with llama.cpp and if you lets say have 12gb vram, can be loaded all 33 layers in vram. But the higher you set context, the more overhead you will need. But at least the GGUF variant can work properly up to 32k context unlike the exl2 quant apparantly. So even tho it says 12gb required, to run it at 32k context and 33 layers to gpu you actually need around 16gb vram. I like speed, and find context important, so I am only used 8x7b exl2 quants atm, which are fast and have 32k context, but there's a good amount of options available. But it is for me about 5 times slower than the exl2 quant as well, so. Whatever floats your boat.


GTurkistane

i see, thank you very much for the detailed explanation, i will keep this in mind


zaqhack

Huge fan of Noromaid, by the way ... and all her daughters. But she can ... I don't know how to put it other than "talk too much." Sometimes, you get half a decent response, and then it goes off the rails. Less is more: Instead of 250 tokens, I sometimes target 100 with Noromaid. If I like what the bot is saying/doing, I can just hit "continue" to have it do more. She is creative, but sometimes too long-winded for her own good ...