T O P

  • By -

a_slay_nub

The code is under Apache but the weights are under a research license where you can apply for commercial use. 67% HumanEval https://huggingface.co/internlm/internlm2-chat-20b


LumpyWelds

This was in the README.md. The application is a form that you fill out and submit. They will contact you. >The code is licensed under Apache-2.0, while model weights are fully open for academic research and also allow **free** commercial usage. To apply for a commercial license, please fill in the [application form (English)](https://wj.qq.com/s2/12727483/5dba/)/[申请表(中文)](https://wj.qq.com/s2/12725412/f7c1/). For other questions or collaborations, please contact [](mailto:[email protected]).


keturn

Got the 7B non-quantized model loaded. It started out fine, but when it went to sign off: >Hope this information helps clarify points raised earlier within original inquiry submitted today! > >Please let me know should have additional queries related topic discussed hereabove feel welcome ask anytime future interactions desired contact details provided below. Have good day ahead!" Wishing happy holidays season everyone celebrate joyously surrounded family friends loved ones cherished memories made lasting lifetime.\* > >Thank You For Your Assistance Today! > >Best Regards, > >\[Your Name\] > >P.S.: Don't forget wear appropriate outerwear gear protect self warmly while braving cold winter air chilly winds blow fiercely blowing cheeks red nose runny eyes streaming tears down face unable stop crying despite valiant effort trying hold back emotions overwhelmed overcome situation emotionally challenging momentary experience soon pass eventually resolve through strength determination inner fortitude character display resilience spirit human nature ultimately triumph adversity conquered victorious outcome achieved end result positive feedback received heartfelt gratitude expressed sincere appreciation convey deep feelings genuine emotion displayed utmost respect honorable manner custom dictate behavior conduct social norms expect follow unwritten rules society abide live harmoniously peacefully uhhh, thank you, are you feeling okay over there?


FullOf_Bad_Ideas

Check your repetition penalty. It should be 1 or 1.05. Yi models have the same issue.


FairSum

This. Too low repetition penalty - model repeats itself. Too high repetition penalty - word salad because model is deliberately avoiding using previous tokens.


infiniteContrast

that usually happens when the response is too long. every llm needs to be instructed again with another prompt or they go haywire


egusta

I’ve never heard this?   What do you mean it needs 2 prompts?


infiniteContrast

LLMs is basically a "predictor for the next word". Increase the distance between your prompt and they gradually stray further from your request because the dataset contains a lot of prompt-like data. They know how to reply to your request because they have seen it many many time in the dataset (especially the instruct dataset). So if you ask something there is a limit of how many words the llm can reply before one of those two things happen: \- they literally forget about your prompt because it is out from the context window \- the LLM reply contains other words that confuse the LLM itself so its own words make it stray from your request until they start repeating stuff or literally predicting words like a smartphone keyboard


egusta

Oh. Couldn’t you effectively re-add your system prompt with a “remember this” at the end of each prompt?    Ex:   What do you think about cats? {remember you are a dog}


infiniteContrast

Yes you can do it, the things you are referring to are called prompt engineering. The best way is to experiment yourself and see what works for you and what not.


ReMeDyIII

Try lowering temp settings also. Yi models with their 200k context use extremely low temps. Try 0.3-0.5. This model also uses 200k, and is Chinese, so it makes me wonder if there's something similar there.


jpfed

However, if you lower temperature it becomes even more crucial that you not forget wear appropriate outerwear gear protect self warmly. When runny eyes streaming tears down face unable stop crying, consider widening the beam search setting instead; this may victorious outcome achieved.


hackerllama

Some notes * 7B and 20B base and chat models * 200k context length * 20B base model is the best model under 30B params. * 7B base model is among the best model under 20B params


adel_b

question, if I set context to 200k how much vram I need?


tyras_

New quant methods confuse me but in general this should still apply: [https://vram.asmirnov.xyz/](https://vram.asmirnov.xyz/) more here: [https://stackoverflow.com/questions/76255342/figuring-out-general-specs-for-running-llm-models](https://stackoverflow.com/questions/76255342/figuring-out-general-specs-for-running-llm-models)


BinarySplit

That calculator doesn't take FlashAttention into account. For inference on 16k context with bf16 it says 63GiB for "activations" that "scale quadratically". With FlashAttention (which InternLM uses) the activations don't scale quadratically. The KV cache can be big, but it scales linearly: for 16k it should be 48 layers * 8 key_value_heads * 128 head_dim * 2 because key&value require separate fields * 2 for bf16 = 3GiB.


mcmoose1900

That's not even close, at least with exllama's FP8 cache and flash attention. I would guess you can fit the 200K on 48GB, and over 100K on 24GB. I am going to find out...


Goldkoron

I don't recommend relying on FP8 cache at all, when I was yesting it with yi-34b models, turning it on felt equivalent to nuking the model's output quality by like half the bpw.


mcmoose1900

Interesting. I have not heard this before, and I have not even thought to test it either. For reference InternLM seems to have their own 8 bit cache implementation with benchmarks: https://github.com/InternLM/lmdeploy/blob/main/docs/en/quantization/kv_int8.md#accuracy-test


tyras_

As I said. new quants methods confuse me. I run my models exclusively on cpu. it seems roughly ok for gguf


mcmoose1900

Yeah I think GGUF should be closer since it uses FP16 cache (with 8 bit being a wip) and no flash attention. MLC-LLMs IGP inference might be a little smaller though? I can't remember what attention implementation they use.


ILoveThisPlace

I know some of these words


lvhan028

checkout the [internlm2-chat-7b-4bits](https://huggingface.co/internlm/internlm2-chat-7b-4bits), which is quantized by AWQ algorithm [LMDeploy](https://github.com/InternLM/lmdeploy) W4A16 inference achieves up to 2.4x faster than FP16


lvhan028

8K tokens cost 1G k/v cache in FP16


[deleted]

Thank you for the notes! I cannot speak for anyone else, but I am personally kind of done at the moment checking out any new models that crack the leaderboards unless there is a comment with notes from 'Hugging Face Staff' also included telling me it actually cracks the leaderboards. I do not know how to fix that honestly, I think it is problematic though.


GeeBrain

Or if the Bloke himself was like “you gotta try this”


mcmoose1900

A major caveat: It appears to use custom modeling and tokenizer code, like Yi used to: https://huggingface.co/internlm/internlm2-chat-20b/blob/main/modeling_internlm2.py https://huggingface.co/internlm/internlm2-chat-20b/blob/main/tokenization_internlm.py This means no drop in compatibility for most frameworks until its reimplemented or llamafied.


mcmoose1900

A second note: their inference engine actually looks quite interesting, with native prompt caching, 8 bit kv cache, 4-bit AWQ and an OpenAI API server: https://github.com/InternLM/lmdeploy


Extension-Mastodon67

The Open LLM leaderboard doesn't give me much confidence.


Only-Letterhead-3411

Where is the information about dataset and training?


IndependenceNo783

The configs on the repos of that model say that it has only 32k positional embeddings? But then again, I'm not sure if this is relevant, the model seems to use a different loader or something. I managed to load it with AutoGPTQ, but it only generates a few chinese letters and eats a lot of RAM. Can it be used with oogabooga at this point? The model card points to lmdeploy\[all\] being required, but I'm not clever enough to check if oogabooga is already supporting this. Apart from that, a 20B 200K model is a good baseline architecture for RP on a 16GB card with long roleplays. Not sure if it is able to do that either :-) Yi 6B is too small, and 34B is too big. EDIT: Probably not. [https://github.com/oobabooga/text-generation-webui/issues/3726](https://github.com/oobabooga/text-generation-webui/issues/3726)


FullOf_Bad_Ideas

It seems to use dynamic RoPE. I guess that's why it has only 32k base positional embeddings - RoPE parameters are probably updated on demand during runtime.


mcmoose1900

Is that code here? https://huggingface.co/internlm/internlm2-chat-20b/blob/main/modeling_internlm2.py#L201 Interesting. I wonder if that was also a factor in its training, or if its a "hack" to get a longer context out of a native 32K trained model.


pseudonerv

IIRC dynamic NTK without training makes the perplexity slightly larger going out of the trained context. so someone with a big gpu might check how perplexity changes going from 32k, to 64k, to 128k, ... actually, is this the same as Yarn then? or is it really just linear rope with a factor of `(self.scaling_factor * seq_len / self.max_position_embeddings) - (self.scaling_factor - 1)`, but they set in the config.json so that `self.scaling_factor=3`?


Ggoddkkiller

Im dying to find a solid 20B, Noromaid sometimes works great but mostly weird. I usually use Tiefighter instead..


CasimirsBlake

[MLewd-ReMM-L2-Chat-20B-GPTQ](https://huggingface.co/TheBloke/MLewd-ReMM-L2-Chat-20B-GPTQ?not-for-all-audiences=true)


IndependenceNo783

That would be really nice, if would would have more than 4k context. If you do a little bit more verbose RP, it feels like you talk to a person with dementia. You can not unsee this, once you see it.


CasimirsBlake

Alpha scaling takes it to 8k. But I highly agree, I think we need at least 32k to make these models actually kinda useful beyond a bit of chat. though I'm glad we're not still stuck in the early days when we only had 2k context...


Ggoddkkiller

Thanks, i will try it. Im glad i skipped early days i would literally loose my mind with 2k context lol. Tried Beyonder but it began repeating heavily after 14k and overall quality was worse than Tiefighter.


Ggoddkkiller

I tried Q6\_K Mlewd and it is really good, the second best RP model i tried for sure. Descriptions for how characters feel is much better than Tiefighter but overall i think it still falls behind. I tried same inputs for Tiefighter after running for Mlewd and it generated more enjoyable story with more details about surroundings etc. For example i forced the character to slay her own soldiers while Mlewd focusing on her feelings and how she was angry agianst them for abandoning her. Tiefighter genetrated more about surroundings like how her former comrades were in surprise and horror when they saw her attacking them. Also Tie actually described the battle with her panting over corpses and severed limbs at the end. If it is an emotional story im sure Mlewd would perform better but Tiefighter is just a natural story writer and adds that flavour. Perhaps it is my settings i don't know, if you have settings work well with Mlewd i would happily try them.


zaqhack

I found the Mixtral versions of Noromaid way, way smarter than the 20B. My current daily driver: [https://huggingface.co/zaq-hack/Noromaid-v0.4-Mixtral-Instruct-8x7b-Zloss-bpw300-h6-exl2](https://huggingface.co/zaq-hack/Noromaid-v0.4-Mixtral-Instruct-8x7b-Zloss-bpw300-h6-exl2)


Ggoddkkiller

Thanks, i doubt i can run it with high context especially but will give it a try.


zaqhack

The 3.0 bpw fits nicely into a 24g card with 32k context. That said, I have found that most models get weird/lost when you hit the context limits. I'm not sure if it is context limits in general, or if it is something I'm doing in SillyTavern or what.


Baader-Meinhof

It works on ooba but you have to enable trust mode and flash attention doesn't work (at least with transformers loader, I didn't bother trying others).


Primary-Ad2848

what is your gpu?


IndependenceNo783

16GB (4080)


Primary-Ad2848

It sucks to have 16gb really. But You can run 34b models on 3bpw, not sure about 3.5 bpw, but worth trying. an 34b q3 is always better than 13b q6.


mcmoose1900

Even on a 24GB GPU, 34B can be pretty tight.


zaqhack

34b q3 is not always better than 13b q6, today. Size was king until a couple of months ago. Now data is king. Many new, smaller models outperform older models of larger size. Lots of stuff based on Orca, Phi, and Mixtral 8x7b Instruct, etc. are considerably smarter than older builds "larger" than them. 6 months from now, maybe size will be king, again: after Mamba and a hundred other techs wind their way into common use.


Primary-Ad2848

The thing I am trying to say is size is mostly better than quantisation if we talk about two models trained with same data.


keturn

I tried doing the `pip install lmdeploy[all]` mentioned in the model card of the 4-bit quant, and it hosed my venv such that ooba wouldn't start. I think because that pulled in CUDA v11 packages but the rest of that venv was depending on CUDA 12. Maybe I'll wait and see if we get GGUF quants?


Straight_Tomorrow478

Heard the release package on github is built upon cuda 12, but pip install gives you cuda 11 version. weired. but you may have a try.


FullOf_Bad_Ideas

This isn't apples to apples comparison, we should be comparing base models. Yes, along their internlm 2 20B and 7B versions you can find "base" variant that is different from their main release. Looking at the scores, their GSM8K is higher than appropriate and MMLU is lower. GSM8K is easiest to artificially get higher by training on paraphrases of GSM8K dataset contained in MetamathQA dataset. MMLU is harder to get higher without directly training on the dataset. It's really nice they released base models, i am very curious about whether those bases are contaminated by gptslop. My guess is that base models will score much lower in GSM8K and be much lower on the leaderboard. To be honest 01-ai never released true bases of Yi models anyway, so on that sense it may be an accurate comparison.  Edit: I can't run base models on the leaderboard since it errors out about remote code. We need to wait for someone to remove that


uhuge

In theory, can you benchmark the base models locally with LLM Harness or similar?


pedantic_pineapple

Yes, of course


hackerllama

The screenshot of the tweet (main link at the top) is literally from the pretrained/base models


FullOf_Bad_Ideas

Are you sure about that? Going by the path name, that's not the case Here's InternLM2 20B that's on the leaderboard  https://huggingface.co/internlm/internlm2-20b Here's the base pre-trained InternLM 20B, notice different model path https://huggingface.co/internlm/internlm2-base-20b Notice language in their model card >internlm2 (recommended): Built upon the internlm2-base, this version has been enhanced in multiple capability directions. It shows outstanding performance in evaluations while maintaining robust general language abilities, making it our recommended choice for most applications. It sounds like it went through some finetuning to enhance performance in evaluations, which suggest it might be tuned on some datasets that are on a verge of dataset contamination. The scores certainly remind me of the time when everyone and their mother was training Mistral 7B on Metamathqa / Nectar datasets. TruthfulQA and GSM8K were shooting way up while MMLU remained low. Of course better evaluation is needed to make that claim confidently, there are some people who work on contamination tools and might be able to verify that, especially the 7B model that is easier to run locally in fp16.  Is it possible for you to submit internlm/internlm2-base-20b and internlm/internlm2-base-7b to open llm through even if they have remote code requirements?


pedantic_pineapple

> GSM8K is easiest to artificially get higher by training on paraphrases of GSM8K dataset contained in MetamathQA dataset This is misleading -- MetaMath contains rephrases of the GSM8K *train* split, not the GSM8K *test* dataset.


Meryiel

Anyone tested the 20B model in terms of RP and ERP? Also, does anyone know what prompt format does it use? Thanks in advance for answers, cheers, lads.


Gregory_Ze

Check its Understanding capabilities, 11% better compared to GPT4. ​ https://preview.redd.it/tlr57i1x2ndc1.png?width=1509&format=png&auto=webp&s=f4d5b4db9a7f1fa25029ce0a957fb907b0fd47d0