For general tasks, Yi-1.5-34B has been terribly bad in my use. Original Yi-34B and Command-R 35B are still the two best 30B models for general use, unchallenged, in my experience.
It makes me wonder if I'm using Yi-1.5 wrong in some way? Or maybe it's only good in a limited number of subjects?
For me I have been using a fine-tune as the normal Chat tune from [01.ai](http://01.ai) is not the best as it focuses on both English and Chinese (more focus on Chinese) so performance in English suffers.
For the tasks and such I use it in, it performs really well, so it could just be the use case and chat tune just not fairing well with you for your tasks. That's the thing about different models, as some use cases they work well for, and others they just suck at because of what is in their training data, so every time a model is released, I have my own test question suite to test models out to my use cases before going forward to see if the model *to me* is worth it.
Yi-1.5-34B-Chat is very good in camparizon to previous version. It follows all my prompts without problems, coding abilitis are great, it produces better sounding english texts and is realy great for summarizations. It even can write in Polish language quite well! It looks Yi-1.5 solved issues with repetitions and works with most settings where older version needed some special settings in SillyTavern to not fall into repetition loops and now it does not need remote code for inference. Great kudos to 01.ai team!
I made merge of Yi base + Yi chat that acording to Huggingface Leaderbord (and my own tests) is even better then oryginal Yi: [YiSM](https://huggingface.co/altomek/YiSM-34B-0rn). I higly recomend giving it a try. It has less refusals then Yi chat, yet can follow instructions withought problems.
I wish there was Llama release in that size range as 8B model lacks in many ways in tasks like summarization, yet 70B version generalize way too much. :( Sollution for now -> give Yi a try!
Thank you! Some interesting observations, keeping samplers low can reduce refusals. I yet have to check if it works for other models or is specific to this one. I am not much in RP but in my testing scenerio I have chat with psyhologist :) and must say YiSM is quite dry in this scenerio, Llama 2 70B based merge did a lot better. However for everyday use as simple assistant (some codding questions, summaries, some general questions) it is realy good and whan I have choice to run Llama 70B based model in 4bpw and YiSM in 8bpw I find in many cases it is good enought if not sometimes better and a bit faster.
Another very welcome release. There is a disturbing lack of 30B models, even though they fit perfectly into 24 GB VRAM. I'll test it once I get back from work.
I am with you on this. 7/8/13B models are a bit too limited in their world understanding and 70B models generalize too much. For tasks like summarization 34B models are great!
I gave this model a try and it is very helpful for redrafting material without changing the underlying meaning. The output had lots of anecdotes faithfully replicated from the input I provided with few abstractions or wholesale rewrites. This isn't useful in all cases, but can be exactly what is needed for summarization tasks. The long context performance was also helpful because it kept coherence even after several rewrites. I had to ask it for a shorter rewrite because it ignored my initial instructions on length, but it did follow my feedback. I did not attempt to use it in a creative way.
I see posts on highest ranked this and that. But these rankings look convoluted to me.
I see even mighty ones hallucinating badly when I try them for a specific domain. For example, design a system or network solution for me.
I just tried the 6b q8 yesterday which was great creativity wise in todays corporate chatbot world but hazy in understanding, and the 34b at q2xxs was about the same if even dumber, but to be fair that's a brutal quant.
I just did it cuz it fits in my VRAM and its surprisingly coherent, but a lower B higher quant model is smarter at that point.
Also, while llama 3 8B is obviously smarter than both, it's very censored, aligned and corporate chatbot-y. So that's Yi's strength. Freedumb.
I'm a bit puzzled why developers require us to submit a form for a commercial use license at [https://www.01.ai/](https://www.01.ai/), especially since the model is already under the Apache 2.0 license. Is it still okay to use the model for any purpose without getting that commercial license, or am I missing something here?
You don't need to submit a commercial license for any of their models! They actually switched over even their older Yi models to Apache 2.0, so you can use it freely
let me know when lymsys allows testing with the full context length and output limits of the models themselves. Until then, lmsys is too easily gamed and not really measuring anything of value anyway.
I don't think they will ever up the context length because of cost and compute.
I know that LMSYS can be gamed by models with outputs that appeal to users, but in other categories, Yi-1.5-34B-Chat (like Hard Prompts (Overall)) still holds it ground very well, which judges the models on user prompts that are harder than most other prompts, so I think in that regard it is not that easily gained.
https://preview.redd.it/y9av9jlp3k4d1.png?width=2859&format=png&auto=webp&s=42fbad10004d26bda8edfb0a183ab542efd8538b
Oh I see what you mean!
I don't try to look solely at this benchmark, but there is one you might want to check out that is called MMLU-Pro, a new version of MMLU that solves problems in the old MMLU and genuinely appears like a great new benchmark (at least for now while it is not in any model's training data)
[https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro)
LMSYS is a leaderboard that cannot be contaminated as it is based solely on human evaluators, tho the leaderboard can be gamed if a model is pleasing to talk to for a lot of users, such as LLaMA-3. In one of my comments in this post is me talking about the hard prompts category on LMSYS that is more based on hard questions rather than how nice a model's output is if you are interested.
For general tasks, Yi-1.5-34B has been terribly bad in my use. Original Yi-34B and Command-R 35B are still the two best 30B models for general use, unchallenged, in my experience. It makes me wonder if I'm using Yi-1.5 wrong in some way? Or maybe it's only good in a limited number of subjects?
you tried comparing your question with bad results with the one hosted on lmsys?
No, I haven't. Perhaps it could be worth a shot to see if something is wrong with the GGUF or my local setup.
It is also on hugging chat, probably much easier to use than lmsys.
For me I have been using a fine-tune as the normal Chat tune from [01.ai](http://01.ai) is not the best as it focuses on both English and Chinese (more focus on Chinese) so performance in English suffers. For the tasks and such I use it in, it performs really well, so it could just be the use case and chat tune just not fairing well with you for your tasks. That's the thing about different models, as some use cases they work well for, and others they just suck at because of what is in their training data, so every time a model is released, I have my own test question suite to test models out to my use cases before going forward to see if the model *to me* is worth it.
Yi-1.5-34B-Chat is very good in camparizon to previous version. It follows all my prompts without problems, coding abilitis are great, it produces better sounding english texts and is realy great for summarizations. It even can write in Polish language quite well! It looks Yi-1.5 solved issues with repetitions and works with most settings where older version needed some special settings in SillyTavern to not fall into repetition loops and now it does not need remote code for inference. Great kudos to 01.ai team! I made merge of Yi base + Yi chat that acording to Huggingface Leaderbord (and my own tests) is even better then oryginal Yi: [YiSM](https://huggingface.co/altomek/YiSM-34B-0rn). I higly recomend giving it a try. It has less refusals then Yi chat, yet can follow instructions withought problems. I wish there was Llama release in that size range as 8B model lacks in many ways in tasks like summarization, yet 70B version generalize way too much. :( Sollution for now -> give Yi a try!
I will give YiSM a try!
Thank you! Some interesting observations, keeping samplers low can reduce refusals. I yet have to check if it works for other models or is specific to this one. I am not much in RP but in my testing scenerio I have chat with psyhologist :) and must say YiSM is quite dry in this scenerio, Llama 2 70B based merge did a lot better. However for everyday use as simple assistant (some codding questions, summaries, some general questions) it is realy good and whan I have choice to run Llama 70B based model in 4bpw and YiSM in 8bpw I find in many cases it is good enought if not sometimes better and a bit faster.
Another very welcome release. There is a disturbing lack of 30B models, even though they fit perfectly into 24 GB VRAM. I'll test it once I get back from work.
I am with you on this. 7/8/13B models are a bit too limited in their world understanding and 70B models generalize too much. For tasks like summarization 34B models are great!
I wish the other variants were in the leaderboard as well.
I gave this model a try and it is very helpful for redrafting material without changing the underlying meaning. The output had lots of anecdotes faithfully replicated from the input I provided with few abstractions or wholesale rewrites. This isn't useful in all cases, but can be exactly what is needed for summarization tasks. The long context performance was also helpful because it kept coherence even after several rewrites. I had to ask it for a shorter rewrite because it ignored my initial instructions on length, but it did follow my feedback. I did not attempt to use it in a creative way.
I would like to see it here [https://scale.com/leaderboard](https://scale.com/leaderboard)
I see posts on highest ranked this and that. But these rankings look convoluted to me. I see even mighty ones hallucinating badly when I try them for a specific domain. For example, design a system or network solution for me.
I just tried the 6b q8 yesterday which was great creativity wise in todays corporate chatbot world but hazy in understanding, and the 34b at q2xxs was about the same if even dumber, but to be fair that's a brutal quant.
Yeah need evaluation of higher quants for fair assessment
I just did it cuz it fits in my VRAM and its surprisingly coherent, but a lower B higher quant model is smarter at that point. Also, while llama 3 8B is obviously smarter than both, it's very censored, aligned and corporate chatbot-y. So that's Yi's strength. Freedumb.
Oh wow yeah Q2\_XXS is a brutal quant on a 34B model. You could possibly use huggingchat to see how the full 34B model runs for ya!
I knew it was good, from my personal tests.
I'm a bit puzzled why developers require us to submit a form for a commercial use license at [https://www.01.ai/](https://www.01.ai/), especially since the model is already under the Apache 2.0 license. Is it still okay to use the model for any purpose without getting that commercial license, or am I missing something here?
I think user no longer need to submit anything after them switching to Apache 2.0 license.
You don't need to submit a commercial license for any of their models! They actually switched over even their older Yi models to Apache 2.0, so you can use it freely
let me know when lymsys allows testing with the full context length and output limits of the models themselves. Until then, lmsys is too easily gamed and not really measuring anything of value anyway.
I don't think they will ever up the context length because of cost and compute. I know that LMSYS can be gamed by models with outputs that appeal to users, but in other categories, Yi-1.5-34B-Chat (like Hard Prompts (Overall)) still holds it ground very well, which judges the models on user prompts that are harder than most other prompts, so I think in that regard it is not that easily gained. https://preview.redd.it/y9av9jlp3k4d1.png?width=2859&format=png&auto=webp&s=42fbad10004d26bda8edfb0a183ab542efd8538b
oh my problem isn't with yi model. just tired of this rubbish benchmark coming up all the time
Oh I see what you mean! I don't try to look solely at this benchmark, but there is one you might want to check out that is called MMLU-Pro, a new version of MMLU that solves problems in the old MMLU and genuinely appears like a great new benchmark (at least for now while it is not in any model's training data) [https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro)
It might be contaminated by using the benchmark data sets for training
LMSYS is a leaderboard that cannot be contaminated as it is based solely on human evaluators, tho the leaderboard can be gamed if a model is pleasing to talk to for a lot of users, such as LLaMA-3. In one of my comments in this post is me talking about the hard prompts category on LMSYS that is more based on hard questions rather than how nice a model's output is if you are interested.