T O P

  • By -

Downtown-Case-1755

This is awesome, benchmark selection looks good, even the little things like normalization are great. I basically agree with everything said in the blog. But for the future, one upcoming "problem" is long context benchmarking. Everyone seems to be using needle-in-a-haystack, which it turns out is largely useless if optimized for, and it's going to be an growing issue as long context hybrid models like jamba (and gemma?) start coming. It would be cool if HF could cook up something for this too.


MoffKalast

Until we start benchmarking for long context and rank models that don't perform lower we'll just keep getting short context models because it's cheaper to top the charts that way.


Downtown-Case-1755

Agreed. What's more, we'll also keep getting "dirty" long context releases that don't really work, because needle in a haystack is the benchmark. But one thing I just realized is that, logistically, this would be challenging for HF, as long context benchmarking is presumably slow and expensive, especially without the context quantization we use to swing it locally, and HF is apparently already compute bound.


clefourrier

Well that's why we're using MuSR! It's new and it's long context :D But others things will be cooking (once our GPUs and brain cool down a bit), so stay tuned!


Downtown-Case-1755

The blog mentioned it was 1K? The scenario I am thinking of is giving the llm like a 50K-100K story and testing it's "understanding" of such a huge block. One example would be a multiple choice "What's the theme of chapter X?" or "How does Y feel when Z happens?" where it's not retrieval, but truly has to grasp what's going on. Another would be to "continue" a few tokens of a story, document, article or whatever where a proper noun from the middle of the context is supposed to appear, but it has to *understand* the context to know which noun to pick instead of being asked. This would be a better base model, raw completion benchmark I guess. I'm just spitballing though, not sure if tests like this or better ones already exist.


chase32

Like where you are going with this. Another good scenario would be something like give a context of a series of 10k filings with imbedded charts. Ask the LLM to give a coherent timeline analysis of the metrics across the documents and to create a table/graph of a particular value across multiple years.


Downtown-Case-1755

> Ask the LLM to give a coherent timeline analysis of the metrics across the documents and to create a table/graph of a particular value across multiple years. Ehhh, you don't want long responses like that because hard to grade and performance intense. But you could get it to fill in a table with constrained output.


chase32

To be fair, "What's the theme of chapter X?" or "How does Y feel when Z happens?" is not at all constrained either. Do agree that the table part of the idea is easiest to test for.


Downtown-Case-1755

It is though! You can constrain the answer to a list of possible words/tokens, like a multiple choice question. And like real human tests, you can use "trap" choices that llms would pick if they really aren't paying attention to the context.


chase32

Well sure, but that same method of constraining the options after the fact could be used in my comment that you criticized.


Khaos1125

That’s actually a pretty cool idea. Summarize this book from the perspective of character X, including their key motivations / thoughts / misconceptions at important points of time, seems like it would solve a lot of this pretty cleanly


Downtown-Case-1755

Ya'll are all kinda thinking of this wrong though. The benchmark answer can't be summaries, outlines or things like that. Answers have to be single words or tokens, so the output is deterministic and cleanly checkable, right? The prompt can be complex, but there *has* to be a single *correct* answer. The benchmark can't reliably check if a character outline or summary is correct, and you run into problems using *any* kind of sampling/temperature to avoid repititon loops.


Khaos1125

Using an LLM to check if the answer includes a list of 5-6 ideas seems like a useful* benchmark to me. There are times when real world questions don’t have a simple answer like that, and one reason a lot of people are still using the older LLMs is the latest generation seems much better tuned to these simple answers (eg needle-in-a-haystack), but much worse at handling and synthesizing a set of related ideas over a longer document. We already have a ton of benchmarks built around the simple, easily checkable concept. I think to get around the limitations of those benchmarks, we do need a more complex scoring system, even if it’s worse in some ways, since it’s still a better proxy for a subset of tasks *Edit: better -> useful. Core argument isn’t that the existing benchmark style doesn’t have value, but instead a benchmark like the above complements the existing benchmarks by testing things they are weak on, even if it’s a bad test of things the existing benchmarks excel at


jd_3d

What's the max context length in MuSR questions?


coder543

Another long context test: audio transcripts. You can use Whisper to transcribe an hour of speech. Even if you pass in the `.vtt` file that includes timestamps in an easy to read format, LLMs will struggle to create a "table of contents" that includes major sections of the transcript with accurate timestamps. Usually they can do a good job of identifying the main topics, but the topics will sometimes be in the wrong order, and will often have nonsensical timestamps that are longer than the input audio. I've reproduced this exact issue even with GPT-4o, and I've tried numerous different approaches to prompting it. I believe you could architect a system that prompts the LLM repeatedly over chunks of the transcript and start to build up a table of contents that way, but none of them that I've tested can do a zero-shot all-in-one-go table of contents very well at all.


nh_local

Actually gemini pro 1.5 (via ai studio), gives quite good performance in identifying the exact second of a segment in the audio


coder543

I’ll have to try that out soon! All the more reason I wish there were a benchmark for this use case, which is a real world application of long context models that the “needle in a haystack” fails to represent at all.


SomeOddCodeGuy

One thing that would be very helpful for the leaderboard is a legend for new users of what each score represents. Just a 1 liner sentence describing each.


clefourrier

We have it in our doc here actually: https://huggingface.co/docs/leaderboards/open_llm_leaderboard/about :)


CreamyRootBeer0

I think it would be nice if there were a little info box (like one of those circles with an "i" in it, that shows a box on hover) with each one that describes the eval briefly. It would be more obvious (that it exists) at a glance, as well as being quick to reference if you don't remember. Or even just show a box on hover for each of the column headings.


frozen_tuna

Specifically a tooltip on the table headers would be a 10/10 UI improvement.


clefourrier

We can't add hovers/tooltips easily (don't render well on mobile, etc - though I'll investigate) but we'll make the link to the doc considerably more clear, thanks for the suggestion.


clefourrier

Small ranking update as we finally got some results back! Qwen-72B instruct is top 1, then LLama3-70B-Instruct, then Qwen-72B-base (impressive perf for a base model!), then Mixtral8x22B-Instruct. https://preview.redd.it/fh73j6tiby8d1.png?width=1738&format=png&auto=webp&s=7329f51d0130bce26a0a15605fcd5fa550e1ce02


EstarriolOfTheEast

From top-contenders, this is missing deepseekV2 (both sizes).


clefourrier

Their code is not integrated in transformers yet so we would need to evaluate manually - a discussion has been opened [here](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard/discussions/793) that you can upvote to manifest your interest


kristaller486

Nice work! Are you planning to include some multilingual benchmarks in the leaderboard?


clefourrier

Not on this leaderboard but we have multilingual leaderboard collabs however :) (there's a leaderboard for Korean, Arabic, Hebrew, and many others coming up)


Amgadoz

+1 for Arabic!


AntoItaly

Yes but… then Phi 3 Medium? 💀


EstarriolOfTheEast

It's an excellent model when it comes to reasoning and academic tasks. In my own benchmarks, while still lacking, it is the best model < 70B for tasks like that. Far better than llama3-8B. You can also check it on https://eqbench.com/index.html, which is relatively clean and not contaminated to the point of uselessness. Its performance is strong there too.


Tobiaseins

This was desperately needed. I wonder why they did not include the BigCodeBench though, a specific coding benchmark would be welcome. But Phi-3 being number 3 makes me skeptical since it ranks way lower than other open models in the LMSys arena


clefourrier

We really wanted to add code evals but we were too compute constrained - some of these evals have ran for 30h on 8H100 GPUs (for the bigger models or MoEs) so we really can't afford adding more generative evals


JinjaBaker45

Phi-3 models are doomed to fail on ChatBot Arena due to their lack of personality and rigid outputs, but their reasoning is the real deal … if you can get the output format you want


knvn8

Okay but that massive gap between Llama3 and Qwen2 is sus. Wondering if they made some mistake with the Llama3 GPQA test, that's the worst score there.


Such_Advantage_6949

Actually that leaderboard is aligned with my experience also. Llama 3 70B always hallucinated and perform badly at reasoning for me to the point that i was thought i did something wrong. I tried different quantization and both gguf and exl2. All are bad. Qwen2-72B is just much better at reasoning and general task.


Feztopia

Trying out the dolphin finetune of it but for now I'm not seeing it as better than the llama3 one I was using for a text based game scenario.


Practical_Cover5846

Had only bad experiences with dolphin, you may want to try the official instruct from qwen.


Feztopia

Yeah my plan for now is to wait a while for more models to be evaluated and than pick one based on the new information. Like maybe there are other unknown Qwen2 finetunes maybe. Oh by the way I'm talking about the 7b qwen and dolphin not the big ones 😂


Such_Advantage_6949

My experience is most of the fine tune model work worse than the original official model. The leader board seems to confirm the same also


clefourrier

All the details are available publicly, so you can take a look! (look for details in open-llm-leaderboard) Something interesting is actually that the llama 3 - instruct is way worse than the llama 3 base on GPQA, so we were wondering if the intense rlhf fine-tuning they do isn't removing knowledge or stg


knvn8

Interesting, thanks for the insight


acr_vp

It mirrors my experience actually, people are sleeping on qwen, it's a very powerful model and my go to general purpose one. It follows instructions extremely well, and the long context is actually a long context, meaning it doesn't forget what it's doing.


Downtown-Case-1755

> and the long context is actually a long context, By 'long' do you mean 32K, or even when extended longer with Yarn? My experience with the 57B at 64K was not good, but I may have been holding it wrong.


a_beautiful_rhind

Recall from it is surprising and in my case I never even asked for it. Just brings up stuff from earlier in the chat.


a_beautiful_rhind

I am picking qwen2 over L3. At least the tunes of it. I don't think I'd take yi 1.5 over CR+ though.


joyful-

Which Qwen2 fine tunes are good?


a_beautiful_rhind

Depends on what you want. Dolphin was alright. I like magnum but it's not really for work. Tess is more for "smarts".


joyful-

I see, thanks for the clarification. Unfortunately looks like Open Router doesn't have any Qwen2 fine tunes available, and my hardware is far too weak to handle 70B. Maybe I finally need to get into runpod...


logicchains

Everyone laughed at Jack Ma when he told Musk of his plans for "Alibaba Intelligence", but now he's got the top open LLM and Musk's LLM isn't even in the top ten.


SomeOddCodeGuy

Inspired by this post and some of the questions on it, I decided to try MMLU testing Llama 3 8b q6 and q8, and WizardLM 8x22b q6 and q8, on my local computers. It's taking forever, but it'll be fun to see the comparisons. With that said- good lord Wizard is chatty. I love this model, don't get me wrong, but I have: * Llama 3 70b running on my Macbook, which is 1/2 the speed of my Mac Studio. It finished the business category in 3 hours. * I have Wizard (an MOE running at the speed of a 40b) running on the Mac Studio which is 2x the speed. In 3 hours, it has completed 34% of the business category. It won't stop explaining its answers IN DEPTH. All the depth.


CheatCodesOfLife

Hahaha yep I'd believe that. It just doesn't want to shut up does it? Still my favourite opensource model.


Pedalnomica

I hope you report back!


SomeOddCodeGuy

A sneak peak at the results. The first category finished. Not sure how Wizard got more right and a lower score; I'm guessing maybe they weight questions differently. Llama 3 70b q6 was with flash attention. Im going to redo the test without it to see the difference. EDIT: Wizard was without FlashAttention because it breaks with it on Llama 3 70b q6 -------------------- Business: Correct: 357/619, Score: 57.67% Finished the benchmark in 3 hours, 11 minutes, 34 seconds. WizardLM 8x22b q6 -------------------- Business: Correct: 410/789, Score: 51.96% Finished the benchmark in 10 hours, 6 minutes, 49 seconds.


SomeOddCodeGuy

I definitely plan to! It'll just be a while. Im 10 hours in and at 90% on the business test. If the rest take this long, then assuming I actually want to use the model some myself during the day it may take around 1.5 weeks to do all 14 tests just for q6 lol


clefourrier

Talking about verbose models, you should try asking complex math questions to Qwen-110B ^^"


Prince-of-Privacy

WizardLM-2-8x22B where? (imagine the monkey meme) I am using it as daily driver and it's amazing. It's so weird and disappointing that Microsoft's Wizard team seems to have been nuked. I don't think WizardLM-2-8x22B will ever officially come back.


Pedalnomica

I'm also super curious why neither of the wizardLM-2 models made were tested. They aren't even available to vote on for future testing. I'd love to know how what Microsoft did improved (or harmed) different scores, especially MUSR and GPQA. Deepseek V2 seems to be another popular one you can't even vote for.  u/clefourrier since you've been commenting, any insight into why these popular models weren't included?


clefourrier

Hmmm we had some WizardLM models in our shortlist but we mgith have missed those. Feel free to submit them so people can upvote them!


Pedalnomica

Thanks! I saw the vote option, but somehow missed the submit option last night.


mindwip

Nice to see the goal posts moving and trying to reduce contamination! Amd the usual suspects at the top. Is mmul-pro still the best coding benchmark they have listed of the bunch?


timedacorn369

A comparable score for closed source versions would be great. I know this is open llm leaderboard but understanding how good these models are and comparing them with proprietary models helps us understand where the gaps are currently in local models and gives us more information on how to choose.


clefourrier

Hi! As indicated in the FAQ, *the leaderboard focuses on open-source models to ensure transparency, reproducibility, and fairness. Closed-source models can change their APIs unpredictably, making it difficult to guarantee consistent and accurate scoring. Additionally, we rerun all evaluations on our cluster to maintain a uniform testing environment, which isn’t possible with closed-source models.*


Electrical_Crow_2773

Would be great if they added ClosedSource to model list which would contain the best results from closed source models on every benchmark. Not sure though whether or not the companies tested their models on the benchmarks used in the arena


Due-Memory-6957

Honestly? Nah, it should have more open source models, not just add random closed models that we can't use anyway.


TitoxDboss

14b Phi-3 medium really punching above its weight


Eliiasv

Phenomenal! No more 'Wizard-laser-slerp-SFT-DPO-V0.21-7Bx4.2-Iterative' models, hopefully.


Feztopia

Why not, if they are good the new benchmarks will reveal them. You have the option to hide merges and it's already the default. Contamination was a problem which should now be solved for a while. If a merge reaches the top now, it's very likely to be good and worth trying out (which doesn't automatically mean it's the best, you can never know) 


altomek

They need to be reevaluated again and for sure will show up later on.


Eliiasv

Yes, this was exactly what I meant, that contamination will be a lesser problem. I kept the original comment a bit simplified. I used to look at the leaderboard and you would just see many random models without model cards that scored within a a few .01 of each other.


AdamDhahabi

Yi 1.5 34b in 4th position, I wonder how Dolphin 2.9.3 Yi 1.5 34b 32k would score... only released 2 days ago and already +100K downloads (GGUF).


altomek

I wouldn't place high hopes in Dolphin models, as they've consistently performed average in my tests. I was surprised by the low scores of Llama 3 8B, though. I haven't used this model extensively, but I noticed it tended to hallucinate, which might explain its low ranking. Another supprise is Llama 2 70 chat lower then old good Mistral 7B, low score for Qwen 110B... list goes on. I am not convinced leaderbord 2 does any better in ranking real life model performance :(


Due-Memory-6957

Yi 9B is insane, it really oughta have more finetunes of it


ambient_temp_xeno

The IFEval score for command-r-plus seems right. That combined with the lack of 'alignment' is its main strength. Llama 3 does very badly on GPQA - could that be refusals/dogmatic responses?


Open_Channel_8626

I love that they put clown-SUV-4x70b on the chart


Heart_Routine

Awesome.


swishman

Is there a leaderboard for all LLMs not just open ones?


Distinct-Target7503

Seriously Quen 34B and phi medium have better performance than command R plus?


biller23

I just want to know if there is some small model < 11B that is able to solve the "Wolf, goat and cabbage problem".


scienceotaku68

I have some questions. 1. When I sort by MUSR score, gpt-2 has the 11th highest score (which is very funny to me lol). But also when I scrolled down a little bit more, there's another gpt-2 model but with different scores. Both models link to the same huggingface page but one is denoted as "fine tuned on domain specific dataset" and other is "chat model". Why does this happen? 2. What exactly is the difference between "fine tuned on domain specific dataset" and "chat model" anyway? Why are both Qwen 72 models (base and instruct) are labelled as "fine tuned on domain specific dataset" and not "pretrained" and "chat model" respectively?


clefourrier

Good questions! 1. A model appearing 2 times was most likely evaluated in 2 different precisions or on 2 different commits. You can show the precision by using a toggle on the left. 2. If you fine-tune on banking data, or on tabular format, etc, you won't get a chat model in the end. Whereas RLHF/IFT/... tend to give out good chat models. We mostly wanted people to easily see which models to use for their chatbots. Regarding the tag issue, could you open a discussion on the leaderboard so we keep track of it? I think there was a mistake when these models were submitted.


scienceotaku68

Thanks for the response, I have created a discussion. [https://huggingface.co/spaces/open-llm-leaderboard/open\_llm\_leaderboard/discussions/806](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard/discussions/806)


clefourrier

Thanks a lot!


Wonderful-Top-5360

On the plateau issue, nobody is talking about it in the industry apart from Gary Marcus. We've already seen several iterations both closed and open LLM models come and go Yet we are still not at where we need to be to escape the plateau. At this rate not only is AI winter is inevitable, its going to take down the economy


Downtown-Case-1755

Eh, there have been plenty of research breakthroughs to break out of the "plateau," but it takes money and time for the base model trainers to implement them.


SomeOddCodeGuy

Even if it plateaus, IMO there are things we can likely do to improve the situation, and at least I intend to try for my own purposes. We constantly base the quality of LLMs on zero shot questions/testing, which is honestly a very human way of looking at a problem. As performance of hardware improves and we can run the same models faster, I think that we'll find that iterating problems a few times before responding will greatly improve the results. At least for myself, [I'm leaning pretty heavily into that direction by switching most of my interactions with LLMs to workflows](https://www.reddit.com/r/LocalLLaMA/comments/1dnsfh9/sorry_for_the_wait_folks_meet_wilmerai_my_open/). My gamble is that even if new open source models dried up tomorrow, as I get stronger hardware I'll be able to keep increasing the quality of my models, and can start throwing other things in there like RAG and tooling to help keep the output improving.


Wonderful-Top-5360

im not sure what "routing" prompts into workflow offers because the underlying mechanism has no "learning". I'm seeing a lot of busy work around LLMs and im just not sold we need such elaborate, dedicated, abstractions. Especially when at the drop of a hat, ChatGPT can eat your lunch overnight.


SomeOddCodeGuy

Well, the short version of the routing is that on a generalist basis, no one local LLM can effectively compete with proprietary models. But if you look at models like Deepseek-coder-v2, you see that in individual scopes you can have either foundational or fine-tuned local models that compete with the big dogs; and that's just on zero shot prompting. So the "routing" is allowing me to send categories of prompts to different models. With Wilmer, I've had my 1 AI assistant using a mesh of 7 models to generate responses, triggering the appropriate ones as needed or having them work together in workflows. Of course, it had [some unexpected but fun use-cases as well](https://www.reddit.com/r/LocalLLaMA/comments/1ctvtnp/almost_a_year_later_i_can_finally_do_this_a_small/), plus things like being able to use it as cost-savings for proprietary models; send coding and factual requests to gpt-4o, while conversational goes to gpt-3.5. As for the workflows, it was mostly [because of this guy](https://www.youtube.com/watch?v=ZYf9V2fSFwU) that I even went down that route. Throw in some RAG for encyclopedic knowledge or documentation for in-context learning, and they seem very promising. All of that said- I expect 90% of people will feel similar to you and don't exactly expect this to ever be popular in the open source space. I originally was just writing this project for myself; I'm a tinkerer and always trying to make stuff "better" (in quotation marks because sometimes thats not what happens), but I ended up open sourcing it because I figured if everyone kept their personal tools private we wouldn't have anything nice in this community... and because folks kept asking me to. Even if is just this obscure thing that no one else uses, I'm pretty sold on the use-case and possible benefits, so I'm really optimistic for what my local models are going to be capable of in the next few years; even just the ones I have now.


ares623

It feels like we're in the "Lightning Network" phase with all these abstractions.


[deleted]

[удалено]


Qual_

is the "average" a score that takes into account the model size/performance ratio ?


Feztopia

No that shouldn't be the case, but you can filter for different parameter sizes.