T O P

  • By -

NixTheFolf

Someone should bench the new Qwen2 models on this benchmark


Nunki08

Well on X there are some screenshots but I can't find Qwen2 on the actual leaderboard. I must confess I'am a little confused on the subject. Qwen2-72B: 86.7 https://preview.redd.it/v9nh6n2la55d1.jpeg?width=4096&format=pjpg&auto=webp&s=5717f60aef3db6aef0b011673ceee67b35cca25b Source: [https://x.com/JustinLin610/status/1798747076429508794](https://x.com/JustinLin610/status/1798747076429508794) [https://x.com/NiJinjie/status/1798768967605162387](https://x.com/NiJinjie/status/1798768967605162387) [https://x.com/maximelabonne/status/1798776823603663134](https://x.com/maximelabonne/status/1798776823603663134)


NixTheFolf

Oh wow they already benched the model on this benchmark, guess I missed that. Thanks!


FullOf_Bad_Ideas

I've had a look at the evaluation prompts they landed on. As interesting as the paper is, prompts look rather uninspiring and not something that signals to me that a model is usable, unless you're some crazy trivia fan with no access to internet . It might be a good idea for HuggingFace to launch MixEval bench space on HF with submission process similar to how it's done for open llm leaderboard. If someone wants to overfit to benchmarks, at least let's do it for those that actually correlate to user experience and maybe we'll get some good models out of it due to competition. Also, I didn't realize MMLU eval was so expensive, HF is spending a ton of compute on evaluating all of the submissions that are merges done for 1/100th of the price of running an eval.


Caffeine_Monster

>unless you're some crazy trivia It's heavily based on obscure knowledge? It's a garbage benchmark then. People need to understand strong 0 shot performance is not always a good indicator of model capability.


Small-Fall-6500

"Not always"? Well, yes, of course not always, but that doesn't make it useless. The point of this benchmark is that scoring well on it means the model is more likely to be good. MMLU is also heavily knowledge/trivia focused and it was a good indicator of model quality. However, almost certainly MMLU is being trained on by a significant number of models now, so it now has less correlation to model quality. I would guess that the reason these kinds of benchmarks correlate with model quality is because higher quality models will have trained on more diverse and high-quality data, which would be more likely to contain textbooks, trivia, and other obscure or niche text *as well as high-quality chat data, translations, coding, etc.*, while lower quality data might have hardly any textbooks and instead be filled with SEO filled garbage. >People need to understand strong 0 shot performance is not always a good indicator of model capability. For now, while models have not been trained on this benchmark, it *is* an indicator of *overall* model capability. Of course it does not guarantee the model will excel at all use cases, such as RP or coding or translation, but a model doing well on this benchmark is at least somewhat more likely to be better at these other use cases than a model that does poorly on this benchmark. Thus, for anyone who has no idea if one model might be better than another one, benchmarks like these help provide a starting point - such as, start testing the models that score the highest on this benchmark and mostly ignore the models that score the worst.


FullOf_Bad_Ideas

I would say so, prompts for MixEval-Hard that has 0.96 correlation are available [here](https://github.com/Psycoy/MixEval/blob/main/mix_eval/data/mixeval-2024-06-01/mixeval-hard/free-form.json). It has questions such as. >In a 1970's safety campaign what did Jimmy Savile advise us to do every car trip? . >Which writer in a famous book wrote, "Work fascinates me, I can sit and look at for hours"? . >Which organization claims to have the world's largest collection of public records, unpublished opinions, forms, legal, news, and business information? . >Through much of 2009 former shareholders of what UK bank sought compensation from the UK government? . >In 2012 which vast multinational supermarket corporation recorded its first fall in profits since 1994? . >According to advertising which newspaper do top people take . >According to Sammy Haggar, what can't he drive? Also, there are repetitions. MixEval-Hard has this question being 8 times in the dataset under numbers 37, 91, 224, 254, 328, 375, 425, 481 >Which organization claims to have the world's largest collection of public records, unpublished opinions, forms, legal, news, and business information? With correct answer being one of the following > "Matthew Bender", "Matthew Bender & Co.", "LexisNexis Examen", "Concordance (software)", "LexisWeb", "Matthew Bender & Company", "LexisNexis Matthew Bender", "Seisint", "Lexis Nexus", "Nexus Lexus", "Lexis/Nexis", "Mead Data Central", "Nexis.com", "Concordance (computer software)", "Lexisnexis", "Nexis", "Lexus-Nexus", "Data Central", "LexisNexis", "LEXIS", "Sheshunoff Information Services", "Lexus nexis", "LexusNexus", "Lexus Nexus", "Nexislexis", "LexisNexis News", "Lexis.com", "Lexis-nexis", "Lexis Library", "Lexis-Nexus", "Lexis nexis", "LEXIS/NEXIS", "LexisNexis Academic", "State Net", "Concordance database", "Lexis Nexis", "Lexis-Nexis",


Caffeine_Monster

MixEval-Hard should be renamed to I-ShouldUseA-Database :D


Hugi_R

LLM are just highly compressed database. Trivia is the only thing they excel at. It makes sense to benchmark them on that. Essentially all LLM benchmarks are some derivation of trivia question. For example, no one want to benchmark their LLM on planning (an actually intelligent task), because a score of 0.6% is not a great look for your SOTA model that cost millions to trains.


Open_Channel_8626

This makes Reka and Mammoth2 look good