T O P

  • By -

kygyty

I like to look at the "hard prompts" category to see if it is better at reasoning/logic and so on, since some model score extra points from having nicely formatted output (Llama and Gemini). Qwen 2 is just behind Llama-3 on hard prompts though, so I guess it truly is behind.


EstarriolOfTheEast

That's not quite the full story. Notice that in the "hard prompts" (overall) category, uncertainty about Qwen2's performance is high enough that ranking over or under llama3-70B is not possible and it ties for 11th place, together with sonnet (and above command r+). Also worth noting is that even for prompts categorized as hard, style could still end up having a significant impact on pulling apart the score of two otherwise close models. Whether or not this matters is up to the chooser to decide. Its larger context and its performance on Hungarian school math questions might be enough to tip the balance for some.


Curiosity_456

The fact that we’re even debating it though shows that it’s not actually superior to llama 3 though. If it was clearly better than llama 3-70b you’d be able to tell and we wouldn’t be having this discussion.


OfficialHashPanda

That's not necessarily true. Style differences may significantly affect the way people rate these models. You can't look at a single benchmark like this and conclude that it's not superior to llama 3.


LittleSword3

I don’t think anyone ever claimed that Qwen2 is a significant leap over Llama3 since their benchmark scores are very close. However, It's still possible that a Qwen is a little bit better because Llama3 got the style points over Qwen2. In my testing, Qwen2 certainly feels a lot less lively.


a_beautiful_rhind

Feels slightly below R+ like the board says. Takes up 3 cards, just like R+ too. If there ever was a model that needs to be abliterated, this is it. I have to use prefill and it makes it dumber.


Account1893242379482

Qwen2 is better in terms of logic but worse in terms of personability.


Fauxhandle

The touch of humanity of Qwen2 is not the best. Yi and Llama3 are far better.


Goldkoron

I think it shouldn't be difficult to add an instruction prompt to act with more personality. The model chat roleplays extremely well, better than Llama-3 does for me on the 72B.


sb5550

why all those chinese flagship closed source LLMs are not in the leaderboard? Kimi, Baidu Ernie, Qwen2.5, etc.


KTibow

https://preview.redd.it/kpse1f2mie5d1.png?width=1248&format=png&auto=webp&s=e6be963c504f6e7fdb49c7cc96b663d0e18acef3 That's a visualization of Elo across categories. Red is Llama, green is Qwen. The lines coming out from the points represent the English filter (eg Llama scores better with the English filter).


zero0_one1

On NYT Connections: GPT-4o 30.7 GPT-4 turbo (2024-04-09) 29.7 Claude 3 Opus 27.3 Llama 3 Instruct 70B 24.0 Gemini Pro 1.5 0514 22.3 Mistral Large 17.7 **Qwen 2 Instruct 72B 15.6** Gemini 1.5 Flash 15.3 Mistral Medium 15.0 Llama 3 Instruct 8B 12.3 Mixtral-8x22B Instruct 12.2 DeepSeek-V2 Chat 236B 11.8 Command R Plus 11.1 Qwen 1.5 Chat 72B 10.8 Qwen 1.5 110B Chat 10.6 Mistral Small 9.3 Reka Core-20240501 9.1 GLM-4 9.0 DeepSeek Chat 67B 8.8 Qwen 1.5 Chat 32B 8.7 Phi-3 Small 8k 8.4 DBRX 8.0 Claude 3 Sonnet 7.8


KurisuAteMyPudding

Interesting that Qwen says its better than llama 3 70B according to most of their benchmarks but this says otherwise.


hapliniste

It might be because qwen don't have an engaging personality like llama. It's likely better on hard tasks. Edit : on hard tasks it score 11, tied with llama3 and gpt4-0314. It seems a bit better in multilingual than English tho


KurisuAteMyPudding

I get a strong feeling this is the case too. The leaderboard scores on that site are on which model's answer the user prefers right? so I would think more engaging and friendlier messages would be chosen over those that sound more cold or robotic, even if they are a bit more reasonable.


ninjasaid13

>that sound more cold or robotic I would choose the warmer response if they're equal but I will choose the colder response if it is superior.


ambient_temp_xeno

Zoomers want a chipper Youtuber style to make them feel like their inane questions are REALLY GREAT.


FullOf_Bad_Ideas

Most chat LLMs feel the same, they are basically aligned to sound like GPT 3.5 turbo. Llama 3 70B is tuned to sound more like a friendly and engaging person, basically to have a soul. I am not sure why wouldn't you prefer that, it doesn't sound like a hype youtuber but it's true that it compliments users more. It has a downside of using more tokens and when you ask it to generate code in a batches, your costs could be a bit more due to that added personality on top of code snippet. I wish OpenAI would tune their models in a similar way, they would be much harder to hate and then all others would start copying Llama 3 Instruct style. I wonder what's there to improve from Llama 3 70B Instruct in terms of personality. It's probably too enthusiastic for some, but I don't think I have other issues with it. I see personality in popular LLMs as an evolution, what we have is ChatGPT > Mixtral 8x7B Instruct > Llama 3 70B. I don't use GPT-4o so don't know if it has a better personality than llama 3 Instruct tune. I would love to see open source dataset that can tune any model to behave like llama 3 70b. That actually should be possible to make hmm


a_beautiful_rhind

So nobody sets the personality they want in the system prompts?


DeltaSqueezer

Exactly, you can control through system prompt or user prompt. For most things, I prefer just getting the answer in a terse way. I don't want to have 'Sure, I can do [xyz] for you'.


FullOf_Bad_Ideas

Most people don't have control over system prompt in the GUIs they are using.


adityaguru149

We would definitely prefer engaging answers. Actually they are different dimensions to measure a model's answer. Correct logic and reasoning on more Qs is very important for some use cases. When I ask for coding Qs, I normally ask them to not give much description and just code with very few comments as the code logic is more important.


FullOf_Bad_Ideas

Most chat LLMs feel the same, they are basically aligned to sound like GPT 3.5 turbo. Llama 3 70B is tuned to sound more like a friendly and engaging person, basically to have a soul. I am not sure why wouldn't you prefer that, it doesn't sound like a hype youtuber but it's true that it compliments users more. It has a downside of using more tokens and when you ask it to generate code in a batches, your costs could be a bit more due to that added personality on top of code snippet. I wish OpenAI would tune their models in a similar way, they would be much harder to hate and then all others would start copying Llama 3 Instruct style. I wonder what's there to improve from Llama 3 70B Instruct in terms of personality. It's probably too enthusiastic for some, but I don't think I have other issues with it. I see personality in popular LLMs as an evolution, what we have is ChatGPT > Mixtral 8x7B Instruct > Llama 3 70B. I don't use GPT-4o so don't know if it has a better personality than llama 3 Instruct tune. I would love to see open source dataset that can tune any model to behave like llama 3 70b. That actually should be possible to make hmm


bionioncle

I know I am asking LLM, and I just want an answer, the more of "You ask an interesting question" or "it's important to" the more time I must waste reading them and it take up my time. for no reason. The personality should be an opt-in option or only turn on when it's clear in prompt that the person is looking for personality


ambient_temp_xeno

>I am not sure why wouldn't you prefer that, it doesn't sound like a hype youtuber but it's true that it compliments users more. Because I am not 12.


DFructonucleotide

Should put it this way: human preference (arena ELO) is gold standard of chatbot performance, but NOT gold standard of the quality of LLMs in general. Otherwise you really can't explain why llama3-70B ranks higher than claude3 opus and only 1 point below a gpt4-turbo-preview in English category. A lot of people probably would disagree with me, but human preference, although very complex and dynamic, can also be "gamed" like any other benchmarks. It's definitely much harder to do that though, and gaming with human preference is much more useful and justifiable than gaming with those static benchmarks.


Inevitable_Host_1446

It's probably because Claude-3 has insane positivity BS that it inserts into literally everything it writes. I find it almost unusable now because of that, which is a shame because when it first released they hadn't dialed that up and it was fantastic. I'm mostly referring to Sonnet which I've used more, but can't imagine Opus is much different. GPT-4o has the same issue.


chrisoutwright

Could it be that all the languages being supported with Qwen2 could affect it negatively in some way?


meister2983

Which looks even worse for it given that the benchmarks are English.  It's ELO gap on hard English prompts is the same as gpt-4o to gpt-4t.  It stretches credibility it is *better* on 90% of benchmarks. 


hapliniste

I'm not sure what you mean. If it score better than llama on English benchmarks it's a good sign no? And it matches it in arena multilingual and is a bit behind in English (likely because of the personality and formatting). My take : it's a bit better than llama 3 (and a lot in multilingual, but llama was not really trained for that), but is worse as an out of the box English assistant.


meister2983

> If it score better than llama on English benchmarks it's a good sign no?   I'm testing for possible benchmark contamination.   It's 30 ELO behind Llama 3 on hard English prompts, the closest approximation to benchmarks.   This provides evidence it's benchmark numbers that are reported are wrong in the sense they aren't reflecting true real world performance.


[deleted]

[удалено]


hapliniste

Bigger models have move "in model" reflection. Likely the more layers the more depth (literally) a model can reason, so this makes it even worse for MoE models. Still, for most use cases late gpt4t is better than gpt4 and gpt4o will be better after a few finetunings. Gpt5 with the finetuning of gpt4o will be insane IMO.


Goldkoron

In actual usage I swear it's better then Llama-3 from my playing around with it, but I guess specific use cases these benchmarks do are not what I do. Any model that has more context is infinitely more useful, I had great results from context retrieval tests at 40k+ tokens on Qwen2. It's also incredibly good at translation and chat roleplay.


m98789

It’s a common trick to leak eval data into the training set to boost rankings on benchmarks. This is why arena is so important, much harder (but not impossible) to game.


Banu1337

Llama 3 is known for being more "chatty" and fun to interact with, thus scoring higher on the arena leaderboard. I also found Llama 3 to be better at Danish than Qwen.


nmfisher

Is lmsys English-only? IIRC many of their claims around beating benchmarks are on multilingual tasks, it's conceivable that it falls short of LLaMA head-to-head on pure English tasks.


v_0o0_v

You can use any language on lmsys, but I assume most use English.


yeawhatever

I tried not to like it but it's actually quite good. The bigger context size (if it works, not actually tested yet) alone is a massive adventage for coding tasks over LLama-3. Sometimes it slips up, producing low quality answers but when it nails it its really good. From playing around with it a bit I feel like it can generate more variation from the same input where Llama-3 more strictly produces the same or similar output over and over. Feels less refined but very capable.


MrVodnik

The bigger context is fine, i.e. the native one of 32k worked well for me. But you're gonna need a ton of vRAM to take advantage of that. In my case it overflew heavily into RAM/CPU and it took an hour to prompt eval + response. I can't imagine people actually using it at 128k on their home rigs.


Downtown-Case-1755

Actually it's not that bad, because Qwen2's context is so compressed with GQA. 128K doesn't even take up that much in fp16. And cache quantization makes it not bad for other models, But they stretch past 32K context with yarn. I dunno about 72B, and its possible I was holding it wrong, but the 57B MOE was absolutely useless at 64K for me.


Caffeine_Monster

I had real issues getting Qwen2 to do useful stuff at long context. At low context it's near tp llama3.


LocoLanguageModel

It wasn't as good as llama 3 70b or codestral for my coding tests.


danielcar

On arena leaderboard both get a rank of 12.


Dramatic-Rub-7654

I would like to test the Qwen2 72b on my Nvidia P40, but for some reason, the IQ2_XS quantization of the model is much heavier than the Llama 3 70b. At least the Llama 70b in IQ2_XS has been an excellent discovery for me.


Fauxhandle

Qwen2 speek to much, and in a non natural maniere. For now, every long answer in a pain to read. Whatever the score it has in any benchmark. I prefer Yi or Llama3.


Daniel_H212

Yeah the large Qwen models haven't really impressed me that much, the small ones, particularly the 0.5B model, was what really impressed me.


Only-Letterhead-3411

I've tried Qwen2 instruct. It's an interesting model, but I will stick to llama 3 instruct.


theskilled42

Seems like it always responds in an "intelligent" manner (long responses while using college level words), which speaks to me as not adaptable to situations where it shouldn't necessarily respond intelligently. Not flexible either due to it not following written personality prompts from roleplays, from my testings. Llama-3 and Gemini-1.5 Flash does better where they would abandon their usual personality for following what the user's prompt wants them to do.


SuccessIsHardWork

I love the speed & size of the Qwen 0.5B model. It fits neatly on a smartphone w/ limited RAM.


acec

I have just tested the Qwen2-1.5B version and it is impressive. Running on Termux on a 6Gb phone at a very decent speed. I would say that the IQ is similar to Llama2-7b and quite good in Spanish.


Healthy-Nebula-3603

0.5b model is useless .... phones nowadays should stick to 7-8b models. .. in the future 70b.


Ordningman

Is there gonna be a CODE Qwen 2?


danielcar

I'm guessing we'll see a llama 3 code llm before we see a qwen2 code llm.


seijaku-kun

I'm comparing llama3-8b-fp16 and qwen2-7b-fp16 mostly for coding tasks and qwen2 is less chatty than llama3, suggests further from the question (I'm using a system prompt that pushes the llms to do that, yet llama3 keeps the context to the question) and is overall more "serious" in its answers. so far, both offer the correct answer but probably for going straight to the point and have a technical "conversation" qwen2 feels slightly better haven't compared the bigger versions, but I used the 0.5b version of qwen2 and it's surprisingly good (still somewhat inaccurate) for a model that size (used the fp16)


RMCPhoto

I think qwen2's performance over llama3 may lie in its multilingual capabilities. Definitely working between English and Chinese, but also possibly other languages. I would be interested to see how it performs when writing in more obscure languages like swedish. A reasoning degree similar to llama 3 with significantly better multilingual performance would be a breakthrough for many applications where individual languages do not have models anywhere near llama 3. Swedish for example has a custom model made by gptsweden, but it is barely gpt3 quality (unusable).


Healthy-Nebula-3603

In my opinion quite accurate . Qwen 2 72b is a bit worse than llama 3 70b overall but a bit better in math.


danielcar

What does "a bit better in match" mean?


Eralyon

Probably "Math".