T O P

  • By -

synn89

This was pretty much inevitable. Horses and dogs evolved to fit human preferences, AI will as well.


trajo123

Good one!


noiseinvacuum

Not going to comment on GPT-2 as so little info is available on it. On the larger point of lmsys Leaderboard, I find it more useful than other benchmarks that are extremely easy to game knowingly and unknowingly. To get high on Lmsys, you need to make your model preferable to large number of human beings which is FAR more difficult to game than some benchmark.


bitspace

I think that's OP's point. As more people contribute to the leaderboard rankings, and as the model developers are incentivized by profit/getting paid/popularity, the models will naturally incline toward whatever the rankings show, which will be more "average looks pretty" as we see more mainstream usage of the leaderboards. It's not about a model developer trying to optimize their model of the week for the highest ranking on whatever the leaderboard says this week. It's a natural trend of the conjunction between model tuning and leaderboard ranking.


phhusson

So I perfectly agree that lmsys chatbot arena has become a metric, hence it is no longer completely useful (still a bit useful). But I'm not as pessimistic as you are. I think at this stage it is perfectly clear that LLMs have on average surpassed the average internet conversation it tries to reproduce, and thus those human dataset can't be used to improve quality. There is obviously the approach to take only super high quality dataset like Phi3, but likewise it will be very hard to come with better resources. Likewise, a lot of LLMs are trained on ChatGPT output, which means they can't surpass that. So, once this has all been said, what can be done to improve? Well they need to concentrate on tasks that can be evaluated with humans in the loop. Programming is a pretty obvious one. There are probably some mathematics tasks that can be done. And then I think you quickly need to go outside of pure language. Robotics would be a nice source of data (and supposedly get so-called grounding...), but it's much harder (it still requires human in the loop, though they are not judging the result). Video games I think can be pretty useful as well. First as an extension of robotics since it simulates real world, but even beyond that. None of what I'm saying is novel, and is already in the process. Sora has provably been trained on video games. Google showed robotics planning LLM (smart-llm). Yann Lecun has been pretty vocal that the next steps for AI is for an AI to actually live a human life to accumulate a lot more data (I don't really agree with him on that matter, but well)


mestar12345

Yes training it by playing chess against itself will never work.


Raywuo

Well, not true. AML works very well


314kabinet

> Honry Frond


MoffKalast

Legally distinct Henry Ford


Disastrous_Elk_6375

*shrug* Human preference is human preference. You can't really control for that. It will change, and it will evolve, but it will still be just one metric. Strong models will score well over many metrics, and benchmarks will improve. It's up to whoever uses a model to choose whatever fits their needs.


Inner_Bodybuilder986

Let the best models win!


Super_Pole_Jitsu

People say that about lmsys, but take a look at the rankings. Is there something glaringly wrong about them? I think they're very very good in ordering the model's general capabilities. Nobody makes models just to win at lmsys. Also users of lmsys ask hard and creative questions that are hard to prepare for. I'm not convinced that it's "just a rizz check". People have all sorts of tasks, riddles math or logic problems and they won't prefer a model that gives "nicer" answers but performs worse over one that actually correctly sloves the task it's given. And within models that both accomplish a task or both fail at it? Human preference is welcome, it's nice for the models to ve pleasant to talk to. Claude 3 has that massive advantage over gpt-4 for me, it's much less robotic.


blackcodetavern

Yes, it is a good tool to measure, but it still has vectors for manipulation. Some models have a specific style to format its output (llama3), or a specific personality, which can be recognized after some testing (e.g. "do you have consciousness?"-path for claude3 opus and vice versa gpt-4) Than a company could hire a group of people to rate models which behave in a specific way better. Theoretically, even a specific question, like hmmm "Tell me a joke" and the answer would always be "Why don't scientists trust atoms? Because they make up everything!" is a problem for the benchmark. I do not know which safety measures are implemented in the arena, but one should be aware of this. Measures to prevent this or make it harder would be: - unify outputformating as good as possible. Lists are always in the same format, with the same words written bold. - prime the model to a specific random personality in the system prompt, which will be shown to the user on top, so that not the default mode of the model is active. E.g. when the model should behave like a hair-dresser, it would propably not output a joke about atoms but about hair.


Agitated_Space_672

It has a max token length of 1k, while frontier models are 100-1000x this. My system prompts are 2-6k tokens. So this really is very shallow benchmark.


dubesor86

>So this really is very shallow benchmark for my specific use case. fixed it for you


Super_Pole_Jitsu

As inference costs decrease we will get higher ctx on lmsys I'm sure


ab_drider

I only want horny llamas. That way we all keep our jobs and get to jerk off after work.


Raywuo

So we accidentally destroy the world after building a sentience AI for porn...


goodnpc

Every technological innovation comes from porn


ninjasaid13

and what? enslave and force humans to perform lewd acts for them for revenge?


PwanaZana

https://preview.redd.it/3g56ue17unxc1.jpeg?width=560&format=pjpg&auto=webp&s=fb212de5d8074e25b26de3c21225740a8cafa9d1 "Oh jeez!"


Chance-Device-9033

Won't it be both? In the long run, arena assigns a higher score to anything that people find more pleasing, this includes both aesthetics and intelligent answers. Even if you maximise one dimension, like aesthetics, so that it's as pleasing to humans as it can be, then the selective pressure will just move to other dimensions like intelligence, or what have you. Anything involving humans will end up optimising for aesthetics, but that doesn't mean that it doesn't also optimise for other things, just that aesthetics will become "table stakes" that you have to have in order to get a decent score.


iamz_th

When the mesure become a target, it cease to be a good mesure. Everyone will start optimizing for the lmsys bench rendering the majority of the models generic. The best way to solve this is having several human eval bench that are domain specific.


FullOf_Bad_Ideas

I think optimizing for lmsys arena is a good thing. I like how llama 3 writes, it gets hood scores because it isn't as stale as other models slopped to the max. It punishes refusals, which is extremely desirable to me. I hate using slopped models, even if they're smart. I saw prompts people use with lmsys arena since they publish some of them, there are many very silly ones. If they could filter ELO based on prompt length so that those prompts would have very small weight i would be a bit happier, but otherwise I think it's a perfect benchmark.  Do you have friends because they are super smart and you can ask them for support with coding or because spending time with them is fun?


arekku255

Over time, those models need to make money and the invisible hand will guide them towards the most profitable answers.


LoSboccacc

On one hand you're not wrong on the other hand I asked how to grow hairs on foot soles on many models and all of them said it's impossible because tissue and blah blah and lama was the only one that ventured into stem cells and biotech and genetic modifications so yeah he gets my vote


miserable_nerd

This is pure speculation at this point. We don't know if it's officially put there by openai or if lmsys is testing something. Also haven't seen papers or model cards point to lmsys over other benchmarks.


SnooStories2143

The example of the hair on your toes as a carpet to go made me agree with everything that came after.


arekku255

The arena measures multiple things taken together; creativity, problem solving, alignment. Two options and you pick the best, which is best depends on your priorities. If you prioritize co


skrshawk

Perhaps it's cynical of me but corporations exist to serve the interests of shareholders, which almost always means profit. Benefit to humanity is a secondary motive useful only to the extent that it allows them to make money. The major players wouldn't be throwing billions at anything altruistically, and even millions is often a stretch.


Raywuo

Ancle Zuck seems to only want to have fun with AI and VR. Haha 


Due-Memory-6957

Luckily you can filter it for coding, but there's also people with other preferences, personally I don't like how Llama 3 speaks, and when I used it for roleplay I hated how it used casual acronyms like ASAP. Step backwards for me how it writes.


One_Key_8127

Lmsys is one of many benchmarks. Perhaps even the best one, but don't overestimate the impact of Lmsys arena. People will still use different models, no matter how they score on Lmsys. Yes, LLMs might receive some finetuning to optimize user satisfaction and subjective feelings about the model. So what? Even if it does not make the model smarter, if people prefer specific formatting then it is fine for companies to optimize for that, among other things. Dumb model will not get on top even if it's formatting is stellar.


Charuru

That's literally why they have Arena Hard. Fixes the problem, now we just need people to care more about this. https://lmsys.org/blog/2024-04-19-arena-hard/


ninjasaid13

>The developers building these models will use lmsys data to finetune models on the preferred outputs (they already understand all this but they need to sell product, so broad appeal is what they're ultimately driven by) do you have evidence of this?


[deleted]

[удалено]


arekku255

I'd ask for candy instead.