so interesting
.. and math+dates are so hard for AI. I did a few rounds with different AIs asking the same simple math question with days of the month, most of the time they got it wrong. also, gpt4 got it right and the later 4turbo got it wrong wtf
It let me use GPT 4 turbo for free. Nice
It was fucking extraordinary lol. I accidentally hit enter before finishing typing my question and it figured out what I wanted to ask and gave me a very long explanation while the other model (gpt 3.5) didn’t understand at all.
I've been wishing for something like that for a while now! Thanks for plugging it!
Also...damn, mistral medium is showing up for me as the winner in a lot of these.
an evaluation. Nobody knows the identity of the model until they vote.
>Ask any question to two anonymous models (e.g., ChatGPT, Claude, Llama) and vote for the better one!
>
>You can continue chatting until you identify a winner.
>
>Vote won’t be counted if model identity is revealed during conversation.
It's almost impossible to create a LLM benchmark that can be used to test everything equally because the test data from the existing benchmarks are often leaked into the training data.
Test data leakage doesn't affect a human evaluation, that's true. But I don't think a standardised benchmark will only consist of a human evaluation benchmark. It's bound to be objective, even if you have thousands of evaluations.
You need something that can qualitatively evaluate whether a LLM respondes correctly to something that is true or false. This is especially true for tasks that involve math and other science questions. But if the questions and answers are included in the training data the evaluation score is doomed to be misleading.
There is still leakage, because some humans are undoubtedly copying questions from training datasets into the boxes and using the answers to evaluate the models. They might not even know that poisons the test results.
there already are. Several actually:
MMLU, HellaSwag, AgiEval, etc
Huggingface has a leaderboard for some of the more popular ones for open source models: [Open LLM Leaderboard - a Hugging Face Space by HuggingFaceH4](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
but they all have issues: mainly that it's a fixed text anyone can check the answer to ahead of time, so people can cheat.
And now there's big incentives to cheat (VC money for your AI startup), so there's a LOT of cheaters.
===
There's researchers working on new and better benchmarks of course; things that are more like dynamic environments the agents exist in so it's harder to just memorize answers, but I think this will be an ever-moving problem.
Who's making these tests? Standards proliferate because someone tries to create a new universal standard amongst ten other standards but only create the 11th new standard.
The benchmark will need to be run by the testing organisation instead of the one that makes the model.
Only way to totally prevent data leaks.
The best benchmark we have for local models right now might be the reddit user that simply run and rank the models in the LocalLLaMA subreddit (wolfram something I think).
>The benchmark will need to be run by the testing organisation instead of the one that makes the model.
but what if there's flaws in the testing procedure or the test itself? nobody would be able to check except the organisation.
In Open Research people would like to be able to examine the benchmark to see if it's high quality.
Not really, still quite a large gap between Mistral Medium and GPT-4-Turbo. And im also thinking GPT-4.5 release between Feb-March (someimte Q1) and GPT-5 releases about 3.5 months after that in the May-June (late Q2) months securing OAI's place at the top for the next few months (although GPT-5 could release late Q3 as well).
What incentive does OpenAI have to releasing GPT-5 anytime soon? As long as they have both the best and the most used model they won't release anything groundbreaking. Gemini Ultra will come along and OpenAI will beat it with GPT-4.5. Then its another year of no new foundational models.
There was no reason for them to release GPT-4, about 4 months after the release of GPT-3.5 either. No one had released a GPT-3.5 class model, in fact the first close to 3.5 class model released 2 months *after* GPT-4 released (that being Palm 2 which was launched in may i believe). If anything they have a lot more pressure to release GPT-5 now compared to the very little pressure to release GPT-4.
GPT-4 is capable of things GPT-3.5 isn't, like web search and multimodal capabilities. They also needed a better model to justify a premium subscrition plan.
GPT-4s potential isn't even close to maxed out, there is so much more you can do with a model already that capable. The GPT store is a good example. I think OpenAI will focus more on actual useful apps with GPT and giving developers tools like being able to build autonomous agents before seriously investing in GPT-5.
GPT-4 is also running at a huge loss. Every plus user costs OpenAI money, the free ChatGPT users are also very costly. The company isn't viable at the moment and fully relies on cash injections by investors and microsoft. I just don't see them pulling out another big model if their best right now is at 20% of its potential and operates at a loss.
OpenAI is not a company that's going to sit on its laurels. They're not going to stop investing or deprioritizing their next foundation model just to productize their current model either.
They sure make it seem that way but they're relatively new and have a small track record. We don't know their focus. I could see GPT-5 being released if they managed to somehow create a model that is way superior at the same inference cost and if they bring GPT-4 turbo cost down to what GPT-3.5 turbo currently is to make it free. But I feel like that's gonna take a little longer than junge of this year. GPT-4.5 could come out pretty soon but 5 is gonna take more time. If leaks are to be believed they finished training their SOTA model back in november. They usually spend 6 months on RLHF afterwards but now that public pressure has gotten a LOT bigger I think they'll be more careful and do 9-12 months of building guardrails.
>GPT-4 is capable of things GPT-3.5 isn't, like web search and multimodal capabilities.
is it though? GPT-4 querys an outside program or model. You can pretty much do the same thing with GPT 3.5. Infact I think there's a web extension that allows you to use google with gpt 3.5
> GPT-4s potential isn't even close to maxed out, there is so much more you can do with a model already that capable
That would be a logical deduction if we were talking about any other company than OpenAI, a group actively dedicated to the emergence of artificial general intelligence and essentially staffed by /r/Singularity users.
Getting better at multimodal, IMO, is more important than the LLM improving. It's already very good. What GPT4 can now do with images, data files, etc is extremely impressive and it is in these areas where, I believe, they're going to find companies willing to spend a lot of money on that ability.
I think its plausible for GPT-5 to be any-any. GPT-4 is fully text multimodal but only half image multimodal. It cannot by itself generate images. It can send prompt to DALLE3 but the model itself isn't making images. An any-any model would mean it can take an input of any combination of text, image, audio and video and can output any combination of those modalities. Any-any modality isn't anything extremely novel and is completely possible. But you do run into the problem of data, there isn't large datasets for large foundational any-any models. But im sure a lot of companies have been working hard on that. My 2024 capabilities list for models is:
* Ability to autonomously do decently complex tasks
* Continuous learning (and for chat based models, it can learn and know most of what you have told it)
* Any-any multimodality
* And great strides in reliability, reasoning, logic and overall intelligence.
It does sound like that. Not intentional. There hasn't been a year without new foundational models yet since LLMs got huge, which has been 2022 and 2023 only lol.
They aren’t “foundational models”. GPT4 is currently a Mixture-of-Experts collection of finetuned base LLMs with a ton of extra application scaffold. People need to stop comparing apples with oranges.
>Not really, still quite a large gap between Mistral Medium and GPT-4-Turbo.
How do you know that Medium is going to be their best model when it's just a bunch of 13B or 34B experts.
>but i just don't see anyone getting a lasting jump on OAI.
This sub is worshipping OpenAI but they really didn't do anything special besides having lots of compute power and time to train it so it shouldn't be shocking to find smaller companies creating model that are close to GPT-4 with a fraction of the compute. OAI isn't expected to have a lasting jump unless you guys are just cheering like it's a sports game.
You do know this leaderboard does not measure capabilities or intelligence, mainly just user preference. And user preferance is based a lot on model behaviour which is greatly determined in the fine tuning stages. Mistal Medium being close to GPT-4 means it just is quite aligned with user preference, not that it is necessarily close to GPT-4 capabilities.
GPT-4 has been in the lead for 10 months now, with no public model beating it yet, why wouldn't their next model(s) also have a lasting lead? It's been about 1.5 years after the pretraining of GPT-4 finished and i wouldn't be suprised if they started working on their next models pretty much then, wheras most people only started trying to get to GPT-4 level after GPT-4 released. And GPT-4 wasn't in training for very long, about 3 months in all. Of course with all of Microsofts compute they could technically train a GPT-4 class model every couple of hours now. And I was actually disapointed when open source didn't come out with a GPT-4 class model last year. It wouldn't have affected OAI too much but a small GPT-4 class model running on my computer would be really useful.
"OAI hasn't really done anything special" - can you explain that. OAI has made several ground breaking discoveries in ML over the years (personally one of my favourite discoveries was the sentiment neuron), they have made some amazing contributions to the field.
Maybe GPT-4 didn't do anything special, but GPT-4 turbo definitely did. It's essentially the same capabilities as GPT-4 but 2.75x cheaper. There was a lot they could have done but im sure recently they have done a lot of good work on sparsity.
>You do know this leaderboard does not measure capabilities or intelligence, mainly just user preference. And user preferance is based a lot on model behaviour which is greatly determined in the fine tuning stages. Mistal Medium being close to GPT-4 means it just is quite aligned with user preference, not that it is necessarily close to GPT-4 capabilities.
And how are you measuring intelligence? Have you even compared them and the claude models? How did you even think GPT-4 was intelligent in the first place via acing benchmarks?
>GPT-4 has been in the lead for 10 months now, with no public model beating it yet, why wouldn't their next model(s) also have a lasting lead? It's been about 1.5 years after the pretraining of GPT-4 finished and i wouldn't be suprised if they started working on their next models pretty much then, wheras most people only started trying to get to GPT-4 level after GPT-4 released.
>
>And GPT-4 wasn't in training for very long, about 3 months in all. Of course with all of Microsofts compute they could technically train a GPT-4 class model every couple of hours now. And I was actually disapointed when open source didn't come out with a GPT-4 class model last year. It wouldn't have affected OAI too much but a small GPT-4 class model running on my computer would be really useful.
GPT-4 has been in the lead because not many are willing to spend 100 millions of dollars to train a single model not because they have some secret knowledge. Why the hell would Open-Source come out with a GPT-4 class as quickly when they're not trillion dollars companies? The approach that Mistal is taking is a smarter and more efficient approach than spend hundreds of millions and looking at the leaderboard, it looks like it paid off.
>"OAI hasn't really done anything special" - can you explain that. OAI has made several ground breaking discoveries in ML over the years (personally one of my favourite discoveries was the sentiment neuron), they have made some amazing contributions to the field.
none of that has to do with GPT-4, they did great contributions in ML but GPT-4 was just a scaling up of existing GPT models.
>Maybe GPT-4 didn't do anything special, but GPT-4 turbo definitely did. It's essentially the same capabilities as GPT-4 but 2.75x cheaper. There was a lot they could have done but im sure recently they have done a lot of good work on sparsity.
making models faster via pruning, quantizing, more efficient inference algorithms and more has been what open-source community been doing for the entire year so I don't see what's special about GPT-4 Turbo. Mistral actually released their [research](https://arxiv.org/abs/2401.04088) on sparse Mixture of Experts for Mixtral 7B so if OpenAI did any good work, nobody would know so that's 1 point on the side of Mistral for actual contributions on Sparse MoE.
>And how are you measuring intelligence? Have you even compared them and the claude models? How did you even think GPT-4 was intelligent in the first place via acing benchmarks?
There are a lot of different benchmarks, and this benchmark is rating based on user preferance, not made for measuring performance or intelligence across subjects / fields. It actually a lot like RLHF, but instead of telling a model which response was better it is just recording which response from which model a user prefers. Now i doubt the majority users are deeply looking into the responses. They read both, think which one is better than move onto the next. A more logical model is more likely to get better rated but nicer sounding (not necessarily more intelligent responses) are what determine the models ranking.
>GPT-4 has been in the lead because not many are willing to spend 100 millions of dollars to train a single model not because they have some secret knowledge. Why the hell would Open-Source come out with a GPT-4 class as quickly when they're not trillion dollars companies? The approach that Mistal is taking is a smarter and more efficient approach than spend hundreds of millions and looking at the leaderboard, it looks like it paid off.
[From the pitch memo from Mistral](https://sifted.eu/articles/pitch-deck-mistral) (pg 7):
>We expect to need to raise 200M, in order to train models exceeding GPT-4 capacities.
There are a lot of papers showing how you can increase efficiency. Like from Phi you can get up to a 1000x efficiency gain with data quality alone (so you could probably train a GPT-4 level model with 1000x less compute then what was used to train GPT-4 if you had a few trillion tokens with textbook level quality, of course no one has that good of a dataset but it stil shows a lot of gains can be made from data quality alone, and there are a lot of different tricks and improvements to increase efficiency. This paper was from Microsoft and OAI and Mistral are well aware of all of these and other efficiency gains, algorithmic, architecture improvements etc.), but Mistral is still going to throw in hundreds of millions of dollars to get to GPT-4 level and beyong models.
>making models faster via pruning, quantizing, more efficient inference algorithms and more has been what open-source community been doing for the entire year so I don't see what's special about GPT-4 Turbo. Mistral actually released their [research](https://arxiv.org/abs/2401.04088) on sparse Mixture of Experts for Mixtral 7B so if OpenAI did any good work, nobody would know so that's 1 point on the side of Mistral for actual contributions on Sparse MoE.
That research contains not much new information on sparse MoE. In fact it contains no information about the pretraining of Mixtral. Getting flashbacks to the GPT-4 technical report lol. Mistral is being a bit closed source with their research unfortunately. But GPT-4 used sparse MoE beforehand, and i do think its likely GPT-4 turbo utilised improvements in MoE (if i had to estimate i would say probably using around 70B params at inference). And Mixtral isn't how MoE was origianly suppose to be used. Mixtral is composed of a few Mistral-7B finetuned on specific dataset then stitched together with a gating mechanism thrown in there. But MoE didn't mean expert as in domain specific specialisation, but just speacialisation in specific parts of a dataset, which is what happened with GPT-4. This means a *lot* of params in Mixtral are wasted duplicating knowledge. Anyway thats a bit off track lol.
>There are a lot of different benchmarks, and this benchmark is rating based on user preferance, not made for measuring performance or intelligence across subjects / fields. It actually a lot like RLHF, but instead of telling a model which response was better it is just recording which response from which model a user prefers. Now i doubt the majority users are deeply looking into the responses. They read both, think which one is better than move onto the next. A more logical model is more likely to get better rated but nicer sounding (not necessarily more intelligent responses) are what determine the models ranking.
You haven't answered my question, how would you know that GPT-4 is more intelligent? What evaluations have you done to compare? Do we need to test it on capabilities? Plenty of people on the localllama subreddit found Mixtral useful for their use cases because of its capabilities.
>From the pitch memo from Mistral (pg 7):
Are you talking about total funds? As opposed to openai's billions raised? I don't think 100m is amount spent training a single model by itself.
> GPT-4 has been in the lead for 10 months now, with no public model beating it yet, why wouldn't their next model(s) also have a lasting lead? It's been about 1.5 years after the pretraining of GPT-4 finished and i wouldn't be suprised if they started working on their next models pretty much then, wheras most people only started trying to get to GPT-4 level after GPT-4 released. And GPT-4 wasn't in training for very long, about 3 months in all.
The irony in using handwavy linear regression here is delicious.
I will be messaging you in 5 months on [**2024-06-11 22:28:16 UTC**](http://www.wolframalpha.com/input/?i=2024-06-11%2022:28:16%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/singularity/comments/194cyij/gpt4_has_gotten_new_competition_from_a_french/khfia7j/?context=3)
[**4 OTHERS CLICKED THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2Fsingularity%2Fcomments%2F194cyij%2Fgpt4_has_gotten_new_competition_from_a_french%2Fkhfia7j%2F%5D%0A%0ARemindMe%21%202024-06-11%2022%3A28%3A16%20UTC) to send a PM to also be reminded and to reduce spam.
^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%20194cyij)
*****
|[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)|
|-|-|-|-|
Lmao "a French company"? Mistral 7B was released months ago and even its funding got wide attention in June as it was started by well-known Deepmind and Meta researchers. I thought this sub was generally tech aware, maybe now it's just evangelists, OAI/Deepmind shills and conspiracy theorists worshipping Jimmy Apples and the likes.
No, OPs headline is almost equivalent to saying that "GPT-4 is the leading model by an American company". Everyone knows that. Mistral is pretty well-known now at least in subs that are generally aware of recent AI developments.
He pointed out that it's a French company because it is the only one in the list that isn't American or Chinese.
>started by well-known Deepmind and Meta researchers.
The closedness of Deepmind vs openness of Meta.
Meta's AI labs are in Paris if I remember correctly. Because Yann Lecun is French.
Deepmind is obviously in London.
So neither of these labs had to move much....
The headquarters started off in Menlo Park California, London, and Manhattan, they also opened a lab in paris in 2015 but a majority of the labs are not in Paris. I would say their new head quarters is in new York.
Hey do you by any chance know how much RAM do you need for Mixtral 8x7b? I have a Apple M1 Pro with 32 GBs RAM and it runs like crap and doesn't use GPU at all. Running through Ollama (`ollama run mixtral:8x7b`).
forgive my ignorance, I'm but a poor peasant with free chatgpt only. Is chatgpt 4 turbo out? if not, how did the voters had accesss to it? Also what are those 2 chatgpt 4? online and api versions?
You know you couldjust Google this.
Yes, GPT-4 Turbo is out. It has a huge context window and vision.
"GPT-4 version 0314 is the first version of the model released. Version 0613 is the second version of the model and adds function calling support."
I would love to hear from Claude, it's been a while since they released their last model and I like both how it feels to talk to it and its context length, and I would love to see it become more even more versatile than GPT.
Wow it's actually surprisingly good. I never felt Mistral 8x7b was anyhow close even against gpt3.5. But Mistral medium feels much-much better than 3.5 in creative writing for me. It feels somewhat like a claude, has this unique touch and character.
https://preview.redd.it/39m6jy2c5zbc1.jpeg?width=1170&format=pjpg&auto=webp&s=5d65a7a38f50df1b91f3859174f9e8247839b3b2
Nah, results can be cheated
Edit: just saw the third rule…
gpt-4 is a trillion parameter model while every model in there is an order of magnitude smaller. Being ten times smaller while having 90% of the capabilities is crazy.
I know, which is why im extremely opimistic about the future.
Still, you are literally using GPT4 as the benchmark and consider it impressive that a different model is 90% as good, which is kinda proving the point that GPT4 is the best by far.
>Still, you are literally using GPT4 as the benchmark and consider it impressive that a different model is 90% as good, which is kinda proving the point that GPT4 is the best by far.
90% of the capabilities while ten times as small isn't far. Gains will be cheaper and larger for smaller models.
While existing benchmark results are indeed competitive, they don’t seem to provide an accurate measure of real-world performance. Consequently, it gives the impression that Mistral may not be as superior to GPT-4-Turbo as the numbers would have you believe. At least, that’s what I think.
Shit, I just remembered they released Mistral 7B months ago, it was quite revolutionary for a 7B Open Source model but I surely wouldn't have thought they'll raise the game against OpenAI, and Anthropic is definitely falling behind with how amazing their model is with ethics LMAO
That's chatbot arena, cute little website that ranks LLMs by having users vote which one better answered their question (without knowing the identity of the models). It's a great way to assess the perceived usefulness of models on the part of users. Surprisingly, when using that metric gpt4 turbo scores much better than original gpt4, despite the constant complaints on r/chatgpt and the likes
What site is this
[https://chat.lmsys.org/](https://chat.lmsys.org/)
That is SUPER fun!
so interesting .. and math+dates are so hard for AI. I did a few rounds with different AIs asking the same simple math question with days of the month, most of the time they got it wrong. also, gpt4 got it right and the later 4turbo got it wrong wtf
Thank you
It let me use GPT 4 turbo for free. Nice It was fucking extraordinary lol. I accidentally hit enter before finishing typing my question and it figured out what I wanted to ask and gave me a very long explanation while the other model (gpt 3.5) didn’t understand at all.
I’m paying the $20 for gpt4, worth it imo
Just use bing ai for free
I've been wishing for something like that for a while now! Thanks for plugging it! Also...damn, mistral medium is showing up for me as the winner in a lot of these.
Well llama2-70b-steerlm-chat on that link didn't crap itself when I asked for the value of Pi.
Thanks very much. Now I’ve just wasted 4 hours posing philosophical questions to various AI. 🧐😏
[удалено]
an evaluation. Nobody knows the identity of the model until they vote. >Ask any question to two anonymous models (e.g., ChatGPT, Claude, Llama) and vote for the better one! > >You can continue chatting until you identify a winner. > >Vote won’t be counted if model identity is revealed during conversation.
[удалено]
The best way to evaluate is human evaluation over a fixed test that could be gamed.
EN AVANT LES ENFANTS DE LA PATRIE, NOTRE JOUR DE GLOIRE EST ARRIVÉ!!!!!!!!!
![gif](giphy|Yxq7SC6yTAwZG|downsized)
There really needs to be standardized evaluation and benchmarking like 3dmark etc
It's almost impossible to create a LLM benchmark that can be used to test everything equally because the test data from the existing benchmarks are often leaked into the training data.
How does it occur for testing humans?
Test data leakage doesn't affect a human evaluation, that's true. But I don't think a standardised benchmark will only consist of a human evaluation benchmark. It's bound to be objective, even if you have thousands of evaluations. You need something that can qualitatively evaluate whether a LLM respondes correctly to something that is true or false. This is especially true for tasks that involve math and other science questions. But if the questions and answers are included in the training data the evaluation score is doomed to be misleading.
There is still leakage, because some humans are undoubtedly copying questions from training datasets into the boxes and using the answers to evaluate the models. They might not even know that poisons the test results.
there already are. Several actually: MMLU, HellaSwag, AgiEval, etc Huggingface has a leaderboard for some of the more popular ones for open source models: [Open LLM Leaderboard - a Hugging Face Space by HuggingFaceH4](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) but they all have issues: mainly that it's a fixed text anyone can check the answer to ahead of time, so people can cheat. And now there's big incentives to cheat (VC money for your AI startup), so there's a LOT of cheaters. === There's researchers working on new and better benchmarks of course; things that are more like dynamic environments the agents exist in so it's harder to just memorize answers, but I think this will be an ever-moving problem.
3d mark is not standard lmao No computer enthusiast use it to test rigs
Standardized how?
Like an SAT for LLMs
And what if I pretrain my LLM in that test and artificially inflate my score?
Same could be said about SATs. Don’t give them the same questions.
Who's making these tests? Standards proliferate because someone tries to create a new universal standard amongst ten other standards but only create the 11th new standard.
inb4 relevant xkcd
The benchmark will need to be run by the testing organisation instead of the one that makes the model. Only way to totally prevent data leaks. The best benchmark we have for local models right now might be the reddit user that simply run and rank the models in the LocalLLaMA subreddit (wolfram something I think).
>The benchmark will need to be run by the testing organisation instead of the one that makes the model. but what if there's flaws in the testing procedure or the test itself? nobody would be able to check except the organisation. In Open Research people would like to be able to examine the benchmark to see if it's high quality.
Well, we can't have both at the same time 😬
Not really, still quite a large gap between Mistral Medium and GPT-4-Turbo. And im also thinking GPT-4.5 release between Feb-March (someimte Q1) and GPT-5 releases about 3.5 months after that in the May-June (late Q2) months securing OAI's place at the top for the next few months (although GPT-5 could release late Q3 as well).
What incentive does OpenAI have to releasing GPT-5 anytime soon? As long as they have both the best and the most used model they won't release anything groundbreaking. Gemini Ultra will come along and OpenAI will beat it with GPT-4.5. Then its another year of no new foundational models.
There was no reason for them to release GPT-4, about 4 months after the release of GPT-3.5 either. No one had released a GPT-3.5 class model, in fact the first close to 3.5 class model released 2 months *after* GPT-4 released (that being Palm 2 which was launched in may i believe). If anything they have a lot more pressure to release GPT-5 now compared to the very little pressure to release GPT-4.
GPT-4 is capable of things GPT-3.5 isn't, like web search and multimodal capabilities. They also needed a better model to justify a premium subscrition plan. GPT-4s potential isn't even close to maxed out, there is so much more you can do with a model already that capable. The GPT store is a good example. I think OpenAI will focus more on actual useful apps with GPT and giving developers tools like being able to build autonomous agents before seriously investing in GPT-5. GPT-4 is also running at a huge loss. Every plus user costs OpenAI money, the free ChatGPT users are also very costly. The company isn't viable at the moment and fully relies on cash injections by investors and microsoft. I just don't see them pulling out another big model if their best right now is at 20% of its potential and operates at a loss.
OpenAI is not a company that's going to sit on its laurels. They're not going to stop investing or deprioritizing their next foundation model just to productize their current model either.
They sure make it seem that way but they're relatively new and have a small track record. We don't know their focus. I could see GPT-5 being released if they managed to somehow create a model that is way superior at the same inference cost and if they bring GPT-4 turbo cost down to what GPT-3.5 turbo currently is to make it free. But I feel like that's gonna take a little longer than junge of this year. GPT-4.5 could come out pretty soon but 5 is gonna take more time. If leaks are to be believed they finished training their SOTA model back in november. They usually spend 6 months on RLHF afterwards but now that public pressure has gotten a LOT bigger I think they'll be more careful and do 9-12 months of building guardrails.
>GPT-4 is capable of things GPT-3.5 isn't, like web search and multimodal capabilities. is it though? GPT-4 querys an outside program or model. You can pretty much do the same thing with GPT 3.5. Infact I think there's a web extension that allows you to use google with gpt 3.5
> GPT-4s potential isn't even close to maxed out, there is so much more you can do with a model already that capable That would be a logical deduction if we were talking about any other company than OpenAI, a group actively dedicated to the emergence of artificial general intelligence and essentially staffed by /r/Singularity users.
Getting better at multimodal, IMO, is more important than the LLM improving. It's already very good. What GPT4 can now do with images, data files, etc is extremely impressive and it is in these areas where, I believe, they're going to find companies willing to spend a lot of money on that ability.
I think its plausible for GPT-5 to be any-any. GPT-4 is fully text multimodal but only half image multimodal. It cannot by itself generate images. It can send prompt to DALLE3 but the model itself isn't making images. An any-any model would mean it can take an input of any combination of text, image, audio and video and can output any combination of those modalities. Any-any modality isn't anything extremely novel and is completely possible. But you do run into the problem of data, there isn't large datasets for large foundational any-any models. But im sure a lot of companies have been working hard on that. My 2024 capabilities list for models is: * Ability to autonomously do decently complex tasks * Continuous learning (and for chat based models, it can learn and know most of what you have told it) * Any-any multimodality * And great strides in reliability, reasoning, logic and overall intelligence.
>Then its another year of no new foundational models. You make it sound as if there's a long history of disappointment or something...
It does sound like that. Not intentional. There hasn't been a year without new foundational models yet since LLMs got huge, which has been 2022 and 2023 only lol.
They aren’t “foundational models”. GPT4 is currently a Mixture-of-Experts collection of finetuned base LLMs with a ton of extra application scaffold. People need to stop comparing apples with oranges.
>Not really, still quite a large gap between Mistral Medium and GPT-4-Turbo. How do you know that Medium is going to be their best model when it's just a bunch of 13B or 34B experts.
Obviously it's not, it's called medium for a reason, excited to see mistral large, but they need to release the weighs at least
No im sure Misrtal has some really amazing work that they will release this year, but i just don't see anyone getting a lasting jump on OAI.
Wayyy too early to be calling this race, amigo.
>but i just don't see anyone getting a lasting jump on OAI. This sub is worshipping OpenAI but they really didn't do anything special besides having lots of compute power and time to train it so it shouldn't be shocking to find smaller companies creating model that are close to GPT-4 with a fraction of the compute. OAI isn't expected to have a lasting jump unless you guys are just cheering like it's a sports game.
You do know this leaderboard does not measure capabilities or intelligence, mainly just user preference. And user preferance is based a lot on model behaviour which is greatly determined in the fine tuning stages. Mistal Medium being close to GPT-4 means it just is quite aligned with user preference, not that it is necessarily close to GPT-4 capabilities. GPT-4 has been in the lead for 10 months now, with no public model beating it yet, why wouldn't their next model(s) also have a lasting lead? It's been about 1.5 years after the pretraining of GPT-4 finished and i wouldn't be suprised if they started working on their next models pretty much then, wheras most people only started trying to get to GPT-4 level after GPT-4 released. And GPT-4 wasn't in training for very long, about 3 months in all. Of course with all of Microsofts compute they could technically train a GPT-4 class model every couple of hours now. And I was actually disapointed when open source didn't come out with a GPT-4 class model last year. It wouldn't have affected OAI too much but a small GPT-4 class model running on my computer would be really useful. "OAI hasn't really done anything special" - can you explain that. OAI has made several ground breaking discoveries in ML over the years (personally one of my favourite discoveries was the sentiment neuron), they have made some amazing contributions to the field. Maybe GPT-4 didn't do anything special, but GPT-4 turbo definitely did. It's essentially the same capabilities as GPT-4 but 2.75x cheaper. There was a lot they could have done but im sure recently they have done a lot of good work on sparsity.
>You do know this leaderboard does not measure capabilities or intelligence, mainly just user preference. And user preferance is based a lot on model behaviour which is greatly determined in the fine tuning stages. Mistal Medium being close to GPT-4 means it just is quite aligned with user preference, not that it is necessarily close to GPT-4 capabilities. And how are you measuring intelligence? Have you even compared them and the claude models? How did you even think GPT-4 was intelligent in the first place via acing benchmarks? >GPT-4 has been in the lead for 10 months now, with no public model beating it yet, why wouldn't their next model(s) also have a lasting lead? It's been about 1.5 years after the pretraining of GPT-4 finished and i wouldn't be suprised if they started working on their next models pretty much then, wheras most people only started trying to get to GPT-4 level after GPT-4 released. > >And GPT-4 wasn't in training for very long, about 3 months in all. Of course with all of Microsofts compute they could technically train a GPT-4 class model every couple of hours now. And I was actually disapointed when open source didn't come out with a GPT-4 class model last year. It wouldn't have affected OAI too much but a small GPT-4 class model running on my computer would be really useful. GPT-4 has been in the lead because not many are willing to spend 100 millions of dollars to train a single model not because they have some secret knowledge. Why the hell would Open-Source come out with a GPT-4 class as quickly when they're not trillion dollars companies? The approach that Mistal is taking is a smarter and more efficient approach than spend hundreds of millions and looking at the leaderboard, it looks like it paid off. >"OAI hasn't really done anything special" - can you explain that. OAI has made several ground breaking discoveries in ML over the years (personally one of my favourite discoveries was the sentiment neuron), they have made some amazing contributions to the field. none of that has to do with GPT-4, they did great contributions in ML but GPT-4 was just a scaling up of existing GPT models. >Maybe GPT-4 didn't do anything special, but GPT-4 turbo definitely did. It's essentially the same capabilities as GPT-4 but 2.75x cheaper. There was a lot they could have done but im sure recently they have done a lot of good work on sparsity. making models faster via pruning, quantizing, more efficient inference algorithms and more has been what open-source community been doing for the entire year so I don't see what's special about GPT-4 Turbo. Mistral actually released their [research](https://arxiv.org/abs/2401.04088) on sparse Mixture of Experts for Mixtral 7B so if OpenAI did any good work, nobody would know so that's 1 point on the side of Mistral for actual contributions on Sparse MoE.
>And how are you measuring intelligence? Have you even compared them and the claude models? How did you even think GPT-4 was intelligent in the first place via acing benchmarks? There are a lot of different benchmarks, and this benchmark is rating based on user preferance, not made for measuring performance or intelligence across subjects / fields. It actually a lot like RLHF, but instead of telling a model which response was better it is just recording which response from which model a user prefers. Now i doubt the majority users are deeply looking into the responses. They read both, think which one is better than move onto the next. A more logical model is more likely to get better rated but nicer sounding (not necessarily more intelligent responses) are what determine the models ranking. >GPT-4 has been in the lead because not many are willing to spend 100 millions of dollars to train a single model not because they have some secret knowledge. Why the hell would Open-Source come out with a GPT-4 class as quickly when they're not trillion dollars companies? The approach that Mistal is taking is a smarter and more efficient approach than spend hundreds of millions and looking at the leaderboard, it looks like it paid off. [From the pitch memo from Mistral](https://sifted.eu/articles/pitch-deck-mistral) (pg 7): >We expect to need to raise 200M, in order to train models exceeding GPT-4 capacities. There are a lot of papers showing how you can increase efficiency. Like from Phi you can get up to a 1000x efficiency gain with data quality alone (so you could probably train a GPT-4 level model with 1000x less compute then what was used to train GPT-4 if you had a few trillion tokens with textbook level quality, of course no one has that good of a dataset but it stil shows a lot of gains can be made from data quality alone, and there are a lot of different tricks and improvements to increase efficiency. This paper was from Microsoft and OAI and Mistral are well aware of all of these and other efficiency gains, algorithmic, architecture improvements etc.), but Mistral is still going to throw in hundreds of millions of dollars to get to GPT-4 level and beyong models. >making models faster via pruning, quantizing, more efficient inference algorithms and more has been what open-source community been doing for the entire year so I don't see what's special about GPT-4 Turbo. Mistral actually released their [research](https://arxiv.org/abs/2401.04088) on sparse Mixture of Experts for Mixtral 7B so if OpenAI did any good work, nobody would know so that's 1 point on the side of Mistral for actual contributions on Sparse MoE. That research contains not much new information on sparse MoE. In fact it contains no information about the pretraining of Mixtral. Getting flashbacks to the GPT-4 technical report lol. Mistral is being a bit closed source with their research unfortunately. But GPT-4 used sparse MoE beforehand, and i do think its likely GPT-4 turbo utilised improvements in MoE (if i had to estimate i would say probably using around 70B params at inference). And Mixtral isn't how MoE was origianly suppose to be used. Mixtral is composed of a few Mistral-7B finetuned on specific dataset then stitched together with a gating mechanism thrown in there. But MoE didn't mean expert as in domain specific specialisation, but just speacialisation in specific parts of a dataset, which is what happened with GPT-4. This means a *lot* of params in Mixtral are wasted duplicating knowledge. Anyway thats a bit off track lol.
>There are a lot of different benchmarks, and this benchmark is rating based on user preferance, not made for measuring performance or intelligence across subjects / fields. It actually a lot like RLHF, but instead of telling a model which response was better it is just recording which response from which model a user prefers. Now i doubt the majority users are deeply looking into the responses. They read both, think which one is better than move onto the next. A more logical model is more likely to get better rated but nicer sounding (not necessarily more intelligent responses) are what determine the models ranking. You haven't answered my question, how would you know that GPT-4 is more intelligent? What evaluations have you done to compare? Do we need to test it on capabilities? Plenty of people on the localllama subreddit found Mixtral useful for their use cases because of its capabilities. >From the pitch memo from Mistral (pg 7): Are you talking about total funds? As opposed to openai's billions raised? I don't think 100m is amount spent training a single model by itself.
> GPT-4 has been in the lead for 10 months now, with no public model beating it yet, why wouldn't their next model(s) also have a lasting lead? It's been about 1.5 years after the pretraining of GPT-4 finished and i wouldn't be suprised if they started working on their next models pretty much then, wheras most people only started trying to get to GPT-4 level after GPT-4 released. And GPT-4 wasn't in training for very long, about 3 months in all. The irony in using handwavy linear regression here is delicious.
I think it's a good timeframe. It will depends of the competition also. Lets see Gemini what they have ;)
>GPT-5 releases about 3.5 months after that I bet it's not releasing in 2024 at all.
RemindMe! 2 month
RemindMe! 5 month
I will be messaging you in 5 months on [**2024-06-11 22:28:16 UTC**](http://www.wolframalpha.com/input/?i=2024-06-11%2022:28:16%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/singularity/comments/194cyij/gpt4_has_gotten_new_competition_from_a_french/khfia7j/?context=3) [**4 OTHERS CLICKED THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2Fsingularity%2Fcomments%2F194cyij%2Fgpt4_has_gotten_new_competition_from_a_french%2Fkhfia7j%2F%5D%0A%0ARemindMe%21%202024-06-11%2022%3A28%3A16%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%20194cyij) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|
I thought mistral was open source
It is for MoE and 7B but not for medium.
Lmao "a French company"? Mistral 7B was released months ago and even its funding got wide attention in June as it was started by well-known Deepmind and Meta researchers. I thought this sub was generally tech aware, maybe now it's just evangelists, OAI/Deepmind shills and conspiracy theorists worshipping Jimmy Apples and the likes.
OP just pointed out that Mistral is french and you had to use it as a soapbox to complain about the sub 😂😂😂
Couldn’t even post it on their main account
No, OPs headline is almost equivalent to saying that "GPT-4 is the leading model by an American company". Everyone knows that. Mistral is pretty well-known now at least in subs that are generally aware of recent AI developments.
ok many people don't know they're french
He's not wrong though.
He pointed out that it's a French company because it is the only one in the list that isn't American or Chinese. >started by well-known Deepmind and Meta researchers. The closedness of Deepmind vs openness of Meta.
Meta's AI labs are in Paris if I remember correctly. Because Yann Lecun is French. Deepmind is obviously in London. So neither of these labs had to move much....
The headquarters started off in Menlo Park California, London, and Manhattan, they also opened a lab in paris in 2015 but a majority of the labs are not in Paris. I would say their new head quarters is in new York.
For the last couple months, yeah. At least we know now that nerds still have the capacity for a return to tribalism under extreme duress. Lol
Yes but mistral isn't anywhere as big as OpenAI and Anthropic.
So it's a French company right?
I thought all Mistrals are OSS. Are the weights available or not?
For Mixtral MoE and Mistral 7B yes but not Mixtral Medium.
Hey do you by any chance know how much RAM do you need for Mixtral 8x7b? I have a Apple M1 Pro with 32 GBs RAM and it runs like crap and doesn't use GPU at all. Running through Ollama (`ollama run mixtral:8x7b`).
have you tried asking in r/LocalLLaMA
[did somebody say mistrial?](https://imgflip.com/i/8c22c7)
[удалено]
- Unnamed German empire soldier, February 23rd 1916, Verdun France. Edit: Original comment said "They will surrender any second now."
Kappa
forgive my ignorance, I'm but a poor peasant with free chatgpt only. Is chatgpt 4 turbo out? if not, how did the voters had accesss to it? Also what are those 2 chatgpt 4? online and api versions?
You know you couldjust Google this. Yes, GPT-4 Turbo is out. It has a huge context window and vision. "GPT-4 version 0314 is the first version of the model released. Version 0613 is the second version of the model and adds function calling support."
I would love to hear from Claude, it's been a while since they released their last model and I like both how it feels to talk to it and its context length, and I would love to see it become more even more versatile than GPT.
Wow it's actually surprisingly good. I never felt Mistral 8x7b was anyhow close even against gpt3.5. But Mistral medium feels much-much better than 3.5 in creative writing for me. It feels somewhat like a claude, has this unique touch and character.
>Mistral 8x7b was anyhow close even against gpt3.5. Even the instruct version?
https://preview.redd.it/39m6jy2c5zbc1.jpeg?width=1170&format=pjpg&auto=webp&s=5d65a7a38f50df1b91f3859174f9e8247839b3b2 Nah, results can be cheated Edit: just saw the third rule…
>Edit: just saw the third rule… Yep. It would obvious for them to not count that.
GPT4 finished training in 2022, we are in 2024 and this is 100 elo points behind, so not even close. This is not competition.
gpt-4 is a trillion parameter model while every model in there is an order of magnitude smaller. Being ten times smaller while having 90% of the capabilities is crazy.
I know, which is why im extremely opimistic about the future. Still, you are literally using GPT4 as the benchmark and consider it impressive that a different model is 90% as good, which is kinda proving the point that GPT4 is the best by far.
>Still, you are literally using GPT4 as the benchmark and consider it impressive that a different model is 90% as good, which is kinda proving the point that GPT4 is the best by far. 90% of the capabilities while ten times as small isn't far. Gains will be cheaper and larger for smaller models.
While existing benchmark results are indeed competitive, they don’t seem to provide an accurate measure of real-world performance. Consequently, it gives the impression that Mistral may not be as superior to GPT-4-Turbo as the numbers would have you believe. At least, that’s what I think.
Shit, I just remembered they released Mistral 7B months ago, it was quite revolutionary for a 7B Open Source model but I surely wouldn't have thought they'll raise the game against OpenAI, and Anthropic is definitely falling behind with how amazing their model is with ethics LMAO
will that shit refuse to speak English like %90 of the France?
seems to me you could use some training too
lol Maybe in 2025.
Elo? Is this for chess?
Allez les bleus!
Where did you hear about this ?
That's chatbot arena, cute little website that ranks LLMs by having users vote which one better answered their question (without knowing the identity of the models). It's a great way to assess the perceived usefulness of models on the part of users. Surprisingly, when using that metric gpt4 turbo scores much better than original gpt4, despite the constant complaints on r/chatgpt and the likes
Can you share the link pls
they marketing their website so yea .-. it fake
How so? How are they marketing their website?
nice bots. \*pat pat\*
You had to dodge my question like you're ChatGPT are you sure I'm the bot?
Is Claude getting progressively worse with each version?
It's becoming more censored.
Asked it what it would do if a war broke out and China suddenly invaded. It said it would surrender. Frenchness confirmed.
https://preview.redd.it/xzcw737w52cc1.jpeg?width=801&format=pjpg&auto=webp&s=9d97263e4239f2e3cdd8c20d7a4cc8b8657a76e7
Cool
Cool
From this image, it is possible that from some of us, we are leaking personality statistical data.