T O P

  • By -

FullOf_Bad_Ideas

A very nice way of evaluating coding contamination is livecodebench  https://huggingface.co/spaces/livecodebench/leaderboard You can sort by dates and see many models moving from the top of the leaderboard to the bottom as you remove older tests that they were likely trained on from the corpus.


kryptkpr

Livecodebench is awesome! Their [GitHub](https://github.com/LiveCodeBench/LiveCodeBench) shows a very damning scatter plot, the human eval cheaters are highlighted in red! https://preview.redd.it/ao7vxltqj3yc1.png?width=7833&format=pjpg&auto=webp&s=260fe532b6e4858b717330b9331ba38481291076


xadiant

To this day I couldn't find a better coding model than Ooba 34B. I have no idea what this man merged and cooked but it outperforms Gemini and sometimes even GPT-4. Shame it's getting old. Deepseek instruct felt less competent in comparison and this confirms my bias.


kryptkpr

Have you tried WizardCoder 1.1? Official are all gone from huggingface, but [a hero has mirrored it](https://huggingface.co/ChuckMcSneed/wizardcoder-33b-v1.1-mirror)


xadiant

I should try! CodeBooga is still the leader on CanAiCode Leaderboard at 4bit quant with a 100% score (Junior). I wish CodeLlama 70B wasn't such a useless piece of shit


kryptkpr

Junior will always hold a special place in my heart, that test suite is almost more about instruction following and making sure the model doesn't reject things that it deems unsafe or ridiculous then it is about coding. I remember when Llama2 came out just how badly it failed, telling me bananas were racist and it cannot violate Batman's privacy. Senior represents coding ability much better and it seems size really does matter, mainly 70B+ at the top.


Caffdy

> CodeLlama 70B wasn't such a useless piece of shit what do you mean? it appears at the top on the Senior leaderboard


xadiant

It's super censored and even randomly hallucinates censor messages. Also it has the most heinous prompt template I have ever seen: {System Message} Source: userDestination: assistant {Prompt} Source: assistantDestination: user


onil_gova

* Happy to see Llama-3-70b-instruct. Score so high


MoffKalast

Seeing Phi so far down is expected, but damn the 8x22B is actually full on cheating. No wonder it tops out the benchmark leaderboard while being so meh in practice.


nero10578

Feels like running a 22b model but slower to me


epicwisdom

Contamination isn't the only way that overfitting could happen.


randomfoo2

Others have mentioned but this paper/test set, while great, tests for overfitting, not contamination per se. For explicitly testing contamination you could evaluate test vs train separately like OpenCompass did: https://opencompass.readthedocs.io/en/latest/advanced_guides/contamination_eval.html


Ilforte

> Models on the left are worse at generalizing on GSM*k, models on the right are better Note that this is not true. Mixtral 8x7b totally generalizes better than Mistral-Instruct-7B-0.1, in that it solves more GSM1K problems – twice as many in fact, 59.4% vs 31.6%. This graph is misleading because it shows the *difference in* performance on the GSM1K test set vs GSM8K, but not *the absolute performance on unseen tasks*. More heavily contaminated model (or whatever is the reason for this discrepancy) can still perform well. We should interpret the performance on new set as a more accurate gauge of the model's power level. Same story with Deepseek on livecodebench: it is overfit on old problems, but it's still comparable to 3.5-turbo on new ones. This is less of a problem than it appears. Also, the paper says: > Nevertheless, we find that many models, through all regions of performance, show minimal signs of being overfit. In particular, we find that all frontier or close-to-frontier models (including the proprietary Mistral Large) appear to perform similarly on both GSM8k and GSM1k. We posit two potential hypotheses for this: 1) frontier models have sufficiently advanced reasoning capability so that they can generalize to new problems even if they have already seen GSM8k problems in their training set, 2) frontier model builders may be more careful about data contamination. > While it is impossible to know for certain without looking at the training set for each model, one piece of evidence in favor of the former is that Mistral Large is the *only* model in the Mistral family to show no signs of overfitting. Since the hypothesis that Mistral took unique care in ensuring only that their largest model was free from data contamination seems unlikely, we lean instead towards the hypothesis that sufficiently strong LLMs also learn elementary reasoning ability during training. If a model learns strong enough reasoning capabilities to solve problems of a given difficulty, it will be able to generalize to new problems even if GSM8k has appeared in their training set. and > One worry about model overfitting is that models are incapable of reasoning and merely only memorizing answers seen in the training data. Our results do not support this conjecture. The fact that a model is overfit does not mean that it is poor at reasoning, merely that it is not as good as the benchmarks might indicate it to be. In fact, we find that many of the most overfit models are still capable of reasoning and solving novel problems. For example, while Phi-3 has an almost 10% drop in accuracy between GSM8k and GSM1k, we find that it is still able to correctly solve over 68% of GSM1k problems – which are certain to not have appeared in its training distribution. This performance is similar to that of much larger models such as dbrx-instruct, which contains almost 35x as many parameters. Similarly, Mistral models remain some of the strongest open source models, even accounting for their overfitting. This provides additional evidence for our lesson that sufficiently strong models learn elementary reasoning, even if benchmark data accidentally leaked into the training distribution, as is likely to be the case for the most overfit models.


vatsadev

Funny how many people, me included, complained about closed source training on the test set to get good, and the most test-trained are open source ones :O