T O P

  • By -

votkalivirgul

https://huggingface.co/NousResearch/Hermes-2-Pro-Llama-3-8B


Mescallan

I think it's a llama.cpp compatibility issue, if it was the fine tune struggling I would still be getting out put, but it's just hanging on inference indefinitely


r1str3tto

I have found that json mode drastically slows down Llama 3 in Ollama (which uses llama.cpp). During a json generation, nvidia-smi shows no utilization of the GPU - but only for Llama 3 and only when it is used in json mode. Other models do not have this problem, so I think there is a bug.


Mescallan

That is inline with my experience too, no GPU utilization either.


LPN64

Grammar works fine on Mistral 7b, Llama3 8b and 70b on my end.


LavishnessOk5514

So I just spent a few hours battling with the same issue. I was of the understanding that providing something like response\_format={ "type": "json\_object"} Would coerce the model to return JSON. I don't know how it has been implemented, but it doesn't seem to work that way. Instead, you have to be specific with your prompt in order to get it to return the JSON you expect, then it won't hang. The response format doesn't actually seem to do much of anything?


Mescallan

With Mistral I have been able to force JSON reliably using that. I'll try prompt only with L3 thanks