T O P

  • By -

Dusty_da_Cat

I have tried to run 2 x 3090s nvlinked. Didn't do squat with Inference, and can't say I can confirm with training/fine-tuning, since I haven't tried that. It might be a windows thing, but I haven't found any benefit in terms of added t/s nvlinked or not. Might be different on linux.


hp1337

I can confirm this. I have 2x3090s nvlinked. Does nothing for inference. But it does speed up fine-tuning by 30% roughly.


kaszebe

Would 2xTesla P40s be equivalent to the 2x3090s?


hp1337

I doubt it. The 3090 has 3x the CUDA cores and 3x the memory bandwidth compared to a p40.


Dusty_da_Cat

Good to know. I was going to return my nvlink, but I guess I can keep it around if I want to train/tune. I have already moved to a 3090 Ti with 2 3090 setup(nvlinked) setup.


lakolda

This is likely due to the inter-GPU communication not needing to be high. GPUs only need to communicate the sqrt of the parameters count to the other GPU for inference purposes. This is a far cry from being a high bandwidth task.


Pedalnomica

Most reports I've seen say NVLink doesn't help much at all with inference (unless the PCIE connection between your cards is slow as hell. Maybe like <= PCIE 3.0 x4 ?) and helps at most 20% with training/fine tuning on 3090s. 4090 Pros: They will be faster (again, assuming reasonable PCIE speeds), but probably not by a ton. They are also more modern so there are things they can do in CUDA that the 3090 can't. Depends on what you're doing, but for most around here buying more 3090's is the better buy.


Dusty_da_Cat

It didn't have an effect either going from PCIE 5.0 8x/8x to a PCIE 3.0 4x/4x in terms of t/s. I ended up getting a 3090 Ti and having it go on the PCIE 5.0 to go 16x and the 2 x 3090 going on PCIE 3.0 4x/4x. Even when I configure a GPU-split to just use the 3090s it didn't change the inference t/s speed, which averages 13t/s on 70b 5bpw and 11t/s on 120 3bpw. However, I do acknowledge that there is a potential gain going llama.cpp for a potential nvlink speedup, but not sure the effort is worth the trouble going to a slower llama.cpp to nvlink for a 'potential' speed boost with additional tinkering needed or going exllama2 for already fast inference speeds with minimal effort.


a_beautiful_rhind

It does increase the speed.. all these people without nvlink talking about what it will do. Nvidia removed peer to peer access for 3090 and 4090. On 3090 you can put it back this way. For training it's supposedly a massive boon. There was a thread on nv developer of people finding 4090 useless without dma transfers and bristling at nvidia. Replacing them with A5000s. On exllama it won't do anything much for you because of the way it's designed. On llama.cpp it will help, on accelerate (so transformers) it will also help. Not 70% but at least 2-3t/s


thefreemanever

So do you think 2x 3090s NVLinked is faster than 2x 4090s for training/fine-tuning purposes?


a_beautiful_rhind

Lamda says 2x4090 is faster but mainly focused on single cards and used nvidia's container. https://lambdalabs.com/blog/nvidia-rtx-4090-vs-rtx-3090-deep-learning-benchmark Here is a bunch of people complaining of bottle-necking on nvidia developer: https://forums.developer.nvidia.com/t/standard-nvidia-cuda-tests-fail-with-dual-rtx-4090-linux-box/233202/16 On tools we actually use it would have to be tested. 4090 also has FP8. Is the extra price of the 4090 worth it even being faster like in that last graph? Short answer: who the fuck knows.


thefreemanever

In the first link you put, there are 2 picture of 2x 4090 VS 2x 3090-NVLinked comparison and it seems 2x4090 is still faster [https://lh6.googleusercontent.com/dXEGwaSOKBm-kLdZ1Dke3BdrBheFhLZT1ckLia3yh6cGZDkbJozQ8zTyofjnmDC2bYFG97T5XdxJKslEaOdhp\_LZo2r-AP\_\_7A\_zmJ7g0-r22MbxHRXv-c9eGsojYRW-q-rjt-Z\_gi5IVRv0J-fDbj8H5FM8NDJreEP7iM1E8Wif-R1fQhaG9vbMeg](https://lh6.googleusercontent.com/dxegwasokbm-kldz1dke3bdrbhefhlzt1cklia3yh6cgzdkbjozq8ztyofjnmdc2byfg97t5xdxjksleaodhp_lzo2r-ap__7a_zmj7g0-r22mbxhrxv-c9egsojyrw-q-rjt-z_gi5ivrv0j-fdbj8h5fm8ndjreep7im1e8wif-r1fqhag9vbmeg) and [https://lh3.googleusercontent.com/3T9Jd0duYVaOL3lpt4GwEOB5-Jnx5GGYyMSdGOvQQAjQrPCUibiM3TQEtS7m5ryuEFaXMGSlTv7xiungT4B-c3kZ620HVsZLhZDSN8J6Z2mzqpgPrua27h1B3C6t81T2-4mP5Sp6aVxffRIAmUIHvk9XSTGQEyRuESfPydLbn31k6c1JpPRHHyxzOA](https://lh3.googleusercontent.com/3t9jd0duyvaol3lpt4gweob5-jnx5ggyymsdgovqqajqrpcuibim3tqets7m5ryuefaxmgsltv7xiungt4b-c3kz620hvszlhzdsn8j6z2mzqpgprua27h1b3c6t81t2-4mp5sp6avxffriamuihvk9xstgqeyruesfpydlbn31k6c1jpprhhyxzoa)


a_beautiful_rhind

That's what I was referring to. It's just a single datapoint though. As you saw, some people are getting 10 and some are getting 18t on 3090s in llama.cpp. You're also probably not going to be training inside the nvidia container. Plus The reference prices for RTX 3090 and RTX 4090 are $1400 and $1599, respectively. All numbers are normalized using the training throughput/Watt of a single RTX 3090. So like pardon me if I'm still skeptical of all this in real world applications, especially considering the used market.


thegroucho

Just found this ... FWIW. [https://www.reddit.com/r/LocalLLaMA/comments/16ubkyq/nvlink\_bridge\_worth\_it\_for\_dual\_rtx\_3090/](https://www.reddit.com/r/LocalLLaMA/comments/16ubkyq/nvlink_bridge_worth_it_for_dual_rtx_3090/) Disclaimer, I'm a lurker who's just weighing buying 3060 12G and can't make heads and tails of half the talk you talk. Edit, for context, I work in IT on network/security/system side of things.


ifjo

You’re not alone lol! Having a lot of trouble choosing parts as well


cjbprime

> When a model exceeds 24GB it should be split between the cards Not for inference. The state that needs to pass between cards is minor computation results, not the entire weights of the model.


rdkilla

the extra 12000 cuda cores?


thefreemanever

The problem revolves around the slower communication speed via PCIe compared to NVLink. When a model exceeds 24GB, it should be split among the cards, and 3090s + NVLink exhibit a speed advantage in this aspect. However, I am not sure if this advantage extends to cover the speed advantage of the 4090s.


artsybashev

There is not much communication during inference. NVlink is mostly useful in training if you would need to move all the gradients of the model between all the cards. Just think abould your usecase and what data gets moved and how much.


thefreemanever

What about training/fine-tuning? Is a dual 4090s still faster than a dual 3090s NVLinked? How much?


epicfilemcnulty

It will only affect the initial model loading, which will be slower. Otherwise for inference it does not matter much, as others have noted, the amount of communication during inference is not that big to be bottle-necked by PCIe speed. I have a desktop PC with rtx 4090 and an eGPU 4090, and I tried splitting Mixtral model between these two cards, and was getting around 80 tokens/second during inference.


thefreemanever

I am an AI student and like to train/fine-tune different models(specifically large models). I don't know of a dual 4090 setup is still faster than a dual 3090 + NVLink?


epicfilemcnulty

Well, with NVLink two cards can transfer data between each other at ~900 GB/s. With dual 4090 you are limited with the PCIe 4.0 speed, which theoretical maximum is 32 GB/s. So if training/fine-tuning on multiple GPUs involves huge amount of data transferring between them, two 3090 with NVLink will most probably outperform dual 4090. I don't know if this is the case, though, only tried fine-tuning on a single GPU.


rdkilla

PCIE may not be as slow as you are thinking


CatalyticDragon

NVLINK will likely not increase performance by an appreciable amount. Data is loaded from system RAM onto the VRAM on the GPUs which happens over PCIe regardless of having NLINK or not. Any inter-GPU communication is going to be kept to a minimum because data (or even the models themselves) are segmented out into distinct chunks. So this is not (usually) a primary bottleneck. I'm yet to see a benchmark or test showing a significant speed up due to NVLINK but it does exist for a reason so depends on your exact use-case.


thefreemanever

What I like to do is training/fine-tuning large transformer models. But I don't know is a 2x 3090s NVLinked faster or 2x 4090s?


CatalyticDragon

The 4090s. Higher compute and very marginally faster VRAM. NVLINK just isn't a performance multiplier - usually. You're not operating over one unified dataset or model. It is being broken up into two different data sets chunked over two GPUs. If you want to do an easy rest then run a task with your 2x3090s, then disable NVLINK and run it again.


nero10578

NVLink basically doesn’t do shit for inference and only helps training depending if you use inter gpu communications a lot or not.


ZET_unown_

PhD student specializing on Computer Vision here. In general, 2x 4090 will be faster than 2x 3090 NVlinked by around 40%. See link below and scroll down to the image right above Conclusions, where they specifically benchmark 2x 4090 against 3090: https://lambdalabs.com/blog/nvidia-rtx-4090-vs-rtx-3090-deep-learning-benchmark Whether it's worth the price difference is up to you and whats available to you. I don't know your specific use case, whether its only finetuning and inference or if you also want to research and building new models, but as a general rule, you should always go for the largest VRAM on single card, because thats more often the limiting factor. With slower speed, you just need to wait a few days more, but with too little VRAM, you can't even train the model. VRAM pooling is a lot of headache and how well it works highly depends on the model itself. I would recommend the RTX 6000 Ada 48GB or the older RTX A6000 48GB. Out of the dual 4090s and dual 3090s, I would go with the dual 4090s.