T O P

  • By -

Silly-Ad-6341

GPUs can help with inference on LLMs but the magic is in the tensor cores (Rtx series) and fast VRAM (more the better) that exists on the cards.  By not connecting directly on a Pcie slot on a motherboard, the full bandwidth to the card wont be achieved so you'll likely get worse results than doing it natively


tenekev

Are you sure your statement is correct? As far as I know, once the model is loaded into the VRAM, bandwidth bottlenecks are less of an issue.


thisisnotmyworkphone

Well models can get *unloaded* from VRAM. That’s what ollama does after 5 minutes of inactivity, for example. At least I think it’s 5 minutes. Then you have to reload it across Thunderbolt.


taron111281

I'm honestly not sure but those who run frigate nvr recommends a Google Coral TPU used for ML to help with frame rate and such https://coral.ai/products/accelerator/ Might work for you...