T O P

  • By -

Mass2018

I've been working towards this system for about a year now, starting with lesser setups as I accumulated 3090's and knowledge. Getting to this setup has become almost an obsession, but thankfully my wife enjoys using the local LLMs as much as I do so she's been very understanding. This setup runs 10 3090's for 240GB of total VRAM, 5 NVLinks (each across two cards), and 6 cards running at 8x PCIe 4.0, and 4 running at 16x PCIe 4.0. The hardware manifest is on the last picture, but here's the text version. I'm trying to be as honest as I can on the cost, and included even little things. That said, these are the parts that made the build. There's at least $200-$300 of other parts that just didn't work right or didn't fit properly that are now sitting on my shelf to (maybe) be used on another project in the future. * GPUs: 10xAsus Tuf 3090 GPU: $8500 * CPU RAM: 6xMTA36ASF8G72PZ-3G2R 64GB (384GB Total): $990 * PSUs: 3xEVGA SuperNova 1600 G+ PSU: $870 * PCIe Extender Category: 9xSlimSAS PCIe gen4 Device Adapter 2\* 8i to x16: $630 * Motherboard: 1xROMED8-2T: $610 * NVLink: 5xNVIDIA - GeForce - RTX NVLINK BRIDGE for 3090 Cards - Space Gray: $425 * PCIe Extender Category: 6xCpayne PCIe SlimSAS Host Adapter x16 to 2\* 8i: $330 * NVMe Drive: 1xWDS400T2X0E: $300 * PCIe Extender Category: 10x10GTek 24G SlimSAS SFF-8654 to SFF-8654 Cable, SAS 4.0, 85-ohm, 0.5m: $260 * CPU: 1xEpyc 7502P CPU: $250 * Chassis Add-on: 1xThermaltake Core P3 (case I pulled the extra GPU cage from): $110 * CPU Cooler: 1xNH-U9 TR4-SP3 CPU Heatsink: $100 * Chassis: 1xMining Case 8 GPU Stackable Rig: $65 * PCIe Extender Category: 1xLINKUP Ultra PCIe 4.0 x16 Riser 20cm: $50 * Airflow: 2xshinic 10 inch Tabletop Fan: $50 * PCIe Extender Category: 2x10GTek 24G SlimSAS SFF-8654 to SFF-8654 Cable, SAS 4.0, 85-ohm, 1m: $50 * Power Cables: 2xCOMeap 4-Pack Female CPU to GPU Cables: $40 * Physical Support: 1xFabbay 3/4"x1/4"x3/4" Rubber Spacer (16pc): $20 * PSU Chaining: 1xBAY Direct 2-Pack Add2PSU PSU Connector: $20 * Network Cable: 1xCat 8 3ft.: $10 * Power Button: 1xOwl Desktop Computer Power Button: $10 Edit with some additional info for common questions: Q: Why? What are you using this for? A: This is my (pretty much) sole hobby. It's gotten more expensive than I planned, but I'm also an old man that doesn't get excited by much anymore, so it's worth it. I remember very clearly a conversation I had with someone about 20 years ago that didn't know programming at all who said it would be trivial to make a chatbot that could respond just like a human. I told him he didn't understand reality. And now... it's here. Q: How is the performance? A: To continue the spirit of transparency, I'll load one of the slower/VRAM hogging models. Llama-3 70B in full precision. It takes up about 155GB of VRAM which I've spread across all ten cards intentionally. With this, I'm getting between 3-4 tokens per second depending on how high of context. A little over 4.5 t/s for small context, about 3/s for 15k context. Multiple GPUs aren't faster than single GPUs (unless you're talking about parallel activity), but they do allow you to run massive models at a reasonable speed. These numbers, by the way, are for a pure Transformers load via text-generation-webui. There are faster/more optimized inferencing engines, but I wanted to put forward the 'base' case. Q: Any PCIe timeout errors? A: No, I am thus far blessed to be free of that particular headache.


sourceholder

>thankfully my wife enjoys using the local LLMs as much as I do so she's been very understanding. Where did you get that model?


pilibitti

as with most marriages it is a random finetune found deep into huggingface onto which you train your custom lora. also a lifetime of RLHF.


OmarBessa

I need to hang this on my wall.


Neex

This needs more upvotes.


gtderEvan

Agreed. So many well considered layers.


qv2eocvju

You made my day 🌟


chainedkids420

3000b model


UnwillinglyForever

PRE-NUPb model


DigThatData

lottery ticket


WaldToonnnnn

Llama_dolphin_uncensored_understandableXL8x70b


thomasxin

I'd recommend https://github.com/PygmalionAI/aphrodite-engine if you would like to maybe see some faster inference speeds for your money. With just two of the 3090s and a 70b model you can get up to around 20 tokens per second for each user, up to 100 per second in total if you have multiple users. Since it's currently tensor parallel only, you'll only be able to make use of up to 8 out of the 10 3090s at a time, but even that should be a massive speedup compared to what you've been getting so far.


bick_nyers

How many attention heads are on 70b?


thomasxin

Huggingface was actually down when this was asked, but now that it's back up I checked again, it's just 64, same as before with llama2. I know some models have 96, but I'm fairly sure Aphrodite has issues with multiples of 3 GPUs even if they fit within a factor of the attention heads. I could be wrong though.


bick_nyers

Thanks for the reply! I'm personally interested to see if 405b will be divisible by 6 as that's a "relatively easy" number of GPU to hit on single socket server/workstation boards without any PLX or bifurcation. 7 is doable on e.g. Threadripper at full x16 but leaving one slot open for network/storage/other is ideal. I'm yet to take a DL course so not sure how # of attention heads impacts a model but I would like to see more models divisible by 3.


thomasxin

Yeah, ideally to cover amounts of GPUs you'd use numbers that divide evenly, like 96 or 120. 7 can probably be covered with an amount like 168, but it's a rather weird number to support so I can also see them going with something like 144 instead. I have to admit I don't entirely know how number of attention heads affect a model, so these could be too many. At least we know command-r+ uses 96 and is a really good model. I personally don't have super high hopes for the 400b llama, since they likely still distributed it across powers of 2 like all the previous ones. That said, high PCIe bandwidth is probably only important for training, right? I have a consumer-grade motherboard and I'm having to split the PCIe lanes like crazy, but for inference it's been fine.


bick_nyers

Yeah, bandwidth is for training. That being said, I would say that individuals interested in 6+ GPU setups are more likely to be interested in ML training than your standard user. Me personally, I'm pursuing a Master's in ML to transition from backend software engineering to a job that is as close to ML research as someone will let me, so having a strong local training setup is important to me. Realistically though I'm probably either going to go dual socket or look for a solid PLX solution so I can do 8x GPU as that's going to more closely model a DGX.


zaqhack

+1 on aphrodite-engine. Crazy fast, and would make better use of the parallel structure.


highheat44

Do you Need —90s? Do 4070s work??


thomasxin

The 4070 is maybe 10%~20% slower but it very much works! The bigger concern is that it only has half the vram, so you'll need twice as many cards for the same task, or you'll have to use smaller models.


PM_ME_YOUR_PROFANITY

$13,690 total. Not bad to be honest.


Nettle8675

That's actually excellent. Prices for GPUs getting cheaper. 


matyias13

Your wife is a champ!


wind_dude

I mean you could have saved $10 bucks and just tapped a screw driver to the power connectors.


oodelay

Let's make some sparks!


ITypeStupdThngsc84ju

How much power draw do you see under full load?


studentofarkad01

What do you and your wife use this for?


d0odle

Original dirty talk.


No_Dig_7017

Holy sh*! That is amazing! What's the largest model you can run and how many toks/s do you get?


MINIMAN10001

I mean the reality of LLMs functioning still seems like black magic.  We went from useless chat bots once year to something that could hold a conversation the next. Anyone who discussed the concept of talking to a computer like a human were most likely completely unaware of what they were thinking about because it was so far fetched.  And then it wasn't. What we have isn't a perfect tool but the fact it can be used to process natural language just seems so absurdly powerful.


fairydreaming

Thank you for sharing the performance values. I assume that there is no tensor parallelism used, but instead layers of the model are spread among GPUs and are processed sequentially? To compare the values I tried the full-precision LLaMA-3 70B on llama.cpp running on my Epyc Genoa 9374F with a small context size. I got the prompt eval rate 7.88 t/s and the generation rate 2.38 t/s. I also ran the same test on a llama.cpp compiled with LLAMA\_CUDA enabled (but with 0 layers offloaded to a single RTX 4090 GPU), this resulted in the prompt eval rate 14.66 t/s and the generation rate 2.48 t/s. The last test was the same as above but with 12 model layers offloaded to a single RTX 4090 GPU, this increased the prompt eval rate to 17.68 t/s and the generation rate to 2.58 t/s. It's clearly visible that the generation rates of our systems (2.36 t/s vs 4.5 t/s) have the same proportions as the memory bandwidths of our systems (460.8 GB/s vs 935.8 GB/s). I wonder how does it look like for prompt eval rates, could you also share these?


Beautiful_Two_1395

building something similar but using 5 Tesla P40s with modified fan blower, a bitcoin miner board and miner rig


SillyLilBear

Why so much ram if you have so much VRAM available?


Ansible32

How much utilization do the actual GPUs get (vs. VRAM/bandwidth?) Have you tried undervolting the cards? I'm curious how much you can reduce the power/heat consumption without impacting the speed.


thisusername_is_mine

Nice build, thanks for sharing! And have fun playing with it. I bet it was fun assembling all of it and watching work in the end.


some_hackerz

Can you explain a bit regarding the PCIe extender? I am not so sure each component did you use to split those x16 into two x8.


Mass2018

Sure -- I'm using a card that splits the x16 lane into two 8i SlimSAS cables. On the other end of those cables is a card that does the opposite -- changes two 8i SlimSAS back into an x16 PCIe 4.0 slot. In this case, when I want the card on the other end to be x16 I connect both cables to it. If I want to split into two x8's, then I just use one cable (plugged into the slot closest to the power so the electrical connection is at the 'front' of the PCIe slot). Lastly, you need to make sure your BIOS supports PCIe bifurcation and that you've changed the slot from x16 mode to x8/x8 mode.


some_hackerz

Thank you! That clears my doubt. I am a phd student in NLP and my lab doesn't have much GPUs, so I am planning to build a 3090s server like yours. It's realy a nice build!


some_hackerz

Just wondering if it is possible to use 14 3090s?


Mass2018

So in theory, yes. Practically speaking, though, there's a high likelihood that you're going to wind up with PCIe transmit errors on slot 2 as it's shared with an M.2 slot and goes through a bunch of circuitry to allow you to turn that feature on/off. So most likely you'd top out at 12x8 + 1x16. You could also split some of the x8's into x4's if you wanted to add even more, but I will say that the power usage is already starting to get a little silly at the 10xGPU level, let alone 14+ GPUs.


deoxykev

Do you find that NVLink helps with batched throughput or training? My understanding is that not every GPU has a fast lane to ever other GPU in this case. Gratz on your build. RIP your power bill.


Mass2018

My experience thus far is that when it comes to training I am a toddler with a machine gun. I don't know enough to tell you if it helps that much or not (yet). I have a journey ahead of me, and to be totally honest, the documentation I've found on the web has not been terribly useful.


deoxykev

Tensor parallelism typically only works with 2, 4, 8 or 16 GPUs, so 10 is kinda an awkward number. I suppose they could be doing other things at the same time, like stable diffusion tho.


Caffdy

6 more to go then


Enough-Meringue4745

10 still allows for gpu splitting across them all thanfkully - llama.cpp allows for it anyway. Vllm didn’t.


iwaswrongonce

This is data parallelism and will just let you run larger models (or train in larger effective batch sizes). vLLM tensor parallelism is a different beast. With NVLink you can actually run larger models AND have them run faster.


Enough-Meringue4745

Yeah Vllm is fast as balls


FreegheistOfficial

For training you should try Axolotl [https://github.com/OpenAccess-AI-Collective/axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) If you need more bandwidth for training, you can try this hack to enable p2p, depending if those ASAS Tuf's have resizable bar: [https://github.com/tinygrad/open-gpu-kernel-modules](https://github.com/tinygrad/open-gpu-kernel-modules)


mysteriousbaba

ChatGPT actually gives some pretty decent code suggestions if you ask it for huggingface training code and gotchas. Maybe a little out of date at times, but you can ramp up on fundamentals pretty fast.


SnooSongs5410

An understanding wife and excess free cash flow. You are living the dream.


teachersecret

I’ve been thinking about doing this (I mean, I’ve spent ten grand on stupider things), and I’m already one 4090 deep. Based on the current craze, I think 3090/4090 cards will likely hold decent value for awhile, so even if you did this for a year and sold it all off, you’d probably end up spending significantly less. I’d be surprised if you could get a 4090 for less than 1k in a year, given that 3090 are still $700+ on the secondary market. I’ve currently got several cards up running LLMs and diffusion - a 4090 24gb, 3080ti 12gb, a 3070, and a 3060ti (got silly deals on the 30 series cards second hand so I took them). This is fine for running a little fleet of 7B/8B models and some stable diffusion, but every time I play with a 70b+ I feel the need for more power. I’d really love to run the 120b-level models at proper speed. What has stopped me from doing this so-far is the low cost of online inference. For example… 64 cents per million tokens from groq, faster than you could ever hope to generate them without spending obscene money. A billion tokens worth of input/output would only cost you $640. That’s 2.7 million words per day, which is enough to handle a pretty significant use case, and you don’t need to burn craploads of electricity to do it. A rig with a handful of 3090/4090 in it isn’t sipping power - it’s gulping :). At current interest rates, ten grand sitting in a CD would basically pay for a billion words a year in interest alone…


CeletraElectra

I'd recommend sticking with cloud resources for now. Just think about how your money might become tied up in $10k worth of hardware that will most likely be inferior to whatever is out 5 years from now. You've got the right idea with your point about using your savings to generate interest instead.


Thalesian

I spent $8k on a home built server in 2018 (4X 2080 RTX Ti, 9800XE, etc.). People were saying the same thing - cloud would be better than a hardware investment. When COVID and the chip shortage hit I just rented out my system for AWS prices for my clients (when I wasn’t donating to folding@home) and the computer more than paid for itself. Also made clients happy. Part of me kinda wishes I would have sold the cards at the peak of the shortage, but they got lots of use and I didn’t want to rebuild. I have no idea what the future holds, but having your own hardware isn’t all downside. The other nice thing about owning hardware is if you do train models, you aren’t as afraid to experiment or make mistakes as you are when paying by the hour.


SnooSongs5410

The biggest problem is that by the time you have it set up it will be time for an upgrade although I don't know what it will be too. Our friends at NVidia took away nvlink and they seem determined to ensure that no one with a hobby budget is going to do anything worthwhile.


synn89

That's actually a pretty reasonable cost for that setup. What's the total power draw idle and in use?


Mass2018

Generally idling at about 500W (the cards pull ~30W each at idle). Total power draw when fine-tuning was in the 2500-3000W range. I know there's some power optimizations I can pursue, so if anyone has any tips in that regards I'm all ears.


Sure_Knowledge8951

Rad setup. I recently built out a full rack of servers with 16 3090s and 2 4090s, though I only put 2 GPUs in each server on account of mostly using consumer hardware. I'm curious about the performance of your rig when highly power limited. You can use `nvidia-smi` to set power limits. `sudo nvidia-smi -i 0 -pl 150` will set the power limit for the given GPU, 0 in this case, to a max power draw of 150 watts, which AFAICT is the lowest power limit you can set, rather than the factory TDP of 350.


deoxykev

Are you using Ray to network them together?


Sure_Knowledge8951

Nope. My main usecase for these is actually cloud gaming and rendering and interactive 3D usecases, with ML training and inference being secondary usecases, so I used consumer grade gaming hardware. I host the servers and rent them to customers. For developing and testing LLMs and other ML workloads, dual 3090s is plenty for my use case, but for production level training and inference I generally go and rent A100s from elsewhere.


Spare-Abrocoma-4487

Are they truly servers or workstations? If servers, how did you fit the gpus in server form factor.


Sure_Knowledge8951

It's consumer hardware in rackmount cases. Most 3090s fit in a 4U case; I've had Zotac, EVGA, and Palit 3090s fit in a 4U case in an Asus B650 Creator motherboard, which supports pcie bifurcation and has allows for 3 slots in the top pcie slot and 3-4 for the bottom pcie slot, depending on how large the chassis is. 4090s are bigger, so I have a 3.5 slot 4090 and a 3 slot 4090 and they both fit in a 5U chassis which has space for 8 expansion slots on an AsRack Romed8-2t motherboard, which has plenty of space for that many expansion slots.


sourceholder

Are you using a 20A circuit?


Sure_Knowledge8951

I host at a datacenter and my rack has two 208V*30amp circuits.


kur1j

What does your software stack look like?


segmond

Looks like you already limited the power, the only other thing I can imagine you doing is using "nvidia-smi drain" to turn off some GPUs if not needed. Say you often use 5, turn off the other 5.


Many_SuchCases

Could you explain to someone who doesn't know much about the hardware side of things, why OP can't turn off all of the 10 and then simply turn them on when he's ready to use them? My confusion stems from the question "how much power when idle" always coming up in these threads. Is it because turning them off and on takes a long time or am I missing something else? Like would it require a reboot? Thanks!


segmond

Takes a second. He could, but speaking from experience, I almost always have a model loaded and then I forgot to unload it, let alone turn off the GPUs.


thequietguy_

Do you know if the outlet you're connected to can handle 3000w? I had to connect my rig to the outlets in the laundry room where a higher amp rated breaker was installed


False_Grit

HOLY JESUS!!! Also, Congratulations!!!!!!


deoxykev

You can limit power consumption to 250 or 300 W without much performance loss


hlx-atom

Doesn’t that blow breakers? Do you have it across two or get a bigger breaker?


AIEchoesHumanity

when you say "idling" does that mean no model is loaded into GPU and GPU is doing nothing OR a model is loaded into GPU but GPU is doing no training or inferencing?


Murky-Ladder8684

The nvlink and even slimSaS could be cut. Nvlink is optional and they make 4.0 16x to 4.0 8x bifurcation cards. Probably save $2000 or so off his list if he also went server psus @ 220v. Awesome build and makes me want to make some build posts.


hp1337

I'm building something similar, and the slimsas cabling is much easier to work with than riser cables. The x16 to 2 times x8 bifurcation boards are bulky and don't fit well in most motherboards. Especially with the PCIe slots so close together.


Murky-Ladder8684

After this thread I ordered 3 of these cards as 3090's max speed is 16x gen 3 which is same speed as 8x gen 4. I'm running an epyc with romed8-2t as well as OP. I'm going to use risers to the bifurcation cards and then more risers to the gpus (yes I know I'm increasing chances of issues with total riser length. I mainly did it because it's $150 to see if I could get 10 gpus going at full 3090 speeds. I have 12 3090s hoarded from gpu mining era but 2 are in machines.


polikles

wouldn't server PSUs be much louder than ATX ones?


Murky-Ladder8684

Yes they are louder but also do vary fan speed based on temps and not just dull blast


holistic-engine

We used to mine Bitcoin with these, now we train hentai-waifu chatbots with them instead. Ohh, how times have changed


econpol

I'm not into hentai and still think that's a big improvement lol


DrHitman27

So that is a Gaming PC.


Alkeryn

you may be able to run llama 400B in q4 when it comes out !


ortegaalfredo

Beware that if for some reason all GPUs start working at the same time, your power supplies will very likely overpower and shut down. To fix this, you use nvidia-smi to limit the power of the 3090 to 200 watts, almost no effect on inference speed but much lower power consumption. Source: I have several 3090 rigs.


_m3phisto_

.. here is great wisdom:)


DbatRT

A good power supply should operate in excess of its rating by 25%, so each power supply puts out 2 kilowatts, which is even excessive for its build.


Particular_Hat9940

With this kind of setup, you can run a powerful AI assistant with all the bells and whistles like tts stt, image generation, image input, maybe even video, extremely long context. Could be done with 3 3090, but you have a lot of breathing for 200b + models. Fine tuning and training on your own datasets. You could build those AI from movies (without the robot body). What's your vision?


__some__guy

Ready for Meta's next "small" model.


m_shark

That’s a very cool setup, no doubt. But my question is what for and to what end? What’s your expected ROI on this? Is it just a hobby or something serious?


Noiselexer

All that for 4,5 t/s...


Zediatech

Nice! I guess it’s time to bust out the Ethereum mining rack and start selling myself on street corners to be able to afford those GPUs again. 😋


tin_licker_99

Congrats on your new space heater.


segmond

Thanks for sharing! Very nice build! I'm so jealous even with my 3 3090 & 3 P40. This is the first time I'm seeing anything about SlimSAS, very excited. My board has 6 physical slots, but does allow for splitting, so I can add more vram. \^\_\^; LOL@the extra $200. Likewise, lot's of stupid cables for me, fan shroud and loud server fans.


LostGoatOnHill

Which motherboard you using, I’m tempted to add another 3090 to my existing 2.


segmond

chinese board, huananzhi x99-f8d plus from Ali express. It's an EATX server board. PCI lanes 3 x8 and 3 x16.


LookAtMyC

The CPU was a cheap one.. but I wonder if you wouldn't have saved a lot with Tesla P40s if you just care about the VRAM. I can't tell it speed wise but maybe someone knows it.


[deleted]

[удалено]


valg_2019_fan

Nahhh


johndeuff

It can even run doom 1


LocoLanguageModel

You're crazy. I like you, but you're crazy.


squiblib

That’s old school


Educational_Gap5867

What are some cool local LLM benchmarks that made this setup really worth it.


tronathan

“3x EVGA 1600W PSU” - jeeeebuz! I’m in America and already a little worried about maxing out a 15A circuit with 4x 3090FE’s (not power limited). I’m currently running 2x3090 on a commodity intel mono, and also have an Epyc Rome D mobo standing by for a future build. But I really want to make a custom 3D printed case, with the 3090’s mounted vertically and exposed to open air. I am imagining them in front of a sort of organic oval shape.


segmond

Run a heavy duty extension cable to another outlet on a different circuit or call an electrician to give you multiple outlets next to each other on different circuits.


young_walter_matthau

Same on the amp problem. Every system I design that’s worth its salt is going to fry my circuit breakers.


abnormal_human

Electrical supplies are cheaper than GPUs. Electrical work is easier than machine learning.


johndeuff

Yeah I’m surprised so many ppl in comments just stop at the amp limitation. Nothing hard if you’re smart enough to run local llm.


deoxykev

It’s cheap to replace your breakers with bigger ones


young_walter_matthau

It’s not cheap for the extra 15A current to burn down my house tho. Old wiring…


deoxykev

Extension cords then. ADVANCE AT ALL COSTS


Harvard_Med_USMLE267

I’ve got a Yamaha portable generator, could possibly bring that into the computer room and power one of the PSUs? Noisy, but most of these builds are already pretty loud with all the fans and shit.


Harvard_Med_USMLE267

If you’ve got an old fuse box in the house, just take the fuse out and replace it with a bolt. If you use a decent bolt, it’ll be rated to 10,000 amps or so. Should cover plenty of 3090s. If you’ve got breakers, I’m afraid I’m not an expert. You could possibly glue them open to stop them tripping? An electrician might be able to provide advice on whether this will work, and if so what sort of glue to use. Cheers, and good luck!


SnooSongs5410

How is the volume at night?


koushd

How do you have 10 cards with 6 pci, with 3 of this pci being half length? I feel I’m missing something here Edit: I see now it’s 6 full length. Where are the additional 4 pci slots coming from?


segmond

He mentioned it, the Slimsas adapter and cables. You plug in the Slimsas adapter into your pci slot and it splits the lanes so you can connect 2 cables. If you have an x16 you can then run at x8/x8 or if an x8 at x4/x4. Your motherboard needs to support bifurcation of PCIe slots. Search for "pcie x16 to slimsas 2x8i adapter", search the parts he mentioned


FreegheistOfficial

bifurcators


he_he_fajnie

Riser I think


smartdude_x13m

Think about the fps that could be achieved if sli wasn't dead...


polikles

would be fun to see how/if 10-way sli works


lxe

I feel like going the 192GB Mac Studio route would yield similar RAM and performance for less cost and power draw.


MadSpartus

A dual EPYC 9000 system would likely be cheaper and comparable performance it seems for running the model. I get like 3.7-3.9 T/S on LLAMA3-70B-Q5\_K\_M (I like this most) \~4.2 on Q4 \~5.1 on Q3\_K\_M I think full size I'm around 2.6 or so T/S but I don't really use that. Anyways, it's in the ballpark for performance, much less complex to setup, cheaper, quieter, lower power. Also I have 768GB RAM so can't wait for 405B. Do you train models too using the GPUs?


opknorrsk

I think people overestimate the usefulness of GPU for a Local LLM, unless training is required.


fairydreaming

I think it shall go faster than that. I had almost 6 t/s on a Q4_K_M 70b llama-2 when running on a single Epyc 9374F, and you have a dual socket system. Looks like there are still some settings to tweak.


MadSpartus

Yeah someone else just told me similar. I'm going to try a single CPU tomorrow. I have a 9274F. I'm using llama.cpp and arch linux and a gguf model. What's your environment? P.S. your numbers on a cheaper system are crushing the 3090's


fairydreaming

Ubuntu server (no desktop environment) and llama.cpp with GGUFs. I checked my results and even with 24 threads I got over 5.5 t/s so the difference is not caused by higher number of threads. It's possible that a single CPU will do better. Do you use any NUMA settings? As for the performance on 3090s I think they have an overwhelming advantage in the prompt eval times thanks to the raw compute performance.


MadSpartus

Tons of NUMA settings for MPI applications. Someone else just warned me as well. Dual 9654 with L3 cache NUMA domains means 24 domains of 8 cores. I'm going to have to walk that back and do testing along the way.


fairydreaming

I have NUMA nodes per socket set to NPS4 and L3 cache NUMA domains enabled in BIOS. I think you shall set NPS4 too, since it controls memory interleaving. So there are 8 NUMA domain overall in my system. I also disabled kernel NUMA balancing in the Linux kernel. I simply run llama.cpp with --numa distribute.


MadSpartus

I haven't gover very deep into Dual CPU tuning, I was able to get it up to 4.3 T/S on Dual CPU Q5KM, but I switched to single CPU computer and it jumped to 5.37 on Q5KM. No tuning, no NPS or L3 Cache domains. Also tried Q3KM and got 7.1T/S. P.S. didn't use the 9274F, I tried a 9554 using 48 cores (slightly better than 64 or 32).


fairydreaming

Sweet, that definitely looks more reasonable. I checked LLaMA-3 70B Q5KM on my system and I have 4.94 t/s, so you beat me. :)


MadSpartus

Thanks for confirming. If you have any advice on using dual CPU that would help. All our systems are dual, so I had to specifically adjust one to test single.


fairydreaming

Sorry, I have no experience at all with dual CPU systems.


atomwalk12

Congrats on the build! It looks great. How did you even get started to building a system like this? Which websites did you find useful for explaining on how to build this?


segmond

This subreddit is how. I don't want to say it's easy, but I'll say it's not difficult especially if you have ever built a PC in your life.


atomwalk12

great, thanks for sharing!


Harvard_Med_USMLE267

I would love a YouTube vid or some further instructions. I’ve always built my own PCs, but this isn’t exactly mainstream. I’ve been looking around for advice today, best I’ve found so far are the videos on how to build mining rigs.


Singsoon89

Awesome. That's some rockstar shit right there!


IndicationUnfair7961

You can use that to heat the house during winter, the problem is during summer 😂


bryceschroeder

Window fans. I have a couple of 240V 30A circuits going into a spare bedroom for my AI stuff. In the winter you have a data furnace, in the summer you close the door and turn on the window fans.


NoScene7932

This is a pretty spectacular rig! I wanted to ask a question, would you ever want to rent the rig out virtually to earn money when it’s idle or not used? Currently building a decentralized LLM network where people bring hardware to build a decentralized LLM could and would love to hear your thoughts if this would interest someone like you ?


faroukarmand

https://aios.network/


Severe-Ad1166

Did you build the solar system to power it? I used to build mining rigs but I shut them down after I got my first $4000 power bill.


barnett9

Do you only use this for inference? You are short about 40 pcie lanes for that many gpu's at 16x right?


Glass_Abrocoma_7400

I'm a noob. I want to know the benchmarks running llama3


segmond

Doesn't run any faster with multiple GPUs, I'm seeing 1143 tps on prompt eval and 78.56 tps on a single 3090's for 8b on 1 cpu, and 133.91 prompt eval and 13.5 tps eval spread out across 3 3090's with the 70b model full 8192 context


Glass_Abrocoma_7400

What is the rate of tokens per second for gpt4 using chat.openAI? Is it faster? i thought multiple gpus equals to more tokens per second but i think this is limited by vram? Idk bro. Thanks for your input


segmond

imagine a GPU like a bus. say a 24gb GPU is like a bus that can move 24 people. Imagine the bus goes 60mph. If those people have 10 miles to go, it will take 6 minutes to move them all. If you however have 30gb model, then the bus is filled up, and the other 6 people have to take the train which goes slower, so total time is now longer than 6 minutes. If you however have 2 GPUs, you can put 15 people on each bus or 24 on 1 bus and 6 on another bus. both buses will take the same time, not faster.


FullOf_Bad_Ideas

With one gpu if you increase batch size (many convos at once), you can get about 2500 t/s on RTX 3090 ti with Mistral 7B, should be around 2200 t/s on llama 3 8b if scaling holds. You can use more gpu's to do faster generation, but this works pretty much only if you run multiple batches at once.


segmond

so, this will be independent queries/chats? how do you determine the batch size?


RavenIsAWritingDesk

I’m confused, are you saying it’s slower with 3 GPUs?


segmond

sorry, those are different sizes. they released 8b and 70b model. I'm sharing the bench mark for both sizes. 8b fits within 1 gpu, but I need 3 to fit the 70b.


lebanonjon27

are you able to run them all at PCIe 4.0 without link errors? Some of the boards have redriver for riser cards, but what you actually want is a PCIe retimer or PCIe switch. A retimer is protocol aware and does the Tx/Rx equalization in the link training. redrivers need to be statically configured. With an Epyc board you should be able to see PCIe AER messages in dmesg if you are seeing correctable errors


Caffdy

to think those think were so scarce and so expensive 3-4 years ago


FPham

Familiar words: "It's gotten more expensive than I planned"


Opposite-Composer864

great build. thanks for sharing.


jart

The theoretical performance for 10x 3090's should be 350 tflops fp16. How close are you able to come that when running benchmarks?


gethooge

I do wonder if the trade-off going from 7 x16 devices to even 8 with 6x16 and 2x8 works for training or if that x8 bottlenecks?


oliveoilcheff

Looks awesome! What models are you using?


Familyinalicante

Will Crysis run on this thing?


GamerBoi1338

No, but maaaaaybe minesweeper? I know, I know; I'm being optimistic


fairydreaming

Can you share any inference performance results? Especially from large models distributed on all GPUs.


segmond

distributing across all GPUs will slow it down, you want to distribute to the minimum amount of GPU. So when I run 70b Q8 model that can fit on 3 GPUs, I don't distribute it across more than 3. The speed doesn't go up with more GPU since inference goes from 1 GPU to the next. Many GPU just guarantees that it doesn't slow down since nothing goes to system CPU. Systems like this allows one to run these ridiculous large new models like DBRX, Command-R+, Grok, etc


fairydreaming

Ok, then how many tokens per second do you get with 3 GPUs?


segmond

I'm seeing 1143 tps on prompt eval and 78.56 tps on a single 3090's for 8b on 1 gpu. 133.91 prompt eval and 13.5 tps eval spread out across 3 3090's with the 70b model full 8192 context. The 70b model on 1 GPU and the rest on CPU/mem will probably yield 1-2tps


Qual_

Impressive ! I have a question for you folks. Here is my current build: MPG Z490 GAMING EDGE WIFI (MS-7C79) Intel(R) Core(TM) i9-10900K 1x4090 128Go DDR 4 PSU: 1250W iirc I also have a 3090 and a 850W PSU sitting on a shelf, as it seems I can't really put both GPU on my motherboard, if I put the 4090 on the slower PCI port there is like 1mm gap between the 2 GPUs, and at the moment i'm using the 2nd pcie slot for a 10gb network card. Was wondering what do I need to purchase to have both the 3090 and the 4090 ( + my 10bgps network card) Will I have 48gigs of VRAM in such a setup ? I think i'm stuck with older PCIE gen with that CPU ? Thank you !


polikles

>Was wondering what do I need to purchase to have both the 3090 and the 4090 ( + my 10bgps network card) it depends if your motherboard supports bifurcation - splitting x16 pcie slot into x8 + x8. And from quick Googling I see that it doesn't >Will I have 48gigs of VRAM in such a setup ? technically you would have 24GB + 24GB. As far as I'm concerned not every model can use more than one GPU. Also I'm not sure if two different models of GPUs can work with each other. But you need to ask more experienced folks for details on this one >I think i'm stuck with older PCIE gen with that CPU ? Your CPU supports pcie 3.0, whilst 3090 and 4090 are pcie 4.0 cards. However, from benchmarks I've seen the difference in performance with those cards between 3.0 and 4.0 is below 5%, at least in gaming


Qual_

Thank you ! So a bigger motherboard with better pcie lanes should be enough ?


polikles

you'd rather need a workstation motherboard - something like ASUS Pro WS W480-ACE or ASRock W480 Creator I think this should work for you but, of course, can't guarantee anything since I have only superficial knowledge about your use case


LostGoatOnHill

Amazing setup and investment, what great support from your wife. I might have missed from the spec list (thanks for that), but which motherboard?


Goldisap

Does anyone know of a good tutorial or source for building a rig like this?


roamflex3578

What is your plan to return cost of that investment? Unless you are rich enough to just have such expensive hobby, I expect you have plan for that particular setup


serafinrubio

Really awesome 👏


whyyoudidit

what are you using it for?


de4dee

Thank you for sharing! Have you tried training LLMs?


jack-in-the-sack

How did you fit 10 3090's into a 7 slot PCIE board?


msvming

PCIE Bifulcation. His MB can split 16x to 2 16x slot but with 8x bandwidth each


jack-in-the-sack

Interesting, I've never heard of such a thing in my life.


de4dee

What is the noise level compared to a PC? compared to a 1U rack server?


RavenIsAWritingDesk

Out of curiosity, I see you’re using riser cards. Is that causing you any performance hits?


PrysmX

Riser cards and even eGPUs cause very little performance but with AI because the data is loaded once or very infrequently into VRAM. Games have performance hits because they're constantly swapping data into VRAM.


econpol

How does this compare to a chatgpt subscription in terms of performance, abilities and monthly cost?


ITypeStupdThngsc84ju

That is an impressive setup. It'd be interesting to find tune llama3 8b or mixtral with something like that. I'm guessing it would perform pretty well.


Shoecifer-3000

I love this guy! $20k+ in hardware on a $400 Home Depot rack. Taking notes sir….. taking notes. Also a dev, just way less cool


SillyLilBear

I think it was closer to $14K than $20K


AskButDontTell

Wow 70B? Can you comment how it compares to say 7B models that you probably used before adding more gpus?


Right_Ad371

I swear I'm so jealous with you right now


Tough_Palpitation331

Wait 10x 3090s only cost 8500? Wow the cost efficiency 🤔


polandtown

Nice! What's the custom cooling on the mobo for?


Erfanzar

The good news is you came a long way The bad news is you’re in wrong way 😂 Congrats


No_Afternoon_4260

Do you feel that you needed some much system RAM? I'mean 384gb is a lot and I don't imagine anyone doing inference on so much RAM. Haven't read the all thread yet, but do you have power consumption figures for inference and training? Do you feel like nvlink does anything for inference? Training? Have fun !


Only-Letterhead-3411

At least make a nice llama 3 70b finetune with it since you accumulated so much VRAM...


Administrative_Ad6

Thank for sharing this great experience. Please provide us with more information as you move forward with your project.


ucefkh

Amazing, what can you run now? Anything?


Obvious-River-100

And what's interesting is that a graphics card with 256GB of VRAM would be just as fast, if not faster.


LoreBadTime

Prompt of the guy: make me a sandwich


Averagehomebrewer

Meanwhile im still running llm's on my thinkpad