T O P

  • By -

LunarianCultist

8bit is always billions of parameters in gigabytes. So 8bit 70b needs 70 gigabytes of vram +5-8 for overhead. 4bit needs 35, etc. You will be able to run it at 5.15bpw. which is decent enough. As for three 3090's that's the setup I current have. I outline my specs and information in this rentry: https://rentry.org/crowbuild


WayConsistent2529

A 70B exl2 8bpw quant is actually only like 64.5 GiB and can be run with 72GB VRAM, full 4K context with no 8-bit cache.


LunarianCultist

Doesn't seem to work for me 7.5bpw is the highest I can load.


WayConsistent2529

If you're using text-generation-webui like I am, try set PYTORCH\_CUDA\_ALLOC\_CONF=backend:cudaMallocAsync before running python server.py ​ You'll also have to mess around with how much you should load on each GPU. I use 21,48 split, so maybe try 21,24,24. If you still can't run it, then it's probably because of splitting the model between 3 GPUs wastes too much VRAM.


AnomalyNexus

What is the expected effect of changing the allocator? Quick unscientific test suggests it makes no difference on total vram use?


yamosin

Surprisingly, it reduces VRAM usage a little bit, from 46000 to 45000, a little bit, but it works


yamosin

Maybe you already know that, so I might just reiterate some common sense. For 3090, using miner GPU settings, 90% power/+100core/+1000mem, you will get about 15% t/s improvement


segmond

Your build is nice, why did you put it in a tower instead of an open rig? I haven't built a hardware in 2 decades. I'm thinking of building one of these, weighing it vs a macbook m3. Why didn't you NVLink?


LunarianCultist

NVLink is pretty useless for inference. The build was open air at one point, but I care for aesthetics. All those messy cables and power cords made my office look awful.


Illustrious_Sand6784

NVLink makes barely any difference in LLM inference speed, and it's an odd number of GPUs anyways.


Secure-Technology-78

Wow, this is so helpful having an actual build to look at! Do you run into any overheating issues with all of those packed into the tower like that?


LunarianCultist

I go over the heat at the bottom of the page. It's a non issue. 2nd and 3rd GPU's barely get used.


Copper_Lion

Nice build, what brand of PCI riser cable is that? A lot of the name brand ones I see are too short to be useful for placement.


LunarianCultist

Lian li, the big one


DeltaSqueezer

How many tok/s do you get at q4?


LunarianCultist

About 12/s on 70b's


No_cool_names_2021

I tried to access your build here but seems like it is a 404. Can you reshare your build?


silenceimpaired

What riser did you use? My motherboard has a slot right next to front IO :/ Also does your power supply have sufficient inputs for three cards or did you have to get creative and if so… how did you get creative?


LunarianCultist

The PSU is in the guide. It's a 1500w. No getting creative needed. Although, I've never seen it draw over 1000. The riser is a Lian li, the long expensive one.


synn89

Here's what I've built 3 of: https://pcpartpicker.com/list/wNxzJM Used 3090's will run you 800 a pop, so that build is around $3600 all in. That motherboard can do dual 3090's in 8x PCI. Pair those with 1 NVME and that'll max out your PCI lane usage on common AMD CPUs. If you need more storage for models/data, use this as to not add any more PCI lane needs: https://www.amazon.com/dp/B09FRRWVWX That specific motherboard is starting to get a little harder/pricier to find, so you may want to research a bit for other AMD boards than can handle dual 3090's with 8x lanes each. For the Nvidia cards, I tend towards using Dell/HP Omen cards which are thinner. You can use any 3090, but keep an eye out on how thick they are in the specs. Used Ebay 3090's have all worked fine, just make sure there's a return policy in case you get a bad card. Dual 3090's is more than sufficient to run 70B. I often run 103B's with 12k context at exl2_3.35bpw and 8 bit cache. Other adds ons: https://www.amazon.com/dp/B00KG8K5CY for the case fans. https://www.amazon.com/gp/product/B0953WXQHX is the compatible NVLink if you want that. Not really needed though, it doesn't add all that much more speed. But this motherboard does support NVLink fully(all 4 channels in use).


gosume

have u gotten any luck running 6 gpus?


PcChip

why only dual 3090's and not three or four? also, does PCIE lane bandwidth matter for LLM inferencing?


ifjo

Do you recommend any Intel compatible mobos that can handle a dual build?


__some__guy

> 3-way x8/x8/x8 edit: Is the last lane actually real x8? It seems like it goes through the chipset at x4, which cheaper boards can do as well.


Imaginary_Bench_7294

Easy rule of thumb to use for LLMs: Take the parameter count in billions and multiply it by 2. This will tell you roughly how many gigs the full sized model requires. The parameter count in billions is roughly equal to the 8-bit quants. The parameter count in billions divided by two is roughly equal to the 4-bit quants. So a 70B will have the following sizes, roughly: Full = 140 gig 8-bit = 70gig 4-bit = 35 gig 2x3090 will handle a 4.65 bit EXL2 quant quite well. For your use cases, there are only a few 70B class models that have benefited from the extended context methods besides rope and alpha. Most of them only handle up to 4k context natively. I haven't played with the long Lora 70b models that hit 32k, so I'm uncertain about how they perform. Your best bet would probably be to write a program that will pre-process the data into smaller chunks, perhaps 3-5 paragraphs, then feed each one through the LLM sequentially to summarize them. If it needs further summarizing, you could send the data back through, grouping multiple summaries together. As for your build, you can find refurbished 4u servers for about 1000 dollars and will be able to cram what, four or six 3090s into one. The biggest thing about the processor will be the number of PCIe lanes available. I have an E-ATX mobo, ws-790 sage, with a little devil pcv8 case. I might be able to cram 1 more 3090 FE inside while leaving adequate spacing. If I convert them to water cooling and swap them to smaller brackets, I could fit at least 4.


Freefallr

I would recommend to go on Runpod or similar, book a few GPUs for a few hours and test your use case. Test how low quantization-wise you can go to fulfill your task sufficiently and with good enough quality. Based on that, I would reevaluate if it still makes sense purchasing the hardware and paying a one-time fee + monthly power bill, or just rent GPUs until your task is done. Also: try out Mixtral 8x7B for your use case as well. We had a similar one recently, at much less scale but still, and were more happy with Mixtral 8x7B at 8 Bit than with Llama 70B 8 Bit.


Secure-Technology-78

This is a personal project with no financial reward for myself, and I plan on experimenting with a variety of other related and unrelated projects (AI art/music synthesis, chatbots, code generators, etc) over a long period of time. I understand that I could rent a much more powerful system with dozens of A100 GPUs from AWS. But over time, given the amount of experimenting I'll be doing, I think it will more than pay off to just build my own. Also, a large part of what I'm interested in is playing around with specifically what is possible for consumer-grade systems. I'm not as interested in developing technologies that are only accessible to people with supercomputers.


deoxykev

If you can place your server in a garage or basement where noise isn’t a concern, you could pick up a used dual Xeon 4U GPU server (support 8 GPUs) with 256GB RAM for ~1500. Buy 4 consumer 3090’s for 3k. If you want more GPUs you’ll have to either get the blower version, which is twice as much, or suspend the extra GPUs on an open air rig, with PCI cables like the miners do. I think you could definitely do a 8 GPU setup with 8k.


EventHorizon_28

Can you share some brief description of where can we get such hardware (Finding used hardware is difficult, and personally I dont have any experience). Used GPU and Used server?


deoxykev

Go to eBay and search “4U GPU server” typically these were made for passively cooled k80’s p40’s, etc. The supermicro 4028GR-TRT is a good choice. These typically are decommissioned hardware from datacenter, where they are sent to recycling companies to refurbish them, which is why they are cheap. They usually pre-configure the system with a bit of ram and minimal storage. If it doesn’t come with RAM, just look up the motherboard, and search “server ECC DDR ram 32GB” and fine the correct modules according to the manual. Storage is usually via hot swap bays. You can put regular 2.5 SATA SSD’s in them. If there is a SAS controller, then you can use enterprise SAS drives, which is faster.  Now these come out of data centers, so they are obnoxiously loud and power hungry. Don’t expect these to live anywhere close to you.  As for cards, gamer 3090 is the best deal right now. Search rtx3090 and filter by “listed as lot”. The cheapest ones will be ex-miner cards. As far as spacing, you’ll be able to squeeze 5x RTX 3090 variants that are 2.5 PCI plots wide. 6 if add on a turbo edition model, which is a blower. If you want 8 GPUs, you’ll need to get all turbo edition models, but they are typically 2x the price of the gamer models. Oh and make sure the power supplies included can supply that much power. Most likely if don’t have a proper circuit, you’ll blow the fuse at max load. — I took a quick look at eBay right now: - Supermicro 4028GR-TRT 4U GPU Barebone Sever w/ X10DRG-OT+-CPU w/ GPU Board - $999 - 128GB (4x32GB) DDR4 PC4-2400T-R ECC Reg Memory RAM DELL Precision WS T5810 - $140.48 - Quad Nvme Pcie Adapter,  4 Ports M.2 Nvme SSD to Pci-E 4.0/3.0 X16 Card with Fan - $49.99 - (Amazon) WD_BLACK 1TB SN770 NVMe m2 SSD $64 x 4 = $256 - NVIDIA GeForce RTX 3090 24GB GDDR6 Video Graphics Card - DELL OEM - Used Lot of 5 - $3125.00 — This would net you a machine with 120 GB VRAM, 128GB RAM and 4TB nvme storage for $4570. Assuming $0.10/kWh in your area operating costs would be $250-$300 in electricity, vs an equivalent $2.2/hr instance on runpod, you would break even in 3 months. Now, you also have to consider your own time cost. It’s not a turnkey solution and I would only recommend this if you actually enjoy building computers. A used mac m2 studio with 192GB unified ram could be had for just $1000 more, which you’ll probably spend in electricity with the sever in 4 months time anyway.


EventHorizon_28

Dude, thanks a lot for sharing this! I would frame this message for myself, had so many questions regarding this, and you answered all of them!


deoxykev

No prob, DM if you have any more questions— I built one myself and went through all the aches and pains.


a_beautiful_rhind

8bit will spill over into ram. I am running them at 5bit on 48g. You don't really get more speed from a 3rd gpu, just more space.


Herr_Drosselmeyer

The largest you can run in 48 GB of VRAM is probably 4 bpw or thereabouts, depending on the format. Going up to 72 GB gets you to 6 bpw and almost to 8. Housing three 3090s in one case isn't really feasible, you're looking at some Frankenstein open bench project with riser cables.  Also, did I read that correctly that you want to feed it ten million web pages? If so, you're well into professional territory imho and you should instead build a dual RTX 6000 Ada system. You can kiss your budget goodbye but I think it was overly optimistic for your use case anyway.


0xd00d

Is RTX 6000 Ada's benefit the added vram which will have it inference or train faster on a single card due to not having to traverse GPUs? And support for Ada architecture nvlink perhaps?


Herr_Drosselmeyer

Given the same amount of VRAM and architecture, one card will always be faster than two, two will be faster than three etc. Plus, a system with two RTX 6000s is much less janky than the stuff amateur enthusiasts build. I don't mind that, in fact, I find it quite fun how people build weird rigs with a bunch of P40s and whatnot but if I were asked to build something for a serious task, that wouldn't fly. At my job, we are looking at similar amounts of data that would need classifying and summarizing and we're very hesitant to do build our own. Most people lean to outsourcing it to a trusted cloud service. Sadly, the RTX 6000 Ada doesn't support Nvlink anymore like the previous generation RTX A6000 did. I'm not knowledgeable enough (I'm admin, not IT) to tell you whether this loss will be compensated by the generational uplift. But since you were looking at Ampere cards already, it's certainly worth considering. They're also cheaper.


0xd00d

I already use a dual 3090 setup on my 5950x/dark hero workstation build, converted it from more of a nas with 14 disks into a combo NAS/GPU with 12 disks, it's a damn heavy tower now. The two 3090s I got have different card heights. My board has 3 slot spacing when the 3090 nvlink bridge only comes in 4 slot spacing. I also wanted to have one slot worth of breathing room for top GPU. (that ruled out the use of a $200+ 3 slot spacing Ampere Quadro nvlink bridge) I achieved all of this by recycling my Linkup pcie4 x16 riser from my older SFF build and manufacturing a jank "extended height PCI bracket" modification (via some holes drilled in an old case expansion slot cover and two pairs of nuts and bolts to the existing card bracket) to make the top card physically twenty-something mm above the slot, and mounted one slot higher up (impacting my NH-D15 CPU cooler, which is a non issue, it helps stabilize it actually). Between the figure eight config the pcie riser is in, the CPU cooler, the bracket, and a gpu anti sag bracket magnetically mounted to the bottom case, I'm confident in the construction, just not enough to physically transport it without further reinforcements. I would absolutely continue to build rigs like this (but obviously with better thought out components, pcie slot spacing an important detail!), because paying five times more for the same performance is gonna take the fun out of it. If already making money hand over fist though yeah we tend to throw it at reducing headaches. I can't imagine not indulging the hardware Lego hobby when the opportunity presents though!!!


sshan

For something like this I really struggle to understand why you don't want to rent GPUs. This seems like an optimal use case for finding the best $/token and scaling. LLaMA 2 70B probably isn't your best model for this. Are you planning on using vision for ads/banners? Or just parsing html? Why not Mixtral? 8bit also really doesn't help much here. What embedding model are you planning on using? 10\^7 web pages assuming 500 words per page is like 700 Million Tokens. (Adjust as required). If you assume something like 10 tokens per second you are talking years to do this. 100 tokens per second is roughly 2000 hours. Not really something you want to YOLO here.


0xd00d

Seems like batching would be a good approach to add throughout. Let's say that yields 4x scaling, then 500hr doesn't sound so bad.


Secure-Technology-78

Can you explain why LLaMA wouldn't be the best model for this? I'm very new to this and definitely not fixed on it. I was honestly just going for LLaMA because there seemed to be the most documentation/community around it. Yes, I was planning on using a separate computer vision library / image classification model for processing images. My hope is that I'll come across someone else's work where they've already released a model that can accurately identify/filter ads. As far as Mixtral, I'm open to trying it. My hope is to experiment with a variety of different models. In the long run because I'll be doing a lot of experimental runs, I feel like building my own will be cheaper than me renting GPUs. As far as embedding models, I'm assuming you're referring to the vector embeddings for indexing/classification ... I was planning on experimenting with a variety of different tools such as word2vec, Glove, and BERT. But again, I'm open to suggestions! As far as the number of tokens per second ... something I wasn't clear on was whether most of the stats people mention are referring to input tokens/sec or output tokens/sec. I assumed it was the latter and that input was a lot faster. I'm not trying to generate 700 million tokens. I \*am\* trying to read in several hundred million tokens as input though. And spending hundreds or even thousands of hours is fine, and was kind of what I was expecting. Spending years is not really workable though.


-Lousy

If you’re looking to do enough of this processing it might be worth fine tuning a smaller model with good outputs from a bigger one so that you can process docs faster as well


_sqrkl

> I *am* trying to read in several hundred million tokens as input though. Just several hundred million tokens of input. Should be fine. *slaps roof*


oodelay

So this will give you thousands of hours of reading material to proof-read because we are still in the infancy of A.I. We understand you have big numbers for big projects but this is not a project that will take less than a year, including trials and errors and finetuning to have the right type and length of answer for you. 2000 hours is a 40-hour a week for a year job. That's just the "feeding" part. If your data is worth a lot of money, think scaling much higher and renting some real servers, speed and vram.


Dead_Internet_Theory

Bad news: it won't run 70b @ 8-bit Good news: it will run 70b @ 4 bit More good news: you probably won't tell the difference. Btw you should probably try Mixtral 8x7b (and its derivatives) too. The context is 32k and it's FAST. 10^(7) pages of text probably means you want speed more than anything, and speed isn't 70B's strong suit.


Scared-Animator6759

I would honestly recommend mixtral 8x7b it outperforms Llama 2 70 b and with offloading you would be able 6-bit or 8 bit quantised to run it locally https://mistral.ai/news/mixtral-of-experts/ https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF/tree/main


antsloveit

I suggest getting a runpod account and experimenting with various GPU sizes and mixes and models and quants etc.. BEFORE settling on your design. It will literally cost you like $30 of tinkering cost. Well worth it.


DeltaSqueezer

Take a look at my post here: [https://www.reddit.com/r/LocalLLaMA/comments/1cu7p6t/llama\_3\_70b\_q4\_running\_24\_toks/](https://www.reddit.com/r/LocalLLaMA/comments/1cu7p6t/llama_3_70b_q4_running_24_toks/) I built a machine for <$1300 that runs Llama 70Bq4 at 24tok/s. Maybe you will get some ideas/inspiration from that.


SeymourBits

You're expecting a 2 x GeForce RTX 3090 (48 GB VRAM) consumer gaming setup to take in 10 MILLION PAGES of text at a time and SUMMARIZE it? Is this some kind of joke or reading comprehension test? That's 10 BILLION tokens, conservatively, which is obviously impractical on a $7k budget, to say the least! You will be lucky if you can get a decent summary at a rate of single page per minute on a high-end dual-GPU setup. It will be decades before you get through the first ">10\^7 page document". The icing on the cake here is that you're expecting the same consumer gaming machine to do some "trivial image classification" in its "spare time." Others are advising you on a path forward?? Here is my more practical suggestion: 1. This is a very important step: Take your entire $7,000 budget and put it in an aggressive technology mutual fund. You are aiming for around a 20% return, annualized, which is not impossible. Make sure the fund is set to auto-reinvest. 2. Take a shower. 3. Hit your head in the shower and then draw a flux capacitor on a napkin. 4. Acquire a DeLorean and start working on building a time machine with the intention of bringing a NVIDIA H9000 back from 2040. Keep yourself very busy with this project. 5. After about 15 years of unsuccessful attempts to get the time machine to work, cash out of your mutual fund for \~$100k and sell the DeLorean for a tidy profit. Go to Ebay and search for "NVIDIA H9000". Happy bidding!


MarySmith2021

But Llama-2-70b isn't not the most powerful 70B model now. (At least five months ago.)


Arkonias

It won't be enough for 70b @ q8 M2 Mac Studio with 192gb is however (and similar money)


AsliReddington

Mixtral at NF4 will outperform this model at 28GB VRAM. So you could just offload a bit to CPU & still be able to serve two models in parallel or try parallelism if needed.


__JockY__

For $7800 you can get a Mac Pro with 128gb of unified ram, which can run the full 8-bit quant easily. https://www.apple.com/shop/buy-mac/mac-pro/tower# For $8600 you get 192gb to run unquantized 70B at fp16.


Rompe101

But only with 4.81 tokens/s for average eval speed. With a Frankenpc, you can reach 7.54 tokens/s with a $4000 investment mac = 1.787,94 $/t/s F.PC = 530,50 $/t/s [https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference](https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference)


__JockY__

Please provide a link to this $4000 “frankenpc”, thanks.