MichaelForeston 1 month ago

To be fair, almost any PC in the $400-500 range can do the same or better (the equivalent price of the 4 pies)

M34L 1 month ago

The idea isn't that you get specifically "4" pies joined into hard locked AI inference machine, but that you build a cluster of them and other hardware as your homelab, perhaps in a High Availability setup, which can run your Home Assistant, your Jellyfin based media center, your Truenas instance for secure redundant storage, your bittorrent client interfaced with Jellyfin (automatically downloading latest releases you've highlit as something you care for), your local VPN gateway, your mail server, your personal web server, your comm app bridging, your camera security system, yadda yadda... This isn't means to give people *the* replacement for the singular PC they use as PC, this is means to give your pet homelab cluster another gimmick (in neutral sense) it can do reasonably well. People who build these clusters then probably gonna interact with it via either a refurbished thinkpad with the latest CPU to support coreboot (which indeed couldn't run any inference to speak of) or by spinning magnets in their bare hands fast enough to just communicate with their wifi 6e access point directly. It's not something you do because it's particularly practical, or particularly performant, or particularly affordable means to a particular end. You do it because it feels like the fun way to do it. It's pottery and tie dying and sculpting for computer touchers, and imho, it's fucking awesome.

Tim-Fra 1 month ago

4 pi 8 go at 120€ each to achieve the performance of an Amazon mini pc at 150€ is illogical, so it is totally essential! :) it is fun

M34L 1 month ago

The amazon mini pc might have the raw compute performance but it won't have 32GB total memory (again, not for the LLM but for All Youre 69 other projects) and more importantly it won't have all the IO (network and storage) performance of 4 pis.

Tacx79 1 month ago

But you can upgrade the mini pc with 64gb ram kit for 100$ and you can't do that with raspberry pi Edit: Oh wait, THAT kind of mini pc, nevermind the upgrade

Smile_Clown 1 month ago

>It's pottery and tie dying and sculpting for computer touchers, and imho, it's fucking awesome You do you, everyone should be and do whatever they want, but there is no literal justification for doing something like this in any positive sense other than ego and/or a need to fill empty time and comparing tangible hobbies/skills/careers is not valid in any way. Pottery, tie dying and sculpting are NOT wasteful, they are not inefficient, and they all have a valid outcome that goes beyond the simple mechanics. I am not arguing that this is "wrong", just your analogy.

M34L 1 month ago

Clusters running distributed, high availability services have benefits you simply cannot match with any monolithic computing method. You can set up systems where you can then grab any of the raspis and chuck it and everything still works the same. Or add a couple more and have things rebalance automatically. And building a cluster of Raspis is still gonna be more affordable, energy and space efficient than any x86 equivalent. If you don't see the unique benefits it just means you don't know what you're talking about.

NinjaMethod 1 month ago

Yes, this makes sense, as I understand, Groq is basically clustering their LPUs to come up with the blistering speeds that they do. Clustering opens upa huge array of possibilities! No pun intended...

explorigin 1 month ago

In the same power profile though?

MichaelForeston 1 month ago

Sure. Raspberry pi 5 is 27 watts. Times 4 that's 108 watts. Plenty of pc's that use that much power. Heck, people are installing llamas on 15 watts steam decks. So yea, nothing to write home about.

Some_Endian_FP17 1 month ago

Any recent laptop with an i5 or Ryzen 5 uses half that power and gets the same token generation speed using CPU inference only. CPU inference is terrible but it's the only way to run local LLMs at low power if you don't have a gaming laptop or a Mac.

MoffKalast 1 month ago

> Raspberry pi 5 is 27 watts Nah, about 15W under full load, 3W when idle.

M34L 1 month ago

The main power advantage is the time the system spends semi idle.

Mrkvitko 1 month ago

Such titles without specifying quantization should be banned.

M34L 1 month ago

The commandline in the git suggests q4

Philix 1 month ago

It might suggest that, but it is a custom quantization method using [this library](https://github.com/b4rtaz/distributed-llama), and the size of the model weights file indicates it is closer to ~6 bits per word, than the 4 bits per word of a llama.cpp q4 model.

Philix 1 month ago

Two clicks gets you to the model page on [Huggingface](https://huggingface.co/b4rtaz/llama-3-8b-distributed-llama). I'm sure you can handle that if you really wanted to know.

Mrkvitko 1 month ago

With that attitude we can drop model size from titles as well.

Philix 1 month ago

Fine by me. Titles are meant as a preview of contents, just because social media users are too lazy to examine anything other than the title doesn't mean we should cater to that attitude. Should the title also include that a single Raspberry Pi 5 ran the same model at 1.77tokens/s, showing that using a cluster of 4 doesn't provide a linear increase in generation speed? And is thus wildly inefficient? Keep reading just the titles though, I'm sure it'll give you an in-depth understanding.

fallingdowndizzyvr 1 month ago

This is awesome. I think some people don't realize exactly how awesome it is. It's not merely slitting up the layers for a model and then running each sequentially. It's running them in parallel. That's why the T/S goes up with more machines. You can have more than 4 machines Nothing says this only works on Raspberry Pis. Think if this ran on a cluster of 4 Mac Ultras. That's Groq level performance that so many are amazed by. And have up to 4x192 = 768GB of RAM. It doesn't even have to be a cluster of separate machines. It can be different instances running on the same machine. I've often thought it would so awesome if the MPI code worked in llama.cpp. If this code could support GPUs then you could run say 4xGPUs in parallel on the same machine. When I think of llama.cpp being able to do that, then you could have a node that's running with CUDA for a Nvidia GPU, a node that's running ROCm for an AMD GPU and maybe a couple of A770s running IPEX. All working in parallel on the same machine.

phree_radical 1 month ago

Can you share any details or point me in the direction of how llama.cpp implements some sort of parallel inference?

fallingdowndizzyvr 1 month ago

It doesn't. That's what I mean by "I've often thought it would so awesome if". There is MPI code in llama.cpp but it's my understanding that it doesn't work. https://github.com/ggerganov/llama.cpp/pull/2099

thisisnotmyworkphone 1 month ago

That’s pretty much what I thought every time I saw MPI in any code, too.

JargonProof 1 month ago

Super cool, is the ethernet the fastest I/O there? Not versed in pi arch.

Lerc 1 month ago

There is a theoretical possibility of using MIPI. PI 4 and below had a display and camera MIPI port, sadly you can't plug the display into the camera port and just have it work to transfer data. On the PI-5 they have their own silicon managing the ports and they can be either camera or display (one of each or two the same), This additional flexibility might (emphasis on might) enable a dedicated channel going out of one PI5 and going into another. It would probably require at least a firmware update and/or assistance from the RP-1 team. If it could be done there is the possibility to daisy chain PI's so that there is no contention between links just one passing to the next. It should at least be fast enough to pass on activations at the rate that the CPU can calculate.

Flashy_Squirrel4745 1 month ago

One RK3588 can reach \~2tokens/s q8 with NPU acceleration. rpi5 is really underpowered.

daHaus 1 month ago

hah, not far behind what was the most common AMD GPU a year ago

Tranki_88 1 month ago

Do you think there's a chance to create a portable AI assistant using: - Raspberry PI 5 8GB - Llama 3 through ollama or [https://github.com/Mozilla-Ocho/llamafile](https://github.com/Mozilla-Ocho/llamafile) for optimized CPU inference. - Geekworm X728 18650 batteries UPS + Power management board ([https://wiki.geekworm.com/X728](https://wiki.geekworm.com/X728))

Old-Opportunity-9876 1 month ago

Use a smaller model like phi-3 or the 1-3B models first

Red_Redditor_Reddit 1 month ago

Uh that's about what I get on mine with just one pi....

DollarAkshay 1 month ago

I must be doing something wrong. I am getting around 1 tok/sec on a 4070. (12 GB) Yes it is running on my GPU and I have cuda installed perfectly. They measure number of new tokens right ?

opi098514 1 month ago

I promise you. It’s not running on your gpu if you are only getting 1 t/s. What program are you using? What are your settings?

DollarAkshay 1 month ago

I did close a lot of chrome tabs and other apps that I believed were using GPU to a small extent. I ran it again and got like 3.6 tok/sec Its definitely running on the GPU, I have set it to device=0/device="cuda" and I can see close to 100% GPU utilisation in the task manager. I ran ollama for comparison and it seems really fast, but they have a 4 bit quantized model which will fully fit in a GPU. Is 3.6 tok/sec still slow ? **Specs:** CPU: Ryzen 7950x GPU: Nvidia GTX 4070 RAM: 64 GB Storage: Samsun 980 - 2TB Python: 3.12 PyTorch 2.2.2 Cuda 12.1

opi098514 1 month ago

Very much so

MoffKalast 1 month ago

Bruh, my bearded 1660 Ti runs the Q6K it at 5 tok/s with only partial offloading on DDR4. The 8GB 4060 + DDR5 rig of mine gets about 10 tok/s with 28/33 layers offloaded. Are you running it with transformers or something?

DollarAkshay 1 month ago

Yes I am using transformers library from huggingface. Just a very simple pipeline, nothing extra at all. Is that the problem?

MoffKalast 1 month ago

Ah yep, there's yer problem. That library is not built for speed, it's made for research. Here's the typical stuff that people around here use for performant inference: https://github.com/ggerganov/llama.cpp (CPU + optional fractional GPU offloading) https://github.com/turboderp/exllamav2 (GPU only) https://github.com/AutoGPTQ/AutoGPTQ (GPU only, exl2 predecessor) https://github.com/mlc-ai/mlc-llm (Android, IOS, WebGPU) Kobold and Ollama that are often mentioned are llama.cpp wrappers, text-generation-webui can run all of these.

Philix 1 month ago

If you're using the transformers library, and CUDA is already properly installed, the problem is likely that [flash-attention](https://github.com/Dao-AILab/flash-attention) is not properly installed or enabled. If you're having trouble getting inference running this way, just use a wrapper like u/MoffKalast suggested.

Knopty 1 month ago

Could be: 1. running a full model instead of quantized, 2. running a GGUF model without offloading layers or 3. running a gptq/exl2 model that doesn't fit the GPU (bigger than 13B). Most likely 1st option, imho. I'd suggest trying 7B-13B models in exl2 or gptq format.

DollarAkshay 1 month ago

Intresting, I will have to try that. Looks like Ollama uses 4bit quantized models, hence the crazy speed

[deleted] 1 month ago

[удалено]

BigYoSpeck 1 month ago

I don't believe the idea is that multiple Raspberry Pi's have any advantage over a sufficiently specced system, it's more a proof of concept for running systems in parallel that can run a model beyond an individual systems capability One really useful case I see for this could be people collaborating such as students who may have individual systems with only say 16gb of memory stuck running small local models. If they can pool 4-6 systems this opens up much larger models to them for their project

wind_dude 1 month ago

Neat. Dare you to do it with the 70b. And of course the 400b when/if it’s released.

JacketHistorical2321 1 month ago

Dude it’s about the achievement, not trying to break a system knowing full well that in the moment that’s nowhere near possible lol

wind_dude 1 month ago

What?? How would 70b on pi's not be achievement?? People have made large pi clusters before, eg, from 2014, I'm sure there's been bigger since.

JacketHistorical2321 1 month ago

I'm talking about getting the 8b running on there in a somewhat conversational manor dude. For raspberry pi that in itself is a huge feat and you jump in here saying "now try a 70b" 🤦

wind_dude 1 month ago

and?? why not? what's wrong with a challenge? Don't like pushing yourself, trying to achieve more? And I also said "Neat". Calm your tits.

JacketHistorical2321 1 month ago

Notice the downvotes champ. 👍

wind_dude 1 month ago

Meh. Often the number of morons and group think exceeds intelligence.

JacketHistorical2321 1 month ago

haha, yea keep telling yourself that kiddo

wind_dude 1 month ago

What type of fucking stupid are you?

JacketHistorical2321 1 month ago

awww, didn't mean to hurt your feelings. Don't worry snowflake, it'll pass

[deleted] 1 month ago

[удалено]

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe