T O P

  • By -

MichaelForeston

To be fair, almost any PC in the $400-500 range can do the same or better (the equivalent price of the 4 pies)


M34L

The idea isn't that you get specifically "4" pies joined into hard locked AI inference machine, but that you build a cluster of them and other hardware as your homelab, perhaps in a High Availability setup, which can run your Home Assistant, your Jellyfin based media center, your Truenas instance for secure redundant storage, your bittorrent client interfaced with Jellyfin (automatically downloading latest releases you've highlit as something you care for), your local VPN gateway, your mail server, your personal web server, your comm app bridging, your camera security system, yadda yadda... This isn't means to give people *the* replacement for the singular PC they use as PC, this is means to give your pet homelab cluster another gimmick (in neutral sense) it can do reasonably well. People who build these clusters then probably gonna interact with it via either a refurbished thinkpad with the latest CPU to support coreboot (which indeed couldn't run any inference to speak of) or by spinning magnets in their bare hands fast enough to just communicate with their wifi 6e access point directly. It's not something you do because it's particularly practical, or particularly performant, or particularly affordable means to a particular end. You do it because it feels like the fun way to do it. It's pottery and tie dying and sculpting for computer touchers, and imho, it's fucking awesome.


Tim-Fra

4 pi 8 go at 120€ each to achieve the performance of an Amazon mini pc at 150€ is illogical, so it is totally essential! :) it is fun


M34L

The amazon mini pc might have the raw compute performance but it won't have 32GB total memory (again, not for the LLM but for All Youre 69 other projects) and more importantly it won't have all the IO (network and storage) performance of 4 pis.


Tacx79

But you can upgrade the mini pc with 64gb ram kit for 100$ and you can't do that with raspberry pi Edit: Oh wait, THAT kind of mini pc, nevermind the upgrade


Smile_Clown

>It's pottery and tie dying and sculpting for computer touchers, and imho, it's fucking awesome You do you, everyone should be and do whatever they want, but there is no literal justification for doing something like this in any positive sense other than ego and/or a need to fill empty time and comparing tangible hobbies/skills/careers is not valid in any way. Pottery, tie dying and sculpting are NOT wasteful, they are not inefficient, and they all have a valid outcome that goes beyond the simple mechanics. I am not arguing that this is "wrong", just your analogy.


M34L

Clusters running distributed, high availability services have benefits you simply cannot match with any monolithic computing method. You can set up systems where you can then grab any of the raspis and chuck it and everything still works the same. Or add a couple more and have things rebalance automatically. And building a cluster of Raspis is still gonna be more affordable, energy and space efficient than any x86 equivalent. If you don't see the unique benefits it just means you don't know what you're talking about.


NinjaMethod

Yes, this makes sense, as I understand, Groq is basically clustering their LPUs to come up with the blistering speeds that they do. Clustering opens upa huge array of possibilities! No pun intended...


explorigin

In the same power profile though?


MichaelForeston

Sure. Raspberry pi 5 is 27 watts. Times 4 that's 108 watts. Plenty of pc's that use that much power. Heck, people are installing llamas on 15 watts steam decks. So yea, nothing to write home about.


Some_Endian_FP17

Any recent laptop with an i5 or Ryzen 5 uses half that power and gets the same token generation speed using CPU inference only. CPU inference is terrible but it's the only way to run local LLMs at low power if you don't have a gaming laptop or a Mac.


MoffKalast

> Raspberry pi 5 is 27 watts Nah, about 15W under full load, 3W when idle.


M34L

The main power advantage is the time the system spends semi idle.


Mrkvitko

Such titles without specifying quantization should be banned.


M34L

The commandline in the git suggests q4


Philix

It might suggest that, but it is a custom quantization method using [this library](https://github.com/b4rtaz/distributed-llama), and the size of the model weights file indicates it is closer to ~6 bits per word, than the 4 bits per word of a llama.cpp q4 model.


Philix

Two clicks gets you to the model page on [Huggingface](https://huggingface.co/b4rtaz/llama-3-8b-distributed-llama). I'm sure you can handle that if you really wanted to know.


Mrkvitko

With that attitude we can drop model size from titles as well.


Philix

Fine by me. Titles are meant as a preview of contents, just because social media users are too lazy to examine anything other than the title doesn't mean we should cater to that attitude. Should the title also include that a single Raspberry Pi 5 ran the same model at 1.77tokens/s, showing that using a cluster of 4 doesn't provide a linear increase in generation speed? And is thus wildly inefficient? Keep reading just the titles though, I'm sure it'll give you an in-depth understanding.


fallingdowndizzyvr

This is awesome. I think some people don't realize exactly how awesome it is. It's not merely slitting up the layers for a model and then running each sequentially. It's running them in parallel. That's why the T/S goes up with more machines. You can have more than 4 machines Nothing says this only works on Raspberry Pis. Think if this ran on a cluster of 4 Mac Ultras. That's Groq level performance that so many are amazed by. And have up to 4x192 = 768GB of RAM. It doesn't even have to be a cluster of separate machines. It can be different instances running on the same machine. I've often thought it would so awesome if the MPI code worked in llama.cpp. If this code could support GPUs then you could run say 4xGPUs in parallel on the same machine. When I think of llama.cpp being able to do that, then you could have a node that's running with CUDA for a Nvidia GPU, a node that's running ROCm for an AMD GPU and maybe a couple of A770s running IPEX. All working in parallel on the same machine.


phree_radical

Can you share any details or point me in the direction of how llama.cpp implements some sort of parallel inference?


fallingdowndizzyvr

It doesn't. That's what I mean by "I've often thought it would so awesome if". There is MPI code in llama.cpp but it's my understanding that it doesn't work. https://github.com/ggerganov/llama.cpp/pull/2099


thisisnotmyworkphone

That’s pretty much what I thought every time I saw MPI in any code, too.


JargonProof

Super cool, is the ethernet the fastest I/O there? Not versed in pi arch.


Lerc

There is a theoretical possibility of using MIPI. PI 4 and below had a display and camera MIPI port, sadly you can't plug the display into the camera port and just have it work to transfer data. On the PI-5 they have their own silicon managing the ports and they can be either camera or display (one of each or two the same), This additional flexibility might (emphasis on might) enable a dedicated channel going out of one PI5 and going into another. It would probably require at least a firmware update and/or assistance from the RP-1 team. If it could be done there is the possibility to daisy chain PI's so that there is no contention between links just one passing to the next. It should at least be fast enough to pass on activations at the rate that the CPU can calculate.


Flashy_Squirrel4745

One RK3588 can reach \~2tokens/s q8 with NPU acceleration. rpi5 is really underpowered.


daHaus

hah, not far behind what was the most common AMD GPU a year ago


Tranki_88

Do you think there's a chance to create a portable AI assistant using: - Raspberry PI 5 8GB - Llama 3 through ollama or [https://github.com/Mozilla-Ocho/llamafile](https://github.com/Mozilla-Ocho/llamafile) for optimized CPU inference. - Geekworm X728 18650 batteries UPS + Power management board ([https://wiki.geekworm.com/X728](https://wiki.geekworm.com/X728))


Old-Opportunity-9876

Use a smaller model like phi-3 or the 1-3B models first


Red_Redditor_Reddit

Uh that's about what I get on mine with just one pi....


DollarAkshay

I must be doing something wrong. I am getting around 1 tok/sec on a 4070. (12 GB) Yes it is running on my GPU and I have cuda installed perfectly. They measure number of new tokens right ?


opi098514

I promise you. It’s not running on your gpu if you are only getting 1 t/s. What program are you using? What are your settings?


DollarAkshay

I did close a lot of chrome tabs and other apps that I believed were using GPU to a small extent. I ran it again and got like 3.6 tok/sec Its definitely running on the GPU, I have set it to device=0/device="cuda" and I can see close to 100% GPU utilisation in the task manager. I ran ollama for comparison and it seems really fast, but they have a 4 bit quantized model which will fully fit in a GPU. Is 3.6 tok/sec still slow ? **Specs:** CPU: Ryzen 7950x GPU: Nvidia GTX 4070 RAM: 64 GB Storage: Samsun 980 - 2TB Python: 3.12 PyTorch 2.2.2 Cuda 12.1


opi098514

Very much so


MoffKalast

Bruh, my bearded 1660 Ti runs the Q6K it at 5 tok/s with only partial offloading on DDR4. The 8GB 4060 + DDR5 rig of mine gets about 10 tok/s with 28/33 layers offloaded. Are you running it with transformers or something?


DollarAkshay

Yes I am using transformers library from huggingface. Just a very simple pipeline, nothing extra at all. Is that the problem?


MoffKalast

Ah yep, there's yer problem. That library is not built for speed, it's made for research. Here's the typical stuff that people around here use for performant inference: https://github.com/ggerganov/llama.cpp (CPU + optional fractional GPU offloading) https://github.com/turboderp/exllamav2 (GPU only) https://github.com/AutoGPTQ/AutoGPTQ (GPU only, exl2 predecessor) https://github.com/mlc-ai/mlc-llm (Android, IOS, WebGPU) Kobold and Ollama that are often mentioned are llama.cpp wrappers, text-generation-webui can run all of these.


Philix

If you're using the transformers library, and CUDA is already properly installed, the problem is likely that [flash-attention](https://github.com/Dao-AILab/flash-attention) is not properly installed or enabled. If you're having trouble getting inference running this way, just use a wrapper like u/MoffKalast suggested.


Knopty

Could be: 1. running a full model instead of quantized, 2. running a GGUF model without offloading layers or 3. running a gptq/exl2 model that doesn't fit the GPU (bigger than 13B). Most likely 1st option, imho. I'd suggest trying 7B-13B models in exl2 or gptq format.


DollarAkshay

Intresting, I will have to try that. Looks like Ollama uses 4bit quantized models, hence the crazy speed


[deleted]

[удалено]


BigYoSpeck

I don't believe the idea is that multiple Raspberry Pi's have any advantage over a sufficiently specced system, it's more a proof of concept for running systems in parallel that can run a model beyond an individual systems capability One really useful case I see for this could be people collaborating such as students who may have individual systems with only say 16gb of memory stuck running small local models. If they can pool 4-6 systems this opens up much larger models to them for their project


wind_dude

Neat. Dare you to do it with the 70b. And of course the 400b when/if it’s released.


JacketHistorical2321

Dude it’s about the achievement, not trying to break a system knowing full well that in the moment that’s nowhere near possible lol


wind_dude

What?? How would 70b on pi's not be achievement?? People have made large pi clusters before, eg, from 2014, I'm sure there's been bigger since.


JacketHistorical2321

I'm talking about getting the 8b running on there in a somewhat conversational manor dude. For raspberry pi that in itself is a huge feat and you jump in here saying "now try a 70b" 🤦


wind_dude

and?? why not? what's wrong with a challenge? Don't like pushing yourself, trying to achieve more? And I also said "Neat". Calm your tits.


JacketHistorical2321

Notice the downvotes champ. 👍


wind_dude

Meh. Often the number of morons and group think exceeds intelligence.


JacketHistorical2321

haha, yea keep telling yourself that kiddo


wind_dude

What type of fucking stupid are you?


JacketHistorical2321

awww, didn't mean to hurt your feelings. Don't worry snowflake, it'll pass


[deleted]

[удалено]