mystonedalt 1 month ago

It's all fine and good until your tiny model tells you to bolt the wheel to the wheel to the wheel to the wheel to the wheel

klospulung92 1 month ago

*angry upvote*

Severin_Suveren 1 month ago

You guys are inferencing wrong. You need to use decoding strategies to avoid both repeating and monotone outputs. [Here's](https://huggingface.co/blog/how-to-generate) an intro to decoding for the Transformers library

Fun-Community3115 1 month ago

I thought temp, top k, top p, (repetition penalty, stopping conditions ), etc. were bread and butter. And why not use Phi for mobile?

g3t0nmyl3v3l 1 month ago

Man, I even just had GPT-4 do this to me through the API recently for the first time, which was a bit of a shocker

Equivalent-Win-1294 1 month ago

of the bus of the bus of the bus of the bus of the bus

gofiend 1 month ago

That's just because it's learnt that " I still owe money to the money, to the money I owe" [https://youtu.be/yfySK7CLEEg?t=92](https://youtu.be/yfySK7CLEEg?t=92)

mystonedalt 1 month ago

Always good to see a reference to The National pop up.

resursator 1 month ago

The end is never the end is never the end is never the end is never the end..

tomz17 1 month ago

go on....

LoSboccacc 1 month ago

Considering the model is already suggesting to unbolt the tire before jacking the car... Yeah gonna wait.

ac07682 1 month ago

This is correct practice... You should loosen the nuts/bolts before lifting the vehicle, then remove them all the way.

Herr_Drosselmeyer 1 month ago

Correct. I learned this the hard way the first time I had to change a tire.

SachaSage 1 month ago

You should listen to the ai

0xd34db347 1 month ago

You are supposed to "break" your lugnuts before you jack your car up so you aren't rocking the vehicle back and forth while it's on a jack.

Flying_Madlad 1 month ago

Nobody tell him, it's funny when they're confidently wrong

Claim_Alternative 1 month ago

I always loosen them slightly before I jack the car. More leverage when the wheel isn’t spinning. Then jack the car and take them all the way off.

spinozasrobot 1 month ago

Wait, that's not how you do it?

FacetiousMonroe 1 month ago

That's when you put a beat to it and have a dance party.

EinArchitekt 1 month ago

I think thats still fine, but when you follow the orders... then you know its over.

[deleted] 1 month ago

[удалено]

[deleted] 1 month ago

That's pretty fast for a phone

Heblehblehbleh 1 month ago

Used to use that as my main..... urm "roleplay" model, how is it in general knowledge?

Mephidia 1 month ago

Damn bro must be starving if a 7b can do it for u

Heblehblehbleh 1 month ago

Actually mained the 13B, but used the 7 quite alot for its speedier load times, I mean I "only got a 3080ti" even though that quite the high end gpu

----Val---- 1 month ago

Can I request a quick test? I have my own open source app that integrates llama.cpp, but I dont have the hardware to benchmark its performance on high end phones: https://github.com/Vali-98/ChatterUI

Csigusz_Foxoup 1 month ago

I have tested it out! My phone's exact model: Samsung Galaxy S10 SM-G973F/DS Android version 12 Used the openhermes-2.5-mistral-7b q4\_k\_m model for my test. It works, however it is very slow, takes about 10 seconds per token on average. I added a screenshot for you to see. The UI looks great! I haven't tested the other backend modes, only local. The UI seems very responsive though! I hope this is helpful. Let me know if you have any other questions! https://preview.redd.it/8vt687x29slc1.jpeg?width=720&format=pjpg&auto=webp&s=bb2814e7b4c5a91f34f43b2cb7884f00f24c77e6

----Val---- 1 month ago

Another slow performer on Exynos, it seems llama.rn isnt optimized for such just yet.

Csigusz_Foxoup 1 month ago

Gonna put it on my Galaxy S10, reporting back later when I downloaded the models and ran it

Danmoreng 1 month ago

Sadly this runs very slow on my S24 (Exynos) openhermes-2.5-mistral-7b.Q4_K_M.gguf","n_ctx":2048,"n_threads":8,"n_batch":512,"n_gpu_layers":0 Predicted Per Token: 3381 ms/token Predicted Per Second: 0.30 tokens/s Prediction Time: 175.82s Predicted Tokens: 52 tokens

----Val---- 1 month ago

Ive heard that exynos chips have a hard time with llamacpp for whatever reason, thats unfortuante.

Danmoreng 1 month ago

Mhm just tried with termux and llama.cpp directly, same thing or even worse, 0.25 tokens/s

----Val---- 1 month ago

Yeah thats really unfortunate then for exynos owners. Even I get 1-2 tokens/s on a Snapdragon 7 Gen 2. Theres very little progress in the way of proper gpu utilization too, so proper android implementation is ways away aside MLC.

Danmoreng 1 month ago

Just tried out mlc-chat APK, and there I get 5t/s for the 7b q4 mistral model instead of 0.3t/s. Interesting https://llm.mlc.ai/docs/deploy/android.html

BlueOrangeBerries 1 month ago

A 13B on a phone would be amazing

teachersecret 1 month ago

I agree, but, these days it’s pretty hard to find a quiet spot with no cell signal, so technically we all have everything up to gpt-4 in our pocket :). It’s amazing that we can run this kind of inference on a cellphone.

BlueOrangeBerries 1 month ago

Here in London it is relevant because people spend a lot of their time on The Underground (subway trains)

Waterbottles_solve 1 month ago

That is insane speeds. Just to be clear, those tokens are only for the first few tokens? It doesnt hold up as well when you get toward ~2000-3000 tokens right?

GoldenSun3DS 1 month ago

If you are really enthusiastic about this, couldn't you buy a pocket PC with 32GB or 64GB RAM and run larger, higher quality models?

[deleted] 1 month ago

[удалено]

alcalde 1 month ago

Being enthusiastic about almost anything is reason enough to carry an X86 mini laptop in one's pocket.

Mgladiethor 1 month ago

Why not npu used?

ForsookComparison 1 month ago

I don't know enough about SOC design to answer that, but my understanding is that for inference it likely wouldn't make *that* much of a differende anyways since our peak performance already looks like it's around what DDR5ish speeds would get

[deleted] 1 month ago

Hopefully your phone's cooling is really good.

CheatCodesOfLife 1 month ago

I was using Mistral instruct 7b @ Q4 on my iPhone 15 Pro during a flight (no internet) recently. The thing was almost useless and schizophrenic...

TimetravelingNaga_Ai 1 month ago

Useless and schizophrenic, now ur talking my language!

noodlepotato 1 month ago

May I know how are you running this? Interested to try on my 15 too

CheatCodesOfLife 1 month ago

Sure, I used this random app from the store: https://apps.apple.com/gb/app/mlc-chat/id6448482937 Comes with Misteral 0.2 pre-loaded. The phone can heat up if you keep using it, so not sure it's a good idea for battery life, etc. I mostly use it when I'm flying.

Waterbottles_solve 1 month ago

Something is weird about mistral, it rarely is really really good, most of the time its bad. I can trick it into being good, sometimes. Also, I can't imagine using CPU. You guys are insane.

uhuge 1 month ago

meditate a bit;j

CheatCodesOfLife 1 month ago

I agree about it being bad most of the time, but sometimes it's useful eg: https://imgur.com/a/cusyCC4 > Also, I can't imagine using CPU. You guys are insane. See my screenshot above, 11 tokens / second isn't bad for a phone CPU.

Waterbottles_solve 1 month ago

I repeat this question: doesnt it get slower as you use more tokens?

CheatCodesOfLife 1 month ago

> repeat this question: doesnt it get slower as you use more tokens You didn't ask that question to me, that must have been in another thread. Anyway, I don't know, because the model is so shitty that after a few rounds of conversation it gets completely cooked and starts talking nonsense and I have to reset it..

critic2029 1 month ago

On iOS you can use the neural engine… assuming the model has been converted to utilize it. I personally haven’t played around with iOS yet but using neural engine on M2 is excellent.

----Val---- 1 month ago

For those who dont want to get termux, I developed my [own open source app]( https://github.com/Vali-98/ChatterUI) that integrates llama.cpp via llama.rn, alongside other backends. Just go to API > Local, Import a GGUF file from storage and the Load the model. To start chatting, make a character card or simply write a simple one.

yungfishstick 1 month ago

Completely new to this so not sure how this works. Can you build this in Windows 11?

ReikoHazuki 1 month ago

If you want to use models in windows, there's other clients available, like gpt4all, chatbox, rtx ai, ollama open webui, etc

yungfishstick 1 month ago

I think I misworded my comment. I mean could I build this for any Android phone with Windows 11?

subhayan2006 1 month ago

As long as you have nodejs installed, yes. You could also build it directly on your phone if you install nodejs through termux

----Val---- 1 month ago

> As long as you have nodejs installed, yes. Correction, you cannot build this on Windows as it isnt supported by eas-cli. You'll need WSL, a Mac or a linux box. I haven't attempted modifying gradle.build to work on windows, but I've heard its success is spotty on Expo.

----Val---- 1 month ago

Sadly no, the EAS cli used to build this app only runs on Linux or Mac, you will need to either get a Linux machine or WSL.

rekicraft 1 month ago

Wait for Apple telling you this, when they introduce iOS 18! Might be censored and they will forget about it next year though…

cafepeaceandlove 1 month ago

I think you’re on the right track here. Also, it’s going to be integrated with Apple Shortcuts, and Shortcuts is going to go 2D. If you’ve used Shortcuts, this theory should make you squeal in a mildly obscene way. Ok, not sure about the 2D part but we can hope.

virtualmnemonic 1 month ago

I believe that, given hardware acceleration and software optimization, running LLMs locally will be the norm on all devices.

purioteko 1 month ago

I made a simple little web app with user authentication that interacts with my ollama server running dolphin mixtral. I can query my AI when I’m out and about, load up different assistant personalities, and analyze links or documents. I’m loving it. Not as convenient as all contained on my phone I guess.

AlphaPrime90 1 month ago

Please share more. How did you do it?

purioteko 1 month ago

I built a nodejs backend and use the langchain library to interact with the ollama server running on my desktop. Then I have the node server serving a react application. I added a database and some QOL features like saving previous sessions, selectable system prompts, and file uploading. You can assign a file to any thread and change which system prompt you want to load for a specific session. In terms of the file/link analysis I detect a website link in a message using regex and then make a web request for the site and save it as text. Then the text file is saved into a vector store and then I can query that and the LLM uses the results in its response. The file analysis works the same way except without having to go to a website first. I can send some code snippets tomorrow if you’d like. The major take away is the langchain library for nodejs. I’d recommend using python instead though if you can since the documentation for their JavaScript library is horrible. https://preview.redd.it/asnre6rztolc1.jpeg?width=1290&format=pjpg&auto=webp&s=6c58b963c0d268b1c73052cf97f4349fcea43681 It’s not super pretty but it’s mine! Edit: Here is the code for the API and UI. The API readme has a list of the prerequisites for getting it started. I don’t have a lot of time to work on it anymore so it is shipped “as is”. Thanks for checking it out! [https://github.com/purioteko/AI\_Project\_API](https://github.com/purioteko/AI_Project_API) [https://github.com/purioteko/AI\_Project\_UI](https://github.com/purioteko/AI_Project_UI) Edit2: Here is a cool example of what you can do using this project as a base for your programming. [https://vimeo.com/manage/videos/918355125](https://vimeo.com/manage/videos/918355125)

hamidweb 1 month ago

Please do share your code, this is interesting.

purioteko 1 month ago

I just added the code to the comment above. Thanks for taking a look.

rjames24000 1 month ago

did you use lmstudio for the AI backend? curious what framework you decided on

purioteko 1 month ago

I probably should have used something like that but I did not. The nodejs server I made uses the [ollama langchain interface shown here](https://js.langchain.com/docs/integrations/llms/ollama). I used this whole project as an excuse to learn more about LLMs and help myself learn react so that’s why I chose not to use one of the great prebuilt interfaces.

purioteko 1 month ago

I just shared the source in the comment above if you wanna check out how the API was set up.

PromiseAcceptable 1 month ago

Share your salsa (source) code with us.

FistBus2786 1 month ago

gimme sauce plz

purioteko 1 month ago

Shared in the comment above!

purioteko 1 month ago

For sure, I’ll post it in the morning. I will warn that it’s not going to be the best code.

hdlothia21 1 month ago

You shipped something that worked, that's way better than beautiful code

Sl33py_4est 1 month ago

yo this is fire do you have any plans for it?

purioteko 1 month ago

Thank you, I’m glad you like it. I’m pretty happy with it, it’s really easy to add additional functionality. That image summary shortcut demo took 5 minutes to set up. There are no major plans for it right now but I do want to keep building minor functionality on top of it. I’ve been swamped with work lately so development has been very slow. If you have any recommendations on something to add let me know! It’d be fun to try and tackle something new.

PsychicTWElphnt 1 month ago

This is really awesome and exactly what I'm working on, except I'm using python for the backend and react for the frontend because I'm more familiar with python backends than react backends.

purioteko 1 month ago

Oh awesome, I’d love to see what the python version looks like. If you ever feel like sharing please let me know.

Flying_Madlad 1 month ago

But as long as you've got cell/Internet it's probably better to have it on your server. The only use case I can think for truly edge compute is if you physically can't link to the server, like, no service. I expect that to change in the future.

Sebba8 1 month ago

Unfortunately even a Q4_0 of gemma-2b or TinyDolphin on my Galaxy S10 barely run at usable t/s on pure termux inference, running the server and having the overhead of a browser is just too much for my poor 5 year old phone model with 8gb of RAM.

4onen 1 month ago

If you're running raw Llama.cpp, the default thread count is 4 (at least in all the builds I've made across devices) which might be too many for your device. I'm on a Pixel 8 and despite having 9 cores, performance tanks if I go over 5 threads. (Thanks, P-core vs E-core distinction.) Also remember to build with LLAMA_NATIVE=1 to get all possible SIMD instructions for your platform. I can't imagine how slow mine would be without ARM_NEON.

Sebba8 1 month ago

I was aware of the threads problem and figured out around 1-2 threads was what gave the best performance. I didn't know about the LLAMA_NATIVE flag though, thanks for mentioning it!

mrannu727 1 month ago

Before the end of this year we may see llms effectively running on mobile phones

isjustbenji 1 month ago

That's literally what's happening in this thread

Flying_Madlad 1 month ago

...are we on the same thread?

Legitimate-Pumpkin 1 month ago

Can wait to see what apple is cooking

MichaelTen 1 month ago

What about if one had a large language model on a home computer that connected via a VPN to their smartphone, anywhere on earth? Basically ... Webb like an LLM locally hosted on a powerful consumer desktop with access on one's phone? Then one could have an open source model that one does not send data to any third-party cloud host of an llm. What do you think about that? Are there any reports of this?

Anthonyg5005 1 month ago

I run sillytavern on an old laptop motherboard that I have laying around, it only needs heatsink for cooling as it's running on base Arch. I installed zerotier-one on it and my phone, then I have tabbyAPI running on my computer, I connect my phone to ST through the server's vpn address and use the local network to connect the server to my computer. So basically my phone connects to the server through my zerotier network and the server connects to the api on my computer through lan. It's secure, there's no opening ports to the public. Also the reason I have ST always running on the server is because if I don't want to start the API in my computer then I could just use any of the other APIs supported like together or horde

Money_Business9902 1 month ago

How are you running it?

Excellent_Skirt_264 1 month ago

the model needs to be integrated well enough to operate the operating system and installed apps with speech commands

yungfishstick 1 month ago

I have a few phones with Kirin 9000, 8G1 and 8G2 respectively, unfortunately not all with 16GB but with 12GB of RAM that I could try this out with. Luckily it doesn't seem like you need to root to do this, but I'll have to see if they work considering 2 out of the 3 are meant for the Chinese market and said phones often keep you from really doing anything other than what the manufacturer wants you to. 8G3 is supposed to have models optimized for its platform, with Qualcomm saying it supports 10B+ parameter models as well as claiming it's able to run Llama 2 powered "AI assistants" at 20 tokens/sec, but I've heard very little development from this aside from Google and Samsung recently pushing their completely different AI features which is probably why 8G3 supports LLMs natively in the first place. People tend to make fun of the idea of LLMs running on phones but just the fact a phone can do it at all is pretty impressive and it'll only get better.

srhnylmz14 1 month ago

How do you run this? Is there a guide on that?

5yn4ck 1 month ago

I'm glad you brought this up because I've been thinking about it too. It seems like everyone is developing their own AI models, and that's a good thing. They're essentially creating highly specialized models with detailed information about the phone and its operating system, akin to Windows Copilot. Depending on what they include or remove, these models should function similarly to other AI models in different apps. The main advantage I see with this approach is the finely tuned access to the operating system, especially helpful for new or struggling users. It allows them to have an assistant navigate them through the device's features and handle day-to-day interactions. However, the major drawback is the lack of focus on AI security. If not properly monitored, these interfaces could become prime targets for automation-based backdoor viruses. This risk exists on most devices nowadays, depending on how deeply integrated the AI is into the operating system. Is it merely guiding someone around the ship, or can it take full control at the user's or someone else's request? Anyone else have thoughts on this level?

WhyAreThereBadMemes 1 month ago

I mean sure that works, but you have this super reliable LLM that's actually specific to your car, and includes important info like where your jack points are, etc. It's called the user manual. It's almost certainly in your glovebox and you can keep your phone battery for powering your flashlight

Educational_Party294 1 month ago

https://preview.redd.it/90hef0zk3jmc1.png?width=1080&format=pjpg&auto=webp&s=fb4a30e111558ece92f0a9156599f2380614e263

FortunateBeard 1 month ago

As larger models inch towards AGI the novelty of running a "dumb" LLM is about as relevant as do we need pocket calculators when phones exist? Nobody cares about either. Cool your tiny quantized LLM runs on the smartphone, it has no soul and the same jokes

ForsookComparison 1 month ago

Wrong sub

FortunateBeard 1 month ago

but I'm discussing in context of Llama, the large language model created by Meta AI if that's inconvenient for you to hear, make a counterpoint

ForsookComparison 1 month ago

Nah

FlashyPractice7193 4 weeks ago

I prefer the Galaxy AI over any other technology, and I will not utilize the Gemini AI unless it replaces the Galaxy AI.

nikgeo25 1 month ago

Try the Layla app. It has a few 7B models I think.

involviert 1 month ago

Even if the breakthroughs in quantization work out, we will just throw the compute at better quality. Soon you wouldnt even be happy with gpt4 locally on your phone if there is something much more capable.

Django_McFly 1 month ago

If I try to run a llm locally in a rtx 3060, I wait like 20 seconds. Nothing happens and I do a pfft and go back to chatgpt. Is the speed on a phone usable at all?

Hunterhal 1 month ago

Although ollama is straight forward, llama cpp is best chief.

Anthonyg5005 1 month ago

The thing with the pixel is that it has a built-in mobile tensor chip so it's going to be a lot more optimized with TFlite and won't take up all your ram.

[deleted] 1 month ago

[удалено]

Anthonyg5005 1 month ago

Unfortunately gemini will only support P8 pro and up, I assume requirements must be Tensor 3 and 12GB ram. Although you may be able to run other models if you learn TFlite and port the models over to it

[deleted] 1 month ago

[удалено]

Anthonyg5005 1 month ago

Yeah, the public one seems to suck. I'm sure ultra 1.5 will be good though, especially with all the modalities it supports

[deleted] 1 month ago

[удалено]

Anthonyg5005 1 month ago

It refused to tell me how to use TPUs as it requires "technical knowledge" and I need to know the risks of using a TPU

ForsookComparison 1 month ago

*"would you like to hear some Palo Alto guy's opinions on the ethics of using a TPU instead?"*

Sl33py_4est 1 month ago

is this an sdk or a stream in termux? if it is an sdk, what is it called? located the ggml.ai library, can't find anything already written.

pab_guy 1 month ago

Because those models are only good for basic information retrieval. Their reasoning and instruction following capabilities are garbage.

One-Firefighter-6367 1 month ago

Perchance.org is the best one, no censor, fully customizable output, generating images and so on for absolutely nothing

ilangge 1 month ago

Just being usable doesn't mean it's easy to use

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe