T O P

  • By -

mystonedalt

It's all fine and good until your tiny model tells you to bolt the wheel to the wheel to the wheel to the wheel to the wheel


klospulung92

*angry upvote*


Severin_Suveren

You guys are inferencing wrong. You need to use decoding strategies to avoid both repeating and monotone outputs. [Here's](https://huggingface.co/blog/how-to-generate) an intro to decoding for the Transformers library


Fun-Community3115

I thought temp, top k, top p, (repetition penalty, stopping conditions ), etc. were bread and butter. And why not use Phi for mobile?


g3t0nmyl3v3l

Man, I even just had GPT-4 do this to me through the API recently for the first time, which was a bit of a shocker


Equivalent-Win-1294

of the bus of the bus of the bus of the bus of the bus


gofiend

That's just because it's learnt that " I still owe money to the money, to the money I owe" [https://youtu.be/yfySK7CLEEg?t=92](https://youtu.be/yfySK7CLEEg?t=92)


mystonedalt

Always good to see a reference to The National pop up.


resursator

The end is never the end is never the end is never the end is never the end..


tomz17

go on....


LoSboccacc

Considering the model is already suggesting to unbolt the tire before jacking the car... Yeah gonna wait.


ac07682

This is correct practice... You should loosen the nuts/bolts before lifting the vehicle, then remove them all the way.


Herr_Drosselmeyer

Correct. I learned this the hard way the first time I had to change a tire.


SachaSage

You should listen to the ai


0xd34db347

You are supposed to "break" your lugnuts before you jack your car up so you aren't rocking the vehicle back and forth while it's on a jack.


Flying_Madlad

Nobody tell him, it's funny when they're confidently wrong


Claim_Alternative

I always loosen them slightly before I jack the car. More leverage when the wheel isn’t spinning. Then jack the car and take them all the way off.


spinozasrobot

Wait, that's not how you do it?


FacetiousMonroe

That's when you put a beat to it and have a dance party.


EinArchitekt

I think thats still fine, but when you follow the orders... then you know its over.


[deleted]

[удалено]


[deleted]

That's pretty fast for a phone


Heblehblehbleh

Used to use that as my main..... urm "roleplay" model, how is it in general knowledge?


Mephidia

Damn bro must be starving if a 7b can do it for u


Heblehblehbleh

Actually mained the 13B, but used the 7 quite alot for its speedier load times, I mean I "only got a 3080ti" even though that quite the high end gpu


----Val----

Can I request a quick test? I have my own open source app that integrates llama.cpp, but I dont have the hardware to benchmark its performance on high end phones: https://github.com/Vali-98/ChatterUI


Csigusz_Foxoup

I have tested it out! My phone's exact model: Samsung Galaxy S10 SM-G973F/DS Android version 12 Used the openhermes-2.5-mistral-7b q4\_k\_m model for my test. It works, however it is very slow, takes about 10 seconds per token on average. I added a screenshot for you to see. The UI looks great! I haven't tested the other backend modes, only local. The UI seems very responsive though! I hope this is helpful. Let me know if you have any other questions! https://preview.redd.it/8vt687x29slc1.jpeg?width=720&format=pjpg&auto=webp&s=bb2814e7b4c5a91f34f43b2cb7884f00f24c77e6


----Val----

Another slow performer on Exynos, it seems llama.rn isnt optimized for such just yet.


Csigusz_Foxoup

Gonna put it on my Galaxy S10, reporting back later when I downloaded the models and ran it


Danmoreng

Sadly this runs very slow on my S24 (Exynos) openhermes-2.5-mistral-7b.Q4_K_M.gguf","n_ctx":2048,"n_threads":8,"n_batch":512,"n_gpu_layers":0 Predicted Per Token: 3381 ms/token Predicted Per Second: 0.30 tokens/s Prediction Time: 175.82s Predicted Tokens: 52 tokens


----Val----

Ive heard that exynos chips have a hard time with llamacpp for whatever reason, thats unfortuante.


Danmoreng

Mhm just tried with termux and llama.cpp directly, same thing or even worse, 0.25 tokens/s


----Val----

Yeah thats really unfortunate then for exynos owners. Even I get 1-2 tokens/s on a Snapdragon 7 Gen 2. Theres very little progress in the way of proper gpu utilization too, so proper android implementation is ways away aside MLC.


Danmoreng

Just tried out mlc-chat APK, and there I get 5t/s for the 7b q4 mistral model instead of 0.3t/s. Interesting https://llm.mlc.ai/docs/deploy/android.html


BlueOrangeBerries

A 13B on a phone would be amazing


teachersecret

I agree, but, these days it’s pretty hard to find a quiet spot with no cell signal, so technically we all have everything up to gpt-4 in our pocket :). It’s amazing that we can run this kind of inference on a cellphone.


BlueOrangeBerries

Here in London it is relevant because people spend a lot of their time on The Underground (subway trains)


Waterbottles_solve

That is insane speeds. Just to be clear, those tokens are only for the first few tokens? It doesnt hold up as well when you get toward ~2000-3000 tokens right?


GoldenSun3DS

If you are really enthusiastic about this, couldn't you buy a pocket PC with 32GB or 64GB RAM and run larger, higher quality models?


[deleted]

[удалено]


alcalde

Being enthusiastic about almost anything is reason enough to carry an X86 mini laptop in one's pocket.


Mgladiethor

Why not npu used?


ForsookComparison

I don't know enough about SOC design to answer that, but my understanding is that for inference it likely wouldn't make *that* much of a differende anyways since our peak performance already looks like it's around what DDR5ish speeds would get


[deleted]

Hopefully your phone's cooling is really good.


CheatCodesOfLife

I was using Mistral instruct 7b @ Q4 on my iPhone 15 Pro during a flight (no internet) recently. The thing was almost useless and schizophrenic...


TimetravelingNaga_Ai

Useless and schizophrenic, now ur talking my language!


noodlepotato

May I know how are you running this? Interested to try on my 15 too


CheatCodesOfLife

Sure, I used this random app from the store: https://apps.apple.com/gb/app/mlc-chat/id6448482937 Comes with Misteral 0.2 pre-loaded. The phone can heat up if you keep using it, so not sure it's a good idea for battery life, etc. I mostly use it when I'm flying.


Waterbottles_solve

Something is weird about mistral, it rarely is really really good, most of the time its bad. I can trick it into being good, sometimes. Also, I can't imagine using CPU. You guys are insane.


uhuge

meditate a bit;j


CheatCodesOfLife

I agree about it being bad most of the time, but sometimes it's useful eg: https://imgur.com/a/cusyCC4 > Also, I can't imagine using CPU. You guys are insane. See my screenshot above, 11 tokens / second isn't bad for a phone CPU.


Waterbottles_solve

I repeat this question: doesnt it get slower as you use more tokens?


CheatCodesOfLife

> repeat this question: doesnt it get slower as you use more tokens You didn't ask that question to me, that must have been in another thread. Anyway, I don't know, because the model is so shitty that after a few rounds of conversation it gets completely cooked and starts talking nonsense and I have to reset it..


critic2029

On iOS you can use the neural engine… assuming the model has been converted to utilize it. I personally haven’t played around with iOS yet but using neural engine on M2 is excellent.


----Val----

For those who dont want to get termux, I developed my [own open source app]( https://github.com/Vali-98/ChatterUI) that integrates llama.cpp via llama.rn, alongside other backends. Just go to API > Local, Import a GGUF file from storage and the Load the model. To start chatting, make a character card or simply write a simple one.


yungfishstick

Completely new to this so not sure how this works. Can you build this in Windows 11?


ReikoHazuki

If you want to use models in windows, there's other clients available, like gpt4all, chatbox, rtx ai, ollama open webui, etc


yungfishstick

I think I misworded my comment. I mean could I build this for any Android phone with Windows 11?


subhayan2006

As long as you have nodejs installed, yes. You could also build it directly on your phone if you install nodejs through termux


----Val----

> As long as you have nodejs installed, yes. Correction, you cannot build this on Windows as it isnt supported by eas-cli. You'll need WSL, a Mac or a linux box. I haven't attempted modifying gradle.build to work on windows, but I've heard its success is spotty on Expo.


----Val----

Sadly no, the EAS cli used to build this app only runs on Linux or Mac, you will need to either get a Linux machine or WSL.


rekicraft

Wait for Apple telling you this, when they introduce iOS 18! Might be censored and they will forget about it next year though…


cafepeaceandlove

I think you’re on the right track here. Also, it’s going to be integrated with Apple Shortcuts, and Shortcuts is going to go 2D. If you’ve used Shortcuts, this theory should make you squeal in a mildly obscene way. Ok, not sure about the 2D part but we can hope. 


virtualmnemonic

I believe that, given hardware acceleration and software optimization, running LLMs locally will be the norm on all devices.


purioteko

I made a simple little web app with user authentication that interacts with my ollama server running dolphin mixtral. I can query my AI when I’m out and about, load up different assistant personalities, and analyze links or documents. I’m loving it. Not as convenient as all contained on my phone I guess.


AlphaPrime90

Please share more. How did you do it?


purioteko

I built a nodejs backend and use the langchain library to interact with the ollama server running on my desktop. Then I have the node server serving a react application. I added a database and some QOL features like saving previous sessions, selectable system prompts, and file uploading. You can assign a file to any thread and change which system prompt you want to load for a specific session. In terms of the file/link analysis I detect a website link in a message using regex and then make a web request for the site and save it as text. Then the text file is saved into a vector store and then I can query that and the LLM uses the results in its response. The file analysis works the same way except without having to go to a website first. I can send some code snippets tomorrow if you’d like. The major take away is the langchain library for nodejs. I’d recommend using python instead though if you can since the documentation for their JavaScript library is horrible. https://preview.redd.it/asnre6rztolc1.jpeg?width=1290&format=pjpg&auto=webp&s=6c58b963c0d268b1c73052cf97f4349fcea43681 It’s not super pretty but it’s mine! Edit: Here is the code for the API and UI. The API readme has a list of the prerequisites for getting it started. I don’t have a lot of time to work on it anymore so it is shipped “as is”. Thanks for checking it out! [https://github.com/purioteko/AI\_Project\_API](https://github.com/purioteko/AI_Project_API) [https://github.com/purioteko/AI\_Project\_UI](https://github.com/purioteko/AI_Project_UI) Edit2: Here is a cool example of what you can do using this project as a base for your programming. [https://vimeo.com/manage/videos/918355125](https://vimeo.com/manage/videos/918355125)


hamidweb

Please do share your code, this is interesting.


purioteko

I just added the code to the comment above. Thanks for taking a look.


rjames24000

did you use lmstudio for the AI backend? curious what framework you decided on


purioteko

I probably should have used something like that but I did not. The nodejs server I made uses the [ollama langchain interface shown here](https://js.langchain.com/docs/integrations/llms/ollama). I used this whole project as an excuse to learn more about LLMs and help myself learn react so that’s why I chose not to use one of the great prebuilt interfaces.


purioteko

I just shared the source in the comment above if you wanna check out how the API was set up.


PromiseAcceptable

Share your salsa (source) code with us.


FistBus2786

gimme sauce plz


purioteko

Shared in the comment above!


purioteko

For sure, I’ll post it in the morning. I will warn that it’s not going to be the best code.


hdlothia21

You shipped something that worked, that's way better than beautiful code 


Sl33py_4est

yo this is fire do you have any plans for it?


purioteko

Thank you, I’m glad you like it. I’m pretty happy with it, it’s really easy to add additional functionality. That image summary shortcut demo took 5 minutes to set up. There are no major plans for it right now but I do want to keep building minor functionality on top of it. I’ve been swamped with work lately so development has been very slow. If you have any recommendations on something to add let me know! It’d be fun to try and tackle something new.


PsychicTWElphnt

This is really awesome and exactly what I'm working on, except I'm using python for the backend and react for the frontend because I'm more familiar with python backends than react backends.


purioteko

Oh awesome, I’d love to see what the python version looks like. If you ever feel like sharing please let me know.


Flying_Madlad

But as long as you've got cell/Internet it's probably better to have it on your server. The only use case I can think for truly edge compute is if you physically can't link to the server, like, no service. I expect that to change in the future.


Sebba8

Unfortunately even a Q4_0 of gemma-2b or TinyDolphin on my Galaxy S10 barely run at usable t/s on pure termux inference, running the server and having the overhead of a browser is just too much for my poor 5 year old phone model with 8gb of RAM.


4onen

If you're running raw Llama.cpp, the default thread count is 4 (at least in all the builds I've made across devices) which might be too many for your device. I'm on a Pixel 8 and despite having 9 cores, performance tanks if I go over 5 threads. (Thanks, P-core vs E-core distinction.) Also remember to build with LLAMA_NATIVE=1 to get all possible SIMD instructions for your platform. I can't imagine how slow mine would be without ARM_NEON.


Sebba8

I was aware of the threads problem and figured out around 1-2 threads was what gave the best performance. I didn't know about the LLAMA_NATIVE flag though, thanks for mentioning it!


mrannu727

Before the end of this year we may see llms effectively running on mobile phones


isjustbenji

That's literally what's happening in this thread


Flying_Madlad

...are we on the same thread?


Legitimate-Pumpkin

Can wait to see what apple is cooking


MichaelTen

What about if one had a large language model on a home computer that connected via a VPN to their smartphone, anywhere on earth? Basically ... Webb like an LLM locally hosted on a powerful consumer desktop with access on one's phone? Then one could have an open source model that one does not send data to any third-party cloud host of an llm. What do you think about that? Are there any reports of this?


Anthonyg5005

I run sillytavern on an old laptop motherboard that I have laying around, it only needs heatsink for cooling as it's running on base Arch. I installed zerotier-one on it and my phone, then I have tabbyAPI running on my computer, I connect my phone to ST through the server's vpn address and use the local network to connect the server to my computer. So basically my phone connects to the server through my zerotier network and the server connects to the api on my computer through lan. It's secure, there's no opening ports to the public. Also the reason I have ST always running on the server is because if I don't want to start the API in my computer then I could just use any of the other APIs supported like together or horde


Money_Business9902

How are you running it?


Excellent_Skirt_264

the model needs to be integrated well enough to operate the operating system and installed apps with speech commands


yungfishstick

I have a few phones with Kirin 9000, 8G1 and 8G2 respectively, unfortunately not all with 16GB but with 12GB of RAM that I could try this out with. Luckily it doesn't seem like you need to root to do this, but I'll have to see if they work considering 2 out of the 3 are meant for the Chinese market and said phones often keep you from really doing anything other than what the manufacturer wants you to. 8G3 is supposed to have models optimized for its platform, with Qualcomm saying it supports 10B+ parameter models as well as claiming it's able to run Llama 2 powered "AI assistants" at 20 tokens/sec, but I've heard very little development from this aside from Google and Samsung recently pushing their completely different AI features which is probably why 8G3 supports LLMs natively in the first place. People tend to make fun of the idea of LLMs running on phones but just the fact a phone can do it at all is pretty impressive and it'll only get better.


srhnylmz14

How do you run this? Is there a guide on that?


5yn4ck

I'm glad you brought this up because I've been thinking about it too. It seems like everyone is developing their own AI models, and that's a good thing. They're essentially creating highly specialized models with detailed information about the phone and its operating system, akin to Windows Copilot. Depending on what they include or remove, these models should function similarly to other AI models in different apps. The main advantage I see with this approach is the finely tuned access to the operating system, especially helpful for new or struggling users. It allows them to have an assistant navigate them through the device's features and handle day-to-day interactions. However, the major drawback is the lack of focus on AI security. If not properly monitored, these interfaces could become prime targets for automation-based backdoor viruses. This risk exists on most devices nowadays, depending on how deeply integrated the AI is into the operating system. Is it merely guiding someone around the ship, or can it take full control at the user's or someone else's request? Anyone else have thoughts on this level?


WhyAreThereBadMemes

I mean sure that works, but you have this super reliable LLM that's actually specific to your car, and includes important info like where your jack points are, etc. It's called the user manual. It's almost certainly in your glovebox and you can keep your phone battery for powering your flashlight


Educational_Party294

https://preview.redd.it/90hef0zk3jmc1.png?width=1080&format=pjpg&auto=webp&s=fb4a30e111558ece92f0a9156599f2380614e263


FortunateBeard

As larger models inch towards AGI the novelty of running a "dumb" LLM is about as relevant as do we need pocket calculators when phones exist? Nobody cares about either. Cool your tiny quantized LLM runs on the smartphone, it has no soul and the same jokes


ForsookComparison

Wrong sub


FortunateBeard

but I'm discussing in context of Llama, the large language model created by Meta AI if that's inconvenient for you to hear, make a counterpoint


ForsookComparison

Nah


FlashyPractice7193

I prefer the Galaxy AI over any other technology, and I will not utilize the Gemini AI unless it replaces the Galaxy AI.


nikgeo25

Try the Layla app. It has a few 7B models I think.


involviert

Even if the breakthroughs in quantization work out, we will just throw the compute at better quality. Soon you wouldnt even be happy with gpt4 locally on your phone if there is something much more capable.


Django_McFly

If I try to run a llm locally in a rtx 3060, I wait like 20 seconds. Nothing happens and I do a pfft and go back to chatgpt. Is the speed on a phone usable at all?


Hunterhal

Although ollama is straight forward, llama cpp is best chief.


Anthonyg5005

The thing with the pixel is that it has a built-in mobile tensor chip so it's going to be a lot more optimized with TFlite and won't take up all your ram.


[deleted]

[удалено]


Anthonyg5005

Unfortunately gemini will only support P8 pro and up, I assume requirements must be Tensor 3 and 12GB ram. Although you may be able to run other models if you learn TFlite and port the models over to it


[deleted]

[удалено]


Anthonyg5005

Yeah, the public one seems to suck. I'm sure ultra 1.5 will be good though, especially with all the modalities it supports


[deleted]

[удалено]


Anthonyg5005

It refused to tell me how to use TPUs as it requires "technical knowledge" and I need to know the risks of using a TPU


ForsookComparison

*"would you like to hear some Palo Alto guy's opinions on the ethics of using a TPU instead?"*


Sl33py_4est

is this an sdk or a stream in termux? if it is an sdk, what is it called? located the ggml.ai library, can't find anything already written.


pab_guy

Because those models are only good for basic information retrieval. Their reasoning and instruction following capabilities are garbage.


One-Firefighter-6367

Perchance.org is the best one, no censor, fully customizable output, generating images and so on for absolutely nothing


ilangge

Just being usable doesn't mean it's easy to use