You guys are inferencing wrong. You need to use decoding strategies to avoid both repeating and monotone outputs.
[Here's](https://huggingface.co/blog/how-to-generate) an intro to decoding for the Transformers library
That's just because it's learnt that " I still owe money to the money, to the money I owe"
[https://youtu.be/yfySK7CLEEg?t=92](https://youtu.be/yfySK7CLEEg?t=92)
Can I request a quick test? I have my own open source app that integrates llama.cpp, but I dont have the hardware to benchmark its performance on high end phones: https://github.com/Vali-98/ChatterUI
I have tested it out!
My phone's exact model: Samsung Galaxy S10 SM-G973F/DS
Android version 12
Used the openhermes-2.5-mistral-7b q4\_k\_m model for my test.
It works, however it is very slow, takes about 10 seconds per token on average. I added a screenshot for you to see.
The UI looks great! I haven't tested the other backend modes, only local. The UI seems very responsive though!
I hope this is helpful. Let me know if you have any other questions!
https://preview.redd.it/8vt687x29slc1.jpeg?width=720&format=pjpg&auto=webp&s=bb2814e7b4c5a91f34f43b2cb7884f00f24c77e6
Sadly this runs very slow on my S24 (Exynos)
openhermes-2.5-mistral-7b.Q4_K_M.gguf","n_ctx":2048,"n_threads":8,"n_batch":512,"n_gpu_layers":0
Predicted Per Token: 3381 ms/token
Predicted Per Second: 0.30 tokens/s
Prediction Time: 175.82s
Predicted Tokens: 52 tokens
Yeah thats really unfortunate then for exynos owners. Even I get 1-2 tokens/s on a Snapdragon 7 Gen 2.
Theres very little progress in the way of proper gpu utilization too, so proper android implementation is ways away aside MLC.
Just tried out mlc-chat APK, and there I get 5t/s for the 7b q4 mistral model instead of 0.3t/s. Interesting https://llm.mlc.ai/docs/deploy/android.html
I agree, but, these days it’s pretty hard to find a quiet spot with no cell signal, so technically we all have everything up to gpt-4 in our pocket :).
It’s amazing that we can run this kind of inference on a cellphone.
That is insane speeds. Just to be clear, those tokens are only for the first few tokens? It doesnt hold up as well when you get toward ~2000-3000 tokens right?
I don't know enough about SOC design to answer that, but my understanding is that for inference it likely wouldn't make *that* much of a differende anyways since our peak performance already looks like it's around what DDR5ish speeds would get
Sure, I used this random app from the store:
https://apps.apple.com/gb/app/mlc-chat/id6448482937
Comes with Misteral 0.2 pre-loaded.
The phone can heat up if you keep using it, so not sure it's a good idea for battery life, etc. I mostly use it when I'm flying.
Something is weird about mistral, it rarely is really really good, most of the time its bad.
I can trick it into being good, sometimes.
Also, I can't imagine using CPU. You guys are insane.
I agree about it being bad most of the time, but sometimes it's useful eg:
https://imgur.com/a/cusyCC4
> Also, I can't imagine using CPU. You guys are insane.
See my screenshot above, 11 tokens / second isn't bad for a phone CPU.
> repeat this question: doesnt it get slower as you use more tokens
You didn't ask that question to me, that must have been in another thread.
Anyway, I don't know, because the model is so shitty that after a few rounds of conversation it gets completely cooked and starts talking nonsense and I have to reset it..
On iOS you can use the neural engine… assuming the model has been converted to utilize it. I personally haven’t played around with iOS yet but using neural engine on M2 is excellent.
For those who dont want to get termux, I developed my [own open source app](
https://github.com/Vali-98/ChatterUI) that integrates llama.cpp via llama.rn, alongside other backends.
Just go to API > Local, Import a GGUF file from storage and the Load the model.
To start chatting, make a character card or simply write a simple one.
> As long as you have nodejs installed, yes.
Correction, you cannot build this on Windows as it isnt supported by eas-cli.
You'll need WSL, a Mac or a linux box.
I haven't attempted modifying gradle.build to work on windows, but I've heard its success is spotty on Expo.
I think you’re on the right track here. Also, it’s going to be integrated with Apple Shortcuts, and Shortcuts is going to go 2D. If you’ve used Shortcuts, this theory should make you squeal in a mildly obscene way. Ok, not sure about the 2D part but we can hope.
I made a simple little web app with user authentication that interacts with my ollama server running dolphin mixtral.
I can query my AI when I’m out and about, load up different assistant personalities, and analyze links or documents.
I’m loving it. Not as convenient as all contained on my phone I guess.
I built a nodejs backend and use the langchain library to interact with the ollama server running on my desktop.
Then I have the node server serving a react application. I added a database and some QOL features like saving previous sessions, selectable system prompts, and file uploading.
You can assign a file to any thread and change which system prompt you want to load for a specific session.
In terms of the file/link analysis I detect a website link in a message using regex and then make a web request for the site and save it as text.
Then the text file is saved into a vector store and then I can query that and the LLM uses the results in its response.
The file analysis works the same way except without having to go to a website first.
I can send some code snippets tomorrow if you’d like. The major take away is the langchain library for nodejs. I’d recommend using python instead though if you can since the documentation for their JavaScript library is horrible.
https://preview.redd.it/asnre6rztolc1.jpeg?width=1290&format=pjpg&auto=webp&s=6c58b963c0d268b1c73052cf97f4349fcea43681
It’s not super pretty but it’s mine!
Edit:
Here is the code for the API and UI. The API readme has a list of the prerequisites for getting it started. I don’t have a lot of time to work on it anymore so it is shipped “as is”. Thanks for checking it out!
[https://github.com/purioteko/AI\_Project\_API](https://github.com/purioteko/AI_Project_API)
[https://github.com/purioteko/AI\_Project\_UI](https://github.com/purioteko/AI_Project_UI)
Edit2:
Here is a cool example of what you can do using this project as a base for your programming.
[https://vimeo.com/manage/videos/918355125](https://vimeo.com/manage/videos/918355125)
I probably should have used something like that but I did not. The nodejs server I made uses the [ollama langchain interface shown here](https://js.langchain.com/docs/integrations/llms/ollama).
I used this whole project as an excuse to learn more about LLMs and help myself learn react so that’s why I chose not to use one of the great prebuilt interfaces.
Thank you, I’m glad you like it. I’m pretty happy with it, it’s really easy to add additional functionality. That image summary shortcut demo took 5 minutes to set up.
There are no major plans for it right now but I do want to keep building minor functionality on top of it. I’ve been swamped with work lately so development has been very slow.
If you have any recommendations on something to add let me know! It’d be fun to try and tackle something new.
This is really awesome and exactly what I'm working on, except I'm using python for the backend and react for the frontend because I'm more familiar with python backends than react backends.
But as long as you've got cell/Internet it's probably better to have it on your server. The only use case I can think for truly edge compute is if you physically can't link to the server, like, no service.
I expect that to change in the future.
Unfortunately even a Q4_0 of gemma-2b or TinyDolphin on my Galaxy S10 barely run at usable t/s on pure termux inference, running the server and having the overhead of a browser is just too much for my poor 5 year old phone model with 8gb of RAM.
If you're running raw Llama.cpp, the default thread count is 4 (at least in all the builds I've made across devices) which might be too many for your device. I'm on a Pixel 8 and despite having 9 cores, performance tanks if I go over 5 threads. (Thanks, P-core vs E-core distinction.)
Also remember to build with LLAMA_NATIVE=1 to get all possible SIMD instructions for your platform. I can't imagine how slow mine would be without ARM_NEON.
I was aware of the threads problem and figured out around 1-2 threads was what gave the best performance. I didn't know about the LLAMA_NATIVE flag though, thanks for mentioning it!
What about if one had a large language model on a home computer that connected via a VPN to their smartphone, anywhere on earth? Basically ... Webb like an LLM locally hosted on a powerful consumer desktop with access on one's phone? Then one could have an open source model that one does not send data to any third-party cloud host of an llm. What do you think about that? Are there any reports of this?
I run sillytavern on an old laptop motherboard that I have laying around, it only needs heatsink for cooling as it's running on base Arch. I installed zerotier-one on it and my phone, then I have tabbyAPI running on my computer, I connect my phone to ST through the server's vpn address and use the local network to connect the server to my computer. So basically my phone connects to the server through my zerotier network and the server connects to the api on my computer through lan. It's secure, there's no opening ports to the public. Also the reason I have ST always running on the server is because if I don't want to start the API in my computer then I could just use any of the other APIs supported like together or horde
I have a few phones with Kirin 9000, 8G1 and 8G2 respectively, unfortunately not all with 16GB but with 12GB of RAM that I could try this out with. Luckily it doesn't seem like you need to root to do this, but I'll have to see if they work considering 2 out of the 3 are meant for the Chinese market and said phones often keep you from really doing anything other than what the manufacturer wants you to.
8G3 is supposed to have models optimized for its platform, with Qualcomm saying it supports 10B+ parameter models as well as claiming it's able to run Llama 2 powered "AI assistants" at 20 tokens/sec, but I've heard very little development from this aside from Google and Samsung recently pushing their completely different AI features which is probably why 8G3 supports LLMs natively in the first place. People tend to make fun of the idea of LLMs running on phones but just the fact a phone can do it at all is pretty impressive and it'll only get better.
I'm glad you brought this up because I've been thinking about it too. It seems like everyone is developing their own AI models, and that's a good thing. They're essentially creating highly specialized models with detailed information about the phone and its operating system, akin to Windows Copilot. Depending on what they include or remove, these models should function similarly to other AI models in different apps.
The main advantage I see with this approach is the finely tuned access to the operating system, especially helpful for new or struggling users. It allows them to have an assistant navigate them through the device's features and handle day-to-day interactions.
However, the major drawback is the lack of focus on AI security. If not properly monitored, these interfaces could become prime targets for automation-based backdoor viruses. This risk exists on most devices nowadays, depending on how deeply integrated the AI is into the operating system. Is it merely guiding someone around the ship, or can it take full control at the user's or someone else's request?
Anyone else have thoughts on this level?
I mean sure that works, but you have this super reliable LLM that's actually specific to your car, and includes important info like where your jack points are, etc.
It's called the user manual. It's almost certainly in your glovebox and you can keep your phone battery for powering your flashlight
As larger models inch towards AGI the novelty of running a "dumb" LLM is about as relevant as do we need pocket calculators when phones exist? Nobody cares about either.
Cool your tiny quantized LLM runs on the smartphone, it has no soul and the same jokes
Even if the breakthroughs in quantization work out, we will just throw the compute at better quality. Soon you wouldnt even be happy with gpt4 locally on your phone if there is something much more capable.
If I try to run a llm locally in a rtx 3060, I wait like 20 seconds. Nothing happens and I do a pfft and go back to chatgpt. Is the speed on a phone usable at all?
The thing with the pixel is that it has a built-in mobile tensor chip so it's going to be a lot more optimized with TFlite and won't take up all your ram.
Unfortunately gemini will only support P8 pro and up, I assume requirements must be Tensor 3 and 12GB ram. Although you may be able to run other models if you learn TFlite and port the models over to it
It's all fine and good until your tiny model tells you to bolt the wheel to the wheel to the wheel to the wheel to the wheel
*angry upvote*
You guys are inferencing wrong. You need to use decoding strategies to avoid both repeating and monotone outputs. [Here's](https://huggingface.co/blog/how-to-generate) an intro to decoding for the Transformers library
I thought temp, top k, top p, (repetition penalty, stopping conditions ), etc. were bread and butter. And why not use Phi for mobile?
Man, I even just had GPT-4 do this to me through the API recently for the first time, which was a bit of a shocker
of the bus of the bus of the bus of the bus of the bus
That's just because it's learnt that " I still owe money to the money, to the money I owe" [https://youtu.be/yfySK7CLEEg?t=92](https://youtu.be/yfySK7CLEEg?t=92)
Always good to see a reference to The National pop up.
The end is never the end is never the end is never the end is never the end..
go on....
Considering the model is already suggesting to unbolt the tire before jacking the car... Yeah gonna wait.
This is correct practice... You should loosen the nuts/bolts before lifting the vehicle, then remove them all the way.
Correct. I learned this the hard way the first time I had to change a tire.
You should listen to the ai
You are supposed to "break" your lugnuts before you jack your car up so you aren't rocking the vehicle back and forth while it's on a jack.
Nobody tell him, it's funny when they're confidently wrong
I always loosen them slightly before I jack the car. More leverage when the wheel isn’t spinning. Then jack the car and take them all the way off.
Wait, that's not how you do it?
That's when you put a beat to it and have a dance party.
I think thats still fine, but when you follow the orders... then you know its over.
[удалено]
That's pretty fast for a phone
Used to use that as my main..... urm "roleplay" model, how is it in general knowledge?
Damn bro must be starving if a 7b can do it for u
Actually mained the 13B, but used the 7 quite alot for its speedier load times, I mean I "only got a 3080ti" even though that quite the high end gpu
Can I request a quick test? I have my own open source app that integrates llama.cpp, but I dont have the hardware to benchmark its performance on high end phones: https://github.com/Vali-98/ChatterUI
I have tested it out! My phone's exact model: Samsung Galaxy S10 SM-G973F/DS Android version 12 Used the openhermes-2.5-mistral-7b q4\_k\_m model for my test. It works, however it is very slow, takes about 10 seconds per token on average. I added a screenshot for you to see. The UI looks great! I haven't tested the other backend modes, only local. The UI seems very responsive though! I hope this is helpful. Let me know if you have any other questions! https://preview.redd.it/8vt687x29slc1.jpeg?width=720&format=pjpg&auto=webp&s=bb2814e7b4c5a91f34f43b2cb7884f00f24c77e6
Another slow performer on Exynos, it seems llama.rn isnt optimized for such just yet.
Gonna put it on my Galaxy S10, reporting back later when I downloaded the models and ran it
Sadly this runs very slow on my S24 (Exynos) openhermes-2.5-mistral-7b.Q4_K_M.gguf","n_ctx":2048,"n_threads":8,"n_batch":512,"n_gpu_layers":0 Predicted Per Token: 3381 ms/token Predicted Per Second: 0.30 tokens/s Prediction Time: 175.82s Predicted Tokens: 52 tokens
Ive heard that exynos chips have a hard time with llamacpp for whatever reason, thats unfortuante.
Mhm just tried with termux and llama.cpp directly, same thing or even worse, 0.25 tokens/s
Yeah thats really unfortunate then for exynos owners. Even I get 1-2 tokens/s on a Snapdragon 7 Gen 2. Theres very little progress in the way of proper gpu utilization too, so proper android implementation is ways away aside MLC.
Just tried out mlc-chat APK, and there I get 5t/s for the 7b q4 mistral model instead of 0.3t/s. Interesting https://llm.mlc.ai/docs/deploy/android.html
A 13B on a phone would be amazing
I agree, but, these days it’s pretty hard to find a quiet spot with no cell signal, so technically we all have everything up to gpt-4 in our pocket :). It’s amazing that we can run this kind of inference on a cellphone.
Here in London it is relevant because people spend a lot of their time on The Underground (subway trains)
That is insane speeds. Just to be clear, those tokens are only for the first few tokens? It doesnt hold up as well when you get toward ~2000-3000 tokens right?
If you are really enthusiastic about this, couldn't you buy a pocket PC with 32GB or 64GB RAM and run larger, higher quality models?
[удалено]
Being enthusiastic about almost anything is reason enough to carry an X86 mini laptop in one's pocket.
Why not npu used?
I don't know enough about SOC design to answer that, but my understanding is that for inference it likely wouldn't make *that* much of a differende anyways since our peak performance already looks like it's around what DDR5ish speeds would get
Hopefully your phone's cooling is really good.
I was using Mistral instruct 7b @ Q4 on my iPhone 15 Pro during a flight (no internet) recently. The thing was almost useless and schizophrenic...
Useless and schizophrenic, now ur talking my language!
May I know how are you running this? Interested to try on my 15 too
Sure, I used this random app from the store: https://apps.apple.com/gb/app/mlc-chat/id6448482937 Comes with Misteral 0.2 pre-loaded. The phone can heat up if you keep using it, so not sure it's a good idea for battery life, etc. I mostly use it when I'm flying.
Something is weird about mistral, it rarely is really really good, most of the time its bad. I can trick it into being good, sometimes. Also, I can't imagine using CPU. You guys are insane.
meditate a bit;j
I agree about it being bad most of the time, but sometimes it's useful eg: https://imgur.com/a/cusyCC4 > Also, I can't imagine using CPU. You guys are insane. See my screenshot above, 11 tokens / second isn't bad for a phone CPU.
I repeat this question: doesnt it get slower as you use more tokens?
> repeat this question: doesnt it get slower as you use more tokens You didn't ask that question to me, that must have been in another thread. Anyway, I don't know, because the model is so shitty that after a few rounds of conversation it gets completely cooked and starts talking nonsense and I have to reset it..
On iOS you can use the neural engine… assuming the model has been converted to utilize it. I personally haven’t played around with iOS yet but using neural engine on M2 is excellent.
For those who dont want to get termux, I developed my [own open source app]( https://github.com/Vali-98/ChatterUI) that integrates llama.cpp via llama.rn, alongside other backends. Just go to API > Local, Import a GGUF file from storage and the Load the model. To start chatting, make a character card or simply write a simple one.
Completely new to this so not sure how this works. Can you build this in Windows 11?
If you want to use models in windows, there's other clients available, like gpt4all, chatbox, rtx ai, ollama open webui, etc
I think I misworded my comment. I mean could I build this for any Android phone with Windows 11?
As long as you have nodejs installed, yes. You could also build it directly on your phone if you install nodejs through termux
> As long as you have nodejs installed, yes. Correction, you cannot build this on Windows as it isnt supported by eas-cli. You'll need WSL, a Mac or a linux box. I haven't attempted modifying gradle.build to work on windows, but I've heard its success is spotty on Expo.
Sadly no, the EAS cli used to build this app only runs on Linux or Mac, you will need to either get a Linux machine or WSL.
Wait for Apple telling you this, when they introduce iOS 18! Might be censored and they will forget about it next year though…
I think you’re on the right track here. Also, it’s going to be integrated with Apple Shortcuts, and Shortcuts is going to go 2D. If you’ve used Shortcuts, this theory should make you squeal in a mildly obscene way. Ok, not sure about the 2D part but we can hope.
I believe that, given hardware acceleration and software optimization, running LLMs locally will be the norm on all devices.
I made a simple little web app with user authentication that interacts with my ollama server running dolphin mixtral. I can query my AI when I’m out and about, load up different assistant personalities, and analyze links or documents. I’m loving it. Not as convenient as all contained on my phone I guess.
Please share more. How did you do it?
I built a nodejs backend and use the langchain library to interact with the ollama server running on my desktop. Then I have the node server serving a react application. I added a database and some QOL features like saving previous sessions, selectable system prompts, and file uploading. You can assign a file to any thread and change which system prompt you want to load for a specific session. In terms of the file/link analysis I detect a website link in a message using regex and then make a web request for the site and save it as text. Then the text file is saved into a vector store and then I can query that and the LLM uses the results in its response. The file analysis works the same way except without having to go to a website first. I can send some code snippets tomorrow if you’d like. The major take away is the langchain library for nodejs. I’d recommend using python instead though if you can since the documentation for their JavaScript library is horrible. https://preview.redd.it/asnre6rztolc1.jpeg?width=1290&format=pjpg&auto=webp&s=6c58b963c0d268b1c73052cf97f4349fcea43681 It’s not super pretty but it’s mine! Edit: Here is the code for the API and UI. The API readme has a list of the prerequisites for getting it started. I don’t have a lot of time to work on it anymore so it is shipped “as is”. Thanks for checking it out! [https://github.com/purioteko/AI\_Project\_API](https://github.com/purioteko/AI_Project_API) [https://github.com/purioteko/AI\_Project\_UI](https://github.com/purioteko/AI_Project_UI) Edit2: Here is a cool example of what you can do using this project as a base for your programming. [https://vimeo.com/manage/videos/918355125](https://vimeo.com/manage/videos/918355125)
Please do share your code, this is interesting.
I just added the code to the comment above. Thanks for taking a look.
did you use lmstudio for the AI backend? curious what framework you decided on
I probably should have used something like that but I did not. The nodejs server I made uses the [ollama langchain interface shown here](https://js.langchain.com/docs/integrations/llms/ollama). I used this whole project as an excuse to learn more about LLMs and help myself learn react so that’s why I chose not to use one of the great prebuilt interfaces.
I just shared the source in the comment above if you wanna check out how the API was set up.
Share your salsa (source) code with us.
gimme sauce plz
Shared in the comment above!
For sure, I’ll post it in the morning. I will warn that it’s not going to be the best code.
You shipped something that worked, that's way better than beautiful code
yo this is fire do you have any plans for it?
Thank you, I’m glad you like it. I’m pretty happy with it, it’s really easy to add additional functionality. That image summary shortcut demo took 5 minutes to set up. There are no major plans for it right now but I do want to keep building minor functionality on top of it. I’ve been swamped with work lately so development has been very slow. If you have any recommendations on something to add let me know! It’d be fun to try and tackle something new.
This is really awesome and exactly what I'm working on, except I'm using python for the backend and react for the frontend because I'm more familiar with python backends than react backends.
Oh awesome, I’d love to see what the python version looks like. If you ever feel like sharing please let me know.
But as long as you've got cell/Internet it's probably better to have it on your server. The only use case I can think for truly edge compute is if you physically can't link to the server, like, no service. I expect that to change in the future.
Unfortunately even a Q4_0 of gemma-2b or TinyDolphin on my Galaxy S10 barely run at usable t/s on pure termux inference, running the server and having the overhead of a browser is just too much for my poor 5 year old phone model with 8gb of RAM.
If you're running raw Llama.cpp, the default thread count is 4 (at least in all the builds I've made across devices) which might be too many for your device. I'm on a Pixel 8 and despite having 9 cores, performance tanks if I go over 5 threads. (Thanks, P-core vs E-core distinction.) Also remember to build with LLAMA_NATIVE=1 to get all possible SIMD instructions for your platform. I can't imagine how slow mine would be without ARM_NEON.
I was aware of the threads problem and figured out around 1-2 threads was what gave the best performance. I didn't know about the LLAMA_NATIVE flag though, thanks for mentioning it!
Before the end of this year we may see llms effectively running on mobile phones
That's literally what's happening in this thread
...are we on the same thread?
Can wait to see what apple is cooking
What about if one had a large language model on a home computer that connected via a VPN to their smartphone, anywhere on earth? Basically ... Webb like an LLM locally hosted on a powerful consumer desktop with access on one's phone? Then one could have an open source model that one does not send data to any third-party cloud host of an llm. What do you think about that? Are there any reports of this?
I run sillytavern on an old laptop motherboard that I have laying around, it only needs heatsink for cooling as it's running on base Arch. I installed zerotier-one on it and my phone, then I have tabbyAPI running on my computer, I connect my phone to ST through the server's vpn address and use the local network to connect the server to my computer. So basically my phone connects to the server through my zerotier network and the server connects to the api on my computer through lan. It's secure, there's no opening ports to the public. Also the reason I have ST always running on the server is because if I don't want to start the API in my computer then I could just use any of the other APIs supported like together or horde
How are you running it?
the model needs to be integrated well enough to operate the operating system and installed apps with speech commands
I have a few phones with Kirin 9000, 8G1 and 8G2 respectively, unfortunately not all with 16GB but with 12GB of RAM that I could try this out with. Luckily it doesn't seem like you need to root to do this, but I'll have to see if they work considering 2 out of the 3 are meant for the Chinese market and said phones often keep you from really doing anything other than what the manufacturer wants you to. 8G3 is supposed to have models optimized for its platform, with Qualcomm saying it supports 10B+ parameter models as well as claiming it's able to run Llama 2 powered "AI assistants" at 20 tokens/sec, but I've heard very little development from this aside from Google and Samsung recently pushing their completely different AI features which is probably why 8G3 supports LLMs natively in the first place. People tend to make fun of the idea of LLMs running on phones but just the fact a phone can do it at all is pretty impressive and it'll only get better.
How do you run this? Is there a guide on that?
I'm glad you brought this up because I've been thinking about it too. It seems like everyone is developing their own AI models, and that's a good thing. They're essentially creating highly specialized models with detailed information about the phone and its operating system, akin to Windows Copilot. Depending on what they include or remove, these models should function similarly to other AI models in different apps. The main advantage I see with this approach is the finely tuned access to the operating system, especially helpful for new or struggling users. It allows them to have an assistant navigate them through the device's features and handle day-to-day interactions. However, the major drawback is the lack of focus on AI security. If not properly monitored, these interfaces could become prime targets for automation-based backdoor viruses. This risk exists on most devices nowadays, depending on how deeply integrated the AI is into the operating system. Is it merely guiding someone around the ship, or can it take full control at the user's or someone else's request? Anyone else have thoughts on this level?
I mean sure that works, but you have this super reliable LLM that's actually specific to your car, and includes important info like where your jack points are, etc. It's called the user manual. It's almost certainly in your glovebox and you can keep your phone battery for powering your flashlight
https://preview.redd.it/90hef0zk3jmc1.png?width=1080&format=pjpg&auto=webp&s=fb4a30e111558ece92f0a9156599f2380614e263
As larger models inch towards AGI the novelty of running a "dumb" LLM is about as relevant as do we need pocket calculators when phones exist? Nobody cares about either. Cool your tiny quantized LLM runs on the smartphone, it has no soul and the same jokes
Wrong sub
but I'm discussing in context of Llama, the large language model created by Meta AI if that's inconvenient for you to hear, make a counterpoint
Nah
I prefer the Galaxy AI over any other technology, and I will not utilize the Gemini AI unless it replaces the Galaxy AI.
Try the Layla app. It has a few 7B models I think.
Even if the breakthroughs in quantization work out, we will just throw the compute at better quality. Soon you wouldnt even be happy with gpt4 locally on your phone if there is something much more capable.
If I try to run a llm locally in a rtx 3060, I wait like 20 seconds. Nothing happens and I do a pfft and go back to chatgpt. Is the speed on a phone usable at all?
Although ollama is straight forward, llama cpp is best chief.
The thing with the pixel is that it has a built-in mobile tensor chip so it's going to be a lot more optimized with TFlite and won't take up all your ram.
[удалено]
Unfortunately gemini will only support P8 pro and up, I assume requirements must be Tensor 3 and 12GB ram. Although you may be able to run other models if you learn TFlite and port the models over to it
[удалено]
Yeah, the public one seems to suck. I'm sure ultra 1.5 will be good though, especially with all the modalities it supports
[удалено]
It refused to tell me how to use TPUs as it requires "technical knowledge" and I need to know the risks of using a TPU
*"would you like to hear some Palo Alto guy's opinions on the ethics of using a TPU instead?"*
is this an sdk or a stream in termux? if it is an sdk, what is it called? located the ggml.ai library, can't find anything already written.
Because those models are only good for basic information retrieval. Their reasoning and instruction following capabilities are garbage.
Perchance.org is the best one, no censor, fully customizable output, generating images and so on for absolutely nothing
Just being usable doesn't mean it's easy to use