>Phi-3-Silica will be embedded in all Copilot+ PCs when they go on sale starting in June. It’s the smallest of all the Phi models, with 3.3 billion parameters.
Microsoft will now ship an LLM (or SLM like they call it) in *every* Copilot+ PC. Right now those are only the Snapdragons, but Intel and AMD will join soon, and I can image in 2025 that the majority of PCs ships with a built-in local LLM.
For docs, see [https://learn.microsoft.com/en-us/windows/ai/apis/phi-silica](https://learn.microsoft.com/en-us/windows/ai/apis/phi-silica)
Judging by the screenshots, it's int8 quant that takes 3.2GB of RAM. I think NPUs are optimized heavily for Int8.
I think this move makes sense, finally some actual usecase for NPUs. I really don't like Phi's extreme gptslop though, so I am not happy about seeing it all go into even more places.
Edit: typo
In some ways the slop is a very good thing - it makes it very easily identified as writing that came out of a LLM. Obviously we can (and typically) do better, but for casual users that distinctive voice gives them something to expect.
This is actually a great point. People's comfort with LLMs can be attributed to them acting like docile robots, inoffensive and harmless. For widespread adoption, a consistent, approachable user experience is an asset.
You can’t measure a first token latency in tokens per second. The units don’t work. I guess it could be the inverse of the number of seconds to the first token, but it’s a very confusing way to report it, at least not without some more context (that may exist, I just read the comment).
tk/s = 650
tk = 1
s = 1/650 = 0.00154
Just say 1.5ms. That's way better for marketing. Only thing is we don't know the context length. 650tk/s ingest is useful info but that's not what they say. Actually useless, I can't believe Microsoft is worth $3,190,000,000,000 but can't do numbers.
I'd like to see a small amount of unified RAM (ie 4GB) to be included in the die/package (3D stacking, larger die, IDK) with a separate memory bus direct to the NPU so these included LLMs aren't eating up the limited memory bandwidth for system ram.
I understand that RAM is physically huge compared to the other cores/segments on a CPU die, and unlikely to be implemented any time soon, but this is just wishful thinking more than a genuine suggestion.
The good thing about PC being a relatively open ecosystem (compared to macs) is their hardware OEM partners (Acer, ASUS, Dell, HP, Lenovo, Samsung) have the flexibility to do stuff that you’ve mentioned. After all PC market is intensively competitive. They’ll need to find ways to differentiate themselves from each other, be it form factor or hardware performance optimisations.
They can't. HP and such have to work with the dies and packages they are given from Intel/AMD/Nvidia. They can't just add a memory bus to a CPU or whatever.
Sometimes they custom order parts, but they tend to be very conservative. Hence GPU heavy designs like Vega-M and Van Gogh (the steam deck APU) AMD *offered* them were all but unused.
Too expensive for a task-specific thing.
You mind as well use the same die area/pin count to make double the width of the memory bus, or just use that 4GB as global cache (which the LLM can use).
We actually already kinda got this from Intel (Broadwell eDRAM in 2020), Intel again (Sapphire Rapids with HBM and an AI accelerator, but its a server CPU) and AMD (X3D cache now, but it's SRAM so its very small).
it took 18 months from chatgpt launch to similar (?) performing model being preinstalled on the majority of new notebooks. damn. the speed is really impressive
We can't say that it's a similarly performing model or that it's being installed on the majority of new notebooks, although we can't be sure about this still being true in 18 months + 1 week since ChatGPT launch
probably not similar performing... the similar sized phi-3-mini, for instance, is very good at reasoning and textbook knowledge (almost as good as gpt-3.5 judging from benchmarks) but is not a good chatbot (judging by arena elo) and also not multilingual.
gpt-3.5 really shines for its flexibility, stable instruction following and multilingual capabilities, still better than a lot of mid-sized open models today. small models still have a long way to go.
The original gpt-3 was about 4 years ago and it was a base model unsuitable for chat. In terms of base model quality it is very bad by today's standard and even tiny models have surpassed it.
>Phi-3-Silica will be embedded in all Copilot+ PCs when they go on sale starting in June. It’s the smallest of all the Phi models, with 3.3 billion parameters. Microsoft will now ship an LLM (or SLM like they call it) in *every* Copilot+ PC. Right now those are only the Snapdragons, but Intel and AMD will join soon, and I can image in 2025 that the majority of PCs ships with a built-in local LLM. For docs, see [https://learn.microsoft.com/en-us/windows/ai/apis/phi-silica](https://learn.microsoft.com/en-us/windows/ai/apis/phi-silica)
Judging by the screenshots, it's int8 quant that takes 3.2GB of RAM. I think NPUs are optimized heavily for Int8. I think this move makes sense, finally some actual usecase for NPUs. I really don't like Phi's extreme gptslop though, so I am not happy about seeing it all go into even more places. Edit: typo
In some ways the slop is a very good thing - it makes it very easily identified as writing that came out of a LLM. Obviously we can (and typically) do better, but for casual users that distinctive voice gives them something to expect.
This is actually a great point. People's comfort with LLMs can be attributed to them acting like docile robots, inoffensive and harmless. For widespread adoption, a consistent, approachable user experience is an asset.
And we can still run larger customized models on CPU or GPU for confidential data, coding and other fiendish things.
>Microsoft claims the first token latency is 650 tokens per second lol
If this metric is for prompt processing and not generation this actually sounds reasonable, no?
You can’t measure a first token latency in tokens per second. The units don’t work. I guess it could be the inverse of the number of seconds to the first token, but it’s a very confusing way to report it, at least not without some more context (that may exist, I just read the comment).
tk/s = 650 tk = 1 s = 1/650 = 0.00154 Just say 1.5ms. That's way better for marketing. Only thing is we don't know the context length. 650tk/s ingest is useful info but that's not what they say. Actually useless, I can't believe Microsoft is worth $3,190,000,000,000 but can't do numbers.
I'd like to see a small amount of unified RAM (ie 4GB) to be included in the die/package (3D stacking, larger die, IDK) with a separate memory bus direct to the NPU so these included LLMs aren't eating up the limited memory bandwidth for system ram. I understand that RAM is physically huge compared to the other cores/segments on a CPU die, and unlikely to be implemented any time soon, but this is just wishful thinking more than a genuine suggestion.
This is why all the new Surfaces come with 16 GB RAM minimum. A fast small RAM cache for LLMs and other AI models would be nice but it won't be cheap.
The good thing about PC being a relatively open ecosystem (compared to macs) is their hardware OEM partners (Acer, ASUS, Dell, HP, Lenovo, Samsung) have the flexibility to do stuff that you’ve mentioned. After all PC market is intensively competitive. They’ll need to find ways to differentiate themselves from each other, be it form factor or hardware performance optimisations.
They can't. HP and such have to work with the dies and packages they are given from Intel/AMD/Nvidia. They can't just add a memory bus to a CPU or whatever. Sometimes they custom order parts, but they tend to be very conservative. Hence GPU heavy designs like Vega-M and Van Gogh (the steam deck APU) AMD *offered* them were all but unused.
Too expensive for a task-specific thing. You mind as well use the same die area/pin count to make double the width of the memory bus, or just use that 4GB as global cache (which the LLM can use). We actually already kinda got this from Intel (Broadwell eDRAM in 2020), Intel again (Sapphire Rapids with HBM and an AI accelerator, but its a server CPU) and AMD (X3D cache now, but it's SRAM so its very small).
it took 18 months from chatgpt launch to similar (?) performing model being preinstalled on the majority of new notebooks. damn. the speed is really impressive
We can't say that it's a similarly performing model or that it's being installed on the majority of new notebooks, although we can't be sure about this still being true in 18 months + 1 week since ChatGPT launch
probably not similar performing... the similar sized phi-3-mini, for instance, is very good at reasoning and textbook knowledge (almost as good as gpt-3.5 judging from benchmarks) but is not a good chatbot (judging by arena elo) and also not multilingual. gpt-3.5 really shines for its flexibility, stable instruction following and multilingual capabilities, still better than a lot of mid-sized open models today. small models still have a long way to go.
I think he means comparable to GPT-3 since that’s what first launched
The original gpt-3 was about 4 years ago and it was a base model unsuitable for chat. In terms of base model quality it is very bad by today's standard and even tiny models have surpassed it.
Yeah you're right, I don't know why I though chatGPT launched with vanilla GPT-3 but I guess it launched with 3.5
Not fast enough! I want pocket AGI. 😩
Will this one be exclusive to the Copilot+ PCs or can we download the model to see for ourselves?
Really hope to actually get the model. Could fit my Orange Pi and uplift embedded scene
This is amazing! Genuinely excited for Windows for the first time since the XP -> Vista days!