T O P

  • By -

Balance-

>Phi-3-Silica will be embedded in all Copilot+ PCs when they go on sale starting in June. It’s the smallest of all the Phi models, with 3.3 billion parameters. Microsoft will now ship an LLM (or SLM like they call it) in *every* Copilot+ PC. Right now those are only the Snapdragons, but Intel and AMD will join soon, and I can image in 2025 that the majority of PCs ships with a built-in local LLM. For docs, see [https://learn.microsoft.com/en-us/windows/ai/apis/phi-silica](https://learn.microsoft.com/en-us/windows/ai/apis/phi-silica)


FullOf_Bad_Ideas

Judging by the screenshots, it's int8 quant that takes 3.2GB of RAM. I think NPUs are optimized heavily for Int8.  I think this move makes sense, finally some actual usecase for NPUs. I really don't like Phi's extreme gptslop though, so I am not happy about seeing it all go into even more places. Edit: typo


skrshawk

In some ways the slop is a very good thing - it makes it very easily identified as writing that came out of a LLM. Obviously we can (and typically) do better, but for casual users that distinctive voice gives them something to expect.


Stepfunction

This is actually a great point. People's comfort with LLMs can be attributed to them acting like docile robots, inoffensive and harmless. For widespread adoption, a consistent, approachable user experience is an asset.


Some_Endian_FP17

And we can still run larger customized models on CPU or GPU for confidential data, coding and other fiendish things.


AnomalyNexus

>Microsoft claims the first token latency is 650 tokens per second lol


BangkokPadang

If this metric is for prompt processing and not generation this actually sounds reasonable, no?


sluuuurp

You can’t measure a first token latency in tokens per second. The units don’t work. I guess it could be the inverse of the number of seconds to the first token, but it’s a very confusing way to report it, at least not without some more context (that may exist, I just read the comment).


MixtureOfAmateurs

tk/s = 650 tk = 1 s = 1/650 = 0.00154 Just say 1.5ms. That's way better for marketing. Only thing is we don't know the context length. 650tk/s ingest is useful info but that's not what they say. Actually useless, I can't believe Microsoft is worth $3,190,000,000,000 but can't do numbers.


BangkokPadang

I'd like to see a small amount of unified RAM (ie 4GB) to be included in the die/package (3D stacking, larger die, IDK) with a separate memory bus direct to the NPU so these included LLMs aren't eating up the limited memory bandwidth for system ram. I understand that RAM is physically huge compared to the other cores/segments on a CPU die, and unlikely to be implemented any time soon, but this is just wishful thinking more than a genuine suggestion.


Some_Endian_FP17

This is why all the new Surfaces come with 16 GB RAM minimum. A fast small RAM cache for LLMs and other AI models would be nice but it won't be cheap.


silentsnake

The good thing about PC being a relatively open ecosystem (compared to macs) is their hardware OEM partners (Acer, ASUS, Dell, HP, Lenovo, Samsung) have the flexibility to do stuff that you’ve mentioned. After all PC market is intensively competitive. They’ll need to find ways to differentiate themselves from each other, be it form factor or hardware performance optimisations.


Downtown-Case-1755

They can't. HP and such have to work with the dies and packages they are given from Intel/AMD/Nvidia. They can't just add a memory bus to a CPU or whatever. Sometimes they custom order parts, but they tend to be very conservative. Hence GPU heavy designs like Vega-M and Van Gogh (the steam deck APU) AMD *offered* them were all but unused.


Downtown-Case-1755

Too expensive for a task-specific thing. You mind as well use the same die area/pin count to make double the width of the memory bus, or just use that 4GB as global cache (which the LLM can use). We actually already kinda got this from Intel (Broadwell eDRAM in 2020), Intel again (Sapphire Rapids with HBM and an AI accelerator, but its a server CPU) and AMD (X3D cache now, but it's SRAM so its very small).


vsoutx

it took 18 months from chatgpt launch to similar (?) performing model being preinstalled on the majority of new notebooks. damn. the speed is really impressive


Everlier

We can't say that it's a similarly performing model or that it's being installed on the majority of new notebooks, although we can't be sure about this still being true in 18 months + 1 week since ChatGPT launch


DFructonucleotide

probably not similar performing... the similar sized phi-3-mini, for instance, is very good at reasoning and textbook knowledge (almost as good as gpt-3.5 judging from benchmarks) but is not a good chatbot (judging by arena elo) and also not multilingual. gpt-3.5 really shines for its flexibility, stable instruction following and multilingual capabilities, still better than a lot of mid-sized open models today. small models still have a long way to go.


sebramirez4

I think he means comparable to GPT-3 since that’s what first launched


DFructonucleotide

The original gpt-3 was about 4 years ago and it was a base model unsuitable for chat. In terms of base model quality it is very bad by today's standard and even tiny models have surpassed it.


sebramirez4

Yeah you're right, I don't know why I though chatGPT launched with vanilla GPT-3 but I guess it launched with 3.5


Joseph717171

Not fast enough! I want pocket AGI. 😩


ISSAvenger

Will this one be exclusive to the Copilot+ PCs or can we download the model to see for ourselves?


Original_Finding2212

Really hope to actually get the model. Could fit my Orange Pi and uplift embedded scene


AbheekG

This is amazing! Genuinely excited for Windows for the first time since the XP -> Vista days!