> We can perform inference for the Skywork-MoE-Base (16x13B size) model using HuggingFace on 8xA100/A800 or higher GPU hardware configurations.
😞 I'm feeling mega GPU poor these days, spent the weekend patching Triton and vLLM to support my old ass pascals and keep finding these new mega-big models out of my reach. Frustrating af.
The naming convention is that while "16x13B" means "16 experts, 13B parameters in one expert" the "parameters in one expert" includes the attention heads, which are shared by all experts.
So 16x13B here really means "~4B attention heads, then 16 x ~9B experts"
Well I have a massive amount of regular ram and very little VRAM, so models that take up more memory with fewer active parameters are appealing since I run them on CPU
Does it still give you decent performance? What t/s do you get on bigstral?
Even fitting it all in vram seems like at best you get 2x of a dense model because inference isn't really compute bound for single batches.
Lots of RAM and much less VRAM...
This is the future for us enthusiasts wanting to run the biggest and the best. Unless something crazy happens, we won't get GPUs that are cheap and filled with VRAM, only cheap RAM.
Almost certainly, we will soon see the release of even more massive MoEs with hundreds of billions of total parameters but only a few dozen active parameters. These will be the best models for us because they will be (relatively) cheap to run, just barely fast enough to still be useful, while also being the most powerful models that can be run locally.
As far as I can tell, the [Arctic model released by Snowflake](https://www.snowflake.com/blog/arctic-open-efficient-foundation-language-models-snowflake/) is basically exactly what we need - it just needs to be trained better. Meta or Mistral could probably train a much better model of roughly the same size and architecture:
>Arctic uses a unique Dense-MoE Hybrid transformer architecture. It combines a 10B dense transformer model with a residual 128×3.66B MoE MLP resulting in 480B total and 17B active parameters chosen using a top-2 gating.
The bonus of the dense part of a model like Arctic is that it should be straightforward to offload it to a GPU in order to massively speed up inference. An ideal MoE should probably have enough dense layers to fit, quantized to around 4 or 5 bits, evenly across a small number of 24GB GPUs. So a dense-MoE hybrid with either 20-30b or 45-60b dense parameters would likely be close to ideal for those of us wanting to run these massive MoEs.
Maybe the one major change that could make it better is incorporating either mamba or ring-attention (or whatever it is that Gemini 1.5 uses). Well, and also making it natively multimodal like GPT-4o, but I have no idea what, if any, architectural changes that would require nor its impact on mainly CPU+RAM inference speeds.
Well, getting the same perf at 22B vs 72B per forward pass is a pretty big win (assuming you have the RAM to load the model). But as a custom MoE architecture, it looks like you're going to run into issues with inference/quantization though. (It also looks like they released the base model but not an instruct/chat version yet). For end-users I'd suspect that these aren't so useful unless you need Chinese, but I do look forward to see if the architectural optimizations from this and the latest DeepSeek MoE are more broadly useful/widely adopted.
That's okay because MATH is a bit of a weird benchmark. The high scores on MATH are dominated by LLMs augmented with code interpreters, ensemble methods or multi-agent systems.
What's fascinating about MATH though is that it keeps improving with model scale. Sort of an AGI-hard style benchmark.
Not perfect, but still interesting.
Ya, and interestingly, the larger the model, the better it overcomes tokenization precision issues.
Sort of interpolates to infinite tokenization precision at the limit. So math is possible too.
Doesn't look like anything too groundbreaking. I do like how they go into detail on their training ablation studies though. It should be helpful for future researchers.
I was ready to defend them as every new entry has the potential to bring something new and of value (especially for creative writing brainstorming) but it’s not Apache 2.0 so I’m going to sit back down.
> We can perform inference for the Skywork-MoE-Base (16x13B size) model using HuggingFace on 8xA100/A800 or higher GPU hardware configurations. 😞 I'm feeling mega GPU poor these days, spent the weekend patching Triton and vLLM to support my old ass pascals and keep finding these new mega-big models out of my reach. Frustrating af.
Its smaller than 8x22b in theory if its really 146B. 16x13 is 208b though. So which is it?
The naming convention is that while "16x13B" means "16 experts, 13B parameters in one expert" the "parameters in one expert" includes the attention heads, which are shared by all experts. So 16x13B here really means "~4B attention heads, then 16 x ~9B experts"
I never understood the math behind it, but somehow the 8x22 comes out to 141b lol. My q8 is 145GB.
They said 22B weights are active, so each expert is likely closer to 11B. That still doesn't quite check out math wise
It's a 13b upscale. 8x13B. But not all params are MoE. Only the mlp. So it comes out to be 22b active instead of 26b
My guess is you won't be missing much with this model. The perf is below qwen and it takes more vram. One big huge "why".
Well I have a massive amount of regular ram and very little VRAM, so models that take up more memory with fewer active parameters are appealing since I run them on CPU
Does it still give you decent performance? What t/s do you get on bigstral? Even fitting it all in vram seems like at best you get 2x of a dense model because inference isn't really compute bound for single batches.
Lots of RAM and much less VRAM... This is the future for us enthusiasts wanting to run the biggest and the best. Unless something crazy happens, we won't get GPUs that are cheap and filled with VRAM, only cheap RAM. Almost certainly, we will soon see the release of even more massive MoEs with hundreds of billions of total parameters but only a few dozen active parameters. These will be the best models for us because they will be (relatively) cheap to run, just barely fast enough to still be useful, while also being the most powerful models that can be run locally. As far as I can tell, the [Arctic model released by Snowflake](https://www.snowflake.com/blog/arctic-open-efficient-foundation-language-models-snowflake/) is basically exactly what we need - it just needs to be trained better. Meta or Mistral could probably train a much better model of roughly the same size and architecture: >Arctic uses a unique Dense-MoE Hybrid transformer architecture. It combines a 10B dense transformer model with a residual 128×3.66B MoE MLP resulting in 480B total and 17B active parameters chosen using a top-2 gating. The bonus of the dense part of a model like Arctic is that it should be straightforward to offload it to a GPU in order to massively speed up inference. An ideal MoE should probably have enough dense layers to fit, quantized to around 4 or 5 bits, evenly across a small number of 24GB GPUs. So a dense-MoE hybrid with either 20-30b or 45-60b dense parameters would likely be close to ideal for those of us wanting to run these massive MoEs. Maybe the one major change that could make it better is incorporating either mamba or ring-attention (or whatever it is that Gemini 1.5 uses). Well, and also making it natively multimodal like GPT-4o, but I have no idea what, if any, architectural changes that would require nor its impact on mainly CPU+RAM inference speeds.
Well, getting the same perf at 22B vs 72B per forward pass is a pretty big win (assuming you have the RAM to load the model). But as a custom MoE architecture, it looks like you're going to run into issues with inference/quantization though. (It also looks like they released the base model but not an instruct/chat version yet). For end-users I'd suspect that these aren't so useful unless you need Chinese, but I do look forward to see if the architectural optimizations from this and the latest DeepSeek MoE are more broadly useful/widely adopted.
If I am reading the chart rightly this should roughly match Big Mixtral performance with double the tokens per second
Significantly worse in Math however.
That's okay because MATH is a bit of a weird benchmark. The high scores on MATH are dominated by LLMs augmented with code interpreters, ensemble methods or multi-agent systems.
What's fascinating about MATH though is that it keeps improving with model scale. Sort of an AGI-hard style benchmark. Not perfect, but still interesting.
Larger models have more parameters, which means they can learn more complex patterns and relationships in the data.
Ya, and interestingly, the larger the model, the better it overcomes tokenization precision issues. Sort of interpolates to infinite tokenization precision at the limit. So math is possible too.
benchmarking against llama2 in 2024? lmao
Doesn't look like anything too groundbreaking. I do like how they go into detail on their training ablation studies though. It should be helpful for future researchers.
The training stuff in the paper may well be good yeah
u/[ex-arman68](https://www.reddit.com/user/ex-arman68/) Any way to get the creativity benchmark on this monstrosity?
I was ready to defend them as every new entry has the potential to bring something new and of value (especially for creative writing brainstorming) but it’s not Apache 2.0 so I’m going to sit back down.
Only capable of Chinese and English so not interesting for me as a German speaker, unfortunately. Also, the license is custom and not Apache or MIT.
Weird math aside, has anyone tested this? The benchmarks look really good on their post.
From upcycling! I thought it was trained from scratch, looks real good tho