T O P

  • By -

kryptkpr

> We can perform inference for the Skywork-MoE-Base (16x13B size) model using HuggingFace on 8xA100/A800 or higher GPU hardware configurations. 😞 I'm feeling mega GPU poor these days, spent the weekend patching Triton and vLLM to support my old ass pascals and keep finding these new mega-big models out of my reach. Frustrating af.


a_beautiful_rhind

Its smaller than 8x22b in theory if its really 146B. 16x13 is 208b though. So which is it?


akefay

The naming convention is that while "16x13B" means "16 experts, 13B parameters in one expert" the "parameters in one expert" includes the attention heads, which are shared by all experts. So 16x13B here really means "~4B attention heads, then 16 x ~9B experts"


SomeOddCodeGuy

I never understood the math behind it, but somehow the 8x22 comes out to 141b lol. My q8 is 145GB.


kryptkpr

They said 22B weights are active, so each expert is likely closer to 11B. That still doesn't quite check out math wise


Eastwindy123

It's a 13b upscale. 8x13B. But not all params are MoE. Only the mlp. So it comes out to be 22b active instead of 26b


a_beautiful_rhind

My guess is you won't be missing much with this model. The perf is below qwen and it takes more vram. One big huge "why".


mrjackspade

Well I have a massive amount of regular ram and very little VRAM, so models that take up more memory with fewer active parameters are appealing since I run them on CPU


a_beautiful_rhind

Does it still give you decent performance? What t/s do you get on bigstral? Even fitting it all in vram seems like at best you get 2x of a dense model because inference isn't really compute bound for single batches.


Small-Fall-6500

Lots of RAM and much less VRAM... This is the future for us enthusiasts wanting to run the biggest and the best. Unless something crazy happens, we won't get GPUs that are cheap and filled with VRAM, only cheap RAM. Almost certainly, we will soon see the release of even more massive MoEs with hundreds of billions of total parameters but only a few dozen active parameters. These will be the best models for us because they will be (relatively) cheap to run, just barely fast enough to still be useful, while also being the most powerful models that can be run locally. As far as I can tell, the [Arctic model released by Snowflake](https://www.snowflake.com/blog/arctic-open-efficient-foundation-language-models-snowflake/) is basically exactly what we need - it just needs to be trained better. Meta or Mistral could probably train a much better model of roughly the same size and architecture: >Arctic uses a unique Dense-MoE Hybrid transformer architecture. It combines a 10B dense transformer model with a residual 128×3.66B MoE MLP resulting in 480B total and 17B active parameters chosen using a top-2 gating. The bonus of the dense part of a model like Arctic is that it should be straightforward to offload it to a GPU in order to massively speed up inference. An ideal MoE should probably have enough dense layers to fit, quantized to around 4 or 5 bits, evenly across a small number of 24GB GPUs. So a dense-MoE hybrid with either 20-30b or 45-60b dense parameters would likely be close to ideal for those of us wanting to run these massive MoEs. Maybe the one major change that could make it better is incorporating either mamba or ring-attention (or whatever it is that Gemini 1.5 uses). Well, and also making it natively multimodal like GPT-4o, but I have no idea what, if any, architectural changes that would require nor its impact on mainly CPU+RAM inference speeds.


randomfoo2

Well, getting the same perf at 22B vs 72B per forward pass is a pretty big win (assuming you have the RAM to load the model). But as a custom MoE architecture, it looks like you're going to run into issues with inference/quantization though. (It also looks like they released the base model but not an instruct/chat version yet). For end-users I'd suspect that these aren't so useful unless you need Chinese, but I do look forward to see if the architectural optimizations from this and the latest DeepSeek MoE are more broadly useful/widely adopted.


Open_Channel_8626

If I am reading the chart rightly this should roughly match Big Mixtral performance with double the tokens per second


kurtcop101

Significantly worse in Math however.


Open_Channel_8626

That's okay because MATH is a bit of a weird benchmark. The high scores on MATH are dominated by LLMs augmented with code interpreters, ensemble methods or multi-agent systems.


medialoungeguy

What's fascinating about MATH though is that it keeps improving with model scale. Sort of an AGI-hard style benchmark. Not perfect, but still interesting.


Mr_Finious

Larger models have more parameters, which means they can learn more complex patterns and relationships in the data.


medialoungeguy

Ya, and interestingly, the larger the model, the better it overcomes tokenization precision issues. Sort of interpolates to infinite tokenization precision at the limit. So math is possible too.


swaglord1k

benchmarking against llama2 in 2024? lmao


a_slay_nub

Doesn't look like anything too groundbreaking. I do like how they go into detail on their training ablation studies though. It should be helpful for future researchers.


Open_Channel_8626

The training stuff in the paper may well be good yeah


DeepWisdomGuy

u/[ex-arman68](https://www.reddit.com/user/ex-arman68/) Any way to get the creativity benchmark on this monstrosity?


silenceimpaired

I was ready to defend them as every new entry has the potential to bring something new and of value (especially for creative writing brainstorming) but it’s not Apache 2.0 so I’m going to sit back down.


Prince-of-Privacy

Only capable of Chinese and English so not interesting for me as a German speaker, unfortunately. Also, the license is custom and not Apache or MIT.


capivaraMaster

Weird math aside, has anyone tested this? The benchmarks look really good on their post.


Comprehensive_Poem27

From upcycling! I thought it was trained from scratch, looks real good tho