Aug 04 2025

Router Wars: Which MoE Routing Strategy Actually Works

Here’s what nobody tells you about Mixture-of-Experts (MoE): the router can single-handedly destroy your model. You can have perfect expert network architecture, tuned hyperparameters, and unlimited compute, but if your router collapses, you’re back to dense model performance regardless of number of experts you choose.

The router’s job sounds simple – it needs to decide which expert handles each token. In practice, it’s where most MoE implementations go wrong. With wrong strategy you can spend weeks debugging and be completely lost.

So which routing strategy should you use and what to expect from it? Let’s examine the most common approaches, their real-world tradeoffs, and what works in practice.

The Routing Landscape: Oh So Many Flavors…

Table 1: MoE routing reality. Behind the marketing hype, all production systems use some version of learned routing with engineering tricks layered on top.

Let’s first address the elephant in the room. Why should you care about routing techniques from 2017-2022 when there are dozens of newer methods being published every ~~month~~ week? It’s because every current production MoE model today is built on top of them!

In Table 1 you can see fancy names such as shared experts, capacity factors, adaptive auxiliary (aux) loss or expert bias (and there is many more.) These are just engineering tricks layered on core methods developed almost a decade ago. What are they trying to fix? Two fundamental problems: expert utilization (i.e. are all your experts actually being used?) and expert specialization (i.e. are your experts learning different things, or just copying each other introducing redundancy?).

What about DeepSeek-V3’s novel routing method? It’s a vanilla learned routing with aux loss on the sequence level and extra engineering tricks to improve expert utilization. Qwen3’s routing breakthrough? It’s also a learned routing method with aux loss, however, on the global batch level – in other words, it simply relaxes load balancing regularization a bit more to make experts more specialized.

Want to pick something off the shelf? Go ahead, use Table 1 and close this guide. But, when your shiny new routing method fails at 3am in the morning during a multi-million-dollar training run, you’ll be debugging one of the core approaches underneath all engineering layers that we will explore in a greater depth in the rest of this guide.

The Three Fundamental Approaches

Figure 1: Router choice makes or breaks MoE performance scaling. At 128 experts, learned and Sinkhorn routing both deliver 3x bigger quality gains than hash routing, with the gap widening as we increase expert count.

Hash Routing: The Safe but Boring Choice

Hash routing (Roller et al., 2021) is the most straightforward approach – the router from moe_layer introduced in (Soboleva, 2025) simply assigns tokens using:

where N is the number of experts and token_id is the token index in the vocabulary. It’s deterministic, easy to understand, and impossible to break. It also doesn’t work very well.

Looking at Figure 2a, hash routing maintains perfect load balancing across all layers – every expert gets the same number of tokens (high expert utilization). But Figure 3a shows why this doesn’t help: experts end up learning overlapping, similar representations because token assignments are completely disregarding the token’s context (low expert specialization). A token representing “function” in code and “function” in a math paper might have similar token_ids but need completely different processing. Hash routing can’t tell the difference.

As a result, with 16 experts, hash routing gives you only 1.5% loss improvement (compared to Chinchilla-optimal dense scaling (Hoffmann et al. 2022) at a fixed compute) and it barely increases with more experts (Figure 1).

Figure 2: Expert load balancing across layers. Hash routing (a) keeps perfect balance, learned routing (b) collapses in early/late layers, Sinkhorn routing ( c ) maintains hash-level load balancing across all layers.

Learned Routing: The Industry Standard

With hash routing the problem is clear: ignoring context kills performance. Learned routing, first introduced in (Shazeer et al. 2017), takes the opposite approach – it learns how experts should handle each token. Concretely, the router from moe_layer is now a learned linear layer that outputs logits for each expert. To penalize the router for potential imbalances, we add an auxiliary loss:

where f represents the fraction of tokens that activate each expert (binary: either a token’s top_k includes that expert or not), experts’ mixing weights we already defined in moe_layer, and coeff controls how hard you want to enforce balance.

The results are impressive. With 16 experts, learned routing delivers a solid 4% loss improvement – nearly 3x larger gains than hash routing (Figure 1)! This is why every production MoE system uses some variant of learned routing (Table 1).

The magic happens through specialization. Figure 3b shows learned routing creating clean, separated expert representations – each expert carves out its own specialty instead of producing overlapping patterns like in hash routing.

But there is a problem: router collapse! Figure 2b shows that while middle layers balance well, early and late layers funnel most tokens to just 1-2 experts. This creates load balancing nightmares for distributed training (for example, when using expert parallelism (DeepSeek-AI et al. 2024)). This is why Deepseek-V3 (DeepSeek-AI et al., 2024) and Qwen2 (Qwen et al., 2024) MoE models use shared experts (always activated experts alongside the routed ones).

Figure 3: The story of expert specialization (middle layer)[1]. Hash (a) creates a mess – experts learn similar representations because assignments are random. Learned routing (b) works beautifully in the middle layers - each expert finds its niche. Sinkhorn ( c ) enforces balance so strictly that experts can’t specialize properly.

Sinkhorn Routing: The Per-Layer Load Balancer

Learned routing delivers great performance but suffers from router collapse in some layers. Hash routing has perfect load balancing through all layers but ignores context entirely, leading to suboptimal performance. What if you could get learned routing-level performance with better per layer load balancing control? This is where Sinkhorn routing comes in (Clark et al. 2022).

Learned routing controls load balancing globally across layers – but individual layers can still collapse if others compensate. Sinkhorn gets rid of auxiliary loss altogether and prevents imbalance by regularizing each layer independently. It iteratively alternates two normalizations:

first, it ensures equal load per expert across tokens (dim=0):

then it normalizes distribution for each token across experts (dim=1):

The final step applies exp(logits) to convert the normalized log-probabilities to experts’ mixing weights. As a result, we can achieve hash-level load balance across all layers (Figure 2c) with learned routing-level performance quality (Figure 1).

But here’s an important insight: better load balancing ≠ better learning. Figure 3c shows that enforcing strict per-layer balance limits how much experts can differentiate themselves compared to learned routing’s cleaner separation (Figure 3b). Sinkhorn essentially takes collapsed layers (with only a few experts effectively utilized) and forcibly moves tokens from overutilized experts to underutilized ones. You're not getting better token-expert matching - you’re just solving the load balancing problem.

You might wonder, however, why Sinkhorn is not the industry standard since it gets the best parts of both learned and hash routing? Unfortunately, it is significantly harder to scale in practice and has seen limited industry adoption due to implementation complexities. Here’s an important one: Sinkhorn’s iterative algorithm detaches gradients, breaking router training.

The fix is surprisingly simple but somewhat buried in the literature. Let’s use Sinkhorn weights:

to select which experts to route to, but compute experts’ mixing weights from the original (non-detached) logits:

Most people miss that and wonder why their router only learns how to load balance.

We’re still trying to find a routing mechanism that simultaneously improves both expert utilization and specialization (Chi et al., 2022; Qiu et al., 2025). This matters because we want to improve performance by adding hundreds or even thousands of experts, but router inefficiency seems to become more troublesome with more experts. As the core component of the MoE system, if the router collapses, the MoE scaling advantages can vanish entirely.

Knowing which router to pick is just the beginning. Even with the right choice, MoE training is fragile. Router collapse, load imbalance, vanishing gradients, and other mysterious training instabilities can appear even with implementations that look alright. Your loss curve is going down, but your router learns to route everything to only one expert and you end up with your baseline dense model despite increased model capacity.

Sound familiar? In “Debugging Dead MoE Models: A Step-by-Step Guide”, we’ll build a complete MoE model from scratch and debug these issues step-by-step. You will learn how to fix subtle bugs that make MoE training so much harder than training dense models.

Citation

Questions? Find me at: https://soboleva-daria.github.io/

Notify me of the next post

Footnotes

[1] Why different projections in Fig 3a vs 3b? Hash lacks router weights, so we use PCA. Learned routing has router weights that define decision boundaries.

References

Chi, Z., Dong, L., Huang, S., et al. (2022). On the representation collapse of sparse mixture of experts. In Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS 2022). https://doi.org/10.48550/arXiv.2204.09179

Clark, A., de las Casas, D., Guy, A., et al. (2022). Unified scaling laws for routed language models. arXiv preprint arXiv:2202.01169. https://doi.org/10.48550/arXiv.2202.01169

DeepSeek-AI (2024). DeepSeek-V3 technical report. arXiv preprint arXiv:2412.19437. https://doi.org/10.48550/arXiv.2412.19437

Hoffmann, J., Borgeaud, S., Mensch, A., et al. (2022). Training compute-optimal large language models. arXiv preprint arXiv:2203.15556. https://doi.org/10.48550/arXiv.2203.15556

Qiu, Z., Huang, Z., Zheng, B., et al. (2025). Demons in the detail: On implementing load balancing loss for training specialized mixture-of-expert models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (pp. 5005-5018). Association for Computational Linguistics. https://doi.org/10.18653/v1/2025.acl-long.249

Qwen Team (2024). Qwen2.5 technical report. arXiv preprint arXiv:2412.15115. https://doi.org/10.48550/arXiv.2412.15115

Roller, S., Sukhbaatar, S., Szlam, A., et al. (2021). Hash layers for large sparse models. arXiv preprint arXiv:2106.04426. https://doi.org/10.48550/arXiv.2106.04426

Shazeer, N., Mirhoseini, A., Maziarz, K., et al. (2017). Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538. https://doi.org/10.48550/arXiv.1701.06538

Soboleva, D. (2025). MoE fundamentals: Sparse models are the future. Cerebras Blog. https://www.cerebras.ai/blog/moe-guide-why-moe