Aug 04 2025

Debugging Dead MoE Models: A Step-by-Step Guide

What We Expect to See at the End

So I bet when you hear Mixture-of-Experts (MoE), you immediately think “another thing that only Google can afford to train”, right? That’s exactly the myth we want to bust today. Yes, the famous MoE models are huge - we’re talking trillion parameter scale (Kimi Team, 2025). But this is like avoiding neural networks because GPT-4 exists. You can create a perceptron network from scratch in less than 20 lines of code.

Unfortunately, it is a common myth that training MoE models is not really accessible to majority of people. In fact as we were working on this blog, this was the first question that Aman raised. Do we need a huge cluster of compute for this? But here is a counterintuitive fact - MoE is not just about scale, it’s about a technique that you can scale down to observe the same benefits.

At the end of this blog you will learn how to train a small MoE model and beat a dense model at the GPT-2 scale. As for compute – you only need your laptop (okay, maybe a small acceleration node, but definitely not a datacenter!). Let’s dive in.

Prerequisites

If you are unfamiliar with dense GPT-2 style models, we highly recommend watching Let’s reproduce GPT-2 (124M) by Andrej Karpathy first and then get back to this blog where we build on top of it, specifically focusing on MoEs.

Setup

Let’s get ready for the big runs! We’re building on the modded-nanogpt codebase (Jordan et al., 2024a), which gives us a solid 124M GPT-2-like baseline trained on FineWeb dataset (Penedo et al., 2024). Our model is a decoder-only transformer which uses (So et al., 2022), RoPE (Su et al., 2023), RMSNorm (Zhang and Sennrich, 2019) and QK-Norm (Henry et al., 2020), trained using the Muon optimizer (Jordan et al., 2024b). It is basically a modernized GPT-2 that trains in 11 minutes. If you have access to Cerebras hardware you can follow along there, otherwise it’ll work fine on GPUs too. To get started, download the MoE fork of NanoGPT:

After the above downloads the FineWeb training files, open train_gpt_moe.py in your favorite editor or IDE. Also, everything can be scaled down (including the model size!) and we encourage you to play with it – consider this blog as your starting point. Break things, try weird settings, and see what happens.

Where Experts Networks Live

Alright, you’ve got your modded-nanogpt codebase pulled and you’re staring at the code. Here comes the exciting part - let’s turn our MoE concept introduced in (Soboleva, 2025a) into something you can actually train! The first step is pretty straightforward. We need to swap out single feedforward network (FFN) for multiple expert networks.

Figure 1: Each expert is just a GPT-2’s FFN. You can find the expert class definition in train_gpt_moe.py#L184-L196 and the instantiation logic in train_gpt_moe.py#L263.

As we can see on Figure 1 each expert is literally GPT-2’s FFN copy-pasted (two linear layers separated with nonlinearity). Now, how do we use these expert networks? Figure 2 shows how we combine experts outputs when we’re processing batches of tokens, extending on the single-token approach that we introduced in (Soboleva, 2025a). Note that this implementation is not fast and wastes computation, but is useful for pedagogy.

Figure 2: Expert mixing for batch processing (see train train_gpt_moe.py#L208-L225).

As was noted in the part 2 of this series MoE routing is tricky and can completely negate benefits that these models offer theoretically. In the next sections, we'll keep the experts architecture and mixing logic unchanged, while playing with routing solely and observing how it changes our model's quality. Fair warning: we are going to mess things up first and make mistakes, just like how it actually happens in real life.

Learned Routing

Learned routing is the industry standard and it powers all these great MoE models released by Google, OpenAI, Anthropic, etc. Now you’ll build it yourself! We’ve covered the theory in the part 2 and here we will focus on its actual implementation. Figure 3 shows that learned routing has two objectives: first, finding the best expert for each token through learnable weights, and second, keeping all experts utilized through load balancing. This dual objective is usually the place where we can shoot ourselves in the foot, so let’s find out what happens when we launch this training run. Plumb the auxiliary loss through the model, sum it across all layers, and add it to the cross-entropy loss with a weight of 0.01.

Figure 3: Learned routing implementation: (top) router logic for token-expert assignments and (bottom) auxiliary loss for load balancing. Locate router definition in train_gpt_moe.py#L228-L240, load balancing logic can be found in train_gpt_moe.py#L242-L250.

Time to check on the loss curves. Figure 4 shows that our MoE has already beaten the dense GPT-2 baseline. Congratulations! You implemented your first MoE model that works! But wait – we mentioned shooting ourselves in the foot, yet everything just worked. So why does this guide exist? You might have noticed that we promised you approximately a 2% improvement against the dense GPT-2 baseline with learned routing at 4 experts back in part 2 of our MoE 101 series. We are only getting half of that gain now. Where did the other 1% go? This is the moment that most of the researchers experience when implementing someone else’s paper and it doesn’t work as promised. In this guide we are going to fix it. Let’s debug this properly together.

Figure 4: MoE with 4 experts and learned routing outperforms the dense GPT-2 model. Loss improvement widens throughout training, reaching a 1% by our target 2.4B tokens.

Remember, our router has two objectives: finding the best expert for each token AND keeping all experts utilized. The second objective is easy to check on, we can simply look at expert utilization statistics to see how we are doing there. Figure 5 shows that each layer is almost perfectly balanced. The load balancing part of the dual objective appears to be working great - the problem isn’t there.

Figure 5: Expert utilization (averaged across layers) quickly converges to near 25%, a perfectly balanced case with 4 experts total.

Back at the “finding the best expert for each token” suspect. To debug what router is learning we can look at the gradient norms. Check out the problem on Figure 6. We can see that the router is getting effectively zero gradient from the cross-entropy part of the loss! Which means that your router is only directly learning how to load-balance. The part where it is supposed to learn which expert is actually best for each token? That signal isn’t reaching the router. Before you scroll down to see the solution, take a look at the code on Figure 3. Can you spot the bug that’s preventing our router from getting gradients from cross-entropy part of the loss? Small hint, consider top_k = 1 case in particular.

Figure 6: Learned router with top_k=1 receives zero gradient from cross-entropy loss. Gradient only comes from the load-balancing part of the objective.

Fixing Learned Routing

What we just learned is important. Despite perfect load balancing, learned routing can be completely useless. Although the router is learning something, it has only really learned to load balance, not how to route to improve cross-entropy. Unfortunately, many MoE publications never mention this problem. Most of them are focused on the load balancing issues, leaving significant performance on the table...

Alright, let’s diagnose the bug and you can check if you guessed it right. With top_k = 1, we have gate.sum(dim=-1, keepdim=True) = gate, so our normalization gate = gate / gate.sum(...) becomes gate = gate / gate = 1. This means that the cross-entropy gradient never makes it to the router weights. The learning signal is halted! Note that this is specifically an issue with top_k = 1. However, with top_k = 2, we can see in Figure 7 that the gradient flow appears normal. Why? Because with two experts active, the gate maintains the relative differences during normalization and the router is still learning which expert should get more weight. It’s like if you ask “are you interested in applying for a math degree?” without any other options, it is hard to make a confident choice. But if you say: “are you more interested in math degree or physics?” and suddenly your 0.9 confidence in math degree and 0.1 confidence in physics significantly eases the decision making.

Figure 7: Learned router with top_k=2 receives substantial amount of gradient from cross-entropy loss, unlike top_k=1 case which gets zero gradient.

This suggests our fix. For top_k = 1 we are going to introduce a “null expert” - the phantom expert that never gets activated, but it gives your router an option to contrast predictions against it. Unlike shared experts approach that grows both parameter count and compute, our phantom expert adds zero computational overhead and zero additional parameters! Similarly to how Bachlechner et al., 2020 tackles a problem with vanishing gradients in deeper layers of the neural network, we are going to boost the gradient flow coming into the router. Figure 8 shows how we can implement it.

Figure 8: Adding a “null expert” to give the router something to contrast against, enabling better gradient flow.

null_expert_bias is a tunable parameter and we set it to 10 in our experiments after minimum tuning. Now the router must justify why choosing a real expert is better than the phantom option. Exactly the contrast we wanted. Let’s see if this is going to help with our gradient flow. Figure 9 shows that we are finally receiving non-trivial amount of gradient from the cross-entropy part of the dual objective. And it’s matching the behavior observed with top_k = 2 in Figure 7. The “learning” has begun! As a result, Figure 10 shows that our MoE model is now 2% better than dense baseline! A big gain for a one line fix (and based on our experience, it becomes even more pronounced with larger scale).

Figure 9: “Null expert” used in top_k=1 case elevates gradient flow to the router that comes from cross-entropy part of the dual objective.

Take a moment and think about what just happened. You took a subtly broken MoE model. The learning signal was completely halted. You root-caused the issue, applied a targeted fix, and boom – the signal is flowing again! We’re now almost doubling the gains obtained against dense GPT-2 baseline. Note that you did not have to increase parameter count or compute and even hyperparameter sweeps were not required. You just fixed a tiny component of the MoE network, the router. And it dramatically changed model’s quality.

Figure 10: MoE with 4 experts and learned routing with “null expert” beats GPT-2 model by 2% at the target 2.4B tokens. Presence of the “null expert” almost doubles the gains obtained against dense baseline.

Summary of What We Were Able to Achieve

Remember how we started? You probably were convinced that MoE was just another “Google-scale only” technique. We set out to prove that wrong. We took a tiny 124M parameter GPT-2 model – the kind you can train with minimum compute budget and make it work. Learned routing was not “learning” at first, then came back to life with a small fix. We observed that the benefits from just tweaking the router were massive. We didn’t tune hyperparameters or spent compute on the batch size sweeps. We just diagnosed what was actually broken, applied a targeted fix, and solved it.

MoEs are tricky, because they can fail in subtle ways. This is unpopular opinion, but perfect load balancing does not mean the router is “learning”. Now you know this too. But MoEs are real because when you fix what’s actually broken, the benefits are immediate and they are significant.

Our debugging methodology (hypothesis driven investigation, measuring the right metrics and understanding the failure modes) applies way beyond MoEs. When your models die (and they will), you will know how to bring them back to life, and you won’t really need a data center for it.

Citation

Questions? You can find Daria at: https://soboleva-daria.github.io/

Notify of the next post

References

Bachlechner, T., Majumder, B. P., Mao, H. H., et al. (2020). ReZero is all you need: Fast convergence at large depth. arXiv preprint arXiv:2003.04887. https://doi.org/10.48550/arXiv.2003.04887

Henry, A., Dachapally, P. R., Pawar, S., et al. (2020). Query-key normalization for transformers. arXiv preprint arXiv:2010.04245. https://doi.org/10.48550/arXiv.2010.04245

Jordan, K., Bernstein, J., Rappazzo, B., et al. (2024). modded-nanogpt: Speedrunning the NanoGPT baseline. GitHub Repository. https://github.com/KellerJordan/modded-nanogpt

Jordan, K., Jin, Y., Boza, V., et al. (2024). Muon: An optimizer for hidden layers in neural networks. Blog Post. https://kellerjordan.github.io/posts/muon/

Kimi Team (2025). Kimi K2: Open agentic intelligence. arXiv preprint arXiv:2507.20534. https://doi.org/10.48550/arXiv.2507.20534

Penedo, G., Kydličák, H., Ben Allal, L., et al. (2024). The FineWeb datasets: Decanting the web for the finest text data at scale. In Proceedings of the Thirty-Eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track. https://openreview.net/forum?id=n6SCkn2QaG

Radford, A., Wu, J., Child, R., et al. (2019). Language models are unsupervised multitask learners. OpenAI Technical Report. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
So, D. R., Mańke, W., Liu, H., et al. (2021). Primer: Searching for efficient transformers for language modeling. arXiv preprint arXiv:2109.08668. https://doi.org/10.48550/arXiv.2109.08668

Soboleva, D. (2025). MoE fundamentals: Sparse models are the future. Cerebras Blog. https://www.cerebras.ai/blog/moe-guide-why-moe

Soboleva, D. (2025). Router wars: Which MoE routing strategy actually works. Cerebras Blog. https://www.cerebras.ai/blog/moe-guide-router

Su, J., Lu, Y., Pan, S., et al. (2021). RoFormer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864. https://doi.org/10.48550/arXiv.2104.09864

Zhang, B., & Sennrich, R. (2019). Root mean square layer normalization. arXiv preprint arXiv:1910.07467. https://doi.org/10.48550/arXiv.1910.07467

Discussion

Some other random things we found during the process of making this blogpost:

Just removing normalization (gate = gate / gate.sum(...)) doesn’t help (at least with top_k = 1 case). Although there is gradient from the cross-entropy loss flowing, the “learning” signal is still minimum, leading to worse performance than dense GPT-2 baseline.
Since we built off modded-nanogpt, we used Muon optimizer. We also did some runs building atop EleutherAI/nanoGPT-mup, which uses Adam, and a slightly older transformer architecture. We found that with Adam, learned routing with gradient signal coming only from load-balancing loss (i.g. with the normalization bug) cannot outperform the dense GPT-2 baseline, unlike with Muon it actually does. Perhaps, Muon helps expert networks learn with less router supervision, though we haven’t studied this phenomenon in depth.
We tried to get hash routing to work – the version we introduced in (Soboleva, 2025b). Surprisingly, we were not able to get hash routing to outperform the dense GPT-2 baseline for either using Adam or Muon optimizers (unless we used only 4 layers-deep GPT-2-style network). We did not spend time figuring out where hash implementation was broken, so if you do try it and find the bug – please let us know.