arxiv:2603.11535

Expert Threshold Routing for Autoregressive Language Modeling with Dynamic Computation Allocation and Load Balancing

Published on Mar 12

· Submitted by

Authors:

Abstract

Expert Threshold routing dynamically allocates computation in MoE models by using exponential moving average thresholds to route tokens independently, achieving better performance than Token-choice MoE without auxiliary losses.

AI-generated summary

Token-choice Mixture-of-Experts (TC-MoE) routes each token to a fixed number of experts, limiting dynamic computation allocation and requiring auxiliary losses to maintain load balance. We propose Expert Threshold (ET) routing, where each expert maintains an exponential moving average (EMA) threshold estimated from the global token distribution. At both training and inference, each token is independently routed to an expert if its score exceeds the expert's threshold, enabling dynamic computation allocation while achieving load balance without auxiliary losses. This fully causal mechanism eliminates dependence on other tokens in the batch, making it well-suited for autoregressive language modeling. In pretraining experiments scaling to 2.4B parameters on FineWeb-Edu, ET achieves 0.067 lower cross-entropy loss than TC-MoE, equivalent to reaching the same performance with 1.6times fewer tokens.

View arXiv page View PDF GitHub 3 Add to collection

Community

MasterGodzilla

Paper submitter about 16 hours ago

🧠 Expert Threshold Routing

Dynamic Computation and Load Balancing for Autoregressive Language Models

📄 Paper · 💻 Code

🔍 The Routing Trilemma

Mixture-of-Experts (MoE) routing faces a three-way tradeoff. Token Choice (TC) lets each token pick its top experts, but everyone crowds the popular ones, requiring auxiliary losses to patch load imbalance. Expert Choice (EC) flips the direction and lets experts pick tokens, achieving perfect load balance and dynamic computation. But EC needs to see the entire batch to make selections, breaking causality for autoregressive generation.

Routing	Dynamic Computation	Load Balance	Autoregressive
Token Choice	❌ Fixed top-k	❌ Needs aux loss	✅
Expert Choice	✅ Variable	✅ Perfect	❌
Expert Threshold	✅ Variable	✅ Near-perfect	✅

💡 Key Idea

Load balance only needs to hold in expectation over the data distribution, not strictly within each batch. Expert Threshold (ET) routing maintains an exponential moving average (EMA) of each expert's selection cutoff from historical batches. A token activates an expert whenever its router score exceeds that threshold. No dependence on the current batch, no causality violation.

Conceptually, ET is equivalent to doing Expert Choice over an infinitely large batch. As EC's batch size grows, its per-batch cutoff converges to a fixed quantile of the global score distribution, which is exactly what ET's EMA estimates.

📊 Results

In pretraining experiments scaling to 2.4B parameters on FineWeb-Edu, ET achieves 0.067 lower cross-entropy loss than Token Choice, equivalent to reaching the same performance with 1.6× fewer training tokens. ET also matches or slightly outperforms the best Expert Choice configuration while being fully causal at both training and inference.

The model learns to allocate more computation to structurally important tokens (sentence boundaries, numerical results) and less to predictable ones, leading to sharper expert specialization.

📝 Citation

@article {sun2026expertthresholdrouting,
  title={Expert Threshold Routing for Autoregressive Language Modeling with Dynamic Computation Allocation and Load Balancing},
  author={Sun, Ryan and Liu, Yixin and Wu, Yonghui and Sun, Lichao},
  journal={arXiv preprint arXiv:2603.11535},
  year={2026},
  url={https://arxiv.org/abs/2603.11535}
}