Papers
arxiv:2603.11535

Expert Threshold Routing for Autoregressive Language Modeling with Dynamic Computation Allocation and Load Balancing

Published on Mar 12
ยท Submitted by
Hanchi Sun
on Mar 19
Authors:
,
,
,

Abstract

Expert Threshold routing dynamically allocates computation in MoE models by using exponential moving average thresholds to route tokens independently, achieving better performance than Token-choice MoE without auxiliary losses.

AI-generated summary

Token-choice Mixture-of-Experts (TC-MoE) routes each token to a fixed number of experts, limiting dynamic computation allocation and requiring auxiliary losses to maintain load balance. We propose Expert Threshold (ET) routing, where each expert maintains an exponential moving average (EMA) threshold estimated from the global token distribution. At both training and inference, each token is independently routed to an expert if its score exceeds the expert's threshold, enabling dynamic computation allocation while achieving load balance without auxiliary losses. This fully causal mechanism eliminates dependence on other tokens in the batch, making it well-suited for autoregressive language modeling. In pretraining experiments scaling to 2.4B parameters on FineWeb-Edu, ET achieves 0.067 lower cross-entropy loss than TC-MoE, equivalent to reaching the same performance with 1.6times fewer tokens.

Community

๐Ÿง  Expert Threshold Routing

Dynamic Computation and Load Balancing for Autoregressive Language Models

๐Ÿ“„ Paper ยท ๐Ÿ’ป Code

๐Ÿ” The Routing Trilemma

Mixture-of-Experts (MoE) routing faces a three-way tradeoff. Token Choice (TC) lets each token pick its top experts, but everyone crowds the popular ones, requiring auxiliary losses to patch load imbalance. Expert Choice (EC) flips the direction and lets experts pick tokens, achieving perfect load balance and dynamic computation. But EC needs to see the entire batch to make selections, breaking causality for autoregressive generation.

Routing Dynamic Computation Load Balance Autoregressive
Token Choice โŒ Fixed top-k โŒ Needs aux loss โœ…
Expert Choice โœ… Variable โœ… Perfect โŒ
Expert Threshold โœ… Variable โœ… Near-perfect โœ…

๐Ÿ’ก Key Idea

Load balance only needs to hold in expectation over the data distribution, not strictly within each batch. Expert Threshold (ET) routing maintains an exponential moving average (EMA) of each expert's selection cutoff from historical batches. A token activates an expert whenever its router score exceeds that threshold. No dependence on the current batch, no causality violation.

Conceptually, ET is equivalent to doing Expert Choice over an infinitely large batch. As EC's batch size grows, its per-batch cutoff converges to a fixed quantile of the global score distribution, which is exactly what ET's EMA estimates.

๐Ÿ“Š Results

In pretraining experiments scaling to 2.4B parameters on FineWeb-Edu, ET achieves 0.067 lower cross-entropy loss than Token Choice, equivalent to reaching the same performance with 1.6ร— fewer training tokens. ET also matches or slightly outperforms the best Expert Choice configuration while being fully causal at both training and inference.

The model learns to allocate more computation to structurally important tokens (sentence boundaries, numerical results) and less to predictable ones, leading to sharper expert specialization.

๐Ÿ“ Citation

@article {sun2026expertthresholdrouting,
  title={Expert Threshold Routing for Autoregressive Language Modeling with Dynamic Computation Allocation and Load Balancing},
  author={Sun, Ryan and Liu, Yixin and Wu, Yonghui and Sun, Lichao},
  journal={arXiv preprint arXiv:2603.11535},
  year={2026},
  url={https://arxiv.org/abs/2603.11535}
}

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.11535 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.11535 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.11535 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.