Abstract
Token filtering during pretraining effectively reduces unwanted language model capabilities while maintaining alignment, becoming more effective at larger scales and tolerating noisy labels with sufficient compute.
Current approaches to reducing undesired capabilities in language models are largely post hoc, and can thus be easily bypassed by adversaries. A natural alternative is to shape capabilities during pretraining itself. On the proxy task of removing medical capabilities, we show that the simple intervention of filtering pretraining data is highly effective, robust, and inexpensive at scale. Inspired by work on data attribution, we show that filtering tokens is more effective than filtering documents, achieving the same hit to undesired capabilities at a lower cost to benign ones. Training models spanning two orders of magnitude, we then demonstrate that filtering gets more effective with scale: for our largest models, token filtering leads to a 7000x compute slowdown on the forget domain. We also show that models trained with token filtering can still be aligned on the forget domain. Along the way, we introduce a methodology for labeling tokens with sparse autoencoders and distilling cheap, high-quality classifiers. We also demonstrate that filtering can be robust to noisy labels with sufficient pretraining compute.
Community
Key Findings:
1. Token-level Filtering vs Document-level Filtering (Figure 3)
- Token filtering Pareto-dominates document filtering: Can achieve equal reduction in undesired capabilities (equal medical loss) at lower cost to desired capabilities (lower biology loss)
- More precise filtering preserves beneficial content better
2. Scaling Effects (Figures 1, 4, 5, 6)
- Filtering gets more effective with scale:
- 1.8B parameter models see 7,000× compute slowdown on medical domain
- Document filtering: ~30× slowdown
- Token removal: >7,000× slowdown
- Multiple choice evaluation: Models score near chance on MedMCQA and MedQA-USMLE (medical), but maintain performance on retain domains
- Free response: Token filtering reduces medical answer correctness up to 20×, relevance/coherence 3× compared to baseline
3. Robustness to Attacks (Figure 7)
- 10× more robust than unlearning against adversarial finetuning attacks for 1.8B models
- State-of-the-art unlearning (RMU) requires 13× fewer tokens to recover capabilities compared to token removal
4. Alignment Compatibility (Figures 8, 9)
- Models can still be aligned on forget domain:
- Token-level filtering makes refusal training easier (2× better refusal generalization)
- Document filtering struggles with alignment generalization
- Linear probes show models can distinguish forget vs. retain tokens despite filtering
5. Classifier Training (Table 1, Figure 11)
- Small, task-specific models outperform large general ones:
- 224M parameter biLM achieves 0.894 F1 on test set
- Outperforms 395M ModernBERT-large (0.794 F1)
- Domain-specific pretraining improves performance
6. Label Quality Tolerance (Figures 12, 13, 14, 15)
- Robust to imperfect labels:
- Aggressive filtering with sufficient compute can overcome label noise
- Token-level classifiers generalize from weak labels better than document-level
- Can trade precision for recall to maintain effectiveness
👀
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs (2025)
- Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers (2025)
- How Does Prefix Matter in Reasoning Model Tuning? (2026)
- LIME: Making LLM Data More Efficient with Linguistic Metadata Embeddings (2025)
- Next-Embedding Prediction Makes Strong Vision Learners (2025)
- Predictive Concept Decoders: Training Scalable End-to-End Interpretability Assistants (2025)
- Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper