𧬠Darwin Family: Zero Gradient Steps, GPQA Diamond 88.89%
How far can we push LLM reasoning *without* training?
Our team at VIDRAFT submitted this paper to Daily Papers yesterday, and it's currently #3. Huge thanks to everyone who upvoted ā sharing the core ideas below.
Darwin Family is a training-free evolutionary merging framework. By recombining the weight spaces of existing LLM checkpoints ā with zero gradient-based training ā it reaches frontier-level reasoning.
- š Darwin-28B-Opus: GPQA Diamond 88.89% - šø Zero gradient steps ā not a single B200 or H200 hour needed - 𧬠Consistent gains across 4B ā 35B scale - š Cross-architecture breeding between Transformer and Mamba families - š Stable recursive multi-generation evolution
#Three Core Mechanisms
ā 14-dim Adaptive Merge Genome ā fine-grained recombination at both component level (Attention / FFN / MLP / LayerNorm / Embedding) and block level, expanding the prior evolutionary-merge search space.
ā” MRI-Trust Fusion ā we diagnose each layer's reasoning contribution via an **MRI (Model Reasoning Importance)** signal and fuse it with evolutionary search through a **learnable trust parameter**. Trust the diagnostic too much and search collapses; ignore it and search becomes inefficient ā Darwin learns the balance from data.
We're thrilled to release Darwin-9B-NEG, a 9B-parameter reasoning model that embeds an architecturally-internalised sense of self-confidence directly into the transformer ā our proprietary Native Entropy Gating (NEG) technology.
With only 9 billion parameters and 1Ć inference cost, Pure NEG jumps +12.63 %p over the same model without NEG. Going all-in with ensemble refinement pushes it to 84.34 % ā surpassing the published Qwen3.5-9B leaderboard score (81.7 %) by +2.64 %p.
š¬ What makes NEG different from Multi-Turn Iteration (MTI)?
Classical MTI needs 3-8Ć extra inference passes. NEG instead lives INSIDE the single decoding loop. Two tiny modules ride with the transformer: NEG-Head predicts per-token entropy from the last hidden state, and NEG-Gate conditionally restricts the top-k choice when confidence is low. The gate activates in only 4.36 % of tokens ā essentially free at inference time.
⨠Key differentiators ⢠Architecturally internalised ā model file *is* the feature ⢠1Ć inference cost (vs. 3-8Ć for MTI) ⢠Drop-in with vLLM / SGLang / TGI / transformers ā no extra engine ⢠+12.63 %p reasoning at zero latency overhead ⢠Single-file deployment, Apache 2.0 licensed
Darwin-TTS: 3% of an LLM's Brain Makes TTS Speak with Emotion ā Zero Training
We blended 3% of Qwen3-1.7B (LLM) FFN weights into Qwen3-TTS-1.7B's talker module. The result: emotionally enhanced speech synthesis ā with zero training, zero data, and zero GPU hours.
Qwen3-1.7B (LLM) and Qwen3-TTS-1.7B's talker share 100% identical architecture ā same hidden_size (2048), same layers (28), same heads (16). This enabled pure 1:1 weight blending across 84 FFN tensors with a single lerp operation. At 3% blend, emotion appears. At 5%, emotion intensifies. At 10%, the model breaks ā producing 655-second outputs for a 3-second sentence, because the LLM's "keep generating" pattern overwhelms the TTS stop signal.
To our knowledge, this is the first training-free cross-modal weight transfer between an LLM and a TTS model. Prior work either requires adapter training (SmolTolk, 2025), fine-tuning (CSLM, 2025), or massive end-to-end compute (GPT-4o). Darwin-TTS achieves cross-modal capability transfer in under 2 minutes on CPU.
The key insight: TTS models with LLM backbones already "think" in language. We're just restoring 3% of the original LLM's language understanding patterns ā particularly those related to emotional semantics and prosody planning. The code is three lines: load the model, load the LLM FFN, call p.lerp_(llm_weight, 0.03).
creators of the Darwin Evolutionary Merge Framework. Darwin LLM V7 achieved GPQA Diamond 86.9% (HF Benchmark #3) through CMA-ES optimized FFN crossbreeding. Darwin-TTS extends this principle from LLM-to-LLM merging into cross-modal LLM-to-TTS transfer. Apache 2.0.