codelion (Asankhaya Sharma)

commented on The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix about 18 hours ago

I think the data mixing should carry over to larger models, there is existing work from others that suggests this, e.g. - https://www.datologyai.com/blog/beyondweb

liked a dataset about 20 hours ago

introvoyz041/OpenEvolve

Viewer • Updated 1 day ago • 94 • 5 • 1

updated a model 18 days ago

codelion/dhara-70m

Text Generation • 71.3M • Updated 18 days ago • 3.65k • 41

New activity in codelion/dhara-70m 18 days ago

1024 in max_position_embeddings

1

#2 opened 20 days ago by

khtsly

commented on The Optimal Architecture for Small Language Models 18 days ago

For evaluation we used lm-evaluation-harness with a custom wrapper to handle diffusion-specific probability calculations for multiple choice tasks.

For inference we used standard Transformers library. The diffusion models use a custom generate() method that handles parallel token generation with configurable diffusion steps. Throughput was measured with batch size 1, generating 100 tokens per prompt averaged over multiple runs.

New activity in codelion/dhara-70m 18 days ago

Adding `transformers` as the library name

#3 opened 18 days ago by

ariG23498

commented on The Optimal Architecture for Small Language Models 19 days ago

I ran the numbers on layer-only params (excluding embeddings):

Config	Hidden	Layers	Layer Params	Score	Tier
4L	768	4	28.3M	31.98%	Low
12L	512	12	37.7M	38.15%	High
16L	448	16	38.5M	32.61%	Low
24L	384	24	42.5M	31.79%	Low
32L	384	32	56.6M	38.50%	High
48L	320	48	59.0M	32.45%	Low
64L	256	64	50.3M	38.21%	High

The 48L config has the most layer params (59M) but is in the Low tier, while 12L has fewer (37.7M) and is High tier.

The hidden dimension threshold still dominates. But er-layer representation width seems critical, with hidden=320 or 256, you create an information bottleneck that more layers can't overcome, unless you hit the critical depth thresholds (32 or 64 layers) where something else compensates.

This suggests the finding should be reframed as: at small scale, you need sufficient hidden dimension AND appropriate depth.

(BTW, based on your earlier comment I've added a note to the article clarifying the parameter matching limitations — thanks for the feedback!)

commented on The Optimal Architecture for Small Language Models 20 days ago

Here's the full breakdown of where parameters come from:

Embeddings (scales linearly with d_model)

Token embeddings: vocab_size × d_model = 50,257 × d
Position embeddings: 1,024 × d
Total: ~51,281 × d

Per transformer layer (scales quadratically with d_model)

Attention (Q, K, V, O): 4 × d²
MLP (up + down, with 4x intermediate): 2 × d × 4d = 8d²
LayerNorms: ~4d (negligible)
Total per layer: ~12d²

LM Head

Usually tied with embeddings (free) or d × vocab_size

4L × 768:

Embeddings: 51,281 × 768 ≈ 39.4M
Layers: 4 × 12 × 768² ≈ 28.3M
Total: ~68M

12L × 512:

Embeddings: 51,281 × 512 ≈ 26.3M
Layers: 12 × 12 × 512² ≈ 37.7M
Total: ~64M

commented on The Optimal Architecture for Small Language Models 20 days ago

Thanks for the references I will take a look.

New activity in codelion/dhara-70m 21 days ago

High-throughput deployment use cases

1

#1 opened 21 days ago by

Cagnicolas

reacted to their post with 👍 21 days ago

Post

5997

Introducing Dhara-70M: A diffusion language model that achieves 3.8x higher throughput than autoregressive models!

Key findings from our research on optimal architectures for small language models:

→ Depth beats width: 32 layers outperforms 12 layers at the same parameter count
→ Best-in-class factuality: 47.5% on TruthfulQA
→ 10x training efficiency using WSD (Warmup-Stable-Decay) conversion
→ Canon layers add only 0.13% parameters but improve reasoning

We trained on 1B tokens using the optimal 50-30-20 dataset mix (PDFs + filtered web + educational content), then converted to diffusion with just 100M additional tokens.

Blog: https://huggingface.co/blog/codelion/optimal-model-architecture
Model: codelion/dhara-70m

1 reply

·

reacted to their post with 🤗🚀🔥 22 days ago

Post

5997

Introducing Dhara-70M: A diffusion language model that achieves 3.8x higher throughput than autoregressive models!

Key findings from our research on optimal architectures for small language models:

→ Depth beats width: 32 layers outperforms 12 layers at the same parameter count
→ Best-in-class factuality: 47.5% on TruthfulQA
→ 10x training efficiency using WSD (Warmup-Stable-Decay) conversion
→ Canon layers add only 0.13% parameters but improve reasoning

We trained on 1B tokens using the optimal 50-30-20 dataset mix (PDFs + filtered web + educational content), then converted to diffusion with just 100M additional tokens.

Blog: https://huggingface.co/blog/codelion/optimal-model-architecture
Model: codelion/dhara-70m

1 reply

·

posted an update 22 days ago

Post

5997

Introducing Dhara-70M: A diffusion language model that achieves 3.8x higher throughput than autoregressive models!

Key findings from our research on optimal architectures for small language models:

→ Depth beats width: 32 layers outperforms 12 layers at the same parameter count
→ Best-in-class factuality: 47.5% on TruthfulQA
→ 10x training efficiency using WSD (Warmup-Stable-Decay) conversion
→ Canon layers add only 0.13% parameters but improve reasoning

We trained on 1B tokens using the optimal 50-30-20 dataset mix (PDFs + filtered web + educational content), then converted to diffusion with just 100M additional tokens.

Blog: https://huggingface.co/blog/codelion/optimal-model-architecture
Model: codelion/dhara-70m

1 reply

·

published an article 22 days ago

Article

The Optimal Architecture for Small Language Models

22 days ago

•

108

liked a model 22 days ago