LLaDA-8B dLLM — Core AI (on-device diffusion LLM)

The model zoo's first diffusion language model (dLLM) for Apple Core AI. Most on-device LLMs are autoregressive — they write one token at a time, left to right. This one is a masked diffusion model: it starts from a canvas of [MASK] tokens and fills them in in parallel, committing the most-confident positions each step until the answer resolves. A different decoding paradigm, running on-device on Apple Silicon.

Base: GSAI-ML/LLaDA-8B-Instruct (LLaMA-dense 8B, bidirectional, no causal mask)
Distillation: d3LLM/d3LLM_LLaDA (hao-ai-lab / NVIDIA) — pseudo-trajectory distillation that cuts the number of denoising steps hard (≈8 tokens committed per forward here)
Quantization: int4 weight-only (per-block-32 body) + int8 head, ≈4.9 GB
Runtime: Apple Core AI (coreai-core), GPU

Use it

▶️ Run it (source) — the DiffuseChat runner (GUI + CLI, one app for every diffusion LM in the catalog):

git clone https://github.com/john-rocky/coreai-kit
open coreai-kit/Examples/DiffuseChat/DiffuseChat.xcodeproj
# → Run, then pick "LLaDA-8B (diffusion)" in the model picker

# agents / headless (macOS):
cd coreai-kit/Examples/DiffuseChat
swift run diffuse-cli --model llada-8b --prompt "What is the capital of France?"

💻 Build with it — complete; the glue is kit API, copy-paste runs:

import CoreAIKit

let dlm = try await KitDiffusionLM(catalog: "llada-8b")
let reply = try await dlm.reply(to: prompt)
// reply: the denoised answer — pass onStep: to watch the canvas fill in per forward
// (still-masked positions as ░), in parallel, not left-to-right

The take-home is Examples/DiffuseChat/Sources/QuickStart.swift — this exact code as one typed function, no UI; the CLI is an argument shell over it, and the GUI renders the same live canvas. The canvas is fixed (S=256 ≈ 210 generated tokens) and the whole history must fit — no KV cache. reply(messages:) takes role/content turns and drops the oldest first. Pass onStep: nil if you only want the final text.

Integration checklist

SPM: https://github.com/john-rocky/coreai-kit → product CoreAIKit
Info.plist: none needed
Entitlements: none needed
First run downloads the model — 5.3 GB (Mac) — then it loads from the local cache (Application Support; progress via the downloadProgress callback)
Measure in Release — Debug is ~3× slower on per-token host work

What it looks like

The answer doesn't stream left-to-right — the canvas denoises. Mid-generation you see tokens pop in out of order, e.g.:

48 + ░4 =░7░ clips░░░   →   48 + 24 = 72 clips altogether.

That parallel, out-of-order fill is the signature of a masked-diffusion LM.

How to run it

Open it in CoreAIChatMac (the zoo's macOS chat app) — Download Models… → LLaDA-8B (diffusion), or point the app at a folder containing the macos/ bundle. The host drives the denoising loop over a single static, bidirectional forward (main(input_ids[1,S]) → logits[1,S,vocab], no KV cache).

Bundle

macos/ — Core AI .aimodel (GPU) + tokenizer/ + metadata.json
metadata.json exposes the diffusion knobs (no recompile to retune):
- seq — canvas length (256 ⇒ answers up to ~210 tokens). No KV cache ⇒ the whole prompt+answer lives in S; per-step cost is ~linear in S (a larger canvas = fuller answers but slower steps).
- block_size 32, threshold 1.0 — the entropy threshold trades steps for speed; lower = more gradual, higher = fewer forwards (faster), too high degrades quality

Performance (M4 Max, GPU)

metric	value
throughput	≈40 tok/s (threshold 1.0, S=256)
TTFT	~0.3 s
forwards (NFE)	~22 for a full 210-token answer
size	4.9 GB (int4 + int8 head)

The per-step cost is a full bidirectional forward over the whole canvas (no KV cache) — the distillation keeps the step count low. A delayed-KV-cache decode (process only the active region) is the next speed lever.

Roadmap

Longer context (larger canvas) build
Delayed-KV-cache decode (the d3LLM generate_multi_block_kv_cache lever) for higher throughput
iPhone (h18p) build

Credits & license

A community port. All credit to the upstream authors — LLaDA (GSAI-ML) and d3LLM (hao-ai-lab / NVIDIA). This bundle inherits and is bound by their licenses — please review them before use. Core AI conversion only; no retraining.

Part of the Core AI model zoo.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlboydaisuke/LLaDA-8B-dLLM-CoreAI

Base model

GSAI-ML/LLaDA-8B-Instruct

Finetuned

(36)

this model