LLaDA-8B dLLM β€” Core AI (on-device diffusion LLM)

The model zoo's first diffusion language model (dLLM) for Apple Core AI. Most on-device LLMs are autoregressive β€” they write one token at a time, left to right. This one is a masked diffusion model: it starts from a canvas of [MASK] tokens and fills them in in parallel, committing the most-confident positions each step until the answer resolves. A different decoding paradigm, running on-device on Apple Silicon.

  • Base: GSAI-ML/LLaDA-8B-Instruct (LLaMA-dense 8B, bidirectional, no causal mask)
  • Distillation: d3LLM/d3LLM_LLaDA (hao-ai-lab / NVIDIA) β€” pseudo-trajectory distillation that cuts the number of denoising steps hard (β‰ˆ8 tokens committed per forward here)
  • Quantization: int4 weight-only (per-block-32 body) + int8 head, β‰ˆ4.9 GB
  • Runtime: Apple Core AI (coreai-core), GPU

Use it

▢️ Run it (source) β€” the DiffuseChat runner (GUI + CLI, one app for every diffusion LM in the catalog):

git clone https://github.com/john-rocky/coreai-kit
open coreai-kit/Examples/DiffuseChat/DiffuseChat.xcodeproj
# β†’ Run, then pick "LLaDA-8B (diffusion)" in the model picker

# agents / headless (macOS):
cd coreai-kit/Examples/DiffuseChat
swift run diffuse-cli --model llada-8b --prompt "What is the capital of France?"

πŸ’» Build with it β€” complete; the glue is kit API, copy-paste runs:

import CoreAIKit

let dlm = try await KitDiffusionLM(catalog: "llada-8b")
let reply = try await dlm.reply(to: prompt)
// reply: the denoised answer β€” pass onStep: to watch the canvas fill in per forward
// (still-masked positions as β–‘), in parallel, not left-to-right

The take-home is Examples/DiffuseChat/Sources/QuickStart.swift β€” this exact code as one typed function, no UI; the CLI is an argument shell over it, and the GUI renders the same live canvas. The canvas is fixed (S=256 β‰ˆ 210 generated tokens) and the whole history must fit β€” no KV cache. reply(messages:) takes role/content turns and drops the oldest first. Pass onStep: nil if you only want the final text.

Integration checklist

  • SPM: https://github.com/john-rocky/coreai-kit β†’ product CoreAIKit
  • Info.plist: none needed
  • Entitlements: none needed
  • First run downloads the model β€” 5.3 GB (Mac) β€” then it loads from the local cache (Application Support; progress via the downloadProgress callback)
  • Measure in Release β€” Debug is ~3Γ— slower on per-token host work

What it looks like

The answer doesn't stream left-to-right β€” the canvas denoises. Mid-generation you see tokens pop in out of order, e.g.:

48 + β–‘4 =β–‘7β–‘ clipsβ–‘β–‘β–‘   β†’   48 + 24 = 72 clips altogether.

That parallel, out-of-order fill is the signature of a masked-diffusion LM.

How to run it

Open it in CoreAIChatMac (the zoo's macOS chat app) β€” Download Models… β†’ LLaDA-8B (diffusion), or point the app at a folder containing the macos/ bundle. The host drives the denoising loop over a single static, bidirectional forward (main(input_ids[1,S]) β†’ logits[1,S,vocab], no KV cache).

Bundle

  • macos/ β€” Core AI .aimodel (GPU) + tokenizer/ + metadata.json
  • metadata.json exposes the diffusion knobs (no recompile to retune):
    • seq β€” canvas length (256 β‡’ answers up to ~210 tokens). No KV cache β‡’ the whole prompt+answer lives in S; per-step cost is ~linear in S (a larger canvas = fuller answers but slower steps).
    • block_size 32, threshold 1.0 β€” the entropy threshold trades steps for speed; lower = more gradual, higher = fewer forwards (faster), too high degrades quality

Performance (M4 Max, GPU)

metric value
throughput β‰ˆ40 tok/s (threshold 1.0, S=256)
TTFT ~0.3 s
forwards (NFE) ~22 for a full 210-token answer
size 4.9 GB (int4 + int8 head)

The per-step cost is a full bidirectional forward over the whole canvas (no KV cache) β€” the distillation keeps the step count low. A delayed-KV-cache decode (process only the active region) is the next speed lever.

Roadmap

  • Longer context (larger canvas) build
  • Delayed-KV-cache decode (the d3LLM generate_multi_block_kv_cache lever) for higher throughput
  • iPhone (h18p) build

Credits & license

A community port. All credit to the upstream authors β€” LLaDA (GSAI-ML) and d3LLM (hao-ai-lab / NVIDIA). This bundle inherits and is bound by their licenses β€” please review them before use. Core AI conversion only; no retraining.

Part of the Core AI model zoo.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mlboydaisuke/LLaDA-8B-dLLM-CoreAI

Finetuned
(36)
this model