Fine-tune a minimal LLM model with RTX 2050 GPU

Hello,

I have a JSONL file with more 1500 lines (entries are “input” as integer matrix, “output” as integer matrix and “logic” as DSL string that produces “output” from “input”) and I want to fine tune a minimal LLM model to see if it can generalize it. I have a RTX 2050 GPU on my laptop and I am trying to fine tune TinyLlama/TinyLlama-1.1B-Chat-v1.0 but it seems not working (because grad_norm is nan and in the “Performance” section of the Windows “Task Manager”, GPU usage remains at 0):

{‘loss’: 3.184, ‘grad_norm’: nan, ‘learning_rate’: 0.0, ‘epoch’: 0.01}

Here is the code I tried:

from datasets import load_dataset
from transformers import LlamaForCausalLM, LlamaTokenizer, TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model
import torch

dataset = load_dataset(“json”, data_files=“data/logics.jsonl”)

model_name = “huggyllama/llama-7b”
model_name = “microsoft/phi-2”
model_name = “TinyLlama/TinyLlama-1.1B-Chat-v1.0”
tokenizer = LlamaTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

def format_example(ex):
return f"““User: Given input {ex[‘input’]} and output {ex[‘output’]}, produce the DSL program.
Answer: {ex[‘logic’]}””"

def tokenize(ex):
text = format_example(ex)
tokens = tokenizer(text, truncation=True, padding=“max_length”, max_length=1024)
tokens[“labels”] = tokens[“input_ids”].copy()
return tokens

tokenized = dataset.map(tokenize, batched=False)

model = LlamaForCausalLM.from_pretrained(
model_name,
load_in_4bit=True,
device_map=“auto”
)

config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=[“q_proj”, “v_proj”],
lora_dropout=0.05,
bias=“none”,
task_type=“CAUSAL_LM”,
)
model = get_peft_model(model, config)

training_args = TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
warmup_steps=50,
max_steps=100,
learning_rate=2e-4,
fp16=True,
logging_steps=1,
output_dir=“./llm-dsl”
)

trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized[“train”],
)

trainer.train()

model.save_pretrained(“./llm-dsl”)
tokenizer.save_pretrained(“./llm-dsl”)

Have I a good hardware to do that? Is it a good model to start?

1 Like

I think it’s possible with 4GB of VRAM, but LoRA might not have enough. Let’s use QLoRA.


Short answer:

  • Your RTX 2050 (4 GB) is just enough to fine-tune a ~1B model like TinyLlama if you use QLoRA (4-bit) + LoRA and very small batch/sequence lengths.

  • TinyLlama-1.1B-Chat is a reasonable starting model, but there are also slightly smaller, newer options (e.g. Qwen2.5-0.5B) that are even friendlier to a 4 GB GPU.

  • Your current run is almost certainly:

    • falling back to CPU (hence 0% GPU usage), and
    • hitting numerical instability (hence grad_norm = nan) due to the way fp16 + 4-bit + LoRA are configured.

Below is a detailed explanation and a working recipe.


1. Is your RTX 2050 “good enough”?

The laptop RTX 2050 is an Ampere GPU with:

  • 2048 CUDA cores
  • 4 GB GDDR6
  • 64-bit memory bus, 30–45 W power range

For fine-tuning LLMs this means:

  • Not enough for full-precision fine-tuning (fp16/fp32) of 7B-class models (LLaMA-2 7B, Phi-2, etc.).

  • Enough for parameter-efficient fine-tuning (PEFT) of ~1B models using:

    • 4-bit quantization (QLoRA) for the frozen base weights, and
    • small LoRA adapters in 16-bit for the trainable part.

Hugging Face’s QLoRA/quantization docs explicitly show that 4-bit weights + LoRA can reduce memory usage by ~4× compared to full 16-bit fine-tuning and allow fine-tuning very large models on a single GPU.

Applied to your case:

  • A 1.1B model in 4-bit is roughly:

    • 1.1B parameters × 4 bits ≈ 0.55 GB for weights
      • activations + LoRA weights + optimizer states

With careful settings (4-bit, batch_size=1, short max_length, gradient accumulation), this fits inside 4 GB. Many people fine-tune 1B–2B models this way on 4–8 GB GPUs using QLoRA.

So:

Yes, your hardware is adequate if you use QLoRA/LoRA correctly and keep the configuration conservative.


2. Is TinyLlama-1.1B-Chat a good model to start with?

TinyLlama-1.1B-Chat-v1.0:

  • 1.1B LLaMA-style model
  • Pretrained on ~3T tokens, then instruction-tuned for chat

It is explicitly designed as a compact, efficient model; there are even mobile/embedded variants (GGUF, LiteRT) targeting low-resource devices.

For your task:

  • Inputs: integer matrices
  • Outputs: integer matrices
  • Target: DSL program mapping input → output

You need a model that:

  • Handles structured text / code reasonably well.
  • Is small enough to fine-tune on 4 GB in 4-bit.

TinyLlama-1.1B-Chat fits this nicely.

That said, for maximum comfort on 4 GB you may also consider:

  • Qwen/Qwen2.5-0.5B-Instruct (or the Coder-0.5B variant): newer pretraining, ~0.5B params, explicitly designed for low-resource devices, strong coding/math abilities.

But if you want to stay with TinyLlama, it is a perfectly reasonable starting point.


3. Why your current run shows grad_norm = nan and GPU usage = 0%

3.1 0% GPU usage: likely CPU fallback

You load the model as:

model = LlamaForCausalLM.from_pretrained(
    model_name,
    load_in_4bit=True,
    device_map="auto"
)

On Windows, 4-bit quantization with bitsandbytes only uses the GPU if:

  • PyTorch is installed with CUDA, and
  • bitsandbytes has working CUDA kernels for your setup.

If that fails, bitsandbytes can silently fall back to CPU. Then:

  • Training appears to “run”,
  • GPU utilization in Task Manager stays near 0%, and
  • All heavy work is on CPU.

You should explicitly check where the model landed:

import torch
print("CUDA available:", torch.cuda.is_available())
print("Device count:", torch.cuda.device_count())

print("Model first param device:", next(model.parameters()).device)

If that prints cpu, you are not training on the RTX 2050 at all. Fixing that (PyTorch + CUDA + bitsandbytes versions) is mandatory.

The HF bitsandbytes/quantization docs emphasize that BitsAndBytesConfig is the recommended way to enable 4-bit quantization with Transformers, not the legacy load_in_4bit=True flag, and they show example code that correctly places layers on GPU.

3.2 learning_rate: 0.0 is actually normal

You use:

warmup_steps=50,
max_steps=100,
learning_rate=2e-4,

With the default linear scheduler, LR starts at 0 and increases linearly during warmup. So seeing learning_rate: 0.0 at the first few steps is expected; it is not itself a bug.

3.3 grad_norm: nan: numerical instability

grad_norm is computed as the l2 norm of all parameter gradients. If any gradient becomes NaN, the norm becomes NaN.

In your setup, the main suspects are:

  1. Unstable fp16 + 4-bit combination

    • You set fp16=True while also using 4-bit quantization and LoRA on a small GPU.
    • QLoRA normally uses a careful combination: 4-bit NF4 weights + 16-bit compute dtype (fp16 or bf16) plus LoRA adapters, chosen to keep numerics stable.
    • If bitsandbytes or your model is on CPU, the mixed-precision path can behave unexpectedly.
  2. High learning rate for LoRA on a chat model

    • 2e-4 is quite aggressive for LoRA/QLoRA fine-tuning on instruction models; many official examples use 1e-5 to 5e-5.
  3. No gradient clipping (max_grad_norm)

    • Large spikes in gradients (especially early in training) can overflow fp16 and produce NaNs.
  4. 4-bit misconfiguration

    • The HF 4-bit docs recommend specifying:

      • bnb_4bit_quant_type="nf4"
      • bnb_4bit_compute_dtype=torch.bfloat16 or torch.float16
      • bnb_4bit_use_double_quant=True
    • Doing this via BitsAndBytesConfig is the tested path; direct load_in_4bit=True is older and more likely to hit edge cases.

Given that your logged loss is finite (loss: 3.184) but grad_norm is NaN immediately, the combination of fp16 + 4-bit + high LR + possibly CPU fallback is the most plausible root cause.


4. A safer TinyLlama QLoRA + LoRA recipe for your RTX 2050

Below is a revised training script tailored to:

  • TinyLlama-1.1B-Chat-v1.0
  • 4 GB RTX 2050
  • Your (input, output, logic) JSONL dataset (~1500 samples)

Key changes vs your code:

  • Use AutoModelForCausalLM + BitsAndBytesConfig (modern HF API).
  • Lower LR, add gradient clipping.
  • Smaller max_length (512 instead of 1024).
  • Batch size 1 + gradient accumulation for low VRAM.
  • Apply LoRA to all attention + MLP projections (standard QLoRA practice).
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    TrainingArguments,
    Trainer,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch

model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

# 1. Dataset --------------------------------------------------------------
dataset = load_dataset("json", data_files={"train": "data/logics.jsonl"})["train"]

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

def format_example(ex):
    # You can refine this prompt later; keep it consistent.
    return (
        f"User: Given input {ex['input']} and output {ex['output']}, "
        f"produce the DSL program.\n"
        f"Assistant: {ex['logic']}"
    )

def tokenize_batch(batch):
    texts = [format_example(ex) for ex in batch]
    enc = tokenizer(
        texts,
        truncation=True,
        max_length=512,      # shorter seq length for 4 GB VRAM
        padding="max_length",
    )
    # Simple causal LM: predict all tokens
    enc["labels"] = enc["input_ids"].copy()
    return enc

tokenized = dataset.map(
    tokenize_batch,
    batched=True,
    remove_columns=dataset.column_names,
)

# 2. 4-bit quantization config (QLoRA style) ------------------------------
compute_dtype = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",   # should pick cuda:0 on your RTX 2050
)

model = prepare_model_for_kbit_training(model)

# 3. LoRA config ----------------------------------------------------------
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# 4. Training args --------------------------------------------------------
use_bf16 = torch.cuda.is_bf16_supported()

training_args = TrainingArguments(
    output_dir="./llm-dsl",
    per_device_train_batch_size=1,     # very important for 4 GB
    gradient_accumulation_steps=16,    # effective batch ~= 16
    num_train_epochs=3,
    learning_rate=5e-5,                # safer than 2e-4
    warmup_ratio=0.05,
    lr_scheduler_type="cosine",
    logging_steps=5,
    save_strategy="epoch",
    max_grad_norm=1.0,                 # gradient clipping
    fp16=not use_bf16,
    bf16=use_bf16,
    report_to="none",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized,
)

trainer.train()

model.save_pretrained("./llm-dsl")
tokenizer.save_pretrained("./llm-dsl")

If this still runs out of memory or is unstable:

  • Reduce max_length to 256.
  • Reduce LoRA rank to 4.
  • Keep batch size at 1 and rely on gradient accumulation.

This pattern matches what Hugging Face and others recommend for QLoRA fine-tuning with bitsandbytes and PEFT.


5. Step-by-step checklist for your setup

  1. Verify GPU is actually used

    After loading the model:

    print(next(model.parameters()).device)
    

    You want cuda:0. If you see cpu, fix PyTorch + CUDA + bitsandbytes installation.

  2. Confirm VRAM usage

    • On Windows, watch Task Manager → Performance → GPU → Dedicated GPU memory.
    • During training, you should see VRAM consumption around 2–3 GB for TinyLlama-1.1B in 4-bit with LoRA.
  3. Watch for NaNs

    • Add max_grad_norm=1.0 (already in the suggested script).
    • Start with LR ≈ 5e-5; if NaNs persist, drop to 1e-5.
    • If necessary, temporarily disable fp16/bf16 (fp16=False, bf16=False) to check whether the issue is purely precision-related.
  4. Use smaller sequences and small batches

    • max_length=512 or even 256 is more realistic than 1024 on 4 GB.
    • per_device_train_batch_size=1 is recommended; use gradient_accumulation_steps to simulate a larger batch.
  5. Optionally use Unsloth for extra headroom

    The Unsloth library is designed to make QLoRA fine-tuning faster and use ~70% less VRAM than vanilla Transformers/PEFT, specifically for consumer GPUs.

    You can:

    • Start with the plain HF script above for learning.
    • Later, switch to an Unsloth notebook (e.g. their 1B QLoRA examples) and plug in TinyLlama or Qwen2.5; that often makes 4 GB runs much smoother.

6. Summary

  • Hardware:

    • RTX 2050 (4 GB) can fine-tune a ~1B model using 4-bit QLoRA + LoRA, small batches, and short sequence length. It cannot practically full-fine-tune 7B models.
  • Model:

    • TinyLlama-1.1B-Chat is a reasonable starting point; newer, even smaller models like Qwen2.5-0.5B-Instruct are also very suitable for your GPU.
  • Your current issues:

    • 0% GPU usage → very likely CPU fallback because bitsandbytes or CUDA is misconfigured.
    • grad_norm = nan → unstable combo of fp16 + 4-bit + high LR and missing gradient clipping; switch to the BitsAndBytesConfig-based QLoRA recipe with lower LR.
  • What to do:

    • Verify the model is really on cuda:0.
    • Use BitsAndBytesConfig with NF4 and prepare_model_for_kbit_training.
    • Use LoRA on attention + MLP, conservative LR, max_grad_norm, and max_length=256–512.

If you adopt the revised script and your environment is correct, you should see:

  • GPU memory usage rising during training,
  • grad_norm remaining finite,
  • and loss decreasing over steps/epochs.

7. Selected source links

1 Like

Thank you for this very thorough feedback.
The fine tuning works.
Here is an excerpt from the “all_logics.jsonl” file used for training:
{“task”: [{“input”: [[4, 4, 4, 4, 4, 4], [6, 4, 8, 6, 4, 8], [6, 6, 8, 6, 6, 8]], “output”: [[4, 4, 4], [6, 4, 8], [6, 6, 8]], “logic”: “put(mulmat(mul2(input, 0), rot90(input)), input, (0, 0))”}]}
Here is the Python script to fine-tune:

#pip install transformers>=4.40.0
#pip install peft accelerate bitsandbytes datasets

import os
import torch
from datasets import load_dataset
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
DataCollatorForLanguageModeling,
Trainer
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

MODEL_NAME = “Qwen/Qwen2.5-0.5B-Instruct”
#MODEL_NAME = “Qwen/Qwen2.5-7B-Instruct”
#MODEL_NAME = “Qwen/Qwen2.5-72B-Instruct”
DATA_PATH = “data/all_logics.jsonl”

---------------------------------------------------------------------------

Load tokenizer

---------------------------------------------------------------------------

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token

---------------------------------------------------------------------------

Load dataset

---------------------------------------------------------------------------

dataset = load_dataset(“json”, data_files=DATA_PATH, split=“train”)

Format messages → chatml string

def format_example(example):
conv = “”
for e in example[“task”]:
conv += “Input:\n” + str(e[“input”]).replace(" “, “”) + “\n”
conv += “Logic:\n” + str(e[“logic”]).replace(” “, “”) + “\n”
conv += “Output:\n” + str(e[“output”]).replace(” ", “”) + “\n”
return tokenizer(conv, truncation=True, padding=False, max_length=512)

dataset = dataset.map(format_example, remove_columns=[“task”])

---------------------------------------------------------------------------

Load model in 4-bit to fit GPU

---------------------------------------------------------------------------

model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
device_map=“auto”,
load_in_4bit=True,
dtype=torch.float16,
low_cpu_mem_usage=True
)

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

---------------------------------------------------------------------------

LoRA configuration

---------------------------------------------------------------------------

lora_config = LoraConfig(
r=8,
lora_alpha=32,
lora_dropout=0.0,
bias=“none”,
task_type=“CAUSAL_LM”
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

---------------------------------------------------------------------------

Training arguments

---------------------------------------------------------------------------

training_args = TrainingArguments(
output_dir=“./qwen25_dsl_lora”,
per_device_train_batch_size=1,
gradient_accumulation_steps=16,
num_train_epochs=3,
learning_rate=2e-4,
logging_steps=1,
save_strategy=“epoch”,
fp16=True,
bf16=False,
optim=“paged_adamw_32bit”,
warmup_ratio=0.1,
)

---------------------------------------------------------------------------

Trainer

---------------------------------------------------------------------------

data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

trainer = Trainer(
model=model,
train_dataset=dataset,
args=training_args,
data_collator=data_collator,
)

---------------------------------------------------------------------------

Train

---------------------------------------------------------------------------

trainer.train()

Save LoRA adapter

model.save_pretrained(“./qwen25_dsl_lora”)
tokenizer.save_pretrained(“./qwen25_dsl_lora”)
print(“Fine-tuning completed.”)

Here is the Python script for inference:

import copy
import itertools
import json
import numpy as np
import torch
import typing

def zero(shape: typing.Union[int, tuple]) → np.ndarray:
return np.zeros(shape)

def tile(a: np.ndarray, reps: typing.Union[np.ndarray, tuple, list, int]) → np.ndarray:
return np.tile(a, reps)

def fliplr(a: np.ndarray) → np.ndarray:
return np.fliplr(a)

def flipud(a: np.ndarray) → np.ndarray:
return np.flipud(a)

def place_region(a: np.ndarray, at: tuple, region: list, x: typing.Union[int, float]) → np.ndarray:
b = copy.deepcopy(a)

for t in region:
    try:
        b[at[0] + t[0] - region[0][0], at[1] + t[1] - region[0][1]] = x
    except:
        pass

return b

def hline(dst: np.ndarray, at: int, value: typing.Union[int, float]) → np.ndarray:
m = copy.deepcopy(dst)
m[at, :] = value

return m

def vline(dst: np.ndarray, at: int, value: typing.Union[int, float]) → np.ndarray:
m = copy.deepcopy(dst)
m[:, at] = value

return m

def hlineleft(dst: np.ndarray, at: tuple, value: typing.Union[int, float]) → np.ndarray:
m = copy.deepcopy(dst)
m[at[0], :at[1]] = value

return m

def hlineright(dst: np.ndarray, at: tuple, value: typing.Union[int, float]) → np.ndarray:
m = copy.deepcopy(dst)
m[at[0], at[1]:] = value

return m

def vlineup(dst: np.ndarray, at: tuple, value: typing.Union[int, float]) → np.ndarray:
m = copy.deepcopy(dst)
m[:at[0], at[1]] = value

return m

def vlinedown(dst: np.ndarray, at: tuple, value: typing.Union[int, float]) → np.ndarray:
m = copy.deepcopy(dst)
m[at[0]:, at[1]] = value

return m

def fill_region_at(a: np.ndarray, at: tuple, x: typing.Union[int, float]) → np.ndarray:
b = copy.deepcopy(a)

s = set()
stack = set()
stack.add(at)
v = a[at]

while (len(stack)):
    loc = stack.pop()

    if (not loc in s):
        s.add(loc)

        try:
            if (b[loc] == v):
                b[loc] = x

                for n in neighbors(loc):
                    stack.add(n)
        except:
            pass

return b

def fill_region2_at(a: np.ndarray, at: tuple, x: typing.Union[int, float]) → np.ndarray:
b = copy.deepcopy(a)

s = set()
stack = set()
stack.add(at)
v = a[at]

while (len(stack)):
    loc = stack.pop()

    if (not loc in s):
        s.add(loc)

        try:
            if (b[loc] == v):
                b[loc] = x

                for n in neighbors(loc, diagonals = False):
                    stack.add(n)
        except:
            pass

return b

def replace(a: np.ndarray, x: typing.Union[int, float], y: typing.Union[int, float]) → np.ndarray:
b = copy.deepcopy(a)
b[b == x] = y

return b

def put(dst: np.ndarray, src: np.ndarray, at: typing.Union[np.ndarray, tuple, list]) → np.ndarray:
begin = np.array(at)
end = np.array(at) + np.array(src.shape)
slices = tuple(slice(b, e) for b, e in zip(begin, end))
m = copy.deepcopy(dst)
m[slices] = src

return m

def put_value(dst: np.ndarray, at: typing.Union[np.ndarray, tuple, list], value: typing.Union[int, float]) → np.ndarray:
m = copy.deepcopy(dst)
m[at] = value

return m

def fill_region(a: np.ndarray, region: list, x: typing.Union[int, float]) → np.ndarray:
b = copy.deepcopy(a)

for t in region:
    try:
        b[t] = x
    except:
        pass

return b

def or_(x: np.ndarray, y: np.ndarray) → np.ndarray:
return np.bitwise_or(x, y)

def and_(x: np.ndarray, y: np.ndarray) → np.ndarray:
return np.bitwise_and(x, y)

def xor(x: np.ndarray, y: np.ndarray) → np.ndarray:
return np.bitwise_xor(x, y)

def invert(x: np.ndarray) → np.ndarray:
return np.invert(x)

def dotsegment(dst: np.ndarray, begin: tuple, end: tuple, value: typing.Union[int, float], dot_step: int) → np.ndarray:
m = copy.deepcopy(dst)

u = np.array(end) - np.array(begin)
v = u / np.linalg.norm(u)

if (min(v)):
    v *= min(v)

step = np.linalg.norm(v) / np.linalg.norm(u)
i = 0

for t in np.arange(0, 1 + 0.1 * step, step):
    if (i % dot_step == 0):
        m[round(begin[0] + t * u[0]), round(begin[1] + t * u[1])] = value

    i += 1

return m

def rot90(a: np.ndarray) → np.ndarray:
return np.rot90(a)

def neighbors(idx, exclude_self = True, diagonals = True):
dims = len(idx)
deltas = [-1, 0, 1]

for delta in itertools.product(deltas, repeat=dims):
    if (exclude_self and all(d == 0 for d in delta)):
        continue

    if (not diagonals and sum(d != 0 for d in delta) != 1):
        continue

    t = tuple(i + d for i, d in zip(idx, delta))

    if (any([x < 0 for x in t])):
        continue

    yield t

def region(a: np.ndarray, at: tuple) → list:
s = set()
stack = set()
stack.add(at)
v = a[at]
indices = 


while (len(stack)):
    loc = stack.pop()

    if (not loc in s):
        s.add(loc)

        try:
            if (a[loc] == v):
                indices.append(loc)

                for n in neighbors(loc, diagonals = False):
                    stack.add(n)
        except:
            pass

return indices

def matrix_region(a: np.ndarray, region: list) → np.ndarray:
iMin = math.inf
iMax = 0
jMin = math.inf
jMax = 0

for t in region:
    iMin = min(iMin, t[0])
    iMax = max(iMax, t[0])
    jMin = min(jMin, t[1])
    jMax = max(jMax, t[1])

return a[iMin:iMax, jMin:jMax]

def add(a: np.ndarray, b: np.ndarray) → np.ndarray:
return a + b

def sub(a: np.ndarray, b: np.ndarray) → np.ndarray:
return a - b

def add2(a: np.ndarray, b: typing.Union[int, float]) → np.ndarray:
return a + b

def sub2(a: np.ndarray, b: typing.Union[int, float]) → np.ndarray:
return a - b

def mulmat(x: np.ndarray, y: np.ndarray) → np.ndarray:
return x @ y

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

BASE = “Qwen/Qwen2.5-0.5B-Instruct”
FINETUNED = “./qwen25_dsl_lora”

tokenizer = AutoTokenizer.from_pretrained(BASE)
model = AutoModelForCausalLM.from_pretrained(
BASE, device_map=“auto”, load_in_4bit=True
)
model = PeftModel.from_pretrained(model, FINETUNED)
model.eval()

folder = “../../ARC-AGI-2-main/data/training”

import json
from os import listdir
from os.path import isfile, join

tasks = [f for f in listdir(folder) if isfile(join(folder, f))]
tasks = [“00d62c1b.json”]
tasks = [“0d3d703e.json”]

successCounter = 0
counter = 0

for task in tasks:
f = open(folder + “/” + task)

data = json.loads(f.readline())

f.close()

arcPrompt = "You are an ARC DSL generator. Produce only DSL code instead of question marks.\n"

train = data["train"]

for n in range(0, len(train)):
    arcPrompt += "Input:\n" + str(train[n]["input"]).replace(" ", "") + "\n"
    arcPrompt += "Logic:\n?\n"
    arcPrompt += "Output:\n" + str(train[n]["output"]).replace(" ", "") + "\n"

test = data["test"]

for n in range(0, len(test)):
    counter += 1

    testPrompt = arcPrompt + "Input:\n" + str(test[n]["input"]).replace(" ", "") + "\n"
    testPrompt += "Logic:\n?\n"

    print(testPrompt)

    tokens = tokenizer(testPrompt, return_tensors="pt").to(model.device)

    out = model.generate(
        **tokens,
        max_new_tokens=128,
        temperature=0.1,
        do_sample=False
    )

    logic = tokenizer.decode(out[0], skip_special_tokens=False)

    print(logic)

    try:
        evalOutput = eval(logic)

        input = np.array(test[n]["input"])
        output = np.array(test[n]["output"])

        #print(input)
        #print(output)
        #print(evalOutput)
        #assert(np.all(np.isclose(output, evalOutput)))

        if (np.all(np.isclose(output, evalOutput))):
            successCounter += 1
    except:
        pass

print(str(successCounter) + “/” + str(counter))

545 / 5 000

For an ARC AGI task, let’s say I have 4 pairs of DSL input/output logic. Currently, the training prompt looks like this: input1, logic1, output1, input2, logic2, output2, input3, logic3, output3, input4, logic4, output4. With the following inference prompt: input1, logic1 (to be retrieved by the LLM), output1, input2, logic2 (to be retrieved by the LLM), output2, input3, logic3 (to be retrieved by the LLM), output3, input4, logic4 (to be retrieved by the LLM), output4, input (test), logic (to be retrieved by the LLM for the test input).
I tested with the following inference prompt:
You are an ARC DSL generator. Produce only DSL code instead of question marks.
Input:
[[5,8,6],[5,8,6],[5,8,6]]
Logic:
?
Output:
[[1,9,2],[1,9,2],[1,9,2]]
Input:
[[2,3,8],[2,3,8],[2,3,8]]
Logic:
?
Output:
[[6,4,9],[6,4,9],[6,4,9]]
Input:
[[9,4,2],[9,4,2],[9,4,2]]
Logic:
?
Output:
[[8,3,6],[8,3,6],[8,3,6]]
Input:
[[3,1,2],[3,1,2],[3,1,2]]
Logic:
?
Output:
[[4,5,6],[4,5,6],[4,5,6]]
Input:
[[8,1,3],[8,1,3],[8,1,3]]
Logic:
?
And the fine-tuned Qwen model completes it with:
Output:
[[7,8,9],[7,8,9],[7,8,9]]<|endoftext|>
This is not the correct output (excepted output is [[9,5,4],[9,5,4],[9,5,4]]), and it did not produce a DSL instead of the ?. Is it possible to resolve this issue? Is it a prompt problem?

1 Like

Even if the output seems off, since it differs from the original model, I believe fine-tuning itself is indeed happening. That suggests the cause is often that the fine-tuning is proceeding in a way different from what was intended. Mismatches with expected templates or phrasing are relatively common.


You can absolutely fix this. The fine-tuned model is behaving consistently with how you trained and prompted it:

  • During training you taught it:
    “Input → Logic (DSL) → Output (grid)”, everything as one long string.
  • During inference you ask:
    “Input → Logic: ? → Output … Input … Logic: ? …”, and then you decode the entire prompt + completion and eval() it.

So the model is (a) more likely to continue with output grids than with DSL, and (b) you are not cleanly extracting “just the DSL” from its completion.

Below is the “why” and a concrete way to restructure both training and inference so Qwen reliably returns DSL code, not output grids.


1. What your current setup is really teaching the model

1.1 Training format: you supervise both DSL and outputs

Your fine-tuning example:

{
  "task": [{
    "input": [[4,4,4,4,4,4],[6,4,8,6,4,8],[6,6,8,6,6,8]],
    "output": [[4,4,4],[6,4,8],[6,6,8]],
    "logic": "put(mulmat(mul2(input, 0), rot90(input)), input, (0, 0))"
  }]
}

Formatted as:

def format_example(example):
    conv = ""
    for e in example["task"]:
        conv += "Input:\n" + str(e["input"]).replace(" ", "") + "\n"
        conv += "Logic:\n" + str(e["logic"]).replace(" ", "") + "\n"
        conv += "Output:\n" + str(e["output"]).replace(" ", "") + "\n"
    return tokenizer(conv, truncation=True, padding=False, max_length=512)

So the sequence passed to the model looks like:

Input:
[[4,4,4,4,4,4],
 [6,4,8,6,4,8],
 [6,6,8,6,6,8]]
Logic:
put(mulmat(mul2(input,0),rot90(input)),input,(0,0))
Output:
[[4,4,4],
 [6,4,8],
 [6,6,8]]

Then you use DataCollatorForLanguageModeling(mlm=False), which for causal LM means:

  • labels = shifted input_idsevery token is a training target.

So the model is trained to:

  1. Predict the DSL after Logic:.
  2. Predict the output grid after Output:.

It never sees the pattern Logic:\n?\nOutput: during training.

1.2 Inference format: you cue it to produce outputs, not DSL

For ARC evaluation you build a prompt like:

You are an ARC DSL generator. Produce only DSL code instead of question marks.
Input:
[[...train1 input...]]
Logic:
?
Output:
[[...train1 output...]]
Input:
[[...train2 input...]]
Logic:
?
Output:
[[...train2 output...]]
...
Input:
[[...test input...]]
Logic:
?

Then:

tokens = tokenizer(testPrompt, return_tensors="pt").to(model.device)
out = model.generate(
    **tokens,
    max_new_tokens=128,
    temperature=0.1,
    do_sample=False
)
logic = tokenizer.decode(out[0], skip_special_tokens=False)
evalOutput = eval(logic)

Two key problems:

  1. Pattern mismatch

    • Training: Logic:\n<DSL>\nOutput:\n<grid>.
    • Inference: Logic:\n?\nOutput:\n<grid> repeated, then Logic:\n?\n with no output.

    The model has never learned that ? means “fill in DSL here”. It has learned that after Output:\n it should print a grid, because you supervised that explicitly.

  2. Decoding and eval

    • You decode the entire sequence (out[0]), which includes the original prompt plus the generated continuation.

    • If the model produces:

      Output:
      [[7,8,9],[7,8,9],[7,8,9]]<|endoftext|>
      

      and you eval() that, you’re evaluating a big string that is not a clean DSL expression.

So the model is not “failing” randomly. It is following the strongest pattern you taught:

“After Output:, I should print a matrix.”


2. What you actually want: LLM as a DSL program synthesizer

Conceptually, your setup is the classic program synthesis approach used in ARC-AGI:

  • The model sees input/output grids and produces a program in a DSL that explains the mapping.
  • You then execute that program (with your NumPy DSL functions) and check whether the resulting grid matches the target (training or test).

In that paradigm, the “answer” is:

  • Always code, not an output grid.
  • The evaluation metric is correctness of execution, not just token matching.

To get there, you need:

  1. Training where the supervised “answer” is only the DSL.
  2. Prompts that clearly ask for DSL, not outputs.
  3. Decoding that extracts only the DSL text from the completion.

3. Fix 1: Train the model to output only DSL (use chat templates)

Qwen2.5-0.5B-Instruct is an instruction-tuned chat model; its model card shows examples using apply_chat_template to format system/user/assistant messages.

You can use that to make each training example:

  • System: explain the task (“you are an ARC DSL generator…”).
  • User: give the input and output grids.
  • Assistant: give the DSL (logic).

Example preprocessing:

def format_example(example):
    e = example["task"][0]  # if exactly one entry per task
    inp = str(e["input"]).replace(" ", "")
    out = str(e["output"]).replace(" ", "")
    logic = e["logic"].replace(" ", "")

    messages = [
        {
            "role": "system",
            "content": (
                "You are an ARC DSL generator. "
                "Given an input grid and an output grid, "
                "produce a single DSL expression that transforms the input into the output. "
                "Return only the DSL expression, nothing else."
            ),
        },
        {
            "role": "user",
            "content": f"Input:\n{inp}\nOutput:\n{out}",
        },
        {
            "role": "assistant",
            "content": logic,
        },
    ]

    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=False,  # for training
    )

    tokenized = tokenizer(
        text,
        truncation=True,
        max_length=512,
        padding="max_length",
    )

    # For now, simplest: label all tokens (still much better than mixing DSL+outputs)
    input_ids = tokenized["input_ids"]
    labels = input_ids.copy()
    tokenized["labels"] = labels
    return tokenized

Hugging Face’s chat templating docs recommend exactly this pattern: apply the chat template on your dataset as a preprocessing step so that training and inference use the same format.

For best practice, you’d go one step further: mask non-assistant tokens in labels (set them to -100). Transformers recently added return_assistant_tokens_mask to apply_chat_template to support this: you get a mask of assistant tokens and build labels that ignore system/user tokens.

But even if you don’t mask yet, the key change is:

  • Your training samples now say clearly:

    • “Given this conversation, the assistant must respond with DSL code.”
  • The model no longer sees any Output: after the DSL in the assistant message.


4. Fix 2: Use a matching prompt at inference time

At inference, build messages in the same style:

  • System: same instructions (“ARC DSL generator… return only DSL”).
  • User: several input/output examples for that ARC task + the new input grid.

Example (sketch):

def build_messages_for_task(train_pairs, test_input):
    # train_pairs: list of (input_grid, output_grid) Python lists
    examples_txt = ""
    for inp, out in train_pairs:
        examples_txt += f"Input:\n{str(inp).replace(' ', '')}\n"
        examples_txt += f"Output:\n{str(out).replace(' ', '')}\n\n"

    user_content = (
        "Here are input/output examples for a single ARC task:\n"
        f"{examples_txt}"
        "Now, given this new input grid, write ONE DSL expression that would produce "
        "the correct output grid. Return only the DSL expression.\n"
        f"Input:\n{str(test_input).replace(' ', '')}\n"
        "DSL:\n"
    )

    messages = [
        {
            "role": "system",
            "content": (
                "You are an ARC DSL generator. You see input/output grid pairs for a "
                "single ARC task. Infer a DSL expression that maps the input grid to "
                "the output grid. For the new input, output only the DSL expression, "
                "no explanation, no grid."
            ),
        },
        {
            "role": "user",
            "content": user_content,
        },
    ]
    return messages

Then:

messages = build_messages_for_task(train_pairs, test_input)

inputs = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt",
    add_generation_prompt=True,   # now we want the assistant to continue
).to(model.device)

with torch.no_grad():
    out = model.generate(
        **inputs,
        max_new_tokens=128,
        temperature=0.1,
        do_sample=False,
    )

# Slice off the prompt, keep only new tokens
input_len = inputs["input_ids"].shape[1]
generated = out[0, input_len:]
logic_str = tokenizer.decode(generated, skip_special_tokens=True)

print("Predicted DSL:", logic_str)

Important differences from your previous script:

  • No Logic:\n?\nOutput: pattern. The test part ends in DSL:, strongly nudging the model to continue with code.
  • You use Qwen’s chat template exactly as in its model card, so tokens align with how Qwen2.5 was trained.
  • You decode only the generated tokens, not the entire prompt+completion.

After that you can do:

# VERY IMPORTANT: avoid raw eval in real life; parse or sandbox this instead
eval_output = eval(logic_str)

and compare eval_output to the ARC test output grid.


5. Fix 3: Stop teaching the model to emit “Output:” for the answer

Right now, your ARC “few-shot” pseudo-prompt encourages:

Input:
...
Logic:
?
Output:
...

Given that during fine-tuning the model saw lots of Output:\n<grid>, after a Logic: section it is very natural for it to produce Output:\n[[...]] as the continuation. You are seeing exactly that:

Output:
[[7,8,9],[7,8,9],[7,8,9]]<|endoftext|>

To make “DSL” more likely than “Output”, do:

  • In training:

    • Let the assistant’s message be only the DSL, with no Output: afterwards.
  • In inference:

    • Don’t include Output: after the test input.
    • Instead end with a cue like DSL: or just let the assistant message start after the chat template.

This way, the highest-probability continuation after your prompt is code, not another grid.


6. Higher-level context: LLM + DSL + ARC is a search pipeline

Even after you fix the prompt and training, you should expect that the model:

  • Often produces DSL programs that are almost right or reflect only part of the pattern.
  • Sometimes returns syntactically invalid DSL.

That is normal in ARC-AGI program synthesis. Most competitive systems:

  • Use an LLM to generate multiple candidate programs.
  • Execute each candidate on the examples, discard those that fail.
  • Possibly refine candidates via search / evolution (SOAR, execution-guided synthesis, etc.).

So once DSL output is reliable, a natural next step is:

  1. For each ARC task, sample k DSL programs (e.g. temperature=0.7, num_return_sequences=10).

  2. For each program:

    • Run it on all training I/O pairs.
    • Keep only those that perfectly match all training outputs.
  3. Evaluate the survivors on the test input(s).

Your current loop (eval(logic) + np.isclose()) is already a minimal version of this; it will become much more powerful once the model consistently outputs only DSL code.


7. Short checklist to fix your specific issue

  1. Change training data

    • Use Qwen’s apply_chat_template.
    • System: “You are an ARC DSL generator… return only DSL.”
    • User: Input/Output grids.
    • Assistant: DSL string.
    • Labels: at minimum all tokens; ideally mask non-assistant tokens using return_assistant_tokens_mask.
  2. Change inference prompt

    • Same system instructions.
    • User message: multiple train Input/Output pairs + test Input. End with “DSL:” or similar.
    • No Logic: ? / Output: for the test case.
  3. Change decoding

    • Use apply_chat_template(..., add_generation_prompt=True) to build inputs.
    • After generate, slice out[0, input_len:] and decode only those tokens.
  4. Then evaluate

    • Run the decoded DSL through your interpreter, check if test grid matches.
    • Later: sample multiple programs and use search/execution for better success rates.

8. Useful links

For reference and further reading:

Thanks a lot again for this amazing answer!
I modified my code based on your suggestions. The training works, but not the inference; I get the following error that I don’t understand:

Traceback (most recent call last):
File “C:\Users\julie\Documents\GitHub\aicpp\build\vc143-Release\infer_qwen.py”, line 85, in
out = model.generate(
**inputs,
…<2 lines>…
do_sample=False,
)
TypeError: peft.peft_model.PeftModelForCausalLM.generate() argument after ** must be a mapping, not Tensor

I tried this:
input_ids=inputs, #**inputs,
It works but I get this error after:
input_len = inputs[“input_ids”].shape[1]
Traceback (most recent call last):
File “C:\Users\julie\Documents\GitHub\aicpp\build\vc143-Release\infer_qwen.py”, line 94, in
input_len = inputs[“input_ids”].shape[1]
~~~~~~^^^^^^^^^^^^^
IndexError: too many indices for tensor of dimension 2

1 Like

I forgot to mention… If the library version is too old or too new, minor specifications change, often causing errors. And libraries get updated frequently…:roll_eyes:


You’re not doing anything “wrong” in a logical sense. You just have a mismatch between:

  • what apply_chat_template(..., return_tensors="pt") actually returns in your setup, and
  • how my earlier example was treating that return value.

Let’s go through both errors and then show a clean, working inference pattern for your case.


1. First error: argument after ** must be a mapping, not Tensor

You have:

out = model.generate(
    **inputs,
    max_new_tokens=128,
    temperature=0.1,
    do_sample=False,
)

This assumes that inputs is a mapping (a dict-like object), e.g.:

inputs = {
    "input_ids": <tensor of shape [B, T]>,
    "attention_mask": <tensor of shape [B, T]>,
}

Then:

**inputs

expands to:

input_ids=<...>, attention_mask=<...>

But in your code, inputs is not a dict; it’s a plain torch.Tensor (shape [1, seq_len]). That’s why Python complains:

TypeError: ... argument after ** must be a mapping, not Tensor

So:

  • **inputs only works when inputs is dict-like (BatchEncoding, dict, etc.).

  • Because inputs is a tensor, you must pass it explicitly:

    out = model.generate(
        input_ids=inputs,
        max_new_tokens=128,
        temperature=0.1,
        do_sample=False,
    )
    

This is exactly how the Hugging Face chat examples pass the result of apply_chat_template when return_tensors="pt": they treat it as a tensor, not a dict, and feed it positionally or as input_ids=....


2. Second error: inputs["input_ids"] on a tensor

After switching to:

out = model.generate(
    input_ids=inputs,  # works
    max_new_tokens=128,
    ...
)

you then do:

input_len = inputs["input_ids"].shape[1]

But now you are treating inputs as if it were a dict with key "input_ids".

It isn’t. It’s still the same tensor you just passed to generate:

type(inputs)  # torch.Tensor

A tensor doesn’t support string indexing, so that line is just invalid. What you want is:

input_len = inputs.shape[1]

In other words:

  • If inputs is a tensor of shape [batch_size, seq_len], use inputs.shape[1] to get the prompt length.
  • Only do inputs["input_ids"] when inputs is a dict / BatchEncoding.

So the second error is just the mirror image of the first one: you fixed the call to generate, but you kept using dict-style access on a tensor.


3. Why this happened at all (library behavior)

The confusion comes from the fact that there are two different patterns you’ll see online:

Pattern 1 – tensor output

Most current chat examples do something like:

input_ids = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)

outputs = model.generate(
    input_ids=input_ids,
    max_new_tokens=128,
)

Here:

  • input_ids is a tensor.
  • You never use **input_ids.
  • You use input_ids.shape[1] to get the prompt length.

This is what your code path is giving you: apply_chat_template(..., return_tensors="pt")tensor.

Pattern 2 – dict / BatchEncoding output

Other examples (especially around newer versions) show:

encoded = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt",
    add_generation_prompt=True,
)
encoded = {k: v.to(model.device) for k, v in encoded.items()}

outputs = model.generate(
    **encoded,
    max_new_tokens=128,
)
prompt_len = encoded["input_ids"].shape[1]

Here:

  • encoded is a BatchEncoding (dict-like).
  • **encoded expands fine.
  • encoded["input_ids"] is valid.

Your situation is that my earlier answer implicitly used Pattern 2, while your actual return value from apply_chat_template follows Pattern 1.

That’s why:

  • **inputs → TypeError (because inputs is a tensor).
  • inputs["input_ids"] → IndexError / wrong kind of indexing (still a tensor).

You do not need return_dict=True for apply_chat_template; that flag is relevant for the classic tokenizer(...) API, not for this helper in the way you’re using it.


4. The simplest correct inference code for you

Given your current behavior (where apply_chat_template(..., return_tensors="pt") returns a tensor), the cleanest pattern is:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

BASE = "Qwen/Qwen2.5-0.5B-Instruct"
FINETUNED = "./qwen25_dsl_lora"

tokenizer = AutoTokenizer.from_pretrained(BASE)
base_model = AutoModelForCausalLM.from_pretrained(
    BASE,
    device_map="auto",
    load_in_4bit=True,
)
model = PeftModel.from_pretrained(base_model, FINETUNED)
model.eval()

# Build your messages (system + user)
messages = [
    {
        "role": "system",
        "content": (
            "You are an ARC DSL generator. "
            "Given input/output grids, produce only a DSL expression that "
            "maps the input to the output."
        ),
    },
    {
        "role": "user",
        "content": user_prompt_text,  # your Input/Output ... + "DSL:\n"
    },
]

# 1) Get tensor of token IDs
input_ids = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)  # shape: [1, seq_len]

# 2) Generate
with torch.no_grad():
    out = model.generate(
        input_ids=input_ids,
        max_new_tokens=128,
        temperature=0.1,
        do_sample=False,
    )

# 3) Slice off the prompt (same batch dimension, shorter seq_len)
prompt_len = input_ids.shape[1]
generated_tokens = out[:, prompt_len:]

# 4) Decode only the generated part as your DSL
logic_str = tokenizer.decode(generated_tokens[0], skip_special_tokens=True)
print("Predicted DSL:", logic_str)

No **inputs, no inputs["input_ids"]. You treat input_ids as a tensor all the way through.

If you later upgrade your Transformers version and discover that apply_chat_template now returns a dict-like object instead of a tensor, then you can switch to the other idiom, but for your current environment, the tensor-based style above is correct and simplest.


5. Mental model to remember

When you see code like:

out = model.generate(**inputs, ...)
prompt_len = inputs["input_ids"].shape[1]

that is dict-style code. It only makes sense if inputs is a mapping (BatchEncoding).

When you see:

out = model.generate(input_ids=inputs, ...)
prompt_len = inputs.shape[1]

that is tensor-style code. It only makes sense if inputs is a tensor.

Your current setup is firmly in the second scenario, so just be consistent with that and the errors will disappear.

Thank you for this precision.
Here is the inference prompt:

Here are input/output examples for a single ARC task:
Input1:
[[5,8,6],[5,8,6],[5,8,6]]
Output1:
[[1,9,2],[1,9,2],[1,9,2]]
Input2:
[[2,3,8],[2,3,8],[2,3,8]]
Output2:
[[6,4,9],[6,4,9],[6,4,9]]
Input3:
[[9,4,2],[9,4,2],[9,4,2]]
Output3:
[[8,3,6],[8,3,6],[8,3,6]]
Input4:
[[3,1,2],[3,1,2],[3,1,2]]
Output4:
[[4,5,6],[4,5,6],[4,5,6]]
Now, given this new input grid, write ONE DSL expression that would produce the correct output grid. Return only the DSL expression.
Input5:
[[8,1,3],[8,1,3],[8,1,3]]
Logic5:

And the answer:

Predicted DSL: Output5:
[[7,0,0],[7,0,0],[7,0,0]]

I do something wrong?

1 Like

Maybe you’re making some minor mistakes?


You’re not “breaking” the code anymore. The pipeline is fine.
The remaining issue is behavior: the model still prefers to output an Output5: grid instead of a DSL expression.

That is almost always a training + prompt alignment issue, not a runtime bug.


1. What is happening in your example

Your current inference prompt (inside the user message) is:

Here are input/output examples for a single ARC task:
Input1:
[[5,8,6],[5,8,6],[5,8,6]]
Output1:
[[1,9,2],[1,9,2],[1,9,2]]
Input2:
[[2,3,8],[2,3,8],[2,3,8]]
Output2:
[[6,4,9],[6,4,9],[6,4,9]]
Input3:
[[9,4,2],[9,4,2],[9,4,2]]
Output3:
[[8,3,6],[8,3,6],[8,3,6]]
Input4:
[[3,1,2],[3,1,2],[3,1,2]]
Output4:
[[4,5,6],[4,5,6],[4,5,6]]
Now, given this new input grid, write ONE DSL expression that would produce the correct output grid. Return only the DSL expression.
Input5:
[[8,1,3],[8,1,3],[8,1,3]]
Logic5:

Model output:

Predicted DSL: Output5:
[[7,0,0],[7,0,0],[7,0,0]]

So:

  • The model ignored “Return only the DSL expression” and
  • continued the “OutputN:” pattern it saw many times in the prompt.

This is typical LM behavior:

  • It completes the strongest pattern in the text (“InputN/OutputN”) rather than strictly obeying the natural-language instruction.
  • If training also contained many “Output:” + grid continuations, that pattern is even stronger.

You did the low-level things correctly (generate call, slicing, decoding). The remaining problems are:

  1. How the model was trained (what exactly was labeled as the “answer”), and
  2. How the inference prompt relates to that training format.

2. Check 1: What did you actually train the model to do?

The key question:

In your fine-tuning data, what exact string did you give as the assistant’s answer?

There are two possibilities:

Case A – Only DSL was labeled as answer (desired)

For each training triple (input, output, logic) you did something like:

  • System: “You are an ARC DSL generator…”

  • User:

    Input:
    [[...]]
    Output:
    [[...]]
    
  • Assistant:

    put(mulmat(...), input, (0,0))
    

And when you tokenized with apply_chat_template, you:

  • Set labels only on the assistant tokens (DSL code), and
  • Masked everything else with -100 (system/user tokens not trained). (Hugging Face)

In this case, the model has been explicitly optimized to answer with DSL, not with output grids.

Case B – The whole “Input / Logic / Output” string was the target (old behavior)

For each item you did something like your original:

Input:
[[...]]
Logic:
put(...)
Output:
[[...]]

and used DataCollatorForLanguageModeling(mlm=False) so the model was trained to predict everything, including the Output: + grid part.

In that situation, the model learned:

  • “After Output: I should output a grid” (very strongly),
  • and that pattern is now dominating the continuation.

From your earlier code, it looked like you started in Case B. If your new training script didn’t fully change to Case A, this is exactly the behavior you’d expect.

Conclusion:
If you still train on sequences that include the Output: grids in the assistant part, the model will naturally think that emitting grids is a good answer.


3. Check 2: Does inference match your training format?

Suppose you did move to the recommended chat format (Case A):

  • System: “You are an ARC DSL generator. Return only DSL…”
  • User: “Input… Output…”
  • Assistant: “put(…)”.

That means each training conversation is:

<system> You are an ARC DSL generator. ...
<user>   Input: ... Output: ...
<assistant> put(...)

At inference, you are currently giving one large user message with:

  • Four Input/Output examples, and
  • A new Input5, and
  • A line Logic5:.

This is not something the model saw during training:

  • It never saw “Input1 / Output1 / Input2 / Output2 / … Input5 / Logic5:”.
  • It never saw multiple training examples packed into one user turn.
  • It never saw the label “Logic5:”.

So the model is being asked to do two things at once:

  1. Meta-learn a pattern across multiple I/O examples (few-shot ARC induction), and
  2. Generalize to a new prompt structure.

For a small model (0.5B) with 1500 fine-tuning samples, this is quite a lot.


4. Step-by-step way to see what is wrong

4.1 Single-example sanity check

First, strip away the few-shot complexity. For one training triple, do:

System:
You are an ARC DSL generator. Given an input and output grid, produce only the DSL expression.

User:
Input:
[[5,8,6],[5,8,6],[5,8,6]]
Output:
[[1,9,2],[1,9,2],[1,9,2]]
DSL:

and see what the model outputs.

Implementation sketch:

messages = [
    {
        "role": "system",
        "content": (
            "You are an ARC DSL generator. "
            "Given an input and output grid, produce only the DSL expression."
        ),
    },
    {
        "role": "user",
        "content": (
            "Input:\n[[5,8,6],[5,8,6],[5,8,6]]\n"
            "Output:\n[[1,9,2],[1,9,2],[1,9,2]]\n"
            "DSL:\n"
        ),
    },
]

input_ids = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)

out = model.generate(
    input_ids=input_ids,
    max_new_tokens=128,
    temperature=0.1,
    do_sample=False,
)

prompt_len = input_ids.shape[1]
generated = out[:, prompt_len:]
logic_str = tokenizer.decode(generated[0], skip_special_tokens=True)
print("Predicted:", repr(logic_str))

If here you still get something like Output: [[...]], that means:

  • Training is still teaching the model to output grids,
  • Or the system/user/assistant split is not what you think (e.g. your logic ended up inside the user message instead of assistant).

That’s the first thing to verify.

4.2 Inspect actual training strings

Take one random training example and print out the final string you pass to the tokenizer right before training, e.g.:

for ex in raw_dataset.select(range(1)):
    print(format_example(ex)["text_or_whatever_you_return"])
    break

Check:

  • Does that text end with only DSL, or does it also include an Output: line?
  • Does it contain “Logic:” or “DSL:”? Match that label at inference.

This is the most reliable way to see if the training really follows the pattern you think.


5. How to improve your inference prompt

Assuming your training is now in the “good” format (assistant = DSL only), there are still two issues with the prompt you showed:

  1. It introduces new patterns (Input1, Output1, Logic5) that the model never saw.
  2. It does not show any DSL example inside the context, only grids.

5.1 Match labels between training and inference

If in training you always used:

User:  Input:\n...\nOutput:\n...\n
Assistant:  put(...)

then in inference you should use similar labels:

  • Use Input: and Output:, not Input1:, Output1:.
  • Use the same DSL label (“DSL:” or nothing) you used when constructing the assistant messages during training.

For example, for few-shot prompting:

Here are input/output examples for a single ARC task:

Example 1:
Input:
[[...]]
Output:
[[...]]

Example 2:
Input:
[[...]]
Output:
[[...]]

Now, infer a DSL rule that maps input to output.
For this new input, output only a DSL expression.
Input:
[[8,1,3],[8,1,3],[8,1,3]]
DSL:

This aligns more closely with the training “Input / Output / DSL” pattern.

5.2 Show at least one DSL example in the prompt

Right now, the prompt contains only grids:

  • Input1 / Output1
  • Input2 / Output2
  • Input4 / Output4
  • Input5

The model sees no explicit DSL example in-context, just the verbal instruction.

Given the model is small and your dataset is limited, it helps to show one or two worked examples including DSL, e.g.:

Here is how you should answer:

Example:
Input:
[[5,8,6],[5,8,6],[5,8,6]]
Output:
[[1,9,2],[1,9,2],[1,9,2]]
DSL:
put(replace(replace(replace(input,5,1),8,9),6,2), input, (0,0))

Now here are new training examples (no DSL shown):

Input:
[[2,3,8],[2,3,8],[2,3,8]]
Output:
[[6,4,9],[6,4,9],[6,4,9]]

...

Now, for this new input, output only the DSL expression:
Input:
[[8,1,3],[8,1,3],[8,1,3]]
DSL:

Because Qwen2.5 is an instruction-tuned chat model, it will try to follow the latest instruction, but seeing a concrete DSL format in-context makes it more likely to output DSL instead of continuing the OutputN grid pattern. (Hugging Face)


6. Accept model limitations on this specific pattern

Even with perfect formatting:

  • You are asking a 0.5B model, fine-tuned on ~1500 DSL tasks, to infer a non-trivial permutation over digits from 4 I/O examples and express it as code.
  • That is a real compositional generalization task, similar to ARC-AGI program induction.

In the literature, such tasks are usually solved with:

  • Larger models, more specialized training, and
  • Execution-guided search (generate many programs, execute them, filter) rather than a single greedy decode. (arXiv)

So you should expect:

  • Even after fixing prompts and training labels, many tasks will still fail on the first try.
  • You will often want to sample multiple DSL candidates and check them by executing them against training and test grids.

Your current test (single deterministic generate call, temperature=0.1, do_sample=False) is the most demanding scenario.


7. Summary

  • Your code and inference call are now technically correct.

  • The model’s behavior (“Output5: [[7,0,0]…]”) is a sign that:

    • It has strong prior to emit Output: + grid patterns (likely from training), and
    • Your inference prompt does not match the exact structure it saw during fine-tuning (labels, number of examples, presence/absence of DSL in-context).

Concrete actions:

  1. Verify training:

    • Ensure assistant messages contain only DSL, no Output: grids.
    • Mask non-assistant tokens in labels if possible.
  2. Simplify and test:

    • Try a single-example prompt (one Input/Output → DSL) to see if the model outputs DSL.
    • If that fails, fix training before worrying about few-shot prompts.
  3. Align inference with training:

    • Use the same label words (Input:, Output:, DSL:) and structure.
    • Avoid Input1 / Output1 / Logic5 unless you trained with that pattern.
    • Optionally include one worked DSL example in the prompt.
  4. Plan for search:

    • Eventually, sample multiple DSL programs and execute them to find ones that match the grids.

So the short answer to “Do I do something wrong?” is:

  • The low-level code is now fine.
  • The remaining issue is training target choice + mismatch between training format and inference prompt, plus the inherent difficulty of the task for a 0.5B model.

Thanks a lot. I will check these points.
For information, here is a sample of training prompt:
|im_start|>system
You are an ARC DSL generator. Given pairs of input grid and output grid, produce DSL expressions that transform the input into the output. Return only the DSL expressions, nothing else.<|im_end|>
<|im_start|>user
Input1:
[[0,0,0,0,0,0,0,8,8,0,0,0,0,0],[0,0,0,0,0,0,0,0,8,8,8,0,0,0],[0,0,0,0,0,0,0,0,0,0,8,0,8,0],[0,0,0,0,0,8,8,8,8,0,8,8,8,0],[0,0,0,0,8,8,0,0,8,8,8,0,8,8],[0,0,0,0,0,0,0,8,8,0,0,0,8,0],[0,0,0,0,0,0,8,8,0,0,0,8,8,0],[0,0,0,0,0,0,0,0,0,8,8,8,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,1,1,1,0,0,0,0,0,0,0],[0,0,0,0,1,0,1,0,0,0,0,0,0,0],[0,0,0,0,0,1,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0]]
Output1:
[[0,0,0,0,0,0,0,7,7,0,0,0,0,0],[0,0,0,0,0,0,0,0,7,7,7,0,0,0],[0,0,0,0,0,0,0,0,0,0,7,0,7,0],[0,0,0,0,0,7,7,7,7,0,7,7,7,0],[0,0,0,0,7,7,0,0,7,7,7,0,7,7],[0,0,0,0,0,0,0,7,7,0,0,0,7,0],[0,0,0,0,0,0,7,7,0,0,0,7,7,0],[0,0,0,0,0,0,0,0,0,7,7,7,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0]]
Input2:
[[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,8,8,0,8,0,8,8,0,0,0],[0,0,0,0,8,0,8,0,8,0,8,0,0,0],[0,0,0,0,8,8,0,8,0,8,8,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,1,0,1,0,0,0,0,0,0,0,0],[0,0,0,0,1,0,0,0,0,0,0,0,0,0],[0,0,0,1,1,1,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0]]
Output2:
[[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,3,3,0,3,0,3,3,0,0,0],[0,0,0,0,3,0,3,0,3,0,3,0,0,0],[0,0,0,0,3,3,0,3,0,3,3,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0]]
Input3:
[[0,0,0,8,0,0,0,8,0,0,0,0,0,0],[0,8,0,8,0,8,0,8,0,8,0,0,0,0],[0,8,8,8,8,8,8,8,8,8,0,0,0,0],[0,8,0,8,0,8,0,8,0,8,0,0,0,0],[0,8,0,0,0,8,0,0,0,8,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,1,0,0,0,0,0],[0,0,0,0,0,0,0,1,1,1,0,0,0,0],[0,0,0,0,0,0,0,0,1,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0]]
Output3:
[[0,0,0,2,0,0,0,2,0,0,0,0,0,0],[0,2,0,2,0,2,0,2,0,2,0,0,0,0],[0,2,2,2,2,2,2,2,2,2,0,0,0,0],[0,2,0,2,0,2,0,2,0,2,0,0,0,0],[0,2,0,0,0,2,0,0,0,2,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0]]
Input4:
[[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,8,0,0,0,0,0],[0,0,0,0,0,8,8,8,0,8,8,0,0,0],[0,0,0,0,0,0,8,0,8,8,0,0,0,0],[0,0,0,0,0,0,0,8,0,0,8,8,0,0],[0,0,0,0,0,0,0,0,8,8,0,8,0,0],[0,0,0,0,0,0,0,0,0,0,8,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,1,0,1,0,0,0,0,0,0,0,0,0],[0,0,0,1,0,0,0,0,0,0,0,0,0,0],[0,0,1,1,1,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0]]
Output4:
[[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,3,0,0,0,0,0],[0,0,0,0,0,3,3,3,0,3,3,0,0,0],[0,0,0,0,0,0,3,0,3,3,0,0,0,0],[0,0,0,0,0,0,0,3,0,0,3,3,0,0],[0,0,0,0,0,0,0,0,3,3,0,3,0,0],[0,0,0,0,0,0,0,0,0,0,3,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0]]
Input5:
[[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,8,8,0,8,8,0,0],[0,0,0,0,0,0,0,0,8,8,8,0,0,0],[0,0,0,0,0,8,8,8,8,0,0,0,0,0],[0,0,0,0,8,8,0,8,0,0,8,8,0,0],[0,0,0,0,0,0,0,8,8,8,8,0,0,0],[0,0,0,0,0,0,0,0,8,0,8,0,0,0],[0,0,0,0,0,0,8,8,8,0,8,8,8,0],[0,0,0,0,0,0,8,0,0,0,0,0,8,0],[0,0,1,0,0,0,0,0,0,0,0,0,0,0],[0,1,1,1,0,0,0,0,0,0,0,0,0,0],[0,0,1,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0]]
Output5:
[[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,2,2,0,2,2,0,0],[0,0,0,0,0,0,0,0,2,2,2,0,0,0],[0,0,0,0,0,2,2,2,2,0,0,0,0,0],[0,0,0,0,2,2,0,2,0,0,2,2,0,0],[0,0,0,0,0,0,0,2,2,2,2,0,0,0],[0,0,0,0,0,0,0,0,2,0,2,0,0,0],[0,0,0,0,0,0,2,2,2,0,2,2,2,0],[0,0,0,0,0,0,2,0,0,0,0,0,2,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0]]
<|im_end|>
<|im_start|>assistant
Logic1:
replace(replace(input,8,7),1,0)
Logic2:
replace(replace(input,8,3),1,0)
Logic3:
replace(replace(input,8,2),1,0)
Logic4:
replace(replace(input,8,3),1,0)
Logic5:
replace(replace(input,8,2),1,0)
<|im_end|>

1 Like

Hm?


You’re not disastrously off; the training prompt is close, but there are a few important mismatches between:

  • what your system message says (“Return only the DSL expressions”),
  • what your assistant content actually looks like (Logic1: ... Logic2: ...), and
  • what you later do in inference (“Logic5:” on the user side, and expecting only the code).

These mismatches make it much more likely that the model keeps producing things like Output5: or other non-DSL fragments.

Let’s walk through what your sample training prompt is teaching the model, and how to adjust it or your inference so behavior is more predictable.


1. What your training sample really looks like

You showed a full ChatML-style training instance:

|im_start|>system
You are an ARC DSL generator. Given pairs of input grid and output grid, produce DSL expressions that transform the input into the output. Return only the DSL expressions, nothing else.<|im_end|>
<|im_start|>user
Input1:
[[...]]
Output1:
[[...]]
Input2:
[[...]]
Output2:
[[...]]
Input3:
[[...]]
Output3:
[[...]]
Input4:
[[...]]
Output4:
[[...]]
Input5:
[[...]]
Output5:
[[...]]
<|im_end|>
<|im_start|>assistant
Logic1:
replace(replace(input,8,7),1,0)
Logic2:
replace(replace(input,8,3),1,0)
Logic3:
replace(replace(input,8,2),1,0)
Logic4:
replace(replace(input,8,3),1,0)
Logic5:
replace(replace(input,8,2),1,0)
<|im_end|>

So, per training example:

  • System: instructs the model to be a DSL generator and “return only DSL expressions”.
  • User: supplies 5 input/output grid pairs (Input1..Output5).
  • Assistant: supplies 5 labeled DSL expressions (Logic1: ... Logic2: ... ... Logic5: ...).

Key observations:

  1. The assistant’s answer is not just bare DSL; it includes labels like Logic1:, Logic2:.

    • This is already a small inconsistency with your system instruction “return only the DSL expressions, nothing else”.

    • In practice, the model learns that the answer format is:

      Logic1:
      <dsl1>
      Logic2:
      <dsl2>
      ...
      
  2. Each training example is multi-pair:

    • The user gives 5 pairs, and the assistant returns 5 DSL lines.
    • So the model is not learning “one input-output → one DSL” in isolation; it is learning “a whole block of Input/Output1..5 → Logic1..5”.
  3. The assistant never mentions the outputs again; it only outputs LogicN: lines.

So far, this is logically consistent: user = grids, assistant = DSL. That part is good.


2. How this differs from your inference prompt

Your inference (plain-text) prompt was:

Here are input/output examples for a single ARC task:
Input1:
[[5,8,6],[5,8,6],[5,8,6]]
Output1:
[[1,9,2],[1,9,2],[1,9,2]]
Input2:
[[2,3,8],[2,3,8],[2,3,8]]
Output2:
[[6,4,9],[6,4,9],[6,4,9]]
Input3:
[[9,4,2],[9,4,2],[9,4,2]]
Output3:
[[8,3,6],[8,3,6],[8,3,6]]
Input4:
[[3,1,2],[3,1,2],[3,1,2]]
Output4:
[[4,5,6],[4,5,6],[4,5,6]]
Now, given this new input grid, write ONE DSL expression that would produce the correct output grid. Return only the DSL expression.
Input5:
[[8,1,3],[8,1,3],[8,1,3]]
Logic5:

Differences:

  1. Speaker roles / template

    • In training, this “block” is inside a user message, wrapped in ChatML via apply_chat_template.
    • In inference, you’re also likely wrapping this in a user message, which is fine, but the text is different on two important axes (below).
  2. Presence of Output5

    • Training user: has both Input5: and Output5:.
    • Inference user: has Input5: only, no Output5:.
    • In training, the model never had to generate Logic5: without seeing Output5:; now it must. This is exactly the “generalization” you want, but from the model’s perspective it’s out-of-distribution.
  3. Shape of the assistant’s answer

    • Training assistant: a block with Logic1:..Logic5:.
    • Inference expectation: you want only one DSL expression, corresponding to Logic5:. But your prompt just ends with “Logic5:\n”; you are not explicitly saying “Output exactly one line: Logic5: ” in terms that match the training pattern.
  4. Exact labels

    • Training: Logic1:, Logic2:, Logic3:, Logic4:, Logic5: all appear in the assistant messages.
    • Inference: Logic5: is in the user message. That is a subtle but meaningful shift: in training, Logic5: is a token the assistant tends to start with; in inference, the user has already written it, and the assistant must now decide what to write after it.

Given this, the model is trying its best to continue a pattern it knows:

  • It sees Input1..Output4 pairs (like training).
  • It sees a new Input5: but no Output5:, plus a request to “return only DSL”.
  • Its training pattern tells it that after multiple OutputN lines, it often sees more grid-like content or similar patterns. Depending on how the rest of your training set looks, it may consider “Output5: [[…]]” as a high-probability continuation when it’s unsure about the DSL.

This is why, even though we interpret the prompt as “give me code”, the model is still sometimes producing Output5: + grid. It’s following its learned text statistics more than the natural language instruction.


3. Are you doing something “wrong” in the training prompt?

You’re not disastrously wrong, but there are some design choices that work against what you want:

  1. System vs assistant content mismatch

    • System says: “Return only the DSL expressions, nothing else.”

    • Assistant actually writes:

      Logic1:
      replace(...)
      ...
      
    • So you are teaching it that “return only DSL” = “return multiple LogicN: headers plus DSL bodies”. That’s not fatal, but it lowers the weight of obeying the exact natural-language instruction.

  2. Multi-DSL per conversation

    • Each training example has 5 DSL outputs.
    • This is convenient, but it means the model isn’t directly learning “one (input, output) → one DSL” as a simple mapping, which is what you exploit at inference time.
  3. LogicN: tokens only ever appear in assistant messages during training

    • In your sample, the user never mentions Logic1: etc. Only the assistant does.
    • At inference, because Logic5: is in the user message, the model is in a configuration it hasn’t seen: it must begin generating after Logic5: instead of starting with Logic5:.

None of these are syntactic errors, but they introduce mismatches between training and test that hurt generalization.


4. Ways to make this more consistent

You have a spectrum of options; you don’t necessarily need to rewrite everything.

Option A – Minimal changes, keep current training

If you keep this training format, then:

  1. Inference should mimic the training shape more closely

    For example, put all 5 I/O pairs in the user message (including Output5) and then ask the model to output all 5 logics, but you only care about Logic5 at evaluation:

    System: same as training
    
    User:
    Input1:
    ...
    Output1:
    ...
    ...
    Input5:
    ...
    Output5:
    ?
    
    Assistant (expected):
    Logic1:
    ...
    Logic2:
    ...
    ...
    Logic5:
    ...
    

    Here you might mark Output5 with ? or some special pattern, but still show that there is an Output5 line like in the training data. The assistant is then expected to complete the same five LogicN: lines. This is closer to what you trained on.

  2. In the prompt, explicitly say you want “Logic1..Logic5”

    Explicitly mirror the exact training format:

    Now, based on the above, output:
    Logic1:
    <DSL1>
    Logic2:
    <DSL2>
    ...
    Logic5:
    <DSL5>
    

    so the model is not guessing the format.

  3. Later, you can gradually hide Output5

    Once the model reliably produces LogicN for tasks where all I/O pairs (including the last) are given, you can try training (or fine-tuning further) with some examples that omit Output5 and force the model to infer it.

Option B – Change training to one I/O pair per example

The conceptually clean approach:

  • For training:

    • Each training instance includes one input/output pair in the user, and one DSL in the assistant.
    • No Logic1..5 headers: just one DSL expression string.

    Example:

    <|im_start|>system
    You are an ARC DSL generator. Given an input grid and output grid,
    produce a DSL expression that transforms the input into the output.
    Return only the DSL expression, nothing else.<|im_end|>
    <|im_start|>user
    Input:
    [[...]]
    Output:
    [[...]]
    <|im_end|>
    <|im_start|>assistant
    replace(replace(input,8,7),1,0)
    <|im_end|>
    
  • Then at inference:

    • You can still give multiple I/O examples in one user message (few-shot), but the model’s “answer format” is always a single DSL.

    • For example:

      System: same
      
      User:
      Here are examples for a single task:
      Input:
      [[...]]
      Output:
      [[...]]
      Input:
      [[...]]
      Output:
      [[...]]
      Now, for this new input grid, output only the DSL expression:
      Input:
      [[test_input]]
      DSL:
      

    This “one conversation → one DSL” alignment is much easier for the model to learn.

This requires rewriting your preprocessing, but conceptually it is the cleanest.


5. For your current sample: are you doing something “wrong”?

Given your sample:

  • You are correctly:

    • Using system/user/assistant roles.
    • Putting all grid data on the user side.
    • Putting only DSL on the assistant side.

So the basic pattern is sound.

The parts that work against you:

  1. Assistant actually outputs LogicN: + DSL, not “only DSL expression” as per system.
  2. You pack 5 pairs and 5 logics into a single dialogue instead of 1 pair → 1 logic; this is fine but complicates generalization.
  3. At inference you significantly change the shape: you omit Output5 and only want a single Logic5: instead of the whole block.

So the short answer is:

You’re not fundamentally wrong, but your training prompt and inference prompt are not tightly aligned. The model is learning “User: Input1..Output5 → Assistant: Logic1..Logic5” but at inference you are asking “User: Input1..Output4 + Output1..Output4 + Input5 (no Output5) → Assistant: only Logic5”. That’s a different pattern.


6. Practical next steps

If you want to keep your current training data:

  1. Test single-example behavior first

    Take one training instance (user Input1/Output1, assistant Logic1) and fine-tune a small model where each sample is just 1 pair → 1 DSL. Confirm that at inference, with that same format, the model outputs DSL correctly.

  2. Then introduce few-shot prompts

    Once the model reliably maps 1 pair → 1 DSL, try a prompt with 2–3 training examples plus 1 new example, all in one user message, and see if it still outputs a single DSL.

  3. If you keep multi-logic training (your current setup), at inference:

    • Include Output5 (maybe as ?),
    • Ask explicitly for Logic1..Logic5, and
    • Extract Logic5 from the generated block.

This incremental approach lets you see at which step the behavior diverges.


Summary

  • Your printed training prompt is structurally reasonable: user = input/output grids, assistant = DSL.

  • The main issues are format mismatches and multi-logic vs single-logic expectations:

    • System says “only DSL expressions”, but assistant includes Logic1: labels.
    • Training: 5 I/O pairs → 5 LogicN lines.
    • Inference: 4 I/O pairs + 1 input only → 1 Logic5 line.
  • To reduce the model’s tendency to produce Output5: or other non-DSL text, bring training and inference closer together:

    • Either rewrite training to 1 pair → 1 DSL,
    • Or make inference ask for the same Logic1..Logic5 block you trained on and then post-process Logic5.

You’re very close—this is now mostly about tightening the correspondence between how you train and how you query the model.

Ok, again thank you very much. I keep the same training and I changed the training prompt and it generates DSL (yes :slight_smile: !), not valid but it is a really good step! It seems not to be reproductible…
Here is the training prompt:
Here are input/output examples for a single ARC task:
Input1:
[[5,8,6],[5,8,6],[5,8,6]]
Output1:
[[1,9,2],[1,9,2],[1,9,2]]
Input2:
[[2,3,8],[2,3,8],[2,3,8]]
Output2:
[[6,4,9],[6,4,9],[6,4,9]]
Input3:
[[9,4,2],[9,4,2],[9,4,2]]
Output3:
[[8,3,6],[8,3,6],[8,3,6]]
Input4:
[[3,1,2],[3,1,2],[3,1,2]]
Output4:
[[4,5,6],[4,5,6],[4,5,6]]
Now, given this new input grid, write 5 DSL logic expressions that would produce the correct output grid. Return only the DSL expressions.
Input5:
[[8,1,3],[8,1,3],[8,1,3]]
And the output:
Predicted DSL: Logic1:
replace(input,(0,0),8)
Logic2:
replace(input,(0,0),8)
Logic3:
replace(input,(0,0),8)
Logic4:
replace(input,(0,0),8)
Logic5:
replace(input,(0,0),8)

Can I do better or things are correct and a bigger model can do better? (replace function does not take a pair eg (0, 0), but a number here)

1 Like

I think we’ve gained a pretty good understanding of what this model (potentially) can do and its actual limitations.


What you’re seeing is actually exactly what your current setup encourages:

  • The model is now clearly in “DSL mode” (good).
  • It’s still weak on DSL grammar + function signatures and on true reasoning, so it invents something like replace(input,(0,0),8) and copies it 5 times.
  • A 0.5B model can learn “shape” and surface patterns of your DSL, but struggles to fully internalize the typed API and the underlying rule for that ARC task.

So:

  • Yes, you can do better with your current setup (prompt + training + decoding changes).
  • But there is also a fundamental capacity issue: a small 0.5B general model, even Qwen2.5-0.5B-Instruct, has limited program-synthesis power; larger or code-specialized models do perform better on code/DSL tasks.

I’ll break this into three parts:

  1. Why replace(input,(0,0),8) and repeated Logic1..Logic5 is actually “expected” behavior.
  2. Concrete things you can improve without changing the model size.
  3. What realistically improves with a bigger / more code-specialized model.

1. Interpreting replace(input,(0,0),8) and the 5 identical logics

1.1 What the model has actually learned so far

Your training format (summarizing):

  • System: “You are an ARC DSL generator… Return only the DSL expressions…”

  • User: Input1..Output5 (5 I/O pairs).

  • Assistant:

    Logic1:
    <dsl1>
    Logic2:
    <dsl2>
    ...
    Logic5:
    <dsl5>
    

At inference now you say:

Here are input/output examples for a single ARC task:
Input1: ... Output1: ...
...
Input4: ... Output4: ...
Now, given this new input grid, write 5 DSL logic expressions that would produce the correct output grid. Return only the DSL expressions.
Input5: ...

The model’s response:

Logic1:
replace(input,(0,0),8)
Logic2:
replace(input,(0,0),8)
Logic3:
replace(input,(0,0),8)
Logic4:
replace(input,(0,0),8)
Logic5:
replace(input,(0,0),8)

This tells you a few things:

  1. It has internalized the high-level format

    • It knows that answers should be Logic1: ... Logic2: ... Logic3: ... Logic4: ... Logic5:.
    • That is precisely what its assistant side looked like during training.
  2. It has learned some surface patterns of the DSL

    • replace(input, something, something) is clearly coming from your training distribution (you do have many replace(replace(input,8,2),1,0) etc).
    • It is not hallucinating completely random tokens.
  3. It has not really learned the function signature

    • In your DSL, replace is of the form:

      replace(a: np.ndarray, x: scalar, y: scalar) -> np.ndarray
      

      so the second argument must be a color index, not a (0, 0) coordinate.

    • The model is mixing patterns from other DSL functions where coordinates appear, like put_value(a,(i,j),v) or fill_region_at(a,(i,j),x), and not respecting per-function types.

This is very typical: LLMs are trained primarily on surface text, and their sense of API semantics is probabilistic and fuzzy, especially in small models. Program-synthesis work consistently finds that without extra constraints or execution guidance, models frequently produce syntactically invalid or semantically ill-typed code.

1.2 Why all five logics are identical

You’re asking:

“write 5 DSL logic expressions that would produce the correct output grid”

But:

  • The prompt only contains one test input (Input5), and no Output5.
  • There is no clear differentiation in the prompt about what Logic1 vs Logic2 vs Logic3 vs Logic4 vs Logic5 should do differently.
  • In the training example you showed, Logic1..Logic5 all correspond to five different I/O pairs with similar “replace 8 → some color, 1 → 0” behavior.

So the model does the safest thing it can:

  • Give you five copies of the same “best guess” for a program that might transform Input5 into a plausible output.
  • Prefix them with Logic1..Logic5 because that’s strongly associated with “block of five DSL expressions” in the assistant messages during training.

In other words: for your current inference prompt, producing five distinct high-quality candidate programs would require a meta-level understanding of “I’m being asked to propose different candidates”, which is not strongly supported by your training format.


2. What you can do better without changing the model size

You can squeeze more out of the current model by attacking three axes:

  1. Prompting: define the DSL precisely and show examples.
  2. Training data: ensure all DSL is valid and type-consistent; maybe simplify per-example structure.
  3. Decoding / search: filter invalid programs and keep the ones that execute correctly.

2.1 Prompt: make the DSL spec explicit (including function signatures)

Right now the model is guessing the DSL purely from examples. You can help it by including a compact DSL spec in the system message or an initial “guidance” example:

You are an ARC DSL generator.

DSL specification (read carefully and follow exactly):
- Grids are NumPy-like 2D arrays of integers in [0..9].
- replace(a, x, y) returns a copy of array a where every cell equal to x is replaced by y.
  - x and y are single integers (colors), not coordinates. Example:
    Input:
    [[1,2],
     [2,2]]
    replace(input, 2, 9) -> [[1,9],[9,9]]

- put(a, b, (i,j)) places array b into a at top-left coordinate (i,j).
- fill_region_at(a, (i,j), x) flood-fills from coordinate (i,j) in a, replacing the starting color with x.
...

When you write programs, you must respect these signatures.
Return only DSL expressions that can be parsed and executed by this DSL.

And in the assistant output:

  • Add a few “gold” DSL examples that use replace correctly, explicitly as:

    Logic1:
    replace(replace(input,8,7),1,0)
    

Thus you’re not just showing DSL in context, but also telling the model what is allowed.

Studies on neural program synthesis strongly emphasize that clear DSL definitions and type rules significantly improve syntactic correctness, especially when coupled with constraints at decoding time.

2.2 Training: potentially simplify to “1 pair → 1 logic” as an experiment

Your current training packs 5 I/O pairs and 5 LogicN into one conversation. That is fine, but it makes the task more complex:

  • The model has to map multiple I/O pairs to multiple programs at once.
  • At inference you then “tilt” the usage: you want 5 candidates for 1 test input.

As a controlled experiment, you can build a second dataset where each JSONL line is:

System: same
User:  Input:\n<one grid>\nOutput:\n<one grid>
Assistant:  <one DSL expression>

Then:

  • Fine-tune a separate copy of Qwen2.5-0.5B-Instruct on that simpler mapping.

  • At inference, test the simple format:

    System: same
    
    User:
    Input:
    [[5,8,6],[5,8,6],[5,8,6]]
    Output:
    [[1,9,2],[1,9,2],[1,9,2]]
    DSL:
    

If the model outputs clean DSL here (often with correct function signatures), you know the core mapping is learnable; the extra complexity in multi-logic training is what hurts.

This “one example → one program” structure is standard in neural program synthesis datasets (e.g., RobustFill, FlashFill-like DSLs).

2.3 Decoding / search: execution-guided filtering

Even if each individual candidate DSL is weak, you can get strong performance by generating many programs and filtering them using your interpreter:

  1. For each ARC task (and each test input), build your prompt.

  2. Use sampling:

    out = model.generate(
        input_ids=input_ids,
        max_new_tokens=128,
        temperature=0.7,
        top_p=0.95,
        num_return_sequences=K,  # e.g. 16, 32
        do_sample=True,
    )
    
  3. For each decoded program:

    • Try to eval() or parse it into your DSL.
    • If it fails to parse or uses invalid signatures (like replace(input,(0,0),8)), discard it.
    • If it parses, execute it on the training I/O pairs for that ARC task.
    • Keep only programs that reproduce all given training outputs exactly.

This is exactly the idea of execution-guided neural program synthesis / decoding: the model proposes programs; the interpreter acts as a correctness filter or reranking signal. This approach consistently outperforms pure “one-shot” decoding on difficult program-synthesis tasks, including SQL generation and more general DSLs.

For ARC-like DSLs, recent work explicitly shows that execution-guided neural program synthesis in a custom DSL significantly improves out-of-distribution performance.

Even with a weak 0.5B model, “propose many, test many” can give nontrivial success on ARC-style tasks.

2.4 Grammar constraints (optional but powerful)

If you want to be stricter about DSL syntax (e.g., replace(a, scalar, scalar) only), you can:

  • Write a small grammar for your DSL (e.g., in ANTLR, Lark, PEG).
  • Use grammar-constrained decoding or a library that supports grammar-based token masking.

This ensures that the model can’t output replace(input,(0,0),8) because (0,0) is not a valid term where a scalar is expected.

Recent work on grammar-aligned decoding / grammar-constrained decoding shows that you can enforce such constraints efficiently at decode time and drastically reduce invalid outputs.

This is extra engineering work, but it directly attacks your observed problem: syntactically valid yet semantically wrong signatures.


3. Would a bigger / different model actually do better?

Short answer: yes, but it’s not magic.

3.1 Limits of 0.5B models

A recent study of ~0.5B reasoning LLMs (including Qwen2.5-0.5B-Instruct) shows:

  • They can do nontrivial reasoning and math but remain significantly weaker than multi-billion parameter models.
  • They’re good enough to pick up surface patterns and simple compositional rules, but complex multi-step, type-sensitive program synthesis is still hard.

Your DSL + ARC tasks are on the “hard compositional” side:
they demand understanding of:

  • Grid structure,
  • Local operations (replace, put, flood-fill, etc.),
  • Composing them in precise ways,
  • Not mixing function signatures.

For this, 0.5B is borderline:

  • It can learn some patterns (you’re already seeing that).
  • But you should expect lots of “almost correct” or template-like outputs unless you add strong constraints or a search loop.

3.2 More suitable models on roughly the same scale

If you stay near 0.5B but want better code/DSL behavior:

  • Consider Qwen2.5-Coder-0.5B-Instruct, which is specifically trained for code generation and code reasoning.

    • It has stronger priors about function signatures, types, and program structure.
    • You can still LoRA-fine-tune it in 4-bit on your RTX 2050, similar to what you’ve done.

In many code-synthesis tasks, code-specialized models outperform general instruction models of the same size.

3.3 Larger models (1.5B–7B) improve both pattern learning and generalization

ARC-AGI-like tasks are being attacked by:

  • 1.5B–7B general reasoning models,
  • code-specialized models (e.g., Qwen2.5-Coder-7B), and
  • multi-step search frameworks combining LLM suggestions with program execution.

A 7B code model:

  • Will almost certainly respect function signatures better.
  • Will propose more diverse and coherent DSL programs.
  • May sometimes solve simple ARC tasks end-to-end with very little extra machinery.

But:

  • 7B fine-tuning on RTX 2050 is painful or impossible without heavy tricks; you can often infer with 7B via remote GPUs or CPU-only for testing, but not fine-tune locally.
  • Even for 7B, the best ARC-AGI systems use execution-guided search and sometimes test-time training, not just a single pass.

So a bigger model is a win, but it does not remove the need for:

  • Good DSL specification,
  • Clean training targets, and
  • Execution-guided filtering.

4. Answering your core question directly

Can I do better or things are correct and a bigger model can do better?

  1. You can definitely do better with your existing 0.5B model by:

    • Making the DSL spec and function signatures explicit in the prompt.
    • Ensuring training examples have only valid DSL (no noisy / inconsistent patterns).
    • Trying a simpler “1 pair → 1 DSL” training variant.
    • Using execution-guided search: generate many candidate DSLs, parse and run them, filter by correctness.
  2. But there is also a real capacity ceiling:

    • 0.5B general models struggle with complex program synthesis, especially type-sensitive DSLs, compared to larger or code-specific models.
    • If you move to Qwen2.5-Coder-0.5B or a 1.5B+ model and keep the same pipeline, you should see better grammar adherence and more meaningful candidate programs.

So your current results (syntactically DSL-like but semantically wrong, repetitive candidates) are normal and expected for:

  • A small model,
  • A fairly complex DSL, and
  • No grammar constraints or execution-guided search.

You’re on the right track; the next big improvements come from:

  • Tightening the DSL spec and training format.
  • Adding execution-guided filtering.
  • Then, if hardware permits, upgrading to a stronger code-specialized model.

Thanks again. Actually, I designed a C++ engine that finds DSL logic between “train” input/output pairs of ARC tasks (74% on “training”, 28% on “evaluation”). So I have a JSONL file with over 1500 valid DSL logics. Aside from this prompt issue in hoping to get correct output with DSLs, do I have a chance compared to the approaches already tried with LLM + DSL?

1 Like

That C++ engine is probably pretty good…


Yes, you absolutely “have a chance” in the sense that your line of attack is aligned with what the better ARC systems are doing today: program synthesis over a DSL, with learning components assisting search, not “just an LLM”. But you should calibrate expectations:

  • Your current numbers (≈74% train / ≈28% eval) are in the same qualitative regime as many serious research systems that use symbolic/DSL methods on ARC.
  • The very top ARC Prize 2024 solvers reach ~53–55% on the private test set, using a lot more engineering, deep learning, and data than you are using right now. (arXiv)
  • Pure LLM approaches, even when heavily fine-tuned on synthetic ARC-like data, still hover roughly in the <5–10% range on full ARC-AGI. (ARC Prize)

So your approach is not outclassed by “LLM+DSL”; it is inside that design space. The question is how you extend it.


1. Where your current system sits in the landscape

1.1 ARC and what counts as “good”

ARC-AGI (and ARC-AGI-2) are designed so that:

  • Humans can usually solve a task from a handful of IO pairs.
  • Machines are forced to generalize beyond training tasks, not memorize. (Emergent Mind)

As of the 2024 technical report and follow-up overviews:

  • Best open systems on the original ARC-AGI private eval set are ~55%. These are pipelines combining:

    • Pretrained LLMs (e.g. 8B Mistral variants),
    • Data augmentation on ARC tasks,
    • Test-time training / fine-tuning,
    • Heuristic candidate selection and program search. (arXiv)

So your 28% eval with a hand-engineered C++ DSL solver is not laughable; it is within the range of serious entries that do not yet combine every trick in the book.

1.2 Two extremes: small DSL vs huge LLM

François Chollet has described two extremes for ARC-like problems: (dwarkesh.com)

  • Discrete program search side

    • Small DSL (100–200 primitives).
    • Very deep program search.
    • Your C++ solver is clearly here.
  • LLM side

    • Effectively a huge “latent DSL” with millions of internal building blocks learned from massive data.
    • Very shallow explicit search, often just greedy decoding + maybe reranking.

Your current system (C++ DSL search + a small Qwen LoRA model that emits DSL code) sits between these:

  • You still have a small, explicit DSL.
  • You add an LLM as a learned prior over programs.

That is exactly the direction several recent ARC works are exploring.


2. What others have done with DSL + LLM on ARC

2.1 Pure DSL program synthesis

Several groups have gone the “DSL only” route:

  • Hodel’s ARC-DSL defines a small but expressive DSL with hand-written solutions for all train tasks. (GitHub)
  • ILP-based ARC solvers design a DSL as background knowledge and use Inductive Logic Programming to synthesize logic programs from IO pairs. (arXiv)

These systems tend to:

  • Do well on a subset of tasks where the DSL is expressive enough.
  • Have trouble with generalization to evaluation tasks when the DSL or search heuristics do not cover the needed abstractions.

Your C++ engine + DSL is conceptually similar, with different engineering choices.

2.2 Neural / LLM-guided program synthesis over a DSL

There is now a family of neural program synthesis approaches in the ARC domain:

  • Execution-guided neural program synthesis (EG-NPS) with a custom DSL for ARC. This work compares:

    • Non-execution-guided NPS,
    • Execution-guided NPS,
    • Test-time fine-tuning of neural models.
      Execution-guided methods come out ahead on out-of-distribution generalization. (arXiv)
  • LLM-legible ARC DSL + LLMs:

    • The arc-dsl-llm project rewrites Hodel’s DSL to be easier for LLMs to read and write. (GitHub)
    • “Capturing Sparks of Abstraction for the ARC Challenge” feeds complete DSL solutions into a big LLM (Gemini), asks it to refactor and comment them, and uses those abstractions as a resource for smaller local models. (arXiv)
  • Natural-language-enhanced neural program synthesis:

    • A Stanford project uses an ARC DSL and compares baselines with and without natural language guidance. The LLM generates DSL programs given IO pairs + textual hints. (Stanford University)

These are very close to what you are doing:

  • A fixed DSL,
  • A neural model (CNN/Transformer/LLM) that outputs DSL code,
  • Optionally execution feedback to prune or guide search.

Your dataset of ~1500 valid DSL programs is exactly the sort of supervision these methods try to construct (often much more expensively).

2.3 Reflection / multi-agent systems: LLM + DSL solver

Recent “reflection systems” are especially relevant:

  • The Reflection System / MASR combines:

    • A DSL-based program synthesis solver,
    • One or more LLMs that propose or repair programs,
    • “Reflection” loops where LLMs analyze failures and suggest changes. (OpenReview)
  • They also introduce AugARC, an augmented ARC benchmark with more data per task; fine-tuning LLMs on AugARC significantly improves LLM-based performance compared to vanilla ARC. (OpenReview)

Key point: they empirically find that:

  • LLM alone on ARC is weak.
  • DSL solver alone is limited.
  • LLM + DSL solver with reflection is strictly better than either alone.

This directly supports the idea that your C++ solver + LLM is a sane and competitive direction.


3. How your current setup compares

You have:

  1. A C++ DSL engine achieving ≈74% on training tasks and ≈28% on evaluation tasks.

  2. A Qwen 0.5B LoRA model trained on ≈1500 examples of:

    • Input grid(s),
    • Output grid(s),
    • A valid DSL program connecting them.

Relative to existing work:

  • Many published DSL-only solvers focus on subsets of tasks or achieve similar ballpark evaluation accuracies when measured under realistic constraints. (ResearchGate)
  • Pure LLM fine-tuning, even on millions of synthetic ARC-like tasks, tends to score around 10% on full ARC-AGI. (ARC Prize)
  • Your 28% eval performance with a relatively modest engineering setup is already better than most pure LLM approaches and is solid for a single-person project.

So yes, you “have a chance” in the sense that:

  • Your system is in the same conceptual class as several current research lines (NPS + DSL + LLM guidance).
  • There is a lot of headroom above your current 28% and below the current SoTA ~55%, so there is room for meaningful progress without needing to “win ARC” outright.

4. Where you can do better: concrete directions

Below are directions that are both aligned with current research and compatible with your resources.

4.1 Use your C++ solver as a data generator, not just a labeler

Right now you have ~1500 DSL programs, each derived from train tasks.

That is small compared to:

  • Synthetic ARC-like datasets with millions of tasks used in deep-learning-oriented ARC papers. (arXiv)

You have an advantage:

  • A working solver that can generate correct DSL programs for many tasks.

You can:

  1. Augment existing tasks

    • Apply transformations preserving semantics:

      • grid rotations/flips,
      • color remaps,
      • translation of patterns within the grid,
      • scaling where the DSL’s semantics remain valid.
    • Use your solver to recompute DSL solutions for these augmented tasks where necessary.

  2. Generate synthetic tasks

    • Randomly sample short DSL programs from your primitive set,
    • Execute them on random input grids,
    • Store (input, output, DSL) triples as training data.

This follows the logic of AugARC and test-time training work that show data augmentation and synthetic tasks can substantially boost deep learning performance on ARC-like problems. (OpenReview)

Your LLM then trains on far more than 1500 examples, yet remains grounded in your actual DSL semantics.

4.2 Move from “one-shot DSL” to “LLM-guided search”

Right now you seem to:

  • Feed a few IO examples to the fine-tuned Qwen model,
  • Ask it to emit one DSL string,
  • Eval it; if it fails, you count it as wrong.

Instead, look at what EG-NPS and reflection-style systems do: (arXiv)

  1. Generate multiple candidates

    • Use num_return_sequences and/or sampling (small temperature, nucleus sampling) to get a batch of DSL candidates.
  2. Execute and score

    • Run all programs on the train examples for that task,
    • Filter those that match all train pairs,
    • Optionally rank them by simplicity (AST depth, length, primitive counts).
  3. Optionally repair / reflect

    • If no candidate works, give error messages or differences as feedback to the LLM, and ask for a revised DSL (reflection step).

Your C++ engine already gives you fast program execution and correctness checking. That is exactly what EG-NPS calls execution guidance. Using the LLM only as a proposal distribution over programs and letting the solver enforce correctness is far more powerful than using the LLM as a one-shot predictor.

4.3 Improve the DSL and its LLM-legibility

Several ARC groups report that DSL design is a major bottleneck: too high-level and specific, and generalization suffers. (lewish.io)

You can:

  • Make the DSL more compositional:

    • Fewer, more general primitives (e.g. region, object, mirror, flood-fill, transform).
    • Encourage programs composed of reusable building blocks rather than ad-hoc chains.
  • Make the DSL more LLM-friendly:

    • Clear, regular syntax (no tricky positional arguments; named parameters can help).
    • Consistent naming, limited ambiguity.
    • Possibly wrap it in an arc-dsl-llm-style surface syntax, then compile to your internal representation. (GitHub)

If you train your Qwen model only on this cleaned, normalized DSL, it will have a much easier time producing syntactically correct programs.

4.4 Borrow ideas from test-time training (TTT / TTFT)

Recent deep-learning-centric work on ARC shows that test-time training / test-time fine-tuning (TTT/TTFT) can yield large gains: models adjust their weights on the few IO examples of a given task just before predicting test outputs. (arXiv)

You can analogously:

  • Keep a small copy of your Qwen+LoRA adapter,

  • For each new ARC task:

    • Fine-tune the LoRA adapter for a few dozen steps on synthetic augmentations of that task’s train pairs,
    • Then decode DSL programs for the test grids.

Because Qwen 0.5B + LoRA is relatively light, a small number of gradient steps per task may be feasible on your hardware, especially with 4-bit loading (which you already use).

4.5 Combine your engine with larger “remote” LLMs for analysis

Even if you keep your competition system small and offline, you can:

  • Use large hosted LLMs (Gemini, GPT-4, etc.) during development to:

    • Explain and cluster your DSL programs (extract strategies, macros), similar to “Sparks of Abstraction”. (arXiv)
    • Suggest new DSL primitives or macros.
    • Generate high-level textual descriptions of tasks, which may help your smaller local model (as in natural-language-guided synthesis). (Stanford University)

You do not ship those big models, but you re-use the structure they reveal.


5. Do you have a shot vs existing LLM+DSL approaches?

Reframing “do I have a chance?” into more precise questions:

5.1 Can your approach be competitive with existing LLM+DSL research?

Yes, if:

  • You leverage your C++ engine for massive synthetic data and execution-guided candidate pruning.

  • You adjust your training to:

    • Use many more examples,
    • Use a normalized DSL syntax,
    • Use candidate-set decoding instead of one-shot answers.

These are exactly the levers used in recent NPS + LLM systems on ARC. (arXiv)

5.2 Can your approach reach the top of the ARC leaderboard as is?

Unlikely in its current form:

  • State-of-the-art systems:

    • Use much larger models (≈8B parameters and up),
    • Train on millions of augmented tasks,
    • Employ ensembles and sophisticated test-time training. (arXiv)
  • Your 1500-example fine-tuning dataset + single small Qwen adapter is too small to compete with that, if you keep everything else fixed.

But nothing stops you from:

  • Scaling data (via your solver),
  • Improving search (LLM-guided, execution-guided),
  • And incrementally upgrading model size as hardware allows.

5.3 Is your DSL+LLM approach “already tried and doomed”?

No:

  • Several recent surveys explicitly argue that hybrid program synthesis + learning is the promising direction for ARC (as opposed to “LLM alone”). (ARC Prize)
  • Reflection systems and EG-NPS are early prototypes, not fully optimized. There is clear room for new designs, better DSLs, and better search strategies. (arXiv)

Your unique asset is a working, fast C++ solver with a real DSL and 1500+ discovered programs. That is a valuable starting point many teams do not have.


6. Summary

  • ARC is still hard. Even with big LLMs and heavy engineering, SoTA is ≈55% on the original private test set, far below the 85% prize target. (arXiv)

  • Pure LLM approaches are weak; DSL + program search + learning is currently the most promising family of methods. (ARC Prize)

  • Your system (C++ DSL solver + 0.5B Qwen LoRA trained on 1500 DSL programs) is conceptually aligned with current research and already outperforms naive LLM-only baselines.

  • You have clear avenues to improve:

    • Use your solver as a data generator (synthetic tasks, AugARC-style augmentation),
    • Switch from one-shot DSL prediction to LLM-guided, execution-guided search,
    • Make your DSL more compositional and LLM-legible,
    • Add test-time fine-tuning per task,
    • Use large external LLMs during development to mine abstractions and propose better DSLs.

If you implement even a subset of these, your approach remains competitive with current LLM+DSL work and can plausibly push beyond your current 28% evaluation performance, even without matching the full complexity of SoTA ARC Prize entries.

Setup for using an LLM in a hybrid with a solver would look something like this.


You can improve the LLM side in three main ways:

  1. Use a more appropriate type of model (code-oriented, not just general instruction).
  2. Change how you fine-tune it on your DSL data.
  3. Change how you use it at inference (search, constraints, reflection), instead of expecting one perfect program from one deterministic decode.

Below I’ll assume you keep your C++ DSL engine and ARC pipeline, and you want to upgrade only the LLM part.


0. Your current position

Right now:

  • Engine: C++ DSL solver that achieves ≈74% on ARC “training” and ≈28% on “evaluation”.
  • Data: >1500 tuples (inputs, outputs, DSL_program) from that solver.
  • LLM: Qwen2.5-0.5B-Instruct, LoRA fine-tuning in 4-bit on RTX 2050.
  • Use: Feed a few I/O examples for a task, ask model for one DSL program, eval it, check if it solves the task.

This is already a solid hybrid system: explicit DSL + neural prior over programs. It is similar in spirit to recent neural program synthesis and reflection systems for ARC. (arXiv)

What is still weak:

  • The model is small (0.5B) and general, not code-specialized.
  • Training data is relatively small and not yet heavily augmented.
  • Inference uses one deterministic decode with no search or constraints.

1. Better model choice, under your hardware

Your GPU is the hard constraint, so there are two tiers:

1.1 Local, small models you can still fine-tune

You should move from a generic instruction model to a code-specialized 0.5B model:

  • Qwen/Qwen2.5-Coder-0.5B-Instruct

    • Same parameter scale as your current model (0.5B).
    • Trained specifically on code, with improved code generation and reasoning. (Hugging Face)
    • Available in HF transformers and GGUF variants, designed to run on modest hardware. (Hugging Face)

Why it is better for your use:

  • It has a stronger prior for:

    • Function signatures and types.
    • Nested calls and syntax.
    • “Program-looking” output and bracket discipline.
  • Your DSL is small, but code-oriented models adapt to DSLs more easily than general chat models.

With 4-bit loading and LoRA, Qwen2.5-Coder-0.5B-Instruct is almost a drop-in replacement for your current setup.

If VRAM plus CPU offload allow, you might also test:

  • Qwen2.5-Coder-1.5B or 3B-Instruct (in 4-bit, offload to CPU if needed). (Hugging Face)

These will give better program synthesis, but fine-tuning them on RTX 2050 may be borderline. Start with 0.5B coder; only then explore 1.5–3B.

1.2 Optional: use larger models only as offline “teachers”

You do not need to fine-tune large models on your GPU. You can:

  • Call a larger open model (e.g., Qwen2.5-Coder-7B/14B/32B) on a remote machine or cloud. (Hugging Face)

  • Use it to:

    • Produce extra DSL solutions for tasks your C++ engine cannot solve yet.
    • Refactor your existing DSL programs into more compositional forms (macros, reusable patterns).
    • Suggest new DSL primitives or high-level patterns.

Then you use only the distilled data and patterns to train your small local model.


2. Better fine-tuning strategy

The model you choose matters, but how you fine-tune it can matter just as much. From your logs and prompts, these are high-leverage changes.

2.1 Clean “one conversation → one output” structure

Right now one training conversation contains 5 I/O pairs and 5 logics. That works, but it makes the learning problem harder.

For the LLM side, consider constructing a simple variant of your dataset:

  • For each (input, output, DSL) triple, create:

    <|im_start|>system
    You are an ARC DSL generator. Given an input grid and an output grid,
    produce a DSL expression that transforms the input into the output.
    Return only the DSL expression, nothing else.
    <|im_end|>
    
    <|im_start|>user
    Input:
    [[...]]
    Output:
    [[...]]
    <|im_end|>
    
    <|im_start|>assistant
    replace(replace(input,8,2),1,0)
    <|im_end|>
    
  • Train on many such conversations, where each example has:

    • one input grid,
    • one output grid,
    • one DSL program in the assistant.

This gives the model a very clear mapping: I/O → single DSL string. It learns the grammar and signatures without the combinatorics of multiple logics per conversation.

You can still keep your 5-pair format as a second dataset, but having the 1-pair format as a main training target makes the behavior at inference less ambiguous.

2.2 Mask labels to assistant tokens only

When using apply_chat_template and Trainer, make sure:

  • labels are -100 (ignored) for system and user tokens.
  • labels equal to input_ids only for assistant tokens.

This makes the model focus on “given this user content and system spec, what assistant DSL should I output”, not on predicting the user text itself.

This “label masking” is the default behavior in libraries like TRL for chat fine-tuning, but with custom tokenization it can be easy to get wrong. Many program synthesis works emphasize training only on the output sequence, not the entire concatenated prompt. (OpenReview)

2.3 Grow your DSL dataset with your solver

1500 programs is small. You have a C++ engine capable of solving many tasks; use it as a generator of supervised data, similar to what AugARC does with augmented tasks. (OpenReview)

You can:

  • Apply robust augmentations to inputs:

    • Rotations (90, 180, 270 degrees), flips, color permutations, permutations of object order.
  • For each augmented input grid, recompute the DSL program where possible and create additional (input, output, DSL) triples.

  • For each DSL you already discovered, randomly generate new input grids and apply the DSL to get new synthetic tasks.

Papers introducing AugARC show that augmenting ARC tasks into thousands to millions of variants significantly improves LLM performance on ARC-like reasoning. (OpenReview)

You can do a smaller version of this with your own solver and still get a big boost over 1500 tasks.

2.4 Make the DSL LLM-legible and type-stable

Your DSL is already regular, but you can further:

  • Make function signatures unambiguous:

    • replace(a, x, y) where x and y are colors; never coordinates.
    • put(dst, src, (i,j)) where third arg is always a tuple.
  • Avoid overloaded meanings:

    • Do not reuse names like region for both coordinate lists and boolean masks.

Work on grammar-constrained decoding and DSL generation recommends simplifying DSL syntax, avoiding overloaded tokens, and making the grammar as context-free as possible so that both LLM and parser can handle it. (NeurIPS Proceedings)

You can even maintain an internal low-level DSL and expose a slightly more verbose, readable DSL to the model, then compile it down.


3. Better inference: treat the LLM as a proposal engine, not an oracle

The biggest structural improvement is to stop expecting “one forward pass = one perfect program.”

Instead, do what execution-guided neural program synthesis and ARC reflection systems do: let the LLM propose programs, and let your DSL engine enforce correctness. (OpenReview)

3.1 Multi-candidate sampling + execution filtering

For each ARC task:

  1. Build your prompt (few-shot or single-shot).

  2. Generate multiple DSL candidates:

    out = model.generate(
        input_ids=input_ids,
        max_new_tokens=128,
        temperature=0.7,
        top_p=0.9,
        num_return_sequences=K,  # e.g. 16 or 32
        do_sample=True,
    )
    
  3. Decode each output to a candidate DSL program.

  4. For each candidate:

    • Try to parse/eval it into your DSL. If it fails, discard.
    • Execute on all train I/O pairs for that ARC task.
    • Keep candidates that match all train outputs exactly.

This is “execution-guided decoding” or “execution-guided program synthesis”: you use the semantics of the DSL to score candidates. It is known to significantly improve performance in program synthesis tasks compared to pure neural decoding. (OpenReview)

Your C++ engine makes this cheap.

3.2 Grammar- or type-constrained decoding

To reduce nonsense like replace(input,(0,0),8), add syntactic and type constraints during decoding:

  • Define a context-free grammar for your DSL.
  • Use a grammar-constrained or grammar-aligned decoding algorithm to restrict allowed next tokens. (NeurIPS Proceedings)

This ensures the model cannot produce syntactically invalid programs.
You can go further and encode type constraints (for “this position must be a scalar, not a tuple”), as suggested in recent work on LLM-hardened DSLs and constrained DSL generation. (LinkedIn)

Even with a small model, this dramatically increases the fraction of usable outputs.

3.3 Reflection loop: use the LLM to debug its own programs

Following the Reflection System for ARC: (ResearchGate)

  1. Generate an initial program.

  2. Execute it; if it fails (wrong output):

    • Compute a short description of how it failed (e.g., differences between predicted and target output, or an error trace).

    • Feed this back to the model in a second prompt:

      Your previous DSL program produced this wrong output on the training examples.
      Here is the difference between expected and actual outputs: ...
      Fix the DSL program. Return only the fixed DSL.
      
  3. Repeat for a small number of iterations.

This multi-step process lets even small models gradually refine programs, especially when combined with execution feedback.


4. Prioritized plan from where you are

If you want a concrete roadmap:

  1. Swap to a code model, same size

    • Move from Qwen2.5-0.5B-Instruct to Qwen2.5-Coder-0.5B-Instruct and repeat your LoRA fine-tune. (Hugging Face)
  2. Clean up the training format

    • Build a “1 input, 1 output → 1 DSL” dataset from your 1500 programs.
    • Train with proper chat templates and label masking so only assistant DSL tokens are supervised.
  3. Augment data with your C++ solver

    • Use rotations, flips, color permutations, and synthetic programs to expand to tens or hundreds of thousands of examples, following the spirit of AugARC. (OpenReview)
  4. Change inference to “sample K candidates + execute + filter”

    • Use sampling, generate K programs, execute all, keep the ones that fit all training examples.
    • This alone can significantly improve success rate.
  5. Add grammar / type constraints once the basics work

    • Implement a small grammar for your DSL and plug it into constrained decoding. (NeurIPS Proceedings)
  6. Optionally, add reflection

    • For hard tasks, let the model iteratively fix its own programs using execution feedback, similar to MASR. (ResearchGate)

Each of these steps improves the LLM side without discarding the work already done in C++ and your DSL.


Summary

  • A better model within your constraints is a code-specialized small model (Qwen2.5-Coder-0.5B-Instruct) rather than a generic chat model. (Hugging Face)

  • A better approach is to treat the LLM as a probabilistic proposal engine over DSL programs, not as an oracle:

    • Larger and better-structured training data (generated by your solver). (OpenReview)
    • Execution-guided, multi-candidate decoding. (OpenReview)
    • Grammar- and type-constrained decoding to enforce DSL syntax. (NeurIPS Proceedings)
    • Optional reflection loops for repair. (ResearchGate)

Those changes together give you a realistic and research-aligned way to push the LLM side well beyond your current behavior, while staying within your RTX 2050 constraints.

Thanks.
I tried to fine-tune the Qwen/Qwen2.5-1.5B-Instruct model only on input/output pairs with DSL logic (no batch per ARC task) but at inference I get a (bad) grid instead of a DSL…

1 Like

With a 1.5B LLM, unlike the 0.5B case, structured output can be used, but it might be manageable just by improving the fine-tuning method…?


You are hitting a supervision / format problem, not a low-level bug.

Fine-tuning Qwen2.5-1.5B-Instruct only on Input/Output + DSL and still getting a grid at inference usually means:

  • The model is still being trained to emit grids somewhere in the sequence.
  • Or the inference prompt does not match what you trained on.
  • Or both.

I’ll structure this as:

  1. Typical failure modes that produce “grid instead of DSL”.
  2. Concrete fixes for Qwen2.5-1.5B-Instruct.
  3. A short checklist for your next run.

1. Likely causes

1.1 The model is still supervised to produce grids

If your training text per sample looks like this:

Input:
[[...]]
Output:
[[...]]
Logic:
replace(...)

and you train with causal LM loss on the full sequence (e.g. DataCollatorForLanguageModeling(mlm=False)), then:

  • The model is trained to predict:

    • the grid after Input:
    • the grid after Output:
    • the DSL after Logic:
  • At inference, when you prompt with something like:

    Input:
    [[...]]
    Output:
    [[...]]
    Logic:
    

    the highest probability continuation is often still “more grid-ish stuff”, because the model’s pretraining + your fine-tuning both gave many “Output: [[grid]]” continuations and only 1500 examples of “Logic: ”.

So even though you intend “Logic:” → DSL, the LM objective is “continue the sequence”, and Output/grids are a strong prior.

This effect is stronger with an instruction model like Qwen2.5-Instruct, which has seen a lot of natural language and table-like outputs in pretraining. (Medium)

1.2 Assistant vs user tokens not separated

For chat models like Qwen2.5-Instruct, the chat template embeds roles as special tokens (<|im_start|>system, <|im_start|>user, <|im_start|>assistant, <|im_end|>) and the recommended practice is:

  • Format data as messages = [{role: "system" ...}, {role: "user"...}, {role: "assistant"...}].
  • Tokenize via tokenizer.apply_chat_template.
  • Mask labels so only the assistant span has loss; system+user tokens get label -100.

If you instead just passed a flat string and let all tokens be supervised, the model is rewarded for copying parts of user grids, “Output:” labels, etc. This makes it more likely to respond with grids or mixed text even after fine-tuning. Hugging Face’s TRL docs and recent issues around Qwen SFT emphasize correct chat-template usage and assistant-only labeling for Qwen2.5.

1.3 Training / inference format mismatch

You say you now train “only on input/output pairs with DSL logic (no batch per ARC task)”. That is good structurally, but the details matter:

  • Training might be:

    <system> You are an ARC DSL generator...
    <user>   Input: ... Output: ...
    <assistant> replace(replace(input,8,2),1,0)
    
  • Inference might be:

    Here are several examples...
    Input1: ...
    Output1: ...
    Input2: ...
    Output2: ...
    Now, for this new input, output DSL...
    Input5: ...
    Logic5:
    

This changes several things:

  • New labels: Input1, Output1, Logic5 never appear in training.
  • Multiple I/O pairs per prompt, while training only had one.
  • In training, Logic: was only in assistant content; at inference, you might put Logic5: in the user message.

For a 1.5B model with 1500 samples, that kind of distribution shift is enough to push it back to “grid continuation” behavior.

1.4 Not slicing off the prompt correctly

If you decode the entire sequence (out[0]) instead of only the new tokens after the prompt, you may see grids that actually belong to the prompt or are prompt + completion mixed.

For Qwen2.5, the usual pattern is:

input_ids = tokenizer.apply_chat_template(..., return_tensors="pt").to(device)
out = model.generate(input_ids=input_ids, max_new_tokens=128, ...)
prompt_len = input_ids.shape[1]
generated = out[:, prompt_len:]
text = tokenizer.decode(generated[0], skip_special_tokens=True)

If you don’t slice and you decode the full out, the appearance of grids can be misleading. Qwen examples on vLLM and HF docs show this pattern explicitly. (docs.vllm.ai)


2. Solutions

2.1 Make the assistant target “pure DSL”

For each training example, format it as a proper chat and ensure the assistant content is only the DSL expression.

Example dataset row:

messages = [
    {
        "role": "system",
        "content": (
            "You are an ARC DSL generator. Given an input grid and an output grid, "
            "produce a single DSL expression that transforms the input into the output. "
            "Return only the DSL expression."
        ),
    },
    {
        "role": "user",
        "content": (
            "Input:\n[[5,8,6],[5,8,6],[5,8,6]]\n"
            "Output:\n[[1,9,2],[1,9,2],[1,9,2]]"
        ),
    },
    {
        "role": "assistant",
        "content": "replace(replace(input,5,1),8,9)",  # no extra text
    },
]

Then:

  • Either let TRL’s SFTTrainer handle apply_chat_template and assistant-only masking for Qwen (recommended).

  • Or manually:

    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
    encoded = tokenizer(text, return_tensors="pt")
    labels = encoded["input_ids"].clone()
    # Build an "assistant mask" from the positions of the last assistant span,
    # set labels[mask==0] = -100
    

The crucial point: no grid appears after the DSL in the assistant text. The sequence ends with the DSL. That way, the model is never incentivized to output grids after DSL.

2.2 Match inference to training format

At inference, use exactly the same structure:

messages = [
    {
        "role": "system",
        "content": (
            "You are an ARC DSL generator. Given an input grid and an output grid, "
            "produce a single DSL expression. Return only the DSL expression."
        ),
    },
    {
        "role": "user",
        "content": (
            "Input:\n[[8,1,3],[8,1,3],[8,1,3]]\n"
            "Output:\n[[9,5,4],[9,5,4],[9,5,4]]"  # if you reveal output
        ),
    },
]
input_ids = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)

out = model.generate(
    input_ids=input_ids,
    max_new_tokens=128,
    temperature=0.1,
    do_sample=False,
)
generated = out[:, input_ids.shape[1]:]
dsl = tokenizer.decode(generated[0], skip_special_tokens=True)

If you want few-shot examples, keep them consistent:

  • Still use role="user" for all Input/Output pairs.
  • Still use role="assistant" for a single final DSL (or for a worked example + a new query).
  • Use the same labels (Input:, Output:, DSL:) and grid formatting as in training.

Avoid switching to Input1/Output1/Logic5 unless you also trained on that exact pattern.

2.3 Reduce “grid continuation” bias

You can push the model further away from outputting grids by:

  • Emphasizing in the system message:

    “You must never output grids or ‘Output:’ lines. Only output valid DSL code such as replace(...), put(...), etc.”

  • Showing a worked example in the prompt:

    Example:
    Input:
    [[5,8,6],[5,8,6],[5,8,6]]
    Output:
    [[1,9,2],[1,9,2],[1,9,2]]
    DSL:
    replace(replace(input,5,1),8,9)
    
    Now solve this:
    Input:
    ...
    Output:
    ...
    DSL:
    

This uses in-context learning to reinforce the format beyond the small SFT dataset.

2.4 Use a code-specialized variant if possible

Qwen2.5-Coder-1.5B-Instruct is trained heavily on code and structured syntax. It tends to respect function signatures and “code-like” outputs more reliably than the generic Instruct model. (Medium)

Given your DSL is code-like, using Qwen/Qwen2.5-Coder-1.5B-Instruct as the base for LoRA fine-tuning will:

  • Make it easier to learn that replace(a, x, y) takes scalars, not coordinates.
  • Reduce the tendency to drift back into natural language or grid dumps.

Training pipeline stays the same, just change the base model.

2.5 Post-filtering and search

Even with all of the above, a single greedy decode will often be wrong or grid-like. You can harden the system by:

  • Generating multiple candidates (num_return_sequences or repeated calls).

  • Discarding outputs that:

    • Contain [[ or ]] (grids).
    • Do not parse against your DSL grammar (e.g. replace(input,(0,0),8) rejected).
  • Executing valid DSL candidates on the train I/O pairs and keeping the ones that match.

This turns the LLM into a proposal generator; your C++ engine remains the correctness oracle.

Execution-guided decoding is standard in program synthesis and has been shown to beat single-shot LLM output on ARC-like DSLs. (Hugging Face)


3. Checklist

For your next Qwen2.5-1.5B run:

  • Each training sample = {system, user(Input+Output), assistant(DSL only)}.
  • Use apply_chat_template for both training and inference.
  • Mask labels so only assistant tokens have loss.
  • Training text ends at the DSL; no grids after it.
  • Inference prompt uses the same format and labels (Input, Output, DSL) as training.
  • Slice off the prompt before decoding the answer.
  • Optionally switch base to Qwen2.5-Coder-1.5B-Instruct.
  • Optionally add an execution-guided filter to reject grids and invalid DSL.

If you follow these, seeing grids instead of DSL after fine-tuning becomes the exception rather than the default, even with a 1.5B model.