Qwen3.5-0.8B-thinking
A Chain-of-Thought fine-tuned version of Qwen/Qwen3.5-0.8B-Base,
trained to reason step-by-step using <think> tags before producing a final answer.
Model Details
| Attribute | Value |
|---|---|
| Base model | Qwen/Qwen3.5-0.8B-Base |
| Architecture | Qwen3_5ForCausalLM (hybrid linear / full attention) |
| Parameters | ~0.8B |
| Context window | 4096 tokens |
| Hidden size | 1024 |
| Layers | 24 |
| Attention heads | 8 (2 KV heads) |
| Vocabulary | 248,320 tokens |
| Precision | bfloat16 |
Training Details
Data
Fine-tuned on PursuitOfDataScience/0.5M-thinking,
a dataset of ~500K examples with structured chain-of-thought reasoning wrapped in <think> / </think> tags
followed by a clean final answer.
After filtering examples that exceed the 4096-token context window, 244,997 examples were used for training.
Procedure
The model was trained with supervised fine-tuning (SFT) using HuggingFace Trainer:
| Hyperparameter | Value |
|---|---|
| Epochs | 1 |
| Per-device batch size | 4 |
| Gradient accumulation steps | 8 |
| Effective batch size | 32 |
| Learning rate | 2e-5 |
| LR schedule | Linear with warmup |
| Warmup steps | 100 |
| Max sequence length | 4096 |
| Total optimizer steps | 7,657 |
| Hardware | 1× H100 GPU |
| Precision | bfloat16 |
| Attention | SDPA (scaled dot-product attention) |
Prompt format used during training:
user: <question>
assistant: <think>
<step-by-step reasoning>
</think>
<final answer>
The <think> tag is hardcoded into the prompt prefix so the model always learns
to emit structured reasoning before the answer.
Label masking: Only the assistant response (starting after <think>) is
included in the cross-entropy loss — the prompt tokens are masked with -100.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "PursuitOfDataScience/Qwen3.5-0.8B-thinking"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
question = "If Alice has 3 apples and buys 5 more, how many apples does she have?"
prompt = (
f"user: Solve this math problem step by step. "
f"Show your reasoning, then give the final answer after ####.\n\n"
f"Question: {question}\n"
f"assistant: <think>\n"
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=1024,
temperature=0.6,
top_p=0.9,
do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
GSM8K Benchmark Results (Pass@1)
Evaluated on the GSM8K test set (1,319 examples) using greedy-like sampling (temperature=0.6, top_p=0.9, max_new_tokens=4096).
The x-axis below represents CoT-SFT optimizer steps (0 = base model before any fine-tuning; 7,657 = end of one full epoch). Only the final model is publicly released — intermediate checkpoints are not available.
| Training Steps | GSM8K Accuracy |
|---|---|
0 (base, with <think>) |
58.23% |
0 (base, no <think>) |
51.40% |
| 500 | 57.32% |
| 1,000 | 59.97% |
| 1,500 | 63.53% |
| 2,000 | 60.20% |
| 2,500 | 59.21% |
| 3,000 | 60.73% |
| 3,500 | 60.58% |
| 4,000 | 60.35% |
| 4,500 | 61.11% |
| 5,000 | 58.61% |
| 5,500 | 62.62% |
| 6,000 | 62.17% |
| 6,500 | 61.11% |
| 7,000 | 63.68% |
| 7,500 | 61.03% |
| 7,657 | 61.64% |
| final model | 62.40% |
The fine-tuned final model achieves 62.40% vs the base model's 58.23%
(+4.17 pp) when both use chain-of-thought (<think>) prompting, and a
+10.99 pp gain over the base model without any reasoning prompt.
Acknowledgements
- Base model: Qwen/Qwen3.5-0.8B-Base by the Qwen Team (Alibaba Cloud)
- Training data: PursuitOfDataScience/0.5M-thinking
License
Apache 2.0 — same as the base model.
- Downloads last month
- 29