I used to have no problem with PEFT fine-tuning after hundreds of trainings, but now I have encountered The error RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

elliotthwang · October 1, 2025, 10:02am

elliotthwang · October 1, 2025, 10:25am

The error RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

John6666 · October 1, 2025, 11:04pm

Without seeing the code, it’s hard to say for sure, but I think it’s related to gradient checkpointing…

Root cause → fix pairs for PEFT LoRA when you hit
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn.

Gradient checkpointing with a frozen backbone
Cause: base weights are frozen, so autograd needs input grads when checkpointing. Missing input grads or using cache breaks backprop.
Fix: enable input grads and disable cache before training.

# refs:
# https://github.com/huggingface/peft/issues/137
# https://huggingface.co/proxy/discuss.huggingface.co/t/peft-lora-gpt-neox-backward-pass-failing/35641
# https://huggingface.co/togethercomputer/evo-1-131k-base/commit/e87428b

model.gradient_checkpointing_enable()
model.enable_input_require_grads()   # critical with PEFT + checkpointing
model.config.use_cache = False       # cache off during training

Why this works: gradients must flow through frozen layers, so inputs need requires_grad=True. Reconfirmed in issues and code comments. (GitHub)

No LoRA params are actually trainable
Cause: wrong target_modules for your architecture, stale adapters, or adapters not activated. Result: every param stays frozen.
Fix: target real module names for your model and verify trainables; if loading adapters, activate them.

# refs:
# https://github.com/huggingface/peft/issues/1346
# https://huggingface.co/microsoft/phi-2/discussions/82

print([n for n,p in model.named_parameters() if p.requires_grad][:30])
assert any(p.requires_grad for _,p in model.named_parameters())

# common sets:
# LLaMA-like: ["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"]
# Minimal VRAM: ["q_proj","v_proj"]

If your names are off, nothing trains and the loss has no grad. (GitHub)

The loss was detached from the graph
Cause: with torch.no_grad(), loss = loss.item() before backward(), .detach(), or .cpu().numpy() on tensors used to compute loss.
Fix: pass the tensor loss to backward() untouched. Keep detaches for logging only.

# ref: https://discuss.pytorch.org/t/loss-backward-element-0-of-tensors-does-not-require-grad-and-does-not-have-a-grad-fn/193788
loss = outputs.loss          # tensor
loss.backward()              # OK
# logging-only after backward:
scalar = float(loss.detach())  # or loss.item() AFTER backward

PyTorch core discussion confirms these patterns. (PyTorch Forums)

Version edge cases in Trainer stacks
Cause: specific combos of transformers + peft + accelerate + checkpointing have regressions.
Fix: pin a known good set or upgrade all together; keep use_cache=False. See reports around SFTTrainer and 4.36. (GitHub)
Wrapper conflicts
Cause: mixing PEFT with third-party wrappers that alter forward/backward can drop grads or mark params unused.
Fix: follow the wrapper’s PEFT integration path or run plain PEFT first; in DDP set find_unused_parameters=True if needed. (GitHub)

Minimal, safe training template
Use this to avoid 1–3, then iterate.

# URLs:
# https://github.com/huggingface/transformers
# https://github.com/huggingface/peft
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model

base = "meta-llama/Llama-3-8b"  # example
tok = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base)

lcfg = LoraConfig(
    r=16, lora_alpha=32, lora_dropout=0.05,
    target_modules=["q_proj","v_proj"],   # adjust per model
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lcfg)

model.gradient_checkpointing_enable()  # PEFT + checkpointing
model.enable_input_require_grads()     # ensure input grads
model.config.use_cache = False         # disable cache

model.train()
out = model(**batch)
loss = out.loss            # tensor, do not .item() yet
loss.backward()            # OK

Quick diagnostics to run before training

Sanity check trainables: sum(p.requires_grad for p in model.parameters()) > 0. If 0, fix target_modules and adapter activation. (GitHub)
Check for accidental no-grad: search your loop for torch.no_grad, .detach(), .item() before backward(). (PyTorch Forums)
If using checkpointing: confirm enable_input_require_grads() and use_cache=False. (GitHub)

Redundant restatement for coverage

If checkpointing is on, you must: enable input grads and turn cache off. That alone fixes most cases. (GitHub)
If no params require grad, your target_modules are wrong or adapters inactive. Verify names per model. (GitHub)
If you detached the loss or wrapped in no-grad, remove those lines. Backprop needs a tensor with requires_grad=True. (PyTorch Forums)

Topic		Replies	Views
PEFT LoRA GPT-NeoX - Backward pass failing 🤗Transformers	7	7516	July 29, 2024
Freezing layers with SFTTrainer Intermediate	2	334	March 8, 2025
Retraining a peft model after loading Beginners	2	275	February 15, 2025
How to fix RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn Intermediate	1	1841	June 10, 2024
[RuntimeError] DPOTrainer - "element 0 of tensors does not require grad and does not have a grad_fn" on 8x A100 GPUs 🤗Accelerate	1	85	May 20, 2025

I used to have no problem with PEFT fine-tuning after hundreds of trainings, but now I have encountered The error RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

Related topics