I used to have no problem with PEFT fine-tuning after hundreds of trainings, but now I have encountered The error RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

I used to have no problem with PEFT fine-tuning after hundreds of trainings, but now I have encountered The error RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

1 Like

The error RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

1 Like

Without seeing the code, it’s hard to say for sure, but I think it’s related to gradient checkpointing…


Root cause → fix pairs for PEFT LoRA when you hit
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn.

  1. Gradient checkpointing with a frozen backbone
    Cause: base weights are frozen, so autograd needs input grads when checkpointing. Missing input grads or using cache breaks backprop.
    Fix: enable input grads and disable cache before training.
# refs:
# https://github.com/huggingface/peft/issues/137
# https://huggingface.co/proxy/discuss.huggingface.co/t/peft-lora-gpt-neox-backward-pass-failing/35641
# https://huggingface.co/togethercomputer/evo-1-131k-base/commit/e87428b

model.gradient_checkpointing_enable()
model.enable_input_require_grads()   # critical with PEFT + checkpointing
model.config.use_cache = False       # cache off during training

Why this works: gradients must flow through frozen layers, so inputs need requires_grad=True. Reconfirmed in issues and code comments. (GitHub)

  1. No LoRA params are actually trainable
    Cause: wrong target_modules for your architecture, stale adapters, or adapters not activated. Result: every param stays frozen.
    Fix: target real module names for your model and verify trainables; if loading adapters, activate them.
# refs:
# https://github.com/huggingface/peft/issues/1346
# https://huggingface.co/microsoft/phi-2/discussions/82

print([n for n,p in model.named_parameters() if p.requires_grad][:30])
assert any(p.requires_grad for _,p in model.named_parameters())

# common sets:
# LLaMA-like: ["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"]
# Minimal VRAM: ["q_proj","v_proj"]

If your names are off, nothing trains and the loss has no grad. (GitHub)

  1. The loss was detached from the graph
    Cause: with torch.no_grad(), loss = loss.item() before backward(), .detach(), or .cpu().numpy() on tensors used to compute loss.
    Fix: pass the tensor loss to backward() untouched. Keep detaches for logging only.
# ref: https://discuss.pytorch.org/t/loss-backward-element-0-of-tensors-does-not-require-grad-and-does-not-have-a-grad-fn/193788
loss = outputs.loss          # tensor
loss.backward()              # OK
# logging-only after backward:
scalar = float(loss.detach())  # or loss.item() AFTER backward

PyTorch core discussion confirms these patterns. (PyTorch Forums)

  1. Version edge cases in Trainer stacks
    Cause: specific combos of transformers + peft + accelerate + checkpointing have regressions.
    Fix: pin a known good set or upgrade all together; keep use_cache=False. See reports around SFTTrainer and 4.36. (GitHub)

  2. Wrapper conflicts
    Cause: mixing PEFT with third-party wrappers that alter forward/backward can drop grads or mark params unused.
    Fix: follow the wrapper’s PEFT integration path or run plain PEFT first; in DDP set find_unused_parameters=True if needed. (GitHub)

Minimal, safe training template
Use this to avoid 1–3, then iterate.

# URLs:
# https://github.com/huggingface/transformers
# https://github.com/huggingface/peft
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model

base = "meta-llama/Llama-3-8b"  # example
tok = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base)

lcfg = LoraConfig(
    r=16, lora_alpha=32, lora_dropout=0.05,
    target_modules=["q_proj","v_proj"],   # adjust per model
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lcfg)

model.gradient_checkpointing_enable()  # PEFT + checkpointing
model.enable_input_require_grads()     # ensure input grads
model.config.use_cache = False         # disable cache

model.train()
out = model(**batch)
loss = out.loss            # tensor, do not .item() yet
loss.backward()            # OK

Quick diagnostics to run before training

  • Sanity check trainables: sum(p.requires_grad for p in model.parameters()) > 0. If 0, fix target_modules and adapter activation. (GitHub)
  • Check for accidental no-grad: search your loop for torch.no_grad, .detach(), .item() before backward(). (PyTorch Forums)
  • If using checkpointing: confirm enable_input_require_grads() and use_cache=False. (GitHub)

Redundant restatement for coverage

  • If checkpointing is on, you must: enable input grads and turn cache off. That alone fixes most cases. (GitHub)
  • If no params require grad, your target_modules are wrong or adapters inactive. Verify names per model. (GitHub)
  • If you detached the loss or wrapped in no-grad, remove those lines. Backprop needs a tensor with requires_grad=True. (PyTorch Forums)