I used to have no problem with PEFT fine-tuning after hundreds of trainings, but now I have encountered The error RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
The error RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
Without seeing the code, it’s hard to say for sure, but I think it’s related to gradient checkpointing…
Root cause → fix pairs for PEFT LoRA when you hit
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn.
- Gradient checkpointing with a frozen backbone
Cause: base weights are frozen, so autograd needs input grads when checkpointing. Missing input grads or using cache breaks backprop.
Fix: enable input grads and disable cache before training.
# refs:
# https://github.com/huggingface/peft/issues/137
# https://huggingface.co/proxy/discuss.huggingface.co/t/peft-lora-gpt-neox-backward-pass-failing/35641
# https://huggingface.co/togethercomputer/evo-1-131k-base/commit/e87428b
model.gradient_checkpointing_enable()
model.enable_input_require_grads() # critical with PEFT + checkpointing
model.config.use_cache = False # cache off during training
Why this works: gradients must flow through frozen layers, so inputs need requires_grad=True. Reconfirmed in issues and code comments. (GitHub)
- No LoRA params are actually trainable
Cause: wrongtarget_modulesfor your architecture, stale adapters, or adapters not activated. Result: every param stays frozen.
Fix: target real module names for your model and verify trainables; if loading adapters, activate them.
# refs:
# https://github.com/huggingface/peft/issues/1346
# https://huggingface.co/microsoft/phi-2/discussions/82
print([n for n,p in model.named_parameters() if p.requires_grad][:30])
assert any(p.requires_grad for _,p in model.named_parameters())
# common sets:
# LLaMA-like: ["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"]
# Minimal VRAM: ["q_proj","v_proj"]
If your names are off, nothing trains and the loss has no grad. (GitHub)
- The loss was detached from the graph
Cause:with torch.no_grad(),loss = loss.item()beforebackward(),.detach(), or.cpu().numpy()on tensors used to compute loss.
Fix: pass the tensor loss tobackward()untouched. Keep detaches for logging only.
# ref: https://discuss.pytorch.org/t/loss-backward-element-0-of-tensors-does-not-require-grad-and-does-not-have-a-grad-fn/193788
loss = outputs.loss # tensor
loss.backward() # OK
# logging-only after backward:
scalar = float(loss.detach()) # or loss.item() AFTER backward
PyTorch core discussion confirms these patterns. (PyTorch Forums)
-
Version edge cases in Trainer stacks
Cause: specific combos oftransformers+peft+accelerate+ checkpointing have regressions.
Fix: pin a known good set or upgrade all together; keepuse_cache=False. See reports around SFTTrainer and 4.36. (GitHub) -
Wrapper conflicts
Cause: mixing PEFT with third-party wrappers that alter forward/backward can drop grads or mark params unused.
Fix: follow the wrapper’s PEFT integration path or run plain PEFT first; in DDP setfind_unused_parameters=Trueif needed. (GitHub)
Minimal, safe training template
Use this to avoid 1–3, then iterate.
# URLs:
# https://github.com/huggingface/transformers
# https://github.com/huggingface/peft
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model
base = "meta-llama/Llama-3-8b" # example
tok = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base)
lcfg = LoraConfig(
r=16, lora_alpha=32, lora_dropout=0.05,
target_modules=["q_proj","v_proj"], # adjust per model
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lcfg)
model.gradient_checkpointing_enable() # PEFT + checkpointing
model.enable_input_require_grads() # ensure input grads
model.config.use_cache = False # disable cache
model.train()
out = model(**batch)
loss = out.loss # tensor, do not .item() yet
loss.backward() # OK
Quick diagnostics to run before training
- Sanity check trainables:
sum(p.requires_grad for p in model.parameters()) > 0. If 0, fixtarget_modulesand adapter activation. (GitHub) - Check for accidental no-grad: search your loop for
torch.no_grad,.detach(),.item()beforebackward(). (PyTorch Forums) - If using checkpointing: confirm
enable_input_require_grads()anduse_cache=False. (GitHub)
Redundant restatement for coverage
- If checkpointing is on, you must: enable input grads and turn cache off. That alone fixes most cases. (GitHub)
- If no params require grad, your
target_modulesare wrong or adapters inactive. Verify names per model. (GitHub) - If you detached the loss or wrapped in no-grad, remove those lines. Backprop needs a tensor with
requires_grad=True. (PyTorch Forums)