Tensor Dimension Mismatch when using TRL GKDTrainer

Leoruc · December 11, 2025, 1:59pm

When using TRL’s GKDTrainer to perform generalized knowledge distillation, I encountered with the following error :

[rank0]: RuntimeError: The size of tensor a (436) must match the size of tensor b (437) at non-singleton dimension 2

The code I use is derived from the template in the TRL’s Github examples(https://github.com/huggingface/trl/blob/main/examples/scripts/gkd.py)

from datasets import load_dataset
import random
from transformers import AutoTokenizer
from trl import (
    GKDConfig,
    GKDTrainer,
    LogCompletionsCallback,
    ModelConfig,
    ScriptArguments,
    TrlParser,
    get_kbit_device_map,
    get_peft_config,
    get_quantization_config,
)


if __name__ == "__main__":
    parser = TrlParser((ScriptArguments, GKDConfig, ModelConfig))
    args, training_args, model_config = parser.parse_args_and_config()

    ################
    # Model & Tokenizer
    ################
    quantization_config = get_quantization_config(model_config)
    model_kwargs = dict(
        revision=model_config.model_revision,
        trust_remote_code=model_config.trust_remote_code,
        attn_implementation=model_config.attn_implementation,
        torch_dtype=model_config.torch_dtype,
        use_cache=False if training_args.gradient_checkpointing else True,
        device_map=get_kbit_device_map() if quantization_config is not None else None,
        quantization_config=quantization_config,
    )
    training_args.model_init_kwargs = model_kwargs
    
    #teacher_quantization_config = get_quantization_config(model_config)

    teacher_model_kwargs = dict(
        revision=model_config.model_revision,
        trust_remote_code=model_config.trust_remote_code,
        attn_implementation=model_config.attn_implementation,
        torch_dtype=model_config.torch_dtype,
        dtype=model_config.torch_dtype,
        use_cache=True,
        device_map=get_kbit_device_map() if quantization_config is not None else None,
        quantization_config=quantization_config,
    )
    training_args.teacher_model_init_kwargs = teacher_model_kwargs

    tokenizer = AutoTokenizer.from_pretrained(
        model_config.model_name_or_path,
        trust_remote_code=model_config.trust_remote_code,
        padding="max_length",
        padding_side="right",
        truncation=True,
        truncation_side="right",
        model_max_length=8192
    )
    
    tokenizer.pad_token = tokenizer.eos_token

    ################
    # Dataset
    ################
    
    dataset = load_dataset(data_files=args.dataset_name, path='json',num_proc=1)   
    
    train_data = dataset['train']

    ################
    # Training
    ################
    trainer = GKDTrainer(
        model=model_config.model_name_or_path,
        teacher_model=training_args.teacher_model_name_or_path,
        args=training_args,
        train_dataset=train_data,
        processing_class=tokenizer,
        peft_config=get_peft_config(model_config)
    )
    
    trainer.train()

    # Save
    trainer.save_model(training_args.output_dir)

My deepspeed zero3 configuration file is as follows, I use 4 GPUs:

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  "gradient_accumulation_steps": 4     
  "train_micro_batch_size_per_gpu": 1
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

What I can’t understand is that I have already set the padding and truncation parameter in the initialization of the tokenizer, but it seems that it doesn’t work.
The versions of related libraries are: deepspeed 0.15.3, transformers 4.57.1, trl 0.23.1, peft 0.17.1.

At the beginning, the deepspeed version is 0.17.6, but in this version, I faced with a more serious error which the training process can’t even be started.

So, what should I do to tackle with this problem? I have asked AI for a lot of times but just in vain.

A surprising discovery is that, if I use deepspeed zero2, all errors disappear. But zero2 can’t fulfill my need since my teacher model is so large that it can’t be loaded on one GPU.

John6666 · December 12, 2025, 4:34am

Hmm, if changing the DeepSpeed settings fixes it, it might be a bug…

Topic		Replies	Views
Incorrect total train batch size when using tp_size > 1 and deepspeed DeepSpeed	1	143	May 20, 2025
How to use trainer with deepspeed Beginners	0	349	January 12, 2024
Custom trainer does not work on multiple GPUs 🤗Transformers	1	1445	December 21, 2021
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! 🤗Accelerate	1	787	May 31, 2024
Increasing eval batch size in trainer api causes size mismatch during evaluation 🤗Transformers	0	504	December 24, 2022

Tensor Dimension Mismatch when using TRL GKDTrainer

Related topics