Guidance Needed on Choosing the Right Dataset Format for Transformer Model Training

Rong-Tao · December 8, 2023, 12:45pm

Hi, Hugging Face community! (or dear @mariosasko again )

I’m currently following this tutorial, where the dataset is created as follows:

from transformers import LineByLineTextDataset

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="../dataset/dutch.txt",
    block_size=128)

This method is straightforward for text files, but I’m working with a dataset in the Hugging Face .arrow format, created using datasets.Dataset.save_to_disk. I noticed that transformers.TextDataset and transformers.LinebyLineTextDataset don’t seem to support reading from a Hugging Face dataset folder. The source code is here.

Furthermore, when using the Trainer, it seems to require a transformers.Dataset:

from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

This is where my confusion lies. transformers.Dataset doesn’t allow reading from an HGF dataset folder and doesn’t seem to enable specifying a column (in my case, column='text'). On the other hand, datasets.Dataset doesn’t allow setting a block_size, which seems crucial for my task, and it’s unclear whether it’s compatible with Trainer.

I’m trying to understand which .Dataset class would be the most appropriate for my scenario . Should I use transformers.Dataset, adapting it somehow to read from HGF data folders, or is there a way to use datasets.Dataset with the necessary block_size and ensure compatibility with Trainer?

Any guidance or suggestions on how to approach this would be greatly appreciated!

Thank you in advance!

mariosasko · December 8, 2023, 3:32pm

transformes.LineByLineTextDataset is deprecated, and the deprecation message suggests taking a look at the transformers/examples/pytorch/language-modeling/run_mlm.py at main · huggingface/transformers · GitHub script for the ways to preprocess the data.

So, you can use datasets.load_from_disk to load the dataset and then apply transforms from the linked script to it (.map calls) before passing it to Trainer.

Topic		Replies	Views
Use tf.data.Data with HuggingFace datasets 🤗Transformers	2	2664	March 22, 2021
Using huggingface transformers trainer method for hugging face datasets 🤗Datasets	1	1118	April 15, 2024
Creating my own Dataset 🤗Transformers	2	3061	January 30, 2023
Using load_dataset.set_transform() function along with Trainer class 🤗Datasets	4	2661	April 26, 2021
What is the data file format of `run_ner.py`? 🤗Transformers	2	344	April 4, 2024

Guidance Needed on Choosing the Right Dataset Format for Transformer Model Training

Related topics