Hi, Hugging Face community! (or dear @mariosasko again
)
I’m currently following this tutorial, where the dataset is created as follows:
from transformers import LineByLineTextDataset
dataset = LineByLineTextDataset(
tokenizer=tokenizer,
file_path="../dataset/dutch.txt",
block_size=128)
This method is straightforward for text files, but I’m working with a dataset in the Hugging Face .arrow format, created using datasets.Dataset.save_to_disk. I noticed that transformers.TextDataset and transformers.LinebyLineTextDataset don’t seem to support reading from a Hugging Face dataset folder. The source code is here.
Furthermore, when using the Trainer, it seems to require a transformers.Dataset:
from transformers import Trainer
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=dataset,
)
This is where my confusion lies. transformers.Dataset doesn’t allow reading from an HGF dataset folder and doesn’t seem to enable specifying a column (in my case, column='text'). On the other hand, datasets.Dataset doesn’t allow setting a block_size, which seems crucial for my task, and it’s unclear whether it’s compatible with Trainer.
I’m trying to understand which .Dataset class would be the most appropriate for my scenario
. Should I use transformers.Dataset, adapting it somehow to read from HGF data folders, or is there a way to use datasets.Dataset with the necessary block_size and ensure compatibility with Trainer?
Any guidance or suggestions on how to approach this would be greatly appreciated!
Thank you in advance!