How to load large dataset with streaming mode and prepare for training?

Imran1 · November 2, 2023, 9:08am

I can load dataset with streaming mode, but I am confused, how to prepare for training to iteratively train the model on whole dataset.

If any one can provide a notebook so this will be very helpful.
@lhoestq

lhoestq · November 2, 2023, 5:05pm

What are you using for training ?

If you have your own training loop you can use a DataLoader with the streaming dataset

Imran1 · November 3, 2023, 1:35am

Here is the complete code please check it

github.com/huggingface/trl

fine tune zephyar with large dataset.

opened 06:18AM - 01 Nov 23 UTC

imrankh46

hi, i am try to SFT( fine tune ) zephyar model with ultrachat200k dataset. but i…t show cuda out of memory issues. how to load dataset with streaming and prepare for training for each chank? here is the code. ``` !pip install --upgrade "transformers" "datasets" "peft" "accelerate" "bitsandbytes" "safetensors" "trl" "wandb" !pip install -U git+https://github.com/huggingface/trl.git@main #!pip install -U flash-attn !git config --global credential.helper store !huggingface-cli login --token 'token' --add-to-git-credential from datasets import load_dataset dataset_base = load_dataset('HuggingFaceH4/ultrachat_200k',streaming=True) dataset_base def formatting_func(example): instruction = '### Instruction:\n Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n' input_prompt = f"### Prompt:\n{example['prompt']}\n\n" # Check if there's an instruction and include it if 'context' in example: input_prompt += f"### Instruction:\n{example['context']}\n\n" input_prompt += "### Conversation:\n" for message in example['messages']: input_prompt += f"{message['role']}: {message['content']}\n\n" text = instruction + input_prompt return {"text": text} # Select the splits you want to format splits_to_format = ['train_sft', 'test_sft'] # Apply the formatting function to the selected splits for split_name in splits_to_format: dataset_base[split_name] = dataset_base[split_name].map(formatting_func) # Now, your 'train_sft' and 'test_sft' splits have been formatted using the 'formatting_func'. # You can access them as follows: train_sft = dataset_base['train_sft'] test_sft = dataset_base['test_sft'] print(train_sft[2]["text"]) import torch import transformers from peft import LoraConfig from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer model_id = "HuggingFaceH4/zephyr-7b-beta" qlora_config = LoraConfig( r=16, lora_alpha=32, lora_dropout=0.05, target_modules=["q_proj", "v_proj"], bias="none", task_type="CAUSAL_LM" ) bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 ) base_model = AutoModelForCausalLM.from_pretrained( model_id, quantization_config=bnb_config, trust_remote_code=True, #use_flash_attention_2=True # using flesh attention v2 #use_auth_token=True, ) base_model tokenizer = AutoTokenizer.from_pretrained(model_id) tokenizer.pad_token = tokenizer.eos_token tokenizer.padding_side = "right" from trl import SFTTrainer supervised_finetuning_trainer = SFTTrainer( base_model, train_dataset=train_sft, eval_dataset=test_sft, args=transformers.TrainingArguments( output_dir="sft_z", max_steps=500, logging_steps=10, save_steps=10000, per_device_train_batch_size=2, per_device_eval_batch_size=2, gradient_accumulation_steps=8, gradient_checkpointing=False, group_by_length=False, learning_rate=1e-4, lr_scheduler_type="cosine", warmup_steps=100, weight_decay=0.05, optim="paged_adamw_8bit", fp16=True, remove_unused_columns=False, run_name="sft_zephyar", report_to="wandb", ), tokenizer=tokenizer, peft_config=qlora_config, dataset_text_field="text", max_seq_length=512, neftune_noise_alpha=5 ) supervised_finetuning_trainer.train() ```

lhoestq · November 3, 2023, 11:13am

Your issue doesn’t seem to be related to the dataset, feel free to continue the discussion in your github issue

Imran1 · November 3, 2023, 11:41am

My question is, how to iteratively train the model , if the dataset in streaming mode.

Can you provide any notebook, I just want to learn the concept/tricks etc.

lhoestq · November 3, 2023, 11:49am

You can find cod examples on how to use a streaming dataset in your own training loop here: Stream

It’s generally a good starting point if you want to adapt it to your use case

Imran1 · November 3, 2023, 12:08pm

Thank you. I would like to know, can I use this with trainer API ?
Actually I want, to train the model on dataset using streaming mode. Where the trainer API download automatically, chanks or batch etc and tokenize and train and so on iteratively. By doing this I will save my ram.

lhoestq · November 3, 2023, 1:51pm

You can pass your chunk and tokenize function to your streaming dataset using .map(), and then pass the dataset to the Trainer. The chunking and tokenization will happen iteratively during training

Imran1 · November 3, 2023, 2:05pm

Streaming=True not support map.

lhoestq · November 3, 2023, 2:18pm

Actually it does ! see https://huggingface.co/docs/datasets/v2.14.5/en/stream#map

Imran1 · November 3, 2023, 2:47pm

Thank you,

Topic		Replies	Views
Roadmap/timeline for dataset streaming 🤗Datasets	9	2318	July 5, 2021
Streaming Dataset of Sequence Length 2048 Intermediate	7	2889	May 12, 2022
Anyone have idea how we can finetune a model using Trainer API? 🤗Transformers	0	455	April 22, 2022
Use load dataset to load a sample of the dataset 🤗Datasets	3	1323	May 24, 2021
Cannot stream custom dataset 🤗Datasets	1	568	October 11, 2023

How to load large dataset with streaming mode and prepare for training?

Related topics