Imran1
November 2, 2023, 9:08am
1
I can load dataset with streaming mode, but I am confused, how to prepare for training to iteratively train the model on whole dataset.
If any one can provide a notebook so this will be very helpful.
@lhoestq
1 Like
What are you using for training ?
If you have your own training loop you can use a DataLoader with the streaming dataset
1 Like
Imran1
November 3, 2023, 1:35am
3
Here is the complete code please check it
opened 06:18AM - 01 Nov 23 UTC
hi, i am try to SFT( fine tune ) zephyar model with ultrachat200k dataset. but i… t show cuda out of memory issues.
how to load dataset with streaming and prepare for training for each chank?
here is the code.
```
!pip install --upgrade "transformers" "datasets" "peft" "accelerate" "bitsandbytes" "safetensors" "trl" "wandb"
!pip install -U git+https://github.com/huggingface/trl.git@main
#!pip install -U flash-attn
!git config --global credential.helper store
!huggingface-cli login --token 'token' --add-to-git-credential
from datasets import load_dataset
dataset_base = load_dataset('HuggingFaceH4/ultrachat_200k',streaming=True)
dataset_base
def formatting_func(example):
instruction = '### Instruction:\n Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n'
input_prompt = f"### Prompt:\n{example['prompt']}\n\n"
# Check if there's an instruction and include it
if 'context' in example:
input_prompt += f"### Instruction:\n{example['context']}\n\n"
input_prompt += "### Conversation:\n"
for message in example['messages']:
input_prompt += f"{message['role']}: {message['content']}\n\n"
text = instruction + input_prompt
return {"text": text}
# Select the splits you want to format
splits_to_format = ['train_sft', 'test_sft']
# Apply the formatting function to the selected splits
for split_name in splits_to_format:
dataset_base[split_name] = dataset_base[split_name].map(formatting_func)
# Now, your 'train_sft' and 'test_sft' splits have been formatted using the 'formatting_func'.
# You can access them as follows:
train_sft = dataset_base['train_sft']
test_sft = dataset_base['test_sft']
print(train_sft[2]["text"])
import torch
import transformers
from peft import LoraConfig
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer
model_id = "HuggingFaceH4/zephyr-7b-beta"
qlora_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules=["q_proj", "v_proj"],
bias="none",
task_type="CAUSAL_LM"
)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
base_model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
trust_remote_code=True,
#use_flash_attention_2=True # using flesh attention v2
#use_auth_token=True,
)
base_model
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
from trl import SFTTrainer
supervised_finetuning_trainer = SFTTrainer(
base_model,
train_dataset=train_sft,
eval_dataset=test_sft,
args=transformers.TrainingArguments(
output_dir="sft_z",
max_steps=500,
logging_steps=10,
save_steps=10000,
per_device_train_batch_size=2,
per_device_eval_batch_size=2,
gradient_accumulation_steps=8,
gradient_checkpointing=False,
group_by_length=False,
learning_rate=1e-4,
lr_scheduler_type="cosine",
warmup_steps=100,
weight_decay=0.05,
optim="paged_adamw_8bit",
fp16=True,
remove_unused_columns=False,
run_name="sft_zephyar",
report_to="wandb",
),
tokenizer=tokenizer,
peft_config=qlora_config,
dataset_text_field="text",
max_seq_length=512,
neftune_noise_alpha=5
)
supervised_finetuning_trainer.train()
```
lhoestq
November 3, 2023, 11:13am
4
Your issue doesn’t seem to be related to the dataset, feel free to continue the discussion in your github issue
Imran1
November 3, 2023, 11:41am
5
My question is, how to iteratively train the model , if the dataset in streaming mode.
Can you provide any notebook, I just want to learn the concept/tricks etc.
lhoestq
November 3, 2023, 11:49am
6
You can find cod examples on how to use a streaming dataset in your own training loop here: Stream
It’s generally a good starting point if you want to adapt it to your use case
1 Like
Imran1
November 3, 2023, 12:08pm
7
Thank you. I would like to know, can I use this with trainer API ?
Actually I want, to train the model on dataset using streaming mode. Where the trainer API download automatically, chanks or batch etc and tokenize and train and so on iteratively. By doing this I will save my ram.
You can pass your chunk and tokenize function to your streaming dataset using .map(), and then pass the dataset to the Trainer. The chunking and tokenization will happen iteratively during training
1 Like
Imran1
November 3, 2023, 2:05pm
9
Streaming=True not support map.
lhoestq
November 3, 2023, 2:18pm
10
1 Like