This is my tokenizer method. I found that no matter how much batch_size is set, the speed is the same. Tokenizer Spend time even longer than training. How cloud I do. Thanks very much.
def tokenize_function(example):
return tokenizer(example["sentence1"], truncation=True, max_length = 512)
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True, batch_size = 8)
tokenized_datasets = tokenized_datasets.remove_columns(["sentence1"])
Hi! What tokenizer are you using? What does tokenizer.is_fast return? If the returned value is False, you can set num_proc > 1 to leverage multiprocessing in map. Fast tokenizers use multithreading to process a batch in parallel on a single process by default, so it doesn’t make sense to use num_proc there.
1 Like
I’ve checked mine, and I have a fast tokenizer. However, it’s still taking about 20 seconds per example for tokenization, which is too slow.
Here’s the code,
base_model_id = "google/gemma-7b"
tokenizer = AutoTokenizer.from_pretrained(
base_model_id,
padding_side="left",
add_eos_token=True,
add_bos_token=True,
)
tokenizer.pad_token = tokenizer.eos_token
max_length = 1026
def generate_and_tokenize_prompt(prompt):
result = tokenizer(
formatting_func(prompt),
truncation=True,
max_length=max_length,
padding="max_length",
)
result["labels"] = result["input_ids"].copy()
return result
train_dataset = dataset_split['train']
eval_dataset = dataset_split['test']
tokenized_train_dataset = train_dataset.map(generate_and_tokenize_prompt)
tokenized_val_dataset = eval_dataset.map(generate_and_tokenize_prompt)
Can someone please help me figure out what I’m missing? Thanks.
Feel free to report this issue in the tokenizers repo.
1 Like
You may need to set the preprocessing_num_workers as None like:
accelerate launch \
--num_processes 8 \
--num_machines 1 \
--main_process_port 44144 \
--mixed_precision fp16 \
--dynamo_backend no \
pretrain.py \
--dataset_name \
/data/***/datasets/STORIES \
--model_name_or_path bert-base-uncased \
--per_device_train_batch_size 64 \
--per_device_eval_batch_size 64 \
--learning_rate 2e-4 \
--max_train_steps 500000 \
--num_warmup_steps 5000 \
--output_dir ./base \
--max_seq_length 128 \
--checkpointing_steps 50000 \
# --preprocessing_num_workers None \
--with_tracking \
--report_to wandb
Result in:
Running tokenizer on every text in dataset: 95%|██████████████████████████████████████████████████████████████████████████████████████ | 77660856/82086247 [1:09:52<6:53:00, 178.58 examples/s]
For me, I just set the preprocessing_num_workers 16 at the beginning and remove the preprocessing_num_workers apporaching the end. This will result in almost 20000 examples/s at the beginning and 180 examples/s at the end, but still better then 2 examples/s. However, I still don’t understand why single process will run faster.
1 Like