A 124M-parameter GPT-2 model was trained on the 10B fineweb-edu dataset (https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) using Karpathy's llm library (https://github.com/karpathy/build-nanogpt). Training took 3 hours on 4 H100(80G) GPU, yielding the following graphs:
Settings:
Model Parameters: 124M
Tokens: 10B
Batch Size: 48
Max Sequence Length: 1024
Total Batches: 196608
Warmup Steps: 1906
Max Steps: 50862
GPUs: 4
GPU Memory Usage: 62567 MB
Training Time: 3 hours (12 GPU hours)
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
