Pretraining datasets?

by saattrupdan - opened Sep 11, 2025

Sep 11, 2025

•

edited Sep 11, 2025

Do you have an overview of the pretraining datasets, that you've trained the model on?

You're linking to several datasets in your YAML metadata (e.g., HPLT2.0, Fineweb2 and MADLAD-400). Is that an exhaustive list?

That would help a lot with transparency :)

TBergmanis

Tilde org Sep 11, 2025

We have listed: culturax fineweb-2 hplt hplt2 madlad-400
Not listed because the data is not on Wikipedia, HF is speakleash, Eurolex, corpora from OPUS (cc_matrix, paracrawl, Europarl, ect). There were some data donations from Slovenian, Slovak and Estonian institutions, which were mostly data already found in other data sources.

We will detail more precisely in the technical report when we get to it.

saattrupdan

Sep 12, 2025

@TBergmanis That's great, thanks!

saattrupdan changed discussion status to closed Sep 12, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment