Allow streaming of large datasets with image/audio

Hi !
For the metadata file I’d suggest any line-by-line format (json lines, csv, tsv)
Then having the images in a ZIP file does the job :slight_smile:

Most formats are compatible with streaming, with a few exceptions like the TAR format (you can’t uncompress one single file from TAR without reading the full file, while ZIP supports this for example).

I don’t think there is a big image dataset on the hub that works with streaming yet, since the image datasets I can see use pickle/torch file formats which are not streamable (you can’t just load one image from such a file without reading the full file).

Therefore if you have any questions or if I can help, feel free to ping me :slight_smile:

1 Like