Download only requested split?

ego-thales · August 7, 2025, 12:34pm

Hi everyone,

I don’t understand how to download only the requested split. See the following for this [EDIT: current version uses .zip] dataset:

>>> ds = load_dataset("ego-thales/cifar10", name="no_airplane", split="left_out_calibration")
train/automobile/25706.png: 100%|██████████████████████████████████████████████████████| 2.27k/2.27k [00:01<00:00, 2.07kB/s]
train/automobile/25700.png: 100%|██████████████████████████████████████████████████████| 2.39k/2.39k [00:00<00:00, 3.64kB/s]
train/automobile/25701.png: 100%|██████████████████████████████████████████████████████| 2.28k/2.28k [00:00<00:00, 3.19kB/s]
...

It starts with train/automobile, probably because of the train split (I did not request!). Dataset config (only the relevant part):

- config_name: no_airplane
    data_files:
      - split: train
        path:
          - train/automobile/*.png
          - train/bird/*.png
          - train/cat/*.png
          - train/deer/*.png
          - train/dog/*.png
          - train/frog/*.png
          - train/horse/*.png
          - train/ship/*.png
          - train/truck/*.png
      - split: calibration
        path:
          - calibration/automobile/*.png
          - calibration/bird/*.png
          - calibration/cat/*.png
          - calibration/deer/*.png
          - calibration/dog/*.png
          - calibration/frog/*.png
          - calibration/horse/*.png
          - calibration/ship/*.png
          - calibration/truck/*.png
      - split: test
        path:
          - test/automobile/*.png
          - test/bird/*.png
          - test/cat/*.png
          - test/deer/*.png
          - test/dog/*.png
          - test/frog/*.png
          - test/horse/*.png
          - test/ship/*.png
          - test/truck/*.png
      - split: left_out_train
        path: train/airplane/*.png
      - split: left_out_calibration
        path: calibration/airplane/*.png
      - split: left_out_test
        path: test/airplane/*.png
      - split: left_out
        path:
          - train/airplane/*.png
          - calibration/airplane/*.png
          - test/airplane/*.png

Is there an elegant solution? It’s for a tutorial, and user may only want to download very small splits…

Thanks in advance!

Élie

ego-thales · August 7, 2025, 12:48pm

I know I can use the streaming=True option if I want this mode, but I’d hope that without it, HF would simply eagerly download the whole split, but only the requested split!

ego-thales · August 8, 2025, 8:15am

I know I can use the streaming=True option if I want this mode, but I’d hope that without it, HF would simply eagerly download the whole split, but only the requested split!

I realize now that even this does not work.

Indeed, my current version uses .zip archives since it’s impossible to load thousands of images individually, even if they are really small. But then, using the streaming option still triggers a metadata search in all config data_files.

But the documentation says:

You can also zip your images, and in this case each zip should contain both the images and the metadata

So hugginface goes through the trouble of unpacking all zip files to find the metadata….

Is there a way to ship metadata alongside the zip file or even at the root of the dataset folder so that huggingface finds it easily and does not unpack the entire dataset when I’m requesting only a very tiny split in `streaming` mode?

Thanks

John6666 · August 8, 2025, 10:16am

I can only think of sharding with WebDataset…

Edit:
I found this.

ego-thales · August 8, 2025, 10:31am

Me too so far… I won’t resort to it though since my goal is to have clean few-liner downloads for tutorials.

But for the life of me I really feel like I’m missing something. I’ve heard so much talk about HuggingFace and it’s sooo popular. Yet, somehow, requesting a split fetches the entire dataset by default?? Even streaming mode unpacks every archive looking for metadata?? It makes absolutely no sense to me since the configuration explicitly says what to look for…

As a workaround, I defined essentially one dataset configuration per potential split (like ”no_bird_calibration",”train”, ”truck_test”), so load_dataset(path, name=…) only fetches the requested part of the dataset. But it feels really monkeypatchy and gives huge 1000 lines README.md metadata lol

John6666 · August 8, 2025, 10:44am

Hmm… @lhoestq

ego-thales · August 8, 2025, 1:16pm

Just so you can see what I came down to x)

Topic		Replies	Views
Download only a subset of a split 🤗Datasets	10	18042	February 25, 2025
How can I download a specific split of a dataset? 🤗Datasets	1	1547	April 3, 2024
Download a fraction of data from HuggingFace Datasets 🤗Datasets	4	415	November 20, 2024
Download only 1 of many parquet file 🤗Datasets	2	317	March 19, 2025
Is there any ways to download only a subset of dataset using huggingface-cli? 🤗Hub	0	325	July 17, 2024

Download only requested split?

Is there a way to ship metadata alongside the zip file or even at the root of the dataset folder so that huggingface finds it easily and does not unpack the entire dataset when I’m requesting only a very tiny split in streaming mode?

Related topics

Is there a way to ship metadata alongside the zip file or even at the root of the dataset folder so that huggingface finds it easily and does not unpack the entire dataset when I’m requesting only a very tiny split in `streaming` mode?