Download only requested split?

Hi everyone,

I don’t understand how to download only the requested split. See the following for this [EDIT: current version uses .zip] dataset:

>>> ds = load_dataset("ego-thales/cifar10", name="no_airplane", split="left_out_calibration")
train/automobile/25706.png: 100%|██████████████████████████████████████████████████████| 2.27k/2.27k [00:01<00:00, 2.07kB/s]
train/automobile/25700.png: 100%|██████████████████████████████████████████████████████| 2.39k/2.39k [00:00<00:00, 3.64kB/s]
train/automobile/25701.png: 100%|██████████████████████████████████████████████████████| 2.28k/2.28k [00:00<00:00, 3.19kB/s]
...

It starts with train/automobile, probably because of the train split (I did not request!). Dataset config (only the relevant part):

- config_name: no_airplane
    data_files:
      - split: train
        path:
          - train/automobile/*.png
          - train/bird/*.png
          - train/cat/*.png
          - train/deer/*.png
          - train/dog/*.png
          - train/frog/*.png
          - train/horse/*.png
          - train/ship/*.png
          - train/truck/*.png
      - split: calibration
        path:
          - calibration/automobile/*.png
          - calibration/bird/*.png
          - calibration/cat/*.png
          - calibration/deer/*.png
          - calibration/dog/*.png
          - calibration/frog/*.png
          - calibration/horse/*.png
          - calibration/ship/*.png
          - calibration/truck/*.png
      - split: test
        path:
          - test/automobile/*.png
          - test/bird/*.png
          - test/cat/*.png
          - test/deer/*.png
          - test/dog/*.png
          - test/frog/*.png
          - test/horse/*.png
          - test/ship/*.png
          - test/truck/*.png
      - split: left_out_train
        path: train/airplane/*.png
      - split: left_out_calibration
        path: calibration/airplane/*.png
      - split: left_out_test
        path: test/airplane/*.png
      - split: left_out
        path:
          - train/airplane/*.png
          - calibration/airplane/*.png
          - test/airplane/*.png

Is there an elegant solution? It’s for a tutorial, and user may only want to download very small splits…

Thanks in advance!

Élie

1 Like

I know I can use the streaming=True option if I want this mode, but I’d hope that without it, HF would simply eagerly download the whole split, but only the requested split!

1 Like

I know I can use the streaming=True option if I want this mode, but I’d hope that without it, HF would simply eagerly download the whole split, but only the requested split!

I realize now that even this does not work.

Indeed, my current version uses .zip archives since it’s impossible to load thousands of images individually, even if they are really small. But then, using the streaming option still triggers a metadata search in all config data_files.

But the documentation says:

You can also zip your images, and in this case each zip should contain both the images and the metadata

So hugginface goes through the trouble of unpacking all zip files to find the metadata….

Is there a way to ship metadata alongside the zip file or even at the root of the dataset folder so that huggingface finds it easily and does not unpack the entire dataset when I’m requesting only a very tiny split in streaming mode?

Thanks

1 Like

I can only think of sharding with WebDataset

Edit:
I found this.

1 Like

Me too so far… :confused: I won’t resort to it though since my goal is to have clean few-liner downloads for tutorials.

But for the life of me I really feel like I’m missing something. I’ve heard so much talk about HuggingFace and it’s sooo popular. Yet, somehow, requesting a split fetches the entire dataset by default?? Even streaming mode unpacks every archive looking for metadata?? It makes absolutely no sense to me since the configuration explicitly says what to look for…

As a workaround, I defined essentially one dataset configuration per potential split (like ”no_bird_calibration",”train”, ”truck_test”), so load_dataset(path, name=…) only fetches the requested part of the dataset. But it feels really monkeypatchy and gives huge 1000 lines README.md metadata lol :person_shrugging:

1 Like

Hmm… @lhoestq

1 Like

Just so you can see what I came down to x)

1 Like