LLM Course

0. Setup

1. Transformer models

2. Using 🤗 Transformers

3. Fine-tuning a pretrained model

4. Sharing models and tokenizers

5. The 🤗 Datasets library

Introduction What if my dataset isn't on the Hub?Time to slice and dice Big data? 🤗 Datasets to the rescue!Creating your own dataset Semantic search with FAISS 🤗 Datasets, check!End-of-chapter quiz

6. The 🤗 Tokenizers library

7. Classical NLP tasks

8. How to ask for help

9. Building and sharing demos

10. Curate high-quality datasets

11. Fine-tune Large Language Models

12. Build Reasoning Models new

Course Events

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

Introduction

In Chapter 3 you got your first taste of the 🤗 Datasets library and saw that there were three main steps when it came to fine-tuning a model:

Load a dataset from the Hugging Face Hub.
Preprocess the data with Dataset.map().
Load and compute metrics.

But this is just scratching the surface of what 🤗 Datasets can do! In this chapter, we will take a deep dive into the library. Along the way, we’ll find answers to the following questions:

What do you do when your dataset is not on the Hub?
How can you slice and dice a dataset? (And what if you really need to use Pandas?)
What do you do when your dataset is huge and will melt your laptop’s RAM?
What the heck are “memory mapping” and Apache Arrow?
How can you create your own dataset and push it to the Hub?

The techniques you learn here will prepare you for the advanced tokenization and fine-tuning tasks in Chapter 6 and Chapter 7 — so grab a coffee and let’s get started!

Update on GitHub

←End-of-chapter quiz What if my dataset isn't on the Hub?→