LLM Course documentation
Introduction
0. Setup
1. Transformer models
2. Using 🤗 Transformers
3. Fine-tuning a pretrained model
4. Sharing models and tokenizers
5. The 🤗 Datasets library
IntroductionWhat if my dataset isn't on the Hub?Time to slice and diceBig data? 🤗 Datasets to the rescue!Creating your own datasetSemantic search with FAISS🤗 Datasets, check!End-of-chapter quiz
6. The 🤗 Tokenizers library
7. Classical NLP tasks
8. How to ask for help
9. Building and sharing demos
10. Curate high-quality datasets
11. Fine-tune Large Language Models
12. Build Reasoning Models new
Course Events
Introduction
In Chapter 3 you got your first taste of the 🤗 Datasets library and saw that there were three main steps when it came to fine-tuning a model:
- Load a dataset from the Hugging Face Hub.
- Preprocess the data with
Dataset.map(). - Load and compute metrics.
But this is just scratching the surface of what 🤗 Datasets can do! In this chapter, we will take a deep dive into the library. Along the way, we’ll find answers to the following questions:
- What do you do when your dataset is not on the Hub?
- How can you slice and dice a dataset? (And what if you really need to use Pandas?)
- What do you do when your dataset is huge and will melt your laptop’s RAM?
- What the heck are “memory mapping” and Apache Arrow?
- How can you create your own dataset and push it to the Hub?
The techniques you learn here will prepare you for the advanced tokenization and fine-tuning tasks in Chapter 6 and Chapter 7 — so grab a coffee and let’s get started!
Update on GitHub