# Deep Learning Emotion Classification - Code Explanation

This document provides a detailed line-by-line explanation of the `main.ipynb` notebook, which implements a multi-label emotion classification system using the DeBERTa transformer model with K-Fold cross-validation.

---

## Section 1: Imports & Setup

### Lines 18-36: Import Statements

```python
import numpy as np
import pandas as pd
```
- **numpy**: Used for numerical operations, array manipulation, and random seed setting
- **pandas**: Used for data loading and manipulation (CSV files, DataFrames)

```python
import torch
import torch.nn as nn
```
- **torch**: PyTorch deep learning framework for tensor operations and model training
- **torch.nn**: Neural network modules including loss functions

```python
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import f1_score
```
- **StratifiedKFold**: Creates k-fold splits while maintaining class distribution in each fold
- **f1_score**: Calculates F1 metric for evaluation (harmonic mean of precision and recall)

```python
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    get_linear_schedule_with_warmup,
    AutoConfig
)
```
- **AutoTokenizer**: Automatically loads the appropriate tokenizer for the specified model
- **AutoModelForSequenceClassification**: Pre-trained transformer model for classification tasks
- **get_linear_schedule_with_warmup**: Learning rate scheduler with warmup and linear decay
- **AutoConfig**: Model configuration loader

```python
from torch.optim import AdamW
```
- **AdamW**: Adam optimizer with decoupled weight decay (better than standard Adam for transformers)

```python
from torch.cuda.amp import autocast, GradScaler
```
- **autocast**: Enables automatic mixed precision (AMP) to speed up training
- **GradScaler**: Scales gradients for mixed precision training to prevent underflow

```python
import gc
import warnings
import os
```
- **gc**: Garbage collection to free up memory
- **warnings**: To suppress warning messages
- **os**: For file system operations and environment variables

```python
warnings.filterwarnings("ignore")
```
- Suppresses all warning messages for cleaner output

---

## Section 2: Configuration

### Lines 52-68: Configuration Class

```python
class Config:
    SEED = 42
```
- Sets random seed for reproducibility across all random operations

```python
    LABELS = ["anger", "fear", "joy", "sadness", "surprise"]
```
- Defines the 5 emotion labels for multi-label classification

```python
    MODEL_NAME = "microsoft/deberta-v3-base"
```
- Specifies the pre-trained model (DeBERTa v3 base - 184M parameters, SOTA performance)

```python
    MAX_LEN = 128
```
- Maximum sequence length for tokenization (tokens longer than this are truncated)

```python
    BATCH_SIZE = 16
```
- Number of samples processed together in one forward/backward pass

```python
    EPOCHS = 4
```
- Number of complete passes through the training dataset

```python
    LR = 1.5e-5
```
- Learning rate (1.5 × 10⁻⁵) - small value typical for fine-tuning transformers

```python
    WEIGHT_DECAY = 0.01
```
- L2 regularization strength to prevent overfitting

```python
    WARMUP_RATIO = 0.1
```
- Fraction of training steps used for learning rate warmup (10% of total steps)

```python
    N_FOLDS = 5
```
- Number of folds for K-Fold cross-validation

```python
    TRAIN_CSV = "/kaggle/input/2025-sep-dl-gen-ai-project/train.csv"
    TEST_CSV = "/kaggle/input/2025-sep-dl-gen-ai-project/test.csv"
```
- Paths to training and test datasets (Kaggle environment paths)

```python
    SUBMISSION_PATH = "submission.csv"
```
- Output file for predictions

```python
CONFIG = Config()
```
- Creates a global instance of the configuration class

---

## Section 3: Seed & Device Setup

### Lines 84-93: Reproducibility and Device Selection

```python
def set_seed(seed=CONFIG.SEED):
    np.random.seed(seed)
```
- Sets numpy's random seed for reproducible random number generation

```python
    torch.manual_seed(seed)
```
- Sets PyTorch's random seed for CPU operations

```python
    torch.cuda.manual_seed_all(seed)
```
- Sets PyTorch's random seed for all GPU devices

```python
    os.environ['PYTHONHASHSEED'] = str(seed)
```
- Sets hash seed for Python's built-in hash() function for reproducibility

```python
set_seed()
```
- Calls the seed setting function

```python
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
```
- Checks if GPU is available; uses GPU if available, otherwise falls back to CPU
- Prints the device being used for training

---

## Section 4: Utility Functions

### Lines 109-115: `ensure_text_column` Function

```python
def ensure_text_column(df: pd.DataFrame) -> pd.DataFrame:
    if "text" in df.columns:
        return df
```
- Checks if DataFrame already has a "text" column; if yes, returns unchanged

```python
    for c in ["comment_text", "sentence", "content", "review"]:
        if c in df.columns:
            return df.rename(columns={c: "text"})
```
- Searches for common alternative text column names
- Renames the first matching column to "text" for standardization

```python
    raise ValueError("No text column found. Add/rename your text column to 'text'.")
```
- Raises an error if no text column is found

### Lines 117-126: `tune_thresholds` Function

```python
def tune_thresholds(y_true: np.ndarray, y_prob: np.ndarray) -> np.ndarray:
    th = np.zeros(y_true.shape[1], dtype=np.float32)
```
- Creates array to store optimal threshold for each label (initialized to 0)
- Multi-label classification requires separate thresholds per label

```python
    for j in range(y_true.shape[1]):
        best_t, best_f1 = 0.5, -1
```
- Iterates through each label
- Initializes best threshold to 0.5 (default) and best F1 to -1

```python
        for t in np.linspace(0.1, 0.9, 17):
```
- Tests 17 threshold values evenly spaced between 0.1 and 0.9

```python
            f1 = f1_score(y_true[:, j], (y_prob[:, j] >= t).astype(int), zero_division=0)
```
- Calculates F1 score for current label and threshold
- Converts probabilities to binary predictions using threshold

```python
            if f1 > best_f1:
                best_f1, best_t = f1, t
```
- Updates best threshold if current F1 is better

```python
        th[j] = best_t
    return th
```
- Stores optimal threshold for each label and returns the array

### Lines 128-141: `get_optimizer_params` Function

```python
def get_optimizer_params(model, lr, weight_decay):
    param_optimizer = list(model.named_parameters())
```
- Gets all model parameters with their names

```python
    no_decay = ["bias", "LayerNorm.bias", "LayerNorm.weight"]
```
- Lists parameters that should NOT have weight decay applied
- Bias and LayerNorm parameters typically trained without weight decay

```python
    optimizer_parameters = [
        {
            "params": [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
            "weight_decay": weight_decay,
        },
```
- First parameter group: all parameters EXCEPT bias and LayerNorm
- These parameters will have weight decay applied

```python
        {
            "params": [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
            "weight_decay": 0.0,
        },
    ]
```
- Second parameter group: only bias and LayerNorm parameters
- These parameters have weight decay set to 0.0

```python
    return optimizer_parameters
```
- Returns grouped parameters for differential weight decay

---

## Section 5: Dataset Class

### Lines 157-180: `EmotionDS` Class

```python
class EmotionDS(torch.utils.data.Dataset):
    def __init__(self, df, tokenizer, max_len, is_test=False):
```
- Custom PyTorch Dataset class for emotion classification
- `is_test` flag indicates whether this is test data (no labels)

```python
        self.texts = df["text"].tolist()
```
- Extracts text data as a Python list

```python
        self.is_test = is_test
        if not is_test:
            self.labels = df[CONFIG.LABELS].values.astype(np.float32)
```
- Stores test flag
- If training data, extracts multi-label targets as float32 array

```python
        self.tok = tokenizer
        self.max_len = max_len
```
- Stores tokenizer and max length for later use

```python
    def __len__(self):
        return len(self.texts)
```
- Returns dataset size (required by PyTorch)

```python
    def __getitem__(self, i):
        enc = self.tok(
            self.texts[i],
            truncation=True,
            padding="max_length",
            max_length=self.max_len,
            return_tensors="pt",
        )
```
- Tokenizes the text at index `i`
- **truncation**: Cuts text longer than max_len
- **padding**: Pads shorter sequences to max_len
- **return_tensors="pt"**: Returns PyTorch tensors

```python
        item = {k: v.squeeze(0) for k, v in enc.items()}
```
- Removes the batch dimension (1, seq_len) → (seq_len)
- Returns dict with keys: input_ids, attention_mask, token_type_ids (if applicable)

```python
        if not self.is_test:
            item["labels"] = torch.tensor(self.labels[i])
        return item
```
- Adds labels to the item dict if training data
- Returns the complete item

---

## Section 6: Training & Validation Helper Functions

### Lines 196-213: `train_one_epoch` Function

```python
def train_one_epoch(model, loader, optimizer, scheduler, scaler, criterion):
    model.train()
```
- Sets model to training mode (enables dropout, batch normalization updates)

```python
    losses = []
    for batch in loader:
```
- Initializes list to track losses
- Iterates through batches

```python
        batch = {k: v.to(device, non_blocking=True) for k, v in batch.items()}
```
- Moves batch data to GPU (or CPU)
- `non_blocking=True`: Async transfer for faster processing

```python
        optimizer.zero_grad(set_to_none=True)
```
- Clears gradients from previous step
- `set_to_none=True`: More memory efficient than setting to zero

```python
        with autocast(enabled=True):
            out = model(input_ids=batch["input_ids"], attention_mask=batch["attention_mask"])
            loss = criterion(out.logits, batch["labels"])
```
- **autocast**: Uses mixed precision (float16) for faster computation
- Forward pass through model
- Calculates loss between predictions (logits) and true labels

```python
        scaler.scale(loss).backward()
```
- Scales loss to prevent gradient underflow in mixed precision
- Computes gradients via backpropagation

```python
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
```
- Unscales gradients before clipping
- Clips gradients to maximum norm of 1.0 to prevent exploding gradients

```python
        scaler.step(optimizer)
        scaler.update()
```
- Updates model parameters (with scaled gradients)
- Updates the scaler's internal state

```python
        scheduler.step()
```
- Updates learning rate according to schedule

```python
        losses.append(loss.item())
    return np.mean(losses)
```
- Stores loss value
- Returns average loss for the epoch

### Lines 215-230: `validate` Function

```python
def validate(model, loader, criterion):
    model.eval()
```
- Sets model to evaluation mode (disables dropout, fixes batch norm)

```python
    losses = []
    preds = []
    targs = []
```
- Initializes lists for losses, predictions, and targets

```python
    with torch.no_grad():
```
- Disables gradient computation (saves memory and speeds up inference)

```python
        for batch in loader:
            batch = {k: v.to(device, non_blocking=True) for k, v in batch.items()}
            with autocast(enabled=True):
                out = model(input_ids=batch["input_ids"], attention_mask=batch["attention_mask"])
                loss = criterion(out.logits, batch["labels"])
```
- Moves batch to device
- Forward pass with mixed precision
- Calculates validation loss

```python
            losses.append(loss.item())
            preds.append(torch.sigmoid(out.logits).float().cpu().numpy())
            targs.append(batch["labels"].cpu().numpy())
```
- Stores loss
- Applies sigmoid to convert logits to probabilities [0, 1]
- Moves predictions and targets to CPU as numpy arrays

```python
    return np.mean(losses), np.vstack(preds), np.vstack(targs)
```
- Returns average loss, stacked predictions, and stacked targets

---

## Section 7: Main K-Fold Training Loop

### Lines 246-324: `run_training` Function

```python
def run_training():
    if not os.path.exists(CONFIG.TRAIN_CSV):
        print("Train CSV not found. Please check the path.")
        return None, None
```
- Checks if training data exists
- Returns None if not found (graceful failure)

```python
    df = pd.read_csv(CONFIG.TRAIN_CSV)
    df = ensure_text_column(df)
```
- Loads training data
- Ensures text column exists

```python
    skf = StratifiedKFold(n_splits=CONFIG.N_FOLDS, shuffle=True, random_state=CONFIG.SEED)
    y_str = df[CONFIG.LABELS].astype(str).agg("".join, axis=1)
```
- Creates 5-fold stratified splitter
- Converts multi-label to string representation for stratification
- Example: [1,0,1,0,0] → "10100"

```python
    oof_preds = np.zeros((len(df), len(CONFIG.LABELS)))
```
- Initializes out-of-fold predictions array (for all training samples)

```python
    tokenizer = AutoTokenizer.from_pretrained(CONFIG.MODEL_NAME)
```
- Loads DeBERTa tokenizer

```python
    for fold, (train_idx, val_idx) in enumerate(skf.split(df, y_str)):
        print(f"\n{'='*20} FOLD {fold+1}/{CONFIG.N_FOLDS} {'='*20}")
```
- Iterates through each fold
- `train_idx`: indices for training, `val_idx`: indices for validation

```python
        df_tr = df.iloc[train_idx].reset_index(drop=True)
        df_va = df.iloc[val_idx].reset_index(drop=True)
```
- Splits data into training and validation sets for current fold
- Resets index for clean indexing

```python
        ds_tr = EmotionDS(df_tr, tokenizer, CONFIG.MAX_LEN)
        ds_va = EmotionDS(df_va, tokenizer, CONFIG.MAX_LEN)
```
- Creates PyTorch datasets for training and validation

```python
        dl_tr = torch.utils.data.DataLoader(ds_tr, batch_size=CONFIG.BATCH_SIZE, shuffle=True, num_workers=2, pin_memory=True)
        dl_va = torch.utils.data.DataLoader(ds_va, batch_size=CONFIG.BATCH_SIZE, shuffle=False, num_workers=2, pin_memory=True)
```
- Creates data loaders
- **shuffle=True** for training (randomizes batch order)
- **shuffle=False** for validation (keeps consistent order)
- **num_workers=2**: Uses 2 subprocesses for data loading
- **pin_memory=True**: Speeds up CPU→GPU transfer

```python
        model = AutoModelForSequenceClassification.from_pretrained(
            CONFIG.MODEL_NAME, 
            num_labels=len(CONFIG.LABELS),
            problem_type="multi_label_classification"
        )
        model.to(device)
```
- Loads pre-trained DeBERTa model
- Configures for 5-label multi-label classification
- Moves model to GPU/CPU

```python
        optimizer_params = get_optimizer_params(model, CONFIG.LR, CONFIG.WEIGHT_DECAY)
        optimizer = AdamW(optimizer_params, lr=CONFIG.LR)
```
- Gets parameter groups with differential weight decay
- Creates AdamW optimizer

```python
        total_steps = len(dl_tr) * CONFIG.EPOCHS
        scheduler = get_linear_schedule_with_warmup(
            optimizer, 
            num_warmup_steps=int(total_steps * CONFIG.WARMUP_RATIO), 
            num_training_steps=total_steps
        )
```
- Calculates total training steps
- Creates learning rate scheduler:
  - Warmup: LR increases linearly for 10% of steps
  - Decay: LR decreases linearly to 0 for remaining 90%

```python
        criterion = nn.BCEWithLogitsLoss()
        scaler = GradScaler(enabled=True)
```
- **BCEWithLogitsLoss**: Binary cross-entropy loss for multi-label classification
- Creates gradient scaler for mixed precision

```python
        best_f1 = 0
        best_state = None
```
- Initializes tracking for best model

```python
        for ep in range(CONFIG.EPOCHS):
            train_loss = train_one_epoch(model, dl_tr, optimizer, scheduler, scaler, criterion)
            val_loss, val_preds, val_targs = validate(model, dl_va, criterion)
```
- Trains for one epoch
- Validates on validation set

```python
            val_f1 = f1_score(val_targs, (val_preds >= 0.5).astype(int), average="macro", zero_division=0)
```
- Calculates macro F1 score (average F1 across all labels)
- Uses 0.5 threshold for predictions

```python
            print(f"Ep {ep+1}: TrLoss={train_loss:.4f} | VaLoss={val_loss:.4f} | VaF1={val_f1:.4f}")
```
- Prints epoch metrics

```python
            if val_f1 > best_f1:
                best_f1 = val_f1
                best_state = model.state_dict()
```
- Saves model state if validation F1 improves

```python
        torch.save(best_state, f"model_fold_{fold}.pth")
```
- Saves best model weights to disk

```python
        model.load_state_dict(best_state)
        _, val_preds, _ = validate(model, dl_va, criterion)
        oof_preds[val_idx] = val_preds
```
- Loads best weights
- Gets predictions on validation set
- Stores out-of-fold predictions

```python
        del model, optimizer, scaler, scheduler
        torch.cuda.empty_cache()
        gc.collect()
```
- Deletes objects to free memory
- Clears GPU cache
- Runs garbage collector

```python
    return oof_preds, df[CONFIG.LABELS].values
```
- Returns out-of-fold predictions and true labels

```python
if os.path.exists(CONFIG.TRAIN_CSV):
    oof_preds, y_true = run_training()
else:
    print("Skipping training as data is not found (likely in a dry-run environment).")
```
- Executes training if data exists
- Otherwise skips gracefully

---

## Section 8: Threshold Optimization

### Lines 340-347: Threshold Tuning

```python
if os.path.exists(CONFIG.TRAIN_CSV):
    best_thresholds = tune_thresholds(y_true, oof_preds)
```
- Finds optimal threshold for each emotion label using validation predictions

```python
    oof_tuned = (oof_preds >= best_thresholds).astype(int)
```
- Converts probabilities to binary predictions using optimized thresholds

```python
    final_f1 = f1_score(y_true, oof_tuned, average="macro", zero_division=0)
    print(f"\nFinal CV Macro F1: {final_f1:.4f}")
    print(f"Best Thresholds: {best_thresholds}")
```
- Calculates cross-validated F1 score with optimized thresholds
- Prints final performance and optimal thresholds

```python
else:
    best_thresholds = np.array([0.5] * len(CONFIG.LABELS))
```
- Falls back to 0.5 thresholds if training data not available

---

## Section 9: Inference & Submission

### Lines 363-420: `predict_test` Function

```python
def predict_test(thresholds):
    if not os.path.exists(CONFIG.TEST_CSV):
        print("Test CSV not found.")
        return
```
- Checks if test data exists

```python
    df_test = pd.read_csv(CONFIG.TEST_CSV)
    df_test = ensure_text_column(df_test)
```
- Loads test data and ensures text column

```python
    tokenizer = AutoTokenizer.from_pretrained(CONFIG.MODEL_NAME)
    ds_test = EmotionDS(df_test, tokenizer, CONFIG.MAX_LEN, is_test=True)
    dl_test = torch.utils.data.DataLoader(ds_test, batch_size=CONFIG.BATCH_SIZE, shuffle=False, num_workers=2)
```
- Creates tokenizer, dataset, and data loader for test data
- `is_test=True`: No labels expected

```python
    fold_preds = []
```
- Initializes list to store predictions from each fold

```python
    for fold in range(CONFIG.N_FOLDS):
        model_path = f"model_fold_{fold}.pth"
        if not os.path.exists(model_path):
            print(f"Model for fold {fold} not found, skipping.")
            continue
```
- Iterates through all folds
- Checks if model exists

```python
        print(f"Predicting Fold {fold+1}...")
        model = AutoModelForSequenceClassification.from_pretrained(
            CONFIG.MODEL_NAME, 
            num_labels=len(CONFIG.LABELS),
            problem_type="multi_label_classification"
        )
        model.load_state_dict(torch.load(model_path))
        model.to(device)
        model.eval()
```
- Loads model architecture
- Loads trained weights
- Sets to evaluation mode

```python
        preds = []
        with torch.no_grad():
            for batch in dl_test:
                batch = {k: v.to(device, non_blocking=True) for k, v in batch.items()}
                with autocast(enabled=True):
                    out = model(input_ids=batch["input_ids"], attention_mask=batch["attention_mask"])
                preds.append(torch.sigmoid(out.logits).float().cpu().numpy())
```
- Makes predictions without computing gradients
- Uses mixed precision for speed
- Applies sigmoid to get probabilities

```python
        fold_preds.append(np.vstack(preds))
        del model
        torch.cuda.empty_cache()
        gc.collect()
```
- Stores fold predictions
- Frees memory

```python
    if not fold_preds:
        print("No predictions made.")
        return
```
- Checks if any predictions were made

```python
    avg_preds = np.mean(fold_preds, axis=0)
```
- Averages predictions across all folds (ensemble)

```python
    final_preds = (avg_preds >= thresholds).astype(int)
```
- Applies optimized thresholds to get binary predictions

```python
    sub = pd.DataFrame(columns=["id"] + CONFIG.LABELS)
    sub["id"] = df_test["id"] if "id" in df_test.columns else np.arange(len(df_test))
    sub[CONFIG.LABELS] = final_preds
    sub.to_csv(CONFIG.SUBMISSION_PATH, index=False)
    print(f"Submission saved to {CONFIG.SUBMISSION_PATH}")
    print(sub.head())
```
- Creates submission DataFrame
- Adds ID column (from data or generated)
- Adds prediction columns
- Saves to CSV
- Displays first few rows

```python
predict_test(best_thresholds)
```
- Executes prediction function with optimized thresholds

---

## Summary

This notebook implements a **robust emotion classification pipeline** with:

1. **K-Fold Cross-Validation**: 5-fold stratified CV for reliable performance estimates
2. **State-of-the-Art Model**: DeBERTa-v3-base transformer
3. **Optimization Techniques**:
   - Mixed precision training (faster, less memory)
   - Gradient clipping (stability)
   - Learning rate warmup and decay
   - Differential weight decay
4. **Threshold Optimization**: Per-label thresholds for better F1 scores
5. **Ensemble Prediction**: Averages predictions from all folds
6. **Memory Management**: Explicit cleanup between folds

The model predicts 5 emotions (anger, fear, joy, sadness, surprise) in a **multi-label** setting, where text can have multiple emotions simultaneously.