# Deep Learning Emotion Classification - Code Explanation This document provides a detailed line-by-line explanation of the `main.ipynb` notebook, which implements a multi-label emotion classification system using the DeBERTa transformer model with K-Fold cross-validation. --- ## Section 1: Imports & Setup ### Lines 18-36: Import Statements ```python import numpy as np import pandas as pd ``` - **numpy**: Used for numerical operations, array manipulation, and random seed setting - **pandas**: Used for data loading and manipulation (CSV files, DataFrames) ```python import torch import torch.nn as nn ``` - **torch**: PyTorch deep learning framework for tensor operations and model training - **torch.nn**: Neural network modules including loss functions ```python from sklearn.model_selection import StratifiedKFold from sklearn.metrics import f1_score ``` - **StratifiedKFold**: Creates k-fold splits while maintaining class distribution in each fold - **f1_score**: Calculates F1 metric for evaluation (harmonic mean of precision and recall) ```python from transformers import ( AutoTokenizer, AutoModelForSequenceClassification, get_linear_schedule_with_warmup, AutoConfig ) ``` - **AutoTokenizer**: Automatically loads the appropriate tokenizer for the specified model - **AutoModelForSequenceClassification**: Pre-trained transformer model for classification tasks - **get_linear_schedule_with_warmup**: Learning rate scheduler with warmup and linear decay - **AutoConfig**: Model configuration loader ```python from torch.optim import AdamW ``` - **AdamW**: Adam optimizer with decoupled weight decay (better than standard Adam for transformers) ```python from torch.cuda.amp import autocast, GradScaler ``` - **autocast**: Enables automatic mixed precision (AMP) to speed up training - **GradScaler**: Scales gradients for mixed precision training to prevent underflow ```python import gc import warnings import os ``` - **gc**: Garbage collection to free up memory - **warnings**: To suppress warning messages - **os**: For file system operations and environment variables ```python warnings.filterwarnings("ignore") ``` - Suppresses all warning messages for cleaner output --- ## Section 2: Configuration ### Lines 52-68: Configuration Class ```python class Config: SEED = 42 ``` - Sets random seed for reproducibility across all random operations ```python LABELS = ["anger", "fear", "joy", "sadness", "surprise"] ``` - Defines the 5 emotion labels for multi-label classification ```python MODEL_NAME = "microsoft/deberta-v3-base" ``` - Specifies the pre-trained model (DeBERTa v3 base - 184M parameters, SOTA performance) ```python MAX_LEN = 128 ``` - Maximum sequence length for tokenization (tokens longer than this are truncated) ```python BATCH_SIZE = 16 ``` - Number of samples processed together in one forward/backward pass ```python EPOCHS = 4 ``` - Number of complete passes through the training dataset ```python LR = 1.5e-5 ``` - Learning rate (1.5 × 10⁻⁵) - small value typical for fine-tuning transformers ```python WEIGHT_DECAY = 0.01 ``` - L2 regularization strength to prevent overfitting ```python WARMUP_RATIO = 0.1 ``` - Fraction of training steps used for learning rate warmup (10% of total steps) ```python N_FOLDS = 5 ``` - Number of folds for K-Fold cross-validation ```python TRAIN_CSV = "/kaggle/input/2025-sep-dl-gen-ai-project/train.csv" TEST_CSV = "/kaggle/input/2025-sep-dl-gen-ai-project/test.csv" ``` - Paths to training and test datasets (Kaggle environment paths) ```python SUBMISSION_PATH = "submission.csv" ``` - Output file for predictions ```python CONFIG = Config() ``` - Creates a global instance of the configuration class --- ## Section 3: Seed & Device Setup ### Lines 84-93: Reproducibility and Device Selection ```python def set_seed(seed=CONFIG.SEED): np.random.seed(seed) ``` - Sets numpy's random seed for reproducible random number generation ```python torch.manual_seed(seed) ``` - Sets PyTorch's random seed for CPU operations ```python torch.cuda.manual_seed_all(seed) ``` - Sets PyTorch's random seed for all GPU devices ```python os.environ['PYTHONHASHSEED'] = str(seed) ``` - Sets hash seed for Python's built-in hash() function for reproducibility ```python set_seed() ``` - Calls the seed setting function ```python device = torch.device("cuda" if torch.cuda.is_available() else "cpu") print(f"Using device: {device}") ``` - Checks if GPU is available; uses GPU if available, otherwise falls back to CPU - Prints the device being used for training --- ## Section 4: Utility Functions ### Lines 109-115: `ensure_text_column` Function ```python def ensure_text_column(df: pd.DataFrame) -> pd.DataFrame: if "text" in df.columns: return df ``` - Checks if DataFrame already has a "text" column; if yes, returns unchanged ```python for c in ["comment_text", "sentence", "content", "review"]: if c in df.columns: return df.rename(columns={c: "text"}) ``` - Searches for common alternative text column names - Renames the first matching column to "text" for standardization ```python raise ValueError("No text column found. Add/rename your text column to 'text'.") ``` - Raises an error if no text column is found ### Lines 117-126: `tune_thresholds` Function ```python def tune_thresholds(y_true: np.ndarray, y_prob: np.ndarray) -> np.ndarray: th = np.zeros(y_true.shape[1], dtype=np.float32) ``` - Creates array to store optimal threshold for each label (initialized to 0) - Multi-label classification requires separate thresholds per label ```python for j in range(y_true.shape[1]): best_t, best_f1 = 0.5, -1 ``` - Iterates through each label - Initializes best threshold to 0.5 (default) and best F1 to -1 ```python for t in np.linspace(0.1, 0.9, 17): ``` - Tests 17 threshold values evenly spaced between 0.1 and 0.9 ```python f1 = f1_score(y_true[:, j], (y_prob[:, j] >= t).astype(int), zero_division=0) ``` - Calculates F1 score for current label and threshold - Converts probabilities to binary predictions using threshold ```python if f1 > best_f1: best_f1, best_t = f1, t ``` - Updates best threshold if current F1 is better ```python th[j] = best_t return th ``` - Stores optimal threshold for each label and returns the array ### Lines 128-141: `get_optimizer_params` Function ```python def get_optimizer_params(model, lr, weight_decay): param_optimizer = list(model.named_parameters()) ``` - Gets all model parameters with their names ```python no_decay = ["bias", "LayerNorm.bias", "LayerNorm.weight"] ``` - Lists parameters that should NOT have weight decay applied - Bias and LayerNorm parameters typically trained without weight decay ```python optimizer_parameters = [ { "params": [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], "weight_decay": weight_decay, }, ``` - First parameter group: all parameters EXCEPT bias and LayerNorm - These parameters will have weight decay applied ```python { "params": [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], "weight_decay": 0.0, }, ] ``` - Second parameter group: only bias and LayerNorm parameters - These parameters have weight decay set to 0.0 ```python return optimizer_parameters ``` - Returns grouped parameters for differential weight decay --- ## Section 5: Dataset Class ### Lines 157-180: `EmotionDS` Class ```python class EmotionDS(torch.utils.data.Dataset): def __init__(self, df, tokenizer, max_len, is_test=False): ``` - Custom PyTorch Dataset class for emotion classification - `is_test` flag indicates whether this is test data (no labels) ```python self.texts = df["text"].tolist() ``` - Extracts text data as a Python list ```python self.is_test = is_test if not is_test: self.labels = df[CONFIG.LABELS].values.astype(np.float32) ``` - Stores test flag - If training data, extracts multi-label targets as float32 array ```python self.tok = tokenizer self.max_len = max_len ``` - Stores tokenizer and max length for later use ```python def __len__(self): return len(self.texts) ``` - Returns dataset size (required by PyTorch) ```python def __getitem__(self, i): enc = self.tok( self.texts[i], truncation=True, padding="max_length", max_length=self.max_len, return_tensors="pt", ) ``` - Tokenizes the text at index `i` - **truncation**: Cuts text longer than max_len - **padding**: Pads shorter sequences to max_len - **return_tensors="pt"**: Returns PyTorch tensors ```python item = {k: v.squeeze(0) for k, v in enc.items()} ``` - Removes the batch dimension (1, seq_len) → (seq_len) - Returns dict with keys: input_ids, attention_mask, token_type_ids (if applicable) ```python if not self.is_test: item["labels"] = torch.tensor(self.labels[i]) return item ``` - Adds labels to the item dict if training data - Returns the complete item --- ## Section 6: Training & Validation Helper Functions ### Lines 196-213: `train_one_epoch` Function ```python def train_one_epoch(model, loader, optimizer, scheduler, scaler, criterion): model.train() ``` - Sets model to training mode (enables dropout, batch normalization updates) ```python losses = [] for batch in loader: ``` - Initializes list to track losses - Iterates through batches ```python batch = {k: v.to(device, non_blocking=True) for k, v in batch.items()} ``` - Moves batch data to GPU (or CPU) - `non_blocking=True`: Async transfer for faster processing ```python optimizer.zero_grad(set_to_none=True) ``` - Clears gradients from previous step - `set_to_none=True`: More memory efficient than setting to zero ```python with autocast(enabled=True): out = model(input_ids=batch["input_ids"], attention_mask=batch["attention_mask"]) loss = criterion(out.logits, batch["labels"]) ``` - **autocast**: Uses mixed precision (float16) for faster computation - Forward pass through model - Calculates loss between predictions (logits) and true labels ```python scaler.scale(loss).backward() ``` - Scales loss to prevent gradient underflow in mixed precision - Computes gradients via backpropagation ```python scaler.unscale_(optimizer) torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) ``` - Unscales gradients before clipping - Clips gradients to maximum norm of 1.0 to prevent exploding gradients ```python scaler.step(optimizer) scaler.update() ``` - Updates model parameters (with scaled gradients) - Updates the scaler's internal state ```python scheduler.step() ``` - Updates learning rate according to schedule ```python losses.append(loss.item()) return np.mean(losses) ``` - Stores loss value - Returns average loss for the epoch ### Lines 215-230: `validate` Function ```python def validate(model, loader, criterion): model.eval() ``` - Sets model to evaluation mode (disables dropout, fixes batch norm) ```python losses = [] preds = [] targs = [] ``` - Initializes lists for losses, predictions, and targets ```python with torch.no_grad(): ``` - Disables gradient computation (saves memory and speeds up inference) ```python for batch in loader: batch = {k: v.to(device, non_blocking=True) for k, v in batch.items()} with autocast(enabled=True): out = model(input_ids=batch["input_ids"], attention_mask=batch["attention_mask"]) loss = criterion(out.logits, batch["labels"]) ``` - Moves batch to device - Forward pass with mixed precision - Calculates validation loss ```python losses.append(loss.item()) preds.append(torch.sigmoid(out.logits).float().cpu().numpy()) targs.append(batch["labels"].cpu().numpy()) ``` - Stores loss - Applies sigmoid to convert logits to probabilities [0, 1] - Moves predictions and targets to CPU as numpy arrays ```python return np.mean(losses), np.vstack(preds), np.vstack(targs) ``` - Returns average loss, stacked predictions, and stacked targets --- ## Section 7: Main K-Fold Training Loop ### Lines 246-324: `run_training` Function ```python def run_training(): if not os.path.exists(CONFIG.TRAIN_CSV): print("Train CSV not found. Please check the path.") return None, None ``` - Checks if training data exists - Returns None if not found (graceful failure) ```python df = pd.read_csv(CONFIG.TRAIN_CSV) df = ensure_text_column(df) ``` - Loads training data - Ensures text column exists ```python skf = StratifiedKFold(n_splits=CONFIG.N_FOLDS, shuffle=True, random_state=CONFIG.SEED) y_str = df[CONFIG.LABELS].astype(str).agg("".join, axis=1) ``` - Creates 5-fold stratified splitter - Converts multi-label to string representation for stratification - Example: [1,0,1,0,0] → "10100" ```python oof_preds = np.zeros((len(df), len(CONFIG.LABELS))) ``` - Initializes out-of-fold predictions array (for all training samples) ```python tokenizer = AutoTokenizer.from_pretrained(CONFIG.MODEL_NAME) ``` - Loads DeBERTa tokenizer ```python for fold, (train_idx, val_idx) in enumerate(skf.split(df, y_str)): print(f"\n{'='*20} FOLD {fold+1}/{CONFIG.N_FOLDS} {'='*20}") ``` - Iterates through each fold - `train_idx`: indices for training, `val_idx`: indices for validation ```python df_tr = df.iloc[train_idx].reset_index(drop=True) df_va = df.iloc[val_idx].reset_index(drop=True) ``` - Splits data into training and validation sets for current fold - Resets index for clean indexing ```python ds_tr = EmotionDS(df_tr, tokenizer, CONFIG.MAX_LEN) ds_va = EmotionDS(df_va, tokenizer, CONFIG.MAX_LEN) ``` - Creates PyTorch datasets for training and validation ```python dl_tr = torch.utils.data.DataLoader(ds_tr, batch_size=CONFIG.BATCH_SIZE, shuffle=True, num_workers=2, pin_memory=True) dl_va = torch.utils.data.DataLoader(ds_va, batch_size=CONFIG.BATCH_SIZE, shuffle=False, num_workers=2, pin_memory=True) ``` - Creates data loaders - **shuffle=True** for training (randomizes batch order) - **shuffle=False** for validation (keeps consistent order) - **num_workers=2**: Uses 2 subprocesses for data loading - **pin_memory=True**: Speeds up CPU→GPU transfer ```python model = AutoModelForSequenceClassification.from_pretrained( CONFIG.MODEL_NAME, num_labels=len(CONFIG.LABELS), problem_type="multi_label_classification" ) model.to(device) ``` - Loads pre-trained DeBERTa model - Configures for 5-label multi-label classification - Moves model to GPU/CPU ```python optimizer_params = get_optimizer_params(model, CONFIG.LR, CONFIG.WEIGHT_DECAY) optimizer = AdamW(optimizer_params, lr=CONFIG.LR) ``` - Gets parameter groups with differential weight decay - Creates AdamW optimizer ```python total_steps = len(dl_tr) * CONFIG.EPOCHS scheduler = get_linear_schedule_with_warmup( optimizer, num_warmup_steps=int(total_steps * CONFIG.WARMUP_RATIO), num_training_steps=total_steps ) ``` - Calculates total training steps - Creates learning rate scheduler: - Warmup: LR increases linearly for 10% of steps - Decay: LR decreases linearly to 0 for remaining 90% ```python criterion = nn.BCEWithLogitsLoss() scaler = GradScaler(enabled=True) ``` - **BCEWithLogitsLoss**: Binary cross-entropy loss for multi-label classification - Creates gradient scaler for mixed precision ```python best_f1 = 0 best_state = None ``` - Initializes tracking for best model ```python for ep in range(CONFIG.EPOCHS): train_loss = train_one_epoch(model, dl_tr, optimizer, scheduler, scaler, criterion) val_loss, val_preds, val_targs = validate(model, dl_va, criterion) ``` - Trains for one epoch - Validates on validation set ```python val_f1 = f1_score(val_targs, (val_preds >= 0.5).astype(int), average="macro", zero_division=0) ``` - Calculates macro F1 score (average F1 across all labels) - Uses 0.5 threshold for predictions ```python print(f"Ep {ep+1}: TrLoss={train_loss:.4f} | VaLoss={val_loss:.4f} | VaF1={val_f1:.4f}") ``` - Prints epoch metrics ```python if val_f1 > best_f1: best_f1 = val_f1 best_state = model.state_dict() ``` - Saves model state if validation F1 improves ```python torch.save(best_state, f"model_fold_{fold}.pth") ``` - Saves best model weights to disk ```python model.load_state_dict(best_state) _, val_preds, _ = validate(model, dl_va, criterion) oof_preds[val_idx] = val_preds ``` - Loads best weights - Gets predictions on validation set - Stores out-of-fold predictions ```python del model, optimizer, scaler, scheduler torch.cuda.empty_cache() gc.collect() ``` - Deletes objects to free memory - Clears GPU cache - Runs garbage collector ```python return oof_preds, df[CONFIG.LABELS].values ``` - Returns out-of-fold predictions and true labels ```python if os.path.exists(CONFIG.TRAIN_CSV): oof_preds, y_true = run_training() else: print("Skipping training as data is not found (likely in a dry-run environment).") ``` - Executes training if data exists - Otherwise skips gracefully --- ## Section 8: Threshold Optimization ### Lines 340-347: Threshold Tuning ```python if os.path.exists(CONFIG.TRAIN_CSV): best_thresholds = tune_thresholds(y_true, oof_preds) ``` - Finds optimal threshold for each emotion label using validation predictions ```python oof_tuned = (oof_preds >= best_thresholds).astype(int) ``` - Converts probabilities to binary predictions using optimized thresholds ```python final_f1 = f1_score(y_true, oof_tuned, average="macro", zero_division=0) print(f"\nFinal CV Macro F1: {final_f1:.4f}") print(f"Best Thresholds: {best_thresholds}") ``` - Calculates cross-validated F1 score with optimized thresholds - Prints final performance and optimal thresholds ```python else: best_thresholds = np.array([0.5] * len(CONFIG.LABELS)) ``` - Falls back to 0.5 thresholds if training data not available --- ## Section 9: Inference & Submission ### Lines 363-420: `predict_test` Function ```python def predict_test(thresholds): if not os.path.exists(CONFIG.TEST_CSV): print("Test CSV not found.") return ``` - Checks if test data exists ```python df_test = pd.read_csv(CONFIG.TEST_CSV) df_test = ensure_text_column(df_test) ``` - Loads test data and ensures text column ```python tokenizer = AutoTokenizer.from_pretrained(CONFIG.MODEL_NAME) ds_test = EmotionDS(df_test, tokenizer, CONFIG.MAX_LEN, is_test=True) dl_test = torch.utils.data.DataLoader(ds_test, batch_size=CONFIG.BATCH_SIZE, shuffle=False, num_workers=2) ``` - Creates tokenizer, dataset, and data loader for test data - `is_test=True`: No labels expected ```python fold_preds = [] ``` - Initializes list to store predictions from each fold ```python for fold in range(CONFIG.N_FOLDS): model_path = f"model_fold_{fold}.pth" if not os.path.exists(model_path): print(f"Model for fold {fold} not found, skipping.") continue ``` - Iterates through all folds - Checks if model exists ```python print(f"Predicting Fold {fold+1}...") model = AutoModelForSequenceClassification.from_pretrained( CONFIG.MODEL_NAME, num_labels=len(CONFIG.LABELS), problem_type="multi_label_classification" ) model.load_state_dict(torch.load(model_path)) model.to(device) model.eval() ``` - Loads model architecture - Loads trained weights - Sets to evaluation mode ```python preds = [] with torch.no_grad(): for batch in dl_test: batch = {k: v.to(device, non_blocking=True) for k, v in batch.items()} with autocast(enabled=True): out = model(input_ids=batch["input_ids"], attention_mask=batch["attention_mask"]) preds.append(torch.sigmoid(out.logits).float().cpu().numpy()) ``` - Makes predictions without computing gradients - Uses mixed precision for speed - Applies sigmoid to get probabilities ```python fold_preds.append(np.vstack(preds)) del model torch.cuda.empty_cache() gc.collect() ``` - Stores fold predictions - Frees memory ```python if not fold_preds: print("No predictions made.") return ``` - Checks if any predictions were made ```python avg_preds = np.mean(fold_preds, axis=0) ``` - Averages predictions across all folds (ensemble) ```python final_preds = (avg_preds >= thresholds).astype(int) ``` - Applies optimized thresholds to get binary predictions ```python sub = pd.DataFrame(columns=["id"] + CONFIG.LABELS) sub["id"] = df_test["id"] if "id" in df_test.columns else np.arange(len(df_test)) sub[CONFIG.LABELS] = final_preds sub.to_csv(CONFIG.SUBMISSION_PATH, index=False) print(f"Submission saved to {CONFIG.SUBMISSION_PATH}") print(sub.head()) ``` - Creates submission DataFrame - Adds ID column (from data or generated) - Adds prediction columns - Saves to CSV - Displays first few rows ```python predict_test(best_thresholds) ``` - Executes prediction function with optimized thresholds --- ## Summary This notebook implements a **robust emotion classification pipeline** with: 1. **K-Fold Cross-Validation**: 5-fold stratified CV for reliable performance estimates 2. **State-of-the-Art Model**: DeBERTa-v3-base transformer 3. **Optimization Techniques**: - Mixed precision training (faster, less memory) - Gradient clipping (stability) - Learning rate warmup and decay - Differential weight decay 4. **Threshold Optimization**: Per-label thresholds for better F1 scores 5. **Ensemble Prediction**: Averages predictions from all folds 6. **Memory Management**: Explicit cleanup between folds The model predicts 5 emotions (anger, fear, joy, sadness, surprise) in a **multi-label** setting, where text can have multiple emotions simultaneously.