Spaces:

hrshlgunjal
/

emotion-classifier-dl-genai-project

Sleeping

App Files Files Community

emotion-classifier-dl-genai-project / main_code_explanation.md

HarshalGunjalOp

Add notebooks and kaggle data for GitHub, configure HuggingFace ignore

eccd289 13 days ago

preview code

raw

history blame contribute delete

22.6 kB

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

Deep Learning Emotion Classification - Code Explanation

This document provides a detailed line-by-line explanation of the main.ipynb notebook, which implements a multi-label emotion classification system using the DeBERTa transformer model with K-Fold cross-validation.

Section 1: Imports & Setup

Lines 18-36: Import Statements

import numpy as np
import pandas as pd

numpy: Used for numerical operations, array manipulation, and random seed setting
pandas: Used for data loading and manipulation (CSV files, DataFrames)

import torch
import torch.nn as nn

torch: PyTorch deep learning framework for tensor operations and model training
torch.nn: Neural network modules including loss functions

from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import f1_score

StratifiedKFold: Creates k-fold splits while maintaining class distribution in each fold
f1_score: Calculates F1 metric for evaluation (harmonic mean of precision and recall)

from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    get_linear_schedule_with_warmup,
    AutoConfig
)

AutoTokenizer: Automatically loads the appropriate tokenizer for the specified model
AutoModelForSequenceClassification: Pre-trained transformer model for classification tasks
get_linear_schedule_with_warmup: Learning rate scheduler with warmup and linear decay
AutoConfig: Model configuration loader

from torch.optim import AdamW

AdamW: Adam optimizer with decoupled weight decay (better than standard Adam for transformers)

from torch.cuda.amp import autocast, GradScaler

autocast: Enables automatic mixed precision (AMP) to speed up training
GradScaler: Scales gradients for mixed precision training to prevent underflow

import gc
import warnings
import os

gc: Garbage collection to free up memory
warnings: To suppress warning messages
os: For file system operations and environment variables

warnings.filterwarnings("ignore")

Suppresses all warning messages for cleaner output

Section 2: Configuration

Lines 52-68: Configuration Class

class Config:
    SEED = 42

Sets random seed for reproducibility across all random operations

    LABELS = ["anger", "fear", "joy", "sadness", "surprise"]

Defines the 5 emotion labels for multi-label classification

    MODEL_NAME = "microsoft/deberta-v3-base"

Specifies the pre-trained model (DeBERTa v3 base - 184M parameters, SOTA performance)

    MAX_LEN = 128

Maximum sequence length for tokenization (tokens longer than this are truncated)

    BATCH_SIZE = 16

Number of samples processed together in one forward/backward pass

    EPOCHS = 4

Number of complete passes through the training dataset

    LR = 1.5e-5

Learning rate (1.5 × 10⁻⁵) - small value typical for fine-tuning transformers

    WEIGHT_DECAY = 0.01

L2 regularization strength to prevent overfitting

    WARMUP_RATIO = 0.1

Fraction of training steps used for learning rate warmup (10% of total steps)

    N_FOLDS = 5

Number of folds for K-Fold cross-validation

    TRAIN_CSV = "/kaggle/input/2025-sep-dl-gen-ai-project/train.csv"
    TEST_CSV = "/kaggle/input/2025-sep-dl-gen-ai-project/test.csv"

Paths to training and test datasets (Kaggle environment paths)

    SUBMISSION_PATH = "submission.csv"

Output file for predictions

CONFIG = Config()

Creates a global instance of the configuration class

Section 3: Seed & Device Setup

Lines 84-93: Reproducibility and Device Selection

def set_seed(seed=CONFIG.SEED):
    np.random.seed(seed)

Sets numpy's random seed for reproducible random number generation

    torch.manual_seed(seed)

Sets PyTorch's random seed for CPU operations

    torch.cuda.manual_seed_all(seed)

Sets PyTorch's random seed for all GPU devices

    os.environ['PYTHONHASHSEED'] = str(seed)

Sets hash seed for Python's built-in hash() function for reproducibility

set_seed()

Calls the seed setting function

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Checks if GPU is available; uses GPU if available, otherwise falls back to CPU
Prints the device being used for training

Section 4: Utility Functions

Lines 109-115: `ensure_text_column` Function

def ensure_text_column(df: pd.DataFrame) -> pd.DataFrame:
    if "text" in df.columns:
        return df

Checks if DataFrame already has a "text" column; if yes, returns unchanged

    for c in ["comment_text", "sentence", "content", "review"]:
        if c in df.columns:
            return df.rename(columns={c: "text"})

Searches for common alternative text column names
Renames the first matching column to "text" for standardization

    raise ValueError("No text column found. Add/rename your text column to 'text'.")

Raises an error if no text column is found

Lines 117-126: `tune_thresholds` Function

def tune_thresholds(y_true: np.ndarray, y_prob: np.ndarray) -> np.ndarray:
    th = np.zeros(y_true.shape[1], dtype=np.float32)

Creates array to store optimal threshold for each label (initialized to 0)
Multi-label classification requires separate thresholds per label

    for j in range(y_true.shape[1]):
        best_t, best_f1 = 0.5, -1

Iterates through each label
Initializes best threshold to 0.5 (default) and best F1 to -1

        for t in np.linspace(0.1, 0.9, 17):

Tests 17 threshold values evenly spaced between 0.1 and 0.9

            f1 = f1_score(y_true[:, j], (y_prob[:, j] >= t).astype(int), zero_division=0)

Calculates F1 score for current label and threshold
Converts probabilities to binary predictions using threshold

            if f1 > best_f1:
                best_f1, best_t = f1, t

Updates best threshold if current F1 is better

        th[j] = best_t
    return th

Stores optimal threshold for each label and returns the array

Lines 128-141: `get_optimizer_params` Function

def get_optimizer_params(model, lr, weight_decay):
    param_optimizer = list(model.named_parameters())

Gets all model parameters with their names

    no_decay = ["bias", "LayerNorm.bias", "LayerNorm.weight"]

Lists parameters that should NOT have weight decay applied
Bias and LayerNorm parameters typically trained without weight decay

    optimizer_parameters = [
        {
            "params": [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
            "weight_decay": weight_decay,
        },

First parameter group: all parameters EXCEPT bias and LayerNorm
These parameters will have weight decay applied

        {
            "params": [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
            "weight_decay": 0.0,
        },
    ]

Second parameter group: only bias and LayerNorm parameters
These parameters have weight decay set to 0.0

    return optimizer_parameters

Returns grouped parameters for differential weight decay

Section 5: Dataset Class

Lines 157-180: `EmotionDS` Class

class EmotionDS(torch.utils.data.Dataset):
    def __init__(self, df, tokenizer, max_len, is_test=False):

Custom PyTorch Dataset class for emotion classification
is_test flag indicates whether this is test data (no labels)

        self.texts = df["text"].tolist()

Extracts text data as a Python list

        self.is_test = is_test
        if not is_test:
            self.labels = df[CONFIG.LABELS].values.astype(np.float32)

Stores test flag
If training data, extracts multi-label targets as float32 array

        self.tok = tokenizer
        self.max_len = max_len

Stores tokenizer and max length for later use

    def __len__(self):
        return len(self.texts)

Returns dataset size (required by PyTorch)

    def __getitem__(self, i):
        enc = self.tok(
            self.texts[i],
            truncation=True,
            padding="max_length",
            max_length=self.max_len,
            return_tensors="pt",
        )

Tokenizes the text at index i
truncation: Cuts text longer than max_len
padding: Pads shorter sequences to max_len
return_tensors="pt": Returns PyTorch tensors

        item = {k: v.squeeze(0) for k, v in enc.items()}

Removes the batch dimension (1, seq_len) → (seq_len)
Returns dict with keys: input_ids, attention_mask, token_type_ids (if applicable)

        if not self.is_test:
            item["labels"] = torch.tensor(self.labels[i])
        return item

Adds labels to the item dict if training data
Returns the complete item

Section 6: Training & Validation Helper Functions

Lines 196-213: `train_one_epoch` Function

def train_one_epoch(model, loader, optimizer, scheduler, scaler, criterion):
    model.train()

Sets model to training mode (enables dropout, batch normalization updates)

    losses = []
    for batch in loader:

Initializes list to track losses
Iterates through batches

        batch = {k: v.to(device, non_blocking=True) for k, v in batch.items()}

Moves batch data to GPU (or CPU)
non_blocking=True: Async transfer for faster processing

        optimizer.zero_grad(set_to_none=True)

Clears gradients from previous step
set_to_none=True: More memory efficient than setting to zero

        with autocast(enabled=True):
            out = model(input_ids=batch["input_ids"], attention_mask=batch["attention_mask"])
            loss = criterion(out.logits, batch["labels"])

autocast: Uses mixed precision (float16) for faster computation
Forward pass through model
Calculates loss between predictions (logits) and true labels

        scaler.scale(loss).backward()

Scales loss to prevent gradient underflow in mixed precision
Computes gradients via backpropagation

        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

Unscales gradients before clipping
Clips gradients to maximum norm of 1.0 to prevent exploding gradients

        scaler.step(optimizer)
        scaler.update()

Updates model parameters (with scaled gradients)
Updates the scaler's internal state

        scheduler.step()

Updates learning rate according to schedule

        losses.append(loss.item())
    return np.mean(losses)

Stores loss value
Returns average loss for the epoch

Lines 215-230: `validate` Function

def validate(model, loader, criterion):
    model.eval()

Sets model to evaluation mode (disables dropout, fixes batch norm)

    losses = []
    preds = []
    targs = []

Initializes lists for losses, predictions, and targets

    with torch.no_grad():

Disables gradient computation (saves memory and speeds up inference)

        for batch in loader:
            batch = {k: v.to(device, non_blocking=True) for k, v in batch.items()}
            with autocast(enabled=True):
                out = model(input_ids=batch["input_ids"], attention_mask=batch["attention_mask"])
                loss = criterion(out.logits, batch["labels"])

Moves batch to device
Forward pass with mixed precision
Calculates validation loss

            losses.append(loss.item())
            preds.append(torch.sigmoid(out.logits).float().cpu().numpy())
            targs.append(batch["labels"].cpu().numpy())

Stores loss
Applies sigmoid to convert logits to probabilities [0, 1]
Moves predictions and targets to CPU as numpy arrays

    return np.mean(losses), np.vstack(preds), np.vstack(targs)

Returns average loss, stacked predictions, and stacked targets

Section 7: Main K-Fold Training Loop

Lines 246-324: `run_training` Function

def run_training():
    if not os.path.exists(CONFIG.TRAIN_CSV):
        print("Train CSV not found. Please check the path.")
        return None, None

Checks if training data exists
Returns None if not found (graceful failure)

    df = pd.read_csv(CONFIG.TRAIN_CSV)
    df = ensure_text_column(df)

Loads training data
Ensures text column exists

    skf = StratifiedKFold(n_splits=CONFIG.N_FOLDS, shuffle=True, random_state=CONFIG.SEED)
    y_str = df[CONFIG.LABELS].astype(str).agg("".join, axis=1)

Creates 5-fold stratified splitter
Converts multi-label to string representation for stratification
Example: [1,0,1,0,0] → "10100"

    oof_preds = np.zeros((len(df), len(CONFIG.LABELS)))

Initializes out-of-fold predictions array (for all training samples)

    tokenizer = AutoTokenizer.from_pretrained(CONFIG.MODEL_NAME)

Loads DeBERTa tokenizer

    for fold, (train_idx, val_idx) in enumerate(skf.split(df, y_str)):
        print(f"\n{'='*20} FOLD {fold+1}/{CONFIG.N_FOLDS} {'='*20}")

Iterates through each fold
train_idx: indices for training, val_idx: indices for validation

        df_tr = df.iloc[train_idx].reset_index(drop=True)
        df_va = df.iloc[val_idx].reset_index(drop=True)

Splits data into training and validation sets for current fold
Resets index for clean indexing

        ds_tr = EmotionDS(df_tr, tokenizer, CONFIG.MAX_LEN)
        ds_va = EmotionDS(df_va, tokenizer, CONFIG.MAX_LEN)

Creates PyTorch datasets for training and validation

        dl_tr = torch.utils.data.DataLoader(ds_tr, batch_size=CONFIG.BATCH_SIZE, shuffle=True, num_workers=2, pin_memory=True)
        dl_va = torch.utils.data.DataLoader(ds_va, batch_size=CONFIG.BATCH_SIZE, shuffle=False, num_workers=2, pin_memory=True)

Creates data loaders
shuffle=True for training (randomizes batch order)
shuffle=False for validation (keeps consistent order)
num_workers=2: Uses 2 subprocesses for data loading
pin_memory=True: Speeds up CPU→GPU transfer

        model = AutoModelForSequenceClassification.from_pretrained(
            CONFIG.MODEL_NAME, 
            num_labels=len(CONFIG.LABELS),
            problem_type="multi_label_classification"
        )
        model.to(device)

Loads pre-trained DeBERTa model
Configures for 5-label multi-label classification
Moves model to GPU/CPU

        optimizer_params = get_optimizer_params(model, CONFIG.LR, CONFIG.WEIGHT_DECAY)
        optimizer = AdamW(optimizer_params, lr=CONFIG.LR)

Gets parameter groups with differential weight decay
Creates AdamW optimizer

        total_steps = len(dl_tr) * CONFIG.EPOCHS
        scheduler = get_linear_schedule_with_warmup(
            optimizer, 
            num_warmup_steps=int(total_steps * CONFIG.WARMUP_RATIO), 
            num_training_steps=total_steps
        )

Calculates total training steps
Creates learning rate scheduler:
- Warmup: LR increases linearly for 10% of steps
- Decay: LR decreases linearly to 0 for remaining 90%

        criterion = nn.BCEWithLogitsLoss()
        scaler = GradScaler(enabled=True)

BCEWithLogitsLoss: Binary cross-entropy loss for multi-label classification
Creates gradient scaler for mixed precision

        best_f1 = 0
        best_state = None

Initializes tracking for best model

        for ep in range(CONFIG.EPOCHS):
            train_loss = train_one_epoch(model, dl_tr, optimizer, scheduler, scaler, criterion)
            val_loss, val_preds, val_targs = validate(model, dl_va, criterion)

Trains for one epoch
Validates on validation set

            val_f1 = f1_score(val_targs, (val_preds >= 0.5).astype(int), average="macro", zero_division=0)

Calculates macro F1 score (average F1 across all labels)
Uses 0.5 threshold for predictions

            print(f"Ep {ep+1}: TrLoss={train_loss:.4f} | VaLoss={val_loss:.4f} | VaF1={val_f1:.4f}")

Prints epoch metrics

            if val_f1 > best_f1:
                best_f1 = val_f1
                best_state = model.state_dict()

Saves model state if validation F1 improves

        torch.save(best_state, f"model_fold_{fold}.pth")

Saves best model weights to disk

        model.load_state_dict(best_state)
        _, val_preds, _ = validate(model, dl_va, criterion)
        oof_preds[val_idx] = val_preds

Loads best weights
Gets predictions on validation set
Stores out-of-fold predictions

        del model, optimizer, scaler, scheduler
        torch.cuda.empty_cache()
        gc.collect()

Deletes objects to free memory
Clears GPU cache
Runs garbage collector

    return oof_preds, df[CONFIG.LABELS].values

Returns out-of-fold predictions and true labels

if os.path.exists(CONFIG.TRAIN_CSV):
    oof_preds, y_true = run_training()
else:
    print("Skipping training as data is not found (likely in a dry-run environment).")

Executes training if data exists
Otherwise skips gracefully

Section 8: Threshold Optimization

Lines 340-347: Threshold Tuning

if os.path.exists(CONFIG.TRAIN_CSV):
    best_thresholds = tune_thresholds(y_true, oof_preds)

Finds optimal threshold for each emotion label using validation predictions

    oof_tuned = (oof_preds >= best_thresholds).astype(int)

Converts probabilities to binary predictions using optimized thresholds

    final_f1 = f1_score(y_true, oof_tuned, average="macro", zero_division=0)
    print(f"\nFinal CV Macro F1: {final_f1:.4f}")
    print(f"Best Thresholds: {best_thresholds}")

Calculates cross-validated F1 score with optimized thresholds
Prints final performance and optimal thresholds

else:
    best_thresholds = np.array([0.5] * len(CONFIG.LABELS))

Falls back to 0.5 thresholds if training data not available

Section 9: Inference & Submission

Lines 363-420: `predict_test` Function

def predict_test(thresholds):
    if not os.path.exists(CONFIG.TEST_CSV):
        print("Test CSV not found.")
        return

Checks if test data exists

    df_test = pd.read_csv(CONFIG.TEST_CSV)
    df_test = ensure_text_column(df_test)

Loads test data and ensures text column

    tokenizer = AutoTokenizer.from_pretrained(CONFIG.MODEL_NAME)
    ds_test = EmotionDS(df_test, tokenizer, CONFIG.MAX_LEN, is_test=True)
    dl_test = torch.utils.data.DataLoader(ds_test, batch_size=CONFIG.BATCH_SIZE, shuffle=False, num_workers=2)

Creates tokenizer, dataset, and data loader for test data
is_test=True: No labels expected

    fold_preds = []

Initializes list to store predictions from each fold

    for fold in range(CONFIG.N_FOLDS):
        model_path = f"model_fold_{fold}.pth"
        if not os.path.exists(model_path):
            print(f"Model for fold {fold} not found, skipping.")
            continue

Iterates through all folds
Checks if model exists

        print(f"Predicting Fold {fold+1}...")
        model = AutoModelForSequenceClassification.from_pretrained(
            CONFIG.MODEL_NAME, 
            num_labels=len(CONFIG.LABELS),
            problem_type="multi_label_classification"
        )
        model.load_state_dict(torch.load(model_path))
        model.to(device)
        model.eval()

Loads model architecture
Loads trained weights
Sets to evaluation mode

        preds = []
        with torch.no_grad():
            for batch in dl_test:
                batch = {k: v.to(device, non_blocking=True) for k, v in batch.items()}
                with autocast(enabled=True):
                    out = model(input_ids=batch["input_ids"], attention_mask=batch["attention_mask"])
                preds.append(torch.sigmoid(out.logits).float().cpu().numpy())

Makes predictions without computing gradients
Uses mixed precision for speed
Applies sigmoid to get probabilities

        fold_preds.append(np.vstack(preds))
        del model
        torch.cuda.empty_cache()
        gc.collect()

Stores fold predictions
Frees memory

    if not fold_preds:
        print("No predictions made.")
        return

Checks if any predictions were made

    avg_preds = np.mean(fold_preds, axis=0)

Averages predictions across all folds (ensemble)

    final_preds = (avg_preds >= thresholds).astype(int)

Applies optimized thresholds to get binary predictions

    sub = pd.DataFrame(columns=["id"] + CONFIG.LABELS)
    sub["id"] = df_test["id"] if "id" in df_test.columns else np.arange(len(df_test))
    sub[CONFIG.LABELS] = final_preds
    sub.to_csv(CONFIG.SUBMISSION_PATH, index=False)
    print(f"Submission saved to {CONFIG.SUBMISSION_PATH}")
    print(sub.head())

Creates submission DataFrame
Adds ID column (from data or generated)
Adds prediction columns
Saves to CSV
Displays first few rows

predict_test(best_thresholds)

Executes prediction function with optimized thresholds

Summary

This notebook implements a robust emotion classification pipeline with:

K-Fold Cross-Validation: 5-fold stratified CV for reliable performance estimates
State-of-the-Art Model: DeBERTa-v3-base transformer
Optimization Techniques:
- Mixed precision training (faster, less memory)
- Gradient clipping (stability)
- Learning rate warmup and decay
- Differential weight decay
Threshold Optimization: Per-label thresholds for better F1 scores
Ensemble Prediction: Averages predictions from all folds
Memory Management: Explicit cleanup between folds

The model predicts 5 emotions (anger, fear, joy, sadness, surprise) in a multi-label setting, where text can have multiple emotions simultaneously.