A newer version of the Gradio SDK is available:
6.1.0
Deep Learning Emotion Classification - Code Explanation
This document provides a detailed line-by-line explanation of the main.ipynb notebook, which implements a multi-label emotion classification system using the DeBERTa transformer model with K-Fold cross-validation.
Section 1: Imports & Setup
Lines 18-36: Import Statements
import numpy as np
import pandas as pd
- numpy: Used for numerical operations, array manipulation, and random seed setting
- pandas: Used for data loading and manipulation (CSV files, DataFrames)
import torch
import torch.nn as nn
- torch: PyTorch deep learning framework for tensor operations and model training
- torch.nn: Neural network modules including loss functions
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import f1_score
- StratifiedKFold: Creates k-fold splits while maintaining class distribution in each fold
- f1_score: Calculates F1 metric for evaluation (harmonic mean of precision and recall)
from transformers import (
AutoTokenizer,
AutoModelForSequenceClassification,
get_linear_schedule_with_warmup,
AutoConfig
)
- AutoTokenizer: Automatically loads the appropriate tokenizer for the specified model
- AutoModelForSequenceClassification: Pre-trained transformer model for classification tasks
- get_linear_schedule_with_warmup: Learning rate scheduler with warmup and linear decay
- AutoConfig: Model configuration loader
from torch.optim import AdamW
- AdamW: Adam optimizer with decoupled weight decay (better than standard Adam for transformers)
from torch.cuda.amp import autocast, GradScaler
- autocast: Enables automatic mixed precision (AMP) to speed up training
- GradScaler: Scales gradients for mixed precision training to prevent underflow
import gc
import warnings
import os
- gc: Garbage collection to free up memory
- warnings: To suppress warning messages
- os: For file system operations and environment variables
warnings.filterwarnings("ignore")
- Suppresses all warning messages for cleaner output
Section 2: Configuration
Lines 52-68: Configuration Class
class Config:
SEED = 42
- Sets random seed for reproducibility across all random operations
LABELS = ["anger", "fear", "joy", "sadness", "surprise"]
- Defines the 5 emotion labels for multi-label classification
MODEL_NAME = "microsoft/deberta-v3-base"
- Specifies the pre-trained model (DeBERTa v3 base - 184M parameters, SOTA performance)
MAX_LEN = 128
- Maximum sequence length for tokenization (tokens longer than this are truncated)
BATCH_SIZE = 16
- Number of samples processed together in one forward/backward pass
EPOCHS = 4
- Number of complete passes through the training dataset
LR = 1.5e-5
- Learning rate (1.5 × 10⁻⁵) - small value typical for fine-tuning transformers
WEIGHT_DECAY = 0.01
- L2 regularization strength to prevent overfitting
WARMUP_RATIO = 0.1
- Fraction of training steps used for learning rate warmup (10% of total steps)
N_FOLDS = 5
- Number of folds for K-Fold cross-validation
TRAIN_CSV = "/kaggle/input/2025-sep-dl-gen-ai-project/train.csv"
TEST_CSV = "/kaggle/input/2025-sep-dl-gen-ai-project/test.csv"
- Paths to training and test datasets (Kaggle environment paths)
SUBMISSION_PATH = "submission.csv"
- Output file for predictions
CONFIG = Config()
- Creates a global instance of the configuration class
Section 3: Seed & Device Setup
Lines 84-93: Reproducibility and Device Selection
def set_seed(seed=CONFIG.SEED):
np.random.seed(seed)
- Sets numpy's random seed for reproducible random number generation
torch.manual_seed(seed)
- Sets PyTorch's random seed for CPU operations
torch.cuda.manual_seed_all(seed)
- Sets PyTorch's random seed for all GPU devices
os.environ['PYTHONHASHSEED'] = str(seed)
- Sets hash seed for Python's built-in hash() function for reproducibility
set_seed()
- Calls the seed setting function
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
- Checks if GPU is available; uses GPU if available, otherwise falls back to CPU
- Prints the device being used for training
Section 4: Utility Functions
Lines 109-115: ensure_text_column Function
def ensure_text_column(df: pd.DataFrame) -> pd.DataFrame:
if "text" in df.columns:
return df
- Checks if DataFrame already has a "text" column; if yes, returns unchanged
for c in ["comment_text", "sentence", "content", "review"]:
if c in df.columns:
return df.rename(columns={c: "text"})
- Searches for common alternative text column names
- Renames the first matching column to "text" for standardization
raise ValueError("No text column found. Add/rename your text column to 'text'.")
- Raises an error if no text column is found
Lines 117-126: tune_thresholds Function
def tune_thresholds(y_true: np.ndarray, y_prob: np.ndarray) -> np.ndarray:
th = np.zeros(y_true.shape[1], dtype=np.float32)
- Creates array to store optimal threshold for each label (initialized to 0)
- Multi-label classification requires separate thresholds per label
for j in range(y_true.shape[1]):
best_t, best_f1 = 0.5, -1
- Iterates through each label
- Initializes best threshold to 0.5 (default) and best F1 to -1
for t in np.linspace(0.1, 0.9, 17):
- Tests 17 threshold values evenly spaced between 0.1 and 0.9
f1 = f1_score(y_true[:, j], (y_prob[:, j] >= t).astype(int), zero_division=0)
- Calculates F1 score for current label and threshold
- Converts probabilities to binary predictions using threshold
if f1 > best_f1:
best_f1, best_t = f1, t
- Updates best threshold if current F1 is better
th[j] = best_t
return th
- Stores optimal threshold for each label and returns the array
Lines 128-141: get_optimizer_params Function
def get_optimizer_params(model, lr, weight_decay):
param_optimizer = list(model.named_parameters())
- Gets all model parameters with their names
no_decay = ["bias", "LayerNorm.bias", "LayerNorm.weight"]
- Lists parameters that should NOT have weight decay applied
- Bias and LayerNorm parameters typically trained without weight decay
optimizer_parameters = [
{
"params": [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
"weight_decay": weight_decay,
},
- First parameter group: all parameters EXCEPT bias and LayerNorm
- These parameters will have weight decay applied
{
"params": [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
"weight_decay": 0.0,
},
]
- Second parameter group: only bias and LayerNorm parameters
- These parameters have weight decay set to 0.0
return optimizer_parameters
- Returns grouped parameters for differential weight decay
Section 5: Dataset Class
Lines 157-180: EmotionDS Class
class EmotionDS(torch.utils.data.Dataset):
def __init__(self, df, tokenizer, max_len, is_test=False):
- Custom PyTorch Dataset class for emotion classification
is_testflag indicates whether this is test data (no labels)
self.texts = df["text"].tolist()
- Extracts text data as a Python list
self.is_test = is_test
if not is_test:
self.labels = df[CONFIG.LABELS].values.astype(np.float32)
- Stores test flag
- If training data, extracts multi-label targets as float32 array
self.tok = tokenizer
self.max_len = max_len
- Stores tokenizer and max length for later use
def __len__(self):
return len(self.texts)
- Returns dataset size (required by PyTorch)
def __getitem__(self, i):
enc = self.tok(
self.texts[i],
truncation=True,
padding="max_length",
max_length=self.max_len,
return_tensors="pt",
)
- Tokenizes the text at index
i - truncation: Cuts text longer than max_len
- padding: Pads shorter sequences to max_len
- return_tensors="pt": Returns PyTorch tensors
item = {k: v.squeeze(0) for k, v in enc.items()}
- Removes the batch dimension (1, seq_len) → (seq_len)
- Returns dict with keys: input_ids, attention_mask, token_type_ids (if applicable)
if not self.is_test:
item["labels"] = torch.tensor(self.labels[i])
return item
- Adds labels to the item dict if training data
- Returns the complete item
Section 6: Training & Validation Helper Functions
Lines 196-213: train_one_epoch Function
def train_one_epoch(model, loader, optimizer, scheduler, scaler, criterion):
model.train()
- Sets model to training mode (enables dropout, batch normalization updates)
losses = []
for batch in loader:
- Initializes list to track losses
- Iterates through batches
batch = {k: v.to(device, non_blocking=True) for k, v in batch.items()}
- Moves batch data to GPU (or CPU)
non_blocking=True: Async transfer for faster processing
optimizer.zero_grad(set_to_none=True)
- Clears gradients from previous step
set_to_none=True: More memory efficient than setting to zero
with autocast(enabled=True):
out = model(input_ids=batch["input_ids"], attention_mask=batch["attention_mask"])
loss = criterion(out.logits, batch["labels"])
- autocast: Uses mixed precision (float16) for faster computation
- Forward pass through model
- Calculates loss between predictions (logits) and true labels
scaler.scale(loss).backward()
- Scales loss to prevent gradient underflow in mixed precision
- Computes gradients via backpropagation
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
- Unscales gradients before clipping
- Clips gradients to maximum norm of 1.0 to prevent exploding gradients
scaler.step(optimizer)
scaler.update()
- Updates model parameters (with scaled gradients)
- Updates the scaler's internal state
scheduler.step()
- Updates learning rate according to schedule
losses.append(loss.item())
return np.mean(losses)
- Stores loss value
- Returns average loss for the epoch
Lines 215-230: validate Function
def validate(model, loader, criterion):
model.eval()
- Sets model to evaluation mode (disables dropout, fixes batch norm)
losses = []
preds = []
targs = []
- Initializes lists for losses, predictions, and targets
with torch.no_grad():
- Disables gradient computation (saves memory and speeds up inference)
for batch in loader:
batch = {k: v.to(device, non_blocking=True) for k, v in batch.items()}
with autocast(enabled=True):
out = model(input_ids=batch["input_ids"], attention_mask=batch["attention_mask"])
loss = criterion(out.logits, batch["labels"])
- Moves batch to device
- Forward pass with mixed precision
- Calculates validation loss
losses.append(loss.item())
preds.append(torch.sigmoid(out.logits).float().cpu().numpy())
targs.append(batch["labels"].cpu().numpy())
- Stores loss
- Applies sigmoid to convert logits to probabilities [0, 1]
- Moves predictions and targets to CPU as numpy arrays
return np.mean(losses), np.vstack(preds), np.vstack(targs)
- Returns average loss, stacked predictions, and stacked targets
Section 7: Main K-Fold Training Loop
Lines 246-324: run_training Function
def run_training():
if not os.path.exists(CONFIG.TRAIN_CSV):
print("Train CSV not found. Please check the path.")
return None, None
- Checks if training data exists
- Returns None if not found (graceful failure)
df = pd.read_csv(CONFIG.TRAIN_CSV)
df = ensure_text_column(df)
- Loads training data
- Ensures text column exists
skf = StratifiedKFold(n_splits=CONFIG.N_FOLDS, shuffle=True, random_state=CONFIG.SEED)
y_str = df[CONFIG.LABELS].astype(str).agg("".join, axis=1)
- Creates 5-fold stratified splitter
- Converts multi-label to string representation for stratification
- Example: [1,0,1,0,0] → "10100"
oof_preds = np.zeros((len(df), len(CONFIG.LABELS)))
- Initializes out-of-fold predictions array (for all training samples)
tokenizer = AutoTokenizer.from_pretrained(CONFIG.MODEL_NAME)
- Loads DeBERTa tokenizer
for fold, (train_idx, val_idx) in enumerate(skf.split(df, y_str)):
print(f"\n{'='*20} FOLD {fold+1}/{CONFIG.N_FOLDS} {'='*20}")
- Iterates through each fold
train_idx: indices for training,val_idx: indices for validation
df_tr = df.iloc[train_idx].reset_index(drop=True)
df_va = df.iloc[val_idx].reset_index(drop=True)
- Splits data into training and validation sets for current fold
- Resets index for clean indexing
ds_tr = EmotionDS(df_tr, tokenizer, CONFIG.MAX_LEN)
ds_va = EmotionDS(df_va, tokenizer, CONFIG.MAX_LEN)
- Creates PyTorch datasets for training and validation
dl_tr = torch.utils.data.DataLoader(ds_tr, batch_size=CONFIG.BATCH_SIZE, shuffle=True, num_workers=2, pin_memory=True)
dl_va = torch.utils.data.DataLoader(ds_va, batch_size=CONFIG.BATCH_SIZE, shuffle=False, num_workers=2, pin_memory=True)
- Creates data loaders
- shuffle=True for training (randomizes batch order)
- shuffle=False for validation (keeps consistent order)
- num_workers=2: Uses 2 subprocesses for data loading
- pin_memory=True: Speeds up CPU→GPU transfer
model = AutoModelForSequenceClassification.from_pretrained(
CONFIG.MODEL_NAME,
num_labels=len(CONFIG.LABELS),
problem_type="multi_label_classification"
)
model.to(device)
- Loads pre-trained DeBERTa model
- Configures for 5-label multi-label classification
- Moves model to GPU/CPU
optimizer_params = get_optimizer_params(model, CONFIG.LR, CONFIG.WEIGHT_DECAY)
optimizer = AdamW(optimizer_params, lr=CONFIG.LR)
- Gets parameter groups with differential weight decay
- Creates AdamW optimizer
total_steps = len(dl_tr) * CONFIG.EPOCHS
scheduler = get_linear_schedule_with_warmup(
optimizer,
num_warmup_steps=int(total_steps * CONFIG.WARMUP_RATIO),
num_training_steps=total_steps
)
- Calculates total training steps
- Creates learning rate scheduler:
- Warmup: LR increases linearly for 10% of steps
- Decay: LR decreases linearly to 0 for remaining 90%
criterion = nn.BCEWithLogitsLoss()
scaler = GradScaler(enabled=True)
- BCEWithLogitsLoss: Binary cross-entropy loss for multi-label classification
- Creates gradient scaler for mixed precision
best_f1 = 0
best_state = None
- Initializes tracking for best model
for ep in range(CONFIG.EPOCHS):
train_loss = train_one_epoch(model, dl_tr, optimizer, scheduler, scaler, criterion)
val_loss, val_preds, val_targs = validate(model, dl_va, criterion)
- Trains for one epoch
- Validates on validation set
val_f1 = f1_score(val_targs, (val_preds >= 0.5).astype(int), average="macro", zero_division=0)
- Calculates macro F1 score (average F1 across all labels)
- Uses 0.5 threshold for predictions
print(f"Ep {ep+1}: TrLoss={train_loss:.4f} | VaLoss={val_loss:.4f} | VaF1={val_f1:.4f}")
- Prints epoch metrics
if val_f1 > best_f1:
best_f1 = val_f1
best_state = model.state_dict()
- Saves model state if validation F1 improves
torch.save(best_state, f"model_fold_{fold}.pth")
- Saves best model weights to disk
model.load_state_dict(best_state)
_, val_preds, _ = validate(model, dl_va, criterion)
oof_preds[val_idx] = val_preds
- Loads best weights
- Gets predictions on validation set
- Stores out-of-fold predictions
del model, optimizer, scaler, scheduler
torch.cuda.empty_cache()
gc.collect()
- Deletes objects to free memory
- Clears GPU cache
- Runs garbage collector
return oof_preds, df[CONFIG.LABELS].values
- Returns out-of-fold predictions and true labels
if os.path.exists(CONFIG.TRAIN_CSV):
oof_preds, y_true = run_training()
else:
print("Skipping training as data is not found (likely in a dry-run environment).")
- Executes training if data exists
- Otherwise skips gracefully
Section 8: Threshold Optimization
Lines 340-347: Threshold Tuning
if os.path.exists(CONFIG.TRAIN_CSV):
best_thresholds = tune_thresholds(y_true, oof_preds)
- Finds optimal threshold for each emotion label using validation predictions
oof_tuned = (oof_preds >= best_thresholds).astype(int)
- Converts probabilities to binary predictions using optimized thresholds
final_f1 = f1_score(y_true, oof_tuned, average="macro", zero_division=0)
print(f"\nFinal CV Macro F1: {final_f1:.4f}")
print(f"Best Thresholds: {best_thresholds}")
- Calculates cross-validated F1 score with optimized thresholds
- Prints final performance and optimal thresholds
else:
best_thresholds = np.array([0.5] * len(CONFIG.LABELS))
- Falls back to 0.5 thresholds if training data not available
Section 9: Inference & Submission
Lines 363-420: predict_test Function
def predict_test(thresholds):
if not os.path.exists(CONFIG.TEST_CSV):
print("Test CSV not found.")
return
- Checks if test data exists
df_test = pd.read_csv(CONFIG.TEST_CSV)
df_test = ensure_text_column(df_test)
- Loads test data and ensures text column
tokenizer = AutoTokenizer.from_pretrained(CONFIG.MODEL_NAME)
ds_test = EmotionDS(df_test, tokenizer, CONFIG.MAX_LEN, is_test=True)
dl_test = torch.utils.data.DataLoader(ds_test, batch_size=CONFIG.BATCH_SIZE, shuffle=False, num_workers=2)
- Creates tokenizer, dataset, and data loader for test data
is_test=True: No labels expected
fold_preds = []
- Initializes list to store predictions from each fold
for fold in range(CONFIG.N_FOLDS):
model_path = f"model_fold_{fold}.pth"
if not os.path.exists(model_path):
print(f"Model for fold {fold} not found, skipping.")
continue
- Iterates through all folds
- Checks if model exists
print(f"Predicting Fold {fold+1}...")
model = AutoModelForSequenceClassification.from_pretrained(
CONFIG.MODEL_NAME,
num_labels=len(CONFIG.LABELS),
problem_type="multi_label_classification"
)
model.load_state_dict(torch.load(model_path))
model.to(device)
model.eval()
- Loads model architecture
- Loads trained weights
- Sets to evaluation mode
preds = []
with torch.no_grad():
for batch in dl_test:
batch = {k: v.to(device, non_blocking=True) for k, v in batch.items()}
with autocast(enabled=True):
out = model(input_ids=batch["input_ids"], attention_mask=batch["attention_mask"])
preds.append(torch.sigmoid(out.logits).float().cpu().numpy())
- Makes predictions without computing gradients
- Uses mixed precision for speed
- Applies sigmoid to get probabilities
fold_preds.append(np.vstack(preds))
del model
torch.cuda.empty_cache()
gc.collect()
- Stores fold predictions
- Frees memory
if not fold_preds:
print("No predictions made.")
return
- Checks if any predictions were made
avg_preds = np.mean(fold_preds, axis=0)
- Averages predictions across all folds (ensemble)
final_preds = (avg_preds >= thresholds).astype(int)
- Applies optimized thresholds to get binary predictions
sub = pd.DataFrame(columns=["id"] + CONFIG.LABELS)
sub["id"] = df_test["id"] if "id" in df_test.columns else np.arange(len(df_test))
sub[CONFIG.LABELS] = final_preds
sub.to_csv(CONFIG.SUBMISSION_PATH, index=False)
print(f"Submission saved to {CONFIG.SUBMISSION_PATH}")
print(sub.head())
- Creates submission DataFrame
- Adds ID column (from data or generated)
- Adds prediction columns
- Saves to CSV
- Displays first few rows
predict_test(best_thresholds)
- Executes prediction function with optimized thresholds
Summary
This notebook implements a robust emotion classification pipeline with:
- K-Fold Cross-Validation: 5-fold stratified CV for reliable performance estimates
- State-of-the-Art Model: DeBERTa-v3-base transformer
- Optimization Techniques:
- Mixed precision training (faster, less memory)
- Gradient clipping (stability)
- Learning rate warmup and decay
- Differential weight decay
- Threshold Optimization: Per-label thresholds for better F1 scores
- Ensemble Prediction: Averages predictions from all folds
- Memory Management: Explicit cleanup between folds
The model predicts 5 emotions (anger, fear, joy, sadness, surprise) in a multi-label setting, where text can have multiple emotions simultaneously.