emotion-classifier-dl-genai-project / main_code_explanation.md
HarshalGunjalOp
Add notebooks and kaggle data for GitHub, configure HuggingFace ignore
eccd289
# Deep Learning Emotion Classification - Code Explanation
This document provides a detailed line-by-line explanation of the `main.ipynb` notebook, which implements a multi-label emotion classification system using the DeBERTa transformer model with K-Fold cross-validation.
---
## Section 1: Imports & Setup
### Lines 18-36: Import Statements
```python
import numpy as np
import pandas as pd
```
- **numpy**: Used for numerical operations, array manipulation, and random seed setting
- **pandas**: Used for data loading and manipulation (CSV files, DataFrames)
```python
import torch
import torch.nn as nn
```
- **torch**: PyTorch deep learning framework for tensor operations and model training
- **torch.nn**: Neural network modules including loss functions
```python
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import f1_score
```
- **StratifiedKFold**: Creates k-fold splits while maintaining class distribution in each fold
- **f1_score**: Calculates F1 metric for evaluation (harmonic mean of precision and recall)
```python
from transformers import (
AutoTokenizer,
AutoModelForSequenceClassification,
get_linear_schedule_with_warmup,
AutoConfig
)
```
- **AutoTokenizer**: Automatically loads the appropriate tokenizer for the specified model
- **AutoModelForSequenceClassification**: Pre-trained transformer model for classification tasks
- **get_linear_schedule_with_warmup**: Learning rate scheduler with warmup and linear decay
- **AutoConfig**: Model configuration loader
```python
from torch.optim import AdamW
```
- **AdamW**: Adam optimizer with decoupled weight decay (better than standard Adam for transformers)
```python
from torch.cuda.amp import autocast, GradScaler
```
- **autocast**: Enables automatic mixed precision (AMP) to speed up training
- **GradScaler**: Scales gradients for mixed precision training to prevent underflow
```python
import gc
import warnings
import os
```
- **gc**: Garbage collection to free up memory
- **warnings**: To suppress warning messages
- **os**: For file system operations and environment variables
```python
warnings.filterwarnings("ignore")
```
- Suppresses all warning messages for cleaner output
---
## Section 2: Configuration
### Lines 52-68: Configuration Class
```python
class Config:
SEED = 42
```
- Sets random seed for reproducibility across all random operations
```python
LABELS = ["anger", "fear", "joy", "sadness", "surprise"]
```
- Defines the 5 emotion labels for multi-label classification
```python
MODEL_NAME = "microsoft/deberta-v3-base"
```
- Specifies the pre-trained model (DeBERTa v3 base - 184M parameters, SOTA performance)
```python
MAX_LEN = 128
```
- Maximum sequence length for tokenization (tokens longer than this are truncated)
```python
BATCH_SIZE = 16
```
- Number of samples processed together in one forward/backward pass
```python
EPOCHS = 4
```
- Number of complete passes through the training dataset
```python
LR = 1.5e-5
```
- Learning rate (1.5 × 10⁻⁵) - small value typical for fine-tuning transformers
```python
WEIGHT_DECAY = 0.01
```
- L2 regularization strength to prevent overfitting
```python
WARMUP_RATIO = 0.1
```
- Fraction of training steps used for learning rate warmup (10% of total steps)
```python
N_FOLDS = 5
```
- Number of folds for K-Fold cross-validation
```python
TRAIN_CSV = "/kaggle/input/2025-sep-dl-gen-ai-project/train.csv"
TEST_CSV = "/kaggle/input/2025-sep-dl-gen-ai-project/test.csv"
```
- Paths to training and test datasets (Kaggle environment paths)
```python
SUBMISSION_PATH = "submission.csv"
```
- Output file for predictions
```python
CONFIG = Config()
```
- Creates a global instance of the configuration class
---
## Section 3: Seed & Device Setup
### Lines 84-93: Reproducibility and Device Selection
```python
def set_seed(seed=CONFIG.SEED):
np.random.seed(seed)
```
- Sets numpy's random seed for reproducible random number generation
```python
torch.manual_seed(seed)
```
- Sets PyTorch's random seed for CPU operations
```python
torch.cuda.manual_seed_all(seed)
```
- Sets PyTorch's random seed for all GPU devices
```python
os.environ['PYTHONHASHSEED'] = str(seed)
```
- Sets hash seed for Python's built-in hash() function for reproducibility
```python
set_seed()
```
- Calls the seed setting function
```python
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
```
- Checks if GPU is available; uses GPU if available, otherwise falls back to CPU
- Prints the device being used for training
---
## Section 4: Utility Functions
### Lines 109-115: `ensure_text_column` Function
```python
def ensure_text_column(df: pd.DataFrame) -> pd.DataFrame:
if "text" in df.columns:
return df
```
- Checks if DataFrame already has a "text" column; if yes, returns unchanged
```python
for c in ["comment_text", "sentence", "content", "review"]:
if c in df.columns:
return df.rename(columns={c: "text"})
```
- Searches for common alternative text column names
- Renames the first matching column to "text" for standardization
```python
raise ValueError("No text column found. Add/rename your text column to 'text'.")
```
- Raises an error if no text column is found
### Lines 117-126: `tune_thresholds` Function
```python
def tune_thresholds(y_true: np.ndarray, y_prob: np.ndarray) -> np.ndarray:
th = np.zeros(y_true.shape[1], dtype=np.float32)
```
- Creates array to store optimal threshold for each label (initialized to 0)
- Multi-label classification requires separate thresholds per label
```python
for j in range(y_true.shape[1]):
best_t, best_f1 = 0.5, -1
```
- Iterates through each label
- Initializes best threshold to 0.5 (default) and best F1 to -1
```python
for t in np.linspace(0.1, 0.9, 17):
```
- Tests 17 threshold values evenly spaced between 0.1 and 0.9
```python
f1 = f1_score(y_true[:, j], (y_prob[:, j] >= t).astype(int), zero_division=0)
```
- Calculates F1 score for current label and threshold
- Converts probabilities to binary predictions using threshold
```python
if f1 > best_f1:
best_f1, best_t = f1, t
```
- Updates best threshold if current F1 is better
```python
th[j] = best_t
return th
```
- Stores optimal threshold for each label and returns the array
### Lines 128-141: `get_optimizer_params` Function
```python
def get_optimizer_params(model, lr, weight_decay):
param_optimizer = list(model.named_parameters())
```
- Gets all model parameters with their names
```python
no_decay = ["bias", "LayerNorm.bias", "LayerNorm.weight"]
```
- Lists parameters that should NOT have weight decay applied
- Bias and LayerNorm parameters typically trained without weight decay
```python
optimizer_parameters = [
{
"params": [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
"weight_decay": weight_decay,
},
```
- First parameter group: all parameters EXCEPT bias and LayerNorm
- These parameters will have weight decay applied
```python
{
"params": [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
"weight_decay": 0.0,
},
]
```
- Second parameter group: only bias and LayerNorm parameters
- These parameters have weight decay set to 0.0
```python
return optimizer_parameters
```
- Returns grouped parameters for differential weight decay
---
## Section 5: Dataset Class
### Lines 157-180: `EmotionDS` Class
```python
class EmotionDS(torch.utils.data.Dataset):
def __init__(self, df, tokenizer, max_len, is_test=False):
```
- Custom PyTorch Dataset class for emotion classification
- `is_test` flag indicates whether this is test data (no labels)
```python
self.texts = df["text"].tolist()
```
- Extracts text data as a Python list
```python
self.is_test = is_test
if not is_test:
self.labels = df[CONFIG.LABELS].values.astype(np.float32)
```
- Stores test flag
- If training data, extracts multi-label targets as float32 array
```python
self.tok = tokenizer
self.max_len = max_len
```
- Stores tokenizer and max length for later use
```python
def __len__(self):
return len(self.texts)
```
- Returns dataset size (required by PyTorch)
```python
def __getitem__(self, i):
enc = self.tok(
self.texts[i],
truncation=True,
padding="max_length",
max_length=self.max_len,
return_tensors="pt",
)
```
- Tokenizes the text at index `i`
- **truncation**: Cuts text longer than max_len
- **padding**: Pads shorter sequences to max_len
- **return_tensors="pt"**: Returns PyTorch tensors
```python
item = {k: v.squeeze(0) for k, v in enc.items()}
```
- Removes the batch dimension (1, seq_len) → (seq_len)
- Returns dict with keys: input_ids, attention_mask, token_type_ids (if applicable)
```python
if not self.is_test:
item["labels"] = torch.tensor(self.labels[i])
return item
```
- Adds labels to the item dict if training data
- Returns the complete item
---
## Section 6: Training & Validation Helper Functions
### Lines 196-213: `train_one_epoch` Function
```python
def train_one_epoch(model, loader, optimizer, scheduler, scaler, criterion):
model.train()
```
- Sets model to training mode (enables dropout, batch normalization updates)
```python
losses = []
for batch in loader:
```
- Initializes list to track losses
- Iterates through batches
```python
batch = {k: v.to(device, non_blocking=True) for k, v in batch.items()}
```
- Moves batch data to GPU (or CPU)
- `non_blocking=True`: Async transfer for faster processing
```python
optimizer.zero_grad(set_to_none=True)
```
- Clears gradients from previous step
- `set_to_none=True`: More memory efficient than setting to zero
```python
with autocast(enabled=True):
out = model(input_ids=batch["input_ids"], attention_mask=batch["attention_mask"])
loss = criterion(out.logits, batch["labels"])
```
- **autocast**: Uses mixed precision (float16) for faster computation
- Forward pass through model
- Calculates loss between predictions (logits) and true labels
```python
scaler.scale(loss).backward()
```
- Scales loss to prevent gradient underflow in mixed precision
- Computes gradients via backpropagation
```python
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
```
- Unscales gradients before clipping
- Clips gradients to maximum norm of 1.0 to prevent exploding gradients
```python
scaler.step(optimizer)
scaler.update()
```
- Updates model parameters (with scaled gradients)
- Updates the scaler's internal state
```python
scheduler.step()
```
- Updates learning rate according to schedule
```python
losses.append(loss.item())
return np.mean(losses)
```
- Stores loss value
- Returns average loss for the epoch
### Lines 215-230: `validate` Function
```python
def validate(model, loader, criterion):
model.eval()
```
- Sets model to evaluation mode (disables dropout, fixes batch norm)
```python
losses = []
preds = []
targs = []
```
- Initializes lists for losses, predictions, and targets
```python
with torch.no_grad():
```
- Disables gradient computation (saves memory and speeds up inference)
```python
for batch in loader:
batch = {k: v.to(device, non_blocking=True) for k, v in batch.items()}
with autocast(enabled=True):
out = model(input_ids=batch["input_ids"], attention_mask=batch["attention_mask"])
loss = criterion(out.logits, batch["labels"])
```
- Moves batch to device
- Forward pass with mixed precision
- Calculates validation loss
```python
losses.append(loss.item())
preds.append(torch.sigmoid(out.logits).float().cpu().numpy())
targs.append(batch["labels"].cpu().numpy())
```
- Stores loss
- Applies sigmoid to convert logits to probabilities [0, 1]
- Moves predictions and targets to CPU as numpy arrays
```python
return np.mean(losses), np.vstack(preds), np.vstack(targs)
```
- Returns average loss, stacked predictions, and stacked targets
---
## Section 7: Main K-Fold Training Loop
### Lines 246-324: `run_training` Function
```python
def run_training():
if not os.path.exists(CONFIG.TRAIN_CSV):
print("Train CSV not found. Please check the path.")
return None, None
```
- Checks if training data exists
- Returns None if not found (graceful failure)
```python
df = pd.read_csv(CONFIG.TRAIN_CSV)
df = ensure_text_column(df)
```
- Loads training data
- Ensures text column exists
```python
skf = StratifiedKFold(n_splits=CONFIG.N_FOLDS, shuffle=True, random_state=CONFIG.SEED)
y_str = df[CONFIG.LABELS].astype(str).agg("".join, axis=1)
```
- Creates 5-fold stratified splitter
- Converts multi-label to string representation for stratification
- Example: [1,0,1,0,0] → "10100"
```python
oof_preds = np.zeros((len(df), len(CONFIG.LABELS)))
```
- Initializes out-of-fold predictions array (for all training samples)
```python
tokenizer = AutoTokenizer.from_pretrained(CONFIG.MODEL_NAME)
```
- Loads DeBERTa tokenizer
```python
for fold, (train_idx, val_idx) in enumerate(skf.split(df, y_str)):
print(f"\n{'='*20} FOLD {fold+1}/{CONFIG.N_FOLDS} {'='*20}")
```
- Iterates through each fold
- `train_idx`: indices for training, `val_idx`: indices for validation
```python
df_tr = df.iloc[train_idx].reset_index(drop=True)
df_va = df.iloc[val_idx].reset_index(drop=True)
```
- Splits data into training and validation sets for current fold
- Resets index for clean indexing
```python
ds_tr = EmotionDS(df_tr, tokenizer, CONFIG.MAX_LEN)
ds_va = EmotionDS(df_va, tokenizer, CONFIG.MAX_LEN)
```
- Creates PyTorch datasets for training and validation
```python
dl_tr = torch.utils.data.DataLoader(ds_tr, batch_size=CONFIG.BATCH_SIZE, shuffle=True, num_workers=2, pin_memory=True)
dl_va = torch.utils.data.DataLoader(ds_va, batch_size=CONFIG.BATCH_SIZE, shuffle=False, num_workers=2, pin_memory=True)
```
- Creates data loaders
- **shuffle=True** for training (randomizes batch order)
- **shuffle=False** for validation (keeps consistent order)
- **num_workers=2**: Uses 2 subprocesses for data loading
- **pin_memory=True**: Speeds up CPU→GPU transfer
```python
model = AutoModelForSequenceClassification.from_pretrained(
CONFIG.MODEL_NAME,
num_labels=len(CONFIG.LABELS),
problem_type="multi_label_classification"
)
model.to(device)
```
- Loads pre-trained DeBERTa model
- Configures for 5-label multi-label classification
- Moves model to GPU/CPU
```python
optimizer_params = get_optimizer_params(model, CONFIG.LR, CONFIG.WEIGHT_DECAY)
optimizer = AdamW(optimizer_params, lr=CONFIG.LR)
```
- Gets parameter groups with differential weight decay
- Creates AdamW optimizer
```python
total_steps = len(dl_tr) * CONFIG.EPOCHS
scheduler = get_linear_schedule_with_warmup(
optimizer,
num_warmup_steps=int(total_steps * CONFIG.WARMUP_RATIO),
num_training_steps=total_steps
)
```
- Calculates total training steps
- Creates learning rate scheduler:
- Warmup: LR increases linearly for 10% of steps
- Decay: LR decreases linearly to 0 for remaining 90%
```python
criterion = nn.BCEWithLogitsLoss()
scaler = GradScaler(enabled=True)
```
- **BCEWithLogitsLoss**: Binary cross-entropy loss for multi-label classification
- Creates gradient scaler for mixed precision
```python
best_f1 = 0
best_state = None
```
- Initializes tracking for best model
```python
for ep in range(CONFIG.EPOCHS):
train_loss = train_one_epoch(model, dl_tr, optimizer, scheduler, scaler, criterion)
val_loss, val_preds, val_targs = validate(model, dl_va, criterion)
```
- Trains for one epoch
- Validates on validation set
```python
val_f1 = f1_score(val_targs, (val_preds >= 0.5).astype(int), average="macro", zero_division=0)
```
- Calculates macro F1 score (average F1 across all labels)
- Uses 0.5 threshold for predictions
```python
print(f"Ep {ep+1}: TrLoss={train_loss:.4f} | VaLoss={val_loss:.4f} | VaF1={val_f1:.4f}")
```
- Prints epoch metrics
```python
if val_f1 > best_f1:
best_f1 = val_f1
best_state = model.state_dict()
```
- Saves model state if validation F1 improves
```python
torch.save(best_state, f"model_fold_{fold}.pth")
```
- Saves best model weights to disk
```python
model.load_state_dict(best_state)
_, val_preds, _ = validate(model, dl_va, criterion)
oof_preds[val_idx] = val_preds
```
- Loads best weights
- Gets predictions on validation set
- Stores out-of-fold predictions
```python
del model, optimizer, scaler, scheduler
torch.cuda.empty_cache()
gc.collect()
```
- Deletes objects to free memory
- Clears GPU cache
- Runs garbage collector
```python
return oof_preds, df[CONFIG.LABELS].values
```
- Returns out-of-fold predictions and true labels
```python
if os.path.exists(CONFIG.TRAIN_CSV):
oof_preds, y_true = run_training()
else:
print("Skipping training as data is not found (likely in a dry-run environment).")
```
- Executes training if data exists
- Otherwise skips gracefully
---
## Section 8: Threshold Optimization
### Lines 340-347: Threshold Tuning
```python
if os.path.exists(CONFIG.TRAIN_CSV):
best_thresholds = tune_thresholds(y_true, oof_preds)
```
- Finds optimal threshold for each emotion label using validation predictions
```python
oof_tuned = (oof_preds >= best_thresholds).astype(int)
```
- Converts probabilities to binary predictions using optimized thresholds
```python
final_f1 = f1_score(y_true, oof_tuned, average="macro", zero_division=0)
print(f"\nFinal CV Macro F1: {final_f1:.4f}")
print(f"Best Thresholds: {best_thresholds}")
```
- Calculates cross-validated F1 score with optimized thresholds
- Prints final performance and optimal thresholds
```python
else:
best_thresholds = np.array([0.5] * len(CONFIG.LABELS))
```
- Falls back to 0.5 thresholds if training data not available
---
## Section 9: Inference & Submission
### Lines 363-420: `predict_test` Function
```python
def predict_test(thresholds):
if not os.path.exists(CONFIG.TEST_CSV):
print("Test CSV not found.")
return
```
- Checks if test data exists
```python
df_test = pd.read_csv(CONFIG.TEST_CSV)
df_test = ensure_text_column(df_test)
```
- Loads test data and ensures text column
```python
tokenizer = AutoTokenizer.from_pretrained(CONFIG.MODEL_NAME)
ds_test = EmotionDS(df_test, tokenizer, CONFIG.MAX_LEN, is_test=True)
dl_test = torch.utils.data.DataLoader(ds_test, batch_size=CONFIG.BATCH_SIZE, shuffle=False, num_workers=2)
```
- Creates tokenizer, dataset, and data loader for test data
- `is_test=True`: No labels expected
```python
fold_preds = []
```
- Initializes list to store predictions from each fold
```python
for fold in range(CONFIG.N_FOLDS):
model_path = f"model_fold_{fold}.pth"
if not os.path.exists(model_path):
print(f"Model for fold {fold} not found, skipping.")
continue
```
- Iterates through all folds
- Checks if model exists
```python
print(f"Predicting Fold {fold+1}...")
model = AutoModelForSequenceClassification.from_pretrained(
CONFIG.MODEL_NAME,
num_labels=len(CONFIG.LABELS),
problem_type="multi_label_classification"
)
model.load_state_dict(torch.load(model_path))
model.to(device)
model.eval()
```
- Loads model architecture
- Loads trained weights
- Sets to evaluation mode
```python
preds = []
with torch.no_grad():
for batch in dl_test:
batch = {k: v.to(device, non_blocking=True) for k, v in batch.items()}
with autocast(enabled=True):
out = model(input_ids=batch["input_ids"], attention_mask=batch["attention_mask"])
preds.append(torch.sigmoid(out.logits).float().cpu().numpy())
```
- Makes predictions without computing gradients
- Uses mixed precision for speed
- Applies sigmoid to get probabilities
```python
fold_preds.append(np.vstack(preds))
del model
torch.cuda.empty_cache()
gc.collect()
```
- Stores fold predictions
- Frees memory
```python
if not fold_preds:
print("No predictions made.")
return
```
- Checks if any predictions were made
```python
avg_preds = np.mean(fold_preds, axis=0)
```
- Averages predictions across all folds (ensemble)
```python
final_preds = (avg_preds >= thresholds).astype(int)
```
- Applies optimized thresholds to get binary predictions
```python
sub = pd.DataFrame(columns=["id"] + CONFIG.LABELS)
sub["id"] = df_test["id"] if "id" in df_test.columns else np.arange(len(df_test))
sub[CONFIG.LABELS] = final_preds
sub.to_csv(CONFIG.SUBMISSION_PATH, index=False)
print(f"Submission saved to {CONFIG.SUBMISSION_PATH}")
print(sub.head())
```
- Creates submission DataFrame
- Adds ID column (from data or generated)
- Adds prediction columns
- Saves to CSV
- Displays first few rows
```python
predict_test(best_thresholds)
```
- Executes prediction function with optimized thresholds
---
## Summary
This notebook implements a **robust emotion classification pipeline** with:
1. **K-Fold Cross-Validation**: 5-fold stratified CV for reliable performance estimates
2. **State-of-the-Art Model**: DeBERTa-v3-base transformer
3. **Optimization Techniques**:
- Mixed precision training (faster, less memory)
- Gradient clipping (stability)
- Learning rate warmup and decay
- Differential weight decay
4. **Threshold Optimization**: Per-label thresholds for better F1 scores
5. **Ensemble Prediction**: Averages predictions from all folds
6. **Memory Management**: Explicit cleanup between folds
The model predicts 5 emotions (anger, fear, joy, sadness, surprise) in a **multi-label** setting, where text can have multiple emotions simultaneously.