Spaces:

hrshlgunjal
/

emotion-classifier-dl-genai-project

Sleeping

App Files Files Community

emotion-classifier-dl-genai-project / main_code_explanation.md

HarshalGunjalOp

Add notebooks and kaggle data for GitHub, configure HuggingFace ignore

eccd289 15 days ago

preview code

raw

history blame contribute delete

22.6 kB

	# Deep Learning Emotion Classification - Code Explanation

	This document provides a detailed line-by-line explanation of the `main.ipynb` notebook, which implements a multi-label emotion classification system using the DeBERTa transformer model with K-Fold cross-validation.

	---

	## Section 1: Imports & Setup

	### Lines 18-36: Import Statements

	```python
	import numpy as np
	import pandas as pd
	```
	- numpy: Used for numerical operations, array manipulation, and random seed setting
	- pandas: Used for data loading and manipulation (CSV files, DataFrames)

	```python
	import torch
	import torch.nn as nn
	```
	- torch: PyTorch deep learning framework for tensor operations and model training
	- torch.nn: Neural network modules including loss functions

	```python
	from sklearn.model_selection import StratifiedKFold
	from sklearn.metrics import f1_score
	```
	- StratifiedKFold: Creates k-fold splits while maintaining class distribution in each fold
	- f1_score: Calculates F1 metric for evaluation (harmonic mean of precision and recall)

	```python
	from transformers import (
	AutoTokenizer,
	AutoModelForSequenceClassification,
	get_linear_schedule_with_warmup,
	AutoConfig
	)
	```
	- AutoTokenizer: Automatically loads the appropriate tokenizer for the specified model
	- AutoModelForSequenceClassification: Pre-trained transformer model for classification tasks
	- get_linear_schedule_with_warmup: Learning rate scheduler with warmup and linear decay
	- AutoConfig: Model configuration loader

	```python
	from torch.optim import AdamW
	```
	- AdamW: Adam optimizer with decoupled weight decay (better than standard Adam for transformers)

	```python
	from torch.cuda.amp import autocast, GradScaler
	```
	- autocast: Enables automatic mixed precision (AMP) to speed up training
	- GradScaler: Scales gradients for mixed precision training to prevent underflow

	```python
	import gc
	import warnings
	import os
	```
	- gc: Garbage collection to free up memory
	- warnings: To suppress warning messages
	- os: For file system operations and environment variables

	```python
	warnings.filterwarnings("ignore")
	```
	- Suppresses all warning messages for cleaner output

	---

	## Section 2: Configuration

	### Lines 52-68: Configuration Class

	```python
	class Config:
	SEED = 42
	```
	- Sets random seed for reproducibility across all random operations

	```python
	LABELS = ["anger", "fear", "joy", "sadness", "surprise"]
	```
	- Defines the 5 emotion labels for multi-label classification

	```python
	MODEL_NAME = "microsoft/deberta-v3-base"
	```
	- Specifies the pre-trained model (DeBERTa v3 base - 184M parameters, SOTA performance)

	```python
	MAX_LEN = 128
	```
	- Maximum sequence length for tokenization (tokens longer than this are truncated)

	```python
	BATCH_SIZE = 16
	```
	- Number of samples processed together in one forward/backward pass

	```python
	EPOCHS = 4
	```
	- Number of complete passes through the training dataset

	```python
	LR = 1.5e-5
	```
	- Learning rate (1.5 × 10⁻⁵) - small value typical for fine-tuning transformers

	```python
	WEIGHT_DECAY = 0.01
	```
	- L2 regularization strength to prevent overfitting

	```python
	WARMUP_RATIO = 0.1
	```
	- Fraction of training steps used for learning rate warmup (10% of total steps)

	```python
	N_FOLDS = 5
	```
	- Number of folds for K-Fold cross-validation

	```python
	TRAIN_CSV = "/kaggle/input/2025-sep-dl-gen-ai-project/train.csv"
	TEST_CSV = "/kaggle/input/2025-sep-dl-gen-ai-project/test.csv"
	```
	- Paths to training and test datasets (Kaggle environment paths)

	```python
	SUBMISSION_PATH = "submission.csv"
	```
	- Output file for predictions

	```python
	CONFIG = Config()
	```
	- Creates a global instance of the configuration class

	---

	## Section 3: Seed & Device Setup

	### Lines 84-93: Reproducibility and Device Selection

	```python
	def set_seed(seed=CONFIG.SEED):
	np.random.seed(seed)
	```
	- Sets numpy's random seed for reproducible random number generation

	```python
	torch.manual_seed(seed)
	```
	- Sets PyTorch's random seed for CPU operations

	```python
	torch.cuda.manual_seed_all(seed)
	```
	- Sets PyTorch's random seed for all GPU devices

	```python
	os.environ['PYTHONHASHSEED'] = str(seed)
	```
	- Sets hash seed for Python's built-in hash() function for reproducibility

	```python
	set_seed()
	```
	- Calls the seed setting function

	```python
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	print(f"Using device: {device}")
	```
	- Checks if GPU is available; uses GPU if available, otherwise falls back to CPU
	- Prints the device being used for training

	---

	## Section 4: Utility Functions

	### Lines 109-115: `ensure_text_column` Function

	```python
	def ensure_text_column(df: pd.DataFrame) -> pd.DataFrame:
	if "text" in df.columns:
	return df
	```
	- Checks if DataFrame already has a "text" column; if yes, returns unchanged

	```python
	for c in ["comment_text", "sentence", "content", "review"]:
	if c in df.columns:
	return df.rename(columns={c: "text"})
	```
	- Searches for common alternative text column names
	- Renames the first matching column to "text" for standardization

	```python
	raise ValueError("No text column found. Add/rename your text column to 'text'.")
	```
	- Raises an error if no text column is found

	### Lines 117-126: `tune_thresholds` Function

	```python
	def tune_thresholds(y_true: np.ndarray, y_prob: np.ndarray) -> np.ndarray:
	th = np.zeros(y_true.shape[1], dtype=np.float32)
	```
	- Creates array to store optimal threshold for each label (initialized to 0)
	- Multi-label classification requires separate thresholds per label

	```python
	for j in range(y_true.shape[1]):
	best_t, best_f1 = 0.5, -1
	```
	- Iterates through each label
	- Initializes best threshold to 0.5 (default) and best F1 to -1

	```python
	for t in np.linspace(0.1, 0.9, 17):
	```
	- Tests 17 threshold values evenly spaced between 0.1 and 0.9

	```python
	f1 = f1_score(y_true[:, j], (y_prob[:, j] >= t).astype(int), zero_division=0)
	```
	- Calculates F1 score for current label and threshold
	- Converts probabilities to binary predictions using threshold

	```python
	if f1 > best_f1:
	best_f1, best_t = f1, t
	```
	- Updates best threshold if current F1 is better

	```python
	th[j] = best_t
	return th
	```
	- Stores optimal threshold for each label and returns the array

	### Lines 128-141: `get_optimizer_params` Function

	```python
	def get_optimizer_params(model, lr, weight_decay):
	param_optimizer = list(model.named_parameters())
	```
	- Gets all model parameters with their names

	```python
	no_decay = ["bias", "LayerNorm.bias", "LayerNorm.weight"]
	```
	- Lists parameters that should NOT have weight decay applied
	- Bias and LayerNorm parameters typically trained without weight decay

	```python
	optimizer_parameters = [
	{
	"params": [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
	"weight_decay": weight_decay,
	},
	```
	- First parameter group: all parameters EXCEPT bias and LayerNorm
	- These parameters will have weight decay applied

	```python
	{
	"params": [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
	"weight_decay": 0.0,
	},
	]
	```
	- Second parameter group: only bias and LayerNorm parameters
	- These parameters have weight decay set to 0.0

	```python
	return optimizer_parameters
	```
	- Returns grouped parameters for differential weight decay

	---

	## Section 5: Dataset Class

	### Lines 157-180: `EmotionDS` Class

	```python
	class EmotionDS(torch.utils.data.Dataset):
	def __init__(self, df, tokenizer, max_len, is_test=False):
	```
	- Custom PyTorch Dataset class for emotion classification
	- `is_test` flag indicates whether this is test data (no labels)

	```python
	self.texts = df["text"].tolist()
	```
	- Extracts text data as a Python list

	```python
	self.is_test = is_test
	if not is_test:
	self.labels = df[CONFIG.LABELS].values.astype(np.float32)
	```
	- Stores test flag
	- If training data, extracts multi-label targets as float32 array

	```python
	self.tok = tokenizer
	self.max_len = max_len
	```
	- Stores tokenizer and max length for later use

	```python
	def __len__(self):
	return len(self.texts)
	```
	- Returns dataset size (required by PyTorch)

	```python
	def __getitem__(self, i):
	enc = self.tok(
	self.texts[i],
	truncation=True,
	padding="max_length",
	max_length=self.max_len,
	return_tensors="pt",
	)
	```
	- Tokenizes the text at index `i`
	- truncation: Cuts text longer than max_len
	- padding: Pads shorter sequences to max_len
	- return_tensors="pt": Returns PyTorch tensors

	```python
	item = {k: v.squeeze(0) for k, v in enc.items()}
	```
	- Removes the batch dimension (1, seq_len) → (seq_len)
	- Returns dict with keys: input_ids, attention_mask, token_type_ids (if applicable)

	```python
	if not self.is_test:
	item["labels"] = torch.tensor(self.labels[i])
	return item
	```
	- Adds labels to the item dict if training data
	- Returns the complete item

	---

	## Section 6: Training & Validation Helper Functions

	### Lines 196-213: `train_one_epoch` Function

	```python
	def train_one_epoch(model, loader, optimizer, scheduler, scaler, criterion):
	model.train()
	```
	- Sets model to training mode (enables dropout, batch normalization updates)

	```python
	losses = []
	for batch in loader:
	```
	- Initializes list to track losses
	- Iterates through batches

	```python
	batch = {k: v.to(device, non_blocking=True) for k, v in batch.items()}
	```
	- Moves batch data to GPU (or CPU)
	- `non_blocking=True`: Async transfer for faster processing

	```python
	optimizer.zero_grad(set_to_none=True)
	```
	- Clears gradients from previous step
	- `set_to_none=True`: More memory efficient than setting to zero

	```python
	with autocast(enabled=True):
	out = model(input_ids=batch["input_ids"], attention_mask=batch["attention_mask"])
	loss = criterion(out.logits, batch["labels"])
	```
	- autocast: Uses mixed precision (float16) for faster computation
	- Forward pass through model
	- Calculates loss between predictions (logits) and true labels

	```python
	scaler.scale(loss).backward()
	```
	- Scales loss to prevent gradient underflow in mixed precision
	- Computes gradients via backpropagation

	```python
	scaler.unscale_(optimizer)
	torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
	```
	- Unscales gradients before clipping
	- Clips gradients to maximum norm of 1.0 to prevent exploding gradients

	```python
	scaler.step(optimizer)
	scaler.update()
	```
	- Updates model parameters (with scaled gradients)
	- Updates the scaler's internal state

	```python
	scheduler.step()
	```
	- Updates learning rate according to schedule

	```python
	losses.append(loss.item())
	return np.mean(losses)
	```
	- Stores loss value
	- Returns average loss for the epoch

	### Lines 215-230: `validate` Function

	```python
	def validate(model, loader, criterion):
	model.eval()
	```
	- Sets model to evaluation mode (disables dropout, fixes batch norm)

	```python
	losses = []
	preds = []
	targs = []
	```
	- Initializes lists for losses, predictions, and targets

	```python
	with torch.no_grad():
	```
	- Disables gradient computation (saves memory and speeds up inference)

	```python
	for batch in loader:
	batch = {k: v.to(device, non_blocking=True) for k, v in batch.items()}
	with autocast(enabled=True):
	out = model(input_ids=batch["input_ids"], attention_mask=batch["attention_mask"])
	loss = criterion(out.logits, batch["labels"])
	```
	- Moves batch to device
	- Forward pass with mixed precision
	- Calculates validation loss

	```python
	losses.append(loss.item())
	preds.append(torch.sigmoid(out.logits).float().cpu().numpy())
	targs.append(batch["labels"].cpu().numpy())
	```
	- Stores loss
	- Applies sigmoid to convert logits to probabilities [0, 1]
	- Moves predictions and targets to CPU as numpy arrays

	```python
	return np.mean(losses), np.vstack(preds), np.vstack(targs)
	```
	- Returns average loss, stacked predictions, and stacked targets

	---

	## Section 7: Main K-Fold Training Loop

	### Lines 246-324: `run_training` Function

	```python
	def run_training():
	if not os.path.exists(CONFIG.TRAIN_CSV):
	print("Train CSV not found. Please check the path.")
	return None, None
	```
	- Checks if training data exists
	- Returns None if not found (graceful failure)

	```python
	df = pd.read_csv(CONFIG.TRAIN_CSV)
	df = ensure_text_column(df)
	```
	- Loads training data
	- Ensures text column exists

	```python
	skf = StratifiedKFold(n_splits=CONFIG.N_FOLDS, shuffle=True, random_state=CONFIG.SEED)
	y_str = df[CONFIG.LABELS].astype(str).agg("".join, axis=1)
	```
	- Creates 5-fold stratified splitter
	- Converts multi-label to string representation for stratification
	- Example: [1,0,1,0,0] → "10100"

	```python
	oof_preds = np.zeros((len(df), len(CONFIG.LABELS)))
	```
	- Initializes out-of-fold predictions array (for all training samples)

	```python
	tokenizer = AutoTokenizer.from_pretrained(CONFIG.MODEL_NAME)
	```
	- Loads DeBERTa tokenizer

	```python
	for fold, (train_idx, val_idx) in enumerate(skf.split(df, y_str)):
	print(f"\n{'='20} FOLD {fold+1}/{CONFIG.N_FOLDS} {'='20}")
	```
	- Iterates through each fold
	- `train_idx`: indices for training, `val_idx`: indices for validation

	```python
	df_tr = df.iloc[train_idx].reset_index(drop=True)
	df_va = df.iloc[val_idx].reset_index(drop=True)
	```
	- Splits data into training and validation sets for current fold
	- Resets index for clean indexing

	```python
	ds_tr = EmotionDS(df_tr, tokenizer, CONFIG.MAX_LEN)
	ds_va = EmotionDS(df_va, tokenizer, CONFIG.MAX_LEN)
	```
	- Creates PyTorch datasets for training and validation

	```python
	dl_tr = torch.utils.data.DataLoader(ds_tr, batch_size=CONFIG.BATCH_SIZE, shuffle=True, num_workers=2, pin_memory=True)
	dl_va = torch.utils.data.DataLoader(ds_va, batch_size=CONFIG.BATCH_SIZE, shuffle=False, num_workers=2, pin_memory=True)
	```
	- Creates data loaders
	- shuffle=True for training (randomizes batch order)
	- shuffle=False for validation (keeps consistent order)
	- num_workers=2: Uses 2 subprocesses for data loading
	- pin_memory=True: Speeds up CPU→GPU transfer

	```python
	model = AutoModelForSequenceClassification.from_pretrained(
	CONFIG.MODEL_NAME,
	num_labels=len(CONFIG.LABELS),
	problem_type="multi_label_classification"
	)
	model.to(device)
	```
	- Loads pre-trained DeBERTa model
	- Configures for 5-label multi-label classification
	- Moves model to GPU/CPU

	```python
	optimizer_params = get_optimizer_params(model, CONFIG.LR, CONFIG.WEIGHT_DECAY)
	optimizer = AdamW(optimizer_params, lr=CONFIG.LR)
	```
	- Gets parameter groups with differential weight decay
	- Creates AdamW optimizer

	```python
	total_steps = len(dl_tr) * CONFIG.EPOCHS
	scheduler = get_linear_schedule_with_warmup(
	optimizer,
	num_warmup_steps=int(total_steps * CONFIG.WARMUP_RATIO),
	num_training_steps=total_steps
	)
	```
	- Calculates total training steps
	- Creates learning rate scheduler:
	- Warmup: LR increases linearly for 10% of steps
	- Decay: LR decreases linearly to 0 for remaining 90%

	```python
	criterion = nn.BCEWithLogitsLoss()
	scaler = GradScaler(enabled=True)
	```
	- BCEWithLogitsLoss: Binary cross-entropy loss for multi-label classification
	- Creates gradient scaler for mixed precision

	```python
	best_f1 = 0
	best_state = None
	```
	- Initializes tracking for best model

	```python
	for ep in range(CONFIG.EPOCHS):
	train_loss = train_one_epoch(model, dl_tr, optimizer, scheduler, scaler, criterion)
	val_loss, val_preds, val_targs = validate(model, dl_va, criterion)
	```
	- Trains for one epoch
	- Validates on validation set

	```python
	val_f1 = f1_score(val_targs, (val_preds >= 0.5).astype(int), average="macro", zero_division=0)
	```
	- Calculates macro F1 score (average F1 across all labels)
	- Uses 0.5 threshold for predictions

	```python
	print(f"Ep {ep+1}: TrLoss={train_loss:.4f} \| VaLoss={val_loss:.4f} \| VaF1={val_f1:.4f}")
	```
	- Prints epoch metrics

	```python
	if val_f1 > best_f1:
	best_f1 = val_f1
	best_state = model.state_dict()
	```
	- Saves model state if validation F1 improves

	```python
	torch.save(best_state, f"model_fold_{fold}.pth")
	```
	- Saves best model weights to disk

	```python
	model.load_state_dict(best_state)
	_, val_preds, _ = validate(model, dl_va, criterion)
	oof_preds[val_idx] = val_preds
	```
	- Loads best weights
	- Gets predictions on validation set
	- Stores out-of-fold predictions

	```python
	del model, optimizer, scaler, scheduler
	torch.cuda.empty_cache()
	gc.collect()
	```
	- Deletes objects to free memory
	- Clears GPU cache
	- Runs garbage collector

	```python
	return oof_preds, df[CONFIG.LABELS].values
	```
	- Returns out-of-fold predictions and true labels

	```python
	if os.path.exists(CONFIG.TRAIN_CSV):
	oof_preds, y_true = run_training()
	else:
	print("Skipping training as data is not found (likely in a dry-run environment).")
	```
	- Executes training if data exists
	- Otherwise skips gracefully

	---

	## Section 8: Threshold Optimization

	### Lines 340-347: Threshold Tuning

	```python
	if os.path.exists(CONFIG.TRAIN_CSV):
	best_thresholds = tune_thresholds(y_true, oof_preds)
	```
	- Finds optimal threshold for each emotion label using validation predictions

	```python
	oof_tuned = (oof_preds >= best_thresholds).astype(int)
	```
	- Converts probabilities to binary predictions using optimized thresholds

	```python
	final_f1 = f1_score(y_true, oof_tuned, average="macro", zero_division=0)
	print(f"\nFinal CV Macro F1: {final_f1:.4f}")
	print(f"Best Thresholds: {best_thresholds}")
	```
	- Calculates cross-validated F1 score with optimized thresholds
	- Prints final performance and optimal thresholds

	```python
	else:
	best_thresholds = np.array([0.5] * len(CONFIG.LABELS))
	```
	- Falls back to 0.5 thresholds if training data not available

	---

	## Section 9: Inference & Submission

	### Lines 363-420: `predict_test` Function

	```python
	def predict_test(thresholds):
	if not os.path.exists(CONFIG.TEST_CSV):
	print("Test CSV not found.")
	return
	```
	- Checks if test data exists

	```python
	df_test = pd.read_csv(CONFIG.TEST_CSV)
	df_test = ensure_text_column(df_test)
	```
	- Loads test data and ensures text column

	```python
	tokenizer = AutoTokenizer.from_pretrained(CONFIG.MODEL_NAME)
	ds_test = EmotionDS(df_test, tokenizer, CONFIG.MAX_LEN, is_test=True)
	dl_test = torch.utils.data.DataLoader(ds_test, batch_size=CONFIG.BATCH_SIZE, shuffle=False, num_workers=2)
	```
	- Creates tokenizer, dataset, and data loader for test data
	- `is_test=True`: No labels expected

	```python
	fold_preds = []
	```
	- Initializes list to store predictions from each fold

	```python
	for fold in range(CONFIG.N_FOLDS):
	model_path = f"model_fold_{fold}.pth"
	if not os.path.exists(model_path):
	print(f"Model for fold {fold} not found, skipping.")
	continue
	```
	- Iterates through all folds
	- Checks if model exists

	```python
	print(f"Predicting Fold {fold+1}...")
	model = AutoModelForSequenceClassification.from_pretrained(
	CONFIG.MODEL_NAME,
	num_labels=len(CONFIG.LABELS),
	problem_type="multi_label_classification"
	)
	model.load_state_dict(torch.load(model_path))
	model.to(device)
	model.eval()
	```
	- Loads model architecture
	- Loads trained weights
	- Sets to evaluation mode

	```python
	preds = []
	with torch.no_grad():
	for batch in dl_test:
	batch = {k: v.to(device, non_blocking=True) for k, v in batch.items()}
	with autocast(enabled=True):
	out = model(input_ids=batch["input_ids"], attention_mask=batch["attention_mask"])
	preds.append(torch.sigmoid(out.logits).float().cpu().numpy())
	```
	- Makes predictions without computing gradients
	- Uses mixed precision for speed
	- Applies sigmoid to get probabilities

	```python
	fold_preds.append(np.vstack(preds))
	del model
	torch.cuda.empty_cache()
	gc.collect()
	```
	- Stores fold predictions
	- Frees memory

	```python
	if not fold_preds:
	print("No predictions made.")
	return
	```
	- Checks if any predictions were made

	```python
	avg_preds = np.mean(fold_preds, axis=0)
	```
	- Averages predictions across all folds (ensemble)

	```python
	final_preds = (avg_preds >= thresholds).astype(int)
	```
	- Applies optimized thresholds to get binary predictions

	```python
	sub = pd.DataFrame(columns=["id"] + CONFIG.LABELS)
	sub["id"] = df_test["id"] if "id" in df_test.columns else np.arange(len(df_test))
	sub[CONFIG.LABELS] = final_preds
	sub.to_csv(CONFIG.SUBMISSION_PATH, index=False)
	print(f"Submission saved to {CONFIG.SUBMISSION_PATH}")
	print(sub.head())
	```
	- Creates submission DataFrame
	- Adds ID column (from data or generated)
	- Adds prediction columns
	- Saves to CSV
	- Displays first few rows

	```python
	predict_test(best_thresholds)
	```
	- Executes prediction function with optimized thresholds

	---

	## Summary

	This notebook implements a robust emotion classification pipeline with:

	1. K-Fold Cross-Validation: 5-fold stratified CV for reliable performance estimates
	2. State-of-the-Art Model: DeBERTa-v3-base transformer
	3. Optimization Techniques:
	- Mixed precision training (faster, less memory)
	- Gradient clipping (stability)
	- Learning rate warmup and decay
	- Differential weight decay
	4. Threshold Optimization: Per-label thresholds for better F1 scores
	5. Ensemble Prediction: Averages predictions from all folds
	6. Memory Management: Explicit cleanup between folds

	The model predicts 5 emotions (anger, fear, joy, sadness, surprise) in a multi-label setting, where text can have multiple emotions simultaneously.