Spaces:

onurcopur
/

tattoo_search_engine

Sleeping

App Files Files Community

Onur Çopur commited on Oct 7

Commit

0647d62

1 Parent(s): 8880ccb

add dinov3 and dinov2 with registers

Browse files

Files changed (4) hide show

.gitignore +7 -1
CLAUDE.md +203 -0
embeddings.py +278 -1
patch_attention.py +238 -2

.gitignore CHANGED Viewed

@@ -108,4 +108,10 @@ jspm_packages/
 # temporary folders
 tmp/
-temp/

 # temporary folders
 tmp/
+temp/
+*.png
+*.jpg
+*.jpeg
+*.gif
+*.svg
+*.mp4

CLAUDE.md ADDED Viewed

	@@ -0,0 +1,203 @@

+# CLAUDE.md
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+## Overview
+This is an AI-powered tattoo search engine that combines visual similarity search with image captioning. Users upload a tattoo image, and the system finds visually similar tattoos from across the web using multi-model embeddings and multi-platform search.
+**Tech Stack**: FastAPI, PyTorch, HuggingFace Transformers, OpenCLIP, DINOv2, SigLIP
+**Deployment**: Dockerized application designed for HuggingFace Spaces (GPU recommended)
+## Development Commands
+### Running the Application
+```bash
+# Local development
+python app.py
+# Docker build and run
+docker build -t tattoo-search .
+docker run -p 7860:7860 --env-file .env tattoo-search
+```
+### Environment Setup
+Required environment variable:
+- `HF_TOKEN`: HuggingFace API token (required for GLM-4.5V captioning via Novita provider)
+Create `.env` file:
+```
+HF_TOKEN=your_token_here
+```
+### Testing Endpoints
+```bash
+# Health check
+curl http://localhost:7860/health
+# Get available models
+curl http://localhost:7860/models
+# Search with image
+curl -X POST http://localhost:7860/search \
+  -F "[email protected]" \
+  -F "embedding_model=clip" \
+  -F "include_patch_attention=false"
+```
+## Architecture
+### Core Pipeline Flow
+1. **Image Upload** → FastAPI endpoint (`/search` in main.py)
+2. **Caption Generation** → GLM-4.5V via HuggingFace InferenceClient (Novita provider)
+3. **Multi-Platform Search** → SearchEngineManager coordinates searches across Pinterest, Reddit, Instagram
+4. **URL Validation** → URLValidator filters valid/accessible images
+5. **Embedding Extraction** → Selected model (CLIP/DINOv2/SigLIP) encodes query + candidates
+6. **Similarity Computation** → Cosine similarity ranking in parallel
+7. **Optional Patch Analysis** → PatchAttentionAnalyzer for detailed visual correspondence
+### Key Components
+**main.py - TattooSearchEngine Class**
+- Main orchestration class that ties all components together
+- `generate_caption()`: Uses HuggingFace InferenceClient with GLM-4.5V model
+- `search_images()`: Delegates to SearchEngineManager with caching
+- `download_and_process_image()`: Parallel image download and similarity computation
+- `compute_similarity()`: ThreadPoolExecutor for concurrent processing with early stopping
+**embeddings.py - Model Abstraction**
+- `EmbeddingModel`: Abstract base class defining interface
+- `CLIPEmbedding`: OpenAI CLIP ViT-B/32 (default)
+- `DINOv2Embedding`: Meta's self-supervised vision transformer
+- `SigLIPEmbedding`: Google's improved CLIP-like model
+- `EmbeddingModelFactory`: Factory pattern for model instantiation with fallback
+- All models support both global image embeddings and patch-level features
+**search_engines/ - Multi-Platform Search**
+- `SearchEngineManager`: Coordinates parallel searches across platforms with fallback strategies
+- `BaseSearchEngine`: Abstract interface for platform-specific engines
+- Platform implementations: PinterestSearchEngine, RedditSearchEngine, InstagramSearchEngine
+- `SearchResult` and `ImageResult`: Data classes for structured results
+- Includes intelligent query simplification for fallback searches
+**patch_attention.py - Visual Correspondence**
+- `PatchAttentionAnalyzer`: Computes patch-level attention matrices between images
+- `compute_patch_similarities()`: Extracts patch features and computes attention
+- `visualize_attention_heatmap()`: Creates matplotlib visualizations as base64 PNG
+- Returns attention matrices showing which image regions correspond best
+**utils/ - Supporting Utilities**
+- `SearchCache`: In-memory LRU cache with TTL for search results
+- `URLValidator`: Concurrent URL validation to filter broken/inaccessible images
+### Model Selection Logic
+The search engine supports dynamic model switching via `get_search_engine()`:
+- Global singleton pattern with lazy initialization
+- Models are swapped only when a different embedding model is requested
+- Each model implements both global pooling and patch-level encoding
+### Search Strategy
+SearchEngineManager uses a tiered approach:
+1. Primary platforms (Pinterest, Reddit) searched first
+2. If results < threshold, try additional platforms (Instagram)
+3. If still insufficient, simplify query and retry
+4. All platform searches run concurrently via ThreadPoolExecutor
+### Caching Strategy
+- Search results cached by query + max_results hash
+- Default TTL: 1 hour (3600s)
+- Max cache size: 1000 entries with LRU eviction
+- Significantly reduces redundant searches
+## Important Implementation Details
+### Caption Generation
+- Uses GLM-4.5V via HuggingFace InferenceClient with Novita provider
+- Converts PIL image to base64 data URL
+- Expects JSON response with "search_query" field
+- Fallback to "tattoo artwork" on failure
+### Image Download Headers
+- Platform-specific headers (Pinterest, Instagram optimizations)
+- Random user agent rotation
+- Content-type and size validation (10MB limit, min 50x50px)
+- Exponential backoff retry mechanism
+### Similarity Computation
+- Early stopping optimization: stops at 20 good results (5 if patch attention enabled)
+- ThreadPoolExecutor with max 10 workers
+- Rate limiting with 0.1s delays between downloads
+- Future cancellation after target reached
+### Patch Attention
+- Only triggered when `include_patch_attention=true`
+- Computes NxM attention matrix (query patches × candidate patches)
+- Visualizations include: attention heatmap, patch grid overlays, top correspondences
+- Returns base64-encoded PNG images
+## API Response Structures
+**POST /search** returns:
+```json
+{
+  "caption": "string",
+  "results": [
+    {
+      "score": 0.95,
+      "url": "https://...",
+      "patch_attention": {  // optional
+        "overall_similarity": 0.87,
+        "query_grid_size": 7,
+        "candidate_grid_size": 7,
+        "attention_summary": {...}
+      }
+    }
+  ],
+  "embedding_model": "CLIP-ViT-B-32",
+  "patch_attention_enabled": false
+}
+```
+**POST /analyze-attention** returns detailed patch analysis with visualizations
+## Common Development Patterns
+### Adding a New Embedding Model
+1. Create new class in `embeddings.py` inheriting from `EmbeddingModel`
+2. Implement `load_model()`, `encode_image()`, `encode_image_patches()`, `get_model_name()`
+3. Add to `EmbeddingModelFactory.AVAILABLE_MODELS`
+4. Add config to `get_default_model_configs()`
+### Adding a New Search Platform
+1. Create new engine in `search_engines/` inheriting from `BaseSearchEngine`
+2. Add platform to `SearchPlatform` enum in `base.py`
+3. Implement `search()` and `is_valid_url()` methods
+4. Add to `SearchEngineManager.engines` dict
+5. Update platform prioritization in `search_with_fallback()` if needed
+## Performance Considerations
+- GPU acceleration used if available (CUDA)
+- Concurrent image downloads (ThreadPoolExecutor)
+- Search result caching to reduce API calls
+- Early stopping in similarity computation
+- Future cancellation after targets met
+- Model instances reused globally to avoid reloading
+## Deployment Notes
+- Designed for HuggingFace Spaces with Docker SDK
+- Port 7860 (HF Spaces default)
+- Recommended hardware: T4 Small GPU or higher
+- Health check endpoint at `/health` for monitoring
+- All models download on first use and cache in `/app/cache`

embeddings.py CHANGED Viewed

@@ -216,6 +216,273 @@ class DINOv2Embedding(EmbeddingModel):
         return f"DINOv2-{self.model_name}"
 class SigLIPEmbedding(EmbeddingModel):
     """SigLIP-based embedding model."""
@@ -297,6 +564,8 @@ class EmbeddingModelFactory:
     AVAILABLE_MODELS = {
         "clip": CLIPEmbedding,
         "dinov2": DINOv2Embedding,
         "siglip": SigLIPEmbedding,
     }
@@ -305,7 +574,7 @@ class EmbeddingModelFactory:
         """Create an embedding model instance.
         Args:
-            model_type: Type of model ('clip', 'dinov2', 'siglip')
             device: PyTorch device
             **kwargs: Additional arguments for specific models
@@ -345,6 +614,14 @@ def get_default_model_configs() -> Dict[str, Dict[str, Any]]:
             "model_name": "dinov2_vitb14",
             "description": "Meta DINOv2 - self-supervised vision transformer, good for visual features"
         },
         "siglip": {
             "model_name": "google/siglip-base-patch16-224",
             "description": "Google SigLIP - improved CLIP-like model with better training"

         return f"DINOv2-{self.model_name}"
+class DINOv2WithRegistersEmbedding(EmbeddingModel):
+    """DINOv2 with register tokens - improved feature maps and attention."""
+    def __init__(self, device: torch.device, model_name: str = "facebook/dinov2-with-registers-base"):
+        super().__init__(device)
+        self.model_name = model_name
+        self.processor = None
+        self.load_model()
+    def load_model(self) -> None:
+        """Load DINOv2 with registers model and preprocessing."""
+        try:
+            from transformers import Dinov2WithRegistersModel, AutoImageProcessor
+            logger.info(f"Loading DINOv2 with registers model: {self.model_name}")
+            self.model = Dinov2WithRegistersModel.from_pretrained(self.model_name)
+            self.model.to(self.device)
+            self.model.eval()
+            self.processor = AutoImageProcessor.from_pretrained(self.model_name)
+            logger.info(f"DINOv2 with registers model {self.model_name} loaded successfully")
+        except Exception as e:
+            logger.error(f"Failed to load DINOv2 with registers model: {e}")
+            raise
+    def encode_image(self, image: Image.Image) -> torch.Tensor:
+        """Encode image using DINOv2 with registers."""
+        try:
+            inputs = self.processor(images=image, return_tensors="pt")
+            inputs = {k: v.to(self.device) for k, v in inputs.items()}
+            with torch.no_grad():
+                outputs = self.model(**inputs)
+                # Use pooler_output for global representation, fallback to mean pooling
+                if hasattr(outputs, 'pooler_output') and outputs.pooler_output is not None:
+                    features = outputs.pooler_output
+                else:
+                    # Mean pooling over spatial dimensions
+                    features = outputs.last_hidden_state.mean(dim=1)
+                features = F.normalize(features, p=2, dim=1)
+            return features
+        except Exception as e:
+            logger.error(f"Failed to encode image with DINOv2 with registers: {e}")
+            raise
+    def encode_image_patches(self, image: Image.Image) -> torch.Tensor:
+        """Encode image patches using DINOv2 with registers."""
+        try:
+            inputs = self.processor(images=image, return_tensors="pt")
+            inputs = {k: v.to(self.device) for k, v in inputs.items()}
+            with torch.no_grad():
+                outputs = self.model(**inputs)
+                # Token sequence structure: [CLS] + 4 register tokens + 256 patch tokens = 261 total
+                # We want only the spatial patch tokens (positions 5 to 260)
+                num_register_tokens = 4
+                expected_patches = (224 // 14) ** 2  # 256 for base model with 224x224 input, patch size 14
+                # Skip CLS token (position 0) and register tokens (positions 1-4)
+                start_idx = 1 + num_register_tokens  # Position 5
+                end_idx = start_idx + expected_patches  # Position 261
+                patch_features = outputs.last_hidden_state[:, start_idx:end_idx, :]  # [1, 256, feature_dim]
+                # Normalize patch features
+                patch_features = F.normalize(patch_features, p=2, dim=-1)
+                return patch_features.squeeze(0)  # [num_patches, feature_dim]
+        except Exception as e:
+            logger.error(f"Failed to encode image patches with DINOv2 with registers: {e}")
+            raise
+    def get_model_name(self) -> str:
+        return f"DINOv2-WithRegisters-{self.model_name.split('/')[-1]}"
+    def get_attention_maps(self, image: Image.Image) -> torch.Tensor:
+        """
+        Extract native attention maps from DINOv2 with registers.
+        Returns:
+            Attention tensor with shape (num_layers, num_heads, num_tokens, num_tokens)
+            where num_tokens includes [CLS] + patches + registers
+        """
+        try:
+            inputs = self.processor(images=image, return_tensors="pt")
+            inputs = {k: v.to(self.device) for k, v in inputs.items()}
+            with torch.no_grad():
+                outputs = self.model(**inputs, output_attentions=True)
+                # outputs.attentions is a tuple of attention tensors, one per layer
+                # Each has shape: (batch_size, num_heads, sequence_length, sequence_length)
+                # Stack all layer attentions
+                attention_stack = torch.stack(outputs.attentions)  # (num_layers, batch_size, num_heads, seq_len, seq_len)
+                attention_stack = attention_stack.squeeze(1)  # Remove batch dimension -> (num_layers, num_heads, seq_len, seq_len)
+                return attention_stack
+        except Exception as e:
+            logger.error(f"Failed to extract attention maps: {e}")
+            raise
+    def compute_cross_attention(self, query_image: Image.Image, candidate_image: Image.Image) -> torch.Tensor:
+        """
+        Compute cross-attention between query and candidate images using patch features.
+        This uses the extracted patch embeddings to compute attention from query to candidate,
+        similar to the native attention mechanism but across two images.
+        Returns:
+            Cross-attention matrix with shape (query_patches, candidate_patches)
+        """
+        try:
+            # Get patch features for both images
+            query_patches = self.encode_image_patches(query_image)  # (num_query_patches, feature_dim)
+            candidate_patches = self.encode_image_patches(candidate_image)  # (num_candidate_patches, feature_dim)
+            # Compute attention-style similarity (softmax over candidate dimension)
+            # attention[i,j] = how much query patch i attends to candidate patch j
+            attention_logits = torch.mm(query_patches, candidate_patches.T)  # (query_patches, candidate_patches)
+            # Apply softmax to get attention distribution for each query patch
+            cross_attention = F.softmax(attention_logits, dim=1)
+            return cross_attention
+        except Exception as e:
+            logger.error(f"Failed to compute cross-attention: {e}")
+            raise
+    def supports_native_attention(self) -> bool:
+        """Check if this model supports native attention extraction."""
+        return True
+class DINOv3Embedding(EmbeddingModel):
+    """DINOv3-based embedding model from HuggingFace transformers."""
+    def __init__(self, device: torch.device, model_name: str = "facebook/dinov3-vits16-pretrain-lvd1689m"):
+        super().__init__(device)
+        self.model_name = model_name
+        self.processor = None
+        self.load_model()
+    def load_model(self) -> None:
+        """Load DINOv3 model and preprocessing."""
+        try:
+            from transformers import AutoModel, AutoImageProcessor
+            logger.info(f"Loading DINOv3 model: {self.model_name}")
+            self.model = AutoModel.from_pretrained(self.model_name)
+            self.model.to(self.device)
+            self.model.eval()
+            self.processor = AutoImageProcessor.from_pretrained(self.model_name)
+            logger.info(f"DINOv3 model {self.model_name} loaded successfully")
+        except Exception as e:
+            logger.error(f"Failed to load DINOv3 model: {e}")
+            raise
+    def encode_image(self, image: Image.Image) -> torch.Tensor:
+        """Encode image using DINOv3."""
+        try:
+            inputs = self.processor(images=image, return_tensors="pt")
+            inputs = {k: v.to(self.device) for k, v in inputs.items()}
+            with torch.no_grad():
+                outputs = self.model(**inputs)
+                # Use pooler_output (CLS token) for global representation
+                if hasattr(outputs, 'pooler_output') and outputs.pooler_output is not None:
+                    features = outputs.pooler_output
+                else:
+                    # Fallback to mean pooling over patch embeddings
+                    features = outputs.last_hidden_state[:, 1:, :].mean(dim=1)
+                features = F.normalize(features, p=2, dim=1)
+            return features
+        except Exception as e:
+            logger.error(f"Failed to encode image with DINOv3: {e}")
+            raise
+    def encode_image_patches(self, image: Image.Image) -> torch.Tensor:
+        """Encode image patches using DINOv3."""
+        try:
+            inputs = self.processor(images=image, return_tensors="pt")
+            inputs = {k: v.to(self.device) for k, v in inputs.items()}
+            with torch.no_grad():
+                outputs = self.model(**inputs)
+                # DINOv3 outputs: [CLS] + register tokens + patch tokens
+                # We want only the patch tokens (skip CLS at position 0 and register tokens)
+                # For DINOv3-ViTS16, it has 4 register tokens
+                num_register_tokens = 4
+                patch_features = outputs.last_hidden_state[:, 1 + num_register_tokens:, :]
+                # Normalize patch features
+                patch_features = F.normalize(patch_features, p=2, dim=-1)
+                return patch_features.squeeze(0)  # [num_patches, feature_dim]
+        except Exception as e:
+            logger.error(f"Failed to encode image patches with DINOv3: {e}")
+            raise
+    def get_model_name(self) -> str:
+        return f"DINOv3-{self.model_name.split('/')[-1]}"
+    def supports_native_attention(self) -> bool:
+        """Check if this model supports native attention extraction."""
+        return True
+    def get_attention_maps(self, image: Image.Image) -> torch.Tensor:
+        """
+        Extract native attention maps from DINOv3.
+        Returns:
+            Attention tensor with shape (num_layers, num_heads, num_tokens, num_tokens)
+        """
+        try:
+            inputs = self.processor(images=image, return_tensors="pt")
+            inputs = {k: v.to(self.device) for k, v in inputs.items()}
+            with torch.no_grad():
+                outputs = self.model(**inputs, output_attentions=True)
+                # Stack all layer attentions
+                attention_stack = torch.stack(outputs.attentions)
+                attention_stack = attention_stack.squeeze(1)  # Remove batch dimension
+                return attention_stack
+        except Exception as e:
+            logger.error(f"Failed to extract attention maps: {e}")
+            raise
+    def compute_cross_attention(self, query_image: Image.Image, candidate_image: Image.Image) -> torch.Tensor:
+        """
+        Compute cross-attention between query and candidate images using patch features.
+        Returns:
+            Cross-attention matrix with shape (query_patches, candidate_patches)
+        """
+        try:
+            query_patches = self.encode_image_patches(query_image)
+            candidate_patches = self.encode_image_patches(candidate_image)
+            # Compute attention-style similarity
+            attention_logits = torch.mm(query_patches, candidate_patches.T)
+            # Apply softmax to get attention distribution
+            cross_attention = F.softmax(attention_logits, dim=1)
+            return cross_attention
+        except Exception as e:
+            logger.error(f"Failed to compute cross-attention: {e}")
+            raise
 class SigLIPEmbedding(EmbeddingModel):
     """SigLIP-based embedding model."""
     AVAILABLE_MODELS = {
         "clip": CLIPEmbedding,
         "dinov2": DINOv2Embedding,
+        "dinov2_registers": DINOv2WithRegistersEmbedding,
+        "dinov3": DINOv3Embedding,
         "siglip": SigLIPEmbedding,
     }
         """Create an embedding model instance.
         Args:
+            model_type: Type of model ('clip', 'dinov2', 'dinov2_registers', 'dinov3', 'siglip')
             device: PyTorch device
             **kwargs: Additional arguments for specific models
             "model_name": "dinov2_vitb14",
             "description": "Meta DINOv2 - self-supervised vision transformer, good for visual features"
         },
+        "dinov2_registers": {
+            "model_name": "facebook/dinov2-with-registers-base",
+            "description": "Meta DINOv2 with register tokens - improved feature maps and attention"
+        },
+        "dinov3": {
+            "model_name": "facebook/dinov3-vits16-pretrain-lvd1689m",
+            "description": "Meta DINOv3 - vision foundation model with high-quality dense features"
+        },
         "siglip": {
             "model_name": "google/siglip-base-patch16-224",
             "description": "Google SigLIP - improved CLIP-like model with better training"

patch_attention.py CHANGED Viewed

@@ -15,14 +15,21 @@ class PatchAttentionAnalyzer:
     def __init__(self, embedding_model):
         self.embedding_model = embedding_model
     def compute_patch_similarities(self, query_image: Image.Image, candidate_image: Image.Image) -> Dict[str, Any]:
         """
         Compute patch-level similarities between query and candidate images.
         Returns:
             Dictionary containing attention matrix, top correspondences, and metadata
         """
         try:
             # Get patch features for both images
             query_patches = self.embedding_model.encode_image_patches(query_image)
@@ -205,11 +212,61 @@ class PatchAttentionAnalyzer:
         return image.crop((left, top, right, bottom))
     def get_similarity_summary(self, similarity_data: Dict[str, Any]) -> Dict[str, Any]:
         """Get a summary of similarity statistics."""
         attention_matrix = similarity_data['attention_matrix']
-        return {
             'overall_similarity': similarity_data['overall_similarity'],
             'max_similarity': float(np.max(attention_matrix)),
             'min_similarity': float(np.min(attention_matrix)),
@@ -218,4 +275,183 @@ class PatchAttentionAnalyzer:
             'candidate_patches_count': similarity_data['candidate_patches_shape'][0],
             'high_attention_patches': int(np.sum(attention_matrix > (np.mean(attention_matrix) + np.std(attention_matrix)))),
             'model_name': self.embedding_model.get_model_name()
-        }

     def __init__(self, embedding_model):
         self.embedding_model = embedding_model
+        self.supports_native_attention = hasattr(embedding_model, 'supports_native_attention') and embedding_model.supports_native_attention()
     def compute_patch_similarities(self, query_image: Image.Image, candidate_image: Image.Image) -> Dict[str, Any]:
         """
         Compute patch-level similarities between query and candidate images.
+        Automatically uses native attention if model supports it.
         Returns:
             Dictionary containing attention matrix, top correspondences, and metadata
         """
+        # Use native attention if available
+        if self.supports_native_attention:
+            return self.compute_native_attention_similarities(query_image, candidate_image)
+        # Fallback to cosine similarity approach
         try:
             # Get patch features for both images
             query_patches = self.embedding_model.encode_image_patches(query_image)
         return image.crop((left, top, right, bottom))
+    def compute_native_attention_similarities(self, query_image: Image.Image, candidate_image: Image.Image) -> Dict[str, Any]:
+        """
+        Compute patch-level similarities using native attention mechanism.
+        Only available for models with native attention support (e.g., DINOv2 with registers).
+        Returns:
+            Dictionary containing attention matrix, top correspondences, and metadata
+        """
+        try:
+            # Use model's cross-attention computation
+            attention_matrix = self.embedding_model.compute_cross_attention(query_image, candidate_image)
+            attention_matrix_np = attention_matrix.cpu().numpy()
+            # Get patch counts (attention_matrix is already query_patches x candidate_patches)
+            num_query_patches = attention_matrix.shape[0]
+            num_candidate_patches = attention_matrix.shape[1]
+            # Get grid dimensions (assuming square patches)
+            query_grid_size = int(math.sqrt(num_query_patches))
+            candidate_grid_size = int(math.sqrt(num_candidate_patches))
+            # Find top correspondences for each query patch
+            top_correspondences = []
+            for i in range(num_query_patches):
+                patch_similarities = attention_matrix[i]
+                top_indices = torch.topk(patch_similarities, k=min(5, num_candidate_patches))
+                top_correspondences.append({
+                    'query_patch_idx': i,
+                    'query_patch_coord': self._patch_idx_to_coord(i, query_grid_size),
+                    'top_candidate_indices': top_indices.indices.tolist(),
+                    'top_candidate_coords': [self._patch_idx_to_coord(idx.item(), candidate_grid_size)
+                                           for idx in top_indices.indices],
+                    'similarity_scores': top_indices.values.tolist()
+                })
+            return {
+                'attention_matrix': attention_matrix_np,
+                'query_grid_size': query_grid_size,
+                'candidate_grid_size': candidate_grid_size,
+                'top_correspondences': top_correspondences,
+                'query_patches_shape': (num_query_patches, attention_matrix.shape[-1]),
+                'candidate_patches_shape': (num_candidate_patches, attention_matrix.shape[-1]),
+                'overall_similarity': torch.mean(attention_matrix).item(),
+                'use_native_attention': True
+            }
+        except Exception as e:
+            raise RuntimeError(f"Error computing native attention similarities: {e}")
     def get_similarity_summary(self, similarity_data: Dict[str, Any]) -> Dict[str, Any]:
         """Get a summary of similarity statistics."""
         attention_matrix = similarity_data['attention_matrix']
+        summary = {
             'overall_similarity': similarity_data['overall_similarity'],
             'max_similarity': float(np.max(attention_matrix)),
             'min_similarity': float(np.min(attention_matrix)),
             'candidate_patches_count': similarity_data['candidate_patches_shape'][0],
             'high_attention_patches': int(np.sum(attention_matrix > (np.mean(attention_matrix) + np.std(attention_matrix)))),
             'model_name': self.embedding_model.get_model_name()
+        }
+        # Add native attention flag if present
+        if 'use_native_attention' in similarity_data:
+            summary['use_native_attention'] = similarity_data['use_native_attention']
+        return summary
+    def visualize_multihead_attention(self, image: Image.Image, layer_idx: int = -1, figsize: Tuple[int, int] = (20, 12)) -> str:
+        """
+        Visualize attention from multiple heads for a single image.
+        Only available for models with native attention support.
+        Args:
+            image: Input image to visualize attention for
+            layer_idx: Which transformer layer to visualize (-1 for last layer)
+            figsize: Figure size for the plot
+        Returns:
+            Base64 encoded PNG image showing multi-head attention patterns
+        """
+        if not self.supports_native_attention:
+            raise ValueError("Multi-head attention visualization requires native attention support")
+        try:
+            # Get attention maps from the model
+            attention_maps = self.embedding_model.get_attention_maps(image)
+            # Shape: (num_layers, num_heads, num_tokens, num_tokens)
+            # Select the specified layer
+            layer_attention = attention_maps[layer_idx]  # (num_heads, num_tokens, num_tokens)
+            num_heads = layer_attention.shape[0]
+            # Extract patch-to-patch attention (exclude CLS token and register tokens)
+            # Token sequence structure varies by model:
+            # DINOv2 with registers: [CLS] + 4 register tokens + 256 spatial patches = 261 total
+            # DINOv3: [CLS] + 4 register tokens + 196 spatial patches (16x16 patches) = 201 total
+            model_name = self.embedding_model.get_model_name().lower()
+            if 'dinov3' in model_name:
+                num_register_tokens = 4
+                expected_patches = 196  # For 224x224 image with patch size 16 (14*14=196)
+            else:
+                num_register_tokens = 4
+                expected_patches = 256  # For 224x224 image with patch size 14
+            # Skip CLS token (position 0) and register tokens (positions 1-4)
+            start_idx = 1 + num_register_tokens  # Position 5
+            end_idx = start_idx + expected_patches  # Position 261
+            patch_attention = layer_attention[:, start_idx:end_idx, start_idx:end_idx]
+            # Convert to numpy
+            patch_attention_np = patch_attention.cpu().numpy()
+            # Get grid size
+            num_patches = patch_attention.shape[1]
+            grid_size = int(math.sqrt(num_patches))
+            # Create subplot grid
+            num_cols = 4
+            num_rows = (num_heads + num_cols - 1) // num_cols  # Ceiling division
+            fig, axes = plt.subplots(num_rows, num_cols, figsize=figsize)
+            axes = axes.flatten() if num_heads > 1 else [axes]
+            layer_name = f"Layer {layer_idx}" if layer_idx >= 0 else f"Last Layer ({len(attention_maps)})"
+            fig.suptitle(f'Multi-Head Attention Patterns - {layer_name}', fontsize=16, fontweight='bold')
+            # Plot each head's average attention
+            for head_idx in range(num_heads):
+                # Average attention from all query patches to all key patches
+                head_attn = patch_attention_np[head_idx]
+                avg_attention = np.mean(head_attn, axis=0).reshape(grid_size, grid_size)
+                im = axes[head_idx].imshow(avg_attention, cmap='viridis', interpolation='nearest')
+                axes[head_idx].set_title(f'Head {head_idx + 1}')
+                axes[head_idx].axis('off')
+                plt.colorbar(im, ax=axes[head_idx], fraction=0.046, pad=0.04)
+            # Hide unused subplots
+            for idx in range(num_heads, len(axes)):
+                axes[idx].axis('off')
+            plt.tight_layout()
+            # Convert to base64
+            buffer = io.BytesIO()
+            plt.savefig(buffer, format='png', dpi=150, bbox_inches='tight')
+            buffer.seek(0)
+            plot_data = buffer.getvalue()
+            buffer.close()
+            plt.close()
+            return base64.b64encode(plot_data).decode()
+        except Exception as e:
+            raise RuntimeError(f"Error visualizing multi-head attention: {e}")
+    def visualize_attention_comparison(self, query_image: Image.Image, candidate_image: Image.Image,
+                                      figsize: Tuple[int, int] = (20, 10)) -> str:
+        """
+        Compare native attention vs computed cosine similarity side-by-side.
+        Only available for models with native attention support.
+        Args:
+            query_image: Query image
+            candidate_image: Candidate image
+            figsize: Figure size for the plot
+        Returns:
+            Base64 encoded PNG showing both attention methods
+        """
+        if not self.supports_native_attention:
+            raise ValueError("Attention comparison requires native attention support")
+        try:
+            # Compute native attention
+            native_data = self.compute_native_attention_similarities(query_image, candidate_image)
+            # Compute cosine similarity for comparison
+            query_patches = self.embedding_model.encode_image_patches(query_image)
+            candidate_patches = self.embedding_model.encode_image_patches(candidate_image)
+            cosine_attention = self.embedding_model.compute_patch_attention(query_patches, candidate_patches)
+            cosine_attention_np = cosine_attention.cpu().numpy()
+            # Create comparison visualization
+            fig, axes = plt.subplots(2, 3, figsize=figsize)
+            fig.suptitle('Native Attention vs Cosine Similarity Comparison', fontsize=16, fontweight='bold')
+            # Row 1: Native attention
+            axes[0, 0].imshow(query_image)
+            axes[0, 0].set_title('Query Image')
+            axes[0, 0].axis('off')
+            im1 = axes[0, 1].imshow(native_data['attention_matrix'], cmap='viridis', aspect='auto')
+            axes[0, 1].set_title(f'Native Attention\n(Avg: {native_data["overall_similarity"]:.3f})')
+            axes[0, 1].set_xlabel('Candidate Patches')
+            axes[0, 1].set_ylabel('Query Patches')
+            plt.colorbar(im1, ax=axes[0, 1], fraction=0.046, pad=0.04)
+            # Max attention heatmap for native
+            max_native = np.max(native_data['attention_matrix'], axis=1)
+            native_grid = max_native.reshape(native_data['query_grid_size'], native_data['query_grid_size'])
+            im2 = axes[0, 2].imshow(native_grid, cmap='hot', interpolation='nearest')
+            axes[0, 2].set_title('Max Native Attention per Patch')
+            plt.colorbar(im2, ax=axes[0, 2], fraction=0.046, pad=0.04)
+            # Row 2: Cosine similarity
+            axes[1, 0].imshow(candidate_image)
+            axes[1, 0].set_title('Candidate Image')
+            axes[1, 0].axis('off')
+            cosine_mean = float(np.mean(cosine_attention_np))
+            im3 = axes[1, 1].imshow(cosine_attention_np, cmap='viridis', aspect='auto')
+            axes[1, 1].set_title(f'Cosine Similarity\n(Avg: {cosine_mean:.3f})')
+            axes[1, 1].set_xlabel('Candidate Patches')
+            axes[1, 1].set_ylabel('Query Patches')
+            plt.colorbar(im3, ax=axes[1, 1], fraction=0.046, pad=0.04)
+            # Max attention heatmap for cosine
+            max_cosine = np.max(cosine_attention_np, axis=1)
+            query_grid_size = int(math.sqrt(query_patches.shape[0]))
+            cosine_grid = max_cosine.reshape(query_grid_size, query_grid_size)
+            im4 = axes[1, 2].imshow(cosine_grid, cmap='hot', interpolation='nearest')
+            axes[1, 2].set_title('Max Cosine Similarity per Patch')
+            plt.colorbar(im4, ax=axes[1, 2], fraction=0.046, pad=0.04)
+            plt.tight_layout()
+            # Convert to base64
+            buffer = io.BytesIO()
+            plt.savefig(buffer, format='png', dpi=150, bbox_inches='tight')
+            buffer.seek(0)
+            plot_data = buffer.getvalue()
+            buffer.close()
+            plt.close()
+            return base64.b64encode(plot_data).decode()
+        except Exception as e:
+            raise RuntimeError(f"Error comparing attention methods: {e}")