Patch-ioner_talk2dino_viecap_COCO_Captions - Patch-ioner Configuration

This repository contains a pre-trained VIECAP model from the Patch-ioner framework for dense image captioning and controllable visual description.

📝 Paper Information

Title: "One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework"
Authors: Lorenzo Bianchi, Giacomo Pacini, Fabio Carrara, Nicola Messina, Giuseppe Amato, Fabrizio Falchi
ArXiv: https://arxiv.org/abs/2510.02898 Project Page: https://paciosoft.com/Patch-ioner/

💻 GitHub Repository

The official code repository for Patch-ioner can be found here: https://github.com/Ruggero1912/Patch-ioner

🎯 Model Overview

Model Type: VIECAP
Configuration: mlp.viecap.k.yaml
Vision Backbone: dinov2_vitb14_reg
Language Model: gpt2
Input Resolution: 518x518
Prefix Size: 768

VieCap Configuration

Continuous Prompt Length: 10
Clip Project Length: 10
Temperature: 0.01
Top-K: 3
Entity Retrieval: coco_entities

📊 Performance

Task	METEOR	CIDEr	SPICE
Image Captioning	0.225	0.769	0.161
Narratives	10.700	28.200	12.500

📈 Detailed Results

Image Captioning Results

METEOR: 0.2250
CIDEr: 0.7690
SPICE: 0.1612
BLEU_4: 0.2362
ROUGE_L: 0.4779
CLIP-S: 0.7188

Narratives Results

METEOR: 10.7000
CIDEr: 28.2000
SPICE: 12.5000
BLEU_4: 2.6000
ROUGE_L: 23.1000
CLIP-S: 66.5000

🚀 Quick Start

from patch_ioner import load_model, Patchioner

# Load the model
config_path = "config.yaml"
model = load_model(config_path)

# Run inference
image_path = "your_image.jpg"
results = model.forward(image_path)
print(results)

transformers Sample Usage

The model can also be loaded using the transformers library:

from transformers import AutoModel

MODEL_ID = "Ruggero1912/Patch-ioner_talk2dino_viecap_COCO_Captions" # Note: use the correct MODEL_ID for this repository
model = AutoModel.from_pretrained(MODEL_ID, trust_remote_code=True)

📁 Repository Contents

config.yaml: Model configuration file
model.pt: Pre-trained model weights
README.md: This file

🔧 Installation

pip install git+https://github.com/Ruggero1912/Patch-ioner

💡 Usage Examples

Refer to the Patch-ioner repository for updated usage examples.

🎛️ Model Configuration

Prefix Size: 768
Memory Bank Size: 0
Normalization: False

📈 Training Details

Training Dataset: COCO Captions
Training Epochs: TBD
Batch Size: TBD
Learning Rate: TBD
Optimizer: AdamW

📚 Citation

If you use this model in your research, please cite our paper, refer to the Project Page for updated citation template.

🤝 Contributing

We welcome contributions to improve the Patch-ioner framework. Please see the main repository for contribution guidelines.

📄 License

See the main repository for detailed license information.

🐛 Issues and Support

For issues related to this model or the Patch-ioner framework, please:

Check the main repository for existing issues
Open a new issue with detailed information about your problem
Contact the authors.

🔗 Related Models

Explore other Patch-ioner model configurations:

Patch-ioner_mlp - MLP-based DeCap model
Patch-ioner_viecap - VieCap controllable captioning
Patch-ioner_clipcap - ClipCap integration

More models available in Ruggero1912's models

This model is part of the Patch-ioner framework for dense image captioning and controllable visual description.

Downloads last month: 16

Collection including Ruggero1912/Patch-ioner_talk2dino_viecap_COCO_Captions

Patch-ioner

Collection

The official collection of all the Patch-ioner framework models • 9 items • Updated Oct 13 • 2

Evaluation results

METEOR on COCO Captions
self-reported

0.225
CIDEr on COCO Captions
self-reported

0.769
SPICE on COCO Captions
self-reported

0.161
BLEU-4 on COCO Captions
self-reported

0.236
ROUGE-L on COCO Captions
self-reported

0.478
CLIP-S on COCO Captions
self-reported

0.719
METEOR on Visual Storytelling Dataset (VIST)
self-reported

10.700
CIDEr on Visual Storytelling Dataset (VIST)
self-reported

28.200
SPICE on Visual Storytelling Dataset (VIST)
self-reported

12.500
BLEU-4 on Visual Storytelling Dataset (VIST)
self-reported

2.600