Patch-ioner_talk2dino_viecap_COCO_Captions - Patch-ioner Configuration
This repository contains a pre-trained VIECAP model from the Patch-ioner framework for dense image captioning and controllable visual description.
π Paper Information
Title: "One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework"
Authors: Lorenzo Bianchi, Giacomo Pacini, Fabio Carrara, Nicola Messina, Giuseppe Amato, Fabrizio Falchi
ArXiv: https://arxiv.org/abs/2510.02898
Project Page: https://paciosoft.com/Patch-ioner/
π» GitHub Repository
The official code repository for Patch-ioner can be found here: https://github.com/Ruggero1912/Patch-ioner
π― Model Overview
- Model Type: VIECAP
- Configuration: mlp.viecap.k.yaml
- Vision Backbone: dinov2_vitb14_reg
- Language Model: gpt2
- Input Resolution: 518x518
- Prefix Size: 768
VieCap Configuration
- Continuous Prompt Length: 10
- Clip Project Length: 10
- Temperature: 0.01
- Top-K: 3
- Entity Retrieval: coco_entities
π Performance
| Task | METEOR | CIDEr | SPICE |
|---|---|---|---|
| Image Captioning | 0.225 | 0.769 | 0.161 |
| Narratives | 10.700 | 28.200 | 12.500 |
π Detailed Results
Image Captioning Results
- METEOR: 0.2250
- CIDEr: 0.7690
- SPICE: 0.1612
- BLEU_4: 0.2362
- ROUGE_L: 0.4779
- CLIP-S: 0.7188
Narratives Results
- METEOR: 10.7000
- CIDEr: 28.2000
- SPICE: 12.5000
- BLEU_4: 2.6000
- ROUGE_L: 23.1000
- CLIP-S: 66.5000
π Quick Start
from patch_ioner import load_model, Patchioner
# Load the model
config_path = "config.yaml"
model = load_model(config_path)
# Run inference
image_path = "your_image.jpg"
results = model.forward(image_path)
print(results)
transformers Sample Usage
The model can also be loaded using the transformers library:
from transformers import AutoModel
MODEL_ID = "Ruggero1912/Patch-ioner_talk2dino_viecap_COCO_Captions" # Note: use the correct MODEL_ID for this repository
model = AutoModel.from_pretrained(MODEL_ID, trust_remote_code=True)
π Repository Contents
config.yaml: Model configuration filemodel.pt: Pre-trained model weightsREADME.md: This file
π§ Installation
pip install git+https://github.com/Ruggero1912/Patch-ioner
π‘ Usage Examples
Refer to the Patch-ioner repository for updated usage examples.
ποΈ Model Configuration
- Prefix Size: 768
- Memory Bank Size: 0
- Normalization: False
π Training Details
- Training Dataset: COCO Captions
- Training Epochs: TBD
- Batch Size: TBD
- Learning Rate: TBD
- Optimizer: AdamW
π Citation
If you use this model in your research, please cite our paper, refer to the Project Page for updated citation template.
π€ Contributing
We welcome contributions to improve the Patch-ioner framework. Please see the main repository for contribution guidelines.
π License
See the main repository for detailed license information.
π Issues and Support
For issues related to this model or the Patch-ioner framework, please:
- Check the main repository for existing issues
- Open a new issue with detailed information about your problem
- Contact the authors.
π Related Models
Explore other Patch-ioner model configurations:
- Patch-ioner_mlp - MLP-based DeCap model
- Patch-ioner_viecap - VieCap controllable captioning
- Patch-ioner_clipcap - ClipCap integration
More models available in Ruggero1912's models
This model is part of the Patch-ioner framework for dense image captioning and controllable visual description.
- Downloads last month
- 16
Collection including Ruggero1912/Patch-ioner_talk2dino_viecap_COCO_Captions
Evaluation results
- METEOR on COCO Captionsself-reported0.225
- CIDEr on COCO Captionsself-reported0.769
- SPICE on COCO Captionsself-reported0.161
- BLEU-4 on COCO Captionsself-reported0.236
- ROUGE-L on COCO Captionsself-reported0.478
- CLIP-S on COCO Captionsself-reported0.719
- METEOR on Visual Storytelling Dataset (VIST)self-reported10.700
- CIDEr on Visual Storytelling Dataset (VIST)self-reported28.200
- SPICE on Visual Storytelling Dataset (VIST)self-reported12.500
- BLEU-4 on Visual Storytelling Dataset (VIST)self-reported2.600