Image-to-Image
Transformers
low-light
low-light-image-enhancement
image-enhancement
image-restoration
computer-vision
low-light-enhance
multimodal
multimodal-learning
transformer
vision-transformer
vision-transformers
Eval Results (legacy)
Instructions to use albrateanu/ModalFormer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use albrateanu/ModalFormer with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-to-image", model="albrateanu/ModalFormer")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("albrateanu/ModalFormer", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| license: mit | |
| tags: | |
| - low-light | |
| - low-light-image-enhancement | |
| - image-enhancement | |
| - image-restoration | |
| - computer-vision | |
| - low-light-enhance | |
| - multimodal | |
| - multimodal-learning | |
| - transformer | |
| - transformers | |
| - vision-transformer | |
| - vision-transformers | |
| model-index: | |
| - name: ModalFormer | |
| results: | |
| - task: | |
| type: low-light-image-enhancement | |
| dataset: | |
| name: LOL-v1 | |
| type: LOL-v1 | |
| metrics: | |
| - type: PSNR | |
| value: 27.97 | |
| name: PSNR | |
| - type: SSIM | |
| value: 0.897 | |
| name: SSIM | |
| - task: | |
| type: low-light-image-enhancement | |
| dataset: | |
| name: LOL-v2-Real | |
| type: LOL-v2-Real | |
| metrics: | |
| - type: PSNR | |
| value: 29.33 | |
| name: PSNR | |
| - type: SSIM | |
| value: 0.915 | |
| name: SSIM | |
| - task: | |
| type: low-light-image-enhancement | |
| dataset: | |
| name: LOL-v2-Synthetic | |
| type: LOL-v2-Synthetic | |
| metrics: | |
| - type: PSNR | |
| value: 30.15 | |
| name: PSNR | |
| - type: SSIM | |
| value: 0.951 | |
| name: SSIM | |
| - task: | |
| type: low-light-image-enhancement | |
| dataset: | |
| name: SDSD-indoor | |
| type: SDSD-indoor | |
| metrics: | |
| - type: PSNR | |
| value: 31.37 | |
| name: PSNR | |
| - type: SSIM | |
| value: 0.917 | |
| name: SSIM | |
| - task: | |
| type: low-light-image-enhancement | |
| dataset: | |
| name: SDSD-outdoor | |
| type: SDSD-outdoor | |
| metrics: | |
| - type: PSNR | |
| value: 31.73 | |
| name: PSNR | |
| - type: SSIM | |
| value: 0.904 | |
| name: SSIM | |
| - task: | |
| type: low-light-image-enhancement | |
| dataset: | |
| name: MEF | |
| type: MEF | |
| metrics: | |
| - type: NIQE | |
| value: 3.44 | |
| name: NIQE | |
| - task: | |
| type: low-light-image-enhancement | |
| dataset: | |
| name: LIME | |
| type: LIME | |
| metrics: | |
| - type: NIQE | |
| value: 3.82 | |
| name: NIQE | |
| - task: | |
| type: low-light-image-enhancement | |
| dataset: | |
| name: DICM | |
| type: DICM | |
| metrics: | |
| - type: NIQE | |
| value: 3.64 | |
| name: NIQE | |
| - task: | |
| type: low-light-image-enhancement | |
| dataset: | |
| name: NPE | |
| type: NPE | |
| metrics: | |
| - type: NIQE | |
| value: 3.55 | |
| name: NIQE | |
| pipeline_tag: image-to-image | |
| # ✨ ModalFormer: Multimodal Transformer for Low-Light Image Enhancement | |
| <div align="center"> | |
| **[Alexandru Brateanu](https://scholar.google.com/citations?user=ru0meGgAAAAJ&hl=en), [Raul Balmez](https://scholar.google.com/citations?user=vPC7raQAAAAJ&hl=en), [Ciprian Orhei](https://scholar.google.com/citations?user=DZHdq3wAAAAJ&hl=en), [Codruta Ancuti](https://scholar.google.com/citations?user=5PA43eEAAAAJ&hl=en), [Cosmin Ancuti](https://scholar.google.com/citations?user=zVTgt8IAAAAJ&hl=en)** | |
| [](https://arxiv.org/abs/2507.20388) | |
| </div> | |
| ### Abstract | |
| *Low-light image enhancement (LLIE) is a fundamental yet challenging task due to the presence of noise, loss of detail, and poor contrast in images captured under insufficient lighting conditions. Recent methods often rely solely on pixel-level transformations of RGB images, neglecting the rich contextual information available from multiple visual modalities. In this paper, we present ModalFormer, the first large-scale multimodal framework for LLIE that fully exploits nine auxiliary modalities to achieve state-of-the-art performance. Our model comprises two main components: a Cross-modal Transformer (CM-T) designed to restore corrupted images while seamlessly integrating multimodal information, and multiple auxiliary subnetworks dedicated to multimodal feature reconstruction. Central to the CM-T is our novel Cross-modal Multi-headed Self-Attention mechanism (CM-MSA), which effectively fuses RGB data with modality-specific features—including deep feature embeddings, segmentation information, geometric cues, and color information—to generate information-rich hybrid attention maps. Extensive experiments on multiple benchmark datasets demonstrate ModalFormer’s state-of-the-art performance in LLIE. Pre-trained models and results are made available at https://github.com/albrateanu/ModalFormer* | |
| ## 🆕 Updates | |
| - `29.07.2025` 🎉 The [**ModalFormer**](https://arxiv.org/abs/2401.15204) paper is now available! Check it out and explore our results and methodology. | |
| - `28.07.2025` 📦 Pre-trained models and test data published! ArXiv paper version and HuggingFace demo coming soon, stay tuned! | |
| ## ⚙️ Setup and Testing | |
| For ease, utilize a Linux machine with CUDA-ready devices (GPUs). | |
| To setup the environment, first run the provided setup script: | |
| ```bash | |
| ./environment_setup.sh | |
| # or | |
| bash environment_setup.sh | |
| ``` | |
| Note: in case of difficulties, ensure ```environment_setup.sh``` is executable by running: | |
| ```bash | |
| chmod +x environment_setup.sh | |
| ``` | |
| Give the setup a couple of minutes to run. | |
| Please check out the [**GitHub repository**](https://github.com/albrateanu/ModalFormer) for more implementation details. | |
| ## 📚 Citation | |
| ``` | |
| @misc{brateanu2025modalformer, | |
| title={ModalFormer: Multimodal Transformer for Low-Light Image Enhancement}, | |
| author={Alexandru Brateanu and Raul Balmez and Ciprian Orhei and Codruta Ancuti and Cosmin Ancuti}, | |
| year={2025}, | |
| eprint={2507.20388}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.CV}, | |
| url={https://arxiv.org/abs/2507.20388}, | |
| } | |
| ``` | |
| ## 🙏 Acknowledgements | |
| We use [this codebase](https://github.com/caiyuanhao1998/Retinexformer) as foundation for our implementation. | |
| Paper: https://arxiv.org/pdf/2507.20388 |