ReasonX β Intrinsic Judge
This is the MLLM judge released with ReasonX: MLLM-Guided Intrinsic Image Decomposition (CVPR 2026). It is fine-tuned from InternVL2.5-4B to make relative intrinsic comparisons between two marked points on an RGB image, across four physical modalities: depth, surface normals, albedo, and irradiance.
Paper: ReasonX: MLLM-Guided Intrinsic Image Decomposition
Project page: alaradirik.github.io/reasonx/
Code & Training Details: github.com/adobe-research/ReasonX
What does this model do?
Given an RGB image with two colored point markers (red and green), the judge answers modality-specific pairwise questions such as:
| Modality | Example question |
|---|---|
| Depth | Which point appears closer to the camera β red or green? |
| Normals | Which point has a surface more facing towards the camera β red or green? |
| Albedo | Do the red and green points have the same base color? |
| Irradiance | Which point is more illuminated β red or green? |
In the ReasonX framework, the frozen judge is used as a reward model inside a GRPO loop to fine-tune intrinsic decomposition models (PRISM, Marigold) on unlabeled, in-the-wild images without requiring ground-truth intrinsic maps. The judge achieves the following accuracy on held-out synthetic test sets (InteriorVerse / HyperSim):
| Modality | Accuracy | Macro F1 |
|---|---|---|
| Depth | 0.962 | 0.962 |
| Normals | 0.935 | 0.933 |
| Albedo | 0.894 | 0.889 |
| Irradiance | 0.876 | 0.878 |
Usage
The model expects an RGB image with two hollow circle markers drawn on it β one red (255, 0, 0) and one green (0, 255, 0) β indicating the two points to compare. Full point-drawing utilities and usage examples are available in the GitHub repository.
Quick start
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained(
"adirik/InternVL2_5-4B-Intrinsic-Judge",
torch_dtype=torch.bfloat16,
use_flash_attn=True,
trust_remote_code=True,
).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(
"adirik/InternVL2_5-4B-Intrinsic-Judge",
trust_remote_code=True,
use_fast=False,
)
# Load a point-pair annotated image (red and green markers drawn on RGB)
# See the GitHub repo for the load_image() and draw_points() utility functions
pixel_values = load_image("annotated_image.jpg").to(torch.bfloat16).cuda()
question = "Which point appears to be closer to the camera β red or green?"
generation_config = dict(max_new_tokens=128, do_sample=False)
response = model.chat(tokenizer, pixel_values, question, generation_config)
print(response)
# e.g. "The green point is closer."
Citation
If you use this model, please cite:
@inproceedings{Dirik2025ReasonXMI,
title = {ReasonX: MLLM-Guided Intrinsic Image Decomposition},
author = {Dirik, Alara and Wang, Tuanfeng and Ceylan, Duygu
and Zafeiriou, Stefanos and Fr{\"u}hst{\"u}ck, Anna},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR)},
year = {2026}
}
- Downloads last month
- 49