ReasonX β€” Intrinsic Judge

This is the MLLM judge released with ReasonX: MLLM-Guided Intrinsic Image Decomposition (CVPR 2026). It is fine-tuned from InternVL2.5-4B to make relative intrinsic comparisons between two marked points on an RGB image, across four physical modalities: depth, surface normals, albedo, and irradiance.

Paper: ReasonX: MLLM-Guided Intrinsic Image Decomposition
Project page: alaradirik.github.io/reasonx/
Code & Training Details: github.com/adobe-research/ReasonX

What does this model do?

Given an RGB image with two colored point markers (red and green), the judge answers modality-specific pairwise questions such as:

Modality Example question
Depth Which point appears closer to the camera β€” red or green?
Normals Which point has a surface more facing towards the camera β€” red or green?
Albedo Do the red and green points have the same base color?
Irradiance Which point is more illuminated β€” red or green?

In the ReasonX framework, the frozen judge is used as a reward model inside a GRPO loop to fine-tune intrinsic decomposition models (PRISM, Marigold) on unlabeled, in-the-wild images without requiring ground-truth intrinsic maps. The judge achieves the following accuracy on held-out synthetic test sets (InteriorVerse / HyperSim):

Modality Accuracy Macro F1
Depth 0.962 0.962
Normals 0.935 0.933
Albedo 0.894 0.889
Irradiance 0.876 0.878

Usage

The model expects an RGB image with two hollow circle markers drawn on it β€” one red (255, 0, 0) and one green (0, 255, 0) β€” indicating the two points to compare. Full point-drawing utilities and usage examples are available in the GitHub repository.

Quick start

import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained(
    "adirik/InternVL2_5-4B-Intrinsic-Judge",
    torch_dtype=torch.bfloat16,
    use_flash_attn=True,
    trust_remote_code=True,
).eval().cuda()

tokenizer = AutoTokenizer.from_pretrained(
    "adirik/InternVL2_5-4B-Intrinsic-Judge",
    trust_remote_code=True,
    use_fast=False,
)

# Load a point-pair annotated image (red and green markers drawn on RGB)
# See the GitHub repo for the load_image() and draw_points() utility functions
pixel_values = load_image("annotated_image.jpg").to(torch.bfloat16).cuda()

question = "Which point appears to be closer to the camera β€” red or green?"
generation_config = dict(max_new_tokens=128, do_sample=False)

response = model.chat(tokenizer, pixel_values, question, generation_config)
print(response)
# e.g. "The green point is closer."

Citation

If you use this model, please cite:

@inproceedings{Dirik2025ReasonXMI,
  title     = {ReasonX: MLLM-Guided Intrinsic Image Decomposition},
  author    = {Dirik, Alara and Wang, Tuanfeng and Ceylan, Duygu
               and Zafeiriou, Stefanos and Fr{\"u}hst{\"u}ck, Anna},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer
               Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}
Downloads last month
49
Safetensors
Model size
4B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for adirik/InternVL2_5-4B-Intrinsic-Judge