ReasonX — Intrinsic Judge

This is the MLLM judge released with ReasonX: MLLM-Guided Intrinsic Image Decomposition (CVPR 2026). It is fine-tuned from InternVL2.5-4B to make relative intrinsic comparisons between two marked points on an RGB image, across four physical modalities: depth, surface normals, albedo, and irradiance.

Paper: ReasonX: MLLM-Guided Intrinsic Image Decomposition
Project page: alaradirik.github.io/reasonx/
Code & Training Details: github.com/adobe-research/ReasonX

What does this model do?

Given an RGB image with two colored point markers (red and green), the judge answers modality-specific pairwise questions such as:

Modality	Example question
Depth	Which point appears closer to the camera — red or green?
Normals	Which point has a surface more facing towards the camera — red or green?
Albedo	Do the red and green points have the same base color?
Irradiance	Which point is more illuminated — red or green?

In the ReasonX framework, the frozen judge is used as a reward model inside a GRPO loop to fine-tune intrinsic decomposition models (PRISM, Marigold) on unlabeled, in-the-wild images without requiring ground-truth intrinsic maps. The judge achieves the following accuracy on held-out synthetic test sets (InteriorVerse / HyperSim):

Modality	Accuracy	Macro F1
Depth	0.962	0.962
Normals	0.935	0.933
Albedo	0.894	0.889
Irradiance	0.876	0.878

Usage

The model expects an RGB image with two hollow circle markers drawn on it — one red (255, 0, 0) and one green (0, 255, 0) — indicating the two points to compare. Full point-drawing utilities and usage examples are available in the GitHub repository.

Quick start

import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained(
    "adirik/InternVL2_5-4B-Intrinsic-Judge",
    torch_dtype=torch.bfloat16,
    use_flash_attn=True,
    trust_remote_code=True,
).eval().cuda()

tokenizer = AutoTokenizer.from_pretrained(
    "adirik/InternVL2_5-4B-Intrinsic-Judge",
    trust_remote_code=True,
    use_fast=False,
)

# Load a point-pair annotated image (red and green markers drawn on RGB)
# See the GitHub repo for the load_image() and draw_points() utility functions
pixel_values = load_image("annotated_image.jpg").to(torch.bfloat16).cuda()

question = "Which point appears to be closer to the camera — red or green?"
generation_config = dict(max_new_tokens=128, do_sample=False)

response = model.chat(tokenizer, pixel_values, question, generation_config)
print(response)
# e.g. "The green point is closer."

Citation

If you use this model, please cite:

@inproceedings{Dirik2025ReasonXMI,
  title     = {ReasonX: MLLM-Guided Intrinsic Image Decomposition},
  author    = {Dirik, Alara and Wang, Tuanfeng and Ceylan, Duygu
               and Zafeiriou, Stefanos and Fr{\"u}hst{\"u}ck, Anna},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer
               Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}

Downloads last month: 14

Safetensors

Model size

4B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for adirik/InternVL2_5-4B-Intrinsic-Judge

ReasonX: MLLM-Guided Intrinsic Image Decomposition

Paper • 2512.04222 • Published Dec 3, 2025 • 1