new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Jan 6

Multi-Objective Task-Aware Predictor for Image-Text Alignment

Evaluating image-text alignment while reflecting human preferences across multiple aspects is a significant issue for the development of reliable vision-language applications. It becomes especially crucial in real-world scenarios where multiple valid descriptions exist depending on contexts or user needs. However, research progress is hindered by the lack of comprehensive benchmarks and existing evaluation predictors lacking at least one of these key properties: (1) Alignment with human judgments, (2) Long-sequence processing, (3) Inference efficiency, and (4) Applicability to multi-objective scoring. To address these challenges, we propose a plug-and-play architecture to build a robust predictor, MULTI-TAP (Multi-Objective Task-Aware Predictor), capable of both multi and single-objective scoring. MULTI-TAP can produce a single overall score, utilizing a reward head built on top of a large vision-language model (LVLMs). We show that MULTI-TAP is robust in terms of application to different LVLM architectures, achieving significantly higher performance than existing metrics and even on par with the GPT-4o-based predictor, G-VEval, with a smaller size (7-8B). By training a lightweight ridge regression layer on the frozen hidden states of a pre-trained LVLM, MULTI-TAP can produce fine-grained scores for multiple human-interpretable objectives. MULTI-TAP performs better than VisionREWARD, a high-performing multi-objective reward model, in both performance and efficiency on multi-objective benchmarks and our newly released text-image-to-text dataset, EYE4ALL. Our new dataset, consisting of chosen/rejected human preferences (EYE4ALLPref) and human-annotated fine-grained scores across seven dimensions (EYE4ALLMulti), can serve as a foundation for developing more accessible AI systems by capturing the underlying preferences of users, including blind and low-vision (BLV) individuals.

  • 4 authors
·
Oct 1, 2025

The ALPINE-CRISTAL-JWST Survey: The Fast Metal Enrichment of Massive Galaxies at z~5

We present the stellar mass-metallicity relation (MZR) and mass-metallicity-star formation relation ("fundamental metallicity relation"; FMR) of 18 massive (log(M/M_odot) = 9.5-11) main-sequence galaxies at z~5 from the ALPINE-CRISTAL-JWST sample. This sample complements recent studies by JWST at up to two orders of magnitude lower stellar masses. The metallicities are derived using strong optical lines, and verified by temperature-based oxygen abundance measurements for five galaxies for which faint auroral lines are detected. We find little evolution at the massive end of the MZR between z~5 and cosmic noon at z~2, suggesting a fast metal enrichment at early times. The FMR at z=5 exhibits a 5x larger scatter (preferentially to lower metallicities) compared the local FMR relation. This scatter can be explained by a bursty star formation and the direct build-up of metals in early galaxies as well as differences in age and outflow efficiencies. Capitalizing on all available samples, we find that the observed MZR and FMR over three orders of stellar mass is generally in good agreement with results from cosmological simulation, although some underestimate the metal enrichment at low stellar masses. This may be due to too efficient metal-rich outflows. We show that the ALPINE-CRISTAL-JWST galaxies likely joined the current FMR at z~10 and will evolve into massive (log(M/M_odot)~11.4) galaxies with super-solar metallicities by z=0.

  • 56 authors
·
Oct 17, 2025