SOSBENCH: Benchmarking Safety Alignment on Scientific Knowledge Paper β’ 2505.21605 β’ Published May 27, 2025
Building a Foundational Guardrail for General Agentic Systems via Synthetic Data Paper β’ 2510.09781 β’ Published Oct 10, 2025 β’ 27
PersonaMem-v2: Towards Personalized Intelligence via Learning Implicit User Personas and Agentic Memory Paper β’ 2512.06688 β’ Published Dec 7, 2025 β’ 2
Emergent Social Intelligence Risks in Generative Multi-Agent Systems Paper β’ 2603.27771 β’ Published Mar 29 β’ 52
Visual Aesthetic Benchmark: Can Frontier Models Judge Beauty? Paper β’ 2605.12684 β’ Published May 12 β’ 11
AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks? Paper β’ 2606.05080 β’ Published 18 days ago β’ 30
Steering Multimodal Large Language Models Decoding for Context-Aware Safety Paper β’ 2509.19212 β’ Published Sep 23, 2025
TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments Paper β’ 2510.01179 β’ Published Oct 1, 2025 β’ 29
VisualSphinx: Large-Scale Synthetic Vision Logic Puzzles for RL Paper β’ 2505.23977 β’ Published May 29, 2025 β’ 10
TinyV: Reducing False Negatives in Verification Improves RL for LLM Reasoning Paper β’ 2505.14625 β’ Published May 20, 2025 β’ 13
KodCode-V1 Collection KodCode-V1 is the largest fully-synthetic open-source dataset providing verifiable solutions and tests for coding tasks. β’ 5 items β’ Updated Mar 2 β’ 5
KodCode-V1 Collection KodCode-V1 is the largest fully-synthetic open-source dataset providing verifiable solutions and tests for coding tasks. β’ 5 items β’ Updated Mar 2 β’ 5