τ^2-Bench: Evaluating Conversational Agents in a Dual-Control Environment
Paper
•
2506.07982
•
Published
•
7
A Qwen3-4B model fine-tuned with GRPO reinforcement learning on tau2-bench tasks.
This model was trained using Group Relative Policy Optimization (GRPO) on the tau2-bench telecom domain. It builds on the SFT checkpoint and was trained for 95 steps with sparse binary rewards from task completion.
The model produces tool calls in inline JSON format:
<thinking>Analysis of the customer issue...</thinking>
{"name": "tool_name", "arguments": {"param": "value"}}
@article{yao2024tau2bench,
title={tau2-Bench: Evaluating Conversational Agents in a Dual-Control Environment},
author={Yao, Shunyu and others},
journal={arXiv preprint arXiv:2506.07982},
year={2024}
}
Base model
Jarrodbarnes/Qwen3-4B-Instruct-SFT