SkillCoach: Self-Evolving Rubrics for Evaluating and Enhancing Agentic Skill-Use
Abstract
SkillCoach is a self-evolving rubric framework that evaluates and improves agentic skill-use by analyzing skill selection, following, composition, and reflection processes, providing better supervision than outcome-only metrics.
Skills are becoming a reusable operational layer for LLM agents, encoding SOPs, domain rules, tool workflows, scripts, and validation routines. In realistic skill repositories, overlapping skills make reliable skill-use difficult. Final verifier success is too coarse for both evaluation and training, since an agent may pass through trial and error while selecting distractor skills, skipping required steps, composing workflows incorrectly or omitting final checks. We introduce SkillCoach, a self-evolving rubric framework for evaluating and enhancing agentic skill-use. SkillCoach derives skill-grounded process rubrics from real rollouts and evaluates trajectories along four dimensions: skill selection, skill following, skill composition, and skill-grounded reflection. It keeps the external verifier as a separate outcome signal, allowing process quality to be distinguished from accidental task success. The evolved rubrics further serve as process supervision for selecting high-quality training trajectories. Experiments show that evolved rubrics substantially improve evaluation quality, expose failures hidden by final accuracy, and provide stronger supervision signals than outcome-only filtering for enhancing agentic skill-use.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SkillRevise: Improving LLM-Authored Agent Skills via Trace-Conditioned Skill Revision (2026)
- SkillAxe: Sharpening LLM-Authored Agent Skills Through Evaluation-Guided Self-Refinement (2026)
- SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution (2026)
- SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills (2026)
- CODESKILL: Learning Self-Evolving Skills for Coding Agents (2026)
- SkillAudit: Ground-Truth-Free Skill Evolution via Paired Trajectory Auditing (2026)
- SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2607.01874 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper