TRL

Reward Functions

This module contains some useful reward functions, primarily intended for use with the GRPOTrainer.

Format rewards

think_format_reward

trl.rewards.think_format_reward

< source >

( completions: list **kwargs ) → list[float]

Parameters

completions (list[list[dict[str, str]]]) — List of completions to be evaluated. Each completion must be a list of one message, i.e. a dictionary containing the key "content" with the value being the text of the completion.
**kwargs — Additional keyword arguments. This function does not use them, but they are required in the function signature to ensure compatibility with trainers like GRPOTrainer.

Returns

list[float]

A list of rewards, where each reward is 1.0 if the completion matches the expected format, otherwise 0.0.

Reward function that checks if the reasoning process is enclosed within "<think>" and "</think>" tags. The function returns a reward of 1.0 if the format is correct, otherwise 0.0.

Example:

>>> from trl.rewards import think_format_reward

>>> completions = [
...     [{"content": "<think>\nThis is my reasoning.\n</think>\nThis is my answer."}],
...     [{"content": "<think>\nThis is my reasoning.\nThis is my answer."}],
... ]
>>> think_format_reward(completions)
[1.0, 0.0]

< > Update on GitHub