Hugging Face
Models
Datasets
Spaces
Buckets
new
Docs
Enterprise
Pricing
Website
Tasks
HuggingChat
Collections
Languages
Organizations
Community
Blog
Posts
Daily Papers
Learn
Discord
Forum
GitHub
Solutions
Team & Enterprise
Hugging Face PRO
Enterprise Support
Inference Providers
Inference Endpoints
Storage Buckets
Log In
Sign Up
Building on HF
81.8
TFLOPS
2
4
9
Naman Vats
PRO
namanvats
Follow
securerat's profile picture
Mi6paulino's profile picture
chayanvats11's profile picture
12 followers
ยท
38 following
namanvats
namanvats
AI & ML interests
Make Open Source AI win
Recent Activity
liked
a model
about 11 hours ago
deepseek-ai/DeepSeek-V4-Flash
upvoted
a
collection
25 days ago
AgentDoG
reacted
to
anakin87
's
post
with โค๏ธ
26 days ago
How LLM training with RL Environments works? It all starts with ๐ฅ๐ฒ๐ถ๐ป๐ณ๐ผ๐ฟ๐ฐ๐ฒ๐บ๐ฒ๐ป๐ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด ๐๐ถ๐๐ต ๐ฉ๐ฒ๐ฟ๐ถ๐ณ๐ถ๐ฎ๐ฏ๐น๐ฒ ๐ฅ๐ฒ๐๐ฎ๐ฟ๐ฑ๐ - question asked - model generates reasoning + answer - answer checked against ground truth - reward drives RL training In this setup, the environment is simple: fixed questions and answers, rollout logic, reward(s) Consider a more complex tic-tac-toe env โโญ It adds: - dynamic game generation/handling - tunable opponent skill - multi-turn interactions (envs can also include tools) --- What happens at training? We use ๐๐ฟ๐ผ๐๐ฝ ๐ฅ๐ฒ๐น๐ฎ๐๐ถ๐๐ฒ ๐ฃ๐ผ๐น๐ถ๐ฐ๐ ๐ข๐ฝ๐๐ถ๐บ๐ถ๐๐ฎ๐๐ถ๐ผ๐ป with a tic-tac-toe env No critic model needed, the group is the baseline Simpler than PPO 1๏ธโฃ Rollout generation: from the same board, model plays N games via sampling 2๏ธโฃ Each game scored with deterministic rewards (win, format, ...) 3๏ธโฃ Mean score computed across the group 4๏ธโฃ Each rollout's advantage = its score minus the group mean 5๏ธโฃ Model updated to favor trajectories above baseline ๐ Repeat For a deep dive, check out ๐ฑ https://github.com/anakin87/llm-rl-environments-lil-course a free hands-on course on RL environments for LLMs
View all activity
Organizations
namanvats
's datasets
1
Sort:ย Recently updated
namanvats/harbor-goose-openhands-benchmark
Viewer
โข
Updated
Apr 12
โข
246
โข
103
โข
3