Spaces:
Sleeping
Sleeping
| title: DeepBattler-RL | |
| emoji: π§ | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: docker | |
| pinned: false | |
| # DeepBattler-RL: Reinforcement Learning Agents for Hearthstone Battlegrounds | |
| This repository contains the **RL training and inference pipeline** for DeepBattler, combining **RLHF (Reinforcement Learning from Human Feedback)** on human expert actions with **RLAIF** optimized using **GRPO (Group Relative Policy Optimization)** to train a policy model for Hearthstone Battlegrounds decision-making. | |
| ## Overview | |
| DeepBattler-RL fine-tunes a Qwen3-4B-Instruct model in two stages: an SFT warmup on human expert trajectories (RLHF-style) and a GRPO phase that uses multi-candidate feedback (RLAIF) to do the heavy lifting. The trained model is served via a FastAPI endpoint for real-time inference. | |
| **Key Features:** | |
| - **SFT + GRPO Training Pipeline** - SFT warmup on human expert (RLHF-style) data, then GRPO as the main optimization step | |
| - **RLHF + Multi-Candidate RLAIF** - Human expert actions as `expert` candidates plus additional medium/bad actions for preference-based GRPO | |
| - **LoRA Fine-tuning** - Efficient parameter-efficient training with PEFT | |
| - **FastAPI Inference Server** - Production-ready API for action generation | |
| - **Docker Deployment** - Ready for HuggingFace Spaces or self-hosted deployment | |
| ## Project Structure | |
| ``` | |
| DeepBattler-RL/ | |
| βββ RL/ # Core RL training & evaluation | |
| β βββ train_battleground_rlaif.py # SFT + GRPO training pipeline | |
| β βββ train_battleground_rlaif_gamehistory.py # Training with game history context | |
| β βββ eval_battleground_rlaif.py # Evaluation scripts | |
| β βββ infer_battleground_cloud.py # Cloud inference utilities | |
| β βββ battleground_nl_utils.py # Game state to natural language conversion | |
| β βββ datasets/ # Training data (JSONL format) | |
| βββ app.py # FastAPI inference server | |
| βββ Dockerfile # Docker deployment config | |
| βββ requirements.txt # Python dependencies | |
| βββ Agent/ # LLM agent callers (OpenAI, Gemma) | |
| βββ DeepBattlerPlugin/ # HDT plugin for game state extraction | |
| ``` | |
| ## Quick Start | |
| ### Installation | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| **Requirements:** | |
| - Python 3.10+ | |
| - PyTorch >= 2.1.0 | |
| - CUDA (recommended for training) | |
| ### Running the Inference Server | |
| ```bash | |
| uvicorn app:app --host 0.0.0.0 --port 7860 | |
| ``` | |
| The server loads: | |
| - **Base Model:** `Qwen/Qwen3-4B-Instruct-2507` | |
| - **LoRA Adapter:** `iteratehack/battleground-rlaif-qwen-gamehistory-grpo` | |
| ### API Usage | |
| **POST `/generate_actions`** | |
| ```json | |
| { | |
| "phase": "PlayerTurn", | |
| "turn": 5, | |
| "state": { | |
| "game_state": { ... }, | |
| "tavern": [ ... ], | |
| "hand": [ ... ], | |
| "board": [ ... ] | |
| }, | |
| "max_new_tokens": 256, | |
| "temperature": 0.2 | |
| } | |
| ``` | |
| **Response:** | |
| ```json | |
| { | |
| "actions": [ | |
| {"type": "BUY_FROM_TAVERN", "tavern_index": 2, "card_name": "Sellemental"}, | |
| {"type": "PLAY_FROM_HAND", "hand_index": 0, "board_index": 0}, | |
| {"type": "END_TURN"} | |
| ], | |
| "raw_completion": "..." | |
| } | |
| ``` | |
| ## Training | |
| ### Dataset Format | |
| Training data is stored in JSONL format under `RL/datasets/`: | |
| ```json | |
| { | |
| "game_id": "...", | |
| "step_id": 0, | |
| "turn": 3, | |
| "phase": "PlayerTurn", | |
| "state": { ... }, | |
| "candidates": [ | |
| {"role": "expert", "action": {...}, "reward": 1.0}, | |
| {"role": "medium", "action": {...}, "reward": 0.5}, | |
| {"role": "bad", "action": {...}, "reward": -0.5} | |
| ] | |
| } | |
| ``` | |
| Here the `expert` role corresponds to human expert actions (the RLHF component), while the other roles provide additional candidates used for RLAIF with GRPO. | |
| ### Running Training | |
| **SFT + GRPO Pipeline:** | |
| ```bash | |
| python RL/train_battleground_rlaif.py \ | |
| --model Qwen/Qwen3-4B-Instruct \ | |
| --data RL/datasets/battleground_rlaif_multicandidate.jsonl \ | |
| --output ./battleground_rlaif_qwen \ | |
| --sft_epochs 3 \ | |
| --grpo_epochs 3 | |
| ``` | |
| **With Game History Context:** | |
| ```bash | |
| python RL/train_battleground_rlaif_gamehistory.py \ | |
| --model Qwen/Qwen3-4B-Instruct \ | |
| --output ./battleground_rlaif_qwen_gamehistory | |
| ``` | |
| ### Training Configuration | |
| | Parameter | Default | Description | | |
| |-----------|---------|-------------| | |
| | `--model` | `Qwen/Qwen3-4B-Instruct` | Base model path | | |
| | `--sft_epochs` | 3 | SFT training epochs | | |
| | `--grpo_epochs` | 3 | GRPO training epochs | | |
| | `--per_device_batch_size` | 4 | Batch size per GPU | | |
| | `--sft_learning_rate` | 1e-5 | SFT learning rate | | |
| | `--grpo_learning_rate` | 5e-6 | GRPO learning rate | | |
| | `--max_seq_length` | 1024 | Maximum sequence length | | |
| | `--skip_sft` | False | Skip SFT phase | | |
| | `--skip_grpo` | False | Skip GRPO phase | | |
| ## Docker Deployment | |
| ```bash | |
| docker build -t deepbattler-rl . | |
| docker run -p 7860:7860 --gpus all deepbattler-rl | |
| ``` | |
| For HuggingFace Spaces, the Dockerfile is pre-configured for automatic deployment. | |
| ## Action Types | |
| The model outputs JSON action sequences with these action types: | |
| | Action Type | Description | | |
| |-------------|-------------| | |
| | `BUY_FROM_TAVERN` | Purchase a minion from the tavern | | |
| | `PLAY_FROM_HAND` | Play a minion from hand to board | | |
| | `SELL_FROM_BOARD` | Sell a minion from the board | | |
| | `HERO_POWER` | Activate hero power | | |
| | `ROLL` | Refresh the tavern | | |
| | `UPGRADE_TAVERN` | Upgrade tavern tier | | |
| | `FREEZE` | Freeze the current tavern | | |
| | `END_TURN` | End the current turn | | |
| ## Related Components | |
| - **DeepBattlerPlugin/** - C# HDT plugin that extracts game state to JSON | |
| - **Agent/** - Python agents for real-time voice-assisted gameplay (OpenAI/Gemma) | |
| For the full DeepBattler experience with HDT integration, see the main [DeepBattler repository](https://github.com/William-Dic/DeepBattler). | |
| ## License | |
| This software is available for personal, educational, and non-commercial use. See the main DeepBattler repository for full license terms. | |
| --- | |