Spaces:
Sleeping
Sleeping
File size: 6,000 Bytes
dfe5fb8 fed1ca7 dfe5fb8 fed1ca7 787c99c fed1ca7 787c99c fed1ca7 787c99c fed1ca7 787c99c fed1ca7 787c99c fed1ca7 787c99c fed1ca7 787c99c fed1ca7 787c99c fed1ca7 787c99c fed1ca7 787c99c fed1ca7 787c99c fed1ca7 787c99c fed1ca7 787c99c fed1ca7 787c99c fed1ca7 787c99c fed1ca7 1b0482e fed1ca7 787c99c fed1ca7 787c99c fed1ca7 787c99c fed1ca7 787c99c fed1ca7 787c99c fed1ca7 787c99c fed1ca7 787c99c fed1ca7 787c99c fed1ca7 787c99c fed1ca7 787c99c fed1ca7 787c99c fed1ca7 787c99c fed1ca7 787c99c fed1ca7 787c99c fed1ca7 787c99c fed1ca7 787c99c fed1ca7 787c99c fed1ca7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 |
---
title: DeepBattler-RL
emoji: π§
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
---
# DeepBattler-RL: Reinforcement Learning Agents for Hearthstone Battlegrounds
This repository contains the **RL training and inference pipeline** for DeepBattler, combining **RLHF (Reinforcement Learning from Human Feedback)** on human expert actions with **RLAIF** optimized using **GRPO (Group Relative Policy Optimization)** to train a policy model for Hearthstone Battlegrounds decision-making.
## Overview
DeepBattler-RL fine-tunes a Qwen3-4B-Instruct model in two stages: an SFT warmup on human expert trajectories (RLHF-style) and a GRPO phase that uses multi-candidate feedback (RLAIF) to do the heavy lifting. The trained model is served via a FastAPI endpoint for real-time inference.
**Key Features:**
- **SFT + GRPO Training Pipeline** - SFT warmup on human expert (RLHF-style) data, then GRPO as the main optimization step
- **RLHF + Multi-Candidate RLAIF** - Human expert actions as `expert` candidates plus additional medium/bad actions for preference-based GRPO
- **LoRA Fine-tuning** - Efficient parameter-efficient training with PEFT
- **FastAPI Inference Server** - Production-ready API for action generation
- **Docker Deployment** - Ready for HuggingFace Spaces or self-hosted deployment
## Project Structure
```
DeepBattler-RL/
βββ RL/ # Core RL training & evaluation
β βββ train_battleground_rlaif.py # SFT + GRPO training pipeline
β βββ train_battleground_rlaif_gamehistory.py # Training with game history context
β βββ eval_battleground_rlaif.py # Evaluation scripts
β βββ infer_battleground_cloud.py # Cloud inference utilities
β βββ battleground_nl_utils.py # Game state to natural language conversion
β βββ datasets/ # Training data (JSONL format)
βββ app.py # FastAPI inference server
βββ Dockerfile # Docker deployment config
βββ requirements.txt # Python dependencies
βββ Agent/ # LLM agent callers (OpenAI, Gemma)
βββ DeepBattlerPlugin/ # HDT plugin for game state extraction
```
## Quick Start
### Installation
```bash
pip install -r requirements.txt
```
**Requirements:**
- Python 3.10+
- PyTorch >= 2.1.0
- CUDA (recommended for training)
### Running the Inference Server
```bash
uvicorn app:app --host 0.0.0.0 --port 7860
```
The server loads:
- **Base Model:** `Qwen/Qwen3-4B-Instruct-2507`
- **LoRA Adapter:** `iteratehack/battleground-rlaif-qwen-gamehistory-grpo`
### API Usage
**POST `/generate_actions`**
```json
{
"phase": "PlayerTurn",
"turn": 5,
"state": {
"game_state": { ... },
"tavern": [ ... ],
"hand": [ ... ],
"board": [ ... ]
},
"max_new_tokens": 256,
"temperature": 0.2
}
```
**Response:**
```json
{
"actions": [
{"type": "BUY_FROM_TAVERN", "tavern_index": 2, "card_name": "Sellemental"},
{"type": "PLAY_FROM_HAND", "hand_index": 0, "board_index": 0},
{"type": "END_TURN"}
],
"raw_completion": "..."
}
```
## Training
### Dataset Format
Training data is stored in JSONL format under `RL/datasets/`:
```json
{
"game_id": "...",
"step_id": 0,
"turn": 3,
"phase": "PlayerTurn",
"state": { ... },
"candidates": [
{"role": "expert", "action": {...}, "reward": 1.0},
{"role": "medium", "action": {...}, "reward": 0.5},
{"role": "bad", "action": {...}, "reward": -0.5}
]
}
```
Here the `expert` role corresponds to human expert actions (the RLHF component), while the other roles provide additional candidates used for RLAIF with GRPO.
### Running Training
**SFT + GRPO Pipeline:**
```bash
python RL/train_battleground_rlaif.py \
--model Qwen/Qwen3-4B-Instruct \
--data RL/datasets/battleground_rlaif_multicandidate.jsonl \
--output ./battleground_rlaif_qwen \
--sft_epochs 3 \
--grpo_epochs 3
```
**With Game History Context:**
```bash
python RL/train_battleground_rlaif_gamehistory.py \
--model Qwen/Qwen3-4B-Instruct \
--output ./battleground_rlaif_qwen_gamehistory
```
### Training Configuration
| Parameter | Default | Description |
|-----------|---------|-------------|
| `--model` | `Qwen/Qwen3-4B-Instruct` | Base model path |
| `--sft_epochs` | 3 | SFT training epochs |
| `--grpo_epochs` | 3 | GRPO training epochs |
| `--per_device_batch_size` | 4 | Batch size per GPU |
| `--sft_learning_rate` | 1e-5 | SFT learning rate |
| `--grpo_learning_rate` | 5e-6 | GRPO learning rate |
| `--max_seq_length` | 1024 | Maximum sequence length |
| `--skip_sft` | False | Skip SFT phase |
| `--skip_grpo` | False | Skip GRPO phase |
## Docker Deployment
```bash
docker build -t deepbattler-rl .
docker run -p 7860:7860 --gpus all deepbattler-rl
```
For HuggingFace Spaces, the Dockerfile is pre-configured for automatic deployment.
## Action Types
The model outputs JSON action sequences with these action types:
| Action Type | Description |
|-------------|-------------|
| `BUY_FROM_TAVERN` | Purchase a minion from the tavern |
| `PLAY_FROM_HAND` | Play a minion from hand to board |
| `SELL_FROM_BOARD` | Sell a minion from the board |
| `HERO_POWER` | Activate hero power |
| `ROLL` | Refresh the tavern |
| `UPGRADE_TAVERN` | Upgrade tavern tier |
| `FREEZE` | Freeze the current tavern |
| `END_TURN` | End the current turn |
## Related Components
- **DeepBattlerPlugin/** - C# HDT plugin that extracts game state to JSON
- **Agent/** - Python agents for real-time voice-assisted gameplay (OpenAI/Gemma)
For the full DeepBattler experience with HDT integration, see the main [DeepBattler repository](https://github.com/William-Dic/DeepBattler).
## License
This software is available for personal, educational, and non-commercial use. See the main DeepBattler repository for full license terms.
---
|