PoseLess: Depth-Free Vision-to-Joint Control via Direct Image Mapping with VLM
Paper
•
2503.07111
•
Published
•
3
"AlphaSpace: (Paper) , a novel methodology designed to enhance the spatial reasoning capabilities of large language models (LLMs) for 3D Cartesian space navigation. AlphaSpace employs a semantics-based tokenization strategy, encoding height information through specialized semantic tokens, and integrates primarily symbolic synthetic reasoning data. This approach enables LLMs to accurately manipulate objects by positioning them at specific [x, y, z] coordinates.
GPU Configuration: Cluster of 8x NVIDIA H200-SXM-140GB.
GPU Usage:
We utilize Llama-Factory library to train the model.
| Parameter | Continual Training |
|---|---|
| Epoch | 1 |
| Global batch size | 128 |
| Learning Rate | 1e-4 |
| Learning Scheduler | cosine with warmup |
| Optimizer | AdamW Fused |
| Warmup Ratio | 0.1 |
| Max length | 4096 |
| Precision | bf16 |