Multi-ORFT: Stable Online Reinforcement Fine-Tuning for Multi-Agent Diffusion Planning in Cooperative Driving
Summary: arXiv:2604.11734v2 Announce Type: replace-cross
Abstract: Closed-loop cooperative driving requires planners that generate realistic multimodal multi-agent trajectories while improving safety and traffic efficiency. Existing diffusion planners can model multimodal behaviors from demonstrations, but they often exhibit weak scene consistency and remain poorly aligned with closed-loop objectives; meanwhile, stable online post-training in reactive multi-agent environments remains difficult.
We present Multi-ORFT, which couples scene-conditioned diffusion pre-training with stable online reinforcement post-training. In pre-training, the planner uses inter-agent self-attention, cross-attention, and AdaLN-Zero-based scene conditioning to improve scene consistency and road adherence of joint trajectories.
Key Features of Multi-ORFT
- Inter-Agent Self-Attention: Enhances the interaction and coordination among multiple agents.
- Cross-Attention: Allows the model to consider inputs from various sources, improving decision-making.
- AdaLN-Zero Scene Conditioning: Ensures that the generated trajectories are contextually relevant to the driving environment.
In post-training, we formulate a two-level Markov Decision Process (MDP) that exposes step-wise reverse-kernel likelihoods for online optimization. We combine dense trajectory-level rewards with variance-gated group-relative policy optimization (VG-GRPO) to stabilize training.
Performance Results
On the WOMD closed-loop benchmark, Multi-ORFT achieved the following improvements:
- Reduced collision rate from 2.04% to 1.89%.
- Decreased off-road rate from 1.68% to 1.36%.
- Increased average speed from 8.36 to 8.61 m/s relative to the pre-trained planner.
Moreover, Multi-ORFT outperformed several strong open-source baselines, including:
- SMART-large
- SMART-tiny-CLSFT
- VBD
Conclusion
The results demonstrate that coupling scene-consistent denoising with stable online diffusion-policy optimization significantly enhances the reliability of closed-loop cooperative driving. Multi-ORFT not only improves safety metrics but also contributes to better traffic efficiency, making it a promising advance in the field of autonomous driving and multi-agent systems.
