M$^{2}$GRPO: Mamba-based Multi-Agent Group Relative Policy Optimization for Biomimetic Underwater Robots Pursuit
Summary: arXiv:2604.19404v1 Announce Type: cross
Abstract
Traditional policy learning methods in cooperative pursuit face fundamental challenges in biomimetic underwater robots, where long-horizon decision making, partial observability, and inter-robot coordination require both expressiveness and stability. To address these issues, a novel framework called Mamba-based multi-agent group relative policy optimization (M$^{2}$GRPO) is proposed, which integrates a selective state-space Mamba policy with group-relative policy optimization under the centralized-training and decentralized-execution (CTDE) paradigm.
Key Features of M$^{2}$GRPO
The M$^{2}$GRPO framework introduces several innovative features aimed at improving the performance of biomimetic underwater robots in cooperative scenarios:
- Selective State-Space Mamba Policy: This policy leverages observation history to capture long-horizon temporal dependencies.
- Attention-Based Relational Features: The framework encodes inter-agent interactions effectively, ensuring that the robots can coordinate their actions based on evolving circumstances.
- Bounded Continuous Actions: Actions are produced through normalized Gaussian sampling, which provides stability and consistency in decision-making.
Improved Credit Assignment
To enhance credit assignment without compromising stability, the M$^{2}$GRPO employs a novel approach:
- Group-Relative Advantages: Rewards are normalized across agents within each episode, allowing for more accurate assessment of each agent’s contribution to the group’s success.
- Multi-Agent Extension of GRPO: This extension significantly reduces the demand for training resources while enabling stable and scalable policy updates.
Performance Evaluation
Extensive simulations and real-world pool experiments have been conducted to evaluate the effectiveness of M$^{2}$GRPO:
- The framework was tested across various team scales and evader strategies.
- Results indicate that M$^{2}$GRPO consistently outperforms both the Multi-Agent Proximal Policy Optimization (MAPPO) and recurrent baselines.
- Key metrics such as pursuit success rate and capture efficiency show significant improvements with the implementation of M$^{2}$GRPO.
Conclusion
Overall, the proposed M$^{2}$GRPO framework offers a practical and scalable solution for cooperative underwater pursuit with biomimetic robot systems. By addressing the challenges of long-horizon decision-making, partial observability, and inter-robot coordination, M$^{2}$GRPO paves the way for more effective and efficient operations in aquatic environments.
