M2GRPO: Multi-Agent Policy Optimization for Underwater Robots

M$^{2}$GRPO: Mamba-based Multi-Agent Group Relative Policy Optimization for Biomimetic Underwater Robots Pursuit

Summary: arXiv:2604.19404v1 Announce Type: cross

Abstract

Traditional policy learning methods in cooperative pursuit face fundamental challenges in biomimetic underwater robots, where long-horizon decision making, partial observability, and inter-robot coordination require both expressiveness and stability. To address these issues, a novel framework called Mamba-based multi-agent group relative policy optimization (M$^{2}$GRPO) is proposed, which integrates a selective state-space Mamba policy with group-relative policy optimization under the centralized-training and decentralized-execution (CTDE) paradigm.

Key Features of M$^{2}$GRPO

The M$^{2}$GRPO framework introduces several innovative features aimed at improving the performance of biomimetic underwater robots in cooperative scenarios:

Selective State-Space Mamba Policy: This policy leverages observation history to capture long-horizon temporal dependencies.
Attention-Based Relational Features: The framework encodes inter-agent interactions effectively, ensuring that the robots can coordinate their actions based on evolving circumstances.
Bounded Continuous Actions: Actions are produced through normalized Gaussian sampling, which provides stability and consistency in decision-making.

Improved Credit Assignment

To enhance credit assignment without compromising stability, the M$^{2}$GRPO employs a novel approach:

Group-Relative Advantages: Rewards are normalized across agents within each episode, allowing for more accurate assessment of each agent’s contribution to the group’s success.
Multi-Agent Extension of GRPO: This extension significantly reduces the demand for training resources while enabling stable and scalable policy updates.

Performance Evaluation

Extensive simulations and real-world pool experiments have been conducted to evaluate the effectiveness of M$^{2}$GRPO:

The framework was tested across various team scales and evader strategies.
Results indicate that M$^{2}$GRPO consistently outperforms both the Multi-Agent Proximal Policy Optimization (MAPPO) and recurrent baselines.
Key metrics such as pursuit success rate and capture efficiency show significant improvements with the implementation of M$^{2}$GRPO.

Conclusion

Overall, the proposed M$^{2}$GRPO framework offers a practical and scalable solution for cooperative underwater pursuit with biomimetic robot systems. By addressing the challenges of long-horizon decision-making, partial observability, and inter-robot coordination, M$^{2}$GRPO paves the way for more effective and efficient operations in aquatic environments.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

M2GRPO: Multi-Agent Policy Optimization for Underwater Robots

M$^{2}$GRPO: Mamba-based Multi-Agent Group Relative Policy Optimization for Biomimetic Underwater Robots Pursuit

Abstract

Key Features of M$^{2}$GRPO

Improved Credit Assignment

Performance Evaluation

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related