M2GRPO: Multi-Agent Policy Optimization for Underwater Robots

Date:

M$^{2}$GRPO: Mamba-based Multi-Agent Group Relative Policy Optimization for Biomimetic Underwater Robots Pursuit

Summary: arXiv:2604.19404v1 Announce Type: cross

Abstract

Traditional policy learning methods in cooperative pursuit face fundamental challenges in biomimetic underwater robots, where long-horizon decision making, partial observability, and inter-robot coordination require both expressiveness and stability. To address these issues, a novel framework called Mamba-based multi-agent group relative policy optimization (M$^{2}$GRPO) is proposed, which integrates a selective state-space Mamba policy with group-relative policy optimization under the centralized-training and decentralized-execution (CTDE) paradigm.

Key Features of M$^{2}$GRPO

The M$^{2}$GRPO framework introduces several innovative features aimed at improving the performance of biomimetic underwater robots in cooperative scenarios:

  • Selective State-Space Mamba Policy: This policy leverages observation history to capture long-horizon temporal dependencies.
  • Attention-Based Relational Features: The framework encodes inter-agent interactions effectively, ensuring that the robots can coordinate their actions based on evolving circumstances.
  • Bounded Continuous Actions: Actions are produced through normalized Gaussian sampling, which provides stability and consistency in decision-making.

Improved Credit Assignment

To enhance credit assignment without compromising stability, the M$^{2}$GRPO employs a novel approach:

  • Group-Relative Advantages: Rewards are normalized across agents within each episode, allowing for more accurate assessment of each agent’s contribution to the group’s success.
  • Multi-Agent Extension of GRPO: This extension significantly reduces the demand for training resources while enabling stable and scalable policy updates.

Performance Evaluation

Extensive simulations and real-world pool experiments have been conducted to evaluate the effectiveness of M$^{2}$GRPO:

  • The framework was tested across various team scales and evader strategies.
  • Results indicate that M$^{2}$GRPO consistently outperforms both the Multi-Agent Proximal Policy Optimization (MAPPO) and recurrent baselines.
  • Key metrics such as pursuit success rate and capture efficiency show significant improvements with the implementation of M$^{2}$GRPO.

Conclusion

Overall, the proposed M$^{2}$GRPO framework offers a practical and scalable solution for cooperative underwater pursuit with biomimetic robot systems. By addressing the challenges of long-horizon decision-making, partial observability, and inter-robot coordination, M$^{2}$GRPO paves the way for more effective and efficient operations in aquatic environments.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.