Optimize Multi-Agent Consumer Assistants: Evaluation Blueprint

Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants

Conversational shopping assistants (CSAs) are emerging as a significant application of agentic AI, yet the transition from prototype development to full-scale production presents unique challenges. A recently published paper on arXiv (2603.03565v2) addresses the nuances involved in evaluating multi-turn interactions and optimizing tightly coupled multi-agent systems, particularly within the context of grocery shopping.

The authors highlight that grocery shopping introduces complexities such as underspecified user requests, heightened sensitivity to personal preferences, and constraints posed by budget and inventory limitations. This makes it crucial to develop a robust framework for assessing and enhancing the performance of conversational shopping assistants.

Multi-Faceted Evaluation Rubric

To tackle these challenges, the paper introduces an innovative evaluation rubric that breaks down end-to-end shopping quality into structured dimensions. This approach enables a comprehensive assessment of how well a CSA meets user needs throughout the shopping experience. Key components of the evaluation rubric include:

Accuracy: Assessing the correctness of product recommendations and user query interpretations.
Relevance: Evaluating how well the suggestions align with user preferences and context.
Engagement: Measuring user interaction levels and satisfaction with the conversational flow.
Efficiency: Analyzing the time taken to complete a shopping task and the number of interactions required.

In addition, the authors developed a calibrated LLM-as-judge pipeline, which aligns automated evaluations with human annotations. This framework aims to ensure that the evaluation process is both precise and reflective of real-world user experiences.

Optimizing Multi-Agent Systems

Building upon their evaluation foundation, the researchers explore two complementary strategies for prompt optimization, leveraging a state-of-the-art prompt optimizer known as GEPA (Shao et al., 2025). These strategies are designed to enhance the performance of individual agents within a multi-agent environment:

Sub-agent GEPA: This strategy focuses on optimizing individual agent nodes based on localized rubrics, allowing for tailored improvements that enhance specific aspects of the CSA.
MAMuT (Multi-Agent Multi-Turn) GEPA: A novel system-level approach that synchronously optimizes prompts across multiple agents. This method uses multi-turn simulation and trajectory-level scoring to evaluate the collective performance of the agents throughout the interaction.

By implementing these strategies, developers can achieve significant advancements in the overall functionality and user satisfaction of conversational shopping assistants.

Supporting Practitioners

The authors also recognize the need for practical resources to aid practitioners in the development of production-level CSAs. They have made available various rubric templates and evaluation design guidance, aiming to empower teams to adopt best practices in building and optimizing their own conversational shopping systems.

In conclusion, as the landscape of conversational AI continues to evolve, this paper provides a crucial blueprint for addressing the challenges inherent in multi-agent systems. By focusing on comprehensive evaluation and strategic optimization, developers can enhance the effectiveness of CSAs, ultimately leading to a more satisfying shopping experience for consumers.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Optimize Multi-Agent Consumer Assistants: Evaluation Blueprint

Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants

Multi-Faceted Evaluation Rubric

Optimizing Multi-Agent Systems

Supporting Practitioners

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related