SAGE: A Service Agent Graph-guided Evaluation Benchmark
Summary: arXiv:2604.09285v1 Announce Type: new
The rapid advancements in Large Language Models (LLMs) have transformed automation in the customer service sector. However, the challenge of benchmarking their performance remains a significant hurdle. Existing evaluation frameworks are primarily based on static paradigms and single-dimensional metrics, which do not adequately reflect the complexities of user interactions or the critical adherence to structured Standard Operating Procedures (SOPs) essential in real-world scenarios.
To address these limitations, we introduce SAGE (Service Agent Graph-guided Evaluation), a universal multi-agent benchmark designed for automated, dual-axis assessment. SAGE innovatively formalizes unstructured SOPs into Dynamic Dialogue Graphs, which facilitate accurate verification of logical compliance and ensure comprehensive path coverage in dialogues.
Key Features of SAGE
- Dynamic Dialogue Graphs: These graphs allow for the representation of SOPs in a flexible manner, accommodating various user interactions and ensuring that all possible dialogue paths are covered during evaluation.
- Adversarial Intent Taxonomy: This taxonomy categorizes potential user intents that can be adversarial in nature, allowing for a robust analysis of how LLMs handle challenging conversational scenarios.
- Modular Extension Mechanism: This feature enables easy adaptation and deployment of the SAGE framework across different domains, facilitating low-cost integration into existing systems.
- Automated Dialogue Data Synthesis: SAGE supports the generation of synthetic dialogue data, which can be used to train and test LLMs, enhancing their capabilities in varied contexts.
Evaluation Framework
The evaluation process within SAGE involves a structured framework where Judge Agents and a Rule Engine critically analyze the interactions between User and Service Agents. This interaction analysis generates deterministic ground truth metrics, which are essential for accurately assessing the performance of LLMs.
Experimental Findings
Our extensive experiments, conducted on 27 LLMs across six industrial scenarios, revealed a notable phenomenon termed the “Execution Gap.” This gap highlights a discrepancy wherein models can accurately classify user intents but often fail to execute the correct subsequent actions. Furthermore, we identified an intriguing aspect called “Empathy Resilience.” This phenomenon occurs when models maintain a polite conversational demeanor, even when underlying logical inconsistencies arise under conditions of high adversarial intensity.
Conclusion
In conclusion, SAGE represents a significant advancement in the evaluation of LLMs in customer service applications. By addressing the shortcomings of existing benchmarks and incorporating dynamic, multi-faceted assessment techniques, SAGE paves the way for more accurate and reliable performance evaluations. The code and resources related to SAGE can be accessed at this link.
