From Load Tests to Live Streams: Graph Embedding-Based Anomaly Detection in Microservice Architectures
Summary: arXiv:2604.06448v1 Announce Type: cross
Abstract
Prime Video regularly conducts load tests to simulate the viewer traffic spikes seen during live events such as Thursday Night Football as well as video-on-demand (VOD) events such as Rings of Power. While these stress tests validate system capacity, they can sometimes miss service behaviors unique to real event traffic. We present a graph-based anomaly detection system that identifies under-represented services using unsupervised node-level graph embeddings.
Introduction
In the era of digital streaming, platforms like Prime Video face unique challenges when it comes to maintaining performance during high-demand events. Traditional load tests, while effective for assessing capacity, often fall short in replicating the complex user behaviors observed during actual events. This article discusses a novel approach to anomaly detection in microservice architectures using graph-based techniques.
Methodology
Our approach leverages a Graph Convolutional Network-Graph Autoencoder (GCN-GAE) framework to learn structural representations from directed, weighted service graphs at minute-level resolution. This methodology allows for the identification of anomalies by calculating cosine similarity between embeddings generated from load tests and those from actual event data.
Key Features of the Anomaly Detection System
- Unsupervised Node-Level Graph Embeddings: The system identifies under-represented services without requiring labeled data, making it adaptable to various scenarios.
- Minute-Level Resolution: By analyzing data at a granular level, we can detect anomalies that may occur within short time frames.
- Incident-Related Service Identification: The system effectively flags services that are affected during incidents, streamlining the debugging process.
- Early Detection Capability: By comparing embeddings from load tests to those from actual events, the system can detect potential issues before they escalate.
Evaluation and Results
To validate the effectiveness of our anomaly detection system, we introduced a preliminary synthetic anomaly injection framework for controlled evaluation. The results demonstrate:
- Precision: 96% – indicating a high accuracy in identifying true anomalies.
- Low False Positive Rate: 0.08% – ensuring minimal disruption to service operations.
- Recall: 58% – highlighting areas for improvement, particularly under conservative propagation assumptions.
These metrics illustrate the practical utility of our approach within Prime Video and provide insights for future enhancements.
Conclusion
Our graph-based anomaly detection system not only addresses the limitations of traditional load testing but also lays the groundwork for broader applications across microservice ecosystems. As streaming services continue to evolve, the need for robust anomaly detection mechanisms becomes increasingly critical. We aim to further refine our approach and explore its implementation in diverse environments.
Future Directions
Looking ahead, we plan to enhance recall rates and explore the integration of additional features and datasets. Our findings contribute valuable methodological lessons that can inform future research and applications in the field of microservices and beyond.
