Graph Embedding Anomaly Detection in Microservice Systems

Date:

From Load Tests to Live Streams: Graph Embedding-Based Anomaly Detection in Microservice Architectures

Summary: arXiv:2604.06448v1 Announce Type: cross

Abstract

Prime Video regularly conducts load tests to simulate the viewer traffic spikes seen during live events such as Thursday Night Football as well as video-on-demand (VOD) events such as Rings of Power. While these stress tests validate system capacity, they can sometimes miss service behaviors unique to real event traffic. We present a graph-based anomaly detection system that identifies under-represented services using unsupervised node-level graph embeddings.

Introduction

In the era of digital streaming, platforms like Prime Video face unique challenges when it comes to maintaining performance during high-demand events. Traditional load tests, while effective for assessing capacity, often fall short in replicating the complex user behaviors observed during actual events. This article discusses a novel approach to anomaly detection in microservice architectures using graph-based techniques.

Methodology

Our approach leverages a Graph Convolutional Network-Graph Autoencoder (GCN-GAE) framework to learn structural representations from directed, weighted service graphs at minute-level resolution. This methodology allows for the identification of anomalies by calculating cosine similarity between embeddings generated from load tests and those from actual event data.

Key Features of the Anomaly Detection System

  • Unsupervised Node-Level Graph Embeddings: The system identifies under-represented services without requiring labeled data, making it adaptable to various scenarios.
  • Minute-Level Resolution: By analyzing data at a granular level, we can detect anomalies that may occur within short time frames.
  • Incident-Related Service Identification: The system effectively flags services that are affected during incidents, streamlining the debugging process.
  • Early Detection Capability: By comparing embeddings from load tests to those from actual events, the system can detect potential issues before they escalate.

Evaluation and Results

To validate the effectiveness of our anomaly detection system, we introduced a preliminary synthetic anomaly injection framework for controlled evaluation. The results demonstrate:

  • Precision: 96% – indicating a high accuracy in identifying true anomalies.
  • Low False Positive Rate: 0.08% – ensuring minimal disruption to service operations.
  • Recall: 58% – highlighting areas for improvement, particularly under conservative propagation assumptions.

These metrics illustrate the practical utility of our approach within Prime Video and provide insights for future enhancements.

Conclusion

Our graph-based anomaly detection system not only addresses the limitations of traditional load testing but also lays the groundwork for broader applications across microservice ecosystems. As streaming services continue to evolve, the need for robust anomaly detection mechanisms becomes increasingly critical. We aim to further refine our approach and explore its implementation in diverse environments.

Future Directions

Looking ahead, we plan to enhance recall rates and explore the integration of additional features and datasets. Our findings contribute valuable methodological lessons that can inform future research and applications in the field of microservices and beyond.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.