AEGIS: Scaling Long-Sequence Homomorphic Encrypted Transformer Inference via Hybrid Parallelism on Multi-GPU Systems
Summary: arXiv:2604.03425v1 Announce Type: cross
Abstract: Fully Homomorphic Encryption (FHE) enables privacy-preserving Transformer inference, but long-sequence encrypted Transformers quickly exceed single-GPU memory capacity because encoded weights are already large and encrypted activations grow rapidly with sequence length. Multi-GPU execution therefore becomes unavoidable, yet scaling remains challenging because communication is jointly induced by application-level aggregation and encryption-level RNS coupling. Existing approaches either synchronize between devices frequently or replicate encrypted tensors across devices, leading to excessive communication and latency.
In recent advancements in machine learning, Fully Homomorphic Encryption (FHE) has emerged as a key technology for enabling privacy-preserving computations. However, the implementation of long-sequence encrypted Transformers has revealed significant challenges, particularly when it comes to memory limitations on single-GPU systems. As the size of encoded weights increases and encrypted activations expand with sequence length, the need for multi-GPU execution has become imperative.
Addressing these challenges, researchers have introduced AEGIS, or Application-Encryption Guided Inference System. This innovative framework is designed specifically for scalable long-sequence encrypted Transformer inference on multi-GPU platforms. AEGIS takes a unique approach by deriving device placement from ciphertext dependencies that are influenced by both the dataflow of the Transformer and the CKKS polynomial coupling. This co-location of modulus-coherent and token-coherent data minimizes unnecessary communication, introducing it only when application dependencies necessitate it.
- Key Features of AEGIS:
- Reduces inter-GPU communication significantly, achieving up to 57.9% reduction in feed-forward networks and 81.3% in self-attention mechanisms.
- Achieves an impressive scaling efficiency of up to 96.62% when utilizing four GPUs.
- Provides a substantial end-to-end speedup, reported at 3.86 times faster than prior methods.
- Offers a remarkable 69.1% reduction in per-device memory requirements.
The results obtained from AEGIS demonstrate the effectiveness of coordinated application-encryption parallelism. By strategically reordering polynomial operators, AEGIS allows for overlapping remaining collective operations with computational tasks, significantly enhancing performance and efficiency. This innovative approach establishes a practical foundation for scalable homomorphic Transformer inference, paving the way for broader applications of privacy-preserving machine learning models.
As the demand for privacy in data processing continues to grow, the development of technologies like AEGIS is crucial. It not only addresses the immediate challenges associated with long-sequence encrypted Transformers but also sets a precedent for future research in the realm of secure and efficient machine learning frameworks. The success of AEGIS could herald a new era in privacy-preserving AI, where robust models can be deployed without compromising sensitive information.
In conclusion, AEGIS represents a significant step forward in the quest for scalable and efficient homomorphic encrypted inference systems. Its ability to minimize communication overhead while maximizing computational efficiency could revolutionize the way encrypted data is processed in the field of artificial intelligence.
