Taming Asynchronous CPU-GPU Coupling for Frequency-aware Latency Estimation on Mobile Edge
Summary: arXiv:2604.15357v1 Announce Type: cross
Abstract: Precise estimation of model inference latency is crucial for time-critical mobile edge applications, enabling devices to calculate latency margins against deadlines and trade them for enhanced model performance or resource savings. However, the ubiquity of Dynamic Voltage and Frequency Scaling (DVFS) renders traditional static profiling invalid in real-world deployments, as inference latency fluctuates with varying processor (CPU and GPU) frequencies.
While extensive profiling across frequency combinations is theoretically possible, it is prohibitively expensive, particularly for emerging Small Language Models (SLMs), where variable context lengths explode the profiling up to days. We observe that simple analytic scaling fails to predict these fluctuations due to the complex asynchronous coupling between CPU (kernel launching) and GPU (execution).
Introduction to FLAME
In this paper, we introduce FLAME, a novel tool designed to accurately estimate inference latency across various frequency combinations. FLAME employs a series of innovative techniques to address the challenges presented by asynchronous CPU-GPU coupling. The key features of FLAME include:
- Layer-wise Modeling: FLAME incorporates a unique layer-wise modeling approach that quantifies overlapping parallelism.
- Dynamic Pipeline Bubbles: The tool aggregates dynamic pipeline bubbles created by asynchronous processor interactions, enabling it to extend its analysis to the full model efficiently.
- Generalizability: FLAME’s bottom-up methodology ensures its effectiveness across a wide range of model architectures, from Deep Neural Networks (DNNs) to Small Language Models (SLMs).
Efficiency and Accuracy
One of the primary advantages of FLAME is its ability to significantly reduce profiling times while maintaining high accuracy. The new modeling techniques allow for:
- Reduction of DNN profiling time from hours to mere minutes.
- Cutting SLM profiling time from days down to minutes.
- Maintaining small estimation errors across different frequency profiles.
By streamlining the profiling process, FLAME empowers developers and researchers to work more efficiently, enabling quicker iterations in model development and deployment.
Utility in Deadline-aware DVFS
In addition to its profiling capabilities, FLAME demonstrates significant utility in deadline-aware Dynamic Voltage and Frequency Scaling (DVFS). The tool outperforms existing state-of-the-art methods, providing enhanced power efficiency and superior latency guarantees.
As mobile edge applications become increasingly prevalent, accurate latency estimation will play a critical role in optimizing performance and resource management. FLAME stands out as a promising solution that addresses the intricate challenges of CPU-GPU coupling, paving the way for more efficient mobile edge computing.
