ODMA: On-Demand Memory Allocation Strategy for LLM Serving on LPDDR-Class Accelerators
Summary: arXiv:2512.09427v3 Announce Type: replace-cross
Abstract
Existing memory management techniques severely hinder efficient Large Language Model (LLM) serving on accelerators constrained by poor random-access bandwidth. While static pre-allocation preserves memory contiguity, it incurs significant overhead due to worst-case provisioning. Conversely, fine-grained paging mitigates this overhead but relies on High Bandwidth Memory (HBM)’s high random-access tolerance, making it unsuitable for LPDDR systems where non-sequential access rapidly degrades bandwidth.
Furthermore, prior works typically assume static distributions and HBM characteristics, thereby failing to resolve the critical fragmentation and bandwidth constraints inherent to LPDDR hardware. We present ODMA, an on-demand memory allocation strategy tailored for random-access-constrained accelerators, such as the Cambricon MLU series.
Introduction to ODMA
ODMA advances generation-length prediction by addressing two critical limitations in production workloads:
- Distribution Drift: This invalidates static bucket boundaries, leading to inefficient memory usage.
- Performance Fragility: Heavy-tailed request patterns can significantly degrade performance.
Key Features of ODMA
ODMA integrates several innovative features to enhance memory allocation efficiency:
- Lightweight Length Predictor: This component allows for more accurate predictions of memory requirements based on request patterns.
- Adaptive Bucket Partitioning: Bucket boundaries are dynamically recalibrated via online histograms to maximize utilization.
- Fallback Safety Pool: This ensures robustness against prediction errors, providing a safety net during unexpected workload changes.
Performance Improvements
On Alpaca and Google-NQ benchmarks, ODMA demonstrates significant improvements in prediction accuracy:
- Improvement of S3’s prediction accuracy from 98.60% to 99.55% on the Alpaca benchmark.
- Increase from 82.68% to 93.36% on the Google-NQ benchmark.
Deployment and Results
Deployment with DeepSeek-R1-Distill-Qwen-7B on Cambricon MLU370-X4 accelerators has shown that ODMA increases key performance metrics:
- KV-cache Utilization: An increase of up to 19.25% (absolute).
- Throughput (Transactions Per Second): An increase of 23-27% over static baselines.
Conclusion
ODMA validates the efficacy of predictor-driven contiguous allocation for LPDDR-class devices, addressing critical performance challenges in LLM serving. By integrating advanced prediction mechanisms and adaptive strategies, this approach significantly enhances both memory utilization and overall throughput, paving the way for more efficient implementations in resource-constrained environments.
