ODMA: Efficient Memory Allocation for LLMs on LPDDR Accelerators

ODMA: On-Demand Memory Allocation Strategy for LLM Serving on LPDDR-Class Accelerators

Summary: arXiv:2512.09427v3 Announce Type: replace-cross

Abstract

Existing memory management techniques severely hinder efficient Large Language Model (LLM) serving on accelerators constrained by poor random-access bandwidth. While static pre-allocation preserves memory contiguity, it incurs significant overhead due to worst-case provisioning. Conversely, fine-grained paging mitigates this overhead but relies on High Bandwidth Memory (HBM)’s high random-access tolerance, making it unsuitable for LPDDR systems where non-sequential access rapidly degrades bandwidth.

Furthermore, prior works typically assume static distributions and HBM characteristics, thereby failing to resolve the critical fragmentation and bandwidth constraints inherent to LPDDR hardware. We present ODMA, an on-demand memory allocation strategy tailored for random-access-constrained accelerators, such as the Cambricon MLU series.

Introduction to ODMA

ODMA advances generation-length prediction by addressing two critical limitations in production workloads:

Distribution Drift: This invalidates static bucket boundaries, leading to inefficient memory usage.
Performance Fragility: Heavy-tailed request patterns can significantly degrade performance.

Key Features of ODMA

ODMA integrates several innovative features to enhance memory allocation efficiency:

Lightweight Length Predictor: This component allows for more accurate predictions of memory requirements based on request patterns.
Adaptive Bucket Partitioning: Bucket boundaries are dynamically recalibrated via online histograms to maximize utilization.
Fallback Safety Pool: This ensures robustness against prediction errors, providing a safety net during unexpected workload changes.

Performance Improvements

On Alpaca and Google-NQ benchmarks, ODMA demonstrates significant improvements in prediction accuracy:

Improvement of S3’s prediction accuracy from 98.60% to 99.55% on the Alpaca benchmark.
Increase from 82.68% to 93.36% on the Google-NQ benchmark.

Deployment and Results

Deployment with DeepSeek-R1-Distill-Qwen-7B on Cambricon MLU370-X4 accelerators has shown that ODMA increases key performance metrics:

KV-cache Utilization: An increase of up to 19.25% (absolute).
Throughput (Transactions Per Second): An increase of 23-27% over static baselines.

Conclusion

ODMA validates the efficacy of predictor-driven contiguous allocation for LPDDR-class devices, addressing critical performance challenges in LLM serving. By integrating advanced prediction mechanisms and adaptive strategies, this approach significantly enhances both memory utilization and overall throughput, paving the way for more efficient implementations in resource-constrained environments.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

ODMA: Efficient Memory Allocation for LLMs on LPDDR Accelerators

ODMA: On-Demand Memory Allocation Strategy for LLM Serving on LPDDR-Class Accelerators

Abstract

Introduction to ODMA

Key Features of ODMA

Performance Improvements

Deployment and Results

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related