ODMA: Efficient Memory Allocation for LLMs on LPDDR Accelerators

Date:

ODMA: On-Demand Memory Allocation Strategy for LLM Serving on LPDDR-Class Accelerators

Summary: arXiv:2512.09427v3 Announce Type: replace-cross

Abstract

Existing memory management techniques severely hinder efficient Large Language Model (LLM) serving on accelerators constrained by poor random-access bandwidth. While static pre-allocation preserves memory contiguity, it incurs significant overhead due to worst-case provisioning. Conversely, fine-grained paging mitigates this overhead but relies on High Bandwidth Memory (HBM)’s high random-access tolerance, making it unsuitable for LPDDR systems where non-sequential access rapidly degrades bandwidth.

Furthermore, prior works typically assume static distributions and HBM characteristics, thereby failing to resolve the critical fragmentation and bandwidth constraints inherent to LPDDR hardware. We present ODMA, an on-demand memory allocation strategy tailored for random-access-constrained accelerators, such as the Cambricon MLU series.

Introduction to ODMA

ODMA advances generation-length prediction by addressing two critical limitations in production workloads:

  • Distribution Drift: This invalidates static bucket boundaries, leading to inefficient memory usage.
  • Performance Fragility: Heavy-tailed request patterns can significantly degrade performance.

Key Features of ODMA

ODMA integrates several innovative features to enhance memory allocation efficiency:

  • Lightweight Length Predictor: This component allows for more accurate predictions of memory requirements based on request patterns.
  • Adaptive Bucket Partitioning: Bucket boundaries are dynamically recalibrated via online histograms to maximize utilization.
  • Fallback Safety Pool: This ensures robustness against prediction errors, providing a safety net during unexpected workload changes.

Performance Improvements

On Alpaca and Google-NQ benchmarks, ODMA demonstrates significant improvements in prediction accuracy:

  • Improvement of S3’s prediction accuracy from 98.60% to 99.55% on the Alpaca benchmark.
  • Increase from 82.68% to 93.36% on the Google-NQ benchmark.

Deployment and Results

Deployment with DeepSeek-R1-Distill-Qwen-7B on Cambricon MLU370-X4 accelerators has shown that ODMA increases key performance metrics:

  • KV-cache Utilization: An increase of up to 19.25% (absolute).
  • Throughput (Transactions Per Second): An increase of 23-27% over static baselines.

Conclusion

ODMA validates the efficacy of predictor-driven contiguous allocation for LPDDR-class devices, addressing critical performance challenges in LLM serving. By integrating advanced prediction mechanisms and adaptive strategies, this approach significantly enhances both memory utilization and overall throughput, paving the way for more efficient implementations in resource-constrained environments.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.