Understand and Accelerate Memory Processing Pipeline for Disaggregated LLM Inference
arXiv:2603.29002v1
Type: cross
Abstract
Modern large language models (LLMs) increasingly depend on efficient long-context processing and generation mechanisms, including sparse attention, retrieval-augmented generation (RAG), and compressed contextual memory, to support complex reasoning. We show that these optimizations can be unified into a four-step memory processing pipeline: Prepare Memory, Compute Relevancy, Retrieval, and Apply to Inference.
Introduction
As the demand for more sophisticated natural language processing applications grows, the efficiency of large language models has come under scrutiny. Recent studies highlight a notable 22%-97% memory processing overhead during LLM inference, suggesting that existing architectures may not be fully optimized for the complexities involved in handling large sets of data.
Memory Processing Pipeline
Our proposed four-step memory processing pipeline is crucial for addressing these overheads:
- Prepare Memory: Initialize and allocate necessary memory resources for processing.
- Compute Relevancy: Assess which parts of the memory are relevant for the current task.
- Retrieval: Fetch the most pertinent data from memory based on the relevancy computation.
- Apply to Inference: Integrate the retrieved data into the inference process to enhance performance.
Systematic Profiling
Through rigorous profiling of the memory processing pipeline, we have identified that significant performance bottlenecks exist within LLM inference. This finding emphasizes the importance of optimizing memory management within these models. The heterogeneity observed in computational characteristics further supports the need for a tailored approach to hardware utilization.
Heterogeneous Systems
Motivated by our findings, we propose that heterogeneous systems, which leverage multiple types of processing units, are uniquely positioned to enhance memory processing efficiency. Our approach focuses on distributing the workload by offloading specific tasks to Field Programmable Gate Arrays (FPGAs) while retaining computationally intensive operations on Graphics Processing Units (GPUs).
Experimental Results
We evaluated our heterogeneous system on an AMD MI210 GPU paired with an Alveo U55C FPGA. The results were promising, showing that our system achieves performance improvements of 1.04 to 2.2 times faster and consumes 1.11 to 4.7 times less energy compared to the traditional GPU baseline. Similar outcomes were observed using the NVIDIA A100, reinforcing the validity of our approach.
Conclusion
Our research underscores the importance of optimizing memory processing in large language models through the use of heterogeneous systems. By effectively managing memory operations, we can significantly enhance the performance and energy efficiency of LLM inference tasks. These findings pave the way for future developments in heterogeneous hardware design, which will be critical for advancing the capabilities of AI technologies.
