Sandwich: Efficient CPU Serving for Large Language Models

Date:

Sandwich: Joint Configuration Search and Hot-Switching for Efficient CPU LLM Serving

Summary: arXiv:2507.18454v2 Announce Type: replace-cross

Abstract: CPUs are critical for LLM serving due to their availability, cost efficiency, and edge applicability. However, efficient CPU serving is hindered by conflicting prefill/decode resource demands under non-disaggregated deployment constraints–existing solutions fail to avoid cross-phase interference, ignore sub-NUMA hardware structures, and deliver suboptimal dynamic-shape kernel performance. We propose Sandwich, a full-stack CPU LLM serving system with three core innovations addressing these challenges:

  • Seamless phase-wise plan switching to eliminate cross-phase interference.
  • TopoTree, a tree-based hardware abstraction for automated substructure-aware (e.g., LLC slices) partial core allocation.
  • Fast-start-then-finetune dynamic-shape tensor program generation.

Across five x86/ARM CPU platforms, Sandwich achieves an average 2.01x end-to-end speedup and up to 3.40x latency reduction over state-of-the-art systems. Its kernels match static compiler performance with three orders of magnitude lower tuning cost.

The Challenge of CPU LLM Serving

As large language models (LLMs) become more prevalent, the demand for efficient serving solutions continues to grow. CPUs are often the backbone of such systems due to their widespread availability and cost-effectiveness. However, the inherent challenges posed by conflicting resource demands during the prefill and decode phases can significantly hinder performance.

Traditional serving architectures often struggle with cross-phase interference, leading to inefficiencies. Moreover, many existing solutions overlook the importance of sub-NUMA hardware structures, which can further degrade dynamic performance when serving LLMs.

Introducing Sandwich

Sandwich aims to resolve these issues with a comprehensive serving system designed specifically for CPU architectures. By introducing three innovative techniques, Sandwich enhances the efficiency of CPU serving for LLMs:

  • Seamless Phase-wise Plan Switching: This feature allows for the dynamic adjustment of resource allocation during different phases of model serving, thereby eliminating cross-phase interference and optimizing resource utilization.
  • TopoTree: This tree-based hardware abstraction enables the system to automate partial core allocation while being aware of hardware substructures. This capability allows Sandwich to optimize the use of LLC slices, ensuring that resources are allocated in the most efficient manner possible.
  • Fast-start-then-finetune: The dynamic-shape tensor program generation technique employed by Sandwich allows for quick initial execution followed by fine-tuning. This approach significantly reduces the time and resources spent on kernel tuning, leading to improved overall performance.

Performance Benchmarks

When tested across five different CPU platforms, including both x86 and ARM architectures, Sandwich demonstrated remarkable improvements. The system achieved an average of 2.01 times faster end-to-end performance and reduced latency by up to 3.40 times compared to existing state-of-the-art systems.

Additionally, the performance of Sandwich’s kernels rivals that of static compilers but with a drastically reduced tuning cost, offering a promising solution for efficient LLM serving on CPU architectures.

Conclusion

Sandwich represents a significant advancement in the field of CPU LLM serving, addressing critical challenges while delivering improved performance metrics. Its innovative approach not only enhances efficiency but also sets a new standard for future developments in this rapidly evolving domain.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.