Sandwich: Joint Configuration Search and Hot-Switching for Efficient CPU LLM Serving
Summary: arXiv:2507.18454v2 Announce Type: replace-cross
Abstract: CPUs are critical for LLM serving due to their availability, cost efficiency, and edge applicability. However, efficient CPU serving is hindered by conflicting prefill/decode resource demands under non-disaggregated deployment constraints–existing solutions fail to avoid cross-phase interference, ignore sub-NUMA hardware structures, and deliver suboptimal dynamic-shape kernel performance. We propose Sandwich, a full-stack CPU LLM serving system with three core innovations addressing these challenges:
- Seamless phase-wise plan switching to eliminate cross-phase interference.
- TopoTree, a tree-based hardware abstraction for automated substructure-aware (e.g., LLC slices) partial core allocation.
- Fast-start-then-finetune dynamic-shape tensor program generation.
Across five x86/ARM CPU platforms, Sandwich achieves an average 2.01x end-to-end speedup and up to 3.40x latency reduction over state-of-the-art systems. Its kernels match static compiler performance with three orders of magnitude lower tuning cost.
The Challenge of CPU LLM Serving
As large language models (LLMs) become more prevalent, the demand for efficient serving solutions continues to grow. CPUs are often the backbone of such systems due to their widespread availability and cost-effectiveness. However, the inherent challenges posed by conflicting resource demands during the prefill and decode phases can significantly hinder performance.
Traditional serving architectures often struggle with cross-phase interference, leading to inefficiencies. Moreover, many existing solutions overlook the importance of sub-NUMA hardware structures, which can further degrade dynamic performance when serving LLMs.
Introducing Sandwich
Sandwich aims to resolve these issues with a comprehensive serving system designed specifically for CPU architectures. By introducing three innovative techniques, Sandwich enhances the efficiency of CPU serving for LLMs:
- Seamless Phase-wise Plan Switching: This feature allows for the dynamic adjustment of resource allocation during different phases of model serving, thereby eliminating cross-phase interference and optimizing resource utilization.
- TopoTree: This tree-based hardware abstraction enables the system to automate partial core allocation while being aware of hardware substructures. This capability allows Sandwich to optimize the use of LLC slices, ensuring that resources are allocated in the most efficient manner possible.
- Fast-start-then-finetune: The dynamic-shape tensor program generation technique employed by Sandwich allows for quick initial execution followed by fine-tuning. This approach significantly reduces the time and resources spent on kernel tuning, leading to improved overall performance.
Performance Benchmarks
When tested across five different CPU platforms, including both x86 and ARM architectures, Sandwich demonstrated remarkable improvements. The system achieved an average of 2.01 times faster end-to-end performance and reduced latency by up to 3.40 times compared to existing state-of-the-art systems.
Additionally, the performance of Sandwich’s kernels rivals that of static compilers but with a drastically reduced tuning cost, offering a promising solution for efficient LLM serving on CPU architectures.
Conclusion
Sandwich represents a significant advancement in the field of CPU LLM serving, addressing critical challenges while delivering improved performance metrics. Its innovative approach not only enhances efficiency but also sets a new standard for future developments in this rapidly evolving domain.
