Local-Splitter: A Measurement Study of Seven Tactics for Reducing Cloud LLM Token Usage on Coding-Agent Workloads
Summary: arXiv:2604.12301v1 Announce Type: cross
Abstract: We present a systematic measurement study of seven tactics for reducing cloud LLM token usage when a small local model can act as a triage layer in front of a frontier cloud model. The tactics are: (1) local routing, (2) prompt compression, (3) semantic caching, (4) local drafting with cloud review, (5) minimal-diff edits, (6) structured intent extraction, and (7) batching with vendor prompt caching.
We implement all seven in an open-source shim that speaks both MCP and the OpenAI-compatible HTTP surface, supporting any local model via Ollama and any cloud model via an OpenAI-compatible endpoint. We evaluate each tactic individually, in pairs, and in a greedy-additive subset across four coding-agent workload classes: edit-heavy, explanation-heavy, general chat, and RAG-heavy.
Key Findings
Our study reveals several critical insights into the efficiency of these tactics:
- Token Savings: Tactic 1 (local routing) combined with Tactic 2 (prompt compression) achieves 45-79% cloud token savings on edit-heavy and explanation-heavy workloads.
- RAG-Heavy Workloads: For RAG-heavy workloads, implementing the complete set of tactics, including Tactic 4 (draft-review), results in 51% savings.
- Workload Dependency: The optimal tactic subset varies depending on the workload, highlighting the necessity for tailored approaches when deploying coding agents.
Implementation Framework
The tactics were integrated into an open-source shim designed for flexibility and compatibility. This structure not only allows for easy adaptation to different local models through Ollama but also ensures that any cloud model can be accessed via an OpenAI-compatible endpoint. The ease of use and adaptability of this system could significantly enhance the efficiency of coding agents in real-world applications.
Evaluation Metrics
Throughout the study, we meticulously measured various performance metrics, including:
- Tokens Saved: Quantifying the reduction in token usage across different tactics and workloads.
- Dollar Cost: Analyzing the cost implications associated with token savings.
- Latency: Measuring the response time for each tactic to ensure efficiency.
- Routing Accuracy: Evaluating the effectiveness of routing decisions made by the local model.
Conclusion
Our measurement study provides valuable insights into the optimization of cloud LLM token usage through effective tactics. The findings underscore the importance of adopting a tailored approach based on specific workload requirements. By implementing these tactics, practitioners can achieve significant cost savings while maintaining the effectiveness of coding agents.
As cloud LLM technologies continue to evolve, our study serves as a crucial resource for developers seeking to refine their strategies and improve operational efficiencies.
