Characterizing Performance-Energy Trade-offs of Large Language Models in Multi-Request Workflows
Summary: arXiv:2604.09611v1 Announce Type: cross
Abstract
Large language models (LLMs) are increasingly utilized in applications that form multi-request workflows, such as document summarization, search-based copilots, and multi-agent programming. While these workflows unlock richer functionality, they also amplify latency and energy demand during inference. Existing measurement and benchmarking efforts either focus on assessing LLM inference systems or consider single-request evaluations, overlooking workflow dependencies and cross-request interactions unique to multi-request workflows. Moreover, the energy usage of such interdependent LLM calls remains underexplored.
Research Overview
This paper presents the first systematic characterization of performance-energy trade-offs in multi-request LLM inference. The study develops four representative workloads capturing:
- Sequential patterns
- Interactive patterns
- Agentic patterns
- Composite patterns
Methodology
Using an NVIDIA A100 testbed equipped with state-of-the-art serving systems (vLLM and Parrot), the research analyzes how various energy management knobs influence:
- Latency
- Throughput
- Component-level energy use
Key Findings
The findings reveal that batch size is the most impactful lever affecting performance metrics, although the benefits are workload-dependent. Specifically:
- Optimal batching proves advantageous for workloads with large shared prompts.
- Sequential summarization workloads see minimal benefits from batching.
- Multi-agent coding shows partial effectiveness with batching techniques.
Energy Management Techniques
The study further explores the implications of various energy management techniques:
- GPU power capping provides modest but predictable energy savings.
- Output length induces linear energy scaling, resulting in limited efficiency gains.
Optimization Strategies
Engine-level optimizations in the vLLM system maintain higher GPU utilization and efficiency, particularly for decode-heavy workloads. Conversely, Parrot’s workflow-aware scheduling achieves lower energy consumption under strict power constraints. These findings provide actionable guidelines for developers and system operators aiming to design performance- and energy-aware LLM serving systems in emerging multi-request workflows.
Conclusion
This research contributes significantly to understanding the intricate balance between performance and energy efficiency in multi-request workflows leveraging large language models. As applications continue to evolve, these insights will be crucial for optimizing LLM deployments while ensuring sustainable energy consumption in the AI landscape.
