Performance and Energy Trade-offs in Multi-Request LLM Workflows

Characterizing Performance-Energy Trade-offs of Large Language Models in Multi-Request Workflows

Summary: arXiv:2604.09611v1 Announce Type: cross

Abstract

Large language models (LLMs) are increasingly utilized in applications that form multi-request workflows, such as document summarization, search-based copilots, and multi-agent programming. While these workflows unlock richer functionality, they also amplify latency and energy demand during inference. Existing measurement and benchmarking efforts either focus on assessing LLM inference systems or consider single-request evaluations, overlooking workflow dependencies and cross-request interactions unique to multi-request workflows. Moreover, the energy usage of such interdependent LLM calls remains underexplored.

Research Overview

This paper presents the first systematic characterization of performance-energy trade-offs in multi-request LLM inference. The study develops four representative workloads capturing:

Sequential patterns
Interactive patterns
Agentic patterns
Composite patterns

Methodology

Using an NVIDIA A100 testbed equipped with state-of-the-art serving systems (vLLM and Parrot), the research analyzes how various energy management knobs influence:

Latency
Throughput
Component-level energy use

Key Findings

The findings reveal that batch size is the most impactful lever affecting performance metrics, although the benefits are workload-dependent. Specifically:

Optimal batching proves advantageous for workloads with large shared prompts.
Sequential summarization workloads see minimal benefits from batching.
Multi-agent coding shows partial effectiveness with batching techniques.

Energy Management Techniques

The study further explores the implications of various energy management techniques:

GPU power capping provides modest but predictable energy savings.
Output length induces linear energy scaling, resulting in limited efficiency gains.

Optimization Strategies

Engine-level optimizations in the vLLM system maintain higher GPU utilization and efficiency, particularly for decode-heavy workloads. Conversely, Parrot’s workflow-aware scheduling achieves lower energy consumption under strict power constraints. These findings provide actionable guidelines for developers and system operators aiming to design performance- and energy-aware LLM serving systems in emerging multi-request workflows.

Conclusion

This research contributes significantly to understanding the intricate balance between performance and energy efficiency in multi-request workflows leveraging large language models. As applications continue to evolve, these insights will be crucial for optimizing LLM deployments while ensuring sustainable energy consumption in the AI landscape.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Performance and Energy Trade-offs in Multi-Request LLM Workflows

Characterizing Performance-Energy Trade-offs of Large Language Models in Multi-Request Workflows

Abstract

Research Overview

Methodology

Key Findings

Energy Management Techniques

Optimization Strategies

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related