Performance and Energy Trade-offs in Multi-Request LLM Workflows

Date:

Characterizing Performance-Energy Trade-offs of Large Language Models in Multi-Request Workflows

Summary: arXiv:2604.09611v1 Announce Type: cross

Abstract

Large language models (LLMs) are increasingly utilized in applications that form multi-request workflows, such as document summarization, search-based copilots, and multi-agent programming. While these workflows unlock richer functionality, they also amplify latency and energy demand during inference. Existing measurement and benchmarking efforts either focus on assessing LLM inference systems or consider single-request evaluations, overlooking workflow dependencies and cross-request interactions unique to multi-request workflows. Moreover, the energy usage of such interdependent LLM calls remains underexplored.

Research Overview

This paper presents the first systematic characterization of performance-energy trade-offs in multi-request LLM inference. The study develops four representative workloads capturing:

  • Sequential patterns
  • Interactive patterns
  • Agentic patterns
  • Composite patterns

Methodology

Using an NVIDIA A100 testbed equipped with state-of-the-art serving systems (vLLM and Parrot), the research analyzes how various energy management knobs influence:

  • Latency
  • Throughput
  • Component-level energy use

Key Findings

The findings reveal that batch size is the most impactful lever affecting performance metrics, although the benefits are workload-dependent. Specifically:

  • Optimal batching proves advantageous for workloads with large shared prompts.
  • Sequential summarization workloads see minimal benefits from batching.
  • Multi-agent coding shows partial effectiveness with batching techniques.

Energy Management Techniques

The study further explores the implications of various energy management techniques:

  • GPU power capping provides modest but predictable energy savings.
  • Output length induces linear energy scaling, resulting in limited efficiency gains.

Optimization Strategies

Engine-level optimizations in the vLLM system maintain higher GPU utilization and efficiency, particularly for decode-heavy workloads. Conversely, Parrot’s workflow-aware scheduling achieves lower energy consumption under strict power constraints. These findings provide actionable guidelines for developers and system operators aiming to design performance- and energy-aware LLM serving systems in emerging multi-request workflows.

Conclusion

This research contributes significantly to understanding the intricate balance between performance and energy efficiency in multi-request workflows leveraging large language models. As applications continue to evolve, these insights will be crucial for optimizing LLM deployments while ensuring sustainable energy consumption in the AI landscape.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.