Two-dimensional Early Exit Optimisation of LLM Inference
Summary: arXiv:2604.18592v1 Announce Type: cross
Abstract: We introduce a two-dimensional (2D) early exit strategy that coordinates layer-wise and sentence-wise exiting for classification tasks in large language models. By processing input incrementally sentence-by-sentence while progressively activating deeper layers, our method achieves multiplicative computational savings that exceed those from optimizing either dimension independently.
The rapid advancement in large language models (LLMs) has led to increased interest in optimizing their inference processes. This article discusses a novel two-dimensional early exit strategy that promises to enhance efficiency in LLM-based classification tasks. The proposed method coordinates both layer-wise and sentence-wise exits, leading to significant computational savings.
Key Features of the 2D Early Exit Strategy
- Incremental Processing: The model processes input data sentence by sentence, which allows for quicker exits without fully traversing all layers.
- Layer Activation: Deeper layers are progressively activated, optimizing the number of computations needed based on the complexity of the task.
- Multiplicative Savings: The combined approach of managing both dimensions yields computational savings that surpass traditional methods focusing on single dimensions.
Experimental Evaluation
The effectiveness of the 2D early exit strategy was evaluated across four state-of-the-art LLMs including Llama 3.1, Llama 3.2, Gemma, and Qwen, which range from 3B to 8B parameters. The evaluation was performed on three sentiment classification datasets, revealing the following results:
- Achieved speed-ups of 1.4 to 2.3 times over optimal layer-wise early exits for simpler tasks.
- Demonstrated graceful degradation in performance on more complex multi-class classification problems.
- Fine-tuning processes reduced but did not eliminate the computational advantages of the 2D approach.
Model Agnosticism and Compatibility
This innovative approach is model-agnostic, which means it can be applied to various LLM architectures without extensive modifications. It requires only lightweight classification adapters, making it an accessible solution for developers and researchers in the field. Furthermore, the 2D early exit strategy operates independently of other efficiency techniques such as quantization and pruning, allowing for versatile integration into existing workflows.
Future Directions
Our findings indicate that the 2D early exit strategy excels particularly when semantic information accumulates predictably across the input structure. This suggests potential applicability to sequence-processing tasks beyond sentiment classification, opening avenues for further research and development.
In conclusion, the two-dimensional early exit optimisation presents a promising advancement in the efficiency of LLM inference, offering practical benefits for real-world applications in natural language processing.
