SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding
Authors: arXiv:2604.09557v1
Announcement Type: Cross
Abstract: Speculative Decoding (SD) has emerged as a critical technique for accelerating Large Language Model (LLM) inference. Unlike deterministic system optimizations, SD performance is inherently data-dependent, meaning that diverse and representative workloads are essential for accurately measuring its effectiveness.
Existing benchmarks suffer from limited task diversity, inadequate support for throughput-oriented evaluation, and a reliance on high-level implementations that fail to reflect production environments. To address this, we introduce SPEED-Bench, a comprehensive suite designed to standardize SD evaluation across diverse semantic domains and realistic serving regimes.
Key Features of SPEED-Bench
- Qualitative Data Split: SPEED-Bench offers a carefully curated qualitative data split, selected by prioritizing semantic diversity across the data samples.
- Throughput Data Split: It includes a throughput data split, allowing speedup evaluation across a range of concurrencies, from latency-sensitive low-batch settings to throughput-oriented high-load scenarios.
- Integration with Production Engines: By integrating with production engines like vLLM and TensorRT-LLM, SPEED-Bench allows practitioners to analyze system behaviors often masked by other benchmarks.
Significance of SPEED-Bench
SPEED-Bench is particularly important as it addresses common pitfalls found in existing benchmarks. Many benchmarks rely on synthetic inputs that do not accurately represent real-world scenarios. This can lead to an overestimation of throughput and performance, which is critical for developers and researchers working with LLMs.
Insights from SPEED-Bench
Through our rigorous evaluation with SPEED-Bench, we highlight several key insights:
- We quantify how synthetic inputs can overestimate real-world throughput.
- We identify batch-size dependent optimal draft lengths, emphasizing the importance of tuning parameters for different workloads.
- We analyze biases in low-diversity data, which can skew results and lead to inaccurate conclusions.
- We explore the caveats of vocabulary pruning in state-of-the-art drafters, shedding light on potential pitfalls in model design.
Conclusion
In conclusion, we release SPEED-Bench to establish a unified evaluation standard for practical comparisons of SD algorithms. By offering a diverse and comprehensive benchmarking suite, we aim to enhance the accuracy and relevance of SD evaluations, ultimately contributing to the advancement of Large Language Models.
For more information and access to SPEED-Bench, please visit our official release page.
