The Illusion of Equivalence: Systematic FP16 Divergence in KV-Cached Autoregressive Inference
Summary: arXiv:2604.15409v1 Announce Type: cross
Abstract: KV caching is a ubiquitous optimization in autoregressive transformer inference, long presumed to be numerically equivalent to cache-free computation. This assumption fails under standard FP16 precision: cache-ON and cache-OFF execution paths employ different floating-point accumulation orderings which, due to FP16 non-associativity, produce a deterministic divergence in decoded token sequences.
Across three open-weight models (LLaMA-2-7B, Mistral-7B-v0.3, Gemma-2-2B) evaluated on GSM8K, we observe a 100% token divergence rate across all sampling strategies, including greedy decoding. This rules out sampling randomness as a cause, and also with cache-ON yielding higher accuracy in 8 of 9 conditions, where the accuracy difference serves as an indicator that the divergence direction is systematic rather than random.
Key Findings
- Controlled FP32 falsification reduces divergence by eight orders of magnitude.
- FP32 elimination of token flips drops the flip rate to exactly 0.0%, confirming FP16 non-associativity as the sole causal driver.
- Layer-wise drift profiling reveals architecturally predictable propagation patterns.
- Models using Grouped-Query Attention exhibit sharp divergence at the first layer.
- Gemma’s larger head dimension and sliding window attention produce uniform accumulation across all layers.
Analysis of Divergence
The research indicates that the divergence observed in token sequences is not a random occurrence but rather a systematic issue rooted in the floating-point precision used during inference. The findings highlight the critical impact of numerical stability in large language model (LLM) inference systems, particularly under FP16 precision.
One significant aspect of the study is the controlled experimentation with FP32 precision, which demonstrated a stark reduction in divergence. This finding suggests that the inherent properties of FP16, particularly its non-associativity, are problematic when it comes to maintaining consistency between cache-ON and cache-OFF executions.
Implications for Future Research
These findings establish that FP16 KV cache inference is fundamentally non-equivalent to recomputation. This has broad implications for the design and optimization of future transformer architectures, especially as the reliance on cache-based optimizations continues to grow.
Furthermore, understanding the mechanistic framework behind the numerical instability in modern LLM inference systems could guide researchers in developing more robust models that mitigate the effects of floating-point precision errors. As the demand for high-performance AI systems increases, addressing these numerical challenges will be crucial for advancing the field.
Conclusion
In conclusion, the study sheds light on a previously overlooked aspect of autoregressive inference in transformer models. By revealing the systematic FP16 divergence associated with KV caching, it opens the door for further exploration into more reliable computational methods that ensure consistency and accuracy in language model outputs.
