FBS: Modeling Native Parallel Reading inside a Transformer
Summary: arXiv:2601.21708v2 Announce Type: replace
Abstract: Large language models (LLMs) excel across many tasks, yet inference is still dominated by strictly token-by-token autoregression. Existing acceleration methods largely patch this pipeline and miss core human-reading ingredients: content-adaptive foresight, chunk-structure-aware compute allocation, and train-test consistency for preview/skimming. We propose the Fovea-Block-Skip Transformer (FBS), which injects a causal, trainable loop into Transformers via Parafovea-Attention Window (PAW), Chunk-Head (CH), and Skip-Gate (SG). Across diverse benchmarks, FBS improves the quality-efficiency trade-off without increasing parameters, and ablations show the three modules are complementary.
Introduction
In recent years, large language models have transformed the landscape of natural language processing (NLP). They have showcased remarkable capabilities in various tasks, including text generation, translation, and summarization. Despite their advancements, the traditional autoregressive inference method remains prevalent, where models generate text one token at a time. This approach, while effective, does not fully exploit the potential of parallel processing and misses several key aspects of human reading.
The Fovea-Block-Skip Transformer (FBS)
The Fovea-Block-Skip Transformer (FBS) introduces innovative mechanisms aimed at enhancing the efficiency and effectiveness of language models. The FBS model incorporates three crucial components:
- Parafovea-Attention Window (PAW): This mechanism allows the model to focus on relevant parts of the text while simultaneously maintaining context-awareness, enabling better content-adaptive foresight.
- Chunk-Head (CH): By utilizing a chunk-structure-aware compute allocation strategy, the model can process text in larger segments rather than token-by-token, significantly increasing processing speed.
- Skip-Gate (SG): This component facilitates the skipping of non-essential tokens during inference, optimizing computational resources and enhancing overall performance.
Performance and Benefits
The implementation of these components has led to significant improvements in the quality-efficiency trade-off of the FBS model. It achieves better performance on various benchmarks without the need to increase the number of parameters. This is a noteworthy advancement since many existing models often sacrifice efficiency for increased complexity.
Ablation studies have demonstrated that the three modules—PAW, CH, and SG—are not only effective individually but also work synergistically to produce superior results. By combining these methodologies, FBS harnesses the strengths of each component to create a more robust language model.
Conclusion
The development of the Fovea-Block-Skip Transformer highlights a significant step forward in modeling native parallel reading capabilities within Transformer architectures. As the demand for more efficient and effective language models continues to grow, innovations like FBS pave the way for future research and applications in NLP. By embracing mechanisms that better emulate human reading processes, researchers can enhance the capabilities of language models, making them more adaptable and powerful in real-world applications.
