LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning
Summary: arXiv:2506.18841v3 Announce Type: replace-cross
Introduction
Ultra-long text generation by large language models (LLMs) has become a sought-after capability in various applications. However, generating coherent and high-quality long-form text remains a significant challenge due to inherent limitations in maximum generation length and the degradation of quality as sequence lengths increase. Traditional approaches, including the well-known LongWriter, have relied on supervised fine-tuning (SFT) with synthetic long-form outputs. While effective to some extent, this method has notable drawbacks, including reliance on synthetic data that can be costly and complex to produce.
The Challenge with Traditional Approaches
Previous methods of enhancing ultra-long text generation often face several hurdles:
- Dependence on synthetic SFT data, which is challenging to create.
- Common issues of coherence and consistency in generated text.
- The tendency for outputs to be overly artificial and structurally monotonous.
Introducing LongWriter-Zero
In light of these challenges, we introduce LongWriter-Zero, a novel model that utilizes an incentivization-based approach to overcome the limitations associated with traditional SFT methods. Rather than relying on pre-existing annotated or synthetic data, LongWriter-Zero employs reinforcement learning (RL) to cultivate the emergence of ultra-long and high-quality text generation capabilities in LLMs from scratch.
Methodology
Our approach begins with RL training from a base model, similar to the R1-Zero methodology. This training encourages the model to engage in reasoning that facilitates planning and refinement throughout the writing process. To support effective training, we have designed specialized reward models that guide the LLM towards:
- Improved length control.
- Enhanced writing quality.
- Better structural formatting.
Results and Evaluation
Experimental evaluations reveal that LongWriter-Zero, trained on the Qwen2.5-32B model, consistently outperforms traditional SFT methods across various long-form writing tasks. It achieves state-of-the-art results on prominent benchmarks such as WritingBench and Arena-Write, even surpassing larger models with over 100 billion parameters, including DeepSeek R1 and Qwen3-235B.
Open Source Availability
In our commitment to advancing the field of natural language processing, we have made our data and model checkpoints publicly available. Researchers and developers can access LongWriter-Zero at the following link: LongWriter-Zero on Hugging Face.
Conclusion
The emergence of LongWriter-Zero marks a significant advancement in the quest for high-quality ultra-long text generation. By leveraging reinforcement learning and eliminating the need for synthetic data, this model opens new pathways for improving the coherence, quality, and structure of long-form content generation in LLMs.
