PromptEcho: Annotation-Free Reward from Vision-Language Models for Text-to-Image Reinforcement Learning
Summary: arXiv:2604.12652v1 Announce Type: cross
Abstract: Reinforcement learning (RL) can improve the prompt following capability of text-to-image (T2I) models, yet obtaining high-quality reward signals remains challenging. The CLIP Score is often criticized for being too coarse-grained, while VLM-based reward models such as RewardDance require costly human-annotated preference data and additional fine-tuning. In response, we have developed PromptEcho, a novel reward construction method that requires no annotation and no reward model training.
Introduction
Text-to-image generation has gained significant traction in recent years, particularly with advancements in reinforcement learning (RL). However, one of the major hurdles in enhancing the performance of these models is the acquisition of high-quality reward signals. Existing methods present various challenges that PromptEcho seeks to address.
Challenges in Reward Signal Acquisition
The current landscape for evaluating text-to-image models reveals two primary approaches:
- CLIP Score: While useful, this metric is often too coarse-grained to provide nuanced feedback.
- VLM-based reward models: Approaches like RewardDance necessitate expensive human-annotated datasets and additional fine-tuning, complicating the reward signal extraction process.
Introducing PromptEcho
PromptEcho revolutionizes the reward signal generation process by leveraging a frozen Vision-Language Model (VLM). It computes the token-level cross-entropy loss using the original prompt as the label, effectively extracting the image-text alignment knowledge that was encoded during the VLM’s pretraining phase. This method offers several advantages:
- Annotation-Free: There is no need for human annotation, making the process more efficient.
- No Reward Model Training: The elimination of the need for training a reward model reduces resource expenditure.
- Deterministic and Efficient: The reward computation is both reliable and computationally efficient.
- Adaptive Improvement: As more robust open-source VLMs become available, the quality of the reward automatically improves.
Evaluation with DenseAlignBench
To rigorously assess the prompt following capability, we developed DenseAlignBench, a benchmark of concept-rich dense captions. The results from our experiments on two state-of-the-art T2I models, Z-Image and QwenImage-2512, are promising:
- Substantial improvements on DenseAlignBench with a net win rate increase of +26.8 percentage points for Z-Image and +16.2 percentage points for QwenImage-2512.
- Consistent gains observed on additional benchmarks including GenEval, DPG-Bench, and TIIFBench, all without any task-specific training.
Ablation Studies and Future Directions
Ablation studies further confirm that PromptEcho outperforms inference-based scoring using the same VLM, with the quality of the reward showing a positive correlation with VLM size. This research opens avenues for future exploration in the realm of text-to-image generation and reinforcement learning.
In conclusion, we are committed to advancing this field and will open-source the trained models along with the DenseAlignBench to foster further research and development.
