Dynamic Context Evolution for Scalable Synthetic Data Generation
Summary: arXiv:2604.07147v1 Announce Type: cross
Abstract: Large language models produce repetitive output when prompted independently across many batches, a phenomenon we term cross-batch mode collapse: the progressive loss of output diversity when a language model is prompted repeatedly without access to its prior generations. Practitioners have long mitigated this with ad hoc deduplication and seed rotation, but no principled framework exists.
Introduction to Dynamic Context Evolution
To address the limitations associated with repetitive outputs in large language models, researchers have introduced a novel framework known as Dynamic Context Evolution (DCE). This framework encompasses three innovative mechanisms designed to enhance the diversity and quality of outputs generated by language models.
Mechanisms of DCE
- Verbalized Tail Sampling: This mechanism involves the model labeling each generated idea with an indication of how obvious it is. Ideas deemed too obvious are filtered out, promoting a richer diversity of outputs.
- Semantic Memory: DCE maintains a persistent embedding index that helps in rejecting near-duplicate outputs across different batches. This mechanism facilitates the retention of unique ideas, effectively combating the issue of redundancy.
- Adaptive Prompt Evolution: This strategy reconstructs the generation prompt for each batch by utilizing the system’s memory state and applying various diversity strategies. It ensures that the prompts evolve over time, keeping the outputs fresh and varied.
Experimental Validation
The efficacy of DCE has been validated through extensive experiments conducted across three distinct domains: sustainable packaging concepts, educational exam questions, and creative writing prompts. The researchers utilized two model families, namely gpt-5-mini and claude-haiku-4-5, to assess the performance of DCE.
Results and Comparisons
In a series of experiments involving component ablation and using 2-3 random seeds per method, DCE demonstrated remarkable results:
- DCE achieved a collapse rate of 0.0 +/- 0.0%, significantly outperforming the naive prompting method, which exhibited a collapse rate of 5.6 +/- 2.0%.
- DCE produced an average of 17-18 HDBSCAN clusters per seed, compared to the naive method’s volatile range of 2-17 clusters, indicating a more reliably structured conceptual output.
Conclusion
The findings were further corroborated with an independent embedding model (all-MiniLM-L6-v2) across various sensitivity sweeps of the verbalized tail sampling threshold (tau) and deduplication threshold (delta). The individual approaches of deduplication and prompt evolution were found to be insufficient on their own; however, when combined, they proved to be highly effective, achieving approximately $0.50 per 1,000 candidates using only standard API calls. Notably, this advancement requires no fine-tuning or custom architectures.
In summary, the Dynamic Context Evolution framework presents a significant leap forward in the generation of synthetic data, providing a systematic approach to enhance output diversity and quality in large language models.
