The Last Fingerprint: How Markdown Training Shapes LLM Prose
Summary: arXiv:2603.27006v1 Announce Type: cross
Large language models (LLMs) have revolutionized the way we interact with text, but their quirks often raise questions about their generative processes. One such peculiarity is the varied use of em dashes across different models, a phenomenon that has sparked widespread discussion among researchers and users alike. This article aims to explore the relationship between markdown formatting and the propensity for LLMs to produce em dashes, proposing a new understanding of this stylistic choice as an artifact of training data rather than a mere stylistic flaw.
The Em Dash as a Marker of AI-Generated Text
The em dash has emerged as a notable marker of AI-generated text due to its inconsistent application by different language models. While some models exhibit a proclivity for overusing em dashes, others show a more moderate approach. This inconsistency is not merely a stylistic choice; it may reflect deeper structural influences from the training data.
Our hypothesis posits that the em dash represents a remnant of markdown formatting—one of the many elements that LLMs internalize from their training on vast and varied textual corpora rich in markdown. The following five-step genealogy outlines this connection:
- Training Data Composition: LLMs are trained on diverse datasets, many of which include substantial markdown-formatted content.
- Structural Internalization: These models internalize various structural elements from their training data, including punctuation usage.
- Dual-Register Status: The em dash serves multiple functions in prose, contributing to its frequent appearance.
- Post-Training Amplification: After training, LLMs may amplify certain stylistic features, such as em dashes, in their generated outputs.
- Suppression Experiment Results: A two-condition suppression experiment across twelve models revealed that while overt markdown features could be suppressed, the em dash persisted in many cases.
Experiment Findings and Implications
In our suppression experiment, we tested twelve models from five different providers: Anthropic, OpenAI, Meta, Google, and DeepSeek. The results were revealing:
- When instructed to avoid markdown formatting, most models eliminated or significantly reduced overt features like headers and bullet points.
- However, the em dash continued to appear, with frequencies varying significantly across models.
- Meta’s Llama models stood out, producing none at all, suggesting a unique fine-tuning approach.
- Our three-condition suppression gradient demonstrated that even explicit prohibition of the em dash did not fully eradicate its use in some models.
These findings indicate that the frequency of em dashes can serve as a diagnostic tool for understanding the fine-tuning methodologies applied to LLMs. Rather than viewing this stylistic quirk as a defect, we suggest it be considered a fingerprint of the specific training processes that shape these powerful models.
Conclusion
The interplay between markdown training and em dash usage in LLMs provides valuable insights into the structural influences imparted during model development. By reframing our understanding of em dash frequency as a marker of fine-tuning methodology, we can further explore the nuances of AI-generated text and improve the capabilities of future language models.
