Cross-Model Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Across Three Large Language Models
Summary: arXiv:2604.19598v1
Type: Cross
This study presents a comprehensive analysis of the consistency of exercise prescription outputs generated by three prominent large language models (LLMs): GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash. All models were evaluated under controlled temperature=0 conditions to ensure a fair comparison. Each model was tasked with generating exercise prescriptions for six distinct clinical scenarios, repeated 20 times, resulting in a total of 360 outputs for analysis.
Study Design and Methodology
The analysis focused on four key dimensions:
- Semantic Similarity: The degree to which the generated outputs were similar in meaning.
- Output Reproducibility: The ability of the model to produce consistent outputs across repetitions.
- FITT Classification: The classification based on Frequency, Intensity, Time, and Type of exercise.
- Safety Expression: The models’ ability to incorporate safety considerations into the prescriptions.
Findings
The results revealed significant differences in performance among the three models:
- Mean Semantic Similarity: GPT-4.1 achieved the highest mean semantic similarity score of 0.955, followed closely by Gemini 2.5 Flash at 0.950, and Claude Sonnet 4.6 at 0.903. Statistical analysis confirmed these differences were significant (H = 458.41, p < .001).
- Output Reproducibility: GPT-4.1 produced entirely unique outputs (100%) with stable semantic content. In contrast, Gemini 2.5 Flash exhibited a notable level of output repetition, generating only 27.5% unique outputs. This discrepancy suggests that its high similarity score stemmed from text duplication rather than consistently sound reasoning.
- Safety Expression: All models reached ceiling levels in safety expression metrics, highlighting its limited efficacy as a distinguishing factor between the models.
Conclusion
The findings of this study underscore the importance of model selection as a clinical decision rather than a purely technical one. The divergent behaviors observed under repeated generation conditions indicate that relying solely on single-output evaluations can be misleading. As such, the output behavior of LLMs should be a core criterion for the reliable deployment of AI-based exercise prescription systems. By understanding how different models perform in terms of consistency and reproducibility, healthcare professionals can make more informed choices about which AI tools to integrate into their practice.
