AI Exercise Prescription Consistency: Comparing Top LLMs

Cross-Model Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Across Three Large Language Models

Summary: arXiv:2604.19598v1

Type: Cross

This study presents a comprehensive analysis of the consistency of exercise prescription outputs generated by three prominent large language models (LLMs): GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash. All models were evaluated under controlled temperature=0 conditions to ensure a fair comparison. Each model was tasked with generating exercise prescriptions for six distinct clinical scenarios, repeated 20 times, resulting in a total of 360 outputs for analysis.

Study Design and Methodology

The analysis focused on four key dimensions:

Semantic Similarity: The degree to which the generated outputs were similar in meaning.
Output Reproducibility: The ability of the model to produce consistent outputs across repetitions.
FITT Classification: The classification based on Frequency, Intensity, Time, and Type of exercise.
Safety Expression: The models’ ability to incorporate safety considerations into the prescriptions.

Findings

The results revealed significant differences in performance among the three models:

Mean Semantic Similarity: GPT-4.1 achieved the highest mean semantic similarity score of 0.955, followed closely by Gemini 2.5 Flash at 0.950, and Claude Sonnet 4.6 at 0.903. Statistical analysis confirmed these differences were significant (H = 458.41, p < .001).
Output Reproducibility: GPT-4.1 produced entirely unique outputs (100%) with stable semantic content. In contrast, Gemini 2.5 Flash exhibited a notable level of output repetition, generating only 27.5% unique outputs. This discrepancy suggests that its high similarity score stemmed from text duplication rather than consistently sound reasoning.
Safety Expression: All models reached ceiling levels in safety expression metrics, highlighting its limited efficacy as a distinguishing factor between the models.

Conclusion

The findings of this study underscore the importance of model selection as a clinical decision rather than a purely technical one. The divergent behaviors observed under repeated generation conditions indicate that relying solely on single-output evaluations can be misleading. As such, the output behavior of LLMs should be a core criterion for the reliable deployment of AI-based exercise prescription systems. By understanding how different models perform in terms of consistency and reproducibility, healthcare professionals can make more informed choices about which AI tools to integrate into their practice.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

AI Exercise Prescription Consistency: Comparing Top LLMs

Cross-Model Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Across Three Large Language Models

Study Design and Methodology

Findings

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related